

ICE ClusterWare™
AI and HPC Cluster Management Software
Overview
Penguin Solutions® ICE ClusterWareTM is a hardware-agnostic, intelligent cluster management platform that transforms bare-metal hardware, networking, and software resources into reliable highperformance infrastructure. It simplifies deployment and management, delivers real-time health monitoring, and, with ClusterWare AIM services, enables peak performance for enterprise-scale AI and HPC workloads.
ICE ClusterWare provides comprehensive management functionality, including node provisioning, boot and runtime management, real-time monitoring, automated image deployment, and resilient data storage. It also serves as a platform for additional software and schedulers. With these capabilities, customers can seamlessly scale operations, automate best practices, and optimize AI and HPC cluster performance with confidence.

Key Features
• Rapid image-based provisioning for fast cluster boot times
• Comprehensive performance, status, and health monitoring of all cluster components
• Heterogeneous OS and hardware support
• Cluster configurations and logical resource isolation to enable multi-tenancy
• Web GUI and rich set of CLIs driven by REST APIs for administrator management
• Dashboards for cluster observability, visualization, and analytics
• High interoperability to integrate software stacks of choice
• Ansible-based orchestration and automation for DevOps and infrastructure-as-code
• Support for variety of industry standard batch or containerbased workload managers
• HA configuration enabling any head node to serve any compute node
• Node management for NVIDIA Cumulus Linux and Edge-core SONiC switches
• Support for strictest security protocols required in secure, controlled environments

The ICE ClusterWare web console features a node display, allowing admins to quickly gain a status overview. Admins can drill down into nodes to see the associated attributes and other details.

ICE ClusterWare ships with Grafana for powerful, customizable health monitoring and alerting. Standard cluster-wide and node dashboards are provided.
Tame the Complexity of AI and HPC Infrastructure at Scale
AI is advancing rapidly, and companies are challenged to adopt, manage, and optimize their AI infrastructure. Unlike traditional HPC architectures, AI clusters present unique challenges. Job requirements, admin experience, cluster size, and security needs add layers of complexity. ICE ClusterWare’s intuitive tools simplify the deployment and management of thousands of nodes, streamline administration, and optimize compute resources for both experienced admins and new admins alike.
Scalable Deployment and Management for any Environment
Penguin Solutions ICE ClusterWare simplifies cluster deployment and management for administrators and system architects with the following key features:
• Rapid, image-based provisioning – Minimizes steps and reduces errors when adding new nodes
• Flexible management options – Intuitive web-based GUI, powerful command-line interface (CLI), and RESTful API for seamless control
• Advanced resource allocation – Logical resource provisioning, dynamic workload partitioning, and multi-tenancy governance to ensure secure, efficient cluster usage
• Scheduler-agnostic integration – Natively supports Slurm, Torque, OpenPBS, Kubernetes, and other workload managers
• Containerization for IT efficiency – Enables easy integration with IT management tools and accelerates workload deployment
• Seamless CI/CD pipeline integration – Automates image management, software deployment, and hardware driver updates using Git-based continuous integration/continuous delivery (CI/CD)
• Stateless compute architecture – Supports diskless booting, ensuring immutable configurations, high reliability, and effortless scalability
Robust Monitoring and Health Management
Continuous monitoring and health management provides administrators with the visibility and status required for on-going cluster optimization, development, and resilience. Features include:
• Real-time performance monitoring – Customizable alerts and notifications via Grafana, email, and webhooks to proactively address issues
• Centralized logging and auditing – Comprehensive system logs, authentication tracking, and RBAC activity monitoring for enhanced security and compliance
• Comprehensive performance metrics – Monitors compute nodes, networking, and GPU/CPU utilization, providing deep insights into cluster health
• Automated health management – Advanced error detection with detailed per-node metrics, enabling proactive issue resolution and self-healing capabilities
Highly Secure Operations
ICE ClusterWare supports strict security protocols and restricted environment requirements to maintain high security standards through features, including intranode encryption. ClusterWare security includes support for:
• SELinux enforcement – Supports both Targeted and Multi-Level Security (MLS) policies for granular access control
• FIPS 140-2 compliance – Ensures cryptographic security in sensitive computing environments
• Security Technical Implementation Guides (STIGs) – Adheres to rigorous security configurations for hardened deployments
• Air-gapped deployment support – Enables fully isolated, offline environments for maximum security
• Trusted Platform Module (TPM) encryption – Enhances disk security with LUKS-based encryption

Contact Us
For sales queries, please contact sales@penguinsolutions.com
To learn more about the ICE ClusterWareTM and other Penguin Solutions products, please visit: www.penguinsolutions.com