Rapid image-based provisioning for fast cluster boot times
Comprehensive performance, status, and health monitoring of all cluster components
Heterogeneous OS and hardware support
Cluster configurations and logical resource isolation to enable multi-tenancy
Web GUI and rich set of CLIs driven by REST APIs for administrator management
Dashboards for cluster observability, visualization, and analytics
High interoperability to integrate software stacks of choice
Ansible-based orchestration and automation for DevOps and infrastructure-as-code
Support for variety of industry standard batch or containerbased workload managers
HA configuration enabling any head node to serve any compute node
Node management for Nvidia
Cumulus Linux and Edge-core SONiC switches
Support for strictest security protocols required in secure, controlled environments
ICE ClusterWare™
AI and HPC Cluster Management Software
ICE ClusterWareTM is a hardware-agnostic, intelligent cluster management platform that transforms bare-metal hardware, networking, and software resources into reliable high-performance infrastructure. It simplifies deployment and management, delivers real-time health monitoring, and, with ClusterWare AIM services, enables peak performance for enterprise-scale AI and HPC workloads.
ICE ClusterWare provides comprehensive management functionality, including node provisioning, boot and runtime management, real-time monitoring, automated image deployment, and resilient data storage. It also serves as a platform for additional software and schedulers. With these capabilities, customers can seamlessly scale operations, automate best practices, and optimize AI and HPC cluster performance with confidence.
The ICE ClusterWare web console features a node display, allowing admins to quickly gain a status overview. Admins can drill down into nodes to see the associated attributes and other details.
ICE ClusterWare ships with Grafana for powerful, customizable health monitoring and alerting. Standard cluster-wide and node dashboards are provided.
Tame the Complexity of AI and HPC Infrastructure at Scale
AI is advancing rapidly, and companies are challenged to adopt, manage, and optimize their AI infrastructure. Unlike traditional HPC architectures, AI clusters present unique challenges. Job requirements, admin experience, cluster size, and security needs add layers of complexity. ICE ClusterWare’s intuitive tools simplify the deployment and management of thousands of nodes, streamline administration, and optimize compute resources for both experienced admins and new admins alike.
Scalable Deployment and Management for any Environment
Penguin Solutions ICE ClusterWare simplifies cluster deployment and management for administrators and system architects with the following key features:
Rapid, image-based provisioning – Minimizes steps and reduces errors when adding new nodes
Flexible management options – Intuitive web-based GUI, powerful command-line interface (CLI), and RESTful API for seamless control
Advanced resource allocation – Logical resource provisioning, dynamic workload partitioning, and multi-tenancy governance to ensure secure, efficient cluster usage
Scheduler-agnostic integration – Natively supports Slurm, Torque, OpenPBS, Kubernetes, and other workload managers
Containerization for IT efficiency – Enables easy integration with IT management tools and accelerates workload deployment
Seamless CI/CD pipeline integration – Automates image management, software deployment, and hardware driver updates using Git-based continuous integration/continuous delivery (CI/CD)
Stateless compute architecture – Supports diskless booting, ensuring immutable configurations, high reliability, and effortless scalability
Robust Monitoring and Health Management
Continuous monitoring and health management provides administrators with the visibility and status required for on-going cluster optimization, development, and resilience. Features include:
Real-time performance monitoring – Customizable alerts and notifications via Grafana, email, and webhooks to proactively address issues
Centralized logging and auditing – Comprehensive system logs, authentication tracking, and RBAC activity monitoring for enhanced security and compliance
Comprehensive performance metrics – Monitors compute nodes, networking, and GPU/CPU utilization, providing deep insights into cluster health
Automated health management – Advanced error detection with detailed per-node metrics, enabling proactive issue resolution and self-healing capabilities
Highly Secure Operations
ICE ClusterWare supports strict security protocols and restricted environment requirements to maintain high security standards through features, including intranode encryption. ClusterWare security includes support for:
SELinux enforcement – Supports both Targeted and Multi-Level Security (MLS) policies for granular access control FIPS 140-2 compliance – Ensures cryptographic security in sensitive computing environments
Security Technical Implementation Guides (STIGs) – Adheres to rigorous security configurations for hardened deployments
Air-gapped deployment support – Enables fully isolated, offline environments for maximum security
Trusted Platform Module (TPM) encryption – Enhances disk security with LUKS-based encryption
About Penguin Solutions
The most exciting technological advancements are also the most challenging for companies to adopt. At Penguin Solutions®, we support our customers in achieving their ambitions across our computing, memory, and LED lines of business. With our expert skills, experience, and partnerships, we turn our customers’ most complex challenges into compelling opportunities.
For more information, visit https://www.penguinsolutions.com.
Learn More
Sign Up for a Demo or Evaluation of ClusterWare
Reach out to us at sales@penguincomputing.com
Visit www.penguinsolutions.com to learn more about our software.