Stratus ztC Endurance ZEN Availability - Whitepaper

Page 1


Evolutionary Fault Tolerance

An innovative approach to availability for x86 architectures

White Paper

Executive summary

This white paper provides a high-level overview of Penguin Solutions® Stratus ztC Endurance™ computing platform and the evolutionary fault tolerance delivered by its availability architecture. When combined with Penguin Solutions award-winning service and support, Stratus ztC Endurance provides the highest possible levels of uptime, availability, and reliability for mission-critical applications and data.

The design of the Stratus ztC Endurance availability architecture was based on several factors to provide the optimal combination of protection, performance, intelligence, modularity, flexibility, and serviceability for a next-generation, mission-critical advanced computing system.

Stratus ztC Endurance features redundant modular hardware, advanced health monitoring and predictive diagnostics, and automatic self-healing and failover capabilities. It proactively identifies and isolates faults, making it easier to deliver system services and repairs via simple online replacements. This Identify-Isolate-Service availability architecture maximizes uptime and availability for workloads and ensures that critical applications continue to run with no downtime or data loss, even if a hardware failure occurs.

The Stratus ztC Endurance platform

Stratus ztC Endurance is an innovative family of fault-tolerant computing platforms that enable intelligent predictive failover and 99.99999% compute platform availability. The platform builds on Penguin Solutions proven combination of built-in fault tolerance, proactive health monitoring, and unmatched serviceability.

Stratus ztC Endurance is an innovative family of fault-tolerant computing platforms that enable intelligent predictive failover and 99.99999% compute platform availability.

Together with the Stratus Automated Uptime Layer with Smart ExchangeTM (Stratus AUL - Smart Exchange), Stratus ztC Endurance delivers predictable, protected performance by leveraging Intel RAS capabilities, embedded hardware and software security, and increased manageability and serviceability using a modular architecture.

Stratus ztC Endurance is available in four models:

Stratus ztC Endurance 3110 — This entry-level model provides affordable performance for critical applications in remote or branch offices and on shopfloors. It is a reliable, fault-tolerant system with a single-socket processor architecture that can accommodate approximately 12 medium-sized virtual machines (VMs) or applications. The 3110 model offers a single-processor architecture with 1 x 12-core Hyper-Threaded Intel® Xeon® Silver processor (providing 24 threads or vCPUs) and includes either 64 GB, 128 GB, or 256 GB of memory.

Stratus ztC Endurance 5110 — This mid-range model offers a versatile compute platform for rapidly evolving application requirements in regional offices, remote plants, or regional data centers. The 5110 model is a reliable, fault-tolerant system that can accommodate roughly 24 medium-sized VMs or applications. It has a dual-processor architecture with 2 x 12-core Hyper-Threaded Intel Xeon Silver processors (providing 48 threads or vCPUs) and offers either 128 GB, 256 GB, or 512 GB of memory. The platform provides a balance of application performance and value to meet the demands of mission-critical applications in traditional data centers and at the edge.

Stratus ztC Endurance 7110 — This high-performance model provides the highest level of performance for compute-intensive applications such as AI and ML as well as data- and transaction-intensive applications at larger remote plants or corporate data centers. The 7110 model features a dual-processor architecture with 2 x 28-core Hyper-Threaded Intel Xeon Gold processors (providing 112 threads or vCPUs) and offers either 256 GB, 512 GB, or 1024 GB of memory. The platform can accommodate more than 56 medium-sized VMs or applications.

Stratus ztC Endurance 9110 — This ultra-high-performance model provides the highest level of performance for the most demanding critical applications, such as high volume transaction processing applications found in the financial industry and compute-intensive applications used in AI and ML. Located in larger plants or corporate data centers, the 9110 model features a dual-processor architecture with 2 x 32-core Hyper-Threaded Intel Xeon Gold processors (providing 128 threads or vCPUs), offers 1024 GB of memory, and can accommodate more than 64 medium-sized VMs or applications.

Simple

Stratus ztC Endurance platforms are easy to install, deploy, and manage across applications and infrastructure. With zero-touch operation, they provide rapid time-to-value and simplified management for technical and non-technical staff alike. Stratus ztC Endurance platforms support both bare metal and virtualized architectures to expedite application deployment, increase flexibility, maximize computing resources, and lower total cost of ownership (TCO).

Easy to use: The Stratus ztC Endurance platform was designed for ease of use by both IT and OT staff. Features such as automatic deployment scripts, a simple user interface (UI), industry-standard interfaces for remote monitoring and management, support for standard off-the-shelf operating systems and hypervisors, automatic local and remote notification capabilities, and hot-swappable plug-and-play components ensure that Stratus ztC Endurance systems are easy to deploy, configure, operate, monitor, and service.

Modular and serviceable: The Stratus ztC Endurance platform expands on the redundant, hot-swappable Customer Replaceable Units (CRUs) used in other Penguin Solutions computing platforms with an even more modular and more serviceable design. A single Stratus ztC Endurance system chassis includes eight CRU modules:

Two compute modules

Two I/O modules

Two storage modules

Two power supply units (PSUs)

Each CRU module can be independently hot-replaced (i.e., a failed CRU can be hot-removed as indicated by the CRU’s “safe-to-pull” LED and a replacement CRU can be hot-inserted) to restore the system to a healthy, fully redundant configuration in the event of a hardware failure.

Replacement parts can be dispatched to a site and hot-installed in the system while it is running, with no impact to operations, applications, or data. The platform’s modular structure also allows for highly granular serviceability, so any subsystem (compute, storage, I/O, or power) can be independently serviced without affecting the performance of the other subsystems.

Penguin Solutions Stratus ztC Endurance family of fault-tolerant platforms is another advance in simple, protected, and autonomous computing.

Protected

Stratus ztC Endurance is a redundant, fault-tolerant, hardened, secure system with no single point of failure, ensuring continued operation with no loss of in-flight data should a hardware failure occur. Like all Penguin Solutions fault-tolerant computing solutions, Stratus ztC Endurance is designed to protect operations, applications, and data and reduce operational, financial, and reputational risk by ensuring “always-on” availability, zero downtime, and zero data loss.

Redundant architecture: A core aspect of the Stratus ztC Endurance architecture is its fully redundant hardware design, which includes mainboards, processors, memory, disk drives, network interfaces, and power supplies. Pairs of identical CRU modules for compute, storage, I/O, and power supply provide redundancy to prevent system outages.

Redundancy for the compute CRU modules is provided via an active/standby availability architecture. The active module handles all processing while the standby module stays ready to be promoted to active status should the active module begin to fail.

Redundancy for the storage modules, I/O modules, and PSUs is provided via an active/active availability architecture that keeps both modules active and operational at all times. This active/active redundancy allows the associated subsystem to continue operating in a seamless, bumpless fashion (i.e., with no system outages, no downtime, and no failover process required) should one of the modules in that subsystem fail. The failed module can be serviced or removed and replaced while the healthy active module continues operating, with no disruption to overall system operations or applications.

Fault-tolerant approach: In the event that Stratus AULSmart Exchange identifies a hardware failure or predicts a potential hardware failure, the Stratus ztC Endurance platform will automatically utilize its redundant hardware and built-in failover capabilities to take action and avoid a system outage.

Identify Isolate Service

Stratus ztC Endurance utilizes an “IdentifyIsolate-Service” approach. Any internal components identified as failed or likely to fail are automatically removed from operation without affecting the compute workload. Applications will continue to run, and data will continue to be accessible. The failed component can be serviced via online CRU replacement while the server is running, with no interruption to business operations.

The Stratus ztC Endurance platform automatically provides this fault-tolerant protection in an application-transparent fashion. The system’s hardware redundancy and fault-tolerant capabilities are abstracted from an operating system/hypervisor, VM, or application allowing the Stratus ztC Endurance platform to run standard operating systems/hypervisors and the same applications it would run on a typical server with no special setup, custom configuration, or code modifications required. This requires no additional work and automatically protects the operating system/hypervisor, VMs, applications, and data from outages and downtime.

Hardened hardware and software: Stratus ztC Endurance makes use of hardened hardware components wherever possible. All memory modules utilized in Stratus ztC Endurance systems undergo a SMART Zefr (Zero Failure Rate) screening process that involves extended-runtime testing of each memory module on a server-class motherboard at elevated operational temperature, with high-speed data exchange driven by demanding test scripts. This screening process filters out potentially weak memory modules and reduces the memory-related defective parts per million (DPPM) metric from the industry standard 3,000 DPPM to 200 DPPM, dramatically increasing the reliability of the memory modules as well as the overall reliability and availability of the Stratus ztC Endurance platform.

continue running and automatically self-heal in the event of a firmware lockup, driver-related error condition, or surprise removal or insertion of a storage or I/O hardware component. Competing solutions typically do not include these drivers and thus cannot recover or continue running should a similar error condition occur.

Secure operation: Cybersecurity is of critical importance due to the increasing reliance on technology, the growing need for interconnectivity and networking among distributed sites, the adoption of remote access technologies, the emergence of the Internet of Things (IoT), and the rise of IT/OT convergence. These trends are exposing previously isolated systems to new vulnerabilities, making it difficult to secure computing architectures using conventional tools and methods.

Stratus ztC Endurance reduces these risks with multi-layered defense-in-depth approaches to minimize the number of viable attack vectors. Penguin Solutions focus on both process and product security ensure maximum protection for all our fault-tolerant computing platforms.

For more information on Penguin Solutions process security and the security of Stratus ztC Endurance, please download the Stratus Product Security Whitepaper

SMART ZefrTM memory reduces the memory-related defective parts per million (DPPM) metric from the industry standard 3,000 DPPM down to 200 DPPM.

Autonomous

The Stratus ztC Endurance zero-touch computing platform requires no human intervention for identification and isolation of faults and minimal human intervention for system support, service, maintenance, and repair. Our platforms offer “call home” features that provide remote issue notification and 24/7/365 support, minimizing the possibility of unplanned downtime for equipment and applications. Penguin Solutions also supports remote management and monitoring, which provides greater flexibility for system management and maintenance activities.

Self-monitoring: Stratus ztC Endurance actively monitors hundreds of internal data points drawn from multiple sources and continuously analyzes this information to identify or predict failures before they occur. If a hardware component fails or is about to fail, Stratus ztC Endurance automatically identifies the issue via its health monitoring capabilities.

Once an issue has been identified, Stratus ztC Endurance sends both local and remote notifications to ensure that issue resolution can immediately begin. The system can alert the user of an issue locally using industry-standard methods and protocols such as e-mail alerts, SNMP, and REST APIs. Stratus ztC Endurance can also “call home” to notify Penguin Solutions support of an issue using secure, encrypted communication channels.

Self-healing: Stratus ztC Endurance automatically takes action to ensure continued system operation if a component fails or is predicted to fail. The system leverages built-in health monitoring, predictive analysis capabilities, and hardware redundancy and online failover functionality to continue operating through component failures with zero downtime or data loss.

If a component or module fails or is predicted to fail, Stratus ztC Endurance removes it from service and runs diagnostics on the isolated component or module, using algorithms to determine whether the error condition was persistent or transient. For infrequent transient errors, the isolated component or module typically can be reset and returned to service. In the event of a persistent error or a frequent transient error, the component or module will remain out of service and will need to be replaced.

Stratus ztC Endurance’s self-healing capabilities also assist the operator when module replacement required. Aside from physically plugging the replacement module into the Stratus ztC Endurance system chassis, no additional effort is required. No need for a keyboard, mouse, or monitor as there is no need to perform diagnostics, restore device configurations, flash firmware, synchronize data, run scripts, balance loads, or other activities typically associated with component replacement. The Stratus ztC Endurance management subsystem automatically detects the replacement module and performs any actions required to return the component to service and the system to full health. This self-healing capability means that simple module replacements can be done by OT staff, with no need for IT support.

Remote monitoring and management: The Stratus ztC Endurance management subsystem provides several web-based UIs for remote monitoring and management. These include a BMC UI for system health monitoring, console access, and power control; a management subsystem UI for system health monitoring and system management; and an operating system/hypervisor UI for system management and configuration of virtual machines, applications, and the like.

Stratus ztC Endurance also provides built-in support for remote monitoring by Penguin Solutions, providing automatic notifications to Penguin Solutions Support Service via a secure internet-based connectivity with no additional implementation required.

Remote servive and support: Stratus ztC Endurance provides a cloud-based capability for remote access that enable Penguin Solutions Support to connect remotely for diagnostics, troubleshooting, and service. This can be done using secure internet-based connectivity if explicitly permitted by the user.

In addition, Penguin Solutions Managed Services provides a wide range of additional remote service and support features, including server management, health monitoring, database administration, reporting, and additional services.

Interoperability: In modern data center and edge computing environments, interoperability—the ease with which different computerized systems can connect and communicate with one another using standard, coordinated methods—has become increasingly important. The Stratus ztC Endurance platform provides broad support for SNMP, SMTP, REST APIs, and other industry-standard protocols used by standard off-the-shelf system management platforms to support interoperability and enable centralized remote monitoring and management.

The Stratus ztC Endurance availability architecture

The Stratus ztC Endurance availability architecture was specifically designed to leverage the increased reliability and intelligence of modern compute hardware components to deliver the highest possible levels of server availability and performance.

It ensures this level of availability via internal health monitoring, component diagnostics, predictive failure analysis, redundancy, and automatic failover capabilities.

At the same time, Stratus ztC Endurance provides maximum compute performance by utilizing the advanced features and full functionality of modern hardware components such as fourth-generation Intel® Xeon® processors, all while shielding the system from the spurious alarms, false data divergences, and unnecessary hardware recovery/resync cycles that can result from common, transient, and correctable errors.

Stratus ztC Endurance has an overall uptime percentage of seven nines (99.99999%), equivalent to an expected average system downtime of less than 3.15 seconds per year. The availability of Stratus ztC Endurance is significantly higher than that of a typical high availability (HA) cluster system, which has an uptime percentage of 99.95% (i.e., 4 hours and 23 minutes of expected downtime per year on average). It also offers greater availability than a conventional standalone server, which has an uptime percentage of 99% or an average expected downtime of 87 hours and 40 minutes per year.

Stratus ztC Endurance systems have an overall uptime percentage of 99.99999%, meaning expected average system downtime is less than 3.15 seconds per year.

Stratus ztC Endurance offers an active/standby availability architecture (motherboard, processors, memory) from a compute perspective alongside fully redundant active/active storage (disk drives), I/O (network interfaces and PCIe peripherals), and power.

Stratus AUL- Smart Exchange

Stratus ztC Endurance features the Stratus AUL - Smart Exchange to enhance system availability. Unique in the industry, Stratus AUL - Smart Exchange provides reliability and fault tolerance for motherboards, processors, memory, busses, storage devices, and I/O devices. It also simplifies system monitoring and management and enables local monitoring and remote service and support.

Service and support

The redundancy, fault tolerance, and availability architecture of Stratus ztC Endurance ensure that in the event of a hardware failure, workloads will continue with no significant interruption or downtime. Its self-monitoring and self-healing capabilities identify and resolve issues quickly, returning the system to full health as soon as possible.

Penguin Solutions Customer Support complements these capabilities and make it easier to achieve the highest possible levels of uptime and availability using a Stratus ztC Endurance platform. In addition to providing proactive monitoring and next-day replacement parts in the event of a hardware failure, Penguin Solutions Customer Support offers 24/7/365 technical support for all Penguin Solutions-provided hardware and software, system software upgrades and patches, root cause failure analysis services, emergency onsite response, vendor collaboration for original equipment manufacturer (OEM) components, and full support for the operating system/hypervisor.

For more details on Stratus ztC Endurance, the Stratus availability architecture, or Penguin Solutions Customer Support, please reach out to your local Penguin Solutions representative or contact us to schedule a meeting with a Penguin Solutions expert. You can also learn more at www.penguinsolutions.com

About Penguin Solutions

With over two decades of experience as trusted advisors to our valued customers, Penguin Solutions is an end-to-end solutions provider helping solve complex challenges in computing, memory, and LED solutions.

Penguin Solutions designs, builds, deploys, and manages high-performance, high-availability enterprise solutions, allowing customers to achieve breakthrough innovations.

Our Stratus high availability and fault tolerant computing platforms ensure the continuous availability of our customers’ critical applications and data in remote data centers and edge locations.

Contact Us

For sales queries, pl ease contact sales@penguinsolutions.com.

To learn more about Stratus ztC Endurance and other Penguin Solution s products, please call 800-221-6588 or visit ww w.penguinsolution s.com

© 2025 Penguin Solutions, Inc. All rights reserved. Penguin Solutions, Penguin Computing, OriginAI, and ICE ClusterWare are trademarks or registered trademarks of Penguin Solutions. All other product names, trademarks, and registered trademarks are the property of their respective owners. All company, product, and service names used in this document are for identification purposes only. Use of these names, trademarks, and brands does not imply endorsement.

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.