Reliability, Availability, and Serviceability whitepaper RAS Ch aracter istics of the Azul Compute Appli ance version 2.1 Azul Systems, Inc. AWP-009-021 August 2007 EXE C UTIVE OVERVIEW ® The Azul Systems Compute Appliance is designed with enterprise class reliability, availability, and serviceability features. As an appliance, it can be viewed as a black box with limited and thoroughly-tested configurations, low component count, few moving parts, redundant subsystems, and built-in error checking and error correction capabilities. Multiple appliances form a compute pool with additional redundancy that in turn preserves application clustering topologies. Because it provides massive capacity, it typically allows customer workloads to run on fewer systems thus reducing system count and improving reliability, availability, and serviceability across the infrastructure, the Azul Compute Appliance meets today’s requirements for a highly-reliable, on demand compute resource. I n trodu ction A key component of any enterprise operating environment is the reliability, availability, and serviceability (RAS) of hardware and software components that make up the applications infrastructure. With billions of transistors, hundreds of mechanical devices, and millions of lines of code, modern information technology systems can suffer failures. Modern commercial systems have extensive RAS features and capabilities designed-in from the outset. These capabilities allow an IT organization to configure a resilient infrastrcuture to withstand failures, and in most cases, recover completely—either by self-correction or by engaging some form of redundancy and/or initiating human intervention to replace or repair specific components. The Azul Systems compute appliance is no exception. Built for the most demanding IT environments, comprehensive RAS capabilities are an integral part of the hardware and software design of the compute appliance. The appliance is designed and optimized solely to run massive amounts of virtual machine-based workloads. The considerable capacity of this appliance in a symmetric multiprocessing (SMP) architecture enables applications to dynamically scale, responding to varying workload and spikes without the need to reconfigure or provision application tier servers. The small form factor provides high rack density, low environmental costs, and simple administration. Moreover, because of the optimized nature of the appliance, component counts are greatly reduced, and users can realize increased levels of application availability beyond that of general-purpose commercial applications servers. The Azul Compute Appliance is industry’s first solution in a new processing model called network attached processing (Figure 1). Network attached processing allows JavaTM platform based applications initiated on a general-purpose server to access the processing power of an Azul compute pool. These pools are comprised of multiple high-capacity Azul Compute Appliances, and provide highly available processing power. Instead of running on the host server, the Azul Virtual Machine and application threads are redirected to a compute appliance within the compute pool for execution. All interfaces to clients, databases, and host operating system services remain on the host server. Relative to the requirements of a single application, the compute pool provides a virtually infinite, unbound pool of compute resources. A policy-based Compute Pool ManagerTM (CPM) provides flexible, central control of application resources within a compute pool. The CPM is responsible for managing appliances and the resources allocated to applications. Using the CPM, the compute pool administrator creates the order, priorities, and the guarantees related to running applications. With the natural redundancies of a compute pool and the policies governing each application or classes of applications, organizations can confidently scale one or many applications within Azul Compute Appliance pools. The bottom-line result of employing network attached processing is a high degree of RAS because of reduced component counts, simpler infrastructure, and the redundancies inherent in a compute pool, and the enterprise class RAS features within each compute appliance. Web Tier Application Tier Compute Appliances Database Tier Compute Pool Application Hosts Figure 1 The Network Attached Processing Model This paper explores the hardware and software methods employed by the Azul Systems architecture to attain these levels of RAS within the Azul Compute Appliance. N e two r k Attac hed Processing Using network attached processing, servers can redirect application processing tasks to an Azul compute pool. These pools are comprised of multiple ultra-high-capacity Azul appliances, and provide highly available application processing power. Relative to the requirements of a single application, the compute pool provides a virtually infinite, unbound pool of compute resources. This makes it possible for multiple applications to reside in one compute pool. That task is accomplished with the help of a policy based Compute Pool Manager (CPM) which provides flexible, central control of application resources. Workload consolidation onto Azul compute pools is transparent to the application and the existing servers. Applications continue to be invoked by and run on the existing servers, each with their own separately configurable operating system and application server, but their raw processing power is augmented by the Azul compute pool. As shown in figure 2, when an application is launched, a VM proxy on the local application host server redirects the workload to a corresponding VM engine running on an appliance in the compute pool. The result is a reduced need for traditional servers at the application tier, eliminating a large number of points of failure from the overall implementation. Application Tier Web Tier Appliance 1 Database Tier Appliance 2 VM VM Engine Engine VM Engine VM Engine VM Engine Compute Pool VM Proxy Host 1 VM Proxy VM Proxy Host 2 VM Proxy VM Proxy Host 3 Figure 2 Azul Virtual Machine technology is a key component in network attached processing. The virtual machine engine runs on an Azul Systems compute appliance and shifts the virtual machine workload from the application host to the compute appliance. A VM engine is started automatically by each VM proxy. In other words, if compute appliances only had levels of RAS similar to the replaced application hosts, the application environment would still realize increased levels of RAS due to decreased component counts, and without deploying standby systems that remain idle most of the time. A z ul Compute Appli a nce Har dware and Software Prote c tion The reliability, availability, and serviceability features of a system’s design determine its ability to operate continuously without failures, to become operational after a failure, and to minimize the time needed to restore it after a failure. Together, these features enable undisrupted system operation. As expected from a large scale enterprise class multiprocessor system, Azul Compute Appliances are equipped with hardware and software protections that enable them to continue processing without interruption when hardware failures occur, and to recover quickly in the unlikely event of a failure. These protections include the following: • ECC and/or parity protection • Failure prediction and avoidance capabilities • DRAM fault tolerance (Chipkill) on all system memory • Memory scrubbing • Automatic restart and configuration around failed processor, memory, and networking system elements • Redundant network processors, with automatic failover • Redundant, bonded gigabit Ethernet data network links • Software fault isolation • N+1 power supply and fans In addition, the following features ensure ease of serviceability of the Azul appliance: • Self monitoring of all critical system elements • SNMP integration • A few customer replaceable units (CRUs) • Hot pluggable fan assemblies and power supplies C o mpute Appli a n c e Reli ability and Availability Because the compute appliance is built to run mission-critical business applications, reliability is paramount, and is addressed in the compute appliance design. Experienced processor architecture, verification, and silicon technology teams developed, reviewed, and perfected the Azul appliance design to ensure enterprise-level reliability. During system manufacturing, the Azul appliance goes through a thorough testing process to ensure high product quality levels. The Azul system incorporates a highly scalable, symmetric multi-processor (SMP) design, built around a custom processor. Each chip contains forty eight fully independent 64-bit processor cores. With up to sixteen chips, each Azul appliance houses as many as 768 processor cores, optimized for multi-threaded virtual machine execution. Azul connects all of these processors through a passive, non-blocking interconnect mesh that enables all processors to access available memory in a given appliance. Each Azul appliance employs dual network processors for network communications and system control/service processing. The system view indicating the location and orientation of major system components is shown in Figure 3. Processor boards, each containing two chips and 16 memory modules are horizontally oriented in the front of the system, and are connected to a midplane that is unique for each of the four, eight, and sixteen chip available processor configurations. Hot swappable power supplies are inserted from the front at the bottom. All I/O cabling is done via connections to the dual redundant network processor modules at the rear, located just above the AC power inlets. Hot swappable fan assemblies providing N+1 fault resilient fans are mounted in the rear with a pull orientation, providing front to back cooling airflow for all components. Figure 3. System View of the Azul Systems Model 7280 Compute Appliance The dynamic reconfiguration capabilities in Azul Compute Appliances allow the system to reboot with reduced capacity after experiencing certain hardware faults related to processor cores, processor chips or DIMMs. A partially configured Azul appliance continues to operate as a member of a compute pool, albeit with reduced memory or CPU capacity, until the appliance is serviced and repaired. Additional RAS features, such as hardware error injection, ECC and parity protection, and failure protection and avoidance, to name a few, enable a simple, closed box solution that can operate reliably. In addition, each Azul Compute Appliance is manufactured to ISO9000 quality manufacturing standards. Pro ce sso rs The appliance’s custom-designed processors are based on proven ASIC design methodology. Compute appliance processors undergo hardware error injection during the design validation process, and feature built-in system monitoring and diagnostics, as well as fail-over interconnect to ensure dynamic fault resiliency. Fully integrated processor chips mean no separate chips are needed for I/O and memory management, for example. By minimizing the total number of chips within the appliance, Azul has significantly lowered the probability of system failure. Extensive ECC and parity protection on caches, processor register files, and translation look-aside buffers support failure prediction and avoidance. BIST (Built-In Self-Test) and POST (Power-On Self-Test) check all caches, links, memory, and processor operation. The non-blocking, passive processor-to-memory interconnect features ECC is on each processor path. The ECC ensures protection for all data transfers between the processors and memory. If a hard processor core failure occurs, the Azul appliance maps out the affected processor cores. If the failure does not affect kernel processes, only the affected virtual machine will fail when this hard failure is detected. The failed processor core is mapped out immediately if the system is able to continue operating. Otherwise, it is mapped out on the next reboot. N e t wo r k Pr o c ess o rs The dual network processors found on each appliance feature dual Gigabit Ethernet network links per network processor, built-in system monitoring and diagnostics, and a failover mechanism. Just as with the processors, the appliance’s network processors undergo hardware error injection during the design validation process. P o w e r and Cooling Sub system s As a network attached processing resource, the Azul appliance does not need to maintain state, and thus requires no disk drives. The fans are the only true moving part, ensuring a higher level of hardware reliability for the entire appliance. N+1 power and cooling components provide redundancy in case of failures. The cooling fans feature a straight flow-through air design, while a tachometer on each fan informs the service processor when the fan is not working properly. At the same time, predictive fan failure monitors fan RPM, and generates an alert if the speed falls below the threshold level for a continuous period of time. M e mory Sub system Because they do not run a traditional operating system, Azul appliances do not require hard drives. Instead, each appliance features an enterprise grade solid state flash card that, by definition, has no head, no platter, or motor – in other words, it does not contain any moving parts. This enables an Azul appliance to realize a longer operational lifetime before a failure than a general-purpose server with a hard drive. As with processor register files and caches, extensive error detection and checking within the system memory helps maintain the integrity of data that passes through the system. The system design facilitates the recognition of errors that are either corrected dynamically (such as through memory scrubbing), or properly reported for isolation and repair. Each Azul appliance monitors the rate of correctable errors (i.e., soft errors) on memory pages and sends out an alert to replace the affected memory DIMM when the rate exceeds a certain threshold. Since these are soft errors (and are corrected on the fly by the ECC logic in the memory subsystem), the offending memory remains in use. If a hard failure occurs, the Azul appliance maps out the memory (similarly to mapping out processor cores). If the system is able to continue operating, the failed memory is mapped out immediately. Otherwise, it is mapped out on the next reboot. In addition, the DRAMs feature Chipkill, so that the appliance can continue operating even if an entire DRAM stops operating. S er vic e ability of Azul Compute Appli ances 10 The modular nature of this relatively simple appliance ensures ease of serviceability. Each Azul Compute Appliance can be viewed as a black box with very few customer-replaceable elements. The CPM enables organizations to easily monitor the status and performance statistics of applications and compute appliances. The CPM can integrate with IT management software, including authentication servers and SNMP-based network monitoring tools. S e l f Monito ring of C r iti cal System Elements Remote Support Monitor is an optional feature enabled on the compute appliance that automatically collects and sends system availability metrics, environmental metrics, and error event information to Azul Customer Support for proactive monitoring. To enable this feature, the customer only needs SMTP connectivity to send messages over the Internet. This information, collected hourly, includes processor core statistics, memory statistics, fan assembly unit information, network processor board and processor board temperatures and voltages, power supply status, and application start and exit events. With this data, Azul can detect problems before they become critical and make necessary corrections before system downtime is encountered. The Azul appliance also contains a Management Information Base (MIB) that can be accessed via SNMP, allowing Azul support personnel to monitor an appliance and proactively schedule a replacement when the data points to an impending problem. C u s tome r -Repl ac ea b le Unit s The customer-replaceable components in an Azul Compute Appliance include the fan assembly unit, power supply unit, lower bezel, RJ-45 to DB9 connector adapter, AC power cord, serial cable, and rackmount kit. Customers can safely replace these parts without the need for a service technician. Because the appliance solid mounts in the rack, not on rails, and all components are front and rear accessible, the appliance does not need to be slid out for repairs. This not only simplifies repairs, but also lessens the likelihood of a cable getting accidentally unplugged in the process. H o t- P lugg able Component s 11 The hot-pluggable nature of the cooling fans and power supplies enables customers to install or remove these components while the Azul appliance is running, without affecting the rest of the system’s capabilities. On-site sparing for these two components is part of the Azul Pool Power service agreement. R e p lac ement Pr o c ess When a hard failure occurs that does not involve kernel processes, the appliance keeps running. An alert is immediately generated and an Azul technician is scheduled to replace the failed part. Prior to taking the appliance offline, the customer puts the appliance into Maintenance Mode to prevent new applications from being assigned to it. Applications already assigned to the appliance are reassigned to other compute appliances in the compute pool, if possible. Administrators can either wait for current applications running on the appliance to terminate normally, or they can terminate them manually. The appliance is turned off until the replacement is made. Once tested and verified to be operational again, the appliance is changed from Maintenance Mode to Active Mode. At this point, the policy allows applications to be deployed on the repaired appliance. C o nc lu sion The Azul Compute Appliance is designed with enterprise class reliability, availability, and serviceability features. As an appliance, it can be viewed as a black box with limited and thoroughly-tested configurations, low component count, few moving parts, redundant subsystems, and built-in error checking and error correction capabilities. Multiple appliances form a compute pool with additional redundancy which in turn preserves application clustering topologies. The Azul Compute Appliance meets today’s requirements for a highly-reliable, on demand compute resource. Because it provides massive capacity, it typically allows customer workloads to run on fewer systems thus reducing system count and improving reliability, availability, and serviceability across the infrastructure. 12 13 R elated White Papers The following related white papers can be found at www.azulsystems.com: • Network Attached Processing: Scaling Enterprise Application Deployments • Azul Compute Appliances: Ultra-High Capacity Building Blocks for Scalable Compute Pools • Pauseless Garbage Collection: Improving Application Scalability and Predictability • The Azul Virtual Machine: Realizing the Elegance of Java Technology • Optimistic Thread Concurrency: Breaking the Scale Barrier • Java Application Environment Security with Azul Compute Appliances • Reducing the Cost of Application-tier Deployments with Network Attached Processing a bout azul s ystem s Azul Systems has pioneered the industry’s first network attached processing solution designed to enable unbound compute resources for Java and J2EE based enterprise applications. Azul compute appliances eliminate capacity planning at the application level and much of the cost and complexity associated with the conventional delivery of computing resources. More information about Azul Systems can be found at www.azulsystems.com. Copyright © 2007 Azul Systems, Inc. All rights reserved. Azul Systems and Azul are registered logos in the United States and other countries. The Azul arch logo, Compute Pool Manager, and Vega are trademarks of Azul Systems Inc. in the United States and other countries. Sun, Sun Microsystems, J2EE, J2SE, Java are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. Other marks are the property of their respective owners and are used here only for identification purposes. Products and specifications discussed in this document may reflect future versions and are subject to change by Azul Systems without notice. This document may not be used for commercial purposes. 1600 Plymouth Street, Mountain View, CA 94043 T 650.230.6500 | F 650.230.6600 | www.azulsystems.com 14 14