Reliability, Availability, and Serviceability
whitepaper
RAS Ch aracter istics of the Azul Compute Appli ance
version 2.1
Azul Systems, Inc.
AWP-009-021
August 2007
EXE C UTIVE OVERVIEW
®
The Azul Systems Compute Appliance is designed with enterprise
class reliability, availability, and serviceability features. As an
appliance, it can be viewed as a black box with limited and
thoroughly-tested configurations, low component count, few moving
parts, redundant subsystems, and built-in error checking and error
correction capabilities. Multiple appliances form a compute pool with
additional redundancy that in turn preserves application clustering
topologies.
Because it provides massive capacity, it typically allows customer
workloads to run on fewer systems thus reducing system count
and improving reliability, availability, and serviceability across
the infrastructure, the Azul Compute Appliance meets today’s
requirements for a highly-reliable, on demand compute resource.
I n trodu ction
A key component of any enterprise operating environment is the reliability, availability, and
serviceability (RAS) of hardware and software components that make up the applications
infrastructure. With billions of transistors, hundreds of mechanical devices, and millions of lines of
code, modern information technology systems can suffer failures.
Modern commercial systems have extensive RAS features and capabilities designed-in from the
outset. These capabilities allow an IT organization to configure a resilient infrastrcuture to withstand
failures, and in most cases, recover completely—either by self-correction or by engaging some form
of redundancy and/or initiating human intervention to replace or repair specific components.
The Azul Systems compute appliance is no exception. Built for the most demanding IT
environments, comprehensive RAS capabilities are an integral part of the hardware and software
design of the compute appliance. The appliance is designed and optimized solely to run massive
amounts of virtual machine-based workloads. The considerable capacity of this appliance in
a symmetric multiprocessing (SMP) architecture enables applications to dynamically scale,
responding to varying workload and spikes without the need to reconfigure or provision application
tier servers. The small form factor provides high rack density, low environmental costs, and simple
administration. Moreover, because of the optimized nature of the appliance, component counts are
greatly reduced, and users can realize increased levels of application availability beyond that of
general-purpose commercial applications servers.
The Azul Compute Appliance is industry’s first solution in a new processing model called network
attached processing (Figure 1). Network attached processing allows JavaTM platform based
applications initiated on a general-purpose server to access the processing power of an Azul
compute pool. These pools are comprised of multiple high-capacity Azul Compute Appliances, and
provide highly available processing power. Instead of running on the host server, the Azul Virtual
Machine and application threads are redirected to a compute appliance within the compute pool
for execution. All interfaces to clients, databases, and host operating system services remain on
the host server. Relative to the requirements of a single application, the compute pool provides a
virtually infinite, unbound pool of compute resources.
A policy-based Compute Pool ManagerTM (CPM) provides flexible, central control of application
resources within a compute pool. The CPM is responsible for managing appliances and the
resources allocated to applications. Using the CPM, the compute pool administrator creates the
order, priorities, and the guarantees related to running applications. With the natural redundancies
of a compute pool and the policies governing each application or classes of applications,
organizations can confidently scale one or many applications within Azul Compute Appliance pools.
The bottom-line result of employing network attached processing is a high degree of RAS because
of reduced component counts, simpler infrastructure, and the redundancies inherent in a compute
pool, and the enterprise class RAS features within each compute appliance.
Web Tier
Application Tier
Compute Appliances
Database Tier
Compute Pool
Application Hosts
Figure 1 The Network Attached Processing Model
This paper explores the hardware and software methods employed by the Azul Systems
architecture to attain these levels of RAS within the Azul Compute Appliance.
N e two r k Attac hed Processing
Using network attached processing, servers can redirect application processing tasks to an Azul
compute pool. These pools are comprised of multiple ultra-high-capacity Azul appliances, and
provide highly available application processing power. Relative to the requirements of a single
application, the compute pool provides a virtually infinite, unbound pool of compute resources. This
makes it possible for multiple applications to reside in one compute pool. That task is accomplished
with the help of a policy based Compute Pool Manager (CPM) which provides flexible, central
control of application resources.
Workload consolidation onto Azul compute pools is transparent to the application and the existing
servers. Applications continue to be invoked by and run on the existing servers, each with their own
separately configurable operating system and application server, but their raw processing power is
augmented by the Azul compute pool.
As shown in figure 2, when an application is launched, a VM proxy on the local application host
server redirects the workload to a corresponding VM engine running on an appliance in the
compute pool. The result is a reduced need for traditional servers at the application tier, eliminating
a large number of points of failure from the overall implementation.
Application Tier
Web Tier
Appliance 1
Database Tier
Appliance 2
VM
VM
Engine Engine
VM
Engine
VM
Engine
VM
Engine
Compute Pool
VM
Proxy
Host 1
VM
Proxy
VM
Proxy
Host 2
VM
Proxy
VM
Proxy
Host 3
Figure 2 Azul Virtual Machine technology is a key component in network attached processing. The
virtual machine engine runs on an Azul Systems compute appliance and shifts the virtual machine
workload from the application host to the compute appliance. A VM engine is started automatically
by each VM proxy.
In other words, if compute appliances only had levels of RAS similar to the replaced application
hosts, the application environment would still realize increased levels of RAS due to decreased
component counts, and without deploying standby systems that remain idle most of the time.
A z ul Compute Appli a nce Har dware and Software
Prote c tion
The reliability, availability, and serviceability features of a system’s design determine its ability to
operate continuously without failures, to become operational after a failure, and to minimize the time
needed to restore it after a failure. Together, these features enable undisrupted system operation.
As expected from a large scale enterprise class multiprocessor system, Azul Compute Appliances
are equipped with hardware and software protections that enable them to continue processing
without interruption when hardware failures occur, and to recover quickly in the unlikely event of a
failure. These protections include the following:
• ECC and/or parity protection
• Failure prediction and avoidance capabilities
• DRAM fault tolerance (Chipkill) on all system memory
• Memory scrubbing
• Automatic restart and configuration around failed processor, memory, and networking system
elements
• Redundant network processors, with automatic failover
• Redundant, bonded gigabit Ethernet data network links
• Software fault isolation
• N+1 power supply and fans
In addition, the following features ensure ease of serviceability of the Azul appliance:
• Self monitoring of all critical system elements
• SNMP integration
• A few customer replaceable units (CRUs)
• Hot pluggable fan assemblies and power supplies
C o mpute Appli a n c e Reli ability and Availability
Because the compute appliance is built to run mission-critical business applications, reliability is
paramount, and is addressed in the compute appliance design. Experienced processor architecture,
verification, and silicon technology teams developed, reviewed, and perfected the Azul appliance
design to ensure enterprise-level reliability. During system manufacturing, the Azul appliance goes
through a thorough testing process to ensure high product quality levels.
The Azul system incorporates a highly scalable, symmetric multi-processor (SMP) design, built
around a custom processor. Each chip contains forty eight fully independent 64-bit processor cores.
With up to sixteen chips, each Azul appliance houses as many as 768 processor cores, optimized
for multi-threaded virtual machine execution. Azul connects all of these processors through a
passive, non-blocking interconnect mesh that enables all processors to access available memory in
a given appliance.
Each Azul appliance employs dual network processors for network communications and system
control/service processing. The system view indicating the location and orientation of major system
components is shown in Figure 3. Processor boards, each containing two chips and 16 memory
modules are horizontally oriented in the front of the system, and are connected to a midplane
that is unique for each of the four, eight, and sixteen chip available processor configurations. Hot
swappable power supplies are inserted from the front at the bottom. All I/O cabling is done via
connections to the dual redundant network processor modules at the rear, located just above the
AC power inlets. Hot swappable fan assemblies providing N+1 fault resilient fans are mounted in
the rear with a pull orientation, providing front to back cooling airflow for all components.
Figure 3. System View of the Azul Systems Model 7280 Compute Appliance
The dynamic reconfiguration capabilities in Azul Compute Appliances allow the system to
reboot with reduced capacity after experiencing certain hardware faults related to processor
cores, processor chips or DIMMs. A partially configured Azul appliance continues to operate as
a member of a compute pool, albeit with reduced memory or CPU capacity, until the appliance
is serviced and repaired.
Additional RAS features, such as hardware error injection, ECC and parity protection, and
failure protection and avoidance, to name a few, enable a simple, closed box solution that can
operate reliably. In addition, each Azul Compute Appliance is manufactured to ISO9000 quality
manufacturing standards.
Pro ce sso rs
The appliance’s custom-designed processors are based on proven ASIC design methodology.
Compute appliance processors undergo hardware error injection during the design validation
process, and feature built-in system monitoring and diagnostics, as well as fail-over
interconnect to ensure dynamic fault resiliency.
Fully integrated processor chips mean no separate chips are needed for I/O and memory
management, for example. By minimizing the total number of chips within the appliance, Azul
has significantly lowered the probability of system failure. Extensive ECC and parity protection
on caches, processor register files, and translation look-aside buffers support failure prediction
and avoidance. BIST (Built-In Self-Test) and POST (Power-On Self-Test) check all caches, links,
memory, and processor operation.
The non-blocking, passive processor-to-memory interconnect features ECC is on each
processor path. The ECC ensures protection for all data transfers between the processors and
memory.
If a hard processor core failure occurs, the Azul appliance maps out the affected processor
cores. If the failure does not affect kernel processes, only the affected virtual machine will fail
when this hard failure is detected. The failed processor core is mapped out immediately if the
system is able to continue operating. Otherwise, it is mapped out on the next reboot.
N e t wo r k Pr o c ess o rs
The dual network processors found on each appliance feature dual Gigabit Ethernet network
links per network processor, built-in system monitoring and diagnostics, and a failover
mechanism. Just as with the processors, the appliance’s network processors undergo
hardware error injection during the design validation process.
P o w e r and Cooling Sub system s
As a network attached processing resource, the Azul appliance does not need to maintain
state, and thus requires no disk drives. The fans are the only true moving part, ensuring a
higher level of hardware reliability for the entire appliance. N+1 power and cooling components
provide redundancy in case of failures.
The cooling fans feature a straight flow-through air design, while a tachometer on each
fan informs the service processor when the fan is not working properly. At the same time,
predictive fan failure monitors fan RPM, and generates an alert if the speed falls below the
threshold level for a continuous period of time.
M e mory Sub system
Because they do not run a traditional operating system, Azul appliances do not require hard
drives. Instead, each appliance features an enterprise grade solid state flash card that, by
definition, has no head, no platter, or motor – in other words, it does not contain any moving
parts. This enables an Azul appliance to realize a longer operational lifetime before a failure
than a general-purpose server with a hard drive.
As with processor register files and caches, extensive error detection and checking within
the system memory helps maintain the integrity of data that passes through the system. The
system design facilitates the recognition of errors that are either corrected dynamically (such
as through memory scrubbing), or properly reported for isolation and repair.
Each Azul appliance monitors the rate of correctable errors (i.e., soft errors) on memory pages
and sends out an alert to replace the affected memory DIMM when the rate exceeds a certain
threshold. Since these are soft errors (and are corrected on the fly by the ECC logic in the
memory subsystem), the offending memory remains in use. If a hard failure occurs, the Azul
appliance maps out the memory (similarly to mapping out processor cores). If the system is
able to continue operating, the failed memory is mapped out immediately. Otherwise, it is
mapped out on the next reboot. In addition, the DRAMs feature Chipkill, so that the appliance
can continue operating even if an entire DRAM stops operating.
S er vic e ability of Azul Compute Appli ances
10
The modular nature of this relatively simple appliance ensures ease of serviceability. Each
Azul Compute Appliance can be viewed as a black box with very few customer-replaceable
elements. The CPM enables organizations to easily monitor the status and performance
statistics of applications and compute appliances. The CPM can integrate with IT management
software, including authentication servers and SNMP-based network monitoring tools.
S e l f Monito ring of C r iti cal System Elements
Remote Support Monitor is an optional feature enabled on the compute appliance that
automatically collects and sends system availability metrics, environmental metrics, and
error event information to Azul Customer Support for proactive monitoring. To enable this
feature, the customer only needs SMTP connectivity to send messages over the Internet.
This information, collected hourly, includes processor core statistics, memory statistics, fan
assembly unit information, network processor board and processor board temperatures and
voltages, power supply status, and application start and exit events. With this data, Azul can
detect problems before they become critical and make necessary corrections before system
downtime is encountered. The Azul appliance also contains a Management Information
Base (MIB) that can be accessed via SNMP, allowing Azul support personnel to monitor an
appliance and proactively schedule a replacement when the data points to an impending
problem.
C u s tome r -Repl ac ea b le Unit s
The customer-replaceable components in an Azul Compute Appliance include the fan
assembly unit, power supply unit, lower bezel, RJ-45 to DB9 connector adapter, AC power
cord, serial cable, and rackmount kit. Customers can safely replace these parts without the
need for a service technician. Because the appliance solid mounts in the rack, not on rails,
and all components are front and rear accessible, the appliance does not need to be slid out
for repairs. This not only simplifies repairs, but also lessens the likelihood of a cable getting
accidentally unplugged in the process.
H o t- P lugg able Component s
11
The hot-pluggable nature of the cooling fans and power supplies enables customers to install
or remove these components while the Azul appliance is running, without affecting the rest of
the system’s capabilities. On-site sparing for these two components is part of the Azul Pool
Power service agreement.
R e p lac ement Pr o c ess
When a hard failure occurs that does not involve kernel processes, the appliance keeps
running. An alert is immediately generated and an Azul technician is scheduled to replace
the failed part. Prior to taking the appliance offline, the customer puts the appliance into
Maintenance Mode to prevent new applications from being assigned to it. Applications already
assigned to the appliance are reassigned to other compute appliances in the compute pool,
if possible. Administrators can either wait for current applications running on the appliance
to terminate normally, or they can terminate them manually. The appliance is turned off until
the replacement is made. Once tested and verified to be operational again, the appliance is
changed from Maintenance Mode to Active Mode. At this point, the policy allows applications
to be deployed on the repaired appliance.
C o nc lu sion
The Azul Compute Appliance is designed with enterprise class reliability, availability, and
serviceability features. As an appliance, it can be viewed as a black box with limited and
thoroughly-tested configurations, low component count, few moving parts, redundant
subsystems, and built-in error checking and error correction capabilities. Multiple appliances
form a compute pool with additional redundancy which in turn preserves application clustering
topologies. The Azul Compute Appliance meets today’s requirements for a highly-reliable, on
demand compute resource. Because it provides massive capacity, it typically allows customer
workloads to run on fewer systems thus reducing system count and improving reliability,
availability, and serviceability across the infrastructure.
12
13
R elated White Papers
The following related white papers can be found at www.azulsystems.com:
• Network Attached Processing: Scaling Enterprise Application Deployments
• Azul Compute Appliances: Ultra-High Capacity Building Blocks for Scalable Compute Pools
• Pauseless Garbage Collection: Improving Application Scalability and Predictability
• The Azul Virtual Machine: Realizing the Elegance of Java Technology
• Optimistic Thread Concurrency: Breaking the Scale Barrier
• Java Application Environment Security with Azul Compute Appliances
• Reducing the Cost of Application-tier Deployments with Network Attached Processing
a bout azul s ystem s
Azul Systems has pioneered the industry’s first network attached processing solution designed
to enable unbound compute resources for Java and J2EE based enterprise applications. Azul
compute appliances eliminate capacity planning at the application level and much of the cost
and complexity associated with the conventional delivery of computing resources. More
information about Azul Systems can be found at www.azulsystems.com.
Copyright © 2007 Azul Systems, Inc. All rights reserved. Azul Systems and Azul are registered logos in the
United States and other countries. The Azul arch logo, Compute Pool Manager, and Vega are trademarks
of Azul Systems Inc. in the United States and other countries. Sun, Sun Microsystems, J2EE, J2SE, Java are
trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
Other marks are the property of their respective owners and are used here only for identification purposes.
Products and specifications discussed in this document may reflect future versions and are subject to
change by Azul Systems without notice. This document may not be used for commercial purposes.
1600 Plymouth Street, Mountain View, CA 94043 T 650.230.6500 | F 650.230.6600 | www.azulsystems.com
14
14