Monitoring and
Managing PowerEdge
1655MC High
Performance Computing
Clusters
Dell White Paper
By Scalable Systems Group
April 2003
Contents
Introduction: Modular Computing in HPCC ........................................................................ 3
PowerEdge 1655MC Overview................................................................................................. 4
Dell’s Management Solution for PowerEdge 1655MC HPC Clusters .............................. 5
In-band Monitoring and Management .................................................................................. 7
IT Assistant (ITA) ................................................................................................................ 7
Ganglia ................................................................................................................................... 8
Out-of-Band Monitoring and Management ........................................................................ 12
ERA/MC ............................................................................................................................... 12
Digital KVM ....................................................................................................................... 14
Conclusions ............................................................................................................................... 16
References ........................................................................................................................... 16
Figures
Figure 1: PowerEdge 1655MC Chassis - Front View .............................................................................. 4
Figure 2: PowerEdge 1655MC Chassis - Rear View ................................................................................ 4
Figure 3: 66-blade PowerEdge 1655MC HPC Cluster Configuration................................................... 6
Figure 5: At-a-Glance View of Ganglia ..................................................................................................... 9
Figure 6: Information about One Node ...................................................................................................10
Figure 7: Web Based ERA/MC Console ...................................................................................................13
Figure 8: OSCAR screen on 2161DS .........................................................................................................15
Figure 9: ERA/MC and KVM Controller Card .......................................................................................15
April 2003
Page 2
Dell Enterprise Product Group
Section
1
Introduction: Modular Computing in HPCC
Modular computing solutions target environments in which the servers are
consolidated into one physical location, which is most commonly the case with
clusters. Some elements – the power supply, the cabling, and the systems
management – do not need to be replicated for every server, and can be shared
among the modular pieces.
The Dell™ PowerEdge™ 1655MC is the first product in Dell’s Modular
Computing or “blade server” product line. Blade server architecture introduces
several self-contained servers, known as blades, within a server chassis. Each
blade has its own processor(s), memory, I/O subsystem, a set of hard drives, an
operating system, and other basic components. The chassis provides redundant
infrastructure components, such as power supplies, fans, and switches. The
concept of modular computing has the potential to increase server density,
improve manageability, lower power consumption, and enhance deployment
and serviceability, all resulting in lower TCO (Total Cost of Ownership).
Furthermore, the PowerEdge 1655MC modular design adds the following
advantages compared to integrated servers, which make it an ideal element for
constituting a high performance computing cluster:
April 2003

Low heat production

Low power consumption

Lower space requirements (0.5U/server)

Easy deployment and simplified cable management

Ease of service and replacement

Ease of adding computing resources
Page 3
Dell Enterprise Product Group
Section
2
PowerEdge 1655MC Overview
The Dell PowerEdge 1655MC features up to six server blades in one chassis in a
3U form factor. Each blade functions as an individual server utilizing its own
memory, 2 CPUs and 2 internal SCSI hard drives. The chassis includes power
supplies, network module, fans, and a management module. The PowerEdge
1655MC optionally ships with a USB CDROM/Floppy drive. The chassis also
contains two Gigabit Ethernet network switches, which connect internally to two
network interface cards (NICs) embedded on each blade. Additionally, Dell
embedded remote access (ERA) hardware and firmware are integrated in the
chassis. The ERA module monitors all the shared infrastructure components of
the chassis. Figure 1 and 2 show the PowerEdge 1655MC front view and back
view respectively. For detailed information regarding Dell PowerEdge 1655MC,
refer to
http://www.dell.com/us/en/esg/topics/esg_pedge_rackmain_servers_1_pedge_16
55mc.htm
Figure 1: PowerEdge 1655MC Chassis - Front View
Figure 2: PowerEdge 1655MC Chassis - Rear View
April 2003
Page 4
Dell Enterprise Product Group
Section
3
Dell’s Management Solution for PowerEdge
1655MC HPC Clusters
Dell’s PowerEdge 1655MC HPCC solution provides four methods of managing
and monitoring the cluster: Dell OpenManage™ IT Assistant, Ganglia, digital
KVM and ERA. IT Assistant and Ganglia are the two in-band management tools
that use the cluster fabric, or intra-cluster network, for monitoring and
management traffics. IT Assistant is Dell’s server management solution that
provides a centralized management console used to discover nodes on the
network and examine hardware sensor data to prevent failures at the system
level. Ganglia is an OS-level cluster monitor that can be used to look at resource
usage, detect node failures, and troubleshoot performance problems. Both ITA
and Ganglia require OS support and use the cluster fabric for communication.
Figure 3 shows a sample of PowerEdge 1655MC HPC cluster configuration
formed by 66 blades as the compute nodes. The Cluster Fabric in the diagram is
constructed by using three Dell PowerConnect™ 5224 Gigabit Ethernet switches.
Four Gigabit Ethernet links are used as a network trunk from each PowerEdge
1655MC chassis to one of the PowerConnect 5224 switches. A dedicated IT
Assistant node – a PowerEdge 1650 as the IT Assistant monitoring and
management station – is connected to one of the switches as well as to the ERA
Fabric. The ERA Fabric is constructed by using a PowerConnect 3024 Fast
Ethernet switch. The ERA ports on PowerEdge 1655MC chassis are connected to
the PowerConnect 3248 switch. The master node, a PowerEdge 1650 server is
also connected to the 3024 switch, so that both the ITA node and the master node
can perform out-of-band ERA monitoring and management functions.
The other out-of-band fabric called KVM Fabric is going through a digital KVM
switch, the Dell 2161DS Remote Console Switch. The KVM ports on the
PowerEdge 1655MC chassis, the master node, and the ITA node are connected to
the 2161DS switch. The Ethernet ports on the 2161DS switch is connected to the
LAN outside the cluster to form a complete out-of-band management network –
independent to the cluster fabric and the ERA fabric. For detail information
about utilizing the 2161DS switch, refer to:
http://www.dell.com/us/en/biz/topics/power_ps3q02-avocent.htm.
For information regarding PowerEdge 1655MC HPC clusters, please visit the
Dell HPCC web site at:
http://www.dell.com/us/en/esg/topics/products_clstr_gb1655_pedge_configs_165
5_cluster_hpcc.htm
April 2003
Page 5
Dell Enterprise Product Group
Figure 3: 66-blade PowerEdge 1655MC HPC Cluster Configuration
April 2003
Page 6
Dell Enterprise Product Group
Section
4
In-band Monitoring and Management
It is important for an HPCC system administrator to be able to monitor a cluster
at the hardware level especially in a large cluster environment. Dell HPC cluster
solution offers two methods of in-band management:

Dell OpenManage™ IT Assistant (ITA), a Web-based tool for managing Dell
servers, and

Ganglia, an open source monitoring tool, developed at the University of
California, Berkley.
IT Assistant (ITA)
Using the OpenManage IT Assistant, a web browser-based tool that supports all
of the PowerEdge 1655MC components through the Simple Network
Management Protocol (SNMP), allows cluster administrator the ability to
manage and monitor the hardware of an entire cluster, and to perform day-today cluster management tasks from a centralized location using a GUI. SNMP
provides the communication between the management console and the nodes,
with every system component running an SNMP agent. IT Assistant provides the
following functionality:

Discovery of the chassis and chassis components (see Figure 4)

Support for hot swapping blades

Summary and status information for all chassis components and support for
system inventory and search

Launch of management applications for chassis components

Management of events generated by chassis components

Page/e-mail when an event occurs

One-to-many centralized console
All of the functions mentioned above are crucial to the management of a HPC
cluster. One of the most basic system administration tasks, discovery and
identification of nodes, is performed by IT Assistant, as well as discovery of
chassis components – the embedded Ethernet switch and the ERA module.
IT Assistant allows the administrator to hot swap any blade in the chassis
without interrupting the other blades, which allows maintenance to be
performed without shutting down entire blades in a chassis. As the cluster grows
April 2003
Page 7
Dell Enterprise Product Group
in size, the node status information becomes even more important to monitor in
order to simplify administration. IT Assistant provides such information as
system name, IP address, MAC address, versions of components, memory size,
chassis service tag, chassis asset tag, blade slot number and blade service tag.
IT Assistant provides one-to-many functionalities such as remote shutdown,
flash BIOS, configuration of server alert functions as well as inventory for all
components.
IT Assistant includes an event management system (ESM) for capturing any
event that is generated by the modules through SNMP traps. Administrators can
associate actions with specific events, including email, paging or application
launching.
Figure 4: Summary of PowerEdge 1655MC Chassis Information
Ganglia
Another in-band management tool available in a Dell PowerEdge 1655MC
cluster offering is Ganglia, an open source OS-level cluster monitor. Out of the
box, Ganglia monitors and automatically graphs over 20 metrics such as the
node’s load average, number of running processes, number of incoming and
April 2003
Page 8
Dell Enterprise Product Group
outgoing network packets, total and free memory on every node of the cluster,
etc.
Ganglia provides several levels of cluster information. At-a-glance view (Figure
5) shows the overall status of the cluster and summarizes total node count,
number of nodes that are up, overall load average, and CPU and memory
utilization for the cluster. Color-coding is used to represent CPU utilization to
enable quick identification of overloaded systems. A crossbones icon indicates a
node is down. Selecting a different metric in this view redisplays the screen with
the value of this metric for each node, and uses the metric as a sort index when
displaying the nodes.
Figure 5: At-a-Glance View of Ganglia
Clicking on an individual node icon displays all available information for this
node (Figure 6). This view summarizes the static information such as the version
of the OS, system usage, IP address, machine type, and graphs those metrics that
April 2003
Page 9
Dell Enterprise Product Group
change over time, such as memory and CPU usage, network traffic stats, number
of running process, disk usage, etc.
Figure 6: Information about One Node
Using Ganglia allows administrators define and add other parameters in the
cluster that they want to monitor. Ganglia’s GUI will automatically graph those
values in addition to the pre-set metrics for every node. Ganglia also simplifies
cluster management by providing a remote execution environment. This feature
is used for remote management, and to execute commands in parallel on
multiple nodes.
April 2003
Page 10
Dell Enterprise Product Group
Additionally, Ganglia provides the ability to monitor multiple clusters. This is
especially useful in large compute centers where computational resources are
grouped in smaller clusters for specialized use. The centralized console enables
an administrator to monitor multiple clusters at once, while maintaining high
level of security by defining trust relationships.
April 2003
Page 11
Dell Enterprise Product Group
Section
5
Out-of-Band Monitoring and Management
During heavy communication between application components or blade-server
nodes, in-band management and monitoring can inaccurately report network or
server problems, since they share the fabric with the applications. In addition,
any monitoring/management traffics will consume resources that are used by
parallel applications. Finally, if a machine’s operating system (OS) is not
responding, neither method guarantees access to the node and ability to fix the
problem since both methods rely on the OS support.
In these situations, system administrators can use the out-of-band network
management methods to communicate with the cluster hardware, and diagnose
or fix problems. Dell’s HPCC solution provides two out-of-band management
routes: digital KVM and ERA.
ERA/MC
The Dell Embedded Remote Access/MC Controller for Dell PowerEdge 1655MC
provides remote systems management for the modular computing blades.
ERA/MC provides an out of band management route by utilizing its own
dedicated processor, memory, bus and network port, without consuming the
cluster computing or network resources. If the cluster blades become
unresponsive, ERA/MC allows the administrator to view and access the nodes
remotely to troubleshoot the system. ERA/MC provides the following
functionality to the PowerEdge 1655MC system:

Initial configuration of chassis and blades

Scripting for automation

Local and remote management of chassis and blades

Configuration of blades, network switches and the digital KVM through
console redirection

Remote firmware updates

Remote monitoring of fans and sensors

Remote power cycle, power down and power up
The use of ERA/MC within an HPC cluster simplifies cluster management and
allows a system administrator to monitor the hardware components remotely
either through a CLI (through the serial port) or a web-based GUI console. The
April 2003
Page 12
Dell Enterprise Product Group
main utility used in ERA is racadm (remote access control and administrator),
which provides the interface for monitoring and configuring the system. The
racadm utility can be used through a serial port using communications program
such as minicom or HyperTerminal, through a remote interface or through a
web-based console across the network. Through the serial interface, the
administrator can view or modify the configuration settings on the chassis or the
blades. For instance, the administrator can change the IP configuration of the
ERA/MC port to be able to access the GUI available on the web console (Figure
7).
Also through the serial interface, console redirection can be used for configuring
the blades, switches and KVM. An administrator can use the automated
scripting feature to run configuration commands on multiple nodes. This proves
to be a useful tool for making identical changes within large cluster
configurations. The remote interface is currently only supported through the MSDOS environment using Windows, allowing the use of the racadm command for
managing the nodes. This CLI here provides the only means in modifying the
properties of the ERA/MC on the PowerEdge1655MC and the automated
scripting can be used here as well. The web interface can be accessed through
any supported web browser using the ERA/MC IP address or through IT
Assistant. It allows the user to utilize the features of the ERA/MC in a graphical
interface.
Figure 7: Web Based ERA/MC Console
April 2003
Page 13
Dell Enterprise Product Group
One of the main features of out-of-band management is the ability to control and
monitor the hardware from a remote location. The racadm commands on the
PowerEdge 1655MC allow the administrator to view the health status of the
chassis and blades within the cluster. By allocating appropriate IP addresses to
the ERA/MC ports of all the chassis within a cluster, the administrator can assign
names to each system and to each blade, allowing access to individual blades in
order to utilize specific resources. Using racadm, there are multiple commands to
use to troubleshoot a cause of a failure. For example, the administrator can view
information on the modules within the chassis: the blades, the network switches,
the fans; the sensor information about rpms of the fans, the status of the power
supplies and much more. For the blades, administrators can power-cycle the
nodes individually, reset configurations and cause LEDS to blink or glow to
easily identify systems within a cluster.
Digital KVM
The PowerEdge1655MC contains an embedded digital KVM switch, which
allows video and keyboard and mouse access to each blade. All access to the
blades is from the management card on the chassis, which can either be through
the standard analog PS2 keyboard, mouse and video ports or through the Analog
Rack Interface port with a CAT5 cable. The Analog Rack Interface port can be
connected directly to a port on the Dell 2161DS Remote Console Switch with a
CAT5 cable, which cascades the switches and allows them to be accessed from
one central place. In large cluster configurations with several PowerEdge
1655MC chassis, this can greatly minimize the cable organization and
management.
The 2161DS Remote Console Switch pulls together both analog and digital
technologies to provide a central point of access to an entire cluster. Each KVM
switch has 16 ports to attach machines or other switches and can be networked
over a LAN connection to provide remote access to these machines. Each
machine must use a System Interface POD (SIP) for converting the keyboard,
video and mouse signals to Ethernet. This considerably reduces the groups of
KVM cables that are usually associated with HPC clusters. The switch comes
with a cross-platform software that allows the administrator to manage the
switch, install a new 2161DS switch or launch video sessions to a system server.
The administrator can view multiple machines from this access point and use the
keyboard and mouse on the individual machines.
A 2161DS KVM switch with one or more chassis attached allows the view of all
the blades from a centralized location. The KVM switches use OSCAR (OnScreen Configuration and Activity Reporting interface) to select the nodes; with
multiple chassis are cascaded, the user is able to see all nodes on one interface.
April 2003
Page 14
Dell Enterprise Product Group
Figure 8: OSCAR screen on 2161DS
Each PowerEdge 1655MC chassis contains the internal KVM switch so it appears
on the main 2161DS OSCAR interface as cascaded. In Figure 8, ‘Edmond’ is the
third blade on the fourth chassis; each node appears with this numbering
scheme.
Figure 9: ERA/MC and KVM Controller Card
April 2003
Page 15
Dell Enterprise Product Group
Section
4
Conclusion
Dell blade cluster solution provides four management routes, two in-bands and
two out-of-bands. The in-band management tools, ITA and Ganglia, provide
easy access to cluster status information, such as load, utilization, number of
dead nodes, as well as individual machine’s hardware sensor data, including fan
speed, voltage, etc. In-band management routes share the cluster fabric with
cluster applications. Out of band management routes include KVM and ERA,
and are useful to administrators when the cluster is under heavy load, since both
methods use a dedicated fabric and do not interfere with applications running on
the cluster. These four different methods provided in Dell’s HPCC solutions help
to: make a PowerEdge 1655MC cluster easier to manage and monitor; reduce the
possibility of failure; and lower the TCO.
References
DellTech/Support
http://delltech.us.dell.com/support/
Ganglia: Distributed Monitoring and Execution System
http://ganglia.sourceforge.net/
Serial and Remote Execution of CLI Commands for Blade Server Management
http://www.dell.com/us/en/esg/topics/power_ps3q02-suniti.htm
THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL
ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR
IMPLIED WARRANTIES OF ANY KIND.
Dell, PowerConnect, PowerVault and PowerEdge are trademarks of Dell Computer Corporation. Other trademarks and
trade names may be used in this document to refer to either the entities claiming the marks and names or their products.
Dell disclaims proprietary interest in the marks and names of others.
©Copyright 2003 Dell Computer Corporation. All rights reserved. Reproduction in any manner whatsoever without the
express written permission of Dell Computer Corporation is strictly forbidden. For more information, contact Dell.
Information in this document is subject to change without notice.
April 2003
Page 16
Dell Enterprise Product Group