Roadrunner: Hardware and Software Overview Front cover

advertisement
Front cover
Roadrunner: Hardware
and Software Overview
Review components that comprise the
Roadrunner supercomputer
Understand Roadrunner hardware
components
Learn about Roadrunner
system software
Dr. Andrew Komornicki
Gary Mullen-Schulz
Deb Landon
ibm.com/redbooks
Redpaper
International Technical Support Organization
Roadrunner: Hardware and Software Overview
January 2009
REDP-4477-00
Note: Before using this information and the product it supports, read the information in “Notices” on page v.
First Edition (January 2009)
This edition applies to the Roadrunner computing system.
© Copyright International Business Machines Corporation 2009. All rights reserved.
Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
The team that wrote this paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1. Roadrunner hardware overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What Roadrunner is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 A historical perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Roadrunner hardware components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 TriBlade: a unique concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 IBM BladeCenter QS22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 IBM BladeCenter LS21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Rack configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Compute node rack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Compute node and I/O rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Switch and service rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 The Connected Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 Networks within a Connected Unit cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Networks between Connected Unit clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 2. Roadrunner software overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Roadrunner components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Compute node (TriBlade) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 I/O node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Service node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.4 Master (management) node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Cluster boot sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Boot scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 How applications are written and executed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Application core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Offloading logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
20
20
20
21
21
21
22
23
23
23
24
Appendix A. The Cell Broadband Engine (Cell/B.E.) processor . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The processor elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Element Interconnet Bus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Memory Flow Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
28
30
30
31
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
© Copyright IBM Corp. 2009. All rights reserved.
iii
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
How to get Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
iv
Roadrunner: Hardware and Software Overview
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
© Copyright IBM Corp. 2009. All rights reserved.
v
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. These and other IBM trademarked terms are
marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US
registered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
AS/400®
BladeCenter®
Blue Gene/L™
Blue Gene®
Domino®
GPFS™
IBM PowerXCell™
IBM®
iSeries®
PartnerWorld®
Power Architecture®
POWER3™
POWER5™
PowerPC®
Redbooks®
Redbooks (logo)
RS/6000®
System i®
WebSphere®
®
The following terms are trademarks of other companies:
AMD, AMD Opteron, HyperTransport, the AMD Arrow logo, and combinations thereof, are trademarks of
Advanced Micro Devices, Inc.
InfiniBand, and the InfiniBand design marks are trademarks and/or service marks of the InfiniBand Trade
Association.
Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment, Inc., in the United
States, other countries, or both and is used under license therefrom.
Java, Sun, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States,
other countries, or both.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other
countries, or both.
Intel Pentium, Intel, Pentium, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered
trademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
vi
Roadrunner: Hardware and Software Overview
Preface
This IBM® Redpaper publication provides an overview of the hardware and software
components that constitute a Roadrunner system. This includes the actual chips, cards, and
so on that comprise a Roadrunner connected unit, as well as the peripheral systems required
to run applications. It also includes a brief description of the software used to manage and run
the system.
The team that wrote this paper
This publication was produced by a team of IBM specialists working in collaboration with the
International Technical Support Organization (ITSO), Rochester Center.
Dr. Andrew Komornicki is an accomplished computational scientist with many years of
experience. Prior to joining IBM, his career included independent research, scientific
management, government service, as well as work in the computer industry. During the
1990s, he spent two years as a rotator at the National Science Foundation as a program
director, where he co-managed the program in computational chemistry. As a computational
scientist, he also spent four years as the chair of the allocation committee at the San Diego
Supercomputer Center. He has consulted extensively in both the computer and chemical
industry. Upon his return from Washington, he spent several years at Sun™ Microsystems,
where he worked as a business development executive tasked with the development of
vertical markets in the chemistry and pharmaceutical markets. Three years ago, he joined the
Advanced Technical Support group at IBM in the role of supporting scientific computing in the
High Performance Computing (HPC) arena. His duties have included support of large scale
procurements, benchmarks, and some software contributions.
Gary Mullen-Schulz is a Consulting IT Specialist at the ITSO, Rochester Center. He leads
the team responsible for producing Roadrunner documentation, and was the primary author
of IBM System Blue Gene Solution: Application Development, SG24-7179. Gary also focuses
on Java™ and WebSphere®. He is a Sun Certified Java Programmer, Developer and
Architect, and has three issued patents.
Deb Landon is an IBM Certified Senior IT Specialist in the IBM ITSO, Rochester Center.
Debbie has been with IBM for 25 years, working first with the S/36 and then the AS/400®,
which has since evolved to the IBM System i® platform. Before joining the ITSO in November
of 2000, Debbie was a member of the PartnerWorld® for Developers iSeries® team,
supporting IBM Business Partners in the area of Domino® for iSeries.
Thanks to the following people for their contributions to this project:
Bill Brandmeyer
Mike Brutman
Chris Engel
Susan Lee
Dave Limpert
Camille Mann
Andrew Schram
IBM Rochester
© Copyright IBM Corp. 2009. All rights reserved.
vii
Prashant Manikal
Cornell Wright
IBM Austin
Debbie Landon
Wade Wallace
International Technical Support Organization, Rochester Center
Become a published author
Join us for a two- to six-week residency program! Help write a book dealing with specific
products or solutions, while getting hands-on experience with leading-edge technologies. You
will have the opportunity to team with IBM technical professionals, Business Partners, and
Clients.
Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you
will develop a network of contacts in IBM development labs, and increase your productivity
and marketability.
Learn more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our papers to be as helpful as possible. Send us your comments about this paper or
other IBM Redbooks® in one of the following ways:
򐂰 Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
򐂰 Send your comments in an e-mail to:
redbooks@us.ibm.com
򐂰 Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
viii
Roadrunner: Hardware and Software Overview
1
Chapter 1.
Roadrunner hardware overview
This chapter describes the hardware components that comprise the Roadrunner system.
Specifically, this chapter examines the various components that make up a Connected Unit
(CU) and then discusses how the CUs are tied together to create a complete Roadrunner
cluster.
Note: This IBM Redpaper publication is not intended to be a detailed analysis, but rather a
“big picture” discussion meant to acquaint the reader with the Roadrunner system.
© Copyright IBM Corp. 2009. All rights reserved.
1
1.1 What Roadrunner is
Roadrunner is the first general purpose computer system to reach the petaflop milestone. On
June 10, 2008, IBM announced that this supercomputer had sustained a record-breaking
petaflop, or 1015 floating point operations per second, as measured by the Linpack
benchmark. As a result of this achievement, Roadrunner became the world’s fastest
supercomputer.
Roadrunner was designed, manufactured, and tested at the IBM facility in Rochester,
Minnesota. The actual initial petaflop run was done in Poughkeepsie, New York. Its final
destination is the Los Alamos National Laboratory (LANL) in New Mexico, which will use this
system for a variety of scientific efforts. Most notably, Roadrunner is the latest tool used by
the National Nuclear Security Administration (NNSA) to ensure the safety and reliability of the
US nuclear weapons stockpile.
This computer system has a number of unique characteristics. The most notable is its sheer
size and the fact that this is the first modern heterogeneous system of its kind. As a petascale
design, the Roadrunner system has the fewest number of compute nodes and the fewest
number of cores of any of the outstanding designs considered to date. In a nutshell, the
attributes of this system can be summarized with the following characteristics:
򐂰 Roadrunner is a cluster of clusters.
The fundamental building block of the Roadrunner system is a Connected Unit (CU). As
originally designed, Roadrunner would have 18 such connected units, of which 17 have
been delivered to LANL for the final system configuration. Roadrunner is made up of
approximately 6500 AMD™ dual-core processors coupled with 12,240 Cell Broadband
Engine™ (Cell/B.E.™) processors. The total peak (theoretical) performance of this hybrid
system is in excess of 1.3 petaflops. The memory on this system consists of a total of 98
TB equally distributed between the Opteron and the Cell/B.E. nodes.
Each CU is made up of 180 compute nodes and 12 I/O nodes. A unique aspect of the
Roadrunner design is the creation of a TriBlade as a fundamental building block for the
CU. Each TriBlade consists of an AMD Opteron™ blade and two Cell/B.E. IBM
BladeCenter® QS22 blades. The Opteron blade contains two dual-core processors, while
the Cell/B.E. blades each contain two new Cell/B.E. eDP (double precision) processors.
This architecture allows for a one-to-one mapping of Opteron cores to Cell/B.E.
processors. As discussed in 1.2.1, “TriBlade: a unique concept” on page 5, this design
architecture creates a master-subordinate relationship between the Opterons and the
Cell/B.E. processors. Each Opteron core is connected to a Cell/B.E. chip through a
dedicated PCIe link. Communications between Opteron nodes is accomplished through
an extensive InfiniBand® network.
򐂰 Fedora Linux® is the operating system of choice for this system.
System management of this cluster of clusters is accomplished with the xCAT cluster
management software tools.
It is worthwhile to note some of the physical characteristics of this system. The entire system
consists of 278 racks that occupy approximately 5000 square feet of floor space. The weight
of this system is approximately 500,000 pounds, or 250 tons. The networking required for
both the compute and management tasks consists of 55 miles of InfiniBand (IB) cables.
Lastly, even though the system consumes 2.4 MW of power, it is very energy efficient,
delivering almost 437 megaflops per watt.
Roadrunner holds a unique position in the history of scientific computing. It was over ten
years ago that the first teraflop (1012 floating point operations per second) computer was built.
In 1997, a computer consisting of 7000+ Intel® Pentium® II processors sustained a teraflop
2
Roadrunner: Hardware and Software Overview
on the Linpack benchmark. Roadrunner in 2008 has demonstrated a thousand fold increase
in sustained compute performance.
Note: The name Roadrunner was chosen by Los Alamos National Laboratory and is not a
product name of the IBM Corporation. This supercomputer was designed and developed
for the Department of Energy and Los Alamos National Laboratory under the project name
Roadrunner. The project was named after the state bird of New Mexico.
1.1.1 A historical perspective
Machines of Roadrunner’s size and capability are the direct result of the scientific needs of
the weapons-physics communities. In October of 1992, the United States (U.S.) entered the
start of the nuclear testing moratorium that banned all nuclear testing above and below
ground. Prior to this moratorium, the US nuclear weapons stockpile was maintained through a
combination of underground nuclear testing as well as the development of new weapons
systems. When theory and experiment were combined, the Department of Energy could rely
on much simpler models than those needed today. Without nuclear testing, weapons
scientists must rely much more heavily on sophisticated hardware and software to simulate
the complex aging process of both weapons systems as well as their components.
Established in 1995, the Advanced Simulation and Computing Program (ASC) is an integral
part of the Department of Energy's National Nuclear Security Administration (NNSA) shift in
emphasis from test-based to simulation-based programs. Under the ASC, computer
simulation capabilities are continually developed to analyze and predict the performance,
safety, and reliability of nuclear weapons and to certify their functionality. All of this work is
integrated into the three weapons laboratories:
򐂰 Los Alamos National Laboratory (LANL)
򐂰 Lawrence Livermore National Laboratory (LLNL)
򐂰 Sandia National Laboratories (SNL)
The predecessor of the ASC was the Accelerated Strategic Computing Initiative (known as
the ASCI program) in direct response to the National Defense Authorization Act of 1994,
which required, in the absence of nuclear testing, for the Department of Energy to:
򐂰 Support a focused multifaceted program to increase the understanding of the existing
nuclear stockpile.
򐂰 Predict, detect, and evaluate potential problems associated with the aging of the nuclear
stockpile.
򐂰 Maintain the science and engineering institutions needed to support the national nuclear
deterrent, now and in the future.
In response to this mandate, the ASCI program set the following objectives in order to meet
the needs and requirements of the Stockpile Stewardship program. These were enumerated
to include performance, safety, reliability, and renewal, and were articulated in the ASCI
program plan, published by the Department of Energy Defense Programs on January 2000:
򐂰 Create predictive simulations of nuclear weapon systems to analyze behavior and asses
performance in an environment without nuclear testing.
򐂰 Predict with high certainty the behavior of full weapon systems in complex accident
scenarios.
򐂰 Achieve sufficient, validated predictive simulations to extend the lifetime of the stockpile,
predict failure mechanisms, and reduce routine maintenance.
Chapter 1. Roadrunner hardware overview
3
򐂰 Use virtual prototyping and modeling to understand how new production processors and
materials affect performance, safety, reliability, and aging. This understanding helps define
the right configuration of production and testing facilities necessary for managing the
stockpile throughout the next several decades.
Throughout the history of this program, the IBM Corporation has been a key partner of the
Department of Energy's National Nuclear Security Administration (NNSA) program. Here are
several historical examples:
򐂰 In 1998, IBM delivered the ASCI Blue Pacific system, which consisted of 5,856 PowerPC®
604e microprocessors. The theoretical peak performance of this system was 3.8 teraflops.
򐂰 In 2000, IBM delivered the ASCI White system. This computer system was based on the
IBM RS/6000® computer, which contained IBM POWER3™ nodes running at 375 MHz.
This cluster consisted of 512 nodes, each of which had 16 processors for a total of 8,192
processors. The power requirements for this machine consisted of 3 MW for the computer
and an additional 3 MW required for cooling. The theoretical peak processing power was
12.3 teraflops and a Linpack performance of 7.2 teraflops.
򐂰 In 2005, IBM delivered and installed the ASC Purple system at Lawrence Livermore
Laboratories. This system was a 100 teraflop machine and was the successful realization
of a goal set a decade earlier (1996) to deliver a 100 teraflop machine within the 2004 to
2005 time frame.
Note: At the time these goals were set, computers were still at the gigaflop level and
were still two years away from the realization of the first teraflop machine.
ASC Purple is based on the symmetric shared memory IBM POWER5™ architecture. The
combined system contains approximately 12,500 POWER5 processors and requires 7.5
MW of electrical power for both the computer and cooling equipment.
򐂰 Another machine in the ASC program is the IBM System Blue Gene/L™ machine
delivered by IBM to Lawrence Livermore Laboratories. The Blue Gene® architecture is
unique in that it allows for a very dense packing of computer nodes. A single Blue Gene
rack contains 1024 nodes. On March 24, 2005, the US Department of Energy announced
that the Blue Gene/L installation at Lawrence Livermore Laboratory had achieved a speed
of 135 teraflops on a system consisting of 32 racks. On October 27, 2005, Lawrence
Livermore Laboratories and IBM announced that Blue Gene/L had produced a Linpack
benchmark that exceeded 280 teraflops. This system consisted of 65,536 compute nodes
housed in 64 Blue Gene racks.
As with each of the systems described above, the Roadrunner project is a partnership with
IBM. The original contract was signed in September 2006 and projected for three phases. In
phase 1, a base system was delivered consisting of Opteron nodes. A hybrid node prototype
system was projected for phase 2. The delivery of a hybrid final system, one that would
achieve a sustained petaflop in Linpack performance, was projected for phase 3.
For more information, refer to the Advanced Simulation and Computing Web site at:
http://www.sandia.gov/NNSA/ASC/about.html
4
Roadrunner: Hardware and Software Overview
1.2 Roadrunner hardware components
A simple way to describe the Roadrunner system is that it is a heterogeneous cluster of
clusters, each of which is accelerated by Cell/B.E. processors. The unique feature of this
design is that each compute node consists of node-attached Cell/B.E. processors, rather than
a simple cluster of Cell/B.E. processors. A collection of such compute and I/O nodes, all
connected through a high speed switch fabric, makes up a scalable unit known as a
Connected Unit (CU).
The fundamental building block of a CU is a compute node, each of which is a TriBlade. The
TriBlade is an original design concept created for the Roadrunner system and allows for the
integration of Cell/B.E. and Opteron blades. Architecturally, this design allows for the
incorporation of these TriBlades into a IBM BladeCenter chassis.
1.2.1 TriBlade: a unique concept
The TriBlade makes up what is called a hybrid compute node. The components of this node
consist of an IBM LS21 Opteron blade, two IBM BladeCenter QS22 Cell/B.E. blades, and a
fourth blade that houses the communications fabric for the compute node. This expansion
blade connects the two QS22 blades through four PCI Express x8 links to the Opteron blade
and provides each node with an InfiniBand 4x DDR cluster interconnect. Figure 1-1 shows a
schematic of a TriBlade.
Figure 1-1 TriBlade schematic
Chapter 1. Roadrunner hardware overview
5
The node design of the TriBlade offers a number of important characteristics. Since each
node is accelerated by Cell/B.E. processors, by design there is one Cell/B.E. chip for each
Opteron core. The TriBlade is populated with 16 GB of Opteron memory and an equal amount
of Cell/B.E. memory. Since the new Cell/B.E. eDP processors are capable of delivering 102.4
gigaflops of peak performance, each TriBlade node is capable of approximately 400 gigaflops
of double precision compute power. For additional information about the Cell/B.E. processor,
see Appendix A, “The Cell Broadband Engine (Cell/B.E.) processor” on page 27.
The design of the TriBlade presents the user with a very specific memory hierarchy. The
Opteron processors establish a master-subordinate relationship with the Cell/B.E.
processors. Each Opteron blade contains 4 GB of memory per core, resulting in 8 GB of
shared memory per socket. The Opteron blade thus contains 16 GB of NUMA shared
memory per node.
Each Cell/B.E. processor contains 4 GB of shared memory, resulting in 8 GB of shared
memory per blade. In total, the Cell/B.E. blades contain 16 GB of distributed memory per
TriBlade node. It is important to note that not only is there a one-to-one mapping of Opteron
cores to Cell/B.E. processors, but also each node consists of a distribution of equal memory
among each of these components.
In order to sustain this compute power, the connectivity within each node consists of four PCI
Express x8 links, each capable of 2 GBs transfer rates, with a 2 micro-second latency. The
expansion slot also contains the InfiniBand interconnect, which allows communications to the
rest of the cluster. The capability of the InfiniBand 4x DDR interconnect is rated at 2 GBs with
a 2 micro-second latency.
1.2.2 IBM BladeCenter QS22
The IBM BladeCenter QS22 is based on the IBM PowerXCell™ 8i processor, a new
generation processor based on the Cell/B.E. architecture. In contrast to its predecessors, the
QS20 and QS21, the QS22 is based on the second generation processor of the Cell/B.E.
architecture and offers single instruction multiple data (SIMD) vector capability along with
strong parallelization. It performs double precision floating point operations at five times the
speed of the previous generations of Cell/B.E. processors.
Due to its parallel nature and extraordinary computing speed, the QS22 is ideal for use in
scientific applications, which is why it was chosen as an integral part of the Roadrunner
system by IBM and Los Alamos. The QS22 is a single-wide blade server that offers an SMP
with shared memory and two Cell/B.E. processors in a single blade enclosure.
Figure 1-2 on page 7 provides an illustration of the IBM BladeCenter QS22. Features of the
QS22 include:
򐂰
򐂰
򐂰
򐂰
򐂰
򐂰
򐂰
Two 3.2 GHz IBM PowerXCell 8i processors
Up to 32 GB of PC2-6400 800 MHz DDR2 memory
460 single-precision gigaflops per blade (peak)
217 double-precision gigaflops per blade (peak)
Integrated dual 1 Gb Ethernet
IBM Enhance I/O Bridge chip
Serial Over LAN
The QS22 is based on the 64-bit IBM PowerXCell 8i processor. This processor operates at
3.2 GHz. Each of the eight SIMD vector processors is capable of producing four floating point
results per clock period. The memory subsystem on the QS22 consists of eight DIMM slots,
enabling configurations from 4 GB up to 32 GB of ECC memory.
6
Roadrunner: Hardware and Software Overview
Important: The implementation chosen for the Roadrunner system consists of the
standard blade populated with 16 GB of DDR2 memory. As with the Opteron blades, all of
the Cell/B.E. based blades are diskless.
For additional information about the Cell/B.E. processor, see Appendix A, “The Cell
Broadband Engine (Cell/B.E.) processor” on page 27.
Figure 1-2 IBM BladeCenter QS22
For more information about the QS22, see the IBM BladeCenter QS22 Web page at:
http://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.html
1.2.3 IBM BladeCenter LS21
The IBM BladeCenter LS21 is a single width AMD Opteron-based server. The LS21 blade
server supports up to two of the dual-core 2200 series AMD Opteron processors combined
with up to 32 GB of ECC memory and one fixed SAS HDD.
Important: The configuration used for the Roadrunner system contains two AMD Opteron
processors running at 1.8 GHz, 16 GB of ECC memory, and no hard disk. The diskless
configuration is an important implementation design, which eliminates additional moving
parts and potential points of failure for a system with so many thousands of nodes.
The memory used in the LS21 are DDR2 and are ECC protected. The general memory
configuration for the LS21 has to follow these guidelines:
򐂰 A total of eight DIMM slots (four per processor socket). Two of these slots (1 and 2) are
preconfigured with a pair of DIMMs.
򐂰 Because memory is 2-way interleaved, the memory modules must be installed in matched
pairs. However, one DIMM pair is not required to match the other in capacity.
򐂰 A maximum of 32 GB of installed memory is achieved when all DIMM sockets are
populated with 4 GB DIMMs.
Chapter 1. Roadrunner hardware overview
7
򐂰 For each installed microprocessor, a set of four DIMM sockets are enabled.
The processors used in these blades are standard low-power processors. The standard AMD
Opteron processors draw a maximum of 95 W. Specially manufactured low-power processors
operate at 68 W or less without any performance trade-offs. This savings in power at the
processor level combined with the smarter power solution that IBM BladeCenter delivers
make these blades very attractive for installations that are limited by power and cooling
resources.
This blade is designed with power management capability to provide the maximum up time
possible. In extended thermal conditions, rather than shut down completely or fail, the LS21
automatically reduces the processor frequency to maintain acceptable thermal levels.
A standard LS21 blade server offers these features:
򐂰 Up to two high-performance, AMD Dual-Core Opteron processors.
򐂰 A system board containing eight DIMM connectors, supporting 512 MB, 1 GB, 2 GB, or 4
GB DIMMs.
򐂰 Up to 32 GB of system memory is supported with 4 GB DIMMs.
򐂰 A SAS controller, supporting one internal SAS drive (36 or 73 GB) and up to three
additional SAS drives with optional SIO blade.
򐂰 Two TCP/IP Offload Engine enabled Gigabit Ethernet controllers (Broadcom 5706S) as
standard, with load balancing and failover features.
򐂰 Support for concurrent KVM (cKVM) and concurrent USB/DVD (cMedia) through
Advanced Management Module and an optional daughter card.
򐂰 Support for a Storage and I/O Expansion (SIO) unit.
Dual Gigabit Ethernet controllers are standard, providing high-speed data transfers and
offering TCP/IP Offload Engine support, load-balancing, and failover capabilities. The version
used for Roadrunner uses optional InfiniBand expansion cards, allowing high speed
communication between nodes. The InfiniBand fabric installed with Roadrunner provides
4x DDR connections that have a theoretical peak of 2 GB per second.
Finally, the LS21 supports both the Windows® and Linux operating systems. The Roadrunner
implementation uses the Fedora version of Linux.
Figure 1-3 on page 9 shows a schematic of the planar of an LS21.
8
Roadrunner: Hardware and Software Overview
Figure 1-3 LS21 planar
For more information about the LS21, see the IBM BladeCenter LS21 Web page at:
http://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.html
1.3 Rack configurations
TriBlades are combined into racks to create assemblies of hybrid compute nodes. In addition,
some racks contain other components for other required functionality. There are three
different rack types:
򐂰 Compute node rack
򐂰 Compute node and I/O rack
򐂰 Switch and service rack
In general, these racks look very similar. Each can hold a maximum of 12 TriBlades and some
hold additional components.
Chapter 1. Roadrunner hardware overview
9
1.3.1 Compute node rack
A compute node rack holds a total of 12 TriBlades, which means it holds 12 LS21s and 24
QS22s. A compute node rack looks similar to the picture shown in Figure 1-4.
Figure 1-4 Compute node rack
1.3.2 Compute node and I/O rack
A compute node and I/O rack contains 12 TriBlades, but also contains an IBM System x3655
(x3655) at the bottom of the rack. The x3655 performs input/output (I/O) services on behalf of
the system. A compute and I/O node rack looks similar to the picture shown in Figure 1-5 on
page 11.
The x3655 is a new rack-optimized server based on the AMD Opteron dual-core processor.
The x3655 supports four processor sockets and 32 memory DIMM slots. The memory is 667
MHz DDR2, in sizes ranging from 512 MB to 4 GB per DIMM. This gives a total capacity of up
to 128 GB of main system memory.
Note: The x3655 used in the Roadrunner system supports 16 GB or 32 GB of memory.
10
Roadrunner: Hardware and Software Overview
Figure 1-5 Compute and I/O node rack
1.3.3 Switch and service rack
The switch and service rack contains no TriBlades. This rack contains a Voltaire Grid Director
ISR 9288 switch that is used to manage InfiniBand networking traffic. This is known in
Roadrunner as a first-stage switch. See “First-stage InfiniBand switch” on page 14 for more
information about its role and function.
You can learn more about the Voltaire switch technology on the Voltaire Web page at:
http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR
_9288
In addition, this rack contains an IBM System x3655, which serves as the service node for the
CU. The functions that the service node performs include the following:
򐂰 Holds the boot images used to IPL the Opteron and Cell/B.E. blades, as well as the I/O
nodes.
򐂰 IPLs all elements in the CU when instructed to do so by the central management node.
Chapter 1. Roadrunner hardware overview
11
A switch and service rack looks similar to the picture shown in Figure 1-6.
Misc
Figure 1-6 Switch and service rack
1.4 The Connected Unit
The Connected Unit (CU) is a core concept in the Roadrunner system. Groups of the various
rack configurations discussed in 1.3, “Rack configurations” on page 9 are put together to
create a single CU. Table 1-1 lists the racks that comprise a single CU.
Table 1-1 Racks making up a Connected Unit
Rack type
Number of racks in
the Connected Unit
Number of TriBlades
in a rack
Total number of
TriBlades
Compute node rack
3
12
36
Compute node and I/O rack
12
12
144
Switch and service rack
1
0
0
Total
16
N/A
180
A CU can be thought of as a base cluster unit. The racks that make up a CU are connected to
each other through first-stage switches. CUs are then tied together through second-stage
switches to create a larger grid.
The size of a CU is largely determined by the capabilities of the first-stage switch. There are
180 TriBlades in a CU. This number of TriBlades means that a Connected Unit contains 180
AMD Opteron LS21s and 360 IBM BladeCenter QS22s. See Figure 1-7 on page 13.
12
Roadrunner: Hardware and Software Overview
Connected Unit
Misc
I/O + Compute rack
x12
Compute rack
x3
Switch and
Service rack
Figure 1-7 Racks comprising a Connected Unit
Note: As previously discussed in this chapter, the entire Roadrunner system or cluster is
comprised of a total of 17 CUs.
1.5 Networks
Given the high number of racks and nodes in the Roadrunner system, it should come as no
surprise that there are several different networks used to tie the system together. This section
provides an overview of the different networks involved as well as their functional purpose.
1.5.1 Networks within a Connected Unit cluster
First-stage switches are used to connect all the racks making up a Connected Unit (CU)
together and to allow the CU to communicate with the outside world (for example, a file
system) and other CUs. The second-stage switches primarily serve as a hub to tie the 17 CUs
together into a common computational system.
Chapter 1. Roadrunner hardware overview
13
First-stage InfiniBand switch
As discussed in 1.3.3, “Switch and service rack” on page 11, each CU contains a rack with a
Voltaire Grid Director ISR 9288 switch. This switch allows for 288 different InfiniBand inputs,
which are used as shown in Table 1-2.
Table 1-2 Connections in and out from a first-stage switch
Component
Number of
connections
Purpose
TriBlades InfiniBand link
180
Connects the AMD Opteron nodes together
to allow them to participate in a network.
InfiniBand links to second-stage
switch
96
Allows the CUs to be tied together into a
single network.
InfiniBand links to I/O nodes
8
Provides the hybrid compute nodes access
to the file system for application input and
output.
Total
288
InfiniBand Connected Unit
This network creates a “fat tree” that allows the AMD Opterons to communicate with each
other using the industry-standard Message Passing Interface (MPI). It is built on top of the
switched InfiniBand network. A “fat tree” is a special topology invented by Charles E.
Leiserson of MIT. Unlike a traditional binary tree, a fat tree has “thicker branches” the closer
you get to the tree’s root. In this way, you do not end up with a communications bottleneck at
the root of the tree.
Figure 1-8 shows a traditional binary tree. Note that as messages flow up the tree, the single
links to the root node can become a point of congestion.
Figure 1-8 Traditional binary tree
Figure 1-9 on page 15, on the other hand, shows a fat tree. Notice how the number of links
between nodes increases as you get closer to the tree’s root. The number of links shown is
just one example of a fat tree configuration; the actual number may be higher or lower
between any two nodes depending on the given requirements.
14
Roadrunner: Hardware and Software Overview
Figure 1-9 Fat tree
Fat tree topologies are becoming quite popular in InfiniBand clusters. For more information
about fat trees and their usage with InfiniBand, see the article Performance Modeling of
Subnet Management on Fat Tree InfiniBand Networks using OpenSM, which is available at
the following Web site:
http://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.pdf
10 Gigabit Ethernet file system LAN
Every CU has twelve I/O nodes, each of which has a single InfiniBand connection to the CU's
InfiniBand Switch. This allows the hybrid compute nodes (TriBlades) to retrieve and pass data
to the I/O nodes over the InfiniBand network. The file system is connected through the I/O
nodes, each of which have two 10 GB links to the file system LAN.
Gigabit Ethernet Control VLAN (CVLAN)
The 1 GB Ethernet control VLAN is used to perform vital program and node control functions
within each CU, such as Message Passing Information (MPI) required for program operation
and communication.
Gigabit Ethernet Management VLAN (MVLAN)
The 1 GB Ethernet Management VLAN is used to perform vital system management
functions within each CU, such as passing the required operating system boot images from
the CU's service node to the processors on the hybrid compute nodes and I/O nodes in order
to IPL them.
Important: This VLAN is used exclusively for control traffic, no user data flows across this
network.
PCI Express link between LS21 and Cell/B.E. blades
Each AMD Opteron has a one-to-one “master-subordinate” relationship with a Cell/B.E.
processor. Although the Opterons participate in MPI communications with other Opteron
nodes and access the file system through the I/O nodes, the Cell/B.E. processors only
communicate with their “master” Opteron.
Chapter 1. Roadrunner hardware overview
15
The link between the AMD Opteron and its associated Cell/B.E. processor is through a direct,
point-to-point PCI-Express connection. As discussed in 1.2.3, “IBM BladeCenter LS21” on
page 7, each LS21 has an expansion card installed, a Broadcom HT-2100. This expansion
card allows for PCI-Express (PCIe) communications to a Cell/B.E. processor.
The Cell/B.E. blades have PCIe functionality built into them directly, so no extra expansion
card is needed. Low-level device drivers have been written to enable communications across
the PCIe link. Higher-level APIs, such as Data Communication and Synchronization (DaCS)
and the Accelerated Library Framework (ALF), will flow across this PCIe connection to enable
Opteron-to-Cell/B.E. communications.
For more information about MPI, DaCS, and ALF, see Chapter 2, “Roadrunner software
overview” on page 19.
1.5.2 Networks between Connected Unit clusters
As discussed previously, there are several networks in place within a Connected Unit to
provide for cluster management, MPI communications, file I/O, and Opteron-to-Cell/B.E.
communications. This section discusses the connectivity between CUs that serve to create a
grid, which from an application perspective appears as a single computational unit.
Second-stage InfiniBand switches
These InfiniBand switches serve as a way to interconnect the 17 CUs together to form a
single system image. Like the first-stage switches, these are Voltaire Grid Director ISR 9288s.
In this case, there are eight. Strictly speaking, only six units are actually mandatory; two are
there for expansion and redundancy.
Figure 1-10 shows the role of the second-stage switches for Roadrunner. All connections to,
from, and between switches are InfiniBand optical links.
96
96
96
96
96
96
96
96
96
96
96
96
96
96
CUA
Figure 1-10 Role of second stage InfiniBand switches in Roadrunner
16
Roadrunner: Hardware and Software Overview
96
96
96
CUQ
Gigabit Ethernet management VLAN (MVLAN)
The 1 GB Ethernet management VLAN is the grid-wide system management network. It is
used for booting, system control, and status determination operations between the
management nodes and the various managed elements throughout the cluster. The MVLAN
does not have direct network access to the “internals” of a CU (for example, the hybrid
compute nodes and I/O nodes). Management operations to those nodes occurs from the
MVLAN to the CU's MVLAN through the service node to the desired target.
The MVLAN has no user or application data flow across this network. Only system
management and control traffic flows across the MVLAN.
Chapter 1. Roadrunner hardware overview
17
18
Roadrunner: Hardware and Software Overview
2
Chapter 2.
Roadrunner software overview
This chapter briefly describes the software used to run applications on the Roadrunner
system.
Note: This IBM Redpaper publication is not intended to be a detailed analysis, but rather a
“big picture” discussion meant to acquaint the reader with the Roadrunner system.
© Copyright IBM Corp. 2009. All rights reserved.
19
2.1 Roadrunner components
This section provides a brief explanation of the software used to run on the various
components that comprise a Roadrunner system.
2.1.1 Compute node (TriBlade)
As described in 1.2.1, “TriBlade: a unique concept” on page 5, a TriBlade is made up of one
IBM BladeCenter LS21 blade and two IBM BladeCenter QS22 blades. Each of these runs its
own operating system image, but “shares” a common user application.
Note: From an IBM BladeCenter Advanced Management Module (AMM) perspective, the
TriBlade still appears as separate blades. In other words, it appears as one LS21 and two
QS22s. The logical grouping of the LS21 and QS22s is handled through the xCAT
management tools. See 2.3, “xCAT” on page 23 for more information.
The following is the software that runs on the various components of the TriBlade:
򐂰 AMD Opteron LS21 for IBM BladeCenter
Each LS21 is standard except for the fact that it is diskless. The operating system is
Fedora Linux. Since it is diskless, it is booted up from its Connected Unit’s service node.
򐂰 IBM BladeCenter QS22
Each QS22 is standard except for the fact that it is diskless. The operating system is
Fedora Linux. Since it is diskless, it is booted up from its Connected Unit’s service node.
򐂰 Broadcom HT-2100 (PCIe adapter)
The dual Opteron host blade (LS21) is connected to the two QS22s through a PCI
Express (PCIe) interconnect. Two HyperTransport™ x16 connections from the LS21 blade
drive an expansion card containing two Broadcom HT-2100 HyperTransport to PCI
Express bridge chips. Each Broadcom HT-2100 drives two PCI Express x8 connections to
the two Axon Southbridge chips on one of the Cell Broadband Engine (Cell/B.E.) blades
(QS22). This provides a dedicated PCIe x8 connection to each Cell/B.E processor.
The PCIe interconnect is supported by a low-level device driver that provides direct
memory access (DMA) and a remote memory mapped small message area (SMA). DMA
operations can be started by calls to the device driver from programs on either the LS21 or
the QS22. The device driver initiates the DMA operation using a DMA controller in the
Axon Southbridge. The small message area provides regions of memory that can be
accessed remotely by user space instructions without a context switch to the kernel or
device driver interaction. There is a unique device driver instance on both the Opteron and
the Cell/B.E. blade for each Axon Southbridge. A virtual Ethernet driver (also replicated
per Axon) supports point-to-point communications between the Opteron and each
Cell/B.E processor.
2.1.2 I/O node
As mentioned previously in 1.3.2, “Compute node and I/O rack” on page 10, each I/O node is
an IBM System x3655 server. I/O nodes are diskless and serve as “pipes” to the external file
system across the 10 Gigabit Ethernet file system LAN.
Each I/O node runs Fedora Linux as its operating system. Since the node is diskless, it is
booted up from its Connected Unit’s service node. The I/O node will run either the IBM
20
Roadrunner: Hardware and Software Overview
GPFS™ or Panasas PanFS client to communicate with the external file system, depending on
what file system software is running there.
2.1.3 Service node
Service nodes are standard IBM System x3655 Opteron-based servers and are diskless.
There is one dedicated service node per Connected Unit, so this image can be updated
directly from the master node over the management network (MVLAN) described in “Gigabit
Ethernet management VLAN (MVLAN)” on page 17.
Service nodes obtain copies of the boot images for the I/O nodes and compute nodes from
the master node. These images are refreshed on an as needed basis. The images are loaded
over the CVLAN (see “Gigabit Ethernet Control VLAN (CVLAN)” on page 15).
2.1.4 Master (management) node
The master node is a standard IBM System x3655 Opteron-based server and is booted from
the local disk. The master node runs Fedora Linux.
Note: There is only one master node for the entire Roadrunner cluster.
2.2 Cluster boot sequence
The initial booting of the nodes is complicated by two factors in the Roadrunner system:
򐂰 All of the nodes except for the master node are diskless, so they must boot over the
network.
򐂰 There are over 3,000 total nodes and 10,000 operating system images that need to be
installed and booted.
There will be times when the entire system needs to be booted, and there will be times when
only parts of the system need to be booted (while the rest of the system is still available but
powered off). This places two distinct demands on the management network:
򐂰 It must be able to boot the entire system without causing timeouts on the management
network such that no boot progress is being made.
򐂰 It must be able to boot substantial portions of the system without interfering with any
status and control operations that are occurring on the running portion of the system.
Since the majority of nodes are diskless, a scalable way to move the boot images to each of
the nodes is required. To this end, a hierarchy of management nodes has been created.
The solution to this concern is to use a bootstrap protocol (BOOTP) together with the trivial
file transfer protocol (TFTP) subnet multicast to boot the diskless LS21 Opteron and QS22
Cell/B.E. blades. This method provides a broadcast of the common boot image that the
LS21s and QS22s can pick up midstream. The multicast repeats until all requesting blades
have received all packets of the boot image. There are unique boot images for the various
configurations. The boot images are stored on the Connected Unit service nodes and
multicast over the CVLAN. This method significantly reduces network traffic compared to
sending individual boot images to each processor.
Chapter 2. Roadrunner software overview
21
2.2.1 Boot scenarios
This section describes in more detail what happens when a cluster (or parts of the cluster)
are booted up.
Master (management) node (tier 1)
This node is installed and booted with the required management node image. The
management node boots from the local disk.
Service nodes (tier 2)
There is only one service node per Connected Unit, so this image can be updated directly
from the master node over the MVLAN at any time (not just at service node bring-up). Once
booted, service nodes obtain copies of the boot images for the I/O nodes and compute nodes
from the master node. These images are refreshed on an as-needed basis. The images are
loaded over the CVLAN through the multicast boot process, which allows for far less network
traffic and parallel image download.
I/O nodes
Once successfully booted, the service nodes begin transferring the required boot images
down the CVLAN. The I/O nodes are standard Opteron Linux servers and are booted diskless
with the required image. I/O nodes are connected to the 10 GB Global File System (GFS) to
service the compute nodes’ file access requests. The image required to boot the I/O node is
received from its local service node through the CVLAN network.
Compute nodes (TriBlades)
Compute nodes (TriBlades) are either accelerated or non-accelerated, with the difference
being that accelerated nodes will have their associated Cell/B.E. blades powered on and
booted, while Cell/B.E. blades on the non-accelerated nodes are left powered off.
Note: There is no “low power” mode for the Cell/B.E. blades, so some sort of “standby”
mode is not possible. They are either on (accelerated) or off (non-accelerated).
There is no need for a “heartbeat” function between the Opteron core and its associated Cell
Broadband Engine processor. The general health of both resources is known by the xCAT
software and reflected in the resource manager. Communication health status between the
two resources is monitored and understood “on demand” by the application running on the
Opteron side. The Data Communication and Synchronization (DaCS) API is notified of errors
from the Cell/B.E. processor concerning any data transfer or communications request.
Failures of these transactions is reported by the software structures. If the PCI Express
connection between the Opteron and Cell/B.E. processor fails, an appropriate error event is
posted and the application terminated.
Given the PCI Express interface between the Opteron and Cell/B.E. processor, it is necessary
to boot the Cell/B.E. processor portions of a compute node (in the accelerated node pool)
before the Opteron portion. This allows the proper initialization of the interconnect firmware
and PCI Express device drivers. The Cell/B.E. PCI Express device drivers “listen” for the
necessary firmware/driver handshakes from the LS21 and Broadcom HT-2100 (PCIe adapter)
expansion card to establish communication. The process of insuring the correct booting
sequence is controlled by the xCAT software.
22
Roadrunner: Hardware and Software Overview
2.3 xCAT
Setting up the installation and management of a cluster is a complicated task and doing
everything manually can become very complicated. The development of xCAT grew out of the
desire to automate a lot of the repetitive steps involved in installing and configuring a Linux
cluster.
The development of xCAT is driven by customer requirements. Because xCAT itself is written
entirely using scripting languages such as korn shell, Perl, and Expect, an administrator can
easily modify the scripts should the need arise.
The main functions of xCAT are grouped as follows:
򐂰
򐂰
򐂰
򐂰
Automated installation
Hardware management and monitoring
Software administration
Remote console support for text and graphics
For more information about xCAT, refer to the xCAT Web site at:
http://xcat.sourceforge.net
2.4 How applications are written and executed
This section discusses how applications are written and executed on the Roadrunner system.
The unique architecture employed means that applications are designed and written in a
revolutionary new manner compared to previous parallel processing applications.
2.4.1 Application core
The bulk of the user application, including initiation and termination, runs on the AMD
Opteron processor (LS21). It uses Message Passing Interface (MPI) APIs to communicate
with the other Opteron processors the application is running on in a typical single program,
multiple data (SPMD) fashion. The number of compute nodes used to run the application is
determined at program launch.
The MPI implementation of Roadrunner is based on the open-source Open MPI Project and
therefore is standard MPI. In this regard, Roadrunner applications are similar to other typical
MPI applications (such as those that run on the IBM Blue Gene solution). Where Roadrunner
differs in the sphere of application architecture is how its Cell/B.E. “accelerators” are
employed. At any point in the application flow, the MPI application running on each Opteron
can offload computationally-complex logic to its “subordinate” Cell/B.E. processor.
For more information about Open MPI Project, refer to the Open MPI: Open Source High
Performance Computing Web site at:
http://www.open-mpi.org/
Chapter 2. Roadrunner software overview
23
2.4.2 Offloading logic
Determining which logic routines get offloaded to the Cell/B.E. processor, and when that
occurs, is one of the most challenging tasks facing an application developer of the
Roadrunner system. But it is this very challenge that makes the opportunity for incredibly high
application performance possible.
There are two primary techniques that a developer can employ to actually perform
asynchronous offloads of logic. This section briefly describes each, and points to areas where
you can find more detailed information.
DaCS
The Data Communication and Synchronization (DaCS) library provides a set of services that
ease the development of applications and application frameworks in a heterogeneous
multi-tiered system (for example, a 64-bit x86 system (x86_64) and one or more Cell/B.E.
processor systems). The DaCS services are implemented as a set of APIs providing an
architecturally neutral layer for application developers on a variety of multi-core systems. One
of the key abstractions that further differentiates DaCS from other programming frameworks
is a hierarchical topology of processing elements, each referred to as a DaCS Element (DE).
Within the hierarchy, each DE can serve one or both of the following roles:
򐂰 A general purpose processing element, acting as a supervisor, control, or master
processor. This type of element usually runs a full operating system and manages jobs
running on other DEs. This is referred to as a Host Element (HE).
򐂰 A general or special purpose processing element running tasks assigned by an HE. This
is referred to as an Accelerator Element (AE).
DaCS for Hybrid (DaCSH) is an implementation of the DaCS API specification that supports
the connection of an HE on an x86_64 system to one or more AEs on Cell/B.E. processors. In
SDK 3.0, DaCSH only supports the use of sockets to connect the HE with the AEs. Direct
access to the Synergistic Processor Elements (SPEs) on the Cell/B.E. processor is not
provided. Instead, DaCSH provides access to the PowerPC Processor Element (PPE),
allowing a PPE program to be started and stopped and allowing data transfer between the
x86_64 system and the PPE. The SPEs can only be used by the program running on the
PPE.
For more information about DaCS, see IBM Software Development Kit for Multicore
Acceleration Data Communication and Synchronization Library for Hybrid-x86 Programmer's
Guide and API Reference, SC33-8408.
ALF
The Accelerated Library Framework (ALF) provides a programming environment for data and
task parallel applications and libraries. The ALF API provides you with a set of interfaces to
simplify library development on heterogeneous multi-core systems. You can use the provided
framework to offload the computationally intensive work to the accelerators. More complex
applications can be developed by combining the several function offload libraries. You can
also choose to implement applications directly to the ALF interface.
ALF supports the multiple-program-multiple-data (MPMD) programming module where
multiple programs can be scheduled to run on multiple accelerator elements at the same
time.
24
Roadrunner: Hardware and Software Overview
The ALF functionality includes:
򐂰
򐂰
򐂰
򐂰
Data transfer management
Parallel task management
Double buffering
Dynamic load balancing for data parallel tasks
With the provided API, you can also create descriptions for multiple compute tasks and define
their execution orders by defining task dependency. Task parallelism is accomplished by
having tasks without direct or indirect dependencies between them. The ALF run time
provides an optimal parallel scheduling scheme for the tasks based on given dependencies.
For more information about ALF, see IBM Software Development Kit for Multicore
Acceleration Accelerated Library Framework for Hybrid-x86 Programmer's Guide and API
Reference, SC33-8406.
Chapter 2. Roadrunner software overview
25
26
Roadrunner: Hardware and Software Overview
A
Appendix A.
The Cell Broadband Engine
(Cell/B.E.) processor
Of all of the components that make up the Roadrunner cluster, the Cell/B.E. processor holds
a special place in that it provides extraordinary compute power that can be harnessed from a
single multi-core chip. This appendix provides a brief architectural overview of the current
Cell/B.E. processor, the motivation for some of its features, as well as the general properties
of this unique processor.
Note: Be aware that ample and extensive resources exist on the Cell/B.E. processor, the
Cell/B.E. architecture, as well as tutorials for the interested programmer. It is not the
intention of this publication to reproduce all of this information in this short section. We
have utilized these extensive resources in our attempt to provide this summary.
For additional information about the Cell/B.E. processor, refer to the following resources:
򐂰 Programming the Cell Broadband Engine·õ Architecture: Examples and Best Practices,
SG24-7575
򐂰 IBM Software Development Kit for Multicore Acceleration Data Communication and
Synchronization Library for Cell/B.E. Programmer's Guide and API Reference, SC33-8407
򐂰 IBM Software Development Kit for Multicore Acceleration Accelerated Library Framework
for Cell/B.E. Programmer's Guide and API Reference, SC33-8333
򐂰 The Cell/B.E. project at IBM Research, found at:
http://www.research.ibm.com/cell/
򐂰 The Cell/B.E. resource center, found at:
http://www.ibm.com/developerworks/power/cell/
© Copyright IBM Corp. 2009. All rights reserved.
27
Background
The Cell/B.E. architecture is designed to support a very broad range of applications. The first
implementation is a single-chip multiprocessor with nine processor elements operating on a
shared memory model, as shown in Figure A-1. In this respect, the Cell/B.E. processor
extends current trends in PC and server processors. The most distinguishing feature of the
Cell/B.E. processor is that, although all processor elements can share or access all available
memory, their function is specialized into two types: the Power Processor Element (PPE) and
the Synergistic Processor Element (SPE). The Cell/B.E. processor has one PPE and eight
SPEs.
The architectural definition of the physical Cell/B.E. architecture-compliant processor is much
more general than the initial implementation. A Cell/B.E. architecture-compliant processor
can consist of a single chip, a multi-chip module (or modules), or multiple single-chip modules
on a system board or other second-level package. The design depends on the technology
used and performance characteristics of the intended design.
Logically, the Cell/B.E. architecture defines four separate types of functional components:
򐂰
򐂰
򐂰
򐂰
PowerPC Processor Element (PPE)
Synergistic Processor Unit (SPU)
Memory Flow Controller (MFC)
Internal Interrupt Controller (IIC)
The computational units in the Cell/B.E. architecture-compliant processor are the PPEs and
the SPUs. Each SPU must have a dedicated local storage, a dedicated MFC with its
associated memory management unit (MMU), and a replacement management table (RMT).
The combination of these components is called a Synergistic Processor Element (SPE).
Figure A-1 Cell/B.E. schematic
The first type of processor element, the PPE, contains a 64-bit PowerPC architecture core. It
complies with the 64-bit PowerPC architecture and can run 32-bit and 64-bit applications. The
second type of processor element, the SPE, is designed to run computationally intensive
single-instruction multiple-data (SIMD)/vector applications. It is not intended to run a full
featured operating system. The SPEs are independent processor elements, each running
their own individual application programs or threads. Each SPE has full access to shared
memory, including the memory-mapped I/O space implemented by multiple DMA units. There
is a mutual dependence between the PPE and the SPEs. The SPEs depend on the PPE to
run the operating system and, in many cases, the top-level thread control for a user code. The
PPE depends on the SPEs to provide the bulk of compute power.
28
Roadrunner: Hardware and Software Overview
The SPEs are designed to be programmed in high level languages. They support a rich
instruction set that includes extensive SIMD functionality. However, like conventional
processors with SIMD extensions, use of SIMD data is preferred but not mandatory. For
programming convenience, the PPE also supports the standard PowerPC architecture
instruction set and the SIMD/vector multimedia extensions. To an application programmer, the
Cell/B.E. processor looks like a single core, dual threaded processor with eight additional
cores, each having their own local store. The PPE is more adept than the SPEs at
control-intensive tasks and quicker at task switching. The SPEs are more adept at compute
intensive tasks and slower than the PPE at task switching. Either processor element is
capable of both types of functions. This specialization is a significant factor in accounting for
the order-of magnitude improvement in peak computational performance and power
efficiency that the Cell/B.E. processor achieves over conventional processors.
The more significant difference between the SPE and PPE lies in how they access memory.
The PPE accesses memory with load and store instructions that move data between main
storage and a set of registers, the contents of which may be cached. PPE memory access is
like that of a conventional processor. The SPEs in contrast access main storage with direct
memory access (DMA) commands that move data and instructions between main storage
and a private local memory, called a local store (LS). An SPE's instruction fetches and
load/store instructions access a private local store rather than the shared main memory.
This three-level organization of storage (registers, LS, and main memory), with asynchronous
DMA transfers between LS and main memory, is a radical break from conventional
architecture and programming models. It explicitly parallels computation with the transfer of
data and instructions that feed computation and stores the results of computation in main
memory.
A primary motivation for this new memory model is the realization that over the past twenty
five years, memory latency, as measured in processor cycles, has increased by almost three
orders of magnitude. The result is that application performance is, in most cases, limited by
memory latency rather than peak compute capability, as measured by processor clock
speeds. When a sequential program performs a load instruction that encounters a cache
miss, program execution comes to a halt for several hundred cycles (techniques such as
hardware threading attempt to hide these stalls, but it does not help single threaded
applications). Compared to this penalty, the few cycles that it takes to set up a DMA transfer
for an SPE is a much better trade off, especially considering the fact that each of the eight
SPE's DMA controllers can maintain up to 16 DMA transfers in flight simultaneously.
Anticipating DMA needs efficiently can provide “just in time delivery” of data, which may
reduce this stall or eliminate it entirely. Conventional processors, even with deep and costly
speculation, manage to get, at best, a handful of independent memory accesses in flight.
One of the SPE's DMA transfer methods supports a list (such as a scatter gather list) of DMA
transfers that is constructed in an SPE's local store, so that the SPE's DMA controller can
process the list asynchronously while the SPE operates on previously transferred data. In
several cases, this approach of accessing memory has improved application performance by
almost two orders of magnitude when compared to the performance of conventional
processors This is significantly more than one would expect from the peak performance ratio
(approximately 10x) between the Cell/B.E. processor and conventional PC processors.
Appendix A. The Cell Broadband Engine (Cell/B.E.) processor
29
The processor elements
The general Cell/B.E. architecture-compliant processor may contain one or more PPEs, while
the current implementation consists of only one. The PPE contains a 64-bit, dual threaded
PowerPC RISC core and supports a PowerPC virtual memory subsystem. The current
PowerPC PPE runs at 3.2 GHz. It has 32 KB level-1 (L1) instruction and data caches and a
512 KB level-2 (L2) unified (instruction and data) cache. It is intended primarily for control
processing, running an operating system, managing system resources, and managing SPE
threads. It can run existing PowerPC architecture software and is well suited to executing
system control code. The instruction set for the PPE is an extended version of the PowerPC
instruction set. It includes the vector/SIMD multimedia extensions.
Each of the eight Synergistic Processor Elements (SPEs) contains a 3.2 GHz Synergistic
Processor Unit (SPU) vector processor plus the 256 KB of local store that is directly
addressable. Computationally, each of these SPEs is capable of producing four floating point
results per clock period. Simple arithmetic shows that all eight of these SPEs have a peak
compute power of 102.4 gigaflops.
The eight identical SPEs are single-instruction multiple-data (SIMD) processor elements that
are intended for computationally intensive operations allocated to them by the PPE. Each
SPE contains a RISC core, 256 KB software controlled local store for instructions and data,
and a set of 128 registers, each of which is 128 bits wide. The SPEs support a special SIMD
instruction set and a unique set of commands for managing DMA transfers and
inter-processor messaging and control.
SPE DMA transfers access main memory using PowerPC effective addresses. As in the PPE,
SPE address translation is governed by PowerPC architecture segment and page tables,
which are loaded into the SPEs by privileged software running on the PPE. The SPEs are not
intended to run an operating system.
An SPE controls DMA transfers and communicates with the system by means of channels
that are implemented in and managed by the SPE's Memory Flow Controller (MFC). The
channels are unidirectional message passing interfaces. The PPE and other devices on the
system, including other SPEs, can also access this MFC state through the MFC's
memory-mapped I/O (MMIO) registers and queues, which are visible to software in the main
memory address space.
The Element Interconnet Bus
The SPEs, PPE, the Memory Interface Controller (MIC) and broadband interface, and the
connection to other Cell/B.E. processors within an SMP are interconnected through a high
speed Element Interconnect Bus (EIB). The EIB is the communication path for commands
and data between all processor elements on the Cell/B.E. processor and the on chip
controllers for memory and I/O. The EIB supports full memory coherent and symmetric
multiprocessor (SMP) operations. A Cell/B.E. architecture processor is designed to be
combined coherently with other Cell/B.E. architecture processors to produce a cluster. The
Cell/B.E. blade is one such example where two Cell/B.E. processors are combined in a
shared memory environment to produce an SMP.
The EIB consists of four 16 byte wide data rings, two in each direction, and a central arbiter.
In the absence of path contention, each ring can perform three concurrent data transfers.
Each ring transfers 128 bytes (one PPE cache line) at a time. Processor elements can drive
and receive data simultaneously. The SPEs, PPE, and PIC each have 25.6 GBps links to and
from the EIB. In aggregate, the EIB is capable of 204.8 GBps transfers. Figure A-1 on
30
Roadrunner: Hardware and Software Overview
page 28 shows each of these elements and the order in which the elements are connected to
the EIB. The connection order is important to programmers seeking to minimize the latency of
transfers on the EIB, where latency is a function of the number of connection hops. Transfers
between adjacent elements have the shortest latencies, while transfers between elements
separated by multiple hops have the longest latencies.
The EIB's internal maximum bandwidth is 96 bytes per processor clock cycle. Multiple
transfers can be in process concurrently on each ring, including more than 100 outstanding
DMA memory transfer requests between main storage and the SPEs in either direction.
These requests also may include SPE memory to and from the I/O space. The EIB does not
support any particular quality of service (QoS) behavior other than to guarantee forward
progress. However, a resource allocation management (RAM) facility resides in the EIB.
Privileged software can use it to regulate the rate at which resource requesters (the PPE,
SPEs, and I/O devices) can use memory and I/O resources.
Memory Flow Controller
The Memory Flow Controller (MFC) is the data transfer engine. It provides the primary
method for data transfer, protection, and synchronization between main storage and the
associated local storage, or between the associated local storage and another local storage.
An MFC command describes the transfer to be performed. A principal architectural objective
of the MFC is to perform these data transfer operations in as fast and as fair a manner as
possible, thereby maximizing the overall throughput of the processor.
Commands that transfer data are called MFC DMA commands. These commands are
converted into DMA transfers between the local storage domain and main storage domain.
Each MFC can typically support multiple DMA transfers at the same time and can maintain
and process multiple MFC commands. To accomplish this, the MFC maintains and processes
queues of MFC commands. Each MFC provides one queue for the associated SPU (MFC
SPU command queue) and one queue for other processors and devices (MFC proxy
command queue). Logically, a set of MFC queues is always associated with each SPU in a
Cell/B.E. architecture-compliant processor.
The on-chip memory interface controller (MIC) provides the interface between the EIB and
physical memory. The IBM BladeCenter QS22 uses normal DDR memory and additional
hardware logic to implement the MIC. Memory accesses on each interface are 1 to 8, 16, 32,
64, or 128 bytes, with coherent memory ordering. Up to 64 reads and 64 writes can be
queued. The resource allocation token manager provides feedback about queue levels. The
MIC has multiple software controlled modes, including fast path mode (for improved latency
when command queues are empty), high priority read (for prioritizing SPE reads in front of all
other reads), early read (for starting a read before a previous write completes), speculative
read, and slow mode (for power management). The MIC implements a closed page controller
(bank rows are closed after being read, written, or refreshed), memory initialization, and
memory scrubbing.
Appendix A. The Cell Broadband Engine (Cell/B.E.) processor
31
32
Roadrunner: Hardware and Software Overview
Glossary
Accelerator General or special purpose processing
element in a hybrid system. An accelerator might have a
multi-level architecture with both host elements and
accelerator elements. An accelerator, as defined here, is
a hierarchy with potentially multiple layers of hosts and
accelerators. An accelerator element is always associated
with one host. Aside from its direct host, an accelerator
cannot communicate with other processing elements in
the system. The memory subsystem of the accelerator
can be viewed as distinct and independent from a host.
This is referred to as the subordinate in a cluster
collective.
de_id A unique number assigned by the DaCS
application at run time to a physical processing element in
a topology group A group construct specifies a collection
of DEs and processes in a system.
EMC
Electromagnetic compatibility.
All-reduce operation Output from multiple accelerators
is reduced and combined into one output.
ESD
Electrostatic discharge.
API Application Programming Interface. An application
programming interface defines the syntax and semantics
for invoking services from within an executing application.
All APIs are targeted to be available to both FORTRAN
and C programs, although implementation issues (such
as whether the FORTRAN routines are simply wrappers
for calling C routines) are up to the supplier.
ASCI The name commonly used for the Advanced
Simulation and Computing program administered by
Department of Energy (DOE)/National Nuclear Security
Agency (NNSA).
ASIC
B/U
Application Specific Integrated Circuit.
Bring up.
CEC
Central electronic complex.
cluster
A collection of nodes.
compute kernel Part of the accelerator code that does
stateless computation tasks on one piece of input data
and generates the corresponding output results.
compute task An accelerator execution image that
consists of a compute kernel linked with the accelerated
library framework accelerator runtime library.
DaCS element A general or special purpose processing
element in a topology. This refers specifically to the
physical unit in the topology. A DaCS element can serve
as a host or an accelerator.
DDR Double Data Rate. DDR is a technique for
doubling the switching rate of a circuit by triggering both
the rising edge and falling edge of a clock signal.
DE
EDRAM Enhanced dynamic random access memory is
dynamic random access memory that includes a small
amount of static RAM (SRAM) inside a larger amount of
DRAM. Performance is enhanced by making sure that
many of the memory accesses will be to the faster SRAM.
ETH
Ethernet, as in adapter or interface.
FLOP Floating Point OPeration. A measure of
computations speed frequently used with
supercomputers.
FLOP/s
FLOPs per second.
FPU
Floating point unit.
FRU
Field replaceable unit.
GFLOP GigaFLOP. A gigaFLOP/s is a billion (109 =
1,000,000,000) floating point operations per second.
handle A handle is an abstraction of a data object,
usually a pointer to a structure.
HBCT
Hardware-based cycle time.
host A general purpose processing element in a hybrid
system. A host can have multiple accelerators attached to
it. This is often referred to as the master node in a cluster
collective.
hybrid A 64-bit x86 system using a Cell Broadband
Engine (Cell/B.E.) architecture as an accelerator.
I/O I/O (input/output) describes any operation, program,
or device that transfers data to or from a computer.
I/O node The I/O nodes (ION) are responsible, in part,
for providing I/O services to compute nodes.
Job A job is a cluster-wide abstraction similar to a
POSIX session, with certain characteristics and attributes.
Commands are targeted to be available to manipulate a
job as a single entity (including kill, modify, query
characteristics, and query state).
See DaCS element.
© Copyright IBM Corp. 2009. All rights reserved.
33
LANL
Los Alamos National Laboratory.
LINPACK LINPACK is a collection of FORTRAN
subroutines that analyze and solve linear equations and
linear leastsquares problems.
main thread The main thread of the application. In
many cases, Cell/B.E. architecture programs are
multi-threaded using multiple SPEs running concurrently.
A typical scenario is that the application consists of a main
thread that creates as many SPE threads as needed and
the application organizes them.
MFLOP MegaFLOP/s. A megaFLOP/s is a million (106
= 1,000,000) floating point operations per second.
SPE Synergistic Processor Element. Extends the
PowerPC 64 architecture by acting as cooperative offload
processors (synergistic processors), with the direct
memory access (DMA) and synchronization mechanisms
to communicate with them (memory flow control), and with
enhancements for real-time management. There are eight
SPEs on each Cell/B.E. processor.
SPMD Single Program Multiple Data. A common style of
parallel computing. All processes use the same program,
but each has its own data.
SPU Synergistic Processor Unit. The part of an SPE
that executes instructions from its local store (LS).
SWL
MPI
MPICH2 MPICH is an implementation of the MPI
standard available from Argonne National Laboratory.
node A node is a functional unit in the system topology,
consisting of one host together with all the accelerators
connected as children in the topology (this includes any
children of accelerators).
parent The parent of a DE is the DE that resides
immediately above it in the topology tree.
PPE Power Processor Element: 64-bit Power
Architecture® unit within the CBE that is optimized for
running operating systems and applications. The PPE
depends on the SPEs to provide the bulk of the application
performance.
PPE PowerPC Processor Element. The
general-purpose processor in the Cell/B.E. processor.
process A process is a standard UNIX®-type process
with a separate address space.
RAS
Reliability, availability, and serviceability.
service node The service node is responsible, in part,
for management and control of RoadRunner.
SIMD Single Instruction Multiple Data. Processing in
which a single instruction operates on multiple data
elements that make up a vector data type. Also known as
vector processing. This style of programming implements
data-level parallelism.
SN
See service node.
SPE Synergistic Processor Element. Eight of these exist
within the Cell/B.E. processor, optimized for running
compute-intensive applications, and they are not
optimized for running an operating system. The SPEs are
independent processors, each running its own individual
application programs.
34
Synthetic workload.
Message passing interface.
Roadrunner: Hardware and Software Overview
thread A sequence of instructions executed within the
global context (shared memory space and other global
resources) of a process that has created (spawned) the
thread. Multiple threads (including multiple instances of
the same sequence of instructions) can run
simultaneously if each thread has its own architectural
state (registers, program counter, flags, and other
program-visible state). Each SPE can support only a
single thread at any one time. Multiple SPEs can
simultaneously support multiple threads. The PPE
supports two threads at any one time, without the need for
software to create the threads. It does this by duplicating
the architectural state. A thread is typically created by the
pthreads library.
topology A topology is a configuration of DaCS
elements in a system. The topology specifies how the
different processing elements in a system are related to
each other. DaCS assumes a tree topology: Each DE has
at most one parent.
Tri-Lab The Tri-Lab includes Los Alamos National
Laboratory, Lawrence Livermore National Laboratory, and
Sandia National Laboratories.
VPD
Vital product data.
work block A basic unit of data to be managed by the
framework. It consists of one piece of the partitioned data,
the corresponding output buffer, and related parameters.
A work block is associated with a task. A task can have as
many work blocks as necessary.
work queue An internal data structure of the
accelerated library framework that holds the lists of work
blocks to be processed by the active instances of the
compute task.
Abbreviations and acronyms
ALF
Accelerated Framework Library
AMM
Advanced Management Module
ASIC
Application-Specific Integrated
Circuit
BMC
Baseboard Management Controller
Cell/B.E.
Cell Broadband Engine processor
CU
Connected Unit
DaCS
Data Communication and
Synchronization
DMA
Direct Memory Access
FW
Firmware
IBM
International Business Machines
Corporation
IPMI
Intelligent Platform Management
Interface
ITSO
International Technical Support
Organization
LANL
Los Alamos National Laboratory
MPI
Message Passing Interface
PBS
Portable Batch System
PCIe
PCI-Express
PPE
Power Processing Element
SLOF
Slim Line Open Firmware
SMA
Small Message Area
SOL
Serial over LAN
SPEs
Synergistic Processing Elements
VNFS
Virtual Node File System
VPD
Vital Product Data
© Copyright IBM Corp. 2009. All rights reserved.
35
36
Roadrunner: Hardware and Software Overview
Related publications
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this paper.
IBM Redbooks
For information about ordering these publications, see “How to get Redbooks” on page 38.
Note that some of the documents referenced here may be available in softcopy only.
򐂰 Building a Linux HPC Cluster with xCAT, SG24-6623
򐂰 IBM BladeCenter Products and Technology, SG24-7523
򐂰 Programming the Cell Broadband Engine·õ Architecture: Examples and Best Practices,
SG24-7575
Other publications
These publications are also relevant as further information sources:
򐂰 IBM Software Development Kit for Multicore Acceleration Accelerated Library Framework
for Cell/B.E. Programmer's Guide and API Reference, SC33-8333-02
򐂰 IBM Software Development Kit for Multicore Acceleration Accelerated Library Framework
for Hybrid-x86 Programmer's Guide and API Reference, SC33-8406-00
򐂰 IBM Software Development Kit for Multicore Acceleration Data Communication and
Synchronization Library for Cell/B.E. Programmer's Guide and API Reference,
SC33-8407-00
򐂰 IBM Software Development Kit for Multicore Acceleration Data Communication and
Synchronization Library for Hybrid-x86 Programmer's Guide and API Reference,
SC33-8408-00
򐂰 Software Development Kit for Multicore Acceleration Version 3.0 Programmer's Guide
Version 3.0, SC33-8325
򐂰 Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial,
SC33-8410
򐂰 Performance Modeling of Subnet Management on Fat Tree InfiniBand Networks using
OpenSM, found at:
http://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.
pdf
Online resources
These Web sites are also relevant as further information sources:
򐂰 Advanced Simulation and Computing
http://www.sandia.gov/NNSA/ASC/about.html
© Copyright IBM Corp. 2009. All rights reserved.
37
򐂰 The Cell Broadband Engine (Cell/B.E.) project at IBM Research
http://www.research.ibm.com/cell/
򐂰 Cell/B.E. resource center
http://www.ibm.com/developerworks/power/cell/
򐂰 IBM BladeCenter QS22
http://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.html
򐂰 IBM BladeCenter LS21
http://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.html
򐂰 Open MPI: Open Source High Performance Computing
http://www.open-mpi.org/
򐂰 xCAT
http://xcat.sourceforge.net
How to get Redbooks
You can search for, view, or download Redbooks, Redpapers, Technotes, draft publications
and Additional materials, as well as order hardcopy Redbooks, at this Web site:
ibm.com/redbooks
38
Roadrunner: Hardware and Software Overview
Back cover
®
Roadrunner: Hardware
and Software Overview
Redpaper
™
Review components
that comprise the
Roadrunner
supercomputer
Understand
Roadrunner hardware
components
This IBM Redpaper publication provides an overview of the
hardware and software components that constitute a Roadrunner
system. This includes the actual chips, cards, and so on that
comprise a Roadrunner connected unit, as well as the peripheral
systems required to run applications. It also includes a brief
description of the software used to manage and run the system.
INTERNATIONAL
TECHNICAL
SUPPORT
ORGANIZATION
BUILDING TECHNICAL
INFORMATION BASED ON
PRACTICAL EXPERIENCE
Learn about
Roadrunner system
software
IBM Redbooks are developed
by the IBM International
Technical Support
Organization. Experts from
IBM, Customers and Partners
from around the world create
timely technical information
based on realistic scenarios.
Specific recommendations
are provided to help you
implement IT solutions more
effectively in your
environment.
For more information:
ibm.com/redbooks
REDP-4477-00
Download