Q Performance

advertisement
PAL
CCS-3
STATE OF THE ART
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
1
PAL
Section 2
CCS-3

Overview
We are going to briefly describe some state-of-theart supercomputers
 The goal is to evaluate the degree of integration of
the three main components, processing nodes,
interconnection network and system software
 Analysis limited to 6 supercomputers (ASCI Q and
Thunder, System X, BlueGene/L, Cray XD1 and
ASCI Red Storm), due to space and time limitations

Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
2
ASCI Q: Los Alamos National
Laboratory
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
3
PAL
ASCI Q
CCS-3
 Total
— 20.48 TF/s, #3 in the top 500
 Systems
— 2048 AlphaServer ES45s
 8,192 EV-68 1.25-GHz CPUs
with 16-MB cache
 Memory
 System
— 22 Terabytes
Interconnect
 Dual Rail Quadrics Interconnect
 4096 QSW PCI adapters
 Four 1024-way QSW federated
switches
 Operational in 2002
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
4
PAL
Node: HP (Compaq) AlphaServer ES45
21264 System Architecture
CCS-3
EV68 1.25
GHz
Each @ 64b 500 MHz (4.0 GB/s)
Memory
Up to 32 GB
MMB 0
256b 125 MHz
(4.0 GB/s)
Quad
C-Chip Controller
MMB 1
D D
D D
D D
D D
MMB 2 256b 125 MHz
Cache
16 MB per CPU
(4.0 GB/s)
PCI Chip
Bus
Bus0,1
0
MMB 3
PCI Chip
Bus
Bus2,3
1
64b 66MHz (528 MB/S)
64b
MB/S)
64b66MHz
33MHz(528
(266MB/S)
64b 33
33MHz
64b
MHz(266MB/S)
(266 MB/S)
64b 66 MHz (528 MB/S)
PCI0
PCI0
PCI1
PCI1
PCI1
HS
PCI2
PCI2
PCI2
HS
PCI3
PCI3
PCI3
HS
PCI4
PCI4
3.3V I/O
PCI5
PCI5
PCI6
PCI6
PCI6
HS
PCI7
HS
PCI7
PCI7
PCI8
PCI8
PCI8
HS
PCI9
PCI
PCI9HS
9
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
PCI-USB
PCI-USB
PCI-junk IO
IO
PCI-junk
5.0V I/O
Europar 2004, Pisa Italy
5
Serial, Parallel
keyboard/mouse
floppy
QsNET: Quaternary Fat Tree
PAL
CCS-3
• Hardware support for Collective
Communication
• MPI Latency 4ms, Bandwidth
300 MB/s
• Barrier latency less than 10ms
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
6
PAL
Interconnection Network
Super Top Level
6
Switch
Level
CCS-3
5
Mid Level
4
...
3
1024
nodes
(2x = 2048
nodes)
2
16th 64U64D
Nodes 960-1023
1st 64U64D
Nodes 0-63
1
48
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
63
960
Europar 2004, Pisa Italy
1023
7
PAL
System Software
CCS-3



Operating System is Tru64
Nodes organized in Clusters of 32 for resource
allocation and administration purposes (TruCluster)
Resource management executed through Ethernet
(RMS)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
8
PAL
ASCI Q: Overview
CCS-3

Node Integration


Network Integration


Low (multiple boards per node, network interface on
I/O bus)
High (HW support for atomic collective primitives)
System Software Integration

Medium/Low (TruCluster)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
9
ASCI Thunder, 1,024
Nodes, 23 TF/s peak
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
10
ASCI Thunder, Lawrence
Livermore National
Laboratory
PAL
CCS-3
• 1,024 Nodes, 4096 Processors, 23 TF/s,
•#2 in the top 500
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
11
ASCI Thunder: Configuration
PAL
CCS-3






1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB
DDR266 SDRAM (8 Terabytes total)
2.5 ms, 912 MB/s MPI latency and bandwidth over
Quadrics Elan4
Barrier synchronization 6 ms, allreduce 15 ms
75 TB in local disk in 73GB/node UltraSCSI320
Lustre file system with 6.4 GB/s delivered parallell
I/O performance
Linux RH 3.0, SLURM, Chaos
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
12
PAL
CCS-3

CHAOS: Clustered High Availability Operating
System
 Derived
from Red Hat, but differs in the following
areas
 Modified
kernel (Lustre and hw specific)
 New packages for cluster monitoring, system
installation, power/console management
 SLURM, an open-source resource manager
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
13
PAL
ASCI Thunder: Overview
CCS-3

Node Integration


Network Integration


Medium/Low (network interface on I/O bus)
Very High (HW support for atomic collective
primitives)
System Software Integration

Medium (Chaos)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
14
PAL
System X: Virginia Tech
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
15
PAL
System X, 10.28 TF/s

1100 dual Apple G5 2GHz CPU based nodes.



8 billion operations/second/processor (8 GFlops) peak double
precision floating performance.
Each node has 4GB of main memory and 160 GB of Serial
ATA storage.


CCS-3
176TB total secondary storage.
Infiniband, 8ms and 870 MB/s, latency and bandwidth, partial
support for collective communication
System-level Fault-tolerance (Déjà vu)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
16
PAL
System X: Overview
CCS-3

Node Integration


Network Integration


Medium/Low (network interface on I/O bus)
Medium (limited support for atomic collective
primitives)
System Software Integration

Medium (system-level fault-tolerance)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
17
PAL
BlueGene/L System
System
(64 cabinets, 64x32x32)
Cabinet
(32 Node boards, 8x8x16)
Node Card
(32 chips, 4x4x2)
16 Compute Cards
Compute Card
(2 chips, 2x1x1)
180/360 TF/s
16 TB DDR
Chip
(2 processors)
90/180 GF/s
8 GB DDR
2.8/5.6 GF/s
4 MB
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
2.9/5.7 TF/s
256 GB DDR
5.6/11.2 GF/s
0.5 GB DDR
Europar 2004, Pisa Italy
18
CCS-3
BlueGene/L Compute ASIC
PAL
CCS-3
PLB (4:1)
32k/32k L1
256
128
L2
440 CPU
4MB
EDRAM
“Double FPU”
snoop
Multiported
Shared
SRAM
Buffer
256
32k/32k L1
440 CPU
I/O proc
128
Shared
L3 directory
for EDRAM
1024+
144 ECC
L3 Cache
or
Memory
L2
256
Includes ECC
256
“Double FPU”
128
• IBM CU-11, 0.13 µm
• 11 x 11 mm die size
• 25 x 32 mm CBGA
• 474 pins, 328 signal
• 1.5/2.5 Volt
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Ethernet
Gbit
Gbit
Ethernet
JTAG
Access
JTAG
Torus
Tree
6 out and
3 out and
6 in, each at
3 in, each at
1.4 Gbit/s link 2.8 Gbit/s link
Europar 2004, Pisa Italy
DDR
Control
with ECC
Global
Interrupt
4 global
barriers or
interrupts
19
144 bit wide
DDR
256/512MB
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
20
PAL
CCS-3
16
compute
cards
2 I/O cards
DC-DC Converters:
40V  1.5, 2.5V
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
21
PAL
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
22
BlueGene/L Interconnection
Networks
PAL
CCS-3
3 Dimensional Torus





Interconnects all compute nodes (65,536)
Virtual cut-through hardware routing
1.4Gb/s on all 12 node links (2.1 GBytes/s per node)
350/700 GBytes/s bisection bandwidth
Communications backbone for computations
Global Tree





One-to-all broadcast functionality
Reduction operations functionality
2.8 Gb/s of bandwidth per link
Latency of tree traversal in the order of 5 µs
Interconnects all compute and I/O nodes (1024)
Ethernet



Incorporated into every node ASIC
Active in the I/O nodes (1:64)
All external comm. (file I/O, control, user interaction, etc.)
Low Latency Global Barrier

8 single wires crossing whole system, touching all nodes
Control Network (JTAG)

Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
For booting, checkpointing, error logging
Europar 2004, Pisa Italy
23
PAL
BlueGene/L System Software
Organization
CCS-3



Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Compute nodes dedicated to
running user application, and
almost nothing else - simple
compute node kernel (CNK)
I/O nodes run Linux and
provide O/S services
file access
process launch/termination
debugging
Service nodes perform system
management services (e.g.,
system boot, heart beat, error
monitoring) - largely
transparent to
application/system software
Europar 2004, Pisa Italy
24
Operating Systems


Compute nodes: CNK
Specialized simple O/S
5000 lines of code,
40KBytes in core
No thread support, no virtual
memory
Protection
Protect kernel from
application
Some net devices in
userspace
File I/O offloaded (“function
shipped”) to IO nodes
Through kernel system calls
“Boot, start app and then stay out
of the way”
I/O nodes: Linux
2.4.19 kernel (2.6 underway) w/
ramdisk
NFS/GPFS client
CIO daemon to
Start/stop jobs
Execute file I/O
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
PAL
CCS-3



Global O/S (CMCS, service node)
 Invisible to user programs
 Global and collective decisions
 Interfaces with external policy
modules (e.g., job scheduler)
 Commercial database technology
(DB2) stores static and dynamic
state
 Partition selection
 Partition boot
 Running of jobs
 System error logs
 Checkpoint/restart
mechanism
 Scalability, robustness, security
Execution mechanisms in the core
Policy decisions in the service node
Europar 2004, Pisa Italy
25
PAL
BlueGeneL: Overview
CCS-3

Node Integration


Network Integration


High (separate tree network)
System Software Integration


High (processing node integrates processors and
network interfaces, network interfaces directly
connected to the processors)
Medium/High (Compute kernels are not globally
coordinated)
#2 and #4 in the top500
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
26
PAL
Cray XD1
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
27
PAL
Cray XD1 System Architecture
CCS-3
Compute

12 AMD Opteron 32/64
bit, x86 processors

High Performance
Linux
RapidArray Interconnect

12 communications
processors

1 Tb/s switch fabric
Active Management

Dedicated processor
Application Acceleration

6 co-processors

Processors directly
connected to the
interconnect
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
28
Cray XD1 Processing Node
PAL
CCS-3
Six 2-way SMP
Blades
4 Fans
Six SATA
Hard Drives
500 Gb/s
crossbar
switch
Chassis Front
12-port Interchassis
connector
Four
independent
PCI-X Slots
Connector to 2nd
500 Gb/s crossbar
switch and 12-port
inter-chassis
connector
Chassis Rear
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
29
Cray XD1 Compute Blade
CCS-3
4 DIMM Sockets
for DDR 400
Registered ECC
Memory
AMD Opteron
2XX
Processor
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
PAL
RapidArray
Communications
Processor
AMD Opteron
2XX Processor
4 DIMM Sockets
for DDR 400
Registered ECC
Memory
Connector to
Main Board
Europar 2004, Pisa Italy
30
PAL
Fast Access to the Interconnect
CCS-3
GigaBytes
Memory
Xeon
Server
Cray
XD1
GFLOPS
Processor
GigaBytes per Second
I/O
1 GB/s
PCI-X
Interconnect
0.25 GB/s
GigE
5.3 GB/s
DDR 333
6.4GB/s
DDR 400
8 GB/s
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
31
Communications Optimizations
PAL
CCS-3
RapidArray Communications Processor
 HT/RA tunnelling with bonding
 Routing with route redundancy
 Reliable transport
 Short message latency optimization
 DMA operations
 System-wide clock synchronization
RapidArray
Communications
Processor
AMD
Opteron 2XX
Processor
2 GB/s
3.2 GB/s
2 GB/s
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
32
Active Manager System
PAL
CCS-3
Usability

Single System Command
and Control
Resiliency
Active Management
Software
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov

Dedicated management
processors, real-time OS
and communications fabric.

Proactive background
diagnostics with selfhealing.

Synchronized Linux kernels
Europar 2004, Pisa Italy
33
PAL
Cray XD1: Overview
CCS-3

Node Integration


Network Integration


Medium/High (HW support for collective
communication)
System Software Integration


High (direct access from HyperTransport to
RapidArray)
High (Compute kernels are globally coordinated)
Early stage
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
34
PAL
ASCI Red STORM
CCS-3
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
35
PAL
Red Storm Architecture
CCS-3






Distributed memory MIMD parallel supercomputer
Fully connected 3D mesh interconnect. Each
compute node processor has a bi-directional
connection to the primary communication network
108 compute node cabinets and 10,368 compute
node processors (AMD Sledgehammer @ 2.0 GHz)
~10 TB of DDR memory @ 333MHz
Red/Black switching: ~1/4, ~1/2, ~1/4
8 Service and I/O cabinets on each end (256
processors for each color240 TB of disk storage (120
TB per color)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
36
PAL
Red Storm Architecture
CCS-3




Functional hardware partitioning: service and
I/O nodes, compute nodes, and RAS nodes
Partitioned Operating System (OS): LINUX
on service and I/O nodes, LWK (Catamount)
on compute nodes, stripped down LINUX on
RAS nodes
Separate RAS and system management
network (Ethernet)
Router table-based routing in the interconnect
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
37
PAL
Red Storm architecture
CCS-3
Service
Compute
File I/O
Users
/home
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Net I/O
Europar 2004, Pisa Italy
38
PAL
System Layout
(27 x 16 x 24 mesh)
CCS-3
Switchable
Nodes
Normally
Classified
{
{
Normally
Unclassified
Disconnect Cabinets
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
39
PAL
Red Storm System Software
CCS-3
Run-Time System
Logarithmic loader
Fast, efficient Node allocator
Batch system – PBS
Libraries – MPI, I/O, Math
File Systems being considered include
PVFS – interim file system
Lustre – Pathforward support,
Panassas…
Operating Systems
LINUX on service and I/O nodes
Sandia’s LWK (Catamount) on compute nodes
LINUX on RAS nodes
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
40
PAL
ASCI Red Storm: Overview
CCS-3

Node Integration


Network Integration


Medium (No support for collective communication)
System Software Integration


High (direct access from HyperTransport to network
through custom network interface chip)
Medium/High (scalable resource manager, no global
coordination between nodes)
Expected to become the most powerful machine in
the world (competition permitting)
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
41
PAL
Overview
CCS-3
Node
Integration
Network
Integration
Software
Integration
ASCI Q
ASCI Thunder
System X
BlueGene/L
Cray XD1
Red Storm
Kei Davis and Fabrizio Petrini
{kei,fabrizio}@lanl.gov
Europar 2004, Pisa Italy
42
Download