Cray-X1

advertisement
The Cray X1
Architecture
Steve Scott
Cray X1 Chief Architect
Cray Proprietary (expired)
Cray’s Computing Vision
Scalable High-Bandwidth Computing
2010
‘Cascade’
‘Black Widow’
‘Black Widow 2’
2006
Sustained
Petaflops
X1E
2004
Product Integration
X1
2006
‘Strider 3’
Red Storm
Cray X1 Overview
2004
2005
RS
‘Strider 2’
Cray Proprietary (expired)
‘Strider X’
Slide 2
Cray X1
Cray PVP
T3E
• Powerful vector processors
• Very high memory bandwidth
• Non-unit stride computation
• Special ISA features
• Extreme scalability
• Optimized communication
• Memory hierarchy
• Synchronization features
• Modernized the ISA
• Improved via vectors
Extreme scalability with high bandwidth vector processors
Cray X1 Overview
Cray Proprietary (expired)
Slide 3
Cray X1 Instruction Set Architecture
New ISA
–
–
–
–
Much larger register set (32x64 vector, 64+64 scalar)
All operations performed under mask
64- and 32-bit memory and IEEE arithmetic
Integrated synchronization features
Advantages of a vector ISA
– Compiler provides useful dependence information to hardware
– Very high single processor performance with low complexity
ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr)
– Localized computation on processor chip
– large register state with very regular access patterns
– registers and functional units grouped into local clusters (pipes)
 excellent fit with future IC technology
– Latency tolerance and pipelining to memory/network
 very well suited for scalable systems
– Easy extensibility to next generation implementation
Cray X1 Overview
Cray Proprietary (expired)
Slide 4
Not Your Father’s Vector Machine
• New instruction set
• New system architecture
• New processor microarchitecture
• “Classic” vector machines were programmed
differently
– Classic vector: Optimize for loop length with little regard for locality
– Scalable micro: Optimize for locality with little regard for loop length
• The Cray X1 is programmed like a parallel microbased machine
– Rewards locality: register, cache, local memory, remote memory
– Decoupled microarchitecture performs well on short loop nests
– (however, does require vectorizable code)
Cray X1 Overview
Cray Proprietary (expired)
Slide 5
Cray X1 Node
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
mem
IO
IO
51 Gflops, 200 GB/s
• Four multistream processors (MSPs), each 12.8 Gflops
• High bandwidth local shared memory (128 Direct Rambus channels)
• 32 network links and four I/O links per node
Cray X1 Overview
Cray Proprietary (expired)
Slide 6
NUMA Scalable up to 1024 Nodes
Interconnection
Network
• 16 parallel networks for bandwidth
• Global shared memory across machine
Cray X1 Overview
Cray Proprietary (expired)
Slide 7
Designed for Scalability
• Distributed shared memory (DSM) architecture
– Low latency, load/store access to entire machine (tens of TBs)
• Decoupled vector memory architecture for latency tolerance
– Thousands of outstanding references, flexible addressing
• Very high performance network
– High bandwidth, fine-grained transfers
– Same router as Origin 3000, but 16 parallel copies of the network
• Architectural features for scalability
– Remote address translation
– Global coherence protocol optimized for distributed memory
– Fast synchronization
• Parallel I/O scales with system size
Cray X1 Overview
Cray Proprietary (expired)
Slide 8
Network Topology (16 CPUs)
P
M0
Section 0
Cray X1 Overview
P
P
node 0
P
node 1
M15
M1
P
P
P
node 2
M15
M1
P
M0
P
M15
P
M0
P
M1
P
M0
P
P
P
P
node 3
M15
M1
Section 15
Section 1
Cray Proprietary (expired)
Slide 9
Network Topology (128 CPUs)
R
R
R
R
R
R
R
R
Cray X1 Overview
Cray Proprietary (expired)
Slide 10
Network Topology (512 CPUs)
one
chassis
Cray X1 Overview
Cray Proprietary (expired)
Slide 11
Network Topology (1024 CPUs)
one
chassis
Cray X1 Overview
Cray Proprietary (expired)
Slide 12
Some Design Challenges
• How do we tolerate ever growing memory
latencies?
• How do we support address translations and
cache coherence with high bandwidth vector
processors?
Cray X1 Overview
Cray Proprietary (expired)
Slide 13
Decoupled Vector Microarchitecture
• Decoupled access/execute and decoupled scalar/vector
• Scalar unit runs ahead, doing addressing and control
– Scalar and vector loads issued early
– Store addresses computed early, saved for later use
– Operations queued and executed later when data arrives
• Hardware dynamically unrolls loops
– Scalar starts on next loop before current loop has completed
– Memory pipeline stays full of requests
– Special sync operations keep pipeline full, even across
barriers
This is key to making the system perform well on short loop nests
Cray X1 Overview
Cray Proprietary (expired)
Slide 14
Maintaining Decoupling Past
Synchronization Points
Msync control and data barrier:
P0
….
St Vx
….
Msync
….
Ld
P1
….
St Vx
….
Msync
….
Ld
P2
….
St Vx
….
Msync
….
Ld
P3
….
St Vx
….
Msync
….
Ld
Want to protect against hazards, but not drain memory pipeline.
Vector store addresses computed early (before data is
available):
– sent out to the shared L2 cache, where it modifies cache state
– later loads from other P chips can now be performed
– loads only wait if there is a true conflict
Cray X1 Overview
Cray Proprietary (expired)
Slide 15
Address Translation
63 62 61
VA:
32 31
48 47
MBZ
16 15
Virtual Page #
Page Offset
Memory region:
useg, kseg, kphys
47 4645
PA:
0
Possible page boundaries:
64 KB to 4 GB
36 35
Node
0
Offset
Physical address space:
Main memory, MMR, I/O
Max 64 TB physical memory
• High translation bandwidth:
– scalar + four vector translations per cycle per P chip
• Remote (hierarchical) translation:
– allows each node to manage its own memory (eases memory
mgmt.)
– TLB only needs to hold translations for one node  scales
Cray X1 Overview
Cray Proprietary (expired)
Slide 16
Cache Coherence
• Global coherence, but only cache memory from local node
– Supports SMP-style codes up to 4 MSPs
– References outside this domain converted to non-allocate
• Scalable codes use explicit communication anyway
– Keeps directory entry and protocol simple
– Significant reliability benefits for large scale systems
• Explicit cache allocation control
– Per instruction hints
– Use non-allocating refs to avoid cache pollution
• Coherence directory stored on the M chips (rather than in
DRAM)
– Low latency and really high bandwidth to support vectors
• Factor of several hundred over typical DRAM-based directory
Cray X1 Overview
Cray Proprietary (expired)
Slide 17
Mechanical
Design
Cray Proprietary (expired)
Multi-Chip Module
8 - 7S IBM IC’s & 80 Decoupling Capacitors
72 mm X 72 mm X 8.3 mm
3832 LGA Pads on BSM
34,000 C4
Pads on TSM
173,000 mm of Routing – a Routing Density
of 3300 mm per Square cm
Cray X1 Overview
18 Plane Pairs of X & Y
Routing in Ceramic
Cray Proprietary (expired)
83 layer, Glass Ceramic/Copper
Conductor/Mesh Construction
Slide 19
Compliant Interconnect
Resistance of 0.015 m ohms @ 0.75
amps
3832 Contacts on 1-mm Spacing
Force of 40 grams (153 kg per MCM)
Alignment is by Socket
Spring/Fence Centering
Cray X1 Overview
Inductance of < 1.5 nH @ 500-1000 MHz
Cray Proprietary (expired)
Compliance of 12 mils Non-Yielding
Re-Mateable Interface (100 times)
Slide 20
Printed Circuit Board
558 mm x 431 mm and 3.5 mm
17,000 Nets and
99,000 Buried & Thru Vias
1 mm Via Pitch
•Two Routes / Channel
3,600,000 mm of Routing
(1500 mm per square cm)
34 layers - 16 Power and Ground, 16 Signal, & 2 Surface
Cray X1 Overview
3 mil Traces on 4 and 8 mil Pitch
100 W impedance - Differential
45 W impedance - Single Ended
Cray Proprietary (expired)
Distributes 2000 Amps DC Current
Slide 21
Spray Cap Assembly
Heat Flux of IC’s on the MCM are:
2
P Chip Heat Flux - 45 W/cm
2
E Chip Heat Flux - 15 W/cm
IC Junction Temperature of 85o C @ Heat Flux Density up to 70 W/cm2
O-Ring Seal
Flow Rate is 1 ml/w/min @
Pressure Differential of 25 psig
Mixed Vapor Return
Evaporation Efficiency
Of ~ 25%
Maintain Component Junction
Temperatures @ 75o C +/-10o
Cray X1 Overview
Fluid Inlet
Fluorinert™ FC 72, the liquid coolant is atomized and sprayed onto
the (ICs) to maintain a continuously wetted surface
Cray Proprietary (expired)
Slide 22
Power Converter
Synchronous Rectification
1,000,000 hour MTBF
Input Voltage- 48 Volt
•Electronic Inrush Control
•Current-Share
•Margins
•Enable Features
Conduction Cooled
Power Density of 16 watts/in3
Cray X1 Overview
190 amp output current @ 1.8 volts(125 amps @ 2.5-volt)
with efficiencies of 80% at max. output
Cray Proprietary (expired)
Slide 23
Memory Daughter Card
Forced Convection Cooled Via a Heat Spreader
•(qja=16C/W@1200fpm)
16 RDRAM® (Rambus®) Memory Chips
1.25 CFM of Air per Daughter Card
Cards Support Multiple Memory Chip Densities
Cray X1 Overview
Connects to the PCB Via a Custom 294-Pin Connector
Cray Proprietary (expired)
Slide 24
System Packaging
• LC Cabinets
–
–
–
–
All System Heat is Rejected to Facility Water
1 to 16 Node Modules per LC Cabinet
Up to 8 Cabinets, 512 CPUs, standard hypercube
Up to 64 Cabinets, 4,096 CPUs, mesh topology
• AC Cabinets
– All System Heat is Rejected to Facility Air
– 1 to 4 Node Modules (16 CPUs) per AC Cabinet
– Maximum 4 Cabinets, 64 CPUs
• 50 Hz & 60 Hz Power Supported
Cray X1 Overview
Cray Proprietary (expired)
Slide 25
System Specifications
• Power
LC Cabinet
AC Cabinet
82 KW, 208v 3Ø
20 KW, 208v 3Ø
• Cooling
LC Cabinet
AC Cabinet
40-45 GPM
2000 CFM
• Footprint
LC Cabinet
AC Cabinet
43.5 in x 84 x 82.25 in
32 in x 48 in x 82 in
• Weight
LC Cabinet
AC Cabinet
Cray X1 Overview
3500 lbs
1200 lbs
Cray Proprietary (expired)
Slide 26
Liquid Cooled Cabinet
Cable Routing
Card Cage &
Connectors
Router Modules
Power Distribution Bus
Node Modules
Damper Doors
Incoming Power
Box
Power Supplies
Heat Exchanger
Blower Assembly
FC-72 Gear Pumps
FC-72 Filters
Cray X1 Overview
Cray Proprietary (expired)
Slide 27
Air Cooled Cabinet
Cable Routing
Card Cage &
Connectors
Router Modules
Node Modules
Blower Assembly
Heat Exchanger
Incoming Power
Box
FC-72 Filters
Power Supplies
Cray X1 Overview
Cray Proprietary (expired)
Slide 28
Node board
Field-Replaceable
Memory
Daughter Cards
Spray Cooling
Caps
CPU MCM
8 chips
Network
Interconnect
PCB
17” x 22’’
Cray X1 Overview
Air Cooling
Manifold
Cray Proprietary (expired)
Slide 29
Cray X1 Node Module
Cray X1 Overview
Cray Proprietary (expired)
Slide 30
Node Module (Power Converter side)
Cray X1 Overview
Cray Proprietary (expired)
Slide 31
Cray X1 Chassis
Cray X1 Overview
Cray Proprietary (expired)
Slide 32
64 Processor Cray X1 System
~820 Gflops
Cray X1 Overview
Cray Proprietary (expired)
Slide 33
256 Processor Cray X1 System
~3.3 Tflops
Cray X1 Overview
Cray Proprietary (expired)
Slide 34
Download