The Cray X1 Architecture Steve Scott Cray X1 Chief Architect Cray Proprietary (expired) Cray’s Computing Vision Scalable High-Bandwidth Computing 2010 ‘Cascade’ ‘Black Widow’ ‘Black Widow 2’ 2006 Sustained Petaflops X1E 2004 Product Integration X1 2006 ‘Strider 3’ Red Storm Cray X1 Overview 2004 2005 RS ‘Strider 2’ Cray Proprietary (expired) ‘Strider X’ Slide 2 Cray X1 Cray PVP T3E • Powerful vector processors • Very high memory bandwidth • Non-unit stride computation • Special ISA features • Extreme scalability • Optimized communication • Memory hierarchy • Synchronization features • Modernized the ISA • Improved via vectors Extreme scalability with high bandwidth vector processors Cray X1 Overview Cray Proprietary (expired) Slide 3 Cray X1 Instruction Set Architecture New ISA – – – – Much larger register set (32x64 vector, 64+64 scalar) All operations performed under mask 64- and 32-bit memory and IEEE arithmetic Integrated synchronization features Advantages of a vector ISA – Compiler provides useful dependence information to hardware – Very high single processor performance with low complexity ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr) – Localized computation on processor chip – large register state with very regular access patterns – registers and functional units grouped into local clusters (pipes) excellent fit with future IC technology – Latency tolerance and pipelining to memory/network very well suited for scalable systems – Easy extensibility to next generation implementation Cray X1 Overview Cray Proprietary (expired) Slide 4 Not Your Father’s Vector Machine • New instruction set • New system architecture • New processor microarchitecture • “Classic” vector machines were programmed differently – Classic vector: Optimize for loop length with little regard for locality – Scalable micro: Optimize for locality with little regard for loop length • The Cray X1 is programmed like a parallel microbased machine – Rewards locality: register, cache, local memory, remote memory – Decoupled microarchitecture performs well on short loop nests – (however, does require vectorizable code) Cray X1 Overview Cray Proprietary (expired) Slide 5 Cray X1 Node P P P P P P P P P P P P P P P P $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ M M M M M M M M M M M M M M M M mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem IO IO 51 Gflops, 200 GB/s • Four multistream processors (MSPs), each 12.8 Gflops • High bandwidth local shared memory (128 Direct Rambus channels) • 32 network links and four I/O links per node Cray X1 Overview Cray Proprietary (expired) Slide 6 NUMA Scalable up to 1024 Nodes Interconnection Network • 16 parallel networks for bandwidth • Global shared memory across machine Cray X1 Overview Cray Proprietary (expired) Slide 7 Designed for Scalability • Distributed shared memory (DSM) architecture – Low latency, load/store access to entire machine (tens of TBs) • Decoupled vector memory architecture for latency tolerance – Thousands of outstanding references, flexible addressing • Very high performance network – High bandwidth, fine-grained transfers – Same router as Origin 3000, but 16 parallel copies of the network • Architectural features for scalability – Remote address translation – Global coherence protocol optimized for distributed memory – Fast synchronization • Parallel I/O scales with system size Cray X1 Overview Cray Proprietary (expired) Slide 8 Network Topology (16 CPUs) P M0 Section 0 Cray X1 Overview P P node 0 P node 1 M15 M1 P P P node 2 M15 M1 P M0 P M15 P M0 P M1 P M0 P P P P node 3 M15 M1 Section 15 Section 1 Cray Proprietary (expired) Slide 9 Network Topology (128 CPUs) R R R R R R R R Cray X1 Overview Cray Proprietary (expired) Slide 10 Network Topology (512 CPUs) one chassis Cray X1 Overview Cray Proprietary (expired) Slide 11 Network Topology (1024 CPUs) one chassis Cray X1 Overview Cray Proprietary (expired) Slide 12 Some Design Challenges • How do we tolerate ever growing memory latencies? • How do we support address translations and cache coherence with high bandwidth vector processors? Cray X1 Overview Cray Proprietary (expired) Slide 13 Decoupled Vector Microarchitecture • Decoupled access/execute and decoupled scalar/vector • Scalar unit runs ahead, doing addressing and control – Scalar and vector loads issued early – Store addresses computed early, saved for later use – Operations queued and executed later when data arrives • Hardware dynamically unrolls loops – Scalar starts on next loop before current loop has completed – Memory pipeline stays full of requests – Special sync operations keep pipeline full, even across barriers This is key to making the system perform well on short loop nests Cray X1 Overview Cray Proprietary (expired) Slide 14 Maintaining Decoupling Past Synchronization Points Msync control and data barrier: P0 …. St Vx …. Msync …. Ld P1 …. St Vx …. Msync …. Ld P2 …. St Vx …. Msync …. Ld P3 …. St Vx …. Msync …. Ld Want to protect against hazards, but not drain memory pipeline. Vector store addresses computed early (before data is available): – sent out to the shared L2 cache, where it modifies cache state – later loads from other P chips can now be performed – loads only wait if there is a true conflict Cray X1 Overview Cray Proprietary (expired) Slide 15 Address Translation 63 62 61 VA: 32 31 48 47 MBZ 16 15 Virtual Page # Page Offset Memory region: useg, kseg, kphys 47 4645 PA: 0 Possible page boundaries: 64 KB to 4 GB 36 35 Node 0 Offset Physical address space: Main memory, MMR, I/O Max 64 TB physical memory • High translation bandwidth: – scalar + four vector translations per cycle per P chip • Remote (hierarchical) translation: – allows each node to manage its own memory (eases memory mgmt.) – TLB only needs to hold translations for one node scales Cray X1 Overview Cray Proprietary (expired) Slide 16 Cache Coherence • Global coherence, but only cache memory from local node – Supports SMP-style codes up to 4 MSPs – References outside this domain converted to non-allocate • Scalable codes use explicit communication anyway – Keeps directory entry and protocol simple – Significant reliability benefits for large scale systems • Explicit cache allocation control – Per instruction hints – Use non-allocating refs to avoid cache pollution • Coherence directory stored on the M chips (rather than in DRAM) – Low latency and really high bandwidth to support vectors • Factor of several hundred over typical DRAM-based directory Cray X1 Overview Cray Proprietary (expired) Slide 17 Mechanical Design Cray Proprietary (expired) Multi-Chip Module 8 - 7S IBM IC’s & 80 Decoupling Capacitors 72 mm X 72 mm X 8.3 mm 3832 LGA Pads on BSM 34,000 C4 Pads on TSM 173,000 mm of Routing – a Routing Density of 3300 mm per Square cm Cray X1 Overview 18 Plane Pairs of X & Y Routing in Ceramic Cray Proprietary (expired) 83 layer, Glass Ceramic/Copper Conductor/Mesh Construction Slide 19 Compliant Interconnect Resistance of 0.015 m ohms @ 0.75 amps 3832 Contacts on 1-mm Spacing Force of 40 grams (153 kg per MCM) Alignment is by Socket Spring/Fence Centering Cray X1 Overview Inductance of < 1.5 nH @ 500-1000 MHz Cray Proprietary (expired) Compliance of 12 mils Non-Yielding Re-Mateable Interface (100 times) Slide 20 Printed Circuit Board 558 mm x 431 mm and 3.5 mm 17,000 Nets and 99,000 Buried & Thru Vias 1 mm Via Pitch •Two Routes / Channel 3,600,000 mm of Routing (1500 mm per square cm) 34 layers - 16 Power and Ground, 16 Signal, & 2 Surface Cray X1 Overview 3 mil Traces on 4 and 8 mil Pitch 100 W impedance - Differential 45 W impedance - Single Ended Cray Proprietary (expired) Distributes 2000 Amps DC Current Slide 21 Spray Cap Assembly Heat Flux of IC’s on the MCM are: 2 P Chip Heat Flux - 45 W/cm 2 E Chip Heat Flux - 15 W/cm IC Junction Temperature of 85o C @ Heat Flux Density up to 70 W/cm2 O-Ring Seal Flow Rate is 1 ml/w/min @ Pressure Differential of 25 psig Mixed Vapor Return Evaporation Efficiency Of ~ 25% Maintain Component Junction Temperatures @ 75o C +/-10o Cray X1 Overview Fluid Inlet Fluorinert™ FC 72, the liquid coolant is atomized and sprayed onto the (ICs) to maintain a continuously wetted surface Cray Proprietary (expired) Slide 22 Power Converter Synchronous Rectification 1,000,000 hour MTBF Input Voltage- 48 Volt •Electronic Inrush Control •Current-Share •Margins •Enable Features Conduction Cooled Power Density of 16 watts/in3 Cray X1 Overview 190 amp output current @ 1.8 volts(125 amps @ 2.5-volt) with efficiencies of 80% at max. output Cray Proprietary (expired) Slide 23 Memory Daughter Card Forced Convection Cooled Via a Heat Spreader •(qja=16C/W@1200fpm) 16 RDRAM® (Rambus®) Memory Chips 1.25 CFM of Air per Daughter Card Cards Support Multiple Memory Chip Densities Cray X1 Overview Connects to the PCB Via a Custom 294-Pin Connector Cray Proprietary (expired) Slide 24 System Packaging • LC Cabinets – – – – All System Heat is Rejected to Facility Water 1 to 16 Node Modules per LC Cabinet Up to 8 Cabinets, 512 CPUs, standard hypercube Up to 64 Cabinets, 4,096 CPUs, mesh topology • AC Cabinets – All System Heat is Rejected to Facility Air – 1 to 4 Node Modules (16 CPUs) per AC Cabinet – Maximum 4 Cabinets, 64 CPUs • 50 Hz & 60 Hz Power Supported Cray X1 Overview Cray Proprietary (expired) Slide 25 System Specifications • Power LC Cabinet AC Cabinet 82 KW, 208v 3Ø 20 KW, 208v 3Ø • Cooling LC Cabinet AC Cabinet 40-45 GPM 2000 CFM • Footprint LC Cabinet AC Cabinet 43.5 in x 84 x 82.25 in 32 in x 48 in x 82 in • Weight LC Cabinet AC Cabinet Cray X1 Overview 3500 lbs 1200 lbs Cray Proprietary (expired) Slide 26 Liquid Cooled Cabinet Cable Routing Card Cage & Connectors Router Modules Power Distribution Bus Node Modules Damper Doors Incoming Power Box Power Supplies Heat Exchanger Blower Assembly FC-72 Gear Pumps FC-72 Filters Cray X1 Overview Cray Proprietary (expired) Slide 27 Air Cooled Cabinet Cable Routing Card Cage & Connectors Router Modules Node Modules Blower Assembly Heat Exchanger Incoming Power Box FC-72 Filters Power Supplies Cray X1 Overview Cray Proprietary (expired) Slide 28 Node board Field-Replaceable Memory Daughter Cards Spray Cooling Caps CPU MCM 8 chips Network Interconnect PCB 17” x 22’’ Cray X1 Overview Air Cooling Manifold Cray Proprietary (expired) Slide 29 Cray X1 Node Module Cray X1 Overview Cray Proprietary (expired) Slide 30 Node Module (Power Converter side) Cray X1 Overview Cray Proprietary (expired) Slide 31 Cray X1 Chassis Cray X1 Overview Cray Proprietary (expired) Slide 32 64 Processor Cray X1 System ~820 Gflops Cray X1 Overview Cray Proprietary (expired) Slide 33 256 Processor Cray X1 System ~3.3 Tflops Cray X1 Overview Cray Proprietary (expired) Slide 34