Supercomputers Special Course of Computer Architecture H.Amano Contents • • • • What are supercomputers? Architecture of Supercomputers Representative supercomputers Exa-Scale supercomputer project Defining Supercomputers • High performance computers mainly for scientific computation. – Huge amount of computation for Biochemistry, Physics, Astronomy, Meteorology and etc. – Very expensive: developed and managed by national fund. – High level techniques are required to develop and manage them. – USA, Japan and China compete the top 1 supercomputer. – A large amount of national fund is used, and tends to be political news→ In Japan, the supercomputer project became the target of budget review in Dec. 2009 「K」 achieved 10PFLOPS, and became the top 1 in the last year, but Sequoia got back in the last month. FLOPS • Floating Point Operation Per Second • Floating Point number – (Mantissa) × 2 (index) – Double precision 64bit, Single precision 32bit. – IEEE Standard defines the format and rounding sign Single Double index 8 mantissa 23 11 52 The range of performance 106 109 1012 1015 1018 100万 10億 1兆 1000兆 100京 M(Mega) G(Giga) T(Tera) P(Peta) E(Exa) iPhone4S 140MFLOPS Supercomputers 10TFLOPS-16PFLOPS High-end PC 50-80GFLOPS Powerful GPU Tera-FLOPS 10PFLOPS = 1京回 in Japanese → The name 「K」 comes from it. growing ratio: 1.9times/year How to select top 1? • Top500/Green500: Performance of executing Linpack – Linpack is a kernel for matrix computation. – Scale free – Performance centric. • Godon Bell Prize – Peak Performance, Price/Performance, Special Achievement • HPC Challenge – Global HPL Matrix computation: Computation – Global Random Access: random memory access: Communication – EP stream per system: heavy load memory access: Memory performance – Global FFT: Complicated problem requiring both memory and communication performance. • Nov. ACM/IEEE Supercomputing Conference – Top500、Gordon Bell Prize、HPC Challenge、Green500 • Jun. International Supercomputing Conference – Top500、Green500 10 Rmax: Peta FLOPS Top 5 K Japan Sequoia USA 16PFLOPS 9 8 3 Tianhe(天河) China 2 Jaguar USA Nebulae China 1 Kraken USA Roadrunner USA Tsubame Japan Jugene Germany 2010.6 2010.11 2011.6 2011.11 From SACSIS2012 Invited Speech. Top 500 2011 11月 Name Developmen t Hardware Cores Performanc e TFLOPS Power (KW) K (京) (Japan) RIKEN AICS SPARC VIIIfx 2.0GHz Tofu Interconnect Fujitsu 705024 10510 (11280) 12659.9 Tianhe1A( 天河) (China) National Supercompute r Center Tenjien NUDT YH MPPXeon X5670 6C 2.93GHz,NVIDIA 2050 NUDT 186368 2566 (4701) 4040 Jaguar (USA) DOE/SC/Oak Ridge National Lab. Cray XT5-HE Opteron 6-Core 2.6GHz, Cray Inc. 224162 1759 (2331) 6950 Nebulae( China) National Supercomputi ng Centre in Shenzhen Dawning TC3600 Blade, Xeon X5650 6C 2.66GHz, Infiniband QDR, NVIDIA 2050,Dawing 120640 1271 (2974) 2580 TSUBAM E2.0(Jap an) GSIC,Tokyo Inst. of Technology HP ProLiant 73238 SL390s G7 Xeon 6C X5670, NVIDIA GPU,NEC/HP 1192 (2287) 1398.6 Green 500 2011 11月 1 Machine Place FLOPS/W Total kW BlueGene/Q, Power BQC 16C 1.60 GHz, Custom IBM - Rochester 2026.48 85.12 IBM Blue Gene/Q got 1-5 2-5 BlueGene/Q, Power BQC 16C 1.60 GHz, Custom BlueGene/Q Prototype IBM – Thomas J. Watson Research Center /Rochester 1689.86 2026.48 6 DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband QDR Nagasaki Univ. 1378.32 47.05 7 Bullx B505, Xeon E5649 6C 2.53GHz, Infiniband QDR, NVIDIA 2090 Barcelona Supercomputing Center 1266.26 81.50 8 Curie Hybrid Nodes - Bullx B505, TGCC / GENCI Nvidia M2090, Xeon E5640 2.67 GHz, Infiniband QDR 1010.11 108.80 10位はTsubame-2.0(東工大) Why Top1? • Top1 is just a measure of matrix computation. • Top1 of Green500, Gordon Bell Prize, Top1 of each HPC Challenge program → All machines are valuable. TV or newspapers are too much focus on Top 500. • However, most top 1 computer also got Gordon Bell Prize and HPC Challenge top1. – K and Sequoia • Impact of Top 1 is great! Why supercomputers so fast? × Because they use high freq. clock Freq. Pentium4 3.2GHz Clock freq. of High end PC Nehalem 3.3GHz K 2GHz Sequoia 1.6GHz 1GHz 40% / year Alpha21064 150MHz 100MHz 1992 The speed up of the clock is saturated in 2003. Power and heat dissipation The clock frequency of K and Sequoia is lower than that of common PCs 2000 2008 年 Major 3 methods of parallel processing in supercomputers Supercomputer = massively parallel computers – SIMD (Single Instruction Stream Multiple Data Streams) • Most accelerators – Pipelined processing • Vector computers – MIMD(Multiple Instruction Streams Multiple Data Streams): • Homogeneous (vs. Accelerators), Scalar (vs. Vector machines) – Although all supercomputers use three methods in various level, it can be classified by its usage. Key issues other than computational nodes Large high bandwidth memory Large disk High speed Interconnection Networks. SIMD (Single Instruction Stream Multiple Data Streams Instruction Memory Instruction Processing Unit Data memory •All Processing Units executes the same instruction •Low degree of flexibility •Illiac-IV/MMX instructions/ClearSpeed/IMAP /GP-GPU(coarse grain) •CM-2,(fine grain) GPGPU(General-Purpose computing on Graphic ProcessingUnit) – TSUBAME2.0(Xeon+Tesla,Top500 2010/11 4th ) – 天河一号(Xeon+FireStream,2009/11 5th ) ※()内は開発環境 GeForce GTX280 240 cores Host Input Assembler Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors … PBSM PBSM PBSM PBSM PBSM PBSM Load/Store Global Memory PBSM PBSM PBSM PBSM GPU (NVIDIA’s GTX580) 128 Cores 128 Cores L2 Cache 128 Cores 128 Cores 512 GPU cores ( 128 X 4 ) 768 KB L2 cache 40nm CMOS 550 mm^2 Cell Broadband Engine PS3 PPE L2 C SPE L1 C SXU SXU SXU SXU IBM Roadrunner Common platform for supercomputers and games PXU LS LS LS LS IOIF1 DMA DMA DMA DMA 1.6GHz / 4 X 16B data rings BIF/ MIC SXU SXU SXU SXU IOIF0 LS LS LS LS DMA DMA DMA DMA Peta FLOPS 11 K Japan Peak performance vs Linpack Performance 10 5 4 The difference is large in machines with accelerators Tianhe(天河) China Homogeneous Using GPU 3 Nebulae China 2 Jaguar USA Tsubame Japan 1 Accelerator type is energy efficient. Pipeline processing 1 2 3 4 5 6 Stage Each stage sends the result/receives the input every clock cycle. N stages = N times performance Data dependency makes RAW hazards and degrades the performance. If the large array is treated, a lot of stages can work efficiently. Vector computers vector registers a0a1a2….. adder Y=Y+X[i] multiplier X[i]=A[i]*B[i] b0b1b2…. The classic style supercomputers since Cray-1. Earth simulator may be the last vector supercomputer. Vector computers vector registers a1a2….. adder multiplier a0 b0 Y=Y+X[i] X[i]=A[i]*B[i] b1b2…. Vector computers vector registers a2….. adder multiplier a1 a0 b0 b1 Y=Y+X[i] X[i]=A[i]*B[i] b2…. Vector computers vector registers a11….. multiplier a10 adder x1 x0 a9 b9 b10 Y=Y+X[i] X[i]=A[i]*B[i] b11…. MIMD(Multipe-Instruction Streams/ Multiple-Data Streams) • Multiple processors (cores) can work independently. – Synchronization mechanism – Data communication: Shared memory • All supercomputers are MIMD with multiple cores. • However, K and Sequoia (BlueGene Q) are typical massively parallel MIMD machines. – homogeneous computers – scalar processors MIMD(Multipe-Instruction Streams/ Multiple-Data Streams) Node 0 0 Node 1 1 2 Interconnection Network Node 2 3 Node 3 Shared Memory Processors which can work independently. Multi-Core (Intel’s Nehalem-EX) CPU CPU L3 Cache CPU CPU CPU CPU L3 Cache CPU CPU 8 CPU cores 24MB L3 cache 45nm CMOS 600 mm^2 Intel 80-Core Chip Intel 80-core chip [Vangal,ISSCC’07] How to program them? • Can the common programs for PC be accelerated on supercomputers? – Yes, a certain degree by parallel compilers. • However, in order to efficient use of many cores, specialists must optimize programs. – Multithread using MPIs – Open MP – Open CL/CUDA → GPU accelerator type The fastest computer Also simple NUMA From IBM web site IBM’s BlueGene Q • Successor of Blue Gene L and Blue Gene P. • Sequoia is consisting of BlueGene Q • 18 Power processors (16 computational, 1 control and 1 redundant) and network interfaces are provided in a chip. • Inner-chip interconnection is a cross-bar switch. • 5 dimensional Mesh/Torus • 1.6GHz clock. Japanese supercomputers • K-Supercomputer – Homogeneous scalar type massively parallel computers. • Earth simulator – Vector computers – The difference between peak and Linpack performance is small. • TIT’s Tsubame – A lot of GPUs are used. Energy efficient supercomputer. • Nagasaki University’s DEGIMA – A lot of GPUs are used. Hand made supercomputer. High costperformance. Gordon Bell prize cost performance winner • GRAPE projects – For astronomy, dedicated supercomputers. SIMD、Various version won the Gordon Bell prize. SACSIS2012 Invited Speech Supercomputer 「K」 L2 C Memory Core Core Core Core Core Core Core Core Tofu Interconnect 6-D Torus/Mesh Inter Connect Controller SPARC64 VIIIfx Chip 4 nodes/board 96nodes/Lack 24boards/Lack RDMA mechanism NUMA or UMA+NORMA SACSIS2012 Invited speech SACSIS2012 invited speech water cooling system Lacks of K 6 dimensional torus Tofu 3-dimensional mesh 2 00 0 00 010 0 20 1 00 101 001 00 2 10 11 0 1 11 0 11 012 120 120 121 0 21 0 22 201 20 2 10 2 11 212 3-ary 1-cube 112 221 2 22 3-ary 2-cube 122 3-ary 3-cube 4 dimensional mesh 0*** 1*** 2*** Why K could get top 1 • The delay of BlueGeneQ/Sequoia – Financial crisis in USA • Withdrawal of NEC/Hitachi – As starting, the complex system of a vector machine and a scalar machine was planned. – All budget can be used only for scalar machine. • Budget reviewing made the project famous. – Enough fund was thrown in short period. • Engineers in Fujitsu did really good job. SACSIS2012 invited talk Peak performance 40TFLOPS The earth simulator Interconnection Network (16GB/s x 2) Node 1 7 0 1 … Vector Processor 1 …. Vector Processor 0 … Shared Memory 16GB Vector Processor Vector Processor 1 7 Node 0 … Vector Processor Vector Processor Shared Memory 16GB Vector Processor 0 Vector Processor Vector Processor Shared Memory 16GB 7 Node 639 The Earth simulator (2002) Simple NUMA TIT’s Tsubame Well balanced supercomputer with GPUs Nagasaki Univ’s DEGIMA GRAPE-DR Kei Hiraki “GRAPE-DR” http://www.fpl.org (FPL2007) Exa-scale computer • Japanese national project for exa-scale computer started. • Feasibility Study started. – U. Tokyo, Tsukuba Univ. Tohoku Univ. and Riken. • It is difficult to produce supercomputers with Japanese original chips. • In Japan, a vendor suffers loss for developing supercomputers. • The vendor may retrieve development fee later by selling smaller systems. • However, Japanese semiconductor companies will not be able to support a big money for development. • If Intel’s CPUs or NVIDIA’s GPUs are used, a huge national money will flow to US companies. • For exa-scale: 70,000,000 cores are needed. – The limitation of budget is severer than technical limit. Amdahl’s law Serial part 1% Parallel part 99% Accelerated by parallel processing 0.01 + 0.99/p 50 times with 100 cores、91 times with 1000 cores If there is a small part of serial execution part, the performance improvement is limited. Why Exa-scale supercomputers? • The ratio of serial part becomes small for the large scale problem. – Linpack is scale free benchmark. – Serial execution part 1 day+Parallel execution part 10 years → 1day+1day: A big impact. • Are there any big programs which cannot be solved by K but can be solved by Exa-scale supercomputers? – The number of programs will be decreased. – Can we find new area of application? • It is important such a big computing power is open for researches. Should we develop a floating computation centric supercomputers? • What people wants big supercomputer to do? – Finding new medicines: Pattern matching. – Simulation of earthquake, Meteorology for analyzing global warming. – Big data – Artificial Intelligence • Most of them are not suitable for floating computation centric supercomputers. • “Supercomputers for big data” or “Super-cloud computers” might be required. Motivation and limitation • Integrated computer technologies including architecture, hardware, software, dependable techniques, semiconductors and application. • Flagship and symbols. • No-computer is remained in Japan other than supercomputers • A super computing power is open for peaceful researches. • It is a tool which makes impossible analysis possible. • What needs infinite computing power? • Is it a Japanese supercomputer if all cores and accelerators are made in USA? • Does floating centric supercomputer to solve LInpack as fast as possible really fit the demand? Look at Exa-scale computer project! Excise • A target program: serial computation part :1 parallel computation part: N3 • K: 700,000 cores • Exa: 70,000,000 cores • What N makes Exa 10 times faster than K?