Re-Configurable EXASCALE Computing Steve Wallach swallach”at”conveycomputer “dot”com Convey <= Convex++ Acknowledgement • Steve Poole (ORNL) and I had many emails and discussion on technology trends. • We are both guilty of being overzealous architects swallach – 2012 APRIL - IEEE _DALLAS Discussion • The state of HPC software – today • The path to Exascale Computing • "If you don't know where you're going, any road will take you there,” – The Wizard of Oz. • “When you get to the fork in the road take it” – Yogi Berra swallach – 2012 APRIL - IEEE _DALLAS Mythology • In the old days, we were told “Beware of Greeks bearing gifts” swallach – 2012 APRIL - IEEE _DALLAS Today • “Beware of Geeks bearing Gifts” swallach – 2012 APRIL - IEEE _DALLAS What problems are we solving • New Hardware Paradigms • Uniprocessor Performance leveling • MPP and multithreading for the masses • Power Consumption is a constraint swallach – 2012 APRIL - IEEE _DALLAS Deja Vu • – – • Many Core ILP fizzles x86 extended with sse2, sse3, and sse4 – – • application specific enhancements Co-processor within x86 microarchitecture Basically performance enhancements by – – On chip parallel Instructions for specific application acceleration • • One application instruction replaces MANY generic instructions Déjà vu – all over again – 1980’s – – Need more performance than micro GPU, CELL, and FPGA’s • • Yogi Berra Multi-Core Evolves Different software environment Heterogeneous Computing AGAIN swallach – 2012 APRIL - IEEE _DALLAS Current Languages • Fortran 66 Fortran 77 Fortran 95 2003 – HPC Fortran – Co-Array Fortran • C C++ – – – – UPC Stream C C# (Microsoft) Ct (Intel) • OpenMP – Shared memory multiprocessing API swallach – 2012 APRIL - IEEE _DALLAS Another Bump in the Road • GPGPU’s are very cost effective for many applications. • Matrix Multiply – Fortran do i = 1,n1 do k = 1,n3 c(i,k) = 0.0 do j = 1,n2 c(i,k) = c(i,k) + a(i,j) * b(j,k) Enddo Enddo Enddo swallach – 2012 APRIL - IEEE _DALLAS PGI Fortran to CUDA __global__ void matmulKernel( float* C, float* A, float* B, int N2, int N3 ){ int bx = blockIdx.x, by = blockIdx.y; int tx = threadIdx.x, ty = threadIdx.y; int aFirst = 16 * by * N2; int bFirst = 16 * bx; float Csub = 0; for( int j = 0; j < N2; j += 16 ) { __shared__ float Atile[16][16], Btile[16][16]; Atile[ty][tx] = A[aFirst + j + N2 * ty + tx]; Btile[ty][tx] = B[bFirst + j*N3 + b + N3 * ty + tx]; __syncthreads(); for( int k = 0; k < 16; ++k ) Csub += Atile[ty][k] * Btile[k][tx]; __syncthreads(); } int c = N3 * 16 * by + 16 * bx; C[c + N3 * ty + tx] = Csub; } void matmul( float* A, float* B, float* C, size_t N1, size_t N2, size_t N3 ){ void *devA, *devB, *devC; cudaSetDevice(0); cudaMalloc( &devA, N1*N2*sizeof(float) ); cudaMalloc( &devB, N2*N3*sizeof(float) ); cudaMalloc( &devC, N1*N3*sizeof(float) ); cudaMemcpy( devA, A, N1*N2*sizeof(float), cudaMemcpyHostToDevice ); cudaMemcpy( devB, B, N2*N3*sizeof(float), cudaMemcpyHostToDevice ); dim3 threads( 16, 16 ); dim3 grid( N1 / threads.x, N3 / threads.y); matmulKernel<<< grid, threads >>>( devC, devA, devB, N2, N3 ); cudaMemcpy( C, devC, N1*N3*sizeof(float), cudaMemcpyDeviceToHost ); cudaFree( devA ); cudaFree( devB ); cudaFree( devC ); } Pornographic Programming: Can’t define it, but you know When you see it. http://www.linuxjournal.com/article/10216 Michael Wolfe – Portland Group How We Should Program GPGPUs November 1st, 2008 swallach – 2012 APRIL - IEEE _DALLAS Find the Accelerator • Accelerators can be beneficial. It isn’t “free” (like waiting for the next clock speed boost) • • • Worst case - you will have to completely rethink your algorithms and/or data structures Performance tuning is still time consuming Don’t forget our long history of parallel computing... Fall Creek Falls Conference, Chattanooga, TN - Sept. 2009 One of These Things Isn't Like the Other...Now What?Pat McCormick, LANL swallach – 2012 APRIL - IEEE _DALLAS Programming Model • • Tradeoff programmer productivity vs. performance Web programming is mostly done with scripting and interpretive languages – Java – Javascript • • Server-side programming languages (Python, Ruby, etc.). Matlab Users tradeoff productivity for performance – Moore’s Law helps performance – Moore’s Law hurts productivity • Multi-core • What languages are being used – http://www.tiobe.com/index.php/c ontent/paperinfo/tpci/index.html swallach – 2012 APRIL - IEEE _DALLAS 12 NO Kathy Yelick, 2008 Keynote, Salishan Conference swallach – 2012 APRIL - IEEE _DALLAS Exascale Workshop Dec 2009, San Diego swallach – 2012 APRIL - IEEE _DALLAS Exascale Specs JULY 2011 swallach – 2012 APRIL - IEEE _DALLAS Berkeley’s 13 Motifs http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-23.html swallach – 2012 APRIL - IEEE _DALLAS What does this all mean? How do we create architectures to do this? Each application is a different point on this 3D grid (actually a curve) Application Performance Metric (e.g. Efficiency) High Cache-based Stride-1 Physical Stride-N Physical Compiler Complexity Stride-N Smart Low SISD SIMD MIMD/ Threads Full Custom swallach – 2012 APRIL - IEEE _DALLAS 17 Convey – ISA (Compiler View) User Written Systolic/State Machine Bio-Informatics VECTOR (32 Bit – Float) SEISMIC x86-64 ISA Data Mining Sorting/Tree Traversal Bit/Logical swallach – 2012 APRIL - IEEE _DALLAS Hybrid-Core Computing Application-Specific Personalities Applications • Extend the x86 instruction set Convey Compilers • Implement key operations in hardware Life Sciences x86-64 ISA Custom ISA CAE Custom Financial Oil & Gas Shared Virtual Memory 1 1 19 Cache-coherent, shared memory • Both ISAs address common memory swallach – 2012 APRIL - IEEE _DALLAS HC-1 Hardware Coprocessor Assembly Host memory DIMMs x16 PCI-E slot FSB Mezzanine Card • 2U enclosure: – Top half of 2U platform contains the coprocessor – Bottom half contains Intel motherboard 1/22/2010 20 Host x86 Server Assembly swallach – 2012 APRIL - IEEE _DALLAS 3 x 3½” Disk Drives Memory Subsystem • Optimized for 64-bit accesses; 80 GB/sec peak • Automatically maintains coherency without impacting AE performance 1/22/2010 21 swallach – 2012 APRIL - IEEE _DALLAS Personality Development Kit (PDK) 1/22/2010 22 HIB & IAS Host Interface AE-AE interface Dispatch Interface Customer Designed Logic Scalar Instr. Exception handling MC Link Interface Memory Controllers Virtually addressable, cache coherent memory swallach – 2012 APRIL - IEEE _DALLAS Mgmt/Debug Interface • Customer designed logic in Convey infrastructure • Executes as instructions within an x86-64 address space • Allows designers to concentrate on Intellectual Property, not housekeeping CP Mgmt Pers. Loading Pers. Debug HW Monitor (IPMI) C/C++ Current Convey Compiler Architecture Fortran95 Front End WHIRL Interprocedural Analyzer x86 Loop Nest Optimizer Convey Vectorizer Global Scalar Optimizer (woptimizer) Intel 64 Codegen tree-based intermediate representation Personality specific optimizations built into vectorizer Personality Signature Convey Instruction Table AE FPGA bit file Convey Codegen loaded at runtime object files to linker January 20, 2010 23 swallach – 2012 APRIL - IEEE _DALLAS Convey Confidential Next Generation Optimizer C/C++ Fortran95 Personality Front End Signature Interprocedural Analyzer x86 Loop Nest Optimizer Convey Optimizer directed acyclic graph intermediate representation rules specify operations on graph Global Scalar Optimizer Intel 64 Codegen Convey Codegen Instruction Descriptors AE FPGA bit file table based code generation loaded at runtime object files to linker January 20, 2010 Graph Transformation Rules 24 l swallach – 2012 APRIL - IEEE _DALLAS personality specific optimizations coded as rules custom instruction definitions in descriptor files for 64-bit memory access • High bandwidth for non-unity strides, scatter/gather • 3131 interleave for high bandwidth for all strides except multiples of 31 • measured with strid3.c written by Mark Seager: Strided 64-bit Memory Accesses 60 60 50 50 40 GB/sec GB/sec • Strided Memory Access Performance Convey SG-DIMMs optimized Strided 64-bit Memory Accesses 40 30 30 20 20 10 10 1/22/2010 25 Stride (1-64 words) Nehalem (single core) Nehalem (single core) HC-1 SG-DIMM 3131 HC-1 SG-DIMM 3131 swallach – 2012 APRIL - IEEE _DALLAS 61 61 57 57 53 53 49 49 Stride (1-64 words) 45 45 41 41 37 37 33 33 29 29 25 25 21 21 9 9 17 17 5 5 13 13 1 for( i=0; i<n*incx; i+=incx ) { yy[i] += t*xx[i]; } 0 1 0 Performance/$ Performance/Watt Rank Machine Owner Scale TEPS $K Perf/ KW $ Perf/W 1 Intrepid (IBM BlueGene/P, 8192 nodes/32k cores) Argonne National Laboratory 36 6.600 $10,400.00 2 Franklin (Cray XT4, 500 of 9544 nodes) NERSC 32 5.220 $2,000.00 2610 80 65 3 cougarxmt (128 node Cray XMT) Pacific Northwest National Laboratory 29 1.220 $1,000.00 1220 100 12 4 graphstorm (128 node Cray XMT) Sandia National Laboratories 29 1.170 $1,000.00 1170 100 12 5 Endeavor (256 node, 512 core Westmere X5670 2.93) Intel Corporation 29 0.533 $2,000.00 6 Erdos (64 node Cray XMT) Oak Ridge National Laboratory 29 0.051 $500.00 7 Red Sky (Nehalem X5570, IB Torus, 512 processors) Sandia National Laboratories 28 0.478 Jaguar (Cray XT5-HE, 512 node subset) Oak Ridge National Laboratory 27 0.800 Convey (Single node HC-1ex) Convey Computer Corporation 27 0.773 Intel Corporation 26 0.616 8 9 Endeavor (128 node, 256 core Westmere X5670) 635 136 49 267 88.3 6 50 1 $1,500.00 319 88.3 5 $2,700.00 296 80 10 $80.00 9663 0.75 1031 $1,000.00 Adapted from http://www.graph500.org/Results.html prior to June 22, 2011 swallach – 2012 APRIL - IEEE _DALLAS 102 616 88.3 7 November 2011 Graph 500 Benchmark (BFS) (Problem size 31 and lower) Rank System 20 Dingus (Convey HC-1ex - 1 node / 4 cores, 4 FPGAs) 22 Vortex (Convey HC-1ex - 1 node / 4 cores, 4 FPGAs) Convey01 (Convey HC-1ex - 1 node / 4 cores, 4 23 FPGAs) 24 HC1-d (Convey HC-1 - 1 node / 4 cores, 4 FPGAs) 25 Convey2 (Convey HC-1 - 1 node / 4 cores, 4 FPGAs) 7 IBM BlueGene/Q (512 nodes) 40 ultraviolet (4 processors / 32 cores) 9 SGI Altix ICE 8400EX (256 nodes / 1024 cores) 26 IBM and ScaleMP/vSMP Fundation (128 cores) 32 IBM and ScaleMP/vSMP Fundation (128 cores) 14 DAS-4/VU (128 processors) 34 Hyperion (64 nodes / 512 cores) 43 Kraken (6128 cores) 35 Matterhorn (64 nodes) 19 Todi (176 AMD Interlagos, 176 NVIDIA Tesla X2090) 42 Knot (64 cores / 8 processors) 49 Gordon (7 nodes / 84 cores) 33 Jaguar (224,256 processors) Site SNL Convey Computer Bielefeld University, CeBiTec Convey Computer LBL/NERSC IBM Research, T.J. Watson SNL SGI LANL LANL VU University LLNL LLNL CSCS CSCS UCSB SDSC ORNL swallach – 2012 APRIL - IEEE _DALLAS MTEPS/ Scale MTEPS kW 28 1,759 2,345 28 1,675 2,233 28 1,615 28 1,601 28 1,598 30 56,523 29 420 31 14,085 29 1,551 30 1,036 30 7,135 31 1,004 31 105 31 885 28 3,060 30 177 29 30 30 1,011 2,153 2,135 2,130 1,809 420 363 242 162 139 39 21 18 17 9 3 0 Graph – Multi-Threaded swallach – 2012 APRIL - IEEE _DALLAS Multi-Threaded Processor De Bruijn graph swallach – 2012 APRIL - IEEE _DALLAS Now What • A Strawman Exascale System Feasible by 2020 (delivered) • Philosophy/Assumptions behind approach – Clock cycle constant – Memory, Power and Network bandwidth is DEAR – Efficiency for delivered ops is more important than peak – If we think we know the application profile for Exascale applications we are kidding ourselves • Then a counterproposal – Contradict myself swallach – 2012 APRIL - IEEE _DALLAS Take an Operation Research Method of Prediction • Moore’s Law • Application Specific – Matrices – Matrix Arithmetic • Hard IP for floating point • Number of Arithmetic Engines • System Arch and Programming model – Language directed design • Resulting Analysis – Benefit – Mean and +/- sigma (if normal) swallach – 2012 APRIL - IEEE _DALLAS Moore’s Law • Feature Set – Every 2 years twice the logic – Thus by 2020 • 8 times the logic, same clock rate – Mean Factor of 7, sigma +/- 2 • Benefit – 7 times the performance, same clock rate, same internal architecture swallach – 2012 APRIL - IEEE _DALLAS Rock’s Law • Rock's law, named for Arthur Rock, says that the cost of a semiconductor chip fabrication plant doubles every four years. As of 2011, the price had already reached about 8 billion US dollars. • Rock's Law can be seen as the economic flipside to Moore's Law; the latter is a direct consequence of the ongoing growth of the capital-intensive semiconductor industry— innovative and popular products mean more profits, meaning more capital available to invest in ever higher levels of large-scale integration, which in turn leads to creation of even more innovative products. http://newsroom.intel.com/community/en_za/blog/2010/10/19/intel-announces-multi-billion-dollar-investment-innext-generation-manufacturing-in-us http://en.wikipedia.org/wiki/Rock%27s_law swallach – 2012 APRIL - IEEE _DALLAS Application Specific • Feature Set • • • June 1993 Linpack (% of peak) – – • Tianhe-1A – 53% (86,016 cores & GPU) French Bull – 83% (17,480 sockets. 140k cores) June 2011 Linpack – • Jaguar - 75% (224,162 cores & GPU) November 2010 Linpack – – • Earth Simulator ( NEC) – 87.5% (5120) June 2010 Linpack – • NEC SX3/44 - 95% (4) Cray YMP (16) – 90% (9) June 2006 – • Matrix Engine Mean Factor of 4 (+4/- 1) Riken K – 93% (548,352 cores – vector) Benefit – – – 90% of Peak Matrix Arithmetic's Outer Loop Parallel/Inner Loop vector within ONE functional unit swallach – 2012 APRIL - IEEE _DALLAS 3D Finite Difference (3DFD) Personality • designed for nearest neighbor operations on structured grids – maximizes data reuse • reconfigurable “registers” – 1D (vector), 2D, and 3D modes – 8192 elements in a register X(I,J,K) = + + + + + + S0*Y(I ,J ,K ) S1*Y(I-1,J ,K ) S2*Y(I+1,J ,K ) S3*Y(I ,J-1,K ) S4*Y(I ,J+1,K ) S5*Y(I ,J ,K-1) S6*Y(I ,J ,K+1) • operations on entire cubes – “add points to their neighbor to the left times a scalar” is a single instruction – up to 7 points away in any direction • finite difference method for post-stack reverse-time migration swallach – 2012 APRIL - IEEE _DALLAS 35 FPGA -Hard IP Floating Point • Feature Set – By 2020 – Fused Multiply Add – Mean Factor of 8 +/- 2 – Reconfigurable IP Multiply and Add – Mean Factor 4 +/-1 – Bigger fixed point DSP’s – Mean Factor of 3 • Benefit – More Floating Point ALU’s – More routing paths swallach swallach – 2012– APRIL SEPT 2011 - IEEE - UCLA _DALLAS 36 Number of AE FPGA’s per node • Feature Set – Extensible AE’s – 4, 8, or 16 AE’s as a function of physical memory capacity • 4 -One byte per flop – mean factor of 1 • 8 - ½ byte per flop – mean factor of 2 • 16 -¼ byte per flop – mean factor of 4 • Benefit – More Internal Parallelism – Transparent to user – Potential heterogeneity within node swallach – 2012 APRIL - IEEE _DALLAS Exascale Programming Model Example 4.1-1: Matrix by Vector Multiply 1: #include<upc_relaxed.h> 2: #define N 200*THREADS 3: shared [N] double A[N][N]; NOTE: Thread is 16000 4: shared double b[N], x[N]; 5: void main() 6: { 7: int i,j; 8: /* reading the elements of matrix A and the 9: vector x and initializing the vector b to zeros 10: */ 11: upc_forall(i=0;i<N;i++;i) 12: for(j=0;j<N;j++) 13: b[i]+=A[i][j]*x[j] ; 14: } swallach – 2012 APRIL - IEEE _DALLAS Math results in • Using the mean – 7 Moore’s law – 4 Matrix arithmetic's – .90 efficiency (percentage of peak) – 8 Fused multiply/add (64 bits) – 4 16_ AE’s (user visible pipelines) per node – Or a MEAN of 800 times today • Best Case – 2304 • Worst Case - 448 swallach – 2012 APRIL - IEEE _DALLAS The System • • 2010 base level node is 80 GFlops/node (peak) Thus 7 x 4 x .9 x 8 x 4 = 806 Factor – – – – • Mean of 800 = +1500 (upside)/- 400 (downside) 64 TFlops/node (peak) 16,000 Nodes/Exascale Linpack High Speed Serial DRAM 64 bit virtual address space – – – Flat address space UPC/openMP addressing paradigm integrated within TLB hardware Programming model is 16000 shared memory nodes • • Compiler optimizations (user transparent) deal with local node micro-architecture Power is 1.5 KWatts/Node (3-4U rack mounted) – – – 24 MegaWatts/system 32 TBytes/Node (288 PetaBytes – system) Physical Memory approx 60% of power swallach – 2012 APRIL - IEEE _DALLAS INTEGRATED SMP - WDM DRAM – 16/32 TeraBytes - HIGHLY INTERLEAVED MULTI-LAMBDA xmit/receive CROSS BAR 6.4 TBYTES/SEC .1 bytes/sec per Peak flop ... swallach – 2012 APRIL - IEEE _DALLAS IBM Optical Technology http://domino.research.ibm.com/comm/research_projects.nsf/pages/photonics.index.html swallach – 2012 APRIL - IEEE _DALLAS COTS ExaOPs/Flop/s System 3 2 4 5 Single Node 16 AE’s ... 1000 1 16000 14000 ... 2000 ALL-OPTICAL SWITCH 4000 12000 8000 7000 10 meters= 50 NS Delay 3000 ... I/O LAN/WAN ... 5000 6000 swallach – 2012 APRIL - IEEE _DALLAS Concluding (almost) • Uniprocessor Performance has to be increased – Heterogeneous here to stay – The easiest to program will be the correct technology • Smarter Memory Systems (PIM) • New HPC Software must be developed. – SMARTER COMPILERS – ARCHITECTURE TRANSPARENT • New algorithms, not necessarily new languages swallach – 2012 APRIL - IEEE _DALLAS CONVEZ COMPUTER • DESIGN SPACE – IPHONE 6.0S – User Interface – Processor – External Communications 2009, Sept, “CONVEZ COMPUTER”, Annapolis, MD http://www.lps.umd.edu/AdvancedComputing/ swallach – 2012 APRIL - IEEE _DALLAS Processor - 2020 • Power Budget – 300 Milliwatts • Either ARM or IA-64 compatible • In 2020 all other ISA’s either will be gone or niche players (no volume) • • • • • 64 bit virtual address space (maybe) 2 to 4 GBytes Physical Memory FPGA co-processor (hybrid processor on a die) 1 Terabyte Flash 10 to 20 GFlops (real *8) – 3D video drives DSP performance • 2**24 chips – – – – Approx ExaOP Approx 64 PetaBytes – DRAM Approx 2.4 MegaWatts Switch is Cell Towers swallach – 2012 APRIL - IEEE _DALLAS Virtual Address Space • 64 bits is NOT enough – • • Exa is 2**60 Applications reference the network not just local memory (disk) Time to embed/contemplate the network in the address space – Unique Name • IPv6 (unicast host.id) • Global phone number – • MAC address 128 bit virtual address space – – – NO NATS JUST LIKE world-wide phone numbering system JUST LIKE IPv6 – US04656579 04/07/1987 – Digital data processing system having a uniquely organized memory system and means for storing and accessing information therein swallach – 2012 APRIL - IEEE _DALLAS Finally The Flops/Ops OPs system which is the simplest to FLOPs /Watt program will win. USER cycles are more important that CPU cycles swallach – 2012 MARCH DOD 48