Parallel Programming Trends in Extremely Scalable Architectures Carlo Cavazzoni, HPC department, CINECA CINECA CINECA non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography and Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR). CINECA is the largest Italian computing centre, one of the most important worldwide. The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote technology transfer initiatives for industry. Why parallel programming? Solve larger problems Run memory demanding codes Solve problems with greater speed Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have a mixed architecture + accelerators -> hybrid architectures Distributed Memory CPU CPU CPU memory memory CPU CPU node CPU node memory NETWORK node node memory node memory node memory Shared Memory memory CPU CPU CPU CPU CPU Real Shared Memory banks System Bus CPU CPU CPU CPU CPU Virtual Shared Network HUB CPU node HUB CPU node HUB CPU node HUB CPU node HUB CPU node HUB CPU node Mixed Architectures memory memory CPU CPU CPU CPU node node CPU CPU node NETWORK memory Most Common Networks switched Cube, hypercube, n-cube switch Torus in 1,2,...,N Dim Fat Tree HPC Trends Number of cores of no 1 system from Top500 500000 Number of cores Paradigm Change in HPC 400000 300000 200000 100000 Next HPC system installed in CINECA will have 200000 cores -1 1 -1 0 …. What about applications? Ju n -0 9 Ju n -0 8 Ju n -0 7 Ju n -0 6 Ju n Ju n -0 5 Ju n -0 4 Ju n -0 3 -0 2 Ju n Ju n -0 1 Ju n -0 0 Ju n -9 9 Ju n -9 8 Ju n -9 7 Ju n -9 6 Ju n -9 5 Ju n -9 4 Ju n -9 3 0 Ju n T o p 5 0 0 600000 Roadmap to Exascale (architectural trends) Dennard Scaling law (MOSFET) L’ = L / 2 do not hold anymore! V’ = V / 2 F’ = F * 2 D’ = 1 / L2 = 4D P’ = P The core frequency and performance do not grow following the Moore’s law any longer L’ = L / 2 V’ = ~V F’ = ~F * 2 D’ = 1 / L2 = 4 * D P’ = 4 * P The power crisis! CPU + Accelerator to maintain the architectures evolution In the Moore’s law Programming crisis! Where Watts are burnt? Today (at 40nm) moving 3 64bit operands to compute a 64bit floatingpoint FMA takes 4.7x the energy with respect to the FMA operation itself D = A + B* C A B C Extrapolating down to 10nm integration, the energy required to move date Becomes 100x ! MPP System Arch Option for BG/Q When? 2012 PFlop/s >2 Power >1MWatt Cores >150000 Threads >500000 Accelerator A set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. (Carlo Cavazzoni) CPU Single thread perf. ACC. throughput CPU CPU & ACC ACC. Architectural integration Physical integration nVIDIA GPU Fermi implementation will pack 512 processor cores ATI FireStream, AMD GPU 2012 New Graphics Core Next “GCN” With new instruction set and new SIMD design Intel MIC (Knight Ferry) What about parallel App? In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1/(1−P) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001 Programming Models • • • Message Passing (MPI) Shared Memory (OpenMP) Partitioned Global Address Space Programming (PGAS) Languages • Next Generation Programming Languages and Models • • UPC, Coarray Fortran, Titanium Chapel, X10, Fortress Languages and Paradigm for Hardware Accelerators CUDA, OpenCL Hybrid: MPI + OpenMP + CUDA/OpenCL trends Scalar Application Vector MPP System, Message Passing: MPI Distributed memory Multi core nodes: OpenMP Accelerator (GPGPU, FPGA): Cuda, OpenCL Shared Memory Hybrid codes CPU memory node memory node node Message Passing domain decomposition CPU memory CPU CPU memory CPU node memory node node Internal High Performance Network memory CPU Ghost Cells - Data exchange Processor 1 sub-domain boundaries i,j+1 i-1,j i,j i+1,j i,j+1 i-1,j i,j i+1,j Ghost Cells i,j-1 Processor 1 i,j+1 i-1,j i,j i+1,j Ghost Cells exchanged between processors at every update i,j+1 i-1,j i,j i+1,j i,j-1 i,j+1 i-1,j i,j i+1,j i,j-1 Processor 2 Processor 2 Message Passing: MPI Main Characteristic • Library • Coarse grain • Inter node parallelization (few real alternative) • Domain partition • Distributed Memory • Almost all HPC parallel App Open Issue • Latency • OS jitter • Scalability Shared memory node CPU Thread 1 CPU Thread 2 CPU Thread 3 CPU y memory Thread 0 x Shared Memory: OpenMP Main Characteristic • Compiler directives • Medium grain • Intra node parallelization (pthreads) • Loop or iteration partition • Shared memory • Many HPC App Open Issue • Thread creation overhead • Memory/core affinity • Interface with MPI OpenMP !$omp parallel do do i = 1 , nsl call 1DFFT along z ( f [ offset( threadid ) ] ) end do !$omp end parallel do call fw_scatter ( . . . ) !$omp parallel do i = 1 , nzl !$omp parallel do do j = 1 , Nx call 1DFFT along y ( f [ offset( threadid ) ] ) end do !$omp parallel do do j = 1, Ny call 1DFFT along x ( f [ offset( threadid ) ] ) end do end do !$omp end parallel Accelerator/GPGPU + Sum of 1D array CUDA sample void CPUCode( int* input1, int* input2, int* output, int length) { for ( int i = 0; i < length; ++i ) { output[ i ] = input1[ i ] + input2[ i ]; } } __global__void GPUCode( int* input1, int*input2, int* output, int length) { int idx = blockDim.x * blockIdx.x + threadIdx.x; if ( idx < length ) { output[ idx ] = input1[ idx ] + input2[ idx ]; } } Each thread execute one loop iteration CUDA OpenCL Main Characteristic • Ad-hoc compiler • Fine grain • offload parallelization (GPU) • Single iteration parallelization • Ad-hoc memory • Few HPC App Open Issue • Memory copy • Standard • Tools • Integration with other languages Hybrid (MPI+OpenMP+CUDA+… Take the positive off all models Exploit memory hierarchy Many HPC applications are adopting this model Mainly due to developer inertia Hard to rewrite million of source lines …+python) Hybrid parallel programming Python: Ensemble simulations MPI: Domain partition OpenMP: External loop partition CUDA: assign inner loops Iteration to GPU threads Quantum ESPRESSO Storage I/O • • • • • • The I/O subsystem is not keeping the pace with CPU Checkpointing will not be possible Reduce I/O On the fly analysis and statistics Disk only for archiving Scratch on non volatile memory (“close to RAM”) PRACE PRACE Research Infrastructure ( the top level of the European HPC ecosystem The vision of PRACE is to enable and support European global leadership in public and private research and development. CINECA (representing Italy) is an hosting member of PRACE can host a Tier-0 system European (PRACE) Tier 0 National (CINECA today) Tier 1 Local Tier 2 FERMI @ CINECA PRACE Tier-0 System Architecture: 10 BGQ Frame Model: IBM-BG/Q Processor Type: IBM PowerA2, 1.6 GHz Computing Cores: 163840 Computing Nodes: 10240 RAM: 1GByte / core Internal Network: 5D Torus Disk Space: 2PByte of scratch space Peak Performance: 2PFlop/s ISCRA & PRACE call for projects now open! Conclusion Parallel programming trends in extremely scalable architectures • • • • • • • • Exploit millions of ALU Hybrid Hardware Hybrid codes Memory Hierarchy Flops/Watt (more that Flops/Sec) I/O subsystem Non volatile memory Fault Tolerance!