Node Characteristics Number of Cores 32 Peak Performance (2.3 GHz) 294 Gflops/sec Memory Size 32-128 GB per node Memory Bandwidth Z Y X 102 GB/sec 1 Dedicated Components Shared at the module level Fetch Two independent integer New AVX instruction set AVX = Advanced Vector eXtensions Both 128 and 256 bit “vector length” instr. Integer Scheduler FP Scheduler Integer Functional Unit FP resource Integer Scheduler 128-bit FMAC A single “thread” can use the entire Decode Integer Functional Unit schedulers and a shared 2x128-bit FP resource Shared at the chip level 128-bit FMAC Composed of 8 “core modules” A core module has shared and dedicated components Shared 2 Mbyte L2 Cache Flexible architecture Shared 8 Mbyte L3 Cache and NB 2 Shared at the chip level Fetch Decode Integer Scheduler FP Scheduler Integer Functional Unit Integer Scheduler 128-bit FMAC Active Components 128-bit FMAC Module Each MPI rank has exclusive access to the 2x128-bit FP unit and is capable of 8 FP results per clock cycle Maximize memory/core and memory/rank Larger L2/L3 cache per MPI rank The peak of the chip is not reduced Better with well vectorized code Integer Functional Unit 1 MPI Rank or Thread per Core Idle Components Shared 2 Mbyte L2 Cache 3 Integer Scheduler Integer Scheduler FP Scheduler Integer Functional Unit Decode 128-bit FMAC MPI Rank 2 Fetch 128-bit FMAC Module Each unit has exclusive access to an integer scheduler, integer pipelines and L1 Dcache The 2x128-bit FP unit and the L2 Cache is shared between units AVX instructions are dynamically executed as two 128-bit instructions utilizing either or both FP unit Best for highly parallel integer or mostly scalar applications Integer Functional Unit 2 MPI Ranks or Threads per Core Shared Components MPI Rank 1 Shared 2 Mbyte L2 Cache 4 Node Characteristics Number of X86 Cores 16 X86 Peak 147 Gflops Accelerator Peak ~1 Tflop X86 Memory 16 or 32GB capacity at 51 GB/sec Accelerator Memory 12GB capacity at 225 GB/sec Z Y X 5 “Kepler” accelerator Peak: ~1Tflop (64-bit) Memory: 12GB Memory BW: ~225GB/sec Several architectural improvements over Fermi generation 6 MPI Support ~1.2 s latency ~15M independent messages/sec/NIC BTE for large messages FMA stores for small messages One-sided MPI Small , scalable memory footprint Advanced Synchronization and Communication Features Globally addressable memory Atomic memory operations Pipelined global loads and stores ~25M (65M) independent (indexed) Puts/sec/NIC Efficient support for UPC, CAF, and Global Arrays Embedded high-performance router Adaptive routing Scales to over 100,000 endpoints 7 Globally addressable memory provides efficient support for UPC, Co-array FORTRAN, Shmem and Global Arrays Cray Programming Environment will target this capability directly Pipelined global loads and stores Allows for fast irregular communication patterns Atomic memory operations Provides fast synchronization needed for one-sided communication models 8