Outline This image cannot currently be display ed. Parallel Computer Architecture Concepts TDDD93 Lecture 1 Christoph Kessler PELAB / IDA Linköping university Sweden 2015 Lecture 1: Parallel Computer Architecture Concepts n Parallel computer, multiprocessor, multicomputer n SIMD vs. MIMD execution n Shared memory vs. Distributed memory architecture n Interconnection networks n Parallel architecture design concepts l Instruction-level parallelism l Hardware multithreading l Multi-core and many-core l Accelerators and heterogeneous systems l Clusters n Implications for programming and algorithm design 2 Traditional Use of Parallel Computing: in HPC Parallel Computer 3 Example: Weather Forecast • 3D Space discretization (cells) • Time discretization (steps) • Start from current observations (sent from weather stations) • Simulation step by evaluating weather model equations 4 (very simplified…) • Air pressure • Temperature • Humidity • Sun radiation • Wind direction • Wind velocity •… cell 5 Parallel Computing for HPC Applications n High Performance Computing l Much computational work (in FLOPs, floatingpoint operations) l Often, large data sets l E.g. climate simulations, particle physics, engineering, sequence matching or proteine docking in bioinformatics, … n Single-CPU computers and even today’s multicore processors cannot provide such massive computation power NSC Triolith n Aggregate LOTS of computers à Clusters l Need scalable algorithms l Need to exploit multiple levels of parallelism 6 1 Parallel Computer Architecture Concepts Classification by Control Structure Classification of parallel computer architectures: n by control structure n by memory organization l in particular, Distributed memory vs. Shared memory … n by interconnection network topology op op op2 1 7 op 3 op 4 8 Classification by Memory Organization Interconnection Networks (1) P 9 R 10 Interconnection Networks (2): Simple Topologies P P P fully connected Interconnection Networks (3): Hypercube P P P Inductive definition: 11 12 2 Instruction Level Parallelism (1): Pipelined Execution Units More about Interconnection Networks n Fat-Tree, Butterfly, … See TDDC78 n Switching and routing algorithms n Discussion of interconnection network properties l l l l l l l Cost (#switches, #lines) Scalability (asymptotically, cost grows not much faster than #nodes) Node degree Longest path (à latency) Accumulated bandwidth Fault tolerance (node or switch failure) … 13 SIMD computing with Pipelined Vector Units e.g., vector supercomputers Cray (1970s, 1980s), Fujitsu, … 14 Instruction-Level Parallelism (2): VLIW and Superscalar n Multiple functional units in parallel n 2 main paradigms: l VLIW (very large instruction word) architecture ^ 4Parallelism is explicit, progr./compiler-managed (hard) l Superscalar architecture à 4Sequential instruction stream 4Hardware-managed dispatch 4power + area overhead n ILP in applications is limited l typ. < 3...4 instructions can be issued simultaneously l Due to control and data dependences in applications n Solution: Multithread the application and the processor 16 15 Hardware Multithreading SIMD Instructions ”vector register” n “Single Instruction stream, Multiple Data streams” op SIMD unit single thread of control flow l restricted form of data parallelism 4 apply the same primitive operation (a single instruction) in parallel to multiple data elements stored contiguously l SIMD units use long “vector registers” 4 each holding multiple data elements Common today l MMX, SSE, SSE2, SSE3,… l Altivec, VMX, SPU, … Performance boost for operations on shorter data types Area- and energy-efficient Code to be rewritten (SIMDized) by programmer or compiler Does not help (much) for memory bandwidth l n E.g., data dependence n n n n P P 17 P P 18 3 The Memory Wall Moore’s Law n Performance gap CPU – Memory (since 1965) Exponential increase in transistor density n Memory hierarchy n Increasing cache sizes shows diminishing returns l Costs power and chip area 4 GPUs spend the area instead on many simple cores with little memory l Relies on good data locality in the application n What if there is no / little data locality? l Irregular applications, e.g. sorting, searching, optimization... n Solution: Spread out / overlap memory access delay l Programmer/Compiler: Prefetching, on-chip pipelining, SW-managed on-chip buffers l Generally: Hardware multithreading, again! Data from Kunle Olukotun, Lance Hammond, Herb Sutter, 20Burton Smith, Chris Batten, and Krste Asanoviç 19 Moore’s Law vs. Clock Frequency The Power Issue n Power = Static (leakage) power + Dynamic (switching) power n Dynamic power ~ Voltage2 * Clock frequency where Clock frequency approx. ~ voltage • #Transistors / mm 2 still growing exponentially according to Moore’s Law à Dynamic power ~ Frequency3 n Total power ~ #processors • Clock speed flattening out ~3GHz 2003 21 More transistors + Limited frequency Þ More cores 22 Conclusion: Moore’s Law Continues, But ... Single-processor Performance Scaling 16,0 Parallelism 14,0 Throughput incr. 55%/year Log2 Speedup 12,0 Device speed 8,0 Limit: Clock rate 6,0 4,0 2,0 0,0 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç 23 Assumed increase 17%/year possible 10,0 Pipelining Limit: RISC ILP RISC/CISC CPI Source: Doug Burger, UT Austin 2005 90 nm 24 65 nm 45 nm 32nm 22nm 4 Solution for CPU Design: Multicore + Multithreading Main features of a multicore system n There are multiple computational cores on the same chip. n Single-thread performance does not improve any more l ILP wall l Memory wall l Power wall n but we can put more cores on a chip l And hardware-multithread the cores to hide memory latency l All major chip manufacturers produce multicore CPUs today n The cores might have (small) private on-chip memory modules and/or access to on-chip memory shared by several cores. n The cores have access to a common off-chip main memory n There is a way by which these cores communicate with each other and/or with the environment. 25 26 Standard CPU Multicore Designs Some early dual-core CPUs (2004/2005) n Standard desktop/server CPUs have a few cores with shared off-chip main memory l On-chip cache (typ., 2 levels) 4L1-cache core core core L1$ L1$ L1$ L1$ L2$ mostly core-private L2$ Interconnect / Memory interface 4L2-cache often shared by groups of cores l core SMT P0 P1 P0 P1 P0 P1 L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L2$ L2$ L2$ Memory Ctrl main memory (DRAM) Memory access interface shared by all or groups of cores n Caching à multiple copies of the same data item L2$ Memory Ctrl Memory Ctrl Main memory Main memory IBM Power5 (2004) AMD Opteron Dualcore (2005) Main memory n Writing to one copy (only) causes inconsistency Intel Xeon Dualcore(2005) $ = ”cache” L1$ = ”level-1 instruction cache” D1$ = ”level-1 data cache” L2$ = ”level-2 cache” (uniform) n Shared memory coherence mechanism to enforce automatic updating or invalidation of all copies around à More about shared-memory architecture, caches, data locality, consistency issues and coherence protocols in TDDC78/TDDD56 27 SUN/Oracle SPARC T Niagara (8 cores) 28 SUN / Oracle SPARC-T5 (2012) Sun UltraSPARC ”Niagara” Niagara T1 (2005): 8 cores, 32 HW threads Niagara T2 (2008): 8 cores, 64 HW threads Niagara T3 (2010): 16 cores, 128 HW threads T5 (2012): 16 cores, 128 HW threads P0 P1 P2 P3 P4 P5 P6 P7 L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ Niagara T1 (2005): 8 cores, 32 HW threads L2$ Memory Ctrl Memory Ctrl Memory Ctrl Memory Ctrl Main memory Main memory Main memory Main memory 29 28nm process, 16 cores x 8 HW threads, L3 cache on-chip, On-die accelerators for common encryption algorithms 30 5 Scaling Up: Network-On-Chip Example: Cell/B.E. n Cache-coherent shared memory (hardware-controlled) – n An on-chip network (four parallel unidirectional rings) does not scale well to many cores l power- and area-hungry l signal latency across whole chip l not well predictable access times n Idea: NCC-NUMA – non-cache-coherent, non-uniform memory access l Physically distributed on-chip [cache] memory, l on-chip network, connecting PEs or coherent ”tiles” of PEs l global shared address space, l but software responsible for maintaining coherence n Examples: l STI Cell/B.E., l Tilera TILE64, l Intel SCC, Kalray MPPA 31 Towards Many-Core CPUs... (IBM/Sony/Toshiba 2006) interconnect the master core, the slave cores and the main memory interface n LS = local on-chip memory, PPE = master, SPE = slave 32 Towards Many-Core Architectures n Tilera TILE64 (2007): 64 cores, 8x8 2D-mesh on-chip network n Intel, second-generation many-core research processor: Mem-controller 48-core (in-order x86) SCC ”Single-chip cloud computer”, 2010 l l No longer fully cache coherent over the entire chip P MPI-like message passing over 2D mesh network on chip I/O C I/O R 1 tile: VLIW-processor + cache + router (Image simplified) 33 Source: Intel 34 Kalray MPPA-256 Intel Xeon PHI (since late 2012) n 16 tiles n Up to 61 cores, 244 HW threads, 1.2 Tflops peak performance with 16 VLIW compute cores each plus 1 control core per tile n Message passing network on chip n Simpler x86 (Pentium) cores (x 4 HW threads), with 512 bit wide SIMD vector registers n Can also be used as a coprocessor, instead of a GPU n Virtually unlimited array extension by clustering several chips n 28 nm CMOS technology n Low power dissipation, typ. 5 W 35 36 6 ”General-purpose” GPUs Nvidia Fermi (2010): 512 cores • Main GPU providers for laptop/desktop Nvidia, AMD(ATI), Intel • Example: NVIDIA’s 10-series GPU (Tesla, 2008) has 240 cores 1 Fermi C2050 GPU 1 ”shared-memory multiprocessor” (SM) I-cache Scheduler Dispatch Register file SM (Images removed) 32 Streaming processors (cores) L2 • Each core has a • • • • Floating point / integer unit Logic unit Move, compare unit Branch unit Nvidia Tesla C1060: 933 GFlops • Cores managed by thread manager Load/Store units Special function units 1 Streaming Processor (SP) • Thread manager can spawn and manage 30,000+ threads • Zero overhead thread switching 37 FPU 64K configurable L1cache/ shared memory IntU 38 GPU Architecture Paradigm The future will be heterogeneous! n Optimized for high throughput Need 2 kinds of cores – often on same chip: l In theory, ~10x to ~100x higher throughput than CPU is possible n Massive hardware-multithreading hides memory access latency n Massive parallelism n GPUs are good at data-parallel computations l multiple threads executing the same instruction on different data, preferably located adjacently in memory n For non-parallelizable code: Parallelism only from running several serial applications simultaneously on different cores (e.g. on desktop: word processor, email, virus scanner, … … not much more) à Few (ca. 4-8) ”fat” cores (power-hungry, area-costly, caches, out-of-order issue, ) for high single-thread performance n For well-parallelizable code: (or on cloud servers) 39 à on hundreds of simple cores (power + area efficient) (GPU-/SCC-like) 40 Heterogeneous / Hybrid Multi-/Manycore Heterogeneous / Hybrid Multi-/Manycore Systems Key concept: Master-slave parallelism, offloading n Cell/B.E. n General-purpose CPU (master) processor controls execution of slave processors by submitting tasks to them and transfering operand data to the slaves’ local memory à Master offloads computation to the slaves n Slaves often optimized for heavy throughput computing l n GPU-based system: Offload heavy computation Master could do something else while waiting for the result, or switch to a power-saving mode n Master and slave cores might reside CPU on the same chip (e.g., Cell/B.E.) or on different chips (e.g., most GPU-based systems today) n Slaves might have access to off-chip main memory (e.g., Cell) Device memory Data transfer GPU Main memory or not (e.g., today’s GPUs) 41 42 7 Multi-GPU Systems Reconfigurable Computing Units n Connect one or few general-purpose (CPU) multicore n FPGA – Field Programmable Gate Array processors with shared off-chip memory to several GPUs n Increasingly popular in high-performance computing l Cost and (quite) energy effective if offloaded computation fits GPU architecture well L2 L2 Main Memory (DRAM) 43 Example: Beowulf-class PC Clusters 44 Cluster Example: Triolith (NSC, 2012 / 2013) Capability cluster (fast network for parallel applications) with off-the-shelf CPUs (Xeon, Opteron, …) Final configuration: 1200 HP SL230 compute nodes, each equipped with 2 Intel E5-2660 (2.2 GHz Sandybridge) processors with 8 cores each 19200 cores in total Theoretical peak performance of 338 Tflops/s NSC Triolith (Source: NSC) Mellanox Infiniband network 45 46 The Challenge Can’t the compiler fix it for us? n Today, basically all computers are parallel computers! n Automatic parallelization? l Single-thread performance stagnating Dozens, hundreds of cores l Hundreds, thousands of hardware threads l Heterogeneous (core types, accelerators) l Data locality matters l Clusters for HPC, require message passing n Utilizing more than one CPU core requires thread-level parallelism n One of the biggest software challenges: Exploiting parallelism l Every programmer will eventually have to deal with it l All application areas, not only traditional HPC 4 General-purpose, graphics, games, embedded, DSP, … l Affects HW/SW system architecture, programming languages, algorithms, data structures … l Parallel programming is more error-prone (deadlocks, races, further sources of inefficiencies) 4 And thus more expensive and time-consuming l 47 l at compile time: static analysis – not effective for pointer-based languages 4Requires 4needs 4ok l programmer hints / rewriting ... for few benign special cases: – (Fortran) loop SIMDization, – extraction of instruction-level parallelism, … at run time (e.g. speculative multithreading) 4High overheads, not scalable n More about parallelizing compilers in TDDD56 + TDDC78 48 8 And worse yet, The Challenge n A lot of variations/choices in hardware n Bad news 1: l Many will have performance implications l No standard parallel programming model 4portability Many programmers (also less skilled ones) need to use parallel programming in the future n Bad news 2: issue n Understanding the hardware will make it easier to make programs get high performance l Performance-aware programming gets more important also for single-threaded code l Adaptation leads to portability issue again There will be no single uniform parallel programming model as we were used to in the old sequential times à Several competing general-purpose and domain-specific languages and their concepts will co-exist n How to write future-proof parallel programs? 49 50 What we learned in the past… … and what we need now n Sequential von-Neumann model n Parallel programming! programming, algorithms, data structures, complexity l Sequential / few-threaded languages: C/C++, Java, ... not designed for exploiting massive parallelism time l Parallel algorithms and data structures Analysis / cost model: parallel time, work, cost; scalability; l Performance-awareness: data locality, load balancing, communication l time T(n) = O ( n log n ) T(n,p) = O ( (n log n)/p + log p ) number of processing units used problem size 51 problem size 52 This image cannot currently be display ed. Questions? 9