Parallel Computer Architecture Concepts Outline

Outline This image cannot currently be display ed. Parallel Computer Architecture Concepts TDDD93 Lecture 1 Christoph Kessler PELAB / IDA Linköping university Sweden 2015 Lecture 1: Parallel Computer Architecture Concepts n Parallel computer, multiprocessor, multicomputer n SIMD vs. MIMD execution n Shared memory vs. Distributed memory architecture n Interconnection networks n Parallel architecture design concepts l Instruction-level parallelism l Hardware multithreading l Multi-core and many-core l Accelerators and heterogeneous systems l Clusters n Implications for programming and algorithm design 2 Traditional Use of Parallel Computing: in HPC Parallel Computer 3 Example: Weather Forecast • 3D Space discretization (cells) • Time discretization (steps) • Start from current observations (sent from weather stations) • Simulation step by evaluating weather model equations 4 (very simplified…) • Air pressure • Temperature • Humidity • Sun radiation • Wind direction • Wind velocity •… cell 5 Parallel Computing for HPC Applications n High Performance Computing l Much computational work (in FLOPs, floatingpoint operations) l Often, large data sets l E.g. climate simulations, particle physics, engineering, sequence matching or proteine docking in bioinformatics, … n Single-CPU computers and even today’s multicore processors cannot provide such massive computation power NSC Triolith n Aggregate LOTS of computers à Clusters l Need scalable algorithms l Need to exploit multiple levels of parallelism 6 1 Parallel Computer Architecture Concepts Classification by Control Structure Classification of parallel computer architectures: n by control structure n by memory organization l in particular, Distributed memory vs. Shared memory … n by interconnection network topology op op op2 1 7 op 3 op 4 8 Classification by Memory Organization Interconnection Networks (1) P 9 R 10 Interconnection Networks (2): Simple Topologies P P P fully connected Interconnection Networks (3): Hypercube P P P Inductive definition: 11 12 2 Instruction Level Parallelism (1): Pipelined Execution Units More about Interconnection Networks n Fat-Tree, Butterfly, … See TDDC78 n Switching and routing algorithms n Discussion of interconnection network properties l l l l l l l Cost (#switches, #lines) Scalability (asymptotically, cost grows not much faster than #nodes) Node degree Longest path (à latency) Accumulated bandwidth Fault tolerance (node or switch failure) … 13 SIMD computing with Pipelined Vector Units e.g., vector supercomputers Cray (1970s, 1980s), Fujitsu, … 14 Instruction-Level Parallelism (2): VLIW and Superscalar n Multiple functional units in parallel n 2 main paradigms: l VLIW (very large instruction word) architecture ^ 4Parallelism is explicit, progr./compiler-managed (hard) l Superscalar architecture à 4Sequential instruction stream 4Hardware-managed dispatch 4power + area overhead n ILP in applications is limited l typ. < 3...4 instructions can be issued simultaneously l Due to control and data dependences in applications n Solution: Multithread the application and the processor 16 15 Hardware Multithreading SIMD Instructions ”vector register” n “Single Instruction stream, Multiple Data streams” op SIMD unit single thread of control flow l restricted form of data parallelism 4 apply the same primitive operation (a single instruction) in parallel to multiple data elements stored contiguously l SIMD units use long “vector registers” 4 each holding multiple data elements Common today l MMX, SSE, SSE2, SSE3,… l Altivec, VMX, SPU, … Performance boost for operations on shorter data types Area- and energy-efficient Code to be rewritten (SIMDized) by programmer or compiler Does not help (much) for memory bandwidth l n E.g., data dependence n n n n P P 17 P P 18 3 The Memory Wall Moore’s Law n Performance gap CPU – Memory (since 1965) Exponential increase in transistor density n Memory hierarchy n Increasing cache sizes shows diminishing returns l Costs power and chip area 4 GPUs spend the area instead on many simple cores with little memory l Relies on good data locality in the application n What if there is no / little data locality? l Irregular applications, e.g. sorting, searching, optimization... n Solution: Spread out / overlap memory access delay l Programmer/Compiler: Prefetching, on-chip pipelining, SW-managed on-chip buffers l Generally: Hardware multithreading, again! Data from Kunle Olukotun, Lance Hammond, Herb Sutter, 20Burton Smith, Chris Batten, and Krste Asanoviç 19 Moore’s Law vs. Clock Frequency The Power Issue n Power = Static (leakage) power + Dynamic (switching) power n Dynamic power ~ Voltage2 * Clock frequency where Clock frequency approx. ~ voltage • #Transistors / mm 2 still growing exponentially according to Moore’s Law à Dynamic power ~ Frequency3 n Total power ~ #processors • Clock speed flattening out ~3GHz 2003 21 More transistors + Limited frequency Þ More cores 22 Conclusion: Moore’s Law Continues, But ... Single-processor Performance Scaling 16,0 Parallelism 14,0 Throughput incr. 55%/year Log2 Speedup 12,0 Device speed 8,0 Limit: Clock rate 6,0 4,0 2,0 0,0 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç 23 Assumed increase 17%/year possible 10,0 Pipelining Limit: RISC ILP RISC/CISC CPI Source: Doug Burger, UT Austin 2005 90 nm 24 65 nm 45 nm 32nm 22nm 4 Solution for CPU Design: Multicore + Multithreading Main features of a multicore system n There are multiple computational cores on the same chip. n Single-thread performance does not improve any more l ILP wall l Memory wall l Power wall n but we can put more cores on a chip l And hardware-multithread the cores to hide memory latency l All major chip manufacturers produce multicore CPUs today n The cores might have (small) private on-chip memory modules and/or access to on-chip memory shared by several cores. n The cores have access to a common off-chip main memory n There is a way by which these cores communicate with each other and/or with the environment. 25 26 Standard CPU Multicore Designs Some early dual-core CPUs (2004/2005) n Standard desktop/server CPUs have a few cores with shared off-chip main memory l On-chip cache (typ., 2 levels) 4L1-cache core core core L1$ L1$ L1$ L1$ L2$ mostly core-private L2$ Interconnect / Memory interface 4L2-cache often shared by groups of cores l core SMT P0 P1 P0 P1 P0 P1 L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L2$ L2$ L2$ Memory Ctrl main memory (DRAM) Memory access interface shared by all or groups of cores n Caching à multiple copies of the same data item L2$ Memory Ctrl Memory Ctrl Main memory Main memory IBM Power5 (2004) AMD Opteron Dualcore (2005) Main memory n Writing to one copy (only) causes inconsistency Intel Xeon Dualcore(2005) $ = ”cache” L1$ = ”level-1 instruction cache” D1$ = ”level-1 data cache” L2$ = ”level-2 cache” (uniform) n Shared memory coherence mechanism to enforce automatic updating or invalidation of all copies around à More about shared-memory architecture, caches, data locality, consistency issues and coherence protocols in TDDC78/TDDD56 27 SUN/Oracle SPARC T Niagara (8 cores) 28 SUN / Oracle SPARC-T5 (2012) Sun UltraSPARC ”Niagara” Niagara T1 (2005): 8 cores, 32 HW threads Niagara T2 (2008): 8 cores, 64 HW threads Niagara T3 (2010): 16 cores, 128 HW threads T5 (2012): 16 cores, 128 HW threads P0 P1 P2 P3 P4 P5 P6 P7 L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ Niagara T1 (2005): 8 cores, 32 HW threads L2$ Memory Ctrl Memory Ctrl Memory Ctrl Memory Ctrl Main memory Main memory Main memory Main memory 29 28nm process, 16 cores x 8 HW threads, L3 cache on-chip, On-die accelerators for common encryption algorithms 30 5 Scaling Up: Network-On-Chip Example: Cell/B.E. n Cache-coherent shared memory (hardware-controlled) – n An on-chip network (four parallel unidirectional rings) does not scale well to many cores l power- and area-hungry l signal latency across whole chip l not well predictable access times n Idea: NCC-NUMA – non-cache-coherent, non-uniform memory access l Physically distributed on-chip [cache] memory, l on-chip network, connecting PEs or coherent ”tiles” of PEs l global shared address space, l but software responsible for maintaining coherence n Examples: l STI Cell/B.E., l Tilera TILE64, l Intel SCC, Kalray MPPA 31 Towards Many-Core CPUs... (IBM/Sony/Toshiba 2006) interconnect the master core, the slave cores and the main memory interface n LS = local on-chip memory, PPE = master, SPE = slave 32 Towards Many-Core Architectures n Tilera TILE64 (2007): 64 cores, 8x8 2D-mesh on-chip network n Intel, second-generation many-core research processor: Mem-controller 48-core (in-order x86) SCC ”Single-chip cloud computer”, 2010 l l No longer fully cache coherent over the entire chip P MPI-like message passing over 2D mesh network on chip I/O C I/O R 1 tile: VLIW-processor + cache + router (Image simplified) 33 Source: Intel 34 Kalray MPPA-256 Intel Xeon PHI (since late 2012) n 16 tiles n Up to 61 cores, 244 HW threads, 1.2 Tflops peak performance with 16 VLIW compute cores each plus 1 control core per tile n Message passing network on chip n Simpler x86 (Pentium) cores (x 4 HW threads), with 512 bit wide SIMD vector registers n Can also be used as a coprocessor, instead of a GPU n Virtually unlimited array extension by clustering several chips n 28 nm CMOS technology n Low power dissipation, typ. 5 W 35 36 6 ”General-purpose” GPUs Nvidia Fermi (2010): 512 cores • Main GPU providers for laptop/desktop Nvidia, AMD(ATI), Intel • Example: NVIDIA’s 10-series GPU (Tesla, 2008) has 240 cores 1 Fermi C2050 GPU 1 ”shared-memory multiprocessor” (SM) I-cache Scheduler Dispatch Register file SM (Images removed) 32 Streaming processors (cores) L2 • Each core has a • • • • Floating point / integer unit Logic unit Move, compare unit Branch unit Nvidia Tesla C1060: 933 GFlops • Cores managed by thread manager Load/Store units Special function units 1 Streaming Processor (SP) • Thread manager can spawn and manage 30,000+ threads • Zero overhead thread switching 37 FPU 64K configurable L1cache/ shared memory IntU 38 GPU Architecture Paradigm The future will be heterogeneous! n Optimized for high throughput Need 2 kinds of cores – often on same chip: l In theory, ~10x to ~100x higher throughput than CPU is possible n Massive hardware-multithreading hides memory access latency n Massive parallelism n GPUs are good at data-parallel computations l multiple threads executing the same instruction on different data, preferably located adjacently in memory n For non-parallelizable code: Parallelism only from running several serial applications simultaneously on different cores (e.g. on desktop: word processor, email, virus scanner, … … not much more) à Few (ca. 4-8) ”fat” cores (power-hungry, area-costly, caches, out-of-order issue, ) for high single-thread performance n For well-parallelizable code: (or on cloud servers) 39 à on hundreds of simple cores (power + area efficient) (GPU-/SCC-like) 40 Heterogeneous / Hybrid Multi-/Manycore Heterogeneous / Hybrid Multi-/Manycore Systems Key concept: Master-slave parallelism, offloading n Cell/B.E. n General-purpose CPU (master) processor controls execution of slave processors by submitting tasks to them and transfering operand data to the slaves’ local memory à Master offloads computation to the slaves n Slaves often optimized for heavy throughput computing l n GPU-based system: Offload heavy computation Master could do something else while waiting for the result, or switch to a power-saving mode n Master and slave cores might reside CPU on the same chip (e.g., Cell/B.E.) or on different chips (e.g., most GPU-based systems today) n Slaves might have access to off-chip main memory (e.g., Cell) Device memory Data transfer GPU Main memory or not (e.g., today’s GPUs) 41 42 7 Multi-GPU Systems Reconfigurable Computing Units n Connect one or few general-purpose (CPU) multicore n FPGA – Field Programmable Gate Array processors with shared off-chip memory to several GPUs n Increasingly popular in high-performance computing l Cost and (quite) energy effective if offloaded computation fits GPU architecture well L2 L2 Main Memory (DRAM) 43 Example: Beowulf-class PC Clusters 44 Cluster Example: Triolith (NSC, 2012 / 2013) Capability cluster (fast network for parallel applications) with off-the-shelf CPUs (Xeon, Opteron, …) Final configuration: 1200 HP SL230 compute nodes, each equipped with 2 Intel E5-2660 (2.2 GHz Sandybridge) processors with 8 cores each 19200 cores in total Theoretical peak performance of 338 Tflops/s NSC Triolith (Source: NSC) Mellanox Infiniband network 45 46 The Challenge Can’t the compiler fix it for us? n Today, basically all computers are parallel computers! n Automatic parallelization? l Single-thread performance stagnating Dozens, hundreds of cores l Hundreds, thousands of hardware threads l Heterogeneous (core types, accelerators) l Data locality matters l Clusters for HPC, require message passing n Utilizing more than one CPU core requires thread-level parallelism n One of the biggest software challenges: Exploiting parallelism l Every programmer will eventually have to deal with it l All application areas, not only traditional HPC 4 General-purpose, graphics, games, embedded, DSP, … l Affects HW/SW system architecture, programming languages, algorithms, data structures … l Parallel programming is more error-prone (deadlocks, races, further sources of inefficiencies) 4 And thus more expensive and time-consuming l 47 l at compile time: static analysis – not effective for pointer-based languages 4Requires 4needs 4ok l programmer hints / rewriting ... for few benign special cases: – (Fortran) loop SIMDization, – extraction of instruction-level parallelism, … at run time (e.g. speculative multithreading) 4High overheads, not scalable n More about parallelizing compilers in TDDD56 + TDDC78 48 8 And worse yet, The Challenge n A lot of variations/choices in hardware n Bad news 1: l Many will have performance implications l No standard parallel programming model 4portability Many programmers (also less skilled ones) need to use parallel programming in the future n Bad news 2: issue n Understanding the hardware will make it easier to make programs get high performance l Performance-aware programming gets more important also for single-threaded code l Adaptation leads to portability issue again There will be no single uniform parallel programming model as we were used to in the old sequential times à Several competing general-purpose and domain-specific languages and their concepts will co-exist n How to write future-proof parallel programs? 49 50 What we learned in the past… … and what we need now n Sequential von-Neumann model n Parallel programming! programming, algorithms, data structures, complexity l Sequential / few-threaded languages: C/C++, Java, ... not designed for exploiting massive parallelism time l Parallel algorithms and data structures Analysis / cost model: parallel time, work, cost; scalability; l Performance-awareness: data locality, load balancing, communication l time T(n) = O ( n log n ) T(n,p) = O ( (n log n)/p + log p ) number of processing units used problem size 51 problem size 52 This image cannot currently be display ed. Questions? 9

Parallel Computer Architecture Concepts Outline

Related documents

Products

Support

Parallel Computer Architecture Concepts Outline

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib