Chapter 7

Chapter 7 Multicores, Multiprocessors, and Clusters Objectives The Student shall be able to:  Define parallel processing, multicore, cluster, vector processing.  Define SISD, SIMD, MISD, MIMD.  Define multithreading, hardware multithreading, coursegrained multithreading, fine-grained multithreading, simultaneous multithreading  Draw network configurations: bus, ring, mesh, cube.  Define how vector processing works: how instructions may look, and how they may be processed.  Define GPU and name 3 characteristics that differentiate a CPU and GPU. Chapter 7 — Multicores, Multiprocessors, and Clusters — 2 What We’ve Already Covered  §4.10: Parallelism and Advanced Instruction-Level Parallelism   §5.8: Parallelism and Memory Hierarchies   Pipelines and Multiple Issue Associative Memory, Interleaved Memory §6.9: Parallelism and I/O:  Redundant Arrays of Inexpensive Disks Chapter 7 — Multicores, Multiprocessors, and Clusters — 3 Parallel Processing Microsoft Word Editor Matrix Multiply Backup SpellCheck GrammarCheck Chapter 7 — Multicores, Multiprocessors, and Clusters — 4 Parallel Programming Main Factor::child(int begin, int end) cout << "Run Factor " << total << ":" << numChild << endl; Factor factor; // Spawn children for (i=0; i<numChild; i++) if (fork() == 0) { factor.child(begin, begin+range); begin += range + 1; } int val, i; for (val=begin; val<end; val++) { for (i=2; i<=end/2; i++) if (val % i == 0) break; if (i>val/2) cout << "Factor:" << val << endl; } exit(0); // Wait for children to finish for (i=0; i<numChild; i++) wait(&stat); cout << "All Children Done: " << numChild << endl; } Chapter 7 — Multicores, Multiprocessors, and Clusters — 5  Goal: connecting multiple computers to get higher performance    High throughput for independent jobs Parallel processing program   Multiprocessors Scalability, availability, power efficiency Job-level (process-level) parallelism   §9.1 Introduction Introduction Single program run on multiple processors Multicore microprocessors  Chips with multiple processors (cores) Chapter 7 — Multicores, Multiprocessors, and Clusters — 6 Hardware and Software  Hardware    Software    Serial: e.g., Pentium 4 Parallel: e.g., quad-core Xeon e5345 Sequential: e.g., traditional program Concurrent: e.g., operating system Sequential/concurrent software can run on serial/parallel hardware  Challenge: making effective use of parallel hardware Chapter 7 — Multicores, Multiprocessors, and Clusters — 7 Amdahl’s Law    Sequential part can limit speedup Example: 100 processors, 90× speedup?  Tnew = Tparallelizable/100 + Tsequential  1 Speedup   90 (1 Fparalleliz able )  Fparalleliz able /100  Solving: Fparallelizable = 0.999 Need sequential part to be 0.1% of original time Chapter 7 — Multicores, Multiprocessors, and Clusters — 8  SMP: shared memory multiprocessor    Hardware provides single physical address space for all processors Synchronize shared variables using locks Memory access time  UMA (uniform) vs. NUMA (nonuniform) §7.3 Shared Memory Multiprocessors Shared Memory Chapter 7 — Multicores, Multiprocessors, and Clusters — 9 Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1); Chapter 7 — Multicores, Multiprocessors, and Clusters — 10   Each processor (or computer) has private physical address space Hardware sends/receives messages between processors §7.4 Clusters and Other Message-Passing Multiprocessors Cluster - Message Passing Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Loosely Coupled Clusters  Network of independent computers   Each has private memory and OS Connected using I/O system   Suitable for applications with independent tasks    E.g., Ethernet/switch, Internet Web servers, databases, simulations, … High availability, scalable, affordable Problems   Administration cost (prefer virtual machines) Low interconnect bandwidth  c.f. processor/memory bandwidth on an SMP Chapter 7 — Multicores, Multiprocessors, and Clusters — 12 Grid Computing  Separate computers interconnected by long-haul networks    E.g., Internet connections Work units farmed out, results sent back Can make use of idle time on PCs  E.g., SETI@home, World Community Grid Chapter 7 — Multicores, Multiprocessors, and Clusters — 13 Multithreading     Hardware Multithreading: Each thread has its own register file and PC Fine-Grained = interleaved processing: switches between threads each instruction Course-Grained: switches threads when stall required (memory access, wait) Simultaneous Multithreading: uses dynamic scheduling to schedule multiple threads simultaneously Chapter 7 — Multicores, Multiprocessors, and Clusters — 14  Performing multiple threads of execution in parallel    Fine-grain multithreading     Replicate registers, PC, etc. Fast switching between threads §7.5 Hardware Multithreading Multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Coarse-grain multithreading   Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesn’t hide short stalls (eg, data hazards) Chapter 7 — Multicores, Multiprocessors, and Clusters — 15 Simultaneous Multithreading  In multiple-issue dynamically scheduled processor     Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium-4 HT  Two threads: duplicated registers, shared function units and caches Chapter 7 — Multicores, Multiprocessors, and Clusters — 16 Multithreading Example Chapter 7 — Multicores, Multiprocessors, and Clusters — 17 Future of Multithreading   Will it survive? In what form? Power considerations  simplified microarchitectures   Tolerating cache-miss latency   Simpler forms of multithreading Thread switch may be most effective Multiple simple cores might share resources more effectively Chapter 7 — Multicores, Multiprocessors, and Clusters — 18 Multithreading Lab In /home/student/Classes/Cs355/PrimeLab are 2 files: Factor.cpp runFactor  Copy them over to one of your directories (below called mydirectory):  cp Factor.cpp ~/mydirectory  cp runFactor ~/mydirectory  cd ~/mydirectory You want to observe the processor utilization (how busy the processor is).  Linux: Applications->System Tools->System Monitor->Resources Now compile Factor.cpp into executable Factor and run it using the command file runFactor (in Linux):  g++ Factor.cpp –o Factor  . / runFactor The file time.dat will contain the start and end time of the program, so you can calculate the duration. Now change in Factor.cpp the number of threads: numChild. Recompile and run. Create a matrix in Microsoft Excel or Open Office, with one column containing the number of children, the second column containing the seconds to complete. Label the second column: Delay.  Linux: Applications->Office->LibreOffice Calc to show efficiency of multiple threads. Find the times for a range of thread counts: Delay  1,2,3,4,5,6,10,20 (Whatever you have time for). Have your spreadsheet draw a graph with your data. Show me your data. 1 2 3 4 6 Chapter 7 — Multicores, Multiprocessors, and Clusters — 19  An alternate classification Data Streams Single Instruction Single Streams Multiple  Multiple SISD: Intel Pentium 4 SIMD: SSE instructions of x86 MISD: No examples today MIMD: Intel Xeon e5345 SPMD: Single Program Multiple Data   §7.6 SISD, MIMD, SIMD, SPMD, and Vector Instruction and Data Streams A parallel program on a MIMD computer Conditional code for different processors Chapter 7 — Multicores, Multiprocessors, and Clusters — 20 SIMD  Operate elementwise on vectors of data  E.g., MMX and SSE instructions in x86   All processors execute the same instruction at the same time     Multiple data elements in 128-bit wide registers Each with different data address, etc. Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications Chapter 7 — Multicores, Multiprocessors, and Clusters — 21 Vector Processors   Highly pipelined function units Stream data from/to vector registers to units    Data collected from memory into registers Results stored from registers to memory Example: Vector extension to MIPS   32 × 64-element registers (64-bit elements) Vector instructions     lv, sv: load/store vector addv.d: add vectors of double addvs.d: add scalar to each element of vector of double Significantly reduces instruction-fetch bandwidth Chapter 7 — Multicores, Multiprocessors, and Clusters — 22 Example: DAXPY (Y = a × X + Y) Conventional MIPS code l.d $f0,a($sp) addiu r4,$s0,#512 loop: l.d $f2,0($s0) mul.d $f2,$f2,$f0 l.d $f4,0($s1) add.d $f4,$f4,$f2 s.d $f4,0($s1) addiu $s0,$s0,#8 addiu $s1,$s1,#8 subu $t0,r4,$s0 bne $t0,$zero,loop  Vector MIPS code l.d $f0,a($sp) lv $v1,0($s0) mulvs.d $v2,$v1,$f0 lv $v3,0($s1) addv.d $v4,$v2,$v3 sv $v4,0($s1)  ;load scalar a ;upper bound of what to load ;load x(i) ;a × x(i) ;load y(i) ;a × x(i) + y(i) ;store into y(i) ;increment index to x ;increment index to y ;compute bound ;check if done ;load scalar a ;load vector x ;vector-scalar multiply ;load vector y ;add y to product ;store the result Chapter 7 — Multicores, Multiprocessors, and Clusters — 23 Vector vs. Scalar  Vector architectures and compilers Simplify data-parallel programming  Speed up processing since no loops  No data hazard within vector instruction  Benefit from interleaved and burst memory  Avoid control hazards by avoiding loops  Chapter 7 — Multicores, Multiprocessors, and Clusters — 24 Multimedia Improvements    Intel X86 (e.g., 80386) Architecture MMX: MultiMedia Extensions SSE: Streaming SIMD Extensions 32 bits 16 bits 8 bits  8 bits One ALU 16 bits 8 bits 8 bits A register can be subdivided into smaller units … or extended and subdivided Chapter 7 — Multicores, Multiprocessors, and Clusters — 25  Network topologies  Arrangements of processors, switches, and links Bus Ring N-cube (N = 3) 2D Mesh §7.8 Introduction to Multiprocessor Network Topologies Interconnection Networks Fully connected Chapter 7 — Multicores, Multiprocessors, and Clusters — 26 Multistage Networks Chapter 7 — Multicores, Multiprocessors, and Clusters — 27 Network Characteristics  Performance       Latency (delay) per message Throughput: messages/second Congestion delays (depending on traffic) Cost Power Routability in silicon Chapter 7 — Multicores, Multiprocessors, and Clusters — 28  Graphics Processing Units    Processors oriented to 3D graphics tasks Vertex/pixel processing, shading, texture mapping, rasterization Architecture  GPU memory optimized for bandwidth, not latency    Smaller memories, no multilevel cache Simultaneous execution    Wider DRAM chips §7.7 Introduction to Graphics Processing Units History of GPUs Hundreds or thousands of threads Parallel processing: SIMD + scalar No double precision floating point Chapter 7 — Multicores, Multiprocessors, and Clusters — 29 Graphics in the System Chapter 7 — Multicores, Multiprocessors, and Clusters — 30 GPU Architectures  Processing is highly data-parallel   GPUs are highly multithreaded Use thread switching to hide memory latency    Graphics memory is wide and high-bandwidth Trend toward general purpose GPUs    Less reliance on multi-level caches Heterogeneous CPU/GPU systems CPU for sequential code, GPU for parallel code Programming languages/APIs    DirectX, OpenGL C for Graphics (Cg), High Level Shader Language (HLSL) Compute Unified Device Architecture (CUDA) Chapter 7 — Multicores, Multiprocessors, and Clusters — 31 Example: NVIDIA Tesla Streaming multiprocessor 8 × Streaming processors Chapter 7 — Multicores, Multiprocessors, and Clusters — 32 Example: NVIDIA Tesla  Streaming Processors (SP)    Single-precision FP and integer units Each SP is fine-grained multithreaded Warp: group of 32 threads  Executed in parallel, SIMD (or SPMD) style   8 SPs × 4 clock cycles Hardware contexts for 24 warps  Registers, PCs, … Time Chapter 7 — Multicores, Multiprocessors, and Clusters — 33 Classifying GPUs  Don’t fit nicely into SIMD/MIMD model  Conditional execution in a thread allows an illusion of MIMD   But with performance degredation Need to write general purpose code with care Instruction-Level Parallelism Data-Level Parallelism Static: Discovered at Compile Time Dynamic: Discovered at Runtime VLIW Superscalar SIMD or Vector Tesla Multiprocessor Chapter 7 — Multicores, Multiprocessors, and Clusters — 34 Roofline Diagram Attainable GPLOPs/sec = Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance ) Chapter 7 — Multicores, Multiprocessors, and Clusters — 35 Optimizing Performance  Choice of optimization depends on arithmetic intensity of code  Arithmetic intensity is not always fixed   May scale with problem size Caching reduces memory accesses  Increases arithmetic intensity Chapter 7 — Multicores, Multiprocessors, and Clusters — 36 Comparing Systems  Example: Opteron X2 vs. Opteron X4   2-core vs. 4-core, 2× FP performance/core, 2.2GHz vs. 2.3GHz Same memory system  To get higher performance on X4 than X2   Need high arithmetic intensity Or working set must fit in X4’s 2MB L-3 cache Chapter 7 — Multicores, Multiprocessors, and Clusters — 37 Optimizing Performance  Optimize FP performance    Balance adds & multiplies Improve superscalar ILP and use of SIMD instructions Optimize memory usage  Software prefetch   Avoid load stalls Memory affinity  Avoid non-local data accesses Chapter 7 — Multicores, Multiprocessors, and Clusters — 38 2 × quad-core Intel Xeon e5345 (Clovertown) Chipset = Bus Fully-Buffered DRAM DIMMs 2 × quad-core AMD Opteron X4 2356 (Barcelona) §7.11 Real Stuff: Benchmarking Four Multicores … Four Example Systems Chapter 7 — Multicores, Multiprocessors, and Clusters — 39 Four Example Systems 2 × oct-core Sun UltraSPARC T2 5140 (Niagara 2) Fine-grained Multithreading 2 × oct-core IBM Cell QS20 SPE=Synergistic Proc. Element Have SIMD instr. set Chapter 7 — Multicores, Multiprocessors, and Clusters — 40 Pitfalls  Not developing the software to take account of a multiprocessor architecture  Example: using a single lock for a shared composite resource   Serializes accesses, even if they could be done in parallel Use finer-granularity locking Chapter 7 — Multicores, Multiprocessors, and Clusters — 41   Goal: higher performance by using multiple processors Difficulties    Many reasons for optimism    Developing parallel software Devising appropriate architectures §7.13 Concluding Remarks Concluding Remarks Changing software and application environment Chip-level multiprocessors with lower latency, higher bandwidth interconnect An ongoing challenge for computer architects! Chapter 7 — Multicores, Multiprocessors, and Clusters — 42

Chapter 7

Related documents

Products

Support

Chapter 7

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib