Chapter 7 Multicores, p , and Multiprocessors, Clusters Goal: connecting multiple computers to get higher g e performance pe o a ce High g throughput oug pu for o independent depe de jobs Parallel processing program Multiprocessors Scalability, y, availability, y, power p efficiencyy Job-level (process-level) parallelism §9.1 Intro oduction Introduction Single program run on multiple processors Multicore microprocessors Chips with multiple processors (cores) Chapter 7 — Multicores, Multiprocessors, and Clusters — 2 Hardware and Software Hardware Software Serial: e e.g., g Pentium 4 Parallel: e.g., quad-core Xeon e5345 Sequential: e.g., matrix multiplication Concurrent: e.g., operating system, multithreaded matrix multiplication Sequential/concurrent software can run on serial/parallel hardware Challenge: g making g effective use of p parallel hardware Chapter 7 — Multicores, Multiprocessors, and Clusters — 3 Parallel software is the problem Need to get significant performance improvement Otherwise, Oth i just j t use a ffaster t uniprocessor, i since it’s easier! Diffi lti Difficulties Partitioning Coordination Communications overhead §7.2 The Difficulty of Creatin ng Parallel Processin ng Programs Parallel Programming Chapter 7 — Multicores, Multiprocessors, and Clusters — 4 Amdahl’s Law Sequential part can limit speedup Example: 100 processors processors, 90× speedup? Tnew = Tparallelizable/100 + Tsequential Speedup Solving: Fparallelizable = 0.999 1 (1 Fpparallelizable ) Fpparallelizable /100 90 Need sequential part to be 0 0.1% 1% of original time Chapter 7 — Multicores, Multiprocessors, and Clusters — 5 Scaling Example Workload: sum of 10 scalars, and 10 × 10 matrix sum Single processor: Time = (10 + 100) × tadd 10 processors Time = 10 × tadd + 100/10 × tadd = 20 × tadd Speedup = 110/20 = 5.5 (55% of potential) 100 processors S Speed d up ffrom 10 tto 100 processors Time = 10 × tadd + 100/100 × tadd = 11 × tadd Speedup = 110/11 0/ = 10 0 ((10% 0% o of po potential) e a) Assumes load can be balanced across processors Chapter 7 — Multicores, Multiprocessors, and Clusters — 6 Scaling Example (cont) What if matrix size is 100 × 100? Single processor: Time = (10 + 10000) × tadd dd 10 processors 100 processors Time = 10 × tadd dd + 10000/10 × tadd dd = 1010 × tadd dd Speedup = 10010/1010 = 9.9 (99% of potential) Time = 10 × tadd + 10000/100 × tadd = 110 × tadd Speedup = 10010/110 = 91 (91% of potential) Assuming load balanced Chapter 7 — Multicores, Multiprocessors, and Clusters — 7 Strong vs Weak Scaling Strong scaling: problem size fixed As in example Weak scaling: problem size proportional to n mber of processors number 10 processors, 10 × 10 matrix 100 processors, 32 × 32 matrix Ti Time = 20 × tadd Ti Time = 10 × tadd + 1000/100 × tadd = 20 × tadd Constant performance in this example Chapter 7 — Multicores, Multiprocessors, and Clusters — 8 SMP: shared memory multiprocessor Hardware provides single physical address space for all processors Synchronize shared variables using locks Cache coherence issue Memory access time UMA (uniform) vs vs. NUMA (nonuniform) §7.3 Sha ared Memo ory Multiprrocessors Shared Memory Chapter 7 — Multicores, Multiprocessors, and Clusters — 9 Multi-cores processors: Core Core Cache Cache §7.3 Sha ared Memo ory Multiprrocessors Shared Memory Chapter 7 — Multicores, Multiprocessors, and Clusters — 10 Each processor has private physical address add ess space Hardware sends/receives messages p between processors §7.4 Clussters and O Other Mes ssage-Pas ssing Multiprocessors Message Passing Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Loosely Coupled Clusters Network of independent computers Each has private memory and OS Connected using I/O system Suitable for applications with independent tasks E.g., Ethernet/switch, Internet Web servers, databases, simulations, … High availability, scalable, affordable Problems Administration cost (prefer virtual machines) Low interconnect bandwidth c f processor/memory bandwidth on an SMP c.f. Chapter 7 — Multicores, Multiprocessors, and Clusters — 12 Grid Computing Separate computers interconnected by long-haul networks Can make use of idle time on PCs E.g., Internet connections Work units farmed out, out results sent back E.g., SETI@home, World Community Grid Other parallel computers Supercomputers, p p , vector computers, p , GPUs,, cells, etc. Chapter 7 — Multicores, Multiprocessors, and Clusters — 13 Goal: higher performance by using multiple p processors Difficulties Manyy reasons for optimism Developing p gp parallel software Devising appropriate architectures §7.13 Co oncluding R Remarks Concluding Remarks Changing software and application environment Chip-level multiprocessors with lower latency, hi h b higher bandwidth d idth iinterconnect t t An ongoing challenge for computer architects, software/tool/compiler developers/researchers! Chapter 7 — Multicores, Multiprocessors, and Clusters — 14