Multicores, Multiprocessors, and Clusters Hiroaki Kobayashi 7/18/2012 Contents l l l l l Introduction The Difficulty of Creating Parallel Processing Programs Shared Memory Multiprocessors Clusters and Other Message-Passing Multiprocessors Cache Coherence on Shared-Memory Multiprocessors n Snooping protocol for Cache Coherence on Shared Memory Multiprocessors n Directory-based Protocol for Cache Coherence on MessagePassing Multiprocessors l l l l SISD, MIMD, SIMD, SPMD, and Vector Introduction to Multiprocessor Network Topologies Multiprocessor Benchmarks Summary Computer Science 2012 Hiroaki Kobayashi 2 Introduction (1/2) l Goal of multiprocessor is to create powerful computers simply by connecting many existing smaller ones. n Replace large inefficient processors with many smaller, efficient processors n Scalability, availability, power efficiency l Job-level (or process-level) parallelism n high throughput for independent jobs l Parallel processing program n Single program runs on multiple processors n Short turnaround time Computer Science 2012 Hiroaki Kobayashi 3 Introduction (2/2) l A cluster is composed of microprocessors housed in many independent servers or PCs n Now very popular as search engine, Web servers, email servers and database server in data centers l Cluster System @Cyberscience Center Multicore/Manycore microprocessors n Chips with multiple processors (cores) n Difficulty in seeking higher clock rates and improved CPI on single processor due to the power problem. n Number of cores is expected to double every two years INTEL Nehalem-EP Processor with 4-core Programmers who care about performance must become parallel programmers! Computer Science 2012 Hiroaki Kobayashi 4 Concurrency vs Parallel Hardware/Software Categorization Software Sequential Serial Hardware Parallel Concurrent Matrix Multiply written in C, and compiled for and running on an Intel Pentium 4 Windwos Visata Operating System running on an Intel Pentium 4 Matrix Multiply written in C, and compiled for and running on an Intel Xeon E5345 (Clovertown) Windwos Visata Operating System running on an Intel Xeon E5345 (Clovertown) Challenge: making effective use of parallel hardware for sequential or concurrent software Computer Science 2012 Hiroaki Kobayashi 5 Difficulty of Creating Parallel Processing Programs l l Parallel software is the problem Need to get significant performance improvement n Otherwise, just use a faster uniprocessor, since it’s easier! n Uniprocessor design technique such as superscalar and outof-order execution take advantage of ILP, normally without the involvement of the programmer l Difficulties in design of parallel programs n Partitioning (as equally as possible) n Coordination/scheduling for load balancing n Synchronization and communications overheads The difficulties get worse as the number of processors increase… Computer Science 2012 Hiroaki Kobayashi 6 Speedup Challenge Amdahl’s Law says n n n Sequential part can limit speedup Example: 100 processors, 90× speedup? n Tnew = Tparallelizable/100 + Tsequential n 1 Speedup = = 90 (1− Fparallelizable ) + Fparallelizable /100 n Solving: Fparallelizable = 0.999 Need sequential part to be 0.1% of original time Computer Science 2012 Hiroaki Kobayashi 7 Scaling Example (1/2) l Workload: sum of 10 scalars, and 10 × 10 matrix sum n Speed up from 10 to 100 processors l l Single processor: Time = (10 + 100) × tadd 10 processors n Time = 10 × tadd + 100/10 × tadd = 20 × tadd n Speedup = 110/20 = 5.5 (55% of potential) l 100 processors n Time = 10 × tadd + 100/100 × tadd = 11 × tadd n Speedup = 110/11 = 10 (10% of potential) l Assumes load can be balanced across processors Computer Science 2012 Hiroaki Kobayashi 8 Scaling Example (2/2) l l l What if matrix size is 100 × 100? Single processor: Time = (10 + 10000) × tadd 10 processors n Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd n Speedup = 10010/1010 = 9.9 (99% of potential) l 100 processors n Time = 10 × tadd + 10000/100 × tadd = 110 × tadd n Speedup = 10010/110 = 91 (91% of potential) l Assuming load balanced Computer Science 2012 Hiroaki Kobayashi 9 Strong vs Weak Scaling Getting good speed-up on a multiprocessor while keeping the problem size fixed is harder than getting good speed-up by increasing the size of the problem l Strong scaling: problem size fixed n As in example l Weak scaling: problem size proportional to number of processors n 10 processors, 10 × 10 matrix u Time = 20 × tadd n 100 processors, 32 × 32 matrix u Time = 10 × tadd + 1000/100 × tadd = 20 × tadd n Constant performance in this example Computer Science 2012 Hiroaki Kobayashi 10 Speed-up Challenge: Balancing Load In the previous example, each processor has 1% of the work to achieve the speed-up of 91 on the larger problem with 100 processors. l If one processor’s load is higher than all the rest, show the impact on speed-up. Calculate at 2% and 5% u If one processor has 2% of 10,000 and other 99 will share the remaining 9800 l n Execution time=Max(9800t/99, 200t/1)+10t=210t n Speed-up drops to 10,010t/210t=48 u If one processor has 5% of 10,000 and other 99 will share the remaining 9500 n Execution time=Max(9500t/99, 500t/1)+10t=510t n Speed-up drops to 10,010t/510t=20 Computer Science 2012 Hiroaki Kobayashi 11 Shared Memory Multiprocessors l SMP: shared memory multiprocessor n Hardware provides single physical address space for all processors n Synchronize shared variables using locks n Memory access time u UMA (uniform) vs. NUMA (nonuniform) Uniform Memory Access Multiprocessor Computer Science 2012 Hiroaki Kobayashi 12 Non Uniform Memory Access Multiprocessor Single address space l l l Lower latency to nearby memory Higher scalability More efforts for efficient parallel programming Computer Science 2012 Hiroaki Kobayashi 13 Synchronization and Lock on SMP l Synchronization: The process of coordinating the behavior of two or more processes, which may be running on different processors. n As processors operating in parallel will normally share data, they also need to coordinate when operating on shared data. l Lock: a synchronization device that allows access to data to only one process at a time. n Other processors interested in shared data must wait until the original processor unlocks the variable. Computer Science 2012 Hiroaki Kobayashi 14 Example: Sum Reduction (1/2) l Sum 100,000 numbers on 100 processor UMA n Each processor has ID: 0 ≤ Pn ≤ 99 n Partition 1000 numbers per processor n Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; l Now need to add these partial sums n Reduction: divide and conquer n Half the processors add pairs, then quarter, … n Need to synchronize between reduction steps Computer Science 2012 Hiroaki Kobayashi 15 Example: Sum Reduction (2/2) half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1); Computer Science 2012 Hiroaki Kobayashi 16 Message-Passing Multiprocessors n n n n Each processor has private physical address space Hardware sends/receives messages between processors Multicomputers, clusters, Network of Workstations (NOW) Suited for job-level parallelism and applications with little communication n Web search, mail servers, file servers… Computer Science 2012 Hiroaki Kobayashi 17 Loosely-Coupled Clusters l Network of independent computers n Each has private memory and OS n Connected using I/O system u l E.g., Ethernet/switch, Internet Suitable for applications with independent tasks n Web servers, databases, simulations, … l l High availability, scalable, affordable Problems n Administration cost (prefer virtual machines) n Low interconnect bandwidth u c.f. processor/memory bandwidth on an SMP Computer Science 2012 Hiroaki Kobayashi 18 Sum Reduction on Message-Passing Multiprocessor (1/2) l l Sum 100,000 on 100 processors First distribute 1000 numbers to each n The do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; l Reduction n Half the processors send, other half receive and add n The quarter send, quarter receive and add, … Computer Science 2012 Hiroaki Kobayashi 19 Sum Reduction on Message-Passing Multiprocessor (2/2) l Given send() and receive() operations limit = 100; half = 100;/* 100 processors */ repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ until (half == 1); /* exit with final sum */ n Send/receive also provide synchronization n Assumes send/receive take similar time to addition Computer Science 2012 Hiroaki Kobayashi 20 Grid Computing l Separate computers interconnected by long-haul networks n E.g., Internet connections n Work units farmed out, results sent back l Can make use of idle time on PCs n E.g., SETI@home n Aslo named volunteer computing n 5+ million computers, 200+ countries, 257Teraflop/s for searching extraterrestrial intelligence in 2006 n http://www.planetary.or.jp/setiathome/home_japanese.html Computer Science 2012 Hiroaki Kobayashi 21 Parallelism and Memory Hierarchies: Cache Coherence l Suppose two CPU cores share a physical address space n Write-through caches Time Event step CPU A’s cache CPU B’s cache 0 Memory 0 1 CPU A reads X 0 0 2 CPU B reads X 0 0 0 3 CPU A writes 1 to X 1 0 1 inconsistency Computer Science 2012 Hiroaki Kobayashi 22 Cache Coherence Protocol l Operations performed by caches in multiprocessors to ensure coherence n Migration of data to local caches u Reduces bandwidth for shared memory n Replication of read-shared data u l Reduces contention for access Snooping protocols n Each cache monitors bus reads/writes l Directory-based protocols n Caches and memory record sharing status of blocks in a directory Computer Science 2012 Hiroaki Kobayashi 23 Snoop-based and Directory-based Systems Broadcast bus Snoop-based system Computer Science 2012 Directory-based system Hiroaki Kobayashi 24 Invalidating Snooping Protocols l Cache gets exclusive access to a block when it is to be written n Broadcasts an invalidate message on the bus n Subsequent read in another cache misses u Owning cache supplies updated value n If two processors attempt to write the same data simultaneously, one of them wins the race, causing the other processor’s copy to be invalidated. u Enforce write serialization. activity CPU Bus activity CPU A’s cache CPU B’s cache Memory 0 CPU A reads X Cache miss for X 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes 1 to X Invalidate for X 1 0 0 Cache miss for X 1 1 Computer Science 2012 CPU B read X 0 Hiroaki Kobayashi 25 1 Coherence Misses l l Coherence Misses: a new class of cache misses due to sharing data on a multiprocessor True Sharing n A write to the shared data on the cache by one processor causes an invalidate to the entire block that includes the shared data cached on other processors, and a miss occurs when another processor accesses a modified word in that block. l False Sharing n A write to the shared data on the cache by one processor causes an invalidate to the entire block that includes the shared data cached on other processors, and a miss occurs even when another processor accesses a different word but in the same block. Computer Science 2012 Hiroaki Kobayashi 26 Contributing Causing of Memory Cycles on SMP Computer Science 2012 Hiroaki Kobayashi 27 Classifications Based on Instruction and Data Streams n Flynn’s Classification (1966) Data Streams Single Instruction Single Streams Multiple n Multiple SISD: Intel Pentium 4 SIMD: SSE instructions of x86 MISD: No examples today MIMD: Intel Xeon e5345 SPMD: Single Program Multiple Data n A parallel program on a MIMD computer n Conditional code for different processors Computer Science 2012 Hiroaki Kobayashi 28 SIMD l Operate elementwise on vectors of data n E.g., MMX and SSE instructions in x86 u l Multiple data elements in 128-bit wide registers All processors execute the same instruction at the same time n Each with different data address, etc. l l l l Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications Weakest in case or switch statement Computer Science 2012 Hiroaki Kobayashi 29 Vector Processors l l Highly pipelined function units Stream data from/to vector registers to units n Data collected from memory into registers n Results stored from registers to memory l Example: Vector extension to MIPS n 32 × 64-element registers (64-bit elements) n Vector instructions lv, sv: load/store vector u addv.d: add vectors of double u addvs.d: add scalar to each element of vector of double u l Significantly reduces instruction-fetch bandwidth Computer Science 2012 Hiroaki Kobayashi 30 Characteristics of Vector Architecture l Vector registers n Each vector register is a fixed-length bank holding a single vector and must have at least two read ports and one write port. l Vector functional units n Each unit is fully pipelined and can start a new operation on every clock cycle. A control unit is needed to detect hazards, both from conflicts for the functional unit and from conflicts for register accesses. l Vector load-store unit n This is a vector memory unit that loads or stores a vector to or from memory. Vector loads and stores are fully pipelined, so that words can be moved between the vector registers and memory with a bandwidth of 1 word per clock cycle, after an initial latency. Computer Science 2012 From VR1 &VR2 To VR3 Pipelined FP ADD Unit Hiroaki Kobayashi 31 Example: DAXPY (Y = a x X + Y) Conventional MIPS code l.d $f0,a($sp) addiu r4,$s0,#512 loop: l.d $f2,0($s0) mul.d $f2,$f2,$f0 l.d $f4,0($s1) add.d $f4,$f4,$f2 s.d $f4,0($s1) addiu $s0,$s0,#8 addiu $s1,$s1,#8 subu $t0,r4,$s0 bne $t0,$zero,loop n Vector MIPS code l.d $f0,a($sp) lv $v1,0($s0) mulvs.d $v2,$v1,$f0 lv $v3,0($s1) addv.d $v4,$v2,$v3 sv $v4,0($s1) n Computer Science 2012 ;load scalar a ;upper bound of what to load ;load x(i) ;a × x(i) ;load y(i) ;a × x(i) + y(i) ;store into y(i) ;increment index to x ;increment index to y ;compute bound ;check if done ;load scalar a ;load vector x ;vector-scalar multiply ;load vector y ;add y to product ;store the result Hiroaki Kobayashi 32 Vector vs. Scalar l Vector architectures and compilers n Simplify data-parallel programming n Explicit statement of absence of loop-carried dependences u Reduced checking in hardware n Regular access patterns benefit from interleaved and burst memory n Avoid control hazards by avoiding loops l More general than ad-hoc media extensions (such as MMX, SSE) n Better match with compiler technology Computer Science 2012 Hiroaki Kobayashi 33 History of GPU l Early video cards (VGA controller) n Frame buffer memory with address generation for video output l 3D graphics processing n Originally high-end computers (e.g., SGI) n Moore’s Law ⇒ lower cost, higher density n 3D graphics cards for PCs and game consoles l Graphics Processing Units n Processors oriented to 3D graphics tasks n Vertex/pixel processing, shading, texture mapping, rasterization Computer Science 2012 Hiroaki Kobayashi 34 Graphics in the System 16GB/s Computer Science 2012 Hiroaki Kobayashi 35 Graphics Rendering Pipeline Computer Science 2012 Hiroaki Kobayashi 36 GPU Architecture l Processing is highly data-parallel n GPUs are highly multithreaded n Use thread switching to hide memory latency u Less reliance on multi-level caches n Graphics memory is wide and high-bandwidth l Trend toward general purpose GPUs n Heterogeneous CPU/GPU systems n CPU for sequential code, GPU for parallel code l Programming languages/APIs n DirectX, OpenGL n C for Graphics (Cg), High Level Shader Language (HLSL) n Compute Unified Device Architecture (CUDA) Computer Science 2012 Hiroaki Kobayashi 37 Trend in Performance between CPU and GPU Computer Science 2012 Hiroaki Kobayashi 38 Example: NVIDIA Tesla 345.6Gflop/s = 16 MPs x 8SPs x 2Flop x 1.35x109 Streaming multiprocesso r 8 × Streaming processors Computer Science 2012 Hiroaki Kobayashi 39 Example: NVIDIA Tesla n Streaming Processors n n n Single-precision FP and integer units Each SP is fine-grained multithreaded Warp: group of 32 threads n Executed in parallel, SIMD style n n 8 SPs × 4 clock cycles Hardware contexts for 24 warps n Registers, PCs, … Computer Science 2012 Hiroaki Kobayashi 40 Classifying GPUs n Don’t fit nicely into SIMD/MIMD model n Conditional execution in a thread allows an illusion of MIMD n n But with performance degredation Need to write general purpose code with care Instruction-Level Parallelism Data-Level Parallelism Computer Science 2012 Static: Discovered at Compile Time Dynamic: Discovered at Runtime VLIW Superscalar SIMD or Vector Tesla Multiprocessor Hiroaki Kobayashi 41 Interconnection Networks n Network topologies n Arrangements of processors, switches, and links Bus Ring N-cube (N = 3) 2D Mesh Computer Science 2012 Fully connected Hiroaki Kobayashi 42 Multi-Stage Networks Computer Science 2012 Hiroaki Kobayashi 43 Network Characteristics l Performance n Latency per message (unloaded network) n Throughput u u u Link bandwidth Peak data transfer rate per link Total network bandwidth Collective data transfer rate of all links in the network Bisection bandwidth The bandwidth between two equal parts of a multiprocessor. This measure is for a worst case split of the multiprocessor n Congestion delays (depending on traffic) l l l Cost Power Routability in silicon Computer Science 2012 Hiroaki Kobayashi 44 Parallel Benchmarks (1/2) l l l l l Linpack: matrix linear algebra n DAXPY routine n Weak scaling n www.top500.org SPECrate: parallel run of SPEC CPU programs n Job-level parallelism (no communication between them) n Throughput metric n Weak scaling SPLASH: Stanford Parallel Applications for Shared Memory n Mix of kernels and applications, strong scaling NAS (NASA Advanced Supercomputing) suite n computational fluid dynamics kernels n Weak scaling PARSEC (Princeton Application Repository for Shared Memory Computers) suite n Multithreaded applications using Pthreads and OpenMP n Weak scaling Computer Science 2012 Hiroaki Kobayashi 45 Parallel Benchmarks (2/2) Computer Science 2012 Hiroaki Kobayashi 46 Concluding Remarks l l Goal: higher performance by using multiple processors Difficulties n Developing parallel software n Devising appropriate architectures l Many reasons for optimism n Changing software and application environment n Chip-level multiprocessors with lower latency, higher bandwidth interconnect l An ongoing challenge for computer architects! Computer Science 2012 Hiroaki Kobayashi 47