Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion Serial vs. parallel program One instruction at a time Task=תהליך Multiple instructions in parallel Levels of parallelism • Transistors • Hardware modules • Architecture, pipeline • Superscalar • Multicore • Symmetric multi-processors • Hardware connected multi-processors • Tightly-coupled machine clusters • Distributed systems • Geo-distributed systems • Internet-scale systems • Interstellar systems Flynn's HARDWARE Taxonomy SIMD Lock-step execution Example: vector operations MIMD Example: multicores Why parallel programming? Most software will have to be written as a parallel software Why ? Free lunch ... Wait 18 months – get a new CPU with x2 speed Moore's law: #transitors per chip = 2(years) Source: Wikipedia … is over More transistors = more execution units BUT performance per unit does not improve Bad news: serial programs will not run faster Good news: parallelization has high performance potential May get faster on newer architectures Parallel programming is hard Need to optimize for performance Understand management of resources Identify bottlenecks No one technology fits all needs Zoo of programming models, languages, run-times Hardware architecture is a moving target Parallel thinking is NOT intuitive Parallel debugging is not fun But there is no detour Parallelizing Game of Life (Cellular automaton of pixels) Given a 2D grid vt(i,j)=F(vt-1(of all its neighbors)) i,j+1 i-1,j i,j i,j-1 i+1,j Problem partitioning Domain decomposition (SPMD) Input domain Output domain Both Functional decomposition (MPMD) Independent tasks Pipelining We choose: domain decomposition The field is split between processors CPU 0 CPU 1 i,j+1 i-1,j i,j i,j-1 i+1,j Issue 1. Memory Can we access v(i+1,j) from CPU 0 as in serial program? CPU 0 CPU 1 i,j+1 i-1,j i,j i,j-1 i+1,j It depends... YES: Shared memory space architecture: same memory space CPU 0 CPU 1 Memory NO: Distributed memory space architecture: disjoint memory space Network CPU 0 CPU 1 Memo ry Memo ry Someone has to pay Easier to program, harder to build hardware CPU 0 CPU 1 Memory Harder to program, easier to build hardware Network CPU 0 CPU 1 Memo ry Memo ry Tradeoff: Programability vs. Scalability “Scalability” – the holy grail Improving efficiency on larger problems when more processors are available. • Single machine, multicore - vertical scalability • Multiple machines – horizontal scalability, scale out Memory architecture – latency in accessing data CPU0: Time to access v(i+1,j) = Time to access v(i-1,j)? CPU 0 CPU 1 i,j+1 i-1,j i,j i,j-1 i+1,j Hardware shared memory, flavors 1 Uniform memory access: UMA Same cost of accessing any data by all processors SMP = Symmetric Multi-Processor Hardware shared memory, flavors 2 NON-Uniform memory access: NUMA Tradeoff: Scalability vs. Latency Software Distributed Shared Memory (SDSM) Process 1 Process 2 CPU 0 CPU 1 Memo ry Memo ry Network DSM daemons Software - DSM: emulation of NUMA in software over distributed memory space Memory-optimized programming Most modern systems are NUMA or distributed Architecture is easier to scale up by scaling out Access time difference: local vs. remote data: x100-10000 Locality: most important optimization parameter for program speed serial or parallel Issue 2: Control Can we assign one pixel per CPU? Can we assign one pixel per process/logical task? i,j+1 i-1,j i,j i,j-1 i+1,j Task management overhead Each task has a state that should be managed More tasks – more state to manage Who manages tasks? How many tasks should be run by a cpu? … depends on the complexity of F v(i,j)= F(all v's neighbors) Question Every process reads the data from its neighbors Will it produce correct results? i,j+1 i-1,j i,j i,j-1 i+1,j Issue 3: Synchronization The order of reads and writes made in different tasks is non-deterministic Synchronization is required to enforce the order Locks, semaphores, barriers, conditionals Check point Fundamental hardware-related issues: Memory accesses Control Optimizing locality of accesses Overhead Synchronization Goals and tradeoffs: Ease of Programing, Correctness, Scalability Parallel programming issues We decide to split this 3x3 grid like this: CPU 0 CPU 1 i,j+1 i-1,j i,j CPU 2 CPU 4 i,j-1 • OK? i+1,j Issue 1: Load balancing CPU 0 CPU 1 i,j+1 i-1,j CPU 2 i,j i,j-1 i+1,j CPU 4 Always waiting for the slowest task Solutions? Issue 2: Granularity G=Computation/Communication Fine-grain parallelism (small G) Good load balancing Potentially high overhead Coarse-grain parallelism (large G) Potentially bad load balancing Low overhead What granularity works for you? Must balance overhead and CPU utilization Issue 3: Memory Models a=0,b=0 Which writes by a task are seen by which reads of the other tasks? Print(b) Print(a) a=1 b=1 Printed:0,0? Printed:1,0? Printed:1,1? Memory Consistency Models Example program: Pi: R V; W V,7; R V; R V Pj: R V; W V,13; R V; R V A consistency/memory model is an “agreement” between the execution environment (H/W, OS, middleware) and the processes. Runtime guarantees to the application certain properties on the way values written to shared variables become visible to reads. This determines the memory model, what’s valid, what’s not. Example execution: Pi: R V,0; W V,7; R V,7; R V,13 Pj: R V,0; W V,13; R V,13; R V,7 Order of writes to V as seen to Pi: (1) W V,7; (2) W V,13 Order of writes to V as seen to Pj: (1) W V,13; (2) W V,7 Memory Model: Coherence Coherence is the memory model in which (the system guarantees to the program that) writes performed by the processes for every specific variable are viewed by all processes in the same order. Example program: All valid executions under Coherence: Pi: W V,7 RV RV Pj: W V,13 RV RV Pi: W V,7 R V,7 R V,13 Pi: W V,7 R V,7 R V,7 Pj: W V,13 R V,13 R V,7 Pj: W V,13 R V,13 R V,13 Pi: W V,7 R V,13 R V,13 Pi: W V,7 R V,7 R V,7 Pj: W V,13 R V,13 R V,13 Pj: W V,13 R V,7 R V,7 Pi: W V,7 R V,7 R V,7 Pj: W V,13 R V,13 R V,13 The Register Property: the view of a process consists of the values it “sees” in its reads, and the writes it performs. If a R V in P which is later than W V,x in P sees value different than x, then a later R V cannot see x. Tradeoff: Memory Model/Programability vs. Scalability Summary Parallelism and scalability do not come for free Overhead Memory-aware access Synchronization Load balancing Granularity Memory Models Does it worth the trouble? Depends on how hard are above vs. potential gain How would you know? Some tasks cannot be parallelized “Speedup” – yet another holy grail time of best available serial program ------------------------------------------------- time of parallel program An upper bound on the speedup Amdahl's law Sequential component limits the speedup Split program serial time Tserial=1 into Ideally parallelizable: A Cannot be parallelized: 1-A Ideal parallel time Tparallel= A/#CPUs+(1-A) Ideal speedup(#CPUs)=Tserial/Tparallel <=1/(A/#CPUs+(1-A)) 39 Bad news So why do we need machines with 1000x CPUs? Source: wikipedia The larger the problem the smaller the serial part and the closer the simulation to the best serial program Super-linear speedups • Do they exist? True Super-Linear Speedups • Suppose data does not fit a single cpu cache • But after domain decomposition it does Debugging Parallel Programs General Purpose Computing on Graphic Processing Unit (GPU) NVIDIA’s Streaming MultiProcessor CUDA= Compute Unified Device Architecture NVIDIA’s Streaming MultiProcessor NVIDIA’s Kepler = 2 Tera-instructions/sec (real) 54