Introduction to Parallel Processing Shantanu Dutt University of Illinois at Chicago Acknowledgements Ashish Agrawal, IIT Kanpur, “Fundamentals of Parallel Processing” (slides), w/ some modifications and augmentations by Shantanu Dutt John Urbanic, Parallel Computing: Overview (slides), w/ some modifications and augmentations by Shantanu Dutt John Mellor-Crummey, “COMP 422 Parallel Computing: An Introduction”, Department of Computer Science, Rice University, (slides), w/ some modifications and augmentations by Shantanu Dutt 2 Outline The need for explicit multi-core/processor parallel processing: Applications for parallel processing Moore's Law and its limits Different uni-processor performance enhancement techniques and their limits Overview of different applications Classification of parallel computations Classification of parallel architectures Examples of MIMD/SPMD parallel algorithms Summary Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur 3 Outline The need for explicit multi-core/processor parallel processing: Applications for parallel processing Moore's Law and its limits Different uni-processor performance enhancement techniques and their limits Overview of different applications Classification of parallel computations Classification of parallel architectures Examples of MIMD/SPMD parallel algorithms Summary Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur 4 Moore’s Law & Need for Parallel Processing Chip performance doubles every 18-24 months Power consumption is prop. to freq. Limits of Serial computing – Heating issues Limit to transmissions speeds Leakage currents Limit to miniaturization Multi-core processors already commonplace. Most high performance servers already parallel. Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 5 Quest for Performance Pipelining Superscalar Architecture Out of Order Execution Caches Instruction Set Design Advancements Parallelism Multi-core processors Clusters Grid This is the future Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 6 Pipelining Illustration of Pipeline using the fetch, load, execute, store stages. At the start of execution – Wind up. At the end of execution – Wind down. Pipeline stalls due to data dependency (RAW, WAR), resource conflict, incorrect branch prediction – Hit performance and speedup. Pipeline depth – No of cycles in execution simultaneously. Intel Pentium 4 – 35 stages. Top text from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur 7 Pipelining • • Tpipe(n) is pipelined time to process n instructions = fill-time + n*(max{ti} ~ n*(max{ti} for large n, as fill-time is a constant wrt n), ti = exec. time of the i’th stage. This pipelined throughput = 1/max{ti} 8 Cache Desire for fast cheap and non volatile memory Memory speed growth at 7% per annum while processor growth at 50% p.a. Cache – fast small memory. L1 and L2 caches. Retrieval from memory takes several hundred clock cycles Retrieval from L1 cache takes the order of one clock cycle and from L2 cache takes the order of 10 clock cycles. Cache ‘hit’ and ‘miss’. Prefetch used to avoid cache misses at the start of the execution of the program. Cache lines used to avoid latency time in case of a cache miss Order of search – L1 cache -> L2 cache -> RAM -> Disk Cache coherency – Correctness of data. Important for distributed parallel computing Limit to cache improvement: Improving cache performance will at most improve efficiency to match processor efficiency Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 9 : instruction-level parallelism—degree generally low and dependent on how the sequential code has been written, so not v. effective (single-instr. multiple data) (exs. of limited data parallelism) (exs. of limited & low-level functional parallelism) 10 11 12 13 14 (simultaneous multithreading) (multi-threading) 15 16 Thus ……: Two Fundamental Issues in Future High Performance Microprocessor performance improvement via various implicit and explicit parallelism schemes and technology improvements is reaching (has reached?) a point of diminishing returns Thus need development of explicit parallel algorithms that are based on a fundamental understanding of the parallelism inherent in a problem, and exploiting that parallelism with minimum interaction/communication between the parallel parts Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 17 Outline The need for explicit multi-core/processor parallel processing: Applications for parallel processing Moore's Law and its limits Different uni-processor performance enhancement techniques and their limits Overview of different applications Classification of parallel computations Classification of parallel architectures Examples of MIMD/SPMD parallel algorithms Summary Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur 18 19 20 Computing and Design/CAD Designs of complex to very complex systems have almost become the norm in many areas of engineering, from design of chips with billions of transistors to aircrafts of various types of sophistication (large fly-by-wire passenger aircrafts to fighter planes) to complex engines to buildings and bridges. An effective design process needs to explore the design space in smart ways (without being exhaustive but also without leaving out useful design points) to optimize some metric (e.g., minimizing power consumption of a chip) while satisfying tens to hundreds of constraints on others (e.g., on speed and temperature profile of the chip). This is an extremely time intensive process for large and complex designs and can benefit significantly from parallel processing. 21 Applications of Parallel Processing Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 22 23 24 25 26 27 28 Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 29 Outline The need for explicit multi-core/processor parallel processing: Applications for parallel processing Moore's Law and its limits Different uni-processor performance enhancement techniques and their limits Overview of different applications Classification of parallel computations Classification of parallel architectures Examples of MIMD/SPMD parallel algorithms Summary and future advances Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur 30 Multiple tasks at once. Distribute work into multiple execution units. A classification of parallelism: Data Parallelism Functional or Control Parallelism Data Parallelism - Divide the dataset and solve each sector “similarly” on a separate execution unit. Functional Parallelism – Divide the 'problem' into different tasks and execute the tasks on different units. What would func. parallelism look like for the example on the right? Hybrid: Can do both: Say, first partition by data, and then for each data block, partition by functionality Data Parallelism Sequential Parallelism - A simplistic understanding Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 31 Data Parallelism Functional Parallelism: Example: Earth weather model Q: What would a data parallel breakup look like for this problem? Q. How can a hybrid breakup be done? Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 32 Flynn’s Classification Flynn's Classical Taxonomy – Based on # of instruction/task and data streams Single Instruction, Single Data streams (SISD): your single-core uni-processor PC Single Instruction, Multiple Data streams (SIMD): special purpose low-granularity multi-processor m/c w/ a single control unit relaying the same instruction to all processors (w/ different data) every cc (e.g., nVIDIA graphic co-processor w/ 1000’s of simple cores) Multiple Instruction, Single Data streams (MISD): pipelining is a major example Multiple Instruction, Multiple Data streams (MIMD): the most prevalent model. SPMD (Single Program Multiple Data) is a very useful subset. Note that this is v. different from SIMD. Why? Data vs Control Parallelism is an independent classification to Flynn’s Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 33 Flynn’s Classification (cont’d). Example machines: Thinking Machines CM 2000, nVIDIA GPU 34 Flynn’s Classification (cont’d). 35 Flynn’s Classification (cont’d). 36 Flynn’s Classification (cont’d). Example machines: Various current multicomputers (see the most recent list at http://www.top500.org/), multi-core processors like the Intel i3, i5, i7 processors (all quad-core: 4 processors on a single chip) 37 Flynn’s Classification (cont’d). 38 Flynn’s Classification (cont’d). • Data Parallelism: SIMD and SPMD fall into this category • Functional Parallelism: MISD falls into this category • MIMD can incorporates both data and functional parallelisms (the latter at either instruction level—different instrs. being executed across the processors at any time, or at the high-level function space) 39 Outline The need for explicit multi-core/processor parallel processing: Applications for parallel processing Moore's Law and its limits Different uni-processor performance enhancement techniques and their limits Overview of different applications Classification of parallel computations Classification of parallel architectures Examples of MIMD/SPMD parallel algorithms Summary Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur 40 Parallel Arch. Classification 1 Multi-processor Architectures Distributed Memory with message passing—Most prevalent architecture model for # processors > 8 Indirect interconnectionn n/ws Direct interconnection n/ws Shared Memory Uniform Memory Access (UMA) Non- Uniform Memory Access (NUMA)—Distributed shared memory Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 41 Distributed Memory—Message Passing Architectures Each processor P (with its own local cache C) is connected to exclusive local memory, i.e. no other CPU has direct access to it. Each node comprises at least one network interface (NI) that mediates the connection to a communication network. On each CPU runs a serial process that can communicate with other processes on other CPUs by means of the network. Blocking vs Non-blocking communication Blocking: computation stalls until commun. occurs/completes Non-blocking: if no commun. has occurred/completed at calling point, computation proceeds to the next instruction/statement (will require later calls to commun. primitive until commun. occurrs) Direct vs Indirect Communication / Interconnection network Example: A 2x4 mesh n/w (direct connection n/w) Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 42 43 44 1 The ARGO Beowulf Cluster at UIC (http://accc.uic.edu/service/argo-cluster) • Has 56 compute nodes/computers and a master node • • • Master here has a different meaning—generally a system front-end where you login and perform various tasks before submitting your parallel code to run on several compute nodes—than the “master” node in a parallel algorithm (e.g., the one we saw for the finite-element heat distribution problem), which would actually be one of the compute nodes, and generally distributes data to the other compute nodes, monitors progress of the computation, determines the end of the computation, etc., and may also additionally perform a part of the computation Compute nodes are divided among 14 zones, each zone containing 4 nodes which are connected as a ring network. Zones are connected to each other by a higher-level n/w. Each node (compute or master) has 2 processors. Each processor on some nodes are single-core ones, and dual cores in others; see http://accc.uic.edu/service/arg/nodes 45 1 System Computational Actions in a Message-Passing Program Proc. X Proc. Y Proc. X Proc. Y recv(P2, b); /* blocking */ a := b+c; b := x*y; send(P1,b); /* non-blocking */ Message passing mapping a := b+c; b := x*y; (a) Two basic parallel processes X, Y, and their data dependency Processor/core containing X b P(X) P(Y) Processor/core containing Y Message passing Link(s) (direct of data item “b”. or indirect) betw. the 2 processors (b) Their mapping to a message-passing multicomputer 46 Distributed Shared Memory Arch.: UMA 1 Flat memory model Memory bandwidth and latency are the same for all processors and all memory locations. Simplest example – dual core processor Most commonly represented today by Symmetric Multiprocessor (SMP) machines Cache coherent UMA—consistent cache values of the same data item in different proc./core caches L1 cache L2 cache Dual-Core Quad-Core Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 47 1 System Computational Actions in a Shared-Memory Program Proc. X Proc. Y Proc. X Proc. Y a := b+c; b := z*w; Shared-memory mapping a := b+c; b := x*y; (a) Two basic parallel processes X, Y, and their data dependency Possible Actions by O.S.: (i) Since “b” is a shared data item (e.g., designated by compiler or programmer), check “b”’s status bit to see if it has been written to (or more generally, check if a write counter to see if has a new value since last read) (ii) If so {read “b” & decrement read_cntr for “b”} else go to (i) and busy wait (check periodically). P(X) P(Y) Shared Memory Possible Actions by O.S.: (i) Since “b” is a shared data item (e.g., designated by compiler or programmer), check “b”’s location to see if it can be written to (all reads done: read_cntr for “b” = 0). (ii) If so, write “b” to its location and mark status bit as written by “Y”. (or increment its write counter if “b” will be written to multiple times by “Y”). (iii) Initialize read_cntr for “b” to pre-determined value (b) Their mapping to a shared-memory multiprocessor 48 Distributed Shared Memory Arch.: NUMA 1 Memory is physically distributed but logically shared. The physical layout similar to the distributed-memory message-passing case Aggregated memory of the whole system appear as one single address space. Due to the distributed nature, memory access performance varies depending on which CPU accesses which parts of memory (“local” vs. “remote” access). Example: Two locality domains linked through a high speed connection called Hyper Transport (in general via a link, as in message passing arch’s, only here these links are used by the O.S., not by the programmer, to transmit read/write non-local data to/from processor/non-local memory). Advantage – Scalability (compared to UMA’s) Disadvantage – a) Locality Problems and Connection congestion. b) Not a natural parallel prog./algo. model (it is easier to partition data among proc’s instead of think of all of it occupying a large monolithic address space that each processor can access). Most text from Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur all-to-all (complete graph) connection via a combination of direct and indirect conns. 49 Outline The need for explicit multi-core/processor parallel processing: Applications for parallel processing Moore's Law and its limits Different uni-processor performance enhancement techniques and their limits Overview of different applications Classification of parallel computations Classification of parallel architectures Examples of MIMD/SPMD parallel algorithms Summary Some text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur 50 An example parallel algorithm for a finite element computation Easy Parallel Situation – Each data part is independent. No communication is required between the execution units solving two different parts. E.g., matrix multiplication Next Level: Simple, structured and sparse communication needed. Example: Heat Equation (more generally, a Poisson Equation Solver) The initial temperature is zero on the boundaries and high in the middle Is this a good data The boundary temperature is held at zero. partition for N data elts The calculation of an element is dependent upon its (grid points) and P processors? Analysis? neighbor elements Serial Code – repeat do y=2, N-1 do x=2, M-1 u2(x,y)=u1(x,y)+cx*[u1(x+1,y) + u1(x-1,y)] + cy*[u1(x,y+1)} + u1(x,y-1)] /* cx, cy are const. */ enddo enddo u1 = u2; until convergence (u1 ~ u2) Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur data1 data2 …... data P : Data commun. needed betw. processes working on adajacent data sets 51 1. 2. 3. 4. 5. 6. 7. 8. 9. 1. 2. 3. 4. 5. 6. 7. find out if I am MASTER or WORKER if I am MASTER initialize array send each WORKER starting info and subarray do until all WORKERS converge gather from all WORKERS convergence data broadcast to all WORKERS convergence signal end do receive results from each WORKER 14. 15. 16. 17. 18. 19. 20. update border of my portion of solution array determine if my solution has converged if so {send MASTER convergence signal recv. from MASTER convergence signal} end do } send MASTER results endif else if I am WORKER receive from MASTER starting info and subarray do until solution converged { send (non-blocking?) neighbors my border info receive (non-blocking?) neighbors border info update interior of my portion of solution Workers array (see comput. given in the serial code) wait for incomplete non-block. receive (if any) to complete by busy waiting or blocking receive. Code from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur Master (can be one of the workers) Problem Grid 52 An example of an SPMD message-passing parallel program 53 SPMD message-passing parallel program (contd.) 1 node xor D, 54 How to interconnect the multiple cores/processors is a major consideration in a parallel architecture 55 1 Tflops Tflops kW 56 Summary Serial computers / microprocessors will probably not get much faster parallelization unavoidable Pipelining, cache and other optimization strategies for serial computers reaching a plateau the heat wall has also been reached Application examples Data and functional parallelism Flynn’s taxonomy: SIMD, MISD, MIMD/SPMD Parallel Architectures Intro Distributed Memory message-passing Shared Memory Uniform Memory Access Non Uniform Memory Access (distributed shared memory) Parallel program/algorithm examples Most text from: Fund. of Parallel Processing, A. Agrawal, IIT Kanpur 57 Additional References Computer Organization and Design– Patterson Hennessey Modern Operating Systems – Tanenbaum Concepts of High Performance Computing – Georg Hager Gerhard Wellein Cramming more components onto Integrated Circuits – Gordon Moore, 1965 Introduction to Parallel Computing – https://computing.llnl.gov/tutorials/parallel_comp The Landscape of Parallel Computing Research – A view from Berkeley, 2006 Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 58