ALL LECTURE SLIDES Lecture 1 Why parallel? 1 Why we need ever-increasing performance? 1 Why we’re building ever-increasing performance? 1 The solution of parallel 1 Why we need to write parallel programs 1 Approaches to the serial problem 1 More problems + Example 1 Analysis of example #1 and #2 2 Better parallel algorithm 2 Multiple cores forming a global sum 2 How do we write parallel programs? Task Parallelism and Data Parallelism 2 Division of work: Task Parallelism and Data Parallelism 3 Coordination 3 Types of parallel systems: Shared-memory and Distrubuted Memory 3 Terminology of Concurrent, Parallel and Distributed computing 2 Concluding remarks of lecture 1 3 Lecture 2 & 3 The von Neumann Architecture 4 Main Memory 4 CPU(Central Processing Unit) and its parts 4 Register 4 Program Counter 4 Bus 4 Process and its components 4 Multitasking 4 Threading 4 A process and two threads figure 5 Basics of caching 5 Principle of locality: Spatial locality and Temporal locality 5 Levels of Cache. Cache hit. Cache miss 5 Issues with cache(Write-through and Write-back) 5 Cache mappings (Full associative, Directed mapped, n-way set associative) 5 A table of 16- line main memory to a 4-line cache 5 Caches and programs code and table 6 Virtual memory 6 Virtual page numbers 6 Page table with Virtual Address divided into Virtual Page Number and Byte Offset TLB (Translation-lookaside buffer) 6 ILP (Instruction Level Parallelism) 6 Pipeling Example #1 6 Pipeling Example #2 7 Pipeling in general + table 7 Multiple Issues (Static multiple issue, Dynamic multiple issue) 7 Speculation 7 Hardware multithreading (SMT, Fine-grained and Coarse-grained) 7 Flynn’s Taxonomy (MISD, MIMD, SISD, SIMD) 8 SIMD and its drawbacks 8 What if we don’t have as many ALUs as data items? 8 Vector Processors in general (the pros, the cons) 8 GPUs (Graphics Processing Units) 9 MIMD 9 Shared Memory System 9 UMA multicore System 9 NUMA multicore System 9 Distributed Memory System 9 Interconnection network 9 Shared Memory Interconnects (Bus and Switch interconnects, Crossbar) 10 Distributed Memory Interconnects (Direct and Indirect interconnections) 10 6 Bisection width 10 Figures about “Two bisections of a ring” and “A bisection of toroidal mesh” 10 Bandwidth 11 Bisection bandwidth 11 Hypercube 11 Indirect interconnects 11 Figures about “Crossbar interconnect for distributed memory” and “Omega network” 11 A switch in an omega network 11 More definition (Latency, Bandwidth) 12 Cache coherence 12 A shared memory system with two cores and two caches 12 Snooping Cache Coherence 12 Directory Based Cache Coherence 12 The burden is on software 12 SPMD 12 Writing Parallel Programs 13 Shared Memory (Dynamic and Static Threads) 13 Nondeterminism 13 Codes about “Busy-waiting”, “message-passing”, “Partitioned Global Address Space Languages” 13 Input and Output 14 Sources of Overhead in Parallel Programs (Interprocess interaction, Idling, Excess Computation) 14 Perfomance Metrics for Parallel Systems: Execution Time 14 Speedup (Speedup of a parallel program, speedup bounds) 14 Superlinear Speedups 15 Performance Metrics: Efficiency 15 Efficiency of a parallel program 15 Speedups and efficiencies of a parallel program or on different problem sizes 15 Effect of overhead 16 Amdahl’s Law 16 Proof for Traditional Problems 16 Proof of Amdahl’s Law and Examples 16 Limitations of Amdahl’s Law 16 Scalability 16 Taking Timing codes 17 Foster’s methodology(Partitioning, Communication, Agglomeration, Mapping) 17 Histogram Example 17 Serial program input and output 17 First two stages of Foster’s Methodology 17 Alternative definition of tasks and communication 18 Adding the local arrays 18 Concluding Remarks 18 Lecture 4 Corresponding MPI functions 18 Topologies 18 Linear model of communication overhead 18 Contention 18 One-to-all broadcast (Input, Output, Ring, Mesh, Hypercube, Algorithm) 19 All-to-one reduction (Input, Output, Algorithm) 20 All-to-all broadcast (Input, Output, Ring, Mesh, Hypercube, summary) 20-21 All-to-all reduction (Input, Output, Algorithm) 21 All-reduce (Input, Output, Algorithm) 22 Prefix Sum (Input, Output, Algorithm) 22 Scatter (Input, Output, Algorithm) 22 Gather (Input, Output, Algorithm) 23 All-to-all personalized (Input, Output, Summary, Hypercube algorithm, E-cube routing) 23-24 Hypercube Time for all topologies 24 Improved one-to-all broadcast (Scatter, All-to-all broadcast, Time analysis) 24-25 Improved all-to-one reduction (All-to-all reduction, Gather, Time Analysis) 25 Improved all-reduce (All-to-all reduction, Gather, Scatter, All-to-all broadcast, Time analysis) 25 Lecture 5 & 6 POSIX®Threads 26 Caveat 26 Hello World 1, 2 and 3 26 Compiling a Pthread program 26 Running a Pthread program 26 Global variables 26 Starting the Threads 26 pthread_t objects 26 A closer look 1 and 2 27 Function started by pthread_create 27 Running the Threads 27 Stopping the Threads 27 Serial pseudo-code 27 Using 3 Pthreads 27 Pthreads matrix-vector multiplication 27 Estimating π and a thread function for computing π 28 Busy-Waiting 28 Possible race condition 28 Pthreads global sum with busy-waiting 28 Global sum function with critical section after loop 28 Mutexes 29 Global sum function that uses a mutex 29 Issues 29 Problems with a mutex solution 30 A first attempt at sending messages using pthreads 30 Syntax of the various Semaphore functions 30 Barriers + (Using barriers to time the slowest threads and for debugging) Busy-Waiting and a Mutex 30 Implementing a barrier with Semaphores 30 Condition Variables 31 Implementing a barrier with conditional variables 31 Linked Lists (+Linked Lists Membership) 31 Inserting a new node into a list 31 Deleting a node from a linked list 31 A Multi-Threaded Linked List 32 Simultaneous access by two threads 32 Solution #1 and #2 and their issues 32 Implementation of Member with one mutex per list node 32 Pthreads Read-Write Locks 32 Protecting our Linked List functions 33 Linked List Perfomance 33 Caches, Cache-Coherence, and False Sharing 33 Pthreads matrix-vector multiplication 33 Thread-Safety (+Example) 33 Simple Approach 34 The strtok function 34 Multi-threaded tokenizer 34 Running with one thread 34 Running with two threads 34 Other unsafe C library functions 34 “re-entrant” (thread safe) functions 35 Concluding remarks about lecture 5&6 35 Lecture 7 & 8 OpenMP 35 Pragmas 35 OpenMP pragmas 36 A process forking and joining two threads clause 36 Of note… 36 36 30 Some terminology 36 In case the compiler doesn’t support OpenMD 36 The Trapezoid Rule in general 36-38 Scope and Scope in OpenMD 37 The Reduction Clause 37 Reduction Operators 38 Parallel for 38 Legal forms for parallelizable for statements 38 Caveats 38 Data dependencies 39 Estimating π with OpenMD solution 39 The default clause 39 Bubble Sort 39 Serial Odd-Even Transposition Sort 39 First OpenMD Odd-Even Sort 39 Second OpenMD Odd-Even Sort 40 Scheduling loops and their Results 40 The Schedule Clause (Default schedule, Cyclic schedule) 40 Schedule(type, chunksize, static, dynamic or guided, auto, runtime) The Static Schedule Type 41 The Dynamic Schedule Type 41 The Guided Schedule Type 41 The Runtime Schedule Type 41 Queues 41 Message-Passing 41 Sending Messages 41 Receiving Messages 41 Termination Detection 42 Startup 42 The Atomic Directive 42 Critical Sections 42 Locks (+Using Locks in the Message-Passing Programs) 42 Some Caveats 43 Matrix-vector multiplication 43 Thread-Safety 43 Concluding Remarks about Lecture 7 & 8 43 Lecture 9 & 10 Identifying MPI processes 43 Our first MPI program 43 MPI Compilation 44 MPI Execution 44 MPI Programs 44 MPI Components 44 Basic Outline 44 Communicators 44 SPMD 44 Communication 45 Data Types 45 Message Matching 45 Receiving messages 45 status_p argument 45 How much data am I receiving? 45 The Trapezoidal Rule in MPI 45 Parallelizing the Trapezoidal Rule 46 Parallel pseudo-code 46 One trapezoid 46 Pseudo-code for a serial program 46 40 Tasks and communications for Trapezoidal Rule 46 First Version (1&2&3) for Trapezoidal 46 Dealing with I/O 46 Input 47 Running with 6 processes 47 Function for reading user input 47 Tree-structured communication 47 A tree-structured global sum (+an alternate one) 47 MPI_Reduce 48 Predefined reduction operators in MPI 48 Collective vs Point-to-Point Communications 48 MPI_Reduce examples 48 MPI_Allreduce 48 A global sum followed by distribution of the result 49 A butterfly structured global sum 49 Broadcast (+ A tree structured broadcast) 49 A version of Get_input that uses MPI_Bcast 49 Data distributions 49 Serial implementation of vector addition 49 Different partitions of a 12-component vector among 3 processes 49 Partitioning options (Block, Cyclic, Block-cyclic partitioning ) 50 MPI_Scatter 50 Parallel implementation of vector addition 50 MPI Gather 50 Reading and distributing a vector 50 Print a distributed vector 50 Allgather 51 Matrix-vector multiplication 51 Multiply a matrix by a vector 51 C style arrays 51 Serial Matrix-vector multiplication 51 An MPI matrix-vector multiplication function 51 Derived datatypes 51 MPI_Type create_struct 52 MPI_Get_address 52 MPI_Type_commit 52 MPI_Type_free 52 Get input function with a derived datatype 52 Elapsed parallel time 52 Elapsed serial time 52 MPI_Barrier 53 Scalability 53 Speedup formula 53 Efficiency formula 53 A parallel sorting algorithm sort 53 Odd-even transposition sort (+Example) 53 Serial odd-even transposition sort 54 Communications among tasks in odd-even sort 54 Parallel odd-even transposition sort (+Pseudocode) 54 Compute_partner 54 Safety in MPI programs 54 MPI_Ssend 55 Restructuring communication 55 MPI_Sendrecv 55 Safe communication with five processes 55 Parallel odd-even transposition sort with code 55 Concluding Remarks about Lecture 9&10 56 1 Why Parallel? Instead of designing and building faster microprocessors, we put multiple processors on a single integrated circuit. Why we need ever-increasing performance? Computational power is increasing, but so are our computation problems and needs. Problems we never dreamed of have been solved because of past increases, such as decoding the human genome. More complex problems are still waiting to be solved. Why we’re building parallel systems? Smaller transistors = faster processors. Faster processors = increased power consumption. Increased power consumption = increased heat. Increased heat = unreliable processors. The solution of parallel Move away from single-core systems to multicore processors. “core” = central processing unit (CPU) Why we need to write parallel programs? Running multiple instances of a serial program often isn’t very useful. What you really want is for it to run faster. Approaches to the serial problem Rewrite serial programs so that they’re parallel. Write translation programs that automatically convert serial programs into parallel programs. o This is very difficult to do. o Success has been limited. More problems Some coding constructs can be recognized by an automatic program generator, and converted to a parallel construct. However, it’s likely that the result will be a very inefficient program. Sometimes the best parallel solution is tostep back and devise an entirely new algorithm. Example After each core completes execution of the code, is a private variable my_sum contains the sum of the values computed by its calls to Compute_next_value. Once all the cores are done computing their private my_sum, they form a global sum by sending results to a designated “master” core which adds the final result. 2 Example #2 Analysis In the first example, the master core performs 7 receives and 7 additions. In the second example, the master core performs 3 receives and 3 additions. The improvement is more than a factor of 2! The difference is more dramatic with a larger number of cores. If we have 1000 cores: o The first example would require the master to perform 999 receives and 999 additions. o The second example would only require 10 receives and 10 additions. That’s an improvement of almost a factor of 100! Better parallel algorithm Don’t make the master core do all the work. Share it among the other cores. Pair the cores so that core 0 adds its result with core 1’s result. Core 2 adds its result with core 3’s result, etc. Work with odd and even numbered pairs of cores. Repeat the process now with only the evenly ranked cores. Core 0 adds result from core 2. Core 4 adds the result from core 6, etc. Now cores divisible by 4 repeat the process, and so forth, until core 0 has the final result Multiple cores forming a global sum How do we write parallel programs? Task parallelism o Partition various tasks carried out solving the problem among the cores. Data parallelism o Partition the data used in solving the problem among the cores. o Each core carries out similar operations on it’s part of the data. 3 Division of work – data parallelism Division of work –task parallelism Coordination Cores usually need to coordinate their work. Communication – one or more cores send their current partial sums to another core. Load balancing – share the work evenly among the cores so that one is not heavily loaded. Synchronization – because each core works at its own pace, make sure cores do not get too far ahead of the rest Types of parallel systems Shared-memory o The cores can share access to the computer’s memory. o Coordinate the cores by having them examine and update shared memory locations. Distributed-memory o Each core has its own, private memory. o The cores must communicate explicitly by sending messages across a network Terminology Concurrent computing – a program is one in which multiple tasks can be in progress at any instant. Parallel computing – a program is one in which multiple tasks cooperate closely to solve a problem Distributed computing – a program may need to cooperate with other programs to solve a problem Concluding Remarks The laws of physics have brought us to the doorstep of multicore technology. Serial programs typically don’t benefit from multiple cores. Automatic parallel program generation from serial program code isn’t the most efficient approach to get high performance from multicore computers. Learning to write parallel programs involves learning how to coordinate the cores. Parallel programs are usually very complex and therefore, require sound program techniques and development. 4 The von Neumann Architecture Main Memory: This is a collection of locations, each of which is capable of storing both instructions and data. Every location consists of an address, which is used to access the location, and the contents of the location. Central processing unit (CPU) Divided into two parts. Control unit - responsible for deciding which instruction in a program should be executed. (the boss) Arithmetic and logic unit (ALU) -responsible for executing the actual instructions. (the worker) Register – very fast storage, part of the CPU. Program counter – stores address of the next instruction to be executed. Bus – wires and hardware that connects the CPU and memory An operating system “process” An instance of a computer program that is being executed. Components of a process: o The executable machine language program. o A block of memory. o Descriptors of resources the OS has allocated to the process. o Security information. o Information about the state of the process Multitasking Gives the illusion that a single processor system is running multiple programs simultaneously. Each process takes turns running. (time slice) After its time is up, it waits until it has a turn again. (blocks) Threading Threads are contained within processes. They allow programmers to divide their programs into (more or less) independent tasks. The hope is that when one thread blocks because it is waiting on a resource, another will have work to do and can run. 5 A process and two threads Basics of caching A collection of memory locations that can be accessed in less time than some other memory locations. A CPU cache is typically located on the same chip, or one that can be accessed much faster than ordinary memory Principle of locality Accessing one location is followed by an access of a nearby location. Spatial locality – accessing a nearby location. Temporal locality – accessing in the near future. Levels of Cache Cache hit Cache miss Issues with cache When a CPU writes data to cache, the value in cache may be inconsistent with the value in main memory. Write-through caches handle this by updating the data in main memory at the time it is written to cache. Write-back caches mark data in the cache as dirty. When the cache line is replaced by a new cache line from memory, the dirty line is written to memory. Cache mappings Full associative – a new line can be placed at any location in the cache. Direct mapped – each cache line has a unique location in the cache to which it will be assigned. n-way set associative – each cache line can be place in one of n different locations in the cache. When more than one line in memory can be mapped to several different locations in cache we also need to be able to decide which line should be replaced or evicted. Example Table: Assignments of a 16-line main memory to a 4-line cache 6 Caches and programs Virtual memory If we run a very large program or a program that accesses very large data sets, all of the instructions and data may not fit into main memory. Virtual memory functions as a cache for secondary storage. It exploits the principle of spatial and temporal locality. It only keeps the active parts of running programs in main memory. Swap space - those parts that are idle are kept in a block of secondary storage. Pages – blocks of data and instructions. o Usually these are relatively large. o Most systems have a fixed page size that currently ranges from 4 to 16 kilobytes. Virtual page numbers When a program is compiled its pages are assigned virtual page numbers. When the program is run, a table is created that maps the virtual page numbers to physical addresses. A page table is used to translate the virtual address into a physical address. Page Table Virtual Address divided into Virtual Page Number and Byte Offset Translation-lookaside buffer (TLB) Using a page table has the potential to significantly increase each program’s overall run-time. A special address translation cache in the processor. It caches a small number of entries (typically 16–512) from the page table in very fast memory. Page fault – attempting to access a valid physical address for a page in the page table but the page is only stored on disk. Instruction Level Parallelism (ILP) Attempts to improve processor performance by having multiple processor components or functional units simultaneously executing instructions. Pipelining - functional units are arranged in stages. Multiple issue - multiple instructions can be simultaneously initiated. Pipelining Example #1: Add the floating point numbers 9.87×104 and 6.54×103 7 Pipeling Example #2 Assume each operation takes one nanosecond (10-9 seconds). This for loop takes about 7000 nanoseconds. Pipelining Divide the floating point adder into 7 separate pieces of hardware or functional units. First unit fetches two operands, second unit compares exponents, etc. Output of one functional unit is input to the next. One floating point addition still takes 7 nanoseconds. But 1000 floating point additions now takes 1006 nanoseconds! Pipelined Addition Table: Numbers in the table are subscripts of operands/results. Multiple Issue Multiple issue processors replicate functional units and try to simultaneously execute different instructions in a program. static multiple issue - functional units are scheduled at compile time. dynamic multiple issue – functional units are scheduled at run-time (superscalar) Speculation In order to make use of multiple issue, the system must find instructions that can be executed simultaneously. In speculation, the compiler or the processor makes a guess about an instruction, and then executes the instruction on the basis of the guess. Hardware multithreading There aren’t always good opportunities for simultaneous execution of different threads. Hardware multithreading provides a means for systems to continue doing useful work when the task being currently executed has stalled. The current task has to wait for data to be loaded from memory. o Ex., the current task has to wait for data to be loaded from memory. Simultaneous multithreading (SMT) - a variation on fine-grained multithreading. Allows multiple threads to make use of the multiple functional units Fine-grained - the processor switches between threads after each instruction, skipping threads that are stalled. o Pros: potential to avoid wasted machine time due to stalls. o Cons: a thread that’s ready to execute a long sequence of instructions may have to wait to execute every instruction. Coarse-grained - only switches threads that are stalled waiting for a time consuming operation to complete. o Pros: switching threads doesn’t need to be nearly instantaneous. o Cons: the processor can be idled on shorter stalls, and thread switching will also cause delays. 8 Flynn’s Taxonomy SIMD: Single instruction stream Multiple data stream Parallelism achieved by dividing data among the processors. Applies the same instruction to multiple data items. Called data parallelism. Example What if we don’t have as many ALUs as data items? Divide the work and process iteratively. Ex. m = 4 ALUs and n = 15 data items. SIMD drawbacks All ALUs are required to execute the same instruction, or remain idle. In classic design, they must also operate synchronously. The ALUs have no instruction storage. Efficient for large data parallel problems, but not other types of more complex parallel problems. Vector processors Operate on arrays or vectors of data while conventional CPU’s operate on individual data elements or scalars. Vector registers: Capable of storing a vector of operands and operating simultaneously on their contents. Vectorized and pipelined functional units: The same operation is applied to each element in the vector (or pairs of elements). Vector instructions: Operate on vectors rather than scalars. Interleaved memory. o Multiple “banks” of memory, which can be accessed more or less independently. o Distribute elements of a vector across multiple banks, so reduce or eliminate delay in loading/storing successive elements. Strided memory access and hardware scatter/gather. o The program accesses elements of a vector located at fixed intervals. The Pros of Vector Processors Fast. Easy to use. Vectorizing compilers are good at identifying code to exploit. Compilers also can provide information about code that cannot be vectorized. o Helps the programmer re-evaluate code. High memory bandwidth. Uses every item in a cache line. The Cons of Vector Processors They don’t handle irregular data structures as well as other parallel architectures. A very finite limit to their ability to handle ever larger problems. (scalability) 9 Graphics Processing Units (GPUs) Real time graphics application programming interfaces or API’s use points, lines, and triangles to internally represent the surface of an object. A graphics processing pipeline converts the internal representation into an array of pixels that can be sent to a computer screen. Several stages of this pipeline (called shader functions) are programmable. Typically just a few lines of C code. Shader functions are also implicitly parallel, since they can be applied to multiple elements in the graphics stream. GPU’s can often optimize performance by using SIMD parallelism. The current generation of GPU’s use SIMD parallelism. Although they are not pure SIMD systems. MIMD: Multiple instruction stream Multiple data stream Supports multiple simultaneous instruction streams operating on multiple data streams. Typically consist of a collection of fully independent processing units or cores, each of which has its own control unit and its own ALU. Shared Memory System A collection of autonomous processors is connected to a memory system via an interconnection network. Each processor can access each memory location. The processors usually communicate implicitly by accessing shared data structures. Most widely available shared memory systems use one or more multicore processors. (multiple CPU’s or cores on a single chip) UMA multicore system NUMA multicore system UMA: Time to access all the memory locations will be the same for all the cores. NUMA: A memory location a core is directly connected to can be accessed faster than a memory location that must be accessed through another chip. Distributed Memory System Clusters (most popular): A collection of commodity systems. Connected by a commodity interconnection network. Nodes of a cluster are individual computations units joined by a communication network. Interconnection network Affects performance of both distributed and shared memory systems. Two categories: Shared memory interconnects Distributed memory interconnect 10 Shared memory interconnects Bus interconnect: A collection of parallel communication wires together with some hardware that controls access to the bus. Communication wires are shared by the devices that are connected to it. As the number of devices connected to the bus increases, contention for use of the bus increases, and performance decreases. Switched interconnect: Uses switches to control the routing of data among the connected devices. Crossbar: Allows simultaneous communication among different devices. Faster than buses. But the cost of the switches and links is relatively high. BELOW WE HAVE CROSSBARS A crossbar switch connecting 4 processors (Pi) and 4 memory modules (Mj) Configuration of internal switches in a crossbar Simultaneous memory accesses by the processors Distributed memory interconnects Two groups Direct interconnect: Each switch is directly connected to a processor memory pair, and the switches are connected to each other. Indirect interconnect: Switches may not be directly connected to a processor. DIRECT INTERCONNECT -> Bisection width: A measure of “number of simultaneous communications” or “connectivity”. Two bisections of a ring A bisection of a toroidal mesh 11 Bandwidth: The rate at which a link can transmit data. Usually given in megabits or megabytes per second. Bisection bandwidth: A measure of network quality. Instead of counting the number of links joining the halves, it sums the bandwidth of the links. Fully connected network: Each switch is directly connected to every other switch. Hypercube Highly connected direct interconnect. Built inductively: o A one-dimensional hypercube is a fully-connected system with two processors. o A two-dimensional hypercube is built from two onedimensional hypercubes by joining “corresponding” switches. o Similarly a three-dimensional hypercube is built from two two-dimensional hypercubes. Hypercubes Indirect interconnects Simple examples of indirect networks: o Crossbar o Omega network Often shown with unidirectional links and a collection of processors, each of which has an outgoing and an incoming link, and a switching network. A generic indirect network Crossbar interconnect for distributed memory A switch in an omega network Omega network 12 More definitions: Any time data is transmitted, we’re interested in how long it will take for the data to reach its destination. Latency: The time that elapses between the source’s beginning to transmit the data and the destination’s starting to receive the first byte. Bandwidth: The rate at which the destination receives data after it has started to receive the first byte. Cache coherence: Programmers have no control over caches and when they get updated. A shared memory system with two cores and two caches: Snooping Cache Coherence The cores share a bus . Any signal transmitted on the bus can be “seen” by all cores connected to the bus. When core 0 updates the copy of x stored in its cache it also broadcasts this information across the bus. If core 1 is “snooping” the bus, it will see that x has been updated and it can mark its copy of x as invalid. Directory Based Cache Coherence Uses a data structure called a directory that stores the status of each cache line. When a variable is updated, the directory is consulted, and the cache controllers of the cores that have that variable’s cache line in their caches are invalidated. The burden is on software Hardware and compilers can keep up the pace needed. From now on… In shared memory programs: o Start a single process and fork threads. o Threads carry out tasks. In distributed memory programs: o Start multiple processes. o Processes carry out tasks. SPMD – single program multiple data A SPMD programs consists of a single executable that can behave as if it were multiple different programs through the use of conditional branches. 13 Writing Parallel Programs 1. Divide the work among theprocesses/threads (a) so each process/threadgets roughly the same amount of work (b) and communication isminimized. 2. Arrange for the processes/threads to synchronize. 3. Arrange for communication among processes/threads. Shared Memory Dynamic threads o Master thread waits for work, forks new threads, and when threads are done, they terminate o Efficient use of resources, but thread creation and termination is time consuming. Static threads o Pool of threads created and are allocated work, but do not terminate until cleanup. o Better performance, but potential waste of system resources. Nondeterminism my_val = Compute_val ( my_rank ) ; x += my_val ; Nondeterminism Race condition Critical section Mutually exclusive Mutual exclusion lock (mutex, or simply lock) busy-waiting Partitioned Global Address Space Languages message-passing 14 Input and Output In distributed memory programs, only process 0 will access stdin. In shared memory programs, only the master thread or thread 0 will access stdin. In both distributed memory and shared memory programs all the processes/threads can access stdout and stderr. However, because of the indeterminacy of the order of output to stdout, in most cases only a single process/thread will be used for all output to stdout other than debugging output. Debug output should always include the rank or id of the process/thread that’s generating the output. Only a single process/thread will attempt to access any single file other than stdin, stdout, or stderr. So, for example, each process/thread can open its own, private file for reading or writing, but no two processes/threads will open the same file. Sources of Overhead in Parallel Programs: If I use two processors, shouldn't my program run twice as fast? No - a number of overheads cause degradation in performance, like: excess computation, communication, idling, contention. The execution profile of a hypothetical parallel program executing on eight processing elements. Profile indicates times spent performing computation (both essential and excess), communication, and idling. Interprocess interactions: Processors working on any non-trivial parallel problem will need to talk to each other. Idling: Processes may idle because of load imbalance, synchronization, or serial components. Excess Computation: This is computation not performed by the serial version. This might be because the serial algorithm is difficult to parallelize, or that some computations are repeated across processors to minimize communication. Performance Metrics for Parallel Systems: Execution Time Serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. The parallel runtime is the time that elapses from the moment the first processor starts to the moment the last processor finishes execution. We denote the serial runtime by TS or TSerial and the parallel runtime by TP or TParallel Speedup Number of cores = p Serial run-time = Tserial Parallel run-time = Tparallel Tparallel = Tserial / p (linear speedup) Speedup of a parallel program Speedup (S) is the ratio of the time taken to solve a problem on a single processor to the time required to solve the 𝑇 same problem on a parallel computer with p identical processing elements. 𝑆 = 𝑠𝑒𝑟𝑖𝑎𝑙 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 Speedup Bounds Speedup can be as low as 0 (the parallel program never terminates). Speedup, in theory, should be upper bounded by p -after all, we can only expect a p-fold speedup if we use times as many resources. A speedup greater than p is possible only if the processing elements spend in average less than T S / p time solving the problem. In this case, a single processor could be timeslided to achieve a faster serial program, which contradicts our assumption of fastest serial program as basis for speedup. 15 Superlinear Speedups One reason for superlinearity is that the parallel version does less work than corresponding serial algorithm. Searching an unstructured tree for a node with a given label,`S', on two processing elements using depth-first traversal. The two-processor version with processor 0 searching the left subtree andprocessor 1 searching the right subtree expands only the shadednodes before the solution is found. The corresponding serialformulation expands the entire tree. It is clear that the serial algorithm does more work than the parallel algorithm. Superlinear Speedups Resource-based superlinearity: The higher aggregate cache/memory bandwidth can result in better cache-hit ratios, and therefore superlinearity. Example: A processor with 64KB of cache yields an 80% hit ratio. If two processors are used, since the problem size/processor is smaller, the hit ratio goes up to 90%. Of the remaining 10% access, 8% come from local memory and 2% from remote memory. If DRAM access time is 100 ns, cache access time is 2 ns, and remote memory access time is 400ns, this corresponds to a speedup of 2.43! Performance Metrics: Efficiency Efficiency is a measure of the fraction of time for which a processing element is usefully employed. 𝑆 Mathematically, it is given by: 𝐸 = Following the bounds on speedup, efficiency can be as low as 0 and as high as 1. 𝑃 Efficiency of a parallel program Speedups and efficiencies of a parallel program Speedups and efficiencies of parallel program on different problem sizes Speedup Efficiency 16 Effect of overhead: Tparallel = Tserial / p + Toverhead Amdahl’s Law: Unless virtually all of a serial program is parallelized, the possible speedup is going to be very limited, regardless of the number of cores available. Thus Amdahl’s law provides an upper bound on the speedup that can be obtained by a parallel program. Let f be the fraction of operations in a computation that must be performed sequentially, where 0 ≤ f ≤ 1. The maximum speedup achievable by a parallel computer with n processors is Proof for Traditional Problems If the fraction of the computation that cannot be divided into concurrent tasks is f, and no overhead incurs when the computation is divided into concurrent parts, the time to perform the computation with n processors is given by: Tp ≥ fTs + [(1 - f ) Ts] / n Proof of Amdahl’s Law Using the preceding expression for Tp The last expression is obtained by dividing both numerator and denominator by Ts. Multiplying numerator & denominator by n produces the following alternate version of this formula: Example 1 95% of a program’s execution time occurs inside a loop that can be executed in parallel(while the other 5% is serial). What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs? Example 2 What is the maximum speedup achievable by the program of the previous example (increasing the number of processors as much as we may prefer) ? Limitation of Amdahl’s Law Ignores communication cost On communications-intensive applications, does not capture the additional communication slowdown due to network congestion. As a result, Amdahl’s law usually overestimates speedup achievable Scalability In general, a problem is scalable if it can handle ever increasing problem sizes. If we increase the number of processes/threads and keep the efficiency fixed without increasing problem size, the problem is strongly scalable. If we keep the efficiency fixed by increasing the problem size at the same rate as we increase the number of processes/threads, the problem is weakly scalable. 17 Taking Timings Foster’s methodology 1. Partitioning: divide the computation to be performed and the data operated on by the computation into small tasks. The focus here should be on identifying tasks that can be executed in parallel. 2. Communication: determine what communication needs to be carried out among the tasks identified in the previous step. 3. Agglomeration or aggregation: combine tasks and communications identified in the first step into larger tasks. For example, if task A must be executed before task B can be executed, it may make sense to aggregate them into a single composite task. 4. Mapping: assign the composite tasks identified in the previous step to processes/threads.This should be done so that communication is minimized, and each process/thread gets roughly the same amount of work. Histogram example Serial program - input 1. 2. 3. 4. 5. The number of measurements: data_count An array of data_count floats: data The minimum value for the bin containing the smallest values: min_meas The maximum value for the bin containing the largest values: max_meas The number of bins: bin_count Serial program - output 1. bin_maxes : an array of bin_count floats 2. bin_counts : an array of bin_count ints First two stages of Foster’s Methodology 18 Alternative definition of tasks and communication Adding the local arrays Concluding Remarks: Serial systems: The standard model of computer hardware has been the von Neumann architecture. Parallel hardware: Flynn’s taxonomy. Parallel software: o We focus on software for homogeneous MIMD systems, consisting of a single program that obtains parallelism by branching. o SPMD programs. Input and Output o We’ll write programs in which one process or thread can access stdin, and all processes can access stdout and stderr. o However, because of nondeterminism, except for debug output we’ll usually have a single process or thread accessing stdout Performance o Speedup o Efficiency o Amdahl’s law o Scalability Parallel Program Design: Foster’s methodologyss or thread accessing stdout. Corresponding MPI functions Linear model of communication overhead Point-to-point message takes time ts + twm ts is the latency tw is the per-word transfer time (inverse bandwidth) m is the message size in # words (Must use compatible units for m and tw ) Contention Assuming bi-directional links Each node can send and receive simultaneously Contention if link is used by more than one message k-way contention means tw → tw /k Topologies 19 One-to-all broadcast Input: The message M is stored locally on the root Output: The message M is stored locally on all processes Ring Mesh Ring Recursive doubling I Double the number of active processes in each step Mesh Use ring algorithm on the root’s mesh row Use ring algorithm on all mesh columns in parallel Hypercube Generalize mesh algorithm to d dimensions Algorithm The algorithms described above are identical on all three topologies The given algorithm is not general. Number of steps: d = log2 p Time per step: ts + twm Total time: (ts + twm)log2 p In particular, note that broadcasting to p2 processes is onlytwice as expensive as broadcasting to p processes (log2 p2 = 2 log2 p) 20 All-to-one reduction Algorithm Analogous to all-to-one broadcast algorithm Analogous time (plus the time to compute a ⊕ b) Reverse order of communications Reverse direction of communications Combine incoming message with local message using ⊕ All-to-all broadcast Ring Ring algorithm 1. 2. 3. 4. 5. 6. 7. 8. 9. left ← (me − 1) mod p right ← (me + 1) mod p result ← M M ← result for k = 1, 2, . . . , p − 1 do Send M to right Receive M from left result ← result ∪ M end for The “send” is assumed to be non-blocking Lines 6–7 can be implemented via MPI_Sendrecv Time of ring algorithm Number of steps: p − 1 Time per step: ts + twm Total time: (p − 1)(ts + twm) Mesh algorithm The mesh algorithm is based on the ring algorithm: Apply the ring algorithm to all mesh rows in parallel Apply the ring algorithm to all mesh columns in parallel 21 Time of mesh algorithm Hypercube algorithm The hypercube algorithm is also based on the ring algorithm: For each dimension d of the hypercube in sequence: Apply the ring algorithm to the 2d−1 links in the current dimension in parallel. Time of hypercube algorithm Summary All-to-all reduction Input: The p2 messages Mr,k for r, k = 0, 1, . . . , p − 1 The message Mr,k is stored locally on process r An associative reduction operator ⊕ Output: The “sum” Mr:= M0,r ⊕ M1,r ⊕ · · · ⊕ Mp-1,r stored locally on each process r Algorithm Analogous to all-to-all broadcast algorithm Analogous time (plus the time for computing a ⊕ b) Reverse order of communications Reverse direction of communications Combine incoming message with part of local message using ⊕ Same transfer time (tw term) I But the number of messages differ 22 All-reduce Input: The p messages Mk for k = 0, 1, . . . , p − 1 The message Mk is stored locally on process k An associative reduction operator ⊕ Output: The “sum” M := M0 ⊕ M1 ⊕ · · · ⊕ Mp-1 stored locally on all processes Algorithm Analogous to all-to-all broadcast algorithm Combine incoming message with local message using ⊕ Cheaper since the message size does not grow Total time: (ts + twm)log2 p Prefix sum Input: The p messages Mk for k = 0, 1, . . . , p − 1 The message Mk is stored locally on process k An associative reduction operator ⊕ Output: The “sum” M(k) := M0 ⊕ M1 ⊕ · · · ⊕ Mk stored locally onprocess k for all k Algorithm Analogous to all-reduce algorithm Analogous time Locally store only the corresponding partial sum Scatter Input: The p messages Mk for k = 0, 1, . . . , p − 1 stored locally on the root Output: The message Mk stored locally on process k for all k Algorithm Analogous to one-to-all broadcast algorithm Send half of the messages in the first step, send one quarter inthe second step, and so on More expensive since several messages are sent in each step Total time: tslog2 p + tw (p − 1)m 23 Gather Input: The p messages Mk for k = 0, 1, . . . , p − 1 The message Mk is stored locally on process k Output: The p messages Mk stored locally on the root Algorithm Analogous to scatter algorithm Analogous time Reverse the order of communications Reverse the direction of communications All-to-all personalized Input: The p2 messages Mr,k for r, k = 0, 1, . . . , p − 1 The message Mr,k is stored locally on process r Output: The p messages Mr,k stored locally on process k for all k Summary The hypercube algorithm is not optimal with respect tocommunication volume (the lower bound is t wm(p − 1)) An optimal (w.r.t. volume) hypercube algorithm Idea: Let each pair of processes exchange messages directly Time: I (p − 1)(ts + twm) Q: In which order do we pair the processes? A:In step k, let me exchange messages with me XOR k. This can be done without contention! An optimal hypercube algorithm 24 An optimal hypercube algorithm based on E-cube routing E-cube routing Routing from s to t := s XOR k in step k The difference between s and t is s XOR t = s XOR(s XOR k) = k The number of links to traverse equals the number of 1’s in thebinary representation of k (the so-called Hamming distance) E-cube routing: route through the links according to somefixed (arbitrary) ordering imposed on the dimensions Why does E-cube work? E-cube routing example Hypercube for all topologies Improved one-to-all broadcast 25 Time analysis Improved all-to-one reduction Time analysis Improved all-reduce All-reduce = One-to-all reduction + All-to-one broadcast ...but gather followed by scatter cancel out! Time analysis Processes and Threads A process is an instance of a running (or suspended) program. Threads are analogous to a “light-weight”process. In a shared memory program a single process may have multiple threads of control. 26 POSIX®Threads Also known as Pthreads. A standard for Unix-like operating systems. A library that can be linked with C programs. Specifies an application programming interface (API) for multi-threaded programming. Caveat: The Pthreads API is only available onPOSIXR systems — Linux, MacOS X,Solaris, HPUX, … Hello World(1) Hello World(2) Hello World(3) Compiling a Pthread program gcc −g −Wall −o pth_hello pth_hello . c –lpthread Running a Pthreads program Global variables Can introduce subtle and confusing bugs! Limit use of global variables to situations in which they’re really needed. o (Shared variables) Starting the Threads Processes in MPI are usually started by a script. In Pthreads the threads are started by the program executable. pthread_t objects Opaque The actual data that they store is system-specific. Their data members aren’t directly accessible to user code. However, the Pthreads standard guarantees that a pthread_t object does store enough information to uniquely identify the thread with which it’s associated. 27 A closer look(1) A closer look(2) Function started by pthread_create Prototype: void* thread_function ( void* args_p ) ; Void* can be cast to any pointer type in C. So args_p can point to a list containing one or more values needed by thread_function. Similarly, the return value of thread_function can point to a list of one or more values. Running the Threads Main thread forks and joins two threads. Stopping the Threads We call the function pthread_join once foreach thread. A single call to pthread_join will wait for thethread associated with the pthread_t object to complete. Serial pseudo-code Using 3 Pthreads Pthreads matrix-vector multiplication 28 Estimating π A thread function for computing π Note that as we increase n, the estimate with one thread gets better and better. Busy-Waiting Possible race condition A thread repeatedly tests a condition, but, effectively, does no useful work until the condition has the appropriate value. Beware of optimizing compilers, though! Pthreads global sum with busy-waiting Global sum function with critical section after loop (1) Global sum function with critical section after loop (2) 29 Mutexes A thread that is busy-waiting may continually use the CPU accomplishing nothing. Mutex (mutual exclusion) is a special type of variable that can be used to restrict access to a critical section to a single thread at a time. Used to guarantee that one thread“excludes” all other threads while itexecutes the critical section. The Pthreads standard includes a special type for mutexes: pthread_mutex_t. When a Pthreads program finishes using a mutex, it should call In order to gain access to a critical section a thread calls When a thread is finished executing the code in a critical section, it should call Global sum function that uses a mutex(1) Global sum function that uses a mutex(2) Run-times (in seconds) of π programs using n = 108 terms on a system with two four-core processors. Possible sequence of events with busy-waiting and more threads than cores. Issues Busy-waiting enforces the order threads access a critical section. Using mutexes, the order is left to chance and the system. There are applications where we need to control the order threads access the critical section. 30 Problems with a mutex solution A first attempt at sending messages using pthreads Syntax of the various semaphore functions Barriers Synchronizing the threads to make sure that they all are at the same point in a program is called a barrier. No thread can cross the barrier until all the threads have reached it. Using barriers to time the slowest thread Using barriers for debugging Busy-waiting and a Mutex Implementing a barrier using busy-waitingand a mutex is straightforward. We use a shared counter protected by themutex. When the counter indicates that everythread has entered the critical section,threads can leave the critical section. Implementing a barrier with semaphores 31 Condition Variables A condition variable is a data object that allows a thread to suspend execution until a certain event or condition occurs. When the event or condition occurs another thread can signal the thread to “wake up.” A condition variable is always associated with a mutex. Implementing a barrier with condition variables Linked Lists Linked List Membership Deleting a node from a linked list Inserting a new node into a list 32 A Multi-Threaded Linked List In order to share access to the list, we can define head_p to be a global variable. This will simplify the function headers for Member, Insert, and Delete, since we won’t need to pass in either head_p or a pointer to head_p: we’ll only need to pass in the value of interest. Simultaneous access by two threads Solution #1 An obvious solution is to simply lock the list any time that a thread attempts to access it. A call to each of the three functions can be protected by a mutex. Issues of Sol #1 We’re serializing access to the list. If the vast majority of our operations are calls to Member, we’ll fail to exploit this opportunity for parallelism. On the other hand, if most of our operations are calls to Insert and Delete, then this may be the best solution since we’ll need to serialize access to the list for most of the operations, and this solution will certainly be easy to implement. Solution #2 Instead of locking the entire list, we could try to lock individual nodes. A “finer-grained” approach. Issues of Sol #2 This is much more complex than the original Member function. It is also much slower, since, in general, each time a node is accessed, a mutex must be locked and unlocked. The addition of a mutex field to each node will substantially increase the amount of storage needed for the list. Implementation of Member with one mutex per list node (1) Implementation of Member with one mutex per list node(2) 33 Pthreads Read-Write Locks Neither of our multi-threaded linked lists exploits the potential for simultaneous access to any node by threads that are executing Member. The first solution only allows one thread to access the entire list at any instant. The second only allows one thread to access any given node at any instant. A read-write lock is somewhat like a mutex except that it provides two lock functions. The first lock function locks the read-writelock for reading, while the second locks itfor writing. So multiple threads can simultaneously obtain the lock by calling the read-lock function, while only one thread can obtain the lock by calling the write-lock function. Thus, if any threads own the lock for reading, any threads that want to obtain the lock for writing will block in the call to the write-lock function. If any thread owns the lock for writing, any threads that want to obtain the lock for reading or writing will block in their respective locking functions. Protecting our linked list functions Linked List Performance 100,000 ops/thread 99.9% Member 0.05% Insert 0.05% Delete 100,000 ops/thread 80% Member 10% Insert 10% Delete Caches, Cache-Coherence, and False Sharing Recall that chip designers have added blocks of relatively fast memory to processors called cache memory. The use of cache memory can have a huge impact on shared-memory. A write-miss occurs when a core tries to update a variable that’s not in cache, and it has to access main memory. Pthreads matrix-vector multiplication Thread-Safety A block of code is thread-safe if it can be simultaneously executed by multiple threads without causing problems. Example Suppose we want to use multiple threads to “tokenize” a file that consists of ordinary English text. The tokens are just contiguous sequences of characters separated from the rest of the text by white-space — a space, a tab, or a newline. 34 Simple approach Divide the input file into lines of text and assign the lines to the threads in a roundrobin fashion. The first line goes to thread 0, the second goes to thread 1, . . . , the tth goes to thread t, the t +1st goes to thread 0, etc. We can serialize access to the lines ofinput using semaphores. After a thread has read a single line ofinput, it can tokenize the line using the strtok function. The strtok function The first time it’s called the string argument should be the text to be tokenized. Our line of input. For subsequent calls, the first argument should be NULL. The idea is that in the first call, strtok caches a pointer to string, and for subsequent calls it returns successive tokens taken from the cached copy. Multi-threaded tokenizer (1) Multi-threaded tokenizer (2) Running with one thread It correctly tokenizes the input stream. o Pease porridge hot. o Pease porridge cold. o Pease porridge in the pot o Nine days old. Running with two threads What happened? strtok caches the input line by declaring a variable to have static storage class. This causes the value stored in this variable to persist from one call to the next. Unfortunately for us, this cached string is shared, not private. Thus, thread 0’s call to strtok with the third line of the input has apparently overwritten the contents of thread 1’s call with the second line. So the strtok function is not thread-safe.If multiple threads call it simultaneously, the output may not be correct. Other unsafe C library functions Regrettably, it’s not uncommon for C library functions to fail to be thread-safe. The random number generator random in stdlib.h. The time conversion function localtime in time.h. 35 “re-entrant” (thread safe) functions In some cases, the C standard specifies an alternate, thread-safe, version of a function. Concluding remarks A thread in shared-memory programming is analogous to a process in distributed memory programming. o However, a thread is often lighter-weight than a full-fledged process. o In Pthreads programs, all the threads have access to global variables, while local o variables usually are private to the thread running the function. o When indeterminacy results from multiple threads attempting to access a shared resource such as a shared variable or a shared file, at least one of the accesses is an update, and the accesses can result in an error, we have a race condition. A critical section is a block of code that updates a shared resource that can only be updated by one thread at a time. o So the execution of code in a critical section should, effectively, be executed as serial code. A mutex can be used to avoid conflicting access to critical sections as well. o Think of it as a lock on a critical section, since mutexes arrange for mutually exclusive access to a critical section. A semaphore is the third way to avoid conflicting access to critical sections. o It is an unsigned int together with two operations: sem_wait and sem_post. o Semaphores are more powerful than mutexes since they can be initialized to any nonnegative value. A barrier is a point in a program at which the threads block until all of the threads have reached it. A read-write lock is used when it’s safe for multiple threads to simultaneously read a data structure, but if a thread needs to modify or write to the data structure, then only that thread can access the data structure during the modification. Some C functions cache data between calls by declaring variables to be static, causing errors when multiple threads call the function. o This type of function is not thread-safe OpenMP An API for shared-memory parallel programming. MP = multiprocessing Designed for systems in which each thread or process can potentially have access to all available memory. System is viewed as a collection of cores or CPU’s, all of which have access to main memory. Pragmas (#pragma) Special preprocessor instructions. Typically added to a system to allow behaviors that aren’t part of the basic C specification. Compilers that don’t support the pragmas ignore them. 36 OpenMP pragmas # pragma omp parallel o Most basic parallel directive. o The number of threads that run the following structured block of code is determined by the run-time system. A process forking and joining two threads clause Text that modifies a directive. The num_threads clause can be added to a parallel directive. It allows the programmer to specify the number of threads that should execute the following block. # pragma omp parallel num_threads ( thread_count ) Of note… There may be system-defined limitations on the number of threads that a program can start. The OpenMP standard doesn’t guarantee that this will actually start thread_count threads. Most current systems can start hundreds or even thousands of threads. Unless we’re trying to start a lot of threads, we will almost always get the desired number of threads. Some terminology: In OpenMP parlance the collection of threads executing the parallel block — the original thread and the new threads — is called a team, the original thread is called the master, and the additional threads are called slaves. In case the compiler doesn’t support OpenMP The trapezoid rule Serial algorithm A First OpenMP Version 1) We identified two types of tasks: a) computation of the areas of individual trapezoids, and b) adding the areas of trapezoids. 2) There is no communication among the tasks in the first collection, but each task in the first collection communicates with task 1b. 3) We assumed that there would be many more trapezoids than cores. 1. So we aggregated tasks by assigning a contiguous block of trapezoids to each thread (and a single thread to each core) 37 Assignment of trapezoids to threads Unpredictable results when two (or more) threads attempts to simultaneously executete: global_result += my_result Mutual exclusion # pragma omp critical <- only one thread can execute the following structured block at a time global_result += my_result ; Scope 2. In serial programming, the scope of a variable consists of those parts of a program in which the variable can be used. 3. In OpenMP, the scope of a variable refers to the set of threads that can access the variable in a parallel block. Scope in OpenMP A variable that can be accessed by all the threads in the team has shared scope. A variable that can only be accessed by a single thread has private scope. The default scope for variables declared before a parallel block is shared. The Reduction Clause 38 Reduction operators A reduction operator is a binary operation (such as addition or multiplication). A reduction is a computation that repeatedly applies the same reduction operator to a sequence of operands in order to get a single result. All of the intermediate results of the operation should be stored in the same variable: the reduction variable. Parallel for Forks a team of threads to execute the following structured block. However, the structured block following the parallel for directive must be a for loop. Furthermore, with the parallel for directive the system parallelizes the for loop by dividing the iterations of the loop among the threads. Legal forms for parallelizable for statements Caveats The variable index must have integer or pointer type (e.g., it can’t be a float). The expressions start, end, and incr must have a compatible type. For example, if index is a pointer, then incr must have integer type. The expressions start, end, and incr must not change during execution of the loop. During execution of the loop, the variable index can only be modified by the “increment expression” in the for statement. 39 Data dependencies What happened? 1. OpenMP compilers don’t check for dependences among iterations in a loop that’s being parallelized with a parallel for directive. 2. A loop in which the results of one or more iterations depend on other iterations cannot, in general, be correctly parallelized by OpenMP. Estimating π OpenMP solution #1 OpenMP solution #2 The default clause Lets the programmer specify the scope of each variable in a block. default(none) With this clause the compiler will require that we specify the scope of each variable we use in the block and that has been declared outside the block. Bubble Sort Serial Odd-Even Transposition Sort First OpenMP Odd-Even Sort 40 Second OpenMP Odd-Even Sort Odd-even sort with two parallel for directives and two for directives. (Times are in seconds.) Scheduling loops Results f(i) calls the sin function i times. Assume the time to execute f(2i) requires approximately twice as much time as the time to execute f(i). n = 10,000 1. one thread 2. run-time = 3.67 seconds. n = 10,000 1. two threads 2. default assignment 3. run-time = 2.76 seconds 4. speedup = 1.33 n = 10,000 1. two threads 2. cyclic assignment 3. run-time = 1.84 seconds 4. speedup = 1.99 The Schedule Clause Default schedule: Cyclic schedule: schedule ( type , chunksize ) Type can be: o static: the iterations can be assigned to the threads before the loop is executed. o dynamic or guided: the iterations are assigned to the threads while the loop is executing. o auto: the compiler and/or the run-time system determine the schedule. o runtime: the schedule is determined at run-time. The chunksize is a positive integer. 41 The static Schedule Type twelve iterations, 0, 1, . . . , 11, and three threads The Dynamic Schedule Type The iterations are also broken up into chunksof chunksize consecutive iterations. Each thread executes a chunk, and when a thread finishes a chunk, it requests another one from the run-time system. This continues until all the iterations are completed. The chunksize can be omitted. When it is omitted, a chunksize of 1 is used. The Guided Schedule Type Each thread also executes a chunk, and when a thread finishes a chunk, it requests another one. However, in a guided schedule, as chunks are completed the size of the new chunks decreases. If no chunksize is specified, the size of the chunks decreases down to 1. If chunksize is specified, it decreases down to chunksize, with the exception that the very last chunk can be smaller than chunksize. The Runtime Schedule Type The system uses the environment variable OMP_SCHEDULE to determine at run time how to schedule the loop. The OMP_SCHEDULE environment variable can take on any of the values that can be used for a static, dynamic, or guided schedule. Queues Can be viewed as an abstraction of a line of customers waiting to pay for their groceries in a supermarket. A natural data structure to use in many multithreaded applications. For example, suppose we have several “producer” threads and several “consumer” threads. o Producer threads might “produce” requests for data. o Consumer threads might “consume” the request by finding or generating the requested data. Message-Passing Each thread could have a shared message queue, and when one thread wants to “send a message” to another thread, it could enqueue the message in the destination thread’s queue. A thread could receive a message by dequeuing the message at the head of its message queue. Sending Messages Receiving Messages 42 Termination Detection Startup When the program begins execution, a single thread, the master thread, will get command line arguments and allocate an array of message queues: one for each thread. This array needs to be shared among the threads, since any thread can send to any other thread, and hence any thread can enqueue a message in any of the queues. One or more threads may finish allocating their queues before some other threads. We need an explicit barrier so that when a thread encounters the barrier, it blocks until all the threads in the team have reached the barrier. After all the threads have reached the barrier all the threads in the team can proceed. # pragma omp barrier The Atomic Directive Unlike the critical directive, it can only protect critical sections that consist of a single C assignment statement. o # pragma omp atomic Further, the statement must have one of the following forms: o <op>=<expression>; o x++; o ++x; o x--; o --x; Here <op> can be one of the binary operators o +, *, -, /, &, ^, |, <<, or >> Many processors provide a special load-modify-store instruction. A critical section that only does a load-modify-store can be protected much more efficiently by using this special instruction rather than the constructs that are used to protect more general critical sections. Critical Sections OpenMP provides the option of adding a name to a critical directive: o # pragma omp critical(name) When we do this, two blocks protected with critical directives with different names can be executed simultaneously. However, the names are set during compilation, and we want a different critical section for each thread’s queue. Locks: A lock consists of a data structure and functions that allow the programmer to explicitly enforce mutual exclusion in a critical section. Using Locks in the Message-Passing Program 43 Some Caveats 1. You shouldn’t mix the different types of mutual exclusion for a single critical section. 2. There is no guarantee of fairness in mutual exclusion constructs. 3. It can be dangerous to “nest” mutual exclusion constructs Matrix-vector multiplication Thread-Safety Concluding Remarks OpenMP is a standard for programming shared-memory systems. OpenMP uses both special functions and preprocessor directives called pragmas. OpenMP programs start multiple threads rather than multiple processes. Many OpenMP directives can be modified by clauses. A major problem in the development of shared memory programs is the possibility of race conditions. OpenMP provides several mechanisms for insuring mutual exclusion in critical sections. o Critical directives o Named critical directives o Atomic directives o Simple locks By default most systems use a block partitioning of the iterations in a parallelized for loop. OpenMP offers a variety of scheduling options. In OpenMP the scope of a variable is the collection of threads to which the variable is accessible. A reduction is a computation that repeatedly applies the same reduction operator to a sequence of operands in order to get a single result Identifying MPI processes Common practice to identify processes by nonnegative integer ranks. p processes are numbered 0, 1, 2, ... p-1 Our first MPI program 44 Compilation Execution MPI Programs Written in C. o Has main. o Uses stdio.h, string.h, etc. Need to add mpi.h header file. Identifiers defined by MPI start with “MPI_”. First letter following underscore is uppercase. o For function names and MPI-defined types. o Helps to avoid confusion. MPI Components MPI_Init o Tells MPI to do all the necessary setup. MPI_Finalize o Tells MPI we’re done, so clean up anything allocated for this program. Basic Outline Communicators Communicators A collection of processes that can send messages to each other. MPI_Init defines a communicator that consists of all the processes created when the program is started. Called MPI_COMM_WORLD. SPMD Single-Program Multiple-Data We compile one program. Process 0 does something different. o Receives messages and prints them while the other processes do the work. The if-else construct makes our program SPMD. 45 Communication Data types Message matching Communication Receiving messages A receiver can get a message without knowing: o the amount of data in the message, o the sender of the message, o or the tag of the message. status_p argument How much data am I receiving? Issues with send and receive Exact behavior is determined by the MPI implementation. MPI_Send may behave differently with regard to buffer size, cutoffs and blocking. MPI_Recv always blocks until a matching message is received. Know your implementation; don’t make assumptions! The Trapezoidal Rule in MPI 46 One trapezoid Pseudo-code for a serial program Parallelizing the Trapezoidal Rule 1. Partition problem solution into tasks. 2. Identify communication channels between tasks. 3. Aggregate tasks into composite tasks. 4. Map composite tasks to cores. Parallel pseudo-code Tasks and communications for Trapezoidal Rule First version (1&2) First version (3) Dealing with I/O 47 Input Running with 6 processes (unpredictable output) Most MPI implementations only allow process 0 in MPI_COMM_WORLD access to stdin. Process 0 must read the data (scanf) and send to the other processes. Function for reading user input Tree-structured communication 1. In the first phase: (a) Process 1 sends to 0, 3 sends to 2, 5 sends to 4, and 7 sends to 6. (b) Processes 0, 2, 4, and 6 add in the received values. (c) Processes 2 and 6 send their new values to processes 0 and 4, respectively. (d) Processes 0 and 4 add the received values into their new values. 2. (a) Process 4 sends its newest value to process 0. (b) Process 0 adds the received value to its newest value A tree-structured global sum An alternative tree-structured global sum 48 MPI_Reduce Predefined reduction operators in MPI Collective vs. Point-to-Point Communications All the processes in the communicator must call the same collective function. For example, a program that attempts to match a call to MPI_Reduce on one process with a call to MPI_Recv on another process is erroneous, and, in all likelihood, the program will hang or crash. The arguments passed by each process to an MPI collective communication must be “compatible.” For example, if one process passes in 0 as the dest_process and another passes in 1, then the outcome of a call to MPI_Reduce is erroneous, and, once again, the program is likely to hang or crash. The output_data_p argument is only used on dest_process. However, all of the processes still need to pass in an actual argument corresponding to output_data_p, even if it’s just NULL. Point-to-point communications are matched on the basis of tags and communicators. Collective communications don’t use tags. They’re matched solely on the basis of the communicator and the order in which they’re called. Example 1 Multiple calls to MPI_Reduce Example 2 Suppose that each process calls MPI_Reduce with operator MPI_SUM, and destination process 0. At first glance, it might seem that after the two calls to MPI_Reduce, the value of b will be 3, and the value of d will be 6. However, the names of the memory locations are irrelevant to the matching of the calls to MPI_Reduce. The order of the calls will determine the matching so the value stored in b will be 1+2+1 = 4, and the value stored in d will be 2+1+2 = 5. MPI_Allreduce Useful in a situation in which all of the processes need the result of a global sum in order to complete some larger computation. 49 Broadcast: Data belonging to a single process is sent to all of the processes in the communicator. A version of Get_input that uses MPI_Bcast Data distributions(computing a vector sum) Serial implementation of vector addition Different partitions of a 12-component vector among 3 processes 50 Partitioning options Block partitioning o Assign blocks of consecutive components to each process. Cyclic partitioning o Assign components in a round robin fashion. Block-cyclic partitioning o Use a cyclic distribution of blocks of components. Scatter Parallel implementation of vector addition MPI_Scatter can be used in a function that reads in an entire vector on process 0 but only sends the needed components to each of the other processes. Reading and distributing a vector Gather Collect all of the components of the vector onto process 0, and then process 0 can process all of the components. Print a distributed vector (1) Print a distributed vector (2) 51 Allgather Concatenates the contents of each process’ send_buf_p and stores this in each process’ recv_buf_p. As usual, recv_count is the amount of data being received from each process. Matrix-vector multiplication Multiply a matrix by a vector C style arrays Serial matrix-vector multiplication An MPI matrix-vector multiplication function (1) An MPI matrix-vector multiplication function (2) Derived datatypes Used to represent any collection of data items in memory by storing both the types of the items and their relative locations in memory. The idea is that if a function that sends data knows this information about a collection of data items, it can collect the items from memory before they are sent. Similarly, a function that receives data can distribute the items into their correct destinations in memory when they’re received. Formally, consists of a sequence of basic MPI data types together with a displacement for each of the data types. Trapezoidal Rule example: 52 MPI_Type create_struct Builds a derived datatype that consists of individual elements that have different basic types. MPI_Get_address Returns the address of the memory location referenced by location_p. The special type MPI_Aint is an integer type that is big enough to store an address on the system. MPI_Type_commit Allows the MPI implementation to optimize its internal representation of the datatype for use in communication functions. MPI_Type_free When we’re finished with our new type, this frees any additional storage used. Get input function with a derived datatype (1) Get input function with a derived datatype (2) Get input function with a derived datatype(3) Elapsed parallel time Returns the number of seconds that have elapsed since some time in the past. Elapsed serial time In this case, you don’t need to link in the MPI libraries. Returns time in microseconds elapsed from some point in the past. 53 MPI_Barrier Ensures that no process will return from calling it until every process in the communicator has started calling it. Speedup Efficiency Scalability A program is scalable if the problem size can be increased at a rate so that the efficiency doesn’t decrease as the number of processes increase. Programs that can maintain a constant efficiency without increasing the problem size are sometimes said to be strongly scalable. Programs that can maintain a constant efficiency if the problem size increases at the same rate as the number of processes are sometimes said to be weakly scalable. A parallel sorting algorithm Sorting n keys and p = comm sz processes. n/p keys assigned to each process. No restrictions on which keys are assigned to which processes. When the algorithm terminates: o The keys assigned to each process should be sorted in (say) increasing order. o If 0 ≤ q < r < p, then each key assigned to process q should be less than or equal to every assigned to process r. Odd-even transposition sort A sequence of phases. Even phases, compare swaps: Odd phases, compare swaps: Example Start: 5, 9, 4, 3 Even phase: compare-swap (5,9) and (4,3) getting the list 5, 9, 3, 4 Odd phase: compare-swap (9,3) getting the list 5, 3, 9, 4 Even phase: compare-swap (5,3) and (9,4) getting the list 3, 5, 4, 9 Odd phase: compare-swap (5,4) getting the list 3, 4, 5, 9 key 54 Serial odd-even transposition sort Parallel odd-even transposition sort Communications among tasks in odd-even sort Pseudo-code Compute_partner Safety in MPI programs The MPI standard allows MPI_Send to behave in two different ways: o it can simply copy the message into an MPI managed buffer and return, o or it can block until the matching call to MPI_Recv starts. Many implementations of MPI set a threshold at which the system switches from buffering to blocking. Relatively small messages will be buffered by MPI_Send. Larger messages, will cause it to block. If the MPI_Send executed by each process blocks, no process will be able to start executing a call to MPI_Recv, and the program will hang or deadlock. Each process is blocked waiting for an event that will never happen. A program that relies on MPI provided buffering is said to be unsafe. Such a program may run without problems for various sets of input, but it may hang or crash with other sets. 55 MPI_Ssend An alternative to MPI_Send defined by the MPI standard. The extra “s” stands for synchronous and MPI_Ssend is guaranteed to block until the matching receive starts. Restructuring communication MPI_Sendrecv An alternative to scheduling the communications ourselves. Carries out a blocking send and a receive in a single call. The dest and the source can be the same or different. Especially useful because MPI schedules the communications so that the program won’t hang or crash. Safe communication with five processes Parallel odd-even transposition sort 56 Concluding Remarks MPI or the Message-Passing Interface is a library of functions that can be called from C, C++, or Fortran programs. A communicator is a collection of processes that can send messages to each other. Many parallel programs use the single-program multiple data or SPMD approach. Most serial programs are deterministic: if we run the same program with the same input we’ll get the same output. Parallel programs often don’t possess this property. Collective communications involve all the processes in a communicator. When we time parallel programs, we’re usually interested in elapsed time or “wall clock time”. Speedup is the ratio of the serial run-time to the parallel run-time. Efficiency is the speedup divided by the number of parallel processes. If it’s possible to increase the problem size (n) so that the efficiency doesn’t decrease as p is increased, a parallel program is said to be scalable. An MPI program is unsafe if its correct behavior depends on the fact that MPI_Send is buffering its input.