Chapter 3 Parallel Programming Models Abstraction • Machine Level – Looks at hardware, OS, buffers • Architectural models – Looks at interconnection network, memory organization, synchronous & asynchronous • Computational Model – Cost models, algorithm complexity, RAM vs. PRAM • Programming Model – Uses programming language description of process Control Flows • Process – Address spaces differ - Distributed • Thread – Shares address spaces – Shared Memory • Created statically (like MPI-1) or dynamically during run time (MPI-2 allows this as well as pthreads). Parallelization of a Program • Decomposition of the computations – Can be done at many levels (ex. Pipelining). – Divide into tasks and identify dependencies between tasks. – Can be done statically (at compile time) or dynamically (at run time) – Number of tasks places an upper bound on the parallelism that can be used – Granularity: the computation time of a task Assignment of Tasks • The number of processes or threads does not need to be the same as the number of processors • Load Balancing: each process/thread having the same amount of work (computation, memory access, communication) • Have a tasks that use the same memory execute on the same thread (good cache use) • Scheduling: assignment of tasks to threads/processes Assignment to Processors • 1-1: map a process/thread to a unique processor • many to 1: map several processes to a single processor. (Load balancing issues) • OS or programmer done Scheduling • Precedence constraints – Dependencies between tasks • Capacity constraints – A fixed number of processors • Want to meet constraints and finish in minimum time Levels of Parallelism • • • • Instruction Level Data Level Loop Level Functional level Instruction Level • Executing multiple instructions in parallel. May have problems with dependencies – Flow dependency – if next instruction needs a value computed by previous instruction – Anti-Dependency – if an instruction uses a value from register or memory when the next instruction stores a value into that place (cannot reverse the order of instructions – Output dependency – 2 instructions store into same location Data Level • Same process applied to different elements of a large data structure • If these are independent, the can distribute the data among the processors • One single control flow • SIMD Loop Level • If there are no dependencies between the iterations of a loop, then each iteration can be done independently, in parallel • Similar to data parallelism Functional Level • Look at the parts of a program and determine which parts can be done independently. • Use a dependency graph to find the dependencies/independencies • Static or Dynamic assignment of tasks to processors – Dynamic would use a task pool Explicit/Implicit Parallelism Expression • Language dependent • Some languages hide the parallelism in the language • For some languages, you must explicitly state the parallelism Parallelizing Compilers • Takes a program in a sequential language and generates parallel code – Must analyze the dependencies and not violate them – Should provide good load balancing (difficult) – Minimize communication • Functional Programming Languages – Express computations as evaluation of a function with no side effects – Allows for parallel evaluation More explicit/implicit • Explicit parallelism/implicit distribution – The language explicitly states the parallelism in the algorithm, but allows the system to assign the tasks to processors. • Explicit assignment to processors – do not have to worry about communications • Explicit Communication and Synchronization – MPI – additionally must explicitly state communications and synchronization points Parallel Programming Patterns • • • • • • • • • Process/Thread Creation Fork-Join Parbegin-Parend SPMD, SIMD Master-Slave (worker) Client-Server Pipelining Task Pools Producer-Consumer Process/Thread Creation • Static or Dynamic • Threads, traditionally dynamic • Processes, traditionally static, but dynamic has become recently available Fork-Join • An existing thread can create a number of child processes with a fork. • The child threads work in parallel. • Join waits for all the forked processes to terminate. • Spawn/exit is similar Parbegin-Parend • Also called cobegin-coend • Each statement (blocks/function calls) in the cobegin-coend block are to be executed in parallel. • Statements after coend are not executed until all the parallel statement are complete. SPMD – SIMD • Single Program, Multiple Data vs. Single Instruction, Multiple Data • Both use a number of threads/processes which apply the same program to different data • SIMD executes the statements synchronously on different data • SPMD executes the statements asynchronously Master-Slave • One thread/process that controls all the others • If dynamic thread/process creation, the master is the one that usually does it. • Master would “assign” the work to the workers and the workers would send the results to the master Client-Server • Multiple clients connected to a server that responds to requests • Server could be satisfying requests in parallel (multiple requests being done in parallel or if the request is involved, a parallel solution to the request) • The client would also do some work with response from server. • Very good model for heterogeneous systems Pipelining • Output of one thread is the input to another thread • A special type of functional decomposition • Another case where heterogeneous systems would be useful Task Pools • Keep a collection of tasks to be done and the data to do it upon • Thread/process can generate tasks to be added to the pool as well as obtaining a task when it is done with the current task Producer Consumer • Producer threads create data used as input by the consumer threads • Data is stored in a common buffer that is accessed by producers and consumers • Producer cannot add if buffer is full • Consumer cannot remove if buffer is empty Array Data Distributions • 1-D – Blockwise • Each process gets ceil(n/p) elements of A, except for the last process which gets n-(p-1)*ceil(n/p) elements • Alternatively, the first n%p processes get ceil(n/p) elements while the rest get floor(n/p) elements. – Cyclic • Process p gets data k*p+i (k=0..ceil(n/)) – Block cyclic • Distribute blocks of size b to processes in a cyclic manner 2-D Array distribution • Blockwise distribution rows or columns • Cyclic distribution of rows or columns • Blockwise-cyclic distribution of rows or columns Checkerboard • Take an array of size n x m • Overlay a grid of size g x f – g<=n – f<=m – More easily seen if n is a multiple of g and m is a multiple of f • Blockwise Checkerboard – Assign each n/g x m/f submatrix to a processor Cyclic Checkerboard • Take each item in a n/g x m/f submatrix and assign it in a cyclic manner. • Block-Cyclic checkerboard – Take each n/g x m/f submatrix and assign all the data in the submatrix to a processor in a cyclic fashion Information Exchange • Shared Variables – Used in shared memory – When thread T1 wants to share information with thread T2, then T1 writes the information into a variable that is shared with T2 – Must avoid 2 or more processes reading or writing to the same variable at the same time (race condition) – Leads to non-Deterministic behavior. Critical Sections • Sections of code where there may be concurrent accesses to shared variables • Must make these sections mutually exclusive – Only one process can be executing this section at any one time • Lock mechanism is used to keep sections mutually exclusive – Process checks to see if section is “open” – If it is, then “lock” it and execute (unlock when done) – If not, wait until unlocked Communication Operations • Single Transfer – Pi sends a message to Pj • Single Broadcast – one process sends the same data to all other processes • Single accumulation – Many values operated on to make a single value that is placed in root • Gather – Each process provides a block of data to a common single process • Scatter – root process sends a separate block of a large data structure to every other process More Communications • Multi-Broadcast – Every process sends data to every other process so every process has all the data that was spread out among the processes • Multi-Accumulate – accumulate, but every process gets the result • Total Exchange-Each process provides p-data blocks. The ith data block is sent to pi. Each processor receives the blocks and builds the structure with the data in i order. Applications • Parallel Matrix-Vector Product – Ab=c where A is n x m and b, c are m – Want A to be in contiguous memory • A single array, not an array of arrays – Have blocks of rows with allof b calculate a block of c • Used if A is stored row-wise – Have blocks of columns with a block of b compute columns that need to be summed. • Used if A is stored column-wise Processes and Threads • Process – a program in execution – Includes code, program data on stack or heap, values of registers, PC. – Assigned to processor or core for execution – If there are more processes than resources (processors or memory) for all, execute in a round-robin time-shared method – Context switch – changing from one process to another executing on processor. Fork • The Unix fork command – Creates a new process – Makes a copy of the program – Copy starts at statement after the fork – NOT shared memory model – Distributed memory model – Can take a while to execute Threads • Share a single address space • Best with physically shared memory • Easier to get started than a process – no copy of code space • Two types – Kernel threads – managed by the OS – User threads – managed by a thread library Thread Execution • If user threads are executed by a thread library/scheduler, (no OS support for threads) then all the threads are part of one process that is scheduled by the OS – Only one thread executed at a time even if there are multiple processors • If OS has thread management, then threads can be scheduled by OS and multiple threads can execute concurrently • Or, Thread scheduler can map user threads to kernel threads (several user threads may map to one kernel thread) Thread States • • • • • • Newly generated Executable Running Waiting Finished Threads transition from state to state based on events (start, interrupt, end, block, unblock, assign-to-processor) Synchronization • Locks – A process “locks” a shared variable at the beginning of a critical section • Lock allows process to proceed if shared variable is unlocked • Process blocked if variable is locked until variable is unlocked • Locking is an “atomic” process. Semaphore • Usually a binary type but can be integer • wait(s) – Waits until the value of s is 1 (or greater) – When it is, decreases s by 1 and continues • signal(s) – Increments s by 1 Barrier Synchronization • A way to have every process wait until every process is at a certain point • Assures the state of every process before certain code is executed Condition Synchronization • A thread is blocked until a given condition is established – If condition is not true, then put into blocked state – When condition true, moved from blocked to ready (not necessarily directly to a processor) – Since other processes may be executing, by the time this process gets to a processor, the condition may no longer be true • So, must check condition after condition satisfied Efficient Thread Programs • Proper number of threads – Consider degree of parallelism in application – Number of processors – Size of shared cache • Avoid synchronization as much as possible – Make critical section as small as possible • Watch for deadlock conditions Memory Access • Must consider writing values to shared memory that is held in local caches • False sharing – Consider 2 processes writing to different memory locations – SHOULD not be an issue since not shared by two cache memories – HOWEVER, if the memory locations are close to each other, they may be in the same cache line and actually have the different locations both be in the different caches