Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2nd Edition by Grama, Gupta, Karypis, and Kumar Outline • • • • • Decomposition Tasks and Interaction Load Balancing Managing Overhead Parallel Models Decomposition • Decomposition - dividing a computation into parts that may be executed in parallel • Tasks - programmer defined units of computation into which the main computation is subdivided • Task-dependency graphs - abstraction used to express dependencies between tasks and their relative order of execution Granularity • Granularity - number/size of tasks that a computation can be divided into • Fine grained - task divided into many small tasks • Coarse grained - task divided into few large tasks • Degree of concurrency - maximum number of tasks that can be executed in parallel in a program at any time • Average degree of concurrency can be more useful as it provides a better indication of performance Example • Matrix-Vector Multiplication – Figure will be drawn upon the board – Generally considered fine grained if parallelizing based upon each dot product – Could be considered coarse-grained if using a dual core processor and each task computes half of the dot products Task Graphs • Critical path - longest directed path between a pair of start an finish nodes in the task graph • Critical path length – sum of the weights of the nodes along a critical path • Weight of a node is the size of the task or amount of work associated with the task • Aside from these factors, the interaction between tasks running on different processors may cost additional runtime • An example of a task dependency graph will be drawn in class to aid in the understanding of these concepts Processes and Threads vs. Processors • mapping - mechanism by which tasks are assigned to processes and/or threads for execution • Threads and processes are logical units that perform tasks • Processors physically perform the computations • Important to realize this because we may have multiple stages of computation • For example, internode communication vs. shared memory communication • Drawing a task dependency or task interaction graph may help us to understand how tasks interact with one another and will aid in development of a parallel algorithm Decomposition Techniques • • • • • Embarrassingly Parallel Recursive decomposition Data Decomposition Exploratory decomposition Speculative decomposition Embarrassingly Parallel Tasks • Some tasks lend themselves to direct parallelization • Such tasks are said to be embarrassingly parallel and can be directly mapped to processes or threads • A subset of these types of tasks represent the map pattern • Note that the map pattern represents a function that can be “replicated and applied to all elements in a collection” – source https://software.intel.com/enus/blogs/2009/06/10/parallel-patterns-3-map • Map operations occur in independent loop iterations Embarrassingly Parallel (Map) • Performing array (or matrix) addition is a straightforward example that is easily parallelized • The serial example of this follows: for(i = 0; i < N; i++) C[i] = A[i] + B[i]; • Three OpenMP parallel versions follow on the next slides OpenMP First Try • We could parallelize the loop on the last slide directly as follows: #pragma omp parallel private(i) shared(A,B,C) { int start = omp_get_thread_num()*(N / omp_get_num_threads()); int end = start + (N/omp_get_num_threads()); for(i = start; i < end; i++) C[i] = A[i] + B[i]; } • Notice that i is declared private because it it is not shared between threads – each thread gets its own copy of i • Arrays A, B, and C are declared shared because they are shared between threads OpenMP for clause • It is preferred to allow OpenMP to directly parallelize loops using the for clause as follows #pragma omp parallel private(i) shared(A,B,C) { #pragma omp for for(i = 0; i < N; i++) C[i] = A[i] + B[i]; } • Notice that the loop can be written in a serial fashion and it will be automatically partitioned and tasked to a thread Shortened OpenMP for • When using a single for loop, the parallel and for clauses may be combined #pragma omp parallel for private(i) \ shared(A,B,C) for(i = 0; i < N; i++) C[i] = A[i] + B[i]; Recursive Decomposition • Used to include concurrency in problems that can be solved with divide-and-conquer • Such a problem is solved by dividing it into independent sub-problems • A special type of this decomposition is the Reduction Pattern, wherein elements of a collection are combined with a binary associative operator (e.g. +, -, min, max, etc.), source https://software.intel.com/enus/blogs/2009/07/23/parallel-pattern-7-reduce Example • To find a minimum serially given an array A of size N use the following algorithm min = A[0]; for(i = 1; i < N; i++) if(A[i] < min) min = A[i]; Example • Decomposing this task for parallelism requires a recursive solution int findMinRec(int A[], int i, int n) { if(n == 1) return A[i]; else { int lmin = findMinRec(A, i, n/2); int rmin = findMinRec(A, i+n/2, n-n/2); return min(lmin,rmin); } } OpenMP Implementation for(i = 0; i < N; i++) A[i] = rand() % 100; small = A[0]; #pragma omp parallel for reduction(min:small) for(i = 0; i < N; i++) { if(A[i] < small) small = A[i]; } OpenMP Sum Reduction for(i = 0; i < N; i++) A[i] = i+1; sum = 0; #pragma omp parallel for reduction(+:sum) for(i = 0; i < N; i++) sum += A[i]; printf("The sum is %d\n", sum); Data Decomposition • Commonly used on algorithms that operate on large data structures • Involves two steps – Data is partitioned – Data partitioning is used to cause partitioning of computations into tasks • Operations on different data partitions are typically similar or are chosen from a small set of operations Partitioning • Partitioning output data – outputs computed independently of others as a function of input – Example – matrix multiplication can be partitioned into submatrices • Partitioning input data – task is created for each partition of the input data – Example – finding a minimum or maximum • Partitioning input and output – combination of the two cases above • Partitioning intermediate data Next Time • More decompositions – Exploratory Decomposition – Speculative Decomposition • • • • Tasks and Interactions Load balancing Handling overhead Parallel Algorithm Models