Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1 Parallel Patterns Serial Patterns Structured Programming Universal Algorithmic Skeletons, Techniques, Strategies Including OOP Features Well-structured Maintainable Efficient Deterministic Composable 2 Nesting Pattern Ability to hierarchically compose patterns Patterns within Patterns As in Structured Programming Static: Sequence, Selection, Iteration, Dynamic: Recursion Any pattern can contain any other pattern 3 Data Parallelism vs. Functional Decomposition Static Patterns Functional Decomposition Dynamic Pattern = Recursion Data Parallelism Nesting + Recursion Parallel Slack What about “excessive” recursion? 4 3.2 Serial Control Flow Patterns Sequence Selection (Decision) Iteration (Loop, Repetition) Loop-Carried Dependency Map, Scan, Recurrence, Scatter Gather, Pack Recursion What is an alias? 5 Can this loop be parallelized? Problems? void engine (int n, double x[ ], int a[ ], b[ ], c[ ], d[ ]) { for (int k = 0; k < n; ++ k) x[a[k]] = x[b[k]]* x[c[k]]+ x[d[k]] } 6 Can this loop be parallelized? Problems? void engine (int n, double x[ ], y[ ] int a[ ], b[ ], c[ ], d[ ]) { for (int k = 0; k < n; ++ k) y[a[k]] = x[b[k]]* x[c[k]]+ x[d[k]] } 7 3.3 Parallel Control Patterns Fork-Join Map Stencil Reduction Scan Recurrence Nvidia GE Force 480 8 3.3.1 Fork - Join Fork – instruction allows creation of new control flow Join – instruction to synchronize control flows that have been created via the fork instruction; after Join, only one control flow continues Variation: Spawn – for executing a function Caller does not wait for return Barrier – synchronizes multiple control flows but all may continue after Barrier 9 3.3.2 Map (Fig.3.6) Map – technique replicates elemental function over each element of an index set Elemental function is applied to elements of collections Iteration (Loop) Replacement Every iteration is independent Computation – count, index, data item Known number of iterations Pure Elemental Function: No side effects 10 3.3.3 Stencil (Fig. 3.7) Stencil – extension of Map allowing elemental function access to set of “neighbors” Pattern of access eliminates memory/data conflicts Special cases: out-of-bounds Utilizes Tiling (see section 7.3) Applications: image filtering, simulation (fluid flow), linear algebra 11 3.3.4 Reduction (Fig. 3.9) Reduction - Combines elements of collection into single element (using associative combiner function) O(log n) Consider summation of an array Calculate total number of additions 12 3.3.5 Scan (Fig. 3.10) Scan – computes partial reductions of a collection For each output position, reduction to that point is computed AKA – Prefix Sums (example) Total number of additions serial? Parallel? How many processors? Implications? O(log n) Applications: Checkbook, integration, random numbers 13 3.3.6 Recurrence Omit??? 14 3.4 Serial Data Management Patterns How stored data is allocated, shared, read, written, copied Random RW Stack Allocation Heap Allocation Closure Object 15 3.4.1 Random Read & Write Memory Access via Addresses Pointers Alias – if “forbidden”- becomes programmers responsibility Arrays Safer due to contiguous storage Can be aliased Normal for Serial. Implications for Parallel? Locality? 16 3.4.2 Stack Allocation Dynamic Allocation Nested, as in function calls Where is stack used by systems? LIFO Parallel: each thread has own stack Preserves locality 17 3.4.3 Heap Allocation Definition? Where used by system? Features Dynamic, Complex, Slow No Locality guarantee, Loss of Coherence Fragmented memory Limited Scalability 18 3.4.4 & 3.4.5 Closures & Objects Omit 19 3.5 Parallel Data Management Patterns Shared or Not Shared data Modification patterns of data Help improve performance 20 3.5.1 Pack - Unpack Eliminate unused space in a collection (e.g. array) How? Assign 0 or 1 to locations Use Scan (Parallel Prefix) to compute new address Write to new array EXAMPLE - Figure 3.12 (P. 98) Unpack – return to original array Applications?? 21 3.5.2 Pipeline Sequence (series) of processing elements such that the output of 1 element is the input of the next element Functional Decomposition – limited parallelism – number of stages is generally fixed Useful For serially dependent tasks When nested with other patterns 22 3.5.3 Geometric Decomposition 3.5.4 Gather Omit 23