Structuring Parallel Algorithms © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 Slides from: • David Kirk/NVIDIA and Wen-mei W. Hwu, • ECE 498AL, University of Illinois, UrbanaChampaign • Podcasts of their lectures (recommended) : – http://courses.ece.illinois.edu/ece498/al/Syllabus.html Key Parallel Programming Steps 1) To find the concurrency in the problem 2) To structure the algorithm to translate concurrency into performance 3) To implement the algorithm in a suitable programming environment 4) To execute and tune the performance of the code on a parallel system Unfortunately, these have not been separated into levels of abstractions that can be dealt with independently. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 3 Algorithm • A step by step procedure that is guaranteed to terminate, such that each step is precisely stated and can be carried out by a computer – Definiteness – the notion that each step is precisely stated – Effective computability – each step can be carried out by a computer – Finiteness – the procedure terminates • Multiple algorithms can be used to solve the same problem – Some require fewer steps – Some exhibit more parallelism – Some have larger memory footprint than others © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 4 Choosing Algorithm Structure Start Task-centric Data Flow centric Data-centric Linear Recursive Regular Irregular Linear Recursive Task Parallelism Divide and Conquer Pipeline Event Driven Geometric Decomposition Recursive Data © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 5 A simple implementation • Assume we have already loaded array into __shared__ float partialSum[] unsigned int t = threadIdx.x; for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0) partialSum[t] += partialSum[t+stride]; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 6 Mapping a Divide and Conquer Algorithm Thread 0 0 1 0+1 2 0...3 Thread 2 1 2 2+3 Thread 4 3 4 4+5 4..7 3 0..7 Thread 6 5 6 6+7 Thread 8 7 8 8+9 Thread 10 9 10 11 10+11 8..11 8..15 iterations © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign Array elements 7 Tiled (Stenciled) Algorithms are Important for Geometric Decomposition bx 0 A framework for memory data sharing and reuse by increasing data access locality. bsize-1 N M P Psub ty bsize-1 A convenient framework for organizing threads2 (tasks) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign WIDTH • 012 WIDTH – Tiled access patterns allow small cache/scartchpad memories to hold on to data for re-use. For matrix multiplication, a 16X16 thread block perform 0 loads from 2*256 = 512 float device memory for 256 * 0 (2*16) = 8,192 mul/add 12 operations. by 1 tx BLOCK_WIDTHBLOCK_WIDTH – 2 BLOCK_SIZE • 1 BLOCK_WIDTHBLOCK_WIDTH BLOCK_WIDTH WIDTH WIDTH 8 Increased Work per Thread for even more locality bx 0 1 2 tx – Nd More work done in each iteration WIDTH • Each thread computes two element of Pdsub Reduced loads from global memory (Md) to shared memory Reduced instruction overhead TILE_WIDTH • • TILE_WIDTH 0 1 2 TILE_WIDTH-1 Pd Md 1 Pdsub Pdsub WIDTH by ty 0 1 2 TILE_WIDTHE 0 TILE_WIDTH-1 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign TILE_WIDTH TILE_WIDTH WIDTH TILE_WIDTH WIDTH 9 Double Buffering - a frequently used algorithm pattern • One could double buffer the computation, getting better instruction mix within each thread – This is classic software pipelining in ILP compilers Loop { Load current tile to shared memory Load next tile from global memory Loop { Deposit current tile to shared memory syncthreads() syncthreads() Compute current tile Load next tile from global memory syncthreads() Compute current tile }© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign syncthreads() 10 bx Double Buffering 1 ty 2 tx TILE_WIDTH Nd WIDTH TILE_WIDTH 0 1 2 TILE_WIDTH-1 Pd Pdsub WIDTH by 1 2 1 TILE_WIDTHE • Deposit blue tile from register into shared memory • Syncthreads • Load orange tile into register • Compute Blue tile • Deposit orange tile into shared Md memory 0 • …. 0 0 TILE_WIDTH-1 TILE_WIDTH TILE_WIDTH 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign WIDTH TILE_WIDTH WIDTH 11 One can trade more work for increased parallelism MB Mot. Est. Mot Comp & Frame Sub • Diamond search algorithm for motion estimation work efficient but sequential – Popular in traditional CPU implementations • Exhaustive Search totally parallel but work inefficient DCT & Quantizer Dequantizer & IDCT Output Frame Order = I, P1, P2... MB2 in frame P2 MB 2 Region of P1 required to process MB2 P1 macroblocks containing – Popular in HW and parallel implementations © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 12 0 An MPEG Algorithm based on Data Parallelism 1 2 3 4 5 6 time Frame 1 Frame 2 ACC1 ACC2 ACC3 ACC4 Communication • Loops distributed – DOALL style – Replicates instructions and tables across accelerators • If instructions and data are too large for local memory… – Large memory transfers required to preserve data – Saturation of and contention for communication resources can leave computation resources idle Memory bandwidth constrains performance © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 13 Loop fusion & memory privatization 0 1 2 3 4 time Frame 1 ACC1 ACC2 ACC3 ACC4 Frame 2 ACC5 ACC6 ACC7 ACC8 • Stage loops fused into single DOALL macroblock loop – Memory privatization reduces main memory access • Replicates instructions and tables across processors – Local memory constraints may prevent this technique Novel dimensions of parallelism reduce communication © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 14 Pipeline or “Spatial Computing” Model 0 1 2 3 4 time Frame 1 ACC1 ACC2 ACC3 ACC4 Frame 2 ACC5 ACC6 ACC7 ACC8 • Each PE performs as one pipeline stage in macroblock processing • Imbalanced stages result in idle resources • Takes advantage of direct, accelerator to accelerator communication • Not very effective in CUDA but can be effective for Cell Efficient point-to-point can enable new models © David Kirk/NVIDIA and Wen-mei W. Hwu,communication 2007-2009 15 ECE 498AL, University of Illinois, Urbana-Champaign Small decisions in algorithms can have major effect on parallelism. (H.263 motion estimation example) • Different algorithms may prev_frame expose different levels of parallelism while achieving desired result cur_frame • In motion estimation, can (a) Guess vectors are obtained from the previous macroblock. use previous vectors (either from space or prev_frame time) as guess vectors cur_frame (b) Guess vectors are obtained from the corresponding macroblock in the previous frame. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 16