Jeremiah van Oosten Reinier van Oeveren Introduction Related Works ◦ Prefix Sums and Scans ◦ Recursive Filtering ◦ Summed-Area Tables Problem Definition Parallelization Strategies ◦ ◦ ◦ ◦ Baseline (Algorithm RT) Block Notation Inter-block Parallelism Kernel Fusion (Algorithm 2) Overlapping Summed-Area Tables ◦ Causal-Anticausal overlapping (Algorithm 3 & 4) ◦ Row-Column Causal-Anitcausal overlapping (Algorithm 5) ◦ Overlapped Summed-Area Tables (Algorithm SAT) Results Conclusion Linear filtering is commonly used to blur, sharpen or down-sample images. A direct implementation evaluating a filter of support d on a h x w image has a cost of O(hwd). The cost of the image filter can be reduced using a recursive filter in which case previous results can be used to compute the current value: 1 𝑦𝑖 = 𝑥𝑖 − 𝑦𝑖−1 2 Cost can be reduced to O(hwr) where r is the number of recursive feedbacks. At each step, the filter produces an output element by a linear combination of the input element and previously computed output elements. 𝑦 = 𝑦𝑖 = 𝐹(𝒑, 𝒙) 𝑟 𝑥𝑖 − 𝑎𝑘 𝑦𝑖−𝑘 𝑘=1 0(hwr) Continue… Applications of recursive filters ◦ Low-pass filtering like Gaussian kernels ◦ Inverse Convolution (X = 𝑉 ∗ 𝐹, 𝑉 = 𝑋 ∗ 𝐹 −1 ) ◦ Summed-area tables recursive filters input blurred Recursive filters can be causal or anticausal (or non-causal). Causal filters operate on previous values. 𝑟 𝑦𝑖 = 𝑥𝑖 − 𝑎𝑘 𝑦𝑖−𝑘 𝑘=1 Anticausal filters operate on “future” values. 𝑟 𝑧𝑖 = 𝑦𝑖 − 𝑎𝑘 𝑧𝑖+𝑘 𝑘=1 Anticausal filters operate on “future” values. 𝒛 = 𝑅(𝒚, 𝒆) 𝑧𝑖 = 𝑦𝑖 − 𝑟 𝑎𝑘 𝑧𝑖+𝑘 𝑘=1 Continue… It is often required to perform a sequence of recursive image filters. P P’ U V X Y Z E E’ • Independent Columns • Causal 𝑌 = 𝐹(𝑃, 𝑋) • Anticausal 𝑍 = 𝐹(𝑌, 𝐸) • Independent Rows • Causal 𝑈 = 𝐹 𝜏 (𝑃′, 𝑍) • Anticausal 𝑉 = 𝐹 𝜏 (𝐸′, 𝑈) The naïve approach to solving the sequence of recursive filters does not sufficiently utilize the processing cores of the GPU. The latest GPU from NVIDIA has 2,668 shader cores. Processing even large images (2048x2048) will not make full use of all available cores. Under utilization of the GPU cores does not allow for latency hiding. We need a way to make better utilization of the GPU without increasing IO. In the paper “GPU-Efficient Recursive Filtering and Summed-Area Tables” by Diego Nehab et. al. they introduce a new algorithmic framework to reduce memory bandwidth by overlapping computation over the full sequence of recursive filters. Partition the image into 2D blocks of size 𝑏 𝑥 𝑏. A prefix sum 𝑦𝑖 = 𝑥𝑖 + 𝑦𝑖−1 Simple case of a first-order recursive filter. A scan generalizes the recurrence using an arbitrary binary associative operator. Parallel prefix-sums and scans are important building blocks for numerous algorithms. [Iverson 1962; Stone 1971; Blelloch 1989; Sengupta et. al. 2007] An optimized implementation comes with the CUDPP library [2011]. A generalization of the prefix sum using a weighted combination of prior outputs. 𝑟 𝑦𝑖 = 𝑥𝑖 − 𝑎𝑘 𝑦𝑖−𝑘 𝑘=1 This can be implemented as a scan operation with redefined basic operators. Ruijters and Thevenaz [2010] exploit parallelisim across the rows and columns of the input. Sung and Mitra [1986] use block parallelism and split the computation into two parts: ◦ One computation based only on the block data assuming a zero initial conditions. ◦ One computation based only on the initial conditions and assuming zero block data. 𝑌 = 𝐹 0, 𝑋 + 𝐹(𝑃, 0) Summed-area tables enable the averaging rectangular regions of pixel with a constant number of reads + UL LL + width UR height LR 𝐿𝑅 − 𝐿𝐿 − 𝑈𝑅 + 𝑈𝐿 𝑎𝑣𝑔 = 𝑤𝑖𝑑𝑡ℎ ∙ ℎ𝑒𝑖𝑔ℎ𝑡 The paper titled “Fast Summed-Area Table Generation…” from Justin Hensley et. al. (2005) describes a method called recursive doubling which requires multiple passes of the input image. (A 256x256 image requires 16 passes to compute). Image A Image B Image A Image B In 2010, Justin Hensley extended his 2005 implementation to compute shaders taking more samples per pass and storing the result in intermediate shared memory. Now a 256x256 image only required 4 passes when reading 16 samples per pass. Casual recursive filters of order 𝑟 are characterized by a set of 𝑟 feedback coefficients 𝑎𝑘 in the following manner. Given a prologue vector 𝒑 ∈ ℝ𝑟 and an input vector 𝒙 ∈ ℝℎ of any size ℎ the filter 𝐹 produces the output: 𝒚 = 𝐹(𝒑, 𝒙) Such that 𝒚 ∈ ℝℎ (𝒚 has the same size as the input 𝒙). Causal recursive filters depend on a prologue vector 𝒑 ∈ ℝ𝑟 𝑟 𝑦𝑖 = 𝑥𝑖 − 𝑎𝑘 𝑦𝑖−𝑘 𝑘=1 Similar for the anitcausal filter. Given an input vector 𝒚 ∈ ℝℎ and an epilogue vector 𝒆 ∈ ℝ𝑟 , the output vector 𝒛 = 𝑅(𝒚, 𝒆) is defined by: 𝑟 𝑧𝑖 = 𝑦𝑖 − 𝑎′𝑘 𝑧𝑖+𝑘 𝑘=1 For row processing, we define an extended casual filter 𝐹 𝜏 and anticausal filter 𝑅𝜏 . 𝜏 𝒖 = 𝐹 (𝒑′, 𝒛) 𝜏 𝒗 = 𝑅 (𝒖, 𝒆′) With these definitions, we are able to formulate the problem of applying the full sequence of four recursive filters (down, up, right, left). P P’ U V X Y Z E E’ • Independent Columns • Causal 𝑌 = 𝐹(𝑃, 𝑋) • Anticausal 𝑍 = 𝐹(𝑌, 𝐸) • Independent Rows • Causal 𝑈 = 𝐹 𝜏 (𝑃′, 𝑍) • Anticausal 𝑉 = 𝐹 𝜏 (𝐸′, 𝑈) The goal is to implement this algorithm on the GPU to make full use of all available resources. ◦ Maximize occupancy by splitting the problem up to make use of all cores. ◦ Reduce I/O to global memory. Must break the dependency chain in order to increase task parallelism. Primary design goal: Increase the amount of parallelism without increasing memory I/O. Baseline algorithm ‘RT’ Block notation Inter-block parallelism Kernel fusion Independent row and column processing Step RT1: In parallel for each column in 𝑋, apply 𝐹 sequentially and store 𝑌. Step RT2: In parallel for each column in 𝑌, apply 𝑅 sequentially and store 𝑍. Step RT1: In parallel for each row in 𝑍, apply 𝐹 𝑇 sequentially and store 𝑈. Step RT1: In parallel for each row in 𝑈, apply 𝑅𝑇 sequentially and store 𝑉. input stages output column processing row processing ℎ𝑤 𝑐𝑝 Completion takes 𝑂( Bandwidth usage in total is 8ℎ𝑤 𝑝 𝑐 𝑤 ℎ 𝑟 = = = = = 4r ) steps streaming multiprocessors number of cores (per processor) width of the input image height of the input image order of the applied filter Partition input image into 𝑏 𝑥 𝑏 blocks ◦ 𝑏 = number of threads in warp (=32) What means what? ◦ 𝑩𝒎,𝒏 𝑿 = 𝒃 𝒙 𝒃 block in matrix 𝑿 with index 𝒎, 𝒏 ◦ 𝑷𝒎−𝟏,𝒏 𝑿 = 𝒓 𝒙 𝒃 column-prologue submatrix ◦ 𝑬𝒎+𝟏,𝒏 𝑿 = 𝒓 𝒙 𝒃 column-epilogue submatrix For rows we have (similar) transposed operators: 𝑷𝑻 𝒎,𝒏−𝟏 𝑿 and 𝑬𝑻 𝒎,𝒏+𝟏 𝑿 Tail and head operators: selecting prologueand epilogue-shaped submatrices from 𝑏 𝑥 𝑏 𝑷𝒎,𝒏 𝑿 = 𝑻 𝑩𝒎,𝒏 𝑿 , 𝑷𝑻 𝒎,𝒏 = 𝑻𝑻 𝑩𝒎,𝒏 𝑿 𝑬𝒎,𝒏 𝑿 = 𝑯 𝑩𝒎,𝒏 𝑿 , 𝑬𝑻 𝒎,𝒏 = 𝑯𝑻 𝑩𝒎,𝒏 𝑿 Result: blocked version of problem definition 𝒀 = 𝑭(𝑷, 𝑿), 𝒁 = 𝑭(𝒀, 𝑬), 𝑼 = 𝑭𝑻 (𝑷′, 𝒁), 𝑽 = 𝑭𝑻 (𝑬′, 𝑼) 𝑩𝒎,𝒏 𝒀 = 𝑭 𝑷𝒎−𝟏,𝒏 𝐘 , 𝑩𝒎,𝒏 𝑿 𝑩𝒎,𝒏 𝒁 = 𝑹 𝑩𝒎,𝒏 𝐘 , 𝑬𝒎+𝟏,𝒏 𝒁 𝑩𝒎,𝒏 𝑼 = 𝑭𝑻 𝑷𝑻 𝒎,𝒏−𝟏 𝐔 , 𝑩𝒎,𝒏 𝒁 𝑩𝒎,𝒏 𝑽 = 𝑹𝑻 𝑩𝒎,𝒏 𝑼 , 𝑬𝑻 𝒎,𝒏+𝟏 𝐕 Superposition (based on linearity) Effects of the input and prologue/epilogue on the output can be computed independently 𝑭 𝒑, 𝒙 = 𝑭 𝟎, 𝒙 + 𝑭 𝒑, 𝟎 𝑹 𝒚, 𝒆 = 𝑹 𝒚, 𝟎 + 𝑹 𝟎, 𝒆 Express as matrix products For any 𝒓, 𝑰𝒓 is the 𝒓 𝒙 𝒓 identity matrix 𝑭 𝑹 𝑭 𝑹 𝒑, 𝟎 𝟎, 𝒆 𝟎, 𝒙 𝒚, 𝟎 =𝑭 =𝑹 =𝑭 =𝑭 𝑰𝒓 , 𝟎 𝟎, 𝑰𝒓 𝟎, 𝑰𝒃 𝑰𝒃 , 𝟎 𝒑= 𝒆= 𝒙= 𝒚= 𝑨𝑭𝑷 𝒑 𝑨𝑹𝑬 𝒆 𝑨𝑭𝑩 𝒙 𝑨𝑹𝑩 𝒚 𝑻(𝑨𝑭𝑷 ) = 𝑨𝒃𝑭 𝑯(𝑨𝑹𝑬 ) = 𝑨𝒃𝑹 Precomputed 𝑟 𝑥 𝑟 matrices that depend only on the feedback 𝑨 𝑭𝑩 , 𝑨𝑹𝑩 = 𝒃 𝒙 𝒃 𝑨𝑭𝑷 , 𝑨𝑹𝑬 = 𝒃 𝒙 𝒓 coefficients of filters 𝐹 and 𝑅 respectively. Details in paper. Perform block computation independently 𝑩𝒎 𝒚 = 𝑭 𝑷𝒎−𝟏 𝐲 , 𝑩𝒎 𝒙 output block Prologue 𝑷𝒎−𝟏 𝐲 / superposition tail of prev. output block 𝑻(𝑩𝒎−𝟏 𝒚 ) 𝑩𝒎 𝒚 = 𝑭 𝟎, 𝑩𝒎 𝒙 + 𝑭 𝑷𝒎−𝟏 𝒚 , 𝟎 𝑩𝒎 𝒚 = 𝑭 𝟎, 𝑩𝒎 𝒙 first term 𝑩𝒎 (𝒚) + 𝑭 𝑷𝒎−𝟏 𝒚 , 𝟎 second term 𝑭 𝑰𝒓 , 𝟎 𝑷𝒎−𝟏 𝒚 incomplete causal output 𝑨𝑭𝑷 𝑷𝒎−𝟏 (𝒚) 𝑷𝒎 𝒚 = 𝑷𝒎 𝒚 + 𝒃 𝑨𝑭 𝑷𝒎−𝟏 (𝒚) (1) Recall: 𝑩𝒎 𝒚 = 𝑭 𝑷𝒎−𝟏 𝐲 , 𝑩𝒎 𝒙 (2) Algorithm 1 1.1 In parallel for all m, compute and store each 𝑷𝒎 𝒚 1.2 Sequentially for each m, compute and store the 𝑷𝒎 𝒚 according to (1) and using the previously computed 𝑷𝒎 𝒚 1.3 In parallel for all m, compute & store output block 𝑩𝒎 𝒚 using (2) and the previously computed 𝑷𝒎−𝟏 (𝒚) Processing all rows and columns using causal and anti-causal filter pairs requires 4 successive applications of algorithm 1. 𝒉𝒘 There are independent tasks: hides memory access latency. 𝒃 However.. The memory bandwidth usage is now 𝟏𝟐 + Significantly more than algorithm RT (𝟖𝒉𝒘) can be solved 𝟏𝟔𝒓 𝒃 𝒉𝒘. Original idea: Kirk & Hwu [2010] Use output of one kernel as input for the next without going through global memory. Fused kernel: code from both kernels but keep intermediate results in shared mem. Use Algorithm 1 for all filters, do fusing. Fuse last stage of 𝐹 with first stage of 𝑅 Fuse last stage of 𝑅 and first stage of 𝐹 𝑇 Fuse last stage of 𝐹 𝑇 with first stage of 𝑅𝑇 We aimed for bandwidth reduction. Did it work? Algorithm 1: Algorithm 2: 16𝑟 12 + ℎ𝑤 𝑏 16𝑟 9+ ℎ𝑤 𝑏 yes, it did! output stages input * fix fix fix fix * for the full algorithm in text, please see the paper Further I/O reduction is still possible: by recomputing intermediary results instead of storing in memory. More bandwidth reduction: 6 + No. of steps: ℎ𝑤 𝑂( 𝑐𝑝 (14𝑟 + 1 4 𝑏 22𝑟 𝑏 ℎ𝑤 (=good) 𝑟 2 + 𝑟 ) (≈bad*) Bandwidth usage is less than Algorithm RT(!) but involves more computations *But.. future hardware may tip the balance in favor of more computations. Overlapping is introduced to reduce IO to global memory. It is possible to work with twice-incomplete anticausal epilogues 𝐸𝑚,𝑛 (𝑍), computed directly from the incomplete causal output block 𝐵𝑚,𝑛 (𝑌). This is called casual-anticausal overlapping. Recall that we can express the filter so that the input and the prologue or epilogue can be computed independently and later added together. 𝐹(𝒑, 𝒙) = 𝐹 𝟎, 𝒙 + 𝐹 𝒑, 𝟎 , 𝑅(𝒚, 𝒆) = 𝑅 𝒚, 𝟎 + 𝐹(𝟎, 𝒆) Using the previous properties, we can split the dependency chains of anticausal epilogues. Which can be further simplified to: Where the twice-incomplete 𝑧 is such that Each twice-incomplete epilogue 𝐸𝑚 (𝒛) depends only on the corresponding input block 𝐵𝑚 (𝒙) and therefore they can all be computed in parallel already in the first pass. As a byproduct of that same pass, we can compute and store the 𝑃𝑚 (𝒚) that will be needed to obtain 𝑃𝑚 (𝒚). With 𝑃𝑚 (𝒚), we can compute all 𝐸𝑚 (𝒛) in the following pass. 1. 2. 3. 4. In parallel for all 𝑚, compute and store 𝑃𝑚 (𝒚) and 𝐸𝑚 (𝒛). Sequentially for each 𝑚, compute and store the 𝑃𝑚 (𝒚) using the previously computed 𝑃𝑚 (𝒚). Sequentially for each 𝑚, compute and store 𝐸𝑚 (𝒛) using the previously computed 𝑃𝑚−1 (𝒚) and 𝐸𝑚 (𝒛). In parallel for all 𝑚, compute each causal output block 𝐵𝑚 (𝒚) using the previously computed 𝑃𝑚−1 (𝒚). Then compute and store each anticausal output block 𝐵𝑚 (𝒛) using the previously computed 𝐸𝑚+1 (𝒛). Algorithm 3 computes row and columns in separate passes. Fusing these two stages, results in algorithm 4. 1. 2. 3. 4. 5. 6. 7. In parallel for all 𝑚 and 𝑛, compute and store the 𝑃𝑚,𝑛 (𝑌) and 𝐸𝑚,𝑛 (𝑍). Sequentially for each 𝑚, but in parallel for each 𝑛, compute and store the 𝑃𝑚,𝑛 (𝑌) using the previously computed 𝑃𝑚,𝑛 (𝑌). Sequentially for each 𝑚, but in parallel for each 𝑛, compute and store the 𝐸𝑚,𝑛 (𝑍) using the previously computed 𝑃𝑚−1,𝑛 (𝑌 ) and 𝐸𝑚,𝑛 (𝑍). In parallel for all 𝑚 and 𝑛, compute 𝐵𝑚,𝑛 (𝑌) using the previously computed 𝑃𝑚−1,𝑛 (𝑌). Then compute and store the 𝐵𝑚,𝑛 (𝑍) using the previously computed 𝐸𝑚+1,𝑛 (𝑍). Finally, compute and store both 𝜏 (𝑈) and 𝐸 𝜏 (𝑉). 𝑃𝑚,𝑛 𝑚,𝑛 Sequentially for each 𝑛, but in parallel for each 𝑚, compute and store 𝜏 (𝑈) from 𝑃𝜏 (𝑈). the 𝑃𝑚,𝑛 𝑚,𝑛 Sequentially for each 𝑛, but in parallel for each 𝑚, compute and store 𝜏 (𝑉) using the previously computed 𝑃𝜏 𝜏 each 𝐸𝑚,𝑛 𝑚,𝑛−1 (𝑈) and 𝐸𝑚,𝑛 (𝑉). In parallel for all 𝑚 and 𝑛, compute 𝐵𝑚,𝑛 (𝑉) using the previously 𝜏 computed 𝑃𝑚,𝑛−1 (𝑉) and 𝐵𝑚,𝑛 (𝑍). Then compute and store the 𝐵𝑚,𝑛 (𝑈) using the previously computed 𝐸𝑚,𝑛+1 (𝑈). input output stages fix both fix both Adds causal-anticausal overlapping ◦ Eliminates reading and writing causal results Both in column and in row processing ◦ Modest increase in computation There is still one source of inefficiency in algorithm 4. We wait until the complete block 𝐵𝑚,𝑛 𝑍 is available in stage 4 before 𝜏 computing incomplete 𝑃𝑚,𝑛 𝑈 and twice𝜏 incomplete 𝐸𝑚,𝑛 (𝑉). We can overlap row and column computations and work with thrice-incomplete transposed epilogues obtained directly during algorithm 4 stage 1. Below is the formula for completing the thrice-incomplete transposed prologues: The thrice-incomplete 𝑈 satisfies To complete the four-times-incomplete transposed epilogues of 𝑉: 1. 2. 3. 4. 5. 6. In parallel for all 𝑚 and 𝑛, compute and store each 𝑃𝑚,𝑛 (𝑌), 𝜏 𝜏 𝐸𝑚,𝑛 (𝑍), 𝑃𝑚,𝑛 (𝑈), and 𝐸𝑚,𝑛 (𝑉). In parallel for all 𝑛, sequentially for each 𝑚, compute and store the 𝑃𝑚,𝑛 (𝑌) using the previously computed 𝑃𝑚−1,𝑛 (𝑌). In parallel for all 𝑛, sequentially for each 𝑚, compute and store 𝐸𝑚,𝑛 (𝑍) using the previously computed 𝑃𝑚−1,𝑛 (𝑌) and 𝐸𝑚+1,𝑛 (𝑍). In parallel for all 𝑚, sequentially for each 𝑛, compute and store 𝜏 𝜏 𝑃𝑚,𝑛 (𝑈) using the previously computed 𝑃𝑚,𝑛 (𝑈), 𝑃𝑚−1,𝑛 (𝑌 ), and 𝐸𝑚+1,𝑛 (𝑍). In parallel for all 𝑚, sequentially for each 𝑛, compute and store 𝜏 𝜏 𝜏 𝐸𝑚,𝑛 (𝑉) using the previously computed 𝐸𝑚,𝑛 (𝑉), 𝑃𝑚,𝑛−1 (𝑈), 𝜏 𝑃𝑚−1,𝑛 (𝑌), and 𝐸𝑚+1,𝑛 (𝑍). In parallel for all 𝑚 and 𝑛, successively compute 𝐵𝑚,𝑛 (𝑌), 𝐵𝑚,𝑛 (𝑍), 𝐵𝑚,𝑛 (𝑈), and 𝐵𝑚,𝑛 (𝑉) using the previously computed 𝜏 𝜏 𝑃𝑚−1,𝑛 (𝑌), 𝐸𝑚+1,𝑛 (𝑍), 𝑃𝑚,𝑛−1 (𝑈), and 𝐸𝑚,𝑛+1 (𝑉). Store 𝐵𝑚,𝑛 (𝑉). input output stages fix all! Adds row-column overlapping ◦ Eliminates reading and writing column results ◦ Modest increase in computation Start from input and global borders Load blocks into shared memory Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders All borders in global memory Fix incomplete borders Fix twice-incomplete borders Fix thrice-incomplete borders Fix four-times-incomplete borders Done fixing all borders Load blocks into shared memory Finish causal columns Finish anticausal columns Finish causal rows Finish anticausal rows Store results to global memory Done! A summed-area table is obtained using prefix sums over columns and rows. The prefix-sum filter 𝑆 is a special case firstorder causal recursive filter (with feedback coefficient 𝑎1 = −1). We can directly apply overlapping to optimize the computation of summed-area tables. In blocked form, the problem is to obtain output 𝑉 from input 𝑋 where the blocks satisfy the relations: 𝜏 𝐵𝑚,𝑛 𝑉 = 𝑆 𝜏 𝑃𝑚,𝑛−1 𝑉 , 𝐵𝑚,𝑛 (𝑌) 𝐵𝑚,𝑛 𝑌 = 𝑆(𝑃𝑚−1,𝑛 𝑌 , 𝐵𝑚,𝑛 𝑋 ) Using the strategy developed for causalanticausal overlapping, computing 𝑆 and 𝑆 𝜏 using overlapping becomes easy. In the first stage, we compute the incomplete output blocks 𝐵𝑚,𝑛 (𝑌) and 𝐵𝑚,𝑛 (𝑉) directly from the input. 𝐵𝑚,𝑛 𝑌 = 𝑆(𝟎, 𝐵𝑚,𝑛 𝑋 ) 𝐵𝑚,𝑛 𝑉 = 𝑆 𝜏 𝟎, 𝐵𝑚,𝑛 (𝑌) We store only the incomplete prologues 𝜏 𝑃𝑚,𝑛 (𝑌) and 𝑃𝑚,𝑛 (𝑉). Then we complete them using: 𝑃𝑚,𝑛 𝑌 = 𝑃𝑚−1,𝑛 𝑌 + 𝑃𝑚,𝑛 (𝑌) 𝜏 𝜏 𝑃𝑚,𝑛 𝑉 = 𝑃𝑚,𝑛−1 𝑉 + 𝑠 𝑃𝑚−1,𝑛 𝑌 𝜏 + 𝑃𝑚,𝑛 (𝑉) Scalar 𝑠 𝑃𝑚−1,𝑛 𝑌 denotes the sum of all entries in vector 𝑃𝑚−1,𝑛 𝑌 . 1. 2. 3. 4. In parallel for all 𝑚 and 𝑛, compute and store 𝜏 the 𝑃𝑚,𝑛 (𝑌) and 𝑃𝑚,𝑛 (𝑉). Sequentially for each 𝑚, but in parallel for each 𝑛, compute and store the 𝑃𝑚,𝑛 (𝑌) using the previously computed𝑃𝑚,𝑛 (𝑌). Compute and store 𝑠(𝑃𝑚,𝑛 𝑌 ). Sequentially for each 𝑛, but in parallel for each 𝜏 (𝑉) using the 𝑚, compute and store the 𝑃𝑚,𝑛 𝜏 (𝑉) and previously computed 𝑃𝑚−1,𝑛 (𝑌), 𝑃𝑚,𝑛 𝑠(𝑃𝑚,𝑛 𝑌 ). In parallel for all 𝑚 and 𝑛, compute 𝐵𝑚,𝑛 (𝑌) then compute and store 𝐵𝑚,𝑛 (𝑉) using the previously 𝜏 (𝑉). computed 𝑃𝑚,𝑛 (𝑌) and 𝑃𝑚,𝑛 S.1 Reads the input then computes and stores the incomplete prologues 𝑃𝑚,𝑛 (𝑌) (red) and 𝜏 𝑃𝑚,𝑛 𝑉 (blue). S.2 Completes the prologues 𝑃𝑚,𝑛 (𝑌) (red) and computes scalars 𝑠 𝑃𝑚−1,𝑛 𝑌 (yellow). 𝜏 S.3 Completes prologues 𝑃𝑚,𝑛 (𝑉) S.4 Reads the input and completed prologues, then computes and stores the final summed-area table. First-order filter benchmarks • Alg. RT is the baseline implementation • Alg. 4 adds causal-anticausal overlapping • Ruijters et al. 2010 “GPU prefilter […]” • Eliminates 4hw of IO • Modest increase in computation • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Alg. 5 adds row-column overlapping • Blelloch 1990 “Prefix sums […]” • Eliminates additional 2hw of IO • + tricks from GPU parallel scan algorithms • Modest increase in computation Cubic B-Spline Interpolation (GeForce GTX 480) 7 5 4 2 RT Throughput (GiP/s) 6 5 Alg. 5 4 4 3 2 2 1 RT 64 2 128 2 256 2 2 512 1024 Input size (pixels) 2 2048 2 4096 2 Step Complexity Max. # of Threads Memory Bandwidth Summed-area table benchmarks • First-order filter, unit coefficient, no anticausal component • Harris et al 2008, GPU Gems 3 • Hensley 2010, Gamefest • “Parallel prefix-scan […]” • Multi-scan + transpose + multiscan • Implemented with CUDPP • “High-quality depth of field” • Multi-wave method • Our improvements + specialized row and column kernels + save only incomplete borders + fuse row and column stages Summed-area Table (GeForce GTX 480) 9 Overlapped SAT Improved Hensley [2010] Hensley [2010] Harris et al [2008] 8 Throughput (GiP/s) 7 • Overlapped SAT 6 • Row-column overlapping 5 4 3 2 1 64 2 128 2 256 2 2 512 1024 Input size (pixels) 2 2048 2 4096 2 The paper describes an efficient algorithmic framework that reduces memory bandwidth over a sequence of recursive filters. It splits the input into blocks that are processed in parallel on the GPU. Overlapping the causal, anticausal, row and columns filters to reduce IO to global memory which leads to substantial performance gains. Difficult to understand theoretically Complex implementation Questions? Alg. RT (0.5 GiP/s) Alg. 2 (3 GiP/s) baseline + block parallelism Alg. 4 (5 GiP/s) Alg. 5 (6 GiP/s) + causal-anticausal overlapping + row-column overlapping