pptx 16 - Advanced Graphics

advertisement
Jeremiah van Oosten
Reinier van Oeveren




Introduction
Related Works
◦ Prefix Sums and Scans
◦ Recursive Filtering
◦ Summed-Area Tables
Problem Definition
Parallelization Strategies
◦
◦
◦
◦
Baseline (Algorithm RT)
Block Notation
Inter-block Parallelism
Kernel Fusion (Algorithm 2)

Overlapping

Summed-Area Tables


◦ Causal-Anticausal overlapping (Algorithm 3 & 4)
◦ Row-Column Causal-Anitcausal overlapping (Algorithm 5)
◦ Overlapped Summed-Area Tables (Algorithm SAT)
Results
Conclusion


Linear filtering is
commonly used to blur,
sharpen or down-sample
images.
A direct implementation
evaluating a filter of
support d on a h x w
image has a cost of
O(hwd).


The cost of the image filter can be reduced
using a recursive filter in which case previous
results can be used to compute the current
value:
1
𝑦𝑖 = 𝑥𝑖 − 𝑦𝑖−1
2
Cost can be reduced to O(hwr) where r is the
number of recursive feedbacks.

At each step, the filter produces an output
element by a linear combination of the input
element and previously computed output
elements.
𝑦
=
𝑦𝑖
=
𝐹(𝒑, 𝒙)
𝑟
𝑥𝑖 −
𝑎𝑘 𝑦𝑖−𝑘
𝑘=1
0(hwr)
Continue…

Applications of recursive filters
◦ Low-pass filtering like Gaussian kernels
◦ Inverse Convolution (X = 𝑉 ∗ 𝐹, 𝑉 = 𝑋 ∗ 𝐹 −1 )
◦ Summed-area tables
recursive filters
input
blurred


Recursive filters can be causal or anticausal
(or non-causal).
Causal filters operate on previous values.
𝑟
𝑦𝑖 = 𝑥𝑖 −
𝑎𝑘 𝑦𝑖−𝑘
𝑘=1

Anticausal filters operate on “future” values.
𝑟
𝑧𝑖 = 𝑦𝑖 −
𝑎𝑘 𝑧𝑖+𝑘
𝑘=1

Anticausal filters operate on “future” values.
𝒛
=
𝑅(𝒚, 𝒆)
𝑧𝑖
= 𝑦𝑖 −
𝑟
𝑎𝑘 𝑧𝑖+𝑘
𝑘=1
Continue…

It is often required to perform a sequence of
recursive image filters.
P
P’
U
V
X
Y
Z
E
E’
• Independent Columns
• Causal
𝑌 = 𝐹(𝑃, 𝑋)
• Anticausal
𝑍 = 𝐹(𝑌, 𝐸)
• Independent Rows
• Causal
𝑈 = 𝐹 𝜏 (𝑃′, 𝑍)
• Anticausal
𝑉 = 𝐹 𝜏 (𝐸′, 𝑈)




The naïve approach to solving the sequence of recursive filters
does not sufficiently utilize the processing cores of the GPU.
The latest GPU from NVIDIA has 2,668 shader cores. Processing
even large images (2048x2048) will not make full use of all
available cores.
Under utilization of the GPU cores does not allow for latency
hiding.
We need a way to make better utilization of the GPU without
increasing IO.

In the paper “GPU-Efficient Recursive
Filtering and Summed-Area Tables” by Diego
Nehab et. al. they introduce a new
algorithmic framework to reduce memory
bandwidth by overlapping computation over
the full sequence of recursive filters.

Partition the image into 2D blocks of size
𝑏 𝑥 𝑏.

A prefix sum
𝑦𝑖 = 𝑥𝑖 + 𝑦𝑖−1





Simple case of a first-order recursive filter.
A scan generalizes the recurrence using an
arbitrary binary associative operator.
Parallel prefix-sums and scans are important
building blocks for numerous algorithms.
[Iverson 1962; Stone 1971; Blelloch 1989;
Sengupta et. al. 2007]
An optimized implementation comes with the
CUDPP library [2011].

A generalization of the prefix sum using a
weighted combination of prior outputs.
𝑟
𝑦𝑖 = 𝑥𝑖 −
𝑎𝑘 𝑦𝑖−𝑘
𝑘=1


This can be implemented as a scan operation
with redefined basic operators.
Ruijters and Thevenaz [2010] exploit
parallelisim across the rows and columns of
the input.

Sung and Mitra [1986] use block parallelism
and split the computation into two parts:
◦ One computation based only on the block data
assuming a zero initial conditions.
◦ One computation based only on the initial
conditions and assuming zero block data.
𝑌 = 𝐹 0, 𝑋 + 𝐹(𝑃, 0)
Summed-area tables enable the averaging
rectangular regions of pixel with a constant
number of reads
+ UL
LL
+
width
UR
height

LR
𝐿𝑅 − 𝐿𝐿 − 𝑈𝑅 + 𝑈𝐿
𝑎𝑣𝑔 =
𝑤𝑖𝑑𝑡ℎ ∙ ℎ𝑒𝑖𝑔ℎ𝑡

The paper titled “Fast Summed-Area Table
Generation…” from Justin Hensley et. al.
(2005) describes a method called recursive
doubling which requires multiple passes of
the input image. (A 256x256 image requires
16 passes to compute).
Image A
Image B
Image A
Image B

In 2010, Justin Hensley
extended his 2005
implementation to compute
shaders taking more samples
per pass and storing the
result in intermediate shared
memory. Now a 256x256
image only required 4 passes
when reading 16 samples per
pass.


Casual recursive filters of order 𝑟 are
characterized by a set of 𝑟 feedback
coefficients 𝑎𝑘 in the following manner.
Given a prologue vector 𝒑 ∈ ℝ𝑟 and an input
vector 𝒙 ∈ ℝℎ of any size ℎ the filter 𝐹
produces the output:
𝒚 = 𝐹(𝒑, 𝒙)

Such that 𝒚 ∈ ℝℎ (𝒚 has the same size as the
input 𝒙).

Causal recursive filters depend on a prologue
vector 𝒑 ∈ ℝ𝑟
𝑟
𝑦𝑖 = 𝑥𝑖 −
𝑎𝑘 𝑦𝑖−𝑘
𝑘=1

Similar for the anitcausal filter. Given an input
vector 𝒚 ∈ ℝℎ and an epilogue vector 𝒆 ∈ ℝ𝑟 , the
output vector 𝒛 = 𝑅(𝒚, 𝒆) is defined by:
𝑟
𝑧𝑖 = 𝑦𝑖 −
𝑎′𝑘 𝑧𝑖+𝑘
𝑘=1

For row processing, we define an extended
casual filter 𝐹 𝜏 and anticausal filter 𝑅𝜏 .
𝜏
𝒖 = 𝐹 (𝒑′, 𝒛)
𝜏
𝒗 = 𝑅 (𝒖, 𝒆′)

With these definitions, we are able to formulate the
problem of applying the full sequence of four
recursive filters (down, up, right, left).
P
P’
U
V
X
Y
Z
E
E’
• Independent Columns
• Causal
𝑌 = 𝐹(𝑃, 𝑋)
• Anticausal
𝑍 = 𝐹(𝑌, 𝐸)
• Independent Rows
• Causal
𝑈 = 𝐹 𝜏 (𝑃′, 𝑍)
• Anticausal
𝑉 = 𝐹 𝜏 (𝐸′, 𝑈)

The goal is to implement this algorithm on
the GPU to make full use of all available
resources.
◦ Maximize occupancy by splitting the problem up to
make use of all cores.
◦ Reduce I/O to global memory.


Must break the dependency chain in order to
increase task parallelism.
Primary design goal: Increase the amount of
parallelism without increasing memory I/O.
 Baseline
algorithm ‘RT’
 Block notation
 Inter-block parallelism
 Kernel fusion
Independent row and column processing




Step RT1: In parallel for each column in 𝑋,
apply 𝐹 sequentially and store 𝑌.
Step RT2: In parallel for each column in 𝑌,
apply 𝑅 sequentially and store 𝑍.
Step RT1: In parallel for each row in 𝑍,
apply 𝐹 𝑇 sequentially and store 𝑈.
Step RT1: In parallel for each row in 𝑈,
apply 𝑅𝑇 sequentially and store 𝑉.
input
stages
output
column processing
row processing
ℎ𝑤
𝑐𝑝

Completion takes 𝑂(

Bandwidth usage in total is 8ℎ𝑤
𝑝
𝑐
𝑤
ℎ
𝑟
=
=
=
=
=
4r ) steps
streaming multiprocessors
number of cores (per processor)
width of the input image
height of the input image
order of the applied filter


Partition input image into 𝑏 𝑥 𝑏 blocks
◦ 𝑏 = number of threads in warp (=32)
What means what?
◦ 𝑩𝒎,𝒏 𝑿
= 𝒃 𝒙 𝒃 block in matrix 𝑿 with index 𝒎, 𝒏
◦ 𝑷𝒎−𝟏,𝒏 𝑿 = 𝒓 𝒙 𝒃 column-prologue submatrix
◦ 𝑬𝒎+𝟏,𝒏 𝑿 = 𝒓 𝒙 𝒃 column-epilogue submatrix
For rows we have (similar) transposed operators:
𝑷𝑻 𝒎,𝒏−𝟏 𝑿 and 𝑬𝑻 𝒎,𝒏+𝟏 𝑿

Tail and head operators: selecting prologueand epilogue-shaped submatrices from 𝑏 𝑥 𝑏
𝑷𝒎,𝒏 𝑿 = 𝑻 𝑩𝒎,𝒏 𝑿
, 𝑷𝑻 𝒎,𝒏 = 𝑻𝑻 𝑩𝒎,𝒏 𝑿
𝑬𝒎,𝒏 𝑿 = 𝑯 𝑩𝒎,𝒏 𝑿
, 𝑬𝑻 𝒎,𝒏 = 𝑯𝑻 𝑩𝒎,𝒏 𝑿

Result: blocked version of problem definition
𝒀 = 𝑭(𝑷, 𝑿), 𝒁 = 𝑭(𝒀, 𝑬), 𝑼 = 𝑭𝑻 (𝑷′, 𝒁), 𝑽 = 𝑭𝑻 (𝑬′, 𝑼)
𝑩𝒎,𝒏 𝒀 = 𝑭 𝑷𝒎−𝟏,𝒏 𝐘 , 𝑩𝒎,𝒏 𝑿
𝑩𝒎,𝒏 𝒁 = 𝑹 𝑩𝒎,𝒏 𝐘 , 𝑬𝒎+𝟏,𝒏 𝒁
𝑩𝒎,𝒏 𝑼 = 𝑭𝑻 𝑷𝑻 𝒎,𝒏−𝟏 𝐔 , 𝑩𝒎,𝒏 𝒁
𝑩𝒎,𝒏 𝑽 = 𝑹𝑻 𝑩𝒎,𝒏 𝑼 , 𝑬𝑻 𝒎,𝒏+𝟏 𝐕
Superposition
(based on linearity)
Effects of the input and prologue/epilogue on
the output can be computed independently
𝑭 𝒑, 𝒙 = 𝑭 𝟎, 𝒙 + 𝑭 𝒑, 𝟎
𝑹 𝒚, 𝒆 = 𝑹 𝒚, 𝟎 + 𝑹 𝟎, 𝒆
Express as matrix products
For any 𝒓, 𝑰𝒓 is the 𝒓 𝒙 𝒓 identity matrix
𝑭
𝑹
𝑭
𝑹
𝒑, 𝟎
𝟎, 𝒆
𝟎, 𝒙
𝒚, 𝟎
=𝑭
=𝑹
=𝑭
=𝑭
𝑰𝒓 , 𝟎
𝟎, 𝑰𝒓
𝟎, 𝑰𝒃
𝑰𝒃 , 𝟎
𝒑=
𝒆=
𝒙=
𝒚=
𝑨𝑭𝑷 𝒑
𝑨𝑹𝑬 𝒆
𝑨𝑭𝑩 𝒙
𝑨𝑹𝑩 𝒚
𝑻(𝑨𝑭𝑷 ) = 𝑨𝒃𝑭
𝑯(𝑨𝑹𝑬 ) = 𝑨𝒃𝑹
Precomputed
𝑟 𝑥 𝑟 matrices that depend only on the feedback
𝑨
𝑭𝑩 , 𝑨𝑹𝑩 = 𝒃 𝒙 𝒃 𝑨𝑭𝑷 , 𝑨𝑹𝑬 = 𝒃 𝒙 𝒓
coefficients of filters 𝐹 and 𝑅 respectively. Details in paper.
Perform block computation independently
𝑩𝒎 𝒚 = 𝑭 𝑷𝒎−𝟏 𝐲 , 𝑩𝒎 𝒙
output block
Prologue 𝑷𝒎−𝟏 𝐲
/ superposition
tail of prev. output block 𝑻(𝑩𝒎−𝟏 𝒚 )
𝑩𝒎 𝒚 = 𝑭 𝟎, 𝑩𝒎 𝒙
+ 𝑭 𝑷𝒎−𝟏 𝒚 , 𝟎
𝑩𝒎 𝒚 = 𝑭 𝟎, 𝑩𝒎 𝒙
first term
𝑩𝒎 (𝒚)
+ 𝑭 𝑷𝒎−𝟏 𝒚 , 𝟎
second term
𝑭 𝑰𝒓 , 𝟎 𝑷𝒎−𝟏 𝒚
incomplete causal output
𝑨𝑭𝑷 𝑷𝒎−𝟏 (𝒚)
𝑷𝒎 𝒚 = 𝑷𝒎 𝒚 +
𝒃
𝑨𝑭
𝑷𝒎−𝟏 (𝒚) (1)
Recall: 𝑩𝒎 𝒚 = 𝑭 𝑷𝒎−𝟏 𝐲 , 𝑩𝒎 𝒙
(2)
Algorithm 1
1.1
In parallel for all m, compute and store each 𝑷𝒎 𝒚
1.2
Sequentially for each m, compute and store the 𝑷𝒎 𝒚 according
to (1) and using the previously computed 𝑷𝒎 𝒚
1.3
In parallel for all m, compute & store output block 𝑩𝒎 𝒚
using (2) and the previously computed 𝑷𝒎−𝟏 (𝒚)
Processing all rows and columns using causal and anti-causal
filter pairs requires 4 successive applications of algorithm 1.
𝒉𝒘
There are
independent tasks: hides memory access latency.
𝒃
However.. The memory bandwidth usage is now 𝟏𝟐 +
Significantly more than algorithm RT (𝟖𝒉𝒘) 
can be solved
𝟏𝟔𝒓
𝒃
𝒉𝒘.



Original idea: Kirk & Hwu [2010]
Use output of one kernel as input for the
next without going through global memory.
Fused kernel: code from both kernels but
keep intermediate results in shared mem.




Use Algorithm 1 for all filters, do fusing.
Fuse last stage of 𝐹 with first stage of 𝑅
Fuse last stage of 𝑅 and first stage of 𝐹 𝑇
Fuse last stage of 𝐹 𝑇 with first stage of 𝑅𝑇
We aimed for bandwidth reduction. Did it work?

Algorithm 1:

Algorithm 2:
16𝑟
12 +
ℎ𝑤
𝑏
16𝑟
9+
ℎ𝑤
𝑏
yes, it did!
output
stages
input
*
fix
fix
fix
fix
* for the full algorithm in text, please see the paper



Further I/O reduction is still possible: by recomputing
intermediary results instead of storing in memory.
More bandwidth reduction: 6 +
No. of steps:
ℎ𝑤
𝑂(
𝑐𝑝
(14𝑟 +
1
4
𝑏
22𝑟
𝑏
ℎ𝑤
(=good)
𝑟 2 + 𝑟 ) (≈bad*)
Bandwidth usage is less than Algorithm RT(!) but involves
more computations  *But.. future hardware may tip the
balance in favor of more computations.



Overlapping is introduced to reduce IO to
global memory.
It is possible to work with twice-incomplete
anticausal epilogues 𝐸𝑚,𝑛 (𝑍), computed
directly from the incomplete causal output
block 𝐵𝑚,𝑛 (𝑌).
This is called casual-anticausal overlapping.

Recall that we can express the filter so that
the input and the prologue or epilogue can be
computed independently and later added
together.
𝐹(𝒑, 𝒙) = 𝐹 𝟎, 𝒙 + 𝐹 𝒑, 𝟎 ,
𝑅(𝒚, 𝒆) = 𝑅 𝒚, 𝟎 + 𝐹(𝟎, 𝒆)

Using the previous properties, we can split
the dependency chains of anticausal
epilogues.

Which can be further simplified to:


Where the twice-incomplete 𝑧 is such that
Each twice-incomplete epilogue 𝐸𝑚 (𝒛) depends
only on the corresponding input block 𝐵𝑚 (𝒙) and
therefore they can all be computed in parallel
already in the first pass. As a byproduct of that
same pass, we can compute and store the 𝑃𝑚 (𝒚)
that will be needed to obtain 𝑃𝑚 (𝒚). With 𝑃𝑚 (𝒚),
we can compute all 𝐸𝑚 (𝒛) in the following pass.
1.
2.
3.
4.
In parallel for all 𝑚, compute and store 𝑃𝑚 (𝒚)
and 𝐸𝑚 (𝒛).
Sequentially for each 𝑚, compute and store the
𝑃𝑚 (𝒚) using the previously computed 𝑃𝑚 (𝒚).
Sequentially for each 𝑚, compute and store
𝐸𝑚 (𝒛) using the previously computed 𝑃𝑚−1 (𝒚)
and 𝐸𝑚 (𝒛).
In parallel for all 𝑚, compute each causal
output block 𝐵𝑚 (𝒚) using the previously
computed 𝑃𝑚−1 (𝒚). Then compute and store
each anticausal output block 𝐵𝑚 (𝒛) using the
previously computed 𝐸𝑚+1 (𝒛).

Algorithm 3 computes row and columns in
separate passes. Fusing these two stages,
results in algorithm 4.
1.
2.
3.
4.
5.
6.
7.
In parallel for all 𝑚 and 𝑛, compute and store the 𝑃𝑚,𝑛 (𝑌) and 𝐸𝑚,𝑛 (𝑍).
Sequentially for each 𝑚, but in parallel for each 𝑛, compute and store
the 𝑃𝑚,𝑛 (𝑌) using the previously computed 𝑃𝑚,𝑛 (𝑌).
Sequentially for each 𝑚, but in parallel for each 𝑛, compute and store
the 𝐸𝑚,𝑛 (𝑍) using the previously computed 𝑃𝑚−1,𝑛 (𝑌 ) and 𝐸𝑚,𝑛 (𝑍).
In parallel for all 𝑚 and 𝑛, compute 𝐵𝑚,𝑛 (𝑌) using the previously
computed 𝑃𝑚−1,𝑛 (𝑌). Then compute and store the 𝐵𝑚,𝑛 (𝑍) using the
previously computed 𝐸𝑚+1,𝑛 (𝑍). Finally, compute and store both
𝜏 (𝑈) and 𝐸 𝜏 (𝑉).
𝑃𝑚,𝑛
𝑚,𝑛
Sequentially for each 𝑛, but in parallel for each 𝑚, compute and store
𝜏 (𝑈) from 𝑃𝜏 (𝑈).
the 𝑃𝑚,𝑛
𝑚,𝑛
Sequentially for each 𝑛, but in parallel for each 𝑚, compute and store
𝜏 (𝑉) using the previously computed 𝑃𝜏
𝜏
each 𝐸𝑚,𝑛
𝑚,𝑛−1 (𝑈) and 𝐸𝑚,𝑛 (𝑉).
In parallel for all 𝑚 and 𝑛, compute 𝐵𝑚,𝑛 (𝑉) using the previously
𝜏
computed 𝑃𝑚,𝑛−1
(𝑉) and 𝐵𝑚,𝑛 (𝑍). Then compute and store the 𝐵𝑚,𝑛 (𝑈)
using the previously computed 𝐸𝑚,𝑛+1 (𝑈).
input
output stages
fix both

fix both
Adds causal-anticausal overlapping
◦ Eliminates reading and writing causal results
 Both in column and in row processing
◦ Modest increase in computation


There is still one source of inefficiency in
algorithm 4. We wait until the complete
block 𝐵𝑚,𝑛 𝑍 is available in stage 4 before
𝜏
computing incomplete 𝑃𝑚,𝑛
𝑈 and twice𝜏
incomplete 𝐸𝑚,𝑛
(𝑉).
We can overlap row and column computations
and work with thrice-incomplete transposed
epilogues obtained directly during algorithm
4 stage 1.


Below is the formula for completing the
thrice-incomplete transposed prologues:
The thrice-incomplete 𝑈 satisfies

To complete the four-times-incomplete
transposed epilogues of 𝑉:
1.
2.
3.
4.
5.
6.
In parallel for all 𝑚 and 𝑛, compute and store each 𝑃𝑚,𝑛 (𝑌),
𝜏
𝜏
𝐸𝑚,𝑛 (𝑍), 𝑃𝑚,𝑛
(𝑈), and 𝐸𝑚,𝑛
(𝑉).
In parallel for all 𝑛, sequentially for each 𝑚, compute and store
the 𝑃𝑚,𝑛 (𝑌) using the previously computed 𝑃𝑚−1,𝑛 (𝑌).
In parallel for all 𝑛, sequentially for each 𝑚, compute and store
𝐸𝑚,𝑛 (𝑍) using the previously computed 𝑃𝑚−1,𝑛 (𝑌) and 𝐸𝑚+1,𝑛 (𝑍).
In parallel for all 𝑚, sequentially for each 𝑛, compute and store
𝜏
𝜏
𝑃𝑚,𝑛
(𝑈) using the previously computed 𝑃𝑚,𝑛
(𝑈), 𝑃𝑚−1,𝑛 (𝑌 ), and
𝐸𝑚+1,𝑛 (𝑍).
In parallel for all 𝑚, sequentially for each 𝑛, compute and store
𝜏
𝜏
𝜏
𝐸𝑚,𝑛
(𝑉) using the previously computed 𝐸𝑚,𝑛
(𝑉), 𝑃𝑚,𝑛−1
(𝑈),
𝜏
𝑃𝑚−1,𝑛 (𝑌), and 𝐸𝑚+1,𝑛 (𝑍).
In parallel for all 𝑚 and 𝑛, successively compute 𝐵𝑚,𝑛 (𝑌),
𝐵𝑚,𝑛 (𝑍), 𝐵𝑚,𝑛 (𝑈), and 𝐵𝑚,𝑛 (𝑉) using the previously computed
𝜏
𝜏
𝑃𝑚−1,𝑛 (𝑌), 𝐸𝑚+1,𝑛 (𝑍), 𝑃𝑚,𝑛−1
(𝑈), and 𝐸𝑚,𝑛+1
(𝑉). Store 𝐵𝑚,𝑛 (𝑉).
input
output stages
fix all!

Adds row-column overlapping
◦ Eliminates reading and writing column results
◦ Modest increase in computation
Start from input and global borders
Load blocks into shared memory
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
All borders in global memory
Fix incomplete borders
Fix twice-incomplete borders
Fix thrice-incomplete borders
Fix four-times-incomplete borders
Done fixing all borders
Load blocks into shared memory
Finish causal columns
Finish anticausal columns
Finish causal rows
Finish anticausal rows
Store results to global memory
Done!



A summed-area table is obtained using prefix
sums over columns and rows.
The prefix-sum filter 𝑆 is a special case firstorder causal recursive filter (with feedback
coefficient 𝑎1 = −1).
We can directly apply overlapping to optimize
the computation of summed-area tables.

In blocked form, the problem is to obtain
output 𝑉 from input 𝑋 where the blocks
satisfy the relations:
𝜏
𝐵𝑚,𝑛 𝑉 = 𝑆 𝜏 𝑃𝑚,𝑛−1
𝑉 , 𝐵𝑚,𝑛 (𝑌)
𝐵𝑚,𝑛 𝑌 = 𝑆(𝑃𝑚−1,𝑛 𝑌 , 𝐵𝑚,𝑛 𝑋 )


Using the strategy developed for causalanticausal overlapping, computing 𝑆 and 𝑆 𝜏
using overlapping becomes easy.
In the first stage, we compute the incomplete
output blocks 𝐵𝑚,𝑛 (𝑌) and 𝐵𝑚,𝑛 (𝑉) directly
from the input.
𝐵𝑚,𝑛 𝑌 = 𝑆(𝟎, 𝐵𝑚,𝑛 𝑋 )
𝐵𝑚,𝑛 𝑉 = 𝑆 𝜏 𝟎, 𝐵𝑚,𝑛 (𝑌)

We store only the incomplete prologues
𝜏
𝑃𝑚,𝑛 (𝑌) and 𝑃𝑚,𝑛
(𝑉). Then we complete them
using:
𝑃𝑚,𝑛 𝑌 = 𝑃𝑚−1,𝑛 𝑌 + 𝑃𝑚,𝑛 (𝑌)
𝜏
𝜏
𝑃𝑚,𝑛
𝑉 = 𝑃𝑚,𝑛−1
𝑉 + 𝑠 𝑃𝑚−1,𝑛 𝑌

𝜏
+ 𝑃𝑚,𝑛
(𝑉)
Scalar 𝑠 𝑃𝑚−1,𝑛 𝑌 denotes the sum of all
entries in vector 𝑃𝑚−1,𝑛 𝑌 .
1.
2.
3.
4.
In parallel for all 𝑚 and 𝑛, compute and store
𝜏
the 𝑃𝑚,𝑛 (𝑌) and 𝑃𝑚,𝑛
(𝑉).
Sequentially for each 𝑚, but in parallel for each
𝑛, compute and store the 𝑃𝑚,𝑛 (𝑌) using the
previously computed𝑃𝑚,𝑛 (𝑌). Compute and store
𝑠(𝑃𝑚,𝑛 𝑌 ).
Sequentially for each 𝑛, but in parallel for each
𝜏 (𝑉) using the
𝑚, compute and store the 𝑃𝑚,𝑛
𝜏 (𝑉) and
previously computed 𝑃𝑚−1,𝑛 (𝑌), 𝑃𝑚,𝑛
𝑠(𝑃𝑚,𝑛 𝑌 ).
In parallel for all 𝑚 and 𝑛, compute 𝐵𝑚,𝑛 (𝑌) then
compute and store 𝐵𝑚,𝑛 (𝑉) using the previously
𝜏 (𝑉).
computed 𝑃𝑚,𝑛 (𝑌) and 𝑃𝑚,𝑛

S.1 Reads the input then computes and stores
the incomplete prologues 𝑃𝑚,𝑛 (𝑌) (red) and
𝜏
𝑃𝑚,𝑛
𝑉 (blue).

S.2 Completes the prologues 𝑃𝑚,𝑛 (𝑌) (red) and
computes scalars 𝑠 𝑃𝑚−1,𝑛 𝑌
(yellow).

𝜏
S.3 Completes prologues 𝑃𝑚,𝑛
(𝑉)

S.4 Reads the input and completed
prologues, then computes and stores the
final summed-area table.
First-order filter benchmarks
• Alg. RT is the baseline implementation
• Alg. 4 adds causal-anticausal overlapping
• Ruijters et al. 2010 “GPU prefilter […]”
• Eliminates 4hw of IO
• Modest increase in computation
• Alg. 2 adds block parallelism & tricks
• Sung et al. 1986 “Efficient […] recursive […]” • Alg. 5 adds row-column overlapping
• Blelloch 1990 “Prefix sums […]”
• Eliminates additional 2hw of IO
• + tricks from GPU parallel scan algorithms
• Modest increase in computation
Cubic B-Spline Interpolation
(GeForce GTX 480)
7
5
4
2
RT
Throughput (GiP/s)
6
5
Alg.
5
4
4
3
2
2
1
RT
64
2
128
2
256
2
2
512
1024
Input size (pixels)
2
2048
2
4096
2
Step
Complexity
Max. # of
Threads
Memory
Bandwidth
Summed-area table benchmarks
• First-order filter, unit coefficient, no anticausal component
• Harris et al 2008, GPU Gems 3
• Hensley 2010, Gamefest
• “Parallel prefix-scan […]”
• Multi-scan + transpose + multiscan
• Implemented with CUDPP
• “High-quality depth of field”
• Multi-wave method
• Our improvements
+ specialized row and column kernels
+ save only incomplete borders
+ fuse row and column stages
Summed-area Table
(GeForce GTX 480)
9
Overlapped SAT
Improved Hensley [2010]
Hensley [2010]
Harris et al [2008]
8
Throughput (GiP/s)
7
• Overlapped SAT
6
• Row-column overlapping
5
4
3
2
1
64
2
128
2
256
2
2
512
1024
Input size (pixels)
2
2048
2
4096
2



The paper describes an efficient algorithmic
framework that reduces memory bandwidth
over a sequence of recursive filters.
It splits the input into blocks that are
processed in parallel on the GPU.
Overlapping the causal, anticausal, row and
columns filters to reduce IO to global memory
which leads to substantial performance gains.


Difficult to understand theoretically
Complex implementation
Questions?
Alg. RT (0.5 GiP/s)
Alg. 2 (3 GiP/s)
baseline
+ block parallelism
Alg. 4 (5 GiP/s)
Alg. 5 (6 GiP/s)
+ causal-anticausal overlapping
+ row-column overlapping
Download