Structuring Parallel Algorithms 1 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

advertisement
Structuring Parallel Algorithms
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
1
Slides from:
• David Kirk/NVIDIA and Wen-mei W. Hwu,
• ECE 498AL, University of Illinois, UrbanaChampaign
• Podcasts of their lectures (recommended) :
– http://courses.ece.illinois.edu/ece498/al/Syllabus.html
Key Parallel Programming Steps
1) To find the concurrency in the problem
2) To structure the algorithm to translate
concurrency into performance
3) To implement the algorithm in a suitable
programming environment
4) To execute and tune the performance of the code on
a parallel system
Unfortunately, these have not been separated into levels
of abstractions that can be dealt with independently.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
3
Algorithm
• A step by step procedure that is guaranteed to terminate, such
that each step is precisely stated and can be carried out by a
computer
– Definiteness – the notion that each step is precisely stated
– Effective computability – each step can be carried out by a computer
– Finiteness – the procedure terminates
• Multiple algorithms can be used to solve the same problem
– Some require fewer steps
– Some exhibit more parallelism
– Some have larger memory footprint than others
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
4
Choosing Algorithm Structure
Start
Task-centric
Data Flow
centric
Data-centric
Linear
Recursive
Regular
Irregular
Linear
Recursive
Task
Parallelism
Divide and
Conquer
Pipeline
Event Driven
Geometric
Decomposition
Recursive
Data
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
5
A simple implementation
• Assume we have already loaded array into
__shared__ float partialSum[]
unsigned int t = threadIdx.x;
for (unsigned int stride = 1;
stride < blockDim.x; stride *= 2)
{
__syncthreads();
if (t % (2*stride) == 0)
partialSum[t] += partialSum[t+stride];
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
6
Mapping a Divide and Conquer Algorithm
Thread 0
0
1 0+1
2 0...3
Thread 2
1
2
2+3
Thread 4
3
4
4+5
4..7
3 0..7
Thread 6
5
6
6+7
Thread 8
7
8
8+9
Thread 10
9
10
11
10+11
8..11
8..15
iterations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
Array elements
7
Tiled (Stenciled) Algorithms are
Important for Geometric Decomposition
bx
0
A framework for memory
data sharing and reuse by
increasing data access
locality.
bsize-1
N
M
P
Psub
ty
bsize-1
A convenient framework
for
organizing threads2 (tasks)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
WIDTH
•
012
WIDTH
–
Tiled access patterns allow
small cache/scartchpad
memories to hold on to data
for re-use.
For matrix multiplication, a
16X16 thread block perform
0 loads from
2*256 = 512 float
device memory for 256 * 0
(2*16) = 8,192 mul/add 12
operations. by 1
tx
BLOCK_WIDTHBLOCK_WIDTH
–
2
BLOCK_SIZE
•
1
BLOCK_WIDTHBLOCK_WIDTH
BLOCK_WIDTH
WIDTH
WIDTH
8
Increased Work per Thread for even
more locality
bx
0
1
2
tx
–
Nd
More work done in each iteration
WIDTH
•
Each thread computes two element of Pdsub
Reduced loads from global memory (Md) to
shared memory
Reduced instruction overhead
TILE_WIDTH
•
•
TILE_WIDTH
0 1 2 TILE_WIDTH-1
Pd
Md
1
Pdsub
Pdsub
WIDTH
by
ty
0
1
2
TILE_WIDTHE
0
TILE_WIDTH-1
2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
TILE_WIDTH TILE_WIDTH
WIDTH
TILE_WIDTH
WIDTH
9
Double Buffering
- a frequently used algorithm pattern
• One could double buffer the computation, getting better
instruction mix within each thread
– This is classic software pipelining in ILP compilers
Loop {
Load current tile to shared
memory
Load next tile from global memory
Loop {
Deposit current tile to shared
memory
syncthreads()
syncthreads()
Compute current tile
Load next tile from global memory
syncthreads()
Compute current tile
}© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
syncthreads()
10
bx
Double Buffering
1
ty
2
tx
TILE_WIDTH
Nd
WIDTH
TILE_WIDTH
0 1 2 TILE_WIDTH-1
Pd
Pdsub
WIDTH
by
1
2
1
TILE_WIDTHE
• Deposit blue tile from register into
shared memory
• Syncthreads
• Load orange tile into register
• Compute Blue tile
• Deposit orange tile into shared
Md
memory
0
• ….
0
0
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH
2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
WIDTH
TILE_WIDTH
WIDTH
11
One can trade more work for increased
parallelism
MB
Mot.
Est.
Mot Comp
&
Frame Sub
• Diamond search algorithm for
motion estimation work
efficient but sequential
– Popular in traditional CPU
implementations
• Exhaustive Search totally
parallel but work inefficient
DCT &
Quantizer
Dequantizer
& IDCT
Output
Frame Order = I, P1, P2...
MB2 in frame P2
MB
2
Region of P1 required to
process MB2
P1 macroblocks
containing
– Popular in HW and parallel
implementations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
12
0
An MPEG Algorithm based on
Data Parallelism
1
2
3
4
5
6
time
Frame 1
Frame 2
ACC1
ACC2
ACC3
ACC4
Communication
• Loops distributed – DOALL style
– Replicates instructions and tables across accelerators
• If instructions and data are too large for local memory…
– Large memory transfers required to preserve data
– Saturation of and contention for communication resources can leave
computation resources idle
Memory bandwidth constrains performance
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
13
Loop fusion & memory privatization
0
1
2
3
4
time
Frame 1
ACC1
ACC2
ACC3
ACC4
Frame 2
ACC5
ACC6
ACC7
ACC8
• Stage loops fused into single DOALL macroblock loop
– Memory privatization reduces main memory access
• Replicates instructions and tables across processors
– Local memory constraints may prevent this technique
Novel dimensions of parallelism reduce communication
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
14
Pipeline or “Spatial Computing” Model
0
1
2
3
4
time
Frame 1
ACC1
ACC2
ACC3
ACC4
Frame 2
ACC5
ACC6
ACC7
ACC8
• Each PE performs as one pipeline stage in macroblock
processing
• Imbalanced stages result in idle resources
• Takes advantage of direct, accelerator to accelerator
communication
• Not very effective in CUDA but can be effective for Cell
Efficient
point-to-point
can enable new models
© David Kirk/NVIDIA
and Wen-mei W. Hwu,communication
2007-2009
15
ECE 498AL, University of Illinois, Urbana-Champaign
Small decisions in algorithms can have
major effect on parallelism.
(H.263 motion estimation example)
• Different algorithms may
prev_frame
expose different levels of
parallelism while
achieving desired result
cur_frame
• In motion estimation, can (a) Guess vectors are obtained from the previous macroblock.
use previous vectors
(either from space or
prev_frame
time) as guess vectors
cur_frame
(b) Guess vectors are obtained from the corresponding
macroblock in the previous frame.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
16
Download