lecture15-dynamic-input-f10

advertisement
ECE 498HK
Computational Thinking for
Many-core Computing
Lecture 15: Dealing with
Dynamic Data
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Objective
• To understand the nature of dynamic input data
extraction
• To learn efficient queue structures that support
massively parallel extraction of input data form
bulk data structures
• To learn efficient kernel structures that cope with
dynamic variation of working set size
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Examples of Computation with
Dynamic Input Data
• Graph search and optimization problems
–
–
–
–
Breadth-first search
Shortest path
Graph coverage
…
• Rendering and game physics
– Ray tracing
– Collision detection
–…
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Dynamic Data Extraction
• The data to be processed in each phase of
computation need to be dynamically determined
and extracted from a bulk data structure
– Harder when the bulk data structure is not organized
for massively parallel access, such as graphs.
• Graph algorithms are popular examples that deal
with dynamic data
– Widely used in EDA and large scale optimization
applications
– We will use Breadth-First Search (BFS) as an
example
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Main Challenges of Dynamic Data
• Input data need to be organized for locality,
coalescing, and contention avoidance as they
are extracted during execution
• The amount of work and level of parallelism often
grow and shrink during execution
– As more or less data is extracted during each phase
– Hard to efficiently fit into one CUDA/OPenCL kernel
configuration, which cannot be changed once
launched
– Different kernel strategies fit different data sizes
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Breadth-First Search (BFS)
t u
v w x y
r
s
t u
v w x y
r
s
t u
v w x y
r
r
s
t u
v w x y
r
s
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Level 1
s
v
Frontier vertex
Visited vertex
w
t
u
Level 2
x
Level 3
y
Level 4
Sequential BFS
• Store the frontier in a queue
• Complexity (O(V+E))
Level 1
s
r
w
yt x
uv
sr w
Level 2
Dequeue
v
t
u
x
Level 3
y
Level 4
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Enqueue
Parallelism in BFS
• Parallel Propagation in each level
• One GPU kernel per level
Level 1
s
r
Level 2
Level 3
Level 4
v
w
t
u
Example kernel
Parallel
x Parallel
y Parallel
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
v
t
u
x
Kernel 3
global barrier
Kernel 4
y
BFS in VLSI CAD
• Maze Routing
net terminal
blockage
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
BFS in VLSI CAD
• Logic Simulation/Timing Analysis
0
0
0
0
0
0
0
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
1
BFS in VLSI CAD
•
•
•
•
In formal verification for reachabiliy analysis.
In clustering for finding connected components.
In logic synthesis
……..
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Potential Pitfall of Parallel
Algorithms
• Greatly accelerated n2 algorithm is still slower
than an nlogn algorithm.
• Always need to keep an eye on fast sequential
algorithm as the baseline.
Running Time
Problem Size
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Node Oriented Parallelization
• IIIT-BFS
– P. Harish et. al. “Accelerating large graph algorithms on the GPU
using CUDA”
– Each thread is dedicated to one node
– Every thread examines neighbor nodes to determine if its node
will be a frontier node in the next phase
– Complexity O(VL+E) (Compared with O(V+E))
– Slower than the sequential version for large graphs
• Especially for sparsely connect graphs
v
t
x
y2010
©Wen-mei W. Hwu and DaviduKirk/NVIDIA
r
s
t
u v
w x
y
r
s
t
u v
w x
y
Matrix-based Parallelization
• Yangdong Deng et. al. “Taming Irregular EDA
applications on GPUs”
• Propagation is done through matrix-vector
multiplication
– For sparsely connected graphs, the connectivity matrix
will be a sparse matrix
• Complexity O(V+EL) (compared with O(V+E))
– Slower than sequential for large graphs
s 0 1 0 1  0 s
s
s
u
u
u 1 0 1   0  1  u
v 0 1 0 0 0 v
v
v
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
s u v
Need a More General Technique
• To efficiently handle most graph types
• Use more specialized formulation when
appropriate as an optimization
• Efficient queue-based parallel algorithms
– Hierarchical scalable queue implementation
– Hierarchical kernel arrangements
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
An Initial Attempt
• Manage the queue structure
–
–
–
–
Complexity: O(V+E)
Dequeue in parallel
Each frontier node is a thread
Enqueue in sequence.
v
u
• Poor coalescing
• Poor scalability
– No speedup
t
x
y
Parallel
v
t
x
uy
Sequential
u y
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Parallel Insert-Compact Queues
• C.Lauterbach et.al.“Fast BVH Construction on
GPUs”
• Parallel enqueue with compaction cost
• Not suitable for light-node problems
Propagate
v
t
v
x
t
Φ Φ
u
y
u y
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
x
u Φ y Φ
Compact
Basic ideas
• Each thread processes one or more frontier
nodes
• Find the index of each new frontier node
• Build queue
of next frontier
hierarchically
q1
q3
q2
Local
queues
Local
a b c
g h i
j
Index = offset of q2 (#node in q1) + index in q2
Global
queue
Global
h
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Two-level Hierarchy
• Block queue (b-queue)
– Inserted by all threads in a
block
– Reside in Shared Memory
Shared Mem
• Global queue (g-queue)
b-queue
– Inserted only when a block
completes
g-queue
Global Mem
• Problem:
– Collision on b-queues
– Threads in the same block
can cause heavy contention
©Wen-mei W. Hwu and David
Kirk/NVIDIA 2010
Warp-level Queue
• Thread Scheduling
Warp i
Warp j
WiT7 WiT15 WiT23 WiT30 WjT7 WjT15 WjT23 WjT30
WiT6 WiT14 WiT22 WiT29 WjT6 WjT14 WjT22 WjT29
WiT0 WiT8
WiT16 WiT24 WjT0 WjT8 WjT16 WjT24
Time
• Divide threads into 8 groups (for GTX280)
– Number of SP’s in each SM in general
– One queue to be written by each SP in the SM
• Each group writes to one warp-level queue
• Still should use atomic operation
– But much lower level of contention
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Three-level hierarchy
w-queue
b-queue
g-queue
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
b-queue
Hierarchical Queue Management
• Shared Memory:
– Interleaved queue layout, no bank conflict
W-queue7
W-queue1
W-queue0
W-queues[][8]
• Global Memory:
– Coalescing when releasing a b-queue to g-queue
– Moderate contention across blocks
• Texture memory :
– Store graph structure (random access, no coalescing)
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
– Fermi cache may help.
Hierarchical Queue Management
• Advantage and limitation
– The technique can be applied to any inherently
sequential data structure
– The w-queues and b-queues are limited by the
capacity of shared memory. If we know the upper limit
of the degree, we can adjust the number of threads
per block accordingly.
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
ANY FURTHER QUESTIONS?
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010
Download