ECE 498HK Computational Thinking for Many-core Computing Lecture 15: Dealing with Dynamic Data ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Objective • To understand the nature of dynamic input data extraction • To learn efficient queue structures that support massively parallel extraction of input data form bulk data structures • To learn efficient kernel structures that cope with dynamic variation of working set size ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Examples of Computation with Dynamic Input Data • Graph search and optimization problems – – – – Breadth-first search Shortest path Graph coverage … • Rendering and game physics – Ray tracing – Collision detection –… ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Dynamic Data Extraction • The data to be processed in each phase of computation need to be dynamically determined and extracted from a bulk data structure – Harder when the bulk data structure is not organized for massively parallel access, such as graphs. • Graph algorithms are popular examples that deal with dynamic data – Widely used in EDA and large scale optimization applications – We will use Breadth-First Search (BFS) as an example ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Main Challenges of Dynamic Data • Input data need to be organized for locality, coalescing, and contention avoidance as they are extracted during execution • The amount of work and level of parallelism often grow and shrink during execution – As more or less data is extracted during each phase – Hard to efficiently fit into one CUDA/OPenCL kernel configuration, which cannot be changed once launched – Different kernel strategies fit different data sizes ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Breadth-First Search (BFS) t u v w x y r s t u v w x y r s t u v w x y r r s t u v w x y r s ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Level 1 s v Frontier vertex Visited vertex w t u Level 2 x Level 3 y Level 4 Sequential BFS • Store the frontier in a queue • Complexity (O(V+E)) Level 1 s r w yt x uv sr w Level 2 Dequeue v t u x Level 3 y Level 4 ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Enqueue Parallelism in BFS • Parallel Propagation in each level • One GPU kernel per level Level 1 s r Level 2 Level 3 Level 4 v w t u Example kernel Parallel x Parallel y Parallel ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 v t u x Kernel 3 global barrier Kernel 4 y BFS in VLSI CAD • Maze Routing net terminal blockage ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 BFS in VLSI CAD • Logic Simulation/Timing Analysis 0 0 0 0 0 0 0 ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 1 BFS in VLSI CAD • • • • In formal verification for reachabiliy analysis. In clustering for finding connected components. In logic synthesis …….. ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Potential Pitfall of Parallel Algorithms • Greatly accelerated n2 algorithm is still slower than an nlogn algorithm. • Always need to keep an eye on fast sequential algorithm as the baseline. Running Time Problem Size ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Node Oriented Parallelization • IIIT-BFS – P. Harish et. al. “Accelerating large graph algorithms on the GPU using CUDA” – Each thread is dedicated to one node – Every thread examines neighbor nodes to determine if its node will be a frontier node in the next phase – Complexity O(VL+E) (Compared with O(V+E)) – Slower than the sequential version for large graphs • Especially for sparsely connect graphs v t x y2010 ©Wen-mei W. Hwu and DaviduKirk/NVIDIA r s t u v w x y r s t u v w x y Matrix-based Parallelization • Yangdong Deng et. al. “Taming Irregular EDA applications on GPUs” • Propagation is done through matrix-vector multiplication – For sparsely connected graphs, the connectivity matrix will be a sparse matrix • Complexity O(V+EL) (compared with O(V+E)) – Slower than sequential for large graphs s 0 1 0 1 0 s s s u u u 1 0 1 0 1 u v 0 1 0 0 0 v v v ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 s u v Need a More General Technique • To efficiently handle most graph types • Use more specialized formulation when appropriate as an optimization • Efficient queue-based parallel algorithms – Hierarchical scalable queue implementation – Hierarchical kernel arrangements ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 An Initial Attempt • Manage the queue structure – – – – Complexity: O(V+E) Dequeue in parallel Each frontier node is a thread Enqueue in sequence. v u • Poor coalescing • Poor scalability – No speedup t x y Parallel v t x uy Sequential u y ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Parallel Insert-Compact Queues • C.Lauterbach et.al.“Fast BVH Construction on GPUs” • Parallel enqueue with compaction cost • Not suitable for light-node problems Propagate v t v x t Φ Φ u y u y ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 x u Φ y Φ Compact Basic ideas • Each thread processes one or more frontier nodes • Find the index of each new frontier node • Build queue of next frontier hierarchically q1 q3 q2 Local queues Local a b c g h i j Index = offset of q2 (#node in q1) + index in q2 Global queue Global h ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Two-level Hierarchy • Block queue (b-queue) – Inserted by all threads in a block – Reside in Shared Memory Shared Mem • Global queue (g-queue) b-queue – Inserted only when a block completes g-queue Global Mem • Problem: – Collision on b-queues – Threads in the same block can cause heavy contention ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Warp-level Queue • Thread Scheduling Warp i Warp j WiT7 WiT15 WiT23 WiT30 WjT7 WjT15 WjT23 WjT30 WiT6 WiT14 WiT22 WiT29 WjT6 WjT14 WjT22 WjT29 WiT0 WiT8 WiT16 WiT24 WjT0 WjT8 WjT16 WjT24 Time • Divide threads into 8 groups (for GTX280) – Number of SP’s in each SM in general – One queue to be written by each SP in the SM • Each group writes to one warp-level queue • Still should use atomic operation – But much lower level of contention ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 Three-level hierarchy w-queue b-queue g-queue ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 b-queue Hierarchical Queue Management • Shared Memory: – Interleaved queue layout, no bank conflict W-queue7 W-queue1 W-queue0 W-queues[][8] • Global Memory: – Coalescing when releasing a b-queue to g-queue – Moderate contention across blocks • Texture memory : – Store graph structure (random access, no coalescing) ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 – Fermi cache may help. Hierarchical Queue Management • Advantage and limitation – The technique can be applied to any inherently sequential data structure – The w-queues and b-queues are limited by the capacity of shared memory. If we know the upper limit of the degree, we can adjust the number of threads per block accordingly. ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 ANY FURTHER QUESTIONS? ©Wen-mei W. Hwu and David Kirk/NVIDIA 2010