A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows TIMOTHY BLAT TNER NIST | UMBC 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 1 Outline Introduction Challenges Image Stitching Hybrid Task Graph Scheduler Preliminary Results Conclusions Future Work 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 2 Credits Walid Keyrouz (NIST) Milton Halem (UMBC) Shuvra Bhattacharrya (UMD) 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 3 Introduction Hardware landscape is changing Traditional software approaches to extracting performance from the hardware ◦ Reaching complexity limit ◦ Multiple GPUs on a node ◦ Complex memory hierarchies We present a novel abstract machine model ◦ Hybrid task graph scheduler ◦ Hybrid pipeline workflows ◦ Scope: Single node with multiple CPUs and GPUs ◦ Emphasis on ◦ Execution pipelines to scale to multiple GPUs/CPU sockets ◦ Memory interface to attach to hierarchies of memory ◦ Can be expanded beyond single node (clusters) 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 4 Introduction – Future Architectures Future hybrid architecture generation ◦ Few fat cores with many more simpler cores ◦ Intel Knights Landing ◦ POWER 9 + NVIDIA Volta + NVLink ◦ Sierra cluster ◦ Faster interconnect ◦ Deeper memory hierarchy Programming methods must present the right machine model to programmers so they can extract performance Figure: NVIDIA Volta GPU (nvidia.com) 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 5 Introduction – Data transfer costs Copying data between address spaces is expensive ◦ PCI express bottleneck Current hybrid CPU+GPU systems contain multiple independent address spaces ◦ Unification of the address spaces ◦ Simplification for programmer ◦ Good for prototyping ◦ Obscures the cost of data motion Techniques for improving hybrid utilization ◦ Have enough computation per data element ◦ Overlap data motion with computation ◦ Faster bus (80 GB/s NVLink versus 16 GB/s PCIe) ◦ NVLink requires multiple GPUs to reach peak performance [NVLink whitepaper 2014] 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 6 Introduction – Complex Memory Hierarchies Data locality is becoming more complex ◦ Non-volatile storage devices ◦ ◦ ◦ ◦ NVMe 3D XPoint (future) SATA SSD SATA HDD ◦ Volatile memories ◦ HBM / 3D stacked ◦ DDR ◦ GPU Shared Memory / L1,L2,L3 Cache Need to model these memories within programming methods ◦ Effectively utilize based on size and speed Figure: Memory hierarchies speed, cost, and capacity. [Ang et. Al. 2014] ◦ Hierarchy-aware programming 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 7 Key Challenges Changing H/W landscape ◦ Hierarchy-aware programming ◦ Manage data locality ◦ Wider data transfer channels ◦ Requires multi-GPU computation ◦ NVLink ◦ Hybrid computing ◦ Utilize all compute resource A programming and execution machine model is needed to address the above challenges ◦ Hybrid Task Graph Scheduler (HTGS) model ◦ Expands on hybrid pipeline workflows [Blattner 2013] 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 8 Hybrid Pipeline Workflows Hybrid pipeline workflow system ◦ Schedule tasks using a multiple-producer multiple-consumer model ◦ Prototype in 2013 Master’s thesis [Blattner 2013] ◦ Kept all GPUs busy ◦ Execution pipelines, one per GPU ◦ Stayed within memory limits ◦ Overlapped data motion with computation ◦ Tailored for image stitching ◦ Required significant programming effort to implement ◦ Prevent race conditions, manage dependencies, and maintain memory limits We expand on hybrid pipeline workflows ◦ Formulates a model for a variety of algorithms ◦ Will reduce programmer effort ◦ Hybrid Task Graph Scheduler (HTGS) 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 9 Hybrid Workflow Impact – Image Stitching Image Stitching ◦ Addresses the scale mismatch between microscope field of view and a plate under study ◦ Need to ‘stitch’ overlapping images to form one large image ◦ Three compute stages ◦ (S1) fast Fourier Transform (FFT) of an image ◦ (S2) Phase correlation image alignment method (PCIAM) (Kuglin & Hines 1975) ◦ (S3) Cross correlation factors (CCFs) Figure: Image stitching dataflow graph 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 10 Hybrid Workflow Impact – Image Stitching Implementation using traditional parallel techniques (Simple-GPU) ◦ Port computationally intensive components to the GPU ◦ Copy to/from GPU as needed ◦ 1.14x speedup end-to-end time compared to a sequential CPU-only implementation ◦ Data motion dominated the run-time Implementation using hybrid workflow system ◦ ◦ ◦ ◦ Reuse existing compute kernels 24x speedup end-to-end compared to Simple-GPU Scales using multiple GPUs (~1.8x from one to two GPUs) Requires significant programming effort [Blattner et al. 2014] 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 11 HTGS Motivation Performance gains using a hybrid pipeline workflow Figure 1: Simple-GPU Profile Figure 2: Hybrid Workflow Profile 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 12 HTGS Motivation Transforming dataflow graphs Into task graphs 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 13 Dataflow and Task Graphs Contains a series of vertices and edges ◦ A vertex is a task/compute function ◦ Implements a function applied on data ◦ An edge is data flowing between tasks ◦ Main difference between dataflow and task graphs ◦ Scheduling ◦ Effective method for representing MIMD concurrency Figure: Example task graph Figure: Example dataflow graph 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 14 HTGS Motivation Scale to multiple GPUs ◦ Partition task graph into sub-graphs ◦ Bind sub-graph to separate GPUs Memory interface ◦ Represent separate address spaces ◦ CPU ◦ GPU ◦ Managing complex memory hierarchies (future) Overlap computation with I/O ◦ Pipeline computation with I/O 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 15 Hybrid Task Graph Scheduler Model Four primary components ◦ ◦ ◦ ◦ Tasks Data Dependency Rules Memory Rules Construct task graphs using the four components ◦ Vertices are tasks ◦ Edges are data flow Figure: Task graph 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 16 Hybrid Task Graph Scheduler Model Tasks ◦ Programmer implements ‘execute’ ◦ Defines functionality of the task ◦ Special task types ◦ GPU Tasks ◦ Binds to device prior to execution ◦ Bookkeeper ◦ Manages dependencies ◦ Threading ◦ Each task is bound to one or more threads in a thread pool 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 17 CUDA Task Binds CUDA graphics card to a task ◦ Provides CUDA context and stream to the execute function ◦ 1 CPU thread launches GPU kernels with thousands or millions of GPU threads Figure: CUDA Task 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 18 Memory Interface Attaches to a task needing reusable memory Memory is freed based on memory rules ◦ Programmer defined Task requests memory from manager ◦ Blocks if no memory is available Acts as a separate channel from dataflow Figure: Memory Manager Interface 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 19 Hybrid Task Graph Scheduler Model Execution Pipelines ◦ Encapsulates a sub graph ◦ Creates duplicate instances of the sub graph ◦ Each instance is scheduled and executed using new threads ◦ Can be distributed among available GPUs (one instance per GPU) Figure: Execution Pipeline Task 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 20 HTGS API Using the model, implement the HTGS API ◦ Tasks ◦ Default ◦ Bookkeeper ◦ Execution Pipeline ◦ CUDA ◦ Memory Interface ◦ Attaches to any task to allocate/free/update memory 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 21 Prototype HTGS API – Image Stitching Full implementation in Java ◦ Uses image stitching as a test case Figure: Image Stitching Task Graph 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 22 Preliminary Results Machine specifications ◦ ◦ ◦ ◦ ◦ ◦ Two Xeon E5620 (16 logical cores) Two NVIDIA Tesla C2070s and one GTX 680 Libraries: JCuda and JCuFFT Baseline implementation: [Blattner et al. 2014] Problem size: 42x59 images (70% overlap) HTGS prototype similar runtime as baseline, 23.6% reduction in code size HTGS Exec Pipeline GPUs Runtime (s) Lines of Code 3 29.8 949 1 43.3 725 1 41.4 726 2 26.6 726 3 24.5 726 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING <- Baseline Hybrid pipeline workflow 23 Conclusions Prototype HTGS API ◦ Reduces code size by 23.6% ◦ Compared to the hybrid pipeline workflow implementation ◦ Speedup of 17% ◦ Enables multi-GPU execution by adding a single line of code Coarse-grained parallelism ◦ ◦ ◦ ◦ Decomposition of algorithm and data structures Memory management Data locality Scheduling 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 24 Conclusions The HTGS model and API ◦ ◦ ◦ ◦ ◦ Scales using multiple GPUs and CPUs Overlap data motion Keeps processors busy Memory interface for separate address spaces Restricted to single node with multiple CPUs and multiple NVIDIA GPUs A Tool to represent complex, image processing algorithms that require high performance 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 25 Future Work Release of C++ implementation of HTGS (currently in development) Use HTGS with other classes of algorithms ◦ Out-of-core matrix multiplication and LU factorization Expand execution pipelines to support clusters and Intel MIC Image Stitching with LIDE++ ◦ Lightweight dataflow environment [Shen, Plishker, & Bhattacharyya 2012] ◦ Tool-assisted acceleration ◦ Annotated dataflow graphs ◦ Manage memory and data motion ◦ Enhanced scheduling ◦ Improved concurrency 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 26 References [Ang et al. 2014] Ang, J. A.; Barrett, R. F.; Benner, R. E.; Burke, D.; Chan, C.; Cook, J.; Donofrio, D.; Hammond, S. D.; Hemmert, K. S.; Kelly, S. M.; Le, H.; Leung, V. J.; Resnick, D. R.; Rodrigues, A. F.; Shalf, J.; Stark, D.; Unat, D.; and Wright, N. J. 2014. abstract machine models and proxy architectures for exascale computing. In proceedings of the 1st international workshop onhardware-software co-design for high performance computing, co-hpc ’14,25–32. ieee press. [Blattner et al. 2014] Blattner, T.; Keyrouz, W.; Chalfoun, J.; Stivalet, B.; Brady, M.; and Zhou, S. 2014. a hybrid cpugpu system for stitching large scale optical microscopy images. In 43rd international conference on parallel processing (icpp), 1–9. [Blattner 2013] Blattner, T. 2013. A Hybrid CPU/GPU Pipeline Workflow System. Master’s thesis, University of Maryland Baltimore County. [Shen, Plishker, & Bhattacharyya 2012] C. Shen, W. Plishker, and S. S. Bhattacharyya. Dataflow-based design and implementation of image processing applications. In L. Guan, Y. He, and S.-Y. Kung, editors, Multimedia Image and Video Processing, pages 609-629. CRC Press, second edition, 2012. Chapter 24 [Kuglin & Hines 1975] Kuglin, C. D., and Hines, D. C. 1975. the phase correlation image alignment method. In proceedings of the 1975 ieee international conference on cybernetics and society, 163–165. [NVLink Whitepaper 2014] NVIDIA 2014. http://www.nvidia.com/object/nvlink.html 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 27 Thank You Questions? 12/15/2015 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 28