Chi-Keung Luk Sunpyo Hong Hyesoon Kim Presented by Chris Spain Heterogeneous Computing Multiple levels of hardware parallelism exposed to software on current CPU+GPU systems (The GPU has tens to hundreds of special-purpose cores, while the CPU has a few general-purpose cores. Within each CPU core, there is shortvector parallelism provided by the SIMD extension of the ISA.) Problem: Dot Product of Large Vectors Multiplications of large vectors can be done in parallel on GPU. Summation of products also done on GPU using reduction. After reduction, for each block we have the partial sum. How should we add the partial sums? GPU: if small number then this is a waste CPU: let the GPU get on with the next set. Example 𝑁 𝐴∙𝐵 = A 𝐴𝑖 𝐵𝑖 B 𝑖=1 Block 1 Block 2 … FINAL SUM ON GPU OR CPU? Array Reduction Block N Not an Either/Or Decision Matrix multiplication example Most efficient mapping is dependent on data size and may share work between CPU and GPU How do we know whether to run on GPU or CPU? Could depend on problem type – is it mostly serial, does it need large amounts of cache? Recall last week’s paper showing sort and search faster on CPU Also depends on size of data set – would we bother summing 30 elements on a GPU? Have to allocate memory, transfer data, run kernel, transfer back Mapping Computations Manually mapping to GPU or CPU Labor intensive Does not adapt to changes during runtime Does not adapt to hardware changes Qilin: Adaptive Mapping API for writing parallelizable operations in C/C++ Automatic mapping Responds to runtime changes in data size Adapts to new hardware Quicker and easier than manual mapping Qilin API API on top of C/C++ Built on Intel Thread Building Blocks (TBB) and CUDA Qilin compiler generates TBB & CUDA source code on the fly Defines new Qilin types: QArray & QArrayList Allows the option to specify which device, for example: Add(Qx, Qy, PE_SELECTOR_GPU) Add(Qx, Qy, PE_SELECTOR_CPU) Add(Qx, Qy, PE_SELECTOR_DEFAULT) Uses dynamic compilation to compile API calls into native machine codes while the program runs. Two Possible Approaches Can be Used. Stream API Threading API Stream API operations Example 1: Stream API Convert normal arrays into QArrays All operations here are elementwise in parallel (sum does the reduction) Convert Qarrays back to normal arrays Example 2: Threading API CPU Implementation GPU Implementation Threading API Continued Convert normal arrays into 2D QArrays Create glued implementation of function Allow work to be divided between CPU & GPU and create argument list. Run the function with the argument list and default mapping Compilation of Qilin Programs Uses dynamic compilation at runtime with the following steps: 1) Build Directed Acyclic Graphs (DAGs) from API calls according to data dependencies 2) Decide the mapping from computation to processing elements (CPU and/or GPU) 3) Optimize DAGs via coalescing and removal of temp arrays 4) Code generation Compilation of Qilin Programs Adaptive Mapping Adaptive mapping to automatically find the near-optimal mapping from computations to processing elements Optimal for the given application, problem size, and system configuration Stores a database of execution time projections for every program it has ever executed. First execution of program treated as a training run. First Execution of Program 1) 2) 3) 4) 5) 6) Take input data of size Nt and divide in two parts N1 and N2. Use N1 on the CPU and N2 on the GPU. Subdivide each and run the pieces on their respective devices. Measure execution time of each piece. Fit a line to the time datapoints The line then becomes the projection of how execution time scales with performance. Projection of Execution Time as Data Size Increases Using the Curves to Determine Division of Work Equation of projection lines Experimental Setup Results 1: Effect on Power Results 2: Training Set Size Impact of the training set size on adaptive mapping performance (Note: The y-axis is in logarithmic scale. The legend ”X%” means the training set size is X% of the reference set size.). Results 3: Change in Hardware Issues: Is it really worth the trouble? Many problems have one single loop or task that is suited to the GPU Data size is often known and fixed Threading API: not fun writing two implementations of the same code (CPU and CPU) Compilation overhead at runtime Conclusion Automated mapping preferable to manual techiques Qilin works almost as well as manual mapping for: Power consumption Execution time Qilin is adaptable to change in: Input size Hardware configuration Questions? 2 um