Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung Luk SSG Software Pathfinding and Innovation Intel Corporation, Hudson, MA Email: chi-keung.luk@intel.com Sunpyo Hong Hyesoon Kim School of Computer Science Georgia Institute of Technology, Atlanta, GA Email: hyesoon@cc.gatech.edu Abstract Heterogeneous multiprocessors are growingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify this mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on heterogeneous multiprocessors. We have implemented it in our experimental heterogeneous programming system called Qilin. Our results demonstrate that, for a set of important computation kernels, automatic adaptive mapping achieves a speedup of 9.3x on average over the best serial implementation by judiciously distributing works over the CPU and GPU, which is 69% and 33% faster than using the CPU or GPU alone, respectively. In addition, adaptive mapping is within 94% of the speedup of the best manual mapping found via exhaustive searching. To the best of our knowledge, Qilin is the first and only system to date that has such capability. 1. Introduction Multiprocessors have emerged as mainstream computing platforms nowadays. Among them, an increasingly popular class are those with heterogeneous architectures. By providing processing elements1 of different performance/energy characteristics on the same machine, these architectures could deliver high performance and energy efficiency [10]. The most well-known heterogeneous architecture today is probably the IBM/Sony Cell architecture, which consists of a Power processor and eight synergistic processors [20]. In the personal computer (PC) world, a desktop now has a multicore CPU and a GPU, exposing multiple levels of hardware parallelism to software, as illustrated in Figure 1. 1 The term processing element (PE) is a generic term for a hardware element that executes a stream of instructions. [Copyright notice will appear here once ’preprint’ option is removed.] CPU Core-0 Core-1 Core-2 Core-3 Core-3 GPU SIMD Figure 1. Multiple levels of hardware parallelism exposed to software on current CPU+GPU systems (The GPU has tens to hundreds of special-purpose cores, while the CPU has a few general-purpose cores. Within each CPU core, there is short-vector parallelism provided by the SIMD extension of the ISA.) In order for mainstream programmers to fully tap into the potential of heterogeneous multiprocessors, the step that maps computations to processing elements must be as automated as possible. Unfortunately, the state-of-the-art approach [12, 17, 30] is to rely on the programmer to perform this mapping manually: for the Cell, O’Brien et al. extend the IBM XL compiler to support OpenMP on this architecture [17]; for commodity PCs, Linderman et al. propose the Merge framework [12], a library-based system to program CPU and GPU together using the map-reduce paradigm. In both cases, the computation-to-processor mappings are statically determined by the programmer. This manual approach not only puts burdens on the programmer but also is not adaptable, since the optimal mapping is likely to change with different applications, different input problem sizes, and different hardware configurations. In this paper, we address this problem by introducing a fully automatic approach that decides the mapping from computations to processing elements using run-time adaptation. We have implemented our approach into Qilin2 , our experimental system for programming heterogeneous multiprocessors. Experimental results demonstrate that our automated adaptive mapping performs nearly as well as manual mapping while at the same time can tolerate changes in input problem sizes and hardware configurations. While Qilin currently focuses on CPU+GPU platforms, our adaptive map2 “Qilin” is a mythical chimerical creature in the Chinese culture; we picked it as the project name for the heterogeneity of this creature. (a) Experiment 1 (N = 1000, #CPU Cores = 8) 9x 7.7 8x Speedup over Serial ping technique is applicable to heterogeneous platforms in general, including the Cell and others, since our technique does not require any information regarding the hardware or machine models. To the best of our knowledge, Qilin is the first and the only system to date that has such capability. The rest of this paper is organized as follows. In Section 2, we use a case study to further motivate the need of adaptive mapping. Section 3 then describes the Qilin system in details. We will focus on our runtime adaptation techniques in Section 4. Experimental evidences will be given in Section 5. Finally, we relate our works to others in Section 6 and conclude in Section 7. 7 7x 6x 6.1 5.7 5.2 5.1 5.7 4.7 5x 4.1 3.6 4x 3.2 3x 2x 1x 2. A Case Study: Why Do We Need Adaptation Mapping? 3. The Qilin Programming System Qilin is a programming system we have recently developed for heterogeneous multiprocessors. Figure 3 shows the software architecture of Qilin. At the application level, Qilin provides an API to 2 0 10 / 9 C PU 0 -o nl y 0 20 /8 0 30 /7 0 40 /6 0 50 /5 0 60 /4 0 70 /3 80 /2 90 /1 (b) Experiment 2 (N = 6000, #CPU Cores = 8) Speedup over Serial 12x 9.7 10x 7.6 8x 8 10.3 9 8.7 7.7 7.4 6.7 5.9 6x 5.3 4x 2x G PU -o nl y 90 /1 0 80 /2 0 70 /3 0 60 /4 0 50 /5 0 40 /6 0 30 /7 0 20 /8 0 10 / 9 C PU 0 -o nl y 0x (c) Experiment 3 (N = 6000, #CPU Cores = 2) 10x 9.3 9x 8.4 7.6 Speedup over Serial 8x 6.5 7x 6x 5 5x 4 4x 3.3 3x 2.8 2.5 2.2 2 2x 1x 90 /1 0 80 /2 0 70 /3 0 60 /4 0 50 /5 0 40 /6 0 30 /7 0 20 /8 0 10 / 9 C PU 0 -o nl y PU -o nl y 0x G We now motivate the need of adaptive mapping with a case study on matrix multiplication, a very commonly used computation kernel in scientific computing. We measured the parallelization speedups on matrix multiplication with a heterogeneous machine consisting of an Intel multicore CPU and a Nvidia multicore GPU (details of the machine configuration are given in Section 5.1). We did three experiments with different input matrix sizes and the number of CPU cores used. In each experiment, we varied the distribution of work between the CPU and GPU. For matrix multiplication C = A ∗ B, we first divide A into two smaller matrices A1 and A2 by rows. We then compute C1 = A1 ∗ B on the CPU and C2 = A2 ∗ B on the GPU in parallel. Finally, we obtain C by combining C1 and C2 . We use the best matrix multiplication libraries available: for the CPU, we use the Intel Math Kernel Library (MKL) [8]; for the GPU, we use the CUDA CUBLAS library [15]. Figure 2 shows the results of these three experiments. All input matrices are N ∗ N square matrices. The y-axis is the speedup over the serial case. The x-axis is the distribution of work across the CPU and GPU, where the notation “X/Y” means X% of work mapped to the GPU and Y% of work mapped to the CPU. At the two extremes are the cases where we schedule all the work on either the GPU or CPU. In Experiment 1, we use a relatively small problem size (N = 1000) with eight CPU cores. The low computation-to-communication ratio with this problem size renders the GPU less effective. As a result, the optimal mapping is to schedule all work on the CPU. In Experiment 2, we increase the problem size to N = 6000 and keep the same number of CPU cores. Now, with a higher computationto-communication ratio, the GPU becomes more effective—both the GPU-alone and the CPU-alone speedups are over 7x. And the optimal mapping is to schedule 60% of work on the GPU and 40% on the CPU, resulting into a 10.3x speedup. In Experiment 3, we keep the problem size to N = 6000 but reduce the number of CPU cores to only two. With much less CPU horsepower, the CPU-only speedup is limited to 2x and the optimal mapping now is to schedule 80% of work on the GPU and 20% on the CPU. It is clear from these experiments that even for a single application, the optimal mapping from computations to processing elements highly depends on the input problem size and the hardware capability. Needless to say, different applications would have different optimal mappings. Therefore, we believe that any static mapping techniques would not be satisfactory. What we want is a dynamic mapping technique that can automatically adapt to the runtime environment, as we are going to propose next. G PU -o nl y 0 0x Figure 2. Matrix multiplication experiments with different input matrix sizes and number of CPU cores used. The input matrices are N by N . The notation “X/Y” on the x-axis means X% of work mapped to the GPU and Y% of work mapped to the CPU. programmers for writing parallelizable operations. By explicitly expressing these computations through the API, the compiler is alleviated from the difficult job of extracting parallelism from serial code, and instead can focus on performance tuning. Similar to OpenMP, the Qilin API is built on top of C/C++ so that it can be easily adopted. But unlike standard OpenMP, where parallelization only happens on the CPU, Qilin can exploit the hardware parallelism available on both the CPU and GPU. Beneath the API layer is the Qilin system layer, which consists of a dynamic compiler and its code cache, a number of libraries, 2008/11/14 void MySgemm(float* A, float* B, float* C, int m, int k, int n, float alpha, float beta) C++ source Qilin API Application { // Create Qilin arrays from normal arrays Qilin System Hardware Libraries Compiler QArray<float> qA = QArray<float>::Create2D(m, k, A); Dev. tools QArray<float> qB = QArray<float>::Create2D(k, n, B); Code Cache QArray<float> qC = QArray<float>::Create2D(m, n, C); Scheduler CPU // // // qC GPU Invoke the Qilin version of BLAS Sgemm() on the processing elements determined by the default mapping scheme = BlasSgemm(qA, qB, qC, alpha, beta, PE_SELECTOR_DEFAULT); Figure 3. Qilin software architecture // Convert from qC[] back to C[] and // this triggers the lazy evaluation a set of development tools, and a scheduler. The compiler dynamically translates the API calls into native machine codes. It also decides the near-optimal mapping from computations to processing elements using an adaptive algorithm. To reduce compilation overhead, translated codes are stored in the code cache so that they can be reused without recompilation whenever possible. Once native machine codes are available, they are scheduled to run on the CPU and/or GPU by the scheduler. Libraries include commonly used functions such as BLAS and FFT. Finally, debugging, visualization, and profiling tools can be provided to facilitate the development of Qilin programs. Current Implementation Qilin is currently built on top of two popular C/C++ based threading APIs: The Intel Threading Building Blocks (TBB) [24] for the CPU and the Nvidia CUDA [16] for the GPU. Instead of directly generating CPU and GPU native machine codes, the Qilin compiler generates TBB and CUDA source codes from Qilin programs on the fly and then uses the system compilers to generate the final machine codes. The “Libraries” component in Figure 3 is implemented as wrapper calls to the libraries provided by the vendors: Intel MKL [8] for the CPU and Nvidia CUBLAS [15] for the GPU. For the “scheduler” component, we simply take advantage of the TBB task scheduler to schedule all CPU threads and we dedicate one CPU thread to handle all handshaking with the GPU. The “Dev. tools” component is still under development. In the rest of this section, we will focus our discussion on the Qilin API and the Qilin compiler. We will discuss our adaptive mapping technique in Section 4. 3.1 Qilin API Qilin defines two new types QArray and QArrayList using C++ templates. A QArray represents a multidimensional array of a generic type. For example, QArray<float> is the type for an array of floats. QArray is an opaque type and its actual implementation is hidden from the programmer. A QArrayList represents a list of QArray objects, and is also an opaque type. Qilin parallel computations are performed on either QArray’s or QArrayList’s. There are two approaches to writing these computations. The first approach is called the Stream-API approach, where the programmer solely uses the Qilin stream API which implements common data-parallel operations including elementwise, reduction, and linear algebra operations on QArray’s. Our stream API is similar to those found in some GPGPU systems [2, 6, 19, 23, 28]. However, since Qilin targets for heterogeneous machines, it also allows the programmer to optionally select the processing elements. For instance, “QArray<float> Qsum = Add(Qx, Qy, PE SELECTOR GPU)” specifies that the addition of the two arrays Qx and Qy must be performed on the 3 qC.ToNormalArray(C, m*n*sizeof(float)) } Figure 4. Matrix multiplication written with the Stream-API approach. GPU. Other possible selector values are PE SELECTOR CPU for choosing the CPU and PE SELECTOR DEFAULT for choosing the default mapping scheme which can be a static one or the adaptive mapping scheme (See Section 4). Figure 4 shows a function MySgemm() which uses the Qilin stream API to perform matrix multiplication. The function QArray:: Create2D() creates a 2D QArray from a normal C array. Its reverse is QArray::ToNormalArray(), which converts a QArray back to a normal C array. BlasSgemm() is the BLAS Sgemm() function provided by Qilin. Since PE SELECTOR DEFAULT is used in the call, BlasSgemm() can be mapped to either the CPU or GPU or both, depending on the default mapping scheme. The second approach to writing parallel computations is called the Threading-API approach, in which the programmer provides the parallel implementations of computations in the threading APIs on which Qilin is built (i.e., TBB on the CPU side and CUDA on the GPU side for our current implementation). A Qilin operation is then defined to glue these two implementations together. When Qilin compiles this operation, it will automatically partition computations across processing elements and eventually merge the results. Figure 5 shows an example of using the Threading-API approach to write an image filter. First, the programmer provides a TBB implementation of the filter in CpuFilter() and a CUDA implementation in GpuFilter() (TBB and CUDA codes are omitted for clarity reasons). Since both TBB and CUDA work on normal arrays, we need to convert Qilin arrays back to normal arrays before invoking TBB and CUDA. Second, we use the Qilin function MakeQArrayOp() to make a new operation myFilter out of CpuFilter() and GpuFilter(). Third, we construct the argument list of myFilter from the two Qilin arrays qSrc and qDst. The keyword QILIN PARTITIONABLE tells Qilin that the associated computations of both arrays can be partitioned to run on the CPU and GPU. Fourth, we call another Qilin function ApplyQArrayOp() to apply myFilter with the argument list using the default mapping scheme. The result of ApplyQArrayOp() is a Qilin boolean array qSuccess of a single element, which returns whether the operation is applied successfully. Finally, we convert qSuccess to a normal boolean value. 2008/11/14 void CpuFilter(QArray<Pixel> qSrc, QArray<Pixel> qDst) { Pixel* src_cpu = qSrc.NormalArray(), dst_cpu = qDst.NormalArray(); int height_cpu = qSrc.DimSize(0), width_cpu = qSrc.DimSize(1); // … Filter implementation in TBB … } void GpuFilter(QArray<Pixel> qSrc, QArray<Pixel> qDst) { Pixel* src_gpu = qSrc.NormalArray(), dst_gpu = qDst.NormalArray(); int height_gpu = qSrc.DimSize(0), width_gpu = qSrc.DimSize(1); // … Filter implementation in CUDA … } void MyFilter(Pixel* src, Pixel* dst, int height, int width) { // Create Qilin arrays from normal arrays QArray<Pixel> qSrc = QArray<Pixel>::Create2D(height, width, src); QArray<Pixel> qDst = QArray<Pixel>::Create2D(height, width, dst); // Define myFilter as an operation that glues CpuFilter() and GpuFilter() QArrayOp myFilter = MakeQArrayOp(“myFilter”, CpuFilter, GpuFilter); // Build the argument list for myFilter. QILIN_PARTITIONABLE means the // associated computation can be partitioned to run on both CPU and GPU. QArrayOpArgsList argList; argList.Insert(qSrc, QILIN_PARTITIONABLE); argList.Insert(qDst, QILIN_PARTITIONABLE); // Apply myFilter with argList using the default mapping scheme QArray<BOOL> qSuccess = ApplyQArrayOp(myFilter, argList, PE_SELECTOR_DEFAULT); // Convert from qSuccess[] to success, and this triggers the lazy evaluation BOOL success; qSuccess.ToNormalArray(&success, sizeof(BOOL)); } Figure 5. Image filter written with the Threading-API approach. 3.2 Dynamic Compilation Qilin uses dynamic compilation to compile Qilin API calls into native machine codes while the program runs. The main advantage of dynamic over static compilation is able to adapt to changes in the runtime environment. The downside of dynamic compilation is the compilation overhead incurred at run time. However, we argue that this overhead is largely amortized in the typical Qilin usage model, where a relatively small program runs for a long period of time. The Qilin dynamic compilation consists of the following four steps: 1. Building Directed Acyclic Graph (DAGs) from Qilin API calls: DAGs are built according to the data dependencies among QArrays in the API calls. These DAGs are essentially the intermediate representation which latter steps in the compilation process will operate on. 2. Deciding the mapping from computations to processing elements: This step either uses the programmer-specified choice at each operation (see Section 3.1) or uses the automatic adaptive mapping technique (see Section 4). 4 3. Performing optimizations on DAGs: A number of optimizations are applied to the DAGs. The most important ones are (i) operation coalescing and (ii) removal of unnecessary temporary arrays. Operation coalescing groups as many operations running on the same processing elements into a single function as possible, and thereby reducing the overhead of scheduling individual operations. It is also important to remove the allocating/deallocating and copying of temporary arrays used in the intermediate computations of QArrays. 4. Code generation: At this point, we have the optimized DAGs and their computation-to-processor mappings decided. One additional step we need to do here is to ensure that all hardware resource constraints are satisfied. The most common issue is the memory requirement of computations that are mapped to the GPU, because the amount of GPU physical memory available is relatively limited (typically less than 1GB) and it does not have virtual memory. To cope with this issue, if Qilin estimates that the GPU memory requirement of a DAG exceeds the limit, it will split the DAG into multiple smaller DAGs and run them sequentially. Once all resource constraints are taken care of, Qilin generates the native machine codes from the DAGs ac- 2008/11/14 cording to the mappings. Qilin also automatically generates all gluing codes needed to combine the results from the CPU and the GPU. N1,1 Nt N1,m N2,1 N2,m Reducing Compilation Overhead To reduce the runtime overhead, dynamic compilation is mostly done in a lazy-evaluation manner: When a Qilin program is executed, DAGs are built (i.e., Step 1 in the compilation process) as API calls are encountered. Nevertheless, the remaining three steps are not invoked until the computation results are really needed— when we perform a QArray::ToNormalArray() call to convert from a QArray to a normal C array. Thus, dynamically dead Qilin API calls will not cause any compilation. In addition, code generated for a particular DAG is stored in a software code cache so that if the same DAG is seen again later in the same program run, we do not have to redo Steps 2-4. P P time taken: Tc (N1,1) P P P P P Tc (N1,m) TG(N2,1) curve fitting P Runtime 3.2.1 (a) Training run P TG(N2,m) curve fitting T’C(N) Input size T’G(N) 4. Adaptive Mapping The Qilin adaptive mapping is a technique to automatically find the near-optimal mapping from computations to processing elements for the given application, problem size, and hardware configuration. To facilitate our discussion, let us introduce the following notations: TC (N ) = TG (N ) = TC (N ) TG (N ) = = The actual time to execute the given program of problem size N on the CPU. The actual time to execute the given program of problem size N on the GPU. Qilin’s projection of TC (N ) Qilin’s projection of TG (N ) TG (N ) (b) Reference run Database Nr One approach to predicting TC (N ) and TG (N ) is to build an analytical performance model based on static analysis. While this approach might work well for simple programs, it is unlikely to be sufficient for more complex programs. In particular, predicting TC (N ) using an analytical model is challenging with features like out-of-order execution, speculation and prefetching on modern processors. Instead, Qilin takes an empirical approach, as pictorialized in Figure 6 and explained below. Qilin maintains a database that provides execution-time projection for all the programs it has ever executed. The first time that a program is run under Qilin, it is used as the training run (see Figure 6(a)). Suppose the input problem size for this training run is Nt . Then Qilin divides the input into two parts of size N1 and N2 . The N1 part is mapped to the CPU while the N2 part is mapped to the GPU. Within the CPU part, it further divides N1 into m subparts N1,1 ...N1,m . Each subpart N1,i is run on the CPU and the execution time TC (N1,i ) is measured. Similarly, the N2 part is further divided into m subparts N2,1 ...N2,m and their execution times on the GPU TG (N2,i )’s are measured. Once all TC (N1,i )’s and TG (N2,i )’s are available, Qilin uses curve fitting to construct two linear equations TC (N ) and TG (N ) as projections for the actual execution times TC (N ) and TG (N ), respectively. That is: TC (N ) Database ≈ = TC (N ) a c + bc ∗ N (1) ≈ = TG (N ) a g + bg ∗ N (2) where ac , bc , ag and bg are constant real numbers. The next time that Qilin runs the same program with a different input problem size Nr , it can use the execution-time projection stored in the database to determine the computation-to-processor mapping (see 5 Tβ(Nr) § Max( T’C(ȕNr), T’G((1-ȕ)Nr) ) Find ȕ to minimize Tβ(Nr) P CPU GPU P P Figure 6. Qilin’s adaptive mapping technique. Figure 6(b)). Let β be the fraction of work mapped to the CPU and Tβ (N ) be the actual time to execute β of work on the CPU and (1 − β) of work on the GPU in parallel. Then: Tβ (N ) = ≈ M ax(TC (βN ), TG ((1 − β)N )) M ax(TC (βN ), TG ((1 − β)N )) (3) The above equations assume that running the CPU and GPU simultaneously is as fast as running them standalone. In reality, this is not the case since the CPU and GPU do contend for common resources including bus bandwidth and the need of CPU processor cycles to handshake with the GPU. So, we take these factors into account by multiplying factors into Equation (3): p Tβ (N ) ≈ M ax( TC (βN ), TG ((1 − β)N )) (4) p−1 2008/11/14 Case i: CPU and GPU curves intersect at β ≤ 0 Time CPU: (p/p-1)T’c(β Nr) Architecture Core Clock Number of Cores Memory size Memory Bandwidth Threading API Compiler Minimized when mapping all work to the GPU GPU: T’G((1-β) Nr) 0 1 OS GPU Nvidia 8800 GTX 575 MHz 128 stream processors 768 MB 86.4 GB/s Intel Threading Nvidia CUDA 1.1 Building Blocks (TBB) Intel C Compiler; Nvidia C Compiler (ICC 10.1, “-fast”) (NVCC 1.1,“-O3”) 32-bit Linux Fedora Core 6 Table 1. Experimental Setup β 5. Evaluation Case ii: CPU and GPU curves intersect at β ≥ 1 We now present the methodology used to evaluate the effectiveness of adaptive mapping and the results. Time GPU: T’G((1-β) Nr) 5.1 CPU: (p/p-1)T’c(β Nr) 0 1 β Case iii: CPU and GPU curves intersect at 0 < β < 1 Time Minimized when mapping βmin of work to the CPU GPU: T’G((1-β) Nr) CPU: (p/p-1)T’c(β Nr) βmin 1 Methodology Our evaluation was done on a heterogeneous PC, consisting of a multicore CPU and a high-end GPU, as depicted in Table 1. Table 2 lists our benchmarks, including the ones used in the Merge work [12], the most relevant previous study, and a few other important computation kernels. We compare Qilin’s adaptive implementation against the best available CPU or GPU implementations whenever possible: for the GPU implementations, we use the ones provided in the CUDA Software Development Kit (SDK) [14] if available. We measured the wall-clock execution time, including the time required to transfer the data between the CPU and the GPU when the GPU is used. For Qilin’s adaptive versions, we did one training run followed by a variable number of reference runs until the total execution time (including both training and reference runs) reaches one hour. We then report the average execution time per reference run (i.e., total execution time divided by the number of reference runs). This emulates the expected usage model of Qilin where a program is trained once and then used many times afterwards. In addition, since our current prototype uses sourceto-source translation and invokes the system compilers to generate the final executables (see Section 3), the dynamic compilation overhead is fairly high. So we amortize it with multiple reference runs. In a production environment where we shall have a fullfledged compiler that generates code all the way from source to binary, the compilation overhead would be much less an issue. Minimized when mapping all work to the CPU 0 CPU Intel Core2 Quad 2.4 GHz 8 cores (on two sockets) 4 GB 8 GB/s β Figure 7. Three possible cases of the β in Equation (4). 5.2 where p is the number of CPU cores. Here we dedicate one CPU core to handshake with the GPU and we assume that bus bandwidth is not a major factor. Now, we can plug the input problem size Nr into Equation (4). TC (βNr ) becomes a linear equation of a single variable β and the same for TG ((1 − β)Nr ). We need to find the value of p β that minimizes M ax( p−1 TC (βNr ), TG ((1 − β)Nr )). This is p TC (βNr ) and TG ((1 − equivalent to finding the β at which p−1 β)Nr ) intersect. There are three possible cases, as illustrated in Figure 7. In case (i) where the intersection happens at β ≤ 0, Qilin maps all work to the GPU; in case (ii) where the intersection happens at β ≥ 1, Qilin maps all work to the CPU; in case (iii) where the intersection happens at 0 < βmin < 1, Qilin maps βmin of work to the CPU and 1 − βmin of work to the GPU. Finally, if Qilin detects any hardware changes since the last training run for a particular program was done, it will trigger a new training run for the new hardware configuration and use the results for future performance projection. 6 Results In this section, we first present results that compare adaptive mapping against manual mapping, using training set sizes identical to the reference set sizes. Second, we present results with different training input sizes. Finally, we present results with different hardware configurations. 5.2.1 Effectiveness of Adaptive Mapping Figure 8 shows the performance comparison between automatic adaptive mapping and other mappings. The y-axis is the speedup over the serial case in logarithmic scale. The x-axis shows the benchmarks and the geometric mean. In each benchmark, there are four bars representing different mapping schemes. “CPU-always” means always scheduling all the computations to the CPU, while “GPU-always” means scheduling all the computations to the GPU. “Manual mapping” means the programmer manually determines the near-optimal mapping through an exhaustive search. “Adaptive mapping” means Qilin automatically determines the nearoptimal mapping via adaptive mapping. Both “Manual mapping” 2008/11/14 Benchmark Description Binomial BlackScholes Convolve American option pricing Eurpoean option pricing 2D separable image convolution MatrixMultiply Linear Dense matrix multiplication Linear image filter—compute output pixel as average of a 9-pixel square Modify RGB value to artificially age images Compute scoring matrix for a pair of DNA sequences Kernel from a SVM-based face classifier Sepia Smithwat Svm Input Problem Size 1000 options, 2048 steps 10000000 options 12000 x 12000 image, kernel radius = 8 6000 x 6000 matrix 13000 x 13000 image Serial Time 11454 ms 343 ms 10844 ms Origin CUDA SDK [14] CUDA SDK CUDA SDK 37583 ms 6956 ms CUDA SDK Merge [12] 13000 x 13000 image 2000 base pairs 2307 ms 26494 ms Merge Merge 736 x 992 image 491 ms Merge Table 2. Benchmarks GPU-always Manual mapping Adaptive mapping 9.9 9.3 CPU-always 10x 5.5 7 Speedup over Serial 100x ea n G eo -M Sv m Sm ith w at Se pi a Li ne ar M at rix M ul tip ly on vo lv e C le s Bl ac kS ch o Bi no m ia l 1x Figure 8. Performance of Adaptive Mapping (Note: The y-axis is in logarithmic scale). and “Adaptive mapping” use training inputs that are identical to the reference inputs (we will investigate the impact of different training inputs in Section 5.2.2). Table 3 reports the distribution of computations under the last two mapping schemes. Table 4 reports the linear equations TC (N ) and TG (N ) that Qilin constructed for each benchmark via curve fitting. Note that the equations are fairly different for different benchmarks, indicating that Qilin automatically adjusts the time predictions for different programs. On average, adaptive mapping is 69% faster than always using the CPU and 33% faster than always using the GPU. It is within 94% of the near-optimal mapping found by the programmer via exhaustive searching. In the cases of Binomial, Convolve and Svm, adaptive mapping performs slightly better than manual mapping. There are two major factors that determine the performance of adaptive mapping. The first factor is how accurate Qilin’s performance projections are (i.e., Equation (4) in Section 4). Table 3 shows that the work distributions under the two mapping schemes are similar, indicating that Qilin’s performance projections are fairly accurate. Figure 9 evaluates the accuracy of Qilin’s performance projection in more details for Binomial. It plots the actual and predicted execution times for both CPU and GPU with different problem sizes. As shown, the actual and predicted execution-time 7 Binomial BlackScholes Convolve MatrixMultiply Linear Sepia Smithwat Svm Manual mapping CPU GPU 10% 90% 40% 60% 40% 60% 40% 60% 60% 40% 80% 20% 60% 40% 10% 90% Adaptive mapping CPU GPU 10.5% 89.5% 46.5% 53.5% 36.3% 63.7% 45.5% 54.5% 50.8% 49.2% 76.2% 23.8% 59.3% 40.7% 14.3% 85.7% Table 3. Distribution of computations under the manual mapping and adaptive mapping in Figure 8. curves are very close in both CPU and GPU cases. Similar observations are made in other benchmarks as well. The second factor is the runtime overhead of adaptive mapping. Among our benchmarks, Smithwat is the only one that significantly suffers from this overhead (see the relative low performance of “Adaptive mapping”for Smithwat in Figure 8). This benchmark computes the scoring matrix for a pair of DNA sequences. The serial version is sketched in Figure 10(a). The two-level for- 2008/11/14 Binomial BlackScholes Convolve MatrixMultiply Linear Sepia Smithwat Svm CPU TC (N ) = ac + bc ∗ N ac bc 16.21 1.61 4.40 0.000015 6.45 0.18 24.23 0.00017 4.07 0.0000076 3.80 0.0000028 178.85 2.55 3.81 0.0011 GPU TG (N ) = ag + bg ∗ N ag bg 1.20 0.26 0.98 0.000013 4.50 0.10 96.58 0.00014 2.73 0.0000079 4.98 0.0000089 13.71 3.91 5.34 0.00017 (a) Serial version void SerialSmithwat(float* score, float* seqA, int startA, int lenA, float* seqB, int startB, int lenB) { for (int i=startA; i<lenA; i++) for (int j=startB; j<lenB; j++) ScoreOnePair(score, seqA, i, lenA, seqB, j, lenB); Table 4. The time-projection equations constructed by Qilin via curve fitting (i.e., Equations (1) and (2) in Section 4). } (b) Qilin version void MySmithwat(float* score, float* seqA, int i, int lenA, float* seqB, int startB, int lenB) { CPU-actual GPU-actual CPU-predicted GPU-predicted QArray<float> q1 = QArray<float>::Create1D(…); QArray<float> q2 = QArray<float>::Create1D(…); Execution time (ms) 1600 1400 QArrayOp mySmithwat = MakeQArrayOp(…); QArrayOpArgsList argList; argList.Insert(q1, …); argList.Insert(q2, …); 1200 1000 QArray<BOOL> qSuccess = ApplyQArrayOp(mySmithwat, argList, …); qSuccess.ToNormalArray(…); 800 600 400 } 200 void QilinSmithwat(float* score, 0 float* seqA, int startA, int lenA, float* seqB, int startB, int lenB) { 100 200 300 400 500 600 700 800 900 1000 for (int i=startA; i<lenA; i++) Problem size (number of options) MySmithwat(score, seqA, i, lenA, seqB, startB, lenB); Figure 9. Accuracy of Qilin’s performance projections in Binomial. loop considers all possible pairs of elements (one element from seqA and the other from seqB). The Qilin version is sketched in Figure 10(b). Data dependency constrains us to parallelize the inner loop instead of the outer loop. Consequently, the Qilin version has to pay the adaptation overhead in each iteration of the outer loop (i.e., the work being done in MySmithwat(), including converting between normal arrays and Qilin arrays and defining and applying new QArrayOp’s.) 5.2.2 Impact of Training Input Size Figure 11 shows the impact of the training set size on the performance of adaptive mapping. Each benchmark uses six different training set sizes, ranging from 10% to 100% of the reference set size (i.e., the “100%” bars in Figure 11 are the same as the “Adaptive mapping” bars in Figure 8). Figure 11 shows that much of the performance benefit of adaptive mapping is preserved as long as the training set size is at least 30% of the reference set size. When the training set size is only one-tenth of the reference set size, the average adaptive-mapping speedup drops to 7.5x, but is still higher than the 7.0x or 5.5x speedups by using the GPU or CPU alone, respectively. These results provide a guideline on when Qilin should apply adaptive mapping if the actual problem sizes are significantly different from the training input sizes stored in the database. 5.2.3 Adapting to Hardware Changes One important advantage of adaptive mapping is its capability to adjust to hardware changes. To demonstrate this advantage, we did the following two experiments. In the first experiment, we replaced our GPU by a less powerful one (Nvidia 8800 GTS) but kept the same 8-core CPU. 8800 GTS has fewer stream processors (96) and less memory 8 } Figure 10. Explaining Smithwat. the high adaptation overhead in (640MB) than 8800 GTX. The performance of adaptive mapping with this new hardware configuration is shown in Figure 12(a). The “GPU-always” speedup with 8800 GTS is 5.7x compared to the 7x speedup with 8800 GTX in Figure 8, or a 19% performance reduction. Adaptive mapping automatically re-distributes the computations for this change. The new work distribution is shown under the “Less powerful GPU” column in Table 5. Comparing this against the original work distribution in Table 3, Qilin has shifted more work from the GPU to the CPU. Consequently, adaptive mapping achieves a 8.2x speedup, or a 12% performance reduction compared against its 9.3x speedup in Figure 8. Note that the performance reduction in the “Adaptive mapping” case (12%) is less than that in the “GPU-always” case (19%), because Qilin is able to recover part of the loss from the CPU. In the second experiment, we went for the other extreme, where we replaced our 8-core CPU by a 2-core CPU but kept the original GPU (8800 GTX). The performance with this new hardware configuration is shown in Figure 12(b) and the new work distribution is shown under the “Less powerful CPU” column in Table 5. With a 2-core CPU, the average speedup of “CPU-always” is down to 1.5x. Qilin decided to shift most computations to the GPU. As a result, the “Adaptive mapping” speedup in Figure 12(b) is only slightly better than the “GPU-always” speedup. 6. Related Work Heterogeneous multiprocessors have been drawing increasing attentions from both the hardware and software research communities. On the hardware side, Kumar et al. [10] demonstrate the advantages of heterogeneous chip multiprocessors (CMPs) over ho- 2008/11/14 50% 30% 20% 10% 9.3 9.3 9.2 9.0 8.2 7.5 80% Li ne ar Speedup over Serial 100% M at rix M ul tip ly 100x 10x G eo -M ea n Sv m Sm ith w at Se pi a C on vo lv e Bl ac kS ch ol es Bi no m ia l 1x Figure 11. Impact of the training set size on adaptive mapping performance (Note: The y-axis is in logarithmic scale. The legend ”X%” means the training set size is X% of the reference set size.). (a) Using a less powerful GPU CPU-always GPU-always Adaptive mapping 5.5 5.7 8.2 Speedup over Serial 100x 10x m G eo -M ea n Sv ith w at Sm Se pi a Li ne ar es C on vo lv M e at rix M ul tip ly Sc ho l Bl ac k Bi no m ia l 1x (b) Using a less powerful CPU GPU-always Adaptive mapping 7 7.2 CPU-always 1.5 10x ea n G eo -M Sv m Se pi a Li ne ar Sm ith w at C on vo lv e M at rix M ul tip ly 1x Bi no m ia l Bl ac kS ch ol es Speedup over Serial 100x Figure 12. Performance of Adaptive Mapping with respect to hardware changes (Note: The y-axis is in logarithmic scale). mogeneous CMPs in terms of power and throughput. They predict that once homogeneous CMPs reach a total of four cores, the benefits of heterogeneity will outweigh the benefits of additional homogeneous cores in many applications. They also classify heterogeneous multiprocessors into multi-ISA multiprocessors such as the 9 Cell and CPU+GPU and single-ISA multiprocessors [9] such as the upcoming Intel’s Larrabee [26]. Our adaptive mapping technique is applicable to both single-ISA or multi-ISA heterogeneous multicores. Recently, Hill and Marty [7] also argue that heterogeneous (“asymmetric” in their terminology) multicore designs offer greater potential speedup than homogeneous (“symmetric”) designs, provided that software challenges like the computation-to-processor mapping problem can be addressed. On the software side, there are a number of GPGPU programming systems from both academic and industry. Most of them, including Stanford’s Brook [2], Microsoft’s Accelerator [28], Google’s Peakstream [19], Rapidmind [23], AMD’s Brook+ [1] and Nvidia’s Cuda [16] target for only the GPU. In contrast, Intel’s Ct [6] currently targets for the CPU only. Liao et al. [11] developed a compiler that generates multithreaded CPU codes from Brook-like programs. Similarly, Stratton et al. [27] have developed MCUDA for translating CUDA kernels to run on multicore CPUs. While Qilin shares some similarities with these systems in its stream API, it differs from them by taking advantage of both the CPU and the GPU simultaneously. As we mentioned in Section 1, the IBM’s OpenMP extension for Cell [17] and Intel’s Merge work [12] are most related to Qilin as they could also map computations to both the host processor and the special processor. However, the key advantage of Qilin is that the mapping is done automatically and is adaptive to changes in the runtime environment. Most recently, OpenCL [13] is proposed as the standard API for programming GPUs. At this moment, it is unclear what kind of mechanism will be available in OpenCL to help programmers decide the computation-to-processor mapping. At the operating system level, Ghiasi et al [5] proposes a scheduler that schedules memory-bound tasks to cores running at lower frequencies, thereby limiting system power while minimizing total performance loss. Our work is also related to program autotuning [3, 4, 18, 21, 22, 25, 29, 31], an increasingly popular approach to producing highquality portable code by generating many variants of a computation kernel and benchmarking each variant on the target platform. Existing autotuners largely focus on tuning the program parameters that affect the memory-hierarchy performance, such as the cache blocking factor and prefetch distance. Adaptive mapping can be viewed 2008/11/14 Binomial BlackScholes Convolve MatrixMultiply Linear Sepia Smithwat Svm Less powerful GPU CPU GPU 19.2% 80.8% 46.3% 53.7% 39.4% 60.6% 53.6% 46.4% 55.3% 44.7% 82% 18% 60% 40% 15% 85% Less powerful CPU CPU GPU 0% 100% 9.9% 90.1% 9.4% 90.6% 11.7% 88.3% 14.5% 85.5% 32.4% 67.6% 22.9% 77.1% 0% 100% [12] L INDERMAN , M. D., C OLLINS , J. D., WANG , H., AND M ENG , T. H. Merge: A Programming Model for Heterogeneous Multi-core Systems. In Proceedings of the 2008 ASPLOS (March 2008). [13] M UNSHI , A. OpenCL Parallel Computing on the GPU and CPU. In ACM SIGGRAPH 2008 (2008). [14] N VIDIA . CUDA SDK. http://www.nvidia.com/object/cuda get.html. [15] N VIDIA . CUDA CUBLAS Reference Manual, June 2007. [16] N VIDIA . CUDA Programming Guide v 1.0, June 2007. [17] O’B RIEN , K., O’B RIEN , K., S URA , Z., C HEN , T., AND Z HANG , T. Supporting OpenMP on Cell. International Journal on Parallel Programming 36 (2008), 289–311. Table 5. Distribution of computations under adaptive mapping corresponding to the two hardware changes in Figure 12. [18] PAN , Z., AND E IGENMANN , R. PEAL—A Fast and Effective Performance Tuning System via Compiler Optimization Orchestration. ACM Transactions. on Programming Languages and Systems 30, 3 (May 2008). as an autotuning technique that tunes for the distribution of works on heterogeneous multiprocessors. [19] P EAKSTREAM. Peakstream Stream Platform API C++ Programming Guide v 1.0, May 2007. 7. Conclusion We have presented adaptive mapping, the first automatic technique that maps computations to processing elements on heterogeneous multiprocessors. We have implemented it in an experimental system called Qilin for programming CPUs and GPUs. We demonstrate that automated adaptive mapping performs close to manual mapping and can adapt to changes in input problem sizes and hardware configurations. We believe that adaptive mapping could be an important technique in the multicore software stack. References [1] AMD. AMD Stream SDK User Guide v 1.2.1-beta, Oct 2008. [2] B UCK , I., F OLEY, T., H ORN , D., S UGERMAN , J., FATAHALIAN , K., H OUSTON , M., AND H ANRAHAN , P. Brook for GPUs: Stream Computing on Graphics Hardware. ACM Transactions on Graphics 23, 3 (2004), 777–786. [3] C HEN , C., C HAME , J., N ELSON , Y. L., D INIZ , P., H ALL , M., AND L UCAS , R. Compiler-Assisted Performance Tuning. In Proceedings of SciDAC 2007, Journal of Physics: Conference Series (June 2007). [4] F URSIN , G. G., O’B OYLE , M. F. P., AND K NIJNENBURG , P. M. W. Evaluating Iterative Compilation. In Proceedings of the 2002 Workshop on Languages and Compilers for Parallel Computing. [5] G HIASI , S., K ELLER , T., AND R AWSON , F. Scheduling for Heterogeneous Processors in Server Systems. In Proceedings of the 2nd Conference on Computing Frontiers (May 2005), pp. 199– 210. [6] G HULOUM , A., S MITH , T., W U , G., Z HOU , X., FANG , J., G UO , P., S O , B., R AJAGOPALAN , M., C HEN , Y., AND C HEN , B. FutureProof Data Parallel Algorithms and Software On Intel Multi-Core Architecture. Intel Technology Journal 11, 4, 333–348. [7] H ILL , M., AND M ARTY, M. R. Amdahl’s Law in the Multicore Era. IEEE Computer (July 2008), 33–38. [8] I NTEL . Intel Math Kernel Library Reference Manual, Sept 2007. [9] K UMAR , R., FARKAS , K. I., J OUPPI , N. P., R ANGANATHAN , P., AND T ULLSEN , D. Single-ISA Heterogeneous Multicore Architectures: The Potential for Processor Power Reduction. In Proceedings of the MICRO’03 (December 2003), pp. 81–92. [10] K UMAR , R., T ULLSEN , D., J OUPPI , N., AND R ANGANATHAN , P. Heterogeneous Chip Multiprocessors. IEEE Computer (November 2005), 32–38. [11] L IAO , S.-W., D U , Z., W U , G., AND L UEH , G.-Y. Data and Computation Transformations for Brook Streaming Applications on Multiprocessors. In Proceedings of the 4th Conference on CGO (March 2006), pp. 196–207. 10 [20] P HAM , D., A SANO , S., B OLLIGER , M., D AY, M. M., H OFSTEE , H. P., J OHNS , C., K AHLE , J., K AMEYAMA , A., K EATY, J., M ASUBUCHI , Y., R ILEY, M., S HIPPY, D., S TASIAK , D., S UZUOKI , M., WANG , M., WARNOCK , J., W EITZEL , S., W ENDEL , D., YAMAZAKI , T., AND YAZAWA , K. The Design and Implementation of a First-Generation CELL Processor. In IEEE International SolidState Circuits Conference (May 2005), pp. 49–52. [21] P OUCHET, L.-N., B ASTOUL , C., C OHEN , A., AND C AVAZOS , J. Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time. In Proceedings of the ACM SIGPLAN 08 Conference on PLDI (June 2008). [22] P USCHEL , M., M OURA , J., J OHNSON , J., PAUDA , D., V ELOSO , M., S INGER , B., X IONG , J., F RANCHETTI , F., G ACIC , A., V ORONENKO , Y., C HEN , K., J OHNSON , R., AND R IZZOLO , N. SPIRAL: Code Generation for DSP Transforms. Proceedings of the IEEE, special issue on Program Generation, Optimization, and Adaption 93, 2 (2005), 232–275. [23] R APIDMIND. Rapidmind. http://www.rapidmind.net. [24] R EINDERS , J. Intel Threading Building Blocks. O’Reilly, July 2007. [25] R EN , M., PARK , J., H OUSTON , M., A IKEN , A., AND D ALLY, W. J. A Tuning Framework for Software-Managed Memory Hierarchies. In Proceedings of the 2008 International Conference on PACT. [26] S EILER , L., C ARMEAN , D., S PRANGLE , E., F ORSYTH , T., A BRASH , M., D UBEY, P., J UNKINS , S., L AKE , A., S UGERMAN , J., C AVIN , R., E SPASA , R., G ROCHOWSKI , E., J UAN , T., AND H ANRAHAN , P. Larrabee: A Many-Core x86 Architecture for Visual Computing. In Proceedings of ACM SIGGRAPH 2008 (2008). [27] S TRATTON , J. A., S TONE , S. S., AND M W. H WU , W. MCUDA: An Efficient Implementation of CUDA Kernels from Multi-Core CPUs. In Proceedings of the 2008 Workshop on Languages and Compilers for Parallel Computing. [28] TARDITI , D., P URI , S., AND O GLESBY, J. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. In Proceedings of the 2006 ASPLOS (October 2006). [29] V UDUC , R., D EMMEL , J., AND Y ELICK , K. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series (June 2005). [30] WANG , P., C OLLINS , J. D., C HINYA , G., J IANG , H., T IAN , X., G IRKAR , M., YANG , N., L UEH , G.-Y., AND WANG , H. EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System. In Proceedings of the ACM SIGPLAN 07 Conference on PLDI (June 2007), pp. 156–166. [31] W HALEY, R. C., P ETITET, A., AND D ONGARRA , J. J. Automated Empirical Optimization of Software and the ATLAS Project. Parallel Computing 27, 1-2 (2001), 3–35. 2008/11/14