Provably Efficient GPU Algorithms Nodari Sitchinava and Volker Weichert

Provably Efficient GPU Algorithms Nodari Sitchinava1 and Volker Weichert2 1 arXiv:1306.5076v1 [cs.DS] 21 Jun 2013 2 Karlsruhe Institute of Technology, Karlsruhe, Germany Goethe University Frankfurt am Main, Frankfurt, Germany Abstract. In this paper we present an abstract model for algorithm design on GPUs by extending the parallel external memory (PEM) model with computations in internal memory (commonly known as shared memory in GPU literature) defined in the presence of memory banks and bank conflicts. We also present a framework for designing bank conflict free algorithms on GPUs. Using our framework we develop the first shared memory sorting algorithm that incurs no bank conflicts. Our sorting algorithm can be used as a subroutine for comparison-based GPU sorting algorithms to replace current use of sorting networks in shared memory. We show experimentally that such substitution improves the runtime of the mergesort implementation of the THRUST library. 1 Introduction In the past 10 years GPUs – the massively parallel processors developed for graphics rendering – have been adapted for general purpose computation and new languages like C for CUDA and OpenCL have been developed for programming them. With hundreds of cores on the latest hardware, GPUs have become a standard computational platform in high-performance computing (HPC) and have been successfully used to analyze data in natural sciences, such as biology, chemistry, physics and astronomy. However, general purpose computation on GPUs is still in the early stages and there is very little understanding of what kind of algorithms translate into efficient GPU implementations. In this paper we present a theoretical model for GPUs, which simplifies algorithm design, while capturing most important features of the architecture. Our model is not completely new – some features of our model have appeared in literature in one way or another. Our contribution is that we identify the most important aspects of the GPU architecture, consolidate the algorithmic techniques and design recommendations that appear in literature, and provide clear complexity metrics for theoretical analysis of GPU algorithms. Our model is an extension of existing models of computation, allowing reuse of algorithmic techniques from existing literature applicable to GPUs. The algorithms designed in our model come close to the performance of finely tuned state of the art GPU implementations, and with some additional fine tuning and optimizations outperforms existing implementations. At the same time, our algorithms are simple, which is a big advantage in code maintainability. Our model decouples data from threads, which is a more similar to the way programs are designed on manycore CPUs. The model also simplifies GPU algorithm design by removing the requirement for coalesced accesses to global memory from the algorithm design process, letting the programmer focus on defining how to process the data in more familiar way. Finally, we present a framework that clarifies how to design algorithms to process data in shared without bank conflicts. It is well known that bank conflicts — multiple threads trying to simultaneously access different locations on the same memory bank of shared memory — cause a significant slowdown in GPU implementations and complicate theoretical analysis of GPU algorithms. Some researchers have presented techniques to avoid bank conflicts for a simple sequential scan of an array [5]. However, up to now it was not clear how to avoid bank conflicts for more complex data access patterns. Using our framework it becomes very easy to identify a bank conflict free algorithm to sort items in shared memory. Our sorting algorithm can be used as a black box to replace current implementations of sorting networks to sort items in the base case of the comparison-based algorithms, which leads to predictable and better overall runtimes. 2 An algorithmic model for GPUs The model consists of p multiprocessors and global memory of conceptually unlimited size shared among all processors. Each multiprocessor consists of w scalar processors each running a single thread. In addition each multiprocessor pi contains a cache Ci of size M , which is accessible only by threads of pi . Initially, the input resides in global memory. To process data, pi must contain it in its cache Ci . The transfer between the global memory and caches is performed by each multiprocessor using so-called input-output (I/O) operations. In each I/O operation, each pi can transfer a single block of B elements between the global memory and its cache Ci , independently of and concurrently with other multiprocessors. Thus, in each step, up to p distinct blocks can be transferred between the global memory and the p caches. In case of concurrent writes to the same element in global memory, an arbitrary multiprocessor succeeds, while others fail without notification. Concurrent writes to the same block but distinct addresses within the block succeed in a single I/O. The multiprocessors communicate with each other only by writing to and reading from the global memory. To avoid race conditions to global memory, the execution between different multiprocessors can be explicitly synchronized with barrier synchronizations. Each synchronization makes sure that all multiprocessors finish their computations and I/Os before proceeding any further. Each multiprocessor pi consists of w scalar processors, each running a single thread and executing synchronously. At each time step, all threads execute the same instruction, but can apply the instruction to any item in the cache, i.e., pi is a SIMD multiprocessor. An execution of an instruction is broken into three phases. In the first phase each thread reads data from the appropriate address. In 2 the second phase each thread executes the instruction on the data read in the first phase. In the third phase each thread writes out the result. Each cache consists of b = w memory banks. If in either phase more than one thread attempts to access the same memory bank, a bank conflict occurs and only one of the threads succeeds and finishes the execution. The other threads stall and replay the instruction in the next time step. Since execution is performed synchronously across all threads, the threads do not proceed to the next instruction until all of them finish executing the current one. Thus, bank conflicts are serialized and can result in up to w time steps to execute a single SIMD instruction.3 Complexity metrics and runtime. The complexity of an algorithm in our model consists of three metrics. The first one is the parallel I/O complexity – the number of parallel blocks transferred between the caches and the global memory. The second one is parallel time complexity – the number of units of time it takes to execute the instructions by each multiprocessor. Memory bank conflicts in caches make it difficult to analyze the second metric. Therefore, in Section 3 we will describe how to develop algorithms without any bank conflicts. Finally, since barrier synchronizations take non-trivial amount of time, the third complexity metric is the number of such synchronizations. Let λ be the latency of accessing a single block of B elements in global memory and σ be the upper bound on the time it takes to perform a barrier synchronization. We call the execution between barrier synchronizations a round. Let tk (n) and qk (n) be, respectively, the parallel time complexity and parallel I/O complexity of an algorithm A for round k and R(n) be the total number of rounds. Then the total runtime of A can be upper bounded as follows: R(n) T (n) = X R(n) (tk (n) + λqk (n) + σ) = k=1 X k=1 R(n) tk (n) + λ X qk (n) + σR(n) (1) k=1 Designing efficient GPU algorithms. An efficient algorithm will miniPR(n) PR(n) mize the three functions: R(n), k=1 qk (n) and k=1 tk (n). A familiar reader will notice that our model describes the access to global memory exactly as the parallel external memory model of Arge et al. [1]. Thus, we can use the algorithmic techniques developed in the PEM model to optimize the parallel I/O complexity q(n) and number of rounds R(n) in our GPU model. In section 3 we describe how to design algorithms without bank conflicts. Then, function tk (n) for such algorithms can be analyzed similarly to the PRAM algorithms. Observe, that in some cases minimizing both tk (n) and qk (n) functions will be conflicting goals: optimal parallel I/O complexity for the problem of list ranking results in increased time complexity [4]. Therefore, to obtain the best running time, the values of λ and σ need to be considered.4 3 4 The special case of multiple threads accessing the same word in shared memory can be executed on modern GPUs in a single time step. However, for simplicity of the model we ignore this case and it can be considered as a fine-grained optimization. In our experiments on NVIDIA GTX580, σ is 1-2 orders of magnitude larger than λ, while λ is 2-3 orders of magnitude larger than access by a thread to data within cache. 3 2.1 Relation between the model and GPU hardware We only mention the aspects of GPU hardware that are essential for our model. For more details of GPU hardware we refer the reader to [6]. Since NVIDIA’s CUDA is the most commonly applied GPU programming framework, we use it as a reference to design our model. However, we believe that the concepts can be applied to any GPU that supports programming beyond custom shaders. SMs, CTAs and warps. A typical graphics hardware consists of off-chip graphics memory known as global memory and P streaming multiprocessors (SMs), each consisting of a number of scalar processors and a faster memory known as shared memory which is shared among the scalar processors within the SM, but is inaccessible by other multiprocessors. A CUDA program (kernel) is executed as a number of threads that are grouped into Cooperative Thread Arrays (CTAs) or thread blocks. The hardware performs automatic load balancing when assigning the CTAs to SMs, however, threads of a single CTA are never split among multiple SMs. An SM in turn is able to accommodate multiple CTAs, which have to share the register file and shared memory. A CTA is executed in warps, a hardware unit of w threads that can be executed at the same time (currently w = 32). Each multiprocessor in our model represents a CTA running a single warp. The cache in our model represents a fraction of shared memory available to the CTA. Note, to keep the model simple, we do not model multiple warps on a CTA. Instead, s CTAs each running r warps on a hardware with P SMs can be approximated in our model by p = sr multiprocessors, each containing a private cache of size M = P M̂ /sr, where M̂ is the size of shared memory on each SM. Algorithms designed and analyzed in this setting will run only faster if implemented with multiple multiprocessors grouped as multiple warps per CTA. Since the shared memory and register file resources are shared among all the warps, it is beneficial to keep p as low as possible. However, too few CTAs and warps could translate into underutilized bandwidth between global memory and SMs. In our experiments (Figure 4) we observed that p = 8P is enough to reach the maximum throughput between global memory and P = 16 SMs of the GTX580 graphics card. Coalesced access to global memory. The global memory has a much higher latency than the shared memory: shared memory latency is 2 to 14 clock cycles, depending on the literature (compare [14] and [17]), while global memory latency is between 400 and 800 clock cycles [14]), but by choosing access patterns that transfer w consecutive elements at the same time the reads can be bundled into one memory operation, potentially decreasing the overall latency by a factor of as much as w = 32 [14]. This is called coalesced memory access. The blockwise access of the I/O operations in our model represent these coalesced memory accesses. That is, any access to global memory on GPUs can load B contiguous elements in the same time as a single element and can be performed in parallel by the multiple streaming multiprocessors. In our experiments we observed that this is true in practice and the value of B for current GPUs can be estimated anywhere between w and 4w, depending on the number of warps (each containing 4 Memory Bank 0 A0 A8 A16 A24 A32 A40 A48 A56 Memory Bank 1 A1 A9 A17 A25 A33 A41 A49 A57 Memory Bank 2 A2 A10 A18 A26 A34 A42 A50 A58 Memory Bank 3 A3 A11 A19 A27 A35 A43 A51 A59 Memory Bank 4 A4 A12 A20 A28 A36 A44 A52 A60 Memory Bank 5 A5 A13 A21 A29 A37 A45 A53 A61 Memory Bank 6 A6 A14 A22 A30 A38 A46 A54 A62 Memory Bank 7 A7 A15 A23 A31 A39 A47 A55 A63 Fig. 1: The view of shared memory as a b × ⌈M/b⌉ matrix. w threads) that each CTA consists of. We observed that as long as we access full blocks with enough warps, whether in contiguous locations or not (Figures 4), the full SM to global memory bandwidth is reached, thus, justifying our model. Memory banks of shared memory. The memory bank layout of caches in our model mimic the organization of shared memory of SMs. We observed that memory bank conflicts incur significant penalties on overall running time and match our predictions of serialization of instructions which cause bank conflicts. Registers. The fastest memory on GPUs is the register file. Accessing a register incurs no latency and is thus preferable for many situations. However, modeling registers adds extra complications. First, registers are private to each thread and communication of register data among threads of a warp must be performed via shared memory. Second, access pattern to registers must be known at compile time, otherwise the data is moved to local memory, which is a part of the global memory put aside for each thread’s. Although local memory is cached in shared memory, bank conflicts and limited cache size make it difficult to analyze algorithms that use registers if memory accesses are data-dependent. To keep the model simple, we do not consider registers. Instead we recommend that explicit use of registers be considered as a final optimization during implementation, rather than during algorithm design. Thread synchronization. Synchronization between all threads of a CTA can be achieved by an internal function ( syncthreads) and occurs no overhead, but inter-CTA synchronization is costly. While efforts have been made to achieve synchronization within the program [7], the most common strategy is to end the running kernel and start another. The time to do this is an order of magnitude larger than the global memory access latency. The barrier synchronizations of our model represent these types of inter-CTA synchronization. Since each multiprocessor models a CTA running a single warp, the model does not consider inter-warp synchronizations within a single CTA. 3 Designing bank conflict free algorithms in shared memory We model the cache of each multiprocessor of size M as a two dimensional matrix M with b rows and ⌈M/b⌉ columns similar to the view in [5]. Each row of the 5 matrix corresponds to a separate memory bank. When a block of B items is loaded from global memory to cache, the items of the block are stored in B/b adjacent columns in the column-major order in the matrix. Thus, an array of n elements loaded into contiguous space in the cache will be placed into n/b adjacent columns in the column-major order (see Figure 1). Observe, that under such view, one can obtain a simple bank conflict free computation by processing each row of the matrix by a different thread. Note that if M = b2 , transposition of the matrix will place each column in a separate row and we can process the original columns with separate threads, again incurring no bank conflicts. Thus, we can design bank conflict free algorithms if we process matrices by rows or columns. This approach can easily be generalized for caches of size M = cb2 for small integer constant c, by transforming the matrix from column-major to row-major order and vice versa. Since cache space is limited, we must perform the transformation in-place. Although there is no simple in-place algorithm for transforming general rectangular matrices from row-major to column-major order, c is a sufficiently small integer for current sizes of shared memories on GPUs. The transformation of a matrix of size M = cb2 from one order into another can be performed by splitting the rectangular matrix into c square submatrices, transposing each submatrix and remembering to access the correct submatrices by index arithmetic. Bank conflict free transposition of a square b × b matrix can be accomplished trivially in t(b2 ) = b2 /w = b parallel time on a multiprocessor with w = b threads. The algorithm is given in Appendix B for completeness. In Section 5 we use this view of shared matrix to implement ShearSort [16] to obtain the first bank conflict free shared memory GPU sorting implementation. 4 Prefix sums and colored prefix sums Based on our model, we design a simple three round algorithm for prefix sums, similar to [12]: a +-reduction round, a spine round and a prefix sums round. Each multiprocessor pi is assigned a tile: ⌈n/p⌉ contiguous elements of the input. To process the tile, pi repeatedly loads disjoint pages of size M into its cache. In the reduction round, pi simply reduces all tile elements using the prefix sums operation. The result of the +-reduction, which is a single value per tile, is written out in contiguous space in global memory. The spine round computes the exclusive prefix sums of the +-reduction round’s results. Since the input data is very small for this round, we use only a single multiprocessor for this round. In the prefix sums round, pi scans the tile once more, computing prefix sums using the inter-tile offsets from the spine round. Within the multiprocessor, we assign each scalar processor a contiguous part of the page to work on and calculate intra-page offsets before actually working on the prefix sums. We expand our prefix sums algorithm to the problem of colored prefix sums. Given an array of n elements and a set of d colors {c1 , c2 , . . . , cd }, with a color ci associated with each element, colored prefix sums asks to compute d independent prefix sums among the elements of the same color. 6 250 100 colored prefix sums 200 kernel 3 runtime kernel 3 bank conflicts improved kernel 3 runtime 95 175 100 90 150 85 125 80 100 75 75 70 50 65 25 bank conflicts [106] 150 runtime [ms] runtime [ms] 200 50 0 60 0 2 4 6 8 10 12 14 16 0 0 colors 2 4 6 8 10 12 14 16 colors Fig. 2: The runtime of the colored prefix sums algorithm rises with increasing number of colors. The correlation between the number of bank conflicts and runtime can be seen in the example of the prefix sums kernel. The algorithm is basically the same as for prefix sum, apart from having intermediate values for each color and performing extra operations to identify the color of each element. The d intermediate values per multiprocessor for the spine round form a d × p-matrix that is still processed using only one multiprocessor. Prefix sums operations in shared memory can be performed by one thread per color without significant slowdown, as they only have to work on w elements. We would expect the runtime to depend on the input size n, the number of processors p and the warp size w, but not d since all calculations depending on the number of colors are performed in parallel. However, the runtime increases with the number of colors (see Figure 2). The slowdown correlates nicely with the number of shared memory bank conflicts incurred when calculating intra-page offsets in the prefix sums round. To improve our algorithm, we implement a way to calculate the intra-page offsets without any bank conflicts. We assign each thread one memory bank for storage of intermediate values. The resulting w × d-matrix lists the color values in columns. To avoid having multiple threads working on the same banks we transpose the matrix and then calculate the prefix sums with d threads in parallel. After retransposing the matrix, we can propagate the offsets. We observe that the runtime of the algorithm is now constant for different numbers of colors (see Figure 2). We limit the number of colors to 16 because of the limited amount of shared memory available for each multiprocessor. Both algorithms execute one scan of the data in the reduction round, as well as w computation steps in shared memory. Therefore, the computation time ps cps cps M M tps 1 (M ) = t1 (M ) = O( w ) and parallel I/O is q1 (M ) = q1 (M ) = O( B ). The data transfer and the computation of the spine round are much smaller: tps 2 (p) = O( wp · log w) and q2ps (p) = O( Bp ) for simple prefix sums and tcps 2 (d · p) = O(w) and q2cps (d · p) = O( d·p B ) for colored prefix sums. These values are too small to have significant impact on the runtime. The prefix sums round finally causes two more scans of the data in global memory, as well as two scans in shared 7 M memory and an extra w for the intra-page offsets, so tps 3 (M ) = O( w + w) and ps cps M q3 (M ) = q3 (M ) = O( B ). For colored prefix sums we have to perform two M additional transpositions that can be done in time O(w), so tcps 3 (M ) = O( w +w) 2 as well. Considering that M = w , B = 4w, and d · p ≪ n, the total runtime is   X n  tcps T cps (n) = tcps 2 (d · p) + k (M ) Mp k=1,3   X n n n + λ q2cps (d · p) + qkcps (M ) + 3σ = O +λ +σ Mp wp wp k=1,3 5 Sorting There exist a number of implementations of various sorting algorithms on GPUs. However, all implementations of comparison-based sorting algorithms have one thing in common: when the size of the array to be sorted is very small (e.g. the base case of recursive algorithms), all implementations resort to loading the array into shared memory and implementing some version of a sorting network. Sorting networks are ideal for GPUs because they are data oblivious (resulting in no divergent execution branches) and they are very simple (resulting in very small constant factors). However, sorting networks still cause bank conflicts. To the best of our knowledge, there is no implementation of a comparison-based sorting algorithm that achieves no bank conflicts when sorting small arrays. Using our view of shared memories (caches) as two dimensional matrices from Section 3, it becomes very easy to identify a bank conflict free algorithm. We implement ShearSort [16]. It works on data arranged in an n × m matrix leaving it sorted in ascending order along the column-major order. It performs log m rounds of first sorting columns in alternating order and then rows in ascending order. After Θ(log m) rounds matrix is sorted. On a square matrix of size M = w2 , ShearSort takes O(t̂(w) log w) parallel time using w threads, where t̂(w) is the time it takes to sort w elements using a single thread. We can still benefit from the simplicity and data-oblivious nature of sorting networks by implementing sequential sorting of rows and columns with a sorting network. Thus, if the row/column sorting is implemented using Batcher’s sorting network [2], then t̂(w) = O(w log2 w) and the total time it takes to sort a square matrix is O(w log3 w). Notice, that although this approach incurs O(log w) extra steps over a simple implementation of a sorting network in shared memory, this time complexity is guaranteed in the worst case and in practice the savings due to lack of bank conflicts are larger than the extra O(log w) factor. We modify the well-known mergesort implementation of the thrust library [3] to use our ShearSort implementation as base case instead of its own base sorting. The reason for choosing a mergesort version is the predictable size of the base case, which enables us to use w × w matrices without the need for padding. 8 CTA0 CTA1 ··· CTAP Setup 1 w00 w10 ··· wP 0 w01 w11 ··· wP 1 · · · w0W w1W · · · wP W Setup 2 Fig. 3: Setup 1 and 2 of the bandwidth experiments with P CTAs and W warps per CTA (wxy is warp y of CTA x). It would be interesting to see how the use of ShearSort would perform in practice on sorting algorithms that do not guarantee fixed sized base case, e.g. sample sort [10]. We suspect that extra padding would reduce the advantage of not having any bank conflicts, but the overall running time would still be improved. We leave this as an open problem for future work. We cannot analyze the running time of the whole mergesort because the merging steps still incur bank conflicts. It would be interesting to see if our framework can help develop a bank conflict free merging, e.g., by using techniques of the external memory algorithms [13] or of Green et al. [8]. Such an algorithm would be a first fully bank conflict free sorting algorithm on GPUs. 6 Experimental results We conduct our experiments on a standard PC with a 2.4 GHz Intel Core2 Quad CPU, 8 GB of RAM and an NVIDIA Geforce GTX580 GPU. Our algorithms are implemented using the CUDA runtime API and compiled with the CUDA toolkit version 5.0, gcc version 4.5 and the -O 3 optimization flag. The runtimes reported do not include the time needed to transfer data from the host computer to the GPU or back, as both prefix sums and sorting are primitives often used as a subroutine by more complex algorithms. 6.1 Global memory bandwidth While the theoretical maximum DRAM bandwidth for our GPU is 192.4 GB/s 5 it is not initially clear how to reach that bandwidth. The CUDA SDK’s sample program to measure the bandwidth using the API’s cudaMemcpy command reported a bandwidth of 158.9 GB/s for our graphics card. Merrill describes a scheme [12] that uses NVIDIA’s internal vector classes to bundle elements. Following this scheme we achieved a bandwidth of 168.2 GB/s. To verify that the PEM model can serve as a good basis, we created two sets of experiments. In the first setup, each CTA is assigned one contiguous tile of the input data and proceeds to read this tile using all its warps. The second setup breaks with the tile paradigm and treats each warp as a virtual CTA. The virtual CTAs read one block at a time such that all the ith warp of every CTA reads a block before any of the (i + 1)th warp of a CTA reads any (see Figure 3). 5 http://www.nvidia.de/object/product-geforce-gtx-580-de.html 9 180 170 170 160 160 150 150 140 140 130 130 throughput [GB/s] throughput [GB/s] 180 120 110 100 90 80 70 60 120 110 100 90 80 70 60 50 50 40 40 1 block/SM 2 blocks/SM 4 blocks/SM 8 blocks/SM 16 blocks/SM 30 20 10 0 5 10 15 20 25 1 block/SM 2 blocks/SM 4 blocks/SM 8 blocks/SM 16 blocks/SM 30 20 10 30 0 5 10 15 warps/block 20 25 30 warps/block Fig. 4: Achievable throughput for transfer between global and shared memory in setups 1 (left) and 2 (right). 0.1 14 thrust scan b40c scan our prefix sums thrust scan b40c scan our prefix sums 12 throughput [109 elements/s] runtime [s] 0.01 0.001 0.0001 10 8 6 4 2 1e-05 216 217 218 219 220 221 222 223 224 225 226 227 228 229 input size 0 216 217 218 219 220 221 222 223 224 225 226 227 228 229 input size Fig. 5: Runtimes and throughput of prefix sums algorithms. Our experiments show that eight CTAs per SM and one warp per CTA are sufficient to reach near peak bandwidth in both setups (see Figure 4). That is indicative of block accesses of width w elements being feasible to use as unit in the model. We use this configuration in our kernels. 6.2 Prefix sums While our kernels are optimized to a point, it sometimes seems excessive to go even further. For example the summation of w = 32 integers in shared memory can be performed in parallel in time O(log w). Simply having one thread read all w values and adding them up in a register is not significantly slower, if at all. This is done in the reduction kernel of our simple prefix sums implementation. The spine phase is run on just one CTA because inter-CTA communication is impossible. The previous kernel’s output is only p = 128 values (8 CTAs·16 SMs) which are copied into shared memory. The w threads then cooperatively calculate the prefix sums by four scans of w consecutive values, where each scan starts with the final value of the previous scan. The scans follow the simple algorithm 10 1 1 thrust mergesort thrust mergesort (base case 1024) shearsort 0.8 throughput [109 elements/s] 0.1 runtime [s] thrust mergesort thrust mergesort (base case 1024) shearsort 0.9 0.01 0.001 0.0001 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1e-05 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 input size 0 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 input size Fig. 6: Runtimes and throughput of thrust mergesort base case in original size and 1024, as well as ShearSort. The basecase runtimes have been measured with the NVIDIA Visual Profiler. by Kogge et al. [9]. Since w consecutive values in shared memory always reside on different memory banks there are no bank conflicts to worry about. We compared our prefix sums algorithm against the well-known implementations of the thrust library [3] and the back40computing (b40c) library [11]. Our algorithm outperforms both library implementations on many input sizes, reaching relative speedups of 13 over thrust and 10 over b40c on smaller inputs. On larger inputs, our implementation is still faster than thrust by a factor of 1.3, while the highly optimized b40c surpasses our algorithm by a margin of 3%. Figure 5 summarizes the results. 6.3 Sorting Thrust mergesort [15] is implemented as sorting a small base case and then iteratively merging the sorted subsequences into larger ones. We replace the base case by our own ShearSort and adjust the iteration to start at 1024 elements. Since rows consist of only w = 32 elements and because accesses to registers are not data dependent, i.e., can be determined at compile time, as an additional optimization we sort the elements of each row in register space. For better comparability we modify the thrust library to use a base case size of 1024 elements (thrust1024 ). While this change causes a slowdown between 1.5 and 4 for the initial sorting, the reduced number of merging phases result in a faster overall runtime for input sizes smaller than 218 . On the GTX580, ShearSort fully uses the hardware resources for input sizes larger than 217 elements. Since smaller inputs simply result in multiprocessors being idle, it is no surprise that the original thrustorig and thrust1024 are faster than ShearSort here. For inputs larger than 217 elements, our ShearSort implementation achieves speedups of up to 1.7 over thrustorig and 2.2 over thrust1024 . 11 7 Conclusion We presented a theoretical model for designing and analyzing GPU algorithms. Using the model, we designed algorithms which are simple to describe and analyze, yet their implementations are competitive with (and sometimes superior to) those of the best known GPU implementations. Because we eliminate the sources of unpredictable runtimes, e.g., bank conflicts, the performance of the algorithms can be predicted in our model. This provides a basis for algorithm designers to reason about the performance of their algorithms, thus bridging the gap between GPU programming and theoretical computer science. Acknowledgments. We would like to thank Vitaly Osipov for helpful discussions and sharing his insights on GPUs. References 1. Arge, L., Goodrich, M.T., Nelson, M.J., Sitchinava, N.: Fundamental parallel algorithms for private-cache chip multiprocessors. In: SPAA. pp. 197–206 (2008) 2. Batcher, K.E.: Sorting networks and their applications. In: Proceedings of the AFIPS Spring Joint Computer Conference. pp. 307–314. ACM (1968) 3. Bell, N., Hoberock, J.: Thrust: A productivity-oriented library for cuda. GPU Computing Gems: Jade Edition pp. 359–372 (2011) 4. Chiang, Y.J., Goodrich, M.T., Grove, E.F., Tamassia, R., Vengroff, D.E., Vitter, J.S.: External-memory graph algorithms. In: SODA’95. pp. 139–149 (1995) 5. Dotsenko, Y., Govindaraju, N.K., Sloan, P.P., Boyd, C., Manfedelli, J.: Fast scan algorithms on graphics processors. In: ICS (2008) 6. Farber, R.: CUDA application design and development. Morgan Kaufmann (2011) 7. Feng, W., Xiao, S.: To GPU synchronize or not GPU synchronize? In: ISCAS’10. pp. 3801–3804 (2010) 8. Green, O., McColl, R., Bader, D.A.: GPU merge path: a GPU merging algorithm. In: ICS’12. pp. 331–340 (2012) 9. Kogge, P.M., Stone, H.S.: A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Transactions on Computers C-22(8), 786 –793 (August 1973) 10. Leischner, N., Osipov, V., Sanders, P.: GPU sample sort. In: IPDPS (2010) 11. Merrill, D.: back40computing: Fast and efficient software primitives for GPU computing, http://code.google.com/p/back40computing/, svn checkout 4/21/2013 12. Merrill, D., Grimshaw, A.: Parallel scan for stream architectures. Tech. Rep. CS2009-14, Department of Computer Science, University of Virginia (2009) 13. Meyer, U., Sanders, P., Sibeyn, J.: Algorithms for Memory Hierarchies: Advanced Lectures. Springer-Verlag (2003) 14. NVIDIA: NVIDIA CUDA Programming Guide 5.0 (2012), http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf, last viewed on 2/11/2013 15. Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for manycore gpus. In: IPDPS’09. pp. 1–10. IEEE (2009) 16. Sen, S., Scherson, I.D., Shamir, A.: Shear Sort: A true two-dimensional sorting techniques for VLSI networks. In: ICPP. pp. 903–908 (1986) 12 17. Wong, H., Papadopoulou, M.M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: ISPASS’10. pp. 235–246. IEEE (2010) 13 A Additional bandwidth 180 180 170 170 160 160 150 150 140 140 130 130 throughput [GB/s] throughput [GB/s] We ran experiments in both setups (see Figure 3) both for transfers between global memory and shared memory and global memory and registers. Another set of experiments included computation, namely incrementing the data. The number of increments performed per element was controlled by a parameter. Apart from the obvious tuning parameters, number of CTAs and number of warps per CTA, Merrill’s data transfer method [12] proposes two additional parameters: the vector length for element bundling (this has a direct impact on the size of the read or write operation performed by CUDA), the number of accesses per round (can be exploited to overlap computation and data transfer). 120 110 100 90 80 70 60 120 110 100 90 80 70 60 50 50 40 40 1 block/SM 2 blocks/SM 4 blocks/SM 8 blocks/SM 16 blocks/SM 30 20 10 0 5 10 15 20 25 1 block/SM 2 blocks/SM 4 blocks/SM 8 blocks/SM 16 blocks/SM 30 20 10 30 0 5 10 warps/block 15 20 25 30 warps/block 180 180 170 170 160 160 150 150 140 140 130 130 throughput [GB/s] throughput [GB/s] Fig. 7: Achievable throughput for transfer from global memory to registers in both setups. 120 110 100 90 80 70 60 120 110 100 90 80 70 60 50 50 40 40 1 block/SM 2 blocks/SM 4 blocks/SM 8 blocks/SM 16 blocks/SM 30 20 10 0 5 10 15 20 25 1 block/SM 2 blocks/SM 4 blocks/SM 8 blocks/SM 16 blocks/SM 30 20 10 30 0 warps/block 5 10 15 20 25 30 warps/block Fig. 8: Achievable throughput for transfer from global to shared memory in both setups. 14 180 180 1 warp/block 2 warps/block 4 warps/block 6 warps/block 8 warps/block 10 warps/block 12 warps/block 170 160 150 140 160 150 140 130 throughput [GB/s] 130 throughput [GB/s] 1 warp/block 2 warps/block 4 warps/block 6 warps/block 8 warps/block 10 warps/block 12 warps/block 170 120 110 100 90 80 70 120 110 100 90 80 70 60 60 50 50 40 40 30 30 20 20 10 10 0 5 10 15 20 25 30 0 5 10 warps/block 15 20 25 30 warps/block Fig. 9: Achievable throughput for data transfer with increments using 4 CTAs per SM. Transfer from global memory to registers (left and from global to shared memory (right). 180 160 150 1 warp/block 2 warps/block 3 warps/block 4 warps/block 5 warps/block 6 warps/block 170 160 150 140 140 130 130 throughput [GB/s] throughput [GB/s] 180 1 warp/block 2 warps/block 3 warps/block 4 warps/block 5 warps/block 6 warps/block 170 120 110 100 90 80 70 120 110 100 90 80 70 60 60 50 50 40 40 30 30 20 20 10 10 0 5 10 15 20 25 30 0 warps/block 5 10 15 20 25 30 warps/block Fig. 10: Achievable throughput for data transfer with increments using 8 CTAs per SM. Transfer from global memory to registers (left and from global to shared memory (right). B Primitives Data transfer between global and shared memory Based on Merrill’s observations [12] we use the following scheme to transfer data between global memory and shared memory. The data is cast into a vector type to increase the load type size to 128 bits. The requests are split in to multiple rounds (the while-loop in Algorithm 1) with several vectors being requested to allow the optimizer to reorganize the requests to achieve optimal throughput (the while-loop in 1 should be unrolled). Transposing a matrix in shared memory There is no algorithm known to us that transposes an arbitrary matrix in-place with O(1) extra memory. Fortunately, our algorithms operate on either square matrices or m × n-matrices, where m = 2 · n. That allows us to limit our transposition to square matrices, splitting 15 Algorithm 1 Data transfer between global and shared memory nvec ← pagesize/width vec elements[count ele ] count ← threadIdx .x while count < nvec do for i ∈ [0, count ele ) do elements[i] ← in[count + blockDim.x · i] end for syncthreads() for i ∈ [0, count ele ) do out[count + blockDim.x · i] ← elements[i] end for end while the long matrix into two square matrices and using in-program addressing to concatenate the pieces. Algorithm 2 Bank-conflict free transposition of a square matrix element 1 , element 1 , row col ← threadIdx .x for i ∈ [1, count columns ) do row ← col + i if row < count rows then element 1 = matrix [col ][row ] element 2 = matrix [row ][col ] matrix [row ][col ] = element 1 matrix [col ][row ] = element 2 end if syncthreads () end for This function executes one sweep of the matrix below the main diagonal, using one thread per column. Since no two threads are reading element 1 from the same row at the same time, and the row where element 2 comes from is exclusively used by one thread, no bank conflicts can occur. 16

Provably Efficient GPU Algorithms Nodari Sitchinava and Volker Weichert

Related documents

Products

Support

Provably Efficient GPU Algorithms Nodari Sitchinava and Volker Weichert

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib