GPU computing with C++ AMP Kenneth Domino Domem Technologies, Inc. http://domemtech.com October 24, 2011 NB: This presentation is based on Daniel Moth’s presentation at http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T CPUs vs GPUs today CPU • • • • • • • Low memory bandwidth Higher power consumption Medium level of parallelism Deep execution pipelines Random accesses Supports general code Mainstream programming http://domemtech.com/?p=1025 CPUs vs GPUs today GPU • • • • • • • http://domemtech.com/?p=1025 High memory bandwidth Lower power consumption High level of parallelism Shallow execution pipelines Sequential accesses Supports data-parallel code Niche programming Tomorrow… • CPUs and GPUs coming closer together… • …nothing settled in this space, things still in motion… • C++ AMP is designed as a mainstream solution not only for today, but also for tomorrow image source: AMD http://domemtech.com/?p=1025 C++ AMP • • • • Part of Visual C++ Visual Studio integration STL-like library for multidimensional data Builds on Direct3D performance productivity portability http://domemtech.com/?p=1025 C++ AMP vs. CUDA vs. OpenCL C++ AMP OpenCL CUDA – Wow, classes! (Stone knives and bearskins – C99!! YUCK!) http://domemtech.com/?p=1025 Lambda functions (1936 Church—back to the future) Hello World: Array Addition void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } } http://domemtech.com/?p=1025 How do we take the serial code on the left that runs on the CPU and convert it to run on an accelerator like the GPU? Hello World: Array Addition #include <amp.h> using namespace concurrency; void AddArrays(int n, int * pA, int * pB, int * pC) { void AddArrays(int n, int * pA, int * pB, int * pC) { array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); for (int i=0; i<n; i++) for (int i=0; i<n; i++) parallel_for_each( sum.grid, [=](index<1> i) restrict(direct3d) { pC[i] = =pA[i] sum[i] a[i] + pB[i]; b[i]; } ); { pC[i] = pA[i] + pB[i]; } } http://domemtech.com/?p=1025 } Basic Elements of C++ AMP coding parallel_for_each: execute the lambda on the accelerator once per thread void AddArrays(int n, int * pA, int * pB, int * pC) { restrict(direct3d): tells the array_view<int,1> a(n, pA); compiler to check that this code can execute on Direct3D hardware array_view<int,1> b(n, pB); (aka accelerator) array_view<int,1> sum(n, pC); array_view: wraps the data grid: the number and shape of threads to execute the lambda parallel_for_each( to operate on the accelerator sum.grid, [=](index<1> idx) restrict(direct3d) { sum[idx] = a[idx] + b[idx]; } array_view variables captured ); and associated data copied to index: the thread ID that}is running the lambda, used to index into data http://domemtech.com/?p=1025 accelerator (on demand) grid<N>, extent<N>, and index<N> • index<N> • represents an N-dimensional point • extent<N> • number of units in each dimension of an N-dimensional space • grid<N> • origin (index<N>) plus extent<N> • N can be any number http://domemtech.com/?p=1025 Examples: grid, extent, and index index<1> i(2); index<2> i(0,2); index<3> i(2,0,1); extent<1> e(6); extent<2> e(3,4); extent<3> e(3,2,2); grid<1> g(e); grid<2> g(e); grid<3> g(e); http://domemtech.com/?p=1025 array<T,N> • Multi-dimensional array of rank N with element T • Storage lives on accelerator vector<int> v(96); extent<2> e(8,12); // e[0] == 8; e[1] == 12; array<int,2> a(e, v.begin(), v.end()); 0 0 1 2 // in the body of my lambda index<2> i(3,9); // i[0] == 3; i[1] == 9; int o = a[i]; //or a[i] = 16; //int o = a(i[0], i[1]); http://domemtech.com/?p=1025 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 array_view<T,N> • View on existing data on the CPU or GPU • array_view<T,N> • array_view<const T,N> vector<int> v(10); extent<2> e(2,5); array_view<int,2> a(e, v); //above two lines can also be written //array_view<int,2> a(2,5,v); http://domemtech.com/?p=1025 Data Classes Comparison array<T,N> • Rank at compile time • Extent at runtime • Rectangular array_view<T,N> • Rank at compile time • Extent at runtime • Rectangular • • • • • • • • Dense Container for data Explicit copy Capture by reference [&] Dense in one dimension Wrapper for data Implicit copy Capture by value [=] KED Note: array_view’s seems faster than array<>. Could it be because of the on-demand feature of array_view? http://domemtech.com/?p=1025 parallel_for_each • Executes the lambda for each point in the extent • As-if synchronous in terms of visible side-effects 1. parallel_for_each( 2. g, //g is of type grid<N> 3. [ ](index<N> idx) restrict(direct3d) { // kernel code } 1. ); http://domemtech.com/?p=1025 restrict(…) • Applies to functions (including lambdas) • Why restrict • Target-specific language restrictions • Optimizations or special code-gen behavior • Future-proofing • Functions can have multiple restrictions • In 1st release we are implementing direct3d and cpu • cpu – the implicit default http://domemtech.com/?p=1025 restrict(direct3d) restrictions • Can only call other restrict(direct3d) functions • All functions must be inlinable • Only direct3d-supported types • int, unsigned int, float, double • structs & arrays of these types • Pointers and References • Lambdas cannot capture by reference¹, nor capture pointers • References and single-indirection pointers supported only as local variables and function arguments http://domemtech.com/?p=1025 restrict(direct3d) restrictions • No • • • • • • • recursion 'volatile' virtual functions pointers to functions pointers to member functions pointers in structs pointers to pointers • No • • • • • • • goto or labeled statements throw, try, catch globals or statics dynamic_cast or typeid asm declarations varargs unsupported types • e.g. char, short, long double http://domemtech.com/?p=1025 Example: restrict overloading double bar( double ) restrict(cpu,direc3d); // 1: same code for both double cos( double ); // 2a: general code double cos( double ) restrict(direct3d); // 2b: specific code void SomeMethod(array<double> c) { parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { //… double d1 = bar(c[idx]); // ok double d2 = cos(c[idx]); // ok, chooses direct3d overload //… }); } http://domemtech.com/?p=1025 accelerator, accelerator_view Host • accelerator • e.g. DX11 GPU, REF • e.g. CPU • accelerator_view • a context for scheduling and memory management • • • http://domemtech.com/?p=1025 Accelerator (GPU example) PCIe Example: accelerator // Identify an accelerator based on Windows device ID accelerator myAcc(“PCI\\VEN_1002&DEV_9591&CC_0300”); // …or enumerate all accelerators (not shown) // Allocate an array on my accelerator array<int> myArray(10, myAcc.default_view); // …or launch a kernel on my accelerator parallel_for_each(myAcc.default_view, myArrayView.grid, ...); http://domemtech.com/?p=1025 C++ AMP at a Glance (so far) • • • • • • • • restrict(direct3d, cpu) parallel_for_each class array<T,N> class array_view<T,N> class index<N> class extent<N>, grid<N> class accelerator class accelerator_view http://domemtech.com/?p=1025 Achieving maximum performance gains 012345 • Schedule threads in tiles • Avoid thread index remapping • Gain ability to use tile static memory 012345 012345 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 extent<2> e(8,6); grid<2> g(e); g.tile<4,3>() g.tile<2,2>() • parallel_for_each overload for tiles accepts • tiled_grid<D0> or tiled_grid<D0, D1> or tiled_grid<D0, D1, D2> • a lambda which accepts • tiled_index<D0> or tiled_index<D0, D1> or tiled_index<D0, D1, D2> http://domemtech.com/?p=1025 tiled_grid, tiled_index •array_view<int,2> Given data(8, 6, p_my_data); parallel_for_each( data.grid.tile<2,2>(), [=] (tiled_index<2,2> t_idx)… { … }); • When the lambda is executed by • t_idx.global = index<2> (6,3) • t_idx.local = index<2> (0,1) • t_idx.tile = index<2> (3,1) • t_idx.tile_origin = index<2> (6,2) http://domemtech.com/?p=1025 0 1 2 3 4 5 0 1 2 3 4 5 6 7 T tile_static, tile_barrier • Within the tiled parallel_for_each lambda we can use • tile_static storage class for local variables • indicates that the variable is allocated in fast cache memory • i.e. shared by each thread in a tile of threads • only applicable in restrict(direct3d) functions • class tile_barrier • synchronize all threads within a tile • e.g. t_idx.barrier.wait(); http://domemtech.com/?p=1025 • • • • • • • • C++ AMP at a Glance restrict(direct3d, cpu) • tile_static storage class parallel_for_each • class tiled_grid< , , > class array<T,N> • class tiled_index< , , > class array_view<T,N> • class tile_barrier class index<N> class extent<N>, grid<N> class accelerator class accelerator_view http://domemtech.com/?p=1025 EXAMPLES!!! • The beloved, ubiquitous matrix multiplication • Bitonic sort http://domemtech.com/?p=1025 Example: Matrix Multiplication http://domemtech.com/?p=1025 Example: Matrix Multiplication void Multiply_Serial(Matrix * C, Matrix * A, Matrix * B) { int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows; void MultiplySimple(Matrix * C, Matrix * A, Matrix * B) { int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows; array_view<const float,1> a(hA * wA, Data(A)); array_view<const float,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC * wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e); parallel_for_each(g, [=](index<2> idx) restrict(direct3d) { int gr = idx[0]; int gc = idx[1]; float sum = 0.0f; for(int k = 0; k < hB; k++) sum += a[gr * wA + k]; * b[k * wB + gc]; c[gr * wC + gc] = sum; }); for (int gr = 0; i < hA; ++gr) // row for (int gc = 0; j < wB; ++gc) { // col float sum = 0; for (int k = 0; k < hB; ++k) sum += Data(A)[gr * wA +k] * Data(B)[k*wB +gc]; Data(C)[gr * wC + gc] = sum; } } http://domemtech.com/?p=1025 } Example: Matrix Multiplication http://domemtech.com/?p=1025 Example: Matrix Multiplication (tiled, shared memory) void MultiplySimple(Matrix * C, Matrix * A, Matrix * B) { int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows; void MultiplySimple(Matrix * C, Matrix * A, Matrix * B) { int wA = A->cols; int hA = A->rows; int wB = B->cols; int hB = B->rows; int wC = C->cols; int hC = C->rows; array_view<const float,1> a(hA * wA, Data(A)); array_view<const float,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC * wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e); array_view<const float,1> a(hA * wA, Data(A)); array_view<const float,1> b(hB * wB, Data(B)); array_view<writeonly<float>,1> c(hC * wC, Data(C)); extent<2> e(hC, wC); grid<2> g(e); const int TS = 16; parallel_for_each(g.tile<TS,TS>(), [=](tiled_index<TS,TS> idx) restrict(direct3d) { int lr = idx.local[0]; int lc = idx.local[1]; int gr = idx.global[0]; int gc = idx.global[1]; parallel_for_each(g, [=](index<2> idx) restrict(direct3d) { int gr = idx[0]; int gc = idx[1]; float sum = 0.0f; for(int k = 0; k < hB; k++) sum += a[gr * wA + k]; * b[k * wB + gc]; c[gr * wC + gc] = sum; }); float sum = 0.0f; for (int i = 0; i < hB; i += TS) { tile_static float locA[TS][TS], locB[TS][TS]; locA[lr][lc] = a[gr * wA + lc + i]; locB[lr][lc] = b[(lr + i) * wB + gc]; idx.barrier.wait(); for (int k = 0; k < TS; k++) sum += locA[lr][k] * locB[k][lc]; idx.barrier.wait(); } c[gr * wC + gc] = sum; } }); http://domemtech.com/?p=1025 } Performance C++ AMP performs as good as CUDA or OpenCL for matrix multiplication. Implementation Sequential Implicit (unspecified tile) Explicit (16 x 16 tile) Shared mem (16 x 16 tile) C++ AMP 1.317 ± 0.006 s 0.035 ± 0.008 s 0.030 ± 0.001 s 0.015 ± 0.002 s CUDA 1.454 ± 0.008 s n.a. 0.046 ± 0.001 s 0.0150 ± 0.0003 s OpenCL 1.448 ± 0.003 s n.a. 0.061 ± 0.002 s 0.033 ± 0.002 s Random matrices A (480 x 640) x B (640 x 960) = C (480 x 960) single prec. floats. Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used), and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS. 10 runs, first run thrown out (for total of 9). Average ± S.E. of sample. NB: this comparison is between “Developer Preview vs RTM http://domemtech.com/?p=1025 Bitonic sort http://domemtech.com/?p=1025 Bitonic sort http://domemtech.com/?p=1025 Bitonic sort http://domemtech.com/?p=1025 Bitonic sort http://domemtech.com/?p=1025 Bitonic sort http://domemtech.com/?p=1025 Bitonic sort http://domemtech.com/?p=1025 Bitonic sort http://domemtech.com/?p=1025 Bitonic sort http://domemtech.com/?p=1025 Bitonic sort http://domemtech.com/?p=1025 Bitonic sort http://domemtech.com/?p=1025 Bitonic sort http://domemtech.com/?p=1025 Bitonic sort void bitonicSortSequential(int * data, int length) { unsigned int log2length = log2(length); unsigned int checklength = pow2(log2length); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; unsigned int phase2 = pow2((unsigned int)phase); for (int ig = 0; ig < compares; ++ig) { int cross, paired; orange_box(ig, phase2, cross, paired); swap(data[cross], data[paired]); } for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); for (int ig = 0; ig < compares; ++ig) { int cross, paired; red_box(ig, level2, cross, paired); swap(data[cross], data[paired]); } } } } http://domemtech.com/?p=1025 void bitonicSortSimple(int * data, int length) { unsigned int log2length = log2(length); unsigned int checklength = pow2(log2length); static const int TS = 1024; array_view<int,1> a(length, data); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; extent<1> e(compares); grid<1> g(e); unsigned int phase2 = pow2((unsigned int)phase); parallel_for_each(g.tile<TS>(), [phase2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; orange_box(ig, phase2, cross, paired); swap(a[cross], a[paired]); }); for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); parallel_for_each(*it, g.tile<TS>(), [level2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; red_box(ig, level2, cross, paired); swap(a[cross], a[paired]); }); } } } Performance C++ AMP performs almost as good as CUDA for (naïve) bitonic sort. Implementation Sequential Explicit (1024 tile) C++ AMP 12.5 ± 0.4 s 0.42 ± 0.04 s CUDA 12.6 ± 0.01 s 0.372 ± 0.002 s Array of length 1024 * 32 * 16 * 16 = 8388608, of integers. Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used), and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS. 10 runs, first run thrown out (for total of 9). Average ± S.E. of sample. http://domemtech.com/?p=1025 Bitonic sort with subsets of bitonic merge http://domemtech.com/?p=1025 Peters, H., O. Schulz-Hildebrandt, et al. (2011). "Fast in-place, comparison-based sorting with CUDA: a study with bitonic sort." Concurrency and Computation: Practice and Experience 23(7): 681-693. void bitonicSortSimple(int * data, int length) { unsigned int log2length = log2(length); unsigned int checklength = pow2(log2length); static const int TS = 1024; array_view<int,1> a(length, data); for (int phase = 0; phase < log2length; ++phase) { int compares = length / 2; extent<1> e(compares); grid<1> g(e); unsigned int phase2 = pow2((unsigned int)phase); parallel_for_each(g.tile<TS>(), [phase2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; orange_box(ig, phase2, cross, paired); swap(a[cross], a[paired]); }); for (int level = phase-1; level >= 0; --level) { unsigned int level2 = pow2((unsigned int)level); parallel_for_each(*it, g.tile<TS>(), [level2, a](tiled_index<TS> idx) restrict(direct3d) { int ig = idx.global[0]; int cross, paired; red_box(ig, level2, cross, paired); swap(a[cross], a[paired]); }); } } } http://domemtech.com/?p=1025 Optimized bitonic sort • Optimization: compute multiple compares per thread using subsets of the sorted merge of a given degree. • Load the portion of the subset bitonic merge into registers. • Perform swap in registers. •Store registers back to global memory. • NB: Must manually unroll loop – bug in compiler with #pragma unroll! for (int i = 0; i < msize; ++i) mm[i] = a[base + memory[i]]; for (int i = 0; i < csize; i += 2) { int cross = normalized_compares[i]; int paired = normalized_compares[i+1]; swap(mm[cross], mm[paired]); } for (int i = 0; i < msize; ++i) a[base + memory[i]] = mm[i]; Performance C++ AMP performs almost as good as CUDA for (naïve) bitonic sort. Implementation Explicit (512 tile) Degree opt., explicit (512 tile) C++ AMP 0.285 ± 0.4 s 0.239 ± 0.001 s CUDA 0.280 ± 0.001 s 0.187 ± 0.001 s Array of length 1024 * 32 * 16 * 16 = 8388608, of integers. Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used), and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS. 10 runs, first run thrown out (for total of 9). Average ± S.E. of sample. http://domemtech.com/?p=1025 Issues All GPU memory must be an array_view<> (or array<>), so you must use an array index!!! foo_bar[…] = …; Pointer to struct/class unavailable, but struct/class is OK: class foo {} foo f; … Parallel_for_each… f.bar(); // cannot convert 'this' pointer from 'const foo' to 'foo &‘ foo f; array_view<foo,1> xxx(1, &f); parallel_for_each… xxx[0].bar(); // OK, and must use []’s—YUCK! http://domemtech.com/?p=1025 Issues Easy to forget that after parallel_for_each, results must be fetched via 1) call arrray_view synchronize() for the variable; 2) destruction of array_view variable; 3) explicit copy for array<>. Take care in comparing CUDA/OpenCL with C++ AMP: 1) C++ AMP may not pick right optimal accelerator if you have two GPU’s; 2) C++ AMP picks tile size if you don’t specify it. If you do pick a tile size, it must fit with an exact multiple into the grid! http://domemtech.com/?p=1025 Not Covered • Math library • e.g. acosf • Atomic operation library • e.g. atomic_fetch_add • Direct3D intrinsics • debugging (e.g. direct3d_printf), fences (e.g. __dp_d3d_all_memory_fence), float math (e.g. __dp_d3d_absf) • Direct3D Interop • *get_device, create_accelerator_view, make_array, *get_buffer http://domemtech.com/?p=1025 Visual Studio 11 • • • • • • • Organize Edit Design Build Browse Debug Profile http://domemtech.com/?p=1025 NB: VS Express does not contain C++ AMP! If you install Windows 8, make sure to install VS Ultimate 11 Developer preview. C++ AMP Parallel Debugger • Well known Visual Studio debugging features • Launch, Attach, Break, Stepping, Breakpoints, DataTips I could not break on a statement • Toolwindows in the kernel • Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch • New features (for both CPU and GPU) • Parallel Stacks window, Parallel Watch window, Barrier • New GPU-specific • Emulator, GPU Threads window, race detection http://domemtech.com/?p=1025 Visual Studio 11 - Profiler • Labels statements that are most expensive in a routine.. NB: VS Express does not contain C++ AMP! If you install Windows 8, make sure to install VS Ultimate 11 Developer preview. http://domemtech.com/?p=1025 Visual Studio 11 - Profiler • Labels statements that are most expensive in a routine. • Does not seem to label kernel code. http://domemtech.com/?p=1025 Concurrency Visualizer for GPU • Direct3D-centric • Supports any library/programming model built on it • Integrated GPU and CPU view • Goal is to analyze high-level performance metrics • Memory copy overheads • Synchronization overheads across CPU/GPU • GPU activity and contention with other processes Where is my GPU??? http://domemtech.com/?p=1025 Summary • Democratization of parallel hardware programmability • Performance for the mainstream • High-level abstractions in C++ (not C) • State-of-the-art Visual Studio IDE • Hardware abstraction platform • Intent is to make C++ AMP an open specification http://domemtech.com/?p=1025 Ken’s blog comparing C++ AMP, CUDA, OpenCL • http://domemtech.com/?p=1025 Daniel Moth's blog (PM of C++ AMP) • http://www.danielmoth.com/Blog/ MSDN Native parallelism blog (team blog) • http://blogs.msdn.com/b/nativeconcurrency/ MSDN Dev Center for Parallel Computing • http://msdn.com/concurrency MSND Forums to ask questions • http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads