GPU compute with C++ AMP

advertisement
GPU computing with C++ AMP
Kenneth Domino
Domem Technologies, Inc.
http://domemtech.com
October 24, 2011
NB: This presentation is based on Daniel Moth’s presentation at
http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T
CPUs vs GPUs today
CPU
•
•
•
•
•
•
•
Low memory bandwidth
Higher power consumption
Medium level of parallelism
Deep execution pipelines
Random accesses
Supports general code
Mainstream programming
http://domemtech.com/?p=1025
CPUs vs GPUs today
GPU
•
•
•
•
•
•
•
http://domemtech.com/?p=1025
High memory bandwidth
Lower power consumption
High level of parallelism
Shallow execution pipelines
Sequential accesses
Supports data-parallel code
Niche programming
Tomorrow…
• CPUs and GPUs coming closer together…
• …nothing settled in this space, things still in motion…
• C++ AMP is designed as
a mainstream solution
not only for today,
but also for tomorrow
image source: AMD
http://domemtech.com/?p=1025
C++ AMP
•
•
•
•
Part of Visual C++
Visual Studio integration
STL-like library for multidimensional data
Builds on Direct3D
performance
productivity
portability
http://domemtech.com/?p=1025
C++ AMP vs. CUDA vs. OpenCL
C++ AMP
OpenCL
CUDA – Wow, classes!
(Stone knives and bearskins – C99!! YUCK!)
http://domemtech.com/?p=1025
Lambda functions (1936
Church—back to the future)
Hello World: Array Addition
void AddArrays(int n, int * pA, int * pB, int * pC)
{
for (int i=0; i<n; i++)
{
pC[i] = pA[i] + pB[i];
}
}
http://domemtech.com/?p=1025
How do we take the
serial code on the left
that runs on the CPU
and convert it to run on
an accelerator like the
GPU?
Hello World: Array Addition
#include <amp.h>
using namespace concurrency;
void AddArrays(int n, int * pA, int * pB, int * pC)
{
void AddArrays(int n, int * pA, int * pB, int * pC)
{
array_view<int,1> a(n, pA);
array_view<int,1> b(n, pB);
array_view<int,1> sum(n, pC);
for (int i=0; i<n; i++)
for (int i=0; i<n; i++)
parallel_for_each(
sum.grid,
[=](index<1> i) restrict(direct3d)
{
pC[i] = =pA[i]
sum[i]
a[i] + pB[i];
b[i];
}
);
{
pC[i] = pA[i] + pB[i];
}
}
http://domemtech.com/?p=1025
}
Basic Elements of C++ AMP coding
parallel_for_each:
execute the lambda
on the accelerator
once per thread
void AddArrays(int n, int * pA, int * pB, int * pC)
{
restrict(direct3d): tells the
array_view<int,1> a(n, pA);
compiler to check that this code
can execute on Direct3D hardware
array_view<int,1> b(n, pB);
(aka accelerator)
array_view<int,1> sum(n, pC);
array_view: wraps the data
grid: the number and
shape of threads to
execute the lambda
parallel_for_each( to operate on the accelerator
sum.grid,
[=](index<1> idx) restrict(direct3d)
{
sum[idx] = a[idx] + b[idx];
}
array_view variables captured
);
and associated data copied to
index: the thread ID that}is running the
lambda, used to index into data
http://domemtech.com/?p=1025
accelerator (on demand)
grid<N>, extent<N>, and index<N>
• index<N>
• represents an N-dimensional point
• extent<N>
• number of units in each dimension
of an N-dimensional space
• grid<N>
• origin (index<N>) plus extent<N>
• N can be any number
http://domemtech.com/?p=1025
Examples: grid, extent, and index
index<1> i(2);
index<2> i(0,2);
index<3> i(2,0,1);
extent<1> e(6);
extent<2> e(3,4);
extent<3> e(3,2,2);
grid<1> g(e);
grid<2> g(e);
grid<3> g(e);
http://domemtech.com/?p=1025
array<T,N>
• Multi-dimensional array of rank N
with element T
• Storage lives on accelerator
vector<int> v(96);
extent<2> e(8,12);
// e[0] == 8; e[1] == 12;
array<int,2> a(e, v.begin(), v.end());
0
0
1
2
// in the body of my lambda
index<2> i(3,9);
// i[0] == 3; i[1] == 9;
int o = a[i]; //or a[i] = 16;
//int o = a(i[0], i[1]);
http://domemtech.com/?p=1025
3
4
5
6
7
1
2
3
4
5
6
7
8
9
10
11
array_view<T,N>
• View on existing data on the CPU or GPU
• array_view<T,N>
• array_view<const T,N>
vector<int> v(10);
extent<2> e(2,5);
array_view<int,2> a(e, v);
//above two lines can also be written
//array_view<int,2> a(2,5,v);
http://domemtech.com/?p=1025
Data Classes Comparison
array<T,N>
• Rank at compile time
• Extent at runtime
• Rectangular
array_view<T,N>
• Rank at compile time
• Extent at runtime
• Rectangular
•
•
•
•
•
•
•
•
Dense
Container for data
Explicit copy
Capture by reference [&]
Dense in one dimension
Wrapper for data
Implicit copy
Capture by value [=]
KED Note: array_view’s seems faster than array<>. Could it be
because of the on-demand feature of array_view?
http://domemtech.com/?p=1025
parallel_for_each
• Executes the lambda for each point in the extent
• As-if synchronous in terms of visible side-effects
1. parallel_for_each(
2.
g, //g is of type grid<N>
3.
[ ](index<N> idx) restrict(direct3d)
{
// kernel code
}
1. );
http://domemtech.com/?p=1025
restrict(…)
• Applies to functions (including lambdas)
• Why restrict
• Target-specific language restrictions
• Optimizations or special code-gen behavior
• Future-proofing
• Functions can have multiple restrictions
• In 1st release we are implementing direct3d and cpu
• cpu – the implicit default
http://domemtech.com/?p=1025
restrict(direct3d) restrictions
• Can only call other restrict(direct3d) functions
• All functions must be inlinable
• Only direct3d-supported types
• int, unsigned int, float, double
• structs & arrays of these types
• Pointers and References
• Lambdas cannot capture by reference¹, nor capture pointers
• References and single-indirection pointers supported only as local variables
and function arguments
http://domemtech.com/?p=1025
restrict(direct3d) restrictions
• No
•
•
•
•
•
•
•
recursion
'volatile'
virtual functions
pointers to functions
pointers to member functions
pointers in structs
pointers to pointers
• No
•
•
•
•
•
•
•
goto or labeled statements
throw, try, catch
globals or statics
dynamic_cast or typeid
asm declarations
varargs
unsupported types
• e.g. char, short, long double
http://domemtech.com/?p=1025
Example: restrict overloading
double bar( double ) restrict(cpu,direc3d); // 1: same code for both
double cos( double );
// 2a: general code
double cos( double ) restrict(direct3d);
// 2b: specific code
void SomeMethod(array<double> c) {
parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d)
{
//…
double d1 = bar(c[idx]); // ok
double d2 = cos(c[idx]); // ok, chooses direct3d overload
//…
});
}
http://domemtech.com/?p=1025
accelerator, accelerator_view
Host
• accelerator
• e.g. DX11 GPU, REF
• e.g. CPU
• accelerator_view
• a context for scheduling and
memory management
•
•
•
http://domemtech.com/?p=1025
Accelerator (GPU example)
PCIe
Example: accelerator
// Identify an accelerator based on Windows device ID
accelerator myAcc(“PCI\\VEN_1002&DEV_9591&CC_0300”);
// …or enumerate all accelerators (not shown)
// Allocate an array on my accelerator
array<int> myArray(10, myAcc.default_view);
// …or launch a kernel on my accelerator
parallel_for_each(myAcc.default_view, myArrayView.grid, ...);
http://domemtech.com/?p=1025
C++ AMP at a Glance (so far)
•
•
•
•
•
•
•
•
restrict(direct3d, cpu)
parallel_for_each
class array<T,N>
class array_view<T,N>
class index<N>
class extent<N>, grid<N>
class accelerator
class accelerator_view
http://domemtech.com/?p=1025
Achieving maximum performance gains
012345
• Schedule threads in tiles
• Avoid thread index remapping
• Gain ability to use tile static memory
012345
012345
0
0
0
1
1
1
2
2
2
3
3
3
4
4
4
5
5
5
6
6
6
7
7
7
extent<2> e(8,6);
grid<2> g(e);
g.tile<4,3>()
g.tile<2,2>()
• parallel_for_each overload for tiles accepts
• tiled_grid<D0> or tiled_grid<D0, D1> or tiled_grid<D0, D1, D2>
• a lambda which accepts
• tiled_index<D0> or tiled_index<D0, D1> or tiled_index<D0, D1, D2>
http://domemtech.com/?p=1025
tiled_grid, tiled_index
•array_view<int,2>
Given
data(8, 6, p_my_data);
parallel_for_each(
data.grid.tile<2,2>(),
[=] (tiled_index<2,2> t_idx)… { … });
• When the lambda is executed by
• t_idx.global
= index<2> (6,3)
• t_idx.local
= index<2> (0,1)
• t_idx.tile
= index<2> (3,1)
• t_idx.tile_origin
= index<2> (6,2)
http://domemtech.com/?p=1025
0 1 2 3 4 5
0
1
2
3
4
5
6
7
T
tile_static, tile_barrier
• Within the tiled parallel_for_each lambda we can use
• tile_static storage class for local variables
• indicates that the variable is allocated in fast cache memory
• i.e. shared by each thread in a tile of threads
• only applicable in restrict(direct3d) functions
• class tile_barrier
• synchronize all threads within a tile
• e.g. t_idx.barrier.wait();
http://domemtech.com/?p=1025
•
•
•
•
•
•
•
•
C++ AMP at a Glance
restrict(direct3d, cpu)
• tile_static storage class
parallel_for_each
• class tiled_grid< , , >
class array<T,N>
• class tiled_index< , , >
class array_view<T,N>
• class tile_barrier
class index<N>
class extent<N>, grid<N>
class accelerator
class accelerator_view
http://domemtech.com/?p=1025
EXAMPLES!!!
• The beloved, ubiquitous matrix
multiplication
• Bitonic sort
http://domemtech.com/?p=1025
Example: Matrix Multiplication
http://domemtech.com/?p=1025
Example: Matrix Multiplication
void Multiply_Serial(Matrix * C, Matrix * A, Matrix * B)
{
int wA = A->cols; int hA = A->rows;
int wB = B->cols; int hB = B->rows;
int wC = C->cols; int hC = C->rows;
void MultiplySimple(Matrix * C, Matrix * A, Matrix * B)
{
int wA = A->cols; int hA = A->rows;
int wB = B->cols; int hB = B->rows;
int wC = C->cols; int hC = C->rows;
array_view<const float,1> a(hA * wA, Data(A));
array_view<const float,1> b(hB * wB, Data(B));
array_view<writeonly<float>,1> c(hC * wC, Data(C));
extent<2> e(hC, wC);
grid<2> g(e);
parallel_for_each(g,
[=](index<2> idx) restrict(direct3d) {
int gr = idx[0];
int gc = idx[1];
float sum = 0.0f;
for(int k = 0; k < hB; k++)
sum += a[gr * wA + k]; * b[k * wB + gc];
c[gr * wC + gc] = sum;
});
for (int gr = 0; i < hA; ++gr) // row
for (int gc = 0; j < wB; ++gc) { // col
float sum = 0;
for (int k = 0; k < hB; ++k)
sum += Data(A)[gr * wA +k] * Data(B)[k*wB +gc];
Data(C)[gr * wC + gc] = sum;
}
}
http://domemtech.com/?p=1025
}
Example: Matrix Multiplication
http://domemtech.com/?p=1025
Example: Matrix Multiplication (tiled, shared memory)
void MultiplySimple(Matrix * C, Matrix * A, Matrix * B)
{
int wA = A->cols; int hA = A->rows;
int wB = B->cols; int hB = B->rows;
int wC = C->cols; int hC = C->rows;
void MultiplySimple(Matrix * C, Matrix * A, Matrix * B)
{
int wA = A->cols; int hA = A->rows;
int wB = B->cols; int hB = B->rows;
int wC = C->cols; int hC = C->rows;
array_view<const float,1> a(hA * wA, Data(A));
array_view<const float,1> b(hB * wB, Data(B));
array_view<writeonly<float>,1> c(hC * wC, Data(C));
extent<2> e(hC, wC);
grid<2> g(e);
array_view<const float,1> a(hA * wA, Data(A));
array_view<const float,1> b(hB * wB, Data(B));
array_view<writeonly<float>,1> c(hC * wC, Data(C));
extent<2> e(hC, wC);
grid<2> g(e);
const int TS = 16;
parallel_for_each(g.tile<TS,TS>(),
[=](tiled_index<TS,TS> idx) restrict(direct3d) {
int lr = idx.local[0]; int lc = idx.local[1];
int gr = idx.global[0]; int gc = idx.global[1];
parallel_for_each(g,
[=](index<2> idx) restrict(direct3d) {
int gr = idx[0];
int gc = idx[1];
float sum = 0.0f;
for(int k = 0; k < hB; k++)
sum += a[gr * wA + k]; * b[k * wB + gc];
c[gr * wC + gc] = sum;
});
float sum = 0.0f;
for (int i = 0; i < hB; i += TS) {
tile_static float locA[TS][TS], locB[TS][TS];
locA[lr][lc] = a[gr * wA + lc + i];
locB[lr][lc] = b[(lr + i) * wB + gc];
idx.barrier.wait();
for (int k = 0; k < TS; k++)
sum += locA[lr][k] * locB[k][lc];
idx.barrier.wait();
}
c[gr * wC + gc] = sum;
}
});
http://domemtech.com/?p=1025
}
Performance
C++ AMP performs as good as CUDA or OpenCL for matrix multiplication.
Implementation
Sequential
Implicit
(unspecified tile)
Explicit
(16 x 16 tile)
Shared mem
(16 x 16 tile)
C++ AMP
1.317 ± 0.006 s
0.035 ± 0.008 s
0.030 ± 0.001 s
0.015 ± 0.002 s
CUDA
1.454 ± 0.008 s
n.a.
0.046 ± 0.001 s
0.0150 ± 0.0003 s
OpenCL
1.448 ± 0.003 s
n.a.
0.061 ± 0.002 s
0.033 ± 0.002 s
Random matrices A (480 x 640) x B (640 x 960) = C (480 x 960) single prec. floats.
Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used),
and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.
10 runs, first run thrown out (for total of 9). Average ± S.E. of sample. NB: this comparison is
between “Developer Preview vs RTM
http://domemtech.com/?p=1025
Bitonic sort
http://domemtech.com/?p=1025
Bitonic sort
http://domemtech.com/?p=1025
Bitonic sort
http://domemtech.com/?p=1025
Bitonic sort
http://domemtech.com/?p=1025
Bitonic sort
http://domemtech.com/?p=1025
Bitonic sort
http://domemtech.com/?p=1025
Bitonic sort
http://domemtech.com/?p=1025
Bitonic sort
http://domemtech.com/?p=1025
Bitonic sort
http://domemtech.com/?p=1025
Bitonic sort
http://domemtech.com/?p=1025
Bitonic sort
http://domemtech.com/?p=1025
Bitonic sort
void bitonicSortSequential(int * data, int length)
{
unsigned int log2length = log2(length);
unsigned int checklength = pow2(log2length);
for (int phase = 0; phase < log2length; ++phase)
{
int compares = length / 2;
unsigned int phase2 = pow2((unsigned int)phase);
for (int ig = 0; ig < compares; ++ig) {
int cross, paired; orange_box(ig, phase2, cross, paired);
swap(data[cross], data[paired]);
}
for (int level = phase-1; level >= 0; --level)
{
unsigned int level2 = pow2((unsigned int)level);
for (int ig = 0; ig < compares; ++ig) {
int cross, paired; red_box(ig, level2, cross, paired);
swap(data[cross], data[paired]);
}
}
}
}
http://domemtech.com/?p=1025
void bitonicSortSimple(int * data, int length)
{
unsigned int log2length = log2(length);
unsigned int checklength = pow2(log2length);
static const int TS = 1024;
array_view<int,1> a(length, data);
for (int phase = 0; phase < log2length; ++phase)
{
int compares = length / 2;
extent<1> e(compares); grid<1> g(e);
unsigned int phase2 = pow2((unsigned int)phase);
parallel_for_each(g.tile<TS>(),
[phase2, a](tiled_index<TS> idx) restrict(direct3d) {
int ig = idx.global[0];
int cross, paired; orange_box(ig, phase2, cross, paired);
swap(a[cross], a[paired]);
});
for (int level = phase-1; level >= 0; --level)
{
unsigned int level2 = pow2((unsigned int)level);
parallel_for_each(*it, g.tile<TS>(),
[level2, a](tiled_index<TS> idx) restrict(direct3d) {
int ig = idx.global[0];
int cross, paired; red_box(ig, level2, cross, paired);
swap(a[cross], a[paired]);
});
} } }
Performance
C++ AMP performs almost as good as CUDA for (naïve) bitonic sort.
Implementation
Sequential
Explicit
(1024 tile)
C++ AMP
12.5 ± 0.4 s
0.42 ± 0.04 s
CUDA
12.6 ± 0.01 s
0.372 ± 0.002 s
Array of length 1024 * 32 * 16 * 16 = 8388608, of integers.
Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used),
and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.
10 runs, first run thrown out (for total of 9). Average ± S.E. of sample.
http://domemtech.com/?p=1025
Bitonic sort with subsets of bitonic merge
http://domemtech.com/?p=1025
Peters, H., O. Schulz-Hildebrandt, et al. (2011). "Fast in-place, comparison-based sorting with
CUDA: a study with bitonic sort." Concurrency and Computation: Practice and Experience
23(7): 681-693.
void bitonicSortSimple(int * data, int length)
{
unsigned int log2length = log2(length);
unsigned int checklength = pow2(log2length);
static const int TS = 1024;
array_view<int,1> a(length, data);
for (int phase = 0; phase < log2length; ++phase)
{
int compares = length / 2;
extent<1> e(compares); grid<1> g(e);
unsigned int phase2 = pow2((unsigned int)phase);
parallel_for_each(g.tile<TS>(),
[phase2, a](tiled_index<TS> idx) restrict(direct3d) {
int ig = idx.global[0];
int cross, paired; orange_box(ig, phase2, cross, paired);
swap(a[cross], a[paired]);
});
for (int level = phase-1; level >= 0; --level)
{
unsigned int level2 = pow2((unsigned int)level);
parallel_for_each(*it, g.tile<TS>(),
[level2, a](tiled_index<TS> idx) restrict(direct3d) {
int ig = idx.global[0];
int cross, paired; red_box(ig, level2, cross, paired);
swap(a[cross], a[paired]);
});
} } }
http://domemtech.com/?p=1025
Optimized bitonic sort
• Optimization: compute multiple compares per thread
using subsets of the sorted merge of a given degree.
• Load the portion of the subset bitonic merge into
registers.
• Perform swap in registers.
•Store registers back to global memory.
• NB: Must manually unroll loop – bug in compiler with
#pragma unroll!
for (int i = 0; i < msize; ++i)
mm[i] = a[base + memory[i]];
for (int i = 0; i < csize; i += 2)
{
int cross = normalized_compares[i];
int paired = normalized_compares[i+1];
swap(mm[cross], mm[paired]);
}
for (int i = 0; i < msize; ++i)
a[base + memory[i]] = mm[i];
Performance
C++ AMP performs almost as good as CUDA for (naïve) bitonic sort.
Implementation
Explicit (512 tile)
Degree opt.,
explicit
(512 tile)
C++ AMP
0.285 ± 0.4 s
0.239 ± 0.001 s
CUDA
0.280 ± 0.001 s
0.187 ± 0.001 s
Array of length 1024 * 32 * 16 * 16 = 8388608, of integers.
Environment: NVIDIA GeForce GTX 470, an ATI Radeon HD 6450 (not used),
and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.
10 runs, first run thrown out (for total of 9). Average ± S.E. of sample.
http://domemtech.com/?p=1025
Issues
All GPU memory must be an array_view<> (or array<>), so you must use
an array index!!!
foo_bar[…] = …;
Pointer to struct/class unavailable, but struct/class is OK:
class foo {}
foo f;
…
Parallel_for_each…
f.bar(); // cannot convert 'this' pointer from 'const foo' to 'foo &‘
foo f;
array_view<foo,1> xxx(1, &f);
parallel_for_each…
xxx[0].bar(); // OK, and must use []’s—YUCK!
http://domemtech.com/?p=1025
Issues
Easy to forget that after parallel_for_each, results must be fetched via
1) call arrray_view synchronize() for the variable;
2) destruction of array_view variable;
3) explicit copy for array<>.
Take care in comparing CUDA/OpenCL with C++ AMP:
1) C++ AMP may not pick right optimal accelerator if you have two GPU’s;
2) C++ AMP picks tile size if you don’t specify it.
If you do pick a tile size, it must fit with an exact multiple into the grid!
http://domemtech.com/?p=1025
Not Covered
• Math library
• e.g. acosf
• Atomic operation library
• e.g. atomic_fetch_add
• Direct3D intrinsics
• debugging (e.g. direct3d_printf), fences (e.g.
__dp_d3d_all_memory_fence), float math (e.g. __dp_d3d_absf)
• Direct3D Interop
• *get_device, create_accelerator_view, make_array, *get_buffer
http://domemtech.com/?p=1025
Visual Studio 11
•
•
•
•
•
•
•
Organize
Edit
Design
Build
Browse
Debug
Profile
http://domemtech.com/?p=1025
NB: VS Express does not
contain C++ AMP! If you
install Windows 8, make
sure to install VS Ultimate
11 Developer preview.
C++ AMP Parallel Debugger
• Well known Visual Studio debugging features
• Launch, Attach, Break, Stepping, Breakpoints, DataTips
I could not break on a statement
• Toolwindows
in the kernel
• Processes, Debug Output, Modules, Disassembly, Call Stack, Memory,
Registers, Locals, Watch, Quick Watch
• New features (for both CPU and GPU)
• Parallel Stacks window, Parallel Watch window, Barrier
• New GPU-specific
• Emulator, GPU Threads window, race detection
http://domemtech.com/?p=1025
Visual Studio 11 - Profiler
• Labels statements
that are most
expensive in a
routine..
NB: VS Express does not
contain C++ AMP! If you
install Windows 8, make
sure to install VS Ultimate
11 Developer preview.
http://domemtech.com/?p=1025
Visual Studio 11 - Profiler
• Labels statements
that are most
expensive in a
routine.
• Does not seem to
label kernel code.
http://domemtech.com/?p=1025
Concurrency Visualizer for GPU
• Direct3D-centric
• Supports any library/programming model built on it
• Integrated GPU and CPU view
• Goal is to analyze high-level performance metrics
• Memory copy overheads
• Synchronization overheads across CPU/GPU
• GPU activity and contention with other processes
Where is my GPU???
http://domemtech.com/?p=1025
Summary
• Democratization of parallel hardware programmability
• Performance for the mainstream
• High-level abstractions in C++ (not C)
• State-of-the-art Visual Studio IDE
• Hardware abstraction platform
• Intent is to make C++ AMP an open specification
http://domemtech.com/?p=1025
Ken’s blog comparing C++ AMP, CUDA, OpenCL
• http://domemtech.com/?p=1025
Daniel Moth's blog (PM of C++ AMP)
• http://www.danielmoth.com/Blog/
MSDN Native parallelism blog (team blog)
• http://blogs.msdn.com/b/nativeconcurrency/
MSDN Dev Center for Parallel Computing
• http://msdn.com/concurrency
MSND Forums to ask questions
• http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads
Download