ppt - The University of Akron

advertisement
Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The
University of Akron.
 The Problem: how do you do global communication?
 Finish a kernel and start a new one
 All writes from all threads complete before a kernel
finishes
step1<<<grid1,blk1>>>(…);
// The system ensures that all
// writes from step 1 complete
step2<<<grid2,blk2>>>(…);
 Would need to decompose kernels into before and after
parts
CUDA Threads and Atomics – Slide 2
 Alternately write to a predefined memory location
 Race condition! Updates can be lost
threadID: 0
threadID: 1917
// vector[0] was equal to zero
vector[0] += 5;
…
a = vector[0];
vector[0] += 1;
…
a = vector[0];
 What is the value of a in thread 0? In thread 1917?
CUDA Threads and Atomics – Slide 3
threadID: 0
threadID: 1917
// vector[0] was equal to zero
vector[0] += 5;
…
a = vector[0];
vector[0] += 1;
…
a = vector[0];
 Thread 0 could have finished execution before 1917
started
 Or the other way around
 Or both are executing at the same time
 Answer: not defined by the programming model; can be
arbitrary
CUDA Threads and Atomics – Slide 4
 CUDA provides atomic operations to deal with this
problem
 An atomic operation guarantees that only a single thread
has access to a piece of memory while an operation
completes
 The name atomic comes from the fact that it is
uninterruptable
 No dropped data, but ordering is still arbitrary
CUDA Threads and Atomics – Slide 5
 CUDA provides atomic operations to deal with this
problem
 Requires hardware with compute capability 1.1 and above
 Different types of atomic instructions
 Addition/subtraction: atomicAdd, atomicSub
 Minimum/maximum: atomicMin, atomicMax
 Conditional increment/decrement: atomicInc,
atomicDec
 Exchange/compare-and-swap: atomicExch, atomicCAS
 More types in fermi: atomicAnd, atomicOr,
atomicXor
CUDA Threads and Atomics – Slide 6
// Determine frequency of colors in a picture
// colors have already been converted into integers
// Each thread looks at one pixel and increments
// a counter atomically
__global__ void histogram(int* color, int* buckets)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int c = colors[i];
atomicAdd(&buckets[c], 1);
}
 atomicAdd returns the previous value at a certain
address
 Useful for grabbing variable amounts of data from the
list
CUDA Threads and Atomics – Slide 7
// For algorithms where the amount of work per item
// is highly non-uniform, it often makes sense for
// to continuously grab work from a queue
__global__ void workq(int* work_q, int* q_counter,
int* output, int queue_max)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int q_index = atomicInc(q_counter, queue_max);
int result = do_work(work_q[q_index]);
output[i] = result;
}
CUDA Threads and Atomics – Slide 8
int atomicCAS(int* address, int compare, int val)
 if compare equals old value stored at address then
val is stored instead
 in either case, routine returns the value of old
 seems a bizarre routine at first sight, but can be very
useful for atomic locks
CUDA Threads and Atomics – Slide 9
int atomicCAS(int* address,
int oldval, int val)
{
int old_reg_val = *address;
if (old_reg_val == compare) *address = val;
return old_reg_val;
}
 Most general type of atomic
 Can emulate all others with CAS
CUDA Threads and Atomics – Slide 10
 Atomics are slower than normal load/store
 Most of these are associative operations on
signed/unsigned integers:
 quite fast for data in shared memory
 slower for data in device memory
 You can have the whole machine queuing on a single
location in memory
 Atomics unavailable on G80!
CUDA Threads and Atomics – Slide 11
// If you require the maximum across all threads
// in a grid, you could do it with a single global
// maximum value, but it will be VERY slow
__global__ void global_max(int* values, int* gl_max)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int val = values[i];
atomicMax(gl_max,val);
}
CUDA Threads and Atomics – Slide 12
// introduce intermediate maximum results, so that
// most threads do not try to update the global max
__global__ void global_max(int* values, int* max,
int *regional_maxes,
int num_regions)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int val = values[i];
int region = i % num_regions;
if (atomicMax(&reg_max[region],val) < val)
{
atomicMax(max,val);
}
}
CUDA Threads and Atomics – Slide 13
 Single value causes serial bottleneck
 Create hierarchy of values for more parallelism
 Performance will still be slow, so use judiciously
CUDA Threads and Atomics – Slide 14
 Can’t use normal load/store for inter-thread
communication because of race conditions
 Use atomic instructions for sparse and/or
unpredictable global communication
 Decompose data (very limited use of single global
sum/max/min/etc.) for more parallelism
CUDA Threads and Atomics – Slide 15
 How a streaming multiprocessor (SM) executes
threads
 Overview of how a streaming multiprocessor works
 SIMT Execution
 Divergence
CUDA Threads and Atomics – Slide 16
Streaming Multiprocessor
Thread Block 5
Thread Block 27
Thread Block 61
Thread Block 2001
 Hardware schedules thread blocks onto available SMs
 No guarantee of ordering among thread blocks
 Hardware will schedule thread blocks as soon as a
previous thread block finished
CUDA Threads and Atomics – Slide 17
Control
Control
Control
Control
Control
Control
ALU
ALU
ALU
ALU
ALU
ALU
Control
ALU
ALU
ALU
ALU
ALU
ALU
 A warp = 32 threads launched together
 Usually execute together as well
CUDA Threads and Atomics – Slide 18
 Each thread block is mapped to one or more warps
 The hardware schedules each warp independently
TB N W1
Thread Block N (128
threads)
TB N W2
TB N W3
TB N W4
CUDA Threads and Atomics – Slide 19
 SM implements zero-overhead warp scheduling
 At any time, only one of the warps is executed by SM*
 Warps whose next instruction has its inputs ready for
consumption are eligible for execution
 Eligible warps are selected for execution on a prioritized
scheduling policy
 All threads in a warp execute the same instruction when
selected
TB1, W1 stall
TB2, W1 stall
Instruction:
TB1
W1
1
2
Time
3
4
TB2
W1
5
6
1
2
TB3
W1
1
2
TB3, W2 stall
TB3
W2
1
2
TB2
W1
3
4
TB1
W1
7
8
TB1
W2
1
2
TB1
W3
1
2
TB3
W2
3
4
TB = Thread Block, W = Warp
CUDA Threads and Atomics – Slide 20
 Threads are executed in warps of 32, with all threads in
the warp executing the same instruction at the same
time
 What happens if you have the following code?
if (foo(threadIdx.x))
{
do_A();
}
else
{
do_B();
}
CUDA Threads and Atomics – Slide 21
if (foo(threadIdx.x))
{
do_A();
}
else
{
do_B();
}
 This is called warp divergence – CUDA will generate
correct code to handle this, but to understand the
performance you need to understand what CUDA does
with it
CUDA Threads and Atomics – Slide 22
Branch
Path A
Path B
From Fung et al. MICRO ’07
CUDA Threads and Atomics – Slide 23
 Nested branches are handled as well
if (foo(threadIdx.x))
{
if (bar(threadIdx.x))
do_A();
else
do_B();
}
else
do_C();
CUDA Threads and Atomics – Slide 24
Branch
Branch
Path A
Path B
Path C
CUDA Threads and Atomics – Slide 25
 You don’t have to worry about divergence for
correctness
 Mostly true, except corner cases (for example intra-warp
locks)
 You might have to think about it for performance
 Depends on your branch conditions
CUDA Threads and Atomics – Slide 26
 One solution: NVIDIA GPUs have predicated
instructions which are carried out only if a logical flag
is true.
 In the previous example, all threads compute the
logical predicate and two predicated instructions
p = (foo(threadIdx.x));
p: do_A();
!p: do_B();
CUDA Threads and Atomics – Slide 27
 Performance drops off with the degree of divergence
35
switch (threadIdx.x % N) {
case 0: ...
case 1:
...
}
Performance
30
25
20
15
10
5
0
0
2
4
6
8
10
12
14
16
18
Divergence
CUDA Threads and Atomics – Slide 28
 Performance drops off with the degree of divergence
 In worst case, effectively lose a factor of 32 in
performance if one thread needs expensive branch,
while rest do nothing
 Another example: processing a long list of elements
where, depending on run-time values, a few require
very expensive processing GPU implementation:
 first process list to build two sub-lists of “simple” and
“expensive” elements
 then process two sub-lists separately
CUDA Threads and Atomics – Slide 29
 Already introduced __syncthreads(); which
forms a barrier – all threads wait until every one has
reached this point.
 When writing conditional code, must be careful to
make sure that all threads do reach the
__syncthreads();
 Otherwise, can end up in deadlock
CUDA Threads and Atomics – Slide 30
 Fermi supports some new synchronisation
instructions which are similar to __syncthreads()
but have extra capabilities:
 int __syncthreads_count(predicate)
 counts how many predicates are true
 int __syncthreads_and(predicate)
 returns non-zero (true) if all predicates are true
 int __syncthreads_or(predicate)
 returns non-zero (true) if any predicate is true
CUDA Threads and Atomics – Slide 31
 There are similar warp voting instructions which
operate at the level of a warp:
 int __all(predicate)
 returns non-zero (true) if all predicates in warp are true
 int __any(predicate)
 returns non-zero (true) if any predicate is true
 unsigned int __ballot(predicate)
 sets nth bit based on nth predicate
CUDA Threads and Atomics – Slide 32
 Use very judiciously
 Always include a max_iter in your spinloop!
 Decompose your data and your locks
CUDA Threads and Atomics – Slide 33
// global variable: 0 unlocked, 1 locked
__device__ int lock=0;
__global__ void kernel(...) {
...
// set lock
do {} while(atomicCAS(&lock,0,1));
...
// free lock
lock = 0;
}
 Problem: when a thread writes data to device memory
the order of completion is not guaranteed, so global
writes may not have completed by the time the lock is
unlocked
CUDA Threads and Atomics – Slide 34
// global variable: 0 unlocked, 1 locked
__device__ int lock=0;
__global__ void kernel(...) {
...
// set lock
do {} while(atomicCAS(&lock,0,1));
...
// free lock
__threadfence(); // wait for writes to finish
lock = 0;
}
CUDA Threads and Atomics – Slide 35
 __threadfence_block();
 wait until all global and shared memory writes are
visible to all threads in block
 __threadfence();
 wait until all global and shared memory writes are
visible to all threads in block (or all threads, for global
data)
CUDA Threads and Atomics – Slide 36
 lots of esoteric capabilities – don’t worry about most of
 them
 essential to understand warp divergence – can have a
 very big impact on performance
 __syncthreads(); is vital
 the rest can be ignored until you have a critical need –
then read the documentation carefully and look for
examples in the SDK
CUDA Threads and Atomics – Slide 37
 Based on original material from
 Oxford University: Mike Giles
 Stanford University
 Jared Hoberock, David Tarjan
 Revision history: last updated 8/8/2011.
CUDA Threads and Atomics – Slide 38
Download