Histogram

advertisement
Histograms and Sparse Array
Multiplication
Introduction to
CUDA Programming
Andreas Moshovos
Winter 2009
Based on documents from:
NVIDIA & Appendix A of the New P&H Book
Histogram
• E.g., Given an image calculate this:
• Distribution of values
Sequential Algorithm
for (int i = 0; i < BIN_COUNT; i++)
result[i] = 0;
for (int i = 0; i < dataN; i++)
result[data[i]]++;
Parallel Strategy
• Distribute work across multiple blocks
– Divide input data to blocks
• Each block process it’s own portion
– Multiple threads, one per image pixel?
• #pixels / #threads per thread
– Produces a partial histogram
• Could produce multiple histograms
• Merge all partial histograms
– Produces the final histogram
Data Structures and Access Patterns
result[data[i]]++;
• Data[]:
– We control: Can be accessed sequentially
– Each element accessed only once
• Result[]:
– Access is data-dependent
– Each element may be accessed multiple times
• Data[] in global memory
• Result[] in shared memory
Sub-Histograms
• How many sub-histograms can we fit in shared
memory?
– Input value range: 0-255, 1 byte
– Each histogram needs 256 entries
– How many bytes per entry?
• That’s data dependent
– Let’s assume 32-bits or 4 bytes
• 16KB (shared memory) / (256 x 4) (histogram)
– 16 sub-histograms at any given point of time
– If one per thread then we have less than a warp
• Let’s try one histogram per block
– many threads per block
Partial Histogram Data Structure
• Array in shared memory
• One per block
• One column per possible pixel value
Algorithm Overview
• Step 1:
– Initialize partial histogram
– Each thread:
• s_Hist[index] = 0
• index += threads per block
• Step 2:
– Update histogram
– Each thread:
• read data[index]
• update s_Hist[]
• index += Total number of threads
• Step 3:
– Update global histogram
– Each thread
• read s_hist[index]
• update global histogram
• index += threads per block
Simultaneous Updates?
• Threads in a block:
– update s_Hist[]
• All threads:
– update global history
• Without special support this becomes:
– register X = value of A
– X ++
– A = register X
• This is a read-modify-write sequence
The problem with simultaneous updates
• What if we do each step individually
– r10 = mem[100]
– r10++
– mem[100] = r10
10
11
11
r100 = mem[100] 10
r100++
11
mem[100] = r100 11
• But we really wanted 12
• What if we had 32 threads running in parallel?
• Start with 10 we would want: 10+32
– We may still get 11
• Need to think about this:
– Special support: Atomic operations
Atomic Operations
• Read-Modify-Write operations that are
guaranteed to happen “atomically”
– Produces the same result as if the sequence
executed in isolation in time
– Think of it as “serializing the execution” of all
atomics
– This is not what happens – This is how you should
think about them
Atomic Operations
• Supported both in Shared and Global memory
• Example:
– atomicAdd (pointer, value)
– does: *pointer += value
• Atomic Operations
– Add, Sub, Inc, Dec
– Exch, Min, Max, CAS
– Bitwise: And, Or, Xor
• Work with (unsigned) integers
• Exch works with single FP as well
atomicExch, atomicMin, atomicMax, atomicCAS
• atomicExch (pointer, value)
– tmp = * pointer
– *pointer = value
– return tmp
• atomicMin (pointer, value) (max is similar)
– tmp = *pointer
– if (*pointer < value) *pointer = value
– return tmp
• atomicCAS (pointer, value1, value2)
– tmp = *pointer
– if (*pointer == value1) *pointer = value2
– return tmp
atomicInc, atomicDec
• atomicInc (pointer, value)
– tmp = *pointer
– if (*pointer < value) (*pointer)++
– else *pointer = 0
– return tmp
• atomicDec (pointer, value)
– tmp = *pointer
– if (*pointer == 0 || *pointer > value) *pointer = value
– else (*pointer)-– return tmp
atomicAnd, atomicOr, atomicXOR
• atomicAnd (pointer, value)
– tmp = *pointer
– *pointer = *pointer & value
– return tmp
• Others similar
CUDA Implementation - Declarations
__global__ void histogram256Kernel
(uint *d_Result, uint *d_Data, int dataN){
//Current global thread index
const int
globalTid = blockIdx.x * blockDim.x +
threadIdx.x;
//Total number of threads in the compute grid
const int
numThreads = blockDim.x * gridDim.x;
__shared__ uint s_Hist[BIN_COUNT];
Clear partial histogram buffer
//Clear shared memory buffer for current thread
block before processing
for ( int pos = threadIdx.x;
pos < BIN_COUNT;
pos += blockDim.x)
s_Hist[pos] = 0;
__syncthreads ();
Generate partial histogram
for (int pos = globalTid;
pos < dataN;
pos += numThreads){
uint data4 = d_Data[pos];
atomicAdd
atomicAdd
atomicAdd
atomicAdd
(s_Hist
(s_Hist
(s_Hist
(s_Hist
}
__syncthreads();
+
+
+
+
(data4
(data4
(data4
(data4
>> 0) &
>> 8) &
>> 16) &
>> 24) &
0xFFU,
0xFFU,
0xFFU,
0xFFU,
1);
1);
1);
1);
Merge partial histogram with global histogram
for (int pos = threadIdx.x;
pos < BIN_COUNT;
pos += blockDim.x){
atomicAdd(d_Result + pos, s_Hist[pos]);
}
Code overview
__global__ void histogram256Kernel (uint *d_Result, uint *d_Data,
int dataN){
const int
globalTid = blockIdx.x * blockDim.x + threadIdx.x;
const int
numThreads = blockDim.x * gridDim.x;
__shared__ uint s_Hist[BIN_COUNT];
for (int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x)
s_Hist[pos] = 0;
__syncthreads ();
for (int pos = globalTid; pos < dataN; pos += numThreads){
uint data4 = d_Data[pos];
atomicAdd (s_Hist + (data4 >> 0) & 0xFFU, 1);
atomicAdd (s_Hist + (data4 >> 8) & 0xFFU, 1);
atomicAdd (s_Hist + (data4 >> 16) & 0xFFU, 1);
atomicAdd (s_Hist + (data4 >> 24) & 0xFFU, 1);
}
__syncthreads();
for (int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x)
atomicAdd(d_Result + pos, s_Hist[pos]);
}
Discussion
• s_Hist updates
– Conflicts in shared memory
– Data Dependent
– 16-way conflicts possible and likely
• Is there an alternative?
– One histogram per thread?
– Load data in shared memory
• Each thread produces a portion of the s_Hist that maps
onto the same bank?
Warp Vote Functions
• int __all (int predicate);
– evaluates predicate for all threads of the warp and
returns non-zero if and only if predicate evaluates
to non-zero for all of them.
• int __any (int predicate);
– evaluates predicate for all threads of the warp and
returns non-zero if and only if predicate evaluates
to non-zero for any of them.
Warp Vote Functions Example
• Original code:
for (int pos = threadIdx.x; pos < BIN_COUNT; pos +=
blockDim.x){
atomicAdd(d_Result + pos, s_Hist[pos]);
• Modified w/ __any ():
for (int pos = threadIdx.x; pos < BIN_COUNT; pos +=
blockDim.x){
if (__any (s_Hist[pos] != 0) )
atomicAdd(d_Result + pos, s_Hist[pos]);
• Modified w/ __all ():
for (int pos = threadIdx.x; pos < BIN_COUNT; pos +=
blockDim.x){
if (!__all (s_Hist[pos] == 0) )
atomicAdd(d_Result + pos, s_Hist[pos]);
Sparse Matrix Multiplication
• Sparse Matrix N x N:
– number of non-zero entries m is only a small
fraction of the total
• Representation goal:
– store only non-zero entries
• Typically:
– m = O(N)
Compressed Sparse Row Representation
• Av[]:
– Array values in row-major order
• Aj[]:
– Column for corresponding Av[] entry
• Ap[]:
– row i extends from indexes Ap[i] to Ap[i+1] -1 in Av[]
and Aj[]
Matrix x Vector: y = Ax
x
Av
Aj
Ap
3 1 2 4 1 1 1
0 2 1 2 3 0 3
0 2 2 5 7
10
20
30
40
Ax
2 x 20 + 4 x 30 + 1 x 40
60
0
210
50
Single Row Multiply
• Produces an entry of the result vector
float multiply_row
(uint rowsize,
uint *Aj,
// column indices for row
float *Av,
// nonzero entries for row
float *xl
// the RHS vector
{
float sum = 0;
for (uint column=0; column<rowsize; column++)
sum = Av[column] * x[ Aj[column] ] ;
return sum;
}
Av
Aj
Ap
3 1 2 4 1 1 1
0 2 1 2 3 0 3
0 2 2 5 7
10
20
30
40
Serial Code
void csrmul_serial (uint *Ap, uint *Aj,
float *Av,
uint num_rows,
float *x, float *y)
{
for (uint row=0; row<num_rows; row++)
{
uint row_begin = Ap[row];
uint row_end = Ap[row + l];
Av 3 1 2 4 1 1 1
y[row] = multiply_row (
Aj 0 2 1 2 3 0 3
row_end - row_begin,
Ap 0 2 2 5 7
Aj+row_begin,
Av+row_begin,
x);
}
}
CUDA Strategy
• Assume that there are many rows
• One thread per row
CUDA Kernel
__global__
void csrmul_kernel (uint *Ap,
uint *Aj,
float *Av,
uint num_rows,
float *x,
float *y)
uint row = blockIdx.x * blockDim.x +
threadIdx.x;
uint row_begin = Ap[row];
uint row_end = Ap[row+1];
y[row] = multiply_row (row_end – row_begin,
Aj + row_begin,
Av + row_begin,
x);
}
Discussion
• We are not using shared memory
– all are in global memory
• Claim:
– rows using x[i] will be rows near row I
• Block processing rows i through j
– cache x[i] through x[j]
– load from shared memory if possible
• Unroll multiply_row()
• Fetch Ap[row+1] from adjacent thread
CSR multiplication using shared memory
__global__ void csrmul_cached( uint *Ap, uint *Aj,
float *Av, uint num_rows,float *x, float *y)
{
__shared__ float cache[blocksize];
uint block_begin = blockIdx.x * blockDim.x;
uint block_end = block_begin + blockDim.x;
uint row = block_begin + threadIdx.x:
if (row < num_rows) cache[threadIdx.x] = x[row];
__syncthreads();
if (row<num_rows)
{
uint row_begin = Ap[row];
uint row_end = Ap[row + l] ;
float sum = 0, x_j ;
for (uint col=row_begin; col < row_end; col++ )
{
uint j = Aj[col];
// Fetch x_j from our cache when possible
if (j>=block_begin && j<block_end )
x_j = cache [j – block_begin];
else
x_j = x [j];
sum += Av[col] * x_j ;
}
y[row] = sum;
}
Download