1/28/13 Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

advertisement
1/28/13
Administrative
•  Next assignment available
–  Next three slides
–  Goals of assignment:
L6: Memory Hierarchy Optimization IV,
Bandwidth Optimization
– simple memory hierarchy management
– block-thread decomposition tradeoff
–  Due Friday, Feb. 8, 5PM
–  Use handin program on CADE machines
•  “handin CS6235 lab2 <probfile>”
1 CS6235 L5: Memory Hierarchy, III CS6235 Assignment 2: Memory Hierarchy Optimization
Due Fri day, February 8 at 5PM
Sobel edge detection:
Find the boundaries of the image
where there is significant
difference as compared to
neighboring “pixels” and replace
values to find edges
J  U I  J  I  sum1 only CS6963 sum2 only E for (i = 1; i < ImageNRows -­‐ 1; i++) for ( j = 1; j < ImageNCols -­‐1; j++) { sum1 = u[i-­‐1][j+1] -­‐ u[i-­‐1][j-­‐1] + 2 * u[i][j+1] -­‐ 2 * u[i][j-­‐1] + u[i+1][j+1] -­‐ u[i+1][j-­‐1]; sum2 = u[i-­‐1][j-­‐1] + 2 * u[i-­‐1][j] + u[i-­‐1][j+1] -­‐ u[i+1][j-­‐1] -­‐ 2 * u[i+1][j] -­‐ u[i+1][j+1]; 2 Example
Input Output magnitude = sum1*sum1 + sum2*sum2; if (magnitude > THRESHOLD) e[i][j] = 255; else e[i][j] = 0; } both 3"
L4: Memory Hierarchy I 1
1/28/13
General Approach
Overview
0. Provided
a. Input file
b. Sample output file
c. CPU implementation
1. 
•  Bandwidth optimization
Structure
a. 
b. 
2. 
•  Global memory coalescing
•  Avoiding shared memory bank conflicts
•  A few words on alignment
Compare CPU version and GPU version output [compareInt]
Time performance of two GPU versions (see 2 & 3 below) [EventRecord]
GPU version 1 (partial credit if correct)
implementation using global memory
3. 
GPU version 2 (highest points to best performing versions)
use memory hierarchy optimizations from previous, current lecture
•  Reading:
–  Chapter 4, Kirk and Hwu
–  Chapter 5, Kirk and Hwu
–  Sections 3.2.4 (texture memory) and 5.1.2
(bandwidth optimizations) of NVIDIA CUDA
Programming Guide
4. Extra credit: Try two different block / thread decompositions. What happens if you use
more threads versus more blocks? What if you do more work per thread? Explain your
choices in a README file.
Handin using the following on CADE machines, where probfile includes all files
“handin cs6235 lab2 <probfile>” CS6235 Overview of Texture Memory
•  Recall, texture cache of read-only data
•  Special protocol for allocating and copying to GPU
–  texture<Type, Dim, ReadMode> texRef;
•  Dim: 1, 2 or 3D objects
•  Special protocol for accesses (macros)
–  tex2D(<name>,dim1,dim2);
•  In full glory can also apply functions to textures
•  Writing possible, but unsafe if followed by read in
same kernel
6"
L6: Memory Hierarchy IV Using Texture Memory (simpleTexture project
from SDK)
cudaMalloc( (void**) &d_data, size);
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0,
cudaChannelFormatKindFloat);
cudaArray* cu_array;
cudaMallocArray( &cu_array, &channelDesc, width, height );
cudaMemcpyToArray( cu_array, 0, 0, h_data, size, cudaMemcpyHostToDevice);
// set texture parameters
tex.addressMode[0] = tex.addressMode[1] = cudaAddressModeWrap;
tex.filterMode = cudaFilterModeLinear; tex.normalized = true;
cudaBindTextureToArray( tex,cu_array, channelDesc);
// execute the kernel
transformKernel<<< dimGrid, dimBlock, 0 >>>( d_data, width, height, angle);
Kernel function:
// declare texture reference for 2D float texture
texture<float, 2, cudaReadModeElementType> tex;
… = tex2D(tex,i,j);
CS6235 L5: Memory Hierarchy, 3 CS6235 L5: Memory Hierarchy, 3 2
1/28/13
When to use Texture (and Surface) Memory
(From 5.3 of CUDA manual) Reading device memory through
texture or surface fetching present some benefits that can make
it an advantageous alternative to reading device memory from
global or constant memory: •  If memory reads to global or constant memory will not be
coalesced, higher bandwidth can be achieved providing that
there is locality in the texture fetches or surface reads (this is
less likely for devices of compute capability 2.x given that global
memory reads are cached on these devices); •  Addressing calculations are performed outside the kernel by
dedicated units; •  Packed data may be broadcast to separate variables in a single
operation; •  8-bit and 16-bit integer input data may be optionally converted
to 32-bit floating-point values in the range [0.0, 1.0] or [-1.0, 1.0]
(see Section 3.2.4.1.1). L5: Memory Hierarchy, 3 Introduction to Memory System
•  Recall execution model for a multiprocessor
–  Scheduling unit: A “warp” of threads is issued
at a time (32 threads in current chips)
–  Execution unit: Each cycle, 8 “cores” or SPs are
executing (32 cores in a Fermi)
–  Memory unit: Memory system scans a “half
warp” or 16 threads for data to be loaded; (full
warp for Fermi)
CS6235 Global Memory Accesses
•  Each thread issues memory accesses to
data types of varying sizes, perhaps as
small as 1 byte entities
•  Given an address to load or store, memory
returns/updates “segments” of either 32
bytes, 64 bytes or 128 bytes
•  Maximizing bandwidth:
–  Operate on an entire 128 byte segment for
each memory transfer
10"
L6: Memory Hierarchy IV Understanding Global Memory Accesses
Memory protocol for compute capability 1.2 and
1.3* (CUDA Manual 5.1.2.1 and Appendix A.1)
•  Start with memory request by smallest numbered
thread. Find the memory segment that contains the
address (32, 64 or 128 byte segment, depending on data
type)
•  Find other active threads requesting addresses within
that segment and coalesce
•  Reduce transaction size if possible
•  Access memory and mark threads as “inactive”
•  Repeat until all threads in half-warp are serviced
*Includes Tesla and GTX platforms as well as new Linux machines!
CS6235 11"
L6: Memory Hierarchy IV CS6235 12"
L6: Memory Hierarchy IV 3
1/28/13
Memory Layout of a Matrix in C
Protocol for most systems (including lab6
machines) even more restrictive
M0,0 M1,0 M2,0 M3,0
Access
direction in
Kernel
code
•  For compute capability 1.0 and 1.1
–  Threads must access the words in a
segment in sequence
–  The kth thread must access the kth word
–  Alignment to the beginning of a segment
becomes a very important optimization!
Consecutive
threads will
access different
rows in memory.
M0,1 M1,1 M2,1 M3,1
Each thread will
require a different
memory
…
operation.
M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3
Time Period 2
T1
T2
T3
T4
Time Period 1
T1
T2
T3
T4
M
Odd: But this is
the RIGHT layout
for a
conventional
multi-core!
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
13"
L6: Memory Hierarchy IV CS6235 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Memory Layout of a Matrix in C
Access
direction in
Kernel
code
M0,0 M1,0 M2,0 M3,0
M0,1 M1,1 M2,1 M3,1
M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3
Time Period 1
Time Period 2
T1 T2 T3 T4
T1 T2 T3 T4
M
…
14"
L6: Memory Hierarchy IV How to find out compute capability
Each thread in a halfwarp (assuming rows
of 16 elements) will
access consecutive
memory locations.
See Appendix A.1 in NVIDIA CUDA Programming Guide to look up your device.
GREAT! All accesses
are coalesced.
Linux lab, most CADE machines and Tesla cluster are Compute Capability 1.2 and
1.3.
With just a 4x4 block,
we may need 4
separate memory
operations to load data
for a half-warp.
Fermi machines are 2.x.
Also, recall “deviceQuery” in SDK to learn about features of installed device.
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
15"
L6: Memory Hierarchy IV CS6235 16"
L6: Memory Hierarchy IV 4
1/28/13
Alignment
•  Addresses accessed within a half-warp
may need to be aligned to the beginning
of a segment to enable coalescing
–  An aligned memory address is a multiple of
the memory segment size
–  In compute 1.0 and 1.1 devices, address
accessed by lowest numbered thread must
be aligned to beginning of segment for
coalescing
–  In future systems, sometimes alignment
can reduce number of accesses
CS6235 17"
L6: Memory Hierarchy IV More on Alignment
•  Objects allocated statically or by
cudaMalloc begin at aligned addresses
–  But still need to think about index
expressions
•  May want to align structures
struct __align__(8) {
float a;
float b;
};
CS6235 What Can You Do to Improve Bandwidth
to Global Memory?
•  Think about spatial reuse and access
patterns across threads
–  May need a different computation & data
partitioning
–  May want to rearrange data in shared
memory, even if no temporal reuse
(transpose example)
–  Similar issues, but much better in future
hardware generations
CS6235 19"
L6: Memory Hierarchy IV struct __align__(16) {
float a;
float b;
float c;
};
18"
L6: Memory Hierarchy IV Bandwidth to Shared Memory:
Parallel Memory Accesses
•  Consider each thread accessing a
different location in shared memory
•  Bandwidth maximized if each one is able
to proceed in parallel
•  Hardware to support this
–  Banked memory: each bank can support an
access on every memory cycle
CS6235 20"
L6: Memory Hierarchy 4 5
1/28/13
Shared memory bank conflicts
How addresses map to banks on G80
•  Each bank has a bandwidth of 32 bits
per clock cycle
•  Successive 32-bit words are assigned to
successive banks
•  G80 has 16 banks
–  So bank = address % 16
–  Same as the size of a half-warp
• 
Shared memory is as fast as registers if there are no
bank conflicts
• 
The fast case:
–  If all threads of a half-warp access different banks, there
is no bank conflict
–  If all threads of a half-warp access the identical address,
there is no bank conflict (broadcast)
• 
The slow case:
–  Bank Conflict: multiple threads in the same half-warp
access the same bank
–  Must serialize the accesses
–  Cost = max # of simultaneous accesses to a single bank
•  No bank conflicts between different halfwarps, only within a single half-warp
21"
L6: Memory Hierarchy IV 22"
L6: Memory Hierarchy IV Bank Addressing Examples
• 
No Bank Conflicts
–  Linear addressing
stride == 1
• 
Bank Addressing Examples
No Bank Conflicts
–  Random 1:1 Permutation
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 15
Bank 15
Thread 15
Bank 15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
23"
L6: Memory Hierarchy IV • 
2-way Bank Conflicts
–  Linear addressing
stride == 2
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 8
Thread 9
Thread 10
Thread 11
• 
8-way Bank Conflicts
–  Linear addressing
stride == 8
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 15
Thread 15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
x8
x8
Bank 0
Bank 1
Bank 2
Bank 7
Bank 8
Bank 9
Bank 15
24"
L6: Memory Hierarchy IV 6
1/28/13
Putting It Together: Global Memory
Coalescing and Bank Conflicts
Matrix Transpose (from SDK)
_global__ void transpose(float *odata, float *idata, int width, int height)
{
odata and idata in Let’s look at matrix transpose
Simple goal: Replace A[i][j] with A[j][i]
Any reuse of data?
Do you think shared memory might be
useful?
• 
• 
• 
• 
25"
L6: Memory Hierarchy IV global memory int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + width * yIndex;
int index_out = yIndex + height * xIndex;
for (int r=0; r < nreps; r++)
{
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS)
{
// read the element
// write the transposed element to global memory
odata[index_out+i] = idata[index_in+i*width];
}
}
CS6235 Coalesced Matrix Transpose
– No Shared Memory Bank Conflicts
Coalesced Matrix Transpose
_global__ void transpose(float *odata, float *idata, int width, int height)
{
__shared__ float tile[TILE_DIM][TILE_DIM];
_global__ void transpose(float *odata, float *idata, int width, int height)
{
__shared__ float tile[TILE_DIM][TILE_DIM+1];
odata and idata in }
26"
L6: Memory Hierarchy IV odata and idata in // read the matrix tile into shared memory
global memory int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + (yIndex)*width;
// read the matrix tile into shared memory
global memory int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + (yIndex)*width;
xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
int index_out = xIndex + (yIndex)*height;
xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
int index_out = xIndex + (yIndex)*height;
for (int r=0; r < nreps; r++) {
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];
}
__syncthreads();
// write the transposed matrix tile to global memory
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i];
}
}
CS6235 27"
L6: Memory Hierarchy IV Rearrange in shared memory and write back efficiently to global memory }
for (int r=0; r < nreps; r++) {
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];
}
__syncthreads();
// write the transposed matrix tile to global memory
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i];
}
}
CS6235 Rearrange in shared memory and write back efficiently to global memory 28"
L6: Memory Hierarchy IV 7
1/28/13
Further Optimization: Partition Camping
Performance Results for Matrix
Transpose (GTX280)
•  A further optimization improves bank
conflicts in global memory
•  But has not proven that useful in codes with
additional computation
•  Map blocks to different parts of chips
int bid = blockIdx.x + gridDim.x*blockIdx.y; by = bid%gridDim.y; bx = ((bid/gridDim.y)+by)%gridDim.x;
SDK-prev: all optimizations other than partition camping
CHiLL: generated by our compiler
SDK-new: includes partition camping
29"
L6: Memory Hierarchy IV 30"
L6: Memory Hierarchy IV Summary of Lecture
•  Completion of bandwidth optimizations
–  Global memory coalescing
–  Alignment
–  Shared memory bank conflicts
–  “Partitioning camping”
•  Matrix transpose example
CS6235 31"
L6: Memory Hierarchy IV 8
Download