pptx - Domemtech.com

advertisement
IEEE Boston Continuing Education Program
Ken Domino, Domem Technologies
May 9, 2011
Announcements
 Course website updates:
Syllabus- http://domemtech.com/ieee-pp/Syllabus.docx
Lecture1– http://domemtech.com/ieee-pp/Lecture1.pptx
Lecture2– http://domemtech.com/ieee-pp/Lecture2.pptx
References- http://domemtech.com/ieee-pp/References.docx
 Ocelot April 5 download is not working
PRAM
 Parallel Random Access Machine (PRAM).
 Idealized SIMD parallel computing model.





Unlimited RAM’s, called Processing Units (PU).
RAM’s operate with same instructions and synchronously.
Shared Memory unlimited, accessed in one unit time.
Shared Memory access is one of CREW, CRCW, EREW.
Communication between RAM’s is only through Shared Memory.
PRAM pseudo code
 Parallel for loop
 for Pi , 1 ≤ i ≤ n in parallel do … end
 (aka “data-level parallelism)
Synchronization
 A simple example from C:
Synchronization
 What happens if we have two threads competing for
the same resources (char_in/char_out)?
Synchronization
 What happens if two threads execute this code
serially?
No prob!
Synchronization
 What happens if two threads execute this code in
parallel? We can sometimes get a problem.
char_in of T2 overwrites char_in of T1!
Synchronization
 Synchronization forces thread serialization, e.g., so
concurrent access does not cause problems.
Synchronization
 Two types:
 Mutual exclusion,
using a “mutex”
semaphore = a
lock
 Cooperation, wait
on an object until
all other threads
ready, using wait()
+ notify(), barrier
synchronization
Deadlock
 The use of mutual exclusion of two or more resources.
PRAM Synchronization
 ”stay idle” – wait until other processors complete,
”cooperative” synchronization
CUDA
 “Compute Unified Device Architecture”
 Developed by NVIDIA, introduced November 2006
 Based on C, extended later to work with C++.
 CUDA provides three key abstractions:
 a hierarchy of thread groups
 shared memories
 barrier synchronization
http://www.nvidia.com/object/IO_37226.html,
http://www.gpgpu.org/oldsite/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf,
Nickolls, J., Buck, I., Garland, M. and Skadron, K. Scalable parallel programming with CUDA. Queue, 6 (2). 40-53.
GPU coprocessor to CPU
NVIDIA GPU Architecture
Multiprocessor (MP) =
texture/processor clust
er (TPC)
Dynamic randomaccess
memory (DRAM) aka
“global memory”
Raster operation
processor (ROP)
L2 – Level-2 memory
cache
NVIDIA GPU Architecture
Streaming
Multiprocessor (SM)
Streaming processor (SP)
Streaming
multiprocessor control
(SMC)
Texture processing unit
(TPU)
Con Cache – “constant”
memory
Sh. Memory – “shared”
memory
Multithreaded
instruction fetch and
issue unit (MTIFI)
1st generation, G80 – 2006
3rd generation, Fermi, GTX 570 - 2010
Single-instruction, multiple-thread
 “SIMT”
 SIMT = SIMD + SPMD (single
program, multiple data).
 Multiple threads.
 Sort of “Single Instruction”—
except that each instruction
executed is in multiple
independent parallel threads.
 Instruction set architecture: a
register-based instruction set
including floating-point, integer,
bit, conversion, transcendental,
flow control, memory load/store,
and texture operations.
Single-instruction, multiple-thread
 The Stream Multiprocessor is a hardware
multithreaded unit.
 Threads are executed in groups of 32 parallel threads
called warps.
 Each thread has its own set of registers.
 Individual threads composing a warp are of the same
program and start together at the same program
address, but they are otherwise free to branch and
execute independently.
Single-instruction, multiple-thread
 Instruction executed is same for each warp.
 If threads of a warp diverge via a data dependent
conditional branch, the warp serially executes each
branch path taken.
Single-instruction, multiple-thread
 Warps are serialized if there is:
 Divergence in instructions (i.e., conditional branch
instruction)
 write access to the same memory
Warp Scheduling
 SM hardware implements near-zero overhead
 Warp scheduling
 Warps whose next instruction has its operands ready for
consumption can be executed
 Eligible Warps are selected for execution by priority
 All threads in a Warp execute the same instruction
 4 clock cycles needed to dispatch the instruction for all
threads (G80)
Cooperative Thread Array (CTA)
 An abstraction to
synchronizing threads
 AKA a thread block, grid
 CTA’s are mapped to
warps
Cooperative Thread Array (CTA)
 Each thread has a unique integer thread ID (TID).
 Threads of a CTA share data in global or shared
memory
 Threads synchronize with the barrier instruction.
 CTA thread programs use their TIDs to select work and
index shared data arrays.
Cooperative Thread Array (CTA)
 The programmer
declares a 1D, 2D, or
3D grid shape and
dimensions in
threads.
 The TID is 1D, 2D,
or 3D indice.
Restrictions in grid sizes
Kernel
 Every thread in a grid executes the same body of
instructions, called a kernel.
 In CUDA, it’s just a function.
CUDA Kernels
 Kernels declared with __global__ void
 Parameters are the same for all threads.
__global__ void fun(float * d, int size)
{
int idx = threadIdx.x +
blockDim.x * blockIdx.x
+ blockDim.x * gridDim.x * blockDim.y * blockIdx.y
+ blockDim.x * gridDim.x * threadIdx.y;
if (idx < 0)
return;
if (idx >= size)
return;
d[idx] = idx * 10.0 / 0.1;
}
CUDA Kernels
 Kernels are called via “chevron syntax”
 Func<<< Dg, Db, Ns, S >>>(parameters)





Dg is of type dim3 and specifies the dimension and size of the grid
Db is of type dim3 and specifies the dimension and size of the block
Dg is of type dim3 and specifies the dimension and size of the grid
Ns is of type size_t and specifies the number of bytes in shared memory that is
dynamically allocated per block
S is of type cudaStream_t and specifies the associated stream
 Kernel is void type; must return value through cbv parameter
 Example:
 Foo<<<1, 100>>(1, 2, i);
Memory
 CTA’s have various
types of memory
 Global, shared,
constant, textured,
registers
 Threads can access
host memory, too.
Types of memory
CUDA Memory
 Data types (int, long, float, double, etc) are the same




as in the host.
Shared memory shared between blocks in a thread.
Global memory shared by all threads in all blocks.
Constant memory shared by all threads in all blocks,
but it cannot be changed (so, faster).
Host memory (of CPU) can be access by all threads in
all blocks.
Shared Memory
 __shared__ declares a variable that:
 Resides in the shared memory space of a thread block,
 Has the lifetime of the block,
 Is only accessible from all the threads within the block.
 Examples:
 extern __shared__ float shared[];
 (or declared on kernel call—later!)
Global Memory
 __device__ declares a variable that:
 Resides in global memory space;
 Has the lifetime of an application;
 Is accessible from all the threads within the grid and
from the host through the runtime library
(cudaGetSymbolAddress() / cudaGetSymbolSize() /
cudaMemcpyToSymbol() /
cudaMemcpyFromSymbol())
 Can be allocated through cudaMalloc()
 Examples:
 extern __device__ int data[100];
 cudaMalloc(&d, 100*sizeof(int));
Basic host function calls
 Global memory allocation via cudaMalloc()
 Copying memory between host and GPU via
cudaMemcpy()
 Kernels are called by chevron syntax
Counting 6’s
 Have an array of integers, h[], want to count the
number of 6’s that appear in the array.
 H[0..size-1]
 How do we do this in CUDA?
Counting 6’s
 Divide the array
into blocks of
blocksize
threads.
 For each block,
sum the number
of times 6
appears.
 Return the sum
for each block.
Counting 6’s
 Divide the array
into blocks of
blocksize
threads.
 For each block,
sum the number
of times 6
appears.
 Return the sum
for each block.
#include <stdio.h>
__global__ void c6(int * d_in, int * d_out, int size)
{
int sum = 0;
for (int i=0; i < blockDim.x; i++)
{
int val = d_in[i + blockIdx.x * blockDim.x];
if (val == 6)
sum++;
}
d_out[blockIdx.x] = sum;
}
Counting 6’s
 In main program,
call the kernel
with the correct
dimensions of the
block.
 Note: size %
blocksize = 0.
 How would we
extend this for
arbitrary array
size?
int main()
{
int size = 300;
int * h = (int*)malloc(size * sizeof(int));
for (int i = 0; i < size; ++i)
h[i] = i % 10;
int * d_in;
int * d_out;
int bsize = 100;
int blocks = size/bsize + 1;
int threads_per_block = bsize;
int rv1 = cudaMalloc(&d_in, size*sizeof(int));
int rv2 = cudaMalloc(&d_out, blocks*sizeof(int));
int rv3 = cudaMemcpy(d_in, h, size*sizeof(int),
cudaMemcpyHostToDevice);
c6<<<blocks, threads_per_block>>>(d_in, d_out, size);
cudaThreadSynchronize();
int rv4 = cudaGetLastError();
int * r = (int*)malloc(blocks * sizeof(int));
int rv5 = cudaMemcpy(r, d_out, blocks*sizeof(int),
cudaMemcpyDeviceToHost);
int sum = 0;
for (int i = 0; i < blocks; ++i)
sum += r[i];
printf("Result = %d\n", sum);
return 0;
}
Developing CUDA programs
 Install CUDA SDK (drivers, Toolkit, examples)
 Windows, Linux, Mac:





Use Version 4.0, release candidate 2. (The older 3.2 release
does not work with VS2010 easily! You can install both VS2010
and VS2008, but you will have to manage paths.)
http://developer.nvidia.com/cuda-toolkit-40
Install toolkit, tools SDK, and example code
For drivers, you must have an NVIDIA GPU card
Recommendation: The CUDA examples use definitions in a
common library—do not force your code to depend on it by
using it.
Developing CUDA programs
 Emulation
 Do not install CUDA drivers (will fail).
 Windows and Mac only



Install VirtualBox.
Create 40GB virtual drive.
Install Ubuntu from ISO image on VirtualBox.
 Install Ocelot
(http://code.google.com/p/gpuocelot/downloads/list)
 Install various dependencies (sudo apt-get xxxx install, for
g++, boost, etc.)
 Note: There is a problem with the current release of Ocelot—I
emailed Gregory.Diamos@gatech.edu to resolve build issue.
Developing CUDA programs
 Windows:
 Install VS2010 C++ Express
(http://www.microsoft.com/visualstudio/enus/products/2010-editions/visual-cpp-express)
 (Test installation with “Hello World” .cpp example.)
Developing CUDA programs
 Windows:
 Create an empty c++ console project
 Create hw.cu “hello world” program in source directory
 Project ‐> Custom Build Rules, check box for CUDA 4.0
targets
 Add hw.cu into your empty project
 Note: “.cu” suffix stands for “CUDA source code”. You
can put CUDA syntax into .cpp files, but build
environment won’t know what to compile it with
(cl/g++ vs nvcc).
Developing CUDA programs
#include <stdio.h>
__global__ void fun(int * mem)
{
*mem = 1;
}
hw.cu:
int main()
{
int h = 0;
int * d;
cudaMalloc(&d, sizeof(int));
cudaMemcpy(d, &h, sizeof(int), cudaMemcpyHostToDevice);
fun<<<1,1>>>(d);
cudaThreadSynchronize();
int rv = cudaGetLastError();
cudaMemcpy(&h, d, sizeof(int), cudaMemcpyDeviceToHost);
printf("Result = %d\n", h);
return 0;
}
Compile, link,
Developing
CUDA programs

and run

(Version 4.0
installation
adjusts all
environmental
variables.)
NVCC
 nvcc (NVIDIA
CUDA compiler)
is a driver program
for compiler
phases
 Use –keep option
to see
intermediate files.
(Need to add “.” to
include directories
on compile.)
NVCC
 Compiles to “.cu” into a “.cu.cpp” file
 Two types of targets: virtual and real, represented in
PTX assembly code and “cubin” binary code,
respectively.
PTXAS
 Compiles PTX assembly code into machine code, placed in an ELF module.











# cat hw.sm_10.cubin |
0000000 7f 45 4c 46 01
0000020 02 00 be 00 01
0000040 34 00 00 00 0a
0000060 16 00 01 00 00
0000100 00 00 00 00 00
0000120 00 00 00 00 00
0000140 03 00 00 00 00
0000160 7f 01 00 00 00
0000200 00 00 00 00 0b
0000220 00 00 00 00 23
od
01
00
01
00
00
00
00
00
00
05
-t
01
00
0a
00
00
00
00
00
00
00
x1
33
00
00
00
00
00
00
00
00
00
| head
02 00 00
00 00 00
34 00 20
00 00 00
00 00 00
00 00 00
00 00 00
00 00 00
03 00 00
22 00 00
00
00
00
00
00
00
00
00
00
00
00
34
03
00
00
01
a4
04
00
00
00
18
00
00
00
00
03
00
00
00
00
00
28
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
 Disassembly of the machine code can be done using cuobjectdump or my own
utility nvdis (http://forums.nvidia.com/index.php?showtopic=183438)
PTX, the GPU assembly code



PTX = “Parallel
Thread Execution”
Target for PTX is an
abstract GPU
machine.
Contains
operations for load,
store, register
declarations, add,
sub, mul, etc.
.version 1.4
.target sm_10, map_f64_to_f32
// compiled with …/be.exe
// nvopencc 4.0 built on 2011-03-24
.entry _Z3funPi (
.param .u32 __cudaparm__Z3funPi_mem)
{
.reg .u32 %r<4>;
.loc
16
4
0
$LDWbegin__Z3funPi:
.loc
16
6
0
mov.s32
%r1, 1;
ld.param.u32
%r2, [__cudaparm__Z3funPi_mem];
st.global.s32
[%r2+0], %r1;
.loc
16
7
0
exit;
$LDWend__Z3funPi:
} // _Z3funPi
CUDA GPU targets
 Virtual – PTX code is
embedded in
executabe as a string,
then compiled at
runtime “just-intime”.

Real – PTX code is
compiled into target
execute.
Next time
 For next week, we will go into more detail:
 The CUDA runtime API;
 Writing efficient CUDA code;
 Look at some important examples.
Download