Lab 0 - Elsevier

advertisement
ECE408 / CS483
Applied Parallel Programming
Lecture 3:
Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
1
Update – TA Office Hours
• TA office hours will be held at the following
times in CSL369:
– Wednesday 2:00p.m. - 3:00p.m.
– Friday 11:00 a.m. - 12:00 p.m.
• More TA office hours will be added on a
need basis
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
2
Tentative Plan for Labs
•
Lab 0: Account Setup
– Released: Thu, Aug 30 (Week 1)
– Due: N/A
•
Lab 1: Vector Add
– Released: Tue, Sep 4 (Week 2)
– Due: Tue, Sep 11 (Week 3)
•
Lab 2: Basic Matrix Multiplication
– Released: Tue, Sep 11 (Week 3)
– Due: Tue, Sep 18 (Week 4)
•
Lab 3: Tiled Matrix Multiplication
– Released: Tue, Sep 18 (Week 4)
– Due: Tue, Sep 25 (Week 5)
• Lab 4: Tile Convolution
– Released: Tue, Sep 25 (Week 5)
– Due: Tue, Oct 9 (Week 7)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
• Lab 5: Reduction & Prefix Scan
– Released: Tue, Oct 9 (Week 7)
– Due: Tue, Oct 23 (Week 9)
• Lab 6: Histogramming
– Released: Tue, Oct 23 (Week 9)
– Due: Tue, Nov 6 (Week 11)
• Project:
– Release & kickoff Lecture: Tue,
Oct 23 (Week 9)
– Proposals for Non-competition
Projects Due: TBD
– Progress Report: TBD (if any)
– Presentations: TBD
3
Objective
• To learn about data parallelism and the basic
features of CUDA C that enable exploitation
of data parallelism
– Hierarchical thread organization
– Main interfaces for launching parallel execution
– Thread index(es) to data index mapping
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
4
CUDA C /OpenCL – Execution Model
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C code
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
...
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
...
5
Heterogeneous Computing vecAdd
CUDA Host CodePart 1
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
Host Memory Device Memory
GPU
CPU
Part 2
int size = n* sizeof(float);
float* d_A, d_B, d_C;
…
1. // Allocate device memory for A, B, and C
// copy A and B to device memory
2. // Kernel launch code – to have the device
// to perform the actual vector addition
Part 3
3. // copy C from the device memory
// Free device vectors
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
6
Partial Overview of CUDA Memories
•
Device code can:
– R/W per-thread registers
– R/W all-shared global memory
(Device) Grid
Block (0, 0)
•
Host code can
– Transfer data to/from per grid global
memory
Host
Block (1, 0)
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global
Memory
We will cover more later.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
7
CUDA Device Memory
Management API functions
• cudaMalloc()
Grid
– Allocates object in the device
global memory
– Two parameters
Block (0, 0)
• Address of a pointer to the
allocated object
• Size of of allocated object in
terms of bytes
• cudaFree()
Host
Block (1, 0)
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global Memory
– Frees object from device global
memory
• Pointer to freed object
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
8
Host-Device Data Transfer API
functions
• cudaMemcpy()
(Device) Grid
– memory data transfer
– Requires four parameters
•
•
•
•
Pointer to destination
Pointer to source
Number of bytes copied
Type/Direction of transfer
Host
Block (0, 0)
Block (1, 0)
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global
Memory
– Transfer to device is
asynchronous
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
9
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
int size = n * sizeof(float);
float* d_A, d_B, d_C;
1. // Transfer A and B to device memory
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &B_d, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Allocate device memory for
cudaMalloc((void **) &d_C, size);
2. // Kernel invocation code – to be shown later
…
3.
// Transfer C from device to host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Free device memory for A, B, C
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
10
}
ECE408/CS483, University of Illinois, Urbana-Champaign
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A, float* B, float* C, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}
int vectAdd(float* h_A, float* h_B, float* h_C, int n)
{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernel<<<ceil(n/256.0), 256>>>(d_A, d_B, d_C, n);
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
11
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAddkernel(float* A_d, float* B_d, float* C_d, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C_d[i] = A_d[i] + B_d[i];
}
Host Code
int vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernnel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
12
More on Kernel Launch
Host Code
int vecAdd(float* A, float* B, float* C, int n)
{
// A_d, B_d, C_d allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
dim3 DimGrid(n/256, 1, 1);
if (n%256) DimGrid.x++;
This makes sure that there are
dim3 DimBlock(256, 1, 1); enough threads to cover all elements.
vecAddKernnel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n);
}
• Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit
synch needed for blocking
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
13
Kernel execution in a nutshell
__host__
__global__
Void vecAdd()
void vecAddKernel(float *A_d,
{
float *B_d, float *C_d, int n)
dim3 DimGrid = (ceil(n/256.0),1,1);
{
dim3 DimBlock = (256,1,1);
int i = blockIdx.x * blockDim.x
+ threadIdx.x;
vecAddKernel<<<DimGrid,DimBlock>>>(A_
d,B_d,C_d,n);
if( i<n ) C_d[i] = A_d[i]+B_d[i];
}
}
Kernel
•••
Blk 0
Blk
N-1
Schedule onto multiprocessors
M0
GPU
•••
Mk
RAM
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
14
More on CUDA Function Declarations
Executed
on the:
Only callable
from the:
__device__ float DeviceFunc()
device
device
__global__ void
device
host
host
host
__host__
•
KernelFunc()
float HostFunc()
__global__ defines a kernel function
• Each “__” consists of two underscore characters
• A kernel function must return void
•
__device__ and __host__ can be used together
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
15
LET’S TAKE A LOOK AT A
REAL PIECE OF CUDA CODE
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
16
QUESTIONS?
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012
ECE408/CS483, University of Illinois, Urbana-Champaign
17
Download