IntroCUDASlides

advertisement
Programming Massively Parallel
Processors Using CUDA & C++AMP
Lecture 1 - Introduction
Wen-mei Hwu, Izzat El Hajj
CEA-EDF-Inria Summer School 2013
Course Goals
• Learn how to program massively parallel
processors and achieve
– high performance
– functionality and maintainability
– scalability across future generations
• Technical subjects
– principles and patterns of parallel algorithms
– processor architecture features and constraints
– programming API, tools and techniques
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
2
Text/Notes
• D. Kirk and W. Hwu, “Programming
Massively Parallel Processors – A Handson Approach,” 2nd Edition, Morgan
Kaufman Publisher, 2012,
• NVIDIA, NVidia CUDA C Programming
Guide, version 5.0, NVidia, 2013
(reference book)
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
3
Course Sessions Overview
Monday
• Lecture 1
• Lecture 2
• Lab Session 1
Wednesday
• Lecture 5
• Lecture 6
• Lab Session 2
Tuesday
• Lecture 3
• Lecture 4
Thursday
• Lab Session 3
Friday
• Lab Session 4
• Lab Session 5
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
4
Blue Waters Hardware
Cray System & Storage cabinets: • >300
Compute nodes: • >25,000
Usable Storage Bandwidth: • >1 TB/s
System Memory: • >1.5 Petabytes
Memory per core module: • 4 GB
Gemin Interconnect Topology: • 3D Torus
Usable Storage: • >25 Petabytes
Peak performance: • >11.5 Petaflops
Number of AMD Interlogos processors: • >49,000
Number of AMD x86 core modules: • >380,000
Number of NVIDIA Kepler GPUs: • >4,000
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
5
Cray XK7 Compute Node
XK7 Compute Node
Characteristics
AMD Series 6200
(Interlagos)
NVIDIA Kepler
Host Memory
32GB
1600 MT/s DDR3
NVIDIA Tesla X2090
Memory
6GB GDDR5 capacity
Gemini High Speed
Interconnect
Keplers in final installation
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
Z
Y
X
6
CPU and GPU have very different
design philosophy
GPU
CPU
Throughput Oriented Cores
Latency Oriented Cores
Chip
Chip
Compute Unit
Core
Cache/Local Mem
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
Registers
SIMD Unit
Control
SIMD
Unit
Threading
Registers
Local Cache
7
CPUs: Latency Oriented Design
• Large caches
– Convert long latency
memory accesses to short
latency cache accesses
ALU
ALU
ALU
Control
• Sophisticated control
– Branch prediction for
reduced branch latency
– Data forwarding for
reduced data latency
ALU
CPU
Cache
DRAM
• Powerful ALU
– Reduced operation latency
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
8
GPUs: Throughput Oriented
Design
• Small caches
– To boost memory
throughput
• Simple control
– No branch prediction
– No data forwarding
GPU
• Energy efficient ALUs
– Many, long latency but
heavily pipelined for high
throughput
• Require massive number
of threads to tolerate
latencies
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
DRAM
9
Winning Applications Use Both
CPU and GPU
• CPUs for sequential
parts where latency
matters
– CPUs can be 10+X
faster than GPUs for
sequential code
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
• GPUs for parallel
parts where
throughput wins
– GPUs can be 10+X
faster than CPUs for
parallel code
10
Heterogeneous parallel computing is
catching on.
Financial
Analysis
Scientific
Simulation
Engineering
Simulation
Data
Intensive
Analytics
Medical
Imaging
Digital Audio
Processing
Digital Video
Processing
Computer
Vision
Biomedical
Informatics
Electronic
Design
Automation
Statistical
Modeling
Ray Tracing
Rendering
Interactive
Physics
Numerical
Methods
• 280 submissions to GPU Computing Gems
– 110 articles included in two volumes
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
11
What is the stake?
• Scalable and portable software lasts
through many hardware generations
Scalable algorithms and libraries can
be the best legacy we can leave behind
from this era
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
12
Introduction to CUDA C
Objective
• To learn about data parallelism and the
basic features of CUDA C that enable
exploitation of data parallelism
– Hierarchical thread organization
– Main interfaces for launching parallel
execution
– Thread index(es) to data index mapping
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
14
Parallel Programming Work Flow
• Identify compute intensive parts of an
application
• Adopt scalable algorithms
• Optimize data arrangements to maximize
locality
• Performance Tuning
• Pay attention to code portability and
maintainability
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
15
Massive
Parallelism Regularity
© David Kirk/NVIDIA
16
and Wen-mei W. Hwu,
CUDA /OpenCL – Execution Model
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C code
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
...
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
...
17
The Von-Neumann Model
Memory
I/O
Processing Unit
Processor
ALU
Reg
File
Control Unit
PC
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
IR
18
Arrays of Parallel Threads
• A CUDA kernel is executed by a grid (array) of
threads
– Each thread is a virtualized Von-Neumann Processor
– All threads in a grid run the same kernel code (SPMD)
– Each thread has an index that it uses to compute
memory addresses and make control decisions
tid = compute thread index
work(tid)
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
19
Thread Blocks: Scalable Cooperation
• Divide thread array into multiple blocks
– Threads within a block cooperate via shared
memory, atomic operations and barrier
synchronization
– Threads in different blocks cannot cooperate
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
20
Computing the Thread Index
gridDim.x
# of blocks
in grid
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
21
Computing the Thread Index
gridDim.x
blockIdx.x
# of blocks
in grid
position of
block in grid
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
22
Computing the Thread Index
gridDim.x
blockIdx.x
blockDim.x
# of blocks
in grid
position of
block in grid
# threads
in block
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
23
Computing the Thread Index
gridDim.x
blockIdx.x
blockDim.x
threadIdx.x
# of blocks
in grid
position of
block in grid
# threads
in block
position of
thread in block
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
24
Computing the Thread Index
gridDim.x
blockIdx.x
blockDim.x
threadIdx.x
# of blocks
in grid
position of
block in grid
# threads
in block
position of
thread in block
tid = blockIdx.x*blockDim.x + threadIdx.x
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
25
blockIdx and threadIdx
•
Each thread uses indices to
decide what data to work on
–
–
•
blockIdx: 1D, 2D, or 3D
(CUDA 4.0)
threadIdx: 1D, 2D, or 3D
Simplifies memory
addressing when processing
multidimensional data
–
–
–
Image processing
Solving PDEs on volumes
…
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
26
Vector Addition – Conceptual View
vector A
A[0]
A[1]
A[2]
A[3]
A[4]
…
A[N-1]
vector B
B[0]
B[1]
B[2]
B[3]
B[4]
…
B[N-1]
+
+
+
+
+
C[0]
C[1]
C[2]
C[3]
C[4]
vector C
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
+
…
C[N-1]
27
Vector Addition – Traditional C
Code
// Compute vector sum C = A+B
void vecAdd(float* A, float* B, float* C, int n)
{
for (i = 0, i < n, i++)
C[i] = A[i] + B[i];
}
int main()
{
// Memory allocation for A_h, B_h, and C_h
// I/O to read A_h and B_h, N elements
…
vecAdd(A_h, B_h, C_h, N);
}
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
28
Heterogeneous Computing vecAdd
Host Code
#include <cuda.h>
void vecAdd(float* A, float* B, float*
C, int n){
int size = n* sizeof(float);
float* A_d, B_d, C_d;
...
1. // Allocate device memory for A, B, and C
// copy A and B to device memory
Part 1
2. // Kernel launch code – to have the device
// to perform the actual vector addition
Host Memory
3. // copy C from the device memory
// Free device vectors
}
Device Memory
GPU
Part 2
CPU
Part 3
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
29
Control Flow
CPU
GPU
cudaMalloc(...)
cudaMemcpy(...)
kernel<<<...>>>(...)
cudaMemcpy(...)
cudaFree(...)
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
30
Partial Overview of CUDA Memories
•
•
(Device) Grid
Device code can:
– R/W per-thread registers
– R/W per-grid global memory
Block (0, 0)
Host code can
– Transfer data to/from per grid
global memory
Host
Block (1, 0)
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global
Memory
We will cover more later.
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
31
CUDA Device Memory
Management API functions
(Device) Grid
• cudaMalloc()
Block (0, 0)
– Allocates object in the device
global memory
– Two parameters
• Address of a pointer to the
allocated object
• Size of of allocated object in
terms of bytes
• cudaFree()
– Frees object from device
global memory
Host
Block (1, 0)
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global
Memory
• Pointer to freed object
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
32
Host-Device Data Transfer API
functions
• cudaMemcpy()
(Device) Grid
– memory data transfer
– Requires four parameters
•
•
•
•
Pointer to destination
Pointer to source
Number of bytes copied
Type/Direction of transfer
Host
– Transfer to device is
asynchronous
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
Block (0, 0)
Block (1, 0)
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Global
Memory
33
void vecAdd(float* A, float* B, float* C, int n) {
int size = n * sizeof(float);
float* A_d, B_d, C_d;
1.
// Transfer A and B to device memory
cudaMalloc((void **) &A_d, size);
cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &B_d, size);
cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);
// Allocate device memory for
cudaMalloc((void **) &C_d, size);
2.
3.
// Kernel invocation code – to be shown later
...
// Transfer C from device to host
cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);
// Free device memory for A, B, C
cudaFree(A_d); cudaFree(B_d); cudaFree (C_d);
}
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
34
Example: Vector Addition Kernel
Input Vector 1:
Input Vector 2:
assign 1 thread
per element
Output Vector:
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
35
Example: Vector Addition Kernel
Input Vector 1:
Input Vector 2:
Output Vector:
quantization of global thread
count due to block organization
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
36
Example: Vector Addition Kernel
Input Vector 1:
Input Vector 2:
Output Vector:
boundary checks
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
37
Example: Vector Addition Kernel
Input Vector 1:
Input Vector 2:
Output Vector:
data padding
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
38
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A_d, float* B_d, float* C_d, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C_d[i] = A_d[i] + B_d[i];
}
int vectAdd(float* A, float* B, float* C, int n)
{
// A_d, B_d, C_d allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);
}
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
39
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A_d, float* B_d, float* C_d, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C_d[i] = A_d[i] + B_d[i];
}
Host Code
int vectAdd(float* A, float* B, float* C, int n)
{
// A_d, B_d, C_d allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);
}
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
40
More on Kernel Launch
int vecAdd(float* A, float* B, float* C, int n) {
// A_d, B_d, C_d allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
dim3 DimGrid(n/256, 1, 1);
if (n%256) DimGrid.x++;
dim3 DimBlock(256, 1, 1);
Host Code
vecAddKernnel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n);
}
• Any call to a kernel function is asynchronous from CUDA 1.0 on,
explicit synch needed for blocking
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
41
Kernel execution in a nutshell
__host__
Void vecAdd()
{
dim3 DimGrid(ceil(n/256,1,1);
dim3 DimBlock(256,1,1);
__global__
void vecAddKernel(float *A_d,
float *B_d, float *C_d, int n)
{
int i = blockIdx.x * blockDim.x
+ threadIdx.x;
vecAddKernel<<<DimGrid,DimBlock>>>
(A_d,B_d,C_d,n);
if( i<n ) C_d[i] = A_d[i]+B_d[i];
}
}
Kernel
•••
Blk 0
Blk N-1
Schedule onto multiprocessors
M0
GPU
•••
Mk
RAM
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
42
More on CUDA Function Declarations
Executed
on the:
Only callable
from the:
__device__ float DeviceFunc()
device
device
__global__ void
device
host
host
host
__host__
KernelFunc()
float HostFunc()
__global__ defines a kernel function
• Each “__” consists of two underscore characters
• A kernel function must return void
__device__ and __host__ can be used together
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
43
Compiling A CUDA Program
Integrated C programs with CUDA extensions
NVCC Compiler
Host Code
Device Code (PTX)
Host C Compiler/ Linker
Device Just-in-Time Compiler
Heterogeneous Computing Platform with
CPUs, GPUs
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
44
ANY MORE QUESTIONS?
© David Kirk/NVIDIA and
Wen-mei W. Hwu, 2007-2013
45
Download