Lecture 13 – GPU Architectures

advertisement
CS6461 – Computer Architecture
Spring 2015
Morris Lancaster – Instructor
Adapted from Professor Stephen Kaisler’s Notes
Lecture 13
Graphics Processing Units
(GPUs)
Early Graphics
•
•
•
•
Vector graphics (1960s)
Based on the oscilloscope
Consist of electron gun + phosphor display
Examples:
– SketchPad
– Asteroids
– Pong
• Problems:
– No color
– Limited objects
– Crude images
3/22/2016
CSCI6461: Computer Architecture
13-2
Raster Displays
• Represent an image with a
“framebuffer”
– A 2D array of pixels
• Requires a RAMDAC:
– Random Access Memory
Digital-to-Analog Converter
• Translates the framebuffer into
a video signal
• Can be used to display any
image (in theory)
• Early Games:
– Sprite based rendering (late
1980s-early 1990s)
– 'Blit' rectangular subregions
quickly to animate objects
– Later games added scrolling
backgrounds
– Examples: Duke Nukem,
Commander Keen
3/22/2016
CSCI6461: Computer Architecture
13-3
NVIDIA GPU Card
3/22/2016
CSCI6461: Computer Architecture
13-4
What is a GPU?
• A specialized processor oriented to performing
graphics operations very fast
–
–
–
–
Highly parallel vs. serial
Many execution units/slower clock vs. few EUs/higher clock
Much deeper pipelines
Mostly deterministic in their computations
• GPUs are fast!!
– 3.0 GHz dual-core Pentium4: 24.6 GFLOPS
– NVIDIA GeForceFX 7800: 165 GFLOPs
– Advanced memory interfaces for shifting data
• Growth:
– CPUs: 1.4× annual growth
– GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth
3/22/2016
CSCI6461: Computer Architecture
13-5
Why GPU?
 To provide a separate dedicated graphics computation
and rendering resource including a graphics processor
and memory.
 To relieve some of the burden of the main system
resources, namely the Central Processing Unit, Main
Memory, and the System Bus, which would otherwise
get saturated with graphical operations and I/O requests.
3/22/2016
CSCI6461: Computer Architecture
13-6
Why Are GPUs So Fast?
• Entertainment Industry has driven the economy of these chips?
– Males age 15-35 buy $10B in video games / year
• Moore’s Law ++
• Simplified design (stream processing)
• Single-chip designs
3/22/2016
CSCI6461: Computer Architecture
13-7
GPU: a Multithreaded Coprocessor
Programming Model: Single Instruction
Multiple Thread (SIMT)
3/22/2016
CSCI6461: Computer Architecture
13-8
GPU: a Multithreaded Coprocessor
• Avoid divergent branches
• Threads of single SM must be executing the same
code
• Code that branches heavily and unpredictably will
execute slowly
• Threads shoud be independent as much as
possible
• Synchronization and communication can be done
efficiently only for threads of single multiprocessor
3/22/2016
CSCI6461: Computer Architecture
13-9
Modern GPU Pipeline
• Input Assembler
• Takes in 1D vertex data from
up to 8 input streams
• Converts data to a canonical
format
• Supports a mechanism that
allows the IA to effectively
replicate an object n times instancing
3/22/2016
CSCI6461: Computer Architecture
13-10
Modern GPU Pipeline - I
• Vertex Shader:
– Receives a stream of vertices in object space with all their
associated information (normals, texture coordinates, per vertex
color etc) from the CPU interface
– Outputs a stream of vertices in screen/clip space
– VS and other programmable stages share a common feature set
that includes:
• an expanded set of floating-point, integer, control, an
• memory read instructions allowing access to up to 128 memory
buffers (textures) and 16 parameter (constant) buffers - common
core
– No new vertices are created in this stage
– No vertices are discarded
3/22/2016
CSCI6461: Computer Architecture
13-11
Modern GPU Pipeline - II
• Geometry Shader:
– Geometry information becomes raster information (screen space
geometry is the input, pixels are the output)
– Takes the vertices of a single primitive (point, line segment, or
triangle) as input and generates the vertices of zero or more
primitives
– Additional vertices can be generated on-the-fly , allowing
displacement mapping
– Geometry shader has the ability to access the adjacency
information
– This enables implementation of some new powerful algorithms :
• Realistic fur rendering
• NPR rendering
3/22/2016
CSCI6461: Computer Architecture
13-12
Modern GPU Pipeline - III
• Clip/Project/PreRasterize
• Handles following operations:
- Clipping
– Copies a subset of the vertex
information to up to 4 1D
output buffers in sequential
order
– Data conversion and packing
can be implemented by a GS
program
– Prior to rasterization,
triangles that are backfacing
or are located outside the
viewing frustrum are rejected
– Some GPUs also do some
hidden surface removal at
this stage
3/22/2016
- Culling
- Perspective divide
- View port transform
- Primitive set-up
- Scissoring
- Depth offset
- Depth processing like
hierarchical-z
- Fragment generation
CSCI6461: Computer Architecture
13-13
Modern GPU Pipeline - IV
• Pixel Shader:
– Each fragment provided by Geometry Shader is fed into
fragment processing as a set of attributes (position, normal,
texcoord, etc), which are used to compute the final color for this
pixel
– The computations taking place here include texture mapping and
math operations
– Typically the bottleneck in modern applications
– Produces a single output fragment consisting of 1-8 attribute
values and an optional depth value
– If the fragment is supposed to be rendered, it is output to 8
render targets
– Each target represent a different representation of the scene
3/22/2016
CSCI6461: Computer Architecture
13-14
Modern GPU Pipeline
• Output Merger:
– Input is a fragment from the pixel shader
– Fragment colors provided by the previous stage are written to
the framebuffer
• Used to be the biggest bottleneck before fragment processing took
over
– Uses a single unified depth/stencil buffer to specify the bind
points for this buffer and 8 other render targets
• Degree of multiple rendering enhanced to 8
• Before the final write occurs, some fragments are rejected by the
zbuffer, stencil and alpha tests
– On modern GPUs, z and color are compressed to reduce
framebuffer bandwidth (but not size)
3/22/2016
CSCI6461: Computer Architecture
13-15
Vertex and Fragment Shaders: Example
3/22/2016
CSCI6461: Computer Architecture
13-16
GPU Applications: Bump/Displacement mapping
Height map
Diffuse light without bump
3/22/2016
Diffuse light with bump
CSCI6461: Computer Architecture
13-17
GPU Applications: Volume texture mapping
3/22/2016
CSCI6461: Computer Architecture
13-18
GPU Applications: Soft Shadows
3/22/2016
CSCI6461: Computer Architecture
13-19
Modern GPU
• Vector Processor
– Operates on 4 tuples
• Position
• Color
• Texture Coordinates
( x, y, z, w )
( red, green, blue, alpha )
( s, t, r, q )
– 4 Tuple Ops performed in 1 clock cycle
• How many processing units?
– Hundreds plus
• How many ALUs?
– Hundreds to Thousands
• Do you need a cache?
– No! Use space for ALUs
• What kind of memory?
– Lots, very fast.
3/22/2016
CSCI6461: Computer Architecture
13-20
NVIDIA GeForce 6800 3D Pipeline
Vertex
Triangle Setup
Z-Cull
Shader Instruction Dispatch
Fragment
L2 Tex
Fragment Crossbar
3/22/2016
Memory
Partition
Memory
Partition
Composite
Memory
Partition
CSCI6461: Computer Architecture
Memory
Partition
13-21
What Are GPUs Good For?
• Everything you can get it to do efficiently for you!!!
–
–
–
–
Physics
Collision Detection
AI
And yes, graphics too…
• GPUs mostly do stuff like rendering images.
• This is done through mostly floating point arithmetic
– the same stuff people use supercomputing for!
3/22/2016
CSCI6461: Computer Architecture
13-22
GPU versus Fastest Computers
ref: http://www.llnl.gov/str/JanFeb05/Seager.html
3/22/2016
CSCI6461: Computer Architecture
13-23
GPU vs. CPU: NVIDIA 280 vs. Intel i7 860
CPU1
GPU
Registers
16,384 (32-bit) /
multi-processor3
128 reservation stations
Peak memory bandwidth
141.7 Gb/sec
21 Gb/sec
Peak GFLOPs
562 (float)/
77 (double)
50 (double)
Cores
240
4/8 (hyperthreaded)
Processor Clock (MHz)
1296
2800
Memory
1Gb
16Gb
Shared memory
16Kb/TPC2
N/A
Virtual memory
None
3/22/2016
CSCI6461: Computer Architecture
13-24
GPU vs. CPU Performance
3/22/2016
CSCI6461: Computer Architecture
13-25
Graphics in the System
3/22/2016
CSCI6461: Computer Architecture
13-26
Programming GPUs
• GPUs – in the olden days – used to be hard to
program:
– using a graphics standard like OpenGL (which is mostly
meant for rendering), or
– getting fairly deep into the graphics rendering pipeline.
• To use a GPU to do general purpose number
crunching, you had to make your number crunching
pretend to be graphics.
– This was hard. So most people didn’t bother.
• The power and flexibility of GPUs makes them an
attractive platform for general-purpose computation
– Example applications range from in-game physics simulation
to conventional computational science
– Goal: make the inexpensive power of the GPU available to
developers as a sort of computational coprocessor
3/22/2016
CSCI6461: Computer Architecture
13-27
GPGPU programming model
• The GPU is viewed as a compute device that:
– Is a coprocessor (slave) to the CPU (host)
– Has its own DRAM (device memory) but no virtual memory
• Kernel is a CPU-callable function. Thread is an instance of a kernel.
• GPU runs many threads in parallel.
Device
Host
(CPU)
(GPU)
Kernel
Threads
(instances of
the kernel)
PCIe
Device
Memory
3/22/2016
CSCI6461: Computer Architecture
13-28
GPU <-> CPU Data Transfer
• GPUs and CPUs communicate via a PCIe bus
– This communication is expensive and should be minimized for target
applications
• Graphics applications usually require
– Initial data to be sent from CPU to GPU
– Single transfer of processed data from GPU to CPU
• General purpose computations usually require
– Multiple transfers between CPU and GPU (since conditional checks
on CPU)
– Possibility of saturating the PCIe bus and reducing the achievable
performance
3/22/2016
CSCI6461: Computer Architecture
13-29
GPU Threads vs. CPU Threads
• GPU threads:
– Lightweight, small creation and scheduling overhead,
extremely fast hardware context switching
– Need to issue 1000s of GPU threads to hide global memory
latencies (600-800 cycles)
• CPU threads:
– Heavyweight, large scheduling overhead, slow context
switching
• Multi-GPU usage requires invocation of multiple CPU
threads
– Each CPU thread creates a GPU context
– Context swapping is required for a CPU thread to access GPU
memory allocated by another CPU thread
3/22/2016
CSCI6461: Computer Architecture
13-30
GPU Memory Space
3/22/2016
CSCI6461: Computer Architecture
13-31
GPU Memory Space
• Each thread runs on a SP and has:
– R/W per-thread registers (on-chip)
• Limit usage (max 16K/MP)
– R/W per-thread local memory (off)
– R/W per-block shared memory (on)
• Need to avoid bank conflicts
– R/W per-grid global memory (off)
• Not cached, 600-800 cycle read
– Latency hidden by parallelism and fast context switches
• Main means for data transfer from host and device
– RO per-grid cached constant
and texture memory (off)
• The host can R/W global, constant and texture
memories (visible to all threads)
3/22/2016
CSCI6461: Computer Architecture
13-32
But, GPUs Are Difficult To Use as GPs
• GPUs designed for & driven by video games
– Programming model unusual; map general-purpose data
structures to graphics structures
– Programming idioms tied to computer graphics
– Programming environment tightly constrained
• Underlying architectures are:
–
–
–
–
Inherently parallel
Rapidly evolving (even in basic feature set!)
Largely proprietary
Lack random read/write to memory
• Not a Panacea!!
– Can’t simply “port” CPU code!
– GPUs are fast because they are specialized
• Poorly suited to sequential, “pointer-chasing” code
– Missing support for some basic functionality
• E.g. double precision, bitwise operations, indexed write
3/22/2016
CSCI6461: Computer Architecture
13-33
GPGPU Stream Operations
• Map: apply a function to every element in a
stream
• Reduce: use a function to reduce a stream to a
smaller stream (often 1 element)
• Scatter/gather: indirect read and write
• Filter: select a subset of elements in a stream
• Sort: order elements in a stream
• Search: find a given element, nearest neighbors,
etc
3/22/2016
CSCI6461: Computer Architecture
13-34
GPGPU Challenges
• The Killer App
• Programming models and
tools
• GPU in tomorrow’s
computer?
• Data conditionals
• Relationship to other
parallel hw/sw
3/22/2016
• Managing rapid change in
hw/sw (roadmaps)
• Performance evaluation and
cliffs
• Philosophy of faults and
lack of precision
• Broader toolbox for
computation / data
structures
• Wedding graphics and
GPGPU techniques
CSCI6461: Computer Architecture
13-35
Programming GPGPUs
• Proprietary programming language or extensions
– NVIDIA: Compute Unified Device Architecture (CUDA) (C/C++)
– AMD/ATI: StreamSDK/Brook+ (C/C++)
• OpenCL (Open Computing Language): an industry
standard for doing number crunching on GPUs.
• Portland Group Inc (PGI) Fortran and C compilers with
accelerator directives;
– PGI CUDA Fortran (Fortran 90 equivalent of NVIDIA’s CUDA C).
3/22/2016
CSCI6461: Computer Architecture
13-36
Processing Flow
3/22/2016
CSCI6461: Computer Architecture
13-37
Processing Flow
3/22/2016
CSCI6461: Computer Architecture
13-38
Processing Flow
3/22/2016
CSCI6461: Computer Architecture
13-39
To Compute, We Need To
• Allocate memory that will be used for the computation (variable
declaration and allocation)
• Read the data that we will compute on (input)
• Specify the computation that will be performed
• Write to the appropriate device the results (output)
But, the GPU is a specialized computing engine:
• Need to allocate space in the GPU’s memory for the variables.
• The GPU does not have I/O devices
– need to copy the input data from the memory in the host computer
into the memory in the GPU, using the variable allocated in the
previous step.
• Need to specify code to execute.
• Must copy the results back to the memory in the host computer.
3/22/2016
CSCI6461: Computer Architecture
13-40
What Executes on the GPU?
• Kernel:
– A simple C function
– Executes in parallel with other instances on other cores
– The keyword __global__ tells the compiler to make a function
a kernel (and compile it for the GPU, instead of the CPU)
• How to increase parallelism:
– More independent work within a thread (warp)
– More concurrent threads
3/22/2016
CSCI6461: Computer Architecture
13-41
Requirements for GPU Performance
• Expose sufficient parallelism
– Large number of independent instructions
– Arithmetic pipes should accommodate many independent
instructions
– Memory systems: keep many requests in flight to saturate
bandwidth
• Coalesce memory access
• Coherent execution within warp
3/22/2016
CSCI6461: Computer Architecture
13-42
What Do Programmers Need To Do?
• How is every instance of the kernel going to know which
piece of data it is working on?
• Programmers need to specify:
– The grid size: The size and shape of the data that the program
will be working on
– The block size: The block size indicates the sub-area of the
original grid that will be assigned to an MP (a set of stream
processors that share local memory)
3/22/2016
CSCI6461: Computer Architecture
13-43
Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
3/22/2016
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
CSCI6461: Computer Architecture
Ref: “NVIDIA CUDA Programming Guide” version 1.1
Threads: Grids and Blocks
13-44
Threads: Grids and Blocks
• A kernel is executed as a grid of thread blocks (aka
blocks)
• A thread block is a batch of threads that can
cooperate with each other by:
– Synchronizing their execution
• Diverging execution results in performance loss
– Efficiently sharing data through a low latency shared
memory
• Two threads from two different blocks cannot
cooperate
Ref: “NVIDIA CUDA Programming Guide” version 1.1
3/22/2016
CSCI6461: Computer Architecture
13-45
Threads and blocks have IDs
So each thread can identify
what data they will operate on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when processing
multidimensional data
Image processing
Solving PDEs on volumes
Other problems with
underlying 1D, 2D or 3D
geometry
3/22/2016
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
CSCI6461: Computer Architecture
13-46
Ref: “NVIDIA CUDA Programming Guide” version 1.1
Block and Thread IDs
• GPU consists of “multiprocessors”, each of which has
many processors
• A kernel is executed as a grid of blocks
• Thread block is a batch of threads that cooperate with
each other by:
– Synchronizing their execution
• Diverging execution results in performance loss
– Efficiently sharing data through a low latency shared memory
• All threads of a block reside on the same multiprocessor
• Number of blocks a multiprocessor can process at once
depends on register and shared memory usage per
thread
3/22/2016
CSCI6461: Computer Architecture
13-47
Ref: “NVIDIA CUDA Programming Guide” version 1.1
Scheduling in GPUs
C Example
int A[2][4];
for(i=0;i<2;i++)
convert into CUDA
{
for(j=0;j<4;j++)
{
// define threads
A[i][j]++;
}
}
3/22/2016
int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A);
__device__ kernelF(A)
{
i = blockIdx.x;
j = threadIdx.x;
A[i][j]++;
}
// all threads run same kernel
// each thread block has its id
// each thread has its id
// each thread has a different i
and j
CSCI6461: Computer Architecture
13-48
13-48
Thread Hierarchy
thread 3 of block 1 operates
on element A[1][3]
int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A);
__device__ kernelF(A){
i = blockIdx.x;
j = threadIdx.x;
A[i][j]++;
}
3/22/2016
3/22/2016
// define threads
// all threads run same kernel
// each thread block has its id
// each thread has its id
// each thread has a
different i and j
CSCI6461:Computer
Computer Architecture
13-49
CSCI6461:
Architecture
13-49
Thread Execution
int A[2][4];
kernelF<<<(2,1),(4,1)>>>(A);
__device__ kernelF(A)
{
i = blockIdx.x;
j = threadIdx.x;
A[i][j]++;
}
mv.u32 %r0, %ctaid.x
mv.u32 %r1, %ntid.x
mv.u32 %r2, %tid.x
mad.u32 %r3, %r2, %r1, %r0
ld.global.s32 %r4, [%r3]
add.s32 %r4, %r4, 1
st.global.s32 [%r3], %r4
3/22/2016
// r0 = i = blockIdx.x
// r1 = "threads-per-block"
// r2 = j = threadIdx.x
// r3 = i * "threads-per-block" + j
// r4 = A[i][j]
// r4 = r4 + 1
// A[i][j] = r4
CSCI6461: Computer Architecture
13-50
NVidia’s CUDA
• Compute Unified Device Architecture-CUDA
– Elegant solution to problem of expressing parallelism
– Not all algorithms, but enough to matter
• CUDA’s design goals
– extend a standard sequential programming language,
specifically C/C++,
– focus on the important issues of parallelism—how to craft
efficient parallel algorithms—rather than grappling with the
mechanics of an unfamiliar and complicated language.
• minimalist set of abstractions for expressing parallelism
• highly scalable parallel code that can run across tens of thousands
of concurrent threads and hundreds of processor cores.
– A thread is associated with each data element
3/22/2016
CSCI6461: Computer Architecture
13-51
OpenCL
• Open Computing Language
• Open standard developed by Apple:
– Now, maintained by the Khronos Group, consortium of many
companies (including NVIDIA, AMD and Intel, but also lots of
others)
• Initial version of OpenCL standard released in Dec
2008.
• Many companies will create their own implementations.
• Apple was first to market, with an OpenCL
implementation included in Mac OS X v10.6 (“Snow
Leopard”) in 2009.
3/22/2016
CSCI6461: Computer Architecture
13-52
OpenCL vs. CUDA
• CUDA uses early code binding
– Code is compiled with normal C/C++/FORTRAN (beta)
source code
• Need CUDA occupancy calculator to determine number of threads
based on resource utilization
– Library support: BLAS & FFT & DPT
• OpenCL
– Late binding of OpenCL code to executable
• OpenCL compiler/linker embedded within application
• No need for CUDA occupancy calculator
– Only supports C
– No libraries
3/22/2016
CSCI6461: Computer Architecture
13-53
Matrix Multiplication
Ref: Multicores, Multiprocessors, and Clusters; University of Arizona
3/22/2016
CSCI6461: Computer Architecture
13-54
Matrix Multiplication
• For a 4096x4096 matrix multiplication
– Matrix C will require calculation of 16,777,216 matrix cells.
– On the GPU each cell is calculated by its own thread.
– On a general purpose processor we can only calculate one cell
at a time.
• We can have 23,040 active threads (GTX570), which
means we can have this many matrix cells calculated in
parallel.
• Each thread exploits the GPUs fine granularity by
computing one element of Matrix C.
• Sub-matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU
bandwidth.
3/22/2016
CSCI6461: Computer Architecture
13-55
Solving Systems of Equations
Ref: Multicores, Multiprocessors, and Clusters; University of Arizona
3/22/2016
CSCI6461: Computer Architecture
13-56
Additional Material
3/22/2016
CSCI6461: Computer Architecture
13-57
Simple Example
#include <stdio.h>
#define SIZEOFARRAY 64
extern void fillArray(int *a,int size);
/* The main program */
int main(int argc,char *argv[])
{
/* Declare the array that will be modified by the GPU */
int a[SIZEOFARRAY];
int i;
/* Initialize the array to 0s */
for(i=0;i < SIZEOFARRAY;i++) {
a[i]=i;
}
/* Print the initial array */
printf("Initial state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
/* Call the function that will in turn call the function
in the GPU that will fill the array */
fillArray(a,SIZEOFARRAY);
/* Now print the array after calling fillArray */
printf("Final state of the array:\n");
for(i = 0; i < SIZEOFARRAY; i++) {
printf("%d ",a[i]);
}
printf("\n");
return 0;
}
3/22/2016
CSCI6461: Computer Architecture
13-58
Function in the GPU: Simple.cu
• simple.cu contains two functions
– fillArray(): A function that will be executed on the host and which
takes care of:
•
•
•
•
•
•
Allocating variables in the global GPU memory
Copying the array from the host to the GPU memory
Setting the grid and block sizes
Invoking the kernel that is executed on the GPU
Copying the values back to the host memory
Freeing the GPU memory
– cu_fillArray()
• This is the kernel that will be executed in every stream processor in
the GPU
• It is identified as a kernel by the use of the keyword: __global__
• This function uses the built-in variables
– blockIdx.x and
– threadIdx.x
to identify a particular position in the array
3/22/2016
CSCI6461: Computer Architecture
13-59
fillArray – Part 1
#define BLOCK_SIZE 32
extern "C" void fillArray(int *array,int arraySize)
{
/* a_d is the GPU counterpart of the array that exists on the host
memory */
int *array_d;
cudaError_t result;
/* allocate memory on device */
/* cudaMalloc allocates space in the memory of the GPU card */
result = cudaMalloc((void**)&array_d,sizeof(int)*arraySize);
/* copy the array into the variable array_d in the device */
/* The memory from the host is being copied to the corresponding
variable in the GPU global memory */
result = cudaMemcpy(
array_d,array
sizeof(int)*arraySize,
cudaMemcpyHostToDevice);
3/22/2016
CSCI6461: Computer Architecture
13-60
fillArray – Part 2
/* execution configuration... */
/* Indicate the dimension of the block */
dim3 dimblock(BLOCK_SIZE);
/* Indicate the dimension of the grid in blocks */
dim3 dimgrid(arraySize/BLOCK_SIZE);
/* actual computation: Call the kernel, the function that is */
/* executed by each and every processing element on the GPU card */
cu_fillArray<<<dimgrid,dimblock>>>(array_d);
/* read results back: */
/* Copy the results from the GPU back to the memory on the host */
result = cudaMemcpy(
array,
array_d,sizeof(int)*arraySize
cudaMemcpyDeviceToHost);
/* Release the memory on the GPU card */
cudaFree(array_d);
}
3/22/2016
CSCI6461: Computer Architecture
13-61
cu_fillArray
global__ void cu_fillArray(int *array_d)
{
int x;
/* blockIdx.x is a built-in variable in CUDA
that returns the blockId in the x axis
of the block that is executing this block of code
threadIdx.x is another built-in variable in CUDA
that returns the threadId in the x axis
of the thread that is being executed by this
stream processor in this particular block
*/
x=blockIdx.x*BLOCK_SIZE+threadIdx.x;
array_d[x]+=array_d[x];
}
3/22/2016
CSCI6461: Computer Architecture
13-62
// main routine that executes on the host
int main(void)
{
float *a_h, *a_d;
// Pointer to host & device arrays
const int N = 10;
// Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size);
cudaMalloc((void **) &a_d, size);
// Allocate array on host
// Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}
3/22/2016
CSCI6461: Computer Architecture
13-63
http://llpanorama.wordpress.com/200
8/05/21/my-first-cuda-program/
NVIDIA CUDA
OpenCL Example Part 1
// create a compute context with GPU device
context =
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL,
NULL);
// create a command queue
queue = clCreateCommandQueue(context, NULL, 0, NULL);
// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA, NULL);
memobjs[1] = clCreateBuffer(context,
CL_MEM_READ_WRITE,
sizeof(float)*2*num_entries, NULL, NULL);
// create the compute program
program = clCreateProgramWithSource(context, 1,
&fft1D_1024_kernel_src, NULL, NULL);
3/22/2016
CSCI6461: Computer Architecture
13-64
OpenCL Example Part 2
// build the compute program executable
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
// create the compute kernel
kernel = clCreateKernel(program, "fft1D_1024", NULL);
// set the args values
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]);
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]);
clSetKernelArg(kernel, 2,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
clSetKernelArg(kernel, 3,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
// create N-D range object with work-item dimensions
// and execute kernel
global_work_size[0] = num_entries; local_work_size[0] = 64;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
global_work_size, local_work_size, 0,
NULL, NULL);
3/22/2016
CSCI6461: Computer Architecture
13-65
OpenCL Example Part 3
// This kernel computes FFT of length 1024. The 1024 length
FFT is decomposed into calls to a radix 16 function,
another radix 16
// function and then a radix 4 function
__kernel void fft1D_1024 (__global float2 *in, __global
float2 *out,__local float *sMemx, __local float *sMemy)
{
int tid = get_local_id(0);
int blockIdx = get_group_id(0) * 1024 + tid;
float2 data[16];
// starting index of data to/from global memory
in = in + blockIdx;
out = out + blockIdx;
globalLoads(data, in, 64); // coalesced global reads
fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 1024, 0);
3/22/2016
CSCI6461: Computer Architecture
13-66
OpenCL Example Part 4
// local shuffle using local memory
localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) +
(tid >> 4)));
// in-place radix-16 pass
fftRadix16Pass(data);
// twiddle factor multiplication
twiddleFactorMul(data, tid, 64, 4);
localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) +
(tid & 15)));
// four radix-4 function calls
fftRadix4Pass(data);
// radix-4
fftRadix4Pass(data + 4); // radix-4
fftRadix4Pass(data + 8); // radix-4
fftRadix4Pass(data + 12); // radix-4
function
function
function
function
number
number
number
number
1
2
3
4
// coalesced global writes
globalStores(data, out, 64);
}
3/22/2016
CSCI6461: Computer Architecture
13-67
Dr. Dobb’s Rules for Multicore Programming
Think parallel.
• Approach all problems looking for the parallelism.
• Understand where parallelism is, and organize your
thinking to express it.
• Decide on the best parallel approach before other design
or implementation decisions. Learn to "Think Parallel."
3/22/2016
CSCI6461: Computer Architecture
13-68
Dr. Dobb’s Rules for Multicore Programming
Program using abstraction.
• Focus on writing code to express parallelism, but
avoid writing code to manage threads or processor
cores.
• Libraries, OpenMP, and Intel Threading Building
Blocks are all examples of using abstractions.
• Do not use raw native threads (pthreads, Windows
threads, Boost threads, and the like).
• Threads and MPI are the assembly languages for
parallelism. They offer maximum flexibility, but
require too much time to write, debug, and maintain.
• Your programming should be at a high-enough level
that your code is about your problem, not about
thread or core management.
3/22/2016
CSCI6461: Computer Architecture
13-69
Dr. Dobb’s Rules for Multicore Programming
Program in tasks (chores), not threads (cores).
• Leave the mapping of tasks to threads or processor
cores as a distinctly separate operation in your
program, preferably an abstraction you are using
that handles thread/core management for you.
• Create an abundance of tasks in your program, or a
task that can be spread across processor cores
automatically (such as an OpenMP loop).
• By creating tasks, you are free to create as many as
you can without worrying about oversubscription.
3/22/2016
CSCI6461: Computer Architecture
13-70
Dr. Dobb’s Rules for Multicore Programming
Design with the option to turn concurrency off.
• To make debugging simpler, create programs that can run without
concurrency. This way, when debugging, you can run programs first
with—then without—concurrency, and see if both runs fail or not.
• Debugging common issues is simpler when the program is not running
concurrently because it is more familiar and better supported by today's
tools.
• Knowing that something fails only when run concurrently hints at the
type of bug you are tracking down.
• If you ignore this rule and can't force your program to run in only one
thread, you'll spend too much time debugging. Since you want to have
the capability to run in a single thread specifically for debugging, it
doesn't need to be efficient.
• You just need to avoid creating parallel programs that require
concurrency to work correctly, such as many producer-consumer
models.
• MPI programs often violate this rule, which is part of the reason MPI
programs can be problematic to implement and debug.
3/22/2016
CSCI6461: Computer Architecture
13-71
Dr. Dobb’s Rules for Multicore Programming
Avoid using locks.
• Simply say "no" to locks. Locks slow programs, reduce
their scalability, and are the source of bugs in parallel
programs.
• Make implicit synchronization the solution for your
program.
• When you still need explicit synchronization, use atomic
operations.
• Use locks only as a last resort.
• Work hard to design the need for locks completely out of
your program.
3/22/2016
CSCI6461: Computer Architecture
13-72
Dr. Dobb’s Rules for Multicore Programming
Use tools and libraries designed to help with
concurrency.
• Don't "tough it out" with old tools.
• Be critical of tool support with regards to how it presents
and interacts with parallelism.
• Most tools are not yet ready for parallelism.
• Look for threadsafe libraries—ideally ones that are
designed to utilize parallelism themselves.
3/22/2016
CSCI6461: Computer Architecture
13-73
Dr. Dobb’s Rules for Multicore Programming
• Use scalable memory allocators.
• Threaded programs need to use scalable memory
allocators. Period.
• There are a number of solutions and I'd guess that
all of them are better than malloc().
• Using scalable memory allocators speeds up
applications by eliminating global bottlenecks,
reusing memory within threads to better utilize
caches, and partitioning properly to avoid cache line
sharing.
• SHK: Better yet, do not use a programming language
that does not have implicit memory allocation and
garbage collection, e.g., Java, CommonLisp, …
3/22/2016
CSCI6461: Computer Architecture
13-74
Dr. Dobb’s Rules for Multicore Programming
• Design to scale through increased workloads.
• The amount of work your program needs to handle
increases over time. Plan for that.
• Designed with scaling in mind, your program will handle
more work as the number of processor cores increase
• Every year, we ask our computers to do more and more.
• Your designs should favor using increases in parallelism
to give you advantages in handling bigger workloads in
the future.
3/22/2016
CSCI6461: Computer Architecture
13-75
Download