Device routines - Personal Web Pages

advertisement
Device Routines
and device variables
These notes will introduce:
• Declaring routines that are be executed on
device and on the host
• Declaring local variable on device
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011
DeviceRoutines.pptx
1
CUDA extensions to declare
kernel routines
Host = CPU
Device = GPU
__global__
indicates routine can only be called from
host and only executed on device
__device__
indicates routine can only be called from
device and only executed on device
__host__
indicates routine can only be called from
host and only executed on host
Two
underscores
each
(generally only used in combination with __device__ , see later)
Note cannot call a routine from the
kernel to be executed on host
2
So far we have
… seen __global__:
__global__ must have void return type. Why?
Executed __global__ void add(int *a,int *b, int *c) {
on device
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid < N) c[tid] = a[tid]+b[tid];
}
int main(int argc, char *argv[]) {
int T = 10, B = 1;
// threads per block and blocks per grid
int a[N],b[N],c[N];
int *dev_a, *dev_b, *dev_c;
…
cudaMalloc((void**)&dev_a,N * sizeof(int));
cudaMalloc((void**)&dev_b,N * sizeof(int));
cudaMalloc((void**)&dev_c,N * sizeof(int));
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);
Called from
host
add<<<B,T>>>(dev_a,dev_b,dev_c);
cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);
…
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
Note __global__
asynchronous.
3
Returns before complete
Routines to be executed on
device
Generally cannot call C library routines from device!
However CUDA has math routines for device that are
equivalent to standard C math routines with the same
names, so in practice can call math routines such as
sin(x) – need to check CUDA docs* before use
Also CUDA has GPU-only routines implemented, faster
less accurate (have __ names)*
* See NVIDIA CUDA C Programming Guide for more details
4
__device__ routines
__global__ void gpu_sort (int *a, int *b, int N) {
…
swap (&list[m],&list[j]);
…
}
__device__ void swap (int *x, int *y) {
int temp;
temp = *x;
*x = *y;
*y = temp;
}
int main (int argc, char *argv[]) {
…
gpu_sort<<< B, T >>>(dev_a, dev_b, N);
…
return 0;
Recursion is possible with __device __ 5
}
routines so far as I can tell
Routines executable on both
host and device
__device__ and __host__ qualifiers can be used
together
Then routine callable and executable on both host and
device. Routine will be compiled for both.
Feature might be used to create code that optionally
uses a GPU or for test purposes.
Generally will need statement s that differentiate
between host and device
Note: __global__ and __host__ qualifiers cannot be used together
6
__CUDA_ARCH__ macro
Indicates compute capability of GPU being used.
Can be used to create different paths thro device code
for different capabilities.
__CUDA_ARCH__ = 100 for 1.0 compute capability
__CUDA_ARCH__ = 110 for 1.1 compute capability
…
7
Example
__host__ __device__ func() {
Could also select specific
compute capabilities
#ifdef __CUDA_ARCH__
…
// Device code
#else
…
// Host code
#endif
}
8
Declaring local variables
for host and for device
9
Local variables
on host
In C, scope of a variable
is block it is declared in,
which does not extend to
routines called from
block.
If scope is to include
main and all within it,
including called routines,
place declaration outside
main:
#include <stdio.h>
#include <stdlib.h>
int cpuA[10];
...
void clearArray() {
for (int i = 0; i < 10; i++)
cpuA[i] = 0;
}
void setArray(int n) {
for (int i = 0; i < 10; i++)
cpuA[i] = n;
}
int main(int argc, char *argv[]) {
…
clearArray();
…
setArray(N);
…
return 0;
10
}
Declaring local
kernel variables
Declaring variable outside
main but use __device__
keyword
(now used as a variable
type qualifier rather than
function type qualifier)
Without further
qualification, variable is in
global (GPU) memory.
Accessible by all threads
#include <stdio.h>
#include <stdlib.h>
__device__ int gpu_A[10];
...
__global__ void clearArray() {
for (int i = 0; i < 10; i++)
gpuA[i] = 0;
}
int main(int argc, char *argv[]) {
…
clearArray();
…
setArray(N);
…
return 0;
11
}
Accessing kernel variables from host
Accessible by host using:
• cudaMemcpyToSymbol(),
• cudaMemcpyFromSymbol(),…
where name of variable given as an argument:
int main(int argc, char *argv[]) {
int cpuA[10]:
…
cudaMemcpyFromSymbol(&cpuA, "gpuA",
sizeof(cpuA), 0, cudaMemcpyDeviceToHost);
…
return 0;
}
12
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
int cpu_hist[10];
Example of both local host
and device variables
// globally accessible on cpu
// histogram computed on cpu
__device__ int gpu_hist[10];
// globally accessible on gpu
// histogram computed on gpu
void cpu_histogram(int *a, int N) {
…
}
__global__ void gpu_histogram(int *a, int N) {
…
}
int main(int argc, char *argv[]) {
…
gpu_histogram<<<B,T>>>(dev_a,N);
cpu_histogram(a,N);
…
return 0;
13
}
Questions
Download