“Zero-copy” memory - Personal Web Pages

advertisement
Zero-Copy Host Memory
These notes will introduce “zero-copy” memory.
“Zero-copy” memory requires page lock-memory.
These materials comes from Chapter 11 of “CUDA by Example” by Jason
Sanders and Edwards Kandrot.
ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013
1
Zero-copy memory
Zero-copy refers to the GPU accessing the host memory
without explicitly copying the data from the host memory
to the GPU memory i.e. zero copying
Depending upon the hardware structure the data may get
copied though!
•
Integrated GPUs that are part of the system chipset and
share system memory do not. --- example MacBook Pro
•
Discrete GPU cards with their own device memory do.
2
CUDA routines for zero-copy
memory
Use page-locked memory.
Allocate with:
cudaHostAlloc (void ** ptr, size_t size, unsigned int flags)
Allocates page-locked memory and accessible to the device.
Set flags to:
cudaHostAllocMapped - Map allocation into CUDA address space.
Reference: NVIDIA CUDA library
http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/online/
3
Flags continued
cudaHostAllocWriteCombined -- Allocates memory as
“write-combined”, which can be transferred more quickly
across PCIe bus on some system configurations, but
cannot be read efficiently by most CPUs.
Use for memory written by CPU and read by device via
mapped pinned memory.
Combining flags:
cudaHostAllocMapped || cudaHostAllocWriteCombined
Reference: NVIDIA CUDA library
http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/online/
4
Device pointer to allocated
memory
Device pointer to memory obtained by calling:
cudaHostGetDevicePointer()
“Passes back device pointer corresponding to mapped, pinned host
buffer allocated by cudaHostAlloc() or …”
Needed to account for different address spaces.
Parameters
Returned device pointer for mapped memory
cudaHostGetDevicePointer( void ** pDevice, void * pHost,
unsigned int flags)
Flags for extensions (must be 0 for now)
Requested host pointer mapping
Reference: NVIDIA CUDA library
5
http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/online/
http://www.clear.rice.edu/comp422/resources/cuda/html/group__CUDART__MEMORY_ga475419a9b21a66036029d5001ea908c.html
Code to allocate memory and
get pointer for device
int *a;
int *dev_a;
size = … ;
// host pointer
// device pointer to host memory
// number of bytes to allocate
Allocate pinned memory on host:
cudaHostAlloc( (void**)&a, size, cudaHostAllocMapped ||
cudaHostAllocWriteCombined );
Get device point to it:
Now do not need to copy memory from host to device:
If desired
cudaHostGetDevicePointer(&dev_a, a, 0);
6
Using pointer to host memory
Simply use returned pointer in kernel call where one
would otherwise have used a device memory pointer:
MyKernel<<< B,T>>> (dev_a, … );
without needing to modify the kernel code at all!
7
#define N 32
// size of vectors
__global__ void add(int *a,int *b, int *c) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid < N) c[tid] = a[tid]+b[tid];
}
int main(int argc, char *argv[]) {
int T = 32, B = 1;
// threads per block and blocks per grid
int *a,*b,*c;
// host pointers
int *dev_a, *dev_b, *dev_c;
// device pointers to host memory
cudaEvent_t start, stop;
float elapsed_time_ms;
// to measure time
Example
Vector addition
without hostdevice transfers
Note flag
cudaHostAlloc( (void**)&a, size, cudaHostAllocMapped || cudaHostAllocWriteCombined );
cudaHostAlloc( (void**)&b, size, cudaHostAllocMapped || cudaHostAllocWriteCombined );
cudaHostAlloc( (void**)&c, size, cudaHostAllocMapped );
… // load arrays with some numbers
cudaHostGetDevicePointer(&dev_a, a, 0); // mem. copy to device not need now, but ptrs needed instead
cudaHostGetDevicePointer(&dev_b, b, 0);
cudaHostGetDevicePointer(&dev_c ,c, 0);
…
// start time
add<<<B,T>>>(dev_a,dev_b,dev_c);
cudaThreadSynchronize();
…
…
// copy back not needed but now need thread synchronization
// end time
// print results
printf("Time to calculate results: %f ms.\n", elapsed_time_ms); // print out execution time
cudaFreeHost(a);
// clean up
cudaFreeHost(b);
cudaFreeHost(c);
cudaEventDestroy(start);
cudaEventDestroy(stop);
book seems to miss out this special free routine when
8
using cudaHostAlloc
Host Memory pointed to from device
Host (CPU)
Host
memory
cudaHostAlloc( (void**)&a, … );
cudaHostGetDevicePointer(&dev_a, a, 0);
Device (GPU)
__global__ void add(int *a, … ) {
…
}
MyKernel<<< B,T>>> (dev_a, … );
9
Code to determine whether GPU has
the capability of features being used
Look at device properties:
cudaDeviceProp prop;
int myDevice;
cudaGetDevice(&myDevice);
Returns device executing thread
Returns a structure, see next
cudaGetDeviceProperties(&prop, myDevice);
If (prop.property_name != 1) printf(“Feature not available\n”);
Various property names, see next
10
Properties
struct cudaDeviceProp {
char name[256];
size_t totalGlobalMem;
size_t sharedMemPerBlock;
int regsPerBlock;
int warpSize;
size_t memPitch;
int maxThreadsPerBlock;
int maxThreadsDim[3];
int maxGridSize[3];
size_t totalConstMem;
int major;
int minor;
int clockRate;
size_t textureAlignment;
int deviceOverlap;
int multiProcessorCount;
int kernelExecTimeoutEnabled;
int integrated;
int canMapHostMemory;
int computeMode;
int concurrentKernels;
11
Checking can map page-locked host
memory into device address space
…
cudaDeviceProp prop;
int myDevice;
cudaGetDevice(&myDevice);
Very likely as
only needs
compute
capability > 1.0
cudaGetDeviceProperties(&prop, myDevice);
If (prop.canMapHostMemory != 1) {
printf(“Feature not available\n”);
return 0;
}
…
12
Integrated GPU systems
Example: My 13” MacBook Pro, 2010
Zero-copy memory
particularly interesting with
integrated GPU systems
where system memory is
shared between CPU
and GPU.
Increased performance
will always result when
using zero-copy memory
(according to the course
textbook)
CPU
GPU
2.4 GHz
Intel Core
2 Duo
NVIDIA
GeForce
320M
Shared between
CPU and GPU
256 MB
DDR3
SDRAM
DDR3
SDRAM
4 GB
Main memory
Intel Graphics Media Accelerator (GMA )
shared bus on 15/17” models
13
Using multiple GPU on one system
Each GPU needs to be controlled by a separate thread:
Code
Thread 1
Thread 2
GPU 1
GPU 2
So need to write a multi-threaded program using thread APIs/tools
such as Pthreads, WinThreads, OpenMP, … .
14
…
#if _WIN32
//Windows threads.
#include <windows.h>
Textbook utility routines for
multi-threading
Found in ../common/book.h
typedef HANDLE CUTThread;
typedef unsigned (WINAPI *CUT_THREADROUTINE)(void *);
#define CUT_THREADPROC unsigned WINAPI
#define CUT_THREADEND return 0
#else
//POSIX threads.
#include <pthread.h>
typedef pthread_t CUTThread;
typedef void *(*CUT_THREADROUTINE)(void *);
#define CUT_THREADPROC void
#define CUT_THREADEND
#endif
//Create thread.
CUTThread start_thread( CUT_THREADROUTINE, void *data );
//Wait for thread to finish.
void end_thread( CUTThread thread );
Provides for Win32 Threads for Windows
or Pthreads for Linux
thread = start_thread(funct,ptr)
Used to start a new thread
Takes as arguments:
void* funct (void*)
void* ptr
Returns CUTThread type thread identifier
//Destroy thread.
void destroy_thread( CUTThread thread );
//Wait for multiple threads.
void wait_for_threads( const CUTThread *threads, int num );
#if _WIN32
//Create thread
CUTThread start_thread(CUT_THREADROUTINE func, void *data){
return CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)func, data, 0, NULL);
}
//Wait for thread to finish
void end_thread(CUTThread thread){
WaitForSingleObject(thread, INFINITE);
CloseHandle(thread);
}
//Destroy thread
void destroy_thread( CUTThread thread ){
TerminateThread(thread, 0);
CloseHandle(thread);
}
//Wait for multiple threads
void wait_for_threads(const CUTThread * threads, int num){
WaitForMultipleObjects(num, threads, true, INFINITE);
for(int i = 0; i < num; i++)
CloseHandle(threads[i]);
}
#else
//Create thread
CUTThread start_thread(CUT_THREADROUTINE func, void * data){
pthread_t thread;
pthread_create(&thread, NULL, func, data);
return thread;
}
//Wait for thread to finish
void end_thread(CUTThread thread){
pthread_join(thread, NULL);
}
To terminate thread (join to main thread):
end_thread(thread)
//Destroy thread
void destroy_thread( CUTThread thread ){
pthread_cancel(thread);
}
//Wait for multiple threads
void wait_for_threads(const CUTThread * threads, int num){
for(int i = 0; i < num; i++)
end_thread( threads[i] );
}
#endif
…
15
Pinned memory on multiple GPUs
Pinned memory only pinned by thread allocating the
pinned memory
Other threads see it as pageable and access slower.
These threads cannot use cudaMemcpyAsync, which
requires pinned memory
“Portable” pinned memory
Memory allowed to move between host threads and any
thread to see it as pinned memory
Use cudaHostAlloc and include cudaAllocPortable flag
16
Questions
More information – See Chapter 11 of “CUDA
by Example” by Jason Sanders and Edwards
Kandrot, Addison-Wesley, 2011
Download