“GPU With CUDA Architecture” Presented ByDhaval Kaneria (13014061010) Guided ByMr. Rajesh k Navandar Table Of Contents • • • • • • • • • • • • Introduction of GPU Performance Factors Of GPU GPU Pipeline Block Diagram Of Pipeline Process Flow Introduction Of CUDA Thread Batching Simple Processing Flow CUDA C/C++ Applications The Future Scope Of CUDA Technology Conclusion References 2 Introduction of GPU • A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically for the processing of 3D graphics. • The processor is built with integrated transform, lighting, triangle setup/clipping, and rendering engines, capable of handling millions of math-intensive processes per second. • GPUs form the heart of modern graphics cards, relieving the CPU (central processing units) of much of the graphics processing load. GPUs allow products such as desktop PCs, portable computers, and game consoles to process real-time 3D graphics that only a few years ago were only available on high-end workstations. • Used primarily for 3-D applications, a graphics processing unit is a single-chip processor that creates lighting effects and transforms objects every time a 3D scene is redrawn. These are mathematically-intensive tasks, which otherwise, would put quite a strain on the CPU. Lifting this burden from the CPU frees up cycles that can be used for other jobs. 3 Performance Factors Of GPU • Fill Rate: It is defined as the number of pixels or texels (textured pixels) rendered per second by the GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as high as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given to it. • Memory Bandwidth: It is the data transfer speed between the graphics chip and its local frame buffer. More bandwidth usually gives better performance with the image to be rendered is of high quality and at very high resolution. • Memory Management: The performance of the GPU also depends on how efficiently the memory is managed, because memory bandwidth may become the only bottle neck if not managed properly. • Hidden Surface removal: A term to describe the reducing of overdraws when rendering a scene by not rendering surfaces that are not visible. This helps a lot in increasing performance of GPU. 4 GPU Pipeline • The GPU receives geometry information from the CPU as an input and provides a picture as an output • The host interface is the communication bridge between the CPU and the GPU • It receives commands from the CPU and also pulls geometry information from system memory. • It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color etc) • The vertex processing stage receives vertices from the host interface in object space and outputs them in screen space • This may be a simple linear transformation, or a complex operation involving morphing effects host interface vertex processing triangle setup pixel processing memory interface Cont.. • A fragment is generated if and only if its center is inside the triangle • Every fragment generated has its attributes computed to be the perspective correct interpolation of the three vertices that make up the triangle • Each fragment provided by triangle setup is fed into fragment processing as a set of attributes (position, normal, texcord etc), which are used to compute the final color for this pixel Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests 6 Block Diagram Of Pipeline Process Flow 7 Cont.. • Allow shader to be applied to each vertex Transformation and other per vertex ops • Allow vertex shader to fetch texture data • Cull/clip–per primitive operation and data preparation for rasterization • Rasterization: primitive to pixel mapping • Z culling: quick pixel elimination based on Depth • Fragment : a candidate pixel Varying number of pixel pipelines • SIMD processing hides texture fetch latency 8 Introduction Of CUDA •CUDA aka Compute unified device architecture is parallel computing platform and programing model which is implemented by graphics processing unit. 9 CUDA Programming Model: A Highly Multithreaded Coprocessor • The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel • Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads • Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few Thread Batching: Grids and Blocks •A kernel is executed as a grid of thread blocks –All threads share data memory space •A thread block is a batch of threads that can cooperate with each other by: –Synchronizing their execution •For hazard-free shared memory accesses –Efficiently sharing data through a low latency shared memory •Two threads from two different blocks cannot cooperate Host Device Grid 1 Kernel 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Kernel 2 Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NDVIA Block and Thread IDs •Threads and blocks have IDs –So each thread can decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D Device Grid 1 •Simplifies memory •addressing when processing •multidimensional data –Image processing –Solving PDEs on volumes Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NDVIA CUDA Device Memory Space Overview (Device) Grid •Each thread can: –R/W per-thread registers –R/W per-thread local memory –R/W per-block shared memory –R/W per-grid global memory –Read only per-grid constant memory –Read only per-grid texture memory Host The host can R/W global, constant, and texture memories Block (0, 0) Block (1, 0) Shared Memory Registers Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Global Memory Constant Memory Texture Memory Local Memory Local Memory Global, Constant, and Texture Memories •Global memory (Device) Grid –Main means of communicating R/W - Data between host and device –Contents visible to all threads Block (0, 0) •Texture and Constant Memories Block (1, 0) Shared Memory –Constants initialized by host –Contents visible to all threads Registers Host Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Global Memory Constant Memory Texture Memory Courtesy: NDVIA Local Memory Simple Processing Flow 1. 2. 3. 4. Copy input data from CPU memory to GPU memory CPU instruct process to GPU Load GPU program and execute, caching data on chip for performance Copy results from GPU memory to CPU memory 15 CUDA C/C++ • CUDA Language: C with Minimal Extensions • Philosophy: provide minimal set of extensions necessary to expose power • Declaration specifiers to indicate where things live __global__ void KernelFunc(...); // kernel function, runs on device __device__ int GlobalVar; // variable in device memory __shared__ int SharedVar; // variable in per-block shared memory • Extend function invocation syntax for parallel kernel launch KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each • Special variables for thread identification in kernels dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim; • Intrinsics that expose specific operations in kernel code __syncthreads(); // barrier synchronization within kernel 16 Applications •3D image analysis •Adaptive radiation therapy •Astronomy •Automobile vision •Bio informatics •Biological simulation •Broadcast •Computational Fluid Dynamics •Computer Vision •Cryptography •CT reconstruction •Data Mining •Electromagnetic simulation •Equity training •Financial - lots of areas •Mathematics research •Military (lots) •Mine planning •Molecular dynamics •MRI reconstruction •Network processing •Neural network •Protein folding •Quantum chemistry •Ray tracing •Radar •Reservoir simulation •Robotic vision/AI •Robotic surgery •Satellite data analysis •Seismic imaging •Surgery simulation 17 Simulation Result •If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar 18 •Valid Results from bandwidth Test CUDA Sample 19 • Create an Array at the size of BLOCKS, allocate space for the array on the device, and call, generateArray<<<BLOCKS,1>>>( deviceArray );. •This function will now run in BLOCKS parallel kernels, creating the entire array in one call . 20 The Future Scope Of CUDA Technology • Currently most of research is going on general purpose GPU. As GPU have a highlyefficient and flexible parallel programmable features, a growing number of researchers and business organizations started to use some of the non-graphical rendering with GPU to implement the calculations, and create a new field of study: GPGPU (General-Purpose computation on GPU) and its objective is to use GPU to implement more extensive scientific computing. GPGPU has been successfully used in algebra, fluid simulation, database applications, spectrum analysis, and other nongraphical applications • Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in a distributed and cooperative way. • Size reduction • Cooling technique 21 Conclusion. • • • • • • • • CUDA is a powerful parallel programming model Heterogeneous - mixed serial-parallel programming Scalable - hierarchical thread execution model Accessible - minimal but expressive changes to C CUDA on GPUs can achieve great results on data parallel computations with a few simple performance optimization strategies: Structure your application and select execution configurations to maximize exploitation of the GPU’s parallel capabilities. Minimize CPU ↔GPU data transfers. Coalesce global memory accesses. Take advantage of shared memory. Minimize divergent warps. Minimize use of low-throughput instructions. 22 References 1.Xiao Yang,Shamik K. Valia,Michael J. Schulte,Ruby B. Lee,” Exploration and Evaluation of PLX Floating-point Instructions and Implementations for 3D Graphics ”,IEEE, Year 2004 2.Lei Wang, Yong-zhong Huang,Xin Chen,Chun-yan Zhang,” Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment ”,CSIT-2008 3.Feng Ji,Heshan Lin,Xiaosong Ma,’ RSVM: a Region-based Software Virtual Memory for GPU’,IEEE-2013 4.“CUDA_Architecture_Overview” By Nathan Whitehead,Alex Fit-Florea,Nvidia Corporation 5.“CUDA C/C++ Basics” By Cyril Zeller, NVIDIA Corporation 6.“Optimizing Parallel Reduction in CUDA” By Mark Harris ,NVIDIA Developer Technology 23 Thank-You 24