PPT - SEAS - University of Pennsylvania

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Administrivia Assignment 4 grades returned  Project checkpoint on Monday   Post  an update on your blog beforehand Poster session: 04/28  Three weeks from tomorrow G80, GT200, and Fermi November 2006: G80  June 2008: GT200  March 2011: Fermi (GF100)  Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf New GPU Generation  What are the technical goals for a new GPU generation? New GPU Generation  What are the technical goals for a new GPU generation?  Improve existing application performance. How? New GPU Generation  What are the technical goals for a new GPU generation?  Improve existing application performance. How?  Advance programmability. In what ways? Fermi: What’s More? More total cores (SPs) – not SMs though  More registers: 32K per SM  More shared memory: up to 48K per SM  More Super Functional Units (SFUs)  Fermi: What’s Faster? Faster double precision – 8x over GT200  Faster atomic operations. What for?   5-20x  Faster context switches applications – 10x  Between graphics and compute, e.g., OpenGL and CUDA  Between Fermi: What’s New?  L1 and L2 caches.  For        compute or graphics? Dual warp scheduling Concurrent kernel execution C++ support Full IEEE 754-2008 support in hardware Unified address space Error Correcting Code (ECC) memory support Fixed function tessellation for graphics G80, GT200, and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf G80, GT200, and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf GT200 and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf Fermi Block Diagram GF100  16 SMs  Each with 32 cores   512 total cores Each SM hosts up to  48 warps, or  1,536 threads  In flight, up to  24,576 threads Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi SM  Why 32 cores per SM instead of 8?  Why not more SMs? G80 – 8 cores GT200 – 8 cores GF100 – 32 cores Fermi SM  Dual warp scheduling  Why? 32K registers  32 cores   Floating point and integer unit per core 16 Load/stores  4 SFUs  Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi SM 16 SMs * 32 cores/SM = 512 floating point operations per cycle  Why not in practice?  Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi SM  Each SM  64KB on-chip memory 48KB shared memory / 16KB L1 cache, or  16KB L1 cache / 48 KB shared memory   Configurable by CUDA developer Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi Dual Warping Scheduling Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf Fermi Caches Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf Fermi Caches Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf Fermi: Unified Address Space Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi: Unified Address Space 64-bit virtual addresses  40-bit physical addresses (currently)  CUDA 4: Shared address space with CPU. Why?  Fermi: Unified Address Space 64-bit virtual addresses  40-bit physical addresses (currently)  CUDA 4: Shared address space with CPU. Why?   No explicit CPU/GPU copies  Direct GPU-GPU copies  Direct I/O device to GPU copies Fermi ECC  ECC Protected  Register  file, L1, L2, DRAM Uses redundancy to ensure data integrity against cosmic rays flipping bits  For example, 64 bits is stored as 72 bits Fix single bit errors, detect multiple bit errors  What are the applications?  Fermi Tessellation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf Fermi Tessellation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf Fermi Tessellation  Fixed function hardware on each SM for graphics  Texture filtering  Texture cache  Tessellation  Vertex Fetch / Attribute Setup  Stream Output  Viewport Transform. Why? Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf Observations  Becoming easier to port CPU code to the GPU  Recursion, fast atomics, L1/L2 caches, faster global memory  In fact… Observations  Becoming easier to port CPU code to the GPU  Recursion, fast atomics, L1/L2 caches, faster global memory In fact…  GPUs are starting to look like CPUs   Beefier SMs, L1 and L2 caches, dual warp scheduling, double precision, fast atomics

PPT - SEAS - University of Pennsylvania

Related documents

Products

Support

PPT - SEAS - University of Pennsylvania

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib