PPT - SEAS - University of Pennsylvania

advertisement
NVIDIA Fermi
Architecture
Patrick Cozzi
University of Pennsylvania
CIS 565 - Spring 2011
Administrivia
Assignment 4 grades returned
 Project checkpoint on Monday

 Post

an update on your blog beforehand
Poster session: 04/28
 Three
weeks from tomorrow
G80, GT200, and Fermi
November 2006: G80
 June 2008:
GT200
 March 2011:
Fermi (GF100)

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
New GPU Generation

What are the technical goals for a new GPU
generation?
New GPU Generation

What are the technical goals for a new GPU
generation?
 Improve
existing application performance. How?
New GPU Generation

What are the technical goals for a new GPU
generation?
 Improve
existing application performance. How?
 Advance programmability. In what ways?
Fermi: What’s More?
More total cores (SPs) – not SMs though
 More registers: 32K per SM
 More shared memory: up to 48K per SM
 More Super Functional Units (SFUs)

Fermi: What’s Faster?
Faster double precision – 8x over GT200
 Faster atomic operations. What for?

 5-20x

Faster context switches
applications – 10x
 Between graphics and compute, e.g.,
OpenGL and CUDA
 Between
Fermi: What’s New?

L1 and L2 caches.
 For







compute or graphics?
Dual warp scheduling
Concurrent kernel execution
C++ support
Full IEEE 754-2008 support in hardware
Unified address space
Error Correcting Code (ECC) memory support
Fixed function tessellation for graphics
G80, GT200, and Fermi
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
G80, GT200, and Fermi
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
GT200 and Fermi
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Block Diagram
GF100
 16 SMs
 Each with 32 cores


512 total cores
Each SM hosts up
to

48 warps, or
 1,536 threads

In flight, up to

24,576 threads
Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi SM

Why 32 cores per SM instead of 8?
 Why
not more SMs?
G80 – 8 cores
GT200 – 8 cores
GF100 – 32 cores
Fermi SM

Dual warp scheduling
 Why?
32K registers
 32 cores

 Floating
point and
integer unit per core
16 Load/stores
 4 SFUs

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi SM
16 SMs * 32 cores/SM
= 512 floating point
operations per cycle
 Why not in practice?

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi SM

Each SM
 64KB
on-chip memory
48KB shared memory /
16KB L1 cache, or
 16KB L1 cache / 48 KB
shared memory

 Configurable
by
CUDA developer
Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi Dual Warping Scheduling
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf
Fermi Caches
Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Caches
Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi: Unified Address Space
Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Fermi: Unified Address Space
64-bit virtual addresses
 40-bit physical addresses (currently)
 CUDA 4: Shared address space with CPU.
Why?

Fermi: Unified Address Space
64-bit virtual addresses
 40-bit physical addresses (currently)
 CUDA 4: Shared address space with CPU.
Why?

 No
explicit CPU/GPU copies
 Direct GPU-GPU copies
 Direct I/O device to GPU copies
Fermi ECC

ECC Protected
 Register

file, L1, L2, DRAM
Uses redundancy to ensure data integrity
against cosmic rays flipping bits
 For
example, 64 bits is stored as 72 bits
Fix single bit errors, detect multiple bit errors
 What are the applications?

Fermi Tessellation
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Tessellation
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Fermi Tessellation

Fixed function hardware on
each SM for graphics
 Texture
filtering
 Texture cache
 Tessellation
 Vertex Fetch / Attribute Setup
 Stream Output
 Viewport Transform. Why?
Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
Observations

Becoming easier to port CPU code to the
GPU
 Recursion,
fast atomics, L1/L2 caches, faster
global memory

In fact…
Observations

Becoming easier to port CPU code to the
GPU
 Recursion,
fast atomics, L1/L2 caches, faster
global memory
In fact…
 GPUs are starting to look like CPUs

 Beefier
SMs, L1 and L2 caches, dual warp
scheduling, double precision, fast atomics
Download