Slide 1 - IME-USP

advertisement
Product Availability Update
Product
Inventory
Leadtime
for big orders
Notes
C1060
200 units
8 weeks
Build to order
M1060
500 units
8 weeks
Build to order
S1070-400
50 units
10 weeks
Build to order
S1070-500
25 units+ 75 being built
10 weeks
Build to order
M2050
Shipping now
Building 20K for Q2
8 weeks
Sold out through mid-July
S2050
Shipping now
Building 200 for Q2
8 weeks
Sold out through mid-July
C2050
Processamento Paralelo em
2000 unitsGPU’s na
8 weeks
Will maintain inventory
Arquitetura
Fermi
M2070
Sept 2010
-
Get PO in now to get priority
C2070
M2070-Q
Arnaldo Tavares
Sept-Oct 2010
- for LatinGet
PO in now to get priority
Tesla Sales Manager
America
Oct 2010
1
Quadro or Tesla?
Computer Aided Design
• e.g. CATIA, SolidWorks, Siemens NX
3D Modeling / Animation
• e.g. 3ds, Maya, Softimage
Video Editing / FX
• e.g. Adobe CS5, Avid
Numerical Analytics
• e.g. MATLAB, Mathematica
Computational Biology
• e.g. AMBER, NAMD, VMD
Computer Aided Engineering
• e.g. ANSYS, SIMULIA/ABAQUS
2
GPU Computing
CPU + GPU Co-Processing
4 cores
CPU
GPU
48 GigaFlops (DP)
515 GigaFlops (DP)
(Average efficiency in Linpack: 50%)
3
146X
36X
18X
50X
100X
Medical Imaging
U of Utah
Molecular Dynamics
U of Illinois, Urbana
Video Transcoding
Elemental Tech
Matlab Computing
AccelerEyes
Astrophysics
RIKEN
50x – 150x
149X
47X
20X
130X
30X
Financial simulation
Oxford
Linear Algebra
Universidad Jaime
3D Ultrasound
Techniscan
Quantum Chemistry
U of Illinois, Urbana
Gene Sequencing
U of Maryland
4
Increasing Number of Professional CUDA Apps
Available Now
Tools
Libraries
Future
CUDA C/C++
PGI
Accelerators
Platform LSF
Cluster Mgr
TauCUDA
Perf Tools
Parallel Nsight
Vis Studio IDE
PGI CUDA
Fortran
CAPS HMPP
Bright Cluster
Manager
Allinea DDT
Debugger
ParaTools
VampirTrace
AccelerEyes
Wolfram
Jacket MATLAB Mathematica
CUDA FFT
CUDA BLAS
EMPhotonics
CULAPACK
Thrust C++
Template Lib
NVIDIA NPP
Perf Primitives
MAGMA
(LAPACK)
NVIDIA
RNG & SPARSE
Video Libraries CUDA Libraries
Headwave
Suite
OpenGeoSolut
ions OpenSEIS
GeoStar
Seismic Suite
Acceleware
RTM Solver
StoneRidge
RTM
ffA SVI Pro
VSG
Open Inventor
Seismic City
RTM
Tsunami
RTM
AMBER
NAMD
HOOMD
TeraChem
BigDFT
ABINT
GROMACS
LAMMPS
VMD
GAMESS
CP2K
CUDA-BLASTP
MUMmerGPU
CUDA-MEME
PIPER
Docking
CUDA SW++
SmithWaterm
GPU-HMMR
CUDA-EC
HEX Protein
Docking
ACUSIM
AcuSolve 1.8
Autodesk
Moldflow
Prometch
Particleworks
Remcom
XFdtd 7.0
Oil & Gas
MATLAB
PGI CUDA
x86
Paradigm
RTM
TotalView
Debugger
Panorama
Tech
Paradigm
SKUA
Acellera
ACEMD
DL-POLY
Bio-Chemistry
BioInformatics
CAE
Available
Announced
OpenEye ROCS
ANSYS
Mechanical
LSTC
LS-DYNA 971
FluiDyna
OpenFOAM
Metacomp
CFD++
MSC.Software
Marc 2010.2
5
Increasing Number of Professional CUDA Apps
Available Now
Video
Rendering
Finance
EDA
Other
Available
Future
Adobe Premier
Pro CS5
ARRI
Various Apps
GenArts
Sapphire
TDVision
TDVCodec
Black Magic
Da Vinci
MainConcept
CUDA Encoder
Elemental
Video
Fraunhofer
JPEG2000
Cinnafilm
Pixel Strings
Assimilate
SCRATCH
Bunkspeed
Shot (iray)
Refractive SW
Octane
Random
Control Arion
ILM
Plume
Autodesk
3ds Max
Cebas
finalRender
mental images
iray (OEM)
NVIDIA
OptiX (SDK)
Caustic
Graphics
Weta Digital
PantaRay
Lightworks
Artisan
Chaos Group
V-Ray GPU
NAG
RNG
Numerix Risk
SciComp
SciFinance
RMS Risk
Mgt Solutions
Aquimin
AlphaVision
Hanweck
Options Analy
Murex
MACS
Agilent
EMPro 2010
CST
Microwave
Agilent ADS
SPICE
Acceleware
FDTD Solver
Synopsys
TCAD
SPEAG
SEMCAD X
Gauda OPC
Acceleware
EM Solution
Siemens 4D
Ultrasound
Digisens
Medical
Schrodinger
Core Hopping
Useful
Progress Med
MotionDSP
Ikena Video
Manifold
GIS
Dalsa Machine
Digital
Vision
Anarchy Photo
Announced
The Foundry
Kronos
Works Zebra
Zeany
Rocketick
Veritlog Sim
MVTec
Machine Vis
6
3 of Top5 Supercomputers
3000
8
7
2500
6
2000
1500
4
Megawatts
Gigaflops
5
3
1000
2
500
1
0
0
Tianhe-1A
Jaguar
Nebulae
Tsubame
Hopper II
Tera 100
7
3 of Top5 Supercomputers
3000
8
7
2500
6
2000
1500
4
Megawatts
Gigaflops
5
3
1000
2
500
1
0
0
Tianhe-1A
Jaguar
Nebulae
Tsubame
Hopper II
Tera 100
8
Linpack
Teraflops
What if Every Supercomputer Had Fermi?
1000
800
600
400
450 GPUs
110 TeraFlops
$2.2 M
Top 50
225 GPUs
55 TeraFlops
$1.1 M
Top 100
150 GPUs
37 TeraFlops
$740K
Top 150
200
0
Top 500 Supercomputers (Nov 2009)
9
Hybrid ExaScale Trajectory
2010
1.27 PFLOPS
2.55 MWatts
2017 *
2 EFLOPS
10 MWatts
2008
1 TFLOP
7.5 KWatts
*
This is a projection based on Moore’s law and does not represent a committed roadmap
10
Tesla Roadmap
11
The March of the GPUs
Peak Double Precision FP
GFlops/sec
1200
Peak Memory Bandwidth
GBytes/sec
250
1000
200
T20A
800
T20
150
T20A
600
8-core
Sandy Bridge
3 GHz
T20
T10
100
8-core Sandy
Bridge
3 GHz
400
T10
200
0
2007
2008
Nehalem
3 GHz
2009
Double Precision: NVIDIA GPU
50
Westmere
3 GHz
2010
2011
2012
Double Precision: x86 CPU
0
2007
2008
Nehalem
3 GHz
Westmere
3 GHz
2009
2010
NVIDIA GPU (ECC off)
2011
2012
x86 CPU
12
Project Denver
13
Expected Tesla Roadmap with Project Denver
14
Workstation / Data Center Solutions
2 Tesla
M2050/70 GPUs
Workstations
Up to 4x
Tesla C2050/70 GPUs
OEM CPU Server +
Tesla S2050/70
4 Tesla GPUs in 2U
Integrated
CPU-GPU Server
2x Tesla M2050/70 GPUs
in 1U
15
Tesla C-Series Workstation GPUs
Tesla C2050
Tesla C2070
Processors
Tesla 20-series GPU
Number of Cores
448
Caches
64 KB L1 cache + Shared Memory / 32 cores
768 KB L2 cache
Floating Point Peak
Performance
1030 Gigaflops (single)
515 Gigaflops (double)
GPU Memory
3 GB
2.625 GB with ECC on
6 GB
5.25 GB with ECC on
Memory Bandwith
144 GB/s (GDDR5)
System I/O
PCIe x16 Gen2
Power
238 W (max)
238 W (max)
Available
Shipping Now
Shipping Now
16
How is the GPU Used?
Basic Component: “Stream Multiprocessor” (SM)
SIMD: “Single Instruction Multiple Data”
Same Instruction for all cores, but can operate over different data
“SIMD at SM, MIMD at GPU chip”
Source: Presentation from Felipe A. Cruz, Nagasaki University
17
The Use of GPU’s and Bottleneck Analysis
Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology
18
The Fermi Architecture
3 billion transistors
16 x Streaming Multiprocessors
(SM’s)
6 x 64-bit Memory Partitions =
384-bit Memory Interface
Host Interface: connects the GPU
to the CPU via PCI-Express
GigaThread global scheduler:
distribute thread blocks to SM
thread schedulers
19
SM Architecture
Instruction Cache
Scheduler Scheduler
Dispatch
32 CUDA cores per SM (512 total)
Dispatch
Register File
Core Core Core Core
16 x Load/Store Units = source and destin. address
calculated for 16 threads per clock
Core Core Core Core
Core Core Core Core
Core Core Core Core
4 x Special Function Units (sin, cosine, sq. root, etc.)
Core Core Core Core
Core Core Core Core
64 KB of RAM for shared memory and L1 cache
(configurable)
Core Core Core Core
Core Core Core Core
Load/Store Units x 16
Special Func Units x 4
Dual Warp Scheduler
Interconnect Network
64K Configurable
Cache/Shared Mem
Uniform Cache
20
Dual Warp Scheduler
1 Warp = 32 parallel threads
2 Warps issued and executed concurrently
Each Warp goes to 16 CUDA Cores
Most instructions can be dual issued
(exception: Double Precision instructions)
Dual-Issue Model allows near peak hardware
performance
21
CUDA Core Architecture
Instruction Cache
Scheduler Scheduler
Dispatch
New IEEE 754-2008 floating-point standard,
surpassing even the most advanced CPUs
Dispatch
Register File
Core Core Core Core
Core Core Core Core
Newly designed integer ALU
optimized for 64-bit and extended
precision operations
Core Core Core Core
CUDA Core
Dispatch Port
Operand Collector
Fused multiply-add (FMA) instruction
for both 32-bit single and 64-bit
double precision
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
FP Unit
INT Unit
Core Core Core Core
Load/Store Units x 16
Result Queue
Special Func Units x 4
Interconnect Network
64K Configurable
Cache/Shared Mem
Uniform Cache
22
Fused Multiply-Add Instruction (FMA)
23
GigaThreadTM Hardware Thread Scheduler (HTS)
Hierarchically manages thousands
of simultaneously active threads
10x faster application context
switching (each program receives a
time slice of processing resources)
HTS
Concurrent kernel execution
24
GigaThread Hardware Thread Scheduler
Concurrent Kernel Execution + Faster Context Switch
Kernel 1
Kernel 1
Time
Kernel 2
Kernel 2
nel
Kernel 2
Kernel 2
Kernel 3
Ker
4
Kernel 5
Kernel 3
Kernel 4
Kernel 5
Serial Kernel Execution
Parallel Kernel Execution
25
GigaThread Streaming Data Transfer Engine
Dual DMA engines
Simultaneous CPUGPU and GPUCPU
data transfer
Fully overlapped with CPU and GPU
processing time
SDT
Activity Snapshot:
Kernel 0
CPU
SDT0
GPU
SDT1
Kernel 1
CPU
SDT0
GPU
SDT1
Kernel 2
CPU
SDT0
GPU
SDT1
Kernel 3
CPU
SDT0
GPU
SDT1
26
Cached Memory Hierarchy
First GPU architecture to support a true cache
hierarchy in combination with on-chip shared memory
Shared/L1 Cache per SM (64KB)
Improves bandwidth and reduces latency
Unified L2 Cache (768 KB)
Fast, coherent data sharing across all cores in the GPU
Global Memory (up to 6GB)
27
CUDA: Compute Unified Device Architecture
NVIDIA’s Parallel Computing Architecture
Software Development Platform aimed to the GPU Architecture
Language Integration
Device-level APIs
Applications
Using DirectX
HLSL
Applications
Using OpenCL
OpenCL C
Applications
Using the
CUDA Driver API
Applications
Using C, C++, Fortran,
Java, Python, ...
C for CUDA
C for CUDA
5
DirectX 11
Compute
OpenCL
Driver
C Runtime
for CUDA
3
CUDA Driver
CUDA Support in Kernel Level Driver
PTX (ISA)
4
2
1
CUDA Parallel Compute Engines inside GPU
28
Thread Hierarchy
Kernels (simple C program) are executed by thread
Threads are grouped into Blocks
Threads in a Block can synchronize execution
Blocks are grouped in a Grid
Blocks are independent (must be able to be executed
at any order
Source: Presentation from Felipe A. Cruz, Nagasaki University
29
Memory and Hardware Hierarchy
Threads access Registers
CUDA Cores execute Threads
Threads within a Block can share data/results
via Shared Memory
Streaming Multiprocessors (SM’s) execute
Blocks
Grids use Global Memory for result sharing
(after kernel-wide global synchronization)
GPU executes Grids
Source: Presentation from Felipe A. Cruz, Nagasaki University
30
Full View of the Hierarchy Model
CUDA
Hardware Level
Memory Access
Thread
CUDA Core
Registers
Block
SM
Shared Memory
Grid
GPU
Global Memory
Device
Node
Host Memory
31
IDs and Dimensions
Threads
3D IDs, unique within a block
Device
Grid 1
Blocks
2D IDs, unique within a grid
Dimensions set at launch time
Can be unique for each grid
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Built-in variables
threadIdx, blockIdx
blockDim, gridDim
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
32
Compiling C for CUDA Applications
void serial_function(… ) {
...
}
void other_function(int ... ) {
...
}
void saxpy_serial(float ... ) {
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
void main( ) {
float x;
saxpy_serial(..);
...
}
Modify into
Parallel
CUDA code
C CUDA
Key Kernels
Rest of C
Application
NVCC
(Open64)
CPU Compiler
CUDA object
files
CPU object
files
Linker
CPU-GPU
Executable
33
C for CUDA : C with a few keywords
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);
Standard C Code
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
Parallel
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
C Code
34
Software Programming
Source: Presentation from Andreas Klöckner, NYU
35
Software Programming
Source: Presentation from Andreas Klöckner, NYU
36
Software Programming
Source: Presentation from Andreas Klöckner, NYU
37
Software Programming
Source: Presentation from Andreas Klöckner, NYU
38
Software Programming
Source: Presentation from Andreas Klöckner, NYU
39
Software Programming
Source: Presentation from Andreas Klöckner, NYU
40
Software Programming
Source: Presentation from Andreas Klöckner, NYU
41
Software Programming
Source: Presentation from Andreas Klöckner, NYU
42
CUDA C/C++ Leadership
2007
CUDA Toolkit 1.0
July 07
CUDA Toolkit 1.1
Nov 07
• C Compiler
• C Extensions
• Win XP 64
• Single Precision
• BLAS
• FFT
• SDK
40 examples
• Atomics support
• Multi-GPU
support
2008
CUDA
VisualApril
Profiler
08 2.2
cuda-gdb
HW Debugger
2009
Parallel Nsight
Beta
Nov
09
2010
CUDA Toolkit 2.0
Aug 08
CUDA Toolkit 2.3
July 09
• Double Precision
• DP FFT
• C++ inheritance
• Compiler
Optimizations
• 16-32 Conversion
intrinsics
• Fermi arch support
• Vista 32/64
• Performance
enhancements
CUDA Toolkit 3.0
Mar 10
• Tools updates
• Driver / RT interop
• Mac OSX
• 3D Textures
• HW Interpolation
43
Why should I choose Tesla over consumer cards?
Feature
Benefits
4x Higher double precision (on 20-series)
Higher Performance for scientific CUDA applications
ECC only on Tesla & Quadro (on 20-series)
Data reliability inside the GPU and on DRAM memories
Bi-directional PCI-E communication (Tesla has Dual DMA
Engines, GeForce has only 1 DMA Engine)
Higher Performance for CUDA applications (by overlapping
communication & computation)
Larger memory for larger data sets – 3GB and 6GB Products
Higher performance on wide range of applications (medical, oil & gas,
manufacturing, FEA, CAE)
Cluster management software tools available on Tesla only
Needed for GPU monitoring and job scheduling in data center
deployments
TCC (Tesla Compute Cluster) driver supported for Windows OS
only on Tesla.
Higher performance for CUDA applications due to lower kernel launch
overhead. TCC adds support for RDP and Services
Integrated OEM workstations and servers
Trusted, reliable systems built for Tesla products.
Professional ISVs will certify CUDA applications only on Tesla
Bug reproduction, support, feature requests for Tesla only.
2 to 4 day Stress testing & memory burn-in for reliability. Added
margin in memory and core clocks for added reliability.
Built for 24/7 computing in data center and workstation environments.
Manufactured & guaranteed by NVIDIA
No changes in key components like GPU and memory without notice.
Always the same clocks for known, reliable performance.
3-year warranty from HP
Reliable, long life products
Enterprise support, higher priority for CUDA bugs and requests
Ability to influence CUDA and GPU roadmap. Get early access to
features requests.
18-24 months availability + 6-month EOL notice
Reliable product supply
Features
Quality &
Warranty
Support &
Lifecycle
44
Download