Product Availability Update Product Inventory Leadtime for big orders Notes C1060 200 units 8 weeks Build to order M1060 500 units 8 weeks Build to order S1070-400 50 units 10 weeks Build to order S1070-500 25 units+ 75 being built 10 weeks Build to order M2050 Shipping now Building 20K for Q2 8 weeks Sold out through mid-July S2050 Shipping now Building 200 for Q2 8 weeks Sold out through mid-July C2050 Processamento Paralelo em 2000 unitsGPU’s na 8 weeks Will maintain inventory Arquitetura Fermi M2070 Sept 2010 - Get PO in now to get priority C2070 M2070-Q Arnaldo Tavares Sept-Oct 2010 - for LatinGet PO in now to get priority Tesla Sales Manager America Oct 2010 1 Quadro or Tesla? Computer Aided Design • e.g. CATIA, SolidWorks, Siemens NX 3D Modeling / Animation • e.g. 3ds, Maya, Softimage Video Editing / FX • e.g. Adobe CS5, Avid Numerical Analytics • e.g. MATLAB, Mathematica Computational Biology • e.g. AMBER, NAMD, VMD Computer Aided Engineering • e.g. ANSYS, SIMULIA/ABAQUS 2 GPU Computing CPU + GPU Co-Processing 4 cores CPU GPU 48 GigaFlops (DP) 515 GigaFlops (DP) (Average efficiency in Linpack: 50%) 3 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana Video Transcoding Elemental Tech Matlab Computing AccelerEyes Astrophysics RIKEN 50x – 150x 149X 47X 20X 130X 30X Financial simulation Oxford Linear Algebra Universidad Jaime 3D Ultrasound Techniscan Quantum Chemistry U of Illinois, Urbana Gene Sequencing U of Maryland 4 Increasing Number of Professional CUDA Apps Available Now Tools Libraries Future CUDA C/C++ PGI Accelerators Platform LSF Cluster Mgr TauCUDA Perf Tools Parallel Nsight Vis Studio IDE PGI CUDA Fortran CAPS HMPP Bright Cluster Manager Allinea DDT Debugger ParaTools VampirTrace AccelerEyes Wolfram Jacket MATLAB Mathematica CUDA FFT CUDA BLAS EMPhotonics CULAPACK Thrust C++ Template Lib NVIDIA NPP Perf Primitives MAGMA (LAPACK) NVIDIA RNG & SPARSE Video Libraries CUDA Libraries Headwave Suite OpenGeoSolut ions OpenSEIS GeoStar Seismic Suite Acceleware RTM Solver StoneRidge RTM ffA SVI Pro VSG Open Inventor Seismic City RTM Tsunami RTM AMBER NAMD HOOMD TeraChem BigDFT ABINT GROMACS LAMMPS VMD GAMESS CP2K CUDA-BLASTP MUMmerGPU CUDA-MEME PIPER Docking CUDA SW++ SmithWaterm GPU-HMMR CUDA-EC HEX Protein Docking ACUSIM AcuSolve 1.8 Autodesk Moldflow Prometch Particleworks Remcom XFdtd 7.0 Oil & Gas MATLAB PGI CUDA x86 Paradigm RTM TotalView Debugger Panorama Tech Paradigm SKUA Acellera ACEMD DL-POLY Bio-Chemistry BioInformatics CAE Available Announced OpenEye ROCS ANSYS Mechanical LSTC LS-DYNA 971 FluiDyna OpenFOAM Metacomp CFD++ MSC.Software Marc 2010.2 5 Increasing Number of Professional CUDA Apps Available Now Video Rendering Finance EDA Other Available Future Adobe Premier Pro CS5 ARRI Various Apps GenArts Sapphire TDVision TDVCodec Black Magic Da Vinci MainConcept CUDA Encoder Elemental Video Fraunhofer JPEG2000 Cinnafilm Pixel Strings Assimilate SCRATCH Bunkspeed Shot (iray) Refractive SW Octane Random Control Arion ILM Plume Autodesk 3ds Max Cebas finalRender mental images iray (OEM) NVIDIA OptiX (SDK) Caustic Graphics Weta Digital PantaRay Lightworks Artisan Chaos Group V-Ray GPU NAG RNG Numerix Risk SciComp SciFinance RMS Risk Mgt Solutions Aquimin AlphaVision Hanweck Options Analy Murex MACS Agilent EMPro 2010 CST Microwave Agilent ADS SPICE Acceleware FDTD Solver Synopsys TCAD SPEAG SEMCAD X Gauda OPC Acceleware EM Solution Siemens 4D Ultrasound Digisens Medical Schrodinger Core Hopping Useful Progress Med MotionDSP Ikena Video Manifold GIS Dalsa Machine Digital Vision Anarchy Photo Announced The Foundry Kronos Works Zebra Zeany Rocketick Veritlog Sim MVTec Machine Vis 6 3 of Top5 Supercomputers 3000 8 7 2500 6 2000 1500 4 Megawatts Gigaflops 5 3 1000 2 500 1 0 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 100 7 3 of Top5 Supercomputers 3000 8 7 2500 6 2000 1500 4 Megawatts Gigaflops 5 3 1000 2 500 1 0 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 100 8 Linpack Teraflops What if Every Supercomputer Had Fermi? 1000 800 600 400 450 GPUs 110 TeraFlops $2.2 M Top 50 225 GPUs 55 TeraFlops $1.1 M Top 100 150 GPUs 37 TeraFlops $740K Top 150 200 0 Top 500 Supercomputers (Nov 2009) 9 Hybrid ExaScale Trajectory 2010 1.27 PFLOPS 2.55 MWatts 2017 * 2 EFLOPS 10 MWatts 2008 1 TFLOP 7.5 KWatts * This is a projection based on Moore’s law and does not represent a committed roadmap 10 Tesla Roadmap 11 The March of the GPUs Peak Double Precision FP GFlops/sec 1200 Peak Memory Bandwidth GBytes/sec 250 1000 200 T20A 800 T20 150 T20A 600 8-core Sandy Bridge 3 GHz T20 T10 100 8-core Sandy Bridge 3 GHz 400 T10 200 0 2007 2008 Nehalem 3 GHz 2009 Double Precision: NVIDIA GPU 50 Westmere 3 GHz 2010 2011 2012 Double Precision: x86 CPU 0 2007 2008 Nehalem 3 GHz Westmere 3 GHz 2009 2010 NVIDIA GPU (ECC off) 2011 2012 x86 CPU 12 Project Denver 13 Expected Tesla Roadmap with Project Denver 14 Workstation / Data Center Solutions 2 Tesla M2050/70 GPUs Workstations Up to 4x Tesla C2050/70 GPUs OEM CPU Server + Tesla S2050/70 4 Tesla GPUs in 2U Integrated CPU-GPU Server 2x Tesla M2050/70 GPUs in 1U 15 Tesla C-Series Workstation GPUs Tesla C2050 Tesla C2070 Processors Tesla 20-series GPU Number of Cores 448 Caches 64 KB L1 cache + Shared Memory / 32 cores 768 KB L2 cache Floating Point Peak Performance 1030 Gigaflops (single) 515 Gigaflops (double) GPU Memory 3 GB 2.625 GB with ECC on 6 GB 5.25 GB with ECC on Memory Bandwith 144 GB/s (GDDR5) System I/O PCIe x16 Gen2 Power 238 W (max) 238 W (max) Available Shipping Now Shipping Now 16 How is the GPU Used? Basic Component: “Stream Multiprocessor” (SM) SIMD: “Single Instruction Multiple Data” Same Instruction for all cores, but can operate over different data “SIMD at SM, MIMD at GPU chip” Source: Presentation from Felipe A. Cruz, Nagasaki University 17 The Use of GPU’s and Bottleneck Analysis Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology 18 The Fermi Architecture 3 billion transistors 16 x Streaming Multiprocessors (SM’s) 6 x 64-bit Memory Partitions = 384-bit Memory Interface Host Interface: connects the GPU to the CPU via PCI-Express GigaThread global scheduler: distribute thread blocks to SM thread schedulers 19 SM Architecture Instruction Cache Scheduler Scheduler Dispatch 32 CUDA cores per SM (512 total) Dispatch Register File Core Core Core Core 16 x Load/Store Units = source and destin. address calculated for 16 threads per clock Core Core Core Core Core Core Core Core Core Core Core Core 4 x Special Function Units (sin, cosine, sq. root, etc.) Core Core Core Core Core Core Core Core 64 KB of RAM for shared memory and L1 cache (configurable) Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Dual Warp Scheduler Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache 20 Dual Warp Scheduler 1 Warp = 32 parallel threads 2 Warps issued and executed concurrently Each Warp goes to 16 CUDA Cores Most instructions can be dual issued (exception: Double Precision instructions) Dual-Issue Model allows near peak hardware performance 21 CUDA Core Architecture Instruction Cache Scheduler Scheduler Dispatch New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs Dispatch Register File Core Core Core Core Core Core Core Core Newly designed integer ALU optimized for 64-bit and extended precision operations Core Core Core Core CUDA Core Dispatch Port Operand Collector Fused multiply-add (FMA) instruction for both 32-bit single and 64-bit double precision Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core FP Unit INT Unit Core Core Core Core Load/Store Units x 16 Result Queue Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache 22 Fused Multiply-Add Instruction (FMA) 23 GigaThreadTM Hardware Thread Scheduler (HTS) Hierarchically manages thousands of simultaneously active threads 10x faster application context switching (each program receives a time slice of processing resources) HTS Concurrent kernel execution 24 GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch Kernel 1 Kernel 1 Time Kernel 2 Kernel 2 nel Kernel 2 Kernel 2 Kernel 3 Ker 4 Kernel 5 Kernel 3 Kernel 4 Kernel 5 Serial Kernel Execution Parallel Kernel Execution 25 GigaThread Streaming Data Transfer Engine Dual DMA engines Simultaneous CPUGPU and GPUCPU data transfer Fully overlapped with CPU and GPU processing time SDT Activity Snapshot: Kernel 0 CPU SDT0 GPU SDT1 Kernel 1 CPU SDT0 GPU SDT1 Kernel 2 CPU SDT0 GPU SDT1 Kernel 3 CPU SDT0 GPU SDT1 26 Cached Memory Hierarchy First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory Shared/L1 Cache per SM (64KB) Improves bandwidth and reduces latency Unified L2 Cache (768 KB) Fast, coherent data sharing across all cores in the GPU Global Memory (up to 6GB) 27 CUDA: Compute Unified Device Architecture NVIDIA’s Parallel Computing Architecture Software Development Platform aimed to the GPU Architecture Language Integration Device-level APIs Applications Using DirectX HLSL Applications Using OpenCL OpenCL C Applications Using the CUDA Driver API Applications Using C, C++, Fortran, Java, Python, ... C for CUDA C for CUDA 5 DirectX 11 Compute OpenCL Driver C Runtime for CUDA 3 CUDA Driver CUDA Support in Kernel Level Driver PTX (ISA) 4 2 1 CUDA Parallel Compute Engines inside GPU 28 Thread Hierarchy Kernels (simple C program) are executed by thread Threads are grouped into Blocks Threads in a Block can synchronize execution Blocks are grouped in a Grid Blocks are independent (must be able to be executed at any order Source: Presentation from Felipe A. Cruz, Nagasaki University 29 Memory and Hardware Hierarchy Threads access Registers CUDA Cores execute Threads Threads within a Block can share data/results via Shared Memory Streaming Multiprocessors (SM’s) execute Blocks Grids use Global Memory for result sharing (after kernel-wide global synchronization) GPU executes Grids Source: Presentation from Felipe A. Cruz, Nagasaki University 30 Full View of the Hierarchy Model CUDA Hardware Level Memory Access Thread CUDA Core Registers Block SM Shared Memory Grid GPU Global Memory Device Node Host Memory 31 IDs and Dimensions Threads 3D IDs, unique within a block Device Grid 1 Blocks 2D IDs, unique within a grid Dimensions set at launch time Can be unique for each grid Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Built-in variables threadIdx, blockIdx blockDim, gridDim Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) 32 Compiling C for CUDA Applications void serial_function(… ) { ... } void other_function(int ... ) { ... } void saxpy_serial(float ... ) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } void main( ) { float x; saxpy_serial(..); ... } Modify into Parallel CUDA code C CUDA Key Kernels Rest of C Application NVCC (Open64) CPU Compiler CUDA object files CPU object files Linker CPU-GPU Executable 33 C for CUDA : C with a few keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); Standard C Code __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; Parallel if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); C Code 34 Software Programming Source: Presentation from Andreas Klöckner, NYU 35 Software Programming Source: Presentation from Andreas Klöckner, NYU 36 Software Programming Source: Presentation from Andreas Klöckner, NYU 37 Software Programming Source: Presentation from Andreas Klöckner, NYU 38 Software Programming Source: Presentation from Andreas Klöckner, NYU 39 Software Programming Source: Presentation from Andreas Klöckner, NYU 40 Software Programming Source: Presentation from Andreas Klöckner, NYU 41 Software Programming Source: Presentation from Andreas Klöckner, NYU 42 CUDA C/C++ Leadership 2007 CUDA Toolkit 1.0 July 07 CUDA Toolkit 1.1 Nov 07 • C Compiler • C Extensions • Win XP 64 • Single Precision • BLAS • FFT • SDK 40 examples • Atomics support • Multi-GPU support 2008 CUDA VisualApril Profiler 08 2.2 cuda-gdb HW Debugger 2009 Parallel Nsight Beta Nov 09 2010 CUDA Toolkit 2.0 Aug 08 CUDA Toolkit 2.3 July 09 • Double Precision • DP FFT • C++ inheritance • Compiler Optimizations • 16-32 Conversion intrinsics • Fermi arch support • Vista 32/64 • Performance enhancements CUDA Toolkit 3.0 Mar 10 • Tools updates • Driver / RT interop • Mac OSX • 3D Textures • HW Interpolation 43 Why should I choose Tesla over consumer cards? Feature Benefits 4x Higher double precision (on 20-series) Higher Performance for scientific CUDA applications ECC only on Tesla & Quadro (on 20-series) Data reliability inside the GPU and on DRAM memories Bi-directional PCI-E communication (Tesla has Dual DMA Engines, GeForce has only 1 DMA Engine) Higher Performance for CUDA applications (by overlapping communication & computation) Larger memory for larger data sets – 3GB and 6GB Products Higher performance on wide range of applications (medical, oil & gas, manufacturing, FEA, CAE) Cluster management software tools available on Tesla only Needed for GPU monitoring and job scheduling in data center deployments TCC (Tesla Compute Cluster) driver supported for Windows OS only on Tesla. Higher performance for CUDA applications due to lower kernel launch overhead. TCC adds support for RDP and Services Integrated OEM workstations and servers Trusted, reliable systems built for Tesla products. Professional ISVs will certify CUDA applications only on Tesla Bug reproduction, support, feature requests for Tesla only. 2 to 4 day Stress testing & memory burn-in for reliability. Added margin in memory and core clocks for added reliability. Built for 24/7 computing in data center and workstation environments. Manufactured & guaranteed by NVIDIA No changes in key components like GPU and memory without notice. Always the same clocks for known, reliable performance. 3-year warranty from HP Reliable, long life products Enterprise support, higher priority for CUDA bugs and requests Ability to influence CUDA and GPU roadmap. Get early access to features requests. 18-24 months availability + 6-month EOL notice Reliable product supply Features Quality & Warranty Support & Lifecycle 44