GPU in HPC Scott A. Friedman friedman@ats.ucla.edu ATS Research Computing Technologies First of all… • Double Precision is coming! – GPU: late 07 or early 08 (nvidia) • Will be half speed – word on street • At G80 speed, that equals 175Gflop – Cell HPC: summer 08 • First appears in LANL Roadrunner • 5x increase to 100Gflop IDRE GPU Lunch Hardware • Remember! – GPUs are for graphics (graphics processing unit) • Think data parallelism! • Must hide memory latency – Lots of computation – ‘arithmetic intensity’ – Low latency memory is precious resource • Limitations – – – – – • Regs zeroed, minimal shared/static data, no r-m-w buffers Varying latencies: dependant on memory type accessed Designed for independent operations (legacy of graphics) Lots of gotchas that will kill performance Hardware constantly changing Current generation – Proprietary architectures – NVIDIA G80, 128 ALUs, 350Gflop SP IDRE GPU Lunch Programming Model • Streaming – Elements (array) processed by a kernel (function) • • • • • Sounds like a SIMD vector processor Not exactly, often term SPMD used (P=program) No index ops on streams Input stream(s) -> compute -> output stream No dependencies between stream elements – CUDA relaxes this somewhat • Experimentation required – Balancing essential • Compute rather than move data – Maximize use of precious low latency high bandwidth memory – Cover latencies with as much computation as possible • High arithmetic intensity, you will hear this a lot! – Often better to re-compute than cache data – Avoid code that is memory bound • Memory access progressing much slower than # of ALUs • Better to batch memory moves into large transfers • Complex memory access rules have major impact on performance IDRE GPU Lunch Tools • Cell SDK – – • Direct access to the hardware Very low level CUDA (nvidia >8xxx) – – – C API – provides scalar execution model (with caveats) Low level, think of MPI? Certain amount of hardware abstraction User maps problem domain to processing units and memory hierarchy • • – Re-imagining of graphics hardware to programming concepts (e.g. threads, arrays) GLSL, graphics tools, even lower level but not as necessary now Kernel is : 1,2,3D Grid : Blocks : Threads • • Threads within block can communicate via on chip shared memory and synchronize Blocks are independent! – – – • Free but specific to nvidia hardware (will hide future architecture changes) CTM (amd/ati) – • Similar, but lower level than CUDA RapidMind – – – – – Integrate into C++ code Higher level abstractions, think OpenMP? SPMD oriented: e.g. streams and kernels, more restrictive than CUDA Portable? Let the experts do the mapping to memory hierarchy • • – • No communication between blocks No execution ordering or concurrency guarantees Several back-ends supported, Cell, GPUs, Multicore CPUs Allows tuning to specific hardware Not free Brook, Sh – – Opensource tools Sh is precursor to Rapidmind kit IDRE GPU Lunch Resources • One stop shopping – • More good stuff – • http://ati.amd.com/technology/streamcomputing/index.html Rapidmind – • http://developer.nvidia.com/object/cuda.html AMD/ATI – • http://www.ibm.com/developerworks/power/cell/ NVIDIA – • http://www.gpgpu.org/s2007/ IBM Cell – • http://www.power.org/resources/devcorner/cellcorner/hpcspe.pdf Siggraph 2007 gpgpu course – very good – • http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=907 Cell HPC presentation – • http://www.rapidmind.com/resources.php Great survey paper – • http://www.gpgpu.org http://developer.rapidmind.com Google is your friend, of course IDRE GPU Lunch Conclusions • • This is the future of the highest performance codes – – GPU, Cell or Larabee-sque multi-core Industry is scaling cores not clocks – Industry contacts share that customers are 'in denial' and need to get on board. Programming is going to get whole lot more complex – Memory hierarchies – Load and system balancing – More and more doing it – fewer and fewer who are any good at it • Education! • Mapping problem domains to these architectures is still evolving – Lots of clever solutions to lots of problems • Domain and algorithm level • Tools are currently pretty weak – Industry appears to be aware of this – not just the market opportunity • Hopefully – APIs will insolate us from variety and evolution of hardware IDRE GPU Lunch Thank you • Questions? • Please feel free to contact me • ATS has several resources that you can access to try some of these things out – Sony Playstation3, Cell SDK – nVidia 8800GTX, CUDA, Rapidmind IDRE GPU Lunch