GPU in HPC

GPU in HPC Scott A. Friedman friedman@ats.ucla.edu ATS Research Computing Technologies First of all… • Double Precision is coming! – GPU: late 07 or early 08 (nvidia) • Will be half speed – word on street • At G80 speed, that equals 175Gflop – Cell HPC: summer 08 • First appears in LANL Roadrunner • 5x increase to 100Gflop IDRE GPU Lunch Hardware • Remember! – GPUs are for graphics (graphics processing unit) • Think data parallelism! • Must hide memory latency – Lots of computation – ‘arithmetic intensity’ – Low latency memory is precious resource • Limitations – – – – – • Regs zeroed, minimal shared/static data, no r-m-w buffers Varying latencies: dependant on memory type accessed Designed for independent operations (legacy of graphics) Lots of gotchas that will kill performance Hardware constantly changing Current generation – Proprietary architectures – NVIDIA G80, 128 ALUs, 350Gflop SP IDRE GPU Lunch Programming Model • Streaming – Elements (array) processed by a kernel (function) • • • • • Sounds like a SIMD vector processor Not exactly, often term SPMD used (P=program) No index ops on streams Input stream(s) -> compute -> output stream No dependencies between stream elements – CUDA relaxes this somewhat • Experimentation required – Balancing essential • Compute rather than move data – Maximize use of precious low latency high bandwidth memory – Cover latencies with as much computation as possible • High arithmetic intensity, you will hear this a lot! – Often better to re-compute than cache data – Avoid code that is memory bound • Memory access progressing much slower than # of ALUs • Better to batch memory moves into large transfers • Complex memory access rules have major impact on performance IDRE GPU Lunch Tools • Cell SDK – – • Direct access to the hardware Very low level CUDA (nvidia >8xxx) – – – C API – provides scalar execution model (with caveats) Low level, think of MPI? Certain amount of hardware abstraction User maps problem domain to processing units and memory hierarchy • • – Re-imagining of graphics hardware to programming concepts (e.g. threads, arrays) GLSL, graphics tools, even lower level but not as necessary now Kernel is : 1,2,3D Grid : Blocks : Threads • • Threads within block can communicate via on chip shared memory and synchronize Blocks are independent! – – – • Free but specific to nvidia hardware (will hide future architecture changes) CTM (amd/ati) – • Similar, but lower level than CUDA RapidMind – – – – – Integrate into C++ code Higher level abstractions, think OpenMP? SPMD oriented: e.g. streams and kernels, more restrictive than CUDA Portable? Let the experts do the mapping to memory hierarchy • • – • No communication between blocks No execution ordering or concurrency guarantees Several back-ends supported, Cell, GPUs, Multicore CPUs Allows tuning to specific hardware Not free Brook, Sh – – Opensource tools Sh is precursor to Rapidmind kit IDRE GPU Lunch Resources • One stop shopping – • More good stuff – • http://ati.amd.com/technology/streamcomputing/index.html Rapidmind – • http://developer.nvidia.com/object/cuda.html AMD/ATI – • http://www.ibm.com/developerworks/power/cell/ NVIDIA – • http://www.gpgpu.org/s2007/ IBM Cell – • http://www.power.org/resources/devcorner/cellcorner/hpcspe.pdf Siggraph 2007 gpgpu course – very good – • http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=907 Cell HPC presentation – • http://www.rapidmind.com/resources.php Great survey paper – • http://www.gpgpu.org http://developer.rapidmind.com Google is your friend, of course IDRE GPU Lunch Conclusions • • This is the future of the highest performance codes – – GPU, Cell or Larabee-sque multi-core Industry is scaling cores not clocks – Industry contacts share that customers are 'in denial' and need to get on board. Programming is going to get whole lot more complex – Memory hierarchies – Load and system balancing – More and more doing it – fewer and fewer who are any good at it • Education! • Mapping problem domains to these architectures is still evolving – Lots of clever solutions to lots of problems • Domain and algorithm level • Tools are currently pretty weak – Industry appears to be aware of this – not just the market opportunity • Hopefully – APIs will insolate us from variety and evolution of hardware IDRE GPU Lunch Thank you • Questions? • Please feel free to contact me • ATS has several resources that you can access to try some of these things out – Sony Playstation3, Cell SDK – nVidia 8800GTX, CUDA, Rapidmind IDRE GPU Lunch

GPU in HPC

Related documents

Products

Support

GPU in HPC

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib