GPU in HPC

advertisement
GPU in HPC
Scott A. Friedman
friedman@ats.ucla.edu
ATS Research Computing Technologies
First of all…
• Double Precision is coming!
– GPU: late 07 or early 08 (nvidia)
• Will be half speed – word on street
• At G80 speed, that equals 175Gflop
– Cell HPC: summer 08
• First appears in LANL Roadrunner
• 5x increase to 100Gflop
IDRE GPU Lunch
Hardware
•
Remember!
– GPUs are for graphics (graphics processing unit)
•
Think data parallelism!
•
Must hide memory latency
– Lots of computation – ‘arithmetic intensity’
– Low latency memory is precious resource
•
Limitations
–
–
–
–
–
•
Regs zeroed, minimal shared/static data, no r-m-w buffers
Varying latencies: dependant on memory type accessed
Designed for independent operations (legacy of graphics)
Lots of gotchas that will kill performance
Hardware constantly changing
Current generation
– Proprietary architectures
– NVIDIA G80, 128 ALUs, 350Gflop SP
IDRE GPU Lunch
Programming Model
•
Streaming
– Elements (array) processed by a kernel (function)
•
•
•
•
•
Sounds like a SIMD vector processor
Not exactly, often term SPMD used (P=program)
No index ops on streams
Input stream(s) -> compute -> output stream
No dependencies between stream elements
– CUDA relaxes this somewhat
•
Experimentation required
– Balancing essential
• Compute rather than move data
– Maximize use of precious low latency high bandwidth memory
– Cover latencies with as much computation as possible
• High arithmetic intensity, you will hear this a lot!
– Often better to re-compute than cache data
– Avoid code that is memory bound
• Memory access progressing much slower than # of ALUs
• Better to batch memory moves into large transfers
• Complex memory access rules have major impact on performance
IDRE GPU Lunch
Tools
•
Cell SDK
–
–
•
Direct access to the hardware
Very low level
CUDA (nvidia >8xxx)
–
–
–
C API – provides scalar execution model (with caveats)
Low level, think of MPI? Certain amount of hardware abstraction
User maps problem domain to processing units and memory hierarchy
•
•
–
Re-imagining of graphics hardware to programming concepts (e.g. threads, arrays)
GLSL, graphics tools, even lower level but not as necessary now
Kernel is : 1,2,3D Grid : Blocks : Threads
•
•
Threads within block can communicate via on chip shared memory and synchronize
Blocks are independent!
–
–
–
•
Free but specific to nvidia hardware (will hide future architecture changes)
CTM (amd/ati)
–
•
Similar, but lower level than CUDA
RapidMind
–
–
–
–
–
Integrate into C++ code
Higher level abstractions, think OpenMP?
SPMD oriented: e.g. streams and kernels, more restrictive than CUDA
Portable?
Let the experts do the mapping to memory hierarchy
•
•
–
•
No communication between blocks
No execution ordering or concurrency guarantees
Several back-ends supported, Cell, GPUs, Multicore CPUs
Allows tuning to specific hardware
Not free
Brook, Sh
–
–
Opensource tools
Sh is precursor to Rapidmind kit
IDRE GPU Lunch
Resources
•
One stop shopping
–
•
More good stuff
–
•
http://ati.amd.com/technology/streamcomputing/index.html
Rapidmind
–
•
http://developer.nvidia.com/object/cuda.html
AMD/ATI
–
•
http://www.ibm.com/developerworks/power/cell/
NVIDIA
–
•
http://www.gpgpu.org/s2007/
IBM Cell
–
•
http://www.power.org/resources/devcorner/cellcorner/hpcspe.pdf
Siggraph 2007 gpgpu course – very good
–
•
http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=907
Cell HPC presentation
–
•
http://www.rapidmind.com/resources.php
Great survey paper
–
•
http://www.gpgpu.org
http://developer.rapidmind.com
Google is your friend, of course
IDRE GPU Lunch
Conclusions
•
•
This is the future of the highest performance codes
–
–
GPU, Cell or Larabee-sque multi-core
Industry is scaling cores not clocks
–
Industry contacts share that customers are 'in denial' and need to get on board.
Programming is going to get whole lot more complex
– Memory hierarchies
– Load and system balancing
– More and more doing it – fewer and fewer who are any good at it
• Education!
•
Mapping problem domains to these architectures is still evolving
– Lots of clever solutions to lots of problems
• Domain and algorithm level
•
Tools are currently pretty weak
– Industry appears to be aware of this – not just the market opportunity
•
Hopefully
– APIs will insolate us from variety and evolution of hardware
IDRE GPU Lunch
Thank you
• Questions?
• Please feel free to contact me
• ATS has several resources that you can
access to try some of these things out
– Sony Playstation3, Cell SDK
– nVidia 8800GTX, CUDA, Rapidmind
IDRE GPU Lunch
Download