MatlabGPUWorkshop2013PowerPoint

advertisement
GPU Computing with Matlab®
@ CBI Laboratory
Overview
• GPU History & Hardware
– GPU History
– CPU vs. GPU Hardware
– Parallelism Design Points
• GPU Software Infrastructure ( CUDA )
• Matlab Parallel Computing Toolbox, GPU
Computing
• GPU nodes @ CBI Lab
• Examples
• Additional Features
2
GPU History
3D object
model:
e.g. A circle of
radius R, @
center (x,y,z)
Color = Blue
Light Source
@ ( x,y,z )
2 Dimensional Screen
Goal: Answer
question, for pixel
(X,Y) on the
screen, what’s my
(R,G,B) value 3
GPU History
3D object
model:
e.g. A circle of
radius R, @
center (x,y,z)
Color = Blue
Light Source
@ ( x,y,z )
2 Dimensional Screen
Much Parallelism
Available &
Screen refresh
rate << Processor
Clock rate
4
GPU History
3D object
model:
e.g. A circle of
radius R, @
center (x,y,z)
Color = Blue
Light Source
@ ( x,y,z )
2 Dimensional Screen
GPU Model:
Assembly Line
Concept
High Latency BUT
High Throughput
5
GPU History
MATRIX
MULTIPLICATION:
e.g. Translation,
Rotation, Scaling
MATRIX
MULTIPLICATION:
e.g. Rotation
3d
MATRIX
MULTIPLICATION:
e.g. 3-D to 2-D
Projection (
Perspective
Projection )
3d
3d
2d
Many Independent Computations: Streams of Triangles & Vertices
screen
3 vertices
(x1,y1,z1)
(x2,y2,z2)
(x3,y3,z3)
The more
calculators: the
more points we can
move around in the
same amount of
time
GPU History
MATRIX
MULTIPLICATION:
e.g. Translation,
Rotation, Scaling
MATRIX
MULTIPLICATION:
e.g. Rotation
3d
MATRIX
MULTIPLICATION:
e.g. 3-D to 2-D
Projection (
Perspective
Projection )
3d
3d
2D
Many Independent Computations: Streams of Triangles & Vertices
screen
Why must we be limited to performing a single type of function?
The answer involves the start of General Purpose GPU Computing.
Allow the programmer to create custom functions ( a.k.a. kernels )
that run in parallel.
GPU vs. CPU
Different Goals: Fast Food Restaurant vs. Anywhere there are long lines of people waiting
Which column maps to CPU
and which to GPU?
Higher Latency
Exceptionally High Throughput
•An individual may need to wait a long time in
line, but many more people go through system
during the course of a day.
•Workers are always kept busy, even if the
current person say forgets a document and
needs to wait for someone to deliver it, since
there are many people waiting in line.
•More workers/ smaller desks per worker.
•Use as much of the building space as possible to
add workers.
Lower Latency
Good Throughput
•An individual waits as little as possible in line.
•Workers are always kept busy by having large local caches
of supplies both at the store and at the work counters.
•Subdivide 1 task into smaller tasks and increase the speed
of each smaller task. ( ILP & Pipelining )
•Try to find parallelism within 1 task ( out-of-order
execution )
•Try to predict what people may order to get a head start. (
Branch Prediction )
•Trying to optimize for minimum wait time for a single user
uses up resources ( workers + space where you could have
put more workers )
GPU vs. CPU
Different Goals: Fast Food Restaurant vs. Anywhere there are long lines of people waiting
GPU
Higher Latency
Exceptionally High Throughput
•An individual may need to wait a long time in
line, but many more people go through system
during the course of a day.
•Workers are always kept busy, even if the
current person say forgets a document and
needs to wait for someone to deliver it, since
there are many people waiting in line.
•More workers/ smaller desks per worker.
•Use as much of the building space as possible to
add workers.
CPU
Lower Latency
Good Throughput
•An individual waits as little as possible in line.
•Workers are always kept busy by having large local caches
of supplies both at the store and at the work counters.
•Subdivide 1 task into smaller tasks and increase the speed
of each smaller task. ( ILP & Pipelining )
•Try to find parallelism within 1 task ( out-of-order
execution )
•Try to predict what people may order to get a head start. (
Branch Prediction )
•Trying to optimize for minimum wait time for a single user
uses up resources ( workers + space where you could have
put more workers )
Parallelism Design Points
• Key: Focus on dependency analysis
• How much of your program is independent determines potential parallelism
( Amdahl’s Law ) …. For a fixed amount of work in the parallel section…
• Gustafson’s Law: Do more work within parallel sections…
• Data transfer vs. Compute ( Arithmetic Intensity )
– Cost of moving the data from CPU to GPU needs to be taken into
account.
– GPU may provide large benefit when ( compute >> data I/O )
• Going to the store to get 100 items with 10 workers: you ideally only
want to make 1 trip for all 100 items
• Even if all 10 workers go to get their items in parallel, not much
benefit if you make 10 round trips.
• Resource contention
– Data transfer bandwidth
10
Parallelism Design Points
• Resource limits ( memory, disk )
• Hardware limits
– Memory cache line sizes, Memory alignment issues, Disk block sizes,
Cache sizes, # Queues, etc.
• Physical data organization ( e.g. Row Major vs. Column Major )
• Conditional (if-else) minimization
– Ideally you would hope to have 0 if statements in your functions…. Not
always feasible for algorithm correctness.
• Synchronization
– Algorithm correctness many times requires some type of synchronization
• Many more variables affect function, program, … as well as system level
parallelism….
– A function may be highly parallelizable, but overall system parallelism
may involve looking at different levels of parallel to achieve good
solution.
11
Fermi Architecture[16]
Many resources
are available at
www.nvidia.com
GPU Hardware
Fermi Architecture[16]
Many resources
are available at
www.nvidia.com
GPU Hardware
GPU Software Infrastructure
CUDA: Compute Unified Device Architecture
Applications ( e.g. Matlab )
CUDA C/C++
NVCC Compiler +
Utilities ( nvprof,
visual profiler )
PTX: Parallel Thread
eXecution Assembly
Language
( Virtual Machine )
CUDA Libraries
CUDA Runtime API
CUDA Driver
Operating System ( Linux, Windows, etc.)
CUBIN( Cuda Binary )
GPU card(s) & System Board with CPU, Buses ( PCIe ),..
14
GPU Software Infrastructure
CUDA: Compute Unified Device Architecture
Software model: An abstraction of the hardware
Streams: Compute & Data Transfer  GPU1,GPU2…
Queues (order guaranteed within a single stream)
Grids: Run the same kernel( a.k.a. function ) 
Software to
Hardware
Mapping
GPU1,GPU2…
Blocks: Group of cooperating threads SM(Streaming Multi-processor )
- 32 compute cores per SM in Fermi Architecture.
- Blocks should be viewed as self contained work units
Warps: Groups of 32 threads  SM ( Streaming Multi-processor )
- The basic unit of execution, 32 threads running the same instruction in the same amount of time.
Threads: Execution context ( keeps track a core’s state) Compute Core
15
Matlab Parallel Computing Toolbox,
GPU Computing
•
•
•
•
•
•
•
•
•
•
•
•
•
gpuDevice(#)
gpuDeviceCount()
Matlab Parallel Computing Toolbox:
reset(gpuDevice(#))
Each release, more and more functions are
wait()
bsxfun()
enabled for
GPU
gpuArray()
gather()
support.
arrayfun()
existsOnGPU()
parallel.gpu.CUDAKernel()
feval
setConstantMemory
Many GPU enabled built-in functions: e.g. fft, …. Check with:
transparent
– methods(‘gpuArray’)
16
Matlab Parallel Computing Toolbox,
GPU Computing
• Many GPU enabled built-in functions: e.g. fft,
…. Check with:
– methods(‘gpuArray’)
– fft,fft2,…. Many built in functions
– Try running >> methods(‘gpuArray’) to see the list
of support functions.
17
GPU Nodes @ CBI Lab
Nvidia M2070: Fermi Architecture, 448 cuda cores, 14 Multiprocessors, @ 32 cuda
cores/Multi Processor
•
•
•
•
2 modes: Interactive & Batch
Interactive: Use for development
$ ssh –Y username@cheetah.cbi.utsa.edu
$ qlogin -q gpu.q -l gpuonly
$ matlab &
Batch mode: For production runs
Job Script
#!/bin/bash
Putty+Xming can be used to access
Matlab GUI from Windows system.
http://cbi.utsa.edu/faq/xforwarding
#$ -q gpu.q
#$ -l gpuonly
[Source: http://www.cbi.utsa.edu/faq/sge/gpu]
18
GPU Nodes @CBI Lab
Matlab GUI access is also
available from Windows,
using Putty + x11
forwarding with XMing
qlogin –q gpu.q –l gpuonly
19
GPU Nodes @ CBI Lab
matlab &
nvidia-smi
top
>> gpuDevice(#)
20
GPU Nodes @ CBI Lab
21
GPU Nodes @ CBI Lab
M2070: Fermi Architecture, 448 CUDA cores, 14 Multiprocessors, @ 32 cores/Multi Processor
22
Built-in function support for GPU
• 4x + y
• 2x -3y
• -6x -2y
- 2z = 0
+ 3z = 9
+ z=0
• A*x = b
• A = [4 1 -2; 2 -3 3; -6 -2 1];
• b = [0; 9; 0];
• What is x?
Quickly solving sets of
linear equations has
applications throughout
science & engineering.
\ operator is one of many
functions that work on
gpuArray data types.
– x = A\b; x = [ 0.75, -2, 0.5 ];
4*0.75 + (-2) – (2*0.5) = 0 ???  should match if correct solution of system
2*0.75 + (-3*-2) + (3*0.5 ) = 9 ???  should match if correct solution of system
-6*0.75 + (-2*-2) + 0.5 = 0 ???  should match if correct solution of system
23
Many Additional Features
• Using Matlab with GPU in Batch mode via Job
Script
• Calling .cu , .ptx code directly from Matlab
• Using the GPU from C/C++ code directly with
the MEX interface
– Allows incorporating custom GPU code into
Matlab as well as using Nvidia Nsight and Nvidia
Visual Profiler for custom GPU algorithm
development.
24
Demo
An example Matlab code running on a GPU system.
25
Appendix
Many applications are being enabled for GPU acceleration:
e.g.NAMD for Molecular Dynamics using GPU
http://www.nvidia.com/object/gpu-applications.html
http://www.nvidia.com/content/tesla/pdf/gpu-acceleratedapplications-for-hpc.pdf
C/C++/Fortran Library:
Accelereyes Arrayfire
https://developer.nvidia.com/accelereyes-arrayfire
http://www.accelereyes.com/examples/case_studies
26
Appendix
CUDA Internals: Valgrind+ Kcachegrind: libcudart.so visualization
27
Appendix
CUDA Internals: Valgrind+ Kcachegrind: libcudart.so visualization
28
References
[1] http://www.mathworks.com/help/distcomp/release-notes.html
[2] http://www.mathworks.com/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html
[3] http://www.mathworks.com/help/distcomp/examples/illustrating-three-approaches-to-gpu-computing-the-mandelbrot-set.html
[4] http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html
[5] http://www.nvidia.com/docs/IO/105880/DS-Tesla-M-Class-Aug11.pdf
[6] http://en.wikipedia.org/wiki/Nvidia_Tesla#cite_note-11
[7] http://en.wikipedia.org/wiki/Rasterisation
[8] http://en.wikipedia.org/wiki/Perspective_projection#Perspective_projection
[9] http://en.wikipedia.org/wiki/GPGPU
[10] http://www.cbi.utsa.edu/faq/sge/gpu
[11] http://medim.sth.kth.se/6l2872/F/F11c.pdf (FFT registration )
[12] http://medim.sth.kth.se/6l2872/F/F11c.pdf
[13] http://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf
[14] http://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performance-technical-brief.pdf
[15] http://en.wikipedia.org/wiki/Nvidia_Tesla
[16] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
[17] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
[18] https://www.udacity.com/wiki/cs344/Lesson_1_-_The_GPU_Programming_Model#latency-vs-bandwidth
[19] https://www.udacity.com/wiki/cs344
[20] http://www.computingbook.org/FullText.pdf
[21] http://en.wikipedia.org/wiki/Dynamic_random-access_memory
[22] http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2009/lec08-cache.html
[23] http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/computer-architecture-2012/lec03-fastest.html
[24] http://en.wikipedia.org/wiki/Gustafson%27s_law
[25] http://archive.hpcwire.com/hpc/705814.html
[26] http://www.johngustafson.net/pubs/pub13/amdahl.pdf
[27] http://spartan.cis.temple.edu/shi/public_html/docs/amdahl/amdahl.html
[28] http://software.intel.com/en-us/articles/amdahls-law-gustafsons-trend-and-the-performance-limits-of-parallel-applications
29
Acknowledgements
• This project received computational, research &
development, software design/development support
from the Computational System Biology
Core/Computational Biology Initiative, funded by the
National Institute on Minority Health and Health
Disparities (G12MD007591) from the National
Institutes of Health. URL: http://www.cbi.utsa.edu
30
Contact Us
http://cbi.utsa.edu
31
Download