Numerical Simulations Using Programmable GPUs Data Analysis and Visualization Stan Tomov

advertisement
Data Analysis and Visualization
Numerical Simulations
Using Programmable GPUs
Stan Tomov
September 5, 2003
Brookhaven Science Associates
U.S. Department of Energy
1
Outline
Motivation
• Literature review
• The graphics pipeline
• Programmable GPUs
• Block diagram of nVidia's GeForce FX
• Some probability based simulations
- Monte Carlo simulations
- Ising model
- Percolation model
• Implementation
• Performance results and analysis
• Extensions and future work
• Conclusions
•
Brookhaven Science Associates
U.S. Department of Energy
1
Motivation
The GPUs have:
●
High flops count (nVidia has listed 200Gflops theoretical speed for NV30)
Problem size
11,540
47,636
193,556
780,308
●
●
Frames per second using
OpenGL(GPU)
Mesa (CPU)
189
8.01
52
1.71
13
0.44
3
0.12
Table 1. GPU vs CPU in rendering polygons.
The GPU (Quadro2 Pro) is approximately 30
times faster than the CPU (Pentium III, 1 GHz)
in rendering polygonal data of various sizes.
Compatible price performance (0.1 cents per M flop)
High rate of performance increase over time (doubling every 6 months)
Explore the possibility of extending GPUs' use to non-graphics applications
Brookhaven Science Associates
U.S. Department of Energy
1
Literature review
Using graphics hardware for non-graphics applications:
• Cellular automata
• Reaction-diffusion simulation (Mark Harris, University of North Carolina)
• Matrix multiply (E. Larsen and D. McAllister, University of North Carolina)
• Lattice Boltzmann computation (Wei Li, Xiaoming Wei, and Arie Kaufman, Stony Brook)
• CG and multigrid (J. Bolz et al, Caltech, and N. Goodnight et al, University of Virginia)
• Convolution (University of Stuttgart)
Performance results:
• Significant speedup of GPU vs CPU are reported if the GPU performs
low precision computations (30 to 60 times; depends on the configuration)
• The fact that the operations are low precision is often skipped which may be confusing:
- NCSA, University of Illinois assembled a $50,000 supercomputer out of 70 PlayStation 2
consoles, which could theoretically deliver 0.5 trillion operations/second
- also, currently $200 GPUs are capable of 1.2 trillion op/s
• GPU’s flops performance is comparable to the CPU’s
Brookhaven Science Associates
U.S. Department of Energy
1
The graphics pipeline
Brookhaven Science Associates
U.S. Department of Energy
1
Programmable GPUs
(in particular NV30)
• Support floating point operations
• Vertex program
- Replaces fixed-function pipeline for vertices
- Manipulates single vertex data
- Executes for every vertex
• Fragment program
- Similar to vertex program but for pixels
• Programming in Cg:
- High level language
- Looks like C
- Portable
- Compiles Cg programs to assembly code
Brookhaven Science Associates
U.S. Department of Energy
1
Block diagram of GeForce FX
AGP 8x graphics bus bandwidth: 2.1GB/s
• Local memory bandwidth: 16 GB/s
• Chip officially clocked at 500 MHz
• Vertex processor:
•
- execute vertex shaders or emulate fixed transformations and lighting (T&L)
● Pixel processor :
- execute pixel shaders or emulate fixed shaders
- 2 int & 1 float ops or 2 texture accesses/clock circle
● Texture & color interpolators
- interpolate texture coordinates and color values
Performance (on processing 4D vectors):
● Vertex ops/sec - 1.5 Gops
● Pixel ops/sec - 8 Gops (int), or 4 Gops (float)
Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show
started", November 18, 2002.
Brookhaven Science Associates
U.S. Department of Energy
1
Monte Carlo simulations
Used in variety of simulations in physics, finance, chemistry, etc.
● Based on probability statistics and use random numbers
● A classical example: compute area of a circle
● Computation of expected values:
●
N
E(F) =  F (S i )P(S i )
i=1
(1)
N can be very large : on a 1024 x 1024 lattice
of particles, every
2
1024
particle modeled to have k states, N = k
●
Random number generation. We used linear congruential type
generator:
R(n)  (a * R(n 1)  b) mod N
Brookhaven Science Associates
U.S. Department of Energy
1
Ising model
●
Simplified model for magnets (introduced by Wilhelm Lenz in 1920,
further studied by his student Ernst Ising)
Modeled on 2D lattice with a “spin” (corresponding to orientation of electrons)
at every cell pointing up or down
● Uses temperature to couple 2 opposing
physical principles
●
- minimization of the system's energy
- entropy maximization
●
Want to compute
- expected magnetization: F ( Si )  N up ( Si )  N down ( Si )
- expected energy:
F(S i )  En(S i ) =  Si (j)S i (k)
j,k
●
Evolve the system into “higher probability” states and compute
expected values as average over those states
- evolving from state to state, based on certain probability decision, is related to so called Markov chains:
W.Gilks, S.Richardson, and D.Spiegelhalter (Editors), Markov chain Monte Carlo in Practice, Chapman&Hall, 1996.
Brookhaven Science Associates
U.S. Department of Energy
1
Ising model computational procedure
●
●
●
●
Choose an absolute temperature of interest T (in Kelvin)
Color lattice in a checkerboard manner
Start consecutive black and white “sweeps”
Change the spin at a site based on the procedure
1. Denote current state as S, the state with flipped spin as S'
2. Compute
ΔE  E(S' )  E(S)
3. If ΔE  0 accept S'
else generate R  [0,1] and accept S' if,
P(S' )
R
= e  ΔE / (kT)
P(S)
where P(S) is given by the Boltzmann probability distribution function
P(S) =
e  E( S)/ ( kT)
N
e
 E ( Si ) /( kT )
i=1
Brookhaven Science Associates
U.S. Department of Energy
1
Percolation model
●
First studied by Broadbent and Hemmercley in 1957
●
Used in studies of disordered medium (usually
specified by a probability distribution)
●
●
Applied in studies of various phenomena such as
spread of diseases, flow in porous media, forest
fire propagation, clustering, etc.
Of particular interest are:
- media modeling threshold after which there exists a
“spanning cluster”
- relations between different media models
- time to reach steady state spanning cluster
Brookhaven Science Associates
U.S. Department of Energy
1
Implementation
Approaches:
• Pure OpenGL (simulations using the fixed-function pipeline)
• Shaders in assembly
• Shaders in Cg
Dynamic texturing:
• Create a texture T (think of a 2D lattice)
• Loop:
- Render an image using T (in an off-screen buffer)
- Update T from the resulting image
Brookhaven Science Associates
U.S. Department of Energy
1
Performance results and analysis
• Time in s. (approximate) for different vector flops on the GPU:
256x256
512x512
traffic
0.00063
0.0024
+, -, *, /
0.00010
0.0003
cos, sin
0.00026
0.0010
log, exp
0.00045
0.0015
if, ? :
0.00016
0.0008
 48 B per node – speed limited by
GPU’s memory speed (16 GB/s)
 3.5 Gflops
 20 x faster then CPU but the
operations are of low accuracy
• Time in s. (approximate) including traffic for different vector flops on the CPU:
256x256
512x512
1024x1024
+, -, *, /
0.0011
0.0046
0.017
cos, sin
0.0540
0.0650
0.267
log, exp
0.0609
0.1100
0.426
Brookhaven Science Associates
U.S. Department of Energy
32 B per node – speed
limited by CPU’s memory
speed (4.2 GB/s)
1
Performance results and analysis
• GPU and CPU (2.8 GHz) performance on the Ising model
Lattice size (not necessary power of 2)
64x64
128x128
256x256
512x512
1024x1024
GPU sec/frame
0.0006
0.0023
0.0081
0.033
0.14
CPU no opt.
0.0009
0.0024
0.0083
0.032
0.13
CPU with –O4
0.0008
0.0020
0.0069
0.026
0.10
GPU instr./sec
0.55 G
0.57 G
0.66 G
0.63 G
0.61 G
•  2.64 Gflops, i.e. 15% GPU theoretical power utilization (too many ifs):
- if (flag) { … } : exec. time = time to compute the block even if flag = 0
• Performance compatible with visualization related sample shaders from nVidia
• Cg
assembly
- Performance is the same for using runtime Cg or the generated assembly code
- The assembly code generated is not optimal: we found cases where the code could
be optimized and performance increased
Brookhaven Science Associates
U.S. Department of Energy
1
Extensions and future work
• Code optimization (through optimization of Cg generated assembly)
• More applications:
- QCD ?
- Fluid flow ?
• Parallel algorithms (or just as a coprocessor)
- domain decomposition type in cluster environment
- Motivation: communication rates CPU
GPU for lattices of different sizes in seconds
64x64
128x128
256x256
512x512
 speed
Read bdr (glReadPixels)
0.00016
0.0002
0.0006
0.0024
14 MB/s
Read all (glReadPixels)
0.00040
0.0015
0.0062
0.0250
167 MB/s
Write bdr (glDrawPixels)
0.00022
0.0003
0.0007
0.0024
14 MB/s
Write all
(glTexSubImage2D)
0.00020
0.0008
0.0032
0.0120
350 MB/s
Write bdr
(glTexSubImage2D)
0.00050
0.0020
0.0071
0.0250
1.3 MB/s
Not a bottleneck
in cluster with
1Gbit network
• Other ideas?
Brookhaven Science Associates
U.S. Department of Energy
1
Conclusions
• GPUs have higher rate of performance increase over time than CPUs
- always appealing as “research for the future”
• In certain applications GPUs are 30 to 60 times faster than CPUs
for low precision computations (depending on configuration)
• For certain floating point applications GPU’s and CPU’s
performance is comparable
- can be used as coprocessor
•
•
•
•
GPUs are often constrained in memory, but
Preliminary results show it is feasible to use GPUs in parallel
Cg is a convenient tool (but cgc could be optimized)
It is feasible to use GPUs for numerical simulations
- we demonstrated it by implementing 2 models (with many applications), and
- used the implementation in benchmarking NV30 and Cg
Brookhaven Science Associates
U.S. Department of Energy
1
Download