Data Analysis and Visualization Numerical Simulations Using Programmable GPUs Stan Tomov September 5, 2003 Brookhaven Science Associates U.S. Department of Energy 1 Outline Motivation • Literature review • The graphics pipeline • Programmable GPUs • Block diagram of nVidia's GeForce FX • Some probability based simulations - Monte Carlo simulations - Ising model - Percolation model • Implementation • Performance results and analysis • Extensions and future work • Conclusions • Brookhaven Science Associates U.S. Department of Energy 1 Motivation The GPUs have: ● High flops count (nVidia has listed 200Gflops theoretical speed for NV30) Problem size 11,540 47,636 193,556 780,308 ● ● Frames per second using OpenGL(GPU) Mesa (CPU) 189 8.01 52 1.71 13 0.44 3 0.12 Table 1. GPU vs CPU in rendering polygons. The GPU (Quadro2 Pro) is approximately 30 times faster than the CPU (Pentium III, 1 GHz) in rendering polygonal data of various sizes. Compatible price performance (0.1 cents per M flop) High rate of performance increase over time (doubling every 6 months) Explore the possibility of extending GPUs' use to non-graphics applications Brookhaven Science Associates U.S. Department of Energy 1 Literature review Using graphics hardware for non-graphics applications: • Cellular automata • Reaction-diffusion simulation (Mark Harris, University of North Carolina) • Matrix multiply (E. Larsen and D. McAllister, University of North Carolina) • Lattice Boltzmann computation (Wei Li, Xiaoming Wei, and Arie Kaufman, Stony Brook) • CG and multigrid (J. Bolz et al, Caltech, and N. Goodnight et al, University of Virginia) • Convolution (University of Stuttgart) Performance results: • Significant speedup of GPU vs CPU are reported if the GPU performs low precision computations (30 to 60 times; depends on the configuration) • The fact that the operations are low precision is often skipped which may be confusing: - NCSA, University of Illinois assembled a $50,000 supercomputer out of 70 PlayStation 2 consoles, which could theoretically deliver 0.5 trillion operations/second - also, currently $200 GPUs are capable of 1.2 trillion op/s • GPU’s flops performance is comparable to the CPU’s Brookhaven Science Associates U.S. Department of Energy 1 The graphics pipeline Brookhaven Science Associates U.S. Department of Energy 1 Programmable GPUs (in particular NV30) • Support floating point operations • Vertex program - Replaces fixed-function pipeline for vertices - Manipulates single vertex data - Executes for every vertex • Fragment program - Similar to vertex program but for pixels • Programming in Cg: - High level language - Looks like C - Portable - Compiles Cg programs to assembly code Brookhaven Science Associates U.S. Department of Energy 1 Block diagram of GeForce FX AGP 8x graphics bus bandwidth: 2.1GB/s • Local memory bandwidth: 16 GB/s • Chip officially clocked at 500 MHz • Vertex processor: • - execute vertex shaders or emulate fixed transformations and lighting (T&L) ● Pixel processor : - execute pixel shaders or emulate fixed shaders - 2 int & 1 float ops or 2 texture accesses/clock circle ● Texture & color interpolators - interpolate texture coordinates and color values Performance (on processing 4D vectors): ● Vertex ops/sec - 1.5 Gops ● Pixel ops/sec - 8 Gops (int), or 4 Gops (float) Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002. Brookhaven Science Associates U.S. Department of Energy 1 Monte Carlo simulations Used in variety of simulations in physics, finance, chemistry, etc. ● Based on probability statistics and use random numbers ● A classical example: compute area of a circle ● Computation of expected values: ● N E(F) = F (S i )P(S i ) i=1 (1) N can be very large : on a 1024 x 1024 lattice of particles, every 2 1024 particle modeled to have k states, N = k ● Random number generation. We used linear congruential type generator: R(n) (a * R(n 1) b) mod N Brookhaven Science Associates U.S. Department of Energy 1 Ising model ● Simplified model for magnets (introduced by Wilhelm Lenz in 1920, further studied by his student Ernst Ising) Modeled on 2D lattice with a “spin” (corresponding to orientation of electrons) at every cell pointing up or down ● Uses temperature to couple 2 opposing physical principles ● - minimization of the system's energy - entropy maximization ● Want to compute - expected magnetization: F ( Si ) N up ( Si ) N down ( Si ) - expected energy: F(S i ) En(S i ) = Si (j)S i (k) j,k ● Evolve the system into “higher probability” states and compute expected values as average over those states - evolving from state to state, based on certain probability decision, is related to so called Markov chains: W.Gilks, S.Richardson, and D.Spiegelhalter (Editors), Markov chain Monte Carlo in Practice, Chapman&Hall, 1996. Brookhaven Science Associates U.S. Department of Energy 1 Ising model computational procedure ● ● ● ● Choose an absolute temperature of interest T (in Kelvin) Color lattice in a checkerboard manner Start consecutive black and white “sweeps” Change the spin at a site based on the procedure 1. Denote current state as S, the state with flipped spin as S' 2. Compute ΔE E(S' ) E(S) 3. If ΔE 0 accept S' else generate R [0,1] and accept S' if, P(S' ) R = e ΔE / (kT) P(S) where P(S) is given by the Boltzmann probability distribution function P(S) = e E( S)/ ( kT) N e E ( Si ) /( kT ) i=1 Brookhaven Science Associates U.S. Department of Energy 1 Percolation model ● First studied by Broadbent and Hemmercley in 1957 ● Used in studies of disordered medium (usually specified by a probability distribution) ● ● Applied in studies of various phenomena such as spread of diseases, flow in porous media, forest fire propagation, clustering, etc. Of particular interest are: - media modeling threshold after which there exists a “spanning cluster” - relations between different media models - time to reach steady state spanning cluster Brookhaven Science Associates U.S. Department of Energy 1 Implementation Approaches: • Pure OpenGL (simulations using the fixed-function pipeline) • Shaders in assembly • Shaders in Cg Dynamic texturing: • Create a texture T (think of a 2D lattice) • Loop: - Render an image using T (in an off-screen buffer) - Update T from the resulting image Brookhaven Science Associates U.S. Department of Energy 1 Performance results and analysis • Time in s. (approximate) for different vector flops on the GPU: 256x256 512x512 traffic 0.00063 0.0024 +, -, *, / 0.00010 0.0003 cos, sin 0.00026 0.0010 log, exp 0.00045 0.0015 if, ? : 0.00016 0.0008 48 B per node – speed limited by GPU’s memory speed (16 GB/s) 3.5 Gflops 20 x faster then CPU but the operations are of low accuracy • Time in s. (approximate) including traffic for different vector flops on the CPU: 256x256 512x512 1024x1024 +, -, *, / 0.0011 0.0046 0.017 cos, sin 0.0540 0.0650 0.267 log, exp 0.0609 0.1100 0.426 Brookhaven Science Associates U.S. Department of Energy 32 B per node – speed limited by CPU’s memory speed (4.2 GB/s) 1 Performance results and analysis • GPU and CPU (2.8 GHz) performance on the Ising model Lattice size (not necessary power of 2) 64x64 128x128 256x256 512x512 1024x1024 GPU sec/frame 0.0006 0.0023 0.0081 0.033 0.14 CPU no opt. 0.0009 0.0024 0.0083 0.032 0.13 CPU with –O4 0.0008 0.0020 0.0069 0.026 0.10 GPU instr./sec 0.55 G 0.57 G 0.66 G 0.63 G 0.61 G • 2.64 Gflops, i.e. 15% GPU theoretical power utilization (too many ifs): - if (flag) { … } : exec. time = time to compute the block even if flag = 0 • Performance compatible with visualization related sample shaders from nVidia • Cg assembly - Performance is the same for using runtime Cg or the generated assembly code - The assembly code generated is not optimal: we found cases where the code could be optimized and performance increased Brookhaven Science Associates U.S. Department of Energy 1 Extensions and future work • Code optimization (through optimization of Cg generated assembly) • More applications: - QCD ? - Fluid flow ? • Parallel algorithms (or just as a coprocessor) - domain decomposition type in cluster environment - Motivation: communication rates CPU GPU for lattices of different sizes in seconds 64x64 128x128 256x256 512x512 speed Read bdr (glReadPixels) 0.00016 0.0002 0.0006 0.0024 14 MB/s Read all (glReadPixels) 0.00040 0.0015 0.0062 0.0250 167 MB/s Write bdr (glDrawPixels) 0.00022 0.0003 0.0007 0.0024 14 MB/s Write all (glTexSubImage2D) 0.00020 0.0008 0.0032 0.0120 350 MB/s Write bdr (glTexSubImage2D) 0.00050 0.0020 0.0071 0.0250 1.3 MB/s Not a bottleneck in cluster with 1Gbit network • Other ideas? Brookhaven Science Associates U.S. Department of Energy 1 Conclusions • GPUs have higher rate of performance increase over time than CPUs - always appealing as “research for the future” • In certain applications GPUs are 30 to 60 times faster than CPUs for low precision computations (depending on configuration) • For certain floating point applications GPU’s and CPU’s performance is comparable - can be used as coprocessor • • • • GPUs are often constrained in memory, but Preliminary results show it is feasible to use GPUs in parallel Cg is a convenient tool (but cgc could be optimized) It is feasible to use GPUs for numerical simulations - we demonstrated it by implementing 2 models (with many applications), and - used the implementation in benchmarking NV30 and Cg Brookhaven Science Associates U.S. Department of Energy 1