Wave Propagation Modeling enabled by GPUs in Underwater Acoustics Paul Hursky Heat, Light, and Sound Research Inc. http://hlsresearch.com GTC 2013, San Jose, Ca 18‐21 March, 2013 Outline • Introduction to underwater sound propagation – typical problems and numerical approaches • Split‐step Fourier Parabolic Equation (SFPE) implemented in CUDA • Example application for SFPE Deep water propagation Shallow water coastal propagation Ocean is a thin layer at scales we are interested in 5‐by‐75 kilometers 100‐by‐5000 meters We can take advantage of low angles – paraxial approximation. Repertoire of models • • • • • Wavenumber integral Normal modes Ray tracing and Gaussian beams Parabolic equation Finite elements and boundary elements (currently impractical at range scales desired, except for small focused problems, like target scattering in a small volume) Wavenumber Integration – Fast Field Programs (FFP) • FFP is an integral transform technique for horizontally stratified media • Total field is sum of particular solution (due to specified sources) and some linear combination of homogeneous solutions to depth‐ separated wave equation • Boundary conditions are used to determine coefficients of homogeneous solutions in total field at each wavenumber (i.e. at each value of the kr transform variable for range) • Field is calculated using spectral (wavenumber) integral to transform these solutions (at each wavenumber) to spatial domain (range) • Called FFP because early versions used FFT to evaluate the spectral integrals • If integral is evaluated using contour integration, the discrete set of poles corresponds to the normal modes Normal modes • Normal mode methods are closely related to wavenumber integral methods • The unforced version of depth‐separated wave equation is solved using an eigenanalysis, in which the eigenvectors are the “normal modes” of vibration and the eigenvalues are the horizontal wavenumbers • The field is calculated by summing contributions from each mode weighted according to the source distribution • This can be viewed as evaluating the wavenumber integral using a sum of residues, where the poles are at the discrete set of mode horizontal wavenumbers Fast and accurate at low frequency: Normal Mode, Wavenumber Integral, and Parabolic Equation Models • Wavenumber integral and normal mode models do not do range‐ dependent oceans – nevertheless, they provide useful baselines for more complicated models • Wavenumber integral, normal mode, and PE models are SINGLE FREQUENCY, so broadband solutions require multiple runs and an inverse Fourier transform to synthesize the time‐domain waveform; they are VERY accurate at low frequency, perhaps impractical at high frequency (> 3 kHz) • We are going to present a Parabolic Equation model that we have implemented in CUDA – will go into details in several slides • First, I’ll briefly cover ray trace and Gaussian beam models – these models are high‐frequency and inherently broadband; can produce broadband impulse response function and convolve an arbitrary source waveform with it Ray tracing is useful way to look at propagation, but produce artifacts… Gaussian beams better Single gaussian beam Ray tracing FFP reference solution Gaussian beam models are fast and accurate at high frequency Gaussian beams Ray tracing Reference solution Split‐step Fourier PE Model • Exploits FFTs (thus CUDA FFT) • Has been eclipsed by split‐step Pade (which uses tri‐diagonal solver), which can model wider angles, but Fourier PE still used when computational domain is large (deep ocean) • Will show shallow and deep water benchmarks • Will show timing comparisons, CPU vs GPU Split‐step Fourier Parabolic Equation Helmholtz equation: ∂2 ∂2 1∂ ∂ 2 , 0 ∂ 2 ∂ 0 Assume outgoing cylindrical wave and far field ( , 1 0 Ψ , 0 1 0 0 0 2 ≅ ≫ 1): 0 0 4 Left with: ∂2 Ψ ∂ 2 ∂2Ψ ∂Ψ 2 0 ∂ ∂ 2 0 2 ∂2Ψ 2 1 Ψ 0 ∂Ψ Paraxial approximation ( 2 ≪ 2 0 ) leaves standard ∂ ∂ parabolic equation (Hardin and Tappert, 1973): 2 ∂Ψ 0 ∂ ∂2Ψ ∂ 2 2 0 2 1 Ψ 0 Jensen, Kuperman, Porter, and Schmidt, Computational Ocean Acoustics, Second Edition, Springer, 2011. Split‐step Fourier Parabolic Equation ∂Ψ Use Fourier transform (FFT) to calculate ∂Ψ , zΨ ∂ ∂2Ψ , 2 ∂ 2 ∂ : , Ψ , This first order differential equation is basis for range step: 2 0 ∂Ψ ∂ Ψ 2 0 , Ψ , 2 1 2 0 2 Ψ 0 2 2 1 2 0 0 1 Ψ , 0 Ψ , Jensen, Kuperman, Porter, and Schmidt, Computational Ocean Acoustics, Second Edition, Springer, 2011. Split‐step Fourier Parabolic Equation Range step can be split several ways: 0 ∂2 2 2 0 ∂ 2 2 2 2 1 2 0 2 2 2 0 Name “split‐step”: A refraction and B diffraction step Jensen, Kuperman, Porter, and Schmidt, Computational Ocean Acoustics, Second Edition, Springer, 2011. Split‐step Fourier PE Model Reasonably complete implementation: Starters: Gaussian, Greene, Thomson, normal mode, RAM self‐starter Variable density via reduced pressure, seabed attenuation Three forms of operator splitting, including Thomson‐Chapman splitting Implemented on: • • • Host CPU, in Matlab Host CPU, in C using FFTW ($562, single‐core and multiple core versions with OpenMP) NVIDIA GTX‐460, in CUDA C using CUDA FFT ($250, Fermi architecture, 336 cores, 1GB RAM) Getting acceleration of roughly 2‐35 on GPU wrt CPU Pekeris waveguide benchmark Scooter Split step Thomson starter 40 degrees Wedge benchmark RAM Split step Self‐starter Demo • Shows “race” between CPU (FFTW, OpenMP) and GPU (NVIDIA CUDA and CUFFT) models, both with a real‐time display – CPU version gets a 10‐second head start • Watch how fast the CPU version runs in first 10 seconds, and then compare with GPU version, which is also sharing computer with CPU version • Full set of examples in YouTube video ‐ search for “Split‐step Fourier CPU vs GPU” Race: CPU gets 10‐second head start Race: CPU gets 10‐second head start Dickins seamount Examples compared: GPU vs CPU m5rd w500 dickins Desktop workstation GPU sec CPU sec Ratio GPU GFLOPs/sec 11.05 319.65 28.93 34.632023 1.197348 28.92 32 1 9.83 343.03 34.90 38.945095 1.115724 34.91 64 2 9.34 310.89 33.29 40.984165 1.231072 33.29 128 4 9.25 336.18 36.34 41.371372 1.138483 36.34 256 8 3.11 68.69 22.09 38.460396 1.741282 22.09 32 1 2.83 67.8 23.96 42.318722 1.763977 23.99 64 2 2.71 57 21.03 44.200207 2.098401 21.06 128 4 2.69 64.93 24.14 44.387684 1.841962 24.10 256 8 2.98 19.56 6.56 11.388378 1.733704 6.57 32 1 2.89 18.61 6.44 11.719803 1.822122 6.43 64 2 2.87 15.64 5.45 11.80256 2.167825 5.44 128 4 2.87 15.96 5.56 11.806542 2.124061 5.56 256 8 Totals(sec) 62.42 1621.98 Totals(min) 1.04 27.033 m5rd dickins w500 Intel Core i7 950 @ 3.07 GHz (4 cores, 8 threads) CPU GFLOPs/sec GPU Ratio threads CPU threads GeForce GTX 460 7 Mps x 48 cores = 336 cores 1.35 GHz, CUDA 4.0 Profiler screen shot – m5rd case, 128 threads per block Predicting exposure of marine mammals to man‐made noise sources • Software product (called Simple) developed under NOAA SBIR funding (Phase I and II) • Used by NOAA to assess impact of man made noise on marine environment for environmental impact assessments • Simple enables non‐acoustic specialists to produce reliable predictions of sound pressure levels due to variety of sources (cargo ships, oil exploration ships, air guns, pile drivers) • Given sources at particular locations, Simple calculates how loud sound gets at remote location, where marine mammals may be located • Forms map of loudness relative to source locations User selects site to work in Each site has databases for: • sound speed profiles (by month) • bathymetry • seabed type (hard, soft) User places sources and “pods” of marine mammals on the map User selects from 122 different source types and sets locations User selects from 125 marine mammal species and sets locations and densities, or relies upon OBIS‐seamap database of previous sightings Appropriate model is run along each radial from each source to predict sound pressure levels in vertical planes Gaussian beam model Parabolic equation alternative: fast and accurate at low frequency and shallow water Parabolic equation model Gaussian beams are versatile, but here we see too few rays have made it past seamount Gaussian beam model Parabolic equation handles full wave effects like diffraction better, particularly at low frequency Parabolic equation model Once calculations are complete along all radials, map of relevant metric is displayed (e.g. “sound exposure level” or “peak pressure”) Challenges for CUDA (perhaps OptiX) • 3D features such as bathymetric canyons and internal waves cannot be handled using Nx2D modeling and require 3D modeling formulations • Acoustic communications and reverberation observed in multi‐static active sonar systems require significant bandwidth at high frequency • Modeling acomms channel must handle reflections from moving ocean surface and from the seabed when the platform is moving – this produces time‐varying wideband Doppler effects Summary • Initial implementation of split‐step Fourier PE model served as proof‐of‐concept that GPGPU was a viable path for our work • This work generated healthy interest from our sponsors • We are engaged in several projects where GPGPU technology is an important ingredient • Interested to see how GPGPU intersects with low‐ power mobile hardware like TEGRA for use in autonomous vehicles Gaming laptop m5rd dickins GPU sec CPU sec Ratio GPU GFLOPs/sec 35.52 374.81 10.55 10.776348 1.021144 10.55 32 1 35.82 547.66 15.29 10.686275 0.69885 15.29 64 2 36.19 486.57 13.44 10.574224 0.786596 13.44 128 4 36.01 529.34 14.70 10.627154 0.723031 14.70 256 8 7.78 11.448703 1.471658 7.78 32 1 109.42 10.40 11.373087 1.093098 10.40 64 2 10.45 10.52 81.27 CPU GFLOPs/sec GPU Ratio threads CPU threads 10.65 91.34 8.58 11.229033 1.309453 8.58 128 4 10.6 104.82 9.89 11.279705 1.141046 9.89 256 8 8 23.02 2.88 4.239896 1.472965 2.88 32 1 7.99 29.87 3.74 4.240794 1.134978 3.74 64 2 8.02 25.17 3.14 4.226517 1.347116 3.14 128 4 8.03 25.45 3.17 4.224125 1.332057 3.17 256 8 Totals(sec) 217.8 2403.29 Totals(min) 3.63 40.05 w500 Intel Core i7 Q 720 @ 1.60 GHz Quad core (4 cores, 8 threads) GeForce GTS 360M 12 Mps x 8 cores = 96 cores 1.32 GHz, CUDA 4.0 GPU server m5rd dickins GPU sec CPU sec Ratio GPU GFLOPs/sec CPU GFLOPs/sec GPU Ratio threads CPU threads 20.77 348.75 16.79 18.425264 1.097438 16.79 32 1 20.58 334.06 16.23 18.599964 1.145707 16.23 64 2 20.63 302.02 14.64 18.554523 1.267248 14.64 128 4 20.66 315.26 15.26 18.525311 1.214019 15.26 256 8 6.21 80.19 12.91 19.275263 1.491578 12.92 32 1 6.19 73.58 11.89 19.335495 1.625393 11.90 64 2 6.2 56.91 9.18 19.29336 2.101644 9.18 128 4 6.21 61.4 9.89 19.256332 1.947969 9.89 256 8 6.42 20.57 3.20 5.27944 1.64818 3.20 32 1 6.44 18.72 2.91 5.264232 1.810779 2.91 64 2 6.45 16.23 2.52 5.257461 2.088724 2.52 128 4 6.47 15.27 2.36 5.243372 2.220888 2.36 256 8 Totals(sec) 133.23 1627.69 Totals(min) 2.2205 27.1281667 w500 Intel Xeon X5550 @ 2.67 GHz Dual quad core (8 cores, 16 threads) GeForce GTX 285 30 Mps x 8 cores = 240 cores 1.48 GHz, CUDA 3.2 Profiler screen shot – w500 case, 128 threads per block Pekeris waveguide benchmark RAM Split step Thomson starter 40 degrees