RTM at Petascale and Beyond Michael Perrone IBM Master Inventor Computational Sciences Center, IBM Research © 2011 IBM Corporation RTM (Reverse Time Migration) Seismic Imaging on BGQ • RTM is a widely-used imaging technique for oil and gas exploration, particularly under subsalts • Over $5 trillion of subsalt oil is believed to exist in the Gulf of Mexico • Imaging subsalt regions of the Earth is extremely challenging • Industry anticipates exascale need by 2020 IBM Research Bottom Line: Seismic Imaging We can make RTM 10 to 100 times faster How? ► Abandon embarrassingly parallel RTM ► Use domain-partitioned, multisource RTM System requirements ► High communication BW ► Low communication latency Can be extended equally well to FWI ► Lots of memory 3 mpp@us.ibm.com © 2011 IBM Corporation IBM Research Take Home Messages Embarrassingly parallel is not always the best approach It is crucial to know where bottlenecks exist Algorithmic changes can dramatically improve performance 4 mpp@us.ibm.com © 2011 IBM Corporation IBM Research Compute performance on new hardware Old hardware New hardware 1 New hardware 2 Run Time Kernel performance improvement 5 mpp@us.ibm.com © 2011 IBM Corporation IBM Research Compute performance on new hardware Old hardware New hardware 1 New hardware 2 Disk IO Run Time Need to track end-to-end performance 6 mpp@us.ibm.com © 2011 IBM Corporation IBM Research Bottlenecks: Memory IO GPU: 0.1 B/F ►100 GB/s ►1 TF/s BG/P: 1.0 B/F BG/Q L2: 1.5 B/F ►13.6 GB/s ► > 300 GB/s ►13.6 GF/s ► 204.8 GF/s BG/Q: 0.2 B/F ►43 GB/s ►204.8 GF/s 7 mpp@us.ibm.com © 2011 IBM Corporation IBM Research GPU’s for Seismic Imaging? x86/GPU [old results, 2x now] ► 17B Stencils / Second ► nVidia / INRIA collaboration – Velocity model: 560x560x905 – Iterations: 22760 BlueGene/P ► 40B Stencils / Second ► Comparable model size/complexity ► Partial optimization – MPI not overlapped – Kernel optimization ongoing Abdelkhalek, R., Calandra, H., Coulaud, O., Roman, J., Latu, G. 2009. Fast Seismic Modeling and Reverse Time Migration on a GPU Cluster. In International Conference on High Performance Computing & Simulation, 2009. HPCS'09. ► BlueGene/Q will be even faster 8 mpp@us.ibm.com © 2011 IBM Corporation IBM Research Reverse Time Migration (RTM) Receiver Data: Source Data: R ( x , y , z , t ) S ( x , y , z , t ) Ship ~1 km ~5 km 1 Shot 9 mpp@us.ibm.com © 2011 IBM Corporation IBM Research RTM - Reverse Time Migration Use 3D wave equation to model sound in Earth 222 1 2 P ( x , y , z , t ) S ( x , y , z , t ) x y z t S 2 ( x , y , z ) v 222 1 2 P ( x , y , z , t ) R ( x , y , z , t ) x y z t R 2 ( x , y , z ) v Forward (Source): Reverse (Receiver): P ( x , y , z , t ) P ( x , y , z , t ) S R Imaging Condition 10 mpp@us.ibm.com I ( x , y , z ) P ( x , y , z , t ) P ( x , y , z , t ) S R t © 2011 IBM Corporation IBM Research Implementing the Wave Equation Finite difference in time: 222 1 2 P ( x , y , z , t ) S ( x , y , z , t ) x y z t S 2 ( x , y , z ) v P ( x , y , z , t ) P ( x , y , z , t 1 ) 2 P ( x , y , z , t ) P ( x , y , z , t 1 ) 2 t Finite difference in space: 2 P ( x , y , z , t ) g ( n ) P ( x n , y , z , t ) x x n P ( x , y , z , t ) g ( n ) P ( x , y n , z , t ) y 2 y n 2 P ( x , y , z , t ) g ( n ) P ( x , y , z n , t ) z z n Absorbing boundary conditions, interpolation, compression, etc. 11 mpp@us.ibm.com © 2011 IBM Corporation IBM Research Image RTM Algorithm (for each shot) Load data ► Velocity model v(x,y,z) t=N F(N) R(N) I(N) t=2N F(2N) R(2N) I(2N) t=3N F(3N) R(3N) I(3N) . . . . . . t=kN F(kN) ► Source & Receiver data Forward propagation ► Calculate P(x,y,z,t) ► Every N timesteps – Compress P(x,y,x,t) – Write P(x,y,x,t) to disk/memory Backward propagation ► Calculate P(x,y,z,t) ► Every N timesteps – Read P(x,y,x,t) from disk/memory – Decompress P(x,y,x,t) – Calculate partial sum of I(x,y,z) . . . R(kN) I(kN) Merge I(x,y,z) with global image 12 mpp@us.ibm.com © 2011 IBM Corporation IBM Research Embarrassingly Parallel RTM Data Archive (Disk) Process shots in parallel, one per slave node Model Slave Node Disk Slave Node Disk Master Node ... Slave Node Disk Scratch disk bottleneck Subset of model for each shot (~100k+ shots) 13 mpp@us.ibm.com © 2011 IBM Corporation IBM Research Domain-Partitioned Multisource RTM Data Archive (Disk) Process all data at once with domain decomposition Model Slave Node Disk Slave Node Disk Master Node ... Slave Node Shots merged and model partitioned 14 mpp@us.ibm.com Disk Small partitions mean forward wave can be stored locally: No disks © 2011 IBM Corporation IBM Research Multisource RTM Full Velocity Model Receiver data Velocity Subset Source Linear superposition principal 2 1 2 2 2 t Pi (x, y,z,t) Si (x, y,z,t) x y z 2 v (x, y,z) So N sources can be merged 2 N 1 2 2 2 N t i Pi (x, y,z,t) i Si (x, y,z,t) x y z 2 v (x, y,z) Finite receiver array acts as nonlinear filter on data Rmeasured (x,y,z,t) MRfull (x, y,z,t) Accelerate by factor of N Nonlinearity leads to “crosstalk” noise which needs to be minimized 15 mpp@us.ibm.com © 2011 IBM Corporation IBM Research 3D RTM Scaling (Partial optimization) 512x512x512 & 1024x1024x1024 models Scaling improves for larger models 16 mpp@us.ibm.com © 2011 IBM Corporation IBM Research GPU Scaling is Comparatively Poor Tsubame supercomputer Japan GPU’s achieve only 10% of peak performance (100x increase for 1000 nodes Okamoto, T., Takenaka, H., Nakamura, T. and Aoki, T. 2010. Accelerating large-scale simulation of seismic wave propagation by multi-GPUs and three-dimensional domain decomposition. In Earth Planets Space, November, 2010. 17 mpp@us.ibm.com © 2011 IBM Corporation IBM Research Physical survey size mapped to BG/Q L2 cache Isotropic RTM with minimum V = 1.5 km/s 10 points per wavelength (5 would reduce number below by 8x) Mapping entire survey volume – not a subset (enables multisource) 1000 (512)^3 512 km^3m^3 100 (4096)^3 4096 km^3m^3 10 (16384)^3 16384 km^3m^3 1 0 10 20 30 40 50 60 70 80 Max Imaging Frequency 18 mpp@us.ibm.com © 2011 IBM Corporation IBM Research Snapshot Data Easily Fits in Memory (No disk required) 10000 9000 500^3 8000 4x more capacity for BGQ 600^3 7000 700^3 800^3 6000 900^3 1000^3 5000 1100^3 4000 1200^3 1300^3 3000 1400^3 1500^3 2000 1600^3 1000 0 128 256 512 1024 2048 3072 4096 5120 6144 7168 8192 9216 10240 # of uncompressed snapshots that can be stored for various model sizes and number of nodes. 19 © 2011 IBM Corporation IBM Research Comparison Embarrassingly parallel RTM ► Coarse-grain communication ► Coarse-grain synchronization ► Disk IO Bottleneck Partitioned RTM ► Fine-grain communication ► Fine-grain synchronization ► No scratch disk 20 mpp@us.ibm.com Low latency High bandwidth: Blue Gene © 2011 IBM Corporation IBM Research Conclusion: RTM can be dramatically accelerated Algorithmic: ► Adopt partitioned, multisource RTM ► Abandon embarrassingly parallel implementations Hardware: ► Increase communication bandwidth ► Decrease communication latency ► Reduce node nondeterminism Advantages ► Can process larger models - scales well ► Avoids scratch disk IO bottleneck ► Improves RAS & MTBF: No disk means no moving parts Disadvantages ► Must handle shot “crosstalk” noise – Methods exist - research continuing… 21 mpp@us.ibm.com © 2011 IBM Corporation