Reverse Time Migration on GMAC NVIDIA GTC 22nd of September, 2010 Javier Cabezas Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José María Cela Nacho Navarro BSC Repsol/BSC UPC/UIUC NVIDIA Repsol UPC/BSC UPC/BSC Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 2 Reverse Time Migration on CUDA └ RTM • RTM generates an image of the subsurface layers • Uses traces recorded by sensors in the field • RTM’s algorithm 1. Propagation of a modeled wave (forward in time) 2. Propagation of the recorded traces (backward in time) 3. Correlation of the forward and backward wavefields • Last forward wavefield with the first backward wavefield • FDTD are preferred to FFT • • 2nd-order finite differencing in time High-order finite differencing in space NVIDIA GPU Technology Conference – 22nd of September, 2010 3 Introduction └ Barcelona Supercomputing Center (BSC) • BSC and Repsol: Kaleidoscope project • • Develop better algorithms/techniques for seismic imaging We focused on Reverse Time Migration (RTM), as it is the most popular seismic imaging technique for depth exploration • Due to the high computational power required, the project started a quest for the most suitable hardware • • • • PowerPC: scalability issues Cell: good performance (in production @ Repsol), difficult programmability FPGA: potentially best performance, programmability nightmare GPUs: 5x speedup vs Cell (GTX280), what about programmability? NVIDIA GPU Technology Conference – 22nd of September, 2010 4 Outline • Introduction • Reverse Time Migration on CUDA →General approach • Disk I/O • Domain decomposition • Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 5 Reverse Time Migration on CUDA └ General approach • We focus on the host-side part of the implementation 1.Avoid memory transfers between host and GPU memories • Implement on the GPU as many computations as possible 2.Hide latency of memory transfers • Overlap memory transfers and kernel execution 3.Take advantage of the PCIe full-duplex capabilities (Fermi) • Overlap deviceToHost and hostToDevice memory transfers NVIDIA GPU Technology Conference – 22nd of September, 2010 6 Reverse Time Migration on CUDA └ General approach Forward Backward 3D-Stencil 3D-Stencil Absorbing Boundary Conditions Absorbing Boundary Conditions Source insertion Traces insertion Compression Read from disk Write to disk Decompression Correlation NVIDIA GPU Technology Conference – 22nd of September, 2010 7 Reverse Time Migration on CUDA └ General approach • Data structures used in the RTM algorithm • • Read/Write structures • • • 3D volume for the wavefield (can be larger than 1000x1000x1000 points) State of the wavefiled in previous time-steps to compute finite differences in time Some extra points in each direction at the boundaries (halos) Read-Only structures • • 3D volume of the same size as the wavefield Geophones’ recorded traces: time-steps x #geophones NVIDIA GPU Technology Conference – 22nd of September, 2010 8 Reverse Time Migration on CUDA └ General approach • Data flow-graph (forward) 3D-Stencil ABC Source Wavefields Constant read-only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22nd of September, 2010 9 Compress Reverse Time Migration on CUDA └ General approach • Simplified data flow-graph (forward) RTM Kernel Compress Wave-fields Constant read-only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22nd of September, 2010 10 Reverse Time Migration on CUDA └ General approach Start • Control flow-graph (forward) • • i =0 RTM Kernel Computation RTM Kernel Compress and transfer to disk • • • deviceToHost + Disk I/O Performed every N steps i%N == 0 Can run in parallel with the next compute steps yes Compress no toHost i++ Runs on the GPU yes i < steps no Runs on the CPU End NVIDIA GPU Technology Conference – 22nd of September, 2010 11 Disk I/O Outline • Introduction • Reverse Time Migration on CUDA • General approach →Disk I/O • Domain decomposition • Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 12 Reverse Time Migration on CUDA └ Disk I/O • GPU → Disk transfers are very time-consuming K 1 K 2 K 3 K 4 C toHost K 5 Disk I/O time • Transferring to disk can be overlapped with the next (computeonly) steps K 1 K 2 K 3 K 4 C Runs on the GPU toHost Runs on the CPU time NVIDIA GPU Technology Conference – 22nd of September, 2010 K 5 13 K 6 Disk I/O K 7 K 8 Reverse Time Migration on CUDA └ Disk I/O • Single transfer: wait for all the data to be in host memory deviceToHost Disk I/O time • Multiple transfers: overlap deviceToHost transfers with disk I/O • toH Double buffering toH Disk I/O toH Disk I/O toH Disk I/O time NVIDIA GPU Technology Conference – 22nd of September, 2010 14 Disk I/O Reverse Time Migration on CUDA └ Disk I/O • CUDA-RT limitations • GPU memory accessible by the owner host thread only →deviceToHost transfers must be performed by the compute thread GPU GPU address space Compute thread I/O thread CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 15 Reverse Time Migration on CUDA └ Disk I/O • CUDA-RT Implementation (single transfer) • CUDA streams must be used not to block GPU execution →Intermediate page-locked buffer must be used: for real-size problems the system can run out of memory! GPU GPU address space CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 16 Reverse Time Migration on CUDA └ Disk I/O • CUDA-RT Implementation (multiple transfers) • Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU →Lots of synchronization code in the compute thread GPU GPU address space CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 17 Outline • Introduction • Reverse Time Migration on CUDA • General approach • Disk I/O →Domain decomposition • Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 18 Reverse Time Migration on CUDA └ Domain decomposition • But… wait, real-size problems require > 16GB of data! • Volumes are split into tiles (along the Z-axis) • 3D-Stencil introduces data dependencies D4 D2 x y z D1 NVIDIA GPU Technology Conference – 22nd of September, 2010 D3 19 Reverse Time Migration on CUDA └ Domain decomposition • Multi-node may be required to overcome memory capacity limitations • • Shared memory for intra-node communication MPI for inter-node communication Node 2 Node 1 GPU1 GPU2 GPU3 GPU4 GPU1 GPU2 MPI Host Memory NVIDIA GPU Technology Conference – 22nd of September, 2010 Host Memory 20 GPU3 GPU4 Reverse Time Migration on CUDA └ Domain decomposition • Data flow-graph (multi-domain) RTM Kernel Compress Compress RTM Kernel Wave-fields (domain 1) Wave-fields (domain 2) Constant read-only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22nd of September, 2010 21 Reverse Time Migration on CUDA └ Domain decomposition Start • Control flow-graph (multi-domain) • i =0 Boundary exchange every time-step • Kernel sync Inter-domain communication blocks execution of the next steps! Exchange s%N == 0 yes Compress no toHost i++ Runs on the GPU yes i < steps no Runs on the CPU End NVIDIA GPU Technology Conference – 22nd of September, 2010 22 Disk I/O Reverse Time Migration on CUDA └ Domain decomposition • Boundary exchange every time-step is needed K 1 X K 2 X K 3 X K 4 X C K 5 X K 6 toHost Disk I/O time NVIDIA GPU Technology Conference – 22nd of September, 2010 23 X K 7 Reverse Time Migration on CUDA └ Domain decomposition • Single-transfer exchange • “Easy” to program, needs large page-locked buffers deviceToHost deviceToHost deviceToHost hostToDevice hostToDevice hostToDevice time • Multiple-transfer exchange to maximize PCI-Express utilization • toH “Complex” to program, needs smaller page-locked buffers toH toH toH toH toH toH toH toH toH toH toH toD toD toD toD toD toD toD toD toD toD toD time NVIDIA GPU Technology Conference – 22nd of September, 2010 24 toD Reverse Time Migration on CUDA └ Domain decomposition • CUDA-RT limitations • Each host thread can only access to the memory objects it allocates GPU 1 GPU 2 GPU 3 GPUs’ address spaces CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 25 GPU 4 Reverse Time Migration on CUDA └ Domain decomposition • CUDA-RT implementation (single-transfer exchange) • • Streams and page-locked memory buffers must be used Page-locked memory buffers can be too big GPU 1 GPU 2 GPU 3 GPUs’ address spaces CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 26 GPU 4 └ Domain decomposition • CUDA-RT implementation (multiple-transfer exchange) • • Uses small page-locked buffers More synchronization code • Too complex to be represented using Powerpoint! • Very difficult to implement in real code! NVIDIA GPU Technology Conference – 22nd of September, 2010 27 Outline • Introduction • Reverse Time Migration on CUDA • General approach • Disk I/O • Domain decomposition →Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 28 Reverse Time Migration on CUDA └ Overlapping computation and communication • Problem: boundary exchange blocks the execution of the following time-step K 1 X K 2 X K 3 X K 4 X C K 5 X K 6 toHost Disk I/O time NVIDIA GPU Technology Conference – 22nd of September, 2010 29 X K 7 Reverse Time Migration on CUDA └ Overlapping computation and communication • Solution: with a 2-stage execution plan we can effectively overlap the boundary exchange between domains k 1 K 1 X k 2 K 2 X k 3 K 3 k 4 X K 4 X C k 5 K 5 X k 6 K 6 X toHost NVIDIA GPU Technology Conference – 22nd of September, 2010 30 K 7 X k 8 K 8 X C k 9 K 9 X toHost Disk I/O time k 7 Dis Disk I/O Reverse Time Migration on CUDA └ Overlapping computation and communication • Approach: two-stage execution • Stage 1: compute the wavefield points to be exchanged x y z GPU1 NVIDIA GPU Technology Conference – 22nd of September, 2010 GPU2 31 Reverse Time Migration on CUDA └ Overlapping computation and communication • Approach: two-stage execution • Stage 2: Compute the remaining points while exchanging the boundaries x y z GPU1 NVIDIA GPU Technology Conference – 22nd of September, 2010 GPU2 32 Reverse Time Migration on CUDA └ Overlapping computation and communication • But two-stage execution requires more abstractions and code complexity • An additional stream per domain • We already have 1 to launch kernels, 1 to overlap transfers to disk, 1 to exchange boundaries →At this point the code is a complete mess! • Requires 4 streams per domain, many page-locked buffers, lots of inter-thread synchronization • • Poor readability and maintainability Easy to introduce bugs NVIDIA GPU Technology Conference – 22nd of September, 2010 33 Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance →Features • Code examples • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 34 GMAC at a glance └ Introduction • Library that enhances the host programming model of CUDA • Freely available at http://code.google.com/p/adsm/ • • • Developed by BSC and UIUC NCSA license (BSD-like) Works in Linux and MacOS X (Windows version coming soon) • Presented in detail tomorrow at 9 am @ San Jose Ballroom NVIDIA GPU Technology Conference – 22nd of September, 2010 35 GMAC at a glance └ Features • Unified virtual address space for all the memories in the system • Single allocation for shared objects • Special API calls: gmacMalloc, gmacFree • GPU memory allocated by a host thread is visible to all host threads →Brings POSIX thread semantics back to developers Shared Data GPU CPU CPU Data NVIDIA GPU Technology Conference – 22nd of September, 2010 Memory 36 GMAC at a glance └ Features • Parallelism exposed via regular POSIX threads • • Replaces the explicit use of CUDA streams OpenMP support • GMAC uses streams and page-locked buffers internally • Concurrent kernel execution and memory transfers for free GPU NVIDIA GPU Technology Conference – 22nd of September, 2010 37 GMAC at a glance └ Features • Optimized bulk memory operations via library interposition • • • File I/O • • Standard I/O functions: fwrite, fread Automatic overlap of Disk I/O and hostToDevice and deviceToHost transfers Optimized GPU to GPU transfers via regular memcpy Enhanced versions of the MPI send/receive calls NVIDIA GPU Technology Conference – 22nd of September, 2010 38 Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance • Features →Code examples • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 39 GMAC at a glance └ Examples • Single allocation (and pointer) for shared objects CUDA-RT GMAC void compute(FILE *file, int size) { 1 float *foo, *dev_foo; 2 foo = malloc(size); 3 fread(foo, size, 1, file); 4 cudaMalloc(&dev_foo, size); 5 cudaMemcpy(dev_foo, foo, size, ToDevice); 6 kernel<<<Dg, Db>>>(dev_foo, size); 7 cudaThreadSynchronize(); 8 cudaMemcpy(foo, dev_foo, size, ToHost); 9 cpuComputation(foo); 10 cudaFree(dev_foo); 11 free(foo); } NVIDIA GPU Technology Conference – 22nd of September, 2010 40 void compute(FILE *file, int size) { 1 float *foo; 2 foo = gmacMalloc(size); 3 fread(foo, size, 1, file); 4 5 6 kernel<<<Dg, Db>>>(foo, size); 7 gmacThreadSynchronize(); 8 9 cpuComputation(foo); 10 gmacFree(foo); 11 } GMAC at a glance └ Examples • Optimized support for bulk memory operations CUDA-RT GMAC void compute(FILE *file, int size) { 1 float *foo, *dev_foo; 2 foo = malloc(size); 3 fread(foo, size, 1, file); 4 cudaMalloc(&dev_foo, size); 5 cudaMemcpy(dev_foo, foo, size, ToDevice); 6 kernel<<<Dg, Db>>>(dev_foo, size); 7 cudaThreadSynchronize(); 8 cudaMemcpy(foo, dev_foo, size, ToHost); 9 cpuComputation(foo); 10 cudaFree(dev_foo); 11 free(foo); } NVIDIA GPU Technology Conference – 22nd of September, 2010 41 void compute(FILE *file, int size) { 1 float *foo; 2 foo = gmacMalloc(size); 3 fread(foo, size, 1, file); 4 5 6 kernel<<<Dg, Db>>>(foo, size); 7 gmacThreadSynchronize(); 8 9 cpuComputation(foo); 10 gmacFree(foo); 11 } Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC →Disk I/O • Domain decomposition • Overlapping computation and communication • Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 42 Reverse Time Migration on GMAC └ Disk I/O • CUDA-RT Implementation (multiple transfers) • Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU →Lots of synchronization code in the compute thread GPU GPU address space CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 43 Reverse Time Migration on GMAC └ Disk I/O (GMAC) • GMAC implementation • • • deviceToHost transfers performed by the I/O thread deviceToHost and Disk I/O transfers overlap for free Small page-locked buffers are used GPU Global address space NVIDIA GPU Technology Conference – 22nd of September, 2010 44 Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC • Disk I/O →Domain decomposition • Overlapping computation and communication • Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 45 Reverse Time Migration on GMAC └ Domain decomposition (CUDA-RT) • CUDA-RT implementation (single-transfer exchange) • • Streams and page-locked memory buffers must be used Page-locked memory buffers can be too big GPU 1 GPU 2 GPU 3 GPUs’ address spaces CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 46 GPU 4 Reverse Time Migration on GMAC └ Domain decomposition (GMAC) • GMAC implementation (multiple-transfer exchange) • Exchange of boundaries performed using a simple memcpy! GPU 1 GPU 2 GPU 3 GPU 4 Unified global address space • Full PCIe utilization: internally GMAC performs several transfers and double buffering NVIDIA GPU Technology Conference – 22nd of September, 2010 47 Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC • Disk I/O • Domain decomposition →Overlapping computation and communication • Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 48 Reverse Time Migration on GMAC └ Overlapping computation and communication • No streams, no page-locked buffers, similar performance: ±2% readVelocity(velociy); cudaMalloc(&d_input, W_SIZE); cudaMalloc(&d_output, W_SIZE); cudaHostAlloc(&i_halos, H_SIZE); cudaHostAlloc(&disk_buffer, W_SIZE); cudaStreamCreate(&s1); cudaStreamCreate(&s2); cudaMemcpy(d_velocity, velocity, W_SIZE) for all time steps do launch_stage1(d_output, d_input, s1); launch_stage2(d_output, d_input, s2); cudaMemcpyAsync(i_halos, d_output, s1); cudaStreamSynchronize(s1); barrier(); cudaMemcpyAsync(d_output, i_halos, s1); cudaThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); transfer_to_host(disk_buffer); barrier_write_to_disk(); } // ... Update pointers end for fread(velocity); gmacMalloc(&input, W_SIZE); gmacMalloc(&output, W_SIZE); for all time steps do launch_stage1( output, input ); gmacThreadSynchronize(); launch_stage2( output, input ); memcpy(neighbor, output); gmacThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); barrier_write_to_disk(); } // ... Update pointers end for CUDA-RT NVIDIA GPU Technology Conference – 22nd of September, 2010 GMAC 49 Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC • Disk I/O • Domain decomposition • Inter-domain communication →Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 50 Reverse Time Migration on GMAC └ Development cycle and debugging • CUDA-RT • • Start from a simple, correct sequential code Implement kernels one at a time and check correctness • • • 3D-Stencil Two allocations per data structure Source insertion Keep data consistency by hand (cudaMemcpy) To introduce modifications to any kernel • • Absorbing Boundary Conditions Two allocations per data structure Keep data consistency by hand (cudaMemcpy) NVIDIA GPU Technology Conference – 22nd of September, 2010 51 Compression Reverse Time Migration on GMAC └ Development cycle and debugging • GMAC • • 3D-Stencil Allocate objects with gmacMalloc • Single pointer Use pointer both in the host and GPU kernel implementations • Absorbing Boundary Conditions Source insertion No copies Compression NVIDIA GPU Technology Conference – 22nd of September, 2010 52 Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 53 Conclusions • Heterogeneous systems based on GPUs are currently the most appropriate to implement RTM • CUDA has programmability issues • • CUDA provides a good language to expose data parallelism in the code to be run on the GPU The host-side interface provided by the CUDA-RT makes difficult to implement even some basic optimizations GMAC eases the development of applications for GPU-based systems with no performance penalty 6-month part-time single programmer: full RTM version (5x speedup over the previous Cell implementation) NVIDIA GPU Technology Conference – 22nd of September, 2010 54 Acknowledgements • Barcelona Supercomputing Center • Repsol • Universitat Politècnica de Catalunya • University of Illinois at Urbana-Champaign NVIDIA GPU Technology Conference – 22nd of September, 2010 55 Thank you! Questions? NVIDIA GPU Technology Conference – 22nd of September, 2010 56