Reverse Time Migration on GMAC

Reverse Time Migration on GMAC NVIDIA GTC 22nd of September, 2010 Javier Cabezas Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José María Cela Nacho Navarro BSC Repsol/BSC UPC/UIUC NVIDIA Repsol UPC/BSC UPC/BSC Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 2 Reverse Time Migration on CUDA └ RTM • RTM generates an image of the subsurface layers • Uses traces recorded by sensors in the field • RTM’s algorithm 1. Propagation of a modeled wave (forward in time) 2. Propagation of the recorded traces (backward in time) 3. Correlation of the forward and backward wavefields • Last forward wavefield with the first backward wavefield • FDTD are preferred to FFT • • 2nd-order finite differencing in time High-order finite differencing in space NVIDIA GPU Technology Conference – 22nd of September, 2010 3 Introduction └ Barcelona Supercomputing Center (BSC) • BSC and Repsol: Kaleidoscope project • • Develop better algorithms/techniques for seismic imaging We focused on Reverse Time Migration (RTM), as it is the most popular seismic imaging technique for depth exploration • Due to the high computational power required, the project started a quest for the most suitable hardware • • • • PowerPC: scalability issues Cell: good performance (in production @ Repsol), difficult programmability FPGA: potentially best performance, programmability nightmare GPUs: 5x speedup vs Cell (GTX280), what about programmability? NVIDIA GPU Technology Conference – 22nd of September, 2010 4 Outline • Introduction • Reverse Time Migration on CUDA →General approach • Disk I/O • Domain decomposition • Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 5 Reverse Time Migration on CUDA └ General approach • We focus on the host-side part of the implementation 1.Avoid memory transfers between host and GPU memories • Implement on the GPU as many computations as possible 2.Hide latency of memory transfers • Overlap memory transfers and kernel execution 3.Take advantage of the PCIe full-duplex capabilities (Fermi) • Overlap deviceToHost and hostToDevice memory transfers NVIDIA GPU Technology Conference – 22nd of September, 2010 6 Reverse Time Migration on CUDA └ General approach Forward Backward 3D-Stencil 3D-Stencil Absorbing Boundary Conditions Absorbing Boundary Conditions Source insertion Traces insertion Compression Read from disk Write to disk Decompression Correlation NVIDIA GPU Technology Conference – 22nd of September, 2010 7 Reverse Time Migration on CUDA └ General approach • Data structures used in the RTM algorithm • • Read/Write structures • • • 3D volume for the wavefield (can be larger than 1000x1000x1000 points) State of the wavefiled in previous time-steps to compute finite differences in time Some extra points in each direction at the boundaries (halos) Read-Only structures • • 3D volume of the same size as the wavefield Geophones’ recorded traces: time-steps x #geophones NVIDIA GPU Technology Conference – 22nd of September, 2010 8 Reverse Time Migration on CUDA └ General approach • Data flow-graph (forward) 3D-Stencil ABC Source Wavefields Constant read-only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22nd of September, 2010 9 Compress Reverse Time Migration on CUDA └ General approach • Simplified data flow-graph (forward) RTM Kernel Compress Wave-fields Constant read-only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22nd of September, 2010 10 Reverse Time Migration on CUDA └ General approach Start • Control flow-graph (forward) • • i =0 RTM Kernel Computation RTM Kernel Compress and transfer to disk • • • deviceToHost + Disk I/O Performed every N steps i%N == 0 Can run in parallel with the next compute steps yes Compress no toHost i++ Runs on the GPU yes i < steps no Runs on the CPU End NVIDIA GPU Technology Conference – 22nd of September, 2010 11 Disk I/O Outline • Introduction • Reverse Time Migration on CUDA • General approach →Disk I/O • Domain decomposition • Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 12 Reverse Time Migration on CUDA └ Disk I/O • GPU → Disk transfers are very time-consuming K 1 K 2 K 3 K 4 C toHost K 5 Disk I/O time • Transferring to disk can be overlapped with the next (computeonly) steps K 1 K 2 K 3 K 4 C Runs on the GPU toHost Runs on the CPU time NVIDIA GPU Technology Conference – 22nd of September, 2010 K 5 13 K 6 Disk I/O K 7 K 8 Reverse Time Migration on CUDA └ Disk I/O • Single transfer: wait for all the data to be in host memory deviceToHost Disk I/O time • Multiple transfers: overlap deviceToHost transfers with disk I/O • toH Double buffering toH Disk I/O toH Disk I/O toH Disk I/O time NVIDIA GPU Technology Conference – 22nd of September, 2010 14 Disk I/O Reverse Time Migration on CUDA └ Disk I/O • CUDA-RT limitations • GPU memory accessible by the owner host thread only →deviceToHost transfers must be performed by the compute thread GPU GPU address space Compute thread I/O thread CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 15 Reverse Time Migration on CUDA └ Disk I/O • CUDA-RT Implementation (single transfer) • CUDA streams must be used not to block GPU execution →Intermediate page-locked buffer must be used: for real-size problems the system can run out of memory! GPU GPU address space CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 16 Reverse Time Migration on CUDA └ Disk I/O • CUDA-RT Implementation (multiple transfers) • Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU →Lots of synchronization code in the compute thread GPU GPU address space CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 17 Outline • Introduction • Reverse Time Migration on CUDA • General approach • Disk I/O →Domain decomposition • Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 18 Reverse Time Migration on CUDA └ Domain decomposition • But… wait, real-size problems require > 16GB of data! • Volumes are split into tiles (along the Z-axis) • 3D-Stencil introduces data dependencies D4 D2 x y z D1 NVIDIA GPU Technology Conference – 22nd of September, 2010 D3 19 Reverse Time Migration on CUDA └ Domain decomposition • Multi-node may be required to overcome memory capacity limitations • • Shared memory for intra-node communication MPI for inter-node communication Node 2 Node 1 GPU1 GPU2 GPU3 GPU4 GPU1 GPU2 MPI Host Memory NVIDIA GPU Technology Conference – 22nd of September, 2010 Host Memory 20 GPU3 GPU4 Reverse Time Migration on CUDA └ Domain decomposition • Data flow-graph (multi-domain) RTM Kernel Compress Compress RTM Kernel Wave-fields (domain 1) Wave-fields (domain 2) Constant read-only data: velocity model, geophones’ traces NVIDIA GPU Technology Conference – 22nd of September, 2010 21 Reverse Time Migration on CUDA └ Domain decomposition Start • Control flow-graph (multi-domain) • i =0 Boundary exchange every time-step • Kernel sync Inter-domain communication blocks execution of the next steps! Exchange s%N == 0 yes Compress no toHost i++ Runs on the GPU yes i < steps no Runs on the CPU End NVIDIA GPU Technology Conference – 22nd of September, 2010 22 Disk I/O Reverse Time Migration on CUDA └ Domain decomposition • Boundary exchange every time-step is needed K 1 X K 2 X K 3 X K 4 X C K 5 X K 6 toHost Disk I/O time NVIDIA GPU Technology Conference – 22nd of September, 2010 23 X K 7 Reverse Time Migration on CUDA └ Domain decomposition • Single-transfer exchange • “Easy” to program, needs large page-locked buffers deviceToHost deviceToHost deviceToHost hostToDevice hostToDevice hostToDevice time • Multiple-transfer exchange to maximize PCI-Express utilization • toH “Complex” to program, needs smaller page-locked buffers toH toH toH toH toH toH toH toH toH toH toH toD toD toD toD toD toD toD toD toD toD toD time NVIDIA GPU Technology Conference – 22nd of September, 2010 24 toD Reverse Time Migration on CUDA └ Domain decomposition • CUDA-RT limitations • Each host thread can only access to the memory objects it allocates GPU 1 GPU 2 GPU 3 GPUs’ address spaces CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 25 GPU 4 Reverse Time Migration on CUDA └ Domain decomposition • CUDA-RT implementation (single-transfer exchange) • • Streams and page-locked memory buffers must be used Page-locked memory buffers can be too big GPU 1 GPU 2 GPU 3 GPUs’ address spaces CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 26 GPU 4 └ Domain decomposition • CUDA-RT implementation (multiple-transfer exchange) • • Uses small page-locked buffers More synchronization code • Too complex to be represented using Powerpoint! • Very difficult to implement in real code! NVIDIA GPU Technology Conference – 22nd of September, 2010 27 Outline • Introduction • Reverse Time Migration on CUDA • General approach • Disk I/O • Domain decomposition →Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 28 Reverse Time Migration on CUDA └ Overlapping computation and communication • Problem: boundary exchange blocks the execution of the following time-step K 1 X K 2 X K 3 X K 4 X C K 5 X K 6 toHost Disk I/O time NVIDIA GPU Technology Conference – 22nd of September, 2010 29 X K 7 Reverse Time Migration on CUDA └ Overlapping computation and communication • Solution: with a 2-stage execution plan we can effectively overlap the boundary exchange between domains k 1 K 1 X k 2 K 2 X k 3 K 3 k 4 X K 4 X C k 5 K 5 X k 6 K 6 X toHost NVIDIA GPU Technology Conference – 22nd of September, 2010 30 K 7 X k 8 K 8 X C k 9 K 9 X toHost Disk I/O time k 7 Dis Disk I/O Reverse Time Migration on CUDA └ Overlapping computation and communication • Approach: two-stage execution • Stage 1: compute the wavefield points to be exchanged x y z GPU1 NVIDIA GPU Technology Conference – 22nd of September, 2010 GPU2 31 Reverse Time Migration on CUDA └ Overlapping computation and communication • Approach: two-stage execution • Stage 2: Compute the remaining points while exchanging the boundaries x y z GPU1 NVIDIA GPU Technology Conference – 22nd of September, 2010 GPU2 32 Reverse Time Migration on CUDA └ Overlapping computation and communication • But two-stage execution requires more abstractions and code complexity • An additional stream per domain • We already have 1 to launch kernels, 1 to overlap transfers to disk, 1 to exchange boundaries →At this point the code is a complete mess! • Requires 4 streams per domain, many page-locked buffers, lots of inter-thread synchronization • • Poor readability and maintainability Easy to introduce bugs NVIDIA GPU Technology Conference – 22nd of September, 2010 33 Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance →Features • Code examples • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 34 GMAC at a glance └ Introduction • Library that enhances the host programming model of CUDA • Freely available at http://code.google.com/p/adsm/ • • • Developed by BSC and UIUC NCSA license (BSD-like) Works in Linux and MacOS X (Windows version coming soon) • Presented in detail tomorrow at 9 am @ San Jose Ballroom NVIDIA GPU Technology Conference – 22nd of September, 2010 35 GMAC at a glance └ Features • Unified virtual address space for all the memories in the system • Single allocation for shared objects • Special API calls: gmacMalloc, gmacFree • GPU memory allocated by a host thread is visible to all host threads →Brings POSIX thread semantics back to developers Shared Data GPU CPU CPU Data NVIDIA GPU Technology Conference – 22nd of September, 2010 Memory 36 GMAC at a glance └ Features • Parallelism exposed via regular POSIX threads • • Replaces the explicit use of CUDA streams OpenMP support • GMAC uses streams and page-locked buffers internally • Concurrent kernel execution and memory transfers for free GPU NVIDIA GPU Technology Conference – 22nd of September, 2010 37 GMAC at a glance └ Features • Optimized bulk memory operations via library interposition • • • File I/O • • Standard I/O functions: fwrite, fread Automatic overlap of Disk I/O and hostToDevice and deviceToHost transfers Optimized GPU to GPU transfers via regular memcpy Enhanced versions of the MPI send/receive calls NVIDIA GPU Technology Conference – 22nd of September, 2010 38 Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance • Features →Code examples • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 39 GMAC at a glance └ Examples • Single allocation (and pointer) for shared objects CUDA-RT GMAC void compute(FILE *file, int size) { 1 float *foo, *dev_foo; 2 foo = malloc(size); 3 fread(foo, size, 1, file); 4 cudaMalloc(&dev_foo, size); 5 cudaMemcpy(dev_foo, foo, size, ToDevice); 6 kernel<<<Dg, Db>>>(dev_foo, size); 7 cudaThreadSynchronize(); 8 cudaMemcpy(foo, dev_foo, size, ToHost); 9 cpuComputation(foo); 10 cudaFree(dev_foo); 11 free(foo); } NVIDIA GPU Technology Conference – 22nd of September, 2010 40 void compute(FILE *file, int size) { 1 float *foo; 2 foo = gmacMalloc(size); 3 fread(foo, size, 1, file); 4 5 6 kernel<<<Dg, Db>>>(foo, size); 7 gmacThreadSynchronize(); 8 9 cpuComputation(foo); 10 gmacFree(foo); 11 } GMAC at a glance └ Examples • Optimized support for bulk memory operations CUDA-RT GMAC void compute(FILE *file, int size) { 1 float *foo, *dev_foo; 2 foo = malloc(size); 3 fread(foo, size, 1, file); 4 cudaMalloc(&dev_foo, size); 5 cudaMemcpy(dev_foo, foo, size, ToDevice); 6 kernel<<<Dg, Db>>>(dev_foo, size); 7 cudaThreadSynchronize(); 8 cudaMemcpy(foo, dev_foo, size, ToHost); 9 cpuComputation(foo); 10 cudaFree(dev_foo); 11 free(foo); } NVIDIA GPU Technology Conference – 22nd of September, 2010 41 void compute(FILE *file, int size) { 1 float *foo; 2 foo = gmacMalloc(size); 3 fread(foo, size, 1, file); 4 5 6 kernel<<<Dg, Db>>>(foo, size); 7 gmacThreadSynchronize(); 8 9 cpuComputation(foo); 10 gmacFree(foo); 11 } Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC →Disk I/O • Domain decomposition • Overlapping computation and communication • Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 42 Reverse Time Migration on GMAC └ Disk I/O • CUDA-RT Implementation (multiple transfers) • Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU →Lots of synchronization code in the compute thread GPU GPU address space CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 43 Reverse Time Migration on GMAC └ Disk I/O (GMAC) • GMAC implementation • • • deviceToHost transfers performed by the I/O thread deviceToHost and Disk I/O transfers overlap for free Small page-locked buffers are used GPU Global address space NVIDIA GPU Technology Conference – 22nd of September, 2010 44 Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC • Disk I/O →Domain decomposition • Overlapping computation and communication • Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 45 Reverse Time Migration on GMAC └ Domain decomposition (CUDA-RT) • CUDA-RT implementation (single-transfer exchange) • • Streams and page-locked memory buffers must be used Page-locked memory buffers can be too big GPU 1 GPU 2 GPU 3 GPUs’ address spaces CPU address space NVIDIA GPU Technology Conference – 22nd of September, 2010 46 GPU 4 Reverse Time Migration on GMAC └ Domain decomposition (GMAC) • GMAC implementation (multiple-transfer exchange) • Exchange of boundaries performed using a simple memcpy! GPU 1 GPU 2 GPU 3 GPU 4 Unified global address space • Full PCIe utilization: internally GMAC performs several transfers and double buffering NVIDIA GPU Technology Conference – 22nd of September, 2010 47 Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC • Disk I/O • Domain decomposition →Overlapping computation and communication • Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 48 Reverse Time Migration on GMAC └ Overlapping computation and communication • No streams, no page-locked buffers, similar performance: ±2% readVelocity(velociy); cudaMalloc(&d_input, W_SIZE); cudaMalloc(&d_output, W_SIZE); cudaHostAlloc(&i_halos, H_SIZE); cudaHostAlloc(&disk_buffer, W_SIZE); cudaStreamCreate(&s1); cudaStreamCreate(&s2); cudaMemcpy(d_velocity, velocity, W_SIZE) for all time steps do launch_stage1(d_output, d_input, s1); launch_stage2(d_output, d_input, s2); cudaMemcpyAsync(i_halos, d_output, s1); cudaStreamSynchronize(s1); barrier(); cudaMemcpyAsync(d_output, i_halos, s1); cudaThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); transfer_to_host(disk_buffer); barrier_write_to_disk(); } // ... Update pointers end for fread(velocity); gmacMalloc(&input, W_SIZE); gmacMalloc(&output, W_SIZE); for all time steps do launch_stage1( output, input ); gmacThreadSynchronize(); launch_stage2( output, input ); memcpy(neighbor, output); gmacThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); barrier_write_to_disk(); } // ... Update pointers end for CUDA-RT NVIDIA GPU Technology Conference – 22nd of September, 2010 GMAC 49 Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC • Disk I/O • Domain decomposition • Inter-domain communication →Development cycle and debugging • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 50 Reverse Time Migration on GMAC └ Development cycle and debugging • CUDA-RT • • Start from a simple, correct sequential code Implement kernels one at a time and check correctness • • • 3D-Stencil Two allocations per data structure Source insertion Keep data consistency by hand (cudaMemcpy) To introduce modifications to any kernel • • Absorbing Boundary Conditions Two allocations per data structure Keep data consistency by hand (cudaMemcpy) NVIDIA GPU Technology Conference – 22nd of September, 2010 51 Compression Reverse Time Migration on GMAC └ Development cycle and debugging • GMAC • • 3D-Stencil Allocate objects with gmacMalloc • Single pointer Use pointer both in the host and GPU kernel implementations • Absorbing Boundary Conditions Source insertion No copies Compression NVIDIA GPU Technology Conference – 22nd of September, 2010 52 Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions NVIDIA GPU Technology Conference – 22nd of September, 2010 53 Conclusions • Heterogeneous systems based on GPUs are currently the most appropriate to implement RTM • CUDA has programmability issues • • CUDA provides a good language to expose data parallelism in the code to be run on the GPU The host-side interface provided by the CUDA-RT makes difficult to implement even some basic optimizations GMAC eases the development of applications for GPU-based systems with no performance penalty 6-month part-time single programmer: full RTM version (5x speedup over the previous Cell implementation) NVIDIA GPU Technology Conference – 22nd of September, 2010 54 Acknowledgements • Barcelona Supercomputing Center • Repsol • Universitat Politècnica de Catalunya • University of Illinois at Urbana-Champaign NVIDIA GPU Technology Conference – 22nd of September, 2010 55 Thank you! Questions? NVIDIA GPU Technology Conference – 22nd of September, 2010 56

Reverse Time Migration on GMAC

Related documents

Products

Support

Reverse Time Migration on GMAC

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib