Reverse Time Migration on GMAC

advertisement
Reverse Time
Migration
on GMAC
NVIDIA GTC
22nd of September, 2010
Javier Cabezas
Mauricio Araya
Isaac Gelado
Thomas Bradley
Gladys González
José María Cela
Nacho Navarro
BSC
Repsol/BSC
UPC/UIUC
NVIDIA
Repsol
UPC/BSC
UPC/BSC
Outline
• Introduction
• Reverse Time Migration on CUDA
• GMAC at a glance
• Reverse Time Migration on GMAC
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
2
Reverse Time Migration on CUDA
└ RTM
• RTM generates an image of the subsurface layers
• Uses traces recorded by sensors in the field
• RTM’s algorithm
1. Propagation of a modeled wave (forward in time)
2. Propagation of the recorded traces (backward in time)
3. Correlation of the forward and backward wavefields
•
Last forward wavefield with the first backward wavefield
• FDTD are preferred to FFT
•
•
2nd-order finite differencing in time
High-order finite differencing in space
NVIDIA GPU Technology Conference – 22nd of September, 2010
3
Introduction
└ Barcelona Supercomputing Center (BSC)
• BSC and Repsol: Kaleidoscope project
•
•
Develop better algorithms/techniques for seismic imaging
We focused on Reverse Time Migration (RTM), as it is the most popular
seismic imaging technique for depth exploration
• Due to the high computational power required, the project started a
quest for the most suitable hardware
•
•
•
•
PowerPC: scalability issues
Cell: good performance (in production @ Repsol), difficult programmability
FPGA: potentially best performance, programmability nightmare
GPUs: 5x speedup vs Cell (GTX280), what about programmability?
NVIDIA GPU Technology Conference – 22nd of September, 2010
4
Outline
• Introduction
• Reverse Time Migration on CUDA
→General approach
• Disk I/O
• Domain decomposition
• Overlapping computation and communication
• GMAC at a glance
• Reverse Time Migration on GMAC
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
5
Reverse Time Migration on CUDA
└ General approach
• We focus on the host-side part of the implementation
1.Avoid memory transfers between host and GPU memories
•
Implement on the GPU as many computations as possible
2.Hide latency of memory transfers
•
Overlap memory transfers and kernel execution
3.Take advantage of the PCIe full-duplex capabilities (Fermi)
•
Overlap deviceToHost and hostToDevice memory transfers
NVIDIA GPU Technology Conference – 22nd of September, 2010
6
Reverse Time Migration on CUDA
└ General approach
Forward
Backward
3D-Stencil
3D-Stencil
Absorbing Boundary
Conditions
Absorbing Boundary
Conditions
Source insertion
Traces insertion
Compression
Read from disk
Write to disk
Decompression
Correlation
NVIDIA GPU Technology Conference – 22nd of September, 2010
7
Reverse Time Migration on CUDA
└ General approach
• Data structures used in the RTM algorithm
•
•
Read/Write structures
•
•
•
3D volume for the wavefield (can be larger than 1000x1000x1000 points)
State of the wavefiled in previous time-steps to compute finite differences
in time
Some extra points in each direction at the boundaries (halos)
Read-Only structures
•
•
3D volume of the same size as the wavefield
Geophones’ recorded traces: time-steps x #geophones
NVIDIA GPU Technology Conference – 22nd of September, 2010
8
Reverse Time Migration on CUDA
└ General approach
• Data flow-graph (forward)
3D-Stencil
ABC
Source
Wavefields
Constant read-only data: velocity model, geophones’ traces
NVIDIA GPU Technology Conference – 22nd of September, 2010
9
Compress
Reverse Time Migration on CUDA
└ General approach
• Simplified data flow-graph (forward)
RTM Kernel
Compress
Wave-fields
Constant read-only data: velocity model, geophones’ traces
NVIDIA GPU Technology Conference – 22nd of September, 2010
10
Reverse Time Migration on CUDA
└ General approach
Start
• Control flow-graph (forward)
•
•
i =0
RTM Kernel Computation
RTM
Kernel
Compress and transfer to disk
•
•
•
deviceToHost + Disk I/O
Performed every N steps
i%N == 0
Can run in parallel with
the next compute steps
yes
Compress
no
toHost
i++
Runs on the GPU
yes
i < steps
no
Runs on the CPU
End
NVIDIA GPU Technology Conference – 22nd of September, 2010
11
Disk I/O
Outline
• Introduction
• Reverse Time Migration on CUDA
• General approach
→Disk I/O
• Domain decomposition
• Overlapping computation and communication
• GMAC at a glance
• Reverse Time Migration on GMAC
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
12
Reverse Time Migration on CUDA
└ Disk I/O
• GPU → Disk transfers are very time-consuming
K
1
K
2
K
3
K
4
C
toHost
K
5
Disk I/O
time
• Transferring to disk can be overlapped with the next (computeonly) steps
K
1
K
2
K
3
K
4
C
Runs on the GPU
toHost
Runs on the CPU
time
NVIDIA GPU Technology Conference – 22nd of September, 2010
K
5
13
K
6
Disk I/O
K
7
K
8
Reverse Time Migration on CUDA
└ Disk I/O
• Single transfer: wait for all the data to be in host memory
deviceToHost
Disk I/O
time
• Multiple transfers: overlap deviceToHost transfers with disk I/O
•
toH
Double buffering
toH
Disk I/O
toH
Disk I/O
toH
Disk I/O
time
NVIDIA GPU Technology Conference – 22nd of September, 2010
14
Disk I/O
Reverse Time Migration on CUDA
└ Disk I/O
• CUDA-RT limitations
• GPU memory accessible by the owner host thread only
→deviceToHost transfers must be performed by the compute thread
GPU
GPU address
space
Compute
thread
I/O
thread
CPU address
space
NVIDIA GPU Technology Conference – 22nd of September, 2010
15
Reverse Time Migration on CUDA
└ Disk I/O
• CUDA-RT Implementation (single transfer)
• CUDA streams must be used not to block GPU execution
→Intermediate page-locked buffer must be used: for real-size problems
the system can run out of memory!
GPU
GPU address
space
CPU address
space
NVIDIA GPU Technology Conference – 22nd of September, 2010
16
Reverse Time Migration on CUDA
└ Disk I/O
• CUDA-RT Implementation (multiple transfers)
•
Besides launching kernels, the compute thread must program and
monitor several deviceToHost transfers while executing the next
compute-only steps on the GPU
→Lots of synchronization code in the compute thread
GPU
GPU address
space
CPU address
space
NVIDIA GPU Technology Conference – 22nd of September, 2010
17
Outline
• Introduction
• Reverse Time Migration on CUDA
• General approach
• Disk I/O
→Domain decomposition
• Overlapping computation and communication
• GMAC at a glance
• Reverse Time Migration on GMAC
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
18
Reverse Time Migration on CUDA
└ Domain decomposition
• But… wait, real-size problems require > 16GB of data!
• Volumes are split into tiles (along the Z-axis)
•
3D-Stencil introduces data dependencies
D4
D2
x
y z
D1
NVIDIA GPU Technology Conference – 22nd of September, 2010
D3
19
Reverse Time Migration on CUDA
└ Domain decomposition
• Multi-node may be required to overcome memory capacity
limitations
•
•
Shared memory for intra-node communication
MPI for inter-node communication
Node 2
Node 1
GPU1
GPU2
GPU3
GPU4
GPU1
GPU2
MPI
Host Memory
NVIDIA GPU Technology Conference – 22nd of September, 2010
Host Memory
20
GPU3
GPU4
Reverse Time Migration on CUDA
└ Domain decomposition
• Data flow-graph (multi-domain)
RTM Kernel
Compress
Compress
RTM Kernel
Wave-fields (domain 1)
Wave-fields (domain 2)
Constant read-only data: velocity model, geophones’ traces
NVIDIA GPU Technology Conference – 22nd of September, 2010
21
Reverse Time Migration on CUDA
└ Domain decomposition
Start
• Control flow-graph (multi-domain)
•
i =0
Boundary exchange every
time-step
•
Kernel
sync
Inter-domain communication
blocks execution of the next
steps!
Exchange
s%N == 0
yes
Compress
no
toHost
i++
Runs on the GPU
yes
i < steps
no
Runs on the CPU
End
NVIDIA GPU Technology Conference – 22nd of September, 2010
22
Disk I/O
Reverse Time Migration on CUDA
└ Domain decomposition
• Boundary exchange every time-step is needed
K
1
X
K
2
X
K
3
X
K
4
X
C
K
5
X
K
6
toHost
Disk I/O
time
NVIDIA GPU Technology Conference – 22nd of September, 2010
23
X
K
7
Reverse Time Migration on CUDA
└ Domain decomposition
• Single-transfer exchange
•
“Easy” to program, needs large page-locked buffers
deviceToHost
deviceToHost
deviceToHost
hostToDevice
hostToDevice
hostToDevice
time
• Multiple-transfer exchange to maximize PCI-Express utilization
•
toH
“Complex” to program, needs smaller page-locked buffers
toH
toH
toH
toH
toH
toH
toH
toH
toH
toH
toH
toD
toD
toD
toD
toD
toD
toD
toD
toD
toD
toD
time
NVIDIA GPU Technology Conference – 22nd of September, 2010
24
toD
Reverse Time Migration on CUDA
└ Domain decomposition
• CUDA-RT limitations
•
Each host thread can only access to the memory objects it allocates
GPU
1
GPU
2
GPU
3
GPUs’ address
spaces
CPU address
space
NVIDIA GPU Technology Conference – 22nd of September, 2010
25
GPU
4
Reverse Time Migration on CUDA
└ Domain decomposition
• CUDA-RT implementation (single-transfer exchange)
•
•
Streams and page-locked memory buffers must be used
Page-locked memory buffers can be too big
GPU
1
GPU
2
GPU
3
GPUs’ address
spaces
CPU address
space
NVIDIA GPU Technology Conference – 22nd of September, 2010
26
GPU
4
└ Domain decomposition
• CUDA-RT implementation (multiple-transfer exchange)
•
•
Uses small page-locked buffers
More synchronization code
• Too complex to be represented using Powerpoint!
• Very difficult to implement in real code!
NVIDIA GPU Technology Conference – 22nd of September, 2010
27
Outline
• Introduction
• Reverse Time Migration on CUDA
• General approach
• Disk I/O
• Domain decomposition
→Overlapping computation and communication
• GMAC at a glance
• Reverse Time Migration on GMAC
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
28
Reverse Time Migration on CUDA
└ Overlapping computation and communication
• Problem: boundary exchange blocks the execution of the
following time-step
K
1
X
K
2
X
K
3
X
K
4
X
C
K
5
X
K
6
toHost
Disk I/O
time
NVIDIA GPU Technology Conference – 22nd of September, 2010
29
X
K
7
Reverse Time Migration on CUDA
└ Overlapping computation and communication
• Solution: with a 2-stage execution plan we can effectively
overlap the boundary exchange between domains
k
1
K
1
X
k
2
K
2
X
k
3
K
3
k
4
X
K
4
X
C
k
5
K
5
X
k
6
K
6
X
toHost
NVIDIA GPU Technology Conference – 22nd of September, 2010
30
K
7
X
k
8
K
8
X
C
k
9
K
9
X
toHost
Disk I/O
time
k
7
Dis
Disk I/O
Reverse Time Migration on CUDA
└ Overlapping computation and communication
• Approach: two-stage execution
•
Stage 1: compute the wavefield points to be exchanged
x
y z
GPU1
NVIDIA GPU Technology Conference – 22nd of September, 2010
GPU2
31
Reverse Time Migration on CUDA
└ Overlapping computation and communication
• Approach: two-stage execution
•
Stage 2: Compute the remaining points while exchanging the
boundaries
x
y z
GPU1
NVIDIA GPU Technology Conference – 22nd of September, 2010
GPU2
32
Reverse Time Migration on CUDA
└ Overlapping computation and communication
• But two-stage execution requires more abstractions and code
complexity
•
An additional stream per domain
•
We already have 1 to launch kernels, 1 to overlap transfers to disk, 1 to
exchange boundaries
→At this point the code is a complete mess!
•
Requires 4 streams per domain, many page-locked buffers, lots of
inter-thread synchronization
•
•
Poor readability and maintainability
Easy to introduce bugs
NVIDIA GPU Technology Conference – 22nd of September, 2010
33
Outline
• Introduction
• Reverse Time Migration on CUDA
• GMAC at a glance
→Features
• Code examples
• Reverse Time Migration on GMAC
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
34
GMAC at a glance
└ Introduction
• Library that enhances the host programming model of CUDA
• Freely available at http://code.google.com/p/adsm/
•
•
•
Developed by BSC and UIUC
NCSA license (BSD-like)
Works in Linux and MacOS X (Windows version coming soon)
• Presented in detail tomorrow at 9 am @ San Jose Ballroom
NVIDIA GPU Technology Conference – 22nd of September, 2010
35
GMAC at a glance
└ Features
• Unified virtual address space for all the memories in the system
•
Single allocation for shared objects
•
Special API calls: gmacMalloc, gmacFree
• GPU memory allocated by a host thread is visible to all host threads
→Brings POSIX thread semantics back to developers
Shared
Data
GPU
CPU
CPU Data
NVIDIA GPU Technology Conference – 22nd of September, 2010
Memory
36
GMAC at a glance
└ Features
• Parallelism exposed via regular POSIX threads
•
•
Replaces the explicit use of CUDA streams
OpenMP support
• GMAC uses streams and page-locked buffers internally
•
Concurrent kernel execution and memory transfers for free
GPU
NVIDIA GPU Technology Conference – 22nd of September, 2010
37
GMAC at a glance
└ Features
• Optimized bulk memory operations via library interposition
•
•
•
File I/O
•
•
Standard I/O functions: fwrite, fread
Automatic overlap of Disk I/O and hostToDevice and
deviceToHost transfers
Optimized GPU to GPU transfers via regular memcpy
Enhanced versions of the MPI send/receive calls
NVIDIA GPU Technology Conference – 22nd of September, 2010
38
Outline
• Introduction
• Reverse Time Migration on CUDA
• GMAC at a glance
• Features
→Code examples
• Reverse Time Migration on GMAC
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
39
GMAC at a glance
└ Examples
• Single allocation (and pointer) for shared objects
CUDA-RT
GMAC
void compute(FILE *file, int size)
{
1
float *foo, *dev_foo;
2
foo = malloc(size);
3
fread(foo, size, 1, file);
4
cudaMalloc(&dev_foo, size);
5
cudaMemcpy(dev_foo, foo, size, ToDevice);
6
kernel<<<Dg, Db>>>(dev_foo, size);
7
cudaThreadSynchronize();
8
cudaMemcpy(foo, dev_foo, size, ToHost);
9
cpuComputation(foo);
10 cudaFree(dev_foo);
11 free(foo);
}
NVIDIA GPU Technology Conference – 22nd of September, 2010
40
void compute(FILE *file, int size)
{
1
float *foo;
2
foo = gmacMalloc(size);
3
fread(foo, size, 1, file);
4
5
6
kernel<<<Dg, Db>>>(foo, size);
7
gmacThreadSynchronize();
8
9
cpuComputation(foo);
10 gmacFree(foo);
11
}
GMAC at a glance
└ Examples
• Optimized support for bulk memory operations
CUDA-RT
GMAC
void compute(FILE *file, int size)
{
1
float *foo, *dev_foo;
2
foo = malloc(size);
3
fread(foo, size, 1, file);
4
cudaMalloc(&dev_foo, size);
5
cudaMemcpy(dev_foo, foo, size, ToDevice);
6
kernel<<<Dg, Db>>>(dev_foo, size);
7
cudaThreadSynchronize();
8
cudaMemcpy(foo, dev_foo, size, ToHost);
9
cpuComputation(foo);
10 cudaFree(dev_foo);
11 free(foo);
}
NVIDIA GPU Technology Conference – 22nd of September, 2010
41
void compute(FILE *file, int size)
{
1
float *foo;
2
foo = gmacMalloc(size);
3
fread(foo, size, 1, file);
4
5
6
kernel<<<Dg, Db>>>(foo, size);
7
gmacThreadSynchronize();
8
9
cpuComputation(foo);
10 gmacFree(foo);
11
}
Outline
• Introduction
• GMAC at a glance
• Reverse Time Migration on GMAC
→Disk I/O
• Domain decomposition
• Overlapping computation and communication
• Development cycle and debugging
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
42
Reverse Time Migration on GMAC
└ Disk I/O
• CUDA-RT Implementation (multiple transfers)
•
Besides launching kernels, the compute thread must program and
monitor several deviceToHost transfers while executing the next
compute-only steps on the GPU
→Lots of synchronization code in the compute thread
GPU
GPU address
space
CPU address
space
NVIDIA GPU Technology Conference – 22nd of September, 2010
43
Reverse Time Migration on GMAC
└ Disk I/O (GMAC)
• GMAC implementation
•
•
•
deviceToHost transfers performed by the I/O thread
deviceToHost and Disk I/O transfers overlap for free
Small page-locked buffers are used
GPU
Global address
space
NVIDIA GPU Technology Conference – 22nd of September, 2010
44
Outline
• Introduction
• GMAC at a glance
• Reverse Time Migration on GMAC
• Disk I/O
→Domain decomposition
• Overlapping computation and communication
• Development cycle and debugging
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
45
Reverse Time Migration on GMAC
└ Domain decomposition (CUDA-RT)
• CUDA-RT implementation (single-transfer exchange)
•
•
Streams and page-locked memory buffers must be used
Page-locked memory buffers can be too big
GPU
1
GPU
2
GPU
3
GPUs’ address
spaces
CPU address
space
NVIDIA GPU Technology Conference – 22nd of September, 2010
46
GPU
4
Reverse Time Migration on GMAC
└ Domain decomposition (GMAC)
• GMAC implementation (multiple-transfer exchange)
•
Exchange of boundaries performed using a simple memcpy!
GPU
1
GPU
2
GPU
3
GPU
4
Unified global
address space
•
Full PCIe utilization: internally GMAC performs several transfers and
double buffering
NVIDIA GPU Technology Conference – 22nd of September, 2010
47
Outline
• Introduction
• GMAC at a glance
• Reverse Time Migration on GMAC
• Disk I/O
• Domain decomposition
→Overlapping computation and communication
• Development cycle and debugging
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
48
Reverse Time Migration on GMAC
└ Overlapping computation and communication
• No streams, no page-locked buffers, similar performance: ±2%
readVelocity(velociy);
cudaMalloc(&d_input, W_SIZE);
cudaMalloc(&d_output, W_SIZE);
cudaHostAlloc(&i_halos, H_SIZE);
cudaHostAlloc(&disk_buffer, W_SIZE);
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
cudaMemcpy(d_velocity, velocity, W_SIZE)
for all time steps do
launch_stage1(d_output, d_input, s1);
launch_stage2(d_output, d_input, s2);
cudaMemcpyAsync(i_halos, d_output, s1);
cudaStreamSynchronize(s1);
barrier();
cudaMemcpyAsync(d_output, i_halos, s1);
cudaThreadSynchronize();
barrier();
if (timestep % N == 0) {
compress(output, c_output);
transfer_to_host(disk_buffer);
barrier_write_to_disk();
}
// ... Update pointers
end for
fread(velocity);
gmacMalloc(&input, W_SIZE);
gmacMalloc(&output, W_SIZE);
for all time steps do
launch_stage1( output, input );
gmacThreadSynchronize();
launch_stage2( output, input );
memcpy(neighbor, output);
gmacThreadSynchronize();
barrier();
if (timestep % N == 0) {
compress(output, c_output);
barrier_write_to_disk();
}
// ... Update pointers
end for
CUDA-RT
NVIDIA GPU Technology Conference – 22nd of September, 2010
GMAC
49
Outline
• Introduction
• GMAC at a glance
• Reverse Time Migration on GMAC
• Disk I/O
• Domain decomposition
• Inter-domain communication
→Development cycle and debugging
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
50
Reverse Time Migration on GMAC
└ Development cycle and debugging
• CUDA-RT
•
•
Start from a simple, correct sequential code
Implement kernels one at a time and check
correctness
•
•
•
3D-Stencil
Two allocations per data structure
Source insertion
Keep data consistency by hand (cudaMemcpy)
To introduce modifications to any kernel
•
•
Absorbing Boundary
Conditions
Two allocations per data structure
Keep data consistency by hand (cudaMemcpy)
NVIDIA GPU Technology Conference – 22nd of September, 2010
51
Compression
Reverse Time Migration on GMAC
└ Development cycle and debugging
• GMAC
•
•
3D-Stencil
Allocate objects with gmacMalloc
•
Single pointer
Use pointer both in the host and GPU kernel
implementations
•
Absorbing Boundary
Conditions
Source insertion
No copies
Compression
NVIDIA GPU Technology Conference – 22nd of September, 2010
52
Outline
• Introduction
• Reverse Time Migration on CUDA
• GMAC at a glance
• Reverse Time Migration on GMAC
• Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010
53
Conclusions
• Heterogeneous systems based on GPUs are currently the most
appropriate to implement RTM
• CUDA has programmability issues
•
•
CUDA provides a good language to expose data parallelism in the code
to be run on the GPU
The host-side interface provided by the CUDA-RT makes difficult to
implement even some basic optimizations
GMAC eases the development of applications for GPU-based
systems with no performance penalty
6-month part-time single programmer: full RTM version (5x speedup
over the previous Cell implementation)
NVIDIA GPU Technology Conference – 22nd of September, 2010
54
Acknowledgements
• Barcelona Supercomputing Center
• Repsol
• Universitat Politècnica de Catalunya
• University of Illinois at Urbana-Champaign
NVIDIA GPU Technology Conference – 22nd of September, 2010
55
Thank you!
Questions?
NVIDIA GPU Technology Conference – 22nd of September, 2010
56
Download