Jacobi Iterative technique on Multi GPU platform

advertisement
JACOBI ITERATIVE TECHNIQUE ON MULTI
GPU PLATFORM
By
Ishtiaq Hossain
Venkata Krishna
Nimmagadda
APPLICATION OF JACOBI ITERATION





Cardiac Tissue is considered as a grid of
cells.
Each GPU thread takes care of voltage
calculation at one cell. This calculation
requires Voltage values of neighboring
cells
Two different models are shown in the
bottom right corner
Vcell0 in current time step is calculated
by using values of surrounding cells
from previous time step to avoid
synchronization issues
Vcell0k = f(Vcell1k-1 +Vcell2k-1 +Vcell3k1….+V
k-1
cellN )
where N can be 6 or 18
APPLICATION OF JACOBI ITERATION




Initial values are provided to start
computation
In s single time step ODE and PDE
parts are sequentially evaluated
and added
By solving the finite difference
equations, voltage values of every
cell in a time step is calculated by a
thread
Figure 1 shows a healthy cell’s
voltage curve with time.
Figure 1
THE TIME STEP
Vtemp2 is generated in
every iteration for all
the cells in the grid
Calculation of Vtemp2
requires Vtemp2 values of
previous time step
Once the iterations are
completed, final Vtemp2 is
added with Vtemp1 to
generate Voltage values
for that time step
Solve for ODE part and add
it to the current cell’s
voltage to obtain voltage
Vtemp1 for each cell
Use Vtemp1 as initial value,
perform jacobi iteration by
considering surrounding
values to generate Vtemp2
Vtemp2 is generated in every
iteration
CORRECTNESS OF OUR IMPLEMENTATION
MEMORY COALESCING
Time in milli secs

305
300
295
290
285
280
275
270
265
Unaligned
single
cell
with __align
Design of data Structure
typedef struct __align__(N)
{
int a[N];
int b[N]
} NODE;
.
.
.
.
NODE nodes[N*N];
N*N blocks and N threads are
launched so that all the N
threads access values in
consecutive places
SERIAL VS SINGLE GPU
1000
800
600
400
200
0
Series 1
Time in secs
Time in secs
Serial is not helpful
Hey serial, what take you so long?
45
40
35
30
25
20
15
10
5
0
1 GPU
128X128X128 gives us 309 secs
Enormous Speed Up
STEP 1 LESSONS LEARNT



Choose Data structure which maximizes the
memory coalescing
The mechanics of serial code and parallel code
are very different
Develop algorithms that address the areas where
serial code takes long time
MULTI GPU APPROACH
Multiple Host
threads
Creation
Establishing
Multiple Host
– GPU
Contexts
Using OpenMP for
launching host
threads.
Data partitioning
and kernel
invocation for GPU
computation.
Solve Cell
Model ODE
Solve
Communicatio
n model PDE
Visualize Data
PDE is solved
using Jacobi
Iteration
ODE is solved
using Forward
Eular Method
INTER GPU DATA PARTITIONING
•Input data: 2D array of structures. Structures contain arrays.
•Data resides in host memory.
Interface
Region
•Let both the cubes are of dimensions s X s X s
•Interface Region of left one is 2s2
•Interface Region of right one is 3s2
•After division, data is copied into the device memory (global) of each
GPU.
SOLVING PDES USING MULTIPLE GPUS
During each Jacobi Iteration threads use
Global memory to share data among them.
 Threads in the Interface Region need data from
other GPUs.
 Inter GPUs sharing is done through Host
memory.
 A separate kernel is launched that handles the
interface region computation and copies result
back to device memory. So GPUs are
synchronized.
 Once PDE calculation is completed for one
timestamp, all values are written back to the
Host Memory.

SOLVING PDES USING MULTIPLE GPUS
Time
Host to device copy
GPU Computation
Interface Region Computation
Device to host copy
THE CIRCUS OF INTER GPU SYNC

Ghost Cell computing!


Pad with dummy cells at the inter GPU interfaces to
reduce communication
Lets make other cores of CPU work
4 out of 8 cores in CPU are having contexts
 Use the free 4 cores to do interface computation


Simple is the best

Launch new kernels with different dimensions to
handle cells at interface.
VARIOUS STAGES
Inter GPU Sync
Solve PDE
Solve ODE
Memory Copy
Solving ODE and PDE takes most of the time.
Interestingly solving PDE using Jacobi iteration is eating most of the time.
SCALABILITY
500
400
300
1 GPU
200
2 GPU
100
3 GPU
2 GPU
0
A
B
1 GPU
C
D
A = 32X32X32 cells executed by each GPU
B= 32X32X32 cells executed by each GPU
C= 32X32X32 cells executed by each GPU
D= 32X32X32 cells executed by each GPU
3 GPU
STEP 2 LESSONS LEARNT



The Jacobi iterative technique looks pretty good
in scalability
Interface Selection is very important
Making a Multi GPU program generic is a lot of
effort from programmer side
LETS WATCH A VIDEO
Q&A
Download