Presentation Title Here

advertisement
High Performance Compute Platform Based on
multi-core DSP for Seismic Modeling and
Imaging
Presenter: Murtaza Ali, Texas Instruments
Contributors:
Murtaza Ali, Eric Stotzer, Xiaohui Li, Texas Instruments
William Symes, Jan Odegard, Rice University
1
TI Information – Selective Disclosure
Outline
• Introduction to TI Multi-core DSP
• Brief review of IWAVE based seismic signal modeling
• Details and challenges of implementation
• Results and conclusions
2
TI Information – Selective Disclosure
A New Paradigm in High Performance Computing
• Industry-best floating point performance
– 16 Gflops/W
• Standard programming model
– supports MPI and OpenMP
• Wide range of applications
– from embedded systems to server blades
• Full ecosystem support
– Off the shelf PCIe and ATCA cards
– O/S and application software
Supported by a full set of development tools
and Code Composer Studio IDE
TI Information – Selective Disclosure
Shannon (TMS320C6678) – Block Diagram
• Multi-Core KeyStone SoC
• Fixed/Floating CorePac
• 0.5MB L2/core, 4.0 MB Shared L2
• 320G MAC, 160G FLOP, 60G DFLOPS
• 10W
• Navigator
• Hardware Queue Manager with DMA
• Multicore Shared Memory
Controller
8 x CorePac
C66x
DSP
C66x
DSP
C66x
DSP
C66x
DSP
L1
L1
L1
L1
L2
• IPv4/IPv6 Network interface solution
L2
C66x
DSP
C66x
DSP
C66x
DSP
L1
L1
L1
L1
L2
L2
L2
L2
Memory Subsystem
DDR364b
Crypto
L2
C66x
DSP
• Low latency, high bandwidth memory access
• Network Coprocessor
L2
Network
CoProcessors
Packet
Accelerator
TeraNet
• 8 CorePac @ 1.25 GHz
Multicore Navigator
• 50G Baud Expansion Port
• Transparent to Software
GbE
Switch
SGMII
SGMII
Multicore Shared Memory Controller
(MSMC)
Peripherals & IO
Shared Memory 4MB
• IPSec, SRTP, Encryption fully offloaded
• HyperLink
IP Interfaces
System Elements
Power Management
SysMon
Debug
EDMA
Hyper
Link
50
SRIO
x4
PCIe
x2
EMIF
16
TSIP
2x
I2 C
SPI
UART
4
TI Information – Selective Disclosure
C66x – Core Architecture
• 8 issue VLIW Architecture
– Can issue 8 instructions per
cycle
• 2 data paths
– 4 units per data path
– L, S, D, M
• 64 registers (32 bit)
– 32 per data path
– Can be arranged in dual (64 bit)
or quad (128 bit) registers
– Cross connect available
• Single Instruction Multiple
Data (SIMD) available
– Dual or quad multiplies
TI Information – Selective Disclosure
TI DSP SW Resources
• Multicore Software Development Kit
– Peripheral drivers
– Demos for quick start
• OpenMP – alpha version released, example code available
• Linear Algebra Library (BLAS, LAPACK)
– Working with UT Austin to port “libflame” (LAPACK equivalent) to Shannon
• Optimized Libraries
– DSPLIB (math functions), ImageLib
– Medical Imaging SW Toolkit – Ultrasound, Optical Coherence, 3D Rendering
TI Information – Selective Disclosure
Shannon PCIe Development Cards
•512 Gflops
•50 W
•Available Now!
•1 Tera-flop
•120 W
•Available 1Q12
TI Information – Selective Disclosure
Seismic Modeling
Focus of our
current study
• wave equation update
• source addition
• boundary condition
Typical iteration in forward
sweep (essential part in
modeling)
Reverse Time migration (RTM)
• wave equation update
• Receiver addition
• boundary condition
• Imaging after iterations
complete
Typical iteration in Backward
sweep essential part in
imaging)
• IWAVE: A framework to enable efficient and scalable Finite Difference
simulation on regular grid
–
–
–
–
includes seismic modeling and imaging
Implement different wave equation update
Used for modeling and imaging
Open source from Rice University
TI Information – Selective Disclosure
8
Inside wave update
px
epx mpx
Update
x
vy
y
vz
z
dvxdx
Linear
Combination
vx
dvydy
dvzdz
lax
px
py epy mpy
Update
py
• Based on velocity –stress PDE
• First order hyperbolic system
• 10th order finite difference method
pz epz mpz
laz
Update
lay
pz
vy
vx
evx
x
dpxdx
mvy
mvx
py
px
evy
Update
y
vx
dpydy
Update
vy
vz
pz
z
lay
lax
laz
TI Information – Selective Disclosure
dpzdz
evz
mvz
Update
vx
• Identified four kernels to optimize to core
instruction architecture
– Differential in x-direction (first dimension)
– Differential in y or z-direction (orthogonal
dimension)
– Update in x-directions
– Update in y or z directions
Memory access
(load/store)
Kernels Implementations
Compute resource
Load store friendly
Optimization trade-off at
kernel levels
Cache friendly (first dimension)
;*
.L units
0
0
;*
.S units
0
0
;*
.D units
8*
8*
;*
.M units
5
7
;*
.X cross paths
3
2
;*
.T address paths
8*
8*
……………………..
;*
;*
Searching for software pipeline schedule at ...
;*
ii = 8 Schedule found with 4 iterations in
parallel
10
TI Information – Selective Disclosure
Kernel Results
• Kernels takes between 1-3 cycles per cell
• Summing up kernel numbers show capability
of over 200 M cells/sec on 8 core DSP
running at 1 GHz.
• Initial benchmarks carried out using all data
being kept in DDR3 memory
– OpenMP used to parallelize across cores
Core #6
Core #5
Core #4
Core #3
• Assignment is based on z direction
– Need better data movement strategy over
DDR3
– Analyze bottlenecks of performance
Core #2
Core #1
openMP threads running on each core
Core #7
Core #0
11
TI Information – Selective Disclosure
Data Movement Strategy
• C66 architecture allows 3-D data movement
using DMA
• Allows defining strides in two direction
• Some limitations exist on sizes of strides
limiting shape
– May limit sub-domain definition
– A tall sub-domain will be most useful
• DMAs can be linked
– Multiple data transfer can be initiated
– Continued without core intervention
• Compute can be overlapped to Data
movement
– Need double buffering
12
TI Information – Selective Disclosure
3-D differential calculation strategy
• Kernel operates on 4 lines
simultaneously
• Operate on a set of 4 x 4 x nx
data set as the core
computations strategy
Total data set needed
• Determine x-differentials on
the set of 16 lines
• Add y-differentials on a
horizontal plane of 4 x nx fours
times
• Add z-differentials on a vertical
plane of 4 x nx fours times
x-differential
y-differential
z-differential
13
TI Information – Selective Disclosure
Example of Data Movement
CPU
L1 (16K SRAM/
16K Cache)
L2 (384K SRAM/
128K Cache)
MSMCSRAM
(shared by all
cores)
DDR
TI Information – Selective Disclosure
Results
• After implementing DMA data movement, performance went from 45 to
59 M cells/sec on a single 8-core C6678 multi-core DSP
• Performance limited by data transfers over DDR3
– Performance only went up to 63 M cells/sec when all computes are disables
– Theoretical DDR3 bandwidth limited performance is 120 M cells/sec @
1330 MHz DDR3.
– Currently we at operating at about 50% of DDR3 bandwidth
15
TI Information – Selective Disclosure
Future Activity
• Continued performance analysis
– Current measurements done with DDR3 clock rate of 1330 MHz
– Device capable of handling 1600 MHz-> 20% improvement
– Optimize further for parameters for maximum data transfer utilization
• Extend analysis to multiple DSP based PCI board
– MPI based message passing
– Side region data exchange
• Integrate with IWAVE framework
– Framework can run on host with main computes being handled by DSP
board(s)
• Add more complicated wave equation update
– Elastic modeling
16
TI Information – Selective Disclosure
Download