High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging Presenter: Murtaza Ali, Texas Instruments Contributors: Murtaza Ali, Eric Stotzer, Xiaohui Li, Texas Instruments William Symes, Jan Odegard, Rice University 1 TI Information – Selective Disclosure Outline • Introduction to TI Multi-core DSP • Brief review of IWAVE based seismic signal modeling • Details and challenges of implementation • Results and conclusions 2 TI Information – Selective Disclosure A New Paradigm in High Performance Computing • Industry-best floating point performance – 16 Gflops/W • Standard programming model – supports MPI and OpenMP • Wide range of applications – from embedded systems to server blades • Full ecosystem support – Off the shelf PCIe and ATCA cards – O/S and application software Supported by a full set of development tools and Code Composer Studio IDE TI Information – Selective Disclosure Shannon (TMS320C6678) – Block Diagram • Multi-Core KeyStone SoC • Fixed/Floating CorePac • 0.5MB L2/core, 4.0 MB Shared L2 • 320G MAC, 160G FLOP, 60G DFLOPS • 10W • Navigator • Hardware Queue Manager with DMA • Multicore Shared Memory Controller 8 x CorePac C66x DSP C66x DSP C66x DSP C66x DSP L1 L1 L1 L1 L2 • IPv4/IPv6 Network interface solution L2 C66x DSP C66x DSP C66x DSP L1 L1 L1 L1 L2 L2 L2 L2 Memory Subsystem DDR364b Crypto L2 C66x DSP • Low latency, high bandwidth memory access • Network Coprocessor L2 Network CoProcessors Packet Accelerator TeraNet • 8 CorePac @ 1.25 GHz Multicore Navigator • 50G Baud Expansion Port • Transparent to Software GbE Switch SGMII SGMII Multicore Shared Memory Controller (MSMC) Peripherals & IO Shared Memory 4MB • IPSec, SRTP, Encryption fully offloaded • HyperLink IP Interfaces System Elements Power Management SysMon Debug EDMA Hyper Link 50 SRIO x4 PCIe x2 EMIF 16 TSIP 2x I2 C SPI UART 4 TI Information – Selective Disclosure C66x – Core Architecture • 8 issue VLIW Architecture – Can issue 8 instructions per cycle • 2 data paths – 4 units per data path – L, S, D, M • 64 registers (32 bit) – 32 per data path – Can be arranged in dual (64 bit) or quad (128 bit) registers – Cross connect available • Single Instruction Multiple Data (SIMD) available – Dual or quad multiplies TI Information – Selective Disclosure TI DSP SW Resources • Multicore Software Development Kit – Peripheral drivers – Demos for quick start • OpenMP – alpha version released, example code available • Linear Algebra Library (BLAS, LAPACK) – Working with UT Austin to port “libflame” (LAPACK equivalent) to Shannon • Optimized Libraries – DSPLIB (math functions), ImageLib – Medical Imaging SW Toolkit – Ultrasound, Optical Coherence, 3D Rendering TI Information – Selective Disclosure Shannon PCIe Development Cards •512 Gflops •50 W •Available Now! •1 Tera-flop •120 W •Available 1Q12 TI Information – Selective Disclosure Seismic Modeling Focus of our current study • wave equation update • source addition • boundary condition Typical iteration in forward sweep (essential part in modeling) Reverse Time migration (RTM) • wave equation update • Receiver addition • boundary condition • Imaging after iterations complete Typical iteration in Backward sweep essential part in imaging) • IWAVE: A framework to enable efficient and scalable Finite Difference simulation on regular grid – – – – includes seismic modeling and imaging Implement different wave equation update Used for modeling and imaging Open source from Rice University TI Information – Selective Disclosure 8 Inside wave update px epx mpx Update x vy y vz z dvxdx Linear Combination vx dvydy dvzdz lax px py epy mpy Update py • Based on velocity –stress PDE • First order hyperbolic system • 10th order finite difference method pz epz mpz laz Update lay pz vy vx evx x dpxdx mvy mvx py px evy Update y vx dpydy Update vy vz pz z lay lax laz TI Information – Selective Disclosure dpzdz evz mvz Update vx • Identified four kernels to optimize to core instruction architecture – Differential in x-direction (first dimension) – Differential in y or z-direction (orthogonal dimension) – Update in x-directions – Update in y or z directions Memory access (load/store) Kernels Implementations Compute resource Load store friendly Optimization trade-off at kernel levels Cache friendly (first dimension) ;* .L units 0 0 ;* .S units 0 0 ;* .D units 8* 8* ;* .M units 5 7 ;* .X cross paths 3 2 ;* .T address paths 8* 8* …………………….. ;* ;* Searching for software pipeline schedule at ... ;* ii = 8 Schedule found with 4 iterations in parallel 10 TI Information – Selective Disclosure Kernel Results • Kernels takes between 1-3 cycles per cell • Summing up kernel numbers show capability of over 200 M cells/sec on 8 core DSP running at 1 GHz. • Initial benchmarks carried out using all data being kept in DDR3 memory – OpenMP used to parallelize across cores Core #6 Core #5 Core #4 Core #3 • Assignment is based on z direction – Need better data movement strategy over DDR3 – Analyze bottlenecks of performance Core #2 Core #1 openMP threads running on each core Core #7 Core #0 11 TI Information – Selective Disclosure Data Movement Strategy • C66 architecture allows 3-D data movement using DMA • Allows defining strides in two direction • Some limitations exist on sizes of strides limiting shape – May limit sub-domain definition – A tall sub-domain will be most useful • DMAs can be linked – Multiple data transfer can be initiated – Continued without core intervention • Compute can be overlapped to Data movement – Need double buffering 12 TI Information – Selective Disclosure 3-D differential calculation strategy • Kernel operates on 4 lines simultaneously • Operate on a set of 4 x 4 x nx data set as the core computations strategy Total data set needed • Determine x-differentials on the set of 16 lines • Add y-differentials on a horizontal plane of 4 x nx fours times • Add z-differentials on a vertical plane of 4 x nx fours times x-differential y-differential z-differential 13 TI Information – Selective Disclosure Example of Data Movement CPU L1 (16K SRAM/ 16K Cache) L2 (384K SRAM/ 128K Cache) MSMCSRAM (shared by all cores) DDR TI Information – Selective Disclosure Results • After implementing DMA data movement, performance went from 45 to 59 M cells/sec on a single 8-core C6678 multi-core DSP • Performance limited by data transfers over DDR3 – Performance only went up to 63 M cells/sec when all computes are disables – Theoretical DDR3 bandwidth limited performance is 120 M cells/sec @ 1330 MHz DDR3. – Currently we at operating at about 50% of DDR3 bandwidth 15 TI Information – Selective Disclosure Future Activity • Continued performance analysis – Current measurements done with DDR3 clock rate of 1330 MHz – Device capable of handling 1600 MHz-> 20% improvement – Optimize further for parameters for maximum data transfer utilization • Extend analysis to multiple DSP based PCI board – MPI based message passing – Side region data exchange • Integrate with IWAVE framework – Framework can run on host with main computes being handled by DSP board(s) • Add more complicated wave equation update – Elastic modeling 16 TI Information – Selective Disclosure