Leveraging Multicomputer Frameworks for Use in Multi-Core Processors • • • • • Yael Steinsaltz, ysteinsa@mc.com Scott Geaghan, sgeaghan@mc.com Myra Jean Prelle, mjp@mc.com Brian Bouzas, bbouzas@mc.com Michael Pepe, mpepe@mc.com High Performance Embedded Computing Workshop September 21, 2006 © 2005 Mercury Computer Systems, Inc. Outline • • • • 2 Introduction Channelizer Problem Preliminary Results Summary © 2005 Mercury Computer Systems, Inc. Multi-Core Processors • Multi-Core processors vary in architecture from 2-4 identical cores (Intel Xeon, Freescale 8641), to a single Manager, several Workers on a die (IBM Cell Broadband Engine™ (BE) processor). • Focusing on the IBM Cell BE processor, and using the standard presented in www.datare.org, we implemented an API ‘Multi-Core Framework’ (MCF). • MCF is applicable across architectures as long as one process acts as a Manager; more established APIs would work as well. 3 © 2005 Mercury Computer Systems, Inc. Multi-Core Framework • MCF is based on Mercury's prior implementation of www.data-re.org, a product named “Parallel Acceleration System” or PAS. • Distributed data flows in a Manager-Worker fashion enabling concurrent I/O and parallel processing. • Function Offload model, where user programs both Manager and Workers. MCF simplifies development. • LS memory is used efficiently (< 5% for MCF kernel). • Runs tasks on SPE without Linux® overhead (thread create is bypassed). 4 © 2005 Mercury Computer Systems, Inc. Data Movement • Multi-buffered, strip mining of N-dimensional data sets between a large main memory (XDR) and small worker memories. • Provides for overlap and duplication when distributing data as well as different partitioning. • Data re-organization enables easy transfer of data between local stores. 5 © 2005 Mercury Computer Systems, Inc. Outline • • • • 6 Introduction Channelizer Problem Preliminary Results Summary © 2005 Mercury Computer Systems, Inc. Objective and Motivation • Objective : Develop a Cell BE based realtime signal acquisition system composed of frequency channelizers and signal detectors in a single ~6U slot. • Motivation : Benchmark computational density between PPCs, FPGAs & Cell-BE for a typical streaming application 7 © 2005 Mercury Computer Systems, Inc. The Channelizer Problem • FM3TR Signal (Hopping, Multi-Waveform, Multiband) • Channelization using 16K real FFT with 75% overlap of the input (Computation signal independent). • Simple threshold for detection of the active channels (Computation is data dependent). 8 © 2005 Mercury Computer Systems, Inc. Channelizer Problem • The signal acquisition system separates a wide radio frequency band into a set of narrow frequency bands. • Implementation Specifications 9 4:1 Overlap Buffer: 16K sample buffer -> 8K complex FFT. Blackman Window (Embedded Multipliers). Log-magnitude Threshold: adjustable register and comparator to determine detections © 2005 Mercury Computer Systems, Inc. Data Flow and Work Distribution • manager manager Input data • manager Manager • thread of thread of execution worker Unused processing elements worker worker HSA output Teams perform Channelizer output data parallel math worker Channelizer workers 10 worker High speed Alarm worker manager worker Unused processing elements manager © 2005 Mercury Computer Systems, Inc. Data Flow – Re-org Channels XDR Channelizer team Local Store HSA team LS 11 © 2005 Mercury Computer Systems, Inc. Data Flow – Re-org Channels XDR Channelizer team Local Store HSA team LS 12 © 2005 Mercury Computer Systems, Inc. Data Flow – Re-org Channels XDR Channelizer team Local Store HSA team LS 13 © 2005 Mercury Computer Systems, Inc. Data Flow – Re-org Channels XDR Channelizer team Local Store HSA team LS 14 © 2005 Mercury Computer Systems, Inc. Data Flow – Re-org Channels XDR Channelizer team Local Store HSA team LS 15 © 2005 Mercury Computer Systems, Inc. Data Flow – Re-org Channels XDR Channelizer team Local Store HSA team LS 16 © 2005 Mercury Computer Systems, Inc. Data Flow – Re-org Channels XDR Channelizer team Local Store HSA team LS 17 © 2005 Mercury Computer Systems, Inc. Data Flow – Re-org Channels XDR Channelizer team Local Store HSA team LS 18 © 2005 Mercury Computer Systems, Inc. Data Flow – Re-org Channels XDR Channelizer team Local Store HSA team LS 19 © 2005 Mercury Computer Systems, Inc. Data Flow – Re-org Channels XDR Channelizer team Local Store HSA team LS 20 © 2005 Mercury Computer Systems, Inc. Outline • • • • 21 Introduction Channelizer Problem Preliminary Results Summary © 2005 Mercury Computer Systems, Inc. Development Time and Hardware Use • PPC – 22 PPC needed for the channelizer, and 7 PPC for the HSA; about 2 man-months for development. • FPGA – one half of a VirtexIIPro P70 FPGA (quarter board), about 8 man-months, all the math had to be developed using some Xilinx cores. • Cell BE – single processor (half board), about 4 man-weeks (using the same math and SAL calls as the PPC code). 22 © 2005 Mercury Computer Systems, Inc. Data Rates Tested • PPC implementation accepted data at 70, 80 and 105 MHz (and is easily scalable). • FPGA implementation met data rates at 70 and 80 MHz (MS/sec). • Cell BE implementation met data rates at 70, 80 and 105 MHz (MS/sec). Windowing wasn’t implemented in Cell BE because of insufficient local store for the weights. To add this an extra 2-3 weeks of design modification to the data organization and channels would be needed (Times were measured with a multiply by constant to be true to performance). Math only started to impact data rates when using less than 4 SPEs for the FFT, adding more SPEs didn’t result in added speed. 23 © 2005 Mercury Computer Systems, Inc. Outline • • • • 24 Introduction Channelizer Problem Preliminary Results Summary © 2005 Mercury Computer Systems, Inc. Summary • Morphing a library with similar API to new architecture makes porting applications efficient. • Hardware footprint (6U slots) is comparable to FPGA use. • The small size of the SPE local store is a significant contributor in determining whether an application will port easily or require additional work. • Mercury is fully cognizant of the architecture and works to reduce code size while benefiting from the large I/O bandwidth and fast processing capability of the Cell BE. 25 © 2005 Mercury Computer Systems, Inc.