How to realize high-performance compute with Multicore DSP 1 C667x Target Applications (Non- Telecom) Mission Critical Video Infrastructure Test and Automation Infrastructure Audio HPC, Imaging and Medical Emerging Others Emerging Broadband Innovations TI Confidential – NDA Restrictions RF and Communication Applications Military & Defense Avionics Application ISR (Intelligence/Surveillance/Reconnaissance) o SIGINT/COMINT/Signal Generators Military Communications. o SDR(JTRS)-Manpack/LMR/Fixed o Comm. Infra - VoIP/Video Gateways Satellite\Avionics Communications o Ground Receiver/Repeaters o Weather Radar FAA – Civil Aviation/Govt Comm. Conventional PS – TETRA/APCO/E911 o Wireless Infrastructure o Comm. Infra - VoIP/Video Gateways Emerging Broadband (OFDM/LTE/WiMAX) o Utilities/Transport/Smart Grid TI Confidential – NDA Restrictions Govt & Public Safety Key Customer Careabouts •Long Term Partnership •Financial Stability •Strong Roadmap and R&D •Floating Point Performnce •Size, Weight, and Power (SWaP) •I/O Bandwidth •Longevity of supply (10+yrs) 3 3 RF and Comm. Product Requirements End Product Need Support Multiple Waveforms Common Platform for TDMA/CDMA/OFDMA Multi-channel VoIP/Video capability Support FEC and Modulation TCP/IP Networking support DSP Requirement Needs Raw Performance in terms of MIPS/GHz/MMACS Floating Point Capable ISA to achieve “precision” and high GFLOPS. Large On Chip RAM – Reduce accesses to slow external memory. High Speed External Memory Interface Large addressable memory Efficient DMA architecture Wireless specific accelerators and TCP/IP Offload TI Confidential – NDA Restrictions 4 Imaging Product Requirements End Product Need High BW Interface RF Front End and Telecom ports Connect Multiple DSPs on a board e.g. in ATCA Card High BW Backplane and Network Connectivity Reliability in Mission Critical Designs Low Power Design Ease of Use TI Confidential – NDA Restrictions DSP Requirement Needs multiple high speed interfaces – PCIe ,Serial RapidIO – OBSAI/CPRI Interface – Gigabit Ethernet etc Memory Error Correction & Checking (ECC) Efficient Low Power DSPs Support Extended Temp ranges from -40oC to 105oC and others Temp Dev and Debug Tools Multicore S/W Frameworks Signal/Image Processing functions. VoIP Library Audio/Video Codecs 5 Introducing “Keystone Architecture” (C66x) The Best Combination of Performance (GHz) and Power Consumption in the Industry 16GFLOPs & 32GMACS per Core @ 1GHz Next-Generation C66x DSP Core C64x+ Core (Fixed pt) Fixed and Floating-point Core @ 1.25 GHz C64x+ Lowest Power Highest Performance DSP Core C67x Core (Floating pt) Fixed Point Floating Point 4x C64x+ MAC (32) 4xC67x Fl pt MAC(8) 16FLOP/cy compared to 6FLOP/cy NEW MultiCore DSP C66x 100% Code Compatible with all C64x (fixed) & C67x (floating) Devices C67xx Similar Power Profiles as C64x Core Supported by Code Composer Studio IDE Industry’s Lowest Power FP DSP Core High precision and wide dynamic range KEYSTONE Architecture TI Confidential – NDA Restrictions 8 Core C6678 based on C66x core delivers 320 GMACs/160GFLOPS @ 1.25GHz/Core (effectively a 10GHz DSP) 6 Unmatched Performance BDTImark2000 TM Score ADI 2116x (SHARC) NEC uPD77050 ADI 2126x (SHARC) ADI BF5xx (Blackfin) ADI 213xx (SHARC) ADI TS201S(TigerSHARC) ADI TS201S (TigerSHARC) ADI TS202S/203S (TigerSHARC) ADI TS202S/203S (TigerSHARC) Freescale MSC81xx (SC140) Intell Pentium III Freescale MSC814x (SC3400) Renesas SH77xx (SH-4) Freescale MSC815x (SC3850) TMS320C67x TMS320C64x+ TMS320C66xx TMS320C66xx 0 2000 4000 6000 8000 10000 12000 0 14000 BDTI Score for Floating Point Processors Algorithm Single Precision Floating Point FFT, 2048 pt, Radix 4 5000 10000 20000 BDTI Score for Fixed Point Processors C67x @ 300MHz C64x+ @1.2GH z 86.84 us C66x @1.25GH z Gain 14.00 us* ~600 % Fixed Point FFT, 2048 pt, Radix 4 8.23 us 4.46 us* ~200 % FIR Filter, 40 samples, 40 taps 0.69 us 0.34 us* ~200 % Matrix Multiply 32 x 32 17.92 us 6.16 us* ~300 % 0.53 us 0.13 us* ~400 % Matrix Inverse 4 x 4 TI Confidential – NDA Restrictions 15000 25000 TI Multicore KeyStone Architecture • Highest Integration Multicore Navigator Network on Chip – Cost & Power • Common Architecture C66x, ARM Processing Cores Multicore Shared Memory Controller – Portable Software • Scalable – Tailored Solutions Shared Memory • Navigator – Innovative Multi-core System Management TeraNet 2 (Debug, Clocking, Power) • Floating Point – Development Time Application Accelerator Application Accelerator • Tools & Debugging – R&D Efficiency • Quality Software – Solutions & Libraries High Speed I/O HyperLink 50 The first network on chip infrastructure to unleash full multicore entitlement 8 TI Confidential – NDA Restrictions 8 Product Highlights: C6670 and C6678 C6670 C6678 Performance Optimized Core Power Optimized Core Next Generation C66x Core - 4 C66x Cores @ 1GHz - 1.2GHz Memory Architecture - 4MB Local L2/Core (1MB per Core) - 2MB Multicore Shared Memory Next Generation C66x Core - Up to 8 C66x Cores @ 1GHz -1.25GHz - Available Options: 1, 2, 4, and 8 Core Devices Memory Architecture - 4MB Local L2/Core (512KB per Core) - 4MB Multicore Shared Memory Power Optimized Core - <10W at 1Ghz nominal temp Communication Accelerators - TCP3e (Turbo Encode) – Up to 550Mbps - TCP3d (Turbo Decode) – Up to 600Mbps - FFTC – 2048 FFT every 4.6µs - VCP2 for voice channel decoding Multicore Navigator L1 C66X DSP C66X DSP L1 L2 4x VCP2 L2 TeraNet L2 Communications CoProcessors L1 L2 3x TCP3d 2x RAC 1x TAC 3x FFTC BCP Network CoProcessors Crypto Multicore Shared Memory Controller (MSMC) Shared Memory 2MB System Elements Power Management SysMon Debug EDMA TI Confidential – NDA Restrictions TI Confidential – NDA Restrictions Peripherals & IO HyperLink DDR364b C66X L1 L1 L2 DSP L2 C66X DSP C66X DSP L1 L1 L2 L2 C66X DSP C66X DSP C66X DSP C66X DSP L1 L1 L1 L1 L2 L2 L2 Network CoProcessors L2 Memory Subsystem Packet Accelerator Memory Subsystem 8 x CorePac C66X DSP TeraNet C66X DSP C66X DSP L1 Multicore Navigator SRIO x4 PCIe x2 AIF2 x6 SGMII x2 I2C SPI UART System Elements Power Management Debug SysMon EDMA Crypto Packet Accelerator IP Interfaces GbE Switch SGMII SGMII Multicore Shared Memory Controller DDR3(MSMC) 64b Shared Memory 4MB HyperLink Peripherals & IO SRIO x4 PCIe x2 EMIF 16 TSIP x2 I2C SPI UART 9 Innovation & Integration via C6678 DSP Highlights Multicore Navigator C66x Core Data transfer engine that is architected to move data between various system elements without using any CPU overhead so maximum system efficiency is achieved Next generation Fixed / Floating-Point DSP core with clock speeds ranging from 1GHz– 1.25GHz and Up to 8 core options Multicore Navigator 8 x CorePac C66X C66X C66X DSP DSP DSP L1 L2 L1 L2 L1 L2 L1 L2 C66X DSP C66X DSP C66X DSP C66X DSP L1 L2 L1 L2 L1 L2 L1 L2 Memory Subsystem DDR364b Power Management Debug S/W Dev and Debug Support Leveraged by CCS HyperLink Ultra high-speed ( up to 50 Gbaud), low latency serial interface that connects to other DSPs and FPGAs in the systems TI Confidential – NDA Restrictions SysMon EDMA Crypto Packet Accelerator IP Interfaces Network Co- Processor and Accelerators A cost effective implementation to off-load the TCP/IP and secure networking functions from the DSP GbE Switch SGMII SGMII Multicore Shared Memory Controller (MSMC) Shared Memory 4MB System Elements Improved Debug Network CoProcessors Peripherals & IO HyperLink • 0.5 MB of local Memory per core; • 4 MB of Shared Memory. • Enhanced memory architecture through an enhanced Multicore Shared memory Controller • Bottleneck free fast on- and offchip memory access including a DDR3-1333MHz (64-bit) interface • L1/L2/L3 ECC C66X DSP TeraNet Memory Architecture SRIO x4 PCIe EMIF x2 16 TSIP x2 I2C SPI UART TeraNet Switch fabric that has 2 Terabits of bandwidth which allows maximum data transfer between system components to realize full system entitlement Peripherals and I/O Interfaces High bandwidth peripherals that operate independently (NOT Shared) allowing simultaneous data transfer to prevent bottle necks - featuring: RapidIO v2.1 – 4lanes @ 5Gbps with 1x, 2x and 4x support PCIe x2 – 2lanes, running independently of RapidIO 10 Competitive Analysis Value Prop against FPGA •C66x Performance – 320GMACS/160GFLOP – Baseband on a chip. Handles multiple waveforms supporting OFDM,CDMA,TDM – L1/L2/L3 Processing capability – Wireless Accelerators (VCP/TCP/FFT) •Software Programmability – Time To Market •Smaller Package (more DSP/Board) •Lower Power – smaller battery, simpler cooling Value Prop against other DSPs •C66x Fixed & Floating Point capability@1.25GHz – Industry’s Fastest DSP at 10GHz •On-Chip RAM up to 8MB •DDR3 – 1600MHz, 64Bit, 8GB Address space •Multiple Independent High Speed IO – 4xsRIOv2.1,2xPCIe Gen II, 2xSGMII, 2xTSIP •High BW FPGA connectivity – Hyperlink @ 50Gbps •1/2/4/8 Core Option (Pin Compatible) •L1/L2/L3 Memory ECC – System Reliability •Low Power per GFLOPs and GMACS •Extended Temp support -40oC to 105oC •CCS Tools + S/W Collateral •3rd Party Network •Low Cost - MIPs/$ 11 TMDXEVM6678L EVM Singe wide AMC form factor C6678 Code Composer Studio™ IDE H/W Development Tools *Design *Code and Build *Debug *Analyze *Tune CCSv5 Allows designers of all experience levels to move quickly through application development (www.ti.com/ccstudio) •Time Limited FREE Evaluation Versions available for download. Includes C667x Simulator EVM Kit includes •BIOS 6.x, •BIOS-MCSDK / LINUX-MCSDK 2.0 (NDK, PDK, LIB etc), •Sample Program and Out of box demo (OOB) e.g. • I/O Benchmark, Imaging Processing Pipeline and High Performance DSP Utility Application (HUA) •User Guide, Starter guide, Tech Ref Guide, App Notes etc • • • • • TMDXEVM6678L – EVM with XDS100 emulation $399 TMDXEVM6678LE – EVM with XDS560V2 emulation - $599 TMDXEVM6678LXE – EVM with XDS560V2 emulation –Encryption Enabled - $599 TMDSEMU560v2STM-UE - XDS560v2 System Trace Emulator with 128Mb System Trace buffer and Ethernet / USB support Optional PCIe adapter card to connect the C6678 EVM to a standard PCI header of a desktop. TI’s Multicore Hardware Ecosystem Others Chassis / System Standardized Boards PCIExpress (with Gen 2) Advanced Mezzanine (AMC) Custom ATCA Other TI’s Multicore Software Ecosystem Customer Application Multicore Entitlement Layer 2+ Layer 1 UMTS IP Network Stack Layer 1 LTE TI Runtime TI’s Device Entitlement Libraries TI Layer 1 Libraries TI BIOS, Linux, OSE(ck) Multicore Tools and Software (MC-SDK) • Tools – Codegen with OpenMP support – Emulator/Debugger – Simulator – Profiler / DVT – 3rd party tools • Software – BIOS/Linux SDK • Multicore Demonstration • 6.x DSP BIOS – Platform Abstraction – Basic Networking – Inter core communication Eclipse DSP Customer Application Code Composer StudioTM Editor/IDE Compiler Linker (Codegen) Third Party Plug-Ins Multicore Software Development Kit Polycore Demo App Multicore BIOS ENEA Optima DSPLIB IMGLIB 3L Profiler Speech Codec Demo App Multicore BIOS and Linux Demo App Multicore Linux NDK Audio Codec Video Codec Operating System w/ Boot Loader BIOS Debugger Linux Multicore Entitlement Remote Debug Inter Core Communication Full Silicon Entitlement SoC Analyzer Platform Development Kit • Application Specific Libraries – – – – Audio/Video CODECS VoIP Components WiMAX Toolkit, LTE Toolkit, DSPLib • others.. TI Confidential – NDA Restrictions Target Board Host Computer XDS 560 V2 XDS 560 Trace 15 KeyStone Multicore Software – Libraries & Codecs Digital Signal Processing • FFT • Adaptive Filtering • Filtering and convolution • Others….. • Available free from TI Image Processing • Edge Detection • Boundary • Morphology • Others….. • Available free from TI Voice and Fax • Line Echo Cancellation • Voice Activity Detection • Others… • Available free from TI Security/Cryptography • AES, SHA1, 3DES Vision Lib (object only) • 50+ royalty-free kernels: Libraries MATLAB • Image processing • Math operations Vision Analytics Voice • G.711, G.722 • G.723, G.729 • CDMA, AMR(NB/WB), EVRC-B • Others Codecs Fax • • T.38 Fax Modem Video • • • • • • H.263 H.264 MPEG2 MPEG4 VC1/WMV9 Decode Others • Background modeling & subtraction • Object feature extraction • Tracking, recognition • Low-level pixel processing Audio • MPEG1 Layer2 • AAC LC/HE • AC3 2.0/5.1 • Sample Rate Conversion High-Performance and Multicore Processor High Value Keystone Architecture High-Performance at the Right Power & Price Low-Cost EVM Open & Affordable Tools Easy to Use Training Product Collateral Drivers & Example Code Quick to Market User Community Enabler Software Quick-Start Hardware Benchmarks & Functional Understanding Frameworks & Abstraction Generic Libraries Application Libraries Getting Started – More Information/Links • Product Folders: – – • EVMs and Software Tools: – – – – – – – • C66X Informational Wiki Page All C6000 Multicore DSPs • TMS320C6670 • TMS320C6678 TMS320C6678 EVM TMS320C6670 EVM AMC to PCIe Adapter Card Multicore Software Development Kit for BIOS & Linux • MCSDK Wiki • CCS v5 Wiki • C66x Linux Wiki DSP Signal Processing Library(DSPLIB) Image and Video Processing Library (IMGLIB) LTE /WiMAX Toolkit – Discuss with BDM Technical Support – – TI E2E Community (Online Support) Product Training TIConfidential Confidential – NDA Restrictions TI – NDA Restrictions Online Video Training http://focus.ti.com/docs/training/catalog/events/event.jhtml?sku=OLT110027 TI Confidential – NDA Restrictions Mission Critical DSP Market • • Undisputed #1 DSP and SoC supplier – Strong Growth for 8 years in a row, even in 2009 – Higher R&D spending than DSP revenue of most competitors Revenue “What Customers Like about TI” KeyStone SoC Architecture secures future success – Rich Product Portfolio & Strong Roadmap – 2 Families with multiple devices and growing – • Nyquist(6670), Shannon(6678/4/2) • 40nm -> 28nm • Tools/Software & Compilers • 3rd Party Eco-System Multiple Design Wins Pre-Announcement 2002 2009 TI SoC Architecture Layer 1 Macro Pico Femto PHY Software Radio IP Network • Secure Supply – No DSP product discontinuation (end of life) • History of delivery upon promises (Power, GHz, ..) • Field Experience - Completeness of system analysis, Architecture, Internal Switch, …. • Customer Support • Business Model - Long Term relationships with key customers – Actively seek and incorporate customer feedback in roadmap devices. TI Confidential – NDA Restrictions Backup Slides Product Details 21 C6678 (Shannon) “Lightning” Half-Length PCIe Card Feature Set TI TMS320C6678 (8-core) x 4 ― C66x Core Frequency: 1.25GHz ― DDR3 Memory ― Data Frequency: 1600MHz ― Data Bus Width: 64-bit ― Serial RapidIO Gen-2 Interface ― PCIe Gen-2 Interface ― 10/100/1000Mbps Ethernet w/ SGMII ― Hyperlink50 Interface 1024 MB DDR3-1333 on board PLX PEX8624 PCIe Gen-2 Switch Serial RapidIO daisy-chain Ethernet daisy-chain Each DSP device is linked to PCIe switch by x2 lanes Dual DSPs linked by Hyperlink50 Power: Max 54Watts TI Confidential – NDA Restrictions What is Hyperlink? “high-speed, low-latency, and low-pin-count communication interface” •Low pin count (24 pins) •Point to Point Connection •Interconnect •DSP-to-DSP •DSP-to-FPGA. •SerDes for data transfer • x1 x4 modes for Tx and Rx •12.5GBaud/lane •Effectively 8b9b encoding •LVCMOS sideband signals for flow control & power mgmt - errors/events/timeouts * Simple packet-based transfer protocol for memory-mapped access * Read/Write to DSP/FPGA local memory Up to 64 Memory mapped Regions each region up to 256MB TI Confidential – NDA Restrictions - discrete memory access of any byte aligned width up to 64bits. - burst transfer modes • Write (Maximum Burst Size 256Bytes) – Write Request ---> – Data Packet ---> • Read (Maximum Burst Size 256Bytes) – Read Request ---> – Read Response • Interrupt Request <--> 23 Universal Parallel Port (uPP) • What is it? – – – – • Application – Each channel can interface cleanly with high-speed ADCs and/or DACs with up to 16-bit data width (per channel). – Useful as low cost interface with FPGAs. Can run up to 120MByte/s per channel in single channel or bi-directional mode ( 240MByte for both channels in unidirectional mode) Can also be used to interface two C6655/57 devices or to connect C6655/57 with C674x or OMAP-L13x family of devices. – • Parallel bus, two independent channels (separate data buses) I/O speeds up to 75 MHz with 8-16 bit data width per channel 1 or 2 channel parallel interface operating in RX, TX or FD mode Supports Double data rate mode of operation (Bandwidth does not change/increase) Other benefits – – – – – Throughput Estimates: Internal DMA – leaves CPU EDMA free Simple protocol with few control pins (configurable: 2-4 per channel) Multiple data packing formats for 9-15 bit data widths Interleave mode (single channel only) Simple interface: IO Queued by software Note: Max. clock of 50 MHz in (*) configuration TI Confidential – NDA Restrictions Thank You 25