• • Computing Engine Choices General Purpose Processors (GPPs): Intended for general purpose computing (desktops, servers, clusters..) General Purpose ISAs (RISC or CISC) Processors Application-Specific Processors (ASPs): Processors with ISAs and architectural features tailored towards specific application domains – E.g Digital Signal Processors (DSPs), Network Processors (NPs), Media Processors, Graphics Processing Units (GPUs), Vector Processors??? ... Special Purpose ISAs • • Co-Processors: A hardware (hardwired) implementation of specific algorithms with limited programming interface (augment GPPs or ASPs) Configurable Hardware: The ISA forms an abstraction layer – Field Programmable Gate Arrays (FPGAs) – Configurable array of simple processing elements • • that sets the requirements for both complier and CPU designers Application Specific Integrated Circuits (ASICs): A custom VLSI hardware solution for a specific computational task The choice of one or more depends on a number of factors including: - Type and complexity of computational algorithm (general purpose vs. Specialized) - Desired level of flexibility and programmability - Performance requirements - Desired level of computational efficiency - Expected useful lifecycle of computing element or system (e.g Computations per watt or computations per chip area) - Power requirements - Development time and cost - Real-time constraints - System cost EECC722 - Shaaban Repeated here from lecture 1 #1 lec # 8 Fall 2012 10-10-2012 Computing Engine Choices Programmability / Flexibility For Application-Specific Processors (ASPs): Application Domain Requirements ASP ISA ASP Architecture e.g Digital Signal Processors (DSPs), Network Processors (NPs), Media Processors, Graphics Processing Units (GPUs) Physics Processor …. General Purpose Processors (GPPs): Application-Specific Processors (ASPs) Processor = Programmable computing element that runs programs written using a pre-defined set of instructions ISA Configurable Hardware Selection Factors: -Type and complexity of computational algorithm (general purpose vs. Specialized) - Desired level of flexibility and programmability - Performance requirements - Desired level of computational efficiency - Power requirements - Real-time constraints - Development time and cost - System cost Repeated here from lecture 1 Software (Processors) Co-Processors Application Specific Integrated Circuits (ASICs) Specialization , Development cost/time Performance/Chip Area/Watt (Computational Efficiency) Hardware EECC722 - Shaaban #2 lec # 8 Fall 2012 10-10-2012 Why Application-Specific Processors (ASPs)? Computing Element Choices Observation • Generality and efficiency are in some sense inversely related i.e computational efficiency to one another: – The more general-purpose a computing element is and thus the greater the number of tasks it can perform, the less efficient (e.g. Computations per chip area /watt) it will be in performing any of those specific tasks. – Design decisions are therefore almost always compromises; designers identify key features or requirements of applications that must be met and and make compromises on other less important features. • To counter the problem of computationally intense and specialized problems for which general purpose processors/machines cannot achieve the necessary performance/other requirements: ASPs – Special-purpose processors (or Application-Specific Processors, ASPs) , attached processors, and coprocessors have been designed/built for many years, for specific application domains, such as image or digital signal processing (for which many of the computational tasks are specialized and can be very well defined). Generality = Flexibility = Programmability ? Efficiency = Computational Efficiency (Computations per watt or chip area) EECC722 - Shaaban #3 lec # 8 Fall 2012 10-10-2012 Digital Signal Processor (DSP) Architecture • • • • • • • • • • DSP Generations Classification of Main Processor Types/Applications Requirements of Embedded Processors DSPs are often embedded DSP vs. General Purpose CPUs DSP Cores vs. Chips Classification of DSP Applications DSP Algorithm Format DSP Benchmarks Basic Architectural Features of DSPs DSP Software Development Considerations Classification of Current DSP Architectures and example DSPs: 1-2 – Conventional DSPs: TI TMSC54xx 3 – Enhanced Conventional DSPs: TI TMSC55xx 4 – Multiple-Issue DSPs: • VLIW DSPs: TI TMS320C62xx, TMS320C64xx • Superscalar DSPs: LSI Logic ZSP400/500 DSP core EECC722 - Shaaban #4 lec # 8 Fall 2012 10-10-2012 • General Purpose Computing & General Purpose Processors (GPPs) – – – – – – – – – • Embedded Processing: Embedded processors and processor cores – – – – Cost, power code-size and real-time requirements and constraints Once real-time constraints are met, a faster processor may not be better e.g: Intel XScale, ARM, 486SX, Hitachi SH7000, NEC V800... Often require Digital signal processing (DSP) support or other 16-32 application-specific support (e.g network, media processing) Single or few specialized programs – known at system design time Not end-user programmable Real-time performance must be fully predictable (avoid dynamic arch. features) Lightweight, often realtime OS or no OS Examples: Cellular phones, consumer electronics .. … bit Microcontrollers – – – – – – Extremely code size/cost/power sensitive 8 bit Single program Small word size - 8 bit common Usually no OS Highest volume processors by far Examples: Control systems, Automobiles, industrial control, thermostats, ... Examples of Application-Specific Processors (ASPs) Increasing volume – – – – – • High performance: In general, faster is always better. RISC or CISC: Intel P4, IBM Power4, SPARC, PowerPC, MIPS ... 64 bit Used for general purpose software End-user programmable Real-time performance may not be fully predictable (due to dynamic arch. features) Heavy weight, multi-tasking OS - Windows, UNIX Normally, low cost and power not a requirement (changing) Servers, Workstations, Desktops (PC’s), Notebooks, Clusters … Increasing Cost/Complexity Main Processor Types/Applications EECC722 - Shaaban #5 lec # 8 Fall 2012 10-10-2012 The Processor Design Space Performance (Main Types) Application specific architectures for performance Embedded Real-time constraints processors Microprocessors Specialized applications Low power/cost constraints Microcontrollers GPPs Performance is everything & Software rules Examples of ASPs Cost is everything Chip Area, Power Processor Cost complexity EECC722 - Shaaban #6 lec # 8 Fall 2012 10-10-2012 Requirements of Embedded Processors • Embedded Processors: How Fast? • • • • Usually must meet strict real-time constraints: – Real-time performance must be fully predictable: • Avoid dynamic processor architectural features that make real-time performance harder to predict ( e.g cache, dynamic scheduling, hardware speculation …) – Once real-time constraints are met, a faster processor is not desirable (overkill) due to increased cost/power requirements. Optimized for a single (or few) program (s) - code often in on-chip ROM or on/off chip EPROM/flash memory. Minimum code size (one of the motivations initially for Java) Performance obtained by optimizing datapath Low cost – Lowest possible area • High computational efficiency: Computation per unit area Good or bad? – VLSI implementation technology usually behind the leading edge – High level of integration of peripherals (System-on-Chip -SoC- approach reduces system cost/power) • Fast time to market – Compatible architectures (e.g. ARM family) allows reusable code – Customizable cores (System-on-Chip, SoC). • Low power if application requires portability EECC722 - Shaaban #7 lec # 8 Fall 2012 10-10-2012 Embedded Processors Area of processor cores = Cost (and Power requirements) Thus need to minimize chip area Embedded version of a GPP Nintendo processor Cellular phones EECC722 - Shaaban #8 lec # 8 Fall 2012 10-10-2012 Embedded Processors Another figure of merit: Computation per unit chip area (Computational Efficiency) Embedded version of a GPP Nintendo processor Cellular phones EECC722 - Shaaban #9 lec # 8 Fall 2012 10-10-2012 Embedded Processors • • How? Code size Smaller is better If a majority of the chip is the program stored in ROM, then minimizing code size is a critical issue Common embedded processor ISA features to minimize code size: 1 – Variable length instruction encoding common: • e.g. the Piranha has 3 sized instructions - basic 2 byte, and 2 byte plus 16 or 32 bit immediate 2 – Complex/specialized instructions CISC-Like ? 3 – Complex addressing modes EECC722 - Shaaban #10 lec # 8 Fall 2012 10-10-2012 Embedded Systems vs. General Purpose Computing Embedded Systems General Purpose Computing Systems (and embedded processors) (and processors GPPs) Run a single or few specialized applications often known at system design time Used for general purpose software : Intended to run a fully general set of applications that may not be known at design time May require application-specific capability (e.g DSP) Not end-user programmable No application-specific capability required Minimum code size is highly desirable Lightweight, often real-time OS or no OS Minimizing code size is not an issue Heavy weight, multi-tasking OS - Windows, UNIX Low power and cost constraints/requirements Higher power and cost constraints/requirements Usually must meet strict real-time constraints –(e.g. real-time sampling rate) Thus In general, no real-time constraints Real-time performance must be fully predictable: Real-time performance may not be fully predictable (due to dynamic processor architectural features): •Avoid dynamic processor architectural features that make real-time performance harder to predict Once real-time constraints are met, a faster processor is not desirable (overkill) due to increased cost/power requirements. End-user programmable Thus •Superscalar: dynamic scheduling, hardware speculation, branch prediction, cache. Faster (higher-performance) is always better usually EECC722 - Shaaban #11 lec # 8 Fall 2012 10-10-2012 Evolution of GPPs and DSPs • General Purpose Processors (GPPs) trace roots back to Eckert, Mauchly, Von Neumann (ENIAC) + EDSAC First generation processors • Digital Signal Processors (DSPs) are microprocessors designed for efficient mathematical manipulation of digital signals utilizing digital signal processing algorithms. – DSPs usually process infinite continuous sampled (digitized) data streams (physical signals) while meeting real-time and power constraints. – DSPs evolved from Analog Signal Processors (ASPs) that utilize analog hardware to transform physical signals (classical electrical engineering) – ASP to DSP because: i.e. • DSP insensitive to environment (e.g., same response in snow or desert if it works at all) • DSP performance identical even with variations in components; 2 analog systems behavior varies even if built with same components with 1% variation • Different history and different applications requirements led to different ISA design considerations, terms, different metrics, architectures, some new inventions. For Application-Specific Processors (ASPs): Application Domain Requirements ASP ISA ASP Architecture EECC722 - Shaaban #12 lec # 8 Fall 2012 10-10-2012 DSP vs. General Purpose CPUs • DSPs tend to run one (or few) program(s), not many programs. – Hence OSes (if any) are much simpler, there is no virtual memory or protection, ... – DSP must meet application signal sampling rate computational requirements: DSP Performance Requirements • Once above real-time constraints are met, a faster DSP is overkill (higher DSP cost, power..) without additional benefit. – You must account for anything that could happen in a time slot (DSP algorithm inner-loop, data sampling rate) – All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval. • Therefore, exceptions are BAD. • DSPs usually process infinite continuous data streams: – Requires high memory bandwidth (with predictable latency, e.g no data cache) for streaming real-time data samples and predictable processing time on the data samples • The design of DSP ISAs and processor architectures is driven by the requirements of DSP algorithms. – Thus DSPs are application-specific processors DSP Algorithms DSP ISAs DSP Architectures EECC722 - Shaaban #13 lec # 8 Fall 2012 10-10-2012 Similar to other embedded processors • DSPs usually run applications with hard real-time constraints: DSP vs. GPP • The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC). i.e Main performance measure of DSPs is MAC speed Why? – MAC is common in DSP algorithms that involve computing a vector dot product, such as digital filters, correlation, and Fourier transforms. – DSP are judged by whether they can keep the multipliers busy 100% of the time and by how many MACs are performed in each cycle. • The "SPEC" of DSPs is 4 algorithms: – – – – Inifinite Impule Response (IIR) filters Finite Impule Response (FIR) filters FFT, and convolvers Since DSPS are application domain specific processors • In DSPs, target algorithms are important: – Binary compatibility not a major issue unlike general purpose • High-level Software is not as important in DSPs as in GPPs. – People still write in assembly language for a product to minimize the die area for ROM in the DSP chip and improve performance. Note: While this is still mostly true, however, programming for DSPs in high level languages (HLLs) has been gaining more acceptance due to the development of more efficient HLL DSP compilers in recent years. Code size EECC722 - Shaaban #14 lec # 8 Fall 2012 10-10-2012 Types of DSP Processors According to type of Arithmetic/operand Size Supported • 32-BIT FLOATING POINT (5% of DSP market): Or 24 bit – – – – TI TMS320C3X, TMS320C67xx (VLIW) AT&T DSP32C ANALOG DEVICES ADSP21xxx Hitachi SH-4 Examples • 16-BIT FIXED POINT (95% of DSP market): – – – – – – – TI TMS320C2X, TMS320C62xx (VLIW) Infineon TC1xxx (TriCore1) (VLIW) MOTOROLA DSP568xx, MSC810x (VLIW) ANALOG DEVICES ADSP21xx Agere Systems DSP16xxx, Starpro2000 LSI Logic LSI140x (ZPS400) superscalar Hitachi SH3-DSP Examples – StarCore SC110, SC140 (VLIW) EECC722 - Shaaban #15 lec # 8 Fall 2012 10-10-2012 DSP Cores vs. Chips – Map into chosen fabrication process A/D ConverterData Memory D/A Converter Serial Ports DSP are usually available as synthesizable cores or off-theshelf packaged chips Instruction DSP Memory Core • Synthesizable Cores: IP • Speed, power, and size vary – Choice of peripherals, etc. (SoC) SOC = System On Chip – Requires extensive hardware development effort. Resulting in more development time and cost (very high volume needed to justify development cost • Off-the-shelf packaged chips: – – – – – Highly optimized for speed, energy efficiency, and/or cost. Lower development time/cost/effort. Tools, 3rd-party support often more mature. Faster time to market. Limited performance, integration options. EECC722 - Shaaban #16 lec # 8 Fall 2012 10-10-2012 DSP ARCHITECTURE Enabling Technologies Time Frame Early 1970’s Approach Primary Application Enabling Technologies Discrete logic Non-real time processing Simulation Military radars Digital Comm. Bipolar SSI, MSI FFT algorithm Single chip bipolar multiplier Flash A/D Late 1970’s Building block 1 Early 1980’s Single Chip DSP mP Telecom Control mP architectures NMOS/CMOS 2 Late 1980’s Function/Application specific chips Computers Communication Vector processing Parallel processing 3 Early 1990’s Multiprocessing Video/Image Processing 4 Late 1990’s Single-chip multiprocessing Wireless telephony Internet related First microprocessor DSP TI TMS 32010 Advanced multiprocessing VLIW, MIMD, etc. Low power single-chip DSP VLIW/Multiprocessing Generations of single-chip (microprocessor) DSPs EECC722 - Shaaban #17 lec # 8 Fall 2012 10-10-2012 Texas Instruments TMS320 Family Multiple DSP mP Generations First Sample Bit Size Clock speed (MHz) Instruction Throughput MAC execution (ns) MOPS Device density (# of transistors) Uniprocessor Based (Harvard Architecture) 1 2 3 4 TMS32010 1982 16 integer 20 5 MIPS 400 5 58,000 (3m) TMS320C25 1985 16 integer 40 10 MIPS 100 20 160,000 (2m) TMS320C30 1988 32 flt.pt. 33 17 MIPS 60 33 695,000 (1m) TMS320C50 1991 16 integer 57 29 MIPS 35 60 1,000,000 (0.5m) TMS320C2XXX 1995 16 integer 40 MIPS 25 80 MIMD 5 2 GOPS 120 MFLOP 20 GOPS 5 1 GFLOP VLIW Multiprocessor (VLIW) Based TMS320C80 1996 32 integer/flt. TMS320C62XX 1997 16 integer TMS310C67XX 1997 32 flt. pt. 1600 MIPS Generations of single-chip (microprocessor) DSPs VLIW EECC722 - Shaaban #18 lec # 8 Fall 2012 10-10-2012 DSP Applications • • • • • • Digital audio applications – MPEG Audio – Portable audio Digital cameras Cellular telephones Wearable medical appliances Storage products: – disk drive servo control Military applications: – radar – sonar • Industrial control • Seismic exploration • Networking: (Telecom infrastructure) – Wireless – Base station – Cable modems – ADSL – VDSL – …... Current DSP Killer Applications: Cell phones and telecom infrastructure HDTV? ….. Other? EECC722 - Shaaban #19 lec # 8 Fall 2012 10-10-2012 DSP Algorithms & Applications DSP Algorithm Speech Coding Speech Encryption Speech Recognition Speech Synthesis Speaker Identification High-fidelity Audio Modems Noise cancellation Audio Equalization Ambient Acoustics Emulation Audio Mixing/Editing Sound Synthesis Vision Image Compression Image Compositing Beamforming Echo cancellation Spectral Estimation System Application Digital cellular telephones, personal communications systems, digital cordless telephones, multimedia computers, secure communications. Digital cellular telephones, personal communications systems, digital cordless telephones, secure communications. Advanced user interfaces, multimedia workstations, robotics, automotive applications, cellular telephones, personal communications systems. Advanced user interfaces, robotics Security, multimedia workstations, advanced user interfaces Consumer audio, consumer video, digital audio broadcast, professional audio, multimedia computers Digital cellular telephones, personal communications systems, digital cordless telephones, digital audio broadcast, digital signaling on cable TV, multimedia computers, wireless computing, navigation, data/fax Professional audio, advanced vehicular audio, industrial applications Consumer audio, professional audio, advanced vehicular audio, music Consumer audio, professional audio, advanced vehicular audio, music Professional audio, music, multimedia computers Professional audio, music, multimedia computers, advanced user interfaces Security, multimedia computers, advanced user interfaces, instrumentation, robotics, navigation Digital photography, digital video, multimedia computers, videoconferencing Multimedia computers, consumer video, advanced user interfaces, navigation Navigation, medical imaging, radar/sonar, signals intelligence Speakerphones, hands-free cellular telephones Signals intelligence, radar/sonar, professional audio, music EECC722 - Shaaban #20 lec # 8 Fall 2012 10-10-2012 Another Look at DSP Applications – – – – Increasing Cost • High-end: Military applications (e.g. radar/sonar) Wireless Base Station - TMS320C6000 Cable modem Gateways - HDTV … • Mid-range: – – – Industrial control Cellular phone - TMS320C540 Fax/ voice server … – – – – – – Increasing volume • Low end: Storage products - TMS320C27 (hard drive controllers) Digital camera - TMS320C5000 Portable phones Wireless headsets Consumer audio Automobiles, thermostats, ... EECC722 - Shaaban #21 lec # 8 Fall 2012 10-10-2012 DSP range of applications & Possible Target DSPs EECC722 - Shaaban #22 lec # 8 Fall 2012 10-10-2012 Cellular Phone System 123 456 789 0 PHYSICAL LAYER PROCESSING A/D 415-555-1212 CONTROLLER SPEECH ENCODE BASEBAND CONVERTER SPEECH DECODE Example DSP Application RF MODEM DAC EECC722 - Shaaban #23 lec # 8 Fall 2012 10-10-2012 Cellular Phone: HW/SW/IC Partitioning MICROCONTROLLER 123 456 789 0 ASIC A/D 415-555-1212 CONTROLLER PHYSICAL LAYER PROCESSING SPEECH ENCODE BASEBAND CONVERTER SPEECH DECODE RF MODEM DAC DSP ANALOG IC Example DSP Application EECC722 - Shaaban #24 lec # 8 Fall 2012 10-10-2012 Mapping Onto System-on-Chip (SoC) (Cellular Phone) S/P Micro-controller or embedded processor RAM RAM speech quality DSP Core book intfc µC DMA ASIC LOGIC keypad control protocol DMA S/P phone voice recognition enhancment DSP CORE de-intl & RPE-LTP decoder speech decoder demodulator and synchronizer Example DSP Application Viterbi equalizer EECC722 - Shaaban #25 lec # 8 Fall 2012 10-10-2012 Example Cellular Phone Organization C540 (DSP) ARM7 (µC) Example DSP Application EECC722 - Shaaban #26 lec # 8 Fall 2012 10-10-2012 Multimedia System-on-Chip (SoC) e.g. Multimedia terminal electronics Graphics Out Video I/O Downlink Radio Voice I/O ASIC Co-processor Or ASP Pen In • Future chips will be a mix of processors, memory and dedicated hardware for specific algorithms and I/O Example DSP Application µP Video Unit (ASIC) Memory Coms Uplink Radio custom DSP EECC722 - Shaaban #27 lec # 8 Fall 2012 10-10-2012 DSP Algorithm Format • DSP culture has a graphical format to represent formulas. i.e. DSP algorithms • Like a flowchart for formulas, inner loops, not programs. • Some seem natural: is add, X is multiply • Others are obtuse: z–1 means take variable from earlier iteration (delay). • These graphs are trivial to decode EECC722 - Shaaban #28 lec # 8 Fall 2012 10-10-2012 DSP Algorithm Notation • Uses “flowchart” notation instead of equations • Multiply is or X • Add is • Delay/Storage or + is or or Delay z–1 D EECC722 - Shaaban #29 lec # 8 Fall 2012 10-10-2012 Typical DSP Algorithm: Finite-Impulse Response (FIR) Filter • Filters reduce signal noise and enhance image or signal quality by removing unwanted frequencies. • Finite Impulse Response (FIR) filters compute: N 1 Filter coefficients y (i) h(k ) x(i k ) h(n) * x(n) where – – – – k 0 N Taps Signal samples x is the input sequence Vector Dot Product: Multiply Accumulate (MAC) Operations y is the output sequence h is the impulse response (filter coefficients) N is the number of taps (coefficients) in the filter • Output sequence depends only on input sequence and impulse response. i.e filter coefficients EECC722 - Shaaban #30 lec # 8 Fall 2012 10-10-2012 Typical DSP Algorithms: Finite-impulse Response (FIR) Filter • • • • N most recent samples in the delay line (Xi) New sample moves data down delay line Filter “Tap” is a multiply-add (Multiply And Accumulate, MAC) Each tap (N taps total) nominally requires: – Two data fetches Requires real-time data sample streaming • Predictable data bandwidth/latency • Special addressing modes • Separate memory banks/busses? – Multiply Repetitive computations, multiply and accumulate (MAC) MAC – Accumulate • Requires efficient MAC support – Memory write-back to update delay line • Special addressing modes (e.g modulo) Performance Goal: At least 1 FIR Tap / DSP instruction cycle EECC722 - Shaaban #31 lec # 8 Fall 2012 10-10-2012 From A/D Signal Samples X FINITE-IMPULSE RESPONSE (FIR) FILTER Delay (accumulator register) Z 1 h0 Z 1 MAC A Filter Tap One FIR Filter Tap hN-1 hN-2 h1 Filter Coefficients Z 1 .... Y N 1 Filter coefficients To D/A Delayed samples y (i) h(k ) x(i k ) k 0 i.e. Vector dot product Performance Goal: at least 1 FIR Tap / DSP instruction cycle DSP must meet application signal sampling rate computational requirements: A faster DSP is overkill (more cost/power than really needed) EECC722 - Shaaban #32 lec # 8 Fall 2012 10-10-2012 Sample Computational Rates for FIR Filtering FIR Signal type Type Frequency # taps Performance 1-D Speech 8 kHz N =128 20 MOPs 1-D Music 48 kHz N =256 24 MOPs 2-D Video phone 6.75 MHz N*N = 81 1,090 MOPs 2-D TV N*N = 81 4,370 MOPs 27 MHz (4.37 GOPs) 2-D HDTV 144 MHz N*N = 81 23,300 MOPs (23.3 GOPs) OPs = Operation Per Second 1-D FIR has nop = 2N and a 2-D FIR has nop = 2N2. DSP must meet application signal sampling rate computational requirements: • A faster DSP is overkill (higher DSP cost, power..) EECC722 - Shaaban DSP Performance Requirements #33 lec # 8 Fall 2012 10-10-2012 FIR Filter on (Simple) General Purpose Processor (GPP) loop: lw x0, 0(r0) lw y0, 0(r1) mul a, x0,y0 add y0,a,b sw y0,(r2) inc r0 inc r1 inc r2 dec ctr tst ctr jnz loop • Problems: + + GPP Real-time performance may (to meet signal sampling rate) not be fully predictable (due to dynamic processor architectural features): •Superscalar: dynamic scheduling, hardware speculation, branch prediction, cache. • Bus / memory bandwidth bottleneck, • control/loop code overhead • No suitable addressing modes, instructions – e.g. multiply and accumulate (MAC) instruction EECC722 - Shaaban #34 lec # 8 Fall 2012 10-10-2012 Typical DSP Algorithms: Infinite-Impulse Response (IIR) Filter • Infinite Impulse Response (IIR) filters compute: y(i) M 1 N 1 a(k ) y(i k ) b(k ) x(i k ) k 1 MAC k 0 MAC • Output sequence depends on input sequence, previous outputs, and impulse response. i.e Filter coefficients: a(k), b(k) • Both FIR and IIR filters – Require vector dot product (multiply-accumulate) operations MAC – Use fixed coefficients normally • Adaptive filters update their coefficients to minimize the distance between the filter output and the desired signal. EECC722 - Shaaban #35 lec # 8 Fall 2012 10-10-2012 Typical DSP Algorithms: Discrete Fourier Transform (DFT) • The Discrete Fourier Transform (DFT) allows for spectral analysis in the frequency domain. • It is computed as MAC N 1 y(k ) WN nk x(n) n 0 WN for k = 0, 1, … , N-1, where 2 j e N j 1 Time Domain Frequency Domain – x is the input sequence in the time domain – y is an output sequence in the frequency domain • The Inverse Discrete Fourier Transform is MAC N 1 computed as x(n) WN nk y(k ), for n 0, 1, ... , n - 1 k 0 • The Fast Fourier Transform (FFT) provides an efficient method for computing the DFT. EECC722 - Shaaban #36 lec # 8 Fall 2012 10-10-2012 Typical DSP Algorithms: Discrete Cosine Transform (DCT) • The Discrete Cosine Transform (DCT) is frequently used in image & video compression (e.g. JPEG, MPEG-2). • The DCT and Inverse DCT (IDCT) are computed as: MAC (2n 1)k y(k ) e(k ) cos[ ]x(n), for k 0, 1, ... N - 1 2N n 0 N 1 2 x ( n) N MAC (2n 1)k e(k ) cos[ 2 N ] y(n), for k 0, 1, ... N -1 k 0 N 1 where e(k) = 1/sqrt(2) if k = 0; otherwise e(k) = 1. • A N-Point, 1D-DCT requires N2 MAC operations. EECC722 - Shaaban #37 lec # 8 Fall 2012 10-10-2012 DSP BENCHMARKS • DSPstone: University of Aachen, application benchmarks – – – – ADPCM TRANSCODER - CCITT G.721, REAL_UPDATE, COMPLEX_UPDATES DOT_PRODUCT, MATRIX_1X3, CONVOLUTION FIR, FIR2DIM, HR_ONE_BIQUAD LMS, FFT_INPUT_SCALED • BDTImark2000: Berkeley Design Technology Inc BDTI – 12 DSP kernels in hand-optimized assembly language: • FIR, IIR, Vector dot product, Vector add, Vector maximum, FFT …. – Returns single number (higher means faster) per processor – Use only on-chip memory (memory bandwidth is the major bottleneck in performance of embedded applications). • EEMBC (pronounced “embassy”): EDN Embedded Microprocessor Benchmark Consortium – 30 companies formed by Electronic Data News (EDN) – Benchmark evaluates compiled C code on a variety of embedded processors (microcontrollers, DSPs, etc.) – Application domains: automotive-industrial, consumer, office automation, networking and telecommunications EECC722 - Shaaban #38 lec # 8 Fall 2012 10-10-2012 4th Generation 3rd Generation 2nd Generation > 800x Faster than first generation 1st Generation DSPs from generations 2, 3 and 4 are in use today. Why? EECC722 - Shaaban #39 lec # 8 Fall 2012 10-10-2012 Basic DSP ISA/Architectural Features Specialized DSP Algorithms/Application Requirements • DSP ISAs DSP Architectures Data path configured for DSP algorithms – Fixed-point arithmetic (most DSPs) DSP ISA Feature • Modulo arithmetic (saturation to handle overflow) – MAC- Multiply-accumulate unit(s) – Hardware rounding support DSP Architectural Features • Multiple memory banks and buses – Harvard Architecture – Multiple data memories/buses DSP ISA Feature • Specialized addressing modes – Bit-reversed addressing – DSP ISA Feature • Circular buffers Usually with no data cache for predictable fast data sample streaming DSP Architectural Feature Dedicated address generation units are usually used Specialized instruction set and execution control – Zero-overhead loops – Support for fast MAC – Fast Interrupt Handling • DSP Architectural Feature To meet real-time signal sampling/processing constraints Specialized peripherals for DSP - (System on Chip - SoC style) DSP Architectural Feature EECC722 - Shaaban #40 lec # 8 Fall 2012 10-10-2012 DSP ISA Features DSP Data Path: Arithmetic Most Common: Fixed Point (16-bit or 24-bit) + Integer Arithmetic • DSPs dealing with numbers representing real world signals => Want “reals”/ fractions Fixed-point • DSPs dealing with numbers for addresses => Want integers Thus • DSP ISA (and DSP) must Support “fixed point” as well as integers S -1 Š x < 1 radix point In DSP ISAs: Fixed-point arithmetic must be supported, floating point support is optional and is much less common S –2N–1 Š x < 2N–1 . Usually 16-bit fixed-point DSP ISA Feature . radix point Much Less Common: Single Precision Floating-point Support EECC722 - Shaaban #41 lec # 8 Fall 2012 10-10-2012 DSP ISA Features DSP Data Path: Precision 16-bit Fixed-Point Most Common Single Precision • Word size affects precision of fixed point numbers • DSPs have 16-bit, 20-bit, or 24-bit data words 16-bit most common • Floating Point DSPs cost 2X - 4X vs. fixed point, slower In DSP ISAs: Fixed-point arithmetic must be supported, floating point than fixed point (single precision) support is optional and is much less common • DSP programmers will scale values inside code – SW Libraries – Separate explicit exponent • “Blocked Floating Point” single exponent for a group of fractions • Floating point support simplify development for high-end DSP applications. EECC722 - Shaaban #42 lec # 8 Fall 2012 10-10-2012 DSP ISA Feature DSP Data Path: Overflow Handling • DSP are descended from analog signal processors: – Modulo Arithmetic. • Set to most positive (2N–1–1) or most negative value(–2N–1) : “saturation” • Many DSP algorithms were developed in this model. 2N–1–1 Saturation Why Support? Due to physical nature of signals Saturation –2N–1 EECC722 - Shaaban #43 lec # 8 Fall 2012 10-10-2012 DSP Architectural Features DSP Data Path: Specialized Hardware • Fast specialized hardware functional units performs all key arithmetic operations in 1 cycle, including: – – – – – Shifters To help meet real-time constraints Saturation for commonly needed operations Guard bits Rounding modes Multiplication/addition (MAC) • 50% of instructions can involve multiplier => single cycle latency multiplier • Need to perform multiply-accumulate (MAC) fast • n-bit multiplier => 2n-bit product i.e. must optimize common operations EECC722 - Shaaban #44 lec # 8 Fall 2012 10-10-2012 DSP Data Path: Multiply Accumulate (MAC) Unit One or more MAC units • Don’t want overflow or have to scale accumulator • Option 1: accumalator wider than product: “guard bits” – Motorola DSP: 24b x 24b => 48b product, 56b Accumulator • Option 2: shift right and round product before adder Multiplier Multiplier Shift ALU add Accumulator G ALU add } MAC Unit Accumulator EECC722 - Shaaban #45 lec # 8 Fall 2012 10-10-2012 DSP Data Path: Rounding Modes • Even with guard bits, will need to round when storing accumulator into memory • 3 DSP standard options (supported in hardware) 1 • Truncation: chop results Not in software as in GPPs => biases results up 2 • Round to nearest: < 1/2 round down, • 1/2 round up (more positive) => smaller bias 3 • Convergent: < 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0) => no bias IEEE 754 calls this round to nearest even EECC722 - Shaaban #46 lec # 8 Fall 2012 10-10-2012 Data Path Comparison DSP Processor • Specialized hardware performs all key arithmetic operations in 1 cycle. – e.g MAC • Hardware support for managing numeric fidelity: – Shifters – Guard bits – Saturation – Rounding modes General-Purpose Processor • Multiplies often take>1 cycle • Shifts often take >1 cycle • Other operations (e.g., saturation, rounding) typically take multiple cycles. EECC722 - Shaaban #47 lec # 8 Fall 2012 10-10-2012 TI 320C54x DSP (1995) Functional Block Diagram Multiple memory banks and buses MAC Unit Hardware support for rounding/saturation EECC722 - Shaaban #48 lec # 8 Fall 2012 10-10-2012 i.e. Single-Chip DSP (Microprocessor) First Commercial DSP (1982): Texas Instruments TMS32010 • 16-bit fixed-point arithmetic • Introduced at 5Mhz (200ns) instruction cycle. • “Harvard architecture” – separate instruction, data memories Instruction Memory Processor Data Memory Datapath: Mem T-Register • Accumulator i.e MAC Unit • Specialized instruction set – Load and Accumulate • Two-cycle (400 ns) MultiplyAccumulate (MAC) time. Multiplier ALU P-Register Accumulator EECC722 - Shaaban #49 lec # 8 Fall 2012 10-10-2012 First Generation DSP mP Texas Instruments TMS32010 - 1982 Features • • • • • • • • • • 200 ns instruction cycle (5 MIPS) 144 words (16 bit) on-chip data RAM 1.5K words (16 bit) on-chip program ROM - TMS32010 External program memory expansion to a total of 4K words at full speed 16-bit instruction/data word single cycle 32-bit ALU/accumulator Single cycle 16 x 16-bit multiply in 200 ns Two cycle MAC (5 MOPS) Zero to 15-bit barrel shifter Eight input and eight output channels EECC722 - Shaaban #50 lec # 8 Fall 2012 10-10-2012 First Generation DSP mP TI TMS32010 Block Diagram Program Memory (ROM/EPROM) Data/Samples Memory MAC Unit Barrel Shifter (1 cycle) EECC722 - Shaaban #51 lec # 8 Fall 2012 10-10-2012 TMS32010 FIR Filter Code • Here X4, H4, ... are direct (absolute) memory addresses: LT X4 ; Load T with x(n-4) MPY H4 ; P = H4*X4 LTD X3 ; Load T with x(n-3); x(n-4) = x(n-3); ; Acc = Acc + P MPY H3 ; P = H3*X3 LTD X2 Load and Accumulate MPY H2 ... • Two instructions per tap, but requires unrolling EECC722 - Shaaban #52 lec # 8 Fall 2012 10-10-2012 DSP Architectural Features DSP Memory • FIR Tap implies multiple memory accesses • DSPs require multiple data ports Separate memories for data, program • Some DSPs have ad hoc techniques to reduce memory bandwdith demand: – Instruction repeat buffer: do 1 instruction 256 times – Often disables interrupts, thereby increasing interrupt response time • Some recent DSPs have instruction caches – Even then may allow programmer to “lock in” instructions into cache – Option to turn cache into fast program memory • Usually DSPs have no data caches. • May have multiple data memories e.g one for signal data samples and one for filter coefficients Why? For better real-time performance predictability EECC722 - Shaaban #53 lec # 8 Fall 2012 10-10-2012 Conventional “Von Neumann’’ memory AKA unified or Princeton memory architecture EECC722 - Shaaban #54 lec # 8 Fall 2012 10-10-2012 HARVARD MEMORY ARCHITECTURE in DSP (i.e. split memory) e.g one for signal data samples and one for filter coefficients ROM/EPROM/ FLASH? Data Memory Banks (SRAM) PROGRAM MEMORY X MEMORY Y MEMORY GLOBAL P DATA X DATA Y DATA Multiple memory banks and buses EECC722 - Shaaban #55 lec # 8 Fall 2012 10-10-2012 Memory Architecture Comparison • • • DSP Processor Harvard architecture (split) 2-4 memory accesses/cycle No caches: on-chip SRAM For real-time performance predictability • • • General-Purpose Processor Von Neumann architecture Typically 1 access/cycle Use caches i.e. unified memory but not L1-cache (split) Makes real-time performance harder to predict Program Memory Processor Processor Memory Data Memory EECC722 - Shaaban #56 lec # 8 Fall 2012 10-10-2012 TI TMS320C3x MEMORY BLOCK DIAGRAM - Harvard Architecture Instruction Cache Multiple memory banks and buses Data Data Program Multiple memory banks and buses EECC722 - Shaaban #57 lec # 8 Fall 2012 10-10-2012 TI 320C62x/67x DSP (1997) – (Fourth Generation DSP) Program Data EECC722 - Shaaban #58 lec # 8 Fall 2012 10-10-2012 DSP ISA Features DSP Addressing Modes Complex & Specialized • Have standard addressing modes: immediate, displacement, register indirect • Want to keep MAC datapath busy. • Assumption: any extra instructions imply additional clock cycles of overhead in inner loop and larger code size => Thus complex addressing is good Why? Examples: To match data access patterns in DSP algorithms and reduce number of instructions (code size) • Autoincrement/Autodecrement register indirect – lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1 – Option to do it before addressing, positive or negative • “bit reverse” address addressing mode. • “modulo” or “circular” addressing => Don’t use normal datapath integer unit to calculate complex addressing modes: – Instead use dedicated address generation units. Related DSP Architectural Feature EECC722 - Shaaban #59 lec # 8 Fall 2012 10-10-2012 DSP ISA Features DSP Addressing: FFT • FFTs start or end with data in bufferfly order 0 (000) => 0 (000) 1 (001) => 4 (100) 2 (010) => 2 (010) Bit Reversed 3 (011) => 6 (110) Addressing 4 (100) => 1 (001) 5 (101) => 5 (101) 6 (110) => 3 (011) 7 (111) => 7 (111) • How to avoid overhead of address checking instructions for FFT? • Have an optional “bit reverse” address addressing mode for use with autoincrement addressing • Thus most DSPs have “bit reverse” addressing for radix-2 FFT EECC722 - Shaaban #60 lec # 8 Fall 2012 10-10-2012 DSP ISA Features Bit Reversed Addressing 000 x(0) F(0) 100 x(4) F(1) 010 x(2) F(2) 110 x(6) F(3) 001 x(1) F(4) 101 x(5) F(5) 011 x(3) F(6) 111 x(7) F(7) Four 2-point DFTs Two 4-point DFTs One 8-point DFT Data flow in the radix-2 decimation-in-time FFT algorithm EECC722 - Shaaban #61 lec # 8 Fall 2012 10-10-2012 DSP Addressing: Circular Buffers and addressing • DSPs dealing with continuous I/O Sampled signal • Often interact with an I/O buffer (delay lines) • To save memory, buffers often organized as circular buffers • What can do to avoid overhead of address checking instructions for circular buffer? • Option 1: Keep start register and end register per address register for use with autoincrement addressing, reset to start when reach end of buffer • Option 2: Keep a buffer length register, assuming Circular Buffer buffers starts on aligned address, reset to start when addressing reach end • Every DSP has “modulo” or “circular” addressing EECC722 - Shaaban #62 lec # 8 Fall 2012 10-10-2012 DSP ISA Features Circular Buffers Addressing Support Every DSP has “modulo” or “circular” addressing mode e.g. from A/D Instructions accommodate three elements: • Buffer address • Buffer size Why? • Increment Allows for cycling through: • delay elements (signal samples) e.g. to D/A • Filter coefficients in data memory Or other DSP algorithm coefficients EECC722 - Shaaban #63 lec # 8 Fall 2012 10-10-2012 DSP Architectural Features Address calculation for DSPs DSP • Dedicated address Do not use normal integer unit generation units • Supports modulo and bit reversal arithmetic • Often duplicated to calculate multiple addresses per cycle EECC722 - Shaaban #64 lec # 8 Fall 2012 10-10-2012 Addressing Comparison DSP Architectural Feature DSP Processor • Dedicated address generation units • Specialized addressing DSP ISA Feature modes; e.g.: – Autoincrement – Modulo (circular) – Bit-reversed (for FFT) • Good immediate data support General-Purpose Processor • Often, no separate address generation units • General-purpose addressing modes GPP ISA Feature Number minimized In RISC ISAs EECC722 - Shaaban #65 lec # 8 Fall 2012 10-10-2012 DSP ISA Features DSP Instructions and Execution • May specify multiple operations in a single complex instruction: To reduce number of instructions – e.g. A compound instruction may perform: and reduce code size multiply + add + load + modify address register • Must support Multiply-Accumulate (MAC) • Need parallel move support • Usually have special loop support to reduce branch overhead Reduce loop overhead – Loop an instruction or sequence – 0 value in register usually means loop maximum number of times – Must be sure if calculate loop count that 0 does not mean 0 • May have saturating shift left arithmetic • May have conditional execution to reduce branches In 4th generation VLIW DSPs EECC722 - Shaaban #66 lec # 8 Fall 2012 10-10-2012 DSP ISA Features DSP Low/Zero Overhead Loops Examples Example FIR inner loop on TI TMS320C54xx: Number of filter taps Repeat DO <addr> UNTIL condition” In ADSP 2100: DO X ... Address Generation PCS = PC + 1 if (PC = x && ! condition) PC = PCS else PC = PC +1 Lowers loop overhead X • Eliminates a few instructions in loops • Important in loops with small bodies EECC722 - Shaaban #67 lec # 8 Fall 2012 10-10-2012 Instruction Set (ISA) Comparison DSP Processor ISA General-Purpose Processor ISA • Specialized, complex instructions (e.g. MAC) • Multiple operations per instruction mac x0,y0,a x: (r0) + ,x0 y: (r4) + ,y0 Code Size = 16 bits Smaller Code Size • Zero or reduced overhead loops. The above is addition to addressing mode differences identified earlier (slide 65) • General-purpose instructions Less complex • Typically only one operation per instruction mov *r0,x0 mov *r1,y0 mpy x0, y0, a add a, b mov y0, *r2 inc r0 inc rl Code Size = 7 x 32 = 224 bits (14X) Larger Code Size • No zero or reduced overhead loops support EECC722 - Shaaban #68 lec # 8 Fall 2012 10-10-2012 DSP Architectural Features Specialized Peripherals for DSPs System on Chip (SoC) Approach Heavy integration of peripherals/components to reduce cost (chip count)/power • Instruction Memory DSP Core A/D Converter Data Memory D/A Converter SOC Serial Ports • Synchronous serial ports • Parallel ports • Timers • On-chip A/D, D/A converters • Co-processors. • ASIC • Micro-controller …. • Program/data memory and busses • Component /system interconnects • Host ports • Bit I/O ports • On-chip DMA controller • Clock generators On-chip peripherals often designed for “background” operation, even when DSP core is powered down. EECC722 - Shaaban #69 lec # 8 Fall 2012 10-10-2012 TI TMS320C203/LC203 Block Diagram DSP Core Approach - 1995 Program Data Integrated DSP Peripherals EECC722 - Shaaban #70 lec # 8 Fall 2012 10-10-2012 Summary of Architectural Features of DSPs • DSP Architectural Feature • DSP Architectural Features • DSP ISA Feature • • • Data path configured for DSP – Fixed-point arithmetic Most common 95% of all DSPs – Fast MAC- Multiply-accumulate Multiple memory banks and buses Avoiding dynamic processor + •architectural features that make real– Harvard Architecture time performance harder to predict (e.g dynamic scheduling, hardware – Multiple data memories speculation, branch prediction, cache). – Dedicated address generation units Why? Specialized addressing modes To achieve predictable real-time performance – Bit-reversed addressing – Circular buffers Specialized instruction set and execution control – Zero-overhead loops DSP ISA Features – Support for MAC Specialized peripherals for DSP (SoC) DSP Architectural Feature THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN. (or algorithm driven, DSP algorithms in this case) EECC722 - Shaaban #71 lec # 8 Fall 2012 10-10-2012 DSP Software Development Considerations • Different from general-purpose software development: – – – – – – Thus Resource-hungry, complex algorithms. Requirements Specialized and/or complex processor architectures. Severe cost/storage limitations. Hard real-time constraints. Optimization is essential. Program in DSP Assembly ? Increased testing challenges. • Essential tools: • Most common (for performance) but changing – Assembler, linker. HLL/tools becoming – Instruction set simulator. more mature/ – HLL Code generation: C compiler. gaining popularity – Debugging and profiling tools. Increasingly important: – DSP Software libraries (hand optimized). – Real-time operating systems. EECC722 - Shaaban #72 lec # 8 Fall 2012 10-10-2012 Classification of Current DSP Architectures • Modern Conventional DSPs: Lower Cost/ Power – Similar to the original DSPs of the early 1980s – Single instruction/cycle. Example: TI TMS320C54x Second Generation – Complex instructions/Not compiler friendly Late 1980’s - Usually one MAC unit • Enhanced Conventional DSPs: – – – – Add parallel execution units: SIMD operation Complex, compound instructions. > 1 MAC Unit Example: TI TMS320C55x Not compiler friendly Usually more than one MAC unit • Multiple-Issue DSPs: Late1990’s - Third Generation Early 1990’s - Fourth Generation – VLIW Example: TI TMS320C62xx, TMS320C64xx Higher Cost/ Power Performance • Simpler (RISC-like, fixed-width) instructions than conventional DSPs, more instructions and instruction bandwidth needed, • More compiler friendly - Higher cost/power • SIMD instructions support added to recent DSPs of this class – Superscalar, Example: LSI Logic ZPS400, ZPS500 EECC722 - Shaaban DSPs from all these three generations are still available today. Why? #73 lec # 8 Fall 2012 10-10-2012 A Conventional DSP: TI TMSC54xx • • • • Second Generation DSP ~ 1989 16-bit fixed-point DSP. Issues one 16-bit instruction/cycle Modified Harvard memory architecture Peripherals typical of conventional DSPs: – 2-3 synch. Serial ports, parallel port – Bit I/O, Timer, DMA • Inexpensive (100 MHz ~$5 qty 10K). • Low power (60 mW @ 1.8V, 100 MHz). Has one MAC unit EECC722 - Shaaban #74 lec # 8 Fall 2012 10-10-2012 A Current Conventional DSP: Second TI TMSC54xx Generation DSP One MAC Unit EECC722 - Shaaban #75 lec # 8 Fall 2012 10-10-2012 An Enhanced Conventional DSP: Generation TI TMSC55xx Third DSP ~ 1994 • The TMS320C55xx is based on Texas Instruments' earlier TMS320C54xx family, but adds significant enhancements to the architecture and instruction set, including: – Two instructions/cycle (limited VLIW?) • Instructions are scheduled for parallel execution by the assembly programmer or compiler. – Two MAC units. • Complex, compound instructions: – Assembly source code compatible with C54xx – Mixed-width instructions: 8 to 48 bits. – 200 MHz @ 1.5 V, ~130 mW , $17 qty 10k • Poor compiler target. 2nd generation DSP EECC722 - Shaaban #76 lec # 8 Fall 2012 10-10-2012 An Enhanced Conventional DSP: Third TI TMSC55xx Generation DSP 2 MAC Units EECC722 - Shaaban #77 lec # 8 Fall 2012 10-10-2012 Multiple-Issue DSPs 16-bit Fixed-Point 8-way VLIW DSP: TI TMS320C6201 Revision 2 (1997) The TMS320C62xx is the first fixed-point DSP Program Cache / Program Memory processor from Texas 32-bit address, 256-Bit data512K Bits RAM Instruments that is based Pwr Dwn on a VLIW-like architecture which allows it to execute up to eight 32-bit RISC-like instructions per clock cycle. Control Registers Instruction Dispatch 4-DMA Instruction Decode Data Path 1 Data Path 2 A Register File Control Logic B Register File Test Emulation Floating Point version Example Fourth Generation DSP Program Fetch Host Port Interface TMS320C67xx • More compiler friendly • Higher cost/power •SIMD instructions support added to recent DSPs of this class (TMS320C64xx) C6201 CPU Megamodule Ext. Memory Interface L1 S1 M1 D1 D2 M2 S2 L2 Interrupts 2 Timers Data Memory 32-Bit address, 8-, 16-, 32-Bit data 512K Bits RAM 2 Multichannel buffered serial ports (T1/E1) EECC722 - Shaaban #78 lec # 8 Fall 2012 10-10-2012 TI TMS320C62xx Internal Memory Architecture • Separate Internal Program and Data Spaces • Program – 16K 32-bit instructions (2K Fetch Packets) – 256-bit Fetch Width – Configurable as either • Direct Mapped Cache, Memory Mapped Program Memory • Data – 32K x 16 – Single Ported Accessible by Both CPU Data Buses – 4 x 8K 16-bit Banks 4 Banks • 2 Possible Simultaneous Memory Accesses (4 Banks) • 4-Way Interleave, Banks and Interleave Minimize Access Conflicts EECC722 - Shaaban #79 lec # 8 Fall 2012 10-10-2012 Fourth Generation DSP Example TI TMS320C62xx Datapaths 8-way VLIW Registers A0 - A15 Registers B0 - B15 1X S1 2X S2 D DL SL L1 SL DL D S1 S1 S2 D S1 S2 M1 DDATA_I1 (load data) DDATA_O1 (store data) D S1 S2 S2 S1 D S2 S1 D D1 D2 M2 S2 S1 D DL SL S2 SL DL D S2 S1 L2 DDATA_I2 (load data) DDATA_O2 (store data) DADR1 DADR2 (address) (address) Cross Paths 40-bit Write Paths (8 MSBs) 40-bit Read Paths/Store Paths EECC722 - Shaaban #80 lec # 8 Fall 2012 10-10-2012 TI TMS320C62xx Functional Units • L-Unit (L1, L2) – 40-bit Integer ALU, Comparisons – Bit Counting, Normalization • S-Unit (S1, S2) – 32-bit ALU, 40-bit Shifter – Bitfield Operations, Branching • M-Unit (M1, M2) – 16 x 16 -> 32 • D-Unit (D1, D2) – 32-bit Add/Subtract – Address Calculations (Statically Scheduled) EECC722 - Shaaban #81 lec # 8 Fall 2012 10-10-2012 TI TMS320C62xx Instruction Packing Instruction Packing Advanced 8-way VLIW Example 1 A B C D E F G H A B C D Example 2 E F G H A B C D Example 3 E F G H • Fetch Packet – CPU fetches 8 instructions/cycle • Execute Packet – CPU executes 1 to 8 instructions/cycle – Fetch packets can contain multiple execute packets • Parallelism determined at compile / assembly time • Examples – 1) 8 parallel instructions – 2) 8 serial instructions – 3) Mixed Serial/Parallel Groups • A // B • C • D • E // F // G // H • Reduces Codesize, Number of Program Fetches, Power Consumption (Statically Scheduled VLIW) EECC722 - Shaaban #82 lec # 8 Fall 2012 10-10-2012 TI TMS320C62xx Pipeline Operation Pipeline Phases Fetch Decode Execute PG PS PW PR DP DC E1 E2 E3 E4 E5 • Single-Cycle Throughput • Operate in Lock Step • Fetch – PG Program Address Generate – PS Program Address Send – PW Program Access Ready Wait – PR Program Fetch Packet Receive PG PS PW PR DP DC Execute Packet 2 PG PS PW PR DP Execute Packet 3 PG PS PW PR Execute Packet 4 PG PS PW Execute Packet 5 PG PS Execute Packet 6 PG Execute Packet 7 • • E1 DC DP PR PW PS PG Decode – DP – DC Execute – E1 - E5 E2 E1 DC DP PR PW PS E3 E2 E1 DC DP PR PW E4 E3 E2 E1 DC DP PR Instruction Dispatch Instruction Decode Execute 1 through Execute 5 E5 E4 E3 E2 E1 DC DP E5 E4 E3 E2 E1 DC E5 E4 E3 E2 E1 E5 E4 E5 E3 E4 E5 E2 E3 E4 E5 EECC722 - Shaaban #83 lec # 8 Fall 2012 10-10-2012 C62x Pipeline Operation Delay Slots • Delay Slots: number of extra cycles until result is: – written to register file – available for use by a subsequent instructions – Multi-cycle NOP instruction can fill delay slots while minimizing code size impact Most Instructions Integer Multiply Loads Branches E1 No Delay E1 E2 1 Delay Slots E1 E2 E3 E4 E5 4 Delay Slots E1 Branch Target PG PSPWPR DPDC E1 5 Delay Slots (Statically Scheduled VLIW) For better real-time performance predictability EECC722 - Shaaban #84 lec # 8 Fall 2012 10-10-2012 C6000 Instruction Set Features Conditional Instruction Execution • All Instructions can be Conditional (similar to Intel IA-64) – A1, A2, B0, B1, B2 can be used as Conditions – Based on Zero or Non-Zero Value – Compare Instructions can allow other Conditions (<, >, etc) • Reduces Branching • Increases Parallelism EECC722 - Shaaban #85 lec # 8 Fall 2012 10-10-2012 C6000 Instruction Set Addressing Features • Load-Store Architecture • Two Addressing Units (D1, D2) • Orthogonal – Any Register can be used for Addressing or Indexing • Signed/Unsigned Byte, Half-Word, Word, DoubleWord Addressable – Indexes are Scaled by Type • Register or 5-Bit Unsigned Constant Index EECC722 - Shaaban #86 lec # 8 Fall 2012 10-10-2012 C6000 Instruction Set Addressing Modes/Features • Indirect Addressing Modes – Pre-Increment *++R[index] – Post-Increment *R++[index] – Pre-Decrement *--R[index] – Post-Decrement *R--[index] – Positive Offset *+R[index] – Negative Offset *-R[index] • 15-bit Positive/Negative Constant Offset from Either B14 or B15 • Circular Addressing – Fast and Low Cost: Power of 2 Sizes and Alignment – Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer Sizes • Bit-reversal Addressing • Dual Endian Support EECC722 - Shaaban #87 lec # 8 Fall 2012 10-10-2012 FIR Filter On TMS320C54xx vs. TMS320C62xx Smaller code size than VLIW 2nd Gen Conventional DSP 4th Gen VLIW DSP VLIW DSP: Larger code size Two filter taps In parallel EECC722 - Shaaban #88 lec # 8 Fall 2012 10-10-2012 TI TMS320C64xx • Announced in February 2000, the TMS320C64xx is an extension of Texas Instruments' earlier TMS320C62xx architecture. • The TMS320C64xx has 64 32-bit general-purpose registers, twice as many as the TMS320C62xx. • The TMS320C64xx instruction set is a superset of that used in the TMS320C62xx, and, among other enhancements, adds significant SIMD/media processing capabilities: Not in C62 – 8-bit operations for image/video processing. SIMD Media Processing • Introduced at 600 MHz clock speed (1 GHz now), but: – 11-stage pipeline with long latencies – Dynamic caches. • $100 qty 10k. • The only DSP current family with compatible fixed and floatingpoint versions. EECC722 - Shaaban #89 lec # 8 Fall 2012 10-10-2012 C64xx (also C62xx and C67xx) VLIW have higher memory use due to simpler (RISC-like, fixed-width) instructions than conventional DSPs, more instructions and instruction bandwidth needed, (VLIW) (VLIW) Also VLIW but with variable-length instruction encoding (less memory use than C64xx) (16-32 bits) EECC722 - Shaaban #90 lec # 8 Fall 2012 10-10-2012 Computational (XScale) EECC722 - Shaaban #91 lec # 8 Fall 2012 10-10-2012 Multiple-Issue 4th Generation DSPs Example Superscalar DSP: LSI Logic ZSP400 • A 4-way superscalar dynamically scheduled 16-bit fixedpoint DSP core. Good or bad for a DSP? • 16-bit RISC-like instructions • Separate on-chip caches for instructions and data • Two MAC units, two ALU/shifter units – Limited SIMD support. – MACS can be combined for 32-bit operations. • Possible Disadvantage: – Dynamic behavior complicates DSP software development: • Ensuring real-time behavior • Optimizing code. EECC722 - Shaaban #92 lec # 8 Fall 2012 10-10-2012 2004 EECC722 - Shaaban #93 lec # 8 Fall 2012 10-10-2012 2010 EECC722 - Shaaban #94 lec # 8 Fall 2012 10-10-2012 2004 GPP (4th generation TI DSP) TI not actively improving their flagship FP DSP (fixed-point more important!) EECC722 - Shaaban #95 lec # 8 Fall 2012 10-10-2012