Analog Devices TigerSHARC® DSP Family

Analog Devices TigerSHARC® DSP Family Presented By: Mike Lee and Mike Demcoe Date: April 8th, 2002 1 TigerSHARC Architectural Overview    High performance, 128-bit successor to the ADSP-2106x SHARC family ADSP-TS101S, the newest TigerSHARC DSP, operates at 250MHz! Multiple computational units  Two compute blocks, each containing a register file, ALU, multiplier, and shifter.  Two additional integer ALUs  Two hardware loop counter registers  Can execute up to four independent 32-bit instructions at a time    Or, eight 16-bit instructions Very wide word widths for high precision arithmetic Designed to be used in a multiple processor environment 2 TigerSHARC Architecture Overview (cont…)  BTB (Branch Target Buffer) as a means of alleviating issues with the deep pipeline  32-instruction, 4-way set-associative cache  User controlled Branch Prediction    Three, 128-bit blocks of memory which provide access to a program and two data operands without causing instruction/data conflicts. Load-store, Harvard architecture, like SHARC. Native support for complex number instructions 3 The TS101S Architecture 4 Details of Multiple Compute Blocks  Two computational units, each containing: file – Multi-ported to allow multiple accesses to registers in a single clock cycle  Register   General purpose registers! Contains 32 words, each word being 32-bits in length. – Fixed-point and floating point  Multiplier – Fixed-point and floating point  ALU  Also features MAC (multiply-and-accumulate) capabilities – Standard logical and arithmetic shifts as well as bit manipulation  Shifter 5 The TS101S Pipeline Fetch 1 Fetch Stages Fetch 2 Fetch 3 IAB Decode Integer Execution Stages Access Execute 1 Execute 2 6 Pipelines and Instruction Related Information  ADSP-21061  Three stage pipeline  20ns instruction cycle  SISD but can put instructions in parallel  ADSP-TS101S  Eight stage pipeline with IAB  4ns instruction cycle  MIMD and can also put instructions in parallel 7 Loops, Branching and Timers  ADSP-21061  Zero-overhead hardware loop support  Delayed Branching  One timer  ADSP-TS101S  Little support for zero-overhead hardware loops  32-entry 4-way associative BTB cache with Branch prediction  Two timers 8 Memory and Buses  ADSP-21061 1 Mbit dual ported SRAM  Shared by three buses (PM, DM, I/O)  PM and DM share a port while the I/O receives it’s own  ADSP-TS101S 6 Mbit of SRAM (Quad Ported??)  User defined partitions  Each block is accessed by one 128-bit bus 9 Multiplication and other Nifty Tricks  ADSP-21061  MAC instructions (MRF and MRB)  Various precision output (32, 40, or 80 bit)  ADSP-TS101S  Each compute block has it’s own set of MAC registers  8 16-bit MAC with 40-bit accumulation or 2 32-bit MAC with 80-bit accumulation  Complex number MAC instructions  128-bit accelerator  Trellis decoding (8 Trellis butterflies per cycle) 10 Data Address Generation  ADSP-21061 2 data address generation units (DAGS)  8 circular buffers per DAG  ADSP-TS101S 2 data address generation units (IALU)  4 circular buffers per IALU  Both support modulo arithmetic, bit reversal addressing, and post and pre-modify instructions 11 Ease of Use  ADSP-21061  Easy to use  Algebraic instruction set  Visual DSP environment  ADSP-TS101S  Similar to 21061 but know have to consider 2 compute blocks  ADI suggests leaving parallelization to their optimizing compiler  Visual DSP environment 12 Specific DSP Algorithms and the TigerSHARC In ENEL515 (and/or related articles) we’ve studied the FIR, IIR, and FFT algorithms  TigerSHARC has a massively parallel architecture that is tailored to performing these algorithms.  13 FIR Filter Characteristics   Think back (or forward, depending on how much you’ve procrastinated) to Lab #3. FIR Characteristics      Simple, long loop Repetitive calculations (multiply, then add!) Access to an array of coefficients, and an array of “delay-line” values Few data dependency issues during the calculation of a single output For a filter of length N, require N multiplications and N adds to obtain a single output value. 14 TigerSHARC and the FIR Filter   The general idea is: Divide and conquer! Take a filter of size N and split it into two groups of N/2    Utilize the TigerSHARC’s multiple computational units and MAC instructions to perform the algorithm in ½ the time (plus some overhead) Two hardware loop counters to simultaneously control the two new “N/2” size FIR loops with no overhead! Can do all of the following SIMULTANEOUSLY!  Fetch two operands (one coefficient, one delay line value) from two separate memory banks  Fetch the next instruction  Perform arithmetic operations on the PREVIOUS operands!  Unlike SHARC, instruction/data clashes are non-existant due to the numerous bus paths linking computational units to memory space 15 TigerSHARC and the FIR Filter (continued….)  8-cycle-deep pipeline    The long loop characteristic of the FIR filter algorithm allows us to keep the 8-cycle-deep pipeline full   Stalls are expensive.. Branch Target Buffer reduces performance loss that results from branching in a deeply pipelined processor Full pipeline means fast algorithm FIR Filter algorithms rely heavily on data sets that are aligned in memory   Post-increment is your friend TigerSHARC Quad Data Accesses – Supply four aligned words to one compute block or two aligned words to each compute block. 16 Example Instructions X/Y Conditional Compute if xALE; do, R0=R1+R2 Condition codes, AEQ, ALT, ALE, ALU, MEQ, MLT, MLE, SEQ, SLT, SF0, SF1. A = Adder, M = Multiplier, S = Shifter Memory Addessing Indirect post-modify with update, register offset: YR20=[J1+=J2] Indirect post-modify with update, 8-bit immediate offset: Q[K1+=0xF8]=XYR3:0 Indirect pre-modify no update, register offset: J3:2=L[K1+K2] Indirect pre-modify no update, immediate offset: YR3:2=L[K1+0x0003333] Complex Quad 16-bit Fixed Point Multiplication Instructions {X|Y|XY} MRa += Rm ** Rn {({U}{I}{C|CR}{J})} {X|Y|XY} Rs|Rsd=MRa, MRa+= Rm ** Rn {({U}{I}{C}{J})} 17 FIR Code Example 18 TigerSHARC and the IIR Filter    Short, simple loop characteristic  Means loop overhead is more of a concern  Means keeping the pipeline full is tougher!  Time to unroll the loop, although ADI says to let VisualDSP do it for you. Again, split up the calculations on an N-tap IIR filter into two N/2 sets operating simultaneously  Idea: One computational block does feedforward calculations, one does feedback! Complex numbers commonly required    Hardware support for complex MAC in TigerSHARC Again, Quad Data Access comes in handy for aligned data Post-increment is still your friend 19 TigerSHARC and the FFT   Does not use the same MAC modes that IIR and FIR filters do. Requires more complicated addressing modes  Example: Bit reverse addressing    Difficult to split onto separate computational units and even more difficult to split amongst distributed processors Requires large arrays of complex variables and fixed coefficients    Found on both SHARC and TigerSHARC Hardware complex number MAC comes in handy again! Large arrays of aligned data – Quad Data access again! Requires HIGH-PRECISION arithmetic  Luckily we have 64-bit fixed point arithmetic and 40-bit extended floating point arithmetic.  80-bit MAC precision  FFT Requires many intermediate values  32 GP registers in a single computational block 20 http://www.analog.com/technology/dsp/Sharc/benchmarks.html 21 http://www.analog.com/technology/dsp/TigerSHARC/benchmarks.html 22 Conclusion  TigerSHARC have a very SHARC-like architecture, except it’s MUCH more complex.  Highly   optimized for parallelism Major features: Complex number support, multiple computational units, high instruction throughput, wider buses. Performs DSP algorithms including FIR, IIR, FFT significantly faster than SHARC! 23 References              1. http://www.analog.com/productSelection/pdf/ADSP-21061_L_b.pdf 2. http://products.analog.com/products/info.asp?product=ADSP-TS101-S 3. http://www.analog.com/technology/dsp/TigerSHARC/backgrounder.html 4. http://www.analog.com/library/dspManuals/Tigersharc_hardware.html 5. http://www.analog.com/library/dspManuals/Tigersharc_instruction.html 6. http://www.btid.com/procsum/tsfloat.htm 7. http://www.analog.com/library/applicationNotes/dsp/tigerSharc/EE-147.pdf 8. http://www.analog.com/technology/dsp/TigerSHARC/architecture.html 9. http://www.analog.com/library/dspManuals/pdf/TSDSP_instruction/tsintr.pdf (2-182 - 2-188) 10. ADSP-2106x SHARC User’s Manual, Second Edition 11. http://www.analog.com/library/dspManuals/pdf/TSDSP_instruction/tsin_flw.pdf (3-9 - 3-16) 24 Note from Dr. Smith  Information on Burg algorithm outside ICT536. It is essentially an FIR filter used for prediction (i.e. what FIR coefficients are needed so that the filtered signal is "white noise" ) 25

Analog Devices TigerSHARC® DSP Family

Related documents

Products

Support

Analog Devices TigerSHARC® DSP Family

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib