Analog Devices TigerSHARC® DSP Family

advertisement
Analog Devices
TigerSHARC® DSP
Family
Presented By: Mike Lee and
Mike Demcoe
Date: April 8th, 2002
1
TigerSHARC Architectural
Overview



High performance, 128-bit successor to the ADSP-2106x SHARC
family
ADSP-TS101S, the newest TigerSHARC DSP, operates at 250MHz!
Multiple computational units

Two compute blocks, each containing a register file, ALU, multiplier, and
shifter.
 Two additional integer ALUs
 Two hardware loop counter registers

Can execute up to four independent 32-bit instructions at a time



Or, eight 16-bit instructions
Very wide word widths for high precision arithmetic
Designed to be used in a multiple processor environment
2
TigerSHARC Architecture Overview
(cont…)

BTB (Branch Target Buffer) as a means of
alleviating issues with the deep pipeline
 32-instruction,
4-way set-associative cache
 User controlled Branch Prediction



Three, 128-bit blocks of memory which provide
access to a program and two data operands
without causing instruction/data conflicts.
Load-store, Harvard architecture, like SHARC.
Native support for complex number instructions
3
The TS101S Architecture
4
Details of Multiple Compute Blocks

Two computational units, each containing:
file – Multi-ported to allow multiple accesses
to registers in a single clock cycle
 Register


General purpose registers!
Contains 32 words, each word being 32-bits in length.
– Fixed-point and floating point
 Multiplier – Fixed-point and floating point
 ALU

Also features MAC (multiply-and-accumulate) capabilities
– Standard logical and arithmetic shifts as well
as bit manipulation
 Shifter
5
The TS101S Pipeline
Fetch 1
Fetch
Stages
Fetch 2
Fetch 3
IAB
Decode
Integer
Execution
Stages
Access
Execute 1
Execute 2
6
Pipelines and Instruction Related
Information

ADSP-21061
 Three
stage pipeline
 20ns instruction cycle
 SISD but can put instructions in parallel

ADSP-TS101S
 Eight
stage pipeline with IAB
 4ns instruction cycle
 MIMD and can also put instructions in parallel
7
Loops, Branching and Timers

ADSP-21061
 Zero-overhead
hardware loop support
 Delayed Branching
 One timer

ADSP-TS101S
 Little
support for zero-overhead hardware loops
 32-entry 4-way associative BTB cache with Branch
prediction
 Two timers
8
Memory and Buses

ADSP-21061
1
Mbit dual ported SRAM
 Shared by three buses (PM, DM, I/O)
 PM and DM share a port while the I/O receives it’s
own

ADSP-TS101S
6
Mbit of SRAM (Quad Ported??)
 User defined partitions
 Each block is accessed by one 128-bit bus
9
Multiplication and other Nifty
Tricks

ADSP-21061
 MAC
instructions (MRF and MRB)
 Various precision output (32, 40, or 80 bit)

ADSP-TS101S
 Each
compute block has it’s own set of MAC registers
 8 16-bit MAC with 40-bit accumulation or 2 32-bit
MAC with 80-bit accumulation
 Complex number MAC instructions
 128-bit accelerator

Trellis decoding (8 Trellis butterflies per cycle)
10
Data Address Generation

ADSP-21061
2
data address generation units (DAGS)
 8 circular buffers per DAG

ADSP-TS101S
2
data address generation units (IALU)
 4 circular buffers per IALU

Both support modulo arithmetic, bit reversal
addressing, and post and pre-modify instructions
11
Ease of Use

ADSP-21061
 Easy to use
 Algebraic instruction set
 Visual DSP environment

ADSP-TS101S
 Similar
to 21061 but know have to consider 2
compute blocks
 ADI suggests leaving parallelization to their optimizing
compiler
 Visual DSP environment
12
Specific DSP Algorithms and the
TigerSHARC
In ENEL515 (and/or related articles) we’ve
studied the FIR, IIR, and FFT algorithms
 TigerSHARC has a massively parallel
architecture that is tailored to performing
these algorithms.

13
FIR Filter Characteristics


Think back (or forward, depending on how much you’ve
procrastinated) to Lab #3.
FIR Characteristics





Simple, long loop
Repetitive calculations (multiply, then add!)
Access to an array of coefficients, and an array of “delay-line”
values
Few data dependency issues during the calculation of a single
output
For a filter of length N, require N multiplications and N
adds to obtain a single output value.
14
TigerSHARC and the FIR Filter


The general idea is: Divide and conquer!
Take a filter of size N and split it into two groups of N/2



Utilize the TigerSHARC’s multiple computational units and MAC
instructions to perform the algorithm in ½ the time (plus some overhead)
Two hardware loop counters to simultaneously control the two new
“N/2” size FIR loops with no overhead!
Can do all of the following SIMULTANEOUSLY!

Fetch two operands (one coefficient, one delay line value) from two
separate memory banks
 Fetch the next instruction
 Perform arithmetic operations on the PREVIOUS operands!

Unlike SHARC, instruction/data clashes are non-existant due to the
numerous bus paths linking computational units to memory space
15
TigerSHARC and the FIR Filter
(continued….)

8-cycle-deep pipeline



The long loop characteristic of the FIR filter algorithm
allows us to keep the 8-cycle-deep pipeline full


Stalls are expensive..
Branch Target Buffer reduces performance loss that results from
branching in a deeply pipelined processor
Full pipeline means fast algorithm
FIR Filter algorithms rely heavily on data sets that are
aligned in memory


Post-increment is your friend
TigerSHARC Quad Data Accesses – Supply four aligned words
to one compute block or two aligned words to each compute
block.
16
Example Instructions
X/Y
Conditional Compute
if xALE; do, R0=R1+R2
Condition codes,
AEQ, ALT, ALE, ALU, MEQ, MLT, MLE, SEQ, SLT, SF0, SF1.
A = Adder, M = Multiplier, S = Shifter
Memory Addessing
Indirect post-modify with update, register offset:
YR20=[J1+=J2]
Indirect post-modify with update, 8-bit immediate offset:
Q[K1+=0xF8]=XYR3:0
Indirect pre-modify no update, register offset:
J3:2=L[K1+K2]
Indirect pre-modify no update, immediate offset:
YR3:2=L[K1+0x0003333]
Complex Quad 16-bit Fixed Point Multiplication Instructions
{X|Y|XY} MRa += Rm ** Rn {({U}{I}{C|CR}{J})}
{X|Y|XY} Rs|Rsd=MRa, MRa+= Rm ** Rn {({U}{I}{C}{J})}
17
FIR Code Example
18
TigerSHARC and the IIR Filter



Short, simple loop characteristic
 Means loop overhead is more of a concern
 Means keeping the pipeline full is tougher!
 Time to unroll the loop, although ADI says to let
VisualDSP do it for you.
Again, split up the calculations on an N-tap IIR filter into two
N/2 sets operating simultaneously
 Idea: One computational block does feedforward
calculations, one does feedback!
Complex numbers commonly required



Hardware support for complex MAC in TigerSHARC
Again, Quad Data Access comes in handy for aligned data
Post-increment is still your friend
19
TigerSHARC and the FFT


Does not use the same MAC modes that IIR and FIR filters do.
Requires more complicated addressing modes

Example: Bit reverse addressing



Difficult to split onto separate computational units and even more
difficult to split amongst distributed processors
Requires large arrays of complex variables and fixed coefficients



Found on both SHARC and TigerSHARC
Hardware complex number MAC comes in handy again!
Large arrays of aligned data – Quad Data access again!
Requires HIGH-PRECISION arithmetic

Luckily we have 64-bit fixed point arithmetic and 40-bit extended floating
point arithmetic.
 80-bit MAC precision

FFT Requires many intermediate values

32 GP registers in a single computational block
20
http://www.analog.com/technology/dsp/Sharc/benchmarks.html
21
http://www.analog.com/technology/dsp/TigerSHARC/benchmarks.html
22
Conclusion

TigerSHARC have a very SHARC-like
architecture, except it’s MUCH more complex.
 Highly


optimized for parallelism
Major features: Complex number support,
multiple computational units, high instruction
throughput, wider buses.
Performs DSP algorithms including FIR, IIR, FFT
significantly faster than SHARC!
23
References













1. http://www.analog.com/productSelection/pdf/ADSP-21061_L_b.pdf
2. http://products.analog.com/products/info.asp?product=ADSP-TS101-S
3. http://www.analog.com/technology/dsp/TigerSHARC/backgrounder.html
4. http://www.analog.com/library/dspManuals/Tigersharc_hardware.html
5. http://www.analog.com/library/dspManuals/Tigersharc_instruction.html
6. http://www.btid.com/procsum/tsfloat.htm
7. http://www.analog.com/library/applicationNotes/dsp/tigerSharc/EE-147.pdf
8. http://www.analog.com/technology/dsp/TigerSHARC/architecture.html
9. http://www.analog.com/library/dspManuals/pdf/TSDSP_instruction/tsintr.pdf
(2-182 - 2-188)
10. ADSP-2106x SHARC User’s Manual, Second Edition
11. http://www.analog.com/library/dspManuals/pdf/TSDSP_instruction/tsin_flw.pdf
(3-9 - 3-16)
24
Note from Dr. Smith

Information on Burg algorithm outside
ICT536. It is essentially an FIR filter used
for prediction (i.e. what FIR coefficients
are needed so that the filtered signal is
"white noise" )
25
Download