ENCM DSP Processor requirements

advertisement
Processor Requirements
needed to optimize
DSP performance
M. R. Smith,
Electrical and Computer Engineering,
University of Calgary, Alberta, Canada
smithmr @ ucalgary.ca
2000/03/05
1
To be tackled today

Characteristics of DSP algorithms
Specialized handling of

Multiplication
 Division (21K has no division instruction)
ENCM515 Reference Material






2000/03/05
How RISCy Is DSP, IEEE Micro (Jan-10)
Simply Signal Processing (Jan-40)
Fast Scaling, CCI (Apr-10)
Saturation Arithmetic (Apr-20)
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
2 / 48
DSP Algorithms


DSP algorithms require specialized features
on processors
Processors are a compromise



speed, cost, silicon
When have you as a designer found a
compromise that meets your requirements?
As a consultant may have to add DSP
characteristics to an existing system or add
DSP coprocessor to an existing system
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
3 / 48
FIR






Multiply/Addition intensive
Sum operation with high precision -- overflow considerations
Long simple loop
Online operation -- “infinite” amount of data
Store coefficients on-chip for fast access
Complex domain arithmetic
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
4 / 48
IIR-1





Interrelated and order dependent multiplications
and additions
Small number of delays via register moves?
short loop -- low number of instructions in loop
which makes it difficult to optimize
Precision -- very important because of feedback
Multiple stages -- I.e. IIR follows IIR etc
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
5 / 48
IIR-2 LDI



Short
complicated
loop
Many
intermediate
values
Pipeline
issues
because of
interdependence
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
6 / 48
FFT





Complex variables (A and B) and fixed coefficients (W)
Address calculations complex
Memory accesses numerable
Multiplication and additions
Need for fast access to many registers, address pointers,
constants, variables
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
7 / 48
Fast instruction cycle -- needed

DSP chips -- two cycle instructions (on top of
FETCH/DECODE) during which the processor performs
many parallel operations



More recent technology -- 1 clock cycle
Many processors takes 6 to 32 cycles to handle MULT,
FMULT, FDIV or even FADD
Make processor highly pipelined -- pipeline must be
started and then kept full



FIR (easy to pipeline)
IIR (hard to pipeline)
FFT (challenging to pipeline)
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
8 / 48
Loop Overhead -- must be minimized

Use specialized hardware






specialized decrement and branch instructions
occurring in a single cycle
instruction cached with counter
superscalar operations
delayed branches
hardware loop control
Use specialized software techniques


2000/03/05
loop unrolling
down counting loops
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
9 / 48
Memory operations -- Many of them


Data/instruction and data/data conflicts
Data caches






Will also have external data memory banks
Harvard architecture
branch target caches
multi-ported memory
register pre-forwarding -- avoid stalls while trying
to write back result of ALU operation only to re-access the same register
large register banks -- avoid memory ops
associated with just calculated values
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
10 / 48
Precision -- high but without speed loss






FIR -- accumulated value can grow big
IIR -- recursive use of a value
External Memory bus width
Internal Memory bus width
Data width of registers and ALU
Saturation arithmetic
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
11 / 48
Saturation Arithmetic



For full discussion see 21K SHARC user manual and
also “Being Assertive with your processor” (APR-20)
Internal register 80 bits but external busses only 32
wide
0xFFFF F0000001 00000000


stored as F0000001
0xFFFF 00000001 00000000



stored as 00000001 (normal math)
stored as 80000000 (saturation)
Can be good solution (FIR) or bad solution (IIR) to the
problem of overflow
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
12 / 48
Complex arithmetic -- frequency domain operations




Need to fetch real and imaginary parts in at
different times during the algorithm
Need fast access to adjacent memory
locations -- burst memory
Need for many internal registers to
temporarily store real/imaginary components
(FFT butterfly and last years exams)
Duplication of resources -- was custom, but
consider now 21160
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
13 / 48
TigerSHARC ADSP-21160 Core Architecture
CACHE
MEMORY
32 x 48
DAG 1
8 x 4 x 32
JTAG TEST &
EMULATION
FLAGS
DAG 2
8 x 4 x 32
PROGRAM
SEQUENCER
PMA BUS
TIMER
32
PMA
DMA BUS 32
DMA
PMD BUS
64
PMD
BUS CONNECT
DMD BUS 64
FLOATING & FIXEDPOINT MULTIPLIER,
FIXED-POINT
ACCUMULATOR
2000/03/05
REGISTER
FILE
16 x 40
32-BIT
BARREL
SHIFTER
DMD
FLOATING-POINT
&FIXED-POINT
ALU
FLOATING-POINT
&FIXED-POINT
ALU
32-BIT
BARREL
SHIFTER
REGISTER
FILE
16 x 40
FLOATING & FIXEDPOINT MULTIPLIER,
FIXED-POINT
ACCUMULATOR
14
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
15 / 48
Address calculations -- frequent


Complex addressing modes -- take many
clock cycles
Use pointers and autoincrement rather than
calculating pointer + offset



need many address-related registers
address calculations compete with ALU
calculations
group instructions within program

2000/03/05
e.g. read and store often use same or similar addresses
so don’t recalculate the addresses.
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
16 / 48
Specialized addressing modes







standard memory access
premodify
postmodify
circular buffers (modulo arithmetic on the
address registers)
bit-reverse addressing
structure handling
auto-increment with size accounted for
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
17 / 48
Key issue -- ease of development




Microcontrollers -- onboard peripherals
Host communication
Multiprocessor communications
Simulators




Multi-processor operations
Application notes
Good working environment
Compatibility to previous processor versions -legacy code (advantage and a disadvantage)
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
18 / 48
Multiplication Extensive algorithms

Off-chip multipliers have big bottlenecks





Get and then give instruction to multiplier
Get and then give first, second data to multiplier
Wait till cooked, and then get value
Newer chips have on-board multiplication or
intelligent co-processors (F-LINE exceptions)
Many chips do multiplication using specialized
techniques introduced by optimizing compiler
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
19 / 48
Smart Multiplication through
optimizing compiler techniques



29K RISC FMULT execution takes 6 cycles +
fetch
16bit x 16bit INTEGER multiplication on 68K
CISC takes 70 cycles regardless of operations
Use adds and shift instead since these take
less time -- easy with integer, but floats?

What are equivalent operations on 21K. Discussed
in early lecture on Quirks and SHARCs
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
20 / 48
Smart Integer 68k Multiplication

Multiplication by 2, 4, 8, 16




Achieved by shifting 1, 2, 3 or 4 times
(done in 6 + 2n operations on 68K)
D2 = D0 * 19
MOVE.W D0, D2
ASL.W #4, D2
D2 = D0 * 16
ADD.W D0, D2
D2 = D0 * 17
ASL.W #1, D0
D0 = D0 *2
ADD.W D0, D2
D2 = D0 * 19
(29 cycles compared to 70)
Watch out for overflow, may need conversion to 32 bits (SSI, SSF on
some processssors -- not only 21k)
Waste of time if have single cycle multipliers (21k?). Careful because
multiplication results may end in special register.
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
21 / 48
Multiplication Extensive algorithms

Highly pipelined, therefore complex instruction
interdependence
R0 = R1 * R2
R3 = R4 * R5




BUT
R0 = R1 * R2
R3 = R0 * R5 <- delay dependency
Need automated tools to schedule instructions
Need multiple destinations (registers) for multiplier result
Multiple and Accumulate (MAC) instruction
 Super-scalar operations even on a simpler processor
 Cause problems in short loops
 Many types of MACs needed
Not all processors have the 21061 single cycle
multiplication operation
See “In the AM29050 a FIR-bearing animal” (FEB-80 in
ENCM515 -- Characteristics needed in DSP processors
2000/03/05class notes))
22 / 48
Copyright smithmr@ucalgary.ca

Typically need “Normalization” of result

N point DFT


Result = DFT (Input)
; 0 <= n < N
N point inverse DFT
Result = IDFT (Input) / N ; 0 <= n < N
Division is typically done by the equivalent of
repeated subtraction -- 150 cycles on 68K


result = 0;
do { Numerator = Numerator - Denom;
result++;
} while (Numerator > 0); result--;

2000/03/05
Special shift-subtract tricks speed operations
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
23 / 48
Smart Integer Division

Division by 2, 4, 8, 16
unsigned
LSL #1, D0
signed
ASL #1, D0
Need to propagate (or not propagate) the sign
bit
Unsigned original = 0x80 (128)
Signed
2000/03/05
original = 0x80 (-128)
final = 0x40 (64)
final = 0xC0 (-64)
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
24 / 48
Floating Point Division


2000/03/05
The FDIV on 29K takes 15 cycles
There is not a FDIV on the 21K -- use recursion!!
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
25 / 48
Why is floating point so difficult?
Number Internal representation
1.0
0x3F 80 00 00
32.0
0x42 00 00 00
31.98125
1023.4
0x41 FF D9 9A
0x44 7F D9 9A
31.98125 = 1023.4 / 32 = 1023.4 / 2^5
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
26 / 48
Why is floating point so difficult?


“Fast scaling Routine for Floating-point RISC
and DSP processors” (APR-10)
Floating Point Format
31
S
2000/03/05
23 22
bexp
0
frac
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
27 / 48
Floating point number K
s
(-1)
(bexp -127)
x 1.frac
x2
0
1.0 = 0x1.0 x 2
0
(-1)
2000/03/05
(127 - 127)
x 0x1.0000
x 2
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
28 / 48
Floating point number K
s
(-1)
(bexp -127)
x 1.frac
x2
3
10.0 = 0x10.0 = %1010.0 = %1.0100 x 2
0
(-1)
2000/03/05
3
(0x1.4 x 2 )
(130 - 127)
x 0x1.4000
x 2
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
29 / 48
IEEE Std. 754, 1985
Number
1.0
32.0
31.98125
1023.4
1.frac
Internal
representation
0x3F 80 00 00
0x42 00 00 00
s
bexp
frac
0
0
0x7F
0x84
0x00 00 00
0x00 00 00
0x41 FF D9 9A
0x44 7F D9 9A
0
0
0x83
0x88
0x7F D9 9A
0x7F D9 9A
-- only fractional part is stored
Remember JAMES BOND helped by M (Smith)
“The ONE is remembered and not stored”
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
30 / 48
Fast floating pt division possible
Number
1.0
32.0
31.98125
1023.4
Internal
representation
0x3F 80 00 00
0x42 00 00 00
s
0x41 FF D9 9A
0x44 7F D9 9A
0
0
0
0
bexp
frac
0x7F
0x00 00 00
0x84
0x00 00 00
BEXP DIFF = 5
0x83
0x7F D9 9A
0x88
0x7F D9 9A
BEXP DIFF = 5
K = K / -1
-- flip the sign bit with XOR instruction
p
K = K / N where N = 2
-- decrease bexp = bexp -5
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
31 / 48
Fast Floating Point Division by 32 Doing it

29K -- FP# K is in gr96
Setting up the power
CONST BEXPchange, 5
Setting up the bexp-diff
SLL BEXPchange, BEXPchange, 23
result = K / 32
SUB result, K, BEXPchange
<- REPEATED
Note -- when processing a large array -- only the last step
needed for every number (inside the loop)
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
32 / 48
Fast Floating Point Division by FP M
when M is known to be 2^p
F0 = 1.0
R0 = R8 - R0
// NOTE integer operation
Setting up the bexp-diff
R0 = ASHIFT R0 BY 23
result = K / 32
R4 = R4 - R0
Works because
F8 = 32.0 (0x42000000)
F0 = 1.0 (0x3F800000)
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
33 / 48
PROBLEMS?


Try to do 0 / 32
Get a large negative number
Number
0.0
subtract
-2.126 * 10^37
s
0
0
1
bexp
0x00
0x05
0xFB
frac
0x00 00 00
0x00 00 00
0x00 00 00
If dividing by 2^p -- problems if number is smaller than 2^(p-127)


Must be overcome on many processors
Non-issue on 21k which has single cycle multiplication
and division. Calculate reciprocal and then multiply
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
34 / 48
Must guarantee result

68K, 29K, MIPS and 21k problems



ADD.W R0, R1
ADD gr96, gr97, gr98
Every addition (subtraction) result has the
possibility of being out of range -- overflow. Must
be tested.
68K solution
ADD.W R0, R1
BVS Somewhere

29K and MIPS solution


<- Test takes cycles
Special instructions -- ADDU and ADDS
21k solution is what?
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
35 / 48
Specialized coding techniques e.g. 29k has the ability of
“throwing” SWI as part of compare (ASSERT)
Test for FP number too small from previous special
Division operation
CMP.L #toosmall, D0
BGE okay
MOVE.L #0, D0
BRA continue
okay: SUB.L #b_exp, D0
continue:
68K code
<- EXTRA cycles always executed
ASGE TRAP#, temp, BEXPchange <- Only “compare” for 29k
SUB gr96, gr96, BEXPchange
<- Not in a delay slot?
where TOOSMALL:
CONST gr96, 0
RTI
Extra code only executed in the special case that it is needed
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
36 / 48
Specialized conditional instructions on 21k

21K -- F4 contains the FP value -- need F4/32
R0 = 5
R0 = ASHIFT R0 BY 23
F1 = minimum value ( 2^(5-127) )
F2 = ABS F4
COMP (F2, F1)
IF GE R4 = R4 - R0 ELSE R4 = R4 - R4 <- NO DELAY
Can’t use
ELSE R4 = 0
As this not a compute operation but uses 32-bit constant.
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
37 / 48
LIES -- ALL LIES
IF GE R4 = R4 - R0 ELSE R4 = R4 - R4




This is not a legal instruction either!!
COMPUTE instructions take 22 bits to
describe
IF JUMP/CALL ELSE R4 = R4 - R4 is allowed
Useless approach anyway since there are
better ways on 21k to do repeated division
by a constant.
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
38 / 48
Processors compared


IEEE Micro Magazine Special Feature 1992
DSP



RISC





TMS320C25, 030
DSP56000/1, DSP96002 (Motorola)
i860 (Intel)
MC88100 (Motorola)
SPARC (Sparc Consortium NOT Sun)
Am29050
Ideal -- SMITH CRISP
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
39 / 48
CRISP -- triple pun as well

Comprehensive RISC -- Predicted 1992










Harvard architecture
MAC (rather than Super -- Scalar instructions)
Ability to do X = R+S, Y = R-S operations
many registers for address/values
FP as well as integer capability
Bit-reverse addressing
Peripherals with DMA
Low power standby
High precision -- double precision
Efficient pipeline with parallel completion of many
operations (dual-ported memory and register banks)
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
40 / 48
Comparisons -- 1
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
41 / 48
FIR/IIR
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
42 / 48
FFT -- Radix 2 and Radix 4
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
43 / 48
Requirements for “perfect” DSP





Fast instruction cycle -- different from high clock
speed
Cycle time adjustable according to instruction type
Fast hardware multiplier
Floating point for easier algorithm design
High precision, implying wide data buses for
memory, internal processor transfers, registers
and on-board processing units
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
44 / 48
Requirements for “perfect” DSP




Several data buses available to reduce bus conflict
transfer overhead
Harvard architecture and/or instruction cache to
avoid instruction and data-fetch clashes
Duplicate resources for parallel computation of real
and imaginary components of complex numbers
Dedicated hardware required for address
calculations to avoid APU clash with main algorithm
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
45 / 48
Requirements for “perfect” DSP

Extensive temporary registers to reduce unwanted
fetches of continually used data






Or single cycle, highly parallel, memory operations
Fast and reliable, easily programmed, developed
and upgraded
Inexpensive and easy to develop peripherals
High level of customer support
Inexpensive to purchase
Lower power consumption with a standby mode
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
46 / 48
Requirements for “perfect” DSP




Several data buses available to reduce bus conflict
transfer overhead
Harvard architecture and/or instruction cache to
avoid instruction and data-fetch clashes
Duplicate resources for parallel computation of
real and imaginary components of complex
numbers
Dedicated hardware required for address
calculations to avoid APU
2000/03/05
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
47 / 48
Tackled today

Characteristics of DSP algorithms
Specialized handling of

Multiplication
 Division (21K has no division instruction)
ENCM515 Reference Material






2000/03/05
How RISCy Is DSP, IEEE Micro (Jan-10)
Simply Signal Processing (Jan-40)
Fast Scaling, CCI (Apr-10)
Saturation Arithmetic (Apr-20)
ENCM515 -- Characteristics needed in DSP processors
Copyright smithmr@ucalgary.ca
48 / 48
Download