1 DSP Processors – Lecture 13 - Electrical Engineering

advertisement
DSP Processors – Lecture 13
Ingrid Verbauwhede
Department of Electrical Engineering
University of California Los Angeles
ingrid@ee.ucla.edu
1
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
References
• The origins:
• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP
magazine, October 1988, pg. 4-19.
• Part II, IEEE ASSP magazine, January 1989, pg. 4-14
• Good overview:
• P. Lapsley, J. Bier, A. Shoham, E.A.Lee, “DSP Processor Fundamentals:
Architectures and Features,” IEEE Press, 1998.
More references:
• P. Faraboschi, G. Desoli, J. Fisher, “The latest word in Digital and
Media Processing,” IEEE Signal Processing Magazine, March 1998,
pg. 59-85, (download from the INSPEC webpage).
• I. Verbauwhede, M. Touriguian, “Wireless Digital Signal Processors,”
Chapter 11 in Digital Signal Processing for Multimedia Systems,
Eds. By K. Parhi, T. Nishitani, Marcel Dekker, Inc.
• C. Nicol, I. Verbauwhede, “DSP Architectures for Next Generation wireless
communications,” ISSCC 2000 tutorial.
2
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
1
Recall: Memory architecture
FIR execution on:
• Von Neumann: 3 cycles/tap
• Basic Harvard: 2 cycles/tap
• Modified Harvard & repeat loop: 1 cycle per tap & only 3 instructions
Key issues:
• Memory bandwidth by multiple memory banks or multi port memories
• Every memory has its OWN address generation unit
operating in parallel
• Special instructions that combine operations with memory moves:
MACD
• Indirect addressing: *r1++ or *r2-• circular buffers: extra hardware in the address generation units
FASTER THAN 1 CYCLE PER TAP??
3
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
Compute Intensive function 1: FIR (cont.)
x(n-1)
x(n)
N-1
y(n) =
Σ c(i)
x(n-i)
c(0)
Z
X
-1
Z
-1
Z
X
X
+
+
-1
(50 TAPS)
c(N-1) X
x(n-(N-1))
i=0
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . .
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . .
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . .
+
y(n)
+ c(N-1)x(1-N);
+ c(N-1)x(2-N);
+ c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));
One output = 2N reads, N MAC’s, 1 write
Classic Harvard: one output = N cycles
4
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
2
FIR speed-up
FIR filtering: two outputs in parallel
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . .
+ c(N-1)x(1-N);
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . .
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . .
+ c(N-1)x(2-N);
+ c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));
Two outputs = 4N reads, 2N MAC’s, 2 writes
Dual Mac Architecture with ONLY 2 data busses??
Read two 32-bit numbers instead of four 16-bit numbers
Solution by Lucent 16000 core with dual MAC
Run MAC at double frequency, read two 32-bit numbers
Solution by Matsushita
Insert delay register
Solution by Atmel’s LODE
5
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
Example 3: Lucent DSP16210
XDB(32)
IDB(32)
Inner loop of 32-tap FIR Filter
do 14 { //one instruction !
a0=a0+p0+p1
p0=xh*yh p1=xl*yl
y=*r0++ x=*pt0++
}
Outer Loop: 19 cycles, 38 bytes
1 cycle in inner loop
5 exec units used in inner loop
2 MACs per cycle
Horizontal parallelism, one sample at
a time
2G mobile wireless base-stations
Courtesy: Gareth Hughes, Bell Labs Australia
Y(32)
X(32)
16 x 16 mpy
16 x 16 mpy
p0 (32)
p1 (32)
Shift/Sat.
Shift/Sat.
ALU
ADD
BMU
ACC File
8 x 40
6
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
3
FIR on Lode
FIR filter: two outputs in parallel with delay register
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . .
+ c(N-1)x(1-N);
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . .
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . .
+ c(N-1)x(2-N);
+ c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));
Total energy for one output sample:
Energy
Single
MAC
Dual
MAC
Dual MAC
with REG
No. of MAC operations
N
N
N
No of Memory reads
2N
2N
N
No of Instruction Cycles
N
N/2
N/2
7
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
FIR on Lode
Two MAC units with dedicated bus network
DB1(16)
DB0(16)
x(n-i+1)
• DB0 fetches coefficient
c(i)
• DB1 fetches data
• LREG delays input data
• A0 stores y(n) output
LREG
x(n-i)
c(i)
X
X
+
+
MAC0
MAC1
• A1 stores y(n+1) output
y(n+1)
A0
y(n)
A1
Same structure can be used for IIR
8
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
4
Arithmetic
DSP processors come in two flavors:
• floating point
• most popular one: Sharc’s from Analog Devices
• fixed point
• usually 16 bit, sometimes 24 bit (audio processors)
• newer processors might have wider data paths or registers
(TI C6x: 16x16 mpy, 32 bit registers, 40 bit ALU)
16 x 16 mpy
Basic
datapath
32 bit
ALU
40 bit
40 bit
shifter
Select 16 bit
9
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
Overflow:
• Saturation logic combined with output shifter
16 x 16 mpy
32 bit
ALU
40 bit
40 bit
Shifter/ saturate
Select 16 bit
• How to implement saturation?
10
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
5
Overflow:
• Input shifter: scaling, line up of the inputs
= loss of precision if shift to much down.
16 x 16 mpy
input Shifter
32 bit
ALU
40 bit
40 bit
Shifter/ saturate
Select 16 bit
11
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
Block normalization
• Often used in speech coders because dynamic range of the
input signals is unknown.
• Scale the whole array of values such that the maximum entry
sits in the range [0.5, 1)
• minimum loss of precision
TIC54x:
EXP A
NORM A
<- counts number of sign bits, stores this number in TREG
<- shifts the accumulator by the number of bits in TREG
Lode:
Repeat N;
A3 = expmn (*r0), r0++; (stores # of sign bits in special register ASR)
Repeat N;
*r0 = *r0 < ASR, r0++;
12
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
6
Pipelining:
Time
Fetch
Decode
Fetch
Memory Execute
Access
Decode
Fetch
Memory Execute
Access
Decode
Memory Execute
Access
Fetch = fetch instruction
Decode = decode instruction
Memory access = address generation and read operands
Execute = perform operation
13
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
Pipelining
How does pipeline appears to the programmer?
Lee’s paper (part II) discusses 3 variations
(the difference is often blurry):
• interlocking
• time stationary coding
• data stationary coding
Interlocking: the instructions appear if executed one after another
14
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
7
Interlocking on C10
LT
Fetch
Decode
MPY
Fetch
LTD
Memory Execute
Access
Decode
Fetch
MPY
Memory Execute
Access
Memory Execute
Access
Decode
Fetch
Decode
Memory Execute
Access
Reservation table:
PMEM
LT
MPY
DMEM
LTD
MPY
LTD
MPY
data
coef1
data
coef2
...
MPY
ALU
15
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
Interlocking on C2x
Programmer does not know the pipeline
If an access conflict occurs: hardware will “stall” and finish one (part) of an
Instruction before finishing a second part.
RPTK 49
MACD
Reservation table:
PMEM
DMEM
RPTK
MACD
coef1
coef2
data1
data2
coef3
...
MPY
ALU
16
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
8
Single Cycle MAC
TMS320C2x Multiplier/ALU
Program Bus
16
Data Bus
16
T Register (16)
16
Multiplier (16x16)
32
P Register (32)
32
Left Shifter (0-16)
32 32
16
Left
Shifter
(0-16)
16
MUX
Multiply yielding a
32-bit product
16
MUX
32
Arithmetic Logic Unit (ALU)
32
C Accumulator Register (32)
32
Left Shifter (0-7)
32
16
Single Cycle 16x16 bit
Supports simultaneous
Program and two Data
Operand acquisition
Supports simultaneous
ALU and Multiplier
operations
0-16 bit Left Post-Shifter
Courtesy: Texas Instruments
17
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
Example: MACD
MACD = Multiply by Program Memory and Accumulate with Delay
(Instruction is still present in C54x and C55x)
MACD Smem, pmad, src
Smem = data memory
pmad = program address
src = accumulator (A or B)
Executes (simplified):
(Smem) x (Pmem(at location pmad)) + src -> src
(Smem) -> Treg
(Smem) -> Smem +1
(pmad) +1 -> pmad
; = multiply – accumulate
; load data in Treg register
; load data in next mem loc.
; increment program address
pointer
When executing with a repeat instruction, takes one cycle
18
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
9
Time stationary
Instruction specifies “one instruction cycle”.
So it specifies, all that occurs in parallel.
Decode
Fetch
Fetch
Memory Execute
Access
Decode
Fetch
Memory Execute
Access
Memory Execute
Access
Decode
Fetch
Decode
Memory Execute
Access
Example:
Motorola:
MAC X0, Y0, A
X:(R0)+, X0
Y:(R4-), Y0
(multiply-acc of values read from memory in the previous cycle
Lucent 16x
a0 = a0 + p, p = x * y, y = *r0++, x = *pt ++
19
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
Data stationary
Time stationary: working on different samples in one instruction
Data stationary: describes what happens with one input data from
start to end.
Example (Lode):
*r3++ = a0+ = a2 * *r2++;
(read from memory with pointer reg r2,
Multiply with a2, add to a0 and store back in a0,
Store the result in memory with pointer r3,
Post modify r2 and r3)
Fetch
Decode
Read
Execute
Write
20
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
10
Control & Pipeline for DSP’s
RISC: load/store machine
memory access with load/store instructions (DLX, MIPS, D10V)
Decode
Fetch
Execute
Memory
Access
Write
Back
Memory access / branch
Execution/ address generation
Excellent for complex decision making!
DSP: register-memory architecture (TI, Lucent, HX, Lode)
Memory Execute
Access
Decode
Fetch
Write
Back
Execution
Memory access
Excellent for number crunching!
21
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
Pipeline RISC compared to DSP
RISC:example
Fetch
r0 = *p0;
// load data
a0 = a0 + r0; // execute
Decode
Fetch
Execute
Decode
Fetch
Memory
Access
Execute
Decode
Too expensive for DSP
Memory
Access
Execute
Memory
Access
DSP: memory intensive applications:
Fetch
Decode
Fetch
Memory
Access
Decode
Fetch
Execute
Memory
Access
Decode
Fetch
Execute
Memory
Access
Decode
Penalty: data dependent branch is expensive
Execute
Memory
Access
Execute
22
EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13
11
Download