DSP Processors – Lecture 13 Ingrid Verbauwhede Department of Electrical Engineering University of California Los Angeles ingrid@ee.ucla.edu 1 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 References • The origins: • E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP magazine, October 1988, pg. 4-19. • Part II, IEEE ASSP magazine, January 1989, pg. 4-14 • Good overview: • P. Lapsley, J. Bier, A. Shoham, E.A.Lee, “DSP Processor Fundamentals: Architectures and Features,” IEEE Press, 1998. More references: • P. Faraboschi, G. Desoli, J. Fisher, “The latest word in Digital and Media Processing,” IEEE Signal Processing Magazine, March 1998, pg. 59-85, (download from the INSPEC webpage). • I. Verbauwhede, M. Touriguian, “Wireless Digital Signal Processors,” Chapter 11 in Digital Signal Processing for Multimedia Systems, Eds. By K. Parhi, T. Nishitani, Marcel Dekker, Inc. • C. Nicol, I. Verbauwhede, “DSP Architectures for Next Generation wireless communications,” ISSCC 2000 tutorial. 2 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 1 Recall: Memory architecture FIR execution on: • Von Neumann: 3 cycles/tap • Basic Harvard: 2 cycles/tap • Modified Harvard & repeat loop: 1 cycle per tap & only 3 instructions Key issues: • Memory bandwidth by multiple memory banks or multi port memories • Every memory has its OWN address generation unit operating in parallel • Special instructions that combine operations with memory moves: MACD • Indirect addressing: *r1++ or *r2-• circular buffers: extra hardware in the address generation units FASTER THAN 1 CYCLE PER TAP?? 3 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 Compute Intensive function 1: FIR (cont.) x(n-1) x(n) N-1 y(n) = Σ c(i) x(n-i) c(0) Z X -1 Z -1 Z X X + + -1 (50 TAPS) c(N-1) X x(n-(N-1)) i=0 y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + y(n) + c(N-1)x(1-N); + c(N-1)x(2-N); + c(N-1)x(3-N); . . . y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1)); One output = 2N reads, N MAC’s, 1 write Classic Harvard: one output = N cycles 4 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 2 FIR speed-up FIR filtering: two outputs in parallel y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N); y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(2-N); + c(N-1)x(3-N); . . . y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1)); Two outputs = 4N reads, 2N MAC’s, 2 writes Dual Mac Architecture with ONLY 2 data busses?? Read two 32-bit numbers instead of four 16-bit numbers Solution by Lucent 16000 core with dual MAC Run MAC at double frequency, read two 32-bit numbers Solution by Matsushita Insert delay register Solution by Atmel’s LODE 5 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 Example 3: Lucent DSP16210 XDB(32) IDB(32) Inner loop of 32-tap FIR Filter do 14 { //one instruction ! a0=a0+p0+p1 p0=xh*yh p1=xl*yl y=*r0++ x=*pt0++ } Outer Loop: 19 cycles, 38 bytes 1 cycle in inner loop 5 exec units used in inner loop 2 MACs per cycle Horizontal parallelism, one sample at a time 2G mobile wireless base-stations Courtesy: Gareth Hughes, Bell Labs Australia Y(32) X(32) 16 x 16 mpy 16 x 16 mpy p0 (32) p1 (32) Shift/Sat. Shift/Sat. ALU ADD BMU ACC File 8 x 40 6 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 3 FIR on Lode FIR filter: two outputs in parallel with delay register y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N); y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(2-N); + c(N-1)x(3-N); . . . y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1)); Total energy for one output sample: Energy Single MAC Dual MAC Dual MAC with REG No. of MAC operations N N N No of Memory reads 2N 2N N No of Instruction Cycles N N/2 N/2 7 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 FIR on Lode Two MAC units with dedicated bus network DB1(16) DB0(16) x(n-i+1) • DB0 fetches coefficient c(i) • DB1 fetches data • LREG delays input data • A0 stores y(n) output LREG x(n-i) c(i) X X + + MAC0 MAC1 • A1 stores y(n+1) output y(n+1) A0 y(n) A1 Same structure can be used for IIR 8 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 4 Arithmetic DSP processors come in two flavors: • floating point • most popular one: Sharc’s from Analog Devices • fixed point • usually 16 bit, sometimes 24 bit (audio processors) • newer processors might have wider data paths or registers (TI C6x: 16x16 mpy, 32 bit registers, 40 bit ALU) 16 x 16 mpy Basic datapath 32 bit ALU 40 bit 40 bit shifter Select 16 bit 9 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 Overflow: • Saturation logic combined with output shifter 16 x 16 mpy 32 bit ALU 40 bit 40 bit Shifter/ saturate Select 16 bit • How to implement saturation? 10 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 5 Overflow: • Input shifter: scaling, line up of the inputs = loss of precision if shift to much down. 16 x 16 mpy input Shifter 32 bit ALU 40 bit 40 bit Shifter/ saturate Select 16 bit 11 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 Block normalization • Often used in speech coders because dynamic range of the input signals is unknown. • Scale the whole array of values such that the maximum entry sits in the range [0.5, 1) • minimum loss of precision TIC54x: EXP A NORM A <- counts number of sign bits, stores this number in TREG <- shifts the accumulator by the number of bits in TREG Lode: Repeat N; A3 = expmn (*r0), r0++; (stores # of sign bits in special register ASR) Repeat N; *r0 = *r0 < ASR, r0++; 12 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 6 Pipelining: Time Fetch Decode Fetch Memory Execute Access Decode Fetch Memory Execute Access Decode Memory Execute Access Fetch = fetch instruction Decode = decode instruction Memory access = address generation and read operands Execute = perform operation 13 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 Pipelining How does pipeline appears to the programmer? Lee’s paper (part II) discusses 3 variations (the difference is often blurry): • interlocking • time stationary coding • data stationary coding Interlocking: the instructions appear if executed one after another 14 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 7 Interlocking on C10 LT Fetch Decode MPY Fetch LTD Memory Execute Access Decode Fetch MPY Memory Execute Access Memory Execute Access Decode Fetch Decode Memory Execute Access Reservation table: PMEM LT MPY DMEM LTD MPY LTD MPY data coef1 data coef2 ... MPY ALU 15 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 Interlocking on C2x Programmer does not know the pipeline If an access conflict occurs: hardware will “stall” and finish one (part) of an Instruction before finishing a second part. RPTK 49 MACD Reservation table: PMEM DMEM RPTK MACD coef1 coef2 data1 data2 coef3 ... MPY ALU 16 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 8 Single Cycle MAC TMS320C2x Multiplier/ALU Program Bus 16 Data Bus 16 T Register (16) 16 Multiplier (16x16) 32 P Register (32) 32 Left Shifter (0-16) 32 32 16 Left Shifter (0-16) 16 MUX Multiply yielding a 32-bit product 16 MUX 32 Arithmetic Logic Unit (ALU) 32 C Accumulator Register (32) 32 Left Shifter (0-7) 32 16 Single Cycle 16x16 bit Supports simultaneous Program and two Data Operand acquisition Supports simultaneous ALU and Multiplier operations 0-16 bit Left Post-Shifter Courtesy: Texas Instruments 17 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 Example: MACD MACD = Multiply by Program Memory and Accumulate with Delay (Instruction is still present in C54x and C55x) MACD Smem, pmad, src Smem = data memory pmad = program address src = accumulator (A or B) Executes (simplified): (Smem) x (Pmem(at location pmad)) + src -> src (Smem) -> Treg (Smem) -> Smem +1 (pmad) +1 -> pmad ; = multiply – accumulate ; load data in Treg register ; load data in next mem loc. ; increment program address pointer When executing with a repeat instruction, takes one cycle 18 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 9 Time stationary Instruction specifies “one instruction cycle”. So it specifies, all that occurs in parallel. Decode Fetch Fetch Memory Execute Access Decode Fetch Memory Execute Access Memory Execute Access Decode Fetch Decode Memory Execute Access Example: Motorola: MAC X0, Y0, A X:(R0)+, X0 Y:(R4-), Y0 (multiply-acc of values read from memory in the previous cycle Lucent 16x a0 = a0 + p, p = x * y, y = *r0++, x = *pt ++ 19 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 Data stationary Time stationary: working on different samples in one instruction Data stationary: describes what happens with one input data from start to end. Example (Lode): *r3++ = a0+ = a2 * *r2++; (read from memory with pointer reg r2, Multiply with a2, add to a0 and store back in a0, Store the result in memory with pointer r3, Post modify r2 and r3) Fetch Decode Read Execute Write 20 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 10 Control & Pipeline for DSP’s RISC: load/store machine memory access with load/store instructions (DLX, MIPS, D10V) Decode Fetch Execute Memory Access Write Back Memory access / branch Execution/ address generation Excellent for complex decision making! DSP: register-memory architecture (TI, Lucent, HX, Lode) Memory Execute Access Decode Fetch Write Back Execution Memory access Excellent for number crunching! 21 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 Pipeline RISC compared to DSP RISC:example Fetch r0 = *p0; // load data a0 = a0 + r0; // execute Decode Fetch Execute Decode Fetch Memory Access Execute Decode Too expensive for DSP Memory Access Execute Memory Access DSP: memory intensive applications: Fetch Decode Fetch Memory Access Decode Fetch Execute Memory Access Decode Fetch Execute Memory Access Decode Penalty: data dependent branch is expensive Execute Memory Access Execute 22 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 13 11