A System Solution for HighPerformance, Low Power SDR Yuan Lin1, Hyunseok Lee1, Yoav Harel1, Mark Woh1, Scott Mahlke1, Trevor Mudge1 and Krisztian Flautner2 1Advanced Computer Architecture Laboratory University of Michigan 2ARM, Ltd. SDR Design Challenges: Hardware design challenges High computational throughput (~40 Gops) Low power consumption (~200mW) Meet real-time requirements DSP programming support System-level development Inter-algorithm communication Algorithm-level development Efficient DSP representations 2 Advanced Computer Architecture Laboratory University of Michigan SDR Benchmark Design & Analysis W-CDMA Protocol: 2Mbps Transmitter modulator LPF-Tx scrambler spreader Interleaver Channel encoder U p p e r la y e rs F ro n te n d demodulator Receiver searcher c o m b in e r LPF-Rx descrambler despreader .. . descrambler deinteleaver Channel decoder (turbo/viterbi) despreader 4 Advanced Computer Architecture Laboratory University of Michigan W-CDMA Characteristics Plenty of vector parallelism 8 & 16-bit DSP algorithms Multiplication is not dominant No floating-point operation Small instruction/data memory Has periodic real-time tasks 5 Advanced Computer Architecture Laboratory University of Michigan 802.11a Protocol: 24Mbps Transmitter modulator LPF-Tx Preamble Insertion IFFT QAM Mapping Interleaver Channel encoder U p p e r la y e rs F ro n te n d demodulator FFT Frequency Equalization Receiver QAM demapping deinteleaver LPF-Rx Frequency, Time Synchronization Channel Estimation Channel decoder (viterbi) 6 Advanced Computer Architecture Laboratory University of Michigan 802.11a Characteristics Similar to W-CDMA Plenty of vector parallelism No floating-point operation Small instruction/data memory Different from W-CDMA Mostly 16-bit DSP algorithms Multiplication is more dominant No periodic real-time tasks 7 Advanced Computer Architecture Laboratory University of Michigan SDR Processor Architecture Design System Architecture Design Tradeoffs PE alu alu alu alu alu alu alu alu Uniprocessor alu alu alu alu alu alu alu alu alu alu alu alu PE PE alu alu alu alu alu alu alu alu alu alu alu alu Inter-processor Network alu alu alu alu PE PE alu alu alu alu alu alu alu alu alu alu alu alu Coarse grain processors Amortized fetch Inter-processor Communication Fine grain processors Intra-processor Communication PE PE PE PE PE PE PE PE alu alu alu alu alu alu alu alu Inter-processor Network PE PE PE PE PE PE PE PE alu alu alu alu alu alu alu alu 9 Advanced Computer Architecture Laboratory University of Michigan System Architecture Design Tradeoffs Number of Processing Elements x SIMD width For W-CDMA 2Mbps (51.2GOP/sec) 90nm 1V @400MHz 300 4x32 250 2x64 1x128 1x256 (50% Utilization) 8x16 MOPS/mW 200 150 100 Coarse grain processors Fine grain processors PE PE PE PE alu PEaluPEaluPEaluPE alu alu alu PE alu PE alu alu alu alu Uniprocessor alu alu alu alu Inter-processor PE Inter-processor Network 16x8 32x4 alu alu alu alu alu alu alu alu Network PE PE alu alu alu alu alu alu alualu alualu alu PE PE alu PEalu PEalu PE PE PEaluPE 64x2 alu alu alu alu alu alu alu alu alu 50 128x1 Intra-processor shuffle network critical delay path 0 ALU dominated 0 critical delay path 50 100 150 200 250 critical delay path 300 SIMD width (arithmetic units) 10 Advanced Computer Architecture Laboratory University of Michigan System Architecture Design Controller ALU MEM MEM Memory Global Memory Global Memory SoC Interface SoC Interface SoC Interface Interconnect SoC Interface SoC Interface PE SoC Interface PE PE Scalar Memory SIMD Memory Scalar Memory SIMD Memory Scalar Memory SIMD Memory Scalar RF SIMD RF Scalar RF SIMD RF Scalar RF SIMD RF Scalar ALU SIMD ALU Scalar ALU SIMD ALU Scalar ALU SIMD ALU Scalar Pipeline SIMD Pipeline Scalar Pipeline SIMD Pipeline Scalar Pipeline SIMD Pipeline 12 Advanced Computer Architecture Laboratory University of Michigan PE Design (Area < 1mm2 Power<50mW) PE SIMD Unit SIMD Memory 4 KB 32x8bit 8bit 16x8bit RegFile 8bit 8bit 8bit ALU 8bit 16x8bit RegFile 8bit 8bit 8bit ALU 8bit 16x8bit RegFile 8bit 8bit 8bit ALU Data Shuffle Network 8bit 16x8bit RegFile 8bit 8bit 8bit ALU 16bit 16x16bit RegFile 16bit 16bit 16bit ALU DMA Scalar Memory 4KB 16bit Scalar Unit 13 Advanced Computer Architecture Laboratory University of Michigan Mapping DSP Algorithms: Filters z-1 In b Out spread Vin, Sin shift z, z, up mac z, Vin, Sin z-1 In b3 b2 Z-1 b1 Z-1 b0 Z-1 Out 14 Advanced Computer Architecture Laboratory University of Michigan Mapping DSP Algorithm: Filter PE SIMD Unit SIMD Memory 4 KB spread Vin, Sin shift z, z, up mac z, Vin, Sin 32x8bit 8bit 16x8bit RegFile 8bit 8bit 8bit ALU 8bit 16x8bit RegFile 8bit 8bit 8bit ALU 8bit 16x8bit RegFile 8bit 8bit 8bit ALU Data Shuffle Network 8bit 16x8bit RegFile 8bit 8bit 8bit ALU 16bit 16x16bit RegFile 16bit 16bit 16bit ALU DMA Scalar Memory 4KB 16bit Scalar Unit 15 Advanced Computer Architecture Laboratory University of Michigan Efficient Design Wide SIMD width Small register file with minimum ports Small memories Narrow system BUS Data-path optimized for 8bits Vector shuffle reduce memory ports 16 Advanced Computer Architecture Laboratory University of Michigan Transmitter modulator LPF-Tx LPF-Tx scrambler scrambler spreader spreader Channel Channel encoder encoder Interleaver Interleaver U p p e r la y e rs F ro n te n d demodulator Receiver searcher searcher descrambler descrambler PE PN Code Code PN TX/RX TX/RX 23Mops Mops 23 Misc. Control Control Misc. Mops 11Mops Buffer Buffer (10 Bytes) (10 Bytes) Power Power Control Control 15Kops Kops 15 Deinterleaver Deinterleaver 16 Mops 16 despreader .. . descrambler descrambler ARM c o m b in e r LPF-Rx LPF-Rx despreader PE Buffer Buffer (1280 Bytes) (1280 Bytes) LPF-Rx 22LPF-Rx 182Mops Mops 182 Channel Channel decoder decoder (turbo/viterbi) (turbo/viterbi) deinteleaver deinteleaver PE Buffer Buffer (2560 Bytes) (2560 Bytes) Searcher Searcher 200Mops Mops 200 PE Buffer Buffer (1024 Bytes) (1024 Bytes) Turbo Encoder Turbo Encoder 2 Mops 2 Mops Interleaver Interleaver 2 Mops 2 Mops Spreader Spreader 4.7 Mops 4.7 Mops Scrambler Scrambler 8.6 Mops 8.6 Mops 4 LPF-Rx 4 LPF-Rx 307 Mops 307 Mops Buffer Buffer (1024 Bytes) (1024 Bytes) Turbo Decoder Decoder Turbo 324 Mops Mops 324 Buffer Buffer (1360 Bytes) (1360 Bytes) Descrambler Descrambler 22.5Mops Mops 22.5 Despreader Despreader 11.3Mops Mops 11.3 Global Memory FIFOQueue Queue FIFO (12.5KBytes) KBytes) (12.5 Buffer Buffer (20KBytes) KBytes) (20 Buffer Buffer (20KBytes) KBytes) (20 Combiner Combiner Mops 33Mops WCDMA Receiver WCDMA Transmitter 18 Advanced Computer Architecture Laboratory University of Michigan Transmitter modulator LPF-Tx LPF-Tx scrambler scrambler spreader spreader Channel Channel encoder encoder Interleaver Interleaver U p p e r la y e rs F ro n te n d demodulator Receiver searcher searcher c o m b i n e r LPF-Rx LPF-Rx descrambler descrambler despreader .. . descrambler descrambler Channel Channel decoder decoder (turbo/viterbi) (turbo/viterbi) deinteleaver deinteleaver despreader MAC Frontend A/D ARM PE PN Code Code PN TX/RX TX/RX 23Mops Mops 23 Misc. Control Control Misc. Mops 11Mops Buffer Buffer (10 Bytes) (10 Bytes) Power Power Control Control 15Kops Kops 15 Deinterleaver Deinterleaver 16 Mops 16 PE Buffer Buffer (1280 Bytes) (1280 Bytes) LPF-Rx 22LPF-Rx 182Mops Mops 182 PE Buffer Buffer (2560 Bytes) (2560 Bytes) Searcher Searcher 200Mops Mops 200 PE Buffer Buffer (1024 Bytes) (1024 Bytes) Turbo Encoder Turbo Encoder 2 Mops 2 Mops Interleaver Interleaver 2 Mops 2 Mops Spreader Spreader 4.7 Mops 4.7 Mops Scrambler Scrambler 8.6 Mops 8.6 Mops 4 LPF-Rx 4 LPF-Rx 307 Mops 307 Mops Buffer Buffer (1024 Bytes) (1024 Bytes) Turbo Decoder Decoder Turbo 324 Mops Mops 324 Buffer Buffer (1360 Bytes) (1360 Bytes) Descrambler Descrambler 22.5Mops Mops 22.5 MAC Despreader Despreader 11.3Mops Mops 11.3 Global Memory FIFOQueue Queue FIFO (12.5KBytes) KBytes) (12.5 Buffer Buffer (20KBytes) KBytes) (20 Buffer Buffer (20KBytes) KBytes) (20 Combiner Combiner Mops 33Mops WCDMA Receiver WCDMA Transmitter Frontend D/A Advanced Computer Architecture Laboratory University of Michigan 19 802.11a PE Mapping 802.11a Receiver ARM PE Deinterleaver 60 Mops PE Buffer (2048 Bytes) FIR (Rx) 320 Mops Buffer (1024 Bytes) PE Buffer (2048 Bytes) Interplator 250 Mops FFT 120 Mops PE Buffer (2048 Bytes) Freq. Eq. 120 Mops Buffer (2048 Bytes) Viterbi Dec. 398 Mops QAM Demod 2 Mops Misc. Control 80 Mops Interleaver 60 Mops FIR (Tx) 320 Mops Buffer (1024 Bytes) QAM Mod 2 Mops IFFT 120 Mops Buffer (30 KBytes) Buffer (30 KBytes) Sync. 20 Mops Buffer (2048 Bytes) Global Memory Buffer (22048 Bytes) Channel Enc. 20 Mops Buffer (30 KBytes) Buffer (30 KBytes) Preamble Ins. 50 Mops 802.11a Transmitter 20 Advanced Computer Architecture Laboratory University of Michigan Power Results 400 350 250 WCDMA 200 802.11a 150 100 50 To ta l th er s O ce ss or In te r-p ro em Bu s or y M G lo ba lM O PE PE AR th er s U AL PE R M em or y eg ist er Fi le 0 PE Power (mW) 300 Configuration 4 PEs, 1 ARM (Cortex M3) controller Global scratchpad memory (64Kb) 90nm (1V @ 400 MHZ), Advanced Computer Architecture Laboratory of Michigan SynthesizedUniversity conservatively 21 Area Results 3.5 2.5 2 1.5 1 0.5 ta l To er s O th lM em er -p or ro y ce ss or Bu s ba AR M In t PE G lo M em or Re y gi st er Fi le PE AL U PE O th er s 0 PE mm^2 (90nm) 3 22 Advanced Computer Architecture Laboratory University of Michigan SDR Programming Language Support Software Development Flow Programmer Defined Matlab-Simulink/C Floating Point Algorithm Prototyping Algorithm-level Design Flow System-Level Design Flow Timing Requirements C/SPEX Fixed Point Algorithm Kernel Implementations IO/ Control Code C/SPEX System Description Timing Requirements Timing/Machine Independent asm code Memory Access Patterns System Description asm code Compiler Generated Controller asm code PE asm code c 24 Advanced Computer Architecture Laboratory University of Michigan SPEX (Signal Processing EXtension) Implemented as a library extension to C System-level development Support concurrent DSP kernel function definitions Channel variables for inter-kernel communications Algorithm-level development Native vector & matrix variables Explicit DSP variable attribute definition Native vector & matrix operations 25 Advanced Computer Architecture Laboratory University of Michigan SPEX Overview SPEX DSP kernel extension DSP variable extension DSP variable attributes Mode Bitwidth Type Scalar object SIMD object Concurrent variable object DSP variable operations DSP variable objects Arith. & logic op. Predication Permutation Kernel object Channel object Concurrent variable operations Concurrent kernel mangement Channel communication operations 26 Advanced Computer Architecture Laboratory University of Michigan SPEX Example Code: Viterbi ACS Concurrent DSP kernel definitions void* acs(void*) { /* variable declaration */ saturated char<64> metrics1, metrics2; saturated char<64> states; saturated char<64> t1, t2; while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive(); /* add */ metrics1 += states; metrics2 += states; /* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2; /* sending data to TB */ acs_to_tb.send(states); } } Advanced Computer Architecture Laboratory University of Michigan 27 SPEX Example Code: Viterbi ACS void* acs(void*) { /* variable declaration */ saturated char<64> metrics1, metrics2; saturated char<64> states; saturated char<64> t1, t2; Native SIMD variable definition with explicit attributes while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive(); SPEX variable supports 1. saturated/overflow 2. various variable bit-width 3. vector & matrices /* add */ metrics1 += states; metrics2 += states; /* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2; /* sending data to TB */ acs_to_tb.send(states); } } Advanced Computer Architecture Laboratory University of Michigan 28 SPEX Example Code: Viterbi ACS void* acs(void*) { /* variable declaration */ saturated char<64> metrics1, metrics2; saturated char<64> states; saturated char<64> t1, t2; Inter-kernel communication through channel operations while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive(); /* add */ metrics1 += states; metrics2 += states; Channel types: 1. FIFO queue 2. Broadcast queue 3. Sync/control channel 4. Random-read FIFO queue /* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2; /* sending data to TB */ acs_to_tb.send(states); } } Advanced Computer Architecture Laboratory University of Michigan 29 SPEX Example Code: Viterbi ACS void* acs(void*) { /* variable declaration */ saturated char<64> metrics1, metrics2; saturated char<64> states; saturated char<64> t1, t2; while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive(); SPEX vector operations Supports (Matlab-like C code) /* add */ metrics1 += states; metrics2 += states; 1. SIMD arithmetic operations 2. SIMD permutation 3. SIMD predication /* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2; /* sending data to TB */ acs_to_tb.send(states); } } Advanced Computer Architecture Laboratory University of Michigan 30 Summary - Hardware & software solutions for SDR - Hardware - - 4 dual-issue asymmetric SIMD processing elements Consumes 200~300mW for 90nm Meets the performance requirements for WCDMA & 802.11a Software - SPEX provides efficient DSP algorithm and system implementation 31 Advanced Computer Architecture Laboratory University of Michigan Questions? 32 Advanced Computer Architecture Laboratory University of Michigan