LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede Summary Implementation platform Architecture clock frequency Texas Instruments TMS320C6000 Low-quantity cost US $35 (‘C6211) 150 MHz (‘C6211) Throughput 75-80 channels @ 8000 samples/sec Summary Total energy per sample 1.8 uJ/sample ‘Area’ 1.2% of cycle budget per chan. per frame 8.5% of unified memory per channel 25% of unified memory for algorithm Summary Flexibility of implementation SegSNR_A: High; programmable processor with C compiler, GUI debugger & simulator ? SegSNR_Q: 26 dB (voiced segments) Architecture overview 256-bit VLIW Two “clustered” data paths Four functional units in each data path 16x16 multiply Two ALUs Data addressing unit 32-bit instruction for each functional unit (256 bit “instruction” for 8 func. Units) Data path diagram Architecture overview Split register file Only two cross-paths exists Cluster is limited to one source read from opposite register file per cycle. Data types 8, 16, 32-bit with 40-bit accumulate 40-bit = register pair Memory architecture ‘C6211 (US$35) has a cache! 4kB L1 Instruction cache (L1P) 4kB L1 Data cache (L1D) 64kB L2 Unified memory and/or cache Extra DMA channels Memory architecture Design Tools Command-line Compiler, debugger, simulator Code Composer Studio Same tools Windows NT GUI 30-day “evaluation” license Draconian copy protection, pulls out the rug from under you Design Flow Consolidate Matlab reference into a single function Matlab rewritten C-style Verified C-style Matlab C prototype created Imported into Code Composer, optimized & simulated Fixed-point quantization Input samples 16-bit, normalized to [-1,1) <1.15> format used Coefficient quantization Hamming window, pre-emphasis, FIR <1.15> format used No noticeable change in characteristics Fixed-point quantization Most values 16 bit Take advantage of 16x16 fast multipliers Remain close to other class implementations Add metric for overpowered LPC engine Use # of channels as performance metric Fixed-point quantization Energy stored in <5.27> Temporary values stored in <10.30> Prevent overflow, provide precision for low energy segments Take advantage of extended precision Modified autocorrelation used <16.0> All whole numbers Fixed-Point SNR Matlab simulation of magnitude truncation Tools again. SegSNR_A = ? SegSNR_Q = 26 dB Voiced segments only Sent_female test data Performance results Initial version: 80,000 CPU cycles/frame Optimization Take advantage of VLIW, pipelining Use TI’s DSP Library observe assembly, modify C loops Assembly advantage without assembly Optimized version: 30,182 cycles/frame Had to stop early, still at least 5K cycles wasted Performance Then, the tool license expired. The tool would not install on other machines. TI responded, but wasn’t too helpful. Moral #1: Avoid the evaluation version. Moral #2: Give tools away to sell hardware Cycle count details Routine % Cycles/frame Windowing, pre-emphasis 4.3 1285 Energy calc 0.8 254 Autocorrelation in Levinson-Durbin 8.0 2421 Autocorrelation in pitch detection 51 15334 Algorithm total 95 28561 Total w/ housekeeping 30182 Additional optimizations Use more DSPLIB routines Autocorrelation Assembly-level optimization Code size reduction? Reduce number of buffers to reduce L1D usage per frame Energy per sample ‘C6211 consumes 1.24W 75% high activity / 25% low activity 1.24W / 80 channels = 15.5mW/channel 15.5 mJ/sec/channel * 1/8000 = 1.8 uJ / sample Number of channels 150 x 106 cycles/sec x 0.02 sec/frame = 3.0 x 106 cycles/frame 3.0 x 106 cycles/frame / 30,182 cycles = 99 channels Memory ‘C6211 Cache complicates estimates Performance is 85-99% of optimal for typical applications 30,182 cycles becomes 35,508 cycles/frame for 85% efficiency => now support only 86 channels Memory Try to account for off-chip memory transfers ~220,000 cycles for 150ns fetches for 80 channels => support 75-80 channels Unable to verify/simulate because of unexpected tool expiration Memory L2 usage ~16kB Code size thanks to VLIW Remaining used by data for channels 512 32-byte instruction clusters More suited for ‘C6201 & larger processors 480 bytes each (8.5% of remaining memory) L1 usage L1P: Can’t tell because of cache L1D: 2.2kB (~56%) Tool comments Powerful, easy to use IDE… When it worked. Licensing problems for eval version Debugging support a bit odd puts/printf C6x Conclusions Easily support 75-80 channels of coding 26 dB fixed-point SNR, 16-bit types VLIW = Large code size Cache on a low-end DSP! Good tools, but draconian copy protection