LPC Speech Coder on the TI C6x DSP

LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede Summary  Implementation platform    Architecture clock frequency   Texas Instruments TMS320C6000 Low-quantity cost US $35 (‘C6211) 150 MHz (‘C6211) Throughput  75-80 channels @ 8000 samples/sec Summary  Total energy per sample   1.8 uJ/sample ‘Area’    1.2% of cycle budget per chan. per frame 8.5% of unified memory per channel 25% of unified memory for algorithm Summary  Flexibility of implementation   SegSNR_A:   High; programmable processor with C compiler, GUI debugger & simulator ? SegSNR_Q:  26 dB (voiced segments) Architecture overview  256-bit VLIW   Two “clustered” data paths Four functional units in each data path     16x16 multiply Two ALUs Data addressing unit 32-bit instruction for each functional unit  (256 bit “instruction” for 8 func. Units) Data path diagram Architecture overview  Split register file    Only two cross-paths exists Cluster is limited to one source read from opposite register file per cycle. Data types  8, 16, 32-bit with 40-bit accumulate  40-bit = register pair Memory architecture      ‘C6211 (US$35) has a cache! 4kB L1 Instruction cache (L1P) 4kB L1 Data cache (L1D) 64kB L2 Unified memory and/or cache Extra DMA channels Memory architecture Design Tools  Command-line   Compiler, debugger, simulator Code Composer Studio     Same tools Windows NT GUI 30-day “evaluation” license Draconian copy protection, pulls out the rug from under you Design Flow      Consolidate Matlab reference into a single function Matlab rewritten C-style Verified C-style Matlab C prototype created Imported into Code Composer, optimized & simulated Fixed-point quantization  Input samples    16-bit, normalized to [-1,1) <1.15> format used Coefficient quantization    Hamming window, pre-emphasis, FIR <1.15> format used No noticeable change in characteristics Fixed-point quantization  Most values 16 bit    Take advantage of 16x16 fast multipliers Remain close to other class implementations Add metric for overpowered LPC engine  Use # of channels as performance metric Fixed-point quantization  Energy stored in <5.27>   Temporary values stored in <10.30>   Prevent overflow, provide precision for low energy segments Take advantage of extended precision Modified autocorrelation used <16.0>  All whole numbers Fixed-Point SNR  Matlab simulation of magnitude truncation    Tools again. SegSNR_A = ? SegSNR_Q = 26 dB   Voiced segments only Sent_female test data Performance results   Initial version: 80,000 CPU cycles/frame Optimization  Take advantage of VLIW, pipelining   Use TI’s DSP Library   observe assembly, modify C loops Assembly advantage without assembly Optimized version: 30,182 cycles/frame  Had to stop early, still at least 5K cycles wasted Performance      Then, the tool license expired. The tool would not install on other machines. TI responded, but wasn’t too helpful. Moral #1: Avoid the evaluation version. Moral #2: Give tools away to sell hardware Cycle count details Routine % Cycles/frame Windowing, pre-emphasis 4.3 1285 Energy calc 0.8 254 Autocorrelation in Levinson-Durbin 8.0 2421 Autocorrelation in pitch detection 51 15334 Algorithm total 95 28561 Total w/ housekeeping 30182 Additional optimizations  Use more DSPLIB routines   Autocorrelation Assembly-level optimization   Code size reduction? Reduce number of buffers to reduce L1D usage per frame Energy per sample  ‘C6211 consumes 1.24W    75% high activity / 25% low activity 1.24W / 80 channels = 15.5mW/channel 15.5 mJ/sec/channel * 1/8000 = 1.8 uJ / sample Number of channels 150 x 106 cycles/sec x 0.02 sec/frame = 3.0 x 106 cycles/frame 3.0 x 106 cycles/frame / 30,182 cycles = 99 channels Memory   ‘C6211 Cache complicates estimates Performance is 85-99% of optimal for typical applications  30,182 cycles becomes 35,508 cycles/frame for 85% efficiency => now support only 86 channels Memory  Try to account for off-chip memory transfers  ~220,000 cycles for 150ns fetches for 80 channels => support 75-80 channels  Unable to verify/simulate because of unexpected tool expiration Memory  L2 usage  ~16kB Code size thanks to VLIW    Remaining used by data for channels   512 32-byte instruction clusters More suited for ‘C6201 & larger processors 480 bytes each (8.5% of remaining memory) L1 usage   L1P: Can’t tell because of cache L1D: 2.2kB (~56%) Tool comments   Powerful, easy to use IDE… When it worked.   Licensing problems for eval version Debugging support a bit odd  puts/printf C6x Conclusions      Easily support 75-80 channels of coding 26 dB fixed-point SNR, 16-bit types VLIW = Large code size Cache on a low-end DSP! Good tools, but draconian copy protection

LPC Speech Coder on the TI C6x DSP

Related documents

Products

Support

LPC Speech Coder on the TI C6x DSP

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib