LPC Speech Coder on the TI C6x DSP

advertisement
LPC Speech Coder
on the TI C6x DSP
Mark Anderson, Jeff Burke
EE213A / EE298-2
Prof. Ingrid Verbauwhede
Summary

Implementation platform



Architecture clock frequency


Texas Instruments TMS320C6000
Low-quantity cost US $35 (‘C6211)
150 MHz (‘C6211)
Throughput

75-80 channels @ 8000 samples/sec
Summary

Total energy per sample


1.8 uJ/sample
‘Area’



1.2% of cycle budget per chan. per frame
8.5% of unified memory per channel
25% of unified memory for algorithm
Summary

Flexibility of implementation


SegSNR_A:


High; programmable processor with C
compiler, GUI debugger & simulator
?
SegSNR_Q:

26 dB (voiced segments)
Architecture overview

256-bit VLIW


Two “clustered” data paths
Four functional units in each data path




16x16 multiply
Two ALUs
Data addressing unit
32-bit instruction for each functional unit

(256 bit “instruction” for 8 func. Units)
Data path diagram
Architecture overview

Split register file



Only two cross-paths exists
Cluster is limited to one source read from
opposite register file per cycle.
Data types

8, 16, 32-bit with 40-bit accumulate

40-bit = register pair
Memory architecture





‘C6211 (US$35) has a cache!
4kB L1 Instruction cache (L1P)
4kB L1 Data cache (L1D)
64kB L2 Unified memory and/or cache
Extra DMA channels
Memory architecture
Design Tools

Command-line


Compiler, debugger, simulator
Code Composer Studio




Same tools
Windows NT GUI
30-day “evaluation” license
Draconian copy protection, pulls out the
rug from under you
Design Flow





Consolidate Matlab reference into a
single function
Matlab rewritten C-style
Verified C-style Matlab
C prototype created
Imported into Code Composer,
optimized & simulated
Fixed-point quantization

Input samples



16-bit, normalized to [-1,1)
<1.15> format used
Coefficient quantization



Hamming window, pre-emphasis, FIR
<1.15> format used
No noticeable change in characteristics
Fixed-point quantization

Most values 16 bit



Take advantage of 16x16 fast multipliers
Remain close to other class
implementations
Add metric for overpowered LPC engine

Use # of channels as performance metric
Fixed-point quantization

Energy stored in <5.27>


Temporary values stored in <10.30>


Prevent overflow, provide precision for low
energy segments
Take advantage of extended precision
Modified autocorrelation used <16.0>

All whole numbers
Fixed-Point SNR

Matlab simulation of magnitude
truncation



Tools again.
SegSNR_A = ?
SegSNR_Q = 26 dB


Voiced segments only
Sent_female test data
Performance results


Initial version: 80,000 CPU cycles/frame
Optimization

Take advantage of VLIW, pipelining


Use TI’s DSP Library


observe assembly, modify C loops
Assembly advantage without assembly
Optimized version: 30,182 cycles/frame

Had to stop early, still at least 5K cycles wasted
Performance





Then, the tool license expired.
The tool would not install on other
machines.
TI responded, but wasn’t too helpful.
Moral #1: Avoid the evaluation
version.
Moral #2: Give tools away to sell
hardware
Cycle count details
Routine
% Cycles/frame
Windowing, pre-emphasis
4.3
1285
Energy calc
0.8
254
Autocorrelation in Levinson-Durbin 8.0
2421
Autocorrelation in pitch detection
51
15334
Algorithm total
95
28561
Total w/ housekeeping
30182
Additional optimizations

Use more DSPLIB routines


Autocorrelation
Assembly-level optimization


Code size reduction?
Reduce number of buffers to reduce L1D
usage per frame
Energy per sample

‘C6211 consumes 1.24W



75% high activity / 25% low activity
1.24W / 80 channels
= 15.5mW/channel
15.5 mJ/sec/channel * 1/8000
= 1.8 uJ / sample
Number of channels
150 x 106 cycles/sec x 0.02 sec/frame
= 3.0 x 106 cycles/frame
3.0 x 106 cycles/frame / 30,182 cycles
= 99 channels
Memory


‘C6211 Cache complicates estimates
Performance is 85-99% of optimal for
typical applications

30,182 cycles becomes
35,508 cycles/frame for 85% efficiency
=> now support only 86 channels
Memory

Try to account for off-chip memory
transfers

~220,000 cycles for 150ns fetches
for 80 channels
=> support 75-80 channels

Unable to verify/simulate because of
unexpected tool expiration
Memory

L2 usage

~16kB Code size thanks to VLIW



Remaining used by data for channels


512 32-byte instruction clusters
More suited for ‘C6201 & larger processors
480 bytes each (8.5% of remaining memory)
L1 usage


L1P: Can’t tell because of cache
L1D: 2.2kB (~56%)
Tool comments


Powerful, easy to use IDE…
When it worked.


Licensing problems for eval version
Debugging support a bit odd

puts/printf
C6x Conclusions





Easily support 75-80 channels of coding
26 dB fixed-point SNR, 16-bit types
VLIW = Large code size
Cache on a low-end DSP!
Good tools,
but draconian copy protection
Download