ECE5527 - Final Pres..

advertisement
ASR Front End Processing
Implemented on Texas Instruments OMAP-L137
Jacob Zurasky – 12/12/11
Project Goals
 Create a front-end for embedded ASR
 Extract feature vectors from speech data
 Allow for many different specifications
 Extract features real-time, while allowing enough CPU time
for analysis
Hardware Platform
 Texas Instruments OMAP – L137 DSP, dual core
 TMS320C6747
 ARM9
 AIC3106 Audio Codec
 64MB SDRAM
Signal Flow Block Diagram
Audio
Framing
PreEmphasis
Window
FFT
Mel Filter
Log
DCT
Deltas
13 - MFCCs
13 - Deltas
13 - Delta Deltas
Data Streams
 Streams are a way to transfer blocks of data efficiently
 Uses enhanced direct memory access (EDMA)
 Block of data can be accessed by SIO_reclaim(…)
 Block of data can be sent by SIO_issue(…)
Input Stream
DSP
Audio Codec
Output Stream
Stream Example
 After SIO_reclaim, pIn points to input data and pOut points
to output data
 After SIO_issue, those buffers are reused by the audio codec
Pre-Emphasis
 y[n] = x[n] – ax[n-1]
 First order high-pass filter
 Used to compensate for the higher frequency roll-off in
human speech production
Windowing Function
 Rectangular, Hann, Hamming, Cosine, Gaussian…
Hamming Window
FFT
 Magnitude of Frequency Spectrum
 Texas Instrument’s DSPLIB for C67x
Mel Filter
 Triangular Bandpass Filters along Mel Frequency Scale
 Mimics the logarithmic nature of human hearing
Discrete Cosine Transform (DCT)
 Transforms back from frequency domain
 Typically first 12 values are used as the Mel Frequency
Cepstral Coefficients
 Look-up table for efficiency
Deltas
 Produce 13 MFCC’s per frame
 13 more from the first derivative
 13 additional from the second derivative
 39 dimensional vector to represent the current frame
Observations
 Pre-Emphasis and Windowing an input frame
Input Frame
Pre-Emphasis and Windowed Frame
Observations
 FFT and Log, Mel Filter
Magnitude of Frequency Spectrum
Log, Mel Filtered Spectrum
Observations
 Discrete Cosine Transform to produce MFCC’s
Mel Frequency Cepstral Coefficients
Full Feature Vector for 1 frame
Observations
 Frame Size = 256 samples @ 16 kHz Fs
 1 Frame = 16 mS
 Feature Extraction Time
 Debug – 1.55 mS
 Release – 0.25 mS
 Real Time Feature Extraction
 0.25 mS / 16 mS = 1.56% usuage
Future Goals
 Complete training code for DSP
 Load training data to SDRAM
 DSP calculates all feature vectors associated with a given phone
 Calculates Gaussian mixture model
 Save acoustic model off-chip
 Evaluate the acoustic model (digital recognition)
 Complete embedded ASR on limited vocabulary
Download