Lecture 7
Spoken Language Processing
Prof. Andrew Rosenberg
Representing Acoustic Information
• 16-bit samples 44.1kHz sampling rate
– ~86kB/sec
– ~5MB/min
• Waves repeat -- Much of this data is redundant.
• A good representation of speech (for recognition)
– Keeps all of the information to discriminate between phones
– Is Compact.
i.e. Gets rid of everything else
1
Frame Based analysis
• Using a short window of analysis, analyze the wave form every 10ms (or other analysis rate)
• Usually performed with overlapping windows.
• e.g. FFT and Spectrogram
2
Overlapping frames
• Spectrograms allow for visual inspection of spectral information.
• We are looking for a compact, numerical representation
10ms 10ms 10ms 10ms 10ms
3
Example Spectrogram
4
Standard Representation in the field
• Mel Frequency Cepstral Coefficients
– MFCC
Pre-
Emphasis window FFT
Mel-Filter
Bank
12 MFCC
12 ∆ MFCC
12∆∆ MFCC
1 energy
1 ∆ energy
1 ∆∆ energy energy log
Deltas
12 MFCC
FFT -1
5
Pre-emphasis
• Looking at spectrum for voiced segments, there is more energy at the lower frequencies than higher frequencies.
• Boosting high frequencies helps make the high frequency information more available.
– First-order high-pass filter for pre-emphasis.
6
Windowing
• Overlapping windows allow analysis centered at a frame point, while using more information.
7
Hamming Windowing
• Discontinuities at the edge of the window can cause problems for the FFT
• Hamming window smoothes-out the edges.
8
Hamming Windowing
• Discontinuities at the edge of the window can cause problems for the FFT
• Hamming window smoothes-out the edges.
9
Discrete Fourier Transform
• The algorithm for calculating the Discrete
Fourier Transform (DFT) is the Fast
Fourier Transform.
Australian male /i :/ from “heed” FFT analysis window 12.8ms
http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html
10
Mel Filter Bank and Log
• Human hearing is not equally sensitive at all frequency regions.
• Modeling human hearing sensitivity helps phone recognition.
• MFCC approach: Warp frequencies from
Hz to Mel frequency scale.
• Mel: pairs of sounds that are perceptually equidistant in pitch are separated by an equal number of mels.
11
Mel frequency Filter bank
• Create a bank of filters collecting energy from each frequency band, 10 filters linearly spaced below 1000Hz, logarithmic spread over 1000Hz.
12
Cepstrum
• Separation of source and filter .
• Source differences are speaker dependent
• Filter differences are phone dependent.
• Cepstrum is the “ Spectrum of the Log of the Spectrum ” – inverse DFT of the log magnitude of the DFT of the signal
13
Cepstrum Visualization
• Peak at 120 samples represents the glottal pulse, corresponding to the F0
• Large values closer to zero correspond to vocal tract filter (tongue position, jaw opening, etc.)
• Common to take the first12 coefficients
14
Deltas and Energy
• Energy within a frame is just the sum of the power of the samples.
• The spectrum of some phones change over time – the stop closure to stop burst, or slope of a formant.
• Taking the delta or velocity and double delta or acceleration incorporates this information
15
Summary: MFCC
• Commonly MFCCs have 39 Features
39 MFCC Features
12 Cepstral Coefficients
12 Delta Cepstral Coefficients
12 Delta Delta Cepstral Coefficieints
1 Energy Coefficients
1 Delta Energy Coefficients
1 Delta Delta Energy Coefficients
16
Next Class
• Introduction to Statistical Modeling and
Classification
• Reading: J&M 9.4, optional 6.6
17