Speech Representations - MFCC

advertisement

Representing Acoustics with Mel Frequency

Cepstral Coefficients

Lecture 7

Spoken Language Processing

Prof. Andrew Rosenberg

Representing Acoustic Information

• 16-bit samples 44.1kHz sampling rate

– ~86kB/sec

– ~5MB/min

• Waves repeat -- Much of this data is redundant.

• A good representation of speech (for recognition)

– Keeps all of the information to discriminate between phones

– Is Compact.

i.e. Gets rid of everything else

1

Frame Based analysis

• Using a short window of analysis, analyze the wave form every 10ms (or other analysis rate)

• Usually performed with overlapping windows.

• e.g. FFT and Spectrogram

2

Overlapping frames

• Spectrograms allow for visual inspection of spectral information.

• We are looking for a compact, numerical representation

10ms 10ms 10ms 10ms 10ms

3

Example Spectrogram

4

Standard Representation in the field

• Mel Frequency Cepstral Coefficients

– MFCC

Pre-

Emphasis window FFT

Mel-Filter

Bank

12 MFCC

12 ∆ MFCC

12∆∆ MFCC

1 energy

1 ∆ energy

1 ∆∆ energy energy log

Deltas

12 MFCC

FFT -1

5

Pre-emphasis

• Looking at spectrum for voiced segments, there is more energy at the lower frequencies than higher frequencies.

• Boosting high frequencies helps make the high frequency information more available.

– First-order high-pass filter for pre-emphasis.

6

Windowing

• Overlapping windows allow analysis centered at a frame point, while using more information.

7

Hamming Windowing

• Discontinuities at the edge of the window can cause problems for the FFT

• Hamming window smoothes-out the edges.

8

Hamming Windowing

• Discontinuities at the edge of the window can cause problems for the FFT

• Hamming window smoothes-out the edges.

9

Discrete Fourier Transform

• The algorithm for calculating the Discrete

Fourier Transform (DFT) is the Fast

Fourier Transform.

Australian male /i :/ from “heed” FFT analysis window 12.8ms

http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html

10

Mel Filter Bank and Log

• Human hearing is not equally sensitive at all frequency regions.

• Modeling human hearing sensitivity helps phone recognition.

• MFCC approach: Warp frequencies from

Hz to Mel frequency scale.

• Mel: pairs of sounds that are perceptually equidistant in pitch are separated by an equal number of mels.

11

Mel frequency Filter bank

• Create a bank of filters collecting energy from each frequency band, 10 filters linearly spaced below 1000Hz, logarithmic spread over 1000Hz.

12

Cepstrum

• Separation of source and filter .

• Source differences are speaker dependent

• Filter differences are phone dependent.

• Cepstrum is the “ Spectrum of the Log of the Spectrum ” – inverse DFT of the log magnitude of the DFT of the signal

13

Cepstrum Visualization

• Peak at 120 samples represents the glottal pulse, corresponding to the F0

• Large values closer to zero correspond to vocal tract filter (tongue position, jaw opening, etc.)

• Common to take the first12 coefficients

14

Deltas and Energy

• Energy within a frame is just the sum of the power of the samples.

• The spectrum of some phones change over time – the stop closure to stop burst, or slope of a formant.

• Taking the delta or velocity and double delta or acceleration incorporates this information

15

Summary: MFCC

• Commonly MFCCs have 39 Features

39 MFCC Features

12 Cepstral Coefficients

12 Delta Cepstral Coefficients

12 Delta Delta Cepstral Coefficieints

1 Energy Coefficients

1 Delta Energy Coefficients

1 Delta Delta Energy Coefficients

16

Next Class

• Introduction to Statistical Modeling and

Classification

• Reading: J&M 9.4, optional 6.6

17

Download