Project 1 - University of Cincinnati

advertisement
PROJECT SUMMARY REPORT
Eigen-Faces Applied to Speech Style Classification
Submitted To
The 2012 Academic Year NSF AY-REU Program
Part of
NSF Type 1 STEP Grant
Sponsored By
The National Science Foundation
Grant ID No.: DUE-0756921
College of Engineering and Applied Science
University of Cincinnati
Cincinnati, Ohio
Prepared By
Brad Keserich, Senior, Computer Engineering
Report Reviewed By:
Dr. Dharma Agrawal
REU Faculty Mentor
Professor
School of Computing Sciences and Informatics
University of Cincinnati
September 17 – December 6, 2012
1
Eigen-Faces Applied to Speech Style Classification
Brad Keserich, Senior, Computer Engineering; College of Engineering and Applied Science; University of
Cincinnati; Cincinnati, Ohio
Graduate Mentor: Suryadip Chakraborty, School of Computing Sciences and Informatics; University of
Cincinnati; Cincinnati, Ohio
Mentor: Dr. Dharma Agrawal, Professor, School of Computing Sciences and Informatics; University of
Cincinnati; Cincinnati, Ohio
Abstract
In this study we investigate the usefulness of the eigen-faces method for speech classification. The eigenfaces method has been applied to other audio classification problems like recognizing different vehicle
noises [1]. We use this method to differentiate between speaking styles. Eigen-face classification is
performed using abstract features derived from concrete features in the audio data. Results indicate that
simple classification is possible and that the eigen-faces technique may prove to be useful in speech
classification applications. Simple cases show that normal and non-normal speech are qualitatively
separable in the feature space.
Subject Headings
Classification, Signal processing, Fourier analysis
Introduction
A fully mature voice recognition system may be useful in various applications. Currently there are
computer vision algorithms that can identify a person in a crowd by their gait and human signatures by
learning from past examples. A similar concept may be applied to human speaking style. Existing voice
recognition systems may become more powerful if they are capable of identifying features analogous the
“gait” or “handwriting” used in vision tasks. These features would be related to the style differences of
different speakers. A machine could use trends like stuttering over difficult words, subtle patterns in
speech, or reliance on fillers like “um”. These could be used not only to recognize a speaker but also to
better understand what the speaker is saying. This could lead to better performance overall in speech
recognition tasks.
Materials and Methods
2
Equipment
Audio was recorded using the microphone in a 2008 Macbook Pro laptop. The audio recording software
was Logic Express 8. Scripts were implemented in the Octave language and run on the Mac OSX
operating system.
Data Acquisition
Audio data was recorded for various speech styles. The speaker and contents of the speech were
constant for all test cases. A single voice actor was used to generate all of the audio samples. This
eliminated the complexities of speaker recognition so that the style of speech could be focused on. Audio
was saved in WAV format sampled at 44.1kHz with a 24 bit depth. The data is split into two main cases.
In the first case the speaker had spoken in a clear normal voice. In the second case the speaker was
instructed to a pitch inflection on their words or mimic a stutter.
Methodology
The first task is to segment the raw waveform into spoken sections. This can be done using short-timeenergy, average magnitude, and the average zero cross rate of the raw waveform. Figure 1 illustrates
these three techniques applied to spoken waveform in blue. The word used for this was “Tabemono”. This
word was chosen because it is easily separated into the consonant + vowel pairs TA, BE, MO, and NO.
Figure 2 illustrates the result of taking the minimum of all three of the previously mentioned techniques
and applying a threshold. This results in a segmentation corresponding roughly to the consonant + vowel
pairs.
3
Figure 1. Waveform analysis for segmentation.
Figure 2. Automated waveform segmentation.
4
Automatic segmentation was investigated but was not used for the training data. Training data consists of
the “TA” portion of the above waveforms only and was segmented by hand in a wave editor.
The “TA” data set exists in two portions, the training set and the verification set. The classifier is tuned
using the training set and the verification set is left to test the performance of the classifier. This means
that each WAV file in the verification set already has a human assigned truth class that we use for
checking the answer of the classifier. The performance of the classifier is then gauged by looking at the
seperability of the data in the feature space.
The eigen-faces procedure is described by Wu, Siegel, and Khosla but a brief overview will be provided
here [1]. Essentially, the “TA” portion is taken from many samples to produce a time-domain vector. Each
frame in the vector is then subject to Fourier transformation to produce a set of power spectrums. These
are then normalized and combined with the total duration of the spoken portion and any stutter detection
bit that might be available. This final set of feature vectors is then averaged to produce Ω. Each vector
has Ω subtracted from it before a final principal component analysis (PCA) to obtain the eigen-vectors of
the covariance matrix.
Classification is performed by projecting new sets of vectors onto the eigen-basis. The magnitude of the
residual vector is the indicator of the classification. Larger magnitudes indicate that the vectors are not of
the same class as the training set. Examples of this can be seen in Figures 3 and 4.
Results and Discussion
Results were analyzed qualitatively using the residual scatter plots. This is reasonable for classification
problems since we are looking for a high degree of separation between classes in the feature space. The
results indicate that good stutter detection can relatively easily separate classes. This is effective outside
of the eigen-faces method when a stutter is indicated as a bit. Accurate segmentation also allows for the
spacing time to be calculated and used as a feature. It seems to be relatively easy to check the variance
of these spaces relative to each other so that if there is an outlier then we know there is an awkward
pause situation. Pitch and volume are more difficult to analyze and must be further investigated.
5
Figure 3. Green and blue belong to the same class but
different datasets. Green is the training data. Red is in a
different class than green and blue.
Figure 4. Same data as Figure 4 but now additional features
duration and stutter are used.
Conclusions
The eigen-faces method seems to be a useful technique for speech. The jump from vehicles to human
voices does carry some complexity but further investigation into techniques like wavelet transforms and
the Mel-Cepstrum seem as if they may improve the performance of the classification. There is still much
6
room for improvement with the current state of the research. Other nonlinear alternatives to the PCA
should be tested as well. Laplacian eigen-maps were briefly investigated but not enough is known about
their applicability to this problem to reach a conclusion about their usefulness [2]. The use of MelCepstrum coefficients in particular seems promising for the eigen-faces method. Mel-Cepstrums have
been successfully applied to many speech recognition tasks [3].
Acknowledgments
Funding for this research was provided by the NSF CEAS AY REU Program, Part of NSF Type 1 STEP
Grant, Grant ID No.: DUE-0756921.
References
1. Wu, H., Siegel, M., & Khosla, P. (1999). Vehicle sound signature recognition by frequency vector
principal component analysis. IEEE Transactions on Instrumentation and Measurement, 48(5)
doi: http://dx.doi.org/10.1109/19.799662.
2. Belkin, M. & Niyogi, P. (2002). Laplacian Eigenmaps for Dimensionality Reduction and Data
Representation.
3. Prahalld, K. Speech Technology: A Practical Introduction Topics: Spectogram, Cepstrum and
Mel-Frequency Analysis. http://www.speech.cs.cmu.edu/11-492/slides/03_mfcc.pdf.
7
Download