PROJECT SUMMARY REPORT Eigen-Faces Applied to Speech Style Classification Submitted To The 2012 Academic Year NSF AY-REU Program Part of NSF Type 1 STEP Grant Sponsored By The National Science Foundation Grant ID No.: DUE-0756921 College of Engineering and Applied Science University of Cincinnati Cincinnati, Ohio Prepared By Brad Keserich, Senior, Computer Engineering Report Reviewed By: Dr. Dharma Agrawal REU Faculty Mentor Professor School of Computing Sciences and Informatics University of Cincinnati September 17 – December 6, 2012 1 Eigen-Faces Applied to Speech Style Classification Brad Keserich, Senior, Computer Engineering; College of Engineering and Applied Science; University of Cincinnati; Cincinnati, Ohio Graduate Mentor: Suryadip Chakraborty, School of Computing Sciences and Informatics; University of Cincinnati; Cincinnati, Ohio Mentor: Dr. Dharma Agrawal, Professor, School of Computing Sciences and Informatics; University of Cincinnati; Cincinnati, Ohio Abstract In this study we investigate the usefulness of the eigen-faces method for speech classification. The eigenfaces method has been applied to other audio classification problems like recognizing different vehicle noises [1]. We use this method to differentiate between speaking styles. Eigen-face classification is performed using abstract features derived from concrete features in the audio data. Results indicate that simple classification is possible and that the eigen-faces technique may prove to be useful in speech classification applications. Simple cases show that normal and non-normal speech are qualitatively separable in the feature space. Subject Headings Classification, Signal processing, Fourier analysis Introduction A fully mature voice recognition system may be useful in various applications. Currently there are computer vision algorithms that can identify a person in a crowd by their gait and human signatures by learning from past examples. A similar concept may be applied to human speaking style. Existing voice recognition systems may become more powerful if they are capable of identifying features analogous the “gait” or “handwriting” used in vision tasks. These features would be related to the style differences of different speakers. A machine could use trends like stuttering over difficult words, subtle patterns in speech, or reliance on fillers like “um”. These could be used not only to recognize a speaker but also to better understand what the speaker is saying. This could lead to better performance overall in speech recognition tasks. Materials and Methods 2 Equipment Audio was recorded using the microphone in a 2008 Macbook Pro laptop. The audio recording software was Logic Express 8. Scripts were implemented in the Octave language and run on the Mac OSX operating system. Data Acquisition Audio data was recorded for various speech styles. The speaker and contents of the speech were constant for all test cases. A single voice actor was used to generate all of the audio samples. This eliminated the complexities of speaker recognition so that the style of speech could be focused on. Audio was saved in WAV format sampled at 44.1kHz with a 24 bit depth. The data is split into two main cases. In the first case the speaker had spoken in a clear normal voice. In the second case the speaker was instructed to a pitch inflection on their words or mimic a stutter. Methodology The first task is to segment the raw waveform into spoken sections. This can be done using short-timeenergy, average magnitude, and the average zero cross rate of the raw waveform. Figure 1 illustrates these three techniques applied to spoken waveform in blue. The word used for this was “Tabemono”. This word was chosen because it is easily separated into the consonant + vowel pairs TA, BE, MO, and NO. Figure 2 illustrates the result of taking the minimum of all three of the previously mentioned techniques and applying a threshold. This results in a segmentation corresponding roughly to the consonant + vowel pairs. 3 Figure 1. Waveform analysis for segmentation. Figure 2. Automated waveform segmentation. 4 Automatic segmentation was investigated but was not used for the training data. Training data consists of the “TA” portion of the above waveforms only and was segmented by hand in a wave editor. The “TA” data set exists in two portions, the training set and the verification set. The classifier is tuned using the training set and the verification set is left to test the performance of the classifier. This means that each WAV file in the verification set already has a human assigned truth class that we use for checking the answer of the classifier. The performance of the classifier is then gauged by looking at the seperability of the data in the feature space. The eigen-faces procedure is described by Wu, Siegel, and Khosla but a brief overview will be provided here [1]. Essentially, the “TA” portion is taken from many samples to produce a time-domain vector. Each frame in the vector is then subject to Fourier transformation to produce a set of power spectrums. These are then normalized and combined with the total duration of the spoken portion and any stutter detection bit that might be available. This final set of feature vectors is then averaged to produce Ω. Each vector has Ω subtracted from it before a final principal component analysis (PCA) to obtain the eigen-vectors of the covariance matrix. Classification is performed by projecting new sets of vectors onto the eigen-basis. The magnitude of the residual vector is the indicator of the classification. Larger magnitudes indicate that the vectors are not of the same class as the training set. Examples of this can be seen in Figures 3 and 4. Results and Discussion Results were analyzed qualitatively using the residual scatter plots. This is reasonable for classification problems since we are looking for a high degree of separation between classes in the feature space. The results indicate that good stutter detection can relatively easily separate classes. This is effective outside of the eigen-faces method when a stutter is indicated as a bit. Accurate segmentation also allows for the spacing time to be calculated and used as a feature. It seems to be relatively easy to check the variance of these spaces relative to each other so that if there is an outlier then we know there is an awkward pause situation. Pitch and volume are more difficult to analyze and must be further investigated. 5 Figure 3. Green and blue belong to the same class but different datasets. Green is the training data. Red is in a different class than green and blue. Figure 4. Same data as Figure 4 but now additional features duration and stutter are used. Conclusions The eigen-faces method seems to be a useful technique for speech. The jump from vehicles to human voices does carry some complexity but further investigation into techniques like wavelet transforms and the Mel-Cepstrum seem as if they may improve the performance of the classification. There is still much 6 room for improvement with the current state of the research. Other nonlinear alternatives to the PCA should be tested as well. Laplacian eigen-maps were briefly investigated but not enough is known about their applicability to this problem to reach a conclusion about their usefulness [2]. The use of MelCepstrum coefficients in particular seems promising for the eigen-faces method. Mel-Cepstrums have been successfully applied to many speech recognition tasks [3]. Acknowledgments Funding for this research was provided by the NSF CEAS AY REU Program, Part of NSF Type 1 STEP Grant, Grant ID No.: DUE-0756921. References 1. Wu, H., Siegel, M., & Khosla, P. (1999). Vehicle sound signature recognition by frequency vector principal component analysis. IEEE Transactions on Instrumentation and Measurement, 48(5) doi: http://dx.doi.org/10.1109/19.799662. 2. Belkin, M. & Niyogi, P. (2002). Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. 3. Prahalld, K. Speech Technology: A Practical Introduction Topics: Spectogram, Cepstrum and Mel-Frequency Analysis. http://www.speech.cs.cmu.edu/11-492/slides/03_mfcc.pdf. 7