Voice-Based Gender Classification Using Support Vector Machine Project Presentation for Class COMS E6772, Fall 2006 Student: Wenwei Wang Advisor: Prof. Tony Jebara Columbia University December 11, 2006 2 Motivation • • • • • • • Gender classification plays an important role in: Speech/Speaker recognition Other applications, such as HCI, passive surveillance and smart living environmental Bi-model gender classification can improve the overall performance: Image based gender classification performance varies with the factors, such as the environment light and face angle; Voice based gender classification can be degraded by the factors, such as the environment noise and recording channels. About this project Focus on the voice based gender classification using Support Vector Machines. Gaussian Mixture Model method were used as a comparison. Both cases of text dependent and text independent were explored. Columbia University, Electrical Engineering Department 2 3 Voice Data Source Train and Test Voices: • Each of 25 speakers were asked to read two different paragraphs, the longer one for training voice and the shorter one for testing voice; Recording Method: Different offices with the normal level of noises during the working hours; Ordinary telephone microphone; Microsoft Sound Recorder, Version 5.1; 16 bit, 16 KHz, Mono mode; Record length ranges from 40 to 60 seconds. Speakers Summary: Male: 15; Female: 10 • o o o o o • Columbia University, Electrical Engineering Department 3 4 Voice Feature MFCC: The most commonly used for Speech/Speaker Recognition/Verification • Feature Extraction: Pre-emphasizing: H(z) = 1 - 0.95 Z -1 Framing: window size=500 samples; overlap=200 samples; Filtering: hamming window; Training Voice: 800 MFCC vectors with order of 12 per speaker; Testing Voice: 400 MFCC vectors with order of 12 per speaker; On the top of each MFCC vector, the delta MFCC vector, and the delta delta MFCC delta were created (they were experimented for better results, besides MFCC, but results showed no improvement for gender classification). A true gender matrix with the value 1 for male or -1 for female were created for each of MFCC vectors. o o o o o o o Columbia University, Electrical Engineering Department 4 5 SVC Implementation SVC Model Training • 100 frames of training MFCC per speaker; (100 frames yielded the best results based on the overall classification performance) For each MFCC vector, only the first 3 coefficients were used; (adding more coefficients, or using other combinations with delta MFCC, and delta delta MFCC coefficients didn’t improve the overall classification performance) 1st frame from each speaker generated 1st SVC model, then 2nd frame from each speaker generated 2nd SVC model, and so on. 100 frames generated 100 SVC models; Kernel: RFB with sigma =0.1 and cost =inf used (RFB, ERBF, and BSPLINE gave the same good model with 100% gender classification, and for RFB, the value of sigma =0.1 and cost =inf were selected based on the overall classification performance) SVC Classification Text independent: 100 frames of testing MFCC per speaker; (100 frames and 3 coefficients yielded the best results) Text dependent: 100 frames of training MFCC per speaker; (100 frames and 3 coefficients yielded the best results) 100 predicted gender data from 100 SVC models were simply averaged as the final gender score. SVC tool: SVM software written by Dr. Steve Gunn • • • • • • Columbia University, Electrical Engineering Department 5 6 SVC Model • • • SVC Model Performance SVC plot for 1st 2 dimensions of training MFCC features Top: for frame 1 from 25 speakers Blue: MALE; Red: FEMALE. Bottom: for frame 100 from 25 speakers Blue: MALE; Red: FEMALE As examined one by one, all 100 frames are classified 100% accurate. Only two frames are shown here as examples. Overall, the SVC model is super in its accuracy! Columbia University, Electrical Engineering Department 6 7 SVC Text Independent Classification • • • • 25 test voices used MFCC: 100 frames per voice; 21 voices detected correctly; 3 male voices detected as female, and 1 female voice detected as male; Overall Gender Detection Accuracy Rate: 84% Test Voice 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 Detected m m f m f m m m f f f m m f m f m f f m m f f f m Label m mf mf mmmf f mmmf mmmmf f mf f f m Columbia University, Electrical Engineering Department 7 8 SVC Text Dependent Classification • • • 25 train voices used MFCC: 100 frames per voice; All 25 voices detected correctly; Overall Gender Detection Accuracy Rate: 100% Test Voice 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 Detected m m f m f m m m f f m m m f m m m m f f m f f f m Label m mf mf mmmf f mmmf mmmmf f mf f f m Columbia University, Electrical Engineering Department 8 9 GMM Implementation GMM Model Training • 25 x 800 frames of training MFCC were divided into two groups, male or female; For each MFCC vector, only the first 2 coefficients were used; (adding more coefficients, or using other combinations with delta MFCC, and delta delta MFCC coefficients didn’t improve the overall classification performance) Male GMM model and Female GMM model were trained from two MFCC groups; GMM parameters: 2 dimensions, 5 mixtures, diag, 20 EM iterations (selected based on overall classification results) GMM Classification Text independent: all 400 frames of testing MFCC with 1 st 2 coefficients per speaker; Text dependent: all 800 frames of training MFCC with 1st 2 coefficients per speaker; Each frame fed into the Male GMM model and Female GMM model, respectively. The ratio of two resulted values decides the gender; 400 or 800 predicted gender data were simply averaged as the final gender score. GMM tool: Netlab software written by Dr. Ian Nabney and Dr. Christopher Bishop. • • • • • • • Columbia University, Electrical Engineering Department 9 10 GMM Model 5 • • GMM Model Performance PDF plots (top) of 1st 2 dimensions of MFCC features Left: MALE; Right: FEMALE. The combined PDF plots in 3D (bottom) for the 1st 2 dimensions of MFCC features based on GMM Red: MALE; Blue: FEMALE. The Gaussian peaks clearly show the differences between male and female, but Gaussian bodies show the overlaps. Overall the GMM model is NOT as good as the SVC model! 5 0.022 4 4 0.02 3 0.025 3 0.018 2 0.016 2 0.02 1 0.014 1 0 0.012 0 0.01 -1 0.015 -1 0.008 -2 0.01 -2 0.006 -3 -3 0.004 -4 0.002 -5 0.005 -4 -5 -25 -20 -15 -10 Columbia University, Electrical Engineering Department -25 -20 -15 -10 10 11 GMM Text Independent Classification • • • • 25 test voices used MFCC: 400 frames per voice;. 21 voices detected correctly; 3 male voices detected as female, and 1 female voice detected as male; Overall Gender Detection Accuracy Rate: 84% Test Voice 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 Detected m m f m f m m m f f f m m f m f m f f f m f m f m Label m mf mf mmmf f mmmf mmmmf f mf f f m Columbia University, Electrical Engineering Department 11 12 GMM Text Dependent Classification • • • • 25 train voices used MFCC: 800 frames per voice; 22 voices detected correctly; 2 male voices detected as female, and 1 female voice detected as male; Overall Gender Detection Accuracy Rate: 88% Test Voice 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 Detected m m f m f m m m f f f m m f m f m m f f m f m f m Label m mf mf mmmf f mmmf mmmmf f mf f f m Columbia University, Electrical Engineering Department 12 13 Result Summary Results summary: SVC GMM Model Super model with 100% accuracy PDF peak clearly separated; bodies are overlapped in some degree. Text Independent Classification 84% accuracy 84% accuracy Text Dependent Classification 100% accuracy 86% accuracy Columbia University, Electrical Engineering Department 13 14 Conclusion • • • SVC model itself is a super accurate model, and hence has more potentials than the GMM model in the voice-based gender classification, and possibly in other classification applications; For text dependent type of classification, the SVC could be the best choice; For text independent type of classification, the SVC is one of the choices. Columbia University, Electrical Engineering Department 14 15 Future Work • • • • Investigate the reasons why such a super SVC model can’t perform well for the text independent gender classification; Explore the possible voice features which might improve the SVC text independent classification performance; It could be meaningful to compare SVC performance with other classification model, such as HMM and NNW; Examine SVC model for other voice based classification applications, such as age and spoken language. Columbia University, Electrical Engineering Department 15 16 References • Steve R. Gunn, ‘Support Vector Machines for Classification and Regression,’ Technical report, University of Southampton, 1998. • W.M.Campbell, J.P.Campbell, T.P. Gleason, D.A. Reynolds, and T.R.Leek,’HighLevel Speaker Verification With Support Vector Machines,’ ICASSP, 2004. • And others ( will be listed in the final report) Columbia University, Electrical Engineering Department 16