Recommendations Based on Speech Classification (and examples of what recommender systems can learn from signal processing) Christian Müller German Research Center for Artificial Intelligence International Computer Science Institute, Berkeley, CA Overview Speech as a source of information for non-intrusive user modeling Speech/signal processing Take-away messages Vocal aging -> features for Knowledge-driven feature speaker age recognition selection GMM/SVM supervector approach for acoustic Classification methods for speech features independent “bag of recommender systems can observations” features Detection task(and andexamples of what learn applicationfrom signal processing) pseudo-NIST evaluation Valid procedure independent evaluation Rank and polynomial rank normalization Feature space warping normalization Recommendations Based on Speech Classification Conclusions Christian Müller Speech as a Source for Non-Intrusive UM Information about the user speaker classification ? Now it’s time to get to gate 38. adaptive speech dialog system A user model speech = sensor inference from sensors (not intrusive) adapts it's dialog behavior (e.g. detailed map with shops vs. arrows) B explicit statement (intrusive) provides recommendations (e.g. a different route to the gate) Christian Müller Speaker Classification Systems Cognitive Load Best Research Paper Award UM 2001 Age and Gender Audio segment (telephone quality) S y s t e m Voice Award 2007 Telekom live operation 2009 Language 14 languages + dialects NIST evaluation 2007 Identity Project with BKA 2009 NIST* Evaluation 2008 Acoustic Events Project with VW 2008 Interspeech 2008 Christian Müller Recommendations Based on Speech Classification products media services actions strategies age gender emotions language dialect accent identity acoustic events Christian Müller Product Recommendations Based on Age and Gender Zur Anzeige wird der QuickTime™ Dekompressor „svq1“ benötigt. Christian Müller Product Recommendations Based on Age and Gender AM Michael Feld and Christian Müller. Speaker Classification for Mobile Devices. In Proceedings of the 2nd IEEE International Interdisciplinary Conference on Portable Information Devices (Portable 2008). 2008 Christian Müller How can you find features for building your models by explicitly studying the underlying phenomena? Proposing Knowledge-driven feature select the example of features for speaker age recognition Christian Müller Speaker Classification as an Interdisciplinary Area of Research Which are the speaker Which How are can therequirements the manifestations age (and of thea gender) of age classification of (and a speaker gender)system be in the and how can they be solved on the implementation layer ? speaker’s recognized voice and automatically speaking style ? ? Speech Technology / Artificial Intelligence Speaker Classification Phonetics Voice Pathology SoftwareTechnology Christian Müller Impact of Aging on the Human Speech Production Speech breathing effects: lower expirational volume more speech pauses lower amplitude thorax stiffer lungs lighter less elastic lower position Christian Müller Impact of Aging on the Human Speech Production laryngal area effects: rise of fundamental frequency (in men) reduced voice quality larynx vocal folds calcification and ossification loss of tissue stiffening Christian Müller Impact of Aging on the Human Speech Production supralaryngal area facial bones and muscles degeneration reduced elasticity effects: imprecise articulation for example vowel centralization Christian Müller Impact of Aging on the Human Speech Production neurological effects loss of tissue in the cortex reduced performance of the neuronal transmitters effects: reduced articulation rate defective coordination between the articulators vowel centralization Christian Müller Development of F0 in Men / Women F0 (Hz) 170 men 160 150 only non-smokers women 140 130 120 smokers and nonsmokers 110 100 Linville (2001) 90 age in years 20 30 40 50 60 70 80 90 Christian Müller Age Classes Female CF Male age CM Children <= 13 years YM Youth Adults Seniors 14 - 19 years AM 20 - 64 years >= 65 Jahren Christian Müller Age Classes Female Male age CF CM Children <= 13 years YM Youth Adults Seniors 14 - 19 years AM 20 - 64 years >= 65 Jahren Christian Müller Features fundamental frequency (pitch) mean pitch_mean standard deviation pitch_stddev min, max and difference pitch_min / pitch_max / pitch_diff voice quality shimmer shim_l / shim_ldb / shim_apq3 / shim_apq11 / shim_ddp jitter jitt_l / jitt_la / jitt_rap / jitt_ppq / jitt_ddp harmonics-to-noise-ratio harm_mean / harm_stddev articulation rate ar_rate speech pauses pause_num / pause_dur Christian Müller Features fundamental frequency (pitch) mean standard deviation min, max and difference voice quality voice shimmer jitter harmonics-to-noise-ratio articulation rate speech pauses speaking style Christian Müller Example Results C_YF AF SF YM_AM_SM C F C M Y F Y M A F A M S F S M high jitter value = low voice quality fundamental frequency (F0) Christian Müller. Zweistufige kontextsensitive Sprecherklassifikation am Beispiel von Alter und Geschlecht [Two-layered Context-Sensitive Speaker Classification on the Example of Age and Gender]. AKA, Berlin, 2006 C F C M Y F speech pauses Y M A F A M S F S M Christian Müller Hiearchical Feature Model High-level features (learned characteristics) Low-level features (physical characterstics) semantics ? dialog A b a B: b d d e : e b c ideloect <s> how shall I say this <c> <s> yeah I know... phonetics /S/ /oU/ /m/ /i:/ /D/ /&/ /m/ / / /n/ /i:/ ... prosody spectrum Christian Müller How can your features be modeled assuming that they are multi-dimentional represent repeating observations of the same kind can be assumed to be independent (“bag” of observations) Proposing the GMM/SVM Supervector Approach on the example of frame-byframe acoustic features Christian Müller General Classification Scheme zk e.g. channel compensation Preprocessing multilayer perceptron support-vector machines (not addressed in this networks talk) -0,4 wkj 0.7 -1 y1 y2 -1.5 0.5 1 1 Feature Extraction 1 wji 1 x1 x2 Classification Fusion Top-DownKnowledge Christian Müller Modeling Acoustics and Prosodics semantics ? dialog A b a B: b d d e : e b c ideloect <s> how shall I say this <c> <s> yeah I know... no ASR phonetics /S/ /oU/ /m/ /i:/ /D/ /&/ /m/ / / /n/ /i:/ ... prosody spectrum Christian Müller Generative Approach: Gaussian Mixture Model (GMM) training “emergency vehicle” probability density feature extraction “emergency vehicle” model frame of speech test ? feature extraction “emergency vehicle” model avg likelihood over all frames for class “emergency vehicle” Christian Müller Generative Approach: Gaussian Mixture Model (GMM) test ? feature extraction “emergency vehicle” model frame of speech background model avg. log likelihood ratio over all frames for class “emergency vehicle” Christian Müller A Mixture of Gaussians Means, variances, and mixtures weights are optimized in training Black line = mixture of 3 Gaussians Christian Müller Discriminative Method: Support Vector Machine (SVM) training “em. vehic.” (1) “not em. vehic.” (-1) feature extraction “em. vehic.” model Features are transformed into higher-dimensional space where problem is linear Discriminating hyper plane is learned using linear regression Trade-off between training error and width of margin Model is stored in form of “support vectors” (data points on the margin) Christian Müller Discriminative Method: Support Vector Machine (SVM) test ? feature extraction score (distance to hyper plane) Discriminative methods have shown to be superior to generative methods for similar tasks Features vectors have to be of the same lengths (sensitive to variable segment lengths) Solutions: feature statistics calculated over the entire utterance fixes portion of the segment sequential kernels Christian Müller GMM/SVM Supervector Approach feature extraction Gaussian means (MAP adapted) Combines discriminative power of SVMs with length independency of GMMs Very successful with similar tasks such as speaker recognition GMM is trained using MAP adaptation Christian Müller Evaluation Results DCF 23,41 25 19,55 20 14,58 15 10,22 8,09 10 3,45 5 0 ir t en et s e ed h c at m ed h c at m un GM M-UBM GM M-SVM Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008. Christian Müller How can you evaluate your multi-class models independently from the given application? How can you establish a appropriate evaluation in order procedure to obtain valid results? Proposing the detection task and the “pseudo NIST” evaluation procedure on the example of acoustic event detection and speaker age recognition. Christian Müller Background With multi-class recognition problems, many test/analyzing methods are very application specific. e.g. confusion matrices. we want a method that allows results to be generalized across a large set of applications. With home-grown databases, parameter tuning on the evaluation set often compromises the validity of the results/inferences. we want a fair “one shot” evaluation. Christian Müller The Detection Task system yes , 1.324326 emergeny vehicle ? Given a speech segment (s) and an acoustic event to be detected (target event, ET ) the task is to decide whether ET is present in s (yes or no) the system's output shall also contains a score indicating its confidence with more positive scores indicating greater confidence. Christian Müller Terminology Segment class e.g. segment event, segment age-class. ground truth (not known). Target the hypothesized class. Trial a combination of segment and target. Christian Müller Evaluation yes emergency vehicle ? music ? talking ? laughing ? phone ? no event ? system no no no yes no 1.32432 -0.3212 1.8463 -2.5773 0.00132 2.20122 The system performance is evaluated by presenting it with a set of trials. Each test segment is used for multiple trials. The absence of all of all targets is explicitly included. Christian Müller Type of Errors segment “em. vehic.” system no “MISS” target “em. vehic” ? segment “em. vehic” system target “phone” ? yes “FALSE ALARM” Christian Müller Decision-Error Tradeoff misses “equal error rate” false alarms Selecting an operating point (decision threshold) along the dotted line trades misses off false alarms. Optimal operating point is application dependent. Low false alarm rates are desirable for most applications. Christian Müller Decision Cost Function C(ET, EN) = CMiss · PTarget · PMiss(ET) + CFA · (1-PTarget) · PFA (ET,EN) where ET and EN are the target and non-target events, and CMiss, CFA and PTarget are application model parameters. The application parameters for EER are: CMiss = CFA = 1 and PTarget = 0.5 Weighted sum of misses and false alarms using variable costs and priors. Application model parameters are selected according to the application. Christian Müller Example DET-Plot miss probability false alarm probability Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008. Christian Müller Example Cost Chart COSTS: (At, An) An C YF YM AF AM SF SM C -- 0.220 0.092 0.145 0.083 0.133 0.069 YF 0.166 -- 0.081 0.201 0.080 0.198 0.070 YM 0.076 0.084 -- 0.130 0.203 0.108 0.188 AF 0.088 0.161 0.110 -- 0.095 0.219 0.082 AM 0.064 0.083 0.254 0.139 -- 0.105 0.228 SF 0.096 0.150 0.100 0.249 0.091 -- 0.095 SM 0.065 0.085 0.238 0.117 0.246 0.118 -- Avg Cost (At) 0.092 0.130 0.146 0.164 0.133 0.147 0.122 Avg Cost 0.133 Acoustic GMM/SVM Supervector system on 7-class age task Christian Müller Pseudo NIST Evaluation Procedure ERL provided development and evaluation data as representative as possible for the application. Three months before the evaluation, ICSI was provided with the development data. At a pre-determined date, the blind evaluation data was provided to ICSI for processing. The system's output was submitted to ERL in NIST format. ERL downloaded the scoring software from NIST’s website, made the necessary modifications due to the changes in the labels. ERL ran the software on the submitted system output. The results were then disclosed to ICSI along with the keys (truth) for further analysis. --> Fair “one-shot” evaluation, no parameter tuning on the evaluation set. Christian Müller How can you normalize your features in order to obtain a uniform scale and a unifom distribution? Proposing rank normalization respectively polynomial rank normalization Christian Müller Background Fundamental frequency (pitch): Jitter: 0.001324 PPQ --> implicit feature weighing 75-200 Hz Christian Müller Mean/Variance Normalization 1 -1 vi − min(vi) max(vi) − min(vi) ai = 1 uniform scale non-uniform distribution Christian Müller Rank-Normalization feature 0101 ... 0.01 0101 ... 0.06 0101 ... 0.13 0101 ... 0.29 background model normalized feature 0101 0101 0101 0101 0101 ... 0101 ... 0123 2317 ... 0 0.01 0.06 0.13 0.29 0 0.25 0.5 0.75 1 0.75 0.4 0.2 create ordered list of values using bg data rank = position in list / number of values no occurrence mapped to 0 Christian Müller Rank Normalization 1 -1 1 1 -1 1 (+) uniform distribution (-) large three dimensional lookup tables (-) linear interpolation for unseen values larger values ? smaller values ? Christian Müller Polynomial Rank Normalization use ranks to train a polynomial apply polynomial instead of look-up tables better interpolation no need to store look-up tables Christian Müller and Joan-Isaac Biel. The ICSI 2007 Language Recognition System. In Proc eedings of the Odyssey 2008 Workshop on Speaker and Language Recognition. Stellenbosch, South Africa, 2008 Christian Müller Conclusions Speech as a source of information for non-intrusive user modeling Speech/signal processing Vocal aging -> features for speaker age recognition GMM/SVM supervector approach for acoustic speech features Detection task and pseudo-NIST evaluation procedure Rank and polynomial rank normalization Take-away messages Knowledge-driven feature selection Classification methods for independent “bag of observations” features Valid applicationindependent evaluation Feature space warping normalization Christian Müller Thank you! Christian Müller