Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007 Presented by: Fang-Hui Chu Outline • Introduction • Hidden Markov Models • Discriminative Training Criteria – – – – • • • • • • Maximum Mutual Information Minimum Classification Error Minimum Bayes’ Risk Techniques to improve generalization Large Margin HMMs Maximum Entropy Markov Models Conditional Random Field Dynamic Kernels Conditional Augmented Models Conclusions Automatic Speech Recognition • The task of the speech recognition is to determine the identity of an given observation sequence O by assigning the recognized word sequence W to it • The decision is to find the identity with maximum a posterior (MAP) probability P(W | O) – The so-called Bayes decision (or minimum-error-rate) rule have to be estimated W * arg max P(W | O) arg max W W p(O | W ) P(W ) arg max p(O | W ) P(W ) P(O) W Acoustic Language Model Gaussian Model is assumed to be given Multinomial • A certain parametric representation of these distributions is needed • HMMs are widely adopted for acoustic modeling Acoustic Modeling (1/2) • In the development of an ASR system, acoustic modeling is always an indispensable and crucial ingredient • The purpose of acoustic modeling is to provide a method to calculate the likelihood of a speech utterance occurring given a word sequence, p(O | W ) • In principle, the word sequence can be decomposed into a sequence of phone-like units (acoustic models) – Each of which is normally represented by a HMM, and can be estimated from a corpus of training utterances – Traditionally, the maximum likelihood (ML) training can be employed for this estimation Acoustic Modeling (2/2) • Besides the ML training, the acoustic model can be alternative trained with discriminative training criteria – MCE training、MMI training 、MPE training…etc – In MCE training, an approximation to the error rate on the training data is optimized – The MMI and MPE algorithms were developed in an attempt to correctly discriminate the recognition hypotheses for the best recognition results • However.. – The underlying acoustic model is still generative, with the associated constraints on the state and transition probability distributions – Classification is based on Bayes’ decision rule Introduction • Initially these discriminative criteria were applied to small vocabulary speech recognition tasks • A number of techniques were then developed to enable their use for LVCSR tasks – I-smoothing – Language model weakening – The use of lattices to compactly represent the denominator score • But the performance on LVCSR tasks is still not satisfactory for many speech-enabled applications – This has led to interest in discriminative (or direct) models for speech recognition where the posterior of the word-sequence given the observation, P(W | O) ,is directly modeled Hidden Markov Models • HMMs are the standard acoustic model used in speech recognition q q • The likelihood function is T pO1:T | w; Pq | w p ot | qt ; w t 1 q w p o t | qt ; c N o ; M m 1 m t m , m t t 1 ot ot 1 • The standard training of HMM is based on Maximum Likelihood training 1 R FML log p O ( r ) | w (refr ) ; R r 1 – This optimization is normally performed using Expectation Maximization Discriminative Training Criteria • The discriminative training criteria are more closely linked to minimizing the error rate, rather than maximizing the likelihood of generating the training data • Three main forms of discriminative training have been examined – Maximum Mutual Information (MMI) – Minimum Classification Error (MCE) – Minimum Bayes’ Risk (MBR) • Minimum Phone Error (MPE) Discriminative Training Criteria • Maximum Mutual Information: – To maximizing the mutual information between the observed sequences and models (r ) (r ) (r ) pO | w ref ; Pw ref 1 1 (r ) (r ) FMMI log Pw ref | O ; log R r 1 R r 1 pO ( r ) | w; Pw wW R R • Minimum Classification Error: – Based on a smooth function of the difference between the loglikelihood of the correct sequence and all other competing word sequences 1 R P w (refr ) | O ( r ) ; FMCE 1 R r 1 w w ( r ) P w | O ( r ) ; ref 1 p (Or | s r ) P ( s r ) FMCE ( , ) r 1 e 1 M uW|u ps(Or | u ) P (u ) h r r / 1 Discriminative Training Criteria • Minimum Bayes’ Risk: – Rather than trying to model the correct distribution, the expected loss during inference is minimized 1 R FMBR P w | O ( r ) ; L w, w (refr ) R r 1 w – A number of loss function: • 1/0 function L w, w – equivalent to a sentence-level loss function (r ) ref 1; 0; w w (refr ) w w (refr ) • Word – the loss function directly related to minimizing the expected Word Error Rate (WER) • Phone Large Margin HMMs • The simplest form of large margin training criterion can be expressed as maximizing [Li et al. 2005] P w (refr ) | O( r ) ; 1 R Fmm min( r ) log R r 1 w w ref P w | O( r ) ; – This aims to maximize the minimum distance between the logposterior of the correct label and all the incorrect labels • Some properties related to both the MMI and MCE criterion – A log-posterior cost function is used as in the MMI criterion – The denominator term used with this approach does not include an element from the correct label in a similar fashion to the MCE criterion Large Margin HMMs • A couple of variants of large margin training – Soft margin training [Jinyu Li et al. 2006] P w (refr ) | O ( r ) ; 1 R Fhl min( r ) log (r ) w w ref R r 1 P w | O ; where f x f x 0 if f x 0 otherwise – Large margin GMM [F. Sha and L.K. Saul 2007] • The size of the margin is specified in terms of a loss function between the two sets of sequences P w (refr ) | O ( r ) ; 1 R (r ) Fha max( r ) Η w, w ref log (r ) R r 1 w w ref P w | O ; Direct Models • Direct modeling attempts to model the posterior probability P(W | O) directly • There are many potential advantages as well as challenges for direct modeling – The direct model can potentially make decoding simpler – The direct model allows for the potential combination of multiple sources of data in a unified fashion • Asynchronous and overlapping features can be incorporated formally • It will be possible to take advantage of supra-segmental features like prosodic features, acoustic phonetic features, speaker style, rate of speech, channel differences – However, joint estimation would require a large amount of parallel speech and text data (a challenge for data collection) Direct Models • The relationship between observations and states is reversed – Separate transition and observation probabilities are replaced with one function pst | ot , st 1 – Directly modeling pst | ot , st 1 makes direct computation of PS | O possible • The model can also be conditioned flexibly on a variety of contextual features – Any computable property of the observation sequence can be used as a feature – The number of features at each time frame need not be the same st 1 st st 1 st 1 st Assumption: pst | o t ...o1 , st 1...s1 ot 1 ot ot 1 ot 1 ot pst | o t , st 1 Maximum Entropy Markov Models • Recently, McCallum et al. (ICML 2000) modeled sequential processes using a direct model similar to the HMM in graphical structure and used exponential models for transitionobservation probabilities – Called Maximum Entropy Markov Model (MEMM) • Maximum Entropy modeling is used to model the conditional distributions pst | ot , st 1 – ME modeling is based on the principle of avoiding unnecessary assumptions – The principle states that the modeled probability distribution should be consistent with the given collection of facts about itself and otherwise be as uniform as possible Maximum Entropy Markov Models • The mathematical interpretation of this principle results in a constrained optimization problem – Maximize the entropy of a conditional distribution pst | ot , st 1 , subject to given constraints – Constraints represent the known facts about the model from statistics of the training data Definition 1: 0, if bct 1 and st s f b, s ct , st otherwise 0, in MEMM, ct ot , st 1 Definition 2: fi is said to be activated by a given pair ct , st , if fi ct , st 0 Maximum Entropy Markov Models • These definitions allow us to introduce the constraints of the model ~ f i : E f i E f i ~ E fi 1 N ~ pc , s f i c , s f i c j , s j N j 1 c C , sV • The expected value of E fi with respect to the model ps | c is fi ps | c ~pc f c , s i c C , sV • Using Lagrange multipliers for constrained optimization, the desired probability distribution is given by the maximum of the function ~ ps | c , H ps | c i E f i E f i i H ps | c ~pc ps | c log ps | c c C , sV Maximum Entropy Markov Models • Finally, the solution of objective function is given by the exponential model exp i f i c , s i p s | c Z c Z c exp j f j c , s sV j Reference • [SAP06][Jeff Kuo and Yuqing Gao] “Maximum Entropy Direct Models for Speech Recognition”