Discriminative Models for Speech Recognition

advertisement
Discriminative Models for Speech Recognition
M.J.F. Gales
Cambridge University Engineering Department
2007
Presented by: Fang-Hui Chu
Outline
• Introduction
• Hidden Markov Models
• Discriminative Training Criteria
–
–
–
–
•
•
•
•
•
•
Maximum Mutual Information
Minimum Classification Error
Minimum Bayes’ Risk
Techniques to improve generalization
Large Margin HMMs
Maximum Entropy Markov Models
Conditional Random Field
Dynamic Kernels
Conditional Augmented Models
Conclusions
Automatic Speech Recognition
• The task of the speech recognition is to determine the
identity of an given observation sequence O by
assigning the recognized word sequence W to it
• The decision is to find the identity with maximum a
posterior (MAP) probability P(W | O)
– The so-called Bayes decision (or minimum-error-rate) rule
have to be estimated
W *  arg max P(W | O)  arg max
W
W
p(O | W ) P(W )
 arg max p(O | W ) P(W )
P(O)
W
Acoustic Language
Model
Gaussian
Model
is assumed to
be given
Multinomial
• A certain parametric representation of these distributions is needed
• HMMs are widely adopted for acoustic modeling
Acoustic Modeling (1/2)
• In the development of an ASR system, acoustic modeling
is always an indispensable and crucial ingredient
• The purpose of acoustic modeling is to provide a method
to calculate the likelihood of a speech utterance
occurring given a word sequence, p(O | W )
• In principle, the word sequence can be decomposed into
a sequence of phone-like units (acoustic models)
– Each of which is normally represented by a HMM, and can be
estimated from a corpus of training utterances
– Traditionally, the maximum likelihood (ML) training can be
employed for this estimation
Acoustic Modeling (2/2)
• Besides the ML training, the acoustic model can be
alternative trained with discriminative training criteria
– MCE training、MMI training 、MPE training…etc
– In MCE training, an approximation to the error rate on the training
data is optimized
– The MMI and MPE algorithms were developed in an attempt to
correctly discriminate the recognition hypotheses for the best
recognition results
• However..
– The underlying acoustic model is still generative, with the
associated constraints on the state and transition probability
distributions
– Classification is based on Bayes’ decision rule
Introduction
• Initially these discriminative criteria were applied to small
vocabulary speech recognition tasks
• A number of techniques were then developed to enable
their use for LVCSR tasks
– I-smoothing
– Language model weakening
– The use of lattices to compactly represent the denominator score
• But the performance on LVCSR tasks is still not
satisfactory for many speech-enabled applications
– This has led to interest in discriminative (or direct) models for
speech recognition where the posterior of the word-sequence
given the observation, P(W | O) ,is directly modeled
Hidden Markov Models
• HMMs are the standard acoustic model used in speech
recognition
q
q
• The likelihood function is
T

pO1:T | w;     Pq | w  p ot | qt ; w 
t 1
q

w 
p o t | qt ; 
   c N o ;  
M
m 1
m
t
m
,  m 

t
t 1
ot
ot 1

• The standard training of HMM is based on Maximum
Likelihood training

1 R
FML     log p O ( r ) | w (refr ) ; 
R r 1

– This optimization is normally performed using Expectation
Maximization
Discriminative Training Criteria
• The discriminative training criteria are more closely
linked to minimizing the error rate, rather than
maximizing the likelihood of generating the training data
• Three main forms of discriminative training have been
examined
– Maximum Mutual Information (MMI)
– Minimum Classification Error (MCE)
– Minimum Bayes’ Risk (MBR)
• Minimum Phone Error (MPE)
Discriminative Training Criteria
• Maximum Mutual Information:
– To maximizing the mutual information between the observed
sequences and models


(r )
(r )
(r )

pO | w ref ;  Pw ref  
1
1
(r )
(r )
FMMI     log Pw ref | O ;     log 

R r 1
R r 1   pO ( r ) | w;  Pw 
 wW

R
R
• Minimum Classification Error:
– Based on a smooth function of the difference between the loglikelihood of the correct sequence and all other competing word
sequences

1 R   P w (refr ) | O ( r ) ; 
FMCE     1  
R r 1   w  w ( r ) P w | O ( r ) ; 
ref
 














1




p (Or | s r ) P ( s r )
FMCE ( , )  r 1  e 
 1



 M uW|u ps(Or | u ) P (u )

h
r
 r







 /






1
Discriminative Training Criteria
• Minimum Bayes’ Risk:
– Rather than trying to model the correct distribution, the expected
loss during inference is minimized


1 R
FMBR     P w | O ( r ) ;  L w, w (refr )
R r 1 w
– A number of loss function:
• 1/0 function


L w, w
– equivalent to a sentence-level loss function
(r )
ref

1;

0;
w  w (refr )
w  w (refr )
• Word
– the loss function directly related to minimizing the expected
Word Error Rate (WER)
• Phone
Large Margin HMMs
• The simplest form of large margin training criterion can
be expressed as maximizing [Li et al. 2005]

  P w (refr ) | O( r ) ; 
1 R 
Fmm     min( r ) log 
R r 1  w  w ref   P w | O( r ) ; 


  
 
 
– This aims to maximize the minimum distance between the logposterior of the correct label and all the incorrect labels
• Some properties related to both the MMI and MCE
criterion
– A log-posterior cost function is used as in the MMI criterion
– The denominator term used with this approach does not include
an element from the correct label in a similar fashion to the MCE
criterion
Large Margin HMMs
• A couple of variants of large margin training
– Soft margin training [Jinyu Li et al. 2006]

  P w (refr ) | O ( r ) ; 
1 R 
Fhl        min( r ) log 
(r )
w  w ref
R r 1 
P
w
|
O
;




 

 
where
 f x   
f x 
0
if f x   0
otherwise
– Large margin GMM [F. Sha and L.K. Saul 2007]
• The size of the margin is specified in terms of a loss function
between the two sets of sequences


 P w (refr ) | O ( r ) ; 
1 R 
(r )
Fha      max( r )  Η w, w ref  log 
(r )
R r 1 w  w ref 
 P w | O ;




 

 
Direct Models
• Direct modeling attempts to model the posterior probability P(W | O)
directly
• There are many potential advantages as well as challenges
for direct modeling
– The direct model can potentially make decoding simpler
– The direct model allows for the potential combination of multiple sources
of data in a unified fashion
• Asynchronous and overlapping features can be incorporated formally
• It will be possible to take advantage of supra-segmental features like
prosodic features, acoustic phonetic features, speaker style, rate of speech,
channel differences
– However, joint estimation would require a large amount of parallel
speech and text data (a challenge for data collection)
Direct Models
• The relationship between observations and states is
reversed
– Separate transition and observation probabilities are replaced
with one function pst | ot , st 1 
– Directly modeling pst | ot , st 1  makes direct computation of PS | O
possible
• The model can also be conditioned flexibly on a variety
of contextual features
– Any computable property of the observation sequence can be
used as a feature
– The number of features at each time frame need not be the same
st 1
st
st 1
st 1
st
Assumption:
pst | o t ...o1 , st 1...s1 
ot 1
ot
ot 1
ot 1
ot
 pst | o t , st 1 
Maximum Entropy Markov Models
• Recently, McCallum et al. (ICML 2000) modeled sequential
processes using a direct model similar to the HMM in
graphical structure and used exponential models for transitionobservation probabilities
– Called Maximum Entropy Markov Model (MEMM)
• Maximum Entropy modeling is used to model the conditional
distributions pst | ot , st 1 
– ME modeling is based on the principle of avoiding unnecessary
assumptions
– The principle states that the modeled probability distribution
should be consistent with the given collection of facts about
itself and otherwise be as uniform as possible
Maximum Entropy Markov Models
• The mathematical interpretation of this principle results in a
constrained optimization problem
– Maximize the entropy of a conditional distribution pst | ot , st 1 ,
subject to given constraints
– Constraints represent the known facts about the model from
statistics of the training data
Definition 1:
  0, if bct   1 and st  s
f b, s ct , st   
otherwise
0,
in MEMM, ct  ot , st 1
Definition 2:
fi is said to be activated by a given pair ct , st , if fi ct , st   0
Maximum Entropy Markov Models
• These definitions allow us to introduce the constraints of the model
~
f i : E f i  E f i
~
E fi 
1 N
~
pc , s  f i c , s    f i c j , s j 

N j 1
c C , sV
• The expected value of
E fi 
with respect to the model ps | c  is
fi
 ps | c ~pc  f c , s 
i
c C , sV
• Using Lagrange multipliers for constrained optimization, the desired
probability distribution is given by the maximum of the function
 
~
 ps | c ,    H  ps | c    i  E f i  E f i
i
H  ps | c   
 ~pc  ps | c log ps | c 
c C , sV

Maximum Entropy Markov Models
• Finally, the solution of objective function is given by the
exponential model


exp   i  f i c , s 
 i

p s | c  
Z  c 



Z  c    exp    j  f j c , s
sV
 j

Reference
• [SAP06][Jeff Kuo and Yuqing Gao] “Maximum Entropy Direct Models
for Speech Recognition”
Download