Foundations of ASR Steven Wegmann Director Speech Research International Computer Science Institute 29 June 2015 Outline The main idea in the ASR formalism High level view of HMMs and why they are used in ASR HMM details and inference algorithms ASR model training and decoding details The main idea in the ASR formalism Some notation: I W = sequence of words (transcription) I X = acoustic observations or evidence (MFCCs, ∆s, ∆∆s) Given a particular acoustic utterance, x, our goal is to compute for every possible transcription w the probability P(W = w | X = x) We use these probabilities to decode or recognize the utterance x by selecting the most likely transcription w recog via: w recog = arg max P(W = w | X = x) w The main idea (cont’d) We use Bayes’ Theorem P(w | x) = P(x | w )P(w ) P(x) This decomposes the problem into two probability models I The acoustic model (AM) gives P(x | w ) I The language model (LM) gives P(w ) I The term P(x) is constant so it is (usually) ignored The main idea (cont’d) The mainstream choices for these models I AM: hidden Markov model (HMM) I LM: smoothed n-gram model (earlier lecture) We can view P(w | x) as the posterior on word sequences given the acoustic observations I Where P(w ) is the prior on word sequences I P(x | w ) is the likelihood of x given w Note that we are not modeling P(w | x) directly I Why not? Generative vs Discriminative classifiers What we’ve just described is an example of a generative classifier I Model P(X | W = w ) separately for each class W = w I X is random, W = w is given I Stronger model assumptions I Uses maximum likelihood estimation I Estimation is “easy” A discriminative classifier models P(W | X = x) directly I W is random, X = x is given I Weaker model assumptions I Uses conditional maximum likelihood estimation I Estimation is “hard” Our acoustic models are hidden Markov models We want a probability model for a sequence of frames given a particular word sequence, i.e. P(x | w ) In the last 25 years, hidden Markov models (HMM’s) have provided the most successful form for this model This is spite of the fact that they make very strong independence assumptions which are clearly violated in the case of speech data Notational quibbles In this talk X is always a continuous random variable I So P(X = x | w ) = P(x | w ) is a probability density function I I would normally write f (x | w ) to emphasize this I In this talk I will use P for notational simplicity I It’s bad because it makes densities look like probabilities: sorry! In practice computations are done in the log domain I I So log P(X = x | w ) instead of P(X = x | w ) This is because most of the quantities are very very small I I I Lots of multiplies of small quantities In the log domain these becomes sums However I’m leaving out the logs for clarity Roughly speaking, what is an HMM? An HMM consists of two synchronized stochastic processes I An unobservable Markov chain of hidden states St I An observed process Xt : in our case the acoustic observations Each St is a discrete random variable, while the Xt can be either a discrete or continuous random variable The hidden chain ‘explains’ the observed process, because each St emits Xt I Using a state dependent probability distribution Why use HMMs for speech recognition? We will sketch two motivations for HMMs I In particular: the hidden state sequence I To help build some intuition about HMMs The first: based on the physiology of speech production The second: based on properties of the model The spectrogram serves both motivations I It is essentially what we observe Spectrograms A spectrogram is a particularly useful way to display speech data It displays a sequence of spectral envelopes computed from the signal I Slide a 25ms analysis window in 10ms frame steps I Apply a Hamming window I Compute the FFT in the window to get the energy at each frequency The usual front end (MFCCs) starts with this same processing I It—and most other frontends—are transformations of this picture A spectrogram (from MIT 6.345 Spring ’03) Physiological motivation: picture Physiological motivation: idea When we speak a sentence our production mechanism traverses an essentially hidden sequence of configurations involving our I Glottis I Vocal tract I Tongue I Lips . . . View the HMM’s hidden state sequence as a discrete approximation to these continuous sequences of configurations Spectrograms give us hope that relatively few states are necessary for a reasonable approximation Second motivation: HMM’s point of view HMMs were designed to solve the following ‘segmentation’ problem I We observe a slowly varying stochastic process produced by a hidden group of states I We know that the hidden states emit in bunches I The task is to identify which states emitted the observed data Spectrogram gives us hope again The spectrogram segmentation problem Historical notes L. Baum and colleagues at the IDA in the 60’s I Developed HMMs to solve problems in cryptography I Jim Baker was an intern at IDA while a Princeton undergrad I Jim went to CMU for his PhD where he applied HMMs to ASR (1970’s) I Jim eventually cofounded Dragon Systems I Baum and Simons (a colleague) both went on to found early hedge funds Fred Jelinek independently applied HMMs to ASR at IBM I First used HMMs in the 60’s to solve a problem involving hard drive read errors I I May have used the work of Stratonovich? Stratonovich developed many HMM properties in the late 50’s but published them in Russian HMM Details: discrete time stochastic process A discrete time stochastic process is a sequence of random variables I (X0 , X1 , . . . , Xt , . . . ) I Where all the Xt share the same sample space I The Xt may be vectors In ASR there are canonical stochastic processes I All related to spoken utterances I The observed sequence of speech frames: (Xt ) I The corresponding hidden sequence of HMM states: (St ) I The sequence of words: (Wi ) HMM Details: Markov chains A Markov chain is a discrete time stochastic process (St ) I The St are discrete random variables satisfying P(St+1 = st+1 | St = st , . . . , S0 = s0 ) = P(St+1 = st+1 | St = st ) I Called the (first order) Markov assumption For a Markov chain: the future depends on the past only through the moment! We also assume that the St take on a finite set of values I I.e., the sample space S is finite I Abuse notation and write S = |S| I So S = {1, . . . , S} HMM Details: stationary Markov chains A Markov chain is called stationary if P(St+1 | St ) = P(St | St−1 ) = · · · = P(S1 | S0 ) The transition probabilities define a S × S matrix A: Aij = P(St+1 = j | St = i) ∀t ≥ 0 ∀i, j ∈ S There is also an initial distribution π: π(i) = P(S0 = i) ∀i ∈ S Most HMMs in ASR use this Markov chain A11 0 A01 / 1 A22 A12 / 2 A33 A23 / 3 There are 5 states I State 0 is the initial state I State 4 is the final (accepting) state I π(0) = 1 and A01 = 1 At state 1 there are only two possible transitions I Stay put with probability A11 I Advance to state 2 with probability A12 I A11 + A12 = 1 A34 / 4 HMM definition A hidden Markov model consists of I An observed stochastic process X0 , . . . , XT I A hidden stationary Markov chain S0 , . . . , ST I Their joint probability distribution P(x0 , . . . , xT , s0 , . . . , sT ) Satisfying these (strong!) assumptions I Conditional independence: I I I Given S0 , . . . , ST , the Xi are independent Given Si , Xi is independent of Sj when j 6= i Stationarity: the distribution of Xi given Si does not depend on i I I I These are called the state’s output distributions One distribution per state: {P(x | s)}Ss=1 Not specified by the model, i.e., we choose Simple toy example using an isolated word Use this HMM to model the word “COW” 0.5 0 1.0 / C 0.8 0.5 / O 0.1 0.2 / W 0.9 / 4 States 0 (start) and 4 (accept) do not emit frames (think FSA) States C, O, W emit frames according to the (as yet unspecified) distributions I P(xt | st = C ), P(xt | st = O), and P(xt | st = W ) Toy example (cont’d) 0.5 0 1.0 / C 0.8 0.5 / O 0.1 0.2 / W 0.9 / 4 Suppose we have an 5 frame utterance of COW I (x0 , x1 , x2 , x3 , x4 ) What are the allowable state sequences and their probabilities? I CCCOW I CCOOW I CCOWW I COOOW I COOWW I COWWW Toy example (cont’d) 0.5 0 1.0 / C 0.8 0.5 / O 0.1 0.2 / W For the probability CCCOW multiply the following: I 1.0 × P(x1 | C ) I 0.5 × P(x2 | C ) I 0.5 × P(x3 | C ) I 0.5 × P(x4 | O) I 0.2 × P(x5 | W ) × 0.9 I You do the rest! 0.9 / 4 Toy example (cont’d) A11 0 A01 / 1 A22 A12 / 2 A33 A23 / 3 A34 / 4 In principal there are 3T possible state sequences for a length T utterance with this HMM I Don’t count non-emtting start and accept I In reality most of those sequences are not allowed: zero prob I Exercise: the number of state sequences is T −2 X k=1 k= (T − 2)(T − 1) 2 HMM inference problems/algorithms The joint distribution P(x0 , . . . , xT , s0 , . . . , sT ) The marginal distribution (forward algorithm) X P(x0 , . . . , xT ) = P(x0 , . . . , xT , s0 , . . . , sT ) s0 ,...,sT The conditional distributions (forward-backward algorithm) P(st | x0 , . . . , xt ) The maximum likelihood state sequence (Viterbi algorithm) ŝML = argmax P(x0 , . . . , xT , s0 , . . . , sT ) s0 ,...,sT A major reason for the HMM’s success There are efficient (linear in T ) algorithms for inference Recursive factorization and dynamic programming are key to these algorithms Strong model assumptions allow major simplifications Make repeated use of the chain rule for probabilities P(A, B, C ) = P(C | A, B)P(A, B) = P(C | A, B)P(B | A)P(A) HMM’s joint distribution First simple decomposition P(x0 , . . . , xT , s0 , . . . , sT ) = P(x0 , . . . , xT | s0 , . . . , sT )P(s0 , . . . , sT ) Use recursive factorization on the Markov chain P(s0 , . . . , sT ) = P(sT | s0 , . . . , sT −1 )P(s0 , . . . , sT −1 ) = P(sT | s0 , . . . , sT −1 )P(sT −1 | sT −2 , . . . , s0 )× . . . P(s1 | s0 )P(s0 ) = P(sT | sT −1 )P(ST −1 | sT −2 ) . . . P(s1 | s0 )P(s0 ) = AsT −1 sT AsT −2 sT −1 . . . As0 s1 π(s0 ) = π(s0 ) TY −1 t=0 Ast st+1 HMM’s joint distribution (cont’d) Apply conditional independence assumptions P(x0 , . . . , xT | s0 , . . . , sT ) = = T Y t=0 T Y P(xt | s0 , . . . , sT ) P(xt | st ) t=0 Putting it all together P(x0 , . . . , xT , s0 , . . . , sT ) = π(s0 ) TY −1 t=0 Ast st+1 T Y t=0 P(xt | st ) HMM’s marginal distribution Sum over all possible state sequences P(x0 , . . . , xT ) = X (s0 ,...,sT )∈S×···×S π(s0 ) TY −1 Ast st+1 t=0 T Y P(xt | st ) t=0 It is not feasible to sum over all state sequences I S T +1 state sequences! I Instead, computation is done using the forward algorithm The forward algorithm Define the forward probabilities αt (st ) = P(x0 , . . . , xt , st ) Note that P(x0 , . . . , xT ) = S X P(x0 , . . . , xT , sT ) = sT =1 S X αT (sT ) sT =1 We will derive the recursion αt (st ) = P(xt | st ) S X st−1 =1 Ast−1 st αt−1 (st−1 ) The forward algorithm: dynamic programming To compute P(x0 , . . . , xT ) I Initialize: α0 (s0 ) = P(x0 , s0 ) = π(s0 )P(x0 | s0 ) ∀s0 ∈ S I At t: compute αt (st ) ∀s0 ∈ S using the results from t − 1 via αt (st ) = P(xt | st ) S X Ast−1 st αt−1 (st−1 ) st−1 =1 I Terminate: at T P(x0 , . . . , xT ) = S X αT (sT ) sT =1 Dynamic programming reduces exponential time to linear time I O(S T ) → O(T S 2 ) I For HMMs used in ASR: time is actually O(T S) The forward algorithm: derivation The chain rule, conditional independence, and transition matrix A give αt (st ) = P(x0 , . . . , xt , st ) = S X P(x0 , . . . , xt , st , st−1 ) st−1 =1 = S X P(xt | x0 , . . . , xt−1 , st , st−1 )P(st | x0 , . . . , xt−1 , st−1 )× st−1 =1 × P(x0 , . . . , xt−1 , st−1 ) = P(xt | st ) S X P(st , | st−1 ) αt−1 (st−1 ) st−1 =1 = P(xt | st ) S X st−1 =1 Ast−1 st αt−1 (st−1 ) Toy example again 0.5 0 1.0 / C 0.8 0.5 / O 0.1 0.2 / W 0.9 Exercise: work out the forward algorithm using a 5 frame utterance of COW I (x0 , x1 , x2 , x3 , x4 ) / 4 Toy example (cont’d) Related exercise: show that the allowable paths form a trellis 0 /C x0 /C /C O /O x1 ! /O W ! /W ! /W x2 x3 x4 /4 The forward-backward algorithm We want to compute P(st | x0 , . . . , xT ) ∀t Introducing the backward probabilities βt (st ) = P(xt+i , . . . , xT | st ) Exercise: these satisfy the recursion βt (st ) = S X st+1 =1 P(xt+1 | st+1 ) Ast st+1 βt+1 (st+1 ) The forward-backward algorithm (cont’d) The independence assumptions imply γt (st ) ≡ P(x0 , . . . , xt , xt+1 , . . . , xT , st ) = P(x0 , . . . , xt , st )P(x0 , . . . , xT , st | x0 , . . . , xt , st ) = P(x0 , . . . , xt , st )P(xt+1 , . . . , xT | x0 , . . . , xt , st ) = P(x0 , . . . , xt , st )P(xt+1 , . . . , xT | st ) = αt (st )βt (st ) Thus P(x0 , . . . , xT , st ) P(x0 , . . . , xT ) γt (st ) = PS s=1 γt (s) P(st | x0 , . . . , xT ) = The forward-backward algorithm: dynamic programming To compute P(st | x0 , . . . , xT ) I Run the forward algorithm I Initialize: βT (s) = 1 ∀s ∈ S I At t: compute βt (st ) ∀s0 ∈ S using the results from t + 1 via βt (st ) = S X P(xt+1 | st+1 ) Ast st+1 βt+1 (st+1 ) st+1 =1 I Terminate: at t = 0, set γt (st ) = αt (st )βt (st ) then ∀t γt (st ) P(st | x0 , . . . , xT ) = PS s=1 γt (s) The Viterbi algorithm: brief review The goal is to find the maximum likelihood state sequence ŝML = argmax P(x0 , . . . , xT , s0 , . . . , sT ) s0 ,...,sT The Viterbi algorithm is I Used for decoding/recognition I Used for Model parameter estimation A modified version of the forward algorithm I I I Sums are replaced by Max: the “Viterbi approximation” An extra term is introduced to store the “back-trace” The Viterbi algorithm review (cont’d) We introduce a quantity similar to αt (st ): vitt (st ) = max P(x0 , . . . , xt , s0 , . . . , st−1 , st ) s0 ,...,st−1 The following recursions hold vitt (st ) = P(xt | st ) max Asst vitt−1 (s) s∈S bpt (st ) = P(xt | st )argmax Asst bpt−1 (s) s∈S In ASR HMMs are usually used to model phones We want the hidden states to model a portion of the speech that is consistently realized I Usually across a large vocabulary I Usually across many speakers Instead of using phones, we actually use phones in context, often triphones I The realization of a phone depends on context I Start with a dictionary entry: cat k ae t I Corresponding triphone sequence is: sil-k+ae k-ae+t ae-t+sil I Longer contexts, e.g., pentaphones, are useful too I In principle one HMM for each triphone The three HMM states: start, middle, end of phone I I Regions of local stationarity Phones in context The triphones need to be clustered Problem: typical language uses 50 phonemes so 125K possible triphones I Triphone distribution—like words—is approximately Zipfian I I I I Small population of frequent triphones Long tail of rare or non-occurring triphones Even in huge training sets, many triphones will never occur What models should we use for unobserved triphones? I They may occur during recognition! Solution: top down decision tree clustering is performed on the triphone HMM states I I Each triphone state is in exactly one cluster Decision trees ask a series of questions, e.g., I I Is the left context a vowel? Is the right context a stop? What we need for ASR A vocabulary I The list of words that can be recognized A dictionary I A pronunciation for each word in terms of phonemes AM and LM I I AM: HMMs for each triphone LM: models word sequences in vocabulary I Often estimated from 100’s of billions of words The next slide shows the generative model I Exploited in recognition and training The standard ASR formalism w1 w2 w3 p1 s1 o1 c1 c2 ... c39 p2 s3 s2 o2 c1 c2 ... c39 o3 c1 c2 ... c39 Notation: o = x observations s4 o4 c1 c2 ... c39 s5 o5 c1 c2 ... c39 s6 o6 c1 c2 ... c39 Remarks on recognition/decoding The two main approaches to decoding differ in how their search space is constructed I Dynamic decoders I Static WFST decoders I Both use the the Viterbi algorithm to conduct their search Dynamic decoders follow the previous slide very closely I Expert knowledge/experimentation has been applied to optimize the search algorithms I Much has been published but there are many unpublished secrets! Remarks on recognition/decoding (cont’d) WFST-based decoders use WFSTs to represent the search space I Standard WFST theory/operations are used to algorithmically optimize the search space I The actual decode on this space is vanilla Viterbi Acoustic model training overview: requirements A large acoustic training corpus I Many thousands of hours of audio data I I I Drawn from a large speaker population Similar to the target application Corresponding word-level transcripts A dictionary that covers all the words in the training corpus The standard ASR formalism: training w1 w2 w3 p1 s1 o1 c1 c2 ... c39 p2 s3 s2 o2 c1 c2 ... c39 o3 c1 c2 ... c39 Notation: o = x observations s4 o4 c1 c2 ... c39 s5 o5 c1 c2 ... c39 s6 o6 c1 c2 ... c39 Acoustic model training overview: basic recipe Train monophone HMMs I Use transcripts and dictionary to get the monophone sequences: one HMM per monophone Initialize triphone HMMs I Use the context in the phone sequences to construct corresponding triphone sequences I I Clone triphone models I I “a b c” maps to “a+b a-b+c b-c” HMM parameters for a-b+c come from b Re-estimate triphone parameters Cluster HMM triphone states using decision trees Re-estimate clustered HMMs Output distributions While there are many potential choices for P(xt | st ) I I Mixture models with normal components have been dominant until recently Often called “Gaussian” mixture models (GMMs) I I I Gives Gauss (his usage in 1810) undeserved credit De Moivre first in 1733 Even Laplace was earlier than Gauss (in the 1770’s) Neural networks are currently the preferred choice I The hybrid HMM/NNETs framework was developed in the late 80’s/early 90’s by Morgan and Bourlard (with many other key contributors) I A later lecture will cover HMM/NNETs I HMM/GMMs are still very useful/instructive Diagonal, multivariate normal distributions: review A D-dimensional normal with diagonal covariance I Parameters: mean vector µ and variance vector σ 2 I Write µd for the d th component I N (µ, σ 2 ) denotes the distribution I ϕ(x; µ, σ 2 ) denotes the probability density function (p.d.f.) ! D 1 X (x − µd )2 1 2 ϕ(x; µ, σ ) = q exp − Q 2 σd2 (2π)D D σ 2 d=1 d d=1 Mixture models of normals, i.e., GMMs I I I 2 }M M component weights, means, variances {wm , µm , σm m=1 PM m=1 wm = 1 PM 2 P.d.f.: m=1 wm ϕ(x; µm , σm ) Multivariate normal distributions: review Maximum likelihood estimates (MLEs) I A sample of T , D-dim training examples {xt }T t=1 I xt,d is example t’s d th component I MLE for the mean: T 1 X µ̂ = xt T t=1 I MLE for the variance in dim d: σ̂d2 = T 1 X (xt,d − µ̂d )2 T t=1 We are skipping MLEs for GMMs HMMs with normal output distributions For each state s ∈ S I P(x | s) = ϕ(x; µs , σs2 ) I State dependent means and variances {µs , σs2 }Ss=1 Given a training sample {xt }Tt=1 if we knew what states at each time t, st , were responsible for the xt then the MLEs would be obvious I But we don’t since the state identities are hidden Two approximate solutions to this problem I Viterbi training I Baum-Welch (BW) training HMM with normal output distributions Let θ denote the HMM model parameters I The state transition matrix A I State means and variances {µs , σs2 }Ss=1 I Surface the existence of θ in the probability distributions Pθ The Baum-Welch and Viterbi estimation algorithms I Are iterative I Produce sequences of model parameter estimates I θ̂0 , θ̂1 , . . . , θ̂n We are skipping the estimates for transitions and mixture models I Functionally equivalent and easy to work out Estimate θ̂i+1 using θ̂i Viterbi training I Uses the maximum likelihood estimate for the state identities I Uses the Viterbi algorithm and Pθ̂i I A hard, single choice for which state generates each frame I Kaldi uses Viterbi training Baum-Welch training I Uses the distributions Pθ̂i (st | x1 , . . . , xT ) I Uses the forward-backward algorithm I A soft, fractional choice which state generates each frame I HTK uses Baum-Welch training Viterbi training: basic idea First use the Viterbi algorithm to obtain ŝ = argmax Pθ̂i (x1 , . . . , xT , s1 , . . . , sT ) s1 ,...,sT I ŝ depends on θ̂i so should write ŝθ̂i I We don’t to keep notation “cleaner” I View this as an assignment of frames to states I I.e., xt is generated by state ŝt I Called a Viterbi or forced state-level alignment I This means that state k’s training data is the set of frames {xt : 1 ≤ t ≤ T and δk,ŝt = 1} I Frame xt ’s count for state k is δk,ŝt (depends on θ̂i ) Viterbi training: the new estimate θ̂i+1 We use the MLEs obtained on each state’s training data to obtain the new parameter estimates θ̂i+1 I MLE for state k’s mean: PT µ̂k = Pt=1 T δk,ŝt xt t=1 δk,ŝt I MLE for state k’s variance in dim d: PT δk,ŝ (xt,d − µ̂k,d )2 2 σ̂k,d = t=1 PTt t=1 δk,ŝt Baum-Welch training: basic idea First use the forward-backward algorithm to obtain Pθ̂i (st | x1 , . . . , xT ) I View this as a probabilistic assignment of frames to states I I.e., xt is generated by state k with prob Pθ̂i (st = k | x1 , . . . , xT ) I Frame xt ’s count for state k is Pθ̂i (st = k | x1 , . . . , xT ) I The counts are now fractional instead of integral Baum-Welch training: the new estimate θ̂i+1 We use the MLEs obtained on each state’s training data to obtain the new parameter estimates θ̂i+1 I MLE for state k’s mean: PT t=1 Pθ̂i (st = k | x1 , . . . , xT )xt µ̂k = PT t=1 Pθ̂i (st = k | x1 , . . . , xT ) I MLE for state k’s variance in dim d: PT 2 t=1 Pθ̂i (st = k | x1 , . . . , xT )(xt,d − µ̂k,d ) 2 σ̂k,d = PT t=1 Pθ̂i (st = k | x1 , . . . , xT ) How do we initialize, i.e., choose θ0 ? Usually via a flat start I State transition probabilities use the uniform distribution I I E.g. A1,1 = A1,2 = 0.5 Use the training data total mean and variance for all the states means and variances: µ̄ = T 1 X xt T t=1 σ̄d2 T 1 X = (xt,d − µ̄d )2 ∀d T t=1 What are Viterbi/Baum-Welch training optimizing? Baum-Welch training I Baum-Welch model/parameter selection criterion is the training data likelihood: X L(θ) = Pθ (x1 , . . . , xT , s1 , . . . , sT ) s1 ,...,sT I Produces a parameter sequence (θ̂0 , θ̂1 , . . . , θ̂n ) satisfying L(θ̂0 ) ≤ L(θ̂1 ) ≤ · · · ≤ L(θ̂n ) I Baum-Welch is an example of an Expectation-Maximum algorithm I I Many interesting/important properties Beyond the scope of this introduction What are Viterbi/Baum-Welch training optimizing? Viterbi training I Viterbi model/parameter selection criterion is an approximation to the training data likelihood: VL(θ) = max Pθ (x1 , . . . , xT , s1 , . . . , sT ) s1 ,...,sT I Produces a parameter sequence (θ̂0 , θ̂1 , . . . , θ̂n ) satisfying VL(θ̂0 ) ≤ VL(θ̂1 ) ≤ · · · ≤ VL(θ̂n ) I This is straightforward to verify: exercise! A short, incomplete list of things we didn’t cover Discriminative training I Alternatives to MLE using different model selection criteria I E.g., MMI and MPE Model adaptation I E.g., given a small sample of data from a new speaker: can we leverage extant models to more accurately predict future samples? I Bayesian methods are ideal I MAP, MLLR, etc. Reducing inter-speaker variability at training/test I Vocal tract length normalization (VTLN) I Speaker adaptive training (SAT)