EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling 4 The Hidden Markov Model (HMM) Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 1 Signal template matching 1 Framewise comparison of unknown word and stored templates: ONE TWO THREE Reference FOUR FIVE • 70 60 50 40 30 20 10 10 20 30 40 50 time /frames Test - distance metric? - comparison between templates? - constraints? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 2 Dynamic Time Warp (DTW) • Find lowest-cost constrained path: - matrix d(i,j) of distances between input frame fi and reference frame rj - allowable predecessors & transition costs Txy Reference frames rj Lowest cost to (i,j) T10 D(i,j) = d(i,j) + min T 11 T01 D(i-1,j) D(i-1,j) { D(i-1,j) + T10 D(i,j-1) + T01 D(i-1,j-1) + T11 } Local match cost D(i-1,j) Best predecessor (including transition cost) Input frames fi • Best path via traceback from final state - have to store predecessors for (almost) every (i,j) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 3 DTW-based recognition Reference templates for each possible word • Isolated word: - mark endpoints of input word - calculate scores through each template (+prune) - choose best • Continuous speech - one matrix of template slices; special-case constraints at word ends ONE TWO THREE Reference FOUR • Input frames E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 4 DTW-based recognition (2) + Successfully handles timing variation + Able to recognize speech at reasonable cost - Distance metric? - pseudo-Euclidean space? - Warp penalties? - How to choose templates? - several templates per word? - choose ‘most representative’? - align and average? → need a rigorous foundation... E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 5 Outline 1 Signal template matching 2 Statistical sequence recognition - state-based modeling 3 Acoustic modeling 4 The Hidden Markov Model (HMM) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 6 2 Statistical sequence recognition • DTW limited because it’s hard to optimize - interpretation of distance, transition costs? • Need a theoretical foundation: Probability • Formulate as MAP choice among models: * M = argmax p ( M j X , Θ ) Mj - X = observed features - Mj = word-sequence models - Θ = all current parameters E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 7 Statistical formulation (2) • Can rearrange via Bayes’ rule (& drop p(X) ): * M = argmax p ( M j X , Θ ) Mj = argmax p ( X M j, Θ A ) p ( M j Θ L ) Mj - p(X | Mj) = likelihood of obs’v’ns under model - p(Mj) = prior probability of model - ΘA = acoustics-related model parameters - ΘL = language-related model parameters • Questions: - what form of model to use for p ( X M j, Θ A ) ? - how to find ΘA (training)? - how to solve for Mj (decoding)? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 8 State-based modeling • Assume discrete-state model for the speech: - observations are divided up into time frames - model → states → observations: Model Mj Qk : q1 q2 q3 q4 q5 q6 ... states time X1N : x1 x2 x3 x4 x5 x6 ... • observed feature vectors Probability of observations given model is: p( X M j) = ∑ N p ( X 1 Q k, M j ) ⋅ p ( Q k M j ) all Q k - sum over all possible state sequences Qk • How do observations depend on states? How do state sequences depend on model? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 9 The speech recognition chain • After classification, still have problem of classifying the sequences of frames: sound Feature calculation feature vectors Acoustic classifier Network weights phone probabilities Word models Language model HMM decoder phone & word labeling • Questions - what to use for the acoustic classifier? - how to represent ‘model’ sequences? - how to score matches? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 10 Outline 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling - defining targets - neural networks & Gaussian models 4 The Hidden Markov Model (HMM) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 11 Acoustic Modeling 3 • Goal: Convert features into probabilities of particular labels: i.e find p ( q n X n ) over some state set {qi} i - conventional statistical classification problem • Classifier construction is data-driven - assume we can get examples of known good Xs for each of the qis - calculate model parameters by standard training scheme • Various classifiers can be used - GMMs model distribution under each state - Neural Nets directly estimate posteriors • Different classifiers have different properties - features, labels limit ultimate performance E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 12 Defining classifier targets • Choice of {qi} can make a big difference - must support recognition task - must be a practical classification task • Hand-labeling is one source... - ‘experts’ mark spectrogram boundaries • ...Forced alignment is another - ‘best guess’ with existing classifiers, given words • Result is targets for each training frame: Feature vectors Training targets E6820 SAPR - Dan Ellis g w eh L10 - Sequence recognition n time 2002-04-15 - 13 Forced alignment • Best labeling given existing classifier constrained by known word sequence Feature vectors time Existing classifier Known word sequence Dictionary Phone posterior probabilities ow th r iy ... Training targets ow th r iy n s Constrained alignment ow th r iy Classifier training E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 14 Gaussian Mixture Models vs. Neural Nets • GMMs fit distribution of features under states: - separate ‘likelihood’ model for each state qi 1 1 k T –1 µ -x p ( x q ) = ---------------------------------⋅ exp – ( – ) k Σk ( x – µk ) d 1⁄2 2 ( 2π ) Σ k - match any distribution given enough data • Neural nets estimate posteriors directly p ( q x ) = F [ ∑ w jk ⋅ F [ ∑ w ij x i ] ] j j k - parameters set to discriminate classes • Posteriors & likelihoods related by Bayes’ rule: k k p ( x q ) ⋅ Pr ( q ) p ( q x ) = ----------------------------------------------j j ( ) ⋅ Pr ( q ) p x q ∑ k j E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 15 Outline 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic classification 4 The Hidden Markov Model (HMM) - generative Markov models - hidden Markov models - model fit likelihood - HMM examples E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 16 Markov models 3 • A (first order) Markov model is a finite-state system whose behavior depends only on the current state • S E.g. generative Markov model: .8 .8 .1 p(qn+1|qn) S A B .1 S 0 A 0 .1 .1 .1 qn B 0 .1 C 0 C E 0 .1 E .7 qn+1 A B C E 1 .8 .1 .1 0 0 .1 .8 .1 0 0 .1 .1 .7 0 0 0 0 .1 1 SAAAAAAAABBBBBBBBBCCCCBBBBBBCE E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 17 Hidden Markov models • Markov models where state sequence Q = {qn} is not directly observable (= ‘hidden’) • But, observations X do depend on Q: - xn is rv that depends on current state: p ( x q ) State sequence q=A 0.8 q=B AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC q=C 0.6 0.4 0 0 q=A 0.8 q=B q=C 3 0.6 xn p(x|q) 2 1 0.2 0.4 2 1 0.2 0 Observation sequence 3 xn p(x|q) Emission distributions 0 1 2 3 4 observation x 0 0 10 20 - can still tell something about state seq... E6820 SAPR - Dan Ellis L10 - Sequence recognition 30 time step n 2002-04-15 - 18 (Generative) Markov models (2) • HMM is specified by: - transition probabilities p ( q n q n – 1 ) ≡ a ij j i - (initial state probabilities p ( q 1 ) ≡ π i ) i - emission distributions p ( x q ) ≡ b i ( x ) i - states qi • - transition probabilities aij - emission distributions bi(x) k a t • • k a t • • k a t • • k a t k 1.0 0.9 0.0 0.0 a 0.0 0.1 0.9 0.0 p(x|q) x E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 19 t 0.0 0.0 0.1 0.9 • 0.0 0.0 0.0 0.1 Markov models for speech • Speech models Mj - typ. left-to-right HMMs (sequence constraint) - observation & evolution are conditionally independent of rest given (hidden) state qn S ae1 ae2 ae3 E q1 q2 q3 q4 q5 x1 x2 x3 x4 x5 - self-loops for time dilation E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 20 Markov models for sequence recognition • Independence of observations: - observation xn depends only current state qn p ( X Q ) = p ( x 1, x 2, …x N q 1, q 2, …q N ) = p ( x1 q1 ) ⋅ p ( x2 q2 ) ⋅ … p ( x N q N ) = • ∏n = 1 N p ( xn qn ) = ∏n = 1 b q n ( x n ) N Markov transitions: - transition to next state qi+1 depends only on qi p ( Q M ) = p ( q 1, q 2, …q N M ) = p ( q N q 1 …q N– 1 ) p ( q N– 1 q 1 …q N– 2 )…p ( q 2 q 1 ) p ( q 1 ) = p ( q N q N– 1 ) p ( q N– 1 q N– 2 )…p ( q 2 q 1 ) p ( q 1 ) = p ( q1 ) ∏ E6820 SAPR - Dan Ellis N n=2 a p ( q n q n –1 ) = π q ∏ n = 2 q n–1 q n 1 N L10 - Sequence recognition 2002-04-15 - 21 Model fit calculation • From ‘state-based modeling’: p( X M j) = ∑ N p ( X 1 Q k, M j ) ⋅ p ( Q k M j ) all Q k • For HMMs: p( X Q) = ∏n = 1 b q n ( x n ) N p ( Q M ) = πq ⋅ ∏ 1 • N n=2 aqn–1 qn Hence, solve for M* : - calculate p ( X M j ) for each available model, scale by prior p ( M j ) → p ( M j X ) • Sum over all Qk ??? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 22 Summing over all paths Model M1 0.8 0.7 A S 0.2 0.1 E 0.1 S • • • • S A B E B 0.2 1 A B E 0.9 0.1 • 0.7 0.2 0.1 • 0.8 0.2 • • 1 E States 0.9 xn Observations x1, x2, x3 p(x|B) p(x|A) B 3 n p(x|q) q{ x1 x2 x3 A 2.5 0.2 0.1 B 0.1 2.2 2.3 Paths 0.8 0.8 0.2 0.1 0.1 0.2 0.2 A S 2 Observation likelihoods 0.7 0.7 0.9 0 1 2 3 4 time n All possible 3-emission paths Qk from S to E p(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M) .9 x .7 x .7 x .1 = 0.0441 .9 x .7 x .2 x .2 = 0.0252 .9 x .2 x .8 x .2 = 0.0288 .1 x .8 x .8 x .2 = 0.0128 Σ = 0.1109 0.0022 2.5 x 0.2 x 0.1 = 0.05 0.0290 2.5 x 0.2 x 2.3 = 1.15 2.5 x 2.2 x 2.3 = 12.65 0.3643 0.1 x 2.2 x 2.3 = 0.506 0.0065 Σ = p(X | M) = 0.4020 q0 q1 q2 q3 q4 S S S S A A A B A A B B A B B B E E E E E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 23 The ‘forward recursion’ • Dynamic-programming-like technique to calculate sum over all Qk • Define α n ( i ) as the probability of getting to state qi at time step n (by any path): i n i α n ( i ) = p ( x 1, x 2, …x n, q n =q ) ≡ p ( X 1, q n ) Model states qi • Then α n + 1 ( j ) can be calculated recursively: αn+1(j) = αn(i+1) αn(i) E6820 SAPR - Dan Ellis ai+1j S [i=1 Σ αn(i)·aij]·bj(xn+1) bj(xn+1) aij Time steps n L10 - Sequence recognition 2002-04-15 - 24 Forward recursion (2) • Initialize α 1 ( i ) = π i ⋅ b i ( x 1 ) • N Then total probability p ( X 1 S M) = ∑ αN ( i ) i=1 → Practical way to solve for p(X | Mj) and hence perform recognition p(X | M1)·p(M1) p(X | M2)·p(M2) Observations X E6820 SAPR - Dan Ellis Choose best L10 - Sequence recognition 2002-04-15 - 25 Optimal path • May be interested in actual qn assignments - which state was ‘active’ at each time frame - e.g. phone labelling (for training?) • Total probability is over all paths... • ... but can also solve for single best path = “Viterbi” state sequence • Probability along best path to state q n + 1 : j * * α n + 1 ( j ) = max { α n ( i )a ij } ⋅ b j ( x n + 1 ) i - backtrack from final state to get best path - final probability is product only (no sum) → log-domain calculation just summation • Total probability often dominated by best path: * p ( X, Q M ) ≈ p ( X M ) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 26 Interpreting the Viterbi path • Viterbi path assigns each xn to a state qi - performing classification based on bi(x) - ... at the same time as applying transition constraints aij Inferred classification xn 3 2 1 0 Viterbi labels: 0 10 20 30 AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC • Can be used for segmentation - train an HMM with ‘garbage’ and ‘target’ states - decode on new data to find ‘targets’, boundaries • Can use for (heuristic) training - e.g. train classifiers based on labels... E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 27 Recognition with HMMs • Isolated word - choose best p ( M X ) ∝ p ( X M ) p ( M ) Model M1 w Input ah p(X | M1)·p(M1) = ... n Model M2 t p(X | M2)·p(M2) = ... uw Model M3 th • r p(X | M3)·p(M3) = ... iy Continuous speech - Viterbi decoding of one large HMM gives words p(M1) Input p(M2) sil p(M3) E6820 SAPR - Dan Ellis w ah t th L10 - Sequence recognition n uw r iy 2002-04-15 - 28 HMM example: Different state sequences 0.9 0.8 Model M1 K S 0.2 K O 0.2 E 0.8 0.1 T 0.2 E p( p(x | q) Emission distributions 0.2 T x| p( O) x| p( K ) x| p( T) x| A) S 0.1 0.9 0.8 Model M2 A 0.8 0.8 0.6 0.4 0.2 0 E6820 SAPR - Dan Ellis 0 1 L10 - Sequence recognition 2 3 4 x 2002-04-15 - 29 Model inference: Emission probabilities Observation sequence xn 4 A T K2 O0 0 2 4 6 8 10 12 14 Model M1 4 log p(X | M) = -32.1 2 state alignment log p(X,Q* | M) = -33.5 0 0 -1 log trans.prob log p(Q* | M) = -7.5 -2-3 0 log obs.l'hood -5 log p(X | Q*,M) = -26.0 -10 Model M2 4 log p(X | M) = -47.0 2 state alignment * log p(X,Q | M) = -47.5 0 0 -1 log trans.prob log p(Q* | M) = -8.3 -2-3 0 log obs.l'hood -5 log p(X | Q*,M) = -39.2-10 E6820 SAPR - Dan Ellis 16 18 time n / steps L10 - Sequence recognition 2002-04-15 - 30 Model inference: Transition probabilities 0.9 Model M'1 S 0.8 0.18 A 0.9 0.1 0.9 K 0.02 Model M'2 0.8 0.2 T 0.05 A 0.1 0.15 O state alignment 0.1 0.9 K S E 0.8 0.8 T 0.2 E 0.1 O 4 2 0 0 log obs.l'hood -5 log p(X | Q*,M) = -26.0 -10 0 2 4 6 8 10 12 14 16 18 time n / steps Model M'1 log p(X | M) = -32.2 log p(X,Q* | M) = -33.6 log trans.prob log p(Q* | M) = -7.6 Model M'2 log p(X | M) = -33.5 log p(X,Q* | M) = -34.9 log trans.prob log p(Q* | M) = -8.9 E6820 SAPR - Dan Ellis 0 -1 -2 -3 0 -1 -2 -3 L10 - Sequence recognition 2002-04-15 - 31 Validity of HMM assumptions • Key assumption is conditional independence: Given qi, future evolution & obs. distribution are independent of previous events - duration behavior: self-loops imply exponential distribution p(N = n) γ 1−γ γ(1−γ) n - independence of successive xns p(xn|qi) xn n p( X ) = E6820 SAPR - Dan Ellis ∏ p ( xn q ) ? i L10 - Sequence recognition 2002-04-15 - 32 Recap: Recognizer Structure sound Feature calculation feature vectors Acoustic classifier Network weights phone probabilities Word models Language model HMM decoder phone & word labeling • Know how to execute each state • .. training HMMs? • .. language/word models E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 33 Summary • Speech is modeled as a sequence of features - need temporal aspect to recognition - best time-alignment of templates = DTW • Hidden Markov models are rigorous solution - self-loops allow temporal dilation - exact, efficient likelihood calculations E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 34