Lecture 5: Acous c Modeling with Gaussians

advertisement
CS 224S / LINGUIST 285
Spoken Language Processing
Dan Jurafsky Stanford University Spring 2014 Lecture 5: Acous-c Modeling with Gaussians Outline for Today
—  Acous<c Model —  Increasingly sophis<cated models —  Acous<c Likelihood for each state: —  Gaussians —  Mul<variate Gaussians —  Mixtures of Mul<variate Gaussians —  Where a state is progressively: —  CI Subphone (3ish per phone) —  CD phone (=triphones) —  State-­‐tying of CD phone —  The Embedded Training Algorithm Problem: how to apply HMM model
to continuous observations?
—  We have assumed that the output alphabet V has a finite number of symbols —  But spectral feature vectors are real-­‐valued! —  How to deal with real-­‐valued features? — Decoding: Given ot, how to compute P(ot|
q) — Learning: How to modify EM to deal with real-­‐valued features Vector Quantization
—  Create a training set of feature vectors —  Cluster them into a small number of classes —  Represent each class by a discrete symbol —  For each class vk, we can compute the probability that it is generated by a given HMM state using Baum-­‐Welch as above VQ
—  We’ll define a —  Codebook, which lists for each symbol —  A prototype vector, or codeword —  If we had 256 classes (‘8-­‐bit VQ’), —  A codebook with 256 prototype vectors —  Given an incoming feature vector, we compare it to each of the 256 prototype vectors —  We pick whichever one is closest (by some distance metric) —  And replace the input vector by the index of this prototype vector VQ
VQ requirements
—  A distance metric or distor<on metric — Specifies how similar two vectors are — Used: —  to build clusters —  To find prototype vector for cluster —  And to compare incoming vector to prototypes —  A clustering algorithm — K-­‐means, etc. Simplest distance metric:
Euclidean distance
—  Actually square of Euclidean distance D
d (x, y) = ∑ (x i − y i )
2
2
i=1
—  Also called ‘sum-­‐squared error’ Distance metrics:
Mahalanobis distance
—  Actually square of Mahalanobis distance —  Assume that each dimension i of feature vector has variance σi2 D
(x i − y i )
d (x, y) = ∑
2
σi
i=1
2
2
—  Equa<on above assumes diagonal covariance matrix; more on this later Training a VQ system (generating codebook):
K-means clustering
1.  Ini-aliza-on Choose M vectors from L training vectors as ini<al code words 2.  Search For each training vector, find the closest code word, assign this training vector to that cell 3.  Centroid Update For each cell, compute centroid of that cell. The new code word is the centroid. 4.  Repeat (2)-­‐(3) un<l average distance falls below threshold (or no change) Slide from John-Paul Hosum
Vector Quantization
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Given data points, split into 4 codebook vectors with initial
values at (2,2), (4,6), (6,5), and (8,8)
0 1 2 3 4 5 6 7 8 9
Slide from John-Paul Hosum
Vector Quantization
•  Example
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
compute centroids of each codebook, re-compute nearest
neighbor, re-compute centroids...
0 1 2 3 4 5 6 7 8 9
Slide from John-Paul Hosum
Vector Quantization
0 1 2 3 4 5 6 7 8 9
•  Example
Once there’s no more change, the feature space will be
partitioned into 4 regions. Any input feature can be classified
as belonging to one of the 4 regions. The entire codebook
can be specified by the 4 centroid points.
0 1 2 3 4 5 6 7 8 9
Slide from John-Paul Hosum
Summary: VQ
—  To compute p(ot|qj) —  Compute distance between feature vector ot —  and each codeword (prototype vector) —  in a preclustered codebook —  where distance is either —  Euclidean —  Mahalanobis —  Choose the vector that is the closest to ot —  and take its codeword vk —  And then look up the likelihood of vk given HMM state j in the B matrix —  Bj(ot)=bj(vk) s.t. vk is codeword of closest vector to ot —  Using Baum-­‐Welch as above feature value 2
for state j
Computing bj(vk)
feature value 1 for state j
•  bj(vk) = number of vectors with codebook index k in state j = 14 = 1
number of vectors in state j
56
4
Slide from John-Paul Hosum
Summary: VQ
—  Training: —  Do VQ and then use Baum-­‐Welch to assign probabili<es to each symbol —  Decoding: —  Do VQ and then use the symbol probabili<es in decoding Directly Modeling Continuous
Observations
—  Gaussians —  Univariate Gaussians —  Baum-­‐Welch for univariate Gaussians —  Mul<variate Gaussians —  Baum-­‐Welch for mul<variate Gausians —  Gaussian Mixture Models (GMMs) —  Baum-­‐Welch for GMMs Better than VQ
—  VQ is insufficient for real ASR —  Instead: Assume the possible values of the observa<on feature vector ot are normally distributed. —  Represent the observa<on likelihood func<on bj(ot) as a Gaussian with mean µj and variance σj2 2
1
(x − µ)
f (x | µ,σ ) =
exp(−
)
2
2σ
σ 2π
Gaussians are parameterized by
mean and variance
Reminder: means and variances
—  For a discrete random variable X —  Mean is the expected value of X —  Weighted sum over the values of X —  Variance is the squared average devia<on from mean Gaussian as Probability Density Function
•  A Gaussian is a probability density func<on; probability is area under curve. •  To make it a probability, constrain area under curve = 1 Gaussian PDFs
—  BUT… —  We will be using “point es<mates”; value of Gaussian at point. —  Technically these are not probabili<es, since a pdf gives a probability over a interval, needs to be mul<plied by dx —  As we will see later, this is ok since the same value is omijed from all Gaussians, so argmax is s<ll correct. Gaussians for Acoustic Modeling
—  P(o|q): A Gaussian parameterized by mean and variance: Different means P(o|q) is highest here at mean P(o|q) low here, far from mean P(o|q) o Using a (univariate) Gaussian as an
acoustic likelihood estimator
—  Let’s suppose our observa<on was a single real-­‐valued feature (instead of 39D vector) —  Then if we had learned a Gaussian over the distribu<on of values of this feature —  We could compute the likelihood of any given observa<on ot as follows: Training a Univariate Gaussian
—  A (single) Gaussian is characterized by a mean and a variance —  Imagine that we had some training data in which each state was labeled —  We could just compute the mean and variance from the data: T
1
µi = ∑ ot s.t. ot is state i
T t=1
€
T
1
σ i 2 = ∑ (ot − µi ) 2 s.t. qt is state i
T t=1
Training Univariate Gaussians
—  But we don’t know which observa<on was produced by which state! —  What we want: to assign each observa<on vector ot to every possible state i, prorated by the probability the the HMM was in state i at <me t. —  The probability of being in state i at <me t is γt(i)!! T
T
∑γ (i)o
t
µi =
t
t=1
T
γ
∑γ (i)
t
t=1
σ i2 =
2
γ
(i)(o
−
µ
)
∑t t i
t=1
T
∑γ (i)
t
t=1
Multivariate Gaussians
—  Instead of a single mean µ and variance σ: 1
(x − µ) 2
f (x | µ,σ ) =
exp(−
)
2
2σ
σ 2π
—  Vector of observa<ons x modeled by vector of means µ and covariance matrix Σ % 1
(
T −1
f (x | µ,Σ) =
exp' − (x − µ) Σ (x − µ)*
D /2
1/ 2
& 2
)
(2π ) | Σ |
1
€
Multivariate Gaussians
—  Defining µ and Σ µ = E(x)
Σ = E [(x − µ)(x − µ)
T
]
—  So the i-­‐jth element of Σ is: €
σ = E [(x i − µi )(x j − µ j )]
2
ij
Gaussian Intuitions: Size of Σ
—  µ = [0 0] µ = [0 0] µ = [0 0] —  Σ = I Σ = 0.6I Σ = 2I —  As Σ becomes larger, Gaussian becomes more spread out; as Σ becomes smaller, Gaussian more compressed From Andrew Ng’s lecture notes for CS229
From Chen, Picheny et al lecture slides
[1 0]
[0 1]
[.6 0]
[ 0 2]
—  Different variances in different dimensions Gaussian Intuitions: Off-diagonal
—  As we increase the off-­‐diagonal entries, more correla<on between value of x and value of y Text and figures from Andrew Ng’s lecture notes for CS229
Gaussian Intuitions: off-diagonal
—  As we increase the off-­‐diagonal entries, more correla<on between value of x and value of y Text and figures from Andrew Ng’s lecture notes for CS229
Gaussian Intuitions: off-diagonal
and diagonal
—  Decreasing non-­‐diagonal entries (#1-­‐2) —  Increasing variance of one dimension in diagonal (#3) Text and figures from Andrew Ng’s lecture notes for CS229
In two dimensions
From Chen, Picheny et al lecture slides
But: assume diagonal covariance
—  I.e., assume that the features in the feature vector are uncorrelated —  This isn’t true for FFT features, but is true for MFCC features, as we will see next <me. —  Computa<on and storage much cheaper if diagonal covariance. —  I.e. only diagonal entries are non-­‐zero —  Diagonal contains the variance of each dimension σii2 —  So this means we consider the variance of each acous<c feature (dimension) separately Diagonal covariance
—  Diagonal contains the variance of each dimension σii2 —  So this means we consider the variance of each acous<c feature (dimension) separately 2(
% %
(
o
−
µ
1
1
td
jd
'
*
exp
−
'
*
'
*
2
'
2 & σ jd ) *
2πσ jd
&
)
D
b j (ot ) = ∏
d =1
D
1
b j (ot ) =
2π
D
D
2
∏σ
d =1
jd
1
exp(− ∑
2 d =1
2
(otd − µ jd ) 2
σ jd
2
)
Baum-Welch re-estimation equations
for multivariate Gaussians
—  Natural extension of univariate case, where now µi is mean vector for state i: T
∑γ (i)o
t
µ̂i =
t
t=1
T
∑γ (i)
t
€
t=1
T
σˆ i2 =
T
γ
(i)(o
−
µ
)
(o
−
µ
)
∑t t i t i
t=1
T
∑γ (i)
t
t=1
But we’re not there yet
—  Single Gaussian may do a bad job of modeling distribu<on in any dimension: —  Solu<on: Mixtures of Gaussians Figure from Chen, Picheney et al slides
Mixture of Gaussians to model a
function
%&*
%&)
%&(
%&'
%&!
%&"
%&#
%&$
%
!!
!"
!#
!$
%
$
#
"
!
€
Mixtures of Gaussians
—  M mixtures of Gaussians: M
& 1
)
1
T −1
f (x | µ jk ,Σ jk ) = ∑ c jk
exp( − (x − µ jk ) Σ (x − µ jk )+
D /2
1/ 2
' 2
*
(2π ) | Σ jk |
k=1
—  For diagonal covariance: M
D
1
b j (ot ) = ∑ c jk ∏
k=1
d =1
M
D
c jk
b j (ot ) = ∑
k=1
2πσ 2jkd
2π
D
D
2
∏σ
2)
& &
)
o
−
µ
1
td
jkd
(
exp − ((
++ +
( 2 ' σ jkd * +
'
*
jkd
1
exp(− ∑
2 d =1
2
(x jkd − µ jkd ) 2
σ jkd
2
)
GMMs
—  Summary: each state has a likelihood func<on parameterized by: —  M Mixture weights —  M Mean Vectors of dimensionality D —  Either —  M Covariance Matrices of DxD —  Or more likely —  M Diagonal Covariance Matrices of DxD —  which is equivalent to —  M Variance Vectors of dimensionality D Training a GMM
—  Problem: how do we train a GMM if we don’t know what component is accoun<ng for aspects of any par<cular observa<on? —  Intui<on: we use Baum-­‐Welch to find it for us, just as we did for finding hidden states that accounted for the observa<on Baum-Welch for Mixture Models
—  By analogy with γ earlier, let’s define the probability of being in state j at <me t with the kth mixture component accoun<ng for ot: N
∑α
γ tm ( j) =
t−1
( j)aij c jm b jm (ot )β j (t)
i=1
α F (T )
—  Now, T
T
∑γ
µ jm =
tm
c jm =
t=1
T M
∑∑γ
t=1 k=1
∑γ
( j)ot
tk
( j)
tm
( j)
t=1
T M
∑∑γ
t=1 k=1
T
∑γ
Σ jm =
tk
( j)
T
(
j)(o
−
µ
)
(o
−
µ
)
tm
t
j
t
j
t=1
T
M
∑∑γ
t=1 k=1
tm
( j)
How to train mixtures?
—  Choose M (osen 16; or can tune M dependent on amount of training observa<ons) —  Then can do various splitng or clustering algorithms. One simple method for “splitng”: —  Compute global mean µ and global variance 1.  Split into two Gaussians, with means µ±ε (some<mes ε is 0.2σ) 2.  Run Forward-­‐Backward to retrain 3.  Go to 1 un<l we have 16 mixtures Embedded Training
—  Components of a speech recognizer: —  Feature extrac<on: not sta<s<cal —  Language model: word transi<on probabili<es, trained on some other corpus —  Acous<c model: —  Pronuncia<on lexicon: the HMM structure for each word, built by hand —  Observa<on likelihoods bj(ot) —  Transi<on probabili<es aij Embedded training of acoustic
model
—  If we had hand-­‐segmented and hand-­‐labeled training data —  With word and phone boundaries —  We could just compute —  B: means and variances of all our triphone gaussians —  A: transi<on probabili<es —  And we’d be done! —  But we don’t have word and phone boundaries, nor phone labeling Embedded training
—  Instead: —  We’ll train each phone HMM embedded in an en<re sentence —  We’ll do word/phone segmenta<on and alignment automa<cally as part of training process Embedded Training
Initialization: “Flat start”
—  Transi<on probabili<es: —  set to zero any that you want to be “structurally zero” —  The γ probability computa<on includes previous value of aij, so if it’s zero it will never change — Set the rest to iden<cal values —  Likelihoods: — ini<alize µ and σ of each state to global mean and variance of all training data Embedded Training
—  Given: phoneset, lexicon, transcribed wavefiles —  Build a whole sentence HMM for each sentence —  Ini<alize A probs to 0.5, or to zero —  Ini<alize B probs to global mean and variance —  Run mul<ple iterac<ons of Baum Welch —  During each itera<on, we compute forward and backward probabili<es —  Use them to re-­‐es<mate A and B —  Run Baum-­‐Welch un<l convergence. Viterbi training
—  Baum-­‐Welch training says: —  We need to know what state we were in, to accumulate counts of a given output symbol ot —  We’ll compute γi(t), the probability of being in state i at <me t, by using forward-­‐backward to sum over all possible paths that might have been in state i and output ot. —  Viterbi training says: —  Instead of summing over all possible paths, just take the single most likely path —  Use the Viterbi algorithm to compute this “Viterbi” path —  Via “forced alignment” Forced Alignment
—  Compu<ng the “Viterbi path” over the training data is called “forced alignment” —  Because we know which word string to assign to each observa<on sequence. —  We just don’t know the state sequence. —  So we use aij to constrain the path to go through the correct words —  And otherwise do normal Viterbi —  Result: state sequence! Viterbi training equations
Viterbi
n ij
aˆ ij =
ni
Baum-­‐Welch For all pairs of emitting states,
1 <= i, j <= N
γ
n
(s.t.o
=
v
)
j
t
k
bˆ j (v k ) =
nj
γ
Where nij is number of frames with transition from i to j in best path
And nj is number of frames where state j is occupied
Viterbi Training
—  Much faster than Baum-­‐Welch —  But doesn’t work quite as well —  But the tradeoff is osen worth it. Viterbi training (II)
—  Equa<ons for non-­‐mixture Gaussians 1 T
µi = ∑ ot s.t. qt = i
N i t=1
T
1
2
σi =
(ot − µ i ) s.t. qt = i
∑
N i t =1
2
€
—  Viterbi training for mixture Gaussians is more complex, generally just assign each observa<on to 1 mixture Log domain
—  In prac<ce, do all computa<on in log domain —  Avoids underflow —  Instead of mul<plying lots of very small probabili<es, we add numbers that are not so small. —  Single mul<variate Gaussian (diagonal Σ) compute: —  In log space: Log domain
—  Repea<ng: —  With some rearrangement of terms —  Where: —  Note that this looks like a weighted Mahalanobis distance!!! —  Also may jus<fy why we these aren’t really probabili<es (point es<mates); these are really just distances. Context-Dependent Phones
(triphones)
Phonetic context: different “eh”s
w
eh
d
y
eh
l
b
eh
n!
Modeling phonetic context
—  The strongest factor affec<ng phone<c variability is the neighboring phone —  How to model that in HMMs? —  Idea: have phone models which are specific to context. —  Instead of Context-­‐Independent (CI) phones —  We’ll have Context-­‐Dependent (CD) phones CD phones: triphones
—  Triphones —  Each triphone captures facts about preceding and following phone —  Monophone: —  p, t, k —  Triphone: —  iy-­‐p+aa —  a-­‐b+c means “phone b, preceding by phone a, followed by phone c” “Need” with triphone models
Word-Boundary Modeling
—  Word-­‐Internal Context-­‐Dependent Models ‘OUR LIST’: SIL AA+R AA-­‐R L+IH L-­‐IH+S IH-­‐S+T S-­‐T —  Cross-­‐Word Context-­‐Dependent Models ‘OUR LIST’: SIL-­‐AA+R AA-­‐R+L R-­‐L+IH L-­‐IH+S IH-­‐S+T S-­‐T+SIL —  Dealing with cross-­‐words makes decoding harder! Implications of Cross-Word
Triphones
—  Possible triphones: 50x50x50=125,000 —  How many triphone types actually occur? —  20K word WSJ Task, numbers from Young et al —  Cross-­‐word models: need 55,000 triphones —  But in training data only 18,500 triphones occur! —  Need to generalize models Modeling phonetic context: some
contexts look similar
w iy
r iy
m iy
n iy!
Solution: State Tying
—  Young, Odell, Woodland 1994 —  Decision-­‐Tree based clustering of triphone states —  States which are clustered together will share their Gaussians —  We call this “state tying”, since these states are “<ed together” to the same Gaussian. —  Previous work: generalized triphones —  Model-­‐based clustering (‘model’ = ‘phone’) —  Clustering at state is more fine-­‐grained Young et al state tying
State tying/clustering
—  How do we decide which triphones to cluster together? —  Use phone<c features (or ‘broad phone<c classes’) —  Stop —  Nasal —  Frica<ve —  Sibilant —  Vowel —  Lateral Decision tree for clustering triphones for
tying
Decision tree for clustering
triphones for tying
State Tying: Young, Odell, Woodland 1994
!"
+,-./01!&.23&3453&6.
*!&')6.718**!1&
2396)*
+:-.;)3&6.23&3453&6*
.#3.#0!453&6*
#$!"%&
#$!"%&'
($!"%)
*$!"%)
<<<.
+>-.;)8*#60.1&9.#!6.
#0!453&6*
#$!"%&
#$!"%&'
($!"%)
*$!"%)
<<<.
+?-.@A41&9.#3.
7BB*
#$!"%&
#$!"%&'
($!"%)
*$!"%)
<<<.
Summary: Acoustic Modeling for LVCSR
—  Increasingly sophis<cated models —  For each state: —  Gaussians —  Mul<variate Gaussians —  Mixtures of Mul<variate Gaussians —  Where a state is progressively: —  CI Phone —  CI Subphone (3ish per phone) —  CD phone (=triphones) —  State-­‐tying of CD phone —  Forward-­‐Backward Training —  Viterbi training —  Latest innova<on: Use DNN instead of Gaussians Summary: ASR Architecture
—  Five easy pieces: ASR Noisy Channel architecture —  Feature Extrac<on: —  39 “MFCC” features —  Acous<c Model: —  Gaussians for compu<ng p(o|q) —  Lexicon/Pronuncia<on Model —  HMM: what phones can follow each other —  Language Model —  N-­‐grams for compu<ng p(wi|wi-­‐1) —  Decoder —  Viterbi algorithm: dynamic programming for combining all these to get word sequence from speech! 74
Download