Digital Communication and Computation

advertisement
Digital Communication and Computation
Final Report on
Hidden Markov Models
資工四 B86506009 鄭家駿
I.
Backgrounds of Speech Recognition
1. A Mathematical Formulation
A = a1, a2,…, am
acoustic evidence(data) made of a sequence of symbols taken from some alphabet
A on the basis of which recognizer will make its decision about which words were
spoken
W = w1, w2,…, wn
A string of n words, each belonging to a fixed and known vocabulary V
If P(W|A) denotes the probability that the words W were spoken, given that the
evidence A were observed, then the recognizer should decide in favor of a word
string Ŵsatisfying
Ŵ= arg max P(W|A)
That is, the recognizer will pick the most likely word string given the observed
acoustic evidence.
P(W|A) = P(W)P(A|W)/P(A)
The maximization will carried out with A fixed, then
Ŵ= arg max P(W)P(A|W)
(1)
where P(W) is the probability that the word string W will be uttered, P(A|W) is the
probability that when speaker says W the acoustic evidence A will be observed.
2. Acoustic Processing
This is the first step in speech recognition. That is, one needs to decide on a front
end that will transform the pressure waveform into the symbols ai with which the
recognizer will deal.
3. Acoustic Modeling
In (1), we need to know all possible pairings of A and W to determine the value of
P(A|W). However the number of different possible pairings is too large to deal
with, we need a statistical acoustic model of the speaker’s interaction with the
acoustic processor. The modeling involves the way the speaker pronounces W, the
ambience, the microphone placement and characteristics, and the acoustic
processing performed by the front end. The usual model is hidden Markov model.
4. Language Modeling
In (1), we also nee to know P(W), the priori probably that the speaker wishes to
utter W. We decompose P(W) such that
P(W) = ΠP(wi |w1,…,wi-1)
In fact, the arguments for P(W) are just too many. Therefore, we use equivalence
classes Φ(w1,…,wi-1) instead of the entire history.
P(W) =ΠP(wi |Φ(w1,…,wi-1))
II.
Hidden Markov Chains
1. HMM Concepts
We define
(1) An output alphabet Y = {0, 1,…, b-1}
(2) A state space I = {1, 2,…,c} with unique starting state s0
(3) A probability distribution of transitions from state s to state s’:p(s’|s)
(4) An output probability distribution associated with transitions from state s to
state s’:q(y|s,s’)
Then the probability of observing an HMM output string y1, y2,…,yk is given by
P(y1, y2,…,yk) = ΣΠp(si|si-1) q(y|si-1, si)
(2)
Figure 1 is an example of an HMM with b = 2 and c = 3 attaching outputs to
transitions..
0
1
1
3
0
1
0
1
0
1
2
Fig. 1
2. The Trellis
Figure 2 shows the stages for yi = 0 and yi = 1 corresponding to the HMM of
figure 1.
1
1
1
1
2
2
2
2
3
3
3
3
y=0
y=1
Fig. 2
Figure 3 shows the trellis corresponding to the output sequence 0110. The required
probability P(0110) is equal to the sum of the probabilities of all complete paths
through the trellis that start in the obligatory starting state.
S0
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
0
1
1
0
Fig. 3
The probability P(y1, y2,…,yn) can be obtained recursively. Define
αi(s) = P(y1, y2,…,yi, si = s),
the probability that output sequence is y1, y2,…,yi and ending state is s.
(3)
With boundary conditions of
α0(s) = 1 for s = s0 andα0(s) = 0 otherwise, and
setting p(yi, s|s’) = q(yi|s’,s)p(s|s’),
(4)
we get the recursion
αi(s) = Σp(yi, s|s’)αi-1(s’),
(5)
By definition, the desired probability is
P(y1, y2,…,yn) = Σαn(s)
(6)
3. Search for the Likeliest State Transition Sequence
Given an observed output sequence y1, y2,…, yk which state sequence s1, s2,…, sk is
most likely to generate it? Note that
P(s1, s2,…,sk |y1, y2,…,yk, s0) = P(s1, s2,…,sk, y1, y2,…,yk|s0)/ P(y1, y2,…,yk|s0).
Because it is Markov, then for all i,
P(s1, s2,…,sk , y1, y2,…,yk|s0) = P(s1,…,si , y1,…,yi|s0) P(si+1,…,sk , yi+1,…,yk|si)
Therefore, we can find the maximally likely sequence s1,…,si, si+1,…,sk by finding
first for each state s on trellis level i the most likely state sequence s1(s),…,si-1(s)
leading into state s, then finding the most likely sequence si+1(s),…,sk(s) leading
out of state s, and finally finding the state s on level i for which the complete
sequence s1(s),…,si-1(s), si = s, si+1(s),…,sk(s) has the highest probability. This leads
to the Viterbi algorithm.
(1) Define the recursive formula
γi(si) = max P(s1,…,si , y1,…,yi|s0)
= max p(yi, si|si-1) max P(s1,…,si-1 , y1,…,yi|s0)
= max p(yi, si|si-1)γi-1(si)
(7)
(2) Setγ0(s0) = 1, γ0(s) = 0 for s ≠ s0
(3) Use (7) to computeγ1(s) for all states of the trellis’s first column,
γ1(s) = max p(y1, s1|s0)γ0(s0) = p(y1, s1|s0)
(4) Computeγ2(s) for all states of the trellis’s second column, and purge all other
transitions thatγ2(s) > p(y2, s2|s1)γ1(s1). If more than one transition remains,
arbitrarily select one to keep and purge the others.
(5) In general, computeγi(s) for all states of the trellis’s ith column and purge all
but one transitions.
(6) Find the state s in the trellis’s kth column for whichγk(s) is maximal.
max P(s1,…,si , y1,…,yk|s0) = maxγk(s)
(8)
Figure 4~8 traces the algorithm with the example in figure 1.
p(0,1|1) = 0.2
p(1,1|3) = 0.5
1
3
p(0,3|1) = 0.3
p(1,2|1) = 0.4
p(0,3|2) = 0.4
p(1,3|2) = 0.6
p(0,2|1) = 0.1
p(1,2|3) = 0.5
2
Fig. 4
S0
1
1
1
1
1
2
2
2
3
3
3
γ1(1)=0.2
2
2
γ1(2)=0.1
3
3
γ1(3)=0.3
0
1
1
0
Fig. 5
S0
1
2
1
1
γ1(1)=0.2
γ2(1)=0.15
2
3
2
γ1(2)=0.1
γ2(2)=0.15
3
3
γ1(3)=0.3
γ2(3)=0.06
0
1
1
1
2
2
3
3
1
0
Fig. 6
S0
1
2
1
1
1
γ1(1)=0.2
γ2(1)=0.15
γ3(1)=0.03
2
2
γ1(2)=0.1
γ2(2)=0.15
γ3(2)=0.06
3
3
3
γ1(3)=0.3
γ2(3)=0.06
γ3(3)=0.09
2
3
0
1
1
1
2
3
0
Fig. 7
S0
1
2
1
1
1
1
γ1(1)=0.2
γ2(1)=0.15
γ3(1)=0.03
γ4(1)=0.006
2
2
2
γ1(2)=0.1
γ2(2)=0.15
γ3(2)=0.06
γ4(2)=0.003
3
3
3
3
γ1(3)=0.3
γ2(3)=0.06
γ3(3)=0.09
γ4(3)=0.024
2
3
0
1
1
0
Fig. 8
The most likely sequence is 12313.
4. Estimation of Statistical Parameters of HMMs
In most cases, the parameters of HMMs is not available. We want to use HMMs to
model data generation, and the most we can expect is to be given a sample of such
data. We wish to find the parameter specification that allow the HMM to account
best for the observed data as well as unseen data in the future.
Consider the following example and use q(y|t) = q(y|s, s’)
t1
t2
S1
S3
t5
t4
t3
S2
p(t1)+p(t2) = 1;P(t4) + p(t5) = 1
We will estimate the probabilities p(ti) and q(y|t), where y=0,1 and i=1,…6.
If c*(t) is equal to the normalized number of times the process went through
transition t ,and c*(y,t) is equal to the normalized number of times the process
went through t and as a consequence produced the output y. Then the natural
estimation would be
q(y|t) = c(y,t)/c(t)
p(ti) = c(ti)/c(t1) + c(t2) for i = 1,2
p(t3) = 1 for i = 3
p(ti) = c(ti)/c(t4) + c(t5) for i = 4,5
Suppose that y1,…,yk were observed and define
 L(t), R(t) :the source and target states of the transition t
 P*{ti=t}:the probability that y1,…,yk were observed and transition t was taken
when leaving the ith stage of the trellis, i = 0, 1,…,k
 αi(s) = P{si = s}:the probability that y1,…,yi were observed and the state
reached at the ith stage was s
 βi(s) = P{rest|si = s}:the probability that yi+1,…,yk were observed when the
state at the ith stage was s
Then we can have c*(t) and c*(y,t) as
c*(t) = ΣP*{ti = t}
(9)
c*(y,t) =ΣP*{ti = t}δ(yi+1,y), where δ(yi+1,y) = 1 for yi+1 = y and δ(yi+1,y) = 0
otherwise
(10)
Observe that transition t can be taken only after HMM reached the state L(t) and
the rest action will start in state R(t), thus
P*{ti=t} =αi(L(t)) p(t) q(yi+1|t) βi+1(R(t))
(11)
The following recursions are also hold:
αi(s) = Σαi-1(L(t))p(t)q(yi|t) + Σαi(L(t))p(t)
(12)
βi(s) = Σp(t)q(yi+1|t)βi+1(R(t)) + Σp(t)βi(R(t))
(13)
We can use equation (9) ~ (13) to estimate the parameters of an HMM.
III.
Additional Considerations
1. In the previous discussion, the transitions that produce null symbols were omitted.
When we take it into consideration, some appropriate changes should be made to
the formulas.
2. Pratically, normalization should be made to the equations (12) and (13) because
they will lose precision as i increases/decreases.
3. The front end of speech recognition includes a microphone whose output is an
electric signal, a means of sampling that signal, and a manner of processing the
resulting sequence of samples. The features of the samples are extracted and
compare to the pre-saved prototype database to find the best match.
4. Reference:Statistical method for Speech Recognition
Download