PowerPoint Template

advertisement
Automatic Speech Recognition
ILVB-2006 Tutorial
1
The Noisy Channel Model
• Automatic speech recognition (ASR) is a process by which an acoustic
speech signal is converted into a set of words [Rabiner et al., 1993]
• The noisy channel model [Lee et al., 1996]
– Acoustic input considered a noisy version of a source sentence
Noisy Channel
Source
sentence
Decoder
Noisy
sentence
버스 정류장이
어디에 있나요?
ILVB-2006 Tutorial
Guess at
original sentence
버스 정류장이
어디에 있나요?
2
The Noisy Channel Model
• What is the most likely sentence out of all sentences in the language L
given some acoustic input O?
• Treat acoustic input O as sequence of individual observations
– O = o1,o2,o3,…,ot
• Define a sentence as a sequence of words:
– W = w1,w2,w3,…,wn
Wˆ  arg max P(W | O)
Bayes rule
WL
P(O | W ) P(W )
ˆ
W  arg max
P(O)
WL
Wˆ  arg max P(O | W ) P(W )
WL
ILVB-2006 Tutorial
3
Golden rule
Speech Recognition Architecture Meets Noisy Channel
버스 정류장이
어디에 있나요?
Wˆ  arg max P(O | W ) P(W )
WL
O
Speech Signals
Feature
Extraction
Decoding
버스 정류장이
어디에 있나요?
W
Word Sequence
Network
Construction
Speech
DB
HMM
Estimation
Acoustic Pronunciation
Model
Model
G2P
Text
Corpora
ILVB-2006 Tutorial
LM
Estimation
4
Language
Model
Feature Extraction
• The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice
[Paliwal, 1992]
Preemphasis/
Hamming
Window
X(n)
FFT
(Fast Fourier
Transform)
Mel-scale
filter bank
log|.|
DCT
(Discrete Cosine
Transform)
– Frame size : 25ms / Frame rate : 10ms
25 ms
...
10ms
a1
a2
a3
– 39 feature per 10ms frame
– Absolute : Log Frame Energy (1) and MFCCs (12)
– Delta : First-order derivatives of the 13 absolute coefficients
– Delta-Delta : Second-order derivatives of the 13 absolute coefficients
ILVB-2006 Tutorial
5
MFCC
(12-Dimension)
Acoustic Model
• Provide P(O|Q) = P(features|phone)
• Modeling Units [Bahl et al., 1986]
– Context-independent : Phoneme
– Context-dependent : Diphone, Triphone, Quinphone
– pL-p+pR : left-right context triphone
• Typical acoustic model [Juang et al., 1986]
– Continuous-density Hidden Markov Model   ( A, B,  )
K
– Distribution : Gaussian Mixture
b j ( x j )   c jk N ( xt ;  jk ,  jk )
k 1
– HMM Topology : 3-state left-to-right model for each phone, 1-state for
silence or pause
bj(x)
codebook
ILVB-2006 Tutorial
6
Pronunciation Model
• Provide P(Q|W) = P(phone|word)
• Word Lexicon [Hazen et al., 2002]
– Map legal phone sequences into words according to phonotactic rules
– G2P (Grapheme to phoneme) : Generate a word lexicon automatically
– Several word may have multiple pronunciations
• Example
– Tomato
0.2
[ow]
[ey]
0.5
1.0
1.0
[m]
[t]
0.8
[ah]
1.0
1.0
[t]
0.5
[aa]
1.0
– P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1
– P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4
ILVB-2006 Tutorial
7
[ow]
Training
• Training process [Lee et al., 1996]
Speech DB
Feature
Extraction
Baum-Welch
Re-estimation
yes
Converged?
no
HMM
• Network for training
Sentence HMM
ONE
Word HMM
ONE
Phone HMM
W
ILVB-2006 Tutorial
TWO
ONE TWO THREE ONE
W
1
8
AH
2
3
THREE ONE
N
End
Language Model
• Provide P(W) ; the probability of the sentence [Beaujard et al., 1999]
– We saw this was also used in the decoding process as the probability of
transitioning from one word to another.
– Word sequence : W = w1,w2,w3,…,wn
n
P( w1 wn )   P( wi | w1  wi 1 )
i 1
– The problem is that we cannot reliably estimate the conditional word
probabilities, P( wi | w1  wi 1 ) for all words and all sequence lengths in a
given language
– n-gram Language Model
– n-gram language models use the previous n-1 words to represent the history
P( wi | w1  wi 1 )  P( wi | wi ( n 1)  wi 1 )
– Bi-grams are easily incorporated in a viterbi search
ILVB-2006 Tutorial
9
Language Model
• Example
– Finite State Network (FSN)
서울
부산
에서
출발
세시
네시
출발
대구
대전
하는
기차
버스
도착
– Context Free Grammar (CFG)
$time = 세시|네시;
$city = 서울|부산|대구|대전;
$trans = 기차|버스;
$sent = $city (에서 $time 출발 | 출발 $city 도착) 하는 $trans
– Bigram
P(에서|서울)=0.2 P(세시|에서)=0.5
P(출발|세시)=1.0 P(하는|출발)=0.5
P(출발|서울)=0.5 P(도착|대구)=0.9
…
ILVB-2006 Tutorial
10
Network Construction
• Expanding every word to state level, we get a search network [Demuynck
et al., 1997]
Acoustic Model
Pronunciation Model
I
일
I
L
이
I
삼
S
A
사
S
A
Language Model
L
이
일
S
M
사
A
M
삼
Intra-word
transition
start
이
P(이|x)
Search
Network
Word
transition
end
I
LM is
applied
일
P(일|x)
I
P(사|x)
Between-word
transition
S
L
사
A
P(삼|x)
삼
S
ILVB-2006 Tutorial
A
11
M
Decoding
• Find Wˆ  arg max P(W | O)
WL
• Viterbi Search : Dynamic Programming
– Token Passing Algorithm [Young et al., 1989]
•
•
ILVB-2006 Tutorial
Initialize all states with a token with a null history and the likelihood
that it’s a start state
For each frame ak
– For each token t in state s with probability P(t), history H
– For each state r
– Add new token to s with probability P(t) Ps,r Pr(ak), and
history s.H
12
Decoding
• Pruning [Young et al., 1996]
– Entire search space for Viterbi search is much too large
– Solution is to prune tokens for paths whose score is too low
– Typical method is to use:
– histogram: only keep at most n total hypotheses
– beam: only keep hypotheses whose score is a fraction of best score
• N-best Hypotheses and Word Graphs
– Keep multiple tokens and return n-best paths/scores
– Can produce a packed word graph (lattice)
• Multiple Pass Decoding
– Perform multiple passes, applying successively more fine-grained language
models
ILVB-2006 Tutorial
13
Large Vocabulary Continuous Speech
Recognition (LVCSR)
• Decoding continuous speech over large vocabulary
– Computationally complex because of huge potential search space
• Weighted Finite State Transducers (WFST) [Mohri et al., 2002]
– Efficiency in time and space
Word : Sentence
Phone : Word
WFST
Search
Network
WFST
Combination
HMM : Phone
WFST
State : HMM
WFST
• Dynamic Decoding
– On-demand network constructions
– Much less memory requirements
ILVB-2006 Tutorial
14
Optimization
References (1/2)
• L. Bahl, P. F. Brown, P. V. de Souza, and R .L. Mercer, 1986. Maximum
mutual information estimation of hidden Markov model ICASSP,
pp.49–52.
•
•
•
•
C. Beaujard and M. Jardino, 1999. Language Modeling based on Automatic
Word Concatenations, In Proceedings of 8th European Conference on Speech
Communication and Technology, vol. 4, pp.1563-1566.
K. Demuynck, J. Duchateau, and D. V. Compernolle, 1997. A static lexicon
network representation for cross-word context dependent phones, Proceedings
of the 5th European Conference on Speech Communication and Technology,
pp.143–146.
T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation
modeling using a finite-state transducer representation, Proceedings of the ISCA
Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104.
M. Mohri, F. Pereira, and M Riley, 2002. Weighted finite-state transducers in
speech recognition, Computer Speech and Language, vol.16, no.1, pp.69–88.
ILVB-2006 Tutorial
15
References (2/2)
•
•
•
•
•
•
•
B. H. Juang, S. E. Levinson, and M. M. Sondhi, 1986. Maximum likelihood
estimation for multivariate mixture observations of Markov chains, IEEE
Transactions on Information Theory, vol.32, no.2, pp.307–309.
C. H. Lee, F. K. Soong, and K. K. Paliwal, 1996. Automatic Speech and Speaker
Recognition: Advanced Topics, Kluwer Academic Publishers.
K. K. Paliwal, 1992. Dimensionality reduction of the enhanced feature set for
the HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173.
L. R. Rabiner, 1989, A tutorial on hidden Markov models and selected
applications in speech recognition, Proceedings of the IEEE, vol.77, no.2,
pp.257–286.
L. R. Rabiner and B. H. Juang, 1993. Fundamentals of Speech Recognition,
Prentice-Hall.
S. J. Young, N. H. Russell, and J. H. S Thornton, 1989. Token passing: a simple
conceptual model for connected speech recognition systems. Technical Report
CUED/F-INFENG/TR.38, Cambridge University Engineering Department.
S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, 1996. The HTK
book. Entropics Cambridge Research Lab., Cambridge, UK.
ILVB-2006 Tutorial
16
Download