Large Vocabulary Continuous Speech Recognition Wˆ P(Wˆ | Y ) max P(W | Y ). W P(Y | W ) P(W ) P(W | Y ) . P(Y ) Wˆ arg max P(Y | W ) P(W ). W Subword Speech Units HMM-Based Subword Speech Units Training of Subword Units SW : W1 W2 W3 WI , SU : U1 (W1 )U 2 (W1 ) U L (W1 ) (W1 ) U1 (W2 )U 2 (W2 ) U L (W2 ) (W2 ) U1 (W3 )U 2 (W3 ) U L (W3 ) (W3 ) U1 (WI )U 2 (WI ) U L (WI ) (W1 ), Training of Subword Units Training Procedure Errors and performance evaluation in PLU recognition Substitution error (s) Deletion error (d) Insertion error (i) Performance evaluation: If the total number of PLUs is N, we define: Correctness rate: N – s – d /N Accuracy rate: N – s – d – i / N Language Models for LVCSR W w1 w2 wQ , P(W ) P( w1 w2 wQ ) P( w1 ) P( w2 | w1 ) P( w3 | w1 w2 ) P( wQ | w1 w2 wQ 1). P( wQ | w1 w2 w j 1 ) P( w j | w j N 1 w j 1 ), Word Pair Model: Specify which word pairs are valid 1 if wk w j is valid P( w j | wk ) 0 otherwise Statistical Language Modeling Q PN (W ) P( wi | wi 1 , wi 2 ,, wi N 1 ), i 1 F ( wi , wi 1 , , wi N 1 ) ˆ P( wi | wi 1 , , wi N 1 ) , F ( wi 1 ,, wi N 1 ) F ( w1 , w2 , w3 ) F ( w1 , w2 ) F ( w1 ) ˆ P( w3 | w1 , w2 ) p1 p2 p3 F ( w1 , w2 ) F ( w1 ) F (wi ) Perplexity of the Language Model Entropy of the Source: 1 H lim P( w1 , w2 , , wQ ) log P( w1 , w2 , , wQ ) Q Q P( w1 , w2 , , wQ ) P( w1 ) P( w2 ) P( wQ ) First order entropy of the source: H P( w) log P( w) wV If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the Source puts out, 1 H lim log P( w1 , w2 , , wQ ) Q Q We often compute H based on a finite but sufficiently large Q: 1 H log P( w1 , w2 , , wQ ) Q H is the degree of difficulty that the recognizer encounters, on average, When it is to determine a word from the same source. Using language model, if the N-gram language model PN(W) is used, An estimate of H is: 1 Q H p log P( wi | wi 1 , wi 2 , , wi N 1 ) Q i 1 In general: Hp 1 log Pˆ ( w1 , w2 , , wQ ) Q Perplexity is defined as: B2 Hp Pˆ ( w1 , w2 , , wQ ) 1 / Q Overall recognition system based on subword units Naval Resource (Battleship) Management Task: 991-word vocabulary NG (no grammar): perplexity = 991 Word pair grammar We can partition the vocabulary into four nonoverlapping sets of words: {BE } set of words that con either begin or end a sentence, | BE | 117 {BE } set of words that con begin a sentence but cannot end a sentence, | BE | 64 {B E} set of word that cannot begin a sentence but can end a sentence, | BE | 448 {BE } set of words that cannot begin or end a sentence, | BE | 322. The overall FSN allows recognition of sentences of the form: S : ( silence ) {BE , BE } ( silence ) ({W }) ({W }) ( silence ) {B E , BE } ( silence ) WP (word pair) grammar: Perplexity=60 FSN based on Partitioning Scheme: 995 real arcs and 18 null arcs WB (word bigram) Grammar: Perplexity =20 Control of word insertion/word deletion rate In the discussed structure, there is no control on the sentence length We introduce a word insertion penalty into the Viterbi decoding For this, a fixed negative quantity is added to the likelihood score at the end of each word arc Context-dependent subword units (1) above : ax (2) above : $ ax b (3) above : ax 2 b ax b ah b2 ah b ah v ah1 v Context Independen t Units ah v $ Triphones (Context Dependent) v1 Multiple Phone Untis (4) above : ax(above) b(above) ah(above) v(above) Word Dependent Untis. Creation of context-dependent diphones and triphones p L p $ left context (LC) diphone $ p pR pL p pR rightt context (RC) diphone, left right context (LRC) diphone. If c(.) is the occurrence count for a given unit, we can use a unit reduction rule such as: If c( p L p p R ) T , then 1. p L p p R 2. p L p p R 3. p L p p R $ p p R if c($ p p R ) T p L p $ if c( p L p $) T $ p $ otherwise. CD units using only intraword units for “show all ships”: $ sh ow sh ow $ $ aw aw $ $ sh i sh i p i p s p s $ CD units using both intraword and itnerword units: $ sh ow sh ow aw ow aw aw sh sh i sh i p i p s p s $ Smoothing and interpolation of CD PLU models Bˆ pL p pR pL p pR B pL p pR pL p $ B pL p $ $ p pR B$ p pR $ p $ B$ p $ , pL p p pL p $ $ p p $ p $ 1. R R Implementation issues using CD units Word junction effects To handle known phonological changes, a set of phonological rules are Superimposed on both the training and recognition networks. Some typical phonological rules include: Recognition results using CD units Position dependent units D( p) min LYp | p LYp | q q p Unit splitting and clustering A key source of difficulty in continuous speech recognition is the So-called function words, which include words like a, and, for, in, is. The function words have the following properties: Creation of vocabulary-independent units Semantic Postprocessor For Recognition