ch13 (LVCSR).ppt

Large Vocabulary Continuous Speech Recognition Wˆ  P(Wˆ | Y )  max P(W | Y ). W P(Y | W ) P(W ) P(W | Y )  . P(Y ) Wˆ  arg max P(Y | W ) P(W ). W Subword Speech Units HMM-Based Subword Speech Units Training of Subword Units SW : W1 W2 W3 WI , SU : U1 (W1 )U 2 (W1 ) U L (W1 ) (W1 )  U1 (W2 )U 2 (W2 ) U L (W2 ) (W2 )  U1 (W3 )U 2 (W3 ) U L (W3 ) (W3 )  U1 (WI )U 2 (WI ) U L (WI ) (W1 ), Training of Subword Units Training Procedure Errors and performance evaluation in PLU recognition      Substitution error (s) Deletion error (d) Insertion error (i) Performance evaluation: If the total number of PLUs is N, we define:   Correctness rate: N – s – d /N Accuracy rate: N – s – d – i / N Language Models for LVCSR W  w1 w2  wQ , P(W )  P( w1 w2  wQ )  P( w1 ) P( w2 | w1 ) P( w3 | w1 w2 )  P( wQ | w1 w2  wQ 1). P( wQ | w1 w2  w j 1 )  P( w j | w j  N 1  w j 1 ), Word Pair Model: Specify which word pairs are valid 1 if wk w j is valid P( w j | wk )    0 otherwise Statistical Language Modeling Q PN (W )   P( wi | wi 1 , wi  2 ,, wi  N 1 ), i 1 F ( wi , wi 1 , , wi  N 1 ) ˆ P( wi | wi 1 , , wi  N 1 )  , F ( wi 1 ,, wi  N 1 ) F ( w1 , w2 , w3 ) F ( w1 , w2 ) F ( w1 ) ˆ P( w3 | w1 , w2 )  p1  p2  p3 F ( w1 , w2 ) F ( w1 )  F (wi ) Perplexity of the Language Model Entropy of the Source: 1 H   lim     P( w1 , w2 , , wQ ) log P( w1 , w2 , , wQ ) Q  Q   P( w1 , w2 , , wQ )  P( w1 ) P( w2 )  P( wQ ) First order entropy of the source: H    P( w) log P( w) wV If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the Source puts out, 1 H   lim   log P( w1 , w2 ,  , wQ ) Q  Q   We often compute H based on a finite but sufficiently large Q: 1 H    log P( w1 , w2 ,  , wQ ) Q H is the degree of difficulty that the recognizer encounters, on average, When it is to determine a word from the same source. Using language model, if the N-gram language model PN(W) is used, An estimate of H is: 1 Q H p    log P( wi | wi 1 , wi  2 ,  , wi  N 1 ) Q i 1 In general: Hp   1 log Pˆ ( w1 , w2 ,  , wQ ) Q Perplexity is defined as: B2 Hp  Pˆ ( w1 , w2 ,  , wQ ) 1 / Q Overall recognition system based on subword units Naval Resource (Battleship) Management Task: 991-word vocabulary NG (no grammar): perplexity = 991 Word pair grammar We can partition the vocabulary into four nonoverlapping sets of words: {BE }  set of words that con either begin or end a sentence, | BE | 117 {BE }  set of words that con begin a sentence but cannot end a sentence, | BE | 64 {B E}  set of word that cannot begin a sentence but can end a sentence, | BE | 448 {BE }  set of words that cannot begin or end a sentence, | BE | 322. The overall FSN allows recognition of sentences of the form: S : ( silence )  {BE , BE }  ( silence )  ({W }) ({W })  ( silence )  {B E , BE }  ( silence ) WP (word pair) grammar: Perplexity=60 FSN based on Partitioning Scheme: 995 real arcs and 18 null arcs WB (word bigram) Grammar: Perplexity =20 Control of word insertion/word deletion rate    In the discussed structure, there is no control on the sentence length We introduce a word insertion penalty into the Viterbi decoding For this, a fixed negative quantity is added to the likelihood score at the end of each word arc Context-dependent subword units (1) above : ax (2) above : $  ax  b (3) above : ax 2 b ax  b  ah b2 ah b  ah  v ah1 v Context  Independen t Units ah  v  $ Triphones (Context Dependent) v1 Multiple Phone Untis (4) above : ax(above) b(above) ah(above) v(above) Word  Dependent Untis. Creation of context-dependent diphones and triphones p L  p  $ left context (LC) diphone $  p  pR pL  p  pR rightt context (RC) diphone, left  right context (LRC) diphone. If c(.) is the occurrence count for a given unit, we can use a unit reduction rule such as: If c( p L  p  p R )  T , then 1. p L  p  p R 2. p L  p  p R 3. p L  p  p R  $  p  p R if c($  p  p R )  T  p L  p  $ if c( p L  p  $)  T  $ p $ otherwise. CD units using only intraword units for “show all ships”: $  sh  ow sh  ow  $ $  aw   aw    $ $  sh  i sh  i  p i  p  s p  s  $ CD units using both intraword and itnerword units: $  sh  ow sh  ow  aw ow  aw   aw    sh   sh  i sh  i  p i  p  s p  s  $ Smoothing and interpolation of CD PLU models Bˆ pL p  pR   pL p  pR B pL p  pR   pL p $ B pL p $   $ p  pR B$ p  pR   $ p $ B$ p $ ,  pL p  p   pL p $   $ p  p   $ p $  1. R R Implementation issues using CD units Word junction effects To handle known phonological changes, a set of phonological rules are Superimposed on both the training and recognition networks. Some typical phonological rules include: Recognition results using CD units Position dependent units D( p)  min LYp |  p   LYp | q  q p Unit splitting and clustering A key source of difficulty in continuous speech recognition is the So-called function words, which include words like a, and, for, in, is. The function words have the following properties: Creation of vocabulary-independent units Semantic Postprocessor For Recognition

ch13 (LVCSR).ppt

Related documents

Products

Support

ch13 (LVCSR).ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib