ch13 (LVCSR).ppt

advertisement
Large Vocabulary
Continuous Speech Recognition
Wˆ  P(Wˆ | Y )  max P(W | Y ).
W
P(Y | W ) P(W )
P(W | Y ) 
.
P(Y )
Wˆ  arg max P(Y | W ) P(W ).
W
Subword Speech Units
HMM-Based Subword Speech Units
Training of Subword Units
SW : W1 W2 W3 WI ,
SU : U1 (W1 )U 2 (W1 ) U L (W1 ) (W1 )  U1 (W2 )U 2 (W2 ) U L (W2 ) (W2 ) 
U1 (W3 )U 2 (W3 ) U L (W3 ) (W3 )  U1 (WI )U 2 (WI ) U L (WI ) (W1 ),
Training of Subword Units
Training Procedure
Errors and performance
evaluation in PLU recognition





Substitution error (s)
Deletion error (d)
Insertion error (i)
Performance evaluation:
If the total number of PLUs is N, we define:


Correctness rate: N – s – d /N
Accuracy rate: N – s – d – i / N
Language Models for LVCSR
W  w1 w2  wQ ,
P(W )  P( w1 w2  wQ )  P( w1 ) P( w2 | w1 ) P( w3 | w1 w2 ) 
P( wQ | w1 w2  wQ 1).
P( wQ | w1 w2  w j 1 )  P( w j | w j  N 1  w j 1 ),
Word Pair Model: Specify which word pairs are valid
1 if wk w j is valid
P( w j | wk )  
 0 otherwise
Statistical Language Modeling
Q
PN (W )   P( wi | wi 1 , wi  2 ,, wi  N 1 ),
i 1
F ( wi , wi 1 , , wi  N 1 )
ˆ
P( wi | wi 1 , , wi  N 1 ) 
,
F ( wi 1 ,, wi  N 1 )
F ( w1 , w2 , w3 )
F ( w1 , w2 )
F ( w1 )
ˆ
P( w3 | w1 , w2 )  p1
 p2
 p3
F ( w1 , w2 )
F ( w1 )
 F (wi )
Perplexity of the Language Model
Entropy of the Source:
1
H   lim     P( w1 , w2 , , wQ ) log P( w1 , w2 , , wQ )
Q  Q
 
P( w1 , w2 , , wQ )  P( w1 ) P( w2 )  P( wQ )
First order entropy of the source:
H    P( w) log P( w)
wV
If the source is ergodic, meaning its statistical properties can be
completely characterized in a sufficiently long sequence that the
Source puts out,
1
H   lim   log P( w1 , w2 ,  , wQ )
Q  Q
 
We often compute H based on a finite but sufficiently large Q:
1
H    log P( w1 , w2 ,  , wQ )
Q
H is the degree of difficulty that the recognizer encounters, on average,
When it is to determine a word from the same source.
Using language model, if the N-gram language model PN(W) is used,
An estimate of H is:
1 Q
H p    log P( wi | wi 1 , wi  2 ,  , wi  N 1 )
Q i 1
In general:
Hp  
1
log Pˆ ( w1 , w2 ,  , wQ )
Q
Perplexity is defined as:
B2
Hp
 Pˆ ( w1 , w2 ,  , wQ ) 1 / Q
Overall recognition system based on subword units
Naval Resource (Battleship) Management Task:
991-word vocabulary
NG (no grammar): perplexity = 991
Word pair grammar
We can partition the vocabulary into four nonoverlapping sets of words:
{BE }  set of words that con either begin or end a sentence, | BE | 117
{BE }  set of words that con begin a sentence but cannot end a sentence, | BE | 64
{B E}  set of word that cannot begin a sentence but can end a sentence, | BE | 448
{BE }  set of words that cannot begin or end a sentence, | BE | 322.
The overall FSN allows recognition of sentences of the form:
S : ( silence )  {BE , BE }  ( silence )  ({W }) ({W })  ( silence )  {B E , BE }  ( silence )
WP (word pair)
grammar:
Perplexity=60
FSN based on
Partitioning
Scheme:
995 real arcs and
18 null arcs
WB (word bigram)
Grammar:
Perplexity =20
Control of word insertion/word deletion rate



In the discussed structure, there is no control on
the sentence length
We introduce a word insertion penalty into the
Viterbi decoding
For this, a fixed negative quantity is added to the
likelihood score at the end of each word arc
Context-dependent subword units
(1) above :
ax
(2) above : $  ax  b
(3) above :
ax 2
b
ax  b  ah
b2
ah
b  ah  v
ah1
v
Context  Independen t Units
ah  v  $ Triphones (Context Dependent)
v1
Multiple Phone Untis
(4) above : ax(above)
b(above)
ah(above)
v(above)
Word  Dependent Untis.
Creation of context-dependent diphones and triphones
p L  p  $ left context (LC) diphone
$  p  pR
pL  p  pR
rightt context (RC) diphone,
left  right context (LRC) diphone.
If c(.) is the occurrence count for a given unit, we can use
a unit reduction rule such as:
If c( p L  p  p R )  T , then
1. p L  p  p R
2. p L  p  p R
3. p L  p  p R
 $  p  p R if c($  p  p R )  T
 p L  p  $ if c( p L  p  $)  T
 $ p $
otherwise.
CD units using only intraword units for “show all ships”:
$  sh  ow
sh  ow  $ $  aw   aw    $ $  sh  i sh  i  p i  p  s p  s  $
CD units using both intraword and itnerword units:
$  sh  ow sh  ow  aw ow  aw   aw    sh   sh  i sh  i  p i  p  s p  s  $
Smoothing and interpolation of CD PLU models
Bˆ pL p  pR   pL p  pR B pL p  pR   pL p $ B pL p $
  $ p  pR B$ p  pR   $ p $ B$ p $ ,
 pL p  p   pL p $   $ p  p   $ p $  1.
R
R
Implementation issues using CD units
Word junction effects
To handle known phonological changes, a set of phonological rules are
Superimposed on both the training and recognition networks.
Some typical phonological rules include:
Recognition results using CD units
Position dependent units
D( p)  min LYp |  p   LYp | q 
q p
Unit splitting and clustering
A key source of difficulty in continuous speech recognition is the
So-called function words, which include words like a, and, for, in, is.
The function words have the following properties:
Creation of vocabulary-independent
units
Semantic
Postprocessor
For
Recognition
Download