Document

advertisement
Statistical Language Modeling for Speech
Recognition and Information Retrieval
Berlin Chen
Department of Computer Science & Information Engineering
National Taiwan Normal University
Outline
• What is Statistical Language Modeling
• Statistical Language Modeling for Speech Recognition,
Information Retrieval, and Document Summarization
• Categorization of Statistical Language Models
• Main Issues for Statistical Language Models
• Conclusions
Berlin Chen 2
What is Statistical Language Modeling ?
• Statistical language modeling (LM), which aims to
capture the regularities in human natural language and
quantify the acceptance of a given word sequence
P" hi"  0.01,
P" and nothin g but the truth"  0.01,
P" and nuts sing on the roof "  0.
Adopted from Joshua Goodman’s public presentation file
Berlin Chen 3
What is Statistical LM Used for ?
• It has continuously been a focus of active research in the
speech and language community over the past three
decades
• It also has been introduced to the information retrieval
(IR) problems, and provided an effective and
theoretically attractive probabilistic framework for
building IR systems
• Other application domains
–
–
–
–
–
Machine translation
Input method editor (IME)
Optical character recognition (OCR)
Bioinformatics
etc.
Berlin Chen 4
Statistical LM for Speech Recognition
• Speech recognition: finding a word sequence W  w1, w2 ,..., wm
out of the many possible word sequences that has the
maximum posterior probability given the input speech
utterance
W *  arg max P W X 
W
P  X W P W 
 arg max
P X 
W
 arg max P  X W P W 
W
Acoustic Modeling
Language Modeling
Berlin Chen 5
Statistical LM for Information Retrieval
• Information retrieval (IR): identifying information items or
“documents" within a large collection that best match
(are most relevant to) a “query” provided by a user that
describes the user’s information need
• Query-likelihood retrieval model: a query Q is considered
generated from an “relevant” document D that satisfies
the information need
– Estimate the likelihood of each document in the collection being
the relevant document and rank them accordingly
 
PQ D PD 

 PQ D PD 
P DQ
PQ 
Query Likelihood
Document Prior Probability
Berlin Chen 6
Statistical LM for Document Summarization
• Estimate the likelihood of each sentence S of the
document D being in the summary and rank them
accordingly
 
PD S PS 

 PD S PS 
PSD
PD 
Sentence Generative Probability
(or Document Likelihood)
Sentence Prior Probability
– The sentence generative probability P D S  can be taken as a
relevance measure between the document D and sentence S
– The sentence prior probability P S  is, to some extent, a
measure of the importance of the sentence itself
Berlin Chen 7
n-Gram Language Models (1/3)
• For a word sequence W , PW  can be decomposed into
a product of conditional probabilities
P W   P w1 , w2 ,..., wm 
multiplication rule
 P w1 P w2 w1 P w3 w1 , w2 ...P wm w1 , w2 ,..., wm1 
 P w1   P wi w1 , w2 ,..., wi 1 
m
i 2


– E.g., P" and nothin g but the truth"  P" and "  P " nothing " " and "

 
 P"truth" " and nothin g but the"

 P "but" " and nothing "  P "the" " and nothing bu t"
– However, it’s impossible to estimate and store Pwi w1 , w2 ,..., wi 1 
if i is large (the curse of dimensionality)
History of w
i
Berlin Chen 8
n-Gram Language Models (2/3)
• n-gram approximation
Pwi w1 , w2 ,...,wi 1   Pwi wi n1 , wi n2 ,...,wi 1 
History of length n-1
– Also called (n-1)-order Markov modeling
– The most prevailing language model
• E.g., trigram modeling

 

P"truth" " whole trut h and nothin g but the"  P"truth" " but the"
P "the" " whole trut h and nothin g but"  P "the" " nothing bu t"
– How do we find probabilities? (maximum likelihood estimation)
• Get real text, and start counting (empirically)
P" the" " nothing but"  Cnothing but the/ Cnothing but 
count
Probability may be zero
Berlin Chen 9
n-Gram Language Models (3/3)
• Minimum Word Error (MWE) Discriminative Training
– Given a training set of observation sequences X  {X1,, X u ,, XU } , the
MWE criterion aims to minimize the expected word errors of these
observation sequences using the following objective function
U
FMWE ( X , , )   
u1 SuWu
P  X u | Su ) P ( Su )
RawAcc ( Su , Ru )


 P  X u | Su ) P ( Su )
Su Wu
– MWE objective function can be optimized with respect to the
language model probabilities using Extended Baum-Welch (EBW)
algorithm
 F

( X , )
P ( wi | h )  MWE
 C
 P ( wi | h )

PMWE ( wi | h ) 
 F

( X , )
 C
k P ( wk | h )  MWE
 P ( wk | h )

Berlin Chen 10
n-Gram-Based Retrieval Model (1/2)
• Each document is a probabilistic generative model
consisting of a set of n-gram distributions for predicting the
query
Features:
Query
Q  w1w2..wi ..wL



m1
P wi D j
m2
P wi C
m3
P wi wi 1, D j


m4


1. A formal mathematic framework
2. Use collection statistics but not heuristics
3. The retrieval system can be gradually
improved through usage
P wi wi 1 , C

m1  m2  m3  m4  1

     
  m P w D  m Pw C   m P w w
P Q D j  m1P w1 D j  m2 P w1 C
L
i 2
1
i
j
2
i
3
i
i 1 , D j
 m Pw w
4
i
i 1 , C

– Document models can be optimized by the expectationmaximization (EM) or minimum classification error (MCE) training
algorithms, given a set of query and relevant document pairs
Berlin Chen 11
n-Gram-Based Retrieval Model (2/2)
• MCE training
– Given a query Q and a desired relevant doc D* , define the
classification error function as:
1
*
E (Q, D ) 
 log PQ D* is R   max log PQ D' is not R 
D'
Q


Also can take all irrelevant doc
in the answer set into consideration
“>0”: means misclassified; “<=0”: means a correct decision
– Transform the error function to the loss function
L(Q, D* ) 
1
1  exp( E (Q, D* )   )
– Iteratively update the weighting parameters, e.g., m1
LQ, D * 
~
~
m1 i  1  m1 i    i  
~
m
1
D *  D * i 
Berlin Chen 12
n-Gram-Based Summarization Model (1/2)
• Each sentence S of the spoken document Di is treated
as a probabilistic generative model of n-grams, while the
spoken document is the observation
PLM Di S   
w jDi
  Pw j S   1    Pw j C nw ,D 
j
i
– Pw j S  : the sentence model, estimated from the sentence S
– Pw j C  : the collection model, estimated from a large corpus C
• In order to have some probability of every word in the vocabulary
Berlin Chen 13
n-Gram-Based Summarization Model (2/2)
• Relevance Model (RM)
– In order to improve the estimation of sentence models
– Each sentence S has its own associated relevance model Pw j RS 
, constructed by the subset of documents in the collection that
are relevant to the sentence
Pw j RS 
where P Dl S  
PD S  Pw

 
Dl  D
l
j
Dl 
To p L
P S Dl  P Dl 
 P S Du  P Du 
duDTop L
– The relevance model is then linearly combined with the original
sentence model to form a more accurate sentence model
Pˆ w j S     Pw j S   1     Pw j RS 




PLM  RM Di S      Pˆ w j S  1     P w j C
w jDi


n w j ,Di

Berlin Chen 14
Categorization of Statistical Language Models (1/4)
Berlin Chen 15
Categorization of Statistical Language Models (2/4)
1. Word-based LMs
– The n-gram model is usually the basic model of this category
– Many other models of this category are designed to overcome
the major drawback of n-gram models
• That is, to capture long-distance word dependence information
without increasing the model complexity rapidly
– E.g., mixed-order Markov model and trigger-based language
model, etc.
2. Word class (or topic)-based LMs
– These models are similar to the n-gram model, but the
relationship among words is constructed via (latent) word
classes
• When the relationship is established, the probability of a decoded
word given the history words can be easily found out
– E.g., class-based n-gram model, aggregate Markov model and
word topical mixture model (WTMM)
Berlin Chen 16
Categorization of Statistical Language Models (3/4)
3. Structure-based LMs
– Due to the constraints of grammars, rules for a sentence may be
derived and represented as a parse tree
• Then, we can select among candidate words by the sentence
patterns or head words of the history
– E.g., structured language model
4. Document class (or topic)-based LMs
– Words are aggregated in a document to represent some topics
(or concepts). During speech recognition, the history is
considered as an incomplete document and the associated
latent topic distributions can be discovered on the fly
• The decoded words related to most of the topics that the history
probably belongs to can be therefore selected
– E.g., mixture-based language model, probabilistic latent
semantic analysis (PLSA) and latent Dirichlet allocation (LDA)
Berlin Chen 17
Categorization of Statistical Language Models (4/4)
• Ironically, the most successful statistical language
modeling techniques use very little knowledge of what
language is
– W  w1 ,w2 ,...,wm may be a sequence of arbitrary symbols, with
no deep structure, intention, or thought behind them
• F. Jelinek said “put language back into language
modeling”
– “Closing remarks” presented at the 1995 Language Modeling Summer
Workshop, Baltimore
Berlin Chen 18
Main Issues for Statistical Language Models
• Evaluation
– How can you tell a good language model from a bad one
– Run a speech recognizer or adopt other statistical
measurements
• Smoothing
– Deal with data sparseness of real training data
– Various approaches have been proposed
• Adaptation
– The subject matters and lexical characteristics for the linguistic
contents of utterances or documents (e.g., news articles) might
be are very diverse and are often changing with time
• LMs should be adapted consequently
– Caching: If you say something, you are likely to say it again later
• Adjust word frequencies observed in the current conversation
Berlin Chen 19
Evaluation (1/7)
• Two most common metrics for evaluation a language
model
– Word Recognition Error Rate (WER)
– Perplexity (PP)
• Word Recognition Error Rate
– Requires the participation of a speech recognition system
(slow!)
– Need to deal with the combination of acoustic probabilities and
language model probabilities (penalizing or weighting between
them)
Berlin Chen 20
Evaluation (2/7)
• Perplexity
– Perplexity is geometric average inverse language model
probability (measure language model difficulty, not acoustic
difficulty/confusability)
PPW  w1 , w2 ,..., wm   m
m
1
1

P( w1 ) i  2 P( wi | w1 , w2 ,..., wi 1 )
– Can be roughly interpreted as the geometric mean of the
branching factor of the text when presented to the language
model
wi 2 , wi 1
– For trigram modeling:
PPW  w1 , w2 ,..., wm   m
m
1
1
1


P( w1 ) P( w2 w1 ) i 3 P( wi | wi  2 , wi 1 )
Berlin Chen 21
Evaluation (3/7)
• More about Perplexity
– Perplexity is an indication of the complexity of the language if we
have an accurate estimate of PW 
– A language with higher perplexity means that the number of
words branching from a previous word is larger on average
– A langue model with perplexity L has roughly the same difficulty
as another language model in which every word can be followed
by L different words with equal probabilities
– Examples:
• Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8,
9” – easy – perplexity 10
• Ask a speech recognizer to recognize names at a large institute
(10,000 persons) – hard – perplexity  10,000
Berlin Chen 22
Evaluation (4/7)
• More about Perplexity (Cont.)
– Training-set perplexity: measures how the language model fits the
training data
– Test-set perplexity: evaluates the generalization capability of the
language model
• When we say perplexity, we mean “test-set perplexity”
Berlin Chen 23
Evaluation (5/7)
• Is a language model with lower perplexity is better?
– The true (optimal) model for data has the lowest possible
perplexity
– The lower the perplexity, the closer we are to the true model
– Typically, perplexity correlates well with speech recognition word
error rate
• Correlates better when both models are trained on same data
• Doesn’t correlate well when training data changes
– The 20,000-word continuous speech recognition for Wall Street
Journal (WSJ) task has a perplexity about 128 ~ 176 (trigram)
– The 2,000-word conversational Air Travel Information System
(ATIS) task has a perplexity less than 20
Berlin Chen 24
Evaluation (6/7)
• The perplexity of bigram with different vocabulary size
Berlin Chen 25
Evaluation (7/7)
• A rough rule of thumb (recommended by Rosenfeld)
– Reduction of 5% in perplexity is usually not practically significant
– A 10% ~ 20% reduction is noteworthy, and usually translates into
some improvement in application performance
– A perplexity improvement of 30% or more over a good baseline
is quite significant
Vocabulary
Perplexity
WER
zero |one |two |three |four
|five |six |seven |eight |nine
10
5
John |tom |sam |bon |ron |
|susan |sharon |carol |laura |sarah
10
7
bit |bite |boot |bait |bat
|bet |beat |boat |burt |bart
10
9
Perplexity cannot always
Tasks of recognizing 10 isolated-words using IBM ViaVoice
reflect the difficulty of a
speech recognition task
Berlin Chen 26
Smoothing (1/3)
• Maximum likelihood (ML) estimate of language models
has been shown previously, e.g.:
– Trigam probabilities
C xyz
C xyz
PML  z | xy 

 Cxyw Cxy
w
– Bigram probabilities
PML  y | x  
count
C xy
C xy

 Cxw Cx
w
Berlin Chen 27
Smoothing (2/3)
• Data Sparseness
– Many actually possible events (word successions) in the test set
may not be well observed in the training set/data
• E.g. bigram modeling
P(read|Mulan)=0
P(Mulan read a book)=0
P(W)=0
P(X|W)P(W)=0
– Whenever a string Wsuch that PW   0 occurs during
speech recognition task, an error will be made
Berlin Chen 28
Smoothing (3/3)
• Smoothing
– Assign all strings (or events/word successions) a nonzero
probability if they never occur in the training data
– Tend to make distributions flatter by adjusting lower
probabilities upward and high probabilities downward
Berlin Chen 29
Smoothing: Simple Models
• Add-one smoothing
– For example, pretend each trigram occurs once more than it
actually does
Psmoothz | xy 
C xyz  1
C xyz  1

 Cxyw  1 Cxy  V
w
V : number of total vocabular y words
• Add delta smoothing
C xyz  
Psmoothz | xy 
C xy  V
Work badly! DO NOT DO THESE TWO.
Berlin Chen 30
Smoothing: Back-Off Models
• The general form for n-gram back-off
Psmoothwi | wi  n 1 ,..., wi 1 
 Pˆ wi | wi  n 1 ,..., wi 1 
if C wi  n 1 ,..., wi 1 , wi   0

 wi  n 1 ,..., wi 1   Psmoothwi | wi  n  2 ,..., wi 1  if C wi  n 1 ,..., wi 1 , wi   0
–
 wi n1 ,..., wi 1  : normalizing/scaling factor chosen to make
the conditional probability sum to 1
• I.e.,
w | w ,..., w
P

smooth
i
i  n 1
wi
For example,  wi  n 1 ,..., wi 1  
1
i 1
 1
n-gram
Pˆ wi | wi  n 1 ,..., wi 1 

wi ,C wi n1 ,..., wi 1 , wi  0
P
w
smooth
w j ,C wi n1 ,..., wi 1 , wi  0
j
| wi  n  2 ,..., wi 1 
smoothed (n-1)-gram
Berlin Chen 31
Smoothing: Interpolated Models
• The general form for Interpolated n-gram back-off
Psmoothwi | wi n1 ,..., wi 1 
  wi n1 ,..., wi 1 PML wi | wi n 1 ,..., wi 1   1   wi n1 ,..., wi 1 Psmoothwi | wi n  2 ,..., wi 1 
PML wi | wi  n 1 ,..., wi 1  
C wi  n 1 ,..., wi 1 , wi 
C wi  n 1 ,..., wi 1 
count
• The key difference between backoff and interpolated
models
– For n-grams with nonzero counts, interpolated models use
information from lower-order distributions while back-off models
do not
– Moreover, in interpolated models, n-grams with the same counts
can have different probability estimates
Berlin Chen 32
Caching (1/2)
• The basic idea of cashing is to accumulate n-grams
dictated so far in the current document/conversation and
use these to create dynamic n-grams model
• Trigram interpolated with unigram cache
Pcache z | xy  Psmooth z | xy  1   Pcache z | history 
history : document/c onversatio n dictated so far
C z  history 
C z  history 


Pcache z | history 

lengthhistory   C w  history 
w
• Trigram interpolated with bigram cache
Pcache  z | xy  Psmooth z | xy  1   Pcache  z | y , history 
Pcache  z | y , history  
C  yz  history 
C  y  history 
Berlin Chen 33
Caching (2/2)
• Real Life of Caching
– Someone says “I swear to tell the truth”
Cache remembers!
– System hears “I swerve to smell the soup”
– Someone says “The whole truth”, and, with cache, system hears
“The toll booth.” – errors are locked in
• Caching works well when users correct as they go,
poorly or even hurts without corrections
Adopted from Joshua Goodman’s public presentation file
Berlin Chen 34
LM Integrated into Speech Recognition
• Theoretically,
ˆ  arg max P W P  X W 
W
W
• Practically, language model is a better predictor while
acoustic probabilities aren’t “real” probabilities
– Penalize insertions
ˆ  arg max PW  PX W   length W 
W
W
, where  ,  can be empiricall y decided
• E.g., 
8
Berlin Chen 35
n-Gram Language Model Adaptation (1/4)
• Count Merging
– n-gram conditional probabilities form a a multinominal distribution
• The parameters
form sets of independent Dirichlet
distributions with hyperparameters
– The MAP estimate is the posterior distribution of
All possible N-gram histories
Vocabulary Size
Berlin Chen 36
n-Gram Language Model Adaptation (2/4)
• Count Merging (cont.)
– Maximize the posterior distribution of
– Differentiate
w.r.t.
w.r.t. the constraint
Largrange Multiplier
Berlin Chen 37
n-Gram Language Model Adaptation (3/4)
• Count Merging (cont.)
– Parameterization of the prior distribution (I):
Background Corpus
– The adaptation formula for Count Merging
Adaptation Corpus
• E.g.,
Berlin Chen 38
n-Gram Language Model Adaptation (4/4)
• Model Interpolation
– Parameterization of the prior distribution (II):
– The adaptation formula for Model Interpolation
• E.g.,
Berlin Chen 39
Known Weakness in Current n-Gram LM
• Brittleness Across Domain
– Current language models are extremely sensitive to changes in
the style or topic of the text on which they are trained
• E.g., conversations vs. news broadcasts, fictions vs. politics
– Language model adaptation
• In-domain or contemporary text corpora/speech transcripts
• Static or dynamic adaptation
• Local contextual (n-gram) or global semantic/topical information
• False Independence Assumption
– In order to remain trainable, the n-gram modeling assumes the
probability of next word in a sentence depends only on the
identity of last n-1 words
• n-1-order Markov modeling
Berlin Chen 40
Conclusions
• Statistical language modeling has demonstrated to be
an effective probabilistic framework for NLP, ASR, and
IR-related applications
• There remains many issues to be solved for statistical
language modeling, e.g.,
– Unknown word (or spoken term) detection
– Discriminative training of language models
– Adaptation of language models across different domains and
genres
– Fusion of various (or different levels of) features for language
modeling
• Positional Information?
• Rhetorical (structural) Information?
Berlin Chen 41
References
• J. R. Bellegarda. Statistical language model adaptation: review and
perspectives. Speech Communication 42(11), 93-108, 2004
• X., W. Liu, B. Croft. Statistical language modeling for information retrieval.
Annual Review of Information Science and Technology 39, Chapter 1, 2005
• R. Rosenfeld. Two decades of statistical language modeling: where do we go
from here? Proceedings of IEEE, August 2000
• J. Goodman, A bit of progress in language modeling, extended version.
Microsoft Research Technical Report MSR-TR-2001-72, 2001
• H.S. Chiu, B. Chen. Word topical mixture models for dynamic language model
adaptation. ICASSP2007
• J.W. Kuo, B. Chen. Minimum word error based discriminative training of
language models. Interspeech2005
• B. Chen, H.M. Wang, L.S. Lee. Spoken document retrieval and summarization.
Advances in Chinese Spoken Language Processing, Chapter 13, 2006
Berlin Chen 42
Maximum Likelihood Estimate (MLE) for nGrams (1/2)
• Given a a training corpus T and the language model 
Corpus T  w1th w2th ...wk th ......wL th
Vocabulary W  w1,w2 ,..., wV 
N-grams with
same history
are collected
together
p T   
 pw
k th

history of wk th
wk th
  hwhiwi
N
h
h  T ,

hw j
1
wj
wi
– Essentially, the distribution of the sample counts N h wi with
the same history h referred as a multinominal (polynominal)
distribution
Nh!
N
h  T , PN hw ,..., N hw  

 hw ,  N hw N h and  hw  1
1
V
N
h wi
hwi
!
i
i
wi
wi
j
wj
wi
where pwi h   hwi , N hwi  Chwi , N h   Chwi   Ch in corpus T
…陳水扁
wi
總統 訪問 美國 紐約 … 陳水扁 總統 在 巴拿馬 表示
…
P(總統|陳水扁)=?
Berlin Chen 43
•
Maximum Likelihood Estimate (MLE) for nGrams (2/2)
Take logarithm of pT   , we have
   log pT     N
log 
hwi
h
hwi
wi
• For any pair ( h, w j ), try to maximize   and subject
to  hw  1, h
d log x
1
j
wj



       lh  hw j  1
w

h
 j

dx

x



   N hwi log hwi   lh   hw j  1 
 w

h
wi
h

   
 j



hwi
hwi

N hwi
 lh  0 
hw
i

hw
1
N
hws
ws

N hw1

N hw2
hw
2
 ...... 
N hwV
hw
 lh  lh    N hws   N h
hw j
V
 lh
1 2 3 1 2  3
  
2 4 6 246
ws
wj
N hwi
Chwi 
 ˆhwi 

Nh
Ch 
Berlin Chen 44
Download