Document

Statistical Language Modeling for Speech Recognition and Information Retrieval Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University Outline • What is Statistical Language Modeling • Statistical Language Modeling for Speech Recognition, Information Retrieval, and Document Summarization • Categorization of Statistical Language Models • Main Issues for Statistical Language Models • Conclusions Berlin Chen 2 What is Statistical Language Modeling ? • Statistical language modeling (LM), which aims to capture the regularities in human natural language and quantify the acceptance of a given word sequence P" hi"  0.01, P" and nothin g but the truth"  0.01, P" and nuts sing on the roof "  0. Adopted from Joshua Goodman’s public presentation file Berlin Chen 3 What is Statistical LM Used for ? • It has continuously been a focus of active research in the speech and language community over the past three decades • It also has been introduced to the information retrieval (IR) problems, and provided an effective and theoretically attractive probabilistic framework for building IR systems • Other application domains – – – – – Machine translation Input method editor (IME) Optical character recognition (OCR) Bioinformatics etc. Berlin Chen 4 Statistical LM for Speech Recognition • Speech recognition: finding a word sequence W  w1, w2 ,..., wm out of the many possible word sequences that has the maximum posterior probability given the input speech utterance W *  arg max P W X  W P  X W P W   arg max P X  W  arg max P  X W P W  W Acoustic Modeling Language Modeling Berlin Chen 5 Statistical LM for Information Retrieval • Information retrieval (IR): identifying information items or “documents" within a large collection that best match (are most relevant to) a “query” provided by a user that describes the user’s information need • Query-likelihood retrieval model: a query Q is considered generated from an “relevant” document D that satisfies the information need – Estimate the likelihood of each document in the collection being the relevant document and rank them accordingly   PQ D PD    PQ D PD  P DQ PQ  Query Likelihood Document Prior Probability Berlin Chen 6 Statistical LM for Document Summarization • Estimate the likelihood of each sentence S of the document D being in the summary and rank them accordingly   PD S PS    PD S PS  PSD PD  Sentence Generative Probability (or Document Likelihood) Sentence Prior Probability – The sentence generative probability P D S  can be taken as a relevance measure between the document D and sentence S – The sentence prior probability P S  is, to some extent, a measure of the importance of the sentence itself Berlin Chen 7 n-Gram Language Models (1/3) • For a word sequence W , PW  can be decomposed into a product of conditional probabilities P W   P w1 , w2 ,..., wm  multiplication rule  P w1 P w2 w1 P w3 w1 , w2 ...P wm w1 , w2 ,..., wm1   P w1   P wi w1 , w2 ,..., wi 1  m i 2   – E.g., P" and nothin g but the truth"  P" and "  P " nothing " " and "     P"truth" " and nothin g but the"   P "but" " and nothing "  P "the" " and nothing bu t" – However, it’s impossible to estimate and store Pwi w1 , w2 ,..., wi 1  if i is large (the curse of dimensionality) History of w i Berlin Chen 8 n-Gram Language Models (2/3) • n-gram approximation Pwi w1 , w2 ,...,wi 1   Pwi wi n1 , wi n2 ,...,wi 1  History of length n-1 – Also called (n-1)-order Markov modeling – The most prevailing language model • E.g., trigram modeling     P"truth" " whole trut h and nothin g but the"  P"truth" " but the" P "the" " whole trut h and nothin g but"  P "the" " nothing bu t" – How do we find probabilities? (maximum likelihood estimation) • Get real text, and start counting (empirically) P" the" " nothing but"  Cnothing but the/ Cnothing but  count Probability may be zero Berlin Chen 9 n-Gram Language Models (3/3) • Minimum Word Error (MWE) Discriminative Training – Given a training set of observation sequences X  {X1,, X u ,, XU } , the MWE criterion aims to minimize the expected word errors of these observation sequences using the following objective function U FMWE ( X , , )    u1 SuWu P  X u | Su ) P ( Su ) RawAcc ( Su , Ru )    P  X u | Su ) P ( Su ) Su Wu – MWE objective function can be optimized with respect to the language model probabilities using Extended Baum-Welch (EBW) algorithm  F  ( X , ) P ( wi | h )  MWE  C  P ( wi | h )  PMWE ( wi | h )   F  ( X , )  C k P ( wk | h )  MWE  P ( wk | h )  Berlin Chen 10 n-Gram-Based Retrieval Model (1/2) • Each document is a probabilistic generative model consisting of a set of n-gram distributions for predicting the query Features: Query Q  w1w2..wi ..wL    m1 P wi D j m2 P wi C m3 P wi wi 1, D j   m4   1. A formal mathematic framework 2. Use collection statistics but not heuristics 3. The retrieval system can be gradually improved through usage P wi wi 1 , C  m1  m2  m3  m4  1          m P w D  m Pw C   m P w w P Q D j  m1P w1 D j  m2 P w1 C L i 2 1 i j 2 i 3 i i 1 , D j  m Pw w 4 i i 1 , C  – Document models can be optimized by the expectationmaximization (EM) or minimum classification error (MCE) training algorithms, given a set of query and relevant document pairs Berlin Chen 11 n-Gram-Based Retrieval Model (2/2) • MCE training – Given a query Q and a desired relevant doc D* , define the classification error function as: 1 * E (Q, D )   log PQ D* is R   max log PQ D' is not R  D' Q   Also can take all irrelevant doc in the answer set into consideration “>0”: means misclassified; “<=0”: means a correct decision – Transform the error function to the loss function L(Q, D* )  1 1  exp( E (Q, D* )   ) – Iteratively update the weighting parameters, e.g., m1 LQ, D *  ~ ~ m1 i  1  m1 i    i   ~ m 1 D *  D * i  Berlin Chen 12 n-Gram-Based Summarization Model (1/2) • Each sentence S of the spoken document Di is treated as a probabilistic generative model of n-grams, while the spoken document is the observation PLM Di S    w jDi   Pw j S   1    Pw j C nw ,D  j i – Pw j S  : the sentence model, estimated from the sentence S – Pw j C  : the collection model, estimated from a large corpus C • In order to have some probability of every word in the vocabulary Berlin Chen 13 n-Gram-Based Summarization Model (2/2) • Relevance Model (RM) – In order to improve the estimation of sentence models – Each sentence S has its own associated relevance model Pw j RS  , constructed by the subset of documents in the collection that are relevant to the sentence Pw j RS  where P Dl S   PD S  Pw    Dl  D l j Dl  To p L P S Dl  P Dl   P S Du  P Du  duDTop L – The relevance model is then linearly combined with the original sentence model to form a more accurate sentence model Pˆ w j S     Pw j S   1     Pw j RS      PLM  RM Di S      Pˆ w j S  1     P w j C w jDi   n w j ,Di  Berlin Chen 14 Categorization of Statistical Language Models (1/4) Berlin Chen 15 Categorization of Statistical Language Models (2/4) 1. Word-based LMs – The n-gram model is usually the basic model of this category – Many other models of this category are designed to overcome the major drawback of n-gram models • That is, to capture long-distance word dependence information without increasing the model complexity rapidly – E.g., mixed-order Markov model and trigger-based language model, etc. 2. Word class (or topic)-based LMs – These models are similar to the n-gram model, but the relationship among words is constructed via (latent) word classes • When the relationship is established, the probability of a decoded word given the history words can be easily found out – E.g., class-based n-gram model, aggregate Markov model and word topical mixture model (WTMM) Berlin Chen 16 Categorization of Statistical Language Models (3/4) 3. Structure-based LMs – Due to the constraints of grammars, rules for a sentence may be derived and represented as a parse tree • Then, we can select among candidate words by the sentence patterns or head words of the history – E.g., structured language model 4. Document class (or topic)-based LMs – Words are aggregated in a document to represent some topics (or concepts). During speech recognition, the history is considered as an incomplete document and the associated latent topic distributions can be discovered on the fly • The decoded words related to most of the topics that the history probably belongs to can be therefore selected – E.g., mixture-based language model, probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) Berlin Chen 17 Categorization of Statistical Language Models (4/4) • Ironically, the most successful statistical language modeling techniques use very little knowledge of what language is – W  w1 ,w2 ,...,wm may be a sequence of arbitrary symbols, with no deep structure, intention, or thought behind them • F. Jelinek said “put language back into language modeling” – “Closing remarks” presented at the 1995 Language Modeling Summer Workshop, Baltimore Berlin Chen 18 Main Issues for Statistical Language Models • Evaluation – How can you tell a good language model from a bad one – Run a speech recognizer or adopt other statistical measurements • Smoothing – Deal with data sparseness of real training data – Various approaches have been proposed • Adaptation – The subject matters and lexical characteristics for the linguistic contents of utterances or documents (e.g., news articles) might be are very diverse and are often changing with time • LMs should be adapted consequently – Caching: If you say something, you are likely to say it again later • Adjust word frequencies observed in the current conversation Berlin Chen 19 Evaluation (1/7) • Two most common metrics for evaluation a language model – Word Recognition Error Rate (WER) – Perplexity (PP) • Word Recognition Error Rate – Requires the participation of a speech recognition system (slow!) – Need to deal with the combination of acoustic probabilities and language model probabilities (penalizing or weighting between them) Berlin Chen 20 Evaluation (2/7) • Perplexity – Perplexity is geometric average inverse language model probability (measure language model difficulty, not acoustic difficulty/confusability) PPW  w1 , w2 ,..., wm   m m 1 1  P( w1 ) i  2 P( wi | w1 , w2 ,..., wi 1 ) – Can be roughly interpreted as the geometric mean of the branching factor of the text when presented to the language model wi 2 , wi 1 – For trigram modeling: PPW  w1 , w2 ,..., wm   m m 1 1 1   P( w1 ) P( w2 w1 ) i 3 P( wi | wi  2 , wi 1 ) Berlin Chen 21 Evaluation (3/7) • More about Perplexity – Perplexity is an indication of the complexity of the language if we have an accurate estimate of PW  – A language with higher perplexity means that the number of words branching from a previous word is larger on average – A langue model with perplexity L has roughly the same difficulty as another language model in which every word can be followed by L different words with equal probabilities – Examples: • Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10 • Ask a speech recognizer to recognize names at a large institute (10,000 persons) – hard – perplexity  10,000 Berlin Chen 22 Evaluation (4/7) • More about Perplexity (Cont.) – Training-set perplexity: measures how the language model fits the training data – Test-set perplexity: evaluates the generalization capability of the language model • When we say perplexity, we mean “test-set perplexity” Berlin Chen 23 Evaluation (5/7) • Is a language model with lower perplexity is better? – The true (optimal) model for data has the lowest possible perplexity – The lower the perplexity, the closer we are to the true model – Typically, perplexity correlates well with speech recognition word error rate • Correlates better when both models are trained on same data • Doesn’t correlate well when training data changes – The 20,000-word continuous speech recognition for Wall Street Journal (WSJ) task has a perplexity about 128 ~ 176 (trigram) – The 2,000-word conversational Air Travel Information System (ATIS) task has a perplexity less than 20 Berlin Chen 24 Evaluation (6/7) • The perplexity of bigram with different vocabulary size Berlin Chen 25 Evaluation (7/7) • A rough rule of thumb (recommended by Rosenfeld) – Reduction of 5% in perplexity is usually not practically significant – A 10% ~ 20% reduction is noteworthy, and usually translates into some improvement in application performance – A perplexity improvement of 30% or more over a good baseline is quite significant Vocabulary Perplexity WER zero |one |two |three |four |five |six |seven |eight |nine 10 5 John |tom |sam |bon |ron | |susan |sharon |carol |laura |sarah 10 7 bit |bite |boot |bait |bat |bet |beat |boat |burt |bart 10 9 Perplexity cannot always Tasks of recognizing 10 isolated-words using IBM ViaVoice reflect the difficulty of a speech recognition task Berlin Chen 26 Smoothing (1/3) • Maximum likelihood (ML) estimate of language models has been shown previously, e.g.: – Trigam probabilities C xyz C xyz PML  z | xy    Cxyw Cxy w – Bigram probabilities PML  y | x   count C xy C xy   Cxw Cx w Berlin Chen 27 Smoothing (2/3) • Data Sparseness – Many actually possible events (word successions) in the test set may not be well observed in the training set/data • E.g. bigram modeling P(read|Mulan)=0 P(Mulan read a book)=0 P(W)=0 P(X|W)P(W)=0 – Whenever a string Wsuch that PW   0 occurs during speech recognition task, an error will be made Berlin Chen 28 Smoothing (3/3) • Smoothing – Assign all strings (or events/word successions) a nonzero probability if they never occur in the training data – Tend to make distributions flatter by adjusting lower probabilities upward and high probabilities downward Berlin Chen 29 Smoothing: Simple Models • Add-one smoothing – For example, pretend each trigram occurs once more than it actually does Psmoothz | xy  C xyz  1 C xyz  1   Cxyw  1 Cxy  V w V : number of total vocabular y words • Add delta smoothing C xyz   Psmoothz | xy  C xy  V Work badly! DO NOT DO THESE TWO. Berlin Chen 30 Smoothing: Back-Off Models • The general form for n-gram back-off Psmoothwi | wi  n 1 ,..., wi 1   Pˆ wi | wi  n 1 ,..., wi 1  if C wi  n 1 ,..., wi 1 , wi   0   wi  n 1 ,..., wi 1   Psmoothwi | wi  n  2 ,..., wi 1  if C wi  n 1 ,..., wi 1 , wi   0 –  wi n1 ,..., wi 1  : normalizing/scaling factor chosen to make the conditional probability sum to 1 • I.e., w | w ,..., w P  smooth i i  n 1 wi For example,  wi  n 1 ,..., wi 1   1 i 1  1 n-gram Pˆ wi | wi  n 1 ,..., wi 1   wi ,C wi n1 ,..., wi 1 , wi  0 P w smooth w j ,C wi n1 ,..., wi 1 , wi  0 j | wi  n  2 ,..., wi 1  smoothed (n-1)-gram Berlin Chen 31 Smoothing: Interpolated Models • The general form for Interpolated n-gram back-off Psmoothwi | wi n1 ,..., wi 1    wi n1 ,..., wi 1 PML wi | wi n 1 ,..., wi 1   1   wi n1 ,..., wi 1 Psmoothwi | wi n  2 ,..., wi 1  PML wi | wi  n 1 ,..., wi 1   C wi  n 1 ,..., wi 1 , wi  C wi  n 1 ,..., wi 1  count • The key difference between backoff and interpolated models – For n-grams with nonzero counts, interpolated models use information from lower-order distributions while back-off models do not – Moreover, in interpolated models, n-grams with the same counts can have different probability estimates Berlin Chen 32 Caching (1/2) • The basic idea of cashing is to accumulate n-grams dictated so far in the current document/conversation and use these to create dynamic n-grams model • Trigram interpolated with unigram cache Pcache z | xy  Psmooth z | xy  1   Pcache z | history  history : document/c onversatio n dictated so far C z  history  C z  history    Pcache z | history   lengthhistory   C w  history  w • Trigram interpolated with bigram cache Pcache  z | xy  Psmooth z | xy  1   Pcache  z | y , history  Pcache  z | y , history   C  yz  history  C  y  history  Berlin Chen 33 Caching (2/2) • Real Life of Caching – Someone says “I swear to tell the truth” Cache remembers! – System hears “I swerve to smell the soup” – Someone says “The whole truth”, and, with cache, system hears “The toll booth.” – errors are locked in • Caching works well when users correct as they go, poorly or even hurts without corrections Adopted from Joshua Goodman’s public presentation file Berlin Chen 34 LM Integrated into Speech Recognition • Theoretically, ˆ  arg max P W P  X W  W W • Practically, language model is a better predictor while acoustic probabilities aren’t “real” probabilities – Penalize insertions ˆ  arg max PW  PX W   length W  W W , where  ,  can be empiricall y decided • E.g.,  8 Berlin Chen 35 n-Gram Language Model Adaptation (1/4) • Count Merging – n-gram conditional probabilities form a a multinominal distribution • The parameters form sets of independent Dirichlet distributions with hyperparameters – The MAP estimate is the posterior distribution of All possible N-gram histories Vocabulary Size Berlin Chen 36 n-Gram Language Model Adaptation (2/4) • Count Merging (cont.) – Maximize the posterior distribution of – Differentiate w.r.t. w.r.t. the constraint Largrange Multiplier Berlin Chen 37 n-Gram Language Model Adaptation (3/4) • Count Merging (cont.) – Parameterization of the prior distribution (I): Background Corpus – The adaptation formula for Count Merging Adaptation Corpus • E.g., Berlin Chen 38 n-Gram Language Model Adaptation (4/4) • Model Interpolation – Parameterization of the prior distribution (II): – The adaptation formula for Model Interpolation • E.g., Berlin Chen 39 Known Weakness in Current n-Gram LM • Brittleness Across Domain – Current language models are extremely sensitive to changes in the style or topic of the text on which they are trained • E.g., conversations vs. news broadcasts, fictions vs. politics – Language model adaptation • In-domain or contemporary text corpora/speech transcripts • Static or dynamic adaptation • Local contextual (n-gram) or global semantic/topical information • False Independence Assumption – In order to remain trainable, the n-gram modeling assumes the probability of next word in a sentence depends only on the identity of last n-1 words • n-1-order Markov modeling Berlin Chen 40 Conclusions • Statistical language modeling has demonstrated to be an effective probabilistic framework for NLP, ASR, and IR-related applications • There remains many issues to be solved for statistical language modeling, e.g., – Unknown word (or spoken term) detection – Discriminative training of language models – Adaptation of language models across different domains and genres – Fusion of various (or different levels of) features for language modeling • Positional Information? • Rhetorical (structural) Information? Berlin Chen 41 References • J. R. Bellegarda. Statistical language model adaptation: review and perspectives. Speech Communication 42(11), 93-108, 2004 • X., W. Liu, B. Croft. Statistical language modeling for information retrieval. Annual Review of Information Science and Technology 39, Chapter 1, 2005 • R. Rosenfeld. Two decades of statistical language modeling: where do we go from here? Proceedings of IEEE, August 2000 • J. Goodman, A bit of progress in language modeling, extended version. Microsoft Research Technical Report MSR-TR-2001-72, 2001 • H.S. Chiu, B. Chen. Word topical mixture models for dynamic language model adaptation. ICASSP2007 • J.W. Kuo, B. Chen. Minimum word error based discriminative training of language models. Interspeech2005 • B. Chen, H.M. Wang, L.S. Lee. Spoken document retrieval and summarization. Advances in Chinese Spoken Language Processing, Chapter 13, 2006 Berlin Chen 42 Maximum Likelihood Estimate (MLE) for nGrams (1/2) • Given a a training corpus T and the language model  Corpus T  w1th w2th ...wk th ......wL th Vocabulary W  w1,w2 ,..., wV  N-grams with same history are collected together p T     pw k th  history of wk th wk th   hwhiwi N h h  T ,  hw j 1 wj wi – Essentially, the distribution of the sample counts N h wi with the same history h referred as a multinominal (polynominal) distribution Nh! N h  T , PN hw ,..., N hw     hw ,  N hw N h and  hw  1 1 V N h wi hwi ! i i wi wi j wj wi where pwi h   hwi , N hwi  Chwi , N h   Chwi   Ch in corpus T …陳水扁 wi 總統訪問美國紐約 … 陳水扁總統在巴拿馬表示 … P(總統|陳水扁)=? Berlin Chen 43 • Maximum Likelihood Estimate (MLE) for nGrams (2/2) Take logarithm of pT   , we have    log pT     N log  hwi h hwi wi • For any pair ( h, w j ), try to maximize   and subject to  hw  1, h d log x 1 j wj           lh  hw j  1 w  h  j  dx  x       N hwi log hwi   lh   hw j  1   w  h wi h       j    hwi hwi  N hwi  lh  0  hw i  hw 1 N hws ws  N hw1  N hw2 hw 2  ......  N hwV hw  lh  lh    N hws   N h hw j V  lh 1 2 3 1 2  3    2 4 6 246 ws wj N hwi Chwi   ˆhwi   Nh Ch  Berlin Chen 44

Document

Related documents

Products

Support

Document

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib