Statistical Language Modeling for Speech Recognition and Information Retrieval Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University Outline • What is Statistical Language Modeling • Statistical Language Modeling for Speech Recognition, Information Retrieval, and Document Summarization • Categorization of Statistical Language Models • Main Issues for Statistical Language Models • Conclusions Berlin Chen 2 What is Statistical Language Modeling ? • Statistical language modeling (LM), which aims to capture the regularities in human natural language and quantify the acceptance of a given word sequence P" hi" 0.01, P" and nothin g but the truth" 0.01, P" and nuts sing on the roof " 0. Adopted from Joshua Goodman’s public presentation file Berlin Chen 3 What is Statistical LM Used for ? • It has continuously been a focus of active research in the speech and language community over the past three decades • It also has been introduced to the information retrieval (IR) problems, and provided an effective and theoretically attractive probabilistic framework for building IR systems • Other application domains – – – – – Machine translation Input method editor (IME) Optical character recognition (OCR) Bioinformatics etc. Berlin Chen 4 Statistical LM for Speech Recognition • Speech recognition: finding a word sequence W w1, w2 ,..., wm out of the many possible word sequences that has the maximum posterior probability given the input speech utterance W * arg max P W X W P X W P W arg max P X W arg max P X W P W W Acoustic Modeling Language Modeling Berlin Chen 5 Statistical LM for Information Retrieval • Information retrieval (IR): identifying information items or “documents" within a large collection that best match (are most relevant to) a “query” provided by a user that describes the user’s information need • Query-likelihood retrieval model: a query Q is considered generated from an “relevant” document D that satisfies the information need – Estimate the likelihood of each document in the collection being the relevant document and rank them accordingly PQ D PD PQ D PD P DQ PQ Query Likelihood Document Prior Probability Berlin Chen 6 Statistical LM for Document Summarization • Estimate the likelihood of each sentence S of the document D being in the summary and rank them accordingly PD S PS PD S PS PSD PD Sentence Generative Probability (or Document Likelihood) Sentence Prior Probability – The sentence generative probability P D S can be taken as a relevance measure between the document D and sentence S – The sentence prior probability P S is, to some extent, a measure of the importance of the sentence itself Berlin Chen 7 n-Gram Language Models (1/3) • For a word sequence W , PW can be decomposed into a product of conditional probabilities P W P w1 , w2 ,..., wm multiplication rule P w1 P w2 w1 P w3 w1 , w2 ...P wm w1 , w2 ,..., wm1 P w1 P wi w1 , w2 ,..., wi 1 m i 2 – E.g., P" and nothin g but the truth" P" and " P " nothing " " and " P"truth" " and nothin g but the" P "but" " and nothing " P "the" " and nothing bu t" – However, it’s impossible to estimate and store Pwi w1 , w2 ,..., wi 1 if i is large (the curse of dimensionality) History of w i Berlin Chen 8 n-Gram Language Models (2/3) • n-gram approximation Pwi w1 , w2 ,...,wi 1 Pwi wi n1 , wi n2 ,...,wi 1 History of length n-1 – Also called (n-1)-order Markov modeling – The most prevailing language model • E.g., trigram modeling P"truth" " whole trut h and nothin g but the" P"truth" " but the" P "the" " whole trut h and nothin g but" P "the" " nothing bu t" – How do we find probabilities? (maximum likelihood estimation) • Get real text, and start counting (empirically) P" the" " nothing but" Cnothing but the/ Cnothing but count Probability may be zero Berlin Chen 9 n-Gram Language Models (3/3) • Minimum Word Error (MWE) Discriminative Training – Given a training set of observation sequences X {X1,, X u ,, XU } , the MWE criterion aims to minimize the expected word errors of these observation sequences using the following objective function U FMWE ( X , , ) u1 SuWu P X u | Su ) P ( Su ) RawAcc ( Su , Ru ) P X u | Su ) P ( Su ) Su Wu – MWE objective function can be optimized with respect to the language model probabilities using Extended Baum-Welch (EBW) algorithm F ( X , ) P ( wi | h ) MWE C P ( wi | h ) PMWE ( wi | h ) F ( X , ) C k P ( wk | h ) MWE P ( wk | h ) Berlin Chen 10 n-Gram-Based Retrieval Model (1/2) • Each document is a probabilistic generative model consisting of a set of n-gram distributions for predicting the query Features: Query Q w1w2..wi ..wL m1 P wi D j m2 P wi C m3 P wi wi 1, D j m4 1. A formal mathematic framework 2. Use collection statistics but not heuristics 3. The retrieval system can be gradually improved through usage P wi wi 1 , C m1 m2 m3 m4 1 m P w D m Pw C m P w w P Q D j m1P w1 D j m2 P w1 C L i 2 1 i j 2 i 3 i i 1 , D j m Pw w 4 i i 1 , C – Document models can be optimized by the expectationmaximization (EM) or minimum classification error (MCE) training algorithms, given a set of query and relevant document pairs Berlin Chen 11 n-Gram-Based Retrieval Model (2/2) • MCE training – Given a query Q and a desired relevant doc D* , define the classification error function as: 1 * E (Q, D ) log PQ D* is R max log PQ D' is not R D' Q Also can take all irrelevant doc in the answer set into consideration “>0”: means misclassified; “<=0”: means a correct decision – Transform the error function to the loss function L(Q, D* ) 1 1 exp( E (Q, D* ) ) – Iteratively update the weighting parameters, e.g., m1 LQ, D * ~ ~ m1 i 1 m1 i i ~ m 1 D * D * i Berlin Chen 12 n-Gram-Based Summarization Model (1/2) • Each sentence S of the spoken document Di is treated as a probabilistic generative model of n-grams, while the spoken document is the observation PLM Di S w jDi Pw j S 1 Pw j C nw ,D j i – Pw j S : the sentence model, estimated from the sentence S – Pw j C : the collection model, estimated from a large corpus C • In order to have some probability of every word in the vocabulary Berlin Chen 13 n-Gram-Based Summarization Model (2/2) • Relevance Model (RM) – In order to improve the estimation of sentence models – Each sentence S has its own associated relevance model Pw j RS , constructed by the subset of documents in the collection that are relevant to the sentence Pw j RS where P Dl S PD S Pw Dl D l j Dl To p L P S Dl P Dl P S Du P Du duDTop L – The relevance model is then linearly combined with the original sentence model to form a more accurate sentence model Pˆ w j S Pw j S 1 Pw j RS PLM RM Di S Pˆ w j S 1 P w j C w jDi n w j ,Di Berlin Chen 14 Categorization of Statistical Language Models (1/4) Berlin Chen 15 Categorization of Statistical Language Models (2/4) 1. Word-based LMs – The n-gram model is usually the basic model of this category – Many other models of this category are designed to overcome the major drawback of n-gram models • That is, to capture long-distance word dependence information without increasing the model complexity rapidly – E.g., mixed-order Markov model and trigger-based language model, etc. 2. Word class (or topic)-based LMs – These models are similar to the n-gram model, but the relationship among words is constructed via (latent) word classes • When the relationship is established, the probability of a decoded word given the history words can be easily found out – E.g., class-based n-gram model, aggregate Markov model and word topical mixture model (WTMM) Berlin Chen 16 Categorization of Statistical Language Models (3/4) 3. Structure-based LMs – Due to the constraints of grammars, rules for a sentence may be derived and represented as a parse tree • Then, we can select among candidate words by the sentence patterns or head words of the history – E.g., structured language model 4. Document class (or topic)-based LMs – Words are aggregated in a document to represent some topics (or concepts). During speech recognition, the history is considered as an incomplete document and the associated latent topic distributions can be discovered on the fly • The decoded words related to most of the topics that the history probably belongs to can be therefore selected – E.g., mixture-based language model, probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) Berlin Chen 17 Categorization of Statistical Language Models (4/4) • Ironically, the most successful statistical language modeling techniques use very little knowledge of what language is – W w1 ,w2 ,...,wm may be a sequence of arbitrary symbols, with no deep structure, intention, or thought behind them • F. Jelinek said “put language back into language modeling” – “Closing remarks” presented at the 1995 Language Modeling Summer Workshop, Baltimore Berlin Chen 18 Main Issues for Statistical Language Models • Evaluation – How can you tell a good language model from a bad one – Run a speech recognizer or adopt other statistical measurements • Smoothing – Deal with data sparseness of real training data – Various approaches have been proposed • Adaptation – The subject matters and lexical characteristics for the linguistic contents of utterances or documents (e.g., news articles) might be are very diverse and are often changing with time • LMs should be adapted consequently – Caching: If you say something, you are likely to say it again later • Adjust word frequencies observed in the current conversation Berlin Chen 19 Evaluation (1/7) • Two most common metrics for evaluation a language model – Word Recognition Error Rate (WER) – Perplexity (PP) • Word Recognition Error Rate – Requires the participation of a speech recognition system (slow!) – Need to deal with the combination of acoustic probabilities and language model probabilities (penalizing or weighting between them) Berlin Chen 20 Evaluation (2/7) • Perplexity – Perplexity is geometric average inverse language model probability (measure language model difficulty, not acoustic difficulty/confusability) PPW w1 , w2 ,..., wm m m 1 1 P( w1 ) i 2 P( wi | w1 , w2 ,..., wi 1 ) – Can be roughly interpreted as the geometric mean of the branching factor of the text when presented to the language model wi 2 , wi 1 – For trigram modeling: PPW w1 , w2 ,..., wm m m 1 1 1 P( w1 ) P( w2 w1 ) i 3 P( wi | wi 2 , wi 1 ) Berlin Chen 21 Evaluation (3/7) • More about Perplexity – Perplexity is an indication of the complexity of the language if we have an accurate estimate of PW – A language with higher perplexity means that the number of words branching from a previous word is larger on average – A langue model with perplexity L has roughly the same difficulty as another language model in which every word can be followed by L different words with equal probabilities – Examples: • Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10 • Ask a speech recognizer to recognize names at a large institute (10,000 persons) – hard – perplexity 10,000 Berlin Chen 22 Evaluation (4/7) • More about Perplexity (Cont.) – Training-set perplexity: measures how the language model fits the training data – Test-set perplexity: evaluates the generalization capability of the language model • When we say perplexity, we mean “test-set perplexity” Berlin Chen 23 Evaluation (5/7) • Is a language model with lower perplexity is better? – The true (optimal) model for data has the lowest possible perplexity – The lower the perplexity, the closer we are to the true model – Typically, perplexity correlates well with speech recognition word error rate • Correlates better when both models are trained on same data • Doesn’t correlate well when training data changes – The 20,000-word continuous speech recognition for Wall Street Journal (WSJ) task has a perplexity about 128 ~ 176 (trigram) – The 2,000-word conversational Air Travel Information System (ATIS) task has a perplexity less than 20 Berlin Chen 24 Evaluation (6/7) • The perplexity of bigram with different vocabulary size Berlin Chen 25 Evaluation (7/7) • A rough rule of thumb (recommended by Rosenfeld) – Reduction of 5% in perplexity is usually not practically significant – A 10% ~ 20% reduction is noteworthy, and usually translates into some improvement in application performance – A perplexity improvement of 30% or more over a good baseline is quite significant Vocabulary Perplexity WER zero |one |two |three |four |five |six |seven |eight |nine 10 5 John |tom |sam |bon |ron | |susan |sharon |carol |laura |sarah 10 7 bit |bite |boot |bait |bat |bet |beat |boat |burt |bart 10 9 Perplexity cannot always Tasks of recognizing 10 isolated-words using IBM ViaVoice reflect the difficulty of a speech recognition task Berlin Chen 26 Smoothing (1/3) • Maximum likelihood (ML) estimate of language models has been shown previously, e.g.: – Trigam probabilities C xyz C xyz PML z | xy Cxyw Cxy w – Bigram probabilities PML y | x count C xy C xy Cxw Cx w Berlin Chen 27 Smoothing (2/3) • Data Sparseness – Many actually possible events (word successions) in the test set may not be well observed in the training set/data • E.g. bigram modeling P(read|Mulan)=0 P(Mulan read a book)=0 P(W)=0 P(X|W)P(W)=0 – Whenever a string Wsuch that PW 0 occurs during speech recognition task, an error will be made Berlin Chen 28 Smoothing (3/3) • Smoothing – Assign all strings (or events/word successions) a nonzero probability if they never occur in the training data – Tend to make distributions flatter by adjusting lower probabilities upward and high probabilities downward Berlin Chen 29 Smoothing: Simple Models • Add-one smoothing – For example, pretend each trigram occurs once more than it actually does Psmoothz | xy C xyz 1 C xyz 1 Cxyw 1 Cxy V w V : number of total vocabular y words • Add delta smoothing C xyz Psmoothz | xy C xy V Work badly! DO NOT DO THESE TWO. Berlin Chen 30 Smoothing: Back-Off Models • The general form for n-gram back-off Psmoothwi | wi n 1 ,..., wi 1 Pˆ wi | wi n 1 ,..., wi 1 if C wi n 1 ,..., wi 1 , wi 0 wi n 1 ,..., wi 1 Psmoothwi | wi n 2 ,..., wi 1 if C wi n 1 ,..., wi 1 , wi 0 – wi n1 ,..., wi 1 : normalizing/scaling factor chosen to make the conditional probability sum to 1 • I.e., w | w ,..., w P smooth i i n 1 wi For example, wi n 1 ,..., wi 1 1 i 1 1 n-gram Pˆ wi | wi n 1 ,..., wi 1 wi ,C wi n1 ,..., wi 1 , wi 0 P w smooth w j ,C wi n1 ,..., wi 1 , wi 0 j | wi n 2 ,..., wi 1 smoothed (n-1)-gram Berlin Chen 31 Smoothing: Interpolated Models • The general form for Interpolated n-gram back-off Psmoothwi | wi n1 ,..., wi 1 wi n1 ,..., wi 1 PML wi | wi n 1 ,..., wi 1 1 wi n1 ,..., wi 1 Psmoothwi | wi n 2 ,..., wi 1 PML wi | wi n 1 ,..., wi 1 C wi n 1 ,..., wi 1 , wi C wi n 1 ,..., wi 1 count • The key difference between backoff and interpolated models – For n-grams with nonzero counts, interpolated models use information from lower-order distributions while back-off models do not – Moreover, in interpolated models, n-grams with the same counts can have different probability estimates Berlin Chen 32 Caching (1/2) • The basic idea of cashing is to accumulate n-grams dictated so far in the current document/conversation and use these to create dynamic n-grams model • Trigram interpolated with unigram cache Pcache z | xy Psmooth z | xy 1 Pcache z | history history : document/c onversatio n dictated so far C z history C z history Pcache z | history lengthhistory C w history w • Trigram interpolated with bigram cache Pcache z | xy Psmooth z | xy 1 Pcache z | y , history Pcache z | y , history C yz history C y history Berlin Chen 33 Caching (2/2) • Real Life of Caching – Someone says “I swear to tell the truth” Cache remembers! – System hears “I swerve to smell the soup” – Someone says “The whole truth”, and, with cache, system hears “The toll booth.” – errors are locked in • Caching works well when users correct as they go, poorly or even hurts without corrections Adopted from Joshua Goodman’s public presentation file Berlin Chen 34 LM Integrated into Speech Recognition • Theoretically, ˆ arg max P W P X W W W • Practically, language model is a better predictor while acoustic probabilities aren’t “real” probabilities – Penalize insertions ˆ arg max PW PX W length W W W , where , can be empiricall y decided • E.g., 8 Berlin Chen 35 n-Gram Language Model Adaptation (1/4) • Count Merging – n-gram conditional probabilities form a a multinominal distribution • The parameters form sets of independent Dirichlet distributions with hyperparameters – The MAP estimate is the posterior distribution of All possible N-gram histories Vocabulary Size Berlin Chen 36 n-Gram Language Model Adaptation (2/4) • Count Merging (cont.) – Maximize the posterior distribution of – Differentiate w.r.t. w.r.t. the constraint Largrange Multiplier Berlin Chen 37 n-Gram Language Model Adaptation (3/4) • Count Merging (cont.) – Parameterization of the prior distribution (I): Background Corpus – The adaptation formula for Count Merging Adaptation Corpus • E.g., Berlin Chen 38 n-Gram Language Model Adaptation (4/4) • Model Interpolation – Parameterization of the prior distribution (II): – The adaptation formula for Model Interpolation • E.g., Berlin Chen 39 Known Weakness in Current n-Gram LM • Brittleness Across Domain – Current language models are extremely sensitive to changes in the style or topic of the text on which they are trained • E.g., conversations vs. news broadcasts, fictions vs. politics – Language model adaptation • In-domain or contemporary text corpora/speech transcripts • Static or dynamic adaptation • Local contextual (n-gram) or global semantic/topical information • False Independence Assumption – In order to remain trainable, the n-gram modeling assumes the probability of next word in a sentence depends only on the identity of last n-1 words • n-1-order Markov modeling Berlin Chen 40 Conclusions • Statistical language modeling has demonstrated to be an effective probabilistic framework for NLP, ASR, and IR-related applications • There remains many issues to be solved for statistical language modeling, e.g., – Unknown word (or spoken term) detection – Discriminative training of language models – Adaptation of language models across different domains and genres – Fusion of various (or different levels of) features for language modeling • Positional Information? • Rhetorical (structural) Information? Berlin Chen 41 References • J. R. Bellegarda. Statistical language model adaptation: review and perspectives. Speech Communication 42(11), 93-108, 2004 • X., W. Liu, B. Croft. Statistical language modeling for information retrieval. Annual Review of Information Science and Technology 39, Chapter 1, 2005 • R. Rosenfeld. Two decades of statistical language modeling: where do we go from here? Proceedings of IEEE, August 2000 • J. Goodman, A bit of progress in language modeling, extended version. Microsoft Research Technical Report MSR-TR-2001-72, 2001 • H.S. Chiu, B. Chen. Word topical mixture models for dynamic language model adaptation. ICASSP2007 • J.W. Kuo, B. Chen. Minimum word error based discriminative training of language models. Interspeech2005 • B. Chen, H.M. Wang, L.S. Lee. Spoken document retrieval and summarization. Advances in Chinese Spoken Language Processing, Chapter 13, 2006 Berlin Chen 42 Maximum Likelihood Estimate (MLE) for nGrams (1/2) • Given a a training corpus T and the language model Corpus T w1th w2th ...wk th ......wL th Vocabulary W w1,w2 ,..., wV N-grams with same history are collected together p T pw k th history of wk th wk th hwhiwi N h h T , hw j 1 wj wi – Essentially, the distribution of the sample counts N h wi with the same history h referred as a multinominal (polynominal) distribution Nh! N h T , PN hw ,..., N hw hw , N hw N h and hw 1 1 V N h wi hwi ! i i wi wi j wj wi where pwi h hwi , N hwi Chwi , N h Chwi Ch in corpus T …陳水扁 wi 總統 訪問 美國 紐約 … 陳水扁 總統 在 巴拿馬 表示 … P(總統|陳水扁)=? Berlin Chen 43 • Maximum Likelihood Estimate (MLE) for nGrams (2/2) Take logarithm of pT , we have log pT N log hwi h hwi wi • For any pair ( h, w j ), try to maximize and subject to hw 1, h d log x 1 j wj lh hw j 1 w h j dx x N hwi log hwi lh hw j 1 w h wi h j hwi hwi N hwi lh 0 hw i hw 1 N hws ws N hw1 N hw2 hw 2 ...... N hwV hw lh lh N hws N h hw j V lh 1 2 3 1 2 3 2 4 6 246 ws wj N hwi Chwi ˆhwi Nh Ch Berlin Chen 44