NOISY CHANNEL MODEL Starting at this point, we need to be able to model the target language Speech Recognition Observe: Acoustic signal (A=a1,…,an) Challenge: Find the likely word sequence But we also have to consider the context Language Modeling LML Speech Recognition 2008 2 RESOLVING WORD AMBIGUITIES Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from spoken audio Issues to consider determine the intended word sequence resolve grammatical and pronunciation errors Implementation: Establish word sequence probabilities Use existing corpora Train program with run-time data RECOGNIZER ISSUES Problem: Our recognizer translates the audio to a possible string of text. How do we know the translation is correct? Problem: How do we handle a string of text containing words that are not in the dictionary? Problem: How do we handle strings with valid words, but which do not form sentences with semantics that makes sense? CORRECTING RECOGNIZER AMBIGUITIES Problem: Resolving words not in the dictionary Question: How different is a recognized word from those that are in the dictionary? Solution: Count the single step transformations necessary to convert one word into another. Example: caat cat with removal of one letter Example: fplc fireplace requires adding the letters ire after f and a before c and e at the end WORD PREDICTION APPROACHES Simple: Simple vs. Smart *Every word follows every other word w/ equal probability (0-gram) – Assume |V| is the size of the vocabulary – Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V| n times – If English has 100,000 words, probability of each next word is 1/100000 = .00001 Smarter: Probability of each next word is related to word frequency – Likelihood of sentence S = P(w1) × P(w2) × … × P(wn) – Assumes probability of each word is independent of probabilities of other words. Even smarter: Look at probability given previous words – Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1) – Assumes probability of each word is dependent on probabilities of other words. COUNTS What’s the probability of “canine”? What’s the probability of “canine tooth” or tooth | canine? What’s the probability of “canine companion”? P(tooth|canine) = P(canine & tooth)/P(canine) Sometimes we can use counts to deduce probabilities. Example: According to google: P(canine): occurs 1,750,000 times P(canine tooth): 6280 times P(tooth | canine): 6280/1750000 = .0035 P(companion | canine): .01 So companion is the more likely next word after canine Detecting likely word sequences using counts/table look up SINGLE WORD PROBABILITIES Single word probability Compute likelihood P([ni]|w), then multiply Word P(O|w) P(w) P(O|w)P(w) new .36 .001 .00036 P([ni]|new)P(new) neat .52 .00013 .000068 P([ni]|neat)P(neat) need .11 .00056 .000062 P([ni]|need)P(need) knee 1.00 .000024 .000024 P([ni]|knee)P(knee) Limitation: ignores context We might need to factor in the surrounding words - Use P(need|I) instead of just P(need) Note: P(new|I) < P(need|I) APPLICATION EXAMPLE What is the most likely word sequence? ik-'spen-siv 'pre-z&ns excessive presidents bald P(inactive | bald) expensive presence bold bought expressive inactive presents press 'bot boat P('bot | bald) PROBABILITY CHAIN RULE Conditional Probability P(A1,A2) = P(A1) · P(A2|A1) The Chain Rule generalizes to multiple events Examples: P(A1, …,An) = P(A1) P(A2|A1) P(A3|A1,A2)…P(An|A1…An-1) P(the dog) = P(the) P(dog | the) P(the dog bites) = P(the) P(dog | the) P(bites| the dog) Conditional probability applies more than individual relative word frequencies because they consider the context Dog may be relatively rare word in a corpus But if we see barking, P(dog|barking) is much more likely In general, the probability of a complete string of words w1…wn is: n P(w1) = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1) = n P ( w k | w1k 1) k 1 Detecting likely word sequences using probabilities N-GRAMS How many previous words should we consider? 0 gram: Every word’s likelihood probability is equal Uni-gram: A word’s likelihood depends on frequency counts P(w|a) = P(w) * P(w|wi-1) The appears with frequency .07, rabbit appears with frequency .00001 Rabbit is a more likely word that follows the word white than the is Tri-gram: word likelihood determined by the previous two words The word, ‘the’ occurs 69,971 in the Brown corpus of 1,000,000 words Bi-gram: word likelihood determined by the previous word Each word of a 300,000 word corpora has .000033 frequency probabilities P(w|a) = P(w) * P(w|wi-1 & wi-2) N-gram A model of word or phoneme prediction that uses the previous N-1 words or phonemes to predict the next APPROXIMATING SHAKESPEARE Generating sentences: random unigrams... With bigrams... Every enter now severally so, let Hill he late speaks; or! a more to leg less first you enter What means, sir. I confess she? then all sorts, he is trim, captain. Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Trigrams Sweet prince, Falstaff shall die. This shall forbid it should be branded, if renown made it empty. N-GRAM CONCLUSIONS Quadrigrams (Output now is Shakespeare) What! I will go seek the traitor Gloucester. Will you not tell me who I am? Comments The accuracy of an n-gram model increases with increasing n because word combinations are more and more constrained Higher n-gram models are more and more sparse. Shakespeare produced 0.04% of 844 million possible bigrams. There is a tradeoff between accuracy and computational overhead and memory requirements LANGUAGE MODELING: N-GRAMS Unigrams (SWB): •Most Common: “I”, “and”, “the”, “you”, “a” •Rank-100: “she”, “an”, “going” •Least Common: “Abraham”, “Alastair”, “Acura” Bigrams (SWB): •Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think” •Rank-100: “do it”, “that we”, “don’t think” •Least Common: “raw fish”, “moisture content”, “Reagan Bush” Trigrams (SWB): •Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” •Rank-100: “it was a”, “you know that” •Least Common: “you have parents”, “you seen Brooklyn” N-GRAMS FOR SPELL CHECKS Non-word detection (easiest) Example: graffe => (giraffe) Isolated-word (context-free) error correction A correction is not possible when the error word is in the dictionary Context-dependent (hardest) Example: your an idiot => you’re an idiot (the mistyped word happens to be a real word) EXAMPLE Mispelled word: acress Context Context * P(c) Candidates – with probabilities of use and use within context WHICH CORRECTION IS MOST LIKELY? Misspelled word: accress Word frequency percentage is not enough We need p(typo|candidate) * p(candidate) How likely is the particular error? Deletion of a t after a c and before an r Insertion of an a at the beginning Transpose a c and an a Substitute a c for an r Substitute an o for an e Insert an s before the last s, or after the last s Context of the word within a sentence or paragraph COMMON SPELLING ERRORS Spell check without considering context will fail They are leaving in about fifteen minuets The study was conducted manly be John Black. The design an construction of the system will take more than a year. Hopefully, all with continue smoothly in my absence. Can they lave him my messages? I need to notified the bank of…. He is trying to fine out. Difficulty: Detecting grammatical errors, or nonsensical expressions THE SPARSE DATA PROBLEM Definitions Problem 1: Low frequency n-grams Maximum likelihood: Finding the most probable sequence of tokens based on the context of the input N-gram sequence: A sequence of n words whose context speech algorithms consider Training data: A group of probabilities computed from a corpora of text data Sparse data problem: How should algorithms handle n-grams that have very low probabilities? Data sparseness is a frequently occurring problem Algorithms will make incorrect decisions if it is not handled Assume n-gram x occurs twice and n-gram y occurs once Is x really twice as likely to occur as y? Problem 2: Zero counts Probabilities compute to zero for n-grams not seen in the corpora If n-gram y does not occur, should its probability is zero? SMOOTHING Definition: A corpora is a collection of written or spoken material in machine-readable form An algorithm that redistributes the probability mass Discounting: Reduces probabilities of ngrams with non-zero counts to accommodate the n-grams with zero counts (that are unseen in the corpora). ADD-ONE SMOOTHING The Naïve smoothing technique Add one to the count of all seen and unseen n-grams Add the total increased count to the probability mass Example: uni-grams Un-smoothed probability for word w: uni-grams Add-one revised probability for word w: c(w) P(w) N c(w) 1 P1(w) N V N = number of words encountered, V = vocabulary size, c(w) = number of times word, w, was encountered ADD-ONE BI-GRAM SMOOTHING EXAMPLE Note: This example assumes bi-gram counts and a vocabulary V = 1616 words Note: row = times that word in column precedes word on left, or starts a sentence P(wn|wn-1) = C(wn-1wn)/C(wn-1) 1087/3437 = .3163 1088/(3437+1616) = .2153 P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V] Note: C(I)=3437, C(want)=1215, C(to)=3256, C(eat)=938, C(Chinese)=213, C(food)=1506, C(lunch)=459 EVALUATION OF ADD-ONE SMOOTHING Advantage: Disadvantages: Simple technique to implement and understand Too much probability mass moves to the unseen n-grams Underestimates the probabilities of the common n-grams Overestimates probabilities of rare (or unseen) n-grams Relative smoothing of all unseen n-grams is the same Relative smoothing of rare n-grams still incorrect Alternative: Use a smaller add value Disadvantage: Does not fully solve this problem ADD-ONE BI-GRAM DISCOUNTING Original Counts C(WI) c(wi,wi-1) Revised Counts I 3437 Want 1215 To 3256 Eat 938 Chinese 213 Food 1506 Lunch 459 c’(wi,wi-1) =(c(wi,wi-1)i+1) * N N V Note: High counts reduce by approximately a third for this example Note: Low counts get larger (1087 + 1)* (3437/(3437 + 1616) Note: N = c(wi-1), V = vocabulary size = 1616 UNIGRAM WITTEN-BELL DISCOUNTING Add probability mass to un-encountered words; discount the rest Compute the probability of a first time encounter of a new word Note: Every one of O observed words had a first encounter How many Unseen words: U = V – O What is the state of things when encountering a new word? Equally add this probability across all unobserved words Answer: P( any newly encountered word ) = O/(V+O) P( any specific newly encountered word ) = 1/U * O/(V+O) Adjusted counts = V * 1/U*O/(V+O)) Discount each encountered wordi to preserve probability space Probability From: counti /V To: counti/(V+O) Discounted Counts From: counti To: counti * V/(V+O) O = observed words, U = words never seen, V = corpus vocabulary words BI-GRAM WITTEN-BELL DISCOUNTING Add probability mass to un-encountered bi-grams; discount the rest Consider the bi-gram wnwn-1 Compute probability of a new bi-gram (bin-1) starting with wn-1 Answer: P( any newly encountered bi-gram ) = N(wn-1)/(N(wn-1) +O(wn-1)) Note: We observed O(wn-1) bi-grams in N(wn-1)+O(wn-1) events Note: An event is either a bi-gram or a first time encounter Divide this probability among all unseen bi-grams (new(wn-1)) O(wn-1) = number of uniquely observed bi-grams starting with wn-1 N(wn-1) = count of bi-grams starting with wn-1 U(wn-1) = number of un-observed bi-grams starting with wn-1 Adjusted count = N(wn-1) * 1/U(wn-1) * O(wn-1)/(N(wn-1)+O(wn-1)) Discount observed bi-grams gram(wn-1) Counts From: c(wn-1 wn) To: c(wn-1wn) * N(wn-1)/(N(wn-1)+O(wn-1)) O = observed bi-gram, U = bi-gram never seen, V = corpus vocabulary bi-grams N, O and U values are on the next slide WITTEN-BELL SMOOTHING Note: V, O, U refer to wn-1 counts Original Counts c(wn,wn-1) Adjusted Add-One Counts c′(wn,wn-1) = (c(wn,wn-1)+1) * 3437/(3437+89)*1087 = 1059.563 Nn-1 𝑁𝑛−1 + 𝑉 Adjusted Witten-Bell Counts c′(wn,wn-1) = 𝑂𝑛−1 𝑈𝑛−1 ∗ 𝑁𝑛−1 𝑂𝑛−1 +𝑁𝑛−1 if c(wn,wn-1)=0 76/1540 * 1215/(1215+76) = .0464 c(wn,wn-1) * otherwise 𝑁ℎ−1 𝑂𝑛−1 +𝑁𝑛−1 BI-GRAM COUNTS FOR EXAMPLE O(wn-1) = number of observed bi-grams starting with wn-1 N(wn-1) = count of bi-grams starting with wn-1 U(wn-1) = number of un-observed bi-grams starting with O(wn-1) U(Wn-1) N(wn-1) I 89 1,521 3437 Want 76 1,540 1215 To 130 1,486 3256 Eat 124 1,492 938 Chinese 20 1,596 213 Food 82 1,534 1506 Lunch 45 1,571 459 EVALUATION OF WITTEN-BELL Estimates probability of already encountered grams to compute probabilities for unseen grams Smaller impact on probabilities of already encountered grams Generally computes reasonable probabilities BACK-OFF DISCOUNTING The general Concept Goal is to use a hierarchy of approximations Consider the trigram (wn,wn-1, wn-2) If c(wn-1, wn-2) = 0, consider the ‘back-off’ bi-gram (wn, wn-1) If c(wn-1) = 0, consider the ‘back-off’ unigram wn trigram > bigram > unigram Degrade gracefully when higher level grams don’t exist Given a word sequence fragment: wn-2 wn-1 wn … Utilize the following preference rule 1.p(wn |wn-2 wn-1) if c(wn-2wn-1 wn ) 0 2.1p(wn|wn-1 ) if c(wn-1 wn ) 0 3.2p(wn) Note: 1 and 2 are values carefully computed to preserve probability mass SENONE MODEL Definition: A cluster of similar Markov States Goal: Reduce the trainable units that the recognizer needs to process Approach: HMMs represent sub-phonetic units A tree structure Combine subphonetic units Phoneme recognizer searches tree to find HMMs Nodes partition with questions about neighbors Performance: Triphones reduces error rate by:15% Senones reduces error rate by 24% Is left phone sonorant or nasal? Is right a back-R? Is right voiced? Is left a back-L? Is left s, z, sh, zh?