w n-1

NOISY CHANNEL MODEL Starting at this point, we need to be able to model the target language Speech Recognition    Observe: Acoustic signal (A=a1,…,an) Challenge: Find the likely word sequence But we also have to consider the context Language Modeling LML Speech Recognition 2008 2 RESOLVING WORD AMBIGUITIES  Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from spoken audio  Issues to consider    determine the intended word sequence resolve grammatical and pronunciation errors Implementation: Establish word sequence probabilities  Use existing corpora  Train program with run-time data RECOGNIZER ISSUES Problem: Our recognizer translates the audio to a possible string of text. How do we know the translation is correct?  Problem: How do we handle a string of text containing words that are not in the dictionary?  Problem: How do we handle strings with valid words, but which do not form sentences with semantics that makes sense?  CORRECTING RECOGNIZER AMBIGUITIES      Problem: Resolving words not in the dictionary Question: How different is a recognized word from those that are in the dictionary? Solution: Count the single step transformations necessary to convert one word into another. Example: caat  cat with removal of one letter Example: fplc  fireplace requires adding the letters ire after f and a before c and e at the end WORD PREDICTION APPROACHES Simple: Simple vs. Smart *Every word follows every other word w/ equal probability (0-gram) – Assume |V| is the size of the vocabulary – Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V| n times – If English has 100,000 words, probability of each next word is 1/100000 = .00001 Smarter: Probability of each next word is related to word frequency – Likelihood of sentence S = P(w1) × P(w2) × … × P(wn) – Assumes probability of each word is independent of probabilities of other words. Even smarter: Look at probability given previous words – Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1) – Assumes probability of each word is dependent on probabilities of other words. COUNTS       What’s the probability of “canine”? What’s the probability of “canine tooth” or tooth | canine? What’s the probability of “canine companion”? P(tooth|canine) = P(canine & tooth)/P(canine) Sometimes we can use counts to deduce probabilities. Example: According to google:      P(canine): occurs 1,750,000 times P(canine tooth): 6280 times P(tooth | canine): 6280/1750000 = .0035 P(companion | canine): .01 So companion is the more likely next word after canine Detecting likely word sequences using counts/table look up SINGLE WORD PROBABILITIES Single word probability Compute likelihood P([ni]|w), then multiply Word P(O|w) P(w) P(O|w)P(w) new .36 .001 .00036 P([ni]|new)P(new) neat .52 .00013 .000068 P([ni]|neat)P(neat) need .11 .00056 .000062 P([ni]|need)P(need) knee 1.00 .000024 .000024 P([ni]|knee)P(knee) Limitation: ignores context  We might need to factor in the surrounding words  - Use P(need|I) instead of just P(need) Note: P(new|I) < P(need|I) APPLICATION EXAMPLE  What is the most likely word sequence? ik-'spen-siv 'pre-z&ns excessive presidents bald P(inactive | bald) expensive presence bold bought expressive inactive presents press 'bot boat P('bot | bald)   PROBABILITY CHAIN RULE   Conditional Probability P(A1,A2) = P(A1) · P(A2|A1) The Chain Rule generalizes to multiple events   Examples:    P(A1, …,An) = P(A1) P(A2|A1) P(A3|A1,A2)…P(An|A1…An-1) P(the dog) = P(the) P(dog | the) P(the dog bites) = P(the) P(dog | the) P(bites| the dog) Conditional probability applies more than individual relative word frequencies because they consider the context   Dog may be relatively rare word in a corpus But if we see barking, P(dog|barking) is much more likely In general, the probability of a complete string of words w1…wn is: n P(w1) = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1) = n  P ( w k | w1k 1) k 1 Detecting likely word sequences using probabilities N-GRAMS How many previous words should we consider?  0 gram: Every word’s likelihood probability is equal   Uni-gram: A word’s likelihood depends on frequency counts     P(w|a) = P(w) * P(w|wi-1) The appears with frequency .07, rabbit appears with frequency .00001 Rabbit is a more likely word that follows the word white than the is Tri-gram: word likelihood determined by the previous two words   The word, ‘the’ occurs 69,971 in the Brown corpus of 1,000,000 words Bi-gram: word likelihood determined by the previous word   Each word of a 300,000 word corpora has .000033 frequency probabilities P(w|a) = P(w) * P(w|wi-1 & wi-2) N-gram  A model of word or phoneme prediction that uses the previous N-1 words or phonemes to predict the next APPROXIMATING SHAKESPEARE  Generating sentences: random unigrams...    With bigrams...    Every enter now severally so, let Hill he late speaks; or! a more to leg less first you enter What means, sir. I confess she? then all sorts, he is trim, captain. Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Trigrams   Sweet prince, Falstaff shall die. This shall forbid it should be branded, if renown made it empty. N-GRAM CONCLUSIONS  Quadrigrams (Output now is Shakespeare)    What! I will go seek the traitor Gloucester. Will you not tell me who I am? Comments    The accuracy of an n-gram model increases with increasing n because word combinations are more and more constrained Higher n-gram models are more and more sparse. Shakespeare produced 0.04% of 844 million possible bigrams. There is a tradeoff between accuracy and computational overhead and memory requirements LANGUAGE MODELING: N-GRAMS Unigrams (SWB): •Most Common: “I”, “and”, “the”, “you”, “a” •Rank-100: “she”, “an”, “going” •Least Common: “Abraham”, “Alastair”, “Acura” Bigrams (SWB): •Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think” •Rank-100: “do it”, “that we”, “don’t think” •Least Common: “raw fish”, “moisture content”, “Reagan Bush” Trigrams (SWB): •Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” •Rank-100: “it was a”, “you know that” •Least Common: “you have parents”, “you seen Brooklyn” N-GRAMS FOR SPELL CHECKS   Non-word detection (easiest) Example: graffe => (giraffe)  Isolated-word (context-free) error correction  A correction is not possible when the error word is in the dictionary Context-dependent (hardest) Example: your an idiot => you’re an idiot (the mistyped word happens to be a real word) EXAMPLE Mispelled word: acress Context Context * P(c) Candidates – with probabilities of use and use within context WHICH CORRECTION IS MOST LIKELY? Misspelled word: accress  Word frequency percentage is not enough   We need p(typo|candidate) * p(candidate) How likely is the particular error?        Deletion of a t after a c and before an r Insertion of an a at the beginning Transpose a c and an a Substitute a c for an r Substitute an o for an e Insert an s before the last s, or after the last s Context of the word within a sentence or paragraph COMMON SPELLING ERRORS Spell check without considering context will fail        They are leaving in about fifteen minuets The study was conducted manly be John Black. The design an construction of the system will take more than a year. Hopefully, all with continue smoothly in my absence. Can they lave him my messages? I need to notified the bank of…. He is trying to fine out. Difficulty: Detecting grammatical errors, or nonsensical expressions THE SPARSE DATA PROBLEM  Definitions      Problem 1: Low frequency n-grams    Maximum likelihood: Finding the most probable sequence of tokens based on the context of the input N-gram sequence: A sequence of n words whose context speech algorithms consider Training data: A group of probabilities computed from a corpora of text data Sparse data problem: How should algorithms handle n-grams that have very low probabilities?  Data sparseness is a frequently occurring problem  Algorithms will make incorrect decisions if it is not handled Assume n-gram x occurs twice and n-gram y occurs once Is x really twice as likely to occur as y? Problem 2: Zero counts   Probabilities compute to zero for n-grams not seen in the corpora If n-gram y does not occur, should its probability is zero? SMOOTHING Definition: A corpora is a collection of written or spoken material in machine-readable form An algorithm that redistributes the probability mass Discounting: Reduces probabilities of ngrams with non-zero counts to accommodate the n-grams with zero counts (that are unseen in the corpora). ADD-ONE SMOOTHING  The Naïve smoothing technique    Add one to the count of all seen and unseen n-grams Add the total increased count to the probability mass Example: uni-grams  Un-smoothed probability for word w: uni-grams  Add-one revised probability for word w:  c(w) P(w)  N  c(w) 1 P1(w)  N V N = number of words encountered, V = vocabulary size, c(w) = number of times word, w, was encountered ADD-ONE BI-GRAM SMOOTHING EXAMPLE Note: This example assumes bi-gram counts and a vocabulary V = 1616 words Note: row = times that word in column precedes word on left, or starts a sentence P(wn|wn-1) = C(wn-1wn)/C(wn-1) 1087/3437 = .3163 1088/(3437+1616) = .2153 P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V] Note: C(I)=3437, C(want)=1215, C(to)=3256, C(eat)=938, C(Chinese)=213, C(food)=1506, C(lunch)=459 EVALUATION OF ADD-ONE SMOOTHING  Advantage:   Disadvantages:       Simple technique to implement and understand Too much probability mass moves to the unseen n-grams Underestimates the probabilities of the common n-grams Overestimates probabilities of rare (or unseen) n-grams Relative smoothing of all unseen n-grams is the same Relative smoothing of rare n-grams still incorrect Alternative:   Use a smaller add value Disadvantage: Does not fully solve this problem ADD-ONE BI-GRAM DISCOUNTING Original Counts C(WI) c(wi,wi-1) Revised Counts I 3437 Want 1215 To 3256 Eat 938 Chinese 213 Food 1506 Lunch 459 c’(wi,wi-1) =(c(wi,wi-1)i+1) * N N V Note: High counts reduce by approximately a third for this example Note: Low counts get larger (1087 + 1)* (3437/(3437 + 1616) Note: N = c(wi-1), V = vocabulary size = 1616 UNIGRAM WITTEN-BELL DISCOUNTING Add probability mass to un-encountered words; discount the rest  Compute the probability of a first time encounter of a new word    Note: Every one of O observed words had a first encounter How many Unseen words: U = V – O What is the state of things when encountering a new word?   Equally add this probability across all unobserved words    Answer: P( any newly encountered word ) = O/(V+O) P( any specific newly encountered word ) = 1/U * O/(V+O) Adjusted counts = V * 1/U*O/(V+O)) Discount each encountered wordi to preserve probability space   Probability From: counti /V To: counti/(V+O) Discounted Counts From: counti To: counti * V/(V+O) O = observed words, U = words never seen, V = corpus vocabulary words BI-GRAM WITTEN-BELL DISCOUNTING Add probability mass to un-encountered bi-grams; discount the rest  Consider the bi-gram wnwn-1     Compute probability of a new bi-gram (bin-1) starting with wn-1     Answer: P( any newly encountered bi-gram ) = N(wn-1)/(N(wn-1) +O(wn-1)) Note: We observed O(wn-1) bi-grams in N(wn-1)+O(wn-1) events Note: An event is either a bi-gram or a first time encounter Divide this probability among all unseen bi-grams (new(wn-1))   O(wn-1) = number of uniquely observed bi-grams starting with wn-1 N(wn-1) = count of bi-grams starting with wn-1 U(wn-1) = number of un-observed bi-grams starting with wn-1 Adjusted count = N(wn-1) * 1/U(wn-1) * O(wn-1)/(N(wn-1)+O(wn-1)) Discount observed bi-grams gram(wn-1)  Counts From: c(wn-1 wn) To: c(wn-1wn) * N(wn-1)/(N(wn-1)+O(wn-1)) O = observed bi-gram, U = bi-gram never seen, V = corpus vocabulary bi-grams N, O and U values are on the next slide WITTEN-BELL SMOOTHING Note: V, O, U refer to wn-1 counts Original Counts c(wn,wn-1) Adjusted Add-One Counts c′(wn,wn-1) = (c(wn,wn-1)+1) * 3437/(3437+89)*1087 = 1059.563 Nn-1 𝑁𝑛−1 + 𝑉 Adjusted Witten-Bell Counts c′(wn,wn-1) = 𝑂𝑛−1 𝑈𝑛−1 ∗ 𝑁𝑛−1 𝑂𝑛−1 +𝑁𝑛−1 if c(wn,wn-1)=0 76/1540 * 1215/(1215+76) = .0464 c(wn,wn-1) * otherwise 𝑁ℎ−1 𝑂𝑛−1 +𝑁𝑛−1 BI-GRAM COUNTS FOR EXAMPLE O(wn-1) = number of observed bi-grams starting with wn-1 N(wn-1) = count of bi-grams starting with wn-1 U(wn-1) = number of un-observed bi-grams starting with O(wn-1) U(Wn-1) N(wn-1) I 89 1,521 3437 Want 76 1,540 1215 To 130 1,486 3256 Eat 124 1,492 938 Chinese 20 1,596 213 Food 82 1,534 1506 Lunch 45 1,571 459 EVALUATION OF WITTEN-BELL  Estimates probability of already encountered grams to compute probabilities for unseen grams  Smaller impact on probabilities of already encountered grams  Generally computes reasonable probabilities BACK-OFF DISCOUNTING  The general Concept     Goal is to use a hierarchy of approximations     Consider the trigram (wn,wn-1, wn-2) If c(wn-1, wn-2) = 0, consider the ‘back-off’ bi-gram (wn, wn-1) If c(wn-1) = 0, consider the ‘back-off’ unigram wn trigram > bigram > unigram Degrade gracefully when higher level grams don’t exist Given a word sequence fragment: wn-2 wn-1 wn … Utilize the following preference rule    1.p(wn |wn-2 wn-1) if c(wn-2wn-1 wn )  0 2.1p(wn|wn-1 ) if c(wn-1 wn )  0 3.2p(wn) Note: 1 and 2 are values carefully computed to preserve probability mass SENONE MODEL Definition: A cluster of similar Markov States   Goal: Reduce the trainable units that the recognizer needs to process Approach:      HMMs represent sub-phonetic units A tree structure Combine subphonetic units Phoneme recognizer searches tree to find HMMs Nodes partition with questions about neighbors Performance:   Triphones reduces error rate by:15% Senones reduces error rate by 24% Is left phone sonorant or nasal? Is right a back-R? Is right voiced? Is left a back-L? Is left s, z, sh, zh?

w n-1

Related documents

Products

Support

w n-1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib