w n-1

advertisement
NOISY CHANNEL MODEL
Starting at this point, we need to be able to model the target language
Speech Recognition



Observe: Acoustic signal (A=a1,…,an)
Challenge: Find the likely word sequence
But we also have to consider the context
Language Modeling
LML Speech Recognition 2008
2
RESOLVING WORD AMBIGUITIES

Description: After determining word boundaries, the speech
recognition process matches an array of possible word
sequences from spoken audio

Issues to consider



determine the intended word sequence
resolve grammatical and pronunciation errors
Implementation: Establish word sequence
probabilities
 Use existing corpora
 Train program with run-time data
RECOGNIZER ISSUES
Problem: Our recognizer translates the audio
to a possible string of text. How do we know
the translation is correct?
 Problem: How do we handle a string of text
containing words that are not in the
dictionary?
 Problem: How do we handle strings with valid
words, but which do not form sentences with
semantics that makes sense?

CORRECTING RECOGNIZER AMBIGUITIES





Problem: Resolving words not in the dictionary
Question: How different is a recognized word
from those that are in the dictionary?
Solution: Count the single step transformations
necessary to convert one word into another.
Example: caat  cat with removal of one letter
Example: fplc  fireplace requires adding the
letters ire after f and a before c and e at the end
WORD PREDICTION APPROACHES
Simple:
Simple vs. Smart
*Every word follows every other word w/ equal probability (0-gram)
– Assume |V| is the size of the vocabulary
– Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V|
n times
– If English has 100,000 words, probability of each
next word is 1/100000 =
.00001
Smarter: Probability of each next word is related to word frequency
– Likelihood of sentence S = P(w1) × P(w2) × … × P(wn)
– Assumes probability of each word is independent of probabilities of other words.
Even smarter: Look at probability given previous words
– Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1)
– Assumes probability of each word is dependent on probabilities of other words.
COUNTS






What’s the probability of “canine”?
What’s the probability of “canine tooth” or tooth |
canine?
What’s the probability of “canine companion”?
P(tooth|canine) = P(canine & tooth)/P(canine)
Sometimes we can use counts to deduce probabilities.
Example: According to google:





P(canine): occurs 1,750,000 times
P(canine tooth): 6280 times
P(tooth | canine): 6280/1750000 = .0035
P(companion | canine): .01
So companion is the more likely next word after canine
Detecting likely word sequences using counts/table look up
SINGLE WORD PROBABILITIES
Single word probability
Compute likelihood P([ni]|w), then multiply
Word
P(O|w)
P(w)
P(O|w)P(w)
new
.36
.001
.00036
P([ni]|new)P(new)
neat
.52
.00013
.000068
P([ni]|neat)P(neat)
need
.11
.00056
.000062
P([ni]|need)P(need)
knee
1.00
.000024
.000024
P([ni]|knee)P(knee)
Limitation: ignores context
 We might need to factor in the surrounding words

-
Use P(need|I) instead of just P(need)
Note: P(new|I) < P(need|I)
APPLICATION EXAMPLE

What is the most likely word sequence?
ik-'spen-siv
'pre-z&ns
excessive
presidents
bald
P(inactive | bald)
expensive
presence
bold
bought
expressive
inactive
presents
press
'bot
boat
P('bot | bald)


PROBABILITY CHAIN RULE


Conditional Probability P(A1,A2) = P(A1) · P(A2|A1)
The Chain Rule generalizes to multiple events


Examples:



P(A1, …,An) = P(A1) P(A2|A1) P(A3|A1,A2)…P(An|A1…An-1)
P(the dog) = P(the) P(dog | the)
P(the dog bites) = P(the) P(dog | the) P(bites| the dog)
Conditional probability applies more than individual relative
word frequencies because they consider the context


Dog may be relatively rare word in a corpus
But if we see barking, P(dog|barking) is much more likely
In general, the probability of a complete string of words w1…wn is:
n
P(w1) = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1) =
n
 P ( w k | w1k 1)
k 1
Detecting likely word sequences using probabilities
N-GRAMS
How many previous words should we consider?

0 gram: Every word’s likelihood probability is equal


Uni-gram: A word’s likelihood depends on frequency counts




P(w|a) = P(w) * P(w|wi-1)
The appears with frequency .07, rabbit appears with frequency .00001
Rabbit is a more likely word that follows the word white than the is
Tri-gram: word likelihood determined by the previous two words


The word, ‘the’ occurs 69,971 in the Brown corpus of 1,000,000 words
Bi-gram: word likelihood determined by the previous word


Each word of a 300,000 word corpora has .000033 frequency probabilities
P(w|a) = P(w) * P(w|wi-1 & wi-2)
N-gram
 A model of word or phoneme prediction that uses the previous
N-1 words or phonemes to predict the next
APPROXIMATING SHAKESPEARE

Generating sentences: random unigrams...



With bigrams...



Every enter now severally so, let
Hill he late speaks; or! a more to leg less first you enter
What means, sir. I confess she? then all sorts, he is trim, captain.
Why dost stand forth thy canopy, forsooth; he is this palpable hit
the King Henry.
Trigrams


Sweet prince, Falstaff shall die.
This shall forbid it should be branded, if renown made it empty.
N-GRAM CONCLUSIONS

Quadrigrams (Output now is Shakespeare)



What! I will go seek the traitor Gloucester.
Will you not tell me who I am?
Comments



The accuracy of an n-gram model increases with increasing
n because word combinations are more and more
constrained
Higher n-gram models are more and more sparse.
Shakespeare produced 0.04% of 844 million possible
bigrams.
There is a tradeoff between accuracy and computational
overhead and memory requirements
LANGUAGE MODELING: N-GRAMS
Unigrams (SWB):
•Most Common: “I”, “and”, “the”, “you”, “a”
•Rank-100: “she”, “an”, “going”
•Least Common: “Abraham”, “Alastair”, “Acura”
Bigrams (SWB):
•Most Common:
“you know”, “yeah SENT!”,
“!SENT um-hum”, “I think”
•Rank-100: “do it”, “that we”, “don’t think”
•Least Common: “raw fish”, “moisture content”,
“Reagan Bush”
Trigrams (SWB):
•Most Common: “!SENT um-hum SENT!”,
“a lot of”, “I don’t know”
•Rank-100: “it was a”, “you know that”
•Least Common: “you have parents”,
“you seen Brooklyn”
N-GRAMS FOR SPELL CHECKS


Non-word detection (easiest) Example: graffe => (giraffe)
 Isolated-word (context-free) error correction
 A correction is not possible when the error word is in the dictionary
Context-dependent (hardest) Example: your an idiot => you’re an idiot
(the mistyped word happens to be a real word)
EXAMPLE
Mispelled word: acress
Context
Context * P(c)
Candidates – with probabilities of use and use within context
WHICH CORRECTION IS MOST LIKELY?
Misspelled word: accress

Word frequency percentage is not enough


We need p(typo|candidate) * p(candidate)
How likely is the particular error?







Deletion of a t after a c and before an r
Insertion of an a at the beginning
Transpose a c and an a
Substitute a c for an r
Substitute an o for an e
Insert an s before the last s, or after the last s
Context of the word within a sentence or paragraph
COMMON SPELLING ERRORS
Spell check without considering context will fail







They are leaving in about fifteen minuets
The study was conducted manly be John Black.
The design an construction of the system will take
more than a year.
Hopefully, all with continue smoothly in my absence.
Can they lave him my messages?
I need to notified the bank of….
He is trying to fine out.
Difficulty: Detecting grammatical errors, or nonsensical expressions
THE SPARSE DATA PROBLEM

Definitions





Problem 1: Low frequency n-grams



Maximum likelihood: Finding the most probable sequence of
tokens based on the context of the input
N-gram sequence: A sequence of n words whose context
speech algorithms consider
Training data: A group of probabilities computed from a corpora
of text data
Sparse data problem: How should algorithms handle n-grams
that have very low probabilities?
 Data sparseness is a frequently occurring problem
 Algorithms will make incorrect decisions if it is not handled
Assume n-gram x occurs twice and n-gram y occurs once
Is x really twice as likely to occur as y?
Problem 2: Zero counts


Probabilities compute to zero for n-grams not seen in the
corpora
If n-gram y does not occur, should its probability is zero?
SMOOTHING
Definition: A corpora is a collection of written or
spoken material in machine-readable form
An algorithm that redistributes the
probability mass
Discounting: Reduces probabilities of ngrams with non-zero counts to
accommodate the n-grams with zero
counts (that are unseen in the corpora).
ADD-ONE SMOOTHING

The Naïve smoothing technique



Add one to the count of all seen and unseen n-grams
Add the total increased count to the probability mass
Example: uni-grams

Un-smoothed probability for word w: uni-grams

Add-one revised probability for word w:

c(w)
P(w) 
N

c(w) 1
P1(w) 
N V
N = number of words encountered, V = vocabulary
size, c(w) = number of times word, w, was encountered
ADD-ONE BI-GRAM SMOOTHING EXAMPLE
Note: This example assumes bi-gram counts and a vocabulary V = 1616 words
Note: row = times that word in column precedes word on left, or starts a sentence
P(wn|wn-1) = C(wn-1wn)/C(wn-1)
1087/3437 = .3163
1088/(3437+1616) = .2153
P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]
Note: C(I)=3437, C(want)=1215, C(to)=3256, C(eat)=938,
C(Chinese)=213, C(food)=1506, C(lunch)=459
EVALUATION OF ADD-ONE SMOOTHING

Advantage:


Disadvantages:






Simple technique to implement and understand
Too much probability mass moves to the unseen n-grams
Underestimates the probabilities of the common n-grams
Overestimates probabilities of rare (or unseen) n-grams
Relative smoothing of all unseen n-grams is the same
Relative smoothing of rare n-grams still incorrect
Alternative:


Use a smaller add value
Disadvantage: Does not fully solve this problem
ADD-ONE BI-GRAM DISCOUNTING
Original Counts
C(WI)
c(wi,wi-1)
Revised Counts
I
3437
Want
1215
To
3256
Eat
938
Chinese
213
Food
1506
Lunch
459
c’(wi,wi-1)
=(c(wi,wi-1)i+1) *
N
N V
Note: High counts reduce by approximately a third for this example
Note: Low counts get larger
(1087 + 1)* (3437/(3437 + 1616)
Note: N = c(wi-1), V = vocabulary size = 1616
UNIGRAM WITTEN-BELL DISCOUNTING
Add probability mass to un-encountered words; discount the rest
 Compute the probability of a first time encounter of a new word



Note: Every one of O observed words had a first encounter
How many Unseen words: U = V – O
What is the state of things when encountering a new word?


Equally add this probability across all unobserved words



Answer: P( any newly encountered word ) = O/(V+O)
P( any specific newly encountered word ) = 1/U * O/(V+O)
Adjusted counts = V * 1/U*O/(V+O))
Discount each encountered wordi to preserve probability space


Probability
From: counti /V To: counti/(V+O)
Discounted Counts From: counti
To: counti * V/(V+O)
O = observed words, U = words never seen, V = corpus vocabulary words
BI-GRAM WITTEN-BELL DISCOUNTING
Add probability mass to un-encountered bi-grams; discount the rest
 Consider the bi-gram wnwn-1




Compute probability of a new bi-gram (bin-1) starting with wn-1




Answer: P( any newly encountered bi-gram ) = N(wn-1)/(N(wn-1) +O(wn-1))
Note: We observed O(wn-1) bi-grams in N(wn-1)+O(wn-1) events
Note: An event is either a bi-gram or a first time encounter
Divide this probability among all unseen bi-grams (new(wn-1))


O(wn-1) = number of uniquely observed bi-grams starting with wn-1
N(wn-1) = count of bi-grams starting with wn-1
U(wn-1) = number of un-observed bi-grams starting with wn-1
Adjusted count
= N(wn-1) * 1/U(wn-1) * O(wn-1)/(N(wn-1)+O(wn-1))
Discount observed bi-grams gram(wn-1)

Counts
From: c(wn-1 wn)
To: c(wn-1wn) * N(wn-1)/(N(wn-1)+O(wn-1))
O = observed bi-gram, U = bi-gram never seen, V = corpus vocabulary bi-grams
N, O and U values are
on the next slide
WITTEN-BELL SMOOTHING
Note: V, O, U refer to wn-1 counts
Original Counts
c(wn,wn-1)
Adjusted Add-One Counts
c′(wn,wn-1) =
(c(wn,wn-1)+1) *
3437/(3437+89)*1087 = 1059.563
Nn-1
𝑁𝑛−1 + 𝑉
Adjusted Witten-Bell Counts
c′(wn,wn-1) =
𝑂𝑛−1
𝑈𝑛−1
∗
𝑁𝑛−1
𝑂𝑛−1 +𝑁𝑛−1
if c(wn,wn-1)=0
76/1540 * 1215/(1215+76) = .0464
c(wn,wn-1) *
otherwise
𝑁ℎ−1
𝑂𝑛−1 +𝑁𝑛−1
BI-GRAM COUNTS FOR EXAMPLE
O(wn-1) = number of observed bi-grams starting with wn-1
N(wn-1) = count of bi-grams starting with wn-1
U(wn-1) = number of un-observed bi-grams starting with
O(wn-1) U(Wn-1) N(wn-1)
I
89
1,521
3437
Want
76
1,540
1215
To
130
1,486
3256
Eat
124
1,492
938
Chinese
20
1,596
213
Food
82
1,534
1506
Lunch
45
1,571
459
EVALUATION OF WITTEN-BELL

Estimates probability of already encountered grams
to compute probabilities for unseen grams

Smaller impact on probabilities of already
encountered grams

Generally computes reasonable probabilities
BACK-OFF DISCOUNTING

The general Concept




Goal is to use a hierarchy of approximations




Consider the trigram (wn,wn-1, wn-2)
If c(wn-1, wn-2) = 0, consider the ‘back-off’ bi-gram (wn, wn-1)
If c(wn-1) = 0, consider the ‘back-off’ unigram wn
trigram > bigram > unigram
Degrade gracefully when higher level grams don’t exist
Given a word sequence fragment: wn-2 wn-1 wn …
Utilize the following preference rule



1.p(wn |wn-2 wn-1) if c(wn-2wn-1 wn )  0
2.1p(wn|wn-1 ) if c(wn-1 wn )  0
3.2p(wn)
Note: 1 and 2 are values carefully computed to preserve probability mass
SENONE MODEL
Definition: A cluster of similar Markov States


Goal: Reduce the trainable units that
the recognizer needs to process
Approach:





HMMs represent sub-phonetic units
A tree structure Combine subphonetic units
Phoneme recognizer searches tree
to find HMMs
Nodes partition with questions about
neighbors
Performance:


Triphones reduces error rate by:15%
Senones reduces error rate by 24%
Is left phone sonorant or nasal?
Is right a back-R?
Is right voiced?
Is left a back-L?
Is left s, z, sh, zh?
Download