PPT

advertisement
Bayes Rule
P( B | A) P( A)
P( A | B) 
P( B)
Rev. Thomas Bayes
(1702-1761)
• How is this rule derived?
• Using Bayes rule for probabilistic inference:
P( Evidence | Cause) P(Cause)
P(Cause | Evidence) 
P( Evidence)
– P(Cause | Evidence): diagnostic probability
– P(Evidence | Cause): causal probability
Bayesian decision theory
• Suppose the agent has to make a decision about
the value of an unobserved query variable X given
some observed evidence E = e
– Partially observable, stochastic, episodic environment
– Examples: X = {spam, not spam}, e = email message
X = {zebra, giraffe, hippo}, e = image features
– The agent has a loss function, which is 0 if the value
of X is guessed correctly and 1 otherwise
– What is agent’s optimal estimate of the value of X?
• Maximum a posteriori (MAP) decision: value of
X that minimizes expected loss is the one that has
the greatest posterior probability P(X = x | e)
MAP decision
• X = x: value of query variable
• E = e: evidence
P (e | x ) P ( x )
x*  arg max x P( x | e) 
P (e)
 arg max x P(e | x) P( x)
P ( x | e)  P ( e | x ) P ( x )
posterior
likelihood
prior
• Maximum likelihood (ML) decision:
x*  arg max x P(e | x)
Example: Spam Filter
• We have X = {spam, ¬spam}, E = email message.
• What should be our decision criterion?
– Compute P(spam | message) and P(¬spam | message),
and assign the message to the class that gives higher
posterior probability
Example: Spam Filter
• We have X = {spam, ¬spam}, E = email message.
• What should be our decision criterion?
– Compute P(spam | message) and P(¬spam | message),
and assign the message to the class that gives higher
posterior probability
P(spam | message)  P(message | spam) P(spam)
P(¬spam | message)  P(message | ¬spam) P(¬spam)
Example: Spam Filter
• We need to find P(message | spam) P(spam) and
P(message | ¬spam) P(¬spam)
• How do we represent the message?
– Bag of words model:
• The order of the words is not important
• Each word is conditionally independent of the others given
message class
• If the message consists of words (w1, …, wn), how do we
compute P(w1, …, wn | spam)?
– Naïve Bayes assumption: each word is conditionally
independent of the others given message class
n
P(message | spam)  P( w1 , , wn | spam)   P( wi | spam)
i 1
Example: Spam Filter
• Our filter will classify the message as spam if
n
n
i 1
i 1
P( spam) P( wi | spam)  P(spam) P( wi | spam)
• In practice, likelihoods are pretty small numbers, so we
need to take logs to avoid underflow:
n
n


log  P( spam) P( wi | spam)  log P( spam)   log P( wi | spam)
i 1
i 1


• Model parameters:
– Priors P(spam), P(¬spam)
– Likelihoods P(wi | spam), P(wi | ¬spam)
• These parameters need to be learned from a training set
(a representative sample of email messages marked
with their classes)
Parameter estimation
• Model parameters:
– Priors P(spam), P(¬spam)
– Likelihoods P(wi | spam), P(wi | ¬spam)
• Estimation by empirical word frequencies in the training set:
# of occurrences of wi in spam messages
P(wi | spam) =
total # of words in spam messages
– This happens to be the parameter estimate that maximizes the
likelihood of the training data:
D
nd
 P(w
d ,i
| class d )
d 1 i 1
d: index of training document, i: index of a word
Parameter estimation
• Model parameters:
– Priors P(spam), P(¬spam)
– Likelihoods P(wi | spam), P(wi | ¬spam)
• Estimation by empirical word frequencies in the training set:
# of occurrences of wi in spam messages
P(wi | spam) =
total # of words in spam messages
• Parameter smoothing: dealing with words that were never
seen or seen too few times
– Laplacian smoothing: pretend you have seen every vocabulary word
one more time than you actually did
Bayesian decision making:
Summary
• Suppose the agent has to make decisions about
the value of an unobserved query variable X
based on the values of an observed evidence
variable E
• Inference problem: given some evidence E = e,
what is P(X | e)?
• Learning problem: estimate the parameters of
the probabilistic model P(X | E) given a training
sample {(x1,e1), …, (xn,en)}
Bag-of-word models for images
Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005)
Bag-of-word models for images
1. Extract image features
Bag-of-word models for images
1. Extract image features
Bag-of-word models for images
1. Extract image features
2. Learn “visual vocabulary”
Bag-of-word models for images
1. Extract image features
2. Learn “visual vocabulary”
3. Map image features to visual words
Download