Bayes Rule P( B | A) P( A) P( A | B) P( B) Rev. Thomas Bayes (1702-1761) • How is this rule derived? • Using Bayes rule for probabilistic inference: P( Evidence | Cause) P(Cause) P(Cause | Evidence) P( Evidence) – P(Cause | Evidence): diagnostic probability – P(Evidence | Cause): causal probability Bayesian decision theory • Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence E = e – Partially observable, stochastic, episodic environment – Examples: X = {spam, not spam}, e = email message X = {zebra, giraffe, hippo}, e = image features – The agent has a loss function, which is 0 if the value of X is guessed correctly and 1 otherwise – What is agent’s optimal estimate of the value of X? • Maximum a posteriori (MAP) decision: value of X that minimizes expected loss is the one that has the greatest posterior probability P(X = x | e) MAP decision • X = x: value of query variable • E = e: evidence P (e | x ) P ( x ) x* arg max x P( x | e) P (e) arg max x P(e | x) P( x) P ( x | e) P ( e | x ) P ( x ) posterior likelihood prior • Maximum likelihood (ML) decision: x* arg max x P(e | x) Example: Spam Filter • We have X = {spam, ¬spam}, E = email message. • What should be our decision criterion? – Compute P(spam | message) and P(¬spam | message), and assign the message to the class that gives higher posterior probability Example: Spam Filter • We have X = {spam, ¬spam}, E = email message. • What should be our decision criterion? – Compute P(spam | message) and P(¬spam | message), and assign the message to the class that gives higher posterior probability P(spam | message) P(message | spam) P(spam) P(¬spam | message) P(message | ¬spam) P(¬spam) Example: Spam Filter • We need to find P(message | spam) P(spam) and P(message | ¬spam) P(¬spam) • How do we represent the message? – Bag of words model: • The order of the words is not important • Each word is conditionally independent of the others given message class • If the message consists of words (w1, …, wn), how do we compute P(w1, …, wn | spam)? – Naïve Bayes assumption: each word is conditionally independent of the others given message class n P(message | spam) P( w1 , , wn | spam) P( wi | spam) i 1 Example: Spam Filter • Our filter will classify the message as spam if n n i 1 i 1 P( spam) P( wi | spam) P(spam) P( wi | spam) • In practice, likelihoods are pretty small numbers, so we need to take logs to avoid underflow: n n log P( spam) P( wi | spam) log P( spam) log P( wi | spam) i 1 i 1 • Model parameters: – Priors P(spam), P(¬spam) – Likelihoods P(wi | spam), P(wi | ¬spam) • These parameters need to be learned from a training set (a representative sample of email messages marked with their classes) Parameter estimation • Model parameters: – Priors P(spam), P(¬spam) – Likelihoods P(wi | spam), P(wi | ¬spam) • Estimation by empirical word frequencies in the training set: # of occurrences of wi in spam messages P(wi | spam) = total # of words in spam messages – This happens to be the parameter estimate that maximizes the likelihood of the training data: D nd P(w d ,i | class d ) d 1 i 1 d: index of training document, i: index of a word Parameter estimation • Model parameters: – Priors P(spam), P(¬spam) – Likelihoods P(wi | spam), P(wi | ¬spam) • Estimation by empirical word frequencies in the training set: # of occurrences of wi in spam messages P(wi | spam) = total # of words in spam messages • Parameter smoothing: dealing with words that were never seen or seen too few times – Laplacian smoothing: pretend you have seen every vocabulary word one more time than you actually did Bayesian decision making: Summary • Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E • Inference problem: given some evidence E = e, what is P(X | e)? • Learning problem: estimate the parameters of the probabilistic model P(X | E) given a training sample {(x1,e1), …, (xn,en)} Bag-of-word models for images Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005) Bag-of-word models for images 1. Extract image features Bag-of-word models for images 1. Extract image features Bag-of-word models for images 1. Extract image features 2. Learn “visual vocabulary” Bag-of-word models for images 1. Extract image features 2. Learn “visual vocabulary” 3. Map image features to visual words