Bayesian Learning Thanks to Nir Friedman, HU . Example Suppose we are required to build a controller that removes bad oranges from a packaging line Decision are made based on a sensor that reports the overall color of the orange Bad oranges Classifying oranges Suppose we know all the aspects of the problem: Prior Probabilities: Probability of good (+1) and bad (-1) oranges P(C = +1) = probability of a good orange P(C = -1) = probability of a bad orange Note: P(C = +1) + P(C = -1) = 1 Assumption: oranges are independent The occurrence of a bad orange does not depend on previous Classifying oranges (cont) Sensor performance: Let X denote sensor measurement from each type of oranges p(X | C = -1 ) p(X | C = +1 ) Bayes Rule Given this knowledge, we can compute the posterior probabilities Bayes Rule P (C )P (X x | C ) P (C | X x ) P (X x ) P (X x ) P (C 1)P (X x | C 1) P (C 1)P (X x | C 1) Posterior of Oranges Data likelihood … combined with prior… after normalization … P(C = -1 ) p(X | C = -1 ) P(C = +1)p(X | C = +1 ) 1 p(X | C = -1 ) p(X | C = +1 ) p(C = -1 |X) P(C = +1|X) 0 Decision making Intuition: Predict “Good” if P(C=+1 | X) > P(C=-1 | X) Predict “Bad”, otherwise 1 p(C = -1 |X) P(C = +1|X) 0 bad good bad Loss function we have classes +1, -1 Suppose we can make predictions a1,…,ak Assume loss function L(ai, cj) describes the loss associated with making prediction ai when the class is cj A Real Label Prediction Bad Good -1 +1 1 10 5 0 Expected Risk Given the estimates of P(C | X) we can compute the expected conditional risk of each decision R (a | X ) L(a , c )P (C c | X ) c The Risk in Oranges 1 p(C = -1 |X) P(C = +1|X) 0 10 R(Good|X) 5 R(Bad|X) 0 -1 +1 Bad 1 5 Good 10 0 Optimal Decisions Goal: Minimize risk Optimal decision rule: “Given X = x, predict ai if R(ai|X=x) = mina R(a|X=x) “ (break ties arbitrarily) Note: randomized decisions do not help 0-1 Loss If we don’t have prior knowledge, it is common to use the 0-1 loss L(a,c) = 0 if a = c L(a,c) = 1 otherwise Consequence: R(a|X) = P(a c|X) Decision rule: “choose ai if P(C = ai | X) = maxa P(C = a|X) “ Bayesian Decisions: Summery Decisions based on two components: Conditional distribution P(C|X) Loss function L(A,C) Pros: Specifies optimal actions in presence of noisy signals Can deal with skewed loss functions Cons: Requires P(C|X) Simple Statistics : Binomial Experiment Head Tail When tossed, it can land in one of two positions: Head or Tail We denote by the (unknown) probability P(H). Estimation task: Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)= and P(T) = 1 - Why Learning is Possible? Suppose we perform M independent flips of the thumbtack The number of head we see is a binomial distribution M k P (# Heads k ) (1 )M k k and thus E[# Heads ] M This suggests, that we can estimate by # Heads M Maximum Likelihood Estimation MLE Principle: Learn parameters that maximize the likelihood function This is one of the most commonly used estimators in statistics Intuitively appealing Well studied properties Computing the Likelihood Functions To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails) L( : D ) NH (1 )NT Applying the MLE principle we get ˆ NH NH NH NT and NT are sufficient statistics for the binomial distribution Sufficient Statistics A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood Formally, s(D) is a sufficient statistics if for any two datasets D and D’ s(D) = s(D’ ) L( |D) = L( |D’) Datasets Statistics Maximum A Posterior (MAP) Suppose we observe the sequence H, H MLE estimate is P(H) = 1, P(T) = 0 Should we really believe that tails are impossible at this stage? Such an estimate can have disastrous effect If we assume that P(T) = 0, then we are willing to act as though this outcome is impossible Laplace Correction Suppose we observe n coin flips with k heads MLE k P (H ) n Laplace correction: k 1 P (H ) n 2 As though we observed one additional H and one additional T Can we justify this estimate? Uniform prior! Bayesian Reasoning In Bayesian reasoning we represent our uncertainty about the unknown parameter by a probability distribution This probability distribution can be viewed as subjective probability This is a personal judgment of uncertainty Bayesian Inference We start with P() - prior distribution about the values of P(x1, …, xn|) - likelihood of examples given a known value Given examples x1, …, xn, we can compute posterior distribution on P (x1, xn | )P ( ) P ( | x1, xn ) P (x1, xn ) Where the marginal likelihood is P (x1, xn ) P (x1, xn | )P ( )d Binomial Distribution: Laplace Est. this case the unknown parameter is = P(H) Simplest prior P() = 1 for 0< <1 Likelihood In P (x1, xn | ) k (1 )n k where k is number of heads in the sequence Marginal Likelihood: 1 P (x1, xn ) k (1 )n k d 0 Marginal Likelihood Using integration by parts we have: 1 P (x1, xn ) k (1 )n k d 0 1 k 1 (1 )n k k 1 1 1 0 n k k 1 n k 1 ( 1 ) d k 1 0 1 n k k 1 n k 1 ( 1 ) d k 1 0 Multiply both side by n choose k, we have n k n k 1 n k (1 ) d (1 )n k 1d k 0 k 1 0 1 1 Marginal Likelihood - Cont The recursion terminates when k = n n n 1 n n n (1 ) d d n 1 n 0 0 1 1 Thus 1 n k n k P (x1, xn ) (1 ) d n 1 k 0 1 We conclude that the posterior is n k P ( | x1, xn ) (n 1) (1 )n k k 1 Bayesian Prediction How do we predict using the posterior? We can think of this as computing the probability of the next element in the sequence P (xn 1 | x1, , xn ) P (xn 1, | x1, , xn )d P (xn 1 | , x1, , xn )P ( | x1, , xn )d P (xn 1 | )P ( | x1, , xn )d Assumption: if we know , the probability of Xn+1 is independent of X1, …, Xn P (xn 1 | , x1, , xn ) P (xn 1 | ) Bayesian Prediction Thus, we conclude that P (xn 1 H | x1, , xn ) P (xn 1 | )P ( | x1, , xn )d P ( | x1, , xn )d n k 1 (n 1) (1 )n k d k n 1 n 1 (n 1) k n 2 k 1 1 k 1 n 2 Naïve Bayes . Bayesian Classification: Binary Domain Consider the following situation: Two classes: -1, +1 Each example is described by by N attributes Xn is a binary variable with value 0,1 Example dataset: X1 X 2 … XN C 0 1 1 +1 1 0 1 -1 1 1 0 +1 … … … … 0 0 0 +1 Binary Domain - Priors How do we estimate P(C) ? Simple Binomial estimation Count # of instances with C = -1, and with C = +1 X1 X2 … XN C 0 1 1 +1 1 0 1 -1 1 1 0 +1 … … … … 0 0 0 +1 Binary Domain - Attribute Probability How do we estimate P(X1,…,XN|C) ? Two sub-problems: Training set for P(X1,…,XN|C=+1): Training set for P(X1,…,XN|C=-1): X1 X2 … XN C 0 1 1 +1 1 0 1 -1 1 1 0 +1 … … … … 0 0 0 +1 Naïve Bayes Naïve Bayes: Assume P (X1 , , XN | C ) P (X1 | C ) P (XN | C ) This is an independence assumption Each attribute Xi is independent of the other attributes once we know the value of C Naïve Bayes:Boolean Domain i |1 P (Xi 1 | C 1) Parameters: i |1 i |1 for each i i |1 P (Xi 1 | C 1) How do we estimate 1|+1? Simple binomial estimation Count #1 and #0 values of X1 in instances where C=+1 X1 X2 … XN C 0 1 1 +1 1 0 1 -1 1 1 0 +1 … … … … 0 0 0 +1 Interpretation of Naïve Bayes P ( 1 | X1 ,..., Xn ) P (X1 ,..., Xn | 1)P ( 1) log log P ( 1 | X1 ,..., Xn ) P (X1 ,..., Xn | 1)P ( 1) P (Xi | 1) P ( 1) log log P ( 1) i P (Xi | 1) P (Xi | 1) P ( 1) log log P ( 1) i P (Xi | 1) Interpretation of Naïve Bayes P (1 | X1 ,..., Xn ) P (Xi | 1) P (1) log log log P (1 | X1 ,..., Xn ) P (1) i P (Xi | 1) Xi “votes” about the prediction If P(Xi|C=-1) = P(Xi|C=+1) then Xi has no say in classification If P(Xi|C=-1) = 0 then Xi overrides all other votes (“veto”) Each Interpretation of Naïve Bayes P (1 | X1 ,..., Xn ) P (Xi | 1) P (1) log log log P (1 | X1 ,..., Xn ) P (1) i P (Xi | 1) P (Xi 1 | 1) P (Xi 0 | 1) log Set wi log P (Xi 1 | 1) P (Xi 0 | 1) P (Xi 0 | 1) P ( 1) b log log P ( 1) i P (Xi 0 | 1) Classification rule sign (b wi xi ) i Normal Distribution The Gaussian distribution: X ~ N ( , ) 2 if 1 p (x ) e 2 0.4 2 1 x 2 N(0,12) N(4,22) 0.3 0.2 0.1 0 -4 -2 0 2 4 6 8 10 Maximum Likelihood Estimate we observe x1, …, xm Simple calculations show that the MLE is Suppose 1 xm M m 1 1 2 2 (xm ) M m M Sufficient statistics are xm 2 2 m 1 M M xm , m 2 x m m 1 M 2 x m m Naïve Bayes with Gaussian Distributions Recall, P (X1 , , XN | C ) P (X1 | C ) P (XN | C ) Assume: P (Xi | C ) ~ N ( i ,C , i2 ) Mean of Xi depends on class Variance of Xi does not i ,1 i ,1 Naïve Bayes with Gaussian Distributions Recall P (1 | X1 ,..., Xn ) P (Xi | 1) P (1) log log log P (1 | X1 ,..., Xn ) P (1) i P (Xi | 1) P (Xi | 1) i , 1 i , 1 1 i , 1 i , 1 log X i P (Xi | 1) i i 2 Distance between means Distance of Xi to midway point i ,1 i ,1 Different Variances? If we allow different variances, the classification rule is more complex The term log P (Xi | 1) is quadratic in Xi P (Xi | 1)