Bayesian Learning Thanks to Nir Friedman, HU .

Bayesian Learning Thanks to Nir Friedman, HU . Example  Suppose we are required to build a controller that removes bad oranges from a packaging line  Decision are made based on a sensor that reports the overall color of the orange Bad oranges Classifying oranges Suppose we know all the aspects of the problem: Prior Probabilities:  Probability of good (+1) and bad (-1) oranges  P(C = +1) = probability of a good orange  P(C = -1) = probability of a bad orange  Note: P(C = +1) + P(C = -1) = 1  Assumption: oranges are independent The occurrence of a bad orange does not depend on previous Classifying oranges (cont) Sensor performance:  Let X denote sensor measurement from each type of oranges p(X | C = -1 ) p(X | C = +1 ) Bayes Rule  Given this knowledge, we can compute the posterior probabilities Bayes Rule P (C )P (X  x | C ) P (C | X  x )  P (X  x ) P (X  x )  P (C  1)P (X  x | C  1)  P (C  1)P (X  x | C  1) Posterior of Oranges Data likelihood … combined with prior… after normalization … P(C = -1 ) p(X | C = -1 ) P(C = +1)p(X | C = +1 ) 1 p(X | C = -1 ) p(X | C = +1 ) p(C = -1 |X) P(C = +1|X) 0 Decision making Intuition:  Predict “Good” if P(C=+1 | X) > P(C=-1 | X)  Predict “Bad”, otherwise 1 p(C = -1 |X) P(C = +1|X) 0 bad good bad Loss function we have classes +1, -1  Suppose we can make predictions a1,…,ak  Assume loss function L(ai, cj) describes the loss associated with making prediction ai when the class is cj A Real Label Prediction Bad Good -1 +1 1 10 5 0 Expected Risk  Given the estimates of P(C | X) we can compute the expected conditional risk of each decision R (a | X )   L(a , c )P (C  c | X ) c The Risk in Oranges 1 p(C = -1 |X) P(C = +1|X) 0 10 R(Good|X) 5 R(Bad|X) 0 -1 +1 Bad 1 5 Good 10 0 Optimal Decisions Goal:  Minimize risk Optimal decision rule: “Given X = x, predict ai if R(ai|X=x) = mina R(a|X=x) “  (break ties arbitrarily) Note: randomized decisions do not help 0-1 Loss  If we don’t have prior knowledge, it is common to use the 0-1 loss  L(a,c) = 0 if a = c  L(a,c) = 1 otherwise Consequence:  R(a|X) = P(a c|X)  Decision rule: “choose ai if P(C = ai | X) = maxa P(C = a|X) “ Bayesian Decisions: Summery Decisions based on two components:  Conditional distribution P(C|X)  Loss function L(A,C) Pros:  Specifies optimal actions in presence of noisy signals  Can deal with skewed loss functions Cons:  Requires P(C|X) Simple Statistics : Binomial Experiment Head Tail  When tossed, it can land in one of two positions: Head or Tail  We denote by  the (unknown) probability P(H). Estimation task:  Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 -  Why Learning is Possible? Suppose we perform M independent flips of the thumbtack  The number of head we see is a binomial distribution M  k P (# Heads  k )    (1   )M k k   and thus E[# Heads ]  M This suggests, that we can estimate  by # Heads M Maximum Likelihood Estimation MLE Principle: Learn parameters that maximize the likelihood function  This is one of the most commonly used estimators in statistics  Intuitively appealing  Well studied properties Computing the Likelihood Functions To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails) L( : D )  NH  (1  )NT Applying the MLE principle we get ˆ  NH NH NH  NT and NT are sufficient statistics for the binomial distribution Sufficient Statistics A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood  Formally, s(D) is a sufficient statistics if for any two datasets D and D’  s(D) = s(D’ )  L( |D) = L( |D’) Datasets Statistics Maximum A Posterior (MAP)  Suppose we observe the sequence  H, H  MLE estimate is P(H) = 1, P(T) = 0  Should we really believe that tails are impossible at this stage?  Such an estimate can have disastrous effect  If we assume that P(T) = 0, then we are willing to act as though this outcome is impossible Laplace Correction Suppose we observe n coin flips with k heads  MLE k P (H )  n  Laplace correction: k 1 P (H )  n 2 As though we observed one additional H and one additional T  Can we justify this estimate? Uniform prior! Bayesian Reasoning  In Bayesian reasoning we represent our uncertainty about the unknown parameter  by a probability distribution  This probability distribution can be viewed as subjective probability  This is a personal judgment of uncertainty Bayesian Inference We start with  P() - prior distribution about the values of   P(x1, …, xn|) - likelihood of examples given a known value  Given examples x1, …, xn, we can compute posterior distribution on  P (x1, xn |  )P ( ) P ( | x1, xn )  P (x1, xn ) Where the marginal likelihood is P (x1,  xn )   P (x1,  xn |  )P ( )d Binomial Distribution: Laplace Est. this case the unknown parameter is  = P(H)  Simplest prior P() = 1 for 0< <1  Likelihood  In P (x1,  xn |  )   k (1   )n k where k is number of heads in the sequence  Marginal Likelihood: 1 P (x1,  xn )    k (1   )n k d  0 Marginal Likelihood Using integration by parts we have: 1 P (x1,  xn )    k (1   )n k d  0 1   k 1 (1   )n k k 1 1 1 0 n k k 1 n k 1   ( 1   ) d  k 1 0 1 n k k 1 n k 1   ( 1   ) d  k 1 0 Multiply both side by n choose k, we have n  k  n  k 1 n k     (1   ) d       (1   )n k 1d  k  0 k  1  0 1 1 Marginal Likelihood - Cont  The recursion terminates when k = n n  n 1 n n n     (1   ) d     d   n 1 n  0 0 1 1 Thus 1 n  k n k P (x1,  xn )    (1   ) d     n  1 k  0 1 We conclude that the posterior is n  k P ( | x1,  xn )  (n  1)  (1   )n k k  1 Bayesian Prediction  How do we predict using the posterior?  We can think of this as computing the probability of the next element in the sequence P (xn 1 | x1, , xn )   P (xn 1,  | x1, , xn )d   P (xn 1 |  , x1, , xn )P ( | x1, , xn )d   P (xn 1 |  )P ( | x1, , xn )d  Assumption: if we know , the probability of Xn+1 is independent of X1, …, Xn P (xn 1 |  , x1, , xn )  P (xn 1 |  ) Bayesian Prediction  Thus, we conclude that P (xn 1  H | x1, , xn )   P (xn 1 |  )P ( | x1, , xn )d   P ( | x1, , xn )d  n  k 1  (n  1)    (1   )n k d k  n  1 n  1     (n  1)  k  n  2 k  1  1 k 1  n 2 Naïve Bayes . Bayesian Classification: Binary Domain Consider the following situation:  Two classes: -1, +1  Each example is described by by N attributes  Xn is a binary variable with value 0,1 Example dataset: X1 X 2 … XN C 0 1 1 +1 1 0 1 -1 1 1 0 +1 … … … … 0 0 0 +1 Binary Domain - Priors How do we estimate P(C) ?  Simple Binomial estimation  Count # of instances with C = -1, and with C = +1 X1 X2 … XN C 0 1 1 +1 1 0 1 -1 1 1 0 +1 … … … … 0 0 0 +1 Binary Domain - Attribute Probability How do we estimate P(X1,…,XN|C) ? Two sub-problems: Training set for P(X1,…,XN|C=+1): Training set for P(X1,…,XN|C=-1): X1 X2 … XN C 0 1 1 +1 1 0 1 -1 1 1 0 +1 … … … … 0 0 0 +1 Naïve Bayes Naïve Bayes:  Assume P (X1 , , XN | C )  P (X1 | C ) P (XN | C ) This is an independence assumption  Each attribute Xi is independent of the other attributes once we know the value of C Naïve Bayes:Boolean Domain i |1  P (Xi  1 | C  1)  Parameters:  i |1 i |1 for each i i |1  P (Xi  1 | C  1) How do we estimate 1|+1?  Simple binomial estimation  Count #1 and #0 values of X1 in instances where C=+1 X1 X2 … XN C 0 1 1 +1 1 0 1 -1 1 1 0 +1 … … … … 0 0 0 +1 Interpretation of Naïve Bayes P ( 1 | X1 ,..., Xn ) P (X1 ,..., Xn | 1)P ( 1) log  log P ( 1 | X1 ,..., Xn ) P (X1 ,..., Xn | 1)P ( 1) P (Xi | 1) P ( 1)  log  log  P ( 1) i P (Xi | 1) P (Xi | 1) P ( 1)  log   log P ( 1) i P (Xi | 1) Interpretation of Naïve Bayes P (1 | X1 ,..., Xn ) P (Xi | 1) P (1) log  log  log P (1 | X1 ,..., Xn ) P (1) i P (Xi | 1) Xi “votes” about the prediction  If P(Xi|C=-1) = P(Xi|C=+1) then Xi has no say in classification  If P(Xi|C=-1) = 0 then Xi overrides all other votes (“veto”)  Each Interpretation of Naïve Bayes P (1 | X1 ,..., Xn ) P (Xi | 1) P (1) log  log  log P (1 | X1 ,..., Xn ) P (1) i P (Xi | 1) P (Xi  1 | 1) P (Xi  0 | 1)  log Set wi  log P (Xi  1 | 1) P (Xi  0 | 1) P (Xi  0 | 1) P ( 1) b  log   log P ( 1) i P (Xi  0 | 1) Classification rule sign (b  wi xi ) i Normal Distribution The Gaussian distribution: X ~ N ( ,  ) 2 if 1 p (x )  e 2  0.4 2 1 x      2   N(0,12) N(4,22) 0.3 0.2 0.1 0 -4 -2 0 2 4 6 8 10 Maximum Likelihood Estimate we observe x1, …, xm Simple calculations show that the MLE is  Suppose 1    xm M m 1 1 2 2    (xm  )  M m M  Sufficient statistics are  xm  2 2 m 1 M  M  xm , m 2 x    m m 1 M 2 x  m m Naïve Bayes with Gaussian Distributions  Recall, P (X1 , , XN | C )  P (X1 | C ) P (XN | C )  Assume: P (Xi | C ) ~ N ( i ,C ,  i2 )  Mean of Xi depends on class  Variance of Xi does not i ,1 i ,1 Naïve Bayes with Gaussian Distributions Recall P (1 | X1 ,..., Xn ) P (Xi | 1) P (1) log  log  log P (1 | X1 ,..., Xn ) P (1) i P (Xi | 1) P (Xi | 1) i , 1  i , 1 1  i , 1  i , 1  log   X  i  P (Xi | 1) i i  2  Distance between means Distance of Xi to midway point i ,1 i ,1 Different Variances?  If we allow different variances, the classification rule is more complex  The term log P (Xi | 1) is quadratic in Xi P (Xi | 1)

Bayesian Learning Thanks to Nir Friedman, HU .

Related documents

Products

Support

Bayesian Learning Thanks to Nir Friedman, HU .

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib