Bayesian Learning Rong Jin Outline • • • • MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging Maximum Likelihood Learning (ML) • Find the best model by maximizing the loglikelihood of the training data Maximum A Posterior Learning (MAP) • ML learning • Models are determined by training data • Unable to incorporate prior knowledge/preference about models • Maximum a posterior learning (MAP) • Knowledge/preference is incorporated through a prior Prior encodes the knowledge/preference MAP • Uninformative prior: regularized logistic regression MAP Consider text categorization • wi: importance of i-th word in classification • Prior knowledge: the more common the word, the less important it is • How to construct a prior according to the prior knowledge ? MAP • An informative prior for text categorization • i : the occurrence of the i-th word in training data MAP Two correlated classification tasks: C1 and C2 • How to introduce an appropriate prior to capture this prior knowledge ? MAP • Construct priors to capture the dependence between w1 and w2 Minimum Description Length (MDL) Principle • Occam’s razor: prefer a simple hypothesis • Simple hypothesis short description length • Minimum description length Bits for encoding Bits for encoding data given h hypothesis h • LC (x) is the description length for message x under coding scheme c MDL Sender Send only D ? Send only h ? D Send h + D/h ? Receiver Example: Decision Tree H = decision trees, D = training data labels • LC1(h) is # bits to describe tree h • LC2(D|h) is # bits to describe D given tree h • LC2(D|h)=0 if examples are classified perfectly by h. • Only need to describe exceptions hMDL trades off tree size for training errors MAP vs. MDL MAP learning MDL learning Problems with Maximum Approaches Consider Three possible hypotheses: Pr(h1 | D) 0.4, Pr(h2 | D) 0.3, Pr(h3 | D) 0.3 Maximum approaches will pick h1 Given new instance x h1 ( x) , h2 ( x) , h3 ( x) Maximum approaches will output + However, is this most probable result? Bayes Optimal Classifier (Bayesian Average) Bayes optimal classification: Example: Pr(h1 | D) 0.4, Pr( | h1 , x) 1, Pr( | h1 , x) 0 Pr(h2 | D) 0.3, Pr( | h2 , x) 0, Pr( | h2 , x) 1 Pr(h3 | D) 0.3, Pr( | h3 , x) 0, Pr( | h3 , x) 1 Pr(h | D) Pr( | h, x) 0.4, Pr(h | D) Pr( | h, x) 0.6 h The most probable class is - h Computational Issues • Need to sum over all possible hypotheses • It is expensive or impossible when the hypothesis space is large • E.g., decision tree • Solution: sampling ! Gibbs Classifier Gibbs algorithm 1. Choose one hypothesis at random, according to p(h|D) 2. Use this hypothesis to classify new instance E errGibbs 2 E errBayesOptimal • Surprising fact: • Improve by sampling multiple hypotheses from p(h|D) and average their classification results Bagging Classifiers • In general, sampling from p(h|D) is difficult • • • P(h|D) is difficult to compute P(h|D) is impossible to compute for nonprobabilistic classifier such as SVM Bagging Classifiers: • Realize sampling p(h|D) by sampling training examples Boostrap Sampling Bagging = Boostrap aggregating • Boostrap sampling: given set D containing m training examples • Create Di by drawing m examples at random with replacement from D • Di expects to leave out about 0.37 of examples from D Bagging Algorithm • Create k boostrap samples D1, D2,…, Dk • Train distinct classifier hi on each Di • Classify new instance by classifier vote with equal weights Bagging Bayesian Average Bayesian Average Bagging D P(h|D) Boostrap Sampling Sampling D1 h1 h2 … hk D2 Dk … h1 h2 Boostrap sampling is almost equivalent to i Pr(c | hi , x) sampling from posterior P(h|D) i Pr(c | hi , x) hk Empirical Study of Bagging Bagging decision trees • • • Boostrap 50 different samples from the original training data Learn a decision tree over each boostrap sample Predict the class labels for test instances by the majority vote of 50 decision trees • Bagging decision tree outperforms a single decision tree Bias-Variance Tradeoff Why Bagging works better than a single classifier? • Real value case • y~f(x)+, ~N(0,) • (x|D) is a predictor learned from training data D Irreducible variance Model bias: The simpler the (x|D), the larger the bias Model variance: The simpler the (x|D), the smaller the variance Bagging • Bagging performs better than a single classifier because it effectively reduces the model variance variance bias single decision tree Bagging decision tree