bayes

advertisement
Bayesian Learning
Rong Jin
Outline
•
•
•
•
MAP learning vs. ML learning
Minimum description length principle
Bayes optimal classifier
Bagging
Maximum Likelihood Learning (ML)
• Find the best model by maximizing the loglikelihood of the training data
Maximum A Posterior Learning (MAP)
• ML learning
• Models are determined by training data
• Unable to incorporate prior knowledge/preference
about models
• Maximum a posterior learning (MAP)
• Knowledge/preference is incorporated through a prior
Prior encodes the
knowledge/preference
MAP
• Uninformative prior: regularized logistic
regression
MAP
Consider text categorization
• wi: importance of i-th word in classification
• Prior knowledge: the more common the word, the
less important it is
• How to construct a prior according to the prior
knowledge ?
MAP
• An informative prior for text categorization
• i : the occurrence of the i-th word in
training data
MAP
Two correlated classification tasks: C1 and C2
• How to introduce an appropriate prior to capture
this prior knowledge ?
MAP
• Construct priors to capture the dependence
between w1 and w2
Minimum Description Length (MDL) Principle
• Occam’s razor: prefer a simple hypothesis
• Simple hypothesis  short description length
• Minimum description length
Bits for encoding Bits for encoding
data given h
hypothesis h
• LC (x) is the description length for message x
under coding scheme c
MDL
Sender
Send only D ?
Send only h ?
D
Send h + D/h ?
Receiver
Example: Decision Tree
H = decision trees, D = training data labels
• LC1(h) is # bits to describe tree h
• LC2(D|h) is # bits to describe D given tree h
• LC2(D|h)=0 if examples are classified
perfectly by h.
• Only need to describe exceptions
hMDL trades off tree size for training errors
MAP vs. MDL
MAP learning
MDL learning
Problems with Maximum Approaches
Consider
Three possible hypotheses:
Pr(h1 | D)  0.4, Pr(h2 | D)  0.3, Pr(h3 | D)  0.3
Maximum approaches will pick h1
Given new instance x
h1 ( x)  , h2 ( x)  , h3 ( x)  
Maximum approaches will output +
However, is this most probable result?
Bayes Optimal Classifier (Bayesian Average)
Bayes optimal classification:
Example:
Pr(h1 | D)  0.4, Pr(  | h1 , x)  1, Pr(  | h1 , x)  0
Pr(h2 | D)  0.3, Pr(  | h2 , x)  0, Pr(  | h2 , x)  1
Pr(h3 | D)  0.3, Pr(  | h3 , x)  0, Pr(  | h3 , x)  1
 Pr(h | D) Pr( | h, x)  0.4,  Pr(h | D) Pr( | h, x)  0.6
h
The most probable class is -
h
Computational Issues
• Need to sum over all possible hypotheses
• It is expensive or impossible when the hypothesis
space is large
• E.g., decision tree
• Solution: sampling !
Gibbs Classifier
Gibbs algorithm
1. Choose one hypothesis at random, according to p(h|D)
2. Use this hypothesis to classify new instance
E errGibbs   2 E errBayesOptimal 
•
Surprising fact:
•
Improve by sampling multiple hypotheses from
p(h|D) and average their classification results
Bagging Classifiers
•
In general, sampling from p(h|D) is difficult
•
•
•
P(h|D) is difficult to compute
P(h|D) is impossible to compute for nonprobabilistic classifier such as SVM
Bagging Classifiers:
•
Realize sampling p(h|D) by sampling training
examples
Boostrap Sampling
Bagging = Boostrap aggregating
• Boostrap sampling: given set D containing m
training examples
• Create Di by drawing m examples at random
with replacement from D
• Di expects to leave out about 0.37 of examples
from D
Bagging Algorithm
• Create k boostrap samples D1, D2,…, Dk
• Train distinct classifier hi on each Di
• Classify new instance by classifier vote with
equal weights
Bagging  Bayesian Average
Bayesian Average
Bagging
D
P(h|D)
Boostrap Sampling
Sampling
D1
h1
h2
…
hk
D2
Dk
…
h1
h2
Boostrap sampling is almost equivalent to
i Pr(c | hi , x) sampling from posterior P(h|D) i Pr(c | hi , x)
hk
Empirical Study of Bagging
Bagging decision trees
•
•
•
Boostrap 50 different samples from
the original training data
Learn a decision tree over each
boostrap sample
Predict the class labels for test
instances by the majority vote of 50
decision trees
• Bagging decision tree
outperforms a single decision tree
Bias-Variance Tradeoff
Why Bagging works better than a single classifier?
• Real value case
• y~f(x)+, ~N(0,)
• (x|D) is a predictor learned from training data D
Irreducible
variance
Model bias:
The simpler the (x|D),
the larger the bias
Model variance:
The simpler the (x|D), the
smaller the variance
Bagging
• Bagging performs better than a single classifier because it
effectively reduces the model variance
variance
bias
single decision tree
Bagging decision tree
Download