EP in practice

advertisement
Bayesian Learning for Conditional Models
Alan Qi
MIT CSAIL
September, 2005
Joint work with T. Minka, Z. Ghahramani, M. Szummer,
and R. W. Picard
Motivation
• Two types of graphical models: generative and
conditional
• Conditional models:
– Make no assumptions about data generation
– Enable the use of flexible features
• Learning conditional models : estimating
(distributions of) model parameters
– Maximum likelihood approaches: overfitting
– Bayesian learning
Outline
• Background
– Conditional models for independent and relational data
classification
– Bayesian learning
• Bayesian classification and Predictive ARD
– Feature selection
– Fast kernel learning
• Bayesian conditional random fields
– Contextual object recognition/Segmentation
• Conclusions
Outline
• Background
– Conditional models
– Bayesian learning
• Bayesian classification and Predictive ARD
• Bayesian conditional random fields
• Conclusions
Graphical Models
Conditional models
- Logistic/Probit regression
- Classification of
independent data
t1
t2
x1
x2
w
p( t | x, w)   p(ti | xi , w)
i
Conditional random fields
-Model relational data,
such as natural language
and images
t1
t2
x1
x2
p( t | w, x ) 
w
1
a (x, y, t )

Z ( w) a
Bayesian learning
Simple: Given prior distributions and data
likelihoods, estimate the posterior distributions
of model parameters or the predictive posterior
of a new data point.
Difficult: calculating the posterior distributions in
practice.
–
–
Randomized methods: Markov Chain Monte Carlo,
Importance Sampling
Deterministic approximation: Varitional methods,
Expectation propagation.
Outline
• Background
• Bayesian classification and Predictive ARD
– Feature selection
– Fast kernel learning
• Bayesian conditional random fields
• Conclusions
Goal
Task 1: Classify high dimensional datasets with
many irrelevant features, e.g., normal v.s.
cancer microarray data.
Task 2: Sparse Bayesian kernel classifiers for
fast test performance.
Part 1: Roadmap
• Automatic relevance determination (ARD)
• Risk of Overfitting by optimizing hyperparameters
• Predictive ARD by expectation propagation (EP):
– Approximate prediction error
– EP approximation
• Experiments
• Conclusions
Bayesian Classification Model
Labels: t inputs: X parameters: w
Likelihood for the data set:
Prior of the classifier w:
Where
is a
cumulative distribution function for a standard
Gaussian.
Evidence and Predictive Distribution
The evidence, i.e., the marginal likelihood
of the hyperparameters  :
The predictive posterior distribution of the
label
for a new input
:
Automatic Relevance Determination
(ARD)
• Give the classifier weight independent Gaussian
1
priors whose variance,  , controls how far away
from zero each weight is allowed to go:
• Maximize
, the marginal likelihood of the
model, with respect to .
• Outcome: many elements of  go to infinity,
which naturally prunes irrelevant features in the
data.
Two Types of Overfitting
• Classical Maximum likelihood:
– Optimizing the classifier weights w can directly fit
noise in the data, resulting in a complicated model.
• Type II Maximum likelihood (ARD):
– Optimizing the hyperparameters corresponds to
choosing which variables are irrelevant. Choosing
one out of exponentially many models can also
overfit if we maximize the model marginal
likelihood.
Risk of Optimizing 
Evd-ARD-1
Bayes Point
Evd-ARD-2
X: Class 1 vs O: Class 2
Predictive-ARD
• Choosing the model with the best estimated
predictive performance instead of the most probable
model.
• Expectation propagation (EP) estimates the leaveone-out predictive performance without performing
any expensive cross-validation.
Estimate Predictive Performance
• Predictive posterior given a test data point x N 1
p (t N 1 | x N 1 , t )   p (t N 1 | x N 1 , w ) p (w | t )dw
• EP can estimate predictive leave-one-out error probability


1 N
1 N
1  p(ti | xi , t \i )    1   p(ti | x i , w )q(w | t \i )dw

N i 1
N i 1
• where q( w| t\i) is the approximate posterior of leaving out the
ith label.
• EP can also estimate predictive leave-one-out error count
1
LOO 
N
N
 I( p(t
i 1
i
| xi , t \i ) 
1
2
)
Expectation Propagation in a Nutshell
• Approximate a probability distribution by
simpler parametric terms:
p( w | t )   f i ( w )  (ti w T (x i ) )
i
i
~
q( w )   f i ( w )
i
~
f i ( w)
• Each approximation term
lives in an
exponential family (e.g. Gaussian)
EP in a Nutshell
Three key steps:
• Deletion Step: approximate the “leave-one-out”
predictive posterior for the ith point:
~
~
q (w )  q(w ) / f i (w )   f j (w )
\i
j i
• Minimizing the following KL divergence by moment
matching:
~
\i
\i
arg min KL( f i (w)q (w) || f i (w)q (w))

~

fi ( w )
~
• Inclusion: q(w)  f i (w)q \ i (w)
The key observation: we can use the approximate
predictive posterior, obtained in the deletion step,
for model selection. No extra computation!
Comparison of different model selection criteria
for ARD training
•
•
•
•
•
The estimated leave-one-out error probabilities and
counts are better correlated with the test error than
evidence and sparsity level.
1st row: Test error
2nd row: Estimated leave-one-out error probability
3rd row: Estimated leave-one-out error counts
4th row: Evidence (Model marginal likelihood)
5th row: Fraction of selected features
Gene Expression Classification
Task: Classify gene expression
datasets into different
categories, e.g., normal v.s.
cancer
Challenge: Thousands of genes
measured in the micro-array
data. Only a small subset of
genes are probably
correlated with the
classification task.
Classifying Leukemia Data
• The task: distinguish acute
myeloid leukemia (AML) from
acute lymphoblastic leukemia
(ALL).
• The dataset: 47 and 25 samples
of type ALL and AML
respectively with 7129 features
per sample.
• The dataset was randomly split
100 times into 36 training and 36
testing samples.
Classifying Colon Cancer Data
• The task: distinguish normal and
cancer samples
• The dataset: 22 normal and 40
cancer samples with 2000
features per sample.
• The dataset was randomly split
100 times into 50 training and 12
testing samples.
• SVM results from Li et al. 2002
Bayesian Sparse Kernel Classifiers
• Using feature/kernel expansions defined on
training data points:
• Predictive-ARD-EP trains a classifier that
depends on a small subset of the training
set.
• Fast test performance.
Test error rates and numbers of relevance or
support vectors on breast cancer dataset.
50 partitionings of the data
were used. All these
methods use the same
Gaussian kernel with kernel
width = 5. The trade-off
parameter C in SVM is
chosen via 10-fold crossvalidation for each partition.
Part 1: Conclusions
• Maximizing marginal likelihood can lead to
overfitting in the model space if there are a lot of
features.
• We propose Predictive-ARD based on EP for
– feature selection
– sparse kernel learning
• In practice Predictive-ARD works better than
traditional ARD.
Outline
• Background
• Bayesian classification and Predictive ARD
• Bayesian conditional random fields
– Contextual object recognition/Segmentation
• Conclusions
Part 2: Conditional random fields (CRFs)
t1
t2
x1
x2
w
• Generalize traditional classification model by
introducing correlation between labels.
Potential functions:
• Model data with interdependence and structure, e.g,
web pages, natural languages, and multiple visual
objects in a picture.
• Information at one location propagates to other
locations.
Bayesian Conditional Networks
• Bayesian training to avoid overfitting
• Need efficient training:
– The exact posterior of w
– The Gaussian approximate posterior of w
Learning the parameter w by ML/MAP
Maximum likelihood (ML) : Maximize the data likelihood
where
Maximum a posterior (MAP):Gaussian prior on w
ML/MAP problem: Overfitting to the noise in data.
EP in a Nutshell
• Approximate a probability distribution by simpler
parametric terms (Minka 2001):
~
p ( x)   f a ( x)
q ( x)   f a ( x)
a
a
–
For Bayesian networks: f a (x)  p ( xia | x ja )
–
For Markov networks: f a (x)  a (x, y )
p( w | t, x )   f a ( w )
–
–
a

q( w )  
~
fa ( w)
a
For conditional classification: f a ( w)  p(ta | xa , w)
For conditional random fields: f a ( w)  a (x, ta , w)
Each approximation term ~f a (x) or ~f a ( w) lives in an
exponential family (such as Gaussian & Multinomial)
EP in a Nutshell (2)
~
• The approximate term f a (x) minimizes the
following KL divergence by moment matching:
~
arg min D ( f a ( x ) q ( x ) || f a ( x ) q \ a ( x ))
~
\a
fa ( x )
Where the leave-one-out approximation is
~
q ( x)
q ( x )   f b ( x)  ~
f a ( x)
ba
\a
EP in a Nutshell (3)
Three key steps:
• Deletion Step: approximate the “leave-one-out”
predictive posterior for the ith point:
~
q ( x)  q(x ) / f i (x ) 
\i

j i
~
f j (x)
• Minimizing the following KL divergence by moment
matching (Assumed Density filtering):
~
\i
arg min KL( f i ( x ) q ( x ) || f i ( x ) q \ i ( x ))

~
fi ( x )
• Inclusion:
~
q(x)  fi (x)q \ i (x)
Two Difficulties for Bayesian Training
• the partition function appears in the
denominator
– Regular EP does not apply
• the partition function is a complicated
function of w
Turn Denominator to Numerator (1)
• Transformed EP:
– Deletion:
– ADF:
– Inclusion:
Turn Denominator to Numerator (2)
• Power EP:
– Deletion:
– ADF:
– Inclusion:
Power EP minimizes  divergence
Approximating the partition function
• The parameters w and the labels t are intertwined in Z(w):
where k = {i, j} is the index of edges.
The joint distribution of w and t:
• Factorized approximation:
Flatten Approximation Structure
Iterations
Iterations
Iterations
Remove the intermediate level
Increased efficiency, stability,
and accuracy!
Model Averaging for Prediction
• Bayesian training provides a set of
estimated models
• Bayesian model averaging combines
predictions from all the models to eliminate
overfitting
• Approximate model averaging: weighted
belief propagation
Results on Synthetic Data
• Data generation: first, randomly sample input x, fixed true
parameters w, and then sample the labels t
• Graphical structure: Four nodes in a simple loop
• Comparing maximum likelihood trained CRF with BCRFs:
10 Trials. 100 training examples and 1000 test examples.
BCRFs significantly outperformed ML-trained
CRFs.
FAQs Labeling
• The dataset consists of 47 files, belonging
to 7 Usenet newsgroup FAQs. Each file has
multiple lines, which can be the header (H),
a question (Q), an answer (A), or the tail
(T).
• Task: label the lines that are questions or
answers.
FAQs Features
Results
BCRFs outperform MAP-trained CRFs with a
high statistical significance on FAQs labeling.
Ink Application: analyzing handwritten
organization charts
• Parsing a graph into different components:
containers vs. connectors
Comparing results
Results from Bayes
Point Machine
Results from MAPtrained CRF
Results
from BCRF
Results
BCRF outperforms ML and MAP trained-CRFs.
BCRF-ARD further improves test accuracy. The
results are averaged over 20 runs.
Part 2:Conclusions
Bayesian CRFs:
• Model the relational data
• BCRFs improve the predictive performance
over ML- and MAP-trained CRFs, especially
by approximate model averaging
• ARD for CRFs enables feature selection
• More applications: image segmentation and
joint scene analysis, etc.
Outline
•
•
•
•
Background
Bayesian classification and Predictive ARD
Bayesian conditional random fields
Conclusions
Conclusions
1. Predictive ARD by EP
Gene expression classification: Outperformed
traditional ARD, SVM with feature selection
2. Bayesian conditional random fields
FAQs labeling and joint diagram analysis: Beats
ML- and MAP-trained CRFs
3. Future work
END
Appendix: Sequential Updates
• EP approximates true likelihood terms by
Gaussian virtual observations.
• Based on Gaussian virtual observations, the
classification model becomes a regression model.
• Then, we can achieve efficient sequential updates
without maintaining and updating a full
covariance matrix. (Faul & Tipping 02)
Download