Multi-label Prediction via Sparse Infinite CCA

advertisement
Multi-label Prediction via Sparse
Infinite CCA
Piyush Rai and Hal Daume III
NIPS 2009
Presented by Lingbo Li
ECE, Duke University
July 16th, 2010
Note: all tables and figures are taken from the original paper
Outline
• Canonical Correlation Analysis
 CCA
 Probabilistic CCA
• Infinite Canonical Correlation Analysis Model
 The Indian Buffet Process
 The Infinite CCA Model
 Inference
• Multitask Learning using Infinite CCA
 Fully supervised setting
 Semi-supervised setting
• Experiments
 Infinite CCA results on synthetic data
 Infinite CCA applied to multi-label prediction
• Conclusion
Canonical Correlation Analysis
• For variables
, CCA seeks the linear projections
so that the variables are maximally correlated in the
projection space.
• Correlation coefficient between two variables in the embedded space
is given by
• CCA can be posed as a constrained optimization problem
• Let
and
denote the covariance matrix of data samples
Probabilistic CCA
Let
, , consider following latent variable model:
We can also write
where
Latent variable z is shared between x and y
Probabilistic CCA
•
•
•
•
Probabilistic interpretation of CCA
Maximum likelihood approach for parameter estimation
Number of canonical correlation components is fixed
The projection matrix is not sparse
 Use IBP as a prior on the binary matrices with infinitely countable
columns
 Posterior inference determines the subset of latent features for the
responsible observations
 The IBP ensures that the matrices are sparse
Indian Buffet Process
• Given an
matrix of
observations each with
the latent feature model can be expressed as
where
• IBP interpretation:
 First customer tries
dishes
 nth customer tries:
Previously-tasted dish k with probability

completely new dishes
features,
Infinite CCA Model
• Impose IBP prior on matrix
so that the dimensionality
of the latent space associated with
can be automatically
determined from an unbounded number.
• Represent the
where
• Two random vectors x and y can be modeled as
• z is shared between x and y, and
are noise.
Infinite CCA Model
Let
The full model can be written as
The graphical model structure
Inference
• Sample B
 Sample existing dishes:
where
 Sample new dishes: use an M-H step
Propose
Accept the proposal with an acceptance probability
Inference
• Sample V
• Sample Z
Multitask Learning using Infinite CCA
Consider each example is associated with multiple labels.
One task: to predict each label.
Motivation: borrow information across tasks.
• Apply infinite CCA model to capture label correlations
• Learn better predictive features by projecting the data to a
subspace directed by label information
cross-covariance matrix
: input-output correlation
label covariance matrix
: label correlation
Multitask Learning using Infinite CCA
• Fully supervised setting (Model-1)
Given labeled data
, the model is to learn
task parameters
in the subspace.
Predict labels in the original D dimensional space by inflating
parameters back to D dimensions with the projection matrix.
• Semi-supervised setting (Model-2)
Learn the embeddings for both training and testing data, thus training
and testing both take place in the K dimensionality subspace.
Experiments (I)
• Generate two datasets of dimensionalities 25 and 10, each
having 100 samples;
• Ground truth: have 4 correlation components with a 63%
sparsity in the true projection matrix
• Classical CCA found 8 components with significant
correlations; while infinite CCA correctly discovered
exactly 4 components.
• Classical CCA infer the projection matrix with no exact
zero entries. If set small values to be zero, the uncovered
sparsity was about 25%;
• Infinite CCA can infer the projection matrix with 57%
zero entries and 62% zero entries after thresholding very
small values.
Experiments (II)
• Use two real-world multi-label datasets (Yeast and
Scene) from UCI repository;
• The Yeast dataset consists of 1500 training and 917
testing examples, each having 103 features. The number
of labels per example is 14.
• The Scene dataset consists of 1211 training and 1196
testing examples, each having 294 features. The number
of labels per example is 6.
• Compare the following models
Conclusion
• Present a nonparametric Bayesian model for
the CCA problem;
• Automatically select the number of correlation
components and capture the sparsity pattern;
• Enable to deal with missing data;
• Solve the multi-label learning problem .
Download