Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note: all tables and figures are taken from the original paper Outline • Canonical Correlation Analysis CCA Probabilistic CCA • Infinite Canonical Correlation Analysis Model The Indian Buffet Process The Infinite CCA Model Inference • Multitask Learning using Infinite CCA Fully supervised setting Semi-supervised setting • Experiments Infinite CCA results on synthetic data Infinite CCA applied to multi-label prediction • Conclusion Canonical Correlation Analysis • For variables , CCA seeks the linear projections so that the variables are maximally correlated in the projection space. • Correlation coefficient between two variables in the embedded space is given by • CCA can be posed as a constrained optimization problem • Let and denote the covariance matrix of data samples Probabilistic CCA Let , , consider following latent variable model: We can also write where Latent variable z is shared between x and y Probabilistic CCA • • • • Probabilistic interpretation of CCA Maximum likelihood approach for parameter estimation Number of canonical correlation components is fixed The projection matrix is not sparse Use IBP as a prior on the binary matrices with infinitely countable columns Posterior inference determines the subset of latent features for the responsible observations The IBP ensures that the matrices are sparse Indian Buffet Process • Given an matrix of observations each with the latent feature model can be expressed as where • IBP interpretation: First customer tries dishes nth customer tries: Previously-tasted dish k with probability completely new dishes features, Infinite CCA Model • Impose IBP prior on matrix so that the dimensionality of the latent space associated with can be automatically determined from an unbounded number. • Represent the where • Two random vectors x and y can be modeled as • z is shared between x and y, and are noise. Infinite CCA Model Let The full model can be written as The graphical model structure Inference • Sample B Sample existing dishes: where Sample new dishes: use an M-H step Propose Accept the proposal with an acceptance probability Inference • Sample V • Sample Z Multitask Learning using Infinite CCA Consider each example is associated with multiple labels. One task: to predict each label. Motivation: borrow information across tasks. • Apply infinite CCA model to capture label correlations • Learn better predictive features by projecting the data to a subspace directed by label information cross-covariance matrix : input-output correlation label covariance matrix : label correlation Multitask Learning using Infinite CCA • Fully supervised setting (Model-1) Given labeled data , the model is to learn task parameters in the subspace. Predict labels in the original D dimensional space by inflating parameters back to D dimensions with the projection matrix. • Semi-supervised setting (Model-2) Learn the embeddings for both training and testing data, thus training and testing both take place in the K dimensionality subspace. Experiments (I) • Generate two datasets of dimensionalities 25 and 10, each having 100 samples; • Ground truth: have 4 correlation components with a 63% sparsity in the true projection matrix • Classical CCA found 8 components with significant correlations; while infinite CCA correctly discovered exactly 4 components. • Classical CCA infer the projection matrix with no exact zero entries. If set small values to be zero, the uncovered sparsity was about 25%; • Infinite CCA can infer the projection matrix with 57% zero entries and 62% zero entries after thresholding very small values. Experiments (II) • Use two real-world multi-label datasets (Yeast and Scene) from UCI repository; • The Yeast dataset consists of 1500 training and 917 testing examples, each having 103 features. The number of labels per example is 14. • The Scene dataset consists of 1211 training and 1196 testing examples, each having 294 features. The number of labels per example is 6. • Compare the following models Conclusion • Present a nonparametric Bayesian model for the CCA problem; • Automatically select the number of correlation components and capture the sparsity pattern; • Enable to deal with missing data; • Solve the multi-label learning problem .