Outline Latent Variable Models Probabilistic PCA (PPCA) Dual Probabilistic PCA (DPPCA) Gaussian Latent Variable Models MDS, Kernel PCA, Nonlinear variants Unified objective function Experiments Extensions Gaussian Process Dynamical Models Hierarchical GPLVM Motivation To learn a low dimensional representation of high dimensional data Our primary focus will be on visualization of the data in 2D Desiderata: Maintains proximity data (e.g. what’s close in data space is close in latent space) Probabilistic framework Projection from latent space to data space A non-linear latent space embedding Notation D = # of observed dimensions q = # of latent dimensions N = # of data points Y = observed data X = latent data W = linear mapping Probabilistic PCA A prior is placed over X, our latent embedding space s.t X ~ N(0,I). X is then marginalized out. Marginalize X Optimize W Y is observed Columns of U are the eigenvectors of S L is a diagonal matrix of eigenvalues R is an arbitrary orthogonal matrix Taking the marginal likelihood of Y by integrating over X results in PPCA. Optimizing the log likelihood w.r.t W results in solving the eigenvalue problem for our sample covariance matrix S. Generative Model of PPCA Bishop et. al 2006 Z represents the latent space Which then gets mapped to the data space X via W The green contours show the marginal density for X Dual Probabilistic PCA A prior is placed over W, our mapping parameter s.t W ~ N(0, I). W is then marginalized out. Optimize X Marginalize W Y is observed Columns of U are the eigenvectors of S L is a diagonal matrix of eigenvalues R is an arbitrary orthogonal matrix Taking the marginal likelihood of Y by integrating over X results in PPCA. The marginal likelihood replaces W with X, but the ML estimate takes a similar form… DPPCA = PPCA Solution for PPCA: Solution for Dual PPCA: Lawrence claims these are “equivalent” through the relation: Connection to Gaussian Processes Recall that a Gaussian process is fully defined by its mean m(x) and covariance function. Assuming a 0 mean function and a covariance function K Recall that the marginal likelihoods for DPPCA takes this form: This is equivalent to a product over D independent Gaussian processes with the linear covariance function mentioned above. Kernel PCA vs. GPLVM Kernel PCA: Y X Y Y Nonlinear Embedding GPLVM: X Y W Y Linear Mapping via W K’(Y) W X Linear Mapping via W X X K(X) Nonlinear Embedding Unifying Objective GPLVM can be connected to classical MDS and kernel PCA via a unifying objective function. Kernel PCA: S is non-linear & K is linear GPLVM: K is non-linear & S is linear Classical MDS: S is a proximity or similarity matrix GPLVM can be optimized via SCG, scaled conjugate gradient. Sparsification Kernel methods may be sped up through sparsification In SCG optimization, each gradient computation requires an inversion of the kernel matrix This is datasets and thus prohibitive for visualizing large We can represent the data with an Informative Vector Machine Through a subset I of d points known as the active set with likelihood This is dominated by selection of the active set (Sparse) Oil Flow Visualizations Radial Basis Function Kernel Leads to smooth functions that fall away to zero in regions with no data. Multi-layer Perceptron Kernel Leads also to smooth functions, but regions with no data tend to have the same values. Oil Flow Visualization Full (No Sparsification) RBF Kernel Generative Topographic Mapping Alternative Noise Models We are NOT constrained to Gaussian noise models We visualize handwritten 2’s with two noise models Gaussian Noise Test Pixels removed in red Baseline Model Gaussian Model Bernoulli Model Bernoulli Noise Hallucinating Faces These faces were created by taking 64 uniformly spaced and ordered points from the latent space (1D) and visualizing the mean of their distribution in data space. Examples from the data-set which are closest to the corresponding fantasy images in latent space. Pros and Cons of GPLVM Pros Probabilistic Missing data straightforward. Can sample from model given X. Different noise models can be handled. Flexible family of kernels. Cons Speed of optimisation. Optimisation is non-convex using a non-linear kernel. Sensitivity to initializations. Initialization Issues GPLVM initialized with PCA Swiss Roll Dataset GPLVM initialized with Isomap Gaussian Process Dynamical Models Nonlinear Dynamical System Generative Process GPDM GPDM marginalizes over parameters A and B Jack M. Wang et. al, Gaussian Process Dynamical Models (NIPS 2006) GPDM on walking data Training Data of Walking Video GPDM of walk data learned with an RBF Kernel + 2nd order dynamics Missing Data Training data with frames missing from 51 to 100, out of 158 frames in total. Missing frames recovered by GPDM with linear + RBF kernel Visualization of Latent Space GPLVM Random Samples from X GPDM Data Space Mappings MORE GPDM Action GPDM learned using a 2nd order RBF Kernel for dynamics Hierarchical GPLVM We want MAP Estimates of this Neil D. Lawrence Hierarchical Gaussian Process Latent Variable Models (ICML High Five! Provides coordination information between the 2 subjects Frames A: 85, B: 114, C:127, D: 141, E: 155, F: 170, G: 190, H: 215 A motion capture data set consisting of two interacting subjects. The data, which was taken from the CMU MOCAP database, consists of two subjects that approach each other and ‘high five’. Run, don’t walk! The skeleton is decomposed as shown in Figure 4. In the plots, crosses are latent positions associated with the run and circles are associated with the walk. We have mapped three points from each motion through the hierarchy. Periodic dynamics was used in the latent spaces. Conclusion