Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Conditional Gaussian (general case) Joint Gaussian ๐ ๐, ๐ ∼ ๐ฉ ๐; Σ Conditional Gaussian ๐ ๐ ๐ ∼ ๐ ๐๐|๐ ; Σ๐๐|๐ −1 ๐๐|๐ = ๐๐ + Σ๐๐ Σ๐๐ (๐ − ๐๐ ) −1 Σ๐๐|๐ = Σ๐๐ − Σ๐๐ Σ๐๐ Σ๐๐ Conditional Gaussian is linear in ๐, ๐ ๐ ๐ ∼ ๐ฉ ๐ฝ0 + ๐ต๐; Σ๐๐|๐ −1 ๐ฝ0 = ๐๐ − Σ๐๐ Σ๐๐ ๐๐ −1 ๐ต = Σ๐๐ Σ๐๐ Linear regression model ๐ = ๐ฝ0 + ๐ต๐ + ๐ White noise ๐ฉ(0, Σ๐๐|๐ ) 2 What is Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables Formally: a collection of random variables, any finite number of which have (consistent) Gaussian distributions Informally, infinitely long vector with dimensions index by ๐ฅ ≅ function ๐(๐ฅ) A Gaussian process is fully specified by a mean function ๐ ๐ฅ = ๐ธ[๐(๐ฅ)] and covariance function ๐ ๐ฅ, ๐ฅ ′ = ๐ธ ๐ ๐ฅ − ๐ ๐ฅ ๐ ๐ฅ′ − ๐ ๐ฅ′ ๐ ๐ฅ ∼ ๐บ๐ ๐ ๐ฅ , ๐ ๐ฅ, ๐ฅ ′ , ๐ฅ: ๐๐๐๐๐๐๐ 3 Random function from a Gaussian process one dimensional Gaussian process: ๐ ๐ฅ ∼ 1 ๐บ๐ 0, ๐ ๐ฅ, ๐ฅ ′ = exp − ๐ฅ − ๐ฅ ′ 2 2 To generate a sample from GP Covariance ๐ ๐ฅ๐ , ๐ฅ๐ Gaussian variable ๐๐ , ๐๐ are indexed by ๐ฅ๐ , ๐ฅ๐ respectively, and their covariance (๐๐-th entry in Σ) defined by ๐ ๐ฅ๐ , ๐ฅ๐ ๐๐ Generate ๐ iid. samples: ๐ฆ = ๐ฆ1 , … , ๐ฆ๐ โค ∼ ๐ฉ 0; ๐ผ Transform the sample: ๐ = ๐1 , … , ๐๐ โค = ๐ + Σ1/2 ๐ฆ ๐๐ ๐ฅ๐ ๐ฅ๐ 4 Covariance function of Gaussian processes For any finite collection of indices ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ , the covariance matrix is positive semidefinite Σ=๐พ= ๐ ๐ฅ1 , ๐ฅ1 ๐ ๐ฅ2 , ๐ฅ1 ๐ ๐ฅ1 , ๐ฅ2 ๐ ๐ฅ2 , ๐ฅ2 โฎ โฎ ๐(๐ฅ๐ , ๐ฅ1 ) ๐(๐ฅ๐ , ๐ฅ2 ) โฏ ๐ ๐ฅ1 , ๐ฅ๐ โฏ ๐ ๐ฅ2 , ๐ฅ๐ โฑ โฎ โฏ ๐(๐ฅ๐ , ๐ฅ๐ ) The covariance function needs to be a kernel function over the indices! Eg. Gaussian RBF kernel ๐ ๐ฅ, ๐ฅ ′ = exp − 1 2 ๐ฅ − ๐ฅ′ 2 5 Samples from GPs with different kernels ๐ ๐ฅ๐ , ๐ฅ๐ = ๐ฃ0 exp − ๐ฅ๐ −๐ฅ๐ ๐ ๐ผ + ๐ฃ1 + ๐ฃ2 ๐ฟ๐๐ 6 Kernels for periodic, smooth functions To create GP over periodic functions, we can first map the inputs to ๐ข = sin ๐ฅ , cos ๐ฅ โค , and then measure distance in ๐ข space. Combined with square exponential function, ๐ ๐ฅ, ๐ฅ ′ = exp − 2sin2 ๐ ๐ฅ−๐ฅ ′ ๐2 Three functions drawn at random, left ๐ > 1 and right ๐ < 1 7 Using Gaussian process for nonlinear regression Observing a dataset ๐ท = ๐ฅ๐ , ๐ฆ๐ ๐ ๐=1 Prior ๐(๐) is Gaussian process, like a multivariate Gaussian, therefore, posterior of ๐ is also a Gaussian process Bayesian rule ๐ ๐ ๐ท = ๐ ๐ท ๐ ๐(๐) ๐(๐ท) Everything else about GPs follows the basic rules of probabilities applied to multivariate Gaussians 8 Graphical model for Gaussian Process Square nodes are observed, round nodes unobserved (latent) Red nodes are training data, blue nodes are test data All pairs of latent variables (๐) are connected Prediction of ๐ฆ ∗ depends only on the corresponding ๐ ∗ We can do learning and inference based on this graphical model 9 Posterior of Gaussian process Gaussian process regression For simplicity, noiseless observation ๐ฆ = ๐(๐ฅ) The parameter is a function ๐ ๐ฅ ∼ ๐บ๐ ๐ ๐ฅ = 0, ๐ ๐ฅ, ๐ฅ ′ Gaussian process prior GP posterior ๐ ๐ฅ | ๐ฅ๐ , ๐ฆ๐ ๐ = (๐ฆ2 , … , ๐ฆ๐ )โค ๐ ๐=1 ~๐บ๐ ๐๐๐๐ ๐ก ๐ฅ , ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ = ๐ ๐ฅ1 , … ๐ ๐ฅ๐ โค −1 โค Σ ๐ ๐๐ ๐ ๐๐๐๐ ๐ก ๐ฅ = 0 + Σ๐ ๐ฅ ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ = Σ๐ ๐(๐ฅ′) − Σ๐ ๐ฅ with ๐ฅ −1 Σ ๐ ๐๐ Σ๐๐ ๐ฅ′ 10 Posterior of Gaussian process in Kernel Form Define kernel matrices ๐ ๐ฅ, ๐ โ Σ๐ ๐ฅ ๐ = ๐ ๐ฅ, ๐ฅ1 , … , ๐ ๐ฅ, ๐ฅ๐ ๐พ โ ΣYY = โฏ ๐ ๐ฅ1 , ๐ฅ๐ ๐ ๐ฅ1 , ๐ฅ2 ๐ ๐ฅ2 , ๐ฅ2 โฏ ๐ ๐ฅ2 , ๐ฅ๐ โฎ โฎ โฑ โฎ ๐(๐ฅ๐ , ๐ฅ1 ) ๐(๐ฅ๐ , ๐ฅ2 ) โฏ ๐(๐ฅ๐ , ๐ฅ๐ ) ๐ ๐ฅ1 , ๐ฅ1 ๐ ๐ฅ2 , ๐ฅ1 Then we have ๐๐๐๐ ๐ก ๐ฅ = ๐ ๐ฅ, ๐ K −1 ๐ โค ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ = ๐(๐ฅ, ๐ฅ ′ ) − ๐ ๐ฅ, ๐ K −1 ๐ ๐ฅ ′ , ๐ โค 11 Prior and Posterior GP In the noiseless case (๐ฆ = ๐(๐ฅ)), mean function of the posterior GP passes the training data points Posterior GP has reduced variance, zero variance at training point Prior Posterior 12 Noisy Observation 2 ๐ฆ|๐ฅ, ๐ ๐ฅ ∼ ๐ฉ ๐, ๐๐๐๐๐ ๐ ๐ผ , let ๐ = (๐ฆ2 , … , ๐ฆ๐ )โค ๐ ๐ฅ | ๐ฅ๐ , ๐ฆ๐ ๐ ๐=1 ~๐บ๐ ๐๐๐๐ ๐ก ๐ฅ , ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ 2 ๐๐๐๐ ๐ก ๐ฅ = ๐ ๐ฅ, ๐ ๐พ + ๐๐๐๐๐ ๐ ๐ผ ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ = ๐(๐ฅ, ๐ฅ ′ ) −1 โค ๐ − ๐ ๐ฅ, ๐ ๐พ + −1 2 ๐๐๐๐๐ ๐ ๐ผ ๐ ๐ฅ′, ๐ โค 13 GP: prediction of new observation Given a new point ๐ฅ ∗ , the predictive distribution for the ๐ฆ ∗ ๐ฅ๐ , ๐ฆ๐ ๐ ๐=1 ๐๐ Predictive distribution is Gaussian ๐ฆ ∗ ๐ฅ ∗ , ๐ฅ๐ , ๐ฆ๐ 2 ๐ฉ ๐๐๐๐๐ , ๐๐๐๐๐ ๐ ๐=1 = ๐ ๐ฆ ∗ ๐ฅ ∗ , ๐ฅ๐ , ๐ฆ๐ ๐ ๐=1 = ∫ ๐ ๐ฆ ∗ ๐, ๐ฅ ∗ ๐ ๐ ๐ = (๐ฆ2 , … , ๐ฆ๐ )โค 2 ๐๐๐๐๐ = ๐(๐ฅ ∗ , ๐) ๐พ + ๐๐๐๐๐ ๐ ๐ผ −1 โค ๐ 2 2 ๐๐๐๐๐ = ๐(๐ฅ ∗ , ๐ฅ ∗ ) − ๐ ๐ฅ ∗ , ๐ ๐พ + ๐๐๐๐๐ ๐ ๐ผ −1 ๐ ๐ฅ∗, ๐ โค 14 Weight space view of GP Assume linear regression model ๐ ๐ฅ ๐ค = ๐ฅโค๐ค ๐ฆ =๐+๐ ๐ ∼ ๐ฉ 0, ๐ 2 Let ๐ =, ๐ฆ1 , … ๐ฆ๐ , ๐ = ๐ฅ1 , … , ๐ฅ๐ , Likelihood of observations ๐( ๐ฆ๐ ๐ ๐=1 ๐ฅ๐ ๐ ๐=1 , ๐ค = ๐ฉ ๐ โค ๐ค, ๐ 2 ๐ผ Assume a Gaussian prior over parameters ๐ ๐ค = ๐ฉ(0, ๐ผ) Apply Bayes’ theorem to obtain posterior ๐ ๐ค ๐, ๐ ∝ ๐ ๐ ๐, ๐ค ๐(๐ค) 15 Weight space view of GP Posterior distribution over ๐ค is ๐ ๐ค ๐, ๐ = ๐ฉ 1 (๐ผ ๐2 + 1 โค −1 ๐๐ ) ๐๐, (๐ผ ๐2 + 1 โค −1 ๐๐ ) ๐2 Predictive distribution is ๐ ๐ ∗ ๐ฅ ∗ , ๐, ๐ = ∫ ๐ ๐ฅ ∗ ๐ค ๐ ๐ค, ๐, ๐ ๐๐ค =๐ฉ 1 ∗โค ๐ฅ (๐ผ ๐2 + 1 โค )−1 ๐๐, ๐ฅ ∗ โค (๐ผ ๐๐ ๐2 + 1 ๐๐ โค )−1 ๐ฅ ∗ 2 ๐ Predictive distribution is outer product form, can be turned into inner product form using matrix-inversion lemma 16 Weight space view of GP Predictive distribution ๐ฉ 1 ∗โค ๐ฅ (๐ผ ๐2 + 1 โค −1 ∗โค ๐๐ ) ๐๐, ๐ฅ (๐ผ ๐2 + 1 โค −1 ∗ ๐๐ ) ๐ฅ ๐2 Equivalent to ๐ฉ (๐ฅ ∗ โค ๐)(๐ 2 ๐ผ + ๐ โค ๐ )−1 ๐, ๐ฅ ∗ โค ๐ฅ ∗ − ๐ฅ ∗ โค ๐(๐ 2 ๐ผ + ๐ โค ๐)−1 ๐ โค ๐ฅ ∗ Instead of using the original ๐ฅ, map it to feature space ๐ ๐ฅ 2 ๐ฉ ๐๐๐๐๐ , ๐๐๐๐๐ 2 ๐๐๐๐๐ = ๐(๐ฅ ∗ , ๐) ๐พ + ๐๐๐๐๐ ๐ ๐ผ −1 โค ๐ 2 2 ๐๐๐๐๐ = ๐(๐ฅ ∗ , ๐ฅ ∗ ) − ๐ ๐ฅ ∗ , ๐ ๐พ + ๐๐๐๐๐ ๐ ๐ผ −1 ๐ ๐ฅ∗, ๐ โค 17 Model selection Use marginal likelihood (evidence) to select and tune hyperparameters in covariance function P( ๐ฆ๐ ๐ ๐=1 ๐ฅ๐ ๐ ๐=1 = ∫ ๐ ๐ฆ๐ ๐ ๐=1 ๐, ๐ฅ๐ ๐ ๐=1 ๐ ๐ ๐๐ An example ๐ ๐ฅ๐ , ๐ฅ๐ = ๐ฃ0 exp − ๐ฅ๐ −๐ฅ๐ ๐ ๐ผ + ๐ฃ1 Marginal likelihood is a function of parameter ๐ฃ P( ๐ฆ๐ ๐๐=1 ๐ฅ๐ ๐๐=1 = ๐ฉ 0, ๐พ๐ฃ + ๐ 2 ๐ผ ln ๐( ๐ฆ๐ ๐๐=1 ๐ฅ๐ ๐๐=1 = 1 1 − ln det ๐พ๐ฃ + ๐ 2 ๐ผ − ๐ โค ๐พ๐ฃ + ๐ 2 ๐ผ 2 2 −1 ๐ + ๐๐๐๐ ๐ก Optimize as a function of ๐ฃ 18 Automatic relevance detection We want to automatically decide which inputs are relevant (feature selection) to output Use covariance function ๐ ๐ฅ๐ , ๐ฅ๐ = ๐ฃ0 exp − ๐ท ๐=1 ๐ฅ๐๐ −๐ฅ๐๐ ๐ผ + ๐ฃ1 + ๐ฃ2 ๐ฟ๐๐ ๐๐ ๐๐ is the length scale of the function along input dimension ๐ ๐๐ → ∞, the corresponding feature influence ๐ less Use marginal likelihood ln ๐( ๐ฆ๐ feature selection ๐ ๐=1 ๐ฅ๐ ๐ ๐=1 to tune ๐๐ to do 19 GP for classification Regression model ๐|~๐บ๐ ๐ ๐ฅ , ๐ ๐ฅ, ๐ฅ ′ 2 ๐ฆ|๐ฅ, ๐ ∼ ๐ฉ ๐(๐ฅ), ๐๐๐๐๐ ๐ ๐ผ Classification model Give data ๐ฅ๐ , ๐ฆ๐ ๐|~๐บ๐ ๐ ๐ฅ , ๐ ๐ ๐=1 ๐ฅ, ๐ฅ ′ where ๐ฆ๐ ∈ {−1, +1} ๐ฆ|๐ฅ, ๐ ∼ ๐ ๐ฆ|๐(๐ฅ) 20 Relate GP to class probability Transform the continuous output of Gaussian Process to a value between [-1,1] or [0,1] With binary outputs, the joint distribution of all variables in the model is no longer Gaussians The likelihood is also not Gaussian, so we will need to use approximate inference to compute the posterior GP (Laplace approximation, sampling) 21 Connection to kernel support vector machines 1 ๐ค 2 min ๐ค 2 +๐ถ ๐ ๐๐ ๐ . ๐ก. ๐ค โค ๐(๐ฅ๐ ) + ๐ ๐ฆ๐ ≥ 1 − ๐๐ , ๐๐ ≥ 0, ∀๐ ๐๐ : Slack variables Can be equivalently written as a hinge loss function 1 − ๐ฆ๐ (๐ค โค ๐(๐ฅ๐ )) + = 1 − ๐ฆ๐ ๐๐ + 22 Connection to kernel support vector machines The decision function ๐ ๐ฅ = ๐ค โค ๐ ๐ฅ = ๐ ๐ผ๐ ๐(๐ฅ, ๐ฅ๐ ) Let the vector ๐ be ๐(๐ฅ) evaluated at all training points: ๐ = ๐พ๐ผ, then we have ๐ผ = ๐พ −1 ๐ ๐ค 2 = ๐ผ โค ๐พ๐ผ = ๐ โค ๐พ −1 ๐ We can rewrite the kernelized SVM as: 1 โค −1 min๐ ๐ ๐พ ๐ 2 +๐ถ ๐ 1 − ๐ฆ๐ ๐๐ + 23 Connection to kernel support vector machines We can rewrite the kernelized SVM as: 1 โค −1 min๐ ๐ ๐พ ๐ 2 +๐ถ ๐ 1 − ๐ฆ๐ ๐๐ + We can write the negative log of a GP likelihood as 1 2 min ๐ โค ๐พ −1 ๐ + ๐ ๐ ๐(๐ฆ๐ |๐๐ ) + ๐ With Gaussian process we Handle uncerntainty in unknown function ๐ by averaging, not minimization Can learn the kernel parameters and features using marginal likelihood Can incorporate interpretable noise models and priors, can sample 24 Gaussian process latent variable models GP can be used for nonlinear dimensionality reduction Observed n data points ๐ฆ๐ ๐๐=1 , assume each dimension of the data ๐ฆ๐๐ id modeled by a separate GP using some common low dimension input ๐ฅ๐ Find the best latent input ๐ฅ๐ by maximizing the marginal likelihood ๐ฅ ๐๐ ๐ฆ๐ Computationally intensive 25 Gaussian process latent variable models Find the latent variables is a high dimensional, nonlinear optimization problem with local optima GPLVM defines a map from latent to observed space, not a generative model Mapping new latent coordinate to observations is easy Finding the latent coordinates for new observations is difficult 26 Computation issue of GP ๐ ๐ฅ | ๐ฅ๐ , ๐ฆ๐ ๐ ๐=1 ~๐บ๐ ๐๐๐๐ ๐ก ๐ฅ , ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ 2 ๐๐๐๐ ๐ก ๐ฅ = ๐ ๐ฅ, ๐ ๐พ + ๐๐๐๐๐ ๐ ๐ผ ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ = ๐(๐ฅ, ๐ฅ ′ ) −1 โค ๐ − ๐ ๐ฅ, ๐ ๐พ −1 2 + ๐๐๐๐๐ ๐ ๐ผ ๐ ๐ฅ′, ๐ โค Inverting kernel matrix ๐พ is computationally intensive Suppose there are ๐ data points, the computation cost is ๐ ๐3 Need to reduce the computational cost by some type of approximation ๐ ๐พ ≈ ๐ โค 27 Kernel low rank approximation Incomplete Cholesky factorization of kernel matrix ๐พ of size ๐ × ๐ to ๐ of size ๐ × ๐, and ๐ โช ๐ ๐ ≈ ๐พ ๐ โค ๐ ≈ ๐ด ๐ ๐ฅ | ๐ฅ๐ , ๐ฆ๐ ๐ ๐=1 ๐ โค ~๐บ๐ ๐๐๐๐ ๐ก ๐ฅ , ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ 2 ๐๐๐๐ ๐ก ๐ฅ = ๐ ๐ฅโค ๐ ๐ โค + ๐๐๐๐๐ ๐ ๐ผ ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ = ๐ ๐ฅ๐ฅ − ๐ ๐ฅโค ๐ ๐ โค −1 + ๐ ๐ โค −1 2 ๐๐๐๐๐ ๐ ๐ผ ๐ ๐ ๐ฅ 28 Sparse nonparametric regression Support vector regression 29 Dual of support vector regression and kernelization Dual problem use data only in the inner product Prediction for a new point Replace inner product by kernel functions to obtain nonlinear regression 30 Collaborative Filtering 31 Collaborative Filtering R: rating matrix; U: user factor; V: movie factor min U ,V s .t . f ( U ,V ) ๏ฝ R ๏ญ UV T 2 F U ๏ณ 0 ,V ๏ณ 0 , k ๏ผ๏ผ m , n . Low rank matrix approximation approach Probabilistic matrix factorization Bayesian probabilistic matrix factorization 32 Nonparametric effect model The ratings ๐ ๐๐ are generated by bias ๐, user-item compatibility function ๐๐๐ and a random effects ๐๐๐ plus noise ๐๐๐ ๐ ๐๐ = ๐ + ๐๐๐ + ๐๐๐ + ๐๐๐ , ๐๐๐ ∼ ๐(0, ๐ 2 ) Gaussian process prior for both ๐ and ๐ ๐ ∼ ๐บ๐(0, Ω ⊗ Σ) ๐๐ ∼ ๐บ๐ 0, ๐Σ , i = 1, … , M Hyper prior on the the covariance matrix: inverse-Wishart process: Σ ∼ ๐ผ๐๐(๐ , Σ0 + ๐๐ฟ) Learning with EM 33