Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature map: 2 Embedding Distributions: Mean Mean reduces the entire distribution to a single number Representation power very restricted 1D feature space 3 Embedding Distributions: Mean + Variance Mean and variance reduces the entire distribution to two numbers Variance Richer representation But not enough Mean 2D feature space 4 Embedding with kernel features Transform distribution to infinite dimensional vector Rich representation Feature space Mean, Variance, higher order moment 5 Finite sample approximation of embedding 6 Estimating embedding distance Finite sample estimator Form a kernel matrix with 4 blocks Average this block Average this block Average this block Average this block 7 Measure Dependence via Embeddings Use squared distance to measure dependence between X and Y Feature space 1 X Mean Y Mean Cov. … … Higher order feature … … 8 Estimating embedding distances Given samples (๐ฅ1 , ๐ฆ1 ), … , (๐ฅ๐ , ๐ฆ๐ ) ∼ ๐ ๐, ๐ Dependence measure can be expressed as inner products ๐๐๐ − ๐๐ ⊗ ๐๐ 2 = ๐ธ๐๐ [๐ ๐ ⊗ ๐ ๐ ] − ๐ธ๐ ๐ ๐ ⊗ ๐ธ๐ [๐ ๐ ] 2 =< ๐๐๐ , ๐๐๐ > −2 < ๐๐๐ , ๐๐ ⊗ ๐๐ >+< ๐๐ ⊗ ๐๐ , ๐๐ ⊗ ๐๐ > Kernel matrix operation (๐ป = ๐ผ 1 ๐ก๐๐๐๐( ๐ป 2 ๐ ๐(๐ฅ๐ , ๐ฅ๐ ) 1 − 11โค ) ๐ ๐ป X and Y data are ordered in the same way ๐(๐ฆ๐ , ๐ฆ๐ ) ) 9 Application of kernel distance measure 10 Reference 11 Multivariate Gaussians ๐ ๐1 , ๐2 , … , ๐๐ = 1 2๐ ๐ 2 1 exp − Σ2 1 2 ๐ฅ − ๐ โค Σ −1 ๐ฅ − ๐ Mean vector ๐๐ = ๐ธ[๐๐ ] ๐1 ๐2 ๐= โฎ ๐๐ Covariance matrix ๐๐๐ = ๐ธ ๐๐ − ๐๐ ๐๐ − ๐๐ ๐12 Σ = ๐21 ๐31 ๐12 ๐22 ๐32 ๐13 ๐23 ๐32 12 Conditioning on a Gaussian Joint Gaussian ๐ ๐, ๐ ∼ ๐ฉ ๐; Σ Conditioning a Gaussian variable Y on another Gaussian variable X still gets a Gaussian 2 ๐ ๐ ๐ ∼ ๐ ๐๐|๐ ; ๐๐|๐ New observation Prior mean ๐๐|๐ = ๐๐ + ๐๐๐ 2 ๐๐ ๐ − ๐๐ Prior mean 2 ๐๐|๐ = ๐๐2 − 2 ๐๐๐ 2 ๐๐ Prior variance Posterior variance does not depend on a particular observed value Observe X always decrease variance 13 Conditional Gaussian is a linear model Conditinal linear Gaussian 2 ๐ ๐ ๐ ∼ ๐ฉ ๐๐|๐ ; ๐๐|๐ ๐๐|๐ = ๐๐ + ๐๐๐ 2 ๐๐ ๐ − ๐๐ 2 ๐ ๐ ๐ ∼ ๐ฉ ๐ฝ0 + ๐ฝ๐; ๐๐|๐ The ridge in the figure is the line ๐ฝ0 + ๐ฝ๐ If we make a slice at particular X, we get a Gaussian All these Gaussian slices have the same variance 2 ๐๐|๐ = ๐๐2 − 2 ๐๐๐ 2 ๐๐ 14 Conditional Gaussian (general case) Joint Gaussian ๐ ๐, ๐ ∼ ๐ฉ ๐; Σ Conditional Gaussian ๐ ๐ ๐ ∼ ๐ ๐๐|๐ ; Σ๐๐|๐ −1 ๐๐|๐ = ๐๐ + Σ๐๐ Σ๐๐ (๐ − ๐๐ ) −1 Σ๐๐|๐ = Σ๐๐ − Σ๐๐ Σ๐๐ Σ๐๐ Conditional Gaussian is linear in ๐, ๐ ๐ ๐ ∼ ๐ฉ ๐ฝ0 + ๐ต๐; Σ๐๐|๐ −1 ๐ฝ0 = ๐๐ − Σ๐๐ Σ๐๐ ๐๐ −1 ๐ต = Σ๐๐ Σ๐๐ Linear regression model ๐ = ๐ฝ0 + ๐ต๐ + ๐ White noise ๐ฉ(0, Σ๐๐|๐ ) 15 What is Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables Formally: a collection of random variables, any finite number of which have (consistent) Gaussian distributions Informally, infinitely long vector with dimensions index by ๐ฅ ≅ function ๐(๐ฅ) A Gaussian process is fully specified by a mean function ๐ ๐ฅ = ๐ธ[๐(๐ฅ)] and covariance function ๐ ๐ฅ, ๐ฅ ′ = ๐ธ ๐ ๐ฅ − ๐ ๐ฅ ๐ ๐ฅ′ − ๐ ๐ฅ′ ๐ ๐ฅ ∼ ๐บ๐ ๐ ๐ฅ , ๐ ๐ฅ, ๐ฅ ′ , ๐ฅ: ๐๐๐๐๐๐๐ 16 A set of sample from Gaussian process For each fixed value of ๐ฅ, there is a Gaussian variable associated with it focus on a finite subset of value ๐ = ๐ ๐ฅ1 , ๐ ๐ฅ2 , … , ๐ ๐ฅ๐ for which ๐ ∼ ๐ฉ(0, Σ) where Σ๐๐ = ๐(๐ฅ๐ , ๐ฅ๐ ) โค , Then plot the coordinates of ๐ as a function of the corresponding ๐ฅ values 17 Random function from a Gaussian process one dimensional Gaussian process: ๐ ๐ฅ ∼ 1 ๐บ๐ 0, ๐ ๐ฅ, ๐ฅ ′ = exp − ๐ฅ − ๐ฅ ′ 2 2 To generate a sample from GP Covariance ๐ ๐ฅ๐ , ๐ฅ๐ Gaussian variable ๐๐ , ๐๐ are indexed by ๐ฅ๐ , ๐ฅ๐ respectively, and their covariance (๐๐-th entry in Σ) defined by ๐ ๐ฅ๐ , ๐ฅ๐ ๐๐ Generate ๐ iid. samples: ๐ฆ = ๐ฆ1 , … , ๐ฆ๐ โค ∼ ๐ฉ 0; ๐ผ Transform the sample: ๐ = ๐1 , … , ๐๐ โค = ๐ + Σ1/2 ๐ฆ ๐๐ ๐ฅ๐ ๐ฅ๐ 18 Random function from a Gaussian process Now have two indices ๐ฅ and ๐ฆ covariance function ๐ ๐ฅ, ๐ฆ , ๐ฅ ′, ๐ฆ′ = exp − ๐ฅ−๐ฅ ′ 2 + ๐ฆ−๐ฆ ′ 2 2 19 Gaussian process as a prior A Gaussian process is a prior for functions, we can use it for nonparametric regression Fit a function to noisy observations Gaussian process regression 2 Gaussian likelihood ๐ฆ|๐ฅ, ๐ ๐ฅ ∼ ๐ฉ ๐, ๐๐๐๐๐ ๐ ๐ผ The parameter is a function ๐ ๐ฅ ∼ ๐บ๐ ๐ ๐ฅ = 0, ๐ ๐ฅ, ๐ฅ ′ Gaussian process prior with 20 Graphical model for Gaussian Process Square nodes are observed, round nodes unobserved (latent) Red nodes are training data, blue nodes are test data All pairs of latent variables (๐) are connected Prediction of ๐ฆ ∗ depends only on the corresponding ๐ ∗ We can do learning and inference based on this graphical model 21 Covariance function of Gaussian processes For any finite collection of indices ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ , the covariance matrix is positive semidefinite Σ= โฏ ๐ ๐ฅ1 , ๐ฅ๐ โฏ ๐ ๐ฅ2 , ๐ฅ๐ โฑ โฎ โฏ ๐(๐ฅ๐ , ๐ฅ๐ ) ๐ ๐ฅ1 , ๐ฅ1 ๐ ๐ฅ2 , ๐ฅ1 ๐ ๐ฅ1 , ๐ฅ2 ๐ ๐ฅ2 , ๐ฅ2 โฎ โฎ ๐(๐ฅ๐ , ๐ฅ1 ) ๐(๐ฅ๐ , ๐ฅ2 ) The covariance function needs to be a kernel function over the indices! Eg. Gaussian RBF kernel ๐ ๐ฅ, ๐ฅ ′ = exp − 1 2 ๐ฅ − ๐ฅ′ 2 22 Covariance function of Gaussian process Another example ๐ ๐ฅ๐ , ๐ฅ๐ = ๐ฃ0 exp − ๐ฅ๐ −๐ฅ๐ ๐ ๐ผ + ๐ฃ1 + ๐ฃ2 ๐ฟ๐๐ These kernel parameters are interpretable in the covariance function context ๐ฃ0 : ๐ฃ๐๐๐๐๐๐๐ ๐ ๐๐๐๐ ๐ฃ1 : ๐ฃ๐๐๐๐๐๐๐ ๐๐๐๐ ๐ฃ2 : ๐๐๐๐ ๐ ๐ฃ๐๐๐๐๐๐๐ ๐: ๐๐๐๐๐กโ๐ ๐๐๐๐ ๐ผ: ๐๐๐ข๐โ๐๐๐ ๐ 23 Samples from GPs with different kernels 24 Matern kernel ๐ ๐ฅ๐ , ๐ฅ๐ = 1 Γ ๐ 2๐ฃ−1 2๐ฃ ๐ ๐ฅ๐ − ๐ฅ๐ ๐ฃ ๐พ๐ฃ 2๐ฃ ๐ ๐ฅ๐ − ๐ฅ๐ ๐พ๐ฃ is modified Bessel function of second kind of order ๐ฃ, ๐ is the length scale Sample functions from GP with Matern kernel are ๐ฃ − 1 times differentiable. Hyperparamter ๐ฃ can control smoothness Special cases (let ๐ = |๐ฅ๐ − ๐ฅ๐ |) ๐๐ฃ=1 ๐ = exp − 2 ๐ ๐ : Laplace kernel, Brownian motion ๐๐ฃ=3 ๐ = 1 + 3๐ ๐ ๐๐ฃ=5 ๐ = 1 + 5๐ ๐ ๐๐ฃ→∞ ๐ = exp ๐2 − 2 2๐ 2 2 exp − + 5๐ 2 3๐ 2 3๐ ๐ (once differentiable) exp − 5๐ ๐ (twice differentiable) : smooth (infinitely differentiable) 25 Matern kernel II Univariate Matern kernel function with unit length scale 26 Kernels for periodic, smooth functions To create GP over periodic functions, we can first map the inputs to ๐ข = sin ๐ฅ , cos ๐ฅ โค , and then measure distance in ๐ข space. Combined with square exponential function, ๐ ๐ฅ, ๐ฅ ′ = exp − 2sin2 ๐ ๐ฅ−๐ฅ ′ ๐2 Three functions drawn at random, left ๐ > 1 and right ๐ < 1 27 Using Gaussian process for nonlinear regression Observing a dataset ๐ท = ๐ฅ๐ , ๐ฆ๐ ๐ ๐=1 Prior ๐(๐) is Gaussian process, like a multivariate Gaussian, therefore, posterior of ๐ is also a Gaussian process Bayesian rule ๐ ๐ ๐ท = ๐ ๐ท ๐ ๐(๐) ๐(๐ท) Everything else about GPs follows the basic rules of probabilities applied to multivariate Gaussians 28 Posterior of Gaussian process Gaussian process regression For simplicity, noiseless observation ๐ฆ = ๐(๐ฅ) The parameter is a function ๐ ๐ฅ ∼ ๐บ๐ ๐ ๐ฅ = 0, ๐ ๐ฅ, ๐ฅ ′ Gaussian process prior with Multivariate Gaussian ๐ ๐ ๐ ∼ ๐ ๐๐|๐ ; Σ๐๐|๐ −1 ๐๐|๐ = ๐๐ + Σ๐๐ Σ๐๐ (๐ − ๐๐ ) −1 Σ๐๐|๐ = Σ๐๐ − Σ๐๐ Σ๐๐ Σ๐๐ GP posterior ๐ ๐ฅ | ๐ฅ๐ , ๐ฆ๐ ๐ ๐=1 ~๐บ๐ ๐๐๐๐ ๐ก ๐ฅ , ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ ๐ = (๐ฆ2 , … , ๐ฆ๐ )โค = ๐ ๐ฅ1 , … ๐ ๐ฅ๐ ๐๐๐๐ ๐ก ๐ฅ = 0 + Σ๐ ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ = Σ๐ โค −1 โค ๐ฅ ๐ Σ๐๐ ๐ ๐ฅ ๐(๐ฅ) − Σ๐ −1 ๐ฅ ๐ Σ๐๐ Σ๐๐ ๐ฅ 29 Prior and Posterior GP In the noiseless case (๐ฆ = ๐(๐ฅ)), mean function of the posterior GP passes the training data points Posterior GP has reduced variance, zero variance at training point Prior Posterior 30 Noisy Observation 2 Gaussian likelihood ๐ฆ|๐ฅ, ๐ ๐ฅ ∼ ๐ฉ ๐, ๐๐๐๐๐ ๐ ๐ผ ๐ ๐ฅ | ๐ฅ๐ , ๐ฆ๐ ๐ ๐=1 ~๐บ๐ ๐๐๐๐ ๐ก ๐ฅ , ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ ๐ = (๐ฆ2 , … , ๐ฆ๐ )โค ๐๐๐๐ ๐ก ๐ฅ = 0 + Σ๐ ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ = Σ๐ ๐ฅ ๐ ๐ฅ ๐(๐ฅ) Σ๐๐ + − Σ๐ −1 โค 2 ๐๐๐๐๐ ๐ ๐ผ ๐ ๐ฅ ๐ Σ๐๐ + −1 2 ๐๐๐๐๐ ๐ ๐ผ Σ๐๐ ๐ฅ Covariance function is the kernel function Σ๐ ๐ฅ ๐ ΣYY = = ๐ ๐ฅ, ๐ฅ1 , … , ๐ ๐ฅ, ๐ฅ๐ ๐ ๐ฅ1 , ๐ฅ1 ๐ ๐ฅ2 , ๐ฅ1 ๐ ๐ฅ1 , ๐ฅ2 ๐ ๐ฅ2 , ๐ฅ2 โฎ โฎ ๐(๐ฅ๐ , ๐ฅ1 ) ๐(๐ฅ๐ , ๐ฅ2 ) โฏ ๐ ๐ฅ1 , ๐ฅ๐ โฏ ๐ ๐ฅ2 , ๐ฅ๐ โฑ โฎ โฏ ๐(๐ฅ๐ , ๐ฅ๐ ) 31 Prior and posterior: noisy case In the noisy case (๐ฆ = ๐ ๐ฅ + ๐), mean function of posterior GP does not necessarily passes the training data points Posterior GP has reduced variance 32