Gaussian Processes Le Song Machine Learning II: Advanced Topics

Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature map: 2 Embedding Distributions: Mean Mean reduces the entire distribution to a single number Representation power very restricted 1D feature space 3 Embedding Distributions: Mean + Variance Mean and variance reduces the entire distribution to two numbers Variance Richer representation But not enough Mean 2D feature space 4 Embedding with kernel features Transform distribution to infinite dimensional vector Rich representation Feature space Mean, Variance, higher order moment 5 Finite sample approximation of embedding 6 Estimating embedding distance Finite sample estimator Form a kernel matrix with 4 blocks Average this block Average this block Average this block Average this block 7 Measure Dependence via Embeddings Use squared distance to measure dependence between X and Y Feature space 1 X Mean Y Mean Cov. … … Higher order feature … … 8 Estimating embedding distances Given samples (𝑥1 , 𝑦1 ), … , (𝑥𝑚 , 𝑦𝑚 ) ∼ 𝑃 𝑋, 𝑌 Dependence measure can be expressed as inner products 𝜇𝑋𝑌 − 𝜇𝑋 ⊗ 𝜇𝑌 2 = 𝐸𝑋𝑌 [𝜙 𝑋 ⊗ 𝜓 𝑌 ] − 𝐸𝑋 𝜙 𝑋 ⊗ 𝐸𝑌 [𝜓 𝑌 ] 2 =< 𝜇𝑋𝑌 , 𝜇𝑋𝑌 > −2 < 𝜇𝑋𝑌 , 𝜇𝑋 ⊗ 𝜇𝑌 >+< 𝜇𝑋 ⊗ 𝜇𝑌 , 𝜇𝑋 ⊗ 𝜇𝑌 > Kernel matrix operation (𝐻 = 𝐼 1 𝑡𝑟𝑎𝑐𝑒( 𝐻 2 𝑚 𝑘(𝑥𝑖 , 𝑥𝑗 ) 1 − 11⊤ ) 𝑚 𝐻 X and Y data are ordered in the same way 𝑘(𝑦𝑖 , 𝑦𝑗 ) ) 9 Application of kernel distance measure 10 Reference 11 Multivariate Gaussians 𝑃 𝑋1 , 𝑋2 , … , 𝑋𝑛 = 1 2𝜋 𝑛 2 1 exp − Σ2 1 2 𝑥 − 𝜇 ⊤ Σ −1 𝑥 − 𝜇 Mean vector 𝜇𝑖 = 𝐸[𝑋𝑖 ] 𝜇1 𝜇2 𝜇= ⋮ 𝜇𝑛 Covariance matrix 𝜎𝑖𝑗 = 𝐸 𝑋𝑖 − 𝜇𝑖 𝑋𝑗 − 𝜇𝑗 𝜎12 Σ = 𝜎21 𝜎31 𝜎12 𝜎22 𝜎32 𝜎13 𝜎23 𝜎32 12 Conditioning on a Gaussian Joint Gaussian 𝑃 𝑋, 𝑌 ∼ 𝒩 𝜇; Σ Conditioning a Gaussian variable Y on another Gaussian variable X still gets a Gaussian 2 𝑃 𝑌 𝑋 ∼ 𝑁 𝜇𝑌|𝑋 ; 𝜎𝑌|𝑋 New observation Prior mean 𝜇𝑌|𝑋 = 𝜇𝑌 + 𝜎𝑌𝑋 2 𝜎𝑋 𝑋 − 𝜇𝑋 Prior mean 2 𝜎𝑌|𝑋 = 𝜎𝑌2 − 2 𝜎𝑌𝑋 2 𝜎𝑋 Prior variance Posterior variance does not depend on a particular observed value Observe X always decrease variance 13 Conditional Gaussian is a linear model Conditinal linear Gaussian 2 𝑃 𝑌 𝑋 ∼ 𝒩 𝜇𝑌|𝑋 ; 𝜎𝑌|𝑋 𝜇𝑌|𝑋 = 𝜇𝑌 + 𝜎𝑌𝑋 2 𝜎𝑋 𝑋 − 𝜇𝑋 2 𝑃 𝑌 𝑋 ∼ 𝒩 𝛽0 + 𝛽𝑋; 𝜎𝑌|𝑋 The ridge in the figure is the line 𝛽0 + 𝛽𝑋 If we make a slice at particular X, we get a Gaussian All these Gaussian slices have the same variance 2 𝜎𝑌|𝑋 = 𝜎𝑌2 − 2 𝜎𝑌𝑋 2 𝜎𝑋 14 Conditional Gaussian (general case) Joint Gaussian 𝑃 𝑋, 𝑌 ∼ 𝒩 𝜇; Σ Conditional Gaussian 𝑃 𝑌 𝑋 ∼ 𝑁 𝜇𝑌|𝑋 ; Σ𝑌𝑌|𝑋 −1 𝜇𝑌|𝑋 = 𝜇𝑌 + Σ𝑌𝑋 Σ𝑋𝑋 (𝑋 − 𝜇𝑋 ) −1 Σ𝑌𝑌|𝑋 = Σ𝑌𝑌 − Σ𝑌𝑋 Σ𝑋𝑋 Σ𝑋𝑌 Conditional Gaussian is linear in 𝑋, 𝑃 𝑌 𝑋 ∼ 𝒩 𝛽0 + 𝐵𝑋; Σ𝑌𝑌|𝑋 −1 𝛽0 = 𝜇𝑌 − Σ𝑌𝑋 Σ𝑋𝑋 𝜇𝑋 −1 𝐵 = Σ𝑌𝑋 Σ𝑋𝑋 Linear regression model 𝑌 = 𝛽0 + 𝐵𝑋 + 𝜖 White noise 𝒩(0, Σ𝑌𝑌|𝑋 ) 15 What is Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables Formally: a collection of random variables, any finite number of which have (consistent) Gaussian distributions Informally, infinitely long vector with dimensions index by 𝑥 ≅ function 𝑓(𝑥) A Gaussian process is fully specified by a mean function 𝑚 𝑥 = 𝐸[𝑓(𝑥)] and covariance function 𝑘 𝑥, 𝑥 ′ = 𝐸 𝑓 𝑥 − 𝑚 𝑥 𝑓 𝑥′ − 𝑚 𝑥′ 𝑓 𝑥 ∼ 𝐺𝑃 𝑚 𝑥 , 𝑘 𝑥, 𝑥 ′ , 𝑥: 𝑖𝑛𝑑𝑖𝑐𝑒𝑠 16 A set of sample from Gaussian process For each fixed value of 𝑥, there is a Gaussian variable associated with it focus on a finite subset of value 𝑓 = 𝑓 𝑥1 , 𝑓 𝑥2 , … , 𝑓 𝑥𝑁 for which 𝑓 ∼ 𝒩(0, Σ) where Σ𝑖𝑗 = 𝑘(𝑥𝑖 , 𝑥𝑗 ) ⊤ , Then plot the coordinates of 𝑓 as a function of the corresponding 𝑥 values 17 Random function from a Gaussian process one dimensional Gaussian process: 𝑓 𝑥 ∼ 1 𝐺𝑃 0, 𝑘 𝑥, 𝑥 ′ = exp − 𝑥 − 𝑥 ′ 2 2 To generate a sample from GP Covariance 𝑘 𝑥𝑖 , 𝑥𝑗 Gaussian variable 𝑓𝑖 , 𝑓𝑗 are indexed by 𝑥𝑖 , 𝑥𝑗 respectively, and their covariance (𝑖𝑗-th entry in Σ) defined by 𝑘 𝑥𝑖 , 𝑥𝑗 𝑓𝑖 Generate 𝑁 iid. samples: 𝑦 = 𝑦1 , … , 𝑦𝑁 ⊤ ∼ 𝒩 0; 𝐼 Transform the sample: 𝑓 = 𝑓1 , … , 𝑓𝑁 ⊤ = 𝜇 + Σ1/2 𝑦 𝑓𝑗 𝑥𝑖 𝑥𝑗 18 Random function from a Gaussian process Now have two indices 𝑥 and 𝑦 covariance function 𝑘 𝑥, 𝑦 , 𝑥 ′, 𝑦′ = exp − 𝑥−𝑥 ′ 2 + 𝑦−𝑦 ′ 2 2 19 Gaussian process as a prior A Gaussian process is a prior for functions, we can use it for nonparametric regression Fit a function to noisy observations Gaussian process regression 2 Gaussian likelihood 𝑦|𝑥, 𝑓 𝑥 ∼ 𝒩 𝑓, 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 The parameter is a function 𝑓 𝑥 ∼ 𝐺𝑃 𝑚 𝑥 = 0, 𝑘 𝑥, 𝑥 ′ Gaussian process prior with 20 Graphical model for Gaussian Process Square nodes are observed, round nodes unobserved (latent) Red nodes are training data, blue nodes are test data All pairs of latent variables (𝑓) are connected Prediction of 𝑦 ∗ depends only on the corresponding 𝑓 ∗ We can do learning and inference based on this graphical model 21 Covariance function of Gaussian processes For any finite collection of indices 𝑥1 , 𝑥2 , … , 𝑥𝑛 , the covariance matrix is positive semidefinite Σ= ⋯ 𝑘 𝑥1 , 𝑥𝑛 ⋯ 𝑘 𝑥2 , 𝑥𝑛 ⋱ ⋮ ⋯ 𝑘(𝑥𝑛 , 𝑥𝑛 ) 𝑘 𝑥1 , 𝑥1 𝑘 𝑥2 , 𝑥1 𝑘 𝑥1 , 𝑥2 𝑘 𝑥2 , 𝑥2 ⋮ ⋮ 𝑘(𝑥𝑛 , 𝑥1 ) 𝑘(𝑥𝑛 , 𝑥2 ) The covariance function needs to be a kernel function over the indices! Eg. Gaussian RBF kernel 𝑘 𝑥, 𝑥 ′ = exp − 1 2 𝑥 − 𝑥′ 2 22 Covariance function of Gaussian process Another example 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝑣0 exp − 𝑥𝑖 −𝑥𝑗 𝑟 𝛼 + 𝑣1 + 𝑣2 𝛿𝑖𝑗 These kernel parameters are interpretable in the covariance function context 𝑣0 : 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑠𝑐𝑎𝑙𝑒 𝑣1 : 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑖𝑎𝑠 𝑣2 : 𝑛𝑜𝑖𝑠𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑟: 𝑙𝑒𝑛𝑔𝑡ℎ𝑠𝑐𝑎𝑙𝑒 𝛼: 𝑟𝑜𝑢𝑔ℎ𝑛𝑒𝑠𝑠 23 Samples from GPs with different kernels 24 Matern kernel 𝑘 𝑥𝑖 , 𝑥𝑗 = 1 Γ 𝜈 2𝑣−1 2𝑣 𝑙 𝑥𝑖 − 𝑥𝑗 𝑣 𝐾𝑣 2𝑣 𝑙 𝑥𝑖 − 𝑥𝑗 𝐾𝑣 is modified Bessel function of second kind of order 𝑣, 𝑙 is the length scale Sample functions from GP with Matern kernel are 𝑣 − 1 times differentiable. Hyperparamter 𝑣 can control smoothness Special cases (let 𝑟 = |𝑥𝑖 − 𝑥𝑗 |) 𝑘𝑣=1 𝑟 = exp − 2 𝑟 𝑙 : Laplace kernel, Brownian motion 𝑘𝑣=3 𝑟 = 1 + 3𝑟 𝑙 𝑘𝑣=5 𝑟 = 1 + 5𝑟 𝑙 𝑘𝑣→∞ 𝑟 = exp 𝑟2 − 2 2𝑙 2 2 exp − + 5𝑟 2 3𝑙 2 3𝑟 𝑙 (once differentiable) exp − 5𝑟 𝑙 (twice differentiable) : smooth (infinitely differentiable) 25 Matern kernel II Univariate Matern kernel function with unit length scale 26 Kernels for periodic, smooth functions To create GP over periodic functions, we can first map the inputs to 𝑢 = sin 𝑥 , cos 𝑥 ⊤ , and then measure distance in 𝑢 space. Combined with square exponential function, 𝑘 𝑥, 𝑥 ′ = exp − 2sin2 𝜋 𝑥−𝑥 ′ 𝑙2 Three functions drawn at random, left 𝑙 > 1 and right 𝑙 < 1 27 Using Gaussian process for nonlinear regression Observing a dataset 𝐷 = 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 Prior 𝑃(𝑓) is Gaussian process, like a multivariate Gaussian, therefore, posterior of 𝑓 is also a Gaussian process Bayesian rule 𝑃 𝑓 𝐷 = 𝑃 𝐷 𝑓 𝑃(𝑓) 𝑃(𝐷) Everything else about GPs follows the basic rules of probabilities applied to multivariate Gaussians 28 Posterior of Gaussian process Gaussian process regression For simplicity, noiseless observation 𝑦 = 𝑓(𝑥) The parameter is a function 𝑓 𝑥 ∼ 𝐺𝑃 𝑚 𝑥 = 0, 𝑘 𝑥, 𝑥 ′ Gaussian process prior with Multivariate Gaussian 𝑃 𝑌 𝑋 ∼ 𝑁 𝜇𝑌|𝑋 ; Σ𝑌𝑌|𝑋 −1 𝜇𝑌|𝑋 = 𝜇𝑌 + Σ𝑌𝑋 Σ𝑋𝑋 (𝑋 − 𝜇𝑋 ) −1 Σ𝑌𝑌|𝑋 = Σ𝑌𝑌 − Σ𝑌𝑋 Σ𝑋𝑋 Σ𝑋𝑌 GP posterior 𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 ~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ 𝑌 = (𝑦2 , … , 𝑦𝑛 )⊤ = 𝑓 𝑥1 , … 𝑓 𝑥𝑛 𝑚𝑝𝑜𝑠𝑡 𝑥 = 0 + Σ𝑓 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ = Σ𝑓 ⊤ −1 ⊤ 𝑥 𝑌 Σ𝑌𝑌 𝑌 𝑥 𝑓(𝑥) − Σ𝑓 −1 𝑥 𝑌 Σ𝑌𝑌 Σ𝑌𝑓 𝑥 29 Prior and Posterior GP In the noiseless case (𝑦 = 𝑓(𝑥)), mean function of the posterior GP passes the training data points Posterior GP has reduced variance, zero variance at training point Prior Posterior 30 Noisy Observation 2 Gaussian likelihood 𝑦|𝑥, 𝑓 𝑥 ∼ 𝒩 𝑓, 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 ~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ 𝑌 = (𝑦2 , … , 𝑦𝑛 )⊤ 𝑚𝑝𝑜𝑠𝑡 𝑥 = 0 + Σ𝑓 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ = Σ𝑓 𝑥 𝑌 𝑥 𝑓(𝑥) Σ𝑌𝑌 + − Σ𝑓 −1 ⊤ 2 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑌 𝑥 𝑌 Σ𝑌𝑌 + −1 2 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 Σ𝑌𝑓 𝑥 Covariance function is the kernel function Σ𝑓 𝑥 𝑌 ΣYY = = 𝑘 𝑥, 𝑥1 , … , 𝑘 𝑥, 𝑥𝑛 𝑘 𝑥1 , 𝑥1 𝑘 𝑥2 , 𝑥1 𝑘 𝑥1 , 𝑥2 𝑘 𝑥2 , 𝑥2 ⋮ ⋮ 𝑘(𝑥𝑛 , 𝑥1 ) 𝑘(𝑥𝑛 , 𝑥2 ) ⋯ 𝑘 𝑥1 , 𝑥𝑛 ⋯ 𝑘 𝑥2 , 𝑥𝑛 ⋱ ⋮ ⋯ 𝑘(𝑥𝑛 , 𝑥𝑛 ) 31 Prior and posterior: noisy case In the noisy case (𝑦 = 𝑓 𝑥 + 𝜖), mean function of posterior GP does not necessarily passes the training data points Posterior GP has reduced variance 32

Gaussian Processes Le Song Machine Learning II: Advanced Topics

Related documents

Products

Support

Gaussian Processes Le Song Machine Learning II: Advanced Topics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib