Gaussian Processes Le Song Machine Learning II: Advanced Topics

Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Conditional Gaussian (general case) Joint Gaussian 𝑃 𝑋, 𝑌 ∼ 𝒩 𝜇; Σ Conditional Gaussian 𝑃 𝑌 𝑋 ∼ 𝑁 𝜇𝑌|𝑋 ; Σ𝑌𝑌|𝑋 −1 𝜇𝑌|𝑋 = 𝜇𝑌 + Σ𝑌𝑋 Σ𝑋𝑋 (𝑋 − 𝜇𝑋 ) −1 Σ𝑌𝑌|𝑋 = Σ𝑌𝑌 − Σ𝑌𝑋 Σ𝑋𝑋 Σ𝑋𝑌 Conditional Gaussian is linear in 𝑋, 𝑃 𝑌 𝑋 ∼ 𝒩 𝛽0 + 𝐵𝑋; Σ𝑌𝑌|𝑋 −1 𝛽0 = 𝜇𝑌 − Σ𝑌𝑋 Σ𝑋𝑋 𝜇𝑋 −1 𝐵 = Σ𝑌𝑋 Σ𝑋𝑋 Linear regression model 𝑌 = 𝛽0 + 𝐵𝑋 + 𝜖 White noise 𝒩(0, Σ𝑌𝑌|𝑋 ) 2 What is Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables Formally: a collection of random variables, any finite number of which have (consistent) Gaussian distributions Informally, infinitely long vector with dimensions index by 𝑥 ≅ function 𝑓(𝑥) A Gaussian process is fully specified by a mean function 𝑚 𝑥 = 𝐸[𝑓(𝑥)] and covariance function 𝑘 𝑥, 𝑥 ′ = 𝐸 𝑓 𝑥 − 𝑚 𝑥 𝑓 𝑥′ − 𝑚 𝑥′ 𝑓 𝑥 ∼ 𝐺𝑃 𝑚 𝑥 , 𝑘 𝑥, 𝑥 ′ , 𝑥: 𝑖𝑛𝑑𝑖𝑐𝑒𝑠 3 Random function from a Gaussian process one dimensional Gaussian process: 𝑓 𝑥 ∼ 1 𝐺𝑃 0, 𝑘 𝑥, 𝑥 ′ = exp − 𝑥 − 𝑥 ′ 2 2 To generate a sample from GP Covariance 𝑘 𝑥𝑖 , 𝑥𝑗 Gaussian variable 𝑓𝑖 , 𝑓𝑗 are indexed by 𝑥𝑖 , 𝑥𝑗 respectively, and their covariance (𝑖𝑗-th entry in Σ) defined by 𝑘 𝑥𝑖 , 𝑥𝑗 𝑓𝑖 Generate 𝑁 iid. samples: 𝑦 = 𝑦1 , … , 𝑦𝑁 ⊤ ∼ 𝒩 0; 𝐼 Transform the sample: 𝑓 = 𝑓1 , … , 𝑓𝑁 ⊤ = 𝜇 + Σ1/2 𝑦 𝑓𝑗 𝑥𝑖 𝑥𝑗 4 Covariance function of Gaussian processes For any finite collection of indices 𝑥1 , 𝑥2 , … , 𝑥𝑛 , the covariance matrix is positive semidefinite Σ=𝐾= 𝑘 𝑥1 , 𝑥1 𝑘 𝑥2 , 𝑥1 𝑘 𝑥1 , 𝑥2 𝑘 𝑥2 , 𝑥2 ⋮ ⋮ 𝑘(𝑥𝑛 , 𝑥1 ) 𝑘(𝑥𝑛 , 𝑥2 ) ⋯ 𝑘 𝑥1 , 𝑥𝑛 ⋯ 𝑘 𝑥2 , 𝑥𝑛 ⋱ ⋮ ⋯ 𝑘(𝑥𝑛 , 𝑥𝑛 ) The covariance function needs to be a kernel function over the indices! Eg. Gaussian RBF kernel 𝑘 𝑥, 𝑥 ′ = exp − 1 2 𝑥 − 𝑥′ 2 5 Samples from GPs with different kernels 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝑣0 exp − 𝑥𝑖 −𝑥𝑗 𝑟 𝛼 + 𝑣1 + 𝑣2 𝛿𝑖𝑗 6 Kernels for periodic, smooth functions To create GP over periodic functions, we can first map the inputs to 𝑢 = sin 𝑥 , cos 𝑥 ⊤ , and then measure distance in 𝑢 space. Combined with square exponential function, 𝑘 𝑥, 𝑥 ′ = exp − 2sin2 𝜋 𝑥−𝑥 ′ 𝑙2 Three functions drawn at random, left 𝑙 > 1 and right 𝑙 < 1 7 Using Gaussian process for nonlinear regression Observing a dataset 𝐷 = 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 Prior 𝑃(𝑓) is Gaussian process, like a multivariate Gaussian, therefore, posterior of 𝑓 is also a Gaussian process Bayesian rule 𝑃 𝑓 𝐷 = 𝑃 𝐷 𝑓 𝑃(𝑓) 𝑃(𝐷) Everything else about GPs follows the basic rules of probabilities applied to multivariate Gaussians 8 Graphical model for Gaussian Process Square nodes are observed, round nodes unobserved (latent) Red nodes are training data, blue nodes are test data All pairs of latent variables (𝑓) are connected Prediction of 𝑦 ∗ depends only on the corresponding 𝑓 ∗ We can do learning and inference based on this graphical model 9 Posterior of Gaussian process Gaussian process regression For simplicity, noiseless observation 𝑦 = 𝑓(𝑥) The parameter is a function 𝑓 𝑥 ∼ 𝐺𝑃 𝑚 𝑥 = 0, 𝑘 𝑥, 𝑥 ′ Gaussian process prior GP posterior 𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖 𝑌 = (𝑦2 , … , 𝑦𝑛 )⊤ 𝑛 𝑖=1 ~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ = 𝑓 𝑥1 , … 𝑓 𝑥𝑛 ⊤ −1 ⊤ Σ 𝑌 𝑌𝑌 𝑌 𝑚𝑝𝑜𝑠𝑡 𝑥 = 0 + Σ𝑓 𝑥 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ = Σ𝑓 𝑓(𝑥′) − Σ𝑓 𝑥 with 𝑥 −1 Σ 𝑌 𝑌𝑌 Σ𝑌𝑓 𝑥′ 10 Posterior of Gaussian process in Kernel Form Define kernel matrices 𝑘 𝑥, 𝑋 ≔ Σ𝑓 𝑥 𝑌 = 𝑘 𝑥, 𝑥1 , … , 𝑘 𝑥, 𝑥𝑛 𝐾 ≔ ΣYY = ⋯ 𝑘 𝑥1 , 𝑥𝑛 𝑘 𝑥1 , 𝑥2 𝑘 𝑥2 , 𝑥2 ⋯ 𝑘 𝑥2 , 𝑥𝑛 ⋮ ⋮ ⋱ ⋮ 𝑘(𝑥𝑛 , 𝑥1 ) 𝑘(𝑥𝑛 , 𝑥2 ) ⋯ 𝑘(𝑥𝑛 , 𝑥𝑛 ) 𝑘 𝑥1 , 𝑥1 𝑘 𝑥2 , 𝑥1 Then we have 𝑚𝑝𝑜𝑠𝑡 𝑥 = 𝑘 𝑥, 𝑋 K −1 𝑌 ⊤ 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ = 𝑘(𝑥, 𝑥 ′ ) − 𝑘 𝑥, 𝑋 K −1 𝑘 𝑥 ′ , 𝑋 ⊤ 11 Prior and Posterior GP In the noiseless case (𝑦 = 𝑓(𝑥)), mean function of the posterior GP passes the training data points Posterior GP has reduced variance, zero variance at training point Prior Posterior 12 Noisy Observation 2 𝑦|𝑥, 𝑓 𝑥 ∼ 𝒩 𝑓, 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 , let 𝑌 = (𝑦2 , … , 𝑦𝑛 )⊤ 𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 ~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ 2 𝑚𝑝𝑜𝑠𝑡 𝑥 = 𝑘 𝑥, 𝑋 𝐾 + 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ = 𝑘(𝑥, 𝑥 ′ ) −1 ⊤ 𝑌 − 𝑘 𝑥, 𝑋 𝐾 + −1 2 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑘 𝑥′, 𝑋 ⊤ 13 GP: prediction of new observation Given a new point 𝑥 ∗ , the predictive distribution for the 𝑦 ∗ 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 𝑑𝑓 Predictive distribution is Gaussian 𝑦 ∗ 𝑥 ∗ , 𝑥𝑖 , 𝑦𝑖 2 𝒩 𝜇𝑝𝑟𝑒𝑑 , 𝜎𝑝𝑟𝑒𝑑 𝑛 𝑖=1 = 𝑃 𝑦 ∗ 𝑥 ∗ , 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 = ∫ 𝑃 𝑦 ∗ 𝑓, 𝑥 ∗ 𝑃 𝑓 𝑌 = (𝑦2 , … , 𝑦𝑛 )⊤ 2 𝜇𝑝𝑟𝑒𝑑 = 𝑘(𝑥 ∗ , 𝑋) 𝐾 + 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 −1 ⊤ 𝑌 2 2 𝜎𝑝𝑟𝑒𝑑 = 𝑘(𝑥 ∗ , 𝑥 ∗ ) − 𝑘 𝑥 ∗ , 𝑋 𝐾 + 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 −1 𝑘 𝑥∗, 𝑋 ⊤ 14 Weight space view of GP Assume linear regression model 𝑓 𝑥 𝑤 = 𝑥⊤𝑤 𝑦 =𝑓+𝜖 𝜖 ∼ 𝒩 0, 𝜎 2 Let 𝑌 =, 𝑦1 , … 𝑦𝑛 , 𝑋 = 𝑥1 , … , 𝑥𝑛 , Likelihood of observations 𝑃( 𝑦𝑖 𝑛 𝑖=1 𝑥𝑖 𝑛 𝑖=1 , 𝑤 = 𝒩 𝑋 ⊤ 𝑤, 𝜎 2 𝐼 Assume a Gaussian prior over parameters 𝑃 𝑤 = 𝒩(0, 𝐼) Apply Bayes’ theorem to obtain posterior 𝑃 𝑤 𝑌, 𝑋 ∝ 𝑃 𝑌 𝑋, 𝑤 𝑃(𝑤) 15 Weight space view of GP Posterior distribution over 𝑤 is 𝑃 𝑤 𝑌, 𝑋 = 𝒩 1 (𝐼 𝜎2 + 1 ⊤ −1 𝑋𝑋 ) 𝑋𝑌, (𝐼 𝜎2 + 1 ⊤ −1 𝑋𝑋 ) 𝜎2 Predictive distribution is 𝑃 𝑓 ∗ 𝑥 ∗ , 𝑋, 𝑌 = ∫ 𝑓 𝑥 ∗ 𝑤 𝑃 𝑤, 𝑌, 𝑋 𝑑𝑤 =𝒩 1 ∗⊤ 𝑥 (𝐼 𝜎2 + 1 ⊤ )−1 𝑋𝑌, 𝑥 ∗ ⊤ (𝐼 𝑋𝑋 𝜎2 + 1 𝑋𝑋 ⊤ )−1 𝑥 ∗ 2 𝜎 Predictive distribution is outer product form, can be turned into inner product form using matrix-inversion lemma 16 Weight space view of GP Predictive distribution 𝒩 1 ∗⊤ 𝑥 (𝐼 𝜎2 + 1 ⊤ −1 ∗⊤ 𝑋𝑋 ) 𝑋𝑌, 𝑥 (𝐼 𝜎2 + 1 ⊤ −1 ∗ 𝑋𝑋 ) 𝑥 𝜎2 Equivalent to 𝒩 (𝑥 ∗ ⊤ 𝑋)(𝜎 2 𝐼 + 𝑋 ⊤ 𝑋 )−1 𝑌, 𝑥 ∗ ⊤ 𝑥 ∗ − 𝑥 ∗ ⊤ 𝑋(𝜎 2 𝐼 + 𝑋 ⊤ 𝑋)−1 𝑋 ⊤ 𝑥 ∗ Instead of using the original 𝑥, map it to feature space 𝜙 𝑥 2 𝒩 𝜇𝑝𝑟𝑒𝑑 , 𝜎𝑝𝑟𝑒𝑑 2 𝜇𝑝𝑟𝑒𝑑 = 𝑘(𝑥 ∗ , 𝑋) 𝐾 + 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 −1 ⊤ 𝑌 2 2 𝜎𝑝𝑟𝑒𝑑 = 𝑘(𝑥 ∗ , 𝑥 ∗ ) − 𝑘 𝑥 ∗ , 𝑋 𝐾 + 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 −1 𝑘 𝑥∗, 𝑋 ⊤ 17 Model selection Use marginal likelihood (evidence) to select and tune hyperparameters in covariance function P( 𝑦𝑖 𝑛 𝑖=1 𝑥𝑖 𝑛 𝑖=1 = ∫ 𝑃 𝑦𝑖 𝑛 𝑖=1 𝑓, 𝑥𝑖 𝑛 𝑖=1 𝑃 𝑓 𝑑𝑓 An example 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝑣0 exp − 𝑥𝑖 −𝑥𝑗 𝑟 𝛼 + 𝑣1 Marginal likelihood is a function of parameter 𝑣 P( 𝑦𝑖 𝑛𝑖=1 𝑥𝑖 𝑛𝑖=1 = 𝒩 0, 𝐾𝑣 + 𝜎 2 𝐼 ln 𝑃( 𝑦𝑖 𝑛𝑖=1 𝑥𝑖 𝑛𝑖=1 = 1 1 − ln det 𝐾𝑣 + 𝜎 2 𝐼 − 𝑌 ⊤ 𝐾𝑣 + 𝜎 2 𝐼 2 2 −1 𝑌 + 𝑐𝑜𝑛𝑠𝑡 Optimize as a function of 𝑣 18 Automatic relevance detection We want to automatically decide which inputs are relevant (feature selection) to output Use covariance function 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝑣0 exp − 𝐷 𝑑=1 𝑥𝑖𝑑 −𝑥𝑗𝑑 𝛼 + 𝑣1 + 𝑣2 𝛿𝑖𝑗 𝑟𝑑 𝑟𝑑 is the length scale of the function along input dimension 𝑑 𝑟𝑑 → ∞, the corresponding feature influence 𝑓 less Use marginal likelihood ln 𝑃( 𝑦𝑖 feature selection 𝑛 𝑖=1 𝑥𝑖 𝑛 𝑖=1 to tune 𝑟𝑑 to do 19 GP for classification Regression model 𝑓|~𝐺𝑃 𝑚 𝑥 , 𝑘 𝑥, 𝑥 ′ 2 𝑦|𝑥, 𝑓 ∼ 𝒩 𝑓(𝑥), 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 Classification model Give data 𝑥𝑖 , 𝑦𝑖 𝑓|~𝐺𝑃 𝑚 𝑥 , 𝑘 𝑛 𝑖=1 𝑥, 𝑥 ′ where 𝑦𝑖 ∈ {−1, +1} 𝑦|𝑥, 𝑓 ∼ 𝑝 𝑦|𝑓(𝑥) 20 Relate GP to class probability Transform the continuous output of Gaussian Process to a value between [-1,1] or [0,1] With binary outputs, the joint distribution of all variables in the model is no longer Gaussians The likelihood is also not Gaussian, so we will need to use approximate inference to compute the posterior GP (Laplace approximation, sampling) 21 Connection to kernel support vector machines 1 𝑤 2 min 𝑤 2 +𝐶 𝑗 𝜉𝑗 𝑠. 𝑡. 𝑤 ⊤ 𝜙(𝑥𝑗 ) + 𝑏 𝑦𝑗 ≥ 1 − 𝜉𝑗 , 𝜉𝑗 ≥ 0, ∀𝑗 𝜉𝑗 : Slack variables Can be equivalently written as a hinge loss function 1 − 𝑦𝑖 (𝑤 ⊤ 𝜙(𝑥𝑖 )) + = 1 − 𝑦𝑖 𝑓𝑖 + 22 Connection to kernel support vector machines The decision function 𝑓 𝑥 = 𝑤 ⊤ 𝜙 𝑥 = 𝑖 𝛼𝑖 𝑘(𝑥, 𝑥𝑖 ) Let the vector 𝑓 be 𝑓(𝑥) evaluated at all training points: 𝑓 = 𝐾𝛼, then we have 𝛼 = 𝐾 −1 𝑓 𝑤 2 = 𝛼 ⊤ 𝐾𝛼 = 𝑓 ⊤ 𝐾 −1 𝑓 We can rewrite the kernelized SVM as: 1 ⊤ −1 min𝑓 𝑓 𝐾 𝑓 2 +𝐶 𝑖 1 − 𝑦𝑖 𝑓𝑖 + 23 Connection to kernel support vector machines We can rewrite the kernelized SVM as: 1 ⊤ −1 min𝑓 𝑓 𝐾 𝑓 2 +𝐶 𝑖 1 − 𝑦𝑖 𝑓𝑖 + We can write the negative log of a GP likelihood as 1 2 min 𝑓 ⊤ 𝐾 −1 𝑓 + 𝑓 𝑖 𝑝(𝑦𝑖 |𝑓𝑖 ) + 𝑐 With Gaussian process we Handle uncerntainty in unknown function 𝑓 by averaging, not minimization Can learn the kernel parameters and features using marginal likelihood Can incorporate interpretable noise models and priors, can sample 24 Gaussian process latent variable models GP can be used for nonlinear dimensionality reduction Observed n data points 𝑦𝑖 𝑛𝑖=1 , assume each dimension of the data 𝑦𝑖𝑑 id modeled by a separate GP using some common low dimension input 𝑥𝑖 Find the best latent input 𝑥𝑖 by maximizing the marginal likelihood 𝑥 𝑓𝑑 𝑦𝑑 Computationally intensive 25 Gaussian process latent variable models Find the latent variables is a high dimensional, nonlinear optimization problem with local optima GPLVM defines a map from latent to observed space, not a generative model Mapping new latent coordinate to observations is easy Finding the latent coordinates for new observations is difficult 26 Computation issue of GP 𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 ~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ 2 𝑚𝑝𝑜𝑠𝑡 𝑥 = 𝑘 𝑥, 𝑋 𝐾 + 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ = 𝑘(𝑥, 𝑥 ′ ) −1 ⊤ 𝑌 − 𝑘 𝑥, 𝑋 𝐾 −1 2 + 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑘 𝑥′, 𝑋 ⊤ Inverting kernel matrix 𝐾 is computationally intensive Suppose there are 𝑛 data points, the computation cost is 𝑂 𝑛3 Need to reduce the computational cost by some type of approximation 𝑅 𝐾 ≈ 𝑅⊤ 27 Kernel low rank approximation Incomplete Cholesky factorization of kernel matrix 𝐾 of size 𝑛 × 𝑛 to 𝑅 of size 𝑑 × 𝑛, and 𝑑 ≪ 𝑛 𝑅 ≈ 𝐾 𝑅⊤ 𝑅 ≈ 𝐴 𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 𝑅⊤ ~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ 2 𝑚𝑝𝑜𝑠𝑡 𝑥 = 𝑅𝑥⊤ 𝑅𝑅⊤ + 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ = 𝑅𝑥𝑥 − 𝑅𝑥⊤ 𝑅𝑅⊤ −1 + 𝑅𝑌 ⊤ −1 2 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑅𝑅𝑥 28 Sparse nonparametric regression Support vector regression 29 Dual of support vector regression and kernelization Dual problem use data only in the inner product Prediction for a new point Replace inner product by kernel functions to obtain nonlinear regression 30 Collaborative Filtering 31 Collaborative Filtering R: rating matrix; U: user factor; V: movie factor min U ,V s .t . f ( U ,V )  R  UV T 2 F U  0 ,V  0 , k  m , n . Low rank matrix approximation approach Probabilistic matrix factorization Bayesian probabilistic matrix factorization 32 Nonparametric effect model The ratings 𝑅𝑖𝑗 are generated by bias 𝜇, user-item compatibility function 𝑚𝑖𝑗 and a random effects 𝑓𝑖𝑗 plus noise 𝜖𝑖𝑗 𝑅𝑖𝑗 = 𝜇 + 𝑚𝑖𝑗 + 𝑓𝑖𝑗 + 𝜖𝑖𝑗 , 𝜖𝑖𝑗 ∼ 𝑁(0, 𝜎 2 ) Gaussian process prior for both 𝑚 and 𝑓 𝑚 ∼ 𝐺𝑃(0, Ω ⊗ Σ) 𝑓𝑖 ∼ 𝐺𝑃 0, 𝜏Σ , i = 1, … , M Hyper prior on the the covariance matrix: inverse-Wishart process: Σ ∼ 𝐼𝑊𝑃(𝜅, Σ0 + 𝜆𝛿) Learning with EM 33

Gaussian Processes Le Song Machine Learning II: Advanced Topics

Related documents

Products

Support

Gaussian Processes Le Song Machine Learning II: Advanced Topics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib