Gaussian Processes
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
Pictorial view of embedding distribution
Transform the entire distribution to expected features
Feature space
Feature map:
2
Embedding Distributions: Mean
Mean reduces the entire distribution to a single number
Representation power
very restricted
1D feature space
3
Embedding Distributions: Mean + Variance
Mean and variance reduces the entire distribution to two
numbers
Variance
Richer representation
But not enough
Mean
2D feature space
4
Embedding with kernel features
Transform distribution to infinite dimensional vector
Rich representation
Feature space
Mean,
Variance,
higher
order
moment
5
Finite sample approximation of embedding
6
Estimating embedding distance
Finite sample estimator
Form a kernel matrix with 4 blocks
Average this block
Average this block
Average this block
Average this block
7
Measure Dependence via Embeddings
Use squared distance
to measure
dependence
between X and Y
Feature space
1
X Mean
Y Mean Cov.
…
…
Higher
order
feature
…
…
8
Estimating embedding distances
Given samples (๐ฅ1 , ๐ฆ1 ), … , (๐ฅ๐ , ๐ฆ๐ ) ∼ ๐ ๐, ๐
Dependence measure can be expressed as inner products
๐๐๐ − ๐๐ ⊗ ๐๐ 2 =
๐ธ๐๐ [๐ ๐ ⊗ ๐ ๐ ] − ๐ธ๐ ๐ ๐ ⊗ ๐ธ๐ [๐ ๐ ] 2
=< ๐๐๐ , ๐๐๐ > −2 < ๐๐๐ , ๐๐ ⊗ ๐๐ >+< ๐๐ ⊗ ๐๐ , ๐๐ ⊗ ๐๐ >
Kernel matrix operation (๐ป = ๐ผ
1
๐ก๐๐๐๐(
๐ป
2
๐
๐(๐ฅ๐ , ๐ฅ๐ )
1
− 11โค )
๐
๐ป
X and Y data are ordered
in the same way
๐(๐ฆ๐ , ๐ฆ๐ )
)
9
Application of kernel distance measure
10
Reference
11
Multivariate Gaussians
๐ ๐1 , ๐2 , … , ๐๐ =
1
2๐
๐
2
1 exp −
Σ2
1
2
๐ฅ − ๐ โค Σ −1 ๐ฅ − ๐
Mean vector ๐๐ = ๐ธ[๐๐ ]
๐1
๐2
๐=
โฎ
๐๐
Covariance matrix ๐๐๐ = ๐ธ ๐๐ − ๐๐ ๐๐ − ๐๐
๐12
Σ = ๐21
๐31
๐12
๐22
๐32
๐13
๐23
๐32
12
Conditioning on a Gaussian
Joint Gaussian ๐ ๐, ๐ ∼ ๐ฉ ๐; Σ
Conditioning a Gaussian variable Y on
another Gaussian variable X still gets a
Gaussian
2
๐ ๐ ๐ ∼ ๐ ๐๐|๐ ; ๐๐|๐
New observation
Prior mean
๐๐|๐ = ๐๐ +
๐๐๐
2
๐๐
๐ − ๐๐
Prior mean
2
๐๐|๐
=
๐๐2
−
2
๐๐๐
2
๐๐
Prior variance
Posterior variance does
not depend on a particular
observed value
Observe X always decrease variance
13
Conditional Gaussian is a linear model
Conditinal linear Gaussian
2
๐ ๐ ๐ ∼ ๐ฉ ๐๐|๐ ; ๐๐|๐
๐๐|๐ = ๐๐ +
๐๐๐
2
๐๐
๐ − ๐๐
2
๐ ๐ ๐ ∼ ๐ฉ ๐ฝ0 + ๐ฝ๐; ๐๐|๐
The ridge in the figure is the line ๐ฝ0 + ๐ฝ๐
If we make a slice at particular X, we get a Gaussian
All these Gaussian slices have the same variance
2
๐๐|๐
=
๐๐2
−
2
๐๐๐
2
๐๐
14
Conditional Gaussian (general case)
Joint Gaussian ๐ ๐, ๐ ∼ ๐ฉ ๐; Σ
Conditional Gaussian
๐ ๐ ๐ ∼ ๐ ๐๐|๐ ; Σ๐๐|๐
−1
๐๐|๐ = ๐๐ + Σ๐๐ Σ๐๐
(๐ − ๐๐ )
−1
Σ๐๐|๐ = Σ๐๐ − Σ๐๐ Σ๐๐
Σ๐๐
Conditional Gaussian is linear in ๐, ๐ ๐ ๐ ∼ ๐ฉ ๐ฝ0 + ๐ต๐; Σ๐๐|๐
−1
๐ฝ0 = ๐๐ − Σ๐๐ Σ๐๐
๐๐
−1
๐ต = Σ๐๐ Σ๐๐
Linear regression model ๐ = ๐ฝ0 + ๐ต๐ + ๐
White noise
๐ฉ(0, Σ๐๐|๐ )
15
What is Gaussian Process?
A Gaussian process is a generalization of a multivariate
Gaussian distribution to infinitely many variables
Formally: a collection of random variables, any finite number
of which have (consistent) Gaussian distributions
Informally, infinitely long vector with dimensions index by ๐ฅ ≅
function ๐(๐ฅ)
A Gaussian process is fully specified by a mean function
๐ ๐ฅ = ๐ธ[๐(๐ฅ)] and covariance function ๐ ๐ฅ, ๐ฅ ′ =
๐ธ ๐ ๐ฅ − ๐ ๐ฅ ๐ ๐ฅ′ − ๐ ๐ฅ′
๐ ๐ฅ ∼ ๐บ๐ ๐ ๐ฅ , ๐ ๐ฅ, ๐ฅ ′ , ๐ฅ: ๐๐๐๐๐๐๐
16
A set of sample from Gaussian process
For each fixed value of ๐ฅ, there is a Gaussian variable
associated with it
focus on a finite subset of value ๐ = ๐ ๐ฅ1 , ๐ ๐ฅ2 , … , ๐ ๐ฅ๐
for which ๐ ∼ ๐ฉ(0, Σ) where Σ๐๐ = ๐(๐ฅ๐ , ๐ฅ๐ )
โค
,
Then plot the coordinates of ๐ as a function of the corresponding
๐ฅ values
17
Random function from a Gaussian process
one dimensional Gaussian process:
๐ ๐ฅ ∼
1
๐บ๐ 0, ๐ ๐ฅ, ๐ฅ ′ = exp − ๐ฅ − ๐ฅ ′
2
2
To generate a sample from GP
Covariance
๐ ๐ฅ๐ , ๐ฅ๐
Gaussian variable ๐๐ , ๐๐ are indexed by
๐ฅ๐ , ๐ฅ๐ respectively, and their covariance
(๐๐-th entry in Σ) defined by ๐ ๐ฅ๐ , ๐ฅ๐
๐๐
Generate ๐ iid. samples: ๐ฆ =
๐ฆ1 , … , ๐ฆ๐ โค ∼ ๐ฉ 0; ๐ผ
Transform the sample:
๐ = ๐1 , … , ๐๐ โค = ๐ + Σ1/2 ๐ฆ
๐๐
๐ฅ๐
๐ฅ๐
18
Random function from a Gaussian process
Now have two indices ๐ฅ and ๐ฆ
covariance function ๐ ๐ฅ, ๐ฆ ,
๐ฅ ′, ๐ฆ′
= exp −
๐ฅ−๐ฅ ′
2
+ ๐ฆ−๐ฆ ′
2
2
19
Gaussian process as a prior
A Gaussian process is a prior for functions, we can use it for
nonparametric regression
Fit a function to noisy observations
Gaussian process regression
2
Gaussian likelihood ๐ฆ|๐ฅ, ๐ ๐ฅ ∼ ๐ฉ ๐, ๐๐๐๐๐ ๐
๐ผ
The parameter is a function ๐ ๐ฅ ∼ ๐บ๐ ๐ ๐ฅ = 0, ๐ ๐ฅ, ๐ฅ ′
Gaussian process prior
with
20
Graphical model for Gaussian Process
Square nodes are observed, round nodes unobserved (latent)
Red nodes are training data, blue nodes are test data
All pairs of latent variables (๐)
are connected
Prediction of ๐ฆ ∗ depends only
on the corresponding ๐ ∗
We can do learning and
inference based on this
graphical model
21
Covariance function of Gaussian processes
For any finite collection of indices ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ , the covariance
matrix is positive semidefinite
Σ=
โฏ ๐ ๐ฅ1 , ๐ฅ๐
โฏ ๐ ๐ฅ2 , ๐ฅ๐
โฑ โฎ
โฏ ๐(๐ฅ๐ , ๐ฅ๐ )
๐ ๐ฅ1 , ๐ฅ1
๐ ๐ฅ2 , ๐ฅ1
๐ ๐ฅ1 , ๐ฅ2
๐ ๐ฅ2 , ๐ฅ2
โฎ โฎ
๐(๐ฅ๐ , ๐ฅ1 ) ๐(๐ฅ๐ , ๐ฅ2 )
The covariance function needs to be a kernel function over the
indices!
Eg. Gaussian RBF kernel
๐ ๐ฅ, ๐ฅ ′ = exp −
1
2
๐ฅ − ๐ฅ′
2
22
Covariance function of Gaussian process
Another example
๐ ๐ฅ๐ , ๐ฅ๐ = ๐ฃ0 exp −
๐ฅ๐ −๐ฅ๐
๐
๐ผ
+ ๐ฃ1 + ๐ฃ2 ๐ฟ๐๐
These kernel parameters are interpretable in the covariance
function context
๐ฃ0 : ๐ฃ๐๐๐๐๐๐๐ ๐ ๐๐๐๐
๐ฃ1 : ๐ฃ๐๐๐๐๐๐๐ ๐๐๐๐
๐ฃ2 : ๐๐๐๐ ๐ ๐ฃ๐๐๐๐๐๐๐
๐: ๐๐๐๐๐กโ๐ ๐๐๐๐
๐ผ: ๐๐๐ข๐โ๐๐๐ ๐
23
Samples from GPs with different kernels
24
Matern kernel
๐ ๐ฅ๐ , ๐ฅ๐ =
1
Γ ๐ 2๐ฃ−1
2๐ฃ
๐
๐ฅ๐ − ๐ฅ๐
๐ฃ
๐พ๐ฃ
2๐ฃ
๐
๐ฅ๐ − ๐ฅ๐
๐พ๐ฃ is modified Bessel function of second kind of order ๐ฃ, ๐ is the
length scale
Sample functions from GP with Matern kernel are ๐ฃ − 1 times
differentiable. Hyperparamter ๐ฃ can control smoothness
Special cases (let ๐ = |๐ฅ๐ − ๐ฅ๐ |)
๐๐ฃ=1 ๐ = exp −
2
๐
๐
: Laplace kernel, Brownian motion
๐๐ฃ=3 ๐ = 1 +
3๐
๐
๐๐ฃ=5 ๐ = 1 +
5๐
๐
๐๐ฃ→∞ ๐ = exp
๐2
− 2
2๐
2
2
exp −
+
5๐ 2
3๐ 2
3๐
๐
(once differentiable)
exp −
5๐
๐
(twice differentiable)
: smooth (infinitely differentiable)
25
Matern kernel II
Univariate Matern kernel function with unit length scale
26
Kernels for periodic, smooth functions
To create GP over periodic functions, we can first map the
inputs to ๐ข = sin ๐ฅ , cos ๐ฅ โค , and then measure distance in
๐ข space. Combined with square exponential function,
๐ ๐ฅ, ๐ฅ ′ = exp −
2sin2 ๐ ๐ฅ−๐ฅ ′
๐2
Three functions drawn at random, left ๐ > 1 and right ๐ < 1
27
Using Gaussian process for nonlinear regression
Observing a dataset ๐ท =
๐ฅ๐ , ๐ฆ๐
๐
๐=1
Prior ๐(๐) is Gaussian process, like a multivariate Gaussian,
therefore, posterior of ๐ is also a Gaussian process
Bayesian rule ๐ ๐ ๐ท =
๐ ๐ท ๐ ๐(๐)
๐(๐ท)
Everything else about GPs follows the basic rules of
probabilities applied to multivariate Gaussians
28
Posterior of Gaussian process
Gaussian process regression
For simplicity, noiseless observation ๐ฆ = ๐(๐ฅ)
The parameter is a function ๐ ๐ฅ ∼ ๐บ๐ ๐ ๐ฅ = 0, ๐ ๐ฅ, ๐ฅ ′
Gaussian process prior
with
Multivariate Gaussian ๐ ๐ ๐ ∼ ๐ ๐๐|๐ ; Σ๐๐|๐
−1
๐๐|๐ = ๐๐ + Σ๐๐ Σ๐๐
(๐ − ๐๐ )
−1
Σ๐๐|๐ = Σ๐๐ − Σ๐๐ Σ๐๐
Σ๐๐
GP posterior ๐ ๐ฅ | ๐ฅ๐ , ๐ฆ๐
๐
๐=1
~๐บ๐ ๐๐๐๐ ๐ก ๐ฅ , ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′
๐ = (๐ฆ2 , … , ๐ฆ๐ )โค = ๐ ๐ฅ1 , … ๐ ๐ฅ๐
๐๐๐๐ ๐ก ๐ฅ = 0 + Σ๐
๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′ = Σ๐
โค
−1 โค
๐ฅ ๐ Σ๐๐ ๐
๐ฅ ๐(๐ฅ)
− Σ๐
−1
๐ฅ ๐ Σ๐๐ Σ๐๐ ๐ฅ
29
Prior and Posterior GP
In the noiseless case (๐ฆ = ๐(๐ฅ)), mean function of the posterior GP
passes the training data points
Posterior GP has reduced variance, zero variance at training point
Prior
Posterior
30
Noisy Observation
2
Gaussian likelihood ๐ฆ|๐ฅ, ๐ ๐ฅ ∼ ๐ฉ ๐, ๐๐๐๐๐ ๐
๐ผ
๐ ๐ฅ | ๐ฅ๐ , ๐ฆ๐
๐
๐=1
~๐บ๐ ๐๐๐๐ ๐ก ๐ฅ , ๐๐๐๐ ๐ก ๐ฅ, ๐ฅ ′
๐ = (๐ฆ2 , … , ๐ฆ๐ )โค
๐๐๐๐ ๐ก ๐ฅ = 0 + Σ๐
๐๐๐๐ ๐ก ๐ฅ, ๐ฅ
′
= Σ๐
๐ฅ ๐
๐ฅ ๐(๐ฅ)
Σ๐๐ +
− Σ๐
−1 โค
2
๐๐๐๐๐ ๐ ๐ผ ๐
๐ฅ ๐
Σ๐๐ +
−1
2
๐๐๐๐๐ ๐ ๐ผ Σ๐๐ ๐ฅ
Covariance function is the kernel function
Σ๐
๐ฅ ๐
ΣYY =
= ๐ ๐ฅ, ๐ฅ1 , … , ๐ ๐ฅ, ๐ฅ๐
๐ ๐ฅ1 , ๐ฅ1
๐ ๐ฅ2 , ๐ฅ1
๐ ๐ฅ1 , ๐ฅ2
๐ ๐ฅ2 , ๐ฅ2
โฎ โฎ
๐(๐ฅ๐ , ๐ฅ1 ) ๐(๐ฅ๐ , ๐ฅ2 )
โฏ ๐ ๐ฅ1 , ๐ฅ๐
โฏ ๐ ๐ฅ2 , ๐ฅ๐
โฑ โฎ
โฏ ๐(๐ฅ๐ , ๐ฅ๐ )
31
Prior and posterior: noisy case
In the noisy case (๐ฆ = ๐ ๐ฅ + ๐), mean function of posterior GP
does not necessarily passes the training data points
Posterior GP has reduced variance
32