Gaussian Processes Le Song Machine Learning II: Advanced Topics

advertisement
Gaussian Processes
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
Pictorial view of embedding distribution
Transform the entire distribution to expected features
Feature space
Feature map:
2
Embedding Distributions: Mean
Mean reduces the entire distribution to a single number
Representation power
very restricted
1D feature space
3
Embedding Distributions: Mean + Variance
Mean and variance reduces the entire distribution to two
numbers
Variance
Richer representation
But not enough
Mean
2D feature space
4
Embedding with kernel features
Transform distribution to infinite dimensional vector
Rich representation
Feature space
Mean,
Variance,
higher
order
moment
5
Finite sample approximation of embedding
6
Estimating embedding distance
Finite sample estimator
Form a kernel matrix with 4 blocks
Average this block
Average this block
Average this block
Average this block
7
Measure Dependence via Embeddings
Use squared distance
to measure
dependence
between X and Y
Feature space
1
X Mean
Y Mean Cov.
…
…
Higher
order
feature
…
…
8
Estimating embedding distances
Given samples (๐‘ฅ1 , ๐‘ฆ1 ), … , (๐‘ฅ๐‘š , ๐‘ฆ๐‘š ) ∼ ๐‘ƒ ๐‘‹, ๐‘Œ
Dependence measure can be expressed as inner products
๐œ‡๐‘‹๐‘Œ − ๐œ‡๐‘‹ ⊗ ๐œ‡๐‘Œ 2 =
๐ธ๐‘‹๐‘Œ [๐œ™ ๐‘‹ ⊗ ๐œ“ ๐‘Œ ] − ๐ธ๐‘‹ ๐œ™ ๐‘‹ ⊗ ๐ธ๐‘Œ [๐œ“ ๐‘Œ ] 2
=< ๐œ‡๐‘‹๐‘Œ , ๐œ‡๐‘‹๐‘Œ > −2 < ๐œ‡๐‘‹๐‘Œ , ๐œ‡๐‘‹ ⊗ ๐œ‡๐‘Œ >+< ๐œ‡๐‘‹ ⊗ ๐œ‡๐‘Œ , ๐œ‡๐‘‹ ⊗ ๐œ‡๐‘Œ >
Kernel matrix operation (๐ป = ๐ผ
1
๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘’(
๐ป
2
๐‘š
๐‘˜(๐‘ฅ๐‘– , ๐‘ฅ๐‘— )
1
− 11โŠค )
๐‘š
๐ป
X and Y data are ordered
in the same way
๐‘˜(๐‘ฆ๐‘– , ๐‘ฆ๐‘— )
)
9
Application of kernel distance measure
10
Reference
11
Multivariate Gaussians
๐‘ƒ ๐‘‹1 , ๐‘‹2 , … , ๐‘‹๐‘› =
1
2๐œ‹
๐‘›
2
1 exp −
Σ2
1
2
๐‘ฅ − ๐œ‡ โŠค Σ −1 ๐‘ฅ − ๐œ‡
Mean vector ๐œ‡๐‘– = ๐ธ[๐‘‹๐‘– ]
๐œ‡1
๐œ‡2
๐œ‡=
โ‹ฎ
๐œ‡๐‘›
Covariance matrix ๐œŽ๐‘–๐‘— = ๐ธ ๐‘‹๐‘– − ๐œ‡๐‘– ๐‘‹๐‘— − ๐œ‡๐‘—
๐œŽ12
Σ = ๐œŽ21
๐œŽ31
๐œŽ12
๐œŽ22
๐œŽ32
๐œŽ13
๐œŽ23
๐œŽ32
12
Conditioning on a Gaussian
Joint Gaussian ๐‘ƒ ๐‘‹, ๐‘Œ ∼ ๐’ฉ ๐œ‡; Σ
Conditioning a Gaussian variable Y on
another Gaussian variable X still gets a
Gaussian
2
๐‘ƒ ๐‘Œ ๐‘‹ ∼ ๐‘ ๐œ‡๐‘Œ|๐‘‹ ; ๐œŽ๐‘Œ|๐‘‹
New observation
Prior mean
๐œ‡๐‘Œ|๐‘‹ = ๐œ‡๐‘Œ +
๐œŽ๐‘Œ๐‘‹
2
๐œŽ๐‘‹
๐‘‹ − ๐œ‡๐‘‹
Prior mean
2
๐œŽ๐‘Œ|๐‘‹
=
๐œŽ๐‘Œ2
−
2
๐œŽ๐‘Œ๐‘‹
2
๐œŽ๐‘‹
Prior variance
Posterior variance does
not depend on a particular
observed value
Observe X always decrease variance
13
Conditional Gaussian is a linear model
Conditinal linear Gaussian
2
๐‘ƒ ๐‘Œ ๐‘‹ ∼ ๐’ฉ ๐œ‡๐‘Œ|๐‘‹ ; ๐œŽ๐‘Œ|๐‘‹
๐œ‡๐‘Œ|๐‘‹ = ๐œ‡๐‘Œ +
๐œŽ๐‘Œ๐‘‹
2
๐œŽ๐‘‹
๐‘‹ − ๐œ‡๐‘‹
2
๐‘ƒ ๐‘Œ ๐‘‹ ∼ ๐’ฉ ๐›ฝ0 + ๐›ฝ๐‘‹; ๐œŽ๐‘Œ|๐‘‹
The ridge in the figure is the line ๐›ฝ0 + ๐›ฝ๐‘‹
If we make a slice at particular X, we get a Gaussian
All these Gaussian slices have the same variance
2
๐œŽ๐‘Œ|๐‘‹
=
๐œŽ๐‘Œ2
−
2
๐œŽ๐‘Œ๐‘‹
2
๐œŽ๐‘‹
14
Conditional Gaussian (general case)
Joint Gaussian ๐‘ƒ ๐‘‹, ๐‘Œ ∼ ๐’ฉ ๐œ‡; Σ
Conditional Gaussian
๐‘ƒ ๐‘Œ ๐‘‹ ∼ ๐‘ ๐œ‡๐‘Œ|๐‘‹ ; Σ๐‘Œ๐‘Œ|๐‘‹
−1
๐œ‡๐‘Œ|๐‘‹ = ๐œ‡๐‘Œ + Σ๐‘Œ๐‘‹ Σ๐‘‹๐‘‹
(๐‘‹ − ๐œ‡๐‘‹ )
−1
Σ๐‘Œ๐‘Œ|๐‘‹ = Σ๐‘Œ๐‘Œ − Σ๐‘Œ๐‘‹ Σ๐‘‹๐‘‹
Σ๐‘‹๐‘Œ
Conditional Gaussian is linear in ๐‘‹, ๐‘ƒ ๐‘Œ ๐‘‹ ∼ ๐’ฉ ๐›ฝ0 + ๐ต๐‘‹; Σ๐‘Œ๐‘Œ|๐‘‹
−1
๐›ฝ0 = ๐œ‡๐‘Œ − Σ๐‘Œ๐‘‹ Σ๐‘‹๐‘‹
๐œ‡๐‘‹
−1
๐ต = Σ๐‘Œ๐‘‹ Σ๐‘‹๐‘‹
Linear regression model ๐‘Œ = ๐›ฝ0 + ๐ต๐‘‹ + ๐œ–
White noise
๐’ฉ(0, Σ๐‘Œ๐‘Œ|๐‘‹ )
15
What is Gaussian Process?
A Gaussian process is a generalization of a multivariate
Gaussian distribution to infinitely many variables
Formally: a collection of random variables, any finite number
of which have (consistent) Gaussian distributions
Informally, infinitely long vector with dimensions index by ๐‘ฅ ≅
function ๐‘“(๐‘ฅ)
A Gaussian process is fully specified by a mean function
๐‘š ๐‘ฅ = ๐ธ[๐‘“(๐‘ฅ)] and covariance function ๐‘˜ ๐‘ฅ, ๐‘ฅ ′ =
๐ธ ๐‘“ ๐‘ฅ − ๐‘š ๐‘ฅ ๐‘“ ๐‘ฅ′ − ๐‘š ๐‘ฅ′
๐‘“ ๐‘ฅ ∼ ๐บ๐‘ƒ ๐‘š ๐‘ฅ , ๐‘˜ ๐‘ฅ, ๐‘ฅ ′ , ๐‘ฅ: ๐‘–๐‘›๐‘‘๐‘–๐‘๐‘’๐‘ 
16
A set of sample from Gaussian process
For each fixed value of ๐‘ฅ, there is a Gaussian variable
associated with it
focus on a finite subset of value ๐‘“ = ๐‘“ ๐‘ฅ1 , ๐‘“ ๐‘ฅ2 , … , ๐‘“ ๐‘ฅ๐‘
for which ๐‘“ ∼ ๐’ฉ(0, Σ) where Σ๐‘–๐‘— = ๐‘˜(๐‘ฅ๐‘– , ๐‘ฅ๐‘— )
โŠค
,
Then plot the coordinates of ๐‘“ as a function of the corresponding
๐‘ฅ values
17
Random function from a Gaussian process
one dimensional Gaussian process:
๐‘“ ๐‘ฅ ∼
1
๐บ๐‘ƒ 0, ๐‘˜ ๐‘ฅ, ๐‘ฅ ′ = exp − ๐‘ฅ − ๐‘ฅ ′
2
2
To generate a sample from GP
Covariance
๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘—
Gaussian variable ๐‘“๐‘– , ๐‘“๐‘— are indexed by
๐‘ฅ๐‘– , ๐‘ฅ๐‘— respectively, and their covariance
(๐‘–๐‘—-th entry in Σ) defined by ๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘—
๐‘“๐‘–
Generate ๐‘ iid. samples: ๐‘ฆ =
๐‘ฆ1 , … , ๐‘ฆ๐‘ โŠค ∼ ๐’ฉ 0; ๐ผ
Transform the sample:
๐‘“ = ๐‘“1 , … , ๐‘“๐‘ โŠค = ๐œ‡ + Σ1/2 ๐‘ฆ
๐‘“๐‘—
๐‘ฅ๐‘–
๐‘ฅ๐‘—
18
Random function from a Gaussian process
Now have two indices ๐‘ฅ and ๐‘ฆ
covariance function ๐‘˜ ๐‘ฅ, ๐‘ฆ ,
๐‘ฅ ′, ๐‘ฆ′
= exp −
๐‘ฅ−๐‘ฅ ′
2
+ ๐‘ฆ−๐‘ฆ ′
2
2
19
Gaussian process as a prior
A Gaussian process is a prior for functions, we can use it for
nonparametric regression
Fit a function to noisy observations
Gaussian process regression
2
Gaussian likelihood ๐‘ฆ|๐‘ฅ, ๐‘“ ๐‘ฅ ∼ ๐’ฉ ๐‘“, ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’
๐ผ
The parameter is a function ๐‘“ ๐‘ฅ ∼ ๐บ๐‘ƒ ๐‘š ๐‘ฅ = 0, ๐‘˜ ๐‘ฅ, ๐‘ฅ ′
Gaussian process prior
with
20
Graphical model for Gaussian Process
Square nodes are observed, round nodes unobserved (latent)
Red nodes are training data, blue nodes are test data
All pairs of latent variables (๐‘“)
are connected
Prediction of ๐‘ฆ ∗ depends only
on the corresponding ๐‘“ ∗
We can do learning and
inference based on this
graphical model
21
Covariance function of Gaussian processes
For any finite collection of indices ๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘› , the covariance
matrix is positive semidefinite
Σ=
โ‹ฏ ๐‘˜ ๐‘ฅ1 , ๐‘ฅ๐‘›
โ‹ฏ ๐‘˜ ๐‘ฅ2 , ๐‘ฅ๐‘›
โ‹ฑ โ‹ฎ
โ‹ฏ ๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ๐‘› )
๐‘˜ ๐‘ฅ1 , ๐‘ฅ1
๐‘˜ ๐‘ฅ2 , ๐‘ฅ1
๐‘˜ ๐‘ฅ1 , ๐‘ฅ2
๐‘˜ ๐‘ฅ2 , ๐‘ฅ2
โ‹ฎ โ‹ฎ
๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ1 ) ๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ2 )
The covariance function needs to be a kernel function over the
indices!
Eg. Gaussian RBF kernel
๐‘˜ ๐‘ฅ, ๐‘ฅ ′ = exp −
1
2
๐‘ฅ − ๐‘ฅ′
2
22
Covariance function of Gaussian process
Another example
๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘— = ๐‘ฃ0 exp −
๐‘ฅ๐‘– −๐‘ฅ๐‘—
๐‘Ÿ
๐›ผ
+ ๐‘ฃ1 + ๐‘ฃ2 ๐›ฟ๐‘–๐‘—
These kernel parameters are interpretable in the covariance
function context
๐‘ฃ0 : ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘›๐‘๐‘’ ๐‘ ๐‘๐‘Ž๐‘™๐‘’
๐‘ฃ1 : ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘›๐‘๐‘’ ๐‘๐‘–๐‘Ž๐‘ 
๐‘ฃ2 : ๐‘›๐‘œ๐‘–๐‘ ๐‘’ ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘›๐‘๐‘’
๐‘Ÿ: ๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž๐‘ ๐‘๐‘Ž๐‘™๐‘’
๐›ผ: ๐‘Ÿ๐‘œ๐‘ข๐‘”โ„Ž๐‘›๐‘’๐‘ ๐‘ 
23
Samples from GPs with different kernels
24
Matern kernel
๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘— =
1
Γ ๐œˆ 2๐‘ฃ−1
2๐‘ฃ
๐‘™
๐‘ฅ๐‘– − ๐‘ฅ๐‘—
๐‘ฃ
๐พ๐‘ฃ
2๐‘ฃ
๐‘™
๐‘ฅ๐‘– − ๐‘ฅ๐‘—
๐พ๐‘ฃ is modified Bessel function of second kind of order ๐‘ฃ, ๐‘™ is the
length scale
Sample functions from GP with Matern kernel are ๐‘ฃ − 1 times
differentiable. Hyperparamter ๐‘ฃ can control smoothness
Special cases (let ๐‘Ÿ = |๐‘ฅ๐‘– − ๐‘ฅ๐‘— |)
๐‘˜๐‘ฃ=1 ๐‘Ÿ = exp −
2
๐‘Ÿ
๐‘™
: Laplace kernel, Brownian motion
๐‘˜๐‘ฃ=3 ๐‘Ÿ = 1 +
3๐‘Ÿ
๐‘™
๐‘˜๐‘ฃ=5 ๐‘Ÿ = 1 +
5๐‘Ÿ
๐‘™
๐‘˜๐‘ฃ→∞ ๐‘Ÿ = exp
๐‘Ÿ2
− 2
2๐‘™
2
2
exp −
+
5๐‘Ÿ 2
3๐‘™ 2
3๐‘Ÿ
๐‘™
(once differentiable)
exp −
5๐‘Ÿ
๐‘™
(twice differentiable)
: smooth (infinitely differentiable)
25
Matern kernel II
Univariate Matern kernel function with unit length scale
26
Kernels for periodic, smooth functions
To create GP over periodic functions, we can first map the
inputs to ๐‘ข = sin ๐‘ฅ , cos ๐‘ฅ โŠค , and then measure distance in
๐‘ข space. Combined with square exponential function,
๐‘˜ ๐‘ฅ, ๐‘ฅ ′ = exp −
2sin2 ๐œ‹ ๐‘ฅ−๐‘ฅ ′
๐‘™2
Three functions drawn at random, left ๐‘™ > 1 and right ๐‘™ < 1
27
Using Gaussian process for nonlinear regression
Observing a dataset ๐ท =
๐‘ฅ๐‘– , ๐‘ฆ๐‘–
๐‘›
๐‘–=1
Prior ๐‘ƒ(๐‘“) is Gaussian process, like a multivariate Gaussian,
therefore, posterior of ๐‘“ is also a Gaussian process
Bayesian rule ๐‘ƒ ๐‘“ ๐ท =
๐‘ƒ ๐ท ๐‘“ ๐‘ƒ(๐‘“)
๐‘ƒ(๐ท)
Everything else about GPs follows the basic rules of
probabilities applied to multivariate Gaussians
28
Posterior of Gaussian process
Gaussian process regression
For simplicity, noiseless observation ๐‘ฆ = ๐‘“(๐‘ฅ)
The parameter is a function ๐‘“ ๐‘ฅ ∼ ๐บ๐‘ƒ ๐‘š ๐‘ฅ = 0, ๐‘˜ ๐‘ฅ, ๐‘ฅ ′
Gaussian process prior
with
Multivariate Gaussian ๐‘ƒ ๐‘Œ ๐‘‹ ∼ ๐‘ ๐œ‡๐‘Œ|๐‘‹ ; Σ๐‘Œ๐‘Œ|๐‘‹
−1
๐œ‡๐‘Œ|๐‘‹ = ๐œ‡๐‘Œ + Σ๐‘Œ๐‘‹ Σ๐‘‹๐‘‹
(๐‘‹ − ๐œ‡๐‘‹ )
−1
Σ๐‘Œ๐‘Œ|๐‘‹ = Σ๐‘Œ๐‘Œ − Σ๐‘Œ๐‘‹ Σ๐‘‹๐‘‹
Σ๐‘‹๐‘Œ
GP posterior ๐‘“ ๐‘ฅ | ๐‘ฅ๐‘– , ๐‘ฆ๐‘–
๐‘›
๐‘–=1
~๐บ๐‘ƒ ๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ , ๐‘˜๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ, ๐‘ฅ ′
๐‘Œ = (๐‘ฆ2 , … , ๐‘ฆ๐‘› )โŠค = ๐‘“ ๐‘ฅ1 , … ๐‘“ ๐‘ฅ๐‘›
๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ = 0 + Σ๐‘“
๐‘˜๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ, ๐‘ฅ ′ = Σ๐‘“
โŠค
−1 โŠค
๐‘ฅ ๐‘Œ Σ๐‘Œ๐‘Œ ๐‘Œ
๐‘ฅ ๐‘“(๐‘ฅ)
− Σ๐‘“
−1
๐‘ฅ ๐‘Œ Σ๐‘Œ๐‘Œ Σ๐‘Œ๐‘“ ๐‘ฅ
29
Prior and Posterior GP
In the noiseless case (๐‘ฆ = ๐‘“(๐‘ฅ)), mean function of the posterior GP
passes the training data points
Posterior GP has reduced variance, zero variance at training point
Prior
Posterior
30
Noisy Observation
2
Gaussian likelihood ๐‘ฆ|๐‘ฅ, ๐‘“ ๐‘ฅ ∼ ๐’ฉ ๐‘“, ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’
๐ผ
๐‘“ ๐‘ฅ | ๐‘ฅ๐‘– , ๐‘ฆ๐‘–
๐‘›
๐‘–=1
~๐บ๐‘ƒ ๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ , ๐‘˜๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ, ๐‘ฅ ′
๐‘Œ = (๐‘ฆ2 , … , ๐‘ฆ๐‘› )โŠค
๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ = 0 + Σ๐‘“
๐‘˜๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ, ๐‘ฅ
′
= Σ๐‘“
๐‘ฅ ๐‘Œ
๐‘ฅ ๐‘“(๐‘ฅ)
Σ๐‘Œ๐‘Œ +
− Σ๐‘“
−1 โŠค
2
๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’ ๐ผ ๐‘Œ
๐‘ฅ ๐‘Œ
Σ๐‘Œ๐‘Œ +
−1
2
๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’ ๐ผ Σ๐‘Œ๐‘“ ๐‘ฅ
Covariance function is the kernel function
Σ๐‘“
๐‘ฅ ๐‘Œ
ΣYY =
= ๐‘˜ ๐‘ฅ, ๐‘ฅ1 , … , ๐‘˜ ๐‘ฅ, ๐‘ฅ๐‘›
๐‘˜ ๐‘ฅ1 , ๐‘ฅ1
๐‘˜ ๐‘ฅ2 , ๐‘ฅ1
๐‘˜ ๐‘ฅ1 , ๐‘ฅ2
๐‘˜ ๐‘ฅ2 , ๐‘ฅ2
โ‹ฎ โ‹ฎ
๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ1 ) ๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ2 )
โ‹ฏ ๐‘˜ ๐‘ฅ1 , ๐‘ฅ๐‘›
โ‹ฏ ๐‘˜ ๐‘ฅ2 , ๐‘ฅ๐‘›
โ‹ฑ โ‹ฎ
โ‹ฏ ๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ๐‘› )
31
Prior and posterior: noisy case
In the noisy case (๐‘ฆ = ๐‘“ ๐‘ฅ + ๐œ–), mean function of posterior GP
does not necessarily passes the training data points
Posterior GP has reduced variance
32
Download