Gaussian Processes Le Song Machine Learning II: Advanced Topics

advertisement
Gaussian Processes
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
Conditional Gaussian (general case)
Joint Gaussian ๐‘ƒ ๐‘‹, ๐‘Œ ∼ ๐’ฉ ๐œ‡; Σ
Conditional Gaussian
๐‘ƒ ๐‘Œ ๐‘‹ ∼ ๐‘ ๐œ‡๐‘Œ|๐‘‹ ; Σ๐‘Œ๐‘Œ|๐‘‹
−1
๐œ‡๐‘Œ|๐‘‹ = ๐œ‡๐‘Œ + Σ๐‘Œ๐‘‹ Σ๐‘‹๐‘‹
(๐‘‹ − ๐œ‡๐‘‹ )
−1
Σ๐‘Œ๐‘Œ|๐‘‹ = Σ๐‘Œ๐‘Œ − Σ๐‘Œ๐‘‹ Σ๐‘‹๐‘‹
Σ๐‘‹๐‘Œ
Conditional Gaussian is linear in ๐‘‹, ๐‘ƒ ๐‘Œ ๐‘‹ ∼ ๐’ฉ ๐›ฝ0 + ๐ต๐‘‹; Σ๐‘Œ๐‘Œ|๐‘‹
−1
๐›ฝ0 = ๐œ‡๐‘Œ − Σ๐‘Œ๐‘‹ Σ๐‘‹๐‘‹
๐œ‡๐‘‹
−1
๐ต = Σ๐‘Œ๐‘‹ Σ๐‘‹๐‘‹
Linear regression model ๐‘Œ = ๐›ฝ0 + ๐ต๐‘‹ + ๐œ–
White noise
๐’ฉ(0, Σ๐‘Œ๐‘Œ|๐‘‹ )
2
What is Gaussian Process?
A Gaussian process is a generalization of a multivariate
Gaussian distribution to infinitely many variables
Formally: a collection of random variables, any finite number
of which have (consistent) Gaussian distributions
Informally, infinitely long vector with dimensions index by ๐‘ฅ ≅
function ๐‘“(๐‘ฅ)
A Gaussian process is fully specified by a mean function
๐‘š ๐‘ฅ = ๐ธ[๐‘“(๐‘ฅ)] and covariance function ๐‘˜ ๐‘ฅ, ๐‘ฅ ′ =
๐ธ ๐‘“ ๐‘ฅ − ๐‘š ๐‘ฅ ๐‘“ ๐‘ฅ′ − ๐‘š ๐‘ฅ′
๐‘“ ๐‘ฅ ∼ ๐บ๐‘ƒ ๐‘š ๐‘ฅ , ๐‘˜ ๐‘ฅ, ๐‘ฅ ′ , ๐‘ฅ: ๐‘–๐‘›๐‘‘๐‘–๐‘๐‘’๐‘ 
3
Random function from a Gaussian process
one dimensional Gaussian process:
๐‘“ ๐‘ฅ ∼
1
๐บ๐‘ƒ 0, ๐‘˜ ๐‘ฅ, ๐‘ฅ ′ = exp − ๐‘ฅ − ๐‘ฅ ′
2
2
To generate a sample from GP
Covariance
๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘—
Gaussian variable ๐‘“๐‘– , ๐‘“๐‘— are indexed by
๐‘ฅ๐‘– , ๐‘ฅ๐‘— respectively, and their covariance
(๐‘–๐‘—-th entry in Σ) defined by ๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘—
๐‘“๐‘–
Generate ๐‘ iid. samples: ๐‘ฆ =
๐‘ฆ1 , … , ๐‘ฆ๐‘ โŠค ∼ ๐’ฉ 0; ๐ผ
Transform the sample:
๐‘“ = ๐‘“1 , … , ๐‘“๐‘ โŠค = ๐œ‡ + Σ1/2 ๐‘ฆ
๐‘“๐‘—
๐‘ฅ๐‘–
๐‘ฅ๐‘—
4
Covariance function of Gaussian processes
For any finite collection of indices ๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘› , the covariance
matrix is positive semidefinite
Σ=๐พ=
๐‘˜ ๐‘ฅ1 , ๐‘ฅ1
๐‘˜ ๐‘ฅ2 , ๐‘ฅ1
๐‘˜ ๐‘ฅ1 , ๐‘ฅ2
๐‘˜ ๐‘ฅ2 , ๐‘ฅ2
โ‹ฎ โ‹ฎ
๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ1 ) ๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ2 )
โ‹ฏ ๐‘˜ ๐‘ฅ1 , ๐‘ฅ๐‘›
โ‹ฏ ๐‘˜ ๐‘ฅ2 , ๐‘ฅ๐‘›
โ‹ฑ โ‹ฎ
โ‹ฏ ๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ๐‘› )
The covariance function needs to be a kernel function over the
indices!
Eg. Gaussian RBF kernel
๐‘˜ ๐‘ฅ, ๐‘ฅ ′ = exp −
1
2
๐‘ฅ − ๐‘ฅ′
2
5
Samples from GPs with different kernels
๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘— = ๐‘ฃ0 exp −
๐‘ฅ๐‘– −๐‘ฅ๐‘—
๐‘Ÿ
๐›ผ
+ ๐‘ฃ1 + ๐‘ฃ2 ๐›ฟ๐‘–๐‘—
6
Kernels for periodic, smooth functions
To create GP over periodic functions, we can first map the
inputs to ๐‘ข = sin ๐‘ฅ , cos ๐‘ฅ โŠค , and then measure distance in
๐‘ข space. Combined with square exponential function,
๐‘˜ ๐‘ฅ, ๐‘ฅ ′ = exp −
2sin2 ๐œ‹ ๐‘ฅ−๐‘ฅ ′
๐‘™2
Three functions drawn at random, left ๐‘™ > 1 and right ๐‘™ < 1
7
Using Gaussian process for nonlinear regression
Observing a dataset ๐ท =
๐‘ฅ๐‘– , ๐‘ฆ๐‘–
๐‘›
๐‘–=1
Prior ๐‘ƒ(๐‘“) is Gaussian process, like a multivariate Gaussian,
therefore, posterior of ๐‘“ is also a Gaussian process
Bayesian rule ๐‘ƒ ๐‘“ ๐ท =
๐‘ƒ ๐ท ๐‘“ ๐‘ƒ(๐‘“)
๐‘ƒ(๐ท)
Everything else about GPs follows the basic rules of
probabilities applied to multivariate Gaussians
8
Graphical model for Gaussian Process
Square nodes are observed, round nodes unobserved (latent)
Red nodes are training data, blue nodes are test data
All pairs of latent variables (๐‘“)
are connected
Prediction of ๐‘ฆ ∗ depends only
on the corresponding ๐‘“ ∗
We can do learning and
inference based on this
graphical model
9
Posterior of Gaussian process
Gaussian process regression
For simplicity, noiseless observation ๐‘ฆ = ๐‘“(๐‘ฅ)
The parameter is a function ๐‘“ ๐‘ฅ ∼ ๐บ๐‘ƒ ๐‘š ๐‘ฅ = 0, ๐‘˜ ๐‘ฅ, ๐‘ฅ ′
Gaussian process prior
GP posterior ๐‘“ ๐‘ฅ | ๐‘ฅ๐‘– , ๐‘ฆ๐‘–
๐‘Œ = (๐‘ฆ2 , … , ๐‘ฆ๐‘›
)โŠค
๐‘›
๐‘–=1
~๐บ๐‘ƒ ๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ , ๐‘˜๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ, ๐‘ฅ ′
= ๐‘“ ๐‘ฅ1 , … ๐‘“ ๐‘ฅ๐‘›
โŠค
−1 โŠค
Σ
๐‘Œ ๐‘Œ๐‘Œ ๐‘Œ
๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ = 0 + Σ๐‘“
๐‘ฅ
๐‘˜๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ, ๐‘ฅ ′ = Σ๐‘“
๐‘“(๐‘ฅ′) − Σ๐‘“
๐‘ฅ
with
๐‘ฅ
−1
Σ
๐‘Œ ๐‘Œ๐‘Œ Σ๐‘Œ๐‘“
๐‘ฅ′
10
Posterior of Gaussian process in Kernel Form
Define kernel matrices
๐‘˜ ๐‘ฅ, ๐‘‹ โ‰” Σ๐‘“
๐‘ฅ ๐‘Œ
= ๐‘˜ ๐‘ฅ, ๐‘ฅ1 , … , ๐‘˜ ๐‘ฅ, ๐‘ฅ๐‘›
๐พ โ‰” ΣYY =
โ‹ฏ ๐‘˜ ๐‘ฅ1 , ๐‘ฅ๐‘›
๐‘˜ ๐‘ฅ1 , ๐‘ฅ2
๐‘˜ ๐‘ฅ2 , ๐‘ฅ2
โ‹ฏ ๐‘˜ ๐‘ฅ2 , ๐‘ฅ๐‘›
โ‹ฎ โ‹ฎ
โ‹ฑ โ‹ฎ
๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ1 ) ๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ2 ) โ‹ฏ ๐‘˜(๐‘ฅ๐‘› , ๐‘ฅ๐‘› )
๐‘˜ ๐‘ฅ1 , ๐‘ฅ1
๐‘˜ ๐‘ฅ2 , ๐‘ฅ1
Then we have
๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ = ๐‘˜ ๐‘ฅ, ๐‘‹ K −1 ๐‘Œ โŠค
๐‘˜๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ, ๐‘ฅ ′ = ๐‘˜(๐‘ฅ, ๐‘ฅ ′ ) − ๐‘˜ ๐‘ฅ, ๐‘‹ K −1 ๐‘˜ ๐‘ฅ ′ , ๐‘‹
โŠค
11
Prior and Posterior GP
In the noiseless case (๐‘ฆ = ๐‘“(๐‘ฅ)), mean function of the posterior GP
passes the training data points
Posterior GP has reduced variance, zero variance at training point
Prior
Posterior
12
Noisy Observation
2
๐‘ฆ|๐‘ฅ, ๐‘“ ๐‘ฅ ∼ ๐’ฉ ๐‘“, ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’
๐ผ , let ๐‘Œ = (๐‘ฆ2 , … , ๐‘ฆ๐‘› )โŠค
๐‘“ ๐‘ฅ | ๐‘ฅ๐‘– , ๐‘ฆ๐‘–
๐‘›
๐‘–=1
~๐บ๐‘ƒ ๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ , ๐‘˜๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ, ๐‘ฅ ′
2
๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ = ๐‘˜ ๐‘ฅ, ๐‘‹ ๐พ + ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’
๐ผ
๐‘˜๐‘๐‘œ๐‘ ๐‘ก
๐‘ฅ, ๐‘ฅ ′
=
๐‘˜(๐‘ฅ, ๐‘ฅ ′ )
−1 โŠค
๐‘Œ
− ๐‘˜ ๐‘ฅ, ๐‘‹ ๐พ +
−1
2
๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’ ๐ผ ๐‘˜
๐‘ฅ′, ๐‘‹
โŠค
13
GP: prediction of new observation
Given a new point ๐‘ฅ ∗ , the predictive distribution for the ๐‘ฆ ∗
๐‘ฅ๐‘– , ๐‘ฆ๐‘–
๐‘›
๐‘–=1
๐‘‘๐‘“
Predictive distribution is Gaussian ๐‘ฆ ∗ ๐‘ฅ ∗ , ๐‘ฅ๐‘– , ๐‘ฆ๐‘–
2
๐’ฉ ๐œ‡๐‘๐‘Ÿ๐‘’๐‘‘ , ๐œŽ๐‘๐‘Ÿ๐‘’๐‘‘
๐‘›
๐‘–=1
=
๐‘ƒ ๐‘ฆ ∗ ๐‘ฅ ∗ , ๐‘ฅ๐‘– , ๐‘ฆ๐‘–
๐‘›
๐‘–=1
= ∫ ๐‘ƒ ๐‘ฆ ∗ ๐‘“, ๐‘ฅ ∗ ๐‘ƒ ๐‘“
๐‘Œ = (๐‘ฆ2 , … , ๐‘ฆ๐‘› )โŠค
2
๐œ‡๐‘๐‘Ÿ๐‘’๐‘‘ = ๐‘˜(๐‘ฅ ∗ , ๐‘‹) ๐พ + ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’
๐ผ
−1 โŠค
๐‘Œ
2
2
๐œŽ๐‘๐‘Ÿ๐‘’๐‘‘
= ๐‘˜(๐‘ฅ ∗ , ๐‘ฅ ∗ ) − ๐‘˜ ๐‘ฅ ∗ , ๐‘‹ ๐พ + ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’
๐ผ
−1
๐‘˜ ๐‘ฅ∗, ๐‘‹
โŠค
14
Weight space view of GP
Assume linear regression model
๐‘“ ๐‘ฅ ๐‘ค = ๐‘ฅโŠค๐‘ค
๐‘ฆ =๐‘“+๐œ–
๐œ– ∼ ๐’ฉ 0, ๐œŽ 2
Let ๐‘Œ =, ๐‘ฆ1 , … ๐‘ฆ๐‘› , ๐‘‹ = ๐‘ฅ1 , … , ๐‘ฅ๐‘› , Likelihood of observations
๐‘ƒ( ๐‘ฆ๐‘–
๐‘›
๐‘–=1
๐‘ฅ๐‘–
๐‘›
๐‘–=1 , ๐‘ค
= ๐’ฉ ๐‘‹ โŠค ๐‘ค, ๐œŽ 2 ๐ผ
Assume a Gaussian prior over parameters
๐‘ƒ ๐‘ค = ๐’ฉ(0, ๐ผ)
Apply Bayes’ theorem to obtain posterior
๐‘ƒ ๐‘ค ๐‘Œ, ๐‘‹ ∝ ๐‘ƒ ๐‘Œ ๐‘‹, ๐‘ค ๐‘ƒ(๐‘ค)
15
Weight space view of GP
Posterior distribution over ๐‘ค is
๐‘ƒ ๐‘ค ๐‘Œ, ๐‘‹ = ๐’ฉ
1
(๐ผ
๐œŽ2
+
1
โŠค −1
๐‘‹๐‘‹
) ๐‘‹๐‘Œ, (๐ผ
๐œŽ2
+
1
โŠค −1
๐‘‹๐‘‹
)
๐œŽ2
Predictive distribution is
๐‘ƒ ๐‘“ ∗ ๐‘ฅ ∗ , ๐‘‹, ๐‘Œ = ∫ ๐‘“ ๐‘ฅ ∗ ๐‘ค ๐‘ƒ ๐‘ค, ๐‘Œ, ๐‘‹ ๐‘‘๐‘ค
=๐’ฉ
1 ∗โŠค
๐‘ฅ (๐ผ
๐œŽ2
+
1
โŠค )−1 ๐‘‹๐‘Œ, ๐‘ฅ ∗ โŠค (๐ผ
๐‘‹๐‘‹
๐œŽ2
+
1
๐‘‹๐‘‹ โŠค )−1 ๐‘ฅ ∗
2
๐œŽ
Predictive distribution is outer product form, can be turned
into inner product form using matrix-inversion lemma
16
Weight space view of GP
Predictive distribution
๐’ฉ
1 ∗โŠค
๐‘ฅ (๐ผ
๐œŽ2
+
1
โŠค −1
∗โŠค
๐‘‹๐‘‹
)
๐‘‹๐‘Œ,
๐‘ฅ
(๐ผ
๐œŽ2
+
1
โŠค −1 ∗
๐‘‹๐‘‹
) ๐‘ฅ
๐œŽ2
Equivalent to
๐’ฉ (๐‘ฅ ∗ โŠค ๐‘‹)(๐œŽ 2 ๐ผ + ๐‘‹ โŠค ๐‘‹ )−1 ๐‘Œ, ๐‘ฅ ∗ โŠค ๐‘ฅ ∗ − ๐‘ฅ ∗ โŠค ๐‘‹(๐œŽ 2 ๐ผ + ๐‘‹ โŠค ๐‘‹)−1 ๐‘‹ โŠค ๐‘ฅ ∗
Instead of using the original ๐‘ฅ, map it to feature space ๐œ™ ๐‘ฅ
2
๐’ฉ ๐œ‡๐‘๐‘Ÿ๐‘’๐‘‘ , ๐œŽ๐‘๐‘Ÿ๐‘’๐‘‘
2
๐œ‡๐‘๐‘Ÿ๐‘’๐‘‘ = ๐‘˜(๐‘ฅ ∗ , ๐‘‹) ๐พ + ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’
๐ผ
−1 โŠค
๐‘Œ
2
2
๐œŽ๐‘๐‘Ÿ๐‘’๐‘‘
= ๐‘˜(๐‘ฅ ∗ , ๐‘ฅ ∗ ) − ๐‘˜ ๐‘ฅ ∗ , ๐‘‹ ๐พ + ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’
๐ผ
−1
๐‘˜ ๐‘ฅ∗, ๐‘‹
โŠค
17
Model selection
Use marginal likelihood (evidence) to select and tune
hyperparameters in covariance function
P( ๐‘ฆ๐‘–
๐‘›
๐‘–=1
๐‘ฅ๐‘–
๐‘›
๐‘–=1
= ∫ ๐‘ƒ ๐‘ฆ๐‘–
๐‘›
๐‘–=1
๐‘“, ๐‘ฅ๐‘–
๐‘›
๐‘–=1
๐‘ƒ ๐‘“ ๐‘‘๐‘“
An example
๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘— = ๐‘ฃ0 exp −
๐‘ฅ๐‘– −๐‘ฅ๐‘—
๐‘Ÿ
๐›ผ
+ ๐‘ฃ1
Marginal likelihood is a function of parameter ๐‘ฃ
P( ๐‘ฆ๐‘– ๐‘›๐‘–=1 ๐‘ฅ๐‘– ๐‘›๐‘–=1 = ๐’ฉ 0, ๐พ๐‘ฃ + ๐œŽ 2 ๐ผ
ln ๐‘ƒ( ๐‘ฆ๐‘– ๐‘›๐‘–=1 ๐‘ฅ๐‘– ๐‘›๐‘–=1 =
1
1
− ln det ๐พ๐‘ฃ + ๐œŽ 2 ๐ผ − ๐‘Œ โŠค ๐พ๐‘ฃ + ๐œŽ 2 ๐ผ
2
2
−1 ๐‘Œ
+ ๐‘๐‘œ๐‘›๐‘ ๐‘ก
Optimize as a function of ๐‘ฃ
18
Automatic relevance detection
We want to automatically decide which inputs are relevant
(feature selection) to output
Use covariance function
๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘— = ๐‘ฃ0 exp −
๐ท
๐‘‘=1
๐‘ฅ๐‘–๐‘‘ −๐‘ฅ๐‘—๐‘‘
๐›ผ
+ ๐‘ฃ1 + ๐‘ฃ2 ๐›ฟ๐‘–๐‘—
๐‘Ÿ๐‘‘
๐‘Ÿ๐‘‘ is the length scale of the function along input dimension ๐‘‘
๐‘Ÿ๐‘‘ → ∞, the corresponding feature influence ๐‘“ less
Use marginal likelihood ln ๐‘ƒ( ๐‘ฆ๐‘–
feature selection
๐‘›
๐‘–=1
๐‘ฅ๐‘–
๐‘›
๐‘–=1
to tune ๐‘Ÿ๐‘‘ to do
19
GP for classification
Regression model
๐‘“|~๐บ๐‘ƒ ๐‘š ๐‘ฅ , ๐‘˜ ๐‘ฅ, ๐‘ฅ ′
2
๐‘ฆ|๐‘ฅ, ๐‘“ ∼ ๐’ฉ ๐‘“(๐‘ฅ), ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’
๐ผ
Classification model
Give data
๐‘ฅ๐‘– , ๐‘ฆ๐‘–
๐‘“|~๐บ๐‘ƒ ๐‘š ๐‘ฅ , ๐‘˜
๐‘›
๐‘–=1
๐‘ฅ, ๐‘ฅ ′
where ๐‘ฆ๐‘– ∈ {−1, +1}
๐‘ฆ|๐‘ฅ, ๐‘“ ∼ ๐‘ ๐‘ฆ|๐‘“(๐‘ฅ)
20
Relate GP to class probability
Transform the continuous output of Gaussian Process to a
value between [-1,1] or [0,1]
With binary outputs, the joint distribution of all variables in the
model is no longer Gaussians
The likelihood is also not Gaussian, so we will need to use
approximate inference to compute the posterior GP (Laplace
approximation, sampling)
21
Connection to kernel support vector machines
1
๐‘ค 2
min
๐‘ค
2
+๐ถ
๐‘— ๐œ‰๐‘—
๐‘ . ๐‘ก. ๐‘ค โŠค ๐œ™(๐‘ฅ๐‘— ) + ๐‘ ๐‘ฆ๐‘— ≥ 1 − ๐œ‰๐‘— , ๐œ‰๐‘— ≥ 0, ∀๐‘—
๐œ‰๐‘— : Slack variables
Can be equivalently written as
a hinge loss function
1 − ๐‘ฆ๐‘– (๐‘ค โŠค ๐œ™(๐‘ฅ๐‘– )) +
= 1 − ๐‘ฆ๐‘– ๐‘“๐‘– +
22
Connection to kernel support vector machines
The decision function ๐‘“ ๐‘ฅ = ๐‘ค โŠค ๐œ™ ๐‘ฅ =
๐‘– ๐›ผ๐‘– ๐‘˜(๐‘ฅ, ๐‘ฅ๐‘– )
Let the vector ๐‘“ be ๐‘“(๐‘ฅ) evaluated at all training points: ๐‘“ =
๐พ๐›ผ, then we have ๐›ผ = ๐พ −1 ๐‘“
๐‘ค
2
= ๐›ผ โŠค ๐พ๐›ผ = ๐‘“ โŠค ๐พ −1 ๐‘“
We can rewrite the kernelized SVM as:
1 โŠค −1
min๐‘“ ๐‘“ ๐พ ๐‘“
2
+๐ถ
๐‘–
1 − ๐‘ฆ๐‘– ๐‘“๐‘–
+
23
Connection to kernel support vector machines
We can rewrite the kernelized SVM as:
1 โŠค −1
min๐‘“ ๐‘“ ๐พ ๐‘“
2
+๐ถ
๐‘–
1 − ๐‘ฆ๐‘– ๐‘“๐‘–
+
We can write the negative log of a GP likelihood as
1
2
min ๐‘“ โŠค ๐พ −1 ๐‘“ +
๐‘“
๐‘– ๐‘(๐‘ฆ๐‘– |๐‘“๐‘– ) +
๐‘
With Gaussian process we
Handle uncerntainty in unknown function ๐‘“ by averaging, not
minimization
Can learn the kernel parameters and features using marginal
likelihood
Can incorporate interpretable noise models and priors, can
sample
24
Gaussian process latent variable models
GP can be used for nonlinear dimensionality
reduction
Observed n data points ๐‘ฆ๐‘– ๐‘›๐‘–=1 , assume each
dimension of the data ๐‘ฆ๐‘–๐‘‘ id modeled by a separate
GP using some common low dimension input ๐‘ฅ๐‘–
Find the best latent input ๐‘ฅ๐‘– by maximizing the
marginal likelihood
๐‘ฅ
๐‘“๐‘‘
๐‘ฆ๐‘‘
Computationally intensive
25
Gaussian process latent variable models
Find the latent variables is a high
dimensional, nonlinear
optimization problem with local
optima
GPLVM defines a map from latent
to observed space, not a
generative model
Mapping new latent coordinate to
observations is easy
Finding the latent coordinates for
new observations is difficult
26
Computation issue of GP
๐‘“ ๐‘ฅ | ๐‘ฅ๐‘– , ๐‘ฆ๐‘–
๐‘›
๐‘–=1
~๐บ๐‘ƒ ๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ , ๐‘˜๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ, ๐‘ฅ ′
2
๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ = ๐‘˜ ๐‘ฅ, ๐‘‹ ๐พ + ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’
๐ผ
๐‘˜๐‘๐‘œ๐‘ ๐‘ก
๐‘ฅ, ๐‘ฅ ′
=
๐‘˜(๐‘ฅ, ๐‘ฅ ′ )
−1 โŠค
๐‘Œ
− ๐‘˜ ๐‘ฅ, ๐‘‹ ๐พ
−1
2
+ ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’ ๐ผ ๐‘˜
๐‘ฅ′, ๐‘‹
โŠค
Inverting kernel matrix ๐พ is computationally intensive
Suppose there are ๐‘› data points, the computation cost is ๐‘‚ ๐‘›3
Need to reduce the computational cost by some type of
approximation
๐‘…
๐พ
≈
๐‘…โŠค
27
Kernel low rank approximation
Incomplete Cholesky factorization of kernel matrix ๐พ of size
๐‘› × ๐‘› to ๐‘… of size ๐‘‘ × ๐‘›, and ๐‘‘ โ‰ช ๐‘›
๐‘…
≈
๐พ
๐‘…โŠค
๐‘…
≈
๐ด
๐‘“ ๐‘ฅ | ๐‘ฅ๐‘– , ๐‘ฆ๐‘–
๐‘›
๐‘–=1
๐‘…โŠค
~๐บ๐‘ƒ ๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ , ๐‘˜๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ, ๐‘ฅ ′
2
๐‘š๐‘๐‘œ๐‘ ๐‘ก ๐‘ฅ = ๐‘…๐‘ฅโŠค ๐‘…๐‘…โŠค + ๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’
๐ผ
๐‘˜๐‘๐‘œ๐‘ ๐‘ก
๐‘ฅ, ๐‘ฅ ′
= ๐‘…๐‘ฅ๐‘ฅ −
๐‘…๐‘ฅโŠค
๐‘…๐‘…โŠค
−1
+
๐‘…๐‘Œ โŠค
−1
2
๐œŽ๐‘›๐‘œ๐‘–๐‘ ๐‘’ ๐ผ ๐‘…๐‘…๐‘ฅ
28
Sparse nonparametric regression
Support vector regression
29
Dual of support vector regression and kernelization
Dual problem use data only in the inner
product
Prediction for a new point
Replace inner product by kernel functions to
obtain nonlinear regression
30
Collaborative Filtering
31
Collaborative Filtering
R: rating matrix; U: user
factor; V: movie factor
min
U ,V
s .t .
f ( U ,V ) ๏€ฝ R ๏€ญ UV
T
2
F
U ๏‚ณ 0 ,V ๏‚ณ 0 , k ๏€ผ๏€ผ m , n .
Low rank matrix
approximation approach
Probabilistic matrix
factorization
Bayesian probabilistic matrix
factorization
32
Nonparametric effect model
The ratings ๐‘…๐‘–๐‘— are generated by bias ๐œ‡, user-item compatibility
function ๐‘š๐‘–๐‘— and a random effects ๐‘“๐‘–๐‘— plus noise ๐œ–๐‘–๐‘—
๐‘…๐‘–๐‘— = ๐œ‡ + ๐‘š๐‘–๐‘— + ๐‘“๐‘–๐‘— + ๐œ–๐‘–๐‘— , ๐œ–๐‘–๐‘— ∼ ๐‘(0, ๐œŽ 2 )
Gaussian process prior for both ๐‘š and ๐‘“
๐‘š ∼ ๐บ๐‘ƒ(0, Ω ⊗ Σ)
๐‘“๐‘– ∼ ๐บ๐‘ƒ 0, ๐œΣ , i = 1, … , M
Hyper prior on the the covariance matrix: inverse-Wishart
process:
Σ ∼ ๐ผ๐‘Š๐‘ƒ(๐œ…, Σ0 + ๐œ†๐›ฟ)
Learning with EM
33
Download