Kernel methods for comparing distributions, measuring dependence Le Song

advertisement
Kernel methods for comparing
distributions, measuring dependence
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
Principal component analysis
Given a set of 𝑀 centered
observations π‘₯π‘˜ ∈ 𝑅𝑑 , PCA finds
the direction that maximizes the
variance
𝑋 = π‘₯1 , π‘₯2 , … , π‘₯𝑀
𝑀∗ =
1
⊀
π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯ 𝑀 ≤1
π‘˜ 𝑀 π‘₯π‘˜
2
𝑀
= π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯
1 ⊀
⊀
𝑀 ≤1 𝑀 𝑀 𝑋𝑋 𝑀
1
𝑋𝑋 ⊀ ,
𝑀
𝐢=
𝑀 ∗ can be found by
solving the following eigen-value
problem
𝐢𝑀 = πœ† 𝑀
2
Alternative expression for PCA
The principal component lies in the span of the data
𝑀 = π‘˜ π›Όπ‘˜ π‘₯π‘˜ = 𝑋𝛼
Plug this in we have
𝐢𝑀 =
1
𝑋𝑋 ⊀ 𝑋𝛼
𝑀
= πœ† 𝑋𝛼
Furthermore, for each data point π‘₯π‘˜ , the following relation
holds
1 ⊀
π‘₯π‘˜ 𝑋𝑋 ⊀ 𝑋𝛼 = πœ† π‘₯π‘˜βŠ€ 𝑋𝛼, ∀π‘˜
𝑀
1
matrix form, 𝑋 ⊀ 𝑋𝑋 ⊀ 𝑋𝛼 = πœ†π‘‹ ⊀ 𝑋𝛼
𝑀
π‘₯π‘˜βŠ€ 𝐢𝑀 =
In
Only depends on inner
product matrix
3
Kernel PCA
Key Idea: Replace inner product matrix by kernel matrix
1 ⊀
PCA: 𝑋 𝑋𝑋 ⊀ 𝑋𝛼
𝑀
= πœ†π‘‹ ⊀ 𝑋𝛼
π‘₯π‘˜ ↦ πœ™ π‘₯π‘˜ , Φ = πœ™ π‘₯1 , … , πœ™ π‘₯π‘˜ , 𝐾 = Φ⊀ Φ
Nonlinear component 𝑀 = Φ𝛼
Kernel PCA:
1
𝐾𝐾𝛼
𝑀
= πœ†πΎπ›Ό, equivalent to
1
𝐾𝛼
𝑀
=πœ†π›Ό
First form an 𝑀 by 𝑀 kernel matrix 𝐾, and then perform eigendecomposition on 𝐾
4
Kernel PCA example
Gaussian RBF kernel exp −
π‘₯−π‘₯ ′
2𝜎 2
2
over 2 dimensional space
Eigen-vector evaluated at a test point π‘₯ is a function
𝑀 ⊀ πœ™ π‘₯ = π‘˜ π›Όπ‘˜ π‘˜(π‘₯π‘˜ , π‘₯)
5
Spectral clustering
6
Spectral clustering
Form kernel matrix 𝐾 with Gaussian RBF kernel
Treat kernel matrix 𝐾 as the adjacency matrix of a graph (set
diagonal of 𝐾 to be 0)
Construct the graph Laplacian 𝐿 = 𝐷 −1/2 𝐾𝐷−1/2 , where
𝐷 = π‘‘π‘–π‘Žπ‘”(𝐾 1)
Compute the top π‘˜ eigen-vector 𝑉 = (𝑣1 , 𝑣2 , … , π‘£π‘˜ ) of 𝐿
Use 𝑉 as the input to K-means for clustering
7
Canonical correlation analysis
8
Canonical correlation analysis
Given
Estimate two basis vectors 𝑀π‘₯ and 𝑀𝑦
Estimate the two basis vectors so that the correlations of the
projections onto these vectors are maximized.
9
CCA derivation II
Define the covariance matrix of π‘₯, 𝑦
The optimization problem is equal to
We can require the following normalization, and just maximize
the numerator
10
CCA as generalized eigenvalue problem
The optimality conditions say
C
xy
wy ο€½ C
xx
wx
C
yx
wx ο€½ C
yy
wy
Put these conditions into matrix format
 0

C
 yx
 w x οƒΆ
 C xx

οƒ· ο€½ 
οƒ·

 0
0
wy οƒ·

οƒΈ

C
xy
0  w x οƒΆ

οƒ·
οƒ·

C yy
wy οƒ·

οƒΈ
Generalized eigenvalue problem 𝐴𝑀 = πœ†π΅π‘€
11
CCA in inner product format
Similar to PCA, the directions of projection lie in the span of
the data 𝑋 = π‘₯1 , … , π‘₯π‘š , π‘Œ = (𝑦1 , … , π‘¦π‘š )
𝑀π‘₯ = 𝑋𝛼, 𝑀𝑦 = π‘Œπ›½
𝐢π‘₯𝑦 =
1
π‘‹π‘Œ ⊀ , 𝐢π‘₯π‘₯
π‘š
=
1
𝑋𝑋 ⊀ , 𝐢𝑦𝑦
π‘š
=
1
π‘Œπ‘Œ^⊀
π‘š
Earlier we have
Plug in 𝑀π‘₯ = 𝑋𝛼, 𝑀𝑦 = π‘Œπ›½, we have

 ο€½ max
 ,

T
X
T
XX
T
X
T
T
XY
X
T
Data only appear in
inner products
Y
 Y YY
T
T
T
Y
12
Kernel CCA
Replace inner product matrix by kernel matrix

 ο€½ max
 ,

T
T
K xK y
K x K x

T
K yK y
Where 𝐾π‘₯ is kernel matrix for data 𝑋, with entries 𝐾π‘₯ 𝑖, 𝑗 =
π‘˜ π‘₯𝑖 , π‘₯𝑗
Solve generalized eigenvalue problem
0


K K
 y
K xK
x
0
y
 

 

 K xK
οƒΆ
οƒ· ο€½ 
οƒ·

0
οƒΈ

x
0
K yK
y
 

 

οƒΆ
οƒ·
οƒ·
οƒΈ
13
Comparing two distributions
For two Gaussian distributions 𝑃 𝑋 and 𝑄 𝑋 with unit
variance, simply test
𝐻0 : πœ‡1 = πœ‡2 ?
For general distributions, we can also use KL-divergence
𝐻0 : 𝑃 𝑋 = 𝑄 𝑋 ?
𝐾𝐿(𝑃| 𝑄 =
𝑋
𝑃 𝑋
𝑃(𝑋)
log
𝑑𝑋
𝑄(𝑋)
Given a set of samples π‘₯1 , … , π‘₯π‘š ∼ 𝑃 𝑋 , π‘₯1′ , … , π‘₯𝑛′ ∼
𝑄 𝑋
πœ‡1 ≈
1
π‘š
Need to estimate the
density function first
𝑖 π‘₯𝑖
𝑃 𝑋 log
𝑋
𝑃(𝑋)
𝑑𝑋
𝑄(𝑋)
≈
1
π‘š
𝑖 log
𝑃(π‘₯𝑖 )
𝑄 (π‘₯𝑖 )
14
Embedding distributions into feature space
Summary statistics for distributions
Mean
Covariance
expected features
Pick a kernel, and generate a different summary statistic
15
Pictorial view of embedding distribution
Transform the entire distribution to expected features
Feature space
Feature map:
16
Finite sample approximation of embedding
One-to-one mapping from
to
for certain
kernels (RBF kernel)
Sample average converges to true mean at
17
Embedding Distributions: Mean
Mean reduces the entire distribution to a single number
Representation power
very restricted
1D feature space
18
Embedding Distributions: Mean + Variance
Mean and variance reduces the entire distribution to two
numbers
Variance
Richer representation
But not enough
Mean
2D feature space
19
Embedding with kernel features
Transform distribution to infinite dimensional vector
Rich representation
Feature space
Mean,
Variance,
higher
order
moment
20
Estimating embedding distances
Given samples π‘₯1 , … , π‘₯π‘š ∼ 𝑃 𝑋 , π‘₯1 , … , π‘₯π‘š′ ∼ 𝑄 𝑋
Distance can be expressed as inner products
21
Estimating embedding distance
Finite sample estimator
Form a kernel matrix with 4 blocks
Average this block
Average this block
Average this block
Average this block
22
Optimization view of embedding distance
Optimization problem
πœ‡π‘‹ − πœ‡π‘‹′
2
=
2
sup < 𝑀, πœ‡π‘‹ − πœ‡π‘‹′ >
𝑀 ≤1
sup < 𝑀, 𝐸𝑋∼𝑃 πœ™ 𝑋
𝑀 ≤1
− 𝐸𝑋∼𝑄 πœ™ 𝑋
=
>
2
Witness function
𝑀∗
1
π‘š
1
π‘š
=
π‘–πœ™
𝐸𝑋∼𝑃 πœ™ 𝑋 −𝐸𝑋∼𝑄 πœ™ 𝑋
𝐸𝑋∼𝑃 πœ™ 𝑋 −𝐸𝑋∼𝑄 πœ™ 𝑋
1
π‘₯𝑖 − ′ 𝑖 πœ™(π‘₯𝑖′ )
π‘š
=
′
πœ‡π‘‹ −πœ‡π‘‹
′
πœ‡π‘‹ −πœ‡π‘‹
≈
1
′
𝑖 πœ™ π‘₯𝑖 −π‘š′ 𝑖 πœ™(π‘₯𝑖 )
πœ‡π‘‹ − πœ‡π‘‹′
𝑀
𝑀∗
23
Plot the witness function values
𝑀 ∗ π‘₯ = 𝑀 ∗⊀ πœ™ π‘₯ ∝
1
π‘š
𝑖 π‘˜ π‘₯𝑖 , π‘₯ −
1
π‘š′
′
π‘˜(π‘₯
𝑖
𝑖 , π‘₯)
Gaussian and Laplace distribution with the same mean and
variance (Use Gaussian RBF kernel)
24
Application of kernel distance measure
25
Covariate shift correction
Training and test data
are not from the same
distribution
Want to reweight
training data points to
match the distribution
of test data points
Argmin𝛼≥0, 𝛼 1 =1
𝑖 𝛼𝑖 πœ™
1
π‘₯𝑖 − ′
π‘š
π‘–πœ™
𝑦𝑖
2
26
Embedding Joint Distributions
Transform the entire joint distribution to expected features
maps to
Cross Covariance
(Cov.)
1
maps to
X Mean
Y Mean Cov.
1
X Mean
Y Mean Cov.
maps to
…
…
Higher
order
feature
…
…
27
Embedding Joint: Finite Sample
Feature space
Weights
[Smola, Gretton, Song and Scholkopf. 2007]
Feature mapped data points
28
Measure Dependence via Embeddings
Use squared distance to
measure dependence
between X and Y
Feature space
Dependence measure useful for:
•Dimensionality reduction
•Clustering
•Matching
•…
[Smola, Gretton, Song and Scholkopf. 2007]
29
Estimating embedding distances
Given samples (π‘₯1 , 𝑦1 ), … , (π‘₯π‘š , π‘¦π‘š ) ∼ 𝑃 𝑋, π‘Œ
Dependence measure can be expressed as inner products
πœ‡π‘‹π‘Œ − πœ‡π‘‹ ⊗ πœ‡π‘Œ 2 =
πΈπ‘‹π‘Œ [πœ™ 𝑋 ⊗ πœ“ π‘Œ ] − 𝐸𝑋 πœ™ 𝑋 ⊗ πΈπ‘Œ [πœ“ π‘Œ ] 2
=< πœ‡π‘‹π‘Œ , πœ‡π‘‹π‘Œ > −2 < πœ‡π‘‹π‘Œ , πœ‡π‘‹ ⊗ πœ‡π‘Œ >+< πœ‡π‘‹ ⊗ πœ‡π‘Œ , πœ‡π‘‹ ⊗ πœ‡π‘Œ >
Kernel matrix operation (𝐻 = 𝐼
π‘‘π‘Ÿπ‘Žπ‘π‘’( 𝐻
1
− 11⊀ )
π‘š
π‘˜(π‘₯𝑖 , π‘₯𝑗 )
𝐻
X and Y data are ordered
in the same way
π‘˜(𝑦𝑖 , 𝑦𝑗 )
)
30
Optimization view of the dependence measure
Optimization problem
πœ‡π‘‹π‘Œ − πœ‡π‘‹ ⊗ πœ‡π‘Œ
2
=
sup < 𝑀, πœ‡π‘‹π‘Œ − πœ‡π‘‹ ⊗ πœ‡π‘Œ >
2
𝑀 ≤1
𝑀 ∗ ∝ πœ‡π‘‹π‘Œ − πœ‡π‘‹ ⊗ πœ‡π‘Œ
Witness function
𝑀 ∗ π‘₯, 𝑦 = 𝑀 ∗⊀ (πœ™ π‘₯ ⊗ πœ“ 𝑦 )
A distribution with two stripes
Two stripe distribution vs
Uniform over [-1,1]x[-1,1]
31
Application of kernel distance measure
32
Application of dependence meaure
Independent component analysis
Transform the times series, such that the resulting signals are as
independent as possible (minimize kernel dependence)
Feature selection
Choose a set of features, such that its dependence with labels are as
large as possible (maximize kernel dependence)
Clustering
Generate labels for each data point, such that the dependence
between the labels and data are maximized (maximize kernel
dependence)
Supervised dimensionality reduction
Reduce the dimension of the data, such that its dependence with side
information in maximized (maximize kernel dependence)
33
PCA vs. Supervised dimensionality reduction
20 news groups
34
Supervised dimensionality reduction
10 years of NIPS papers: Text + Coauthor networks
35
Visual Map of LabelMe Images
36
Imposing Structures to Image Collections
Adjacent points on
the grid are similar
High dimensional
image features
…
Layout (sort/organize) images
according to image features
and maximize its dependence
with an external structure
color feature
texture feature
sift feature
composition
description
37
Compare to Other Methods
Other layout algorithms do not have exact control of what
structure to impose
Kernel Embedding
Method
[Quadrinato , Song and Smola 2009]
Generative Topographic
Map (GTM)
[Bishop et al. 1998]
Self-Organizing Map
(SOM)
[Kohonen 1990]
38
39
Reference
40
Download