Kernel methods, kernel SVM and ridge regression Le Song

advertisement
Kernel methods, kernel SVM and
ridge regression
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
Collaborative Filtering
2
Collaborative Filtering
R: rating matrix; U: user
factor; V: movie factor
min
U ,V
s .t .
f ( U ,V )  R  UV
T
2
F
U  0 ,V  0 , k  m , n .
Low rank matrix
approximation approach
Probabilistic matrix
factorization
Bayesian probabilistic matrix
factorization
3
Parameter Estimation and Prediction
Bayesian treats the unknown parameters as a random variable:
𝑃(𝜃|𝐷) =
𝑃 𝐷 𝜃 𝑃(𝜃)
𝑃(𝐷)
=
𝑃 𝐷 𝜃 𝑃(𝜃)
∫ 𝑃 𝐷 𝜃 𝑃 𝜃 𝑑𝜃
Posterior mean estimation: 𝜃𝑏𝑎𝑦𝑒𝑠 = ∫ 𝜃 𝑃 𝜃 𝐷 𝑑𝜃
𝜃
𝑋𝑛𝑒𝑤
Maximum likelihood approach
𝜃𝑀𝐿 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝐷 𝜃 , 𝜃𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃(𝜃|𝐷)
𝑋
𝑁
Bayesian prediction, take into account all possible value of 𝜃
𝑃 𝑥𝑛𝑒𝑤 𝐷 = ∫ 𝑃 𝑥𝑛𝑒𝑤 , 𝜃 𝐷 𝑑𝜃 = ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃 𝑃 𝜃 𝐷 𝑑𝜃
A frequentist prediction: use a “plug-in” estimator
𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐿 ) 𝑜𝑟 𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐴𝑃 )
4
PMF: Parameter Estimation
Parameter estimation: MAP estimate
𝜃𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝜃 𝐷, 𝛼 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝜃, 𝐷 𝛼
= 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝐷 𝜃 𝑃(𝜃|𝛼)
In the paper:
5
PMF: Interpret prior as regularization
Maximize the posterior distribution with respect to parameter
U and V
Equivalent to minimize the sum-of-squares error function with
quadratic regularization term (Plug in Gaussians and take log)
6
Bayesian PMF: predicting new ratings
Bayesian prediction, take into account all possible
value of 𝜃
𝑃 𝑥𝑛𝑒𝑤 𝐷 = ∫ 𝑃 𝑥𝑛𝑒𝑤 , 𝜃 𝐷 𝑑𝜃 =
∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃 𝑃 𝜃 𝐷 𝑑𝜃
In the paper, integrating out all parameters and
hyperparameters.
7
Bayesian PMF: overall algorithm
8
Nonlinear classifier
Nonlinear
Decision
Boundaries
Linear SVM
Decision
Boundaries
9
Nonlinear regression
Linear regression
Three standard deviations
above mean
Nonlinear regression
function
Three standard deviations
below mean
Nonlinear regression and want to estimate the variance
Need advanced methods such as Gaussian processes and kernel regression
10
Nonconventional clusters
Need more advanced
methods, such as
kernel methods or spectral
clustering to work
11
Nonlinear principal component analysis
PCA
Nonlinear PCA
12
Support Vector Machines (SVM)
1
𝑤 2
min 𝑤 ⊤ 𝑤 + 𝐶
𝑗 𝜉𝑗
𝑠. 𝑡. 𝑤 ⊤ 𝑥𝑗 + 𝑏 𝑦𝑗 ≥ 1 − 𝜉𝑗 , 𝜉𝑗 ≥ 0, ∀𝑗
𝜉𝑗 : Slack
variables
13
SVM for nonlinear problem
Solve nonlinear problem with linear relation in feature space
Linear decision boundary
in feature space
Non-linear
decision
boundary
Transform
data points
Nonlinear clustering, principal component analysis, canonical
correlation analysis …
14
SVM for nonlinear problems
Some problem needs complicated and even infinite features
𝜙 𝑥 = 𝑥, 𝑥 2 , 𝑥 3 , 𝑥 4 , …
⊤
Explicitly computing high dimension features is time
consuming, and makes subsequent optimization costly
Nonlinear
Decision
Boundaries
Linear SVM
Decision
Boundaries
15
SVM Lagrangian
Primal problem:
1
𝑤,𝜉 2
min 𝑤 ⊤ 𝑤 + 𝐶
𝑗 𝜉𝑗
𝑠. 𝑡. 𝑤 ⊤ 𝑥𝑗 + 𝑏 𝑦𝑗 ≥ 1 − 𝜉𝑗 , 𝜉𝑗 ≥ 0, ∀𝑗
Lagrangian
𝐿 𝑤, 𝜉, 𝛼, 𝛽 =
1 ⊤
𝑤 𝑤 + 𝐶 𝑗 𝜉𝑗 +
2
𝑗 𝛼𝑗 (1 − 𝜉𝑗
Can be infinite
dimensional features
𝜙 𝑥𝑗
− 𝑤 ⊤ 𝑥𝑗 + 𝑏 𝑦𝑗 ) − β𝑗 𝜉𝑗
𝛼𝑖 ≥ 0, 𝛽𝑖 ≥ 0
Take derive of 𝐿 𝑤, 𝜉, 𝛼, 𝛽 with respect to 𝑤 and 𝜉 we have
𝑤=
𝑏=
𝑗 𝛼𝑗 𝑦𝑗 𝑥𝑗
𝑦𝑘 − 𝑤 ⊤ 𝑥𝑘
for any 𝑘 such that 0 < 𝛼𝑘 < 𝐶
16
SVM dual problem
Plug in 𝑤 and 𝑏 into the Lagrangian, and the dual problem
𝑀𝑎𝑥𝛼
𝑖 𝛼𝑖 −
1
2
𝑠. 𝑡. 𝑖 𝛼𝑖 𝑦𝑖 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶
⊤
𝛼
𝛼
𝑦
𝑦
𝑥
𝑖,𝑗 𝑖 𝑗 𝑖 𝑗 𝑖 𝑥𝑗
𝑥𝑖⊤ 𝑥𝑗 =
𝑥𝑖𝑙 𝑥𝑗𝑙
𝑙
𝜙(𝑥𝑖 )⊤ 𝜙(𝑥𝑗 ) =
𝜙(𝑥𝑖 )𝑙 𝜙(𝑥𝑗 )𝑙
𝑙
It is a quadratic programming; solve for 𝛼, then we get
𝑤=
𝑏=
𝑗 𝛼𝑗 𝑦𝑗 𝑥𝑗
𝑦𝑘 − 𝑤 ⊤ 𝑥𝑘
for any 𝑘 such that 0 < 𝛼𝑘 < 𝐶
Data points corresponding to nonzeros 𝛼𝑖 are called support
vectors
17
Kernel Functions
Denote the inner product as a function 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖
,
)=0.6
K(
,
)=0.2
K(
,
)=0.5
Inner
product
𝑥𝑗
# node
# edge
# triangle
maps to
# rectangle
# pentagon
…
K(
⊤𝜙
# node
# edge
# triangle
maps to
# rectangle
# pentagon
…
K(
ACAAGAT
GCCATTG
TCCCCCG
GCCTCCT
GCTGCTG
,
GCATGAC
GCCATTG
ACCTGCT
GGTCCTA
)=0.7
18
Problem of explicitly construct features
Explicitly construct feature
𝜙 𝑥 : 𝑅𝑚 ↦ 𝐹, feature space can grow
really large and really quickly
Eg. Polynomial feature of degree 𝑑
𝑥1𝑑 , 𝑥1 𝑥2 … 𝑥𝑑 , 𝑥12 𝑥2 … 𝑥𝑑−1
Total number of such feature is huge
for large 𝑚 and 𝑑
𝑑+𝑚−1 !
𝑑+𝑚−1
=
𝑑! 𝑚−1 !
𝑑
𝑑 = 6, 𝑚 = 100, there are 1.6 billion
terms
19
Can we avoid expanding the features?
Rather than computing the features explicitly, and then compute
inner product.
Can we merge two steps using a clever kernel function 𝑘(𝑥𝑖 , 𝑥𝑗 )
Eg. Polynomial kernel 𝑑 = 2
⊤
𝑥12
𝑥1 𝑥2
𝜙 𝑥 ⊤𝜙 𝑦 =
𝑥22
𝑥2 𝑥1
= 𝑥1 𝑦1 + 𝑥2 𝑦2 2 = 𝑥 ⊤ 𝑦
𝑦12
𝑦1 𝑦2
𝑦22
𝑦2 𝑦1
= 𝑥12 𝑦12 + 2𝑥1 𝑥2 𝑦1 𝑦2 + 𝑥22 𝑦22
2
Polynomial kernel 𝑑 = 2, 𝑘 𝑥, 𝑦 = 𝑥 ⊤ 𝑦
𝑂 𝑚 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛!
𝑑
=𝜙 𝑥
⊤𝜙
𝑦
20
What 𝑘(𝑥, 𝑦) can be called a kernel function?
𝑘(𝑥, 𝑦) equivalent to first compute feature 𝜙(𝑥), and then
perform inner product 𝑘 𝑥, 𝑦 = 𝜙 𝑥 ⊤ 𝜙 𝑦
A dataset 𝐷 = 𝑥1 , 𝑥2 , 𝑥3 … 𝑥𝑛
Compute pairwise kernel function 𝑘 𝑥𝑖 , 𝑥𝑗 and form a 𝑛 × 𝑛
kernel matrix (Gram matrix)
𝑘(𝑥1 , 𝑥2 )
⋮
𝐾=
𝑘(𝑥𝑛 , 𝑥1 )
…
⋱
…
𝑘(𝑥1 , 𝑥𝑛 )
⋮
𝑘(𝑥𝑛 , 𝑥𝑛 )
𝑘(𝑥, 𝑦) is a kernel function, iff matrix 𝐾 is positive semidefinite
∀𝑣 ∈ 𝑅𝑛 , 𝑣 ⊤ 𝐾𝑣 ≥ 0
21
Kernel trick
The dual problem of SVMs, replace inner product by kernel
𝑀𝑎𝑥𝛼
𝑖 𝛼𝑖
−
1
2
⊤
𝑘(𝑥
𝑥𝑗 ) 𝑗 )
𝑖,𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝜙(𝑥
𝑖 ) 𝑖 ,𝜙(𝑥
𝑠. 𝑡. 𝑖 𝛼𝑖 𝑦𝑖 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶
It is a quadratic programming; solve for 𝛼, then we get
𝑤=
𝑏=
𝑗 𝛼𝑗 𝑦𝑗 𝜙(𝑥𝑗 )
𝑦𝑘 − 𝑤 ⊤ 𝑥𝑘 for
any 𝑘 such that 0 < 𝛼𝑘 < 𝐶
Evaluate the decision boundary on a new data point
𝑓 𝑥 = 𝑤 ⊤ 𝜙 𝑥 =(
𝑗 𝛼𝑗 𝑦𝑗 𝜙(𝑥𝑗 ))
⊤𝜙
𝑥 =
𝑗 𝛼𝑗 𝑦𝑗
𝑘(𝑥𝑗 , 𝑥)
22
Typical kernels for vector data
Polynomial of degree d
𝑘 𝑥, 𝑦 = 𝑥 ⊤ 𝑦
𝑑
Polynomial of degree up to d
𝑘 𝑥, 𝑦 = 𝑥 ⊤ 𝑦 + 𝑐
𝑑
Exponential kernel (infinite degree polynomials)
𝑘 𝑥, 𝑦 = exp(𝑠 ⋅ 𝑥 ⊤ 𝑦)
Gaussian RBF kernel
𝑘 𝑥, 𝑦 = exp −
𝑥−𝑦 2
2𝜎2
Laplace Kernel
𝑘 𝑥, 𝑦 = exp −
𝑥−𝑦
2𝜎2
Exponentiated distance
𝑘 𝑥, 𝑦 = exp
𝑑 𝑥,𝑦 2
− 2
𝑠
23
Shape of some kernels
Translation invariant kernel 𝑘 𝑥, 𝑦 = 𝑔 𝑥 − 𝑦
The decision boundary is weighted sum of bumps
𝑓 𝑥 =
𝑗 𝛼𝑗 𝑦𝑗
𝑘(𝑥𝑗 , 𝑥)
24
How to construct more complicated kernels
Know the feature space 𝜙 𝑥 , but find a fast way to compute
the inner product 𝑘(𝑥, 𝑦)
Eg. string kernels, and graph kernels
Find a function 𝑘 𝑥, 𝑦 and prove it is positive semidefinite
Make sure the function captures some useful similarity between
data points
Combine existing kernels to get new kernels
What combination still results in kernel
25
Combining kernels I
Positive weighted combination of kernels are kernels
𝑘1 𝑥, 𝑦 and 𝑘2 (𝑥, 𝑦) are kernels
𝛼, 𝛽 ≥ 0
Then 𝑘 𝑥, 𝑦 = 𝛼𝑘1 𝑥, 𝑦 + 𝛽𝑘2 𝑥, 𝑦 is a kernel
Weighted combination kernel is like concatenated feature
space
𝑘1 𝑥, 𝑦 = 𝜙 𝑥
⊤
𝜙(𝑥)
𝑘 𝑥, 𝑦 =
𝜓(𝑥)
𝜙 𝑦 , 𝑘2 𝑥, 𝑦 = 𝜓 𝑥
⊤
⊤
𝜓 𝑦
𝜙 𝑦
𝜓 𝑦
Mapping between spaces give you kernels
𝑘 𝑥, 𝑦 is a kernel, then 𝑘 𝜙 𝑥 , 𝜙 𝑦
𝑘 𝑥, 𝑦 = 𝑥 2 𝑦 2
is a kernel
26
Kernel combination II
Product of kernels are kernels
𝑘1 𝑥, 𝑦 and 𝑘2 (𝑥, 𝑦) are kernels
Then 𝑘 𝑥, 𝑦 = 𝑘1 𝑥, 𝑦 𝑘2 𝑥, 𝑦 is a kernel
Product kernel is like using tensor product feature space
𝑘1 𝑥, 𝑦 = 𝜙 𝑥
⊤𝜙
𝑦 , 𝑘2 𝑥, 𝑦 = 𝜓 𝑥
𝑘 𝑥, 𝑦 = 𝜙 𝑥 ⊗ 𝜓 𝑥
⊤
⊤𝜓
𝑦
𝜙 𝑦 ⊗𝜓 𝑦
𝑘 𝑥, 𝑦 = 𝑘1 𝑥, 𝑦 𝑘2 𝑥, 𝑦 … 𝑘𝑑 𝑥, 𝑦 is a kernel with higher
order tensor features
𝑘 𝑥, 𝑦 = 𝜙 𝑥 ⊗ 𝜓 𝑥 ⊗ ⋯ ⊗ 𝜉(𝑥)
⊤
𝜙 𝑦 ⊗ 𝜓 𝑦 ⊗ ⋯ ⊗ 𝜉(𝑦)
27
Nonlinear regression
Linear regression
Three standard deviations
above mean
Nonlinear regression
function
Three standard deviations
below mean
Nonlinear regression and want to estimate the variance
Need advanced methods such as Gaussian processes and kernel regression
28
Ridge regression
A dataset 𝑋 = 𝑥1 , … , 𝑥𝑛 ∈ 𝑅𝑑×𝑛 , 𝑦 = 𝑦1 , … , 𝑦𝑛
⊤
∈ 𝑅𝑑
With some regularization parameter 𝜆, the goal is to find
regression function 𝑤
𝑤∗
2
⊤
𝑖(𝑦𝑖 − 𝑥𝑖 𝑤
= 𝑎𝑟𝑔𝑚𝑖𝑛𝑤
= 𝑎𝑟𝑔𝑚𝑖𝑛𝑤
𝑦 − 𝑋⊤𝑤
2
+𝜆 𝑤
+𝜆 𝑤
2)
2
Find 𝑤: take derivative of the objective and set it to zeros
𝑤 ∗ = 𝑋𝑋 ⊤ + 𝜆𝐼
−1 𝑋𝑦
29
Matrix inversion lemma
Find 𝑤: take derivative of the objective and set it to zeros
𝑤 ∗ = 𝑋𝑋 ⊤ + 𝜆𝐼𝑑
𝑤 ∗ = 𝑋𝑋 ⊤ + 𝜆𝐼𝑑
−1
−1 𝑋𝑦
𝑋𝑦 = 𝑋 𝑋 ⊤ 𝑋 + 𝜆𝐼𝑛
−1
𝑦
Evaluate 𝑤 ∗ in a new data point, we have
𝑤 ∗ ⊤ x = 𝑦 ⊤ 𝑋 ⊤ 𝑋 + 𝜆𝐼𝑛
−1 𝑋 ⊤ 𝑥
The above expression only depends on inner products!
30
Kernel ridge regression
𝑤 ∗ ⊤ x = 𝑦 ⊤ 𝑋 ⊤ 𝑋 + 𝜆𝐼𝑛
𝑥1⊤ 𝑥1
𝑋 ⊤𝑋 =
⋮
𝑥𝑛⊤ 𝑥1
−1 ⊤
𝑋 𝑥 only depends on inner products!
… 𝑥1⊤ 𝑥𝑛
⋱
⋮
… 𝑥𝑛⊤ 𝑥𝑛
𝑥1⊤ 𝑥
𝑋 ⊤𝑥 =
⋮
𝑥𝑛⊤ 𝑥
Kernel ridge regression: replace inner product by a kernel function
𝑋 ⊤ 𝑋 → 𝐾 = 𝑘 𝑥𝑖 , 𝑥𝑗
𝑋 ⊤ 𝑥 → k x = 𝑘 𝑥𝑖 , 𝑥
𝑓 𝑥 = 𝑦 ⊤ 𝐾 + 𝜆𝐼𝑛
𝑛×𝑛
𝑛×1
−1 𝑘
𝑥
31
Kernel ridge regression
Use Gaussian rbf kernel
𝑘 𝑥, 𝑦 = exp −
𝑙𝑎𝑟𝑔𝑒 𝜎, 𝑙𝑎𝑟𝑔𝑒 𝜆
𝑥−𝑦 2
2𝜎2
𝑠𝑚𝑎𝑙𝑙 𝜎, 𝑠𝑚𝑎𝑙𝑙 𝜆
𝑠𝑚𝑎𝑙𝑙 𝜎, 𝑙𝑎𝑟𝑔𝑒 𝜆
Use cross-validation to choose parameters
32
Download