Review Le Song Machine Learning II: Advanced Topics

advertisement
Review
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
What is Machine Learning (ML)
Study of algorithms that improve their performance at some
task with experience
2
Graphical Models
Representation
directed vs. undirected
Conditional independence semantics
Factorization
Inference
message passing algorithm (tree vs. general graph)
Junction tree for graphs
Variational inference vs. sampling
Learning
directed vs. undirected
Fully observed vs. latent variable
Structure learning
3
Conditional Independence Assumptions
Local Markov Assumption
Global Markov Assumption
𝑋 ⊥ 𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋 |𝑃𝑎𝑋
𝐴 ⊥ 𝐵|𝐶, 𝑠𝑒𝑝𝐺 𝐴, 𝐵; 𝐶
𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋
𝑃𝑎𝑋
𝐵𝑁 𝑀𝑁
𝑋
𝐻
𝑆
¬ (𝐴 ⊥ 𝐻)
𝐴 ⊥𝐻|𝑆
𝐹
𝐴
𝑆
𝐴⊥𝐹
¬ (𝐴 ⊥ 𝐹 | 𝑆)
𝑆
𝑁
𝐻
𝐴
𝐵
Derived local and pairwise
assumption
D-separation, active trail
𝐴
𝐶
𝑁 ⊥𝐻|𝑆
¬(𝑁 ⊥ 𝐻)
𝑋 ⊥ 𝑇ℎ𝑒𝑅𝑒𝑠𝑡 |𝑀𝐵𝑋
𝑋 ⊥ 𝑌 | 𝑇ℎ𝑒𝑅𝑒𝑠𝑡 (no X—Y)
𝐴
𝐶
𝑋
𝐵
𝐷
𝑀𝐵𝑋 = {𝐴𝐵𝐶𝐷}
4
Distribution Factorization
Bayesian Networks (Directed Graphical Models)
𝐼 − 𝑚𝑎𝑝: 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
⇔
𝑛
𝑃(𝑋1 , … , 𝑋𝑛 ) =
Conditional
Probability
Tables (CPTs)
𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖 )
𝑖=1
Markov Networks (Undirected Graphical Models)
𝑠𝑡𝑟𝑖𝑐𝑡𝑙𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃, 𝐼 − 𝑚𝑎𝑝: 𝐼 𝐺 ⊆ 𝐼 𝑃
Clique
⇔
𝑚
Potentials
1
𝑃(𝑋1 , … , 𝑋𝑛 ) =
Ψ𝑖 𝐷𝑖
𝑍
Maximal
𝑖=1
Normalization
(Partition
Function)
𝑚
𝑍 =
𝑥1 ,𝑥2 ,…,𝑥𝑛
Ψ𝑖 𝐷𝑖
Clique
𝑖=1
5
Representation Power
?
𝑃
𝑀𝑁
𝐵𝑁
convert?
Minimal I-map not unique
Do not always have P-map
𝑋1
𝑋1 ⊥ 𝑋3 | 𝑋2 , 𝑋4
𝑋2 ⊥ 𝑋4 | 𝑋1 , 𝑋3
𝑋4
𝑋2
Minimal I-map unique
Do not always have P-map
𝐴
𝐹
𝑆
𝐴⊥𝐹
¬ (𝐴 ⊥ 𝐹 | 𝑆)
𝑋3
6
Inference in Graphical Models
General form of the inference problem
𝑃 𝑋1 , … , 𝑋𝑛 ∝ 𝑖 Ψ(𝐷𝑖 )
Want to query 𝑌 variable given evidence 𝑒, and “don’t care” a set
of 𝑍 variables
Compute 𝜏 𝑌, 𝑒 = 𝑍 𝑖 Ψ(𝐷𝑖 ) using variable elimination
Renormalize to obtain the conditionals 𝑃 𝑌|𝑒 =
𝜏(𝑌,𝑒)
𝑌 𝜏(𝑌,𝑒)
𝐵
Two examples: use graph structure
to order computation
𝐴
DAG:
𝐶
𝐷
Chain:
𝐴
𝐵
𝐶
𝐷
𝐸
𝐸
𝐺
𝐹
𝐻
7
Message passing algorithm
𝑚𝑗𝑖 𝑋𝑖 ∝
𝑋𝑗 Ψ
𝑋𝑖 , 𝑋𝑗 Ψ 𝑋𝑗
𝑠∈N 𝑗 \i 𝑚𝑠𝑗
𝑋𝑗
𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑜𝑓 𝑖𝑛𝑐𝑜𝑚𝑖𝑛𝑔 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠
𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑦 𝑏𝑦 𝑙𝑜𝑐𝑎𝑙 𝑝𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙𝑠
𝑆𝑢𝑚 𝑜𝑢𝑡 𝑋𝑗
𝑋𝑗 can send
message
when incoming
messages from
𝑁 𝑗 \i arrive
N 𝑗 \i
𝑘
𝑚𝑘𝑗 𝑋𝑗
𝑚𝑗𝑖 𝑋𝑖
𝑗
𝑙
𝑖
𝑓
𝑚𝑙𝑗 𝑋𝑗
8
Junction tree algorithm for DAG
𝐵
𝐶
𝐴
𝐷
𝐸
Moralize
𝐺
𝐶
𝐷
𝐴𝐸
𝐸
𝐴𝐸𝐹
Junction Tree
𝐸𝐹
𝐴
𝐶
Triangulate
𝐷
𝐸
𝐹
𝐹
𝐻
𝐺
𝐻
𝐵𝐶
𝐴𝐷𝐸
𝐷𝐸
𝐸𝐹𝐻
𝐵
𝐴
𝐸
𝐶𝐷𝐸
𝐺𝐸
𝐶
𝐹
𝐻
𝐺
𝐵𝐶
𝐵
Maximum spanning
tree
𝐶𝐷𝐸
𝐸
𝐺𝐸
𝐷𝐸
𝐶
𝐸
𝐸
𝐸
𝐴𝐷𝐸
𝐴𝐸
𝐸
𝐸𝐹𝐻
𝐴𝐸𝐹
𝐸𝐹
Clique Graph
9
Message passing in junction trees
𝑚𝐷𝑗 𝐷𝑖 𝑆𝑗𝑖 ∝
𝐷𝑗 \𝑆𝑗𝑖 Φ
𝐷𝑗
𝐷𝑡 ∈N 𝐷𝑗 \𝐷𝑖 𝑚𝐷𝑡 𝐷𝑗
𝑆𝑡𝑗
𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑜𝑓 𝑖𝑛𝑐𝑜𝑚𝑖𝑛𝑔 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠
𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑦 𝑏𝑦 𝑙𝑜𝑐𝑎𝑙 𝑝𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙𝑠
𝑆𝑢𝑚 𝑜𝑢𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑛𝑜𝑡 𝑖𝑛 𝑠𝑒𝑝𝑎𝑟𝑎𝑡𝑜𝑟
N 𝐷𝑗 \Di
𝐷𝑘
𝑆𝑘𝑗
Separator:
𝑆𝑘𝑗 = 𝐷𝑘 ∩ 𝐷𝑗
𝑚𝐷𝑘 𝐷𝑗 𝑆𝑘𝑗
𝑚𝐷𝑗 𝐷𝑖 𝑆𝑗𝑖
𝐷𝑗
𝑆𝑙𝑗
𝐷𝑙
𝑆𝑗𝑖
𝐷𝑖
Can also be applied to
loopy clique graphs
for approximate inference
𝑚𝐷𝑙 𝐷𝑗 𝑆𝑙𝑗
10
Variational Inference
What is the approximating structure?
?
?
𝑃
𝑄
?
𝑄
𝑄
How to measure the goodness of the approximation of
𝑄 𝑋1 , … , 𝑋𝑛 to the original 𝑃 𝑋1 , … , 𝑋𝑛 ?
Reverse KL-divergence 𝐾𝐿(𝑄| 𝑃
How to compute the new parameters?
mean field
Optimization Q∗ = argminQ 𝐾𝐿(𝑄| 𝑃
≈
𝑁𝑒𝑤 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠
11
Mean Field Algorithm
Initialize 𝑄 𝑋1 , … , 𝑋𝑛 = 𝑖 𝑄 𝑋𝑖 (eg., randomly or smartly)
Set all variables to unprocessed
Pick an unprocessed variable 𝑋𝑖
Update 𝑄𝑖 :
1
𝑄𝑖 𝑋𝑖 = exp
𝑍𝑖
𝐸𝑄 ln Ψ 𝐷𝑗
𝐷𝑗 :𝑋𝑖 ∈𝐷𝑗
Set variable 𝑋𝑖 as processed
If 𝑄𝑖 changed
Set neighbors of 𝑋𝑖 to unprocessed
Guaranteed to converge
12
Why Sampling
Previous inference tasks focus on obtaining the entire
posterior distribution 𝑃 𝑋𝑖 𝑒
Often we want to take expectations
Mean 𝜇𝑋𝑖 |𝑒 = 𝐸 𝑋𝑖 𝑒 = ∫ 𝑋𝑖 𝑃 𝑋𝑖 𝑒 𝑑𝑋𝑖
Variance 𝜎𝑋2𝑖 |𝑒 = 𝐸 (𝑋𝑖 −𝜇𝑋𝑖 |𝑒 )2 𝑒 = ∫ (𝑋𝑖 −𝜇𝑋𝑖 |𝑒 )2 𝑃 𝑋𝑖 𝑒 𝑑𝑋𝑖
More general 𝐸 𝑓 = ∫ 𝑓 𝑋 𝑃 𝑋|𝑒 𝑑𝑋, can be difficult to do it
analytically
Key idea: approximate expectation by sample average
1
𝐸𝑓 ≈
𝑁
𝑁
𝑓 𝑥𝑖
𝑖=1
where 𝑥1 , … , 𝑥𝑁 ∼ 𝑃 𝑋|𝑒 independently and identically
13
Sampling Methods
Direct Sampling
Works only for easy distributions (multinomial, Gaussian etc.)
Rejection Sampling
Create samples like direct sampling
Only count samples consistent with given evidence
Importance Sampling
Create samples like direct sampling
Assign weights to samples
Gibbs Sampling
Often used for high-dimensional problem
Use variables and its Markov blanket for sampling
14
Gibbs Sampling in formula
Gibbs sampling
𝑋 = 𝑥0
For t = 1 to N
𝑥1𝑡 ∼ 𝑃(𝑋1 |𝑥2𝑡−1 , … , 𝑥𝐾𝑡−1 )
𝑥2𝑡 ∼ 𝑃(𝑋2 |𝑥1𝑡 , 𝑥3𝑡−1 … , 𝑥𝐾𝑡−1 )
…
𝑡
𝑥𝐾𝑡 ∼ 𝑃(𝑋𝐾 |𝑥1𝑡 , … , 𝑥𝐾−1
)
𝐹𝑜𝑟 𝑔𝑟𝑎𝑝ℎ𝑖𝑐𝑎𝑙 𝑚𝑜𝑑𝑒𝑙𝑠
Only need to condition on the
Variables in the Markov blanket
𝑋1
Variants:
Randomly pick variable to sample
sample block by block
𝑋3
𝑋2
𝑋4
𝑋5
𝑥1𝑡 , 𝑥2𝑡 ∼ 𝑃(𝑋1 , 𝑋2 |𝑥3𝑡−1 … , 𝑥𝐾𝑡−1 )
15
Learning for GMs
Known Structure
Unknown Structure
Fully observable data
Relatively Easy
Hard
Missing data
Hard (EM)
Very hard
Estimation principle:
Maximal likelihood estimation
Bayesian estimation
Common Feature
Make use of distribution factorization
Make use of inference algorithm
Make use of regularization/prior
16
Bayesian Parameter Estimation
Bayesian treat the unknown parameters as a random variable,
whose distribution can be inferred using Bayes rule:
𝑃(𝜃|𝐷) =
𝑃 𝐷 𝜃 𝑃(𝜃)
𝑃(𝐷)
=
𝑃 𝐷 𝜃 𝑃(𝜃)
∫ 𝑃 𝐷 𝜃 𝑃 𝜃 𝑑𝜃
𝜃
The crucial equation can be written in words
𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑×𝑝𝑟𝑖𝑜𝑟
𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑
𝑁
For iid data, the likelihood is 𝑃 𝐷 𝜃 =
𝑁
𝑥𝑖
𝑖=1 𝜃
1−𝜃
1−𝑥𝑖
=𝜃
𝑖 𝑥𝑖
1−𝜃
𝑋
𝑖 1−𝑥𝑖
𝑁
𝑖=1 𝑃(𝑥𝑖 |𝜃)
= 𝜃 #ℎ𝑒𝑎𝑑 1 − 𝜃
#𝑡𝑎𝑖𝑙
The prior 𝑃 𝜃 encodes our prior knowledge on the domain
Different prior 𝑃 𝜃 will end up with different estimate 𝑃(𝜃|𝐷)!
17
Frequentist Parameter Estimation
Bayesian estimation has been criticized for being “subjective”
Frequentists think of a parameter as a fixed, unknown
constant, not a random variable
Hence different “objective” estimators, instead of Bayes’ rule
These estimators have different properties, such as being
“unbiased”, “minimum variance”, etc.
A very popular estimator is the maximum likelihood estimator
(MLE), which is simple and has good statistical properties
𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝐷 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃
𝑁
𝑖=1 𝑃(𝑥𝑖 |𝜃)
18
How estimators should be used?
𝜃𝑀𝐴𝑃 is not Bayesian (even though it uses a prior) since it is a
point estimate
Consider predicting the future. A sensible way is to combine
predictions based on all possible value of 𝜃, weighted by their
posterior probability, this is called Bayesian prediction:
𝑃 𝑥𝑛𝑒𝑤 𝐷 = ∫ 𝑃 𝑥𝑛𝑒𝑤 , 𝜃 𝐷 𝑑𝜃
= ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃, 𝐷 𝑃 𝜃 𝐷 𝑑𝜃
= ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃 𝑃 𝜃 𝐷 𝑑𝜃
𝜃
𝑋𝑛𝑒𝑤
𝑋
𝑁
A frequentist prediction will typically use a “plug-in” estimator
such as ML/MAP
𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐿 ) 𝑜𝑟 𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐴𝑃 )
19
Decomposable likelihood of directed model
𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 =
𝑖 log 𝑃
𝑎𝑖 |𝜃𝑎 +
𝑖 log 𝑃
𝑓 𝑖 |𝜃𝑓 +
𝑖 𝑙𝑜𝑔𝑃
𝑠 𝑖 𝑎𝑖 , 𝑓 𝑖 , 𝜃𝑠 +
𝑖 𝑙𝑜𝑔𝑃(ℎ
𝑖
|𝑠 𝑖 , 𝜃ℎ )
One term for each CPT; break up MLE problem into independent subproblems
Because the factorization of the distribution, we can estimate each CPT
separately.
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝐹𝑙𝑢
𝐹𝑙𝑢
𝑆𝑖𝑛𝑢𝑠
Learn separately
𝐹𝑙𝑢
𝑆𝑖𝑛𝑢𝑠
𝑆𝑖𝑛𝑢𝑠
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
20
Bayesian estimator for directed models
Factorization 𝑃 𝑋 = 𝑥 =
𝑖𝑃
𝑥𝑖 𝑝𝑎𝑋𝑖 , 𝜃𝑖 )
Local CPT: multinomial distribution 𝑃 𝑋𝑖 = 𝑘 𝑃𝑎𝑋𝑖 = 𝑗 = 𝜃𝑘𝑗
Factorized prior over parameters 𝑃 𝜃𝑎 𝑃 𝜃𝑏 𝑃 𝜃𝑠 𝑃(𝜃ℎ )
𝜃𝑏
𝜃𝑎
𝐴𝑙𝑙𝑒𝑟𝑔𝑦
𝜃𝑠
𝐹𝑙𝑢
𝑆𝑖𝑛𝑢𝑠
𝜃ℎ
𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒
21
MLE Learning Algorithm for Exponential models
max 𝑙 𝜃, 𝐷 is a convex optimization problem.
𝜃
Can be solve by many methods, such as gradient descent,
conjugate gradient.
Initialize model parameters 𝜃
Loop until convergence
Compute
𝜕𝑙 𝜃,𝐷
𝜕𝜃𝑖𝑗
= 𝐸P
Update 𝜃𝑖𝑗 ← 𝜃𝑖𝑗 − 𝜂
𝑋𝑖 ,𝑋𝑗
𝑋𝑖 𝑋𝑗 − 𝐸𝑃
𝑋𝜃
𝑋𝑖 𝑋𝑗
𝜕𝑙 𝜃,𝐷
𝜕𝜃𝑖𝑗
22
Partially observed graphical models
Mixture models and hidden Markov models
23
Why is learning hard?
In fully observed iid settings, the log-likelihood decomposes
into a sum of local terms
𝑙 𝜃; 𝐷 = log 𝑝 𝑥, 𝑧 𝜃 = log 𝑃 𝑧 𝜃1 + log 𝑝(𝑥|𝑧, 𝜃2 )
With latent variables, all the parameters become coupled
together via marginalization
𝑙 𝜃; 𝐷 = log
𝑧𝑝
𝑥, 𝑧 𝜃 = log
𝑧 𝑝(𝑥| 𝑧, 𝜃2 )𝑃
𝑍
𝑍
𝑋
𝑋
𝑁
𝑧|𝜃1
𝑁
24
EM algorithm
EM: Expectation-maximization for finding 𝜃
𝑙 𝜃; 𝐷 = log
𝑧𝑝
𝑥, 𝑧 𝜃 = log
𝑧 𝑝(𝑥𝑖 | 𝑧, 𝜃2 )𝑃
𝑧|𝜃1
Iterate between E-step and M-step until convergence
Expectation step (E-step)
𝑓 𝜃 = 𝐸𝑞
𝑧
log 𝑝 𝑥, 𝑧 𝜃 , 𝑤ℎ𝑒𝑟𝑒 𝑞 𝑧 = 𝑃(𝑧|𝑥, 𝜃 𝑡 )
Maximization step (M-step)
𝜃 𝑡+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑓 𝜃
25
Structure Learning
The goal: given set of independent samples (assignments of
random variables), find the best (the most likely) graphical
model structure
𝐴
𝐹
𝐹
𝐴
𝐹
𝐴
𝑆
candidate
structure
𝑆
𝑁
𝑁
𝐻
𝐴
𝐹
𝐻
(A,F,S,N,H) = (T,F,F,T,F)
(A,F,S,N,H) = (T,F,T,T,F)
…
(A,F,S,N,H) = (F,T,T,T,T)
𝑆
𝑁
𝐻
Score
structures
Maximum
likelihood;
Bayesian
score;
Margin
𝑆
𝑁
𝐻
26
Chow-liu algorithm
𝑇 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑇 𝑀
(𝑖,𝑗)∈𝑇 𝐼 (𝑥𝑖 , 𝑥𝑗 )
−𝑀
𝑖 𝐻 (𝑥𝑖 )
Chow-liu algorithm
For each pair of variables 𝑋𝑖 , 𝑋𝑗 , compute their empirical mutual information
𝐼 (𝑥𝑖 , 𝑥𝑗 )
Now you have a complete graph connecting variable nodes, with edge weight
equal to 𝐼 (𝑥𝑖 , 𝑥𝑗 )
Run maximum spanning tree algorithm
27
Kernel methods
Kernels
Similarity measure between a pair of data points
Positive definite kernel matrix
Design and combine kernels
Fast kernel computation
Kernelize algorithms
Use inner product between data points to express algorithms
The learned function lies is a linear combination of data points
Replace inner products with kernels
SVM, ridge regression, clustering, PCA, CCA, ICA, Statistical tests
Gaussian processes
Covariance functions are kernel functions
28
Support Vector Machines (SVM)
1
𝑤 2
min 𝑤 ⊤ 𝑤 + 𝐶
𝑗 𝜉𝑗
𝑠. 𝑡. 𝑤 ⊤ 𝑥𝑗 + 𝑏 𝑦𝑗 ≥ 1 − 𝜉𝑗 , 𝜉𝑗 ≥ 0, ∀𝑗
𝜉𝑗 : Slack
variables
29
SVM for nonlinear problem
Solve nonlinear problem with linear relation in feature space
Linear decision boundary
in feature space
Non-linear
decision
boundary
Transform
data points
Nonlinear clustering, principal component analysis, canonical
correlation analysis …
30
SVM for nonlinear problems
Some problem needs complicated and even infinite features
𝜙 𝑥 = 𝑥, 𝑥 2 , 𝑥 3 , 𝑥 4 , …
⊤
Explicitly computing high dimension features is time
consuming, and makes subsequent optimization costly
Nonlinear
Decision
Boundaries
Linear SVM
Decision
Boundaries
31
Kernel trick
The dual problem of SVMs, replace inner product by kernel
𝑀𝑎𝑥𝛼
𝑖 𝛼𝑖
−
1
2
⊤
𝑘(𝑥
𝑥𝑗 ) 𝑗 )
𝑖,𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝜙(𝑥
𝑖 ) 𝑖 ,𝜙(𝑥
𝑠. 𝑡. 𝑖 𝛼𝑖 𝑦𝑖 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶
Corresponding kernel
matrix is psd
It is a quadratic programming; solve for 𝛼, then we get
𝑤=
𝑏=
𝑗 𝛼𝑗 𝑦𝑗 𝜙(𝑥𝑗 )
𝑦𝑘 − 𝑤 ⊤ 𝜙(𝑥𝑘 )
for any 𝑘 such that 0 < 𝛼𝑘 < 𝐶
Evaluate the decision boundary on a new data point
𝑓 𝑥 = 𝑤 ⊤ 𝜙 𝑥 =(
𝑗 𝛼𝑗 𝑦𝑗 𝜙(𝑥𝑗 ))
⊤𝜙
𝑥 =
𝑗 𝛼𝑗 𝑦𝑗
𝑘(𝑥𝑗 , 𝑥)
32
Typical kernels for vector data
Polynomial of degree d
𝑘 𝑥, 𝑦 = 𝑥 ⊤ 𝑦
𝑑
Polynomial of degree up to d
𝑘 𝑥, 𝑦 = 𝑥 ⊤ 𝑦 + 𝑐
𝑑
Gaussian RBF kernel
𝑘 𝑥, 𝑦 = exp −
𝑥−𝑦 2
2𝜎2
Laplace Kernel
𝑘 𝑥, 𝑦 = exp −
𝑥−𝑦
2𝜎2
33
Kernel Functions
Denote the inner product as a function 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖
,
)=0.6
K(
,
)=0.2
K(
,
)=0.5
Inner
product
𝑥𝑗
# node
# edge
# triangle
maps to
# rectangle
# pentagon
…
K(
⊤𝜙
# node
# edge
# triangle
maps to
# rectangle
# pentagon
…
K(
ACAAGAT
GCCATTG
TCCCCCG
GCCTCCT
GCTGCTG
,
GCATGAC
GCCATTG
ACCTGCT
GGTCCTA
)=0.7
34
Combining kernels
Positive weighted combination of kernels are kernels
𝑘1 𝑥, 𝑦 and 𝑘2 (𝑥, 𝑦) are kernels
𝛼, 𝛽 ≥ 0
Then 𝑘 𝑥, 𝑦 = 𝛼𝑘1 𝑥, 𝑦 + 𝛽𝑘2 𝑥, 𝑦 is a kernel
Product of kernels are kernels
𝑘1 𝑥, 𝑦 and 𝑘2 (𝑥, 𝑦) are kernels
Then 𝑘 𝑥, 𝑦 = 𝑘1 𝑥, 𝑦 𝑘2 𝑥, 𝑦 is a kernel
Mapping between spaces give you kernels
𝑘 𝑥, 𝑦 is a kernel, then 𝑘 𝜙 𝑥 , 𝜙 𝑦
is a kernel
𝑘 𝑥, 𝑦 = 𝑥 2 𝑦 2
35
Principal component analysis
Given a set of 𝑀 centered
observations 𝑥𝑘 ∈ 𝑅𝑑 , PCA finds
the direction that maximizes the
variance
𝑋 = 𝑥1 , 𝑥2 , … , 𝑥𝑀
𝑤∗ =
1
⊤
𝑎𝑟𝑔𝑚𝑎𝑥 𝑤 ≤1
𝑘 𝑤 𝑥𝑘
2
𝑀
= 𝑎𝑟𝑔𝑚𝑎𝑥
1 ⊤
⊤
𝑤 ≤1 𝑀 𝑤 𝑋𝑋 𝑤
1
𝑋𝑋 ⊤ ,
𝑀
𝐶=
𝑤 ∗ can be found by
solving the following eigen-value
problem
𝐶𝑤 = 𝜆 𝑤
36
Alternative expression for PCA
The principal component lies in the span of the data
𝑤 = 𝑘 𝛼𝑘 𝑥𝑘 = 𝑋𝛼
Plug this in we have
𝐶𝑤 =
1
𝑋𝑋 ⊤ 𝑋𝛼
𝑀
= 𝜆 𝑋𝛼
Furthermore, for each data point 𝑥𝑘 , the following relation
holds
1 ⊤
𝑥𝑘 𝑋𝑋 ⊤ 𝑋𝛼 = 𝜆 𝑥𝑘⊤ 𝑋𝛼, ∀𝑘
𝑀
1
matrix form, 𝑋 ⊤ 𝑋𝑋 ⊤ 𝑋𝛼 = 𝜆𝑋 ⊤ 𝑋𝛼
𝑀
𝑥𝑘⊤ 𝐶𝑤 =
In
Only depends on inner
product matrix
37
Kernel PCA
Key Idea: Replace inner product matrix by kernel matrix
1 ⊤
PCA: 𝑋 𝑋𝑋 ⊤ 𝑋𝛼
𝑀
= 𝜆𝑋 ⊤ 𝑋𝛼
𝑥𝑘 ↦ 𝜙 𝑥𝑘 , Φ = 𝜙 𝑥1 , … , 𝜙 𝑥𝑘 , 𝐾 = Φ⊤ Φ
Nonlinear component 𝑤 = Φ𝛼
Kernel PCA:
1
𝐾𝐾𝛼
𝑀
= 𝜆𝐾𝛼, equivalent to
1
𝐾𝛼
𝑀
=𝜆𝛼
First form an 𝑀 by 𝑀 kernel matrix 𝐾, and then perform eigendecomposition on 𝐾
38
CCA in inner product format
Similar to PCA, the directions of projection lie in the span of
the data 𝑋 = 𝑥1 , … , 𝑥𝑚 , 𝑌 = (𝑦1 , … , 𝑦𝑚 )
𝑤𝑥 = 𝑋𝛼, 𝑤𝑦 = 𝑌𝛽
𝐶𝑥𝑦 =
1
𝑋𝑌 ⊤ , 𝐶𝑥𝑥
𝑚
=
1
𝑋𝑋 ⊤ , 𝐶𝑦𝑦
𝑚
=
1
𝑌𝑌^⊤
𝑚
Earlier we have
Plug in 𝑤𝑥 = 𝑋𝛼, 𝑤𝑦 = 𝑌𝛽, we have

  max
 ,

T
X
T
XX
T
X
T
T
Data only appear in
inner products
XY
X
T
Y
 Y YY
T
T
T
Y
39
Kernel CCA
Replace inner product matrix by kernel matrix

  max
 ,

T
T
K xK y
K x K x

T
K yK y
Where 𝐾𝑥 is kernel matrix for data 𝑋, with entries 𝐾𝑥 𝑖, 𝑗 =
𝑘 𝑥𝑖 , 𝑥𝑗
Solve generalized eigenvalue problem
0


K K
 y
K xK
x
0
y
 

 

 K xK

  


0


x
0
K yK
y
 

 





40
Embedding with kernel features
Transform distribution to infinite dimensional vector
Rich representation
Feature space
Mean,
Variance,
higher
order
moment
41
Estimating embedding distance
Finite sample estimator
Form a kernel matrix with 4 blocks
Average this block
Average this block
Average this block
Average this block
42
Measure Dependence via Embeddings
Use squared distance to
measure dependence
between X and Y
Feature space
Dependence measure useful for:
•Dimensionality reduction
•Clustering
•Matching
•…
[Smola, Gretton, Song and Scholkopf. 2007]
43
Estimating embedding distances
Given samples (𝑥1 , 𝑦1 ), … , (𝑥𝑚 , 𝑦𝑚 ) ∼ 𝑃 𝑋, 𝑌
Dependence measure can be expressed as inner products
𝜇𝑋𝑌 − 𝜇𝑋 ⊗ 𝜇𝑌 2 =
𝐸𝑋𝑌 [𝜙 𝑋 ⊗ 𝜓 𝑌 ] − 𝐸𝑋 𝜙 𝑋 ⊗ 𝐸𝑌 [𝜓 𝑌 ] 2
=< 𝜇𝑋𝑌 , 𝜇𝑋𝑌 > −2 < 𝜇𝑋𝑌 , 𝜇𝑋 ⊗ 𝜇𝑌 >+< 𝜇𝑋 ⊗ 𝜇𝑌 , 𝜇𝑋 ⊗ 𝜇𝑌 >
Kernel matrix operation (𝐻 = 𝐼
𝑡𝑟𝑎𝑐𝑒( 𝐻
1
− 11⊤ )
𝑚
𝑘(𝑥𝑖 , 𝑥𝑗 )
𝐻
X and Y data are ordered
in the same way
𝑘(𝑦𝑖 , 𝑦𝑗 )
)
44
Other advanced methods
Combining classifiers
Bagging
Stacking
Boosting (Adaboost)
Semisupervised learning
Graph-based methods (label propagation)
Co-training
Semisupervised SVM
Active learning
Tensor data decomposition
Parafac and Tucker decomposition
45
What is Gaussian Process?
A Gaussian process is a generalization of a multivariate
Gaussian distribution to infinitely many variables
Formally: a collection of random variables, any finite number
of which have (consistent) Gaussian distributions
Informally, infinitely long vector with dimensions index by 𝑥 ≅
function 𝑓(𝑥)
A Gaussian process is fully specified by a mean function
𝑚 𝑥 = 𝐸[𝑓(𝑥)] and covariance function 𝑘 𝑥, 𝑥 ′ =
𝐸 𝑓 𝑥 − 𝑚 𝑥 𝑓 𝑥′ − 𝑚 𝑥′
𝑓 𝑥 ∼ 𝐺𝑃 𝑚 𝑥 , 𝑘 𝑥, 𝑥 ′ , 𝑥: 𝑖𝑛𝑑𝑖𝑐𝑒𝑠
46
Covariance function of Gaussian processes
For any finite collection of indices 𝑥1 , 𝑥2 , … , 𝑥𝑛 , the covariance
matrix is positive semidefinite
Σ=𝐾=
𝑘 𝑥1 , 𝑥1
𝑘 𝑥2 , 𝑥1
𝑘 𝑥1 , 𝑥2
𝑘 𝑥2 , 𝑥2
⋮ ⋮
𝑘(𝑥𝑛 , 𝑥1 ) 𝑘(𝑥𝑛 , 𝑥2 )
⋯ 𝑘 𝑥1 , 𝑥𝑛
⋯ 𝑘 𝑥2 , 𝑥𝑛
⋱ ⋮
⋯ 𝑘(𝑥𝑛 , 𝑥𝑛 )
The covariance function needs to be a kernel function over the
indices!
Eg. Gaussian RBF kernel
𝑘 𝑥, 𝑥 ′ = exp −
1
2
𝑥 − 𝑥′
2
47
Samples from GPs with different kernels
𝑘 𝑥𝑖 , 𝑥𝑗 = 𝑣0 exp −
𝑥𝑖 −𝑥𝑗
𝑟
𝛼
+ 𝑣1 + 𝑣2 𝛿𝑖𝑗
48
Using Gaussian process for nonlinear regression
Observing a dataset 𝐷 =
𝑥𝑖 , 𝑦𝑖
𝑛
𝑖=1
Prior 𝑃(𝑓) is Gaussian process, like a multivariate Gaussian,
therefore, posterior of 𝑓 is also a Gaussian process
Bayesian rule 𝑃 𝑓 𝐷 =
𝑃 𝐷 𝑓 𝑃(𝑓)
𝑃(𝐷)
Everything else about GPs follows the basic rules of
probabilities applied to multivariate Gaussians
49
Noisy Observation
2
𝑦|𝑥, 𝑓 𝑥 ∼ 𝒩 𝑓, 𝜎𝑛𝑜𝑖𝑠𝑒
𝐼 , let 𝑌 = (𝑦2 , … , 𝑦𝑛 )⊤
𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖
𝑛
𝑖=1
~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′
2
𝑚𝑝𝑜𝑠𝑡 𝑥 = 𝑘 𝑥, 𝑋 𝐾 + 𝜎𝑛𝑜𝑖𝑠𝑒
𝐼
𝑘𝑝𝑜𝑠𝑡
𝑥, 𝑥 ′
=
𝑘(𝑥, 𝑥 ′ )
−1 ⊤
𝑌
− 𝑘 𝑥, 𝑋 𝐾 +
−1
2
𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑘
𝑥′, 𝑋
⊤
50
Relate GP to class probability
Transform the continuous output of Gaussian Process to a
value between [-1,1] or [0,1]
With binary outputs, the joint distribution of all variables in the
model is no longer Gaussians
The likelihood is also not Gaussian, so we will need to use
approximate inference to compute the posterior GP (Laplace
approximation, sampling)
51
Kernel low rank approximation
Incomplete Cholesky factorization of kernel matrix 𝐾 of size
𝑛 × 𝑛 to 𝑅 of size 𝑑 × 𝑛, and 𝑑 ≪ 𝑛
𝑅
≈
𝐾
𝑅⊤
𝑅
≈
𝐴
𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖
𝑛
𝑖=1
𝑅⊤
~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′
2
𝑚𝑝𝑜𝑠𝑡 𝑥 = 𝑅𝑥⊤ 𝑅𝑅⊤ + 𝜎𝑛𝑜𝑖𝑠𝑒
𝐼
𝑘𝑝𝑜𝑠𝑡
𝑥, 𝑥 ′
= 𝑅𝑥𝑥 −
𝑅𝑥⊤
𝑅𝑅⊤
−1
+
𝑅𝑌 ⊤
−1
2
𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 (𝑅𝑅⊤ )𝑅𝑥
52
Incomplete Cholesky Decomposition
We have a few things to understand
Gram-Schmidt orthogonalization
Given a set of vectors V = {𝑣1 , 𝑣2 , … , 𝑣𝑛 }, find a set of orthonormal
basis 𝑄 = 𝑢1 , 𝑢2 , … 𝑢𝑛 , 𝑢𝑖⊤ 𝑢𝑗 = 0, 𝑢𝑖⊤ 𝑢𝑖 = 0
QR decomposition
Given a set of orthonormal basis 𝑄, compute the projection of 𝑉
onto 𝑄, 𝑣𝑖 = 𝑗 𝑟𝑗𝑖 𝑢𝑗 , 𝑅 = 𝑟𝑗𝑖
𝑉 = 𝑄𝑅
Cholesky decomposition with pivots
𝑉 ≈ 𝑄 : , 1: 𝑘 𝑅 1: 𝑘, ∶
Kernelization
𝑉 ⊤ 𝑉 = 𝑅⊤ 𝑄 ⊤ 𝑄𝑅 = 𝑅⊤ 𝑅 ≈ 𝑅 1: 𝑘, ∶
𝐾 = Φ⊤ Φ ≈ 𝑅 1: 𝑘, ∶
⊤
⊤
𝑅 1: 𝑘, ∶
𝑅 1: 𝑘, ∶
53
Incomplete Cholesky decomposition: Matlab
Kernel entries can
be computed on
the fly
Computation
𝑂 𝑛𝑑2 number
of kernel
evaluation
54
Random features
What basis to use?
′
𝑒 𝑗𝜔 (𝑥−𝑦) can be replaced by cos(𝜔 𝑥 − 𝑦 ) since both 𝑘 𝑥 − 𝑦
and 𝑝 𝜔 real functions
cos 𝜔 𝑥 − 𝑦 = cos 𝜔𝑥 cos 𝜔𝑦 + sin 𝜔𝑥 sin 𝜔𝑦
For each 𝜔, use feature [cos 𝜔𝑥 , sin 𝜔𝑥 ]
What randomness to use?
Randomly draw 𝜔 from 𝑝 𝜔
Eg. Gaussian RBF kernel, drawn from Gaussian
55
Other advanced methods
Combining classifiers
Bagging, Stacking, Boosting (adaboost)
Semi-supervised learning
Graph-based methods (label propagation), Co-training, Semisupervised SVM
Active Learning
Active learning for SVM
Tensor data analysis
Parafac and Tucker decomposition
Connection with latent variable models
56
Bagging
Bagging: Bootstrap aggregating
Generate B bootstrap samples of the training data: uniformly
random sampling with replacement
Train a classifier or a regression function using each bootstrap
sample
For classification: majority vote on the classification results
For regression: average on the predicted values
Original
Training set 1
Training set 2
Training set 3
Training set 4
1
2
7
3
4
2
7
8
6
5
3
8
5
2
1
4
3
6
7
4
5
7
4
5
6
6
6
2
6
4
7
3
7
2
3
8
1
1
2
8
57
Stacking classifiers
Level-0 models are based on different learning models and use
original data (level-0 data)
Level-1 models are based on results of level-0 models (level-1
data are outputs of level-0 models) -- also called “generalizer”
If you have lots of models, you can stacking into deeper
hierarchies
58
Boosting
Boosting: general methods of converting rough rules of thumb
into highly accurate prediction rule
A family of methods which produce a sequence of classifiers
Each classifier is dependent on the previous one and focuses on
the previous one’s errors
Examples that are incorrectly predicted in the previous classifiers
are chosen more often or weighted more heavily when
estimating a new classifier.
Questions:
How to choose “hardest” examples?
How to combine these classifiers?
59
Adaboost flow chart
Original training set
Data set 1
training instances that are wrongly
predicted by Learner1 will play more
important roles in the training of
Learner2
Data set 2
… ...
Data set T
… ...
Learner1
Learner2
… ...
LearnerT
weighted combination
60
AdaBoost
61
Graph-based methods
Idea: construct a graph with edges between very similar
examples
Unlabeled data can help “glue” the objects of the same class
together
Suppose just two labels: 0 & 1. Solve for labels f(x) for
unlabeled examples x to minimize:
Label propagation: average of neighbor labels
Minimum cut e=(u,v)|f(u)-f(v)|
+
Minimum “soft-cut” e=(u,v)(f(u)-f(v))2
Spectral partitioning
+
62
Passive Learning (Non-sequential Design)
Data
Source
Learning
Algorithm (estimator)
Expert / Oracle
Labeled data points
Algorithm outputs a classifier
63
Active Learning (Sequential Design)
Learning
Algorithm
Data
Source
Expert / Oracle
Request for the label of a data point
The label of that point
Request for the label of another data point
The label of that point
...
Algorithm outputs a classifier
64
Active Learning (Sequential Design)
Learning
Algorithm
Data
Source
Expert / Oracle
Request for the label of a data point
The label of that point
Request for the label of another data point
The label of that point
...
Algorithm outputs a classifier
How many label requests are required to
learn?
Label Complexity
65
Download