Review Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 What is Machine Learning (ML) Study of algorithms that improve their performance at some task with experience 2 Graphical Models Representation directed vs. undirected Conditional independence semantics Factorization Inference message passing algorithm (tree vs. general graph) Junction tree for graphs Variational inference vs. sampling Learning directed vs. undirected Fully observed vs. latent variable Structure learning 3 Conditional Independence Assumptions Local Markov Assumption Global Markov Assumption 𝑋 ⊥ 𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋 |𝑃𝑎𝑋 𝐴 ⊥ 𝐵|𝐶, 𝑠𝑒𝑝𝐺 𝐴, 𝐵; 𝐶 𝑁𝑜𝑛𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡𝑋 𝑃𝑎𝑋 𝐵𝑁 𝑀𝑁 𝑋 𝐻 𝑆 ¬ (𝐴 ⊥ 𝐻) 𝐴 ⊥𝐻|𝑆 𝐹 𝐴 𝑆 𝐴⊥𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆) 𝑆 𝑁 𝐻 𝐴 𝐵 Derived local and pairwise assumption D-separation, active trail 𝐴 𝐶 𝑁 ⊥𝐻|𝑆 ¬(𝑁 ⊥ 𝐻) 𝑋 ⊥ 𝑇ℎ𝑒𝑅𝑒𝑠𝑡 |𝑀𝐵𝑋 𝑋 ⊥ 𝑌 | 𝑇ℎ𝑒𝑅𝑒𝑠𝑡 (no X—Y) 𝐴 𝐶 𝑋 𝐵 𝐷 𝑀𝐵𝑋 = {𝐴𝐵𝐶𝐷} 4 Distribution Factorization Bayesian Networks (Directed Graphical Models) 𝐼 − 𝑚𝑎𝑝: 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃 ⇔ 𝑛 𝑃(𝑋1 , … , 𝑋𝑛 ) = Conditional Probability Tables (CPTs) 𝑃(𝑋𝑖 | 𝑃𝑎𝑋𝑖 ) 𝑖=1 Markov Networks (Undirected Graphical Models) 𝑠𝑡𝑟𝑖𝑐𝑡𝑙𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃, 𝐼 − 𝑚𝑎𝑝: 𝐼 𝐺 ⊆ 𝐼 𝑃 Clique ⇔ 𝑚 Potentials 1 𝑃(𝑋1 , … , 𝑋𝑛 ) = Ψ𝑖 𝐷𝑖 𝑍 Maximal 𝑖=1 Normalization (Partition Function) 𝑚 𝑍 = 𝑥1 ,𝑥2 ,…,𝑥𝑛 Ψ𝑖 𝐷𝑖 Clique 𝑖=1 5 Representation Power ? 𝑃 𝑀𝑁 𝐵𝑁 convert? Minimal I-map not unique Do not always have P-map 𝑋1 𝑋1 ⊥ 𝑋3 | 𝑋2 , 𝑋4 𝑋2 ⊥ 𝑋4 | 𝑋1 , 𝑋3 𝑋4 𝑋2 Minimal I-map unique Do not always have P-map 𝐴 𝐹 𝑆 𝐴⊥𝐹 ¬ (𝐴 ⊥ 𝐹 | 𝑆) 𝑋3 6 Inference in Graphical Models General form of the inference problem 𝑃 𝑋1 , … , 𝑋𝑛 ∝ 𝑖 Ψ(𝐷𝑖 ) Want to query 𝑌 variable given evidence 𝑒, and “don’t care” a set of 𝑍 variables Compute 𝜏 𝑌, 𝑒 = 𝑍 𝑖 Ψ(𝐷𝑖 ) using variable elimination Renormalize to obtain the conditionals 𝑃 𝑌|𝑒 = 𝜏(𝑌,𝑒) 𝑌 𝜏(𝑌,𝑒) 𝐵 Two examples: use graph structure to order computation 𝐴 DAG: 𝐶 𝐷 Chain: 𝐴 𝐵 𝐶 𝐷 𝐸 𝐸 𝐺 𝐹 𝐻 7 Message passing algorithm 𝑚𝑗𝑖 𝑋𝑖 ∝ 𝑋𝑗 Ψ 𝑋𝑖 , 𝑋𝑗 Ψ 𝑋𝑗 𝑠∈N 𝑗 \i 𝑚𝑠𝑗 𝑋𝑗 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑜𝑓 𝑖𝑛𝑐𝑜𝑚𝑖𝑛𝑔 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑦 𝑏𝑦 𝑙𝑜𝑐𝑎𝑙 𝑝𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙𝑠 𝑆𝑢𝑚 𝑜𝑢𝑡 𝑋𝑗 𝑋𝑗 can send message when incoming messages from 𝑁 𝑗 \i arrive N 𝑗 \i 𝑘 𝑚𝑘𝑗 𝑋𝑗 𝑚𝑗𝑖 𝑋𝑖 𝑗 𝑙 𝑖 𝑓 𝑚𝑙𝑗 𝑋𝑗 8 Junction tree algorithm for DAG 𝐵 𝐶 𝐴 𝐷 𝐸 Moralize 𝐺 𝐶 𝐷 𝐴𝐸 𝐸 𝐴𝐸𝐹 Junction Tree 𝐸𝐹 𝐴 𝐶 Triangulate 𝐷 𝐸 𝐹 𝐹 𝐻 𝐺 𝐻 𝐵𝐶 𝐴𝐷𝐸 𝐷𝐸 𝐸𝐹𝐻 𝐵 𝐴 𝐸 𝐶𝐷𝐸 𝐺𝐸 𝐶 𝐹 𝐻 𝐺 𝐵𝐶 𝐵 Maximum spanning tree 𝐶𝐷𝐸 𝐸 𝐺𝐸 𝐷𝐸 𝐶 𝐸 𝐸 𝐸 𝐴𝐷𝐸 𝐴𝐸 𝐸 𝐸𝐹𝐻 𝐴𝐸𝐹 𝐸𝐹 Clique Graph 9 Message passing in junction trees 𝑚𝐷𝑗 𝐷𝑖 𝑆𝑗𝑖 ∝ 𝐷𝑗 \𝑆𝑗𝑖 Φ 𝐷𝑗 𝐷𝑡 ∈N 𝐷𝑗 \𝐷𝑖 𝑚𝐷𝑡 𝐷𝑗 𝑆𝑡𝑗 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 𝑜𝑓 𝑖𝑛𝑐𝑜𝑚𝑖𝑛𝑔 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑦 𝑏𝑦 𝑙𝑜𝑐𝑎𝑙 𝑝𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙𝑠 𝑆𝑢𝑚 𝑜𝑢𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑛𝑜𝑡 𝑖𝑛 𝑠𝑒𝑝𝑎𝑟𝑎𝑡𝑜𝑟 N 𝐷𝑗 \Di 𝐷𝑘 𝑆𝑘𝑗 Separator: 𝑆𝑘𝑗 = 𝐷𝑘 ∩ 𝐷𝑗 𝑚𝐷𝑘 𝐷𝑗 𝑆𝑘𝑗 𝑚𝐷𝑗 𝐷𝑖 𝑆𝑗𝑖 𝐷𝑗 𝑆𝑙𝑗 𝐷𝑙 𝑆𝑗𝑖 𝐷𝑖 Can also be applied to loopy clique graphs for approximate inference 𝑚𝐷𝑙 𝐷𝑗 𝑆𝑙𝑗 10 Variational Inference What is the approximating structure? ? ? 𝑃 𝑄 ? 𝑄 𝑄 How to measure the goodness of the approximation of 𝑄 𝑋1 , … , 𝑋𝑛 to the original 𝑃 𝑋1 , … , 𝑋𝑛 ? Reverse KL-divergence 𝐾𝐿(𝑄| 𝑃 How to compute the new parameters? mean field Optimization Q∗ = argminQ 𝐾𝐿(𝑄| 𝑃 ≈ 𝑁𝑒𝑤 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 11 Mean Field Algorithm Initialize 𝑄 𝑋1 , … , 𝑋𝑛 = 𝑖 𝑄 𝑋𝑖 (eg., randomly or smartly) Set all variables to unprocessed Pick an unprocessed variable 𝑋𝑖 Update 𝑄𝑖 : 1 𝑄𝑖 𝑋𝑖 = exp 𝑍𝑖 𝐸𝑄 ln Ψ 𝐷𝑗 𝐷𝑗 :𝑋𝑖 ∈𝐷𝑗 Set variable 𝑋𝑖 as processed If 𝑄𝑖 changed Set neighbors of 𝑋𝑖 to unprocessed Guaranteed to converge 12 Why Sampling Previous inference tasks focus on obtaining the entire posterior distribution 𝑃 𝑋𝑖 𝑒 Often we want to take expectations Mean 𝜇𝑋𝑖 |𝑒 = 𝐸 𝑋𝑖 𝑒 = ∫ 𝑋𝑖 𝑃 𝑋𝑖 𝑒 𝑑𝑋𝑖 Variance 𝜎𝑋2𝑖 |𝑒 = 𝐸 (𝑋𝑖 −𝜇𝑋𝑖 |𝑒 )2 𝑒 = ∫ (𝑋𝑖 −𝜇𝑋𝑖 |𝑒 )2 𝑃 𝑋𝑖 𝑒 𝑑𝑋𝑖 More general 𝐸 𝑓 = ∫ 𝑓 𝑋 𝑃 𝑋|𝑒 𝑑𝑋, can be difficult to do it analytically Key idea: approximate expectation by sample average 1 𝐸𝑓 ≈ 𝑁 𝑁 𝑓 𝑥𝑖 𝑖=1 where 𝑥1 , … , 𝑥𝑁 ∼ 𝑃 𝑋|𝑒 independently and identically 13 Sampling Methods Direct Sampling Works only for easy distributions (multinomial, Gaussian etc.) Rejection Sampling Create samples like direct sampling Only count samples consistent with given evidence Importance Sampling Create samples like direct sampling Assign weights to samples Gibbs Sampling Often used for high-dimensional problem Use variables and its Markov blanket for sampling 14 Gibbs Sampling in formula Gibbs sampling 𝑋 = 𝑥0 For t = 1 to N 𝑥1𝑡 ∼ 𝑃(𝑋1 |𝑥2𝑡−1 , … , 𝑥𝐾𝑡−1 ) 𝑥2𝑡 ∼ 𝑃(𝑋2 |𝑥1𝑡 , 𝑥3𝑡−1 … , 𝑥𝐾𝑡−1 ) … 𝑡 𝑥𝐾𝑡 ∼ 𝑃(𝑋𝐾 |𝑥1𝑡 , … , 𝑥𝐾−1 ) 𝐹𝑜𝑟 𝑔𝑟𝑎𝑝ℎ𝑖𝑐𝑎𝑙 𝑚𝑜𝑑𝑒𝑙𝑠 Only need to condition on the Variables in the Markov blanket 𝑋1 Variants: Randomly pick variable to sample sample block by block 𝑋3 𝑋2 𝑋4 𝑋5 𝑥1𝑡 , 𝑥2𝑡 ∼ 𝑃(𝑋1 , 𝑋2 |𝑥3𝑡−1 … , 𝑥𝐾𝑡−1 ) 15 Learning for GMs Known Structure Unknown Structure Fully observable data Relatively Easy Hard Missing data Hard (EM) Very hard Estimation principle: Maximal likelihood estimation Bayesian estimation Common Feature Make use of distribution factorization Make use of inference algorithm Make use of regularization/prior 16 Bayesian Parameter Estimation Bayesian treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule: 𝑃(𝜃|𝐷) = 𝑃 𝐷 𝜃 𝑃(𝜃) 𝑃(𝐷) = 𝑃 𝐷 𝜃 𝑃(𝜃) ∫ 𝑃 𝐷 𝜃 𝑃 𝜃 𝑑𝜃 𝜃 The crucial equation can be written in words 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑×𝑝𝑟𝑖𝑜𝑟 𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑁 For iid data, the likelihood is 𝑃 𝐷 𝜃 = 𝑁 𝑥𝑖 𝑖=1 𝜃 1−𝜃 1−𝑥𝑖 =𝜃 𝑖 𝑥𝑖 1−𝜃 𝑋 𝑖 1−𝑥𝑖 𝑁 𝑖=1 𝑃(𝑥𝑖 |𝜃) = 𝜃 #ℎ𝑒𝑎𝑑 1 − 𝜃 #𝑡𝑎𝑖𝑙 The prior 𝑃 𝜃 encodes our prior knowledge on the domain Different prior 𝑃 𝜃 will end up with different estimate 𝑃(𝜃|𝐷)! 17 Frequentist Parameter Estimation Bayesian estimation has been criticized for being “subjective” Frequentists think of a parameter as a fixed, unknown constant, not a random variable Hence different “objective” estimators, instead of Bayes’ rule These estimators have different properties, such as being “unbiased”, “minimum variance”, etc. A very popular estimator is the maximum likelihood estimator (MLE), which is simple and has good statistical properties 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝐷 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑁 𝑖=1 𝑃(𝑥𝑖 |𝜃) 18 How estimators should be used? 𝜃𝑀𝐴𝑃 is not Bayesian (even though it uses a prior) since it is a point estimate Consider predicting the future. A sensible way is to combine predictions based on all possible value of 𝜃, weighted by their posterior probability, this is called Bayesian prediction: 𝑃 𝑥𝑛𝑒𝑤 𝐷 = ∫ 𝑃 𝑥𝑛𝑒𝑤 , 𝜃 𝐷 𝑑𝜃 = ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃, 𝐷 𝑃 𝜃 𝐷 𝑑𝜃 = ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 𝜃 𝑋𝑛𝑒𝑤 𝑋 𝑁 A frequentist prediction will typically use a “plug-in” estimator such as ML/MAP 𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐿 ) 𝑜𝑟 𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐴𝑃 ) 19 Decomposable likelihood of directed model 𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 = 𝑖 log 𝑃 𝑎𝑖 |𝜃𝑎 + 𝑖 log 𝑃 𝑓 𝑖 |𝜃𝑓 + 𝑖 𝑙𝑜𝑔𝑃 𝑠 𝑖 𝑎𝑖 , 𝑓 𝑖 , 𝜃𝑠 + 𝑖 𝑙𝑜𝑔𝑃(ℎ 𝑖 |𝑠 𝑖 , 𝜃ℎ ) One term for each CPT; break up MLE problem into independent subproblems Because the factorization of the distribution, we can estimate each CPT separately. 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝐹𝑙𝑢 𝐹𝑙𝑢 𝑆𝑖𝑛𝑢𝑠 Learn separately 𝐹𝑙𝑢 𝑆𝑖𝑛𝑢𝑠 𝑆𝑖𝑛𝑢𝑠 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 20 Bayesian estimator for directed models Factorization 𝑃 𝑋 = 𝑥 = 𝑖𝑃 𝑥𝑖 𝑝𝑎𝑋𝑖 , 𝜃𝑖 ) Local CPT: multinomial distribution 𝑃 𝑋𝑖 = 𝑘 𝑃𝑎𝑋𝑖 = 𝑗 = 𝜃𝑘𝑗 Factorized prior over parameters 𝑃 𝜃𝑎 𝑃 𝜃𝑏 𝑃 𝜃𝑠 𝑃(𝜃ℎ ) 𝜃𝑏 𝜃𝑎 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝜃𝑠 𝐹𝑙𝑢 𝑆𝑖𝑛𝑢𝑠 𝜃ℎ 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 21 MLE Learning Algorithm for Exponential models max 𝑙 𝜃, 𝐷 is a convex optimization problem. 𝜃 Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters 𝜃 Loop until convergence Compute 𝜕𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 = 𝐸P Update 𝜃𝑖𝑗 ← 𝜃𝑖𝑗 − 𝜂 𝑋𝑖 ,𝑋𝑗 𝑋𝑖 𝑋𝑗 − 𝐸𝑃 𝑋𝜃 𝑋𝑖 𝑋𝑗 𝜕𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 22 Partially observed graphical models Mixture models and hidden Markov models 23 Why is learning hard? In fully observed iid settings, the log-likelihood decomposes into a sum of local terms 𝑙 𝜃; 𝐷 = log 𝑝 𝑥, 𝑧 𝜃 = log 𝑃 𝑧 𝜃1 + log 𝑝(𝑥|𝑧, 𝜃2 ) With latent variables, all the parameters become coupled together via marginalization 𝑙 𝜃; 𝐷 = log 𝑧𝑝 𝑥, 𝑧 𝜃 = log 𝑧 𝑝(𝑥| 𝑧, 𝜃2 )𝑃 𝑍 𝑍 𝑋 𝑋 𝑁 𝑧|𝜃1 𝑁 24 EM algorithm EM: Expectation-maximization for finding 𝜃 𝑙 𝜃; 𝐷 = log 𝑧𝑝 𝑥, 𝑧 𝜃 = log 𝑧 𝑝(𝑥𝑖 | 𝑧, 𝜃2 )𝑃 𝑧|𝜃1 Iterate between E-step and M-step until convergence Expectation step (E-step) 𝑓 𝜃 = 𝐸𝑞 𝑧 log 𝑝 𝑥, 𝑧 𝜃 , 𝑤ℎ𝑒𝑟𝑒 𝑞 𝑧 = 𝑃(𝑧|𝑥, 𝜃 𝑡 ) Maximization step (M-step) 𝜃 𝑡+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑓 𝜃 25 Structure Learning The goal: given set of independent samples (assignments of random variables), find the best (the most likely) graphical model structure 𝐴 𝐹 𝐹 𝐴 𝐹 𝐴 𝑆 candidate structure 𝑆 𝑁 𝑁 𝐻 𝐴 𝐹 𝐻 (A,F,S,N,H) = (T,F,F,T,F) (A,F,S,N,H) = (T,F,T,T,F) … (A,F,S,N,H) = (F,T,T,T,T) 𝑆 𝑁 𝐻 Score structures Maximum likelihood; Bayesian score; Margin 𝑆 𝑁 𝐻 26 Chow-liu algorithm 𝑇 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑇 𝑀 (𝑖,𝑗)∈𝑇 𝐼 (𝑥𝑖 , 𝑥𝑗 ) −𝑀 𝑖 𝐻 (𝑥𝑖 ) Chow-liu algorithm For each pair of variables 𝑋𝑖 , 𝑋𝑗 , compute their empirical mutual information 𝐼 (𝑥𝑖 , 𝑥𝑗 ) Now you have a complete graph connecting variable nodes, with edge weight equal to 𝐼 (𝑥𝑖 , 𝑥𝑗 ) Run maximum spanning tree algorithm 27 Kernel methods Kernels Similarity measure between a pair of data points Positive definite kernel matrix Design and combine kernels Fast kernel computation Kernelize algorithms Use inner product between data points to express algorithms The learned function lies is a linear combination of data points Replace inner products with kernels SVM, ridge regression, clustering, PCA, CCA, ICA, Statistical tests Gaussian processes Covariance functions are kernel functions 28 Support Vector Machines (SVM) 1 𝑤 2 min 𝑤 ⊤ 𝑤 + 𝐶 𝑗 𝜉𝑗 𝑠. 𝑡. 𝑤 ⊤ 𝑥𝑗 + 𝑏 𝑦𝑗 ≥ 1 − 𝜉𝑗 , 𝜉𝑗 ≥ 0, ∀𝑗 𝜉𝑗 : Slack variables 29 SVM for nonlinear problem Solve nonlinear problem with linear relation in feature space Linear decision boundary in feature space Non-linear decision boundary Transform data points Nonlinear clustering, principal component analysis, canonical correlation analysis … 30 SVM for nonlinear problems Some problem needs complicated and even infinite features 𝜙 𝑥 = 𝑥, 𝑥 2 , 𝑥 3 , 𝑥 4 , … ⊤ Explicitly computing high dimension features is time consuming, and makes subsequent optimization costly Nonlinear Decision Boundaries Linear SVM Decision Boundaries 31 Kernel trick The dual problem of SVMs, replace inner product by kernel 𝑀𝑎𝑥𝛼 𝑖 𝛼𝑖 − 1 2 ⊤ 𝑘(𝑥 𝑥𝑗 ) 𝑗 ) 𝑖,𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝜙(𝑥 𝑖 ) 𝑖 ,𝜙(𝑥 𝑠. 𝑡. 𝑖 𝛼𝑖 𝑦𝑖 = 0 0 ≤ 𝛼𝑖 ≤ 𝐶 Corresponding kernel matrix is psd It is a quadratic programming; solve for 𝛼, then we get 𝑤= 𝑏= 𝑗 𝛼𝑗 𝑦𝑗 𝜙(𝑥𝑗 ) 𝑦𝑘 − 𝑤 ⊤ 𝜙(𝑥𝑘 ) for any 𝑘 such that 0 < 𝛼𝑘 < 𝐶 Evaluate the decision boundary on a new data point 𝑓 𝑥 = 𝑤 ⊤ 𝜙 𝑥 =( 𝑗 𝛼𝑗 𝑦𝑗 𝜙(𝑥𝑗 )) ⊤𝜙 𝑥 = 𝑗 𝛼𝑗 𝑦𝑗 𝑘(𝑥𝑗 , 𝑥) 32 Typical kernels for vector data Polynomial of degree d 𝑘 𝑥, 𝑦 = 𝑥 ⊤ 𝑦 𝑑 Polynomial of degree up to d 𝑘 𝑥, 𝑦 = 𝑥 ⊤ 𝑦 + 𝑐 𝑑 Gaussian RBF kernel 𝑘 𝑥, 𝑦 = exp − 𝑥−𝑦 2 2𝜎2 Laplace Kernel 𝑘 𝑥, 𝑦 = exp − 𝑥−𝑦 2𝜎2 33 Kernel Functions Denote the inner product as a function 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖 , )=0.6 K( , )=0.2 K( , )=0.5 Inner product 𝑥𝑗 # node # edge # triangle maps to # rectangle # pentagon … K( ⊤𝜙 # node # edge # triangle maps to # rectangle # pentagon … K( ACAAGAT GCCATTG TCCCCCG GCCTCCT GCTGCTG , GCATGAC GCCATTG ACCTGCT GGTCCTA )=0.7 34 Combining kernels Positive weighted combination of kernels are kernels 𝑘1 𝑥, 𝑦 and 𝑘2 (𝑥, 𝑦) are kernels 𝛼, 𝛽 ≥ 0 Then 𝑘 𝑥, 𝑦 = 𝛼𝑘1 𝑥, 𝑦 + 𝛽𝑘2 𝑥, 𝑦 is a kernel Product of kernels are kernels 𝑘1 𝑥, 𝑦 and 𝑘2 (𝑥, 𝑦) are kernels Then 𝑘 𝑥, 𝑦 = 𝑘1 𝑥, 𝑦 𝑘2 𝑥, 𝑦 is a kernel Mapping between spaces give you kernels 𝑘 𝑥, 𝑦 is a kernel, then 𝑘 𝜙 𝑥 , 𝜙 𝑦 is a kernel 𝑘 𝑥, 𝑦 = 𝑥 2 𝑦 2 35 Principal component analysis Given a set of 𝑀 centered observations 𝑥𝑘 ∈ 𝑅𝑑 , PCA finds the direction that maximizes the variance 𝑋 = 𝑥1 , 𝑥2 , … , 𝑥𝑀 𝑤∗ = 1 ⊤ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑤 ≤1 𝑘 𝑤 𝑥𝑘 2 𝑀 = 𝑎𝑟𝑔𝑚𝑎𝑥 1 ⊤ ⊤ 𝑤 ≤1 𝑀 𝑤 𝑋𝑋 𝑤 1 𝑋𝑋 ⊤ , 𝑀 𝐶= 𝑤 ∗ can be found by solving the following eigen-value problem 𝐶𝑤 = 𝜆 𝑤 36 Alternative expression for PCA The principal component lies in the span of the data 𝑤 = 𝑘 𝛼𝑘 𝑥𝑘 = 𝑋𝛼 Plug this in we have 𝐶𝑤 = 1 𝑋𝑋 ⊤ 𝑋𝛼 𝑀 = 𝜆 𝑋𝛼 Furthermore, for each data point 𝑥𝑘 , the following relation holds 1 ⊤ 𝑥𝑘 𝑋𝑋 ⊤ 𝑋𝛼 = 𝜆 𝑥𝑘⊤ 𝑋𝛼, ∀𝑘 𝑀 1 matrix form, 𝑋 ⊤ 𝑋𝑋 ⊤ 𝑋𝛼 = 𝜆𝑋 ⊤ 𝑋𝛼 𝑀 𝑥𝑘⊤ 𝐶𝑤 = In Only depends on inner product matrix 37 Kernel PCA Key Idea: Replace inner product matrix by kernel matrix 1 ⊤ PCA: 𝑋 𝑋𝑋 ⊤ 𝑋𝛼 𝑀 = 𝜆𝑋 ⊤ 𝑋𝛼 𝑥𝑘 ↦ 𝜙 𝑥𝑘 , Φ = 𝜙 𝑥1 , … , 𝜙 𝑥𝑘 , 𝐾 = Φ⊤ Φ Nonlinear component 𝑤 = Φ𝛼 Kernel PCA: 1 𝐾𝐾𝛼 𝑀 = 𝜆𝐾𝛼, equivalent to 1 𝐾𝛼 𝑀 =𝜆𝛼 First form an 𝑀 by 𝑀 kernel matrix 𝐾, and then perform eigendecomposition on 𝐾 38 CCA in inner product format Similar to PCA, the directions of projection lie in the span of the data 𝑋 = 𝑥1 , … , 𝑥𝑚 , 𝑌 = (𝑦1 , … , 𝑦𝑚 ) 𝑤𝑥 = 𝑋𝛼, 𝑤𝑦 = 𝑌𝛽 𝐶𝑥𝑦 = 1 𝑋𝑌 ⊤ , 𝐶𝑥𝑥 𝑚 = 1 𝑋𝑋 ⊤ , 𝐶𝑦𝑦 𝑚 = 1 𝑌𝑌^⊤ 𝑚 Earlier we have Plug in 𝑤𝑥 = 𝑋𝛼, 𝑤𝑦 = 𝑌𝛽, we have max , T X T XX T X T T Data only appear in inner products XY X T Y Y YY T T T Y 39 Kernel CCA Replace inner product matrix by kernel matrix max , T T K xK y K x K x T K yK y Where 𝐾𝑥 is kernel matrix for data 𝑋, with entries 𝐾𝑥 𝑖, 𝑗 = 𝑘 𝑥𝑖 , 𝑥𝑗 Solve generalized eigenvalue problem 0 K K y K xK x 0 y K xK 0 x 0 K yK y 40 Embedding with kernel features Transform distribution to infinite dimensional vector Rich representation Feature space Mean, Variance, higher order moment 41 Estimating embedding distance Finite sample estimator Form a kernel matrix with 4 blocks Average this block Average this block Average this block Average this block 42 Measure Dependence via Embeddings Use squared distance to measure dependence between X and Y Feature space Dependence measure useful for: •Dimensionality reduction •Clustering •Matching •… [Smola, Gretton, Song and Scholkopf. 2007] 43 Estimating embedding distances Given samples (𝑥1 , 𝑦1 ), … , (𝑥𝑚 , 𝑦𝑚 ) ∼ 𝑃 𝑋, 𝑌 Dependence measure can be expressed as inner products 𝜇𝑋𝑌 − 𝜇𝑋 ⊗ 𝜇𝑌 2 = 𝐸𝑋𝑌 [𝜙 𝑋 ⊗ 𝜓 𝑌 ] − 𝐸𝑋 𝜙 𝑋 ⊗ 𝐸𝑌 [𝜓 𝑌 ] 2 =< 𝜇𝑋𝑌 , 𝜇𝑋𝑌 > −2 < 𝜇𝑋𝑌 , 𝜇𝑋 ⊗ 𝜇𝑌 >+< 𝜇𝑋 ⊗ 𝜇𝑌 , 𝜇𝑋 ⊗ 𝜇𝑌 > Kernel matrix operation (𝐻 = 𝐼 𝑡𝑟𝑎𝑐𝑒( 𝐻 1 − 11⊤ ) 𝑚 𝑘(𝑥𝑖 , 𝑥𝑗 ) 𝐻 X and Y data are ordered in the same way 𝑘(𝑦𝑖 , 𝑦𝑗 ) ) 44 Other advanced methods Combining classifiers Bagging Stacking Boosting (Adaboost) Semisupervised learning Graph-based methods (label propagation) Co-training Semisupervised SVM Active learning Tensor data decomposition Parafac and Tucker decomposition 45 What is Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables Formally: a collection of random variables, any finite number of which have (consistent) Gaussian distributions Informally, infinitely long vector with dimensions index by 𝑥 ≅ function 𝑓(𝑥) A Gaussian process is fully specified by a mean function 𝑚 𝑥 = 𝐸[𝑓(𝑥)] and covariance function 𝑘 𝑥, 𝑥 ′ = 𝐸 𝑓 𝑥 − 𝑚 𝑥 𝑓 𝑥′ − 𝑚 𝑥′ 𝑓 𝑥 ∼ 𝐺𝑃 𝑚 𝑥 , 𝑘 𝑥, 𝑥 ′ , 𝑥: 𝑖𝑛𝑑𝑖𝑐𝑒𝑠 46 Covariance function of Gaussian processes For any finite collection of indices 𝑥1 , 𝑥2 , … , 𝑥𝑛 , the covariance matrix is positive semidefinite Σ=𝐾= 𝑘 𝑥1 , 𝑥1 𝑘 𝑥2 , 𝑥1 𝑘 𝑥1 , 𝑥2 𝑘 𝑥2 , 𝑥2 ⋮ ⋮ 𝑘(𝑥𝑛 , 𝑥1 ) 𝑘(𝑥𝑛 , 𝑥2 ) ⋯ 𝑘 𝑥1 , 𝑥𝑛 ⋯ 𝑘 𝑥2 , 𝑥𝑛 ⋱ ⋮ ⋯ 𝑘(𝑥𝑛 , 𝑥𝑛 ) The covariance function needs to be a kernel function over the indices! Eg. Gaussian RBF kernel 𝑘 𝑥, 𝑥 ′ = exp − 1 2 𝑥 − 𝑥′ 2 47 Samples from GPs with different kernels 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝑣0 exp − 𝑥𝑖 −𝑥𝑗 𝑟 𝛼 + 𝑣1 + 𝑣2 𝛿𝑖𝑗 48 Using Gaussian process for nonlinear regression Observing a dataset 𝐷 = 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 Prior 𝑃(𝑓) is Gaussian process, like a multivariate Gaussian, therefore, posterior of 𝑓 is also a Gaussian process Bayesian rule 𝑃 𝑓 𝐷 = 𝑃 𝐷 𝑓 𝑃(𝑓) 𝑃(𝐷) Everything else about GPs follows the basic rules of probabilities applied to multivariate Gaussians 49 Noisy Observation 2 𝑦|𝑥, 𝑓 𝑥 ∼ 𝒩 𝑓, 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 , let 𝑌 = (𝑦2 , … , 𝑦𝑛 )⊤ 𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 ~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ 2 𝑚𝑝𝑜𝑠𝑡 𝑥 = 𝑘 𝑥, 𝑋 𝐾 + 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ = 𝑘(𝑥, 𝑥 ′ ) −1 ⊤ 𝑌 − 𝑘 𝑥, 𝑋 𝐾 + −1 2 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑘 𝑥′, 𝑋 ⊤ 50 Relate GP to class probability Transform the continuous output of Gaussian Process to a value between [-1,1] or [0,1] With binary outputs, the joint distribution of all variables in the model is no longer Gaussians The likelihood is also not Gaussian, so we will need to use approximate inference to compute the posterior GP (Laplace approximation, sampling) 51 Kernel low rank approximation Incomplete Cholesky factorization of kernel matrix 𝐾 of size 𝑛 × 𝑛 to 𝑅 of size 𝑑 × 𝑛, and 𝑑 ≪ 𝑛 𝑅 ≈ 𝐾 𝑅⊤ 𝑅 ≈ 𝐴 𝑓 𝑥 | 𝑥𝑖 , 𝑦𝑖 𝑛 𝑖=1 𝑅⊤ ~𝐺𝑃 𝑚𝑝𝑜𝑠𝑡 𝑥 , 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ 2 𝑚𝑝𝑜𝑠𝑡 𝑥 = 𝑅𝑥⊤ 𝑅𝑅⊤ + 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 𝑘𝑝𝑜𝑠𝑡 𝑥, 𝑥 ′ = 𝑅𝑥𝑥 − 𝑅𝑥⊤ 𝑅𝑅⊤ −1 + 𝑅𝑌 ⊤ −1 2 𝜎𝑛𝑜𝑖𝑠𝑒 𝐼 (𝑅𝑅⊤ )𝑅𝑥 52 Incomplete Cholesky Decomposition We have a few things to understand Gram-Schmidt orthogonalization Given a set of vectors V = {𝑣1 , 𝑣2 , … , 𝑣𝑛 }, find a set of orthonormal basis 𝑄 = 𝑢1 , 𝑢2 , … 𝑢𝑛 , 𝑢𝑖⊤ 𝑢𝑗 = 0, 𝑢𝑖⊤ 𝑢𝑖 = 0 QR decomposition Given a set of orthonormal basis 𝑄, compute the projection of 𝑉 onto 𝑄, 𝑣𝑖 = 𝑗 𝑟𝑗𝑖 𝑢𝑗 , 𝑅 = 𝑟𝑗𝑖 𝑉 = 𝑄𝑅 Cholesky decomposition with pivots 𝑉 ≈ 𝑄 : , 1: 𝑘 𝑅 1: 𝑘, ∶ Kernelization 𝑉 ⊤ 𝑉 = 𝑅⊤ 𝑄 ⊤ 𝑄𝑅 = 𝑅⊤ 𝑅 ≈ 𝑅 1: 𝑘, ∶ 𝐾 = Φ⊤ Φ ≈ 𝑅 1: 𝑘, ∶ ⊤ ⊤ 𝑅 1: 𝑘, ∶ 𝑅 1: 𝑘, ∶ 53 Incomplete Cholesky decomposition: Matlab Kernel entries can be computed on the fly Computation 𝑂 𝑛𝑑2 number of kernel evaluation 54 Random features What basis to use? ′ 𝑒 𝑗𝜔 (𝑥−𝑦) can be replaced by cos(𝜔 𝑥 − 𝑦 ) since both 𝑘 𝑥 − 𝑦 and 𝑝 𝜔 real functions cos 𝜔 𝑥 − 𝑦 = cos 𝜔𝑥 cos 𝜔𝑦 + sin 𝜔𝑥 sin 𝜔𝑦 For each 𝜔, use feature [cos 𝜔𝑥 , sin 𝜔𝑥 ] What randomness to use? Randomly draw 𝜔 from 𝑝 𝜔 Eg. Gaussian RBF kernel, drawn from Gaussian 55 Other advanced methods Combining classifiers Bagging, Stacking, Boosting (adaboost) Semi-supervised learning Graph-based methods (label propagation), Co-training, Semisupervised SVM Active Learning Active learning for SVM Tensor data analysis Parafac and Tucker decomposition Connection with latent variable models 56 Bagging Bagging: Bootstrap aggregating Generate B bootstrap samples of the training data: uniformly random sampling with replacement Train a classifier or a regression function using each bootstrap sample For classification: majority vote on the classification results For regression: average on the predicted values Original Training set 1 Training set 2 Training set 3 Training set 4 1 2 7 3 4 2 7 8 6 5 3 8 5 2 1 4 3 6 7 4 5 7 4 5 6 6 6 2 6 4 7 3 7 2 3 8 1 1 2 8 57 Stacking classifiers Level-0 models are based on different learning models and use original data (level-0 data) Level-1 models are based on results of level-0 models (level-1 data are outputs of level-0 models) -- also called “generalizer” If you have lots of models, you can stacking into deeper hierarchies 58 Boosting Boosting: general methods of converting rough rules of thumb into highly accurate prediction rule A family of methods which produce a sequence of classifiers Each classifier is dependent on the previous one and focuses on the previous one’s errors Examples that are incorrectly predicted in the previous classifiers are chosen more often or weighted more heavily when estimating a new classifier. Questions: How to choose “hardest” examples? How to combine these classifiers? 59 Adaboost flow chart Original training set Data set 1 training instances that are wrongly predicted by Learner1 will play more important roles in the training of Learner2 Data set 2 … ... Data set T … ... Learner1 Learner2 … ... LearnerT weighted combination 60 AdaBoost 61 Graph-based methods Idea: construct a graph with edges between very similar examples Unlabeled data can help “glue” the objects of the same class together Suppose just two labels: 0 & 1. Solve for labels f(x) for unlabeled examples x to minimize: Label propagation: average of neighbor labels Minimum cut e=(u,v)|f(u)-f(v)| + Minimum “soft-cut” e=(u,v)(f(u)-f(v))2 Spectral partitioning + 62 Passive Learning (Non-sequential Design) Data Source Learning Algorithm (estimator) Expert / Oracle Labeled data points Algorithm outputs a classifier 63 Active Learning (Sequential Design) Learning Algorithm Data Source Expert / Oracle Request for the label of a data point The label of that point Request for the label of another data point The label of that point ... Algorithm outputs a classifier 64 Active Learning (Sequential Design) Learning Algorithm Data Source Expert / Oracle Request for the label of a data point The label of that point Request for the label of another data point The label of that point ... Algorithm outputs a classifier How many label requests are required to learn? Label Complexity 65