Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor; V: movie factor min U ,V s .t . f ( U ,V ) R UV T 2 F U 0 ,V 0 , k m , n . Low rank matrix approximation approach Probabilistic matrix factorization Bayesian probabilistic matrix factorization 3 Parameter Estimation and Prediction Bayesian treats the unknown parameters as a random variable: 𝑃(𝜃|𝐷) = 𝑃 𝐷 𝜃 𝑃(𝜃) 𝑃(𝐷) = 𝑃 𝐷 𝜃 𝑃(𝜃) ∫ 𝑃 𝐷 𝜃 𝑃 𝜃 𝑑𝜃 Posterior mean estimation: 𝜃𝑏𝑎𝑦𝑒𝑠 = ∫ 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 𝜃 𝑋𝑛𝑒𝑤 Maximum likelihood approach 𝜃𝑀𝐿 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝐷 𝜃 , 𝜃𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃(𝜃|𝐷) 𝑋 𝑁 Bayesian prediction, take into account all possible value of 𝜃 𝑃 𝑥𝑛𝑒𝑤 𝐷 = ∫ 𝑃 𝑥𝑛𝑒𝑤 , 𝜃 𝐷 𝑑𝜃 = ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 A frequentist prediction: use a “plug-in” estimator 𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐿 ) 𝑜𝑟 𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐴𝑃 ) 4 PMF: Parameter Estimation Parameter estimation: MAP estimate 𝜃𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝜃 𝐷, 𝛼 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝜃, 𝐷 𝛼 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝐷 𝜃 𝑃(𝜃|𝛼) In the paper: 5 PMF: Interpret prior as regularization Maximize the posterior distribution with respect to parameter U and V Equivalent to minimize the sum-of-squares error function with quadratic regularization term (Plug in Gaussians and take log) 6 Bayesian PMF: predicting new ratings Bayesian prediction, take into account all possible value of 𝜃 𝑃 𝑥𝑛𝑒𝑤 𝐷 = ∫ 𝑃 𝑥𝑛𝑒𝑤 , 𝜃 𝐷 𝑑𝜃 = ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 In the paper, integrating out all parameters and hyperparameters. 7 Bayesian PMF: overall algorithm 8 Nonlinear classifier Nonlinear Decision Boundaries Linear SVM Decision Boundaries 9 Nonlinear regression Linear regression Three standard deviations above mean Nonlinear regression function Three standard deviations below mean Nonlinear regression and want to estimate the variance Need advanced methods such as Gaussian processes and kernel regression 10 Nonconventional clusters Need more advanced methods, such as kernel methods or spectral clustering to work 11 Nonlinear principal component analysis PCA Nonlinear PCA 12 Support Vector Machines (SVM) 1 𝑤 2 min 𝑤 ⊤ 𝑤 + 𝐶 𝑗 𝜉𝑗 𝑠. 𝑡. 𝑤 ⊤ 𝑥𝑗 + 𝑏 𝑦𝑗 ≥ 1 − 𝜉𝑗 , 𝜉𝑗 ≥ 0, ∀𝑗 𝜉𝑗 : Slack variables 13 SVM for nonlinear problem Solve nonlinear problem with linear relation in feature space Linear decision boundary in feature space Non-linear decision boundary Transform data points Nonlinear clustering, principal component analysis, canonical correlation analysis … 14 SVM for nonlinear problems Some problem needs complicated and even infinite features 𝜙 𝑥 = 𝑥, 𝑥 2 , 𝑥 3 , 𝑥 4 , … ⊤ Explicitly computing high dimension features is time consuming, and makes subsequent optimization costly Nonlinear Decision Boundaries Linear SVM Decision Boundaries 15 SVM Lagrangian Primal problem: 1 𝑤,𝜉 2 min 𝑤 ⊤ 𝑤 + 𝐶 𝑗 𝜉𝑗 𝑠. 𝑡. 𝑤 ⊤ 𝑥𝑗 + 𝑏 𝑦𝑗 ≥ 1 − 𝜉𝑗 , 𝜉𝑗 ≥ 0, ∀𝑗 Lagrangian 𝐿 𝑤, 𝜉, 𝛼, 𝛽 = 1 ⊤ 𝑤 𝑤 + 𝐶 𝑗 𝜉𝑗 + 2 𝑗 𝛼𝑗 (1 − 𝜉𝑗 Can be infinite dimensional features 𝜙 𝑥𝑗 − 𝑤 ⊤ 𝑥𝑗 + 𝑏 𝑦𝑗 ) − β𝑗 𝜉𝑗 𝛼𝑖 ≥ 0, 𝛽𝑖 ≥ 0 Take derive of 𝐿 𝑤, 𝜉, 𝛼, 𝛽 with respect to 𝑤 and 𝜉 we have 𝑤= 𝑏= 𝑗 𝛼𝑗 𝑦𝑗 𝑥𝑗 𝑦𝑘 − 𝑤 ⊤ 𝑥𝑘 for any 𝑘 such that 0 < 𝛼𝑘 < 𝐶 16 SVM dual problem Plug in 𝑤 and 𝑏 into the Lagrangian, and the dual problem 𝑀𝑎𝑥𝛼 𝑖 𝛼𝑖 − 1 2 𝑠. 𝑡. 𝑖 𝛼𝑖 𝑦𝑖 = 0 0 ≤ 𝛼𝑖 ≤ 𝐶 ⊤ 𝛼 𝛼 𝑦 𝑦 𝑥 𝑖,𝑗 𝑖 𝑗 𝑖 𝑗 𝑖 𝑥𝑗 𝑥𝑖⊤ 𝑥𝑗 = 𝑥𝑖𝑙 𝑥𝑗𝑙 𝑙 𝜙(𝑥𝑖 )⊤ 𝜙(𝑥𝑗 ) = 𝜙(𝑥𝑖 )𝑙 𝜙(𝑥𝑗 )𝑙 𝑙 It is a quadratic programming; solve for 𝛼, then we get 𝑤= 𝑏= 𝑗 𝛼𝑗 𝑦𝑗 𝑥𝑗 𝑦𝑘 − 𝑤 ⊤ 𝑥𝑘 for any 𝑘 such that 0 < 𝛼𝑘 < 𝐶 Data points corresponding to nonzeros 𝛼𝑖 are called support vectors 17 Kernel Functions Denote the inner product as a function 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖 , )=0.6 K( , )=0.2 K( , )=0.5 Inner product 𝑥𝑗 # node # edge # triangle maps to # rectangle # pentagon … K( ⊤𝜙 # node # edge # triangle maps to # rectangle # pentagon … K( ACAAGAT GCCATTG TCCCCCG GCCTCCT GCTGCTG , GCATGAC GCCATTG ACCTGCT GGTCCTA )=0.7 18 Problem of explicitly construct features Explicitly construct feature 𝜙 𝑥 : 𝑅𝑚 ↦ 𝐹, feature space can grow really large and really quickly Eg. Polynomial feature of degree 𝑑 𝑥1𝑑 , 𝑥1 𝑥2 … 𝑥𝑑 , 𝑥12 𝑥2 … 𝑥𝑑−1 Total number of such feature is huge for large 𝑚 and 𝑑 𝑑+𝑚−1 ! 𝑑+𝑚−1 = 𝑑! 𝑚−1 ! 𝑑 𝑑 = 6, 𝑚 = 100, there are 1.6 billion terms 19 Can we avoid expanding the features? Rather than computing the features explicitly, and then compute inner product. Can we merge two steps using a clever kernel function 𝑘(𝑥𝑖 , 𝑥𝑗 ) Eg. Polynomial kernel 𝑑 = 2 ⊤ 𝑥12 𝑥1 𝑥2 𝜙 𝑥 ⊤𝜙 𝑦 = 𝑥22 𝑥2 𝑥1 = 𝑥1 𝑦1 + 𝑥2 𝑦2 2 = 𝑥 ⊤ 𝑦 𝑦12 𝑦1 𝑦2 𝑦22 𝑦2 𝑦1 = 𝑥12 𝑦12 + 2𝑥1 𝑥2 𝑦1 𝑦2 + 𝑥22 𝑦22 2 Polynomial kernel 𝑑 = 2, 𝑘 𝑥, 𝑦 = 𝑥 ⊤ 𝑦 𝑂 𝑚 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛! 𝑑 =𝜙 𝑥 ⊤𝜙 𝑦 20 What 𝑘(𝑥, 𝑦) can be called a kernel function? 𝑘(𝑥, 𝑦) equivalent to first compute feature 𝜙(𝑥), and then perform inner product 𝑘 𝑥, 𝑦 = 𝜙 𝑥 ⊤ 𝜙 𝑦 A dataset 𝐷 = 𝑥1 , 𝑥2 , 𝑥3 … 𝑥𝑛 Compute pairwise kernel function 𝑘 𝑥𝑖 , 𝑥𝑗 and form a 𝑛 × 𝑛 kernel matrix (Gram matrix) 𝑘(𝑥1 , 𝑥2 ) ⋮ 𝐾= 𝑘(𝑥𝑛 , 𝑥1 ) … ⋱ … 𝑘(𝑥1 , 𝑥𝑛 ) ⋮ 𝑘(𝑥𝑛 , 𝑥𝑛 ) 𝑘(𝑥, 𝑦) is a kernel function, iff matrix 𝐾 is positive semidefinite ∀𝑣 ∈ 𝑅𝑛 , 𝑣 ⊤ 𝐾𝑣 ≥ 0 21 Kernel trick The dual problem of SVMs, replace inner product by kernel 𝑀𝑎𝑥𝛼 𝑖 𝛼𝑖 − 1 2 ⊤ 𝑘(𝑥 𝑥𝑗 ) 𝑗 ) 𝑖,𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝜙(𝑥 𝑖 ) 𝑖 ,𝜙(𝑥 𝑠. 𝑡. 𝑖 𝛼𝑖 𝑦𝑖 = 0 0 ≤ 𝛼𝑖 ≤ 𝐶 It is a quadratic programming; solve for 𝛼, then we get 𝑤= 𝑏= 𝑗 𝛼𝑗 𝑦𝑗 𝜙(𝑥𝑗 ) 𝑦𝑘 − 𝑤 ⊤ 𝑥𝑘 for any 𝑘 such that 0 < 𝛼𝑘 < 𝐶 Evaluate the decision boundary on a new data point 𝑓 𝑥 = 𝑤 ⊤ 𝜙 𝑥 =( 𝑗 𝛼𝑗 𝑦𝑗 𝜙(𝑥𝑗 )) ⊤𝜙 𝑥 = 𝑗 𝛼𝑗 𝑦𝑗 𝑘(𝑥𝑗 , 𝑥) 22 Typical kernels for vector data Polynomial of degree d 𝑘 𝑥, 𝑦 = 𝑥 ⊤ 𝑦 𝑑 Polynomial of degree up to d 𝑘 𝑥, 𝑦 = 𝑥 ⊤ 𝑦 + 𝑐 𝑑 Exponential kernel (infinite degree polynomials) 𝑘 𝑥, 𝑦 = exp(𝑠 ⋅ 𝑥 ⊤ 𝑦) Gaussian RBF kernel 𝑘 𝑥, 𝑦 = exp − 𝑥−𝑦 2 2𝜎2 Laplace Kernel 𝑘 𝑥, 𝑦 = exp − 𝑥−𝑦 2𝜎2 Exponentiated distance 𝑘 𝑥, 𝑦 = exp 𝑑 𝑥,𝑦 2 − 2 𝑠 23 Shape of some kernels Translation invariant kernel 𝑘 𝑥, 𝑦 = 𝑔 𝑥 − 𝑦 The decision boundary is weighted sum of bumps 𝑓 𝑥 = 𝑗 𝛼𝑗 𝑦𝑗 𝑘(𝑥𝑗 , 𝑥) 24 How to construct more complicated kernels Know the feature space 𝜙 𝑥 , but find a fast way to compute the inner product 𝑘(𝑥, 𝑦) Eg. string kernels, and graph kernels Find a function 𝑘 𝑥, 𝑦 and prove it is positive semidefinite Make sure the function captures some useful similarity between data points Combine existing kernels to get new kernels What combination still results in kernel 25 Combining kernels I Positive weighted combination of kernels are kernels 𝑘1 𝑥, 𝑦 and 𝑘2 (𝑥, 𝑦) are kernels 𝛼, 𝛽 ≥ 0 Then 𝑘 𝑥, 𝑦 = 𝛼𝑘1 𝑥, 𝑦 + 𝛽𝑘2 𝑥, 𝑦 is a kernel Weighted combination kernel is like concatenated feature space 𝑘1 𝑥, 𝑦 = 𝜙 𝑥 ⊤ 𝜙(𝑥) 𝑘 𝑥, 𝑦 = 𝜓(𝑥) 𝜙 𝑦 , 𝑘2 𝑥, 𝑦 = 𝜓 𝑥 ⊤ ⊤ 𝜓 𝑦 𝜙 𝑦 𝜓 𝑦 Mapping between spaces give you kernels 𝑘 𝑥, 𝑦 is a kernel, then 𝑘 𝜙 𝑥 , 𝜙 𝑦 𝑘 𝑥, 𝑦 = 𝑥 2 𝑦 2 is a kernel 26 Kernel combination II Product of kernels are kernels 𝑘1 𝑥, 𝑦 and 𝑘2 (𝑥, 𝑦) are kernels Then 𝑘 𝑥, 𝑦 = 𝑘1 𝑥, 𝑦 𝑘2 𝑥, 𝑦 is a kernel Product kernel is like using tensor product feature space 𝑘1 𝑥, 𝑦 = 𝜙 𝑥 ⊤𝜙 𝑦 , 𝑘2 𝑥, 𝑦 = 𝜓 𝑥 𝑘 𝑥, 𝑦 = 𝜙 𝑥 ⊗ 𝜓 𝑥 ⊤ ⊤𝜓 𝑦 𝜙 𝑦 ⊗𝜓 𝑦 𝑘 𝑥, 𝑦 = 𝑘1 𝑥, 𝑦 𝑘2 𝑥, 𝑦 … 𝑘𝑑 𝑥, 𝑦 is a kernel with higher order tensor features 𝑘 𝑥, 𝑦 = 𝜙 𝑥 ⊗ 𝜓 𝑥 ⊗ ⋯ ⊗ 𝜉(𝑥) ⊤ 𝜙 𝑦 ⊗ 𝜓 𝑦 ⊗ ⋯ ⊗ 𝜉(𝑦) 27 Nonlinear regression Linear regression Three standard deviations above mean Nonlinear regression function Three standard deviations below mean Nonlinear regression and want to estimate the variance Need advanced methods such as Gaussian processes and kernel regression 28 Ridge regression A dataset 𝑋 = 𝑥1 , … , 𝑥𝑛 ∈ 𝑅𝑑×𝑛 , 𝑦 = 𝑦1 , … , 𝑦𝑛 ⊤ ∈ 𝑅𝑑 With some regularization parameter 𝜆, the goal is to find regression function 𝑤 𝑤∗ 2 ⊤ 𝑖(𝑦𝑖 − 𝑥𝑖 𝑤 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝑦 − 𝑋⊤𝑤 2 +𝜆 𝑤 +𝜆 𝑤 2) 2 Find 𝑤: take derivative of the objective and set it to zeros 𝑤 ∗ = 𝑋𝑋 ⊤ + 𝜆𝐼 −1 𝑋𝑦 29 Matrix inversion lemma Find 𝑤: take derivative of the objective and set it to zeros 𝑤 ∗ = 𝑋𝑋 ⊤ + 𝜆𝐼𝑑 𝑤 ∗ = 𝑋𝑋 ⊤ + 𝜆𝐼𝑑 −1 −1 𝑋𝑦 𝑋𝑦 = 𝑋 𝑋 ⊤ 𝑋 + 𝜆𝐼𝑛 −1 𝑦 Evaluate 𝑤 ∗ in a new data point, we have 𝑤 ∗ ⊤ x = 𝑦 ⊤ 𝑋 ⊤ 𝑋 + 𝜆𝐼𝑛 −1 𝑋 ⊤ 𝑥 The above expression only depends on inner products! 30 Kernel ridge regression 𝑤 ∗ ⊤ x = 𝑦 ⊤ 𝑋 ⊤ 𝑋 + 𝜆𝐼𝑛 𝑥1⊤ 𝑥1 𝑋 ⊤𝑋 = ⋮ 𝑥𝑛⊤ 𝑥1 −1 ⊤ 𝑋 𝑥 only depends on inner products! … 𝑥1⊤ 𝑥𝑛 ⋱ ⋮ … 𝑥𝑛⊤ 𝑥𝑛 𝑥1⊤ 𝑥 𝑋 ⊤𝑥 = ⋮ 𝑥𝑛⊤ 𝑥 Kernel ridge regression: replace inner product by a kernel function 𝑋 ⊤ 𝑋 → 𝐾 = 𝑘 𝑥𝑖 , 𝑥𝑗 𝑋 ⊤ 𝑥 → k x = 𝑘 𝑥𝑖 , 𝑥 𝑓 𝑥 = 𝑦 ⊤ 𝐾 + 𝜆𝐼𝑛 𝑛×𝑛 𝑛×1 −1 𝑘 𝑥 31 Kernel ridge regression Use Gaussian rbf kernel 𝑘 𝑥, 𝑦 = exp − 𝑙𝑎𝑟𝑔𝑒 𝜎, 𝑙𝑎𝑟𝑔𝑒 𝜆 𝑥−𝑦 2 2𝜎2 𝑠𝑚𝑎𝑙𝑙 𝜎, 𝑠𝑚𝑎𝑙𝑙 𝜆 𝑠𝑚𝑎𝑙𝑙 𝜎, 𝑙𝑎𝑟𝑔𝑒 𝜆 Use cross-validation to choose parameters 32