Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of π centered observations π₯π ∈ π π , PCA finds the direction that maximizes the variance π = π₯1 , π₯2 , … , π₯π π€∗ = 1 β€ ππππππ₯ π€ ≤1 π π€ π₯π 2 π = ππππππ₯ 1 β€ β€ π€ ≤1 π π€ ππ π€ 1 ππ β€ , π πΆ= π€ ∗ can be found by solving the following eigen-value problem πΆπ€ = π π€ 2 Alternative expression for PCA The principal component lies in the span of the data π€ = π πΌπ π₯π = ππΌ Plug this in we have πΆπ€ = 1 ππ β€ ππΌ π = π ππΌ Furthermore, for each data point π₯π , the following relation holds 1 β€ π₯π ππ β€ ππΌ = π π₯πβ€ ππΌ, ∀π π 1 matrix form, π β€ ππ β€ ππΌ = ππ β€ ππΌ π π₯πβ€ πΆπ€ = In Only depends on inner product matrix 3 Kernel PCA Key Idea: Replace inner product matrix by kernel matrix 1 β€ PCA: π ππ β€ ππΌ π = ππ β€ ππΌ π₯π β¦ π π₯π , Φ = π π₯1 , … , π π₯π , πΎ = Φβ€ Φ Nonlinear component π€ = ΦπΌ Kernel PCA: 1 πΎπΎπΌ π = ππΎπΌ, equivalent to 1 πΎπΌ π =ππΌ First form an π by π kernel matrix πΎ, and then perform eigendecomposition on πΎ 4 Kernel PCA example Gaussian RBF kernel exp − π₯−π₯ ′ 2π 2 2 over 2 dimensional space Eigen-vector evaluated at a test point π₯ is a function π€ β€ π π₯ = π πΌπ π(π₯π , π₯) 5 Spectral clustering 6 Spectral clustering Form kernel matrix πΎ with Gaussian RBF kernel Treat kernel matrix πΎ as the adjacency matrix of a graph (set diagonal of πΎ to be 0) Construct the graph Laplacian πΏ = π· −1/2 πΎπ·−1/2 , where π· = ππππ(πΎ 1) Compute the top π eigen-vector π = (π£1 , π£2 , … , π£π ) of πΏ Use π as the input to K-means for clustering 7 Canonical correlation analysis 8 Canonical correlation analysis Given Estimate two basis vectors π€π₯ and π€π¦ Estimate the two basis vectors so that the correlations of the projections onto these vectors are maximized. 9 CCA derivation II Define the covariance matrix of π₯, π¦ The optimization problem is equal to We can require the following normalization, and just maximize the numerator 10 CCA as generalized eigenvalue problem The optimality conditions say C xy wy ο½ ο¬C xx wx C yx wx ο½ ο¬C yy wy Put these conditions into matrix format ο¦ 0 ο§ ο§C ο¨ yx οΆο¦ w x οΆ ο¦ C xx ο·ο§ ο· ο½ ο¬ο§ ο· ο§ ο§ 0 0 wy ο· οΈο¨ οΈ ο¨ C xy 0 οΆο¦ w x οΆ ο·ο§ ο· ο· ο§ C yy wy ο· οΈο¨ οΈ Generalized eigenvalue problem π΄π€ = ππ΅π€ 11 CCA in inner product format Similar to PCA, the directions of projection lie in the span of the data π = π₯1 , … , π₯π , π = (π¦1 , … , π¦π ) π€π₯ = ππΌ, π€π¦ = ππ½ πΆπ₯π¦ = 1 ππ β€ , πΆπ₯π₯ π = 1 ππ β€ , πΆπ¦π¦ π = 1 ππ^β€ π Earlier we have Plug in π€π₯ = ππΌ, π€π¦ = ππ½, we have ο‘ ο² ο½ max ο‘ ,ο’ ο‘ T X T XX T X T T XY Xο‘ T Data only appear in inner products Yο’ ο’ Y YY T T T Yο’ 12 Kernel CCA Replace inner product matrix by kernel matrix ο‘ ο² ο½ max ο‘ ,ο’ ο‘ T T K xK yο’ K x K xο‘ ο’ T K yK yο’ Where πΎπ₯ is kernel matrix for data π, with entries πΎπ₯ π, π = π π₯π , π₯π Solve generalized eigenvalue problem 0 ο¦ ο§ ο§K K ο¨ y K xK x 0 y οΆο¦ ο‘ ο·ο§ ο·ο§ ο’ οΈο¨ ο¦ K xK οΆ ο· ο½ ο¬ο§ ο· ο§ 0 οΈ ο¨ x 0 K yK y οΆο¦ ο‘ ο·ο§ ο·ο§ ο’ οΈο¨ οΆ ο· ο· οΈ 13 Comparing two distributions For two Gaussian distributions π π and π π with unit variance, simply test π»0 : π1 = π2 ? For general distributions, we can also use KL-divergence π»0 : π π = π π ? πΎπΏ(π| π = π π π π(π) log ππ π(π) Given a set of samples π₯1 , … , π₯π ∼ π π , π₯1′ , … , π₯π′ ∼ π π π1 ≈ 1 π Need to estimate the density function first π π₯π π π log π π(π) ππ π(π) ≈ 1 π π log π(π₯π ) π (π₯π ) 14 Embedding distributions into feature space Summary statistics for distributions Mean Covariance expected features Pick a kernel, and generate a different summary statistic 15 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature map: 16 Finite sample approximation of embedding One-to-one mapping from to for certain kernels (RBF kernel) Sample average converges to true mean at 17 Embedding Distributions: Mean Mean reduces the entire distribution to a single number Representation power very restricted 1D feature space 18 Embedding Distributions: Mean + Variance Mean and variance reduces the entire distribution to two numbers Variance Richer representation But not enough Mean 2D feature space 19 Embedding with kernel features Transform distribution to infinite dimensional vector Rich representation Feature space Mean, Variance, higher order moment 20 Estimating embedding distances Given samples π₯1 , … , π₯π ∼ π π , π₯1 , … , π₯π′ ∼ π π Distance can be expressed as inner products 21 Estimating embedding distance Finite sample estimator Form a kernel matrix with 4 blocks Average this block Average this block Average this block Average this block 22 Optimization view of embedding distance Optimization problem ππ − ππ′ 2 = 2 sup < π€, ππ − ππ′ > π€ ≤1 sup < π€, πΈπ∼π π π π€ ≤1 − πΈπ∼π π π = > 2 Witness function π€∗ 1 π 1 π = ππ πΈπ∼π π π −πΈπ∼π π π πΈπ∼π π π −πΈπ∼π π π 1 π₯π − ′ π π(π₯π′ ) π = ′ ππ −ππ ′ ππ −ππ ≈ 1 ′ π π π₯π −π′ π π(π₯π ) ππ − ππ′ π€ π€∗ 23 Plot the witness function values π€ ∗ π₯ = π€ ∗β€ π π₯ ∝ 1 π π π π₯π , π₯ − 1 π′ ′ π(π₯ π π , π₯) Gaussian and Laplace distribution with the same mean and variance (Use Gaussian RBF kernel) 24 Application of kernel distance measure 25 Covariate shift correction Training and test data are not from the same distribution Want to reweight training data points to match the distribution of test data points ArgminπΌ≥0, πΌ 1 =1 π πΌπ π 1 π₯π − ′ π ππ π¦π 2 26 Embedding Joint Distributions Transform the entire joint distribution to expected features maps to Cross Covariance (Cov.) 1 maps to X Mean Y Mean Cov. 1 X Mean Y Mean Cov. maps to … … Higher order feature … … 27 Embedding Joint: Finite Sample Feature space Weights [Smola, Gretton, Song and Scholkopf. 2007] Feature mapped data points 28 Measure Dependence via Embeddings Use squared distance to measure dependence between X and Y Feature space Dependence measure useful for: •Dimensionality reduction •Clustering •Matching •… [Smola, Gretton, Song and Scholkopf. 2007] 29 Estimating embedding distances Given samples (π₯1 , π¦1 ), … , (π₯π , π¦π ) ∼ π π, π Dependence measure can be expressed as inner products πππ − ππ ⊗ ππ 2 = πΈππ [π π ⊗ π π ] − πΈπ π π ⊗ πΈπ [π π ] 2 =< πππ , πππ > −2 < πππ , ππ ⊗ ππ >+< ππ ⊗ ππ , ππ ⊗ ππ > Kernel matrix operation (π» = πΌ π‘ππππ( π» 1 − 11β€ ) π π(π₯π , π₯π ) π» X and Y data are ordered in the same way π(π¦π , π¦π ) ) 30 Optimization view of the dependence measure Optimization problem πππ − ππ ⊗ ππ 2 = sup < π€, πππ − ππ ⊗ ππ > 2 π€ ≤1 π€ ∗ ∝ πππ − ππ ⊗ ππ Witness function π€ ∗ π₯, π¦ = π€ ∗β€ (π π₯ ⊗ π π¦ ) A distribution with two stripes Two stripe distribution vs Uniform over [-1,1]x[-1,1] 31 Application of kernel distance measure 32 Application of dependence meaure Independent component analysis Transform the times series, such that the resulting signals are as independent as possible (minimize kernel dependence) Feature selection Choose a set of features, such that its dependence with labels are as large as possible (maximize kernel dependence) Clustering Generate labels for each data point, such that the dependence between the labels and data are maximized (maximize kernel dependence) Supervised dimensionality reduction Reduce the dimension of the data, such that its dependence with side information in maximized (maximize kernel dependence) 33 PCA vs. Supervised dimensionality reduction 20 news groups 34 Supervised dimensionality reduction 10 years of NIPS papers: Text + Coauthor networks 35 Visual Map of LabelMe Images 36 Imposing Structures to Image Collections Adjacent points on the grid are similar High dimensional image features … Layout (sort/organize) images according to image features and maximize its dependence with an external structure color feature texture feature sift feature composition description 37 Compare to Other Methods Other layout algorithms do not have exact control of what structure to impose Kernel Embedding Method [Quadrinato , Song and Smola 2009] Generative Topographic Map (GTM) [Bishop et al. 1998] Self-Organizing Map (SOM) [Kohonen 1990] 38 39 Reference 40