Kernel Methods for Topological Data Analysis Toward Topological Statistics Kenji Fukumizu The Institute of Statistical Mathematics (Tokyo, Japan) Joint work with Genki Kusano and Yasuaki Hiraoka (Tohoku Univ.) supported by JST CREST 2nd UCL Workshop on the Theory of Big Data. 6 - 8 January 2016 1 Topological Data Analysis • TDA: a new method for extracting topological or geometrical information of data. • Persistence homology: a recent mathematical tool (Edelsbrunner et al 2002; Carlsson 2005) • Progress of computational topology: computing topological invariants in feasible computational cost. • Big data? Not only the size, buy complex structure matters TDA. 2 TDA: Various applications Computer Vision Data of highly complex geometric structure Often difficult to define good feature vectors / descriptors Biology / protein Material Science Liquid Brain Science Glass Shape signature, natural image statistics (Freedman & Chen 2009) Structure change of proteins Brain artery trees e.g. age effect (Bendich et al 2014) eg. open / closed (Kovacev-Nikolic et al 2015) Non-crystal materials etc… (Nakamura, Hiraoka, Hirata, Escolar, Nishiura. Nanotechnology 26 (2015)) TDA provides a compact representation for geometry of such data 3 Algebraic Topology • Homology group Algebraic operation The generators of 1 dim. homology group 0 dim(connected comp.) : 1 dim(ring) : ≅ ⊕ ・・・ ≅ ≅ Homology group: topological invariance ≅ Independent ℓ-dim. holes Various topological invariances e.g. Euler number = ∑ −1 dim 4 Topology of statistical data? True structure neighborhood (e.g. manifold learning) Small disconnected object Noisy sample Stable extraction of topology is NOT easy! Large Small ring is not visible 5 Persistence Homology • All considered = ⊂ , ≔∪ ( ) small large Two rings (generators of 1 dim homology) persist in a long interval. 6 • Persistence homology (formal definition) Filtration of topological spaces X ∶ (X): , → ⊂ →⋯→ ( at at ≅0→⋯→0→ →⋯→ ⊂⋯⊂ ) ≅⊕ [ , →0→⋯→0 ] Irreducible decomposition : field The lifetime of each generator is rigorously defined, and can be computed numerically. • Persistence diagram (PD) Birth and death of a generator Birth Death Handy descriptor of complex geometry PD plots the birth and death of each generator of PH in a 2D graph ( ≥ ). 7 Topological statistics: systematic data analysis in TDA • Conventional analysis in TDA: analysis WITH PH/PD Data Computation of PH Display(PD) Analysis by human • Our statistical approach to TDA: analysis OF random PH/PD Many data sets Many persistent diagrams Computation of PD’s PD1 How? PD2 Analysis of PD’s PD3 Software available CGAL / PHAT Features / Descriptors PDn Topological Statistics 8 “I predict a new subject of statistical topology. Rather than count the number of holes, Betti numbers, etc., one will be more interested in the distribution of such objects on noncompact manifolds as one goes out to infinity.” — Isadore Singer, 2004. 9 Kernel representation of PD • Vectorization of PD by positive definite kernel ≔∑ • PD = Discrete measure ∈ • Embedding of PD’s into RKHS ℇ : ↦∫ ⋅, =∑ (⋅, ) ∈ • For some kernels (e.g., Gaussian, Laplace), ℇ is injective. , Vectorization : positive definite kernel : corresponding RKHS • By vectorization, • a number of methods for data analysis can be applied, SVM, regression, PCA, CCA, etc. • tractable computation is possible with kernel methods. 10 Persistence Weighted Gaussian (PWG) Kernel Generators close to the diagonal may be noise, and should be discounted. , = = , Pers exp − ≔ arctan ≔ Pers − for ( , ∈{ , ∈ > 0) | ≥ } Pers(x1) 11 • Stability with PWG kernel embedding • PWGK defines a distance measure on the persistence diagrams, , ≔ ℇ −ℇ Statiblity Theorem (Kusano, Hiraoka, Fukumizu 2015+) : compact subset in . ⊂ , ⊂ : finite sets. If > + 1, then with PWG kernel ( , , ), ( ), ( ) ≤ , . : constant depending only on , , , , ( ): dimensional persistence diagram of A small change of a set causes only a small change in PD : Haussdorff distance This stability is not known for Gaussian kernel. 12 2nd‑level kernel Data analysis method PD1 PD2 Embedding ℇ ℇ PD3 Application of pos. def. Kernel on RKHS … ℇ PDm PD Data sets Vectors in RKHS 2nd-level kernel (SVM for measures, Muandet, Fukumizu, Dinuzzo, Schölkopf 2012) , • RKHS-Gaussian kernel = exp − derives ℇ ( , = exp − ) ℇ ( ) , : Persistence diagrams 13 Computational issue The number of generators in a PD may be large (≥ 103, 104 ) ℇ ( For =∑ computation ℇ ( =∑ )−ℇ ( ∑ () ∪ Δ, , ) ℇ ( ) = exp − requires ) , The number of exp − ≈ 10 +∑ ∑ = − 2∑ , ( ) ∑ , . computationally expensive for = max{ | = 1, … , } 14 • Approximation by random features (Rahimi & Recht 2008) Gaussian distribution =: By Bochner’s theorem exp − = ∫ (Fourier transform) ,…, Approximation by sampling: exp − ∑ ∑ ∑ℓ ≈ , ≈ ∑ ℓ ∑ ) ℓ ( ) () ∑ℓ = ∑ℓ Computational cost ( : . . .~ ℓ ℓ ∑ () ℓ dim. 2nd level Gram matrix ( Big gain if , ≪ ( ) ∑ ℓ + ). c.f. ( ) 15 Comparison: Persistence Scale Space Kernel (Reininghaus et al 2015) • PSS Kernel , 1 = 8 − exp 8 = ( , ) for − exp = ( , ). ℇ ( ) is considered. − 8 Pos. def. on 0 on Δ. , ≥ • Comparison between PWGK and PSSK • PWGK can control the discount around the diagonal independently of the bandwidth parameter. • PSSK is not shift-invariant Random feature approximation is not applicable. • In Reininghaus et al 2015, 2nd level kernel is not considered. 16 Synthetic example: SVM classification • Classification of PD’s by SVM • Synthetic data • One big circle (random location and sample size) 1 with or without small circle 0 • = XOR((birth( 1)<1 && death( 1)>4)), 0 exists) • In fact, data include noise. S1 PD1 • 100 for training =1 and 100 for testing • Result (correct classification) • PWGK (proposed): 83.8% • PSSK (comparison): 46.5% S1 S0 Data points 1 S1 No S0 S0 noise Data points 2 17 Application: Transition of Silica (SiO2 ) If cooled down rapidly from the liquid state, SiO2 changes into the glass state (not Amorphous: “soft” structure to crystal). Goal: identify the temperature of phase transition. Data: Molecular Dynamics simulation for SiO2. 3D arrangements of the atoms are (Nakamura et al 2015; Hiraoka et al 2015) used for computing PD at 80 temperatures. Examples of PD’s Liquid Glass (Amorphous) 18 Change point detection • Data along a parameter , = 1, … , . Kernel Change Point Analysis with Fisher Discriminant score (Harchoui et al 2009): For each , two classes are defined by the data before and after . Fisher score on RKHS is used. ∑ • For each , compute : = ∑ Φ( ) and Φ( ). : = • Compute Δ ≔ : + : + ( : − : ) . • Find max Δ . • For the packing problem, =ℇ ( = 1, … , 80). 19 • Detection of liquid-glass state transition • Approach in physics:Estimation using derivatives of enthalpy, but not so accurate. Estimate = [2000K, 3500K] • Our approach: Change point detection by Kernel FDR. • Number of generators in a PD is 30000 at most difficult to use PSSK directly • Only PWGK (proposed) is applied with random features. KFDR Gram matrix Detected change point = 3100K 20 • Kernel PCA with PWGK Sharp change between the two phases. (Colored by the result of change point detection. Colors are not used for KPCA). PD + PWGK successfully capture “soft” and complex structure of the glass state. 21 Conclusion • Goal: establishing topological statistics • Persistence homology can be used for defining useful descriptors to complex objects of geometrical structures. • Systematic statistical methods for TDA: Kernel methods introduce systematic data analysis to TDA. • Vectorization of persistence diagrams by kernel embedding. • Persistence weighted Gaussian kernel flexible kernel for noise. • A toolbox of kernel methods become available: SVM, kernel ridge regression, kernel PCA, kernel CCA, etc …. 22 Thank you 23