Kernel Methods for Topological Data Analysis Toward Topological Statistics Kenji Fukumizu

advertisement
Kernel Methods for Topological Data Analysis
Toward Topological Statistics
Kenji Fukumizu
The Institute of Statistical Mathematics (Tokyo, Japan)
Joint work with Genki Kusano and Yasuaki Hiraoka (Tohoku Univ.) supported by JST CREST
2nd UCL Workshop on the Theory of Big Data. 6 - 8 January 2016
1
Topological Data Analysis
• TDA: a new method for extracting topological or geometrical
information of data.
• Persistence homology: a recent mathematical tool
(Edelsbrunner et al 2002; Carlsson 2005)
• Progress of computational topology: computing topological invariants in
feasible computational cost.
• Big data? Not only the size, buy complex structure matters  TDA.
2
TDA: Various applications
Computer Vision
Data of highly complex geometric structure
Often difficult to define good feature vectors / descriptors
Biology / protein
Material Science
Liquid
Brain Science
Glass
Shape signature,
natural image statistics
(Freedman & Chen 2009)
Structure change
of proteins
Brain artery trees
e.g. age effect
(Bendich et al 2014)
eg. open / closed
(Kovacev-Nikolic et al 2015)
Non-crystal materials
etc…
(Nakamura, Hiraoka, Hirata, Escolar, Nishiura.
Nanotechnology 26 (2015))
TDA provides a compact representation for geometry of such data
3
Algebraic Topology
• Homology group
Algebraic
operation
The generators of 1 dim. homology group
0 dim(connected comp.) :
1 dim(ring)
:
≅ ⊕
・・・
≅
≅
Homology group:
topological invariance
≅
Independent
ℓ-dim. holes
Various topological invariances
e.g. Euler number = ∑ −1 dim
4
Topology of statistical data?
True structure
neighborhood
(e.g. manifold learning)
Small  disconnected object
Noisy sample
Stable extraction of topology
is NOT easy!
Large  Small ring is not visible
5
Persistence Homology
• All considered
=
⊂
,
≔∪
( )
small
large
Two rings (generators of 1 dim homology) persist
in a long interval.
6
• Persistence homology (formal definition)
Filtration of topological spaces X ∶
(X):
,
→
⊂
→⋯→
(
at
at
≅0→⋯→0→
→⋯→
⊂⋯⊂
) ≅⊕
[ ,
→0→⋯→0
]
Irreducible decomposition
: field
The lifetime of each generator is rigorously defined,
and can be computed numerically.
• Persistence diagram (PD)
Birth and death of a generator
Birth
Death
Handy descriptor
of complex geometry
PD plots the birth and death of each
generator of PH in a 2D graph ( ≥ ).
7
Topological statistics: systematic data analysis in TDA
• Conventional analysis in TDA: analysis WITH PH/PD
Data  Computation of PH  Display(PD) Analysis by human
• Our statistical approach to TDA: analysis OF random PH/PD
Many data sets
Many persistent diagrams
Computation of PD’s
PD1
How?
PD2
Analysis of
PD’s
PD3
Software
available
CGAL / PHAT
Features /
Descriptors
PDn
Topological
Statistics
8
“I predict a new subject of statistical topology. Rather than count the number
of holes, Betti numbers, etc., one will be more interested in the distribution of
such objects on noncompact manifolds as one goes out to infinity.”
— Isadore Singer, 2004.
9
Kernel representation of PD
• Vectorization of PD by positive definite kernel
≔∑
• PD = Discrete measure
∈
• Embedding of PD’s into RKHS
ℇ :
↦∫
⋅,
=∑
(⋅,
) ∈
• For some kernels (e.g., Gaussian, Laplace), ℇ is injective.
, Vectorization
: positive definite kernel
: corresponding RKHS
• By vectorization,
• a number of methods for data analysis can be applied,
SVM, regression, PCA, CCA, etc.
• tractable computation is possible with kernel methods.
10
Persistence Weighted Gaussian (PWG) Kernel
Generators close to the diagonal may be noise, and should be discounted.
,
=
=
,
Pers
exp −
≔ arctan
≔
Pers
− for
( ,
∈{ ,
∈
> 0)
| ≥ }
Pers(x1)
11
• Stability with PWG kernel embedding
• PWGK defines a distance measure on the persistence diagrams,
,
≔ ℇ
−ℇ
Statiblity Theorem (Kusano, Hiraoka, Fukumizu 2015+)
: compact subset in . ⊂ , ⊂ : finite sets.
If > + 1, then with PWG kernel ( , , ),
( ),
( ) ≤
,
.
: constant depending only on , , , ,
( ): dimensional persistence diagram of
A small change of a set causes
only a small change in PD
: Haussdorff distance
This stability is not known for Gaussian kernel.
12
2nd‑level kernel
Data analysis method
PD1
PD2
Embedding
ℇ
ℇ
PD3
Application of pos. def.
Kernel on RKHS
…
ℇ
PDm
PD
Data sets
Vectors in RKHS
2nd-level kernel (SVM for measures, Muandet, Fukumizu, Dinuzzo, Schölkopf 2012)
,
• RKHS-Gaussian kernel
= exp −
derives
ℇ (
,
= exp −
) ℇ (
)
,
: Persistence diagrams
13
Computational issue
The number of generators in a PD may be large (≥ 103, 104 )
ℇ (
For
=∑
computation
ℇ (
=∑
)−ℇ (
∑
()
∪ Δ,
,
) ℇ (
)
= exp −
requires
)
,
The number of exp −
≈ 10
+∑
∑
=
− 2∑
,
(
)
∑
,
.
 computationally expensive for
= max{ | = 1, … , }
14
• Approximation by random features (Rahimi & Recht 2008)
Gaussian distribution =:
By Bochner’s theorem
exp −
=
∫
(Fourier transform)
,…,
Approximation by sampling:
exp −
∑
∑
∑ℓ
≈
,
≈ ∑
ℓ
∑
)
ℓ
( )
()
∑ℓ
= ∑ℓ
Computational cost (
: . . .~
ℓ
ℓ
∑
()
ℓ
dim.
 2nd level Gram matrix (
Big gain if , ≪
( )
∑
ℓ
+
). c.f. (
)
15
Comparison: Persistence Scale Space Kernel
(Reininghaus et al 2015)
• PSS Kernel
,
1
=
8
−
exp
8
= ( , ) for
− exp
= ( , ).
ℇ ( ) is considered.
−
8
Pos. def. on
0 on Δ.
,
≥
• Comparison between PWGK and PSSK
• PWGK can control the discount around the diagonal independently of the
bandwidth parameter.
• PSSK is not shift-invariant  Random feature approximation is not applicable.
• In Reininghaus et al 2015, 2nd level kernel is not considered.
16
Synthetic example: SVM classification
• Classification of PD’s by SVM
• Synthetic data
• One big circle (random location and sample size) 1
with or without small circle 0
• = XOR((birth( 1)<1 && death( 1)>4)), 0 exists)
• In fact, data include noise.
S1
PD1
• 100 for training
=1
and 100 for testing
• Result (correct classification)
• PWGK (proposed): 83.8%
• PSSK (comparison): 46.5%
S1
S0
Data points 1
S1
No S0
S0
noise
Data points 2
17
Application: Transition of Silica (SiO2 )
If cooled down rapidly from the liquid state, SiO2 changes into the glass state (not
Amorphous: “soft” structure
to crystal).
Goal: identify the temperature of phase transition.
Data: Molecular Dynamics simulation for SiO2. 3D arrangements of the atoms are
(Nakamura et al 2015; Hiraoka et al 2015)
used for computing PD at 80 temperatures.
Examples of PD’s
Liquid
Glass (Amorphous)
18
Change point detection
• Data along a parameter
, = 1, … , .
Kernel Change Point Analysis with Fisher Discriminant score (Harchoui et al 2009):
For each , two classes are defined by the data before and after .
Fisher score on RKHS is used.
∑
• For each , compute : = ∑ Φ( ) and
Φ( ).
: =
• Compute Δ ≔
:
+
:
+
(
:
−
:
)
.
• Find max Δ .
• For the packing problem,
=ℇ
( = 1, … , 80).
19
• Detection of liquid-glass state transition
• Approach in physics:Estimation using derivatives of enthalpy, but not so
accurate. Estimate = [2000K, 3500K]
• Our approach: Change point detection by Kernel FDR.
• Number of generators in a PD is 30000 at most  difficult to use PSSK directly
• Only PWGK (proposed) is applied with random features.
KFDR
Gram matrix
Detected change point = 3100K
20
• Kernel PCA with PWGK
Sharp change between the two phases.
(Colored by the result of change point detection.
Colors are not used for KPCA).
PD + PWGK successfully capture “soft” and
complex structure of the glass state.
21
Conclusion
• Goal: establishing topological statistics
• Persistence homology can be used for defining useful descriptors to
complex objects of geometrical structures.
• Systematic statistical methods for TDA:
Kernel methods introduce systematic data analysis to TDA.
• Vectorization of persistence diagrams by kernel embedding.
• Persistence weighted Gaussian kernel  flexible kernel for noise.
• A toolbox of kernel methods become available:
SVM, kernel ridge regression, kernel PCA, kernel CCA, etc ….
22
Thank you
23
Download