Uploaded by eee ggg

EmirGül DRP rapor - emir gül

advertisement
DRP Türkiye 2023
a report for the online reading period
Topological Data Analysis
Mentee Name: Emir Gül
Mentor Name: Ali Peker
A Report submitted for the DRP Türkiye
Topological Data Analysis
August 29, 2023
Contents
1
Abstract
2
2
Introduction (Information in TDA Pipeline)
2
2.1
The TDA Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2.3
From Time Series to Point Clouds . . . . . . . . . . . . . . . . . . . .
3
2.4
From Point Clouds to Simplicial Complex . . . . . . . . . . . . . . . .
4
2.5
Persistence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3
Example Application in Time Series
11
4
Conclusion
13
1
1
Abstract
In today’s data-driven world, uncovering meaningful patterns in complex datasets poses
significant challenges, and Topological Data Analysis (TDA) is an up-and-coming approach to data analysis that focuses on looking of the shape of data, which is a convenient way to overcome these challenges. This report provides a gentle introduction
to the powerful data analysis technique Topological Data Analysis. Also, In this study,
sudden changes in time series data will be detected. Before that, to inform basics about
Topological Data Analysis, there is an introduction part, a section where fundamental
concepts are addressed step by step according to the provided pipeline. This pipeline is
based on the sudden changes in time series data case. Then, with an example of implementation of this technique, results will be demonstrated.
In this report, a basic understanding of what TDA is, how it works, and why it’s
important in today’s research is provided.
2
2.1
Introduction (Information in TDA Pipeline)
The TDA Pipeline
In my work, the plan of my readings is based on this pipeline:
Figure 1: The TDA Pipeline
Therefore, in this report, the titles will be clarified step by step, but before that, the
motivation behind this are should be pointed out.
2.2
Motivation
With the enhanced technology, there is a rise in amount of data, and this results in having noisy and high-dimensional datasets. In this case, Topological Data Analysis can
2
help in an aspect of the data has "shape". Moreover, it is considered that this shape matters for analizing the data since it robust noise and perturbations. Also, TDA algorithms
are applicable to very high dimensional datasets. [6]
Formally, suppose that a set of data points S is given embedded in some d-dimensional
space Y. We assume that this data is sampled from some unknown k-dimensional subspace X ⊆ Y, where k ≤ d. Both geometry and topology of X are lost during sampling.
Our goal in analysis part is recovering information about X from the given dataset S. [8]
-Properties of the embedding space Y are extrinsic, while properties of the unknown
space X are intrinsic.
Figure 2: Guideline of TDA in analizing part.
Definition: Topological Data Analysis Given a finite set(data) S ⊆ Y of noisy points
sampled from an unknown space X, topological data analysis recovers the topology X
assuming both X and Y are topological spaces. [8]
2.3
From Time Series to Point Clouds
In applications of Topological Data Analysis, the starting point is generating simplicial
complexes from point clouds. Point clouds are basically set of points in a Euclidean
Space of arbitrary dimensions. When the time series is given, the point cloud can be
acquired by using Taken’s Embedding Theorem.
Takens’ theorem, which was introduced by Floris Takens in 1981, delineates the
conditions under which it becomes possible to recreate a continuous attractor from data
collected using a generic function.
Theorem: Taken’s Embedding Let M be a compact manifold of dimension m. For
Pairs (ϕ,y), ϕ: M → M a smooth diffeomorphism and y:M → R a smooth function, it
is a generic property that the map Φ(ϕ,y) : M → R2m+1 , defined by
Φ(ϕ,y) (x) = (y(x), y(ϕ(x)),...,y(ϕ 2m (x)))
is an embedding; by "smooth", we mean at least C2 . [1]
3
Example: Lorenz Attractor
In this example, a time series is given to us and dimension is picked as 3, and
ϕ(t) = t − τ, so shadow attractor which is diffeomorphic to Lorenz attractor is get as
follows:
Figure 3: Shadow attractor was get by Taken’s Embedding. [7]
Thus, point cloud (shadow attractor) was obtained. The similar way will be used
while creating point clouds in this work. For example, we will pick our time delay parameter as τ and dimension as d, so we will get:
Yti = (yti , yti +τ , ...yti +(d−1)τ )
2.4
From Point Clouds to Simplicial Complex
After getting point clouds, it is time to get simplicial complexes. In order to do that, we
need some definitions.
To clarify the idea behind of simplicial complexes, one can think of polygons:
4
Figure 4: Polygons [5]
As can be seen here, any polygon can be obtained by triangles, and triangles can be
obtained by line segments. Line segments can be obtained by dots, so the key objects
for simplicial complex are triangles, line segments and dots.
Definition: Simplex A k-simplex, denoted as σ = [v0 , v1 , ..., vk ] is the smallest convex
set in a given Euclidean space Rd that contains k+1 vertices vi , i ∈ Z, 0 ≤ i ≤ k where
each pair of vertices are linearly independent. [4]
Figure 5: Examples of simplicies
-The k-1 simplex created by a removal of vertex from k-simplex is called a face of simplex, and removal of vi from a k-simplex is denoted as [v0 ,v1 ,...,v̂i ,...,vk ]
Definition: Abstract Simplicial Complex K can be considered as a set of simplicies,
where it is required that any face of σ in K is also in K. In other words, there are no
missing "building blocks" in K. [4]
Definition: The Geometric Realisation of K It is the embedding of K in some Rn ,
where it is also required that the intersection between any two simplicies σ , σ ′ ∈ K is
either empty, or a shared face of both σ and σ ′ . [4]
-In general, the geometric realisation of an abstract simplicial complex will be used,
and this will be referred to as a simplicial complex.
In general, the geometric realisation of an abstract simplicial complex will be used,
and this will be referred to as a simplicial complex, and with these, a shape will be
given to the data. In order to detect holes, Homology Groups must be defined, and for
defining it, firstly, "How to perform linear algebra on the simplicies of K?" should be
pointed out. In computation, in general, the field Z2 = {0, 1} is used, and the elements
5
are chosen to be the set of p-simplicies, and the resulting vector space will be denoted
C p (K). Elements of C p (K) is called p-chains.
Example:
Figure 6: A simplicial complex with labelled vertices [4]
This is the simplicial complex K = {[v0 ], [v1 ], [v2 ], [v0 , v1 ], [v1 , v2 ]}. In this case,
-C0 (K) = {0, [v0 ], [v1 ], [v2 ], [v0 ] + [v1 ], [v1 ] + [v2 ], [v0 ] + [v2 ]}
-C1 (K) = {0, [v0 , v1 ], [v1 , v2 ], [v0 , v1 ] + [v1 , v2 ]}
-As there is no higher order simplicies in K, C p (K) = 0 for p > 1.
To define Homology Groups, defining boundary map is crucial.
Definition: Boundary Map ∂ p : C p (K) → C p−1 (K) such that
p
∂ p σ = ∑i=0
[v0 , v1 , ..., v̂i , ..., v p ]
Example: ∂1 ([v0 , v1 ]) = [v0 ] + [v1 ] [4]
Figure 7: A cartoon illustrating the operation of the boundary map on a 1-simplex. A
1-simplex is mapped to its two endpoint vertices. [4]
-One should note that successive operator of the boundary map is zero:
∂ p−1 ◦ ∂ p = 0
Homology Groups
These groups are used to detect holes in data.
Definition: p-cycle Z p = ker∂ p = {σ ∈ C p(K)|∂ p Σ=0 }
Definition: p-boundaries B p = Im(∂ p+1 ) = {∂ p+1 Σ|T ∈ C p+1 (K)}
-These are the subclass of p-chains.
-From the relation ∂ p−1 ◦ ∂ p = 0, it is clear that every element of B p is an element of Z p
6
Definition: pth Homology Group of K [4]
H p = Z p (K)/B p (K)
Betti Numbers
Definition: pth Betti Number
β p (K) = dimF H p (K)
This number caries topological information, counting p-dimensional holes.
β0 (K): Number of connected components.
β1 (K): Number of loops.
β2 (K): Number of holes bounded by surfaces.
Figure 8: Some topological spaces and their associated Betti numbers [4]
Vietoris-Rips complexes are used to construct simplicial complexes from point clouds,
so there is a need to explain what Vietoris-Rips complex is.
Definition: Vietoris-Rips Complex This complex was originally developed as a mean
of calculating the homology at metric spaces. To construct Rips complex on a finite
subset of points S, the following procedure is used:
-Define parameter r
-For all subsets s ⊆ S
-If diam(s) ≤ 2r, include the
simplex with vertices in s. [4]
-Geometrically, this is equivalent to creating balls of radius r around the points in s, and
including the simplex if there is a non-zero intersection between all pair of balls.
7
Figure 9: Example of rips complex [4]
-This complex is dependent on r value; for example, when r=0, every point is isolated,
and β0 is equal to number of points in the set.
Persistent Homology
While observing the holes in our data, how holes live is considered, so instead of r
values, in the alteration of r, change of the holes is the main focus. This is related to
persistent homology. In persistent homology, the first step is defining a nested sequence
of simplicial complexes:
K0 ,→ K1 ,→ ... ,→ Kn
Here, K0 ⊆ K1 ⊆ ... ⊆ Kn , and ,→ denotes the inclusion map. This fits in the sequence
of Vietoris-Rips complexes with increasing r.
Let 0 ≤ i ≤ j ≤ N. The inclusion maps lead to induced maps in homology:
i, j
f p : H p (Ki ) → H p (K j )
-Each degree of homology is studied independently.
The structures of persistent homology are "classes". For example, a class in 1-dimensional
homology is represented by a collection of 1-simplicies (edges) that have any number
of edges touching each vertex. The classes within homology groups are defined based
i, j
on these f p :
-Classes α that are born at i. These classes, where α ̸= 0, α ∈
/ im( f i−1,i ).
-Classes β that persist from i → j. These are classes where f i, j (β ) ∈
/ im( f i−1, j ).
*This implies that β also persist from i → i + ε if i + ε < j
-Classes γ that die at j. These are the classes where γ ∈ ker f j−1, j or f j−1, j (γ) =
f j−1, j (γ ′ ) [4]
-There is no guarantee that every class will die. Also, there are two notion for birth and
death time:
δb : Birth time
δd :Death time
8
Note: Features that are born and then cease to exist shortly afterward are typically categorized as topological noise, whereas classes that continue to exist for an extended
period are regarded as genuine features of the underlying structure. Nonetheless, it’s
crucial to emphasize that these definitions should be employed solely as guidance. [4]
So far, all the required definitions are given in order to determine holes in a dataset.
After determining these holes, in order to analize and get topological indicator, representing birth and death times of holes is a need. There are bunch of representation can
be used. In this work, persistence diagrams are used.
9
2.5
Persistence Diagram
The persistence diagram is a plot of δb , δd
Figure 10: Persistence Diagram [4]
Metrics on Persistence Diagrams
Persistence diagrams are multiset of points in R2 , where each point is also given
multiplicity. Persistence diagrams are able to handle two features (identical birth and
death times) as stated above, so that the definition of metrics on the space of persistence
diagrams:
Wasserstein Metric
This metric, in general, is a measure of the distance between two probability distributions, and unlike the other metrics, this distance provides a meaningful and smooth
representation of the distance between distributions. It is defined as:
dWp (PD1 , PD2 ) = infφ :PD1 →PD2 [∑x∈PD1 ∥x − φ (x)∥ p ]1/p
Here ϕ is a bijection the P-wasserstein metric; therefore, can be considered as finding
the optimal matching between diagrams. However, in general, two persistent diagrams
do not contain the same number of points, and this is a problem for being ϕ is a bijection. In that case, the diagonal is used. The diagonal can be considered as multiset
of points, which can born and die at the same time. In particular, there can be infinite
number of points with this property. This therefore allows defining bijections between
persistence diagrams. [4]
In practice, this increases the number of points that need to be computed, and the
number of bijections that need to be considered. This definition also can handle points
that live for infinity, and lead to infinite distance if the two persistence diagrams have
different persistent Betti numbers.
10
Bottleneck Metric
Another metric is Bottleneck Metric, and it can be considered as the limit of pWasserstein metric, as p → ∞, and it can be written as
dB (PD1 , PD2 ) = infφ :PD1 →PD2 [supx∈PD1 ∥x − φ (x)∥∞ ] [4]
The Bottleneck metric is computationally cheaper than the p-Wasserstein metric since
unlike Bottleneck metric, Wasserstein distance sums takes all of the points into account,
including the noisy diagonal points. Therefore, Bottleneck metric is better for simple
test of proximity of diagrams. In the other hand, Wasserstein metric is useful when the
noisy classes on the diagonal hold useful information about data. [2]
3
Example Application in Time Series
Let’s have a time series about stock marketing crashes, and apply all of above to this.
Firstly, let’s look at detecting stock marketing crashes with first derivative:
[3]
Figure 11: Detection of stock market crashes from baseline (left) and topological
(right) models, discussed in detail below. [3]
*In representation, there are some criteria like "averaging possible", "allowing computation distances". Since persistence diagrams does not fulfill this kind of criteria,
persistence diagram will be transformed to persistence landscapes. Persistence landscapes fulfill all of the needed criteria (persistence diagrams have no unique means,so
we will use landscapes). When we apply all of the steps in pipeline, and in representation part by using persistent landscapes instead of diagram, we had this:
11
Figure 12: A cartoon illustrating the operation of the boundary map on a 1-simplex. A
1-simplex is mapped to its two endpoint vertices. [3]
Therefore, the difference between result is obvious:
Figure 13: Detection of stock market crashes from baseline (left) and topological (right)
models, discussed in detail below. [3]
12
4
Conclusion
As a result, in this report, basics of Topological Data Analysis are pointed out. Also,
with an example of this approach, difference between traditional way and Topological
Data Analysis has been shown.
13
References
[1] Thomas Lagrange. Taken’s embedding theorem for non-mathematicians, 2021.
[2] Elizabeth Munch. A user’s guide to topological data analysis. Journal of Learning
Analytics, 4(2):47–61, 2017.
[3] Wallyson De Oliveira. Detecting stock market crashes with topological data analysis, 2019.
[4] Lee Steinberg. Topological Data Analysis and its Application to Chemical Systems.
PhD thesis, University of Southampton, 2019.
[5] Shawhin Talebi. Persistent homology | introduction python example code, 2022.
YouTube video.
[6] Shawhin Talebi. Topological data analysis (tda) | an introduction, 2022. YouTube
video.
[7] Francis Villatoro. Takens’ theorem in action for the lorenz chaotic attractor, 2013.
YouTube video.
[8] Afra Zomorodian. Topological data analysis. Advances in applied and computational topology, 70:1–39, 2012.
14
Download