Short review of Diffusion Maps

advertisement
Technion - Israel Institute of Technology
Faculty of Electrical Engineering
Signal and Image Processing Lab
Authors:
Rami Cohen
Oren Ierushalmi
Supervisors:
Michal Genussov
Dr. Yizhar Lavner
Figure 7: Example of LDA classification - a fish is classified according to its
length and width. The decision boundary is the purple curve.
3.4
Diffusion Maps
Diffusion Maps method [11] is a method of manifold learning. In manifold
learning it is assumed that the high-dimensional data which needs to be classified lies on a lower dimension manifold. Manifold learning tries to capture
the underlying geometric structure. A simple example is given in Figure 8.
Figure 8: Manifold learning
Diffusion Maps framework is based upon diffusion processes for finding
meaningful geometric descriptions of data sets. In this method, a graph is
built from the samples on the manifold where Diffusion distance describes
the connectivity on the graph between each two points. This distance is
characterized by the probability of transition between them (see Figure 9).
12
Figure 9: The diffusion distance BC is shorter than AB
Diffusion maps captures the ‘intrinsic’ natural parameters that generate
the data, which usually lie on a lower dimension. For example, the left part of
Figure 10 (from [11]) describes a pattern in 3D, where the right part describes
the natural parameters of this data which lie on 2D dimension.
Figure 10: Recovering the natural parameter that generated the data, the
two angles of variation
The assumption is that the data lie on a non-linear manifold. The data
is transformed using a kernel k : X × X → R,which should be symmetric
and positivity preserving, i.e.: k(x, y) = k(y, x) and k(x, y) ≥ 0. It should
also convey the local geometry of the data, and therefore, it should be se13
lected accordingly. One can think of the data points as being the nodes of a
symmetric graph whose weight function on its edges is specified by k. This
kernel is used for the construction of Markov Random Walk (MRW) matrix.
It can be proved that the diffusion distances in the original space are
mapped into Euclidian distances in the new diffusion space.
The Gaussian kernel is widely used for its simplicity and the ability to
control its scale easily by σ:
!
kxi − xj k2
,
(3)
k (xi , xj ) = exp −
2σ 2
where xi and xj are features vectors in our data set. This kernel represents
some notion of affinity or similarity between points of the data.
As an example, the MRW matrix of our data is depicted in Figure 11.
White pixels denotes high probability, and black pixels denote low probability. It can be seen easily in this figure that there are high transition
probabilities from /f/ to /th/ and vice versa. It conforms with our previous
results, in which we noted that /f/ and /th/ have similar features.
Figure 11: MRW matrix for /s/, /sh/, /f/ and /th/
A description of the algorithm which implements the diffusion maps
method is given below.
1. Distance matrix The first stage involves the building of a distance
matrix according to a given kernel. As we mentioned before, we used
the Gaussian kernel which is defined in (3). Let’s denote this matrix
by K. The elements of this matrix are:
kxi − xj k
Kij = exp −
(4)
2σ 2
14
2. Normalization To form a Markov Random Walk matrix P, we normalize the distance matrix by the sum of its rows. This can be done
in the following way. Let’s denote by d the vector which its elements
are the sum of each row in K. Creating a diagonal matrix D from this
vector, we can get P as follows:
P =D−1 K
(5)
3. Spectral decompostion Computing eigen-values and eigen-vectors of
the matrix P. As a transition matrix which is positive definite, P has
a sequence of positive eigen-values:
1 = λ0 > λ1 ≥ λ2 ≥ ...
(6)
The eigen-values should be sorted in a descending order. When dealing
with a large set of data, Nyström extension [12] suggests a way for
estimating eigen-values and eigen-vectors using only a small subset of
the data.
4. Dimensionality reduction This can be achieved by omitting the
smallest eigen-values. Each element of the eigen-vectors is normalized
by the corresponding eigen-value. Let’s denote the eigen-values by λi
and their corresponding eigen-vectors by ψi . By selecting only the
largest l eigen-values, a new features vector x̃i can be obtained in Rl
(where l can be less than 16, which is the original dimension of features
space):



x̃i = 

λ1 ψ i 1
λ2 ψ i 2
..
.
λl ψ i l





(7)
P can be raised to some desired power t (it is equivalent to running the
Markov chain forward to time t) for faster decay of the eigen-values.
An example of different decay rates for different values of t is shown in
Figure 12.
From now on, the new features vectors x̃i can be used, and any classification method, such as KNN and LDA, can be employed on the new
representation of the data.
15
Figure 12: Eigen-values decay rate, from [11]
3.5
Results
When classifying frames, Both LDA and KNN produced an average error of
23.5%. In the phonemes classification (‘majority vote’) an average error of
20% was achieved. The use of diffusion maps provided mild improvement
only.
The results can be described conveniently in a confusion matrix. Each
row of the matrix represents the instances in a predicted class, while each
column represents the instances in an actual class. Typical confusion matrix
which represents our results can be seen in Figure 13. It should be noted
that most of the error percentage is related to the confusion between /f/ and
/th/.
The difficulty in discriminating between /f/ and /th/ due to their similar
features is well known in the literature [13].
Figure 13: Confusion matrix
16
Download