Technion - Israel Institute of Technology Faculty of Electrical Engineering Signal and Image Processing Lab Authors: Rami Cohen Oren Ierushalmi Supervisors: Michal Genussov Dr. Yizhar Lavner Figure 7: Example of LDA classification - a fish is classified according to its length and width. The decision boundary is the purple curve. 3.4 Diffusion Maps Diffusion Maps method [11] is a method of manifold learning. In manifold learning it is assumed that the high-dimensional data which needs to be classified lies on a lower dimension manifold. Manifold learning tries to capture the underlying geometric structure. A simple example is given in Figure 8. Figure 8: Manifold learning Diffusion Maps framework is based upon diffusion processes for finding meaningful geometric descriptions of data sets. In this method, a graph is built from the samples on the manifold where Diffusion distance describes the connectivity on the graph between each two points. This distance is characterized by the probability of transition between them (see Figure 9). 12 Figure 9: The diffusion distance BC is shorter than AB Diffusion maps captures the ‘intrinsic’ natural parameters that generate the data, which usually lie on a lower dimension. For example, the left part of Figure 10 (from [11]) describes a pattern in 3D, where the right part describes the natural parameters of this data which lie on 2D dimension. Figure 10: Recovering the natural parameter that generated the data, the two angles of variation The assumption is that the data lie on a non-linear manifold. The data is transformed using a kernel k : X × X → R,which should be symmetric and positivity preserving, i.e.: k(x, y) = k(y, x) and k(x, y) ≥ 0. It should also convey the local geometry of the data, and therefore, it should be se13 lected accordingly. One can think of the data points as being the nodes of a symmetric graph whose weight function on its edges is specified by k. This kernel is used for the construction of Markov Random Walk (MRW) matrix. It can be proved that the diffusion distances in the original space are mapped into Euclidian distances in the new diffusion space. The Gaussian kernel is widely used for its simplicity and the ability to control its scale easily by σ: ! kxi − xj k2 , (3) k (xi , xj ) = exp − 2σ 2 where xi and xj are features vectors in our data set. This kernel represents some notion of affinity or similarity between points of the data. As an example, the MRW matrix of our data is depicted in Figure 11. White pixels denotes high probability, and black pixels denote low probability. It can be seen easily in this figure that there are high transition probabilities from /f/ to /th/ and vice versa. It conforms with our previous results, in which we noted that /f/ and /th/ have similar features. Figure 11: MRW matrix for /s/, /sh/, /f/ and /th/ A description of the algorithm which implements the diffusion maps method is given below. 1. Distance matrix The first stage involves the building of a distance matrix according to a given kernel. As we mentioned before, we used the Gaussian kernel which is defined in (3). Let’s denote this matrix by K. The elements of this matrix are: kxi − xj k Kij = exp − (4) 2σ 2 14 2. Normalization To form a Markov Random Walk matrix P, we normalize the distance matrix by the sum of its rows. This can be done in the following way. Let’s denote by d the vector which its elements are the sum of each row in K. Creating a diagonal matrix D from this vector, we can get P as follows: P =D−1 K (5) 3. Spectral decompostion Computing eigen-values and eigen-vectors of the matrix P. As a transition matrix which is positive definite, P has a sequence of positive eigen-values: 1 = λ0 > λ1 ≥ λ2 ≥ ... (6) The eigen-values should be sorted in a descending order. When dealing with a large set of data, Nyström extension [12] suggests a way for estimating eigen-values and eigen-vectors using only a small subset of the data. 4. Dimensionality reduction This can be achieved by omitting the smallest eigen-values. Each element of the eigen-vectors is normalized by the corresponding eigen-value. Let’s denote the eigen-values by λi and their corresponding eigen-vectors by ψi . By selecting only the largest l eigen-values, a new features vector x̃i can be obtained in Rl (where l can be less than 16, which is the original dimension of features space): x̃i = λ1 ψ i 1 λ2 ψ i 2 .. . λl ψ i l (7) P can be raised to some desired power t (it is equivalent to running the Markov chain forward to time t) for faster decay of the eigen-values. An example of different decay rates for different values of t is shown in Figure 12. From now on, the new features vectors x̃i can be used, and any classification method, such as KNN and LDA, can be employed on the new representation of the data. 15 Figure 12: Eigen-values decay rate, from [11] 3.5 Results When classifying frames, Both LDA and KNN produced an average error of 23.5%. In the phonemes classification (‘majority vote’) an average error of 20% was achieved. The use of diffusion maps provided mild improvement only. The results can be described conveniently in a confusion matrix. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class. Typical confusion matrix which represents our results can be seen in Figure 13. It should be noted that most of the error percentage is related to the confusion between /f/ and /th/. The difficulty in discriminating between /f/ and /th/ due to their similar features is well known in the literature [13]. Figure 13: Confusion matrix 16