An Introduction to Diffusion Maps J. de la Porte† , B. M. Herbst† , W. Hereman? , S. J. van der Walt† † Applied Mathematics Division, Department of Mathematical Sciences, University of Stellenbosch, South Africa ? Colorado School of Mines, United States of America jolanidlp@googlemail.com, herbst@sun.ac.za, hereman@mines.edu, stefan@sun.ac.za Abstract This paper describes a mathematical technique [1] for dealing with dimensionality reduction. Given data in a high-dimensional space, we show how to find parameters that describe the lower-dimensional structures of which it is comprised. Unlike other popular methods such as Principle Component Analysis and Multi-dimensional Scaling, diffusion maps are non-linear and focus on discovering the underlying manifold (lower-dimensional constrained “surface” upon which the data is embedded). By integrating local similarities at different scales, a global description of the data-set is obtained. In comparisons, it is shown that the technique is robust to noise perturbation and is computationally inexpensive. Illustrative examples and an open implementation are given. ances in the observations, and therefore clusters the data in 10 groups, one for each digit. Inside each of the 10 groups, digits are furthermore arranged according to the angle of rotation. This organisation leads to a simple twodimensional parametrisation, which significantly reduces the dimensionality of the data-set, whilst preserving all important attributes. 1. Introduction: Dimensionality Reduction Figure 1: Two images of the same digit at different rotation angles. The curse of dimensionality, a term which vividly reminds us of the problems associated with the processing of high-dimensional data, is ever-present in today’s information-driven society. The dimensionality of a dataset, or the number of variables measured per sample, can easily amount to thousands. Think, for example, of a 100 by 100 pixel image, where each pixel can be seen to represent a variable, leading to a dimensionality of 10, 000. In such a high-dimensional feature space, data points typically occur sparsely, causes numerous problems: some algorithms slow down or fail entirely, function and density estimation become expensive or inaccurate, and global similarity measures break down [4]. The breakdown of common similarity measures hampers the efficient organisation of data, which, in turn, has serious implications in the field of pattern recognition. For example, consider a collection of n × m images, each encoding a digit between 0 and 9. Furthermore, the images differ in their orientation, as shown in Fig.1. A human, faced with the task of organising such images, would likely first notice the different digits, and thereafter that they are oriented. The observer intuitively attaches greater value to parameters that encode larger vari- On the other hand, a computer sees each image as a data point in Rnm , an nm-dimensional coordinate space. The data points are, by nature, organised according to their position in the coordinate space, where the most common similarity measure is the Euclidean distance. A small Euclidean distance between vectors almost certainly indicate that they are highly similar. A large distance, on the other hand, provides very little information on the nature of the discrepancy. This Euclidean distance therefore provides a good measure of local similarity only. In higher dimensions, distances are often large, given the sparsely populated feature space. Key to non-linear dimensionality reduction is the realisation that data is often embedded in (lies on) a lowerdimensional structure or manifold, as shown in Fig. 2. It would therefore be possible to characterise the data and the relationship between individual points using fewer dimensions, if we were able to measure distances on the manifold itself instead of in Euclidean space. For example, taking into account its global structure, we could represent the data in our digits data-set using only two variables: digit and rotation. 2.2. Multidimensional Scaling (MDS) Figure 2: Low dimensional data measured in a highdimensional space. The challenge, then, is to determine the lowerdimensional data structure that encapsulates the data, leading to a meaningful parametrisation. Such a representation achieves dimensionality reduction, whilst preserving the important relationships between data points. One realisation of the solution is diffusion maps. In Section 2, we give an overview of three other well known techniques for dimensionality reduction. Section 3 introduces diffusion maps and explains their functioning. In Section 4, we apply the knowledge gained in a real world scenario. Section 5 compares the performance of diffusion maps against the other techniques discussed in Section 2. Finally, Section 6 demonstrates the organisational ability of diffusion maps in an image processing example. 2. Other Techniques for Dimensionality Reduction There exist a number of dimensionality reduction techniques. These can be broadly categorised into those that are able to detect non-linear structures, and those that are not. Each method aims to preserve some specific property of interest in the mapping. We focus on three well known techniques: Principle Component Analysis (PCA), multidimensional scaling (MDS) and isometric feature map (isomap). 2.1. Principal Component Analysis (PCA) PCA [3] is a linear dimensionality reduction technique. It aims to find a linear mapping between a high dimensional space (n dimensional) and a subspace (d dimensional with d < n) that captures most of the variability in the data. The subspace is specified by d orthogonal vectors: the principal components. The PCA mapping is a projection into that space. The principal components are the dominant eigenvectors (i.e., the eigenvectors corresponding to the largest eigenvalues) of the covariance matrix of the data. Principal component analysis is simple to implement, but many real-world data-sets have non-linear characteristics which a PCA mapping fails to encapsulate. MDS [6] aims to embed data in a lower dimensional space in such a way that pair-wise distances between data points, X1..N , are preserved. First, a distance matrix DX is created. Its elements contain the distances between points in the feature space, i.e. DX [i, j] = d(xi , xj ). For simplicity sake, we consider only Euclidean distances here. The goal is to find a lower-dimensional set of feature vectors, Y1..N , for which the distance matrix, DY [i, j] = d(yi , yj ), minimises a cost function, ρ(DX , DY ). Of the different cost functions available, strain is the most popular (MDS using strain is called “classical MDS”): 2 ρstrain (DX , DY ) = ||J T (DX − DY2 )J)||2F . Here, J is the centering matrix, so that J T XJ subtracts the vector mean from each component in X. The matrix norm, ||X||F , is defined as qP Frobenius M PN 2 i=1 j=1 |xij | . The intuition behind this cost function is that it preserves variation in distances, so that scaling by a constant factor has no influence [2]. Minimising the strain has a convenient solution, given by the dominant eigenvectors 2 of the matrix − 12 J T DX J. MDS, when using Euclidean distances, is criticised for weighing large and small distances equally. We mentioned earlier that large Euclidean distances provide little information on the global structure of a data-set, and that only local similarity can be accurately inferred. For this reason, MDS cannot capture non-linear, lowerdimensional structures according to their true parameters of change. 2.3. Isometric Feature Map (Isomap) Isomap [5] is a non-linear dimensionality reduction technique that builds on MDS. Unlike MDS, it preserves geodesic distance, and not Euclidean distance, between data points. The geodesic represents a straight line in curved space or, in this application, the shortest curve along the geometric structure defined by our data points [2]. Isomap seeks a mapping such that the geodesic distance between data points match the corresponding Euclidean distance in the transformed space. This preserves the true geometric structure of the data. How do we approximate the geodesic distance between points without knowing the geometric structure of our data? We assume that, in a small neighbourhood (determined by K-nearest neighbours or points within a specified radius), the Euclidean distance is a good approximation for the geodesic distance. For points further apart, the geodesic distance is approximated as the sum of Euclidean distances along the shortest connecting path. There exist a number of graph-based algorithms for calculating this approximation. Once the geodesic distance has been obtained, MDS is performed as explained above. A weakness of the isomap algorithm is that the approximation of the geodesic distance is not robust to noise perturbation. 3. Diffusion maps 3.1. Connectivity Suppose we take a random walk on our data, jumping between data points in feature space (see Fig. 3). Jumping to a nearby data-point is more likely than jumping to another that is far away. This observation provides a relation between distance in the feature space and probability. The connectivity between two data points, x and y, is defined as the probability of jumping from x to y in one step of the random walk, and is connectivity(x, y) = p(x, y). (1) It is useful to express this connectivity in terms of a nonnormalised likelihood function, k, known as the diffusion kernel: connectivity(x, y) ∝ k(x, y). (2) The kernel defines a local measure of similarity within a certain neighbourhood. Outside the neighbourhood, the function quickly goes to zero. For example, consider the popular Gaussian kernel, ! 2 |x − y| . (3) k(x, y) = exp − α Figure 3: A random walk on a data set. Each “jump” has a probability associated with it. The dashed path between nodes 1 and 6 requires two jumps (i.e., two time units) with the probability along the path being p(node 1, node 2) p(node 2, node 6). Diffusion maps are a non-linear technique. It achieves dimensionality reduction by re-organising data according to parameters of its underlying geometry. The connectivity of the data set, measured using a local similarity measure, is used to create a time-dependent diffusion process. As the diffusion progresses, it integrates local geometry to reveal geometric structures of the data-set at different scales. Defining a time-dependent diffusion metric, we can then measure the similarity between two points at a specific scale (or time), based on the revealed geometry. A diffusion map embeds data in (transforms data to) a lower-dimensional space, such that the Euclidean distance between points approximates the diffusion distance in the original feature space. The dimension of the diffusion space is determined by the geometric structure underlying the data, and the accuracy by which the diffusion distance is approximated. The rest of this section discusses different aspects of the algorithm in more detail. The neighbourhood of x can be defined as all those elements y for which k(x, y) ≥ ε with 0 < ε 1. This defines the area within which we trust our local similarity measure (e.g. Euclidean distance) to be accurate. By tweaking the kernel scale (α, in this case) we choose the size of the neighbourhood, based on prior knowledge of the structure and density of the data. For intricate, nonlinear, lower-dimensional structures, a small neighbourhood is chosen. For sparse data, a larger neighbourhood is more appropriate. The diffusion kernel satisfies the following properties: 1. k is symmetric: k(x, y) = k(y, x) 2. k is positivity preserving: k(x, y) ≥ 0 We shall see in Section 8 that the first property is required to perform spectral analysis of a distance matrix, Kij = k(xi , xj ). The latter property is specific to the diffusion kernel and allows it to be interpreted as a scaled probability (which must always be positive), so that 1 X k(x, y) = 1. dX (4) yX The relation between the kernel function and the connectivity is then connectivity(x, y) = p(x, y) = with 1 dX the normalisation constant. 1 k(x, y) dX (5) Define a row-normalised diffusion matrix, P , with entries Pij = p(Xi , Xj ). Each entry provides the connectivity between two data points, Xi and Xj , and encapsulates what is known locally. By analogy of a random walk, this matrix provides the probabilities for a single step taken from i to j. By taking powers of the diffusion matrix1 , we can increase the number of steps taken. For example, take a 2 × 2 diffusion matrix, P = p11 p21 p12 p22 . 3.3. Diffusion Distance Each element, Pi,j , is the probability of jumping between data points i and j. When P is squared, it becomes 2 P = p11 p11 + p12 p21 p21 p12 + p22 p21 Figure 4: Paths along the true geometric structure of the data set have high probability. p12 p22 + p11 p12 p22 p22 + p21 p12 . Note that P11 = p11 p11 + p12 p21 , which sums two probabilities: that of staying at point 1, and of moving from point 1 to point 2 and back. When making two jumps, these are all the paths from point i to point j. Similarly, Pijt sum all paths of length t from point i to point j. 3.2. Diffusion Process As we calculate the probabilities P t for increasing values of t, we observe the data-set at different scales. This is the diffusion process2 , where we see the local connectivity integrated to provide the global connectivity of a data-set. With increased values of t (i.e. as the diffusion process “runs forward”), the probability of following a path along the underlying geometric structure of the data set increases. This happens because, along the geometric structure, points are dense and therefore highly connected (the connectivity is a function of the Euclidean distance between two points, as discussed in Section 2). Pathways form along short, high probability jumps. On the other hand, paths that do not follow this structure include one or more long, low probability jumps, which lowers the path’s overall probability. In Fig. 4, the red path becomes a viable alternative to the green path as the number of steps increases. Since it consists of short jumps, it has a high probability. The green path keeps the same, low probability, regardless of the value of t . 1 The diffusion matrix can be interpreted as the transition matrix of an ergodic Markov chain defined on the data [1] 2 In theory, a random walk is a discrete-time stochastic process, while a diffusion process considered to be a continuous-time stochastic process. However, here we study discrete processes only, and consider random walks and diffusion processes equivalent. The previous section showed how a diffusion process reveals the global geometric structure of a data set. Next, we define a diffusion metric based on this structure. The metric measures the similarity of two points in the observation space as the connectivity (probability of “jumping”) between them. It is related to the diffusion matrix P , and is given by X 2 Dt (Xi , Xj )2 = |pt (Xi , u) − pt (Xj , u)| (6) uX = X t t 2 Pik − Pkj . (7) k The diffusion distance is small if there are many high probability paths of length t between two points. Unlike isomap’s approximation of the geodesic distance, the diffusion metric is robust to noise perturbation, as it sums over all possible paths of length t between points. As the diffusion process runs forward, revealing the geometric structure of the data, the main contributors to the diffusion distance are paths along that structure. Consider the term pt (x, u) in the diffusion distance. This is the probability of jumping from x to u (for any u in the data set) in t time units, and sums the probabilities of all possible paths of length t between x and u. As explained in the previous section, this term has large values for paths along the underlying geometric structure of the data. In order for the diffusion distance to remain small, the path probabilities between x, u and u, y must be roughly equal. This happens when x and y are both well connected via u. The diffusion metric manages to capture the similarity of two points in terms of the true parameters of change in the underlying geometric structure of the specific data set. 3.4. Diffusion Map Low-dimensional data is often embedded in higher dimensional spaces. The data lies on some geometric structure or manifold, which may be non-linear (see Fig. 2). In the previous section, we found a metric, the diffusion distance, that is capable of approximating distances along this structure. Calculating diffusion distances is computationally expensive. It is therefore convenient to map data points into a Euclidean space according to the diffusion metric. The diffusion distance in data space simply becomes the Euclidean distance in this new diffusion space. A diffusion map, which maps coordinates between data and diffusion space, aims to re-organise data according to the diffusion metric. We exploit it for reducing dimensionality. The diffusion map preserves a data set’s intrinsic geometry, and since the mapping measures distances on a lower-dimensional structure, we expect to find that fewer coordinates are needed to represent data points in the new space. The question becomes which dimensions to neglect, in order to preserve diffusion distances (and therefore geometry) optimally. With this in mind, we examine the mapping pt (Xi , X1 ) pt (Xi , X2 ) T Yi := . (8) = Pi∗ .. . pt (Xi , XN ) For this map, the Euclidean distance between two mapped points, Yi and Yj , is X 2 2 kYi − Yj kE = |pt (Xi , u) − pt (Xj , u)| uX = X t t 2 Pik − Pkj = Dt (Xi , Yj ),2 k which is the diffusion distance between data points Xi and Xj . This provides the re-organisation we sought according to diffusion distance. Note that no dimensionality reduction has been achieved yet, and the dimension of the mapped data is still the sample size, N . Dimensionality reduction is done by neglecting certain dimensions in the diffusion space. Which dimensions are of less importance? The proof in Section 8 provides the key. Take the normalised diffusion matrix, P =D −1 K, where D is the diagonal matrix consisting of the rowsums of K. The diffusion distances in (8) can be expressed in terms of the eigenvectors and -values of P as t λ1 ψ1 (i) λt2 ψ2 (i) Yi0 = (9) , .. . λtn ψn (i) where ψ1 (i) indicates the i-th element of the first eigenvector of P . Again, the Euclidean distance between mapped points Yi0 and Yj0 is the diffusion distance. The set of orthogonal left eigenvectors of P form a basis for the diffusion space, and the associated eigenvalues λl indicate the importance of each dimension. Dimensionality reduction is achieved by retaining the m dimensions associated with the dominant eigenvectors, which Figure 5: Original Data ensures that Yi0 − Yj0 approximates the diffusion distance, Dt (Xi , Xj ), best. Therefore, the diffusion map that optimally preserves the intrinsic geometry of the data is (9). 4. Diffusion Process Experiment We implemented a diffusion map algorithm in the Python programming language. The code was adapted for the machine learning framework Elefant (Efficient Learning, Large Scale Inference and Optimisation Toolkit), and will be included as part of the next release. The basic algorithm is outlined in Algorithm 1. Algorithm 1 Basic Diffusion Mapping Algorithm INPUT: High dimensional data set Xi , i = 0 . . . N − 1. 1. Define a kernel, k(x, y) and create a kernel matrix, K, such that Ki,j = k(Xi , Xj ). 2. Create the diffusion matrix by normalising the rows of the kernel matrix. 3. Calculate the eigenvectors of the diffusion matrix. 4. Map to the d-dimensional diffusion space at time t, using the d dominant eigenvectors and -values as shown in (9). OUTPUT: Lower dimensional data set Yi , i = 0..N − 1. The experiment shows how the algorithm integrates local information through a time-dependent diffusion to reveal structures at different time scales. The chosen data-set exhibits different structures on each scale. The data consists of 5 clusters which, on an intermediate scale, has a noisy C-shape with a single parameter: the position along the C-shape. This parameter is encoded in the colours of the clusters. On a global scale the structure is one super-cluster. As the diffusion time increases, we expect different scales to be revealed. A one dimensional curve characterised by a single parameter should appear. This is the position along the C-shape, the direction of maximum global variation in the data. The ordering of colours along the C-shape should be preserved. 5.1. Discussion 4.1. Discussion In these results, three different geometric structures are revealed, as shown in figures (5) and (6): At t = 1, the local geometry is visible as five clusters. The diffusion process is still restricted individual clusters, as the probability of jumping to another in one time-step is small. Therefore, even with reduced dimensions, we can clearly distinguish the five clusters in the diffusion space. At t = 3, clusters connect to form a single structure in diffusion space. Paths of length three bring them together. Being better connected, the diffusion distance between clusters decreases. At this time scale, the one-dimensional parameter of change, the position along the C-shape, is recovered: in the diffusion space the order of colours are preserved along a straight line. At t = 10, a third geometric structure is seen. The five clusters have merged to form a single super-cluster. At this time scale, all points in the observation space are equally well connected. The diffusion distances between points are very small. In the lower dimensional diffusion space it is approximated as zero, which projects the super-cluster to a single point. This experiment shows how the algorithm uses the connectivity of the data to reveal geometric structures at different scales. Figure 10: The one-dimensional diffusion space. 5.1.1. PCA In observation space (Fig. (5)), most variation occurs along the z-axis. The PCA mapping in Fig. 7 preserves this axis as the first principal component. The second principal component runs parallel to x = y, and is orthogonal to the z-axis. The variation in the third dimension is orthogonal to these principal components, and preserves the C-shape. Once data is projected onto the two primary axes of variation, the ordering along the nonlinear C-shape is lost. As expected, PCA fails to detect a single parameter of change. 5.1.2. MDS Similar to PCA, MDS preserves the two axes along which the Euclidean distance varies most. Its one-dimensional projection fails to preserve the cluster order, as shown in Fig. 8. Suppose we had fixed the red data points in a one dimensional feature space (see Fig. 11), and wanted to plot the other data points such that Euclidean distances are preserved (this is the premise of MDS). Data points lying on the same radius would then be plotted on top of one another. Due to the non-linear structure of the data, MDS cannot preserve the underlying clusters. 5. Comparison With Other Methods We investigate the performance of those dimensionality reduction algorithms discussed in Section 2, and compare them to diffusion maps. We use a similar data set as in the previous experiment, the only difference being an increased variance inside clusters. Which of the algorithms detect the position along the C-shape as a parameter of change, i.e. which of the methods best preserve the ordering of the clusters in a one-dimensional feature space? Being linear techniques, we expect PCA and MDS to fail. In theory, isomap should detect the C-shape, although it is known that it struggles with noisy data. Figure 11: Failure of MDS for non-linear structures Figure 6: Projection onto diffusion space at times t = 1, t = 2, t = 3 and t = 10 (in clock-wise order, starting from the top-left). Figure 7: The two- and one-dimensional feature spaces of PCA. Figure 8: The two- and one-dimensional feature spaces of MDS. Figure 9: The two- and one-dimensional feature spaces of isomap. 5.1.3. Isomap 6.1. Discussion When determining geodesic distances, isomap searches a certain neighbourhood. The choice of this neighbourhood is critical: if too large, it allows for “short circuits” whereas, if it is too small, connectivity becomes very sparse. These “short circuits” can skew the geometry of the outcome entirely. For diffusion maps, a similar challenge is faced when choosing kernel parameters. Isomap is able to recover all the clusters in this experiment, given a very small neighbourhood. It even preserve the clusters in the correct order. Two of the clusters overlap a great deal, due to the algorithm’s sensitivity to noise. Figure 13: Organisation of images in diffusion space. 5.1.4. Comparison with Diffusion Maps Fig. 10 shows that a diffusion map is able to preserve the order of clusters in one dimension. As mentioned before, choosing parameter(s) for the diffusion kernel remains difficult. Unlike isomap, however, the result is based on a summation over all data, which lessens sensitivity towards kernel parameters. Diffusion maps are the only one of these techniques that allows geometric analysis at different scales. 6. Demonstration: Organisational Ability A diffusion map organises data according to its underlying parameters of change. In this experiment, we visualise those parameters. Our data-set consists of randomly rotated versions of a 255 × 255 image template (see Fig. 12). Each rotated image represents a single, 65025-dimensional data-point. The data is mapped to the diffusion space, and the dimensionality reduced to two. At each 2-dimensional coordinate, the original image is displayed. Figure 12: Template: 255 x 255 pixels. In the diffusion space, the images are organised according to their angle of rotation (see Fig. 13). Images with similar rotations lie close to one another. The dimensionality of the data-set has been reduced from 65025 to only two. As an example, imaging running a K-means clustering algorithm in diffusion space. This will not only be much faster than in data space, but would likely achieve better results, not having to deal with a massive, sparse, high-dimensional data-set. 7. Conclusion We investigated diffusion maps, a technique for nonlinear dimensionality reduction. We showed how it integrates local connectivity to recover parameters of change at different time scales. We compared it to three other techniques, and found that the diffusion mapping is more robust to noise perturbation, and is the only technique that allows geometric analysis at differing scales. We furthermore demonstrated the power of the algorithm by organising images. Future work will revolve around applications in clustering, noise-reduction and feature extraction. 8. Proof This section discusses the mathematical foundation of diffusion maps. The diffusion distance, given in (7), is very expensive to calculate but, as shown here, it can be written in terms of the eigenvectors and values of the diffusion matrix. These values can be calculated efficiently. We set up a new diffusion space, where the coordinates are scaled components of the eigenvectors of the diffusion matrix. In the diffusion space, Euclidean distances are equivalent to diffusion distances in data space. Lemma 1: Suppose K is a symmetric, n × n kernel matrix such that K[i, j] = k(i, j). A diagonal matrix, D, normalises the rows of K to produce a diffusion matrix P = D−1 K. 1 (10) ek = D 2 x0k . (11) From (15) we then obtain the eigen decomposition, X P = λk vk eTk . (21) Then, the matrix P 0 , defined as P 0 = D1/2 P D−1/2 , and the left eigenvectors are (20) k 1. is symmetric 2. has the same eigenvalues as P 3. the eigenvectors, x0k of P 0 are multiplied by D−1/2 1 and D 2 to get the left (eT P = λeT ) and right eigenvectors (P v = λv) of P respectively. Proof: Substitute (10) into (11) to obtain P 0 = D−1/2 KD−1/2 . (12) As K is symmetric, P 0 will also be symmetric. We can make P the subject of the equation in (11), P = D−1/2 P 0 D1/2 . (13) As P 0 is symmetric, there exists an orthonormal set of eigenvectors of P 0 such that P 0 = SΛS T , When we examine this eigen decomposition further, we see something interesting. Eq. 21 expresses each row of the diffusion matrix in terms of a new basis: ek , the left eigenvectors of the diffusion matrix. In this new coordinate system in Rn , a row i of P is represented by the point λ1 v1 [i] λ2 v2 [i] Mi = , .. . λn vn [i] where vn [i] is the i-th component of the n-th right eigenvector. However, P is not symmetric, and so the coordinate system will not be orthonormal, i.e. eTk ek 6= 1 or, equivalently, eTk I ek 6= 1 (14) and where Λ is a diagonal matrix containing the eigenvalues of P 0 and S is a matrix with the orthonormal eigenvectors of P 0 as columns. Substituting (14) into (13), eTl I ek 6= 0 for l 6= k. This is a result of the scaling applied to the orthogonal eigenvectors of P 0 in (19) and (20). This scaling can be counter-acted by using a different metric Q, such that P = D−1/2 SΛS T D1/2 . eTk Qek = 1 Since S is an orthogonal matrix and eTl Q ek = 0 for l 6= k P = D = −1/2 (D SΛS −1/2 −1 D 1/2 S)Λ(D−1/2 S)−1 = QΛQ−1 . (15) (16) Therefore, the eigenvalues of P 0 and P are the same. Furthermore, the right eigenvectors of P are the columns of Q = D−1/2 S, (17) while the left eigenvectors are the rows of −1 Q 1 2 T =S D . 1 where D is the diagonal normalisation matrix. It satisfies the two requirements for a metric and leads to eTk Q ek 1 (18) (19) 1 = eTk (D− 2 )T (D− 2 ) ek 0 = x0T k xk = From (17) we see that the equation for the eigenvectors of P can be given in terms of the eigenvectors x0k of P 0 . The right eigenvectors of P are vk = D− 2 x0k where Q must be a positive definite, symmetric matrix. We choose Q = D−1 , 1, using (20). In the same way we can show that eTl D−1 ek = 0 for l 6= k. Therefore the left eigenvectors of the diffusion matrix form an orthonormal coordinate system of Rn , given that the metric is D−1 . We define Rn with metric D−1 as the diffusion space, denoted by l2 (Rn , D−1 ). For example, the Euclidean distance between two vectors, a and a0 , is normally Therefore D(xi , xj )2 = X 2 λ2k (vk [i] − vk [j]) . (25) k d(a, a0 )l2 = d(a, a0 )l2 (Rn ,I) = (a − a0 )T (a − a0 ) We’ve therefore shown that the diffusion distance, Dt (xi , xj )2 , is simply the Euclidean distance between mapped points, Mi and Mj , in diffusion space. in l2 . In l2 (Rn , D−1 ), it becomes d(a, a0 )l2 (Rn ,D−1 ) = (a − a0 )T D−1 (a − a0 ). Lemma 2: If we choose our diffusion coordinates as in (9), then the diffusion distance between points in data space (measured using the D−1 metric) is equal to the Euclidean distance in the diffusion space. Proof : We are required to prove that Dt (xi , xj ) 2 2 = kpt (xi , ·) − pt (xj , ·)kl2 (Rn ,D−1(22) ) 2 = kMi − Mj kl2 (Rn ,I) X 2 = λ2t k (vk [i] − vk [j]) . (23) [1] R.R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5–30, 2006. [2] A. Ihler. Nonlinear Manifold Learning (MIT 6.454 Summary), 2003. [3] I.T. Jolliffe. Principal component analysis. SpringerVerlag New York, 1986. (24) k Here, pt (xi , xj ) = Pij are the probabilities which form the components of the diffusion matrix. For simplicity we assume t = 1. Then D(xi , xj )2 9. References 2 = kp(xi , ·) − p(xj , ·)kl2 (Rn ,D−1 ) [4] S.S. Lafon. Diffusion Maps and Geometric Harmonics. PhD thesis, Yale University, 2004. [5] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500):2319–2323, 2000. [6] W.S. Torgerson. Multidimensional scaling: I. Theory and method. Psychometrika, 17(4):401–419, 1952. 2 = kP [i, ·] − P [j, ·]kl2 (Rn ,D−1 ) . According to the eigen decomposition in (21), this equals 2 X X T T λk vk [j]ek λk vk [i]ek − k k≥0 2 X = λk eTk (vk [i] − vk [j]) k 2 X 1 2 = λk x0T k D (vk [i] − vk [j]) k 2 X 1 2 = λk x0T k (vk [i] − vk [j]) D k In l2 (Rn , D−1 ), this distance is !T ! X λk x0T k (vk [i] − vk [j]) D 1 2 D −1 X 1 2 λm x0T m (vm [i] − vm [j]) D m k ! = X λk x0T k (vk [i] − vk [j]) D 1 2 ! D −1 X X λm x0m m k = D 1 2 λk x0T k (vk [i] − vk [j]) X λm x0m (vm [i] − vm [j]) k m Since {x0k } is an orthonormal set, 0 x0T m xk = 0 for m 6= k. (vm [i] − vm [j])