A tiny review on Manifold Embedding techniques Felipe Orihuela-Espina References are only examples of sources that I have used. Neither they always correspond to the original publication about the topic, nor are necessarily the best. 1 I.- Mathematical Background 2 Topological Space A topological space X is a set of points X = {x i | i = 1K n } with a set of subsets T = {S j = {x i Î X }} Í X satisfying the following axioms: The empty set Æ and are in T X ÆÎ T Ù X Î T The union of a finite collection of sets in T are also in T k US i Î T i= 1 The intersection of an arbitrary collection of sets in T I are also in T Si Î T i The set Basically a topological space is a geometric object, and a topology is a structure imposed on it [Wolfram, World of Maths] T is called the topology of X 3 Manifold (I) A manifold is a topological space that it is locally Euclidean. The concept of manifold is the generalisation of the traditional Euclidean (linear) space to adapt to non-Euclidean topologies. Note that “locally Euclidean” does not mean that it is constraint to a Euclidean metric globally, but only that it is locally homeomorphic to a Euclidean space. In other words, a manifold is an object placed in a ndimensional ambient space A k-dimensional manifold is a submanifold with k degrees of freedom, i.e. that can be described with only k coordinates 4 Manifold (II) If the manifold is infinitely differentiable then it is called a smooth manifold. A smooth manifold with a metric imposed to induce the topology is called a Riemannian manifold. A submanifold is a subset of a manifold which is itself a manifold. [Wolfram, World of Maths] [Carreira-Perpiñán,1997] 5 Homeomorphism and diffeomorphism An homeomorphism is a continuous bijective transformation between topological spaces X and Y . f :X ® Y The fact that is continuous means that points which are close in X are also close in Y , and points which are far in X are also far in Y The fact that it is bijective (or 1 to 1) means that it is injective and surjective, and also imply that there exist the inverse f - 1 :Y ® X If the homeomorphism is differentiable, i.e. if the derivate and its inverse exists, then it is called diffeomorphism. [Wolfram, World of Maths] Figure from Wikipedia 6 Embedding An embedding is a map f : X ® Y such that a diffeomorphism from X to f (X ), and f (X ) is a smooth submanifold of Y f is An embedding is the representation of a topological object (e.g. a manifold, graph, lattice, etc) in a certain (sub-)space so that its topology is preserved. In particular, for manifolds, it preserves the open sets in T [Bonatti,2006] 7 Summarizing… A manifold is any object which is locally linear (flat). An embedding is a function from a space to another so that the topology (shape) is preserve through deformations (twisting and stretching) 8 II.- Manifold embedding 9 Manifold Embedding Dimensionality reduction is a particular case of manifold embedding, in which the dimension of the destination space is lower than the original data space Domain specific data are often distributed (lay on, or close to) a low dimensional manifold in a high dimensional space [Yang, 2004] Topology or structure is retained/preserved if the pairwise distances in the low dimensional space approximate the corresponding pairwaise distance in the feature space. [Sammon,1969] 10 Manifold Embedding The problem of dimension reduction, or data embedding, has been defined in several similar ways: The search for a low dimensional manifold that embeds the high dimensional data [Carreira-Perpiñán, 1997] Finding/recovering meaningful low dimensional structures hidden in high dimensional data [Tenenbaum, 2002] To detect and identify inherent “structure” (i.e. clusters or relationships between vectors) [Sammon, 1969] … 11 Manifold Embedding In data embedding, there are methods for: Estimating the intrinsic dimensionality of the data, without actually projecting the data. Generate a lower dimensional configuration by means of a projection (data projection methods). 12 Manifold Embedding: Variants Multiple manifold embedding Data lie in more than 1 manifold Multi-class manifold embedding Data lie in a single manifold, but sampling contains large gaps, perhaps even fragmenting connected components 13 Manifold Embedding: Nomenclature Manifold embedding is also called The origin space is sometimes called: Manifold learning[Souvernir 2005] Multivariate data projection [[Mao,1995] in Demartines, 1997] or simply projection [Venna2007] High dimensional (input) space [Tenenbaum, 2000][Demartines,1997][Venna2007] Vector space [Roweis, 2000][Sammon,1969][Brand,2003] Data space [Souvernir 2005] Observation space [Silva 2002] Domain space [Yang 2004, 2005] Feature space (usually in the context of pattern analysis) The destination space is usually more consistently called Low-dimensional space But other names include output space [Demartines, 1997][Venna2007] and I personally like…embedded space [Leff, 2007] 14 Manifold Embedding: Applications Some applications of manifold embedding/ dimensionality reduction are Regression and smoothing in statistics Data compression an coding in information theory Visualization and representation of data in general Feature extraction in pattern analysis Determination of latent variables in causal models Complexity reduction in algorithmics. Data exploration in statistics Prior to clustering in machine learning 15 III.- Estimating intrinsic dimensionality without projection 16 Intrinsic dimensionality (ID) The intrinsic dimensionality (ID) of a manifold has been defined as “the number of independent variables that explains satisfactorily” that manifold. Determination of the ID eliminates the possibility of over- or under-fitting. Since it is always possible to find a manifold of any dimension which passes through all points in a data set given enough parameters, the problem of estimating the ID of a dataset is ill-posed in the Hadamard sense Note that is the case of interpolation, which finds a 1-D curve to fit a dataset! [CarreiraPerpiñán,1997] Figure modified from [CarreiraPerpiñán,1997] 17 Intrinsic dimensionality vs Topological dimension Topological dimension is the “local” dimensionality at every point i.e. the dimension of the tangent space The topological dimension is a lower bound of the ID Example: Sphere: ID: 3 Topological dimension: 2 (at every point the sphere can be aproximated by a surface) [Camastra, 2003] 18 Example of methods for estimating the intrinsic dimensionality of data (without projection) Bennet’s algorithm [Bennet, 1969] Fukunaga and Olsen algorithm [Fukunaga et al, 1971] Local eigenvalue estimator [Verveer et al, 1995] Bruske and Sommer work based on topology preserving map [Bruske et al 1998] Trunk’s statistical approach (near neighbour techniques) [Trunk, 1968] [[Trunk, 1976] in [Camastra, 2003] Pettis’ algorithm – Add assumption of uniformly distribution of sampling to derive a simple expression. Near neighbour estimator [Verveer et al, 1995] Fractal based methods [Review by Camastra, 2003] Broomhead’s topological dimension of a time series [Broomhead, 1987] 19 Bennet’s algorithm Let’s define the probability density function of the interpoint distances Observation: A displacement/perturbation in the dataset which increment variance in interpoint distances tend to reduce the dimensionality of the data set [Bennet, 1969] A perturbation of this kind is reduce the small distances (those smaller than the mean interpoint distance) and increase the large distances. 20 Bennet’s algorithm 1. dataset original data 2. Repeated until change in variance is smaller than a given threshold 1. Increase the variance of the dataset (dataset+=Δ) 2. For each point Restore the ranking order of local distances 3. Calculate PCA on the obtained configuration 1. ID is the number of non-zero eigenvalues. Noise can make that no eigenvalues is zero so in practice a threshold is needed [Bennet, 1969] 21 Fukunaga and Olsen algorithm Observation: For vectors embedded in a linear subspace, the dimension is equal to the number of non-zero eigenvalues of the covariance matrix. PCA …which is step 3 of Bennet’s algorithm! Apply PCA locally [Fukunaga, 1971] [Camastra, 2003] Original formulation is based on Taylor expansion of small subregions and calculation of PCA in each small region. 22 Fukunaga and Olsen algorithm 1. Divide the data set in a number of small subsets or hyperspherical regions (neighbourhoods) containing a fixed number of points. 2. Compute the Taylor expansion of local neighbourhoods 3. Count the number of significant terms (i.e. different order derivatives) in the expansion This step is actually solved applying PCA to each small region and counting non-zero eigenvalues larger than a threshold. This count is close to the local ID This formulation is not noise robust. Noise can make that no eigenvalues is zero so a threshold is needed 23 Fukunaga and Olsen algorithm The partition of the dataset does not necessarily need to be complete. Step 1 of the algorithm permits two variants 1. Variation with overlapping regions is more representative of the data 2. Variation without overlapping regions is faster A particularity is that a “single” final answer is not given, but actually it produces summary tables with “local” answers Give upper and lower bounds (range) for the dimensionality Neighbourhood sizes and threshold value are difficult to choose. 24 Fukunaga and Olsen algorithm: A slightly different formulation with Optimally Topology Preserving Map (OTPM) 1. Compute the Voronoi tesellation so that regions can be considered to be locally linear (using LBG vector quatizing algorithm) 1. Construct a graph G corresponding to the induced Delaunay triangulation (optimally topology preserving map - OTPM) 2. For each Voronoi cell 1. Compute the differences vectors from each local point to its generating vector 2. Apply (PCA) to the set of differences vectors 3. The local ID is the number of nonzero eigenvalues larger than a given threshold [Bruske and Sommer, 1998] Image simulated with VoroGlide 2.0 25 Verveer and Duin’s Local Eigenvalue Estimator A modification to Fukunaga’s algorithm to automatically choose the threshold. Requires high sampling of the manifold Assumes uniform sampling density in the local neighburhoods [Verveer, 1995] 26 Trunk’s statistical approach (near neighbour) Iteratively look for the more likely local dimensionality examining invariant statistics. Computes the topological dimensionality from the distribution of distances. Involves many ad-hoc assumptions: Linearity and independence All distance ratios and angles are independent Ad-hoc density distribution for the angles The correct answer for ID is not guaranteed! ([Trunk, 1968:pg 519]) [Trunk, 1968] [Camastra, 2003] 27 Trunk’s statistical approach (near neighbour): Iterative variation for noisy data 1. Initialise k 2. Repeat 1. Construct the k-NN graph 2. For each point i 1. Find the (k+1)th nearest neighbour of point i 2. i Calculate the angle between the (k+1)th nearest neighbour and the k neighbourhood “flat” hypersurface 3. avgAngle mean(i) 4. If (avgAngle > ) 1. Increment k 3. until (avgAngle < ) [[Trunk,1976] in[Camastra, 2003] 28 Trunk’s statistical approach (near neighbour) How to choose is not clear. At each iteration, computed k neighbourhoods are assumed locally linear Iterative description is only conceptual The iterative algorithm is not computationally efficient. Trunk’s original statistical approach was not iterative; much more efficient, but also more complicated and sensitive to noise [Trunk, 1968] [Camastra, 2003] 29 Verveer and Duin’s Near Neighbour Estimator A non-iterative approximation to Trunk’s algorithm for noisy data, in which ID is directly calculated by a derived (cumbersome) formula, namely the near neighbour estimator. [Verveer, 1995] [Camastra, 2003] A correction to Pettis’ algorithm (which they show lead to incorrect answer) The result is a real value, so it must be rounded to nearest integer. It usually underestimate the ID (when the real ID is high) Sensitive to noise, outliers, edge effect, sampling density and distribution, etc… 30 Broomhead’s topological dimension of a time-series Observation: For a small neighbourhood of any point x, the effects of a curvature becomes unimportant (the manifold is by definition locally linear), and therefore the manifold can be well approximated by its tangent space. [Broomhead, 1987] 31 Broomhead’s topological dimension of a time-series For a point x construct a neighbourhood matrix B e (x ) T whose rows are vectors (x j - x ) such that x j are the e -neighbours ( x j - x < e) of x Note that this is similar to consider x as the origin of a coordinate system and expressing all the neighbours x j as vectors “centered” at this origin. For sufficiently small e , B e (x ) represents the tangent vector to the manifold, i.e. the tangent space. The rank of B e (x ) is the dimension of the manifold. The rank (number of linearly independent rows) of a matrix is the dimension of that matrix [Broomhead, 1987] 32 Broomhead’s topological dimension of a time-series The construction of the tangent matrix B e (x ) , is also used by Laplacian Eigenmaps and Hessian eigenmaps. How the manifold is created: For us, patterns (points in the manifold) are columns, i.e. 1 signal with all its time samples For Broomhead’s patterns are rows, i.e. 1 time sample across all signals or observed variables. Time samples or observations Signals/subjects/variables [Broomhead, 1987] 33 IV.- Data Projection Methods Figure from http://people.hofstra.edu/Stefan_Waner/diff_geom/pics/Chart.gif 34 Data Projection Methods They can be coarsely classified as: Linear vs. Non-linear Global vs Local vs. Topology neutral There is no BEST method for all cases Different techniques deal with different types of manifolds 35 Data Projection Methods: Common things The manifold is always considered to be placed in an “ambient” Euclidean space n Most of the methods require an a priori knowledge (or estimation) of the intrinsic dimensionality Virtually every projection solves an optimization problem Embedded solution is not unique Embedded solution normally imply a deformation 36 Data Projection Methods: Common things A number of techniques require the construction of a (undirected weighted) neighbourhood graph K nearest neighbours -radius neighbourhood Often projections are based on accepting the first n eigenvectors of the eigendecomposition of a matrix derived from distances. 37 IV.a.- Linear Data Projection Methods 38 Example of Data projection methods: Linear PCA (Principal Component Analysis) [LOTS!!] MDS (Multidimensional Scaling, a.k.a. Principal coordinate analysis) (LOTS!! – [Kruskal, 1974][Cox,1994]) ICA (Independent Component Analysis) [Comon, 1994] CCA (Canonical Correlation Analysis) [Friman, 2002] PP (Projection pursuit) [Carreira-Perpiñán, 1997] 39 PCA – Principal Component Analysis PCA rotates the axis of the original space so that the new axis, maximise the variance of the variability of each component. As a side effect, discarding the least significant components (those with minimum variance) result in a dimension reduction with minimal loss of information. 2D representation of the PCA transformation Figure from [web.media.mit.edu ] 40 PCA – Principal Component Analysis Let X K xM be the dataset with M vectors as columns and K observations or dimensions as rows. First we center (mean correct) X ' = X - E {X } Þ E {X '} = 0 T KxM = Define a matrix such that the covariance matrix of the dataset can be expressed as: CovX X 1 X 'T m- 1 X ' X 'T = = T TT m- 1 The columns of V in the SVD of T = U S V T are basically the eigenvectors of X ' X 'T or the covariance matrix, which are the principal components of X ' [Schlens,2005],[Fodor2002] 41 PCA – Principal Component Analysis Possibly the most well known and widely applied Assumes linearity and orthogonality between the components. PCA is invariant to both translation and rotation on the dataset PCA tends to overestimate the intrinsic dimensionality (ID) 42 MDS - Multidimensional Scaling MDS find the best spatial representation in a k-dimensional space for a cloud of points given dij the pairwise distances between them in the original space. The aim is to provide a map which minimises the discrepancy between the distances in the original space dij , and the d ij distances in the destination space, thus maximizing the quality of the mapping. [Kruskal,1978],[Cox,1994] Figure reproduced from [Cox, 1994] 43 MDS - Multidimensional Scaling MDS comprises a number of variants with different: Cost function Determines the output configuration Optimization algorithm Defines the computational procedure MDS variants can be split in [Kruskal,1978],[Cox,1994] Metric Distances in the embedded space are related linearly to distances in feature space Non-metric Distances in the embedded space are only related to distances in feature space by a monotonically increasing function 44 MDS - Multidimensional Scaling According to the number of difference (distance) criteria used MDS is: [Arabie,1987],[Cox,1994] Two-way Single measure Three-way Multiple measures 45 MDS - Multidimensional Scaling Classical MDS Is a particular case of two-way metric MDS Uses the cost function known as strain strain = J '(D 2 - D 2 )J 2 D = {dij }, D = {dij } J being t he mat rix of t he eigen decomposit ion Is a dual form with PCA (Principal Component Analysis) yields the same solution. [Kruskal,1978],[Cox,1994] MDS is a.k.a. Principal Coordinate Analysis PCA start from the location of the points. MDS starts from the distances between points. 46 MDS - Multidimensional Scaling Other cost function include: F-Stress or simply stress f - stress = D - D 2 F å å = i ( f (dij ) - dij )2 j usually scale factor = scale factor i dij j F-sstress (squared stress) 2 f - sstress = D - D å å 2 2 F Sammon’s cost function Sammon ' s Error = 1 å å i j dij å å i j (dij - dij )2 dij The use of Sammon’s cost function leads to results comparable to Sammon’s NLM, despite differences in the algorithm 47 MDS – Multidimensional Scaling Although commonly classified as linear, it can perform as non linear (depending of the cost function) When using Sammon’s cost function it performs similarly to NLM Is the seed for a number of other techniques E.g.: Isomap 48 PP - Projection Pursuit Automatically picks an interesting low dimensional projection What is interesting is defined by an objective function called projection index. You can think of PP as rotating a 3D scatter plot in the PC screen to “best” see the structure of data. PP automatically search the most interesting viewpoint for you. In this example, projection in the Y-Z plane is more interesting than projection in X-Y plane, since the clusters are easily visible. 49 PP - Projection Pursuit Let X be the dataset with distribution F , and let A be a projection matrix with distribution FA , and component vectors ai The projection index Q (optimization function) is a real function of the distribution of the projection of the dataset: Q : FA ® ¡ PP finds a projection direction a for a given distribution produces a local optima of Q F which The normal distribution is the least interesting density distribution. To find other local minima of Q run the optimization algorithm again (supressing the current solution). [CarreiraPerpiñán,1997],[Fodor,2002] 50 PP - Projection Pursuit Particularly effective as a step previous to clustering It is linear and projections are orthogonal It is computationally expensive 51 IV.b.- Non-Linear Data Projection Methods 52 Examples of Data projection methods: Non-Linear Sammon’s non linear mapping (NLM) [Sammon, 1969] Kohonen’s self organising maps (SOM) [Kohonen, 1997] a.k.a. topologically continuous maps, and Kohonen maps Temporal Kohonen maps [Chappell,1993] Laplacian eigenmaps [Belkin, 2002, 2003] GeoNLM [Yang, 2004b] Laplacian eigenmaps with fast N-body methods [Wang, 2006] PCA based: Non-linear PCA [Fodor, 2002], Kernel PCA [Scholkopf, 1998], Principal Curves [Carreira-Perpignan, 1997], Space partition and locally applied PCA [Olsen and Fukunaga, 1973] 53 Examples of Data projection methods: Non-Linear Isomap [Tenenbaum, 2000] Locally linear embedding (LLE) [Roweis, 2000] FR-Isomap [Lekadir, 2006], S-Isomap [Geng, 2005], ST-Isomap [Jenkins, 2004], L-Isomap [Silva, 2002], C-Isomap [Silva, 2002] Hessian Eigenmaps, a.k.a. Hessian Locally Linear Embedding [Donoho, 2003] Curvilinear Component Analysis [Demartines, 1997] Curvilinear Distance Analysis (CDA) [Lee, 2002, 2004] 54 Examples of Data projection methods: Non-Linear Kernel ICA [Bach, 2003] Manifold charting [Brand, 2003] Stochastic neighbour embedding [Hinton, 2002] Triangulation method [Lee, 1977] Tetrahedral methods: Distance preserving projection [Yang, 2004] 55 Examples of Data projection methods: Non-Linear Others… Semidefinite embedding (SDE) Conformal Eigenmaps [Maaten, 2007] Maximally angle preserving Maximum Variance Unfolding (MVU) [Maaten, 2007] Minimum Volume Embedding [Shaw, 2007] Variant of LLE Diffusion Maps (DM) Based on a Markov random walk on the high dimensional graph to get a measure of proximity between data. 56 Sammon’s NLM – Non Linear Mapping Seeks to preserve the structure (clusters and non linear relationships) Based on the famous Sammon’s cost function [Sammon, 1969] Favour better mapping of smaller (local) distances. 57 Sammon’s NLM – Non Linear Mapping 1. Compute all pairwise distances dij in the data space 2. Generate an initial low dimension configuration Y k´ N Randomly, using PCA or other 3. Compute all pairwise distances d ij in the low dimensional space for Y k´ N 4. Iteratively optimize Y k´ N (hence change D = {dij } ) so to minimize the cost function E (Y ) = 1 å å i [Sammon, 1969] dij å å i j (dij - dij )2 dij j NLM uses steepest descent procedure for optimization 58 Sammon’s NLM – Non Linear Mapping Highly efficient at identifying hyperspherical/ hyperellipsoidal structures. Points close too close in the input space dij » 0 may badly disturb cost function Has strong similarities with MDS. As argued by Sammon, “mathematical formulation are similar, [but] the underlying mapping criterions are quite different”. [Sammon, 1969],[Demartines,1997] 59 Kohonen’s SOM – Self Organising Maps Map a given input dataset to a discretized lattice (grid) of given shape regardless of the actual shape of the manifold. Linked to a Neural Network implementation Detailed final equations have proved to be difficult, and only exist for extremely simplified cases. I really do not understand this very well to be honest… [Kohonen, 1990] 60 Kohonen’s SOM – Self Organising Maps Let X = {x i (t )} be the high dimensional data set, and the low dimensional reference vectors upon a lattice M = {m i (t )} Initially select m i (0) randomly Iteratively, [Kohonen, 1990] Link x i (t ) to the closest matching m i (t ) Update m i (t + 1) = m i (t ) + h(t )[x (t ) - m i (t )] where h (t ) is a kernel function 61 Laplacian Eigenmaps Aim to preserve neighbours distances The cost from “first” nearest neighbours contributes more to the cost function than the “second” nearest neighbour and so on up the the k-th nearest neighbour to produce the low-dimensional representation. Large weights correspond to small distances, and hence contributes more to the cost function. [Belkin, 2002, 2003] [Maaten, 2007] 62 Laplacian Eigenmaps Conceptual Algorithm 1. Construct a weighted neighbour graph wij = e In practice the solution is expressed in terms of the graph Laplacian of the weighted neighbour graph. This allows expressing the solution of the minimization problem as an eigendecomposition. Edge weights are computed using a Gaussian kernel function, or a Heat kernel function 2. Mathematical formulation - xi - x j 2 2s 2 wij = e - Minimize the cost function F (Y ) = å å i wij (y i - y j )2 xi - x j t 2 j 63 Laplacian Eigenmaps Mathematical Formulation Let W = {wij } be the weighted graph matrix Let M = {m ii } be the degree matrix such that m ii = The graph Laplacian is computed as å j wij L= M- W The cost function can now be rewritten as F (Y ) = å å i wij (y i - y j )2 = 2Y T LY j The low dimensional projection is calculated as the eigendecomposition of L The eigenvectors of L are the projection components Lv = l Mv 64 Laplacian Eigenmaps Particularly suitable when there are clusters Laplacian Eigenmaps PCA for comparison But when there are no clusters, it does not excel… Holes emerges in the visualization 65 Laplacian Eigenmaps Local technique Distance magnification distortion effect Laplacian Eigenmaps can be seen as a variant of cMDS that tries to preserve “commuting” time [[Ham, 2004] in [Venna, 2007]] Small distances tends to get magnified, which result in a “global” distortion Since the weighted neighbourhood graph from which the graph Laplacian is calculated is a sparse matrix, the computation is fast. 66 Isomap – Isometric Feature Mapping Seeks to preserves the global geometry of the manifold based on geodesic distances Geodesic distances are approximated from Euclidean distances as a number of short hops Apply classical MDS to geodesic distances The geodesic is “forced” over the manifold hypersurface Suitable for convex manifolds [Tenenbaum, 1969] 67 Isomap – Isometric Feature Mapping 1 Data Feature Space 3 2 Construct Complete Weighted Graph eij=Euclidean distance 5 4 1 Prune graph Nearest neighbours remains k neighbours 3 2 ε radius 5 4 Compute Shortest Paths Reconstruct Complete Weighted Graph δij=Geodesic distance Dijkstra 1 3 2 Floyd 5 4 Embedding Metric MDS dij=Embedded Euclidean distance 2 1 3 Classical MDS (STRAIN) 4 Embedded Data Embedded Space 5 [Tenenbaum, 1969] 68 Isomap – Isometric Feature Mapping Some Isomap variants: Fixed Reference (FR)-Isomap [Lekadir, 2006] Supervised (S)-Isomap [Geng, 2005] Continuous ST-Isomap: Suitable for uncovering data exhibiting temporal coherence Segmented ST-Isomap: Uncovers spatio-temporal clusters in segmented data. Landmark (L)-Isomap [Silva, 2002] A supervised version to tackle topological instability Spatio-Temporal (ST)-Isomap [Jenkins, 2004] Allows consistent embedding and reduction of computational cost Reduction of computational cost by only embedding the landmarks and then simply locate the rest of points Conformal (C)-Isomap [Silva, 2002] Aim to deal with certain curved manifolds, such as the fishbowl 69 FR-Isomap FR-Isomap Original Dataset Feature Space Select References Points not selected as references Set of References Reference Coordinates Embedding (Isomap) Embedded Space New Datasets [Lekadir, 2006] 70 S-Isomap A supervised method which aims to enhance robustness against noise (topological instability) Rather than simply using pairwise Euclidean distances, the neighbourhood graph is constructed according to a similarity designed to integrate class information, [Geng, 2005] Uses class information to guide the embedding Parameter allows points of the same class to be closer than its simple Euclidean distance. Parameter accounts for the density of sampling of the points and is set to the average Euclidean distance between all pairwise distances. 71 ST-Isomap 1. Break data into temporal blocks. Each block is a data point 2. Compute the nearest neighbour matrix based on Euclidean distance 3. Locally identify the temporal neighbours (among those which already are neighbours in the feature space) 4. Reduce distances between temporal neighbours (several possible criteria) 5. Compute the geodesic distance 6. Apply classical MDS [Jenkins, 2004] 72 C-Isomap Instead of approximating the geodesic directly from the pairwise Euclidean distances x i - x j , it weights these Euclidean distances with the mean distances of points M (x ) to each corresponding neighbours: xi - x j M (x i )M (x j ) [Silva, 2002] 73 L-Isomap 1. Select a number of landmarks 2. Embed those landmarks with plain Isomap (using classical MDS) 3. Embed the rest of points by using the known distances to landmarks as constraints Uses a modified landmark-MDS It is somehow similar to FR-Isomap, but changes the way it includes those points which are not references or landmarks. [Silva, 2002] 74 LLE - Locally Linear Embedding Aims to preserve local neighbourhoods in two steps [Roweis, 2000] Represent each point as a weighted combination of its neighbours Project a map trying to distort each point neighbourhoods as less as possible 75 LLE - Locally Linear Embedding Calculate reconstruction weights 1. Compute pairwise distances and find K nearest neighbours 2. For each point x i calculate optimum reconstruction weights from wij neighbours according to weights cost function subject to constraints e(W ) = å i é êx êi êë 2 å j wij x j ù ú ú ú û wij = 0 Û x j Ï Nearest Neighbours of x j å wij = 1 j 3. Optimize embedding from eigenvectors of the cost matrix, by minimizing embedding cost function Embed Neighbourhoods F (Y ) = å i [Roweis, 2000] é êy i êë å ù w y ú j ij j ú û 2 76 LLE - Locally Linear Embedding Works for non-convex manifolds [Roweis, 2000] [Maaten, 2007] Its weak for manifold with high intrinsic dimensionality or with holes It present a “collapsing” effect 77 LLE - Locally Linear Embedding It requires two optimizations: one for the best reconstruction weights and another one for the projection However it is efficient due to exploiting sparse computation 78 HLLE - Hessian Eigenmaps, a.k.a. Hessian LLE Minimizes the “curviness” of the high dimensional manifold when embedded A modification of LLE but with the Laplacian Eigenmaps and Broomhead’s work on finding topological dimensionality mathematical framework Similar to the Laplacian Eigenmaps, but the Laplacian is substituted by a quadratic form based on the Hessian [Donoho, 2003] 79 HLLE - Hessian Eigenmaps For a point x i construct a neighbourhood matrix B e (x ) whose rows are vectors v = (x j - x i )T such that x j are the e neighbours ( x j - x i < e ) of x Note that this is similar to consider x i as the origin of a coordinate system and expressing all the neighbours x j as vectors v “centered” at this origin. For sufficiently small e , B e (x ) represents the tangent space to the manifold The original algorithm applies SVD here, but that is not strictly necessary Based on the Taylor expansion, construct a smooth tangent function ¶f vT v ¶ 2 f f (v ) = f (x ) + v + ¶ xi 2 ¶ x i2 [Donoho, 2003] Where ¶ 2f H = ¶ x i2 The partial derivatives are applied to each tangent component separately is the Hessian 80 HLLE - Hessian Eigenmaps Construct a matrix with the quadratic symmetric form of the Hessian H2 = å å i H ij H ji j Calculate the eigendecomposition of the matrix H 2 The eigenvectors are the projection components [Donoho, 2003] 81 HLLE - Hessian Eigenmaps Local technique aimed at nonconvex manifold Swiss roll with a hole (non-convex) In Isomap the non-convexity causes a strong dilation, …but HLLE respondes reasonably well [Donoho, 2003] 82 CCA - Curvilinear Component Analysis Embed some landmarks and then interpolate the rest of points Global unfolding of strongly non-linear, non convex and close structures. [Demartines, 1997] [Lee, 2002, 2004] Although radically different approach, but conceptually similar to FR-Isomap or LIsomap It only attempt to preserve a subset of the distances (rather than the whole set as Isomap) Mapping is invertible; Once the map is learned, it can be used both ways 83 CCA - Curvilinear Component Analysis Learning Stage Original Dataset Select Prototypes Vector Quantization Points not selected as prototypes Feature Space Set of prototypes Non linear Embedding E = Prototypes Coordinates 1 å 2 i å (dij - dij )2 F (dij , l ) j¹ i Embedded Space CCA Continuous mapping New Points Interpolation by optimization (stochastic gradient descent) Several options are available for weighting function [Demartines, 1997] F (dij , l ) Parameter λ controls the size of what is considered “local” Vector quantization consists of partitioning the space in Voronoi regions and represent each region by its generating centroid 84 CCA - Curvilinear Component Analysis Implemented as a two layer neural network One layer performs the Vector Quantization The second layer performs the non-linear embedding Initially proposed as a continuous improvement of Kohonen’s SOM. Sammon NLM (and possibly Laplacian Eigenmaps) can be modelled as a particular case by manipulating weighting function F (dij , l ) [Demartines, 1997] [Lee, 2002, 2004] 85 CCA - Curvilinear Component Analysis It has been said to outperform Isomap [Lee 2002, 2004] [Venna, 2007] [Venna, 2007] [Lee, 2002, 2004] I think [Lee 2002, 2004] was biasing his results, by using non-convex manifolds for which Isomap is known to fail [Donoho, 2003] …however [Venna, 2007] seems to reach the same conclusion with different datasets… Convergence to a global minimum has not yet been proved (as far as I know) 86 Manifold charting Minimize loss of information about data density location Take a paper a make a ball of it. In order to unfold it, [Brand, 2003] Topology neutral break it into tiny pieces (patches) which has not been affected by any folding (locally linear). Now, make a collage with all the tiny patches First Step.- Estimating intrinsic dimensionality and neighbourhood size by counting the number of points within a ball of radius r 87 Manifold charting 1. Estimate intrinsic dimensionality using balls of r radius (See previous slide) This also allows to select the approx. size of the neighbourhoods (charts) 2. Create a partition of the data set into locally linear neighbourhoods (charts) minimizing loss in the connecction between neighbour charts. Use a Gaussian mixture model (GMM) solved with the bayesian expectation maximization algorithm (EM) for this optimization GMM produces a soft partitioning of the dataset into neigbourhoods of mean mk and covariance S k 3. Connect (sew) the charts / Compute a minimal distortion merger (connection) of all charts [Brand, 2003] Use the weighted least-squares optimization 88 Manifold charting [Brand, 2004] www.merl.com/projects/images/charting.jpg 89 Manifold charting Aimed at convex manifolds, …but seems to perform well with the fishbowl (non-convex) [Brand, 2003] And because is topology neutral, it is not largely affected by noise… 90 SNE - Stochastic Neighbour Embedding Preserves neighbour identities i.e. the n-th neighbour in the high dimensional space is still n-th neighbour in the low dimensional space (up to a N-th neighbour). Basically, it defines a p.d.f. of the neighbourhood in the high dimensional space and try to preserve the p.d.f. as best as possible in the low dimensional space [Hinton and Roweis, 2002] Variant of MDS minimizing the Kullback-Leibler Divergence 91 SNE - Stochastic Neighbour Embedding At each point x i define a Gaussian probability (heat kernel based) representing the probability that x i picks x j as its neighbour over the rest of points. xi - x j - pij = 2 s i2 e å - e xi - xk 2 s i2 k¹ i [Hinton and Roweis, 2002] 92 SNE - Stochastic Neighbour Embedding …and an induced probability function for the low dimensional images so that At each image point y i the probability that image y i picks image y j as its neighbour over the rest of image points. - qij = 2 s i2 e å [Hinton and Roweis, 2002] yi - y j k¹ i - e yi - yk 2 s i2 93 SNE - Stochastic Neighbour Embedding The cost function to minimise is the sum of Kullback-Leibler divergences between the original and induced distributions over neighbours. C = å i KLD(Pi ,Qi ) = å å i j pij log pij qij This cost function only allows 1 image To allow for more images, a more cumbersome variant of this cost function is used SNE uses steepest descent to minimise the cost function. [Hinton and Roweis, 2002] 94 SNE - Stochastic Neighbour Embedding Allows for one point in the high dimensional space to have more than one image in the low dimensional space This is particularly interesting in “cut points” Avoids the “collapsing” effect of LLE Computationally expensive [Hinton and Roweis, 2002] 95 Triangulation method Sequential mapping of high dimensional points specifically to a plane. It can be though of a 2D view of the minimal spanning tree of the dataset (MST) Only attempt to preserve exactly three distances for each point. In particular, all distances in the MST are preserved. Emphasizes a particular viewpoint. The viewpoint selected depends on the selected root of the constructed MST [Lee, 1977] No global information is kept. 96 Triangulation method Create the minimal spanning tree (MST) Select a node (point) as the root Travel trough the tree (pre-order) Project the root to the origin and the first leftmost child to the appropriate distance (usually to the right, but nor necessarily) For every new node use the previous two nodes as references, to create a triangle and place the new point in one of the two possible intersections [Lee, 1977] 97 Triangulation method MST 98 Distance Preserving Projection An extension to the triangulation method not necessarily confined to a plane. The number of distances preserved depends on the dimensionality of the target space. Key Observation: The triangle is the simplex of a 2D space. In a triangle three (2+1) distances are preserve. In a N dimensional space, the corresponding simplex can preserve exactly (N+1) distances. The extension from the triangulation method is trivial. 99 Some final remarks [Venna, 2007] comparison based on trustworthiness and continuity [Venna, 2007] Projection is trustworthy if proximate points in the visualization (low dimension) are also proximate in the original space (high dimension) Projection is continuous is originally proximate points (high dimension), remains proximate (low dimension) 100 Some final remarks [Venna, 2007] 101 Some final remarks http://www.math.umn.edu/~wittman/mani/mani_gui2.jpg 102 103