Diffusion Geometries, and multiscale Harmonic Analysis on graphs and complex data sets. Multiscale diffusion geometries, “Ontologies and knowledge building” Ronald Coifman Applied Mathematics Yale university. Conventional nearest neighbor search , compared with a diffusion search. The data is a pathology slide ,each pixel is a digital document (spectrum below for each class ) One of our goals is to report on mathematical tools used in machine learning, document and web browsing, bio informatics, and many other data mining activities. The remarkable observation is that basic Geometric Harmonic Analysis of empirical Markov processes provides a unified mathematical structure which encapsulates most successful methods in these areas. These methods enable global descriptions of objects verifying microscopic relations (like calculus). We relate these ideas to methods of classical Harmonic analysis , like Calderon Zygmund theory in which Fourier analysis and multiscale geometry merge. • This simple point is illustrated below Each puzzle piece is linked to its neighbors ( in feature space ) the network of links forms a sphere. A parametrization of the sphere can be obtained from the eigenvectors of the inference relation (diffusion operator) A simple empirical diffusion matrix A can be constructed as follows Let represent normalized data ,we “soft truncate” the covariance matrix Xi as A0 [ X i X j ] exp{(1 X i X j ) / } Xi 1 A is a renormalized Markov version of this matrix The eigenvectors of this matrix provide a local non linear principal component analysis of the data . Whose entries are the diffusion coordinates These are also the eigenfunctions of the discrete Graph Laplace Operator. A ( X i )l ( X j ) 2 l l ( ( X ), ( X ), ( X ),..) t t t i 1 1 i 2 2 i 3 3 i This map is a diffusion (at time t) embedding into Euclidean space X (t ) The First two eigenfunctions organize the small images which were provided in random order, in fact assembling the 3D puzzle. A two dimensional map created by the Diffusion Map algorithm for 400 MMPI-2 examinees. The distance between two people was measured as the difference between their responses. The color corresponds to the score each examinee received on the depression scale. New subjects need to be placed in this tabulation of responders. The following image indicates that graphs may have clusters at different scales. A very simple way to build a hierarchical multiscale structure is as follows. We define the diffusion distance between two subsets E and F as : 2 dt ( E, F ) kt ( x, y)[ E ( y) F ( y)]dy dx 2 Start by considering small disjoint clusters of nearest neighbors . Form a graph of these clusters where the distance is defined with t=1 . Repeat on the graph of these clusters doubling the time , etc 4 Gaussian Clouds A simple application of signal processing on data ,or data filters is Feature based diffusion algorithms . Given an image, associate with each pixel p a vector v(p) of features . For example a spectrum, or the 5x5 subimage centered at the pixel ,or any combination of features . Define a Markov filter as Ap ,q exp( v( p) v(q) exp( v( p) v(q) 2 /) 2 /) q The various powers of A or polynomials in A provide filters which account for feature similarity between pixels . Feature diffusion filtering of the noisy Lenna image is achieved by associating with each pixel a feature vector (say the 5x5 subimage centerd at the pixel) this defines a Markov diffusion matrix which is used to filter the image ,as was done in for the spiral in the preceding slide The data is given as a random cloud , the filter organizes the data. The colors are not part of the data