Diffusion Geometries, and Multiscale Harmonic Analysis on Graphs and Complex Data Sets

advertisement
Diffusion Geometries, and multiscale
Harmonic Analysis on graphs and
complex data sets.
Multiscale diffusion geometries,
“Ontologies and knowledge building”
Ronald Coifman
Applied Mathematics Yale university.
Conventional nearest neighbor search , compared with a diffusion search. The data is
a pathology slide ,each pixel is a digital document (spectrum below for each class )
One of our goals is to report on mathematical tools used in
machine learning, document and web browsing, bio
informatics, and many other data mining activities.
The remarkable observation is that basic Geometric
Harmonic Analysis of empirical Markov processes
provides a unified mathematical structure which
encapsulates most successful methods in these areas.
These methods enable global descriptions of objects
verifying microscopic relations (like calculus).
We relate these ideas to methods of classical Harmonic
analysis , like Calderon Zygmund theory in which Fourier
analysis and multiscale geometry merge.
•
This simple point is illustrated below
Each puzzle piece is linked to its neighbors ( in feature
space ) the network of links forms a sphere. A
parametrization of the sphere can be obtained from the
eigenvectors of the inference relation (diffusion operator)
A simple empirical diffusion matrix A can be constructed as follows
Let
represent normalized data ,we “soft truncate” the covariance
matrix
Xi
as
A0  [ X i  X j ]  exp{(1  X i  X j ) /  }
Xi 1
A is a renormalized Markov version of this matrix
The eigenvectors of this matrix provide a local non linear principal
component analysis of the data . Whose entries are the diffusion coordinates
These are also the eigenfunctions of the discrete Graph Laplace Operator.
A     ( X i )l ( X j )
2
l l
 (  ( X ),   ( X ),   ( X ),..)
t
t
t
i
1 1
i
2 2
i
3 3
i
This map is a diffusion (at time t) embedding into Euclidean space
X
(t )
The First two eigenfunctions organize the small images which were
provided in random order, in fact assembling the 3D puzzle.
A two dimensional map created by the Diffusion Map algorithm for
400 MMPI-2 examinees.
The distance between two people was measured as the difference
between their responses. The color corresponds to the score each
examinee received on the depression scale. New subjects need to be
placed in this tabulation of responders.
The following image indicates that graphs may have
clusters at different scales.
A very simple way to build a hierarchical multiscale
structure is as follows.
We define the diffusion distance between two subsets E
and F as :
2
dt ( E, F )    kt ( x, y)[  E ( y)   F ( y)]dy dx
2
Start by considering small disjoint clusters of nearest
neighbors . Form a graph of these clusters where the
distance is defined with t=1 . Repeat on the graph of
these clusters doubling the time , etc
4 Gaussian Clouds
A simple application of signal processing on data ,or data filters
is Feature based diffusion algorithms .
Given an image, associate with each pixel p a vector v(p) of
features . For example a spectrum, or the 5x5 subimage centered
at the pixel ,or any combination of features . Define a Markov
filter as
Ap ,q 
exp(  v( p)  v(q)
 exp(  v( p)  v(q)
2
/)
2
/)
q
The various powers of A or polynomials in A provide filters
which account for feature similarity between pixels .
Feature diffusion filtering of the noisy Lenna image is
achieved by associating with each pixel a feature vector
(say the 5x5 subimage centerd at the pixel) this defines
a Markov diffusion matrix which is used to filter the
image ,as was done in for the spiral in the preceding
slide
The data is given as a random cloud , the filter organizes the
data.
The colors are not part of the data
Download