A very short introduction

advertisement
A tiny review on Manifold
Embedding techniques
Felipe Orihuela-Espina
References are only examples of sources that I have used. Neither they always
correspond to the original publication about the topic, nor are necessarily the best.
1
I.- Mathematical Background
2
Topological Space

A topological space X is a set of points X = {x i | i = 1K n } with a
set of subsets T = {S j = {x i Î X }} Í X satisfying the following
axioms:

The empty set
Æ
and
are in T
X
ÆÎ T

Ù
X Î T
The union of a finite collection of sets in T
are also in
T
k
US
i
Î T
i= 1

The intersection of an arbitrary collection of sets in T
I
are also in T
Si Î T
i

The set

Basically a topological space is a geometric object, and a topology is a
structure imposed on it
[Wolfram, World of Maths]
T
is called the topology of
X
3
Manifold (I)

A manifold is a topological space that it is locally
Euclidean.



The concept of manifold is the generalisation of the
traditional Euclidean (linear) space to adapt to non-Euclidean
topologies.
Note that “locally Euclidean” does not mean that it is
constraint to a Euclidean metric globally, but only that it is
locally homeomorphic to a Euclidean space.
In other words, a manifold is an object placed in a ndimensional ambient space

A k-dimensional manifold is a submanifold with k degrees of
freedom, i.e. that can be described with only k coordinates
4
Manifold (II)

If the manifold is infinitely differentiable then it is
called a smooth manifold.

A smooth manifold with a metric imposed to induce
the topology is called a Riemannian manifold.

A submanifold is a subset of a manifold which is
itself a manifold.
[Wolfram, World of Maths]
[Carreira-Perpiñán,1997]
5
Homeomorphism and diffeomorphism

An homeomorphism is a continuous bijective
transformation between topological spaces X and Y .
f :X ® Y


The fact that is continuous means that points which
are close in X are also close in Y , and points
which are far in X are also far in Y
The fact that it is bijective (or 1 to 1) means that it
is injective and surjective, and also imply that there
exist the inverse
f - 1 :Y ® X

If the homeomorphism is differentiable, i.e. if the
derivate and its inverse exists, then it is called
diffeomorphism.
[Wolfram, World of Maths]
Figure from Wikipedia
6
Embedding


An embedding is a map f : X ® Y such that
a diffeomorphism from X to f (X ), and f (X ) is a
smooth submanifold of Y
f
is
An embedding is the representation of a topological
object (e.g. a manifold, graph, lattice, etc) in a certain
(sub-)space so that its topology is preserved.

In particular, for manifolds, it preserves the open sets in T
[Bonatti,2006]
7
Summarizing…

A manifold is any object which is locally
linear (flat).

An embedding is a function from a space to
another so that the topology (shape) is
preserve through deformations (twisting
and stretching)
8
II.- Manifold embedding
9
Manifold Embedding

Dimensionality reduction is a particular case of
manifold embedding, in which the dimension of
the destination space is lower than the original data
space


Domain specific data are often distributed (lay on, or
close to) a low dimensional manifold in a high
dimensional space [Yang, 2004]
Topology or structure is retained/preserved if the pairwise
distances in the low dimensional space approximate the
corresponding pairwaise distance in the feature space.
[Sammon,1969]
10
Manifold Embedding

The problem of dimension reduction, or data embedding,
has been defined in several similar ways:




The search for a low dimensional manifold that embeds
the high dimensional data [Carreira-Perpiñán, 1997]
Finding/recovering meaningful low dimensional structures
hidden in high dimensional data [Tenenbaum, 2002]
To detect and identify inherent “structure” (i.e. clusters or
relationships between vectors) [Sammon, 1969]
…
11
Manifold Embedding

In data embedding, there are methods for:


Estimating the intrinsic dimensionality of
the data, without actually projecting the
data.
Generate a lower dimensional configuration
by means of a projection (data projection
methods).
12
Manifold Embedding: Variants

Multiple manifold embedding


Data lie in more than 1 manifold
Multi-class manifold embedding

Data lie in a single manifold, but sampling
contains large gaps, perhaps even
fragmenting connected components
13
Manifold Embedding: Nomenclature

Manifold embedding is also called



The origin space is sometimes called:







Manifold learning[Souvernir 2005]
Multivariate data projection [[Mao,1995] in Demartines, 1997] or simply projection
[Venna2007]
High dimensional (input) space [Tenenbaum, 2000][Demartines,1997][Venna2007]
Vector space [Roweis, 2000][Sammon,1969][Brand,2003]
Data space [Souvernir 2005]
Observation space [Silva 2002]
Domain space [Yang 2004, 2005]
Feature space (usually in the context of pattern analysis)
The destination space is usually more consistently called

Low-dimensional space


But other names include output space [Demartines, 1997][Venna2007]
and I personally like…embedded space [Leff, 2007]
14
Manifold Embedding: Applications

Some applications of manifold embedding/
dimensionality reduction are








Regression and smoothing in statistics
Data compression an coding in information theory
Visualization and representation of data in general
Feature extraction in pattern analysis
Determination of latent variables in causal models
Complexity reduction in algorithmics.
Data exploration in statistics
Prior to clustering in machine learning
15
III.- Estimating intrinsic dimensionality
without projection
16
Intrinsic dimensionality (ID)

The intrinsic dimensionality (ID) of a
manifold has been defined as “the number
of independent variables that explains
satisfactorily” that manifold.


Determination of the ID eliminates the
possibility of over- or under-fitting.
Since it is always possible to find a
manifold of any dimension which passes
through all points in a data set given
enough parameters, the problem of
estimating the ID of a dataset is ill-posed in
the Hadamard sense

Note that is the case of interpolation, which
finds a 1-D curve to fit a dataset!
[CarreiraPerpiñán,1997]
Figure modified from [CarreiraPerpiñán,1997]
17
Intrinsic dimensionality vs Topological dimension

Topological dimension is the “local” dimensionality
at every point



i.e. the dimension of the tangent space
The topological dimension is a lower bound of the ID
Example: Sphere:


ID: 3
Topological dimension: 2 (at every point the sphere
can be aproximated by a surface)
[Camastra, 2003]
18
Example of methods for estimating the intrinsic
dimensionality of data (without projection)

Bennet’s algorithm [Bennet, 1969]


Fukunaga and Olsen algorithm [Fukunaga et al, 1971]



Local eigenvalue estimator [Verveer et al, 1995]
Bruske and Sommer work based on topology preserving map [Bruske
et al 1998]
Trunk’s statistical approach (near neighbour techniques) [Trunk, 1968]
[[Trunk, 1976] in [Camastra, 2003]

Pettis’ algorithm – Add assumption of uniformly distribution of
sampling to derive a simple expression.

Near neighbour estimator [Verveer et al, 1995]
Fractal based methods [Review by Camastra, 2003]

Broomhead’s topological dimension of a time series [Broomhead,
1987]
19
Bennet’s algorithm


Let’s define the probability density function of the interpoint
distances
Observation: A displacement/perturbation in the dataset which
increment variance in interpoint distances tend to reduce the
dimensionality of the data set

[Bennet, 1969]
A perturbation of this kind is reduce the small distances (those
smaller than the mean interpoint distance) and increase the
large distances.
20
Bennet’s algorithm
1. dataset  original data
2. Repeated until change in variance is smaller than a given
threshold
1. Increase the variance of the dataset (dataset+=Δ)
2. For each point

Restore the ranking order of local distances
3. Calculate PCA on the obtained configuration
1. ID is the number of non-zero eigenvalues.
Noise can make that no eigenvalues is zero so
in practice a threshold is needed
[Bennet, 1969]
21
Fukunaga and Olsen algorithm

Observation: For vectors
embedded in a linear subspace, the
dimension is equal to the number of
non-zero eigenvalues of the
covariance matrix.

PCA
…which is step 3 of Bennet’s
algorithm!
Apply PCA locally

[Fukunaga, 1971]
[Camastra, 2003]
Original formulation is based on
Taylor expansion of small
subregions and calculation of PCA
in each small region.
22
Fukunaga and Olsen algorithm
1. Divide the data set in a number of small subsets or hyperspherical regions (neighbourhoods) containing a fixed
number of points.
2. Compute the Taylor expansion of local neighbourhoods
3. Count the number of significant terms (i.e. different order
derivatives) in the expansion

This step is actually solved applying PCA to each small
region and counting non-zero eigenvalues larger than a
threshold. This count is close to the local ID
This formulation is not noise robust. Noise can make
that no eigenvalues is zero so a threshold is needed
23
Fukunaga and Olsen algorithm

The partition of the dataset does not necessarily need to be
complete.

Step 1 of the algorithm permits two variants
1. Variation with overlapping regions is more representative of
the data
2. Variation without overlapping regions is faster

A particularity is that a “single” final answer is not given, but
actually it produces summary tables with “local” answers


Give upper and lower bounds (range) for the dimensionality
Neighbourhood sizes and threshold value are difficult to choose.
24
Fukunaga and Olsen algorithm:
A slightly different formulation with Optimally
Topology Preserving Map (OTPM)
1. Compute the Voronoi tesellation so that
regions can be considered to be locally
linear (using LBG vector quatizing
algorithm)
1. Construct a graph G corresponding
to the induced Delaunay
triangulation (optimally topology
preserving map - OTPM)
2. For each Voronoi cell
1. Compute the differences vectors
from each local point to its
generating vector
2. Apply (PCA) to the set of differences
vectors
3. The local ID is the number of nonzero eigenvalues larger than a given
threshold
[Bruske and Sommer, 1998]
Image simulated with VoroGlide 2.0
25
Verveer and Duin’s Local Eigenvalue Estimator

A modification to Fukunaga’s algorithm to
automatically choose the threshold.

Requires high sampling of the manifold

Assumes uniform sampling density in the local
neighburhoods
[Verveer, 1995]
26
Trunk’s statistical approach (near neighbour)

Iteratively look for the more likely local dimensionality
examining invariant statistics.

Computes the topological dimensionality from the
distribution of distances.
 Involves many ad-hoc assumptions:
 Linearity and independence
 All distance ratios and angles are independent
 Ad-hoc density distribution for the angles
 The correct answer for ID is not guaranteed! ([Trunk, 1968:pg 519])
[Trunk, 1968]
[Camastra, 2003]
27
Trunk’s statistical approach (near neighbour):
Iterative variation for noisy data
1. Initialise k
2. Repeat
1. Construct the k-NN graph
2. For each point i
1. Find the (k+1)th nearest neighbour of point i
2. i  Calculate the angle between the (k+1)th nearest
neighbour and the k neighbourhood “flat” hypersurface
3. avgAngle  mean(i)
4. If (avgAngle > )
1. Increment k
3. until (avgAngle < )
[[Trunk,1976] in[Camastra, 2003]
28
Trunk’s statistical approach (near neighbour)

How to choose  is not clear.
 At each iteration, computed k neighbourhoods are assumed
locally linear
 Iterative description is only conceptual
 The iterative algorithm is not computationally efficient.
 Trunk’s original statistical approach was not iterative;
much more efficient, but also more complicated and
sensitive to noise
[Trunk, 1968]
[Camastra, 2003]
29
Verveer and Duin’s Near Neighbour Estimator

A non-iterative approximation to Trunk’s algorithm for noisy
data, in which ID is directly calculated by a derived
(cumbersome) formula, namely the near neighbour estimator.

[Verveer, 1995]
[Camastra, 2003]
A correction to Pettis’ algorithm (which they show lead to
incorrect answer)

The result is a real value, so it must be rounded to nearest
integer.

It usually underestimate the ID (when the real ID is high)

Sensitive to noise, outliers, edge effect, sampling density and
distribution, etc…
30
Broomhead’s topological dimension of a time-series

Observation: For a small neighbourhood of any
point x, the effects of a curvature becomes
unimportant (the manifold is by definition locally
linear), and therefore the manifold can be well
approximated by its tangent space.
[Broomhead, 1987]
31
Broomhead’s topological dimension of a time-series

For a point x construct a neighbourhood matrix B e (x )
T
whose rows are vectors (x j - x ) such that x j are the e
-neighbours ( x j - x < e) of x

Note that this is similar to consider x as the origin of
a coordinate system and expressing all the neighbours x j
as vectors “centered” at this origin.

For sufficiently small e , B e (x ) represents the tangent
vector to the manifold, i.e. the tangent space.

The rank of B e (x ) is the dimension of the manifold.

The rank (number of linearly independent rows)
of a matrix is the dimension of that matrix
[Broomhead, 1987]
32
Broomhead’s topological dimension of a time-series
The construction of the tangent
matrix B e (x ) , is also used by



Laplacian Eigenmaps
and Hessian eigenmaps.
How the manifold is created:


For us, patterns (points in the
manifold) are columns, i.e. 1
signal with all its time samples
For Broomhead’s patterns are
rows, i.e. 1 time sample across
all signals or observed
variables.
Time samples or observations

Signals/subjects/variables
[Broomhead, 1987]
33
IV.- Data Projection Methods
Figure from http://people.hofstra.edu/Stefan_Waner/diff_geom/pics/Chart.gif
34
Data Projection Methods

They can be coarsely classified as:



Linear vs. Non-linear
Global vs Local vs. Topology neutral
There is no BEST method for all cases

Different techniques deal with different types of
manifolds
35
Data Projection Methods: Common things

The manifold is always considered to be placed in an
“ambient” Euclidean space n

Most of the methods require an a priori knowledge (or
estimation) of the intrinsic dimensionality

Virtually every projection solves an optimization problem


Embedded solution is not unique
Embedded solution normally imply a deformation
36
Data Projection Methods: Common things

A number of techniques require the construction of
a (undirected weighted) neighbourhood graph



K nearest neighbours
 -radius neighbourhood
Often projections are based on accepting the first n
eigenvectors of the eigendecomposition of a matrix
derived from distances.
37
IV.a.- Linear Data Projection Methods
38
Example of Data projection methods: Linear

PCA (Principal Component Analysis) [LOTS!!]

MDS (Multidimensional Scaling, a.k.a. Principal coordinate
analysis) (LOTS!! – [Kruskal, 1974][Cox,1994])

ICA (Independent Component Analysis) [Comon, 1994]

CCA (Canonical Correlation Analysis) [Friman, 2002]

PP (Projection pursuit) [Carreira-Perpiñán, 1997]
39
PCA – Principal Component Analysis

PCA rotates the axis of the
original space so that the new
axis, maximise the variance of
the variability of each
component.

As a side effect, discarding the
least significant components
(those with minimum variance)
result in a dimension reduction
with minimal loss of information.
2D representation of the PCA
transformation
Figure from [web.media.mit.edu ]
40
PCA – Principal Component Analysis

Let X K xM be the dataset with M vectors as columns and K
observations or dimensions as rows. First we center (mean
correct)
X ' = X - E {X } Þ E {X '} = 0

T KxM =
Define a matrix
such that the covariance
matrix of the dataset can be expressed as:
CovX X

1
X 'T
m- 1
X ' X 'T
=
= T TT
m- 1
The columns of V in the SVD of T = U S V T are
basically the eigenvectors of X ' X 'T or the covariance
matrix, which are the principal components of X '
[Schlens,2005],[Fodor2002]
41
PCA – Principal Component Analysis

Possibly the most well known and widely applied

Assumes linearity and orthogonality between the
components.

PCA is invariant to both translation and rotation on
the dataset

PCA tends to overestimate the intrinsic
dimensionality (ID)
42
MDS - Multidimensional Scaling

MDS find the best spatial
representation in a k-dimensional
space for a cloud of points given dij
the pairwise distances between
them in the original space.

The aim is to provide a map which
minimises the discrepancy
between the distances in the
original space dij , and the d ij
distances in the destination space,
thus maximizing the quality of the
mapping.
[Kruskal,1978],[Cox,1994]
Figure reproduced from [Cox, 1994]
43
MDS - Multidimensional Scaling


MDS comprises a number of variants with different:

Cost function  Determines the output configuration

Optimization algorithm  Defines the computational procedure
MDS variants can be split in


[Kruskal,1978],[Cox,1994]
Metric  Distances in the embedded space are related linearly
to distances in feature space
Non-metric  Distances in the embedded space are only related
to distances in feature space by a monotonically increasing
function
44
MDS - Multidimensional Scaling

According to the number of
difference (distance) criteria used
MDS is:
[Arabie,1987],[Cox,1994]

Two-way  Single measure

Three-way  Multiple measures
45
MDS - Multidimensional Scaling

Classical MDS


Is a particular case of two-way metric MDS
Uses the cost function known as strain
strain = J '(D 2 - D 2 )J
2
D = {dij }, D = {dij }
J being t he mat rix of t he eigen decomposit ion

Is a dual form with PCA (Principal Component
Analysis)  yields the same solution.


[Kruskal,1978],[Cox,1994]
MDS is a.k.a. Principal Coordinate Analysis
PCA start from the location of the points. MDS starts
from the distances between points.
46
MDS - Multidimensional Scaling

Other cost function include:

F-Stress or simply stress
f - stress = D - D

2
F
å å
=
i
( f (dij ) - dij )2
j
usually scale factor =
scale factor
i
dij
j
F-sstress (squared stress)
2
f - sstress = D - D

å å
2 2
F
Sammon’s cost function
Sammon ' s Error =
1
å å
i
j
dij
å å
i
j
(dij - dij )2
dij
The use of Sammon’s
cost function leads to
results comparable to
Sammon’s NLM,
despite differences in
the algorithm
47
MDS – Multidimensional Scaling

Although commonly classified as linear, it can
perform as non linear (depending of the cost
function)


When using Sammon’s cost function it performs
similarly to NLM
Is the seed for a number of other techniques

E.g.: Isomap
48
PP - Projection Pursuit

Automatically picks an interesting
low dimensional projection


What is interesting is defined by
an objective function called
projection index.
You can think of PP as rotating a 3D
scatter plot in the PC screen to
“best” see the structure of data.

PP automatically search the
most interesting viewpoint for
you.
In this example, projection in the Y-Z
plane is more interesting than projection
in X-Y plane, since the clusters are easily
visible.
49
PP - Projection Pursuit

Let X be the dataset with distribution F , and let A be a
projection matrix with distribution FA , and component vectors ai

The projection index Q (optimization function) is a real function of
the distribution of the projection of the dataset:
Q : FA ® ¡

PP finds a projection direction a for a given distribution
produces a local optima of Q


F
which
The normal distribution is the least interesting density
distribution.
To find other local minima of Q run the optimization algorithm
again (supressing the current solution).
[CarreiraPerpiñán,1997],[Fodor,2002]
50
PP - Projection Pursuit

Particularly effective as a step previous to
clustering

It is linear and projections are orthogonal

It is computationally expensive
51
IV.b.- Non-Linear Data Projection
Methods
52
Examples of Data projection methods: Non-Linear

Sammon’s non linear mapping (NLM) [Sammon, 1969]


Kohonen’s self organising maps (SOM) [Kohonen, 1997] a.k.a.
topologically continuous maps, and Kohonen maps


Temporal Kohonen maps [Chappell,1993]
Laplacian eigenmaps [Belkin, 2002, 2003]


GeoNLM [Yang, 2004b]
Laplacian eigenmaps with fast N-body methods [Wang, 2006]
PCA based:

Non-linear PCA [Fodor, 2002], Kernel PCA [Scholkopf, 1998],
Principal Curves [Carreira-Perpignan, 1997], Space partition and
locally applied PCA [Olsen and Fukunaga, 1973]
53
Examples of Data projection methods: Non-Linear

Isomap [Tenenbaum, 2000]


Locally linear embedding (LLE) [Roweis, 2000]


FR-Isomap [Lekadir, 2006], S-Isomap [Geng, 2005], ST-Isomap
[Jenkins, 2004], L-Isomap [Silva, 2002], C-Isomap [Silva, 2002]
Hessian Eigenmaps, a.k.a. Hessian Locally Linear Embedding
[Donoho, 2003]
Curvilinear Component Analysis [Demartines, 1997]

Curvilinear Distance Analysis (CDA) [Lee, 2002, 2004]
54
Examples of Data projection methods: Non-Linear

Kernel ICA [Bach, 2003]

Manifold charting [Brand, 2003]

Stochastic neighbour embedding [Hinton, 2002]

Triangulation method [Lee, 1977]

Tetrahedral methods: Distance preserving projection
[Yang, 2004]
55
Examples of Data projection methods: Non-Linear

Others…

Semidefinite embedding (SDE)


Conformal Eigenmaps [Maaten, 2007]


Maximally angle preserving
Maximum Variance Unfolding (MVU) [Maaten, 2007]


Minimum Volume Embedding [Shaw, 2007]
Variant of LLE
Diffusion Maps (DM)

Based on a Markov random walk on the high dimensional
graph to get a measure of proximity between data.
56
Sammon’s NLM – Non Linear Mapping

Seeks to preserve the
structure (clusters and non
linear relationships)

Based on the famous
Sammon’s cost function

[Sammon, 1969]
Favour better mapping of
smaller (local) distances.
57
Sammon’s NLM – Non Linear Mapping
1. Compute all pairwise distances dij in the data space
2. Generate an initial low dimension configuration Y k´ N

Randomly, using PCA or other
3. Compute all pairwise distances d ij in the low dimensional space
for Y k´ N
4. Iteratively optimize Y k´ N (hence change D = {dij } ) so to
minimize the cost function
E (Y ) =
1
å å
i

[Sammon, 1969]
dij
å å
i
j
(dij - dij )2
dij
j
NLM uses steepest descent procedure for optimization
58
Sammon’s NLM – Non Linear Mapping



Highly efficient at identifying hyperspherical/
hyperellipsoidal structures.
Points close too close in the input space dij » 0
may badly disturb cost function
Has strong similarities with MDS.

As argued by Sammon, “mathematical formulation
are similar, [but] the underlying mapping criterions
are quite different”.
[Sammon, 1969],[Demartines,1997]
59
Kohonen’s SOM – Self Organising Maps

Map a given input dataset to a discretized lattice
(grid) of given shape regardless of the actual
shape of the manifold.

Linked to a Neural Network implementation

Detailed final equations have proved to be difficult,
and only exist for extremely simplified cases.
I really do not understand this very well to
be honest…
[Kohonen, 1990]
60
Kohonen’s SOM – Self Organising Maps

Let X = {x i (t )} be the high dimensional data set, and the low
dimensional reference vectors upon a lattice M = {m i (t )}

Initially select m i (0) randomly

Iteratively,


[Kohonen, 1990]
Link x i (t ) to the closest matching m i (t )
Update m i (t + 1) = m i (t ) + h(t )[x (t ) - m i (t )] where h (t ) is a
kernel function
61
Laplacian Eigenmaps

Aim to preserve neighbours distances

The cost from “first” nearest neighbours contributes
more to the cost function than the “second” nearest
neighbour and so on up the the k-th nearest
neighbour to produce the low-dimensional
representation.
Large weights correspond to small distances, and
hence contributes more to the cost function.
[Belkin, 2002, 2003]
[Maaten, 2007]
62
Laplacian Eigenmaps
Conceptual Algorithm
1.
Construct a weighted neighbour
graph
wij = e

In practice the solution is
expressed in terms of the
graph Laplacian of the
weighted neighbour graph.

This allows expressing the
solution of the minimization
problem as an
eigendecomposition.
Edge weights are computed using a
Gaussian kernel function, or a Heat
kernel function

2.
Mathematical formulation
-
xi - x j
2
2s 2
wij = e
-
Minimize the cost function
F (Y ) =
å å
i
wij (y i - y j )2
xi - x j
t
2
j
63
Laplacian Eigenmaps

Mathematical Formulation



Let W = {wij } be the weighted graph matrix
Let M = {m ii } be the degree matrix such that m ii =
The graph Laplacian is computed as
å
j
wij
L= M- W

The cost function can now be rewritten as
F (Y ) =
å å
i

wij (y i - y j )2 = 2Y T LY
j
The low dimensional projection is calculated as the
eigendecomposition of L

The eigenvectors of L are the projection components
Lv = l Mv
64
Laplacian Eigenmaps

Particularly suitable when there
are clusters
Laplacian Eigenmaps
PCA for comparison

But when there are no clusters, it
does not excel…

Holes emerges in the
visualization
65
Laplacian Eigenmaps

Local technique


Distance magnification distortion effect


Laplacian Eigenmaps can be seen as a variant of cMDS
that tries to preserve “commuting” time [[Ham, 2004] in
[Venna, 2007]]
Small distances tends to get magnified, which result in a
“global” distortion
Since the weighted neighbourhood graph from which the
graph Laplacian is calculated is a sparse matrix, the
computation is fast.
66
Isomap – Isometric Feature Mapping

Seeks to preserves the global
geometry of the manifold
based on geodesic distances



Geodesic distances are
approximated from
Euclidean distances as a
number of short hops
Apply classical MDS to
geodesic distances
The geodesic is “forced” over
the manifold hypersurface
Suitable for convex manifolds
[Tenenbaum, 1969]
67
Isomap – Isometric Feature Mapping
1
Data  Feature Space
3
2
Construct Complete Weighted Graph
eij=Euclidean distance
5
4
1
Prune graph
Nearest neighbours remains
k neighbours
3
2
ε radius
5
4
Compute Shortest Paths
Reconstruct Complete Weighted Graph
δij=Geodesic distance
Dijkstra
1
3
2
Floyd
5
4
Embedding Metric MDS
dij=Embedded Euclidean distance
2
1
3
Classical MDS (STRAIN)
4
Embedded Data  Embedded Space
5
[Tenenbaum, 1969]
68
Isomap – Isometric Feature Mapping

Some Isomap variants:

Fixed Reference (FR)-Isomap [Lekadir, 2006]


Supervised (S)-Isomap [Geng, 2005]



Continuous ST-Isomap: Suitable for uncovering data exhibiting
temporal coherence
Segmented ST-Isomap: Uncovers spatio-temporal clusters in
segmented data.
Landmark (L)-Isomap [Silva, 2002]


A supervised version to tackle topological instability
Spatio-Temporal (ST)-Isomap [Jenkins, 2004]


Allows consistent embedding and reduction of computational
cost
Reduction of computational cost by only embedding the
landmarks and then simply locate the rest of points
Conformal (C)-Isomap [Silva, 2002]

Aim to deal with certain curved manifolds, such as the fishbowl
69
FR-Isomap
FR-Isomap
Original
Dataset
Feature
Space
Select
References
Points not
selected as
references
Set of
References
Reference
Coordinates
Embedding
(Isomap)
Embedded
Space
New
Datasets
[Lekadir, 2006]
70
S-Isomap

A supervised method which aims to enhance
robustness against noise (topological
instability)


Rather than simply using pairwise Euclidean
distances, the neighbourhood graph is
constructed according to a similarity designed
to integrate class information,


[Geng, 2005]
Uses class information to guide the
embedding
Parameter  allows points of the same class
to be closer than its simple Euclidean
distance.
Parameter  accounts for the density of
sampling of the points and is set to the
average Euclidean distance between all
pairwise distances.
71
ST-Isomap
1. Break data into temporal blocks.

Each block is a data point
2. Compute the nearest neighbour matrix
based on Euclidean distance
3. Locally identify the temporal neighbours
(among those which already are
neighbours in the feature space)
4. Reduce distances between temporal
neighbours (several possible criteria)
5. Compute the geodesic distance
6. Apply classical MDS
[Jenkins, 2004]
72
C-Isomap

Instead of approximating
the geodesic directly from
the pairwise Euclidean
distances x i - x j , it
weights these Euclidean
distances with the mean
distances of points M (x )
to each corresponding
neighbours:
xi - x j
M (x i )M (x j )
[Silva, 2002]
73
L-Isomap
1. Select a number of landmarks
2. Embed those landmarks with plain Isomap (using classical
MDS)
3. Embed the rest of points by using the known distances to
landmarks as constraints

Uses a modified landmark-MDS
It is somehow similar to FR-Isomap, but changes the way it includes
those points which are not references or landmarks.
[Silva, 2002]
74
LLE - Locally Linear Embedding

Aims to preserve local
neighbourhoods in two
steps


[Roweis, 2000]
Represent each point
as a weighted
combination of its
neighbours
Project a map trying
to distort each point
neighbourhoods as
less as possible
75
LLE - Locally Linear Embedding
Calculate
reconstruction weights
1. Compute pairwise distances and find K nearest neighbours
2. For each point x i calculate optimum reconstruction weights from wij
neighbours according to weights cost function subject to constraints
e(W ) =
å
i
é
êx êi
êë
2
å
j
wij x j
ù
ú
ú
ú
û

wij = 0 Û x j Ï Nearest Neighbours of x j

å
wij = 1
j
3. Optimize embedding from eigenvectors of the cost matrix, by
minimizing embedding cost function
Embed
Neighbourhoods
F (Y ) =
å
i
[Roweis, 2000]
é
êy i êë
å
ù
w
y
ú
j ij j ú
û
2
76
LLE - Locally Linear Embedding

Works for non-convex
manifolds


[Roweis, 2000]
[Maaten, 2007]
Its weak for
manifold with high
intrinsic
dimensionality or
with holes
It present a
“collapsing” effect
77
LLE - Locally Linear Embedding

It requires two optimizations: one for the best
reconstruction weights and another one for the
projection

However it is efficient due to exploiting sparse
computation
78
HLLE - Hessian Eigenmaps, a.k.a. Hessian LLE


Minimizes the “curviness” of
the high dimensional manifold
when embedded
A modification of LLE but with
the Laplacian Eigenmaps and
Broomhead’s work on finding
topological dimensionality
mathematical framework

Similar to the Laplacian
Eigenmaps, but the
Laplacian is substituted by
a quadratic form based on
the Hessian
[Donoho, 2003]
79
HLLE - Hessian Eigenmaps

For a point x i construct a neighbourhood matrix B e (x ) whose
rows are vectors v = (x j - x i )T such that x j are the e neighbours ( x j - x i < e ) of x


Note that this is similar to consider x i as the origin of a
coordinate system and expressing all the neighbours x j as
vectors v “centered” at this origin.
For sufficiently small e , B e (x ) represents the tangent space to
the manifold
The original algorithm applies SVD here, but that is not strictly necessary

Based on the Taylor expansion, construct a smooth tangent
function
¶f
vT v ¶ 2 f
f (v ) = f (x ) + v
+
¶ xi
2 ¶ x i2

[Donoho, 2003]
Where
¶ 2f
H =
¶ x i2
The partial derivatives
are applied to each
tangent component
separately
is the Hessian
80
HLLE - Hessian Eigenmaps

Construct a matrix with the quadratic symmetric
form of the Hessian
H2 =
å å
i

H ij H ji
j
Calculate the eigendecomposition of the matrix H 2

The eigenvectors are the projection components
[Donoho, 2003]
81
HLLE - Hessian Eigenmaps

Local technique
aimed at nonconvex manifold

Swiss roll with a
hole (non-convex)


In Isomap the
non-convexity
causes a
strong dilation,
…but HLLE
respondes
reasonably well
[Donoho, 2003]
82
CCA - Curvilinear Component Analysis

Embed some landmarks and then interpolate the rest of points


Global unfolding of strongly non-linear, non convex and close
structures.


[Demartines, 1997]
[Lee, 2002, 2004]
Although radically different approach, but conceptually similar to FR-Isomap or LIsomap
It only attempt to preserve a subset of the distances (rather than the whole set as
Isomap)
Mapping is invertible; Once the map is learned, it can be used both ways
83
CCA - Curvilinear Component Analysis
Learning Stage
Original
Dataset
Select Prototypes
Vector Quantization
Points not
selected as
prototypes
Feature
Space
Set of
prototypes
Non linear Embedding
E =
Prototypes
Coordinates
1
å
2 i
å
(dij - dij )2 F (dij , l )
j¹ i
Embedded
Space
CCA
Continuous mapping
New
Points

Interpolation by optimization
(stochastic gradient descent)
Several options are available for weighting function


[Demartines, 1997]
F (dij , l )
Parameter λ controls the size of what is considered “local”
Vector quantization consists of partitioning the space in Voronoi
regions and represent each region by its generating centroid
84
CCA - Curvilinear Component Analysis

Implemented as a two layer neural network


One layer performs the Vector Quantization
The second layer performs the non-linear
embedding

Initially proposed as a continuous improvement
of Kohonen’s SOM.

Sammon NLM (and possibly Laplacian
Eigenmaps) can be modelled as a particular
case by manipulating weighting function
F (dij , l )
[Demartines, 1997]
[Lee, 2002, 2004]
85
CCA - Curvilinear Component Analysis

It has been said to outperform Isomap [Lee 2002,
2004] [Venna, 2007]



[Venna, 2007]
[Lee, 2002, 2004]
I think [Lee 2002, 2004] was biasing his results,
by using non-convex manifolds for which
Isomap is known to fail [Donoho, 2003]
…however [Venna, 2007] seems to reach the
same conclusion with different datasets…
Convergence to a global minimum has not yet
been proved (as far as I know)
86
Manifold charting

Minimize loss of information about
data density location


Take a paper a make a ball of it. In
order to unfold it,


[Brand, 2003]
Topology neutral
break it into tiny pieces
(patches) which has not
been affected by any
folding (locally linear).
Now, make a collage with
all the tiny patches
First Step.- Estimating
intrinsic dimensionality and
neighbourhood size by
counting the number of points
within a ball of radius r
87
Manifold charting
1. Estimate intrinsic dimensionality using balls of r radius (See
previous slide)

This also allows to select the approx. size of the neighbourhoods
(charts)
2. Create a partition of the data set into locally linear neighbourhoods
(charts) minimizing loss in the connecction between neighbour
charts.


Use a Gaussian mixture model (GMM) solved with the bayesian
expectation maximization algorithm (EM) for this optimization
GMM produces a soft partitioning of the dataset into
neigbourhoods of mean mk and covariance S k
3. Connect (sew) the charts / Compute a minimal distortion merger
(connection) of all charts

[Brand, 2003]
Use the weighted least-squares optimization
88
Manifold charting
[Brand, 2004]
www.merl.com/projects/images/charting.jpg
89
Manifold charting

Aimed at convex manifolds,

…but seems to perform well with the fishbowl (non-convex)

[Brand, 2003]
And because is topology neutral, it is not largely affected by
noise…
90
SNE - Stochastic Neighbour Embedding

Preserves neighbour identities


i.e. the n-th neighbour in the high dimensional space is
still n-th neighbour in the low dimensional space (up to a
N-th neighbour).
Basically, it defines a p.d.f. of the neighbourhood in the high
dimensional space and try to preserve the p.d.f. as best as
possible in the low dimensional space

[Hinton and Roweis, 2002]
Variant of MDS minimizing the Kullback-Leibler Divergence
91
SNE - Stochastic Neighbour Embedding

At each point x i define a Gaussian probability
(heat kernel based) representing the probability
that x i picks x j as its neighbour over the rest of
points.
xi - x j
-
pij =
2 s i2
e
å
-
e
xi - xk
2 s i2
k¹ i
[Hinton and Roweis, 2002]
92
SNE - Stochastic Neighbour Embedding

…and an induced probability function for the low
dimensional images so that

At each image point y i the probability
that
image y i picks image y j as its neighbour over the
rest of image points.
-
qij =
2 s i2
e
å
[Hinton and Roweis, 2002]
yi - y j
k¹ i
-
e
yi - yk
2 s i2
93
SNE - Stochastic Neighbour Embedding

The cost function to minimise is the sum of Kullback-Leibler
divergences between the original and induced distributions
over neighbours.
C =
å
i



KLD(Pi ,Qi ) =
å å
i
j
pij log
pij
qij
This cost function only allows 1 image
To allow for more images, a more cumbersome variant of
this cost function is used
SNE uses steepest descent to minimise the cost function.
[Hinton and Roweis, 2002]
94
SNE - Stochastic Neighbour Embedding

Allows for one point in the
high dimensional space to
have more than one image
in the low dimensional
space



This is particularly
interesting in “cut
points”
Avoids the “collapsing”
effect of LLE
Computationally
expensive
[Hinton and Roweis, 2002]
95
Triangulation method

Sequential mapping of high dimensional points specifically to
a plane.



It can be though of a 2D view of the minimal spanning
tree of the dataset (MST)
Only attempt to preserve exactly three distances for each
point. In particular, all distances in the MST are
preserved.
Emphasizes a particular viewpoint. The viewpoint selected
depends on the selected root of the constructed MST

[Lee, 1977]
No global information is kept.
96
Triangulation method

Create the minimal spanning tree (MST)

Select a node (point) as the root

Travel trough the tree (pre-order)

Project the root to the origin and the first leftmost child to the appropriate distance
(usually to the right, but nor necessarily)

For every new node use the previous two
nodes as references, to create a triangle and
place the new point in one of the two
possible intersections
[Lee, 1977]
97
Triangulation method
MST
98
Distance Preserving Projection

An extension to the triangulation method not necessarily
confined to a plane.

The number of distances preserved depends on the
dimensionality of the target space.


Key Observation: The triangle is the simplex of a 2D
space. In a triangle three (2+1) distances are preserve.
In a N dimensional space, the corresponding simplex can
preserve exactly (N+1) distances.
The extension from the triangulation method is trivial.
99
Some final remarks

[Venna, 2007] comparison based on
trustworthiness and continuity


[Venna, 2007]
Projection is trustworthy if proximate points in the
visualization (low dimension) are also proximate in
the original space (high dimension)
Projection is continuous is originally proximate points
(high dimension), remains proximate (low
dimension)
100
Some final remarks
[Venna, 2007]
101
Some final remarks
http://www.math.umn.edu/~wittman/mani/mani_gui2.jpg
102
103
Download