Prof. Richard C. Wilson
Dept. of Computer Science
University of York
Background
• Typically objects are characterised by features
– Face images
– SIFT features
– Object spectra
– ...
• If we measure n features → n -dimensional space
• The arena for our problem is an n -dimensional vector space
• Example: Eigenfaces
Background
• Raw pixel values: n by m gives nm features
• Feature space is space of all n by m images
Background
• The space of all face-like images is smaller than the space of all images
• Assumption is faces lie on a smaller manifold embedded in the global space
All images
Face images
Manifold
Manifold learning
All objects should be on the manifold, non-objects outside
Part I: Euclidean Space
Position, Similarity and Distance
Manifold Learning in Euclidean space
Some famous techniques
Part II: Non-Euclidean Manifolds
Assessing Data
Nature and Properties of Manifolds
Data Manifolds
Learning some special types of manifolds
Part III: Advanced Techniques
Methods for intrinsically curved manifolds
Thanks to Edwin Hancock, Eliza Xu, Bob Duin for contributions
And support from the EU SIMBAD project
Position
– A set of n well defined features collected into a vector
ℝ n
Feature vector → position
Similarity
x , y
i x i y i
• The inner-product ‹ x , y
› can be considered to be a similarity between x and y
Induced norm
• The self-similarity ‹ x , x
› is the (square of) the ‘size’ of x and gives rise to the induced norm, of the length of x :
,
• Finally, the length of x allows the definition of a distance in our vector space as the length of the vector joining x and y d
x
y
x
y
x
y
x
y
• Inner product also gets us distance
Euclidean space
• If we have a vector space for features, and the usual inner product, all three are connected:
x
y
x , y
d
x
y
non-Euclidean Inner Product
• If the inner-product has the form x , y
x
T y
i
• Then the vector space is Euclidean x i y i
• Note we recover all the expected stuff for Euclidean space, i.e.
x
x
1
2 x
2
2 x
1
2 d
x
y
x
1
y
1
2
x
2
y
2
2
x n
y n
2
• The inner-product doesn’t have to be like this; for example in Einstein’s special relativity, the inner-product of spacetime is x , y
x
1 y
1
x
2 y
2
x
3 y
3
x
4 y
4
The Golden Trio
• In Euclidean space, the concepts of position, similarity and distance are elegantly connected
Position
X
Distance
D
Similarity
K
Point position matrix
• In a normal manifold learning problem, we have a set of samples
X
={ x
1
, x
2
,..., x m
}
• These can be collected together in a matrix X
X
x x
T
1
T
2 x
T m
I use this convention, but others may write them vertically
Centreing
– Centred points behave better
X
JX
m matrix
–
J is the all-ones matrix
C
C
I
J
m X
CX
–
C is the centreing matrix ( and is symmetric C = C T )
Position-Similarity
• The similarity matrix
K is defined as
K ij
x i
, x j
• From the definition of X , we simply get
K
XX
T
• The Gram matrix is the similarity matrix of the centred points (from the definition of X )
K c
CXX
T
C
T
CKC
– i.e. a centring operation on
K
Position
X
Similarity
K
•
K c is really a kernel matrix for the points
(linear kernel)
Position-Similarity
• To go from
K to X , we need to consider the eigendecomposition of K
K
U Λ U
T
Position
X
K
T
XX
• As long as we can take the square root of Λ then we can find X as
X
U
Λ 1/2
Similarity
K
Kernel embedding
First manifold learning method – kernel embedding
Finds a Euclidean manifold from object similarities
K
X
U
U
Λ
Λ
U
1/2
T
• Embeds a kernel matrix into a set of points in Euclidean space (the points are automatically centred)
• K must have no negative eigenvalues, i.e. it is a kernel matrix (Mercer condition)
Similarity-Distance
Distance
D d ( x i
, x j
)
2 x i
x j
, x i
x j
x i
, x i
x j
, x j
2 x i
, x j
K ii
K jj
2 K ij
D s , ij
• We can easily determine
D s from K
Similarity
K
Similarity-Distance
What about finding K from D s
?
D s , ij
K ii
K jj
2 K ij
Looking at the top equation, we might imagine that
K =-½ D s is a suitable choice
• Not centred; the relationship is actually
K
1
2
CD s
C
Classic MDS
• Classic Multidimensional Scaling embeds a (squared) distance matrix into Euclidean space
• Using what we have so far, the algorithm is simple
K
U Λ U
T
1
2
CD s
C Compute the kernel
K Eigendecom pose the kernel
X
U Λ 1/2
Embed the kernel
• This is MDS
Position
X
Distance
D
The Golden Trio
MDS
Position
X
Distance
D
Similarity
K
K
D s
, ij
1
2
K
CD ii
s
C
K jj
2 K ij
Kernel
Embedding
Kernel methods
• A kernel is function k ( i , j ) which computes an inner-product k ( i , j )
x , i x j
– But without needing to know the actual points (the space is implicit)
• Using a kernel function we can directly compute K without knowing X
Position
X
Distance
D
Similarity
K
Kernel function
Kernel methods
• The implied space may be very high dimensional, but a true kernel will always produce a positive semidefinite K and the implied space will be Euclidean
• Many (most?) PR algorithms can be kernelized
– Made to use
K rather than X or D
• The trick is to note that any interesting vector should lie in the space spanned by the examples we are given
• Hence it can be written as a linear combination u
1 x
1
2 x
2
m x m
X
T α
• Look for α instead of u
Kernel PCA
• What about PCA? PCA solves the following problem u
* arg min u u
T Σu
arg min u
1 n u
T
X
T
Xu
• Let’s kernelize:
1 n u
T
X
T
Xu
1
( X
T α
)
T
X
T
X ( X
T α
) n
1 n
α T
α
1 n
α T
K
2 α
Kernel PCA
•
K 2 has the same eigenvectors as K , so the eigenvectors of
PCA are the same as the eigenvectors of K
• The eigenvalues of PCA are related to the eigenvectors of
K by
P CA
1 n
2
K
• Kernel PCA is a kernel embedding with an externally provided kernel matrix
Kernel PCA
• So kernel PCA gives the same solution as kernel embedding
– The eigenvalues are modified a bit
• They are essentially the same thing in Euclidean space
• MDS uses the kernel and kernel embedding
• MDS and PCA are essentially the same thing in Euclidean space
• Kernel embedding, MDS and PCA all give the same answer for a set of points in Euclidean space
Some useful observations
• Your similarity matrix is Euclidean iff it has no negative eigenvalues (i.e. it is a kernel matrix and PSD)
• By similar reasoning, your distance matrix is Euclidean iff the similarity matrix derived from it is PSD
• If the feature space is small but the number of samples is large, then the covariance matrix is small and it is better to do normal PCA (on the covariance matrix)
• If the feature space is large and the number of samples is small, then the kernel matrix will be small and it is better to do kernel embedding
Non-linear data
– The space of all images of a face is a subspace of the space of all possible images
– The subspace is highly non-linear but low dimensional
(described by a few parameters)
Non-linear data
• This cannot be exploited by the linear subspace methods like PCA
– These assume that the subspace is a Euclidean space as well
• A classic example is the
‘swiss roll’ data:
‘Flat’ Manifolds
• Fundamentally different types of data, for example:
• The embedding of this data into the high-dimensional space is highly curved
– This is called extrinsic curvature, the curvature of the manifold with respect to the embedding space
• Now imagine that this manifold was a piece of paper; you could unroll the paper into a flat plane without distorting it
– No intrinsic curvature, in fact it is homeomorphic to Euclidean space
Curved manifold
• This manifold is different:
• It must be stretched to map it onto a plane
– It has non-zero intrinsic curvature
• A flatlander living on this manifold can tell that it is curved, for example by measuring the ratio of the radius to the circumference of a circle
• In the first case, we might still hope to find Euclidean embedding
• We can never find a distortion free Euclidean embedding of the second
(in the sense that the distances will always have errors)
Intrinsically Euclidean Manifolds
• We cannot use the previous methods on the second type of manifold, but there is still hope for the first
• The manifold is embedded in Euclidean space, but
Euclidean distance is not the correct way to measure distance
• The Euclidean distance ‘shortcuts’ the manifold
• The geodesic distance calculates the shortest path along the manifold
Geodesics
• The geodesic generalizes the concept of distance to curved manifolds
– The shortest path joining two points which lies completely within the manifold
• If we can correctly compute the geodesic distances, and the manifold is intrinsically flat, we should get Euclidean distances which we can plug into our Euclidean geometry machine Position
X
Geodesic
Distances
Distance
D
Similarity
K
ISOMAP
• ISOMAP is exactly such an algorithm
• Approximate geodesic distances are computed for the points from a graph
• Nearest neighbours graph
– For neighbours, Euclidean distance≈geodesic distances
– For non-neighbours, geodesic distance approximated by shortest distance in graph
• Once we have distances
D , can use MDS to find Euclidean embedding
ISOMAP
– Neighbourhood graph
– Shortest path algorithm
– MDS
Laplacian Eigenmap
• The Laplacian Eigenmap is another graph-based method of embedding non-linear manifolds into Euclidean space
• As with ISOMAP, form a neighbourhood graph for the datapoints
• Find the graph Laplacian as follows
• The adjacency matrix
A is
A ij
e
0 d t
2 ij if i and j are connected otherwise
• The ‘degree’ matrix D is the diagonal matrix
D ii
j
A ij
• The normalized graph Laplacian is
L
I
D
1 / 2
AD
1 / 2
Laplacian Eigenmap
• We find the Laplacian eigenmap embedding using the eigendecomposition of L
L
U
U
T
• The embedded positions are
X
D
1 / 2
U
• Similar to ISOMAP
– Structure preserving not distance preserving
Locally-Linear Embedding
• Locally-linear Embedding is another classic method which also begins with a neighbourhood graph
• We make point i (in the original data) from a weighted sum of the neighbouring points i j i
j
W x ij j
• W ij is 0 for any point j not in the neighbourhood (and for i = j )
• We find the weights by minimising the reconstruction error min |
ˆ i
x i
|
2
– Subject to the constrains that the weights are non-negative and sum to 1
W ij
0 ,
j
W ij
1
• Gives a relatively simple closed-form solution
Locally-Linear Embedding
• These weights encode how well a point j represents a point i and can be interpreted as the adjacency between i and j
• A low dimensional embedding is found by then finding points to minimise the error min
| i
y i
|
2 y i
j i j • In other words, we find a low-dimensional embedding which preserves the adjacency relationships
• The solution to this embedding problem turns out to be simply the eigenvectors of the matrix M
M
( I
W )
T
( I
W )
• LLE is scale-free: the final points have the covariance matrix I
– Unit scale
Comparison
• LLE might seem like quite a different process to the previous two, but actually very similar
• We can interpret the process as producing a kernel matrix followed by scale-free kernel embedding
K
( k
1 ) I
K
UΛ U
T k
J
n
X
W
U
W
T
W
T
W
ISOMAP
Representation Neighbourhood graph
Similarity matrix From geodesic distances
Embedding X
U
1 / 2
Lap. Eigenmap LLE
Neighbourhood graph
Neighbourhood graph
Graph Laplacian Reconstruction
X
D
1 / 2
U weights
X
U
Comparison
• ISOMAP is the only method which directly computes and uses the geodesic distances
– The other two depend indirectly on the distances through local structure
• LLE is scale-free, so the original distance scale is lost, but the local structure is preserved
• Computing the necessary local dimensionality to find the correct nearest neighbours is a problem for all such methods
Non-Euclidean data
• Data is Euclidean iff
K is psd
• Unless you are using a kernel function, this is often not true
• Why does this happen?
What type of data do I have?
• Starting point: distance matrix
• However we do not know apriori if our measurements are representable on a manifold
– We will call them dissimilarities
• Our starting point to answer the question “
What type of data do I have?” will be a matrix of dissimilarities D between objects
• Types of dissimilarities
– Euclidean (no intrinsic curvature)
– Non-Euclidean, metric (curved manifold)
– Non-metric (no point-like manifold representation)
Causes
• Example: Chicken pieces data
• Distance by alignment
• Global alignment of everything could find Euclidean distances
• Only local alignments are practical
Causes
1.
D ij
≥ 0 (nonegativity)
2.
D ij
= 0 iff i = j (identity of indiscernables)
3.
D ij
= D ji
4.
D ij
≤
D ik
+ D kj
(symmetry)
(triangle inequality)
Causes
D ij
D ji
i
j
j
i
Causes
• Triangle violations
D ij
≤
D ik
+ D kj
• ‘Extended objects’ i j k
D ik
D kj
0
0
D ij
0
• Finally, noise in the measure of D can cause all of these effects
Tests(1)
• Find the similarity matrix
K
1
2
CD s
C
• The data is Euclidean iff K is positive semidefinite (no negative eigenvalues)
– K is a kernel, explicit embedding from kernel embedding
• We can then use K in a kernel algorithm
Tests(2)
• Negative eigenfraction (NEF)
NEF
i
0
i
i
• Between 0 and 0.5
Tests(3)
1.
D ij
≥ 0 (nonegativity)
2.
D ij
= 0 iff i = j (identity of indiscernables)
3.
D ij
= D ji
4.
D ij
≤ D ik
+ D kj
(symmetry)
(triangle inequality)
– Check these for your data (3 rd involves checking all triples)
– Metric data is embeddable on a (curved) Reimannian manifold
Corrections
• If the data is non-metric or non-Euclidean, we can ‘correct it’
• Symmetry violations
– Average D ij
D
ji
1
2
( D ij
D ij
D
D ji
ji
)
D ij appropriate
• Triangle violations
– Constant offset
D ij
D ij
c ( i
j )
– This will also remove non-Euclidean behaviour for large enough c
• Euclidean violations
– Discard negative eigenvalues
• There are many other approaches
*
* “ On Euclidean corrections for non-Euclidean dissimilarities”, Duin, Pekalska, Harol,
Lee and Bunke, S+SSPR 08
Known Manifolds
• Sometimes we have data which lies on a known but non-
Euclidean manifold
• Examples in Computer Vision
– Surface normals
– Rotation matrices
– Flow tensors (DT-MRI)
• This is not Manifold Learning, as we already know what the manifold is
• What tools do we need to be able to process data like this?
– As before, distances are the key
Example: 2D direction
Direction of an edge in an image, encoded as a unit vector x
1 x x
2
The average of the direction vector isn’t even a direction vector (not unit length), let alone the correct ‘average’ direction
The normal definition of mean is not correct x
1
– Because the manifold is curved n
i x i
Tangent space
• The tangent space (
T
P
) is the Euclidean space which is parallel to the manifold( M ) at a particular point ( P )
M
P
T
P
• The tangent space is a very useful tool because it is
Euclidean
Exponential Map
• Exponential map:
T
P
M
P
A
X
P
• Exp
P maps a point the manifold
X on the tangent plane onto a point A on
–
P is the centre of the mapping and is at the origin on the tangent space
– The mapping is one-to-one in a local region of P
– The most important property of the mapping is that the distances to the centre P are preserved d ( X , P )
d ( A , P )
T
P
M
– The geodesic distance on the manifold equals the Euclidean distance on the tangent plane (for distances to the centre only )
Exponential map
• The log map goes the other way, from manifold to tangent plane
Log
P
X
: M
T p
Log
P
M
Exponential Map
• Example on the circle: Embed the circle in the complex plane
• The manifold representing the circle is a complex number with magnitude 1 and can be written x + iy =exp( i
)
Im
P
e i
P
Re
• In this case it turns out that the map is related to the normal exp and log functions
M
T
P P
e i
P
X
X
Log
P
A
i log
A
P
i log e i
A e i
P
A
P
A
Exp exp
X
P i
P
P exp iX exp i (
A
P
)
exp i
A
A
e i
A
Intrinsic mean
• The mean of a set of samples is usually defined as the sum of the samples divided by the number
– This is only true in Euclidean space
• A more general formula x
arg min x
d
2 g
( x , x i
) i • Minimises the distances from the mean to the samples
(equivalent in Euclidean space)
Intrinsic mean
• We can compute this intrinsic mean using the exponential map
• If we knew what the mean was, then we can use the mean as the centre of a map
X i
Log
M
A i
• From the properties of the Exp-map, the distances are the same d e
( X i
, M )
d g
( A i
, M )
• So the mean on the tangent plane is equal to the mean on the manifold
Intrinsic mean
• Start with a guess at the mean and move towards correct answer
• This gives us the following algorithm
– Guess at a mean M
0
1. Map on to tangent plane using M i
2. Compute the mean on the tangent plane to get new estimate M i +1
M k
1
Exp
M k
1 n
i
Log
M k
A i
Intrinsic Mean
• For many manifolds, this procedure will converge to the intrinsic mean
– Convergence not always guaranteed
• Other statistics and probability distributions on manifolds are problematic.
– Can hypothesis a normal distribution on tangent plane, but distortions inevitable
Some useful manifolds and maps
• Some useful manifolds and exponential maps
• Directional vectors (surface normals etc.) a , a
1 x
sin
( a
p cos
) (Log map) a
p cos
sin
x (Exp map)
• a , p unit vectors, x lies in an ( n -1)D space
Some useful manifolds and maps
• Symmetric positive definite matrices (covariance, flow tensors etc)
A , u
T
Au
0
u
0
X
P
1
2 log P
1
2 AP
1
2
P
1
2 (Log map)
A
P
1
2 exp P
1
2 XP
1
2
P
1
2 (Exp map)
• A is symmetric positive definite, X is just symmetric
• log is the matrix log defined as a generalized matrix function
Some useful manifolds and maps
• Orthogonal matrices (rotation matrices, eigenvector matrices)
A , AA
X
A
log
P
T exp
P
T
I
(Log
(Exp map) map)
• A orthogonal, X antisymmetric ( X + X T =0)
• These are the matrix exp and log functions as before
• In fact there are multiple solutions to the matrix log
– Only one is the required real antisymmetric matrix; not easy to find
– Rest are complex
Embedding on S n
• On
S 2 (surface of a sphere in 3D) the following parameterisation is well known x
( r sin
cos
, r sin
sin
, r cos
)
T
• The distance between two points (the length of the d ij geodesic) is
r
1
x
y
xy
x
y
x d xy y
More Spherical Geometry
• But on a sphere, the distance is the highlighted arc-length
– Much neater to use inner-product d xy
x , y
r
xy
xy cos
xy
r
2 cos
xy
r cos
1
x , y
r
2
– And works in any number of dimensions x rθ xy
θ xy y
Spherical Embedding
• Say we had the distances between some objects ( d ij
), measured on the surface of a [hyper]sphere of dimension n
• The sphere (and objects) can be embedded into an n +1 dimensional space
– Let
X be the matrix of point positions
•
Z = XX T is a kernel matrix
• But
• And
Z d xy ij
r x , i
1 x
j cos
x , y r
2
Z ij
x i
, x j
r
2 cos d ij r
• We can compute
Z from D and find the spherical embedding!
Spherical Embedding
• But wait, we don’t know what r is!
• The distances D are non-Euclidean, and if we use the wrong radius, Z is not a kernel matrix
– Negative eigenvalues
• Use this to find the radius
– Choose r to minimise the negative eigenvalues r *
arg min r
o
Z ( r )
Example: Texture Mapping
• As an alternative to unwrapping object onto a plane and texture-mapping the plane
• Embed onto a sphere and texture-map the sphere
Plane
Sphere
Backup slides
Laplacian and related processes
• As well as embedding objects onto manifolds, we can model many interesting processes on manifolds
• Example: the way ‘heat’ flows across a manifold can be very informative
• du
2 u heat equation dt
2 is the Laplacian and in 3D Euclidean space it is
2
x
2
• On a sphere it is
r
2
1 sin
2
2
2
2
y
2
1 r
2 sin
2
z
2
sin
Heat flow
• Heat flow allows us to do interesting things on a manifold
• Smoothing: Heat flow is a diffusion process (will smooth the data)
• Characterising the manifold (heat content, heat kernel coefficients...)
• The Laplacian depends on the geometry of the manifold
– We may not know this
– It may be hard to calculate explicitly
• Graph Laplacian
Graph Laplacian
• Given a set of datapoints on the manifold, describe them by a graph
– Vertices are datapoints, edges are adjacency relation
• Adjacency matrix (for example)
A ij
d
• Then the graph Laplacian is
L
V
A d
2 ij
2 ij
V ii
A ij manifold Laplacian
Heat Kernel
• Using the graph Laplacian, we can easily implement heatflow methods on the manifold using the heat-kernel d u dt
H
Lu
L t
• Can diffuse a function on the manifold by f '
Hf