Department of Computer Science

advertisement

Similarities, Distances and Manifold

Learning

Prof. Richard C. Wilson

Dept. of Computer Science

University of York

Background

• Typically objects are characterised by features

– Face images

– SIFT features

– Object spectra

– ...

• If we measure n features → n -dimensional space

• The arena for our problem is an n -dimensional vector space

• Example: Eigenfaces

Background

• Raw pixel values: n by m gives nm features

• Feature space is space of all n by m images

Background

• The space of all face-like images is smaller than the space of all images

• Assumption is faces lie on a smaller manifold embedded in the global space

All images

Face images

Manifold

: A space which locally looks Euclidean

Manifold learning

: Finding the manifold representing the objects we are interested in

All objects should be on the manifold, non-objects outside

Part I: Euclidean Space

Position, Similarity and Distance

Manifold Learning in Euclidean space

Some famous techniques

Part II: Non-Euclidean Manifolds

Assessing Data

Nature and Properties of Manifolds

Data Manifolds

Learning some special types of manifolds

Part III: Advanced Techniques

Methods for intrinsically curved manifolds

Thanks to Edwin Hancock, Eliza Xu, Bob Duin for contributions

And support from the EU SIMBAD project

Part I: Euclidean Space

Position

The main arena for pattern recognition and machine learning problems is vector space

– A set of n well defined features collected into a vector

ℝ n

Also defined are addition of vectors and multiplication by a scalar

Feature vector → position

Similarity

To make meaningful progress, we need a notion of similarity

Inner product

 x , y

  i x i y i

• The inner-product ‹ x , y

› can be considered to be a similarity between x and y

Induced norm

• The self-similarity ‹ x , x

› is the (square of) the ‘size’ of x and gives rise to the induced norm, of the length of x :

x

 

x

,

x

• Finally, the length of x allows the definition of a distance in our vector space as the length of the vector joining x and y d

(

x

,

y

)

 x

 y

  x

 y

,

x

 y

• Inner product also gets us distance

Euclidean space

• If we have a vector space for features, and the usual inner product, all three are connected:

Position

x

,

y

Similarity

x , y

Distance

d

(

x

,

y

)

non-Euclidean Inner Product

• If the inner-product has the form x , y

 x

T y

  i

• Then the vector space is Euclidean x i y i

• Note we recover all the expected stuff for Euclidean space, i.e.

x

 x

1

2  x

2

2    x

1

2 d

(

x

,

y

)

(

x

1

 y

1

)

2 

(

x

2

 y

2

)

2   

(

x n

 y n

)

2

• The inner-product doesn’t have to be like this; for example in Einstein’s special relativity, the inner-product of spacetime is x , y

 x

1 y

1

 x

2 y

2

 x

3 y

3

 x

4 y

4

The Golden Trio

• In Euclidean space, the concepts of position, similarity and distance are elegantly connected

Position

X

Distance

D

Similarity

K

Point position matrix

• In a normal manifold learning problem, we have a set of samples

X

={ x

1

, x

2

,..., x m

}

• These can be collected together in a matrix X

X

 x x

T

1

T

2 x

T m

I use this convention, but others may write them vertically

Centreing

A common and important operation is centreing – moving the mean to the origin

– Centred points behave better

X

JX

/

m matrix

J is the all-ones matrix

This can be done with

C

C

I

J

/

m X

CX

C is the centreing matrix ( and is symmetric C = C T )

Position-Similarity

• The similarity matrix

K is defined as

K ij

 x i

, x j

• From the definition of X , we simply get

K

XX

T

• The Gram matrix is the similarity matrix of the centred points (from the definition of X )

K c

CXX

T

C

T 

CKC

– i.e. a centring operation on

K

Position

X

Similarity

K

K c is really a kernel matrix for the points

(linear kernel)

Position-Similarity

• To go from

K to X , we need to consider the eigendecomposition of K

K

U Λ U

T

Position

X

K

 T

XX

• As long as we can take the square root of Λ then we can find X as

X

U

Λ 1/2

Similarity

K

Kernel embedding

First manifold learning method – kernel embedding

Finds a Euclidean manifold from object similarities

K

X

U

U

Λ

Λ

U

1/2

T

• Embeds a kernel matrix into a set of points in Euclidean space (the points are automatically centred)

• K must have no negative eigenvalues, i.e. it is a kernel matrix (Mercer condition)

Similarity-Distance

Distance

D d ( x i

, x j

)

2  x i

 x j

, x i

 x j

 x i

, x i

 x j

, x j

2 x i

, x j

K ii

K jj

2 K ij

D s , ij

• We can easily determine

D s from K

Similarity

K

Similarity-Distance

What about finding K from D s

?

D s , ij

K ii

K jj

2 K ij

Looking at the top equation, we might imagine that

K =-½ D s is a suitable choice

• Not centred; the relationship is actually

K

 

1

2

CD s

C

Classic MDS

• Classic Multidimensional Scaling embeds a (squared) distance matrix into Euclidean space

• Using what we have so far, the algorithm is simple

K

 

U Λ U

T

1

2

CD s

C Compute the kernel

K Eigendecom pose the kernel

X

U Λ 1/2

Embed the kernel

• This is MDS

Position

X

Distance

D

The Golden Trio

MDS

Position

X

Distance

D

Similarity

K

K

D s

, ij

1

2

K

CD ii

 s

C

K jj

2 K ij

Kernel

Embedding

Kernel methods

• A kernel is function k ( i , j ) which computes an inner-product k ( i , j )

 x , i x j

– But without needing to know the actual points (the space is implicit)

• Using a kernel function we can directly compute K without knowing X

Position

X

Distance

D

Similarity

K

Kernel function

Kernel methods

• The implied space may be very high dimensional, but a true kernel will always produce a positive semidefinite K and the implied space will be Euclidean

• Many (most?) PR algorithms can be kernelized

– Made to use

K rather than X or D

• The trick is to note that any interesting vector should lie in the space spanned by the examples we are given

• Hence it can be written as a linear combination u

 

1 x

1

 

2 x

2

    m x m

X

T α

• Look for α instead of u

Kernel PCA

• What about PCA? PCA solves the following problem u

*  arg min u u

T Σu

 arg min u

1 n u

T

X

T

Xu

• Let’s kernelize:

1 n u

T

X

T

Xu

1

( X

T α

)

T

X

T

X ( X

T α

) n

1 n

α T

  

α

1 n

α T

K

2 α

Kernel PCA

K 2 has the same eigenvectors as K , so the eigenvectors of

PCA are the same as the eigenvectors of K

• The eigenvalues of PCA are related to the eigenvectors of

K by

 

P CA

1 n

2

K

• Kernel PCA is a kernel embedding with an externally provided kernel matrix

Kernel PCA

• So kernel PCA gives the same solution as kernel embedding

– The eigenvalues are modified a bit

• They are essentially the same thing in Euclidean space

• MDS uses the kernel and kernel embedding

• MDS and PCA are essentially the same thing in Euclidean space

• Kernel embedding, MDS and PCA all give the same answer for a set of points in Euclidean space

Some useful observations

• Your similarity matrix is Euclidean iff it has no negative eigenvalues (i.e. it is a kernel matrix and PSD)

• By similar reasoning, your distance matrix is Euclidean iff the similarity matrix derived from it is PSD

• If the feature space is small but the number of samples is large, then the covariance matrix is small and it is better to do normal PCA (on the covariance matrix)

• If the feature space is large and the number of samples is small, then the kernel matrix will be small and it is better to do kernel embedding

Part II: Non-Euclidean Manifolds

Non-linear data

• Much of the data in computer vision lies in a highdimensional feature space but is constrained in some way

– The space of all images of a face is a subspace of the space of all possible images

– The subspace is highly non-linear but low dimensional

(described by a few parameters)

Non-linear data

• This cannot be exploited by the linear subspace methods like PCA

– These assume that the subspace is a Euclidean space as well

• A classic example is the

‘swiss roll’ data:

‘Flat’ Manifolds

• Fundamentally different types of data, for example:

• The embedding of this data into the high-dimensional space is highly curved

– This is called extrinsic curvature, the curvature of the manifold with respect to the embedding space

• Now imagine that this manifold was a piece of paper; you could unroll the paper into a flat plane without distorting it

– No intrinsic curvature, in fact it is homeomorphic to Euclidean space

Curved manifold

• This manifold is different:

• It must be stretched to map it onto a plane

– It has non-zero intrinsic curvature

• A flatlander living on this manifold can tell that it is curved, for example by measuring the ratio of the radius to the circumference of a circle

• In the first case, we might still hope to find Euclidean embedding

• We can never find a distortion free Euclidean embedding of the second

(in the sense that the distances will always have errors)

Intrinsically Euclidean Manifolds

• We cannot use the previous methods on the second type of manifold, but there is still hope for the first

• The manifold is embedded in Euclidean space, but

Euclidean distance is not the correct way to measure distance

• The Euclidean distance ‘shortcuts’ the manifold

• The geodesic distance calculates the shortest path along the manifold

Geodesics

• The geodesic generalizes the concept of distance to curved manifolds

– The shortest path joining two points which lies completely within the manifold

• If we can correctly compute the geodesic distances, and the manifold is intrinsically flat, we should get Euclidean distances which we can plug into our Euclidean geometry machine Position

X

Geodesic

Distances

Distance

D

Similarity

K

ISOMAP

• ISOMAP is exactly such an algorithm

• Approximate geodesic distances are computed for the points from a graph

• Nearest neighbours graph

– For neighbours, Euclidean distance≈geodesic distances

– For non-neighbours, geodesic distance approximated by shortest distance in graph

• Once we have distances

D , can use MDS to find Euclidean embedding

ISOMAP

• ISOMAP:

– Neighbourhood graph

– Shortest path algorithm

– MDS

• ISOMAP is distance-preserving – embedded distances should be close to geodesic distances

Laplacian Eigenmap

• The Laplacian Eigenmap is another graph-based method of embedding non-linear manifolds into Euclidean space

• As with ISOMAP, form a neighbourhood graph for the datapoints

• Find the graph Laplacian as follows

• The adjacency matrix

A is

A ij

 e

0 d t

2 ij if i and j are connected otherwise

• The ‘degree’ matrix D is the diagonal matrix

D ii

  j

A ij

• The normalized graph Laplacian is

L

I

D

1 / 2

AD

1 / 2

Laplacian Eigenmap

• We find the Laplacian eigenmap embedding using the eigendecomposition of L

L

U

U

T

• The embedded positions are

X

D

1 / 2

U

• Similar to ISOMAP

– Structure preserving not distance preserving

Locally-Linear Embedding

• Locally-linear Embedding is another classic method which also begins with a neighbourhood graph

• We make point i (in the original data) from a weighted sum of the neighbouring points i j i

  j

W x ij j

• W ij is 0 for any point j not in the neighbourhood (and for i = j )

• We find the weights by minimising the reconstruction error min |

ˆ i

 x i

|

2

– Subject to the constrains that the weights are non-negative and sum to 1

W ij

0 ,

  j

W ij

1

• Gives a relatively simple closed-form solution

Locally-Linear Embedding

• These weights encode how well a point j represents a point i and can be interpreted as the adjacency between i and j

• A low dimensional embedding is found by then finding points to minimise the error min

| i

 y i

|

2 y i

  j i j • In other words, we find a low-dimensional embedding which preserves the adjacency relationships

• The solution to this embedding problem turns out to be simply the eigenvectors of the matrix M

M

( I

W )

T

( I

W )

• LLE is scale-free: the final points have the covariance matrix I

– Unit scale

Comparison

• LLE might seem like quite a different process to the previous two, but actually very similar

• We can interpret the process as producing a kernel matrix followed by scale-free kernel embedding

K

( k

1 ) I

K

UΛ U

T k

J

 n

X

W

U

W

T 

W

T

W

ISOMAP

Representation Neighbourhood graph

Similarity matrix From geodesic distances

Embedding X

U

 

1 / 2

Lap. Eigenmap LLE

Neighbourhood graph

Neighbourhood graph

Graph Laplacian Reconstruction

X

D

1 / 2

U weights

X

U

Comparison

• ISOMAP is the only method which directly computes and uses the geodesic distances

– The other two depend indirectly on the distances through local structure

• LLE is scale-free, so the original distance scale is lost, but the local structure is preserved

• Computing the necessary local dimensionality to find the correct nearest neighbours is a problem for all such methods

Non-Euclidean data

• Data is Euclidean iff

K is psd

• Unless you are using a kernel function, this is often not true

• Why does this happen?

What type of data do I have?

• Starting point: distance matrix

• However we do not know apriori if our measurements are representable on a manifold

– We will call them dissimilarities

• Our starting point to answer the question “

What type of data do I have?” will be a matrix of dissimilarities D between objects

• Types of dissimilarities

– Euclidean (no intrinsic curvature)

– Non-Euclidean, metric (curved manifold)

– Non-metric (no point-like manifold representation)

Causes

• Example: Chicken pieces data

• Distance by alignment

• Global alignment of everything could find Euclidean distances

• Only local alignments are practical

Causes

Dissimilarities may also be non-metric

The data is metric if it obeys the metric conditions

1.

D ij

≥ 0 (nonegativity)

2.

D ij

= 0 iff i = j (identity of indiscernables)

3.

D ij

= D ji

4.

D ij

D ik

+ D kj

(symmetry)

(triangle inequality)

Reasonable dissimilarites should meet 1&2

Causes

• Symmetry

D ij

=

D ji

• May not be symmetric by definition

• Alignment:

i

j

may find a better solution than

j

i

Causes

• Triangle violations

D ij

D ik

+ D kj

• ‘Extended objects’ i j k

D ik

D kj

0

0

D ij

0

• Finally, noise in the measure of D can cause all of these effects

Tests(1)

• Find the similarity matrix

K

 

1

2

CD s

C

• The data is Euclidean iff K is positive semidefinite (no negative eigenvalues)

– K is a kernel, explicit embedding from kernel embedding

• We can then use K in a kernel algorithm

Tests(2)

• Negative eigenfraction (NEF)

NEF

  i

0

 i

 i

• Between 0 and 0.5

Tests(3)

1.

D ij

≥ 0 (nonegativity)

2.

D ij

= 0 iff i = j (identity of indiscernables)

3.

D ij

= D ji

4.

D ij

≤ D ik

+ D kj

(symmetry)

(triangle inequality)

– Check these for your data (3 rd involves checking all triples)

– Metric data is embeddable on a (curved) Reimannian manifold

Corrections

• If the data is non-metric or non-Euclidean, we can ‘correct it’

• Symmetry violations

– Average D ij

 

D

 ji

1

2

( D ij

D ij

D

D ji

 ji

)

D ij appropriate

• Triangle violations

– Constant offset

D ij

 

D ij

  c ( i

 j )

– This will also remove non-Euclidean behaviour for large enough c

• Euclidean violations

– Discard negative eigenvalues

• There are many other approaches

*

* “ On Euclidean corrections for non-Euclidean dissimilarities”, Duin, Pekalska, Harol,

Lee and Bunke, S+SSPR 08

Part III: Advanced techniques for non-Euclidean

Embeddings

Known Manifolds

• Sometimes we have data which lies on a known but non-

Euclidean manifold

• Examples in Computer Vision

– Surface normals

– Rotation matrices

– Flow tensors (DT-MRI)

• This is not Manifold Learning, as we already know what the manifold is

• What tools do we need to be able to process data like this?

– As before, distances are the key

Example: 2D direction

Direction of an edge in an image, encoded as a unit vector x

1 x x

2

The average of the direction vector isn’t even a direction vector (not unit length), let alone the correct ‘average’ direction

The normal definition of mean is not correct x

1

– Because the manifold is curved n

 i x i

Tangent space

• The tangent space (

T

P

) is the Euclidean space which is parallel to the manifold( M ) at a particular point ( P )

M

P

T

P

• The tangent space is a very useful tool because it is

Euclidean

Exponential Map

• Exponential map:

Exp :

T

P

M

P

A

Exp

X

P

• Exp

P maps a point the manifold

X on the tangent plane onto a point A on

P is the centre of the mapping and is at the origin on the tangent space

– The mapping is one-to-one in a local region of P

– The most important property of the mapping is that the distances to the centre P are preserved d ( X , P )

 d ( A , P )

T

P

M

– The geodesic distance on the manifold equals the Euclidean distance on the tangent plane (for distances to the centre only )

Exponential map

• The log map goes the other way, from manifold to tangent plane

Log

P

X

: M

T p

Log

P

M

Exponential Map

• Example on the circle: Embed the circle in the complex plane

• The manifold representing the circle is a complex number with magnitude 1 and can be written x + iy =exp( i

)

Im

P

 e i

P

Re

• In this case it turns out that the map is related to the normal exp and log functions

M

T

P P

 e i

P

X

X

Log

P

A

  i log

A

P

  i log e i

A e i

P

 

A

 

P

A

Exp exp

X

P i

P

P exp iX exp i (

A

 

P

)

 exp i

A

A

 e i

A

Intrinsic mean

• The mean of a set of samples is usually defined as the sum of the samples divided by the number

– This is only true in Euclidean space

• A more general formula x

 arg min x

 d

2 g

( x , x i

) i • Minimises the distances from the mean to the samples

(equivalent in Euclidean space)

Intrinsic mean

• We can compute this intrinsic mean using the exponential map

• If we knew what the mean was, then we can use the mean as the centre of a map

X i

Log

M

A i

• From the properties of the Exp-map, the distances are the same d e

( X i

, M )

 d g

( A i

, M )

• So the mean on the tangent plane is equal to the mean on the manifold

Intrinsic mean

• Start with a guess at the mean and move towards correct answer

• This gives us the following algorithm

– Guess at a mean M

0

1. Map on to tangent plane using M i

2. Compute the mean on the tangent plane to get new estimate M i +1

M k

1

Exp

M k

1 n

 i

Log

M k

A i

Intrinsic Mean

• For many manifolds, this procedure will converge to the intrinsic mean

– Convergence not always guaranteed

• Other statistics and probability distributions on manifolds are problematic.

– Can hypothesis a normal distribution on tangent plane, but distortions inevitable

Some useful manifolds and maps

• Some useful manifolds and exponential maps

• Directional vectors (surface normals etc.) a , a

1 x

 sin

( a

 p cos

) (Log map) a

 p cos

 

 sin

 x (Exp map)

• a , p unit vectors, x lies in an ( n -1)D space

Some useful manifolds and maps

• Symmetric positive definite matrices (covariance, flow tensors etc)

A , u

T

Au

0

 u

0

X

P

1

2 log P

 1

2 AP

 1

2

P

1

2 (Log map)

A

P

1

2 exp P

 1

2 XP

 1

2

P

1

2 (Exp map)

• A is symmetric positive definite, X is just symmetric

• log is the matrix log defined as a generalized matrix function

Some useful manifolds and maps

• Orthogonal matrices (rotation matrices, eigenvector matrices)

A , AA

X

A

 log

P

T exp

P

T

I

 

 

(Log

(Exp map) map)

• A orthogonal, X antisymmetric ( X + X T =0)

• These are the matrix exp and log functions as before

• In fact there are multiple solutions to the matrix log

– Only one is the required real antisymmetric matrix; not easy to find

– Rest are complex

Embedding on S n

• On

S 2 (surface of a sphere in 3D) the following parameterisation is well known x

( r sin

 cos

, r sin

 sin

, r cos

)

T

• The distance between two points (the length of the d ij geodesic) is

 r

cos

1

sin

 x

sin

 y

  xy

cos

 x

cos

 y

 x d xy y

More Spherical Geometry

• But on a sphere, the distance is the highlighted arc-length

– Much neater to use inner-product d xy

 x , y

 r

 xy

 xy cos

 xy

 r

2 cos

 xy

 r cos

1

 x , y

 r

2

– And works in any number of dimensions x rθ xy

θ xy y

Spherical Embedding

• Say we had the distances between some objects ( d ij

), measured on the surface of a [hyper]sphere of dimension n

• The sphere (and objects) can be embedded into an n +1 dimensional space

– Let

X be the matrix of point positions

Z = XX T is a kernel matrix

• But

• And

Z d xy ij



 r x , i

1 x

 j cos

 x , y r

2

Z ij

 x i

, x j

 r

2 cos d ij r

• We can compute

Z from D and find the spherical embedding!

Spherical Embedding

• But wait, we don’t know what r is!

• The distances D are non-Euclidean, and if we use the wrong radius, Z is not a kernel matrix

– Negative eigenvalues

• Use this to find the radius

– Choose r to minimise the negative eigenvalues r *

 arg min r

 o

Z ( r )

Example: Texture Mapping

• As an alternative to unwrapping object onto a plane and texture-mapping the plane

• Embed onto a sphere and texture-map the sphere

Plane

Sphere

Backup slides

Laplacian and related processes

• As well as embedding objects onto manifolds, we can model many interesting processes on manifolds

• Example: the way ‘heat’ flows across a manifold can be very informative

• du

   2 u heat equation dt

 2 is the Laplacian and in 3D Euclidean space it is

2

 x

2

• On a sphere it is

 r

2

1 sin

2

 2

 

2

 2

 y

2

1 r

2 sin

 2

 z

2

 

 sin

 



Heat flow

• Heat flow allows us to do interesting things on a manifold

• Smoothing: Heat flow is a diffusion process (will smooth the data)

• Characterising the manifold (heat content, heat kernel coefficients...)

• The Laplacian depends on the geometry of the manifold

– We may not know this

– It may be hard to calculate explicitly

• Graph Laplacian

Graph Laplacian

• Given a set of datapoints on the manifold, describe them by a graph

– Vertices are datapoints, edges are adjacency relation

• Adjacency matrix (for example)

A ij

exp(

 d

• Then the graph Laplacian is

L

V

A d

2 ij

2 ij

V ii

/

)

A ij manifold Laplacian

Heat Kernel

• Using the graph Laplacian, we can easily implement heatflow methods on the manifold using the heat-kernel d u dt

H

 

Lu

heat equation exp(

L t

) heat kernel

• Can diffuse a function on the manifold by f '

Hf

Download