Geometry of data sets - University of Leicester

advertisement
Geometry of data sets
Alexander Gorban
University of Leicester, UK
Plan
• The problem
• Approximation of multidimensional data by
low-dimensional objects
• Self-simplification of essentially highdimensional sets
• Terra Incognita between low-dimensional sets
and self-simplified high-dimensional ones.
Change of era
From Einstein’s “flight from
miracle.”
«… The development of this world of
thought is in a certain sense a
continuous flight from “miracle”.»
To struggle with complexity
"I think the next century will be the century of complexity."
3
Stephen Hawking
Two main approaches in our
struggle with complexity
A “minimal” space
with this interesting
content
A large space
with something
interesting inside
In high dimensionality many different
things become similar, if we choose the
proper point of view
4
Karl Pearson
1901
Principal Component Analysis
Approximation
by straight lines:
Subtract the
projection
and repeat
1st Principal
axis
Principal points (K-means)
Approximation
by smaller finite sets:
Centres y(i)
Data points x(j)
1. Select several centres;
2. Attach datapoints to the
closest centres by springs;
3. Minimize energy;
4. Repeat 2&3 until converges.
Steinhaus, 1956;
Lloyd, 1957;
MacQueen, 1967
Approximation by algebraic
curves and surfaces
1st Principal axis:
Are we happy with
this approximation?
Extend the space by values of
additional functions and apply PCA
y
y+a+bx+cx2=0
x2
x
Illustration: Nonlinear happiness
(COUNTRY=1…192)
x=
Gross product per person, $/person
Life expectancy, years
Infant mortality, case/1000
Tuberculosis incidence, case/100000
Quality of Life = +1
(YEAR=1989,…,2005)
Russia trajectory
Quality of Life = -1
Linear index explains 76%
Non-linear index explains 93%
Principal curve
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
Constructing elastic nets
y
E (0) E (1)
R (1) R (0) R (2)
Definition of elastic energy:
we borrow this approach from splines
Xj
y
p
U
(Y )
1
j
(i )
   X y
N i 1 x( j )K ( i )
s
E (0) E (1)
U
(E)
  i E (1)  E (0)
(i )
(i )
2
2
i 1
r
U
R (1) R (0) R (2)
U U
( R)
  i R (1)  R (2)  2 R (0)
(i )
i 1
(Y )
U
(i )
i  0 , i  0
(E)
U
( R)
 min
(i )
2
Definition of elastic energy
Xj
y
p
U
(Y )
1
j
(i )
   X y
N i 1 x( j )K ( i )
s
E (0) E (1)
U
(E)
  i E (1)  E (0)
(i )
(i )
2
2
i 1
r
U
R (1) R (0) R (2)
U U
( R)
  i R (1)  R (2)  2 R (0)
(i )
i 1
(Y )
U
(i )
i  0 , i  0
(E)
U
( R)
 min
(i )
2
Are non-linear projections
better than linear projections?
Breast cancer
Wang et al., 2005
Bladder cancer
Dyrskjot et al., 2003
Principal graphs?
RN
2-Star
3-Star
(V)
RN
?
Generalization: what is principal graph?
Ideal object: pluriharmonic graph embedment
negative (repulsing) spring
Elastic k-star (k edges, k+1 nodes).
The branching energy is
2
k
uk -star
2-stars (ribs)
3-stars
1


 k  y0   yi 
k i 1 

S0
Ideal position of S0
(mean point of the star’s leaves)
Primitive elastic graph: all non-terminal nodes with
k edges are elastic k-stars.
The graph energy is
UG 
 uedge    ustar
edges
k k stars
Pluriharmonic graph embedments generalize
straight line, rectangular grid (with proper choice of k-stars), etc.
Principal harmonic dendrites (trees)
approximating complex data structures
Branching PCA
Non-linear PCA
Linear PCA
Visualization of 7-cluster
genome sequence structure
Algorithm iterations
Here clusters
overlapping on 3D PCA
plot are in fact well-separated
and the principal tree reveals this
fact
3D PCA plot
Metro map
And much more
for low-dimensional subsets:
•
•
•
•
•
•
•
Local Linear Embedding
Isomap
Laplace Eigenmaps
Nonlinear Multidimensional Scaling
Independent Component Analysis
Persistent cohomology
...................
Measure
concentration effects
For large n
Bn
=
Sn
=
Sn-1
The Maxwell distribution
Sn
Self-simplification in large dim
Maxwell
Gibbs
Milman
Talgrand
Gromov
………..
Projection
1/ n
Gaussian
Density of shadow
19
A 3D representation
of an 8D hypercube
The body has the same radial
distribution and the same
number of vertices as the
hypercube.
A very small fraction of the
mass lies near a vertex.
Also, most of the interior is
void.
(Illustration by Hamprecht & Agrell,
2002)
20
Self-simplification in large dim
Strange properties
of high dimensional sets
Observable diameter of the
sphere Sn, n = 3, 10, 100, 2500.
Illustrations by V. Pestov, 2005
Distribution of distances for pairs
of points in the unit hypercube In,
n = 3, 10, 100, 1000. (For random
samples of 10,000 pairs.).
Three provinces of the
Complexity Land
Wild complexity ???
Reducible models
(Princ. Comp. …)
Selfsimplification
(Stat. Phys. …)
22
Three provinces of the
Complexity Land
Wild complexity ???
Reducible models
(Princ. Comp. …)
Selfsimplification
(Stat. Phys. …)
23
Download