Geometry of data sets Alexander Gorban University of Leicester, UK Plan • The problem • Approximation of multidimensional data by low-dimensional objects • Self-simplification of essentially highdimensional sets • Terra Incognita between low-dimensional sets and self-simplified high-dimensional ones. Change of era From Einstein’s “flight from miracle.” «… The development of this world of thought is in a certain sense a continuous flight from “miracle”.» To struggle with complexity "I think the next century will be the century of complexity." 3 Stephen Hawking Two main approaches in our struggle with complexity A “minimal” space with this interesting content A large space with something interesting inside In high dimensionality many different things become similar, if we choose the proper point of view 4 Karl Pearson 1901 Principal Component Analysis Approximation by straight lines: Subtract the projection and repeat 1st Principal axis Principal points (K-means) Approximation by smaller finite sets: Centres y(i) Data points x(j) 1. Select several centres; 2. Attach datapoints to the closest centres by springs; 3. Minimize energy; 4. Repeat 2&3 until converges. Steinhaus, 1956; Lloyd, 1957; MacQueen, 1967 Approximation by algebraic curves and surfaces 1st Principal axis: Are we happy with this approximation? Extend the space by values of additional functions and apply PCA y y+a+bx+cx2=0 x2 x Illustration: Nonlinear happiness (COUNTRY=1…192) x= Gross product per person, $/person Life expectancy, years Infant mortality, case/1000 Tuberculosis incidence, case/100000 Quality of Life = +1 (YEAR=1989,…,2005) Russia trajectory Quality of Life = -1 Linear index explains 76% Non-linear index explains 93% Principal curve 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Constructing elastic nets y E (0) E (1) R (1) R (0) R (2) Definition of elastic energy: we borrow this approach from splines Xj y p U (Y ) 1 j (i ) X y N i 1 x( j )K ( i ) s E (0) E (1) U (E) i E (1) E (0) (i ) (i ) 2 2 i 1 r U R (1) R (0) R (2) U U ( R) i R (1) R (2) 2 R (0) (i ) i 1 (Y ) U (i ) i 0 , i 0 (E) U ( R) min (i ) 2 Definition of elastic energy Xj y p U (Y ) 1 j (i ) X y N i 1 x( j )K ( i ) s E (0) E (1) U (E) i E (1) E (0) (i ) (i ) 2 2 i 1 r U R (1) R (0) R (2) U U ( R) i R (1) R (2) 2 R (0) (i ) i 1 (Y ) U (i ) i 0 , i 0 (E) U ( R) min (i ) 2 Are non-linear projections better than linear projections? Breast cancer Wang et al., 2005 Bladder cancer Dyrskjot et al., 2003 Principal graphs? RN 2-Star 3-Star (V) RN ? Generalization: what is principal graph? Ideal object: pluriharmonic graph embedment negative (repulsing) spring Elastic k-star (k edges, k+1 nodes). The branching energy is 2 k uk -star 2-stars (ribs) 3-stars 1 k y0 yi k i 1 S0 Ideal position of S0 (mean point of the star’s leaves) Primitive elastic graph: all non-terminal nodes with k edges are elastic k-stars. The graph energy is UG uedge ustar edges k k stars Pluriharmonic graph embedments generalize straight line, rectangular grid (with proper choice of k-stars), etc. Principal harmonic dendrites (trees) approximating complex data structures Branching PCA Non-linear PCA Linear PCA Visualization of 7-cluster genome sequence structure Algorithm iterations Here clusters overlapping on 3D PCA plot are in fact well-separated and the principal tree reveals this fact 3D PCA plot Metro map And much more for low-dimensional subsets: • • • • • • • Local Linear Embedding Isomap Laplace Eigenmaps Nonlinear Multidimensional Scaling Independent Component Analysis Persistent cohomology ................... Measure concentration effects For large n Bn = Sn = Sn-1 The Maxwell distribution Sn Self-simplification in large dim Maxwell Gibbs Milman Talgrand Gromov ……….. Projection 1/ n Gaussian Density of shadow 19 A 3D representation of an 8D hypercube The body has the same radial distribution and the same number of vertices as the hypercube. A very small fraction of the mass lies near a vertex. Also, most of the interior is void. (Illustration by Hamprecht & Agrell, 2002) 20 Self-simplification in large dim Strange properties of high dimensional sets Observable diameter of the sphere Sn, n = 3, 10, 100, 2500. Illustrations by V. Pestov, 2005 Distribution of distances for pairs of points in the unit hypercube In, n = 3, 10, 100, 1000. (For random samples of 10,000 pairs.). Three provinces of the Complexity Land Wild complexity ??? Reducible models (Princ. Comp. …) Selfsimplification (Stat. Phys. …) 22 Three provinces of the Complexity Land Wild complexity ??? Reducible models (Princ. Comp. …) Selfsimplification (Stat. Phys. …) 23