Hailuoto Workshop UNC, Stat & OR Object Oriented Data Analysis, III J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina March 12, 2016 1 HDLSS Space is a Weird Place, I UNC, Stat & OR Maximal Data Piling (with J. Y. Ahn) In HDLSS Binary Discrimination: There is direction where: Class +1 projections pile at one pt. Class -1 projections pile at another 2 HDLSS Space is a Weird Place, II UNC, Stat & OR Maximal Data Piling Mathematics: Exists w.p. 1, when abs. cont. w.r.t. Lebesgue measure Unique within subspace gend by Data Formula very similar to FLD (pooled within cov. global cov.) Same as FLD when n < d 3 HDLSS Space is a Weird Place, III UNC, Stat & OR ~2n MDP Dirns Useful for clustering? Hard Optimization… 4 HDLSS Space is a Weird Place, IV UNC, Stat & OR Parallel Directions (with X. Liu) 5 Time Series of Curves UNC, Stat & OR Chemical Spectra, evolving over time (with J. Wendelberger & E. Kober) Mortality curves changing in time (with Andres Alonzo) Visualization: Similar tools, PCA & Dirns But color according to time 6 Chemical Spectra, I UNC, Stat & OR 7 Chemical Spectra, II UNC, Stat & OR 8 Chemical Spectra, III UNC, Stat & OR 9 Chemical Spectra, IV UNC, Stat & OR 10 Chemical Spectra, V UNC, Stat & OR 11 Chemical Spectra, VI UNC, Stat & OR 12 Chemical Spectra, VII UNC, Stat & OR 13 Chemical Spectra, VIII UNC, Stat & OR 14 Demography Data UNC, Stat & OR Mortality, as a function of age “Chance of dying”, for Males of each 1-year age group Curves are years 1908 - 2002 PCA of the family of curves 15 Demography Data UNC, Stat & OR PCA of the family of curves for Males Babies & elderly “most mortal” (Raw) All getting better over time (Raw & PC1) Except 1918 - Influenza Pandemic (see Color Scale) Middle age most mortal (PC2): 1918 Early 1930s - Spanish Civil War 1980 – 1994 (then better) auto wrecks Decade Rounding (several places) 16 Demography Data UNC, Stat & OR PCA for Males in Switzerland Most aspects similar No decade rounding (better records) 1918 Flu – Different Color (PC2) (see Color Scale) No War Changes Steady improvement until 70s (PC2) When auto accidents kicked in 17 Demography Data UNC, Stat & OR Dual PCA Idea: Rows and Columns trade places Demographic Primal View: Curves are Years, Coord’s are Ages Demographic Dual View: Curves are Ages, Coord’s are Years Dual PCA View, Spanish Males 18 Demography Data UNC, Stat & OR Dual PCA View, Spanish Males Olde people have const. mortality (raw) But improvement for rest (raw) Bad for 1918 (flu) & Spanish Civil War, but generally improving (mean) Improves for ages 1-6, then worse (PC1) Big Improvement for young (PC2) (Age Color Key) 19 Discrimination for m-reps UNC, Stat & OR Classification S. What for Lie Groups – Symm. Spaces K. Sen & S. Joshi is “separating plane” (for SVM-DWD)? 20 Trees as Data Points, I UNC, Stat & OR Brain Blood Vessel Trees - E. Bullit & H. Wang Statistical Understanding of Population? Mean? Challenge: PCA? Very Non-Euclidean 21 Trees as Data Points, II UNC, Stat & OR Mean PCA of Tree Population: Frechét Approach on Trees (based on “tree lines”) Theory in Place - Implementation? 22 HDLSS Asymptotics: Simple Paradoxes, I UNC, Stat & OR For d dim’al “Standard Normal” dist’n: Z1 Z ~ N d 0, I d Z d Euclidean Distance to Origin (as d ): Z d O p (1) - Data lie roughly on surface of sphere of radius d - Yet origin is point of “highest density”??? - Paradox resolved by: “density w. r. t. Lebesgue Measure” 23 HDLSS Asymptotics: Simple Paradoxes, II UNC, Stat & OR For d dim’al “Standard Normal” dist’n: Z 1 indep. of Z 2 ~ N d 0, I d Euclidean Dist. between Z 1 and Z 2 (as d ): Distance tends to non-random constant: Z 1 Z 2 2d O p (1) Can extend to Z 1 ,..., Z n Where do they all go??? (we can only perceive 3 dim’ns) 24 HDLSS Asymptotics: Simple Paradoxes, III UNC, Stat & OR For d dim’al “Standard Normal” dist’n: Z 1 indep. of Z 2 ~ N d 0, I d High dim’al Angles (as d ): AngleZ 1 , Z 2 90 O p (d 1/ 2 ) - -“Everything is orthogonal”??? - Where do they all go??? (again our perceptual limitations) - Again 1st order structure is non-random 25 HDLSS Asy’s: Geometrical Representation, I UNC, Stat & OR Assume Z 1 ,..., Z n ~ N d 0, I d , let d Study Subspace Generated by Data a. Hyperplane through 0, of dimension n b. Points are “nearly equidistant to 0”, & dist d c. Within plane, can “rotate towards d Unit Simplex” d. All Gaussian data sets are“near Unit Simplex Vertices”!!! “Randomness” appears only in rotation of simplex With P. Hall & A. Neemon 26 HDLSS Asy’s: Geometrical Representation, II UNC, Stat & OR Assume Z 1 ,..., Z n ~ N d 0, I d , let d Study Hyperplane Generated by Data a. n 1 dimensional hyperplane b. Points are pairwise equidistant, dist ~ d c. Points lie at vertices of “regular n hedron” d. Again “randomness in data” is only in rotation e. Surprisingly rigid structure in data? 2d 27 HDLSS Asy’s: Geometrical Representation, III UNC, Stat & OR Simulation View: shows “rigidity after rotation” 28 HDLSS Asy’s: Geometrical Representation, III UNC, Stat & OR Straightforward Generalizations: non-Gaussian data: non-independent: Mild Eigenvalue condition on Theoretical Cov. only need moments use “mixing conditions” (with J. Ahn, K. Muller & Y. Chi) All based on simple “Laws of Large Numbers” 29 HDLSS Asy’s: Geometrical Representation, IV UNC, Stat & OR Explanation of Observed (Simulation) Behavior: “everything similar for very high d” 2 popn’s are 2 simplices (i.e. regular n-hedrons) All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN 30 HDLSS Asy’s: Geometrical Representation, V UNC, Stat & OR Further Consequences of Geometric Representation 1. Inefficiency of DWD for uneven sample size (motivates “weighted version”, work in progress) 2. DWD more “stable” than SVM (based on “deeper limiting distributions”) (reflects intuitive idea “feeling sampling variation”) (something like “mean vs. median”) 3. 1-NN rule inefficiency is quantified. 31 The Future of Geometrical Representation? UNC, Stat & OR HDLSS version of “optimality” results? “Contiguity” approach? Rates of Convergence? Improvements of DWD? Params depend on d? (e.g. other functions of distance than inverse) It is still early days … 32 Some Carry Away Lessons UNC, Stat & OR Atoms of the Analysis: Object Oriented HDLSS contexts deserve further study DWD is attractive for HDLSS classification “Randomness” in HDLSS data is only in rotations (Modulo rotation, have context simplex shape) How to put HDLSS asymptotics to work? 33