U. C. Davis, F. R. G. Workshop UNC, Stat & OR Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina March 17, 2016 1 Object Oriented Data Analysis, I UNC, Stat & OR What is the “atom” of a statistical analysis? 1st Course: Numbers Multivariate Analysis Course : Functional Data Analysis: More generally: Data Objects Vectors Curves 2 Object Oriented Data Analysis, II UNC, Stat & OR Examples: Medical Image Analysis Images as Data Objects? Shape Representations as Objects Micro-arrays Just multivariate analysis? 3 Object Oriented Data Analysis, III UNC, Stat & OR Typical Goals: Understanding population variation Principal Component Analysis + Discrimination (a.k.a. Classification) Time Series of Data Objects 4 Object Oriented Data Analysis, IV UNC, Stat & OR Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) Dimension d >> sample size n “Multivariate Analysis” nearly useless Can’t “normalize the data” Land of Opportunity for Statisticians Need for “creative statisticians” 5 Object Oriented Data Analysis, V UNC, Stat & OR Major Statistical Challenge, II: Data may live in non-Euclidean space Lie Group / Symmetric Spaces Trees/Graphs as data objects Interesting Issues: What is “the mean” (pop’n center)? How do we quantify “pop’n variation”? 6 Statistics in Image Analysis, I UNC, Stat & OR First Generation Problems: Denoising Segmentation Registration (all about single images) 7 Statistics in Image Analysis, II UNC, Stat & OR Second Generation Problems: Populations of Images Understanding Population Variation Discrimination (a.k.a. Classification) Complex Data Structures (& Spaces) HDLSS Statistics 8 HDLSS Statistics in Imaging UNC, Stat & OR Why HDLSS (High Dim, Low Sample Size)? Complex 3-d Objects Hard to Represent Often need d = 100’s of parameters Complex 3-d Objects Costly to Segment Often have n = 10’s cases 9 Object Representation UNC, Stat & OR Landmarks (hard to find) Boundary Rep’ns (no correspondence) Medial representations Find “skeleton” Discretize as “atoms” called M-reps 10 3-d m-reps UNC, Stat & OR Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) • Medial Atoms provide “skeleton” • Implied Boundary from “spokes” “surface” 11 Illuminating Viewpoint UNC, Stat & OR Object Space Focus here on collection of data objects Feature Space Here conceptualize population structure via “point clouds” 12 PCA for m-reps, I UNC, Stat & OR Major issue: m-reps live in 3 SO(3) SO(2) (locations, radius and angles) E.g. “average” of: 2 , 3 , 358 , 359 = ??? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds) 13 PCA for m-reps, II UNC, Stat & OR PCA on non-Euclidean spaces? (i.e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”… 14 PGA for m-reps, Bladder-Prostate-Rectum UNC, Stat & OR Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong) 15 PGA for m-reps, Bladder-Prostate-Rectum UNC, Stat & OR Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong) 16 PGA for m-reps, Bladder-Prostate-Rectum UNC, Stat & OR Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong) 17 HDLSS Classification (i.e. Discrimination) UNC, Stat & OR Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis New Patients are “Healthy” or “Ill” Determined based on measurements 18 HDLSS Classification (Cont.) UNC, Stat & OR Ineffective Methods: Fisher Linear Discrimination Gaussian Likelihood Ratio Less Useful Methods: Nearest Neighbors Neural Nets (“black boxes”, no “directions” or intuition) 19 HDLSS Classification (Cont.) UNC, Stat & OR Currently Fashionable Methods: Support Vector Machines Trees Based Approaches New High Tech Method Distance Weighted Discrimination (DWD) Specially designed for HDLSS data Avoids “data piling” problem of SVM Solves more suitable optimization problem 20 HDLSS Classification (Cont.) UNC, Stat & OR Currently Methods: Fashionable Trees Based Approaches Support Vector Machines: 21 HDLSS Classification (Cont.) UNC, Stat & OR Comparison of Linear Methods (toy data): N d , I , 1, 2.2, n1 n2 20, d 50 Optimal Direction Excellent, but need dir’n in dim = 50 Maximal Data Piling (J. Y. Ahn, D. Peña) Great separation, but generalizability??? Support More separation, gen’ity, but some data piling? Distance Vector Machine Weighted Discrimination Avoids data piling, good gen’ity, Gaussians? 22 Distance Weighted Discrimination UNC, Stat & OR Maximal Data Piling 23 Maximal Data Piling UNC, Stat & OR Mind boggling? J. Y. Ahn has characterized Formula ~ FLD There are many Publishable??? 24 Distance Weighted Discrimination UNC, Stat & OR Based on Optimization Problem: n 1 min w ,b i 1 r i More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming Still Convex gen’tion of quadratic prog’ing Fast greedy solution Can use existing software 25 Simulation Comparison UNC, Stat & OR E.G. Above Gaussians: Wide array of dim’s SVM Subst’ly worse MD – Bayes Optimal DWD close to MD 26 Simulation Comparison UNC, Stat & OR E.G. Outlier Mixture: Disaster for MD SVM & DWD much more solid Dir’ns are “robust” SVM & DWD similar 27 Simulation Comparison UNC, Stat & OR E.G. Wobble Mixture: Disaster for MD SVM less good DWD slightly better Note: All methods come together for larger d ??? 28 DWD in Face Recognition, I UNC, Stat & OR Face Images as Data (with M. Benito & D. Peña) Registered Male using landmarks – Female Difference? Discrimination Rule? 29 DWD in Face Recognition, II UNC, Stat & OR DWD Direction Good separation Images “make sense” Garbage at ends? (extrapolation effects?) 30 DWD in Face Recognition, III UNC, Stat & OR Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness 31 DWD in Face Recognition, IV UNC, Stat & OR Fun Comparison: Jump between means (in SVM direction) Also distinguishes Maleness vs. Femaleness But not as well as DWD 32 DWD in Face Recognition, V UNC, Stat & OR Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?) 33 DWD in Face Recognition, VI UNC, Stat & OR Current Work: Focus on “drivers”: (regions of interest) Relation to Discr’n? Which is “best”? Lessons for human perception? 34 DWD & Microarrays for Gene Expression UNC, Stat & OR Skip due to time pressure Some have seen this… DWD provides excellent tool for: Combining Data Sets (caBIG funded) Visualization of HDLSS data HDLSS hypothesis testing Let’s talk informally if you are interested 35 Discrimination for m-reps UNC, Stat & OR Classification S. What for Lie Groups – Symm. Spaces K. Sen & S. Joshi is “separating plane” (for SVM-DWD)? 36 Trees as Data Points, I UNC, Stat & OR Brain Blood Vessel Trees - E. Bullit & H. Wang Statistical Understanding of Population? Mean? Challenge: PCA? Very Non-Euclidean 37 Trees as Data Points, II UNC, Stat & OR Mean PCA of Tree Population: Frechét Approach on Trees (based on “tree lines”) Theory in Place - Implementation? 38 HDLSS Asymptotics: Simple Paradoxes, I UNC, Stat & OR For d dim’al “Standard Normal” dist’n: Z1 Z ~ N d 0, I d Z d Euclidean Distance to Origin (as d ): Z d O p (1) - Data lie roughly on surface of sphere of radius d - Yet origin is point of “highest density”??? - Paradox resolved by: “density w. r. t. Lebesgue Measure” 39 HDLSS Asymptotics: Simple Paradoxes, II UNC, Stat & OR For d dim’al “Standard Normal” dist’n: Z 1 indep. of Z 2 ~ N d 0, I d Euclidean Dist. between Z 1 and Z 2 (as d ): Distance tends to non-random constant: Z 1 Z 2 2d O p (1) Can extend to Z 1 ,..., Z n Where do they all go??? (we can only perceive 3 dim’ns) 40 HDLSS Asymptotics: Simple Paradoxes, III UNC, Stat & OR For d dim’al “Standard Normal” dist’n: Z 1 indep. of Z 2 ~ N d 0, I d High dim’al Angles (as d ): AngleZ 1 , Z 2 90 O p (d 1/ 2 ) - -“Everything is orthogonal”??? - Where do they all go??? (again our perceptual limitations) - Again 1st order structure is non-random 41 HDLSS Asy’s: Geometrical Representation, I UNC, Stat & OR Assume Z 1 ,..., Z n ~ N d 0, I d , let d Study Subspace Generated by Data a. Hyperplane through 0, of dimension n b. Points are “nearly equidistant to 0”, & dist d c. Within plane, can “rotate towards d Unit Simplex” d. All Gaussian data sets are“near Unit Simplex Vertices”!!! “Randomness” appears only in rotation of simplex With P. Hall & A. Neemon 42 HDLSS Asy’s: Geometrical Representation, II UNC, Stat & OR Assume Z 1 ,..., Z n ~ N d 0, I d , let d Study Hyperplane Generated by Data a. n 1 dimensional hyperplane b. Points are pairwise equidistant, dist ~ d c. Points lie at vertices of “regular n hedron” d. Again “randomness in data” is only in rotation e. Surprisingly rigid structure in data? 2d 43 HDLSS Asy’s: Geometrical Representation, III UNC, Stat & OR Simulation View: shows “rigidity after rotation” 44 HDLSS Asy’s: Geometrical Representation, III UNC, Stat & OR Straightforward Generalizations: non-Gaussian data: non-independent: Mild Eigenvalue condition on Theoretical Cov. only need moments use “mixing conditions” (with J. Ahn, K. Muller & Y. Chi) All based on simple “Laws of Large Numbers” 45 HDLSS Asy’s: Geometrical Representation, IV UNC, Stat & OR Explanation of Observed (Simulation) Behavior: “everything similar for very high d” 2 popn’s are 2 simplices (i.e. regular n-hedrons) All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN 46 HDLSS Asy’s: Geometrical Representation, V UNC, Stat & OR Further Consequences of Geometric Representation 1. Inefficiency of DWD for uneven sample size (motivates “weighted version”, work in progress) 2. DWD more “stable” than SVM (based on “deeper limiting distributions”) (reflects intuitive idea “feeling sampling variation”) (something like “mean vs. median”) 3. 1-NN rule inefficiency is quantified. 47 The Future of Geometrical Representation? UNC, Stat & OR HDLSS version of “optimality” results? “Contiguity” approach? Rates of Convergence? Improvements of DWD? Params depend on d? (e.g. other functions of distance than inverse) It is still early days … 48 Some Carry Away Lessons UNC, Stat & OR Atoms of the Analysis: Object Oriented HDLSS contexts deserve further study DWD is attractive for HDLSS classification “Randomness” in HDLSS data is only in rotations (Modulo rotation, have context simplex shape) How to put HDLSS asymptotics to work? 49