Isaac Newton Institute - Cambridge UNC, Stat & OR Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina March 22, 2016 1 Personal Opinions on Mathematical Statistics UNC, Stat & OR What is Mathematical Statistics? Validation of existing methods Asymptotics (n ∞) & Taylor expansion Comparison of existing methods (requires hard math, but really “accounting”???) 2 Personal Opinions on Mathematical Statistics UNC, Stat & OR What could Mathematical Statistics be? Basis for invention of new methods Complicated data mathematical ideas Do we value creativity? Since we don’t do this, others do… (where are the ₤₤₤s???) 3 Personal Opinions on Mathematical Statistics UNC, Stat & OR Since we don’t do this, others do… Pattern Recognition Artificial Intelligence Neural Nets Data Mining Machine Learning ??? 4 Personal Opinions on Mathematical Statistics UNC, Stat & OR Possible Litmus Test: Creative Statistics Clinical Trials Viewpoint: Worst Imaginable Idea Mathematical Statistics Viewpoint: ??? 5 Object Oriented Data Analysis, I UNC, Stat & OR What is the “atom” of a statistical analysis? 1st Course: Numbers Multivariate Analysis Course : Functional Data Analysis: More generally: Data Objects Vectors Curves 6 Object Oriented Data Analysis, II UNC, Stat & OR Examples: Medical Image Analysis Images as Data Objects? Shape Representations as Objects Micro-arrays Just multivariate analysis? 7 Object Oriented Data Analysis, III UNC, Stat & OR Typical Goals: Understanding population variation Visualization Principal Component Analysis + Discrimination (a.k.a. Classification) Time Series of Data Objects 8 Object Oriented Data Analysis, IV UNC, Stat & OR Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) Dimension d >> sample size n “Multivariate Analysis” nearly useless Can’t “normalize the data” Land of Opportunity for Statisticians Need for “creative statisticians” 9 Object Oriented Data Analysis, V UNC, Stat & OR Major Statistical Challenge, II: Data may live in non-Euclidean space Lie Group / Symmet’c Spaces (manifold data) Trees/Graphs as data objects Interesting Issues: What is “the mean” (pop’n center)? How do we quantify “pop’n variation”? 10 Statistics in Image Analysis, I UNC, Stat & OR First Generation Problems: Denoising Segmentation Registration (all about single images) 11 Statistics in Image Analysis, II UNC, Stat & OR Second Generation Problems: Populations of Images Understanding Population Variation Discrimination (a.k.a. Classification) Complex Data Structures (& Spaces) HDLSS Statistics 12 HDLSS Statistics in Imaging UNC, Stat & OR Why HDLSS (High Dim, Low Sample Size)? Complex 3-d Objects Hard to Represent Often need d = 100’s of parameters Complex 3-d Objects Costly to Segment Often have n = 10’s cases 13 Medical Imaging – A Challenging Example UNC, Stat & OR Male Pelvis Bladder – Prostate – Rectum How do they move over time (days)? Critical to Radiation Treatment (cancer) Work with 3-d CT Very Challenging to Segment Find boundary of each object? Represent each Object? 14 Male Pelvis – Raw Data UNC, Stat & OR One CT Slice (in 3d image) Coccyx (Tail Bone) Rectum Prostate 15 Male Pelvis – Raw Data UNC, Stat & OR Prostate: manual segmentation Slice by slice Reassembled 16 Male Pelvis – Raw Data UNC, Stat & OR Prostate: Slices: Reassembled in 3d How to represent? Thanks: Ja-Yeon Jeong 17 Object Representation UNC, Stat & OR Landmarks (hard to find) Boundary Rep’ns (no correspondence) Medial representations Find “skeleton” Discretize as “atoms” called M-reps 18 3-d m-reps UNC, Stat & OR Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) • Medial Atoms provide “skeleton” • Implied Boundary from “spokes” “surface” 19 3-d m-reps UNC, Stat & OR M-rep model fitting • Easy, when starting from binary (blue) • But very expensive (30 – 40 minutes technician’s time) • Want automatic approach • Challenging, because of poor contrast, noise, … • Need to borrow information across training sample • Use Bayes approach: prior & likelihood posterior • ~Conjugate Gaussians, but there are issues: • Major HLDSS challenges • Manifold aspect of data 20 PCA for m-reps, I UNC, Stat & OR Major issue: m-reps live in 3 SO(3) SO(2) (locations, radius and angles) E.g. “average” of: 2 , 3 , 358 , 359 = ??? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds) 21 PCA for m-reps, II UNC, Stat & OR PCA on non-Euclidean spaces? (i.e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”… 22 PGA for m-reps, Bladder-Prostate-Rectum UNC, Stat & OR Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong) 23 PGA for m-reps, Bladder-Prostate-Rectum UNC, Stat & OR Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong) 24 PGA for m-reps, Bladder-Prostate-Rectum UNC, Stat & OR Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong) 25 HDLSS Classification (i.e. Discrimination) UNC, Stat & OR Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis New Patients are “Healthy” or “Ill” Determined based on measurements 26 HDLSS Classification (Cont.) UNC, Stat & OR Ineffective Methods: Fisher Linear Discrimination Gaussian Likelihood Ratio Less Useful Methods: Nearest Neighbors Neural Nets (“black boxes”, no “directions” or intuition) 27 HDLSS Classification (Cont.) UNC, Stat & OR Currently Fashionable Methods: Support Vector Machines Trees Based Approaches New High Tech Method Distance Weighted Discrimination (DWD) Specially designed for HDLSS data Avoids “data piling” problem of SVM Solves more suitable optimization problem 28 HDLSS Classification (Cont.) UNC, Stat & OR Currently Methods: Fashionable Trees Based Approaches Support Vector Machines: 29 Distance Weighted Discrimination UNC, Stat & OR Maximal Data Piling 30 Distance Weighted Discrimination UNC, Stat & OR Based on Optimization Problem: n 1 min w ,b i 1 r i More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming Still Convex gen’tion of quadratic prog’ing Fast greedy solution Can use existing software 31 DWD Bias Adjustment for Microarrays UNC, Stat & OR Microarray data: Simult. Measur’ts of “gene expression” Intrinsically HDLSS Dimension d ~ 1,000s – 10,000s Sample Sizes n ~ 10s – 100s My view: Each array is “point in cloud” 32 DWD Batch and Source Adjustment UNC, Stat & OR For Perou’s Stanford Breast Cancer Data Analysis in Benito, et al (2004) Bioinformatics https://genome.unc.edu/pubsup/dwd/ Adjust for Source Effects Different sources of mRNA Adjust for Batch Effects Arrays fabricated at different times 33 DWD Adj: Raw Breast Cancer data UNC, Stat & OR 34 DWD Adj: Source Colors UNC, Stat & OR 35 DWD Adj: Batch Colors UNC, Stat & OR 36 DWD Adj: Biological Class Colors UNC, Stat & OR 37 DWD Adj: Biological Class Colors & Symbols UNC, Stat & OR 38 DWD Adj: Biological Class Symbols UNC, Stat & OR 39 DWD Adj: Source Colors UNC, Stat & OR 40 DWD Adj: PC 1-2 & DWD direction UNC, Stat & OR 41 DWD Adj: DWD Source Adjustment UNC, Stat & OR 42 DWD Adj: Source Adj’d, PCA view UNC, Stat & OR 43 DWD Adj: Source Adj’d, Class Colored UNC, Stat & OR 44 DWD Adj: Source Adj’d, Batch Colored UNC, Stat & OR 45 DWD Adj: Source Adj’d, 5 PCs UNC, Stat & OR 46 DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD UNC, Stat & OR 47 DWD Adj: S. & B1,2 vs. 3 Adjusted UNC, Stat & OR 48 DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs UNC, Stat & OR 49 DWD Adj: S. & B Adj’d, B1 vs. 2 DWD UNC, Stat & OR 50 DWD Adj: S. & B Adj’d, B1 vs. 2 Adj’d UNC, Stat & OR 51 DWD Adj: S. & B Adj’d, 5 PC view UNC, Stat & OR 52 DWD Adj: S. & B Adj’d, 4 PC view UNC, Stat & OR 53 DWD Adj: S. & B Adj’d, Class Colors UNC, Stat & OR 54 DWD Adj: S. & B Adj’d, Adj’d PCA UNC, Stat & OR 55 DWD Bias Adjustment for Microarrays UNC, Stat & OR Effective for Batch and Source Adj. Also works for cross-platform Adj. E.g. cDNA & Affy Despite literature claiming contrary “Gene by Gene” vs. “Multivariate” views Funded as part of caBIG “Cancer BioInformatics Grid” “Data Combination Effort” of NCI 56 Interesting Benchmark Data Set UNC, Stat & OR NCI 60 Cell Lines Interesting benchmark, since same cells Data Web available: http://discover.nci.nih.gov/datasetsNature2000.jsp Both cDNA and Affymetrix Platforms 8 Major cancer subtypes Use DWD now for visualization 57 NCI 60: Views using DWD Dir’ns (focus on biology) UNC, Stat & OR 58 DWD in Face Recognition, I UNC, Stat & OR Face Images as Data (with M. Benito & D. Peña) Registered Male using landmarks – Female Difference? Discrimination Rule? 59 DWD in Face Recognition, II UNC, Stat & OR DWD Direction Good separation Images “make sense” Garbage at ends? (extrapolation effects?) 60 Blood vessel tree data UNC, Stat & OR Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view 61 Blood vessel tree data UNC, Stat & OR Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view 62 Blood vessel tree data UNC, Stat & OR Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view 63 Blood vessel tree data UNC, Stat & OR Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view 64 Blood vessel tree data UNC, Stat & OR Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view 65 Blood vessel tree data UNC, Stat & OR Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view 66 Blood vessel tree data UNC, Stat & OR , , ... , Now look over many people (data objects) Structure of population (understand variation?) PCA in strongly non-Euclidean Space??? 67 Blood vessel tree data UNC, Stat & OR , , ... , Possible focus of analysis: • Connectivity structure only (topology) • Location, size, orientation of segments • Structure within each vessel segment 68 Blood vessel tree data UNC, Stat & OR Present Focus: Topology only Already challenging Later address others Then add attributes To tree nodes And extend analysis 69 Strongly Non-Euclidean Spaces UNC, Stat & OR Statistics on Population of Tree-Structured Data Objects? • Mean??? • Analog of PCA??? Strongly non-Euclidean, since: • Space of trees not a linear space • Not even approximately linear (no tangent plane) 70 Strongly Non-Euclidean Spaces UNC, Stat & OR PCA on Tree Space? Key Idea (Jim Ramsay): • Replace 1-d subspace that best approximates data • By 1-d representation that best approximates data Wang and Marron (2007) define notion of Treeline (in structure space) 71 PCA for blood vessel tree data UNC, Stat & OR Data Analytic Goals: Age, Gender See these? No… 72 Preliminary Tree-Curve Results UNC, Stat & OR First Correlation Of Structure To Age! (Back Trees) 73 HDLSS Asymptotics UNC, Stat & OR Why study asymptotics? 74 HDLSS Asymptotics UNC, Stat & OR Why study asymptotics? An interesting (naïve) quote: “I don’t look at asymptotics, because I don’t have an infinite sample size” 75 HDLSS Asymptotics UNC, Stat & OR Why study asymptotics? An interesting (naïve) quote: “I don’t look at asymptotics, because I don’t have an infinite sample size” Suggested perspective: Asymptotics are a tool for finding simple structure underlying complex entities 76 HDLSS Asymptotics UNC, Stat & OR Which asymptotics? n ∞ (classical, very widely done) d ∞ ??? Sensible? Follow typical “sampling process”? Say anything, as noise level increases??? 77 HDLSS Asymptotics UNC, Stat & OR Which asymptotics? n ∞ & d ∞ n >> d: a few results around (still have classical info in data) n ~ d: random matrices (Iain J., et al) (nothing classically estimable) HDLSS asymptotics: n fixed, d ∞ 78 HDLSS Asymptotics UNC, Stat & OR HDLSS asymptotics: n fixed, d ∞ Follow typical “sampling process”? 79 HDLSS Asymptotics UNC, Stat & OR HDLSS asymptotics: n fixed, d ∞ Follow typical “sampling process”? Microarrays: # genes bounded Proteomics, SNPs, … A moot point, from perspective: Asymptotics are a tool for finding simple structure underlying complex entities 80 HDLSS Asymptotics UNC, Stat & OR HDLSS asymptotics: n fixed, d ∞ Say anything, as noise level increases??? 81 HDLSS Asymptotics UNC, Stat & OR HDLSS asymptotics: n fixed, d ∞ Say anything, as noise level increases??? Yes, there exists simple, perhaps surprising, underlying structure 82 HDLSS Asymptotics: Simple Paradoxes, I UNC, Stat & OR For d dim’al “Standard Normal” dist’n: Z1 Z ~ N d 0, I d Z d Euclidean Distance to Origin (as d ): Z d O p (1) - Data lie roughly on surface of sphere of radius d - Yet origin is point of “highest density”??? - Paradox resolved by: “density w. r. t. Lebesgue Measure” 83 HDLSS Asymptotics: Simple Paradoxes, II UNC, Stat & OR For d dim’al “Standard Normal” dist’n: Z 1 indep. of Z 2 ~ N d 0, I d Euclidean Dist. between Z 1 and Z 2 (as d ): Distance tends to non-random constant: Z 1 Z 2 2d O p (1) Can extend to Z 1 ,..., Z n Where do they all go??? (we can only perceive 3 dim’ns) 84 HDLSS Asymptotics: Simple Paradoxes, III UNC, Stat & OR For d dim’al “Standard Normal” dist’n: Z 1 indep. of Z 2 ~ N d 0, I d High dim’al Angles (as d ): AngleZ 1 , Z 2 90 O p (d 1/ 2 ) - -“Everything is orthogonal”??? - Where do they all go??? (again our perceptual limitations) - Again 1st order structure is non-random 85 HDLSS Asy’s: Geometrical Representation, I UNC, Stat & OR Assume Z 1 ,..., Z n ~ N d 0, I d , let d Study Subspace Generated by Data a. Hyperplane through 0, of dimension n b. Points are “nearly equidistant to 0”, & dist d c. Within plane, can “rotate towards d Unit Simplex” d. All Gaussian data sets are“near Unit Simplex Vertices”!!! “Randomness” appears only in rotation of simplex With P. Hall & A. Neeman 86 HDLSS Asy’s: Geometrical Representation, II UNC, Stat & OR Assume Z 1 ,..., Z n ~ N d 0, I d , let d Study Hyperplane Generated by Data a. n 1 dimensional hyperplane b. Points are pairwise equidistant, dist ~ d c. Points lie at vertices of “regular n hedron” d. Again “randomness in data” is only in rotation e. Surprisingly rigid structure in data? 2d 87 HDLSS Asy’s: Geometrical Representation, III UNC, Stat & OR Simulation View: shows “rigidity after rotation” 88 HDLSS Asy’s: Geometrical Representation, III UNC, Stat & OR Straightforward Generalizations: non-Gaussian data: non-independent: only need moments use “mixing conditions” (with P. Hall & A. Neeman) Mild Eigenvalue condition on Theoretical Cov. (with J. Ahn, K. Muller & Y. Chi) All based on simple “Laws of Large Numbers” 89 HDLSS Asy’s: Geometrical Representation, IV UNC, Stat & OR Explanation of Observed (Simulation) Behavior: “everything similar for very high d” 2 popn’s are 2 simplices (i.e. regular n-hedrons) All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN 90 HDLSS Asy’s: Geometrical Representation, V UNC, Stat & OR Further Consequences of Geometric Representation 1. Inefficiency of DWD for uneven sample size (motivates “weighted version”, work in progress) 2. DWD more “stable” than SVM (based on “deeper limiting distributions”) (reflects intuitive idea “feeling sampling variation”) (something like “mean vs. median”) 3. 1-NN rule inefficiency is quantified. 91 2nd Paper on HDLSS Asymptotics UNC, Stat & OR Ahn, Marron, Muller & Chi (2007) Biometrika Assume 2nd Moments Assume no eigenvalues too large in sense: j j 1 For d d 2j d j 1 (and Gaussian) 2 assume 1 o(d ) 1 d i.e. (min possible) (much weaker than previous mixing conditions…) 92 HDLSS Math. Stat. of PCA, I UNC, Stat & OR Consistency & Strong Inconsistency: Spike Covariance Model (Johnstone & Paul) For Eigenvalues: 1,d d , 2,d 1, , d ,d 1 1st Eigenvector: u1 How good are empirical versions, ˆ1,d , , ˆd ,d , uˆ1 as estimates? 93 HDLSS Math. Stat. of PCA, II UNC, Stat & OR Consistency (big enough spike): For 1 , Angleu1 , uˆ1 0 Strong Inconsistency (spike not big enough): For 1 , 0 ˆ Angleu1 , u1 90 94 HDLSS Math. Stat. of PCA, III UNC, Stat & OR Consistency of eigenvalues? L ˆ 1,d 1,d n 2 n Eigenvalues Inconsistent But known distribution Unless n as well 95 HDLSS Work in Progress, II UNC, Stat & OR Canonical Correlations: Myung Hee Lee Results similar to those for those for PCA Singular values inconsistent But directions converge under a much milder spike assumption. 96 HDLSS Work in Progress, III UNC, Stat & OR Conditions for Geo. Rep’n & PCA Consist.: John Kent example: 1 1 X ~ N d 0 d , I d N d 0 d ,100 * I d 2 2 Can only say: 1/ 2 d X O p ( d 1/ 2 ) 1/ 2 10d not deterministic w. p. 12 w. p. 12 Conclude: need some flavor of mixing 97 HDLSS Work in Progress, III UNC, Stat & OR Conditions for Geo. Rep’n & PCA Consist.: Conclude: need some flavor of mixing Challenge: Classical mixing conditions require notion of time ordering Not always clear, e.g. microarrays 98 HDLSS Work in Progress, III UNC, Stat & OR Conditions for Geo. Rep’n & PCA Consist.: Sungkyu Jung Condition: X ~ 0d , d Define: where 1/ 2 d Zd d U d dU t d t d U Xd Assume: Ǝ a permutation, d So that d Zd is ρ-mixing 99 HDLSS Deep Open Problem UNC, Stat & OR In PCA Consistency: Strong Inconsistency Consistency - 1 spike 1 spike What happens at boundary ( 1 )??? 100 The Future of HDLSS Asymptotics? UNC, Stat & OR 1. Address your favorite statistical problem… 2. HDLSS versions of classical optimality results? 3. Continguity Approach 4. Rates of convergence? 5. Improved Discrimination Methods? (~Random Matrices) It is early days… 101 Some Carry Away Lessons UNC, Stat & OR Atoms of the Analysis: Object Oriented Viewpoint: DWD is attractive for HDLSS classification “Randomness” in HDLSS data is only in rotations Object Space Feature Space (Modulo rotation, have constant simplex shape) How to put HDLSS asymptotics to work? 102