OODA-Davis-8-15-05 - The University of North Carolina at

advertisement
U. C. Davis, F. R. G. Workshop
UNC, Stat & OR
Object Oriented Data Analysis
J. S. Marron
Dept. of Statistics and Operations Research,
University of North Carolina
March 17, 2016
1
Object Oriented Data Analysis, I
UNC, Stat & OR
What is the “atom” of a statistical analysis?

1st Course:
Numbers

Multivariate Analysis Course :

Functional Data Analysis:

More generally: Data Objects
Vectors
Curves
2
Object Oriented Data Analysis, II
UNC, Stat & OR
Examples:


Medical Image Analysis

Images as Data Objects?

Shape Representations as Objects
Micro-arrays

Just multivariate analysis?
3
Object Oriented Data Analysis, III
UNC, Stat & OR
Typical Goals:

Understanding population variation

Principal Component Analysis +

Discrimination (a.k.a. Classification)

Time Series of Data Objects
4
Object Oriented Data Analysis, IV
UNC, Stat & OR
Major Statistical Challenge, I:
High Dimension Low Sample Size (HDLSS)

Dimension d >> sample size n

“Multivariate Analysis” nearly useless


Can’t “normalize the data”
Land of Opportunity for Statisticians

Need for “creative statisticians”
5
Object Oriented Data Analysis, V
UNC, Stat & OR
Major Statistical Challenge, II:


Data may live in non-Euclidean space

Lie Group / Symmetric Spaces

Trees/Graphs as data objects
Interesting Issues:

What is “the mean” (pop’n center)?

How do we quantify “pop’n variation”?
6
Statistics in Image Analysis, I
UNC, Stat & OR
First Generation Problems:

Denoising

Segmentation

Registration
(all about single images)
7
Statistics in Image Analysis, II
UNC, Stat & OR
Second Generation Problems:

Populations of Images

Understanding Population Variation

Discrimination (a.k.a. Classification)

Complex Data Structures (& Spaces)

HDLSS Statistics
8
HDLSS Statistics in Imaging
UNC, Stat & OR
Why HDLSS (High Dim, Low Sample Size)?

Complex 3-d Objects Hard to Represent


Often need d = 100’s of parameters
Complex 3-d Objects Costly to Segment

Often have n = 10’s cases
9
Object Representation
UNC, Stat & OR



Landmarks (hard to find)
Boundary Rep’ns (no correspondence)
Medial representations


Find “skeleton”
Discretize as “atoms” called M-reps
10
3-d m-reps
UNC, Stat & OR
Bladder – Prostate – Rectum
(multiple objects, J. Y. Jeong)
•
Medial Atoms provide “skeleton”
•
Implied Boundary from “spokes”  “surface”
11
Illuminating Viewpoint
UNC, Stat & OR
Object Space
Focus here on
collection of
data objects

Feature Space
Here conceptualize
population structure
via “point clouds”
12
PCA for m-reps, I
UNC, Stat & OR
Major issue: m-reps live in 3    SO(3)  SO(2)
(locations, radius and angles)
E.g. “average” of:
2 , 3 , 358 , 359 = ???
Natural Data Structure is:
Lie Groups ~ Symmetric spaces
(smooth, curved manifolds)
13
PCA for m-reps, II
UNC, Stat & OR
PCA on non-Euclidean spaces?
(i.e. on Lie Groups / Symmetric Spaces)
T. Fletcher: Principal Geodesic Analysis
Idea: replace “linear summary of data”
With “geodesic summary of data”…
14
PGA for m-reps, Bladder-Prostate-Rectum
UNC, Stat & OR
Bladder – Prostate – Rectum, 1 person, 17 days
PG 1
PG 2
PG 3
(analysis by Ja Yeon Jeong)
15
PGA for m-reps, Bladder-Prostate-Rectum
UNC, Stat & OR
Bladder – Prostate – Rectum, 1 person, 17 days
PG 1
PG 2
PG 3
(analysis by Ja Yeon Jeong)
16
PGA for m-reps, Bladder-Prostate-Rectum
UNC, Stat & OR
Bladder – Prostate – Rectum, 1 person, 17 days
PG 1
PG 2
PG 3
(analysis by Ja Yeon Jeong)
17
HDLSS Classification (i.e. Discrimination)
UNC, Stat & OR
Background: Two Class (Binary) version:
Using “training data” from Class +1, and
from Class -1
Develop a “rule” for assigning new data to
a Class
Canonical Example: Disease Diagnosis

New Patients are “Healthy” or “Ill”

Determined based on measurements
18
HDLSS Classification (Cont.)
UNC, Stat & OR

Ineffective Methods:



Fisher Linear Discrimination
Gaussian Likelihood Ratio
Less Useful Methods:


Nearest Neighbors
Neural Nets
(“black boxes”, no “directions” or intuition)
19
HDLSS Classification (Cont.)
UNC, Stat & OR

Currently Fashionable Methods:



Support Vector Machines
Trees Based Approaches
New High Tech Method

Distance Weighted Discrimination (DWD)
Specially designed for HDLSS data
 Avoids “data piling” problem of SVM
 Solves more suitable optimization problem

20
HDLSS Classification (Cont.)
UNC, Stat & OR
Currently
Methods:
Fashionable
Trees
Based Approaches
Support Vector Machines:
21
HDLSS Classification (Cont.)
UNC, Stat & OR
Comparison of Linear Methods (toy data):
N d , I , 1,   2.2, n1  n2  20, d  50
Optimal
Direction
 Excellent, but need dir’n in dim = 50
Maximal Data Piling (J. Y. Ahn, D. Peña)

Great separation, but generalizability???
Support

More separation, gen’ity, but some data piling?
Distance

Vector Machine
Weighted Discrimination
Avoids data piling, good gen’ity, Gaussians?
22
Distance Weighted Discrimination
UNC, Stat & OR
Maximal Data Piling
23
Maximal Data Piling
UNC, Stat & OR
Mind boggling?
J. Y. Ahn has
characterized
Formula ~ FLD
There are many
Publishable???
24
Distance Weighted Discrimination
UNC, Stat & OR
Based on Optimization Problem:
n
1
min 
w ,b i 1 r
i
More precisely work in appropriate penalty
for violations
Optimization Method (Michael Todd):
Second Order Cone Programming



Still Convex gen’tion of quadratic prog’ing
Fast greedy solution
Can use existing software
25
Simulation Comparison
UNC, Stat & OR
E.G. Above Gaussians:
Wide
array of dim’s
SVM
Subst’ly worse
MD
– Bayes Optimal
DWD
close to MD
26
Simulation Comparison
UNC, Stat & OR
E.G. Outlier Mixture:

Disaster for MD

SVM & DWD much
more solid

Dir’ns are “robust”

SVM & DWD similar
27
Simulation Comparison
UNC, Stat & OR
E.G. Wobble Mixture:

Disaster for MD

SVM less good

DWD slightly better
Note: All methods
come together for
larger d ???
28
DWD in Face Recognition, I
UNC, Stat & OR
Face
Images as Data
(with M. Benito & D. Peña)
Registered
Male
using landmarks
– Female Difference?
Discrimination
Rule?
29
DWD in Face Recognition, II
UNC, Stat & OR

DWD Direction

Good separation

Images “make sense”

Garbage at ends?
(extrapolation effects?)
30
DWD in Face Recognition, III
UNC, Stat & OR

Interesting summary:

Jump between means
(in DWD direction)

Clear separation of
Maleness vs. Femaleness
31
DWD in Face Recognition, IV
UNC, Stat & OR

Fun Comparison:

Jump between means
(in SVM direction)

Also distinguishes
Maleness vs. Femaleness

But not as well as DWD
32
DWD in Face Recognition, V
UNC, Stat & OR
Analysis of difference: Project onto normals

SVM has “small gap”
(feels noise artifacts?)

DWD “more informative” (feels real structure?)
33
DWD in Face Recognition, VI
UNC, Stat & OR

Current Work:

Focus on “drivers”:
(regions of interest)

Relation to Discr’n?

Which is “best”?

Lessons for human
perception?
34
DWD & Microarrays for Gene Expression
UNC, Stat & OR

Skip due to time pressure

Some have seen this…

DWD provides excellent tool for:


Combining Data Sets (caBIG funded)

Visualization of HDLSS data

HDLSS hypothesis testing
Let’s talk informally if you are interested
35
Discrimination for m-reps
UNC, Stat & OR
Classification
S.
What
for Lie Groups – Symm. Spaces
K. Sen & S. Joshi
is “separating plane” (for SVM-DWD)?
36
Trees as Data Points, I
UNC, Stat & OR
Brain Blood Vessel Trees - E. Bullit & H. Wang
Statistical Understanding of Population?
Mean?
Challenge:
PCA?
Very Non-Euclidean
37
Trees as Data Points, II
UNC, Stat & OR
Mean
PCA
of Tree Population: Frechét Approach
on Trees (based on “tree lines”)
Theory
in Place
-
Implementation?
38
HDLSS Asymptotics: Simple Paradoxes, I
UNC, Stat & OR
For d dim’al “Standard Normal” dist’n:
 Z1 
 
Z     ~ N d 0, I d 
Z 
 d
Euclidean Distance to Origin (as d   ):
Z  d  O p (1)
- Data lie roughly on surface of sphere of radius d
- Yet origin is point of “highest density”???
- Paradox resolved by:
“density w. r. t. Lebesgue Measure”
39
HDLSS Asymptotics: Simple Paradoxes, II
UNC, Stat & OR
For d dim’al “Standard Normal” dist’n:
Z 1 indep. of Z 2 ~ N d 0, I d 
Euclidean Dist. between Z 1 and Z 2 (as d  ):
Distance tends to non-random constant:
Z 1  Z 2  2d  O p (1)
Can extend to Z 1 ,..., Z n
Where do they all go???
(we can only perceive 3 dim’ns)
40
HDLSS Asymptotics: Simple Paradoxes, III
UNC, Stat & OR
For d dim’al “Standard Normal” dist’n:
Z 1 indep. of Z 2 ~ N d 0, I d 
High dim’al Angles (as d   ):
AngleZ 1 , Z 2   90  O p (d 1/ 2 )
- -“Everything is orthogonal”???
- Where do they all go???
(again our perceptual limitations)
- Again 1st order structure is non-random
41
HDLSS Asy’s: Geometrical Representation, I
UNC, Stat & OR
Assume Z 1 ,..., Z n ~ N d 0, I d , let
d 
Study Subspace Generated by
Data
a.
Hyperplane through 0, of
dimension n
b.
Points are “nearly equidistant
to 0”, & dist d
c.
Within plane, can “rotate
towards d  Unit Simplex”
d.
All Gaussian data sets
are“near Unit Simplex
Vertices”!!!
“Randomness” appears only in
rotation of simplex
With P. Hall & A. Neemon
42
HDLSS Asy’s: Geometrical Representation, II
UNC, Stat & OR
Assume Z 1 ,..., Z n ~ N d 0, I d  , let
d 
Study Hyperplane Generated by
Data
a.
n  1 dimensional hyperplane
b.
Points are pairwise
equidistant, dist ~ d
c.
Points lie at vertices of
“regular n  hedron”
d.
Again “randomness in data”
is only in rotation
e.
Surprisingly rigid structure in
data?
2d 
43
HDLSS Asy’s: Geometrical Representation, III
UNC, Stat & OR
Simulation View: shows “rigidity after rotation”
44
HDLSS Asy’s: Geometrical Representation, III
UNC, Stat & OR
Straightforward Generalizations:

non-Gaussian data:

non-independent:

Mild Eigenvalue condition on Theoretical Cov.

only need moments
use “mixing conditions”
(with J. Ahn, K. Muller & Y. Chi)
All based on simple “Laws of Large Numbers”
45
HDLSS Asy’s: Geometrical Representation, IV
UNC, Stat & OR
Explanation of Observed (Simulation) Behavior:
“everything similar for very high d”

2 popn’s are 2 simplices (i.e. regular n-hedrons)

All are same distance from the other class

i.e. everything is a support vector

i.e. all sensible directions show “data piling”

so “sensible methods are all nearly the same”

Including 1 - NN
46
HDLSS Asy’s: Geometrical Representation, V
UNC, Stat & OR
Further Consequences of Geometric Representation
1. Inefficiency of DWD for uneven sample size
(motivates “weighted version”, work in progress)
2. DWD more “stable” than SVM
(based on “deeper limiting distributions”)
(reflects intuitive idea “feeling sampling variation”)
(something like “mean vs. median”)
3. 1-NN rule inefficiency is quantified.
47
The Future of Geometrical Representation?
UNC, Stat & OR

HDLSS version of “optimality” results?

“Contiguity” approach?

Rates of Convergence?

Improvements of DWD?
Params depend on d?
(e.g. other functions of distance than inverse)
It is still early days …
48
Some Carry Away Lessons
UNC, Stat & OR

Atoms of the Analysis: Object Oriented

HDLSS contexts deserve further study

DWD is attractive for HDLSS classification

“Randomness” in HDLSS data is only in rotations
(Modulo rotation, have context simplex shape)

How to put HDLSS asymptotics to work?
49
Download