OODA-HailuotoIII-6-1.. - The University of North Carolina at Chapel Hill

advertisement
Hailuoto Workshop
UNC, Stat & OR
Object Oriented Data Analysis, III
J. S. Marron
Dept. of Statistics and Operations Research,
University of North Carolina
March 12, 2016
1
HDLSS Space is a Weird Place, I
UNC, Stat & OR
Maximal Data Piling
(with J. Y. Ahn)
In HDLSS Binary Discrimination:
There is direction where:
Class +1 projections pile at one pt.
Class -1 projections pile at another
2
HDLSS Space is a Weird Place, II
UNC, Stat & OR
Maximal Data Piling Mathematics:

Exists w.p. 1, when abs. cont. w.r.t.
Lebesgue measure

Unique within subspace gend by Data

Formula very similar to FLD
(pooled within cov.  global cov.)

Same as FLD when n < d
3
HDLSS Space is a Weird Place, III
UNC, Stat & OR
~2n MDP Dirns
Useful for
clustering?
Hard
Optimization…
4
HDLSS Space is a Weird Place, IV
UNC, Stat & OR
Parallel Directions (with X. Liu)
5
Time Series of Curves
UNC, Stat & OR

Chemical Spectra, evolving over time
(with J. Wendelberger & E. Kober)

Mortality curves changing in time
(with Andres Alonzo)
Visualization:

Similar tools, PCA & Dirns

But color according to time
6
Chemical Spectra, I
UNC, Stat & OR
7
Chemical Spectra, II
UNC, Stat & OR
8
Chemical Spectra, III
UNC, Stat & OR
9
Chemical Spectra, IV
UNC, Stat & OR
10
Chemical Spectra, V
UNC, Stat & OR
11
Chemical Spectra, VI
UNC, Stat & OR
12
Chemical Spectra, VII
UNC, Stat & OR
13
Chemical Spectra, VIII
UNC, Stat & OR
14
Demography Data
UNC, Stat & OR
Mortality, as a function of age

“Chance of dying”, for Males

of each 1-year age group

Curves are years
1908 - 2002
PCA of the family of curves
15
Demography Data
UNC, Stat & OR




PCA of the family of curves for Males
Babies & elderly “most mortal” (Raw)
All getting better over time (Raw & PC1)
Except 1918 - Influenza Pandemic
(see Color Scale)
Middle age most mortal (PC2):




1918
Early 1930s - Spanish Civil War
1980 – 1994 (then better) auto wrecks
Decade Rounding (several places)
16
Demography Data
UNC, Stat & OR
PCA for Males in Switzerland

Most aspects similar

No decade rounding (better records)

1918 Flu – Different Color (PC2)
(see Color Scale)

No War Changes

Steady improvement until 70s (PC2)

When auto accidents kicked in
17
Demography Data
UNC, Stat & OR
Dual PCA
Idea: Rows and Columns trade places
Demographic Primal View:
Curves are Years, Coord’s are Ages
Demographic Dual View:
Curves are Ages, Coord’s are Years
Dual PCA View, Spanish Males
18
Demography Data
UNC, Stat & OR
Dual PCA View, Spanish Males

Olde people have const. mortality (raw)

But improvement for rest (raw)

Bad for 1918 (flu) & Spanish Civil War,
but generally improving (mean)

Improves for ages 1-6, then worse (PC1)

Big Improvement for young (PC2)
(Age Color Key)
19
Discrimination for m-reps
UNC, Stat & OR
Classification
S.
What
for Lie Groups – Symm. Spaces
K. Sen & S. Joshi
is “separating plane” (for SVM-DWD)?
20
Trees as Data Points, I
UNC, Stat & OR
Brain Blood Vessel Trees - E. Bullit & H. Wang
Statistical Understanding of Population?
Mean?
Challenge:
PCA?
Very Non-Euclidean
21
Trees as Data Points, II
UNC, Stat & OR
Mean
PCA
of Tree Population: Frechét Approach
on Trees (based on “tree lines”)
Theory
in Place
-
Implementation?
22
HDLSS Asymptotics: Simple Paradoxes, I
UNC, Stat & OR
For d dim’al “Standard Normal” dist’n:
 Z1 
 
Z     ~ N d 0, I d 
Z 
 d
Euclidean Distance to Origin (as d   ):
Z  d  O p (1)
- Data lie roughly on surface of sphere of radius d
- Yet origin is point of “highest density”???
- Paradox resolved by:
“density w. r. t. Lebesgue Measure”
23
HDLSS Asymptotics: Simple Paradoxes, II
UNC, Stat & OR
For d dim’al “Standard Normal” dist’n:
Z 1 indep. of Z 2 ~ N d 0, I d 
Euclidean Dist. between Z 1 and Z 2 (as d  ):
Distance tends to non-random constant:
Z 1  Z 2  2d  O p (1)
Can extend to Z 1 ,..., Z n
Where do they all go???
(we can only perceive 3 dim’ns)
24
HDLSS Asymptotics: Simple Paradoxes, III
UNC, Stat & OR
For d dim’al “Standard Normal” dist’n:
Z 1 indep. of Z 2 ~ N d 0, I d 
High dim’al Angles (as d   ):
AngleZ 1 , Z 2   90  O p (d 1/ 2 )
- -“Everything is orthogonal”???
- Where do they all go???
(again our perceptual limitations)
- Again 1st order structure is non-random
25
HDLSS Asy’s: Geometrical Representation, I
UNC, Stat & OR
Assume Z 1 ,..., Z n ~ N d 0, I d , let
d 
Study Subspace Generated by
Data
a.
Hyperplane through 0, of
dimension n
b.
Points are “nearly equidistant
to 0”, & dist d
c.
Within plane, can “rotate
towards d  Unit Simplex”
d.
All Gaussian data sets
are“near Unit Simplex
Vertices”!!!
“Randomness” appears only in
rotation of simplex
With P. Hall & A. Neemon
26
HDLSS Asy’s: Geometrical Representation, II
UNC, Stat & OR
Assume Z 1 ,..., Z n ~ N d 0, I d  , let
d 
Study Hyperplane Generated by
Data
a.
n  1 dimensional hyperplane
b.
Points are pairwise
equidistant, dist ~ d
c.
Points lie at vertices of
“regular n  hedron”
d.
Again “randomness in data”
is only in rotation
e.
Surprisingly rigid structure in
data?
2d 
27
HDLSS Asy’s: Geometrical Representation, III
UNC, Stat & OR
Simulation View: shows “rigidity after rotation”
28
HDLSS Asy’s: Geometrical Representation, III
UNC, Stat & OR
Straightforward Generalizations:

non-Gaussian data:

non-independent:

Mild Eigenvalue condition on Theoretical Cov.

only need moments
use “mixing conditions”
(with J. Ahn, K. Muller & Y. Chi)
All based on simple “Laws of Large Numbers”
29
HDLSS Asy’s: Geometrical Representation, IV
UNC, Stat & OR
Explanation of Observed (Simulation) Behavior:
“everything similar for very high d”

2 popn’s are 2 simplices (i.e. regular n-hedrons)

All are same distance from the other class

i.e. everything is a support vector

i.e. all sensible directions show “data piling”

so “sensible methods are all nearly the same”

Including 1 - NN
30
HDLSS Asy’s: Geometrical Representation, V
UNC, Stat & OR
Further Consequences of Geometric Representation
1. Inefficiency of DWD for uneven sample size
(motivates “weighted version”, work in progress)
2. DWD more “stable” than SVM
(based on “deeper limiting distributions”)
(reflects intuitive idea “feeling sampling variation”)
(something like “mean vs. median”)
3. 1-NN rule inefficiency is quantified.
31
The Future of Geometrical Representation?
UNC, Stat & OR

HDLSS version of “optimality” results?

“Contiguity” approach?

Rates of Convergence?

Improvements of DWD?
Params depend on d?
(e.g. other functions of distance than inverse)
It is still early days …
32
Some Carry Away Lessons
UNC, Stat & OR

Atoms of the Analysis: Object Oriented

HDLSS contexts deserve further study

DWD is attractive for HDLSS classification

“Randomness” in HDLSS data is only in rotations
(Modulo rotation, have context simplex shape)

How to put HDLSS asymptotics to work?
33
Download