Joint Distribution of Internet Flow Sizes and Durations

advertisement
Isaac Newton Institute - Cambridge
UNC, Stat & OR
Object Oriented Data Analysis
J. S. Marron
Dept. of Statistics and Operations Research,
University of North Carolina
March 22, 2016
1
Personal Opinions on Mathematical Statistics
UNC, Stat & OR
What is Mathematical Statistics?

Validation of existing methods

Asymptotics (n  ∞) & Taylor expansion

Comparison of existing methods
(requires hard math, but
really “accounting”???)
2
Personal Opinions on Mathematical Statistics
UNC, Stat & OR
What could Mathematical Statistics be?

Basis for invention of new methods

Complicated data  mathematical ideas

Do we value creativity?

Since we don’t do this, others do…
(where are the ₤₤₤s???)
3
Personal Opinions on Mathematical Statistics
UNC, Stat & OR

Since we don’t do this, others do…

Pattern Recognition

Artificial Intelligence

Neural Nets

Data Mining

Machine Learning

???
4
Personal Opinions on Mathematical Statistics
UNC, Stat & OR
Possible Litmus Test:
Creative Statistics

Clinical Trials Viewpoint:
Worst Imaginable Idea

Mathematical Statistics Viewpoint:
???
5
Object Oriented Data Analysis, I
UNC, Stat & OR
What is the “atom” of a statistical analysis?

1st Course:
Numbers

Multivariate Analysis Course :

Functional Data Analysis:

More generally: Data Objects
Vectors
Curves
6
Object Oriented Data Analysis, II
UNC, Stat & OR
Examples:


Medical Image Analysis

Images as Data Objects?

Shape Representations as Objects
Micro-arrays

Just multivariate analysis?
7
Object Oriented Data Analysis, III
UNC, Stat & OR
Typical Goals:

Understanding population variation

Visualization

Principal Component Analysis +

Discrimination (a.k.a. Classification)

Time Series of Data Objects
8
Object Oriented Data Analysis, IV
UNC, Stat & OR
Major Statistical Challenge, I:
High Dimension Low Sample Size (HDLSS)

Dimension d >> sample size n

“Multivariate Analysis” nearly useless


Can’t “normalize the data”
Land of Opportunity for Statisticians

Need for “creative statisticians”
9
Object Oriented Data Analysis, V
UNC, Stat & OR
Major Statistical Challenge, II:


Data may live in non-Euclidean space

Lie Group / Symmet’c Spaces (manifold data)

Trees/Graphs as data objects
Interesting Issues:

What is “the mean” (pop’n center)?

How do we quantify “pop’n variation”?
10
Statistics in Image Analysis, I
UNC, Stat & OR
First Generation Problems:

Denoising

Segmentation

Registration
(all about single images)
11
Statistics in Image Analysis, II
UNC, Stat & OR
Second Generation Problems:

Populations of Images

Understanding Population Variation

Discrimination (a.k.a. Classification)

Complex Data Structures (& Spaces)

HDLSS Statistics
12
HDLSS Statistics in Imaging
UNC, Stat & OR
Why HDLSS (High Dim, Low Sample Size)?

Complex 3-d Objects Hard to Represent


Often need d = 100’s of parameters
Complex 3-d Objects Costly to Segment

Often have n = 10’s cases
13
Medical Imaging – A Challenging Example
UNC, Stat & OR

Male Pelvis





Bladder – Prostate – Rectum
How do they move over time (days)?
Critical to Radiation Treatment (cancer)
Work with 3-d CT
Very Challenging to Segment


Find boundary of each object?
Represent each Object?
14
Male Pelvis – Raw Data
UNC, Stat & OR
One CT Slice
(in 3d image)
Coccyx
(Tail Bone)
Rectum
Prostate
15
Male Pelvis – Raw Data
UNC, Stat & OR
Prostate:
manual
segmentation
Slice by slice
Reassembled
16
Male Pelvis – Raw Data
UNC, Stat & OR
Prostate:
Slices:
Reassembled in 3d
How to represent?
Thanks: Ja-Yeon Jeong
17
Object Representation
UNC, Stat & OR



Landmarks (hard to find)
Boundary Rep’ns (no correspondence)
Medial representations


Find “skeleton”
Discretize as “atoms” called M-reps
18
3-d m-reps
UNC, Stat & OR
Bladder – Prostate – Rectum
(multiple objects, J. Y. Jeong)
•
Medial Atoms provide “skeleton”
•
Implied Boundary from “spokes”  “surface”
19
3-d m-reps
UNC, Stat & OR
M-rep model fitting
•
Easy, when starting from binary (blue)
•
But very expensive (30 – 40 minutes technician’s time)
•
Want automatic approach
•
Challenging, because of poor contrast, noise, …
•
Need to borrow information across training sample
•
Use Bayes approach: prior & likelihood  posterior
•
~Conjugate Gaussians, but there are issues:
•
Major HLDSS challenges
•
Manifold aspect of data
20
PCA for m-reps, I
UNC, Stat & OR
Major issue: m-reps live in 3    SO(3)  SO(2)
(locations, radius and angles)
E.g. “average” of:
2 , 3 , 358 , 359 = ???
Natural Data Structure is:
Lie Groups ~ Symmetric spaces
(smooth, curved manifolds)
21
PCA for m-reps, II
UNC, Stat & OR
PCA on non-Euclidean spaces?
(i.e. on Lie Groups / Symmetric Spaces)
T. Fletcher: Principal Geodesic Analysis
Idea: replace “linear summary of data”
With “geodesic summary of data”…
22
PGA for m-reps, Bladder-Prostate-Rectum
UNC, Stat & OR
Bladder – Prostate – Rectum, 1 person, 17 days
PG 1
PG 2
PG 3
(analysis by Ja Yeon Jeong)
23
PGA for m-reps, Bladder-Prostate-Rectum
UNC, Stat & OR
Bladder – Prostate – Rectum, 1 person, 17 days
PG 1
PG 2
PG 3
(analysis by Ja Yeon Jeong)
24
PGA for m-reps, Bladder-Prostate-Rectum
UNC, Stat & OR
Bladder – Prostate – Rectum, 1 person, 17 days
PG 1
PG 2
PG 3
(analysis by Ja Yeon Jeong)
25
HDLSS Classification (i.e. Discrimination)
UNC, Stat & OR
Background: Two Class (Binary) version:
Using “training data” from Class +1, and
from Class -1
Develop a “rule” for assigning new data to
a Class
Canonical Example: Disease Diagnosis

New Patients are “Healthy” or “Ill”

Determined based on measurements
26
HDLSS Classification (Cont.)
UNC, Stat & OR

Ineffective Methods:



Fisher Linear Discrimination
Gaussian Likelihood Ratio
Less Useful Methods:


Nearest Neighbors
Neural Nets
(“black boxes”, no “directions” or intuition)
27
HDLSS Classification (Cont.)
UNC, Stat & OR

Currently Fashionable Methods:



Support Vector Machines
Trees Based Approaches
New High Tech Method

Distance Weighted Discrimination (DWD)
Specially designed for HDLSS data
 Avoids “data piling” problem of SVM
 Solves more suitable optimization problem

28
HDLSS Classification (Cont.)
UNC, Stat & OR
Currently
Methods:
Fashionable
Trees
Based Approaches
Support Vector Machines:
29
Distance Weighted Discrimination
UNC, Stat & OR
Maximal Data Piling
30
Distance Weighted Discrimination
UNC, Stat & OR
Based on Optimization Problem:
n
1
min 
w ,b i 1 r
i
More precisely work in appropriate penalty
for violations
Optimization Method (Michael Todd):
Second Order Cone Programming



Still Convex gen’tion of quadratic prog’ing
Fast greedy solution
Can use existing software
31
DWD Bias Adjustment for Microarrays
UNC, Stat & OR
Microarray data:
 Simult. Measur’ts of “gene expression”
 Intrinsically HDLSS


Dimension d ~ 1,000s – 10,000s
Sample Sizes n ~ 10s – 100s
My view:
Each array is “point in cloud”
32
DWD Batch and Source Adjustment
UNC, Stat & OR



For Perou’s Stanford Breast Cancer Data
Analysis in Benito, et al (2004) Bioinformatics
https://genome.unc.edu/pubsup/dwd/
Adjust for Source Effects


Different sources of mRNA
Adjust for Batch Effects

Arrays fabricated at different times
33
DWD Adj: Raw Breast Cancer data
UNC, Stat & OR
34
DWD Adj: Source Colors
UNC, Stat & OR
35
DWD Adj: Batch Colors
UNC, Stat & OR
36
DWD Adj: Biological Class Colors
UNC, Stat & OR
37
DWD Adj: Biological Class Colors & Symbols
UNC, Stat & OR
38
DWD Adj: Biological Class Symbols
UNC, Stat & OR
39
DWD Adj: Source Colors
UNC, Stat & OR
40
DWD Adj: PC 1-2 & DWD direction
UNC, Stat & OR
41
DWD Adj: DWD Source Adjustment
UNC, Stat & OR
42
DWD Adj: Source Adj’d, PCA view
UNC, Stat & OR
43
DWD Adj: Source Adj’d, Class Colored
UNC, Stat & OR
44
DWD Adj: Source Adj’d, Batch Colored
UNC, Stat & OR
45
DWD Adj: Source Adj’d, 5 PCs
UNC, Stat & OR
46
DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD
UNC, Stat & OR
47
DWD Adj: S. & B1,2 vs. 3 Adjusted
UNC, Stat & OR
48
DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs
UNC, Stat & OR
49
DWD Adj: S. & B Adj’d, B1 vs. 2 DWD
UNC, Stat & OR
50
DWD Adj: S. & B Adj’d, B1 vs. 2 Adj’d
UNC, Stat & OR
51
DWD Adj: S. & B Adj’d, 5 PC view
UNC, Stat & OR
52
DWD Adj: S. & B Adj’d, 4 PC view
UNC, Stat & OR
53
DWD Adj: S. & B Adj’d, Class Colors
UNC, Stat & OR
54
DWD Adj: S. & B Adj’d, Adj’d PCA
UNC, Stat & OR
55
DWD Bias Adjustment for Microarrays
UNC, Stat & OR


Effective for Batch and Source Adj.
Also works for cross-platform Adj.


E.g. cDNA & Affy
Despite literature claiming contrary
“Gene by Gene” vs. “Multivariate” views

Funded as part of caBIG
“Cancer BioInformatics Grid”

“Data Combination Effort” of NCI
56
Interesting Benchmark Data Set
UNC, Stat & OR

NCI 60 Cell Lines
Interesting benchmark, since same cells
 Data Web available:
http://discover.nci.nih.gov/datasetsNature2000.jsp
 Both cDNA and Affymetrix Platforms


8 Major cancer subtypes

Use DWD now for visualization
57
NCI 60: Views using DWD Dir’ns (focus on biology)
UNC, Stat & OR
58
DWD in Face Recognition, I
UNC, Stat & OR
Face
Images as Data
(with M. Benito & D. Peña)
Registered
Male
using landmarks
– Female Difference?
Discrimination
Rule?
59
DWD in Face Recognition, II
UNC, Stat & OR

DWD Direction

Good separation

Images “make sense”

Garbage at ends?
(extrapolation effects?)
60
Blood vessel tree data
UNC, Stat & OR
Marron’s brain:
 Segmented from MRA
 Reconstruct trees
 in 3d
 Rotate to view
61
Blood vessel tree data
UNC, Stat & OR
Marron’s brain:
 Segmented from MRA
 Reconstruct trees
 in 3d
 Rotate to view
62
Blood vessel tree data
UNC, Stat & OR
Marron’s brain:
 Segmented from MRA
 Reconstruct trees
 in 3d
 Rotate to view
63
Blood vessel tree data
UNC, Stat & OR
Marron’s brain:
 Segmented from MRA
 Reconstruct trees
 in 3d
 Rotate to view
64
Blood vessel tree data
UNC, Stat & OR
Marron’s brain:
 Segmented from MRA
 Reconstruct trees
 in 3d
 Rotate to view
65
Blood vessel tree data
UNC, Stat & OR
Marron’s brain:
 Segmented from MRA
 Reconstruct trees
 in 3d
 Rotate to view
66
Blood vessel tree data
UNC, Stat & OR
,
, ... ,
Now look over many people (data objects)
Structure of population (understand variation?)
PCA in strongly non-Euclidean Space???
67
Blood vessel tree data
UNC, Stat & OR
,
, ... ,
Possible focus of analysis:
• Connectivity structure only (topology)
• Location, size, orientation of segments
• Structure within each vessel segment
68
Blood vessel tree data
UNC, Stat & OR
Present Focus:
Topology only
 Already challenging
 Later address others
 Then add attributes
 To tree nodes
 And extend analysis
69
Strongly Non-Euclidean Spaces
UNC, Stat & OR
Statistics on Population of Tree-Structured
Data Objects?
• Mean???
• Analog of PCA???
Strongly non-Euclidean, since:
• Space of trees not a linear space
• Not even approximately linear
(no tangent plane)
70
Strongly Non-Euclidean Spaces
UNC, Stat & OR
PCA on Tree Space?
Key Idea (Jim Ramsay):
•
Replace 1-d subspace
that best approximates data
•
By 1-d representation
that best approximates data
Wang and Marron (2007) define notion of
Treeline (in structure space)
71
PCA for blood vessel tree data
UNC, Stat & OR
Data Analytic Goals: Age, Gender
See
these?
No…
72
Preliminary Tree-Curve Results
UNC, Stat & OR
First
Correlation
Of
Structure
To
Age!
(Back
Trees)
73
HDLSS Asymptotics
UNC, Stat & OR
Why study asymptotics?
74
HDLSS Asymptotics
UNC, Stat & OR
Why study asymptotics?
 An interesting (naïve) quote:
“I don’t look at asymptotics, because
I don’t have an infinite sample size”
75
HDLSS Asymptotics
UNC, Stat & OR
Why study asymptotics?
 An interesting (naïve) quote:
“I don’t look at asymptotics, because
I don’t have an infinite sample size”
 Suggested perspective:
Asymptotics are a tool for finding simple
structure underlying complex entities
76
HDLSS Asymptotics
UNC, Stat & OR
Which asymptotics?
 n  ∞ (classical, very widely done)
 d  ∞ ???
 Sensible?
 Follow typical “sampling process”?
 Say anything, as noise level increases???
77
HDLSS Asymptotics
UNC, Stat & OR
Which asymptotics?
 n ∞ & d ∞
 n >> d:
a few results around
(still have classical info in data)
 n ~ d:
random matrices (Iain J., et al)
(nothing classically estimable)
 HDLSS asymptotics: n fixed, d  ∞
78
HDLSS Asymptotics
UNC, Stat & OR
HDLSS asymptotics: n fixed, d  ∞
 Follow typical “sampling process”?
79
HDLSS Asymptotics
UNC, Stat & OR
HDLSS asymptotics: n fixed, d  ∞
 Follow typical “sampling process”?
 Microarrays: # genes bounded
 Proteomics, SNPs, …
 A moot point, from perspective:
Asymptotics are a tool for finding simple
structure underlying complex entities
80
HDLSS Asymptotics
UNC, Stat & OR
HDLSS asymptotics: n fixed, d  ∞
 Say anything, as noise level increases???
81
HDLSS Asymptotics
UNC, Stat & OR
HDLSS asymptotics: n fixed, d  ∞
 Say anything, as noise level increases???
Yes, there exists simple, perhaps
surprising, underlying structure
82
HDLSS Asymptotics: Simple Paradoxes, I
UNC, Stat & OR
For d dim’al “Standard Normal” dist’n:
 Z1 
 
Z     ~ N d 0, I d 
Z 
 d
Euclidean Distance to Origin (as d   ):
Z  d  O p (1)
- Data lie roughly on surface of sphere of radius d
- Yet origin is point of “highest density”???
- Paradox resolved by:
“density w. r. t. Lebesgue Measure”
83
HDLSS Asymptotics: Simple Paradoxes, II
UNC, Stat & OR
For d dim’al “Standard Normal” dist’n:
Z 1 indep. of Z 2 ~ N d 0, I d 
Euclidean Dist. between Z 1 and Z 2 (as d  ):
Distance tends to non-random constant:
Z 1  Z 2  2d  O p (1)
Can extend to Z 1 ,..., Z n
Where do they all go???
(we can only perceive 3 dim’ns)
84
HDLSS Asymptotics: Simple Paradoxes, III
UNC, Stat & OR
For d dim’al “Standard Normal” dist’n:
Z 1 indep. of Z 2 ~ N d 0, I d 
High dim’al Angles (as d   ):
AngleZ 1 , Z 2   90  O p (d 1/ 2 )
- -“Everything is orthogonal”???
- Where do they all go???
(again our perceptual limitations)
- Again 1st order structure is non-random
85
HDLSS Asy’s: Geometrical Representation, I
UNC, Stat & OR
Assume Z 1 ,..., Z n ~ N d 0, I d , let
d 
Study Subspace Generated by
Data
a.
Hyperplane through 0, of
dimension n
b.
Points are “nearly equidistant
to 0”, & dist d
c.
Within plane, can “rotate
towards d  Unit Simplex”
d.
All Gaussian data sets
are“near Unit Simplex
Vertices”!!!
“Randomness” appears only in
rotation of simplex
With P. Hall & A. Neeman
86
HDLSS Asy’s: Geometrical Representation, II
UNC, Stat & OR
Assume Z 1 ,..., Z n ~ N d 0, I d  , let
d 
Study Hyperplane Generated by
Data
a.
n  1 dimensional hyperplane
b.
Points are pairwise
equidistant, dist ~ d
c.
Points lie at vertices of
“regular n  hedron”
d.
Again “randomness in data”
is only in rotation
e.
Surprisingly rigid structure in
data?
2d 
87
HDLSS Asy’s: Geometrical Representation, III
UNC, Stat & OR
Simulation View: shows “rigidity after rotation”
88
HDLSS Asy’s: Geometrical Representation, III
UNC, Stat & OR
Straightforward Generalizations:

non-Gaussian data:

non-independent:
only need moments
use “mixing conditions”
(with P. Hall & A. Neeman)

Mild Eigenvalue condition on Theoretical Cov.

(with J. Ahn, K. Muller & Y. Chi)
All based on simple “Laws of Large Numbers”
89
HDLSS Asy’s: Geometrical Representation, IV
UNC, Stat & OR
Explanation of Observed (Simulation) Behavior:
“everything similar for very high d”

2 popn’s are 2 simplices (i.e. regular n-hedrons)

All are same distance from the other class

i.e. everything is a support vector

i.e. all sensible directions show “data piling”

so “sensible methods are all nearly the same”

Including 1 - NN
90
HDLSS Asy’s: Geometrical Representation, V
UNC, Stat & OR
Further Consequences of Geometric Representation
1. Inefficiency of DWD for uneven sample size
(motivates “weighted version”, work in progress)
2. DWD more “stable” than SVM
(based on “deeper limiting distributions”)
(reflects intuitive idea “feeling sampling variation”)
(something like “mean vs. median”)
3. 1-NN rule inefficiency is quantified.
91
2nd Paper on HDLSS Asymptotics
UNC, Stat & OR
Ahn, Marron, Muller & Chi (2007) Biometrika

Assume 2nd Moments

Assume no eigenvalues too large in sense:


j 


j

1
For    d 
d  2j
d
j 1
(and Gaussian)
2
assume 
1
 o(d )
1


d
i.e.
(min possible)
(much weaker than previous mixing conditions…)
92
HDLSS Math. Stat. of PCA, I
UNC, Stat & OR
Consistency & Strong Inconsistency:
Spike Covariance Model (Johnstone & Paul)

For Eigenvalues:
1,d  d , 2,d  1, , d ,d  1
1st Eigenvector:
u1
How good are empirical versions,
ˆ1,d , , ˆd ,d , uˆ1
as estimates?
93
HDLSS Math. Stat. of PCA, II
UNC, Stat & OR
Consistency (big enough spike):
For   1 ,
Angleu1 , uˆ1   0
Strong Inconsistency (spike not big enough):
For   1 ,
0
ˆ
Angleu1 , u1   90
94
HDLSS Math. Stat. of PCA, III
UNC, Stat & OR
Consistency of eigenvalues?

L
ˆ
1,d 
 1,d
n
2
n

Eigenvalues Inconsistent

But known distribution

Unless
n 
as well
95
HDLSS Work in Progress, II
UNC, Stat & OR
Canonical Correlations:
Myung Hee Lee

Results similar to those for those for PCA

Singular values inconsistent

But directions converge under a much
milder spike assumption.
96
HDLSS Work in Progress, III
UNC, Stat & OR
Conditions for Geo. Rep’n & PCA Consist.:
John Kent example:
1
1
X ~ N d 0 d , I d   N d 0 d ,100 * I d 
2
2
Can only say:
1/ 2


d
X  O p ( d 1/ 2 )    1/ 2
 10d
 
not deterministic
w. p. 12
w. p. 12




Conclude: need some flavor of mixing
97
HDLSS Work in Progress, III
UNC, Stat & OR
Conditions for Geo. Rep’n & PCA Consist.:
Conclude: need some flavor of mixing
Challenge: Classical mixing conditions
require notion of time ordering
Not always clear, e.g. microarrays
98
HDLSS Work in Progress, III
UNC, Stat & OR
Conditions for Geo. Rep’n & PCA Consist.:
Sungkyu Jung Condition:
X ~ 0d ,  d 
Define:
where
1/ 2
d
Zd  
 d  U d  dU
t
d
t
d
U Xd
Assume:
Ǝ a permutation,  d
So that
 d Zd
is ρ-mixing
99
HDLSS Deep Open Problem
UNC, Stat & OR
In PCA Consistency:


Strong Inconsistency Consistency -
  1 spike
  1 spike
What happens at boundary (   1 )???
100
The Future of HDLSS Asymptotics?
UNC, Stat & OR
1.
Address your favorite statistical problem…
2.
HDLSS versions of classical optimality results?
3.
Continguity Approach
4.
Rates of convergence?
5.
Improved Discrimination Methods?
(~Random Matrices)
It is early days…
101
Some Carry Away Lessons
UNC, Stat & OR

Atoms of the Analysis: Object Oriented

Viewpoint:

DWD is attractive for HDLSS classification

“Randomness” in HDLSS data is only in rotations
Object Space  Feature Space
(Modulo rotation, have constant simplex shape)

How to put HDLSS asymptotics to work?
102
Download