location

advertisement
Object Orie’d Data Analysis, Last Time
• Statistical Smoothing
– Histograms – Density Estimation
– Scatterplot Smoothing – Nonpar. Regression
• SiZer Analysis
– Replaces bandwidth selection
– Scale Space
– Statistical Inference:
Which bumps are “really there”?
– Visualization
Kernel Density Estimation
Choice of bandwidth (window width)?
• Very important to performance
Fundamental Issue:
Which modes are “really there”?
SiZer Background
Fun Scale Spaces Views (Incomes Data)
Surface View
SiZer Background
SiZer analysis of British Incomes data:
SiZer Background
Finance "tick data":
(time, price) of single stock transactions
Idea: "on line" version of SiZer
for viewing and understanding trends
SiZer Background
Finance "tick data":
(time, price) of single stock transactions
Idea: "on line" version of SiZer
for viewing and understanding trends
Notes:
• trends depend heavily on scale
• double points and more
• background color transition
(flop over at top)
SiZer Background
Internet traffic data analysis:
SiZer analysis of
time series of
packet times
at internet hub (UNC)
Hannig, Marron,
and Riedi (2001)
SiZer Background
Internet traffic data analysis:
SiZer analysis of
time series of packet times
at internet hub (UNC)
• across very wide range of scales
• needs more pixels than screen allows
• thus do zooming view
(zoom in over time)
–
–
zoom in to yellow bd’ry in next frame
readjust vertical axis
SiZer Background
Internet traffic data analysis (cont.)
Insights from SiZer analysis:
•
Coarse scales:
amazing amount of significant structure
•
Evidence of self-similar fractal type process?
•
Fewer significant features at small scales
•
But they exist, so not Poisson process
•
Poisson approximation OK at small scale???
•
Smooths (top part) stable at large scales?
Dependent SiZer
Rondonotti, Marron, and Park (2007)
• SiZer compares data with white noise
• Inappropriate in time series
• Dependent SiZer compares data with an
assumed model
• Visual Goodness of Fit test
Dep’ent SiZer : 2002 Apr 13 Sat 1 pm – 3 pm
Zoomed view (to red region, i.e. “flat top”)
Further Zoom: finds very periodic behavior!
Possible Physical Explanation
IP “Port Scan”
• Common device of hackers
• Searching for “break in points”
• Send query to every possible (within
UNC domain):
– IP address
– Port Number
• Replies can indicate system
weaknesses
Internet Traffic is hard to model
SiZer Overview
Would you like to try a SiZer analysis?
• Matlab software:
http://www.unc.edu/depts/statistics/postscript/papers/marron/Matlab6Software/Smoothing/
•
JAVA version (demo, beta): Follow the
SiZer link from the Wagner Associates
home page:
http://www.wagner.com/www.wagner.com/SiZer/
•
More details, examples and discussions:
http://www.stat.unc.edu/faculty/marron/DataAnalyses/SiZer_Intro.html
PCA to find clusters
Return to PCA of Mass Flux Data:
PCA to find clusters
SiZer analysis of Mass Flux, PC1
PCA to find clusters
SiZer analysis of Mass Flux, PC1
Conclusion:
•
Found 3 significant clusters!
•
Correspond to 3 known “cloud types”
•
Worth deeper investigation
Recall Yeast Cell Cycle Data
• “Gene Expression” – Micro-array data
• Data (after major preprocessing):
Expression “level” of:
• thousands of genes
(d ~ 1,000s)
• but only dozens of “cases” (n ~ 10s)
• Interesting statistical issue:
High Dimension Low Sample Size data
(HDLSS)
Yeast Cell Cycle Data, FDA View
Central question:
Which genes are “periodic” over 2 cell cycles?
Yeast Cell Cycle Data, FDA View
Periodic
genes?
Naïve
approach:
Simple
PCA
•
•
•
•
•
•
•
Yeast Cell Cycle Data, FDA
View
Central question: which genes are “periodic”
over 2 cell cycles?
Naïve approach:
Simple PCA
No apparent (2 cycle) periodic structure?
Eigenvalues suggest large amount of
“variation”
PCA finds “directions of maximal variation”
Often, but not always, same as “interesting
directions”
Here need better approach to study
periodicities
Yeast Cell Cycles, Freq. 2 Proj.
PCA on
Freq. 2
Periodic
Component
Of Data
Frequency 2 Analysis
Frequency 2 Analysis
•
Project data onto 2-dim space of sin and cos
(freq. 2)
•
Useful view: scatterplot
•
Angle (in polar coordinates) shows phase
•
Colors: Spellman’s cell cycle phase
classification
•
Black was labeled “not periodic”
•
Within class phases approx’ly same, but notable
differences
•
Now try to improve “phase classification”
Yeast Cell Cycle
Revisit “phase classification”,
•
•
•
•
•
approach:
Use outer 200 genes
(other numbers tried, less resolution)
Study distribution of angles
Use SiZer analysis
(finds significant bumps, etc., in histogram)
Carefully redrew boundaries
Check by studying k.d.e. angles
SiZer Study of Dist’n of Angles
Reclassification of Major Genes
Compare to Previous Classif’n
New Subpopulation View
OODA in Image Analysis
First Generation Problems:
•
Denoising
•
Segmentation
•
Registration
(find object boundaries)
(align objects)
(all about single images)
OODA in Image Analysis
Second Generation Problems:
•
Populations of Images
– Understanding Population Variation
– Discrimination (a.k.a. Classification)
•
Complex Data Structures (& Spaces)
•
HDLSS Statistics
HDLSS Data in Image
Analysis
Why HDLSS (High Dim, Low Sample Size)?
•
Complex 3-d Objects Hard to Represent
– Often need d = 100’s of parameters
•
Complex 3-d Objects Costly to Segment
– Often have n = 10’s of cases
Image Object Representation
Major Approaches for Images:
•
Landmark Representations
•
Boundary Representations
•
Medial Representations
Landmark Representations
Main Idea:
•
On each object find important points
•
Treat point locations as features
•
I.e. represent objects by vectors of point
locations (in 2-d or 3-d)
(Fits in OODA framework)
Landmark Representations
Basis of Field of Statistical Shape Analysis:
(important precursor of FDA & OODA)
Main References:
•
Kendall (1981, 1984)
•
Bookstein (1984)
•
Dryden and Mardia (1998)
(most readable and comprehnsive)
Landmark Representations
Nice Example:
•
Fly Wing Data (Drosophila fruit flies)
•
From George Gilchrist, W. & M. U.
http://gwgilc.people.wm.edu/
•
Graphic Illustrating Landmarks (next page)
–
Same veins appear in all flies
–
And always have same relationship
–
I.e. all landmarks always identifiable
Landmark Representations
Landmarks for fly wing data:
Landmark Representations
Important issue for landmark approaches:
Location, i. e. Registration
Illustration with Fly Wing Data (next slide)
Problem:
•
coordinates are “locations in photo”
•
& unclear where wing is positioned…
Landmark Representations
Illustration of Registration, with Fly Wing Data
Landmark Representations
Standard Approach to Registration Problem:
Procrustes Analysis
Idea: mod out location
•
Can also mod out rotation
•
Can also mod out size
Recommended reference:
Dryden and Mardia (1988)
Landmark Representations
Procustes Results for Fly Wing Data
Landmark Representations
Effect of Procrustes Analysis:
Study Difference Between Continents
• Flies from Europe & South America
• Look for important differences
• Project onto mean difference direction
• Visualize with movie
–
–
Equal time spacing
Through range of data
Landmark Representations
No Procrustes Adjustment:
Movies on Difference Between Continents
Landmark Representations
Effect of Procrustes Analysis:
Movies on Difference Between Continents
• Raw Data
–
–
–
Driven by location effects
Strongly feels size
Hard to understand shape
Landmark Representations
Location, Rotation, Scale Procrustes:
Movies on Difference Between Continents
Landmark Representations
Effect of Procrustes Analysis:
Movies on Difference Between Continents
• Raw Data
–
–
–
•
Driven by location effects
Strongly feels size
Hard to understand shape
Full Procrustes
–
–
Mods out location, size, rotation
Allows clear focus on shape
Landmark Representations
Major Drawback of Landmarks:
•
Need to always find each landmark
•
Need same relationship
•
I.e. Landmarks need to correspond
•
Often fails for medical images
•
E.g. How many corresponding landmarks
on a set of kidneys, livers or brains???
Landmark Representations
Landmarks for brains???
(thanks to
Liz Bullit)
Very hard to
identify
Landmark Representations
Look across people:
Some structure
in common
But “folds” are
different
Consistent
Landmarks???
Landmark Representations
Look across people:
Some structure
in common
But “folds” are
different
Consistent
Landmarks???
Boundary Representations
Major sets of ideas:
•
Triangular Meshes
–
•
Active Shape Models
–
•
Survey: Owen (1998)
Cootes, et al (1993)
Fourier Boundary Representations
–
Keleman, et al (1997 & 1999)
Boundary Representations
Example of triangular mesh rep’n:
From:www.geometry.caltech.edu/pubs.html
Boundary Representations
Example of triangular mesh rep’n for a brain:
From:
meshlab.sourceforge.net/SnapMeshLab.brain.jpg
Boundary Representations
Main Drawback:
Correspondence
•
For OODA (on vectors of parameters):
Need to “match up points”
•
Easy to find triangular mesh
–
•
Lots of research on this driven by gamers
Challenge match mesh across objects
–
There are some interesting ideas…
Medial Representations
Main Idea:
Represent Objects as:
• Discretized skeletons (medial atoms)
• Plus spokes from center to edge
• Which imply a boundary
Very accessible early reference:
• Yushkevich, et al (2001)
Medial Representations
2-d M-Rep Example:
Corpus Callosum
(Yushkevich)
Medial Representations
2-d M-Rep Example:
Corpus Callosum
(Yushkevich)
Atoms
Spokes
Implied
Boundary
Medial Representations
3-d M-Rep Example: From Ja-Yeon Jeong
Bladder – Prostate - Rectum
Atoms - Spokes - Implied Boundary
Medial Representations
3-d M-reps: there are several variations
Two choices:
From
Fletcher
(2004)
Medial Representations
Statistical Challenge
• M-rep parameters are:
– Locations  2 , 3
0
– Radii
– Angles (not comparable)
• Stuffed into a long vector
• I.e. many direct products of these
Medial Representations
Statistical Challenge:
• How to analyze angles as data?
• E.g. what is the average of:
3 , 4 , 358 , 359

– 181 ??? (average of the numbers)
–
•
1 (of course!)
Correct View of angular data:
Consider as points on the unit circle

Medial Representations
What is the average (181o?) or (1o?) of:

3,

4,

358 ,
359

Medial Representations
Statistical Analysis of Directional Data:
• Common Examples:
–
–
–
•
Wind Directions (0-360)
Magnetic Fields (0-360)
Cracks (0-180)
There is a literature (monographs):
–
–
Mardia (1972, 2000)
Fisher, et al (1987, 1993)
Medial Representations
Statistical Challenge
• Many direct products of:
– Locations  2 , 3
– Radii
0
– Angles (not comparable)
• Appropriate View:
Data Lie on Curved Manifold
Embedded in higher dim’al Eucl’n Space
Medial Representations
Data on Curved Manifold Toy Example:
Medial Representations
Data on Curved Manifold Viewpoint:
• Very Simple Toy Example (last movie)
• Data on a Cylinder = 1  S 1
• Notes:
–
–
–
–
–
•
Simplest non-Euclidean Example
3
2-d data, embedded on manifold in R
Can flatten the cylinder, to a plane
Have periodic representation
Movie by: Suman Sen
Same idea for more complex direct prod’s
Download