Limit-n, Limn, to present

advertisement
Limn, to present an image or lifelike
imitation of;
Limit-n, to explore any data set to its
limits
Di Cook
Statistics, Iowa State University
Joint with Peter Sutherland, Manuel Suarez, Vasant Honavar,
Doina Caragea, Les Miller
Outline
What is the Role of Viz in Nonparametrics?
(especially Classication)
Movie Technology for Large Data
Case studies
FBI Bullet Data: feed forward back propagation neural network. (Course Project of
Jenifer Schumi).
Crabs data: CART tree, Support Vector
Machines.
0-3
FBI Forensic Bullet Data
4 manufacturers, 50 bullets from each, 5 trace elements measured on each: Copper, Arsenic, Bismuth,
Silver, Antimony.
Can we recogize the manufacturer based on the concentrations of trace elements?
Note: With graphics it is easy to build a classcation rule for all 4 classes, but with CART, LDA and
feed-forward neural networks it was not possible to
perfectly classify all 4 groups. There confusion between two groups.
Data provided to Hal Stern and Alicia Carriquiry by the FBI. This
analysis was conducted by Jennifer Schumi as a class project.
0-4
Feed-forward neural networks (Ripley's
S code)
0-5
Where's the boundary?
1000
Copper
600
400
200
0
Copper
800
Arsenic
Antimony
2000
4000
6000
8000
1e+04
Antimony
Copper
Bismuth
Arsenic
Antimony
Copper
Bismuth
Arsenic
Silver Antimony
0-6
Case Study: Australian Leptograpsus Crabs
50 specimens of each of 4 classes (Orange, Blue,
M,F).
Preserved specimens lose their color, classify on morphological measurements.
Source: Campbell and Mahon, 1974.
0-7
CART
Classication tree: tree(formula = factor(crabsg) .,
data = data.frame(crabsd)) Number of terminal nodes:
28 Residual mean deviance: 0.5903 = 101.5 / 172
Misclassication error rate: 0.125 = 25 / 200
FL<17.45
1|
RW<13.55
2
RW<15.9
4
CW<36.35
1
CW<37.3 CL<37.1
FL<19.45
FL<21.55
2
3
4
4 444
FL<15.9
2
BD<12.15
FL<17 4 4 4 RW<13.75
CW<44.35
2
1
RW<13.1
4
3
FL<19.95
1 1
3
1
1 1
FL<21.55
3 4
RW<11.15 RW<11.85
1
333
2
3
CL<26.7CL<26.4
1
CW<33.95
2 2BD<12.85
2
3
3
3 4
FL<12.25
4 2
2
1
RW<9.15
2
3
FL<10
CL<23.3
1
2
RW<8.05
2 3 2 2
1 2
0-8
Graphics
(RW-m)/s
(CW-m)/s
(CL-m)/s
(BD-m)/s
(FL-m)/s
(RW-m)/s
BD
FL
RW
CL
CW
(CW-m)/s
(CL-m)/s
BD
0-9
RW
15
20
25
30
CL
35
40
50
FL
Graphics
45
10 12 14 16 18 20 22 24
8
20
18
16
14
12
10
8
6
20
25
30
CW
35
40
45
50
55
0 - 10
Graphics
Strong correlation between variables.
Strong separation between species. Some
confusion in sexes of smaller crabs.
Males separated from females in plots of
CL vs RW (males have a higher CL:RW
ratio), and BD vs RW, and CW vs RW.
The two species can be separated in the
plots of FL vs CW, and BD vs CW.
0 - 11
Support Vector Machines
Method Species Sexes Error Rate
NN (SVM)
0
2
0.01
RW
CW
FL
BD
0 - 12
Large amounts of data?
Reduce data or scale up methods?
We want to look at the full resolution!
Scaling up tour methods: tour algorithm is
linear in n!
............exploit technological advances.
0 - 13
Using Video Technology
Two steps:
1. Create an animation sequence, and save
it as a quicktime movie or as a series of
JPEG images.
2. View the animation, and interact with
it by brushing small subsets, and highlighting the brushed points with overlays
on the images.
0 - 14
Approaches to Generating the
Animations
Little Tour: interpolation path between all
pairs of variables.
Grand Tour: interpolation between random
orthonormal bases.
From File: interpolation path between bases
read from le, takes advantage of work by
Neil Sloane on xed size tours.
PP Guidance: generates indices of interest for each projection, that could be used
later to subset the tour space to interesting
subspaces.
0 - 15
Example: Tropical Atmosphere Ocean
Array of Buoys
Monitoring El Ni~no,
70 buoys in Pacic,
5 variables, time
(1980-1994)
and
space components.
0 - 16
Example
Tour movie of the entire data set: 178080 points from
1980-1998, on variables zonal winds, meridian winds,
humidity, air temperature and sea surface temperature.
Overlays: Dec 1993 (normal), Dec 1997 (El Ni~no).
0 - 17
Discussion
Overlays on video: Binned resolutions, intelligently selected subsets, models.
How do we link between multiple views in
real-time: create indexing when plots are
constructed?
How do we visually represent weighted data?
Glyph size, color?
0 - 18
Summary
Visualization can help understand and simplify results of non-parametric methods.
Video technology provides some potential
for scaling methods up.
0 - 19
Download