Graphics - an Ace up a Statistician's Sleeve Heike Hofmann I

advertisement
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Graphics an Ace up a Statistician's Sleeve
Heike Hofmann
IOWA STATE UNIVERSITY
• Bad graphics
• Beginning of Statistical Graphics
• Milestones in Graphics
• Interactive Graphics
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
BAD Graphics
Guidelines for a bad graphic:
(Howard Wainer)
•
don’t show much data
Criteria for bad graphics:
(Edward Tufte)
• Lie Factor
size of effect in graphic / size of effect in data
•
show the data inaccurately
•
obfuscate the data
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Lie Factor
Increase in Mileage
The Lie Factor
(from Tufte, 1983, p.57)
gif image by Clay Helberg, Pitfalls of Data Analysis
This graph, from the NY Times, purports to show the mandated fuel
economy standards set by the US Department of Transportation. The
standard required an increase in mileage from 18 to 27.5, an increase
of 53%. The magnitude of increase shown in the graph is 783%, for a
whopping lie factor = (783/53) = 14.8!
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
BAD Graphics
Goals for a bad graphic:
(Howard Wainer)
•
don’t show much data
Criteria for bad graphics:
(Edward Tufte)
• Lie Factor
size of effect in graphic / size of effect in data
•
•
show the data inaccurately
obfuscate the data
• Data-Ink Ratio
data ink / total ink used in graphic
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
“Worst Graphic ever ... in print” (Tufte)
Age Structure of College Enrollment
Art or Artifice?
As a substitute for substance, one can try lots of color, 3D effects, or disguised redundancy. This graph uses all three techniques, to display just
five numbers. Note the clever use of mirror-imaging -- the top series is
just (100 - the bottom series) and the interesting use curved lines, front
and back to avoid the appearance that there‘s a lot less here than meets
the eye.
Tufte (1983, p.118) says, “This may well be the worst graphic ever to find
its way into print.”
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Beginnings of Graphics
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Beginnings of Statistical Graphics
William Playfair (1759 - 1823)
Scottish economist
author of “The Commercial and Political Atlas” (1786)
includes 44 charts:
time series plots, one bar chart
simple in design, yet data rich.
Joseph Minard (1781 - 1870)
Mathematician École Nationale des Ponts et Chaussées (ENPC)
1844 - 1870 draws maps and data flow graphs
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Playfair: Price of Wheat
Price of a quarter of wheat (28 pounds)
from 1565 to 1821
in comparison to weekly wages
with a time-line of reigns of different rulers
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Minard: Napoleon's Russian Campaign 1812
6d Data on Army:
geographic location
size of army
time
temperature
direction of movement
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Overlaid Maps
Cholera Outbreak in Central London
September 1854
Dr John Snow plotted
deaths by dots
crosses for water pumps
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Overlaid Maps
Armoring Airplanes
during WWII
Abraham Wald challenged to add extra
armor to airplanes
based on pattern of bullet holes in
returning aircrafts
Wald determined where planes had been shot
conclusion: put extra armor every place else!
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Train schedule: Paris - Lyon
Marey's Plot (1880)
today's TGV
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Modern Dark Ages (1900-1949)
only few innovations, rise of "classical" statistics:
distributions, hypothesis tests, parameter estimates, ...
Re-Birth of Statistical Graphics (1950-1974)
John W. Tukey
variety of new simple graphics: Exploratory Data Analysis
Jean Jacques Bertin
Semiologie Graphique: organize visual and perceptual elements of graphics
http://viscog.beckman.uiuc.edu/djs_lab/demos.html
Computer available
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
High Dimensional, High Interaction Graphics
With increasing computer power processing of high dimensional data possible
High-interaction graphics with new paradigms:
selection, linked highlighting, brushing, logical zooming
New Methods:
for continuous variables
Scatterplot Matrix, Grand Tour + Projection Pursuit, Parallel Coordinate Plots
for categorical variables
Mosaic Plots, Tree Maps
Ever Expanding Application to New Areas
Wide Range of Commercial and Free Software
DataDesk, Spotfire, Statistica, JMP, Visual Insights
GGobi, Manet, Mondrian
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
Biplots & Grand Tour
Biplots (Gabriel 1971)
idea:
scatterplot of 1st & 2nd principal component,
add original variables as lines
2. pc
2003 WNAR/IMS meeting
%weaver 18
%unknown 18
%baker 18
%patrician 18
%merchants 18
%women 18
%widow 18
%goldsmith 18
%textiles 18
Grand Tour (Asimov 1984)
walk along path of ALL POSSIBLE d-dimensional projections
additional indices for optimization: Projection Pursuit
1. pc
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Parallel Coordinates
allow high dimensional visualization of data (Ed Wegman, Al Inselberg)
Non-Euclidean Geometry: points to lines and lines to points
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Mosaic Plots
Visualization of high-dimensional contingency tables (Hartigan & Kleiner)
further development (Friendly) and variations (Hofmann)
Sex
F M
F M
F
Male
F
Male
Age
area based plots:
one rectangle for each cell in the table,
area is proportional to cell size
Child
Adult
Variation: Double Decker Plot
Class 1st 2. 3rd First 2nd
Sex
Female
Age Class
First Second
Third
Male
Adult
Third
Crew
Crew
F M
Child
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Tree Maps
Ben Shneiderman
Splits on same level can be according to
different variables
not all cells are on same level
aspect ratio optimized (close to 1)
squares are easier to compare than
skew rectangles
green-shading indicates development of
stock
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Where do new ideas come from?
stimulated by applications, new data types
here: network data
Network Graphs
Graham Wills
problems
overview vs close-ups
layout
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Application: Gene Expression Data
Experimental Setup
2 genotypes: Wildtype, growth impaired mutant
2 treatments: cure (not) added to soil
2 replicates each
cure added?
no
WT
genotype
mutant
yes
Goal: identify genes with
changes in gene expression
due to treatment or genotype or both
Classical ANOVA Problem!
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
ANOVA Model
Model Setup (for each gene):
Yijk
cure added
no
= µ + λiC + λjT + λijCT + εk
λiC
WT
Yijk
µ
λiC
λjT
λijCT
gene expression level
base expression level (average)
effect of cure
λjT
genotype
yes
mutant
effect of genotype
interaction effect of cure & genotype
Compute F statistics, get P-values
-> P-values of < 5% show significant effects
... or NOT??
λijCT
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Summary
Statistical Graphics have beautiful & interesting past
Successful Applications, in some cases saved human lives
Development goes through Cycles
seems that graphics once more in highly productive phase
Stimulation from application areas
massive data sets, new areas with problems on new scale
data mining / knowledge discovery
Graphics - an Ace up a Statistician's Sleeve
Heike Hofmann, [email protected]
IOWA STATE UNIVERSITY
2003 WNAR/IMS meeting
Sources
Howard Wainer: "Visual Revelations - Graphical Tales of Fate and Deception from Napoleon
Bonaparte to Ross Perot"
Edward Tufte: "The Visual Display of Quantitative Information"
Michael Friendly's
Data Visualization Gallery
Milestones Project
http://www.math.yorku.ca/SCS/
Gallery/milestone/
Download