Interactive and Dynamic Graphics for Data Analysis using XGobi

advertisement
Interactive and Dynamic Graphics for Data
Analysis using XGobi
Dianne Cook, Deborah F. Swayne, Andreas Buja
Copyright 1999 D. Cook, D. F. Swayne, A. Buja
DRAFT
Objectives
By the end of this course, I would hope that you have:
gained some understanding for the power of visual tools used in the process
of data analysis.
learned the basics of XGobi suciently to use it with you own data problems, and where to nd more information as needed.
become inspired to develop new visual methods to address your data analyses.
2
Contents
1 Introduction
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Data Visualization . . . . . . . . . . . . . .
Statistical Graphics . . . . . . . . . . . . . .
Representing High-Dimensional Data . . . .
Restructuring Data . . . . . . . . . . . . . .
The Role of Visualization in Data Analysis
Introduction to the Case Studies . . . . . .
Getting Started with XGobi . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Case Study: Tropical Atmosphere Ocean (TAO) array
2.2.1 Data description . . . . . . . . . . . . . . . . .
2.2.2 Exploring Missingness . . . . . . . . . . . . . .
2.2.3 Exercises . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Missing Values
3 Clustering and Classication
3.1 Background . . . . . . . . . . . . . . . . . . . . . .
3.2 Supervised Classication . . . . . . . . . . . . . . .
3.2.1 Case Study: Australian Leptograpsus Crabs
3.2.2 Exercises . . . . . . . . . . . . . . . . . . .
3.2.3 Case Study: Italian Olive Oils . . . . . . .
3.2.4 Exercises . . . . . . . . . . . . . . . . . . .
3.3 Unsupervised Classication . . . . . . . . . . . . .
3.3.1 Case Study: Italian Olive Oils . . . . . . .
4 Time
4.1 Case Study: River . . . . . . . .
4.2 Case Study: UK Pig Production
4.3 Case Study: TAO Buoys . . . . .
4.3.1 Data description . . . . .
4.3.2 Querying Drift . . . . . .
4.3.3 Cycling over time . . . . .
4.3.4 Correlation Tour . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
13
17
18
23
23
31
31
32
32
33
36
37
37
38
38
45
46
54
55
55
59
59
59
63
63
63
63
65
4.4 Case Study: Telephone Usage . . . . . . . . . . . . . . . . . . . . 65
5 Space
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Case Study: Comprehensive Ocean-Atmosphere Data
5.2.1 Data Description . . . . . . . . . . . . . . . . .
5.2.2 Comparing Traditional with Interactive . . . .
5.2.3 Exploring Other Variables . . . . . . . . . . . .
5.2.4 Exercises . . . . . . . . . . . . . . . . . . . . .
5.3 Case Study: Baker Field Data . . . . . . . . . . . . . .
5.3.1 Data description . . . . . . . . . . . . . . . . .
5.3.2 Scatterplots . . . . . . . . . . . . . . . . . . . .
5.3.3 Exercises . . . . . . . . . . . . . . . . . . . . .
5.3.4 Relating Soil Variables to Each Other . . . . .
5.3.5 A Grand Tour of the Soil Variables . . . . . . .
5.3.6 Linking Yield to Spatial Location . . . . . . . .
5.3.7 Exploring Spatial Dependence . . . . . . . . .
5.3.8 Exercises . . . . . . . . . . . . . . . . . . . . .
6 Multivariate Longitudinal Data
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Data Structuring for Interactive Analysis . . . . . . . . . . . . .
6.3 Case Study I: Liver toxicity . . . . . . . . . . . . . . . . . . . . .
6.3.1 About the data . . . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Data Restructuring for XGobi . . . . . . . . . . . . . . . .
6.3.3 Exploring Liver Toxicity Measures . . . . . . . . . . . . .
6.3.4 Drilling down on interesting cases . . . . . . . . . . . . .
6.3.5 Inspecting drug relationships . . . . . . . . . . . . . . . .
6.3.6 Time-constrained touring . . . . . . . . . . . . . . . . . .
6.3.7 Generating a set of points uniformly distributed on a pdimensional condence ellipse . . . . . . . . . . . . . . . .
6.3.8 Data les for XGobi . . . . . . . . . . . . . . . . . . . . .
7 Compositional Data
69
69
69
69
70
71
71
71
71
74
77
77
80
82
82
83
87
87
87
88
88
89
90
90
91
91
91
93
95
7.1 Case Study: Insect Populations . . . . . . . . . . . . . . . . . . . 96
7.1.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . 96
7.1.2 Comparing the means . . . . . . . . . . . . . . . . . . . . 97
8 Categorical Variables and ANOVA
9 Multivariate Function Visualization
10 Inference for Data Visualization
101
103
105
10.1 Really There? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10.2 Signicance Levels . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.3 Case Study: Baker Field data . . . . . . . . . . . . . . . . . . . . 107
4
10.3.1 Examining Yield and Boron . . . . . . . . . . . . . . . . . 107
10.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.3.3 Discussion about the Structure in Boron . . . . . . . . . . 109
11 Large Data
12 Appendix
12.1
12.2
12.3
12.4
12.5
12.6
12.7
12.8
Gallery Index of Main XGobi Controls . . . . . . . . . . . . . . .
Input File Formats . . . . . . . . . . . . . . . . . . . . . . . . . .
Limitations of Windows 95 version . . . . . . . . . . . . . . . . .
Man page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Solutions to Selected Exercises . . . . . . . . . . . . . . . . . . .
S function for xgobi . . . . . . . . . . . . . . . . . . . . . . . . .
S function for linked dendrogram plots . . . . . . . . . . . . . . .
Generating a set of points uniformly distributed on a p-dimensional
condence ellipse . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.9 S code used to restructure data into les for XGobi . . . . . . . .
12.10Answers to challenges . . . . . . . . . . . . . . . . . . . . . . . .
13 Web Resources
111
113
113
116
116
117
123
128
133
135
136
139
143
5
6
Preface
Background
Computer-aided visualization is a young discipline, having arisen and grown in
the last fteen years. Its goal is to create computerized visual tools for analyzing and communicating information with pictures. The computer age is
drowning us in information, and it becomes clearer every day that all this information is worthless if it can't be digested by humans. If computers brought
on the challenge, they have become new partners to an old solution: presenting information to the human eye. The eye can absorb immense amounts of
information if provided with \informative" pictures. While this is old hat in
principle, computers since the 1980s have brought powerful facilities to bear for
picturing information.
This book is about visualizing a particular type of information, namely,
data. In this book, the term \data" is used for information that exists in
some schematic form such as a table or a list. Data is often but not always
quantitative, and some translation of unstructured information is often required
to derive the data. It always includes some attributes or variables such as the
number of hits on web sites, frequencies of words in text samples, weight in
pounds, mileage in gallons per mile, income per household in dollars, years of
education, acidity on the PH scale, sulfur emissions in tons per year, or scores
on standardized tests. Characteristic for data visualization is the concern with
abstract relationships among such variables: for example, the degree to which
income increases with education, or the question of whether certain astronomical
measurements indicate grouping and therefore hint at new classes of celestial
objects.
In contrast, other areas of visualization are mostly concerned with the display of objects and phenomena in physical 3-D space. Examples are volume
visualization (e.g., for the display of human organs in medicine), surface visualization (e.g., for manufacturing cars or animated movies), ow visualization
(e.g., for aeronautics or meteorology), and cartography. In these areas one often
strives for physical realism or the display of great detail in space, as in the visual display of a new car design, or of a developing hurricane in a meteorological
simulation.
Data visualization as we use the term requires the emphasis to be on vari7
ables and their relationships in the abstract. Variables are typically thought
of as detached from physical location, even if physical location is part of the
data. The general task of data visualization is to make pictures that reect
relationships among variables. To this end, one maps the variables to axes in a
plot and the variable values to locations on the axes. In eect, one uses space
to code non-spatial information. The goal of data visualization is then not realism, which is meaningless, but rendering of data in space for the purpose of
visual consumption. Paraphrasing Galileo Galilei who said \Measure what is
measurable, and make measurable what is not so," we may say \render in space
what is spatial, make spatial what is not so."
This program of data visualization, spatial representation of data, immediately brings up a limitation: Plotting surfaces such as paper or computer screens
are merely 2-dimensional, and physical space is just 3-dimensional. The eye can
be tricked into seeing 3-dimensional virtual space with perspective and motion,
which for visualization is as good as, or even better than, physical space. Only
a few people have claimed to \see" 4-dimensional space (actually we are among
them). In data visualization each variable requires a spatial axis or dimension,
which poses the question of how we are to picture more than two or three variables at a time. The limitation to a 3-dimensional display space is ne if the
objects are 3-dimensional, as in most other visualization areas, but in data visualization the number of axes required to code variables can be large: ve to
ten are common, and 50 and even hundreds arise in very real contexts also.
This then is the challenge of data visualization: to overcome the 2-D and 3-D
barriers.
To meet this challenge, computers allow us to implement some powerful visualization tools. They mimic and amplify a paradigm familiar from photography:
take pictures from multiple directions so the shape of an object can be understood in its entirety. This is called the \multiple views" paradigm, where the
term \views" is just a synonym for \pictures". In our 3-D world the paradigm
works superbly: the human eye is very adept at inferring the true shape of an
object from a few directional views. Unfortunately, the same is often not true
for views of abstract data! The chasm between dierent views of data, however,
can be actively bridged with computer technology: Unlike the passive paper
medium, computers allow us to manipulate pictures, to pull and push their content in continuous motion with a similar eect as a moving video camera, or to
poke at objects in one picture and see them light up in other pictures. Motion
links pictures in time; poking links them across space. This book features many
illustrations of the power of these linking technologies. The diligent reader may
come away \seeing" high-dimensional data spaces!
Talking about a technology is vacuous if it is not backed up by an implementation. For the methods shown in this book, the implementation is the
freely available XGobi software designed and written by us, the authors. While
other excellent software systems for data visualization exist, XGobi is particularly strong for the type of high-dimensional data explorations that we describe
in this book. XGobi implements what amounts to video cameras for directly
viewing high-dimensional data with dynamic projections, and it has tools for
8
poking at and slicing data across multiple views. It permits creative rendering
of data and creative manipulation of views of data. XGobi will be introduced
in sucient detail that this book can serve as a primer to the system.
The discipline of data visualization has multiple homes, one in the eld of
statistics, and others in computer science and engineering. Statistics has had a
strand of data visualization research since the middle 1970's under the rubric
\statistical graphics," while work outside statistics came into focus a bit later,
following the visualization initiative by the National Science Foundation in the
late 1980s.
The seminal research in statistics was PRIM-9, the work of Fisherkeller,
Friedman and Tukey in 1974. PRIM-9, designed and implemented at the Stanford Linear Accelerator Center, was the rst interactive data visualization system. It was followed by further pioneering systems at the Swiss Federal Institute
of Technology (PRIM-ETH), at the Harvard (PRIM-H) and Stanford Universities (ORION), in the late 1970s and early 1980s. Research picked up in the
following few years in many places, including AT&T Bell Labs, Bellcore, the
University of Washington, the University of Minnesota, MIT, CMU, Batelle
Richmond WA, George Mason University, Rice University, and several more.
Our work on XGobi grew out of the third author's work at Stanford and the
University of Washington, followed by the joint work of the three of us at Bellcore starting in the early 1990s. We see XGobi as a partial sum of statisticians'
research in data visualization.
Statisticians aren't the only researchers who are interested in the visualization of abstract high-dimensional data, but statistical data visualization has
some unique features. Statisticians are always concerned with variability in observations and error in measurements, both of which cause uncertainty about
conclusions drawn from data. Dealing with this uncertainty is at the heart of
classical statistics, and statisticians have developed a huge body of inference
methods that allow us to quantify uncertainty. Inference used to be statisticians' sole preoccupation, but this changed under John W. Tukey's towering
inuence. He championed \exploratory data analysis" (EDA) which focusses on
discovery and allows for the unexpected, unlike inference, which progresses from
pre-conceived hypotheses. EDA has always depended heavily on graphics, even
before the term \data visualization" was coined. Our favorite quote from John
Tukey's rich legacy is that to \force the unexpected upon us," we need good
pictures. In the past, EDA and inference were sometimes seen as incompatible, but they are not mutually exclusive. In this book, we will present visual
methods for assessing uncertainty and performing inference, that is, deciding
whether what we see is \really there."
Data visualization is an attractive area because of its broad applicability.
Wherever data is collected, creative analysis is possible with all the fun that
creativity engenders. Data are gathered, stored and analyzed in huge amounts
by all areas of science, by governments, by industries such as nance, retail,
health, telecommunications, and service industries in general. We hope the
large number of data examples in this book will give the reader an impression
of the power and wide applicability of data visualization.
9
About this book
This book may be used as a textbook for courses in data visualization and exploratory data analysis and it can also serve as a reference for current XGobi
users. It may be used as a companion text for courses in applied multivariate analysis, but it may even be helpful for courses in theoretical multivariate
analysis because it builds up intuitions for high-dimensional data spaces. We
recommend the book furthermore for innovative teaching of data mining as the
XGobi software can scale up to over 100,000 cases or records. Each chapter of
the book focuses on specic data applications. Data sets for the examples are
available from the web page for the book. The book will produce the greatest
benet if it is not just passively read, but actively followed by working with the
software and our or the reader's own datasets.
About the software
XGobi is publicly available visualization software for viewing and interacting
with high-dimensional data. Its primary views are scatterplots augmented with
line drawings; other view types are scatterplot matrices and parallel coordinate
plots. Multiple views are automatically linked; they are interactive, in the sense
that they can be directly manipulated and queried. Also featured in XGobi are
dynamic methods, such as high-dimensional rotations, including the grand tour
and correlation tour, augmented with manual controls and automatic guidance
through projection pursuit. XGobi can deal with missing values through elimination or imputation. It can also be used as a rudimentary high-dimensional
drawing program. Finally, a sister program, called XGvis, is provided for nonlinear dimension reduction, multidimensional scaling, and high-dimensional graph
layout. XGobi and XGvis can be run on almost every platform, and they support interprocess communication with other packages.
The authors can be contacted by electronic email at:
dicook@iastate.edu
dfs@research.att.com
andreas@research.att.com
and the XGobi software can be downloaded from the XGobi web site:
http://www.research.att.com/areas/stat/xgobi/
10
Chapter 1
Introduction
Computer-aided visualization is a child of the age of technology. While the
technological age is also drowning us with information, computers bring the
provisions for humans to digest it. Computers have brought powerful facilities
to life for drawing and interacting with pictures to describe information. This
book is about drawing and interacting with pictures about a particular type
of information: data. This book describes application of data visualization
tools, that utilize high levels of interaction and motion, to a broad array of case
studies from dierent disciplines. In this chapter we describe the history of data
visualization and the role of data visualization in the process of data analysis.
We also explain methods to be used in later chapters, and the software used for
demonstrating the methods.
1.1 Data Visualization
In this book, the term \data" is used for information that exists in some
schematic form such as a table or a list. Data is often but not always quantitative, and some translation of unstructured information is often required to
derive the data. It always includes some attributes or variables such as the
number of hits on web sites, frequencies of words in text samples, weight in
pounds, mileage in gallons per mile, income per household in dollars, years of
education, acidity on the PH scale, sulfur emissions in tons per year, or scores
on standardized tests. Characteristic for data visualization is the concern with
abstract relationships among such variables: for example, the degree to which
income increases with education, or the question of whether certain astronomical measurements indicate grouping and therefore hint at new classes of celestial
objects.
In contrast, other areas of visualization are mostly concerned with the display of objects and phenomena in physical 3-D space. Examples are volume
visualization (e.g., for the display of human organs in medicine), surface visualization (e.g., for manufacturing cars or animated movies), ow visualization
11
Download