This talk focuses on the use of interactive and dynamic... cess of data analysis. We use the XGobi software for...

advertisement
This talk focuses on the use of interactive and dynamic graphics in the process of data analysis. We use the XGobi software for demonstrating techniques.
This material is jointly provided by myself, Deborah Swayne and Andreas Buja.
They are based in part on draft notes for a book of similar title.
1
Have you seen this painting? It was painted by Hans Holbein in the early
1500s, and is currently on display in the National Gallery in London. It is wellworth a visit if you are in London. There is a curious \elliptical blur" in the front
of the painting. Holbein often hid things in his paintings. Here it looks like he
was toying with cartographic projections. If you cut out a rectangle around the
ellipse, and warp the rectangle by simple scaling then you see something more
familiar: a skull. This is analogous to the power of interactive visual tools, to
re-shape to focus attention on important features of the data.
2
Moving along to today's subject matter. At the end of this course, I would
hope that you have gained some understanding for the power of visual tools
used in the process of data analysis. The more ambitious of you might even
go away and download the software we use, and perhaps invent new graphical
methods for your analyses.
3
Computer-aided visualization is a young discipline, having arisen and grown
in the last fteen years. Its goal is to create computerized visual tools for
analyzing and communicating information with pictures.
This talk is about visualizing a particular type of information, namely, data.
In this talk, the term \data" is used for information that exists in some schematic
form such as a table or a list. Data is often but not always quantitative, and
some translation of unstructured information is often required to derive the
data. It always includes some attributes or variables such as the number of hits
on web sites, frequencies of words in text samples, weight in pounds, mileage
in gallons per mile, income per household in dollars, years of education, acidity
on the PH scale, sulfur emissions in tons per year, or scores on standardized
tests. Characteristic for data visualization is the concern with abstract relationships among such variables: for example, the degree to which income increases
with education, or the question of whether certain astronomical measurements
indicate grouping and therefore hint at new classes of celestial objects.
In contrast, other areas of visualization are mostly concerned with the display of objects and phenomena in physical 3-D space. Examples are volume
visualization (e.g., for the display of human organs in medicine), surface visualization (e.g., for manufacturing cars or animated movies), ow visualization
(e.g., for aeronautics or meteorology), and cartography. In these areas one often
strives for physical realism or the display of great detail in space, as in the visual display of a new car design, or of a developing hurricane in a meteorological
simulation.
Data visualization as we use the term requires the emphasis to be on variables
and their relationships in the abstract. Variables are typically thought of as
detached from physical location, even if physical location is part of the data. The
general task of data visualization is to make pictures that reect relationships
among variables. To this end, one maps the variables to axes in a plot and
the variable values to locations on the axes. In eect, one uses space to code
non-spatial information. The goal of data visualization is then not realism,
which is meaningless, but rendering of data in space for the purpose of visual
consumption.
4
This program of data visualization, spatial representation of data, immediately brings up a limitation: Plotting surfaces such as paper or computer screens
are merely 2-dimensional, and physical space is just 3-dimensional. The eye can
be tricked into seeing 3-dimensional virtual space with perspective and motion,
which for visualization is as good as, or even better than, physical space. Only
a few people have claimed to \see" 4-dimensional space (actually we are among
them). In data visualization each variable requires a spatial axis or dimension,
which poses the question of how we are to picture more than two or three variables at a time. The limitation to a 3-dimensional display space is ne if the
objects are 3-dimensional, as in most other visualization areas, but in data visualization the number of axes required to code variables can be large: ve to
ten are common, and 50 and even hundreds arise in very real contexts also.
This then is the challenge of data visualization: to overcome the 2-D and 3-D
barriers.
To meet this challenge, computers allow us to implement some powerful
visualization tools. They mimic and amplify a paradigm familiar from photography: take pictures from multiple directions so the shape of an object can
be understood in its entirety. This is called the \multiple views" paradigm,
where the term \views" is just a synonym for \pictures". In our 3-D world the
paradigm works superbly: the human eye is very adept at inferring the true
shape of an object from a few directional views. Unfortunately, the same is
often not true for views of abstract data! The chasm between dierent views of
data, however, can be actively bridged with computer technology: Unlike the
passive paper medium, computers allow us to manipulate pictures, to pull and
push their content in continuous motion with a similar eect as a moving video
camera, or to poke at objects in one picture and see them light up in other
pictures. Motion links pictures in time; poking links them across space. This
talk features many illustrations of the power of these linking technologies. The
diligent listener may come away \seeing" high-dimensional data spaces!
5
The discipline of data visualization has multiple homes, one in the eld of
statistics, and others in computer science and engineering. Statistics has had a
strand of data visualization research since the middle 1970's under the rubric
\statistical graphics," while work outside statistics came into focus a bit later,
following the visualization initiative by the National Science Foundation in the
late 1980s.
The seminal research in statistics was PRIM-9, the work of Fisherkeller,
Friedman and Tukey in 1974. PRIM-9, designed and implemented at the Stanford Linear Accelerator Center, was the rst interactive data visualization system. It was followed by further pioneering systems at the Swiss Federal Institute
of Technology (PRIM-ETH), at the Harvard (PRIM-H) and Stanford Universities (ORION), in the late 1970s and early 1980s. Research picked up in the
following few years in many places, including AT&T Bell Labs, Bellcore, the
University of Washington, the University of Minnesota, MIT, CMU, Batelle
Richmond WA, George Mason University, Rice University, and several more.
Our work on XGobi grew out of the third author's work at Stanford and the
University of Washington, followed by the joint work of the three of us at Bellcore starting in the early 1990s. We see XGobi as a partial sum of statisticians'
research in data visualization.
Statisticians aren't the only researchers who are interested in the visualization of abstract high-dimensional data, but statistical data visualization has
some unique features. Statisticians are always concerned with variability in observations and error in measurements, both of which cause uncertainty about
conclusions drawn from data. Dealing with this uncertainty is at the heart of
classical statistics, and statisticians have developed a huge body of inference
methods that allow us to quantify uncertainty. Inference used to be statisticians' sole preoccupation, but this changed under John W. Tukey's towering
inuence. He championed \exploratory data analysis" (EDA) which focusses
on discovery and allows for the unexpected, unlike inference, which progresses
from pre-conceived hypotheses. EDA has always depended heavily on graphics,
even before the term \data visualization" was coined. Our favorite quote from
John Tukey's rich legacy is that to \force the unexpected upon us," we need
good pictures. In the past, EDA and inference were sometimes seen as incompatible, but they are not mutually exclusive. In this talk, we will present visual
methods for assessing uncertainty and performing inference, that is, deciding
whether what we see is \really there."
6
This talk focuses on interactive and dynamic graphics. We begin by describing the terms more concretely. There are many denitions of the word
\Interaction". To many it is simply typing in on a command-line to change the
characteristics of a plot, or revise the plot, for example, graphics in S-Plus. For
this talk interaction is more precisely, direct manipulation of graphics in a plot,
which includes activities such as:
linked brushing of points, regions or lines in multiple views, by highlighting
them in one plot using a brush, and watching where the points light up in
the other plots.
querying the id of a point or group of points.
dragging a scrollbar to change the value of a parameter.
clicking a button to change the variables viewed in a plot.
Interaction facilitates tasks in exploration and discovery.
7
Dynamic graphics are motion graphics, for example:
cycling between plots.
3D rotating plots.
tour methods: grand/random, guided, manual.
In statistical terms, there are specic technical interpretations of linked
brushing and motion graphics. Linked brushing can be considered to be exploring conditional distributions of variables, where the brush is a conditioning
tool. Motion graphics can be considered to be exploring joint distributions,
because the motion facilitates perception of the \shape" of the data from the
sequence of marginal views. If we know the distribution of all low-dimensional
projections of the data then we also know the joint multivariate distribution,
following a result of Cramer-Wold (Mardia, Kent & Bibby 1979).
8
The eye can absorb immense amounts of information if provided with \informative" pictures.
In our experience there are situations where graphics can provide advantages
in the process of data analysis. Graphics are good for nding small departures
from the trend, local anomalies, and sparse structure in high-dimensional spaces.
They can also be used in conjunction with numerical results to rene solutions,
and to understand or interpret the results. In the next several slides we will
provide examples demonstrating these.
9
The rst case study uses a very simple data set, collected by one waiter over
a period of a few months in a restaurant in the mid-west USA. Several variables
were collected: total tip, total bill, sex of the bill payer, smoking party or not,
day of the week, time of day and size of the party. The data was reported in
(Bryant & Smith 1995), a collection of cases studies for business statistics. The
primary question related to the data is: \What are the factors that aect tipping
behavior?" The solution in the manual is to re-structure tip and total bill into a
new variable tip rate, and model against the remaining variables. The result is a
model with tip rate as the response and size of the party as the one explanatory
variable. It is a very empty result! For such a simple data set it has enormous
richness in structure, which we will show with graphics.
10
The gure displays a very common graphic: a histogram. The small multiples are generated from the data by using an array of bin widths to calculate the
histogram. The bin widths range from $1 at top to 10c at bottom. The multiple
layout illustrates how dierent information that is gained from examining the
data at dierent resolutions like this. The largest bin width ($1) shows tip to
have a unimodal and skewed distribution, which for the data means that most
tips are of the smaller amounts, with less and less larger tips. As the bin width
is reduced the shape of the distribution becomes multimodal. At the smallest
bin width (10c) it is clear that there are modes at the full dollar and fty cent
amounts. This means that the customers tend to round the tip amount to the
nearest fty cents or dollar. It is important to emphasize that the salient features are not found with one ideal bin width, but one must use multiple bin
widths to extract dierent features of data.
11
The gure describes drilling down further into the data. On the left, Total
Tip is plotted against Total Bill. The strong lower right triangular shape suggests that on average there are more cheap tippers than generous tippers. There
are several outlying exceptions, one in particular giving a $5 tip on an $8 bill.
On the right, the scatterplot of tip vs total bill is conditioned by Sex of the Bill
Payer and Smoker (whether it was a smoking party of diners or not). Inspecting
these plots reveals numerous features: (1) for smoking parties, there is almost
no relationship between tip and total bill, (2) when a female non-smoker paid
the bill, the tip was a very consistent percentage of the total bill, with the exceptions of three dining parties, (3) larger total bills mostly had a male paying
them.
Further drill down into other variables, such as time of day, day of week
and size of party reveals enough information to almost locate the restaurant.
This is the power of graphics. Even such a simple data set can reveal the
most interesting complexities, a much richer body of complexities than can be
extracted with numerical methods alone.
As an aside with an interactive system the user would be able to drag a
slider to actively change the bin width in a histogram, or click on a variable
label to condition the plot by the categorical variable.
12
The next few examples will involve demonstrations using the XGobi software
(Swayne, Cook & Buja 1998). XGobi is a visualization system for viewing highdimensional data. It is freely available. The primary visual constructions are
points and lines, which form the basic viewing type, scatterplots, which can
enhanced with line drawings. A high level of direct manipulation of plots is
supported, with linking of points and/or lines between multiple views. Motion
graphics such as rotation, touring and projection pursuit are available. It is
especially designed for the exploration of multivariate data and high-dimensional
spaces.
XGobi was written using the X Window System, so it is primarily a Unixbased software package, but it can be run directly under Windows NT or Windows 95/98 using a variety of X server/client packages. It supports interprocess
communication to share information with other software, so it can be run directly from ArcView, S/S-Plus, R, omegahat.
The graphics and tools that are available in XGobi include:
Cycling rapidly through two-variable scatter plots.
Three-dimensional rotation.
Grand tours and correlation tours: random, guided, and manual control.
Brushing.
Hiding (excluding) groups of points.
Dynamic re-scaling.
Interactive identication of cases.
Linked views: Brushing, identication and touring are linked; that is,
actions in the window of one XGobi process are immediately reected in
another XGobi window displaying the same data.
Line editing.
Moving points.
Fitting curves using smoothers.
Drawing subsamples.
Jittering of categorical measurements.
Parallel coordinate display.
Scatterplot matrix display.
Case and variable label lists.
Missing values are accepted and can be dealt with by imputation of constant values, random values, or user-supplied imputed values. Missing
value patterns can be examined in a separate linked XGobi window.
On-the-y variable transformations.
Quality postscript output.
Online help with the click of the right mouse button.
13
The gure illustrates the strength of our graphical methods in detecting
sparse structure in high-dimensional space. This data is from a particle physics
experiment where there are 7 measured variables which describe the outcome
state of the experiment. We will demonstrate how the structure can be seen.
We begin half-way through the exploration process after visually extracting
clusters. The points are touring in 7D space, with 7 color groups representing
cluster structure discovered prior to this stage. An intuitive explanation of
\touring" is rotation in arbitrary dimensions. The algorithm is complicated to
explain, so in brief, what you see is a continuous sequence of 2D projections
from 7D orthonormal space. Rotation in 3D is a special case of touring. Tour
methods are an example of linking of multiple plots using time. Motion, such
as this, is particularly good for examining joint distributions of variables. We
will see how.
With all the points included, the point cloud looks very complex and messy.
To simplify the process we reduce the number of points, by removing all color
groups except one. Watch this shape for a few minutes. You will see that
the points tend to concentrate in a triangle shape, which is roughly in a 2D
subspace of the 7D space. With the line drawing facility in XGobi we have drawn
lines between points in the vertices of the triangle, and this shape matches the
distribution of points quite beautifully.
Now that this structure is identied, we add one more color group back in,
and continue to watch the points touring. The second group of points concentrate in a linear shape, which is mostly disjoint from the rst group, except they
appear connected at one vertex of the triangle.
We continue in this manner to sequentially add each color group back in.
The underlying shape of the data is exposed. The points lie on a structure
comprised of connected low-dimensional pieces: a 2D triangle, with 2 linear
pieces extending from each vertex. The data was old by relative standards, 20
years old by the time this discovery was made, with visual methods, so the
meaning, if there is meaning to the structure is lost.
Discovering the structure required a combination of touring, guided touring,
using a method called projection pursuit, and brushing.
14
This example illustrates the use of graphics to rene the results of two numerical algorithms. The data comes from a study on Italian Olive Oils (Forina,
Armanino, Lanteri & Tiscornia 1983). Samples from dierent regions were analyzed for their fatty acid content. The eight variables measure the percentage
of each of the eight fatty acids found in the olive oil samples: Palmitic Acid,
Palmitoleic, Stearic, Oleic, Linoleic, Linolenic, Arachidic, Eicosenoic. For quality control purposes, it is of important to be able to classify the oil sample into
its region of manufacture based on the fatty acid composition.
15
The left plot in the gure displays the solution of a Classication and Regression Tree (which is also similar to the linear discriminant analysis solution): the
presence or absence of eicosenoic acid separate oils from region 1 from the other
two regions, oils from region 3 have low levels of linoleic acid correspond, but
oils from region 2 have high levels of linoleic acid. The right plot illustrates the
result of rotating small amounts of oleic and arachidic acid into the projection.
We demonstrate this interactively. With the small adjustment we gain a much
sweeter separation of the oils from the 3 regions. It is easy to understand why
CART and linear discriminant analysis don't eectively discriminate between
the two regions. CART is distracted by the correlation and linear discriminant
analysis is distracted by the heterogeneous group variances. Neural networks
can do a perfect classication job here, but neural netowrk solutions are dicult to interpret. With graphics we can understand how the network does its
classication, which variables dominate the solution, and in general if there are
missclassications, where are the regions of uncertainty. It is important to note
that neural networks do not perform so well in further drill down to classify the
areas within regions. With graphics, we can understand the cluster structure
fairly completely, and how the numerical methods work.
16
Another application of graphics to a fairly similar problem is in cluster analysis. This gure illustrates an average linkage dendrogram for the olive oil data.
Colors represent the true regions. What is interesting is that the dendrogram
looks like it neatly separates out several clusters but observing the mixed results
of region classication it is clear that it didn't result in cluster that correspond
to the region. In a situation where the groups were unknown ahead of time,
linked brushing between the dendrogram and scatterplot views could be used
to explore the results. More generally predictions from any clustering algorithm can be represented by color and glyph. The cluster structure can then be
explored my multibariate graphics.
17
The next few slides describe current developments in graphical methods,
for exploring the distribution of missing values, and for developing inferential
statements.
The data used in to demonstrate methods for missing values contains monthly
average measurements of 13 buoys moored in the Pacic Ocean for the period of
March 1980 to May 1998. There are 5 variables: zonal winds, meridian winds,
relative humidity, air temperature, sea surface temperature. There are 2184
data points, and many missing values. The gure displays plots of the variables
in this data with missing values imputed in dierent ways.
The typical approach to plotting missing values is to plot them on the margins of a plot. Here, the marginal distribution of air temperature for missings
and non-missings on humidity appear a little dierent: the missings appear to
have a slightly lower mean value. This approach does not work for multivariate
plots. We demonstrate the touring on the three variables, humidity, air temperature and sea surface temperature as the active variables. The missing values
appear as bands of points that oat around, which is quite distracting and not
helpful in understanding the nature of missingness.
Imputing using the variable mean value causes more problems. In a scatterplot of sea surface temperature against humidity we can see the missings appear
as a lovely cross shape in the plot. If you change to rotation or tour of the three
variables, you will see a nice 3D cross structure.
Similarly random imputation is not particularly successful when there is
correlation between variables present. The destruction of correlation structure
is even more pronounced in multivariate space.
18
A better approach to exploring the nature of missingness is to use brushing.
This plot shows a scatterplot of air temperature against sea surface temperature
where cases missing on humidity are highlighted as green crosses. The distribution is similar to the non-missings, especially in the correlation structure,
although the missings do look like they have a slightly dierent mean. This
approach can be easily extended to higher dimensions. The right plot shows a
tour view of 4 variables with missings on humidity highlighted.
19
This next example relates to inference. You are now my test audience! Can
you tell which is the real data plot in this array of plots? Stare at the plots for
a while, and decide if one is dierent from the others. Hands up if you think
it is the top left plot? The top line, second from left? The top line third from
left? The top right? The second from top left plot? ...
This is data from an agricultural eld trial in Iowa. The response is corn
yield, and the explanatory variables soil nutrients present in soil samples. We
look at Boron here. The real plot of Yield against Boron is the one in the top
row, second from the right. It is dierent from the others in two ways. One, in
terms of the outlier, at bottom right of this plot, case 200. None of the other
plots have a value so extreme. Secondly, the sharpness of the skewness, or the
lack of points in the bottom right, makes it dierent from the other plots. It is
somewhat similar to the left plot and the right plot in the top row, but has a
more pronounced structure.
Judging reality of structure in the presence of skewness presents more diculties than in symmetric data, because of confounding with sample size.
20
In summary, data visualization is an attractive area because of its broad
applicability. Wherever data is collected, creative analysis is possible with all
the fun that creativity engenders. Data are gathered, stored and analyzed in
huge amounts by all areas of science, by governments, by industries such as
nance, retail, health, telecommunications, and service industries in general.
We hope the range of data examples discussed here will give the reader an
impression of the power and wide applicability of data visualization.
21
The authors can be contacted by electronic email at:
dicook@iastate.edu
dfs@research.att.com
andreas@research.att.com
and the XGobi software can be downloaded from the XGobi web site:
http://www.research.att.com/areas/stat/xgobi/
22
References
Bryant, P. G. & Smith, M. A. (1995), Practical Data Analysis: Case Studies in
Business Statistics, Richard D. Irwin Publishing, Homewood, IL.
Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classication
of olive oils from their fatty acid composition, in H. Martens & H. Russwurm Jr., eds, `Food Research and Data Analysis', Applied Science Publishers, London, pp. 189{214.
Mardia, K. V., Kent, J. T. & Bibby, J. M. (1979), Multivariate Analysis, Academic Press, London.
Swayne, D. F., Cook, D. & Buja, A. (1998), `XGobi: Interactive Dynamic
Graphics in the X Window System', Journal of Computational and Graphical Statistics 7(1), 113{130.
23
Download