Computer-aided visualization is a child of the age of technology.... technological age is also drowning us with information, computers bring...

advertisement
Chapter 1
Introduction
Computer-aided visualization is a child of the age of technology. While the
technological age is also drowning us with information, computers bring the
provisions for humans to digest it. Computers have brought powerful facilities
to life for drawing and interacting with pictures to describe information. This
book is about drawing and interacting with pictures about a particular type
of information: data. This book describes the application of data visualization
tools, that utilize high levels of interaction and motion, to a broad array of case
studies from dierent disciplines. In this chapter we describe the history of data
visualization and the role of data visualization in the process of data analysis.
We also explain methods to be used in later chapters, and the software used for
demonstrating the methods.
1.1 Data Visualization
In this book, the term \data" is used for information that exists in some
schematic form such as a table or a list. Data is often but not always quantitative, and some translation of unstructured information is often required to
derive the data. It always includes some attributes or variables such as the
number of hits on web sites, frequencies of words in text samples, weight in
pounds, mileage in gallons per mile, income per household in dollars, years of
education, acidity on the PH scale, sulfur emissions in tons per year, or scores
on standardized tests. Characteristic for data visualization is the concern with
abstract relationships among such variables: for example, the degree to which
income increases with education, or the question of whether certain astronomical measurements indicate grouping and therefore hint at new classes of celestial
objects.
In contrast, other areas of visualization are mostly concerned with the display of objects and phenomena in physical 3-D space. Examples are volume
visualization (e.g., for the display of human organs in medicine), surface visualization (e.g., for manufacturing cars or animated movies), ow visualization
11
(e.g., for aeronautics or meteorology), and cartography. In these areas one often
strives for physical realism or the display of great detail in space, as in the visual display of a new car design, or of a developing hurricane in a meteorological
simulation.
Data visualization as we use the term requires the emphasis to be on variables
and their relationships in the abstract. Variables are typically thought of as
detached from physical location, even if physical location is part of the data. The
general task of data visualization is to make pictures that reect relationships
among variables. To this end, one maps the variables to axes in a plot and the
variable values to locations on the axes. In eect, one uses space on a page to
code non-spatial information. The goal of data visualization is then not realism,
which is meaningless, but rendering of data in space for the purpose of visual
consumption.
This program of data visualization, spatial representation of data, immediately brings up a limitation: Plotting surfaces such as paper or computer
screens are merely 2-dimensional, and physical space is just 3-dimensional. The
eye can be tricked into seeing 3-dimensional virtual space with perspective and
motion, which for visualization is as good as, or even better than, physical space.
In data visualization each variable requires a spatial axis or dimension, which
poses the question of how we are to picture more than two or three variables at
a time. The limitation to a 3-dimensional display space is ne if the objects are
3-dimensional, as in most other visualization areas, but in data visualization the
number of axes required to code variables can be large: ve to ten are common,
and 50 and even hundreds arise in very real contexts also. This then is the
challenge of data visualization: to overcome the 2-D and 3-D barriers.
To meet this challenge, computers allow us to implement some powerful
visualization tools. They mimic and amplify a paradigm familiar from photography: take pictures from multiple directions so the shape of an object can
be understood in its entirety. This is called the \multiple views" paradigm,
where the term \views" is just a synonym for \pictures". In our 3-D world the
paradigm works superbly: the human eye is very adept at inferring the true
shape of an object from just a few directional views. Unfortunately, the same
is often not true for views of abstract data! The chasm between dierent views
of data, however, can be actively bridged with computer technology: Unlike the
passive paper medium, computers allow us to manipulate pictures, to pull and
push their content in continuous motion with a similar eect as a moving video
camera, or to poke at objects in one picture and see them light up in other
pictures. Motion links pictures in time; poking links them across space. This
book features many illustrations of the power of these linking technologies. The
diligent reader may come away \seeing" high-dimensional data spaces!
1.2 Statistical Graphics
The discipline of data visualization has multiple homes, one in the eld of statistics, and others in computer science and engineering. Statistics has had a strand
12
of data visualization research since the middle 1970's under the rubric \statistical graphics," while work outside statistics came into focus a bit later, following
the visualization initiative by the National Science Foundation in the late 1980s.
The seminal research in statistics was PRIM-9, the work of Fisherkeller,
Friedman and Tukey in 1974. PRIM-9, designed and implemented at the Stanford Linear Accelerator Center, was the rst interactive data visualization system. It was followed by further pioneering systems at the Swiss Federal Institute
of Technology (PRIM-ETH), at the Harvard (PRIM-H) and Stanford Universities (ORION), in the late 1970s and early 1980s. Research picked up in the
following few years in many places, including AT&T Bell Labs, Bellcore, the
University of Washington, the University of Minnesota, MIT, CMU, Batelle
Richmond WA, George Mason University, Rice University, and several more.
Statisticians aren't the only researchers who are interested in the visualization of abstract high-dimensional data, but statistical data visualization has
some unique features. Statisticians are always concerned with variability in observations and error in measurements, both of which cause uncertainty about
conclusions drawn from data. Dealing with this uncertainty is at the heart of
classical statistics, and statisticians have developed a huge body of inference
methods that allow us to quantify uncertainty. Inference used to be statisticians' sole preoccupation, but this changed under John W. Tukey's towering
inuence. He championed \exploratory data analysis" (EDA) which focusses on
discovery and allows for the unexpected, unlike inference, which progresses from
pre-conceived hypotheses. EDA has always depended heavily on graphics, even
before the term \data visualization" was coined. Our favorite quote from John
Tukey's rich legacy is that to \force the unexpected upon us," we need good
pictures. In the past, EDA and inference were sometimes seen as incompatible, but they are not mutually exclusive. In this book, we will present visual
methods for assessing uncertainty and performing inference, that is, deciding
whether what we see is \really there."
1.3 Representing High-Dimensional Data
The approach to drawing plots of data is called the \multiple views" paradigm
(Buja, Cook & Swayne 1996). Multiple dierent plots, corresponding to views
of the data from dierent aspects, are shown to the user simultaneously. Interaction tools facilitate linking information in between plots. The methods adhere
to very specic guidelines for graphics: use simple, easy-to-read plots, with a
healthy collection of interaction tools, such as refocusing, linking between plots,
easy rearrangement of plots, and automated sequences of views. We describe
the approach in depth in this section.
At the root of the graphical methods is a division of data visualization into
two areas:
Rendering, or what to show in a plot;
Manipulation, or what to do with plots;
13
Linking, or what information to share between plots.
The rst area, rendering of data, comprises all decisions that go into the production of a static image. Rendering is concerned with appropriate representation
of information in data variables: a scatterplot, a density plot, a time series plot
or a parallel coordinate plot are examples of renderings. Wegman & Carr (1993)
give an excellent introduction to the wide array of rendering methodology in statistical data visualization. The second area, manipulation of plot elements refers
to how we operate on individual plots and how we organize multiple plots. The
purpose of these manipulations is to support the search for structure in data.
The third area, linking refers to the connection of elements from one rendering
to another rendering. We most often think of linked brushing, where points are
given the same appearance (color, glyph) between renderings. More generally,
linking can include matching scales, matching axes, linking a point to a record
in a database, or index of chemical compounds. In the practice of data visualization, there usually exists a larger context of open-ended problem solving. In
such contexts, data visualization systems are most useful if they provide plot
manipulation tools that support extensive searching and linking of information.
Behind the process of rendering is the concept of a data pipeline, rst described by Buja, Asimov, Hurley & McDonald (1988). The data pipeline is
the conveyor belt which takes the raw data through a series of transformations
to go from p-dimensional data to a d-dimensional rendering. (The pronumeral
p represents the number of variables in the data, and d < p, most commonly
d = 2.) The stages of transformation typically include some type of variable
standardization and dimension reduction. Examples of standardization are standardizing each variable to mean 0, variance 1, and ordering a time variable. The
dimension reduction may be done using variable selection, or through projection methods such as principal components or discriminant coodinates. Motion
graphics, such as tours, can be applied to address the dimension reduction problem, also. Although, the number of variables may remain the same, the tour
algorithm provides a continuous sequence of low-dimensional projections to produce a \movie" which shows the data \from all sides". A tour is also a display
of multiple views using time order. In general, the multiple views approach
species that the data pipeline is rather more like a river delta, piping the data
out into multiple renderings.
With multiple renderings the diculty is to dene appropriate mechanisms
to link information from one plot to another. In the simplest case, where one
view has a scatterplot of variable 1 vs variable 2, and another view has a scatterplot of variable 3 vs variable 4 the points in each plot are linked one-to-one,
the correspondence of a point in one plot is a point in the other plot. More complex types of linking arise often arise, for example, if there is a time or spatial
component to the data. In the situation where there is a spatial component we
need to explore the spatial dependence between sample points. So we may have
a point, in one view as an element of a variogram cloud, corresponding to a pair
of locations in another view, the map. In the situation of time, there may be
a multivariate longitudinal study, where there are both demographic variables
14
Flow
8.5
9.0
9.5 10.0 10.5 11.0 11.5
for each patient, and multiple measurements of a study period for each patient.
Here it is desirable to link a point in one view to a time series in another view.
We could think of this as one-to-many, or indeed many-to-one. Other types of
plot element linking are common as well, for example, axis scale, and projection
coecients.
0 10 20 30 40 50 60 70 80 90100110120130
8.0
8.5
9.0
910.0
.510.5
11.0
11.5
Flow
Time
0 10 20 30 40 50 60 70 80 90100110120130
Time
Figure 1.1: Monthly average Willamette river ow levels over a period of 10
years. at two dierent aspect ratios. Top plot shows 1:1 ratio (contracted in
time), which reveals long term trends, such as the up then down. Bottom plot
shows long time axis, revealing local seasonal trends. The 5th year (between
observations 60 and 70) has a smaller peak ow, than other years. The last year
(beginning at observation 120) appears to start with higher than usual ow.
When plot elements are linked, it ensures that manipulation of elements in
one plot directly aects the representation of the data in the other plots. The
taxonomy of manipulations described in Buja et al. (1996):
Focusing views: By focusing we mean any operation that is an extension
15
-0.5
0.0
0.5
S(WndDir)
1.0
1.0
0.5
0.0
C(WndDir)
-1.0
-0.5
1.0
0.5
0.0
C(WndDir)
-1.0
-0.5
1.0
0.5
0.0
C(WndDir)
-0.5
-1.0
-1.0
-1.0
-0.5
0.0
0.5
S(WndDir)
1.0
-1.0
-0.5
0.0
0.5
1.0
S(WndDir)
Figure 1.2: Comprehensive Ocean-Atmosphere Data: long term means of in
situ weather observations taken by merchant marines, gridded and cleaned.
(Top) The atmospheric science view: wind direction and speed displayed as
arrows on the map, allows easy insight into the global trends: easterly winds
prevail in most parts of the Pacic. (Bottom) Brushing in wind direction plots
north-westerly, northerly, and south-easterly winds allows inspection of local
anomalies: winds o the north American coast are mostly north-westerly and
northerly, o the south American coast16mostly southeasterly on average.
of manipulating a camera, such as deciding from which side to look at the
object and in which magnication and detail. Focusing views includes
choosing the variables or (more generally) the projections for viewing, but
also choosing aspect ratio and zoom and pan. (Figure 4.1 illustrates the
eect of zoom and pan on structure detection.)
Posing queries: In graphical data analysis it is natural to pose queries
graphically, for example with the familiar brushing techniques: coloring or
otherwise highlighting a subset of the data means issuing a query about
this subset. It is then equally natural that the response to the query
be given graphically. This is achieved by showing information about the
highlighted subset in other views. It is therefore desirable that the view
where the query is posed and the views that present the response are
linked. Ideally, responses to queries are instantaneous. (Figure 1.2 illustrates the dierence between a static map and interactive querying for
structure detection.)
Arranging many views: It is often a powerful informal technique to
arrange large numbers of related plots for simultaneous comparison. The
most useful arrangements are matrix-like, such as in scatterplot matrices
of pairwise variable plots, but other arrangements can be useful as well.
A very commonly used method for posing queries is \linked brushing":
points in one plot are highlighted using color or glyph, and the corresponding
points in other plots are simultaneously highlighted. In statistical terms, linked
brushing can be considered to be exploring conditional distributions of variables,
where the brush is a conditioning tool. On the other hand, statistically, motion
graphics such as the tour facilitate exploring the joint distributions, because the
motion facilitates perception of the \shape" of the data from the sequence of
marginal views. If we know the distribution of all low-dimensional projections
of the data then we also know the joint multivariate distribution, following a
result of Cramer-Wold (Mardia, Kent & Bibby 1979).
1.4 Restructuring Data
Clever restructuring of variables is perhaps one of the most hidden yet valuable
tools for data visualization. The case studies described in this book make substantial use of restructured variables. Particular types of data lend themselves
to obvious approaches to re-structuring: data with a time or space component, modular variables such as wind direction, or compositional data where
variables contain a constraint. When there is a time or space component it
is important to explore the time or space dependency using lag plots or variogram cloud plots. With time, it is also likely that there are dierent scales of
time (daily/weekly/yearly) to explore. Regrouping these dierent resolutions
into dierent variables facilitate exploring trend (yearly or weekly). A variable
such as wind direction can be better handled in the interactive setting by using sine and cosine values. This allows the user to brush around the compass
17
points to explore the relationship with other variables. Compositional data is
best approached by pre-projecting the data into a subspace orthogonal to the
variable constraint, called a ternary diagram in 2D. In situations where modeling is a part of the analysis, components associated with the model are useful
to append to the data set, and appending samples or quantiles from standard
distributions facilitates inference. When the modeling is classication exploring a dendrogram in association with the variables may illuminate clustering
inthe data. With any type of prediction, appending the predictions, residuals
or diagnostics to the data can help improve the model.
1.5 The Role of Visualization in Data Analysis
It is commonly said that \a picture is worth a thousand words". Translated
to the area of data analysis, this is, \good plot can present a concise summary
of data that many pages of explanation cannot match". The eye can absorb
immense amounts of information if provided with \informative" pictures. A
well-constructed plot can provide considerably more information about the relationships between variables than summary numerical statistics. In the process
of analyzing data, graphics are used to assess the integrity of the data, provide
diagnostics for model tting, and provide convenient summaries of important
ndings. Graphics provide distinct advantages for detecting structure in data
and also for rening results from numerical algorithms. In this section we discuss
why and where graphics are useful in data analysis, using three examples.
The rst example demonstrates nding local anomalies with graphics. The
data is very simple. It contains recordings made by one waiter over a period of a
few months in a restaurant in the mid-west USA (Bryant & Smith 1995). Several
variables were collected: total tip, total bill, sex of the bill payer, smoking party
or not, day of the week, time of day and size of the party. The data was reported
in a collection of cases studies for business statistics. The primary question
related to the data is: \What are the factors that aect tipping behavior?" The
solution in the manual is to re-structure tip and total bill into a new variable
tip rate, and model against the remaining variables. The result is a model with
tip rate as the response and size of the party as the one explanatory variable.
It is a very empty result! For such a simple data set it has enormous richness
in structure, which we will show with graphics.
Figure 1.3 displays the variable total tip as a histogram. Six histograms were
constructed using dierent bin widths: $1, 50c, 33c, 25c, 20c, 10c. The small
multiples illustrate how dierent information can be gained from examining
the data at dierent resolutions. At the largest bin width, the shape of the
distribution is unimodal and skewed, which indicates the predominance of lower
total tips, and fewer larger total tips. As the bin width is reduced the shape
becomes multimodal, and at the smallest bin width it is clear that there are large
peaks at the full dollars and smaller peaks at the half dollar. This suggests that
the customers tend to round the tip to the nearest fty cents or dollar. These
multiple plots also serve to emphasize that the salient features are not found
18
with one ideal bin width. Dierent resolutions provide ways to focus attention
on dierent features. A large bin width smooths out the the noise, and allows
the reader to focus on large or global trends. In contrast, a smaller bandwidth
allows the reader to extract intricate features from the trend. The human eye
re-focuses itself often, from one object to another, from far to near. Squinting
and peripheral vision helps to extract larger objects from a mixed background.
Providing plots of data in dierent resolutions is analogous to the way the eye
re-focuses.
Figure 1.4 describes drilling down further into the tipping data. On the left,
Total Tip is plotted against Total Bill. The predominating pattern is that the
points mostly lie in the lower right triangle of the plot. This is the region where
tips are relatively low compared to the total bill, which suggests that on average
there are more \cheap tippers" than generous tippers. There are a couple of
notable exceptions. One point has a tip value of about $5 for a $7 bill, a tip
rate of about 70% . The horizontal stripes of points are the peaks at the full
dollar and half dollar tips. This is marginal structure that was discovered using
histograms above. At the right of the main plot, are 4 smaller plots where tip vs
total bill is plotted conditionally on Sex of the Bill Payer and Smoker (whether
it was a smoking party of diners or not). Inspecting these plots reveals numerous
features: (1) for smoking parties, there is almost no relationship between tip
and total bill, (2) when a female non-smoker paid the bill, the tip was a very
consistent percentage of the total bill, with the exceptions of three dining parties,
(3) larger total bills mostly had a male paying them.
Further drill down into other variables, such as time of day, day of week and
size of party reveals enough information to almost locate the restaurant. This
is the power of graphics.
The case study illustrates the strength of our graphical methods in detecting
sparse structure in high-dimensional space. This data is from a particle physics
experiment where there are 7 measured variables which describe the outcome
state of the experiment. A combination of interactive brush controls and motion
graphics reveal that the points lie on a structure comprised of connected lowdimensional pieces: a 2D triangle, with 2 linear pieces extending from each
vertex (Cook, Buja, Cabrera & Hurley 1995). Figure 1.5 shows several 2D
projections of the data illustrating the shape. The points are shown in the
left side views, and to the right is a wire frame diagram illustrating the shape.
Various graphical tools specically facilitated the discovery of the structure:
plots of low dimensional projections of the 7D allowed discovery of the low
dimensional pieces, highlighting allowed the pieces to be recorded ormarked,
and animating many projections into a movie over time allowed the pieces to
be reconstructed into the full shape. The data was old by relative standards,
20 years old by the time this discovery was made, with visual methods, so the
meaning, if there is meaning to the structure is lost.
This example illustrates the use of graphics to rene the results of two numerical algorithms. The data comes from a study on Italian Olive Oils (Forina,
Armanino, Lanteri & Tiscornia 1983). Samples from dierent regions were analyzed for their fatty acid content. The eight variables measure the percentage
19
20 40 60
0
2
4
0
2
4
0
2
4
0
2
4
0
2
4
0
2
4
Tips
6
8
10
6
8
10
6
8
10
6
8
10
6
8
10
6
8
10
0 10 20 30 40 50
0
0 10 20 30 40
Tips
0 10 20 30 40
Tips
0 10 20 30
Tips
0
10 20 30
Tips
Tips
Figure 1.3: Histograms of Actual Tips with diering barwidth: $1, 50c, 33c, 25c,
20c, 10c. The power of an interactive system allows bin width to be changed
with slider.
20
10
Total Tip
4 6 8
2
Male Smokers
Female Smokers
Total Tip
4 6 8
Total Tip
4 6 8
10
0 10 20 30 40 50
Total Bill
0
10
20
30
Total Bill
40
2
2
2
4
Female Non-smokers
0 10 20 30 40 50
Total Bill
10
Total Tip
6
8
2
10
Total Tip
4 6 8
10
Male Non-smokers
0 10 20 30 40 50
Total Bill
50
0 10 20 30 40 50
Total Bill
Figure 1.4: (Left) Scatterplot of Total Tip vs Total Bill: More points in the
bottom right indicate more cheap tippers than generous tippers. (Right) Total
Tip vs Total Bill by Sex and Smoker: There is almost no association between tip
and total bill in the smoking parties, and, with the exception of 3 dining parties,
when a female non-smokers paid the bill the tip was extremely consistent.
of each of the eight fatty acids found in the olive oil samples: Palmitic Acid,
Palmitoleic, Stearic, Oleic, Linoleic, Linolenic, Arachidic, Eicosenoic. For quality control purposes, it is of important to be able to classify the oil sample into
its region of manufacture based on the fatty acid composition.
The left plot in the gure displays the solution of a Classication and Regression Tree (which is also similar to the linear discriminant analysis solution): the
presence or absence of eicosenoic acid separate oils from region 1 from the other
two regions, oils from region 3 have low levels of linoleic acid correspond, but
oils from region 2 have high levels of linoleic acid. The right plot illustrates the
result of rotating small amounts of oleic and arachidic acid into the projection.
We demonstrate this interactively. With the small adjustment we gain a much
sweeter separation of the oils from the 3 regions. It is easy to understand why
CART and linear discriminant analysis don't eectively discriminate between
the two regions. CART is distracted by the correlation and linear discriminant
analysis is distracted by the heterogeneous group variances. Neural networks
can do a perfect classication job here, but neural network solutions are dicult to interpret. With graphics we can understand how the network does its
classication, which variables dominate the solution, and in general if there are
missclassications, where are the regions of uncertainty. It is important to note
that neural networks do not perform so well in further drill down to classify the
areas within regions. With graphics, we can understand the cluster structure
fairly completely, and how the numerical methods work.
The human eye has both the ability to detect small, localized departures
from a trend, and also to ignore peripheral noise, a robust quality. It is also
21
Var 3
Var 3
VarVar
1 Var
4 6
Var 7
Var 5
VarVar
1 Var
4 6
Var 7
Var 5
Var 4
Var 7
Var 5
Var
Var6 12
Var
Var 3
Var 4
Var 7
Var 5
Var
Var6 12
Var
Var 3
Var12
Var
Var 5
Var
Var7 4
Var 3
Var 6
Var12
Var
Var 5
Var
Var7 4
Var 3
Var 6
Figure 1.5: Three projections of 7D particle phyics data, (left) points, (right)
the corresponding wire frame illustrating the underlying structure.
22
60
1
50
10
20
eicosenoic
30
40
1
0
1
1
11 1
1
1
11 1 1 1 1
1
11 11 1
1
1
1
11
1
1
1
1
11
1
111111
1 1 111
1
1
1
1
1
1
1
1
1
1
1
1 11 1 1
11 1 1 11111 11111 1 1
1 11 1 111
1
1
11
1
1
1 1 1 11 111 11 11111 1 111 1 1
11111111 1 11 1 1 11 1111 11 111 1
1 11 1 1 11
1 111 1
1 1 1 1 11 111 11 1 111 11111
11111111111111 11
1
1
1
111 111
11
11
111111 1 11 1
11
111 1 111 11 1111
11 11
11111111 1111 1
11
111 1 1
1
1
11111111111 11 1
1 11 1 111
1 111 1111111
111
1
1 1111111 11111
1111 1
1
1
33
3333
333333333333
3
3333 3333
33 222222
2
222
22222222
2 2
2222222
3333 3
333
33
33
333
333
33
333
2
2
3
33
33
33
3
33333
33333333333
333333333333333
3 332
2 22222
22
222
222
222222 2 222
22222222222
22
600
800
1000
linoleic
1200
1
linoleic
arachidic
2
oleic
eicosenoic
1400
Figure 1.6: (Left) Solution obtained by CART leaves confusion between groups
2 and 3. (Right) A small rotation of two other variables - linoleic and arachidic
acid - into the projection gives a much neater solution.
adept at detecting patterns using motion, especially cluster structure.
1.6 Introduction to the Case Studies
(Briey describe the range of data example used in this book. The appendix
should hold the full data descriptions.)
1.7 Getting Started with XGobi
XGobi is publicly available visualization software for viewing and interacting
with high-dimensional data. Its primary views are scatterplots augmented with
line drawings; other view types are scatterplot matrices and parallel coordinate
plots. Multiple views are automatically linked; they are interactive, in the sense
that they can be directly manipulated and queried. Also featured in XGobi are
dynamic methods, such as high-dimensional rotations, including the grand tour
and correlation tour, augmented with manual controls and automatic guidance
through projection pursuit. XGobi can deal with missing values through elimination or imputation. It can also be used as a rudimentary high-dimensional
drawing program. Finally, a sister program, called XGvis, is provided for nonlinear dimension reduction, multidimensional scaling, and high-dimensional graph
layout. XGobi and XGvis can be run on almost every platform, and they support interprocess communication with other packages.
23
Start up XGobi on the ea beetles data. What you will see is a window like
that in Figure 1.7. The window is laid out into 4 regions. There is a central plot
region, a control panel specic to the view mode is at left, controls for variables
are at the right (there are 6 variables in this data), and a series of menus at the
top facilitating the major tools in XGobi.
Figure 1.7: Main XGobi window, displaying a central plot region, a control
panel at left, controls for variables at right, and a series of menus at the top
facilitating the major tools in XGobi.
Take a closer look at the series of menus at the top. These contain selections
which allow switching between the major modes and tools available in XGobi.
Take a look at each one. The File menu contains typical le input/output, printing and exiting choices. The View menu contains choices of major plot types
(1DPlot, XYPlot, 3D Rotation, Grand Tour, and Correlation Tour), and major
plot interaction methods (Scale, Brush, Identify, Line Editing, Move Points).
The Tools menu contains a heterogeneous selection of tools: Hide or exclude,
Subset, Smooth, Jitter, Parallel coordinates, Scatterplot Matrix, Variable transformation, Variable and Case lists, and Missing Values. The Display menu contains selections for modifying the look of the plot: Display axes/gridlines, center
axes, plot the points/lines. The Info menu give basic instructions about the
help facilities.
The colors and glyphs of the points were read in from startup les, and are
24
set to identify species of beetles in this data.
On start-up, the view mode is XYPlot, so a scatterplot of the rst two
variables appears in the plot window. The bar in the variable circles at right
indicate how the variables are displayed in the plot: tars1 is horizontal and tars2
is vertical. Click on an empty circle with the left mouse button to change the
horizontal variable. Click with the middle mouse button (ALT-Left on a two
button PC mouse) on an empty circle to change the vertical variable. This can
be animated to cycle through all pairs of variables, using the control panel at
left. Click on Cycle to start cycling, and drag the scrollbar to change the speed
of change. Fix X sets the horizontal variable to stay the same during cycling,
and similarly Fix Y sets the vertical variable.
Figure 1.8: Main XGobi window in 1DPlot View mode, with jitter control panel.
Change to 1DPlot view mode. Notice that the control panel at left changes.
Each View mode has a unique control panel. The plot window now displays an
average shifted histogram of the variable tars1. The scrollbar labelled \ASH
Smoothness" controls the smoothness of the density, dragging to the right
smooths the plot. Using the left mouse button on the variable circles allows
switching the variable in the plot. If you select the variable aede2, you'll notice
it is discrete, there are just 9 distinct values. Go into the Tools menu and select
Jitter. A control panel (Figure 1.9) pops up in a separate window which allows
you to add small amounts of random noise to the values. In this case it allows
us to peek at what is plotted underneath.
Change to Grand Tour view mode. The points in the plot window now
rotate in the rst 3 variables. The lines in the variable circles track the rotation,
indicating how each variable contributes to the projection displayed in the plot.
25
Figure 1.9: Main XGobi window in Grand Tour mode, with projrction pursuit
guidance window.
26
The control panel is huge! We'll explain these in more detail below. In the
Display menu, select Center axes in 3D+ modes to be o, to shift the axes out
of the center. Click on the variable circles with either left or middle mouse
buttons to toggle variables in and out. (If you reduce the number of variables
to 2, the tour will stop, and you will need to re-start it with the Pause button
when more variables are added.) When more than 3 variables are included, the
grand tour in XGobi displays a continuous random sequence of 2D projections
of the data. This is particularly useful for this data, as the 3 species separate
out very neatly in some projections when all variables are included in the tour,
and the motion of points further indicates 3 dierent patterns corresponding to
the 3 species. The control panel contains three dierent tour methods, the grand
tour, guided tour and manual tour. The guided tour is activated by clicking on
ProjPrst, and Optimz. Choose the Holes index from the menu of indices, and
watch the guided tour nd a very nice projection of the data identifying the 3
species. A separate plot window tracks the value of the Holes index. From the
axes, you can see the separation is due primarily to 4 variables, tars1, tars2,
aede1, aede3 (Figure 1.9). Also, in terms of principal component axes, the
separation is due to PC1, PC2 and PC6. Turn ProjPrst o. Now focus on the
manual tour. With the tour paused, in the plot window drag the mouse with
the left button held down, to change the coecient of tars1 in the projection.
The thin inner circle in the variable circle panel indicates which variable is to
be manipulated. It can be changed by holding the Shift key down and clicking
on a variable circle.
Change to Brush mode. The brush control panel allows choice of color,
glyph, point or line brushing, hiding and excluding of points. Use the color
menu to select a new color. The rectangle in the corner is the brush, moving
the mouse with the left button depressed around the screen moves the brush,
points that are under the brush are painted the new color. This is transient
brush mode. This mode is more useful when multiple XGobi's are visible on the
screen and linked so that the colors in each plot also change according to the
brush action in one. Click on Persistent to make the brush permanently change
the color of the point. This kind of highlighting is used to mark features in the
data. The middle mouse button allows the size of the brush to be changed. Click
on the Hide or exclude panel (Figure 1.10). Choosing a color/glyph combination
as Hidden hides it from the plot region. Choosing exclude removes it from the
scale calculations, too, so the plot will rescale when groups are excluded.
In the Tools menu select Variable transformation. This panel allows you to
interactively transform variables or groups of variables, and make some major
changes such as sorting or permuting values (Figure 1.11). Sorting is neat
because it allows QQ-plots to be generated `on-the-y'.
Now is a good time to play around selecting modes, tools, and exploring the
functionality of XGobi, or if you can break it.
27
Figure 1.10: Main XGobi window in Brush mode, with Hide or exclude control
panel.
28
Figure 1.11: QQ-plot of aede1 and aede3, obtained by using the variable transformation controls.
29
30
Download