This set of notes is organized to provide a brief... dynamic graphics methods, and an introduction to using the graphical...

advertisement
Chapter 1
Introduction
This set of notes is organized to provide a brief overview of interactive and
dynamic graphics methods, and an introduction to using the graphical exploration of data, using the software XGobi for the examples. Beyond the standard
graphical tools available in XGobi, with creative thinking and re-structuring of
data, you can make an astonishing array visual renderings for your data that
can provide insights that were previously impossible.
1.1 What is Interactive and Dynamic Data Visualization?
Interaction, or to be more precise, direct manipulation, of graphics includes
activities such as:
linked brushing points/regions/lines in multiple views, by highlighting
them in one plot using a brush, and watching where the highlighted points
are in the other plots.
querying the id of a point or group of points.
dragging a scrollbar to change the value of a parameter.
clicking a button to change the variables viewed in a plot.
Interaction facilitates tasks in exploration and discovery.
Dynamic graphics are motion graphics, for example:
3D rotating plots.
tour methods: grand/random, guided, manual.
In statistical terms, there are specic technical interpretations of linked
brushing and motion graphics. Linked brushing can be considered to be exploring conditional distributions of variables, where the brush is a conditioning
11
tool. Motion graphics can be considered to be exploring joint distributions,
because the motion facilitates perception of the \shape" of the data from the
sequence of marginal views. If we know the distribution of all low-dimensional
projections of the data then we also know the joint multivariate distribution,
following a result of Cramer-Wold (Mardia, Kent & Bibby 1979).
1.2 What Makes Graphics Special in the Data
Analysis Process?
Here we present 3 cases studies to demonstrate the power of graphics, and
demonstrate our thought processes going into this direction of research.
The rst case study uses a very simple data set, collected by one waiter over a
period of a few months in a restaurant in the mid-west USA. Figure 1.1 displays
the simplest graphical form: a histogram, with small multiples representing the
data in histograms constructed with dierent bin widths. The bin widths range
from $1 at top to 10c at bottom. The small multiples illustrate the dierent
information that is gained from examining the data at dierent resolutions like
this. With the smallest bin width you can see the customer behavior is to round
the tip amount to the nearest fty cents or dollar. These simple plots also serve
to emphasize that the salient features are not found with one ideal bin width,
but one must use multiple bin widths to extract dierent features of data.
Figure 1.2 describes drilling down further into the data. On the left, Total Tip is plotted against Total Bill. The strong lower right triangular shape
suggests that on average there are more \cheap tippers" than generous, with a
couple of outlying exceptions. On the right, the scatterplot of tip vs total bill
is broken out by Sex of the Bill Payer and Smoker (whether it was a smoking
party of diners or not). Inspecting these plots reveals numerous features: (1) for
smoking parties, there is almost no relationship between tip and total bill, (2)
when a female non-smoker paid the bill, the tip was a very consistent percentage
of the total bill, with the exceptions of three dining parties, (3) larger total bills
mostly had a male paying them.
Further drill down into other variables, such as time of day, day of week
and size of party reveals enough information to almost locate the restaurant.
This is the power of graphics. Even such a simple data set can reveal the
most interesting complexities, a much richer body of complexities than can be
extracted with numerical methods alone.
As an aside with an interactive system the user would be able to drag a
slider to actively change the bin width in a histogram, or click on a variable
label to condition the plot by the categorical variable, To scale these methods
to extremely large data set is very straightforward pre-processing of data into
bins, and subsets.
The second case study illustrates the importance of user interaction in views
of high-dimensional data. The data comes from a study on Italian Olive Oils.
Samples from dierent regions were analyzed for their fatty acid content. The
12
20 40 60
0
2
4
0
2
4
0
2
4
0
2
4
0
2
4
0
2
4
Tips
6
8
10
6
8
10
6
8
10
6
8
10
6
8
10
6
8
10
0 10 20 30 40 50
0
0 10 20 30 40
Tips
0 10 20 30 40
Tips
0 10 20 30
Tips
0
10 20 30
Tips
Tips
Figure 1.1: Histograms of Actual Tips with diering barwidth: $1, 50c, 33c, 25c,
20c, 10c. The power of an interactive system allows bin width to be changed
with slider.
13
eight variables measure the percentage of each of the eight fatty acids found in
the olive oil samples. It is of interest to classify the oil sample based on its fatty
acid composition. The left plot in Figure 3.9 displays the solution of a Classication and Regression Tree (which is also similar to the linear discriminant
analysis solution): the presence or absence of eicosenoic acid separate oils from
region 1 from the other two regions, oils from region 3 have low levels of linoleic
acid correspond, but oils from region 2 have high levels of linoleic acid. The right
plot illustrates the result of rotating small amounts of oleic and arachidic acid
into the projection. With the small adjustment we gain a much sweeter separation of the oils from the 3 regions. CART is distracted by the correlation and
linear discriminant analysis is distracted by the heterogeneous group variances.
Neural networks can do a perfect classication job here. With graphics we can
understand how the network does its classifying, which variables dominate the
solution, and in general if there are missclassications, where are the regions
of uncertainty. It is important to note that neural networks do not perform so
well when further drill down into sub-regions is needed. With graphics, we can
understand the cluster structure, and why this happens.
10
Total Tip
4 6 8
2
Male Smokers
Female Smokers
Total Tip
4 6 8
Total Tip
4 6 8
10
0 10 20 30 40 50
Total Bill
0
10
20
30
Total Bill
40
2
2
2
4
Female Non-smokers
0 10 20 30 40 50
Total Bill
10
Total Tip
6
8
2
10
Total Tip
4 6 8
10
Male Non-smokers
0 10 20 30 40 50
Total Bill
50
0 10 20 30 40 50
Total Bill
Figure 1.2: (Left) Scatterplot of Total Tip vs Total Bill: More points in the
bottom right indicate more cheap tippers than generous tippers. (Right) Total
Tip vs Total Bill by Sex and Smoker: There is almost no association between tip
and total bill in the smoking parties, and, with the exception of 3 dining parties,
when a female non-smokers paid the bill the tip was extremely consistent.
The third case study illustrates the strength of our graphical methods in
detecting structure in sparse high-dimensional space. This data is from a particle physics experiment where there are 7 measured variables which describe
the states of the experiment. Visual inspection using a combination of touring,
guided touring, and brushing, revealed that the points lie on a structure comprising of connected low-dimensional pieces: a 2D triangle, with 2 linear pieces
extending from each vertex (Figure 1.4). The data was old by relative standards,
14
60
1
50
10
20
eicosenoic
30
40
1
0
1
1
11 1
1
1
11 1 1 1 1
1
11 11 1
1
1
1
11
1
1
1
1
11
1
111111
1 1 111
1
1
1
1
1
1
1
1
1
1
1
1 11 1 1
11 1 1 11111 11111 1 1
1 11 1 111
1
1
11
1
1
1 1 1 11 111 11 11111 1 111 1 1
11111111 1 11 1 1 11 1111 11 111 1
1 11 1 1 11
1 111 1
1 1 1 1 11 111 11 1 111 11111
11111111111111 11
1
1
1
111 111
11
11
111111 1 11 1
11
111 1 111 11 1111
11 11
11111111 1111 1
11
111 1 1
1
1
11111111111 11 1
1 11 1 111
1 111 1111111
111
1
1 1111111 11111
1111 1
1
1
33
3333
333333333333
3
3333 3333
33 222222
2
222
22222222
2 2
2222222
3333 3
333
33
33
333
333
33
333
2
2
3
33
33
33
3
33333
33333333333
333333333333333
3 332
2 22222
22
222
222
222222 2 222
22222222222
22
600
800
1000
linoleic
1200
1
linoleic
arachidic
2
oleic
eicosenoic
1400
Figure 1.3: (Left) Solution obtained by CART doesn't separate groups 2 and 3
very well. (Right) A small rotation of two other variables - linoleic and arachidic
acid - into the projection gives a much neater solution.
20 year old by the time of this discovery was made with visual methods.
These three case studies are described to give the reader an indication of the
types of insights that can be made through interactive and dynamic graphics.
1.3 What is XGobi?
XGobi is a visualization system for viewing high-dimensional data. It is freely
available. The primary visual constructions are points and lines, which form
the basic viewing type, scatterplots, which can enhanced with line drawings. A
high level of direct manipulation of plots is supported, with linking of points
and/or lines between multiple views. Motion graphics such as rotation, touring
and projection pursuit are available. It is especially designed for the exploration
of multivariate data and high-dimensional spaces.
XGobi was written using the X Window System, so it is primarily a Unixbased software package, but it can be run directly under Windows NT or Windows 95/98 using a variety of X software packages. It supports interprocess
communication to share information with other software, so it can be run directly from ArcView, S/S-Plus, R, omegahat.
The graphics and tools that are available in XGobi include:
Cycling rapidly through two-variable scatter plots.
Three-dimensional rotation.
15
X3
X5
X2
X1
X4
X7
X3
X6
X5
X2
X1
X4
X7
X6
Figure 1.4: (Left) One projection of the 7D particle phyics data, and (right) the
corresponding wire frame illustrating the underlying structure.
Grand tours and correlation tours: random, guided, and manual control.
Brushing.
Hiding (excluding) groups of points.
Dynamic re-scaling.
Interactive identication of cases.
Linked views: Brushing, identication and touring are linked; that is,
actions in the window of one XGobi process are immediately reected in
another XGobi window displaying the same data.
Line editing.
Moving points.
Fitting curves using smoothers.
Drawing subsamples.
Jittering of categorical measurements.
Parallel coordinate display.
Scatterplot matrix display.
Case and variable label lists.
Missing values are accepted and can be dealt with by imputation of constant values, random values, or user-supplied imputed values. Missing
value patterns can be examined in a separate linked XGobi window.
On-the-y variable transformations.
Quality postscript output.
16
Online help with the click of the right mouse button.
The XGobi web site can be found at http://www.research.att.com/areas/stat/xgobi/.
1.4 Getting Started with XGobi
Start up XGobi on the ea beetles data. What you will see is a window like
that in Figure 1.5. The window is laid out into 4 regions. There is a central plot
region, a control panel specic to the view mode is at left, controls for variables
are at the right (there are 6 variables in this data), and a series of menus at the
top facilitating the major tools in XGobi.
Figure 1.5: Main XGobi window, displaying a central plot region, a control
panel at left, controls for variables at right, and a series of menus at the top
facilitating the major tools in XGobi.
Take a closer look at the series of menus at the top. These contain selections
which allow switching between the major modes and tools available in XGobi.
Take a look at each one. The File menu contains typical le input/output, printing and exiting choices. The View menu contains choices of major plot types
(1DPlot, XYPlot, 3D Rotation, Grand Tour, and Correlation Tour), and major
plot interaction methods (Scale, Brush, Identify, Line Editing, Move Points).
17
The Tools menu contains a heterogeneous selection of tools: Hide or exclude,
Subset, Smooth, Jitter, Parallel coordinates, Scatterplot Matrix, Variable transformation, Variable and Case lists, and Missing Values. The Display menu contains selections for modifying the look of the plot: Display axes/gridlines, center
axes, plot the points/lines. The Info menu give basic instructions about the
help facilities.
The colors and glyphs of the points were read in from startup les, and are
set to identify species of beetles in this data.
On start-up, the view mode is XYPlot, so a scatterplot of the rst two
variables appears in the plot window. The bar in the variable circles at right
indicate how the variables are displayed in the plot: tars1 is horizontal and tars2
is vertical. Click on an empty circle with the left mouse button to change the
horizontal variable. Click with the middle mouse button (ALT-Left on a two
button PC mouse) on an empty circle to change the vertical variable. This can
be animated to cycle through all pairs of variables, using the control panel at
left. Click on Cycle to start cycling, and drag the scrollbar to change the speed
of change. Fix X sets the horizontal variable to stay the same during cycling,
and similarly Fix Y sets the vertical variable.
Figure 1.6: Main XGobi window in 1DPlot View mode, with jitter control panel.
Change to 1DPlot view mode. Notice that the control panel at left changes.
Each View mode has a unique control panel. The plot window now displays an
average shifted histogram of the variable tars1. The scrollbar labelled \ASH
Smoothness" controls the smoothness of the density, dragging to the right
smooths the plot. Using the left mouse button on the variable circles allows
switching the variable in the plot. If you select the variable aede2, you'll notice
18
it is discrete, there are just 9 distinct values. Go into the Tools menu and select
Jitter. A control panel (Figure 1.7) pops up in a separate window which allows
you to add small amounts of random noise to the values. In this case it allows
us to peek at what is plotted underneath.
Change to Grand Tour view mode. The points in the plot window now
rotate in the rst 3 variables. The lines in the variable circles track the rotation,
indicating how each variable contributes to the projection displayed in the plot.
The control panel is huge! We'll explain these in more detail below. In the
Display menu, select Center axes in 3D+ modes to be o, to shift the axes out
of the center. Click on the variable circles with either left or middle mouse
buttons to toggle variables in and out. (If you reduce the number of variables
to 2, the tour will stop, and you will need to re-start it with the Pause button
when more variables are added.) When more than 3 variables are included, the
grand tour in XGobi displays a continuous random sequence of 2D projections
of the data. This is particularly useful for this data, as the 3 species separate
out very neatly in some projections when all variables are included in the tour,
and the motion of points further indicates 3 dierent patterns corresponding to
the 3 species. The control panel contains three dierent tour methods, the grand
tour, guided tour and manual tour. The guided tour is activated by clicking on
ProjPrst, and Optimz. Choose the Holes index from the menu of indices, and
watch the guided tour nd a very nice projection of the data identifying the 3
species. A separate plot window tracks the value of the Holes index. From the
axes, you can see the separation is due primarily to 4 variables, tars1, tars2,
aede1, aede3 (Figure 1.7). Also, in terms of principal component axes, the
separation is due to PC1, PC2 and PC6. Turn ProjPrst o. Now focus on the
manual tour. With the tour paused, in the plot window drag the mouse with
the left button held down, to change the coecient of tars1 in the projection.
The thin inner circle in the variable circle panel indicates which variable is to
be manipulated. It can be changed by holding the Shift key down and clicking
on a variable circle.
Change to Brush mode. The brush control panel allows choice of color,
glyph, point or line brushing, hiding and excluding of points. Use the color
menu to select a new color. The rectangle in the corner is the brush, moving
the mouse with the left button depressed around the screen moves the brush,
points that are under the brush are painted the new color. This is transient
brush mode. This mode is more useful when multiple XGobi's are visible on the
screen and linked so that the colors in each plot also change according to the
brush action in one. Click on Persistent to make the brush permanently change
the color of the point. This kind of highlighting is used to mark features in the
data. The middle mouse button allows the size of the brush to be changed. Click
on the Hide or exclude panel (Figure 1.8). Choosing a color/glyph combination
as Hidden hides it from the plot region. Choosing exclude removes it from the
scale calculations, too, so the plot will rescale when groups are excluded.
In the Tools menu select Variable transformation. This panel allows you to
interactively transform variables or groups of variables, and make some major
changes such as sorting or permuting values (Figure 1.9). Sorting is neat because
19
Figure 1.7: Main XGobi window in Grand Tour mode, with projrction pursuit
guidance window.
20
Figure 1.8: Main XGobi window in Brush mode, with Hide or exclude control
panel.
21
Figure 1.9: QQ-plot of aede1 and aede3, obtained by using the variable transformation controls.
22
it allows QQ-plots to be generated `on-the-y'.
Now is a good time to play around selecting modes, tools, and exploring the
functionality of XGobi, or if you can break it.
23
Download