Principles of Graphics for High-Dimensional Data Department of Statistics Iowa State University

advertisement
Principles of Graphics for
High-Dimensional Data
Di
Cook
Department of Statistics
Iowa State University
dicook@iastate.edu
http://www.public.iastate.edu/ dicook
Copyright 1998 Dianne Cook. These notes may not be circulated without
the prior written consent of the author.
Abstract
We will discuss interactive and dynamic graphics, using the following principles for visualizing high-dimensional data:
(1) focusing: use simple, easy-to-read plots,
(2) linking: provide mechanisms for relating information in one
plot to the information in another,
(3) arranging: provide mechanisms for shifting and reformatting
the layout of multiple plots.
The software package that will be used are XGobi, a public domain dynamic
graphics package. XGobi provides the multivariate graphical tools, encompassing the above visualization principles, that can be applied to problems
involving cluster analysis, discriminant analysis, principal component analysis, factor analysis, outlier detection and multivariate analysis of variance.
2
1 Graphics for Multivariate Data
1.1 Notation
We consider a multivariate data set to have a matrix form as follows:
X = [2X1 X2 : : : Xn ]
X11 X12 : : : X1n
6
X X : : : X2n
6
6 21 22
= 6 ..
4 .
...
...
...
Xp1 Xp2 : : : Xpn
3
7
7
7
7
5
pn
XGobi reads data of the form X 0.
A 1-D projection of the data into a vector p1 takes the form:
0X = [0X1 0X2 : : : 0Xn ]
= [1X11 + 2X21 + : : : + pXp1 : : : 1X1n + 2X2n + : : : + pXpn ]1n
q
where jjjj = 21 + : : : + 2p = 1. A 2-D projection of the data can be generated by expanding to Ap2 = [1 2] where the columns are orthonormal,
012 = 0. Similarly this notation can be expanded to represent d-D projections. In XGobi we work with 1-D and 2-D projections only. In VRGobi,
and the beginnings of JGobi, we work with 3-D, or arbitrary dimensional
projections.
1.2 Plotting Methods
Beyond using simple univariate plots, such as dotplots, histograms, boxplots,
to graph multivariate data variable by variable, there are a wealth of ways
that have been devised to draw multiple variables in a single graphic: most
notable are matrices of pairwise plots, Andrews' curves and parallel coordinate plots, and icon plots like Cherno faces and star charts.
Andrews curves and parallel coordinate plots use a line to represent each
observation, where the line tracks the observation's values on each variable
or combination of variables. Icon plots represent each observation by a single
3
icon, and the observation's value on each variable is represented by a dierent feature in the icon. For example, with Cherno faces each observation
is represented as a face, and the facial features correspond to dierent variables, such as the length of the nose represents the numerical value of the
observation on variable 1. These methods truly attempt to provide graphics
for revealing multivariate structure, most notably clustering. In contrast,
matrices of pairwise scatterplots simply extend the univariate graphics an
additional dimension so that pairwise relationships between variables can be
examined.
Complementary to these unique plotting methods, numerical techniques
are often used to reduce the dimensionality of a data set in a logical fashion
so that a low-dimensional representation comprehensively summarizes the
multivariate data. For example, multidimensional scaling is often used to nd
a lower-dimensional representation that preserves the interpoint distances of
the data. A cluster tree can be used to summarize the natural grouping
of points in the multivariate space. Principal component analysis reduces
dimensionality by determining the low-dimensional subspace that explains
the most variation in the data.
1.3 Interaction and motion
Take any of these plotting methods and add interaction and motion linking
multiple graphics and we make huge gains in what we can understand and
absorb about the variable relationships. The key to multivariate visualization
are three modes of interacting:
Focusing: is placing attention on a particular aspect of the data by selecting subsets (panning and zooming or slicing) or dimension reduction (projection or variable selection), in particular using simple, easy-to-read plots.
Linking: is connecting multiple focused views, in parallel (simultaneous)
by brushing or as a sequence over time by using animation and motion.
Arranging: is shifting and regrouping multiple pictures to provide a more
informative layout of the information.
These principles are discussed in detail in Buja, Cook & Swayne (1996).
In XGobi we concentrate on particularly simple graphics, almost all the plots
are based on scatterclouds, but we add numerous ways to interact with the
plots. This tutorial is an introduction to using the interaction tools.
Note that some similar tools are available in other software packages such
4
as XLispStat (Tierney 1991) (http://stat.umn.edu/luke/xls/xlsinfo/
xlsinfo.html), Data Desk (http://www.datadesk.com/datadesk/), XmdvTool (http://cs.wpi.edu/matt/research/XmdvTool/index.html), Data
Explorer (http://www.almaden.ibm.com/dx/) and SAS JMP (http://www.
sas.com/). Some of these packages are commercial and some are freeware.
1.4 Linked Brushing (and Identication)
Linked brushing is the dynamic changing of symbols (glyphs) or colors in one
plot which simultaneously changes corresponding points in other plots. The
most classic example is brushing in a matrix of pairwise scatterplots. Linked
identication is where brushed points are identied, for example, by labels
rather than colored.
1.5 Rotations and Tours
While linked brushing provides information on conditional distributions, tours
provide information on joint distributions. They are particularly useful for
detecting clusters, outliers, distributional shape, including covariance, and
some non-linear structure.
Grand Tour
Denition 1 A grand tour is a continuous 1-parameter family of d-dimensional
projections of p-dimensional data which is dense in the set of all d-dimensional
projections in IRp . The parameter is usually thought of as time.
This means that each projection shown can be indexed by a time parameter. As time is allowed to wander o to 1 the grand tour will show
all possible d-dimensional projections of the data, which is the meaning of
\dense in the set of all projections".
A grand tour oers a multitude of aspects simultaneously in relationship
to one another. If the data is intrinsically 0-, 1-, or 2-dimensional (that is,
clusters, curves or surfaces) the human eye can pick up the \gestalt" almost
instantly. (We are adept at detecting and recognizing moving objects.)
Three-dimensional rotation can be considered a special case of the tour,
where the dimension of the data is p = 3.
The grand tour provides the overview.
5
Figure 1: Intuitive picture of the approach to generating to dynamic projections of data.
6
Guided Tour
To nd more specic types of structure intelligent search engines can be
connected to the tour, which can automatically provide more informative
views than the random ones provided by the grand tour.
The guided tour leads the user to rare views.
Manual Tour
Prior knowledge can be incorporated with manually controlled tours. The
user can increase or decrease the contribution of a particular variable to a
view to examine how a particular variable contributes to any structure. In
addition manual tools allow us to assess the sensitivity of the structure to a
particular variable or sharpen or rene structure exposed with the grand or
guided tour.
The manual tour renes the views.
References
Asimov, D. (1985), `The Grand Tour: A Tool for Viewing Multidimensional
Data', SIAM Journal of Scientic and Statistical Computing 6(1), 128{
143.
Buja, A. & Asimov, D. (1986), Grand Tour Methods: An Outline, in D. M.
Allen, ed., `Proceedings of the 17th Symposium on the Interface between
Computing Science and Statistics', Elsevier, Lexington, KY, pp. 63{67.
Buja, A., Cook, D., Asimov, D. & Hurley, C. (1997), Dynamic Projections in
High-Dimensional Visualization: Theory and Computational Methods,
Technical report, AT&T Labs, Florham Park, NJ.
Buja,
A.,
Cook,
D. & Swayne, D. (1996), `Interactive High-Dimensional Data Visualization', Journal of Computational and Graphical Statistics 5(1), 78{99.
See also www.research.att.com/andreas/xgobi/heidel/.
Cleveland, W. S. & McGill, M. E., eds (1988), Dynamic Graphics for Statistics, Wadsworth, Monterey, CA.
Cook, D. & Buja, A. (1997), `Manual Controls For High-Dimensional Data
Projections',
Jour7
nal of Computational and Graphical Statistics 4(6), 464{480. Also see
www.public.iastate.edu/dicook/research/papers/manip.html.
Cook, D., Buja, A. & Cabrera, J. (1993), `Projection Pursuit Indexes Based
on Orthonormal Function Expansions', Journal of Computational and
Graphical Statistics 2(3), 225{250.
Cook, D., Buja, A., Cabrera, J. & Hurley, C. (1995), `Grand Tour and
Projection Pursuit', Journal of Computational and Graphical Statistics
4(3), 155{172.
Fisherkeller, M., Friedman, J. H. & Tukey, J. (1974), PRIM-9: An Interactive
Multidimensional Data Display and Analysis System, Technical Report
SLAC-PUB-1408, Stanford Linear Accelerator Center, Stanford, CA.
Hurley, C. & Buja, A. (1990), `Analyzing High-Dimensional Data with Motion Graphics', SIAM Journal on Scientic and Statistical Computing
11(6), 1193{1211.
Inselberg, A. (1985), `The Plane with Parallel Coordinates', The Visual Computer 1, 69{91.
McDonald, J. A. (1982), Interactive Graphics for Data Analysis, Technical
Report Orion II, Statistics Department, Stanford University.
Newton, C. (1978), Graphica: From Alpha to Omega in Data Analysis, in
P. C. C. Wang, ed., `Graphical Representation of Multivariate Data',
Academic Press, New York, NY, pp. 59{92.
Stuetzle, W. (1987), `Plot windows', Journal of the American Statistical Association 82, 466{475.
Swayne, D. & Buja, A. (1998), `Missing Data in Interactive High-Dimensional
Data Visualization', Computational Statistics.
Tierney, L. (1991), LispStat: An Object-Orientated Environment for Statistical Computing and Dynamic Graphics, Wiley, New York, NY.
Wegman, E. (1990), `Hyperdimensional Data Analysis Using Parallel Coordinates', Journal of American Statistics Association 85, 664{675.
8
Download