Principles of Graphics for High-Dimensional Data Di Cook Department of Statistics Iowa State University dicook@iastate.edu http://www.public.iastate.edu/ dicook Copyright 1998 Dianne Cook. These notes may not be circulated without the prior written consent of the author. Abstract We will discuss interactive and dynamic graphics, using the following principles for visualizing high-dimensional data: (1) focusing: use simple, easy-to-read plots, (2) linking: provide mechanisms for relating information in one plot to the information in another, (3) arranging: provide mechanisms for shifting and reformatting the layout of multiple plots. The software package that will be used are XGobi, a public domain dynamic graphics package. XGobi provides the multivariate graphical tools, encompassing the above visualization principles, that can be applied to problems involving cluster analysis, discriminant analysis, principal component analysis, factor analysis, outlier detection and multivariate analysis of variance. 2 1 Graphics for Multivariate Data 1.1 Notation We consider a multivariate data set to have a matrix form as follows: X = [2X1 X2 : : : Xn ] X11 X12 : : : X1n 6 X X : : : X2n 6 6 21 22 = 6 .. 4 . ... ... ... Xp1 Xp2 : : : Xpn 3 7 7 7 7 5 pn XGobi reads data of the form X 0. A 1-D projection of the data into a vector p1 takes the form: 0X = [0X1 0X2 : : : 0Xn ] = [1X11 + 2X21 + : : : + pXp1 : : : 1X1n + 2X2n + : : : + pXpn ]1n q where jjjj = 21 + : : : + 2p = 1. A 2-D projection of the data can be generated by expanding to Ap2 = [1 2] where the columns are orthonormal, 012 = 0. Similarly this notation can be expanded to represent d-D projections. In XGobi we work with 1-D and 2-D projections only. In VRGobi, and the beginnings of JGobi, we work with 3-D, or arbitrary dimensional projections. 1.2 Plotting Methods Beyond using simple univariate plots, such as dotplots, histograms, boxplots, to graph multivariate data variable by variable, there are a wealth of ways that have been devised to draw multiple variables in a single graphic: most notable are matrices of pairwise plots, Andrews' curves and parallel coordinate plots, and icon plots like Cherno faces and star charts. Andrews curves and parallel coordinate plots use a line to represent each observation, where the line tracks the observation's values on each variable or combination of variables. Icon plots represent each observation by a single 3 icon, and the observation's value on each variable is represented by a dierent feature in the icon. For example, with Cherno faces each observation is represented as a face, and the facial features correspond to dierent variables, such as the length of the nose represents the numerical value of the observation on variable 1. These methods truly attempt to provide graphics for revealing multivariate structure, most notably clustering. In contrast, matrices of pairwise scatterplots simply extend the univariate graphics an additional dimension so that pairwise relationships between variables can be examined. Complementary to these unique plotting methods, numerical techniques are often used to reduce the dimensionality of a data set in a logical fashion so that a low-dimensional representation comprehensively summarizes the multivariate data. For example, multidimensional scaling is often used to nd a lower-dimensional representation that preserves the interpoint distances of the data. A cluster tree can be used to summarize the natural grouping of points in the multivariate space. Principal component analysis reduces dimensionality by determining the low-dimensional subspace that explains the most variation in the data. 1.3 Interaction and motion Take any of these plotting methods and add interaction and motion linking multiple graphics and we make huge gains in what we can understand and absorb about the variable relationships. The key to multivariate visualization are three modes of interacting: Focusing: is placing attention on a particular aspect of the data by selecting subsets (panning and zooming or slicing) or dimension reduction (projection or variable selection), in particular using simple, easy-to-read plots. Linking: is connecting multiple focused views, in parallel (simultaneous) by brushing or as a sequence over time by using animation and motion. Arranging: is shifting and regrouping multiple pictures to provide a more informative layout of the information. These principles are discussed in detail in Buja, Cook & Swayne (1996). In XGobi we concentrate on particularly simple graphics, almost all the plots are based on scatterclouds, but we add numerous ways to interact with the plots. This tutorial is an introduction to using the interaction tools. Note that some similar tools are available in other software packages such 4 as XLispStat (Tierney 1991) (http://stat.umn.edu/luke/xls/xlsinfo/ xlsinfo.html), Data Desk (http://www.datadesk.com/datadesk/), XmdvTool (http://cs.wpi.edu/matt/research/XmdvTool/index.html), Data Explorer (http://www.almaden.ibm.com/dx/) and SAS JMP (http://www. sas.com/). Some of these packages are commercial and some are freeware. 1.4 Linked Brushing (and Identication) Linked brushing is the dynamic changing of symbols (glyphs) or colors in one plot which simultaneously changes corresponding points in other plots. The most classic example is brushing in a matrix of pairwise scatterplots. Linked identication is where brushed points are identied, for example, by labels rather than colored. 1.5 Rotations and Tours While linked brushing provides information on conditional distributions, tours provide information on joint distributions. They are particularly useful for detecting clusters, outliers, distributional shape, including covariance, and some non-linear structure. Grand Tour Denition 1 A grand tour is a continuous 1-parameter family of d-dimensional projections of p-dimensional data which is dense in the set of all d-dimensional projections in IRp . The parameter is usually thought of as time. This means that each projection shown can be indexed by a time parameter. As time is allowed to wander o to 1 the grand tour will show all possible d-dimensional projections of the data, which is the meaning of \dense in the set of all projections". A grand tour oers a multitude of aspects simultaneously in relationship to one another. If the data is intrinsically 0-, 1-, or 2-dimensional (that is, clusters, curves or surfaces) the human eye can pick up the \gestalt" almost instantly. (We are adept at detecting and recognizing moving objects.) Three-dimensional rotation can be considered a special case of the tour, where the dimension of the data is p = 3. The grand tour provides the overview. 5 Figure 1: Intuitive picture of the approach to generating to dynamic projections of data. 6 Guided Tour To nd more specic types of structure intelligent search engines can be connected to the tour, which can automatically provide more informative views than the random ones provided by the grand tour. The guided tour leads the user to rare views. Manual Tour Prior knowledge can be incorporated with manually controlled tours. The user can increase or decrease the contribution of a particular variable to a view to examine how a particular variable contributes to any structure. In addition manual tools allow us to assess the sensitivity of the structure to a particular variable or sharpen or rene structure exposed with the grand or guided tour. The manual tour renes the views. References Asimov, D. (1985), `The Grand Tour: A Tool for Viewing Multidimensional Data', SIAM Journal of Scientic and Statistical Computing 6(1), 128{ 143. Buja, A. & Asimov, D. (1986), Grand Tour Methods: An Outline, in D. M. Allen, ed., `Proceedings of the 17th Symposium on the Interface between Computing Science and Statistics', Elsevier, Lexington, KY, pp. 63{67. Buja, A., Cook, D., Asimov, D. & Hurley, C. (1997), Dynamic Projections in High-Dimensional Visualization: Theory and Computational Methods, Technical report, AT&T Labs, Florham Park, NJ. Buja, A., Cook, D. & Swayne, D. (1996), `Interactive High-Dimensional Data Visualization', Journal of Computational and Graphical Statistics 5(1), 78{99. See also www.research.att.com/andreas/xgobi/heidel/. Cleveland, W. S. & McGill, M. E., eds (1988), Dynamic Graphics for Statistics, Wadsworth, Monterey, CA. Cook, D. & Buja, A. (1997), `Manual Controls For High-Dimensional Data Projections', Jour7 nal of Computational and Graphical Statistics 4(6), 464{480. Also see www.public.iastate.edu/dicook/research/papers/manip.html. Cook, D., Buja, A. & Cabrera, J. (1993), `Projection Pursuit Indexes Based on Orthonormal Function Expansions', Journal of Computational and Graphical Statistics 2(3), 225{250. Cook, D., Buja, A., Cabrera, J. & Hurley, C. (1995), `Grand Tour and Projection Pursuit', Journal of Computational and Graphical Statistics 4(3), 155{172. Fisherkeller, M., Friedman, J. H. & Tukey, J. (1974), PRIM-9: An Interactive Multidimensional Data Display and Analysis System, Technical Report SLAC-PUB-1408, Stanford Linear Accelerator Center, Stanford, CA. Hurley, C. & Buja, A. (1990), `Analyzing High-Dimensional Data with Motion Graphics', SIAM Journal on Scientic and Statistical Computing 11(6), 1193{1211. Inselberg, A. (1985), `The Plane with Parallel Coordinates', The Visual Computer 1, 69{91. McDonald, J. A. (1982), Interactive Graphics for Data Analysis, Technical Report Orion II, Statistics Department, Stanford University. Newton, C. (1978), Graphica: From Alpha to Omega in Data Analysis, in P. C. C. Wang, ed., `Graphical Representation of Multivariate Data', Academic Press, New York, NY, pp. 59{92. Stuetzle, W. (1987), `Plot windows', Journal of the American Statistical Association 82, 466{475. Swayne, D. & Buja, A. (1998), `Missing Data in Interactive High-Dimensional Data Visualization', Computational Statistics. Tierney, L. (1991), LispStat: An Object-Orientated Environment for Statistical Computing and Dynamic Graphics, Wiley, New York, NY. Wegman, E. (1990), `Hyperdimensional Data Analysis Using Parallel Coordinates', Journal of American Statistics Association 85, 664{675. 8