Interactive and Dynamic Graphics for Data Analysis using XGobi Dianne Cook, Deborah F. Swayne, Andreas Buja Copyright 1999 D. Cook, D. F. Swayne, A. Buja DRAFT Objectives By the end of this course, I would hope that you have: gained some understanding for the power of visual tools used in the process of data analysis. learned the basics of XGobi suciently to use it with you own data problems, and where to nd more information as needed. become inspired to develop new visual methods to address your data analyses. 2 Contents 1 Introduction 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Data Visualization . . . . . . . . . . . . . . Statistical Graphics . . . . . . . . . . . . . . Representing High-Dimensional Data . . . . Restructuring Data . . . . . . . . . . . . . . The Role of Visualization in Data Analysis Introduction to the Case Studies . . . . . . Getting Started with XGobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Case Study: Tropical Atmosphere Ocean (TAO) array 2.2.1 Data description . . . . . . . . . . . . . . . . . 2.2.2 Exploring Missingness . . . . . . . . . . . . . . 2.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Missing Values 3 Clustering and Classication 3.1 Background . . . . . . . . . . . . . . . . . . . . . . 3.2 Supervised Classication . . . . . . . . . . . . . . . 3.2.1 Case Study: Australian Leptograpsus Crabs 3.2.2 Exercises . . . . . . . . . . . . . . . . . . . 3.2.3 Case Study: Italian Olive Oils . . . . . . . 3.2.4 Exercises . . . . . . . . . . . . . . . . . . . 3.3 Unsupervised Classication . . . . . . . . . . . . . 3.3.1 Case Study: Italian Olive Oils . . . . . . . 4 Time 4.1 Case Study: River . . . . . . . . 4.2 Case Study: UK Pig Production 4.3 Case Study: TAO Buoys . . . . . 4.3.1 Data description . . . . . 4.3.2 Querying Drift . . . . . . 4.3.3 Cycling over time . . . . . 4.3.4 Correlation Tour . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 12 13 17 18 23 23 31 31 32 32 33 36 37 37 38 38 45 46 54 55 55 59 59 59 63 63 63 63 65 4.4 Case Study: Telephone Usage . . . . . . . . . . . . . . . . . . . . 65 5 Space 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Case Study: Comprehensive Ocean-Atmosphere Data 5.2.1 Data Description . . . . . . . . . . . . . . . . . 5.2.2 Comparing Traditional with Interactive . . . . 5.2.3 Exploring Other Variables . . . . . . . . . . . . 5.2.4 Exercises . . . . . . . . . . . . . . . . . . . . . 5.3 Case Study: Baker Field Data . . . . . . . . . . . . . . 5.3.1 Data description . . . . . . . . . . . . . . . . . 5.3.2 Scatterplots . . . . . . . . . . . . . . . . . . . . 5.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . 5.3.4 Relating Soil Variables to Each Other . . . . . 5.3.5 A Grand Tour of the Soil Variables . . . . . . . 5.3.6 Linking Yield to Spatial Location . . . . . . . . 5.3.7 Exploring Spatial Dependence . . . . . . . . . 5.3.8 Exercises . . . . . . . . . . . . . . . . . . . . . 6 Multivariate Longitudinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Data Structuring for Interactive Analysis . . . . . . . . . . . . . 6.3 Case Study I: Liver toxicity . . . . . . . . . . . . . . . . . . . . . 6.3.1 About the data . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Data Restructuring for XGobi . . . . . . . . . . . . . . . . 6.3.3 Exploring Liver Toxicity Measures . . . . . . . . . . . . . 6.3.4 Drilling down on interesting cases . . . . . . . . . . . . . 6.3.5 Inspecting drug relationships . . . . . . . . . . . . . . . . 6.3.6 Time-constrained touring . . . . . . . . . . . . . . . . . . 6.3.7 Generating a set of points uniformly distributed on a pdimensional condence ellipse . . . . . . . . . . . . . . . . 6.3.8 Data les for XGobi . . . . . . . . . . . . . . . . . . . . . 7 Compositional Data 69 69 69 69 70 71 71 71 71 74 77 77 80 82 82 83 87 87 87 88 88 89 90 90 91 91 91 93 95 7.1 Case Study: Insect Populations . . . . . . . . . . . . . . . . . . . 96 7.1.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . 96 7.1.2 Comparing the means . . . . . . . . . . . . . . . . . . . . 97 8 Categorical Variables and ANOVA 9 Multivariate Function Visualization 10 Inference for Data Visualization 101 103 105 10.1 Really There? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 10.2 Signicance Levels . . . . . . . . . . . . . . . . . . . . . . . . . . 106 10.3 Case Study: Baker Field data . . . . . . . . . . . . . . . . . . . . 107 4 10.3.1 Examining Yield and Boron . . . . . . . . . . . . . . . . . 107 10.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 10.3.3 Discussion about the Structure in Boron . . . . . . . . . . 109 11 Large Data 12 Appendix 12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8 Gallery Index of Main XGobi Controls . . . . . . . . . . . . . . . Input File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . Limitations of Windows 95 version . . . . . . . . . . . . . . . . . Man page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Solutions to Selected Exercises . . . . . . . . . . . . . . . . . . . S function for xgobi . . . . . . . . . . . . . . . . . . . . . . . . . S function for linked dendrogram plots . . . . . . . . . . . . . . . Generating a set of points uniformly distributed on a p-dimensional condence ellipse . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.9 S code used to restructure data into les for XGobi . . . . . . . . 12.10Answers to challenges . . . . . . . . . . . . . . . . . . . . . . . . 13 Web Resources 111 113 113 116 116 117 123 128 133 135 136 139 143 5 6 Preface Background Computer-aided visualization is a young discipline, having arisen and grown in the last fteen years. Its goal is to create computerized visual tools for analyzing and communicating information with pictures. The computer age is drowning us in information, and it becomes clearer every day that all this information is worthless if it can't be digested by humans. If computers brought on the challenge, they have become new partners to an old solution: presenting information to the human eye. The eye can absorb immense amounts of information if provided with \informative" pictures. While this is old hat in principle, computers since the 1980s have brought powerful facilities to bear for picturing information. This book is about visualizing a particular type of information, namely, data. In this book, the term \data" is used for information that exists in some schematic form such as a table or a list. Data is often but not always quantitative, and some translation of unstructured information is often required to derive the data. It always includes some attributes or variables such as the number of hits on web sites, frequencies of words in text samples, weight in pounds, mileage in gallons per mile, income per household in dollars, years of education, acidity on the PH scale, sulfur emissions in tons per year, or scores on standardized tests. Characteristic for data visualization is the concern with abstract relationships among such variables: for example, the degree to which income increases with education, or the question of whether certain astronomical measurements indicate grouping and therefore hint at new classes of celestial objects. In contrast, other areas of visualization are mostly concerned with the display of objects and phenomena in physical 3-D space. Examples are volume visualization (e.g., for the display of human organs in medicine), surface visualization (e.g., for manufacturing cars or animated movies), ow visualization (e.g., for aeronautics or meteorology), and cartography. In these areas one often strives for physical realism or the display of great detail in space, as in the visual display of a new car design, or of a developing hurricane in a meteorological simulation. Data visualization as we use the term requires the emphasis to be on vari7 ables and their relationships in the abstract. Variables are typically thought of as detached from physical location, even if physical location is part of the data. The general task of data visualization is to make pictures that reect relationships among variables. To this end, one maps the variables to axes in a plot and the variable values to locations on the axes. In eect, one uses space to code non-spatial information. The goal of data visualization is then not realism, which is meaningless, but rendering of data in space for the purpose of visual consumption. Paraphrasing Galileo Galilei who said \Measure what is measurable, and make measurable what is not so," we may say \render in space what is spatial, make spatial what is not so." This program of data visualization, spatial representation of data, immediately brings up a limitation: Plotting surfaces such as paper or computer screens are merely 2-dimensional, and physical space is just 3-dimensional. The eye can be tricked into seeing 3-dimensional virtual space with perspective and motion, which for visualization is as good as, or even better than, physical space. Only a few people have claimed to \see" 4-dimensional space (actually we are among them). In data visualization each variable requires a spatial axis or dimension, which poses the question of how we are to picture more than two or three variables at a time. The limitation to a 3-dimensional display space is ne if the objects are 3-dimensional, as in most other visualization areas, but in data visualization the number of axes required to code variables can be large: ve to ten are common, and 50 and even hundreds arise in very real contexts also. This then is the challenge of data visualization: to overcome the 2-D and 3-D barriers. To meet this challenge, computers allow us to implement some powerful visualization tools. They mimic and amplify a paradigm familiar from photography: take pictures from multiple directions so the shape of an object can be understood in its entirety. This is called the \multiple views" paradigm, where the term \views" is just a synonym for \pictures". In our 3-D world the paradigm works superbly: the human eye is very adept at inferring the true shape of an object from a few directional views. Unfortunately, the same is often not true for views of abstract data! The chasm between dierent views of data, however, can be actively bridged with computer technology: Unlike the passive paper medium, computers allow us to manipulate pictures, to pull and push their content in continuous motion with a similar eect as a moving video camera, or to poke at objects in one picture and see them light up in other pictures. Motion links pictures in time; poking links them across space. This book features many illustrations of the power of these linking technologies. The diligent reader may come away \seeing" high-dimensional data spaces! Talking about a technology is vacuous if it is not backed up by an implementation. For the methods shown in this book, the implementation is the freely available XGobi software designed and written by us, the authors. While other excellent software systems for data visualization exist, XGobi is particularly strong for the type of high-dimensional data explorations that we describe in this book. XGobi implements what amounts to video cameras for directly viewing high-dimensional data with dynamic projections, and it has tools for 8 poking at and slicing data across multiple views. It permits creative rendering of data and creative manipulation of views of data. XGobi will be introduced in sucient detail that this book can serve as a primer to the system. The discipline of data visualization has multiple homes, one in the eld of statistics, and others in computer science and engineering. Statistics has had a strand of data visualization research since the middle 1970's under the rubric \statistical graphics," while work outside statistics came into focus a bit later, following the visualization initiative by the National Science Foundation in the late 1980s. The seminal research in statistics was PRIM-9, the work of Fisherkeller, Friedman and Tukey in 1974. PRIM-9, designed and implemented at the Stanford Linear Accelerator Center, was the rst interactive data visualization system. It was followed by further pioneering systems at the Swiss Federal Institute of Technology (PRIM-ETH), at the Harvard (PRIM-H) and Stanford Universities (ORION), in the late 1970s and early 1980s. Research picked up in the following few years in many places, including AT&T Bell Labs, Bellcore, the University of Washington, the University of Minnesota, MIT, CMU, Batelle Richmond WA, George Mason University, Rice University, and several more. Our work on XGobi grew out of the third author's work at Stanford and the University of Washington, followed by the joint work of the three of us at Bellcore starting in the early 1990s. We see XGobi as a partial sum of statisticians' research in data visualization. Statisticians aren't the only researchers who are interested in the visualization of abstract high-dimensional data, but statistical data visualization has some unique features. Statisticians are always concerned with variability in observations and error in measurements, both of which cause uncertainty about conclusions drawn from data. Dealing with this uncertainty is at the heart of classical statistics, and statisticians have developed a huge body of inference methods that allow us to quantify uncertainty. Inference used to be statisticians' sole preoccupation, but this changed under John W. Tukey's towering inuence. He championed \exploratory data analysis" (EDA) which focusses on discovery and allows for the unexpected, unlike inference, which progresses from pre-conceived hypotheses. EDA has always depended heavily on graphics, even before the term \data visualization" was coined. Our favorite quote from John Tukey's rich legacy is that to \force the unexpected upon us," we need good pictures. In the past, EDA and inference were sometimes seen as incompatible, but they are not mutually exclusive. In this book, we will present visual methods for assessing uncertainty and performing inference, that is, deciding whether what we see is \really there." Data visualization is an attractive area because of its broad applicability. Wherever data is collected, creative analysis is possible with all the fun that creativity engenders. Data are gathered, stored and analyzed in huge amounts by all areas of science, by governments, by industries such as nance, retail, health, telecommunications, and service industries in general. We hope the large number of data examples in this book will give the reader an impression of the power and wide applicability of data visualization. 9 About this book This book may be used as a textbook for courses in data visualization and exploratory data analysis and it can also serve as a reference for current XGobi users. It may be used as a companion text for courses in applied multivariate analysis, but it may even be helpful for courses in theoretical multivariate analysis because it builds up intuitions for high-dimensional data spaces. We recommend the book furthermore for innovative teaching of data mining as the XGobi software can scale up to over 100,000 cases or records. Each chapter of the book focuses on specic data applications. Data sets for the examples are available from the web page for the book. The book will produce the greatest benet if it is not just passively read, but actively followed by working with the software and our or the reader's own datasets. About the software XGobi is publicly available visualization software for viewing and interacting with high-dimensional data. Its primary views are scatterplots augmented with line drawings; other view types are scatterplot matrices and parallel coordinate plots. Multiple views are automatically linked; they are interactive, in the sense that they can be directly manipulated and queried. Also featured in XGobi are dynamic methods, such as high-dimensional rotations, including the grand tour and correlation tour, augmented with manual controls and automatic guidance through projection pursuit. XGobi can deal with missing values through elimination or imputation. It can also be used as a rudimentary high-dimensional drawing program. Finally, a sister program, called XGvis, is provided for nonlinear dimension reduction, multidimensional scaling, and high-dimensional graph layout. XGobi and XGvis can be run on almost every platform, and they support interprocess communication with other packages. The authors can be contacted by electronic email at: dicook@iastate.edu dfs@research.att.com andreas@research.att.com and the XGobi software can be downloaded from the XGobi web site: http://www.research.att.com/areas/stat/xgobi/ 10 Chapter 1 Introduction Computer-aided visualization is a child of the age of technology. While the technological age is also drowning us with information, computers bring the provisions for humans to digest it. Computers have brought powerful facilities to life for drawing and interacting with pictures to describe information. This book is about drawing and interacting with pictures about a particular type of information: data. This book describes application of data visualization tools, that utilize high levels of interaction and motion, to a broad array of case studies from dierent disciplines. In this chapter we describe the history of data visualization and the role of data visualization in the process of data analysis. We also explain methods to be used in later chapters, and the software used for demonstrating the methods. 1.1 Data Visualization In this book, the term \data" is used for information that exists in some schematic form such as a table or a list. Data is often but not always quantitative, and some translation of unstructured information is often required to derive the data. It always includes some attributes or variables such as the number of hits on web sites, frequencies of words in text samples, weight in pounds, mileage in gallons per mile, income per household in dollars, years of education, acidity on the PH scale, sulfur emissions in tons per year, or scores on standardized tests. Characteristic for data visualization is the concern with abstract relationships among such variables: for example, the degree to which income increases with education, or the question of whether certain astronomical measurements indicate grouping and therefore hint at new classes of celestial objects. In contrast, other areas of visualization are mostly concerned with the display of objects and phenomena in physical 3-D space. Examples are volume visualization (e.g., for the display of human organs in medicine), surface visualization (e.g., for manufacturing cars or animated movies), ow visualization 11