This talk focuses on the use of interactive and dynamic... cess of data analysis. We use the XGobi software for...

This talk focuses on the use of interactive and dynamic graphics in the process of data analysis. We use the XGobi software for demonstrating techniques. This material is jointly provided by myself, Deborah Swayne and Andreas Buja. They are based in part on draft notes for a book of similar title. 1 Have you seen this painting? It was painted by Hans Holbein in the early 1500s, and is currently on display in the National Gallery in London. It is wellworth a visit if you are in London. There is a curious \elliptical blur" in the front of the painting. Holbein often hid things in his paintings. Here it looks like he was toying with cartographic projections. If you cut out a rectangle around the ellipse, and warp the rectangle by simple scaling then you see something more familiar: a skull. This is analogous to the power of interactive visual tools, to re-shape to focus attention on important features of the data. 2 Moving along to today's subject matter. At the end of this course, I would hope that you have gained some understanding for the power of visual tools used in the process of data analysis. The more ambitious of you might even go away and download the software we use, and perhaps invent new graphical methods for your analyses. 3 Computer-aided visualization is a young discipline, having arisen and grown in the last fteen years. Its goal is to create computerized visual tools for analyzing and communicating information with pictures. This talk is about visualizing a particular type of information, namely, data. In this talk, the term \data" is used for information that exists in some schematic form such as a table or a list. Data is often but not always quantitative, and some translation of unstructured information is often required to derive the data. It always includes some attributes or variables such as the number of hits on web sites, frequencies of words in text samples, weight in pounds, mileage in gallons per mile, income per household in dollars, years of education, acidity on the PH scale, sulfur emissions in tons per year, or scores on standardized tests. Characteristic for data visualization is the concern with abstract relationships among such variables: for example, the degree to which income increases with education, or the question of whether certain astronomical measurements indicate grouping and therefore hint at new classes of celestial objects. In contrast, other areas of visualization are mostly concerned with the display of objects and phenomena in physical 3-D space. Examples are volume visualization (e.g., for the display of human organs in medicine), surface visualization (e.g., for manufacturing cars or animated movies), ow visualization (e.g., for aeronautics or meteorology), and cartography. In these areas one often strives for physical realism or the display of great detail in space, as in the visual display of a new car design, or of a developing hurricane in a meteorological simulation. Data visualization as we use the term requires the emphasis to be on variables and their relationships in the abstract. Variables are typically thought of as detached from physical location, even if physical location is part of the data. The general task of data visualization is to make pictures that reect relationships among variables. To this end, one maps the variables to axes in a plot and the variable values to locations on the axes. In eect, one uses space to code non-spatial information. The goal of data visualization is then not realism, which is meaningless, but rendering of data in space for the purpose of visual consumption. 4 This program of data visualization, spatial representation of data, immediately brings up a limitation: Plotting surfaces such as paper or computer screens are merely 2-dimensional, and physical space is just 3-dimensional. The eye can be tricked into seeing 3-dimensional virtual space with perspective and motion, which for visualization is as good as, or even better than, physical space. Only a few people have claimed to \see" 4-dimensional space (actually we are among them). In data visualization each variable requires a spatial axis or dimension, which poses the question of how we are to picture more than two or three variables at a time. The limitation to a 3-dimensional display space is ne if the objects are 3-dimensional, as in most other visualization areas, but in data visualization the number of axes required to code variables can be large: ve to ten are common, and 50 and even hundreds arise in very real contexts also. This then is the challenge of data visualization: to overcome the 2-D and 3-D barriers. To meet this challenge, computers allow us to implement some powerful visualization tools. They mimic and amplify a paradigm familiar from photography: take pictures from multiple directions so the shape of an object can be understood in its entirety. This is called the \multiple views" paradigm, where the term \views" is just a synonym for \pictures". In our 3-D world the paradigm works superbly: the human eye is very adept at inferring the true shape of an object from a few directional views. Unfortunately, the same is often not true for views of abstract data! The chasm between dierent views of data, however, can be actively bridged with computer technology: Unlike the passive paper medium, computers allow us to manipulate pictures, to pull and push their content in continuous motion with a similar eect as a moving video camera, or to poke at objects in one picture and see them light up in other pictures. Motion links pictures in time; poking links them across space. This talk features many illustrations of the power of these linking technologies. The diligent listener may come away \seeing" high-dimensional data spaces! 5 The discipline of data visualization has multiple homes, one in the eld of statistics, and others in computer science and engineering. Statistics has had a strand of data visualization research since the middle 1970's under the rubric \statistical graphics," while work outside statistics came into focus a bit later, following the visualization initiative by the National Science Foundation in the late 1980s. The seminal research in statistics was PRIM-9, the work of Fisherkeller, Friedman and Tukey in 1974. PRIM-9, designed and implemented at the Stanford Linear Accelerator Center, was the rst interactive data visualization system. It was followed by further pioneering systems at the Swiss Federal Institute of Technology (PRIM-ETH), at the Harvard (PRIM-H) and Stanford Universities (ORION), in the late 1970s and early 1980s. Research picked up in the following few years in many places, including AT&T Bell Labs, Bellcore, the University of Washington, the University of Minnesota, MIT, CMU, Batelle Richmond WA, George Mason University, Rice University, and several more. Our work on XGobi grew out of the third author's work at Stanford and the University of Washington, followed by the joint work of the three of us at Bellcore starting in the early 1990s. We see XGobi as a partial sum of statisticians' research in data visualization. Statisticians aren't the only researchers who are interested in the visualization of abstract high-dimensional data, but statistical data visualization has some unique features. Statisticians are always concerned with variability in observations and error in measurements, both of which cause uncertainty about conclusions drawn from data. Dealing with this uncertainty is at the heart of classical statistics, and statisticians have developed a huge body of inference methods that allow us to quantify uncertainty. Inference used to be statisticians' sole preoccupation, but this changed under John W. Tukey's towering inuence. He championed \exploratory data analysis" (EDA) which focusses on discovery and allows for the unexpected, unlike inference, which progresses from pre-conceived hypotheses. EDA has always depended heavily on graphics, even before the term \data visualization" was coined. Our favorite quote from John Tukey's rich legacy is that to \force the unexpected upon us," we need good pictures. In the past, EDA and inference were sometimes seen as incompatible, but they are not mutually exclusive. In this talk, we will present visual methods for assessing uncertainty and performing inference, that is, deciding whether what we see is \really there." 6 This talk focuses on interactive and dynamic graphics. We begin by describing the terms more concretely. There are many denitions of the word \Interaction". To many it is simply typing in on a command-line to change the characteristics of a plot, or revise the plot, for example, graphics in S-Plus. For this talk interaction is more precisely, direct manipulation of graphics in a plot, which includes activities such as: linked brushing of points, regions or lines in multiple views, by highlighting them in one plot using a brush, and watching where the points light up in the other plots. querying the id of a point or group of points. dragging a scrollbar to change the value of a parameter. clicking a button to change the variables viewed in a plot. Interaction facilitates tasks in exploration and discovery. 7 Dynamic graphics are motion graphics, for example: cycling between plots. 3D rotating plots. tour methods: grand/random, guided, manual. In statistical terms, there are specic technical interpretations of linked brushing and motion graphics. Linked brushing can be considered to be exploring conditional distributions of variables, where the brush is a conditioning tool. Motion graphics can be considered to be exploring joint distributions, because the motion facilitates perception of the \shape" of the data from the sequence of marginal views. If we know the distribution of all low-dimensional projections of the data then we also know the joint multivariate distribution, following a result of Cramer-Wold (Mardia, Kent & Bibby 1979). 8 The eye can absorb immense amounts of information if provided with \informative" pictures. In our experience there are situations where graphics can provide advantages in the process of data analysis. Graphics are good for nding small departures from the trend, local anomalies, and sparse structure in high-dimensional spaces. They can also be used in conjunction with numerical results to rene solutions, and to understand or interpret the results. In the next several slides we will provide examples demonstrating these. 9 The rst case study uses a very simple data set, collected by one waiter over a period of a few months in a restaurant in the mid-west USA. Several variables were collected: total tip, total bill, sex of the bill payer, smoking party or not, day of the week, time of day and size of the party. The data was reported in (Bryant & Smith 1995), a collection of cases studies for business statistics. The primary question related to the data is: \What are the factors that aect tipping behavior?" The solution in the manual is to re-structure tip and total bill into a new variable tip rate, and model against the remaining variables. The result is a model with tip rate as the response and size of the party as the one explanatory variable. It is a very empty result! For such a simple data set it has enormous richness in structure, which we will show with graphics. 10 The gure displays a very common graphic: a histogram. The small multiples are generated from the data by using an array of bin widths to calculate the histogram. The bin widths range from $1 at top to 10c at bottom. The multiple layout illustrates how dierent information that is gained from examining the data at dierent resolutions like this. The largest bin width ($1) shows tip to have a unimodal and skewed distribution, which for the data means that most tips are of the smaller amounts, with less and less larger tips. As the bin width is reduced the shape of the distribution becomes multimodal. At the smallest bin width (10c) it is clear that there are modes at the full dollar and fty cent amounts. This means that the customers tend to round the tip amount to the nearest fty cents or dollar. It is important to emphasize that the salient features are not found with one ideal bin width, but one must use multiple bin widths to extract dierent features of data. 11 The gure describes drilling down further into the data. On the left, Total Tip is plotted against Total Bill. The strong lower right triangular shape suggests that on average there are more cheap tippers than generous tippers. There are several outlying exceptions, one in particular giving a $5 tip on an $8 bill. On the right, the scatterplot of tip vs total bill is conditioned by Sex of the Bill Payer and Smoker (whether it was a smoking party of diners or not). Inspecting these plots reveals numerous features: (1) for smoking parties, there is almost no relationship between tip and total bill, (2) when a female non-smoker paid the bill, the tip was a very consistent percentage of the total bill, with the exceptions of three dining parties, (3) larger total bills mostly had a male paying them. Further drill down into other variables, such as time of day, day of week and size of party reveals enough information to almost locate the restaurant. This is the power of graphics. Even such a simple data set can reveal the most interesting complexities, a much richer body of complexities than can be extracted with numerical methods alone. As an aside with an interactive system the user would be able to drag a slider to actively change the bin width in a histogram, or click on a variable label to condition the plot by the categorical variable. 12 The next few examples will involve demonstrations using the XGobi software (Swayne, Cook & Buja 1998). XGobi is a visualization system for viewing highdimensional data. It is freely available. The primary visual constructions are points and lines, which form the basic viewing type, scatterplots, which can enhanced with line drawings. A high level of direct manipulation of plots is supported, with linking of points and/or lines between multiple views. Motion graphics such as rotation, touring and projection pursuit are available. It is especially designed for the exploration of multivariate data and high-dimensional spaces. XGobi was written using the X Window System, so it is primarily a Unixbased software package, but it can be run directly under Windows NT or Windows 95/98 using a variety of X server/client packages. It supports interprocess communication to share information with other software, so it can be run directly from ArcView, S/S-Plus, R, omegahat. The graphics and tools that are available in XGobi include: Cycling rapidly through two-variable scatter plots. Three-dimensional rotation. Grand tours and correlation tours: random, guided, and manual control. Brushing. Hiding (excluding) groups of points. Dynamic re-scaling. Interactive identication of cases. Linked views: Brushing, identication and touring are linked; that is, actions in the window of one XGobi process are immediately reected in another XGobi window displaying the same data. Line editing. Moving points. Fitting curves using smoothers. Drawing subsamples. Jittering of categorical measurements. Parallel coordinate display. Scatterplot matrix display. Case and variable label lists. Missing values are accepted and can be dealt with by imputation of constant values, random values, or user-supplied imputed values. Missing value patterns can be examined in a separate linked XGobi window. On-the-y variable transformations. Quality postscript output. Online help with the click of the right mouse button. 13 The gure illustrates the strength of our graphical methods in detecting sparse structure in high-dimensional space. This data is from a particle physics experiment where there are 7 measured variables which describe the outcome state of the experiment. We will demonstrate how the structure can be seen. We begin half-way through the exploration process after visually extracting clusters. The points are touring in 7D space, with 7 color groups representing cluster structure discovered prior to this stage. An intuitive explanation of \touring" is rotation in arbitrary dimensions. The algorithm is complicated to explain, so in brief, what you see is a continuous sequence of 2D projections from 7D orthonormal space. Rotation in 3D is a special case of touring. Tour methods are an example of linking of multiple plots using time. Motion, such as this, is particularly good for examining joint distributions of variables. We will see how. With all the points included, the point cloud looks very complex and messy. To simplify the process we reduce the number of points, by removing all color groups except one. Watch this shape for a few minutes. You will see that the points tend to concentrate in a triangle shape, which is roughly in a 2D subspace of the 7D space. With the line drawing facility in XGobi we have drawn lines between points in the vertices of the triangle, and this shape matches the distribution of points quite beautifully. Now that this structure is identied, we add one more color group back in, and continue to watch the points touring. The second group of points concentrate in a linear shape, which is mostly disjoint from the rst group, except they appear connected at one vertex of the triangle. We continue in this manner to sequentially add each color group back in. The underlying shape of the data is exposed. The points lie on a structure comprised of connected low-dimensional pieces: a 2D triangle, with 2 linear pieces extending from each vertex. The data was old by relative standards, 20 years old by the time this discovery was made, with visual methods, so the meaning, if there is meaning to the structure is lost. Discovering the structure required a combination of touring, guided touring, using a method called projection pursuit, and brushing. 14 This example illustrates the use of graphics to rene the results of two numerical algorithms. The data comes from a study on Italian Olive Oils (Forina, Armanino, Lanteri & Tiscornia 1983). Samples from dierent regions were analyzed for their fatty acid content. The eight variables measure the percentage of each of the eight fatty acids found in the olive oil samples: Palmitic Acid, Palmitoleic, Stearic, Oleic, Linoleic, Linolenic, Arachidic, Eicosenoic. For quality control purposes, it is of important to be able to classify the oil sample into its region of manufacture based on the fatty acid composition. 15 The left plot in the gure displays the solution of a Classication and Regression Tree (which is also similar to the linear discriminant analysis solution): the presence or absence of eicosenoic acid separate oils from region 1 from the other two regions, oils from region 3 have low levels of linoleic acid correspond, but oils from region 2 have high levels of linoleic acid. The right plot illustrates the result of rotating small amounts of oleic and arachidic acid into the projection. We demonstrate this interactively. With the small adjustment we gain a much sweeter separation of the oils from the 3 regions. It is easy to understand why CART and linear discriminant analysis don't eectively discriminate between the two regions. CART is distracted by the correlation and linear discriminant analysis is distracted by the heterogeneous group variances. Neural networks can do a perfect classication job here, but neural netowrk solutions are dicult to interpret. With graphics we can understand how the network does its classication, which variables dominate the solution, and in general if there are missclassications, where are the regions of uncertainty. It is important to note that neural networks do not perform so well in further drill down to classify the areas within regions. With graphics, we can understand the cluster structure fairly completely, and how the numerical methods work. 16 Another application of graphics to a fairly similar problem is in cluster analysis. This gure illustrates an average linkage dendrogram for the olive oil data. Colors represent the true regions. What is interesting is that the dendrogram looks like it neatly separates out several clusters but observing the mixed results of region classication it is clear that it didn't result in cluster that correspond to the region. In a situation where the groups were unknown ahead of time, linked brushing between the dendrogram and scatterplot views could be used to explore the results. More generally predictions from any clustering algorithm can be represented by color and glyph. The cluster structure can then be explored my multibariate graphics. 17 The next few slides describe current developments in graphical methods, for exploring the distribution of missing values, and for developing inferential statements. The data used in to demonstrate methods for missing values contains monthly average measurements of 13 buoys moored in the Pacic Ocean for the period of March 1980 to May 1998. There are 5 variables: zonal winds, meridian winds, relative humidity, air temperature, sea surface temperature. There are 2184 data points, and many missing values. The gure displays plots of the variables in this data with missing values imputed in dierent ways. The typical approach to plotting missing values is to plot them on the margins of a plot. Here, the marginal distribution of air temperature for missings and non-missings on humidity appear a little dierent: the missings appear to have a slightly lower mean value. This approach does not work for multivariate plots. We demonstrate the touring on the three variables, humidity, air temperature and sea surface temperature as the active variables. The missing values appear as bands of points that oat around, which is quite distracting and not helpful in understanding the nature of missingness. Imputing using the variable mean value causes more problems. In a scatterplot of sea surface temperature against humidity we can see the missings appear as a lovely cross shape in the plot. If you change to rotation or tour of the three variables, you will see a nice 3D cross structure. Similarly random imputation is not particularly successful when there is correlation between variables present. The destruction of correlation structure is even more pronounced in multivariate space. 18 A better approach to exploring the nature of missingness is to use brushing. This plot shows a scatterplot of air temperature against sea surface temperature where cases missing on humidity are highlighted as green crosses. The distribution is similar to the non-missings, especially in the correlation structure, although the missings do look like they have a slightly dierent mean. This approach can be easily extended to higher dimensions. The right plot shows a tour view of 4 variables with missings on humidity highlighted. 19 This next example relates to inference. You are now my test audience! Can you tell which is the real data plot in this array of plots? Stare at the plots for a while, and decide if one is dierent from the others. Hands up if you think it is the top left plot? The top line, second from left? The top line third from left? The top right? The second from top left plot? ... This is data from an agricultural eld trial in Iowa. The response is corn yield, and the explanatory variables soil nutrients present in soil samples. We look at Boron here. The real plot of Yield against Boron is the one in the top row, second from the right. It is dierent from the others in two ways. One, in terms of the outlier, at bottom right of this plot, case 200. None of the other plots have a value so extreme. Secondly, the sharpness of the skewness, or the lack of points in the bottom right, makes it dierent from the other plots. It is somewhat similar to the left plot and the right plot in the top row, but has a more pronounced structure. Judging reality of structure in the presence of skewness presents more diculties than in symmetric data, because of confounding with sample size. 20 In summary, data visualization is an attractive area because of its broad applicability. Wherever data is collected, creative analysis is possible with all the fun that creativity engenders. Data are gathered, stored and analyzed in huge amounts by all areas of science, by governments, by industries such as nance, retail, health, telecommunications, and service industries in general. We hope the range of data examples discussed here will give the reader an impression of the power and wide applicability of data visualization. 21 The authors can be contacted by electronic email at: dicook@iastate.edu dfs@research.att.com andreas@research.att.com and the XGobi software can be downloaded from the XGobi web site: http://www.research.att.com/areas/stat/xgobi/ 22 References Bryant, P. G. & Smith, M. A. (1995), Practical Data Analysis: Case Studies in Business Statistics, Richard D. Irwin Publishing, Homewood, IL. Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classication of olive oils from their fatty acid composition, in H. Martens & H. Russwurm Jr., eds, `Food Research and Data Analysis', Applied Science Publishers, London, pp. 189{214. Mardia, K. V., Kent, J. T. & Bibby, J. M. (1979), Multivariate Analysis, Academic Press, London. Swayne, D. F., Cook, D. & Buja, A. (1998), `XGobi: Interactive Dynamic Graphics in the X Window System', Journal of Computational and Graphical Statistics 7(1), 113{130. 23

This talk focuses on the use of interactive and dynamic... cess of data analysis. We use the XGobi software for...

Related documents

Products

Support

This talk focuses on the use of interactive and dynamic... cess of data analysis. We use the XGobi software for...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib