Chapter 1 Introduction This set of notes is organized to provide a brief overview of interactive and dynamic graphics methods, and an introduction to using the graphical exploration of data, using the software XGobi for the examples. Beyond the standard graphical tools available in XGobi, with creative thinking and re-structuring of data, you can make an astonishing array visual renderings for your data that can provide insights that were previously impossible. 1.1 What is Interactive and Dynamic Data Visualization? Interaction, or to be more precise, direct manipulation, of graphics includes activities such as: linked brushing points/regions/lines in multiple views, by highlighting them in one plot using a brush, and watching where the highlighted points are in the other plots. querying the id of a point or group of points. dragging a scrollbar to change the value of a parameter. clicking a button to change the variables viewed in a plot. Interaction facilitates tasks in exploration and discovery. Dynamic graphics are motion graphics, for example: 3D rotating plots. tour methods: grand/random, guided, manual. In statistical terms, there are specic technical interpretations of linked brushing and motion graphics. Linked brushing can be considered to be exploring conditional distributions of variables, where the brush is a conditioning 11 tool. Motion graphics can be considered to be exploring joint distributions, because the motion facilitates perception of the \shape" of the data from the sequence of marginal views. If we know the distribution of all low-dimensional projections of the data then we also know the joint multivariate distribution, following a result of Cramer-Wold (Mardia, Kent & Bibby 1979). 1.2 What Makes Graphics Special in the Data Analysis Process? Here we present 3 cases studies to demonstrate the power of graphics, and demonstrate our thought processes going into this direction of research. The rst case study uses a very simple data set, collected by one waiter over a period of a few months in a restaurant in the mid-west USA. Figure 1.1 displays the simplest graphical form: a histogram, with small multiples representing the data in histograms constructed with dierent bin widths. The bin widths range from $1 at top to 10c at bottom. The small multiples illustrate the dierent information that is gained from examining the data at dierent resolutions like this. With the smallest bin width you can see the customer behavior is to round the tip amount to the nearest fty cents or dollar. These simple plots also serve to emphasize that the salient features are not found with one ideal bin width, but one must use multiple bin widths to extract dierent features of data. Figure 1.2 describes drilling down further into the data. On the left, Total Tip is plotted against Total Bill. The strong lower right triangular shape suggests that on average there are more \cheap tippers" than generous, with a couple of outlying exceptions. On the right, the scatterplot of tip vs total bill is broken out by Sex of the Bill Payer and Smoker (whether it was a smoking party of diners or not). Inspecting these plots reveals numerous features: (1) for smoking parties, there is almost no relationship between tip and total bill, (2) when a female non-smoker paid the bill, the tip was a very consistent percentage of the total bill, with the exceptions of three dining parties, (3) larger total bills mostly had a male paying them. Further drill down into other variables, such as time of day, day of week and size of party reveals enough information to almost locate the restaurant. This is the power of graphics. Even such a simple data set can reveal the most interesting complexities, a much richer body of complexities than can be extracted with numerical methods alone. As an aside with an interactive system the user would be able to drag a slider to actively change the bin width in a histogram, or click on a variable label to condition the plot by the categorical variable, To scale these methods to extremely large data set is very straightforward pre-processing of data into bins, and subsets. The second case study illustrates the importance of user interaction in views of high-dimensional data. The data comes from a study on Italian Olive Oils. Samples from dierent regions were analyzed for their fatty acid content. The 12 20 40 60 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 Tips 6 8 10 6 8 10 6 8 10 6 8 10 6 8 10 6 8 10 0 10 20 30 40 50 0 0 10 20 30 40 Tips 0 10 20 30 40 Tips 0 10 20 30 Tips 0 10 20 30 Tips Tips Figure 1.1: Histograms of Actual Tips with diering barwidth: $1, 50c, 33c, 25c, 20c, 10c. The power of an interactive system allows bin width to be changed with slider. 13 eight variables measure the percentage of each of the eight fatty acids found in the olive oil samples. It is of interest to classify the oil sample based on its fatty acid composition. The left plot in Figure 3.9 displays the solution of a Classication and Regression Tree (which is also similar to the linear discriminant analysis solution): the presence or absence of eicosenoic acid separate oils from region 1 from the other two regions, oils from region 3 have low levels of linoleic acid correspond, but oils from region 2 have high levels of linoleic acid. The right plot illustrates the result of rotating small amounts of oleic and arachidic acid into the projection. With the small adjustment we gain a much sweeter separation of the oils from the 3 regions. CART is distracted by the correlation and linear discriminant analysis is distracted by the heterogeneous group variances. Neural networks can do a perfect classication job here. With graphics we can understand how the network does its classifying, which variables dominate the solution, and in general if there are missclassications, where are the regions of uncertainty. It is important to note that neural networks do not perform so well when further drill down into sub-regions is needed. With graphics, we can understand the cluster structure, and why this happens. 10 Total Tip 4 6 8 2 Male Smokers Female Smokers Total Tip 4 6 8 Total Tip 4 6 8 10 0 10 20 30 40 50 Total Bill 0 10 20 30 Total Bill 40 2 2 2 4 Female Non-smokers 0 10 20 30 40 50 Total Bill 10 Total Tip 6 8 2 10 Total Tip 4 6 8 10 Male Non-smokers 0 10 20 30 40 50 Total Bill 50 0 10 20 30 40 50 Total Bill Figure 1.2: (Left) Scatterplot of Total Tip vs Total Bill: More points in the bottom right indicate more cheap tippers than generous tippers. (Right) Total Tip vs Total Bill by Sex and Smoker: There is almost no association between tip and total bill in the smoking parties, and, with the exception of 3 dining parties, when a female non-smokers paid the bill the tip was extremely consistent. The third case study illustrates the strength of our graphical methods in detecting structure in sparse high-dimensional space. This data is from a particle physics experiment where there are 7 measured variables which describe the states of the experiment. Visual inspection using a combination of touring, guided touring, and brushing, revealed that the points lie on a structure comprising of connected low-dimensional pieces: a 2D triangle, with 2 linear pieces extending from each vertex (Figure 1.4). The data was old by relative standards, 14 60 1 50 10 20 eicosenoic 30 40 1 0 1 1 11 1 1 1 11 1 1 1 1 1 11 11 1 1 1 1 11 1 1 1 1 11 1 111111 1 1 111 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 11 1 1 11111 11111 1 1 1 11 1 111 1 1 11 1 1 1 1 1 11 111 11 11111 1 111 1 1 11111111 1 11 1 1 11 1111 11 111 1 1 11 1 1 11 1 111 1 1 1 1 1 11 111 11 1 111 11111 11111111111111 11 1 1 1 111 111 11 11 111111 1 11 1 11 111 1 111 11 1111 11 11 11111111 1111 1 11 111 1 1 1 1 11111111111 11 1 1 11 1 111 1 111 1111111 111 1 1 1111111 11111 1111 1 1 1 33 3333 333333333333 3 3333 3333 33 222222 2 222 22222222 2 2 2222222 3333 3 333 33 33 333 333 33 333 2 2 3 33 33 33 3 33333 33333333333 333333333333333 3 332 2 22222 22 222 222 222222 2 222 22222222222 22 600 800 1000 linoleic 1200 1 linoleic arachidic 2 oleic eicosenoic 1400 Figure 1.3: (Left) Solution obtained by CART doesn't separate groups 2 and 3 very well. (Right) A small rotation of two other variables - linoleic and arachidic acid - into the projection gives a much neater solution. 20 year old by the time of this discovery was made with visual methods. These three case studies are described to give the reader an indication of the types of insights that can be made through interactive and dynamic graphics. 1.3 What is XGobi? XGobi is a visualization system for viewing high-dimensional data. It is freely available. The primary visual constructions are points and lines, which form the basic viewing type, scatterplots, which can enhanced with line drawings. A high level of direct manipulation of plots is supported, with linking of points and/or lines between multiple views. Motion graphics such as rotation, touring and projection pursuit are available. It is especially designed for the exploration of multivariate data and high-dimensional spaces. XGobi was written using the X Window System, so it is primarily a Unixbased software package, but it can be run directly under Windows NT or Windows 95/98 using a variety of X software packages. It supports interprocess communication to share information with other software, so it can be run directly from ArcView, S/S-Plus, R, omegahat. The graphics and tools that are available in XGobi include: Cycling rapidly through two-variable scatter plots. Three-dimensional rotation. 15 X3 X5 X2 X1 X4 X7 X3 X6 X5 X2 X1 X4 X7 X6 Figure 1.4: (Left) One projection of the 7D particle phyics data, and (right) the corresponding wire frame illustrating the underlying structure. Grand tours and correlation tours: random, guided, and manual control. Brushing. Hiding (excluding) groups of points. Dynamic re-scaling. Interactive identication of cases. Linked views: Brushing, identication and touring are linked; that is, actions in the window of one XGobi process are immediately reected in another XGobi window displaying the same data. Line editing. Moving points. Fitting curves using smoothers. Drawing subsamples. Jittering of categorical measurements. Parallel coordinate display. Scatterplot matrix display. Case and variable label lists. Missing values are accepted and can be dealt with by imputation of constant values, random values, or user-supplied imputed values. Missing value patterns can be examined in a separate linked XGobi window. On-the-y variable transformations. Quality postscript output. 16 Online help with the click of the right mouse button. The XGobi web site can be found at http://www.research.att.com/areas/stat/xgobi/. 1.4 Getting Started with XGobi Start up XGobi on the ea beetles data. What you will see is a window like that in Figure 1.5. The window is laid out into 4 regions. There is a central plot region, a control panel specic to the view mode is at left, controls for variables are at the right (there are 6 variables in this data), and a series of menus at the top facilitating the major tools in XGobi. Figure 1.5: Main XGobi window, displaying a central plot region, a control panel at left, controls for variables at right, and a series of menus at the top facilitating the major tools in XGobi. Take a closer look at the series of menus at the top. These contain selections which allow switching between the major modes and tools available in XGobi. Take a look at each one. The File menu contains typical le input/output, printing and exiting choices. The View menu contains choices of major plot types (1DPlot, XYPlot, 3D Rotation, Grand Tour, and Correlation Tour), and major plot interaction methods (Scale, Brush, Identify, Line Editing, Move Points). 17 The Tools menu contains a heterogeneous selection of tools: Hide or exclude, Subset, Smooth, Jitter, Parallel coordinates, Scatterplot Matrix, Variable transformation, Variable and Case lists, and Missing Values. The Display menu contains selections for modifying the look of the plot: Display axes/gridlines, center axes, plot the points/lines. The Info menu give basic instructions about the help facilities. The colors and glyphs of the points were read in from startup les, and are set to identify species of beetles in this data. On start-up, the view mode is XYPlot, so a scatterplot of the rst two variables appears in the plot window. The bar in the variable circles at right indicate how the variables are displayed in the plot: tars1 is horizontal and tars2 is vertical. Click on an empty circle with the left mouse button to change the horizontal variable. Click with the middle mouse button (ALT-Left on a two button PC mouse) on an empty circle to change the vertical variable. This can be animated to cycle through all pairs of variables, using the control panel at left. Click on Cycle to start cycling, and drag the scrollbar to change the speed of change. Fix X sets the horizontal variable to stay the same during cycling, and similarly Fix Y sets the vertical variable. Figure 1.6: Main XGobi window in 1DPlot View mode, with jitter control panel. Change to 1DPlot view mode. Notice that the control panel at left changes. Each View mode has a unique control panel. The plot window now displays an average shifted histogram of the variable tars1. The scrollbar labelled \ASH Smoothness" controls the smoothness of the density, dragging to the right smooths the plot. Using the left mouse button on the variable circles allows switching the variable in the plot. If you select the variable aede2, you'll notice 18 it is discrete, there are just 9 distinct values. Go into the Tools menu and select Jitter. A control panel (Figure 1.7) pops up in a separate window which allows you to add small amounts of random noise to the values. In this case it allows us to peek at what is plotted underneath. Change to Grand Tour view mode. The points in the plot window now rotate in the rst 3 variables. The lines in the variable circles track the rotation, indicating how each variable contributes to the projection displayed in the plot. The control panel is huge! We'll explain these in more detail below. In the Display menu, select Center axes in 3D+ modes to be o, to shift the axes out of the center. Click on the variable circles with either left or middle mouse buttons to toggle variables in and out. (If you reduce the number of variables to 2, the tour will stop, and you will need to re-start it with the Pause button when more variables are added.) When more than 3 variables are included, the grand tour in XGobi displays a continuous random sequence of 2D projections of the data. This is particularly useful for this data, as the 3 species separate out very neatly in some projections when all variables are included in the tour, and the motion of points further indicates 3 dierent patterns corresponding to the 3 species. The control panel contains three dierent tour methods, the grand tour, guided tour and manual tour. The guided tour is activated by clicking on ProjPrst, and Optimz. Choose the Holes index from the menu of indices, and watch the guided tour nd a very nice projection of the data identifying the 3 species. A separate plot window tracks the value of the Holes index. From the axes, you can see the separation is due primarily to 4 variables, tars1, tars2, aede1, aede3 (Figure 1.7). Also, in terms of principal component axes, the separation is due to PC1, PC2 and PC6. Turn ProjPrst o. Now focus on the manual tour. With the tour paused, in the plot window drag the mouse with the left button held down, to change the coecient of tars1 in the projection. The thin inner circle in the variable circle panel indicates which variable is to be manipulated. It can be changed by holding the Shift key down and clicking on a variable circle. Change to Brush mode. The brush control panel allows choice of color, glyph, point or line brushing, hiding and excluding of points. Use the color menu to select a new color. The rectangle in the corner is the brush, moving the mouse with the left button depressed around the screen moves the brush, points that are under the brush are painted the new color. This is transient brush mode. This mode is more useful when multiple XGobi's are visible on the screen and linked so that the colors in each plot also change according to the brush action in one. Click on Persistent to make the brush permanently change the color of the point. This kind of highlighting is used to mark features in the data. The middle mouse button allows the size of the brush to be changed. Click on the Hide or exclude panel (Figure 1.8). Choosing a color/glyph combination as Hidden hides it from the plot region. Choosing exclude removes it from the scale calculations, too, so the plot will rescale when groups are excluded. In the Tools menu select Variable transformation. This panel allows you to interactively transform variables or groups of variables, and make some major changes such as sorting or permuting values (Figure 1.9). Sorting is neat because 19 Figure 1.7: Main XGobi window in Grand Tour mode, with projrction pursuit guidance window. 20 Figure 1.8: Main XGobi window in Brush mode, with Hide or exclude control panel. 21 Figure 1.9: QQ-plot of aede1 and aede3, obtained by using the variable transformation controls. 22 it allows QQ-plots to be generated `on-the-y'. Now is a good time to play around selecting modes, tools, and exploring the functionality of XGobi, or if you can break it. 23