Chapter 1 Introduction Computer-aided visualization is a child of the age of technology. While the technological age is also drowning us with information, computers bring the provisions for humans to digest it. Computers have brought powerful facilities to life for drawing and interacting with pictures to describe information. This book is about drawing and interacting with pictures about a particular type of information: data. This book describes the application of data visualization tools, that utilize high levels of interaction and motion, to a broad array of case studies from dierent disciplines. In this chapter we describe the history of data visualization and the role of data visualization in the process of data analysis. We also explain methods to be used in later chapters, and the software used for demonstrating the methods. 1.1 Data Visualization In this book, the term \data" is used for information that exists in some schematic form such as a table or a list. Data is often but not always quantitative, and some translation of unstructured information is often required to derive the data. It always includes some attributes or variables such as the number of hits on web sites, frequencies of words in text samples, weight in pounds, mileage in gallons per mile, income per household in dollars, years of education, acidity on the PH scale, sulfur emissions in tons per year, or scores on standardized tests. Characteristic for data visualization is the concern with abstract relationships among such variables: for example, the degree to which income increases with education, or the question of whether certain astronomical measurements indicate grouping and therefore hint at new classes of celestial objects. In contrast, other areas of visualization are mostly concerned with the display of objects and phenomena in physical 3-D space. Examples are volume visualization (e.g., for the display of human organs in medicine), surface visualization (e.g., for manufacturing cars or animated movies), ow visualization 11 (e.g., for aeronautics or meteorology), and cartography. In these areas one often strives for physical realism or the display of great detail in space, as in the visual display of a new car design, or of a developing hurricane in a meteorological simulation. Data visualization as we use the term requires the emphasis to be on variables and their relationships in the abstract. Variables are typically thought of as detached from physical location, even if physical location is part of the data. The general task of data visualization is to make pictures that reect relationships among variables. To this end, one maps the variables to axes in a plot and the variable values to locations on the axes. In eect, one uses space on a page to code non-spatial information. The goal of data visualization is then not realism, which is meaningless, but rendering of data in space for the purpose of visual consumption. This program of data visualization, spatial representation of data, immediately brings up a limitation: Plotting surfaces such as paper or computer screens are merely 2-dimensional, and physical space is just 3-dimensional. The eye can be tricked into seeing 3-dimensional virtual space with perspective and motion, which for visualization is as good as, or even better than, physical space. In data visualization each variable requires a spatial axis or dimension, which poses the question of how we are to picture more than two or three variables at a time. The limitation to a 3-dimensional display space is ne if the objects are 3-dimensional, as in most other visualization areas, but in data visualization the number of axes required to code variables can be large: ve to ten are common, and 50 and even hundreds arise in very real contexts also. This then is the challenge of data visualization: to overcome the 2-D and 3-D barriers. To meet this challenge, computers allow us to implement some powerful visualization tools. They mimic and amplify a paradigm familiar from photography: take pictures from multiple directions so the shape of an object can be understood in its entirety. This is called the \multiple views" paradigm, where the term \views" is just a synonym for \pictures". In our 3-D world the paradigm works superbly: the human eye is very adept at inferring the true shape of an object from just a few directional views. Unfortunately, the same is often not true for views of abstract data! The chasm between dierent views of data, however, can be actively bridged with computer technology: Unlike the passive paper medium, computers allow us to manipulate pictures, to pull and push their content in continuous motion with a similar eect as a moving video camera, or to poke at objects in one picture and see them light up in other pictures. Motion links pictures in time; poking links them across space. This book features many illustrations of the power of these linking technologies. The diligent reader may come away \seeing" high-dimensional data spaces! 1.2 Statistical Graphics The discipline of data visualization has multiple homes, one in the eld of statistics, and others in computer science and engineering. Statistics has had a strand 12 of data visualization research since the middle 1970's under the rubric \statistical graphics," while work outside statistics came into focus a bit later, following the visualization initiative by the National Science Foundation in the late 1980s. The seminal research in statistics was PRIM-9, the work of Fisherkeller, Friedman and Tukey in 1974. PRIM-9, designed and implemented at the Stanford Linear Accelerator Center, was the rst interactive data visualization system. It was followed by further pioneering systems at the Swiss Federal Institute of Technology (PRIM-ETH), at the Harvard (PRIM-H) and Stanford Universities (ORION), in the late 1970s and early 1980s. Research picked up in the following few years in many places, including AT&T Bell Labs, Bellcore, the University of Washington, the University of Minnesota, MIT, CMU, Batelle Richmond WA, George Mason University, Rice University, and several more. Statisticians aren't the only researchers who are interested in the visualization of abstract high-dimensional data, but statistical data visualization has some unique features. Statisticians are always concerned with variability in observations and error in measurements, both of which cause uncertainty about conclusions drawn from data. Dealing with this uncertainty is at the heart of classical statistics, and statisticians have developed a huge body of inference methods that allow us to quantify uncertainty. Inference used to be statisticians' sole preoccupation, but this changed under John W. Tukey's towering inuence. He championed \exploratory data analysis" (EDA) which focusses on discovery and allows for the unexpected, unlike inference, which progresses from pre-conceived hypotheses. EDA has always depended heavily on graphics, even before the term \data visualization" was coined. Our favorite quote from John Tukey's rich legacy is that to \force the unexpected upon us," we need good pictures. In the past, EDA and inference were sometimes seen as incompatible, but they are not mutually exclusive. In this book, we will present visual methods for assessing uncertainty and performing inference, that is, deciding whether what we see is \really there." 1.3 Representing High-Dimensional Data The approach to drawing plots of data is called the \multiple views" paradigm (Buja, Cook & Swayne 1996). Multiple dierent plots, corresponding to views of the data from dierent aspects, are shown to the user simultaneously. Interaction tools facilitate linking information in between plots. The methods adhere to very specic guidelines for graphics: use simple, easy-to-read plots, with a healthy collection of interaction tools, such as refocusing, linking between plots, easy rearrangement of plots, and automated sequences of views. We describe the approach in depth in this section. At the root of the graphical methods is a division of data visualization into two areas: Rendering, or what to show in a plot; Manipulation, or what to do with plots; 13 Linking, or what information to share between plots. The rst area, rendering of data, comprises all decisions that go into the production of a static image. Rendering is concerned with appropriate representation of information in data variables: a scatterplot, a density plot, a time series plot or a parallel coordinate plot are examples of renderings. Wegman & Carr (1993) give an excellent introduction to the wide array of rendering methodology in statistical data visualization. The second area, manipulation of plot elements refers to how we operate on individual plots and how we organize multiple plots. The purpose of these manipulations is to support the search for structure in data. The third area, linking refers to the connection of elements from one rendering to another rendering. We most often think of linked brushing, where points are given the same appearance (color, glyph) between renderings. More generally, linking can include matching scales, matching axes, linking a point to a record in a database, or index of chemical compounds. In the practice of data visualization, there usually exists a larger context of open-ended problem solving. In such contexts, data visualization systems are most useful if they provide plot manipulation tools that support extensive searching and linking of information. Behind the process of rendering is the concept of a data pipeline, rst described by Buja, Asimov, Hurley & McDonald (1988). The data pipeline is the conveyor belt which takes the raw data through a series of transformations to go from p-dimensional data to a d-dimensional rendering. (The pronumeral p represents the number of variables in the data, and d < p, most commonly d = 2.) The stages of transformation typically include some type of variable standardization and dimension reduction. Examples of standardization are standardizing each variable to mean 0, variance 1, and ordering a time variable. The dimension reduction may be done using variable selection, or through projection methods such as principal components or discriminant coodinates. Motion graphics, such as tours, can be applied to address the dimension reduction problem, also. Although, the number of variables may remain the same, the tour algorithm provides a continuous sequence of low-dimensional projections to produce a \movie" which shows the data \from all sides". A tour is also a display of multiple views using time order. In general, the multiple views approach species that the data pipeline is rather more like a river delta, piping the data out into multiple renderings. With multiple renderings the diculty is to dene appropriate mechanisms to link information from one plot to another. In the simplest case, where one view has a scatterplot of variable 1 vs variable 2, and another view has a scatterplot of variable 3 vs variable 4 the points in each plot are linked one-to-one, the correspondence of a point in one plot is a point in the other plot. More complex types of linking arise often arise, for example, if there is a time or spatial component to the data. In the situation where there is a spatial component we need to explore the spatial dependence between sample points. So we may have a point, in one view as an element of a variogram cloud, corresponding to a pair of locations in another view, the map. In the situation of time, there may be a multivariate longitudinal study, where there are both demographic variables 14 Flow 8.5 9.0 9.5 10.0 10.5 11.0 11.5 for each patient, and multiple measurements of a study period for each patient. Here it is desirable to link a point in one view to a time series in another view. We could think of this as one-to-many, or indeed many-to-one. Other types of plot element linking are common as well, for example, axis scale, and projection coecients. 0 10 20 30 40 50 60 70 80 90100110120130 8.0 8.5 9.0 910.0 .510.5 11.0 11.5 Flow Time 0 10 20 30 40 50 60 70 80 90100110120130 Time Figure 1.1: Monthly average Willamette river ow levels over a period of 10 years. at two dierent aspect ratios. Top plot shows 1:1 ratio (contracted in time), which reveals long term trends, such as the up then down. Bottom plot shows long time axis, revealing local seasonal trends. The 5th year (between observations 60 and 70) has a smaller peak ow, than other years. The last year (beginning at observation 120) appears to start with higher than usual ow. When plot elements are linked, it ensures that manipulation of elements in one plot directly aects the representation of the data in the other plots. The taxonomy of manipulations described in Buja et al. (1996): Focusing views: By focusing we mean any operation that is an extension 15 -0.5 0.0 0.5 S(WndDir) 1.0 1.0 0.5 0.0 C(WndDir) -1.0 -0.5 1.0 0.5 0.0 C(WndDir) -1.0 -0.5 1.0 0.5 0.0 C(WndDir) -0.5 -1.0 -1.0 -1.0 -0.5 0.0 0.5 S(WndDir) 1.0 -1.0 -0.5 0.0 0.5 1.0 S(WndDir) Figure 1.2: Comprehensive Ocean-Atmosphere Data: long term means of in situ weather observations taken by merchant marines, gridded and cleaned. (Top) The atmospheric science view: wind direction and speed displayed as arrows on the map, allows easy insight into the global trends: easterly winds prevail in most parts of the Pacic. (Bottom) Brushing in wind direction plots north-westerly, northerly, and south-easterly winds allows inspection of local anomalies: winds o the north American coast are mostly north-westerly and northerly, o the south American coast16mostly southeasterly on average. of manipulating a camera, such as deciding from which side to look at the object and in which magnication and detail. Focusing views includes choosing the variables or (more generally) the projections for viewing, but also choosing aspect ratio and zoom and pan. (Figure 4.1 illustrates the eect of zoom and pan on structure detection.) Posing queries: In graphical data analysis it is natural to pose queries graphically, for example with the familiar brushing techniques: coloring or otherwise highlighting a subset of the data means issuing a query about this subset. It is then equally natural that the response to the query be given graphically. This is achieved by showing information about the highlighted subset in other views. It is therefore desirable that the view where the query is posed and the views that present the response are linked. Ideally, responses to queries are instantaneous. (Figure 1.2 illustrates the dierence between a static map and interactive querying for structure detection.) Arranging many views: It is often a powerful informal technique to arrange large numbers of related plots for simultaneous comparison. The most useful arrangements are matrix-like, such as in scatterplot matrices of pairwise variable plots, but other arrangements can be useful as well. A very commonly used method for posing queries is \linked brushing": points in one plot are highlighted using color or glyph, and the corresponding points in other plots are simultaneously highlighted. In statistical terms, linked brushing can be considered to be exploring conditional distributions of variables, where the brush is a conditioning tool. On the other hand, statistically, motion graphics such as the tour facilitate exploring the joint distributions, because the motion facilitates perception of the \shape" of the data from the sequence of marginal views. If we know the distribution of all low-dimensional projections of the data then we also know the joint multivariate distribution, following a result of Cramer-Wold (Mardia, Kent & Bibby 1979). 1.4 Restructuring Data Clever restructuring of variables is perhaps one of the most hidden yet valuable tools for data visualization. The case studies described in this book make substantial use of restructured variables. Particular types of data lend themselves to obvious approaches to re-structuring: data with a time or space component, modular variables such as wind direction, or compositional data where variables contain a constraint. When there is a time or space component it is important to explore the time or space dependency using lag plots or variogram cloud plots. With time, it is also likely that there are dierent scales of time (daily/weekly/yearly) to explore. Regrouping these dierent resolutions into dierent variables facilitate exploring trend (yearly or weekly). A variable such as wind direction can be better handled in the interactive setting by using sine and cosine values. This allows the user to brush around the compass 17 points to explore the relationship with other variables. Compositional data is best approached by pre-projecting the data into a subspace orthogonal to the variable constraint, called a ternary diagram in 2D. In situations where modeling is a part of the analysis, components associated with the model are useful to append to the data set, and appending samples or quantiles from standard distributions facilitates inference. When the modeling is classication exploring a dendrogram in association with the variables may illuminate clustering inthe data. With any type of prediction, appending the predictions, residuals or diagnostics to the data can help improve the model. 1.5 The Role of Visualization in Data Analysis It is commonly said that \a picture is worth a thousand words". Translated to the area of data analysis, this is, \good plot can present a concise summary of data that many pages of explanation cannot match". The eye can absorb immense amounts of information if provided with \informative" pictures. A well-constructed plot can provide considerably more information about the relationships between variables than summary numerical statistics. In the process of analyzing data, graphics are used to assess the integrity of the data, provide diagnostics for model tting, and provide convenient summaries of important ndings. Graphics provide distinct advantages for detecting structure in data and also for rening results from numerical algorithms. In this section we discuss why and where graphics are useful in data analysis, using three examples. The rst example demonstrates nding local anomalies with graphics. The data is very simple. It contains recordings made by one waiter over a period of a few months in a restaurant in the mid-west USA (Bryant & Smith 1995). Several variables were collected: total tip, total bill, sex of the bill payer, smoking party or not, day of the week, time of day and size of the party. The data was reported in a collection of cases studies for business statistics. The primary question related to the data is: \What are the factors that aect tipping behavior?" The solution in the manual is to re-structure tip and total bill into a new variable tip rate, and model against the remaining variables. The result is a model with tip rate as the response and size of the party as the one explanatory variable. It is a very empty result! For such a simple data set it has enormous richness in structure, which we will show with graphics. Figure 1.3 displays the variable total tip as a histogram. Six histograms were constructed using dierent bin widths: $1, 50c, 33c, 25c, 20c, 10c. The small multiples illustrate how dierent information can be gained from examining the data at dierent resolutions. At the largest bin width, the shape of the distribution is unimodal and skewed, which indicates the predominance of lower total tips, and fewer larger total tips. As the bin width is reduced the shape becomes multimodal, and at the smallest bin width it is clear that there are large peaks at the full dollars and smaller peaks at the half dollar. This suggests that the customers tend to round the tip to the nearest fty cents or dollar. These multiple plots also serve to emphasize that the salient features are not found 18 with one ideal bin width. Dierent resolutions provide ways to focus attention on dierent features. A large bin width smooths out the the noise, and allows the reader to focus on large or global trends. In contrast, a smaller bandwidth allows the reader to extract intricate features from the trend. The human eye re-focuses itself often, from one object to another, from far to near. Squinting and peripheral vision helps to extract larger objects from a mixed background. Providing plots of data in dierent resolutions is analogous to the way the eye re-focuses. Figure 1.4 describes drilling down further into the tipping data. On the left, Total Tip is plotted against Total Bill. The predominating pattern is that the points mostly lie in the lower right triangle of the plot. This is the region where tips are relatively low compared to the total bill, which suggests that on average there are more \cheap tippers" than generous tippers. There are a couple of notable exceptions. One point has a tip value of about $5 for a $7 bill, a tip rate of about 70% . The horizontal stripes of points are the peaks at the full dollar and half dollar tips. This is marginal structure that was discovered using histograms above. At the right of the main plot, are 4 smaller plots where tip vs total bill is plotted conditionally on Sex of the Bill Payer and Smoker (whether it was a smoking party of diners or not). Inspecting these plots reveals numerous features: (1) for smoking parties, there is almost no relationship between tip and total bill, (2) when a female non-smoker paid the bill, the tip was a very consistent percentage of the total bill, with the exceptions of three dining parties, (3) larger total bills mostly had a male paying them. Further drill down into other variables, such as time of day, day of week and size of party reveals enough information to almost locate the restaurant. This is the power of graphics. The case study illustrates the strength of our graphical methods in detecting sparse structure in high-dimensional space. This data is from a particle physics experiment where there are 7 measured variables which describe the outcome state of the experiment. A combination of interactive brush controls and motion graphics reveal that the points lie on a structure comprised of connected lowdimensional pieces: a 2D triangle, with 2 linear pieces extending from each vertex (Cook, Buja, Cabrera & Hurley 1995). Figure 1.5 shows several 2D projections of the data illustrating the shape. The points are shown in the left side views, and to the right is a wire frame diagram illustrating the shape. Various graphical tools specically facilitated the discovery of the structure: plots of low dimensional projections of the 7D allowed discovery of the low dimensional pieces, highlighting allowed the pieces to be recorded ormarked, and animating many projections into a movie over time allowed the pieces to be reconstructed into the full shape. The data was old by relative standards, 20 years old by the time this discovery was made, with visual methods, so the meaning, if there is meaning to the structure is lost. This example illustrates the use of graphics to rene the results of two numerical algorithms. The data comes from a study on Italian Olive Oils (Forina, Armanino, Lanteri & Tiscornia 1983). Samples from dierent regions were analyzed for their fatty acid content. The eight variables measure the percentage 19 20 40 60 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 Tips 6 8 10 6 8 10 6 8 10 6 8 10 6 8 10 6 8 10 0 10 20 30 40 50 0 0 10 20 30 40 Tips 0 10 20 30 40 Tips 0 10 20 30 Tips 0 10 20 30 Tips Tips Figure 1.3: Histograms of Actual Tips with diering barwidth: $1, 50c, 33c, 25c, 20c, 10c. The power of an interactive system allows bin width to be changed with slider. 20 10 Total Tip 4 6 8 2 Male Smokers Female Smokers Total Tip 4 6 8 Total Tip 4 6 8 10 0 10 20 30 40 50 Total Bill 0 10 20 30 Total Bill 40 2 2 2 4 Female Non-smokers 0 10 20 30 40 50 Total Bill 10 Total Tip 6 8 2 10 Total Tip 4 6 8 10 Male Non-smokers 0 10 20 30 40 50 Total Bill 50 0 10 20 30 40 50 Total Bill Figure 1.4: (Left) Scatterplot of Total Tip vs Total Bill: More points in the bottom right indicate more cheap tippers than generous tippers. (Right) Total Tip vs Total Bill by Sex and Smoker: There is almost no association between tip and total bill in the smoking parties, and, with the exception of 3 dining parties, when a female non-smokers paid the bill the tip was extremely consistent. of each of the eight fatty acids found in the olive oil samples: Palmitic Acid, Palmitoleic, Stearic, Oleic, Linoleic, Linolenic, Arachidic, Eicosenoic. For quality control purposes, it is of important to be able to classify the oil sample into its region of manufacture based on the fatty acid composition. The left plot in the gure displays the solution of a Classication and Regression Tree (which is also similar to the linear discriminant analysis solution): the presence or absence of eicosenoic acid separate oils from region 1 from the other two regions, oils from region 3 have low levels of linoleic acid correspond, but oils from region 2 have high levels of linoleic acid. The right plot illustrates the result of rotating small amounts of oleic and arachidic acid into the projection. We demonstrate this interactively. With the small adjustment we gain a much sweeter separation of the oils from the 3 regions. It is easy to understand why CART and linear discriminant analysis don't eectively discriminate between the two regions. CART is distracted by the correlation and linear discriminant analysis is distracted by the heterogeneous group variances. Neural networks can do a perfect classication job here, but neural network solutions are dicult to interpret. With graphics we can understand how the network does its classication, which variables dominate the solution, and in general if there are missclassications, where are the regions of uncertainty. It is important to note that neural networks do not perform so well in further drill down to classify the areas within regions. With graphics, we can understand the cluster structure fairly completely, and how the numerical methods work. The human eye has both the ability to detect small, localized departures from a trend, and also to ignore peripheral noise, a robust quality. It is also 21 Var 3 Var 3 VarVar 1 Var 4 6 Var 7 Var 5 VarVar 1 Var 4 6 Var 7 Var 5 Var 4 Var 7 Var 5 Var Var6 12 Var Var 3 Var 4 Var 7 Var 5 Var Var6 12 Var Var 3 Var12 Var Var 5 Var Var7 4 Var 3 Var 6 Var12 Var Var 5 Var Var7 4 Var 3 Var 6 Figure 1.5: Three projections of 7D particle phyics data, (left) points, (right) the corresponding wire frame illustrating the underlying structure. 22 60 1 50 10 20 eicosenoic 30 40 1 0 1 1 11 1 1 1 11 1 1 1 1 1 11 11 1 1 1 1 11 1 1 1 1 11 1 111111 1 1 111 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 11 1 1 11111 11111 1 1 1 11 1 111 1 1 11 1 1 1 1 1 11 111 11 11111 1 111 1 1 11111111 1 11 1 1 11 1111 11 111 1 1 11 1 1 11 1 111 1 1 1 1 1 11 111 11 1 111 11111 11111111111111 11 1 1 1 111 111 11 11 111111 1 11 1 11 111 1 111 11 1111 11 11 11111111 1111 1 11 111 1 1 1 1 11111111111 11 1 1 11 1 111 1 111 1111111 111 1 1 1111111 11111 1111 1 1 1 33 3333 333333333333 3 3333 3333 33 222222 2 222 22222222 2 2 2222222 3333 3 333 33 33 333 333 33 333 2 2 3 33 33 33 3 33333 33333333333 333333333333333 3 332 2 22222 22 222 222 222222 2 222 22222222222 22 600 800 1000 linoleic 1200 1 linoleic arachidic 2 oleic eicosenoic 1400 Figure 1.6: (Left) Solution obtained by CART leaves confusion between groups 2 and 3. (Right) A small rotation of two other variables - linoleic and arachidic acid - into the projection gives a much neater solution. adept at detecting patterns using motion, especially cluster structure. 1.6 Introduction to the Case Studies (Briey describe the range of data example used in this book. The appendix should hold the full data descriptions.) 1.7 Getting Started with XGobi XGobi is publicly available visualization software for viewing and interacting with high-dimensional data. Its primary views are scatterplots augmented with line drawings; other view types are scatterplot matrices and parallel coordinate plots. Multiple views are automatically linked; they are interactive, in the sense that they can be directly manipulated and queried. Also featured in XGobi are dynamic methods, such as high-dimensional rotations, including the grand tour and correlation tour, augmented with manual controls and automatic guidance through projection pursuit. XGobi can deal with missing values through elimination or imputation. It can also be used as a rudimentary high-dimensional drawing program. Finally, a sister program, called XGvis, is provided for nonlinear dimension reduction, multidimensional scaling, and high-dimensional graph layout. XGobi and XGvis can be run on almost every platform, and they support interprocess communication with other packages. 23 Start up XGobi on the ea beetles data. What you will see is a window like that in Figure 1.7. The window is laid out into 4 regions. There is a central plot region, a control panel specic to the view mode is at left, controls for variables are at the right (there are 6 variables in this data), and a series of menus at the top facilitating the major tools in XGobi. Figure 1.7: Main XGobi window, displaying a central plot region, a control panel at left, controls for variables at right, and a series of menus at the top facilitating the major tools in XGobi. Take a closer look at the series of menus at the top. These contain selections which allow switching between the major modes and tools available in XGobi. Take a look at each one. The File menu contains typical le input/output, printing and exiting choices. The View menu contains choices of major plot types (1DPlot, XYPlot, 3D Rotation, Grand Tour, and Correlation Tour), and major plot interaction methods (Scale, Brush, Identify, Line Editing, Move Points). The Tools menu contains a heterogeneous selection of tools: Hide or exclude, Subset, Smooth, Jitter, Parallel coordinates, Scatterplot Matrix, Variable transformation, Variable and Case lists, and Missing Values. The Display menu contains selections for modifying the look of the plot: Display axes/gridlines, center axes, plot the points/lines. The Info menu give basic instructions about the help facilities. The colors and glyphs of the points were read in from startup les, and are 24 set to identify species of beetles in this data. On start-up, the view mode is XYPlot, so a scatterplot of the rst two variables appears in the plot window. The bar in the variable circles at right indicate how the variables are displayed in the plot: tars1 is horizontal and tars2 is vertical. Click on an empty circle with the left mouse button to change the horizontal variable. Click with the middle mouse button (ALT-Left on a two button PC mouse) on an empty circle to change the vertical variable. This can be animated to cycle through all pairs of variables, using the control panel at left. Click on Cycle to start cycling, and drag the scrollbar to change the speed of change. Fix X sets the horizontal variable to stay the same during cycling, and similarly Fix Y sets the vertical variable. Figure 1.8: Main XGobi window in 1DPlot View mode, with jitter control panel. Change to 1DPlot view mode. Notice that the control panel at left changes. Each View mode has a unique control panel. The plot window now displays an average shifted histogram of the variable tars1. The scrollbar labelled \ASH Smoothness" controls the smoothness of the density, dragging to the right smooths the plot. Using the left mouse button on the variable circles allows switching the variable in the plot. If you select the variable aede2, you'll notice it is discrete, there are just 9 distinct values. Go into the Tools menu and select Jitter. A control panel (Figure 1.9) pops up in a separate window which allows you to add small amounts of random noise to the values. In this case it allows us to peek at what is plotted underneath. Change to Grand Tour view mode. The points in the plot window now rotate in the rst 3 variables. The lines in the variable circles track the rotation, indicating how each variable contributes to the projection displayed in the plot. 25 Figure 1.9: Main XGobi window in Grand Tour mode, with projrction pursuit guidance window. 26 The control panel is huge! We'll explain these in more detail below. In the Display menu, select Center axes in 3D+ modes to be o, to shift the axes out of the center. Click on the variable circles with either left or middle mouse buttons to toggle variables in and out. (If you reduce the number of variables to 2, the tour will stop, and you will need to re-start it with the Pause button when more variables are added.) When more than 3 variables are included, the grand tour in XGobi displays a continuous random sequence of 2D projections of the data. This is particularly useful for this data, as the 3 species separate out very neatly in some projections when all variables are included in the tour, and the motion of points further indicates 3 dierent patterns corresponding to the 3 species. The control panel contains three dierent tour methods, the grand tour, guided tour and manual tour. The guided tour is activated by clicking on ProjPrst, and Optimz. Choose the Holes index from the menu of indices, and watch the guided tour nd a very nice projection of the data identifying the 3 species. A separate plot window tracks the value of the Holes index. From the axes, you can see the separation is due primarily to 4 variables, tars1, tars2, aede1, aede3 (Figure 1.9). Also, in terms of principal component axes, the separation is due to PC1, PC2 and PC6. Turn ProjPrst o. Now focus on the manual tour. With the tour paused, in the plot window drag the mouse with the left button held down, to change the coecient of tars1 in the projection. The thin inner circle in the variable circle panel indicates which variable is to be manipulated. It can be changed by holding the Shift key down and clicking on a variable circle. Change to Brush mode. The brush control panel allows choice of color, glyph, point or line brushing, hiding and excluding of points. Use the color menu to select a new color. The rectangle in the corner is the brush, moving the mouse with the left button depressed around the screen moves the brush, points that are under the brush are painted the new color. This is transient brush mode. This mode is more useful when multiple XGobi's are visible on the screen and linked so that the colors in each plot also change according to the brush action in one. Click on Persistent to make the brush permanently change the color of the point. This kind of highlighting is used to mark features in the data. The middle mouse button allows the size of the brush to be changed. Click on the Hide or exclude panel (Figure 1.10). Choosing a color/glyph combination as Hidden hides it from the plot region. Choosing exclude removes it from the scale calculations, too, so the plot will rescale when groups are excluded. In the Tools menu select Variable transformation. This panel allows you to interactively transform variables or groups of variables, and make some major changes such as sorting or permuting values (Figure 1.11). Sorting is neat because it allows QQ-plots to be generated `on-the-y'. Now is a good time to play around selecting modes, tools, and exploring the functionality of XGobi, or if you can break it. 27 Figure 1.10: Main XGobi window in Brush mode, with Hide or exclude control panel. 28 Figure 1.11: QQ-plot of aede1 and aede3, obtained by using the variable transformation controls. 29 30