Chapter 9 Inference for Data Visualization Good revealing plots often provoke the question \Is what we see really there?" To date, its been very dicult to address this question, but it seems that if inference is possible with numbers, why not for visual features? To begin we need to understand what \really there" really means. This chapter develops the concepts and describes approaches for making inference with pictures. It discusses ways to overcome the subjectiveness of the eye and the tendency to overinterpret stucture. 9.1 Really There? Adapted from statistical testing literature we could consider \really there" to be: Under scenarios where the underlying feature is absent, the visible feature in the data is too unlikely to have arisen by chance. In terms of classical hypothesis testing language, the null hypothesis would be that the \underlying" feature is absent, the alternative hypothesis would be that the underlying feature is present. The test statistic would be the visible feature itself. The problem in exploratory data analysis is that we don't know what feature we'll detect, so we have to include them all, which leads to: Null hypothesis: \absence of all features." Alternative: \presence of some features." For example, in a simple linear regression scenario with two variables X, Y, examine the plots in Figure 9.1. We naturally are interested in dependence 89 Figure 9.1: Dependence between X and Y. between X and Y so the natural null hypothesis is the independence of X and Y. With graphics we not only detect a linear trend, but virtually any other trend (nonlinear, decreasing, discontinuous, ...) as well. That is, we detect many dierent types of dependence with visual methods easily. Figure 9.2: Independence between X and Y. However, the eye can be easily distracted. If we are interested in dependence between X and Y we must try to ignore marginal structure. The plots in Figure 9.2 dier only in the marginal structure of X. X and Y are independent there is no dependent structure. In general, it may be dicult to tailor visual detection to the structure of interest. It depends on the dening the null scenario clearly, and understanding human perception better. 9.2 Signicance Levels A recipe to establish a visual signicance level is as follows: 1. If a null hypothesis can be simulated, create a large number (N , 1) of views of simulated null data. 2. Randomly insert the view of the actual data, to give N views. 3. Ask an uninvolved person to select the most special looking view. 90 4. If the selected view shows the actual data, the existence of a feature is signicant at the level = 1=N . There are several easy null scenarios to generate: 1. Any univariate distributional assumption, for example, normality, to test a univariate structure. 2. For independence assumptions, shue the X-values against the Y-values, as in a permutation test. 3. Exact tests, null hypotheses with Neyman structure, simulate the conditional distribution given the sucient statistic. 9.3 Case Study: Baker Field data 9.3.1 Examining Yield and Boron This is the data used as a case study in the chapter on space and time variables. At the time we weren't sure whether yield and boron were related or not. From the plots in Figure 9.3 which is the real plot of yield against boron. Can you tell? The answer is in the appendix. Figure 9.4 (left plot) displays qqplot of Boron against quantiles from a normal distribution, where the transformation panel was used to sort the Boron values. Taking a log transformation, makes the variables appear very close to a normal sample (right plot). These plots were made possible by appending the data le with the theoretical quantile values from a normal distribution, and the two stage transformation panel in XGobi. 9.3.2 Exercises 1. For the nutrient Sodium (Na) use the permute transformation to determine if the relationship with Yield is real or not. 2. For the nutrient log(Copper) use the permute transformation to determine if the relationship with Yield is real or not. 3. Examine the normality of the nutrient variables, and yield, using the variable containing the normal quantiles, and the trasnformation panel. 91 Figure 9.3: Which is the real plot of Yield vs Boron? 92 Norm Quant 2 3 4 5 6 7 8 9 10 11 -3.0-2.5-2.0-1.5-1.0-0.50.0 0.5 1.0 1.5 2.0 Norm Quant -3.0-2.5-2.0-1.5-1.0-0.50.0 0.5 1.0 1.5 2.0 1 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 sort(B) sort(ln(B)) Figure 9.4: QQ-plots of Boron, and log(Boron) made by appending normal quantiles to the data le and the transformation panel in XGobi. 9.3.3 Discussion about the Structure in Boron The real plot of Yield against Boron is the one in the top row, second from the right. It is dierent from the others in two ways. One, in terms of the outlier, at bottom right of this plot, case 200, identied in Chapter 4. None of the other plots have a value so extreme. Secondly, the sharpness of the skewness, or the lack of points in the bottom right, makes it dierent from the other plots. It is somewhat similar to the left plot and the right plot in the top row, but has a more pronounced structure. Judging reality of structure in the presence of skewness presents more diculties than in symmetric data, because of confounding with sample size. 93 94