Good revealing plots often provoke the question \Is what we... there?" To date, its been very dicult to address this...

advertisement
Chapter 9
Inference for Data
Visualization
Good revealing plots often provoke the question \Is what we see really
there?" To date, its been very dicult to address this question, but it seems
that if inference is possible with numbers, why not for visual features? To begin
we need to understand what \really there" really means. This chapter develops
the concepts and describes approaches for making inference with pictures. It
discusses ways to overcome the subjectiveness of the eye and the tendency to
overinterpret stucture.
9.1 Really There?
Adapted from statistical testing literature we could consider \really there" to
be:
Under scenarios where the underlying feature is absent, the visible
feature in the data is too unlikely to have arisen by chance.
In terms of classical hypothesis testing language, the null hypothesis would
be that the \underlying" feature is absent, the alternative hypothesis would be
that the underlying feature is present. The test statistic would be the visible
feature itself. The problem in exploratory data analysis is that we don't know
what feature we'll detect, so we have to include them all, which leads to:
Null hypothesis: \absence of all features."
Alternative: \presence of some features."
For example, in a simple linear regression scenario with two variables X,
Y, examine the plots in Figure 9.1. We naturally are interested in dependence
89
Figure 9.1: Dependence between X and Y.
between X and Y so the natural null hypothesis is the independence of X and Y.
With graphics we not only detect a linear trend, but virtually any other trend
(nonlinear, decreasing, discontinuous, ...) as well. That is, we detect many
dierent types of dependence with visual methods easily.
Figure 9.2: Independence between X and Y.
However, the eye can be easily distracted. If we are interested in dependence
between X and Y we must try to ignore marginal structure. The plots in Figure
9.2 dier only in the marginal structure of X. X and Y are independent there is
no dependent structure.
In general, it may be dicult to tailor visual detection to the structure of
interest. It depends on the dening the null scenario clearly, and understanding
human perception better.
9.2 Signicance Levels
A recipe to establish a visual signicance level is as follows:
1. If a null hypothesis can be simulated, create a large number (N , 1) of
views of simulated null data.
2. Randomly insert the view of the actual data, to give N views.
3. Ask an uninvolved person to select the most special looking view.
90
4. If the selected view shows the actual data, the existence of a feature is
signicant at the level = 1=N .
There are several easy null scenarios to generate:
1. Any univariate distributional assumption, for example, normality, to test
a univariate structure.
2. For independence assumptions, shue the X-values against the Y-values,
as in a permutation test.
3. Exact tests, null hypotheses with Neyman structure, simulate the conditional distribution given the sucient statistic.
9.3 Case Study: Baker Field data
9.3.1 Examining Yield and Boron
This is the data used as a case study in the chapter on space and time variables.
At the time we weren't sure whether yield and boron were related or not. From
the plots in Figure 9.3 which is the real plot of yield against boron. Can you
tell? The answer is in the appendix.
Figure 9.4 (left plot) displays qqplot of Boron against quantiles from a normal
distribution, where the transformation panel was used to sort the Boron values.
Taking a log transformation, makes the variables appear very close to a normal
sample (right plot). These plots were made possible by appending the data le
with the theoretical quantile values from a normal distribution, and the two
stage transformation panel in XGobi.
9.3.2 Exercises
1. For the nutrient Sodium (Na) use the permute transformation to determine if the relationship with Yield is real or not.
2. For the nutrient log(Copper) use the permute transformation to determine
if the relationship with Yield is real or not.
3. Examine the normality of the nutrient variables, and yield, using the variable containing the normal quantiles, and the trasnformation panel.
91
Figure 9.3: Which is the real plot of Yield vs Boron?
92
Norm Quant
2 3 4
5 6
7 8
9 10 11
-3.0-2.5-2.0-1.5-1.0-0.50.0 0.5 1.0 1.5 2.0
Norm Quant
-3.0-2.5-2.0-1.5-1.0-0.50.0 0.5 1.0 1.5 2.0
1
0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
sort(B)
sort(ln(B))
Figure 9.4: QQ-plots of Boron, and log(Boron) made by appending normal
quantiles to the data le and the transformation panel in XGobi.
9.3.3 Discussion about the Structure in Boron
The real plot of Yield against Boron is the one in the top row, second from the
right. It is dierent from the others in two ways. One, in terms of the outlier,
at bottom right of this plot, case 200, identied in Chapter 4. None of the other
plots have a value so extreme. Secondly, the sharpness of the skewness, or the
lack of points in the bottom right, makes it dierent from the other plots. It is
somewhat similar to the left plot and the right plot in the top row, but has a
more pronounced structure.
Judging reality of structure in the presence of skewness presents more diculties than in symmetric data, because of confounding with sample size.
93
94
Download