Building A Better Line: InterQuartile Lines

advertisement
Building A Better Datatrap: Quartile-Quartile Lines
Given a set of (x, y) data, the most common interpretation of the idea of finding
the line that is the best fit to the data, and hence the basis of the best linear model, is the
regression line. It is based on the idea of finding the one line that comes closest to all of
the data points in the sense that the sum of the squares of the vertical distances between
the points and the line is a minimum. However, one of the major drawbacks of the
regression line is the fact that the sum of the squares is very sensitive to the presence of
outliers – data points that are relatively far from the line. Any such points contribute
greatly to the sum of the squares and hence have a disproportionate effect on the
parameters of the regression line.
The alternative that is used in many classes is the median-median line, which is
also built into the curve fitting routines of most graphing calculators. Like the median
itself, the median-median line is not affected by a few outliers and so is much more
robust. Also, since use of the median is stressed in modern statistics education, it seems a
natural way to approach the idea of creating a line to fit a set of data. Unfortunately, the
rationale behind the median-median line is not something that is easy to explain to
students, either at the secondary or the beginning college level. Instead, it is likely often
applied either without explanation or without understanding even when it is explained.
Before continuing, we briefly review the rationale used for the TI calculator
routine for the median-median line. First, the data set must be ordered so that the xvalues are in increasing order; the corresponding y-values are “carried along”. Second,
the data set is then subdivided into three more or less equal groups; the highest and the
smallest groups have the same number of entries (equal to INT((n + 2)/3 + 0.5), where
INT is the greatest integer function). For example, if there are n = 20 data points, 7
points would be assigned to the highest and smallest groups and the remaining 6 points
would be assigned to the middle group. Third, the median of the x-values in each of the
three groups is calculated and the line through the first (x1, y1) and the third (x3, y3) of the
resulting points is calculated. Fourth, this line is then translated one-third of the distance
D from the line toward the median point (xM, yM) of the middle group while keeping the
slope the same. (See Figure 1.) The resulting line is the median-median line.
Figure 1: The Median-Median Line
Personally, the present author would not expect most of his students to understand
this to the extent of being able to use the idea effectively without the use of a calculator
routine. Even for those who do understand the reasoning, the amount of calculations
entailed in implementing the procedure by hand or with simple computational support is
not reasonable. Thus, the use of the median-median line likely turns into a buttonpushing routine without any underlying understanding beyond the fact that one is trying
to find a line that fits well to a set of data.
In this article, we present a different approach to constructing such a line, one that
is considerably simpler both to explain and to implement, one that appears to be as
effective in fitting a line to data as either the least-squares line or the median-median line,
and one that also builds on ideas and methods that students will have seen previously.
The Quartile-Quartile Line
Rather than focusing on the median, we look at the first and third quartiles of the data. In
particular, we find these two quartiles for the x-values and the y-values separately. They
can be found under the STAT menu of any of the TI graphing calculators (select CALC
and then the one-variable statistics option for each list). Alternatively, with Excel,
assuming there are 50, say, data points in columns A and B, simply type, in any desired
empty cell in the spreadsheet, “=quartile(A1:A50,1)” for the 1st quartile and
“=quartile(A1:A50,3)” for the 3rd quartile of the x-values). Suppose we call the resulting
points (x1, y1) and (x3, y3). We call the line that passes through these two points the
quartile-quartile line or perhaps the InterQuartile Line. Finding its equation is a simple
application of the two-point formula for a line.
Some readers might wonder why we find the first and third quartiles of both the
x’s and the y’s. First, quartiles often are not exactly equal to one of the values, so there
would be no corresponding value for the other variable. Second, even if the x-quartile
exactly matches one of the x-values, there might be multiple corresponding y-values, and
so no unique value to be used. Third, even if there is a unique y-value, that value might
be an outlier and so would distort the resulting line.
There is one complication with the quartile-quartile line. If the data values are
trending upward, then the slope of the quartile-quartile line should be positive. If the data
values are trending downward, the slope should be negative.
However, the usual
expression for the slope,
m
y3  y1
,
x3  x1
will always be positive because the third quartile y3 is always greater than y1. Therefore,
it is always necessary to look at the scatterplot of the data to decide on whether the trend
is increasing or decreasing and then decide on whether the value for the slope should be
positive or negative. But, this is not necessarily a bad step for students because it
repeatedly reinforces several fundamental ideas. Perhaps then, it might be better to write
the formula for the slope of the quartile-quartile line as
m
y3  y1
x3  x1
to emphasize the need to make a judgment call.
There are several natural questions. First, how well does this line fit the data?
Second, how does it compare to either the least-squares line or the median-median line?
To investigate these questions, the author has developed a dynamic interactive
spreadsheet in Excel [1] that generates a random sample from an underlying bivariate
population with a choice of sample size between n = 4 and n = 40, calculates the least
squares (regression) line, the median-median line, and the quartile-quartile line for the
sample, and displays the results both graphically and numerically. The results of two
different samples of size n = 4 are shown in Figure 2a and 2b; comparable results for a
sample of size n = 24 are shown in Figure 3. The data points are shown with the black
dots, while the median points used for the median-median line are in red.
10
10
Figure 1
5
5
0
0
0
2
4
6
8
10
0
1
2
3
4
5
6
7
8
9
Figure 2
10
5
0
0
1
2
3
4
5
6
7
8
9
10
Figure 3
For the two displays in Figure 2, notice that the three lines are quite different in
one of the images, but extremely close in the other. The corresponding equations of the
three lines in Figure 2a are
y = 1.63x – 0.56 (least squares),
y = 0.42x + 1.85 (median-median),
y = 1.00x + 0.93 (quartile-quartile);
10
for those in Figure 2b, the equations are, respectively, y = 1.65x – 1.10, y = 1.79x – 1.43,
and y = 1.72x – 1.24. However, as the sample size increases, the three lines appear to be
more consistent with one another. For instance, with the larger sample size of n = 25 in
Figure 3, notice that the three lines are extremely close even though there is still a
considerable amount of spread in the data points. The corresponding equations are,
respectively, y = 1.48x – 0.14, y = 1.73x – 0.99, and y = 1.69x – 1.18.
These conclusions seem to be typical. For small sample sizes, there can be rather
extensive differences among the three lines, but as the sample size increases, the three
lines tend to be closer together. Perhaps a more telling display is based on repeated
random samples where the quartile-quartile line of each sample is drawn. In Figure 4, we
show 25 different quartile-quartile lines based on random samples of size n = 4. Notice
how much most of the lines differ from one another, because of the choice of the four
points, although many tend to be reasonably similar to the population quartile-quartile
line shown in red. In Figure 5, we similarly show 25 sample quartile-quartile lines based
on random samples of size n = 15 points. Observe how much closer all of the sample
lines are to one another, as well as how close most of them are to the population quartilequartile line. In this case, the effect of any single outlier is relatively minimal. We note
that this Excel simulation is also [2] available to interested readers and their students who
want to investigate this kind of simulation in more depth.
Sample Quartile-Quartile Lines
Population Quartile-Quartile Line in Red
Figure 4: Samples with n = 4
Sample Quartile-Quartile Lines
Population Quartile-Quartile Line in Red
Figure 5: Samples with n = 15
We note that the results here with the sample quartile-quartile lines essentially
parallels what happens with sample least-squares lines, though, if anything, the lack of
consistency among the latter tends to be considerably more extreme for small sample
sizes. Thus, the quartile-quartile line seems to be somewhat more consistent from one
sample to another.
Conclusions
The quartile-quartile line seems to be comparable to both the least-squares line
and the median-median line in its ability to serve as a linear model to fit a set of data that
fall into a roughly linear pattern. However, the fact that it is conceptually much simpler
than the median-median line makes it a very attractive alternative to the latter,
particularly in introductory courses.
References
1. Author, Comparing Regression Lines, website URL. (to be added)
2. Author, Quartile-Quartile Simulation, website URL. (to be added)
Download