Building A Better Datatrap: Quartile-Quartile Lines Given a set of (x, y) data, the most common interpretation of the idea of finding the line that is the best fit to the data, and hence the basis of the best linear model, is the regression line. It is based on the idea of finding the one line that comes closest to all of the data points in the sense that the sum of the squares of the vertical distances between the points and the line is a minimum. However, one of the major drawbacks of the regression line is the fact that the sum of the squares is very sensitive to the presence of outliers – data points that are relatively far from the line. Any such points contribute greatly to the sum of the squares and hence have a disproportionate effect on the parameters of the regression line. The alternative that is used in many classes is the median-median line, which is also built into the curve fitting routines of most graphing calculators. Like the median itself, the median-median line is not affected by a few outliers and so is much more robust. Also, since use of the median is stressed in modern statistics education, it seems a natural way to approach the idea of creating a line to fit a set of data. Unfortunately, the rationale behind the median-median line is not something that is easy to explain to students, either at the secondary or the beginning college level. Instead, it is likely often applied either without explanation or without understanding even when it is explained. Before continuing, we briefly review the rationale used for the TI calculator routine for the median-median line. First, the data set must be ordered so that the xvalues are in increasing order; the corresponding y-values are “carried along”. Second, the data set is then subdivided into three more or less equal groups; the highest and the smallest groups have the same number of entries (equal to INT((n + 2)/3 + 0.5), where INT is the greatest integer function). For example, if there are n = 20 data points, 7 points would be assigned to the highest and smallest groups and the remaining 6 points would be assigned to the middle group. Third, the median of the x-values in each of the three groups is calculated and the line through the first (x1, y1) and the third (x3, y3) of the resulting points is calculated. Fourth, this line is then translated one-third of the distance D from the line toward the median point (xM, yM) of the middle group while keeping the slope the same. (See Figure 1.) The resulting line is the median-median line. Figure 1: The Median-Median Line Personally, the present author would not expect most of his students to understand this to the extent of being able to use the idea effectively without the use of a calculator routine. Even for those who do understand the reasoning, the amount of calculations entailed in implementing the procedure by hand or with simple computational support is not reasonable. Thus, the use of the median-median line likely turns into a buttonpushing routine without any underlying understanding beyond the fact that one is trying to find a line that fits well to a set of data. In this article, we present a different approach to constructing such a line, one that is considerably simpler both to explain and to implement, one that appears to be as effective in fitting a line to data as either the least-squares line or the median-median line, and one that also builds on ideas and methods that students will have seen previously. The Quartile-Quartile Line Rather than focusing on the median, we look at the first and third quartiles of the data. In particular, we find these two quartiles for the x-values and the y-values separately. They can be found under the STAT menu of any of the TI graphing calculators (select CALC and then the one-variable statistics option for each list). Alternatively, with Excel, assuming there are 50, say, data points in columns A and B, simply type, in any desired empty cell in the spreadsheet, “=quartile(A1:A50,1)” for the 1st quartile and “=quartile(A1:A50,3)” for the 3rd quartile of the x-values). Suppose we call the resulting points (x1, y1) and (x3, y3). We call the line that passes through these two points the quartile-quartile line or perhaps the InterQuartile Line. Finding its equation is a simple application of the two-point formula for a line. Some readers might wonder why we find the first and third quartiles of both the x’s and the y’s. First, quartiles often are not exactly equal to one of the values, so there would be no corresponding value for the other variable. Second, even if the x-quartile exactly matches one of the x-values, there might be multiple corresponding y-values, and so no unique value to be used. Third, even if there is a unique y-value, that value might be an outlier and so would distort the resulting line. There is one complication with the quartile-quartile line. If the data values are trending upward, then the slope of the quartile-quartile line should be positive. If the data values are trending downward, the slope should be negative. However, the usual expression for the slope, m y3 y1 , x3 x1 will always be positive because the third quartile y3 is always greater than y1. Therefore, it is always necessary to look at the scatterplot of the data to decide on whether the trend is increasing or decreasing and then decide on whether the value for the slope should be positive or negative. But, this is not necessarily a bad step for students because it repeatedly reinforces several fundamental ideas. Perhaps then, it might be better to write the formula for the slope of the quartile-quartile line as m y3 y1 x3 x1 to emphasize the need to make a judgment call. There are several natural questions. First, how well does this line fit the data? Second, how does it compare to either the least-squares line or the median-median line? To investigate these questions, the author has developed a dynamic interactive spreadsheet in Excel [1] that generates a random sample from an underlying bivariate population with a choice of sample size between n = 4 and n = 40, calculates the least squares (regression) line, the median-median line, and the quartile-quartile line for the sample, and displays the results both graphically and numerically. The results of two different samples of size n = 4 are shown in Figure 2a and 2b; comparable results for a sample of size n = 24 are shown in Figure 3. The data points are shown with the black dots, while the median points used for the median-median line are in red. 10 10 Figure 1 5 5 0 0 0 2 4 6 8 10 0 1 2 3 4 5 6 7 8 9 Figure 2 10 5 0 0 1 2 3 4 5 6 7 8 9 10 Figure 3 For the two displays in Figure 2, notice that the three lines are quite different in one of the images, but extremely close in the other. The corresponding equations of the three lines in Figure 2a are y = 1.63x – 0.56 (least squares), y = 0.42x + 1.85 (median-median), y = 1.00x + 0.93 (quartile-quartile); 10 for those in Figure 2b, the equations are, respectively, y = 1.65x – 1.10, y = 1.79x – 1.43, and y = 1.72x – 1.24. However, as the sample size increases, the three lines appear to be more consistent with one another. For instance, with the larger sample size of n = 25 in Figure 3, notice that the three lines are extremely close even though there is still a considerable amount of spread in the data points. The corresponding equations are, respectively, y = 1.48x – 0.14, y = 1.73x – 0.99, and y = 1.69x – 1.18. These conclusions seem to be typical. For small sample sizes, there can be rather extensive differences among the three lines, but as the sample size increases, the three lines tend to be closer together. Perhaps a more telling display is based on repeated random samples where the quartile-quartile line of each sample is drawn. In Figure 4, we show 25 different quartile-quartile lines based on random samples of size n = 4. Notice how much most of the lines differ from one another, because of the choice of the four points, although many tend to be reasonably similar to the population quartile-quartile line shown in red. In Figure 5, we similarly show 25 sample quartile-quartile lines based on random samples of size n = 15 points. Observe how much closer all of the sample lines are to one another, as well as how close most of them are to the population quartilequartile line. In this case, the effect of any single outlier is relatively minimal. We note that this Excel simulation is also [2] available to interested readers and their students who want to investigate this kind of simulation in more depth. Sample Quartile-Quartile Lines Population Quartile-Quartile Line in Red Figure 4: Samples with n = 4 Sample Quartile-Quartile Lines Population Quartile-Quartile Line in Red Figure 5: Samples with n = 15 We note that the results here with the sample quartile-quartile lines essentially parallels what happens with sample least-squares lines, though, if anything, the lack of consistency among the latter tends to be considerably more extreme for small sample sizes. Thus, the quartile-quartile line seems to be somewhat more consistent from one sample to another. Conclusions The quartile-quartile line seems to be comparable to both the least-squares line and the median-median line in its ability to serve as a linear model to fit a set of data that fall into a roughly linear pattern. However, the fact that it is conceptually much simpler than the median-median line makes it a very attractive alternative to the latter, particularly in introductory courses. References 1. Author, Comparing Regression Lines, website URL. (to be added) 2. Author, Quartile-Quartile Simulation, website URL. (to be added)