MATH 2441 Probability and Statistics for Biological Sciences The Five-Number Summary, Boxplots, and Outliers To wrap up our coverage of methods of descriptive statistics, we present one more numerical summary method (the five-number summary), one more graphical method (the boxplot or box-and-whisker plot), and address an important issue that arises in statistical data collection and analysis: what do we do with unusual observations (or outliers). The Five-Number Summary The so-called five-number summary is most often prepared as the first step in constructing a boxplot. Most commonly, the five-number summary consists of the five values x , Q3, maximum } { minimum, Q1, ~ Here, 'minimum' refers to the smallest value in the data set, 'maximum' refer to the largest value in the data x is the median. Some workers use the set, Q1 and Q3 are the lower and upper quartile respectively, and ~ upper and lower hinges in place of the two quartiles, but the result is much the same. This set of five numbers gives some direct insight into the distribution of the data. You know the smallest and largest values observed. The median gives you an idea of the center of the distribution, and the difference between the upper and lower quartiles (or the corresponding hinges) gives a sense of how spread out the values are. Boxplots Boxplots are basically a way of graphing the five-number summary. The simplest type of boxplot has the form: minimum value maximum value Q1 median Q3 x Only the horizontal scale has meaning. The features of the boxplot are located horizontally relative to that scale. In the center of the boxplot is the box. Its left edge is drawn at x = Q 1 and its right edge is drawn at x , the median. x = Q3. A vertical line through the interior of the box is drawn at x = ~ Also plotted are vertical lines at the positions x = minimum value, and x = maximum value. The centers of these two vertical lines are joined to the centers of the left and right edges of the box by horizontal lines. These features remind some people of whiskers, hence the alternative name "box-and-whisker plot." David W. Sabo (1999) Boxplots Page 1 of 4 A glance at a boxplot then tells you visually the span of values between the smallest and largest observation, the interval containing the middle 50% of the observations (the box), and the location of the center of the data, as represented by the median. Further, the position of the median line inside the box gives you some sense of the symmetry of the distribution. If the median line is to the left of the box, the distribution is skewed to the right and vice versa. Note that the horizontal width of the box is equal to the IQR (the interquartile range). Although single boxplots by themselves are informative, probably the most common use of the boxplot is in comparing two or more sets of data. This would be done by drawing a boxplot for each set of data relative to the same horizontal scale. (Some people call these side-by-side boxplots.) We will illustrate this technique in the example below. The relative horizontal positions of the side-by-side boxplots allows you to compare the general distributions of the various data sets. Unlike back-to-back stemplots, which permit the comparison of at most two sets of data, you can stack as many boxplots side-by-side as you have room for on the page. Outliers A difficult issue in statistical work is the question of what to do about suspicious or very unusual observations. Such unusual observations are often referred to as outliers (pronounced owt-ly-ers, from the mental image that they "lie way out far away from the other observations"). Of course, when you are aware of a procedural mistake in taking an observation or measurement, that data value should be discarded immediately. This situation might arise when it is clear that a specimen has become contaminated, or an instrument has not been properly calibrated before use, or someone notices a certain degree of carelessness in performing the experimental operation. Sometimes unusual or unexpected types of observations occur and there is no reason to believe they are the result of a procedural blunder of some sort. Yet, they may be so different from the other observations that one may suspect that such a mistake may have occurred. If a mistake did occur, then inclusion of that observation in the data set to be subjected to further analysis may introduce an error into any conclusions later made. On the other hand, it is possible that the unusual observation is not due to a mistake but reflects some unexpected property of the population. Exclusion of that observation may mean missing an important aspect of the population under study. There is no absolute solution to this problem, because that would require people to be able to know when they made a mistake that they don't know about. The next best thing is to develop an objective rule that everybody will follow in deciding when to label an observation as an outlier. Then at least calling an observation an outlier won't depend on the mood of individual researchers from day to day. One of the most commonly used rules for identifying outliers is based on the five-number summary and the x , and Q3 -- and hence, of the IQR -IQR. (This is a reasonable starting point because the values of Q1, ~ are reasonably insensitive to the presence of one or just a few highly unusual observations.) The so-called 1.5 IQR rule starts out by calculating the following four quantities: lower inner fence = Q1 - 1.5 IQR upper inner fence = Q3 + 1.5 IQR (BOX-1) lower outer fence = Q1 - 3 IQR upper outer fence = Q3 + 3 IQR Thus, the inner fences are located a distance of 1.5 times the IQR to the left and right of the corresponding edges of the box in the boxplot. The outer fences are located a distance of 3 times the IQR to the left and right of the corresponding edges of the boxplot. Often dotted vertical lines are drawn at these positions on the boxplot to indicate the locations of the four fences. This is illustrated in the figure for the SalmonCa example below. Observations which fall within the interval bounded by the inner fences are not suspicious. Observations which fall between the inner fences and the outer fences are regarded as mild outliers or possible outliers. Observations lying outside the outer fences are regarded as extreme outliers or probable outliers. Although there is no rule that says you must delete either mild or extreme outliers from your data Page 2 of 4 Boxplots David W. Sabo (1999) before proceeding, most references recommend that the validity of both types of outliers be examined, with particular effort in the case of extreme outliers, before proceeding. Example: The SalmonCa Data To illustrate what has been described so far, consider the three sets of data collected in the SalmonCa experiment described in the standard data sets document. The five-number summary and other relevant quantities for each is easily determined to be: SalmonCa0 SalmonCa20 SalmonCa100 Q3 maximum 29.00 59.75 70.50 90.75 129.00 37.00 59.00 67.50 73.25 86.00 45.00 75.00 84.00 101.00 120.00 IQR 31.00 14.25 26.00 lower inner fence upper inner fence lower outer fence upper outer fence 13.25 137.25 -33.25 183.75 37.63 94.63 16.25 116.00 36.00 140.00 -3.00 179.00 minimum Q1 ~ x The resultant boxplots are: SalmonCa SalmonCa100 SalmonCa20 SalmonCa0 -50.00 0.00 50.00 100.00 150.00 200.00 ppm Ca in fillets Compare the numbers in the table above with the positions of the various elements in the chart. The fences are rendered as dotted vertical lines, and the boxplots themselves are rendered in solid lines. (We've let the horizontal axis extend into negative values here so that we could plot the two lower outer fence values that turn out to be negative. Of course, in this context, the ppm of Ca in salmon fillets could never give a negative value. Such observations would have to be the result of a blunder of one sort or another!) You can see immediately that there is a considerable degree of overlap in values between the three data sets. In the context of the original experiment, this result would tell the technologist that the amount of calcium present in the processed salmon fillets does not appear to be strongly affected by the concentration of chlorine dioxide used to sanitize the fillets. The salmon fillets treated with 20 ppm seem to have a less variable Ca content, and on average, somewhat lower than the unsanitized fillets, and those treated with 100 ppm ClO2, but the difference is very slight. David W. Sabo (1999) Boxplots Page 3 of 4 There is only one outlier visible in these three boxplots, and that is the minimum value observed in the SalmonCa20 data set. The lower inner fence has a value of 37.63, and the smallest value observed for those fillets was 37 ppm. This observation is thus the mildest of mild outliers, and probably does not require further examination. Before leaving this example, make sure you understand how each of the features in the figure above was obtained. Before leaving this topic, we mention two types of variations on the procedure described and illustrated above. First, it is not uncommon for people to use the values of the hinges instead of the quartiles in constructing a boxplot. In the treatment of outliers, the interquartile range would then be replaced by the so-called H-spread, the difference between the upper and lower hinges. Since the hinge values are often very similar to the quartile values, this variation results in diagrams very similar to those illustrated here. A second variation in the way boxplots are drawn involves making outliers a bit more explicit. First, define the most extreme data values which are still inside the inner fences to be the lower adjacent value and upper adjacent value, respectively. The whiskers of the boxplot are drawn to these adjacent values rather than to the maximum and minimum values (though, of course, if there are no outliers, then the adjacent values are the minimum and maximum values). Then, the individual outliers are plotted as individual points -- sometimes with the extreme outliers denoted by different symbols than the mild outliers. Page 4 of 4 Boxplots David W. Sabo (1999)