2.3 Measures of Center and Spread In this section you will learn to compute exact values of those same summary statistics you estimated _______________ in the previous sections. Measures of Center The Mean of a sample: x (“x-bar”) – the “average value” sum of all t he x values number of values x x ... xn 1 2 n x From a frequency table (discrete data): x= n From a frequency table (continuous data): x= n The mean can be estimated visually on a dot plot or histogram by finding the __________ ________ of the distribution (where you would have to place a finger below the horizontal axis in order to balance the distribution). The Median: M – the “_________ value” 1. Arrange all the values in order. 2. If the number of values is odd, the median is the middle one. The middle value is in the position n +1 . 2 3. If the number of values is even, the median is the average of the _____ ________ _________ . Visually, the median is the value on the horizontal axis that separates the histogram into two parts, with 50% of the area under each part of the curve. 1 | Section 2 . 3 Mean v. Median The mean and median of a _______________ distribution are close together. If the distribution is exactly symmetric, the mean and median are ____________. In a skewed distribution, the values in the “tail” pull the mean up or down, so the mean generally lies farther out in the tail than the median. In fact, the mean can be very sensitive to the presence of even a single outlier, which can make it suspect as a measure of center. Example: 40 students were enrolled in a course at Cal Poly. One month after the course began the instructor requested a report that indicated how many times each student had accessed a web page on the class site. The 40 observations were: 0 7 16 37 0 7 18 42 0 8 19 84 0 8 19 331 0 8 20 0 12 20 3 12 21 4 13 22 4 13 23 4 13 26 5 14 36 5 14 36 a) Compute the values of the mean and the median of this data set. b) Of the mean and median, which does the best job of describing a typical value for this data set? Explain. Measures of Spread Two AP Statistics classes took the same test. Here are their results: Class 1: 78, 78, 78, 78, 78, 78, 78, 78, 78, 78 Class 2: 60, 64, 66, 74, 77, 79, 84, 90, 92, 94 What are the median and mean scores for both classes? Mean Class 1: Median Class 1: Mean Class 2: Median Class2: Can you conclude that the classes performed in the same way given only these measures of their centers? 2 | Section 2 . 3 The Range The range is the simplest measure of variability. It is defined as: range = largest value – smallest value The Interquartile Range: IQR The IQR is a measure of variability that is resistant to the effects of outliers. It is based on ___________. 1. Arrange the values in increasing order. 2. Find the median (Q2). 3. Find the quartiles: lower quartile (Q1) = median of the lower half upper quartile (Q3) = median of the upper half If there are an odd number of values the median is excluded from both halves when finding the quartiles. Note: there is no standard rule for finding the quartiles so you will find different statistical software packages use different procedures that can give slightly different values. 4. Calculate the IQR: interquartile range = upper quartile – lower quartile Five-Number Summaries The following collection of summary measures is often referred to as the five-number summary. 1. 2. 3. 4. 5. Minimum – the smallest value. Lower quartile – the median of the lower half of the ordered values. Median – the value that divides the ordered values into halves. Upper quartile – the median of the upper half of the ordered values. Maximum – the largest value. These values give a reasonably complete description of center and spread. They also lead to another visual representation of a distribution, the boxplot. Boxplots A boxplot is a compact display that provides information about the center, spread, and symmetry or __________ of the data. There are two types of boxplots: the regular boxplot and the ____________ boxplot (the latter is the one we typically use). Regular Boxplot Modified Boxplot Like the regular boxplot except the whiskers only extend to the largest and smallest non-outliers. Any outliers appear as individual dots or other symbols. 3 | Section 2 . 3 An outlier is a value that is more than 1.5 times the IQR from the nearest quartile. Once again other methods of identifying outliers exist, but the 1.5 ∙ IQR method is the most common. The Standard Deviation of a sample: s The most common measures of variability describe the extent to which the values deviate from the mean. The deviations from the sample mean are the differences ( x1 - x ), ( x2 - x ), ( x3 - x )....( xn - x ) It is always true that å( x - x ) = 0 , so therefore the _____________ deviation cannot be used as a measure of variability. Instead the deviations are squared to prevent the negative and positive deviations from “cancelling out” when summed (absolute values would also work, but squares are much easier to work with). From a frequency table: å( x - x ) s= 2 n -1 s 2 = variance s= å( x - x ) 2 ×f n -1 Why divide by n – 1? Essentially the reason is to adjust for working with a ___________. The population standard deviation is estimated by the sample standard deviation. The variability in a random sample tends to be less than in the entire population. Thus, you divide by n – 1 rather than n to ___________ the estimate of the population standard deviation a bit. As the sample size increases it makes little difference whether you divide by n or by n – 1. Example: page 70, P21 Stemplot of average mammal longevities 0 ∙ 1 ∙ 2 ∙ 3 ∙ 4 | | | | | | | | | 1 5 0 5 0 5 3 5 0 5 0 4 5 0 5 0 1 | 5 stands for 15 years 6 7 7 8 8 2 2 2 2 2 2 2 2 2 5 5 5 5 6 0 5 1 (a) Five-number summary: (b) IQR = (c) Q1 – 1.5 ∙ IQR = Low end outliers: (d) Q3 + 1.5 ∙ IQR = High end outliers: (e) Draw a modified boxplot. 4 | Section 2 . 3