Topic 9 Measures of Spread In-Class Activities Activity 9-1: Baseball Lineups 9-1, 9-2, 9-13, 9-17 a. observational units = baseball players explanatory variable = team type: binary categorical response variable = age type: quantitative b. [baseballdotplots.pdf] The average age of both teams seems to be about the same (about 30 years), but the spreads are quite different. The 2006 Tigers are much closer together in age than the 2006 Yankees. c. d. Yankees mean: 29.7 years median: 31.5 years Tigers mean: 30 years median: 29.5 years No – although the centers are the same, the spreads are very different, with the Yankees having the youngest and oldest players in the two distributions and not much consistency in the ages of their players. e. The Yankees appear to have greater variability. f. Oldest: 35 Youngest: 22 Difference: 13 years g. Oldest: 34 Youngest: 25 Difference: 9 years h. lower quartile: 28 years Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 upper quartile: 32 years IQR: 4 years 1 i. The Yankees have the greater age range and greater IQR. These values are consistent with our answer to part e. The average age of the starting lineups on both teams is about 30 years, but the Tigers’ j. ages are fairly tightly clustered from 28-34 years with the exception of one player (Granderson) who is only 25 years old. In comparison, the Yankees’ ages range from a low of 22 years to a high of 35 years and also have a larger inter-quartile range. It is difficult to judge the shape with these small sample sizes but the distribution of the ages for the Tigers appears more symmetric, while the distribution of ages for the Yankees is more skewed to the left. Activity 9-2: Baseball Lineups 9-1, 9-2, 9-13, 9-17 a. Player Age Deviation from Mean Absolute Deviation Squared Deviation I. Rodriquez 34 34-30 = 4 4 16 Casey 32 2 2 4 Perez 33 3 3 9 Inge 29 -1 1 1 Guillen 30 0 0 0 Monroe 29 -1 1 1 Granderson 25 -5 5 25 Gomez 28 -2 2 4 Young 32 2 2 4 Robertson 28 -2 2 4 Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 2 Total b. 300 0 22 68 This sum is zero. This makes sense because the positive deviations from the mean ‘cancel out’ the negative deviations from the mean. c. Sum of absolute deviations = 22 years d. MAD = 22 / 10 = 2.2 years e. Sum of squared deviations = 68 years2 f. 68/9 = 7.56 years2 g. 2.749 years h. Tigers standard deviation = 2.749 years; Yankees standard deviation = 4.62 years. The Yankees’ standard deviation is larger as expected since the ages tend to be located further from the average age. i. Answers will vary by student expectation. This change will definitely affect the range and the standard deviation, but it should have little or no effect on the IQR as we are changing only an extreme value (endpoint). j. See table below. k. See table below. Range IQR Standard Deviation Original Data 9 4 2.749 With Large Outlier (43) 18 4 4.86 With Huge Outlier (134) 109 4 33.1 Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 3 k. See table above . These results demonstrate that the IQR is resistant, but not the range or standard deviation. We see this in that the value of the IQR is not changing with the size of the outlier while the other two are dramatically affected. Activity 9-3: Value of Statistics 2-9, 2-10, 9-3 a. Prediction. Many students will incorrectly pick F, focusing on the irregularity in the heights of the bars. b. Prediction. Many students will incorrectly predict Class J to have a large amount of variation since more of the possible data values appear. They again may incorrectly see Class H as having more variability in looking at the differences in the heights of the bars. c. Class F Class G Class H Class I Class J 6 8 8 8 8 Interquartile Range 2.75 3 0 8 4.5 Standard Deviation 1.769 2.041 1.18 4 2.657 Range d. Class G has more variability than F according to these measures of spread since it has more data values further from the mean. e. Class I has the most variability and H has the least according to these measures of spread. Class I has a more of its data values at the extremes (and far from the center) while most of H’s observations are close to the mean. Class J is in between. f. Class F has more bumpiness in its histogram, but has less variability than G. g. Class J has the greatest number of distinct values but does not have the most variability among H, I and J. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 4 h. No – based on the two previous questions, variability does not measure either bumpiness or variety. It measures spread from the center (mean). A distribution can be very ‘bumpy’ without having a great deal of variability and vice versa. It is more important to consider the overall tendency for data values to be far from the center. i. Many answers are possible, but all 10 values need to be the same so that the standard deviation is zero. j. This time only one answer is possible. {1, 1, 1, 1, 1, 9, 9, 9, 9, 9}. This dataset maximizes the distances of observations from the mean and has a standard deviation of 4.22. Any other combination will have a smaller standard deviation. (Note, if we did not balance the 1’s and 9’s, the mean would shift away from 5 and would put the more frequent values closer to the mean.) Activity 9-4: Placement Exam Scores 7-16, 9-4 a. Yes – this distribution appears to be roughly symmetric and mound-shaped. b. x s 14.08 c. The scores in this interval are 7, 8, 9, 10, 11, 12, 13 and 14. There are x s 6.362 16+15+17+32+17+21+12 = 146 of them. 146/213 = .685 d. The scores in this interval are 3, 4, … 15, 16, 17. There are 202 scores in this interval and this proportion would be 202/213 = .948. e. This would include scores from 0 to 21, that is, all the scores. Thus 100% of the scores fall within 3 standard deviations of the mean. Activity 9-5: SATs and ACTs 9-5, 12-11 Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 5 a. 1740-1500 = 240 points b. 30 – 21 = 9 points c. No – the scales on these two tests are different so we cannot conclude that Bobby outperformed Kathy simply because he scored more points above the mean than Kathy did. The 240 and the 9 cannot be directly compared. d. 240/240 = 1 standard deviation e. 9/6 = 1.5 standard deviations f. Kathy has the higher z-score. g. Kathy performed better on her admissions test relative to her peers because her z-score is higher. h. Peter: z = (1380-1500)/240 = -.5 Kelly: z = (15-21)/6 = -1 i. Peter has the higher z-score (less negative, observation is not as far below the mean). j. When the observation is below the mean, the z-score will turn out to be negative. [insert checkmark and PC icons] Activity 9-6: Marriage Ages 8-17, 9-6, 16-19, 17-22, 23-1, 23-12, 26-4, 29-17, 29-18 Solution a. For husbands the median age is 30.5 years and the mean is 35.7 years. For wives the median age is 29 years and the mean is 33.8 years. Husbands tend to be a little less than two years older than their wives. b. For husbands, the lower quartile is 25 years and the upper quartile is 44.5 years, so the IQR is 19.5 years. For wives, the lower quartile is 24 years and the upper quartile is 41.5 years, so the IQR is 17.5 years. The standard deviations are 14.6 and 13.6 years for husbands and Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 6 wives, respectively. These calculations indicate that the middle 50% of husbands’ ages cover a slightly greater distance than the wives’ ages by 2 years and that the husbands’ ages typically lie slightly farther from the mean, by approximately 1 year on average. c. The age distributions are quite similar for husbands and wives. Both are skewed to the right, centered around the low 30s or so, with considerable variability from the upper teens through low 70s. The husbands are a bit older on average, and their ages are a bit more spread out than the wives’ ages. d. The median of the difference in the couples’ ages is 1 year and the mean difference is 1.9 years. Notice that the mean of the age differences is equal to the difference in mean ages between husbands and wives: 1.9 = 35.7 – 33.8, but this property does not quite hold for the median. e. The quartiles are –0.5 and 3, so the IQR is 3.5 years. The standard deviation of these age differences is 4.8 years. The IQR of the differences and the standard deviations of the differences calculated here are not the same as the differences in the IQRs (19.5 – 17.5) and the differences in the standard deviations (14.56 – 13.56) calculated in b.) f. To be within one standard deviation of the mean is to be within 1.9 + 4.8 years, which means between –2.9 and 6.7 years. Seventeen of the age differences fall within this interval, which is a proportion of 17/24 or .708, or 70.8%. This percentage is quite close to 68%, which is what the empirical rule predicts. Because the distribution of the age differences does look fairly symmetric and mound-shaped, this outcome is not surprising. g. The mean and median indicate that, on average, people marry someone within a couple years of their own age. More importantly, the measures of spread are fairly small for the differences, much smaller than for individual ages. This result suggests that there is not much Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 7 variability in the differences, which suggests that couples are fairly consistent in the age gap between the partners. h. The differences have less variability because even though people get married from their teens to seventies (and beyond), they tend to marry people within a few years of their own age. Homework Activities [insert PC icon] Activity 9-7: February Temperatures 2-5, 8-19, 9-7 a. Student predictions. b. Lincoln standard deviation = 15.9° F, San Luis Obispo standard deviation = 9.8° F, Sedona standard deviation = 6.7°F. [insert PC icon] Activity 9-8: Social Acquaintances 9-8, 9-9, 10-13, 10-14, 19-9, 19-10, 20-12 a. Answers will vary by class. Here are some example answers. 1 1134579 2 1345568 3 56778 4 00248 5 012468 6 59 7 6 8 Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 8 9 10 5 11 12 4 [table note]leaf unit = one person b. median = 37 people, QU = 52, QL = 23, IQR = 29 people c. mean = 40.83 people, standard deviation = 25.25 people d. 40.83 ± 25.25 = [15.58, 66.08] 26/35 = .743 of the students’ results fall within one standard deviation of the mean. e. This proportion is more than what the empirical rule predicts, but is reasonably close to it (68%). Yes – these class results are consistent with Gladwell’s findings demonstrating f. considerable variability. There are several values less than 20 and one as high as 124. [insert PC icon] Activity 9-9: Social Acquaintances 9-8, 9-9, 10-13, 10-14, 19-9, 19-10, 20-12 Answers will vary by class. Here are some example answers. Cal Poly Our Class 0 20 40 60 80 100 Number of Acquaintances Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 120 140 [social.pdf] 9 The distributions of data collected from both of these classes is very similar. The mean number of acquaintances for the Cal Poly class is 36.1 people, while for our class is was 40.8 people. Both classes had minimums below 20 people (6 and 11 respectively) and high outliers above 100 people. The standard deviation for both classes was 25.25 people and the IQR for the Cal Poly class was 27 people while for our class it was 29 people. Both distributions appear (histograms or dotplots would be better graphs to use to examine shape) roughly symmetric apart from the outliers. Activity 9-10: Hypothetical Quiz Scores a. Smallest standard deviation: Student A, then Student C, then Student D, then student B. Student A has all her values equal to the mean and Student C has a tight cluster around 8 points. Student D has a similar range but not much consistency in responses right at the mean. Finally, student B has all of her values as far from the mean as possible. b. The standard deviation of the quiz scores for student A is zero. This is because all her scores were the same value (8) – the mean. There was no deviation from the mean, so the average deviation from the mean is zero. c. Student C’s mean = 5. Each deviation is ± 5. Each squared deviation is 25. There are 16 squared deviations, so the sum of squared deviations is 16×25 = 400. The variance is therefore 400/15 = 26.67, and the standard deviation = 26.67 5.164. Activity 9-11: Baby Weights a. z = (13.9-12.5)/1.5 = .93. At 3 months, Baby Ben was not quite one standard deviation above the average weight Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 10 b. .93 = (x -17.25)/2 x = 19.11 lbs. If Ben weighs 19.11 at 6 months, he would again be .93 standard deviations above the mean weight at that age. Activity 9-12: Student and Faculty Ages a. Answers will vary by school, but most likely the teachers’ ages are more variable than the students’ because the students’ ages probably range (generally) from 14-19 or 18-25 years while the teachers’ ages could range from 24-70 years. b. Answers will vary by school, but at many schools a reasonable guess would be between 1 and 2 years. If the ages range from 18-25 years, then it makes sense that roughly 2/3 of the observations would be between 20-22 years. Activity 9-13: Baseball Lineups 9-1, 9-2, 9-13, 9-17 a. Prediction, should be considering issues of resistance. b. The Yankee mean and median ages have increased by 2 years to 31.7 and 33.5 years respectively. c. Student expectation – answers will vary. d. The IQR and standard deviation did not change. e. Student predictions – answers will vary. f. New mean = 60 years, median = 59 years, IQR = 8.5 years, standard deviation = 5.50 years. All of these values have doubled. Activity 9-14: Pregnancy Durations 9-14, 12-6 a. Approximately 68% of human pregnancies will last between 250 and 282 days. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 11 b. Approximately 95% of human pregnancies will last between 234 and 298 days. This is roughly 7.8-9.93 months. c. A horse is more likely to have a pregnancy that lasts within ± 6 days of its mean. In fact 95% of all horse pregnancies will last within ± 6 days of the mean (366 days) since the standard deviation for horse pregnancies is 3 days. [insert PC icon] Activity 9-15: Sampling Words 4-1, 4-2, 4-3, 4-4, 4-7, 4-8, 8-9, 9-15, 14-6 a. Answers will vary. The following are from one particular running of the applet [activity9-15soln.pdf] The mean is 4.29 words and the standard deviation is 0.95 words. b. The mean is 4.31 words and the standard deviation is 0.46 words. c. The standard deviation was roughly cut in half. d. Yes – for the samples based on a sample of size 20, the empirical rule should hold fairly closely because the sampling distribution is approximately symmetric and mound-shaped.. [insert PC icon] Activity 9-16: Tennis Simulations 7-22, 8-21, 9-16, 22-18 a. Based on the frequency tables the no-ad system appears to have the least variability and the standard system appears to have the most variability. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 12 b. standard IQR = 8-5 = 3 points, no-ad IQR = 7-5 = 2 points, handicap IQR = 6-4 = 2 points c. standard s = 2.74 points , no-ad s = 1.022 points, handicap s = 1.458 points d. Yes – both the IQR and standard deviation say that the standard system has the most variability and that the no-ad system has the least variability. Activity 9-17: Baseball Lineups 9-1, 9-2, 9-13, 9-17 [baseballsalaries.pdf] The Yankees’ salaries are generally much higher than those of the Tigers and they also exhibit much more variability. The Yankees have a mean salary of $10.96 million and an even larger median of $12.5 million! Their salaries range from a low of $.3 million to a high of $25.6 million and have a standard deviation of $9.45 million and an IQR of more than $20 million. In contrast, the mean Tiger salary is only $4.14 million and the median is a lowly $2.90 million with salaries ranging from $.3 million to a high of only $10.6 million (half of the Yankees make more). The standard deviation of the Tigers’ salaries is $3.74 million and their IQR is $7.76 million, reflecting the smaller variability in this distribution. Activity 9-18: Population Growth 7-4, 8-15, 9-18, 10-10 a. The western states have more variability in population growth percentages as the values are not as tightly clustered and there is also the extreme outlier (Nevada!) Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 13 b. East IQR = 14.4-5.5 = 8.9%, West IQR = 21.95 – 8.7 = 13.25%. The western states have a substantially larger IQR than the eastern states, confirming the larger variability among the population growth percentages in the west. c. The standard deviation of the western states should decrease considerably if Nevada were removed from the analysis. Nevada is a tremendous outlier and would be making a substantially contribution to the standard deviation when its large deviation is averaged in. (With Nevada the standard deviation is 14.07%, without it is 9.7%.) Activity 9-19: Memorizing Letters 5-5, 7-15, 8-13, 9-19, 10-9, 22-3 Answers will vary by class. Those given here are examples. [memorizinglettersdotplot.pdf] The JFK group showed greater variability in their scores than the JFKC group did. Their standard deviation was 6.45 letters and their IQR was 12 letters while the JFKC group had a standard deviation of only 5.86 letters and an IQR of only 9 letters. Activity 9-20: Monthly Temperatures 9-20, 26-12, 27-12 a. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 [mediantemps.pdf] 14 b. Raleigh median = 59.5 degrees, San Francisco median = 57 degrees. Yes – these medians are fairly close. c. No – you cannot conclude there is not much difference between these two cities with regard to monthly temperatures just because their centers are close. Their spreads are very different. d. Raleigh appears to have more variability in its monthly temperatures. e. Raleigh’s range = 39 degrees, San Francisco’s range = 16 degrees. f. Raleigh’s IQR = 26 degrees, San Francisco’s IQR = 10 degrees. g. Raleigh’s mean absolute deviation = 11.83 degrees, San Francisco’s mean absolute deviation = 4.92 degrees. h. Raleigh’s standard deviation = 14.17 degrees, San Francisco’s standard deviation = 5.75 degrees. Activity 9-21: Nicotine Lozenge 1-16, 2-18, 5-6, 9-21, 19-11, 20-15, 20-19, 21-6, 22-8 a. The mean number of cigarettes smoked per day has more variability. We can tell because the standard deviations are 2-3 times as large as those for the age of initiation variable. b. The researchers provide the means and standard deviations so that readers can compare the distributions of the two treatment groups on these baseline characteristics. By showing these summary statistics are similar, this adds evidence to the lack of confounding variables between the two treatment groups, strengthening our causal conclusions from the study if a difference is observed later for the response variable. c. Yes – the empirical rule probably holds for some of these variables – in particular for the age and weight variables. They are likely to have mound-shaped distributions, and it is likely Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 15 that roughly 68% of these smokers were between the ages of 29 and 53, that 95% of them were between the ages of 17 and 65 and that virtually all of them were between the ages of 5 and 77. Similarly it is likely that roughly 68% of these smokers weighed between 58.4 and 92.8 kg (129205 lbs.), that 95% of them weighed between 41 and 110 kg (91-242.5 lbs) and that virtually all of them weighed between 24 and 127 kg (53-280 lbs). It is less likely that the age of initiation is symmetric since mean – 2 × SD gives an age of 8.3 years which is hopefully too small to be realistic. Similarly, the number of cigarettes smoked per day must be truncated at zero and would not match the empirical rule since mean – 3 × SD < 0. It also makes sense that the extreme chain smokers would skew the distribution to the right. Activity 9-22: Hypothetical Exam Scores 7-12, 8-22, 8-23, 9-22, 10-22, 27-28 Many answers are possible. Some hints: a. {1, 1, 2, 4, 5, 6, 7, 9, 9, 10} b. All the values cannot be the same, but the 3rd through 8th values must be the same. c. {4, 4, 4, 4, 4, 6, 6, 6, 6, 6} d. {0,0,0,0,0,100,100,100,100,100} (This is the only possible answer.) e. The first three ordered values must be zeros, and the last three must be 100’s. The remaining values can be any numbers between 1 and 99. f. {0,0,0,0,0,100,100,100,100,100} range = 100, mean absolute deviation = 50. Activity 9-23: More Measures a. The midhinge and midrange are both measures of center because they give the midpoints of the upper and lower quartiles and minimum and maximum values respectively. This Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 16 “averaging” should place the result roughly in the middle of the distribution. We would need to look at differences between values (e.g., max – min) to have a measure of spread. b. Yes – adding a constant value to all the values in a dataset will change the midhinge and the midrange by that amount. This is further confirmation that these are measures of center since their values change to reflect a shift in the distribution. c. The midhinge is resistant to outliers because it uses only the upper and lower quartiles in its calculation and these values are not usually outliers. The midrange would not be resistant to outliers as it uses the maximum and minimum values in its calculation and these are the values which could be outliers. d. Yankee midrange = (35+22)/2 = 28.5 years midhinge = (32+26)/2 = 29 years Tiger midrange = (34+25)/2 = 29.5 years midhinge = (32+28)/2 = 30 years [insert PC icon] Activity 9-24: Hypothetical ATM Withdrawals 9-24, 19-21, 22-5 a. [hypoatmdotplots.pdf] Yes – each distribution is perfectly symmetric. b. Yes – the mean for each machine is $70, and the standard deviation for each machine is $30.3. They are identical. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 17 c. No – the distributions for each machine are not identical – they are quite different. So this indicates that the mean and standard deviation do not provide a complete summary of a distribution of data. [insert PC icon] Activity 9-25: Guessing Standard Deviations a. Student guesses – answers will vary. b. Data A: mean = 64.454, standard deviation = 9.598, Data B: mean = 202.52, standard deviation = 51.88 Data C: mean = .99947, standard deviation = .04952 Data D: mean = 5.405, standard deviation = 4.714 Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 2, Topic 9 18