Tipping

advertisement
Topic 9
Measures of Spread
In-Class Activities
Activity 9-1: Baseball Lineups
9-1, 9-2, 9-13, 9-17
a.
observational units = baseball players
explanatory variable = team
type: binary categorical
response variable = age
type: quantitative
b.
[baseballdotplots.pdf]
The average age of both teams seems to be about the same (about 30 years), but the spreads are
quite different. The 2006 Tigers are much closer together in age than the 2006 Yankees.
c.
d.
Yankees
mean: 29.7 years
median: 31.5 years
Tigers
mean: 30 years
median: 29.5 years
No – although the centers are the same, the spreads are very different, with the Yankees
having the youngest and oldest players in the two distributions and not much consistency in the
ages of their players.
e.
The Yankees appear to have greater variability.
f.
Oldest: 35
Youngest: 22
Difference: 13 years
g.
Oldest: 34
Youngest: 25
Difference: 9 years
h.
lower quartile: 28 years
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
upper quartile: 32 years
IQR: 4 years
1
i.
The Yankees have the greater age range and greater IQR. These values are consistent
with our answer to part e.
The average age of the starting lineups on both teams is about 30 years, but the Tigers’
j.
ages are fairly tightly clustered from 28-34 years with the exception of one player (Granderson)
who is only 25 years old. In comparison, the Yankees’ ages range from a low of 22 years to a
high of 35 years and also have a larger inter-quartile range. It is difficult to judge the shape with
these small sample sizes but the distribution of the ages for the Tigers appears more symmetric,
while the distribution of ages for the Yankees is more skewed to the left.
Activity 9-2: Baseball Lineups
9-1, 9-2, 9-13, 9-17
a.
Player
Age Deviation from Mean Absolute Deviation Squared Deviation
I. Rodriquez
34
34-30 = 4
4
16
Casey
32
2
2
4
Perez
33
3
3
9
Inge
29
-1
1
1
Guillen
30
0
0
0
Monroe
29
-1
1
1
Granderson
25
-5
5
25
Gomez
28
-2
2
4
Young
32
2
2
4
Robertson
28
-2
2
4
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
2
Total
b.
300
0
22
68
This sum is zero. This makes sense because the positive deviations from the mean
‘cancel out’ the negative deviations from the mean.
c.
Sum of absolute deviations = 22 years
d.
MAD = 22 / 10 = 2.2 years
e.
Sum of squared deviations = 68 years2
f.
68/9 = 7.56 years2
g.
2.749 years
h.
Tigers standard deviation = 2.749 years; Yankees standard deviation = 4.62 years.
The Yankees’ standard deviation is larger as expected since the ages tend to be located further
from the average age.
i.
Answers will vary by student expectation. This change will definitely affect the range
and the standard deviation, but it should have little or no effect on the IQR as we are changing
only an extreme value (endpoint).
j.
See table below.
k.
See table below.
Range
IQR
Standard Deviation
Original Data
9
4
2.749
With Large Outlier
(43)
18
4
4.86
With Huge Outlier
(134)
109
4
33.1
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
3
k.
See table above . These results demonstrate that the IQR is resistant, but not the range or
standard deviation. We see this in that the value of the IQR is not changing with the size of the
outlier while the other two are dramatically affected.
Activity 9-3: Value of Statistics
2-9, 2-10, 9-3
a.
Prediction. Many students will incorrectly pick F, focusing on the irregularity in the
heights of the bars.
b.
Prediction. Many students will incorrectly predict Class J to have a large amount of
variation since more of the possible data values appear. They again may incorrectly see Class H
as having more variability in looking at the differences in the heights of the bars.
c.
Class F
Class G
Class H
Class I
Class J
6
8
8
8
8
Interquartile Range
2.75
3
0
8
4.5
Standard Deviation
1.769
2.041
1.18
4
2.657
Range
d.
Class G has more variability than F according to these measures of spread since it has
more data values further from the mean.
e.
Class I has the most variability and H has the least according to these measures of spread.
Class I has a more of its data values at the extremes (and far from the center) while most of H’s
observations are close to the mean. Class J is in between.
f.
Class F has more bumpiness in its histogram, but has less variability than G.
g.
Class J has the greatest number of distinct values but does not have the most variability
among H, I and J.
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
4
h.
No – based on the two previous questions, variability does not measure either bumpiness
or variety. It measures spread from the center (mean). A distribution can be very ‘bumpy’
without having a great deal of variability and vice versa. It is more important to consider the
overall tendency for data values to be far from the center.
i.
Many answers are possible, but all 10 values need to be the same so that the standard
deviation is zero.
j.
This time only one answer is possible. {1, 1, 1, 1, 1, 9, 9, 9, 9, 9}. This dataset
maximizes the distances of observations from the mean and has a standard deviation of 4.22.
Any other combination will have a smaller standard deviation. (Note, if we did not balance the
1’s and 9’s, the mean would shift away from 5 and would put the more frequent values closer to
the mean.)
Activity 9-4: Placement Exam Scores
7-16, 9-4
a.
Yes – this distribution appears to be roughly symmetric and mound-shaped.
b.
x  s  14.08
c.
The scores in this interval are 7, 8, 9, 10, 11, 12, 13 and 14. There are
x  s  6.362
16+15+17+32+17+21+12 = 146 of them. 146/213 = .685
d.
The scores in this interval are 3, 4, … 15, 16, 17.
There are 202 scores in this interval
and this proportion would be 202/213 = .948.
e.
This would include scores from 0 to 21, that is, all the scores. Thus 100% of the scores
fall within 3 standard deviations of the mean.
Activity 9-5: SATs and ACTs
9-5, 12-11
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
5
a.
1740-1500 = 240 points
b.
30 – 21 = 9 points
c.
No – the scales on these two tests are different so we cannot conclude that Bobby
outperformed Kathy simply because he scored more points above the mean than Kathy did. The
240 and the 9 cannot be directly compared.
d.
240/240 = 1 standard deviation
e.
9/6 = 1.5 standard deviations
f.
Kathy has the higher z-score.
g.
Kathy performed better on her admissions test relative to her peers because her z-score is
higher.
h.
Peter: z = (1380-1500)/240 = -.5
Kelly: z = (15-21)/6 = -1
i.
Peter has the higher z-score (less negative, observation is not as far below the mean).
j.
When the observation is below the mean, the z-score will turn out to be negative.
[insert checkmark and PC icons]
Activity 9-6: Marriage Ages
8-17, 9-6, 16-19, 17-22, 23-1, 23-12, 26-4, 29-17, 29-18
Solution
a.
For husbands the median age is 30.5 years and the mean is 35.7 years. For wives the
median age is 29 years and the mean is 33.8 years. Husbands tend to be a little less than two
years older than their wives.
b.
For husbands, the lower quartile is 25 years and the upper quartile is 44.5 years, so the
IQR is 19.5 years. For wives, the lower quartile is 24 years and the upper quartile is 41.5 years,
so the IQR is 17.5 years. The standard deviations are 14.6 and 13.6 years for husbands and
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
6
wives, respectively. These calculations indicate that the middle 50% of husbands’ ages cover a
slightly greater distance than the wives’ ages by 2 years and that the husbands’ ages typically lie
slightly farther from the mean, by approximately 1 year on average.
c.
The age distributions are quite similar for husbands and wives. Both are skewed to the
right, centered around the low 30s or so, with considerable variability from the upper teens
through low 70s. The husbands are a bit older on average, and their ages are a bit more spread
out than the wives’ ages.
d.
The median of the difference in the couples’ ages is 1 year and the mean difference is 1.9
years. Notice that the mean of the age differences is equal to the difference in mean ages between
husbands and wives: 1.9 = 35.7 – 33.8, but this property does not quite hold for the median.
e.
The quartiles are –0.5 and 3, so the IQR is 3.5 years. The standard deviation of these age
differences is 4.8 years. The IQR of the differences and the standard deviations of the
differences calculated here are not the same as the differences in the IQRs (19.5 – 17.5) and the
differences in the standard deviations (14.56 – 13.56) calculated in b.)
f.
To be within one standard deviation of the mean is to be within 1.9 + 4.8 years, which
means between –2.9 and 6.7 years. Seventeen of the age differences fall within this interval,
which is a proportion of 17/24 or .708, or 70.8%. This percentage is quite close to 68%, which is
what the empirical rule predicts. Because the distribution of the age differences does look fairly
symmetric and mound-shaped, this outcome is not surprising.
g.
The mean and median indicate that, on average, people marry someone within a couple
years of their own age. More importantly, the measures of spread are fairly small for the
differences, much smaller than for individual ages. This result suggests that there is not much
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
7
variability in the differences, which suggests that couples are fairly consistent in the age gap
between the partners.
h.
The differences have less variability because even though people get married from their
teens to seventies (and beyond), they tend to marry people within a few years of their own age.
Homework Activities
[insert PC icon]
Activity 9-7: February Temperatures
2-5, 8-19, 9-7
a.
Student predictions.
b.
Lincoln standard deviation = 15.9° F, San Luis Obispo standard deviation = 9.8° F,
Sedona standard deviation = 6.7°F.
[insert PC icon]
Activity 9-8: Social Acquaintances
9-8, 9-9, 10-13, 10-14, 19-9, 19-10, 20-12
a.
Answers will vary by class. Here are some example answers.
1
1134579
2
1345568
3
56778
4
00248
5
012468
6
59
7
6
8
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
8
9
10
5
11
12
4
[table note]leaf unit = one person
b.
median = 37 people, QU = 52, QL = 23, IQR = 29 people
c.
mean = 40.83 people, standard deviation = 25.25 people
d.
40.83 ± 25.25 = [15.58, 66.08]
26/35 = .743 of the students’ results fall within one
standard deviation of the mean.
e.
This proportion is more than what the empirical rule predicts, but is reasonably close to it
(68%).
Yes – these class results are consistent with Gladwell’s findings demonstrating
f.
considerable variability. There are several values less than 20 and one as high as 124.
[insert PC icon]
Activity 9-9: Social Acquaintances
9-8, 9-9, 10-13, 10-14, 19-9, 19-10, 20-12
Answers will vary by class. Here are some example answers.
Cal Poly
Our Class
0
20
40
60
80
100
Number of Acquaintances
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
120
140
[social.pdf]
9
The distributions of data collected from both of these classes is very similar. The mean number
of acquaintances for the Cal Poly class is 36.1 people, while for our class is was 40.8 people.
Both classes had minimums below 20 people (6 and 11 respectively) and high outliers above 100
people. The standard deviation for both classes was 25.25 people and the IQR for the Cal Poly
class was 27 people while for our class it was 29 people. Both distributions appear (histograms
or dotplots would be better graphs to use to examine shape) roughly symmetric apart from the
outliers.
Activity 9-10: Hypothetical Quiz Scores
a.
Smallest standard deviation: Student A, then Student C, then Student D, then student B.
Student A has all her values equal to the mean and Student C has a tight cluster around 8 points.
Student D has a similar range but not much consistency in responses right at the mean. Finally,
student B has all of her values as far from the mean as possible.
b.
The standard deviation of the quiz scores for student A is zero. This is because all her
scores were the same value (8) – the mean. There was no deviation from the mean, so the
average deviation from the mean is zero.
c.
Student C’s mean = 5. Each deviation is ± 5. Each squared deviation is 25. There are 16
squared deviations, so the sum of squared deviations is 16×25 = 400. The variance is therefore
400/15 = 26.67, and the standard deviation =
26.67  5.164.
Activity 9-11: Baby Weights
a.
z = (13.9-12.5)/1.5 = .93. At 3 months, Baby Ben was not quite one standard deviation
above the average weight
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
10
b.
.93 = (x -17.25)/2
x = 19.11 lbs. If Ben weighs 19.11 at 6 months, he would again be
.93 standard deviations above the mean weight at that age.
Activity 9-12: Student and Faculty Ages
a.
Answers will vary by school, but most likely the teachers’ ages are more variable than the
students’ because the students’ ages probably range (generally) from 14-19 or 18-25 years while
the teachers’ ages could range from 24-70 years.
b.
Answers will vary by school, but at many schools a reasonable guess would be between 1
and 2 years. If the ages range from 18-25 years, then it makes sense that roughly 2/3 of the
observations would be between 20-22 years.
Activity 9-13: Baseball Lineups
9-1, 9-2, 9-13, 9-17
a.
Prediction, should be considering issues of resistance.
b.
The Yankee mean and median ages have increased by 2 years to 31.7 and 33.5 years
respectively.
c.
Student expectation – answers will vary.
d.
The IQR and standard deviation did not change.
e.
Student predictions – answers will vary.
f.
New mean = 60 years, median = 59 years, IQR = 8.5 years, standard deviation = 5.50
years. All of these values have doubled.
Activity 9-14: Pregnancy Durations
9-14, 12-6
a.
Approximately 68% of human pregnancies will last between 250 and 282 days.
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
11
b.
Approximately 95% of human pregnancies will last between 234 and 298 days. This is
roughly 7.8-9.93 months.
c.
A horse is more likely to have a pregnancy that lasts within ± 6 days of its mean. In fact
95% of all horse pregnancies will last within ± 6 days of the mean (366 days) since the standard
deviation for horse pregnancies is 3 days.
[insert PC icon]
Activity 9-15: Sampling Words
4-1, 4-2, 4-3, 4-4, 4-7, 4-8, 8-9, 9-15, 14-6
a.
Answers will vary. The following are from one particular running of the applet
[activity9-15soln.pdf]
The mean is 4.29 words and the standard deviation is 0.95 words.
b.
The mean is 4.31 words and the standard deviation is 0.46 words.
c.
The standard deviation was roughly cut in half.
d.
Yes – for the samples based on a sample of size 20, the empirical rule should hold fairly
closely because the sampling distribution is approximately symmetric and mound-shaped..
[insert PC icon]
Activity 9-16: Tennis Simulations
7-22, 8-21, 9-16, 22-18
a.
Based on the frequency tables the no-ad system appears to have the least variability and
the standard system appears to have the most variability.
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
12
b.
standard IQR = 8-5 = 3 points, no-ad IQR = 7-5 = 2 points, handicap IQR = 6-4 = 2
points
c.
standard s = 2.74 points , no-ad s = 1.022 points, handicap s = 1.458 points
d.
Yes – both the IQR and standard deviation say that the standard system has the most
variability and that the no-ad system has the least variability.
Activity 9-17: Baseball Lineups
9-1, 9-2, 9-13, 9-17
[baseballsalaries.pdf]
The Yankees’ salaries are generally much higher than those of the Tigers and they also
exhibit much more variability. The Yankees have a mean salary of $10.96 million and an even
larger median of $12.5 million! Their salaries range from a low of $.3 million to a high of $25.6
million and have a standard deviation of $9.45 million and an IQR of more than $20 million. In
contrast, the mean Tiger salary is only $4.14 million and the median is a lowly $2.90 million
with salaries ranging from $.3 million to a high of only $10.6 million (half of the Yankees make
more). The standard deviation of the Tigers’ salaries is $3.74 million and their IQR is $7.76
million, reflecting the smaller variability in this distribution.
Activity 9-18: Population Growth
7-4, 8-15, 9-18, 10-10
a.
The western states have more variability in population growth percentages as the values
are not as tightly clustered and there is also the extreme outlier (Nevada!)
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
13
b.
East IQR = 14.4-5.5 = 8.9%, West IQR = 21.95 – 8.7 = 13.25%. The western states
have a substantially larger IQR than the eastern states, confirming the larger variability among
the population growth percentages in the west.
c.
The standard deviation of the western states should decrease considerably if Nevada were
removed from the analysis. Nevada is a tremendous outlier and would be making a substantially
contribution to the standard deviation when its large deviation is averaged in. (With Nevada the
standard deviation is 14.07%, without it is 9.7%.)
Activity 9-19: Memorizing Letters
5-5, 7-15, 8-13, 9-19, 10-9, 22-3
Answers will vary by class. Those given here are examples.
[memorizinglettersdotplot.pdf]
The JFK group showed greater variability in their scores than the JFKC group did. Their
standard deviation was 6.45 letters and their IQR was 12 letters while the JFKC group had a
standard deviation of only 5.86 letters and an IQR of only 9 letters.
Activity 9-20: Monthly Temperatures
9-20, 26-12, 27-12
a.
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
[mediantemps.pdf]
14
b.
Raleigh median = 59.5 degrees, San Francisco median = 57 degrees. Yes – these
medians are fairly close.
c.
No – you cannot conclude there is not much difference between these two cities with
regard to monthly temperatures just because their centers are close. Their spreads are very
different.
d.
Raleigh appears to have more variability in its monthly temperatures.
e.
Raleigh’s range = 39 degrees, San Francisco’s range = 16 degrees.
f.
Raleigh’s IQR = 26 degrees, San Francisco’s IQR = 10 degrees.
g.
Raleigh’s mean absolute deviation = 11.83 degrees, San Francisco’s mean absolute
deviation = 4.92 degrees.
h.
Raleigh’s standard deviation = 14.17 degrees, San Francisco’s standard deviation = 5.75
degrees.
Activity 9-21: Nicotine Lozenge
1-16, 2-18, 5-6, 9-21, 19-11, 20-15, 20-19, 21-6, 22-8
a.
The mean number of cigarettes smoked per day has more variability. We can tell because
the standard deviations are 2-3 times as large as those for the age of initiation variable.
b.
The researchers provide the means and standard deviations so that readers can compare
the distributions of the two treatment groups on these baseline characteristics. By showing these
summary statistics are similar, this adds evidence to the lack of confounding variables between
the two treatment groups, strengthening our causal conclusions from the study if a difference is
observed later for the response variable.
c.
Yes – the empirical rule probably holds for some of these variables – in particular for the
age and weight variables. They are likely to have mound-shaped distributions, and it is likely
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
15
that roughly 68% of these smokers were between the ages of 29 and 53, that 95% of them were
between the ages of 17 and 65 and that virtually all of them were between the ages of 5 and 77.
Similarly it is likely that roughly 68% of these smokers weighed between 58.4 and 92.8 kg (129205 lbs.), that 95% of them weighed between 41 and 110 kg (91-242.5 lbs) and that virtually all
of them weighed between 24 and 127 kg (53-280 lbs). It is less likely that the age of initiation is
symmetric since mean – 2 × SD gives an age of 8.3 years which is hopefully too small to be
realistic. Similarly, the number of cigarettes smoked per day must be truncated at zero and would
not match the empirical rule since mean – 3 × SD < 0. It also makes sense that the extreme chain
smokers would skew the distribution to the right.
Activity 9-22: Hypothetical Exam Scores
7-12, 8-22, 8-23, 9-22, 10-22, 27-28
Many answers are possible. Some hints:
a.
{1, 1, 2, 4, 5, 6, 7, 9, 9, 10}
b.
All the values cannot be the same, but the 3rd through 8th values must be the same.
c.
{4, 4, 4, 4, 4, 6, 6, 6, 6, 6}
d.
{0,0,0,0,0,100,100,100,100,100} (This is the only possible answer.)
e.
The first three ordered values must be zeros, and the last three must be 100’s. The
remaining values can be any numbers between 1 and 99.
f.
{0,0,0,0,0,100,100,100,100,100} range = 100, mean absolute deviation = 50.
Activity 9-23: More Measures
a.
The midhinge and midrange are both measures of center because they give the midpoints
of the upper and lower quartiles and minimum and maximum values respectively. This
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
16
“averaging” should place the result roughly in the middle of the distribution. We would need to
look at differences between values (e.g., max – min) to have a measure of spread.
b.
Yes – adding a constant value to all the values in a dataset will change the midhinge and
the midrange by that amount. This is further confirmation that these are measures of center since
their values change to reflect a shift in the distribution.
c.
The midhinge is resistant to outliers because it uses only the upper and lower quartiles in
its calculation and these values are not usually outliers. The midrange would not be resistant to
outliers as it uses the maximum and minimum values in its calculation and these are the values
which could be outliers.
d.
Yankee midrange = (35+22)/2 = 28.5 years
midhinge = (32+26)/2 = 29 years
Tiger midrange = (34+25)/2 = 29.5 years
midhinge = (32+28)/2 = 30 years
[insert PC icon]
Activity 9-24: Hypothetical ATM Withdrawals
9-24, 19-21, 22-5
a.
[hypoatmdotplots.pdf]
Yes – each distribution is perfectly symmetric.
b.
Yes – the mean for each machine is $70, and the standard deviation for each machine is
$30.3. They are identical.
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
17
c.
No – the distributions for each machine are not identical – they are quite different. So
this indicates that the mean and standard deviation do not provide a complete summary of a
distribution of data.
[insert PC icon]
Activity 9-25: Guessing Standard Deviations
a.
Student guesses – answers will vary.
b.
Data A: mean = 64.454, standard deviation = 9.598,
Data B: mean = 202.52, standard deviation = 51.88
Data C: mean = .99947, standard deviation = .04952
Data D: mean = 5.405, standard deviation = 4.714
Rossman/Chance, Workshop Statistics, 3/e
Solutions, Unit 2, Topic 9
18
Download