Statistics Instructor Guide Sample Chapter

advertisement
CHAPTER
3
Relationships Between Two
Quantitative Variables
Overview
Goals
The overall goal of Chapter 3 is to develop a systematic way to uncover
information about bivariate data (two variables per case). We can pursue this goal
as we did in Chapter 2 for univariate data: by using graphical displays and then
finding a measure of center and spread to summarize the distribution. The basic
plot used is the scatterplot. The basic summary measures are the regression line
(which can be thought of as the measure of center) and the correlation (which
can be thought of as the measure of spread).
The five sections of this chapter will teach students to
• make a scatterplot and determine what its shape tells about the relationship
between the two variables
• find and interpret the equation of the least squares regression line
• find and interpret the correlation coefficient
• use diagnostic tools such as residual plots to determine whether the linear
regression line is appropriate or a transformation is needed first
• make transformations that re-express a curved relationship as a linear one
Content Overview
Statistical Software
Statistical software is important for this chapter—one of the most
computationally intensive in this textbook. Students can use statistical software
to construct scatterplots quickly, accurately, and flexibly. But more to the
instructional point, they can use the software to change the location of points
and then observe the corresponding change in the correlation and the regression
line to further their understanding of the properties of these two measures.
A Note on Terminology
In the statistical community, the language of correlation and regression has not
been standardized. The words in the chart on the next page are sometimes used
interchangeably.
75
Terminology
Concept
A Bivariate
Relationship
Terminology
Words Used
Relationship
Association
Correlation (used
only when the trend
is linear)
Concept
Words Used
The Degree of Spread Correlation
of the Points About
Measure of strength
the Line
of the relationship
Scatter
Variation
A Summary Line Line of best fit
Regression line
Trend line
Model
Fitted line
Least squares line
Comparing Analyses of One-Variable and Two-Variable Distributions
Thinking of the parallels between exploring a one-variable distribution and
exploring a two-variable relationship is helpful. You may want to discuss this
table with your students, which you can show them on the overhead projector or
give them as a handout (master on page 92).
Chapter 2: One Variable
Chapter 3: Two Variables
Key Idea
Distribution
Relationship (association)
Plots
Dot plot
Stemplot
Boxplot
Histogram
Scatterplot
Shape
Linear or curved
Normal, uniform, or
skewed
Symmetric
Constant strength
Clusters, gaps, and outliers Clusters, gaps, and outliers
Ideal Shape
Normal
Linear (oval/ellipse)
Measure of Center
Mean
Median
Regression line
Measure of Spread Standard deviation
from the Center
Interquartile range
Correlation
Notes for AP Teachers
Preparing for the AP Exam Throughout the Year
Describing Distributions in Context: Shape, Trend, and Strength
As in Chapter 2, frequently AP Exam free-response problems that cover the
content of this chapter ask the student to describe distributions of bivariate
data in context. With bivariate data, instead of shape, center, and spread,
we use shape, trend, and strength.
76
Chapter 3 Overview
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.1 is a great place to start students focusing on a complete description
of a scatterplot in context. Even though they can’t yet compute the regression
equation or the correlation, they can still describe a scatterplot’s trend and
strength, just by looking at the data. When they do learn how to do the
calculations, they can be used as additional evidence for conclusions they can
reach from observation and analyzing a scatterplot as described in Section 3.1.
In the past, the question on the AP Exam has been stated along these lines,
“Describe the nature of the relationship between the two variables.” (See
AP Exam 2004 B Question 1, 2002 Question 4, 2000 Question 1.) For the
answer to be “essentially correct,” the student needs to include four
important pieces:
1. Shape: Whether it was linear or nonlinear—if there was a pattern, that should
be stated.
2. Trend: Is there a positive or negative trend?
3. Strength: If there is a trend, is it strong or weak? Does the relationship vary in
strength or is it more constant?
4. Context: The context needs to be included!
For example, if you consider Display 3.1 on page 106 of the student book, there
is a moderately positive association between the year an employee at Westvaco
was hired and the year that employee was born, and that trend is fairly linear
without curvature. There is more variability among values of y for larger values
of x than for smaller. (The student book gives another example of a description.)
The student who loses points in this area usually either leaves out one of the big
three (shape, trend, strength) or completely forgets the context. For example, for
Display 3.1, the student could write,
“There is a positive and moderate association between the variables. As one
variable increases, so does the other. There are clusters in the x 75–90’s
range for different y groupings, and a possible outlier at (43, 28).”
Although this is a fairly nice description of most features, the student forgot
to mention that there is a linear trend and never once mentioned the actual
variables involved. This description would lose points for both omissions.
Comparing Distributions: Differences in Shape, Trend, and Strength
When students are asked to compare two scatterplots (or other distributions) on
the AP Statistics Exam, they cannot simply describe each scatterplot separately.
They should say how the shape, trend, and strength of one differ from those
of the other. For example, “While both shapes are linear with a positive trend, the
points in Scatterplot A cluster more closely to the regression line than the points
in Scatterplot B.”
Thinking in Context
After the basic review of linear equations in Section 3.2, use the variable names
instead of x and y. That way, the context is always part of the discussion, and
that does seem to help students remember to include context—especially
in interpreting the slope in context. For example, after the basic review of
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Chapter 3 Overview
77
linear equations, the very first line the class discusses is the line they get from
Activity 3.2a, “Pinching Pages.” Use for the slope:
response
change in response variable
thickness
slope ______________________ _________ _______________
change in predictor variable predictor number of pages
and for the equation of the line:
thickness (thickness-intercept) slope pages
This practice constantly encourages students to think “in context.” Instead
of thinking in terms of x’s and y’s, they’ll be thinking in terms of thickness
and pages.
By starting off with both of these techniques, by the time they are interpreting
r and r2, they hopefully will automatically add the context.
If students use x and y on the AP Exam, they must define each variable
in context.
AP Exam Practice
As you work through the chapter, you may wish to assign questions from past
AP Statistics Exams. The questions that correspond to topics in Chapter 3
are listed in the table below. To get the free-response questions (FR), go to AP
Central, apcentral.collegeboard.com, where the free-response questions and
investigative tasks for all exams since 1997 are posted. (You will need to print the
test and then literally cut and paste if you want to use only some questions.) The
multiple-choice questions (MC) have been released only for 1997, 2002, and 2007.
(The 2007 Exam was not yet available at the time of this printing, so those items
are not listed in the table below.) These “released exams” may be purchased at AP
Central. You can also use the sample multiple-choice and free-response questions
in the AP Statistics Course Description, available as a download at AP Central.
The College Board gives teachers permission to use all of these questions with
their students.
Section
1997
MC:
28
FR:
6a
FR:
4a
3.4
MC:
31
FR:
6d–e
3.5
FR:
6b,
c, e
FR:
4b
Chapter 3 Overview
2000
2001
FR:
1a–c
FR:
1b–d
FR:
6c
3.3
78
1999
FR:
2c
3.1
3.2
1998
FR:
1a, e
MC:
20
FR:
6c
FR:
6d
2002
2002
Form B
2003
FR:
6a, c
MC:
6, 34
FR:
4b–c
FR:
1b–d
MC:
28
2004
2004
Form B
2005
2005
Form B
2006
2006
Form B
FR:
1a
MC:
31
FR:
4a, c
MC:
17
FR:
4c
2003
Form B
FR:
1a–b
FR: 1a
FR:
3b, d
FR:
1b
FR:
3c
FR:
1c
FR:
3a
FR:
5a–b
FR:
2a
FR:
1a
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Time Required
Traditional Schedule
Block
4 × 4 Block
Section 3.1
1–2 days
Day 1
Describing a scatterplot
Day 2
Summary, exercises
Day 1
Activity 3.2a, lines as summaries and
prediction
Day 2
Least squares regression, reading
computer output
Day 3
Summary, exercises
Day 1
Activity 3.3a, estimating r, formula,
appropriateness of linear model
Day 2
Relation to slope, causation,
interpreting r 2
Day 3
Regression to the mean (optional),
summary, exercises
Day 1
Activity 3.4a, influence, residual plots
Day 2
Summary, exercises
Day 1
Activity 3.5a, exponential growth and
decay, log transformations
Day 2
Log-log transformations, power
functions
Day 3
Summary, exercises
1 day
1 long
2 days
1 long,
1 short
2 days
2 long
Section 3.2
2–3 days
Section 3.3
2–3 days
Section 3.4
1–2 days
1.5 days 1 long,
1 short
Section 3.5
2–3 days
2 days
2 long
Review
1–2 days
1.5 days 1 long,
1 short
Materials
Section 3.1: None
Section 3.2: For Activity 3.2a, a ruler with a millimeter scale and a textbook
for each pair of students
Section 3.3: For Activity 3.3a, a measuring tape, a yardstick, or a meterstick
for each pair of students
Section 3.4: For Activity 3.4a, a piece of paper for recording data
Section 3.5: For Activity 3.5a, a paper cup and 200 pennies for each student
(or each pair of students)
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Chapter 3 Overview
79
Suggested Assignments
Classwork
Section
Essential
Recommended
3.1
D1, D2
P1, P2
3.2
Activity 3.2a
D3–5, D8
P3–8
D6, D7
3.3
D9–13, D17, D18, D21
P9–19
Activity 3.3a
D14, D20, D23
3.4
D26–30
P22–25
Activity 3.4a
D31
3.5
Activity 3.5a
D35, D40
D32, D33, D36–38, D41, D43 P30–32, P35, P37,
P38
P26–29, P33, P34
Optional
D15, D16, D19,
D22, D24, D25
P20, P21
Chapter 3 Quiz 1
D34, D39, D42
P36, P39
Chapter 3 Quiz 2
Homework
Section
Essential
Optional
3.1
E1, E3, E5, E7
E4
E2, E6, E8
3.2
E9, E11, E13, E15, E17, E19,
E21
E14, E22, E23
E10, E12, E16, E18,
E20, E24–26
3.3
E27, E31, E33, E34, E37, E40
E28, E29, E36, E39
E30, E32, E35, E38,
E41, E42
3.4
E43, E45, E47, E49
E44, E50
E46, E48, E51–54
3.5
E55–57, E59
E60, E65
E58, E61–64, E66
E74–76, E78, E80
E71, E72, E77, E79,
E82–83
Chapter
E67–70, E73, E81
Summary
For AP
Students
80
Recommended
Chapter 3 Overview
AP1–10
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
3.1 Scatterplots
Objectives
• to make a scatterplot and describe its basic shape in terms of linearity,
curvature, clusters, and outliers
• to describe whether the trend in a scatterplot is positive or negative
• to describe whether the strength of the relationship is strong, moderate, or
weak and whether the strength is constant across all values of x
• to decide whether the pattern in a scatterplot can be generalized to other cases
and to propose possible explanations for the pattern
Important Terms and Concepts
• bivariate data
• scatterplot
• variables and cases for bivariate
data
• shape of a scatterplot: linear or
curved
• strength of an association or
relationship: strong, moderate,
or weak
• constant strength versus varying
strength
• lurking variable
• trend: positive or negative
Alignment with the AP Statistics Topic Outline
This section aligns with the listed items of the AP Statistics Topic Outline
as described here. The actual text of the AP Statistics Topic Outline and the
complete correlation begin on page xxi.
ID1 Students construct and interpret scatterplots.
ID2 Students examine bivariate data for correlation and linearity.
Lesson Planning
Class Time
One to two days
Materials
None
Suggested Assignments
Classwork
Essential
Recommended
Optional
D1, D2
P1, P2
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.1
81
Homework
Essential
E1, E3, E5, E7
Recommended
E4
Optional
E2, E6, E8
Lesson Notes: Describing the Pattern in a Scatterplot
Making Comparisons
As mentioned in the Overview, when students are asked to compare two
scatterplots (or other distributions) on the AP Statistics Exam, they cannot just
describe the shape trend and strength of each scatterplot separately. They should
say, for example, which is more linear, which has the greater slope, and which has
the stronger relationship.
To help students understand the concept of strength, ask them, “If you look at
the plot, do you see more trend or more variation?”
Linearity
Students often find it very difficult to get used to the idea that any scatterplot
where the points fall within an oval is called “linear.” Linear does not mean that
all points lie on or even near a line. For example, the points in the next scatterplot
have a linear relationship even though they are not clustered closely about the line.
On the other hand, the points on this next scatterplot do not follow a linear
pattern even though they are clustered rather closely to the line. This pattern
is curved.
Outliers in Two Variable Relationships
The question of what is an outlier in a scatterplot is more complicated than
with univariate data, where we could use the Q1 1.5 IQR or Q3 1.5 IQR
rule as a guideline for identifying possible outliers. A scatterplot can have
82
Section 3.1
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
several kinds of outliers:
• a point with an extreme value of x or an extreme value of y or both
• a point that does not follow the general trend
In this next plot, the point represented by the x is an outlier in the sense that it
has an extreme x-value. However, its y-value is not extreme and this point follows
the curved pattern.
y
x
In the next plot, the point represented by the x has neither an extreme x-value
nor an extreme y-value. However, it does not follow the general trend and is
considered an outlier.
y
x
Varying Strength (Heteroscedasticity)
Instead of using the word heteroscedasticity, you can substitute “fan-shaped” or
other synonyms for the word. But students like the way this word sounds, and it
is fun to use in class. It is pronounced just as it is spelled: hetero-sce-das-ticity,
where “sce” is pronounced like the “scu” part of scuff and “das” is pronounced so
it rhymes with class.
If a plot is not heteroscedastic, it is homoscedastic, which means “having the
same variance.”
A summary of the various ways to describe the pattern in a scatterplot is
shown on the next page, and a blackline master is provided on page 93.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.1
83
Describing Bivariate Data
Positive
Negative
None
Strong
Moderate
Weak
Constant
Varying Strength: Fan to Right
Varying Strength: Fan to Left
Linear
Curved
None
Outlier in x
Outlier in y
Outlier in the Residuals
Trend
Strength
Variability
Pattern
Influential Point
84
Section 3.1
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Notes for AP Teachers
Modeling Good Answers
Once students have completed E4, show them the model answer as an example
of what is expected on the AP Exam. This item is on the AP Quiz for Chapters
2–3, on pages 41–42 of the Instructor’s Resource Book. Don’t assign this problem
yet if you plan to use that quiz; review the model answer as you go over the
quiz. A PDF file containing the question and the model answer is available at
www.keypress.com/keyonline.
Solutions
Discussion
D1. a. Display 3.1 illustrates the commonsense
idea that people born earlier typically become
employed before people who were born later.
Display 3.2 shows the same idea but uses age
instead of birth year. A larger birth year means a
smaller age, so the relationship in Display 3.2 is
the reverse of Display 3.1: Older people were hired
earlier and younger people are hired later, so the
association is negative.
b. All the points are in the lower-right half of the
plot because people cannot be hired until they are
about 18 years old. Specifically, a person’s year of
birth must be at least 18 years before the year of
hire so all points (but one) are below the line
y x 18. The one point above this diagonal line
is a person who was hired into the company at a
very young age during the 1940s. Although there
is a lot of variation, in general people were born 25
or 30 years before they were hired.
Note on D1b: As you discuss the graph in the context
of a real situation, introduce the practice of writing
models with variable names that are in the context of
the problem. For example, the line y x 18 could be
stated as year of birth year of hire 18. The practice of
using intelligible variable names is consistent with many
statistics software programs but not with calculators.
Mixing words and symbols can be a problem at lower
levels, but should not present difficulty for your students.
c. No, this is not correct. The ages plotted are not
the ages of the employees when they were hired
but their ages when layoffs began. This idea will
be explored further in E7.
D2. a. For these data, the cases are the states and the
variables are the number of people per thousand
living in dorms and the proportion of the state
population living in cities.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
The shape of the cluster is linear (roughly
oval or elliptical), except for three points (VT,
RI, and MA) that lie relatively far away from the
main cloud of points. Vermont, a rural state, has
a large number of colleges and a higher dorm
proportion than would be anticipated. Rhode
Island and Massachusetts also have a relatively
high proportion living in dorms, but they are
essentially urban states.
The trend is negative, as a larger proportion
of people living in cities tends to mean a smaller
proportion in dorms.
There is a lot of variability in dorm proportions
for any particular proportion living in cities, and
thus the strength of the association between the
two variables is only moderate. Other than the
three apparent outliers, the strength is relatively
constant across all values of proportion of
population living in cities.
This scatterplot shows the data for all of
the 50 states, so there is no larger population
to generalize to. What you see is all there is for
this particular year. However, it is reasonable to
generalize to a previous or subsequent year.
A possible explanation for this pattern is that
states with a high proportion of the population
living in cities also have a high proportion of their
colleges located in cities. In an urban area, there is
little need for students to live in dorms. They can
commute from home or get off-campus housing
in nearby apartments. Thus, for highly urbanized
states, a lower proportion of students need to live
in dorms.
b. The positive trend in the original data comes
from the fact that states with a large number of
people tend to have a large number of colleges and
universities and a large number of people living
in cities. A possible explanation for the negative
trend in the proportion data is given in part a.
Section 3.1 Solutions
85
Practice
Height (in.)
P1. a. You may have to remind students that a
scatterplot without labels and units on the axes
is meaningless. Emphasize the importance of
appropriate labeling. Here is the scatterplot.
50
48
46
44
42
40
38
36
Exercises
2
3
4
5
Age (yr)
6
7
b. These data are not very interesting to describe.
The x-axis shows ages 2 to 7 years, and the y-axis
shows the median height of children at each age.
The shape is linear, the trend is positive, and the
strength is very strong. That is, the scatterplot
shows a very strong positive linear trend. Students
may mention that a typical child grows about
2.7 inches per year.
c. The linear trend could reasonably be expected
to hold for another year. However, median height
could not be expected to increase at this rate to
age 50, as people typically stop growing around
age 20.
d. In the background is something called
“growing up” that happens over time during the
early years of life. That is, an increase in age is
associated with an increase in height.
P2. a. The worst record for baggage handling during
this period is Delta, and Northwest has the lowest
on-time percentage among the airlines.
b. Airlines with a high percentage of on-time
arrivals and a low rate of mishandled baggage
would fall in the upper left of the plot. United and
America West are the top two in both categories,
so are the best overall.
c. False. American’s baggage mishandling rate of
6.5 mishandled bags per thousand was not twice
Southwest’s rate of a little under 4.5. It appears to
be more than twice because the scale on the x-axis
starts at 3.75, not 0.
d. The relationship between the two variables
is negative and weak. The negative relationship
shows that an airline that is “bad” on one variable
tends to be “bad” on the other as well.
e. No. These are the largest carriers in the United
States. (The airlines included in this plot are all of
the U.S. carriers with at least 1% of total domestic
scheduled-service passenger revenues.) The other
airlines that might be added would be small
regional carriers. There is no reason to expect
86
their pattern to be the same as that of the large
national airlines. If we were to plot these same
airlines for the previous or following year, it would
probably look much the same but would likely be
quite different from a plot of similar variables for
ten years ago. With the increase in airport security
since the terrorist attacks of 2001, you would
expect the situation to have changed in some way.
Section 3.1 Solutions
E1. Plot a shows a positive relationship that is strong
and linear. There is fairly uniform variation across
all values of x.
Plot b shows a negative relationship that is
strong and linear, again with fairly uniform
variation across all values of x.
Plot c shows a positive relationship that is
moderate and linear with fairly uniform variation
across all values of x. One point lies a short
distance from the bulk of the data.
Plot d shows a negative relationship that is
moderate and linear with fairly uniform variation
across all values of x. Again, there is one outlier.
Plot e shows a positive relationship that
is strong and linear except for the outlier. As
students will learn, the one outlier has dramatic
influence on the strength of this relationship.
There is fairly uniform variation across all values
of x.
Plot f shows a negative relationship that is very
strong and curved. One point on the far right
lies in the general pattern but far away from the
remainder of the data, which accentuates the
strong relationship. Another outlier lies below the
bulk of the data on the left.
Plot g shows a negative relationship that is
strong and curved. The two points at either end
of the array accentuate the curvature. There is a
bit more variability among values of y for smaller
values of x than for larger values of x.
Plot h shows a positive relationship that is
strong and curved. Again, the outlier on the
extreme right accentuates the curved pattern and
would have dramatic influence on where a trend
line might be placed. The variability in y is fairly
constant across all values of x.
E2. a. Positive and strong: As eggs get bigger, both
length and width increase proportionally.
b. Positive and moderate: Most students tend to
score relatively high on both parts of the exam,
middling on both parts, or relatively low on both
parts of the exam.
c. Positive and strong: Trees produce one new
ring each year.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Frequency
10
8
6
4
2
0
0
20
40
60
80
Percentage Taking Exam
100
The distribution of SAT I math scores also
looks like it may be bimodal. There appear to be
a cluster of scores around 510 and another cluster
around 560.
In the histogram of the average SAT I math
scores, even though you can see two peaks around
510 and 560, the shape is more skewed toward the
larger values than bimodal.
Note on E4c: Ask students to visualize all of the points
dropping onto the x-axis so that they can see the
distribution of percentage taking SAT. A histogram
of percentages is shown above. It makes a nice
demonstration to display the scatterplot in Fathom
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
and remove the y attribute and watch the dots drop to
the x-axis. (The data are available as a Fathom file on
the Instructor’s Resource Book CD.) To visualize the
distribution of SAT scores, ask students to imagine all
the points dropping onto the y-axis. (Again, the Fathom
file makes a nice demonstration.)
Average SAT Math Score
d. Negative and moderate: People tend to lose
flexibility as they get older.
e. Positive and strong: The number of
representatives is roughly proportional to the
population of a state.
f. Positive and weak: Large countries tend to
have large populations, but there are notable
exceptions such as Canada and Australia. Also,
some small countries in area have very large
populations, such as Indonesia.
g. Negative and strong (but curved): Winning
times tend to improve (get smaller) over the years.
E3. a. II
b. IV (The larger states—mainly in the West—
tend to have fewer people per square mile.)
c. III
d. I (Heavier cars tend to get lower gas mileage.)
E4. a. Iowa 5% and Illinois 10%; about 92% of
New York students took the SAT I, and they
averaged 511.
b. The overall trend is negative, moderately
strong, and curved. The gap between 40 and
50% in the middle of the scatterplot suggests
two groups of states—one with low percentages
and high average scores and another with high
percentages and low average scores. One state
stands out a bit from the rest—West Virginia, with
only 20% taking the SAT I and a relatively low
average of 511.
c. Yes, the distribution of the percentage taking
the SAT I looks bimodal because there is a cluster
of percentages around 10 and a second around 65
to 75.
620
600
580
560
540
520
500
480
0
1
2
3 4 5 6
Frequency
7
8
9
d. There are no more states to add, so this is the
complete picture for the given year. You might
generalize, however, to the previous year and the
next year. In fact, the plots for the last 20 or so
years look similar to this one. These numbers do
not change rapidly from year to year.
e. The Midwestern states are predominantly
ACT states. In these states, only small percentages
of students take the SAT I, and these tend to be
the better students who are trying for admission
to exclusive colleges, perhaps outside the Midwest.
If only a few students in a state are taking the
SAT I, they are probably the better students in
the state and their average scores would then be
higher than the average scores for other states.
Thus, as the percentage of students taking the
SAT I increases, the average score tends to decrease.
Although this explanation makes sense, we cannot
be sure from these data alone.
E5. a. Plots A, B, and C are the most linear. Plot
D is not linear because of the seven universities
in the lower right, which may be different from
the rest. Plot A, of graduation rate versus alumni
giving rate, gives some impression of downward
curvature. However, if you disregard the point in
the upper right, the impression of any curvature
disappears.
Plots A, B, and C all have just one cluster.
However, plot D, the plot of graduation rate versus
top 10% in high school, has two clusters. Most of
the points follow the upward linear trend, but the
cluster of seven points in the lower right with the
highest percentage of freshmen in the top 10%
shows little relationship with the graduation rate.
Plots A and C have possible outliers. Plot A,
the plot of graduation rate versus alumni giving
rate has a possible outlier in the upper right. The
point is below the general trend and its x-value
(but not its y-value) is unusually large. In plot C,
Section 3.1 Solutions
87
88
Section 3.1 Solutions
100
90
Graduation Rate
80
70
60
1200
Top 25%
1300
1400
1500
SAT 75th Percentile
1600
26%–50%
Note on E5d: You may wish to mention that these
relationships hold for universities and that the data
provide no evidence about whether the relationships
hold for individuals.
e. Graduation rates may increase as SAT scores
increase because better prepared students may
be more successful in college courses, so a
university with a greater number of prepared
students will, on average, graduate a higher
percentage of students. Alumni giving rates may
increase as graduation rates increase because the
university has produced happy alumni. These
data do not “prove” this claim, however, because
of other possible explanations. These types of
observational studies cannot prove claims; the
proof of a claim requires an experiment, which is
one of the topics of the next chapter.
Note on E6: If your students aren’t working with
computers, you may wish to provide them with copies of
the three scatterplots.
E6. All three have positive association, but
circumference appears to have the strongest
positive association with hat size. Students
may discover this rule: hat size is equal to
circumference divided by p and then rounded to
the closest eighth.
The measurements tend to come aligned in
vertical strips, indicating that the students made
the measurements only to the nearest quarter of
an inch.
Hat Size
the plot of graduation rate versus SAT 75th
percentile, the points toward the upper left and the
middle right should be examined because they
are farther from the general trend than the other
points, although neither their x-values nor their
y-values are unusual. The point in the lower right
of plot D should also be examined, along with the
other six points nearby.
b. Plots A and C have similar moderate positive
linear trends. Plot D shows wide variation in both
variables, with little or no trend. Plot B is the only
plot that shows a negative trend.
c. Among these four variables, it appears that
the alumni giving rate is the best predictor of the
graduation rate, and SAT scores (as measured by
the 75th percentile) is second best. However, both
of these relationships are moderate and neither is
a strong predictor of graduation rate. Ranking in
high school class (as measured by the top 10%) is
almost useless as a predictor of college graduation
rate. Plot A, of graduation rate versus alumni
giving rate, owes part of the impression of a strong
relationship to the point in the upper right. This
plot shows some heteroscedasticity, with the
graduation rate varying more with smaller alumni
giving rates.
Students should understand that even though
the relationship between, say, the graduation rate
and the student/faculty ratio is negative, that’s
not what makes the student/faculty ratio a poor
predictor of the graduation rate. Given a specific
student/faculty ratio, we can predict a graduation
rate. The problem is the great deal of uncertainty
about how close the actual rate would be to the
predicted rate because the range of graduate rates
is large for any given student/faculty ratio.
d. The relationships could change considerably
when looking at all universities because these
are highly rated universities, so the values of all
variables tend to be “good.” A larger collection
of universities may have more spread in the
values of all variables and, possibly, a stronger
relationship between graduation rate and other
variables.
To explain to students how this could be, show
students the version that follows of the graduation
rate versus SAT 75th percentile plot. On this plot,
the x’s represent the top 25 universities and the
closed circles represent the next 25 most highly
rated universities. Note that if you look at either
group, there is very little upward trend. However,
putting the two groups together gives a stronger
linear trend. If another group of 25 universities
were added, the trend probably would be stronger.
This phenomenon is sometimes called the effect of
a restricted range.
7.8
7.6
7.4
7.2
7.0
6.8
6.6
20
21
22
23
Circumference (in.)
24
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
7.0
7.5
8.0
Major Axis Length (in.)
8.5
5.5
6.0
6.5
Minor Axis Length (in.)
7.0
7.8
7.6
7.4
7.2
7.0
6.8
6.6
E7. a. The cases are the individual employees
at the time of the layoffs, and the variables are
the employee’s age at hire and the year of hire.
The triangular shape of this plot indicates
heteroscedasticity. The direction of the cloud
is upward to the right, showing weak positive
association between age at hire and year of hire.
There are no points in the upper-left side of the
plot because these employees would have reached
retirement age.
b. This plot does not help us decide this
question. Because so many people who were
hired in the early years (and even recently)
would have retired, we do not know whether
older people were hired then or not. To
determine whether age discrimination in hiring
may have existed, we need a plot of the age
at hire of all people hired, not just those who
remained at the time of layoffs.
c. From this plot, it appears that people hired
earliest were more subject to layoff, not necessarily
the older employees. Everyone hired before the
early 1960s was laid off, but not all of the older
people were laid off. Perhaps, then, it was higher
salaries because of seniority or obsolete job skills
that resulted in a greater proportion of older
employees being laid off.
Note on E8: A computer should be used for this exercise.
This open-ended investigation can be very time
consuming if students do all three parts. We suggest that
you assign just one of the three parts to each student
(or small group of students) and have them share their
results. Or you could also give each student a copy of
the scatterplot matrix on page 91 (also reproduced
on page 94). This graphic was made with statistical
software. All of the relationships requested in this
exercise are shown in one large plot. For example, the
plots requested in part a, with cost per hour on the
vertical axis, are shown across the bottom row.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
E8. a. cost per hour
i. Students might point out that all of the
associations are positive. As the variable on
the x-axis increases, so does the cost per
hour. Although all relationships are strong,
the strongest are cost per hour versus fuel
consumption per hour and versus number
of seats. The other three relationships are
weaker and show less constant strength. The
scatterplot of cost per hour against cargo
space is the most fan-shaped.
Relationships of cost per hour to fuel
consumption per hour and to number of
seats are the most linear. The relationship
of cost per hour to speed is the most curved.
As speed increases, the cost per hour stays
relatively constant up to about 460 miles
per hour and then increases rapidly with
increasing speed.
There appear to be no outliers in
cost per hour. However, one plane is an
outlier in flight length, the B747-400.
ii. The patterns in the plots do show that
bigger planes cost more per hour to operate,
but this may be perfectly efficient given that
they carry more people and more cargo (and
go faster).
One variable that might measure cost
efficiency for the airplanes is cost per hour
per seat (cost/seat). (Students may come up
with others.) A plot showing this variable
plotted against its denominator is shown
here. Notice cost per hour per seat remains
somewhat constant (but with a lot of
variability) across the number of passengers
carried. That is, larger planes tend to cost
about the same to fly a passenger for an hour
as smaller planes. However, larger planes also
tend to go faster and take less time to travel
the same number of miles. Considering that,
larger planes may be more efficient.
Cost Per Hour Per Seat
Hat Size
Hat Size
7.8
7.6
7.4
7.2
7.0
6.8
6.6
70
60
50
40
30
20
10
0
50 100 150 200 250 300 350 400
Seats
The next scatterplot shows cost per
passenger mile (cost/h, divided by speed
(in mi/h) divided by seats or number of
passengers carried) versus number of seats.
Section 3.1 Solutions
89
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0
50 100 150 200 250 300 350 400
Seats
b. flight length
i. Students might discover some of these
things from their scatterplots or from the
scatterplots in the fourth column of the
scatterplot matrix:
Number of seats has a stronger
relationship with flight length than does
cargo. These are all passenger aircraft, and
their primary purpose is to carry passengers.
Cargo is added as space (and weight) is
available. The weak relationship in the cargo
versus flight length plot is due primarily to
two aircraft, the large A300-600 (Airbus),
used primarily to carry passengers and cargo
over the short routes in Europe, and the
B747-400, used to carry large numbers of
passengers (and relatively less freight) over
the very long international routes.
Planes with the longest flight lengths (the
B747’s) have the most seats but are not at the
top of the cargo carried. This can be seen in
the plots of seats and cargo versus length, but
you have to look back at the data to identify
the planes.
A description of the scatterplot of speed
versus flight length follows.
Cases and variables: The cases are the
planes in the data set, and the variables are
the airborne speed in miles per hour and the
length of flight in miles.
Shape: The shape shows a single cluster
of points in a thin, curved cloud that opens
downward, with one point (the B747-400) as
an outlier on the length axis.
Trend: The direction of the relationship is
positive.
Strength: The relationship is very strong;
a pattern is quite obvious.
Generalization: It seems reasonable that
a similar pattern might appear even if other
planes were added to the study. The general
90
Section 3.1 Solutions
pattern of speed versus flight length should
not depend entirely on the specific planes
being studied here.
Explanation: As the typical length of
flight increases, airlines tend to use faster
planes because it is inefficient to use a fast
plane on a short flight and inconvenient
for travelers to use a slow plane on a long
flight. But there is a maximum speed that
can be achieved by the designs used for
commercial aircraft so that few planes
fly much over 500 miles per hour, which
causes the leveling off of the plot for the
longer flights.
ii. The faster planes used on the longer flights
use more fuel per hour, but they cover
many more miles in an hour than do the
slower planes. So perhaps they use less
fuel per mile.
To compute gallons of fuel used per mile,
divide fuel in gallons per hour by speed
in miles per hour. This plot shows fuel
consumption in gallons per mile plotted
against flight length.
Gallons Of Fuel Per Mile
Cost Per Passenger Per Mile
Now we see that larger planes do tend to be
somewhat more cost efficient per passenger
mile. However, the relationship is weak.
7
6
5
4
3
2
1
0
500
1500
2500
Flight Length
3500
Apparently, the planes that are capable of
flying longer distances use more gallons of
fuel per mile than do planes that fly shorter
distances. This isn’t surprising, as they carry
more passengers and cargo. Further, flying
faster may take more gallons per mile (as with
an automobile).
c. speed, seats, and cargo
i. Here are some things students may discover:
The curvature is more pronounced in the
relationship between speed and cargo. The
slower (and smaller) planes carry little cargo,
and the plane that carries the biggest cargo
has only about medium speed. The planes
that carry the biggest cargo carry only a
moderate number of passengers.
The A300-600 (Airbus) is unusually slow,
both for the amount of cargo it carries (it has
the largest cargo capacity of all the planes on
the list) and for the number of seats it has.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
ii. The flat part to the left of the plot of cargo
versus seats reveals that passenger planes
with a relatively small number of seats (up
to nearly 200) carry very little cargo. The fan
shape to the right reveals that as the planes
get bigger and the number of seats increases
beyond 200, the amount of cargo carried by
these planes also increases. However, the
variation in the cargo-carrying capacity of a
plane also increases as the number of seats
increases.
Scatterplot Matrix
350
250
Seats
150
50
40
Cargo
20
0
550
450
Speed
350
250
3500
2500
FlLength
1500
500
3500
2500
Fuel
1500
500
9000
7000
Cost
5000
3000
1000
50 150
300
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
0 10
30
50 250
400
550 500 2000 4000 500 2000
4000 1000
6000
Section 3.1 Solutions
91
92
Section 3.1
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Scatterplot
Linear or curved
Constant strength
Clusters, gaps, and outliers
Linear (oval/ellipse)
Dot plot
Stemplot
Boxplot
Histogram
Normal, uniform, or skewed
Symmetric
Clusters, gaps, and outliers
Normal
Mean
Median
Standard deviation
Interquartile range
Plots
Shape
Ideal Shape
Measure of Center
Measure of Spread from
the Center
Correlation
Regression line
Relationship (association)
Distribution
Two Variables
One Variable
Key Idea
Chapter 3
Chapter 2
Distributions and Relationships
Describing Bivariate Data
Positive
Negative
None
Strong
Moderate
Weak
Constant
Varying Strength:
Fan to Right
Varying Strength:
Fan to Left
Linear
Curved
None
Outlier in x
Outlier in y
Outlier in the Residuals
Trend
Strength
Variability
Pattern
Influential
Point
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.1
93
Matrix Plot of Aircraft Data
Scatterplot Matrix
350
250
Seats
150
50
40
Cargo
20
0
550
450
Speed
350
250
3500
2500
FlLength
1500
500
3500
2500
Fuel
1500
500
9000
7000
Cost
5000
3000
1000
50 150
94
Section 3.1
300
0 10
30
50 250
400
550 500 2000 4000 500 2000
4000 1000
6000
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
3.2 Getting a Line on the Pattern
Objectives
After reviewing the definition of slope and the slope-intercept form of a linear
equation, students will learn
• to interpret the slope and y-intercept in the context of the situation
• to understand when it is appropriate to use a fitted line to model a relationship
and to predict y when the value of x is known
• to understand that interpolation is more trustworthy than extrapolation
• to compute and interpret residuals and draw them on the scatterplot
• to understand that the least squares regression line minimizes the sum of the
squared errors (residuals)
• to compute the least squares regression line
• to read regression output from various statistical software packages
• to understand various properties of the least squares regression line
Important Terms and Concepts
• slope; y-intercept
• least squares regression line;
fitted line
• predictor or explanatory
variable x
• predicted value ŷ
•
•
•
•
•
response variable y
observed value y
interpolation; extrapolation
residual
sum of squared errors (SSE)
Alignment with the AP Statistics Topic Outline
This section aligns with the listed items of the AP Statistics Topic Outline
as described here. The actual text of the AP Statistics Topic Outline and the
complete correlation begin on page xxi.
ID2 Students examine bivariate data for correlation and linearity.
ID3 Students examine least squares regression lines for bivariate data.
Lesson Planning
Class Time
Two to three days
Materials
For Activity 3.2a, a ruler with a millimeter scale and a textbook for each pair
of students
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.2
95
Suggested Assignments
Classwork
Essential
Recommended
Activity 3.2a
D3–5, D8
P3–8
Optional
D6, D7
Homework
Essential
Recommended
E9, E11, E13, E15, E17, E19, E14, E22, E23
E21
Optional
E10, E12, E16, E18, E20,
E24–26
Lesson Notes: Lines as Summaries
Students should be able to find the equation of a line, given two points or a point
and a slope. However, many have not interpreted an equation of a line in the
context of a situation, especially its slope. See Activity 3.2a step 5 for an example.
Why Do We Use y b0 b1 x in Statistics Rather Than y mx b?
In multiple regression, there can be many explanatory variables: x1, x2, x3, . . . , xn .
For example, if you want to predict college grade point average, you might have
a linear combination of variables such as high school GPA, number of Advanced
Placement courses, SAT score, number of mathematics courses taken in high
school, and so on. The fitted “plane” would then be of the form
ŷ b0 b1x1 b2 x2 b3x3 . . . bn xn
which is a straightforward generalization of
ŷ b0 b1x
If we used the form y mx b, we have no obvious way to generalize the
symbols to the case of more than one explanatory variable.
Notes on Calculator Use
When using a calculator to get a least squares regression line, there is a new
wrinkle: The equation of the regression line may be written y a bx. So if
your students are used to y mx b, with m as the slope and b as the intercept,
warn them to be careful: b can be the slope. Alternatively (or additionally), a
calculator may write the equation in the form y ax b.
When a graphing calculator computes the regression line, it gives values of a
and b precise to many decimal places. If you round these values, any calculations
you do with them (such as interpolating or extrapolating) will retain that error.
To avoid this, use the stored values for a and b from your calculator. For the
). Select
TI-83 Plus or TI-84 Plus, they can be found in the variables menu (
5:Statistics… then arrow over to EQ. The a and b listed are the values from the most
recently calculated regression equation.
The TI-84 Plus calculator has a statistical feature called Manual-Fit, which
allows you to place a line on the screen and adjust it by changing the slope and
96
Section 3.2
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
y-intercept right from the graph screen. This feature could be used to quickly
,
create the lines for parts b and c. You can access this command by pressing
arrowing over to CALC , and selecting D: MANUALFIT. If you press while the
manual line is on the screen, the equation is stored into Y1.
Activity 3.2a: Pinching Pages
The Pinching Pages activity can be done quickly and requires only the textbook
and a ruler marked with millimeters for each student or pair of students. This
activity gives students experience with an easily comprehended interpretation of
the slope and y-intercept in context.
Note: This activity is essential for AP students; it will help them realize that
the line of best fit need not contain any of the data points—to determine
the equation of the line, students must use points on the line, which are not
necessarily points from the data set.
1–2. A set of data for five “pinches” of a standard textbook is given below.
(The sheets may have different thickness from the book your students have.)
The thickness measurements are in millimeters. Students who subtract
page numbers instead of counting sheets will get estimates that are half the
thickness of a sheet (because there are two “pages” per sheet).
Row Sheets Thickness
1
50
6.0
2
100
11.0
3
150
12.5
4
200
17.0
5
250
21.0
Total Thickness (mm)
3–4. This scatterplot shows a nearly straight line. It should be linear as an
increase of one sheet adds a fixed amount to the thickness.
25
20
15
10
5
50
100
150
200
Number of Sheets
250
5. The line passes near the points (50, 6) and (250, 21), so its slope is about
0.075. Slope measures change in y, per unit change in x, and because a unit
change in x is one sheet and y is thickness, this slope is an estimate of the
thickness of one sheet. The y-intercept can be found by using any point on the
line and solving for b0:
6 0.075(50) b0
b0 2.25
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.2
97
The y-intercept is the thickness if no sheets are included, so it is the
approximate thickness of the cover of the book.
You can write this equation as
total thickness y-intercept slope (number of sheets)
6. Because the points lie so close to the line, we believe the measurements
are fairly precise. However, because we measured only to the nearest
0.5 millimeter, we expect some inaccuracy in each measurement—some will
be too small and others too large. By using the line to estimate the slope, we
are averaging out those errors.
7. If the pinched pages did not include the front cover, the y-intercept should be
very close to 0. It may not be exactly 0 because of measurement error.
Minimum Wage Example
In the example on page 118, students are asked to estimate the slope of a line. The
data for Display 3.14 are given in the accompanying table. The equation of the
regression line of minimum wage against year is
minimum wage 0.1009 year 197
When estimating the slope of the line we used two points on the line,
(1960, 0.80) and (2000, 4.80), with the y-values approximated from the plot.
Neither of these two points are actual data values and, in fact, none of the actual
data values are on the line for which we want to estimate the slope.
Year
Minimum
Wage ($)
1960
1.00
1965
1.25
1970
1.60
1975
2.10
1980
3.10
1985
3.35
1990
3.80
1995
4.25
2000
5.15
2005
5.15
Why Points Farther Apart Give Better Estimates of the Slope
Here is an example to show why picking points with values of x that are farther
apart works better when estimating the slope. Suppose the equation of the line
is actually y x, so the slope is 1. Suppose also that your estimate of y tends to
be off by about 5 units. If you select 1 and 2 as values for x, you might estimate
the y-values to be 4 rather than 1, and 7 rather than 2. These values give an
estimate for the slope of
7 (4)
________
11
21
98
Section 3.2
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
which is wildly off. If you pick 1 and 101 as the values for x, you might estimate
the y-values to be 4 and 106. This gives an estimate for the slope of
106 (4)
__________
1.1
101 1
which is fairly close.
Lesson Notes: Using Lines for Prediction
Students should be able to determine whether a regression model tends to
overestimate or under-estimate the value of y for particular values of x by viewing
the residual plot. For example, if the observed value is above the line, the estimate
is an underestimate because the predicted value on the line is below (less than)
the observed value. As shown in Display 3.16 of the student book, a residual is the
vertical distance between the actual (observed) value and the predicted value at a
particular value of the explanatory variable.
Once students have found a regression line, some will rely on it to the point
of forgetting about the variability in the original data. It is not uncommon
for students to predict a value but then not be able to see that it may be an
underestimate, overestimate, or about right. Instead of going back and looking at
how the data points vary around that predicted value, students might state that
the value is a good estimate because it falls on the regression line and the line is
an appropriate model for these data. Even after you show them various clusters
that are below or above the line and tell them that the estimates from the line can
be much too high or too low, students may still have a hard time accepting that a
good model can produce estimates that are either too high or too low.
You may want to connect the term explanatory variable to the independent
variable and the term response variable to the dependent variable to help students
connect the terminology with what they understand about functions.
Lesson Notes: Least Squares Regression Lines
Why the SSE?
Why do we minimize the SSE, the sum of the squared errors (residuals), not just
the sum of the errors or the sum of the absolute value of the errors?
• Minimizing the sum of the residuals does not give a unique line. As students
will see in E20, any line through the point of averages has residuals that sum
to 0, and some of these lines clearly are not good fits to the data.
• The same reason applies to the sum of the absolute errors. For example,
consider the four points (0, 0), (0, 1), (1, 1), and (1, 2). The lines y 1, y x,
y 1 x, y 2x, and y 0.5 1.5x all have a sum of absolute residuals
equal to 2. (See also D7.)
• The sum of squared errors is a recurring theme in statistics. Students first met
this idea with the standard deviation, which is constructed from the sum of the
squared “errors” of the data values from the mean. The mean, in fact, is the
measure of the center that minimizes the sum of the squared errors, much as
the regression line is the “center” line that minimizes that sum.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.2
99
• A nice formula results when we minimize the SSE. Although students may not
yet appreciate it, the formula for the slope of the least squares regression line
is easy to use, and it is easy to prove that it minimizes the sum of the squared
errors. There is no simple formula for minimizing the sum of the absolute
values of the residuals.
• A sum of squared errors is like our usual measure of Euclidean distance:
____________________________
x1 x2 2 y1 y2 2
z1 z 2 2
To see how this applies to regression, think of the data xi , yi , i 1, 2, . . . , n, as
two vectors
[]
x1
x2
x
xn
and
[]
y1
y
y 2
yn
in n-dimensional space. If the points xi , yi are not collinear, then vector y does
not lie in the plane spanned by the vectors I [1 1 1]T and x. That is, y is
not equal to b0I b1x for any b0 and b1. The SSE is the square of the Euclidean
distance between the endpoint of vector y and the plane spanned by I and x.
These two articles give more on this geometric interpretation:
• “The Geometry of Linear Regression” by Richard Parris in Consortium 58
(Summer 1996): 8–9, or math.exeter.edu/rparris/documents.html
• “The ‘Naturalness’ of Squaring in Linear Regression,” by Dan Teague, at
courses.ncssm.edu/math/TALKS
Properties of the Least Squares Regression Line
The boxed information on page 125 of the student book gives a concise summary
of the important facts about residuals from a least squares regression line. In
addition, it provides formulas and a procedure by which a student can calculate
the equation of a least squares regression line.
The third property of the least squares regression line is equivalent to saying
that the sum of the squares of the deviations of the values of y from their
predicted values ŷ is as small as possible, or, that the SSE is as small as
possible.
Note: The AP Statistics Exam is unlikely to ask a student to calculate a regression
equation by hand in this way. However, it is important for all students to
experience the process to deepen their understanding.
Proof of the Regression Formula
Many multivariate calculus books include a proof that the procedure given in
the text does indeed give the equation of the line that minimizes the sum of the
squared errors. A readable one may be found in William G. McCallum et al.,
Multivariable Calculus, 3rd ed. (New York: Wiley, 2002), page 714.
For a linear algebra proof, see David C. Lay, Linear Algebra and Its
Applications, 2nd ed. (Reading, Mass.: Addison-Wesley, 2000), pages 404–416.
For a noncalculus-based approach, see Dan Kalman, Elementary
Mathematical Models (Washington, D.C.: Mathematical Association of America,
1997), Chapter 8.
100
Section 3.2
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Note on Calculator Use
Some graphing calculators automatically calculate residuals and store them every
time a regression is performed. See Calculator Note 3D in Calculator Notes for
the Texas Instruments TI-83 Plus and TI-84 Plus.
Lesson Notes: Reading Computer Output
Complete regression analyses are supplied even though students will not learn
to interpret most of the details of these analyses until later. AP students should
learn to pick out the needed computations from the entire analysis.
Typically, software writes equations using variable names instead of using
x’s and y’s. For example, in the minimum wage example on pages 118–119 of
the student book the equation y 195.20 0.10x could be written as
minimum wage 195.20 0.10 year. Note how the software packages use the
variable names supplied by the user.
Notes for AP Teachers
Regression Lines on the AP Exam
Be sure that students use ŷ (not y) when writing equations of regression lines.
Points have been deducted for this error in the grading of AP Exams.
Interpreting the Slope
The student book uses two different wordings for interpreting the slope of a
regression line. To illustrate the first wording, consider the situation where the
height of a child is the explanatory variable and her weight is the predicted value.
The measurements of her height and weight are taken each month from her third
birthday to her tenth birthday. A case (point on the scatterplot) is a specific month.
The slope, 5, of the regression line can be interpreted as: “For every 1 inch increase
in her height, her weight tended to increase by 5 pounds.” It makes sense to talk
about her height increasing and how her weight tended to increase with it.
On the other hand, in the situation where the adult students in a statistics
class are the cases and the explanatory variable is height and the predicted value
is weight, the slope, 5, of the regression line should be interpreted like this: “A
student who is 1 inch taller than another student tends to be 5 pounds heavier.”
It wouldn’t be quite right to say, “For every 1 inch increase in height, a student’s
weight tends to increase by 5 pounds.” The height of a student isn’t increasing;
nor is the weight. As always, a good interpretation depends on how a case is
defined.
Reading the Computer Output
The AP Exam requires students to be adept at reading and interpreting
computer printouts from popular statistical software packages. This chapter
has many opportunities for students to develop skill in reading computer output.
Statistics in Action with Fathom is a resource for using dynamic data software with
the activities in the student book. As you give students opportunities to become
comfortable using and interpreting statistical output from Fathom, Minitab,
and other software packages, you might use resources on reading statistics
software, such as Chapter 8 of AP Statistics: Preparing for the Advanced Placement
Examination, by James F. Bohan (New York: AMSCO School Publications, 2000).
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.2
101
Modeling Good Answers
Once students have completed P5, show them the answer as a model of what is
expected on the AP Exam. The answer to P5 gives lots of details about the fitted
line and the error. A PDF file containing the questions and the model answers is
available at www.keypress.com/keyonline.
Solutions
Discussion
Note on D3: The interpretation of slope as a rate may be
new to some students. During the discussion, provide
several more examples of rate from students’ experience,
for example, miles per gallon or miles per hour.
D7. a. The plot and a table that includes the
absolute deviations are shown. This line fits best
because it passes through the mean value of y at
each value of x. It also minimizes the variation
among the residuals.
D3. a. cost of purchase versus number of gallons
purchased
b. miles driven versus number of gallons used
c. weight in pounds versus volume of the liquid
D4. Answers will vary a little, but a summarizing line
should come close to the points (1970, 110) and
(2005, 590) for a slope of approximately
2005 1970
This slope tells us that the CPI tended to increase
at the rate of approximately $13.70 per year across
this time span.
The y-intercept can be found by solving
110 b0 13.7(1970)
b0 26879
The equation of the line is
predicted CPI 26879 13.7 year
D5. a. Far below the line; the point lies on the line.
b. The fitted line is too low. It lies below all of
the points. Move the line up so that there are both
positive and negative residuals.
c. The fitted line is mean income 8300.6 4.2248 year. A literal interpretation of the intercept
would be that in year 0, the mean income was a
negative $8300. It makes no sense to extrapolate
this far backward in time, so the intercept does not
have literal interpretation in context.
D6. The arithmetic is fine. The reasoning is an
amusing example of the folly of extreme
extrapolation. The equation is
length 6460 1.375 year
102
Section 3.2 Solutions
y ŷ
| y – ŷ |
0
0
1
1
0
2
1
1
2
2
3
1
2
4
3
1
| y – ŷ | 4
y
590 110 13.7
___________
x
4
3
2
1
0
0.0
0.4
0.8
1.2
x
1.6
2.0
b. Another such line is y 1.5 0.5x. The
residuals are 1.5, 0.5, 0.5, and 1.5, which sum
in absolute value to 4.
c. Another such line is y 0.2 1.6x. The
residuals are 0.2, 1.8, 1.4, and 0.6, which sum
in absolute value to 4.
d. Such a line is y 2.5 0.5x. The residuals are
2.5, 0.5, 1.5, and 0.5, which sum in absolute
value to 5.
e. The original line, y 1 x, is the least
squares line. The sum of squared residuals is 4.
The sum of squared residuals for the lines in b,
c and d are, respectively, 5.0, 5.6, and 9.0. The
least squares line minimizes the sum of squared
residuals among these lines.
f. The standard deviation of the residuals for the
least squares line is 1.15. For the lines in b, c, and
d, the standard deviations are, respectively, 1.2910,
1.3466, and 1.2910. The least squares line also
minimizes the standard deviations of the residuals
among these lines.
D8. a. income 8300.58 4.22478 year.
Estimating the SE from the plot, you get about
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
b. As you can see from the plot, a run of 5 in the
direction of the x-axis corresponds to a drop of about
10 percentage points in the direction of the y-axis.
60
Alumni Giving Percentage
(4)2 12 (1)2 (2)2 (3)2 32 72 52 32 (8)2, or about 182. From the
printout, you get 182.340.
b. Minitab gives the equation of the regression line
as well as giving the coefficients in a table. There are
also a few differences in the analysis of variance section, but basically the same information is presented.
Note on D8: With Fathom, students find the equation
of the least squares line by creating a scatterplot and
choosing Least-Squares Line from the Graph menu. They
find the SSE by choosing Show Squares from the Graph
menu. The output looks like this.
50
40
5
10
30
20
10
0
Scatter Plot
Mean Net Income
10
15
20
Student/Faculty Ratio
25
The slope of the fitted line is 2, which is equal to
rise ____
10 2
____
140
run
130
120
110
100
1990
1992
1994
1996 1998 2000 2002
Year
GeneralFamily_Practice = 4.2248Year - 8300.6; r 2 = 0.91
Sum of squares = 182.3
Practice
P3. The points (40, 91) and (240, 88.3) lie on or near
the regression line, so the slope is about 0.0135.
Each day, the eraser tended to lose around
0.0135 gram of weight.
P4. a. About 0.8.
b. If one student has a hand length 1 inch longer
than another student, we would expect the first
student to have a hand width that is about 0.8 inch
under than the second student.
c. hand width 1.7 0.8 hand length
d. It looks like these students measured their hand
spans with their fingers together rather than spread
apart. If these points were removed, the regression
line would move up slightly at the end for smaller
hand lengths and move up a bit more at the end for
longer hand lengths. In fact, the equation becomes
hand width 1.03 hand length 0.65.
P5. a. The student/faculty ratio is the predictor or
explanatory variable, and the alumni giving rate
(in percent) is the response variable.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
5
c. The y-intercept would mean that a university
that has a student/faculty ratio of 0 (i.e., no
students) would have a giving rate of 55%. This
makes no sense in this context. Extrapolation is
not reasonable in this case.
d. The giving rate would be about 55 2(16) or
23%. The error probably is rather large because
the points are not clustered closely about the
line. There is quite a lot of variation around the
line, especially for the universities with smaller
student/faculty ratios. On the average, a prediction
would be off by about 7 percentage points.
e. The point is just above the line, and the
residual is about 1 or 2. The residual for the point
with the highest giving rate is about 56 40 16.
f. The equation of the fitted line is ŷ 55 2(6) 43. The residual is 32 43 11.
g. Because the line has a negative slope, the
largest possible predicted giving rate occurs when
the student/faculty ratio is as small as possible.
The smallest the student/faculty ratio could be
is 0 (if there were no students). This ratio gives a
predicted giving rate of 55 2(0) 55. This is
the largest possible predicted giving rate. The rate
at Piranha State is larger than 55%, so the residual
for Piranha State will be positive.
P6. a. and b.
Calories
GeneralFamily_Practice
150
5
320
310
300
9
10
11
Fat (g)
12
13
Section 3.2 Solutions
103
b. The equation is ŷ 279.75 2.75x. The slope
and y-intercept are found using this table.
x
Pizza
_
_
y
xx
y y
_
_
(x x ) (y y )
_
(x x )2
1
9
305
–2
5
10
4
2
11
309
0
1
0
0
3
13
316
2
6
12
4
Sum
33
930
0
0
22
8
Mean
11
310
22 2.75
b1 ___
8
b0 310 2.75(11) 279.75
c. The slope of 2.75 means that if one pizza has
one gram of fat more than another, it tends to have
2.75 more calories. The y-intercept means
5 ounces of pizza with no fat is predicted to have
279.75 calories, which may be reasonable.
d. The point of averages is (11, 310), which
satisfies the equation.
279.75 2.75 (11) 310
e. The residuals are
305 [279.75 2.75(9)] 0.5
309 [279.75 2.75(11)] 1.0
316 [279.75 2.75(13)] 0.5
The sum is 0. (When computing, the sum of the
residuals typically won’t be exactly 0 because the
coefficients usually must be rounded.)
P7.
[3.5, 8.5, 1, 65, 85, 1]
computed by hand. The SSE is 0.52 (1)2 0.52 1.5 and is found in the Analysis of Variance
table in row “Error,” column “Sum of Squares.”
Exercises
E9. a. I—E; II—C; III —A; IV—D; V—B
b. I—A; II—E; III—B; IV—D; V—C
E10. a. Pizza Hut’s Hand Tossed and Little Caesar’s
Original Round have the fewest calories. They
also have the least fat. The right side of the graph
contains pizzas with the most fat.
b. I. E II. D
III. A
IV. C
V. B
c.
i. A: The line lies above all the points.
ii. E: The line lies below most of the points.
iii. B: The line lies over most points on the
left and under most points on the right.
iv. D: The line lies under most points on the
left and over most points on the right.
v. C: This line fits best overall, going
through the middle of the points on both
the left and right.
67 37 , or 2.5 inches per year.
E11. a. The slope is about _____
14 2
b. The median height of boys of a given age
tends to increase about 2.5 inches per year from
the ages of 2 to 14.
c. Answers will vary, but using this slope and
the point (3, 39), the equation is approximately
median height 31.5 2.5 age.
d. The y-intercept of 31.5 would mean that
the median length of an average newborn is
31.5 inches. Because this is clearly too long, this
extrapolation is not valid.
E12. a. The calorie prediction for a pizza with 10.5
grams of fat is about 270 calories. The calorie
prediction for a pizza with 15 grams of fat is about
335 calories.
b. The slope of the line is approximately
(335 270)
________
(15 10.5) 14.4. Using the point (15, 335) the
equation is approximately ŷ 119 14.4x.
c. The estimated slope is quite a bit higher than
9. Other ingredients must add calories, which also
increase as the fat content increases.
E13. a. The points in this scatterplot lie perfectly on a
straight line.
The equation of the least squares regression line is
% on time 87.2 2.15 mishandled baggage.
In order, the residuals are 4.08, 2.31, 0.71, 6.51,
1.56, 0.66, 0.11, 0.18, 2.99, and 8.47.
P8. Yes, the regression equation and the mean of
the response variable, 310, are the same as you
104
Section 3.2 Solutions
Reaction Distance (ft)
80
70
60
50
40
30
20
20
30
40
50
60
Speed (mi/h)
70
80
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
b. The y-intercept should be 0 because if the car
has 0 speed, the reaction distance is 0.
c. The distance increases at the rate of 11 feet
for every increase in 10 mi/h of the speed. Thus,
11 1.1. This means that
the slope of the line is __
10
for every 1 mi/h increase in speed, the reaction
distance is an additional 1.1 feet.
d. The equation of the line is ŷ 1.1x, where ŷ is
predicted reaction distance in feet and x is speed
in miles per hour.
e. The predictions are ŷ 1.1(55) 60.5 ft
and ŷ 1.1(75) 82.5 ft.
f. If the reaction time were longer, the reaction
distances would be greater, with more than
11 feet between each successive value. Thus, the
slope of the line would increase and the predicted
distances would be longer than the corresponding
predicted distances for the model given here. The
equation would be ŷ 1.47x.
Note: The formula for the reaction distance in feet
for a given speed in miles per hour and reaction
time in seconds is
(reaction time(in s))(speed(in mi/h))
__________
3600 s/h 5280 ft/mi
E14. a. Fuel consumption rate is the explanatory
variable. Operating cost is the response variable.
b. The slope is approximately 2.5. If one plane
uses 1 gallon per hour more than another, its
operating cost tends to be about $2.50 per hour
more. This could be the cost of 1 gallon of fuel.
c. This value means that if an aircraft used no fuel,
the cost per hour would be $470 per hour. While
it doesn’t make sense for an aircraft to use no fuel,
it does make sense that there are costs besides fuel;
the y-intercept would represent the cost per hour of
running an aircraft in addition to fuel costs.
d. The cost per hour for a plane that consumes
1500 gallons per hour of fuel is approximately
470 2.50(1500) or $4220.
E15. a. The predictor variable is the arsenic
concentration in the well water. The response
variable is the concentration of arsenic in the
toenails of people who use the well water.
b. There is a moderate positive linear relationship
between arsenic concentration in the toenails of well
water users and the arsenic concentration in the well
water. There is a cluster around well water arsenic
concentrations of 0 to 0.005 parts per million.
c. The largest residual is about 0.3 parts per
million.
d. The concentration of arsenic in this person’s
toenails is about 0.4 parts per million.
e. Seven of the 21 wells are above this standard.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
E16. a. 40.58
Pizza Hut’s Pan
17.66
Domino’s Deep Dish
15.95
Pizza Hut’s Hand Tossed
1.03
Little Caesar’s Original Round
14.28
Domino’s Hand Tossed
26.44
Little Caesar’s Deep Dish
34.50
Pizza Hut’s Stuffed Crust
b. The residual of 40.58 shows that Pizza Hut’s
Pan Pizza has more than 40 fewer calories than
would be predicted for a pizza with that pizza’s
amount of fat.
c. You can see that both are negative because
both points are below the regression line.
_
_
E17. a. x 2, y 26
_
_
(x x )(y y)
b1 ______________
_ 2
(x x )
(1 2)(31 26) (2 2)(28 26) (3 2)(19 26)
(1 2) (2 2) (3 2)
_______________________________________________
2
2
2
5 0 7 6
____________
1 0 1
_
_
b0 y b1x 26 (6) 2 38
So the equation is ŷ 38 6x, where ŷ is the
predicted number of days with AQI 100 in
Detroit and x is the number of years after 2000.
b. The number of days with AQI greater than 100
in Detroit tended to decrease by 6 per year.
c. The residuals for each year are calculated in
this table:
Residual y ŷ
x
y
ŷ
1
31
38 6 1 32
1
2
28
38 6 2 26
2
3
19
38 6 3 20
1
The largest residual is 2, for the year 2002.
d. SSE (1)2 22 (1)2 6
e. 1 2 1 0
f. The equation for this line is y 40 6x.
(Since the residual for this point was 2, simply
shift the line up two units.) The SSE for this
line would be (3)2 02 (3)2 18. This is
larger than the SSE for the least squares line, so
according to the least squares criterion, the first
line was better. Students should agree that the least
squares line fits better because it passes through
the middle of the set of points. The line through
the point for 2002 is too high.
g. This equation would be y 37 6x. The
fitted value for 2002 would be 37 6 2 25.
The only nonzero residual would be for this point,
so the SSE would be 32 9.
Section 3.2 Solutions
105
Number of Days AQI > 100
h. As the plot shows, the least squares line is a
better indicator of the trend of all the points than
either of the others. For both other lines, all points
are on or to the same side of the line.
_
32
30
28
26
24
22
20
18
1.0
1.5
2.0
2.5
3.0
Years Since 2000
• For a constant a, a n a, since that means
adding up n instances of a.
• The
is a constant, so
_ mean _of a set of numbers
_
x n x where n x, in turn, is equal to the
sum of the individual x values xi because
_
xi
n x n ___
n xi .
E20. a. A horizontal line has an equation of the form
y a. The residual for a point (xi , yi) would be
yi a. Adding these up for n points, you get
( yi a) yi a yi na. Assume
this sum is zero.
3.5
_
E18. a. x 13.1, y 307.143
_
_
__
_
_
y (Calories)
xx
yy
(x x )(y y)
(x x )2
9.0
230
4.1
77.143
316.2863
16.81
19.5
385
6.4
77.857
498.2848
40.96
14.0
280
0.9
27.143
24.4287
0.81
12.0
305
1.1
2.143
2.3573
1.21
8.0
230
5.1
77.143
393.4299
26.01
14.2
350
1.1
42.857
47.1427
1.21
15.0
370
1.9
62.857
119.4283
3.61
x (Fat)
Note on E19c: This small discrepancy is due to rounding
error. See the note on page 96.
d. This equation is very close to that for boys
from E11.
Note on E20: For these proofs to make sense, students
must realize the following about working with sums:
yi n a 0
yi n a
i ____
____
na
y
n
i
____
a
(x _x)(y _y) (x _x)
y
n
2
1352.5
90.62
So the equation is calories 111.56 14.93 fat.
The estimate in E12 was very close to this
equation for the slope but not for the y-intercept.
b. From the scatterplot, Pizza Hut’s Pan Pizza has
a residual of about 40. From this point alone, the
SSE must be at least 402 1600. Thus, 4307 is the
only possible SSE.
E19. a. The equation is height 31.57 2.43 age.
Here is the plot.
n
_
So, a y . The horizontal line has equation
_
_ _
y y, so it passes through (x, y).
Conversely, if we start with the assumption
that the horizontal line passes through the point
_ _
_
(x, y), the equation of the line is y y. The residual
_
for a point (xi , yi ) would be yi y. Summing these
_
_
residuals we get (yi y) yi y i
yi n y yi n ___
n yi yi 0.
_
y
b. Let the line be y a bx. For a point (xi , yi ),
the predicted y-value is a b xi. So the residual
is yi a b xi. Assume that the sum of these
residuals is zero.
(yi a b xi ) 0
yi n a b xi 0
[1, 15, 1, 30, 7, 5]
b. The residual is positive because the point is
above the line. The actual residual is 59.5 58.3 1.20, which is positive. (The residual is 1.21 with
no rounding.)
c. The mean age is 8 years, and the mean height
is 51 inches. Substitute into the regression equation
to see that this point is on the regression line:
51 31.57 2.43 8
51 51.01
106
Section 3.2 Solutions
yi n a b xi
i ____
____
n a b ____1
y
n
x
n
n
_
_
yabx
_ _
This means that the point (x, y) must satisfy the
equation of the line if the residuals are to sum
to zero.
Conversely, if we start with the assumption
_ _
that the line passes through (x, y), that means the
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
_
_
yi n y n b x b xi
_
_
_
_
nynynbxnbx0
c. As shown in parts a and b, any line that passes
_ _
through the point (x, y) makes the sum of the
residuals equal to zero.
E21. a. height 31.5989 2.47418 age. Answers will
vary according to students’ original estimates. It
should be fairly close.
b. The SSE is 2.4. This does seem reasonable
from the scatterplot. The residuals are all quite
small.
E22. a. percent alumni giving 54.979 1.9455 student/faculty ratio
b. The largest residual is 22.31. The student/
faculty ratio for that university is 13.
c. The fit should be 54.979 1.9455 13 29.6875. The table has the fit calculated correctly.
The actual y-value (also given in the table) is 52.
The residual, 52 29.69 22.31, is calculated
correctly in the table.
d. The SSE is 3578.5. The value is large because
there are many points and they are scattered
widely around the regression line.
E23. height 31.57 2.43 age
Age
Height
Predicted Height
Residual
2
35.1
31.57 2.43 2 36.46
35.1 36.43 1.33
8
51.7
31.57 2.43 8 51.04
51.7 51.01 0.69
14
63.6
31.57 2.43 14 65.62
63.6 65.59 1.99
With no rounding, the residuals are 1.325, 0.7,
and 1.975, respectively.
The first point is below the regression line,
the second is above the line, and the third
is below the line. This pattern of a residuals
suggests a curve in the trend of the data. In the
scatterplot of all the data the points at the far
left are below the regression line, the points in
the middle region lie mostly above the line, and
the points at the right lie below the line. This
suggests that a line may not be the best model
for this data. You will learn more about this in
Section 3.4.
E24. a. Yes; the ratio of price to gallons is the price per
gallon, which is the same for all four purchasers.
Another way to look at this is to observe that
the price, y, increases at a constant rate for each
additional gallon of gas purchased.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
b. The relationship between average speed x,
time y, and distance is speed time distance, or
x y 80 in this scenario. Thus, plotting
80 ,
y (time) against x (speed) means plotting y __
x
which will be a curve (it’s a rotated hyperbola).
Plotting y* _1y against x results in a straight line
because
80 xy
1
__
1
___
y 80 x
1x
y* ___
80
which is a linear equation.
E25. a. In the plot of calories versus fat, the association
shows a positive trend that is moderately strong.
Even though a few points lie relatively far from the
pattern, fat content could be used as a reasonably
good predictor of calories. The equation of the
regression line is ŷ 194.75 10.05x, where ŷ
is the predicted number of calories and x is the
number of grams of fat. The slope means that if
one pizza has 1 more gram of fat than another, it
tends to have 10.05 additional calories.
Calories
_
_
400
380
360
340
320
300
280
260
240
220
8
10
12
14 16
Fat (g)
18
20
b. In the plot of fat versus cost, there is a very
weak positive association between fat and cost.
The equation of the regression line is ŷ 10.66 2.41x, where ŷ is the predicted number of grams
of fat and x is the cost. The slope means that if one
pizza costs $1 more than another, it tends to have
2.41 more grams of fat. (You will be able to check
to see if this association is “real” in Chapter 11.)
Fat (g)
_
equation y a b x must be true. So a y _
b x. The residual for the point (xi , yi ) would be yi _
_
_
_
(y b x) b xi yi y b x b xi.
The sum of these residuals then is
20
18
16
14
12
10
8
0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Cost ($)
c. In the plot (on the next page) there appears to
be no linear association between calories and cost.
Section 3.2 Solutions
107
220
0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Cost ($)
d. In the analysis of the pizza data, calories has a
moderately strong positive association with fat—
the two tend to rise or fall together, which makes
sense because fat has a lot of calories. Fat has a
weak positive association with cost. There appears
to be no association between cost and calories.
Note on E26: This open-ended investigation requires
a computer. This exercise involves issues that might
be sensitive for some students. You have the chance to
emphasize strongly that “association does not imply
causation.”
E26. The four scatterplots are shown here. There
appears to be little or no association between the
percentage living below the poverty line and the
percentage living in metropolitan areas. Likewise,
there appears to be little association between the
poverty rate and the percentage of whites. The
two outliers in this plot are regions with low
percentages of whites, namely, Washington, D.C.
(high poverty rate) and Hawaii (low poverty rate).
The poverty rate is negatively associated with
the percentage of high school graduates; as the
latter goes up, the percentage living in poverty
generally goes down. Finally, poverty appears to be
only weakly associated with percentage of families
headed by a single parent (with Washington, D.C.,
again as an outlier). On the surface, it looks as
though increasing graduation rates would have the
largest effect on decreasing the poverty rate. Keep
in mind, however, that the problem of poverty
is much more complex than that, and many
other variables are lurking in the background.
Association is not the same as cause and effect.
Simply increasing high school graduation rates,
although it might be a good thing to do, will not
108
Section 3.2 Solutions
Poverty (%)
260
20
18
16
14
12
10
8
6
4
2
0
Poverty (%)
300
automatically elevate all of those living below the
poverty line to a better economic condition. It is
even possible that a lower poverty rate may cause
a higher high school graduation rate, or that a
third “lurking” variable is influencing both of
these variables.
20 40 60 80 100 120
Metropolitan Residence (%)
20
18
16
14
12
10
8
6
4
2
0
Poverty (%)
340
10 20 30 40 50 60 70 80 90 100
White (%)
20
18
16
14
12
10
8
6
4
2
0
76 78 80 82 84 86 88 90 92 94
Graduates (%)
Poverty (%)
Calories
380
20
18
16
14
12
10
8
6
4
2
0
6
8
10 12 14 16 18 20
Single Parent (%)
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
0
10
20
30
40
50
60
5
10
15
20
Student/Faculty Ratio
25
Alumni Giving Percentage Versus Student/Faculty Ratio
Alumni Giving Percentage
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.2
109
1000
2000
3000
4000
5000
6000
7000
8000
0
1000
2000
3000
Fuel Consumption (gal/h)
4000
Operating Cost Versus Fuel Consumption Rate
Cost ($ Ⲑ h)
110
Section 3.2
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
3.3 Correlation: The Strength of a Linear Trend
Objectives
• to estimate correlation from a scatterplot
• to understand that correlation should not be computed from data that are not
linear and that a high correlation does not mean that the data are linear
• to understand r as the average product of the z-scores
• to use the relationship between r and the slope of the regression line
• to be aware of possible lurking variables and not assume that correlation
implies causation
• to interpret r2 as the proportion of the variation in the values of y that can be
explained by x
The section “Regression Toward the Mean” is optional. The objectives for that
section are
• to visualize the regression line as the line of the means of the values of y for
fixed values of x
• to recognize the regression effect
Important Terms and Concepts
• correlation coefficient
• lurking variable
• r 2, the coefficient of determination
• average product of z-scores
• correlation versus causation
Optional Terms and Concepts
• regression line as the line of means
• regression toward the mean (the regression effect)
Alignment with the AP Statistics Topic Outline
This section aligns with the listed items of the AP Statistics Topic Outline
as described here. The actual text of the AP Statistics Topic Outline and the
complete correlation begin on page xxi.
ID2 Students examine bivariate data for correlation and linearity.
Lesson Planning
Class Time
Two days if the optional section (“Regression Toward the Mean”) is not covered.
Three days if the optional section is covered.
Materials
For Activity 3.3a, a measuring tape, a yardstick, or a meterstick for each pair
of students
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.3
111
Suggested Assignments
Classwork
Essential
Recommended
D9–13, D17, D18, D21
P9–19
Activity 3.3a
D14, D20, D23
Optional
D15, D16, D19, D22, D24,
D25
P20, P21
Chapter 3 Quiz 1
Homework
Essential
Recommended
E27, E31, E33, E34, E37,
E40
E28, E29, E36, E39
Optional
E30, E32, E35, E38, E41,
E42
Lesson Notes: Estimating the Correlation
Once a linear trend is established (positive or negative) the strength of the
association can be measured using the correlation coefficient. The correlation is a
useful measure of strength only if the data are linear—clustered either loosely or
tightly about a line.
Types of Correlation
Francis Galton is credited with being the first to understand the idea of
correlation (1888), which he called “co-relation.” The correlation coefficient used
in the text is called Pearson’s correlation coefficient, after Karl Pearson, who
introduced it in 1896. Other correlations not covered in this text include the
rank correlation formulas of Charles Edward Spearman and Maurice G. Kendall.
These give the correlation for paired observations that are ranks, such as the
ranking of ten gymnasts by Judge A paired with the ranking of the same ten
gymnasts by Judge B. If there are no ties in the ranks, Spearman’s formula gives
the same value as Pearson’s.
What Happened to Normality?
The guiding features of analyzing data from Chapter 2 were
plot → shape → center → spread
In this section, they are interpreted for bivariate data as
scatterplot → shape→ trend → strength
In Chapter 2, students were told that “normal” was the “ideal” shape—if a
distribution is approximately normal, the mean and standard deviation generally
are useful measures of center and variability. In this section, students are told
that the ideal shape for a scatterplot is elliptical. With an elliptically shaped cloud
of points, the regression line and the correlation are generally useful measures of
center and variability. What happened to normality?
Look at Display 3.50 (page 153 of the student book) of a younger sister’s height
plotted against her older sister’s height. The two “marginal” distributions, the
separate distributions of the heights of the older sisters and the heights of the
112
Section 3.3
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
younger sisters, are approximately normal. The scatterplot of the joint behavior
of the two variables shows an elliptical cloud that is more dense toward the
middle than toward the edges. A three-dimensional plot of these data could
show the density of data points everywhere in the x-y plane. Such a plot could be
modeled by a bivariate normal distribution similar to the one pictured below. (In
this case, the distributions of x and y both have mean 0 and standard deviation
1; the correlation is 0.6.) This is called the joint distribution of x and y. One
characteristic of bivariate normal distributions that will be important in
Chapter 11: Inference for Regression, is that if you slice the distribution at any
fixed value of x, the “conditional” distribution of the y’s corresponding to that x
is normal. A three-dimensional diagram of this is shown in Display 11.4.
Activity 3.3a: Was Leonardo Correct?
This activity is highly recommended. To orchestrate the sharing of data, you may
wish to provide the blackline master provided at the end of this section on page 124
and then have each student read his or her set of four values. Or you can make
an overhead transparency and have students come up and write in their
measurements.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.3
113
1. Student measurements will vary. Sample results from one group of children
and teens are given in the following answers. These measurements are in
centimeters.
2. If Leonardo is correct, the points should lie near the lines:
arm span height
3 height
kneeling height __
4
1 height
hand length __
9
Looking at the plots, these rules appear to be approximately correct. The lines
are the least squares regression lines computed in step 3.
Arm Span (cm)
180
160
140
120
100
80
90
120 150 180
Height (cm)
Hand Length (cm)
Kneeling Height (cm)
22
140
120
100
80
20
18
16
14
12
10
60
90
120 150 180
Height (cm)
90
120
150
Height (cm)
180
3. The least squares regression equation for predicting the arm span from the
height is arm span 5.81 1.03 height; r 0.99.
The least squares regression equation for predicting the kneeling height
from the height is kneeling height 2.19 0.73 height; r 0.989.
The least squares regression equation for predicting the hand length from
the height is hand length 2.97 0.12 height; r 0.96. In each case, the
correlation is quite high, at least 0.96.
On many graphing calculators, you must perform the regression in order to
calculate the correlation. See Calculator Note 3H in Calculator Notes for the Texas
Instruments TI-83 Plus and TI-84 Plus.
4. For the first plot, the slope is 1.03. This means that if one student is 1 cm
taller than another, his or her arm span tends to be 1.03 cm longer. Leonardo
predicted a difference of 1 cm.
For the second plot, the slope is 0.73. This means that if one student is
1 cm taller than another, his or her kneeling height tends to be 0.73 cm taller.
Leonardo predicted a difference of 0.75 cm.
For the third plot, the slope is 0.12. This means that if one student is
1 cm taller than another, his or her hand length tends to be 0.12 cm longer.
Leonardo predicted a difference of _19 , or 0.11 cm.
In each case, the points are packed tightly about the regression line, so
there is a strong correlation.
5. Yes. The slopes are about what he predicted, the y-intercepts are close to 0 in
each case, and the correlations are strong.
114
Section 3.3
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Lesson Notes: A Formula for the Correlation, r
You can use the spreadsheet capability of the TI-83 Plus or TI-84 Plus to speed
the computation for the formula for the correlation on page 142 of the student
book. See Calculator Note 3H in Calculator Notes for the Texas Instruments
TI-83 Plus and TI-84 Plus.
_ _
The conversion of the x-y plane to the zx-zy plane places (x, y) at the new
origin. The relative position of the points does not change; but their coordinates
become their respective z-scores. This transformation of the plane facilitates
a visual understanding of correlation as “the average of the products of the
z-scores.” Depending on the backgrounds of your students, this is a great
opportunity to discuss correlation as an application of transformations.
Lesson Notes: Correlation Does Not Imply Causation
Students will learn in Chapter 4 that to establish that one variable “causes”
another, you must perform a randomized comparative experiment. Using
observational data, such as the fact that smokers get more lung cancer than
nonsmokers, does not establish that smoking causes lung cancer because other
factors are not controlled, by randomization or otherwise.
For example, to establish that breathing cigarette smoke causes tumors in rats,
you would have to randomly assign two treatments—exposed to no smoke and
exposed to smoke—to different rats and see whether the rats exposed to smoke
develop significantly more tumors. If they do and the rats were otherwise treated
alike, you can say that cigarette smoke causes tumors in rats.
Causation is a tricky issue at this stage, not only because students haven’t yet
studied experimental design but also because of the various meanings of the word
“cause.” In some cases, known physical reasons do establish cause and effect, such
as fire “causing” smoke. Scientists did not do a randomized comparative experiment
to establish that fire “causes” smoke. In other cases, we can’t determine a physical
link but rely on statistical evidence, as we do when we say that receiving love and
praise during childhood “causes” good behavior. In the best of all worlds, causation
is established using both kinds of evidence: a probable physical link along with
statistical evidence, preferably in the form of a randomized comparative experiment.
For example, both kinds of evidence have been used to establish that cigarette
smoking causes cancer. Often the statistical evidence leads scientists to look for a
physical link. Once that link is established, we do not need more experiments. The
issues of causation, lurking variables, and confounding will reappear in Section 4.3.
Lesson Notes: Interpreting r 2
In the regression examples in this textbook, there is one predictor variable, x,
and one response variable, y. But many situations have more than one predictor
variable. For example, in predicting the probability an entering student will
graduate from a given college, the college may want to include such predictor
variables as high school GPA, SAT scores, and number of hours worked per
week in the regression. One way to account for many predictor variables in a
prediction model is to use multiple regression.
The formula for r does not generalize to regression where there is more than
one predictor variable, but the idea of r2 as the proportion of variance accounted
for by the regression does.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.3
115
One of the reasons that statistical software gives the value of r2, or R-squared,
rather than the correlation r is that R-squared does generalize to multiple
regression. The values for SSE and SST can still be computed, the former as the
sum of the squared residuals from the fitted plane ( y ŷ)2 and the latter still as
_
the sum of the squares of the differences of the response from its mean, (y y)2.
If you have Fathom, you can find the SST (SS Total) and SSE (SS Residual) by
using multiple regression in the Model menu to find the regression equation.
What Is R-sq(adjusted)?
Statistical software gives not only the value of r2 (sometimes seen as R-squared
or R-sq) but also a value for “R-squared(adjusted),” which is slightly smaller.
Although introductory students should ignore this for now, the adjusted value is
actually the better value to use if your data are from a sample and you would like
an approximately unbiased estimate of the r2 for the entire population.
Just as r2 looks at the proportional reduction in the total sum of squared
errors, SST, by comparing SST to SSE, the adjusted R-squared, r 2a, looks at the
proportional reduction in total mean squared error. Mean squared errors are
sums of squared errors divided by the appropriate degrees of freedom, so that in
simple linear regression
SST
MST _____
n1
and
SSE
MSE _____
n2
Note also that MST s2y, the variance of the y’s, and MSE s2, the variance of
the residuals. Finally,
s2y s2
MST MSE ______
r 2a ___________
MST
s2y
Notes for AP Teachers
Interpreting r2
Even though r2 is not currently in the AP Statistics Topic Outline, it has appeared
on some AP Exams. Students should know the generic interpretation of r2
is: <r2> of the variation in <response variable> can be explained by the linear
relationship with <explanatory variable>.
Virtually all technology tools return r2. Many software packages return only
2
r , rather than r.
Relationship Between r and the Slope
The AP Exam may expect students to calculate and interpret
the linear
_ _
regression equation for two variables if they are given only x, y, sx , sy , and r.
Modeling Good Answers
Once students have completed E39, show them the model answer as an
example of what is expected on the AP Statistics Exam. Students will see how
the correlation coefficient is interpreted in context. A PDF file containing the
question and the model answer is available at www.keypress.com/keyonline.
116
Section 3.3
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Solutions
Discussion
D9. a. 0.783 b. –0.783 c. 0.999
d. 0.906
D10. a. Relationships II and III are positive.
Relationship IV is negative because the more
socks in a bag, the cheaper they tend to be per
sock. Relationship II is the strongest, with r 1.
Relationship I is the weakest; r is almost 0.
b. For I, there should be no relationship between
these two variables. That is, for any day of the
month picked, range of haircut costs will be about
the same. This range will depend on local prices,
and knowing the day of the month a person was
born tells you nothing further about the cost of
a haircut. Thus, there is virtually no correlation
between these two variables.
For II, y does vary with x, but for a fixed x, there
will no variation in y. Each y will be exactly equal
to px. The correlation is 1.
For III, generally, the more socks in a bag, the
higher the price of the bag. There will be some
variation in price because some brands of socks
are more expensive than others, so there would
be a strong correlation, but not a perfect one.
Knowing the number of socks in the package does
assist you in predicting a price range for the bag.
For IV, there will be some variation. Generally,
the more socks in a bag, the cheaper the cost per
sock. Knowing how many socks are in a bag can
assist you in your prediction of the price range
per sock.
All else being equal, the larger the variation in y
at each value of x, the lower the correlation.
D11. a. For America West,
D12. a. The correlation measures the strength of
a linear association by measuring how tightly
packed the data points are about a straight line. Its
size is affected most strikingly by points far away
_ _
from (x, y) and points not near the new coordinate
_
_
axes, x x and y y.
b. Correlation is a unitless quantity because it
is the average product of z-scores, which have no
units. For example, the unit for mishandled bags
is bags per thousand passengers. When computing
the z-scores, the units cancel out. For America
West, this would be
bags/thousand 5.739 bags/thousand
x x 4.36
__________________________________
_____
_
sx
1.3977 bags/thousand
0.98662
c. No. Because it is based on a symmetric
calculation, zx zy zy zx, r does not depend on
which variable is chosen as x and which as y.
D13. a. For well-behaved data, the correlation will
be positive because most of the products zx zy
are positive. It is not necessarily the case that r is
positive, however. The scatterplot here has more
points in Quadrants I and III, but the correlation
is negative, r 0.237.
10
y– ⫽ 0
y
0
x– ⫽ 0
–10
–5
0
x
5
_
4.36 5.739 0.98662
x x ___________
_____
sx
1.3977
_
y y ___________
_____
81.9 74.85 1.39507
sy
_
xx
_____
sx
5.0535
_
yy
_____
0.98662 1.39507 1.37640
s
y
b. Delta, in the lower right corner of Quadrant
IV, has the largest product, zx zy, in absolute
value.
_ _
c. JetBlue, near the point (x, y) has the smallest
value of | zx zy |. A point will make a small
_ _
contribution if it is either near (x, y) or near one of
_
_
the lines x x and y y.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
b. For well-behaved data, the correlation will be
negative because most of the products zx zy are
negative. As in part a, it is not necessarily the case
that r is negative.
c. For well-behaved data, the correlation will be
near 0.
D14. When the correlation is small, the error in
prediction will be larger than if the correlation
were larger. A larger correlation (near 1 or 1)
means the points are generally nearer the line, and
predictions made using the line will be relatively
close to the observed values.
Even though the error in prediction is large
when the correlation is small, having a regression
Section 3.3 Solutions
117
line is better than having no line as long as there is
a definite linear trend in the data. You could give
an example from the quiz scores in Display 3.45.
A student who scored 10 on the second quiz
would be predicted to get about 15 on the
third quiz, but we wouldn’t be surprised if this
prediction was off by 10 points or so. A student
who scored 25 on the first quiz, would be expected
to get about 23 on the second quiz, again with
about the same estimated error. These lead to two
different predictions for the two students: 15 10 is different from 23 10 even though there
is some overlap in what we would think of as
reasonable bounds on the predictions.
Note on D14: Many students will say that the prediction
from the regression equation is better than no prediction
at all. To lead students toward the idea of r2 introduced on
page 148, ask them what they mean by “no prediction.”
That is, what would their estimate be if they had full
information about the scores on Quiz 2 and Quiz 3,
but no information about the relationship between
the two?
D15. a. Scenarios could be situations in which
there is not a definite linear trend in the data
( y appears to be unrelated to x), along with large
variation in the y-values. Age versus month of
birth for a large group of adults might be an
example.
b. Scenarios could be situations in which
there is a definite linear trend, but where there is
much variation in the y-values at each level of x.
SAT math score versus score on the first college
calculus test for a group of college students could
be an example. Family grocery bill per month
versus number of people in the family could be
another.
c. Scenarios could be situations in which the
data points fit closely to a line but the cloud of
points has considerable curvature, so as to make
the straight line a poor measure of the center, or a
poor description of the nature of the association
between the variables. Height versus age for
trees could be an example, because height levels
off for older trees. You have seen other curved
relationships earlier in this chapter.
d. Scenarios could be situations in which the
data points fit closely to a line and the line has a
slope that is not close to zero. Height versus age for
growing children could be an example, as could
height versus shoe size for adults.
D16. The growth rate will probably begin to slow down
at some point, if it hasn’t already. New blogs will
continue to appear, but probably not at the same
rate they did initially.
118
Section 3.3 Solutions
D17. An estimate of the slope is
sy
113
____
b1 r __
sx 0.7 115 0.69
To find the y-intercept, use the fact that the point
(520, 508) is on the regression line:
y slope x y-intercept
508 0.69 (520) y-intercept
y-intercept 149.2
The equation is
critical reading 149.2 0.69 math
D18. Having planes arrive late could result in more bags
not making their connecting flight and so end
up mishandled. This would result in a negative
correlation between percentage of on-time arrivals
and number of mishandled bags. It is also possible
that there is no direct link between the variables;
both could be a result of the lurking variable of
whether or not the airline is generally well run.
D19. At first glance, it would seem that the more highly
rated the university, the lower its acceptance rate.
One possible explanation is that few students apply
to the most selective of these universities unless
they are pretty sure they will be admitted. You
would not say that one variable causes the other,
rather that they are both associated with the most
savvy students. Another possible lurking variable
is that the very best students apply to more colleges
because they are shopping for the college that will
offer them the best financial aid deal. As a result,
the more highly rated colleges must accept a high
percentage of the students that apply because they
know the students have applied many places and
are likely to go elsewhere even if they are admitted.
D20. a. Some might say that a high percentage of
males “causes” higher salaries because men are
more favorably treated than women when it comes
to salary. On the other hand, the lurking variable
may be how quantitative the subject is. There tend
to be far fewer graduates in quantitative subjects,
and business and industry want to hire them also;
therefore, these faculty positions are harder to fill.
This competition for fewer graduates may be the
reason the people in quantitative subjects (who
tend to be male) are more highly paid.
b. Some people might say that a high number of
hate groups “causes” a large number of people on
death row because members of hate groups tend to
commit murders. On the other hand, an obvious
lurking variable here is the size of the state’s
population. Larger states have more of everything.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
In fact, in percentage terms, hate crimes account
for a small proportion of those on death row. To
determine whether more hate groups in a state
results in more people on death row because of
hate crimes, you might look at a scatterplot of the
percentage of people on death row because of hate
crimes against the percentage of people in the
state who belong to hate groups.
c. Some people might say that a high rate of gun
ownership “causes” lower rates of violent crime
by making criminals reluctant to commit violent
crime for fear the person will protect himself
or herself with a gun. On the other hand, the
explanation may be the lurking variable of how
rural the state is. Rural areas have higher rates of
gun ownership, presumably for hunting, and have
lower reported crime.
D21. a. The best prediction would be the mean IQ
of 101.
b. approximately 1 point
c. 0.997 54 45 98.8. You should not have
much faith in this prediction because the variation
around the line is so great.
d. Approximately 2% of the variation is
accounted for by taking head circumference into
account. The regression equation is not of much
practical help in making the prediction.
e. Answers will vary. It seems likely that the
regression line would become less steep and the
correlation would decrease, indicating little to no
correlation between head circumference and IQ.
D22. Rewrite the ratio as follows:
1 inch taller than another, her younger sister also
tends to be 1 inch taller than the younger sister of
the other older sister. The latter interpretation is
what most people would expect. But the element
of chance involved results in the regression effect
that the younger sister is not as tall as expected.
Practice
P9. a. 0.5 b. 0.5
c. 0.95
d. 0
e. 0.95
P10. a. Guesses will vary but should be positive and
close to 1.
b. r 0.908
P11. This problem isn’t as much work as it looks like.
For the first four tables, the means are all 0 and the
standard deviations are all 1, so they have already
been standardized. You can find the average product
in your head. Table e has the same correlation as
table a, table f has the same as table c, table g has the
same as table b, and table h has the same as table c.
a. 1 b. 0.5 c. 0.5 d. 1
e. 1 f. 0.5 g. 0.5 h. 0.5
P12. In the table below, all but one of the products are
positive, resulting in a positive correlation. The
first and last pizza contribute a large amount,
making the correlation quite strong.
5.44700 0.9078
r _______
6
This is about the same value as in P10 and differs
only because of rounding.
Original Units
SSE
SST SSE 1 ____
r 2 _________
SST
SST
Because SSE is less than or equal to SST and both
SSE
are positive, the ratio ___
SST will always be between
0 and 1 inclusive. Thus, r2 will always be between
0 and 1 inclusive. So r must always be between
1 and 1 inclusive.
D23. The number of “heating units” used by a house
varies from year to year. A good predictor of
how many units that will be is temperature. The
investigator is saying that temperature accounts
for 70% (r 2 0.7) of the year-to-year variation.
D24. The regression line is a “line of means” because it
attempts to go through the mean value of the y’s
at each fixed value of x. That is, the regression line
estimates the mean value of y for each fixed value
of x.
D25. Two older sisters 1 inch apart in height will, on
the average, have two younger sisters who are
only 0.337 inches apart. For the line y x, the
interpretation would be that if one older sister is
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Fat (g)
Total
_
Product
zx zy _
_
yy
x x
zx ____
s
yy
zy ____
s
x x ____
____
sx sy 1.21863
2.00685
_
x
y
19.5
385
1.64681
15
370
0.48890
0.98385
0.48100
14
280
0.23158
–0.42478
–0.09837
12
305
–0.28305
–0.03349
0.00948
9
230
–1.05499
–1.20736
1.27375
14.2
350
0.28305
0.67082
0.18987
8
230
–1.31230
–1.20736
91.7
2150
0
0
307.14
0
0
1
1
Mean 13.1
SD
Calories
Standard Units (z-scores)
3.8863
63.8916
1.58442
5.44700
P13. a. positive
b. The point at the extreme upper right of the
plot at about (4.8, 5) will make the largest positive
contribution to the correlation because it is
_ _
farthest away from the new origin (x, y) and from
_
_
the new coordinate axes (x x) and (y y) and
so has a large zx zy.
Section 3.3 Solutions
119
c. In Quadrants I and III; 20 have a positive
product.
d. In Quadrants II and IV; 7 have a negative
product.
P14. The plot on the top has a strong curvature. A line
would not be appropriate here. The plot on the
bottom is linear. The cloud of points is roughly
elliptical. A line would be appropriate for this plot.
P15. a. The correlation is 0.650:
sy
b1 r __
sx
7
0.368 r _____
12.37
r 0.650
90
80
70
60
50
60
70 80
Exam 1
90 100
P16. a. the size of the city’s population
b. Divide each number by the population of the
city to get the number of fast-food franchises per
person and the proportion of the people who get
stomach cancer.
P17. An obvious lurking variable is the age of the child.
Parents tend to give higher allowances to older
children, and vocabulary is larger for older children
than for younger.
P18. A careless conclusion would be that people are
too busy watching television to have babies. The
lurking variable is how affluent the people in the
country are. More affluent people tend to have
more televisions and have fewer children.
P19. a. The formula relating these quantities is
480.25 212.37 0.5578 (as given
SST SSE ___________
r 2 _______
480.25
SST
in the output) so r 0.747 Because the slope of
the regression line is negative (the scatterplot goes
downhill), r 0.747.
b. The value of r2 means that 55.8% of the
variability from state to state in the percentage
of families living in poverty can be “explained”
by the percentage of adults who are high school
graduates. In other words, there is 55.8% less
variability in the differences between y and ŷ
_
than between y and y. So by knowing the high
120
Section 3.3 Solutions
74
72
Older Sister’s Height (in.)
40
70
68
X
66
X
64
X
X
X
X
62
60
58
56
56
58
60 62 64 66 68 70
Younger Sister’s Height (in.)
72
74
74
72
Older Sister’s Height (in.)
Exam 2
b. The regression equation is Exam 2 48.94 0.368 Exam 1. The predicted Exam 2 score is 78.38.
c. The regression equation is Exam 1 14.1 1.149 Exam 2.
d.
100
school graduation percentage for a state and
using the regression line, you tend to do a better
_
job of predicting y than if you used just y as the
prediction for that state.
c. No. Although that conclusion may seem
reasonable, the existence of a negative correlation
alone does not allow us to say that a state can
reduce its poverty rate by increasing graduation
rates. There might be a lurking variable behind the
negative correlation, such as the type of industry
in the state. Some industries may encourage
people to stay in school to get needed training and
also pay well enough to keep people above the
poverty line.
d. Value x is in percentage of high school
graduates, y is in percentage of families living in
poverty, b1 is in percentage of families living in
poverty per percentage of high school graduates,
and r has no units.
P20. The plot should look similar to one of the two
options shown here. The regression line is flatter
than the line connecting the endpoints of the
ellipse. This plot shows the regression effect as
well. This time, it is the older sisters of the taller
younger sisters who tend to be less tall than their
younger sisters!
70
X
68
X
66
X
64
X
62
X
X
60
58
56
56
58
60 62 64 66 68 70
Younger Sister’s Height (in.)
72
74
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
P21. There is evidence of regression to the mean. An
ellipse around the cloud of points will have a
major axis that is steeper than the regression line.
The slope of the regression line is only about 0.4,
much less than 1. Also, for the vertical strip containing exam 1 scores above 95, the mean exam 2
score is about 93. For exam 1 scores less than 70,
the mean exam 2 score is about 76.
Exercises
E27. The correlations of the scatterplots are
a. 0.66
b. 0.25
c. 0.06
d. 0.40
e. 0.85
f. 0.52
g. 0.90
h. 0.74
E28. a. about 0.37. An estimate between 0.50 and
0.25 is a good one.
b. about 0.65. An estimate between 0.50 and 0.80
is a good one.
c. about 0.53. An estimate between 0.40 and 0.65
is a good one.
E29. a. r 0.707
b. r 0.707
E30. a. 1
b. 0.5
c. 0.5
d. 1 e. 1
f. 0.5
g. 0.5
h. –0.5
E31. a. about 0.94
b. All but three of the points lie in Quadrants I
_ _
and III (based on the “origin” (x, y).) In
Quadrant I, zx and zy are positive, and in
Quadrant III, both zx and zy are negative. In
both of these quadrants the products zx zy are
positive. The three points whose products zx zy
_
_
are negative are all close to either x x or y y
and are therefore small negative values. Thus, the
correlation is positive.
c. The point in the lower left corner of
Quadrant III makes the largest contribution. It
is the most extreme point for both the x- and the
y-values, giving it the largest (in absolute value)
z-score for both variables.
_ _
d. The point just below (x, y) makes the smallest
contribution. Both zx and zy are near zero, so the
product of these z-scores will be quite small.
E32. a. I. B
II. C
III. A
b. i.
ii.
E33. a. No, because the units will be different. For
example, for the group that measures in chirps
per second and uses temperature for x, the units
of the slope will be chirps per second per degree
temperature. For the group that measures in
chirps per minute, the units will be chirps per
minute per degree temperature. So the slope for
the second group should be 60 times that of the
first group. For a group that measures in chirps
per minute and uses chirps for x, the units of the
slope will be degrees temperature per chirps per
minute. Even if they use the same units, groups
that interchange x and y will get different slopes
(chirps per minute per degrees Celsius, or degrees
Celsius per chirps per minute).
b. Yes, the correlation is the same no matter
what the units or what you use for x and for y. The
correlation is the same because r is equal to the
average product of z-scores, which have no units,
and zx zy zy zx.
E34. a. An estimate of the slope is
sy
0.083
_____
b1 r __
sx (0.45) 4.3 0.00965
To find the y-intercept, use the fact that the point
_ _
(x, y) (11.7, 0.827) is on the regression line:
y slope x y-intercept
0.827 0.00965(11.7) y-intercept
y-intercept 0.93991
The equation is ŷ 0.00965x 0.93991.
b. An estimate of the slope is
sy
4.3
_____
b1 r __
sx 0.5 0.083 25.90
To find the y-intercept, use the fact that the point
_ _
(x, y) (0.827, 11.7) is on the regression line:
y slope x y-intercept
11.7 25.90(0.827) y-intercept
y-intercept 33.12
The equation is ŷ 25.90x 33.12.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.3 Solutions
121
E35. a. True. This is a direct result of the formula
sy
b1 r __
s
x
s
y
If sx sy , the factor on the right, __
sx , will be greater
than 1, which makes b1 greater in absolute value
than r.
b. The formula
sy
1.6 0.8 __
sx
sy
2 __
s
x
can be true only if sx 25 and sy 50.
c.
sestimated
b1 r _______
s
actual
4.12
0.36 r ____
0.93
r 0.081
d.
0.088740 0.016304 0.8163
__________________
0.088740
sactual
b1 r ______
s
So r 0.903. The value of r2 above is equal to
R-sq in the regression analysis.
b. The largest residual occurs for the point
(55, 0.318). Its value is
estimated
0.93
b1 0.081 ____
4.12
b1 0.0183
E36. An estimate of the slope of the regression line is
sy
8
___
b1 r __
sx 0.8 30 0.21
_ _
75 b0 0.21 280
b0 16.2
The regression equation is
ŷ 16.2 0.21 x
Julie’s final exam score is predicted to be
16.2 0.21 300 79.2.
E37. a. A large brain helps animals live smarter and
therefore longer. The lurking variable is overall
size of the animal. Larger animals tend to develop
more slowly, from gestation to “childhood”
through old age, and larger animals tend to live
longer than smaller ones.
b. If we kept the price of cheeseburgers down,
college would be more affordable. The lurking
variable is inflation over the years—all costs have
gone up over the years.
c. The Internet is good for business. The lurking
variable is years. Stock prices generally go up due
Section 3.3 Solutions
y ŷ 0.318 [0.202 0.00306(55)]
0.318 0.3703
0.0523
To find the y-intercept, use the fact that (x, y) (280, 75) is on the regression line.
122
to inflation over the years. The Internet is new
technology, and so the number of Internet sites
also is increasing over the years.
E38. a. (calories, fat), 0.95
(calories, saturated fat), 0.95
(calories, sodium), 0.5
( fat, saturated fat), 0.95
( fat, sodium), 0.5
(saturated fat, sodium), 0.5
b. If you use more salt in your food, it will reduce
the calories and the fat.
c. The lurking variable is whether the cheese is
low fat. Fat makes cheese taste good (and adds
calories). To make up for this loss of flavor in low
fat cheese, manufacturers may add more salt.
E39. a.
SST SSE
r 2 _________
SST
c. Yes, hot weather causes people to want to eat
something cold, but the correlation alone does not
tell us that.
d. degrees Fahrenheit, pints per person, pints per
person per degree Fahrenheit, and no units
e. MS was computed by dividing SS by DF.
E40. The center of this cloud of points should not be
modeled by a straight line. This scatterplot can
be considered a plot with curvature or a plot with
two very influential points in the upper right
corner. In either case, some adjustments should
be made to the data before attempting to fit a line
to it. One possibility is to transform the data by
techniques to be learned in Section 3.5. Because
r 2 is a measure of how closely the points cluster
about the regression line, it would not make sense
in this context.
E41. Scoring exceptionally well, for example, on a test
involves more than just studying the material.
A certain amount of randomness is involved,
too—the teacher asked questions about what the
student knew, the student was feeling well that
day, the student was not distracted, and so on.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
It’s unlikely that this combination of knowing the
material and good luck will happen again on the
next test for this same student. The student probably
will get a lower, but still high, score on the next test
even if he or she doesn’t slack off. However, it would
appear as if doing well the first time and getting
praised prompted the student to relax and study
less. On the second test, the student’s place at the
top of the class may be taken by another student
who knew just as much for the first test but was also
affected by randomness on the first test—perhaps
unlucky in the questions the teacher chose or
unlucky in another way at the time.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
At the other end, a student who scores
exceptionally poorly on the first test also has a
bit of randomness involved—bad luck this time.
Whether or not he or she is praised, the student
scoring exceptionally poorly probably won’t have
all of the random factors go against him or her on
the next test and the student’s score will tend to
be higher.
E42. This disappointing development is an example of
the regression effect. The explanation is similar
to that of E41. The same students are not likely to
score at the top of their class two times in a row.
Section 3.3 Solutions
123
Activity 3.3a: Was Leonardo Correct?
Height
124
Section 3.3
Kneeling Height
Arm Span
Hand Length
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Relationships Between Poverty Rates and Four Variables
Scatterplot Matrix
100
80
Met Residence
60
40
90
70
White
50
30
90
85
Graduates
80
18
14
Poverty
10
6
20
16
Single Parent
12
8
40
60
80 100
30
50
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
70
90
80
85
90
6 810
14
18
8 10 12 14 16 18 20
Section 3.3
125
74
72
70
68
66
64
62
60
58
56
56 58 60 62 64 66 68 70 72 74
Younger Sister’s Height (in.)
Heights of Sisters
Older Sister’s Height (in.)
126
Section 3.3
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
3.4 Diagnostics: Looking for Features That
the Summaries Miss
Objectives
• to identify a potential influential point by examining a scatterplot
• to determine whether a point is influential by excluding it when computing the
correlation and equation of the regression line
• to make and interpret a residual plot
Important Terms and Concepts
• influential point
• outlier
• residual plot
Alignment with the AP Statistics Topic Outline
This section aligns with the listed items of the AP Statistics Topic Outline
as described here. The actual text of the AP Statistics Topic Outline and the
complete correlation begin on page xxi.
ID4 Students explore residual plots, outliers, and influential points for bivariate data.
Lesson Planning
Class Time
One to two days
Materials
For Activity 3.4a, a piece of paper for recording data
Suggested Assignments
Classwork
Essential
D26–30
P22–25
Recommended
Optional
Activity 3.4a
D31
Homework
Essential
E43, E45, E47, E49
Recommended
E44, E50
Optional
E46, E48, E51–54
Lesson Notes: Which Points Have the Influence?
You identify an influential point by temporarily removing the candidate from
the data set and then recomputing the correlation and regression equation to see
how much they change. If they change only a little, the point isn’t influential. You
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.4
127
should not infer that a large change means that the point should be permanently
discarded. Typically, a point should remain in the data set, but you should
conclude that correlation and linear regression might not be suitable summary
statistics for those data. If a point is an outlier, like all outliers, it should be
carefully examined for a possible error in recording data.
Outliers and influential points are two separate ideas. In particular, some
outliers may not be influential points. The example on animal longevity on
page 164 in the student book covers both outliers and influential points.
Activity 3.4a Near and Far
You can have each student bring an index card or a piece of paper and have students
list their six locations, their estimates, and their actual step count in columns.
Answers will vary because students analyze their own data. For some students the
far point will be very influential on the slope (in either direction), and for others
it won’t be. If the far point is far removed from the data, it will almost always have
considerable influence on the correlation, even if it does not affect the slope.
1–3. A sample set of data from one student is shown next. He or she
consistently underestimated distances.
Estimate (x) Actual (y)
3
3
21
34
25
48
28
63
40
146
180
340
Actual
4–6. The scatterplots with and without the “far” point are shown here, along with
the regression summaries and the correlations. The trend appears linear in the
first plot only because of the far point. In the second plot, the trend is a curve.
Proportionally, the far point was not off by as much as some of the others,
which show increasingly bad estimates with farther distance.
350
300
250
200
150
100
50
0
0
50 100 150 200
Estimate
Dependent variable is:
Actual
No Selector
R squared 94.6%
R squared (adjusted) 93.2%
s 32.40 with 6 2 4 degrees of freedom
Source
Regression
Residual
Variable
Constant
Estimate
Sum of Squares
73163.2
4198.14
Coefficient
13.6176
1.85958
df
1
4
Mean Square
73163.2
1049.53
s.e. of Coeff
17.22
0.2227
t–ratio
0.791
8.35
F–ratio
69.7
prob
0.4733
0.0011
r 0.972
128
Section 3.4
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Actual
150
120
90
60
30
0
0
10 20 30
Estimate
40
Dependent variable is:
Actual
No Selector
6 total cases of which 1 is missing
R squared 84.8%
R squared (adjusted) 79.7%
s 24.14 with 5 2 3 degrees of freedom
Source
Regression
Residual
Variable
Constant
Estimate
Sum of Squares
9718.15
1748.65
Coefficient
27.0973
3.67083
df
1
3
Mean Square
9718.15
582.885
s.e. of Coeff
23.65
0.8990
t–ratio
1.15
4.08
F–ratio
16.7
prob
0.3349
0.0265
r 0.921
7. Yes. Removing the far point decreased the correlation and almost doubled
the slope. Although the far point doesn’t follow the curved pattern of the
other points, its distance from them results in a large zx zy and so including
it increases the correlation.
Lesson Notes: Residual Plots
Students can add a horizontal line to their handmade residual plots where the
residual value equals 0. This line will make it easier to visualize and evaluate
where the model overestimates and underestimates. On a TI-83 Plus or TI-84
Plus, the residual plot will display the “zero residual line” automatically.
Residual plot A in D30 presents a “good” residual plot, that is, one that
displays random scatter and a fairly constant spread in the y’s across all values
of x.
The residual plots C and D in D30 are plots that show a pattern, whereas
residual plot I in Display 3.77 is an example where the spread in y is not constant
across all values of x. These residual plots show that a linear model is not the
right model for the data.
Make sure students see the connection between the scatterplot with its
regression line and the residual plot. Comparing the residual plot with the
scatterplot will help students see the diagnostic information that a residual
plot contains.
A misconception some students have as they look at residual plots—
confusing them with the original scatterplot—is to think patterns are good in
residuals. You might emphasize that if the residual plot still has a pattern, the
model is not taking care of everything and students need to look for another
model that will reduce or eliminate that pattern, too. With a good model,
residuals should be without patterns and randomly scattered about the zero line.
Students are beginning to conceptualize random error about a line, and they
will get a chance to see this again in Chapter 11.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.4
129
Notes for the AP Teacher
Modeling Good Answers
Once students have completed E45, show them the model answer with its good
discussion of residuals. A PDF file containing the question and the model answer
is available at www.keypress.com/keyonline.
Discussion
Note on D26–28: Students might work in groups of four,
with each group calculating the correlation coefficient
and the regression equation for one of the four
Anscombe data sets. Groups can then share their
results. Seeing similar summaries for quite different
graphs will reinforce the fact that numerical summaries
are not sufficient to describe any data set, either
univariate or bivariate.
D26. Plot I shows a positive linear trend that is
moderately strong. Plot II shows points that lie
along a curve. Plot III shows all points lying on a
straight line except for the one point near the right
end. (This point will have the effect of raising that
end of the regression line.) Plot IV has all but one
of the points stacked up at the same value of x,
although the outlier gives it a slight positive trend.
(The regression line will go through the middle
of the points on the left and through the isolated
point on the right.)
a. A straight line is a good summary only for plot
I. However, in all four plots the regression line has
a slope of about 0.5 and an intercept of about 3.0.
(See D27.)
b. The correlation is about 0.8 for plot I and
should not be used to describe the others because it
is a measure of the strength of a linear relationship.
D27. There is no way to tell which plot produced the given
summary statistics. In fact, the summary statistics
are essentially the same for all four plots. The moral?
Draw a picture before you summarize data!
D28. a. Plots III and IV both have influential
observations, but plot IV contains the more
influential point. The influential point in plot IV
is more isolated from the data and completely
controls the slope of the regression line.
b. This influential point lies on the regression
line through these data.
c. If the influential point (the isolated point) is
removed from plot IV, all of the data points will
130
Section 3.4 Solutions
stack up at the same value of x. Thus, a regression
slope and correlation cannot be computed.
D29. Delta produces the residual on the extreme
right of the plot, at point (8.03, 0.18). Northwest
produces the fifth residual from the left and the
largest in absolute value at point (5.36, 8.5).
D30. a. A—I; B—IV; C—II; D—III
b. The scatterplot shows the actual values of y
and upward or downward trend but may obscure
patterns in the residuals (or at least appear to
diminish them a bit). The residual plots do not
show the values of y or trend in the original data,
but they do show the values of the residuals, and
they make departures from linearity easier to see.
D31. This question gives further information on plotting
residuals versus x–values or predicted values, ŷ.
The only difference between the two
residual plots is a matter of scaling. The second
residual plot has essentially the same scale on
the horizontal axis as the scatterplot, with x
beginning at zero. Thus, it must be the one with
residuals plotted versus x. The points on the first
residual plot have horizontal scale values of 0.5,
1.0, and 1.5, which are the fitted values that are
obtained when ŷ 0.5 0.5x and x is 0, 1, or
2, respectively. (See E52, page 178 of the student
book, for more on this idea.)
Practice
P22. a. In the scatterplot with the regression line, the
data show no pattern except for the one outlier in
the upper right.
International ($ millions)
Solutions
1200
1000
800
600
400
200
350
400 450 500 550
Domestic ($ millions)
600
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
1200
1000
800
600
400
200
350
400 450 500 550
Domestic ($ millions)
600
P23. a. The student did not predict very well. The
estimates were consistently low.
b. (180, 350) appears to be the most influential
point. It is an outlier in both variables and is not
aligned with the other points.
c. With the point (180, 350) the regression
equation is actual 12.23 1.92 estimate, and
r 0.975. Without this point, the equation is
actual 27.10 3.67 estimate, with r 0.921.
This point pulls the right end of the regression
line down, decreasing the slope and increasing the
correlation.
P24. a. The scatterplot with the regression line is
shown here.
Residual
c. The residual plot is shown here.
0.5
0.25
0
–0.25
–0.5
0
1
2
3
x
d. The residual plot straightens out the tilt in
the scatterplot so that the residuals can be seen
as deviations above and below zero rather than
above and below a tilted line. The symmetry of the
residuals in this example shows up better on the
residual plot.
P25. a. A—IV; B—II; C—I; D—III
b.
i. The residual plot will open upward, like a
cup, as in plot II.
ii. The residual plot has a fan shape, as in
plot I.
iii. The residual plot will open downward, like
an inverted cup. No plot in this example
shows this pattern.
iv. The residual plot will be V–shaped, as in
plot III. The pattern is more clear if you
ignore the point with a residual of about 10.
c. Plot D and residual plot III show a scatterplot
that looks as though it should be modeled by two
different straight lines. The V shape can be seen
in the scatterplot, but it may be more obvious
to many in the residual plot because the overall
linear trend (tilt of the V) has been removed.
Exercises
E43. a. The scatterplot appears here (with the
regression line). A line is not a good model
because the cloud of points is not elliptical and has
one extremely influential point.
2.0
1.5
y 1.0
0.5
0
0
1
2
3
x
b. The table is shown here.
x
y
Predicted Value
Residual
0
0
0.5
0.5
0
1
0.5
0.5
1
1
1
0
3
2
2
0
Minimum Temperature (°F)
International ($ millions)
b. The regression equation is ŷ 680 2.85x
and the correlation is r 0.7. The slope is positive,
and the correlation is moderately strong.
c. The plot without the influential point (Titanic)
is shown here. The regression equation is ŷ 1350 2.14x and the correlation is r 0.50.
Now the slope is negative and the correlation is
moderate. Titanic has a huge influence (as a big
ship should).
20
0
–20
–40
–60
–80
–100
–120
–140
50
70
90
110 130
Maximum Temperature (°F)
b. The regression equation is ŷ 161.90 0.954x, and r 0.49.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.4 Solutions
131
Source
Regression
Residual
Variable
Constant
x
Sum of
Squares
113.348
80.7516
df
1
8
Coefficient
1.63057
0.745223
Mean Square
113.348
10.0939
s.e. of
Coeff
2.299
0.2224
t–ratio
0.709
3.35
F–ratio
11.2
prob
0.4984
0.0101
Here is the completed table:
20
0
–20
–40
–60
–80
–100
–120
–140
50
70
90 110 130
Maximum Temperature (°F)
Count After Treatment
E44. a. The point (0.137, 2.252) appears to be the
most influential because it is an outlier in both
variables. It appears that if this point were
eliminated, the right end of the line would drop,
decreasing the slope. The correlation would
probably also decrease.
Eliminating (0.137, 2.252) does indeed
decrease the slope of the regression line, from 13.0
to 7.46. The correlation also decreases, from 0.896
to 0.634.
b. A point near the regression line and near
__ __
(x , y ) is likely to have little effect on the slope or
correlation. One such point would be (0.0194, 0.517).
Eliminating this point actually leaves the regression
equation almost the same and very slightly increases
the correlation from 0.896 to 0.897.
c. Removing the point with the largest residual
in absolute value, (0.0764, 0.433), would cause
a slight increase in slope but a large increase in
correlation. The actual result is to cause the slope
to increase from 13.0 to 15.4 and the correlation to
increase from 0.896 to 0.969.
E45. a. The plot with the regression line and the
regression summary are shown here. The equation
of the line is ŷ 1.63 0.745x.
15
12
9
6
3
0
5
10 15 20
Count Before Treatment
132
Dependent variable is:
y
No Selector
13 total cases of which 3 are missing
R squared 58.4%
R squared (adjusted) 53.2%
s 3.177 with 10 2 8 degrees of freedom
Section 3.4 Solutions
x
y
11
6
6.567
0.567
8
0
4.331
4.331
Predicted Values
Residual
5
2
2.096
0.096
14
8
8.803
0.803
19
11
12.529
1.529
6
4
2.841
1.159
10
13
5.822
7.178
6
1
2.841
1.841
11
8
6.567
1.433
3
0
0.605
0.605
b. The residual plot shown here is a little unusual
in that it shows more variability in the middle
than at either end. This is partly because there are
more cases in the middle.
8
6
Residual
Minimum Temperature (°F)
c. With Antarctica removed, the slope of the
regression line changes from positive (0.954) to
negative (1.869) and the correlation becomes
negative, r 0.45. The plot appears as shown
here. (Notice that a new potential influential point
has appeared: Oceania.) Without a plot of the
data, you might come to the following incorrect
conclusion: In general, continents tend to be
“warm” or “cold”; that is, continents with higher
maximums also tend to have higher minimums.
In fact, there is little relationship.
4
2
0
–2
–4
0
5
10 15 20
Count Before Treatment
c. The disinfectant appears to be unusually
effective for the person with the large negative
residual, the point (8, 0) on the original
scatterplot. It is seemingly ineffective for the
person with the large positive residual, the point
(10, 13) on the scatterplot.
E46. a. The slope of the line is very close to 1, and the
y-intercept is 3.57. Because the slope is about
1, the y-intercept means that textbooks bought
online tend to cost about $3.57 less than those
bought at the college bookstore. (In fact, the mean
cost of the college textbooks is $47.04, and the
mean cost of the online textbooks is $45.03. Their
difference is not $3.57 because the slope is not
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
College
Online
Fitted
Residual
93.40
94.18
92.9281
1.2519
9.95
7.96
6.7092
1.2508
46.70
48.75
44.6786
4.0714
76.00
94.15
74.9508
19.1992
86.70
80.95
86.0058
5.0558
7.95
6.36
4.6428
1.7171
24.00
16.80
21.2254
4.4254
12.70
10.66
9.5505
1.1095
66.00
45.50
64.619
19.119
Residual
The next residual plot shows that a line is a
reasonable model for these data. The points are
scattered randomly above and below 0, except
that the points fan out to the right. This pattern
indicates that the points lie farther from the
regression line as the prices increase.
20
15
10
5
0
–5
–10
–15
–20
0
20
40
60
80 100
College Bookstore Price ($)
c. The plot with the line y x is shown next.
A point above the line represents a textbook that
costs more at the online bookstore. A point below
the line represents a point that costs less at the
online bookstore. A point on the line represents a
book that costs the same at both stores.
80
20
0
20
40
60
80 100
College Bookstore Price ($)
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
10
12
14 16
Fat (g)
18
20
b. The largest positive residual belongs to Pizza
Hut’s Stuffed Crust, which has more calories than
would be predicted from a simple linear model. The
largest negative residual belongs to Pizza Hut’s Pan.
None appear so far away that it would be called an
exception, or outlier. In fact, you can check this by
making a boxplot of the residuals, as shown here.
–40 –30 –20 –10 0 10
Residual
20
30
40
c. The complete data set shows a moderately
strong positive trend with a slope of about 14.9
calories per gram of fat and correlation of 0.908.
The most influential data point would be the
one farthest away from the main cloud of points
on the x–axis (Domino’s Deep Dish). Removing
Domino’s Deep Dish yields a slope of around
18 calories per gram of fat and correlation
of 0.893. None of the other points have nearly
as much influence on the slope.
60
40
40
30
20
10
0
–10
–20
–30
–40
8
Calories
Online Bookstore Price ($)
100
d. The boxplot is centered slightly above 0,
indicating that most college bookstore prices
tend to be slightly higher than the online prices.
However, most of the differences are close to 0,
so there is little difference in price. The median
difference is about $2 more for the college
bookstore. Two outliers are shown, which means
that for two textbooks, the prices vary greatly,
in one case being less expensive at the college
bookstore and in the other case being less
expensive at the online bookstore. The overall
lesson is that with more expensive textbooks,
it pays to shop around.
E47. a. The residual plot for these data looks like this:
Residual
exactly equal to 1.) The slope is 1.03, which means
that for every $1.00 increase in price for a book
sold through the college bookstore, there tends to
be a $1.03 increase, on average, for the same book
bought online.
b. The table for computing the residuals is shown
here.
400
380
360
340
320
300
280
260
240
220
8
10
12
14 16
Fat (g)
18
20
Section 3.4 Solutions
133
134
Section 3.4 Solutions
E51. a. The regression equation is ŷ 366.67 16x.
The completed table is shown here.
Aircraft
Seats
Cost
Predicted
Residual
ERJ–145
50
1100
1166.67
66.67
DC9
100
2100
1966.67
133.33
MD–90
150
2700
2766.67
66.67
Residual
b. The plot of the residuals versus x (seats) is
given first followed by the plot of the residuals
versus ŷ. The only difference between the two
plots is the scaling on the horizontal axis.
150
100
50
0
–50
–100
40
Residual
E48. a. The pattern of the scatterplot is basically
linear, so the slope is constant across the numbers
of seats.
b. The spread in the flight lengths increases as
the number of seats increases. The points fan out
to the right.
c. A good guess is likely only in the first case.
Predicting flight length for planes with fewer
numbers of seats would be easier because there
is less variation in flight lengths for the smaller
planes than for the larger. When the number
of seats is between 50 and 150, the values for
flight length vary between about 175 to about
1065 miles, whereas when the number of seats is
between 200 and 300, the values for flight length
vary between 947 and 3559.
d. The residual plot for this scatterplot also fans
out (spread out more) as the number of seats
increases. In fact, the fan shape may be seen better
in the residual plot.
e. Planes that carry more passengers have more
variation in their average flight lengths because
they tend to fly longer distances and there is a
larger spread of numbers over which to vary. In
general, larger numbers usually show a larger
absolute variation than smaller ones. But the
relative variation may be fairly constant. If you
make a new variable defined as (flight length)/
(number of seats) and plot it versus number of
seats, the fan shape disappears.
E49. A—I; B—IV; C—III; D—II. A linear model would
be appropriate for C and D. Both C and D show
a random scatter of points in the residual plot, but
the slope of the regression line is almost zero for
plot C, and there appears to be little correlation.
Plot B does not have a random scatter around
the line; the pattern appears to be cyclical.
This pattern is typical of a situation in which
something changes approximately linearly over
time. What happens next usually depends on
what just happened, causing this up-and-down
pattern in the residuals.
E50. No for plot A. The pattern is impossible for
residuals because the major linear trend would
already have been removed by fitting the regression
line. Residuals show what is left over after any
linear trend is removed from the data.
No for plot B. These residuals do not have a
mean of zero.
60
80 100 120 140 160
Seats
120
80
40
0
–40
–80
1000
1400
1800 2200
Predicted
2600
E52. The fitted value ŷ is a linear transformation of x;
that is, ŷ a bx. Thus, using ŷ rather than x on
the horizontal axis does not change the relative
distance of the values from each other—it just
translates and stretches the horizontal axis.
Consider this example for a regression line
ŷ 1 2x with values of x 0, 1, 3, 4, 6, 6. Then
the values of ŷ are 1, 3, 7, 9, 13, 13. These two sets
of points are plotted on horizontal scales in the
next plot. Note that the relative spacing of the
points is exactly the same on each scale.
0
1
2
3
4
5
6
1
3
5
7
9
11
13
If the regression line has a negative slope, then
the larger y’s correspond to the small x’s and one
residual plot will be the mirror image of the other.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Weight (lb)
190
180
170
160
Life Expectancy (yr)
It is difficult to see the strong V shape of the
residuals in a scatterplot.
E54. Scatterplots of the data, without and with the
regression line, are shown here. Begin by drawing
in the regression line. Then use the residual plot to
determine how far above or below the regression
line each point should be placed. Note that a
linear model is not a good one for predicting life
expectancy from GNP.
90
80
70
60
50
0
10
20
30
Gross National Product
($ thousands per capita)
Life Expectancy (yr)
E53. Scatterplots of the original data, without and with
the regression line, follow the commentary.
To estimate the recommended weight for a
person whose height is 64 inches, add the fitted
weight (given to be 145 pounds) to the residual of
about 1.2 to get 146.2. The slope of the regression
(187 145)
line must be about _______
(76 64) 3.5. For the second
height, 65 inches, the fitted weight would be
145 3.5(1) 148.5. The residual is about 0.9
pounds. Thus, the recommended weight is about
148.5 0.9 149.4 pounds. You could continue
point by point to get the next plot, but a rough
sketch can be obtained by making use of the linear
patterns in the residuals (and hence in the original
scatterplot). The points on the scatterplot must
form a straight line up to a height of 71, where the
weight must be about 145 7(3.5) 1.5 168.0.
The remainder of the points must form
(approximately) another straight line up to a
height of 76, where the weight must be about
145 12(3.5) 2.2 189.2.
90
80
70
60
50
0
10
20
30
Gross National Product
($ thousands per capita)
150
140
64
68
72
Height (in.)
76
64
68
72
Height (in.)
76
Weight (lb)
190
180
170
160
150
140
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.4 Solutions
135
3.5 Shape-Changing Transformations
Objectives
• to understand that removing the curvature from the shape of a scatterplot
requires a nonlinear transformation
• to use a log transformation to straighten data that follow an exponential
pattern
• to use a log-log transformation to straighten data that follow an unknown
power relationship
• to use a power transformation to straighten data that follow a known or
suspected power relationship
Important Terms and Concepts
• log-log transformation
• homogeneous residuals
• power transformations
• exponential growth and decay
• log transformation
• power function
Alignment with the AP Statistics Topic Outline
This section aligns with the listed items of the AP Statistics Topic Outline
as described here. The actual text of the AP Statistics Topic Outline and the
complete correlation begin on page xxi.
ID5 Students use logarithmic and power transformations to achieve linearity.
Lesson Planning
Class Time
Two to three days
Materials
For Activity 3.5a, a paper cup and 200 pennies per student (or pair of students)
Suggested Assignments
Classwork
Essential
Recommended
Optional
Activity 3.5a
D35, D40
D34, D39, D42
D32, D33, D36–38, D41,
D43
P30–32, P35, P37, P38
P36, P39
Chapter 3 Quiz 2
P26–29, P33, P34
Homework
Essential
E55–57, E59
136
Section 3.5
Recommended
E60, E65
Optional
E58, E61–64, E66
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Lesson Notes: Exponential Growth and Decay
An exponential pattern of growth or decay is one of the most common nonlinear
patterns seen in science and social science. Activity 3.5a introduces a common
type of exponential decay so that students can study the pattern made by a
quantity over time. This situation is fully analyzed after the activity.
Activity 3.5a Copper Flippers
Activity 3.5a is essential. This activity can be done in pairs—one to toss the coins
and count and one to record.
1–3. The data from one student’s results are shown here.
Toss
Heads
1
91
2
59
3
27
4
9
5
7
6
4
4. A scatterplot from the sample student results is shown in Display 3.91 on
page 181 of the student book. The pattern appears to be exponential, not
linear.
5. The explanation is given on pages 181–182 of the student book.
Have students save their data to use in D22 on page 397.
Lesson Notes: Exponential Functions
and Log Transformations
Although the AP Statistics Topic Outline does not explicitly require that
students understand the algebra of log and log-log transformations, it should
not be out of the reach of AP students to complete the conversion. Discuss
these algebraic conversions, and expect students to complete them. This work
provides an excellent opportunity for students to see that logarithms can be
very useful.
The basic rules of logarithms and powers are shown in this table. Students
may need to be reminded of these rules.
Logarithms
Powers
logb(mn) logbm logbn
a
logbm n logbm
(a ) amn
logbbm m
alogam m
n
mn
am an
m n
Most commonly, scientists use natural logarithms, which have a base of e and are
denoted by the symbol ln. The number e and functions of e are accessible on any
graphing calculator.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.5
137
To understand why a log transformation straightens points that follow an
exponential pattern, begin with the exponential model
y abx
Take the log of both sides:
ln y ln(abx)
ln a ln bx
ln a x ln b
You now have an equation that is linear in x, with slope ln b and intercept ln a.
So a plot of ln y versus x will be linear. The value of ln b can be exponentiated
to find an estimate of b. Students should also understand that b 1 (growth
rate) for exponential growth and b 1 (decay rate) for exponential decay.
In the exponential growth example on pages 182–183, notice that in the
computation, e0.0148 1.015. Whenever you are computing ex, where x is
close to 0, it always will be the case that ex 1 x. As another example, in E64,
students will find that e0.05645 1 (0.055) 0.945, so the decay rate is 5.5%.
Lesson Notes: Log-Log Transformations
of Power Functions
In this section students should recognize the form of a power model and the
method of using a log-log transformation to linearize this curve, as shown in
the box on page 188. It is virtually impossible to distinguish between a set of
data best fit by an exponential model and one best fit by a power model just by
observing the pattern in a scatterplot. The investigator should think, first, about
which makes more sense in the context of the problem. Growths of populations
are more logically exponential while relationships between weight and height are
more logically ones of power.
The Algebra of Log-Log Transformations
If we take the log of both sides of a power function y a xb, we end up with
log y log a b log x
Again this is linear, but the independent variable here is the log of our original
independent variable, x. And the new dependent variable is the log of the original
dependent variable, y. The slope is b and the intercept is log a.
Here are two examples.
1. To solve log y 0.484 2.10 log x for y, exponentiate both sides and use
the rules in the table on page 137.
10log y 100.484 2.10 log x
y 100.484 102.10 log x
using the 1st and 3rd power rules
y 0.328 10log x
using the 2nd log rule and calculating 100.484
y 0.328 (x)2.10
using the 3rd power rule
2.10
138
Section 3.5
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
2. Solve ln y 25.61 0.0151x for y.
eln y e25.61 0.0151x
exponentiating both sides
y e25.61 e0.015x
using the 1st and 3rd power rules
y e25.61(1.015)x
using the 2nd power rule and calculating e0.015
Note: e25.61 is a constant.
Lesson Notes: Power Transformations
Power transformations are appropriate when your data follow a power model
of the form y axb. However, a power transformation (as well as any other
transformation) should be employed only when it is consistent with the
physical situation in the problem. In thinking through the physical situation,
sometimes a simple power transformation such as a square or square root is
all that is needed.
Unfortunately, some students will keep trying transformations until they get
a “good” fit regardless of its appropriateness. A correlation r close to ±1 does not
imply a “good” fit automatically. For example, in the study of tree diameter versus
age, the correlation is 0.89, but the scatterplot shows enough curvature in the
data to make the straight line a reasonably poor model.
Many graphing calculators have spreadsheet capabilities that make the
transformation of data quite easy. For example, on a TI-83 Plus or TI-84 Plus,
you can load the values of x in list L1, the values of y in list L2, and then, say, define
L3 (L2)2. A linear regression can then be done on (x, y) and (x, y 2) and appropriate
comparisons made.
Notes on Software
Some software packages will show the regression as log y . . . . Graphing
calculators generally do not. That is, if students are doing a regression on
(log x, log y), the calculator will return the coefficients of y a bx. It is up
to the students to realize that they must use the appropriate meaning of those
calculated coefficients, that is, log y a b log x.
Notes for the AP Teacher
Modeling Good Answers
Once students have completed E65, show them the model answer as an example
of what is expected on the AP Exam. A PDF file containing the questions and the
model answers is available at www.keypress.com/keyonline.
Solutions
Discussion
D32. If the chance of “living” was 0.6, then the rate of
decay (death rate) would be 0.4 and the line would
drop less steeply.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
D33. The residuals indicate a lot of scatter around the
trend line for the years up to 1850 and then a
trend that increases faster than the model would
indicate, up to about 1920, followed by an increase
slower than the model would indicate. This same
pattern can be seen in the original scatterplot if
you look closely.
Section 3.5 Solutions
139
Flight Length
ln(flight length)
3960
8.28
904
6.81
175
5.16
plane) is removed, the data show a strong, positive
linear trend and the residuals have no trend.
Fitting a new line to the reduced data set will
show that the outlier has some influence on
the slope of the regression line and even more
influence on the correlation.
D39. It doesn’t make sense to say that speed causes
flight length. The reason for the positive
association is the lurking variable of the size of
the plane. In general, larger planes go faster and
also have longer average flight lengths. In fact,
the larger and faster planes tend to be deployed
on longer flights intentionally; that is a
management decision.
D40. The graph of the power function is shown
below. The graph does not increase as quickly
as the exponential function, once x gets
above 3.21.
120
y
100
80
y ⫽ x2
60
40
20
0
2
4
6
8
10
12
x
1200
y ⫽ 2x
1000
800
y
D34. In 1803, the Louisiana Purchase doubled the
size of the United States but added relatively
few people. Thus, the population density
dropped dramatically below what any model
would have predicted. Similarly, the drop in
density from 1840 to 1850 may be accounted for
by the addition of much of the Southwest to the
United States.
D35. a. The overall trend is curved upward in the
original data, so the correlation coefficient is
not a useful measure of the strength of the
relationship.
b. These are time series data, with one
observation per time period. In time series data,
each observation is highly correlated with the
preceding observation (things cannot change
very much from one time period to the next), so
it is not surprising to see points closely clustered
about the linear trend in the transformed data. Of
much more interest is the nonlinear component of
the overall trend and the fluctuations around this
trend, as revealed in the residuals.
D36. From the equation ŷ e25.118(1.0148)x, you can
see that the growth rate is estimated to be about
1.5% per year. For a 10-year period, think of going
from x 0 to x 10. If at year 0 the population is
ŷ a, at year 10 it will be ŷ a(1.015)10 a(1.161), for a 16% gain. Make sure students
observe that the 10-year growth rate is not
(10)(1.5%) 15%; it is greater than that because
of the compounding.
D37. Some of the flight lengths are in the low hundreds,
whereas others are in the mid-thousands. A
jump from 100 to 1000 is a jump of one order
of magnitude. Note the effect that the log
transformation has on these three flight lengths,
which were picked because they are the longest,
median, and shortest. The logs are more evenly
spread out than the original flight lengths.
600
400
y ⫽ x2
200
0
2
4
6
x
8
10
12
The graph above shows the power model and the
exponential model together.
Both functions increase at an increasing
rate. However, the growth of the exponential
model eventually dwarfs that of the power
model. This is true in general. Exponential
growth models (b ⬎ 1) will eventually outgrow
any power model.
D38. Both the scatterplot and the residual plot show an
outlier on the speed variable. The other residuals
have a linear trend. Once this outlier (the slowest
140
Section 3.5 Solutions
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
D41. We’ll calculate the weight of a 100-inch alligator
using the base 10 log and compare to the example
on page 189 where the base e log was used. The
table shows the lengths and weights, along with
the base 10 log of each.
Length
Weight
94
log(weight) 2.152
weight 102.152 141.9 pounds
log (length)
log (weight)
130
1.97313
2.11394
74
51
1.86923
1.70757
147
640
2.16732
2.80618
58
28
1.76343
1.44716
86
80
1.9345
1.90309
94
110
1.97313
2.04139
63
33
1.79934
1.51851
86
90
1.9345
1.95424
69
36
1.83885
1.5563
72
38
1.85733
1.57978
128
366
2.10721
2.56348
85
84
1.92942
1.92428
82
80
1.91381
1.90309
86
83
1.9345
1.91908
eln y eln x ea
88
70
1.94448
1.8451
y x ea
72
61
1.85733
1.78533
74
54
1.86923
1.73239
61
44
1.78533
1.64345
log y log (x ea)
90
106
1.95424
2.02531
log y log x log ea
89
84
1.94939
1.92428
68
39
1.83251
1.59106
76
42
1.88081
1.62325
114
197
2.0569
2.29447
90
102
1.95424
2.0086
78
57
1.89209
1.75587
The small difference from the value 141.3
reported in the text is due to rounding error.
A small rounding error in an exponent can have a
rather large effect on the result. Calculating with
rounded numbers is always risky, especially when
one of those numbers is an exponent.
The properties of logarithms explain why this
works. If you have an equation such as
ln y ln x a
and you want to rewrite it in base 10, first
exponentiate each side and simplify.
eln y eln x a
Now take the log base 10 of each side.
Here is the scatterplot log(weight) versus log(length).
3.0
2.8
2.6
log(weight)
The prediction is log(weight) 3.286 log(100) 4.42 2.152. The predicted weight is found by
solving
2.4
2.2
2.0
1.8
1.6
1.4
1.75 1.80 1.85 1.90
1.95 2.00 2.05 2.10
log(length)
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
2.15 2.20
Note that ea is a constant. In the alligator example,
the constant in the first regression equation was
10.2. The equation above says to calculate
log(e10.2) 4.43. This value is very close to
the constant in the regression equation using
base 10 logs, or 4.42. Again, rounding error
has caused a small difference in the last
decimal place.
D42. Perhaps the cross-sectional area is growing at
a constant rate across time. This area would be
proportional to the square of the diameter.
D43. The diameters at the left end of the scatterplot
show less variation than those at the right end.
Thus, you would be able to get a more precise
prediction of the diameter for 10-year-old trees
than you would for 40-year-old trees. This should
seem intuitively reasonable; older trees would
naturally have more variability in size than
younger trees. (The same is true of children.)
Section 3.5 Solutions
141
Practice
P26. a. This plot shows the data for Dying Dice.
P27. The Florida population shows a definite
nonlinear trend that could represent exponential
growth.
15,000,000
160
12,000,000
120
Population
Population
200
80
40
0
2 4 6 8 10 12
Roll Number
9,000,000
6,000,000
3,000,000
0
1800 1850 1900 1950 2000
Year
b. The transformed data show a nice linear trend.
a. The log transformation transforms the pattern
to a linear one that can be summarized by a
straight line.
4.0
3.0
2.0
1.0
0
2
4
6
8
Roll Number
10
ln(Population)
ln(population)
5.0
12
c. The equation of the line in part b is
ˆ
ln y 5.22 0.435x
1820 1860 1900 1940 1980 2020
Year
or
ln(pop) 5.22 0.435 roll number
Because e0.435 0.647, the rate of dying is
estimated to be 1 0.647, or about 0.353, or
35%, per time period. This rate is close to the
theoretical probability of dying, set up to be 1/3.
d. The residual plot shows some curvature,
indicating a death rate of a little more than 0.35
in the early stages and a little less than 0.35 in the
later stages. This kind of pattern would not be
unusual in data on real animals.
0.30
0.20
Residual
0.10
b. The equation of the line is
ln(pop) 54.9342 0.03583 year
so that
pop e54.9382(e0.03583)year e54.9382 (1.036)year
for a growth rate of 3.6% per year, which is a high
rate of growth.
c. The residual plot shows a pattern. Florida grew
less rapidly than the model predicts up to about
1845, then grew more rapidly than predicted, then
less, then more. A big jump in growth occurred
between 1950 and 1960. Then in 2000, there was a
big drop in growth.
0
Residual
–0.10
–0.20
0 2 4 6 8 10 12
Roll Number
17
16
15
14
13
12
11
10
0.15
0.10
0.05
0.00
–0.05
–0.10
–0.15
–0.20
1820 1860 1900 1940 1980 2020
Year
142
Section 3.5 Solutions
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
P28.
x
y
log y
2
1000
3
1
100
2
0
10
1
1
1
0
P30. If log y c dx, then by rules of logarithms,
y (10c)(10)dx (10c)(10d)x abx, where a 10c
and b 10d.
For P28, y 10(10)x.
For P29a, y 1(100.5)x 3.16x.
log y
The plot does indeed show a straight line.
3.0
2.5
2.0
1.5
1.0
0.5
0
For P29b, y 1014(102)x 1014 100x.
P31. Using a ln( flight length) transformation gives the
following printout, which agrees with the first part
of Display 3.98 in the student book.
The regression equation is
ln(FlightLength) 1.57 0.0120 speed
–1 –0.5 0 0.5 1 1.5 2
x
Predictor
Constant
Speed
s 0.2691
Coef
1.5730
0.0119958
Stdev
0.3281
0.0007396
R–sq 89.5%
t-ratio
4.79
16.22
p
0.000
0.000
R–sq(adj) 89.1%
Analysis of Variance
The equation of this line is log y 1 x, so the
slope is 1 and the y-intercept is 1.
P29. The tables and plots are shown here.
a. xa ya log ya
6
1000
3
4
100
2
2
10
1
0
1
0
MS
19.058
0.072
F
263.09
p
0.000
flight length e1.57 e0.012 speed 4.807(1.012)speed
P32. A plot of the data shows a cluster at the left with
little trend and three points at the right that, as a
group, are potentially influential. The points in the
residual plot show curvature.
2
1
80
0
2
4
6
xa
The equation of the line is log ya 0 0.5xa,
so the slope is 0.5 and the y-intercept is 0.
b. xb
yb
log yb
0.0001
4
6
0.01
2
100
60
40
20
0
0
5
10
0
5
10
15
20
25
30
15 20
Trips
25
30
30
Residual
5
2
log yb
SS
19.058
2.246
21.304
The answer is, then:
Consumption (g)
log ya
DF
1
31
32
ln(flight length) 1.57 0.012 speed
3
8
SOURCE
Regression
Error
Total
2
1
0
–1
–2
–3
–4
0
–20
5
6
7
8
xb
The equation of the line is log yb 14 2xb,
so the slope is 2 and the y-intercept is 14.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Taking the natural log of the consumption
straightens the residual plot. However, these
transformed points don’t form an elliptical
cloud. Some fishers’ families eat essentially
no fish, even if the person fishes as many as
11 times a month.
Section 3.5 Solutions
143
3
0.1
2
1
0
0
5
10
15
20
25
30
0.0
-0.1
-0.2
-0.3
-0.4
2.0
Residual
Scatter Plot
Tidal Velocity
0.2
4
ln(velocity)
ln(Consumption)
5
-0.5
-0.6
0.0
-4.0
–1.5
5
10
15 20
Trips
25
30
The equation of the regression line is
ln(consumption) 0.39 0.143 trips with a
correlation of about 0.69.
As it happens, removing the three points to the
right has a small effect on the slope, increasing
it from 0.143 to 0.177. However, the correlation
drops considerably, from 0.69 to 0.37, with the
regression line now passing between the two
remaining clusters of points.
P33. The prediction is ln(weight) 3.29 ln(75) 10.2 4.005. Solving,
Residual
0
P34. a. There is a strong positive curved relationship
between depth and velocity. The curve is concave
down with the rate of change in velocity decreasing
as the depth increases.
b. The curve is not exponential, but could be a
power function with a power less than 1. Taking
the log of both variables results in this plot, shown
with its residual plot. A reasonable model is
ln(velocity) 0.146 0.175 ln(depth). Solving for
velocity gives velocity 1.157 depth0.175.
-2.0
-1.0
ln(depth)
0.0
0.10
0.00
-0.10
-0.20
-4.0
-2.0
-1.0
0.0
ln(depth)
lnVelocity = 0.175lnDepth + 0.15; r2 = 0.89
P35. a. i.
-3.0
30
25
20
ya 15
10
5
0
ln(weight) 4.005
weight e4.005 54.9 pounds (or 54.8 with no
rounding)
-3.0
1
2
3
xa
ii. The y-scale must be shrunk more for larger
values of x than for smaller values of x. The
cube root transformation will straighten
them.
iii.
3
2
ya1/3
1
0
1
2
3
xa
b. i.
10
8
yb 6
4
2
0
2
6
4
8
10
xb
144
Section 3.5 Solutions
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
ii. The y-scale must be shrunk more for smaller
values of x than for larger values of x. The
reciprocal transformation (power of 1)
will work.
y –1
b
0
0
2
4
6
8
20
60
40
80 100
xc
ii. The y-scale must be expanded more for
larger values of x than for smaller values of
x. A square transformation (power of 2) will
straighten the points.
iii.
yc2
100
80
60
40
20
0
20 40 60 80 100
xc
P36. For the points in P35a, the x-scale must be
expanded. A cubic power transformation will
straighten this relationship.
30
ya
20
10
20
8
10
8
7
6
5
4
3
2
1
0
30
xa3
10
8
6
4
2
2
3
2
3
4
5
6
7
6
7
2.0
Residual
For the points in P35b, the x-scale must be shrunk
a bit. A reciprocal transformation (power of 1)
will do the trick.
yb
6
x 0.5
c
P37. a. The area is proportional to the square of the
radius (y πx2), so y would have to be raised to
the power _12 (square root).
b. The volume is proportional to the cube of a
side (y x3), so y would have to be raised to the
power of _13 (cube root).
c. The volume is proportional to the square of
the diameter ( y 8␲ _2x 2 or y 2␲x2), so y would
have to be raised to the power _12 (square root).
P38. The following plots show the diameter plotted
against the square root of age and the residuals
from the regression line. The regression analysis
is also provided. This transformation results in a
scatterplot with less of a fan shape than the one
for diameter2 versus age. If you want to predict
diameters along the full range of ages, this
transformation will allow more even precision
in the predictions.
10
0
4
The exponents in P36 are the reciprocals of the
exponents in P35.
10
8
yc 6
4
2
0
2
10
xb
c. i.
10
8
yc 6
4
2
1.0
0.8
0.6
0.4
0.2
Diameter
iii.
For the points in P35c, again, the x-scale must
shrink. This time a square root transformation
(power of 0.5) will straighten the points.
0
–2.5
4
5
Sqrt (age)
Diameter ⫽ 1.47 · Sqrt (age) ⫺ 1.86; r2 ⫽ 0.83
0
0.2 0.4 0.6 0.8 1.0
x b–1
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Section 3.5 Solutions
145
Variable
Constant
√Age
Coefficient
1.85727
1.46516
df
1
25
Mean
Square
F-ratio
102.855
124
0.831876
s.e. of
Coeff
0.6175
0.1318
t-ratio
3.01
11.1
prob
0.0059
ⱕ0.0001
P39. a. The plot of the data suggests that you must
expand the y-scale or shrink the x-scale, so the
power transformation on x will have to be a power
less than 1. Students may suggest a square root
transformation on y because it has been successful
in the past.
b. The log-log transformation yields a nearly
linear plot.
70
60
50
40
30
20
10
200 250 300 350 400 450 500 550
Speed (mi/h)
E56. The plots of the data and the regression analysis
are shown next.
200
Population
Sum of
Squares
102.855
20.7969
Source
Regression
Residual
Sqrt (flight length)
Dependent variable is:
Diameter
No Selector
R squared 83.2%
R squared (adjusted) 82.5%
s 0.911 with 7 2 25 degrees of freedom
160
120
80
40
0
1
log(brain weight)
4
ln (population)
3
2
1
0
The regression equation is
logBrain 0.908 0.760 logBody
Predictor
Constant
logBody
s 0.3156
Coef
0.90754
0.76020
Stdev
0.04967
0.03162
R-sq 92.6%
t-ratio
18.27
24.04
p
0.000
0.000
R-sq(adj) 92.5%
Analysis of Variance
SOURCE
Regression
Error
Total
DF
1
46
47
SS
57.577
4.583
62.159
MS
57.577
0.100
F
577.96
p
0.000
Exercises
E55. The plot of the square root of flight length
versus speed still retains obvious curvature—
this transformation is less satisfactory than the
log transformation.
146
Section 3.5 Solutions
Residual
c. The regression equation is log(brain) 0.908 0.76 log(body), or brain 8.10(body)0.76
(or 8.08 with no rounding).
The slope of the line, 0.76, agrees with the
insight that the x-scale must be transformed
by a power less than 1.
6
6
5
4
3
2
1
–1
–3 –2 –1 0 1 2 3 4
log(body weight)
2 3 4 5
Roll Number
0
1
0
1
2
3
4
5
6
2 3 4 5
Roll Number
6
0.2
0
–0.3
The equation of the regression line isˆ
ln y 5.142 0.885x. Because e0.885 0.413, the
estimated rate of decay is 1 0.413 0.587. The
curved, V-shaped residual plot shows that the rate
of decay is greater than the estimated value during
the first and last time periods and less than the
estimated value over the middle time periods.
E57. The plot of the data shows curvature. Although
the log-log transformation helps, it does not
remove the curvature. (Neither will any other
power transformation.) The residual plots suggest
dividing the data into two groups, as the trends are
more linear within each group. So a good way to
model these data is to split them into two groups
(perhaps with ages 2 through 8 in one group and
9 through 14 in the other). Then although the
points in each group still have some curvature, the
residuals are much smaller. The younger group
has a weight gain per inch of height that is lower
than the overall average, whereas the older group
has a weight gain per inch of height that is higher
than the average.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Weight (lb)
Percent Successful
Scatter Plot
Median Heights
120
100
80
60
100
80
60
40
20
0
40
0
2
4
6
8 10 12 14 16 18
0
2
4 6 8 10 12 14 16 18
Number of Chimps
0
Residual
35
40
45
50 55 60
Height (in.)
65
70
6
0
-6
50 55 60 65
Height (in.)
Weight = 2.76Height - 80; r2 = 0.96
ln(weight)
35
40
45
70
4.8
4.6
4.4
4.2
4.0
3.8
3.6
3.4
3.2
Residual
20
12
0
–8
b. A line is not a bad fit here and would
predict reasonably well for parties of 16 or
fewer chimps because the residuals are small.
However, a line is not the most appropriate
model and would probably not work as well
to predict for parties much more than 16
chimps strong.
c. A log-log transformation works pretty well. The
model would be ln(percent) 0.524 ln(chimps) 2.9575, or percent 19.25 chimps0.524.
3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2
ln(height)
Note on E58: Using the Fathom data file makes this
question much easier. On the plot of cost/seat/mile versus
flight length, highlight the points for one group. The
same subjects will be highlighted on the other graphs
and this question is more easily answered.
E58. The group on the right in the plot of cost/seat/mile
versus flight length also happens to be the larger
planes, judging from the numbers of seats, and
they use more fuel and are the planes with the
highest flight speeds. (Refer to the scatterplot
matrix of air-line data for E8 on page 91 of this
Instructor’s Guide.)
E59. a. There is a strong positive relationship between
size of the hunting party and success rates. The
plot looks fairly linear, but a look at the residual
plot makes it apparent that there is some curvature
in the data.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
ln(percent)
0.00
–0.08
4.6
4.4
4.2
4.0
3.8
3.6
3.4
3.2
3.0
2.8
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residual
0.10
0.2
0.0
–0.2
0.0 0.5 1.0 1.5 2.0 2.5 3.0
ln(chimps)
Note that an exponential model fit using a log
transformation does exactly the wrong thing,
bending the plot in the wrong direction.
ln(percent)
Residual
3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2
4.6
4.4
4.2
4.0
3.8
3.6
3.4
3.2
3.0
0
2
4
6
8 10 12 14 16 18
Chimps
Section 3.5 Solutions
147
1900
Year
1980
35,000,000
30,000,000
25,000,000
20,000,000
15,000,000
10,000,000
5,000,000
0
Sqrt (U.S. population)
c. The pattern in the immigration data is quite
cyclical, which is another common time series
pattern. No simple power transformation will
straighten this out. There is more than one “bend”
in the data; power transformations only work well
for a single bend.
10,000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
1820 1860 1900 1940 1980 2020
Year
E61. The scatterplot of the original data appears here.
b. Taking the log of the population
overcompensates for the curve in the original
data; the growth is not really exponential growth.
A better transformation is the square root of
the population. The plot of these data, along
with the regression analysis and residual plot, is
presented next. The residuals still have some
pattern, as is expected for time series data, but
it is not very pronounced.
_________ The regression
equation is population 138840 77.7
year.
Price ($)
1840 1880 1920 1960 2000
Year
Section 3.5 Solutions
–400
Sqrt (Pop) = 77.7 Year ⫺ 138800; r 2 = 1.00
a. The next plot shows the growth in population
for each decade. The increase in population wasn’t
constant from decade to decade but increased
in a linear way. When change in population
grows linearly, the population grows quadratically.
Thus, we can predict that an exponential model
isn’t appropriate and that taking the square root
of each population will linearize the original
scatterplot.
148
300
0
1820 1860 1900 1940 1980 2020
Year
1820
Population Growth
1820 1860 1900 1940 1980 2020
Residual
300,000,000
250,000,000
200,000,000
150,000,000
100,000,000
50,000,000
0
18,000
16,000
14,000
12,000
10,000
8,000
6,000
4,000
2,000
0
Immigration (thousands)
U.S. Population
d. The residual plot shown in part c shows a
random scatter, which is good. There is more
spread for the smaller hunting parties than for the
larger ones, so a transformation that reduces this
would be better.
E60. The data show a curved pattern of growth over
time, which could perhaps be modeled as
exponential growth as we often hear that
population is growing exponentially.
800
700
600
500
400
300
200
100
0
10 20 30 40 50 60 70 80
Days in Advance
Because you would expect the price to go up
as flight time gets nearer, a reciprocal
transformation of the price (or 1/price) might
linearize data such as these. Actually, this
transformation does a good job. The plot and
residual plot are shown here.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
0.008
0.006
0.004
0.002
0
10 20 30 40 50 60 70 80
Residual
0.005
0
–0.005
0 10 20 30 40 50 60 70 80
Days in Advance
1
The regression equation is ___
price 0.00541 0.000039 days. If you solve for price, you get
1
price ______________________
(0.00541 0.000039 days)
That is, the number of days affects the price very
little according to this model. (As students will
learn later, the slope of the regression equation is
not significantly different from 0.)
You get a linear relationship with some negative
trend by plotting (ln(days), ln(price)). First,
substitute _12 day for 0 days before taking logs.
This regression equation is ln(price) 5.65 0.166 ln(days). The plot and residual plot are shown
next. If you solve the equation for the price, you get
price 284.29 days0.166. Notice that for the range
10 days to 30 days, when most of the purchases
were made, the range of prices is only from $161.64
to $193.98. Once again, days accounts for little of
the variation in price. (And, again, the slope of the
regression line is not statistically significant.)
ln(price)
7
6
5
35
30
25
20
15
0
200 400 600 800
Body Mass (kg)
3.6
3.4
3.2
3.0
2.8
2.6
–5.0 –2.5
4
–1
0
1
2
3
4
5
0
2.5 5.0
ln(body mass)
7.5
The equation of this regression line is
1.5
Residual
Here’s the reason no transformation will give us
a good model of predicting price from day: Five
of the passengers paid a lot more for their tickets
than did the other passengers, and they bought
them 3, 4, 8, 9, and 9 days before the flight. (See
the previous scatterplot.) But other passengers
bought their tickets even closer to flight time
and paid just about the same as passengers who
bought their tickets months before. If the five
passengers who paid extremely high prices are
left out, the relationship is reasonably linear but
flat. The correlation between days and price for
the remaining passengers is 0.034, or practically
nonexistent. Thus, the best model is to say that
there is no relationship between the day these
passengers bought their tickets and the price
they paid, with the exception of five passengers
who bought their tickets within 9 days of the
flight and who paid more than double any other
passenger.
E62. a. It appears that brain oxygen versus body mass
could be modeled by exponential decay, but a
quick check will show that a log transformation
does little to straighten the plot. The log-log
transformation does well, once again.
Oxygen Use in Brain
0.010
ln(brain oxygen)
Reciprocal of Price
0.012
ln(brain oxygen) 3.26 0.07 ln(body mass)
0
–1.2
–1
0
1
2
3
ln(days)
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
4
5
which implies that ln(brain oxygen) decreases, on
the average, by 0.07 units for every 1 unit increase
in ln(body mass).
Section 3.5 Solutions
149
log(GNP)
b. Using the log-log transformation, the
relationship between lung oxygen consumption and
body mass has a similar linear trend except for a
couple of stray points.
2.25
2.00
1.50
1.25
–5.0 –2.5
5 10 15 20 25 30 35 40
0
2.5 5.0
ln(body mass)
0
5 10 15 20 25 30 35 40
Birthrate
0.0
–1.2
7.5
The equation of this line is
which implies that ln(lung oxygen) decreases,
on the average, by 0.0951 units for every 1 unit
increase in ln(body mass).
c. If this theory is true and oxygen consumption
depends on the relative size of the organ, then the
lung oxygen consumption should decrease less
rapidly than the brain oxygen consumption. But
the data show that the lung oxygen consumption
decreases more rapidly than that of the brain.
There must be another explanation as to why the
brain seems to use more oxygen, relative to its
size, than does other organs.
E63. a. The scatterplot of these data show a marked
decrease in the birthrate as the GNP increases.
The relationship is nonlinear but does not look
like exponential decay.
40
35
30
25
20
15
10
5
0
5
10 15 20 25 30 35 40
Birthrate
b. The log transformation works quite well
here and gives a plot that seems appropriate for a
regression line. The residual plot looks rather like
random scatter and further supports this choice of
a statistical model.
log(GNP) = –0.0674 Birthrate + 1.87; r 2 = 0.60
Dependent variable is:
logGNP
No Selector
R squared = 59.7%
R squared (adjusted) = 58.0%
s = 0.4383 with 25 - 2 = 23 degrees of freedom
Source
Regression
Residual
Sum of
Squares
6.54964
4.41913
Variable
Constant
Birthrate
Coefficient
1.86903
0.067360
df
1
23
Section 3.5 Solutions
s.e. of
Coeff
0.2276
0.0115
F-ratio
34.1
t-ratio
8.21
5.84
prob
ⱕ 0.0001
ⱕ 0.0001
log(GNP) 1.87 0.0674 birthrate
or
GNP 101.87 (100.0674 birthrate)
74.13 0.856birthrate
To interpret the slope and intercept of the model
we must use the linear version on the log scale.
log(GNP) decreases, on the average, 0.0674 units
for every 1 unit increase in birthrate.
E64. a. The decreasing trend has a slight curvature,
especially toward the later years, so perhaps a log
transformation (exponential decay) will work.
This would make the interpretation of the results
quite easy.
55
50
45
40
35
30
25
1965
150
Mean Square
6.54964
0.192136
The regression equation is
18+
ln(lung oxygen) 1.976 0.0951 ln(body mass)
GNP
0
0.8
1.75
Residual
ln(lung oxygen)
2.50
1.4
1.0
0.6
0.2
–0.2
–0.6
1975
1985
Year
1995
2005
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
18 – 24
ln(18+)
4.0
3.9
3.8
3.7
3.6
3.5
3.4
3.3
3.2
1975
1985
1995
2005
Residual
1965
Residual
0.06
0.00
55
50
45
40
35
30
25
20
1965
1975
1985
1995
2005
1965
1975
1985
Year
1995
2005
12
0
–8
–0.08
1965
1975
1985 1995
Year
2005
18 – 24
The residual plot still has a good deal of curvature.
Another look at the original plot shows that the
curve seems to be approaching 21, not 0, as an
exponential decay function would.
60
50
18+
The rate of smoking seems to decrease linearly
until about 1990, then it begins increasing linearly.
These plots show lines for the two parts of the plot
separately.
40
30
55
50
45
40
35
30
25
1965 1970 1975 1980 1985 1990 1995
10
1965
1975
1985
Year
1995
2005
Residual
20
18 – 24
1975
1985
1995
2005
Residual
0.20
0.00
–0.20
1965
1975
1985
Year
1995
2005
This exponential decay model fits well and shows
that the level of smoking above 21% is decreasing
at a rate of about 5.5% per year because e0.05645 0.945. (Remember, this rate of decrease is a percent
of a percent because the original measurements
are percentages.)
b. This plot poses a difficulty because the trend
changes abruptly about 1991. One equation will
not work.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Residual
ln(18+ – 21)
1965
0.0
–2.0
1965 1970 1975 1980 1985 1990 1995
Year
An exponential model could be used by
subtracting 21 from the percentage before taking
the log. This graph shows the results.
3.5
3.0
2.5
2.0
1.5
1.0
0.5
2.5
34
32
30
28
26
24
22
1990
1994
1998
2002
1990
1994
1998
Year
2002
3
0
–4
The regression equation for the first plot is
percentage 2311 1.1492 year, and for
the second plot it is percentage 872.00 0.514 year.
c. The pattern of decrease for the 65 and older
category is much more linear; in fact, the log
transformation will make things worse instead
of better.
Section 3.5 Solutions
151
30
resid(CO2)
20
15
10
5
1975
1985
Year
1995
2005
The regression equation is percentage 1062 0.5255 year.
Predictor
Constant
Year
s 1.100
Coef
1061.83
0.52550
Stdev
53.06
0.02666
R–sq 95.8%
t–ratio
20.01
19.71
p
0.000
0.000
R–sq(adj) 95.6%
Analysis of Variance
SOURCE
Regression
Error
Total
DF
1
17
18
SS
470.16
20.57
490.73
MS
470.16
1.21
F
388.57
p
0.000
CO2
380
370
360
350
340
330
320
310
334
332
330
328
326
324
322
320
318
316
1958 1962 1966 1970 1974 1978
1.5
0.0
–1.5
1958 1962 1966 1970 1974 1978
Year
1950
1970
1990
2010
Year
5.94
5.92
5.90
5.88
5.86
5.84
5.82
5.80
5.78
5.76
5.74
380
370
360
350
340
330
1975 1980 1985 1990 1995 2000 2005
Residual
ln(CO2)
c. The pattern of the residuals suggests an abrupt
change around 1976. A better way to model
these data might be to use two straight lines with
different slopes, one line covering the period from
1959 to about 1976 and the other from about
1977 to 2003. The first two plots below show the
regression line and residual plots for years up
to 1976 and for years after 1976, respectively. The
third plot shows the two lines on the original plot.
Residual
This nearly constant (linear) rate of decrease
amounts to about half a percentage point per year.
E65. a. See the plots below. The amount of CO2
is definitely increasing over the years, and the
upward curvature makes it reasonable to suspect
exponential growth. But note that the log
transformation is not much help here.
5
4
3
2
1
0
–1
–2
–3
1950 1960 1970 1980 1990 2000 2010
Year
CO2
1965
CO2
65+
25
1950
1970
1990
1.5
0.0
–1.5
1975 1980 1985 1990 1995 2000 2005
Year
2010
Year
380
370
360
CO2
b. The fitted line appears in the first plot in
part a. The residual plot from the original data
shows that CO2 increased at a rate lower than the
overall average from 1967 to about 1994 and at a
higher rate than the overall average before 1967
and after 1994.
350
340
330
320
310
1955 1965 1975 1985 1995 2005
Year
152
Section 3.5 Solutions
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
prediction, although the left end of this line
is pulled down a bit by a couple of states that
have both low percentages and relatively low
scores.
Score
4.9
4.8
4.7
4.6
4.5
4.4
4.3
4.2
4.1
0
1955 1965 1975 1985 1995 2005
20
40
60
Percent
80
100
6.40
6.36
6.32
6.28
6.24
6.20
0.03
0.00
–0.02
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Residual
Residual
620
600
580
560
540
520
500
480
ln(score)
Adjusted ln(CO2)
The regression equation for the first plot is CO2 1559.3 0.957 year. For the second plot it is
CO2 2776.7 1.573 year.
Another possibility would be to recognize that
although an exponential model has an asymptote
at zero, the CO2 level in the atmosphere was never
near 0. One estimate is that pre-industrial levels of
CO2 in the atmosphere were around 250 ppm. We
can adjust for this by taking the natural log of
(CO2 level 250).
1955 1965 1975 1985 1995 2005
Year
The regression equation for this plot is ln(CO2) 25.281 0.0150 year.
Once again, notice that the residual plots
have an oscillating pattern typical of time
series data.
d. The linear model gives an average increase of
about 1.57 ppm CO2 per year for years after 1976.
Using the exponential model with an asymptote
at 250 ppm, the amount of CO2 in the atmosphere
above 250 ppm is multiplied by e0.0150 1.015
each year, for a growth rate of about 1.5% per year.
E66. The plot of average SAT math score versus
percentage taking exam shows a decreasing trend
with a curvature. A log-log transformation
straightens this out nicely, and the regression
analysis of ln(SAT math score) versus ln(percentage
taking exam) provides a good model for
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
0.08
0.00
–0.08
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
ln(percent)
The complete regression analysis is shown here.
The regression equation is
lnScore 6.46 0.0509 lnPercent
Predictor
Constant
lnPercent
s 0.02921
Coef
6.45586
0.050937
Stdev
0.01351
0.003967
R–sq 77.5%
t–ratio
477.99
12.84
p
0.000
0.000
R–sq(adj) 77.0%
Analysis of Variance
SOURCE
Regression
Error
Total
DF
1
48
49
SS
0.14072
0.04096
0.18168
MS
0.14072
0.00085
F
164.89
p
0.000
Section 3.5 Solutions
153
Chapter Summary
Homework
Essential
E67–70, E73, E81
Recommended
E74, E75, E76, E78, E80
Optional
E71, E72, E77, E79, E82–83
For AP Students
AP1–10
AP Exam Practice
Solutions
Review Exercises
E67. a. If Leonardo is correct, the points should lie
near the lines:
arm span height
Kneeling Height (cm)
As you worked through this chaper, you may have assigned relevant items from
past AP Statistics Exams, which are listed in the table on page 144. As you review
and assess the chapter, you may wish to assign additional items.
You can also use the Chapter 3 AP Practice Quiz or the Chapters 2–3 AP
Practice Quiz on pages 39–42 of the Instructor’s Resource Book. Each quiz
includes five multiple-choice items similar to those on AP Statistics Exams
and one free-reponse question from the Statistics in Action student book. Once
students have taken the quiz, go over the elements of a good answer. You might
display the model answers to the free-response questions downloadable from
www.keypress.com/keyonline.
80 100 120 140 160 180
Height (cm)
3 height
kneeling height __
4
9
Arm Span (cm)
Looking at the next plots, these rules appear to be
approximately correct.
180
160
140
120
100
80
Chapter 3 Summary
15
10
90 110 130 150 170
Height (cm)
90 110 130 150 170 190
Height (cm)
154
20
Hand Length (cm)
1 height
hand length __
140
130
120
110
100
90
80
70
The least squares regression equation for
predicting the arm span from the height is
ŷ 5.81 1.03x.
The least squares regression equation for
predicting the kneeling height from the height
is ŷ 2.19 0.73x.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
arm span and height: 0.992 (strongest)
kneeling height and height: 0.989
hand length and height: 0.961 (weakest)
E68. a. This scatterplot shows no obvious association
between temperature and number of distressed Orings. Although the highest number of distressed
O-rings was at the lowest temperature, the second
highest number was at the highest temperature.
b. It is difficult to look at the scatterplot of
the complete set of data, shown next, and not
see that any risk is almost entirely at lower
temperatures. The correlations are r 0.263 for
the incomplete set of data but r 0.567 after
all points are included, which is only moderately
strong at any rate.
This is a tragic example of scientists and
engineers not asking the right question: Do I have
all of the data?
were left off the plot because it was felt that
these flights did not contribute any information
about the temperature effect. The Commission
concluded that “A careful analysis of the flight
history of O-ring performance would have
revealed the correlation of O-ring damage in
low temperature.” [Report of the Presidential
Commission on the Space Shuttle Challenger
Accident (Washington, D.C., 1986), page 148.]
E69. a. Yes, the student who scored 52 on the first
exam and 83 on the second lies away from the
general pattern. This student scored much higher
on the second exam than would have been
expected. This point will be influential because the
value of x is extreme on the low side and the point
lies away from the regression line. The point sticks
out on the residual plot. There is a pattern in the
rest of the points; they have a positive correlation.
b. The slope should increase, and the correlation
should increase. In fact, the slope increases from
0.430 to 0.540, and the correlation increases from
0.756 to 0.814.
c. The residual plot appears next. The residuals
now appear scattered, without any obvious
pattern, so a linear model fits the points well
when point (52, 83) is removed.
Residual
The least squares regression equation for
predicting the hand length from the height
is ŷ 2.97 0.12x.
b. If one student is 1 cm taller than another, their
arm span tends to be 1.03 cm larger.
If one student is 1 cm taller than another, their
kneeling height tends to be 0.73 cm larger.
If one student is 1 cm taller than another, their
hand length tends to be 0.12 cm larger
c. In each case, the points are packed tightly
about the regression line and so there is a very
strong correlation. The correlations are
Number of O-Rings
with Some Distress
3
8
6
4
2
0
–2
–4
–6
–8
–10
60
2
1
0
50
60
70
80
Temperature (°F)
90
Background information: After each launch, the
two rocket motors on the sides of the shuttle were
recovered and inspected. Each rocket motor was
made in four pieces, which were fit together with
O-rings to seal the small spaces between them.
The O-rings were 37.5 feet in diameter and
0.28 inches thick.
The Rogers Commission, which was appointed
by President Reagan to find the cause of the
accident, noted that the flights with zero incidents
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
70
80
90
100
Exam 1 Without (52, 83)
d. Yes, there is regression to the mean in any
elliptical cloud of points whenever the correlation
is not perfect. For example, the student who
scored lowest on Exam 1 did much better on
Exam 2. The highest scorer on Exam 1 was not the
highest scorer on Exam 2. A line fit through this
cloud of points does have slope less than 1.
E70. a. Use the formula
sy
b1 r __
s
x
to obtain
11.6 0.845
r 0.51 ____
7.0 Chapter 3 Summary
155
_ _
b. Using the fact that (x, y) is on the regression
line,
_
_
ŷ y b1(x x)
be made for a negative slope and a zero slope.
Alternatively, the truth of this statement can be
seen from the relationship
87.8 0.51(x 82.3)
sy
b1 r __
s
45.83 0.51x
x
E72. a. Correlations of 0 occur between A and B,
B and E, B and F, C and D, and E and F. The
correlation between D and E is 0.02. The pairs
with zero correlation all have scatterplots that are
symmetric around a center vertical line or center
horizontal line or both.
b. A and E, D and F, and A and C
c. A and D, A and F, B and C, and C and F
d. Student responses may include the following:
The scatterplot of A and B shows that there can be
a pattern in the points (a shape) even though the
correlation is zero. That the scatterplots of A and
D and of A and F have about the same correlation
also shows that the correlation does not tell
anything about their quite different shapes.
For your information, the complete correlation
matrix is shown here.
A
B
B
0
C
D
C
0.447
0.258
D
0.224
0.129
0
E
0.875
0
0.091
0.018
F
0.258
0
0.289
0.577
E
0
E73. a. True. Both measure how closely the points
cluster about the “center” of the data. For
univariate data, that center is the mean; for
bivariate data, the center is the regression line.
b. True. Refer to E72 for examples.
c. False. For example, picture an elliptical cloud
of points with major axis along the y-axis. The
correlation will be 0, but there will be a wide
variation in the values of y for any given x.
d. True. Intuitively, a positive slope means that as
x increases, y tends to increase. This is equivalent
to a positive correlation. Similar statements can
156
Chapter 3 Summary
Because the standard deviations are always
positive, b1 and r must have the same sign.
E74. a. The correlation is quite high, about –0.83.
b. There are two clusters of points—one of states
with a small percentage of students taking the SAT
and one of states with more than 50% taking the
SAT. The second cluster has almost no correlation
and would have a relatively flat regression line,
whereas the first cluster has a strong negative
relationship. Combining these two clusters results
in summary statistics that do not adequately
describe either one of them.
c. A residual plot would be U-shaped with points
above zero on the left, below zero in the middle,
and above zero again on the right. The actual
residual plot is shown here.
Residual
E71. a. Each value should be matched with itself.
b. Match each value with itself, except match 0.5
(or 0.5) with 0: (1.5, 1.5) (0.5, 0.5) (0, 0),
(0, 0.5), (0.5, 0), (1.5, 1.5) for a correlation of .950.
c. (1.5, 0) (0.5, 0.5) (0, 1.5), (0, 1.5),
(0.5, 0.5), (1.5, 0) has a correlation of 0.1.
d. Match the biggest with the smallest, the next
biggest with the next smallest, etc.: (1.5, 1.5)
(0.5, 0.5) (0, 0), (0, 0), (0.5, 0.5), (1.5, 1.5).
Note on E72: Students should not have to actually
compute the correlations to answer these questions.
40
30
20
10
0
–10
–20
–30
–40
–50
0 10 20 30 40 50 60 70 80 90 100
Percentage Taking Exam
E75. a. Public universities that have the highest in-state
tuition also tend to be the universities with the
highest out-of-state tuition, and public universities
that have the lowest in-state tuition also tend to be
the universities with the lowest out-of-state tuition.
This relationship is quite strong.
b. No, the correlation does not change with a
linear transformation of one or both variables.
However, if you were to take logarithms of the
tuition costs, the correlation would change.
c. The slope would not change with the first
transformation. Consider the formula for the slope:
sy
b1 r __
s
x
The correlation remains unchanged with the change
1
of units. The standard deviations would each be ____
1000
1
____
as large as previously, but the factor of 1000 would
be in both the numerator and denominator and so
would cancel out. But if you were to take logarithms
of the tuition costs, the slope would change because
sy
the proportion __
sx would be different.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
y
Solving for r, the formula becomes
sx
r b1__
s
x
These correlations are A: 0.5; B: 0.3; C: 0.25.
So from weakest to strongest, they are ordered
C, B, A.
E78. a. No. It is possible for one bookstore to be a
lot more expensive than the other. For example,
suppose your local bookstore sold each book for
$10 less than the online price. The correlation
would be 1.
b. The main reason for the high correlation is
that the bookstores pay approximately the same
wholesale cost for a book. They then add on
an amount to cover their overhead costs and to
give them a profit. To have a cause-and-effect
relationship, a change in one variable should
trigger a change in the other variable. That is not
necessarily the case with these prices; however,
if the online bookstore lowers its prices, it might
force local bookstores to do the same.
E79. For example, stocks that do the best in one quarter
may not be the ones that do the best in the next
quarter. As another example, the best 20 and worst
20 hitters this year in Major League Baseball are
not likely to repeat this kind of performance again
next year.
E80. This matrix gives the correlations between all
pairs of variables in this exercise.
Max Long
Max Long
Ave Long
0.769
Gestation
0.577
0.761
0.215
0.237
Speed
Gestation
a. In general, animals with longer maximum
longevity have longer gestation periods. The trend
is reasonably linear, with a moderate correlation.
The plot shows heteroscedasticity, with the points
fanning out as maximum longevity increases.
The elephant is an outlier, although it follows the
general linear trend.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
70
700
600
500
400
300
200
100
Gestation Period (days)
0
5 10 15 20 25 30 35 40 45
Average Longevity (yr)
c. The three relatively slow animals (elephant,
hippopotamus, and grizzly bear) in the lowerright corner give these variables a slight negative
correlation of 0.215. However, the rest of the
animals show a positive trend, with longer average
longevity associated with greater speed. The
lurking variable is the size of the animal. Larger
animals tend to live longer and be faster (unless
they get very big, like an elephant).
50
40
30
20
10
0
0.018
10 20 30 40 50 60
Maximum Longevity (yr)
b. The pattern here is similar: The animals with
longer average longevity have longer gestation
periods. However, the relationship is not as strong
as in part a, so maximum longevity is the better
predictor of gestation period. There are two
outliers, the elephant and the hippopotamus.
Gestation Period (days)
sy
b1 r __
s
700
600
500
400
300
200
100
0
Speed (mi/h)
E76. a. Quadrant I: , , ; Quadrant II: , , ;
Quadrant III: , , : Quadrant IV: , , b. The points near the origin or near one of the
new axes make the smallest contributions. The
contributions are small because either zx or zy
would be close to 0.
E77. You can compute the values of r using the formula
5 10 15 20 25 30 35 40 45
Average Longevity (yr)
E81. a. The next scatterplot shows a very strong
linear relationship between the expenditures for
police officers and the number of police officers
per state. This relationship makes sense. There
is one outlier and influential point, California,
which has by far the largest population of any
state listed.
Chapter 3 Summary
157
8000
7000
6000
5000
4000
3000
2000
1000
0
Residual
0 10 20 30 40 50 60 70 80 90100
Officers (in thousands)
800
0
–800
0 10 20 30 40 50 60 70 80 90 100
Officers (in thousands)
b. The scatterplot again shows a very strong
linear relationship, with California as an outlier
and influential point. The larger the population
of the state, the more police officers. This time
the correlation is 0.987, and the equation of the
regression line is
number of police 1.47 2.91 population
Officers (in thousands)
100
Residual
That is, for every increase of 1 million in the
population, the number of police officers tends
to go up by 2910. If the outlier, California, is
removed, the slope of the line increases a little to
3.15 and the correlation decreases a little to 0.976.
Here, California is not very influential on the
slope or the correlation.
6
80
60
40
20
0
35
0
–6
0
158
5 10 15 20 25 30
Population (in millions)
Chapter 3 Summary
5 10 15 20 25 30
Population (in millions)
35
Violent Crime Rate
(per 100,000 population)
Spending ($ millions)
So for every additional thousand police officers,
costs tend to go up by $73,800,000, or $73,800
each. If the outlier, California, is removed,
the slope of the line decreases to 57.2 but the
correlation increases to 0.984. Here, California is
very influential on the slope.
900
800
700
600
500
400
300
200
100
0
20
40
60
80 100
Officers (in thousands)
A log-log transformation straightens these
points quite well, as shown in this scatterplot and
residual plot.
log(violent)
expPolice 402.8 73.8 number of police
c. The scatterplot shows that there is a moderate
positive but possibly curved relationship. For this
scatterplot, it is not appropriate to compute the
correlation or equation of the regression line. But,
in general, the larger the number of police, the
higher the rate of violent crime. (Note that this is
the rate per 100,000 people in the state, not the
number of violent crimes.) Almost equivalently,
the larger the population of a state, the higher the
violent crime rate. It is not at all obvious why this
should be the case. Why would larger states (more
police, more population) tend to have higher
rates of violent crime? (Because of the strong
relationship between the number of police officers
per state and the population, a scatterplot of the
crime rate versus the number of police officers per
100,000 population looks about the same.)
Residual
The correlation is 0.976, and the equation of the
regression line is
3.0
2.8
2.6
2.4
2.2
2.0
0.0
0.4
0.8 1.2 1.6
log(officers)
2.0
0.0
0.4
0.8 1.2 1.6
log(officers)
2.0
0.20
0.00
–0.20
Note on E82: This exploration will be much more
efficient using Fathom and the data on the Instructor’s
Resource CD rather than a graphing calculator.
E82. a. A linear model works fairly well, as shown in
this scatterplot and residual plot. There is some
heteroscedasticity, however, so the residuals for
houses with a large number of square feet are
larger. The equation is price 25.2 75.6 area,
and the correlation is 0.899.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
0.5 1 1.5 2 2.5 3 3.5 4
Residual
100
0
–60
0 0.5 1 1.5 2 2.5 3 3.5 4
Area (thousand square feet)
Price ($ thousands)
Separating the houses into new houses and old
houses, the corresponding regression equations
are price 48.4 96.0 area and price 16.6 66.6 area. These equations are very
different, so you should not use one equation for
both groupings. New houses cost quite a bit more
per square foot. You can see the two relationships
in this plot.
350
300
250
200
150
100
50
0
0.5 1 1.5 2 2.5 3 3.5 4
Area (thousand square feet)
Age
New
Old
b. The two largest houses clearly are influential
points, as could be the third largest house and
the smallest house in the lower-left corner. If the
two largest are removed from the data set, the
correlation drops to 0.867 and the regression
equation changes to price 10.3 65.7 area.
This is quite a change in the model—the price is
now increasing $10,000 less per increase of 1,000
square feet.
c. Using the equation from part a, the price for
an old house of 1,000 square feet would be price 16.6 66.6 area 16.6 66.6(1) 50, or
$50,000. The price for a house of 2,000 square feet
would be price 16.6 66.6 area 16.6 66.6(2) 116.6, or $116,600. You should have
more confidence in the first prediction because
the spread in the prices is less for the smaller
houses than for the larger houses.
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Number of Bathrooms
0
d. As seen in these boxplots, the number of
bathrooms is strongly related to the selling price.
A lurking variable here is the number of square
feet in the house, which is very strongly related
to both the price and the number of bathrooms.
A regression line is not appropriate here mostly
because of the skewness in the prices for houses
with three bathrooms and because you can do
something better. You can compute the mean (or
the median) price of a house with one bathroom,
with two bathrooms, and with three bathrooms:
$40,320 ($33,400); $99,290 ($98,500); and
$201,400 ($170,000). This process is equivalent
to regression but does not require a linear
relationship.
1
2
3
0
50 100 150 200 250 300 350
Price ($ thousands)
E83. a. The next scatterplot shows that expenditures
per pupil (ExpPP) and average teacher salary
(TeaSal) are moderately correlated: r 0.66. The
equation of the regression line is ExpPP 1144.8
182.7 TeaSal. So, if one state’s average teacher’s
salary is $1000 more than another’s, its per-pupil
expenditure tends to be $183 more. In this case,
there is a cause-and-effect relationship because if
teachers are paid more, on average, the cost per
pupil has to go up unless class sizes are increased
proportionally.
ExPP ($)
Price ($ thousands)
350
300
250
200
150
100
50
13,000
12,000
11,000
10,000
9000
8000
7000
6000
32
36
40 44 48 52
TeaSal ($ thousands)
56
b. This next scatterplot has per capita
expenditure on schools plotted against the
average teacher salary. Again, you would expect
a positive relationship, and r 0.577, so it is
a bit weaker than in part a. It isn’t surprising
that the two correlations are so close because
the number of pupils in the state is pretty much
Chapter 3 Summary
159
automatically draws ellipses). All of the
relationships with percentage of dropouts have
correlations close to zero and in the matrix plot,
their ellipses are quite round and fat. So, no,
none of the variables are good predictors of the
percentage of dropouts.
ExpPC ($)
proportional to the number of people in the
state, so you would expect expenditure per pupil
and expenditure per capita to have roughly the
same correlation with teacher salary (or any
other variable).
2200
2000
1800
1600
1400
1200
1000
Pearson Product-Moment Correlation
No selector
32
36 40 44 48 52
TeaSal ($ thousands)
56
c. Shown below are the correlations and a
scatterplot matrix (the JMP version, which
Dropout
Dropout ExpPP
1.000
ExpPC
TeaSal Enroll Teachers
ExpPP
0.179
1.000
ExpPC
0.060
0.912
1.000
TeaSal
0.024
0.660
0.577
1.000
Enroll
0.052
0.096
0.125
0.438
1.000
Teachers
0.098
0.165
0.175
0.437
0.975 1.000
Chapter 3 Review, E83c
Scatterplot Matrix
13000
11000
ExpPP
9000
7000
2000
ExpPC
1600
1200
55
45
Tea Sal
35
11
9
7
%Dropout
5
3
7000
5000
Enrollment
3000
1000
350
250
Teachers
150
50
7000
160
Chapter 3 Summary
11000
1200
1800
35
45
55
3456789
11 1000 4000
50
150
300
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
50
Body Fat (%)
AP1. A. There aren’t any points in the upper left-hand
corner because the oldest child has to be older
(or the same age, in the case of twins) than the
youngest child. Thus all points must lie on or
below the line y x.
AP2. C. The predicted birthrate is 0.38 60 53.5 30.7, so the residual is the actual birthrate of 47
minus this prediction, 47 30.7 16.3.
AP3. C. Curvature in the residual plot of a linear
regression is a sign of curvature in the original
plot, so statement I is true. When points in the
residual plot lie below the line y 0, the points
in the original scatterplot lie below the regression
line and so the prediction is too large. Thus,
statement II is true. Statement III is false because,
for example, the pattern could be exponential with
a high correlation.
AP4. A. Outliers should not be removed permanently
from a data set simply because they are outliers.
Further investigation is needed, as described in B,
C, and D.
AP5. A. B is incorrect because the slope of the
regression equation is positive, so the correlation
is 0.228. C is incorrect because the value of R2
doesn’t give any information about linearity versus
curvature. E is incorrect because it implies that
each person’s satisfaction tends to increase over
their stay in the hospital. Instead, there may be a
lurking variable of age: older people have to stay
longer and they also tend to be more satisfied
with their care. Or, the lurking variable might be
severity of the problem. The more seriously ill a
patient is, the longer they tend to have to stay, and
the more grateful they are for the care they were
given.
AP6. D. In the year 2000, t 50, so log10(population) 0.01 50 7 7.5, and thus population 107.5 31,622,777 and 31,600,000 is the closest.
AP7. E. Choice A is a poor choice because each point
represents a different Barbarian, and so does not
establish trends in a particular Barbarian. Choice
B is closer to an interpretation of the intercept
than to the slope. Choice C might be close to
correct if the y-intercept was near zero, but here
it’s far from zero. For choice D, you would have
to know the scores on the two sections had
equal standard deviations before making this
interpretation.
AP8. D. Note that the correct statement E is equivalent
to saying that 81% of the variation in the number
of raids among Barbarians is explained by
personal cleanliness.
AP9. a. i. From the plot, this line looks to be a
good fit.
40
30
20
10
1.0 1.02 1.04 1.06 1.08
Density
ii. The regression equation is ŷ 505.254 460.678x and the analysis is as follows.
Dependent variable is: % Fat
No Selector
R squared 99.9%
R squared (adjusted) 99.9%
s 0.2443 with 15 – 2 13 degrees of freedom
Source
Regression
Residual
Sum of
Squares
1206.48
0.775560
df
1
13
Mean Square
1206.48
0.059658
F-ratio
20223
Variable
Constant
Density
Coefficient
505.254
-460.678
s.e. of
Coeff
3.386
3.239
t-ratio
149
–142
prob
≤0.0001
≤0.0001
iii. The r2 value of 0.999 seems to confirm that
this model is a good fit.
iv. The residual plot uncovers some problems;
perhaps we could do better!
0.35
0.20
Residual
AP Practice Test
0.05
–0.10
–0.25
–0.40
1.0
1.02
1.04 1.06
Density
1.08
b. i. As the percentage of fat increases, the body
density decreases. Perhaps the positive
association between the reciprocal of density
and the percentage of fat would be easier
to model. The pertinent plots and the
regression analysis are shown on the next
page.
Chapter 3 Summary
161
this time the regression equation is % body fat 1
453.7 498.97 _____
density , which is similar to Siri’s
model but not nearly as close as the equation for
women. Thus, the model fits better for women
than for men.
Body Fat (%)
50
40
30
20
10
0.94
0.96
0.98
1
Body Fat (%)
0.92
Residual
0.25
0
–0.25
0.92
1
0.9125
Sum of
Squares
1206.88
0.371389
df
1
13
Mean Square
1206.88
0.028568
F-ratio
42246
Variable
Constant
1/Density
Coefficient
–450.632
495.654
s.e. of
Coeff
2.309
–0.412
t-ratio
–195
206
prob
≤0.0001
≤0.0001
The residuals show less pattern; the plot is
more like one of random scatter, suggesting
that this is a better model.
The regression line has the equation
1
% body fat 450.63 495.65 ______
density
which is very close to the Siri equation.
ii. The correlation is close to 1 for both models,
but the second proves to be a better fitting
model than the first. Moral: Never use
correlation as the only criterion for choosing
a model.
iii. Percent body fat as a function of log(density)
works almost as well as Siri’s model. The
residual plot, however, has a hint of a
pattern.
AP10. a. For women, the regression equation was
1
% body fat 450.63 495.65 _____
density ,
almost identical to Siri’s model. (See AP9.)
The relationship is very strong and linear, with
correlation almost equal to 1 and no pattern in the
residual plot.
Using the same variables for men, the
relationship is again extraordinarily linear (see the
next scatterplot), with a correlation near 1. But
Chapter 3 Summary
Density
Source
Regression
Residual
b. For women, this scatterplot of density against
skinfold is quite linear and does not require
re-expression. Thus, a reasonable model is the
regression equation, density 1.084 0.000311
skinfold. The correlation is 0.897.
1.08
1.06
1.04
1.02
1.00
50
100
150
200
Skinfold (mm)
250
For men, the relationship is less strong and has
some curvature (see the scatterplot); however, the
linear model is an adequate one. The equation is
density 1.105 0.000295 skinfold. The residual
plot shows some heteroscedasticity.
1.10
Density
R squared 100.0%
R squared (adjusted) 100.0%
s 0.1690 with 15 – 2 13 degrees of freedom
162
0.9375
0.9625
Reciprocal of Density
% Fat
1.08
1.06
1.04
1.02
50
100
150
200
250
50
100 150 200
Skinfold (mm)
250
0.02
Residual
Dependent variable is:
No Selector
0.94
0.96
0.98
Reciprocal of Density
35
30
25
20
15
10
5
0
0
–0.03
Statistics in Action Instructor’s Guide, Volume 1
© 2008 Key Curriculum Press
Download