y - Cengage Learning

advertisement
734
S e c t i o n V
Additional Opportunities to Learn from Data
16
Understanding
Relationships—Numerical
Data Part 2
Daniel M. Nagy/Shutterstock.com
Preview
Chapter Learning Objectives
16.1The Simple Linear Regression
Model
16.2Inferences Concerning the
Slope of the Population
Regression Line
16.3Checking Model Adequacy
Are You Ready to Move On?
Chapter 16 Review Exercises
Technology Notes
AP* Review Questions for
Chapter 16
Preview
In Chapter 4, you learned how to describe relationships between two numerical
variables. When the relationship was judged to be linear you found the equation
of the least squares regression line and assessed the quality of the fit using the
scatterplot, the residual plot, and the values of the coefficient of determination (r2)
and the standard deviation about the least squares line (se ). In this chapter you will
learn how to make inferences about the slope of the population regression line.
734
85241_ch16_ptg01.indd 734
20/12/12 6:39 PM
Chapter Learning Objectives
Conceptual Understanding
After completing this chapter, you should be able to
C1 Understand how probabilistic and deterministic models differ.
C2 Understand that the simple linear regression model provides a basis for making inferences about
linear relationships.
Mastering the Mechanics
After completing this chapter, you should be able to
M1 Interpret the parameters of the simple linear regression model in context.
M2 Use scatterplots, residual plots, and normal probability plots to assess the credibility of the
assumptions of the simple linear regression model.
M3 Know the conditions for appropriate use of methods for making inferences about b.
M4 Compute the margin of error when the sample slope b is used to estimate a population slope b.
M5 Use the five-step process for estimation problems (EMC3) and computer output to construct and
interpret a confidence interval estimate for the slope of a population regression line.
M6 Use the five-step process (HMC3) to test hypotheses about the slope of the population
regression line.
M7 Use graphs to identify potential outliers and influential points.
Putting It into Practice
After completing this chapter, you should be able to
P1 Interpret a confidence interval for a population slope in context.
P2 Carry out the model utility test and interpret the result in context.
Preview Example
Premature Babies
Babies born prematurely (before the 37th week of pregnancy) often have low birth
weights. Is a low birth weight related to factors that affect brain function? The
authors of the paper “Intrauterine Growth Restriction Affects the Preterm Infant’s
Hippocampus”(Pediatric Research [2008]: 438-43) hoped to use data from a study of
premature babies to answer this question. They measured x 5 birth weight (in grams)
and y 5 hippocampus volume (in mL) for 26 premature babies. The hippocampus
is a part of the brain that is important in the development of both short- and longterm memory. The sample correlation coefficient for their data is r 5 0.4722 and the
​
equation of the least squares regression line is ​ y 
ˆ ​ 5 1.67 1 0.0026x. The pattern in the
scatterplot (Figure 16.1) suggests there may be a positive linear relationship. However,
the correlation coefficient is not very large, and the value of the slope is close to zero.
Could the pattern observed in the scatterplot—and the nonzero slope—be plausibly
explained by chance? That is, is it plausible that there is no relationship between birth
weight and hippocampus volume in the population of all premature babies? Or does
the sample provide convincing evidence of a linear relationship between these two
variables? If there is evidence of a meaningful relationship between these two variables,
the regression line could be used to predict the hippocampus volume. If the predicted
volume was sufficiently small, early cognitive therapy could be recommended. On the
other hand, if there is no meaningful relationship between these variables, low birth
weight should not automatically trigger potentially expensive therapy.
735
85241_ch16_ptg01.indd 735
20/12/12 6:39 PM
736
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
2.4
Hippocampus volume
2.3
2.2
2.1
2.0
1.9
1.8
1.7
1.6
Figure 16.1 1.5
Scatterplot of birth weight versus
hippocampus volume.
500
1000
1500
Birth weight
2000
2500
In this chapter, you will learn methods that will help you determine if there is a
real and useful linear relationship between two variables or if the pattern in the data
could be simply due to chance differences that occur when a sample is selected from a
population.
section 16.1
The Simple Linear Regression Model
A deterministic relationship between two variables x and y is one in which the value of y is
completely determined by the value of the independent variable x. A deterministic relationship can be described, or “modeled,” using mathematical notation, such as y 5 f (x) where f (x)
is a particular function of x. This relationship is deterministic in the sense that the value of the
independent variable is all that is needed to determine the value of the dependent variable.
For example, you might convert x 5 temperature in degrees centigrade to y 5 temperature in
9
degrees Fahrenheit using y 5 f (x), where f (x) 5 ​ __  ​ x 1 32. Once the centigrade temperature
5
is known, the Fahrenheit temperature is completely determined. Or you might determine
y 5 amount of money in a savings account after x years, using the compound interest forr nx
mula, y 5 P ​ 1 1 ​ __ ​    ​ , where P is the principal (the amount of money deposited), r is the
n
interest rate, and n is the number of times each year the interest is compounded. The number
of years you leave the principal in the bank determines the amount in the account.
In many situations the variables of interest are not deterministically related. For example,
the value of y 5 first-year college grade point average is not determined solely by x 5 high
school grade point average, and y 5 crop yield is determined partly by factors other than x 5
amount of fertilizer used. The relationship between two variables, x and y, that are not deterministically related can be described by extending the deterministic model to specify a probabilistic model. The general form of a probabilistic model allows y to be larger or smaller
than f (x) by a random amount e. The model equation for a probabilistic model has the form
( 
)
y 5 deterministic function of x 1 random deviation
5 f (x) 1 e
In a scatterplot of y versus x, some of the data points will fall above the graph of f (x)
and some will fall below. Thinking geometrically, if e . 0, the corresponding point in the
scatterplot will lie above the graph of the function y 5 f (x). If e , 0, the corresponding
point will fall below the graph of f (x).
For example, consider the probabilistic model
y 5 50 2 10x 1 x2 1 e
 ​___________________
    
  ​
f (x)
The graph of the function y 5 50 2 10x 1 x2 is shown as the orange curve in Figure 16.2.
The observed point (4, 30) is also shown in the figure. Because f (4) 5 50 2 10(4) 1 42 5
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 736
20/12/12 6:39 PM
737
16.1 The Simple Linear Regression Model
50 2 40 1 16 5 26 for this point, you can write y 5 f (x) 1 e, where e 5 4. The point
(4, 30) falls 4 above the graph of the function, y 5 50 2 10x 1 x2.
y
Observation (4, 30)
e=4
26
Graph of
y = 50 – 10x + x 2
Figure 16.2 A deviation from the
deterministic part of a
probabilistic model.
x
4
Simple Linear Regression Model
The simple linear regression model is a special case of the general probabilistic model in
which the deterministic function, f (x), is linear (so its graph is a straight line).
Definition
The simple linear regression model assumes that there is a line with vertical or
y intercept a and slope b, called the population regression line. When a value of
the independent variable x is fixed and an observation on the dependent variable y
is made,
y 5 a 1 bx 1 e
Without the random deviation e, all observed (x, y) points would fall exactly on
the population regression line. The inclusion of e in the model equation recognizes
that points will deviate from the line by a random amount.
Figure 16.3 shows two observations in relation to the population regression line.
y
Observation when x = x1
(positive deviation)
Population regression
line (slope b)
e2
e1
Observation when x = x2
(negative deviation)
a = vertical
intercept
Figure 16.3 Two observations and deviations
from the population regression
line.
x
0
0
x = x1
x = x2
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 737
20/12/12 6:39 PM
738
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
Before you actually observe a value of y for any particular value of x, you are
uncertain about the value of e. It could be negative, positive, or even 0. Also, e might
be quite large in magnitude (resulting in a point far from the population regression line)
or quite small (resulting in a point very close to the line). The simple linear regression
model makes some assumptions about the distribution of e at any particular x value in
the population.
Basic Assumptions of the Simple Linear Regression Model
1. The distribution of e at any particular x value has mean value 0. That is, me 5 0.
2. The standard deviation of e (which describes the spread of its distribution)
is the same for any particular value of x. This standard deviation is denoted
by se.
3. The distribution of e at any particular x value is normal.
4. The random deviations e1, e2, ..., en associated with different observations are
independent of one another.
The simple linear regression model assumptions about the variability in the values
of e in the population imply that there is also variability in the y values observed at any
particular value of x. Consider y when x has some fixed value x*, so that
y 5 a 1 bx* 1 e.
Because a and b are fixed (they are unknown population values), a 1 bx* is also a
fixed number. The sum of a fixed number and a normally distributed variable (e) is
also a normally distributed variable (the bell-shaped curve is simply shifted), so y
itself has a normal distribution. Furthermore, me 5 0 implies that the mean value of y
is a 1 bx*, the height of the population regression line for the value x 5 x*. Finally,
because there is no variability in the fixed number a 1 bx*, the standard deviation of
y is the same as the standard deviation of e. These properties are summarized in the
following box.
At any fixed value x*, y has a normal distribution, with
and
( 
) ( 
)
 mean y value
height of the population
​ ​___________
 ​  
​5 ​ ​ ____________________
  
for x*  
regression line above
x* ​  ​5 a 1 bx*
standard deviation of y for a fixed value x* 5 se
The slope b of the population regression line is the mean or expected change
in y associated with a 1-unit increase in x. The y intercept a is the height of
the population line when x 5 0.
The value of se determines how much the (x, y) observations deviate vertically
from the population line; when se is small, most observations will be close to
the line, but when se is large, the observations will tend to deviate more from
the line.
The key features of the model are illustrated in Figures 16.4 and 16.5. Notice that
the three normal curves in Figure 16.4 have identical spreads. This is a consequence of
se being the same at any value of x, which implies that the variability in the y values at a
particular value of x is constant—the variability does not depend on the value of x.
85241_ch16_ptg01.indd 738
20/12/12 6:39 PM
16.1 The Simple Linear Regression Model
739
y
y = a + bx,
the population
regression line
(line of mean values)
a + bx3
Mean value a + bx3
Standard deviation s
Normal curve
a + bx2
Mean value a + bx2
Standard deviation s
Normal curve
a + bx1
Mean value a + bx1
Standard deviation s
Normal curve
x
x1
Figure 16.4 Illustration of the simple linear
regression model.
x2
x3
Three different x values
Population regression
line
Population regression
line
Figure 16.5 The simple linear regression
model: (a) small se ; (b) large se
(b)
(a)
Example 16.1 Stand on Your Head to Lose Weight?
The authors of the article “On Weight Loss by Wrestlers Who Have Been Standing on Their
Heads” (paper presented at the Sixth International Conference on Statistics, Combinatorics,
and Related Areas, Forum for Interdisciplinary Mathematics, 1999, with the data also
appearing in A Quick Course in Statistical Process Control, Mick Norton, 2005) state that
“amateur wrestlers who are overweight near the end of the weight certification period, but
just barely so, have been known to stand on their heads for a minute or two, get on their
feet, step back on the scale, and establish that they are in the desired weight class. Using
a headstand as the method of last resort has become a fairly common practice in amateur
wrestling.”
Does this really work? Data were collected in an experiment where weight loss was
recorded for each wrestler after exercising for 15 minutes and then doing a headstand for
1 minute 45 sec. Based on these data, the authors of the article concluded that there was in
fact a demonstrable weight loss that was greater than that for a control group that exercised
for 15 minutes but did not do the headstand. (The authors give a plausible explanation for
why this might be the case based on the way blood and other body fluids collect in the head
during the headstand and the effect of weighing while these fluids are draining immediately after standing.) The authors also concluded that a simple linear regression model was
a reasonable way to describe the relationship between the variables
y 5 weight loss (in pounds)
and
x 5 body weight prior to exercise and headstand (in pounds)
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 739
20/12/12 6:39 PM
740
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
Suppose that the actual model equation has a 5 0, b 5 0.001, and se 5 0.09 (these
values are consistent with the findings in the article). The population regression line is
shown in Figure 16.6.
y
(
)
Mean y when
= 0.19
x = 190
Population
regression line
y = 0.001x
Figure 16.6 x
The population regression line
for Example 16.1
x = 190
If the distribution of the random errors at any fixed weight (x value) is normal, then
the variable y 5 weight loss is normally distributed with
my 5 0 1 0.001x
sy 5 0.09
For example, when x 5 190 (corresponding to a 190-pound wrestler), weight loss has
mean value
my 5 0 1 0.001(190) 5 0.19 pounds
Because the standard deviation of y is sy 5 0.09, the interval 0.19 6 2(0.09) 5 (0.01,
0.37) includes y values that are within 2 standard deviations of the mean value for y when
x 5 190. Roughly 95% of the weight loss observations made for 190-lb wrestlers will be in
this range. The slope b 5 0.001 can be interpreted as the mean change in weight associated
with each additional pound of body weight.
More insight into model properties can be gained by thinking of the population of all
(x, y) pairs as consisting of many smaller subpopulations. Each subpopulation contains
pairs for which x has a fixed value. Suppose, for example, that in a large population of
college students the variables
x 5 grade point average in major courses
and
y 5 starting salary after graduation
are related according to the simple linear regression model. Then you can think about the
subpopulation of all pairs with x 5 3.20 (corresponding to all students with a grade point
average of 3.20 in major courses), the subpopulation of all pairs having x 5 2.75, and so
on. The model assumes that for each of these subpopulations, y is normally distributed
with the same standard deviation, and that the mean y value (rather than y itself) is linearly
related to x.
In practice, the judgment of whether the simple linear regression model is
appropriate—that is the judgments about the credibility of the assumptions underlying the
linear model—must be based on knowledge of how the data were collected, as well as an
inspection of various plots of the data and the residuals. The sample observations should be
independent of one another, which will be the case if the data are from a random sample.
In addition, the scatterplot should show a linear rather than a curved pattern, and the vertical spread of points should be very similar throughout the range of x values. Figure 16.7
shows plots with three different patterns; only the first pattern is consistent with the simple
linear regression model assumptions.
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 740
20/12/12 6:39 PM
741
16.1 The Simple Linear Regression Model
Figure 16.7 y
y
Some commonly encountered
patterns in scatter plots:
(a) Consistent with the simple
linear regression model;
(b) Suggests a nonlinear
probabilistic model; (c) Suggests
that variability in y changes
with x.
y
x
(a)
x
xx
(b)
(c)
Estimating the Population Regression Line
In Section 16.3, you will see how to check whether the basic assumptions of the simple
linear regression model are reasonable. When this is the case, the values of a and b (y
intercept and slope of the population regression line) can be estimated from sample data.
The estimates of a and b are denoted by a and b, respectively. These estimates are
the values of the intercept and slope of the least squares regression line. Recall that
that the least squares regression line is the line for which the sum of squared vertical
deviations of points in the scatterplot from the line is smaller than for any other line.
The estimates of the slope and the y intercept of the population regression line are
the slope and y intercept, respectively, of the least squares line. That is,
_
_
​  )
 ∑(x 2 x​
​  )(y 2 y​
_  
b 5 estimate of b 5 ​ ______________
  
 ​
2
∑(x 2 x​
​  )
_
_
a 5 estimate of a 5 y​
​  2 b​x​ 
The values of a and b are usually obtained using statistical software or a graphing
calculator. If the slope and intercept are calculated by hand, you can use the following computational formula:
(∑ x)(∑ y)
 
​
∑xy 2 ________
​  n   
_____________
  ​
b 5 ​ 
  
  
2
(∑ x)
∑ x2 2 _____
​     
 
​
n
The estimated regression line is the familiar least squares line
​
 ​y 
ˆ ​ 5 a 1 bx
Let x* denote a specified value of the independent variable x. Then a 1 bx* has
two different interpretations:
1. It is a point estimate of the mean y value when x 5 x*.
2. It is a point prediction of an individual y value to be observed when x 5 x*.
Example 16.2 Mother’s Age and Baby’s Birth Weight
Medical researchers have noted that adolescent females are much more likely to deliver
low-birth-weight babies than are adult females. (Low birth weight in humans is generally
defined as a weight below 2,500 grams) Because low-birth-weight babies have higher
mortality rates, a number of studies have examined the relationship between birth weight
and mother’s age for babies born to young mothers.
One such study is described in the article “Body Size and Intelligence in 6-Year-Olds:
Are Offspring of Teenage Mothers at Risk?” (Maternal and Child Health Journal [2009]:
847-856). The following data on
x 5 maternal age (in years)
and
y 5 birth weight of baby (in grams)
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 741
20/12/12 6:39 PM
742
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
are consistent with summary values given in the article and also with data published by the
National Center for Health Statistics.
Observation
1
2
3
4
5
6
7
8
9
10
x
15
17
18
15
16
19
17
16
18
19
y
2,289
3,393
3,271
2,648
2,897
3,327
2,970
2,535
3,138
3,573
A scatterplot of the data is given in Figure 16.8. The scatterplot shows a linear pattern,
and the spread in the y values appears to be similar across the range of x values. This
supports the appropriateness of the simple linear regression model.
Baby’s weight (g)
3500
3000
2500
Figure 16.8 15
Scatterplot of birth weight versus
maternal age for Example 16.2.
16
17
Mother’s age (yr)
18
19
For these data, the equation of the estimated regression line was found using statistical
software, resulting in
​
 ​y 
ˆ ​ 5 a 1 bx 5 21,163.45 1 245.15x
An estimate of the mean birth weight of babies born to 18-year-old mothers results
from substituting x 5 18 into the estimated equation:
estimated mean y for 18-year-old mothers 5 a 1 bx
5 21,163.45 1 245.15(18)
5 3,249.25 grams
Similarly, you would predict the birth weight of a baby to be born to a particular
18-year-old mother to be
​
 ​y 
ˆ ​ 5 predicted y value when x 5 18
5 a 1 b(18)
5 3,249.25 grams
The estimate of the mean weight and the prediction of an individual baby weight are
identical, because the same x value was used in each calculation. However, their interpretations differ. One is the prediction of the weight of a single baby whose mother is 18, whereas
the other is an estimate of the mean weight of all babies born to 18-year-old mothers.
In Example 16.2, the x values in the sample ranged from 15 to 19. The estimated
regression equation should not be used to make an estimate or prediction for any x value
much outside this range. Without sample data for such values, or some clear theoretical
reason for expecting the relationship to be linear outside the observed range of x values,
you have no reason to believe that the estimated linear relationship continues outside the
range from 15 to 19. Making predictions outside this range can be misleading, and statisticians refer to this as the danger of extrapolation.
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 742
20/12/12 6:39 PM
16.1 The Simple Linear Regression Model
743
Estimating ​s2e​ ​ ​and se
The value of se determines the extent to which observed points (x, y) tend to fall close to
or far away from the population regression line. A point estimate of se is based on
​
SSResid 5 ∑( y 2 ​ y 
ˆ ​)  2
​
​
where ​ y 
ˆ ​ 1 5 a 1 bx1, …, ​ y 
ˆ ​ n 5 a 1 bxn are the fitted or predicted y values and the residuals
​
​
are y1 2 ​ y 
ˆ ​ 1,… yn 2 ​ y 
ˆ ​ n. SSResid is a measure of the extent to which the sample data spread
out around the estimated regression line.
Definition
The statistic for estimating the variance ​s2e​​ ​is
SSResid
​s​2e​  5 ​ _______   
 
​
n22
where
​
SSResid 5 ∑(y 2 ​ y 
ˆ ​)  2 5 ∑y2 2 a ∑y 2 b ∑xy
The subscript in ​s2e​​ ​and ​s2e​​ ​is a reminder that you are estimating the variance of the
“errors” or residuals.
The estimate of se is the estimated standard deviation
__
​  ​s2e​​ ​ ​ 
se 5 Ï
The number of degrees of freedom associated with estimating ​s​2e​ ​or se in simple
linear regresssion is n 2 2.
The estimates and number of degrees of freedom here have analogs in previous
work involving a single sample x1, x2, …, xn. The sample variance s2 had a numerator of
_ 2
∑(x 2 x​
​  ) , a sum of squared deviations (residuals), and denominator n 2 1, the number of
_
degrees of freedom associated with s2 and s. The use of x​
​  as an estimate of m in the formula
for s2 reduces the number of degrees of freedom by 1, from n to n 21. In simple linear
regression, estimation of two quantities, a and b, results in a loss of 2 degrees of freedom,
leaving n 2 2 as the number of degrees of freedom associated with SSResid, ​s2e​​  and se.
Once the estimated regression equation has been found, the usefulness of this model
is evaluated using a residual plot and the values of se and the coefficient of determination,
r2. Recall from Chapter 4 that the values of se and r2 are interpreted as described in the
following box.
The coefficient of determination, r2, is the proportion of variability in y that can be
explained by the approximate linear relationship between x and y.
The value of se, the estimated standard deviation about the population regression
line, is interpreted as the typical amount by which an observation deviates from
the population regression line.
Example 16.3 Estimating Elk Weight
Wildlife biologists monitor the ecological health of animals. For large animals whose habitat is relatively inaccessible, this can present some practical problems. The Rocky Mountain
elk is the fourth largest deer species and is a case in point. Males range up to 7.5 feet in
length and over 500 pounds in weight. The equipment, manpower, and time needed to weigh
these creatures make direct measurement of weight difficult and expensive. The authors of
the paper “Estimating Elk Weight From Chest Girth” (Wildlife Society Bulletin [1996]: 58-611)
found they could reliably estimate elk weights by a much more practical method: measuring
the chest girth and then using linear regression to estimate the weight. They measured the
chest girth and weight of 19 Rocky Mountain elk in Custer State Park, South Dakota. The
85241_ch16_ptg01.indd 743
20/12/12 6:39 PM
744
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
resulting data (from a scatterplot in the paper) is given in the accompanying table. The table
also includes the predicted values and residuals for the estimated regression line.
Girth (cm)
Weight(kg)
Predicted
y Value
Residual
96
105
108
109
110
114
121
124
131
135
137
138
140
142
157
157
159
155
162
87
196
163
196
183
171
230
225
211
231
225
266
241
264
284
292
300
337
339
136.266
161.069
169.336
172.092
174.848
185.871
205.162
213.429
232.720
243.744
249.255
252.011
257.523
263.034
304.372
304.372
309.884
298.860
318.151
238.2661
34.9314
26.3361
23.9080
8.1522
214.8711
24.8380
11.5705
221.7203
212.7436
224.2553
13.9889
216.5228
0.9655
220.3720
212.3720
29.8837
38.1397
20.8488
The scatterplot (Figure 16.9) gives evidence of a strong positive linear relationship between
x 5 chest girth (in cm)
and
y 5 weight in (kg)
350
Weight (kg)
300
250
200
150
100
Figure 16.9 Scatterplot of weight versus
chest girth for Example 16.3
90
100
110
120
130
Girth (cm)
140
150
160
170
Partial Minitab regression output is shown here.
Regression Analysis: Weight versus Girth
The regression equation is
Weight 5 2 136 1 2.81 Girth
Predictor
Constant
Girth
S 5 23.6626
Coef
2135.51
2.8063
SE Coef
T
35.75 23.79
0.2686 10.45
R-Sq 5 86.5%
P
0.001
0.000
R-Sq(adj) 5 85.7%
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 744
20/12/12 6:39 PM
16.1 The Simple Linear Regression Model
745
From the output,
​
 ​y 
ˆ ​ 5 2136 1 2.81x
r2 5 0.865
Se 5 23.6626
Approximately 86.5% of the observed variation in elk weight y can be attributed to the
linear relationship between weight and chest girth. The magnitude of a typical deviation
from the least-squares line is about 23.6626 kg, which is relatively small in comparison to
the y values themselves.
Another important assumption of the simple linear regression model is that the
random deviations at any particular x value are normally distributed. In Section 16.3,
you will see how the residuals can be used to determine whether this assumption is
plausible.
section
16.1 Exercises
Each exercise set assesses the following chapter learning objectives: C1, M1
Section 16.1
Exercise Set 1
16.1 Identify the following relationships as deterministic
or probabilistic:
a. The relationship between the length of the sides of a
square and its perimeter.
b. The relationship between the height and weight of an adult.
c. The relationship between SAT score and college freshman
GPA.
d. The relationship between tree height in centimeters and
tree height in inches.
16.2 Let x be the size of a house (in square feet) and y be
the amount of natural gas used (therms) during a specified
period. Suppose that for a particular community, x and y are
related according to the simple linear regression model with
b 5 slope of population regression line 5 .017
a 5 y intercept of population regression line 5 25.0
Houses in this community range in size from 1000 to
3000 square feet.
a. What is the equation of the population regression line?
b. Graph the population regression line by first finding the
point on the line corresponding to x 5 1000 and then
the point corresponding to x 5 2000, and drawing a line
through these points.
c. What is the mean value of gas usage for houses with
2100 sq. ft. of space?
d. What is the average change in usage associated with a 1
sq. ft. increase in size?
e. What is the average change in usage associated with a
100 sq. ft. increase in size?
f. Would you use the model to predict mean usage for a 500
sq. ft. house? Why or why not?
16.3 Suppose that a simple linear regression model is
appropriate for describing the relationship between y 5
85241_ch16_ptg01.indd 745
house price (in dollars) and x 5 house size (in square feet)
for houses in a large city. The population regression line is
y 5 23,000 1 47x and se 5 5000.
a. What is the average change in price associated with one
extra square foot of space? With an additional 100 sq. ft.
of space?
b. Approximately what proportion of 1800 sq. ft. homes
would be priced over $110,000? Under $100,000?
Section 16.1
Exercise Set 2
16.4 Identify the following relationships as deterministic
or probabilistic:
a. The relationship between height at birth and height at one
year of age.
b. The relationship between a positive number and its
square root.
c. The relationship between temperature in degrees
Fahrenheit and degrees centigrade.
d. The relationship between adult shoe size and shirt size.
16.5 The flow rate in a device used for air quality measurement depends on the pressure drop x (inches of water) across
the device’s filter. Suppose that for x values between 5 and
20, these two variables are related according to the simple
linear regression model with population regression line
y 5 20.12 1 0.095x.
a. What is the mean flow rate for a pressure drop of
10 inches? A drop of 15 inches?
b. What is the average change in flow rate associated with
a 1 inch increase in pressure drop? Explain.
16.6 The paper “Predicting Yolk Height, Yolk Width,
Albumen Length, Eggshell Weight, Egg Shape Index, Eggshell
Thickness, Egg Surface Area of Japanese Quails Using
Various Egg Traits as Regressors” (International Journal of
20/12/12 6:39 PM
746
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
Poultry Science [2008]: 85–88) suggests that the simple
linear regression model is reasonable for describing the
relationship between y 5 eggshell thickness (in micrometers) and x 5 egg length (mm) for quail eggs. Suppose
that the population regression line is y 5 0.135 1 0.003x
and that se 5 0.005. Then, for a fixed x value, y has a normal distribution with mean 0.135 1 0.003x and standard
deviation 0.005.
a. What is the mean eggshell thickness for quail eggs that
are 15 mm in length? For quail eggs that are 17 mm in
length?
b. What is the probability that a quail egg with a length of
15 mm will have a shell thickness that is greater than
0.18 mm?
c. Approximately what proportion of quail eggs of length
14 mm has a shell thickness of greater than 0.175? Less
than 0.178?
Additional Exercises
16.7 Tom and Ray are managers of electronics stores with
slightly different pricing strategies for USB drives. In Tom’s
store, customers pay the same amount, c, for each USB
drive. In Ray’s store, it is a little more exciting. The customer pays an up-front cost of $1.00. Ray charges the same
price per USB drive, c, but at the register the customer flips
a coin. If the coin lands heads up, the customer gets his or
her $1.00 back, plus another dollar off the total cost of the
USB drives purchased.
a. Which of these pricing strategies can be expressed as a
deterministic model?
b. Using mathematical notation, specify a model using
Tom’s pricing strategy that relates y 5 total cost to x 5
number of USB drives purchased.
c. Using mathematical notation, specify a model using
Ray’s pricing strategy that relates y 5 total cost to x 5
number of USB drives purchased.
d. Describe the distribution of e for the probabilistic model
described above. What is the mean of the distribution
of e? What is the standard deviation of e?
16.8 Identify the following relationships as deterministic or
probabilistic:
a. The relationship between the speed limit and a driver’s
speed.
b. The relationship between the price in dollars and the
price in Euros of an object.
c. The relationship between the number of pages and the
number of words in a text book.
d. The relationship between the possible numbers of pennies and the nickels in a pile if no other coins are in the
pile and the amount of money in the pile is $3.00.
16.9 Hormone replacement therapy (HRT) is thought to
increase the risk of breast cancer. The accompanying data
on x 5 percent of women using HRT and y 5 breast cancer
incidence (cases per 100,000 women) for a region in
85241_ch16_ptg01.indd 746
Germany for 5 years appeared in the paper “Decline in Breast
Cancer Incidence after Decrease in Utilisation of Hormone
Replacement Therapy” (Epidemiology [2008]: 427–430). The
authors of the paper used a simple linear regression model to
describe the relationship between HRT use and breast cancer
incidence.
HRT Use
46.30
40.60
39.50
36.60
30.00
Breast Cancer Incidence
103.30
105.00
100.00
93.80
83.50
a. What is the equation of the estimated regression line?
b. What is the estimated average change in breast cancer
incidence associated with a 1 percentage point increase
in HRT use?
c. What would you predict the breast cancer incidence to be
in a year when HRT use was 40%?
d. Should you use this regression model to predict breast
cancer incidence for a year when HRT use was 20%?
Explain.
e. Calculate and interpret the value of r 2.
f. Calculate and interpret the value of se.
16.10 Consider the accompanying data on x 5 advertising
share and y 5 market share for a particular brand of soft drink
during 10 randomly selected years.
x 0.103 0.072 0.071 0.077 0.086 0.047 0.060 0.050 0.070 0.052
y 0.135 0.125 0.120 0.086 0.079 0.076 0.065 0.059 0.051 0.039
a. C onstruct a scatterplot for these data. Do you
think the simple linear regression model would be
appropriate for describing the relationship between
x and y?
b. Calculate the equation of the estimated regression line
and use it to obtain the predicted market share when the
advertising share is 0.09.
c. Compute r 2. How would you interpret this value?
d. Calculate a point estimate of se. How many degrees of
freedom is associated with this estimate?
16.11 The authors of the paper “Weight-Bearing Activity
During Youth Is a More Important Factor for Peak Bone
Mass than Calcium Intake” (Journal of Bone and Mineral
studied a number of
variables they thought might be related to bone mineral
density (BMD). The accompanying data on x 5 weight
at age 13 and y 5 bone mineral density at age 27 are
consistent with summary quantities for women given in the
paper.
Research [1994], 1089–1096)
20/12/12 6:39 PM
16.1 The Simple Linear Regression Model
Weight (kg)
BMD (g/cm2)
54.4
59.3
74.6
62.0
73.7
70.8
66.8
66.7
64.7
71.8
69.7
64.7
62.1
68.5
58.3
1.15
1.26
1.42
1.06
1.44
1.02
1.26
1.35
1.02
0.91
1.28
1.17
1.12
1.24
1.00
747
d. Compute a point estimate of the mean BMD at age 27 for
women whose age 13 weight was 60 kg.
16.12 The production of pups and their survival are the most
significant factors contributing to gray wolf population growth.
The causes of early pup mortality are unknown and difficult
to observe. The pups are concealed within their dens for 3
weeks after birth, and after they emerge it is difficult to confirm
their parentage. Researchers recently used portable ultrasound
equipment to investigate some factors related to reproduction (“Diagnosing Pregnancy, in Utero Litter Size, and Fetal
Growth with Ultrasound in Wild, Free-Ranging Wolves,” Journal
of Mammology [2006]: 85-92). A scatterplot of y 5 length of
an embryonic sac diameter (in cm) and x 5 gestational age
(in days) is shown below. Computer output from a regression
analysis is also given.
Bivariate Fit of Emb Ves Diam (cm) By Gest Age (days)
6
The accompanying computer output is from JMP.
5
Emb Ves Diam (cm)
1.5
1.4
BMD (g/cm^2)
1.3
4
3
2
1.2
1
1.1
0
25
1
35
30
Gest Age (days)
40
Linear Fit
0.9
Linear Fit
0.8
55
60
65
Weight (kg)
70
75
Linear Fit
Linear Fit
BMD (g/cm^2) = 0.5584011 + 0.0094363*Weight (kg)
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.121081
0.053472
0.155141
1.18
15
Lack of Fit
Analysis of Variance
Parameter Estimates
Term
Intercept
Weight (kg)
Estimate Std Error t Ratio Prob>|t|
0.5584011 0.466212
1.20
0.2524
0.0094363 0.007051
1.34
0.2038
a. What percentage of observed variation in BMD at age 27
can be explained by the simple linear regression model?
b. Give a point estimate of se and interpret this estimate.
c. Give an estimate of the average change in BMD associated with a 1 kg increase in weight at age 13.
Emb Ves Diam (cm) = –3.497279 + 0.1903121*Gest Age (days)
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.792803
0.780615
0.450587
2.482526
19
Lack of Fit
Analysis of Variance
Parameter Estimates
Term
Intercept
Gest Age (days)
Estimate
–3.497279
0.1903121
Std Error
0.748605
0.023597
t Ratio
–4.67
8.07
Prob>|t|
0.0002*
<.0001*
a. What is the equation of the estimated regression line?
b. What is the estimated embryonic sac diameter for a
gestational age of 30 days?
c. What is the average change in sac diameter associated
with a 1-day increase in gestational age?
d. What is the average change in sac diameter associated
with a 5-day increase in gestational age?
e. Would you use this model to predict the mean embryonic sac diameter for all gestation ages from conception
to birth? Why or why not?
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 747
20/12/12 6:39 PM
748
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
section 16.2
Inferences Concerning the Slope of the Population
Regression Line
The slope coefficient b in the simple linear regression model represents the average or expected
change in the response variable y that is associated with a 1-unit increase in the value of the
independent variable x. For example, consider x 5 the size of a house (in square feet) and y 5
selling price of the house. If the simple linear regression model is appropriate for the population
of houses in a particular city, b would be the average increase in selling price associated with a
1-square-foot increase in size. As another example, if x 5 amount of time per week a computer
system is used and y 5 the resulting annual maintenance expense, then b would be the expected
change in expense associated with using the computer system one additional hour per week.
Because the value of b is almost always unknown, it must be estimated from sample
data. The slope of the least squares regression line, b, provides an estimate. In some situations, the value of the statistic b may vary greatly from sample to sample, and the value of b
computed from a single sample may be quite different from the value of the population slope,
b. In other situations, almost all possible samples result in a value of b that is quite close to
b. The sampling distribution of b provides information about the behavior of this statistic.
AP* exam tip
Inferences about the slope
of the population regression line are based on the
sampling distribution of
the statistic b. The properties given here depend on
the four basic assumptions
of the linear regression
model being met. In Section 16.3, you will see how
to determine if these assumptions are reasonable.
Properties of the Sampling Distribution of b
When the four basic assumptions of the simple linear regression model are satisfied
1. The mean value of the sampling distribution of b is b. That is, mb 5 b , so the
sampling distribution of b is always centered at the value of b. This means that b
is an unbiased statistic for estimating b.
2. The standard deviation of the sampling distribution of the statistic b is
se
sb 5 ​ __________
 
 ​
________
_ 2 
​  ∑(x 2 x​
​  )  ​ 
Ï
3. The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed).
The fact that b is unbiased tells you that the sampling distribution is centered at the right
place, but it gives no information about variability. If sb is large, the sampling distribution of
b will be quite spread out around b and an estimate far from the value of b could result. For
se
________
sb 5 ​ ___________
 _  
 ​to be small, the numerator se should be small (little variability about the
​  ∑(x 2 ​x​ )2 ​ 
Ï
________
_
_
population line) and/or the denominator Ï
​  ∑(x 2 ​x​ )2 ​ 
should be large. Because ∑(x 2 ​x​ )2 is a
measure of how much the observed x values spread out, b tends to be more precisely estimated
when the x values in the sample are spread out rather than when they are close together. The
normality of the sampling distribution of b implies that the standardized variable
b2b
z 5 ​ ______
 ​
 
 
sb
has a standard normal distribution. However, inferential methods cannot be based on this
statistic, because the value of sb is not known (because the unknown se appears in the
numerator of sb). One way to proceed is to estimate se with se to obtain an estimate of sb.
The estimated standard deviation of the statics b is
se
________
sb 5 ​ ___________
 _ 2  
 ​
​Ï ∑(x 2 x​
​  )  ​ 
AP* exam tip
For inferences about the
slope of the population regression line, df 5 n 2 2.
85241_ch16_ptg01.indd 748
When the four basic assumptions of the simple linear regression model are satisfied,
b2b
 
  is the
the probability distribution of the standardized variable t 5 ​ ______
s  ​
t distribution with df 5 (​ n 2 2 )​.
b
20/12/12 6:39 PM
16.2 Inferences Concerning the Slope of the Population Regression Line
749
_
​x​ 2 m
  
​was used in Chapter 12 to develop a confidence interIn the same way that t 5 ______
​ 
s
____
​   __ ​ 
​  n  ​
Ï
val for m, the t variable in the preceding box can be used to obtain a confidence interval for b.
Confidence Interval for b
When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has
the form
b 6 (​ t critical value )​ sb
where the t critical value is based on df 5 n 2 2. Appendix Table 3 gives critical
values corresponding to the most frequently used confidence levels.
The interval estimate of b is centered at b and extends out from the center by an amount
that depends on the sampling variability of b. When sb is small, the interval is narrow, implying that the investigator has relatively precise knowledge of the value of b. Calculation of a
confidence interval for the slope of a population regression line is illustrated in Example 16.4.
In Section 7.2, you learned four key questions that guide the decision about what statistical inference method to consider in any particular situation. In Section 7.3, a five-step
process for estimation problems was introduced.
The four key questions of section 7.2 were
Q
Question Type
S
Study Type
T
Type of Data
N
Number of Samples or
Treatments
Estimation or hypothesis testing?
Sample data or experiment data?
One variable or two? Categorical or numerical?
How many samples or treatments?
When the answers to these questions are
Q: estimation
S: sample data
T: two numerical variables
N: one sample
the method you will want to consider in a regression setting is the confidence interval for
the slope of a population regression line.
Once you have selected the confidence interval for the slope of a population regression line as the method you want to consider, because this is an estimation problem you
would follow the five-step process for estimation problems (EMC3).
Example 16.4 The Bison of Yellowstone Park
The dedicated work of conservationists for over 100 years has brought the bison in
Yellowstone National Park from near extinction to a herd of over 3,000 animals. This
recovery is a mixed blessing. Many bison have been exposed to the bacteria that cause
brucellosis, a disease that infects domestic cattle, and there are many domestic cattle herds
near Yellowstone. Because of concerns that free-ranging bison can infect nearby cattle, it
is important to monitor and manage the size of the bison population and, if possible, keep
bison from transmitting this bacteria to ranch cattle. The article “Reproduction and Survival
of Yellowstone Bison” (The Journal of Wildlife Management [2007]: 2365-2372) described a
large multiyear study of the factors that influence bison movement and herd size. The
85241_ch16_ptg01.indd 749
20/12/12 6:39 PM
750
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
researchers studied a number of environmental factors to better understand the relationship between bison reproduction and the environment. One factor thought to influence
reproduction is stress due to accumulated snow, which makes foraging more difficult for
the pregnant bison. Data from 1981–1997 on
y 5 spring calf ratio (SCR)
and
x 5 previous fall snow-water equivalent (SWE)
are shown in the accompanying table. Spring calf ratio is the ratio of calves to adults, a
measure of reproductive success. The researchers were interested in estimating the mean
change in spring calf ratio associated with each additional cm in snow-water equivalent.
Let’s answer the four key questions for this problem.
SCR
SWE
SCR
SWE
0.19
0.14
0.21
0.23
0.26
0.19
0.29
0.23
0.16
1,933
4,906
3,072
2,543
3,509
3,908
2,214
2,816
4,128
0.22
0.22
0.18
0.21
0.25
0.19
0.22
0.17
3,317
3,332
3,511
3,907
2,533
4,611
6,237
7,279
The answers are estimation, sample data, two numerical variables, one sample. This
Q
Question Type
S
Study Type
T
Type of Data
N
Number of Samples
or Treatments
Estimation or hypothesis testing?
Estimation
Sample data or experiment data?
Sample data
One variable or two? Categorical or numerical?
Two numerical values
How many samples or treatments?
One sample (regression)
combination of answers suggests considering a confidence interval for the slope of a population regression line. You can now use the five-step process (EMC3) to estimate the slope
of the population regression line.
Step
Estimate
In this example, the value of b, the mean increase in spring calf ratio for each
additional 1 cm of snow-water equivalent, will be estimated.
Method
Because the answers to the four key questions are estimation, sample data,
two numerical values, one sample, a confidence interval for b, the slope of
the population regression line, will be considered.
For this example, a 95% confidence level will be used.
Check
The four basic assumptions of the simple linear regression model need to be
met in order to use the confidence interval.
(continued)
85241_ch16_ptg01.indd 750
20/12/12 6:39 PM
16.2 Inferences Concerning the Slope of the Population Regression Line
751
Step
The investigators collected data from 17 successive years. To proceed, you
would need to assume that these years are representative of yearly circumstances at Yellowstone, and that each year’s reproduction and snowfall is
independent of previous years. You should keep this in mind when you get
to the step that involves interpretation.
A scatterplot of the data is shown here. The pattern in the plot looks linear
and the spread does not seem to be different for different values of x.
0.300
0.275
SCR
0.250
0.225
0.200
0.175
0.150
2000
3000
4000
5000
SWE
6000
7000
8000
A box plot of the residuals is also shown.
–0.050
–0.025
–0.000
0.025
Residuals
0.050
0.075
Because the boxplot is approximately symmetric and there are no
outliers, it is reasonable to think that the distribution of e is
approximately normal.
Calculate
JMP regression output is shown here:
Linear Fit
SCR 5 0.2606561 2 0.0136639*SWE
Summary of Fit
RSquare
0.257644
RSquare Adj
0.208153
Root Mean Square Error
0.033513
Mean of Response
0.209412
Observations (or Sum Wgts)
17
(continued)
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 751
20/12/12 6:39 PM
752
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
Step
Parameter Estimates
Term Estimate Std Error t Ratio
Intercept 0.2606561 0.023885 10.91
SWE 20.013664 0.005989 22.28
Prob>|t|
<.0001*
0.0375*
sb
df 5 n 2 2 = 17 2 2 = 15
The t critical value for a 95% confidence level and df 5 15 is 2.13.
b 6(t critical value)sb
5 20.0137 6(2.13)(0.00599)
5 (20.265, 20.0009)
Communicate
Results
Confidence interval:
You can be 95% confident that the true average change in spring calf
ratio associated with an increase of 1 cm in the snow-water equivalent is
between 20.0265 and 20.0009.
Confidence level:
The method used to construct this interval estimate is successful in
capturing the actual value of the slope of the population regression about
95% of the time.
Hypothesis Tests Concerning b
Hypotheses about b can be tested using a t test similar to the t tests introduced in Chapters 12
and 13. The null hypothesis states that b has a specified hypothesized value. The t statistic
results from standardizing b, the estimate of b, under the assumption that H0 is true. When
H0 is true, the sampling distribution of this statistic is the t distribution with df 5 n 2 2.
Hypothesis Test for the Slope of the Population Regression Line, b
Appropriate when the four basic assumptions of the simple linear regression
model are reasonable:
1. The distribution of e at any particular x value has mean value 0 (that is
me5 0 ).
2. The standard deviation of e is se, which does not depend on x.
3. The distribution of e at any particular x value is normal.
4. The random deviations e1, e2, e3, … en associated with different observations are
independent of one another.
When these conditions are met, the following test statistic can be used:
b 2 b0
t 5 ​ ______
 
 
sb ​
where b0 is the hypothesized value from the null hypothesis.
Form of the null hypothesis: H0: b 5 b0
When the assumptions of the simple linear regression model are reasonable and
the null hypothesis is true, the t test statistic has a t distribution with df 5 n 2 2.
Associated P-value:
When the alternative
hypothesis is…
The P-value is…
Ha: b . b0
Area to the right of the computed t under
the appropriate t curve
(continued)
85241_ch16_ptg01.indd 752
20/12/12 6:39 PM
16.2 Inferences Concerning the Slope of the Population Regression Line
Ha: b , b0
Area to the left of the computed t under the
appropriate t curve
Ha: b Þ b0
2(area to the right of t) if t is positive
or
2(area to the left of the t) if t is negetive
753
This test is a method you should consider when the answers to the four key questions
are hypothesis testing, sample data, two numerical variables, one sample. You would carry
out this test using the five-step process for hypothesis testing problems (HMC3).
Inference for a population slope generally focuses on two questions:
(1) Is the population slope different from zero?
(2) What are plausible values for the population slope?
The question of plausible values can be addressed by calculating a confidence interval for
the population slope. The question of whether a population slope is equal to zero can be
answered by using the hypothesis testing procedure with a null hypothesis H0: b 5 0. This
test of H0: b 5 0 versus Ha: b Þ 0 is called the model utility test for simple linear regression.
The default computer output for inference for a regression slope is for the model utility test.
When the null hypothesis of the model utility test is true, the population regression
line is a horizontal line, and the value of y in the simple linear regression model does not
depend on x. That is,
y 5 a 1 bx 1 e
5 a 1 0x 1 e
5a1e
If b is in fact equal to 0, knowledge of x will be of no use — it will have no “utility”
for predicting y. On the other hand, if b is different from 0, there is a useful linear relationship between x and y, and knowledge of x is useful for predicting y. This is illustrated by
the scatterplots in Figure 16.10.
y
y
nonzero slope
slope = 0
x
Figure 16.10 (a) b 5 0; (b) b Þ 0
x
(a)
(b)
The Model Utility Test for Simple Linear Regression
The model utility test for simple linear regression is the test of
H0: b 5 0
versus
Ha: b Þ 0
The null hypothesis specifies that there is no useful linear relationship between
x and y, whereas the alternative hypothesis specifies that there is a useful linear
(continued)
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 753
20/12/12 6:39 PM
754
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
relationship between x and y. If H0 is rejected, you can conclude that the simple
linear regression model is useful for predicting y.
The test statistic is the t ratio
( b 2 0 )​ __
b
t 5 ​ ​_______
 5 ​ s   ​. 
s  ​ 
b
b
It is recommended that the model utility test be carried out before using the estimated
regression line to make inferences.
Example 16.5 The British (Musical) Invasion
Have you experienced a sudden flood of memory when scanning from station to station on
your car radio and recognized a song from your past? Perhaps you could remember the title
of the song, the artist, and even when the song was released. From a seemingly small amount
of information you were able to recover a great deal of the song’s context from memory. The
article “Plink: ‘Thin slices’ of Music” (Krumhansl, C. Music Perception [2010]:337-354) describes
a study of this phenomenon. The investigator compiled a list of songs from Rolling Stone,
Billboard, and Blender lists of songs plus some recent songs familiar to college students.
Twenty-three college students were then exposed to 56 clips of songs. Most of these students
had had musical training, and they listened to popular music for an average of 21.7 hours
per week. After hearing three short clips from a song (only 400 ms in duration), the students
were asked in what year each of the songs was released. The accompanying table shows the
Actual and Judged Release Years
Actual
Release
Judged
Release
Actual
Release
Judged
Release
Actual
Release
Judged
Release
Actual
Release
Judged
Release
1998
1967
1998
1999
1983
1982
1965
1991
1983
1976
1971
1981
1967
2007
1997.2
1973.7
1996.3
1993.3
1985.4
1988.0
1970.2
1992.8
1984.1
1979.3
1975.4
1984.6
1973.7
1997.2
1976
2008
1971
1965
1967
1971
1967
1984
1984
1968
1965
1965
1979
1997
1983.3
1995.0
1979.8
1976.8
1975.0
1978.0
1978.0
1983.3
1989.8
1976.7
1978.5
1977.2
1986.7
1996.3
1976
2006
1974
2007
1976
1974
1970
1971
1999
1997
2006
1981
2008
1965
1988.0
1996.7
1985.4
1999.8
1987.2
1977.6
1982.8
1976.3
1988.5
1994.1
1995.4
1989.3
1993.7
1981.1
1970
1975
1991
2008
1965
1987
1975
1968
1987
2008
1982
1979
2000
2000
1985.4
1985.9
1993.3
1995.4
1977.6
1990.7
1986.3
1986.7
1988.0
1990.2
1991.1
1983.7
1989.8
1991.1
actual release year and the average of the release years given by the students. The actual
release years ranged from 1965 (The Beatles, “Help”) to 2008 (Katy Perry, “I Kissed a Girl”).
Is there a relationship between the judged and actual release year for these songs? A
scatterplot of the data (Figure 16.11) suggests that there is a linear relation between these
two variables, but this can be confirmed this using the model utility test.
With x 5 actual release year and y 5 judged release year, the equation of the esti​
mated regression line is ​ y 
ˆ ​  5 1095 1 0.449x. The five-step process for hypothesis testing
can be used to carry out the model utility test.
85241_ch16_ptg01.indd 754
20/12/12 6:39 PM
755
16.2 Inferences Concerning the Slope of the Population Regression Line
2000
1995
Judged
1990
1985
1980
1975
1970
Figure 16.11 Scatterplot of judged release year
versus actual release year
1960
1970
1980
1990
2000
2010
Actual
Process Step
H Hypotheses
In the model utility test, the null hypothesis is there is no useful relationship between the actual and the judged
release year: H0: b 5 0.
The alternative hypothesis specifies that there is a useful relationship: b Þ 0.
Hypotheses:
Null hypothesis: H0: b 5 0
Alternative hypothesis: Ha: b Þ 0
M Method
Because the answers to the four key questions are hypothesis testing, sample data, two numerical variables in a
regression setting and one sample, a hypothesis test for the slope of a population regression line will be considered.
The test statistic for this test is
b20
b
t 5 ​ _____
   
  ​5 __
​ s   ​
sb
b
The value of 0 in the test statistic is the hypothesized value from the null hypothesis.
For this example, a significance level of 0.05 will be used.
Significance level:
a 5 0.05
C Check
In Section 16.3, you will see how to check to see if the four assumptions of the simple linear regression model
are reasonable. For this example, you can assume that these assumptions are reasonable and proceed with the
model utility test.
C Calculate
JMP output is shown here:
Linear Fit
Judged Release = 1095.1525 + 0.449281*Actual Release
Summary of Fit
RSquare
0.771
RSquare Adj
0.766759
3.59844
Root Mean Square Error
1986.013
Mean of Response
56
Observations (or Sum Wgts)
Lack of Fit
Analysis of Variance
sb
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t|
Intercept
1095.1525
16.58 <.0001*
66.07159
Actual Release
0.449281 0.033321
13.48 <.0001*
(continued)
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 755
20/12/12 6:39 PM
756
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
Test statistic:
.449 2 0
b 2 0 0__________
t 5 ​ _____
 ​ 
 5 ​ 
 
 ​ 
5 13.48
sb
0.0333
Associated P-value:
P 2 value 5 twice area under t curve to the right of 13.48
5 2P(t .13.48)
ø0
C Commu­nicate
results
Because the P-value is less than the selected significance level, the null hypothesis is rejected.
Decision: Reject H0.
Conclusion: The sample data provide convincing evidence that there is a useful linear relationship between the
actual release year and the judged release year.
Because the model utility test confirms that there is a useful linear relationship
between judged release year and actual release year, it would be reasonable to use the
estimated regression model to predict the judged release year for a given song based on
its actual release year. Of course, before you do this, you would also want to evaluate the
accuracy of predictions by looking at the value of se.
When H0: b 5 0 cannot be rejected using the model utility test at a reasonably small significance level, the search for a useful model must continue. One possibility is to relate y
to x using a nonlinear model — an appropriate strategy if the scatterplot shows curvature.
section
16.2 Exercises
Each exercise set assesses the following chapter learning objectives: C2, M3, M4 , M5, M6, P1, P2
Section 16.2 Exercise Set 1
16.13 The standard deviation of the errors, se, is an important part of the linear regression model.
a. What is the relationship between the value of se and the
value of the test statistic in a test of a hypotheses about b ?
b. What is the relationship between the value of se and the
width of a confidence interval for b ?
16.14 A journalist is reporting about some research on
appropriate amounts of sleep for people 9 to 19 years of
age. In that research, a linear regression model is used to
describe the relationship between alertness and number of
hours of sleep the night before. The researchers reported a
95% confidence interval, but newspapers usually report an
estimate and a margin of error.
a. In order to calculate a margin of error from the reported
confidence interval, what additional conditions, if any,
need to be verified?
b. In order to calculate a margin of error from the reported
confidence interval, what additional information, if any,
is needed?
16.15. A nursing student has completed his final project, and
is preparing for a meeting with his project advisor. The subject
of his project was the relationship between systolic blood pressure (SBP) and body mass index (BMI). The last time he met
with his advisor he had completed his measurements, but only
entered half his data into his statistical software. For the data he
85241_ch16_ptg01.indd 756
had entered, the necessary conditions for inference for b were
met. In a short paragraph, explain, using appropriate statistical
terminology, which of the conditions below must be rechecked.
1. The standard deviation of e is the same for all values of x.
2. The distribution of e at any particular x value is normal.
16.16 Consider the accompanying data on x 5 research
and development expenditure (thousands of dollars) and y 5
growth rate (% per year) for eight different industries.
x
y
2024
1.90
5038
3.96
905
2.44
3572
0.88
1157
0.37
327
20.90
378
0.49
191
1.01
a. Would a simple linear regression model provide useful
information for predicting growth rate from research and
development expenditure? Use a .05 level of significance.
b. Use a 90% confidence interval to estimate the average
change in growth rate associated with a $1000 increase in
expenditure. Interpret the resulting interval
16.17 The paper “The Effects of Split Keyboard Geometry
on Upper Body Postures” (Ergonomics [2009]: 104–111)
describes a study to determine the effects of several keyboard characteristics on typing speed. One of the variables
considered was the front-to-back surface angle of the
keyboard. Minitab output resulting from fitting the simple
linear regression model with x ​5 ​surface angle (degrees)
and y 5 ​typing speed (words per minute) is given below.
20/12/12 6:39 PM
16.2 Inferences Concerning the Slope of the Population Regression Line
Regression Analysis: Typing Speed versus Surface Angle
The regression equation is
Typing Speed ​5 ​60.0 ​1 ​0.0036 Surface Angle
Predictor
Constant
Surface Angle
Coef SE Coef
T
P
60.0286 0.2466 243.45 0.000
0.00357 0.03823
0.09 0.931
S ​ 5 ​ ​0.511766 R-Sq ​5 ​0.3% R-Sq(adj) ​5 ​0.0%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
SS
MS
F
P
1 0.0023 0.0023 0.01 0.931
3 0.7857 0.2619
4 0.7880
a. Suppose that the basic assumptions of the simple linear
regression model are met. Carry out a hypothesis test
to decide if there is a useful linear relationship between
x and y.
b. Are the values of se and r2 consistent with the conclusion
from Part (a) ? Explain.
16.18 Do taller adults make more money? The authors
of the paper “Stature and Status: Height, Ability, and Labor
Market Outcomes” (Journal of Political Economics [2008]:
499–532) investigated the association between height and
earnings. They used the simple linear regression model to
describe the relationship between x ​5 ​height (in inches) and
y ​5 ​log(weekly gross earnings in dollars) in a very large
sample of men. The logarithm of weekly gross earnings was
used because this transformation resulted in a relationship
that was approximately linear. The paper reported that the
slope of the estimated regression line was b ​5 ​0.023 and
the standard deviation of b was sb 5 0.004 . Carry out a
hypothesis test to decide if there is convincing evidence of a
useful linear relationship between height and the logarithm
of weekly earnings. You can assume that the basic assumptions of the simple linear regression model are met.
16.19 The effects of grazing animals on grasslands have
been the focus of numerous investigations by ecologists.
One such study, reported in “The Ecology of Plants, Large
Mammalian Herbivores, and Drought in Yellowstone National
Park” (Ecology [1992]: 2043–​2058), proposed using the simple
linear regression model to relate y 5 green biomass concentration (g/cm3) to x 5 elapsed time since snowmelt (days).
​
a. The estimated regression equation was given as ​ y ​
ˆ   5
106.3 2 .640x. What is the estimate of average change in
biomass concentration associated with a 1-day increase
in elapsed time?
b. What value of biomass concentration would you predict
when elapsed time is 40 days?
c. The sample size was n 5 58, and the reported value of
the coefficient of determination was 0.470. What does
this tell you about the linear relationship between the two
variables?
85241_ch16_ptg01.indd 757
757
Section 16.2
Exercise Set 2
16.20 Consider a test of hypotheses about, b the population
slope in a linear regression model.
a. If you reject the null hypothesis, b 5 0, what does this
mean in terms of a linear relationship between x and y?
b. If you fail to reject the null hypothesis, b 5 0, what does
this mean in terms of a linear relationship between x and y?
16.21 Researchers studying pleasant touch sensations measured the firing frequency (impulses per second) of nerves that
were stimulated by a light brushing stroke on the forearm and
also recorded the subject’s numerical rating of how pleasant
the sensation was. The accompanying data was read from a
graph in the paper “Coding of Pleasant Touch by Unmyelinated
Afferents in Humans” (Nature Neuroscience, April 12, 2009).
Firing
Frequency
23
24
22
25
27
Pleasantness
Rating
0.2
1.0
1.2
1.2
1.0
Firing
Frequency
28
34
33
36
34
Pleasantness
Rating
2.0
2.3
2.2
2.4
2.8
a. Estimate the mean change in pleasantness rating associated with an increase of 1 impulse per second in firing
frequency using a 95% confidence interval. Interpret the
resulting interval.
b. Carry out a hypothesis test to decide if there is convincing
evidence of a useful linear relationship between firing
frequency and pleasantness rating.
16.22 The largest commercial fishing enterprise in the
southeastern United States is the harvest of shrimp. In a
study described in the paper “Long-term Trawl Monitoring
of White Shrimp, Litopenaeus setiferus (Linnaeus), Stocks
within the ACE Basin National Estuariene Research Reserve,
South Carolina” ( Journal of Coastal Research [2008]:193-199),
researchers monitored variables thought to be related to the
abundance of white shrimp. One variable the researchers
thought might be related to abundance is the amount of oxygen in the water. The relationship between mean catch per tow
of white shrimp and oxygen concentration was described by
fitting a regression line using data from ten randomly selected
offshore sites. (The “catch” per tow is the number of shrimp
caught in a single outing.) Computer output is shown below.
The regression equation is
Mean catch per tow 5 25859 1 97.2 O2 Saturation
Predictor
Coef
SE Coef
T
P
Constant
25859
2394 22.45
0.040
O2 Saturation 97.22
34.63
2.81
0.023
S 5 481.632
R-Sq 5 49.6% R-Sq(adj) 5 43.3%
a. Is there convincing evidence of a useful linear relationship between the shrimp catch per tow and oxygen concentration density? Explain.
20/12/12 6:39 PM
758
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
time for individuals with CHI? Test the appropriate
hypotheses using a 5 .05.
b. Would you describe the relationship as strong? Why or
why not?
c. Construct a 95% confidence interval for b and interpret
it in context.
d. What margin of error is associated with the confidence
interval in Part (c)?
Mean Response Time
16.23 The authors of the paper “Decreased Brain Volume in
Adults with Childhood Lead Exposure” (Public Library of Science
Medicine [May 27, 2008]: e112) studied the relationship between
childhood environmental lead exposure and a measure of
brain volume change in a particular region of the brain. Data
were given for x ​5 ​mean childhood blood lead level (mg/dL)
and y ​5 ​brain volume change (BVC, in percent). A subset of
data read from a graph that appeared in the paper was used to
produce the accompanying Minitab output.
Regression Analysis: BVC versus Mean Blood Lead Level
The regression equation is
BVC ​5 ​20.00179 2 0.00210 Mean Blood Lead Level
Predictor
Coef
SE Coef
T
P
Constant
20.001790 0.008303 20.22 0.830
Mean Blood 20.0021007 0.0005743 23.66 0.000
Lead Level
Study
Control
1
2
3
4
5
6
7
8
9
10
250
360
475
525
610
740
880
920
1010
1200
CHI
3​ 03
491
659
683
922
1044
1421
1329
1481
1815
16.27 The article “Photocharge Effects in Dye Sensitized
Ag[Br,I] Emulsions at Millisecond Range Exposures”
(Photographic Science and Engineering [1981]: 138–​144) gave
the accompanying data on x 5 % light absorption and y 5
peak photovoltage.
4.0
0.12
x
y
8.7
0.28
12.7
0.55
19.1
0.68
21.4
0.85
24.6
1.02
28.9
1.15
29.8
1.34
30.5
1.29
JMP output for these data is shown below.
Carry out a hypothesis test to decide if there is convincing
evidence of a useful linear relationship between x and y. You
can assume that the basic assumptions of the simple linear
regression model are met.
Bivariate Fit of PeakPhotoVoltage By %LightAbsorption
1.4
16.25 What is the distinction between se and se?
16.26 The accompanying data were read from a plot
(and are a subset of the complete data set) given in the
article “Cognitive Slowing in Closed-Head Injury” (Brain and
Cognition [1996]: 429–​440). The data represent the mean
response times for a group of individuals with closed-head
injury (CHI) and a matched control group without head
injury on 10 different tasks. Each observation was based on
a different study, and used different subjects, so it is reasonable to assume that the observations are independent.
a. Fit a linear regression model that would allow you to
predict the mean response time for those suffering a
closed-head injury from the mean response time on the
same task for individuals with no head injury.
b. Do the sample data support the hypothesis that there is a
useful linear relationship between the mean response time
for individuals with no head injury and the mean response
PeakPhotoVoltage
1.2
Additional Exercises
16.24 a. Explain the difference between the line y 5 a 1 bx
​
and the line y​
​ ˆ  5 a 1 bx.
b. Explain the difference between b and b.
c. Let x* denote a particular value of the independent variable.
Explain the difference between a 1 bx* and a 1 bx*.
d. Explain the difference between s and se.
1
0.8
0.6
0.4
0.2
0
0
5
10
15
20
%LightAbsorption
25
30
35
Linear Fit
Linear Fit
PeakPhotoVoltage = –0.082594 + 0.0446485* %LightAbsorption
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.982731
0.980264
0.061117
0.808889
9
Analysis of Variance
Parameter Estimates
Term
Estimate
Intercept
–0.082594
%LightAbsorption 0.0446485
Std Error t Ratio Prob>|t|
0.049093
–1.68
0.1364
0.002237
19.96 <.0001*
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 758
20/12/12 6:39 PM
16.3 Checking Model Adequacy
a. What does the scatterplot suggest about the relationship
between the peak photovoltage and the percent of light
absorption?
b. What is the equation of the estimated regression line?
c. How much of the observed variation in peak photovoltage can be explained by the model relationship?
d. Predict peak photovoltage when percent absorption is 19.1,
and compute the value of the corresponding residual.
e. The authors claimed that there is a useful linear relationship between the two variables. Do you agree? Carry out
a formal test.
f. Give an estimate of the average change in peak photovoltage associated with a 1 percentage point increase in
light absorption. Your estimate should convey information about the precision of estimation.
Suppose that from previous evidence, anthropologists had
believed that for each 1-mm increase in chord length, cranial capacity would be expected to increase by 20 cm 3. Do
these new experimental data provide convincing evidence
against this prior belief?
16.29 Suppose you are given the computer output
shown. You are interested in testing the null hypothesis
b 5 1.0 versus an alternative hypothesis of b > 1.0.
Describe how you would use the given computer output
to test these hypotheses.
16.28 In anthropological studies, an important characteristic of fossils is cranial capacity. Frequently skulls
are at least partially decomposed, so it is necessary to
use other characteristics to obtain information about
capacity. One measure that has been used is the length
of the lambda-opisthion chord. The article “Vertesszollos
and the Presapiens Theory” (American Journal of Physical
Anthropology [1971]) reported the accompanying data for n
5 7 Homo erectus fossils.
x (chord
length in mm)
y (capacity
in cm 3)
78
75
78
81
84
86
87
850
775
750
975
915
1015
1030
section 16.3
759
Linear Fit
y = 5.6452776 + 0.9797401*x
Summary of Fit
RSquare
0.985289
RSquare Adj
0.984954
Root Mean Square Error
12.48525
Mean of Response
0.791304
Observations (or Sum Wgts)
46
Lack of Fit
Analysis of Variance
Parameter Estimates
Term
Estimate
Intercept 5.6452776
0.9797401
x
Std Error t Ratio Prob>|t|
1.84302
3.06 0.0037*
0.018048 54.29 <.0001*
Checking Model Adequacy
Section 16.2 introduced methods for estimating and testing hypotheses about b, the slope
in the simple linear regression model
y 5 a 1 bx 1 e
In this model, e represents the random deviation of a y value from the population
regression line a 1 bx. The methods presented in Section 16.2 require that some assumptions about the random deviations in the simple linear regression model be met in order
for inferences to be valid. These assumptions include:
1. At any particular x value, the distribution of e is normal.
2. At any particular x value, the standard deviation of e is se, which is constant over all
values of x (that is, se does not depend on x).
Inferences based on the simple linear regression model are still appropriate if model
assumptions are slightly violated (for example, mild skew in the distribution of e).
However, interpreting a confidence interval or the result of a hypothesis test when assumptions are seriously violated can result in misleading conclusions. For this reason, it is
important to be able to detect any serious violations.
Residual Analysis
If the deviations e1, e2, …, en from the population line were available, they could be examined for any inconsistencies with model assumptions. For example, a normal probability
plot of these deviations would suggest whether or not the normality assumption was plausible. However, because these deviations are
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 759
20/12/12 6:39 PM
760
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
e1 5 y1 2 (a 1 bx1)
:
en 5 yn 2 (a 1 bxn )
they can be calculated only if a and b are known. In practice, this will almost never be the
case. Instead, diagnostic checks must be based on the residuals
​
y1 2 y​
​ ˆ  1 5 y1 2 (a 1 bx1)
:
​
yn 2 y​
​ ˆ  n 5 yn 2 (a 1 bxn )
which are the deviations from the estimated regression line. When all model assumptions are
met, the mean value of the residuals at any particular x value is 0. Any observation that gives
a large positive or negative residual should be examined carefully for any unusual circumstances, such as a recording error or nonstandard experimental condition. Identifying residuals with unusually large magnitudes is made easier by inspecting standardized residuals.
Recall that a quantity is standardized by subtracting its mean value (0 in this case) and
dividing by its actual or estimated standard deviation:
residual
standardized residual 5 ​ _________________________________
    
  
 ​
estimated standard deviation of residual
The value of a standardized residual tells you the distance (in standard deviations) of the
corresponding residual from its expected value, 0.
Because residuals at different x values have different standard deviations (depending on the value of x for that observation)1, computing the standardized residuals can be
tedious. Fortunately, many computer regression programs provide standardized residuals.
Example 16.6 Revisiting the Elk
Example 16.3 introduced data on
x 5 chest girth (in cm)
and
y 5 weight (in kg)
for a sample of 19 Rocky Mountain elk. (See Example 16.3 for a more detailed description
of the study.)
Inspection of the scatterplot in Figure 16.12 suggests the data are consistent with the
assumptions of the simple linear regression model.
350
Weight (kg)
300
250
200
150
100
90
Figure 16.12 Scatterplot for the elk data
1
100
110
120
130
Girth (cm)
140
150
160
​
170
Ï 
The estimated standard deviation of the i residual, yi 2 y​
​ ˆ  i, is se ​
th
________________
_ 2
(xi 2 x​
​  )
1
__
________
_ 2 
1 2 ​ n ​  2 ​    
 
 ​ ​
∑(x 2 x​
​  )
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 760
20/12/12 6:39 PM
761
16.3 Checking Model Adequacy
The data, residuals, and the standardized residuals (computed using Minitab) are
given in Table 16.1. For the residual with the largest magnitude, 38.1397, the standardized
residual is 1.81294. That is, the residual is approximately 1.8 standard deviations above
its expected value of 0. This value is not particularly unusual in a sample of this size. Also
notice that for the negative residual with the largest magnitude, 238.2661, the standardized residual is 21.92313, still not unusual in a sample of this size. On the standardized
scale, no residual here is surprisingly large.
Table 16.1 Data, residuals, and standardized residuals for the elk data
Observation
Girth (cm)
x
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Weight (kg)
y
96
105
108
109
110
114
121
124
131
135
137
138
140
142
157
157
159
155
162
98
196
163
196
183
171
230
225
211
231
225
266
241
264
284
292
300
337
339
Residual
238.2661
34.9314
26.3361
23.9080
8.1522
214.8711
24.8380
11.5705
221.7203
212.7436
224.2553
13.9889
216.5228
0.9655
220.3720
212.3720
29.8837
38.1397
20.8488
Standardized
Residual
y​
​ ˆ  
21.92313
1.68004
20.30135
1.13323
0.38517
20.69477
1.14452
0.53117
20.99323
20.58320
21.11135
0.64147
20.75921
0.04448
20.97540
20.59236
20.47699
1.81294
1.01967
136.266
161.069
169.336
172.092
174.848
185.871
205.162
213.429
232.720
243.744
249.255
252.011
257.523
263.034
304.372
304.372
309.884
298.860
318.151
​​
Next, consider the assumption of the normality of e’s. Figure 16.13 shows box plots of
the residuals and standardized residuals. The box plots are approximately symmetric and
there are no outliers, so the assumption of normally distributed errors seems reasonable.
–40
–30
–20
–10
0
10
Residual
20
30
40
–2
–1
0
Standardized Residual
1
2
Figure 16.13 Boxplots of residuals and
standardized residuals for the elk
data.
Notice that the boxplots of the residuals and standardized residuals are nearly identical. While it is preferable to work with the standardized residuals, if you do not have access
to a computer package or calculator that will produce standardized residuals, a plot of the
unstandardized residuals should suffice.
A normal probability plot of the standardized residuals (or the residuals) is
another way to assess whether it is reasonable to assume that e1, e2,..., en all come
from the same normal distribution. An advantage of the normal probability plot, shown
in Figure 16.14, is that the value of each residual can be seen, which provides more
information about the distribution. The pattern in the normal probability plot of the
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 761
20/12/12 6:39 PM
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
2
2
1
1
Normal score
Normal score
762
0
–1
–1
–2
0
–2
–40
–30
Figure 16.14 Normal probability plots of
residuals and standardized
residuals for the elk data
–20
–10
0
10
Residual
20
30
40
–2
–1
0
Standardized residual
1
2
standardized residuals and pattern in the normal probability plot of the the residuals
for the elk data are reasonably straight, confirming that the assumption of normality of
the error distribution is reasonable. Also notice that the pattern in both normal probability plots is similar, so you don’t need to construct both—either plot could be used.
Plotting the Residuals
AP* exam tip
When considering linear
regression, your first step
should be to study the scatterplot and a residual plot.
These two plots provide
important information
about whether a linear
model is appropriate
A plot of the (x, residual) pairs is called a residual plot, and a plot of the (x, standardized
residual) pairs is a standardized residual plot. Residual and standardized residual plots
typically exhibit the same general shapes. If you are using a computer package or graphing
calculator that calculates standardized residuals, the standardized residual plot is recommended. If not, it is acceptable to use the unstandardized residual plot instead.
A standardized residual plot or a residual plot is often helpful in identifying unusual
or highly influential observations and in checking for violations of model assumptions. A
desirable plot is one that exhibits no particular pattern (such as curvature or a much greater
spread in one part of the plot than in another) and that has no point that is far removed from
all the others. A point in the residual plot falling far above or far below the horizontal line
at height 0 corresponds to a large residual, which can indicate unusual behavior, such as a
recording error, a nonstandard experimental condition, or an atypical experimental subject.
A point with an x value that differs greatly from others in the data set could have exerted
excessive influence in determining the estimated regression line.
A standardized residual plot, such as the one pictured in Figure 16.15(a) is desirable,
because no point lies much outside the horizontal band between 22 and 2 (so there is no
unusually large residual corresponding to an outlying observation). There is no point far to
the left or right of the others (which could indicate an observation that might greatly influence the estimated line), and there is no pattern to indicate that the model should somehow
be modified. When the plot has the appearance of Figure 16.15(b), the fitted model should
be changed to incorporate curvature (a nonlinear model).
The increasing spread from left to right in Figure 16.15(c) suggests that the
variance of y is not the same at each x value but rather increases with x. A straightline model may still be appropriate, but the best-fit line should be obtained by using
weighted least squares rather than ordinary least squares. This involves giving more
weight to observations in the region exhibiting low variability and less weight to
observations in the region exhibiting high variability. A specialized regression analysis
textbook or a statistician should be consulted for more information on using weighted
least squares.
The standardized residual plots of Figures 16.15(d) and 16.15(e) show an outlier (a
point with a large standardized residual) and a potentially influential observation, respectively. Consider deleting the observation corresponding to such a point from the data set
and refitting a line. Substantial changes in estimates and various other quantities are a
signal that a more careful analysis should be carried out before proceeding.
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 762
20/12/12 6:39 PM
16.3 Checking Model Adequacy
AP* exam tip
Notice the difference between an outlier (an observation that is far removed
from the other observations in the y direction) and
a potentially influential observation (an observation
that is far removed from
the other observations in
the x direction).
Standardized
residual
763
Standardized
residual
2
2
1
1
0
x
0
–1
–1
–2
–2
x
(a)
(b)
Standardized
residual
Standardized
residual
2
2
1
1
x
0
Large
residual
x
0
–1
–1
–2
–2
(d)
(c)
Standardized
residual
2
FIGURE 16.15 Examples of residual plots:
(a) satisfactory plot; (b) plot
suggesting that a curvilinear
regression model is needed;
(c) plot indicating nonconstant
variance; (d) plot showing a
large residual; (e) plot showing
a potentially influential
observation.
1
0
Potentially
influential
observation
–1
–2
x
(e)
Example 16.7 Snow Cover and Temperature
The article “Snow Cover and Temperature Relationships in North America and Eurasia”
( Journal of Climate and Applied Meteorology [1983]: 460–469) explored the relationship
between October–November continental snow cover (x, in millions of square kilometers)
and December–February temperature ( y, in °C). The following data refer to Eurasia during
the n 5 13 time periods (196921970, 197021971, …, 198121982):
x
y
13.00
12.75
16.70
18.85
16.60
15.35
13.90
213.5
215.7
215.5
214.7
216.1
214.6
213.4
Standardized Residual
20.11
22.19
20.36
1.23
20.91
20.12
0.34
x
y
22.40
16.20
16.70
13.65
13.90
14.75
218.9
214.8
213.6
214.0
212.0
213.5
Standardized Residual
21.54
0.04
1.25
20.28
21.54
0.58
A simple linear regression analysis described in the article included r2 5 0.52 and
r 5 0.72, suggesting a significant linear relationship. This is confirmed by a model
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 763
20/12/12 6:39 PM
764
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
utility test. The scatterplot and standardized residual plot are displayed in Figure 16.16.
There are no unusual patterns, although one standardized residual, 22.19, is a bit on the
large side. The most interesting feature is the observation (22.40, 218.9), corresponding
to a point far to the right of the others in these plots. This observation may have had a
substantial influence on the estimated regression line. The estimated slope when all 13
observations are included is b 5 20.459, and sb 5 0.133. When the potentially influential observation is deleted, the estimate of b based on the remaining 12 observations is
b 5 20.228. The change in slope is
change in slope 5 original b 2 new b
5 20.459 2 ​( 2 0.288 )​
5 20.231
The change expressed in standard deviations is 20.231/0.133 5 21.74. Because b
has changed by substantially more than 1 standard deviation, the observation under consideration appears to be highly influential.
TEMP
-11.5 +
-13.0 +
-14.5 +
-16.0 +
-17.5 +
-19.0 +
*
*
* *
*
*
*
*
*
*
*
*
*
+-----------+-----------+-----------+-----------+-----------+
SNOW
12.5
15.0
17.5
20.0
22.5
25.0
Figure 16.16 Plots for the data of Example 16.7:
(a) Scatter plot; (b) Standardized
residual plot
(a)
STRESID
2.0 +
Potentially influential
*
observation
*
*
1.0 +
*
*
* *
0.0 +
*
*
*
*
-1.0 +
*
-2.0 +
*
-3.0 +
+-----------+-----------+-----------+-----------+-----------+ SNOW
12.5
15.0
17.5
20.0
22.5
25.0
(b)
In addition, r2 based just on the 12 observations is only 0.13, and the t ratio for testing
b 5 0 is not significant. Evidence for a linear relationship is much less conclusive in light
of this analysis. The investigators should seek a climatological explanation for the influential observation and collect more data, which could be used to find a more useful model.
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 764
20/12/12 6:39 PM
16.3 Checking Model Adequacy
765
Example 16.8 Treadmill Time and Ski Time
The paper “Physiological Characteristics and Performance of Top U.S. Biathletes” (Medicine
and Science in Sports and Exercise [195]: 1302–1310) describes a study of the relationship
between cardiovascular fitness (as measured by time to exhaustion running on a treadmill)
and performance on a 20-kilometer ski race. Data on
x 5 treadmill time to exhaustion (in minutes)
and
AP* exam tip
Don’t forget to check assumptions. If you are used
to checking assumptions
before doing much in the
way of calculation, it is
sometimes easy to forget to
check them in a regression
setting. Be sure to step back
and think about whether
the four basic assumptions
of the linear regression
model are reasonable before making inferences
about the population slope
or using the estimated
model to make predictions.
Figure 16.17 Plots for Example 16.8 (a) Normal
probability plot of standardized
residuals; (b)Standardized
residual plot
y 5 20-km ski time (in minutes)
for 11 athletes are shown in Table 16.2. Standardized residuals and residuals are also given.
Is it reasonable to use the given data to construct a confidence interval or test hypotheses
about b, the average change in ski time associated with a 1-min increase in treadmill time?
It depends on whether the assumptions that the distribution of the deviations from the
population regression line at any fixed x is approximately normal and that the variance of
this distribution does not depend on x are reasonable. Constructing a normal probability
plot of the standardized residuals and a standardized residual plot will provide insight into
whether these assumptions are in fact reasonable.
Table 16.2 Data, Residuals, and Standardized Residuals for
Example 16.8
Observation
1
2
3
4
5
6
7
8
9
10
11
Treadmill
Ski Time
Residual
Standardized
Residual
71.0
71.4
65.0
68.7
64.4
69.4
63.0
64.6
66.9
62.6
61.7
0.172
2.206
3.494
0.906
1.994
3.006
2.461
0.394
2.373
0.527
0.206
0.10
1.13
1.74
0.44
0.96
1.44
1.18
0.19
1.16
0.27
0.12
7.7
8.4
8.7
9.0
9.6
9.6
10.0
10.2
10.4
11.0
11.7
Figure 16.17 shows a normal probability plot of the standardized residuals and a standardized residual plot. The normal probability plot is quite straight, and the standardized
residual plot does not show evidence of any patterns or of increasing spread.
Standardized residual
Standardized residual
1
1
0
0
21
21
22
22
22
21
0
Normal score
(a)
1
2
8
9
10
Treadmill time
11
12
(b)
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 765
20/12/12 6:39 PM
766
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
Example 16.9 A new pediatric tracheal tube
The article “Appropriate Placement of Intubation Depth Marks in a New Cuffed,
Paediatric Tracheal Tube” (British Journal of Anaesthesia [2004]: 80-87) describes a study
of the use of tracheal tubes in newborns and infants. Newborns and infants have small
trachea, and there is little margin for error when inserting tracheal tubes. Using X-rays
of a large number of children aged 2 months to 14 years, the researchers examined
the relationships between appropriate trachea tube insertion depth and other variables
such as height, weight, and age. A scatterplot and a standardized residual plot constructed using data on the insertion depth and height of the children (both measured in
cm) are shown in Figure 16.18.
3
20
2
Standardized residual
Insertion depth
18
16
14
12
1
0
–1
–2
10
–3
50
75
100
Figure 16.18 (a) Scatterplot for insertion
depth vs. height data of Example
16.9; (b) standardized residual
plot.
Figure 16.19 (a) Scatterplot for insertion
depth vs. weight data of Example
16.9; (b) standardized residual
plot.
125
Height
150
175
75
100
(a)
125
Height
150
175
200
(b)
Residual plots like the ones pictured in Figure 16.18(b) are desirable. No point lies
much outside the horizontal band between 22 and 2 (so there are no unusually large
residuals corresponding to outliers). There is no point far to the left or right of the others
(no observation that might be influential), and there is no pattern of curvature or differences in the variability of the residuals for different height values to indicate that the model
assumptions are not reasonable.
But consider what happens when the relationship between insertion depth and weight is
examined. A scatterplot of insertion depth and weight (kg) is shown in Figure 16.19(a), and a
standardized residual plot in Figure 16.19(b). While some curvature is evident in the original
scatterplot, it is even more clearly visible in the standardized residual plot. A careful inspection
of these plots suggests that along with curvature, the residuals may be more variable at larger
weights. When plots have this curved appearance and increasing variability in the residuals, the
linear regression model is not appropriate.
3
22
2
Standardized residual
24
20
Insertion depth
50
200
18
16
14
12
1
0
–1
–2
10
–3
0
10
20
30
40
50
60
70
80
90
Weight
0
10
20
30
40
50
Weight
(a)
(b)
60
70
80
90
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 766
20/12/12 6:39 PM
767
16.3 Checking Model Adequacy
Example 16.10 Looking for love in all the right... trees?
Treefrogs’ search for mating partners was the examined in the article, “The Cause of
Correlations Between Nightly Numbers of Male and Female Barking Treefrogs (Hyla gratiosa)
Attending Choruses” (Behavioral Ecology [2002: 274–281). A lek, in the world of animal
Figure 16.20 (a) Scatterplot for treefrog data
of Example 16.20; (b) residual plot
behavior, is a cluster of males gathered in a relatively small area to exhibit courtship displays.
The “female preference” hypothesis asserts that females will prefer larger leks over smaller
leks, presumably because there are more males to choose from. The scatterplot and residual
plot in Figure 16.20 show the relationship between the number of females and the number
of males in observed leks of barking treefrogs. You can see that the unequal variance, which
is noticeable in the scatterplot, is even more evident in the residual plot. This indicates that
the assumptions of the linear regression model are not reasonable in this situation.
35
15
10
25
Residuals
Number of females
30
20
15
0
10
–5
17.5
0
0
10
20
50
60
30
40
Number of males
(a)
70
80
90
Squirrels per plot
5
section
5
16.3 Exercises
–10
15.0
20
30
40
50
60
Number of males
(b)
70
0
10
20
30
40
%Logged
50
60
70
30
40
%Logged
50
60
70
80
90
7.5
Each exercise set assesses the following chapter learning objectives: M2, M7
Exercise Set 1
16.30 The following graphs are based on data from an
experiment to assess the effects of logging on a squirrel
population in British Columbia (“Effects of Logging Pattern
3
and Intensity on Squirrel Demography,” The Journal of
Wildlife Management [2007]: 2655–2663). Plots of land,
0
2
1
Residual
each nine hectares in area, were subjected to different
percentages of logging, and the squirrel population density
for each plot was measured after 3 years. The scatterplot,
residual plot, and a boxplot of the residuals are shown
here.
10
10.0
5.0
Section 16.2
0
12.5
–1
–2
–3
0
10
20
Squirrels per plot
17.5
15.0
12.5
10.0
7.5
5.0
–3
0
10
20
30
40
%Logged
50
60
70
–2
–1
0
Residual
1
2
3
3
Unless otherwise noted,
2 all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 767
Residual
1
0
–1
–2
20/12/12 6:39 PM
768
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
16.31 The clutch size (number of eggs laid) for turtles
is known to be influenced by body size, latitude, and
average environmental temperature. Researchers gathered data on Gopher tortoises in Okeeheelee County Park
in Florida to further understand the factors that affect
reproduction in these animals (“Geographic Variation in
Body and Clutch Size of Gopher Tortoises,” Copeia [2007]:
355–363). The scatterplot, residual plot, and a normal
probability plot of the residuals for the least squares
regression line with x 5 body length and y 5 clutch size
are shown here.
Does it appear that the assumptions of the simple linear
regression model are plausible? Explain your reasoning in
a few sentences.
14
ClutchSize
10
1.64
0.9
1.28
0.8
0.67
0.7
0.0
0.5
0.3
–0.67
0.2
0.1
–1.28
0.05
–1.64
–8
–6
–4
–2
0
2
4
6
16.32 Carbon aerosols have been identified as a contributing factor in a number of air quality problems.
In a chemical analysis of diesel engine exhaust, x 5
mass (mg/cm2) and y 5 elemental carbon (mg/cm2) were
recorded (“Comparison of Solvent Extraction and Thermal
set is y​
​ ˆ   5 31 1 .737x. The accompanying table gives the
observed x and y values and the corresponding standardized residuals.
8
6
4
2
6
0.95
Optical Carbon Analysis Methods: Application to Diesel
Vehicle Exhaust Aerosol” Environmental Science Technology
[1984]: 231–​234). The estimated regression line for this data
​​
12
0
280
Normal Quantile Plot
Does it appear that the assumptions of the simple linear
regression model are plausible? Explain your reasoning in
a few sentences.
290
300
310
Length(mm)
320
330
340
x
y
St. resid.
x
y
St. resid.
x
y
St. resid.
x
y
St. resid.
x
y
St. resid.
164.2
181
2.52
161.8
170
1.72
118.7
106
21.07
108.1
102
20.75
78.9
86
20.27
156.9
156
0.82
230.9
193
20.73
248.8
204
20.95
89.4
91
20.51
387.8
310
20.89
109.8
115
0.27
106.5
110
0.05
102.4
98
20.73
76.4
97
0.85
135.0
141
0.91
111.4
87.0
132
96
1.64
0.08
97.6
79.7
94
77
20.77 21.11
64.2
89.4
76
89
20.20 20.68
131.7 100.8
128
88
0.00 21.49
82.9 117.9
90
130
20.18
1.05
4
Residuals
2
0
–2
–4
–6
–8
a. Construct a standardized residual plot. Are there any
unusually large residuals? Do you think that there are any
influential observations?
b. Is there any pattern in the standardized residual plot that
would indicate that the simple linear regression model is
not appropriate?
c. Based on your plot in Part (a), do you think that it is
reasonable to assume that the variance of y is the same at
each x value? Explain.
16.33 The article “Vital Dimensions in Volume Perception:
Can the Eye Fool the Stomach?” ( Journal of Marketing
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 768
20/12/12 6:39 PM
769
16.3 Checking Model Adequacy
the dimensions of 27 representative food products (Gerber
baby food, Cheez Whiz, Skippy Peanut Butter, and Ahmed’s
tandoori paste, to name a few).
Product
Maximum Width
(cm)
Minimum Width
(cm)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2.50
2.90
2.15
2.90
3.20
2.00
1.60
4.80
5.90
5.80
2.90
2.45
2.60
2.60
2.70
3.10
5.10
10.20
3.50
2.70
3.00
2.70
2.50
2.40
4.40
7.50
4.25
1.80
2.70
2.00
2.60
3.15
1.80
1.50
3.80
5.00
4.75
2.80
2.10
2.20
2.60
2.60
2.90
5.10
10.20
3.50
1.20
1.70
1.75
1.70
1.20
1.20
7.50
4.25
a. Fit the simple linear regression model that would allow
prediction of the maximum width of a food container
based on its minimum width.​
b. Calculate the standardized residuals (or just the residuals if you don’t have access to a computer program that
gives standardized residuals) and make a residual plot to
determine whether there are any outliers.
c. The data point with the largest residual is for a 1-liter
Coke bottle. Delete this data point and refit the regression. Did deletion of this point result in a large change in
the equation of the estimated regression line?
d. For the regression line of Part (c), interpret the estimated
slope and, if appropriate, the intercept.
e. For the data set with the Coke bottle deleted, do you
think that the assumptions of the simple linear regression
model are reasonable? Give statistical evidence for your
answer.
16.34 Models of climate change predict that global temperatures and precipitation will increase in the next 100
years, with the largest changes occurring during winter
in northern latitudes. Researchers gathered data on the
potential effects of climate change for flowering plants
in Norway. (“Climatic Variability, Plant Phenology, and
Northern Ungulates,” Ecology [1999]: 1322–1339). The table
below gives data for one flower species. Range of flowering
dates and elevation for different sites in Norway were used
to construct the given scatterplot. A potentially influential
point is indicated on the scatterplot.
Bivariate Fit of Flowering Date Range by Elevation
35
30
Flowering
date range
Research [1999]: 313–​326) gave the accompanying data on
25
20
15
0
100
200
300
Elevation
400
500
Flowering Range
versus Elevation: Tussilago Farfara
Elevation (Meters
Above Sea Level)
23.3
5.6
55.6
140.0
31.1
112.2
106.7
42.2
75.6
176.7
126.7
126.7
176.7
201.1
133.3
90.0
41.1
125.6
477.8
Flowering
Date Range
33.4
32.0
31.9
31.3
28.1
29.3
28.4
26.6
24.9
25.7
24.7
23.5
23.2
21.8
22.3
21.4
19.7
17.6
17.6
a. Fit a linear regression model using all 19 observations.
What are the values of a, b, r2, se?
b. Fit a linear regression model with the indicated point
omitted. What are the values of a, b, r2, se?
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 769
20/12/12 6:39 PM
770
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
Section 16.2
Exercise Set 2
16.35 In the study described in Exercise 16.31, the effect
of latitude on mean clutch size was investigated. Data
from various locations in Florida, Georgia, Alabama, and
Mississippi on y 5 mean clutch size and x 5 latitude were
measured. The scatterplot, standardized residual plot, and
several graphs of the standardized residuals are shown
below.
Does it appear that the assumptions of the simple linear
regression model are plausible? Explain your reasoning in
a few sentences.
Histogram of the Residuals
4
Frequency
c. I n a few sentences, describe any differences you found in
Parts (a) and (b).
d. The researchers could use the estimated regression equation based on all 19 observations to make predictions for
elevations ranging from 0 to 200 meters; or they could
use the estimated regression equation based on the 18
observations (omitting the observation identified by an
arrow) to make predictions for elevations ranging from
0 to 500 meters. Which strategy would you recommend,
and why?
3
2
1
0
16.36 Exercise 6.21 gave data on x 5 nerve firing
frequency and y ​5 ​pleasantness rating when nerves were
stimulated by a light brushing stoke on the forearm. The x
values and the corresponding residuals from a simple linear
regression are as follows:
a. Construct a standardized residual plot. Does the plot
exhibit any unusual features?
Firing Frequency, x
23
24
22
25
27
28
34
33
36
34
8
7
5
4
4 26
27
28
26
27
28
Standardized
Standardized
Residual
Residual
2
29
30
Latitude
29
30
31
32
33
31
32
33
0
–1
0
21
–1
–2
–2 26
27
28
26
27
28
29
30
Latitude
29
30
31
32
33
31
32
33
Latitude
Normal Probability plot of the Residuals
22
22.0
21.5
21.0
20.5
0.0
0.5
Standardized residual
1.0
1.5
16.37 The accompanying scatterplot, based on 34 sediment samples with x 5 sediment depth (cm) and y 5 oil
and grease content (mg/kg), appeared in the article “Mined
Land Reclamation Using Polluted Urban Navigable Waterway
Sediments” ( Journal of Environmental Quality [1984]: 415–422).
90
Percent
2
1
1
0
50
10
1
21.83
0.04
1.45
0.20
21.07
1.19
20.24
20.13
20.81
1.17
Latitude
2
1
99
Standardized Residual
b. A normal probability plot of the standardized residuals
follows. Based on this plot, do you think it is reasonable
to assume that the error distribution is approximately
normal? Explain.
6
5
Normal score
MeanMean
Clutch
Clutch
Size Size
8
7
6
–2.0 –1.5 –1.0–0.5 0.0 0.5 1.0 1.5
Residual
–2
–1
0
Residual
1
2
Discuss the effect that the observation (20, 33,000) will have on
the estimated regression line. If this point were omitted, what
do you think will happen to the slope of the estimated regression line compared to the slope when this point is included?
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 770
20/12/12 6:39 PM
Standardized R
2
0
–1
–2
Oil and grease
(mg/kg)
0
771
40
60
80 Adequacy
100
16.3 Checking
Model
Locations/Pack
20
7
6
Frequency
32,000
28,000
5
4
3
2
24,000
1
20,000
0
–1
16,000
8,000
4,000
0
30
60
90
120 150 180
Subsample mean depth (cm)
16.38 Investigators in northern Alaska periodically monitored radio collared wolves in 25 wolf packs over 4 years,
keeping track of the packs’ home ranges. (“Population
Dynamics and Harvest Characteristics of Wolves in the Central
Brooks Range, Alaska,” Wildlife Monographs, [2008]: 1–25).
The home range of a pack is the area typically covered by
its members in a specified amount of time. The investigators
noticed that wolf packs with larger home ranges tended to
be located more often by monitoring equipment. The investigators decided to explore the relationship between home
range and the number of locations per pack. A scatterplot
and standardized residual plot of the data are shown below,
as well as plots of the standardized residuals.
Does it appear that the assumptions of the simple linear
regression model are plausible? Explain your reasoning in
a few sentences.
2
Additional Exercises
16.39 Carbon acrosols have been identified as a contributing factor in a number of air quality problems. In a chemical
analysis of diesel engine exhaust, x 5 mass (mg/cm2) and
y 5 elemental carbon (mg/cm2) were recorded ("Comparison
of Solvent Extraction and Thermal Optical Carbon Analysis
Methods: Application to Diesel Vehicle Exhaust Aerosol"
Environmental science Technology [1984]: 231–234). The esti​​
mated regression line for this data set is y​
​ ˆ  5 31 1 .737x.
A scatterplot of the data and a standardized residual plot are
shown below.
Bivariate Fit of carbon By mass
300
250
carbon
12,000
0
1
Standardized Residual
200
150
2500
100
Home Range
2000
1500
50
1000
3
50
100
150
200
250
mass
300
350
400
500
0
20
40
60
Locations/Pack
80
100
St. Residuals
Standardized Residual
3
2
1
2
1
0
0
–1
–1
–2
0
20
40
60
Locations/Pack
80
100
–2
50
100
150
200
250
mass
300
350
400
Unless otherwise7noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 771
Frequency
6
5
4
3
20/12/12 6:39 PM
772
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
a. Are there any unusually large residuals? Do you think
that there are any influential observations?
b. Is there any pattern in the standardized residual plot that
would indicate that the simple linear regression model is
not appropriate?
c. Based on the scatterplot and the standardized residual
plot, do you think that it is reasonable to assume that the
variance of y is the same at each x value? Explain.
Plant Phenology, and Northern Ungulates,” Ecology [1999]:
1322-1339). The table below gives data for one flower spe-
cies. A scatterplot of the “range of flowering dates” versus
latitude for different sites in Norway is also shown. Two
points that are potentially influential are indicated on the
scatterplot.
50
45
Mean Flowering
date range
40
35
30
25
20
15
10
58
59
60
61
62
Latitude (N)
63
Flowering Range
Versus Latitude: Anemone Hepatica
Flowering
Latitude (N)
Date Range
58.7
58.2
58.2
59.4
60.0
59.4
59.1
59.3
59.5
59.5
59.7
59.8
60.8
46.1
35.9
34.7
32.3
33.0
29.7
26.9
26.2
25.6
27.6
19.1
24.4
26.2
64
60.9
63.4
63.4
60.5
60.7
60.7
61.1
16.41 The sand scorpion is a predator that always hunts
from a motionless resting position outside its own burrow.
When prey appears on the horizon, within say 20 cm, the
scorpion assumes an alert posture; it determines the angular
position of the prey, makes a quick rotation, and runs after it.
In a recent study of the scorpion’s accuracy, the angular position (0 degrees 5 right in front) of the prey, and the turning
angle of the scorpion was recorded for 23 attacks. A simple
regression model relating the response angle of the predator
​
to the target angle position of the prey, r​
​ ˆ  5 a 1 b(t), was fit.
The resulting residual plot is shown. Describe the locations of
any outliers you see in the residual plot.
40
30
20
10
0
–10
–20
–30
–40
–200 –150
(continued)
26.8
28.7
19.2
22.5
17.9
12.9
11.8
a. Fit a linear regression model using all 20 observations.
What are the values of a, b, r2 and se?
b. Fit a linear regression model with the two observations
identified by arrows omitted. What are the values of a, b,
r2 and se?
c. In a few sentences, describe any differences you found in
Parts (a) and (b).
d. The researchers could use the estimated regression
equation based on all 20 observations to make predictions for latitudes ranging from 58 to 64, or they could
use the estimated regression equation based on the 18
observations (omitting the two observations identified by
arrows) to make predictions for latitudes ranging from 58
to 62. Which strategy would you recommend, and why?
Residual
16.40 Models of climate change predict that global
temperatures and precipitation will increase in the next
100 years, with the largest changes occurring during
winter in northern latitudes. Researchers recently gathered data on the potential effects of climate change
for flowering plants in Norway. (“Climatic Variability,
Flowering Range
Versus Latitude: Anemone Hepatica
Flowering
Latitude (N)
Date Range
–100
–50
0
50
Target Angle
100
150
200
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 772
20/12/12 6:39 PM
16.3 Checking Model Adequacy
16.42 The production of pups and their survival are the
most significant factors contributing to gray wolf population
growth. The causes of early pup mortality are unknown, and
difficult to observe. The pups are concealed within their dens
for 3 weeks after birth, and after they emerge it is difficult to
confirm their parentage. Researchers recently used portable
ultrasound equipment to investigate some factors related to
reproduction. (“Diagnosing Pregnancy, in Utero Litter Size, and
Fetal Growth with Ultrasound in Wild, Free-Ranging Wolves,”
Journal of Mammology [2006]: 85–92)
A scatterplot and linear regression of the length of a wolf
fetus (in cm, measured from crown to rump) and gestational age (in days) is shown below. Identify the point
that has the largest residual by giving its approximate
coordinates.
5
Crown – rump(cm)
16.43 The authors of the article “Age, Spacing and
Growth Rate of Tamarix as an Indication of Lake Boundary
Fluctuations at Sebkhet Kelbia, Tunisia” ( Journal of Arid
Environments [1982]: 43–​51) used a simple linear regres-
sion model to describe the relationship between y 5 vigor
(average width in centimeters of the last two annual rings)
and x 5 stem density (stems/m2). The estimated model was
based on the following data. Also given are the standardized
residuals.
x
y
St. resid.
x
y
St. resid.
4
0.75
20.28
15
0.55
0.24
5
1.20
1.92
15
0.00
22.05
6
0.55
20.90
19
0.35
20.12
9
0.60
20.28
21
0.45
0.60
14
0.65
0.54
22
0.40
0.52
a. What assumptions are required for the simple linear
regression model to be appropriate?
b. Construct a normal probability plot of the standardized
residuals. Does the assumption that the random deviation
distribution is normal appear to be reasonable? Explain.
c. Construct a standardized residual plot. Are there any
unusually large residuals?
d. Is there anything about the standardized residual plot
that would cause you to question the use of the simple
linear regression model to describe the relationship
between x and y?
4
3
2
1
0
773
25
30
Gest Age(days)
35
40
are you ready to move on?Chapter 16 Review Exercises
All chapter learning objectives are assessed in these exercises. The learning objectives assessed
in each exercise are given in parentheses.
16.44 (C1)
Describe what distinguishes a deterministic model from a
probabilistic model.
16.45 (C2)
In the context of the simple linear regression model,
explain the difference between a and a. Between b and b.
Between se and se.
16.46 (M1)
The SAT and ACT exams are often used to predict a
student’s first-term college grade point average (GPA).
Different formulas are used for different colleges and
majors. Suppose that a student is applying to State U with
an intended major in civil engineering. Also suppose that
for this college and this major, the following model is used
to predict first term GPA.
GPA 5 a 1 b (ACT )
a 5 0.5
b 5 0.1
a.In this context, what would be the appropriate interpretation of a?
b.In this context, what would be the appropriate interpretation of b?
16.47 (M2)
Theropods were carnivorous dinosaurs, characterized by
short forelimbs, living in the Jurassic and Cretaceous periods. (Tyrannosaurus rex is classified as a Theropod.) What
scientists know about therapods is based on studying incomplete skeletal remains. In a study described in the paper
“My Theropod is Bigger than Yours…or not: Estimating Body
Size from Skull Length in Theropods” ( Journal of Vertebrate
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 773
20/12/12 6:39 PM
774
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
Bivariate Fit of Residuals By SkullLength
2
1.5
1
Residuals
Paleontology [2007]: 108–115), researchers used data from
skeletons to develop a model describing the relationship
between body length and skull length. JMP was used to
produce the following graphical displays and computer
output. When you evaluate the fit of an estimated regression
line, all of the information below is considered as a whole.
However, the summary statistics in the computer output and
the different plots each convey some specific information.
a.Using only the scatterplot, do you think a linear model
does a good job of describing the relationship? Explain
why or why not.
b.Using only the residual plot, what can you determine
about whether the basic assumptions of the linear
regression model are met?
c.Using only the normal probability plot and boxplot of
the residuals, what can you determine about whether the
basic assumptions of the linear regression model are met?
d.Using only the values of r2 and se, what can you say about
the quality of the fit of the linear model for these data?
0.5
0
–0.5
–1
–1.5
–2
0
0.25
0.5
0.75
SkullLength
1
1.25
1.5
Linear Fit
BodyLength = 0.7061088 + 7.791973*SkullLength
Normal Quantile Plot
0.95
0.9
Summary of Fit
1.64
1.28
0.8
RSquare
RSqureAdj
Root Mean Square Error
Mean of Response
Observations(or Sum Wgts)
0.67
0.5
0.0
0.2
0.1
0.05
–0.67
Analysis Of Variance
–1.28
–1.64
Parameter Estimates
–2
–1.5
–1
–0.5
0
0.5
1
Bivariate Fit of BodyLength By SkullLength
12
Estimate
0.7061088
Std Error
0.330485
SkullLength
7.791973
0.415318
t Ratio Prob>|t|
2.14
0.0475*
18.76
<.0001*
16.49 (M3, M4, M5, P1, P2)
Ruffed grouse are a species of birds that nest on the ground.
Because of this, chick survival at night in the first few
weeks of life depends on avoiding predators. Biologists
have theorized that protection from predators might be
supplied by the mother hen’s choice of brooding sites.
One variable that biologists thought might be related to
survival is the density of vegetation in the vicinity of the
nest. Dense vegetation would possible reduce the ability of
predators to detect the nests. The paper “Nocturnal Roost
10
BodyLength
Term
Intercept
16.48 (M3)
There are 4 basic assumptions necessary for making inferences about b, the slope of the population regression line.
a.What are the four assumptions?
b.Which assumptions can be checked using sample data?
c.What statistics or graphs would be used to check each of
the assumptions you listed in Part (b)?
1.5
14
8
6
4
2
0
0.953929
0.951218
0.801042
5.859474
19
0
0.25
0.5
0.75
SkullLength
1
1.25
1.5
Habitat Selection by Ruffed Grouse Broods ( Journal of Field
Ornithology [2005]:168–174) describes a study in which
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 774
20/12/12 6:39 PM
Technology Notes
Linear Fit
BroodSurvival = 0.9468008 – 0.0261902*StemDensity
Summary of Fit
RSquare
RSqureAdj
Root Mean Square Error
Mean of Response
Observations(or Sum Wgts)
0.9
0.9
0.8
0.8
LicePrevalence
Prevalence
Lice
researchers monitored the survival of the brood (number
of chicks surviving /number of eggs hatched) in 23 nests in
different vegetation densities (thousands of stems / hectare.)
Computer output (from JMP) is shown below.
0.193788
0.155397
0.287538
0.436043
23
StemDensity –0.02619
Std Error
0.235108
0.011657
0.5
0.5
16.50 (M7)
Researchers in Hawaii have recently documented a large
increase in the prevalence of a bird parasite known as chewing lice. (“Explosive Increase in Ectoparasites in Hawaiian
Forest Birds,” The Journal of Parasitology [2008]: 1009–1021).
Current data suggest that the prevalence of chewing lice
may be less for bird species with a high degree of bill
overhang. A species is said to have bill overhang when
the upper bill extends downward in front of the end of
the lower bill. The following scatterplot shown shows the
relationship between the prevalence of chewing lice and
bill overhang for 8 bird species in the Hawaiian Islands. A
residual plot is also shown. Use these plots to identify any
outliers or potentially influential observations. For each
point you identify, assess its influence on the estimated
slope of the regression line.
0.2
0.2
0.4
0.6
0.4
0.6
Bill Overhang
Bill Overhang
0.8
0.8
1.0
1.0
–0.4
–0.4 0.0
0.0
0.2
0.2
0.6
0.4
0.6
0.4
Bill Overhang
Bill Overhang
0.8
0.8
1.0
1.0
0.2
0.2
0.0355*
a.Is there convincing evidence of a useful linear relationship between brood survival and stem density?
Explain.
b.Would you describe the relationship as strong? Why or
why not?
c.Construct a 95% confidence interval for b and interpret
it in context.
d.What margin of error is associated with the confidence
interval in part (c)?
0.0
0.0
0.3
0.3
t Ratio Prob>|t|
4.03
0.0006*
–2.25
0.6
0.6
0.3
0.3
0.1
0.1
0.0
0.0
Residual
Residual
Estimate
0.9468008
0.7
0.7
0.4
0.4
Parameter Estimates
Term
Intercept
775
–0.1
–0.1
–0.2
–0.2
–0.3
–0.3
16.51 (M6)
Suppose you are given the computer output shown. You
want to test the hypothesis, b 5 1.0. Describe how you
would use the computer output to test this hypothesis
Linear Fit
y = 5.6452776 + 0.9797401*x
Summary of Fit
RSquare
RSqureAdj
Root Mean Square Error
Mean of Response
Observations(or Sum Wgts)
0.985289
0.984954
12.48525
0.791304
46
Parameter Estimates
Term
Intercept
Estimate
5.6452776
Std Error
1.84302
x
0.9797401
0.018048
t Ratio Prob>|t|
3.06
0.0037*
54.29
<.0001*
Technology Notes
Regression Test
TI-83/84
1.Enter the data for the independent variable into L1 (In order
to access lists press the STAT key, highlight the option called
Edit… then press ENTER)
2.Enter the data for the dependent variable into L2
3. Press STAT
4. Highlight TESTS
5. Highlight LinRegTTest… and press ENTER
6. Next to b & r select the appropriate alternative hypothesis
7. Highlight Calculate
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 775
20/12/12 6:39 PM
776
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
TI-Nspire
1.Enter the data into two separate data lists (In order to access
data lists select the spreadsheet option and press enter)
Note: Be sure to title the lists by selecting the top row of the
column and typing a title.
2.Press the menu key and select 4:Stat Tests then 4:Stats
Tests then A:Linear Reg t Test… and press enter
3.In the box next to X List choose the list title where you
stored your independent data from the drop-down menu
4.In the box next to Y List choose the list title where you
stored your dependent data from the drop-down menu
5.In the box next to Alternate Hyp choose the appropriate
alternative hypothesis from the drop-down menu
6. Press OK
JMP
1.Input the data for the dependent variable into the first
column
2.Input the data for the independent variable into the second
column
3. Click Analyze and select Fit Y by X
4.Select the dependent variable (Y) from the box under Select
Columns and click on Y, Response
5.Select the independent variable (X) from the box under
Select Columns and click on X, Factor
6.Click the red arrow next to Bivariate Fit of… and select
Fit Line
MINITAB
1.Input the data for the dependent variable into the first
column
2.Input the data for the independent variable into the second
column
3.Select Stat then Regression then Regression…
4.Highlight the name of the column containing the dependent
variable and click Select
5.Highlight the name of the column containing the independent variable and click Select
6. Click OK
85241_ch16_ptg01.indd 776
Note: You may need to scroll up in the Session window to view
the t-test results for the regression analysis.
SPSS
1.Input the data for the dependent variable into one column
2.Input the data for the independent variable into a second
column
3.Click Analyze then click Regression then click Linear…
4.Select the name of the dependent variable and click the
arrow to move the variable to the box under Dependent:
5.Select the name of the independent variable and click the
arrow to move the variable to the box under Independent(s):
6. Click OK
Note: The p-value for the regression test can be found in the
Coefficients table in the row with the independent variable name.
Excel
1.Input the data for the dependent variable into the first column
2.Input the data for the independent variable into the second
column
3.Select Analyze then choose Regression then choose
Linear…
4.Highlight the name of the column containing the dependent
variable
5.Click the arrow button next to the Dependent box to move
the variable to this box
6.Highlight the name of the column containing the independent variable
7.Click the arrow button next to the Independent box to move
the variable to this box
8. Click OK
Note: The test statistic and p-value for the regression test for the
slope can be found in the third table of output. These values are
listed in the row titled with the independent variable name and
the columns entitled t Stat and P-value.
20/12/12 6:39 PM
Review Questions
777
AP* Review Questions for Chapter 16
Use the following information for questions 1–6.
A study was carried out to investigate the relationship between x ​5 ​the number of components needing repair and
y ​5 ​the time of the service call (in minutes) for a computer
repair company. The number of components and the service
time for a random sample of 20 service calls was used to fit
a simple linear regression model. Partial computer output is
shown below.
The regression equation is
Time 5 37.2 1 9.97 Number
Predictor
Coef SE Coef
T
P
Constant
37.213
7.985
4.66 0.000
Number
9.9695
0.7218
13.81 0.000
S ​5 ​18.7534 R-Sq ​5 ​89.7% R-Sq(adj) ​5 ​89.2%
1. Which of the following statements is a correct interpretation of the value 9.97?
(A) The average number of components needing repair
goes up 9.97 for each 1 minute increase in the service time of a call.
(B) On average, the service call time goes up 9.97 minutes
for each additional component needing repair.
(C) The service call time is 9.97 minutes when there
are 0 components to repair.
(D) Approximately 9.97% of the observed variation in
the service call times can be explained by the linear
relationship between service time and number of
components requiring repair.
(E) If this regression equation is used to predict service
call times, we can expect predictions to be within
9.97 minutes of the actual time.
2. Which of the following statements is a correct interpretation of the value 89.7%?
(A) On average, the service call time goes up 89.7 minutes
for each additional component needing repair.
(B) The magnitude of a typical difference between an
observed service call time and the service call time
predicted by the linear model is approximately
89.7 minutes.
(C) The correlation between service call time and number of components needing repair is 89.7%.
(D) Approximately 89.7% of the observed variation in
service call time can be explained by the linear relationship between service call time and number of
components needing repair.
(E) If this regression equation is used to predict service
call times, we can expect predictions to be within
89.7 minutes of the actual time.
3. The value of se is 18.75. Which of the following is an
appropriate interpretation of this value?
(A) 18.75% of the variability in service time can be explained by the linear relationship between service
call time and number of components needing repair.
(B) There is a positive correlation between service call
time and number of components needing repair.
(C) For every 1-component increase in the number of
components needing repair, the predicted service
call time increases by about 18.75 minutes.
(D) The magnitude of a typical difference between an observed service time and the service call time predicted
by the linear model is approximately 18.75 minutes.
(E) The average service call time is 18.75 minutes.
4. The value of se is 18.75. If the assumptions of the
simple linear regression model are satisfied, which of
the following is correct?
(A) The width of a 95% confidence interval for the slope
of the population regression line is 2(18.75) 5 37.50.
(B) It would be unlikely that a prediction based on the
regression line will be greater than 18.75 minutes.
(C) It would be unlikely that a prediction based on the
regression line will differ from the actual value by
more than 2(18.75) 5 ​37.50 minutes.
(D) Errors associated with predictions based on the regression line will always be less than 18.75 minutes.
(E) The value of se does not provide any information
about the anticipated magnitude of prediction errors.
5. Which of the following is a 95% confidence interval for
the change in service time associated with a 1-unit
increase in the number of components needing repair?
(A) 37.21 6 (1.96)(7.985)
(B) 37.21 6 (2.910)(7.985)
(C) 9.97 6 (1.96)(0.7218)
(D) 9.97 6 (2.10)(0.7218)
(E) 9.97 6 2(18.7534)
6. If the basic assumptions of the simple linear regression
model are reasonable, what conclusion should be
reached regarding model utility if a significance level of
0.05 is used for the model utility test?
(A) There is convincing evidence of a negative linear
relationship between service call time and number
of components needing repair.
(B) There is convincing evidence that the model is not
useful for predicting service call time.
(C) There is convincing evidence that the model is useful for predicting service call time.
(D) There is not convincing evidence that the model is
useful for predicting service call time.
(E) A conclusion cannot be reached based on the given
information.
AP* and the Advanced Placement Program are registered trademarks of the College Entrance Examination Board, which was not involved in the production of, and does not endorse, this product.
85241_ch16_ptg01.indd 777
20/12/12 6:39 PM
778
CHAPTER 16 Understanding Relationships—Numerical Data Part 2
7. If there is a positive linear relationship between two
variables x and y, which of the following must be true
of b, the slope of the population regression line?
(A) b , 0
(B) b . 0
(C) b ​5 ​0
(D) b . 1
(E) 21 ​, ​b ​, ​1
(A) I only
(B) II only
(C) III only
(D) I and III only
(E) II and III only
Use the scatterplot below to answer questions 9 and 10.
y
8. The plots shown are residual plots resulting from fitting
a linear regression. Which of these plots indicates that
the relationship between the two variables used to fit
the line may not be linear?
8.5
8.0
A
B
D
7.5
Standardized residual
2
7.0
6.5
1
C
6.0
0
4
5
6
7
8
9
x
9. Which of the labeled points would have the largest residual when a linear model is fit to the data?
21
22
210
25
0
5
10
x
Standardized residual
2.0
(A) A
(B) B
(C) C
(D) D
(E) Both C and D
10. Which of the labeled points corresponds to a potentially
influential observation if a linear model is to be fit to
the data?
1.5
1.0
(A) A
(B) B
(C) C
(D) D
(E) Both C and D
0.5
0.0
20.5
21.0
21.5
100
110
120
130
140
x
11. If there is evidence of a linear relationship between x
and y, what decision will be made in a test of H0: b 5 0
versus H0: b Þ 0?
(A) Reject H0 and conclude that there is no evidence
that the linear model is useful
(B) Reject H0 and conclude that there is evidence that
the linear model is useful
(C) Fail to reject H0 and conclude that there is no evidence that the linear model is useful
(D) Fail to reject H0 and conclude that there is evidence
that the linear model is useful
(E) Not enough information to say.
Standardized residual
3
2
1
0
21
22
23
150
200
250
300
350
x
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 778
20/12/12 6:39 PM
Review Questions
Use the following information to answer questions 12
and 13.
As part of a study of the swimming speed of sharks, a random sample of 18 lemon sharks (Triakis semifasciata) were
observed in a laboratory sea tunnel. Body lengths and
maximum sustainable swimming speeds (“MSSS,” reported
in body lengths per second) were measured for each shark.
The computer output from a regression with y = MSSS and
x = body length is given below.
Linear Fit
MSSS = 1.8928955 - 0.0104278*Length
Summary of Fit
RSquare
0.526395
RSquare Adj
0.496794
S
0.272031
Mean of Response
1.24
N Observations
18
Analysis of Variance
Sum of
Source
DF
Squares
Model
1
1.3159870
Error
16
1.1840130
Total
17
2.5000000
779
14. Which of the following is not an assumption that is
made about the random deviation e in a simple linear
regression model?
(A) The distribution of e is normal.
(B) The standard deviation of e, se , depends upon the
particular value of x.
(C) The mean value of e is 0.
(D) The random deviations, e1, e2 …, en , associated
with different observations are independent of one
another.
(E) The standard deviation of e, se , is the same for
each x value.
15. The residual plot below indicates that the one or more
of assumptions of the linear regression model may not
be met. Which of the following is a reasonable conclusion based on this residual plot?
Standardized residual
3
Mean
Square
1.31599
0.07400
Parameter Estimates
Term
Estimate Std Error t Ratio
Intercept
1.8928955 0.167575 11.30
Length(cm) 20.010428 0.002473 24.22
F Ratio
17.7834
Prob > F
0.0007*
Prob>|t|
,.0001*
0.0007*
12. For this data set, the model utility test is based on how
many degrees of freedom?
(A) 15
(B) 16
(C) 17
(D) 18
(E) 19
13. What is the P-value associated with the model utility
test?
2
1
0
21
22
23
150
200
250
300
350
x
(A) The residual plot clearly indicates a non-linear
model would be more appropriate.
(B) There is evidence that the residuals are not normally distributed.
(C) The slope of the regression line is non-zero.
(D) The correlation between x and y is non-zero.
(E) There is evidence the residuals do not have the
same variance for all x values.
(A) 0.0001
(B) 0.0007
(C) 0.07400
(D) 0.167575
(E) 0.526395
Unless otherwise noted, all content on this page is © Cengage Learning.
85241_ch16_ptg01.indd 779
20/12/12 6:39 PM
Download