Section 3.2 Least-Squares Regression

advertisement
Section 3.2
Least-Squares Regression
Linear relationships between two quantitative variables are pretty common and easy to understand. Correlation
measures the direction and strength of these relationships. When a scatterplot shows a linear relationship, we’d
like to summarize the overall pattern by drawing a line on the scatterplot. A _______________________________
_________________ summarizes the relationship between two variables, but only in a specific setting: when one
of the variables helps explain or predict the other. Regression, unlike correlation, requires that we have an
explanatory variable and a response variable.
DEFINITION: Regression line
A regression line is a line that describes how a response variable y changes as an explanatory variable x changes.
We often use a regression line to predict the value of y for a given value of x.
Example
Does Fidgeting Keep You Slim? (Regression lines as models)
Some people don’t gain weight even when they overeat. Perhaps fidgeting and other “nonexercise activity” (NEA)
explains why – some people may spontaneously increase nonexercise activity when fed more. Researchers
deliberately overfed 16 healthy young adults for 8 weeks. They measured fat gain (in kilograms) as the response
variable and change in energy use (in calories) from activity other than deliberate exercise – fidgeting, daily living,
and the like – as the explanatory variable. Here are the data.
Do people with larger increases in NEA tend to gain less fat?
The figure below is a scatterplot of these data. The plot shows a moderately strong, negative linear association
between NEA change and fat gain with no outliers. The correlation is r = -0.7786. The line on the plot is a
regression line for predicting fat gain from change in NEA.
Interpreting a Regression Line
A regression line is a model for the data. The equation of a regression line gives a compact mathematical
description of what this model tells us about the relationship between the response variable y and the explanatory
variable x.
DEFINITION: Regression line, predicted value, slope, y-intercept
Suppose that y is a response variable and x is an explanatory variable. A regression line relating y to x has an
equation of the form
in this equation,
(read “y hat”) is the __________________________ of the response variable y for a given value of
the explanatory variable x.
- b is the __________________________, the amount by which y is predicted to change when x
increases by one unit.
- a is the __________________________________, the predicted value of y when x = 0.
Example
Does Fidgeting Keep You Slim? (Interpreting the slope and y intercept)
The regression line shown in the figure below is
.
Problem: Identify the slope and y-intercept of the regression line. Interpret each value in context.
The slope of a regression line is an important numerical description of the relationship between the two variables.
Note: You can’t say how important a relationship is by looking at the size of the slope of the regression line.
Prediction
We can use a regression line to predict the response
how we do it.
for a specific value of the explanatory variable x. Here’s
Example
Does Fidgeting Keep You Slim? (Predicting with a regression line)
For the NEA and fat gain data, the equation of the regression line is
If a person’s NEA increases by 400 calories when she overeats, substitute x = 400 in the equation. The predicted
fat gain is
This prediction is illustrated in the figure below.
The accuracy of predictions from a regression line depends on how much the data scatter about the line. In this
case, fat gains for similar changes in NEA show a spread of 1 or 2 kilograms. The regression line summarizes the
pattern but gives only roughly accurate predictions.
Can we predict the fat gain for someone whose NEA increases by 1500 calories when she overeats? We can
certainly substitute 1500 calories into the equation of the line. The prediction is
By looking at the figure, we can see that an AEA increase of 1500 calories is far outside the set of x-values for our
data. We can’t say whether increases this large ever occur, or whether the relationship remains linear at such
extreme values. Prediction fat gain when NEA increases by 1500 calories is an _____________________________
of the relationship beyond what the data show.
DEFINITION: Extrapolation
Extrapolation is the use of a regression line for prediction far outside the interval of values of the explanatory
variable x used to obtain the line. Such predictions are often not accurate.
Few relationships are linear for all values of the explanatory variable. Don’t make predictions using values
of x that are much larger or much smaller than those that actually appear in your data.
CHECK FOR UNDERSTANDING
Some data were collected on the weight of a male white laboratory rat for the first 25 weeks after its birth. A
scatterplot of the weight (in grams) and time since birth (in weeks) shows a fairly strong, positive linear
relationship. The linear regression equation
models the data fairly well.
1.
What is the slope of the regression line? Explain what it means in context.
2.
What’s the y intercept? Explain what it means in context.
3.
Predict the rat’s weight after 16 weeks. Show your work.
4.
Should you use this line to predict the rat’s weight at age 2 years? Use the equation to make the
prediction and think about the reasonableness of the result. (There are 454 grams in a pound.)
Residuals and the Least-Squares Regression Line
In most cases, no line will pass exactly through all the points in a scatterplot. Because we use the line to predict y
from x, the prediction errors we make are errors in y, the vertical direction in a scatterplot. A good regression line
makes the vertical distances of the points from the line as small as possible.
The scatterplot below gives the relationship between body weight and backpack weight for a group of 8 hikers.
The regression line is added. The vertical deviations represent “leftover” variation in the response variable after
fitting the regression line. For that reason, they are called residuals.
DEFINITION: Residual
A residual is the difference between an observed value of the response variable and the value predicted by the
regression line. That is,
Example
Back to the Backpackers (Finding a residual)
Problem: Find and interpret the residual for the hiker who weighed 187 pounds.
If we add up the prediction errors for all 8 hikers, the positive and negative residuals cancel out. We’ll solve this
problem by squaring the residuals. The regression line we want is the one that minimizes the sum of the squared
residuals. That’s what the line shown in the figure below does for the hiker data, which is why we call it the
_______________________________________________________________.
DEFINITION: Least-squares regression line
The least-squares regression line of y on x is the line that makes the sum of the squared residuals as small as
possible.
The figure below gives a geometric interpretation of the least-squares idea for the hiker data. The least-squares
regression line shown minimizes the sum of the squared prediction errors, 30.90. No other regression line would
give a smaller sum of squared residuals.
Least-squares Regression Lines on the Calculator
CHECK YOUR UNDERSTANDING
It’s time to practice your calculator regression skills. Using the familiar hiker data in the table below, find the
equation of the regression line.
Body weight (lb):
Backpack weight (lb):
120
26
187
30
109
26
103
24
131
29
165 158
35
31
116
28
Calculating the Equation of the Least-Squares Line
DEFINITION: Equation of the least-squares regression line
We have data on an explanatory variable x and a response variable y for n individuals. From the data, calculate the
means and and the standard deviations sx and sy of the two variables and their correlation r. The least-squares
regression line is the line ________________________________ with slope
b=
and y intercept
a=
What does the slope of the least-squares line tell us? The figure below shows the regression line for the hiker
data. There are four more lines added to the graph: a vertical line at the mean body weight , a vertical line at
sx (one standard deviation above the mean body weight), a horizontal line at the mean pack weight , and a
horizontal line at + sy (one standard deviation above the mean pack weight). Note that the regression line
passes through (
as expected. From the graph, the slope of the line is
From the definition box, we know that the slope is
Setting the two formulas equal to each other, we have
+
b=
So the unknown distance ?? in the figure must be equal to r x sy. In other words, for an increase of one standard
deviation in the value of the explanatory variable x, the least-squares regression line predicts an increase of r
standard deviations in the response variable y.
Example
Fat Gain and NEA (Calculating the Least-squares Regression Line
Look at the data below from the study of nonexercise activity and fat gain. The mean and standard deviation of
the 16 changes in NEA are = 324.8 calories (cal) and sx = 257.66 cal. For the 16 fat gains, the mean and standard
deviation are = 2.388 kg and sy = 1.1389 kg. The correlation between fat gain and NEA change is r = -0.7786.
Problem: (a) Find the equation of the least-squares regression line for predicting fat gain from NEA change. Show
your work.
(b) What change in fat gain does the regression line predict for each additional 257.66 cal of NEA? Explain.
What happens if we standardize both variables?
Standardizing a variable converts its mean to 0 and its standard deviation to 1. Doing this to both x and y will
transform the point (
to (0,0). So the least-squares line for the standardized values will pass through (0,0).
What about the slope of this line? From the formula, it’s b = rsy/sx. Since we standardized, sx = sy = 1. That means
b = r. In other words, the slope is equal to the correlation.
How Well the Line Fits the Data: Residual Plots
One of the first principles of data analysis is to look for an overall pattern and for striking departures from the
pattern. A regression line describes the overall pattern of a linear relationship between an explanatory variable
and a response variable. We see departures from this pattern by looking at the residuals.
Example
Does Fidgeting Keep You Slim? (Examining Residuals)
Let’s return to the fat gain and NEA study involving 16 young people who volunteered to overeat for 8 weeks.
Those whose NEA rose substantially gained less fat than others. In the Technology Corner, we confirmed that the
least-squares regression line for these data is
(NEA change). The calculator screen
shot below shows a scatterplot of the data with the least-squares line added.
One subject’s NEA rose by 135 cal. That subject gained 2.7 kg of fat. (This point is marked in the screen shot with
an X.) The predicted fat gain for 135 cal is
= 3.505 – 0.00344(135) = 3.04kg
The residual for this subject is therefore
residual = observed y – predicted y = y - = 2.7 – 3.04 = -0.34kg
The residual is negative because the data point lies below the line.
The 16 data points used in calculating the least-squares line produced 16 residuals. Rounded to two decimal
places, they are
0.37 -0.70 0.10 -0.34 0.19 0.61 -0.26 -0.98
1.64 -0.18 -0.23 0.54 -0.54 -1.11 0.93 -0.03
You can see the residuals in the scatterplot below by looking at the vertical deviations of the points from the line.
The residual plot in figure (b) makes it easier to study the residuals by plotting them against the explanatory
variable, change in NEA. Because the mean of the residuals is always zero, the horizontal line at zero in figure (b)
helps orient us. This “residual = 0” line corresponds to the regression line in figure (a).
DEFINITION: Residual plot
A residual plot is a scatterplot of the residuals against the explanatory variable. Residual plots help us assess how
well a regression line fits the data.
CHECK FOR UNDERSTANDING
Refer to the nonexercise activity and fat gain data.
1.
Find the residual for the subject who increased NEA by 620 calories. Show your work.
2.
Interpret the value of this subject’s residual in context.
3. For which subject did the regression line overpredict fat gain by the most? Justify your answer.
Examining residual plots A residual plot in effect turns the regression line horizontal. It magnifies the
deviations of the points from the line, making it easier to see unusual observations and patterns. If the
regression line captures the overall pattern of the data, there should be no pattern in the residuals. Figure (a)
shows a residual plot with a clear curved pattern. A straight line is not an appropriate model for these data, as
figure (b) confirms.
Here are two important things to look for when you examine a residual plot.
1. The residual plot should show no obvious pattern.
2. The residuals should be relatively small in size.
Standard deviation of the residuals: We have already seen that the average prediction error is 0 whenever we use
a least-squares regression line. That’s because the positive and negative residuals “balance out.” But that doesn’t
tell us how far off the predictions are, on average. Instead, we use the ___________________________________
______________________________:
DEFINITION: Standard deviation of the residuals (s)
If we use a least-squares line to predict the values of a response variable y from an explanatory variable x, the
standard deviation of the residuals (s) is given by
This value gives the approximate size of a “typical” or “average” prediction error (residuals)
Residual plots and s on the calculator
CHECK YOUR UNDERSTANDING
In the Check Your Understanding on page 171, we asked you to perform least-squares regression on a familiar
hiker data shown in the table below. The graph shown is a residual plot for the least-squares regression of pack
weight on body weight for the 8 hikers.
1.
The residuals plot does not show a random scatter. Describe the pattern you see.
2.
For this regression, s = 2.27. Interpret this value in context.
2
How Well the Line Fits the Data: The Role of r in Regression
A residual plot is a graphical tool for evaluating how well a regression line fits the data. There is a numerical
2
quantity that tells us how well the least-squares line predicts values of the response variable y. It is r , the
2
___________________________________________________. Although it’s true that r is equal to the square of
r, there is much more to this story.
Example
Pack weight and body weight (How can we predict y if we don’t know x?
Suppose a new student is assigned at the last minute to our group of 8 hikers. What would we predict for his pack
weight? The figure below shows a scatterplot of the hiker data that we have studied throught this chapter. The
least-squares line is drawn on the plot in green. Another line has been added in blue: a horizontal line at the mean
y-value, = 28.625 lb. If we don’t know this new student’s body weight, then we can’t use the regression line to
make a prediction. What should we do? Our best strategy is to use the mean pack weight of the other 8 hikers as
our prediction.
Figure (a) shows the prediction errors if we use the average pack weight as our prediction for the original group
of 8 hikers. We can see that the sum of the squared residuals for this line is
. SST
measures the total variation in the y-values.
If we learn our new hiker’s body weight, then we could use the least-squares line to predict his pack weight. How
much better does the regression line do at predicting pack weights than simply using the average pack weight of
all 8 hikers? Figure (b) above reminds us that the sum of squared residuals for the least-squares line is
. We’ll call this SSE, for sum of squared errors.
The ratio SSE/SST tells us what proportion of the total variation in y still remains after using the regression line to
predict the values of the response variable. In this case,
This means that ________% of the variation in pack weight is unaccounted for by the least-squares regression line.
Taking this one step further, the proportion of the total variation in y that is accounted for by the regression line is
We interpret this by saying that “63.2% of the variation in backpack weight is accounted for by the linear model
relating pack weight to body weight.”
2
DEFINITION: The coefficient of determination: r in regression
2
The coefficient of determination r is the fraction of the variation in the values of y that is accounted for by the
2
least-squares regression line of y on x. We can calculate r using the following formula:
where SSE =
and SST =
2
If all points fall directly on the least-squares line, SSE = 0 and r = 1. SSE can never be more than SST. In the worstcase scenario, the least-squares line does no better at predicting y than y = does. When you report a regression,
2
give r as a measure of how successful the regression was in explaining the response. When you see a correlation,
square it to get a better feel for the strength of the linear relationship.
CHECK FOR UNDERSTANDING
Multiple choice: Select the best answer.
2
1.
For the least-squares regression of fat gain on NEA, r = 0.606. Which of the following gives a correct
interpretation of this value in context?
(a) 60.6% of the points lie on the least-squares regression line.
(b) 60.6% of the fat gain values are accounted for by the least-squares line.
(c) 60.6% of the variation in fat gain is accounted for by the least-squares line.
(d) 77.8% of the variation in fat gain is accounted for by the least-squares line.
2.
A recent study discovered that the correlation between the age at which an infant first speaks and the
child’s score on an IQ test given upon entering elementary school is -0.68. A scatterplot of the data shows
a linear form. Which of the following statements about this finding is correct?
(a) Infants who speak at very early ages will have higher IQ scores by the beginning of elementary school
than those who begin to speak later.
(b) 68% of the variation in IQ test scores is explained by the least-squares regression of age at first
spoken word and IQ score.
(c) Encouraging infants to speak before they are ready can have a detrimental effect later in life, as
evidenced by their lower IQ scores.
(d) There is moderately strong, negative linear relationship between age at first spoken word and later IQ
test score for the individuals in this study.
Interpreting Computer Regression Output
The figure below displays the basic regression output for the NEA data. Each output records the slope and y
intercept of the least-squares line. The software also provides some information that we will not use. Be sure that
2
you can locate the slope, the y intercept, and the values of s and r on both of the computer outputs.
Example
Beer and Blood Alcohol (Interpreting regression output)
How well does the number of beers a person drinks predict his or her blood alcohol content (BAC)? Sixteen
volunteers with an initial BAC of 0 drank a randomly assigned number of cans of beer. Thirty minutes later, a
police officer measured their BAC. Least squares regression was performed on the data. A scatterplot with the
regression line added, a residual plot, and some computer output from the regression line are shown below.
Problem:
(a) What is the equation of the least-squares regression line that describes the relationship between beers
consumed and blood alcohol content? Define any variables you use.
(b) Interpret the slope of the regression line in context.
(c) Find the correlation.
(d) Is a line an appropriate model to use for these data? What information tells you this?
(e) What was the BAC reading for the person who consumed 9 beers? Show your work.
Correlation and Regression Wisdom
Correlation and regression are powerful tools for describing the relationship between two variables. When you
use these tools, you should be aware of their limitations.
1. The distinction between explanatory and response variables is important in regression.
Example
Predicting Fat Gain, Predicting NEA (Two different regression lines)
Figure (a) below repeats the scatterplot of the NEA data with the least-squares regression line for predicting fat
gain from change in NEA added. We might also use the data on these 16 subjects to predict the NEA change for
another subject from that subject’s fat gain when overfed for 8 weeks. Now the roles of the variables are
reversed: fat gain is the explanatory variable and change in NEA is the response variable. Figure (b) shows a
scatterplot of these data with the least-squares line for predicting NEA change from fat gain. The two regression
2
lines are very different. However, no matter which variable we put on the x axis, r = 0.606 and the correlation is
r = -0.778.
2.
3.
Correlation and regression lines describe only linear relationships.
Correlation and least-squares regression lines are not resistant.
Example
Gesell Scores (Dealing with unusual points in regression)
Does the age at which a child begins to talk predict a later score on a test of mental ability? A study of the
development of young children recorded the age in months at which each of 21 children spoke their first word and
their Gesell Adaptive Score, the result of an aptitude test taken much later. The data appear in the table below.
State:
Plan:
Do:
Conclude:
In the previous example, Child 18 and Child 19 were identified as outliers in the scatterplot. These points are also
marked in the residual plot. Child 19 has a very large residual because this point lies far from the regression line.
However, Chile 18 has a pretty small residual. That’s because Child 18’s point is close to the line. How do these
two outliers affect the regression line? The figure below shows the results of removing each of these points on the
correlation and the regression line. The graph adds two more regression lines, one calculated after leaving out
Child 18 and the other after leaving out Child 19.
You can see that removing the point for Child 18 moves the line quite a bit. Because of Child 18’s extreme position
on the age scale, this point has a strong influence on the position of the regression line. However, removing Child
19 has little effect on the regression line. A point that is extreme in the x direction with no other points near it
pulls the line toward itself. We call such points ______________________________________.
DEFINITION: Outliers and influential observations in regression
An outlier is an observation that lies outside the overall pattern of the other observations. Points that are outliers
in the y direction but not the x direction of a scatterplot have large residuals. Other outliers may not have large
residuals. An observation is influential for a statistical calculation if removing it would markedly change the result
of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the leastsquares regression line.
4.
Association does not imply causation.
Example
Does Having More Cars Make You Live Longer? (Association, not causation)
A serious study once found that people with two cars live longer than people who own only one car. Owning three
cars is even better, and so on. There is a substantial positive correlation between number of cars x and length of
life y.
The basic meaning of causation is that by changing x we can bring about a change in y. Could we lengthen our lives
by buying more cars? No. The study used number of cars as a quick indicator of wealth. Well-off people tend to
have more cars. They also tend to live longer, probably because they are better educated, take better care of
themselves, and get better medical care. The cars have nothing to do with it. There is no cause-and-effect tie
between number of cars and length of life.
Download