Describing location in distributions - JuabMath

advertisement
AP Statistics
Name_________________________
Unit 3: Examining Relationships
Date____________
Hour____________
3.2B Residual Plots
1. In what way is a regression line a mathematical model?
We are trying to model a real world situation with an equation, and the equation of that regression
line is what models the behavior of the data.
2. What is extrapolation and why is it dangerous?
We only really know the behavior of the data for the range that we gathered it for. Extrapolation
means that you go outside the bounds of your data to predict. The trend that we see may not continue
before or after the data we have collected.
3. What is a least-squares regression line?
A least squares regression line is a line that minimizes the sum of the squared distances from each
point to the line.
4. What is the formula for the equation of the least-squares regression line?
𝑠𝑦
Y=a+bx, where the line goes through the point (𝑥̅ , 𝑦̅) and slope b = 𝑟 𝑠
𝑥
5. The least-squares regression line always passes through the point …?(𝑥̅ , 𝑦̅)
6. What is a residual?
The difference between what the model predicted would happen and what actually happened.
Observed - Expected
7. How can you calculate residuals on your calculator and use this to produce a residual plot?
Do a linear regression. When you do the linear regression, it stores the residuals automatically in
your LIST menu. Then highlight on of your lists, go to the LIST menu, select residuals and it will put it
into the list. Then go to statplot and create a scatterplot using your explanatory variable list, and your
residuals list.
8. If a least-squares regression line fits the data well, what characteristics should the residual plot
exhibit?
It should have a scattered random look to it with no apparent pattern.
9. The table below gives data on the number of powerboats registered in Florida and the number of
manatees killed by boats in the years from 1977 to 1990.
Year
1977
1978
1979
1980
1981
1982
1983
1984
1985
PowerBoat
Registrations
(in thousands)
447
460
481
498
513
512
526
559
585
Manatees
killed
13
21
24
16
24
20
15
34
33
2 EXAMINING RELATIONSHIPS
a) Use your calculator to make a scatterplot of these data.
Sketch the graph below:
1986
1987
1988
1989
1990
614
645
675
711
719
33
39
43
50
47
b) Find the equation of the least-squares regression line and overlay that line on your scatterplot.
Let y = number of manatees killed and x = number of powerboat registrations. The least-square regression
equation is yˆ = − 41.43 + 0.1249x.
c) Predict the number of manatees that will be killed in a year when 716,000 powerboats are
registered.
When 716,000 powerboats are registered, the predicted number of manatees killed will be −41.43 + 0.1249 ×
716 = 47.99, or about 48 manatees.
d) Here are four more years of manatee data.
1991
1992
1993
1994
716
716
716
535
53
38
35
49
Add these points to your scatterplot. Florida took
stronger measures to protect manatees during these
years. Do you see evidence that these measures
succeeded?
Yes, the measures seem to be succeeding, three of the four new points
are below the regression line, indicating that fewer manatees than
predicted were killed. Additional evidence of success is provided by
the two points for 1992 and 1993; they fall well below the overall
pattern.
e) In part (c) you predicted the number of manatee deaths in a year with 716,000 powerboat
registrations. In fact, powerboat registrations were 716,000 for three years. Compare the mean manatee
deaths in these three years with your prediction from part (c). How accurate was your prediction? The
mean number of manatee deaths for the years with 716,000 powerboat registrations is 42. The prediction of 48
was too high.
AP Statistics
Unit 3: Examining Relationships
3.2B Residual Plots
10. The table below gives data on the number of new birds y and percent of returning birds x for 13
sparrowhawk colonies. Enter the data into your calculator.
a) Use your calculator’s regression function to find the equation of the leastsquares regression line.
The least squares regression line is yˆ = 31.9 − 0.304x . The calculator output (and
Minitab output) is shown below:
Percent
returning
74
66
81
52
73
62
52
45
62
46
60
46
38
New
Adults
5
6
8
11
12
15
16
17
18
18
19
20
20
b) Use your calculator to find the mean and standard deviation of both x and y and their
correlation r.
The means, standard deviations, and correlation are: x = 58.23% , sx = 13.03% , y=14.23 newbirds,
sy =5.29 newbirds, r=−0.748.
c) Calculate the slope b and y intercept a of the regression line following the method of Example
3.13 on page 208 in your book. Verify that your equation is the same as the one you obtained
in part (a) except for slight rounding errors.
The slope is
5.29
b=−0.748(13.03) = −.304 and the intercept is a= 14.23 − b×58.2331.9.
d) Explain in words what the slope and y-intercept of the regression line tell us.
The slope tells us that as the percent of returning birds increases by one the number of new birds will decrease
by −0.304 on average. The y intercept provides a prediction that we will see 31.9 new adults in a new colony
when the percent of returning birds is zero. This value is clearly outside the range of values studied for the 13
colonies of sparrowhawks and has no practical meaning in this situation.
e) An ecologist uses the line to predict how many birds will join another colony of sparrowhawks,
to which 60% of the adults return from the previous year. What is the prediction?
The predicted value for the number of new adults is 31.9 − 0.304×60 = 13.69 or about 14.
4 EXAMINING RELATIONSHIPS
11. A study recorded data on number of beers consumed and blood alcohol content (BAC) for 16
students. Page 212 in your book, problem 3.35 gives partial computer output from Minitab
relating to these data:
a) Use the computer output to write the equation of the least-squares line.
Let y = Blood Alcohol Content (BAC) and x = Number of Beers. The least-squares regression line is yˆ =
−0.0127 + 0.017964x .
b) Interpret the slope and y-intercept of the equation in this setting.
The slope indicates that on average, the BAC will increase by 0.017964 for each additional beer consumed. The
intercept suggests that the average BAC will be −0.01270 if no beers are consumed; this is clearly ridiculous.
c) What blood alcohol level would your equation predict for a student who consumed 6 beers?
The predicted BAC for a student who consumed 6 beers is −0.0127 + 0.017964×6 = 0.0951.
d) The one student in the study who consumed 6 beers had a BAC of 0.10. What is your prediction
error in part (c)?
The prediction error is 0.10 − 0.0951 = 0.0049.
e) Problem 3.37 in your book on page 213 gives a screen shot from a TI-84 produced using the data.
Use the method described in example 3.13 on page 208 in your book to calculate the slope and yintercept of the least-squares regression line. Compare with your results from part (a).
12. The Trans-Alaska Oil pipeline is a tube formed from ½ inch thick steel that carries oil across 800
miles of sensitive arctic and sub-arctic terrain. The pipe and the welds that join the segments were
carefully examined before installation. How accurate are field measurements of the depth of small
defects? Scatterplot below compares the results of measurements on 100 defects made in the field
with measurements of the same defects made in the laboratory. The line y = x is drawn on the
scatterplot. The second plot is a residual plot for these data.
AP Statistics
Unit 3: Examining Relationships
3.2B Residual Plots
a) Describe the overall pattern you see in the scatterplot, as well as any deviations from that pattern.
There is a positive, linear association between the two variables. There is more variation in the field
measurements for larger laboratory measurements. The values are scattered above and below the line y = x for
small and moderate depths, indicating strong agreement, but the field measurements tend to be smaller than the
laboratory measurements for large depths.
b) If field and laboratory measurements all agree, then the points should fall on the y=x line drawn
on the scatterplot, except for small variations in the measurements. But this is not the case.
Explain.
The points for the larger depths fall systematically below the line y = x showing that the field measurements are
too small compared to the laboratory measurements.
c) The line drawn on the scatterplot (y=x) is not the least-squares regression line. How would the
slope and y-intercept of the least-squares line compare? Justify your answer.
In order to minimize the sum of the squared distances from the points to the regression line, the top right part of
the blue line in the figure above would need to be pulled down to go through the “middle” of the group of points
that are currently below the blue line. Thus, the slope would decrease and the intercept would increase.
d) Discuss what the residual plot tells you about how well the least-squares regression line fits the
data.
The residual plot clearly shows that the prediction errors increase for larger laboratory measurements. In other
words, the variability in the field measurements increases as the laboratory measurements increase. The least
squares line does not provide a great fit, especially for larger depths.
12. Data on the fuel consumption y of a car at various speeds x is given on page 182 in your book. Fuel
consumption is measured in liters of gasoline per 100 kilometers driven and speed is measured in
kilometers per hour. A statistical software package gives the least-squares regression line and
also the residuals. The regression line is:
𝑦̂ = 11.058 − 0.01466𝑥
The residuals in the same order as the observations are:
10.09
-2.17
2.24
-1.32
-0.62
-0.42
-2.47
0.57
-3.33
1.64
-4.28
2.76
-3.73
3.97
-2.94
a) Make a scatterplot of the observations and draw the least-squares regression line on your plot.
A scatterplot with the least squares regression line is shown below.
b) Would you use the regression line to predict y from x? Justify your answer.
We would certainly not use the regression line to predict fuel consumption. The scatterplot shows a nonlinear
relationship.
6 EXAMINING RELATIONSHIPS
c) Check that the residuals have sum zero (up to round-off error)
The sum of the residuals provided is −0.01, which illustrates a slight roundoff error.
d) Make a plot of residuals against the values of x. Draw a horizontal line at height zero on your plot.
Notice that the residuals show the same pattern about this line as the data points show about the
regression line in the scatterplot in (a). What do you conclude about the residual plot?
The residual plot indicates that the regression line underestimates fuel consumption for slow and fast speeds and
overestimates fuel consumption for moderate speeds. The quadratic pattern in the residual plot indicates that the
regression model is not appropriate for these data.
15. Below are four sets of data prepared by the statistician Frank Ascombe to illustrate the dangers of
calculating without first plotting the data.
a) Without making scatterplots,
find the correlation and the
least-squares regression line
for all four data sets. What
do you notice? Use the
regression line to predict y
for x = 10 for all four data
sets.
b) Make a scatterplot for each of the data sets and add the regression line to each plot.
c) Now make a sketch of the residual plot for each of the four data sets
d) In each of the four cases, which would you be willing to use the regression line to describe the
dependence of y on x? Explain our answer in each case.
AP Statistics
Unit 3: Examining Relationships
3.2B Residual Plots
8 EXAMINING RELATIONSHIPS
Download