AP Statistics

advertisement
AP Statistics
Exam Review on Chapters 5 & 13 Regression
For Chapter 5, read “A word to the Wise: Cautions and Limitations” on page 257-58. Also read
Summary of Key Concepts and Formulas on pages 259-260.
Key Terms/Ideas
 Response variables measure the outcome of a study
 Explanatory variables attempt to explain the observed outcomes
 Use the same principles to examine bivariate data that you used for univariate data
o Start with a graph (scatterplots make a heck of a lot of sense here)
o Describe the strength, direction, and form of the relationship (e.g. strong negative linear
relation or moderate positive linear relationship)
o Give some numerical description of the data (that is your equation for the LSRL, along with





your r and r 2 values.
Avoid using the term causes unless the data is the result of an experiment where there is
overwhelming evidence that changes in the explanatory variable actually do cause changes in the
response variable
A better way to describe a relationship is “x is strongly associated with y…”
To estimate r values, decide how narrow an ellipse (oval) you can draw around the points. The
narrower the ellipse, the stronger the relationship
o R values fall between –1 and 1. the closer to –1 or 1, the stronger the linear relationship. R
values are only used to describe a linear relation, they are not used for other (non-linear
relationships) – although they can describe the transformed data in a non-linear relation.
o Correlation is strongly effected by outliers
o The formula for r involves standardizing the x and y values, so it doesn’t matter what the
units are (ask about this if you aren’t sure what I mean).
Influential points are the ones that if you were to remove them, it would greatly change the
position of the regression line.
o An Influential Point is not the same as an Outlier. An outlier lies outside of the overall
pattern of the scatterplot. Influential points tend to be closer to the left or right of the
scatterplot.
The least squares regression line tries to minimize the sum of the vertical distances from the
observed points to the line of best fit.
y  a  bx , where y-hat describes our predictions

A LSRL should be written as:

Residual = observed – predicted
o The sum of residuals for a LSRL is always = 0. This is why we use the sum of the squares
of the residuals (Least Squares Regression Line).
o
r 2 is called the coefficient of determination and is strongly linked to the idea of
minimizing the vertical distances

o
r 2 tells us what percent of the variation in the “y” variable can be explained by the
LSRL.
Residual Plots are excellent for helping us determine if the “linear model” is appropriate.
If you notice a distinct pattern (like a parabola, etc) rather than a random scattering of
points, that is an indication that the “linear model” may not be the best predictor, since it
will underestimate certain regions and overestimate other regions. Study graphs on 702
regarding residuals for inference.

Any time you calculate the equation for a LSRL on your calculator, the residuals are stored
in a list called RESID. You can use this list to construct a residual plot.
When dealing with a LSRL, the point ( x , y ) is always on the LSRL

The equations for the LSRL are all grouped together on the sheet
o
Cautions!!!
Never (yes never) just take the numerical summaries to describe the data in an explanatory/response
relationship. This is dangerous!!! Very dangerous!
Inference for the slope of Least Squares Regression Lines & Confidence Intervals: Chapter 13.1 –
13.3
The material in chapter 5 helps you determine how strong of a relationship exists between two variables.
You could find a correlation between any two quantitative variables, even though it may be a weak or a
strong one, you could find it.
The material in chapter 5 did not let you know if your relationship would be useful to you or not.
Since we are using regression in order to make some prediction, I will define useful in this context to
mean that our LSRL will allow us to predict the value of the response variable based on the explanatory
variable.
So our LSRL would be useful if different values of the explanatory variable yielded different values of
the response variable. The only time this wouldn’t be true is if the LSRL was horizontal. The Hypothesis
Test for LSRL is essentially:
H 0 :   0 , while H a :   0 , where  =slope of the LSRL.
Be sure you understand how to calculate a confidence interval for slope and interpret the interval.
ALWAYS interpret the slope in the context of the problem.
Be sure you know the assumptions related to inference on slope.
Be sure to be able to read computer output.
Read summary on p. 723 and the first line on page 724.
p. 724 # 61 a and create a 90% confidence interval.
AP Statistics
1.
The residual value of ( x , y ) in a linear regression is
a.
b.
c.
d.
e.
2.
Practice on LSRL and Transformations to Achieve Linearity
negative
0
positive
dependent on the value of r
the value cannot be determined
If (12, 60) is an influential point for the regression line y  7.908 4.098x , then which
of the following must be true?
a.
removal of (12, 60) will improve r
b.
removal of (12, 60) will not affect r
c.
removal of (12, 60) will change the value of the slope of the regression line
d.
(12, 60) has a large residual
e.
none of these
3.
Suppose a data set is transformed using (x, y)  (x, logy) and a least squares linear
regression procedure is performed on the transformed data. If the residual plot of this
regression shows a curved pattern, which of the following is an appropriate conclusion?
a.
A quadratic model should be used with the original data
b.
A square root transformation should be applied to the transformed data
c.
The correlation coefficient of the set of transformed data is 0
d.
The exponential transformation is not appropriate
e.
None of these is appropriate
4.
After data are collected from an agricultural experiment, suppose a transformation is
performed on the bivariate set (inches of water, total plant growth). If the linear
regression of the transformed data has the equation:
Log(growth) = 0.7 + 1.93 log (water)
The regression model of the original data is:
a.
b.
c.
growth = 0.7 + 1.93(water)
growth = 5.01 + 1.93(water)
growth = (5.01) (1.93)water
d.
growth = 5.01 (water ) 1.93
e.
none of these
Free Response (Do on another sheet)
Complete a regression analysis for the following age and income data as indicated
Age
(years)
Income
($1,000)
20
25
30
35
40
45
50
55
60
18.5
23.6
29.8
38.5
49
64.1
78.5
102.0
130.8
1.
Construct and label a scatterplot of the data.
2.
Perform a linear regression on the data; plot the regression line on the scatterplot.
3.
Discuss the goodness of fit of the linear regression referencing the correlation
coefficient and its residual plot.
The correlation coefficient indicates a strong positive linear relationship between
age and income. However, the residual plot shows a definite curvature, indicating
a better model exists.
Perform the following transformations; exponential and power.
4.
NOTE: The sum of the residuals squared here is on the transformed data.
5.
Perform the linear regression on both sets of transformed data.
6.
Discuss the goodness of fit of these linear regressions referencing the correlation
coefficients and each of their residual plots.
Looking at the transformed data sets, the exponential plot has the largest
correlation coefficient and the smallest sum of residuals squared.
7.
Transform the linear models into the exponential and power models and plot each on
the original scatterplot.
8.
Comment on which of the three regression models fits the data the best. Explain your
answer.
The exponential model is the best model since it minimizes the sum of the residuals squared and the
residual plot it the best of the three models.
Review websites:
Online quiz from Yates:
Chapters 3, 4, & 14
http://www.whfreeman.com/yates1e/
Online quiz from Olsen: http://sstaff.hinsdale86.org/~rcazzato/apstats/index.htm
and click on our book. Chapters 5 & 13.
Go to the site listed below and test your skills on guessing correlations.
http://www.stat.uiuc.edu/~stat100/java/GCApplet/GCAppletFrame.html
1998 Free-Response Question 4
In a study of the application of a certain type of weed killer, 14 fields containing large numbers
of weeds were treated. The weed killer was prepared at seven different strengths by adding 1,
1.5, 2, 2.5, 3, 3.5, or 4 teaspoons to a gallon of water. Two randomly selected fields were treated
with each strength of weed killer. After a few days, the percentage of weeds killed on each field
was measured. The computer output obtained from fitting a least squares regression line to the
data is shown below. A plot of the residuals is provided as well.
Dependent variable is: percent killed
R squared = 97.2% R squared (adjusted) = 96.9%
s = 4.505 with 14 - 2 = 12 degrees of freedom
Source
Sum of Squares
df
Mean Square F-ratio
Regression
8330.160
1
8330.1600
410
Residual
0243.589
12
0020.2990
Variable
Constant
No. Teaspoons
Coefficient
-20.5893
s.e. of Coeff
3.242
t-ratio
-6.35
Prob
-24.3929
1.204
20.30
0.0001
0.0001
a.
What is the equation of the least squares regression line given by this analysis? Define any variables used
equation.
b.
If someone uses this equation to predict the percentage of weeds killed when 2.6 teaspoons of weed killer are used,
which of the following would you expect?
o
The prediction will be too large.
o
The prediction will be too small.
o
A prediction cannot be made based on the information given on the computer output.
Explain your reasoning.
in this
To check the multiple choice, go to the following link and select chapters 3 and then 4.
http://www.whfreeman.com/yates1e/
Download