AP Statistics Exam Review on Chapters 5 & 13 Regression For Chapter 5, read “A word to the Wise: Cautions and Limitations” on page 257-58. Also read Summary of Key Concepts and Formulas on pages 259-260. Key Terms/Ideas Response variables measure the outcome of a study Explanatory variables attempt to explain the observed outcomes Use the same principles to examine bivariate data that you used for univariate data o Start with a graph (scatterplots make a heck of a lot of sense here) o Describe the strength, direction, and form of the relationship (e.g. strong negative linear relation or moderate positive linear relationship) o Give some numerical description of the data (that is your equation for the LSRL, along with your r and r 2 values. Avoid using the term causes unless the data is the result of an experiment where there is overwhelming evidence that changes in the explanatory variable actually do cause changes in the response variable A better way to describe a relationship is “x is strongly associated with y…” To estimate r values, decide how narrow an ellipse (oval) you can draw around the points. The narrower the ellipse, the stronger the relationship o R values fall between –1 and 1. the closer to –1 or 1, the stronger the linear relationship. R values are only used to describe a linear relation, they are not used for other (non-linear relationships) – although they can describe the transformed data in a non-linear relation. o Correlation is strongly effected by outliers o The formula for r involves standardizing the x and y values, so it doesn’t matter what the units are (ask about this if you aren’t sure what I mean). Influential points are the ones that if you were to remove them, it would greatly change the position of the regression line. o An Influential Point is not the same as an Outlier. An outlier lies outside of the overall pattern of the scatterplot. Influential points tend to be closer to the left or right of the scatterplot. The least squares regression line tries to minimize the sum of the vertical distances from the observed points to the line of best fit. y a bx , where y-hat describes our predictions A LSRL should be written as: Residual = observed – predicted o The sum of residuals for a LSRL is always = 0. This is why we use the sum of the squares of the residuals (Least Squares Regression Line). o r 2 is called the coefficient of determination and is strongly linked to the idea of minimizing the vertical distances o r 2 tells us what percent of the variation in the “y” variable can be explained by the LSRL. Residual Plots are excellent for helping us determine if the “linear model” is appropriate. If you notice a distinct pattern (like a parabola, etc) rather than a random scattering of points, that is an indication that the “linear model” may not be the best predictor, since it will underestimate certain regions and overestimate other regions. Study graphs on 702 regarding residuals for inference. Any time you calculate the equation for a LSRL on your calculator, the residuals are stored in a list called RESID. You can use this list to construct a residual plot. When dealing with a LSRL, the point ( x , y ) is always on the LSRL The equations for the LSRL are all grouped together on the sheet o Cautions!!! Never (yes never) just take the numerical summaries to describe the data in an explanatory/response relationship. This is dangerous!!! Very dangerous! Inference for the slope of Least Squares Regression Lines & Confidence Intervals: Chapter 13.1 – 13.3 The material in chapter 5 helps you determine how strong of a relationship exists between two variables. You could find a correlation between any two quantitative variables, even though it may be a weak or a strong one, you could find it. The material in chapter 5 did not let you know if your relationship would be useful to you or not. Since we are using regression in order to make some prediction, I will define useful in this context to mean that our LSRL will allow us to predict the value of the response variable based on the explanatory variable. So our LSRL would be useful if different values of the explanatory variable yielded different values of the response variable. The only time this wouldn’t be true is if the LSRL was horizontal. The Hypothesis Test for LSRL is essentially: H 0 : 0 , while H a : 0 , where =slope of the LSRL. Be sure you understand how to calculate a confidence interval for slope and interpret the interval. ALWAYS interpret the slope in the context of the problem. Be sure you know the assumptions related to inference on slope. Be sure to be able to read computer output. Read summary on p. 723 and the first line on page 724. p. 724 # 61 a and create a 90% confidence interval. AP Statistics 1. The residual value of ( x , y ) in a linear regression is a. b. c. d. e. 2. Practice on LSRL and Transformations to Achieve Linearity negative 0 positive dependent on the value of r the value cannot be determined If (12, 60) is an influential point for the regression line y 7.908 4.098x , then which of the following must be true? a. removal of (12, 60) will improve r b. removal of (12, 60) will not affect r c. removal of (12, 60) will change the value of the slope of the regression line d. (12, 60) has a large residual e. none of these 3. Suppose a data set is transformed using (x, y) (x, logy) and a least squares linear regression procedure is performed on the transformed data. If the residual plot of this regression shows a curved pattern, which of the following is an appropriate conclusion? a. A quadratic model should be used with the original data b. A square root transformation should be applied to the transformed data c. The correlation coefficient of the set of transformed data is 0 d. The exponential transformation is not appropriate e. None of these is appropriate 4. After data are collected from an agricultural experiment, suppose a transformation is performed on the bivariate set (inches of water, total plant growth). If the linear regression of the transformed data has the equation: Log(growth) = 0.7 + 1.93 log (water) The regression model of the original data is: a. b. c. growth = 0.7 + 1.93(water) growth = 5.01 + 1.93(water) growth = (5.01) (1.93)water d. growth = 5.01 (water ) 1.93 e. none of these Free Response (Do on another sheet) Complete a regression analysis for the following age and income data as indicated Age (years) Income ($1,000) 20 25 30 35 40 45 50 55 60 18.5 23.6 29.8 38.5 49 64.1 78.5 102.0 130.8 1. Construct and label a scatterplot of the data. 2. Perform a linear regression on the data; plot the regression line on the scatterplot. 3. Discuss the goodness of fit of the linear regression referencing the correlation coefficient and its residual plot. The correlation coefficient indicates a strong positive linear relationship between age and income. However, the residual plot shows a definite curvature, indicating a better model exists. Perform the following transformations; exponential and power. 4. NOTE: The sum of the residuals squared here is on the transformed data. 5. Perform the linear regression on both sets of transformed data. 6. Discuss the goodness of fit of these linear regressions referencing the correlation coefficients and each of their residual plots. Looking at the transformed data sets, the exponential plot has the largest correlation coefficient and the smallest sum of residuals squared. 7. Transform the linear models into the exponential and power models and plot each on the original scatterplot. 8. Comment on which of the three regression models fits the data the best. Explain your answer. The exponential model is the best model since it minimizes the sum of the residuals squared and the residual plot it the best of the three models. Review websites: Online quiz from Yates: Chapters 3, 4, & 14 http://www.whfreeman.com/yates1e/ Online quiz from Olsen: http://sstaff.hinsdale86.org/~rcazzato/apstats/index.htm and click on our book. Chapters 5 & 13. Go to the site listed below and test your skills on guessing correlations. http://www.stat.uiuc.edu/~stat100/java/GCApplet/GCAppletFrame.html 1998 Free-Response Question 4 In a study of the application of a certain type of weed killer, 14 fields containing large numbers of weeds were treated. The weed killer was prepared at seven different strengths by adding 1, 1.5, 2, 2.5, 3, 3.5, or 4 teaspoons to a gallon of water. Two randomly selected fields were treated with each strength of weed killer. After a few days, the percentage of weeds killed on each field was measured. The computer output obtained from fitting a least squares regression line to the data is shown below. A plot of the residuals is provided as well. Dependent variable is: percent killed R squared = 97.2% R squared (adjusted) = 96.9% s = 4.505 with 14 - 2 = 12 degrees of freedom Source Sum of Squares df Mean Square F-ratio Regression 8330.160 1 8330.1600 410 Residual 0243.589 12 0020.2990 Variable Constant No. Teaspoons Coefficient -20.5893 s.e. of Coeff 3.242 t-ratio -6.35 Prob -24.3929 1.204 20.30 0.0001 0.0001 a. What is the equation of the least squares regression line given by this analysis? Define any variables used equation. b. If someone uses this equation to predict the percentage of weeds killed when 2.6 teaspoons of weed killer are used, which of the following would you expect? o The prediction will be too large. o The prediction will be too small. o A prediction cannot be made based on the information given on the computer output. Explain your reasoning. in this To check the multiple choice, go to the following link and select chapters 3 and then 4. http://www.whfreeman.com/yates1e/