Statistics 215 Regression worksheet At the scene of the crime* Criminal investigators often need to predict unobserved characteristics of individuals from observed characteristics. For example, if a footprint is left at the scene of a crime, how accurately can we estimate that person’s height based on the length of the footprint? 1. Identify the observational units, explanatory variable, and response variable in this study. The observational units are the individual students in Statistics 215. Most important in answering this question is the CONTEXT of the problem. Investigators seek to use footprint data at the scene of the crime to estimate crime suspects’ heights. So foot length is the explanatory variable and height is the response variable. Following are the self-reported heights (in inches) of the students in Statistics 215 along with summary statistics from SPSS. 64 64 67 71 67 75 69 67 68 61 68 60 70 71 76 73 64 64 64 64 73 70 72 68 68 64 66 70.5 65 66 72 73 73 67 74 74 Height Stem-and-Leaf Plot Frequency 2.00 .00 8.00 6.00 5.00 5.00 6.00 2.00 1.00 Stem & 6 6 6 6 6 7 7 7 7 . . . . . . . . . Leaf 01 44444455 667777 88889 00011 223333 45 6 De scri ptives Height Mean 95% Confidenc e Int erval for Mean 5% Trimmed Mean Median Variance St d. Deviat ion Minimum Maximum Range Int erquartile Range Sk ewness Kurtos is Lower Bound Upper Bound St atist ic 68.2714 66.8890 St d. E rror .68026 69.6539 68.3016 68.0000 16.196 4.02445 60.00 76.00 16.00 7.00 .011 -.742 .398 .778 2. If you were trying to predict the height of a random statistics student based on these observations, what value would you report? Either the mean or the median with a preference for the median. So 68 inches. 3. About how accurate would you be using such a prediction method? Fill in the blanks: The standard deviation gives a measure of the spread so I would predict a random student’s height to be about 68 inches give or take about 4 inches. 4. When you were sleeping last night, my investigators entered your room and recorded your foot length (in centimeters). Here is a scatterplot of height and foot length. Describe the association in this scatterplot (remember type, direction, strength, clusters, outliers, etc.) Without looking on the next page guess the correlation. Without looking on the next page, draw (lightly) on the scatterplot what looks like the best fit line for the data. There is a moderate to strong positive linear association in the data. There are no apparent clusters or outliers. Answers will vary on correlation guesstimates and where the draw the line. But reasonable guesses should fall between 65% and 90%. 5. Following is SPSS regression and correlation output. What is the correlation? Give the regression equation and draw it as accurately as possible on the scatterplot. The regression equation from the SPSS output Coefficients Table is Height = 41.839 + .953 * Foot. (A mathematician, as opposed to a statistician, would write y = 41.839 + .958*x .) The correlation is 0.841. To draw the line on the scatterplot we find two points on the line and connect the dots. Since the foot variable varies from about 20 to 34 we’ll use those two x-values. Since 41.839 + .958*20 = 60.999, the point (20, 60.999) is on the line. Since 41.839 + .958*34 = 74.411, the point (20, 74.411) is on the line. Joining those two points gives Model Summaryb Model 1 R .841a R Square .707 Adjusted R Square .698 Std. Error of the Estimate 2.21233 a. Predictors: (Constant), Foot b. Dependent Variable: Height Coeffi cientsa Model 1 (Const ant) Foot Unstandardized Coeffic ients B St d. Error 41.839 2.988 .953 .107 a. Dependent Variable: Height St andardiz ed Coeffic ients Beta .841 t 14.003 8.917 Sig. .000 .000 Residuals Statistics a Predicted Value Residual Std. Predicted Value Std. Residual Minimum 61.8470 -3.56365 -1.899 -1.611 Maximum 74.2331 4.48357 1.762 2.027 Mean 68.2714 .00000 .000 .000 Std. Deviation 3.38316 2.17955 1.000 .985 N 35 35 35 35 a. Dependent Variable: Height 6. The point (x-bar, y-bar) always lies on the regression line. Find this point and determine the approximate mean foot length. We know that y-bar (the mean of height) is 68.2714 inches. It looks like the mean foot length is about 27 cm. We could also solve 68.2714 = 41.839 + .953(x-bar) to get that x-bar = (68.271441.839)/.953 = 27.7359. (The difference between that and the exact value of 27.7429 we attribute to rounding errors.) 7. On page 169 of the textbook are the formulas for the slope and intercept of the regression equation. Verify these formulas using the above summary statistics. Note that the actual mean and standard deviation of foot length are, respectively, 27.7429 cm and 3.55083 cm. The slope is r * sy / sx = (.841) * 4.02445 / 3.55083 = 0.953, which agrees with SPSS. The intercept term is y-bar – slope*x-bar = 68.2714 - .953*27.7429 = 41.83242, whose difference from the stated value of 41.839 we again attribute to round-off errors. 8. Use the regression line to predict the height of a person with a 28 cm foot length. Then repeat for a person with a 29 cm foot length. Calculate the difference in these two height predictions. Does this value look familiar? Explain. For a 28cm foot length, the regression predicts a height of 41.839 + .953*28 = 68.523 inches. For a 29cm foot length, the regression predicts a height of 41.839 + .953*29 = 69.476 inches. The difference of these two heights is 0.953, which is the slope, as expected. 9. Provide an interpretation of the slope coefficient in context. Height grows by 0.953 inches for every centimeter of foot length. 10. Provide an interpretation of the intercept coefficient in context. Is such a prediction meaningful for these data? Explain. The intercept term is the height corresponding to a foot length of 0. Such a prediction is not meaningful. 11. Predict the height of someone whose foot length is 44 cm. Explain why you would not be as comfortable making this prediction as the one in (8). The model predicts a height of 41.839 + .953*44= 83.771 inches, which is almost 7 feet. Because 44 cm is far outside the interval of values of the data we don’t expect the model to do well. 12. Zach, a student in statistics, has a foot length of 25 cm and is 67 inches tall. What is Zach’s residual? In general, what is the meaning of a positive residual? Of a negative residual? The predicted height for Zach is 41.839 + .953*25 = 65.664 inches. So the residual is equal to observed height – predicted height = 67 – 65.664 = 1.336 inches. The model underestimated Zach’s height by 1.336 inches. A positive residual corresponds to an under-estimation. A negative residual corresponds to an over-estimation. 13. Here is a residual plot of predicted values versus residuals. We will discuss the residuals and R-Squared. So take a break here and (if there’s time) we’ll talk about the following in class. 14. What fraction of the variability in heights is explained by the linear regression model? This is given by R-squared. That is, 70.7% of the variability in heights is explained by the regression model (so 29.3% of that variability is not). 15. Do you think that the linear model is appropriate for these data? Yes. The scatter plot shows a fairly strong linear association. Also, the residual plot shows no particular structure or pattern to the residuals. The points are somewhat uniformly distributed around the line y = 0. Being a little more nit-picky, because of some slightly higher residuals towards the center of the plot, there is a slight tendency of the model to under-estimate heights for foot lengths about 69 cm. 16. Guess-timate the standard deviation of the residuals. How much do the data points spread about the regression line? From the residual plot, look at the y-values. Imagine “projection” the points onto the y-axis. We see that the spread is about 2 inches. A reasonable guesstimate of the standard deviation of the residuals is about 2 inches. This is the measure of spread about the regression line. We expect points to be about 2 units above or below the line. 17. Consider how height varies about its mean. Compare your answers in (3) and (16). Is the regression model superior to just estimating height based on its mean and standard deviation? In (3), we say that height was about 68 inches give or take 4 inches. The regression model says that height is about 41.839 + .953*foot length, give or take about 2 inches. *This worksheet is based, in part, on Investigation 6.3.3 in Investigating Statistical Concepts, Applications, and Methods by Beth Chance and Allan Rossman.