Regression and Prediction Chapter 15 plus extra May 2, 2012 Prediction Vertical Chimneys Regression Line Equation of the Regression Line Regression and Least Squares Regression Fallacy 1.0 Prediction If we have two quantitative variables X and Y that are linearly related to each other, then knowing the particular value of X for one individual can help us to estimate (or predict) the value of Y for that individual. We will explore what is the best prediction of the response variable (Y ) given a value of the explanatory variable (X ). What is the likely size of the prediction error? 1.1 Fundamental Principle of Prediction Incoming students at a large law school have an average L.S.A.T. score of 163 and a S.D. of 8. You may assume the histogram of these data follows a normal curve approximately. Tomorrow one of these students will be chosen at random. What is your best guess for their score? The guess will be compared to their actual score to see how far off it is. What is the likely size for the error in your guess? 75 70 60 65 Son's height (inches) 70 65 60 Son's height (inches) 75 2.0 Vertical Chimneys In a Scatterplot 55 60 65 70 Father's height (inches) 75 80 55 60 65 70 Father's height (inches) 75 80 70 65 60 Son's height (inches) 75 2.0 Vertical Chimneys in a Scatterplot 55 60 65 70 Father's height (inches) 75 80 The graph of averages shows the average son’s height for each father’s height. It is close to a straight line in the middle. At the ends, it is quite bumpy. 2.1 Prediction in a Scatterplot Use the mean of the relevant sub-group of data as our predictor. S.D. of the group gives the “likely size” of the error in our prediction. 70 65 60 Son's height (inches) 75 3.0 Regression Line 55 60 65 70 Father's height (inches) 75 80 The regression line is a line fit to the graph of averages. It smooths away some of the chance variation in the data. If the graph of averages is close to a straight line, then we use the regression line to predict Y for a given X . If the graph of averages is non-linear, it is better to use it instead. 3.1 Predicting using a Regression Line Estimate the average weight of the men whose height is 69 inches. If you used the regression method to estimate weight from height, would your estimates generally be a little too high, low or about right, for men in the sample with height between 72 in. and 74 in? 4.0 The Regression Line The regression line for predicting Y from X passes through the point of averages X̄ , Ȳ and has slope r × S.D. of Y S.D. of X 5.0 The Equation of the Regression Line The regression line for predicting Y from X has the form: Y = a + b X, = intercept + slope X . Here b = slope, S.D. of Y = r . S.D of X a = intercept, = Ȳ − b X̄ , S.D. of Y X̄ . = Ȳ − r S.D of X 5.1 Prediction from a Regression Line The predicted value of Y for a given value of X say X ∗ has the form: Ŷ = a + b X ∗, S.D. of Y S.D. of Y = Ȳ − r X̄ + r X ∗. S.D of X S.D of X 5.2 Predicting Sons’ Heights 1,078 father-son pairs and their heights were measured. I I I I I Average height of fathers is ≈ 68 in. S.D. of height of fathers is ≈ 2.7 in. Average height of sons is ≈ 69 in. S.D. of height of sons is ≈ 2.8 in. r is ≈ 0.5. What are the co-ordinates for the point of averages? What is the slope of the regression line? What is the intercept of the regression line? Write the equation of the regression line. Suppose a father has a height of 72 inches. What would you predict for his sons’ height? Suppose a father has a height of 62 inches. What would you predict for his son’s height? 5.3 Interpreting the Regression Coefficients Associated with a unit increase in X , there is some average change in Y . The slope of the regression line estimates this change. The formula for the slope is: r × S.D. of Y S.D. of X That is, associated with an increase of one S.D. in X , there is an increase of r S.D.s in Y , on the average. The intercept is just the predicted value for Y when X equals zero. be wary of extrapolation 6.0 Regression and Least Squares The Regression Line is familiarly referred to as the least squares line. This is because it minimizes the sum of the squares of the vertical distances of the data points. y Data point Vertical distance to line Regression Line x 7.0 The Regression Fallacy 7.0 The Regression Fallacy In virtually every scatterplot with less than perfect correlation, the data points that are extreme along the x axis tend not to be as extreme on the y axis. This is called the regression effect. Definition Thinking that the regression effect must be due to something important, not just chance error, is called the regression fallacy. 7.1 Example An instructor standardizes both her midterm and the final each semester so the class average is 50 and the S.D. is 10 on both tests. The correlation between the tests is around 0.5. One semester she took all the students who scored below 30 in the midterm and gave them special tutoring. On average, they gained 10 points the final. She claims that her tutoring worked. Can you give her alternate explanation?