Chapter 10: Regression We are interested in predicting how many publications a faculty member has based on the number of years that have passed since completing his/her PhD We can do this by using regression! Regression: “the prediction of one variable from knowledge of one or more other variables” In this class, we will limit ourselves to linear regression—“regression in which the relationship is linear” Chapter 10: Page 1 We’ve already seen that scatterplots visually convey the relationship between 2 quantitative variables On a scatterplot, we could draw a straight line through the data points that approximates the relationship b/n the Y & X variables It is only useful to draw such a line when the X variable is thought to explain, cause, or predict the Y variable In this case, the X variable is called an explanatory variable, & the Y variable is called a response variable Chapter 10: Page 2 We sampled 20 Miami faculty & recorded the years since receiving their PhD (“years”) & the number of publications they have (“pubs”). A scatterplot of these data are shown below: Pubs 7 3 4 17 11 6 24 29 9 18 19 Years 8 6 6 2 1 4 5 12 11 Pubs 19 11 8 3 4 15 9 30 31 40 30 20 10 PUBS Years 3 6 3 8 9 6 16 10 2 5 5 0 0 2 4 6 8 10 12 14 16 18 YEARS Chapter 10: Page 3 We asked SPSS to place a line on the scatterplot that represents the relationship between years since PhD (X variable) and publications (Y variable) Much of the remainder of this lecture will be a discussion of how we find that line and why it is useful Finding the ‘Best’ Regression Line When you observe a scatterplot, you can ‘guess’ which line best summarizes the relationship between Y and X However, this method is highly subjective from person to person, and also might be affected by the way the scatterplot is constructed Thus, we have mathematical ways to determine the best line Chapter 10: Page 4 The Least-Squares Regression Line A ‘good’ regression line comes as close as possible to all the data points in the scatterplot The points along the regression line represent our best predictions for the value of the Y variable at each level of the X variable In this case, the points along the line represent our predictions for the # of pubs a faculty member will have, given a specific # of yrs since completing the PhD You’ll notice that very few of the actual data points fall on the line, but most are fairly “close” to the line Chapter 10: Page 5 Because we would like to predict the Y variable from the X variable, we would like the (vertical) distance between the points on the graph and the line to be as small as possible The vertical distance b/n the predicted value (the point on the line) and the observed (actual) value is called an error or a residual The error or residual is the difference between the predicted value and the observed value. Residuals are found by the following equation: residual y y Where ŷ is the “predicted value” Chapter 10: Page 6 The best regression line is the one that has the smallest residuals One common way to obtain the smallest residuals is through the least-squares approach The least-squares regression line is the one that makes the sum of the squared vertical distances b/n the data points & the line [residuals] as small as possible The equation for the least squares regression line is: yˆ bX a ŷ : predicted value of the response variable (Y) X: explanatory variable a: intercept; the predicted value of Y when X = 0 b: the slope; the change in the predicted value w/ a 1 unit increase in X Chapter 10: Page 7 By knowing the regression line, we can predict the values of the response variable for a given level of the explanatory variable The regression output from SPSS: Coefficientsa Model 1 (Constant) YEARS Unstandardized Coefficients B Std. Error 1.927 2.705 1.863 .366 Standardi zed Coefficien ts Beta .768 t .713 5.090 Sig. .485 .000 a. Dependent Variable: PUBS The regression equation is: ŷ = 1.863X + 1.927 Chapter 10: Page 8 We can predict ŷ at a given value of X simply by solving the equation X ŷ = 1.863X + 1.927 3 6 3 8 9 6 16 10 2 5 5 8 6 6 2 1 4 5 12 11 7.516 13.105 7.516 16.831 18.694 13.105 31.735 20.557 5.653 11.242 11.242 16.831 13.105 13.105 5.653 3.79 9.379 11.242 24.283 22.42 Chapter 10: Page 9 Notice that our actual values of Y are fairly close on average to the predicted values of Y X ŷ = 1.863X + 1.927 Y 3 6 3 8 9 6 16 10 2 5 5 8 6 6 2 1 4 5 12 11 7.516 13.105 7.516 16.831 18.694 13.105 31.735 20.557 5.653 11.242 11.242 16.831 13.105 13.105 5.653 3.79 9.379 11.242 24.283 22.42 7 3 4 17 11 6 24 29 9 18 19 19 11 8 3 4 15 9 30 31 Y - ŷ -.516 -10.105 -3.516 .169 -7.694 -7.105 -7.735 8.443 3.347 6.758 7.758 2.169 -2.105 -5.105 -2.653 .21 5.621 -2.242 5.717 8.58 .00 Chapter 10: Page 10 You’ll notice that the sum of the residuals is zero. Thus, we find the regression line that minimizes the sum of the squared residuals We can use the regression equation to make predictions: So, if I wanted to predict how many publications a faculty member would have who completed his/her PhD 15 years ago: ŷ = 1.863X + 1.927 29.872 = 1.863(15) + 1.927 Chapter 10: Page 11 Accuracy in Prediction We can always construct a regression line. The critical issue is—how well does that line actually predict the Y values from the X values? The “error” in our predictions is captured by the following: S Y Yˆ 2 ˆ ( Y Y ) N 2 This is the standard error of the estimate: “the average of the squared deviations about the regression line” It is the standard deviation of the errors we make in prediction Chapter 10: Page 12 Regression and Correlation There is a conceptual relationship between correlation and regression Specifically, if we square the correlation coefficient (r) we find the “fraction of the variation in the values of y that is explained by the least-squares regression of y on x” r2 = proportion of variance in Y explained by relationship with X Model Summary Model 1 R .768a R Square .590 Adjusted R Square .567 Std. Error of the Estimate 6.0453 a. Predictors: (Constant), YEARS Chapter 10: Page 13 Hypothesis Testing and Regression If X can reliably predict Y, then there will be a non-zero slope Thus, we can test the following hypotheses: H0 : = 0 H1 : ≠ 0 is the population counterpart of b These hypotheses are tested with a t test Conceptually, we take b and divide it by the standard error of b We will allow SPSS to do these calculations for us Chapter 10: Page 14 Coefficientsa Model 1 (Constant) YEARS Unstandardized Coefficients B Std. Error 1.927 2.705 1.863 .366 Standardi zed Coefficien ts Beta .768 t .713 5.090 Sig. .485 .000 a. Dependent Variable: PUBS Slope (b) coefficient Standard error of the slope coefficient t-test of whether the slope coefficient differs from zero p-value of the test The t-test of the slope coefficient has n -2 df Chapter 10: Page 15 As usual, if the obtained t equals or surpasses a critical value of t, then we’d reject the null hypothesis If the obtained t did not equal or surpass a critical value of t, then we’d fail to reject the null hypothesis In the above case, we rejected the null hypothesis. Our conclusion would be: “The number of years since Miami faculty have earned their PhD predicts the number of publications they have, b = 1.863, t(18) = 5.09, p ≤ .05.” Chapter 10: Page 16 Regression and Outliers Like means, variances, and standard deviations, the regression line is sensitive to outliers. Be sure to always plot your data first to see if there are points that are far away from the regression line Suppose I added one outlier to the previous dataset—a faculty member who earned his/her PhD 25 years ago but only has 2 publications Chapter 10: Page 17 40 30 30 20 20 10 10 PUBS PUBS 40 0 0 2 4 6 8 10 12 14 16 18 0 0 YEARS 10 20 30 YEARS Coefficientsa Original Model 1 (Constant) YEARS Unstandardized Coefficients B Std. Error 9.677 3.371 .495 .373 Standardi zed Coefficien ts Beta .292 With outlier t 2.871 1.328 Sig. .010 .200 a. Dependent Variable: PUBS Chapter 10: Page 18 Notice how much the slope of the regression line has shifted downward to accommodate the new point. The slope coefficient is no longer significant! Always plot your data! Chapter 10: Page 19