Lecture Notes Regression – Corty Chapter 14 Regression Analysis: Use of a relationship between X-Y pairs to explain or predict variation in Y in terms of differences in the X’s. Prediction: Use of a relationship between X-Y pairs to predict values of Y based on knowledge of X. For example, since I know that your high school GPA was 3.7, I predict that your college GPA will be about 3.1. Regression Sample A sample for which you have X-Y pairs with no missing members of either pair. Used it to develop a prediction equation, a simple equation relating predicted Ys to Xs. The prediction equation Predicted Y = Additive constant + multiplicative constant * X. Predicted Y = a + bX We’ll use this: Predicted Y = a + bX = or equivalently, bX + a The second version, bX+a is best for use when you’re doing hand computations. Regression line The prediction equation forms a straight line on the scatterplot of Y vs. X. That line is called the regression line or line of best fit. b and a and the regression line The constant, b, is the slope of the regression line on a scatterplot. The constant, a, is the y-intercept of the line. Prediction from the equation For persons for whom you have X but not Y, simply plug their X value into the equation (assuming you’ve obtained values of a and b) to generate the predicted Y for each one. Why do regression analysis? 1. Economy in prediction. If you have 1000’s of Xs, it would be very difficult to examine all of them to obtain a predicted value for someone. But with the equation, it’s easy. 2. Theory. It may be of theoretical interest to know that there is a relationship between Ys and Xs that is expressed by the simple equation: Predicted Y = a + bX. 3. Objectivity in prediction. Without the equation, we might argue about what the predicted Y should be for a person. With it, we all get the same number. Biderman’s P2010 Handouts Regression - 1 2/9/2016 Prediction Example The data Pair No. 1 2 3 4 5 6 X 1 4 2 6 3 4 Y 4 14 12 22 6 20 1. The Eyeball Method Identify a dataset for which you have sufficient X-Y pairs. A. Create a scatterplot of the X,Y pairs in the regression sample. B. Draw the best fitting straight line through the scatterplot. C. For each X value for which a predicted Y is desired, that predicted Y is the height of the best fitting line above the X value. 24. . 22. . 20. . 18. . 16. . 14. . 12. . 10. . 8. . 6. . 4. . 2. . 0. 0 . . . . . . . . . . . . . . O O O O . 3 . . . . . . . . . Mike – Show how.this . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 10 11 12 13 14 15 16 17 18 19 20 21 was obtained. Note that the best fitting straight line does not necessarily pass through the origin. O . 2 . For example, For X = 3, the predicted value .of Y is 12 or 13. O . 1 . . 4 . 5 . 6 . 7 Problem with the eyeball method: Eyeballs differ so different people will get different prediction equations. Not easily computerizable. Biderman’s P2010 Handouts Regression - 2 2/9/2016 2. The Formula Method, Predicted Y = a + b*X or, equivalently, b*X + a. A. Compute the slope, b, of the best fitting straight line through the scatterplot. Slope = NXY - (X)(Y) --------------NX2 - (X)2 = SY r * ------SX B. Compute the Y-intercept, a, of the best fitting straight line. Y-intercept = Y - Slope * X. For the example data . . . Pair No. 1 2 3 4 5 6 X 1 4 2 6 3 4 Y 4 14 12 22 6 20 X2 1 16 4 36 9 16 XY 4 56 24 132 18 80 Sum 20 78 82 314 Slope = NXY - (X)(Y) --------------- = NX2 - (X)2 Y-intercept = Y - Slope X 6314 - (20)(78) ----------------- = 3.52 682 202 = 13 - 3.523.33 = 1.27 C. For each X value for which a predicted Y is desired, that predicted Y is obtained using the following prediction formula . Predicted Y = Y’ = Y = 3.52 X + 1.27 For example. If X = 3, Predicted Y = 3.523 + 1.27 = 10.56 + 1.27 = 11.83 Putting the best fitting straight line on a scatterplot 1. 2. 3. 4. 5. . Compute Predicted Y for the smallest X. Plot the point, (Smallest X, Predicted Y) on the scatterplot. Compute Predicted Y for the largest X. Plot the point, (Largest X, Predicted Y) on the scatterplot. Connect the two points with a straight line. Biderman’s P2010 Handouts Regression - 3 2/9/2016 In Class example problem on Regression Analysis Suppose a manufacturing company is interested in being able to predict how well prospective employees will perform running a machine which bends metal parts into a predetermined shape. A test of eye-hand coordination is given to six persons applying for employment. Scores on the test can range from 0, representing little eye-hand coordination, to 10, representing very good coordination. All 14 are hired and after six months on the job, the performance of each person is measured. The performance measure is the number of parts produced to specification for a one hour period. Scores on the performance measure could range from 0, representing no parts produced to specification to 26 or 27, the maximum number the company's best machine operators can produce. The data are as follows: ID 1 2 3 4 Test Score 1 4 2 6 Mach Score 4 14 12 22 5 6 7 8 9 10 11 12 13 14 3 4 5 7 3 0 3 5 2 1 6 20 15 25 14 3 9 18 7 4 24. . 22. . 20. . 18. . 16. . 14. . 12. . 10. . 8. . 6. . 4. . 2. . 0. 0 . . . . . . . 1 . . 2 3 Test Score Biderman’s P2010 Handouts . . . 4 . . . 5 . . . 6 . . . 7 Regression - 4 . . . 8 . . . 9 . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2/9/2016 Use the scatterplot to look for nonlineariy and outliers. SPSS generated scatterplot 30 28 26 24 22 20 18 16 14 12 10 8 JOBPERF 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 TESTSCOR De scri ptive Statistics TESTSCOR JOBPERF Mean 3.2857 12.3571 St d. Deviat ion 2.0164 7.1426 N 14 14 Pearson r Correl ations TE STS COR JOBPE RF Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N TE STS COR 1.000 . 14 .922 .000 14 JOBPE RF .922 .000 14 1.000 . 14 p-value for a test of the null hypothesis that the Population r=0. b = r * SY/SX = .922 * 7.1426 / 2.0164 = .922 * 3.5423 = 3.27 N: No. of pairs. a = Y-bar - b * X-bar = 12.3571 - 3.27 * 3.2857 = 12.3571 - 10.7310 = 1.63 Predicted Y = a + b*X = 1.63 + 3.27*X = 3.27*X + 1.63 for ease of computation Biderman’s P2010 Handouts Regression - 5 2/9/2016 Using the SPSS REGRESSION procedure 1. Enter the data into SPSS 2. Analyze -> Regression -> Linear Biderman’s P2010 Handouts Regression - 6 2/9/2016 3.Put Y variable into the Dependent: field and X into the Independent(s): field. 4. The results . . . Regression Variables Entered/Removeda Model Variables Entered Variables Removed 1 testb . a. Dependent Variable: machine b. All requested variables entered. Pearson r Method Enter Model Summary Model R 1 .922a a. Predictors: (Constant), test Model 1 Regression Residual Total R Square .850 Std. Error of the Estimate 2.884 Adjusted R Square .837 ANOVAa df Sum of Squares 563.422 Mean Square 563.422 1 F 67.752 Sig. .000b Ignore this table for this semester 99.792 12 663.214 13 8.316 a. Dependent Variable: machine b. Predictors: (Constant), test Coefficientsa Model 1 Standardized Coefficients Beta Unstandardized Coefficients B Std. Error (Constant) test a. Dependent Variable: machine 1.630 1.514 3.265 .397 Intercept, a Biderman’s P2010 Handouts .922 t Sig. 1.076 .303 8.231 .000 Slope, b Regression - 7 2/9/2016 Another Example: Predicting College GPA from High School GPA This example is based on about 4000 students. Analyze -> Regression -> Linear Regression [DataSet1] G:\MDBR\FFROSH\Ffroshnm.sav Variables Entered/Removeda Model 1 Variables Entered Variables Removed hsgpab Method . Enter a. Dependent Variable: ogpa1 1ST SEM GPA EXCL FSEM b. All requested variables entered. Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1 .493a .243 .243 .79268 a. Predictors: (Constant), hsgpa Biderman’s P2010 Handouts Regression - 8 2/9/2016 ANOVAa Model Sum of Squares Regression 1 df Ignore the ANOVA table. It’s useful only when you have two or more predictors Mean Square 960.505 F 1 960.505 .628 Residual 2985.273 4751 Total 3945.778 4752 Sig. 1528.624 .000b a. Dependent Variable: ogpa1 1ST SEM GPA EXCL FSEM b. Predictors: (Constant), hsgpa a Coefficientsa Model Unstandardized Coefficients Standardized t Sig. Coefficients B Std. Error Beta (Constant) .154 .064 hsgpa .816 .021 2.424 .015 39.098 .000 1 .493 a. Dependent Variable: ogpa1 1ST SEM GPA EXCL FSEM b So Predicted College GPA = 0.154 + 0.816*HSGPA. The p-value in the lower right corner of the Coefficients table indicates that the Population correlation is different from 0. The relationship is positive in the population. Biderman’s P2010 Handouts Regression - 9 2/9/2016 Interpretation of the regression coefficients Intercept : “a”: Expected (predicted) value of Y when X=0. Slope: “b”: Expected difference in Y between two people who differ by 1 on X. Example test question: The prediction equation is Predicted Y = 3 + 4*X. Fred scored X=10. John scored X=12. What is the predicted difference between their Y values? Measuring prediction accuracy Most people use r2, the square of Pearson r. r2 = 1: Prediction of the regression sample is perfect r2 = .5: Prediction is about half “of perfection”. r2 = 0: Prediction is no better than random guesses. Residuals: Errors of prediction: Residual: Observed Y – Predicted Y Residual = Y – Y’ using Corty’s designation for predicted Y Positive residual: Observed Y is bigger than predicted. Person overachieved – did better than expected. Y Y’ Negative residual: Observed Y is smaller than predicted. Person did worse than expected. Y’ Y Biderman’s P2010 Handouts Regression - 10 2/9/2016