Lecture 5 - Regression Analysis Regression analysis: The development of a rule or formula relating a dependent variable, Y, to one or more independent or predictor variables, X1, X2, . . ., XK in order 1) to predict Y values for cases for whom we have only X(s) or 2) to explain differences in Y's in terms of the X(s). Note that there is a clear dependent (Y) vs. independent (X) separation here. In Linear Regression, the formula relating Y to X has the following form: Various Forms of the Prediction Formula's Raw Score Z or Standard Score formula One X: Simple Regression Predicted Y = a + b*X or Predicted Y = b*X + a Multiple Xs: Multiple Regression Predicted ZY = r*ZX Predicted ZY = 1*ZX1 + 2*ZX2 + . . . + K*ZXK. Predicted Y = B0 + B1*X1 + B2*X2 + . . . + BK*XK. Computing the Simple Regression coefficients Get a sample of complete X-Y pairs. Let’s call that the Regression Sample. Raw Score regression coefficients. b a NXY - (X)(Y) SY = ------------------ = r * ----NX2 - (X)2 SX = Y-bar - b*X-bar = -b*X-bar + Y-bar Z-score regression coefficient If the Ys are Zs and the Xs are Zs, then b = Good ol' Pearson r Computing the Multiple Regression coefficients Formula’s become quite complicated. We’ll use the computer. The complexity of formulas for multiple regression analysis is, in my view, one of the reasons that it was not used (or presented to students) until computers became ubiquitous. Copyright © 2005 by Michael Biderman RegAnal.doc - 1 8/17/3 Regression Analysis Example Although it is well-known that scores on achievement tests, for example, the SAT, the ACT, or the GRE predict academic performance. Suppose an investigator was interested in the relationship of general cognitive ability, as measured by a standard IQ test, and academic performance. A regression analysis begins with a set of data for which you have both X and Y values for each person. The relationship of Ys to Xs is found for this regression sample. That relationship may then be applied to explain or predict Y values for persons for whom only X is available. The data are below. WPTQScore is X, a Quick-score measure of cognitive ability; GPA_s is Y, cumulative GPA at the end of the semester. id2 WPTQScore GPA_s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 20 17 26 22 29 22 23 22 15 20 21 24 20 17 28 26 19 21 24 26 22 24 30 23 29 23 22 25 23 16 21 28 20 24 28 25 26 26 26 23 27 24 26 27 21 27 23 20 23 19 19 18 17 16 30 24 29 25 29 3.18 2.76 2.91 4.00 3.75 2.68 4.00 3.55 2.37 3.52 2.56 3.41 2.85 3.53 3.46 3.64 2.75 3.81 2.79 3.16 4.00 3.84 3.06 3.09 2.76 3.42 3.32 3.54 3.40 2.67 2.55 2.77 3.67 3.23 3.75 2.36 2.62 2.79 2.96 3.00 3.33 2.95 2.70 3.36 3.80 3.68 3.25 3.57 2.25 2.54 3.84 2.74 2.38 3.45 2.69 2.81 3.81 2.88 2.59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 Copyright © 2005 by Michael Biderman 21 15 28 24 26 22 20 27 23 24 27 25 22 24 28 24 21 27 22 6 30 26 18 27 30 18 26 19 19 27 24 22 21 23 16 27 30 22 24 17 26 21 25 24 23 30 22 26 21 19 19 24 23 17 23 21 21 26 21 32 26 2.96 2.98 3.41 3.73 3.18 2.59 2.44 3.75 3.84 3.16 3.81 2.74 3.72 3.38 3.22 3.50 3.47 . 2.86 3.55 3.97 3.00 2.92 2.87 2.62 2.05 3.05 3.89 3.47 3.90 4.00 3.64 3.87 2.57 3.61 2.30 2.83 2.84 3.01 2.72 3.04 2.77 3.54 2.78 3.24 3.45 . 3.42 3.59 2.86 2.69 3.54 3.15 3.31 3.13 3.29 3.36 3.29 2.85 2.98 1.83 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 15 28 22 23 20 29 20 27 27 25 24 28 24 27 32 26 23 29 28 26 30 23 26 30 27 19 25 24 28 28 21 22 24 28 23 24 27 25 23 22 31 21 28 24 25 28 22 26 25 28 20 32 20 21 17 20 23 28 21 27 25 2.98 2.33 3.48 2.55 2.69 3.88 2.64 3.35 3.86 3.86 3.59 3.55 2.47 3.40 3.50 2.81 1.14 3.36 3.73 . 3.91 3.44 4.00 3.20 3.20 2.71 3.49 2.95 . 3.53 2.80 3.11 3.49 3.45 3.56 3.22 2.98 3.94 3.45 2.97 3.97 3.33 . 4.00 3.06 3.13 3.17 2.64 2.83 . 2.86 3.41 3.07 2.26 2.86 3.13 3.32 1.87 2.22 2.85 3.46 RegAnal.doc - 2 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 16 27 21 24 30 19 20 28 11 13 20 26 27 24 25 26 28 21 21 18 15 26 23 26 25 30 31 20 18 22 16 25 26 31 25 27 25 21 24 26 22 30 23 23 23 23 31 27 26 17 23 27 24 19 22 26 25 26 18 30 24 2.72 2.99 3.53 3.78 3.72 3.53 2.69 2.82 2.67 2.87 3.08 3.44 3.04 3.38 2.80 3.30 3.43 2.81 3.25 2.81 1.82 3.16 3.32 3.09 3.26 3.61 3.98 3.47 3.00 3.80 2.52 3.66 3.89 . 3.51 . 3.55 2.52 2.78 2.61 3.71 3.23 2.92 3.02 3.33 3.83 4.00 3.28 4.00 3.25 3.39 2.84 2.56 3.81 3.58 . 3.14 2.67 3.59 2.46 2.21 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 29 24 25 22 16 18 6 24 25 26 25 20 30 24 22 28 30 15 25 17 33 28 24 25 20 20 32 22 26 26 31 21 29 21 11 19 22 22 18 23 25 21 18 29 19 27 28 22 28 14 25 22 25 20 22 29 23 3.54 3.85 2.98 3.25 3.88 3.29 2.25 4.00 3.68 3.02 . 2.05 1.81 2.75 3.50 3.43 3.76 3.00 3.37 2.38 2.79 3.23 2.85 3.33 3.37 3.10 3.25 3.08 4.00 3.77 3.36 3.45 2.71 3.72 3.18 2.65 3.49 1.29 2.44 3.17 2.56 2.97 3.26 3.94 . 3.29 3.69 2.81 3.20 2.81 2.90 3.37 3.00 3.02 3.07 2.48 3.85 8/17/3 Getting the Regression coefficients using the Correlation Procedure Analyze -> Correlate -> Bivariate Descriptive Statistics Mean GPA_s WPTQScore Std. Deviation N 3.16238 .510189 288 23.49 4.344 288 Correlationsb GPA_s GPA_s Pearson Correlation WPTQScore 1 Sig. (2-tailed) WPTQScore Pearson Correlation Sig. (2-tailed) .192** .001 .192** 1 .001 **. Correlation is significant at the 0.01 level (2-tailed). b. Listwise N=288 The b of the regression is r * SY/SX = .192 * 0.510189 / 4.344= 0.0225. The a of the regression is Y-bar - b*X-bar = 3.16238 – 0.0225*23.49 = 2.634 So the prediction formula = Predicted GPA = 2.634 + 0.0225*WPTQScore Way too much work. Let’s have the computer do everything. Copyright © 2005 by Michael Biderman RegAnal.doc - 3 8/17/3 Getting the Regression Coefficients (and other stuff) using SPSS REGRESSION Copyright © 2005 by Michael Biderman RegAnal.doc - 4 8/17/3 The REGRESION procedure output Check means and SD's for unusual values. Check sample size. Pearson r between variables. Since there are only two here, this is not really needed, but it comes automatically with “Descriptives.” Adjusted R2 is an estimate of the population R2 taking into account the number of predictors. The more predictors, the smaller the adjusted R2. More on this when we get to multiple regression. Standard error of estimate is the standard deviation of the differences between the observed Ys and predicted Ys. The ANOVA F tests the significance of the relationship of Y to the whole collection of independent variables. Since there is only one independent variable in simple regression, the ANOVA F is redundant with what appears below. But SPSS always prints it. Copyright © 2005 by Michael Biderman RegAnal.doc - 5 8/17/3 Coefficientsa Unstandardized Coefficients Model B 1 (Constant) WPTQScore WPT-Q Score a. Dependent Variable: GPA_s End of semester actual GPA Standardized Coefficients Beta Std. Error 2.634 .163 .022 .007 Whenever you see a leading 0 in a displayed value, double-click on the cell to get SPSS to display the fFull value. .192 t Sig. 16.175 .000 3.300 .001 Tests the hypothesis that in the population, the regression coefficient = 0. Standardized regression coefficient. What the coefficient would be if the regression involved only Z-scores. It is equal to Pearson r. Standard error (estimated standard deviation across repeated samples) of the regression coefficient. So the prediction equation is Predicted GPA = 2.634 + .022493*WPT. Recall from the above hand computation, it was Predicted GPA = 2.634 + 0.0225*WPT So the two results are equal to within rounding error, as they must be. Interpretation of Coefficients (Constant) Regression Y-intercept a or the additive constant Expected value of Y when X = 0. Of little value in most psychological tests, but it is always estimated. WPTQScore Regression slope b or multiplicative constant 1st Interpretation: Expected Y-difference between two people who differ by 1 on X. 2nd Interpretation: Expected change in Y associated with a 1-unit increase in X. So a difference of 1 point on the WPT would be associated with a difference of .022493 in GPA. Often, we multiple the X-difference by 10 (or whatever number works) to achieve a more palatible interpretation. For example, a difference of 10 points on the IQ test would be associated with a difference of about .2 GPA points. Copyright © 2005 by Michael Biderman RegAnal.doc - 6 8/17/3 Regression Arcania Start here on 9/22/15. Predicted Y's, symbolized as Y-hat, Y, Y’ Residuals Residual = Y - Predicted Y, Y - Y-hat, Y-Y or Y-Y’ Positive residual: Y better than predicted = Y is overachievement. Negative residual: Y worse than predicted = Y is underachievement Standard Error of Estimate Simply the standard deviation of the residuals. SYX = 0: Best possible fit - every Y was exactly equal to its predicted value. SYX > 0: Poorer fit. Z's of Residuals ZResid = (Y - Y-hat)/SYX . Z's of residuals are used to assess individual performance. ZResid = > 1.96 >= The Y value was "significantly" greater than predicted. ZResid ~~ 0 => The Y value is about what it was predicted to be. ZResid <= - 1.96 => The Y value was "significantly" smaller than predicted. r-squared, r2 Proportion of variance of Y's linearly related to X's. Also called the Coefficient of Determination. r2 = 0: Worst possible fit. Y's not related to X's. r2 = 1: Best possible fit. Y's perfectly linearly related to X's. Copyright © 2005 by Michael Biderman RegAnal.doc - 7 8/17/3 SPSS Output Continued. . . . More than you ever wanted to know about the residuals Residuals Statisticsa Minimum Maximum Mean Std. Deviation Predicted Value 2.76890 3.37622 3.16238 .097713 Residual -2.009288 .885166 .000000 .500744 Std. Predicted Value -4.027 2.188 .000 1.000 Std. Residual -4.006 1.765 .000 .998 a. Dependent Variable: GPA_s End of semester actual GPA N 288 288 288 288 Graphs Histogram of residuals This plot is essentially Unimodal and pretty much symmetric. It’s pretty much what we hope our residuals plots will look like. Look for skewness. If it is highly skewed, you may have to transform your dependent or independent variables. Look for bimodality. If it is clearly bimodal, you may have to break your data into subgroups or otherwise deal with whatever is causing the bimodality. Normal P-P Plot. This is a plot of Expected Cumulative Probability of Zs of the residuals vs. Observed Cumulative Probability of the Zs of the residuals. The “expected” means expected if the Zs of residuals were perfectly normally distributed. If the Zs of residuals are normally distributed, the plot will be linear. This one is pretty darn near linear. Copyright © 2005 by Michael Biderman RegAnal.doc - 8 8/17/3 A plot of residuals vs. predicted Ys. This plot should be essentially random, with no discernible bend or fanout. A plot of Y vs. X. (Obtained from the Graph menu, not from REGRESSION.) While GPAs are related to the Wonderlic scores, the relationship is not as strong in these data as it is typically found to be. There is considerable variation in GPA not related to WPT. Copyright © 2005 by Michael Biderman RegAnal.doc - 9 8/17/3 Assessing Model Assumptions Linearity: The scatterplot is essentially linear. Not linear 5.2 5.2 5.0 5.0 4.8 4.8 4.6 4.6 4.4 4.4 LG10SAL LG10SAL Linear 4.2 4.0 3.8 4.0 4.2 4.4 4.6 4.8 4.2 4.0 0 5.0 20000 40000 60000 80000 Beginning Salary LG10BEG Homoscedasticity: The variability of Y's about the best fitting straight line is the same for those pairs with small X's and those pairs with large X's. Homoscedastic – ACT vs. WPT Heteroscedastic – Sal vs. Sal Beg Normality of Residuals Normality of Residuals. The distribution of residuals is essentially that of the normal distribution. Essentially normal – ACT vs. WPT Positively skewed- Sal vs. SalBeg Histogram Dependent Variable: Current Salary 70 60 50 40 Frequency 30 20 Std. Dev = 1.00 10 Mean = 0.00 N = 474.00 0 25 5. 75 4. 25 4. 75 3. 25 3. 75 2. 25 2. 75 1. 25 1. 5 .7 5 .2 5 -.2 5 -.725 . -1 5 .7 -1 5 .2 -2 5 .7 -2 Regression Standardized Residual Copyright © 2005 by Michael Biderman RegAnal.doc - 10 8/17/3 100000 Relating the Raw Score Regression Equation to the Scatterplot The prediction equation defines a best fitting straight line (BFSL) through the scatterplot of Y's vs. X's. The slope of the line is equal to b and the y-intercept is equal to a. Predicted Y for an X is the height of the line above the X-value. For the example data . . . Predicted Y = 2.634 +0.0225 * X. Change in Y = slope, i.e. b Change in X I edited the chart so that the point X=0, Y=0 would appear on the graph. Copyright © 2005 by Michael Biderman RegAnal.doc - 11 8/17/3 Example 2 of SPSS REGRESSION Example: Predicting P510/511 Performance from Formula Scores The data for this example are scores in the P510/511 course expressed as a percentage of total possible points and the formula score used to determine eligibility for admission. The data are taken from several previous classes. The issue here is this: Of what use is the formula score? If it doesn’t predict performance in the graduate courses, why do we use it? If it does predict performance in graduate courses, is that prediction such that we don’t need any other predictors or is it such that we should search for other predictors in addition to the formula score? The data are as follows . . newform p511g 1135 1055 1130 1020 1235 1110 1365 1110 1050 1085 1025 1210 1155 1335 1120 1005 1125 1020 1130 1295 1150 1140 1265 1250 1080 1120 1270 1115 1230 1245 1255 1075 1295 1300 1150 1230 1205 1185 1085 1080 1155 1095 1235 1205 1160 1170 1076 1113 1290 1235 1175 .89 .85 .90 .87 .83 .86 .92 .83 .82 .84 .73 .88 .86 .86 .77 .77 .85 .84 .90 .96 .78 .85 .87 .84 .78 .82 .81 .81 .88 .84 .88 .81 .89 .86 .82 .93 .84 .90 .72 .83 .90 .75 .93 .88 .84 .80 .91 .82 .96 .89 .91 1240 1145 1160 1180 1135 1105 1255 1285 1115 1215 1155 1280 1165 1100 1228 1259 1151 1288 1207 1272 1224 1131 1136 1229 987 1095 1080 1133 1160 1356 1134 1192 1050 1210 1211 1194 1304 1126 1165 1188 1182 1154 1349 1221 1279 1104 1107 1193 1156 1098 1225 1250 1228 .84 .86 .89 .88 .90 .84 .88 .93 .96 .89 .89 .94 .91 .89 .89 .96 .95 .94 .79 .94 .79 .84 .90 .91 .55 .84 .81 .85 .91 .95 .87 .85 .82 .86 .94 .90 .79 .76 .81 .87 .86 .93 .94 .86 .91 .96 .83 .92 .73 .86 .87 .74 .88 Copyright © 2005 by Michael Biderman 1234 1158 1163 1234 1174 1168 1130 1179 1087 1181 1195 1097 1206 1260 1225 1131 1125 1202 1110 1177 1307 1055 1350 1179 1323 1182 1050 1098 1137 1204 1269 1182 1141 1117 1424 1179 1332 1156 1235 1165 1220 1241 1134 1243 1220 1288 1235 1185 1250 1217 1248 1264 1325 .80 .70 .79 .74 .76 .78 .97 .69 .85 .79 .79 .80 .92 1.03 .81 1.07 .91 .93 .96 .96 1.02 .75 1.06 .87 .98 .91 .88 .94 .95 .90 .89 .96 .99 .87 1.03 .92 .93 .87 .81 .99 .86 .73 1.00 .80 .98 .97 .96 .75 .93 .92 .90 .96 .94 1335 1090 1275 1091 972 1155 1207 1184 1175 1132 1300 1033 1151 1183 1212 1088 1099 1172 1256 1229 1290 1074 1119 1243 1331 971 1225 1150 1358 1120 1348 1178 1315 1387 925 1182 870 1213 1123 1474 1222 1148 1143 1280 1356 1215 1291 1099 1279 1173 1122 1244 1082 .94 .91 .99 .89 .77 .95 .91 .92 .72 .83 .95 .88 .90 .75 .92 .89 .83 .84 .95 1.02 .89 .91 .97 .85 .96 .62 .97 .77 1.01 .79 .94 .91 .95 .97 .81 .91 .77 .84 1.00 1.05 .87 .81 .78 .92 1.01 .84 .94 .80 .96 .85 .86 .88 .88 RegAnal.doc - 12 1155 1057 1102 1160 1173 1187 1162 1204 1049 1157 1206 1119 1161 1170 1148 1088 1160 1186 1064 1143 1106 1238 1215 1288 1130 1185 1055 1088 1187 1188 1269 1223 1434 1241 1266 1247 1325 1104 1097 1114 1370 1116 1156 1330 1403 1111 1219 845 1153 1190 1136 1168 1327 .94 .92 .88 .94 .92 .81 .89 .94 .70 .76 .90 .84 .72 .78 .95 .90 .77 .88 .75 .87 .82 .81 .93 .98 .87 .86 .84 .77 .79 .85 .93 .82 .97 .83 .87 .89 .94 .75 .72 .84 .95 .74 .69 1.01 .98 .85 .93 .79 .84 .82 .92 .81 .94 1050 1300 1121 1371 1046 1179 1153 1241 1157 1181 1111 1123 1102 1173 1140 1063 1020 999 1039 1281 1089 1100 1031 1183 1132 1276 1183 1277 1165 1166 1171 1123 1192 1111 1125 1327 1147 1173 1324 1210 .82 .90 .87 .95 .85 .88 .84 .95 .75 .86 .85 .80 .92 .86 .94 .89 .70 .75 .81 .91 .75 .65 .85 .75 .76 .99 .87 .87 .85 .84 .87 .70 .97 .86 .88 .97 .85 .86 .97 .81 8/17/3 Univariate Statistics on Each Variable Analyze -> Descriptive Statistics -> Frequencies These are the Syntax commands which would give the output below. FREQUENCIES VARIABLES=formula p511g /STATISTICS=MEAN MEDIAN /HISTOGRAM . Frequencies Statistics newform N Valid Missing p511g 303 303 0 0 The Frequency table has been omitted to save space. Lecture 6 Regression Analysis- 13 8/17/3 Computing the Regression Equation Analyze -> Regression -> Linear Specifying which variables to analyze Click on the “Statistics…” button to tell SPSS to print descriptive statistics on each variable. Click on the “Plots…” button to tell SPSS that you want diagnostic plots created Specifying diagnostic plots . . . It’s always advisable to plot residuals vs. predicted values. Here we’re requesting standardized residuals (ZRESID) vs. standardized predicted values (ZPRED). Lecture 6 Regression Analysis- 14 8/17/3 Output of SPSS's Regression Procedure Regression This is the Pearson r between the criterion (P511g) and the predictor (NEWFORM). Note that REGRESSION automatically prints the Pearson r for the regression. For simple regression analyses (1 predictor), it is the same as that above. For multiple regressions, it will be different from the r printed in the Descriptives table. Lecture 6 Regression Analysis- 15 8/17/3 The Coefficients Table is the meat of the regression analysis. The line labeled "(Constant)" gives information on the Y-intercept. The other line gives information on the predictor, NEWFORM, in this example. Each t tests the hypothesis that the population value of the regression coefficient equals 0. Standardized multiplicative constant. Equals r in simple regression. Standard errors of the estimates that are displayed at the left. Double-click on the table, then again on the cell to get its actual value. So, Predicted = 4.49 x 10-4 . Move decimal point as many places + or – as the value after E. Y .0004491824 = .0003160*FORMULA + .476. So, Predicted P511G = 0.339 + .000449*NEWFORM. A couple of selected predictions . . If a student’s formula score was 1200: Predicted P511g = .339+.000449*1200 = .877, almost an A If a student’s formula score was 1300: Predicted P511g = .339+.000449*1300 = .922, a low A If a student’s formula score was 1600: Predicted P511g = .339+.000449*1600 = 1.057, a super A If you’re interested in residuals statistics . . . Lecture 6 Regression Analysis- 16 8/17/3 Charts – These are the diagnostic plots requested above. The distribution of residuals should be approximately normal. The distribution is not horribly nonnormal. The Normal P-P plot should be linear. This one is acceptably so. Lecture 6 Regression Analysis- 17 8/17/3 The scatterplot of residuals vs. predicted values should be a "classic" zero correlation scatterplot. Look for heteroscadisticity and nonlinearity. Lecture 6 Regression Analysis- 18 8/17/3 Graphical Representation of Regression Analysis Graph -> Legacy Dialogs -> Scatter/Dot. -> Simple -> Define Put p511g in the Y-axis field and newform in the X-axis field. To put a best fitting straight line on the scatterplot. 1. Double-click on the chart. 2. Click on . 3. A line will appear on the scatterplot. In addition, a Properties dialog box will open. More on it later. 4. Checking Individual in the Confidence Intervals section yields the scatterplot a couple of pages down from here . . . Lecture 6 Regression Analysis- 19 8/17/3 Notes on the graph of observed vs. predicted Ys. 1.0 Performed substantially better than predicted. Points above the line represent students who performed better than predicted by the formula Positive residual .9 Negative residual Predicted P511G = .000449*FORMULA + .339 Points below the line represent students performed worse than predicted by the formula .8 Predicted Y for X = 1070. Actual Y for X = 1070 Performed substantially worse than predicted. P511G .7 1000 Rsq = 0.2520 1100 1200 1300 1400 1500 FORMULA Lecture 6 Regression Analysis- 20 8/17/3 Representing 95% Confidence Intervals About the Regression Line Form a scatterplot and then click on . Check the Individual button in the Confidence Intervals section. Putting the 95% Individual Confidence Intervals on the scatterplot is a quick way to identify persons who performed quite a bit better or quite a bit poorer than predicted. They’ll be the points outside the upper and lower Confidence Interval bands. Points above the upper band represent students who performed much better than expected based on their formula scores. Points below the lower band represent students who performed much worse than expected based on their formula scores. Lecture 6 Regression Analysis- 21 8/17/3 The Scatterplot with Origin Included After creating the chart, I double-clicked on it and edited it to force the origin to appear on the graph. Graph 1.0 .9 .8 .7 .6 .5 P511G .4 .3 .2 .1 Rsq = 0.2520 0.0 400 200 0 100 300 800 600 500 700 1000 900 1200 1100 1400 1300 1600 1500 FORMULA Imagine the points that are not on the scatterplot above. Those would be the points of persons who whose formula scores were not high enough to allow them to be admitted to the program. The r2 is .25 for the above data. However, if persons with lower formula scores were admitted and took the course, it is likely that their P511G scores would also be lower, resulting in a scatterplot "ellipse" that was considerably more elongated than that above - filling in the space between the points in the above scatterplot and the origin of the plot. See the outlined ellipse in the figure above. It would be expected that the r2 for such a sample would be considerably larger than the r2 for the truncated sample of those who were actually admitted to the program. This is a problem that confronts analysts predicting performance in selective programs like our MS program. It’s called the problem of range restriction. The restriction of range causes r to be closer to 0 than it would have been had the whole population been included in the analysis. Lecture 6 Regression Analysis- 22 8/17/3 Using Regression to evaluate individual performance: Performance of UTC's Development Office On of the tasks of a university development office is to seek funds from public and private donors to support university functions. Recently, a report was released which included the number of employees in the development offices of several of UTC’s ‘comparable’ institutions along with the total contributions received by those offices. The data are below. CON98_99 Total contributions in millions of $. TOTEMPS Total no. of employees – officers and staff. INST ecu eiu gsu jmu msu ru sfasu unca uni utc uwlc wcup wiu wku uncg asu CON98_99 TOTEMPS 2.70 1.58 5.60 2.90 2.30 3.00 7.20 2.00 9.70 7.40 2.10 1.60 4.10 5.70 8.80 9.80 45.50 20.50 38.00 26.00 16.25 48.00 33.00 9.00 60.00 20.50 24.00 40.00 36.00 55.00 66.00 52.00 Number of cases read: 16 Number of cases listed: Lecture 6 Regression Analysis- 23 8/17/3 16 Regression Va riable s Entered/Rem ov e db Mo del 1 Va riable s Re move d Va riable s En tered TO TEM PS T otal Dev Office Em ploye esa . Me thod En ter a. All requ ested varia bles ente red. b. De pend ent V ariab le: CON9 8_99 Con tribut ions in 98 ,99 Model S umm ary Mo del 1 R .61 8 a R S quare .38 2 Std . Erro r of the Estim ate 2.4 1963 Ad justed R S quare .33 8 a. Pre dicto rs: (Consta nt), T OTE MPS Tot al De v Off ice E mplo yees ANOVAb Mo del 1 Re gressi on Re sidua l To tal Su m of Squa res 50. 747 df 1 Me an S quare 50. 747 81. 965 14 5.8 55 132 .712 15 F 8.6 68 Sig . .01 1 a a. Pre dicto rs: (Consta nt), T OTE MPS Tota l Dev Office Em ploye es b. De pend ent V ariab le: CO N98 _99 Contributio ns in 98,9 9 Coeffici ents a Un stand ardized Co efficients Mo del 1 (Co nstan t) B .72 3 Std . Erro r 1.5 05 TO TEM PS T otal Dev O ffice Emp loye es .11 0 .03 7 Sta ndardized Co effici ents Be ta t .61 8 .48 0 Sig . .63 9 2.9 44 .01 1 a. De pend ent V ariab le: CON98 _99 Cont ributi ons i n 98, 99 So, the relationship of contributions in millions to total employees is Predicted contributions in millions of $ = 0.723 M$ + 0.110 * Total no. of employees. This means we would expect an increase in contributions of about $110,000 for each additional employee. The intercept of the equation suggests that universities might expect to receive over $700,000 without any development office at all. But this conclusion depends on an extrapolation of the curve downward toward 0 employees. We don’t actually know what it would do in that region. What does the equation tell us? Development office employees count. The more employees an office has, the more contributions it can expect. If you add a development office employee, don’t pay him/her more than $110,000. Lecture 6 Regression Analysis- 24 8/17/3 Fcusing on the residuals . . . How is UTC doing relative to what it would be expected to do? 12 This is the difference between what UTC received (7+$M) and what it would have been expected to receive (about 3 $M) based on its total no. of employees. 10 8 6 asu uncg utc sfasu wku gsu wiu 4 jmu unca 2 msu ecu uwlc eiu wcup 0 0 10 20 30 40 50 60 70 Total No. of Employees The figure is based on data distributed by Margaret Kelley at the 2/24/00 meeting of the Planning, Budgeting, & Evaluation Committee. The data were prepared by Cindy Jones of Appalachian State University The figure excludes the data of two universities (University of Northern Iowa and Radford University) who did not report DO's and Staff separately. The line though the figure represents the expected total contributions at each value of No. of employees. Points above the line represent universities whose development offices received more contributions than they would have been expected to have received based on the no. of development office employees. Points below the line represent universities who received fewer contributions than they would have been expected to have received based on the number of development office personnel. Lecture 6 Regression Analysis- 25 8/17/3 Summary of uses of regression analysis 1. Prediction of performance. From the above example, Predicted P511 = .339 + . 000449*Formula. A graduate student with a formula score of 1200 would be predicted to obtain .855, a middle B in P511. 2. Explanation of differences between Y values. Why does one student have a 3.5 point GPA while another has a 2.5? Based on the relationship between GPAs and the Wonderlic, part of the reason might be cognitive ability, as measured, for example, by the Wonderlic. We can also see from the scatter about the regression line in that example that there are probably other reasons for differences in GPA. 3. Evaluation of performance relative to expectations based on the regression. Example 1: Did UTC’s Development Office perform well? The office solicited far more $ than would have been expected based on its size. Example 2: A student for whom I wrote a letter of reference had GRE scores that weren’t super for a Ph.D. program. I pointed out in my letter that his performance in my class was more than 1 standard deviation above that which would have been expected of him based on those scores. Hopefully this helped convince the doctoral admissions committee that the test scores were not an accurate reflection of his ability. He was admitted, and he now makes more money than I. Lecture 6 Regression Analysis- 26 8/17/3 Institutional vs. Individual Emphases The prediction of P511 scores from the I/O formula score is a good example of the difference between what might be called an institutional emphasis in regression analysis and an individual emphasis. Consider the relationship of P511G to Formula illustrated in the following scatterplot . . 1.0 Individual emphasis individuals do perform better than predicted. .9 Institutional emphasis - a generally positive relationship. Persons with high FORMULA scores generally score higher in P511. P511G .8 Rsq = 0.2520 .7 1000 1100 1200 1300 1400 1500 FORMULA Institutional Emphasis: *Focuses on the regression line - the fact that the overall relationship of P511G to FORMULA is positive. *It suggests that FORMULA is useful for the I/O program to select students. *Those students with high formula scores will generally perform better than those students with lower formula scores. *The individual differences between points and the regression line will be ignored.. Individual Emphasis: *Focuses on the residuals – the fact that virtually all individuals scored either above or below the line. *Emphasis would focus on the fact that most of the points would be mispredicted by a greater or lesser amount by the regression equation. *This emphasis would focus on the differences between actual points and the predicted points. *It would emphasize that it is possible to perform better than expected and that it is possible to perform worse than expected. In fact, most of the persons represented above did exactly that - perform better or worse than expected. *This emphasis causes us to remember that even though a person is predicted to perform in a certain way, in virtually all real prediction situations, r2 is not 1, so almost every prediction will be somewhat in error. *It forces us to remember that a person who could be denied might actually be a star performer in the program, while a person who might easily be accepted could turn out to be a horrible student. Lecture 6 Regression Analysis- 27 8/17/3 Multiple regression analysis We’ll spend several weeks in the Spring semester covering multiple regression. Simple example of multiple regression . . . Predicting gpa from Conscientiousness (gencon) and Inconsistency (meanGenV). To perform a multiple regression, invoke the REGRESSION dialog box. Then put two or more variables in the “Independent(s):” field. The output Variables Entered/Removeda Model 1 Variables Entered Variables Removed meanGenV, genconb Method . Enter a. Dependent Variable: eosgpa b. All requested variables entered. Model Summary Std. Error of the Model 1 R R Square .267a .071 Adjusted R Square .065 Estimate .579975 The Model Summary Table shows the relationship (Pearson R) of the dependent variable to the combination of predictors. a. Predictors: (Constant), meanGenV, gencon Lecture 6 Regression Analysis- 28 8/17/3 The ANOVA Table gives the significance of the relationship of the dependent variable to the combination of predictors. ANOVAa Model 1 Sum of Squares Regression df Mean Square F 8.395 2 4.198 Residual 109.657 326 .336 Total 118.052 328 Sig. 12.479 .000b a. Dependent Variable: eosgpa b. Predictors: (Constant), meanGenV, gencon Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) gencon meanGenV Coefficients Std. Error Beta 2.588 .212 .157 .039 -.338 .100 t Sig. 12.185 .000 .215 4.007 .000 -.182 -3.390 .001 a. Dependent Variable: eosgpa The prediction equation is: Predicted Y = 2.588 + .157*gencon - .338*meanGenV. Although it looks as if meanGenV is the stronger predictor, a comparison of the Standardized Coefficients (Beta weights) suggests that it is gencon that is stronger. More on that next semester. Key issues in multiple regression 1. Does our ability to predict increase with addition of predictors?. In this case, the evidence suggests that it does. 2. Does the nature (sign, strength, linearity) of the relationship of a dependent variable to an independent variable change when it is paired (or tripled or quadrupled) with other predictor(s)? In this case the relationships were about the same regardless of whether they were considered singly in simple regressions or in the multiple regression. 3. Does the interpretation of a relationship change when put with other predictors? Yes. It is not interpreted with the phrase “ . . . controlling for the other predictors.” Much more on that next semester. Lecture 6 Regression Analysis- 29 8/17/3