Lecture 5 - Regression Analysis

advertisement
Lecture 5 - Regression Analysis
Regression analysis: The development of a rule or formula relating a dependent
variable, Y, to one or more independent or predictor variables, X1, X2, . . ., XK in order
1) to predict Y values for cases for whom we have only X(s) or
2) to explain differences in Y's in terms of the X(s).
Note that there is a clear dependent (Y) vs. independent (X) separation here.
In Linear Regression, the formula relating Y to X has the following form:
Various Forms of the Prediction Formula's
Raw Score
Z or
Standard
Score
formula
One X: Simple Regression
Predicted Y = a + b*X
or
Predicted Y = b*X + a
Multiple Xs: Multiple Regression
Predicted ZY = r*ZX
Predicted ZY = 1*ZX1 + 2*ZX2 + . . . + K*ZXK.
Predicted Y = B0 + B1*X1 + B2*X2 + . . . + BK*XK.
Computing the Simple Regression coefficients
Get a sample of complete X-Y pairs. Let’s call that the Regression Sample.
Raw Score regression coefficients.
b
a
NXY - (X)(Y)
SY
= ------------------ = r * ----NX2 - (X)2
SX
= Y-bar - b*X-bar
=
-b*X-bar + Y-bar
Z-score regression coefficient
If the Ys are Zs and the Xs are Zs, then
b = Good ol' Pearson r
Computing the Multiple Regression coefficients
Formula’s become quite complicated. We’ll use the computer.
The complexity of formulas for multiple regression analysis is, in my view, one of the
reasons that it was not used (or presented to students) until computers became ubiquitous.
Copyright © 2005 by Michael Biderman
RegAnal.doc - 1
8/17/3
Regression Analysis Example
Although it is well-known that scores on achievement tests, for example, the SAT, the
ACT, or the GRE predict academic performance. Suppose an investigator was interested
in the relationship of general cognitive ability, as measured by a standard IQ test, and
academic performance.
A regression analysis begins with a set of data for which you have both X and Y values
for each person.
The relationship of Ys to Xs is found for this regression sample.
That relationship may then be applied to explain or predict Y values for persons for
whom only X is available.
The data are below.
WPTQScore is X, a Quick-score measure of cognitive ability;
GPA_s is Y, cumulative GPA at the end of the semester.
id2
WPTQScore GPA_s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
20
17
26
22
29
22
23
22
15
20
21
24
20
17
28
26
19
21
24
26
22
24
30
23
29
23
22
25
23
16
21
28
20
24
28
25
26
26
26
23
27
24
26
27
21
27
23
20
23
19
19
18
17
16
30
24
29
25
29
3.18
2.76
2.91
4.00
3.75
2.68
4.00
3.55
2.37
3.52
2.56
3.41
2.85
3.53
3.46
3.64
2.75
3.81
2.79
3.16
4.00
3.84
3.06
3.09
2.76
3.42
3.32
3.54
3.40
2.67
2.55
2.77
3.67
3.23
3.75
2.36
2.62
2.79
2.96
3.00
3.33
2.95
2.70
3.36
3.80
3.68
3.25
3.57
2.25
2.54
3.84
2.74
2.38
3.45
2.69
2.81
3.81
2.88
2.59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
Copyright © 2005 by Michael Biderman
21
15
28
24
26
22
20
27
23
24
27
25
22
24
28
24
21
27
22
6
30
26
18
27
30
18
26
19
19
27
24
22
21
23
16
27
30
22
24
17
26
21
25
24
23
30
22
26
21
19
19
24
23
17
23
21
21
26
21
32
26
2.96
2.98
3.41
3.73
3.18
2.59
2.44
3.75
3.84
3.16
3.81
2.74
3.72
3.38
3.22
3.50
3.47
.
2.86
3.55
3.97
3.00
2.92
2.87
2.62
2.05
3.05
3.89
3.47
3.90
4.00
3.64
3.87
2.57
3.61
2.30
2.83
2.84
3.01
2.72
3.04
2.77
3.54
2.78
3.24
3.45
.
3.42
3.59
2.86
2.69
3.54
3.15
3.31
3.13
3.29
3.36
3.29
2.85
2.98
1.83
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
15
28
22
23
20
29
20
27
27
25
24
28
24
27
32
26
23
29
28
26
30
23
26
30
27
19
25
24
28
28
21
22
24
28
23
24
27
25
23
22
31
21
28
24
25
28
22
26
25
28
20
32
20
21
17
20
23
28
21
27
25
2.98
2.33
3.48
2.55
2.69
3.88
2.64
3.35
3.86
3.86
3.59
3.55
2.47
3.40
3.50
2.81
1.14
3.36
3.73
.
3.91
3.44
4.00
3.20
3.20
2.71
3.49
2.95
.
3.53
2.80
3.11
3.49
3.45
3.56
3.22
2.98
3.94
3.45
2.97
3.97
3.33
.
4.00
3.06
3.13
3.17
2.64
2.83
.
2.86
3.41
3.07
2.26
2.86
3.13
3.32
1.87
2.22
2.85
3.46
RegAnal.doc - 2
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
16
27
21
24
30
19
20
28
11
13
20
26
27
24
25
26
28
21
21
18
15
26
23
26
25
30
31
20
18
22
16
25
26
31
25
27
25
21
24
26
22
30
23
23
23
23
31
27
26
17
23
27
24
19
22
26
25
26
18
30
24
2.72
2.99
3.53
3.78
3.72
3.53
2.69
2.82
2.67
2.87
3.08
3.44
3.04
3.38
2.80
3.30
3.43
2.81
3.25
2.81
1.82
3.16
3.32
3.09
3.26
3.61
3.98
3.47
3.00
3.80
2.52
3.66
3.89
.
3.51
.
3.55
2.52
2.78
2.61
3.71
3.23
2.92
3.02
3.33
3.83
4.00
3.28
4.00
3.25
3.39
2.84
2.56
3.81
3.58
.
3.14
2.67
3.59
2.46
2.21
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
29
24
25
22
16
18
6
24
25
26
25
20
30
24
22
28
30
15
25
17
33
28
24
25
20
20
32
22
26
26
31
21
29
21
11
19
22
22
18
23
25
21
18
29
19
27
28
22
28
14
25
22
25
20
22
29
23
3.54
3.85
2.98
3.25
3.88
3.29
2.25
4.00
3.68
3.02
.
2.05
1.81
2.75
3.50
3.43
3.76
3.00
3.37
2.38
2.79
3.23
2.85
3.33
3.37
3.10
3.25
3.08
4.00
3.77
3.36
3.45
2.71
3.72
3.18
2.65
3.49
1.29
2.44
3.17
2.56
2.97
3.26
3.94
.
3.29
3.69
2.81
3.20
2.81
2.90
3.37
3.00
3.02
3.07
2.48
3.85
8/17/3
Getting the Regression coefficients using the Correlation Procedure
Analyze -> Correlate -> Bivariate
Descriptive Statistics
Mean
GPA_s
WPTQScore
Std. Deviation
N
3.16238
.510189
288
23.49
4.344
288
Correlationsb
GPA_s
GPA_s
Pearson Correlation
WPTQScore
1
Sig. (2-tailed)
WPTQScore
Pearson Correlation
Sig. (2-tailed)
.192**
.001
.192**
1
.001
**. Correlation is significant at the 0.01 level (2-tailed).
b. Listwise N=288
The b of the regression is r * SY/SX = .192 * 0.510189 / 4.344= 0.0225.
The a of the regression is Y-bar - b*X-bar = 3.16238 – 0.0225*23.49 = 2.634
So the prediction formula = Predicted GPA = 2.634 + 0.0225*WPTQScore
Way too much work. Let’s have the computer do everything.
Copyright © 2005 by Michael Biderman
RegAnal.doc - 3
8/17/3
Getting the Regression Coefficients (and other stuff) using SPSS REGRESSION
Copyright © 2005 by Michael Biderman
RegAnal.doc - 4
8/17/3
The REGRESION procedure output
Check means and SD's for
unusual values.
Check sample size.
Pearson r between variables.
Since there are only two here,
this is not really needed, but
it comes automatically with
“Descriptives.”
Adjusted R2 is an estimate of the population
R2 taking into account the number of
predictors. The more predictors, the smaller
the adjusted R2. More on this when we get to
multiple regression.
Standard error of estimate is the standard
deviation of the differences between the
observed Ys and predicted Ys.
The ANOVA F tests the
significance of the
relationship of Y to the
whole collection of
independent variables.
Since there is only one
independent variable in
simple regression, the
ANOVA F is redundant
with what appears below.
But SPSS always prints it.
Copyright © 2005 by Michael Biderman
RegAnal.doc - 5
8/17/3
Coefficientsa
Unstandardized Coefficients
Model
B
1
(Constant)
WPTQScore WPT-Q Score
a. Dependent Variable: GPA_s End of semester actual GPA
Standardized
Coefficients
Beta
Std. Error
2.634
.163
.022
.007
Whenever you see a leading 0 in a displayed value,
double-click on the cell to get SPSS to display the
fFull value.
.192
t
Sig.
16.175
.000
3.300
.001
Tests the hypothesis that
in the population, the
regression coefficient =
0.
Standardized regression coefficient.
What the coefficient would be if the
regression involved only Z-scores.
It is equal to Pearson r.
Standard error (estimated standard
deviation across repeated samples) of
the regression coefficient.
So the prediction equation is Predicted GPA = 2.634 + .022493*WPT.
Recall from the above hand computation, it was Predicted GPA = 2.634 + 0.0225*WPT
So the two results are equal to within rounding error, as they must be.
Interpretation of Coefficients
(Constant)
Regression Y-intercept
a or the additive constant
Expected value of Y when X = 0.
Of little value in most psychological tests, but it is always estimated.
WPTQScore
Regression slope
b or multiplicative constant
1st Interpretation:
Expected Y-difference between two people who differ by 1 on X.
2nd Interpretation:
Expected change in Y associated with a 1-unit increase in X.
So a difference of 1 point on the WPT would be associated with a difference of
.022493 in GPA.
Often, we multiple the X-difference by 10 (or whatever number works) to achieve
a more palatible interpretation. For example, a difference of 10 points on the IQ test
would be associated with a difference of about .2 GPA points.
Copyright © 2005 by Michael Biderman
RegAnal.doc - 6
8/17/3
Regression Arcania Start here on 9/22/15.
Predicted Y's, symbolized as Y-hat, Y, Y’
Residuals
Residual = Y - Predicted Y, Y - Y-hat,
Y-Y
or
Y-Y’
Positive residual: Y better than predicted = Y is overachievement.
Negative residual: Y worse than predicted = Y is underachievement
Standard Error of Estimate
Simply the standard deviation of the residuals.
SYX = 0: Best possible fit - every Y was exactly equal to its predicted value.
SYX > 0: Poorer fit.
Z's of Residuals
ZResid = (Y - Y-hat)/SYX . Z's of residuals are used to assess individual performance.
ZResid = > 1.96 >= The Y value was "significantly" greater than predicted.
ZResid ~~ 0 => The Y value is about what it was predicted to be.
ZResid <= - 1.96 => The Y value was "significantly" smaller than predicted.
r-squared, r2
Proportion of variance of Y's linearly related to X's.
Also called the Coefficient of Determination.
r2 = 0: Worst possible fit. Y's not related to X's.
r2 = 1: Best possible fit. Y's perfectly linearly related to X's.
Copyright © 2005 by Michael Biderman
RegAnal.doc - 7
8/17/3
SPSS Output Continued. . . .
More than you ever wanted to know about the residuals
Residuals Statisticsa
Minimum Maximum
Mean
Std. Deviation
Predicted Value
2.76890
3.37622 3.16238
.097713
Residual
-2.009288
.885166 .000000
.500744
Std. Predicted Value
-4.027
2.188
.000
1.000
Std. Residual
-4.006
1.765
.000
.998
a. Dependent Variable: GPA_s End of semester actual GPA
N
288
288
288
288
Graphs
Histogram of residuals
This plot is essentially Unimodal and
pretty much symmetric.
It’s pretty much what we hope our
residuals plots will look like.
Look for skewness. If it is highly
skewed, you may have to transform your
dependent or independent variables.
Look for bimodality. If it is clearly
bimodal, you may have to break your
data into subgroups or otherwise deal
with whatever is causing the bimodality.
Normal P-P Plot.
This is a plot of Expected Cumulative
Probability of Zs of the residuals vs.
Observed Cumulative Probability of
the Zs of the residuals.
The “expected” means expected if the
Zs of residuals were perfectly normally
distributed.
If the Zs of residuals are normally
distributed, the plot will be linear.
This one is pretty darn near linear.
Copyright © 2005 by Michael Biderman
RegAnal.doc - 8
8/17/3
A plot of residuals vs. predicted Ys.
This plot should be essentially random, with no discernible bend or fanout.
A plot of Y vs. X. (Obtained from the Graph menu, not from REGRESSION.)
While GPAs are related to the Wonderlic scores, the relationship is not as strong in these
data as it is typically found to be. There is considerable variation in GPA not related to
WPT.
Copyright © 2005 by Michael Biderman
RegAnal.doc - 9
8/17/3
Assessing Model Assumptions
Linearity: The scatterplot is essentially linear.
Not linear
5.2
5.2
5.0
5.0
4.8
4.8
4.6
4.6
4.4
4.4
LG10SAL
LG10SAL
Linear
4.2
4.0
3.8
4.0
4.2
4.4
4.6
4.8
4.2
4.0
0
5.0
20000
40000
60000
80000
Beginning Salary
LG10BEG
Homoscedasticity: The variability of Y's about the best fitting straight line is the
same for those pairs with small X's and those pairs with large X's.
Homoscedastic – ACT vs. WPT
Heteroscedastic – Sal vs. Sal Beg
Normality of Residuals
Normality of Residuals. The distribution of residuals is essentially that of the
normal distribution.
Essentially normal – ACT vs. WPT
Positively skewed- Sal vs. SalBeg
Histogram
Dependent Variable: Current Salary
70
60
50
40
Frequency
30
20
Std. Dev = 1.00
10
Mean = 0.00
N = 474.00
0
25
5.
75
4.
25
4.
75
3.
25
3.
75
2.
25
2.
75
1.
25
1.
5
.7
5
.2
5
-.2
5
-.725
.
-1 5
.7
-1 5
.2
-2 5
.7
-2
Regression Standardized Residual
Copyright © 2005 by Michael Biderman
RegAnal.doc - 10
8/17/3
100000
Relating the Raw Score Regression
Equation to the Scatterplot
The prediction equation defines a best fitting straight line (BFSL) through the
scatterplot of Y's vs. X's.
The slope of the line is equal to b and the y-intercept is equal to a.
Predicted Y for an X is the height of the line above the X-value.
For the example data . . .
Predicted Y = 2.634 +0.0225 * X.
Change in Y
= slope, i.e. b
Change in X
I edited the chart so that the point X=0, Y=0
would appear on the graph.
Copyright © 2005 by Michael Biderman
RegAnal.doc - 11
8/17/3
Example 2 of SPSS REGRESSION
Example: Predicting P510/511 Performance from Formula Scores
The data for this example are scores in the P510/511 course expressed as a percentage of total possible
points and the formula score used to determine eligibility for admission. The data are taken from several
previous classes.
The issue here is this: Of what use is the formula score? If it doesn’t predict performance in the graduate
courses, why do we use it? If it does predict performance in graduate courses, is that prediction such that
we don’t need any other predictors or is it such that we should search for other predictors in addition to the
formula score?
The data are as follows . .
newform p511g
1135
1055
1130
1020
1235
1110
1365
1110
1050
1085
1025
1210
1155
1335
1120
1005
1125
1020
1130
1295
1150
1140
1265
1250
1080
1120
1270
1115
1230
1245
1255
1075
1295
1300
1150
1230
1205
1185
1085
1080
1155
1095
1235
1205
1160
1170
1076
1113
1290
1235
1175
.89
.85
.90
.87
.83
.86
.92
.83
.82
.84
.73
.88
.86
.86
.77
.77
.85
.84
.90
.96
.78
.85
.87
.84
.78
.82
.81
.81
.88
.84
.88
.81
.89
.86
.82
.93
.84
.90
.72
.83
.90
.75
.93
.88
.84
.80
.91
.82
.96
.89
.91
1240
1145
1160
1180
1135
1105
1255
1285
1115
1215
1155
1280
1165
1100
1228
1259
1151
1288
1207
1272
1224
1131
1136
1229
987
1095
1080
1133
1160
1356
1134
1192
1050
1210
1211
1194
1304
1126
1165
1188
1182
1154
1349
1221
1279
1104
1107
1193
1156
1098
1225
1250
1228
.84
.86
.89
.88
.90
.84
.88
.93
.96
.89
.89
.94
.91
.89
.89
.96
.95
.94
.79
.94
.79
.84
.90
.91
.55
.84
.81
.85
.91
.95
.87
.85
.82
.86
.94
.90
.79
.76
.81
.87
.86
.93
.94
.86
.91
.96
.83
.92
.73
.86
.87
.74
.88
Copyright © 2005 by Michael Biderman
1234
1158
1163
1234
1174
1168
1130
1179
1087
1181
1195
1097
1206
1260
1225
1131
1125
1202
1110
1177
1307
1055
1350
1179
1323
1182
1050
1098
1137
1204
1269
1182
1141
1117
1424
1179
1332
1156
1235
1165
1220
1241
1134
1243
1220
1288
1235
1185
1250
1217
1248
1264
1325
.80
.70
.79
.74
.76
.78
.97
.69
.85
.79
.79
.80
.92
1.03
.81
1.07
.91
.93
.96
.96
1.02
.75
1.06
.87
.98
.91
.88
.94
.95
.90
.89
.96
.99
.87
1.03
.92
.93
.87
.81
.99
.86
.73
1.00
.80
.98
.97
.96
.75
.93
.92
.90
.96
.94
1335
1090
1275
1091
972
1155
1207
1184
1175
1132
1300
1033
1151
1183
1212
1088
1099
1172
1256
1229
1290
1074
1119
1243
1331
971
1225
1150
1358
1120
1348
1178
1315
1387
925
1182
870
1213
1123
1474
1222
1148
1143
1280
1356
1215
1291
1099
1279
1173
1122
1244
1082
.94
.91
.99
.89
.77
.95
.91
.92
.72
.83
.95
.88
.90
.75
.92
.89
.83
.84
.95
1.02
.89
.91
.97
.85
.96
.62
.97
.77
1.01
.79
.94
.91
.95
.97
.81
.91
.77
.84
1.00
1.05
.87
.81
.78
.92
1.01
.84
.94
.80
.96
.85
.86
.88
.88
RegAnal.doc - 12
1155
1057
1102
1160
1173
1187
1162
1204
1049
1157
1206
1119
1161
1170
1148
1088
1160
1186
1064
1143
1106
1238
1215
1288
1130
1185
1055
1088
1187
1188
1269
1223
1434
1241
1266
1247
1325
1104
1097
1114
1370
1116
1156
1330
1403
1111
1219
845
1153
1190
1136
1168
1327
.94
.92
.88
.94
.92
.81
.89
.94
.70
.76
.90
.84
.72
.78
.95
.90
.77
.88
.75
.87
.82
.81
.93
.98
.87
.86
.84
.77
.79
.85
.93
.82
.97
.83
.87
.89
.94
.75
.72
.84
.95
.74
.69
1.01
.98
.85
.93
.79
.84
.82
.92
.81
.94
1050
1300
1121
1371
1046
1179
1153
1241
1157
1181
1111
1123
1102
1173
1140
1063
1020
999
1039
1281
1089
1100
1031
1183
1132
1276
1183
1277
1165
1166
1171
1123
1192
1111
1125
1327
1147
1173
1324
1210
.82
.90
.87
.95
.85
.88
.84
.95
.75
.86
.85
.80
.92
.86
.94
.89
.70
.75
.81
.91
.75
.65
.85
.75
.76
.99
.87
.87
.85
.84
.87
.70
.97
.86
.88
.97
.85
.86
.97
.81
8/17/3
Univariate Statistics on Each Variable
Analyze -> Descriptive Statistics -> Frequencies
These are the Syntax commands which
would give the output below.
FREQUENCIES
VARIABLES=formula p511g
/STATISTICS=MEAN MEDIAN
/HISTOGRAM .
Frequencies
Statistics
newform
N
Valid
Missing
p511g
303
303
0
0
The Frequency table has
been omitted to save
space.
Lecture 6 Regression Analysis- 13
8/17/3
Computing the Regression Equation
Analyze -> Regression -> Linear
Specifying which variables to analyze
Click on the
“Statistics…”
button to tell SPSS
to print descriptive
statistics on each
variable.
Click on the
“Plots…” button to
tell SPSS that you
want diagnostic
plots created
Specifying diagnostic plots . . .
It’s always advisable
to plot residuals vs.
predicted values.
Here we’re requesting
standardized residuals
(ZRESID) vs.
standardized predicted
values (ZPRED).
Lecture 6 Regression Analysis- 14
8/17/3
Output of SPSS's Regression Procedure
Regression
This is the Pearson r
between the criterion
(P511g) and the predictor
(NEWFORM).
Note that REGRESSION
automatically prints the Pearson
r for the regression. For simple
regression analyses (1
predictor), it is the same as that
above. For multiple regressions,
it will be different from the r
printed in the Descriptives table.
Lecture 6 Regression Analysis- 15
8/17/3
The Coefficients Table is the meat of the regression analysis.
The line labeled "(Constant)" gives information on the Y-intercept.
The other line gives information on the predictor, NEWFORM, in this example.
Each t tests the hypothesis that
the population value of the
regression coefficient equals 0.
Standardized multiplicative constant.
Equals r in simple regression.
Standard errors of the estimates that are displayed at the
left.
Double-click on the table, then again on the cell to get its actual value.
So, Predicted
= 4.49 x 10-4 .
Move decimal point as many places + or – as the value after E.
Y .0004491824
= .0003160*FORMULA + .476.
So, Predicted P511G = 0.339 + .000449*NEWFORM.
A couple of selected predictions . .
If a student’s formula score was 1200: Predicted P511g = .339+.000449*1200 = .877, almost an A
If a student’s formula score was 1300: Predicted P511g = .339+.000449*1300 = .922, a low A
If a student’s formula score was 1600: Predicted P511g = .339+.000449*1600 = 1.057, a super A
If you’re interested in residuals statistics . . .
Lecture 6 Regression Analysis- 16
8/17/3
Charts –
These are the diagnostic plots requested
above.
The distribution of residuals should be
approximately normal.
The distribution is not horribly nonnormal.
The Normal P-P plot should be linear.
This one is acceptably so.
Lecture 6 Regression Analysis- 17
8/17/3
The scatterplot of residuals vs. predicted values should be a "classic" zero correlation scatterplot.
Look for heteroscadisticity
and nonlinearity.
Lecture 6 Regression Analysis- 18
8/17/3
Graphical Representation of Regression Analysis
Graph -> Legacy Dialogs -> Scatter/Dot. -> Simple -> Define
Put p511g in the Y-axis field and newform in the X-axis field.
To put a best fitting straight line on the scatterplot.
1. Double-click on the chart.
2. Click on
.
3. A line will appear on the scatterplot. In addition, a Properties dialog box will open. More on it later.
4. Checking Individual in the Confidence Intervals section yields the scatterplot a couple of pages down from here . . .
Lecture 6 Regression Analysis- 19
8/17/3
Notes on the graph of observed vs. predicted Ys.
1.0
Performed substantially better than predicted.
Points above the line
represent students
who performed better
than predicted by the
formula
Positive residual
.9
Negative residual
Predicted P511G =
.000449*FORMULA + .339
Points below the line
represent students
performed worse than
predicted by the
formula
.8
Predicted Y
for X = 1070.
Actual Y for
X = 1070
Performed substantially worse than
predicted.
P511G
.7
1000
Rsq = 0.2520
1100
1200
1300
1400
1500
FORMULA
Lecture 6 Regression Analysis- 20
8/17/3
Representing 95% Confidence Intervals About the Regression Line
Form a scatterplot and then click on
.
Check the Individual button in the Confidence Intervals section.
Putting the 95% Individual Confidence
Intervals on the scatterplot is a quick
way to identify persons who performed
quite a bit better or quite a bit poorer
than predicted. They’ll be the points
outside the upper and lower Confidence
Interval bands.
Points above the upper band represent students who performed much better than expected based on their
formula scores.
Points below the lower band represent students who performed much worse than expected based on their
formula scores.
Lecture 6 Regression Analysis- 21
8/17/3
The Scatterplot with Origin Included
After creating the chart, I double-clicked on it and edited it to force the origin to appear on the graph.
Graph
1.0
.9
.8
.7
.6
.5
P511G
.4
.3
.2
.1
Rsq = 0.2520
0.0
400
200
0
100
300
800
600
500
700
1000
900
1200
1100
1400
1300
1600
1500
FORMULA
Imagine the points that are not on the scatterplot above. Those would be the points of persons who whose
formula scores were not high enough to allow them to be admitted to the program.
The r2 is .25 for the above data. However, if persons with lower formula scores were admitted and took the
course, it is likely that their P511G scores would also be lower, resulting in a scatterplot "ellipse" that was
considerably more elongated than that above - filling in the space between the points in the above scatterplot
and the origin of the plot. See the outlined ellipse in the figure above.
It would be expected that the r2 for such a sample would be considerably larger than the r2 for the truncated
sample of those who were actually admitted to the program. This is a problem that confronts analysts
predicting performance in selective programs like our MS program. It’s called the problem of range
restriction. The restriction of range causes r to be closer to 0 than it would have been had the whole population
been included in the analysis.
Lecture 6 Regression Analysis- 22
8/17/3
Using Regression to evaluate individual performance:
Performance of UTC's Development Office
On of the tasks of a university development office is to seek funds from public and private donors to support
university functions. Recently, a report was released which included the number of employees in the
development offices of several of UTC’s ‘comparable’ institutions along with the total contributions received
by those offices.
The data are below.
CON98_99 Total contributions in millions of $.
TOTEMPS Total no. of employees – officers and staff.
INST
ecu
eiu
gsu
jmu
msu
ru
sfasu
unca
uni
utc
uwlc
wcup
wiu
wku
uncg
asu
CON98_99
TOTEMPS
2.70
1.58
5.60
2.90
2.30
3.00
7.20
2.00
9.70
7.40
2.10
1.60
4.10
5.70
8.80
9.80
45.50
20.50
38.00
26.00
16.25
48.00
33.00
9.00
60.00
20.50
24.00
40.00
36.00
55.00
66.00
52.00
Number of cases read:
16
Number of cases listed:
Lecture 6 Regression Analysis- 23
8/17/3
16
Regression
Va riable s Entered/Rem ov e db
Mo del
1
Va riable s
Re move d
Va riable s En tered
TO TEM PS T otal Dev Office Em ploye esa
.
Me thod
En ter
a. All requ ested varia bles ente red.
b. De pend ent V ariab le: CON9 8_99 Con tribut ions in 98 ,99
Model S umm ary
Mo del
1
R
.61 8 a
R S quare
.38 2
Std . Erro r of
the Estim ate
2.4 1963
Ad justed R S quare
.33 8
a. Pre dicto rs: (Consta nt), T OTE MPS Tot al De v Off ice E mplo yees
ANOVAb
Mo del
1
Re gressi on
Re sidua l
To tal
Su m of Squa res
50. 747
df
1
Me an S quare
50. 747
81. 965
14
5.8 55
132 .712
15
F
8.6 68
Sig .
.01 1 a
a. Pre dicto rs: (Consta nt), T OTE MPS Tota l Dev Office Em ploye es
b. De pend ent V ariab le: CO N98 _99 Contributio ns in 98,9 9
Coeffici ents a
Un stand ardized Co efficients
Mo del
1
(Co nstan t)
B
.72 3
Std . Erro r
1.5 05
TO TEM PS T otal Dev O ffice Emp loye es
.11 0
.03 7
Sta ndardized
Co effici ents
Be ta
t
.61 8
.48 0
Sig .
.63 9
2.9 44
.01 1
a. De pend ent V ariab le: CON98 _99 Cont ributi ons i n 98, 99
So, the relationship of contributions in millions to total employees is
Predicted contributions in millions of $ = 0.723 M$ + 0.110 * Total no. of employees.
This means we would expect an increase in contributions of about $110,000 for each additional employee.
The intercept of the equation suggests that universities might expect to receive over $700,000 without any
development office at all. But this conclusion depends on an extrapolation of the curve downward toward 0
employees. We don’t actually know what it would do in that region.
What does the equation tell us?
Development office employees count.
The more employees an office has, the more contributions it can expect.
If you add a development office employee, don’t pay him/her more than $110,000.
Lecture 6 Regression Analysis- 24
8/17/3
Fcusing on the residuals . . .
How is UTC doing relative to what it would be expected to do?
12
This is the
difference
between what
UTC received
(7+$M) and
what it would
have been
expected to
receive (about
3 $M) based
on its total no.
of employees.
10
8
6
asu
uncg
utc
sfasu
wku
gsu
wiu
4
jmu
unca
2
msu
ecu
uwlc
eiu
wcup
0
0
10
20
30
40
50
60
70
Total No. of Employees
The figure is based on data distributed by Margaret Kelley at the 2/24/00 meeting of the Planning, Budgeting, &
Evaluation Committee. The data were prepared by Cindy Jones of Appalachian State University The figure
excludes the data of two universities (University of Northern Iowa and Radford University) who did not report
DO's and Staff separately.
The line though the figure represents the expected total contributions at each value of No. of employees. Points
above the line represent universities whose development offices received more contributions than they would
have been expected to have received based on the no. of development office employees. Points below the line
represent universities who received fewer contributions than they would have been expected to have received
based on the number of development office personnel.
Lecture 6 Regression Analysis- 25
8/17/3
Summary of uses of regression analysis
1. Prediction of performance.
From the above example,
Predicted P511 = .339 + . 000449*Formula.
A graduate student with a formula score of 1200 would be predicted to obtain .855, a middle B in P511.
2. Explanation of differences between Y values.
Why does one student have a 3.5 point GPA while another has a 2.5?
Based on the relationship between GPAs and the Wonderlic, part of the reason might be cognitive ability, as
measured, for example, by the Wonderlic. We can also see from the scatter about the regression line in that
example that there are probably other reasons for differences in GPA.
3. Evaluation of performance relative to expectations based on the regression.
Example 1: Did UTC’s Development Office perform well?
The office solicited far more $ than would have been expected based on its size.
Example 2: A student for whom I wrote a letter of reference had GRE scores that weren’t super for a Ph.D.
program. I pointed out in my letter that his performance in my class was more than 1 standard deviation above
that which would have been expected of him based on those scores. Hopefully this helped convince the
doctoral admissions committee that the test scores were not an accurate reflection of his ability. He was
admitted, and he now makes more money than I.
Lecture 6 Regression Analysis- 26
8/17/3
Institutional vs. Individual Emphases
The prediction of P511 scores from the I/O formula score is a good example of the difference between what
might be called an institutional emphasis in regression analysis and an individual emphasis. Consider the
relationship of P511G to Formula illustrated in the following scatterplot . .
1.0
Individual emphasis individuals do perform
better than predicted.
.9
Institutional emphasis
- a generally positive
relationship. Persons
with high FORMULA
scores generally score
higher in P511.
P511G
.8
Rsq = 0.2520
.7
1000
1100
1200
1300
1400
1500
FORMULA
Institutional Emphasis:
*Focuses on the regression line - the fact that the overall relationship of P511G to FORMULA is positive.
*It suggests that FORMULA is useful for the I/O program to select students.
*Those students with high formula scores will generally perform better than those students with lower formula
scores.
*The individual differences between points and the regression line will be ignored..
Individual Emphasis:
*Focuses on the residuals – the fact that virtually all individuals scored either above or below the line.
*Emphasis would focus on the fact that most of the points would be mispredicted by a greater or lesser
amount by the regression equation.
*This emphasis would focus on the differences between actual points and the predicted points.
*It would emphasize that it is possible to perform better than expected and that it is possible to perform
worse than expected. In fact, most of the persons represented above did exactly that - perform better or
worse than expected.
*This emphasis causes us to remember that even though a person is predicted to perform in a certain way, in
virtually all real prediction situations, r2 is not 1, so almost every prediction will be somewhat in error.
*It forces us to remember that a person who could be denied might actually be a star performer in the program,
while a person who might easily be accepted could turn out to be a horrible student.
Lecture 6 Regression Analysis- 27
8/17/3
Multiple regression analysis
We’ll spend several weeks in the Spring semester covering multiple regression.
Simple example of multiple regression . . .
Predicting gpa from Conscientiousness (gencon) and Inconsistency (meanGenV).
To perform a multiple regression,
invoke the REGRESSION dialog box.
Then put two or more variables in the
“Independent(s):” field.
The output
Variables Entered/Removeda
Model
1
Variables Entered
Variables Removed
meanGenV, genconb
Method
.
Enter
a. Dependent Variable: eosgpa
b. All requested variables entered.
Model Summary
Std. Error of the
Model
1
R
R Square
.267a
.071
Adjusted R Square
.065
Estimate
.579975
The Model Summary Table
shows the relationship (Pearson
R) of the dependent variable to
the combination of predictors.
a. Predictors: (Constant), meanGenV, gencon
Lecture 6 Regression Analysis- 28
8/17/3
The ANOVA Table gives the significance of the relationship of the dependent variable to the combination of
predictors.
ANOVAa
Model
1
Sum of Squares
Regression
df
Mean Square
F
8.395
2
4.198
Residual
109.657
326
.336
Total
118.052
328
Sig.
12.479
.000b
a. Dependent Variable: eosgpa
b. Predictors: (Constant), meanGenV, gencon
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
gencon
meanGenV
Coefficients
Std. Error
Beta
2.588
.212
.157
.039
-.338
.100
t
Sig.
12.185
.000
.215
4.007
.000
-.182
-3.390
.001
a. Dependent Variable: eosgpa
The prediction equation is: Predicted Y = 2.588 + .157*gencon - .338*meanGenV.
Although it looks as if meanGenV is the stronger predictor, a comparison of the Standardized Coefficients (Beta
weights) suggests that it is gencon that is stronger. More on that next semester.
Key issues in multiple regression
1. Does our ability to predict increase with addition of predictors?.
In this case, the evidence suggests that it does.
2. Does the nature (sign, strength, linearity) of the relationship of a dependent variable to an independent
variable change when it is paired (or tripled or quadrupled) with other predictor(s)?
In this case the relationships were about the same regardless of whether they were considered singly in
simple regressions or in the multiple regression.
3. Does the interpretation of a relationship change when put with other predictors?
Yes. It is not interpreted with the phrase “ . . . controlling for the other predictors.” Much more on that next
semester.
Lecture 6 Regression Analysis- 29
8/17/3
Download