Example of Multiple Correlation/Regression With Three Predictor

advertisement
Example of Three Predictor Multiple Regression/Correlation Analysis: Checking
Assumptions, Transforming Variables, and Detecting Suppression
The data are from Guber, D.L. (1999). Getting what you pay for: The debate over
equity in public school expenditures. Journal of Statistics Education, 7, 1-8. The
research units are the fifty states in the USA. We shall pretend they represent a
random sample from a population of interest. The criterion variable is mean SAT in the
state. The predictors are Expenditure ($ spent per student), Salary (mean salary of
teachers), and Teacher/Pupil Ratio. If we consider the predictor variables to be fixed
(the regression model), then we do not worry about the shape of the distributions of the
predictor variables. If we consider the predictor variables to be random (the correlation
model) we do. It turns out that each of the predictors has a distinct positive skewness
which can be greatly reduced by a negative reciprocal transformation.
De scri ptive Statistics
Ex penditure
Ex pend_nr
salary
Salary _nr
SAT
TeachPerPup
TeachPerPup_nr
Valid N (lis twis e)
N
St atist ic
50
50
50
50
50
50
50
50
Minimum
St atist ic
3.66
-.27
25.99
-.04
844.00
13.80
-.07
Maximum
St atist ic
9.77
-.10
50.05
-.02
1107.00
24.30
-.04
Sk ewness
St atist ic
St d.
1.107
-.109
.757
.090
.236
1.334
.490
Error
.337
.337
.337
.337
.337
.337
.337
Kurtos is
St atist ic
St d.
1.279
.009
.028
-.620
-1. 309
2.583
.220
Here are the zero-order correlations for the untransformed variables:
Correlations
Expenditure
salary
TeachPerPup
SAT
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Expenditure
1
50
.870**
.000
50
-.371**
.008
50
-.381**
.006
50
**. Correlation is significant at the 0.01 level (2-tailed).
salary
TeachPerPup
.870**
-.371**
.000
.008
50
50
1
-.001
.994
50
50
-.001
1
.994
50
50
-.440**
.081
.001
.575
50
50
SAT
-.381**
.006
50
-.440**
.001
50
.081
.575
50
1
50
Error
.662
.662
.662
.662
.662
.662
.662
Here is a regression analysis with the untransformed variables. I asked SPSS
for a plot of the standardized residuals versus the standardized predicted scores. I also
asked for a histogram of the residuals.
Model Summ aryb
Model
1
R
R Square
.458a
.210
Adjust ed
R Square
.158
St d. Error of
the Es timate
68.65350
a. Predic tors: (Constant), TeachPerPup, s alary,
Ex penditure
b. Dependent Variable: SAT
ANOVAb
Model
1
Regres sion
Residual
Total
Sum of
Squares
57495. 745
216811.9
274307.7
df
3
46
49
Mean Square
19165. 248
4713.303
a. Predic tors: (Constant), Teac hPerPup, salary , Ex penditure
b. Dependent Variable: SAT
F
4.066
Sig.
.012a
Coefficientsa
Model
1
(Constant)
Expenditure
salary
TeachPerPup
Unstandardized
Coefficients
B
Std. Error
1069.234
110.925
16.469
22.050
-8.823
4.697
6.330
6.542
Standardized
Coefficients
Beta
.300
-.701
.192
t
9.639
.747
-1.878
.968
Sig.
.000
.459
.067
.338
a. Dependent Variable: SAT
If you compare the beta weights with the zero-order correlations, it is obvious that
we have some suppression taking place. The beta for expenditure is positive but the
zero-order correlation between SAT and expenditure was negative. For the other two
predictors the value of beta exceeds the value of their zero-order correlation with SAT.
Here is a histogram of the residuals with a normal curve superimposed:
The residuals appear to be approximately normally distributed. The plot of
standardized residuals versus standardized predicted scores will allow us visually to
check for heterogeneity of variance, nonlinear trends, and normality of the residuals
across values of the predicted variable. I have drawn in the regression line (error = 0). I
see no obvious problems here.
Under the homoscedasticity assumption there should be no correlation between
the predicted scores and error variance. The vertical spread of the dots in the plot
above should not vary as we move left to right. I squared the residuals and correlated
them with the predicted values. If the residuals were increasing in variance as the
predicted values increase this correlation would be positive. It is close to zero,
confirming my eyeball conclusion that there is no problem with that fairly common sort
of heteroscedasticity.
Correlations
Predicted_SAT
Pearson Correlation
Residual2
.093
Now let us look at the results using the transformed data.
Correl ations
Ex pend_nr
Salary _nr
TeachP erP up_nr
SA T
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Ex pend_nr
1
Salary _nr
.816**
.000
50
50
.816**
1
.000
50
50
-.425**
.015
.002
.920
50
50
-.398**
-.467**
.004
.001
50
50
**. Correlation is s ignificant at the 0.01 level (2-t ailed).
TeachP er
Pup_nr
-.425**
.002
50
.015
.920
50
1
50
.089
.537
50
SA T
-.398**
.004
50
-.467**
.001
50
.089
.537
50
1
50
The correlation matrix looks much like it did with the untransformed data.
Model Summ aryb
Model
1
R
R Square
.482a
.232
Adjust ed
R Square
.182
St d. Error of
the Es timate
67.65259
a. Predic tors: (Constant), TeachPerPup_nr, Salary _nr,
Ex pend_nr
b. Dependent Variable: SAT
The R2 has increased a bit.
ANOVAb
Model
1
Regres sion
Residual
Total
Sum of
Squares
63771. 502
210536.2
274307.7
df
3
46
49
Mean Square
21257. 167
4576.873
F
4.644
Sig.
.006a
a. Predic tors: (Constant), Teac hPerPup_nr, Salary_nr, Expend_nr
b. Dependent Variable: SAT
Coefficientsa
Model
1
(Constant)
Expend_nr
Salary_nr
TeachPerPup_nr
Unstandardized
Coefficients
B
Std. Error
850.240
130.425
367.276
692.442
-9823.521
4920.257
1805.969
2031.564
Standardized
Coefficients
Beta
.181
-.618
.176
t
6.519
.530
-1.997
.889
Sig.
.000
.598
.052
.379
a. Dependent Variable: SAT
No major changes caused by the transformation, which is comforting. Trust me
that the residuals plots still look OK too.
I wonder what high school teachers would think about the negative relationship
between average state salary for teachers and average state SAT score? If we want
better education should we lower teacher salaries? There is an important state
characteristic that we should have but have not included in our model. Check out the
JSE article to learn what that characteristic is.
Now, can we figure out what sort of suppression is going on here?
Coeffi cientsa
Model
1
Ex pend_nr
Salary _nr
TeachPerPup_nr
St andardiz ed
Coeffic ient s
Beta
.181
-.618
.176
r
-.398
-.467
.089
a. Dependent Variable: SAT
It looks like the expenditures variable is suppressing irrelevant variance in one or
both or a linear combination of the other two predictors. Put another way, if we hold
constant the effects of teacher salary and number of teachers per pupil, then the
relationship between expenditures and SAT goes from negative to positive. Maybe the
money is best spent on things other than hiring more teachers or better paid teachers?
Let us look at two-predictor models.
Coeffi cientsa
Model
1
Ex pend_nr
Salary _nr
St andardiz ed
Coeffic ients
Beta
-.049
-.428
r
-.398
-.467
a. Dependent Variable: SAT
No suppression between expenditures and teacher salary.
Coeffi cientsa
Model
1
Ex pend_nr
TeachPerPup_nr
St andardiz ed
Coeffic ient s
Beta
-.439
-.097
r
-.398
.089
a. Dependent Variable: SAT
A little bit of classical suppression here, but not dramatic.
Coeffi cientsa
Model
1
Salary _nr
TeachPerPup_nr
St andardiz ed
Coeffic ient s
Beta
-.469
.096
r
-.467
.089
a. Dependent Variable: SAT
A little bit of cooperative suppression here, but not dramatic.
Maybe the expenditures variable is suppressing irrelevant variance in a linear
combination of teacher salary and teacher/pupil ratio. I predicted SAT from salary and
teacher/pupil ratio and saved the predicted scores as “predicted23.” Those predicted
scores are a linear combination of teacher salary and teacher/pupil ratio, with lower
salaries and higher teacher/pupil ratios being associated with higher SAT scores. When
I correlate predicted23 with SAT I get .477, the R for SAT predicted from salary and
teacher/pupil ratio. Watch what happens when I add the expenditures variable to the
predicted23 combination.
Coefficientsa
Model
1
predict ed23
Ex pend_nr
St andardized
Coeffic ients
Beta
.586
.122
r
.477
-.398
a. Dependent Variable: SAT
As you can see, the expenditures variable suppresses irrelevant variance in the
predicted23 combination of the other two predictor variables. When you hold total
amount of expenditures constant, there is an increase in the predictive value of a linear
combination of teacher salary and teacher/pupil ratio.
Karl L. Wuensch
East Carolina University, Dept. of Psychology
March, 2011
Return to Wuensch’s Stats Lessons Page
Download