Example of Three Predictor Multiple Regression/Correlation Analysis: Checking Assumptions, Transforming Variables, and Detecting Suppression The data are from Guber, D.L. (1999). Getting what you pay for: The debate over equity in public school expenditures. Journal of Statistics Education, 7, 1-8. The research units are the fifty states in the USA. We shall pretend they represent a random sample from a population of interest. The criterion variable is mean SAT in the state. The predictors are Expenditure ($ spent per student), Salary (mean salary of teachers), and Teacher/Pupil Ratio. If we consider the predictor variables to be fixed (the regression model), then we do not worry about the shape of the distributions of the predictor variables. If we consider the predictor variables to be random (the correlation model) we do. It turns out that each of the predictors has a distinct positive skewness which can be greatly reduced by a negative reciprocal transformation. De scri ptive Statistics Ex penditure Ex pend_nr salary Salary _nr SAT TeachPerPup TeachPerPup_nr Valid N (lis twis e) N St atist ic 50 50 50 50 50 50 50 50 Minimum St atist ic 3.66 -.27 25.99 -.04 844.00 13.80 -.07 Maximum St atist ic 9.77 -.10 50.05 -.02 1107.00 24.30 -.04 Sk ewness St atist ic St d. 1.107 -.109 .757 .090 .236 1.334 .490 Error .337 .337 .337 .337 .337 .337 .337 Kurtos is St atist ic St d. 1.279 .009 .028 -.620 -1. 309 2.583 .220 Here are the zero-order correlations for the untransformed variables: Correlations Expenditure salary TeachPerPup SAT Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Expenditure 1 50 .870** .000 50 -.371** .008 50 -.381** .006 50 **. Correlation is significant at the 0.01 level (2-tailed). salary TeachPerPup .870** -.371** .000 .008 50 50 1 -.001 .994 50 50 -.001 1 .994 50 50 -.440** .081 .001 .575 50 50 SAT -.381** .006 50 -.440** .001 50 .081 .575 50 1 50 Error .662 .662 .662 .662 .662 .662 .662 Here is a regression analysis with the untransformed variables. I asked SPSS for a plot of the standardized residuals versus the standardized predicted scores. I also asked for a histogram of the residuals. Model Summ aryb Model 1 R R Square .458a .210 Adjust ed R Square .158 St d. Error of the Es timate 68.65350 a. Predic tors: (Constant), TeachPerPup, s alary, Ex penditure b. Dependent Variable: SAT ANOVAb Model 1 Regres sion Residual Total Sum of Squares 57495. 745 216811.9 274307.7 df 3 46 49 Mean Square 19165. 248 4713.303 a. Predic tors: (Constant), Teac hPerPup, salary , Ex penditure b. Dependent Variable: SAT F 4.066 Sig. .012a Coefficientsa Model 1 (Constant) Expenditure salary TeachPerPup Unstandardized Coefficients B Std. Error 1069.234 110.925 16.469 22.050 -8.823 4.697 6.330 6.542 Standardized Coefficients Beta .300 -.701 .192 t 9.639 .747 -1.878 .968 Sig. .000 .459 .067 .338 a. Dependent Variable: SAT If you compare the beta weights with the zero-order correlations, it is obvious that we have some suppression taking place. The beta for expenditure is positive but the zero-order correlation between SAT and expenditure was negative. For the other two predictors the value of beta exceeds the value of their zero-order correlation with SAT. Here is a histogram of the residuals with a normal curve superimposed: The residuals appear to be approximately normally distributed. The plot of standardized residuals versus standardized predicted scores will allow us visually to check for heterogeneity of variance, nonlinear trends, and normality of the residuals across values of the predicted variable. I have drawn in the regression line (error = 0). I see no obvious problems here. Under the homoscedasticity assumption there should be no correlation between the predicted scores and error variance. The vertical spread of the dots in the plot above should not vary as we move left to right. I squared the residuals and correlated them with the predicted values. If the residuals were increasing in variance as the predicted values increase this correlation would be positive. It is close to zero, confirming my eyeball conclusion that there is no problem with that fairly common sort of heteroscedasticity. Correlations Predicted_SAT Pearson Correlation Residual2 .093 Now let us look at the results using the transformed data. Correl ations Ex pend_nr Salary _nr TeachP erP up_nr SA T Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Ex pend_nr 1 Salary _nr .816** .000 50 50 .816** 1 .000 50 50 -.425** .015 .002 .920 50 50 -.398** -.467** .004 .001 50 50 **. Correlation is s ignificant at the 0.01 level (2-t ailed). TeachP er Pup_nr -.425** .002 50 .015 .920 50 1 50 .089 .537 50 SA T -.398** .004 50 -.467** .001 50 .089 .537 50 1 50 The correlation matrix looks much like it did with the untransformed data. Model Summ aryb Model 1 R R Square .482a .232 Adjust ed R Square .182 St d. Error of the Es timate 67.65259 a. Predic tors: (Constant), TeachPerPup_nr, Salary _nr, Ex pend_nr b. Dependent Variable: SAT The R2 has increased a bit. ANOVAb Model 1 Regres sion Residual Total Sum of Squares 63771. 502 210536.2 274307.7 df 3 46 49 Mean Square 21257. 167 4576.873 F 4.644 Sig. .006a a. Predic tors: (Constant), Teac hPerPup_nr, Salary_nr, Expend_nr b. Dependent Variable: SAT Coefficientsa Model 1 (Constant) Expend_nr Salary_nr TeachPerPup_nr Unstandardized Coefficients B Std. Error 850.240 130.425 367.276 692.442 -9823.521 4920.257 1805.969 2031.564 Standardized Coefficients Beta .181 -.618 .176 t 6.519 .530 -1.997 .889 Sig. .000 .598 .052 .379 a. Dependent Variable: SAT No major changes caused by the transformation, which is comforting. Trust me that the residuals plots still look OK too. I wonder what high school teachers would think about the negative relationship between average state salary for teachers and average state SAT score? If we want better education should we lower teacher salaries? There is an important state characteristic that we should have but have not included in our model. Check out the JSE article to learn what that characteristic is. Now, can we figure out what sort of suppression is going on here? Coeffi cientsa Model 1 Ex pend_nr Salary _nr TeachPerPup_nr St andardiz ed Coeffic ient s Beta .181 -.618 .176 r -.398 -.467 .089 a. Dependent Variable: SAT It looks like the expenditures variable is suppressing irrelevant variance in one or both or a linear combination of the other two predictors. Put another way, if we hold constant the effects of teacher salary and number of teachers per pupil, then the relationship between expenditures and SAT goes from negative to positive. Maybe the money is best spent on things other than hiring more teachers or better paid teachers? Let us look at two-predictor models. Coeffi cientsa Model 1 Ex pend_nr Salary _nr St andardiz ed Coeffic ients Beta -.049 -.428 r -.398 -.467 a. Dependent Variable: SAT No suppression between expenditures and teacher salary. Coeffi cientsa Model 1 Ex pend_nr TeachPerPup_nr St andardiz ed Coeffic ient s Beta -.439 -.097 r -.398 .089 a. Dependent Variable: SAT A little bit of classical suppression here, but not dramatic. Coeffi cientsa Model 1 Salary _nr TeachPerPup_nr St andardiz ed Coeffic ient s Beta -.469 .096 r -.467 .089 a. Dependent Variable: SAT A little bit of cooperative suppression here, but not dramatic. Maybe the expenditures variable is suppressing irrelevant variance in a linear combination of teacher salary and teacher/pupil ratio. I predicted SAT from salary and teacher/pupil ratio and saved the predicted scores as “predicted23.” Those predicted scores are a linear combination of teacher salary and teacher/pupil ratio, with lower salaries and higher teacher/pupil ratios being associated with higher SAT scores. When I correlate predicted23 with SAT I get .477, the R for SAT predicted from salary and teacher/pupil ratio. Watch what happens when I add the expenditures variable to the predicted23 combination. Coefficientsa Model 1 predict ed23 Ex pend_nr St andardized Coeffic ients Beta .586 .122 r .477 -.398 a. Dependent Variable: SAT As you can see, the expenditures variable suppresses irrelevant variance in the predicted23 combination of the other two predictor variables. When you hold total amount of expenditures constant, there is an increase in the predictive value of a linear combination of teacher salary and teacher/pupil ratio. Karl L. Wuensch East Carolina University, Dept. of Psychology March, 2011 Return to Wuensch’s Stats Lessons Page