THINGS TO KNOW ABOUT REGRESSION

advertisement
THINGS TO KNOW ABOUT REGRESSION
In a bivariate regression, the regression coefficient (b) measures the change in the predicted Y for
a one unit increase in X. In multiple regression, bi (the subscript, i, refers to the i-th independent
variable), measures the change in the predicted Y for a one unit increase in Xi holding all other
independent variables constant. In writing the results of a regression analysis, one could say
(a) the net effect of Xi on Y is bi;
(b) when the other independent variables are held constant, Y changes bi for each unit change
in Xi;
(c) the effect of Xi on Y is bi;
The intercept has a substantive interpretation only when the value of zero is included in the range
of the Xi values. That is, don’t tell a football team that they can win ‘bo’ games if they don’t pass,
don’t rush, and the opponent’s don’t rush!!!
The total sum of squares may be decomposed into two components -- a portion explained by the
independent variables and a portion unexplained. That is:
SStotal = SSreg + SSres
The ratio, SSreg/SStotal, is the proportion of the total variation in Y that is explained by the set of
independent variables. This value is called the coefficient of determination, commonly
symbolized as R2 and referred to as the R-square.
The values of the regression coefficients, bi, are estimated in such a way as to minimize the errors
of prediction, which is why the set of procedures we are studying is called ordinary least squares
(OLS) regression. Minimizing error variance is the same as maximizing explained variance,
which is the same as maximizing R2 or the proportion of variance explained in the dependent
variable.
The test for the significance of the regression is an F-test. This test can be thought of as testing
either of two null hypotheses. The first is that each of the regression coefficients is zero in the
population against the alternative that at least one coefficient is non-zero. If the null hypothesis is
rejected, we then look at the t-tests on the individual coefficients to determine which are not zero.
The second way to conceptualize this F-test is that it tests whether a significant proportion of
variance in the dependent variable is explained by the linear combination of the independent
variables. If the overall F is significant, we then look at the tests on the individual coefficients to
determine which contribute to the explanation of variance.
This F is computed by first computing the mean squares:
MSreg = SSreg/dfreg, dfreg = k (number of independent variables)
MSres = SSres/dfres, dfres = N-k-1
and the F-test is
F = MSreg/MSres
If you are interested in measuring the true effect of an independent variable Xi on the dependent
variable Y, it is important to avoid a misspecified regression equation. That is, don’t estimate Y
= b0 + b1X1 when the real model is Y = b0 + b1X1 + b2X2. If you do, the b1 in the first equation
will be a biased estimate of the effect of X1 on Y. But how do we know that we have included all
the important independent variables? Ah, there’s the rub!! We may not simply include
EVERYTHING as independent variables because we soon become involved in a nasty little
problem called multicollinearity (not to mention running out of degrees of freedom). Yet we have
to include all the important independent variables. This requires a thorough knowledge of one’s
subject matter, and a fine appreciation of which other independent variables are important. Here
statistics becomes more of an art. It is what some people have called the hard part of doing
research, as opposed to the easy part, which merely involves running an SPSS statistical analysis,
which anyone can master.
The first thing you look at on your printout is the significance of the F statistic in the ANOVA
table (or the R-Square if you want to first decide whether the proportion of variance explained is
of any substantive importance before you determine whether it is significantly different from 0).
The R-square gives you the proportion of variance in the dependent variable that is explained by
the set of independent variables. The significance of the F simply tells you whether that amount
of variance explained is different from 0. It is up to you to decide whether it is of any substantive
importance. We follow the same rules in reading the significance level of the F as we did in all
other statistical tests.
If sig F >  = .05 (.01), the R-square is considered to be essentially 0 and we are not
explaining a significant proportion of variance in the dependent variable.
If sig F <  = .05 (.01), the R-square is considered to be greater than 0 and we are explaining
a significant proportion of variance in the dependent variable.
If the R-square is significant, we then want to know which of the independent variables are
contributing to that explanation of variance and what effect the variables have on the dependent
variable. We do this by looking at the information contained in the Coefficients portion of the
printout. First, to see which variables are contributing to the explanation of variance we look at
the significance level of the t statistics.
If sig t >  = .05 (.01), that variable does not contribute to the explanation of variance and has
no effect on the dependent variable in the presence of the other independent
variables.
If sig t <  = .05 (.01), that variable does contribute to the explanation of variance and has an
effect on the dependent variable in the presence of the other independent
variables.
If the variable does have an effect, we look at the unstandardized regression coefficient, b, to
determine the magnitude of the effect. This coefficient represents the amount of change in the
dependent variable for a one unit change in the independent variable holding all other variables
constant. Since the variables are usually measured on different scales or metrics, to determine the
relative importance of the variables in influencing the dependent variable (or the relative
magnitude of effect) we examine the standardized regression coefficients, betas. The betas are
our effect sizes for regression analyses since these coefficients represent the effects of
independent variables on the dependent variable that would occur if we had standardized all
variables to z-values so that all variables were in the same scale of measurement. Their numerical
value represents the number of standard deviations the dependent variable would change if the
independent variable changed by one standard deviation, again holding other variables in the
MULTIPLE REGRESSION – PAGE 2
equation constant. We’re usually not so interested in the actual numerical value of either the
unstandardized or standardized coefficients - just which are significant, whether the variable has
a positive or negative effect on the dependent variable, and the relative magnitude of influence of
the independent variables.
However, one instance in which we are more interested in the actual size of the unstandardized
coefficient is when the variable is a dichotomous variable representing two groups. In this
instance, the unstandardized regression coefficient represents the difference between the two
group means on the dependent variable controlling for differences between the groups on the
other independent variables.
SUBSTANTIVE IMPORTANCE VS. STATISTICAL SIGNIFICANCE
When statistical significance is found, one must then address the issue of the substantive
importance of the findings. It is commonly known that large sample sizes contribute to the
likelihood of finding statistically significant effects in any type of statistical analysis, and we
often use survey data that represent responses by hundreds if not thousands of subjects. It is the
researcher’s responsibility to determine what magnitude of effect is substantively meaningful,
given the nature of the data gathered and the question being addressed. Some authors seek to give
guidelines for criteria of importance, but an effect of a certain magnitude that is important in one
setting is not necessarily important in other settings. For example, Cohen (1977) suggests that an
R2 of .01 could be viewed as a small, meaningful effect, but few would agree that explaining only
1% of the variance in the dependent variable using a collection of independent variables is of any
importance. Comparing the R2 obtained in a study to R2s reported in similar studies and careful
consideration of the magnitude of the betas (standardized coefficients) can help place substantive
importance on findings. Betas of .05 or less can hardly be argued to be meaningful given that this
represents a 5/100 standard deviation change in the dependent variable for a 1 standard deviation
change in the independent variable holding other effects constant.
References
Cohen, J. (1977). Statistical power analysis for the behavioral sciences (Rev. ed.). New York:
Academic Press.
Ethington, C. A., Thomas, S. L., & Pike, G. R. (2002). Back to the basics: Regression as it
should be. In J. C. Smart (Ed.), Higher education: Handbook of theory and research,
Vol. 17. New York: Algora Publishing.
MULTIPLE REGRESSION – PAGE 3
Download