Word [] file

advertisement
Miscellaneous Regression Notes
1. Variances and Squared Correlations
Squared correlations (of various kinds) always index variance that is shared (or overlapping) between
the variables involved in the correlation. Variances of variables, and their overlap, is often illustrated by
a Venn diagram. Consider a multiple regression model where Y is the dependent variable (DV), or
criterion, and there are two independent variables (IVs), or predictors, denoted by X and Z.
This regression model can be written as follows:
Y=b1X + b2Z + b0
X
Y
A
Z
B
C
D
The total area contained by the square box illustrates the total variance of the DV Y. The areas of the 4
separate area-elements in the figure (which are labelled A, B, C, and D respectively) therefore reflect
parts of the variance of Y. The total variance of Y is thus the sum of the 4 areas, i.e. A+B+C+D. (This
can be standardised so that Y has a variance of 1, but we won’t do so here.) The amount of variance in
Y that is shared with the IV X is thus B+C, while the amount that is shared with Z is C+D. The
amount of variance in Y that is accounted for by the complete regression model is thus B+C+D (we
don’t count C twice). A is thus the variance of Y that is not accounted for by the regression model (the
bit of the box that is outside either circle). The part of Y’s variance that is uniquely1 accounted for by X
is B and the bit uniquely accounted for by Z is D. Area C represents Y variance that is accounted
“Uniquely” here refers to the context of the specific regression model being considered. So, variance uniquely accounted
by one IV in the model is not overlapping with variance account for by any of the other IVs in the model. Obviously, other
IVs (not included in the model) may overlap with some portion of variance uniquely accounted for by an IV in the model.
1
for by the regression model, but is variance common to both X and Z. We are now in a position to
consider several kinds of correlations in terms of the above figure and the areas A, B, C and D.
1.1. Simple (Zero-order) Bivariate Correlations. This type of correlation between Y
and X can be denoted as rYX, where the subscripts denote the variables involved in the correlation.
The square of this correlation, r2YX, is equivalent to the variance in Y that overlaps with (is accounted
for by) X, expressed as a proportion of the total variance of Y. From the figure, therefore, r2YX =
(B+C)/(A+B+C+D). What is the area for the equivalent correlation between Y and Z?
1.2. Multiple Correlations from the Model. This correlation is denoted with a capital
R. We can denote a set of IVs from a regression model by writing the variables inside a pair of carets,
e.g. <XZ>. The multiple R from our model is thus the correlation between Y and <XZ> and so it
might therefore be more informatively written as RY<XZ>. (Howell has an alternative notation which I
think is likely to lead to confusion, particularly when combined with his notation for partial and
semipartial correlations -- see below. Howell would write this multiple correlation as RY.XZ.) The
squared multiple correlation, SMC or R2Y<XZ>, is the proportion of variance in Y that is accounted
for by the set of variables <XZ> in the complete regression model. The model F statistic tests
R2Y<XZ> to determine whether the model predicts the DV better than would be expected by chance.
Thus, from the figure, we can see that:
R2Y<XZ> = (B+C+D)/(A+B+C+D)
The proportion of variance in Y that is not explained by the model is called the unexplained (or
residual) variance and this is 1 - R2Y<XZ>. From the figure, we can see that the proportion of residual
Y variance is A/(A+B+C+D). Note the important fact that R2Y<XZ> is NOT equivalent to the sum
of the squares of the pairwise correlations between the DV and each of the IVs:
R2Y<XZ> < (r2YX
+ r2YZ)
In fact, from the above figure we can see that:
R2Y<XZ> = r2YX
+ r2YZ - C/(A+B+C+D)
As already noted, C/(A+B+C+D) is the proportion of Y variance accounted for the model which is
common to both X and Z. This “C” portion of variance is included in the overall model and thus if it is
large then the model F statistic may well be significant. However, area C is not included in the t-tests
applied to the regression coefficients for each of the individual IVs in the model. These t-tests assess
whether each particular IV uniquely explains an above-chance portion of DV variance. For our model
these t-tests address areas B and D respectively (as we will see below). However, the role of the “C”
portion can lead to the unusual situation in which the overall model is significantly better than chance,
and yet none of the individual IVs uniquely explains any significant amount of DV variance. In this
case, it would not be correct to conclude that these IVs explain only a nonsignificant amount of
variance in the DV: they may well have a significant zero-order correlation with the DV (and thus
explain a significant amount of DV variance).
1.3. Partial and Semipartial Correlations. (Note that SPSS refers to semipartial
correlations as part correlations.) From the figure, we can see that B/(A+B+C+D) is the variance in Y
that is uniquely explained by X expressed as a proportion of the total variance of Y. In fact, this
proportion of variance is equivalent to the squared semipartial correlation between Y and X after
controlling for Z. What ratio of areas in the figure relates to the semipartial correlation between Y and
Z after controlling for X? However, we might just as reasonably have decided to express the portion of
Y variance, uniquely accounted for by X, as a proportion of the amount of Y variance that is not
accounted for by the rest of the IVs in the model (i.e., Y variance that excludes the portion
accounted for by Z). From the figure, the portion of variance uniquely accounted for by X, expressed
as a proportion of the Y variance not accounted for by the rest of the IVs in the model, is B/{Total Y
variance - Y variance explained by Z). This is B/([A+B+C+D] - [C+D]), which can be further
rewritten as B/[A+B]. This proportion of Y variance is equivalent to the squared partial correlation
between Y and X after controlling for Z. What ratio of areas in the figure relates to the squared partial
correlation between Y and Z after controlling for X?
It may help to develop a notation for partial and semipartial correlations. We need to be able to write
an expression for the variable which results when we partial out, from a variable W, the variance
accounted for by another variable (V). The resulting variable will be denoted (W.V). The variable to
the right of the “dot” is partialled out from the variable to the left of the dot. Thus, the correlation
between another variable U and (W.V) will be denoted by rU(W.V). Using this notation, rY(X.Z) is the
semipartial correlation between Y and X after controlling for Z. (Howell uses the identical notation.) In
our notation, we might remove from Y the variance accounted for by Z, resulting in the variable (Y.Z).
Thus, we could denote a correlation between (Y.Z) and (X.Z) by the term r(Y.Z)(X.Z). This correlation
is the partial correlation between Y and X after controlling for Z. (Howell uses a shorter, but potentially
ambiguous, notation for partial correlations: he would write r(Y.Z)(X.Z) as r YX.Z. I prefer my
version.)
There are two final points in this section. First, variables, such as (Y.Z) described above, can be
calculated by finding the residual Y scores left after subtracting the predicted Y scores from the actual
Y scores. The predicted Y scores are conventionally denoted by Y with a hat over it (Ŷ; it is called “y
hat”). Ŷ is based on the simple linear regression using Z as an IV (i.e., Ŷ = b1Z + b0). Second, imagine
we had a multiple regression with 3 IVs: Y = b1X + b2Z + b3T + b0. The partial and semipartial (part)
correlations printed out by a stats package such as SPSS would correspond to the correlation between
the DV and each IV after partialling out the other 2 IVs. Thus, for the model above, the semipartial
correlation between Y and X would partial out both Z and T. We would write this, using our notation,
as rY(X.<ZT>), denoting that the set of variables <ZT> has been collectively partialled out. The
equivalent partial correlation would thus be written (in my notation) as r(Y.<ZT>)(X.<ZT>).
Download