Paraphrasing Will Hopkin`s original post:

advertisement
You are interested in the effect of physical activity on health. You do a cross-sectional study in
which you measure health, physical activity, and other variables. In particular, you measure
socioeconomic status (SES) and find that SES and physical activity are both positively correlated
with health. Further, you find a high correlation between SES and activity. High correlations
among the X variables are referred to as collinearity, or multicollinearity in the case of multiple
X variables. People on high SES eat good food, live in toxin-free classy parts of town, read Time
magazine, and think they're alpha in every way. All these things could account for their good
health. How do you analyze your data to address this potential for the effect of activity on health
to be "confounded" by SES? How do you express the magnitude of the resulting effect of activity
on health?
Pedhazur (1997) suggests calculating the partial or semi-partial correlations among the variables
by asking specific questions about their relationships.
When the X variables are correlated, the part of Y that they explain overlaps to some degree, and
not as much of Y is explained as when the X variables are uncorrelated. Venn diagrams can be
used to illustrate this concept:
In this example, Y is the dependent variable, and X1 and X2 are the two
independent variables. r12 is defined as the simple, zero-order correlation
between variables X1 and X2 (i.e., collinearity). r2 represents the coefficient of
determination and is defined as the proportion of variance explained.
When r12 = 0, we have:
Y
Y
X1
X2
X1
ry21  ry22  Ry212
X2
When r12 ≠ 0, the 2 independent variables are
explaining the same part of Y to some degree.
Partial and Semi-partial Correlations
Because in multiple regression the correlations among the X variables influence the regression
coefficients (b-values), it is often of interest to determine what the relationship between Y and
some X variable would be if the other X variables were not in the equation.
This is referred to as controlling for the effects of the other X variables.
In experimental research, the independent variables are controlled by design. This results in
uncorrelated independent variables. Other forms of control through research design are
“matching” and “subject selection.”
For example, using Venn diagrams, below are three possibilities for visualizing the relationships
among socioeconomic status (SES), physical activity, and health (H) presented by Dr. Hopkins:
SES
(a)
(b)
(c)
H
H
H
PA
Minimal collinearity
between SES and PA
SES
PA
Moderate collinearity
between SES and PA
SES
PA
High collinearity
between SES and PA
If we could control for the effects of SES, we would probably find a moderate to high correlation
between PA and H. So, Venn diagram (b) or (c) are more likely in this scenario.
"More likely" might confuse folks. In any case, (c) indicates a low correlation
between H and PA after controlling for SES. It's certainly lower than when SES
isn't in the model.
One way to control for SES would be to use only participants from one SES classification. We
could also control for the effects of SES by a statistical method known as partialing.
A partial correlation is a measure of the correlation between two variables with the effects of a
third variable removed or “partialed out” (i.e., controlled for).
Partial correlations allow us to see what the correlations between two variables would be if the
third variable were not there.
It’s important to realize that in actuality, the third variable is still there – its effects are just
removed statistically, not physically.
Another way of talking about a partial correlation is to say that it is a correlation between X1 and
Y when everyone has the same value on X2. X2 would be the variable that is being partialed
out.
The idea statistically is to correlate the parts of Y (health, H) and X1 (physical activity, PA) that
are not related to X2 (socioeconomic status, SES). This would involve removing the parts of H
and PA that overlap with SES, and correlating on the parts of these variables that do not overlap
with SES.
H
SES
PA
Remove SES from the correlation between PA and H.
Recall that when two variables X and Y are correlated, the residuals (Y – Y’) should not be
correlated with either one. Therefore, the parts of PA and H that are not related to SES would be
the residuals we would get by predicting H and PA from SES.
Statistical control or partialing is therefore accomplished by:
1. Predicting H and PA from SES, where SES is the control variable.
2. Correlating the residuals from these two regressions: (H – H’) and (PA – PA’). These are
the parts of H and PA that are not related to SES.
Because it involves the use of residuals, this process is sometimes referred to as residualizing Y
and X1 with respect to X2.
The notation ry1.2 means the partial correlation of X1 (PA) and Y (H) with X2 (SES) partialed
out. This is referred to as a first-order partial because only one variable is partialed out. Any
number of variables could conceivably by partialed out, but first-order partials are probably most
common.
The formula for the first order partial ry1.2 is:
ry1.2 
ry1  ry 2 r12
1  r122 1  r122
Partial correlations may be larger or smaller than the corresponding zero-order correlations.
They may even be a different sign.
Semi-partial Correlations
In partial correlations, the effects of the control variable are removed from both the Y and the
other X variable. Semi-partial correlations, however, are used when we only want to remove the
effects of the control variable from one other variable; usually the other X variable.
It took me a while to understand it in these terms. I think I prefer to think about
the partial as the correlation for subjects with the other predictors held constant,
whereas the semi-partial is the correlation due only to the predictor but for all the
subjects. Or something.
This would be useful in answering questions such as:
What does X2 contribute to explaining Y over and above what X1 contributes?
or
What does SES contribute to explaining H over and above what PA contributes?
or
What is the additional variance accounted for by SES if PA is already in the equation?
The notation r2y(2.1) represents the squared semi-partial correlation between X2 (SES) and Y (H)
with X1 (PA) partialed out only from X2 (SES).
The formula for a semi-partial correlation is:
ry (1.2 ) 
ry1  ry 2 r12
1  r122
Using Venn diagrams:
Partial: r2y1.2
H
SES
PA
Semi-partial: r2y(1.2)
When to Use Partial and Semi-partial Correlations
Semi-partial correlations are used to determine whether a variable explains
any additional variance, or whether its contribution to the overall R2 is
significant when effect of the other X variables have been taken into account.
Partial Correlations are used to determine what the effects of one variable on
another are when a third variable has been removed or controlled for.
I did get to this point in my original posting, but I did not repeat it in my latest
posting:
If your focus is the extent to which activity accounts for blood pressure in
the total population, then the correlation is root(16/100) = 0.40. (This
correlation is called a semi-partial of Type II in SAS and a part
correlation in SPSS.) But if your focus is the extent to which activity
accounts for blood pressure in people of the same age, the correlation is
root(16/(100-49)) = 0.56. This correlation is called a partial correlation
of Type II in SAS and a partial correlation in SPSS.
Maybe what I am asking is unanswerable. Perhaps a better question is this: what
do epidemiologists usually provide in their papers, or don’t they provide partials
in general? If they provide regression coefficients, are they aware of the problem
that collinearity is not properly controlled for when you try to interpret
magnitude of the regression coefficient?
Examples:
1. Does physical activity (PA) have an effect on health (H) that is over and
above that of socioeconomic status (SES)?
To answer this question, one would calculate the semi-partial correlation:
ry (1.2 ) 
ry1  ry 2 r12
1  r122
Where, y = health, 1 = physical activity, and 2 = socioeconomic status. Therefore ry1, for
example, is the simple Pearson product moment correlation between health and physical activity.
To gauge the proportion of variance that is accounted for, one would square the result = ry2(1.2) .
2. For participants at the same level of socioeconomic status (SES), is there
a correlation between physical activity (PA) and health (H)?
To answer this question, one would calculate the partial correlation:
ry1.2 
ry1  ry 2 r12
1  r122 1  r122
Where, y = health, 1 = physical activity, and 2 = socioeconomic status. Therefore ry1, for
example, is the simple Pearson product moment correlation between health and physical activity.
To gauge the proportion of variance that is accounted for, one would square the result = ry21.2 .
References:
1. Bandalos, Deborah L. EDPS 942: Intermediate Statistics: Correlational Methods. Spring,
1999.
2. Pedhazur, Elazar J. Multiple Regression in Behavioral Research: Explanation and
Prediction (3rd Ed). Harcourt Brace College Publishers: Orlando, FL. 1997. p. 156-194.
Download