You are interested in the effect of physical activity on health. You do a cross-sectional study in which you measure health, physical activity, and other variables. In particular, you measure socioeconomic status (SES) and find that SES and physical activity are both positively correlated with health. Further, you find a high correlation between SES and activity. High correlations among the X variables are referred to as collinearity, or multicollinearity in the case of multiple X variables. People on high SES eat good food, live in toxin-free classy parts of town, read Time magazine, and think they're alpha in every way. All these things could account for their good health. How do you analyze your data to address this potential for the effect of activity on health to be "confounded" by SES? How do you express the magnitude of the resulting effect of activity on health? Pedhazur (1997) suggests calculating the partial or semi-partial correlations among the variables by asking specific questions about their relationships. When the X variables are correlated, the part of Y that they explain overlaps to some degree, and not as much of Y is explained as when the X variables are uncorrelated. Venn diagrams can be used to illustrate this concept: In this example, Y is the dependent variable, and X1 and X2 are the two independent variables. r12 is defined as the simple, zero-order correlation between variables X1 and X2 (i.e., collinearity). r2 represents the coefficient of determination and is defined as the proportion of variance explained. When r12 = 0, we have: Y Y X1 X2 X1 ry21 ry22 Ry212 X2 When r12 ≠ 0, the 2 independent variables are explaining the same part of Y to some degree. Partial and Semi-partial Correlations Because in multiple regression the correlations among the X variables influence the regression coefficients (b-values), it is often of interest to determine what the relationship between Y and some X variable would be if the other X variables were not in the equation. This is referred to as controlling for the effects of the other X variables. In experimental research, the independent variables are controlled by design. This results in uncorrelated independent variables. Other forms of control through research design are “matching” and “subject selection.” For example, using Venn diagrams, below are three possibilities for visualizing the relationships among socioeconomic status (SES), physical activity, and health (H) presented by Dr. Hopkins: SES (a) (b) (c) H H H PA Minimal collinearity between SES and PA SES PA Moderate collinearity between SES and PA SES PA High collinearity between SES and PA If we could control for the effects of SES, we would probably find a moderate to high correlation between PA and H. So, Venn diagram (b) or (c) are more likely in this scenario. "More likely" might confuse folks. In any case, (c) indicates a low correlation between H and PA after controlling for SES. It's certainly lower than when SES isn't in the model. One way to control for SES would be to use only participants from one SES classification. We could also control for the effects of SES by a statistical method known as partialing. A partial correlation is a measure of the correlation between two variables with the effects of a third variable removed or “partialed out” (i.e., controlled for). Partial correlations allow us to see what the correlations between two variables would be if the third variable were not there. It’s important to realize that in actuality, the third variable is still there – its effects are just removed statistically, not physically. Another way of talking about a partial correlation is to say that it is a correlation between X1 and Y when everyone has the same value on X2. X2 would be the variable that is being partialed out. The idea statistically is to correlate the parts of Y (health, H) and X1 (physical activity, PA) that are not related to X2 (socioeconomic status, SES). This would involve removing the parts of H and PA that overlap with SES, and correlating on the parts of these variables that do not overlap with SES. H SES PA Remove SES from the correlation between PA and H. Recall that when two variables X and Y are correlated, the residuals (Y – Y’) should not be correlated with either one. Therefore, the parts of PA and H that are not related to SES would be the residuals we would get by predicting H and PA from SES. Statistical control or partialing is therefore accomplished by: 1. Predicting H and PA from SES, where SES is the control variable. 2. Correlating the residuals from these two regressions: (H – H’) and (PA – PA’). These are the parts of H and PA that are not related to SES. Because it involves the use of residuals, this process is sometimes referred to as residualizing Y and X1 with respect to X2. The notation ry1.2 means the partial correlation of X1 (PA) and Y (H) with X2 (SES) partialed out. This is referred to as a first-order partial because only one variable is partialed out. Any number of variables could conceivably by partialed out, but first-order partials are probably most common. The formula for the first order partial ry1.2 is: ry1.2 ry1 ry 2 r12 1 r122 1 r122 Partial correlations may be larger or smaller than the corresponding zero-order correlations. They may even be a different sign. Semi-partial Correlations In partial correlations, the effects of the control variable are removed from both the Y and the other X variable. Semi-partial correlations, however, are used when we only want to remove the effects of the control variable from one other variable; usually the other X variable. It took me a while to understand it in these terms. I think I prefer to think about the partial as the correlation for subjects with the other predictors held constant, whereas the semi-partial is the correlation due only to the predictor but for all the subjects. Or something. This would be useful in answering questions such as: What does X2 contribute to explaining Y over and above what X1 contributes? or What does SES contribute to explaining H over and above what PA contributes? or What is the additional variance accounted for by SES if PA is already in the equation? The notation r2y(2.1) represents the squared semi-partial correlation between X2 (SES) and Y (H) with X1 (PA) partialed out only from X2 (SES). The formula for a semi-partial correlation is: ry (1.2 ) ry1 ry 2 r12 1 r122 Using Venn diagrams: Partial: r2y1.2 H SES PA Semi-partial: r2y(1.2) When to Use Partial and Semi-partial Correlations Semi-partial correlations are used to determine whether a variable explains any additional variance, or whether its contribution to the overall R2 is significant when effect of the other X variables have been taken into account. Partial Correlations are used to determine what the effects of one variable on another are when a third variable has been removed or controlled for. I did get to this point in my original posting, but I did not repeat it in my latest posting: If your focus is the extent to which activity accounts for blood pressure in the total population, then the correlation is root(16/100) = 0.40. (This correlation is called a semi-partial of Type II in SAS and a part correlation in SPSS.) But if your focus is the extent to which activity accounts for blood pressure in people of the same age, the correlation is root(16/(100-49)) = 0.56. This correlation is called a partial correlation of Type II in SAS and a partial correlation in SPSS. Maybe what I am asking is unanswerable. Perhaps a better question is this: what do epidemiologists usually provide in their papers, or don’t they provide partials in general? If they provide regression coefficients, are they aware of the problem that collinearity is not properly controlled for when you try to interpret magnitude of the regression coefficient? Examples: 1. Does physical activity (PA) have an effect on health (H) that is over and above that of socioeconomic status (SES)? To answer this question, one would calculate the semi-partial correlation: ry (1.2 ) ry1 ry 2 r12 1 r122 Where, y = health, 1 = physical activity, and 2 = socioeconomic status. Therefore ry1, for example, is the simple Pearson product moment correlation between health and physical activity. To gauge the proportion of variance that is accounted for, one would square the result = ry2(1.2) . 2. For participants at the same level of socioeconomic status (SES), is there a correlation between physical activity (PA) and health (H)? To answer this question, one would calculate the partial correlation: ry1.2 ry1 ry 2 r12 1 r122 1 r122 Where, y = health, 1 = physical activity, and 2 = socioeconomic status. Therefore ry1, for example, is the simple Pearson product moment correlation between health and physical activity. To gauge the proportion of variance that is accounted for, one would square the result = ry21.2 . References: 1. Bandalos, Deborah L. EDPS 942: Intermediate Statistics: Correlational Methods. Spring, 1999. 2. Pedhazur, Elazar J. Multiple Regression in Behavioral Research: Explanation and Prediction (3rd Ed). Harcourt Brace College Publishers: Orlando, FL. 1997. p. 156-194.