Chapter 2-14. Pearson Correlation Coefficient With Clustered Data The ordinary Pearson correlation coefficient, as well as the nonparametric correlation coefficients, assume that each x-y pair of numbers is independent from the other x-y pairs. Both the interpretation of the correlation coefficient itself, as well as the significance test for the coefficient, require this assumption. Bland and Altman published three one-page articles discussing this situation. In their first paper (Bland and Altman, 1994), they simply present the idea that it is incorrect to compute a correlation coefficient on data that uses more than one x-y pair per study subject (the repeated measurement situation). In their second paper (Bland and Altman, 1995a), they describe how to analyze such data if the interest is a “within subjects” research question. For example, suppose you measured pH and Paco2 in the same patient, and you repeated this approximately four times during their hospital stay. You have a “within subjects” research question if you want to know whether an increase in pH within the same individual was associated with an increase in Paco2. (Paco2 is the symbol for partial pressure of carbon dioxide in the arterial blood.) In their third paper (Bland and Altman, 1995b), they describe how to analyze such data if the interest is a “between subjects” research question. For this same dataset, you have a “between subjects” research question if you want to know whether subjects with high values of pH also tend to have high values of Paco2. In this chapter, both Bland and Altman’s within subjects approach and their between subjects approach is described, along with some Stata programs for doing them. _________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385 Chapter 2-14 (revision 16 May 2011) p. 1 Significance test for Pearson Correlation Coefficient We will need some Stata code for computing the significance test for the correlation coefficient when we get to the “between subjects” analysis below. Stata does not have a command where you can provide Stata with a correlation coefficent and sample size, after which it gives you the p value. The hypothesis test that the population correlation coefficient is different from zero is (Rosner, 2006, p.496): H0: ρ = 0 vs H1: ρ 0 test statistic: t r n2 1 r2 , where ρ is the population correlation coefficient , where r is the sample correlation coefficient For a two-sided level α test, reject H0 if t tn 2, 1 / 2 or t tn 2, 1 / 2 , where n is the sample size Thus, for a two-sided α = 0.05 comparison, statistical significance (p < 0.05) is achieved when t tn2, 0.975 To compute the two-tailed, or two-sided comparison, p value for a Pearson correlation coefficient, for a given sample size, cut-and-paste the following into the Stata do-file editor and run it. This loads the program to run in your current Stata session. * program to compute 2-tailed p value, passing * it the Pearson r and the sample size capture program drop pearsonpval program define pearsonpval args r n local t=`r'*sqrt(`n'-2)/sqrt(1-`r'^2) local p=2*ttail(`n'-2,abs(`t')) disp as result _newline "r = " `r' disp as result "n = " `n' disp as result "p = " %6.4f `p' end * syntax for use: pearsonpval r n Chapter 2-14 (revision 16 May 2011) p. 2 Now, to compute the p value for an r = 0.25 based on a sample size of n = 100, you use pearsonpval .25 100 r = .25 n = 100 p = 0.0121 This result agrees with Rosner (2006, p.497), so you can feel confident it is programmed correctly. Chapter 2-14 (revision 16 May 2011) p. 3 Within Subjects Pearson Correlation Coefficient for Repeated Measures Data As stated above, Bland and Altman (1995a) describe how to analyze repeated measures data if the interest is a “within subjects” research question. For example, you have a “within subjects” research question if you want to know whether an increase in pH within the same individual was associated with an increase in Paco2. This section will work through the process described in Bland and Altman (1995a). Bland and Altman (1995a) illustrate with a dataset they provide in their article (their Table 1). To read this dataset into Stata, File Open Find the directory where you copied the course CD: Change to the subdirectory datasets & do-files Single click on bland&altmanBMJ1995Table1.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ regressionclass\datasets & do-files\ bland&altmanBMJ1995Table1.dta", clear * which must be all on one line, or use: cd "C:\Documents and ettings\u0032770.SRVR\Desktop" cd "\regressionclass\datasets & do-files" use bland&altmanBMJ1995Table1, clear Listing the first three of the n=8 subjects, list , sepby(subject) , if subject<=3 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. +------------------------+ | subject ph paco2 | |------------------------| | 1 6.68 3.97 | | 1 6.53 4.12 | | 1 6.43 4.09 | | 1 6.33 3.97 | |------------------------| | 2 6.85 5.27 | | 2 7.06 5.37 | | 2 7.13 5.41 | | 2 7.17 5.44 | |------------------------| | 3 7.4 5.67 | | 3 7.42 3.64 | | 3 7.41 4.32 | | 3 7.37 4.73 | | 3 7.34 4.96 | | 3 7.35 5.04 | | 3 7.28 5.22 | | 3 7.3 4.82 | | 3 7.34 5.07 | +------------------------+ Chapter 2-14 (revision 16 May 2011) p. 4 We see that each subject has multiple observations of pH and Paco2, with a variable number of observations per subject. Suppose we are interested in a “within subjects” research question, where we want to know whether an increase in pH within the same individual was associated with an increase in Paco2. This can be done with analysis of covariance (ANCOVA). We make one of the two variables the outcome variable and the other the continuous predictor variable. In addition, we make subject a predictor variable, by using indicator variables (7 indictor, or dummy, variables to represent the 8 subjects). The anova command automatically creates these indictor variables, without showing them to you, for any variable that is not listed in the continuous( ) option. anova ph paco2 subject , continuous(paco2) Number of obs = 47 Root MSE = .093715 R-squared = Adj R-squared = 0.8993 0.8781 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------Model | 2.98016151 8 .372520188 42.42 0.0000 | paco2 | .115324822 1 .115324822 13.13 0.0008 subject | 2.96606639 7 .42372377 48.25 0.0000 | Residual | .333732534 38 .008782435 -----------+---------------------------------------------------Total | 3.31389404 46 .072041175 This ANCOVA table shows how the variability in pH can be partitioned into components due to different sources. Chapter 2-14 (revision 16 May 2011) p. 5 To show graphically what the model did, we ask for predicted values from the model predict ph_pred and then overlay line graphs through these predicted values for each subject onto the scatterplot of observations, 7 6.8 6.6 6.4 Intramural pH 7.2 7.4 #delimit ; twoway (scatter ph paco2 , symbol(circle) color(black)) (line ph_pred paco2 if subject==1, color(black) clstyle(solid)) (line ph_pred paco2 if subject==2, color(black) clstyle(solid)) (line ph_pred paco2 if subject==3, color(black) clstyle(solid)) (line ph_pred paco2 if subject==4, color(black) clstyle(solid)) (line ph_pred paco2 if subject==5, color(black) clstyle(solid)) (line ph_pred paco2 if subject==6, color(black) clstyle(solid)) (line ph_pred paco2 if subject==7, color(black) clstyle(solid)) (line ph_pred paco2 if subject==8, color(black) clstyle(solid)) , legend(off) ytitle("Intramural pH") xtitle(PaCO2) ; #delimit cr 3 4 5 PaCO2 6 7 We see that the model fit parallel lines through each subject’s data. Chapter 2-14 (revision 16 May 2011) p. 6 The residual sum of squares in the ANCOVA table (Residual SS, 0.333732534) represents the variation around these regression lines. The between subjects variation, along with any unmeasured “nuisance variables” which make subjects differ from each other, is reflected in the subjects term of the ANCOVA table (Subjects SS, 2.96606639). We remove that source of variation by not including it in our estimation of the within-subjects correlation. We express the within subjects variation in observations of pH due to Paco2 as a proportion of what’s left: Sum of squares for Paco 2 Sum of squares for Paco 2 residual sum of squares This term is analogous to a coefficient of determination. When you square a Pearson correlation coefficient, you have the “coefficient of determination (r2)” which represents the proportion of variation in the outcome variable that is explained by the predictor variable. To obtained the magnitude of the “correlation coefficient within subjects”, we take the square root of this proportion. Sum of squares for Paco 2 Sum of squares for Paco 2 residual sum of squares Filling in the equation from the ANCOVA table, Number of obs = 47 Root MSE = .093715 R-squared = Adj R-squared = 0.8993 0.8781 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------Model | 2.98016151 8 .372520188 42.42 0.0000 | paco2 | .115324822 1 .115324822 13.13 0.0008 subject | 2.96606639 7 .42372377 48.25 0.0000 | Residual | .333732534 38 .008782435 -----------+---------------------------------------------------Total | 3.31389404 46 .072041175 0.1153 0.51 0.1153 0.3337 Chapter 2-14 (revision 16 May 2011) p. 7 The sign of the correlation coefficient comes from the sign of the regression coefficient for Paco2 when this ANCOVA is obtained from linear regression. This model fitted with linear regression is xi: regress ph paco2 i.subject i.subject _Isubject_1-8 (naturally coded; _Isubject_1 omitted) Source | SS df MS -------------+-----------------------------Model | 2.98016151 8 .372520188 Residual | .333732534 38 .008782435 -------------+-----------------------------Total | 3.31389404 46 .072041175 Number of obs F( 8, 38) Prob > F R-squared Adj R-squared Root MSE = = = = = = 47 42.42 0.0000 0.8993 0.8781 .09371 -----------------------------------------------------------------------------ph | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------paco2 | -.108323 .0298928 -3.62 0.001 -.1688378 -.0478082 _Isubject_2 | .7046112 .0773549 9.11 0.000 .5480145 .861208 _Isubject_3 | .9500127 .0610954 15.55 0.000 .8263315 1.073694 _Isubject_4 | .9715577 .0735091 13.22 0.000 .8227464 1.120369 _Isubject_5 | .8603818 .0583954 14.73 0.000 .7421664 .9785972 _Isubject_6 | .9264284 .0659945 14.04 0.000 .7928296 1.060027 _Isubject_7 | .6921055 .1049093 6.60 0.000 .4797277 .9044834 _Isubject_8 | .7033361 .0615714 11.42 0.000 .5786913 .8279809 _cons | 6.929854 .129469 53.53 0.000 6.667758 7.19195 ------------------------------------------------------------------------------ From the coefficient for paco2 (B = -.108323), we see the assocation is negative. The p value for the correlation coefficient can be taken from the t test for the regression slope (p = 0.001) or from the F test for the paco2 term in the ANCOVA table ( p = 0.0008). Number of obs = 47 Root MSE = .093715 R-squared = Adj R-squared = 0.8993 0.8781 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------Model | 2.98016151 8 .372520188 42.42 0.0000 | paco2 | .115324822 1 .115324822 13.13 0.0008 subject | 2.96606639 7 .42372377 48.25 0.0000 | Residual | .333732534 38 .008782435 -----------+---------------------------------------------------Total | 3.31389404 46 .072041175 It makes no difference which variable we choose for the dependent variable, as either way gives an identical correlation coefficient and p value. Chapter 2-14 (revision 16 May 2011) p. 8 These steps are contained in the following program. Simply cut-and-paste this into the do-file editor, and run it. This sets up the “withincorr” commmand to work during your current Stata session. * --------------------------------------* Bland and Altman (1995) * correlation coefficient within subjects capture program drop withincorr program define withincorr version 10 syntax varlist(min=3 max=3) [if] [in] tokenize `varlist' local yvar `1' local xvar `2' local subjectid `3' tempname touse mark `touse' `wgt' `if' `in' preserve quietly keep if `touse' quietly anova `yvar' `xvar' `subjectid' /// , continuous(`xvar') local r = sqrt(e(ss_1)/(e(ss_1)+e(rss))) quietly xi: regress `yvar' `xvar' i.`subjectid' if _b[`xvar']<0 { local r = -`r' } local p = 2*ttail(`e(N)'-2,abs(_b[`xvar']/_se[`xvar'])) display as result _newline "within r = " /// %4.2f `r' " , n = " `e(N)' " , p = " %5.4f `p' restore end * syntax: withincorr y x subjectid * ---------------------------------------- After executing the above lines to set up the “withincorr” command for the duration of your current Stata session, to compute a within subjects correlation coefficient, you run the following command, Syntax: withincorr yvar xvar subjectid [the subject ID variable must come last] withincorr ph paco2 subject within r = -0.51 , n = 47 , p = 0.0007 This is what is found in the Bland and Altman (1995a) paper, so you can feel confident it was programmed correctly. The command also works with the “if” option and “in” option. To illustrate, withincorr ph paco2 subject if subject<5 withincorr ph paco2 subject in 1/10 within r = -0.07 , n = 22 , p = 0.7628 within r = 0.02 , n = 10 , p = 0.9656 Chapter 2-14 (revision 16 May 2011) p. 9 Here is a suggested way to report this in your article. Article Suggestion Statistical Methods Pearson correlation coefficients are reported. For variables involving repeated measurements for each study subject, an ordinary correlation coefficient is not appropriate (Bland and Altman, 1994). For these variables, a “within subjects” correlation coefficient was reported, which accounts for the lack of independence among the repeated measurements by removing the variation between subjects (Bland and Altman, 1995a). A within subjects correlation coefficient examines whether an increase in a variable within the same individual is associated with an increase in another variable (Bland and Altman, 1995a). Results In the Results section, just report it like you would any other correlation coefficient. I doubt the reader would want to be bothered with knowing which correlation coefficients are ordinary coefficients and which are within subjects coefficients when reading the results. Chapter 2-14 (revision 16 May 2011) p. 10 Between Subjects Pearson Correlation Coefficient for Repeated Measures Data In their third paper (Bland and Altman, 1995b), the authors describe how to obtain a correlation coefficient for repeated measures data if the interest is a “between subjects” research question. Using the same dataset provided for ther within subjects correlation, you have a “between subjects” research question if you want to know whether subjects with high values of pH also tend to have high values of Paco2. Bland and Altman (1995b) illustrate with a dataset they provided in their second article (Bland and Altman, 1995a). To read this dataset into Stata, File Open Find the directory where you copied the course CD: Change to the subdirectory datasets & do-files Single click on bland&altmanBMJ1995Table1.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ regressionclass\datasets & do-files\ bland&altmanBMJ1995Table1.dta", clear * which must be all on one line, or use: cd "C:\Documents and ettings\u0032770.SRVR\Desktop" cd "\regressionclass\datasets & do-files" use bland&altmanBMJ1995Table1, clear Listing the first three of the n=8 subjects, list , sepby(subject) , if subject<=3 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. +------------------------+ | subject ph paco2 | |------------------------| | 1 6.68 3.97 | | 1 6.53 4.12 | | 1 6.43 4.09 | | 1 6.33 3.97 | |------------------------| | 2 6.85 5.27 | | 2 7.06 5.37 | | 2 7.13 5.41 | | 2 7.17 5.44 | |------------------------| | 3 7.4 5.67 | | 3 7.42 3.64 | | 3 7.41 4.32 | | 3 7.37 4.73 | | 3 7.34 4.96 | | 3 7.35 5.04 | | 3 7.28 5.22 | | 3 7.3 4.82 | | 3 7.34 5.07 | +------------------------+ Chapter 2-14 (revision 16 May 2011) p. 11 We see that each subject has multiple observations of pH and Paco2, with a variable number of observations per subject. For the between subjects correlation, we first convert these data to a mean value for each subject, keeping track of how many observations were originally available per subject. capture drop number gen number=1 collapse (mean) ph paco2 (sum) number , by(subject) list 1. 2. 3. 4. 5. 6. 7. 8. +----------------------------------------+ | subject ph paco2 number | |----------------------------------------| | 1 6.4925 4.0375 4 | | 2 7.0525 5.3725 4 | | 3 7.356667 4.83 9 | | 4 7.326 5.312 5 | | 5 7.31375 4.39875 8 | |----------------------------------------| | 6 7.323333 4.92 6 | | 7 6.906667 6.603333 3 | | 8 7.115 4.78375 8 | +----------------------------------------+ The dataset is now reduced to 8 lines, with represent the mean ph and mean paco2 for each subject, along with the original number of observations per subject, which we call number. The number variable will be used to computed a weighted correlation coefficient. This dataset is what is found in the third Bland and Altman paper (Bland and Altman, 1995b), except they use only two decimal places. So that we get the same answer as they did, let’s first round to two decimal places. replace ph=round(ph,.01) replace paco2=round(paco2,.01) list 1. 2. 3. 4. 5. 6. 7. 8. +---------------------------------+ | subject ph paco2 number | |---------------------------------| | 1 6.49 4.04 4 | | 2 7.05 5.37 4 | | 3 7.36 4.83 9 | | 4 7.33 5.31 5 | | 5 7.31 4.4 8 | |---------------------------------| | 6 7.32 4.92 6 | | 7 6.91 6.6 3 | | 8 7.11 4.78 8 | +---------------------------------+ Chapter 2-14 (revision 16 May 2011) p. 12 Stata already has a command for computing the weighted correlation coefficient. Specifying the “frequency weight” as the number variable, pwcorr ph paco2 [fw=number] , obs sig | ph paco2 -------------+-----------------ph | 1.0000 | | 47 | paco2 | 0.0765 1.0000 | 0.6095 | 47 47 As pointed out by Bland and Altman, we cannot use this p value. The p value was based on n = 47 independent observations. We only have n = 8 independent observations. That is, we reduced our correlated repeated measurements into a single observation per subject, which resulted in 8 independent observations. To compute the p value, we use the program provided on page 2 above. Running these lines in the do-file editor sets it up for us. * program to compute 2-tailed p value, passing * it the Pearson r and the sample size capture program drop pearsonpval program define pearsonpval args r n local t=`r'*sqrt(`n'-2)/sqrt(1-`r'^2) local p=2*ttail(`n'-2,abs(`t')) disp as result _newline "r = " `r' disp as result "n = " `n' disp as result "p = " %6.4f `p' end * syntax for use: pearsonpval r n To get the p value for an r = 0.0765 based on 8 independent observations, we use, pearsonpval .0765 8 r = .0765 n = 8 p = 0.8571 This is what Bland and Altman reported in their paper (r = .08 , p = 0.9). Chapter 2-14 (revision 16 May 2011) p. 13 These steps are contained in the following program. Simply cut-and-paste this into the do-file editor, and run it. This sets up the “betweencorr” commmand to work during your current Stata session. * --------------------------------------* Bland and Altman (1995) * correlation coefficient between subjects capture program drop betweencorr program define betweencorr version 10 syntax varlist(min=3 max=3) [if] [in] tokenize `varlist' local yvar `1' local xvar `2' local subjectid `3' tempname touse mark `touse' `wgt' `if' `in' preserve quietly keep if `touse' quietly drop if `subjectid'==. tempvar number gen `number'=1 collapse (mean) `yvar' `xvar' (sum) `number' , by(`subjectid') quietly corr `yvar' `xvar' [fw=`number'] // returns r(rho) local r=r(rho) quietly count if `yvar'~=. & `xvar'~=. // returns r(N) local t=`r'*sqrt(r(N)-2)/sqrt(1-`r'^2) local p=2*ttail(r(N)-2,abs(`t')) display as result _newline "between r = " /// %4.2f `r' " , n = " `r(N)' " , p = " %5.4f `p' restore end * syntax: betweencorr y x subjectid * ---------------------------------------- After executing the above lines to set up the “withincorr” command for the duration of your current Stata session, to compute a within subjects correlation coefficient, you run the following command, Syntax: betweencorr yvar xvar subjectid [the subject ID variable must come last] betweencorr ph paco2 subject between r = 0.07 , n = 8 , p = 0.8696 This is what is found in the Bland and Altman (1995b) paper, so you can feel confident it was programmed correctly. (We get r=0.07, instead of 0.08 this time because we did not round the data off to two decimals to begin with.) Chapter 2-14 (revision 16 May 2011) p. 14 Here is a suggested way to report this in your article. Article Suggestion Statistical Methods Pearson correlation coefficients are reported. For variables involving repeated measurements for each study subject, an ordinary correlation coefficient is not appropriate (Bland and Altman, 1994). For variables with repeated measurements, a “between subjects” correlation coefficient was reported, where the repeated measurements were first converted to means, so that each subject contributed only one observation. The Pearson correlation coefficient was then calculated as a weighted average correlation coefficient, weighted by the number of repeated measurements originally available for each subject. The p values were computed, however, based on the number of subjects, or means, rather than the original number of repeated measurements. This “between subjects” correlation coefficient examines whether subjects with a high value on one variable also tend to have a high value on the other variable, similar to the ordinary correlation coefficient based on independent observations (Bland and Altman, 1995b). Results In the Results section, just report it like you would any other correlation coefficient. I doubt the reader would want to be bothered with knowing which correlation coefficients are ordinary coefficients and which were originally repeated measurements when reading the results. Chapter 2-14 (revision 16 May 2011) p. 15 References Bland JM, Altman DG. (1994). Correlation, regression, and repeated data. BMJ 308:896. Bland JM, Altman DG. (1995a). Calculating correlation coefficients with repeated observations: Part 1—correlation within subjects. BMJ 310:446. Bland JM, Altman DG. (1995b). Calculating correlation coefficients with repeated observations: Part 2—correlation between subjects. BMJ 310:633. Rosner B. (2006). Fundamentals of Biostatistics, 6th ed. Belmont CA, Thomson Brooks/Cole. Chapter 2-14 (revision 16 May 2011) p. 16