1 Guided Exercise Analysis of correlated data using STATA Bandit Thinkhamrop, PhD. 1. Data id 1 2 3 4 5 6 7 8 9 10 11 pstill 64 60 58 71 65 59 68 57 67 68 pwalk 70 65 62 78 71 66 77 63 79 78 pstand 94 87 88 110 98 114 97 121 100 105 nwalk 75 65 60 84 80 94 67 102 98 90 nstand 80 88 65 69 66 84 53 98 102 111 2. Data descriptions 2.1 Variables id =Identification number pstill =Pulse at staying still pwalk =Pulse rate (per minute) immediately after a quick walking for half a minute pstand =Pulse rate (per minute) immediately after a quick sitting up-down for half a minute nwalk =Total number of walking steps nstand =Total number of standing and sitting 2 2.2 Case report form (CRF) Questionnaire B ID………………………….. Please measure your pulse rate by counting the pulse for 15 seconds then multiply by 4 so that the rate per minutes can be achieved. Move 1 While resting Please measure your pulse rate now ………………. beats per minute Move 2 After walking Please take a walk in place for half a minute and count your number of walks. Then measure your pulse rate 2.1 Number of steps ………………. ………………. beats per minute 2.2 Pulse rate Move 3 After astanding Please take a walk in place for half a minute and count your number of walks. Then measure your pulse rate 3.1 Number of sitting-standing …………… ………………. beats per minute 3.2 Pulse rate 3. Preparation of data file format Do: Perform data entry from #1 into STATA 3.1 Wide form 3.1.1 Note that one respondent has one record (one ID, one row) . li 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. id 1 2 3 4 5 6 7 8 9 10 pstill 64 60 58 71 65 59 68 57 67 68 pwalk 70 65 62 78 71 66 77 63 79 78 pstand 94 87 88 110 98 114 97 121 100 105 nwalk 75 65 60 84 80 94 67 102 98 90 nstand 80 88 65 69 66 84 53 98 102 111 3 . su Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------id | 10 5.5 3.02765 1 10 pstill | 10 63.7 4.900113 57 71 pwalk | 10 70.9 6.707376 62 79 pstand | 10 101.4 11.0775 87 121 nwalk | 10 81.5 14.59262 60 102 nstand | 10 81.6 18.54244 53 111 . save Example 1.dta file Example 1.dta saved 3.1.2 Caluculate pulse rate differece before-after taking a walk (This is an example of statistical analysis of paired data approach) . gen pdiff = . li id 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. . ci pwalk- pstill pwalk pstill id 1 2 3 4 5 6 7 8 9 10 pdiff pwalk 70 65 62 78 71 66 77 63 79 78 pstill 64 60 58 71 65 59 68 57 67 68 pdiff 6 5 4 7 6 7 9 6 12 10 pdiff Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------------------pdiff | 10 7.2 .7717225 5.454243 8.945757 3.1.3 Calculate mean of the three pulse rates (This is an example of statistical analysis using summary measure approach) . egen meanp = rmean( pstill pwalk pstand) . li 1. 2. 3. 4. 5. 6. 7. pstill pwalk pstand meanp pstill 64 60 58 71 65 59 68 pwalk 70 65 62 78 71 66 77 pstand 94 87 88 110 98 114 97 meanp 76 70.66666 69.33334 86.33334 78 79.66666 80.66666 Read the command syntax : help egen 4 8. 9. 10. . ci 57 67 68 63 79 78 121 100 105 80.33334 82 83.66666 meanp Variable | Obs Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------------------meanp | 10 78.66667 1.704026 74.81189 82.52144 Remarks: Analytical methods mentioned in 3.1.2 and 3.1.3 are not efficient. Additionally the variable pstill, pwalk, and pstand are all "pulse rate" which shout not be treated as separate variables. 3.2 Long form 3.2.1 Learn how to do it using reshape command in STATA Read the command syntax : help reshape reshape converts data from wide to long form and vice versa. Think of the data as a collection of observations x_ij. One such collection might be (wide form) -i------- x_ij -------id sex inc80 inc81 inc82 ------------------------------1 0 5000 5500 6000 2 1 2000 2200 3300 3 0 3000 2000 1000 (long form) -i- -j-x_ijid year sex inc ----------------------1 80 0 5000 1 81 0 5500 1 82 0 6000 2 80 1 2000 2 81 1 2200 2 82 1 3300 3 80 0 3000 3 81 0 2000 3 82 0 1000 reshape converts data from one form to the other: . reshape long inc, i(id) j(year) . reshape wide inc, i(id) j(year) (goes from top-form to bottom) (goes from bottom-form to top) 5 3.2.2 Reshape the data file format . rename pstill p1 . rename pwalk . rename pstand p3 . rename nwalk move2 . rename nstand move3 Read the command syntax : help rename p2 . gen move1 = 0 . drop . li 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Read the command syntax : help drop pdiff meanp p1 p2 p3 move1 move2 move3 p1 64 60 58 71 65 59 68 57 67 68 p2 70 65 62 78 71 66 77 63 79 78 p3 94 87 88 110 98 114 97 121 100 105 move1 0 0 0 0 0 0 0 0 0 0 move2 75 65 60 84 80 94 67 102 98 90 move3 80 88 65 69 66 84 53 98 102 111 Read the command syntax : help reshape . reshape long p move, i(id) j(visit) (note: j = 1 2 3) Data wide -> long ----------------------------------------------------------------------------Number of obs. 10 -> 30 Number of variables 7 -> 4 j variable (3 values) -> visit xij variables: p1 p2 p3 -> p move1 move2 move3 -> move ----------------------------------------------------------------------------- Nite that one respondent has more than one records. All variables remain the same meaning as the original data. . li id visit 1. 1 1 2. 1 2 3. 1 3 4. 2 1 5. 2 2 6. 2 3 7. 3 1 8. 3 2 9. 3 3 - - - Skip some records - - 22. 8 1 23. 8 2 24. 8 3 25. 9 1 26. 9 2 27. 9 3 28. 10 1 29. 10 2 p 64 70 94 60 65 87 58 62 88 move 0 75 80 0 65 88 0 60 65 57 63 121 67 79 100 68 78 0 102 98 0 98 102 0 90 6 30. 10 3 105 111 3.2.3. Perform desired statistical data analysis Followings are example of common practice in data analysis of the data of this kind. A. Ignore clustering within subject (This is not approprate!) . regress p move Source | SS df MS -------------+-----------------------------Model | 3454.65093 1 3454.65093 Residual | 6282.01573 28 224.357705 -------------+-----------------------------Total | 9736.66667 29 335.747126 Number of obs F( 1, 28) Prob > F R-squared Adj R-squared Root MSE = = = = = = 30 15.40 0.0005 0.3548 0.3318 14.979 -----------------------------------------------------------------------------p | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------move | .264589 .067428 3.92 0.001 .1264691 .402709 _cons | 64.28184 4.573504 14.06 0.000 54.91344 73.65024 ------------------------------------------------------------------------------ B. Adjusted for clustering within subject (This is approprate!) Read the command syntax : help xt . xtgee p move, i(id) t(visit) fam(gaussian) link(identity) corr(exchangeable) robust Iteration 1: tolerance = .00170277 Iteration 2: tolerance = 4.718e-07 GEE population-averaged model Group variable: id Link: identity Family: Gaussian Correlation: exchangeable Scale parameter: 209.4074 Number of obs Number of groups Obs per group: min avg max Wald chi2(1) Prob > chi2 = = = = = = = 30 10 3 3.0 3 64.24 0.0000 (standard errors adjusted for clustering on id) -----------------------------------------------------------------------------| Semi-robust p | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------move | .2625438 .0327578 8.01 0.000 .1983398 .3267478 _cons | 64.39303 3.099198 20.78 0.000 58.31872 70.46735 ------------------------------------------------------------------------------ 4. Combining the data file Obtain the data from Questionnaire A 7 Questionnaire A ID………………………….. Please fill in the following questions 1. Name ………………………………………………………….. 2. Gender [ ]1. Male [ ]2. Female 3. Weight ……………….. kilograms 4. Height …………………. centmeters 5. Shufle your hands in regular manner then see your thumbs. The thumb at the top most is : [ ]1. the right thumb [ ]2. the left thumb id sex 1 2 3 4 5 6 7 8 9 10 wt 1 1 1 1 1 0 0 0 0 0 ht 55 62 60 61 56 42 45 51 48 68 finger 151 165 162 165 145 150 165 158 150 180 1 0 1 0 1 0 1 0 1 0 4.1 Combine the data using STATA command: joinby 4.1.1 Sort id and Save the data in long form then close the data file . sort id . save Example 1 long.dta file Example 1 long.dta saved . close 8 4.1.2 Open the master file then Sort id and Save the data file . use Example 1.dta", clear . sort id Read the command syntax : help sort . save Example 1.dta, replace file Example 1.dta saved Read the command syntax : help joinby . joinby id using Example 1 long.dta . li id sex 1. 1 1 2. 1 1 3. 1 1 4. 2 1 5. 2 1 6. 2 1 - - - Skip some records - - 25. 9 0 26. 9 0 27. 9 0 28. 10 0 29. 10 0 30. 10 0 wt 55 55 55 62 62 62 ht 151 151 151 165 165 165 finger 1 1 1 0 0 0 visit 1 2 3 1 2 3 p 64 70 94 60 65 87 move 0 75 80 0 65 88 48 48 48 68 68 68 150 150 150 180 180 180 1 1 1 0 0 0 1 2 3 1 2 3 67 79 100 68 78 105 0 98 102 0 90 111 4.1.3 Save the data file . save Master.dta file Master.dta saved Remarks: This data file contains 8 variables. Presume that this research has "Pulse rate" as the primary outcome and the independent variables are sex, wt, ht, finger, visit, and move. Among these independent variables, move is the "Time dependent covariate" while the remaining are "Time independent covariates". 4.2 Perform desired statistical data analysis Followings are example of common practice in data analysis of the data of this kind. . xtgee p sex move, i(id) t(visit) fam(gaussian) link(identity) corr(exchangeable) robust Iteration Iteration Iteration Iteration 1: 2: 3: 4: tolerance tolerance tolerance tolerance = = = = .05179417 .00075073 .00001251 2.089e-07 GEE population-averaged model Number of obs = 30 9 Group variable: Link: Family: Correlation: id identity Gaussian exchangeable Scale parameter: Number of groups Obs per group: min avg max Wald chi2(2) Prob > chi2 208.4974 = = = = = = 10 3 3.0 3 86.74 0.0000 (standard errors adjusted for clustering on id) -----------------------------------------------------------------------------| Semi-robust p | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------sex | -2.470965 3.003347 -0.82 0.411 -8.357417 3.415487 move | .2451229 .0299111 8.20 0.000 .1864982 .3037477 _cons | 66.57563 2.750068 24.21 0.000 61.1856 71.96567 ------------------------------------------------------------------------------ Read the command syntax : help longplot . longplot p visit , i(id) 120 p 100 80 60 1 1.5 2 visit 2.5 3 . longplot p visit , i(id) by(sex) 120 0 1 100 p 80 60 1 1.5 2 visit 2.5 3 Read the command syntax : help xtgraph . xtgraph p p 109.324 10 . xtgraph p, group(sex) 0 1 p 119.773 57.1481 1 3 visit 4.3 STATA commands for data analysis using summary measure approach A. Generate a variable containing " Running number" for each record . gen num = _n . li id num id num 1. 1 1 2. 1 2 3. 1 3 4. 2 4 5. 2 5 6. 2 6 - - - Skip some records - - 28. 10 28 29. 10 29 30. 10 30 B.Generate summary measure (This example we use the mean) 11 . egen meanp = mean(p), by(id) . li num id p meanp num id p 1. 1 1 64 2. 2 1 70 3. 3 1 94 4. 4 2 60 5. 5 2 65 6. 6 2 87 - - - Skip some records - - 28. 28 10 68 29. 29 10 78 30. 30 10 105 meanp 76 76 76 70.66666 70.66666 70.66666 83.66666 83.66666 83.66666 C. Keep only on record for an ID . egen minnum = min(num), by(id) . li num minnum id p meanp num minnum 1. 1 1 2. 2 1 3. 3 1 4. 4 4 5. 5 4 6. 6 4 - - - Skip some records - - 28. 28 28 29. 29 28 30. 30 28 id 1 1 1 2 2 2 p 64 70 94 60 65 87 meanp 76 76 76 70.66666 70.66666 70.66666 10 10 10 68 78 105 83.66666 83.66666 83.66666 Read the command syntax : help keep . keep if minnum== num (20 observations deleted) . li 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. num minnum id sex num 1 4 7 10 13 16 19 22 25 28 minnum 1 4 7 10 13 16 19 22 25 28 visit p meanp move id 1 2 3 4 5 6 7 8 9 10 sex 1 1 1 1 1 0 0 0 0 0 visit 1 1 1 1 1 1 1 1 1 1 p 64 60 58 71 65 59 68 57 67 68 meanp 76 70.66666 69.33334 86.33334 78 79.66666 80.66666 80.33334 82 83.66666 D. Analyse the data using ordinary statistical methods . ttest meanp, by(sex) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 5 81.26667 .7102423 1.58815 79.29472 83.23861 1 | 5 76.06667 3.030219 6.775775 67.65343 84.4799 ---------+-------------------------------------------------------------------- move 0 0 0 0 0 0 0 0 0 0 12 combined | 10 78.66667 1.704026 5.388603 74.81189 82.52144 ---------+-------------------------------------------------------------------diff | 5.199998 3.112341 -1.977074 12.37707 -----------------------------------------------------------------------------Degrees of freedom: 8 Ho: mean(0) - mean(1) = diff = 0 Ha: diff < 0 t = 1.6708 P < t = 0.9333 Ha: diff ~= 0 t = 1.6708 P > |t| = 0.1333 Ha: diff > 0 t = 1.6708 P > t = 0.0667 5. GEE Longitudinal, repeated measures, or clustered data are commonly encountered in clinical research. Correlations between observations on a given subject may exist, and need to be accounted for in statistical analysis. Ordinary statistical methods assume the observations are independent. In the situation where observations are likely to be correlated, this is not usually a reasonable assumption. When the assumption of independent observations is violated, the estimated standard errors using ordinary statistical methods are incorrect, and thus lead to incorrect inferences. The method of generalized estimating equations (GEE) can be used to account for such correlations among observations. The GEE method estimates the regression parameters assuming that the observations are independent. After fitting the model, the correlations among observations are estimated using the residuals. Then the correlation estimates is used to obtain new estimates of the regression parameters. This process is repeated until the change between two successive estimates is very small. GEE can be implemented in STATA using the xtgee command. It allows the user to specify different correlation structures for the repeated observations, and to fit other generalized linear models such as Poisson, negative binomial, or multinomial logistic regression in addition to logistic regression. Followings are the lists of options. The allowable options for the xtgee command are 13 Families Bernoulli/binomial Gamma Gaussian Inverse gaussian Negative binomial Poisson Links Correlation Structures Cloglog Identity Log Logit Negative binomial Odds power Power Probit Reciprocal Independent Exchangeable Autoregressive Stationary Nonstationary Unstructured User-specified Assuming an independent correlation structure amounts to ignoring the panel structure of the data. Under this assumption, xtgee will produce answers that are already provided by Stata's nonpanel estimation commands. Examples of when xtgee provides answers that are the same as an existing command are given in the following table. Family gaussian gaussian gaussian binomial binomial binomial binomial binomial binomial nbinomial poisson poisson gamma family Link identity identity identity cloglog cloglog logit logit probit probit nbinomial log log log link Correlation independent exchangeable exchangeable independent exchangeable independent exchangeable independent exchangeable independent independent exchangeable independent independent Equivalent Stata command regress xtreg, re (see note 1) xtreg, pa cloglog (see note 2) xtlog, pa logit or logistic xtlogit, pa probit (see note 3) xtprobit, pa nbreg (see note 4) poisson xtpois, pa ereg (see note 5) glm (see note 6) 14 Note 1 Note 2 Note 3 Note 4 Note 5 Note 6 These methods produce the same results only in the case of balanced panels. For cloglog estimation, xtgee with corr(independent) and cloglog will produce the same coefficients, but the standard errors will be only asymptotically equivalent because cloglog is not the canonical link for the binomial family. For probit estimation, xtgee with corr(independent) and probit will produce the same coefficients, but the standard error will be only asymptotically equivalent because probit is not the canonical link for the binomial family. If the binomial denominator is not 1, the equivalent maximum-likelihood command is bprobit. Fitting a negative binomial model using xtgee (or using glm will yield results conditional on the specified value of alpha. The nbreg command, however, fits that parameter as well as providing unconditional estimates. xtgee with corr(independent) can be used to estimate exponential regressions, but this requires specifying scale(1). As with probit, the xtgee-reported standard errors will be only asymptotically equivalent to those produced by ereg because log is not the canonical link for the gamma family. xtgee cannot be used to estimate exponential regressions on censored data. Using the independent correlation structure, the xtgee command will estimate the same model as estimated with the glm command provided the family-link combination is the same. If the xtgee command is equivalent to another command, then the use of corr(independent) and the robust option with xtgee corresponds to using both the robust option and the cluster(varname) option in the equivalent command where varname corresponds to the i() group variable. **************