1. Data - CEU, KKU

advertisement
1
Guided Exercise
Analysis of correlated data using STATA
Bandit Thinkhamrop, PhD.
1. Data
id
1
2
3
4
5
6
7
8
9
10
11
pstill
64
60
58
71
65
59
68
57
67
68
pwalk
70
65
62
78
71
66
77
63
79
78
pstand
94
87
88
110
98
114
97
121
100
105
nwalk
75
65
60
84
80
94
67
102
98
90
nstand
80
88
65
69
66
84
53
98
102
111
2. Data descriptions
2.1 Variables
id
=Identification number
pstill =Pulse at staying still
pwalk =Pulse rate (per minute) immediately after a quick walking for half a minute
pstand =Pulse rate (per minute) immediately after a quick sitting up-down for half a minute
nwalk =Total number of walking steps
nstand =Total number of standing and sitting
2
2.2 Case report form (CRF)
Questionnaire B
ID…………………………..
Please measure your pulse rate by counting the pulse for 15 seconds then multiply by 4
so that the rate per minutes can be achieved.
Move 1 While resting
Please measure your pulse rate now ………………. beats per minute
Move 2 After walking
Please take a walk in place for half a minute and count your number of walks. Then
measure your pulse rate
2.1 Number of steps
……………….
………………. beats per minute
2.2 Pulse rate
Move 3 After astanding
Please take a walk in place for half a minute and count your number of walks. Then
measure your pulse rate
3.1 Number of sitting-standing ……………
………………. beats per minute
3.2 Pulse rate
3. Preparation of data file format
Do: Perform data entry from #1 into STATA
3.1 Wide form
3.1.1 Note that one respondent has one record (one ID, one row)
. li
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
id
1
2
3
4
5
6
7
8
9
10
pstill
64
60
58
71
65
59
68
57
67
68
pwalk
70
65
62
78
71
66
77
63
79
78
pstand
94
87
88
110
98
114
97
121
100
105
nwalk
75
65
60
84
80
94
67
102
98
90
nstand
80
88
65
69
66
84
53
98
102
111
3
. su
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------id |
10
5.5
3.02765
1
10
pstill |
10
63.7
4.900113
57
71
pwalk |
10
70.9
6.707376
62
79
pstand |
10
101.4
11.0775
87
121
nwalk |
10
81.5
14.59262
60
102
nstand |
10
81.6
18.54244
53
111
. save Example 1.dta
file Example 1.dta saved
3.1.2 Caluculate pulse rate differece before-after taking a walk (This is an example of statistical
analysis of paired data approach)
. gen pdiff =
. li id
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
. ci
pwalk- pstill
pwalk pstill
id
1
2
3
4
5
6
7
8
9
10
pdiff
pwalk
70
65
62
78
71
66
77
63
79
78
pstill
64
60
58
71
65
59
68
57
67
68
pdiff
6
5
4
7
6
7
9
6
12
10
pdiff
Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+------------------------------------------------------------pdiff |
10
7.2
.7717225
5.454243
8.945757
3.1.3 Calculate mean of the three pulse rates (This is an example of statistical analysis using
summary measure approach)
. egen meanp = rmean( pstill pwalk pstand)
. li
1.
2.
3.
4.
5.
6.
7.
pstill pwalk pstand meanp
pstill
64
60
58
71
65
59
68
pwalk
70
65
62
78
71
66
77
pstand
94
87
88
110
98
114
97
meanp
76
70.66666
69.33334
86.33334
78
79.66666
80.66666
Read the command syntax : help egen
4
8.
9.
10.
. ci
57
67
68
63
79
78
121
100
105
80.33334
82
83.66666
meanp
Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+------------------------------------------------------------meanp |
10
78.66667
1.704026
74.81189
82.52144
Remarks:
Analytical methods mentioned in 3.1.2 and 3.1.3 are not efficient. Additionally the variable
pstill, pwalk, and pstand are all "pulse rate" which shout not be treated as separate variables.
3.2 Long form
3.2.1 Learn how to do it using reshape command in STATA
Read the command syntax : help reshape
reshape converts data from wide to long form and vice versa.
Think of the
data as a collection of observations x_ij. One such collection might be
(wide form)
-i------- x_ij -------id sex
inc80
inc81
inc82
------------------------------1
0
5000
5500
6000
2
1
2000
2200
3300
3
0
3000
2000
1000
(long form)
-i- -j-x_ijid
year
sex
inc
----------------------1
80
0
5000
1
81
0
5500
1
82
0
6000
2
80
1
2000
2
81
1
2200
2
82
1
3300
3
80
0
3000
3
81
0
2000
3
82
0
1000
reshape converts data from one form to the other:
. reshape long inc, i(id) j(year)
. reshape wide inc, i(id) j(year)
(goes from top-form to bottom)
(goes from bottom-form to top)
5
3.2.2 Reshape the data file format
. rename
pstill
p1
. rename
pwalk
. rename
pstand p3
. rename
nwalk move2
. rename
nstand move3
Read the command syntax : help rename
p2
. gen move1 = 0
. drop
. li
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Read the command syntax : help drop
pdiff meanp
p1 p2 p3 move1 move2 move3
p1
64
60
58
71
65
59
68
57
67
68
p2
70
65
62
78
71
66
77
63
79
78
p3
94
87
88
110
98
114
97
121
100
105
move1
0
0
0
0
0
0
0
0
0
0
move2
75
65
60
84
80
94
67
102
98
90
move3
80
88
65
69
66
84
53
98
102
111
Read the command syntax : help reshape
. reshape long p move, i(id) j(visit)
(note: j = 1 2 3)
Data
wide
->
long
----------------------------------------------------------------------------Number of obs.
10
->
30
Number of variables
7
->
4
j variable (3 values)
->
visit
xij variables:
p1 p2 p3
->
p
move1 move2 move3
->
move
-----------------------------------------------------------------------------
Nite that one respondent has more than one records. All variables remain the same meaning as the
original data.
. li
id
visit
1.
1
1
2.
1
2
3.
1
3
4.
2
1
5.
2
2
6.
2
3
7.
3
1
8.
3
2
9.
3
3
- - - Skip some records - - 22.
8
1
23.
8
2
24.
8
3
25.
9
1
26.
9
2
27.
9
3
28.
10
1
29.
10
2
p
64
70
94
60
65
87
58
62
88
move
0
75
80
0
65
88
0
60
65
57
63
121
67
79
100
68
78
0
102
98
0
98
102
0
90
6
30.
10
3
105
111
3.2.3. Perform desired statistical data analysis
Followings are example of common practice in data analysis of the data of this kind.
A. Ignore clustering within subject (This is not approprate!)
. regress p
move
Source |
SS
df
MS
-------------+-----------------------------Model | 3454.65093
1 3454.65093
Residual | 6282.01573
28 224.357705
-------------+-----------------------------Total | 9736.66667
29 335.747126
Number of obs
F( 1,
28)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
30
15.40
0.0005
0.3548
0.3318
14.979
-----------------------------------------------------------------------------p |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------move |
.264589
.067428
3.92
0.001
.1264691
.402709
_cons |
64.28184
4.573504
14.06
0.000
54.91344
73.65024
------------------------------------------------------------------------------
B. Adjusted for clustering within subject (This is approprate!)
Read the command syntax : help xt
. xtgee p move, i(id) t(visit) fam(gaussian) link(identity) corr(exchangeable) robust
Iteration 1: tolerance = .00170277
Iteration 2: tolerance = 4.718e-07
GEE population-averaged model
Group variable:
id
Link:
identity
Family:
Gaussian
Correlation:
exchangeable
Scale parameter:
209.4074
Number of obs
Number of groups
Obs per group: min
avg
max
Wald chi2(1)
Prob > chi2
=
=
=
=
=
=
=
30
10
3
3.0
3
64.24
0.0000
(standard errors adjusted for clustering on id)
-----------------------------------------------------------------------------|
Semi-robust
p |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------move |
.2625438
.0327578
8.01
0.000
.1983398
.3267478
_cons |
64.39303
3.099198
20.78
0.000
58.31872
70.46735
------------------------------------------------------------------------------
4. Combining the data file
Obtain the data from Questionnaire A
7
Questionnaire A
ID…………………………..
Please fill in the following questions
1. Name …………………………………………………………..
2. Gender
[ ]1. Male
[ ]2. Female
3. Weight ……………….. kilograms
4. Height …………………. centmeters
5. Shufle your hands in regular manner then see your thumbs. The thumb at the top most is :
[ ]1. the right thumb [ ]2. the left thumb
id
sex
1
2
3
4
5
6
7
8
9
10
wt
1
1
1
1
1
0
0
0
0
0
ht
55
62
60
61
56
42
45
51
48
68
finger
151
165
162
165
145
150
165
158
150
180
1
0
1
0
1
0
1
0
1
0
4.1 Combine the data using STATA command: joinby
4.1.1 Sort id and Save the data in long form then close the data file
. sort id
. save Example 1 long.dta
file Example 1 long.dta saved
. close
8
4.1.2 Open the master file then Sort id and Save the data file
. use Example 1.dta", clear
. sort id
Read the command syntax : help sort
. save Example 1.dta, replace
file Example 1.dta saved
Read the command syntax : help joinby
. joinby id using Example 1 long.dta
. li
id
sex
1.
1
1
2.
1
1
3.
1
1
4.
2
1
5.
2
1
6.
2
1
- - - Skip some records - - 25.
9
0
26.
9
0
27.
9
0
28.
10
0
29.
10
0
30.
10
0
wt
55
55
55
62
62
62
ht
151
151
151
165
165
165
finger
1
1
1
0
0
0
visit
1
2
3
1
2
3
p
64
70
94
60
65
87
move
0
75
80
0
65
88
48
48
48
68
68
68
150
150
150
180
180
180
1
1
1
0
0
0
1
2
3
1
2
3
67
79
100
68
78
105
0
98
102
0
90
111
4.1.3 Save the data file
. save Master.dta
file Master.dta saved
Remarks:
This data file contains 8 variables. Presume that this research has "Pulse rate" as
the primary outcome and the independent variables are sex, wt, ht, finger, visit, and
move. Among these independent variables, move is the "Time dependent covariate" while
the remaining are "Time independent covariates".
4.2 Perform desired statistical data analysis
Followings are example of common practice in data analysis of the data of this kind.
. xtgee p sex move, i(id) t(visit) fam(gaussian) link(identity) corr(exchangeable) robust
Iteration
Iteration
Iteration
Iteration
1:
2:
3:
4:
tolerance
tolerance
tolerance
tolerance
=
=
=
=
.05179417
.00075073
.00001251
2.089e-07
GEE population-averaged model
Number of obs
=
30
9
Group variable:
Link:
Family:
Correlation:
id
identity
Gaussian
exchangeable
Scale parameter:
Number of groups
Obs per group: min
avg
max
Wald chi2(2)
Prob > chi2
208.4974
=
=
=
=
=
=
10
3
3.0
3
86.74
0.0000
(standard errors adjusted for clustering on id)
-----------------------------------------------------------------------------|
Semi-robust
p |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------sex | -2.470965
3.003347
-0.82
0.411
-8.357417
3.415487
move |
.2451229
.0299111
8.20
0.000
.1864982
.3037477
_cons |
66.57563
2.750068
24.21
0.000
61.1856
71.96567
------------------------------------------------------------------------------
Read the command syntax : help longplot
. longplot p visit , i(id)
120
p
100
80
60
1
1.5
2
visit
2.5
3
. longplot p visit , i(id) by(sex)
120
0
1
100
p
80
60
1
1.5
2
visit
2.5
3
Read the command syntax : help xtgraph
. xtgraph p
p
109.324
10
. xtgraph p, group(sex)
0
1
p
119.773
57.1481
1
3
visit
4.3 STATA commands for data analysis using summary measure approach
A. Generate a variable containing " Running number" for each record
. gen num = _n
. li id num
id
num
1.
1
1
2.
1
2
3.
1
3
4.
2
4
5.
2
5
6.
2
6
- - - Skip some records - - 28.
10
28
29.
10
29
30.
10
30
B.Generate summary measure (This example we use the mean)
11
. egen meanp = mean(p), by(id)
. li num id p meanp
num
id
p
1.
1
1
64
2.
2
1
70
3.
3
1
94
4.
4
2
60
5.
5
2
65
6.
6
2
87
- - - Skip some records - - 28.
28
10
68
29.
29
10
78
30.
30
10
105
meanp
76
76
76
70.66666
70.66666
70.66666
83.66666
83.66666
83.66666
C. Keep only on record for an ID
. egen minnum = min(num), by(id)
. li num minnum id p meanp
num
minnum
1.
1
1
2.
2
1
3.
3
1
4.
4
4
5.
5
4
6.
6
4
- - - Skip some records - - 28.
28
28
29.
29
28
30.
30
28
id
1
1
1
2
2
2
p
64
70
94
60
65
87
meanp
76
76
76
70.66666
70.66666
70.66666
10
10
10
68
78
105
83.66666
83.66666
83.66666
Read the command syntax : help keep
. keep if minnum== num
(20 observations deleted)
. li
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
num minnum id sex
num
1
4
7
10
13
16
19
22
25
28
minnum
1
4
7
10
13
16
19
22
25
28
visit
p meanp move
id
1
2
3
4
5
6
7
8
9
10
sex
1
1
1
1
1
0
0
0
0
0
visit
1
1
1
1
1
1
1
1
1
1
p
64
60
58
71
65
59
68
57
67
68
meanp
76
70.66666
69.33334
86.33334
78
79.66666
80.66666
80.33334
82
83.66666
D. Analyse the data using ordinary statistical methods
. ttest meanp, by(sex)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------0 |
5
81.26667
.7102423
1.58815
79.29472
83.23861
1 |
5
76.06667
3.030219
6.775775
67.65343
84.4799
---------+--------------------------------------------------------------------
move
0
0
0
0
0
0
0
0
0
0
12
combined |
10
78.66667
1.704026
5.388603
74.81189
82.52144
---------+-------------------------------------------------------------------diff |
5.199998
3.112341
-1.977074
12.37707
-----------------------------------------------------------------------------Degrees of freedom: 8
Ho: mean(0) - mean(1) = diff = 0
Ha: diff < 0
t =
1.6708
P < t =
0.9333
Ha: diff ~= 0
t =
1.6708
P > |t| =
0.1333
Ha: diff > 0
t =
1.6708
P > t =
0.0667
5. GEE
Longitudinal, repeated measures, or clustered data are commonly encountered in clinical
research. Correlations between observations on a given subject may exist, and need to be
accounted for in statistical analysis. Ordinary statistical methods assume the observations are
independent. In the situation where observations are likely to be correlated, this is not usually
a reasonable assumption. When the assumption of independent observations is violated, the
estimated standard errors using ordinary statistical methods are incorrect, and thus lead to
incorrect inferences. The method of generalized estimating equations (GEE) can be used to
account for such correlations among observations. The GEE method estimates the regression
parameters assuming that the observations are independent. After fitting the model, the
correlations among observations are estimated using the residuals. Then the correlation
estimates is used to obtain new estimates of the regression parameters. This process is
repeated until the change between two successive estimates is very small.
GEE can be implemented in STATA using the xtgee command. It allows the user to specify
different correlation structures for the repeated observations, and to fit other generalized
linear models such as Poisson, negative binomial, or multinomial logistic regression in
addition to logistic regression. Followings are the lists of options.
The allowable options for the xtgee command are
13
Families






Bernoulli/binomial
Gamma
Gaussian
Inverse gaussian
Negative binomial
Poisson
Links









Correlation Structures
Cloglog
Identity
Log
Logit
Negative binomial
Odds power
Power
Probit
Reciprocal







Independent
Exchangeable
Autoregressive
Stationary
Nonstationary
Unstructured
User-specified
Assuming an independent correlation structure amounts to ignoring the panel structure of the data.
Under this assumption, xtgee will produce answers that are already provided by Stata's nonpanel
estimation commands. Examples of when xtgee provides answers that are the same as an existing
command are given in the following table.
Family
gaussian
gaussian
gaussian
binomial
binomial
binomial
binomial
binomial
binomial
nbinomial
poisson
poisson
gamma
family
Link
identity
identity
identity
cloglog
cloglog
logit
logit
probit
probit
nbinomial
log
log
log
link
Correlation
independent
exchangeable
exchangeable
independent
exchangeable
independent
exchangeable
independent
exchangeable
independent
independent
exchangeable
independent
independent
Equivalent Stata command
regress
xtreg, re (see note 1)
xtreg, pa
cloglog (see note 2)
xtlog, pa
logit or logistic
xtlogit, pa
probit (see note 3)
xtprobit, pa
nbreg (see note 4)
poisson
xtpois, pa
ereg (see note 5)
glm (see note 6)
14
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
These methods produce the same results only in the case of balanced panels.
For cloglog estimation, xtgee with corr(independent) and cloglog will produce the
same coefficients, but the standard errors will be only asymptotically equivalent because
cloglog is not the canonical link for the binomial family.
For probit estimation, xtgee with corr(independent) and probit will produce the same
coefficients, but the standard error will be only asymptotically equivalent because probit
is not the canonical link for the binomial family. If the binomial denominator is not 1,
the equivalent maximum-likelihood command is bprobit.
Fitting a negative binomial model using xtgee (or using glm will yield results
conditional on the specified value of alpha. The nbreg command, however, fits that
parameter as well as providing unconditional estimates.
xtgee with corr(independent) can be used to estimate exponential regressions, but this
requires specifying scale(1). As with probit, the xtgee-reported standard errors will be
only asymptotically equivalent to those produced by ereg because log is not the
canonical link for the gamma family. xtgee cannot be used to estimate exponential
regressions on censored data.
Using the independent correlation structure, the xtgee command will estimate the same
model as estimated with the glm command provided the family-link combination is the
same.
If the xtgee command is equivalent to another command, then the use of
corr(independent) and the robust option with xtgee corresponds to using both the
robust option and the cluster(varname) option in the equivalent command where
varname corresponds to the i() group variable.
**************
Download