Bayesian and Penalised Regression Methods for Epidemiological

advertisement
Bayesian and Penalised Regression Methods for Epidemiological Analysis
Lab 3. Offset priors for conditional, Cox and Poisson regression
1. Conditional logistic regression
For conditional logistic regression, the prior data table would consist of two pairs of records: one
containing an exposed case and an unexposed non-case; the other with an unexposed case and an
exposed non-case. Each record would contain the following (Greenland, 2007):
i.
The prior count A which is used as a weight to increase the contribution of the prior record
to the data, or else as a repetition count to multiply the number of records of the given form
(which may require using only whole-number A). A is calculated from the prior as before; i.e.
A=(2/vprior)S2, where S is the scaling factor used to improve normality of the prior.
ii.
M, the total number of subjects in the record. As before, this is equal to 2A for a
symmetric prior and the outcome modelled using the events/trials syntax.
iii.
There is no intercept in conditional logistic regression, so no command is needed to remove
it.
iv.
The variable for which the prior represents, Xj, is set to X = 1/S for the exposed case and
exposed non-case records. Xj is set to 0 for the unexposed records All other regressors (Xnot j)
are set to 0 in all 4 records for the Xj prior.
v.
If there is a nonzero prior median m, an offset H calculated as H = −m/S for exposed pair
members (the ones with X=1/S), 0 for unexposed pair members (the ones with X=0).
vi.
A matched set identifier, MSID, to indicate the pair, unique to each matched set. In the
actual data this identifies the matched sets. The data are stratified by this variable using the
<strata> statement.
a) Construct the prior data table for a normal(ln(2),0.5) prior following this method (use a
scaling factor of 10). Check the prior is entered correctly by running the same events/trials
syntax as before, but this time you need to also specify the strata statement to identify the
matched sets.
b) Examine your table, looking for any redundancies in the rows. Can you see how the prior
could be simplified?
c) Open the LBW dataset (Hosmer & Lemeshow, 1989). These data are available on the SAS
website as an example of how to conduct conditional logistic regression.
http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#sta
tug_phreg_sect050.htm. The study included 189 women, of whom 59 had low-birth-weight
babies and 130 had normal-weight babies. Under investigation are the following risk factors:
maternal weight at the last menstrual period (LWT)*, presence of hypertension (HT),
smoking status during pregnancy (Smoke), and presence of uterine irritability (UI). For HT,
Smoke, and UI, a value of 1 indicates a "yes" and a value of 0 indicates a "no." The woman’s
age (Age) is used as the matching variable. The data set LBW contains a subset of the data
corresponding to women between the ages of 16 and 32. Weight is currently in pounds; a
pound is clinically negligible, so we rescale weight to 10-kilogram (22-pound) units by
dividing LWT by 22.0462. Finally, re-centre LWT by subtracting 5 so that 50Kg is now 0.
*We note maternal weight is a poor choice for a study variable, since “normal” values vary
with height; BMI and height (together) would have been more relevant clinically, but height
is not available in the example data set.
d) Obtain the maximum likelihood estimates (and 95% Wald and Profile Likelihood Limits) using
conditional logistic regression. The conditional ML estimates are below:
Exposure
LWT, 10 Kg
Smoke
HT
UI
Point Estimate
0.717
2.244
5.763
2.419
Odds Ratio
95% Wald
Confidence Limits
0.527
0.975
1.091
4.615
1.353
24.544
0.944
6.202
95% Profile
Likelihood Limits
0.516
0.959
1.099
4.686
1.399 26.863
0.94
6.272
e) Create a prior data table (S=10) for these data using the steps listed above for collapsed
prior pairs. You can select priors based on your own knowledge, or for ease we suggest
normal(ln(2),0.5) for all variables except smoke which is normal(ln(4),0.5). Nb. Your MSID
will need to identify a unique risk set, so ensure it does not overlap with an existing MSID.
f)
Augment the data with the prior and obtain posterior estimates. If using the suggested
priors, your results should be similar to these:
Exposure
LWT, 10 Kg
Smoke
HT
UI
Point Estimate
0.772
2.494
3.281
2.258
Odds Ratio
95% Wald
Confidence Limits
0.582
1.025
1.322
4.704
1.229
8.761
1.04
4.903
95% Profile
Likelihood Limits
0.571
1.011
1.329
4.750
1.226
8.836
1.037
4.919
g) Compare your results with those you would obtain using the BAYES statement in PROC
PHREG.
2. Cox proportional hazards regression
Data augmentation for Cox proportional hazards regression parallels conditional logistic regression.
There are 2 pairs of records: 1 containing an exposed failure and an unexposed survivor; the other
with an unexposed failure and an exposed survivor. Each record contains:
i.
The prior count A, used as a weight to increase the contribution of the prior record to the
data and calculated as before; i.e. A=(2/vprior)S2.
ii.
Failure time, FT, which can be 1 for prior records and is the real failure-or-censoring time for
the actual data.
iii.
Intercepts are not estimated in Cox regression so no command is needed to remove them.
iv.
The variable for which the prior represents, Xj, is set to X = 1/S for the exposed failure and
exposed survivor records. Xj is set to 0 for the unexposed records. Note, as for conditional
logistic regression, all other regressors (Xnot j) are set to 0 in all 4 records for the Xj prior.
v.
If there is a nonzero prior median m, an offset H calculated as H = −m/S for exposed pair
members (the ones with X=1/S), 0 for unexposed pair members (the ones with X=0), as for
conditional logistic regression.
vi.
A matched risk set identifier, MRSID, to indicate the pair, unique to each matched set. This
is a unique value for each risk set (which is each observation in the actual data). The data are
stratified by this variable using the <strata> statement.
a) Use this method to complete the below table for a normal(ln(2),0.5) prior with scaling factor
S of 10.
b) Open the myeloma dataset. These data are from a study of multiple myeloma among 65
patients treated with alkylating agents (Krall, Uthoff, & Harley, 1975) and are available on
the SAS website
http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#sta
tug_genmod_sect070.htm. The primary endpoint was survival time in months and there
were 9 explanatory variables. Obtain hazard ratios for these data using PHREG.
c) Create a normal(ln(2),0.5) prior for two of the regressors, log blood urea nitrogen (BUN) and
serum calcium at baseline. (note: these priors were arbitrarily chosen to illustrate the
method and any serious analysis should choose their prior more carefully.)
d) Augment the data with the prior and obtain posterior estimates.
e) Compare your results with those you would obtain using the BAYES statement in PHREG.
3. Poisson regression
For Poisson regression, one can multiply all the person-time by a very large number N that is at least
100,000 times the size of the largest outcome count among the actual-data records, then use
grouped-data logistic regression by treating the person-time as the total number in the record. If
there is no person-time variable (as in the example below) one instead adds to each actual-data
record a total-count variable N. Either way, this trick changes only the intercept, which will be
shifted by –ln(N).
For normal priors, however, provided S is large enough (at least 30) one can use a Poisson-regression
program with a single record added for each coefficient prior with (Greenland, 2007):
i.
For actual records A is the observed count, while for prior records A is the prior count after
rescaling by a factor of S2; i.e. A=S2/vprior. For adequate normality we recommend at least
S=30 because the prior imposed by Poisson records is too skewed for smaller S.
ii.
If person-time is being entered as part of the actual data, no offset is needed; instead the
person-time for the prior data is set to eH = A/exp(m/S), which is just the prior count A when
the prior mean is 0.
iii.
If no person-time is being entered and there are nonzero prior means, we need an offset,
H=0 for all actual-data records and H=ln(A)−m/S for all prior records.
iv.
The variable X for which the prior applies is set to X = 1/S; all other regressors are set to 0.
No change is made to the actual records.
v.
As with logistic regression it is necessary to replace the intercept by a variable, Const, which
is set to 1 for all actual records and for the prior record for the intercept (if included), and is
set to 0 for all other prior records. Const is a regressor in the model and the programme
requested not to force in its own intercept using the noint option.
a) Construct the prior data table for a normal(ln(4), 0.125) prior, which produces 95% prior
limits for RR of 2 and 8, with scaling factor of 30 using this method.
b) Open the liver dataset. These data are provided on the SAS website as an example of how
to do a Bayesian Poisson regression on count data with an informative prior, using a dataset
from (Ibrahim, Chen, & Lipsitz, 1999)
http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#sta
tug_genmod_sect070.htm. The outcome modelled is the number of cancerous liver nodes
when a patient enters a clinical trial (Nodes). The number of nodes is modelled by a Poisson
regression model using PROC GENMOD with six baseline characteristics as explanatory
variables. Rescale BMI such that a 1 unit increase represents a 10-unit increase. Obtain
maximum likelihood estimates for these data.
c) Use the prior created in a) to place a prior on BMI. Augment the data with the prior and
obtain posterior estimates. Nb. This prior is different from that used in the example on the
SAS website.
d) Compare your results with those you would obtain using the BAYES statement in GENMOD.
e) Additional – try the trick for using logistic regression to run a Poisson model using the
following data. Hint: you will need to use a grouped regression (as in lab 2).
Cases Person-years X
236
46068
1
279
45163
0
References
Greenland, S. (2007). Bayesian perspectives for epidemiological research. II. Regression analysis. Int J
Epidemiol, 36(1), 195-202.
Hosmer, D. W., Jr. , & Lemeshow, S. . (1989). Applied Logistic Regression. New York: John Wiley &
Sons.
Ibrahim, J. G., Chen, M. H., & Lipsitz, S. R. (1999). Monte Carlo EM for missing covariates in
parametric regression models. Biometrics, 55(2), 591-596.
Krall, J. M., Uthoff, V. A., & Harley, J. B. (1975). A step-up procedure for selecting variables associated
with survival. Biometrics, 31(1), 49-57.
Download