Bayesian and Penalised Regression Methods for Epidemiological Analysis Lab 3. Offset priors for conditional, Cox and Poisson regression 1. Conditional logistic regression For conditional logistic regression, the prior data table would consist of two pairs of records: one containing an exposed case and an unexposed non-case; the other with an unexposed case and an exposed non-case. Each record would contain the following (Greenland, 2007): i. The prior count A which is used as a weight to increase the contribution of the prior record to the data, or else as a repetition count to multiply the number of records of the given form (which may require using only whole-number A). A is calculated from the prior as before; i.e. A=(2/vprior)S2, where S is the scaling factor used to improve normality of the prior. ii. M, the total number of subjects in the record. As before, this is equal to 2A for a symmetric prior and the outcome modelled using the events/trials syntax. iii. There is no intercept in conditional logistic regression, so no command is needed to remove it. iv. The variable for which the prior represents, Xj, is set to X = 1/S for the exposed case and exposed non-case records. Xj is set to 0 for the unexposed records All other regressors (Xnot j) are set to 0 in all 4 records for the Xj prior. v. If there is a nonzero prior median m, an offset H calculated as H = −m/S for exposed pair members (the ones with X=1/S), 0 for unexposed pair members (the ones with X=0). vi. A matched set identifier, MSID, to indicate the pair, unique to each matched set. In the actual data this identifies the matched sets. The data are stratified by this variable using the <strata> statement. a) Construct the prior data table for a normal(ln(2),0.5) prior following this method (use a scaling factor of 10). Check the prior is entered correctly by running the same events/trials syntax as before, but this time you need to also specify the strata statement to identify the matched sets. b) Examine your table, looking for any redundancies in the rows. Can you see how the prior could be simplified? c) Open the LBW dataset (Hosmer & Lemeshow, 1989). These data are available on the SAS website as an example of how to conduct conditional logistic regression. http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#sta tug_phreg_sect050.htm. The study included 189 women, of whom 59 had low-birth-weight babies and 130 had normal-weight babies. Under investigation are the following risk factors: maternal weight at the last menstrual period (LWT)*, presence of hypertension (HT), smoking status during pregnancy (Smoke), and presence of uterine irritability (UI). For HT, Smoke, and UI, a value of 1 indicates a "yes" and a value of 0 indicates a "no." The woman’s age (Age) is used as the matching variable. The data set LBW contains a subset of the data corresponding to women between the ages of 16 and 32. Weight is currently in pounds; a pound is clinically negligible, so we rescale weight to 10-kilogram (22-pound) units by dividing LWT by 22.0462. Finally, re-centre LWT by subtracting 5 so that 50Kg is now 0. *We note maternal weight is a poor choice for a study variable, since “normal” values vary with height; BMI and height (together) would have been more relevant clinically, but height is not available in the example data set. d) Obtain the maximum likelihood estimates (and 95% Wald and Profile Likelihood Limits) using conditional logistic regression. The conditional ML estimates are below: Exposure LWT, 10 Kg Smoke HT UI Point Estimate 0.717 2.244 5.763 2.419 Odds Ratio 95% Wald Confidence Limits 0.527 0.975 1.091 4.615 1.353 24.544 0.944 6.202 95% Profile Likelihood Limits 0.516 0.959 1.099 4.686 1.399 26.863 0.94 6.272 e) Create a prior data table (S=10) for these data using the steps listed above for collapsed prior pairs. You can select priors based on your own knowledge, or for ease we suggest normal(ln(2),0.5) for all variables except smoke which is normal(ln(4),0.5). Nb. Your MSID will need to identify a unique risk set, so ensure it does not overlap with an existing MSID. f) Augment the data with the prior and obtain posterior estimates. If using the suggested priors, your results should be similar to these: Exposure LWT, 10 Kg Smoke HT UI Point Estimate 0.772 2.494 3.281 2.258 Odds Ratio 95% Wald Confidence Limits 0.582 1.025 1.322 4.704 1.229 8.761 1.04 4.903 95% Profile Likelihood Limits 0.571 1.011 1.329 4.750 1.226 8.836 1.037 4.919 g) Compare your results with those you would obtain using the BAYES statement in PROC PHREG. 2. Cox proportional hazards regression Data augmentation for Cox proportional hazards regression parallels conditional logistic regression. There are 2 pairs of records: 1 containing an exposed failure and an unexposed survivor; the other with an unexposed failure and an exposed survivor. Each record contains: i. The prior count A, used as a weight to increase the contribution of the prior record to the data and calculated as before; i.e. A=(2/vprior)S2. ii. Failure time, FT, which can be 1 for prior records and is the real failure-or-censoring time for the actual data. iii. Intercepts are not estimated in Cox regression so no command is needed to remove them. iv. The variable for which the prior represents, Xj, is set to X = 1/S for the exposed failure and exposed survivor records. Xj is set to 0 for the unexposed records. Note, as for conditional logistic regression, all other regressors (Xnot j) are set to 0 in all 4 records for the Xj prior. v. If there is a nonzero prior median m, an offset H calculated as H = −m/S for exposed pair members (the ones with X=1/S), 0 for unexposed pair members (the ones with X=0), as for conditional logistic regression. vi. A matched risk set identifier, MRSID, to indicate the pair, unique to each matched set. This is a unique value for each risk set (which is each observation in the actual data). The data are stratified by this variable using the <strata> statement. a) Use this method to complete the below table for a normal(ln(2),0.5) prior with scaling factor S of 10. b) Open the myeloma dataset. These data are from a study of multiple myeloma among 65 patients treated with alkylating agents (Krall, Uthoff, & Harley, 1975) and are available on the SAS website http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#sta tug_genmod_sect070.htm. The primary endpoint was survival time in months and there were 9 explanatory variables. Obtain hazard ratios for these data using PHREG. c) Create a normal(ln(2),0.5) prior for two of the regressors, log blood urea nitrogen (BUN) and serum calcium at baseline. (note: these priors were arbitrarily chosen to illustrate the method and any serious analysis should choose their prior more carefully.) d) Augment the data with the prior and obtain posterior estimates. e) Compare your results with those you would obtain using the BAYES statement in PHREG. 3. Poisson regression For Poisson regression, one can multiply all the person-time by a very large number N that is at least 100,000 times the size of the largest outcome count among the actual-data records, then use grouped-data logistic regression by treating the person-time as the total number in the record. If there is no person-time variable (as in the example below) one instead adds to each actual-data record a total-count variable N. Either way, this trick changes only the intercept, which will be shifted by –ln(N). For normal priors, however, provided S is large enough (at least 30) one can use a Poisson-regression program with a single record added for each coefficient prior with (Greenland, 2007): i. For actual records A is the observed count, while for prior records A is the prior count after rescaling by a factor of S2; i.e. A=S2/vprior. For adequate normality we recommend at least S=30 because the prior imposed by Poisson records is too skewed for smaller S. ii. If person-time is being entered as part of the actual data, no offset is needed; instead the person-time for the prior data is set to eH = A/exp(m/S), which is just the prior count A when the prior mean is 0. iii. If no person-time is being entered and there are nonzero prior means, we need an offset, H=0 for all actual-data records and H=ln(A)−m/S for all prior records. iv. The variable X for which the prior applies is set to X = 1/S; all other regressors are set to 0. No change is made to the actual records. v. As with logistic regression it is necessary to replace the intercept by a variable, Const, which is set to 1 for all actual records and for the prior record for the intercept (if included), and is set to 0 for all other prior records. Const is a regressor in the model and the programme requested not to force in its own intercept using the noint option. a) Construct the prior data table for a normal(ln(4), 0.125) prior, which produces 95% prior limits for RR of 2 and 8, with scaling factor of 30 using this method. b) Open the liver dataset. These data are provided on the SAS website as an example of how to do a Bayesian Poisson regression on count data with an informative prior, using a dataset from (Ibrahim, Chen, & Lipsitz, 1999) http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#sta tug_genmod_sect070.htm. The outcome modelled is the number of cancerous liver nodes when a patient enters a clinical trial (Nodes). The number of nodes is modelled by a Poisson regression model using PROC GENMOD with six baseline characteristics as explanatory variables. Rescale BMI such that a 1 unit increase represents a 10-unit increase. Obtain maximum likelihood estimates for these data. c) Use the prior created in a) to place a prior on BMI. Augment the data with the prior and obtain posterior estimates. Nb. This prior is different from that used in the example on the SAS website. d) Compare your results with those you would obtain using the BAYES statement in GENMOD. e) Additional – try the trick for using logistic regression to run a Poisson model using the following data. Hint: you will need to use a grouped regression (as in lab 2). Cases Person-years X 236 46068 1 279 45163 0 References Greenland, S. (2007). Bayesian perspectives for epidemiological research. II. Regression analysis. Int J Epidemiol, 36(1), 195-202. Hosmer, D. W., Jr. , & Lemeshow, S. . (1989). Applied Logistic Regression. New York: John Wiley & Sons. Ibrahim, J. G., Chen, M. H., & Lipsitz, S. R. (1999). Monte Carlo EM for missing covariates in parametric regression models. Biometrics, 55(2), 591-596. Krall, J. M., Uthoff, V. A., & Harley, J. B. (1975). A step-up procedure for selecting variables associated with survival. Biometrics, 31(1), 49-57.