HW ID#: 2037 Biost 536: Categorical Data Analysis in Epidemiology Emerson, Fall 2014 Homework #2 October 5, 2014 For these questions, we will be considering adjustment for age and sex using both stratified and regression analyses. For the stratified analyses, it will be necessary to use an appropriate categorization of age. 1. We are interested in analyzing associations between 5 year mortality and prevalence of ASCVD at study enrollment using statistical methods appropriate for binary response variables. The observation time for death among these subjects is potentially subject to censoring. Provide a statistical analysis demonstrating that such methods as logistic regression can be used to answer this question. Methods: In this study, the observation time (variable “obstime”) was measured as the number of days from the date of first MRI (variable “mridate”) until death or September 16, 1997 (the end date of the study). There is also a binary (yes/no) variable for death during followup (variable “death”). In order to validate that there is no missing observation time for any participants, we created a date variable with the study end date value for all subjects (“enddate”). we calculated the number of days between first MRI and the end date of the study (“datediff”), then created a new binary variable (“censored”) that showed whether or not datediff=obstime for each study subject. we then compared whether or not “censored”=“death” for all study subjects. Inference: In order to use a method like logistic regression to assess potential associations between 5-year mortality and prevalence of ASCVD at study enrollment, observation time must be complete for all study subjects (i.e. there is no censoring in the data). From the above analysis, we found that all study subject either 1) had the number of days between their MRI and the end of the study correspond to the number of observation days, or 2) died during follow up, demonstrating that follow up time is complete for all subjects in this study. 2. Using the risk difference (RD) as a measure of association, provide statistical inference regarding an association between 5 year survival and baseline prevalence of ASCVD, adjusting for age and sex. a. Answer the question using a stratified analysis (e.g., using Stata command cs or an equivalent analysis in R). Methods: In order to estimate a risk difference for an association between 5-year survival and baseline prevalence of ASCVD (adjusted for age and sex), we first created a binary categorical variable for age group. Before the analysis was performed we chose to divide age into two groups: 65-77 years, and 78+ years old. we then performed a stratified analysis on age and sex and estimated a standardized risk difference, using the ASCVD+ HW ID#: 2037 group as the distribution according to which we standardized. we also calculated chisquared p-values and confidence intervals at the 95% confidence level. Inference: Using stratification for age and sex, we estimated a 19% absolute increase in risk of five-year death among subjects ASCVD positive at baseline (95%CI 12%-26%). The two-sided p-value is <.001, so we can reject the null hypothesis that there is no difference in risk of five year death. b. Answer the question using an appropriate regression model. Methods: In order to estimate age and sex adjusted risk differences for baseline ASCVD and 5-year mortality, we used a linear regression with robust standard errors. The binary outcome variable was 5-year mortality, the predictor of interest was prevalence of ASCVD history at baseline as defined in question 1, and sex (binary) and age (continuous) variables were included as adjustment covariates. Inference: Using the above linear regression model adjusting for sex and age group, we estimated a 19% absolute increase in risk of five-year death among subjects ASCVD positive at baseline (95%CI 12%-26%). The two-sided p-value is <.001, so we can reject the null hypothesis that there is no difference in risk of five year death. c. What is the difference in the statistical models you used? That is, how would you explain any differences between the two analysis approaches? The stratified analysis used a binary age group variable to adjust for age, while linear regression allowed for age to be included as a continuous adjustment variable. Additionally, by creating strata by age and sex and then standardizing using a weighted average, the stratified analysis adjusts for potential interaction between the stratification variables. In the linear regression model we can specify whether or not an interaction term between adjustment variables is included. 3. Using the risk difference (OR) as a measure of association, provide statistical inference regarding an association between 5 year survival and baseline prevalence of ASCVD, adjusting for age and sex. a. Answer the question using a stratified analysis (e.g., using Stata command cc or an equivalent analysis in R). Method: In order to calculate age and sex stratified odds ratios, we used the categorical age variable from question 2a and created a combined “agesex” categorical variable to create four strata across all values of age group and sex. we then used Mantel-Haenszel method to estimate an odds ratio controlling for “agesex.” we also calculated chi-squared p-values and confidence intervals at the 95% confidence level. Inference: Presence of ASCVD at baseline is estimated to be associated with 3.39-fold higher odds of death within five years, using Mantel-Haenszel adjustment for age group HW ID#: 2037 and sex (95%CI 2.23-fold higher odds to 5.13-fold higher odds). With a p-value <.0001 we reject the null hypothesis that there is no difference in odds of five year death between ASCVD positive and negative groups b. Answer the question using an appropriate regression model. Method: In order to estimate adjusted odds ratios, we used a logistic regression model. A 95% confidence interval and p value were computed assuming the approximate normal distribution for the regression parameter estimates. The binary outcome variable was 5year mortality, the predictor of interest was prevalence of ASCVD history at baseline as defined in question 1, and sex (binary) and age (continuous) variables were included as adjustment covariates. Inference: Based on a logistic regression model adjusting for patient sex and age group, presence of ASCVD at baseline is estimated to be associated with 3.57-fold higher odds of death within five years (95% CI 2.36-fold to 5.59-fold higher odds). With a two-sided p value<.0001, we reject the null hypothesis that there is no difference in odds of five year death between ASCVD positive and negative groups c. What is the difference in the statistical models you used? That is, how would you explain any differences between the two analysis approaches? The stratified analysis used a binary age group variable to adjust for age, while linear regression allowed for age to be included as a continuous adjustment variable. In the regression model we can specify whether or not an interaction term between adjustment variables is included. 4. Using the risk difference (RR) as a measure of association, provide statistical inference regarding an association between 5 year survival and baseline prevalence of ASCVD, adjusting for age and sex. a. Answer the question using a stratified analysis (e.g., using Stata command ir or an equivalent analysis in R). Methods: In order to calculate age and sex stratified odds ratios, we used the categorical age variable from question 2a and the “agesex” stratification variable from question 3a. we then performed a stratified analysis on age and sex and estimated a standardized risk difference, using the ASCVD+ group as the distribution according to which we standardized. we also calculated chi-squared p-values and confidence intervals at the 95% confidence level. Inference: From the above described standardization method, we found that, adjusting for age and sex, ASCVD history is estimated to be associated with 2.54 times greater risk of five-year death compared with no ASCVD (95%CI 1.85-fold higher to 3.51-fold higher). Based on a two-sided p-value<.0001, we can reject the null hypothesis of a risk ratio of 1 for this comparison. HW ID#: 2037 b. Answer the question using an appropriate regression model. Methods: In order to estimate adjusted risk ratios, we performed a poisson regression. A 95% confidence interval and p value were computed using robust standard errors. The binary outcome variable was 5-year mortality, the predictor of interest was prevalence of ASCVD history at baseline as defined in question 1, and sex (binary) and age (continuous) variables were included as adjustment covariates. Inference: From the above Poisson regression model adjusting for age and sex, ASCVD history is estimated to be associated with 2.72 times greater risk of five-year death compared with no ASCVD (Robust 95%CI 1. 96-fold higher to 3.76-fold higher). Based on a two-sided p-value<.0001, we can reject the null hypothesis of a risk ratio of 1 for this comparison. c. What is the difference in the statistical models you used? That is, how would you explain any differences between the two analysis approaches? The stratified analysis used a binary age group variable to adjust for age, while linear regression allowed for age to be included as a continuous adjustment variable. In the regression model we can specify whether or not an interaction term between adjustment variables is included. 5. Comment very briefly on the similarity or differences among the three approaches. Which would you tend to prefer in general? Why? The three approaches all estimate associations between the exposure and outcome, but differ in their scale. In this instance, we would prefer to use a ratio measure like odds ratio or risk ratio because we believe it is easier to interpret and communicate. Between the OR and RR, we would choose the RR in this instance because the OR is not a good estimate of the RR in this model due to the frequency of the outcome, which can cause confusion when interpreting results. Question 6 pertains to the analysis of colorectal cancer incidence for whites living in the U.S. as a function of birthplace (U.S. born vs foreign born) (see datafile surveillance.txt and documentation surveillance.doc on the class web pages). 6. Using the incidence ratio as a measure of association, provide inference for an association between incidence of colorectal cancer and birthplace, after adjustment for age, sex, and SEER. a. Answer the question using directly standardized rates, with standardization to the U.S. population. Method: For this analysis, we created a combined SEER site, sex and age group variable upon which to stratify and apply standard weights. Due to the low amount of observations HW ID#: 2037 in the lower age groups, we restricted the analysis to only those over 45 years old (162 USborn, 162 foreign-born). We standardized the estimates against the US white population as a whole and calculated exact 95% confidence intervals. Inference: From the above described analysis adjusting for sex, age group, and SEER site, we found that being born in a non-US country is estimated to be associated with 1.01 times greater risk colorectal cancer incidence compared with US-born (95%CI .978-fold lower to 1.04-fold higher). Because the confidence interval includes 1, we cannot reject the null hypothesis of a risk ratio of 1 for this comparison. b. Answer the question using an appropriate regression model. Methods: In order to estimate adjusted incidence rate, we performed a poisson regression. A 95% confidence interval and p value were computed using robust standard errors. The binary outcome variable was incidence of colorectal cancer, the predictor of interest was place of birth (binary, US vs. Non-US), sex (binary), and age group (categorical, 5-year age ranges >45 years) variables were included as adjustment covariates. Inference: From the above Poisson regression model adjusting for age group, sex and SEER site, foreign birth is estimated to be associated with 1.05 times greater risk of fiveyear death compared with US birth (Robust 95%CI .994-fold lower to 1.10-fold higher). Based on a two-sided p-value=0.08, we fail to reject the null hypothesis of an incidence rate ratio of 1 for this comparison. c. What is the difference in the statistical models you used? That is, how would you explain any differences between the two analysis approaches? Because I modeled the age as categorical in the Poisson regression model, the estimates from the two models are very similar.