Biost 536, Fall 2014 Homework #2 October 5, 2014, Page 1 of 5 Biost 536: Categorical Data Analysis in Epidemiology Emerson, Fall 2014 Homework #2 October 5, 2014 CODE 2467 Written problems: To be submitted as a MS-Word compatible file to the class Catalyst dropbox by 5:30 pm on Sunday, October 12, 2014. See the instructions for peer grading of the homework that are posted on the web pages. On this (as all homeworks) Stata / R code and unedited Stata / R output is TOTALLY unacceptable. Instead, prepare a table of statistics gleaned from the Stata output. The table should be appropriate for inclusion in a scientific report, with all statistics rounded to a reasonable number of significant digits. (I am interested in how statistics are used to answer the scientific question.) In all problems requesting “statistical analyses” (either descriptive or inferential), you should present both • Methods: A brief sentence or paragraph describing the statistical methods you used. This should be using wording suitable for a scientific journal, though it might be a little more detailed. A reader should be able to reproduce your analysis. DO NOT PROVIDE Stata OR R CODE. • Inference: A paragraph providing full statistical inference in answer to the question. Please see the supplementary document relating to “Reporting Associations” for details. Questions 1-5 refer to analyses of at the data in the file mri.txt that is located on the class webpages. In those questions we are interested in associations between 5 year mortality and prevalence of atherosclerotic cardiovascular disease (ASCVD) as defined by history of prior angina, myocardial infarction, transient ischemic attacks, or stroke. You will likely find it useful to create a new variable indicating ASCVD. This variable can be derived from the chd and stroke variables. For these questions, we will be considering adjustment for age and sex using both stratified and regression analyses. For the stratified analyses, it will be necessary to use an appropriate categorization of age. 1. We are interested in analyzing associations between 5 year mortality and prevalence of ASCVD at study enrollment using statistical methods appropriate for binary response variables. The observation time for death among these subjects is potentially subject to censoring. Provide a statistical analysis demonstrating that such methods as logistic regression can be used to answer this question. Among those censored, the minimum observation time was 1827days, or just over 5 years. Therefore, all patients who were observed for less than 5 years are known to have died, and this dichotomization is not affected by censoring. 2. Using the risk difference (RD) as a measure of association, provide statistical inference regarding an association between 5 year survival and baseline prevalence of ASCVD, adjusting for age and sex. Biost 536, Fall 2014 Homework #2 October 5, 2014, Page 2 of 5 a. Answer the question using a stratified analysis (e.g., using Stata command cs or an equivalent analysis in R). Methods: To explore an association of the risk difference between 5 year survival and baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease) adjusting for age and sex we stratify by age and sex into six categories: Males ages 65-74, Males 75-84, Males 85-100, Females ages 65-74, Females 75-84, Females 85-100. The risk differences within each strata are combined using weights which standardize the estimate to the exposed group (i.e. the group that had ASCVD at baseline). A 95% exact confidence interval is presented and used to determine significance at the two-sided .05 level. Inference: This analysis estimates that prevalence of ASCVD at baseline is associated with a 0.1065 absolute increase in risk of mortality within five years of study enrollment if age and sex are held constant. This difference is statistical significant, and would not be unusual if the ASCVD was associated with an increase in risk between .0083 and .2045. b. Answer the question using an appropriate regression model. Methods: To explore an association of the risk difference between 5 year survival and baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease) adjusting for age and sex we regress the binary variable 5 year survival on the binary variable presence of ASCVD at baseline while controlling for age and sex linearly in the model. Note that in this model age is not categorized as in part a. Wald inference was based on robust (Huber White sandwich) standard errors that allow for appropriate adjustment for the mean-variance relationship inherent in binary outcomes and using a normal approximation. The regression coefficient for ASCVD was tested at the 5% level for a risk difference different from 0. Inference: This analysis estimates that prevalence of ASCVD at baseline is associated with a 0.1035 absolute increase in risk of mortality within five years of study enrollment if age and sex are held constant. This difference is statistical significant, and would not be unusual if the ASCVD was associated with an increase in risk between .0039 and .2031 c. What is the difference in the statistical models you used? That is, how would you explain any differences between the two analysis approaches? Regression borrows information across groups, whereas stratified analysis estimate RD individually, and then combine them according to set weights, in this case standardized weights. Stratified analyses also allow for effect modification by strata, whereas the regression used here does not. Also, as mentioned earlier age was treated as a continuous variable in the regression but was categorized in the stratified analysis - this is a important strength of the regression methods. 3. Using the risk difference (OR) as a measure of association, provide statistical inference regarding an association between 5 year survival and baseline prevalence of ASCVD, adjusting for age and sex. Biost 536, Fall 2014 Homework #2 October 5, 2014, Page 3 of 5 a. Answer the question using a stratified analysis (e.g., using Stata command cc or an equivalent analysis in R). Methods: To explore an association using odds ratio between 5 year survival and baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease) adjusting for age and sex we stratify by age and sex into six categories: Males ages 65-74, Males 75-84, Males 85-100, Females ages 65-74, Females 75-84, Females 85-100. The risk differences within each strata are combined using statistically efficient weights using the Mantel Haenszel statistic. A 95% exact confidence interval is presented and used to determine significance at the two-sided .05 level for the test that the odds ratio between patients with similar age and sex categories but differing in ASCVD status is not 1. Inference: This analysis estimates that prevalence of ASCVD at baseline is associated with a 1.94-fold increase in the odds ratio of mortality within five years of study enrollment if age and sex are held constant. This difference is statistically significant, and would not be unusual if the ASCVD was associated with an increase in odds ratio between 1.13 and 3.32-fold for the group with ASCVD within the same age and sex category. b. Answer the question using an appropriate regression model. Methods: To explore an association of the odds ratio between 5 year survival and baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease) adjusting for age and sex we use a logistic regression of the binary variable 5 year survival on the binary variable presence of ASCVD at baseline while controlling for age and sex linearly in the model. Note that in this model age is not categorized as in part a. The regression coefficient for ASCVD was tested at the 5% level for a odds ratio different from 1, and used a normal approximation for the sampling distribution of the estimate. Inference: This analysis estimates that prevalence of ASCVD at baseline is associated with a 1.93-fold increase in the odds of mortality within five years of study enrollment if age and sex are held constant. This difference is statistically significant, and would not be unusual if the ASCVD was associated with an increase in odds ratio between 1.13 and 3.31-fold for the group with ASCVD with the same age and sex. c. What is the difference in the statistical models you used? That is, how would you explain any differences between the two analysis approaches? Regression borrows information across groups, whereas stratified analysis estimate OR individually, and then combine them according to set weights. Stratified analyses also allow for effect modification by strata, whereas the regression used here does not. Also, as mentioned earlier age was treated as a continuous variable in the regression but was categorized in the stratified analysis - this is a important strength of the regression methods. 4. Using the risk difference (RR) as a measure of association, provide statistical inference regarding an association between 5 year survival and baseline prevalence of ASCVD, adjusting for age and sex. a. Answer the question using a stratified analysis (e.g., using Stata command ir or an equivalent analysis in R). Methods: To explore an association of the risk ratio between 5 year survival and baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease) adjusting Biost 536, Fall 2014 Homework #2 October 5, 2014, Page 4 of 5 for age and sex we use a poisson regression of the binary variable 5 year survival on the binary variable presence of ASCVD at baseline while controlling for age and sex linearly in the model. Note that in this model age is not categorized as in part a. The regression coefficient for ASCVD was tested at the 5% level for a risk ratio different from 1, and used a normal approximation for the sampling distribution of the estimate with Huber-white sandwich estimators. Inference: This analysis estimates that prevalence of ASCVD at baseline is associated with a 1.64-fold increase in the risk of mortality within five years of study enrollment if age and sex are held constant. This difference is statistically significant, and would not be unusual if the ASCVD was associated with an increase in odds ratio between 1.12 and 2.42-fold for the group with ASCVD, with the same age and sex. b. Answer the question using an appropriate regression model. Methods: To explore an association of the risk ratio between 5 year survival and baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease) adjusting for age and sex we use a logistic regression of the binary variable 5 year survival on the binary variable presence of ASCVD at baseline while controlling for age and sex linearly in the model. Note that in this model age is not categorized as in part a. The regression coefficient for ASCVD was tested at the 5% level for a risk ratio different from 1, and used a normal approximation for the sampling distribution of the estimate as well as Huber White sandwich estimators. Inference: This analysis estimates that prevalence of ASCVD at baseline is associated with a 1.66-fold increase in the risk of mortality within five years of study enrollment if age and sex are held constant. This difference is statistically significant, and would not be unusual if the ASCVD was associated with an increase in odds ratio between 1.05 and 2.61-fold for the group with ASCVD, with the same age and sex. c. What is the difference in the statistical models you used? That is, how would you explain any differences between the two analysis approaches? Regression borrows information across groups, whereas stratified analysis estimate RR individually, and then combine them according to set weights. Stratified analyses also allow for effect modification by strata, whereas the regression used here does not. Also, as mentioned earlier age was treated as a continuous variable in the regression but was categorized in the stratified analysis - this is a important strength of the regression methods. 5. Comment very briefly on the similarity or differences among the three approaches. Which would you tend to prefer in general? Why? These three approaches explore an association between ASCVD and mortality within 5 years within the same age and sex values using three different scales. I would be inclined to use risk difference for the public health interpretation as it aids in thinking about disease burden. However, the odds ratio tends to be less affected by effect modification, so that could be used as well. Biost 536, Fall 2014 Homework #2 October 5, 2014, Page 5 of 5 Question 6 pertains to the analysis of colorectal cancer incidence for whites living in the U.S. as a function of birthplace (U.S. born vs foreign born) (see datafile surveillance.txt and documentation surveillance.doc on the class web pages). 6. Using the incidence ratio as a measure of association, provide inference for an association between incidence of colorectal cancer and birthplace, after adjustment for age, sex, and SEER. a. Answer the question using directly standardized rates, with standardization to the U.S. population. Methods: Incidence rate ratio of colorectal cancer between US and foreign born cases was tested to understand a possible association between colorectal cancer and birthplace. The sample was stratified according to sex, age, and SEER site. Strata were combined so as to standardize to the US population. Wald-based 95% CI were used to define statistical significance at a .05 level, and used a normal approximation. Inference: The analysis estimated an incidence rate 1.6% higher (relatively) in foreign born than US born people of the same age, sex, and SEER site. A 95% CI of 2.14% lower to 4.6% higher does not give evidence that the incidence rate differs by birthplace. b. Answer the question using an appropriate regression model. Methods: A poisson regression was used to compare test the association between colorectal cancer and birthplace (US or foreign). The model also accounted linearly for age and sex, as well as dummy variables for SEER site. Wald-based 95% CI were used to define statistical significance at a .05 level, and used a normal approximation. Inference:The analysis estimated an incidence rate 4.7% lower (relatively) in foreign born than US born people of the same age, sex, and SEER site. A 95% CI of 14.0% lower to 5.6% higher does not give evidence that the incidence rate differs by birthplace. c. What is the difference in the statistical models you used? That is, how would you explain any differences between the two analysis approaches? As in the other questions, regression borrows information across groups, whereas stratified analysis estimate RR individually, and then combine them according to set weights. Stratified analyses also allow for effect modification by strata, whereas the regression used here does not. Also, the weights used in part a correspond to US population, while part b does not use such information.