HW ID#: 2037
Biost 536: Categorical Data Analysis in Epidemiology
Emerson, Fall 2014
Homework #2
October 5, 2014
For these questions, we will be considering adjustment for age and sex using both stratified and
regression analyses. For the stratified analyses, it will be necessary to use an appropriate categorization of
age.
1. We are interested in analyzing associations between 5 year mortality and prevalence of
ASCVD at study enrollment using statistical methods appropriate for binary response
variables. The observation time for death among these subjects is potentially subject to
censoring. Provide a statistical analysis demonstrating that such methods as logistic
regression can be used to answer this question.
Methods: In this study, the observation time (variable “obstime”) was measured as the
number of days from the date of first MRI (variable “mridate”) until death or September
16, 1997 (the end date of the study). There is also a binary (yes/no) variable for death
during followup (variable “death”). In order to validate that there is no missing
observation time for any participants, we created a date variable with the study end date
value for all subjects (“enddate”). we calculated the number of days between first MRI
and the end date of the study (“datediff”), then created a new binary variable (“censored”)
that showed whether or not datediff=obstime for each study subject. we then compared
whether or not “censored”=“death” for all study subjects.
Inference: In order to use a method like logistic regression to assess potential associations
between 5-year mortality and prevalence of ASCVD at study enrollment, observation time
must be complete for all study subjects (i.e. there is no censoring in the data). From the
above analysis, we found that all study subject either 1) had the number of days between
their MRI and the end of the study correspond to the number of observation days, or 2)
died during follow up, demonstrating that follow up time is complete for all subjects in this
study.
2. Using the risk difference (RD) as a measure of association, provide statistical inference
regarding an association between 5 year survival and baseline prevalence of ASCVD,
adjusting for age and sex.
a. Answer the question using a stratified analysis (e.g., using Stata command cs or
an equivalent analysis in R).
Methods: In order to estimate a risk difference for an association between 5-year survival
and baseline prevalence of ASCVD (adjusted for age and sex), we first created a binary
categorical variable for age group. Before the analysis was performed we chose to divide
age into two groups: 65-77 years, and 78+ years old. we then performed a stratified
analysis on age and sex and estimated a standardized risk difference, using the ASCVD+
HW ID#: 2037
group as the distribution according to which we standardized. we also calculated chisquared p-values and confidence intervals at the 95% confidence level.
Inference: Using stratification for age and sex, we estimated a 19% absolute increase in
risk of five-year death among subjects ASCVD positive at baseline (95%CI 12%-26%).
The two-sided p-value is <.001, so we can reject the null hypothesis that there is no
difference in risk of five year death.
b. Answer the question using an appropriate regression model.
Methods: In order to estimate age and sex adjusted risk differences for baseline ASCVD
and 5-year mortality, we used a linear regression with robust standard errors. The binary
outcome variable was 5-year mortality, the predictor of interest was prevalence of ASCVD
history at baseline as defined in question 1, and sex (binary) and age (continuous) variables
were included as adjustment covariates.
Inference: Using the above linear regression model adjusting for sex and age group, we
estimated a 19% absolute increase in risk of five-year death among subjects ASCVD
positive at baseline (95%CI 12%-26%). The two-sided p-value is <.001, so we can reject
the null hypothesis that there is no difference in risk of five year death.
c. What is the difference in the statistical models you used? That is, how would you
explain any differences between the two analysis approaches?
The stratified analysis used a binary age group variable to adjust for age, while linear
regression allowed for age to be included as a continuous adjustment variable.
Additionally, by creating strata by age and sex and then standardizing using a weighted
average, the stratified analysis adjusts for potential interaction between the stratification
variables. In the linear regression model we can specify whether or not an interaction term
between adjustment variables is included.
3. Using the risk difference (OR) as a measure of association, provide statistical inference
regarding an association between 5 year survival and baseline prevalence of ASCVD,
adjusting for age and sex.
a. Answer the question using a stratified analysis (e.g., using Stata command cc or
an equivalent analysis in R).
Method: In order to calculate age and sex stratified odds ratios, we used the categorical age
variable from question 2a and created a combined “agesex” categorical variable to create
four strata across all values of age group and sex. we then used Mantel-Haenszel method
to estimate an odds ratio controlling for “agesex.” we also calculated chi-squared p-values
and confidence intervals at the 95% confidence level.
Inference: Presence of ASCVD at baseline is estimated to be associated with 3.39-fold
higher odds of death within five years, using Mantel-Haenszel adjustment for age group
HW ID#: 2037
and sex (95%CI 2.23-fold higher odds to 5.13-fold higher odds). With a p-value <.0001 we
reject the null hypothesis that there is no difference in odds of five year death between
ASCVD positive and negative groups
b. Answer the question using an appropriate regression model.
Method: In order to estimate adjusted odds ratios, we used a logistic regression model. A
95% confidence interval and p value were computed assuming the approximate normal
distribution for the regression parameter estimates. The binary outcome variable was 5year mortality, the predictor of interest was prevalence of ASCVD history at baseline as
defined in question 1, and sex (binary) and age (continuous) variables were included as
adjustment covariates.
Inference: Based on a logistic regression model adjusting for patient sex and age group,
presence of ASCVD at baseline is estimated to be associated with 3.57-fold higher odds of
death within five years (95% CI 2.36-fold to 5.59-fold higher odds). With a two-sided p
value<.0001, we reject the null hypothesis that there is no difference in odds of five year
death between ASCVD positive and negative groups
c. What is the difference in the statistical models you used? That is, how would you
explain any differences between the two analysis approaches?
The stratified analysis used a binary age group variable to adjust for age, while linear
regression allowed for age to be included as a continuous adjustment variable. In the
regression model we can specify whether or not an interaction term between adjustment
variables is included.
4. Using the risk difference (RR) as a measure of association, provide statistical inference
regarding an association between 5 year survival and baseline prevalence of ASCVD,
adjusting for age and sex.
a. Answer the question using a stratified analysis (e.g., using Stata command ir or
an equivalent analysis in R).
Methods: In order to calculate age and sex stratified odds ratios, we used the categorical
age variable from question 2a and the “agesex” stratification variable from question 3a.
we then performed a stratified analysis on age and sex and estimated a standardized risk
difference, using the ASCVD+ group as the distribution according to which we
standardized. we also calculated chi-squared p-values and confidence intervals at the 95%
confidence level.
Inference: From the above described standardization method, we found that, adjusting for
age and sex, ASCVD history is estimated to be associated with 2.54 times greater risk of
five-year death compared with no ASCVD (95%CI 1.85-fold higher to 3.51-fold higher).
Based on a two-sided p-value<.0001, we can reject the null hypothesis of a risk ratio of 1 for
this comparison.
HW ID#: 2037
b. Answer the question using an appropriate regression model.
Methods: In order to estimate adjusted risk ratios, we performed a poisson regression. A
95% confidence interval and p value were computed using robust standard errors. The
binary outcome variable was 5-year mortality, the predictor of interest was prevalence of
ASCVD history at baseline as defined in question 1, and sex (binary) and age (continuous)
variables were included as adjustment covariates.
Inference: From the above Poisson regression model adjusting for age and sex, ASCVD
history is estimated to be associated with 2.72 times greater risk of five-year death
compared with no ASCVD (Robust 95%CI 1. 96-fold higher to 3.76-fold higher). Based on
a two-sided p-value<.0001, we can reject the null hypothesis of a risk ratio of 1 for this
comparison.
c. What is the difference in the statistical models you used? That is, how would you
explain any differences between the two analysis approaches?
The stratified analysis used a binary age group variable to adjust for age, while linear
regression allowed for age to be included as a continuous adjustment variable. In the
regression model we can specify whether or not an interaction term between adjustment
variables is included.
5. Comment very briefly on the similarity or differences among the three approaches.
Which would you tend to prefer in general? Why?
The three approaches all estimate associations between the exposure and outcome, but
differ in their scale. In this instance, we would prefer to use a ratio measure like odds ratio
or risk ratio because we believe it is easier to interpret and communicate. Between the OR
and RR, we would choose the RR in this instance because the OR is not a good estimate of
the RR in this model due to the frequency of the outcome, which can cause confusion when
interpreting results.
Question 6 pertains to the analysis of colorectal cancer incidence for whites living in the U.S. as
a function of birthplace (U.S. born vs foreign born) (see datafile surveillance.txt and
documentation surveillance.doc on the class web pages).
6. Using the incidence ratio as a measure of association, provide inference for an association
between incidence of colorectal cancer and birthplace, after adjustment for age, sex, and
SEER.
a. Answer the question using directly standardized rates, with standardization to the
U.S. population.
Method: For this analysis, we created a combined SEER site, sex and age group variable
upon which to stratify and apply standard weights. Due to the low amount of observations
HW ID#: 2037
in the lower age groups, we restricted the analysis to only those over 45 years old (162 USborn, 162 foreign-born). We standardized the estimates against the US white population as
a whole and calculated exact 95% confidence intervals.
Inference: From the above described analysis adjusting for sex, age group, and SEER site,
we found that being born in a non-US country is estimated to be associated with 1.01 times
greater risk colorectal cancer incidence compared with US-born (95%CI .978-fold lower to
1.04-fold higher). Because the confidence interval includes 1, we cannot reject the null
hypothesis of a risk ratio of 1 for this comparison.
b. Answer the question using an appropriate regression model.
Methods: In order to estimate adjusted incidence rate, we performed a poisson regression.
A 95% confidence interval and p value were computed using robust standard errors. The
binary outcome variable was incidence of colorectal cancer, the predictor of interest was
place of birth (binary, US vs. Non-US), sex (binary), and age group (categorical, 5-year age
ranges >45 years) variables were included as adjustment covariates.
Inference: From the above Poisson regression model adjusting for age group, sex and
SEER site, foreign birth is estimated to be associated with 1.05 times greater risk of fiveyear death compared with US birth (Robust 95%CI .994-fold lower to 1.10-fold higher).
Based on a two-sided p-value=0.08, we fail to reject the null hypothesis of an incidence rate
ratio of 1 for this comparison.
c. What is the difference in the statistical models you used? That is, how would you
explain any differences between the two analysis approaches?
Because I modeled the age as categorical in the Poisson regression model, the estimates
from the two models are very similar.