2470grading2467

advertisement
Biost 536, Fall 2014
Homework #2
October 5, 2014, Page 1 of 5
Biost 536: Categorical Data Analysis in Epidemiology
Emerson, Fall 2014
Homework #2
October 5, 2014
CODE 2467
Written problems: To be submitted as a MS-Word compatible file to the class Catalyst dropbox by 5:30
pm on Sunday, October 12, 2014. See the instructions for peer grading of the homework that are posted
on the web pages.
On this (as all homeworks) Stata / R code and unedited Stata / R output is TOTALLY
unacceptable. Instead, prepare a table of statistics gleaned from the Stata output. The table
should be appropriate for inclusion in a scientific report, with all statistics rounded to a
reasonable number of significant digits. (I am interested in how statistics are used to answer the
scientific question.)
In all problems requesting “statistical analyses” (either descriptive or inferential), you should
present both
• Methods: A brief sentence or paragraph describing the statistical methods you used.
This should be using wording suitable for a scientific journal, though it might be a little
more detailed. A reader should be able to reproduce your analysis. DO NOT PROVIDE
Stata OR R CODE.
• Inference: A paragraph providing full statistical inference in answer to the question.
Please see the supplementary document relating to “Reporting Associations” for
details.
Questions 1-5 refer to analyses of at the data in the file mri.txt that is located on the class webpages.
In those questions we are interested in associations between 5 year mortality and prevalence of
atherosclerotic cardiovascular disease (ASCVD) as defined by history of prior angina, myocardial
infarction, transient ischemic attacks, or stroke. You will likely find it useful to create a new variable
indicating ASCVD. This variable can be derived from the chd and stroke variables.
For these questions, we will be considering adjustment for age and sex using both stratified and
regression analyses. For the stratified analyses, it will be necessary to use an appropriate categorization of
age.
1. We are interested in analyzing associations between 5 year mortality and prevalence of
ASCVD at study enrollment using statistical methods appropriate for binary response
variables. The observation time for death among these subjects is potentially subject to
censoring. Provide a statistical analysis demonstrating that such methods as logistic
regression can be used to answer this question.
Among those censored, the minimum observation time was 1827days, or just over 5
years. Therefore, all patients who were observed for less than 5 years are known to have died,
and this dichotomization is not affected by censoring.
2. Using the risk difference (RD) as a measure of association, provide statistical inference
regarding an association between 5 year survival and baseline prevalence of ASCVD,
adjusting for age and sex.
Biost 536, Fall 2014
Homework #2
October 5, 2014, Page 2 of 5
a. Answer the question using a stratified analysis (e.g., using Stata command cs or
an equivalent analysis in R).
Methods: To explore an association of the risk difference between 5 year survival
and baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease)
adjusting for age and sex we stratify by age and sex into six categories: Males ages 65-74, Males
75-84, Males 85-100, Females ages 65-74, Females 75-84, Females 85-100. The risk differences
within each strata are combined using weights which standardize the estimate to the exposed
group (i.e. the group that had ASCVD at baseline). A 95% exact confidence interval is presented
and used to determine significance at the two-sided .05 level.
Inference: This analysis estimates that prevalence of ASCVD at baseline is
associated with a 0.1065 absolute increase in risk of mortality within five years of study
enrollment if age and sex are held constant. This difference is statistical significant, and would
not be unusual if the ASCVD was associated with an increase in risk between .0083 and .2045.
b. Answer the question using an appropriate regression model.
Methods: To explore an association of the risk difference between 5 year survival
and baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease)
adjusting for age and sex we regress the binary variable 5 year survival on the binary variable
presence of ASCVD at baseline while controlling for age and sex linearly in the model. Note that
in this model age is not categorized as in part a. Wald inference was based on robust (Huber
White sandwich) standard errors that allow for appropriate adjustment for the mean-variance
relationship inherent in binary outcomes and using a normal approximation. The regression
coefficient for ASCVD was tested at the 5% level for a risk difference different from 0.
Inference: This analysis estimates that prevalence of ASCVD at baseline is
associated with a 0.1035 absolute increase in risk of mortality within five years of study
enrollment if age and sex are held constant. This difference is statistical significant, and would
not be unusual if the ASCVD was associated with an increase in risk between .0039 and .2031
c. What is the difference in the statistical models you used? That is, how would you
explain any differences between the two analysis approaches?
Regression borrows information across groups, whereas stratified analysis estimate RD
individually, and then combine them according to set weights, in this case standardized weights.
Stratified analyses also allow for effect modification by strata, whereas the regression used here
does not. Also, as mentioned earlier age was treated as a continuous variable in the regression
but was categorized in the stratified analysis - this is a important strength of the regression
methods.
3. Using the risk difference (OR) as a measure of association, provide statistical inference
regarding an association between 5 year survival and baseline prevalence of ASCVD,
adjusting for age and sex.
Biost 536, Fall 2014
Homework #2
October 5, 2014, Page 3 of 5
a. Answer the question using a stratified analysis (e.g., using Stata command cc or
an equivalent analysis in R).
Methods: To explore an association using odds ratio between 5 year survival and
baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease) adjusting
for age and sex we stratify by age and sex into six categories: Males ages 65-74, Males 75-84,
Males 85-100, Females ages 65-74, Females 75-84, Females 85-100. The risk differences within
each strata are combined using statistically efficient weights using the Mantel Haenszel statistic.
A 95% exact confidence interval is presented and used to determine significance at the two-sided
.05 level for the test that the odds ratio between patients with similar age and sex categories but
differing in ASCVD status is not 1.
Inference: This analysis estimates that prevalence of ASCVD at baseline is
associated with a 1.94-fold increase in the odds ratio of mortality within five years of study
enrollment if age and sex are held constant. This difference is statistically significant, and would
not be unusual if the ASCVD was associated with an increase in odds ratio between 1.13 and
3.32-fold for the group with ASCVD within the same age and sex category.
b. Answer the question using an appropriate regression model.
Methods: To explore an association of the odds ratio between 5 year survival and
baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease) adjusting
for age and sex we use a logistic regression of the binary variable 5 year survival on the binary
variable presence of ASCVD at baseline while controlling for age and sex linearly in the model.
Note that in this model age is not categorized as in part a. The regression coefficient for ASCVD
was tested at the 5% level for a odds ratio different from 1, and used a normal approximation for
the sampling distribution of the estimate.
Inference: This analysis estimates that prevalence of ASCVD at baseline is
associated with a 1.93-fold increase in the odds of mortality within five years of study enrollment
if age and sex are held constant. This difference is statistically significant, and would not be
unusual if the ASCVD was associated with an increase in odds ratio between 1.13 and 3.31-fold
for the group with ASCVD with the same age and sex.
c. What is the difference in the statistical models you used? That is, how would you
explain any differences between the two analysis approaches?
Regression borrows information across groups, whereas stratified analysis
estimate OR individually, and then combine them according to set weights. Stratified analyses
also allow for effect modification by strata, whereas the regression used here does not. Also, as
mentioned earlier age was treated as a continuous variable in the regression but was
categorized in the stratified analysis - this is a important strength of the regression methods.
4. Using the risk difference (RR) as a measure of association, provide statistical inference
regarding an association between 5 year survival and baseline prevalence of ASCVD,
adjusting for age and sex.
a. Answer the question using a stratified analysis (e.g., using Stata command ir or
an equivalent analysis in R).
Methods: To explore an association of the risk ratio between 5 year survival and
baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease) adjusting
Biost 536, Fall 2014
Homework #2
October 5, 2014, Page 4 of 5
for age and sex we use a poisson regression of the binary variable 5 year survival on the binary
variable presence of ASCVD at baseline while controlling for age and sex linearly in the model.
Note that in this model age is not categorized as in part a. The regression coefficient for ASCVD
was tested at the 5% level for a risk ratio different from 1, and used a normal approximation for
the sampling distribution of the estimate with Huber-white sandwich estimators.
Inference: This analysis estimates that prevalence of ASCVD at baseline is
associated with a 1.64-fold increase in the risk of mortality within five years of study enrollment
if age and sex are held constant. This difference is statistically significant, and would not be
unusual if the ASCVD was associated with an increase in odds ratio between 1.12 and 2.42-fold
for the group with ASCVD, with the same age and sex.
b. Answer the question using an appropriate regression model.
Methods: To explore an association of the risk ratio between 5 year survival and
baseline prevalence of ASCVD (as defined by stroke or cardiovascular heart disease) adjusting
for age and sex we use a logistic regression of the binary variable 5 year survival on the binary
variable presence of ASCVD at baseline while controlling for age and sex linearly in the model.
Note that in this model age is not categorized as in part a. The regression coefficient for ASCVD
was tested at the 5% level for a risk ratio different from 1, and used a normal approximation for
the sampling distribution of the estimate as well as Huber White sandwich estimators.
Inference: This analysis estimates that prevalence of ASCVD at baseline is
associated with a 1.66-fold increase in the risk of mortality within five years of study enrollment
if age and sex are held constant. This difference is statistically significant, and would not be
unusual if the ASCVD was associated with an increase in odds ratio between 1.05 and 2.61-fold
for the group with ASCVD, with the same age and sex.
c. What is the difference in the statistical models you used? That is, how would you
explain any differences between the two analysis approaches?
Regression borrows information across groups, whereas stratified analysis
estimate RR individually, and then combine them according to set weights. Stratified analyses
also allow for effect modification by strata, whereas the regression used here does not. Also, as
mentioned earlier age was treated as a continuous variable in the regression but was
categorized in the stratified analysis - this is a important strength of the regression methods.
5. Comment very briefly on the similarity or differences among the three approaches. Which
would you tend to prefer in general? Why?
These three approaches explore an association between ASCVD and mortality
within 5 years within the same age and sex values using three different scales. I would be
inclined to use risk difference for the public health interpretation as it aids in thinking about
disease burden. However, the odds ratio tends to be less affected by effect modification, so that
could be used as well.
Biost 536, Fall 2014
Homework #2
October 5, 2014, Page 5 of 5
Question 6 pertains to the analysis of colorectal cancer incidence for whites living in the U.S. as
a function of birthplace (U.S. born vs foreign born) (see datafile surveillance.txt and
documentation surveillance.doc on the class web pages).
6. Using the incidence ratio as a measure of association, provide inference for an association
between incidence of colorectal cancer and birthplace, after adjustment for age, sex, and
SEER.
a. Answer the question using directly standardized rates, with standardization to the
U.S. population.
Methods: Incidence rate ratio of colorectal cancer between US and foreign born
cases was tested to understand a possible association between colorectal cancer and birthplace.
The sample was stratified according to sex, age, and SEER site. Strata were combined so as to
standardize to the US population. Wald-based 95% CI were used to define statistical
significance at a .05 level, and used a normal approximation.
Inference: The analysis estimated an incidence rate 1.6% higher (relatively) in
foreign born than US born people of the same age, sex, and SEER site. A 95% CI of 2.14% lower
to 4.6% higher does not give evidence that the incidence rate differs by birthplace.
b. Answer the question using an appropriate regression model.
Methods: A poisson regression was used to compare test the association between
colorectal cancer and birthplace (US or foreign). The model also accounted linearly for age and
sex, as well as dummy variables for SEER site. Wald-based 95% CI were used to define
statistical significance at a .05 level, and used a normal approximation.
Inference:The analysis estimated an incidence rate 4.7% lower (relatively) in
foreign born than US born people of the same age, sex, and SEER site. A 95% CI of 14.0% lower
to 5.6% higher does not give evidence that the incidence rate differs by birthplace.
c. What is the difference in the statistical models you used? That is, how would you
explain any differences between the two analysis approaches?
As in the other questions, regression borrows information across groups, whereas
stratified analysis estimate RR individually, and then combine them according to set weights.
Stratified analyses also allow for effect modification by strata, whereas the regression used here
does not. Also, the weights used in part a correspond to US population, while part b does not use
such information.
Download