4872

advertisement
Biost 536: Categorical Data Analysis in Epidemiology
Emerson, Fall 2014
Homework #4
November 4, 2014
Written problems: To be submitted as a MS-Word compatible file to the class Catalyst dropbox by
11:30 pm on Sunday, November 9, 2014. See the instructions for peer grading of the homework that are
posted on the web pages.
On this (as all homeworks) Stata / R code and unedited Stata / R output is TOTALLY
unacceptable. Instead, prepare a table of statistics gleaned from the Stata output. The table
should be appropriate for inclusion in a scientific report, with all statistics rounded to a
reasonable number of significant digits. (I am interested in how statistics are used to answer the
scientific question.)
In all problems requesting “statistical analyses” (either descriptive or inferential), you should
present both
 Methods: A brief sentence or paragraph describing the statistical methods you used.
This should be using wording suitable for a scientific journal, though it might be a
little more detailed. A reader should be able to reproduce your analysis. DO NOT
PROVIDE Stata OR R CODE.
 Inference: A paragraph providing full statistical inference in answer to the question.
Please see the supplementary document relating to “Reporting Associations” for
details.
Questions refer to analyses of the data in the file infarcts.txt that is located on the class webpages.
We are interested in associations between prevalence of infarct-like lesions on MRI and various
predictors. For this homework, we will presume that any missing data is missing completely at random
(MCAR) in this dataset and hence ignorable.
1. Fit a logistic regression model investigating prevalence of infarcts as a function of age
(modeled continuously) and coronary heart disease (modeled as dummy variables).
Provide a scientific interpretation of each of the regression coefficients, including a
description of the intercept in the model. (You do not need to describe the methods, or
provide CI or p values.)
Answer:
Interpretation for the intercept: The odds of having infarcts identified by MRI are 0.016
among individuals who are at age 0 and do not have coronary heart disease. This is not really
relevant to reality.
Interpretation for the parameter of age: Among individuals with same coronary heart disease
history and one year apart in age, the odds of having infarcts identified by MRI are 1.046fold higher in the older ones, compared to the one-year younger ones.
Interpretation for the parameter of angina (chd=1): Among individuals at the same age, the
odds of having infarcts identified by MRI are 1.28-fold higher among individuals with angina
prior to MIR, compared to individuals without any coronary heart diseases prior to MRI.
Interpretation for the parameter of myocardial infarction (chd=2): Among individuals at the
same age, the odds of having infarcts identified by MRI are 1.77-fold higher among
individuals having myocardial infarction prior to MRI, compared to individuals without
coronary heart disease prior to MRI.
2. Fit a logistic regression model investigating prevalence of infarcts as a function of age
(modeled continuously), coronary heart disease (modeled as dummy variables), and their
multiplicative interaction. Provide a scientific interpretation of each of the regression
coefficients, including a description of the intercept in the model. (You do not need to
describe the methods, or provide CI or p values.)
Answer:
Interpretation for the intercept: The odds of having infarcts identified by MRI are 0.013
among individuals who are at age 0 and do not have coronary heart disease. This is not really
relevant to reality.
Interpretation for the parameter of age: Among individuals without coronary heart disease
prior to MRI and are one year apart in age, the odds of having infarcts identified by MRI are
1.048-fold higher in the older group, compared to the one-year younger age group.
Interpretation for the parameter of angina (chd=1): Among individuals at age 0, the odds of
having infarcts identified by MRI are 1.512-fold higher among those who had angina prior to
MRI compared to those who did not have coronary heart disease prior to MRI. This is not
really relevant to reality.
Interpretation for the parameter of myocardial infarction (chd=2): Among individuals at age
0, the odds of having infarcts identified by MRI are 10.073-fold higher among those who had
myocardial infarction prior to MRI, compared to those who did not have coronary heart
disease prior to MRI. This is not really relevant to reality either.
Interpretation for the parameter of interaction term between age and angina: The odds ratio
of having infarcts identified by MRI with one year increase in age among individuals with
angina prior to MRI is relatively 1.2% lower (OR=0.998) than the odds ratio among
individuals without coronary heart disease prior to MRI.
Interpretation for the parameter of interaction term between age and myocardial infarction:
The ratio of the odds of having infarcts identified by MRI with one year increase in age
among individuals with myocardial infarction prior to MRI is relatively 2.3% lower
(OR=0.977) than the odds ratio among individuals without coronary heart disease prior to
MRI.
3. Fit a logistic regression model that investigates the linearity of the association between
the log odds of presence of infarcts and age, after adjustment for coronary heart disease.
(Here you do need to describe your methods and results as they relate to the specific
question.)
Answer:
Methods: There is no missing data for age. There is one subjects missing infarcts data. Since we
assume missing complete at random for this analysis, we exclude this subject from regression
analysis. A logistic regression model would be fitted for the binary outcome of infarcts identified
by MRI based on a model including continuous untransformed age, squared age variable, and
dummy variables for coronary heart disease. The estimate of the standard error of the regression
parameters was used with asymptotic normal theory to compute a two-sided p value from a
likelihood ratio test of association. A hierarchical testing scheme was predefined such that in the
presence of a statistically significant primary test for association based on Wald test results, a
secondary test for linearity of association would be performed using the coefficient for the
squared age term: if that coefficient for the squared term was significantly different from zero,
that would be interpreted as evidence that the association between infarcts identified by MRI and
age was not linear in age, after adjusting for coronary heart disease. Because the overall test of
association is used as a “gate-keeper” in this testing strategy, the experiment-wise type I error of
the test for nonlinearity is preserved. The significance level is 0.05.
Results: Logistic regression analysis of infarcts odds ratio across age groups using a quadratic
model, adjusted for coronary heart disease, estimates a statistically significant association
between infarcts identified by MRI and age (two-sided p<0.0005). Because we found a
statistically significant association between infarcts identified by MRI and age, we further
consider whether the regression model presented evidence of a nonlinear association. In this
analysis, the regression coefficient for the squared age term was not found to be statistically
significant (two-sided p=0.651), thus we do not reject the null hypothesis that the association
between infarcts and age after adjusting for coronary heart disease is a linear relationship in the
log odds ratio.
4. Fit a logistic regression model that investigates whether there is a U-shaped association
between the log odds of presence of infarcts and ldl, after adjustment for age. (Here you
do need to describe your methods and results as they relate to the specific question.)
Answer:
Methods: There is one subject missing infarcts data, and 45 subjects missing ldl values. Since we
assume missing completely at random for this analysis, we exclude these subjects in the
regression analysis. To evaluate whether there is a U-shaped association between the log odds of
presence of infarcts and ldl after adjustment for age, we create linear spline variables for ldl
based on scientific relevant cutpoints (lower than normal:ldl<100, normal range 100≤ldl≤189
and higher than normal ldl>189). A logistic regression model would be fitted for the binary
outcome of infarcts identified by MRI based on a model including linear spline ldl variables and
untransformed continuous age variable. The estimate of the standard error of the regression
parameters was used with asymptotic normal theory to compute a two-sided p value from a
likelihood ratio test of association. Parameters for all the ldl linear spline variables are tested
simultaneously for association based on Wald test results. A hierarchical testing scheme was
predefined such that in the presence of a statistically significant primary test for association, the
slopes of lower than normal and higher than normal ldl groups will be evaluated. If they are in
opposite direction (either OR<1 for lower and OR>1 for higher group, or OR>1 for lower and
OR<1 for higher group). A secondary assessment for U-association would be performed using
the coefficient for the lower and higher than normal ldl groups: if both coefficients were
simultaneously significantly different from zero (both p-values<0.05), that would be interpreted
as evidence that the association between infarcts identified by MRI and ldl was U-shaped in ldl,
after adjusting for age. Because the overall test of association is used as a “gate-keeper” in this
testing strategy, the experiment-wise type I error of the test for nonlinearity is preserved. The
significance level is 0.05.
Results: Logistic regression analysis of infarcts odds ratio across LDL levels using a linear spline
model, adjusted for age, estimates a statistically significant association between infarcts
identified by MRI and LDL based on Wald test simultaneously on all the parameters for LDL
linear spline variables (two-sided p=0.0464). Then, we further consider whether the regression
model presented evidence of a nonlinear association. In the regression analysis, the odds ratio for
infarcts among individuals in the lower ldl group with same age is non-significantly negative for
each unit increase in LDL level (OR=0.999, p=0.741), and the odds ratio for infarcts among
indivudals in the higher ldl group with same age is significantly positive (OR=1.015, p=0.034)
for each unit increase in LDL level, suggesting the two slopes has the same direction. Taking
both parameters into account, the results thus suggesting that the association between infarcts
and LDL level after adjusting for age is not a U-shape association at 5% significance level.
Download