Biost 536: Categorical Data Analysis in Epidemiology Emerson, Fall 2014 Homework #4 November 4, 2014 Written problems: To be submitted as a MS-Word compatible file to the class Catalyst dropbox by 11:30 pm on Sunday, November 9, 2014. See the instructions for peer grading of the homework that are posted on the web pages. On this (as all homeworks) Stata / R code and unedited Stata / R output is TOTALLY unacceptable. Instead, prepare a table of statistics gleaned from the Stata output. The table should be appropriate for inclusion in a scientific report, with all statistics rounded to a reasonable number of significant digits. (I am interested in how statistics are used to answer the scientific question.) In all problems requesting “statistical analyses” (either descriptive or inferential), you should present both Methods: A brief sentence or paragraph describing the statistical methods you used. This should be using wording suitable for a scientific journal, though it might be a little more detailed. A reader should be able to reproduce your analysis. DO NOT PROVIDE Stata OR R CODE. Inference: A paragraph providing full statistical inference in answer to the question. Please see the supplementary document relating to “Reporting Associations” for details. Questions refer to analyses of the data in the file infarcts.txt that is located on the class webpages. We are interested in associations between prevalence of infarct-like lesions on MRI and various predictors. For this homework, we will presume that any missing data is missing completely at random (MCAR) in this dataset and hence ignorable. 1. Fit a logistic regression model investigating prevalence of infarcts as a function of age (modeled continuously) and coronary heart disease (modeled as dummy variables). Provide a scientific interpretation of each of the regression coefficients, including a description of the intercept in the model. (You do not need to describe the methods, or provide CI or p values.) Answer: Interpretation for the intercept: The odds of having infarcts identified by MRI are 0.016 among individuals who are at age 0 and do not have coronary heart disease. This is not really relevant to reality. Interpretation for the parameter of age: Among individuals with same coronary heart disease history and one year apart in age, the odds of having infarcts identified by MRI are 1.046fold higher in the older ones, compared to the one-year younger ones. Interpretation for the parameter of angina (chd=1): Among individuals at the same age, the odds of having infarcts identified by MRI are 1.28-fold higher among individuals with angina prior to MIR, compared to individuals without any coronary heart diseases prior to MRI. Interpretation for the parameter of myocardial infarction (chd=2): Among individuals at the same age, the odds of having infarcts identified by MRI are 1.77-fold higher among individuals having myocardial infarction prior to MRI, compared to individuals without coronary heart disease prior to MRI. 2. Fit a logistic regression model investigating prevalence of infarcts as a function of age (modeled continuously), coronary heart disease (modeled as dummy variables), and their multiplicative interaction. Provide a scientific interpretation of each of the regression coefficients, including a description of the intercept in the model. (You do not need to describe the methods, or provide CI or p values.) Answer: Interpretation for the intercept: The odds of having infarcts identified by MRI are 0.013 among individuals who are at age 0 and do not have coronary heart disease. This is not really relevant to reality. Interpretation for the parameter of age: Among individuals without coronary heart disease prior to MRI and are one year apart in age, the odds of having infarcts identified by MRI are 1.048-fold higher in the older group, compared to the one-year younger age group. Interpretation for the parameter of angina (chd=1): Among individuals at age 0, the odds of having infarcts identified by MRI are 1.512-fold higher among those who had angina prior to MRI compared to those who did not have coronary heart disease prior to MRI. This is not really relevant to reality. Interpretation for the parameter of myocardial infarction (chd=2): Among individuals at age 0, the odds of having infarcts identified by MRI are 10.073-fold higher among those who had myocardial infarction prior to MRI, compared to those who did not have coronary heart disease prior to MRI. This is not really relevant to reality either. Interpretation for the parameter of interaction term between age and angina: The odds ratio of having infarcts identified by MRI with one year increase in age among individuals with angina prior to MRI is relatively 1.2% lower (OR=0.998) than the odds ratio among individuals without coronary heart disease prior to MRI. Interpretation for the parameter of interaction term between age and myocardial infarction: The ratio of the odds of having infarcts identified by MRI with one year increase in age among individuals with myocardial infarction prior to MRI is relatively 2.3% lower (OR=0.977) than the odds ratio among individuals without coronary heart disease prior to MRI. 3. Fit a logistic regression model that investigates the linearity of the association between the log odds of presence of infarcts and age, after adjustment for coronary heart disease. (Here you do need to describe your methods and results as they relate to the specific question.) Answer: Methods: There is no missing data for age. There is one subjects missing infarcts data. Since we assume missing complete at random for this analysis, we exclude this subject from regression analysis. A logistic regression model would be fitted for the binary outcome of infarcts identified by MRI based on a model including continuous untransformed age, squared age variable, and dummy variables for coronary heart disease. The estimate of the standard error of the regression parameters was used with asymptotic normal theory to compute a two-sided p value from a likelihood ratio test of association. A hierarchical testing scheme was predefined such that in the presence of a statistically significant primary test for association based on Wald test results, a secondary test for linearity of association would be performed using the coefficient for the squared age term: if that coefficient for the squared term was significantly different from zero, that would be interpreted as evidence that the association between infarcts identified by MRI and age was not linear in age, after adjusting for coronary heart disease. Because the overall test of association is used as a “gate-keeper” in this testing strategy, the experiment-wise type I error of the test for nonlinearity is preserved. The significance level is 0.05. Results: Logistic regression analysis of infarcts odds ratio across age groups using a quadratic model, adjusted for coronary heart disease, estimates a statistically significant association between infarcts identified by MRI and age (two-sided p<0.0005). Because we found a statistically significant association between infarcts identified by MRI and age, we further consider whether the regression model presented evidence of a nonlinear association. In this analysis, the regression coefficient for the squared age term was not found to be statistically significant (two-sided p=0.651), thus we do not reject the null hypothesis that the association between infarcts and age after adjusting for coronary heart disease is a linear relationship in the log odds ratio. 4. Fit a logistic regression model that investigates whether there is a U-shaped association between the log odds of presence of infarcts and ldl, after adjustment for age. (Here you do need to describe your methods and results as they relate to the specific question.) Answer: Methods: There is one subject missing infarcts data, and 45 subjects missing ldl values. Since we assume missing completely at random for this analysis, we exclude these subjects in the regression analysis. To evaluate whether there is a U-shaped association between the log odds of presence of infarcts and ldl after adjustment for age, we create linear spline variables for ldl based on scientific relevant cutpoints (lower than normal:ldl<100, normal range 100≤ldl≤189 and higher than normal ldl>189). A logistic regression model would be fitted for the binary outcome of infarcts identified by MRI based on a model including linear spline ldl variables and untransformed continuous age variable. The estimate of the standard error of the regression parameters was used with asymptotic normal theory to compute a two-sided p value from a likelihood ratio test of association. Parameters for all the ldl linear spline variables are tested simultaneously for association based on Wald test results. A hierarchical testing scheme was predefined such that in the presence of a statistically significant primary test for association, the slopes of lower than normal and higher than normal ldl groups will be evaluated. If they are in opposite direction (either OR<1 for lower and OR>1 for higher group, or OR>1 for lower and OR<1 for higher group). A secondary assessment for U-association would be performed using the coefficient for the lower and higher than normal ldl groups: if both coefficients were simultaneously significantly different from zero (both p-values<0.05), that would be interpreted as evidence that the association between infarcts identified by MRI and ldl was U-shaped in ldl, after adjusting for age. Because the overall test of association is used as a “gate-keeper” in this testing strategy, the experiment-wise type I error of the test for nonlinearity is preserved. The significance level is 0.05. Results: Logistic regression analysis of infarcts odds ratio across LDL levels using a linear spline model, adjusted for age, estimates a statistically significant association between infarcts identified by MRI and LDL based on Wald test simultaneously on all the parameters for LDL linear spline variables (two-sided p=0.0464). Then, we further consider whether the regression model presented evidence of a nonlinear association. In the regression analysis, the odds ratio for infarcts among individuals in the lower ldl group with same age is non-significantly negative for each unit increase in LDL level (OR=0.999, p=0.741), and the odds ratio for infarcts among indivudals in the higher ldl group with same age is significantly positive (OR=1.015, p=0.034) for each unit increase in LDL level, suggesting the two slopes has the same direction. Taking both parameters into account, the results thus suggesting that the association between infarcts and LDL level after adjusting for age is not a U-shape association at 5% significance level.