Biost 513 Spring 2000 Professor Breslow HOMEWORK #3 (Due Friday, April 21 in class) Reading: Rosner 11.8, 11.14 For Reference: “Use of the logistic and related models in longitudinal studies of chronic disease risk” (Coursepak Readings) Problems: 1. The two previous homework assignments have used the grouped data from the Ille-etVilaine case-control study (either tuyns.dat or esoph.raw) to study the relationship between alcohol and tobacco consumption and esophageal cancer. These analyses ignored age. Yet we know from the class notes that age is strongly related to the cancer outcome (p. 32715) and is at least moderately associated with both alcohol and tobacco consumption in the population at risk (p. 32717). This would suggest that, as in many such situations, age be treated as confounder (Age may not be so much a causal risk factor for cancer as a surrogate for the cumulative effects of many unmeasured risk factors). Create a binary “exposure” variable tobexp coded 0 for 0-19 cigarettes per day and 1 for 20+ cigarettes per day (see p. 32913 of the class notes for one way to do this). Similarly, create a binary alcohol exposure variable alcexp coded 0 for 0-79 gm/day and 1 for 80+ gm/day (see p. 40507). Denote cancer status by cc=1 for cases cc=0 for controls. a) Analyze the relationship between cancer (cc) and tobacco (tobexp) by creating a single 2 2 table. Quote and interpret the odds ratio estimate and a 95% confidence limit for the odds ratio. Why is this called the “crude” estimate? b) Now adjust for age by stratification into the 6 age categories. Determine and interpret an adjusted odds ratio for tobexp and a 95% confidence limit using the Mantel-Haenszel method. State in simple terms how the meaning of this estimate differs from that calculated in (a). c) Is the assumption of a common odds ratio, which implicitly underlies the calculations in (b), a plausible assumption? Present evidence to support your conclusions. 1 d) Repeat parts (b) and (c), but this time using simultaneous adjustment for age and alcexp. e) Is there evidence that alcohol and tobacco consumption are associated? after adjustment for age? Why is it best to examine this association using the control population only? 2. Logistic regression can be used to compute odds ratio estimates after adjusting for other variables. Consider the analyses in question 1 that focused on tobexp as the exposure variable of interest: a) What would be the dependent variable in a logistic regression for the Ille-etVilaine data? b) Define (write down the equation for) a logistic regression model that would characterize the unadjusted (crude) odds ratio that was measured in question (a). c) The output below is from a logistic regression of cc on both tobexp and age, with 5 dummy variables used for the effects of each higher age group relative to baseline. Compute and interpret the estimated odds ratio for tobexp, and its 95% confidence limit. Compare the point and interval estimates to those obtained in question 1(b). Are they similar? Do they have similar interpretations? Why or why not? . xi: logit cc tobexp i.age [fweight=freq] i.age Iage_1-6 (naturally coded; Iage_1 omitted) Logit estimates Number of obs LR chi2(6) Prob > chi2 Log likelihood = -425.10698 Pseudo R2 = 0.1408 cc tobexp Iage_2 Iage_3 Iage_4 Iage_5 Iage_6 _cons Coef. .8339739 1.713168 3.47941 3.999044 4.217314 3.990266 -5.008187 = = = Std. Err. .1929275 1.061829 1.019266 1.015121 1.020075 1.060016 1.008153 975 139.27 0.0000 [95% Conf. Interval] .455843 1.212105 -.3679782 3.794314 1.481686 5.477135 2.009444 5.988644 2.218005 6.216624 1.912672 6.06786 -6.984131 -3.032244 3. Suppose you are interested in describing whether social status, as measured by a (0,1) variable called SOC, is associated with cardiovascular disease mortality (within 10 years), as defined by a (0,1) variable called CVD. Suppose further that you have carried out a 12-year follow-up study of 200 men who are 60 years old or older. In assessing the relationship between SOC and CVD, you decide that you want to control for smoking status [SMK, a (0,1) variable] and systolic blood pressure (SBP, a continuous variable). 2 In analyzing your data, you decide to fit two logistic models, each involving the dependent variable CVD, but with different sets of independent variables. The variables involved in each model and their estimated coefficients are listed below: Model 1 Variable Coefficient Intercept -1.180 SOC -0.520 SBP 0.040 SMK -0.560 SOC SBP -0.033 SOC SMK 0.175 Model 2 Variable Coefficient Intercept -1.190 SOC -0.500 SBP 0.010 SMK -0.420 a) For each of the models fitted above, state the form of the logistic model that was used – stating the dependent variable, the interpretation of the probability X , and the model for X in terms of the (unknown) population parameters and the independent variables. b) For each of the models in (a) state the form of the estimated log odds functions: logit X . c) Using Model 1, compute the estimated risk for CVD death (i.e., CVD=1) for a high social class (SOC=1) smoker (SMK=1) with SBP=150 (person 1), and a low social class (SOC=0) smoker (SMK=1) with SBP=150 (person 2). What is the estimated relative risk comparing these individuals? d) Repeat part (c) using Model 2. Why is the estimate different? e) What is the estimated odds ratio comparing SOC=1 to SOC=0 for non-smokers SMK=0 with SBP=150 under Model 1 and under Model 2 (Note: use the coefficients directly rather than calculate ˆ ( X )). f) If the study design had been a case-control study (retrospective) which risk estimate would you report (RR or OR)? Justify. 3