Stat 557 Fall 2002 1. (a) Solutions to Assignment 4 No three factor interaction model (e) Conditional independence of variables 1, 2 and 3 given the level of variable 4 BD CD log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAD il + λ jl + λ kl Maximum likelihood estimates of AC BC expected counts are: log(mijk ) = λ + λ Ai + λ Bj + λCk + λAB + λ + λ ij ik jk ∧ Yi + + l Y+ j+ l Y+ +k l m ijk l = Y+2+ + l No two factors are conditionally independent given the level of the third factor, but the conditional association (f) No three factor interaction. No two between any two factors, as measured factors are conditionally independent by the odds ratio, is the same at each given the level of the third factor, but level of the third factor. the conditional association between factors 1 and 2, as measured by the odds ratios, is the same at each level of the third factor. (b) Complete independence of the three variables log(mijk ) = λ + λ Ai + λ Bj + λCk AC BC log(mijk ) = λ + λAi + λBj + λCk + λAB ij + λ ik + λ jk Maximum likelihood estimates of expected counts are: ∧ Yi+ + Y+ j+ Y++k m ijk = Y+2++ (c) Variables 1 and 3 are jointly independent of variable 2 log(mijk ) = λ + λ Ai + λ Bj + λCk + λAC ik Maximum likelihood estimates of expected counts are: ∧ Y+ j+ Yi + k m ijk = Y+ + + (d) Conditional independence of variables 2 and 3 given the level of variable 1 AC log(mijk ) = λ + λ Ai + λ Bj + λCk + λAB ij + λ ik Maximum likelihood estimates of ∧ Yij + Yi + k expected counts are: m ijk = Yi + + 2. These are only examples of the infinitely many correct answers. (a) Complete independence k=1 i=1 i=2 j=1 1/48 2/48 j=2 3/48 6/48 j=1 1/48 2/48 j=2 3/48 6/48 k=2 i=1 i=2 (b) Variables 2 and 3 are jointly independent of variable 1. k=1 i=1 i=2 j=1 1/30 2/30 j=2 4/30 8/30 k=2 i=1 i=2 (c) j=1 3/30 6/30 j=2 2/30 4/30 Variables 1 and 2 are conditionally independent given the level of variable 3. k=1 i=1 i=2 k=2 j=1 1/30 2/30 j=2 3/30 6/30 i=1 i=2 j=1 4/30 1/30 j=2 12/30 3/30 3. You can count degrees of freedom for lack-of-fit tests by either (i) summing degrees of freedom for λ -terms left out of the model, or (ii) by computing 72-(sum of degrees of freedom for λ terms in the model). AD BC BD CD ABD BCD (a) log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB ij + λ il + λ jk + λ jl + λ k l + λ ijl + λ jk l , df = 27, sufficient statistics are {Yij+ l }, {Y+ jkl }. Maximum likelihood estimates of ∧ expected counts are m ijk l = Yij + l Y+ jkl / Y+ j + l . This model implies that type of housing and level of contact with neighbors are conditionally independent for each pair of levels for satisfaction and influence. Association between satisfaction with housing conditions (B) and housing type (A) depends on the level of perceived influence with management (D). Associations between housing type (A) and perceived influence with management (D) are not consistent across levels of satisfaction (B). Associations between level of satisfaction with housing conditions (B) and perceived influence with management (D) depend on both the level of contact with other residents (C) and the housing type (A). Associations between satisfaction with housing conditions (B) and level of contact with other residents (C) depend on the perceived influence with management (D). Associations between level of contact with other residents (C) and the level of perceived influence with management (D) are not consistent across levels of satisfaction with housing conditions (B). (b) AD BC CD log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB ij + λ il + λ jk + λ k l df = 47, sufficient statistics are {Y+ + k l }, {Y + jk + }, {Yij + + }, {Y+ j+ l }. There are no closed form expressions for the maximum likelihood estimates for the expected counts. This model implies that contact with other residents (C) is conditionally independent of housing type (A), given the levels of the other two variables. It also implies that satisfaction with housing conditions (B) is conditionally independent of perceived influence with management (D), given the joint levels of housing type (A) and contact with other residents (C). As measured by odds ratios, associations between satisfaction with housing conditions (B) and housing type (A) are 3 consistent across the joint levels of contact with other residents (C) and perceived influence with management (D). Associations between satisfaction with housing conditions (B) and level of contact (C) with other residents are consistent across the joint levels of housing type (A) and perceived influence with management (D). Associations between type of housing (A) and perceived influence with management (D) are consistent across the joint levels of satisfaction with housing conditions (B) and level of contact (C) with other residents. Associations between level of contact with other residents (C) and perceived influence with management (D) are consistent across the joint levels of satisfaction with housing conditions (B) and housing type (A). (c) AC AD BC BD CD ABC BCD log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB ij + λ ik + λ il + λ jk + λ jl + λ kl + λ ij k + λ jk l df = 30, sufficient statistics are {Yi + + l }, {Y + jkl } and {Yijk + }. There are no closed form expressions for the maximum likelihood estimates for the expected counts . This model implies that the association between the perceived level influence with management (D) and housing type (A) is consistent across the levels of the other two factors. Associations between level of satisfaction (B) and type of housing (A), however, change across the level of contact with other residents (C) but are consistent across the levels of influence with management (D). Associations between level of satisfaction (B) and level of contact with other residents (C) depend on both the type of housing (A) and the level of perceived influence with management (D). Associations between level of satisfaction (B) and level of perceived influence with management (D) depend on level of contact with other residents (C) but not on housing type (A). Associations between level of contact with other residents (C) and perceived influence with management (D) depend on the level of satisfaction with housing conditions (B) but not on housing type (A). Associations between level of contact with other residents (C) and type of housing (A) change across levels of satisfaction with housing (B) but are consistent across the levels of influence with management (D). (d) AC AD BC BD CD BCD log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB ij + λ ik + λ il + λ jk + λ jl + λ kl + λ jk l df = 36, sufficient statistics are {Yij+ + }, {Yi+ k + }, {Yi + + l }, {Y+ jk l }. There are no closed form expressions for the maximum likelihood estimates for the expected counts. This model implies that associations between housing type (A) and any other factor do no depend on the level of either of the other two factors. Associations between level of satisfaction (B) and level of contact with other residents (C) depend on the level of perceived influence with management (D) but not on housing type(A). Associations between level of satisfaction (B) and level of perceived influence with management (D) depend on level of contact with other residents (C) but not on housing type (A). Associations between level of contact with other residents (C) and perceived influence with management (D) depend on the level of satisfaction with housing conditions (B) but not on housing type (A). (e) BD CD BCD log(mijk l ) = λ + λAi + λBj + λCk + λDl + λBC jk + λ jl + λ k l + λ jk l df = 55, sufficient statistics are sufficient statistics are {Yi + + + } and {Y+ jkl }. Maximum ∧ likelihood estimates of expected counts are m ijk l = Yi + + + Y+ jk l / Y+ + + + . Type of housing (A) is independent of all of the other three factors. Associations between level of satisfaction (B) and level of contact with other residents (C) depend on the level of perceived influence with management (D) but not on housing type (A). Associations between level of satisfaction (B) and level of perceived influence with management (D) depend on level of contact with other residents (C) but not on housing type (A). Associations between level of contact with other residents (C) and perceived influence with management (D) depend on the level of satisfaction with housing conditions (B) but not on housing type (A). 4 3. (B) Several models were reported by the students in this class. One strategy was to fit a log-linear model with just main effects, fit another model with all 2-factor interactions, fit another model with all 3-factor interactions, find a model that fits and then delete insignificant terms from the model with a backward elimination procedure. Others started with some reasonable model and let the step( ) function in Splus or the searching capability of the highloglinear procedure in SPSS search for a model. The simplest model selected was AC BC BD CD (model1) log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB ij + λ ik + λ jk + λ jl + λ k l For this model X2 = 58.36 on 46 d.f. with p-value = 0.104 and G2 = 57.6 with p-value = 0.117. However, there are four adjusted residuals larger 2 in absolute value and one adjusted residual larger than 3. Also, the λAD il interaction is significant at the .033 level when added to this ABD model. After adding the λAD interaction becomes il interaction to the model, the λ ijl significant at the 0.0396 level. Many students selected AC BC BD CD AD ABD (model 2) log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB ij + λ ik + λ jk + λ jl + λ k l + λ il + λ ij l There are no large adjusted residuals for this model. This model fits the data well, X2 = 22.47 on 28 d.f. with p-value = 0.78 and G2 = 22.13 with p-value = 0.76. Other students simplified this model by deleting the λBC jk interaction to obtain AC BD CD AD ABD (model 3) log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB ij + λ ik + λ jl + λ k l + λ il + λ ij l For this model are, G2 = 38.12 on 30 d.f.. with p-value = 0.147. I did not check the residuals from fitting this model. Finally, some students used the step( ) function in Splus to maximize the penalized loglikelihood known as the AIC criterion. This resulted in AC BC BD CD AD ABC (model 4) log(mijkl ) = λ + λAi + λBj + λCk + λDl + λAB ij + λ ik + λ jk + λ jl + λ kl + λ il + λ ijk This model also fits the data well, X2 = 31.72 on 34 d.f.. and G2 = 22.13 with p-value = 0.58. It is well known that maximizing AIC can sometimes lead to overfitting the model. In this case, the λABC interaction term is significant at the 0.058 level. Some students deleted this three ijk factor interaction and selected the model with all six of the 2-factor interactions: AC BC BD CD AD (model5) log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB ij + λ ik + λ jk + λ jl + λ kl + λ il This model also fits the data well, X2 = 43.95 on 40 d.f.. and G2 = 44.18 with p-value = 0.31. Finally, a few students deleted the λAD il interaction from model 5 and ended up with model 1. The S-PLUS search that led to model 4, however, did not lead students to consider model 2. A good case could be made for any of these models, although the residual analysis indicates that model 1 matches the observed data less well than the other models. The data do not provide sufficient information to clearly distinguish between these models. Choosing among these models requires additional information or expertise that is not provided by the observed data. AC BD CD AD Inferences about 2-factor associations λAB ij , λ ik , λ jl , λ k l , λ il are essentially the same 5 ABC ABD for these models, however, because the interactions λBC jk , λ ij k , or λ ij l , if they truly exist, are relatively weak. As an example, I will examine and interpret the interaction terms in Model 2 because it provides an opportunity to discuss a three factor interaction. Listings of the maximum likelihood estimates of the terms in the model produced by PROC GENMOD in SAS are given below. The results from the glm function in S-PLUS are also shown, using constraints where parameters sum to zero across the levels of any single factor. AC interaction: (summation constraints) i=1 .303 -.303 k=1 k=2 i=2 -.005 .005 i=3 -.137 .137 i=4 -.161 .162 λik values i=2 .3092 0 i=3 .0458 0 i=4 0 0 λik values ^ TC AC interaction: (SAS GENMOD constraints) i=1 .9256 0 k=1 k=2 ^ TC Conditional on any level of satisfaction (j) and any level of influence on management ( l) , there is a tendency for less contact with other residents for people who live in the Towers (i=1) and more contact with other residents in atrium (i=3) and terrace (i=4) housing. More specifically, the odds of low contact with other residents is about 2.5 times greater for people who live in Towers than people who live in terrace housing. (The estimate of the log odds is λˆAC - λAC - λAC + λAC = 0.9256 with standard error .1662. This is an estimable quantity, so it 11 41 12 42 is the same for either set of constraints. An approximate 95% confidence interval for the log odds is (.5999, 1.2514). The corresponding estimate of the odds ratios is exp(0.9256)=2.52 with approximate 95% confidence interval (1.82, 3.50) ). You could examine other estimates of odds ratios for this interaction. We will not provide this level of detail in the remainder of this discussion and only report conclusions based on patterns in the estimated λ parameters. BC interaction: (summation constraints) k=1 k=2 j=1 .140 -.140 j=2 -.039 .039 j=3 -.101 .101 j=2 .1251 0 j=3 0 0 BC interaction: (SAS GENMOD constraints) k=1 k=2 j=1 .4818 0 6 Conditional on any particular type of housing and any level of influence with management, people with less contact with other residents tend to be less satisfied with their housing conditions. CD interaction: (summation constraints) l =1 l=2 l =3 -.150 .150 -.031 .031 .181 -.181 l =1 l=2 l =3 -.6645 0 -.4260 0 0 0 k=1 k=2 CD interaction: (SAS GENMOD constraints) k=1 k=2 Conditional on any type of housing and any level of satisfaction, residents with less contact with other residents tend to feel they have more influence on management. Interactions involving housing type (A), satisfaction (B), and influence with management (D) must be interpreted more carefully in the presence of the three factor interaction. First look at the three factor interaction. ABD interaction: (summation constraints) l =1 l =2 l =3 j=1 j=2 j=3 j=1 j=2 j=3 j=1 j=2 j=3 Housing i=1 -.270 .082 .188 .220 -.061 -.159 .050 -.021 -.029 Type (A) i=2 .161 -.011 -.172 -.086 -.040 .126 -.075 .029 .046 i=3 -.041 -.030 .071 -.151 .025 .126 .192 .005 -.197 i=4 .150 -.063 -.087 .017 .076 -.093 -.167 -.013 .180 ABD interaction: (SAS GENMOD constraints) l =1 Housing i=1 l =2 l =3 j=1 j=2 j=3 j=1 j=2 j=3 j=1 j=2 j=3 -1.12 -.330 0 -.150 -.267 0 0 0 0 7 Type (A) i=2 -.130 -.547 0 -.018 -.509 0 0 0 0 i=3 -1.08 -1.12 0 -.519 -.663 0 0 0 0 i=4 0 0 0 0 0 0 0 0 0 These estimates must be added to the estimates of the interactions between type of housing and satisfaction to determine how those interactions change across levels of factor D, perceived ^ influence wi th management. Add the table of λ ijAB values to each of the above tables of ^ λ ABD ijl values. The results are: AB interactions: (summation constraints) l =1 A l =2 l =3 j=1 j=2 J=3 j=1 j=2 i=1 -.557 .016 .541 -.067 -.127 i=2 .250 -.066 -.184 .003 i=3 -.205 .127 .078 i=4 .512 -.077 -.435 j=3 j=1 j=2 j=3 .194 -.236 -.087 .323 -.117 .114 .014 -.048 .034 -.315 .182 .133 .028 .162 -.190 .379 .062 -.441 .194 -.027 -.167 AB interaction: (SAS GENMOD constraints) l =1 l =2 l =3 j=1 j=2 j=3 j=1 j=2 j=3 j=1 j=2 j=3 Housing i=1 -2.05 -.883 0 -1.07 -.820 0 -.927 -.553 0 Type (A) i=2 -.514 -.769 0 -.402 -.731 0 -.384 -.222 0 i=3 -1.23 -.900 0 -.664 -.452 0 -.149 -.211 0 i=4 0 0 0 0 0 0 0 0 0 For any level of interaction with other residents (C), the following patterns are present. For any level of perceived influence with management (M), residents of towers tend to be more satisfied with their housing than residents of other types of housing and residents of terraced housing tend to be less satisfied with their housing. This pattern is most pronounced when perceived level of influence with management is low. 8 The absence of any 3 or 4 factor interaction involving level of interaction (C) implies that associations among housing type (A), satisfaction (B), and influence on management (D) variables are consistent across the two levels of interaction (C). ^ BD BD interaction: (add the λ jl (i =1) ^ ABD terms to the λ ij l (i = 2) terms) (summation constraints) (i = 3) (i = 4) l =1 l =2 l =3 l =1 l =2 l =3 l =1 l =2 l =3 l =1 l =2 l =3 j=1 .074 .217 -.291 .504 -.089 -.415 .304 -.154 -.150 -.494 -.066 -.428 j=2 .110 .016 -.126 .039 -.036 -.075 -.001 .101 -.100 .045 -.152 -.197 j=3 -.184 -233 .417 -.543 .053 .490 -.303 .053 .250 -.539 -.086 .625 BD interaction: (add the (i =1) ^ BD λ jl terms to the (i = 2) ^ ABD λ ij l terms) (SAS GENMOD constraints) (i = 3) (i = 4) l =1 l =2 l =3 l =1 l =2 l =3 l =1 l =2 l =3 l =1 l =2 l =3 j=1 .970 1.162 0 1.958 .766 0 1.001 .192 0 2.088 1.312 0 j=2 .840 .798 0 1.152 .556 0 .651 .402 0 1.170 1.065 0 j=3 0 0 0 0 0 0 0 0 0 0 0 0 At both levels of interaction with other residents (C), the following pattern occurs. Lower levels of satisfaction are associated with lower perceived level of influence on management for each of the four types of housing, although this pattern is strongest for apartment residents (i=2) and terraced housing residents (i=4), less strong for atrium residents (i=3) and tower residents (i=1). This helps to account for the high level of satisfaction of tower residents in spite of the tendency of tower residents to feel they do not strongly influence management. The perceived level of influence on management is not as strongly related to satisfaction for tower residents as it is for some other types of housing. 9 One could describe how relationships between type of housing and perceived level of influence on management change across levels of satisfaction within each level of interaction with other residents in a similar way. Note that the interpretations of interactions in the model are all conditional, i.e., they describe associations within strata formed by levels of at least one additional factor. It may also be of interest to examine associations in marginal distributions for the different pairs of factors. You can do this with PROC GENMOD or PROC CATMOD in SAS, the glm( ) function in S-PLUS, or LOGLINEAR is SPSS. For these data, examination of the 2-deminsional marginal tables leads to the same interpretation of the data. but this will not be the case for all data sets. 4. (a) Strata Gestational Age Mother’s Age Estimate of Conditional Odds Ratio Approximate 95% Confidence Interval 197-260 days < 30 years 1.42 (0.65, 3.10) 197-260 days > 30 years 1.30 (0.39, 4.31) > 261 days < 30 years 2.18 (0.89, 5.37) > 261 days > 30 years 0.92 (0.12, 7.04) The Mantel-Haenszel estimator for a common odds ratio is α̂M H = 1.525 with approximate 95% confidence interval (0.918, 2.534). These data do not reveal a significant association between 12 month survival and mother’s smoking habit. (b) GA GS GM log(mijkl ) = ë + ë Gi + ë Aj + ë Sk + ë M + ë AjkS + ë Aj lM l + ë ij + ë ik + ë il GAS ë SM + ë GAM kl + ë ijk ij l (c) G 2 = 0.95 with 3 d.f. and p-value = .81 suggests that the model in part (b) is suitable for these data. Hence, the hypothesis of homogeneous association between 12 month mortality and mother’s smoking status (as measured by odds ratio) is not rejected. (d) Breslow-Day = 0.964 with 3 d.f. and p-value = 0.81. T4 = -1.08 with (d.f. = 1) and p-value = 0.86. The effect of mother’s smoking habit on 12 month mortality appears to be nearly consistent across the gestational age and mother’s age categories. 10 (e) Three models were selected by students in this class. In order of decreasing complexity they are: (i) S A G AG MG log(mijk l ) = ë + ë M i + ë j + ë k + ë l + ë kl + ë il MA MS + ë SA jk + ë ik + ë ij The value of the G 2 lack-of-fit test is G 2 = 1.82 with 6 d.f. and p-value = .94, X 2 = 1.83 with 6 d.f. and p-value = .93 . Clearly, there is no need to add terms to this MS model. For this model ë̂ ij = .1109 with standard error = .0612 (S-PLUS summation constraints), or ë̂ MS ij = . 4436 with standard error = .24467 (SAS GENMOD constraints), and a 95% confidence interval for the odds of 12 month mortality when the mother smokes at least 5 cigarettes per day divided by the odds for mortality when the mother smokes fewer than 5 cigarettes per day is estimated as 1.56 with approximate 95% confidence interval (0.96, 2.57). Although the point estimate of the odds ratio indicates that smoking more than 5 cigarettes per day could increase the odds of infant mortality by 56%, this result is not quite significant at the .05 level and it could be due to sampling variation. (ii) S A G AG MG SA MA log(mijkl) = ë + ë M i + ë j + ë k + ël + ë kl + ë il + ë jk + ë ik The value of the lack-of-fit test for this model is G 2 = 4.79 with 7 d.f. and p-value = .69 or X 2 = 5.67 with 7 d.f. and p-value = .58. Comparing models (i) and (ii) provides a test of the hypothesis that 12 month mortality is conditionally independent of mother’s smoking habit. G 2 = 4.79 – 1.82 = 2.98 with 1 d.f. and p-value = 0.07. The data do not indicate a strong link between mother’s smoking habit and 12-month infant mortality, after adjusting for mother’s age and gestational age. Furthermore, smoking does not appear to have a significant link to gestational age. (iii) Most students reported the model G MG SA MA log(mijkl) = ë + ë iM+ ë Sj+ ë A k + ë l + ë il + ë jk + ë ik with G 2 = 7.72 on 8 d.f. and p-value = 0.46. Also, X = 8.73 on 8 d.f. and pvalue = 0.37. This model implies that 12-month infant mortality is conditionally independent of the mother’s smoking habit, given the age of the mother and the gestational age of the baby at time of birth. 12-month infant mortality has some association with mother’s age and gestational age. Using SAS GENMOD constraints, 2 ë̂ MA ik = - 0.5506 with standard error = 1692, and the estimate of the odds that a baby dies within the first 12 months is about 73% higher when the mother is at least 30 than when the mother is younger than 30; an approximate 95% confidence interval for the odds ratio is (1.25, 2.42). Also, ë̂ MG il = 3.328 with standard error =.1843 and the estimate of the odds that a baby dies within the first 12 months is about 28 times greater when the gestational age is between 197-260 days than when gestational age exceeds 11 260 days. An approximate 95% confidence interval for the odds ratio is (23.2, 33.5). Since ë̂ AS ij = 0.4043 with standard error =.0994, the estimated odds that the mother smokes is about 50% higher for women under 30. An approximate 95% confidence interval for the odds ratio is (1.36, 1.65). This is an observational study and it does not consider many other factors that could be related to infant mortality, such as economic status, diet, level of pre-natal care, alcohol or other drug use, etc… It may have been more informative to include additional categories for the mother’s age, such as under 18 where rates of premature births are high, or additional smoking habit categories, such as never smoked.