Stat 557 Solutions to... Fall 2002

advertisement
Stat 557
Fall 2002
1. (a)
Solutions to Assignment 4
No three factor interaction model
(e)
Conditional independence of variables
1, 2 and 3 given the level of variable 4
BD
CD
log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAD
il + λ jl + λ kl
Maximum likelihood estimates of
AC
BC
expected
counts are:
log(mijk ) = λ + λ Ai + λ Bj + λCk + λAB
+
λ
+
λ
ij
ik
jk
∧
Yi + + l Y+ j+ l Y+ +k l
m ijk l =
Y+2+ + l
No two factors are conditionally
independent given the level of the third
factor, but the conditional association
(f) No three factor interaction. No two
between any two factors, as measured
factors are conditionally independent
by the odds ratio, is the same at each
given the level of the third factor, but
level of the third factor.
the conditional association between
factors 1 and 2, as measured by the
odds ratios, is the same at each level of
the third factor.
(b) Complete independence of the three
variables
log(mijk ) = λ + λ Ai + λ Bj + λCk
AC
BC
log(mijk ) = λ + λAi + λBj + λCk + λAB
ij + λ ik + λ jk
Maximum likelihood estimates of
expected counts are:
∧
Yi+ + Y+ j+ Y++k
m ijk =
Y+2++
(c) Variables 1 and 3 are jointly
independent of variable 2
log(mijk ) = λ + λ Ai + λ Bj + λCk + λAC
ik
Maximum likelihood estimates of
expected counts are:
∧
Y+ j+ Yi + k
m ijk =
Y+ + +
(d)
Conditional independence of variables
2 and 3 given the level of variable 1
AC
log(mijk ) = λ + λ Ai + λ Bj + λCk + λAB
ij + λ ik
Maximum likelihood estimates of
∧
Yij + Yi + k
expected counts are: m ijk =
Yi + +
2. These are only examples of the infinitely
many correct answers.
(a)
Complete independence
k=1
i=1
i=2
j=1
1/48
2/48
j=2
3/48
6/48
j=1
1/48
2/48
j=2
3/48
6/48
k=2
i=1
i=2
(b)
Variables 2 and 3 are jointly independent of variable 1.
k=1
i=1
i=2
j=1
1/30
2/30
j=2
4/30
8/30
k=2
i=1
i=2
(c)
j=1
3/30
6/30
j=2
2/30
4/30
Variables 1 and 2 are conditionally independent given the level of variable 3.
k=1
i=1
i=2
k=2
j=1
1/30
2/30
j=2
3/30
6/30
i=1
i=2
j=1
4/30
1/30
j=2
12/30
3/30
3.
You can count degrees of freedom for lack-of-fit tests by either (i) summing degrees of
freedom
for λ -terms left out of the model, or (ii) by computing 72-(sum of degrees of freedom for λ terms in the model).
AD
BC
BD
CD
ABD
BCD
(a) log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB
ij + λ il + λ jk + λ jl + λ k l + λ ijl + λ jk l ,
df = 27, sufficient statistics are {Yij+ l }, {Y+ jkl }. Maximum likelihood estimates of
∧
expected counts are m ijk l = Yij + l Y+ jkl / Y+ j + l . This model implies that type of housing and
level of contact with neighbors are conditionally independent for each pair of levels for
satisfaction and influence. Association between satisfaction with housing conditions (B) and
housing type (A) depends on the level of perceived influence with management (D).
Associations between housing type (A) and perceived influence with management (D) are not
consistent across levels of satisfaction (B). Associations between level of satisfaction with
housing conditions (B) and perceived influence with management (D) depend on both the level of
contact with other residents (C) and the housing type (A). Associations between satisfaction
with housing conditions (B) and level of contact with other residents (C) depend on the perceived
influence with management (D).
Associations between level of contact with other residents (C) and the level of perceived
influence with management (D) are not consistent across levels of satisfaction with housing
conditions (B).
(b)
AD
BC
CD
log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB
ij + λ il + λ jk + λ k l
df = 47, sufficient statistics are
{Y+ + k l }, {Y + jk + }, {Yij + + }, {Y+ j+ l }. There are no closed
form expressions for the maximum likelihood estimates for the expected counts. This model
implies that contact with other residents (C) is conditionally independent of housing type (A),
given the levels of the other two variables. It also implies that satisfaction with housing
conditions (B) is conditionally independent of perceived influence with management (D), given
the joint levels of housing type (A) and contact with other residents (C). As measured by odds
ratios, associations between satisfaction with housing conditions (B) and housing type (A) are
3
consistent across the joint levels of contact with other residents (C) and perceived influence with
management (D). Associations between satisfaction with housing conditions (B) and level of
contact (C) with other residents are consistent across the joint levels of housing type (A) and
perceived influence with management (D). Associations between type of housing (A) and
perceived influence with management (D) are consistent across the joint levels of satisfaction
with housing conditions (B) and level of contact (C) with other residents. Associations between
level of contact with other residents (C) and perceived influence with management (D) are
consistent across the joint levels of satisfaction with housing conditions (B) and housing type
(A).
(c)
AC
AD
BC
BD
CD
ABC
BCD
log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB
ij + λ ik + λ il + λ jk + λ jl + λ kl + λ ij k + λ jk l
df = 30, sufficient statistics are {Yi + + l }, {Y + jkl } and {Yijk + }. There are no closed
form expressions for the maximum likelihood estimates for the expected counts . This model
implies that the association between the perceived level influence with management (D) and
housing type (A) is consistent across the levels of the other two factors. Associations between
level of satisfaction (B) and type of housing (A), however, change across the level of contact
with other residents (C) but are consistent across the levels of influence with management (D).
Associations between level of satisfaction (B) and level of contact with other residents (C)
depend on both the type of housing (A) and the level of perceived influence with management
(D). Associations between level of satisfaction (B) and level of perceived influence with
management (D) depend on level of contact with other residents (C) but not on housing type (A).
Associations between level of contact with other residents (C) and perceived influence with
management (D) depend on the level of satisfaction with housing conditions (B) but not on
housing type (A). Associations between level of contact with other residents (C) and type of
housing (A) change across levels of satisfaction with housing (B) but are consistent across the
levels of influence with management (D).
(d)
AC
AD
BC
BD
CD
BCD
log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB
ij + λ ik + λ il + λ jk + λ jl + λ kl + λ jk l
df = 36, sufficient statistics are {Yij+ + }, {Yi+ k + }, {Yi + + l }, {Y+ jk l }. There are no closed
form expressions for the maximum likelihood estimates for the expected counts. This model
implies that associations between housing type (A) and any other factor do no depend on the
level of either of the other two factors. Associations between level of satisfaction (B) and level
of contact with other residents (C) depend on the level of perceived influence with management
(D) but not on housing type(A). Associations between level of satisfaction (B) and level of
perceived influence with management (D) depend on level of contact with other residents (C) but
not on housing type (A). Associations between level of contact with other residents (C) and
perceived influence with management (D) depend on the level of satisfaction with housing
conditions (B) but not on housing type (A).
(e)
BD
CD
BCD
log(mijk l ) = λ + λAi + λBj + λCk + λDl + λBC
jk + λ jl + λ k l + λ jk l
df = 55, sufficient statistics are sufficient statistics are {Yi + + + } and {Y+ jkl }. Maximum
∧
likelihood estimates of expected counts are m ijk l = Yi + + + Y+ jk l / Y+ + + + . Type of housing (A)
is independent of all of the other three factors. Associations between level of satisfaction (B)
and level of contact with other residents (C) depend on the level of perceived influence with
management (D) but not on housing type (A). Associations between level of satisfaction (B) and
level of perceived influence with management (D) depend on level of contact with other
residents (C) but not on housing type (A). Associations between level of contact with other
residents (C) and perceived influence with management (D) depend on the level of satisfaction
with housing conditions (B) but not on housing type (A).
4
3. (B) Several models were reported by the students in this class. One strategy was to fit a log-linear
model with just main effects, fit another model with all 2-factor interactions, fit another model
with all 3-factor interactions, find a model that fits and then delete insignificant terms from the
model with a backward elimination procedure. Others started with some reasonable model and
let the step( ) function in Splus or the searching capability of the highloglinear procedure in
SPSS search for a model. The simplest model selected was
AC
BC
BD
CD
(model1) log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB
ij + λ ik + λ jk + λ jl + λ k l
For this model X2 = 58.36 on 46 d.f. with p-value = 0.104 and G2 = 57.6 with p-value = 0.117.
However, there are four adjusted residuals larger 2 in absolute value and one adjusted residual
larger than 3. Also, the λAD
il interaction is significant at the .033 level when added to this
ABD
model. After adding the λAD
interaction becomes
il interaction to the model, the λ ijl
significant at the 0.0396 level. Many students selected
AC
BC
BD
CD
AD
ABD
(model 2) log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB
ij + λ ik + λ jk + λ jl + λ k l + λ il + λ ij l
There are no large adjusted residuals for this model. This model fits the data well, X2 = 22.47
on 28 d.f. with p-value = 0.78 and G2 = 22.13 with p-value = 0.76. Other students simplified this
model by deleting the λBC
jk interaction to obtain
AC
BD
CD
AD
ABD
(model 3) log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB
ij + λ ik + λ jl + λ k l + λ il + λ ij l
For this model are, G2 = 38.12 on 30 d.f.. with p-value = 0.147. I did not check the residuals
from fitting this model.
Finally, some students used the step( ) function in Splus to maximize the penalized loglikelihood known as the AIC criterion. This resulted in
AC
BC
BD
CD
AD
ABC
(model 4) log(mijkl ) = λ + λAi + λBj + λCk + λDl + λAB
ij + λ ik + λ jk + λ jl + λ kl + λ il + λ ijk
This model also fits the data well, X2 = 31.72 on 34 d.f.. and G2 = 22.13 with p-value = 0.58. It
is well known that maximizing AIC can sometimes lead to overfitting the model. In this case,
the λABC
interaction term is significant at the 0.058 level. Some students deleted this three
ijk
factor interaction and selected the model with all six of the 2-factor interactions:
AC
BC
BD
CD
AD
(model5) log(mijk l ) = λ + λAi + λBj + λCk + λDl + λAB
ij + λ ik + λ jk + λ jl + λ kl + λ il
This model also fits the data well, X2 = 43.95 on 40 d.f.. and G2 = 44.18 with p-value = 0.31.
Finally, a few students deleted the λAD
il interaction from model 5 and ended up with model 1.
The S-PLUS search that led to model 4, however, did not lead students to consider model 2.
A good case could be made for any of these models, although the residual analysis indicates that
model 1 matches the observed data less well than the other models. The data do not provide
sufficient information to clearly distinguish between these models. Choosing among these
models requires additional information or expertise that is not provided by the observed data.
AC
BD
CD
AD
Inferences about 2-factor associations λAB
ij , λ ik , λ jl , λ k l , λ il are essentially the same
5
ABC
ABD
for these models, however, because the interactions λBC
jk , λ ij k , or λ ij l , if they truly exist,
are relatively weak.
As an example, I will examine and interpret the interaction terms in Model 2 because it provides an
opportunity to discuss a three factor interaction. Listings of the maximum likelihood estimates of the
terms in the model produced by PROC GENMOD in SAS are given below. The results from the glm
function in S-PLUS are also shown, using constraints where parameters sum to zero across the levels of
any single factor.
AC interaction: (summation constraints)
i=1
.303
-.303
k=1
k=2
i=2
-.005
.005
i=3
-.137
.137
i=4
-.161
.162
λik values
i=2
.3092
0
i=3
.0458
0
i=4
0
0
λik values
^ TC
AC interaction: (SAS GENMOD constraints)
i=1
.9256
0
k=1
k=2
^ TC
Conditional on any level of satisfaction (j) and any level of influence on management ( l) , there
is a tendency for less contact with other residents for people who live in the Towers (i=1) and
more contact with other residents in atrium (i=3) and terrace (i=4) housing. More specifically,
the odds of low contact with other residents is about 2.5 times greater for people who live in
Towers than people who live in terrace housing. (The estimate of the log odds is
λˆAC - λAC - λAC + λAC = 0.9256 with standard error .1662. This is an estimable quantity, so it
11
41
12
42
is the same for either set of constraints. An approximate 95% confidence interval for the log
odds is (.5999, 1.2514). The corresponding estimate of the odds ratios is exp(0.9256)=2.52
with approximate 95% confidence interval (1.82, 3.50) ). You could examine other estimates of
odds ratios for this interaction. We will not provide this level of detail in the remainder of this
discussion and only report conclusions based on patterns in the estimated λ parameters.
BC interaction: (summation constraints)
k=1
k=2
j=1
.140
-.140
j=2
-.039
.039
j=3
-.101
.101
j=2
.1251
0
j=3
0
0
BC interaction: (SAS GENMOD constraints)
k=1
k=2
j=1
.4818
0
6
Conditional on any particular type of housing and any level of influence with management, people
with less contact with other residents tend to be less satisfied with their housing conditions.
CD interaction: (summation constraints)
l =1
l=2
l =3
-.150
.150
-.031
.031
.181
-.181
l =1
l=2
l =3
-.6645
0
-.4260
0
0
0
k=1
k=2
CD interaction: (SAS GENMOD constraints)
k=1
k=2
Conditional on any type of housing and any level of satisfaction, residents with less contact with
other residents tend to feel they have more influence on management.
Interactions involving housing type (A), satisfaction (B), and influence with management (D) must be
interpreted more carefully in the presence of the three factor interaction. First look at the three factor
interaction.
ABD interaction: (summation constraints)
l =1
l =2
l =3
j=1
j=2
j=3
j=1
j=2
j=3
j=1
j=2
j=3
Housing
i=1
-.270
.082
.188
.220
-.061
-.159
.050
-.021
-.029
Type (A)
i=2
.161
-.011
-.172
-.086
-.040
.126
-.075
.029
.046
i=3
-.041
-.030
.071
-.151
.025
.126
.192
.005
-.197
i=4
.150
-.063
-.087
.017
.076
-.093
-.167
-.013
.180
ABD interaction: (SAS GENMOD constraints)
l =1
Housing
i=1
l =2
l =3
j=1
j=2
j=3
j=1
j=2
j=3
j=1
j=2
j=3
-1.12
-.330
0
-.150
-.267
0
0
0
0
7
Type (A)
i=2
-.130
-.547
0
-.018
-.509
0
0
0
0
i=3
-1.08
-1.12
0
-.519
-.663
0
0
0
0
i=4
0
0
0
0
0
0
0
0
0
These estimates must be added to the estimates of the interactions between type of housing and
satisfaction to determine how those interactions change across levels of factor D, perceived
^
influence wi th management. Add the table of λ ijAB values to each of the above tables of
^
λ ABD
ijl values. The results are:
AB interactions: (summation constraints)
l =1
A
l =2
l =3
j=1
j=2
J=3
j=1
j=2
i=1
-.557
.016
.541
-.067
-.127
i=2
.250
-.066
-.184
.003
i=3
-.205
.127
.078
i=4
.512
-.077
-.435
j=3
j=1
j=2
j=3
.194
-.236
-.087
.323
-.117
.114
.014
-.048
.034
-.315
.182
.133
.028
.162
-.190
.379
.062
-.441
.194
-.027
-.167
AB interaction: (SAS GENMOD constraints)
l =1
l =2
l =3
j=1
j=2
j=3
j=1
j=2
j=3
j=1
j=2
j=3
Housing
i=1
-2.05
-.883
0
-1.07
-.820
0
-.927
-.553
0
Type (A)
i=2
-.514
-.769
0
-.402
-.731
0
-.384
-.222
0
i=3
-1.23
-.900
0
-.664
-.452
0
-.149
-.211
0
i=4
0
0
0
0
0
0
0
0
0
For any level of interaction with other residents (C), the following patterns are present. For any
level of perceived influence with management (M), residents of towers tend to be more satisfied
with their housing than residents of other types of housing and residents of terraced housing tend
to be less satisfied with their housing. This pattern is most pronounced when perceived level of
influence with management is low.
8
The absence of any 3 or 4 factor interaction involving level of interaction (C) implies that
associations among housing type (A), satisfaction (B), and influence on management (D)
variables are consistent across the two levels of interaction (C).
^ BD
BD interaction: (add the λ jl
(i =1)
^ ABD
terms to the λ ij l
(i = 2)
terms) (summation constraints)
(i = 3)
(i = 4)
l =1
l =2
l =3
l =1
l =2
l =3
l =1
l =2
l =3
l =1
l =2
l =3
j=1
.074
.217
-.291
.504
-.089
-.415
.304
-.154
-.150
-.494
-.066
-.428
j=2
.110
.016
-.126
.039
-.036
-.075
-.001
.101
-.100
.045
-.152
-.197
j=3
-.184
-233
.417
-.543
.053
.490
-.303
.053
.250
-.539
-.086
.625
BD interaction: (add the
(i =1)
^ BD
λ jl
terms to the
(i = 2)
^ ABD
λ ij l
terms) (SAS GENMOD constraints)
(i = 3)
(i = 4)
l =1
l =2
l =3
l =1
l =2
l =3
l =1
l =2
l =3
l =1
l =2
l =3
j=1
.970
1.162
0
1.958
.766
0
1.001
.192
0
2.088
1.312
0
j=2
.840
.798
0
1.152
.556
0
.651
.402
0
1.170
1.065
0
j=3
0
0
0
0
0
0
0
0
0
0
0
0
At both levels of interaction with other residents (C), the following pattern occurs. Lower levels
of satisfaction are associated with lower perceived level of influence on management for each of
the four types of housing, although this pattern is strongest for apartment residents (i=2) and
terraced housing residents (i=4), less strong for atrium residents (i=3) and tower residents (i=1).
This helps to account for the high level of satisfaction of tower residents in spite of the tendency
of tower residents to feel they do not strongly influence management. The perceived level of
influence on management is not as strongly related to satisfaction for tower residents as it is for
some other types of housing.
9
One could describe how relationships between type of housing and perceived level of influence on
management change across levels of satisfaction within each level of interaction with other residents in a
similar way.
Note that the interpretations of interactions in the model are all conditional, i.e., they describe
associations within strata formed by levels of at least one additional factor. It may also be of interest to
examine associations in marginal distributions for the different pairs of factors. You can do this with
PROC GENMOD or PROC CATMOD in SAS, the glm( ) function in S-PLUS, or LOGLINEAR is SPSS.
For these data, examination of the 2-deminsional marginal tables leads to the same interpretation of the
data. but this will not be the case for all data sets.
4. (a)
Strata
Gestational Age
Mother’s Age
Estimate of
Conditional
Odds Ratio
Approximate 95%
Confidence Interval
197-260 days
< 30 years
1.42
(0.65, 3.10)
197-260 days
> 30 years
1.30
(0.39, 4.31)
> 261 days
< 30 years
2.18
(0.89, 5.37)
> 261 days
> 30 years
0.92
(0.12, 7.04)
The Mantel-Haenszel estimator for a common odds ratio is α̂M H = 1.525 with approximate
95% confidence interval (0.918, 2.534). These data do not reveal a significant association
between 12 month survival and mother’s smoking habit.
(b)
GA
GS
GM
log(mijkl ) = ë + ë Gi + ë Aj + ë Sk + ë M
+ ë AjkS + ë Aj lM
l + ë ij + ë ik + ë il
GAS
ë SM
+ ë GAM
kl + ë ijk
ij l
(c)
G 2 = 0.95 with 3 d.f. and p-value = .81 suggests that the model in part (b) is suitable for
these data. Hence, the hypothesis of homogeneous association between 12 month mortality
and mother’s smoking status (as measured by odds ratio) is not rejected.
(d)
Breslow-Day = 0.964 with 3 d.f. and p-value = 0.81. T4 = -1.08 with (d.f. = 1) and p-value
= 0.86. The effect of mother’s smoking habit on 12 month mortality appears to be nearly
consistent across the gestational age and mother’s age categories.
10
(e)
Three models were selected by students in this class. In order of decreasing complexity they
are:
(i)
S
A
G
AG
MG
log(mijk l ) = ë + ë M
i + ë j + ë k + ë l + ë kl + ë il
MA
MS
+ ë SA
jk + ë ik + ë ij
The value of the G 2 lack-of-fit test is G 2 = 1.82 with 6 d.f. and p-value = .94, X 2 =
1.83 with 6 d.f. and p-value = .93 . Clearly, there is no need to add terms to this
MS
model. For this model ë̂ ij
= .1109 with standard error = .0612 (S-PLUS
summation constraints), or ë̂ MS
ij = . 4436 with standard error = .24467 (SAS GENMOD
constraints), and a 95% confidence interval for the odds of 12 month mortality when the
mother smokes at least 5 cigarettes per day divided by the odds for mortality when the
mother smokes fewer than 5 cigarettes per day is estimated as 1.56 with approximate
95% confidence interval (0.96, 2.57). Although the point estimate of the odds ratio
indicates that smoking more than 5 cigarettes per day could increase the odds of infant
mortality by 56%, this result is not quite significant at the .05 level and it could be due
to sampling variation.
(ii)
S
A
G
AG
MG
SA
MA
log(mijkl) = ë + ë M
i + ë j + ë k + ël + ë kl + ë il + ë jk + ë ik
The value of the lack-of-fit test for this model is G 2 = 4.79 with 7 d.f. and p-value =
.69 or X 2 = 5.67 with 7 d.f. and p-value = .58. Comparing models (i) and (ii)
provides a test of the hypothesis that 12 month mortality is conditionally independent of
mother’s smoking habit. G 2 = 4.79 – 1.82 = 2.98 with 1 d.f. and p-value = 0.07. The
data do not indicate a strong link between mother’s smoking habit and 12-month infant
mortality, after adjusting for mother’s age and gestational age. Furthermore, smoking
does not appear to have a significant link to gestational age.
(iii) Most students reported the model
G
MG
SA
MA
log(mijkl) = ë + ë iM+ ë Sj+ ë A
k + ë l + ë il + ë jk + ë ik
with G 2 = 7.72 on 8 d.f. and p-value = 0.46. Also, X = 8.73 on 8 d.f. and pvalue = 0.37. This model implies that 12-month infant mortality is conditionally
independent of the mother’s smoking habit, given the age of the mother and the
gestational age of the baby at time of birth. 12-month infant mortality has some
association with mother’s age and gestational age. Using SAS GENMOD constraints,
2
ë̂ MA
ik = - 0.5506 with standard error = 1692, and the estimate of the odds that a baby
dies within the first 12 months is about 73% higher when the mother is at least 30 than
when the mother is younger than 30; an approximate 95% confidence interval for the
odds ratio is (1.25, 2.42). Also, ë̂ MG
il = 3.328 with standard error =.1843 and the
estimate of the odds that a baby dies within the first 12 months is about 28 times greater
when the gestational age is between 197-260 days than when gestational age exceeds
11
260 days. An approximate 95% confidence interval for the odds ratio is (23.2, 33.5).
Since ë̂ AS
ij = 0.4043 with standard error =.0994, the estimated odds that the mother
smokes is about 50% higher for women under 30. An approximate 95% confidence
interval for the odds ratio is (1.36, 1.65).
This is an observational study and it does not consider many other factors that could be
related to infant mortality, such as economic status, diet, level of pre-natal care, alcohol or
other drug use, etc… It may have been more informative to include additional categories for
the mother’s age, such as under 18 where rates of premature births are high, or additional
smoking habit categories, such as never smoked.
Download