Document 10714935

advertisement
Stat 557
Assignment #4
Reading Assignment:
Fall 2002
Chapter 3 and Sections 6.1 and 6.3 in Lloyd’s book
Written Assignment: Due Monday, October 21, in class. Solutions will be posted
on Monday at 5 pm. No late assignments will be accepted.
Midterm Exam:
Thursday, October 24, 7:00-9:00 p.m., in 2245 Coover.
1. For each of the following tables of true cell probabilities give a formula for the most
parsimonious log-linear model and provide a verbal description of associations, independence, or
conditional independence among variables. The cell probabilities are obtained by dividing the
number in each cell by the sum of the numbers in the entire tables.
You should be able to do this without using a computer to fit models to the tables (although you
can use a computer to check your answer). Determine the model by comparing odds ratios and
comparing distributions across columns, rows, or tables. For example, in the following table
divide each entry by 40
(k=1)
i=1
i=2
j=1
5
2
(k=2)
j=2
1
2
i=1
i=2
j=1
15
6
j=2
3
6
to obtain the joint multinomial probabilities. Note that the odds ratio at each level of Factor C
(k= 1,2) is 5, indicating that variables A and B are not conditionally independent given the level
of variable C. Hence, the log-linear model should include an interaction between Factors A and
B, but no three factor interaction. Further inspection shows that the 2x2 table for k=2 is a
multiple of the 2x2 table for k=1, so the joint distribution for variables A and B is the same at
each level of variable C. Hence, variables A and B are jointly independent of variable C and the
most parsimonious log-linear model is
B
C
AB
log( m ijk ) = λ + λA
i + λj + λk + λij .
(a) Each entry in the following 2x2x2 table should be divided by 23
(k=1)
i=1
i=2
j=1
2
3
(k=2)
j=2
4
4
i=1
i=2
j=1
4
3
j=2
2
1
2
(b) Each number in the following 2x2x3 table should be divided by 72.
(k=1)
j=1
j=2
3
1
6
2
i=1
i=2
i=1
i=2
(k=2)
j=1
j=2
9
3
18
6
i=1
i=2
(k=3)
j=1
j=2
6
2
12
4
i=1
i=2
(k=3)
j=1
j=2
5
10
1
2
(c) Each number in the following 2x2x3 table should be divided by 54.
(k=1)
j=1
j=2
2
4
5
10
i=1
i=2
i=1
i=2
(k=2)
j=1
j=2
3
6
2
4
(d) Each number in the following 3x3x2 table should be divided by 60
i=1
i=2
i=3
j=1
2
2
6
(k=1)
j=2
4
8
2
j=3
1
2
6
i=1
i=2
i=3
(k=2)
j=2
8
4
1
j=1
4
1
3
j=3
2
1
3
(e) Each number in the following 2x3x2x2 table should be divided by 192.
i=1
i=2
(k=1, l = 1)
j=1
j=2
j=3
2
1
3
4
2
6
i=1
i=2
j=1
6
12
(k=2, l = 1)
j=2
J=3
3
9
6
18
i=1
i=2
(k=1, l = 2)
j=1
j=2
j=3
4
12
16
1
3
4
i=1
i=2
j=1
8
2
(k=2, l = 2)
j=2
j=3
24
32
6
8
3
(f) Each number in the following 2x2x4 table should be divided by 49.
i=1
i=2
(k=1)
J= j=2
1
1
3
2
2
(k=2)
j=1 j=2
i=1
i=2
4
2
6
1
(k=3)
j=1 j=2
i=1
i=2
5
3
(k=4)
j=1 j=2
5
1
i=1
i=2
2
6
3
3
2. Construct a 2x2x2 table of expected counts for each of the following probability models. Your
table of expected counts should not conform to a more parsimonious model.
(a)
Complete independence: πijk = πi++π+j+π++k.
(b)
πijk = πi++π+jk, but not complete independence.
(c)
Conditional independence: πijk = πi+kπ+jk/π++k.
3. The data in this exercise come from a survey taken in Copenhagen, Denmark which provides
information about satisfaction with housing conditions. A total of 1681 respondents were crossclassified with respect to four factors.
Factor A:
Housing
Type
Factor B:
Satisfaction
Low
Tower
Medium
High
Low
Apartment
Medium
High
Low
Atrium
Medium
High
Low
Terraced
Medium
High
Factor C:
Contact with
Other residents
Low
High
Low
High
Low
High
Low
High
Low
High
Low
High
Low
High
Low
High
Low
High
Low
High
Low
High
Low
High
Factor D: Influence
Low
Medium
High
21
14
21
19
28
37
61
78
23
46
17
43
13
20
9
23
10
20
18
57
6
23
7
13
34
17
22
23
36
40
43
48
35
45
40
86
8
10
8
22
12
24
15
31
13
21
13
13
10
3
11
5
36
23
26
15
18
25
54
62
6
7
7
10
9
21
7
5
5
6
11
13
4
The primary objective of the study was to analyze the relationships between type of housing and
the other three factors. One might view type of housing as an explanatory factor and the other
three factors as response factors. Use symbols A, B, C, and D to label terms in the models you
consider in the analyses of these data. The data have been posted as madsen.dat with one line of
data for each cell in the table. On each line the levels of factors A, B, C, D are followed by the
count.
(A) This part of the problem can be done without the use of a computer. Do not fit any of the
models specified below to obtain maximum likelihood estimates for model parameters.
Consider only the largest hierarchical log-linear model that satisfies the specified condition
and report the following information for the model:
(i) A formula for the log-linear model.
(ii) Degrees of freedom for the X2 and G2 tests of fit for the specified model against the
general multinomial alternative.
(iii) Sets of marginal counts that provide minimal sufficient statistics, e.g.,
{Yij+ + }
{ Y+ +k + }
{ Y+ + + l }
(iv) State what each model implies about independence or conditional independence of the
four factors:
A:
B:
C:
D:
Type of housing
Level of satisfaction with current housing conditions
Level of contact with other residents
Level of influence on housing management
Answer (i), (ii), (iii) and (iv) for each of the models shown below.
(a)
AC
λik
= 0
for all (i,k)
(b)
λAC
ik = 0
for all (i,k) and
(c)
λABD
= 0
ij l
for all (i, j, l) and λACD
= 0 for all (i, k, l)
ik l
(d)
λABC
= 0
ijk
for all (i, j, k) and λACD
= 0
ik l
and
(e)
λABD
= 0
ij l
λAB
= 0
ij
and
for all (i, k, l)
for all (i j l)
for all (i, j)
λAD
il = 0
λBD
j l = 0 for all (j, l)
and λAC
ik = 0
for all (i, l) .
for all (i, k)
5
(B) Use any computer software package of your choice to find the most parsimonious log-linear
model that provides a good description of the housing satisfaction data.
(i) Write a paragraph describing your strategy for selecting a model. Include a brief
description of how you checked the fit of the model. Say just enough to convince me
that you have a good model.
(ii) Report estimates of λ-terms for your model and corresponding standard errors. You may
attach computer printout for this part, but do not submit any other computer print out in
your answer to this problem.
(iii) Explain what you model implies about associations among four factors that were
examined in this study
4. A random sample of 6851 medical records of women who gave birth were examined in a study of
the effects of smoking on infant mortality. The data were cross-classified into the following
2x2x2x2 contingency table with respect to the variables:
Mother’s Age (A):
< 30 years, or > 30 years
Smoking Habit (S):
5 or less cigarettes per day, or more than 5
cigarettes per day
Gestational Age of the
fetus at time of birth (G):
197-260 days, or more than 260 days
Survival of the infant at
the end of one year (M):
yes or no
Gestational Age
Mother’s Age
Smoking Habit
197-260 ( l =1)
< 30 (k=1)
> 5 (j=1)
< 5 (j=2)
9
50
40
315
>30 (k=2)
>5
<5
4
41
11
147
< 30
>5
<5
6
24
459
4012
> 30
>5
<5
1
14
124
1594
> 261 ( l = 2)
Survival at the end of one year
No (i=1)
Yes (i=2)
6
In this study one might consider the mother’s age and smoking habit as explanatory variables
and survival as a response variable. Gestational age at birth could be considered as either a
response or an explanatory variable. Use the letters A, S, G, and M in your notation for
describing log-linear models.
(a)
Using the four combinations of mother’s age and gestational age categories to partition
the data into four 2x2 tables, compute the Mantel-Haenszel estimator for the ratio of odds
of infant mortality for the two smoking categories.
Mantel-Haenszel estimator = ___________
95% confidence interval = ________
(b)
Write out the formula for the largest log-linear model that satisfies the assumption of
homogeneous odds ratios used in part (a).
(c)
Report the values of the X 2 and G 2 tests of the fit of the model in part (b) against the
general alternative. Also report degrees of freedom and p-values. State your conclusion.
(d)
Report values of the Breslow-Day and T4 tests of homogeneity of odds ratios. Report
degrees of freedom and p-values for each test. Do these results agree with the results in
part (c)? Is this to be expected? Explain.
(e)
Examine log-linear models to determine how mother’s age and smoking habit are
associated with gestational age at birth and infant mortality at the end of one year.
Identify the model that you think is most appropriate for these data. Report estimates for
the parameters in your model and their standard errors. Interpret your results with respect
to what they indicate about the relative risk of premature births (gestational age less than
261 days) and infant mortality at the end of one year. State your conclusions in words
that health care professionals could understand. Feel free to comment on what you
perceive to be the main limitations of this study.
Download