Stat 557 Assignment #4 Reading Assignment: Fall 2002 Chapter 3 and Sections 6.1 and 6.3 in Lloyd’s book Written Assignment: Due Monday, October 21, in class. Solutions will be posted on Monday at 5 pm. No late assignments will be accepted. Midterm Exam: Thursday, October 24, 7:00-9:00 p.m., in 2245 Coover. 1. For each of the following tables of true cell probabilities give a formula for the most parsimonious log-linear model and provide a verbal description of associations, independence, or conditional independence among variables. The cell probabilities are obtained by dividing the number in each cell by the sum of the numbers in the entire tables. You should be able to do this without using a computer to fit models to the tables (although you can use a computer to check your answer). Determine the model by comparing odds ratios and comparing distributions across columns, rows, or tables. For example, in the following table divide each entry by 40 (k=1) i=1 i=2 j=1 5 2 (k=2) j=2 1 2 i=1 i=2 j=1 15 6 j=2 3 6 to obtain the joint multinomial probabilities. Note that the odds ratio at each level of Factor C (k= 1,2) is 5, indicating that variables A and B are not conditionally independent given the level of variable C. Hence, the log-linear model should include an interaction between Factors A and B, but no three factor interaction. Further inspection shows that the 2x2 table for k=2 is a multiple of the 2x2 table for k=1, so the joint distribution for variables A and B is the same at each level of variable C. Hence, variables A and B are jointly independent of variable C and the most parsimonious log-linear model is B C AB log( m ijk ) = λ + λA i + λj + λk + λij . (a) Each entry in the following 2x2x2 table should be divided by 23 (k=1) i=1 i=2 j=1 2 3 (k=2) j=2 4 4 i=1 i=2 j=1 4 3 j=2 2 1 2 (b) Each number in the following 2x2x3 table should be divided by 72. (k=1) j=1 j=2 3 1 6 2 i=1 i=2 i=1 i=2 (k=2) j=1 j=2 9 3 18 6 i=1 i=2 (k=3) j=1 j=2 6 2 12 4 i=1 i=2 (k=3) j=1 j=2 5 10 1 2 (c) Each number in the following 2x2x3 table should be divided by 54. (k=1) j=1 j=2 2 4 5 10 i=1 i=2 i=1 i=2 (k=2) j=1 j=2 3 6 2 4 (d) Each number in the following 3x3x2 table should be divided by 60 i=1 i=2 i=3 j=1 2 2 6 (k=1) j=2 4 8 2 j=3 1 2 6 i=1 i=2 i=3 (k=2) j=2 8 4 1 j=1 4 1 3 j=3 2 1 3 (e) Each number in the following 2x3x2x2 table should be divided by 192. i=1 i=2 (k=1, l = 1) j=1 j=2 j=3 2 1 3 4 2 6 i=1 i=2 j=1 6 12 (k=2, l = 1) j=2 J=3 3 9 6 18 i=1 i=2 (k=1, l = 2) j=1 j=2 j=3 4 12 16 1 3 4 i=1 i=2 j=1 8 2 (k=2, l = 2) j=2 j=3 24 32 6 8 3 (f) Each number in the following 2x2x4 table should be divided by 49. i=1 i=2 (k=1) J= j=2 1 1 3 2 2 (k=2) j=1 j=2 i=1 i=2 4 2 6 1 (k=3) j=1 j=2 i=1 i=2 5 3 (k=4) j=1 j=2 5 1 i=1 i=2 2 6 3 3 2. Construct a 2x2x2 table of expected counts for each of the following probability models. Your table of expected counts should not conform to a more parsimonious model. (a) Complete independence: πijk = πi++π+j+π++k. (b) πijk = πi++π+jk, but not complete independence. (c) Conditional independence: πijk = πi+kπ+jk/π++k. 3. The data in this exercise come from a survey taken in Copenhagen, Denmark which provides information about satisfaction with housing conditions. A total of 1681 respondents were crossclassified with respect to four factors. Factor A: Housing Type Factor B: Satisfaction Low Tower Medium High Low Apartment Medium High Low Atrium Medium High Low Terraced Medium High Factor C: Contact with Other residents Low High Low High Low High Low High Low High Low High Low High Low High Low High Low High Low High Low High Factor D: Influence Low Medium High 21 14 21 19 28 37 61 78 23 46 17 43 13 20 9 23 10 20 18 57 6 23 7 13 34 17 22 23 36 40 43 48 35 45 40 86 8 10 8 22 12 24 15 31 13 21 13 13 10 3 11 5 36 23 26 15 18 25 54 62 6 7 7 10 9 21 7 5 5 6 11 13 4 The primary objective of the study was to analyze the relationships between type of housing and the other three factors. One might view type of housing as an explanatory factor and the other three factors as response factors. Use symbols A, B, C, and D to label terms in the models you consider in the analyses of these data. The data have been posted as madsen.dat with one line of data for each cell in the table. On each line the levels of factors A, B, C, D are followed by the count. (A) This part of the problem can be done without the use of a computer. Do not fit any of the models specified below to obtain maximum likelihood estimates for model parameters. Consider only the largest hierarchical log-linear model that satisfies the specified condition and report the following information for the model: (i) A formula for the log-linear model. (ii) Degrees of freedom for the X2 and G2 tests of fit for the specified model against the general multinomial alternative. (iii) Sets of marginal counts that provide minimal sufficient statistics, e.g., {Yij+ + } { Y+ +k + } { Y+ + + l } (iv) State what each model implies about independence or conditional independence of the four factors: A: B: C: D: Type of housing Level of satisfaction with current housing conditions Level of contact with other residents Level of influence on housing management Answer (i), (ii), (iii) and (iv) for each of the models shown below. (a) AC λik = 0 for all (i,k) (b) λAC ik = 0 for all (i,k) and (c) λABD = 0 ij l for all (i, j, l) and λACD = 0 for all (i, k, l) ik l (d) λABC = 0 ijk for all (i, j, k) and λACD = 0 ik l and (e) λABD = 0 ij l λAB = 0 ij and for all (i, k, l) for all (i j l) for all (i, j) λAD il = 0 λBD j l = 0 for all (j, l) and λAC ik = 0 for all (i, l) . for all (i, k) 5 (B) Use any computer software package of your choice to find the most parsimonious log-linear model that provides a good description of the housing satisfaction data. (i) Write a paragraph describing your strategy for selecting a model. Include a brief description of how you checked the fit of the model. Say just enough to convince me that you have a good model. (ii) Report estimates of λ-terms for your model and corresponding standard errors. You may attach computer printout for this part, but do not submit any other computer print out in your answer to this problem. (iii) Explain what you model implies about associations among four factors that were examined in this study 4. A random sample of 6851 medical records of women who gave birth were examined in a study of the effects of smoking on infant mortality. The data were cross-classified into the following 2x2x2x2 contingency table with respect to the variables: Mother’s Age (A): < 30 years, or > 30 years Smoking Habit (S): 5 or less cigarettes per day, or more than 5 cigarettes per day Gestational Age of the fetus at time of birth (G): 197-260 days, or more than 260 days Survival of the infant at the end of one year (M): yes or no Gestational Age Mother’s Age Smoking Habit 197-260 ( l =1) < 30 (k=1) > 5 (j=1) < 5 (j=2) 9 50 40 315 >30 (k=2) >5 <5 4 41 11 147 < 30 >5 <5 6 24 459 4012 > 30 >5 <5 1 14 124 1594 > 261 ( l = 2) Survival at the end of one year No (i=1) Yes (i=2) 6 In this study one might consider the mother’s age and smoking habit as explanatory variables and survival as a response variable. Gestational age at birth could be considered as either a response or an explanatory variable. Use the letters A, S, G, and M in your notation for describing log-linear models. (a) Using the four combinations of mother’s age and gestational age categories to partition the data into four 2x2 tables, compute the Mantel-Haenszel estimator for the ratio of odds of infant mortality for the two smoking categories. Mantel-Haenszel estimator = ___________ 95% confidence interval = ________ (b) Write out the formula for the largest log-linear model that satisfies the assumption of homogeneous odds ratios used in part (a). (c) Report the values of the X 2 and G 2 tests of the fit of the model in part (b) against the general alternative. Also report degrees of freedom and p-values. State your conclusion. (d) Report values of the Breslow-Day and T4 tests of homogeneity of odds ratios. Report degrees of freedom and p-values for each test. Do these results agree with the results in part (c)? Is this to be expected? Explain. (e) Examine log-linear models to determine how mother’s age and smoking habit are associated with gestational age at birth and infant mortality at the end of one year. Identify the model that you think is most appropriate for these data. Report estimates for the parameters in your model and their standard errors. Interpret your results with respect to what they indicate about the relative risk of premature births (gestational age less than 261 days) and infant mortality at the end of one year. State your conclusions in words that health care professionals could understand. Feel free to comment on what you perceive to be the main limitations of this study.