Stat 557 Fall 2000 Exam 2 Solutions Problem 1 As expected, most students used a single log-linear model to analyze these data. Other approaches were used. Two separate log-linear models were used, one where the table was collapsed across the levels of use of other health care providers (O) and the other hwere the table is collapsed across the levels of the use of doctor visits (D). The data can also be analyzed with logistic regression models, then the model searching methods in the SAS LOGISTIC procedure could be used. One approach would t two logistic regression models, one using the log-odds of a doctor visit as the repsonse and the other using the log-odds of a visit to alternative health care provider as the response. The joint response for visits to doctors and other health care providers, however, has four categoriesand three logistic regression models would be needed to replicate the log-linear modell analysis. A key part of this approach is choosing an informative set of logits. We will only comment on the log-linear model approach in this solution. (a) Summary: There is essentially no dierence in demand for doctor visits for single adults covered by the government and private health insurance plans in Australia. The data provide only tenetative evidence of a small increase in the demand for services from other health care providers by single adults covered by private health insurance. These results hold after making adjustments for age, sex, and the presence or absence of chronic illness and recent short term illness. About 55:7% of the respondents were covered by government health insurance. Government health insurance covers a slightly higher proportion of males (58:6%) than females (53:0%). Overall, females are 60 (b) Before tting complicated models, a good way to start the analysis is an examination of percentages and two-way tables. This provides the following information: In this sample of single adults in Australia, 20:2% of the respondents visited a doctor and 9:1% visited some other health care provider in the last two weeks. 20:37% of the 2892 single adults covered by government programs and 8:85% of the 2892 single adults covered by government programs and Females (24.4Females (11.6 Demand for both doctor and other health care services tend to increase with age. 1 Respondents with recent illness are about 4 times more likely to visit a doctor and 3 times more likely to visit other health care providers than respondentts without recent illness. Respondents with chronic conditions are about 2 times more likely to visit a doctor and 3 times more likely to visit other health care providers than respondents without chronic conditions. About 55:7% of the respondents were covered by government health insurance. A slightly higher proportion of males (58:6%) than females (53:0%)were covered by government health insurance. There was essentially no dierence in the proportion of respondents with chronic conditions (55:5%) and the proportion of respondents without chronic conditions (56:0%) who were covered by government health insurance. There was also very little dierence in the proportion of respondents with recent illness (56:3%) and the proportion of respondents with no recent illness (54:3%) who were covered by government health insurance. The government provided health care coverage for 56% of the 20-29 year old respondents, about 40% of the 30-49 year old respondents, 52% of the 50-59 year old respondents, 61% of the 60-69 year old respondnets, and 66% of the repondents over 70 years old. These results suggest that a log-linear model will have to account for eects of age, sex, recent illness and chronic illness on demand for services from doctors and other health care providers, but there may be little dierence between government and private insurance coverage. They also suggest that a log-linear analysis will have to account for age and sex dierence in coverage rates for the government and private insurance plans. These results give little insight intoo higher order associations. Of courses, conditional associations identifed by a log-linear analysis could dier from the marginal associations. A few students made good use of mosaic plots to visually examine conditional reassociations with the demand for doctor visits or visits with other health care providers. (c) Most people used the step( ) or stepAIC( ) functions in S-PLUS to choose a model, starting with some simple model like complete independence. For these data, this approach tended to make the model too complicated. There are a number of interaction terms that are close to being signicant at the .05 level that are put into the model. Consequently, there is a rather wide range of models that seem to provide a good description of these data. It was okay to select one of the more complicated models in 2 this range for your answer as long as you clearly distinguished between the substantial eects and the weaker eects. I took a model identied by the stepAIC( ) function and used backward elimination, dropping one the least signicant interaction at each step without violating the hierarchical modeling criterion, to select a more simple model: log(mijklmrt) = + Si + Aj + Hk + Il + Cm + Dr + Ot SC HC SA SI AI + SH ik + im + km ij + il + jl AO SO SD AD AC + IC lm + jt + it ir + jr + jm ID IO CD CO DO + AH jk + lr + lt mr + mt + rt SAI SAD AIC + SHC ikm + ijl + ijr ilm Using the coding given in the statement of the problem, S A H I C D O = = = = = = = sex age group type of health insurance presence/absence of recent illness presence/absence of chronic doctor visits visits to other health care providers This model implies that, conditional on the levels of the other factors, demands for doctor services are about the same for single adults covered by government and private insurance plans. It also implies that, conditional on the levels of other factors, demands for services from other health care providers are about the same for single adults covered by government and private insurance plans. This coincides with results from the twoway tables of marginal counts. There are three two-factor interaction terms with doctors visits in the model that are not involved in higher order interactions. The strength and direction of these associations are consistent across the levels of the other factors. Estimates were obtained from the GENMOD procedure in SAS where the interaction terms are constrained to be zero at the highest level of any factor invovled int he interaction. Recent illness: ^ID ^ = 3:97, and 11 = 1:38 corresponds to an estimated odds ratio of an approximate 95% condence interval for the odds ratio is (3:19; 4:94). This 3 implies that single adults with a recent illness are from 3 to 5 times as likely to visit a doctor than single adults without a recent illness. Chronic illness: ^CD ^ = 1:49, 11 = 0:396 corresponds to an estimated odds ratio of and an approximate 95% condence interval for the odds ratio is (1:20; 1:84). This implies that single adults with a chronic illness are from 20 to 84 percent more likely to visit a doctor than single adults without a chronic illness. Visit to other health care providers: ^OD 11 = 0:241 corresponds to an estimated odds ratio of ^ = 1:27, and an approximate 95% condence interval for the odds ratio is (1:08; 1:50). This implies that single adults that visited another health care providered in the last two week are from 8 to 50 percent more likely to visit a doctor than single adults who did not visit some other health care provider. Sex and Age are involved in a three-way interaction with demand for doctor visits. This implies, for example, that changes in demand ofr doctor visits across age groups are not the same for single males and single females in Austrailia. The following table of PROC GENMOD estimates of the ^AD jr terms show that demand for doctor visits by single males in Austrailia tends to become stronger as age increases. visit 20 ; 29 30 ; 39 40 ; 49 50 ; 59 60 ; 69 70+ yes (r = 1) -1.158 -0.951 -0.747 -0.521 -0.726 0.00 no (r = 2) 0.00 0.00 0.00 0.00 0.00 0.00 Applying the exponential function to these estimates to obtain mle's of odds rations, we see that males older than 70 are about 3 times as likely as 20-39 year old males to visit doctors and about 2 times as likely as 40-69 year old males to visit doctors. There is a weaker and slightly dierent trend across age groups for single women. This is seen by computing the corresponding table of values for ^SAD ^AD jr + 1jr shown below. visit 20 ; 29 30 ; 39 40 ; 49 50 ; 59 60 ; 69 70+ yes (r = 1) -0.46 -0.42 -0.74 -0.17 -0.30 0.00 no (r = 2) 0.00 0.00 0.00 0.00 0.00 0.00 Applying the exponential function to these estimates to obtain mle's of odds rations, we see that females older than 70 are about 50% more likely to visit doctors than 20-39 year old females, about 2 times as likely to visit doctors as 40-49 year old females, and about 20% ; 35% more likely to visit doctors than 50-69 year old females. 4 Alternatively, you could examine how dierence between male and female demands for doctor services dier across age groups. We will not show thos results here. For the model we selected, the HO kt was not quite signicant at the .05 level. There is only weak evidence that demand of other health care providers by single adults was higher (about 15 Other two-factor interactions invovling demand for other health care providers did not involve three-factor interactions. Hence, these two-factor associations are approximately consistent across the levels of the other factors. Sex: ^SO ^ = 1:32, and an ap11 = 0:28 corresponds to an estimated odds ratio of proximate 95% condence interval for the odds ratio is (1:07; 1:64). This implies that single females are from 7 to 64 percent more likely to visit other health care providers than single males. Age: The following table of PROC GENMOD estimates of the AO jt terms show that demand for visits to other health care providers by single adults in Australia is weakest in the 20-29 age group and strongest in the 70+ age group. visit 20 ; 29 30 ; 39 40 ; 49 50 ; 59 60 ; 69 70+ yes (t = 1) -0.867 -0.614 -0.578 -0.656 -0.431 0.00 no (t = 2) 0.00 0.00 0.00 0.00 0.00 0.00 Recent illness: ^IO ^ = 2:07, and 11 = 0:729 corresponds to an estimated odds ratio of an approximate 95% condence interval for the odds ratio is (1:55; 2:77). This implies that single adults with a recent illness are approximately 2 times as likely to visit a non-doctor health care provider than single adults without a recent illness. Chronic illness: ^CO ^ = 1:80, 11 = 0:590 corresponds to an estimated odds ratio of and an approximate 95% condence interval for the odds ratio is (1:42; 2:29). This implies that single adults with a chronic illness are from 42 to 129 percent more likely to visit a non-doctor health care provider than single adults without a chronic illness. Visit with a doctor: ^OD ^= 11 = 0:241 corresponds to an estimated odds ratio of 1:27, and an approximate 95% condence interval for the odds ratio is (1:08; 1:50). This implies that single adults that visited a doctor in the last two week are from 5 8 to 50 percent more likely to visit some other health care provider than single adults who did not visit a dcotor. The log-linear analysis also provides insight into dierences among single adults covered by government andn private health insurance in Austrailia with respect to sex, age, recent illness, and chronic illness. These dierence would have been of greater interest if the conditional asociations between insurance plans and demand for doctor visits and visits from other health care providers had not agreed so well with the results form the two-way marginal tables of counts. We will breiy report the results implied by the log-linear model we selected for these data. Age: The following table of PROC GENMOD estimates of the AH jk terms shows that enrollment rate in private health insurance is highest for the 30-39 age group and it decrease as age increases. This corresponds to what was seen in the corresponding two-way marginal table of counts. insurance 20 ; 29 30 ; 39 40 ; 49 50 ; 59 60 ; 69 70+ gov. (k = 1) -0.583 -1.305 -1.084 -0.642 -0.253 0.00 pri. (k = 2) 0.00 0.00 0.00 0.00 0.00 0.00 HI item[Recent illness:] The kl interaction was deleted fromthe model because it was not signicant at the 0.15 level. This implies that incidence rates of illness in the last two weeks were about the same for single adults covered by the government and private insurance plans. Sex and chronic illness are involved in a three-way interaction with health care coverage. This implies, for example, that the dierence between incidence rates of chronic disease for single adults enrolled in the government and private health insurance plans is not the same for males and females. The PROC GENMOD estimate ^HC ^ = 0:774, 11 = ;0:256 corresponds to an estimated odds ratio of and an approximate 95% condence interval for the odds ratio is (0:66; 0:91). The incidence of chronic disease is lower for single males covered by government insurance than for single males covered by private health insurance. Since ^SHC ^HC 11 + 111 = ;0:256 + 0:329 = 0:073 corresponds to an estimated odds ratio of ^ = 1:07, the incidence of chronic disease is slightly higher among single females covered by government insurance than among single females covered by private health insurance. (d) Since over 50% of the estimates of the mean counts are smaller than 5 and many are 6 smaller than 0.5, the chi-square approximation to the Pearson X 2 statistic or the G2 statistic may not provide a reliable p-value for testing the t of the selected model against the general alternative of 384 independent Poisson counts with arbitrary positive means. Nevertheless, for this model X 2 = 279:56 and G2 = 297:32 are both smaller than the 314, the degrees of freedom for the chi-square approximation. Hence, the model appears to provide a good summary of the variation in the observed counts. There is no indication of extra-Poisson variation and it is not necessary to entertain negative binomial distributions for the counts or some other distribution to allow variances of the counts to be larger than the means. An examination of the various types of residuals produced by the GENMOD procedure in SAS revealed no extreme values. A plot of the observed counts versus the estimated means shows little variationn from a 45 degree line. Problem 2 (a) Summary: For both species, the proportion of eggs that produce males turtles decreases as temperature increases. The change occurs over a smaller temperature interval for species 1 than for species 2. For species 1 eggs collected in Iowa, the proportion of male hatchlings decreases from 95 percent to 5 percent as the incubation temperature increases from about 27.69 oC to 29.04 oC , but for species 2 eggs collected in Iowa, the proportion of male hatchlings decreases from 95 percent to 5 percent as the incubation temperature increases from about 26.95 oC to 30.25 oC . The results are similar for eggs collected in Louisiana, but the tempearture intervals are shifted by about 1.5oC for both species. (b) A search for a good model might begin by plotting the observed proportion of male hatchlings against temperature for each species within each location. One could also examine plots of empirical logits, omitting cases where the obseved percentage of male hatchlings was either 0 or 100 percent. The next step is to t four separate logistic regression models, one for each species within each location. For species 2 eggs, these preliminary analyses revealed that logistic regression models with just a linear temperature trend were sucient to model the decrease in the proportion of male hatchlings as incubation temperature increases. Furthermore, the curves for species 2 were nearly 7 parallel, suggesting that the temperature coecient might be the same at both locations. A logistic regression model with dierent intercepts for the two locations and the same coecient for the linear temperature component was found to provide an adequate model for the species 2 eggs. The preliminary analyses revealed a much sharper temperature eect for the species 1 eggs. At each location, both linear and quadratic temperature trends were needed in the logistic regression models. Further comparison of those logistic regression models showed that same intercept could be used for both loctaions and the coecient for the quadratic temperature eect could also be the same for both locations, but the linear temperature eects required dierent coecient for the two locations. (c) The nal model is shown below with a standard error shown in parentheses beneath each estimated coecient. Here ij denotes the conditional probability that a male turtle emerges from an egg collected from the i-th species in the j -th location when it is incubated at the specied temperature. Species 1 in Iowa: log ^11 1; ^11 = 2:0157 + 3:3088(temperature ; 26) ; 1:6286(temperature ; 26)2 (1:5854) (1:3996) (0:3160) Species 1 in Louisiana: log ^12 1; ^12 = 2:0157 + 6:0170(temperature ; 26) ; 1:6286(temperature ; 26)2 (1:5854) (1:6201) (0:3160) Species 2 in Iowa: log ^21 1; ^21 Species 2 in Louisiana: log ^22 1; ^22 = 4:6402 ; 1:7840(temperature ; 26) (0:2876) (0:0997) = 7:2936 ; 1:7840(temperature ; 26) (0:4258) (0:0997) These curves are displayed in the following gure. 8 For species 2 eggs, from either Iowa or Louisiana, a one degree C increase in incubation temperature corresponds to about an 83 percent decrease in the log-odds for males (a factor of exp(-1.784)=0.168). The intercepts for the Iowa and Louisiana curves, 4.64 and 7.29, respectively, correspond to hatch rates of more than 99 percent males at 26 o C. For species 1 eggs, an intercept 0f 2.0157 corresponds to hatch rates of about 88 percent males at 26 oC, but this is not well estimated. For species 1 eggs from Iowa, an increase in the incubation temperature from T to T + 1 corresponds to a change of about ;1:682 ; 3:2592(T ; 25) in the log-odds that a male hatches from an egg. For species 1 eggs from Louisiana, an increase in the incubation temperature from T to T + 1 corresponds to a change of about ;4:388 ; 3:2592(T ; 25) in the log-odds that a male hatches from an egg. 9 (d) Plots of the observed proportions overlaid with the estimated curves described above do not reveal any obvious deciencies in the proposed model. Examination of residuals does not reveal any obvious deciences in the proposed model. In this case, it would be a good idea to make a separate plot of the residuals versus incubation temperature for each species in each location and using a data smoother to put a smooth curve on each residual plot. (Nobody did this.) The deviance value is G2 = 21:72 for testing the null hypothesis that the four curves provided by our model are appropriate against the general alternative that there are 40 independent binomial random variables for the observed number of eggs that produce male turtles (for the ten incubation temperatures used with eggs collected from each of the two species at each of the two locations). This test has 33 degrees of freedom. The presence of some small estimates of expected counts for either males or females may prevent a large sample chi-square approximation from providing a reliable p-value. Nevertheless, the G2 value is smaller than the 33 degrees of freedom which suggests that the observed counts are consistent with the proposed model and there is no need to consider models to account for extra-binomial variation. Scores: Here is a stem-leaf display of the scores for this exam. Each problem was scored on a 50 point scale. 9 8 8 7 7 6 6 1 888 111122 6667888999 222234 56688 02 10