STAT 557 FALL 1996 Instructions: 1. FINAL EXAM NAME ______________ Show your work and present your solutions in the space provided on this exam. Attach extra sheets of paper if more space is needed, but clearly indicate where this is done. You may use a calculator, pencils, erasers, and the formula sheets provided with this exam. Respondents in a simple random sample of high school students in Ohio were crossclassified into the following table. Sex Male Female A. Area of Residence SocioEconomic Status Occupational Aspirations High Low Rural High Low 117 54 47 87 Small Urban High Low 350 70 80 85 Large Urban High Low 151 27 31 23 Rural High Low 102 52 69 119 Small Urban High Low 338 44 96 99 Large Urban High Low 148 17 35 39 Write out the log-linear model corresponding to the null hypothesis that the level of occupational aspirations is conditionally independent of the sex of the respondent given area of residence and socio-economic status. Use the following symbols in your formula. A denotes occupational aspiration (1 = high, 2 = low) E denotes economic status (1 = high, 2 = low) R denotes area of residence (1 = rural area 2 = small urban area 3 = large urban area) S denotes sex of respondent (1 = male, 2 = female) B. The following log-linear model was fit to these data: log(mijkl) = ë + ë + ë + ë + ë + ë + ë + ë + ë + ë The standard constraints that the main effects sum to zero and the sums of the interaction effects across all levels of any subscript are all zero were used. (i) What are the degrees of freedom for testing the fit of this model against the general 2 alternative? (ii) All terms in this model are significant and this model fits the data well. What does this imply about independence or associations among the four factors: occupational aspirations (A), socio-economic status (E), area of residence (R), and sex of respondent (S)? (iii) Maximum likelihood estimates of the and terms are shown below. Estimate Std. errors = 0.416 = −.091 = .044 .027 .036 .034 Describe what these estimates reveal about the association between occupational aspirations and socio-economic status among high school students? C. Using occupational aspirations (A) as the response variable, show how the model in part (B) would be written as a logistic regression model. Just give a formula for the logistic regression model, do not try to obtain numerical values for parameters in the logistic regression model. 2. Forty corn fields were examined in a study to determine if the species of grass growing along the borders of the fields affect the average number of larvae of a certain insect species found in corn plants. Grass species A grew along the borders of 20 of these fields and grass species B grew along the borders of the other 20 fields. In each field, 20 corn plants were randomly selected from the plants growing at a distance of 3 meters from the borders of the field. The total number of larvae found in these 20 plants (Y) was counted. This produced a set of 40 counts, one count for each field. The researchers also recorded the mean daily temperature (T) and the mean daily rain fall (R) during the previous 30 days in each field. Temperature and rainfall are known to affect the number of larvae present in corn fields. Explain how you would determine if the type of grass growing along the border of the field has any association with the mean number of insect larvae in corn plants growing 3 meters from the boundaries. Outline the steps you would take in performing a test or developing a model. Show formulas for tests or models you would use. 3. In a study of parental attitudes toward violence in movies, a random sample of n=400 families was taken out of the population of families in Iowa that have a father, a mother, and a boy in high school. After watching a certain movie with some violent content, each member of the family was asked if the movie was too violent for viewing by teenagers. Responses were coded as Yes for too violent for teenagers No for suitable for viewing by teenagers Show how you would construct 95% confidence intervals for 4. A. The proportion of teenage boys who think the film is too violent for viewing by teenagers. B. The difference in proportions of fathers and mothers who think the film is too violent for viewing by teenagers. By mailing offers to a large number of people, a bank was able to add 14,565 new credit 3 cards users to its business. After two years, 762 of these credit card users had defaulted on their loans. These people were classified as "bad" outcomes, their credit card was taken away, and their debt written off as a loss to the bank. The other 13,803 customers were classified as "good" outcomes. In a first attempt to develop a model for predicting "bad" loans, the following logist regression model was fit: π log i = â0 + â1 X1i 1 − πi i = 1, 2, ..., 14565 where X1i is a value of a composite score of financial variables (called the FICO score) for the i-th individual when their credit card was first issued and ði is the conditional probability of a "bad" outcome within the first two years. Use the computer output on page 6 to help answer the following questions. 5. A. Given that the maximum likelihood estimate for â1 is reported as 1 = −.488, explain how you would interpret the association between the conditional probability of a "bad" outcome and the FICO score at the time when the credit card was first issued. B. Give a formula for the log-likelihood function that was maximized to get the estimate in part A. C. Carefully state the probability model that yields the log-likelihood function you stated in part B. D. What does the concordant = 49.2% result on the output measure? Explain how you would interpret the Gamma = .076 result. E. The FICO values in the data set range from 6.4 through 8.2. Evaluate the maximum likelihood estimate of ð, the conditional probability of a "bad" outcome when the FICO score is 6.6. F. Construct a 95% confidence interval for the conditional probability of a "bad" outcome when the FICO score is 6.6. The data set form problem 3 actually contains seven explanatory variables that can be used to estimate the probability of a "bad" outcome. The variables are X1 X2 FICO score when credit card is first issued Number of trade lines opened in the first 6 months X3 Total high credits X4 (loan balance)/(credit limit) ratio X5 Time since most recent trade line was opened X6 (FICO score at mailing) − (FICO score when card is issued) X7 total number of trades The maximum likelihood estimate of formula for the logistic regression model selected by one researcher is 4 π log i = 1.147 1 − πi (.975) –1.08X6 (.100) – 0.592X1 (.140) – + 0.199X2 + .00129X4 (.053) (.00043) .0589X7 (.0133) A. A local mean deviance plot for this model is shown on page 10. Explain how such a plot is constructed. B. What does this local mean deviance plot tell you about the data and the model shown above. C. Partial residual plots for this model are also shown on page 10 of this exam. Describe what these plots reveal? EXAM SCORE COURSE GRADE