Statistics 404 Fall 2009 SECOND EXAM Solutions Name _______________________________ The U.S. Census Bureau estimates there are about 96,000 centenarians (.03% of the population) in the US—a number that is predicted to more than quadruple by 2030, reaching 1.15 million by 2050. Is it the glass of sherry or wine they drink each evening before going to bed, or is it centenarians’ genetic makeup that accounts for their longevity? Or is it centenarians’ cultural background? (For example, although the proportion of centenarians in the Japanese population is about the same as that in the US, in 2050 this proportion is projected to be double that of the US!) Then there is the issue of centenarians’ quality of life. For instance, although centenarians are much more likely to be female than male, men who reach 100 years of age are usually healthier than are same-aged women. You have been hired by the National Centenarian Awareness Project (NCAP) to investigate centenarians’ “secrets to longevity.” The NCAP provides you funds sufficient to conduct a cross-cultural study based on face-to-face interviews with parallel random samples of centenarians in the US and Japan. You conduct interviews with 50 female and 50 male centenarians in each country, yielding a total sample size of 200. Beyond variables that indicate each subject’s nationality (N=1 if US, N=2 if Japan) and gender (G=1 if male, G=2 if female), you have data on these 200 subjects’ responses to the following 9 questions: BIRTHYR (B) = In what year were you born? (values range from 1895 to 1909) ADMITHOS (A) = How many times have you been admitted as a patient in a hospital during the past year? (values range from 0 to 8) HAPPY (H) = On a scale from 1 to 10, where 1 is the least happy of persons and 10 is the most happy of persons, how happy would you say you are? (values range from 1 to 10) MDEATH (M) = How old was your mother when she died? FDEATH (F) = How old was your father when he died? RECACTIV (R) = How often have you joined others in recreational activities (e.g., golf, card playing, shuffle board, etc.) during the past week? SHERRY (S) = Do you generally drink a glass of sherry or wine daily? (values: 1=’yes’ or 0=’no’) ENJOY (E) = Do you generally enjoy being alone? (values: 1=’yes’ or 0=’no’) OTHLIKE (O) = How many visitors have you had during the past week? 1 Statistics 404 Fall 2009 SECOND EXAM Solutions Name _______________________________ a. You decide to use BIRTHYR, ADMITHOS, and HAPPY (i.e., variables associated with centenarians’ responses to the first 3 of the above questions) as the dependent variables in your investigation. However, even before selecting independent variables for your regression models, you notice that you will probably need to transform one of these 3 variables to ensure that your data are homoscedastic. Which variable is this? In the space provided below, sketch the heteroscedastic pattern that you suspect this variable will produce if used as a dependent variable. Explain why you suspect the variable will produce this pattern. What variable transformation would you use to correct for this pattern? (Hint: Be sure to label the axes in your sketch! Also note that in subsequent parts of the exam, analyses with this “suspected variable” will have been performed using the required variance stabilizing transformation.) [weight 3] ADMITHOS Variable that will probably produce a heteroscedastic pattern: ______________ Sketch of the pattern you suspect: ê • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Â Why you suspect the above-sketched pattern: ADMITHOS is a Poisson random variable (a measure of counts [hospital admissions] within a fixed time period [the past year]). As such, its variance increases linearly with the magnitude of its mean. Transformation used to correct this pattern: The appropriate variable transformation would be Y ADMITHOS (or, if some values of ADMITHOS equal zero, Y ADMITHOS ADMITHOS 1 ) to be used as the dependent variable in regression models instead of ADMITHOS. 2 STAT 404 / SECOND EXAM b. You begin your analysis by investigating the “genetics argument” that people live longer because they inherited genes from parents who themselves lived long. Following this line of thinking, you regress BIRTHYR (B) on FDEATH (F) and MDEATH (M) and obtain the following regression equation: Bˆ 1906 .5F .4 M Express the meaning of the partial slope between BIRTHYR and FDEATH (i.e., bˆF .5 )in words. [weight 6] After adjusting the centenarians’ fathers’ longevity as if their mothers all died at the same age (i.e., after removing from their fathers’ “genetic tendency to live longer” these fathers’ tendency to have—perhaps instinctually—selected a wife with a similar genetic tendency toward longevity), one would estimate centenarians to have been born a half-year later (i.e., to have been a half-year younger) for each additional year that their father lived. c. If you interpreted b̂F correctly in part b, it likely sounds counterintuitive to you. (Shouldn’t centenarians live longer if their fathers lived longer?) Upon closer examination, you note that FDEATH and MDEATH are strongly correlated ( rFM .85 ), suggesting to you that centenarians nearly always inherit longevity from both parents. This said, explain the problem in how you specified the regression model in part b. Three techniques for remedying this problem are mentioned in your lecture notes. Please name 2 of them. [weight 3] The problem is one of repetitiveness. If centenarians’ genetic-based longevity consistently originates from both parents, FDEATH and MDEATH need to be combined into a single measure of centenarians’ “genetic inheritance of longevity.” Three techniques for remedying repetitiveness are (1) summing variables, (2) factor analysis, and (3) principal components analysis. 3 STAT 404 / SECOND EXAM d. You next investigate factors that contribute to centenarians’ physiological quality of life (as measured [negatively] by ADMITHOS). Your thinking is that exercise (as measured by RECACTIV) and the artery-cleaning effects of drinking sherry or wine (as measured by SHERRY) have health benefits that will keep centenarians out of hospitals. You regress ADMITHOS (A) on RECACTIV (R) and SHERRY (S) both separately and in a multiple regression, yielding the following 3 regression equations: Aˆ 1 .5R Aˆ 3 2 S Aˆ 4 .7 R 6S In the space below, sketch a plot of A, R, and S that is consistent with the numbers provided in these 3 equations. (Hint: Indicate data points as “1” if S=1, and as “0” if S=0.) [weight 3] 8 1 1 1 1 1 ADMITHOS (A) 0 0 0 0 0 0 0 RECACTIV (R) 4 15 STAT 404 / SECOND EXAM e. Referring to the first and third regression equations listed in part d, give a theoretical explanation for why in the first equation the bivariate slope between A and R is positive, whereas in the third equation the partial slope between A and R is negative. (Hint: Be sure that your explanation is consistent with the sketch you drew in answering part d.) [weight 3] Centenarians’ sherry or wine drinking distorts the negative relation between the frequency of their weekly activities and the (square root—as per part a) frequency of their annual hospital admissions. Our findings in part d suggest that drinking sherry or wine daily has detrimental health effects for centenarians, not health benefits (e.g., from cleaning their arteries). That is, the detrimental health-effects of drinking are likely the reason why centenarians’ are more frequently admitted as a hospital patient if they do than if they do not drink sherry or wine each day. However, since recreationally active centenarians are more likely to drink sherry or wine than less recreationally active centenarians, the fact that “the more recreationally active centenarians ended up in hospitals” is due not to their recreational activities but to their drinking. Among centenarians who drink the same (i.e., who exclusively either do or do not drink a glass of sherry or wine daily), recreational activity decreases the frequency with which they were admitted to a hospital during the past year. 5 STAT 404 / SECOND EXAM f. Your next step is to examine centenarians’ psychological wellbeing (as measured by HAPPY). Your thinking is that centenarians’ happiness will be enhanced if they are liked by others (as measured by OTHLIKE). Unfortunately, such an analysis would be complicated by the fact that centenarians’ visits from others and their own psychological wellbeing are both influenced by their physical health. That is, not only are healthy people more likely to be happy, others may have reasons besides liking (e.g., selfish reasons such as hopes of an inheritance) for visiting centenarians when they are unhealthy. As a result, the effect of OTHLIKE on HAPPY may be due to this common prior cause (namely, health—a variable that [you should be sure to assume that] you have no adequate measure of in your data set). Explain how you might proceed with an analysis of the effect of OTHLIKE on HAPPY in a way that would not violate the assumption that X T~e X T * e~ . (Hints: What instrumental variable might you use? Why would this variable make a good instrument? How would you use the variable to ensure that the X T~e X T * e~ assumption is not violated?) [weight 3] Two-stage least squares (2SLS) is called for here. In this case, an instrumental variable is needed that is (a) related to others’ liking but (b) unrelated to the centenarians’ health. Accordingly, ENJOY (E) might work as an instrumental variable for the following reasons: (a) Someone who enjoys solitude is less likely to be liked by others than someone who enjoys others’ company. (Note: An instrumental variable may have a positive or negative linear association with the variable for which it is an instrument.) (b) If the enjoyment of being alone comprises a centenarian’s general character trait, it would not vary according to her or his health. The 2SLS procedure could be implemented as follows: (1) Stage 1: Regress OTHLIKE on ENJOY, and obtain the predicted values (Ohat) from this regression. (2) Stage 2: Regress HAPPY on the O-hat values obtained in Stage 1. Note: Unlike OTHLIKE, O-hat will not be linearly associated with health-related variance in HAPPY. 6 STAT 404 / SECOND EXAM g. Finally, you decide to describe differences in psychological wellbeing among Japanese women, Japanese men, US women, and US men. Average scores on the HAPPY variable for each of these four groups are as follows: Japanese women Japanese men US women US men 8 7 4 5 Keeping in mind that there are exactly 50 people in each of these groups, do the following while treating the four groups as if they comprise a single nominal-level variable with 4 attributes of “Japanese woman,” “Japanese man,” “US woman,” and “US man” (i.e., do not consider them as the two distinct variables of nationality [N] and gender [G]): First, explain how you would construct effect measures from this variable. Second, obtain constant and slope estimates from the regression of HAPPY on these effect measures. Finally, after computing the “effect” associated with each of the four groups, explain the meaning of each effect in words. (Hint: Be sure to show how the effect variables were constructed and how you calculated your estimates.) [weight 4] Effect measures: E JW 1 if Japanese woman 1 if Japanese man 1 if US woman 1 if US man E JM 1 if US man EUW 1 if US man 0 otherwise 0 otherwise 0 otherwise Constant and slope estimates: aˆ 50 8 50 7 50 4 50 5 6 200 ˆ ˆ H JW 8 aˆ bJW E JW bˆJM E JM bˆUW EUW 6 bˆJW 1 bˆJM 0 bˆUW 0 6 bˆJW bˆJW 2 Hˆ JM 7 aˆ bˆJW E JW bˆJM E JM bˆUW EUW 6 bˆJW 0 bˆJM 1 bˆUW 0 6 bˆJM bˆJM 1 Hˆ UW 4 aˆ bˆJW E JW bˆJM E JM bˆUW EUW 6 bˆJW 0 bˆJM 0 bˆUW 1 6 bˆUW bˆUW 2 The regression model is thus Hˆ 6 2 E JW 1E JM 2 EUW . The effects and “meanings in words” associated with each group are as follows: bˆJW 2 : Japanese women’s happiness scores were 2 points above the overall average. bˆJM 1 : Japanese men’s happiness scores were 1 point above the overall average. bˆUW 2 : US women’s happiness scores were 2 points below the overall average. 3 bˆUM bˆi 2 1 2 1 : US men’s happiness scores were 1 point below the i 1 overall average. 7