Department of IOMS Inference and Regression Final Examination, 2012 This course and this examination are governed by the Stern Honor Code. Instructions Please write your name at the top of this page. Please answer all questions on this question book. Do not turn in a blue book. Please do not separate the pages of this exam booklet. Where a computation is required to answer a question, please show your work. (I cannot give partial credit for an incorrect numerical answer unless the work provided shows a partially correct computation.) Grading: There are 10 questions in this exam. The point values for the questions are 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Total 30 40 10 10 10 15 10 10 15 10 160 There are 200 points in total. The questions add to 160. You may take the other 40 and allocate them to any 3 questions you wish. The grade will be the proportion of the increased total. For example, if you add 20 points to question 2, and you had 30 of the 40 to begin with, your score will be 30/40(40 +20) = 55. Add a maximum of 20 points and a minimum of 10 to each of 3 questions, for a total of 40 additional points. Indicate in the table above where you wish to add the points. 1 [30] Part I. Continuing the Baseball Saga. I decided to rebuild the dynamic baseball regression that we discussed in class on day 9. My model now involves the following variables ATTEND(i,t) = total attendance for team i in year t. ATTEND1(i,t) = last year’s total attendance for team i in year t. WINS(i,t) = this year’s number of wins by team I in year t. AVGSALRY(i,t)= average player salary, in millions, for team i in year t. (This variable takes values like 1.0 or 2.0 or maybe 5.0. RANK1(i,t) = team rank in standings at the end of the previous season. Takes values 1,2,3,4,5. Rank 1 is lowest, last place. Higher rank is better. ALLSTARS(i,t)= number of all stars on the team in year t MGR_EXP = number of years experienced possessed by the manager. I computed two regressions, a basic model that contains ATTEND1, WINS and AVGSALRY. My ‘advanced’ model includes these variables plus RANK1, ALLSTARS and MGR_EXP. Note, in the results below, some values are shown in scientific notation. .19767E+14 means .19767 times 10 to the 14th power. ----------------------------------------------------------------------------Ordinary least squares regression ............ LHS=ATTEND Mean = 2220848.36156 Standard deviation = 432586.34965 ---------No. of observations = 437 DegFreedom Mean square Regression Sum of Squares = .593019E+14 3 .19767E+14 Residual Sum of Squares = .222872E+14 433 .51472E+11 Total Sum of Squares = .815891E+14 436 .18713E+12 ---------Standard error of e = 226873.65637 Root MSE 225832.94603 Fit R-squared = .72684 R-bar squared .72494 --------+-------------------------------------------------------------------| Standard Prob. 95% Confidence ATTEND| Coefficient Error t |t|>T* Interval --------+-------------------------------------------------------------------Constant| -115868 91590.73 -1.27 .2065 -295383 63645 ATTEND1| .62705*** .03066 20.45 .0000 .56695 .68714 WINS| 10713.0*** 1047.100 10.23 .0000 8660.7 12765.3 AVGSALRY| 94367.7*** 19665.55 4.80 .0000 55823.9 132911.5 -------------------------------------------------------------------------------------No. of observations = 437 DegFreedom Mean square Regression Sum of Squares = .596418E+14 6 .99403E+13 Residual Sum of Squares = .219473E+14 430 .51040E+11 Total Sum of Squares = .815891E+14 436 .18713E+12 ---------Standard error of e = 225920.63524 Root MSE 224103.89760 Fit R-squared = .73100 R-bar squared .72725 Model test F[ 6, 430] = 194.75471 Prob F > F* .00000 --------+-------------------------------------------------------------------| Standard Prob. 95% Confidence ATTEND| Coefficient Error t |t|>T* Interval --------+-------------------------------------------------------------------Constant| -73235.3 118602.6 -.62 .5372 -305692.1 159221.6 ATTEND1| .62093*** .03260 19.04 .0000 .55703 .68483 WINS| 9235.51*** 1267.029 7.29 .0000 6752.17 11718.84 AVGSALRY| 99474.1*** 19746.76 5.04 .0000 60771.2 138177.1 ALLSTARS| 25961.4** 10675.53 2.43 .0154 5037.7 46885.0 MGR_EXP| 50.9956 1943.689 .03 .9791 -3758.5641 3860.5552 RANK1| 8163.80 7122.781 1.15 .2524 -5796.60 22124.19 --------+-------------------------------------------------------------------Note: ***, **, * ==> Significance at 1%, 5%, 10% level. ----------------------------------------------------------------------------- 2 1. Using an F test, test the hypothesis that the three slope coefficients (not the constant) in the first regression are equal to zero. F = (R2/3)/((1-R2)/433). Use table to test. 2. Using an F test, test the hypothesis that the coefficients on RANK1, ALLSTARS and MGR_EXP are all zero in the second equation. R2 = fit in large regression R02 = fit in small regression F = (R2 - R02) /3)/((1-R2)/433). Use table to test. 3. According to the regression results, in the second regression, ALLSTARS is “significant” at the 5% level, but not at the 1% level, while MGR_EXP is not significant at all. What is meant by these statements of significance? See class notes 4. The adjusted R squared (R-bar squared) reported for the first regression is .72494. What is Rbar squared, and why is it computed? See class notes 5. Front office management believes that all they have to do to make more fans come to the games is to pay the players more. Suppose they raise player salaries so that the average salary rises by 2 million. How many more fans will ultimately come to the games (in the long run). 2 99474.1 / (1 - .62093) 6. The team is always in last place (rank = 1). Just throwing money at the players is a nice idea, but it probably won’t work. Suppose we be optimistic, and assume the team finds a way to win 10 more games every year, and that is enough to get them to first place (rank = 5). How many more fans can be expected to come to the games if they do this? 10 9235.51 / (1 - .62093) [40] Part II. Family Matters 2. Many observers believe that the sex of a child is completely beyond the control of the parents or environmental factors – the probability is (within rounding error) of 0.5. Others believe that some factors, which we will cleverly denote x, are relevant to the sex of a child. I intend to test 3 the theory statistically. Here is the model we will use to test our theory: (We are going to build a model using a random sample of data.) Ki = the number of children in family i. Ki > 0. (We don’t sample any families with zero children). di = the number of female children in family i. 0 < di < Ki. xi = the set of factors that we think are relevant (age, income, education, country of ancestry) i = the probability that any particular child born in family i is female. Note, this is family specific. As stated so far, the model for the number of daughters is a binomial distribution. K Prob(D = di | xi) = i i di (1 i )( Ki di ) , di = 0,...,Ki di To complete the model, I will claim that i depends on xi. Remembering that i is a probability, I formulate this as i exp( xi ) 1 exp( xi ) So, i is the type of logistic probability that we discussed in class in session 10. There are three theories in this model: Theory A: The factors listed matter. This means that is free to vary and at least some elements of are different from zero. (I.e., at least some variables ‘matter.’) Theory B: The factors don’t matter. This means that is free to vary but all elements of are Zero. (None of the variables make any difference.) Theory Z: Nothing matters – the probability is .50. This means that = 0 and = 0. After gathering the data I needed from 1,000 randomly chosen families, I computed the two regressions below. (Reader, please note, I did not really sample any data. This exercise and the data described here are entirely fictitious.) Model A is the maximum likelihood estimates of the full model that corresponds to Theory A above. Model B corresponds to model B above. The data consist of 1,000 observations on Ki, di and x variables: Age = age of the mother in years – ranges from 25 to 40, average = 33 Educ = years of education of the mother ranges from 12 to 20, average = 15 Income in thousands of dollars ranges from 5 to about 100, average = 60 Country = dummy variable that is 1 if the mother is native to (born in) North America. About 75% of the sample were native North Americans, average = .75 4 ----------------------------------------------------------------------------Binomial (Loglinear) Regression Model A Dependent variable DI Log likelihood function -1016.87044 Restricted log likelihood -1545.58656 --------+-------------------------------------------------------------------| Standard Prob. 95% Confidence DI| Coefficient Error z |z|>Z* Interval --------+-------------------------------------------------------------------|Parameters in conditional mean function Constant| -.39981 .43870 -.91 .3621 -1.25965 .46003 AGE| .09933*** .00955 10.40 .0000 .08061 .11805 EDUC| -.31472*** .01902 -16.55 .0000 -.35200 -.27745 INCOME| .04859*** .00207 23.51 .0000 .04454 .05264 COUNTRY| -.21079** .09923 -2.12 .0336 -.40527 -.01631 ----------------------------------------------------------------------------Binomial (Loglinear) Regression Model B Dependent variable DI Log likelihood function -1545.58656 --------+-------------------------------------------------------------------| Standard Prob. 95% Confidence DI| Coefficient Error z |z|>Z* Interval --------+-------------------------------------------------------------------|Parameters in conditional mean function Constant| .46033*** .03427 13.43 .0000 .39316 .52749 --------+-------------------------------------------------------------------- 1. For a family that has Ki children, what is the expected number of female children? Ki times i 2. How does the expected number of female children depend on education? Derive the formula for the effect of an increase in education of one year on the expected number of children. (Hint: This is not a simple number, as it depends on the other variables and on the parameters. Second hint: where πi is as defined above, dπi/dxik = πi(1 – πi)k..) b = -.31472, the expected value is Kii, so the effect is Ki πi(1 – πi)k if i is about .5, the effect would be about .25Ki. Depends on Ki. 5 3. Form the log likelihood function for estimation of the parameters . (Hint, this model resembles, but is not the same as the logistic model we discussed in class 10.) Sum the logs of these probabilities K Prob(D = di | xi) = i i di (1 i )( Ki di ) , di = 0,...,Ki di i exp( xi ) 1 exp( xi ) 4. Obtain the first order (necessary) conditions for maximizing the log likelihood with respect to the parameters . Log Pi= log(Ki _ di) + dilog i + (Ki –di)log(1 - i). The derivative is (di/i - (Ki – di)/(1-i)) i(1 - i) xi Simplify and sum over observations 5. Using the two sets of regression results given above, for Model A and Model B, test the hypothesis of Model B as a restriction on Model A using a likelihood ratio test. (Hint: We talked about likelihood ratio tests on day 5 of class.) Log likelihood function Restricted log likelihood -1016.87044 -1545.58656 Twice the difference is chi squared with 4 degrees of freedom. 6 6. Theory Z states that πi = .5 for every family. What is the value of the log likelihood function K for this sample assuming Theory Z is correct? (Hints: the sample average of log i for this di sample is .8502. The sample sum of Ki is 3590.) K The log likelihood assuming i = .5 is i di log.5 + (K-d)log.5 + i log i di K = i Ki log .5 + i log i di The sum of Ki is 3590, so the first term is 3590(-.693). The second sum is 1000.8502. The log likelihood is the sum, -1637.67 7. Using the facts in in part 6, test the null hypothesis of Theory Z against the alternative hypothesis of Theory A using a likelihood ratio test. Chi squared = 2(1637.67 – 1016.87) = 1241.6. This is very large. The hypothesis (theory Z) is rejected. 8. Experience might suggest that gender of children run in streaks – after having three daughters in a row, one might begin to suspect it is less than random at that point. Suppose these did run in streaks. What would streaks imply for the statistical approach we used above to test theory A? (Hint, this question does not have a single right answer. It asks you to think like a statistician, and question your methods if your assumptions are not met. In two or three sentences suggest what this suggestion would imply for your model and your approach.) If there really are streaks, then the observations are not independent. The way we constructed the log likelihoods above assumes independence. So, the results would be questionable. [10] Part III. Regression basics. The regression of V on X gave this Minitab output: Regression Analysis: V versus X The regression equation is V = 99.3 - 1.02 X Predictor Coef SE Coef T P Constant 99.301 2.478 40.08 0.000 X -1.0169 0.2130 -4.77 0.000 S = 4.023 R-Sq = 33.6% R-Sq(adj) = 32.1% Analysis of Variance Source DF SS MS F P 7 Regression Residual Total 1 45 46 368.89 728.26 1097.15 368.89 16.18 22.79 0.000 Answer T (true) or F (false) to each of the following. Explain your answer in one short sentence. 1. __F___ The total sum of squares, namely 1,097.15, provides solid evidence of the effect of a significant regression 2. __T___ Most of the residuals are between -8 and +8. 3. __F___ The data provide convincing evidence that increasing X by one unit will cause a decrease in V of 1.0169 units. Not cause 4. __T___ The correlation between the variables X and V is negative. 5. __F___ The regression slope would be regarded as not statistically significant 8 [10] Part IV. More regression basics A regression analysis of variance produced the following table. Some of the positions are blank, and these are the subjects of the questions that follow. Source of variation Regression Error Total Degrees of freedom (a) 5 100 (e) 105 Sum Squares Mean Square F 80.00 (c) 200 280.00 16.00 (d) 2.00 (b) 16 Supply the numbers that go in the positions marked (a) through (e). [10] Part V. Essential Theory Explain the difference between unbiasedness and consistency applied to an estimator. See class notes [15] Part VI. Heteroscedastic Regression I am interested in the regression model Health = α + 1 Age + 2 Insurance + ε. ‘Insurance’ is a binary variable that equals 1.0 if the person has insurance and 0.0 if they do not. The regression is ordinary save for the fact that I know that Var[ε] = σ 2 for men and Var[ε] = cσ2 for women, where 0 < c < 1. (I know the value of c.) 1. Ordinary least squares is unbiased but inefficient. Explain. See class notes 2. Weighted least squares would be more efficient? How would I compute the weighted least squares estimator based on a sample that contains a mixture of men and women? (Hint: remember, c is known.) For women, use c, for men, use c=1, and Regress health/sqr(c) on 1/sqr(c) and Age/sqr(c) and Insurance/sqr(c) 3. Suppose the statement of the model is correct, but I do not know the value of c. Can you suggest a way that I might do weighted least squares? Use least squares. Using observations on women only, compute (1/nw)i ei2. This estimates c2. Do the same for the observations on men. This estimates 2. The ratio gives the estimate of c that I need. [10] Part VII. Statistical Theory Suppose the density of x is f(x) = 1 1/2 1 x exp x , x 0. 2 2 9 What is the density of z = sqrt(x)? (Hint: does there seem to be a stray 2 in your answer? Note that z = -value and z = + value are produced by the same x. So, z ranges from - to +, with half the probability mass on either side of zero.) See class notes and first exam. Change of variable [10] Part VIII. Very Basic Statistics 64 F req u en cy 48 32 16 0 . 001 . 347 . 692 1. 038 1. 384 Q 0.0 0.35 0.7 1.05 1.40 The histogram above describes 1,000 observations on the variable q. 1. Provide a guess of the sample mean, and explain how you obtained it. About .62 or so. Right of median 2. Provide a guess of the sample median and explain how you obtained it. About .6. Less than mean. 3. Provide a guess of the sample standard deviation and explain precisely how you obtained it. The range from .001 to 1.384 is about 6 standard deviations. 4. Are these data skewed to the left or to the right, or not at all? Right 5. What is kurtosis? (And can it be cured?) Thick tails. With patience and care. 10 [15] Part IX. Bivariate Outcomes. During our discussions of the American Express study, I guessed that the credit scoring agency revealed a preference for business employed applicants over self employed applicants in the acceptance decisions. The tables below give a frequency count followed by the corresponding sample proportions. (SelfEmpl = 1 means self employed. CardHldr = 1 means accepted.) +--------------------------------+ | CARDHLDR | +--------+----------------+------+ |SELFEMPL| 0 1 | Total| +--------+----------------+------+ | 0| 2729 9936 | 12665| | 1| 216 563 | 779| +--------+----------------+------+ | Total| 2945 10499 | 13444| +--------------------------------+ --------+-----------------------------------------Percent| CARDH=0 CARDH=1 Total --------+-----------------------------------------SELFE=0| .202990 .739066 .942056 SELFE=1| .0160666 .0418774 .0579441 Total| .219057 .780943 1.00000 1. Test the hypothesis that cardholder status and self employment status are independent against the hypothesis that they are dependent. Contingency table test. See class notes in part 10. 2. The following are the results of a logistic regression of cardholder status in which the only variable that explains the accept/reject decision is whether the applicant is self employed or not. ----------------------------------------------------------------------------Binary Logit Model for Binary Choice based on 13,444 applications Dependent variable CARDHLDR --------+-------------------------------------------------------------------| Standard Prob. 95% Confidence CARDHLDR| Coefficient Error z |z|>Z* Interval --------+-------------------------------------------------------------------|Characteristics in numerator of Prob[CARDHL=1] Constant| 1.29223*** .02161 59.79 .0000 1.24987 1.33459 SELFEMPL| -.33423*** .08290 -4.03 .0001 -.49671 -.17174 --------+-------------------------------------------------------------------- Are these results consistent with your findings in part 1? Explain. Self employment does appear to be related to cardholder status. 11 3. To followup the analysis in parts 1 and 2, I examined the applicants whose applications were accepted – they were given a credit card. Let’s see if self employed people default more often than those who are business employed. Here is the logistic regression model based on the 10,499 cardholders. Do these results validate the reluctance to accept applications from self employed? ----------------------------------------------------------------------------Binary Logit Model for Binary Choice Dependent variable DEFAULT --------+-------------------------------------------------------------------| Standard Prob. 95% Confidence DEFAULT| Coefficient Error z |z|>Z* Interval --------+-------------------------------------------------------------------|Characteristics in numerator of Prob[DEFAUL=1] Constant| -2.24696*** .03412 -65.86 .0000 -2.31383 -2.18009 SELFEMPL| -.17244 .15760 -1.09 .2739 -.48133 .13645 --------+-------------------------------------------------------------------- They do not. The sign is negative and the variable is not significant in the model. [10] Part X. Function of a random estimator 1. In logistic regression settings, researchers often compute ‘odds ratios’ for dummy variables such as SELFEMPL. The ‘odds ratio,’ exp(), gives the change in Prob(accept)/Prob(reject). Compute the odds ratio for SELF EMPL in the logistic regression for DEFAULT and discuss your finding. Exp(-.17244) = .84. 2. The standard error for the estimator of in part 1 above is 0.15760. How would you compute the standard error for the odds ratio, exp(b)? Delta method. The derivative is exp(b). Square and multiply times variance then take square root. Sqr(.842 .15762) = .1323. 12