2012 final exam with solutions

advertisement
Department of IOMS
Inference and Regression
Final Examination, 2012
 This course and this examination are governed by the Stern Honor Code.
 Instructions
 Please write your name at the top of this page.
 Please answer all questions on this question book. Do not turn in a blue book.
 Please do not separate the pages of this exam booklet.
 Where a computation is required to answer a question, please show your work.
(I cannot give partial credit for an incorrect numerical answer unless the work
provided shows a partially correct computation.)
Grading:
There are 10 questions in this exam. The point values for the questions are
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Total
30
40
10
10
10
15
10
10
15
10
160
There are 200 points in total. The questions add to 160. You may take the other 40 and
allocate them to any 3 questions you wish. The grade will be the proportion of the
increased total. For example, if you add 20 points to question 2, and you had 30 of the 40
to begin with, your score will be 30/40(40 +20) = 55. Add a maximum of 20 points and a
minimum of 10 to each of 3 questions, for a total of 40 additional points. Indicate in the
table above where you wish to add the points.
1
[30] Part I. Continuing the Baseball Saga.
I decided to rebuild the dynamic baseball regression that we discussed in class on day 9. My
model now involves the following variables
ATTEND(i,t) = total attendance for team i in year t.
ATTEND1(i,t) = last year’s total attendance for team i in year t.
WINS(i,t)
= this year’s number of wins by team I in year t.
AVGSALRY(i,t)= average player salary, in millions, for team i in year t.
(This variable takes values like 1.0 or 2.0 or maybe 5.0.
RANK1(i,t)
= team rank in standings at the end of the previous season. Takes values
1,2,3,4,5. Rank 1 is lowest, last place. Higher rank is better.
ALLSTARS(i,t)= number of all stars on the team in year t
MGR_EXP
= number of years experienced possessed by the manager.
I computed two regressions, a basic model that contains ATTEND1, WINS and AVGSALRY.
My ‘advanced’ model includes these variables plus RANK1, ALLSTARS and MGR_EXP. Note,
in the results below, some values are shown in scientific notation. .19767E+14 means .19767
times 10 to the 14th power.
----------------------------------------------------------------------------Ordinary
least squares regression ............
LHS=ATTEND
Mean
= 2220848.36156
Standard deviation
=
432586.34965
---------No. of observations =
437 DegFreedom
Mean square
Regression
Sum of Squares
=
.593019E+14
3
.19767E+14
Residual
Sum of Squares
=
.222872E+14
433
.51472E+11
Total
Sum of Squares
=
.815891E+14
436
.18713E+12
---------Standard error of e =
226873.65637 Root MSE
225832.94603
Fit
R-squared
=
.72684 R-bar squared
.72494
--------+-------------------------------------------------------------------|
Standard
Prob.
95% Confidence
ATTEND| Coefficient
Error
t
|t|>T*
Interval
--------+-------------------------------------------------------------------Constant|
-115868
91590.73
-1.27 .2065
-295383 63645
ATTEND1|
.62705***
.03066
20.45 .0000
.56695
.68714
WINS|
10713.0***
1047.100
10.23 .0000
8660.7
12765.3
AVGSALRY|
94367.7***
19665.55
4.80 .0000
55823.9 132911.5
-------------------------------------------------------------------------------------No. of observations =
437 DegFreedom
Mean square
Regression
Sum of Squares
=
.596418E+14
6
.99403E+13
Residual
Sum of Squares
=
.219473E+14
430
.51040E+11
Total
Sum of Squares
=
.815891E+14
436
.18713E+12
---------Standard error of e =
225920.63524 Root MSE
224103.89760
Fit
R-squared
=
.73100 R-bar squared
.72725
Model test
F[ 6,
430]
=
194.75471 Prob F > F*
.00000
--------+-------------------------------------------------------------------|
Standard
Prob.
95% Confidence
ATTEND| Coefficient
Error
t
|t|>T*
Interval
--------+-------------------------------------------------------------------Constant|
-73235.3
118602.6
-.62 .5372
-305692.1 159221.6
ATTEND1|
.62093***
.03260
19.04 .0000
.55703
.68483
WINS|
9235.51***
1267.029
7.29 .0000
6752.17 11718.84
AVGSALRY|
99474.1***
19746.76
5.04 .0000
60771.2 138177.1
ALLSTARS|
25961.4**
10675.53
2.43 .0154
5037.7
46885.0
MGR_EXP|
50.9956
1943.689
.03 .9791 -3758.5641 3860.5552
RANK1|
8163.80
7122.781
1.15 .2524
-5796.60 22124.19
--------+-------------------------------------------------------------------Note: ***, **, * ==> Significance at 1%, 5%, 10% level.
-----------------------------------------------------------------------------
2
1. Using an F test, test the hypothesis that the three slope coefficients (not the constant) in the
first regression are equal to zero.
F = (R2/3)/((1-R2)/433). Use table to test.
2. Using an F test, test the hypothesis that the coefficients on RANK1, ALLSTARS and
MGR_EXP are all zero in the second equation.
R2 = fit in large regression
R02 = fit in small regression
F = (R2 - R02) /3)/((1-R2)/433). Use table to test.
3. According to the regression results, in the second regression, ALLSTARS is “significant” at
the 5% level, but not at the 1% level, while MGR_EXP is not significant at all. What is meant by
these statements of significance?
See class notes
4. The adjusted R squared (R-bar squared) reported for the first regression is .72494. What is Rbar squared, and why is it computed?
See class notes
5. Front office management believes that all they have to do to make more fans come to the
games is to pay the players more. Suppose they raise player salaries so that the average salary
rises by 2 million. How many more fans will ultimately come to the games (in the long run).
2  99474.1 / (1 - .62093)
6. The team is always in last place (rank = 1). Just throwing money at the players is a nice idea,
but it probably won’t work. Suppose we be optimistic, and assume the team finds a way to win
10 more games every year, and that is enough to get them to first place (rank = 5). How many
more fans can be expected to come to the games if they do this?
10  9235.51 / (1 - .62093)
[40] Part II. Family Matters
2. Many observers believe that the sex of a child is completely beyond the control of the parents
or environmental factors – the probability is (within rounding error) of 0.5. Others believe that
some factors, which we will cleverly denote x, are relevant to the sex of a child. I intend to test
3
the theory statistically. Here is the model we will use to test our theory: (We are going to build a
model using a random sample of data.)
Ki = the number of children in family i. Ki > 0. (We don’t sample any families with zero
children).
di = the number of female children in family i. 0 < di < Ki.
xi = the set of factors that we think are relevant (age, income, education, country of ancestry)
i = the probability that any particular child born in family i is female. Note, this is family
specific.
As stated so far, the model for the number of daughters is a binomial distribution.
K 
Prob(D = di | xi) =  i  i di (1  i )( Ki di ) , di = 0,...,Ki
 di 
To complete the model, I will claim that i depends on xi. Remembering that i is a probability, I
formulate this as
i 
exp(  xi )
1  exp(  xi )
So, i is the type of logistic probability that we discussed in class in session 10.
There are three theories in this model:
Theory A: The factors listed matter. This means that  is free to vary and at least some elements
of  are different from zero. (I.e., at least some variables ‘matter.’)
Theory B: The factors don’t matter. This means that  is free to vary but all elements of  are
Zero. (None of the variables make any difference.)
Theory Z: Nothing matters – the probability is .50. This means that  = 0 and  = 0.
After gathering the data I needed from 1,000 randomly chosen families, I computed the two
regressions below. (Reader, please note, I did not really sample any data. This exercise and the
data described here are entirely fictitious.) Model A is the maximum likelihood estimates of the
full model that corresponds to Theory A above. Model B corresponds to model B above. The
data consist of 1,000 observations on Ki, di and x variables:
Age = age of the mother in years – ranges from 25 to 40, average = 33
Educ = years of education of the mother ranges from 12 to 20, average = 15
Income in thousands of dollars ranges from 5 to about 100, average = 60
Country = dummy variable that is 1 if the mother is native to (born in) North America. About
75% of the sample were native North Americans, average = .75
4
----------------------------------------------------------------------------Binomial (Loglinear) Regression Model A
Dependent variable
DI
Log likelihood function
-1016.87044
Restricted log likelihood
-1545.58656
--------+-------------------------------------------------------------------|
Standard
Prob.
95% Confidence
DI| Coefficient
Error
z
|z|>Z*
Interval
--------+-------------------------------------------------------------------|Parameters in conditional mean function
Constant|
-.39981
.43870
-.91 .3621
-1.25965
.46003
AGE|
.09933***
.00955
10.40 .0000
.08061
.11805
EDUC|
-.31472***
.01902
-16.55 .0000
-.35200
-.27745
INCOME|
.04859***
.00207
23.51 .0000
.04454
.05264
COUNTRY|
-.21079**
.09923
-2.12 .0336
-.40527
-.01631
----------------------------------------------------------------------------Binomial (Loglinear) Regression Model B
Dependent variable
DI
Log likelihood function
-1545.58656
--------+-------------------------------------------------------------------|
Standard
Prob.
95% Confidence
DI| Coefficient
Error
z
|z|>Z*
Interval
--------+-------------------------------------------------------------------|Parameters in conditional mean function
Constant|
.46033***
.03427
13.43 .0000
.39316
.52749
--------+--------------------------------------------------------------------
1. For a family that has Ki children, what is the expected number of female children?
Ki times i
2. How does the expected number of female children depend on education? Derive the formula
for the effect of an increase in education of one year on the expected number of children. (Hint:
This is not a simple number, as it depends on the other variables and on the parameters. Second
hint: where πi is as defined above, dπi/dxik = πi(1 – πi)k..)
b = -.31472, the expected value is Kii, so the effect is Ki  πi(1 – πi)k
if i is about .5, the effect would be about .25Ki. Depends on Ki.
5
3. Form the log likelihood function for estimation of the parameters . (Hint, this model
resembles, but is not the same as the logistic model we discussed in class 10.)
Sum the logs of these probabilities
K 
Prob(D = di | xi) =  i  i di (1  i )( Ki di ) , di = 0,...,Ki
 di 
i 
exp(  xi )
1  exp(  xi )
4. Obtain the first order (necessary) conditions for maximizing the log likelihood with respect to
the parameters .
Log Pi= log(Ki _ di) + dilog i + (Ki –di)log(1 - i).
The derivative is (di/i - (Ki – di)/(1-i))  i(1 - i)  xi
Simplify and sum over observations
5. Using the two sets of regression results given above, for Model A and Model B, test the
hypothesis of Model B as a restriction on Model A using a likelihood ratio test. (Hint: We talked
about likelihood ratio tests on day 5 of class.)
Log likelihood function
Restricted log likelihood
-1016.87044
-1545.58656
Twice the difference is chi squared with 4 degrees of freedom.
6
6. Theory Z states that πi = .5 for every family. What is the value of the log likelihood function
K 
for this sample assuming Theory Z is correct? (Hints: the sample average of log  i  for this
 di 
sample is .8502. The sample sum of Ki is 3590.)
K 
The log likelihood assuming i = .5 is i di log.5 + (K-d)log.5 + i log  i 
 di 
K 
= i Ki log .5 + i log  i 
 di 
The sum of Ki is 3590, so the first term is 3590(-.693). The second sum is 1000.8502.
The log likelihood is the sum, -1637.67
7. Using the facts in in part 6, test the null hypothesis of Theory Z against the alternative
hypothesis of Theory A using a likelihood ratio test.
Chi squared = 2(1637.67 – 1016.87) = 1241.6. This is very large. The hypothesis (theory Z) is
rejected.
8. Experience might suggest that gender of children run in streaks – after having three daughters
in a row, one might begin to suspect it is less than random at that point. Suppose these did run in
streaks. What would streaks imply for the statistical approach we used above to test theory A?
(Hint, this question does not have a single right answer. It asks you to think like a statistician,
and question your methods if your assumptions are not met. In two or three sentences suggest
what this suggestion would imply for your model and your approach.)
If there really are streaks, then the observations are not independent. The way we constructed the
log likelihoods above assumes independence. So, the results would be questionable.
[10] Part III. Regression basics.
The regression of V on X gave this Minitab output:
Regression Analysis: V versus X
The regression equation is V = 99.3 - 1.02 X
Predictor
Coef SE Coef
T
P
Constant
99.301
2.478 40.08 0.000
X
-1.0169 0.2130
-4.77 0.000
S = 4.023
R-Sq = 33.6% R-Sq(adj) = 32.1%
Analysis of Variance
Source
DF
SS
MS
F
P
7
Regression
Residual
Total
1
45
46
368.89
728.26
1097.15
368.89
16.18
22.79
0.000
Answer T (true) or F (false) to each of the following. Explain your answer in one short sentence.
1. __F___
The total sum of squares, namely 1,097.15, provides solid evidence of the effect
of a significant regression
2. __T___
Most of the residuals are between -8 and +8.
3. __F___
The data provide convincing evidence that increasing X by one unit will cause a
decrease in V of 1.0169 units. Not cause
4. __T___
The correlation between the variables X and V is negative.
5. __F___
The regression slope would be regarded as not statistically significant
8
[10] Part IV. More regression basics
A regression analysis of variance produced the following table. Some of the positions are blank,
and these are the subjects of the questions that follow.
Source of
variation
Regression
Error
Total
Degrees of
freedom
(a) 5
100
(e) 105
Sum Squares
Mean Square
F
80.00
(c) 200
280.00
16.00
(d) 2.00
(b) 16
Supply the numbers that go in the positions marked (a) through (e).
[10] Part V. Essential Theory
Explain the difference between unbiasedness and consistency applied to an estimator.
See class notes
[15] Part VI. Heteroscedastic Regression
I am interested in the regression model
Health = α + 1 Age + 2 Insurance + ε.
‘Insurance’ is a binary variable that equals 1.0 if the person has insurance and 0.0 if they do not.
The regression is ordinary save for the fact that I know that Var[ε] = σ 2 for men and Var[ε] = cσ2
for women, where 0 < c < 1. (I know the value of c.)
1. Ordinary least squares is unbiased but inefficient. Explain.
See class notes
2. Weighted least squares would be more efficient? How would I compute the weighted least
squares estimator based on a sample that contains a mixture of men and women? (Hint:
remember, c is known.)
For women, use c, for men, use c=1, and
Regress health/sqr(c) on 1/sqr(c) and Age/sqr(c) and Insurance/sqr(c)
3. Suppose the statement of the model is correct, but I do not know the value of c. Can you
suggest a way that I might do weighted least squares?
Use least squares. Using observations on women only, compute (1/nw)i ei2. This estimates c2.
Do the same for the observations on men. This estimates 2. The ratio gives the estimate of c
that I need.
[10] Part VII. Statistical Theory
Suppose the density of x is f(x) =
1 1/2
 1 
x exp   x  , x  0.
2
 2 
9
What is the density of z = sqrt(x)? (Hint: does there seem to be a stray 2 in your answer? Note
that z = -value and z = + value are produced by the same x. So, z ranges from - to +, with half
the probability mass on either side of zero.)
See class notes and first exam. Change of variable
[10] Part VIII. Very Basic Statistics
64
F req u en cy
48
32
16
0
. 001
. 347
. 692
1. 038
1. 384
Q
0.0
0.35
0.7
1.05
1.40
The histogram above describes 1,000 observations on the variable q.
1. Provide a guess of the sample mean, and explain how you obtained it.
About .62 or so. Right of median
2. Provide a guess of the sample median and explain how you obtained it.
About .6. Less than mean.
3. Provide a guess of the sample standard deviation and explain precisely how you obtained it.
The range from .001 to 1.384 is about 6 standard deviations.
4. Are these data skewed to the left or to the right, or not at all?
Right
5. What is kurtosis? (And can it be cured?)
Thick tails. With patience and care.
10
[15] Part IX. Bivariate Outcomes.
During our discussions of the American Express study, I guessed that the credit scoring agency
revealed a preference for business employed applicants over self employed applicants in the
acceptance decisions. The tables below give a frequency count followed by the corresponding
sample proportions. (SelfEmpl = 1 means self employed. CardHldr = 1 means accepted.)
+--------------------------------+
|
CARDHLDR
|
+--------+----------------+------+
|SELFEMPL|
0
1 | Total|
+--------+----------------+------+
|
0|
2729
9936 | 12665|
|
1|
216
563 |
779|
+--------+----------------+------+
|
Total|
2945 10499 | 13444|
+--------------------------------+
--------+-----------------------------------------Percent|
CARDH=0
CARDH=1
Total
--------+-----------------------------------------SELFE=0|
.202990
.739066
.942056
SELFE=1|
.0160666
.0418774
.0579441
Total|
.219057
.780943
1.00000
1. Test the hypothesis that cardholder status and self employment status are independent against
the hypothesis that they are dependent.
Contingency table test. See class notes in part 10.
2. The following are the results of a logistic regression of cardholder status in which the only
variable that explains the accept/reject decision is whether the applicant is self employed or not.
----------------------------------------------------------------------------Binary Logit Model for Binary Choice based on 13,444 applications
Dependent variable
CARDHLDR
--------+-------------------------------------------------------------------|
Standard
Prob.
95% Confidence
CARDHLDR| Coefficient
Error
z
|z|>Z*
Interval
--------+-------------------------------------------------------------------|Characteristics in numerator of Prob[CARDHL=1]
Constant|
1.29223***
.02161
59.79 .0000
1.24987
1.33459
SELFEMPL|
-.33423***
.08290
-4.03 .0001
-.49671
-.17174
--------+--------------------------------------------------------------------
Are these results consistent with your findings in part 1? Explain.
Self employment does appear to be related to cardholder status.
11
3. To followup the analysis in parts 1 and 2, I examined the applicants whose applications were
accepted – they were given a credit card. Let’s see if self employed people default more often
than those who are business employed. Here is the logistic regression model based on the 10,499
cardholders. Do these results validate the reluctance to accept applications from self employed?
----------------------------------------------------------------------------Binary Logit Model for Binary Choice
Dependent variable
DEFAULT
--------+-------------------------------------------------------------------|
Standard
Prob.
95% Confidence
DEFAULT| Coefficient
Error
z
|z|>Z*
Interval
--------+-------------------------------------------------------------------|Characteristics in numerator of Prob[DEFAUL=1]
Constant|
-2.24696***
.03412
-65.86 .0000
-2.31383 -2.18009
SELFEMPL|
-.17244
.15760
-1.09 .2739
-.48133
.13645
--------+--------------------------------------------------------------------
They do not. The sign is negative and the variable is not significant in the model.
[10] Part X. Function of a random estimator
1. In logistic regression settings, researchers often compute ‘odds ratios’ for dummy variables
such as SELFEMPL. The ‘odds ratio,’ exp(), gives the change in Prob(accept)/Prob(reject).
Compute the odds ratio for SELF EMPL in the logistic regression for DEFAULT and discuss
your finding.
Exp(-.17244) = .84.
2. The standard error for the estimator of  in part 1 above is 0.15760. How would you compute
the standard error for the odds ratio, exp(b)?
Delta method. The derivative is exp(b). Square and multiply times variance then take square
root. Sqr(.842  .15762) = .1323.
12
Download