Mid-term exam Practice problems

advertisement
Mid-term exam
Practice problems
Most problems are “short answer” problems. You receive points for the answer and
the explanation. Full points require both, unless otherwise specified. Explaining your
answer is never a bad idea.
A little practice on stuff from last semester in here; that stuff is always relevant, so I
want you to make sure that knowledge hasn’t left you. Plus, answering these questions
should be good practice on how to answer test questions. Most questions rely on a mix
of knowledge that you have obtained over the last two semesters. Question 16 (EKC
question) is especially about issues we have not discussed this semester, but relies on
knowledge you should have. If you want, you can skip this one and concentrate on
other questions (but again, you should ultimately not find this to be a question that you
cannot answer).
1 (5 points)
True or false (no explanation necessary)
In the multiple regression model, the goodness-of-fit measure R-squared always increases
(or remains the same) when an additional explanatory variable is added.
True
2 (3 points)
True or false (no explanation necessary)
You estimate a regression model with three independent variables and get an R2 = 0.86.
True or false: Such a high R2 in a regression model with three independent variables
indicates that each of the explanatory variables must be (individually) statistically significant.
False
3 (3 points)
True or false (no explanation necessary)
True or false: When using a probit model, the effect of a change in x on the probability
that y = 1 depends on the value of x at which you evaluate the change.
1
True
4 (5 points)
True or false (no explanation necessary)
b always have the same sign as the correEstimated OLS regression coefficients (the β’s)
lation between y and the respective x’s.
True
5 (5 points)
(this is a hard one; requires some thinking)
Multiple choice — circle all that apply (no explanation necessary)
In an OLS model with a constant, endogeneity is characterized by:
(a) E(u|x) 6= 0
(b) V ar(u|x) 6= 0.
(c) V ar(u|x) 6= σ 2
(d) E(u|x) = E(u)
The answer is (a). Only (a). If you didn’t recall exactly what endogeneity meant,
you could have found a major clue in the later problem about omitted variable bias
(in which OVB was defined). The (b) and (c) choices are about the variance of the
unobservable term, so we can discount them right away. The choice (d) could have
been tricky. (d) says that the expected value of u, conditional on x is equal to the
unconditional expectation of u. In any OLS regression model with a constant, E(u) = 0.
So (d) implies that E(u|x) = 0. The conditional expectation of u being zero does not
characterize endogeneity. Endogeneity is characterized by a situation in which x and
u are correlated, i.e. knowing something about x would change our expectation of the
value of u. And our expectation of the value of u is always 0 to begin with. So (a) is
the only correct answer.
6 (3 points)
Multiple choice (no explanation necessary)
In a linear probability model all of the following are true except:
(a) the estimated coefficients can be interpreted directly as marginal effects
(b) R2 is a good measure of how well the model fits the data
(c) the predicted probability can be negative
(d) the errors are always heteroskedastic
The answer is (b). If you didn’t already know that (b) was false, by process of elimination:
(a) is true, since the coefficients of a LPM model are equal to marginal effects (whereas
in a probit model the mfx are a nonlinear function of the coefficients); (c) is true because
a linear probability model is nothing but an OLS model, and an OLS model predicts
2
values above 1 and below 0; (d) is true by construction of the LPM — the fit of the model
is such that the variance must change as x changes. In order to gain some intuition as
to why (b) is false, think about what R2 is a measure of: the proportion of variation in
y that is explained by our model. If we are using a LPM model, y is binary (0,1). If y
is binary, its values are artificially constrained to be either 0 or 1. Our predictions will
never actually be exactly equal to zero or one, i.e. yb is always between zero and one.
Compare this situation to a continuous y. With the LPM, R2 gives a much less useful
(more artificial, if you will) gauge of the variation explained by our model.
7 (5 points)
Multiple choice (no explanation necessary)
In the probit model all of the following are lies except:. . .
(a) β0 cannot be negative since probabilities have to lie between 0 and 1
(b) βj tells you the effect of a unit increase in xj on the probability that y = 1
(c) βj does not have a simple interpretation (i.e. cannot be interpreted directly)
(d) β0 is the probability of observing y when all the x variables are 0
The answer is (c) — the marginal effects are a nonlinear function of the coefficients βj ,
but the βj themselves don’t tell us much.
8 (5 points)
Short answer — provide an answer and an explanation
You plan to run an instrumental variables regression (a two-stage least squares model).
You run a first-stage regression in order to examine the relationship between your candidate instrument and the variable you suspect of being endogenous in the main regression.
Which concern would you have if the first-stage regression results were as indicated:
(a) the relationship between the candidate instrument and the endogenous regressor
seems weak — the t-stat is low, the R2 is low, and the F-stat is low. Consequently, you
suspect that the instrument might not be valid.
(b) the relationship between the candidate instrument and the endogenous regressor
seems strong — conditions opposite of those listed in (a). Consequently, you suspect
that the instrument might not be valid.
(c) the relationship between the candidate instrument and the endogenous regressor
seems weak. Consequently, you suspect that the instrument might lead to imprecise
estimates of the effect of interest in the second stage.
The answer is (c). A poor relationship between an instrument and an endogneous
variable (i.e. the variable that is being instrument for) is reason to be concerned about
the amount of variation that will be left once you “clean” the endogenous variable. A
weak relationship is NOT an indication of validity or lack of validity of the instrument.
3
9 (6 points)
Very short answer (don’t get stuck with overly long explanations on this one) (Ballantine
question) Suppose there are three variables, y, x1 , and x2 . The variation in y is represented by the circle made up of area A+B+C+D in the figure below; x1 is represented
by area B+C+E+F; x2 is represented by area C+D+F+G. Three questions: (a) For the
model
y = β0 + β1 x1 + β2 x2 + u,
what area of the Ballantine figure represents u?
(b) In the model
x1 = α0 + α1 x2 + w,
what area of the Ballantine figure represents w?
(c) Finally, in the model
y = γ0 + γ1 w + z,
where w refers to the error term from the model in part (b), what area of the Ballantine
figure represents z? Answer all three questions:
(a)what area of the Ballantine figure represents u? (2 points)
(b)what area of the Ballantine figure represents w? (2 points)
(c)what area of the Ballantine figure represents z (2 points)?
A
B
D
C
E
F
4
G
The area of the Ballantine that represents u is the area A. The area of the Ballantine
that represents w is B+E. The area of the Ballantine that represents z is A+C+D.
10 (5 points)
Show your work or explain your answer Suppose you have data on y and x. You specify
the following regression model
y = β0 + β1 x + u
and obtain the following results from OLS
βb0 = 3 βb1 = 4.
You then add 10 to the y data. That is, you create
y∗ = y + 10.
What will be the regression coefficients when you estimate the model
y∗ = α0 + α1 x + w?
That is, what will be α
c0 and α
c1 ?
α
c0 = 13 and α
c1 = 4.
11 (5 points)
Short answer There are a class of models sometimes referred to as “polychotomous dependent variable models” that can be used to analyze data when there are a number
of possible discrete outcomes. Match the models listed below to the appropriate example.
Model
Ordered probit
Multinomial logit
Poisson model
Use
transportation choice
number of bankruptcies
bond rating height
Ordered probit matches to bond rating; Multinomial logit matches to transportation
choice; Poisson model matches to number of bankruptcies.
12 (10 points)
Short answer Explain the difference between censored data and truncated data. Use the
example of tickets to a sporting event (treat this as the y-variable) as an example of one
of these two types of data. Use your own example of the other type of data.
“Censoring” of data is when some part of the data is hidden from us — we cannot
observe y data beyond certain bounds (above or below some cut-point). We do, however,
5
observe the x data corresponding to these censored values of y. Trucated data, on the
other hand, exists when we do not observe anything at all when y is above or below
the bound. We do not observe y and we do not observe x. The demand for tickets
sold to a sporting event are an example of censored data. When there is a sellout, we
do not observe what demand might have been — all we observe is that there was a
sellout, so demand is equal to the maximum capacity of a stadium. We observe other
things that might help drive demand, such as the record of the team, the population
and income of the home city, the opponent, etc. (Another very similar example here:
http://link.springer.com/article/10.1007%2FBF02031947). Another example might be
constructed by talking about “top-coding” of survey data (see Wooldridge if you don’t
recall what top-coding is). An example of a truncated sample exists when a whole
observation is missing due to the y variable being at or beyond a bound. An example of
this might be survey data on poverty when observations are not collected on individuals
with income above the poverty line. No data is collected when y is above this bound.
13 (22 points)
You are interested in estimating the impact of a household head’s level of education
on the food security status of his or her household. (Roughly stated, the issue of food
security concerns whether a person or household can count on having access to a sufficient
number of calories) You have data from a 2008 household survey run in Senegal. The
survey is nationally representative and includes over 20,000 households and the following
variables:
varname
foodSecure
educHd
rural
employHd
members
kids
nMale
nFem
description
dummy for households whose monthly income lies above the food
poverty line (the line at which the household would achieve 2100
calories per person per day)
years of education of household head
dummy if household is situated in a rural area; zero otherwise
dummy if household head is currently employed
number of individuals living in the household
number of children in the household
number of men in the household
number of women in the household
(a) Write down a model that you would run using OLS. What is the marginal effect of
educHd on the food security status of the household in this model? (3 points)
A simple linear probability model would be written
f oodSecure =β0 + β1 educHd + β2 rural + β3 employHd+
β4 members + β5 kids + β6 nM ale + u
6
There are several ways that you could have written this model that I would have
considered reasonable. Although the question only tells you explicitly that you would
be interested in the relationship between f oodSecure and educHd, I would be surprised
if you didn’t include some of the other variables, as they seem to sensibly belong in
the DGP of f oodSecure. With regard to the four variables members, kids, nM ale,
and nF em, I could see you including each of them on the assumption that they each
measure unique things, or I could see you omitting one on the assumption that nM ale
was the number of ADULT males, nF em was the number of ADULT females, and thus
kids+nM ale+nF em = members, i.e. there would be perfect collinearity if you included
all of the variables. But this isn’t what the question was about. Just saying.
The part about interpretation should be straightforward: the marginal effect of
educHD4 on the food security status of the household is given by β1 .
(b) Specify an alternative model that you would run using maximum likelihood estimation (you need not write down a specific functional form based on a specific distribution;
you may use the notion of a generic function “G” to represent the functional form you
have named using words). How would you obtain the marginal effect of educHd on the
food security status of the household in your new model? (3 points)
One alternative model that we could run using ML estimation is the logit. The other
is a probit. Using the direct analog of the model above I would write the alternative
model as
f oodSecure = G(β0 + β1 educHd + β2 rural + β3 employHd+
β4 members + β5 kids + β6 nM ale)
To get the marginal effect of educHd on the food security status we would need to
take a derivative (or have Stata do it for us). No need to actually take the derivative to
get points. As long as you realized that the coefficient itself is not the marginal effect
(and said so), I’m good with the answer. If you want to get all fancy, use the generic
“G” function and show me you know how to apply the chain rule.
G0 (stuff)β1
(c) What are the pros and cons of model (a) vs. model (b)? Name all that you can. (3
points)
There are 4 pros/cons that we talked about. If you got 3, I was happy. The cons of the
LPM are most prominent: heteroskedastic; predictions above 1 and below 0; poor fit to
the data; constant mfx might not make sense in this case. The virtues are simplicity
and ease of interpretation of the coefficients.
(d) A colleague suggests that your estimate of the effect of household head’s level of
education suffers from omitted variable bias since you do not have a measure of household
income. Describe the circumstances under which your colleague would be correct. Give
an example and be specific about the nature of the bias that you would expect. (5
points)
7
Hey colleague — go get your own problems. There are two ways you could go with this
problem, either of which are fine with me: (1) you could argue that your colleague has a
point, because the omitted variable is correlated with both food security and education;
(2) you could argue that our food security variable is simply a noisy (i.e. discrete) version
of household income, and thus observing household income would perfectly predict our
y-variable.
Option 1: I assume that household income is positively correlated both with food security
and with education. If this is so, then we would expect education to get too much credit
for improving food security in a regression that omitted household income. That is, we
expect a positive bias.
Option 2: I don’t find the claim very compelling. If we could measure household income
we might be able to get a finer measure of food security, in the sense that we could
measure more directly the ability of individuals to afford food. However, the dependent
variable that we DO have is a coarse measure of income. Having a finer measure would
not cause us to put this variable on the right hand side of our equation. I think the
colleague is a bit of a jerk.
(e) Your colleague suggests using the education level of the household head’s father as
an instrumental variable for household head’s level of education. Do you believe that
this instrument is valid ? Explain, being as specific as possible. (8 points)
The issue of validity is strictly concerned with whether or not the instrument is correlated
with the endogenous regressor (whether or not educHF is correlated with educHd) and
simultaneously uncorrelated with the source of the endogeneity. Therefore you can
answer this question regardless of how you answered the previous part. In our case, the
source of the alleged endogeneity is an omitted variable: inc. The primary argument
is whether educHF is uncorrelated with household income. Does this seeem likely?
In my opinion, the argument is tenuous. I would accept either answer if it was well
argued (welcome to the ambiguity that is advanced econometrics!). An argument that
stated that the instrument was invalid might go something like this: the household
head’s father’s level of education was determinant his children’s opportunities, and thus
their education and future income. An argument that stated that the instrument was
valid might go something like this: the household head’s father’s level of education
is determinant of his children’s education, as an educated household is more likely to
foster an environment in which education is values; however, more education in the
parent generation does not lead to income in the child generation, so educHF is likely
to be uncorrelated with inc.
14 (5 points)
Short answer
We estimate a regression of y on x:
y = β0 + β1 x + u,
8
and obtain estimates βb0 and βb1 , and an R-squared coefficient of 0.24. If you were to
change the units of x by multiplying x by 10, what is the new R-squared? Explain why
your answer makes intuitive sense.
If you change the units of (any number of) variables in the equation you do nothing
to change the fundamental relationship. The R-squared is exactly the same as it was
(0.24). Try it for yourself using any regression you like in order to prove it to yourself.
15 (10 points)
Short answer
Suppose that you are attempting to build a model that explains aggregate savings behavior in the United States as a function of the level of interest rates. Would you rather
construct your data sample during a period of fluctuating interest rates or a period in
which interest rates were relatively constant? Why?
More variation! We want more variation! You would like to sample during a period of
fluctuating interest rates, as the relationship between movements in interest rates and
movements in aggregate savings is most easily observed when . . . you guesed it, interest
rates are moving. More variation in an independent variable reduces the variability of
the coefficient estimate:
σ2
V ar(βbj ) =
.
SSTj (1 − Rj2 )
16 (15 points)
Short answer
The Environmental Kuznets Curve (EKC) is a theory relating income and productivity
to environmental conditions. The theory (roughly) goes something like this. Environmental conditions degrade as income increases, because the act of producing lots of
goods is taxing on the environment. At some point of development, however, the relationship changes. As people become richer, they start to develop preferences for a
cleaner environment. (When you are poor, you’d rather have food and shelter and smog
than no food, no shelter, and no smog, but when you nail down the food and shelter
thing, you prefer less smog to a fancier shelter). The (rough) shape of a theoretical EKC
relationship is given in the figure below.
9
Pollution
GDP
We gather data on sulfur-dioxide emissions and GDP in 100 countries and estimate
an environmental Kuznet’s curve model
SO2 = β0 + β1 GDP + β2 GDP 2 + u,
where SO2 represents sulfur-dioxide emissions, GDP is, well, GDP, and GDP 2 is GDPsquared. Write down the hypothesis test (or tests) that you would use to help you
determine whether the data is consistent with the EKC theory. That is, choose the
hypothesis test or tests that you feel would help you falsify the EKC theory, or, should
you not reject the null, support it.
The EKC would be fulfilled if the slope on GDP (β1 ) were positive, indicating a positive
relationship between GDP and SO2 , and the slope on GDP 2 (β2 ) were negative, indicating that at some point as GDP increases, SO2 decreases. Other answers are possible,
but that would be my answer. To write this in terms of formal null- and alternativehypotheses, I would express the above in the following way:
H0 : β1 = 0, β2 = 0 HA : H0 not true
I would use an F-test to test whether the two coefficients were jointly non-zero, since
the EKC reqjuires both coefficients to simultaneously be non-zero. If you wanted to test
each coefficient individually, you might do so using a pair of t-tests. The hypotheses
(with one-sided alternatives) appropriate for such a pair of nulls are below.
H0 : β1 = 0 HA : β1 > 0
H0 : β2 = 0 HA : β2 < 0.
10
17 (15 points)
Short answer
We collect variables y (the dependent variable), x1 , and x2 . If we run a multiple regression of y on x1 , and x2
y = β0 + β1 x1 + β2 x2 + u
we obtain estimated coefficients βb0 , βb1 , and βb2 . The estimated slope coefficients βb1 , and
βb2 are interpreted as independent effects. That is, βb1 is interpreted to be the effect of
x1 on y, independent of the effect of x2 on y. And vice-versa. Another way to say the
same thing is that βb1 is the effect of x1 on y, controlling for the effect of x2 on y.
Use what is sometimes referred to as the “partialling-out procedure” to explain what,
exactly, is meant by “independent effect” or “control.” That is, demonstrate what βb1
represents in the multiple-regression context by describing a series of bivariate regressions
(regressions with one dependent variable and one independent variable). You should be
able to arrive at the results of a bivariate regression in which one of the coefficient
estimates is exactly equal to βb1 in the multiple regression above. Explain how this
procedure highlights the “independent effect” interpretation of βb1 . [Hint: you don’t
need to remember the “partialing-out” procedure to complete this question. The βb1
coefficient in the multiple regression above represents the independent effect of x1 on
y, when the effect of x2 on y has already been accounted for. Use this knowledge to
reconstruct βb1 with a series of bivariate regressions involving y, x1 , and x2 .]
This question is a gimme if you remember the procedure we went over in class. I tried
to describe the procedure as directly as I could without revealing the exact steps. I
didn’t receive any clarifying questions, so I assume everyone knew what I was talking
about. Thus, the explanation is what really matters here. You want to make clear that,
essentially, you understood the point of the exercise! Here is what I would give for an
answer:
βb1 is an estimate of the independent effect of x1 on y. To replicate this estimate
using a series of bivariate regressions, we ultimately want to regress y on the part of
x1 that is independent of x2 . We need to isolate the variation in x1 that influences y,
but is not related to variation in x2 . x1 can be separated into two parts: parts that are
correlated with x2 and parts that are not. To find the part that is correlated with x2 ,
we regress x1 on x2 :
x1 = α0 + α1 x2 + v
and obtain estimated coefficients α
c0 and α
c1 . We use these estimated coefficients to
obtain predictions of x1 :
x
c1 = α
c0 + α
c1 x2 .
These predictions represent the variation in x1 that is explained by x2 . That is, the
prediction x
c1 represents the variation in x1 that is related to variation in x2 . If x1 is
a variable that contains all the variation, and x
c1 is a variable that contains only the
variation related to variation in x2 , then x1 − x
c1 is a variable that contains only the
variation not related to variation in x2 . We have a name for this variable. This variable,
11
x1 − x
c1 , is exactly the residual from the regression of x1 on x2 . Let’s call it r1 . When
we then regress y on r1
y = δ0 + δ1 r1 + w
we get estimated coefficients δb0 and δb1 . The estimated coefficient δb1 represents the effect
of r1 on y. r1 is a variable that contains all the variation in x1 that is not explained by
x2 . Therefore, δb1 is equal to βb1 from the original multivariate regression. This procedure
highlights the fact that βb1 is an estimate of the effect of x1 on y, with the effect of x2
removed entirely. This is what is meant by independent effect.
18 (15 points)
Short answer
“Omitted variable bias” is a term that refers to a situation when omitting an important independent variable (an “x-variable”) from a regression equation causes us to
mis-estimate the effect of another independent variable that is included in the regression. Explain why omitted variable bias is an example of endogeneity. (Endogeneity is
when an independent variable is correlated with the unobservable, or “error,” term in a
regression).
Use the model below when constructing your explanation.
y = β0 + β1 x1 + β2 x2 + u
Explain how omitting x2 causes x1 to be endogenous (i.e. correlated with the unobservable in a regression of y on x1 ). (No algebra required! The answer can be given in
an intuitive discussion. If you find it easier to work with a real-variable example rather
than generic x and y variables, that is fine) [Hint: If you don’t know how to get started,
work out how omitting x2 would cause omitted variable bias in your estimation of β1
(the relationship between y and x1 ) under different scenarios describing the relationship
between y, x1 , and x2 .]
Our task is to explain how leaving x2 out of a regression causes x1 to be related to the
error term. You can get this directly from the statement of the problem, where I gave
you the definition of endogeneity. If the true model is
y = β0 + β1 x1 + β2 x2 + u,
but we regress
y = β0 + β1 x1 + w
instead, then we are ignoring the impact of x2 . Doing this means that when x2 moves,
causing y to move, we do not have any way to control for the independent effect of x1
on y (see previous question). When x2 moves, we assign all the credit (or “blame,” as
the case may be) to x1 . Consider the regression
y = β0 + β1 x1 + w.
12
All of the variation in y that cannot be explained by x1 is contained in the unobservable
term w. What is in w? Well, if x2 really matters, then the effect of x2 is in w. In fact,
if the true model is
y = β0 + β1 x1 + β2 x2 + u,
then w is equal to all the stuff we left out of the equation:
y = β0 + β1 x1 + (β2 x2 + u).
That is, w = (β2 x2 + u). We get omitted variable bias when x1 and x2 are correlated. If
x1 and x2 are correlated, then x1 and w are correlated, since w contains x2 . Omitting
x2 causes x1 to be correlated with the error term w, which is another way of saying that
x1 is endogeneous.
13
Download