5. Assessing studies based on multiple regression

advertisement
5. Assessing studies based on multiple regression
Questions of this section:
• What makes a study using multiple regression (un)reliable?
• When does multiple regression provide a useful estimate of
the causal effect under consideration?
Conceptual framework:
• Internal and external validity
114
Definition 5.1: (Internal and external validity)
A statistical analysis is said to have internal validity if the statistical inferences about causal effects are valid for the population
being studied. The analysis is said to have external validity if
its inferences and conclusions can be generalized from the population and setting studied to other populations and settings.
Terminology:
• The population studied is the population of entities (people,
companies, ...) from which the sample was drawn
• The population of interest is the population of entities to
which the causal inferences from the study are to be applied
• By setting we mean the institutional, legal, social, and economic environment
115
Threats to internal validity:
• The estimator of a causal effect should be unbiased and
consistent
• Hypothesis tests should have the desired significance level
• Confidence intervals should have the desired confidence levels
−→ Requirements for internal validity are that the OLS estimator is unbiased and consistent and that standard errors
are computed in a way that makes confidence intervals
have the desired confidence levels
• These requirements might not be met for various reasons
−→ Threats to internal validity that lead to failures of our
OLS assumptions from Slide 18
116
Threats to external validity:
• Differences between the population being studied and the
population of interest
Example:
Medical studies that use animal populations like mice (the
population being studied), but aim at transfering the results to human populations (the population of interest)
• Even if the population studied and the population of interest
are identical, the study results may not be generalized due
to differences in the settings
Example:
The effect of an antidrinking advertising campaign on the
drinking behaviour of a group of first-term students might
differ at two universities if the legal penalties for drinking
differ at both universities
117
Threats to external validity: [continued]
• Important questions with respect to external validity are:
How to assess the external validity of a study?
How to design an externally valid study?
• Both issues require specific knowledge of the populations and
settings being studied and those of interest
• A rigorous treatment of both issues is beyond the scope of
this lecture
(cf. Shadish et al., 2002, for details)
−→ We focus on aspects of internal validity
118
5.1. Threats to internal validity of multiple regression analysis
Objectives:
• Survey of five reasons why the OLS estimator of a multiple
regression coefficient may be biased even in large samples
(Sections 5.1.1.–5.1.5.)
• All five sources of bias arise because the regressor is correlated with the error term in the population regression
−→ Violation of the first OLS assumption from Slide 18
• What can be done to reduce this bias?
119
5.1.1. Omitted variable bias
Omitted variable bias:
• If an omitted variable is a determinant of Yi and if it is correlated with at least one of the regressors, then the OLS
estimator of at least one of the coefficients will have omitted variable bias
(see Definition 2.5 on Slide 27)
• This bias persists even in large samples
−→ OLS estimator(s) is (are) inconsistent
• Mathematically, under omitted variable bias at least one of
the regressors is correlated with the error term ui implying
that
E(ui|X1i, . . . , Xki) 6= 0
120
Mitigation of omitted variable bias:
• Inclusion of control variables in the regression equation
Definition 5.2: (Control variable)
A control variable is not the object of interest in a regression
analysis, but is rather a regressor included to hold constant factors that, if neglected, could lead the estimated causal effect of
interest to suffer from omitted variable bias.
Remarks:
• Up to now: OLS assumptions on Slide 18 treat all regressors
symmetrically
• Now: explicit distinction between regressors of interest and
control variables
121
Mathematical motivation:
• Consider a regression with two variables X1i (the regressor)
and X2i (the control variable)
Yi = β0 + β1 · X1i + β2 · X2i + ui
• Replace the first OLS assumption
E(ui|X1i, X2i) = 0
by the so-called conditional-mean-independence assumption
E(ui|X1i, X2i) = E(ui|X2i)
(5.1)
−→ β1 has a causal interpretation, but β2 does not
(see class for details)
122
Intuition of (5.1):
• The inclusion of the control variable X2i makes the regressor
X1i uncorrelated with ui so that the OLS estimator β̂1 can
estimate the causal effect on Yi of a change in X1i
• By contrast, the control variable X2i remains correlated with
ui so that its coefficient β2 is subject to omitted variable bias
and does not have a causal interpretation
• The control variable X2i is included because
it controls for omitted factors that affect Yi and are correlated with X1i
it might (but need not) have a causal effect itself
• When a control variable is used, it is controlling for both, (1)
its own direct causal effect (if any), and (2) for the effect of
correlated omitted factors
123
Terminology:
• Complete phrasing:
The coefficient β1 on the regressor X1i is the causal effect on
Yi of a change in X1i using the control variable X2i both (1)
to hold constant the direct effect of X2i, and (2) to control
for factors correlated with X1i
• Conventional, less awkward phrasing:
The coefficient β1 on X1i is the effect on Yi controlling for
X2i
124
Example:
• Consider the student-performance dataset
• Consider the regression results of TEST SCORE on STR and PCTEL
(see left panel on Slide 126)
• Potentially omitted factor could be ’outside learning opportunities’
• Factor ’outside learning opportunities’ is difficult to measure,
but correlated with the students’ economic background
−→ Include a measure of economic background to control
for omitted income-related determinants of TEST SCORE like
’outside learning opportunities’
• Such a control variable is MEAL PCT measuring the percentage
of students receiving a free or subsidized lunch
125
TEST SCORE regression results with and without the control variable MEAT PCT
Dependent Variable: TEST_SCORE
Method: Least Squares
Date: 19/05/12 Time: 17:52
Sample: 1 420
Included observations: 420
White heteroskedasticity-consistent standard errors & covariance
Dependent Variable: TEST_SCORE
Method: Least Squares
Date: 19/05/12 Time: 18:00
Sample: 1 420
Included observations: 420
White heteroskedasticity-consistent standard errors & covariance
Variable
Coefficient
Std. Error
t-Statistic
Prob.
Variable
Coefficient
Std. Error
t-Statistic
Prob.
C
STR
PCTEL
686.0322
-1.101296
-0.649777
8.728224
0.432847
0.031032
78.59930
-2.544307
-20.93909
0.0000
0.0113
0.0000
C
STR
PCTEL
MEAL_PCT
700.1500
-0.998309
-0.121573
-0.547346
5.568450
0.270080
0.032832
0.024107
125.7352
-3.696348
-3.702926
-22.70460
0.0000
0.0002
0.0002
0.0000
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
F-statistic
Prob(F-statistic)
0.426431
0.423680
14.46448
87245.29
-1716.561
155.0136
0.000000
Mean dependent var
S.D. dependent var
Akaike info criterion
Schwarz criterion
Hannan-Quinn criter.
Durbin-Watson stat
654.1565
19.05335
8.188387
8.217246
8.199793
0.685575
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
F-statistic
Prob(F-statistic)
0.774516
0.772890
9.080079
34298.30
-1520.499
476.3064
0.000000
Mean dependent var
S.D. dependent var
Akaike info criterion
Schwarz criterion
Hannan-Quinn criter.
Durbin-Watson stat
654.1565
19.05335
7.259521
7.298000
7.274730
1.437595
126
Example: [continued]
• Including the control variable MEAL PCT
does not substantially change the effect of STR on TEST SCORE
(β̂1 changes from −1.1013 to −0.9983)
changes the size (but not the sign) of the effect of PCTEL
on TEST SCORE
(β̂2 changes from −0.6498 to −0.1216)
• The estimated coefficient β̂3 = −0.5473 is not reasonable,
since if it were we could boost TEST SCORE by eliminating the
reduced-price lunch programme
−→ Do not treat β3 as causal
127
Solutions to omitted variable bias:
• Distinction between two situations, namely (1) one in which
data on the omitted variable or on adequate control variables
are available, and (2) one in which data are not available
Situation #1:
• If data on the omitted variable is available, include it in the
regression equation
• If you have data on adequate control variables (with the hope
of achieving conditional mean independence), include these
variables in the regression equation
128
Trade-off:
• Adding a variable to a regression has both costs and benefits
• On the one hand, omitting the variable could result in omitted variable bias
• On the other hand, including a variable that is not a relevant
regressor (that is, when its population regression coefficient
is zero) reduces the precision of the estimators of the other
regression coefficients
(in the form of higher variances of the OLS estimators)
129
Situation #2:
• If no data are available there are three potential ways of
mitigating the omitted variable bias
The use of panel data
(see Section 6)
The use of instrumental variables
(see Section 8)
The conduct of randomized controlled experiments
(see Stock & Watson, 2011, Section 13)
130
Guidelines for deciding whether to include an additional variable:
• Be specific about the coefficients of interest
• Use a priori reasoning to identify the most important potential sources of omitted variable bias
−→ Consider a baseline specification and some questionable
variables
• Test whether additional questionable variables have nonzero
coefficients
• Provide full disclosure representative tabulations of your results so that others can see the effect of including questionable variables on the coefficients of interest
• Observe, if your results change after including a questionable
control variable
131
5.1.2. Misspecification of the functional form of
the regression function
Definition 5.3: (Functional form misspecification)
Functional form misspecification arises when the functional form
of the estimated function differs from the (true) functional form
of the population regression function.
Two aspects of misspecification:
• If the (true) population regression function is nonlinear, but
we estimate a linear regression equation, then the estimator
of the coefficients suffer from omitted variable bias
• If the (true) population regression function is linear, but we
estimate a nonlinear regression equation, then we estimate
non-existent coefficients
132
Solutions to functional form misspecification:
• Detection of misspecification by using statistical specification
tests, for example the Regression Specification Error Test
(RESET)
(see Econometrics I)
• Plot the data and the estimated regression function
• Correct the misspecification by trying alternative functional
forms of the regression function
133
5.1.3. Measurement error and errors-in-variable
bias
Definition 5.4: (Errors-in-variable bias)
Errors-in-variables bias in the OLS estimator arises when an independent variable is measured imprecisely. The bias depends
on the nature of the measurement error and persists even if the
sample size is large.
Sources of measurement errors:
• Wrong answer of a respondent to a survey question
(e.g. about her/his income)
• Typographical errors in data collected from computerized administrative records
134
Consequence:
• Consider a regression with a single regressor Xi
• Xi is imprecisely measured by X̃i
• The true population regression equation is
Yi = β0 + β1 · Xi + hui 
‘
i
= β0 + β1 · X̃i + β1 · Xi − X̃i + ui
= β0 + β1 · X̃i + vi

‘
where vi = β1 · Xi − X̃i + ui
−→ The population regression equation in terms of X̃i has an
error term containing the measurement error Xi − X̃i
−→ If Xi − X̃i is correlated with X̃i, then the regressor X̃i will
be correlated with vi
135
Consequence: [continued]
−→ Violation of OLS assumption 1 on Slide 18
−→ β̂1 will be biased and inconsistent
Classical measurement error model:
• Suppose the measured value X̃i equals the unmeasured value
Xi plus a purely random component wi with expected value
2
0 and variance σw
• Suppose further that Corr(wi, Xi) = 0 and Corr(wi, ui) = 0
• It then follows that
2
σX
plim β̂1 = 2
· β1
2
σX + σw
(see class for details)
136
Solutions to errors-in-variables bias:
• Try to obtain an accurate measure of X
(if possible)
• Use instrumental variables
(see Section 8)
137
5.1.4. Missing data and sample selection
We consider three cases:
1. Data are missing completely at random
2. Data are missing based on a regressor
3. Data are missing because of a selection process that is related
to Y beyond depending on X
(sample selection bias)
138
Case #1:
• When the data are missing completely at random (for reasons
unrelated to the values of X or Y ) the effect is to reduce the
sample size but not introduce bias
Case #2:
• When the data are missing based on the value of a regressor,
the effect also is to reduce the sample size but not introduce
bias
139
Case #3:
• If the data are missing because of a selection process that
is related to the value of the dependent variable Y beyond
depending on the regressors X1, . . . , Xk then this selection
process can introduce correlation between the error term and
the regressors
−→ Bias in the OLS estimators that persists in large samples
Definition 5.5: (Sample selection bias)
Sample selection bias arises when a selection process influences
the availability of data and that process is related to the dependent variable, beyond depending on the regressors.
140
Example:
• We consider the question as to whether stock mutual funds
outperform the market
• To this end, many studies compare future returns on mutual
funds that had high returns over the past year to future
returns on other funds and on the market as a whole
• Some databases include historical data on funds currently
available for purchase
• This approach implies that the most poorly performing funds
are omitted from the dataset because they went out of business or were merged into other funds
−→ Only the better funds survive to be in the data set
(survivorship bias)
141
Solutions to sample selection bias:
• Beyond the scope of this lecture
142
5.1.5. Simultaneous causality
Definition 5.6: (Simultaneous causality bias)
Simultaneous causality bias, also called simultaneous equation
bias, arises in a regression of Y on X when, in addition to the
causal link of interest from X to Y , there is a causal link from
Y to X.
Remark:
• The reverse causality makes X correlated with the error term
−→ Bias in the OLS estimators that persists in large samples
143
Example:
• Consider the following two regression equations that hold
simultaneously:
Yi = β0 + β1 · Xi + ui
Xi = γ0 + γ1 · Yi + vi
(5.2)
(5.3)
• From Eq. (5.2) it follows that if ui < 0 then Yi decreases
−→ If γ1 > 0, then a low value of Yi leads to a low value of Xi
in Eq. (5.3)
−→ If γ1 > 0, then Corr(Xi, ui) > 0 in Eq. (5.2)
(see class for details)
144
Solutions to simultaneous causality bias:
• Use instrumental variables
(see Section 8)
145
5.1.6. Sources of inconsistency of OLS standard
errors
OLS standard errors:
• Inconsistent standard errors pose a threat to internal validity
• Even if the OLS estimators are consistent and the sample
size is large, inconsistent standard errors will produce invalid
hypothesis tests and confidence intervals
Main reasons for inconsistent standard errors:
• Heteroskedasticity of the error terms ui
• Autocorrelation among the error terms ui
146
Remedies:
• Use heteroskedasticity-consistent standard errors
(see Section 3.1.1.)
• Use heteroskedasticity-autocorrelation consistent (HAC) standard errors
(see Section 3.1.2.)
147
5.2. Summary
Five threats to internal validity:
1. Omitted variables
2. Functional form misspecification
3. Errors-in-variables
4. Sample selection
5. Simultaneous causality
148
Remarks:
• Each of these, if present, result in failure of the first OLS
assumption from Slide 18:
E(ui|X1i, . . . , Xki) 6= 0
−→ Bias in the OLS estimators that persists in large samples
• Incorrect calculation of standard errors poses a further threat
to internal validity
• Applying this list of threats to a multiple regression study
provides a systematic way to assess the internal validity of
that study
149
Download