5. Assessing studies based on multiple regression Questions of this section: • What makes a study using multiple regression (un)reliable? • When does multiple regression provide a useful estimate of the causal effect under consideration? Conceptual framework: • Internal and external validity 114 Definition 5.1: (Internal and external validity) A statistical analysis is said to have internal validity if the statistical inferences about causal effects are valid for the population being studied. The analysis is said to have external validity if its inferences and conclusions can be generalized from the population and setting studied to other populations and settings. Terminology: • The population studied is the population of entities (people, companies, ...) from which the sample was drawn • The population of interest is the population of entities to which the causal inferences from the study are to be applied • By setting we mean the institutional, legal, social, and economic environment 115 Threats to internal validity: • The estimator of a causal effect should be unbiased and consistent • Hypothesis tests should have the desired significance level • Confidence intervals should have the desired confidence levels −→ Requirements for internal validity are that the OLS estimator is unbiased and consistent and that standard errors are computed in a way that makes confidence intervals have the desired confidence levels • These requirements might not be met for various reasons −→ Threats to internal validity that lead to failures of our OLS assumptions from Slide 18 116 Threats to external validity: • Differences between the population being studied and the population of interest Example: Medical studies that use animal populations like mice (the population being studied), but aim at transfering the results to human populations (the population of interest) • Even if the population studied and the population of interest are identical, the study results may not be generalized due to differences in the settings Example: The effect of an antidrinking advertising campaign on the drinking behaviour of a group of first-term students might differ at two universities if the legal penalties for drinking differ at both universities 117 Threats to external validity: [continued] • Important questions with respect to external validity are: How to assess the external validity of a study? How to design an externally valid study? • Both issues require specific knowledge of the populations and settings being studied and those of interest • A rigorous treatment of both issues is beyond the scope of this lecture (cf. Shadish et al., 2002, for details) −→ We focus on aspects of internal validity 118 5.1. Threats to internal validity of multiple regression analysis Objectives: • Survey of five reasons why the OLS estimator of a multiple regression coefficient may be biased even in large samples (Sections 5.1.1.–5.1.5.) • All five sources of bias arise because the regressor is correlated with the error term in the population regression −→ Violation of the first OLS assumption from Slide 18 • What can be done to reduce this bias? 119 5.1.1. Omitted variable bias Omitted variable bias: • If an omitted variable is a determinant of Yi and if it is correlated with at least one of the regressors, then the OLS estimator of at least one of the coefficients will have omitted variable bias (see Definition 2.5 on Slide 27) • This bias persists even in large samples −→ OLS estimator(s) is (are) inconsistent • Mathematically, under omitted variable bias at least one of the regressors is correlated with the error term ui implying that E(ui|X1i, . . . , Xki) 6= 0 120 Mitigation of omitted variable bias: • Inclusion of control variables in the regression equation Definition 5.2: (Control variable) A control variable is not the object of interest in a regression analysis, but is rather a regressor included to hold constant factors that, if neglected, could lead the estimated causal effect of interest to suffer from omitted variable bias. Remarks: • Up to now: OLS assumptions on Slide 18 treat all regressors symmetrically • Now: explicit distinction between regressors of interest and control variables 121 Mathematical motivation: • Consider a regression with two variables X1i (the regressor) and X2i (the control variable) Yi = β0 + β1 · X1i + β2 · X2i + ui • Replace the first OLS assumption E(ui|X1i, X2i) = 0 by the so-called conditional-mean-independence assumption E(ui|X1i, X2i) = E(ui|X2i) (5.1) −→ β1 has a causal interpretation, but β2 does not (see class for details) 122 Intuition of (5.1): • The inclusion of the control variable X2i makes the regressor X1i uncorrelated with ui so that the OLS estimator β̂1 can estimate the causal effect on Yi of a change in X1i • By contrast, the control variable X2i remains correlated with ui so that its coefficient β2 is subject to omitted variable bias and does not have a causal interpretation • The control variable X2i is included because it controls for omitted factors that affect Yi and are correlated with X1i it might (but need not) have a causal effect itself • When a control variable is used, it is controlling for both, (1) its own direct causal effect (if any), and (2) for the effect of correlated omitted factors 123 Terminology: • Complete phrasing: The coefficient β1 on the regressor X1i is the causal effect on Yi of a change in X1i using the control variable X2i both (1) to hold constant the direct effect of X2i, and (2) to control for factors correlated with X1i • Conventional, less awkward phrasing: The coefficient β1 on X1i is the effect on Yi controlling for X2i 124 Example: • Consider the student-performance dataset • Consider the regression results of TEST SCORE on STR and PCTEL (see left panel on Slide 126) • Potentially omitted factor could be ’outside learning opportunities’ • Factor ’outside learning opportunities’ is difficult to measure, but correlated with the students’ economic background −→ Include a measure of economic background to control for omitted income-related determinants of TEST SCORE like ’outside learning opportunities’ • Such a control variable is MEAL PCT measuring the percentage of students receiving a free or subsidized lunch 125 TEST SCORE regression results with and without the control variable MEAT PCT Dependent Variable: TEST_SCORE Method: Least Squares Date: 19/05/12 Time: 17:52 Sample: 1 420 Included observations: 420 White heteroskedasticity-consistent standard errors & covariance Dependent Variable: TEST_SCORE Method: Least Squares Date: 19/05/12 Time: 18:00 Sample: 1 420 Included observations: 420 White heteroskedasticity-consistent standard errors & covariance Variable Coefficient Std. Error t-Statistic Prob. Variable Coefficient Std. Error t-Statistic Prob. C STR PCTEL 686.0322 -1.101296 -0.649777 8.728224 0.432847 0.031032 78.59930 -2.544307 -20.93909 0.0000 0.0113 0.0000 C STR PCTEL MEAL_PCT 700.1500 -0.998309 -0.121573 -0.547346 5.568450 0.270080 0.032832 0.024107 125.7352 -3.696348 -3.702926 -22.70460 0.0000 0.0002 0.0002 0.0000 R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood F-statistic Prob(F-statistic) 0.426431 0.423680 14.46448 87245.29 -1716.561 155.0136 0.000000 Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Hannan-Quinn criter. Durbin-Watson stat 654.1565 19.05335 8.188387 8.217246 8.199793 0.685575 R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood F-statistic Prob(F-statistic) 0.774516 0.772890 9.080079 34298.30 -1520.499 476.3064 0.000000 Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Hannan-Quinn criter. Durbin-Watson stat 654.1565 19.05335 7.259521 7.298000 7.274730 1.437595 126 Example: [continued] • Including the control variable MEAL PCT does not substantially change the effect of STR on TEST SCORE (β̂1 changes from −1.1013 to −0.9983) changes the size (but not the sign) of the effect of PCTEL on TEST SCORE (β̂2 changes from −0.6498 to −0.1216) • The estimated coefficient β̂3 = −0.5473 is not reasonable, since if it were we could boost TEST SCORE by eliminating the reduced-price lunch programme −→ Do not treat β3 as causal 127 Solutions to omitted variable bias: • Distinction between two situations, namely (1) one in which data on the omitted variable or on adequate control variables are available, and (2) one in which data are not available Situation #1: • If data on the omitted variable is available, include it in the regression equation • If you have data on adequate control variables (with the hope of achieving conditional mean independence), include these variables in the regression equation 128 Trade-off: • Adding a variable to a regression has both costs and benefits • On the one hand, omitting the variable could result in omitted variable bias • On the other hand, including a variable that is not a relevant regressor (that is, when its population regression coefficient is zero) reduces the precision of the estimators of the other regression coefficients (in the form of higher variances of the OLS estimators) 129 Situation #2: • If no data are available there are three potential ways of mitigating the omitted variable bias The use of panel data (see Section 6) The use of instrumental variables (see Section 8) The conduct of randomized controlled experiments (see Stock & Watson, 2011, Section 13) 130 Guidelines for deciding whether to include an additional variable: • Be specific about the coefficients of interest • Use a priori reasoning to identify the most important potential sources of omitted variable bias −→ Consider a baseline specification and some questionable variables • Test whether additional questionable variables have nonzero coefficients • Provide full disclosure representative tabulations of your results so that others can see the effect of including questionable variables on the coefficients of interest • Observe, if your results change after including a questionable control variable 131 5.1.2. Misspecification of the functional form of the regression function Definition 5.3: (Functional form misspecification) Functional form misspecification arises when the functional form of the estimated function differs from the (true) functional form of the population regression function. Two aspects of misspecification: • If the (true) population regression function is nonlinear, but we estimate a linear regression equation, then the estimator of the coefficients suffer from omitted variable bias • If the (true) population regression function is linear, but we estimate a nonlinear regression equation, then we estimate non-existent coefficients 132 Solutions to functional form misspecification: • Detection of misspecification by using statistical specification tests, for example the Regression Specification Error Test (RESET) (see Econometrics I) • Plot the data and the estimated regression function • Correct the misspecification by trying alternative functional forms of the regression function 133 5.1.3. Measurement error and errors-in-variable bias Definition 5.4: (Errors-in-variable bias) Errors-in-variables bias in the OLS estimator arises when an independent variable is measured imprecisely. The bias depends on the nature of the measurement error and persists even if the sample size is large. Sources of measurement errors: • Wrong answer of a respondent to a survey question (e.g. about her/his income) • Typographical errors in data collected from computerized administrative records 134 Consequence: • Consider a regression with a single regressor Xi • Xi is imprecisely measured by X̃i • The true population regression equation is Yi = β0 + β1 · Xi + hui i = β0 + β1 · X̃i + β1 · Xi − X̃i + ui = β0 + β1 · X̃i + vi where vi = β1 · Xi − X̃i + ui −→ The population regression equation in terms of X̃i has an error term containing the measurement error Xi − X̃i −→ If Xi − X̃i is correlated with X̃i, then the regressor X̃i will be correlated with vi 135 Consequence: [continued] −→ Violation of OLS assumption 1 on Slide 18 −→ β̂1 will be biased and inconsistent Classical measurement error model: • Suppose the measured value X̃i equals the unmeasured value Xi plus a purely random component wi with expected value 2 0 and variance σw • Suppose further that Corr(wi, Xi) = 0 and Corr(wi, ui) = 0 • It then follows that 2 σX plim β̂1 = 2 · β1 2 σX + σw (see class for details) 136 Solutions to errors-in-variables bias: • Try to obtain an accurate measure of X (if possible) • Use instrumental variables (see Section 8) 137 5.1.4. Missing data and sample selection We consider three cases: 1. Data are missing completely at random 2. Data are missing based on a regressor 3. Data are missing because of a selection process that is related to Y beyond depending on X (sample selection bias) 138 Case #1: • When the data are missing completely at random (for reasons unrelated to the values of X or Y ) the effect is to reduce the sample size but not introduce bias Case #2: • When the data are missing based on the value of a regressor, the effect also is to reduce the sample size but not introduce bias 139 Case #3: • If the data are missing because of a selection process that is related to the value of the dependent variable Y beyond depending on the regressors X1, . . . , Xk then this selection process can introduce correlation between the error term and the regressors −→ Bias in the OLS estimators that persists in large samples Definition 5.5: (Sample selection bias) Sample selection bias arises when a selection process influences the availability of data and that process is related to the dependent variable, beyond depending on the regressors. 140 Example: • We consider the question as to whether stock mutual funds outperform the market • To this end, many studies compare future returns on mutual funds that had high returns over the past year to future returns on other funds and on the market as a whole • Some databases include historical data on funds currently available for purchase • This approach implies that the most poorly performing funds are omitted from the dataset because they went out of business or were merged into other funds −→ Only the better funds survive to be in the data set (survivorship bias) 141 Solutions to sample selection bias: • Beyond the scope of this lecture 142 5.1.5. Simultaneous causality Definition 5.6: (Simultaneous causality bias) Simultaneous causality bias, also called simultaneous equation bias, arises in a regression of Y on X when, in addition to the causal link of interest from X to Y , there is a causal link from Y to X. Remark: • The reverse causality makes X correlated with the error term −→ Bias in the OLS estimators that persists in large samples 143 Example: • Consider the following two regression equations that hold simultaneously: Yi = β0 + β1 · Xi + ui Xi = γ0 + γ1 · Yi + vi (5.2) (5.3) • From Eq. (5.2) it follows that if ui < 0 then Yi decreases −→ If γ1 > 0, then a low value of Yi leads to a low value of Xi in Eq. (5.3) −→ If γ1 > 0, then Corr(Xi, ui) > 0 in Eq. (5.2) (see class for details) 144 Solutions to simultaneous causality bias: • Use instrumental variables (see Section 8) 145 5.1.6. Sources of inconsistency of OLS standard errors OLS standard errors: • Inconsistent standard errors pose a threat to internal validity • Even if the OLS estimators are consistent and the sample size is large, inconsistent standard errors will produce invalid hypothesis tests and confidence intervals Main reasons for inconsistent standard errors: • Heteroskedasticity of the error terms ui • Autocorrelation among the error terms ui 146 Remedies: • Use heteroskedasticity-consistent standard errors (see Section 3.1.1.) • Use heteroskedasticity-autocorrelation consistent (HAC) standard errors (see Section 3.1.2.) 147 5.2. Summary Five threats to internal validity: 1. Omitted variables 2. Functional form misspecification 3. Errors-in-variables 4. Sample selection 5. Simultaneous causality 148 Remarks: • Each of these, if present, result in failure of the first OLS assumption from Slide 18: E(ui|X1i, . . . , Xki) 6= 0 −→ Bias in the OLS estimators that persists in large samples • Incorrect calculation of standard errors poses a further threat to internal validity • Applying this list of threats to a multiple regression study provides a systematic way to assess the internal validity of that study 149