Stat 521: Notes 6. Measurement Error, Instrumental Variables Reading: Wooldridge, Chapter 5.1, 18.1-18.2. I. Note on Homework 1 In question 6, I asked you to consider the effect on mean earnings (Y) of a policy in which everyone who has less than a high school education instead got a high school education. The model for earnings with constant variance is log Yi = 0 1educi i , i ~N (0, 2 ) . 2 E (Yi | educi ) exp 0 1educi 2 If there is nonconstant variance, then, i2 E (Yi | educi ) exp 0 1educi 2 and we need a model for the distribution of i to determine the effect of the policy on mean earnings. We will study this later in the course. 2 II. Measurement Error (Chapter 4.4) The Classical Measurement Error Model: Suppose we would like to estimate the regression Yi 0 1 X i*1 i , 1 but we only measure X i1 which contains measurement error: X i1 X i*1 i The classical measurement error assumes that the measurement * error i is independent of the true X i1 and of i : the measurement error arises completely at random. Let the regression of Y on X i1 be Yi 0 1 X i1 i (1.1) X* Cov( X i*1 , X i1 ) 1 1 We showed that 1 Var ( X i*1 ) X2 * 2 2 Summary: Classical measurement error leads to a bias towards zero (this is called the attenuation bias). X2 The quantity 2 2 which attenuates the regression X coefficient is called the reliability ratio. * * The following table shows the reliability ratios of a few key variables in economics: Variable Reliability Data Set Reference Ratio Log Annual 0.63 PSID Validation 2 Bound et al. Hours Study (1994) Log Annual Earnings 0.76 PSID Validation Study Duncan and Hill (1986) Years of Education 0.90 Twinsburg Twins Study Ashenfelter and Krueger (1994) When the interest is in 1 in the multiple regression Yi 0 1 X i*1 2 X i 2 K X iK i * and the classical measurement error model holds for X i1 , then 1 in the model Yi 0 1 X i1 2 X i 2 K X iK i satisfies 1 1 r2 * X r2 2 * X where rX* * is the error from the regression of X i1 on X i 2 , , X iK . Although classical measurement error is a good starting point, in some important situations, classical measurement error is not possible. For example, for a binary variable, classical measurement cannot hold because there are only two ways the variable can be mismeasured: a true 0 can be measured as a 1 and a true 1 can be mismeasured as a 0, and consequently the error depends on the true value. Aigner (1972) shows random misclassification of a binary variable still biases the regression 3 coefficients towards zero. But in general, the bias could be greater or less than one. We will discuss approaches to correcting for measurement error later. III. Causal Effects and Potential Outcomes Framework (Section 18.2, Wooldridge). In many economic studies, we are interested in estimating the causal effect of a policy or decision on an individual. Example: Does military service raise or lower earnings? We will consider this question specifically for men in the World War II era using data from the 1980 Census. For a specific man, military service in WWII might interrupt education or career and so cause that man to earn less than he would have if he had not served. Alternatively, for a specific man, if the labor market favored veterans, or if various veteran’s programs conferred advantages, then military service in WWII might cause that man to earn more than he would have had he not served. The familiar distinction between association and causation is that these questions concern the effects caused by military service for the same man; they do not simply compare the different men who happened to be WWII veterans and those who happened to not be WWII veterans.Men may be rejected for military service for reasons of ill health or criminal behavior, and othersmay legally avoid or illegally evade military service, so that we would expect veterans and nonveterans of WWII to differ in earnings quite apart from any effect caused by military service. 4 Potential Outcomes Framework: Yi (1) earnings in 1980 man i would have if he had served in the military during WWII Yi (0) earnings in 1980 man i would have if he did not serve in the military during WWII For man i, the causal effect of military service on 1980 earnings (1) (0) (1) is Yi Yi . However, we can only observe one of Yi or Yi (0) . Let Di equal 1 or 0 according to whether man i served in the military or did not serve in the military. We observe (D ) earnings Yi Yi i for man i. Consider a model in which military service has the same causal effect for everybody: Yi (1) Yi (0) . The causal effect of military service on earnings is . The observed earnings is Yi Yi (0) Di Let X i be a vector of covariates, e.g., race and state of birth and suppose E (Yi (0) | X i ) ' X i , i.e., Yi (0) ' X i i , E ( i | X i ) 0 5 Then Yi Di ' X i i Di ' X i E ( i | Di , X i ) i , where i i E ( i | Di , X i ) . Consider running the regression of Yi on Di , X i , E(Yi | Di , Xi ) Di ' Xi Suppose E ( i | Di , X i ) Di [ Recall : E ( i | X i ) 0 ; if there is no interaction between Di and X i in affecting i , then it follows that E ( i | Di , X i ) does not depend on X i .] Using the omitted variable bias formula with E ( i | Di , X i ) as the omitted variable, we have . (0) When does 0 ? If Di is independent of Yi conditional on X i . This is called selection on observables. It means that all the variables that determine the selection of Di are recorded in X i . If Di is randomly assigned as in a randomized experiment, then selection on observables is assured. Summary: We can estimate the causal effect of D on Y by regression if we record all the covariates X which affect the selection of D . 6 The Census data does not contained detailed personal characteristics beyond race, sex , place of birth and education, so selection on observables is unlikely to be satisfied. A strategy for estimating a causal effect when selection on observables does not hold is to look for an instrumental variable. Instrumental variable (IV): A variable Zi is an instrumental variable if it satisfies the following three conditions: (1) Relevance: Zi is correlated with Di conditional on X i . (2) Random assignment: Zi is independent of Di conditional on X i . (3) Exclusion restriction: Zi has no direct effect on Y other than through its influence on Di . To explain this assumption, we allow the potential outcomes to depend on both Z and D : Yi ( z ,d ) = the potential outcome that would be observed for unit i if the instrumental variable Z was set to z and the treatment variable D was set to d . The exclusion restriction says that the potential outcomes depend only on d : Yi ( z ,d ) Yi ( z ' d ) for all possible z, z ', d . How to make use of instrumental variables? Let’s first consider the case without any other covariates X i . Consider the additive linear constant effect (ALICE) model: Yi ( z , d ) Yi (0,0) d z , 7 which leads to the observed data model Yi Yi (0,0) Di Z i . The exclusion restriction assumption says that 0 . The (0,0) random assignment assumption says that Yi is independent of Zi . Thus, E (Yi | Zi ) E (Yi (0,0) | Zi ) E ( Di | Zi ) Zi E (Yi (0,0) ) E ( Di | Zi ) If we knew E ( Di | Zi ) , then we could estimate by least squares regression of Yi on E ( Di | Zi ) . Although we don’t know, E ( Di | Zi ) , we can estimate it. Two stage least squares estimate of 1. First Stage: Regress Di on Zi using least squares to obtain Eˆ ( D | Z ) . i i 2. Second Stage: Regress Yi on Eˆ ( Di | Zi ) . The slope on Eˆ ( D | Z ) is the the two stage least squares instrumental i i variable estimate of . If there are covariates X i , we have E (Yi | Zi , X i ) E ( Di | Zi , X i ) ' X i 8 and we just include the covariates in each stage, i.e., in the first stage we regress Di on Zi and X i to obtain Eˆ ( Di | Zi , X i ) and in the second stage, we regress Y on Eˆ ( D | Z , X ) and X . i i i i i Example: Angrist and Krueger (1994, “Why Do World War II Veterans Earn More than Non-Veterans,” Journal of Labor Economics) propose year/quarter of birth dummy variables as instrumental variables for estimating the causal effect of military service during WWII on earnings. As an example, let’s consider particularly Zi 1 if man born in quarter 3 or 4 of 1926 and Z i 0 if man born in quarter 3 or 4 of 1928. The data are from the 5% microsample of the 1980 Census (the Census only asked about detailed data such as earnings for 5% of respondents). ak_veteran_data=read.table("ak_veteran_data.csv",sep=",",header=TRUE); earnings=ak_veteran_data$earnings; veteran=ak_veteran_data$veteran; z=ak_veteran_data$z; Discussion of IV Assumptions: (1) Relevance: Men born in 1926 turned 18 (draft eligible aid) in 1944, the middle of the war while men born in 1928 turned 18 in 1946, after the war was over. mean(veteran[z==1]); [1] 0.7550714 mean(veteran[z==0]); [1] 0.2431429 9 Men born in 1926 were much more likely to be veterans. The IV is relevant. ( z 0, d 0) (2) Random assignment: Yi is the earnings man i would have in 1980 if he were born in 1928 and did not serve in the military. The random assignment assumption says that the actual year the man was born is independent of this earnings potential. The random assignment assumption would be violated if there are gradual long term trends in fertility, health, education, apprenticeship and employment which might make the earnings potential of the men born in 1928 different on average than those born in 1926. However, the random assignment assumption seems plausible. ( z 0, d 0) Yi ( z 1, d 0) and (3) Exclusion restriction: This says that Yi Yi ( z 0, d 1) Yi ( z 1,d 1) , i.e., if a man served (did not serve) in the military regardless of whether he was born in 1926 or 1928, then his earnings in 1980 would be the same regardless of whether he was born in 1926 or 1928. This would be violated if there are difference in earnings by age. The relationship between age and earnings is relatively flat for the age range 54-56 (those born in 1926 would be 54 in 1980 and those born in 1928 would be 56), but there is likely to be a small effect of age on earnings. Angrist and Krueger do a variety of things to control for this in their paper. Empirical Results: Usual regression: 10 linreg=lm(earnings~veteran); summary(linreg); Call: lm(formula = earnings ~ veteran) Residuals: Min 1Q Median 3Q Max -22114 -8003 -2184 4657 54652 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20348.3 108.1 188.21 <2e-16 *** veteran 1840.8 153.0 12.03 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12800 on 27998 degrees of freedom Multiple R-squared: 0.005141, Adjusted R-squared: 0.005106 F-statistic: 144.7 on 1 and 27998 DF, p-value: < 2.2e-16 If selection on observables held, then we would have strong evidence, that military service increases a man’s earnings. IV analysis: # First stage regression fsreg=lm(veteran~z); veteranhat=predict(fsreg); # Second stage regression ssreg=lm(earnings~veteranhat); betahat=coef(ssreg)[2]; betahat veteranhat -1010.588 11 The IV analysis is estimating that military service decreases a man’s earnings. 12