Stat 521: Notes 1 (Revised) Ordinary Least Squares I: Estimation, Inference and Predicting Outcomes Reading: Wooldridge, Chapter 4.1, 4.2.1-4.2.2. Linear model: Let us review the basics of the linear model. We have N units (individuals, firms, or other economic agents) drawn randomly from a large population. On each unit, we observe an outcome Yi for unit I, and a K-dimensional vector of explanatory variables X i ( X i1 , X i 2 , , X iK ) ' (where typically the first covariate is a constant, X i1 1 , for all i 1, , N ). We are interested in explaining the distribution of Yi in terms of the explanatory variables X i using a linear model: Yi ' X i i . In matrix notation, Y X , or, avoiding vector and matrix notation completely Yi 1 X i1 K K X iK i k X ik i k 1 We assume that the residuals i are independent of the covariates or regressors, and normally distributed with mean 2 zero and variance . 1 Assumption 1: i independent of X i and distributed normally 2 with mean zero and variance . The regression model with assumption 1 is called the normal linear regression model. We can weaken this considerably. First, we could relax normality and only assume independence: Assumption 2: i independent of X i with mean zero. We can even weaken this assumption further by requiring only mean independence Assumption 3: E ( i | X i ) 0 . Assumption 3 allows for the variance of i to depend on X i , i.e., it allows for heteroskedasticity. Under Assumptions 1, 2 or 3, we have E (Yi | X X i ) ' X i We will also assume that the observations are drawn randomly from some population. We can also do most of the analysis by assuming that the covariates are fixed, but this complicates matters for some results, and it does not help very much. See the discussion on fixed versus random covariates in Wooldridge (page 9). Assumption 4: The pairs ( X i , Yi ) are independent draws from some distribution, with the first two moments of X i finite. The ordinary least squares (OLS) estimator for solves 2 N min (Yi ' X i ) 2 . i 1 This leads to ˆ ( X ' X ) 1 ( X 'Y ) Under Assumption 1, the exact distribution of the OLS estimator is ˆ ~ N , 2 ( X ' X ) 1 . Without assuming that the i are normally distributed, it is difficult to derive the exact distribution of ˆ . However, under the independence Assumption 2 and a second moment condition 2 on i (variance finite and equal to ), we can establish asymptotic normality: d N ˆ N (0, 2 E[ X i ' X i ]1 ) Typically, we do not know . We can consistently estimate it as N 2 1 2 ˆ ˆ Yi ' X i N K i 1 Dividing by N K rather than N corrects for the fact that K parameters are estimated before calculating the residuals ˆi Yi ˆ ' X i . This correction does not matter in large samples, and in fact the maximum likelihood estimator 2 1 N 2 ˆ ˆ MLE Yi ' X i N i 1 2 3 is a perfectly reasonable alternative. In practice, we use the following approximate distribution for ˆ ˆ N ( ,Vˆ ) where V 2 N E[ X i ' X i ] 1 2 E[ X i ' X i ] 1 and N 2 Vˆ ˆ X i X i' i 1 1 Often, we are interested in one particular coefficient. Suppose for example we are interested in k . In that case we have ˆ N ( ,Vˆ ) k 1 kk where Vij is the (i, j) element of the matrix V . We can use this for constructing confidence intervals for a particular coefficient. A 95% confidence interval for k is ˆ 1.96 Vˆ , ˆ 1.96 Vˆ . k kk k kk We can also use this to test whether a particular coefficient is equal to some preset number. For example, if we want to test whether k is equal to 0.1, we construct the t-statistic ˆk 0.1 t Vˆkk 4 Under the null hypothesis that k is equal to 0.1, t has a standard normal distribution. We reject the null hypothesis at the 0.05 level if | t | 1.96 . Example: Returns to Education The following regressions are estimated on data from the National Longitudinal Survey of Youth (NLSY). The data set used here consists of 935 observations on usual weekly earnings, years of education, and experience (calculated as age minus education minus six). The particular data set consists of men between 28 and 38 years of age at the time wages were measured. nlsdata=read.table("nlsdata.txt"); logwage=nlsdata$V1; educ=nlsdata$V2; exper=nlsdata$V3; age=nlsdata$V4; wage=exp(logwage) Summary statistics: > summary(wage); Min. 1st Qu. Median Mean 3rd Qu. Max. 57.73 288.50 389.10 421.00 500.10 2001.00 > summary(logwage); Min. 1st Qu. Median Mean 3rd Qu. Max. 4.056 5.665 5.964 5.945 6.215 7.601 > summary(educ); Min. 1st Qu. Median Mean 3rd Qu. Max. 9.00 12.00 12.00 13.47 16.00 18.00 > summary(exper); Min. 1st Qu. Median Mean 3rd Qu. Max. 5.00 11.00 13.00 13.62 17.00 23.00 > summary(age); 5 Min. 1st Qu. Median Mean 3rd Qu. Max. 28.00 30.00 33.00 33.09 36.00 38.00 We will use these data to look at returns to education. Mincer developed a model that leads to the following relation between log earnings, education and experience for individual i: log(earnings)i 1 2educi 3experi 3experi 2 i Estimating this on the NLSY data leads to: # Returns to education regression expersq=exper^2; model1=lm(logwage~educ+exper+expersq); summary(model1) Call: lm(formula = logwage ~ educ + exper + expersq) Residuals: Min 1Q Median 3Q Max -1.70122 -0.25923 0.02022 0.27467 1.68870 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.0305512 0.2230182 18.073 <2e-16 *** educ 0.0925137 0.0075808 12.204 <2e-16 *** exper 0.0767769 0.0250032 3.071 0.0022 ** expersq -0.0018859 0.0008723 -2.162 0.0309 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4106 on 926 degrees of freedom Multiple R-squared: 0.1429, Adjusted R-squared: 0.1401 F-statistic: 51.44 on 3 and 926 DF, p-value: < 2.2e-16 The estimated standard deviation of the residuals is ˆ 0.41 . 6 We will examine more about whether Assumption 1 for the linear model approximately holds for this data later, but for now let’s assume that the assumption approximately holds. Using the estimates and standard errors, we can construct the confidence intervals. For example, a 95% confidence interval for the returns to education, measured by the parameter 2 , is 0.0923 1.96*0.008, 0.0923 1.96*0.008 (0.0775, 0.1071) . The t-statistic for testing 2 0.1 is 0.0923 0.1 t 1.0172 0.008 Since | t | 1.96 , at the 0.05 level, we do not reject the hypothesis that 2 is equal to 0.1. Now suppose we wish to use these estimates for predicting a more complex change. For example, suppose we want to see what the estimated effect is on the log of weekly earnings of increasing a person’s education by one year. Because changing an individual’s education also changes their experience (in this case it automatically reduces it by one year), this effect depends not just on 2 . To make this specific, let us focus on an individual with 12 years of education (high school) and ten years of experience (so that exper2 is equal to 100). The expected value of this person’s log earnings is Eˆ (log earnings|educ=12,exper=10) 4.016 0.092*12 0.079*10 0.002*10 5.7191 > # Expected value of log earnings for person with 12 years of education (high > # school) and ten years of experience 7 > coef(model1)[1]+coef(model1)[2]*12+coef(model1)[3]*10+coef(model1)[4]*100; (Intercept) 5.719899 Now change this person’s education to 13. Their experience will go down to 9 and exper2 down to 81. Hence, their expected log earnings is Eˆ (log earnings|educ=13,exper=9) 4.016 0.092 *13 0.079*9 0.002*81 5.7696 > # Expected value of log earnings for person with 13 years of education (high > # school) and nine years of experience > coef(model1)[1]+coef(model1)[2]*12+coef(model1)[3]*9+coef(model1)[4]*81; (Intercept) 5.678954 The difference is ˆ2 ˆ3 19* ˆ4 0.0505 . Hence, the expected gain of an additional year of education, taking into account the effect on experience and experience squared is the difference between these two predictions, which is equal to 0.051. Now the question is what the standard error for this prediction is. The general way to answer this question is as follows. The vector of estimated coefficients ˆ is approximately normal with mean and variance V . We are interested in a linear combination of the ’s, namely 2 3 19* 4 ' where (0,1, 1, 19) ' . Therefore, ' ~ N ( ' , ' V ) 1 N 2 ' ˆ ˆ V X X i i . The command We estimate V by i 1 vcov(model) in R finds Vˆ . 8 > vcov(model1) (Intercept) educ exper expersq (Intercept) 0.0497371031 -0.0011370835603 -0.00468369754 0.0001476767044 educ -0.0011370836 0.0000574686547 0.00003473218 -0.0000005506074 exper -0.0046836975 0.0000347321798 0.00062516139 -0.0000214799425 expersq 0.0001476767 -0.0000005506074 -0.00002147994 0.0000007609628 > # Standard error and confidence interval for effect on expected log earnings > # of increasing education by 1 when education starts at 12 and experience > # starts at ten > lambda=matrix(c(0,1,-1,-19),ncol=1); > betahat=matrix(coef(model1),ncol=1); > estimate=t(lambda)%*%betahat; > var.est=t(lambda)%*%vcov(model1)%*%lambda; > se.est=sqrt(var.est); > lci=estimate-1.96*se.est; > uci=estimate+1.96*se.est; > estimate [,1] [1,] 0.05156802 > se.est [,1] [1,] 0.009620735 > lci [,1] [1,] 0.03271138 > uci [,1] [1,] 0.07042466 Thus, the 95% confidence for the change in expected log earnings from the one year increase in education, when eduation started at 12 and experience at ten, is (0.0327, 0.0704). The Delta Method 9 The above method works for finding the standard error of a linear function of the coefficients in the regression model. The delta method is a method for finding the standard error of a nonlinear function of the coefficients. Example: Suppose we just consider the regression of log earnings on education. > model2=lm(logwage~educ); > summary(model2); Call: lm(formula = logwage ~ educ) Residuals: Min 1Q Median 3Q Max -1.79050 -0.25453 0.01991 0.27309 1.68779 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.040344 0.085044 59.27 <2e-16 *** educ 0.067168 0.006231 10.78 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4177 on 928 degrees of freedom Multiple R-squared: 0.1113, Adjusted R-squared: 0.1103 F-statistic: 116.2 on 1 and 928 DF, p-value: < 2.2e-16 The estimated regression function is E (log earnings | educ x) 5.0403 0.0672* x 2 2 The estimate for is ˆ 0.4177 0.1744 . Let’s assume Assumption 1 holds (we will examine more later). 2 10 Suppose we are interested in the average effect of increasing education by one year for an individual with currently eight years of education, not on the log of earnings, but on the level of earnings. 2 Then, using the fact that if Z ~ N ( , ) , then E[exp( Z )] exp( 2 / 2) , we have E (earnings | educ x) exp( 1 2 * x 2 / 2) . Hence, the parameter of interest (the effect of increasing education from eight to nine) is exp( 1 2 *9 2 / 2) exp( 1 2 *8 2 / 2) Getting an estimate for is easy. Just plug in the estimates for and 2 to get ˆ exp(ˆ1 ˆ2 *9 ˆ 2 / 2) exp(ˆ1 ˆ2 *8 ˆ 2 / 2) 20.0480 . > # Estimated effect of increasing education from eight to nine on earnings sigmahatsq=.4177^2; thetahat=exp(coef(model2)[1]+coef(model2)[2]*9+sigmahatsq/2)exp(coef(model2)[1]+coef(model2)[2]*8+sigmahatsq/2); > thetahat (Intercept) 20.04802 However, we also need a standard error for this estimate. Let us write the parameter of interest more generally as g ( ) 2 where ( ', ) ' . Delta Method: Suppose that for some estimate ˆ of a true parameter , we have N (ˆ ) N (0, ) . Then, 11 g g N g (ˆ ) g ( ) N 0, ' . Another way of stating the Delta method is the following: ˆ N (0, ) Suppose where N . Then g g ( g (ˆ ) g ( )) N 0, ' . In our case, g ( ) exp( 1 2 *9 2 / 2) exp( 1 2 *8 2 / 2) and 2 2 exp( 1 2 *9 / 2) exp( 1 2 *8 / 2) g 9*exp( 1 2 *9 2 / 2) 8*exp( 1 2 *8 2 / 2) 1 2 2 exp( 1 2 *9 / 2) 8*exp( 1 2 *8 / 2) 2 We estimate this by substituting estimated values for the parameters, so we get 2 2 ˆ ˆ ˆ ˆ ˆ ˆ exp( 1 2 *9 / 2) exp( 1 2 *8 / 2) g 9*exp( ˆ1 ˆ2 *9 ˆ 2 / 2) 8*exp( ˆ1 ˆ2 *8 ˆ 2 / 2) 1 1 2 2 ˆ ˆ ˆ ˆ exp( 1 2 *9 ˆ / 2) *exp( 1 2 *8 ˆ / 2) 2 2 partialg.partialgamma.1=exp(coef(model2)[1]+coef(model2)[2]*9+sigmahatsq/2)exp(coef(model2)[1]+coef(model2)[2]*8+sigmahatsq/2); 12 partialg.partialgamma.2=9*exp(coef(model2)[1]+coef(model2)[2]*9+sigmahatsq/2 )-8*exp(coef(model2)[1]+coef(model2)[2]*8+sigmahatsq/2); partialg.partialgamma.3=.5*exp(coef(model2)[1]+coef(model2)[2]*9+sigmahatsq/ 2)-.5*exp(coef(model2)[1]+coef(model2)[2]*8+sigmahatsq/2); > partialg.partialgamma.1 (Intercept) 20.04802 > partialg.partialgamma.2 (Intercept) 468.9977 > partialg.partialgamma.3 (Intercept) 10.02401 To get the variance for ˆ g (ˆ ) , we also need the full 2 covariance matrix of ( 1 , 2 , ) ' . We have the following distributional fact for the normal linear regression model: N 2 Under the normal linear regression model, ˆ is independent of ˆ and ( N K )ˆ 2 2 (Y ˆ ' X ) i 1 i N K ~ N2 K (Searle, Linear 2 Models, Chapter 5.3). The variance of a m (chi-squared distribution with m degrees of freedom) is 2m . Thus, 4 2 4 2 Var (ˆ ) 2( N K ) 2 (N K ) N K and we can estimate 4 ˆ 2 Var (ˆ 2 ) by N K . 2*0.17444 ˆ (ˆ ) 0.000066 . Hence we have Var 930 2 2 13 i 2 We also have > vcov(model2); (Intercept) educ (Intercept) 0.0072324477 -0.00052296635 educ -0.0005229664 0.00003882174 so v=vcov(model2); Lambdahat=matrix(c(v[1,1],v[2,1],0,v[1,2],v[2,2],0,0,0,.000066),ncol=3); > Lambdahat [,1] [,2] [,3] [1,] 0.0072324477 -5.229664e-04 0.0e+00 [2,] -0.0005229664 3.882174e-05 0.0e+00 [3,] 0.0000000000 0.000000e+00 6.6e-05 and the variance for ˆ is g ˆ g (ˆ ) (ˆ ) ' partialg.partialgamma=matrix(c(partialg.partialgamma.1,partialg.partialgamma.2,p artialg.partialgamma.3),ncol=1); varthetahat=t(partialg.partialgamma)%*%Lambdahat%*%partialg.partialgamma; > varthetahat [,1] [1,] 1.618347 Hence, a 95% confidence interval for (the effect on expected earnings of increasing education from 8 to 9) is lci.theta=thetahat-1.96*sqrt(varthetahat); uci.theta=thetahat+1.96*sqrt(varthetahat); > lci.theta [,1] [1,] 17.55462 > uci.theta [,1] [1,] 22.54142 14