Research Method Lecture 3 (Ch4) Inferences © 1 Sampling distribution of OLS estimators We have learned that MLR.1-MLR4 will guarantee that OLS estimators are unbiased. In addition, we have learned that, by adding MLR.5, you can estimate the variance of OLS estimators. However, in order to conduct hypothesis tests, we need to know the sampling distribution of the OLS estimators. 2 To do so, we introduce one more assumption Assumption MLR.6 (i) The population error u is independent of explanatory variables, x1,x2,…,xk, and (ii) u~N(0,σ2). 3 Classical Linear Assumption MLR.1 through MLR6 are called the classical linear model (CLM) assumptions. Note that MLR.6(i) automatically satisfies MLR.4(provided E(u)=0 which we always assume), but MLR.4 does not necessarily indicate MLR.6(i). In this sense, MLR.4 is redundant. However, to emphasize that we are making additional assumption, MLR1 through MLR.6 are called CLM assumptions. 4 Theorem 4.1 Conditional on X, we have (a) ˆ j ~ N ( j ,Var(ˆ j )) and (b) (ˆ j j ) / sd (ˆ j ) ~ N (0,1) Proof: See the front board 5 Hypothesis testing Consider the following multiple linear regression. y=β0+β1x1+β2x2+….+βkxk+u Now, I present a well known theorem. 6 Theorem 4.2: t-distribution for the standardized estimators. Under MLR1 through MLR6 (CLM assumptions) we have ˆ j j ~ t n k 1 se ( ˆ j ) This means that the standardized coefficient follows t-distribution with n-k-1 degree of freedom. Proof: See the front board. 7 One-sided test One sided test has the following form The null hypothesis: The alternative hypothesis: H0: βj=0 H1: βj>0 8 Test procedure. 1.Set the significance level . Typically, it is set at 0.05. 2.Compute the t-statistics under the H0. that is ˆ j j ˆ j t - st at se ( ˆ j ) se ( ˆ j ) Note: Under H0, βj=0, so this simplified to this. 9 3. Find the cutoff number t number is illustrated below. n k 1, This cutoff T-distribution with n-k-1 degree of freedom 1 tn-k-1,α The cutoff number. 4. Reject the null hypothesis if the t-statistic falls in the rejection region. This is illustrated in the next page. 10 The illustration of the rejection decision. T-distribution with n-k-1 degree of freedom 1 tn-k-1,α Rejection region. (Reject H0) If t-statistic falls in the rejection region, you reject the null hypothesis. Otherwise, you fail to reject the null hypothesis. 11 Note, if you want to test if βj is negative, you have the following null and alternative hypotheses, H0: βj=0 H1: βj<0 Then the rejection region will be on the negative side. Nothing else changes. 1 -tn-k-1,α Rejection region. 12 Example The next slide shows the estimated result of the log salary equation for 338 Japanese economists. (Estimation is done by STATA.) The estimated regression is Log(salary)=β0+β1(female)+ δ(other variables)+u 13 SSE Source SSR SS df MS Model Residual 19.7186266 11 1.79260242 9.58915481 326 .029414585 Total 29.3077814 337 .08696671 Number of obs F( 11, 326) Prob > F R-squared Adj R-squared Root MSE = = = = = = 338 60.94 0.0000 0.6728 0.6618 .17151 T-statistics SST lsalary female fullprof assocprof experience experiencesq evermarried kids6 phdabroad extgrant privuniv phdoffer _cons Coef. Std. Err. -.0725573 .3330248 .1665502 .0214346 -.0003603 .0847564 .0051719 .0442625 .0001081 .1675923 .0751014 6.200925 t .0258508 -2.81 .0505602 6.59 .0397755 4.19 .0042789 5.01 .0000925 -3.90 .027398 3.09 .0224497 0.23 .0310316 1.43 .0000506 2.14 .0199125 8.42 .0202832 3.70 .0412649 150.27 P>|t| [95% Conf. Interval] 0.005 0.000 0.000 0.000 0.000 0.002 0.818 0.155 0.033 0.000 0.000 0.000 -.1234127 -.021702 .2335594 .4324903 .0883011 .2447994 .0130168 .0298524 -.0005423 -.0001783 .0308573 .1386556 -.0389927 .0493364 -.016785 .10531 8.56e-06 .0002076 .1284191 .2067654 .035199 .1150039 6.119746 6.282104 14 Q1. Test if female salary is lower than male salary at 5% significance level (i.e., =0.05). That is test, H0: β1=0 H1: β1<0 15 Two sided test Two sided test has the following form The null hypothesis: The alternative hypothesis: H0: βj=0 H1: βj≠0 16 Test procedure. 1.Set the significance level . Typically, it is set at 0.05. 2.Compute the t-statistics under the H0. that is ˆ j j ˆ j t - st at se ( ˆ j ) se ( ˆ j ) Note: Under H0, βj=0, so this simplified to this. 17 3. Find the cutoff number t number is illustrated below. . This cutoff nk 1, / 2 T-distribution with n-k-1 degree of freedom /2 1 -tn-k-1,α/2 /2 tn-k-1,α/2 The cutoff number. Rejection region 4. Reject the null hypothesis if t-statistic falls in the rejection region above. 18 When you reject the null hypothesis βj≠0 using two sided test, we say that the variable xj is statistically significant. 19 Exercise Consider again the following regression Log(salary)=β0+β1(female)+ δ(other variables)+u This time, test if female coefficient is equal to zero or not using two sided test at the 5% significance level. That is, test H0: β1=0 H1: β1≠0 20 SSE Source SSR SS df MS Model Residual 19.7186266 11 1.79260242 9.58915481 326 .029414585 Total 29.3077814 337 .08696671 Number of obs F( 11, 326) Prob > F R-squared Adj R-squared Root MSE = = = = = = 338 60.94 0.0000 0.6728 0.6618 .17151 SST lsalary female fullprof assocprof experience experiencesq evermarried kids6 phdabroad extgrant privuniv phdoffer _cons Coef. Std. Err. -.0725573 .3330248 .1665502 .0214346 -.0003603 .0847564 .0051719 .0442625 .0001081 .1675923 .0751014 6.200925 t .0258508 -2.81 .0505602 6.59 .0397755 4.19 .0042789 5.01 .0000925 -3.90 .027398 3.09 .0224497 0.23 .0310316 1.43 .0000506 2.14 .0199125 8.42 .0202832 3.70 .0412649 150.27 P>|t| [95% Conf. Interval] 0.005 0.000 0.000 0.000 0.000 0.002 0.818 0.155 0.033 0.000 0.000 0.000 -.1234127 -.021702 .2335594 .4324903 .0883011 .2447994 .0130168 .0298524 -.0005423 -.0001783 .0308573 .1386556 -.0389927 .0493364 -.016785 .10531 8.56e-06 .0002076 .1284191 .2067654 .035199 .1150039 6.119746 6.282104 21 The p-value The p-value is the minimum level of the significance level ( ) at which, the coefficient is statistically significant. STATA program automatically compute this value for you. Take a look at the salary regression again. 22 SSE Source SSR Model Residual Total SS df MS Number of obs F( 11, 326) Prob > F R-squared Adj R-squared Root MSE 19.7186266 11 1.79260242 9.58915481 326 .029414585 29.3077814 337 .08696671 = = = = = = 338 60.94 0.0000 0.6728 0.6618 .17151 P-values SST lsalary female fullprof assocprof experience experiencesq evermarried kids6 phdabroad extgrant privuniv phdoffer _cons Coef. Std. Err. -.0725573 .3330248 .1665502 .0214346 -.0003603 .0847564 .0051719 .0442625 .0001081 .1675923 .0751014 6.200925 t .0258508 -2.81 .0505602 6.59 .0397755 4.19 .0042789 5.01 .0000925 -3.90 .027398 3.09 .0224497 0.23 .0310316 1.43 .0000506 2.14 .0199125 8.42 .0202832 3.70 .0412649 150.27 P>|t| [95% Conf. Interval] 0.005 0.000 0.000 0.000 0.000 0.002 0.818 0.155 0.033 0.000 0.000 0.000 -.1234127 -.021702 .2335594 .4324903 .0883011 .2447994 .0130168 .0298524 -.0005423 -.0001783 .0308573 .1386556 -.0389927 .0493364 -.016785 .10531 8.56e-06 .0002076 .1284191 .2067654 .035199 .1150039 6.119746 6.282104 23 Other hypotheses about βj You can test other hypotheses, such as βj=1 or βj=-1. Consider the null hypothesis β j=a Then, all you have to do is to compute tstatistics as ˆ j a t - st at se ( ˆ j ) Then other test procedure is exactly the same. 24 Consider the following regression results. Log(crime)=-6.63 + 1.27log(enroll) (1.03) (0.11) n=97, R2=0.585 Now, test if coefficient for log(enroll) is equal to 1 or not using two sided test at the 5% significance level. 25 The F-test Testing general linear restrictions You are often interested in more complicated hypothesis testing. First, I will show you some examples of such tests using the salary regression example. 26 27 Example 1: Modified salary equation. Log(salary)=β0+β1(female) +β2(female)×(Exp>20) +β(other variables)+u Where (Exp>20) is the dummy variable for those with experience greater than 20 years. Then, it is easy to show that gender salary gap among those with experience greater than 20 years is given by β1+β2. Then you want to test the following H0: β1+β2=0 H1: β1+β2≠0 28 Example 2: More on modified salary equation. Log(salary)=β0+β1(female) +β2(female)×(Exp) +β(other variables)+u Where exp is the years of experience. Then, if you want to show if there is a gender salary gap at experience equal to 5, you test H0: β1+5*β2=0 H1: β1+5*β2≠0 29 Example 3: The price of houses. Log(price)=β0 +β1(assessed price) +β2(lot size) +β3(square footage) +β4(# bedrooms) Then you would be interested in H0: β1=1, β2=0, β3=0, β4=0 H1: H0 is not true. Note in this case, there are 4 equations in H0. 30 The procedure for F-test Linear restrictions are tested using F-test. The general procedure can be explained using the following example. Y= β0+β1x1+β2x2+β3x3+β4x4+u --------------(1) Suppose you want to test H0: β1=1, β2=β3, β4=0 H1: H0 is not true 31 Step 1: Plug in the hypothetical values of coefficient given by H0 in the equation 1. Then you get Y= β0+1*x1+β2x2+β2x3+0*x4 +u (Y-x1)= β0+β2(x2+x3)+u ----------------------(2) (2) Is called the restricted model. On the other hand, the original equation (1) is called the unrestricted model. 32 In the restricted model, the dependent variable is (Y-x1). And now, there is only one explanatory variable, which is (x2+x3). Now, I can describe the testing procedure. 33 Step 1: Estimate the unrestricted model (1), and compute SSR. Call this SSRur. Step 2: Estimate the restricted model (2), and compute SSR. Call this SSRr. Step 3: Compute the F-statistics as ( SSRr SSRur ) / q F SSRur /(n k 1) Where q is the number of equations in H0. q = numerator degree of freedom (n-k-1) =denominator degree of freedom 34 It is know that F statistic follows the F distribution with degree of freedom (q,nk-1). That is; Numerator degree of freedom F~Fq,n-k-1 Denominator degree of freedom Step5: Set the significance level . (Usually, it is set at 0.05) Step 6. Find the cutoff value c, such that P(Fq,n-k-1>c)=. This is illustrated in the next slide. 35 Step 7: Reject if F stat falls in the rejection region. The density of Fq,n-k-1 1- c Rejection region The cutoff points can be found in the table in the next slide. 36 Copyright © 2009 SouthWestern/Cengage Learning 37 Example Log(salary)=β0+β1(female) +β2(female)×(Exp>20) + δ(other variables)+u -----(1) Now, let us test the following H0: β1+β2=0 H1: β1+β2≠0 38 Then, restricted model is Log(salary)=β0 +β1[(female)-(female)×(Exp>10)] +β(other variables)+u ------------(2) The following slides show the estimated results for unrestricted and restricted models. 39 SSRur The unrestricted model Source SS df MS Model Residual 19.7668781 9.54090327 12 1.64723984 325 .029356625 Total 29.3077814 337 lsalary Coef. female female_exp20 fullprof assocprof experience experiencesq evermarried kids6 phdabroad extgrant privuniv phdoffer _cons -.0873223 .0824519 .3418992 .1713122 .0202163 -.0003409 .0888877 .0051322 .0459212 .0000922 .1661488 .0747895 6.205193 Number of obs F( 12, 325) Prob > F R-squared Adj R-squared Root MSE .08696671 Std. Err. .0282769 .0643129 .0509825 .0399095 .0043791 .0000936 .02756 .0224276 .031028 .000052 .0199247 .0202647 .0413584 t -3.09 1.28 6.71 4.29 4.62 -3.64 3.23 0.23 1.48 1.77 8.34 3.69 150.03 P>|t| 0.002 0.201 0.000 0.000 0.000 0.000 0.001 0.819 0.140 0.078 0.000 0.000 0.000 = = = = = = 338 56.11 0.0000 0.6745 0.6624 .17134 [95% Conf. Interval] -.1429511 -.0440702 .2416019 .0927986 .0116014 -.0005251 .0346692 -.0389894 -.0151199 -.0000102 .1269512 .0349231 6.123829 -.0316935 .2089739 .4421966 .2498259 .0288312 -.0001566 .1431062 .0492537 .1069623 .0001946 .2053464 .114656 6.286557 40 Restricted model. Source SS SSRr df Female –Female*(Exp>20) MS Model Residual 19.7666765 9.54110486 11 1.79697059 326 .029267193 Total 29.3077814 337 lsalary Coef. f_minus_fe20 fullprof assocprof experience experiencesq evermarried kids6 phdabroad extgrant privuniv phdoffer _cons -.0872393 .3424814 .1716157 .0201523 -.0003397 .0891255 .0051464 .0461026 .000091 .1659818 .0748104 6.205144 Number of obs F( 11, 326) Prob > F R-squared Adj R-squared Root MSE .08696671 Std. Err. .028216 .0504192 .0396806 .0043037 .0000925 .0273684 .0223927 .0309036 .00005 .0197922 .0202322 .0412911 t -3.09 6.79 4.32 4.68 -3.67 3.26 0.23 1.49 1.82 8.39 3.70 150.28 P>|t| 0.002 0.000 0.000 0.000 0.000 0.001 0.818 0.137 0.070 0.000 0.000 0.000 = = = = = = 338 61.40 0.0000 0.6745 0.6635 .17108 [95% Conf. Interval] -.1427477 .2432934 .0935533 .0116856 -.0005217 .0352845 -.0389061 -.0146931 -7.40e-06 .1270452 .0350082 6.123913 -.0317308 .4416694 .2496781 .0286189 -.0001577 .1429665 .0491989 .1068982 .0001894 .2049183 .1146126 6.286374 41 Since we have only one equation in H0, q=1. And you can see that (n-k-1)=(338-12-1)=325 F=[(9.54110486 -9.54090327)/1]/[9.54090327/325] =0.0068 The cutoff point at 5% significance level is 3.84. Since F-stat does not falls in the rejection, we fail to reject the null hypothesis. In other words, we did not find evidence that there is a gender gap among those with experience 42 greater than 20 years. Copyright © 2009 SouthWestern/Cengage Learning 43 In fact, STATA does F-test automatically. Source SS df MS Model Residual 19.7668781 9.54090327 12 325 1.64723984 .029356625 Total 29.3077814 337 .08696671 lsalary Coef. female female_exp20 fullprof assocprof experience experiencesq evermarried kids6 phdabroad extgrant privuniv phdoffer _cons -.0873223 .0824519 .3418992 .1713122 .0202163 -.0003409 .0888877 .0051322 .0459212 .0000922 .1661488 .0747895 6.205193 Std. Err. .0282769 .0643129 .0509825 .0399095 .0043791 .0000936 .02756 .0224276 .031028 .000052 .0199247 .0202647 .0413584 . #delimit cr delimiter now cr . . test female + female_exp20=0 ( 1) female + female_exp20 = 0 F( 1, 325) = Prob > F = t -3.09 1.28 6.71 4.29 4.62 -3.64 3.23 0.23 1.48 1.77 8.34 3.69 150.03 Number of obs F( 12, 325) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.002 0.201 0.000 0.000 0.000 0.000 0.001 0.819 0.140 0.078 0.000 0.000 0.000 = = = = = = 338 56.11 0.0000 0.6745 0.6624 .17134 [95% Conf. Interval] -.1429511 -.0440702 .2416019 .0927986 .0116014 -.0005251 .0346692 -.0389894 -.0151199 -.0000102 .1269512 .0349231 6.123829 -.0316935 .2089739 .4421966 .2498259 .0288312 -.0001566 .1431062 .0492537 .1069623 .0001946 .2053464 .114656 6.286557 After estimation, type this command 0.01 0.9340 44 F-test for special case The exclusion restrictions Consider the following model Y= β0+β1x1+β2x2+β3x3+β4x4+u -------(1) Often you would like to test if a subset of coefficients are all equal to zero. This type of restriction is called `the exclusion restrictions’. 45 Suppose you want to test if β2,β3,β4 are jointly equal to zero. Then, you test H0 : β2=0, β3=0, β4=0 H1: H0 is not true. 46 In this special type of F-test, the restricted and unrestricted equations look like Y= β0+β1x1+β2x2+β3x3+β4x4+u -------(1) Y= β0+β1x1 +u -------(2) In this special case, F statistic has the following representation (SSRr SSRur ) / q ( Rur2 Rr2 ) / q F SSRur /(n k 1) (1 Rur2 ) /(n k 1) Proof: See the front board. 47 When we reject this type of null hypothesis, we say x2, x3 and x4 are jointly significant. 48 Example of the test of exclusion restrictions Suppose you are estimating an salary equations for baseball players. Log(salary)=β0 + β1(years in league) +β2(average games played) +β3(batting average) +β4(homeruns) +β5(runs batted) +u 49 Do batting average, homeruns and runs batted matters for salary after years in league and average games played are controlled for? To answer to this question, you test H0: β3=0, β4=0, β5=0 H1: H0 is not true. 50 Unrestricted model Variables Coefficient Standard errors Years in league 0.0689*** 0.0121 Average games played 0.0126*** 0.0026 Batting average 0.00098 0.0011 Homeruns 0.0144 0.016 Runs batted 0.108 0.0072 Constant 11.19*** 0.29 # obs 353 R squared 0.6278 SST 181.186 As can bee seen, batting average, homeruns and runs batted do not have statistically significant t-stat at the 5% level. 51 Restricted model Variables Coefficient Standard errors Years in league 0.0713*** 0.0125 Average games played 0.0202*** 0.0013 Constant 11.22*** 0.11 # obs 353 R squared 0.5971 SST 198.311 The F stat is F=[(198.311-181.186)/3]/[181.186/(353-5-1)]=10.932 The cutoff number of about 2.60. So we reject the null hypothesis at 5% significance level. This is an reminder that even if each coefficient is individually insignificant, they may be jointly significant. 52