VARIABLE MISSPECIFICATION I- Omittion af a rrevelant variable (lenie jashte e nje variabli te rendesishem) Consequences of variable misspecification In this sequence will investigate the consequences of misspecifying the regression model in terms of explanatory variables. To keep the analysis simple, we will assume that there are only two possibilities. Either Y depends only on X2, or it depends on both X2 and X3. Y 1 2 X 2 u OR Y 1 2 X 2 3 X 3 u If Y depends only on X2, and we fit a simple regression model, we will not encounter any problems, assuming of course that the regression model assumptions are valid. As a consequence of the misspecification, the standard errors, t tests and F test are invalid EXAMPLE I 1-First Model-Multiple Regresion with two independent Variables / ASVABC SM regress S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1119.3742 2 559.687101 Residual | 2634.6478 497 5.30110221 -------------+-----------------------------Total | 3754.022 499 7.52309018 Number of obs F( 2, 497) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 105.58 0.0000 0.2982 0.2954 2.3024 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | 1.377521 .1229142 11.21 0.000 1.136026 1.619017 SM | .1919575 .0416913 4.60 0.000 .1100445 .2738705 _cons | 11.8925 .5629644 21.12 0.000 10.78642 12.99859 We will illustrate the bias using an educational attainment model. To keep the analysis simple, we will assume that in the true model S depends only on ASVABC and SM. The output above shows the corresponding regression using EAWE Data Set 21. WE TEST CORRELATION OF INDIPENDENT VARIABLES . cor SM ASVABC (obs=500) | SM ASVABC -------------+-----------------SM | 1.0000 ASVABC | 0.3594 1.0000 Explanin correlation ?????? NOW We will run the regression a second time, OMITTING SM. Before we do this, we will try to predict the direction of the bias in the coefficient of ASVABC. Befor OMITTING SM -It is reasonable to suppose, as a matter of common sense, that B3 is positive. This assumption is strongly supported by the fact that its estimate in the multiple regression is positive and highly significant. The correlation between ASVABC and SM is positive, so the numerator of the bias term must be positive. The denominator is automatically positive since it is a sum of squares and there is some variation in ASVABC. Hence the bias should be positive (Korrelacioni midis ASVABC dhe SM është pozitiv, kështu që numëruesi i termit të paragjykimit duhet të jetë pozitiv. Emëruesi është automatikisht pozitiv pasi është një shumë katrorësh dhe ka disa ndryshime në ASVABC. Prandaj, paragjykimi duhet të jetë pozitiv) 2-Modified Model-Simple Regresion / Regresion omitting SM . regress S ASVABC Source | SS df MS -------------+-----------------------------Model | 1006.99534 1 1006.99534 Residual | 2747.02666 498 5.5161178 -------------+-----------------------------Total | 3754.022 499 7.52309018 Number of obs F( 1, 498) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 182.56 0.0000 0.2682 0.2668 2.3486 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ASVABC | 1.580901 .1170059 13.51 0.000 1.351015 1.810787 _cons | 14.43677 .1097335 131.56 0.000 14.22117 14.65237 ------------------------------------------------------------------------------ As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the difference may be due to pure chance, but part is attributable to the bias. 3-Modified Model-Simple Regresion / Regresion omitting . regress ASVABC S SM Source | SS df MS -------------+-----------------------------Model | 453.551645 1 453.551645 Residual | 3300.47036 498 6.62745051 -------------+-----------------------------Total | 3754.022 499 7.52309018 Number of obs F( 1, 498) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 68.44 0.0000 0.1208 0.1191 2.5744 -----------------------------------------------------------------------------S | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------SM | .3598719 .0435019 8.27 0.000 .2744021 .4453417 _cons | 9.992614 .6002469 16.65 0.000 8.813286 11.17194 -----------------------------------------------------------------------------Here is the regression omitting ASVABC instead of SM. We would expect to be upwards biased. We anticipate that b2 is positive and we know that both the numerator and the denominator of the other factor in the bias expression are positive (Këtu është regresioni që heq ASVABC në vend të SM. Ne do të presim të jemi të njëanshëm lart. Ne parashikojmë që b2 është pozitiv dhe e dimë se si numëruesi ashtu edhe emëruesi i faktorit tjetër në shprehjen e paragjykimit janë positive) In this case the bias is quite dramatic. The coefficient of SM has nearly doubled. The reason for the bigger effect is that the variation in SM is much smaller than that in ASVABC, while b2 and b3 are similar in size, judging by their estimates. Finally, we will investigate how R2 behaves when a variable is omitted. In the simple regression of S on ASVABC, R2 is 0.27, and in the simple regression of S on SM it is 0.12. regress S ASVABC SM Source | SS df MS -------------+-----------------------------Model | 1119.3742 2 559.687101 Residual | 2634.6478 497 5.30110221 -------------+-----------------------------Total | 3754.022 499 7.52309018 Number of obs F( 2, 497) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 105.58 0.0000 0.2982 0.2954 2.3024 Number of obs F( 1, 498) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 182.56 0.0000 0.2682 0.2668 2.3486 . regress S ASVABC Source | SS df MS -------------+-----------------------------Model | 1006.99534 1 1006.99534 Residual | 2747.02666 498 5.5161178 -------------+-----------------------------Total | 3754.022 499 7.52309018 . regress S SM Source | SS df MS -------------+-----------------------------Model | 453.551645 1 453.551645 Residual | 3300.47036 498 6.62745051 -------------+-----------------------------Total | 3754.022 499 7.52309018 Number of obs F( 1, 498) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 68.44 0.0000 0.1208 0.1191 2.5744 Finally, we will investigate how R2 behaves when a variable is omitted. In the simple regression of S on ASVABC, R2 is 0.27, and in the simple regression of S on SM it is 0.12. Does this imply that ASVABC explains 27% of the variance in S and SM 12%? No, because the multiple regression reveals that their joint explanatory power is 0.30, not 0.39 In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its apparent explanatory power. Similarly, in the third regression, SM is partly acting as a proxy for ASVABC, again inflating its apparent explanatory power. Në regresionin e dytë, ASVABC po vepron pjesërisht si një përfaqësues për SM, dhe kjo rrit fuqinë e tij të dukshme shpjeguese. Në mënyrë të ngjashme, në regresionin e tretë, SM po vepron pjesërisht si një përfaqësues për ASVABC, duke fryrë përsëri fuqinë e tij të dukshme shpjeguese. EXAMPLE II . gen log_EARNINGS =log(EARNINGS ) –STATA (Fillimisht gjenerojme log_EARNINGS) However, it is also possible for omitted variable bias to lead to a reduction in the apparent explanatory power of a variable. This will be demonstrated using a simple earnings function model, supposing the logarithm of hourly earnings to depend on S and EXP. Megjithatë, është gjithashtu e mundur që paragjykimi i variablit i anashkaluar të çojë në një reduktim të fuqisë së dukshme shpjeguese të një ndryshoreje. Kjo do të demonstrohet duke përdorur një model të thjeshtë funksioni fitimi, duke supozuar se logaritmi i fitimeve për orë varet nga S dhe EXP. . regress log_EARNINGS S EXP - comand Source | SS df MS -------------+-----------------------------Model | 21.2104059 2 10.6052029 Residual | 131.388814 497 .264363811 -------------+-----------------------------Total | 152.59922 499 .30581006 Number of obs = F( 2, 497) Prob > F R-squared Adj R-squared Root MSE = = = = = 500 40.12 0.0000 0.1390 0.1355 .51416 -----------------------------------------------------------------------------log_EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | .0916942 .0103338 8.87 0.000 .0713908 .1119976 EXP | .0405521 .009692 4.18 0.000 .0215098 .0595944 _cons | 1.199799 .1980634 6.06 0.000 .8106537 1.588943 ------------------------------------------------------------------------------ Make Correlation test for independent variables . cor EXP S (obs=500) | EXP S -------------+-----------------EXP | 1.0000 S | -0.5836 1.0000 regress log_EARNINGS S Source | SS df MS -------------+-----------------------------Model | 16.5822819 1 16.5822819 Residual | 136.016938 498 .273126381 -------------+-----------------------------Total | 152.59922 499 .30581006 Number of obs F( 1, 498) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 60.71 0.0000 0.1087 0.1069 .52261 -----------------------------------------------------------------------------log_EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .0664621 .0085297 7.79 0.000 .0497034 .0832207 _cons | 1.83624 .1289384 14.24 0.000 1.58291 2.089571 . regress log_EARNINGS EXP Source | SS df MS -------------+-----------------------------Model | .396095486 1 .396095486 Residual | 152.203124 498 .305628763 -------------+-----------------------------Total | 152.59922 499 .30581006 Number of obs F( 1, 498) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 1.30 0.2555 0.0026 0.0006 .55284 -----------------------------------------------------------------------------log_EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------EXP | -.0096339 .0084625 -1.14 0.255 -.0262605 .0069927 _cons | 2.886352 .0598796 48.20 0.000 2.768704 3.003999 ------------------------------------------------------------------------------ As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions. In the third regression, the negative bias is sufficient to wipe out the positive effect of EXP. (Siç mund të shihet, koeficientët e S dhe EXP janë me të vërtetë më të ulët në regresionet e thjeshta. Në regresionin e tretë, paragjykimi negativ është i mjaftueshëm për të fshirë efektin pozitiv të EXP.) A comparison of R2 for the three regressions shows that the sum of R2 in the simple regressions is actually less than R2 in the multiple regression VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE (Shtimi I nje variabli te parendesishem) Rewrite the true model adding X3 as an explanatory variable, with a coefficient of 0. Now the true model and the fitted model coincide. Hence b2 will be an unbiased estimator of b2 and b3 will be an unbiased estimator of 0. . regress log_EARNINGS S EXP MALE Source | SS df MS -------------+-----------------------------Model | 25.5575266 3 8.51917554 Residual | 127.041693 496 .256132446 -------------+-----------------------------Total | 152.59922 499 .30581006 Number of obs F( 3, 496) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 33.26 0.0000 0.1675 0.1624 .5061 -----------------------------------------------------------------------------log_EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .097249 .0102607 9.48 0.000 .0770893 .1174088 EXP | .0414485 .0095424 4.34 0.000 .0227001 .060197 MALE | .1885338 .0457636 4.12 0.000 .0986193 .2784483 _cons | 1.017176 .1999318 5.09 0.000 .6243587 1.409994 ------------------------------------------------------------------------------ The table shows the output from a logarithmic regression of hourly earnings on years of schooling, years of work experience, and a male dummy variable, using STATA. After this is added AGE as ne explonatary variable . . regress log_EARNINGS S EXP MALE AGE Source | SS df MS -------------+-----------------------------Model | 25.5961696 4 6.39904241 Residual | 127.00305 495 .256571818 -------------+-----------------------------Total | 152.59922 499 .30581006 Number of obs F( 4, 495) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 24.94 0.0000 0.1677 0.1610 .50653 -----------------------------------------------------------------------------log_EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .0985747 .0108227 9.11 0.000 .0773106 .1198389 EXP | .0437575 .0112521 3.89 0.000 .0216497 .0658653 MALE | .1895216 .0458735 4.13 0.000 .0993907 .2796525 AGE | -.0074013 .0190712 -0.39 0.698 -.0448718 .0300691 _cons | 1.196229 .5028946 2.38 0.018 .2081574 2.1843 Now age has been added as an explanatory variable. There is no particular reason to suppose that age is a relevant explanatory variable and its coefficient is small and insignificant. . cor S EXP MALE AGE (obs=500) | S EXP MALE AGE -------------+-----------------------------------S | 1.0000 EXP | -0.5836 1.0000 MALE | -0.1453 0.0664 1.0000 AGE | -0.0362 0.4492 0.0400 1.0000 Its correlations with S, EXP, and MALE are –0.04, 0.45, and 0.04, respectively. Its inclusion does not cause the coefficients of those variables to be biased and they are little changed. The effect on the standard errors of the coefficients of S and MALE are likewise negligible, as would be expected, given their very low correlations with AGE. However, the correlation of EXP with AGE is large enough to cause a substantial increase in its standard error, reflecting a loss of efficiency. Both point estimates of the coefficient of EXP will be unbiased, but that in the first regression will tend to be closer to the true value. VARIABLE MISSPECIFICATION III: CONSEQUENCES FOR DIAGNOSTICS . regress log_EARNINGS S EXP HEIGHT Source | SS df MS -------------+-----------------------------Model | 22.5581024 3 7.51936748 Residual | 130.041117 496 .262179672 -------------+-----------------------------Total | 152.59922 499 .30581006 Number of obs F( 3, 496) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 28.68 0.0000 0.1478 0.1427 .51203 -----------------------------------------------------------------------------log_EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------S | .0933581 .0103172 9.05 0.000 .0730873 .1136289 EXP | .0409265 .0096533 4.24 0.000 .0219602 .0598928 HEIGHT | .0128517 .0056685 2.27 0.024 .0017146 .0239889 _cons | .3008412 .4428508 0.68 0.497 -.5692536 1.170936 Here is a regression of the logarithm of hourly earnings on years of schooling and experience, and height in inches. The height coefficient implies than an extra inch leads to a 1.29% increase in earnings. Can you really believe this?