Uploaded by nilaselmanaj

Kapitulli I

advertisement
VARIABLE MISSPECIFICATION I- Omittion af a rrevelant variable (lenie jashte e nje variabli te
rendesishem)
Consequences of variable misspecification
In this sequence will investigate the consequences of misspecifying the regression model in terms of
explanatory variables.
To keep the analysis simple, we will assume that there are only two possibilities. Either Y depends only
on X2, or it depends on both X2 and X3.
Y  1   2 X 2  u
OR
Y  1   2 X 2   3 X 3  u
If Y depends only on X2, and we fit a simple regression model, we will not encounter any problems,
assuming of course that the regression model assumptions are valid.
As a consequence of the misspecification, the standard errors, t tests and F test are invalid
EXAMPLE I
1-First Model-Multiple Regresion with two independent Variables / ASVABC SM
regress S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model |
1119.3742
2 559.687101
Residual |
2634.6478
497 5.30110221
-------------+-----------------------------Total |
3754.022
499 7.52309018
Number of obs
F( 2,
497)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
105.58
0.0000
0.2982
0.2954
2.3024
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
1.377521
.1229142
11.21
0.000
1.136026
1.619017
SM |
.1919575
.0416913
4.60
0.000
.1100445
.2738705
_cons |
11.8925
.5629644
21.12
0.000
10.78642
12.99859
We will illustrate the bias using an educational attainment model. To keep the analysis simple, we
will assume that in the true model S depends only on ASVABC and SM. The output above shows the
corresponding regression using EAWE Data Set 21.
WE TEST CORRELATION OF INDIPENDENT VARIABLES
. cor SM ASVABC
(obs=500)
|
SM
ASVABC
-------------+-----------------SM |
1.0000
ASVABC |
0.3594
1.0000
Explanin correlation ??????
NOW We will run the regression a second time, OMITTING SM. Before we do this, we will
try to predict the direction of the bias in the coefficient of ASVABC.
Befor OMITTING SM -It is reasonable to suppose, as a matter of common sense, that B3 is positive.
This assumption is strongly supported by the fact that its estimate in the multiple regression is
positive and highly significant.
The correlation between ASVABC and SM is positive, so the numerator of the bias term must be
positive. The denominator is automatically positive since it is a sum of squares and there is some
variation in ASVABC. Hence the bias should be positive
(Korrelacioni midis ASVABC dhe SM është pozitiv, kështu që numëruesi i termit të paragjykimit
duhet të jetë pozitiv. Emëruesi është automatikisht pozitiv pasi është një shumë katrorësh dhe ka disa
ndryshime në ASVABC. Prandaj, paragjykimi duhet të jetë pozitiv)
2-Modified Model-Simple Regresion / Regresion omitting
SM
. regress S ASVABC
Source |
SS
df
MS
-------------+-----------------------------Model | 1006.99534
1 1006.99534
Residual | 2747.02666
498
5.5161178
-------------+-----------------------------Total |
3754.022
499 7.52309018
Number of obs
F( 1,
498)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
182.56
0.0000
0.2682
0.2668
2.3486
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
1.580901
.1170059
13.51
0.000
1.351015
1.810787
_cons |
14.43677
.1097335
131.56
0.000
14.22117
14.65237
------------------------------------------------------------------------------
As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the
difference may be due to pure chance, but part is attributable to the bias.
3-Modified Model-Simple Regresion / Regresion omitting
. regress
ASVABC
S SM
Source |
SS
df
MS
-------------+-----------------------------Model | 453.551645
1 453.551645
Residual | 3300.47036
498 6.62745051
-------------+-----------------------------Total |
3754.022
499 7.52309018
Number of obs
F( 1,
498)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
68.44
0.0000
0.1208
0.1191
2.5744
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------SM |
.3598719
.0435019
8.27
0.000
.2744021
.4453417
_cons |
9.992614
.6002469
16.65
0.000
8.813286
11.17194
-----------------------------------------------------------------------------Here is the regression omitting ASVABC instead of SM. We would expect
to be upwards
biased. We anticipate that b2 is positive and we know that both the numerator and the
denominator of the other factor in the bias expression are positive
(Këtu është regresioni që heq ASVABC në vend të SM. Ne do të presim të jemi të njëanshëm lart.
Ne parashikojmë që b2 është pozitiv dhe e dimë se si numëruesi ashtu edhe emëruesi i faktorit tjetër
në shprehjen e paragjykimit janë positive)
In this case the bias is quite dramatic. The coefficient of SM has nearly doubled. The reason
for the bigger effect is that the variation in SM is much smaller than that in ASVABC, while b2
and b3 are similar in size, judging by their estimates.
Finally, we will investigate how R2 behaves when a variable is omitted. In the simple
regression of S on ASVABC, R2 is 0.27, and in the simple regression of S on SM it is 0.12.
regress S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model |
1119.3742
2 559.687101
Residual |
2634.6478
497 5.30110221
-------------+-----------------------------Total |
3754.022
499 7.52309018
Number of obs
F( 2,
497)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
105.58
0.0000
0.2982
0.2954
2.3024
Number of obs
F( 1,
498)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
182.56
0.0000
0.2682
0.2668
2.3486
. regress S ASVABC
Source |
SS
df
MS
-------------+-----------------------------Model | 1006.99534
1 1006.99534
Residual | 2747.02666
498
5.5161178
-------------+-----------------------------Total |
3754.022
499 7.52309018
. regress
S SM
Source |
SS
df
MS
-------------+-----------------------------Model | 453.551645
1 453.551645
Residual | 3300.47036
498 6.62745051
-------------+-----------------------------Total |
3754.022
499 7.52309018
Number of obs
F( 1,
498)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
68.44
0.0000
0.1208
0.1191
2.5744
Finally, we will investigate how R2 behaves when a variable is omitted. In the simple regression of S
on ASVABC, R2 is 0.27, and in the simple regression of S on SM it is 0.12. Does this imply that
ASVABC explains 27% of the variance in S and SM 12%? No, because the multiple regression
reveals that their joint explanatory power is 0.30, not 0.39
In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its
apparent explanatory power. Similarly, in the third regression, SM is partly acting as a proxy
for ASVABC, again inflating its apparent explanatory power.
Në regresionin e dytë, ASVABC po vepron pjesërisht si një përfaqësues për SM, dhe kjo rrit fuqinë
e tij të dukshme shpjeguese. Në mënyrë të ngjashme, në regresionin e tretë, SM po vepron pjesërisht
si një përfaqësues për ASVABC, duke fryrë përsëri fuqinë e tij të dukshme shpjeguese.
EXAMPLE II
. gen log_EARNINGS =log(EARNINGS ) –STATA (Fillimisht gjenerojme log_EARNINGS)
However, it is also possible for omitted variable bias to lead to a reduction in the apparent
explanatory power of a variable. This will be demonstrated using a simple earnings function
model, supposing the logarithm of hourly earnings to depend on S and EXP.
Megjithatë, është gjithashtu e mundur që paragjykimi i variablit i anashkaluar të çojë në një reduktim
të fuqisë së dukshme shpjeguese të një ndryshoreje. Kjo do të demonstrohet duke përdorur një model
të thjeshtë funksioni fitimi, duke supozuar se logaritmi i fitimeve për orë varet nga S dhe EXP.
. regress log_EARNINGS S EXP - comand
Source |
SS
df
MS
-------------+-----------------------------Model | 21.2104059
2 10.6052029
Residual | 131.388814
497 .264363811
-------------+-----------------------------Total |
152.59922
499
.30581006
Number of obs =
F( 2,
497)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
500
40.12
0.0000
0.1390
0.1355
.51416
-----------------------------------------------------------------------------log_EARNINGS |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
S |
.0916942
.0103338
8.87
0.000
.0713908
.1119976
EXP |
.0405521
.009692
4.18
0.000
.0215098
.0595944
_cons |
1.199799
.1980634
6.06
0.000
.8106537
1.588943
------------------------------------------------------------------------------
Make Correlation test for independent variables
. cor EXP S
(obs=500)
|
EXP
S
-------------+-----------------EXP |
1.0000
S | -0.5836
1.0000
regress log_EARNINGS S
Source |
SS
df
MS
-------------+-----------------------------Model | 16.5822819
1 16.5822819
Residual | 136.016938
498 .273126381
-------------+-----------------------------Total |
152.59922
499
.30581006
Number of obs
F( 1,
498)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
60.71
0.0000
0.1087
0.1069
.52261
-----------------------------------------------------------------------------log_EARNINGS |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.0664621
.0085297
7.79
0.000
.0497034
.0832207
_cons |
1.83624
.1289384
14.24
0.000
1.58291
2.089571
. regress log_EARNINGS EXP
Source |
SS
df
MS
-------------+-----------------------------Model | .396095486
1 .396095486
Residual | 152.203124
498 .305628763
-------------+-----------------------------Total |
152.59922
499
.30581006
Number of obs
F( 1,
498)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
1.30
0.2555
0.0026
0.0006
.55284
-----------------------------------------------------------------------------log_EARNINGS |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------EXP | -.0096339
.0084625
-1.14
0.255
-.0262605
.0069927
_cons |
2.886352
.0598796
48.20
0.000
2.768704
3.003999
------------------------------------------------------------------------------
As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions. In the
third regression, the negative bias is sufficient to wipe out the positive effect of EXP.
(Siç mund të shihet, koeficientët e S dhe EXP janë me të vërtetë më të ulët në regresionet e thjeshta.
Në regresionin e tretë, paragjykimi negativ është i mjaftueshëm për të fshirë efektin pozitiv të EXP.)
A comparison of R2 for the three regressions shows that the sum of R2 in the simple regressions
is actually less than R2 in the multiple regression
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE (Shtimi I nje
variabli te parendesishem)
Rewrite the true model adding X3 as an explanatory variable, with a coefficient of 0. Now the true
model and the fitted model coincide. Hence b2 will be an unbiased estimator of b2 and b3 will be
an unbiased estimator of 0.
. regress log_EARNINGS S EXP MALE
Source |
SS
df
MS
-------------+-----------------------------Model | 25.5575266
3 8.51917554
Residual | 127.041693
496 .256132446
-------------+-----------------------------Total |
152.59922
499
.30581006
Number of obs
F( 3,
496)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
33.26
0.0000
0.1675
0.1624
.5061
-----------------------------------------------------------------------------log_EARNINGS |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.097249
.0102607
9.48
0.000
.0770893
.1174088
EXP |
.0414485
.0095424
4.34
0.000
.0227001
.060197
MALE |
.1885338
.0457636
4.12
0.000
.0986193
.2784483
_cons |
1.017176
.1999318
5.09
0.000
.6243587
1.409994
------------------------------------------------------------------------------
The table shows the output from a logarithmic regression of hourly earnings on years of schooling,
years of work experience, and a male dummy variable, using STATA.
After this is added AGE as ne explonatary variable
. . regress log_EARNINGS S EXP MALE AGE
Source |
SS
df
MS
-------------+-----------------------------Model | 25.5961696
4 6.39904241
Residual |
127.00305
495 .256571818
-------------+-----------------------------Total |
152.59922
499
.30581006
Number of obs
F( 4,
495)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
24.94
0.0000
0.1677
0.1610
.50653
-----------------------------------------------------------------------------log_EARNINGS |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.0985747
.0108227
9.11
0.000
.0773106
.1198389
EXP |
.0437575
.0112521
3.89
0.000
.0216497
.0658653
MALE |
.1895216
.0458735
4.13
0.000
.0993907
.2796525
AGE | -.0074013
.0190712
-0.39
0.698
-.0448718
.0300691
_cons |
1.196229
.5028946
2.38
0.018
.2081574
2.1843
Now age has been added as an explanatory variable. There is no particular reason to suppose that age
is a relevant explanatory variable and its coefficient is small and insignificant.
. cor S EXP MALE AGE
(obs=500)
|
S
EXP
MALE
AGE
-------------+-----------------------------------S |
1.0000
EXP | -0.5836
1.0000
MALE | -0.1453
0.0664
1.0000
AGE | -0.0362
0.4492
0.0400
1.0000
Its correlations with S, EXP, and MALE are –0.04, 0.45, and 0.04, respectively.
Its inclusion does not cause the coefficients of those variables to be biased and they are little
changed.
The effect on the standard errors of the coefficients of S and MALE are likewise negligible, as would
be expected, given their very low correlations with AGE.
However, the correlation of EXP with AGE is large enough to cause a substantial increase in its
standard error, reflecting a loss of efficiency. Both point estimates of the coefficient of EXP will be
unbiased, but that in the first regression will tend to be closer to the true value.
VARIABLE MISSPECIFICATION III: CONSEQUENCES FOR DIAGNOSTICS
. regress log_EARNINGS S EXP HEIGHT
Source |
SS
df
MS
-------------+-----------------------------Model | 22.5581024
3 7.51936748
Residual | 130.041117
496 .262179672
-------------+-----------------------------Total |
152.59922
499
.30581006
Number of obs
F( 3,
496)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
28.68
0.0000
0.1478
0.1427
.51203
-----------------------------------------------------------------------------log_EARNINGS |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.0933581
.0103172
9.05
0.000
.0730873
.1136289
EXP |
.0409265
.0096533
4.24
0.000
.0219602
.0598928
HEIGHT |
.0128517
.0056685
2.27
0.024
.0017146
.0239889
_cons |
.3008412
.4428508
0.68
0.497
-.5692536
1.170936
Here is a regression of the logarithm of hourly earnings on years of schooling and experience, and
height in inches. The height coefficient implies than an extra inch leads to a 1.29% increase in
earnings. Can you really believe this?
Download