Uploaded by Babette Hensen

2789960 2784678 case1 QRM3

advertisement
Monday February 27, 2023
Quantitative Research Methods III
Economics and Finance
Case 1
Explaining wages
Group 11
Babette Hensen
Sebastiaan Jonker
2789960
2784678
2789960@student.vu.nl
2784678@student.vu.nl
Part I: Explaining wages
1A.
We created a table with summary statistics of the variables ID, wage, age, ttlexp, tenure,
collgrad, married, smsa, south, union, industry and lnwage.
VARIABLE
S
ID
wage
age
ttlexp
tenure
collgrad
married
smsa
south
union
industry
lnwage
(1)
N
Table 1
(2)
(3)
mean
sd
621
621
621
621
616
621
621
621
621
522
618
621
311
7.627
39.11
12.12
5.828
0.256
0.638
0.670
0.425
0.238
8.149
1.842
179.4
5.673
3.108
4.613
5.475
0.437
0.481
0.471
0.495
0.426
3.017
0.594
(4)
min
(5)
max
1
1.005
34
0.885
0
0
0
0
0
0
1
0.00494
621
40.75
46
28.88
24.75
1
1
1
1
1
12
3.707
There are missing values.
- tenure 5 missing values, this is 0.8% of the total number of observations
- industry 3 missing values, this is 0.48% of the total number of observations
- union 99 missing values, this is 15.94% of the total number of observations
1B.
We created a scatterplot of wage on the y-axis and ttlexp on the x-axis.
Scatterplot - wage
Then we estimated an univariate linear regression model (labeled as Model 1) with wage as
dependent variable and ttlexp as independent variable.
Table 1
VARIABLES
ttlexp
Constant
Model 1
wage
0.283***
(0.0481)
4.194***
(0.624)
Observations
621
R-squared
0.053
F-stat
34.67
Prob > F
6.41e-09
Df
619
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
1C.
The interpretation of the estimated ttlexp coefficient is; if work experience goes up by 1 year,
the hourly wage in dollars goes up by .2832524.
1D.
The exact formula of the standard error of the ttlexp coefficient is;
To fill in this formula we subtract the mean 12.1199 from all the variables and sum the
values. We square this value to get 13190.71. Then we multiply this by 1/621 to get
21.24107. Then we divide the standard deviation 4.612519 by (621*21.24107) to get a
standard error of 0.00035. This standard error differs from the standard error in the regression
model, table 2, which is .0481066.
Then we perform linear regression with wage as the dependent variable and ttlexp as the
independent variable, to see if ttlexp statistically affects wages. Then we get a t-value of 5.89
and a p-value of 0.000. This means the p-value of below the required 5% significance mark
and we do not reject the null hypothesis that H0 = β1 = 0. Therefore the total work experience
someone has, does have a significant effect on the wage a person will earn.
1E.
We estimated a linear regression model (Model 2) with lnwage as dependent variable and
ttlexp as independent variable.
VARIABLES
ttlexp
Constant
Model 2
lnwage
0.0427***
(0.00488)
1.325***
(0.0633)
Observations
621
R-squared
0.110
F-stat
76.36
Prob > F
0
Df
619
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
Then we created two histograms: one for the residuals of Model 1 and one for the residuals of
Model 2.
Histogram 1 - wage
Histogram 2 - lnwage
The black line is a normal density plot.
Assumption 6 is “Ui is normally distributed.”
So, in the end we prefer model 2. As shown in the histograms above, model 2 is normally
distributed, unlike model 1, which is not normally distributed.
The theoretical formula of the residuals for model 2 is:
residual = ln (Yi) - ln (Ŷ)
1F.
We estimated a linear regression model (Model 3) with lnwage as dependent variable and
age, ttlexp, tenure, south, smsa, and union as independent variables.
VARIABLES
Model 2
lnwage
age
ttlexp
0.0427***
(0.00488)
tenure
south
smsa
union
Constant
1.325***
(0.0633)
Model 3
lnwage
-0.0148**
(0.00642)
0.0317***
(0.00559)
0.0160***
(0.00459)
-0.149***
(0.0411)
0.273***
(0.0426)
0.147***
(0.0472)
1.802***
(0.256)
Observations
621
519
R-squared
0.110
0.281
F-stat
76.36
33.32
Prob > F
0
0
Df
619
512
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
Then we investigated whether the model suffers from multicollinearity by using the
Variance Inflation Factor.
Variable
VIF
1/VIF
tenure
ttlexp
south
union
smsa
age
1.67
1.62
1.06
1.04
1.02
1.01
0.598080
0.617499
0.941504
0.962828
0.976410
0.985875
Mean VIF
1.24
The VIF measures how much the independent variable is influenced by its correlation with
the other independent variables.
First of all, not ALL dummies AND the constant are included. So there is no question of the
dummy trap in this situation.
We can see that the VIF for all variables is between 1 and 5, following the rules of thumb for
multicollinearity, this means there is a moderate correlation between the independent variable
and the other independent variables in the model. But this moderate correlation is not severe
enough to require attention. So that means, the model does not suffer from multicollinearity.
1G.
We created industry dummies and estimated a new model (Model 4)
VARIABLES
Model 2
lnwage
age
ttlexp
0.0427***
(0.00488)
tenure
south
smsa
union
Model 3
lnwage
Model 4
lnwage
-0.0148**
(0.00642)
0.0317***
(0.00559)
0.0160***
(0.00459)
-0.149***
(0.0411)
0.273***
(0.0426)
0.147***
(0.0472)
-0.0149**
(0.00643)
0.0317***
(0.00560)
0.0159***
(0.00460)
-0.150***
(0.0411)
0.275***
(0.0427)
0.146***
(0.0473)
-0.0230
(0.0469)
1.821***
(0.260)
dummyindustry
Constant
Observations
R-squared
F-stat
Prob > F
Df
1.325***
(0.0633)
1.802***
(0.256)
621
519
0.110
0.281
76.36
33.32
0
0
619
512
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
519
0.281
28.56
0
511
The exact theoretical model for model IV is:
lnwagei = β0 + β1agei + β2ttlexpi + β3tenurei + β4southi + β5smsai + β6unioni +
β7dummyindustryi + εi
The coefficient of dummyindustry is -0.0230. That means that when the industry dummy is 1,
the relative increase of lnwage is -0.0230*100%. With respect to the reference category,
corrected for all other variables.
1H.
If we perform linear regression with lnwage as the dependent variable and industry as an
independent variable, we get a t-value of 2.80 and a p-value of 0.005. This means that the p
value is below the required 5% significance mark and we do not reject the null hypothesis
that H0 = β1 = 0. Therefore the industry someone is employed in does have a significant
effect on the wage a person will earn.
1I.
VARIABLES
age
ttlexp
tenure
south
smsa
union
tindustry1
tindustry3
tindustry4
tindustry5
tindustry6
tindustry7
tindustry8
tindustry9
tindustry10
tindustry11
tindustry12
Constant
Model 5
lnwage
-0.0111*
(0.00631)
0.00862
(0.0201)
0.0101**
(0.00464)
-0.131***
(0.0402)
0.260***
(0.0422)
0.124**
(0.0482)
-0.0231
(0.0265)
0.0259
(0.0229)
0.0253
(0.0201)
0.0312
(0.0209)
0.0110
(0.0201)
0.0343*
(0.0203)
0.0259
(0.0221)
-0.00743
(0.0212)
0.0353
(0.0268)
0.0270
(0.0199)
0.0361*
(0.0206)
1.682***
(0.252)
Observations
519
R-squared
0.338
F-stat
15.07
Prob > F
0
Df
501
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
We estimated a new model (Model 5).
To calculate the impact ttlexp has on ln(wage) per industry we first generate a dummy
variable per industry using the “generate industryX = industry==X” for each industry. The
dummy value is 1 when the variable is part of its respective industry and 0 when it isn’t. We
then generate a slope dummy using the “generate tindustryX = ttlexp* industryX” command.
As a reference category we take industry 2. We have no data of a person in industry 2. But by
taking it as a reference category in this case. Industry 2 will still be represented.
lnwagei = β0 + β1ttlexpi + β2industry1ittlexpi + β3industry3ittlexpi + β4industry4ittlexpi +
β5industry5ittlexpi + β6industry6ittlexpi + β7industry7ittlexpi + β8industry8ittlexpi +
β9industry9ittlexpi + β10industry10ittlexpi + β11industry11ittlexpi + β12industry12ittlexpi + εi
The marginal effect of industry 11 can be mathematically expressed in isolation as :
lnwagei = β0 + β1ttlexpi + β11industry11ittlexpi
lnwagei = β0 + ttlexpi (β1 + β11industry11i)
Meaning that β11, the marginal effect of 0.0270, is the additional effect ttlexp has on lnwage
for industry 11 with respect to the reference category, corrected for all other variables.
1J.
We estimated a new model (Model 6).
VARIABLES
age
ttlexp
tenure
south
smsa
south_smsa
union
dummyindustry
Constant
Model 6
lnwage
-0.0149**
(0.00643)
0.0317***
(0.00560)
0.0159***
(0.00461)
-0.144**
(0.0705)
0.279***
(0.0597)
-0.00899
(0.0862)
0.146***
(0.0475)
-0.0224
(0.0473)
1.818***
(0.262)
Observations
519
R-squared
0.281
F-stat
24.94
Prob > F
0
Df
510
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
The coefficient of south is -0.144 This means when the respondent lives in the south, the
relative increase of lnwage is -0.144 x 100%. With respect to the reference category,
corrected for all other variables.
The coefficient of smsa is 0.279. This means when the respondent lives in Standard
Metropolitan Statistical Area, the relative increase of lnwage is 0.279 x 100%. With respect
to the reference category, corrected for all other variables.
The coefficient of south_smsa is -0.00899. This means that when the respondent lives in the
south AND lives in Standard Metropolitan Statistical Area, the effect on lnwage is the
coefficient of south and the coefficient of smsa with an additional effect of -0.00899 x 100%.
With respect to the reference category, corrected for all other variables.
1K.
We perform linear regression with lnwage as the dependent variable and age, ttlexp, tenure,
south, smsa, south_smsa union and dummyindustry as the independent variables.
VARIABLES
age
ttlexp
tenure
south
smsa
south_smsa
union
dummyindustry
Constant
Model 6
lnwage
-0.0149**
(0.00643)
0.0317***
(0.00560)
0.0159***
(0.00461)
-0.144**
(0.0705)
0.279***
(0.0597)
-0.00899
(0.0862)
0.146***
(0.0475)
-0.0224
(0.0473)
1.818***
(0.262)
Observations
519
R-squared
0.281
F-stat
24.94
Prob > F
0
Df
510
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
The null hypothesis is H0 = σ2i = σ2.
The alternate hypothesis is H1 = σ2i ≠ σ2
We compute the White Test Statistic, which is nš‘… 2 of the regression.
nš‘… 2 = 519 x 0.2812 = 145.9428
k-1=8
Critical value of š‘„ 2 (8) at the 5% level is 15.507
Since 145.9428 > 15.507, we reject heteroskedasticity at the 5% level.
Part II: Explaining union workers
1L.
We estimated a linear regression model with union as dependent variable and age, married,
collgrad, ttlexp and south as independent variables.
Table 4
VARIABLES
age
married
collgrad
ttlexp
south
Constant
OLS
union
0.00293
(0.00602)
-0.0536
(0.0386)
0.134***
(0.0423)
0.00178
(0.00415)
-0.100***
(0.0371)
0.143
(0.243)
Observations
522
R-squared
0.041
F-stat
4.420
Prob > F
0.000597
Df
516
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
In this Linear probability model, the coefficient of age is 0.00293. This means when the age
of the respondent goes up by 1, the probability of being an union worker goes up by 0.0029
% point. With respect to the reference category, corrected for all other variables.
1M.
We estimated a logit model with union as the binary variable and age, married, collgrad,
ttlexp and south as independent variables.
The exact theoretical model equation of the logit model is;
Zi = β0 + β1agei + β2marriedi + β3collgradi + β4ttlexpi + β5southi
P(unioni = 1| Xi) = F(Zi) = exp(Zi) / (exp(Zi) +1)
Due to the non-linear Z function, you can’t interpret the estimated south coefficient without
deriving the marginal effect.
1N.
Then we computed the average marginal effects implied by the estimated logit model of all
independent variables.
The marginal effect of the variable south for an average person means that for a one-unit
increase (going from 0 to 1, so going from not living in the south to living in the south), the
probability of being an union worker decreases by -0,102 % point. With respect to the
reference category, keeping all other variables constant.
Then we compare the estimated south coefficient of question 1l (-0.100), to the estimated
marginal effect south at the mean (-0,103). The difference is small because the average
marginal effect is the average effect of south on all data points while keeping all other
variables constant. Since a logit function is a S shaped curve. The effect (slope) of south is
less near 0 and 1, but more near the median of the curve. It is likely that the average marginal
effect therefore is approximated by the coefficient of the regular OLS. Thus the average slope
of the S shaped curve would be similar to the slope of a linear curve.
Do file:
clear
cd "C:\Users\sfdjo\OneDrive\Documenten\QRM\Case I"
import excel "C:\Users\sfdjo\OneDrive\Documenten\QRM\Case
I\data_Case_I_group_11.xls", sheet("Data") firstrow
**1a
count if missing(wage)
count if missing(age)
count if missing(ttlexp)
count if missing(tenure)
count if missing(collgrad)
count if missing(married)
count if missing(smsa)
count if missing(south)
count if missing(union)
count if missing(industry)
** 1b
gen lnwage = ln(wage)
sum ID wage age ttlexp tenure collgrad married smsa south union industry lnwage
outreg2 using TableI, word title (Table 1) replace sum(log)
twoway (scatter wage ttlexp), graphregion(color(white)) bgcolor(white)
reg (wage ttlexp)
outreg2 using Table2, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) title (Table
2) replace
**1e
reg (lnwage ttlexp)
outreg2 using table3, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) replace
reg (wage ttlexp)
predict resid_wage, r
histogram resid_wage, title("Residuals Model 1") graphregion(color(white)) bgcolor(white)
normal normopts(lcolor(black) lwidth(thick)) kdensity kdenopts(lcolor(red) lwidth(thick))
graph export "C:\Users\sfdjo\OneDrive\Documenten\QRM\Case I\Hist1.png", as(png)
name("Graph") replace
reg (lnwage ttlexp)
predict resid_lnwage, r
histogram resid_lnwage, title("Residuals Model 2") graphregion(color(white)) bgcolor(white)
normal normopts(lcolor(black) lwidth(thick)) kdensity kdenopts(lcolor(red) lwidth(thick))
graph export "C:\Users\sfdjo\OneDrive\Documenten\QRM\Case I\Hist2.png", as(png)
name("Graph") replace
**1f
reg (lnwage age ttlexp tenure south smsa union)
outreg2 using table3, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) append
vif
**1g
gen dummyindustry = (industry>=6)
reg( lnwage age ttlexp tenure south smsa union dummyindustry)
outreg2 using table3, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) append
**1i
generate industry1 = industry==1
generate industry3 = industry==3
generate industry4 = industry==4
generate industry5 = industry==5
generate industry6 = industry==6
generate industry7 = industry==7
generate industry8 = industry==8
generate industry9 = industry==9
generate industry10 = industry==10
generate industry11 = industry==11
generate industry12 = industry==12
generate tindustry1 = ttlexp* industry1
generate tindustry3 = ttlexp* industry3
generate tindustry4 = ttlexp* industry4
generate tindustry5 = ttlexp* industry5
generate tindustry6 = ttlexp* industry6
generate tindustry7 = ttlexp* industry7
generate tindustry8 = ttlexp* industry8
generate tindustry9 = ttlexp* industry9
generate tindustry10 = ttlexp* industry10
generate tindustry11 = ttlexp* industry11
generate tindustry12 = ttlexp* industry12
reg lnwage age ttlexp tenure south smsa union tindustry1 tindustry3 tindustry4 tindustry5
tindustry6 tindustry7 tindustry8 tindustry9 tindustry10 tindustry11 tindustry12
outreg2 using Model5, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) replace
**1j
generate south_smsa = south*smsa
reg( lnwage age ttlexp tenure south smsa south_smsa union dummyindustry)
outreg2 using 1j, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) replace
**1l
reg union age married collgrad ttlexp south
outreg2 using Table4, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) title (Table
4) replace
**1m
logit union age married collgrad ttlexp south
margins, dydx(*) atmeans
margins, dydx(*)
Download