Monday February 27, 2023 Quantitative Research Methods III Economics and Finance Case 1 Explaining wages Group 11 Babette Hensen Sebastiaan Jonker 2789960 2784678 2789960@student.vu.nl 2784678@student.vu.nl Part I: Explaining wages 1A. We created a table with summary statistics of the variables ID, wage, age, ttlexp, tenure, collgrad, married, smsa, south, union, industry and lnwage. VARIABLE S ID wage age ttlexp tenure collgrad married smsa south union industry lnwage (1) N Table 1 (2) (3) mean sd 621 621 621 621 616 621 621 621 621 522 618 621 311 7.627 39.11 12.12 5.828 0.256 0.638 0.670 0.425 0.238 8.149 1.842 179.4 5.673 3.108 4.613 5.475 0.437 0.481 0.471 0.495 0.426 3.017 0.594 (4) min (5) max 1 1.005 34 0.885 0 0 0 0 0 0 1 0.00494 621 40.75 46 28.88 24.75 1 1 1 1 1 12 3.707 There are missing values. - tenure 5 missing values, this is 0.8% of the total number of observations - industry 3 missing values, this is 0.48% of the total number of observations - union 99 missing values, this is 15.94% of the total number of observations 1B. We created a scatterplot of wage on the y-axis and ttlexp on the x-axis. Scatterplot - wage Then we estimated an univariate linear regression model (labeled as Model 1) with wage as dependent variable and ttlexp as independent variable. Table 1 VARIABLES ttlexp Constant Model 1 wage 0.283*** (0.0481) 4.194*** (0.624) Observations 621 R-squared 0.053 F-stat 34.67 Prob > F 6.41e-09 Df 619 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 1C. The interpretation of the estimated ttlexp coefficient is; if work experience goes up by 1 year, the hourly wage in dollars goes up by .2832524. 1D. The exact formula of the standard error of the ttlexp coefficient is; To fill in this formula we subtract the mean 12.1199 from all the variables and sum the values. We square this value to get 13190.71. Then we multiply this by 1/621 to get 21.24107. Then we divide the standard deviation 4.612519 by (621*21.24107) to get a standard error of 0.00035. This standard error differs from the standard error in the regression model, table 2, which is .0481066. Then we perform linear regression with wage as the dependent variable and ttlexp as the independent variable, to see if ttlexp statistically affects wages. Then we get a t-value of 5.89 and a p-value of 0.000. This means the p-value of below the required 5% significance mark and we do not reject the null hypothesis that H0 = β1 = 0. Therefore the total work experience someone has, does have a significant effect on the wage a person will earn. 1E. We estimated a linear regression model (Model 2) with lnwage as dependent variable and ttlexp as independent variable. VARIABLES ttlexp Constant Model 2 lnwage 0.0427*** (0.00488) 1.325*** (0.0633) Observations 621 R-squared 0.110 F-stat 76.36 Prob > F 0 Df 619 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 Then we created two histograms: one for the residuals of Model 1 and one for the residuals of Model 2. Histogram 1 - wage Histogram 2 - lnwage The black line is a normal density plot. Assumption 6 is “Ui is normally distributed.” So, in the end we prefer model 2. As shown in the histograms above, model 2 is normally distributed, unlike model 1, which is not normally distributed. The theoretical formula of the residuals for model 2 is: residual = ln (Yi) - ln (Ŷ) 1F. We estimated a linear regression model (Model 3) with lnwage as dependent variable and age, ttlexp, tenure, south, smsa, and union as independent variables. VARIABLES Model 2 lnwage age ttlexp 0.0427*** (0.00488) tenure south smsa union Constant 1.325*** (0.0633) Model 3 lnwage -0.0148** (0.00642) 0.0317*** (0.00559) 0.0160*** (0.00459) -0.149*** (0.0411) 0.273*** (0.0426) 0.147*** (0.0472) 1.802*** (0.256) Observations 621 519 R-squared 0.110 0.281 F-stat 76.36 33.32 Prob > F 0 0 Df 619 512 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 Then we investigated whether the model suffers from multicollinearity by using the Variance Inflation Factor. Variable VIF 1/VIF tenure ttlexp south union smsa age 1.67 1.62 1.06 1.04 1.02 1.01 0.598080 0.617499 0.941504 0.962828 0.976410 0.985875 Mean VIF 1.24 The VIF measures how much the independent variable is influenced by its correlation with the other independent variables. First of all, not ALL dummies AND the constant are included. So there is no question of the dummy trap in this situation. We can see that the VIF for all variables is between 1 and 5, following the rules of thumb for multicollinearity, this means there is a moderate correlation between the independent variable and the other independent variables in the model. But this moderate correlation is not severe enough to require attention. So that means, the model does not suffer from multicollinearity. 1G. We created industry dummies and estimated a new model (Model 4) VARIABLES Model 2 lnwage age ttlexp 0.0427*** (0.00488) tenure south smsa union Model 3 lnwage Model 4 lnwage -0.0148** (0.00642) 0.0317*** (0.00559) 0.0160*** (0.00459) -0.149*** (0.0411) 0.273*** (0.0426) 0.147*** (0.0472) -0.0149** (0.00643) 0.0317*** (0.00560) 0.0159*** (0.00460) -0.150*** (0.0411) 0.275*** (0.0427) 0.146*** (0.0473) -0.0230 (0.0469) 1.821*** (0.260) dummyindustry Constant Observations R-squared F-stat Prob > F Df 1.325*** (0.0633) 1.802*** (0.256) 621 519 0.110 0.281 76.36 33.32 0 0 619 512 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 519 0.281 28.56 0 511 The exact theoretical model for model IV is: lnwagei = β0 + β1agei + β2ttlexpi + β3tenurei + β4southi + β5smsai + β6unioni + β7dummyindustryi + εi The coefficient of dummyindustry is -0.0230. That means that when the industry dummy is 1, the relative increase of lnwage is -0.0230*100%. With respect to the reference category, corrected for all other variables. 1H. If we perform linear regression with lnwage as the dependent variable and industry as an independent variable, we get a t-value of 2.80 and a p-value of 0.005. This means that the p value is below the required 5% significance mark and we do not reject the null hypothesis that H0 = β1 = 0. Therefore the industry someone is employed in does have a significant effect on the wage a person will earn. 1I. VARIABLES age ttlexp tenure south smsa union tindustry1 tindustry3 tindustry4 tindustry5 tindustry6 tindustry7 tindustry8 tindustry9 tindustry10 tindustry11 tindustry12 Constant Model 5 lnwage -0.0111* (0.00631) 0.00862 (0.0201) 0.0101** (0.00464) -0.131*** (0.0402) 0.260*** (0.0422) 0.124** (0.0482) -0.0231 (0.0265) 0.0259 (0.0229) 0.0253 (0.0201) 0.0312 (0.0209) 0.0110 (0.0201) 0.0343* (0.0203) 0.0259 (0.0221) -0.00743 (0.0212) 0.0353 (0.0268) 0.0270 (0.0199) 0.0361* (0.0206) 1.682*** (0.252) Observations 519 R-squared 0.338 F-stat 15.07 Prob > F 0 Df 501 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 We estimated a new model (Model 5). To calculate the impact ttlexp has on ln(wage) per industry we first generate a dummy variable per industry using the “generate industryX = industry==X” for each industry. The dummy value is 1 when the variable is part of its respective industry and 0 when it isn’t. We then generate a slope dummy using the “generate tindustryX = ttlexp* industryX” command. As a reference category we take industry 2. We have no data of a person in industry 2. But by taking it as a reference category in this case. Industry 2 will still be represented. lnwagei = β0 + β1ttlexpi + β2industry1ittlexpi + β3industry3ittlexpi + β4industry4ittlexpi + β5industry5ittlexpi + β6industry6ittlexpi + β7industry7ittlexpi + β8industry8ittlexpi + β9industry9ittlexpi + β10industry10ittlexpi + β11industry11ittlexpi + β12industry12ittlexpi + εi The marginal effect of industry 11 can be mathematically expressed in isolation as : lnwagei = β0 + β1ttlexpi + β11industry11ittlexpi lnwagei = β0 + ttlexpi (β1 + β11industry11i) Meaning that β11, the marginal effect of 0.0270, is the additional effect ttlexp has on lnwage for industry 11 with respect to the reference category, corrected for all other variables. 1J. We estimated a new model (Model 6). VARIABLES age ttlexp tenure south smsa south_smsa union dummyindustry Constant Model 6 lnwage -0.0149** (0.00643) 0.0317*** (0.00560) 0.0159*** (0.00461) -0.144** (0.0705) 0.279*** (0.0597) -0.00899 (0.0862) 0.146*** (0.0475) -0.0224 (0.0473) 1.818*** (0.262) Observations 519 R-squared 0.281 F-stat 24.94 Prob > F 0 Df 510 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 The coefficient of south is -0.144 This means when the respondent lives in the south, the relative increase of lnwage is -0.144 x 100%. With respect to the reference category, corrected for all other variables. The coefficient of smsa is 0.279. This means when the respondent lives in Standard Metropolitan Statistical Area, the relative increase of lnwage is 0.279 x 100%. With respect to the reference category, corrected for all other variables. The coefficient of south_smsa is -0.00899. This means that when the respondent lives in the south AND lives in Standard Metropolitan Statistical Area, the effect on lnwage is the coefficient of south and the coefficient of smsa with an additional effect of -0.00899 x 100%. With respect to the reference category, corrected for all other variables. 1K. We perform linear regression with lnwage as the dependent variable and age, ttlexp, tenure, south, smsa, south_smsa union and dummyindustry as the independent variables. VARIABLES age ttlexp tenure south smsa south_smsa union dummyindustry Constant Model 6 lnwage -0.0149** (0.00643) 0.0317*** (0.00560) 0.0159*** (0.00461) -0.144** (0.0705) 0.279*** (0.0597) -0.00899 (0.0862) 0.146*** (0.0475) -0.0224 (0.0473) 1.818*** (0.262) Observations 519 R-squared 0.281 F-stat 24.94 Prob > F 0 Df 510 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 The null hypothesis is H0 = σ2i = σ2. The alternate hypothesis is H1 = σ2i ≠ σ2 We compute the White Test Statistic, which is nš 2 of the regression. nš 2 = 519 x 0.2812 = 145.9428 k-1=8 Critical value of š„ 2 (8) at the 5% level is 15.507 Since 145.9428 > 15.507, we reject heteroskedasticity at the 5% level. Part II: Explaining union workers 1L. We estimated a linear regression model with union as dependent variable and age, married, collgrad, ttlexp and south as independent variables. Table 4 VARIABLES age married collgrad ttlexp south Constant OLS union 0.00293 (0.00602) -0.0536 (0.0386) 0.134*** (0.0423) 0.00178 (0.00415) -0.100*** (0.0371) 0.143 (0.243) Observations 522 R-squared 0.041 F-stat 4.420 Prob > F 0.000597 Df 516 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 In this Linear probability model, the coefficient of age is 0.00293. This means when the age of the respondent goes up by 1, the probability of being an union worker goes up by 0.0029 % point. With respect to the reference category, corrected for all other variables. 1M. We estimated a logit model with union as the binary variable and age, married, collgrad, ttlexp and south as independent variables. The exact theoretical model equation of the logit model is; Zi = β0 + β1agei + β2marriedi + β3collgradi + β4ttlexpi + β5southi P(unioni = 1| Xi) = F(Zi) = exp(Zi) / (exp(Zi) +1) Due to the non-linear Z function, you can’t interpret the estimated south coefficient without deriving the marginal effect. 1N. Then we computed the average marginal effects implied by the estimated logit model of all independent variables. The marginal effect of the variable south for an average person means that for a one-unit increase (going from 0 to 1, so going from not living in the south to living in the south), the probability of being an union worker decreases by -0,102 % point. With respect to the reference category, keeping all other variables constant. Then we compare the estimated south coefficient of question 1l (-0.100), to the estimated marginal effect south at the mean (-0,103). The difference is small because the average marginal effect is the average effect of south on all data points while keeping all other variables constant. Since a logit function is a S shaped curve. The effect (slope) of south is less near 0 and 1, but more near the median of the curve. It is likely that the average marginal effect therefore is approximated by the coefficient of the regular OLS. Thus the average slope of the S shaped curve would be similar to the slope of a linear curve. Do file: clear cd "C:\Users\sfdjo\OneDrive\Documenten\QRM\Case I" import excel "C:\Users\sfdjo\OneDrive\Documenten\QRM\Case I\data_Case_I_group_11.xls", sheet("Data") firstrow **1a count if missing(wage) count if missing(age) count if missing(ttlexp) count if missing(tenure) count if missing(collgrad) count if missing(married) count if missing(smsa) count if missing(south) count if missing(union) count if missing(industry) ** 1b gen lnwage = ln(wage) sum ID wage age ttlexp tenure collgrad married smsa south union industry lnwage outreg2 using TableI, word title (Table 1) replace sum(log) twoway (scatter wage ttlexp), graphregion(color(white)) bgcolor(white) reg (wage ttlexp) outreg2 using Table2, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) title (Table 2) replace **1e reg (lnwage ttlexp) outreg2 using table3, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) replace reg (wage ttlexp) predict resid_wage, r histogram resid_wage, title("Residuals Model 1") graphregion(color(white)) bgcolor(white) normal normopts(lcolor(black) lwidth(thick)) kdensity kdenopts(lcolor(red) lwidth(thick)) graph export "C:\Users\sfdjo\OneDrive\Documenten\QRM\Case I\Hist1.png", as(png) name("Graph") replace reg (lnwage ttlexp) predict resid_lnwage, r histogram resid_lnwage, title("Residuals Model 2") graphregion(color(white)) bgcolor(white) normal normopts(lcolor(black) lwidth(thick)) kdensity kdenopts(lcolor(red) lwidth(thick)) graph export "C:\Users\sfdjo\OneDrive\Documenten\QRM\Case I\Hist2.png", as(png) name("Graph") replace **1f reg (lnwage age ttlexp tenure south smsa union) outreg2 using table3, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) append vif **1g gen dummyindustry = (industry>=6) reg( lnwage age ttlexp tenure south smsa union dummyindustry) outreg2 using table3, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) append **1i generate industry1 = industry==1 generate industry3 = industry==3 generate industry4 = industry==4 generate industry5 = industry==5 generate industry6 = industry==6 generate industry7 = industry==7 generate industry8 = industry==8 generate industry9 = industry==9 generate industry10 = industry==10 generate industry11 = industry==11 generate industry12 = industry==12 generate tindustry1 = ttlexp* industry1 generate tindustry3 = ttlexp* industry3 generate tindustry4 = ttlexp* industry4 generate tindustry5 = ttlexp* industry5 generate tindustry6 = ttlexp* industry6 generate tindustry7 = ttlexp* industry7 generate tindustry8 = ttlexp* industry8 generate tindustry9 = ttlexp* industry9 generate tindustry10 = ttlexp* industry10 generate tindustry11 = ttlexp* industry11 generate tindustry12 = ttlexp* industry12 reg lnwage age ttlexp tenure south smsa union tindustry1 tindustry3 tindustry4 tindustry5 tindustry6 tindustry7 tindustry8 tindustry9 tindustry10 tindustry11 tindustry12 outreg2 using Model5, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) replace **1j generate south_smsa = south*smsa reg( lnwage age ttlexp tenure south smsa south_smsa union dummyindustry) outreg2 using 1j, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) replace **1l reg union age married collgrad ttlexp south outreg2 using Table4, word addstat("F-stat", e(F), "Prob > F", e(p), "Df", e(df_r)) title (Table 4) replace **1m logit union age married collgrad ttlexp south margins, dydx(*) atmeans margins, dydx(*)