Econ 3710 Assignment 2 Student Id number ends with two. library(tidyverse) > library(tidyverse) > library(haven) > library(stargazer) > dall <- read_dta("LFS-71M0001-E-2022-October_F1.dta") Question 1 1a. head(dall$AGE_12) dall <- dall %>% filter ((AGE_12 >=2 & AGE_12 <=11)) dall <- dall %>% filter(AGE_12!=11) There is total 74333 observations in my sample. 1b. There are no observations with zero hourly wages. The minimum wage is 32.52. But we do have 23672 missing. These are people who don’t earn wages. For example, retired people, students, housewives etc. These are unemployed people. summary(dall$HRLYEARN) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.33 21.00 28.85 32.52 40.87 115.38 NA's 23672 > dall <-dall %>% mutate(lwage=log(HRLYEARN)) > summary(dall$lwage) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.203 3.045 3.362 3.382 3.710 4.748 NA's 23672 The mean log wage is 3.382. The mean hourly wage is 32.52. The histogram (pictures below) of hourly wages looked skewed to the right. It doesn’t look like a normal distribution. On the other hand, the plot for log wages almost looks like a bell curve. It somewhat appears closer to a normal distribution. dall<-dall %>% filter(!is.na(HRLYEARN)) 1c. > head(dall$HRLYEARN) [1] 21.15 20.14 26.71 23.39 39.00 21.86 > head(dall$AGE_12) <labelled<double>[6]>: Five-year age group of respondent [1] 3 5 6 4 10 3 Labels: value label 1 15 to 19 years 2 20 to 24 years 3 25 to 29 years 4 30 to 34 years 5 35 to 39 years 6 40 to 44 years 7 45 to 49 years 8 50 to 54 years 9 55 to 59 years 10 60 to 64 years 11 65 to 69 years 12 70 and over > head(dall$EDUC) <labelled<double>[6]>: Highest educational attainment [1] 4 4 5 5 4 4 Labels: value label 0 0 to 8 years 1 Some high school 2 High school graduate 3 Some postsecondary 4 Postsecondary certificate or diploma 5 Bachelor's degree 6 Above bachelor's degree > head(dall$SEX) <labelled<double>[6]>: Sex of respondent [1] 2 2 2 2 1 2 Labels: value label 1 Male 2 Female > head(dall$COWMAIN) <labelled<double>[6]>: Class of worker, main job [1] 2 2 2 2 2 2 Labels: value label 1 Public sector employees 2 Private sector employees 3 Self-employed incorporated, with paid help 4 Self-employed incorporated, no paid help 5 Self-employed unincorporated, with paid help 6 Self-employed unincorporated, no paid help 7 Unpaid family worker > table(dall$COWMAIN) 1 2 15019 35642 > head(dall$UNION) <labelled<double>[6]>: Union status, employees only [1] 3 1 3 3 3 3 Labels: value label 1 Union member 2 Not a member but covered by a union contract or collective a 3 Non-unionized > glimpse(dall$HRLYEARN) num [1:50661] 21.1 20.1 26.7 23.4 39 ... - attr(*, "label")= chr "Usual hourly wages, employees only" - attr(*, "format.stata")= chr "%6.2f" EDUC, AGE_12, COWMAIN and SEX are factor variables. Only hourly wages are continuous variables. 1d. head(dall$EDUC) table(dall$EDUC) prop.table(table(dall$EDUC)) CATEGORIES Label 0 to 8 years Some high school High school graduate Some post secondary Post secondary certificate or diploma Bachelor’s degree Above bachelor’s degree Value 0.010 0.041 0.174 0.046 0.385 0.234 0.109 The largest education category is Post secondary certificate or diploma. 1e. > head(dall$SEX) <labelled<double>[6]>: Sex of respondent [1] 2 2 2 2 1 2 Labels: value label 1 Male 2 Female > dall <-dall %>% mutate (sex=SEX-1) > table(dall$EDUC) 0 1 2 3 4 5 6 522 2093 8817 2347 19484 11869 5529 > table(dall$AGE_12) 2 3 4 5 6 7 8 9 10 4179 5374 6117 6272 6388 6056 6171 5736 4368 > table(dall$COWMAIN) 1 2 15019 35642 > model_1e <- lm(lwage ~ factor(EDUC)+factor(AGE_12)+sex+factor(COWMAIN), dall) The base category for factor variables: for EDUC it starts at 0, for AGE_12 it is at 2, and for C OWMAIN it is 1, sex(2). 1e. (ii) Reporting the model in the table format using the stargazer library. > summary(model_1e) Call: lm(formula = lwage ~ factor(EDUC) + factor(AGE_12) + sex + factor(COWMAIN), data = dall) Residuals: Min 1Q Median 3Q Max -2.39832 -0.25370 -0.01157 0.25060 1.54464 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.000637 0.018396 163.114 < 2e-16 *** Some high school 0.070508 0.018759 3.759 0.000171 *** High school graduate 0.138425 0.017297 8.003 1.24e-15 *** Some postsecondary 0.174339 0.018667 9.340 < 2e-16 *** Post secondary certificate or diploma 0.268499 0.017052 15.746 < 2e-16 *** Bachelor’s Degree 0.436309 0.017252 25.290 < 2e-16 *** Above Bachelor’s degree 0.541368 0.017675 30.630 < 2e-16 *** 25 to 29 years 0.192787 0.008026 24.019 < 2e-16 *** 30 to 34 years 0.293741 0.007862 37.363 < 2e-16 *** 35 to 39 years 0.340646 0.007835 43.476 < 2e-16 *** 40 to 44 years 0.368100 0.007802 47.180 < 2e-16 *** 45 to 49 years 0.371832 0.007872 47.233 < 2e-16 *** 50 to 54 years 0.375515 0.007832 47.944 < 2e-16 *** 55 to 59 years 0.336037 0.007935 42.350 < 2e-16 *** 60 to 64 years 0.295222 0.008410 35.104 < 2e-16 *** Female -0.183452 0.003481 -52.694 < 2e-16 *** Private sector employees -0.177315 0.003916 -45.279 < 2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3833 on 50644 degrees of freedom Multiple R-squared: 0.2553, Adjusted R-squared: 0.2551 F-statistic: 1085 on 16 and 50644 DF, p-value: < 2.2e-16 Since the R2 IS 0.2553. So, this model is not a good fit because it has low correlation. 1f. Interpreting the coefficients from the wage regression. We know that base category is an independent variable. Variables such as EDUC and AGE_12 have positive coefficients which means that when these variables increase, lwage also increases. Variables such as Sex and COWMAIN have negative coefficients, indicating lwage decreases as gender and cowmain increases. Question 2. # Unionization head(dall$UNION) dall<-mutate(dall, unionized=if_else(UNION==3,0,1)) #PUBLIC/PRIVATE dall<-dall%>% mutate(public=if_else(COWMAIN==1,1,0)) 2a. > table(dall$unionized, dall$public) 0 1 0 29626 3102 1 6016 11917 Row 1 indicate non-unionized and row 2 indicates unionized. Column 1 indicates private sector and column 2 indicates public sector. The unionized workforce is 6016+11917 = 17933. The total workforce is 29626+3102+6016+11917 = 50661. The unionization percent workforce = 17933 /50661 *100 = 35.40% > prop.table(table(dall$unionized, dall$public),2) 0 1 0 0.8312104 0.2065384 1 0.1687896 0.7934616 > prop.table(table(dall$unionized, dall$public),1) 0 1 0 0.90521877 0.09478123 1 0.33547092 0.66452908 > table(dall$public) 0 1 35642 15019 The total fraction of workforce in public sector is 35642+15019 = 50661. The percent of private sector which is unionized contributes to 33.54%. The percent of public sector which is unionized is 66.45%. 2b. > dallunion<- dall%>% filter(unionized==1) > dallnonun<- dall%>% filter(unionized==0) > summary(dallunion$lwage) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.270 3.219 3.483 3.481 3.766 4.726 > summary(dallnonun$lwage) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.203 2.975 3.258 3.327 3.650 4.748 The mean log wages for union sector are 3.481. The mean log wages for non union sector are 3.327. The union premium = 3.481-3.327 = 0.154 2c. > dallpublic<- dall%>% filter(public==1) > dallprivate<- dall%>% filter(public==0) > summary(dallpublic$lwage) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.270 3.258 3.555 3.550 3.836 4.726 > summary(dallprivate$lwage) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.203 2.983 3.258 3.311 3.618 4.748 The mean log wages for public sector are 3.550 and the mean log wages for private sector is 3.311. The public premium = 0.239 2d. > lm(log(HRLYEARN)~unionized+public, data=dall) Call: lm(formula = log(HRLYEARN) ~ unionized + public, data = dall) Coefficients: (Intercept) unionized 3.30600 0.02832 public 0.22104 The union premium is 3% high. The person who works in the public sector has 22.1% higher wages than person who works in private sector. The intercept (i.e., 3.31) is the mean of nonunionized private. Public sector is correlated with unionized. > lm(log(HRLYEARN)~unionized,, data=dall) Call: lm(formula = log(HRLYEARN) ~ unionized, data = dall) Coefficients: (Intercept) unionized 3.3269 0.1543 > lm(log(HRLYEARN)~public,, data=dall) Call: lm(formula = log(HRLYEARN) ~ public, data = dall) Coefficients: (Intercept) public 3.3108 0.2387 When union was the only regressor, the coefficient is 0.1543. and when public sector was the only regressor, the coefficient was only 0.2387. As unionization and public are correlated, when omitting one from the regression there is omitted variable bias. When both are included the decreased magnitude of coefficients reflects that. 2e. > model_2e <- lm(lwage ~ factor(EDUC)+sex+factor(AGE_12)+unionized +public, data=dall) > summary(model_2e) Call: lm(formula = lwage ~ factor(EDUC) + sex + factor(AGE_12) + unionized + public, data = dall) Residuals: Min 1Q Median 3Q Max -2.40386 -0.25419 -0.01312 0.24947 1.54970 Coefficients: Estimate Std. Error t value (Intercept) 2.819679 0.017911 157.430 factor(EDUC)1 0.070548 0.018752 3.762 factor(EDUC)2 0.138393 0.017291 8.004 factor(EDUC)3 0.174904 0.018660 9.373 factor(EDUC)4 0.267728 0.017046 15.707 factor(EDUC)5 0.437500 0.017247 25.367 factor(EDUC)6 0.543906 0.017673 30.777 sex -0.182305 0.003485 -52.311 factor(AGE_12)3 0.191197 0.008028 23.818 factor(AGE_12)4 0.291731 0.007865 37.091 factor(AGE_12)5 0.338941 0.007837 43.248 factor(AGE_12)6 0.366112 0.007806 46.904 factor(AGE_12)7 0.369607 0.007877 46.920 factor(AGE_12)8 0.373404 0.007837 47.648 factor(AGE_12)9 0.333938 0.007939 42.064 factor(AGE_12)10 0.293433 0.008412 34.884 unionized 0.027960 0.004471 6.253 public 0.159357 0.004855 32.823 Pr(>|t|) (Intercept) < 2e-16 *** factor(EDUC)1 0.000169 *** factor(EDUC)2 1.23e-15 *** factor(EDUC)3 < 2e-16 *** factor(EDUC)4 < 2e-16 *** factor(EDUC)5 < 2e-16 *** factor(EDUC)6 < 2e-16 *** sex < 2e-16 *** factor(AGE_12)3 < 2e-16 *** factor(AGE_12)4 < 2e-16 *** factor(AGE_12)5 < 2e-16 *** factor(AGE_12)6 < 2e-16 *** factor(AGE_12)7 < 2e-16 *** factor(AGE_12)8 < 2e-16 *** factor(AGE_12)9 < 2e-16 *** factor(AGE_12)10 < 2e-16 *** unionized 4.05e-10 *** public < 2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3832 on 50643 degrees of freedom Multiple R-squared: 0.2559, Adjusted R-squared: 0.2556 F-statistic: 1024 on 17 and 50643 DF, p-value: < 2.2e-16 The results suggest that both unionization and public sector employment have positive effect on wages, even after controlling for other factors such as education, age, gender and tenure. The coefficient of unionized is 0.03 and coefficient of public sector is 0.16. 2f. > model_2f <- lm(lwage ~ factor(EDUC)+sex+factor(AGE_12)+ TENURE+ unionized+public+ factor(MARSTAT)+factor(NOC_10), data=dall) > summary(model_2f) Call: lm(formula = lwage ~ factor(EDUC) + sex + factor(AGE_12) + TENURE + unionized + public + factor(MARSTAT) + factor(NOC_10), data = dall) Residuals: Min 1Q Median 3Q Max -2.5094 -0.2127 -0.0098 0.2108 1.4094 Coefficients: Estimate Std. Error t value (Intercept) 3.383e+00 1.727e-02 195.886 factor(EDUC)1 6.934e-02 1.639e-02 4.232 factor(EDUC)2 1.022e-01 1.513e-02 6.753 factor(EDUC)3 1.313e-01 1.635e-02 8.026 factor(EDUC)4 1.786e-01 1.495e-02 11.947 factor(EDUC)5 2.998e-01 1.521e-02 19.714 factor(EDUC)6 3.872e-01 1.561e-02 24.798 sex -1.268e-01 3.453e-03 -36.735 factor(AGE_12)3 1.088e-01 7.139e-03 15.237 factor(AGE_12)4 1.590e-01 7.277e-03 21.844 factor(AGE_12)5 1.706e-01 7.475e-03 22.819 factor(AGE_12)6 1.743e-01 7.585e-03 22.983 factor(AGE_12)7 1.637e-01 7.743e-03 21.140 factor(AGE_12)8 1.549e-01 7.813e-03 19.827 factor(AGE_12)9 1.142e-01 7.998e-03 14.281 factor(AGE_12)10 8.115e-02 8.429e-03 9.628 TENURE 9.390e-04 2.088e-05 44.981 unionized 5.527e-02 4.074e-03 13.565 public 7.872e-02 4.524e-03 17.400 factor(MARSTAT)2 3.471e-03 4.334e-03 0.801 factor(MARSTAT)3 -4.957e-02 1.663e-02 -2.980 factor(MARSTAT)4 -2.604e-02 9.525e-03 -2.734 factor(MARSTAT)5 -6.657e-03 7.567e-03 -0.880 factor(MARSTAT)6 -6.328e-02 4.277e-03 -14.795 factor(NOC_10)2 -4.021e-01 6.471e-03 -62.141 factor(NOC_10)3 -1.677e-01 7.204e-03 -23.273 factor(NOC_10)4 -3.502e-01 7.646e-03 -45.796 factor(NOC_10)5 -3.642e-01 7.029e-03 -51.817 factor(NOC_10)6 -4.308e-01 1.297e-02 -33.204 factor(NOC_10)7 -6.552e-01 6.546e-03 -100.092 factor(NOC_10)8 -3.830e-01 7.017e-03 -54.586 factor(NOC_10)9 -4.457e-01 1.158e-02 -38.503 factor(NOC_10)10 -5.173e-01 8.899e-03 -58.133 Pr(>|t|) (Intercept) < 2e-16 *** factor(EDUC)1 2.32e-05 *** factor(EDUC)2 1.46e-11 *** factor(EDUC)3 1.03e-15 *** factor(EDUC)4 < 2e-16 *** factor(EDUC)5 < 2e-16 *** factor(EDUC)6 < 2e-16 *** sex < 2e-16 *** factor(AGE_12)3 < 2e-16 *** factor(AGE_12)4 < 2e-16 *** factor(AGE_12)5 < 2e-16 *** factor(AGE_12)6 < 2e-16 *** factor(AGE_12)7 < 2e-16 *** factor(AGE_12)8 < 2e-16 *** factor(AGE_12)9 < 2e-16 *** factor(AGE_12)10 < 2e-16 *** TENURE < 2e-16 *** unionized < 2e-16 *** public < 2e-16 *** factor(MARSTAT)2 0.42317 factor(MARSTAT)3 0.00288 ** factor(MARSTAT)4 0.00626 ** factor(MARSTAT)5 0.37898 factor(MARSTAT)6 < 2e-16 *** factor(NOC_10)2 < 2e-16 *** factor(NOC_10)3 < 2e-16 *** factor(NOC_10)4 < 2e-16 *** factor(NOC_10)5 < 2e-16 *** factor(NOC_10)6 < 2e-16 *** factor(NOC_10)7 < 2e-16 *** factor(NOC_10)8 < 2e-16 *** factor(NOC_10)9 < 2e-16 *** factor(NOC_10)10 < 2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3346 on 50628 degrees of freedom Multiple R-squared: 0.4326, Adjusted R-squared: 0.4323 F-statistic: 1206 on 32 and 50628 DF, p-value: < 2.2e-16 Education is relevant to base 0. Females have a penalty of 1.27 percent. My base for age is two. stargazer(model_1e, model_2e, model_2f, type="text") In this model as well, unionization and public sector have positive effect on log wages. However , this model is a little bit better because it has more relevant factors. The coefficients have incre ased to 0.55 for unionization and 0.78 for public sector. Their also has been an increase in R^2 from 25% in model 2_e to 43% in model 2_f. 2g. > step1 <-lm(public ~ factor(EDUC)+sex+factor(AGE_12) + +TENURE+ + unionized+ + factor(MARSTAT)+factor(NOC_10), data=dall) > dall <-dall %>% mutate(rhat =resid(step1)) > #step2 > lm(lwage~rhat, dall) Call: lm(formula = lwage ~ rhat, data = dall) Coefficients: (Intercept) rhat 3.38155 0.07872 I chose public as my coefficient. The intuition behind this two-step procedure is to confirm that coefficient of this rhat is going to be exactly the same as coefficient of public from the original model when we were regressing log wage. It is equal to 7.872.