9 Hypothesis tests and uncertainty in regression Download slides Download seminar/homework data (data/motherhood_revisi (./lecture_slides/lecture8.pdf) (./data/motherhood_revisited.csv) 9.1 Overview In the lecture this week we continued our discussion of statistical inference, and particularly focussed on hypothesis tests and uncertainty in regression estimates. We learned about the different steps of conducting a hypothesis test, and about how to interpret both t-statistics and p-values. We saw the close connection between hypothesis tests and confidence intervals, and drew attention to the fact that observing a “statistically significant” result may not tell us anything about the substantive significance of that result. We also discussed uncertainty in regression models, and saw that our estimated regression coefficients are a quantity of interest that will vary from sample to sample, just as with the difference in means. Accordingly, we saw that we can also construct and interpret standard errors, tstatistics, p-values, and confidence intervals for our regression estimates. In seminar this week, we will: 1. Practice conducting hypothesis tests for the difference in means. 2. Practice conducting hypothesis tests for regression coefficients. 3. Constructing confidence intervals for regression coefficients. 4. Revisit fixed-effect models for panel data. Before coming to the seminar 1. Please read chapter 6, “Probability” and chapter 7, “Uncertainty” in Quantitative Social Science: An Introduction 9.2 Seminar In this seminar, we return to the example that we used in the midterm. In that assignment, you used survey data to investigate the size of the wage penalty that mothers face in the USA. Here, we will use an expanded version of that dataset, which you can download from the link above. The data file is motherhood_revisited.csv , which is a CSV file. Store this file in your data folder as you have done in previous weeks. Then load the data into R: motherhood <- read.csv("data/motherhood_revisited.csv") The names and descriptions of variables are: Name Description PUBID ID of woman year Year of observation wage Hourly wage, in dollars numChildren Number of children that the woman has (in this wave) age Age in years region Name of region (North East = 1, North Central = 2, South = 3, West = 4) urban Geographical classification (urban = 1, otherwise = 0) marstat Marital status educ Level of education school School enrollment (enrolled = TRUE , otherwise = FALSE ) experience Experience since 14 years old, in days tenure Current job tenure, in years tenure2 Current job tenure in years, squared fullTime firmSize multipleLocations unionized Employment status (employed full-time = TRUE , otherwise = FALSE ) Size of the firm Multiple locations indicator (firm with multiple locations = 1, otherwise = 0) Job unionization status (job is unionized = 1, otherwise = 1) industry Job’s industry type hazardous Hazard measure for the job (between 1 and 2) regularity Regularity measure for the job (between 1 and 5) competitiveness autonomy teamwork Competitiveness measure for the job (between 1 and 5) Autonomy measure for the job (between 1 and 5) Teamwork requirements measure for the job (between 1 and 5) Question 1 What years are included in the data? How many women are included, and how many person-years are included? Reveal answer # Number of years length(unique(motherhood$year)) # Number of women length(unique(motherhood$PUBID)) # Number of observations nrow(motherhood) ## [1] 16 ## [1] 1569 ## [1] 18214 There are 16 unique years in this dataset. There are 1569 women in the data and 18214 person-year observations. Question 2 As in the midterm, create a new variable – isMother – that takes a value of 1 if the woman has at least one child and a value of 0 otherwise. motherhood$isMother <- ifelse(motherhood$numChildren > 0, 1, 0) a. Calculate the difference in mean wages between women with children and women without children. Reveal answer wage_mothers <- mean(motherhood$wage[motherhood$isMother == 1], na.rm = TRUE) wage_not_mothers <- mean(motherhood$wage[motherhood$isMother == 0], na.rm = TRUE) mother_not_mother_diff <- wage_mothers - wage_not_mothers mother_not_mother_diff ## [1] 1.247316 In this sample, mothers earn on average 1.25 dollars more per hour than non-mothers. b. Calculate the standard error for the difference in means. Reveal answer The formula for the standard error of the difference in means is SE(Y^X=1 − Y^X=0 ) = √ V ar(YX=1 ) nX=1 + V ar(YX=0 ) nX=0 ## Standard error treat_var <- var(motherhood$wage[motherhood$isMother == 1], na.rm = TRUE) control_var <- var(motherhood$wage[motherhood$isMother == 0], na.rm = TRUE) treat_n <- sum(motherhood$isMother == 1, na.rm = TRUE) control_n <- sum(motherhood$isMother == 0, na.rm = TRUE) st_err <- sqrt(treat_var/treat_n + control_var/control_n) st_err ## [1] 0.1007549 c. Calculate the t-statistic for the difference in means. Reveal answer # T-statistic t_stat <- mother_not_mother_diff/st_err t_stat ## [1] 12.37971 d. At the 95% confidence level, can we reject the null hypothesis that there is no difference in the wage levels of mothers and not mothers in the population? Reveal answer Yes, the t-statistic is much greater than 1.96, implying that we can reject the null hypothesis of no difference. The intuition here is that it is extremely unlikely that we would observe a difference in means this large in our sample if it were true that there were no difference between mothers and non-mothers in the population. e. Use the t.test() function to conduct the same hypothesis test that you just conducted manually. What is the p-value? Does the 95% confidence interval include the value of 0? Reveal answer t.test(x = motherhood$wage[motherhood$isMother==1], y = motherhood$wage[motherhood$isMother==0], conf.level = 0.95) ## ## Welch Two Sample t-test ## ## data: motherhood$wage[motherhood$isMother == 1] and motherhood$wage[motherhood$isMother == 0] ## t = 12.38, df = 13709, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 1.049823 1.444810 ## sample estimates: ## mean of x mean of y ## 11.45436 10.20704 The p-value here is very small (2.2e-16 = 0.00000000000000022), which is consistent with the large t-statistic we calculated above. The confidence interval also does not, of course, include zero. Confidence intervals and hypothesis tests will always produce the same result for a given confidence level. Question 2 a. Run a regression with wage as the outcome variable and numChildren as the explanatory variable. What is the estimated coefficient on the variable numChildren ? Provide a brief substantive interpretation of the coefficient. Reveal answer simple_ols_model <- lm(wage ~ numChildren, data = motherhood) The coefficient on the variable numChildren implies that each additional child that a woman has is associated with an increase of 43 cents in a woman’s hourly wage. b. What is the standard error of the coefficient for numChildren ? Reveal answer We can find the values of the standard error associated with each regression coefficient by using the summary() function: summary(simple_ols_model) ## ## Call: ## lm(formula = wage ~ numChildren, data = motherhood) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.531 -4.138 -1.962 2.112 49.612 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.38796 0.05755 180.509 <2e-16 *** ## numChildren 0.05052 <2e-16 *** 0.43424 8.596 ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.583 on 18197 degrees of freedom ## (15 observations deleted due to missingness) ## Multiple R-squared: 0.004044, Adjusted R-squared: ## F-statistic: 73.89 on 1 and 18197 DF, 0.003989 p-value: < 2.2e-16 The standard error for the numChildren coefficient is 0.051. c. Using the estimated coefficient and standard error for the numChildren variable, conduct a hypothesis test where the null hypothesis is that this coefficient is equal to zero in the population. Can you reject the null hypothesis at the 95% confidence level? Can you reject the null hypothesis at the 99% confidence level? Reveal answer The formula for the test statistic for testing a null hypothesis that a regression coefficient is equal to zero is: t= β^ − βH0 σ ^β^ = β^ σ ^β^ So, to calculate t, we simply divide the estimated coefficient by the standard error: t_stat <- 0.43424/0.05052 The test statistic for the numChildren variable is 8.595, which is far larger than the critical values for either the 95% (1.96) or 99% (2.58) confidence levels. Accordingly, we can easily reject the null hypothesis that the association between the number of children and a mother’s wage in the population is equal to zero. d. What is the meaning of rejecting the null hypothesis in this exercise? Does this provide evidence of a causal relationship between the number of children and the wage level of mothers? Reveal answer Whether or not we reject the null hypothesis of no effect is a different question to whether the coefficient represents a causal effect. Here, rejecting the null hypothesis means that we are confident that the relationship between the number of children and the wage of the mother that we observe in our sample of data is very unlikely to have arisen by chance if the association between those two quantities is zero in the population. However, we should not forget that the association we observe in our sample, however precisely estimated it may be, is still subject to confounding by omitted variables. There are many ways in which women who have more children differ from women with fewer children. For instance, women with more children may also be older on average, or they may have more experience, or different living situations. Each of these characteristics may also be associated with higher wage levels, and therefore even though we can reject the null hypothesis, we cannot conclude that our regression estimate gives us an unbiased estimate of the causal effect of children on their mothers’ wages. Question 3 a. Create a box plot which depicts the distribution of wage for every year in the data. What do you observe? Reveal answer boxplot(wage ~ year, data = motherhood, xlab = "Year", ylab = "Wage") There is a clear association between wage and year – women are on average paid more in more recent years in the sample than in earlier years. b. Create a box plot which depicts the distribution of numChildren for every year in the data. What do you observe? Reveal answer boxplot(numChildren ~ year, data = motherhood, xlab = "Year", ylab = "Number of children") There is a clear association between the number of children a woman has and the sample year – women on average have more children recent years in the sample than in earlier years. Question 4 The analysis above reveals that there is significant over time variation in women’s average wages in our sample, and that there is also a strong relationship between time and the number of children a woman has. It is therefore probable that “time” is an important omitted variable in this analysis, and something that we might want to control for. In addition, we saw last week that when we are working with panel data, a powerful strategy for overcoming omitted variable bias is to use a fixed-effect model, where we include a different intercept term for each of the units in our data. In this example, we have a panel where each women represents a unit, and we have repeated observations of the same women over time. There may be many factors that vary across women, but that are stable within women over time, that are related to both wage level and the number of children a woman has, and so a fixedeffect model may again be helpful for ruling out omitted variable bias here. Given this discussion, it seems natural that we might want to include two sets of fixed-effects here: one set for units (women), and the other for time (year). This reflects a general form of model for working with panel data called the two-way fixed effects model, in which there is a fixed effect for each unit and a fixed effect for each time period. Run a two-way fixed-effect regression where the outcome is the wage and the predictor is the number of children that a woman has. Include fixed effects for each woman and each year. To do this, include the relevant variables within the factor() function as a part of the model formula, as below: two_way_fe_model <- lm(wage ~ numChildren + factor(PUBID) + factor(year), data = motherhood) Note that this regression may take a minute or two to run! Why do we use factor() here? Because both PUBID and year are stored as numeric variables in the motherhood data, R will treat these as regular explanatory variables by default. However, we want R to estimate a separate intercept term for each unique value of these variables, and that is what factor() tells R to do. Create a table of your fixed-effect model using screenreg() from the texreg package. To avoid printing out teh coefficients for all of the fixed effects, set omit.coef = "year|PUBID" Interpret the coefficient associated with numChildren in both statistical and substantive terms. Reveal answer library(texreg) screenreg(list(two_way_fe_model), omit.coef = "year|PUBID") ## ## ========================= ## Model 1 ## ------------------------## (Intercept) -0.00 ## (1.68) ## numChildren -1.04 *** ## (0.06) ## ------------------------## R^2 ## Adj. R^2 ## Num. obs. ## RMSE 0.60 0.56 18199 4.38 ## ========================= ## *** p < 0.001, ** p < 0.01, * p < 0.05 The coefficient on the variable numChildren implies that each additional child that a woman has is associated with a decrease in wages of 1.041 dollars. The standard error for the numChildren coefficient is 0.065, which implies a test-statistic value of -16.019, and therefore that we can reject the null hypothesis of no effect at all conventional confidence levels. It is important to note that in this model, where we control for baseline differences between women using the unit fixed-effects and differences in wages over time using the time fixed-effects, the numChildren coefficient is now negative. That is, once we account for the various forms of omitted variable bias using the fixed-effect model, we find that there is a negative and significant effect of children on women’s wages. This is the opposite conclusion that we would have drawn from the naive analysis in question 2. Question 5 Estimate a new regression model, which still includes fixed effects for woman and year, but which also includes the following variables: Location ( region , urban ) Marital Status ( marstat ) Human Capital ( educ , school , experience , tenure , tenure2 ) Job Characteristics ( fullTime , firmSize , multipleLocations , unionized ) Report the coefficient and standard error associated with the numChildren variable in this model. Is the coefficient still statistically significant? Provide a brief substantive interpretation of this coefficient and the coefficients for any two other variables. Reveal answer two_way_fe_model_2 <- lm(wage ~ numChildren + factor(region) + urban + marstat + educ + school experience + tenure + tenure2 + fullTime + firmSize + multipleLocations + unionized + factor(year) + factor(PUBID), data = motherhood) library(texreg) screenreg(list(two_way_fe_model, two_way_fe_model_2), omit.coef = "year|PUBID") ## ## ==================================================== ## Model 1 Model 2 ## ---------------------------------------------------## (Intercept) -0.00 ## (1.68) (2.05) 3.46 ## numChildren -1.04 *** -0.30 ** ## (0.06) (0.09) ## factor(region)2 -2.22 *** ## (0.46) ## factor(region)3 -1.44 *** ## (0.37) ## factor(region)4 -0.07 ## (0.44) ## urban ## ## marstatMarried 0.20 (0.15) 0.75 *** ## (0.16) ## marstatNo romantic union -0.26 ## (0.14) ## educ2.High school -0.89 *** ## (0.21) ## educ3.Some college 0.33 ## (0.35) ## educ4.College 3.24 *** ## (0.31) ## schoolTRUE -0.88 *** ## (0.13) ## experience 0.33 *** ## (0.04) ## tenure 0.31 *** ## (0.06) ## tenure2 -0.02 *** ## (0.01) ## fullTimeTRUE 1.00 *** ## (0.11) ## firmSize2. 30-299 -0.06 ## (0.11) ## firmSize3. 300+ 1.32 *** ## (0.15) ## multipleLocations 0.37 *** ## (0.11) ## unionized 1.24 *** ## (0.18) ## ---------------------------------------------------## R^2 ## Adj. R^2 ## Num. obs. ## RMSE 0.60 0.56 18199 4.38 0.71 0.66 10688 3.97 ## ==================================================== ## *** p < 0.001, ** p < 0.01, * p < 0.05 The coefficient for ‘numChildren’ is -0.3 and the estimated standard error is 0.09. We can tell that this is statistically significant at the 95% confidence level by noting that the standard error is well less than half the coefficient magnitude, that the t-stat is well above 1.96, or that the p-value (0.001) is well below the standard 0.05 threshold (these three things are equivalent). The coefficient suggests that each additional child that a woman has (keeping constant all other characteristics included in the model) is associated with a decrease of -30 cents in her hourly wage. This implies that even when accounting for these additional control variables, in addition to the time and unit fixed-effects, the effect of additional children on women’s wages appears to be negative. The following is an example interpretation of marital status, a categorical variable. The baseline category is “Cohabiting”. The coefficient for “Married” is 0.75 and significant, meaning that we expect married women to earn 75 cents more per hour than than otherwise comparable cohabiting women. Women “Not in a romantic union”, by contrast, on average earn 26 cents less per hour than comparable cohabiting women in our sample. However, we can see from the small t-statistic (-1.9) or relatively large p-value (0.057) that the coefficient is not significantly different from zero at the 95% confidence level. That is, the uncertainty around this estimate is too large for us to reject the null hypothesis that the true difference between cohabiting and no-union women is actually zero in the population. 9.3 Homework There is no homework this week because of the final assessment. (seminar8.html)