Uploaded by Jinling Yang

9 Hypothesis tests and uncertainty in regression | Introduction to Quantitative Methods

advertisement
9 Hypothesis tests and uncertainty in
regression
Download slides
Download seminar/homework data
(data/motherhood_revisi
(./lecture_slides/lecture8.pdf) (./data/motherhood_revisited.csv)
9.1 Overview
In the lecture this week we continued our discussion of statistical inference, and
particularly focussed on hypothesis tests and uncertainty in regression estimates.
We learned about the different steps of conducting a hypothesis test, and about
how to interpret both t-statistics and p-values. We saw the close connection
between hypothesis tests and confidence intervals, and drew attention to the fact
that observing a “statistically significant” result may not tell us anything about the
substantive significance of that result. We also discussed uncertainty in regression
models, and saw that our estimated regression coefficients are a quantity of
interest that will vary from sample to sample, just as with the difference in means.
Accordingly, we saw that we can also construct and interpret standard errors, tstatistics, p-values, and confidence intervals for our regression estimates.
In seminar this week, we will:
1. Practice conducting hypothesis tests for the difference in means.
2. Practice conducting hypothesis tests for regression coefficients.
3. Constructing confidence intervals for regression coefficients.
4. Revisit fixed-effect models for panel data.
Before coming to the seminar
1. Please read chapter 6, “Probability” and chapter 7, “Uncertainty” in
Quantitative Social Science: An Introduction
9.2 Seminar
In this seminar, we return to the example that we used in the midterm. In that
assignment, you used survey data to investigate the size of the wage penalty that
mothers face in the USA. Here, we will use an expanded version of that dataset,
which you can download from the link above.
The data file is motherhood_revisited.csv , which is a CSV file. Store this file in
your data folder as you have done in previous weeks. Then load the data into R:
motherhood <- read.csv("data/motherhood_revisited.csv")
The names and descriptions of variables are:
Name
Description
PUBID
ID of woman
year
Year of observation
wage
Hourly wage, in dollars
numChildren
Number of children that the woman has (in this wave)
age
Age in years
region
Name of region (North East = 1, North Central = 2,
South = 3, West = 4)
urban
Geographical classification (urban = 1, otherwise = 0)
marstat
Marital status
educ
Level of education
school
School enrollment (enrolled = TRUE , otherwise =
FALSE )
experience
Experience since 14 years old, in days
tenure
Current job tenure, in years
tenure2
Current job tenure in years, squared
fullTime
firmSize
multipleLocations
unionized
Employment status (employed full-time = TRUE ,
otherwise = FALSE )
Size of the firm
Multiple locations indicator (firm with multiple
locations = 1, otherwise = 0)
Job unionization status (job is unionized = 1,
otherwise = 1)
industry
Job’s industry type
hazardous
Hazard measure for the job (between 1 and 2)
regularity
Regularity measure for the job (between 1 and 5)
competitiveness
autonomy
teamwork
Competitiveness measure for the job (between 1 and
5)
Autonomy measure for the job (between 1 and 5)
Teamwork requirements measure for the job (between
1 and 5)
Question 1
What years are included in the data? How many women are included, and how
many person-years are included?
Reveal answer
# Number of years
length(unique(motherhood$year))
# Number of women
length(unique(motherhood$PUBID))
# Number of observations
nrow(motherhood)
## [1] 16
## [1] 1569
## [1] 18214
There are 16 unique years in this dataset. There are 1569 women in the
data and 18214 person-year observations.
Question 2
As in the midterm, create a new variable – isMother – that takes a value of 1 if
the woman has at least one child and a value of 0 otherwise.
motherhood$isMother <- ifelse(motherhood$numChildren > 0, 1, 0)
a. Calculate the difference in mean wages between women with children and
women without children.
Reveal answer
wage_mothers <- mean(motherhood$wage[motherhood$isMother == 1], na.rm = TRUE)
wage_not_mothers <- mean(motherhood$wage[motherhood$isMother == 0], na.rm = TRUE)
mother_not_mother_diff <- wage_mothers - wage_not_mothers
mother_not_mother_diff
## [1] 1.247316
In this sample, mothers earn on average 1.25 dollars more per hour
than non-mothers.
b. Calculate the standard error for the difference in means.
Reveal answer
The formula for the standard error of the difference in means is
SE(Y^X=1 − Y^X=0 ) = √
V ar(YX=1 )
nX=1
+
V ar(YX=0 )
nX=0
## Standard error
treat_var <- var(motherhood$wage[motherhood$isMother == 1], na.rm = TRUE)
control_var <- var(motherhood$wage[motherhood$isMother == 0], na.rm = TRUE)
treat_n <- sum(motherhood$isMother == 1, na.rm = TRUE)
control_n <- sum(motherhood$isMother == 0, na.rm = TRUE)
st_err <- sqrt(treat_var/treat_n + control_var/control_n)
st_err
## [1] 0.1007549
c. Calculate the t-statistic for the difference in means.
Reveal answer
# T-statistic
t_stat <- mother_not_mother_diff/st_err
t_stat
## [1] 12.37971
d. At the 95% confidence level, can we reject the null hypothesis that there is no
difference in the wage levels of mothers and not mothers in the population?
Reveal answer
Yes, the t-statistic is much greater than 1.96, implying that we can
reject the null hypothesis of no difference. The intuition here is that it is
extremely unlikely that we would observe a difference in means this
large in our sample if it were true that there were no difference between
mothers and non-mothers in the population.
e. Use the t.test() function to conduct the same hypothesis test that you just
conducted manually. What is the p-value? Does the 95% confidence interval
include the value of 0?
Reveal answer
t.test(x = motherhood$wage[motherhood$isMother==1],
y = motherhood$wage[motherhood$isMother==0],
conf.level = 0.95)
##
##
Welch Two Sample t-test
##
## data:
motherhood$wage[motherhood$isMother == 1] and motherhood$wage[motherhood$isMother == 0]
## t = 12.38, df = 13709, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##
1.049823 1.444810
## sample estimates:
## mean of x mean of y
##
11.45436
10.20704
The p-value here is very small (2.2e-16 = 0.00000000000000022),
which is consistent with the large t-statistic we calculated above. The
confidence interval also does not, of course, include zero. Confidence
intervals and hypothesis tests will always produce the same result for a
given confidence level.
Question 2
a. Run a regression with wage as the outcome variable and numChildren as the
explanatory variable. What is the estimated coefficient on the variable
numChildren ? Provide a brief substantive interpretation of the coefficient.
Reveal answer
simple_ols_model <- lm(wage ~ numChildren, data = motherhood)
The coefficient on the variable numChildren implies that each
additional child that a woman has is associated with an increase of 43
cents in a woman’s hourly wage.
b. What is the standard error of the coefficient for numChildren ?
Reveal answer
We can find the values of the standard error associated with each
regression coefficient by using the summary() function:
summary(simple_ols_model)
##
## Call:
## lm(formula = wage ~ numChildren, data = motherhood)
##
## Residuals:
##
Min
1Q
Median
3Q
Max
## -11.531
-4.138
-1.962
2.112
49.612
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.38796
0.05755 180.509
<2e-16 ***
## numChildren
0.05052
<2e-16 ***
0.43424
8.596
## --## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.583 on 18197 degrees of freedom
##
(15 observations deleted due to missingness)
## Multiple R-squared:
0.004044,
Adjusted R-squared:
## F-statistic: 73.89 on 1 and 18197 DF,
0.003989
p-value: < 2.2e-16
The standard error for the numChildren coefficient is 0.051.
c. Using the estimated coefficient and standard error for the numChildren
variable, conduct a hypothesis test where the null hypothesis is that this coefficient
is equal to zero in the population. Can you reject the null hypothesis at the 95%
confidence level? Can you reject the null hypothesis at the 99% confidence level?
Reveal answer
The formula for the test statistic for testing a null hypothesis that a
regression coefficient is equal to zero is:
t=
β^ − βH0
σ
^β^
=
β^
σ
^β^
So, to calculate t, we simply divide the estimated coefficient by the
standard error:
t_stat <- 0.43424/0.05052
The test statistic for the numChildren variable is 8.595, which is far
larger than the critical values for either the 95% (1.96) or 99% (2.58)
confidence levels. Accordingly, we can easily reject the null hypothesis
that the association between the number of children and a mother’s
wage in the population is equal to zero.
d. What is the meaning of rejecting the null hypothesis in this exercise? Does this
provide evidence of a causal relationship between the number of children and the
wage level of mothers?
Reveal answer
Whether or not we reject the null hypothesis of no effect is a different
question to whether the coefficient represents a causal effect. Here,
rejecting the null hypothesis means that we are confident that the
relationship between the number of children and the wage of the
mother that we observe in our sample of data is very unlikely to have
arisen by chance if the association between those two quantities is
zero in the population.
However, we should not forget that the association we observe in our
sample, however precisely estimated it may be, is still subject to
confounding by omitted variables. There are many ways in which
women who have more children differ from women with fewer children.
For instance, women with more children may also be older on average,
or they may have more experience, or different living situations. Each of
these characteristics may also be associated with higher wage levels,
and therefore even though we can reject the null hypothesis, we cannot
conclude that our regression estimate gives us an unbiased estimate of
the causal effect of children on their mothers’ wages.
Question 3
a. Create a box plot which depicts the distribution of wage for every year in the
data. What do you observe?
Reveal answer
boxplot(wage ~ year,
data = motherhood,
xlab = "Year",
ylab = "Wage")
There is a clear association between wage and year – women are on
average paid more in more recent years in the sample than in earlier
years.
b. Create a box plot which depicts the distribution of numChildren for every
year in the data. What do you observe?
Reveal answer
boxplot(numChildren ~ year,
data = motherhood,
xlab = "Year",
ylab = "Number of children")
There is a clear association between the number of children a woman
has and the sample year – women on average have more children
recent years in the sample than in earlier years.
Question 4
The analysis above reveals that there is significant over time variation in women’s
average wages in our sample, and that there is also a strong relationship between
time and the number of children a woman has. It is therefore probable that “time” is
an important omitted variable in this analysis, and something that we might want to
control for.
In addition, we saw last week that when we are working with panel data, a powerful
strategy for overcoming omitted variable bias is to use a fixed-effect model, where
we include a different intercept term for each of the units in our data. In this
example, we have a panel where each women represents a unit, and we have
repeated observations of the same women over time. There may be many factors
that vary across women, but that are stable within women over time, that are
related to both wage level and the number of children a woman has, and so a fixedeffect model may again be helpful for ruling out omitted variable bias here.
Given this discussion, it seems natural that we might want to include two sets of
fixed-effects here: one set for units (women), and the other for time (year). This
reflects a general form of model for working with panel data called the two-way
fixed effects model, in which there is a fixed effect for each unit and a fixed effect
for each time period.
Run a two-way fixed-effect regression where the outcome is the wage and the
predictor is the number of children that a woman has. Include fixed effects for each
woman and each year. To do this, include the relevant variables within the
factor() function as a part of the model formula, as below:
two_way_fe_model <- lm(wage ~ numChildren + factor(PUBID) + factor(year), data = motherhood)
Note that this regression may take a minute or two to run!
Why do we use factor() here? Because both PUBID and year are stored as
numeric variables in the motherhood data, R will treat these as regular
explanatory variables by default. However, we want R to estimate a separate
intercept term for each unique value of these variables, and that is what
factor() tells R to do.
Create a table of your fixed-effect model using screenreg() from the texreg
package. To avoid printing out teh coefficients for all of the fixed effects, set
omit.coef = "year|PUBID" Interpret the coefficient associated with
numChildren in both statistical and substantive terms.
Reveal answer
library(texreg)
screenreg(list(two_way_fe_model),
omit.coef = "year|PUBID")
##
## =========================
##
Model 1
## ------------------------## (Intercept)
-0.00
##
(1.68)
## numChildren
-1.04 ***
##
(0.06)
## ------------------------## R^2
## Adj. R^2
## Num. obs.
## RMSE
0.60
0.56
18199
4.38
## =========================
## *** p < 0.001, ** p < 0.01, * p < 0.05
The coefficient on the variable numChildren implies that each
additional child that a woman has is associated with a decrease in
wages of 1.041 dollars. The standard error for the numChildren
coefficient is 0.065, which implies a test-statistic value of -16.019, and
therefore that we can reject the null hypothesis of no effect at all
conventional confidence levels.
It is important to note that in this model, where we control for baseline
differences between women using the unit fixed-effects and differences
in wages over time using the time fixed-effects, the numChildren
coefficient is now negative. That is, once we account for the various
forms of omitted variable bias using the fixed-effect model, we find that
there is a negative and significant effect of children on women’s wages.
This is the opposite conclusion that we would have drawn from the
naive analysis in question 2.
Question 5
Estimate a new regression model, which still includes fixed effects for woman and
year, but which also includes the following variables:
Location ( region , urban )
Marital Status ( marstat )
Human Capital ( educ , school , experience , tenure , tenure2 )
Job Characteristics ( fullTime , firmSize , multipleLocations ,
unionized )
Report the coefficient and standard error associated with the numChildren
variable in this model. Is the coefficient still statistically significant? Provide a brief
substantive interpretation of this coefficient and the coefficients for any two other
variables.
Reveal answer
two_way_fe_model_2 <- lm(wage ~ numChildren + factor(region) + urban + marstat + educ + school
experience + tenure + tenure2 + fullTime + firmSize + multipleLocations +
unionized
+ factor(year) + factor(PUBID), data = motherhood)
library(texreg)
screenreg(list(two_way_fe_model, two_way_fe_model_2),
omit.coef = "year|PUBID")
##
## ====================================================
##
Model 1
Model 2
## ---------------------------------------------------## (Intercept)
-0.00
##
(1.68)
(2.05)
3.46
## numChildren
-1.04 ***
-0.30 **
##
(0.06)
(0.09)
## factor(region)2
-2.22 ***
##
(0.46)
## factor(region)3
-1.44 ***
##
(0.37)
## factor(region)4
-0.07
##
(0.44)
## urban
##
## marstatMarried
0.20
(0.15)
0.75 ***
##
(0.16)
## marstatNo romantic union
-0.26
##
(0.14)
## educ2.High school
-0.89 ***
##
(0.21)
## educ3.Some college
0.33
##
(0.35)
## educ4.College
3.24 ***
##
(0.31)
## schoolTRUE
-0.88 ***
##
(0.13)
## experience
0.33 ***
##
(0.04)
## tenure
0.31 ***
##
(0.06)
## tenure2
-0.02 ***
##
(0.01)
## fullTimeTRUE
1.00 ***
##
(0.11)
## firmSize2. 30-299
-0.06
##
(0.11)
## firmSize3. 300+
1.32 ***
##
(0.15)
## multipleLocations
0.37 ***
##
(0.11)
## unionized
1.24 ***
##
(0.18)
## ---------------------------------------------------## R^2
## Adj. R^2
## Num. obs.
## RMSE
0.60
0.56
18199
4.38
0.71
0.66
10688
3.97
## ====================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
The coefficient for ‘numChildren’ is -0.3 and the estimated standard
error is 0.09. We can tell that this is statistically significant at the 95%
confidence level by noting that the standard error is well less than half
the coefficient magnitude, that the t-stat is well above 1.96, or that the
p-value (0.001) is well below the standard 0.05 threshold (these three
things are equivalent). The coefficient suggests that each additional
child that a woman has (keeping constant all other characteristics
included in the model) is associated with a decrease of -30 cents in her
hourly wage.
This implies that even when accounting for these additional control
variables, in addition to the time and unit fixed-effects, the effect of
additional children on women’s wages appears to be negative.
The following is an example interpretation of marital status, a
categorical variable. The baseline category is “Cohabiting”. The
coefficient for “Married” is 0.75 and significant, meaning that we expect
married women to earn 75 cents more per hour than than otherwise
comparable cohabiting women. Women “Not in a romantic union”, by
contrast, on average earn 26 cents less per hour than comparable
cohabiting women in our sample. However, we can see from the small
t-statistic (-1.9) or relatively large p-value (0.057) that the coefficient is
not significantly different from zero at the 95% confidence level. That is,
the uncertainty around this estimate is too large for us to reject the null
hypothesis that the true difference between cohabiting and no-union
women is actually zero in the population.
9.3 Homework
There is no homework this week because of the final assessment.
(seminar8.html)
Download