Homework 1

advertisement
Eco420, Prof. Bill Even
OLS REGRESSION ASSIGNMENT, Fall 2013
The assignment is due Friday 9/20 by 4 p.m. Submit your assignment via email by the deadline. Late assignments will be penalized at the rate of 20 percentage points for every day (or part thereof) that the assignment is
overdue. All team members will receive the same grade unless someone convinces me that I should do otherwise.
Provide a type-written response to all the questions. Paste the relevant portion of the stata log (both the stata
commands and the output) beneath the relevant part of each question in this word document and then provide a type-written explanation (or leave adequate space for handwritten explanations beneath relevant stata
code and results). Be sure to include enough Stata code that I can determine exactly how you generated
your data, variables, and results. If I am unable to determine what you did, I will assume that it is wrong.
The data set you will use for this exercise is contained in g:\eco\evenwe\marchcps. It is the same data set that you
used during our stata lab during class (cpsmar2011.dta) and will use the same sample and some of the same variables as you created there.
In addition to the variables you already created for wage, age, sex, race, and education, create the following:
1.
2.
3.
Union: a dummy variable indicating whether a worker is covered by a union (based on a_unioncov in
codebook);
4 dummy variables to describe a person’s marital status (based on a_maritl in codebook).
a. Married: if a_maritl>=1 and <=3
b. Widowed: if a_maritl=4
c. Divorced: a_marit=5 or 6
d. Single: if a_marit=7
A weight variable (based on a_fnlwgt in codebook) – drop observations with a weight of zero.
QUESTIONS (relevant stata routines are provided in italics.)
1. (16 points)
a. Estimate the mean of the wage rate for men and women (see summarize).
b. Estimate a regression of the wage rate on an intercept and the female dummy (see reg).
c. How do the coefficients in your regression relate to the mean wages calculated in 1a? Explain the connection
clearly.
d. Using this simple regression, show that the mean of predicted wages for the entire sample, the male sample, and
the female sample match the actual means. [see predict command for reg.]
e. The CPS is a stratified random sample and weights are provided to allow the researcher to generate weighted averages that should match the population. The weight that you saved above is an “inverse probability weight” (i.e.
the inverse of the probability that a particular household is sampled). Hence, if a household has probability p of being sampled, they should count as (1/p=weight) households when estimating the population mean. If the CPS was a
pure random sample, 1/p would be constant across households.
Type “help weight” in Stata to learn about the different types of weights allowed in Stata. Repeat (a)-(d) using
weights for each calculation. Note that summarize does not allow for pweights, but does allow for aweights or
fweights. If you use fweights, the weight must first be converted to an integer [replace weight=int(weight) ].
In calculation of simple means, the choice of aweight, pweight, or fweight doesn’t matter. In all cases, the result is
a simple weighted mean. In the case of regressions, the choice of weights matters for standard errors and tstatistics, but not coefficient estimates. Since the weights in the CPS are inverse probability weights, you should
use pweights for the regression.
Eco420, Prof. Bill Even
OLS REGRESSION ASSIGNMENT, Fall 2013
2. (12 points)
a. Without weights, estimate the following two wage equations :
 Specification 1: include dummy variables for sex, union membership, and education.
 Specification 2: include same variables as in (1) plus age and age-squared.
Summarize your regression results (coefficients, t-stats) in a single table using outreg2.1 You may copy/paste the
output from outreg2 into this document. [Be sure to use outreg2 for this part of the exercise because I want you
to have the ability to summarize regression results for the remainder of the semester.]
b. Based upon what we know about omitted variables bias, why does the coefficient on the union dummy change in
the observed direction when the age variables are added to the regression? Use a regression of union membership
on the omitted variables to show that the relevant conditions required for the observed change exist in this data set.
[Note: Whenever you run a regression, a variable called e(sample) is created as an indicator for which observations
were used in the regression sample. You will want to be sure that you are using the same sample for all parts of this
problem. For example, after you run a regression, gen insamp=e(sample) will create a dummy indicating whether
the observation was in that regression. In subsequent regressions, you can restrict it to the same sample with reg y x
if insamp==1]
3. (12 points)
a. Using the least educated group as the reference group for education and specification 2 from question 2a , test the
null hypothesis that the intercept in the earnings equation is identical across all education groups. Interpret the result
(i.e. indicate whether the null is rejected and at what critical level). (see test command in stata.)
b. Repeat the test in 3a using the most educated group as the reference group.
c. How do the results of the tests in 3a and 3b compare? Be as precise as possible in your comparison. Why
should you have expected this?
d. Show how the coefficients from the first specification (3a) could have been used to estimate the coefficient on the
high school graduate dummy that you estimated in the second specification (3b). Explain.
4. (8 points)
Re-estimate the complete specification from (2a) using the natural log of wage in place of wage. Compare the coefficient on the female dummy in the wage and ln(wage) equation. How does the interpretation of the coefficient on
the female dummy change when you switch the dependent variable from wage to ln(wage)? Do the two coefficients seem to suggest similar quantitative differences between male and male wages? Provide a numerical comparison of the implied effect of sex on wages from the two specifications.
5. (10 points)
Using the complete specification in 2b as the starting point, test the null hypothesis that the effect of union coverage
is identical for men and women while allowing for different intercepts by gender but constraining all other coefficients to be equal across gender. Interpret your results. [Note – interactions are useful here.]
6. (7 points) Test the hypothesis that all coefficients including the intercept (using the complete specification in 2b)
are equal for men and women using ln(wage) as the dependent variable. Discuss the implications of your test static.
The downloadable program “outreg2” is very useful for generating tables and t-stats. It requires a little effort
now, but saves a lot of work for you in the future. To download a third party program, inside of stata go to
“helpsearch” and then “search all” and type in the program you’re interested in. You will see a link that allows
you to install the software along with a help file. To get help with the program, type “help outreg2” into the command window.)
1
Eco420, Prof. Bill Even
OLS REGRESSION ASSIGNMENT, Fall 2013
7. (12 points)
To illustrate the effect of errors-in-variables, define 2 new variables for age.
gen bage1=age+10*invnorm(uniform()) ;
gen bage2=age+50*invnorm(uniform());
invnorm(uniform()) generates a random error draw from a N(0,1) distribution.
Notice that the variance of the noise in bage2 is 25 times greater than that in bage1.
a. Estimate a wage equation with your female dummy, education dummies, union coverage, and age only. What
happens to the coefficient on age as noise is added (i.e. if actual age is replaced by bage1 or bage2)? Explain why
you should have expected this.
b. What happens to the coefficient on union coverage? What does this tell you about the relationship between union coverage and worker age? Use the stata command “correlate” to examine your prediction.
8. (20 points) Use STATA to perform a Oaxaca-Blinder decomposition of the wage gap between men and women
using the complete specification employed in 2b (without the female dummy). Use the results to identify
a. how much of the wage gap between men and women can be accounted for by gender differences in all of the control variables.
b. how much of the wage gap between men and women can be accounted for by gender differences in
i. the level of education
ii. union membership
Note: While there are canned routines available for doing Oaxaca-Blinder decompositions in Stata, I want you to
learn how to use the matrix programming features. So you are required to use matrix (not canned routines) for this
problem.
After a regression, you can import the coefficient estimates into a matrix (call it beta1), the variance-covariance matrix into v1, and the matrix of means for a list of variables as follows:
regress hourwage age;
matrix beta1=get(_b); *puts coefficients in a row vector;
matrix v1=get(VCE); *gets variance-covariance matrix for beta1;
matrix accum xx=age female, means(xbar); *puts means of age & female in a matrix called xbar;
(Note: xbar will automatically include a column of ones in the last column. )
You can create a vector of means for females only with
matrix accum xx=age female, means(xbar2), if female==1;
You can use matrix commands to manipulate the matrices. For example, to create the predicted mean at xbar,
matrix ybar=xbar*beta1’;
You can also extract subvectors of a matrix. For example,
Eco420, Prof. Bill Even
OLS REGRESSION ASSIGNMENT, Fall 2013
xbarj=xbar[1,2]
creates a matrix containing the element in the first row and second column of xbar. Alternatively,
xbarfem=xbar[.,"female"]
creates a matrix containing the elements corresponding to the column with the female variable in it.
For other matrix commands, see the chapter on matrix programming in the Stata manual or go to
http://www.stata.com/help.cgi?matrix .
Download