Stata Self-Teaching Guide

advertisement
Stata Guide and Assignments
Professor Thornton
Economics 515
Econometrics
1
INTRODUCTION
This guide provides an explanation of Stata commands required to do the in-class assignments and
homework for Econ 515. It does not explain all Stata capabilities or commands. Stata is a very powerful
statistical package, and if you desire to learn its full capabilities you should consult the Stata User’s
Manuals.
DATA SETS
In this guide, Stata is explained in the context of examples. The examples use the following data sets.
Each data set is contained in an Excel file. It is assumed that the Excel file is located on a stick disk in
drive E:.
WAGE
The data file WAGE contains a cross-section of 935 males. The variables are as follows. WAGE =
monthly earnings in dollars (year 2007 dollars). HOURS = average hours worked per week. IQ = IQ
score. KWW = knowledge of world work score. EDU = years of education. EXPER = years of work
experience. TENURE = years with current employer. AGE = age in years. MARRIED = dummy variable
for marital status (1 if married, 0 otherwise). BLACK = dummy variable for race (1 if black, 0 otherwise).
SOUTH = dummy variable for region of country where worker lives (1 if individual lives in south, 0
otherwise). URBAN = dummy variable for urban area (1 if individual lives in Standard Metropolitan
Statistical Area, 0 otherwise). SIBS = number of siblings. BRTHORD = birth order. MEDUC = mother’s
years of education. FEDUC = father’s years of education. Note that missing observations are denoted by
the number -999.
SMOKE
The data file SMOKE contains a cross-section of 807 consumers. The variables are as follows. EDUC =
years of schooling. CIGPRIC = price of cigarettes in cents per pack. WHITE = dummy variable for race
(1 if white, 0 otherwise). AGE = age in years. INCOME = annual income in dollars. CIGS = number of
cigarettes smoked per day. RESTAURN = dummy variable for state smoking restrictions (1 if state has
restaurant smoking restrictions, 0 otherwise).
AUTO
The data file AUTO contains annual data for General Motors and Chrysler Corporation for the period
1935 to 1954. The variables are as follows. Year = year. IGM = investment spending for General Motors.
PGM = expected profit for General Motors measured by market capitalization. CGM = desired capital
stock for General Motors measured by actual capital stock. IC = investment spending for Chrysler. PC =
expected profit for Chrysler measured by market capitalization. CC = desired capital stock for Chrysler
measured by actual capital stock. All variables except year are in millions of dollars.
WINE
The data file WINE consists of annual time-series data for the wine industry in Australia for the period 1956
to 1975. The variables are as follows. Q = real per capita amount of wine bought and sold. S = an index of
wine storage costs for producers. PW = real price of wine measured by the price of wine relative to the
consumer price index. PB = real price of beer measured by the price of beer relative to the consumer price
index. A = real per capita advertising expenditure on wine. Y = real per capita disposable income.
2
HEALTHPANEL
The data file HEALTHPANEL consists of panel data for 50 states for the years 1991 to 2000. The
variables are as follows. STATE is an id number for each state. YEAR is year. SPEND is healthcare
spending per capita in dollars. INC is income per capita in dollars. AGE65 is the percent of the population
65 years of age or older. INS is the percent of the population with health insurance coverage.
INMATE
The data file INMATE contains data on 1445 inmates in a North Carolina prison that served time,
released, and followed for a period of time to determine if they would be arrested again, and if so the
amount of time that elapsed until that arrest. The variables are as follows. BLACK = 1 if black, 0
otherwise. ALCOHOL = 1 if a history of alcohol problems, 0 otherwise. DRUGS = 1 if a history of drug
problems, 0 otherwise. SUPER = 1 if release from prison was supervised, 0 otherwise. MARRIED = 1 if
married when sent to prison, 0 otherwise. FELON = 1 if a felony sentence, 0 otherwise. WORKPRG = 1
if a member of a prison work program, 0 otherwise. PROPERTY = 1 if a property crime, 0 otherwise.
PERSON = 1 if a crime against a person, 0 otherwise. PRIORS = the number of prior convictions. EDUC
= years of schooling. RULES = the number of rules violations in prison. AGE = age in months.
TSERVED = time served in prison in months. FOLLOW = length of time followed after release from
prison in months. DURAT = amount of time until arrested after release from prison, or until the inmate
was no longer followed, in months. CENS = 1 if DURAT is right censored, 0 otherwise.
STARTING STATA
To start Stata click on the Stata icon on the computer desktop. When Stata starts you will see five separate
windows. 1) Command window. 2) Results window. 3) Review window. 4) Variables window. 5)
Properties window. The command window is where you type commands. After typing a command, to
execute it press ENTER. The results window shows the output produced by the commands. The review
window lists the commands that have been entered in the command window. If you click on a command
you will move it back into the command window where you can edit and execute it. The variables
window lists all variables that are currently in memory. If you click on a variable its name is placed in the
command window. The properties window provides information on the characteristics of the variables
and data.
EXECUTING STATA COMMANDS AND PROGRAMS
Stata commands and a Stata program can be executed in three ways. 1) Command window. 2) Do file. 3)
Dialogue box. Commands can be written and executed in the command window, a do file, or by using
pull-down menu, dialogue boxes. The basic Stata language syntax is
command [varlist] [=exp] [if exp] [in range] [weight] [, options]
Square brackets denote optional qualifiers. Italics denote information that you provide depending on what
you want Stata to do. Command denotes a command name. Varlist denotes a list of individual variables.
Exp denotes a logical expression. Range denotes an observation range. Weight denotes a weighting
expression. Options (preceded by a comma) denote a list of options. The shortest commands have a
command name only. All commands must be written in lowercase letters. Each command must be
written on its own line; carriage return designates the end of a command. To allow a command to be
written on two or more consecutive lines type: #delimit; . This tell Stata that you will end a command
line with a semicolon. To turn off the delimiter type #delimit cr. Any Stata command can be shortened to
its first three letters. For example, the summarize command can be shortened to sum.
3
STATA ASSIGNMENTS
The first Stata assignment involves learning how to create and open a Stata data file, create variables,
construct histograms and scatter diagrams, and calculate descriptive statistics. The data file wage will be
used for this assignment.
CREATING A STATA DATA FILE
Most data is contained in either an external Excel file or ascii file and must be imported into Stata in the
form of a Stata data file.
Example
Create and save a Stata data file from the Excel data file named wage located on your stick disk in drive
e:. This file contains 935 observations on 16 variables and the names of the variables. The first row of the
Excel file contains the variable names. In the column under each variable name is the data for that
variable.
Steps
Launch Stata. Use the menu bar and select File → Import →Excel Spreadsheet. Click Browse… Go to
drive E:. Click on the file named wage. Click Import First Row As Variable Names. Click OK. To save
the Stata file named wage.dta, select File → Save as and type wage.dta in the File Name box. Click Save.
Comments
1. Versions that preceded Stata 12 do not allow you to import an excel spreadsheet directly. In these
versions, launch Excel and save the Excel file wage.xls as a tab delimited text file wage.txt. Select
File → Import →ASCII data created by a spreadsheet and fill in the dialogue box. Alternatively,
type the following command in the command window: insheet using e:wage.txt.
2. If the data file is a tab delimited text file that contains data but no variable names, then you would use
the following command: infile wage hours iq kww edu exper tenure age married black south urban
sibs brthord meduc feduc using e:wage.txt. Note that the infile command is followed by the variable
names. Alternatively, you can use the menu bar and select File → Import → Unformatted ASCII data,
and fill in the dialogue box.
OPENING A STATA DATA FILE
Example
Launch Stata and open the Stata data file named wage.dta.
Steps
Use the menu bar to select File → Open. Go to drive E: and click on wage. Click Open.
CREATING NEW VARIABLES
Existing variables in a data file can be used to create new variables by using the generate command.
Some often used operators are given below.
4
+
/
*
^
ln
exp
sqrt
addition
subtraction
division
multiplication
power
natural logarithm
exponential
square root
Example
Create two new variables. 1) Logarithm of wage named logwage 2) Age squared named agesq
Commands
generate logwage=ln(wage)
generate agesq=age*age
CREATING DUMMY VARIABLES
To create dummy variables, use the generate command and logical expressions. Logical expressions are
also used when you add the in or if qualifier to a command line. Stata uses the following eight relational
and logical operators.
==
!=
>
<
>=
<=
&
|
!
equal to
not equal to
greater than
less than
greater than or equal to
less than or equal to
and
or
not
Note that the relational operator for equal to is a double equal sign.
Example
Create a dummy variable named college that takes a value of 1 if a worker has education beyond a high
school degree and 0 otherwise.
Command
generate college = (edu>12)
Comment
When Stata is given a logical expression, such as edu>12, it evaluates the expression and assigns a value
of 1 if true and 0 if false.
5
Example
Use the quantitative variable edu to create a qualitative variable with four educational categories. 1) Less
than high school (lhs). 2) High school (hs). 3) Some college, but no degree (scol). 4) College degree (col).
To do this you must create 4 dummy variables, one for each educational category. The commands are
generate lhs = (edu<12)
generate hs = (edu==12)
generate scol = (edu>12 & edu<16)
generate col = (edu>=16)
HISTOGRAMS AND SCATTER DIAGRAMS
The commands used to construct histograms and scatter diagrams are histogram and graph.
Example
Use the data file wage.dta and construct the histogram of the absolute frequency distribution for the
variable wage.
Command
histogram wage, frequency
Example
Construct a scatter diagram for the variables wage and edu.
Command
graph twoway scatter wage edu
DESCRIPTIVE STATISTICS
The commands used to calculate descriptive statistics are summarize, and tabstat.
Example
Calculate the standard set of descriptive statistics (mean, standard deviation, minimum and maximum
values, and number of observations) for the variables wage edu exper married iq tenure age black south
urban.
Command
summarize wage edu exper married iq tenure age black south urban
Comment
If you include the option ,detail Stata will calculate additional descriptive statistics such as the median,
variance, percentiles, etc.
6
Example
Calculate specific descriptive statistics. These include the mean, variance, standard deviation, coefficient
of variation, maximum and minimum values, and number of observations for the variables wage edu
exper married iq tenure.
Command
tabstat wage edu exper married iq tenure, stats(mean variance sd cv min max n)
Example
Decompose the sample into the following three subsamples and calculate descriptive statistics for each
subsample. 1) Less than a high school education. 2) High school education. 3) Post high school education.
Commands
summarize wage edu exper married iq tenure if edu<12
summarize wage edu exper married iq tenure if edu==12
summarize wage edu exper married iq tenure if edu>12
Comments
1. The if qualifier specifies the observations to use based on the values that the variable education takes.
2. The if edu==12 qualifier uses a double equals sign since a qualifier is a logical expression.
Example
Calculate the sample correlation coefficients for the variables wage edu exper married iq tenure.
Command
correlate wage edu exper married iq tenure
CLASSICAL LINEAR REGRESSION MODEL
This Stata assignment involves learning how to estimate a classical linear regression model using the
ordinary least squares (OLS) estimator. The data file wage will be used for this assignment. The
command used to estimate a classical linear regression model using OLS is regress. Other useful
commands that can be used along with regress are predict, correlate, and mfx.
Example
Use the OLS estimator to run a linear regression of wage on edu, exper, tenure, iq and married.
Command
regress wage edu exper married iq tenure
Example
7
For the previous regression, save the predicted (fitted) values of wage and name this variable fitwage,
save the residuals and name this variable residuals, and display the variance covariance matrix of
estimates.
Commands
predict fitwage, xb
predict residuals, resid
estat vce
Comments
1. The predict and estat vce commands use information from the most recently estimated model.
2. The predict options xb and resid tell Stata to save the predicted values and residuals, respectively.
3. If you click on the data browser icon at the top of Stata, this will open a spreadsheet that contains
your data. You will see two new variables named fitwage and residuals.
Example
For the previous regression, estimate the elasticities of all variables, calculate standard errors for the
elasticity estimates, and t-statistics for the zero null hypothesis.
Command
mfx, eyex
Comments
1. The mfx command with option eyex uses information from the most recently estimated model.
2. The t-statistics are asymptotic t-statistics and labeled Z in the table reported by Stata.
3. If you use the mfx command without the option eyex, Stata will report estimates of the marginal
effects for the most recently estimated model.
Example
Use the OLS estimator to run a linear regression of the logarithm of wage on edu, exper, tenure, iq and
married. Calculate estimates of the elasticities of all variables.
Commands
generate logwage=ln(wage)
regress logwage edu exper married iq tenure
mfx, dyex
Comment
When estimating a log-linear functional form, to calculate estimates of elasticities use the option dyex
with the mfx command.
HYPOTHESIS TESTING
8
This Stata assignment involves learning how to test hypotheses using the t-test, F-test, asymptotic t-test,
Wald test, Likelihood ratio test, and Lagrange multiplier test. The data file wage will be used for this
assignment. The commands used to calculate test statistics are test, testnl, lrtest, lincom, and nlcom. Other
useful commands for testing hypotheses and imposing restrictions are estimates store, constraint, and
cnsreg.
Example
Run a regression of wage on edu, exper, tenure, iq, married, and test the following hypotheses using a ttest. 1) job tenure has no effect on the wage. 2) The effect of one additional year of education on the wage
is equal to the effect of one additional year of work experience on the wage.
Commands
regress wage edu exper tenure iq married
lincom edu – exper
Comments
1. The regress command reports the t-statistic for the zero null hypothesis for each variable.
2. The lincom command estimates the value of a new coefficient that is equal to the difference between
the coefficients of edu and exper, calculates an estimate of the standard error of this new coefficient
estimate, and reports the t-statistic for the zero null hypothesis (the difference between the
coefficients of edu and exper is zero).
3. The lincom command uses information from the most recently estimated model.
Example
Run a regression of wage on edu, exper, tenure, iq, married, and test the following hypothesis using an
asymptotic t-test. 1) The effect of one additional year of education on the wage is equal to the square of
the effect of one additional year of job tenure on he wage.
Commands
regress wage edu exper married tenure iq
nlcom _b[edu] - _b[tenure]^2
Comments
1. The nlcom command estimates the value of a new coefficient that is equal to the difference between
the coefficient of edu and the square of the coefficient of exper, calculates an estimate of the
asymptotic standard error of this new coefficient estimate, and reports the asymptotic t-statistic for the
zero null hypothesis (the difference between the coefficient of edu and the square of the coefficient of
exper is zero).
2. The notation _b[edu] and_b[tenure] designate the coefficients of the variables edu and tenure.
3. The nlcom command uses information from the most recently estimated model.
Example
Run a regression of wage on edu, exper, tenure, iq, married, and test the following hypotheses using an Ftest. 1) Education, experience, and marital status have no joint effect on the wage. 2) The effect of one
9
additional year of education on the wage is equal to the effect of ten additional iq points on the wage. 3)
The effect of one additional year of work experience on the wage is equal to the effect of one additional
year of job tenure on the wage and the effect of being married on the wage is equal to the effect of three
additional years of education on the wage.
Commands
regress wage edu exper married tenure iq
test (edu=0) (exper=0) (married=0)
test edu=10*iq
test (exper=tenure) (married=3*edu)
Comments
1. The test command calculates an F-statistic for one or more linear hypotheses for the most recently
estimated model.
2. When testing two or more hypotheses, each hypothesis is placed within parentheses.
Example
The unrestricted model is the regression of wage on edu exper tenure iq married. Test the following
hypothesis using a likelihood ratio test. 1) iq and job tenure have no joint effect on the wage, and
therefore should be dropped from the model.
Commands
regress wage edu exper married tenure iq
estimates store unrestricted
regress wage edu exper married
estimates store restricted
lrtest unrestricted restricted
Comments
1. The estimates store command saves the results for the prior regression and names the results
unrestricted and restricted. Any names can be chosen for the saved results. For example, unrestricted
could have been named A and restricted B.
2. The command lrtest uses the saved results to calculate the likelihood ratio statistic. The name of the
unrestricted model results must precede the name of the restricted model results.
Example
Run a regression of wage on edu, exper, tenure, iq, married, and test the following hypothesis using a
Wald test. 1) The effect of one additional year of education on the wage is equal to the square of the
effect of one additional year of job tenure on he wage and the effect of one additional year of work
experience on the wage is equal to effect of three additional iq points on the wage.
Commands
regress wage edu exper married tenure iq
testnl (_b[edu] = _b[tenure]^2) (_b[exper]=_b[iq]*3)
10
Comments
1. The testnl command calculates a Wald-like statistic that has an approximate F-distribution in large
samples. This can be used to test nonlinear and/or linear hypotheses.
2. The testnl command uses information from the most recently estimated model.
Example
The restricted model is the regression of wage on edu, exper, married. Test the following hypothesis using
a Lagrange multiplier test. 1) iq and job tenure have no joint effect on the wage, and therefore should not
be added to the model.
Commands
regress wage edu exper married
predict residuals, resid
regress residuals edu exper married iq tenure
generate lm=e(r2)*e(N)
display lm
Comments
1. The first four commands perform the four steps required to calculate the Lagrange multiplier test
statistic. 1) Estimate the restricted model. 2) Save the residuals from the restricted model. 3) Regress
the residuals on all explanatory variables in the unrestricted model using OLS. 4) Calculate the LM
statistic by multiplying the r-squared statistic from the auxiliary regression by the sample size. The
fifth command displays the value of the test statistic.
2. The notation e(r2) and e(N) are the saved results for the r-squared statistic and sample size from the
most recent regression.
Example
The unrestricted model is the regression of wage on edu, exper, tenure, iq, married. Impose the following
restriction on the unrestricted model. 1) The effect of one additional year of work experience on the wage
is equal to the effect of one additional year of job tenure on the wage.
Commands
constraint define 1 exper=tenure
cnsreg wage edu exper married tenure iq, constraints(1)
Example
The unrestricted model is the regression of wage on edu, exper, tenure, iq, married. Impose the following
two restrictions on the unrestricted model. 1) The effect of one additional year of work experience on the
wage is equal to the effect of one additional year of job tenure on the wage. 2) The effect of being married
on the wage is equal to the effect of three additional years of education on the wage.
Commands
11
constraint define 1 exper=tenure
constraint define 2 married=3*edu
cnsreg wage edu exper married tenure iq, constraints(1-2)
Comments
1. The constraint define command defines the constraint(s) to be imposed on the model. The constraint
define command is followed by the constraint number and the constraint expression.
2. The cnsreq command tells Stata to estimated a constrained regression. The constraints( ) option
identifies the constraint(s) to be imposed on the model.
GENERAL LINEAR REGRESSION MODEL
This Stata assignment involves heteroscedasticity and the general linear regression model. The data file
smoke will be used for this assignment. The command used to estimate a general linear regression model
with heteroscedasticity using a feasible least squares estimator (weighted least squares estimator) is vwls.
Commands used to test for heteroscedasticity are estat hettest and estat imtest. A very useful command
option is robust, which is used to calculate White’s heteroscedasticity-robust standard errors.
Example
Create a new variable named agesq defined as the square of the variable age. Run a regression of cigs on
cigpric, income, educ, age, agesq, restaurn. Test the null hypothesis of no heteroscedasticity against the
alternative hypothesis of linear heteroscedasticity using the Breusch-Pagan test. Test the null hypothesis
of no heteroscedasticity against the alternative hypothesis of general heteroscedasticity using the White
test.
Commands
generate agesq=age*age
regress cigs cigpric income educ age agesq restaurn
estat hettest cigpric income educ age agesq restaurn
estat imtest, white
Comments
1. The command estat hettest performs a Breusch-Pagan test. All explanatory variables or a subset of
explanatory variables can be included in the variable list.
2. The command and option estat imtest, white performs a White test.
Example
Run a regression of cigs on cigpric, income, educ, age, agesq, restaurn. Test the null hypothesis of no
heteroscedasticity against the alternative hypothesis of general heteroscedasticity using the Wooldridge
test.
Commands
regress cigs cigpric income educ age agesq restaurn
predict f,xb
predict r,resid
12
generate f2=f*f
generate r2=r*r
regress r2 f f2
generate lm=e(r2)*e(N)
display lm
Comments
1. Stata does not have a command for the Wooldridge test, and therefore it is necessary to write the code
for the test yourself.
2. The definitions of the variables created with the above code is as follows. f = fitted values for cigs. r =
residuals. f2 = fitted values squared. r2 = residuals squared. lm = Lagrange multiplier test statistic.
Example
Run a regression of cigs on cigpric, income, educ, age, agesq, restaurn. Calculate White’s
heteroscedasticity-robust standard errors.
Command
regress cigs cigpric income educ age agesq restaurn, robust
Comment
The standard errors reported for this OLS regression are White’s heteroscedasticity-robust standard
errors.
Example
Delete the variables f, f2, r, r2 created in the previous example. Run a regression of cigs on cigpric,
income, educ, age, agesq, restaurn using a feasible least squares estimator (weighted least squares
estimator) assuming an exponential form of heteroscedasticity.
drop f f2 r r2
regress cigs cigpric income educ age agesq restaurn
predict r,resid
generate r2=r*r
generate lr2=ln(r2)
regress lr2 cigpric income educ age agesq restaurn
predict f,xb
generate vare=exp(f)
generate sde=sqrt(var)
vwls cigs cigpric income educ age agesq restaurn,sd(sde)
Comments
1. Command lines 2 through 9 produce an estimate of the standard deviation of the error term for the
regression of cigs on cigpric income educ age agesq restaurn. The definitions of the variables created
are as follows. r = residuals. r2 = residuals squared. lr2 = logarithm of squared residuals. f = fitted
values for lr2. vare = estimate of variance of error. sde = estimate of standard deviation of error.
13
2. The vwls command and option sd(sde) tells Stata to use the estimate of the standard deviation of the
error that you created (sde) to run a weighted least squares regression.
3. When analyzing state-level data that is an average of individuals within a state, the correct weight to
use for the weighted least squares estimator is the square root of the state population. Stata will take
the square root of the population and use it as the weighted least estimator if you include the qualifier
[aweight = population] after the variable list in the regress command. Population is the variable that
has data on state population; it can have any name you choose to give it.
SEEMINGLY UNRELATED REGRESSIONS MODEL
This Stata assignment involves learning how to estimate the parameters of a seemingly unrelated
regressions model and test hypotheses. The data file auto will be used for this assignment. The command
to estimate the seemingly unrelated regressions model is sureg. Commands and options to test hypotheses
are also given in the examples that follow.
Example
Estimate the parameters of two investment demand equations jointly using Zellner’s SUR estimator.
Report the correlation matrix of residuals for the equations and do a Breusch-Pagan test to test the null
hypothesis that the errors for the two equations are not contemporaneously correlated against the
alternative hypothesis that the errors are correlated. For the General Motors investment demand equation,
the dependent variable is igm and the explanatory variables are pgm, cgm. For the Chrysler investment
demand equation, the dependent variable is ic and the explanatory variables are pc, cc.
Command
sureg (igm pgm cgm) (ic pc cc), corr
Comments
1.
2.
3.
4.
The variables for each equation are enclosed in parentheses with the dependent variable listed first.
The option corr tells Stata to report the residual correlation matrix and do a Breusch-Pagan test.
To estimate the equations using Zellner’s iterated SUR estimator add the option isure.
The command and option mfx, eyex can be used to calculated elasticities for the equations.
Example
Estimate the two investment demand equations jointly using Zellner’s iterated SUR estimator. Use a
Wald test to test the following two cross-equation restrictions. 1) The marginal effect of expected profit
on investment spending is the same for General Motors and Chrysler. 2) The marginal effect of desired
capital stock on investment spending is the same for General Motors and Chrysler.
Commands
sureg (igm pgm cgm) (ic pc cc), isure
test ([igm]pgm=[ic]pc) ([igm]cgm=[ic]cc)
Comments
1. When using the test command for the seemingly unrelated regressions model, a Wald test is
performed.
14
2. The expression for each restriction (hypothesis) is enclosed in parentheses. Each variable in the
expression must be preceded by the dependent variable in brackets of the equation in which that
variable resides.
3. To test one or more nonlinear restrictions with a Wald test use the command testnl.
4. The command lrtest can be used to do a likelihood ratio test. The commands lincom and nlcom can be
used to calculate point estimates and standard errors for linear and nonlinear combinations of
parameters. See assignment on hypothesis testing for details.
Example
Estimate the two investment demand equations jointly using Zellner’s iterated SUR estimator. Impose the
following two cross-equation restrictions. 1) The marginal effect of expected profit on investment
spending is the same for General Motors and Chrysler. 2) The marginal effect of desired capital stock on
investment spending is the same for General Motors and Chrysler.
Commands
constraint define 1 [igm]pgm=[ic]pc
constraint define 2 [igm]cgm=[ic]cc
sureg (igm pgm cgm) (ic pc cc), isure constraints(1-2)
IV ESTIMATION AND THE SIMULTANEOUS EQUATIONS STATISTICAL MODEL
This Stata assignment involves learning how to use the instrumental variable estimators two-stage least
squares (2sls) and three-stage least squares (3sls), estimate the parameters of a simultaneous equations
statistical model, and test hypotheses. The data file wine will be used for this assignment. The commands
for the 2sls and 3sls estimators are ivreg and reg3. Commands required to do a variety of tests are also
given in the examples that follow.
Example
Create six new variables that are the logarithms of the variables in the data file wine.
Commands
generate lq=ln(q)
generate ls=ln(s)
generate lpw=ln(pw)
generate lpb=ln(pb)
generate la=ln(a)
generate ly=ln(y)
Example
Estimate the parameters the supply equation for wine and the inverse demand equation for wine using the
2sls estimator. Use a double log functional form for each equation. Treat lq and lpw as endogenous
variables and ls, ly, lpb, la as exogenous variables. For the supple equation, the dependent variable is lq
and the explanatory variables are lpw, ls. For the inverse demand equation, the dependent variable is lpw
and the explanatory variables are lq, ly, lpb, la.
Commands
15
ivreg lq ls (lpw=ly lpb la)
ivreg lpw ly lpb la (lq=ls)
Comments
1. The endogenous right-hand side variable is enclosed in parentheses followed by an equals sign and
the list of identifying instrumental variables.
2. If there is more than one endogenous right-hand side variable, then enclose all endogenous right-hand
side variables in parentheses followed by an equals sign and the list of identifying instrumental
varialbes. For example, if we treat both lpw and ls as endogenous in the supply equation, then the
command is ivreg lq (ls lpw=ly lpb la).
3. To report the first-stage regression, add the option first.
4. To calculate elasticities, use the command mfx, eyex.
5. To calculate White’s heteroscedasticity-robust standard errors. add the option robust.
Example
Estimate the parameters of the supply equation for wine. Test the following hypothesis using an
approximate F-test. 1) The price elasticity of supply of wine is equal to the storage cost elasticity of
supply of wine.
ivreg lq ls (lpw=ly lpb la)
test ls=lpw
Comments
1. To test one or more nonlinear restrictions with a Wald test use the command testnl.
2. The commands lincom and nlcom can be used to calculate point estimates and standard errors for
linear and nonlinear combinations of parameters. See assignment on hypothesis testing for details.
Example
Assess the relevance of the identifying instruments in the supply and inverse demand equations for wine.
To do this, calculate the F-statistic for the zero null hypothesis for each set of identifying instruments in
the first-stage regressions.
Commands
regress lpw ls ly lpb la
test (ly=0) (lpb=0) (la=0)
regress lq ls ly lpb la
test (ls=0)
Example
Use a Lagrange multiplier test to test the overidentifying restrictions for the wine supply equation.
Commands
ivreg lq ls (lpw=ly lpb la)
16
predict residuals, resid
regress residuals ls ly lpb la
generate lm=e(r2)*e(N)
display lm
Comment
These commands calculate the LM test statistic. This statistic must be compare to the critical value for a
chi square statistic with degrees of freedom equal to the number of overidentifying restrictions, which in
this example is 2.
Example
Use a Hausman test to test the null hypothesis that lpw is exogenous against the alternative hypothesis
that it is endogenous in the supply equation for wine.
Commands
regress lpw ls ly lpb la
predict residuals, resid
regress lq lpw ls residuals
Comment
The test statistic for the Hausman test is the t-statistic for the coefficient of the residual variable. To
perform the test, compare this t-statistic to the appropriate critical value for the t-statistic.
Example
Estimate the parameters of the supply and inverse demand equations for wine jointly using 3sls.
Commands
reg3 (lq lpw ls) (lpw lq ly lpb la)
Comments
1. Each equation is enclosed within parentheses with the dependent variable listed first followed by the
explanatory variables. This tells Stata which variables are endogenous and exogenous in the system.
2. To add exogenous variables not included in the equations being estimated as instruments, add the
option exog( ), with the variable names listed within parentheses.
3. To indicate endogenous right-hand side variables that are not also dependent variables, add the option
endog( ), , with the variable names listed within parentheses.
4. To test cross-equation restrictions, use the same commands as those used for the seemingly unrelated
regressions model.
5. To estimate the equations using ordinary least squares, two-stage least squares, or Zellner’s SUR
estimator, add the option ols, 2sls, or sure.
6. To estimate the equations using an iterated 3sls estimator, add the option ireg3.
FIXED-EFFECTS AND RANDOM-EFFECTS REGRESSION MODELS FOR PANEL DATA
17
This Stata assignment involves learning how to estimate panel data regression models. The data file
healthpanel will be used for this assignment. The commands for estimating a fixed effects or random
effects model are xtreg and xtivreg. Commands required to test hypotheses and calculate descriptive
statistics are also given in the examples that follow.
Example
Prepare the panel data file to be analyzed.
Commands
iis state
tis year
Comment
Before using panel data commands, you must give Stata an index for units and time in your data file. The
command iis is the index for units, which in the data file healthpanel is states. The command tis is the
index for time, which in the data file healthpanel is year.
Example
Calculate the mean, standard deviation, maximum and minimum values for the variables in your sample.
Command
xtsum
Comment
Stata reports the overall mean for each variable. It also reports three alternative measures of the standard
deviation, maximum, and minimum values: overall, between, and within. For example, the standard
deviations for the variable income measure the overall dispersion in income, the amount of dispersion in
income across states, and the amount of dispersion in income within states over time.
Example
Estimate a fixed-effects model for medical care spending using the fixed effects estimator. The dependent
variable is spend. The explanatory variables are inc, ins.
Command
xtreg spend inc ins, fe
Comments
1. The xtreg command uses the fixed-effects estimator, and therefore does not report estimates of the
state dummy variables. If you want to obtain estimates of the coefficients of the state dummy
variables you can use a least-squares dummy variable estimator.
18
2. Stata reports the F-statistic for the test of no fixed effects as standard output. The null hypothesis is no
fixed effects (classical linear regression model is appropriate). The alternative hypothesis is fixed
effects (fixed-effects model is appropriate).
3. To calculate elasticities, use the command mfx, eyex.
4. To calculate White’s heteroscedasticity-robust standard errors. add the option robust.
5. The test and testnl commands can be used to test linear and nonlinear restrictions using an F-test and
a Wald test.
6. The commands lincom and nlcom can be used to calculate point estimates and standard errors for
linear and nonlinear combinations of parameters.
Example
Estimate a fixed-effects model for medical care spending using the least squares dummy variable
estimator. Reports the estimates of the coefficients of the state dummy variables.
Command
xi: regress spend inc ins i.state
Comments
1. The xi prefix to the regress command along with the i.state variable tells Stata to create 50 dummy
variables, one for each state, drop the dummy variable for state #1, and include the remaining 49
dummy variables as explanatory variables in the regression.
2. The estimates of the coefficients of inc and ins will be identical when using the fixed-effects
estimator or least-squares dummy variable estimator.
Example
Estimate a fixed-effects model for medical care spending using the two-stage least squares estimator.
Treat ins as the endogenous right-hand side variable and use age65 as the identifying instrumental
variable.
Command
xtivreg spend inc (ins=age65), fe
Example
Estimate a fixed-effects model with time effects for medical care spending that controls for unobserved
factors that differ across states but not over time, and unobserved factors that vary over time but not
across states.
Command
xi: xtreg spend inc ins i.year, fe
Comment
The xi prefix to the xtreg command along with the i.year variable tells Stata to create 10 dummy
variables, one for each year, drop the dummy variable for year 1991, and include the remaining 9 dummy
19
variables as explanatory variables in the regression along with an intercept. Stata then uses a fixed-effects
estimator to estimate the coefficients of this model. It reports the estimates of the time dummy variables,
but not state dummy variables. If you also want to obtain estimates of the state dummy variables you can
use the least-squares dummy variable estimator. The command is xi: xtreg spend inc age65 ins i.year
i.state
Example
Eestimate a random-effects model for medical care spending.
Command
xtreg spend inc ins, re
Example
Estimate a random-effects model with time effects for medical care spending that accounts for correlated
errors and unobserved factors that vary over time but not across states.
Command
xi: xtreg spend inc ins i.year, re
Example
Estimate a random-effects model for medical care spending using the two-stage least squares estimator.
Treat ins as the endogenous right-hand side variable and use age65 as the identifying instrumental
variable.
Command
xtivreg spend inc (ins=age65), re
Example
Estimate a random-effects model and test the null-hypothesis of no random effects (the unit specific
errors are not correlated, and therefore the classical linear regression model is appropriate) against the
alternative hypothesis of random effects (the unit specific errors are correlated, and therefore the random
effects model is appropriate).
Commands
xtreg spend inc ins, re
xttest0
Example
Estimate two models of medical care spending: a fixed-effects model and a random-effects model. Use a
Hausman test to test which is the appropriate model. The null-hypothesis is that the random effects model
is appropriate (unit specific errors are not correlated with the right-hand side variables) against the
20
alternative hypothesis that the fixed-effects model is appropriate (unit specific errors are correlated with
the right-hand side variables).
xtreg spend inc ins, fe
estimates store fixed
xtreg spend inc ins, re
estimates store random
hausman random fixed
BINARY DISCRETE CHOICE REGRESSION MODELS
This Stata assignment involves learning how to estimate linear probability, probit, and logit discrete
choice regression models. The data file smoke will be used for this assignment. The commands for
estimating these models are regress, probit, dprobit, logit, and logistic. Commands required to test
hypotheses and calculate measures of goodness-of-fit are also given in the examples that follow.
Example
Create a dummy variable called smoker that takes a value of 1 if an individual smokes and 0 if an
individual does not smoke. Estimate a linear probability model for smoking. The dependent variable is
smoker and the explanatory variables are cigpric, restaurn, income, age, edu. Do White’s test for
heteroscedasticity. Estimate the linear probability model for smoking again and report White’s
heteroscedasticity-robust standard errors.
Commands
generate smoker=(cigs>0)
regress smoker cigpric restaurn income age edu
estat imtest, white
regress smoker cigpric restaurn income age edu, robust
Example
Estimate a probit model for smoking and report the estimates of the coefficients of the index function.
The dependent variable is smoker and the explanatory variables are cigpric, restaurn, income, age, edu.
Calculate the percent of correct predictions. Estimate the probit model for smoking again and report the
estimates of the marginal effects.
Commands
probit smoker cigpric restaurn income age edu
estat class
dprobit smoker cigpric restaurn income age edu
Comments
The command probit reports maximum likelihood estimates of the coefficients of the index function. The
dprobit command reports estimates of marginal effects. The estat class command calculates and reports
the percent of correct predictions.
Example
21
Estimate the probit model for smoking. Use a Wald test to test the hypothesis that the effect of one more
year of education on the probability of smoking is equal to the effect of one more year of age.
Commands
probit smoker cigpric restaurn income age edu
test educ=age
Comments
1. To test one or more nonlinear restrictions with a Wald test use the command testnl.
2. The command lrtest can be used to do a likelihood ratio test. See assignment on hypothesis testing for
details.
3. The commands lincom and nlcom can be used to calculate point estimates and standard errors for
linear and nonlinear combinations of parameters. See assignment on hypothesis testing for details.
Example
Estimate a logit model for smoking and report the estimates of the coefficients of the index function.
Estimate the logit model for smoking again and report the estimates of the odds ratios.
logit smoker cigpric restaurn income age edu
logistic smoker cigpric restaurn income age edu
Comments
The command logit reports maximum likelihood estimates of the coefficients of the index function, while
the command logistic reports estimates of the odds ratios.
DURATION MODELS
This Stata assignment involves learning how to estimate duration models. The data file inmate will be
used for this assignment. The commands for estimating a Weibull parametric duration model, Cox
proportional hazards model, and Kaplan-Meier nonparametric survival function are streg, stcox, and sts
graph.. Commands required to test hypotheses plot curves are also given in the examples that follow.
Example
Prepare the duration data file to be analyzed.
Commands
recode cens (0=1) (1=0)
stset durat, failure(cens)
Comments
1. The recode command recodes the dummy variable cens so that cens=1 indicates an uncensored
observation and cens=0 indicates a censored observation. Stata requires that the indicator variable for
22
censored observations take a value of 1 if an observation is uncensored and a value of 0 if an
observation is censored.
2. Before using duration model commands, you must tell Stata what variable is the duration variable
and what observations are censored. The command stset tells Stata that durat is the duration variable.
The option failure(cens) tells Stata that the variable cens indicates which observations are uncensored
(cens=1) and which observations are censored (cens=0). Stata calls an uncensored observation a
failure.
Example
Estimate a Weibull parametric duration model of criminal recidivism. The dependent variable is durat.
The explanatory variables are tserved, educ, married. Report the estimates of the parameters of the model.
Plot the survival function and hazard function evaluated at the sample mean values of the variables. Test
the null hypothesis that educ and married have no joint effect on durat using a Wald test.
Commands
streg tserved educ married, d(weibull) time
stcurve, survival
stcurve, hazard
test (educ=0) (married=0)
Comments
1. The option d(Weibull) tells Stata to to estimate a parametric duration model that assumes the variable
durat has a Weibull distribution.
2. The option time tells Stata to report the estimates of the parameters of the model. These are also the
estimates of the parameters of the log median duration function. If you want Stata to report estimates
of the time ratios, replace the option time with the option tr. The time ratios are the exponentiated
estimates of the parameters of the log median duration function.
3. The options survival and hazard for the stcurve command tell Stata to plot survival and hazard
functions.
4. The testnl command can be used to test nonlinear restrictions using a Wald test.
5. The commands lincom and nlcom can be used to calculate point estimates and standard errors for
linear and nonlinear combinations of parameters.
Example
Estimate a Weibull parametric duration model of criminal recidivism. Report the estimates of the hazard
ratios. Use a Wald test to test the null hypothesis that the effect of one additional year of education on the
hazard rate is equal to the effect of 12 additional months of time served on the hazard rate.
Commands
streg tserved educ married, d(weibull)
test educ=12*tserved
Comment
If the option time or tr is not used, then Stata reports estimates of the hazard ratios. The hazard ratios are
the exponentiated estimates of the parameters of the log hazard function.
23
Example
Estimate a Cox proportional hazards model of criminal recidivism. The dependent variable is durat. The
explanatory variables are tserved, educ, married. Report the estimates of the parameters of the Cox
proportional hazard function. Estimate this model a second time. Report the estimates of the hazard
ratios.
Commands
stcox tserved educ married, nohr
stcox tserved educ married
Comments
1. The option nohr tells Stata to report the estimates of the parameters of the Cox proportional hazard
function. If this option is omitted, then Stata reports the estimates of the hazard ratios, which are the
exponentiated estimates of the parameters.
2. The test and testnl commands can be used to test linear and nonlinear restrictions using a Wald test.
3. The commands lincom and nlcom can be used to calculate point estimates and standard errors for
linear and nonlinear combinations of parameters.
Example
Construct a for the variable durat.
Command
sts graph
Example
Construct two nonparametric Kaplan-Meier survival functions: one for married inmates and one for single
inmates. Use a log rank test to test the equality of these two survival functions.
Command
sts graph if married==1
sts graph if married==0
sts test married
24
Download