Stata Guide and Assignments Professor Thornton Economics 515 Econometrics 1 INTRODUCTION This guide provides an explanation of Stata commands required to do the in-class assignments and homework for Econ 515. It does not explain all Stata capabilities or commands. Stata is a very powerful statistical package, and if you desire to learn its full capabilities you should consult the Stata User’s Manuals. DATA SETS In this guide, Stata is explained in the context of examples. The examples use the following data sets. Each data set is contained in an Excel file. It is assumed that the Excel file is located on a stick disk in drive E:. WAGE The data file WAGE contains a cross-section of 935 males. The variables are as follows. WAGE = monthly earnings in dollars (year 2007 dollars). HOURS = average hours worked per week. IQ = IQ score. KWW = knowledge of world work score. EDU = years of education. EXPER = years of work experience. TENURE = years with current employer. AGE = age in years. MARRIED = dummy variable for marital status (1 if married, 0 otherwise). BLACK = dummy variable for race (1 if black, 0 otherwise). SOUTH = dummy variable for region of country where worker lives (1 if individual lives in south, 0 otherwise). URBAN = dummy variable for urban area (1 if individual lives in Standard Metropolitan Statistical Area, 0 otherwise). SIBS = number of siblings. BRTHORD = birth order. MEDUC = mother’s years of education. FEDUC = father’s years of education. Note that missing observations are denoted by the number -999. SMOKE The data file SMOKE contains a cross-section of 807 consumers. The variables are as follows. EDUC = years of schooling. CIGPRIC = price of cigarettes in cents per pack. WHITE = dummy variable for race (1 if white, 0 otherwise). AGE = age in years. INCOME = annual income in dollars. CIGS = number of cigarettes smoked per day. RESTAURN = dummy variable for state smoking restrictions (1 if state has restaurant smoking restrictions, 0 otherwise). AUTO The data file AUTO contains annual data for General Motors and Chrysler Corporation for the period 1935 to 1954. The variables are as follows. Year = year. IGM = investment spending for General Motors. PGM = expected profit for General Motors measured by market capitalization. CGM = desired capital stock for General Motors measured by actual capital stock. IC = investment spending for Chrysler. PC = expected profit for Chrysler measured by market capitalization. CC = desired capital stock for Chrysler measured by actual capital stock. All variables except year are in millions of dollars. WINE The data file WINE consists of annual time-series data for the wine industry in Australia for the period 1956 to 1975. The variables are as follows. Q = real per capita amount of wine bought and sold. S = an index of wine storage costs for producers. PW = real price of wine measured by the price of wine relative to the consumer price index. PB = real price of beer measured by the price of beer relative to the consumer price index. A = real per capita advertising expenditure on wine. Y = real per capita disposable income. 2 HEALTHPANEL The data file HEALTHPANEL consists of panel data for 50 states for the years 1991 to 2000. The variables are as follows. STATE is an id number for each state. YEAR is year. SPEND is healthcare spending per capita in dollars. INC is income per capita in dollars. AGE65 is the percent of the population 65 years of age or older. INS is the percent of the population with health insurance coverage. INMATE The data file INMATE contains data on 1445 inmates in a North Carolina prison that served time, released, and followed for a period of time to determine if they would be arrested again, and if so the amount of time that elapsed until that arrest. The variables are as follows. BLACK = 1 if black, 0 otherwise. ALCOHOL = 1 if a history of alcohol problems, 0 otherwise. DRUGS = 1 if a history of drug problems, 0 otherwise. SUPER = 1 if release from prison was supervised, 0 otherwise. MARRIED = 1 if married when sent to prison, 0 otherwise. FELON = 1 if a felony sentence, 0 otherwise. WORKPRG = 1 if a member of a prison work program, 0 otherwise. PROPERTY = 1 if a property crime, 0 otherwise. PERSON = 1 if a crime against a person, 0 otherwise. PRIORS = the number of prior convictions. EDUC = years of schooling. RULES = the number of rules violations in prison. AGE = age in months. TSERVED = time served in prison in months. FOLLOW = length of time followed after release from prison in months. DURAT = amount of time until arrested after release from prison, or until the inmate was no longer followed, in months. CENS = 1 if DURAT is right censored, 0 otherwise. STARTING STATA To start Stata click on the Stata icon on the computer desktop. When Stata starts you will see five separate windows. 1) Command window. 2) Results window. 3) Review window. 4) Variables window. 5) Properties window. The command window is where you type commands. After typing a command, to execute it press ENTER. The results window shows the output produced by the commands. The review window lists the commands that have been entered in the command window. If you click on a command you will move it back into the command window where you can edit and execute it. The variables window lists all variables that are currently in memory. If you click on a variable its name is placed in the command window. The properties window provides information on the characteristics of the variables and data. EXECUTING STATA COMMANDS AND PROGRAMS Stata commands and a Stata program can be executed in three ways. 1) Command window. 2) Do file. 3) Dialogue box. Commands can be written and executed in the command window, a do file, or by using pull-down menu, dialogue boxes. The basic Stata language syntax is command [varlist] [=exp] [if exp] [in range] [weight] [, options] Square brackets denote optional qualifiers. Italics denote information that you provide depending on what you want Stata to do. Command denotes a command name. Varlist denotes a list of individual variables. Exp denotes a logical expression. Range denotes an observation range. Weight denotes a weighting expression. Options (preceded by a comma) denote a list of options. The shortest commands have a command name only. All commands must be written in lowercase letters. Each command must be written on its own line; carriage return designates the end of a command. To allow a command to be written on two or more consecutive lines type: #delimit; . This tell Stata that you will end a command line with a semicolon. To turn off the delimiter type #delimit cr. Any Stata command can be shortened to its first three letters. For example, the summarize command can be shortened to sum. 3 STATA ASSIGNMENTS The first Stata assignment involves learning how to create and open a Stata data file, create variables, construct histograms and scatter diagrams, and calculate descriptive statistics. The data file wage will be used for this assignment. CREATING A STATA DATA FILE Most data is contained in either an external Excel file or ascii file and must be imported into Stata in the form of a Stata data file. Example Create and save a Stata data file from the Excel data file named wage located on your stick disk in drive e:. This file contains 935 observations on 16 variables and the names of the variables. The first row of the Excel file contains the variable names. In the column under each variable name is the data for that variable. Steps Launch Stata. Use the menu bar and select File → Import →Excel Spreadsheet. Click Browse… Go to drive E:. Click on the file named wage. Click Import First Row As Variable Names. Click OK. To save the Stata file named wage.dta, select File → Save as and type wage.dta in the File Name box. Click Save. Comments 1. Versions that preceded Stata 12 do not allow you to import an excel spreadsheet directly. In these versions, launch Excel and save the Excel file wage.xls as a tab delimited text file wage.txt. Select File → Import →ASCII data created by a spreadsheet and fill in the dialogue box. Alternatively, type the following command in the command window: insheet using e:wage.txt. 2. If the data file is a tab delimited text file that contains data but no variable names, then you would use the following command: infile wage hours iq kww edu exper tenure age married black south urban sibs brthord meduc feduc using e:wage.txt. Note that the infile command is followed by the variable names. Alternatively, you can use the menu bar and select File → Import → Unformatted ASCII data, and fill in the dialogue box. OPENING A STATA DATA FILE Example Launch Stata and open the Stata data file named wage.dta. Steps Use the menu bar to select File → Open. Go to drive E: and click on wage. Click Open. CREATING NEW VARIABLES Existing variables in a data file can be used to create new variables by using the generate command. Some often used operators are given below. 4 + / * ^ ln exp sqrt addition subtraction division multiplication power natural logarithm exponential square root Example Create two new variables. 1) Logarithm of wage named logwage 2) Age squared named agesq Commands generate logwage=ln(wage) generate agesq=age*age CREATING DUMMY VARIABLES To create dummy variables, use the generate command and logical expressions. Logical expressions are also used when you add the in or if qualifier to a command line. Stata uses the following eight relational and logical operators. == != > < >= <= & | ! equal to not equal to greater than less than greater than or equal to less than or equal to and or not Note that the relational operator for equal to is a double equal sign. Example Create a dummy variable named college that takes a value of 1 if a worker has education beyond a high school degree and 0 otherwise. Command generate college = (edu>12) Comment When Stata is given a logical expression, such as edu>12, it evaluates the expression and assigns a value of 1 if true and 0 if false. 5 Example Use the quantitative variable edu to create a qualitative variable with four educational categories. 1) Less than high school (lhs). 2) High school (hs). 3) Some college, but no degree (scol). 4) College degree (col). To do this you must create 4 dummy variables, one for each educational category. The commands are generate lhs = (edu<12) generate hs = (edu==12) generate scol = (edu>12 & edu<16) generate col = (edu>=16) HISTOGRAMS AND SCATTER DIAGRAMS The commands used to construct histograms and scatter diagrams are histogram and graph. Example Use the data file wage.dta and construct the histogram of the absolute frequency distribution for the variable wage. Command histogram wage, frequency Example Construct a scatter diagram for the variables wage and edu. Command graph twoway scatter wage edu DESCRIPTIVE STATISTICS The commands used to calculate descriptive statistics are summarize, and tabstat. Example Calculate the standard set of descriptive statistics (mean, standard deviation, minimum and maximum values, and number of observations) for the variables wage edu exper married iq tenure age black south urban. Command summarize wage edu exper married iq tenure age black south urban Comment If you include the option ,detail Stata will calculate additional descriptive statistics such as the median, variance, percentiles, etc. 6 Example Calculate specific descriptive statistics. These include the mean, variance, standard deviation, coefficient of variation, maximum and minimum values, and number of observations for the variables wage edu exper married iq tenure. Command tabstat wage edu exper married iq tenure, stats(mean variance sd cv min max n) Example Decompose the sample into the following three subsamples and calculate descriptive statistics for each subsample. 1) Less than a high school education. 2) High school education. 3) Post high school education. Commands summarize wage edu exper married iq tenure if edu<12 summarize wage edu exper married iq tenure if edu==12 summarize wage edu exper married iq tenure if edu>12 Comments 1. The if qualifier specifies the observations to use based on the values that the variable education takes. 2. The if edu==12 qualifier uses a double equals sign since a qualifier is a logical expression. Example Calculate the sample correlation coefficients for the variables wage edu exper married iq tenure. Command correlate wage edu exper married iq tenure CLASSICAL LINEAR REGRESSION MODEL This Stata assignment involves learning how to estimate a classical linear regression model using the ordinary least squares (OLS) estimator. The data file wage will be used for this assignment. The command used to estimate a classical linear regression model using OLS is regress. Other useful commands that can be used along with regress are predict, correlate, and mfx. Example Use the OLS estimator to run a linear regression of wage on edu, exper, tenure, iq and married. Command regress wage edu exper married iq tenure Example 7 For the previous regression, save the predicted (fitted) values of wage and name this variable fitwage, save the residuals and name this variable residuals, and display the variance covariance matrix of estimates. Commands predict fitwage, xb predict residuals, resid estat vce Comments 1. The predict and estat vce commands use information from the most recently estimated model. 2. The predict options xb and resid tell Stata to save the predicted values and residuals, respectively. 3. If you click on the data browser icon at the top of Stata, this will open a spreadsheet that contains your data. You will see two new variables named fitwage and residuals. Example For the previous regression, estimate the elasticities of all variables, calculate standard errors for the elasticity estimates, and t-statistics for the zero null hypothesis. Command mfx, eyex Comments 1. The mfx command with option eyex uses information from the most recently estimated model. 2. The t-statistics are asymptotic t-statistics and labeled Z in the table reported by Stata. 3. If you use the mfx command without the option eyex, Stata will report estimates of the marginal effects for the most recently estimated model. Example Use the OLS estimator to run a linear regression of the logarithm of wage on edu, exper, tenure, iq and married. Calculate estimates of the elasticities of all variables. Commands generate logwage=ln(wage) regress logwage edu exper married iq tenure mfx, dyex Comment When estimating a log-linear functional form, to calculate estimates of elasticities use the option dyex with the mfx command. HYPOTHESIS TESTING 8 This Stata assignment involves learning how to test hypotheses using the t-test, F-test, asymptotic t-test, Wald test, Likelihood ratio test, and Lagrange multiplier test. The data file wage will be used for this assignment. The commands used to calculate test statistics are test, testnl, lrtest, lincom, and nlcom. Other useful commands for testing hypotheses and imposing restrictions are estimates store, constraint, and cnsreg. Example Run a regression of wage on edu, exper, tenure, iq, married, and test the following hypotheses using a ttest. 1) job tenure has no effect on the wage. 2) The effect of one additional year of education on the wage is equal to the effect of one additional year of work experience on the wage. Commands regress wage edu exper tenure iq married lincom edu – exper Comments 1. The regress command reports the t-statistic for the zero null hypothesis for each variable. 2. The lincom command estimates the value of a new coefficient that is equal to the difference between the coefficients of edu and exper, calculates an estimate of the standard error of this new coefficient estimate, and reports the t-statistic for the zero null hypothesis (the difference between the coefficients of edu and exper is zero). 3. The lincom command uses information from the most recently estimated model. Example Run a regression of wage on edu, exper, tenure, iq, married, and test the following hypothesis using an asymptotic t-test. 1) The effect of one additional year of education on the wage is equal to the square of the effect of one additional year of job tenure on he wage. Commands regress wage edu exper married tenure iq nlcom _b[edu] - _b[tenure]^2 Comments 1. The nlcom command estimates the value of a new coefficient that is equal to the difference between the coefficient of edu and the square of the coefficient of exper, calculates an estimate of the asymptotic standard error of this new coefficient estimate, and reports the asymptotic t-statistic for the zero null hypothesis (the difference between the coefficient of edu and the square of the coefficient of exper is zero). 2. The notation _b[edu] and_b[tenure] designate the coefficients of the variables edu and tenure. 3. The nlcom command uses information from the most recently estimated model. Example Run a regression of wage on edu, exper, tenure, iq, married, and test the following hypotheses using an Ftest. 1) Education, experience, and marital status have no joint effect on the wage. 2) The effect of one 9 additional year of education on the wage is equal to the effect of ten additional iq points on the wage. 3) The effect of one additional year of work experience on the wage is equal to the effect of one additional year of job tenure on the wage and the effect of being married on the wage is equal to the effect of three additional years of education on the wage. Commands regress wage edu exper married tenure iq test (edu=0) (exper=0) (married=0) test edu=10*iq test (exper=tenure) (married=3*edu) Comments 1. The test command calculates an F-statistic for one or more linear hypotheses for the most recently estimated model. 2. When testing two or more hypotheses, each hypothesis is placed within parentheses. Example The unrestricted model is the regression of wage on edu exper tenure iq married. Test the following hypothesis using a likelihood ratio test. 1) iq and job tenure have no joint effect on the wage, and therefore should be dropped from the model. Commands regress wage edu exper married tenure iq estimates store unrestricted regress wage edu exper married estimates store restricted lrtest unrestricted restricted Comments 1. The estimates store command saves the results for the prior regression and names the results unrestricted and restricted. Any names can be chosen for the saved results. For example, unrestricted could have been named A and restricted B. 2. The command lrtest uses the saved results to calculate the likelihood ratio statistic. The name of the unrestricted model results must precede the name of the restricted model results. Example Run a regression of wage on edu, exper, tenure, iq, married, and test the following hypothesis using a Wald test. 1) The effect of one additional year of education on the wage is equal to the square of the effect of one additional year of job tenure on he wage and the effect of one additional year of work experience on the wage is equal to effect of three additional iq points on the wage. Commands regress wage edu exper married tenure iq testnl (_b[edu] = _b[tenure]^2) (_b[exper]=_b[iq]*3) 10 Comments 1. The testnl command calculates a Wald-like statistic that has an approximate F-distribution in large samples. This can be used to test nonlinear and/or linear hypotheses. 2. The testnl command uses information from the most recently estimated model. Example The restricted model is the regression of wage on edu, exper, married. Test the following hypothesis using a Lagrange multiplier test. 1) iq and job tenure have no joint effect on the wage, and therefore should not be added to the model. Commands regress wage edu exper married predict residuals, resid regress residuals edu exper married iq tenure generate lm=e(r2)*e(N) display lm Comments 1. The first four commands perform the four steps required to calculate the Lagrange multiplier test statistic. 1) Estimate the restricted model. 2) Save the residuals from the restricted model. 3) Regress the residuals on all explanatory variables in the unrestricted model using OLS. 4) Calculate the LM statistic by multiplying the r-squared statistic from the auxiliary regression by the sample size. The fifth command displays the value of the test statistic. 2. The notation e(r2) and e(N) are the saved results for the r-squared statistic and sample size from the most recent regression. Example The unrestricted model is the regression of wage on edu, exper, tenure, iq, married. Impose the following restriction on the unrestricted model. 1) The effect of one additional year of work experience on the wage is equal to the effect of one additional year of job tenure on the wage. Commands constraint define 1 exper=tenure cnsreg wage edu exper married tenure iq, constraints(1) Example The unrestricted model is the regression of wage on edu, exper, tenure, iq, married. Impose the following two restrictions on the unrestricted model. 1) The effect of one additional year of work experience on the wage is equal to the effect of one additional year of job tenure on the wage. 2) The effect of being married on the wage is equal to the effect of three additional years of education on the wage. Commands 11 constraint define 1 exper=tenure constraint define 2 married=3*edu cnsreg wage edu exper married tenure iq, constraints(1-2) Comments 1. The constraint define command defines the constraint(s) to be imposed on the model. The constraint define command is followed by the constraint number and the constraint expression. 2. The cnsreq command tells Stata to estimated a constrained regression. The constraints( ) option identifies the constraint(s) to be imposed on the model. GENERAL LINEAR REGRESSION MODEL This Stata assignment involves heteroscedasticity and the general linear regression model. The data file smoke will be used for this assignment. The command used to estimate a general linear regression model with heteroscedasticity using a feasible least squares estimator (weighted least squares estimator) is vwls. Commands used to test for heteroscedasticity are estat hettest and estat imtest. A very useful command option is robust, which is used to calculate White’s heteroscedasticity-robust standard errors. Example Create a new variable named agesq defined as the square of the variable age. Run a regression of cigs on cigpric, income, educ, age, agesq, restaurn. Test the null hypothesis of no heteroscedasticity against the alternative hypothesis of linear heteroscedasticity using the Breusch-Pagan test. Test the null hypothesis of no heteroscedasticity against the alternative hypothesis of general heteroscedasticity using the White test. Commands generate agesq=age*age regress cigs cigpric income educ age agesq restaurn estat hettest cigpric income educ age agesq restaurn estat imtest, white Comments 1. The command estat hettest performs a Breusch-Pagan test. All explanatory variables or a subset of explanatory variables can be included in the variable list. 2. The command and option estat imtest, white performs a White test. Example Run a regression of cigs on cigpric, income, educ, age, agesq, restaurn. Test the null hypothesis of no heteroscedasticity against the alternative hypothesis of general heteroscedasticity using the Wooldridge test. Commands regress cigs cigpric income educ age agesq restaurn predict f,xb predict r,resid 12 generate f2=f*f generate r2=r*r regress r2 f f2 generate lm=e(r2)*e(N) display lm Comments 1. Stata does not have a command for the Wooldridge test, and therefore it is necessary to write the code for the test yourself. 2. The definitions of the variables created with the above code is as follows. f = fitted values for cigs. r = residuals. f2 = fitted values squared. r2 = residuals squared. lm = Lagrange multiplier test statistic. Example Run a regression of cigs on cigpric, income, educ, age, agesq, restaurn. Calculate White’s heteroscedasticity-robust standard errors. Command regress cigs cigpric income educ age agesq restaurn, robust Comment The standard errors reported for this OLS regression are White’s heteroscedasticity-robust standard errors. Example Delete the variables f, f2, r, r2 created in the previous example. Run a regression of cigs on cigpric, income, educ, age, agesq, restaurn using a feasible least squares estimator (weighted least squares estimator) assuming an exponential form of heteroscedasticity. drop f f2 r r2 regress cigs cigpric income educ age agesq restaurn predict r,resid generate r2=r*r generate lr2=ln(r2) regress lr2 cigpric income educ age agesq restaurn predict f,xb generate vare=exp(f) generate sde=sqrt(var) vwls cigs cigpric income educ age agesq restaurn,sd(sde) Comments 1. Command lines 2 through 9 produce an estimate of the standard deviation of the error term for the regression of cigs on cigpric income educ age agesq restaurn. The definitions of the variables created are as follows. r = residuals. r2 = residuals squared. lr2 = logarithm of squared residuals. f = fitted values for lr2. vare = estimate of variance of error. sde = estimate of standard deviation of error. 13 2. The vwls command and option sd(sde) tells Stata to use the estimate of the standard deviation of the error that you created (sde) to run a weighted least squares regression. 3. When analyzing state-level data that is an average of individuals within a state, the correct weight to use for the weighted least squares estimator is the square root of the state population. Stata will take the square root of the population and use it as the weighted least estimator if you include the qualifier [aweight = population] after the variable list in the regress command. Population is the variable that has data on state population; it can have any name you choose to give it. SEEMINGLY UNRELATED REGRESSIONS MODEL This Stata assignment involves learning how to estimate the parameters of a seemingly unrelated regressions model and test hypotheses. The data file auto will be used for this assignment. The command to estimate the seemingly unrelated regressions model is sureg. Commands and options to test hypotheses are also given in the examples that follow. Example Estimate the parameters of two investment demand equations jointly using Zellner’s SUR estimator. Report the correlation matrix of residuals for the equations and do a Breusch-Pagan test to test the null hypothesis that the errors for the two equations are not contemporaneously correlated against the alternative hypothesis that the errors are correlated. For the General Motors investment demand equation, the dependent variable is igm and the explanatory variables are pgm, cgm. For the Chrysler investment demand equation, the dependent variable is ic and the explanatory variables are pc, cc. Command sureg (igm pgm cgm) (ic pc cc), corr Comments 1. 2. 3. 4. The variables for each equation are enclosed in parentheses with the dependent variable listed first. The option corr tells Stata to report the residual correlation matrix and do a Breusch-Pagan test. To estimate the equations using Zellner’s iterated SUR estimator add the option isure. The command and option mfx, eyex can be used to calculated elasticities for the equations. Example Estimate the two investment demand equations jointly using Zellner’s iterated SUR estimator. Use a Wald test to test the following two cross-equation restrictions. 1) The marginal effect of expected profit on investment spending is the same for General Motors and Chrysler. 2) The marginal effect of desired capital stock on investment spending is the same for General Motors and Chrysler. Commands sureg (igm pgm cgm) (ic pc cc), isure test ([igm]pgm=[ic]pc) ([igm]cgm=[ic]cc) Comments 1. When using the test command for the seemingly unrelated regressions model, a Wald test is performed. 14 2. The expression for each restriction (hypothesis) is enclosed in parentheses. Each variable in the expression must be preceded by the dependent variable in brackets of the equation in which that variable resides. 3. To test one or more nonlinear restrictions with a Wald test use the command testnl. 4. The command lrtest can be used to do a likelihood ratio test. The commands lincom and nlcom can be used to calculate point estimates and standard errors for linear and nonlinear combinations of parameters. See assignment on hypothesis testing for details. Example Estimate the two investment demand equations jointly using Zellner’s iterated SUR estimator. Impose the following two cross-equation restrictions. 1) The marginal effect of expected profit on investment spending is the same for General Motors and Chrysler. 2) The marginal effect of desired capital stock on investment spending is the same for General Motors and Chrysler. Commands constraint define 1 [igm]pgm=[ic]pc constraint define 2 [igm]cgm=[ic]cc sureg (igm pgm cgm) (ic pc cc), isure constraints(1-2) IV ESTIMATION AND THE SIMULTANEOUS EQUATIONS STATISTICAL MODEL This Stata assignment involves learning how to use the instrumental variable estimators two-stage least squares (2sls) and three-stage least squares (3sls), estimate the parameters of a simultaneous equations statistical model, and test hypotheses. The data file wine will be used for this assignment. The commands for the 2sls and 3sls estimators are ivreg and reg3. Commands required to do a variety of tests are also given in the examples that follow. Example Create six new variables that are the logarithms of the variables in the data file wine. Commands generate lq=ln(q) generate ls=ln(s) generate lpw=ln(pw) generate lpb=ln(pb) generate la=ln(a) generate ly=ln(y) Example Estimate the parameters the supply equation for wine and the inverse demand equation for wine using the 2sls estimator. Use a double log functional form for each equation. Treat lq and lpw as endogenous variables and ls, ly, lpb, la as exogenous variables. For the supple equation, the dependent variable is lq and the explanatory variables are lpw, ls. For the inverse demand equation, the dependent variable is lpw and the explanatory variables are lq, ly, lpb, la. Commands 15 ivreg lq ls (lpw=ly lpb la) ivreg lpw ly lpb la (lq=ls) Comments 1. The endogenous right-hand side variable is enclosed in parentheses followed by an equals sign and the list of identifying instrumental variables. 2. If there is more than one endogenous right-hand side variable, then enclose all endogenous right-hand side variables in parentheses followed by an equals sign and the list of identifying instrumental varialbes. For example, if we treat both lpw and ls as endogenous in the supply equation, then the command is ivreg lq (ls lpw=ly lpb la). 3. To report the first-stage regression, add the option first. 4. To calculate elasticities, use the command mfx, eyex. 5. To calculate White’s heteroscedasticity-robust standard errors. add the option robust. Example Estimate the parameters of the supply equation for wine. Test the following hypothesis using an approximate F-test. 1) The price elasticity of supply of wine is equal to the storage cost elasticity of supply of wine. ivreg lq ls (lpw=ly lpb la) test ls=lpw Comments 1. To test one or more nonlinear restrictions with a Wald test use the command testnl. 2. The commands lincom and nlcom can be used to calculate point estimates and standard errors for linear and nonlinear combinations of parameters. See assignment on hypothesis testing for details. Example Assess the relevance of the identifying instruments in the supply and inverse demand equations for wine. To do this, calculate the F-statistic for the zero null hypothesis for each set of identifying instruments in the first-stage regressions. Commands regress lpw ls ly lpb la test (ly=0) (lpb=0) (la=0) regress lq ls ly lpb la test (ls=0) Example Use a Lagrange multiplier test to test the overidentifying restrictions for the wine supply equation. Commands ivreg lq ls (lpw=ly lpb la) 16 predict residuals, resid regress residuals ls ly lpb la generate lm=e(r2)*e(N) display lm Comment These commands calculate the LM test statistic. This statistic must be compare to the critical value for a chi square statistic with degrees of freedom equal to the number of overidentifying restrictions, which in this example is 2. Example Use a Hausman test to test the null hypothesis that lpw is exogenous against the alternative hypothesis that it is endogenous in the supply equation for wine. Commands regress lpw ls ly lpb la predict residuals, resid regress lq lpw ls residuals Comment The test statistic for the Hausman test is the t-statistic for the coefficient of the residual variable. To perform the test, compare this t-statistic to the appropriate critical value for the t-statistic. Example Estimate the parameters of the supply and inverse demand equations for wine jointly using 3sls. Commands reg3 (lq lpw ls) (lpw lq ly lpb la) Comments 1. Each equation is enclosed within parentheses with the dependent variable listed first followed by the explanatory variables. This tells Stata which variables are endogenous and exogenous in the system. 2. To add exogenous variables not included in the equations being estimated as instruments, add the option exog( ), with the variable names listed within parentheses. 3. To indicate endogenous right-hand side variables that are not also dependent variables, add the option endog( ), , with the variable names listed within parentheses. 4. To test cross-equation restrictions, use the same commands as those used for the seemingly unrelated regressions model. 5. To estimate the equations using ordinary least squares, two-stage least squares, or Zellner’s SUR estimator, add the option ols, 2sls, or sure. 6. To estimate the equations using an iterated 3sls estimator, add the option ireg3. FIXED-EFFECTS AND RANDOM-EFFECTS REGRESSION MODELS FOR PANEL DATA 17 This Stata assignment involves learning how to estimate panel data regression models. The data file healthpanel will be used for this assignment. The commands for estimating a fixed effects or random effects model are xtreg and xtivreg. Commands required to test hypotheses and calculate descriptive statistics are also given in the examples that follow. Example Prepare the panel data file to be analyzed. Commands iis state tis year Comment Before using panel data commands, you must give Stata an index for units and time in your data file. The command iis is the index for units, which in the data file healthpanel is states. The command tis is the index for time, which in the data file healthpanel is year. Example Calculate the mean, standard deviation, maximum and minimum values for the variables in your sample. Command xtsum Comment Stata reports the overall mean for each variable. It also reports three alternative measures of the standard deviation, maximum, and minimum values: overall, between, and within. For example, the standard deviations for the variable income measure the overall dispersion in income, the amount of dispersion in income across states, and the amount of dispersion in income within states over time. Example Estimate a fixed-effects model for medical care spending using the fixed effects estimator. The dependent variable is spend. The explanatory variables are inc, ins. Command xtreg spend inc ins, fe Comments 1. The xtreg command uses the fixed-effects estimator, and therefore does not report estimates of the state dummy variables. If you want to obtain estimates of the coefficients of the state dummy variables you can use a least-squares dummy variable estimator. 18 2. Stata reports the F-statistic for the test of no fixed effects as standard output. The null hypothesis is no fixed effects (classical linear regression model is appropriate). The alternative hypothesis is fixed effects (fixed-effects model is appropriate). 3. To calculate elasticities, use the command mfx, eyex. 4. To calculate White’s heteroscedasticity-robust standard errors. add the option robust. 5. The test and testnl commands can be used to test linear and nonlinear restrictions using an F-test and a Wald test. 6. The commands lincom and nlcom can be used to calculate point estimates and standard errors for linear and nonlinear combinations of parameters. Example Estimate a fixed-effects model for medical care spending using the least squares dummy variable estimator. Reports the estimates of the coefficients of the state dummy variables. Command xi: regress spend inc ins i.state Comments 1. The xi prefix to the regress command along with the i.state variable tells Stata to create 50 dummy variables, one for each state, drop the dummy variable for state #1, and include the remaining 49 dummy variables as explanatory variables in the regression. 2. The estimates of the coefficients of inc and ins will be identical when using the fixed-effects estimator or least-squares dummy variable estimator. Example Estimate a fixed-effects model for medical care spending using the two-stage least squares estimator. Treat ins as the endogenous right-hand side variable and use age65 as the identifying instrumental variable. Command xtivreg spend inc (ins=age65), fe Example Estimate a fixed-effects model with time effects for medical care spending that controls for unobserved factors that differ across states but not over time, and unobserved factors that vary over time but not across states. Command xi: xtreg spend inc ins i.year, fe Comment The xi prefix to the xtreg command along with the i.year variable tells Stata to create 10 dummy variables, one for each year, drop the dummy variable for year 1991, and include the remaining 9 dummy 19 variables as explanatory variables in the regression along with an intercept. Stata then uses a fixed-effects estimator to estimate the coefficients of this model. It reports the estimates of the time dummy variables, but not state dummy variables. If you also want to obtain estimates of the state dummy variables you can use the least-squares dummy variable estimator. The command is xi: xtreg spend inc age65 ins i.year i.state Example Eestimate a random-effects model for medical care spending. Command xtreg spend inc ins, re Example Estimate a random-effects model with time effects for medical care spending that accounts for correlated errors and unobserved factors that vary over time but not across states. Command xi: xtreg spend inc ins i.year, re Example Estimate a random-effects model for medical care spending using the two-stage least squares estimator. Treat ins as the endogenous right-hand side variable and use age65 as the identifying instrumental variable. Command xtivreg spend inc (ins=age65), re Example Estimate a random-effects model and test the null-hypothesis of no random effects (the unit specific errors are not correlated, and therefore the classical linear regression model is appropriate) against the alternative hypothesis of random effects (the unit specific errors are correlated, and therefore the random effects model is appropriate). Commands xtreg spend inc ins, re xttest0 Example Estimate two models of medical care spending: a fixed-effects model and a random-effects model. Use a Hausman test to test which is the appropriate model. The null-hypothesis is that the random effects model is appropriate (unit specific errors are not correlated with the right-hand side variables) against the 20 alternative hypothesis that the fixed-effects model is appropriate (unit specific errors are correlated with the right-hand side variables). xtreg spend inc ins, fe estimates store fixed xtreg spend inc ins, re estimates store random hausman random fixed BINARY DISCRETE CHOICE REGRESSION MODELS This Stata assignment involves learning how to estimate linear probability, probit, and logit discrete choice regression models. The data file smoke will be used for this assignment. The commands for estimating these models are regress, probit, dprobit, logit, and logistic. Commands required to test hypotheses and calculate measures of goodness-of-fit are also given in the examples that follow. Example Create a dummy variable called smoker that takes a value of 1 if an individual smokes and 0 if an individual does not smoke. Estimate a linear probability model for smoking. The dependent variable is smoker and the explanatory variables are cigpric, restaurn, income, age, edu. Do White’s test for heteroscedasticity. Estimate the linear probability model for smoking again and report White’s heteroscedasticity-robust standard errors. Commands generate smoker=(cigs>0) regress smoker cigpric restaurn income age edu estat imtest, white regress smoker cigpric restaurn income age edu, robust Example Estimate a probit model for smoking and report the estimates of the coefficients of the index function. The dependent variable is smoker and the explanatory variables are cigpric, restaurn, income, age, edu. Calculate the percent of correct predictions. Estimate the probit model for smoking again and report the estimates of the marginal effects. Commands probit smoker cigpric restaurn income age edu estat class dprobit smoker cigpric restaurn income age edu Comments The command probit reports maximum likelihood estimates of the coefficients of the index function. The dprobit command reports estimates of marginal effects. The estat class command calculates and reports the percent of correct predictions. Example 21 Estimate the probit model for smoking. Use a Wald test to test the hypothesis that the effect of one more year of education on the probability of smoking is equal to the effect of one more year of age. Commands probit smoker cigpric restaurn income age edu test educ=age Comments 1. To test one or more nonlinear restrictions with a Wald test use the command testnl. 2. The command lrtest can be used to do a likelihood ratio test. See assignment on hypothesis testing for details. 3. The commands lincom and nlcom can be used to calculate point estimates and standard errors for linear and nonlinear combinations of parameters. See assignment on hypothesis testing for details. Example Estimate a logit model for smoking and report the estimates of the coefficients of the index function. Estimate the logit model for smoking again and report the estimates of the odds ratios. logit smoker cigpric restaurn income age edu logistic smoker cigpric restaurn income age edu Comments The command logit reports maximum likelihood estimates of the coefficients of the index function, while the command logistic reports estimates of the odds ratios. DURATION MODELS This Stata assignment involves learning how to estimate duration models. The data file inmate will be used for this assignment. The commands for estimating a Weibull parametric duration model, Cox proportional hazards model, and Kaplan-Meier nonparametric survival function are streg, stcox, and sts graph.. Commands required to test hypotheses plot curves are also given in the examples that follow. Example Prepare the duration data file to be analyzed. Commands recode cens (0=1) (1=0) stset durat, failure(cens) Comments 1. The recode command recodes the dummy variable cens so that cens=1 indicates an uncensored observation and cens=0 indicates a censored observation. Stata requires that the indicator variable for 22 censored observations take a value of 1 if an observation is uncensored and a value of 0 if an observation is censored. 2. Before using duration model commands, you must tell Stata what variable is the duration variable and what observations are censored. The command stset tells Stata that durat is the duration variable. The option failure(cens) tells Stata that the variable cens indicates which observations are uncensored (cens=1) and which observations are censored (cens=0). Stata calls an uncensored observation a failure. Example Estimate a Weibull parametric duration model of criminal recidivism. The dependent variable is durat. The explanatory variables are tserved, educ, married. Report the estimates of the parameters of the model. Plot the survival function and hazard function evaluated at the sample mean values of the variables. Test the null hypothesis that educ and married have no joint effect on durat using a Wald test. Commands streg tserved educ married, d(weibull) time stcurve, survival stcurve, hazard test (educ=0) (married=0) Comments 1. The option d(Weibull) tells Stata to to estimate a parametric duration model that assumes the variable durat has a Weibull distribution. 2. The option time tells Stata to report the estimates of the parameters of the model. These are also the estimates of the parameters of the log median duration function. If you want Stata to report estimates of the time ratios, replace the option time with the option tr. The time ratios are the exponentiated estimates of the parameters of the log median duration function. 3. The options survival and hazard for the stcurve command tell Stata to plot survival and hazard functions. 4. The testnl command can be used to test nonlinear restrictions using a Wald test. 5. The commands lincom and nlcom can be used to calculate point estimates and standard errors for linear and nonlinear combinations of parameters. Example Estimate a Weibull parametric duration model of criminal recidivism. Report the estimates of the hazard ratios. Use a Wald test to test the null hypothesis that the effect of one additional year of education on the hazard rate is equal to the effect of 12 additional months of time served on the hazard rate. Commands streg tserved educ married, d(weibull) test educ=12*tserved Comment If the option time or tr is not used, then Stata reports estimates of the hazard ratios. The hazard ratios are the exponentiated estimates of the parameters of the log hazard function. 23 Example Estimate a Cox proportional hazards model of criminal recidivism. The dependent variable is durat. The explanatory variables are tserved, educ, married. Report the estimates of the parameters of the Cox proportional hazard function. Estimate this model a second time. Report the estimates of the hazard ratios. Commands stcox tserved educ married, nohr stcox tserved educ married Comments 1. The option nohr tells Stata to report the estimates of the parameters of the Cox proportional hazard function. If this option is omitted, then Stata reports the estimates of the hazard ratios, which are the exponentiated estimates of the parameters. 2. The test and testnl commands can be used to test linear and nonlinear restrictions using a Wald test. 3. The commands lincom and nlcom can be used to calculate point estimates and standard errors for linear and nonlinear combinations of parameters. Example Construct a for the variable durat. Command sts graph Example Construct two nonparametric Kaplan-Meier survival functions: one for married inmates and one for single inmates. Use a log rank test to test the equality of these two survival functions. Command sts graph if married==1 sts graph if married==0 sts test married 24