A First Tutorial in Stata Stan Hurn Queensland University of Technology National Centre for Econometric Research www.ncer.edu.au Stan Hurn (NCER) Stata Tutorial 1 / 66 Table of contents 1 Preliminaries 2 Loading Data 3 Basic Descriptive Statistics 4 Basic Plotting 5 Simple Data Manipulation 6 Simple Linear Regression 7 Using do files 8 Some Regression Examples Electricity Data California Schools Data Food Expenditure and Income 9 Instrumental Variables Estimation Wage Data Artificial Data Stan Hurn (NCER) Stata Tutorial 2 / 66 Preliminaries Stata Stata is a fast, powerful statistical package with smart data-management facilities, a wide array of up-to-date statistical techniques, and an excellent system for producing publication-quality graphs The bad news is that Stata is NOT as easy to use as some other statistical packages, but Version 12 has got a reasonable menu-driven interface. On the whole the advantages probably outweigh the steepness of the initial learning curve. Stan Hurn (NCER) Stata Tutorial 3 / 66 Preliminaries Stata Resources One of the major advantages to using Stata is that there are a large number of helpful resources to be found. For example: a good web-based tutorial can be found at http://data.princeton.edu/stata/default.html a useful introductory book is An Introduction to Modern Econometrics Using Stata by Christopher F. Baum published by Stata Press in 2006 Stan Hurn (NCER) Stata Tutorial 4 / 66 Preliminaries The Stata 12 Front End for Mac Stan Hurn (NCER) Stata Tutorial 5 / 66 Preliminaries The Stata 12 Front End for Windows Stan Hurn (NCER) Stata Tutorial 6 / 66 Preliminaries Stata 12 Front End Stata has an menu bar on the top and 5 internal windows. The main window is the one in the middle (1 on the previous slide). It gives you the all output of you operations in Stata. The Command window (2) executes commands. You can type commands directly in this window as an alternative to using the menu system. The Review window (3), lists all the operations preformed since opening Stata. If you click on one of your past commands, you will see the command being displayed in the Command window and you can re-run it by hitting the enter key. The Variables window (4) lists the variables in the current dataset (and their descriptions). When you double-click on the variable, it appears in the Command window. The Properties window (5) gives information about your dataset and your variables. Stan Hurn (NCER) Stata Tutorial 7 / 66 Preliminaries Changing the Working Directory To avoid having to specify the path each time you wish to load a data file or run a Stata program (saved in a ”do” file), it is useful to changed the working directory so that Stata looks in the directory that you are currently working in. Click File – Change Working Directory Browse for the correct directory and select it. The result is printed out in the Results window and the appropriate Stata command is echoed in Review window enabling you to reconstruct a ”do” file of you session. Stan Hurn (NCER) Stata Tutorial 8 / 66 Loading Data Loading an Existing Stata File Simply click File – Open and browse for an existing Stata data file. Stata data files have extensions dta. Open the file food.dta. You will note that two variables food exp and income appear in the Variables window of the Stata main page. In the Properties window you will see the filename food.dta together with some information about the file. This file has 2 variables, each with 40 observations and the size of the file in memory is also given. Stan Hurn (NCER) Stata Tutorial 9 / 66 Loading Data Loading an Excel File Stan Hurn (NCER) Stata Tutorial 10 / 66 Loading Data Loading an Excel File Load the Excel file US Macroeconomic Data.xls Click File – Import – Excel Spreadsheet Browse for the correct file in the working directory and open it. Remember to check the radio button asking if you want to use the first row as variable names. Changes variable names in Stata is something of a mystery when using the Menu. But using the command window is easy enough. rename oldname newname will do the trick. Try it. NOTE Case matters: if you use an uppercase letter where a lowercase letter belongs, or vice versa, an error message will display. Stan Hurn (NCER) Stata Tutorial 11 / 66 Loading Data Loading a CSV File Load the CSV file taylor.csv which contains data on the output gap, the inflation gap and the Federal Funds rate for the period 1961:Q1 to 1999:Q4. Click File – Import – Text data created by a spreadsheet Browse for the file and load it. You should have data on the variables ffr, infl and ygap. To specify this as time series data we need a series of dates. The date vector (called ”year”) is created using the following commands generate year = tq(1961q1) + _n-1 To make sure Data understands that this is a time series data set we need to tell it to use ”year” as the date vector. The command is tsset year, quarterly The Stata menu command is to do this is found on the next slide. Stan Hurn (NCER) Stata Tutorial 12 / 66 Loading Data Assigning a Date Vector Stan Hurn (NCER) Stata Tutorial 13 / 66 Basic Descriptive Statistics Summary Statistics Reload the file food.dta. Now click Statistics and then choose Summaries, tables, and tests – Summary and descriptive statistics. Sometimes it is useful to have a look at the histogram of the data. Click Graphics – Histogram and experiment with some of the options. Another useful visual tool is the box plot. Click Graphics – Box plot Stan Hurn (NCER) Stata Tutorial 14 / 66 Basic Plotting Simple Scatter Click File – Open and browse for food.dta. This is a Stata data file. Click Grahics – Twoway and create a simple scatter plot of weekly food expenditure versus weekly income. Stan Hurn (NCER) Stata Tutorial 15 / 66 Basic Plotting Time Series Plots Let’s work through a simple example to construct a plot of the Australian business cycle. Click File – Import – Excel Spreadsheet and use the first row as variable names. This will give you a variable gdp. Make a time series data set by creating a quarterly date vector from 1959:Q2 to 1996:Q1 and make a time-series data set using dates as the time vector. The commands are generate dates = tq(1959q2) + _n-1 tsset dates, quarterly Plot the data. Stan Hurn (NCER) Stata Tutorial 16 / 66 Basic Plotting Australian GDP Stan Hurn (NCER) Stata Tutorial 17 / 66 Simple Data Manipulation Data Transformations Stata’s basic commands for data transformation are generate and replace. generate creates a new variable. replace modifies an existing variable. Both commands are accessed via the Data menu item on the main Stata toolbar. Stan Hurn (NCER) Stata Tutorial 18 / 66 Simple Data Manipulation generate and replace Stan Hurn (NCER) Stata Tutorial 19 / 66 Simple Data Manipulation Growth rate of Australian GDP Create a growth rate of gdp using the L. operator (lag operator) generate g = log(gdp)-log(L1.gdp) Stan Hurn (NCER) Stata Tutorial 20 / 66 Simple Data Manipulation Australian Business Cycle While the plot of the growth rate of gdp is more informative than a plot of the level of the series, yet more information can be obtained by smoothing g. generate bcycle = (L3.g+L2.g+L1.g+g+F1.g+F2.g+F3.g )/7 Stan Hurn (NCER) Stata Tutorial 21 / 66 Simple Data Manipulation Load the food data set 1 Make sure you are in the right working directory (File – Change Working Directory) 2 Load the dataset in food.dta and look at the data characteristics. 3 You can experiment using Statistics – Summaries, tables, and tests – Summary and descriptive statistics but it is simpler to issue the following commands from the command window. describe list browse summarize summarize food exp, detail Stan Hurn (NCER) Stata Tutorial 22 / 66 Simple Data Manipulation Simple scatter plots 1 Use Grahics – Twoway to create a simple scatter plot of weekly food expenditure versus weekly income. 2 Issue the command twoway (scatter food exp income) 3 Issue the command twoway (scatter food exp income), title(Food Expenditure Data) 4 Issue the command twoway (scatter food exp income) (lfit food exp income), title(Fitted Regression Line) The line of best fit is obtained by linear regression of food expenditure on income. We will now explore this in more detail. Stan Hurn (NCER) Stata Tutorial 23 / 66 Simple Linear Regression A First Regression 1 Load the data set caschool.dta. 2 Run a regression of the test scores, testscr , against the student-teacher ratio, str . You do this by selecting Statistics – Linear models and related – Linear regression. 3 A dialogue box will pop up which will require you to fill in the dependent and independent variable. Stan Hurn (NCER) Stata Tutorial 24 / 66 Simple Linear Regression Regression dialogue box Stan Hurn (NCER) Stata Tutorial 25 / 66 Simple Linear Regression Regression Results Stata reports the regression results as follows: The regression predicts that if class size falls by one student, the test scores will increase by 2.28 points. Stan Hurn (NCER) Stata Tutorial 26 / 66 Simple Linear Regression Predicted Values and Residuals A common task after running a regression is storing the fitted values, yb, or the b. Here you must become familiar with the very useful Statistics – residuals, u Postestimation menu. One option to select is Predictions, residuals, etc which gives the dialogue box Stan Hurn (NCER) Stata Tutorial 27 / 66 Simple Linear Regression Predicted Values and Residuals 1 Note that the names you choose for the predicted values and/or residuals cannot already be taken. Use something obvious like yfit or yhat for the fitted values and res or uhat for the residuals. 2 You can also use the Postestimation option to obtain confidence intervals for the prediction using the option Standard errors of the prediction. Save this as yhatci. The commands . gen yhatu = yhat+1.96*yhatci . gen yhatl = yhat - 1.96*yhatci will now generate a 95% confidence interval for the prediction. 3 To be more precise you could use the t-distribution rather than hard-code 1.96. The commands are . gen ttail = invttail(e(df_r),0.975) . gen yhatu = yhat+ttail*yhatci Note that e(df_r) is the way Stata stores the degrees of freedom for the residuals and invtttail computes the relevant critical value from the t-distribution. Stan Hurn (NCER) Stata Tutorial 28 / 66 Simple Linear Regression Predictions with 95% Confidence Interval Stan Hurn (NCER) Stata Tutorial 29 / 66 Simple Linear Regression Out-of-sample Prediction Obtaining out-of-sample predictions is a bit clunky and using the command line is probably the way to go. Suppose there are 40 observations in the data sample and you want to obtain an out-of-sample prediction for a value of the explanatory variable income = 20. The code is // add observation to data file edit set obs 41 replace income=20 in 41 // obtain prediction predict yhat0 list income yhat0 in 41 Stan Hurn (NCER) Stata Tutorial 30 / 66 Simple Linear Regression You should explore other visualisation options Stan Hurn (NCER) Stata Tutorial 31 / 66 Using do files Using do files A nice thing about Stata is that there is a simple way to save all your work steps so you or others can easily reproduce your analysis. The way to do so is using a so-called do file. Remember that all Stata does is to execute commands, which you either clicked on using the menu or directly typed in the Command window. A command is just one line of text (or code). If you want to save this command for later use, just copy it (simply click on it in the Review window and copy the line of text that comes up in the Command window) and paste it into the do file. The next slides describe how you can open and use a do file. Stan Hurn (NCER) Stata Tutorial 32 / 66 Using do files Where to open a new do file You can open a new do file by clicking on the “New Do file Editor” button below the menu (or press Ctrl+9): Stan Hurn (NCER) Stata Tutorial 33 / 66 Using do files Using a do file A do file is just a list of commands. Each command has to start with a new line. Normally you will start your do file telling it which data to load in the first line. In the following lines you can then include analysis commands. If you leave a row empty – no problem. If you want to write comments or text, which are not Stata code, you have to start the row with // or a * symbol; using these symbols tell Stata that this line is not to be executed. Stan Hurn (NCER) Stata Tutorial 34 / 66 Using do files Executing commands with a do file If you want to re-run a command from the do file, just highlight the line and press the “Execute (do)” button (or press Ctrl+d). If you don’t mark any specific line, Stata will run all the commands in the do file you have currently opened from first to last. The results of the command(s) are displayed in the main view as if you were using the menu. Stan Hurn (NCER) Stata Tutorial 35 / 66 Some Regression Examples Electricity Data Demand for Residential Electricity The Excel file elecex.xls has quarterly data on the following variables from 1972:02 to 1993:04. RESKWH NOCUST PRICE CPI INCOME CDD HDD POP = = = = = = = = electricity sales to residential customers (million kilowatt-hours) number of customers (thousands) electricity tariff (cents/kwh) consumer price index nominal personal income (millions of dollars) cooling degree days heating degree days population (thousands) Import the data into Stata using the Import wizard. Take care to check the Radio Button asking whether or not to treat the first row as variable names! Once done you can save this as elecex.dta for your own convenience. Stan Hurn (NCER) Stata Tutorial 36 / 66 Some Regression Examples Electricity Data Time Series Data Most multiple regression exercises involve data manipulation. This is where writing ”do” files is a powerful way of ensuring that you can recover your previous work and others can reproduce it. 1 This is time series data, so we need to create a date vector set dates as the date vector. generate dates = tq(1972q2) + _n-1 tsset dates, quarterly Stan Hurn (NCER) Stata Tutorial 37 / 66 Some Regression Examples Electricity Data Data Manipulations 1 Generate the dependent variable: gen LKWH=log(RESKWH/NOCUST) 2 We want to explain this demand in terms of real per capita income so create the variable gen LY=log((100\ast INCOME)/(CPI\ast POP)) 3 Another important determinant is price — we want to use the real average cost of electricity gen LPRICE=log(100 \ast PRICE/CPI) Stan Hurn (NCER) Stata Tutorial 38 / 66 Some Regression Examples Electricity Data Getting a Feel for the Data You should always try to understand your data before beginning to model it. A useful starting point is the Graphics – Scatterplot matrix option. As the name suggests this creates a matrix of scatterplots of the variables against each other. Hopefully this reveals some pattern to the relationships between the dependent and explanatory variables and no discernible pattern between the explanatory variables themselves. Stan Hurn (NCER) Stata Tutorial 39 / 66 Some Regression Examples Electricity Data Matrix Plots Stan Hurn (NCER) Stata Tutorial 40 / 66 Some Regression Examples Electricity Data Regression Results The results from running the linear regression of the base model of demand on price, income and the weather variables are as follows: Stan Hurn (NCER) Stata Tutorial 41 / 66 Some Regression Examples Electricity Data ACF and PACF This is time series data, so one of the problems may be autocorrelation in the residuals. The autocorrelation function and partial autocorrelation function of the residuals look as follows Stan Hurn (NCER) Stata Tutorial 42 / 66 Some Regression Examples Electricity Data AR(1) Estimation Options The following dialogue box under the Time Series Prais-Winstein regression allows you to correct for autocorrelation in the residuals. Stan Hurn (NCER) Stata Tutorial 43 / 66 Some Regression Examples Electricity Data AR(1) output The results from running the linear regression of the AR(1) model of demand on price, income and the weather variables are as follows: Stan Hurn (NCER) Stata Tutorial 44 / 66 Some Regression Examples California Schools Data California Test Score Data 1 Load the file caschool.dta. 2 Run the regression relating test scores to the student teacher ratio testscr = β0 + β1 str + u 3 The concern is that this equation suffers omitted variable bias which we can correct using multiple regression. Try relating test scores to the student teacher ratio and the percentage of English learners testscr = β0 + β1 str + β2 el pct + u Note that the size of the effect of str is halved! 4 Now try adding expenditure per student to the regression testscr = β0 + β1 str + β2 el pct + β3 expn stu + u Stan Hurn (NCER) Stata Tutorial 45 / 66 Some Regression Examples California Schools Data Presenting Results This exercise has shown that the coefficient on str in the simple two variable model is biased. But the question remains as to how to present this in a reasonable way so that we can see the pattern immediately. The answer is to store the results of the regressions and then to use Stata’s Postestimation menu item to help organise the presentation of the results. Unfortunately this is going to involve estimating the regressions again and then using Statistics – Postestimation – Manage estimation results – Store in memory After each estimation you will need to name your model. Lets be original and call them Model1, Model2 and Model3. As you do this, watch how Stata echoes your command and think how easy it would be to use a ”do” file instead. Stan Hurn (NCER) Stata Tutorial 46 / 66 Some Regression Examples California Schools Data Table of Estimation Results 1 Show the results: estimates table Model1 Model2 Model3 Here both coefficients and standard errors of the various models are summarised in an accessible way and the reduction in the significance of str is clear. Stan Hurn (NCER) Stata Tutorial 47 / 66 Some Regression Examples California Schools Data Table of Estimation Results 2 Further detail on the results: estimates table Model1 Model2 Model3, star(.05 .01 .001) This is a particularly useful way of summarising the results as the significant coefficients are marked. Note how str is insignificant in Model 3. Essentially the t-tests on the individual coefficients are interpreted for you!! Stan Hurn (NCER) Stata Tutorial 48 / 66 Some Regression Examples California Schools Data Joint Significance Test Now let’s test the hypothesis that both str and exp stu are zero. The tests are to be found at: Statistics – Postestimation – Tests – Test linear hypotheses Obviously you are going to have to give Stata some information on which coefficients you wish to test. Once you have selected Test linear hypotheses, click on Create and the following dialogue box with appear. Stan Hurn (NCER) Stata Tutorial 49 / 66 Some Regression Examples California Schools Data Testing Joint Hypotheses The result shows that the p-value of the F-test of the joint hypothesis that β1 = β3 = 0 is 0.0004 so we would reject the null hypothesis. At least one of str and exp stu is a significant factor in the regression. Stan Hurn (NCER) Stata Tutorial 50 / 66 Some Regression Examples California Schools Data Testing Joint Hypotheses for Windows The result shows that the p-value of the F-test of the joint hypothesis that β1 = β3 = 0 is 0.0004 so we would reject the null hypothesis. At least one of str and exp stu is a significant factor in the regression. Stan Hurn (NCER) Stata Tutorial 51 / 66 Some Regression Examples Food Expenditure and Income Food Data Set Study the relationship between food expenditures and income reg food exp income and plot residuals Stan Hurn (NCER) Stata Tutorial 52 / 66 Some Regression Examples Food Expenditure and Income Functional Form It may be that a linear relationship between food expenditures and income is not a good choice. Let us try to fit a linear - log model. food exp = β0 + β1 ln(income) + u Unfortunately Stata doesn’t recognise ln(income) and you have to generate a new variable, say gen lincome = log(income) Stan Hurn (NCER) Stata Tutorial 53 / 66 Some Regression Examples Food Expenditure and Income Fitted Values Stan Hurn (NCER) Stata Tutorial 54 / 66 Some Regression Examples Food Expenditure and Income Elasticities Now you can calculate the percentage change in food expenditure given a 1 percent change in income using the marginal effects options on the Postestimation menu. Stan Hurn (NCER) Stata Tutorial 55 / 66 Instrumental Variables Estimation Wage Data Wage Data This example looks at wage data. The datafile is mroz.dta and the focus is on modelling the wage of married women only. The variables that are important are as follows: educ = years of schooling wage = estimated wage from earns., hours motheduc = mothers years of schooling fatheduc = fathers years of schooling exper = actual labor mkt exper lfp = 1 if in labor force, 1975 Stan Hurn (NCER) Stata Tutorial 56 / 66 Instrumental Variables Estimation Wage Data Estimating a Wage Equation Suppose we wish to estimate the equation that relates wages to education and experience: ln(wage) = β0 + β1 educ + β2 exper + β3 exper 2 + ut . The problem is that educ may be correlated with u because it is an imperfect proxy for ”ability” and that using OLS may therefore result in biased coefficient estimates. Stan Hurn (NCER) Stata Tutorial 57 / 66 Instrumental Variables Estimation Wage Data OLS Results Stan Hurn (NCER) Stata Tutorial 58 / 66 Instrumental Variables Estimation Wage Data The IV Estimator We can now try estimate the regression by IV using mothereduc as an instrument for educ. A mother’s education does not itself belong in the daughter’s wage equation, but it is reasonable to propose that more educated mothers are more likely to have educated daughters. Click Statistics – Edogenous Covariates – Single-equation instrumental-variables estimator This sequence will open a Dialogue Box which will prompt for more information like 1 2 dependent variable, independent variables, endogenous variables and instrumental variables; other options for the constant and standard error correction etc. Stan Hurn (NCER) Stata Tutorial 59 / 66 Instrumental Variables Estimation Wage Data The IV Estimator Stan Hurn (NCER) Stata Tutorial 60 / 66 Instrumental Variables Estimation Wage Data IV Results Stan Hurn (NCER) Stata Tutorial 61 / 66 Instrumental Variables Estimation Wage Data Some Observations 1 Although not shown here mothereduc is highly significant in the first-stage regression of the IV estimation indicating it is a strong instrument for educ. 2 The estimated return to education is about 10% lower than the OLS estimate. This is consistent with our earlier theoretical discussion that the OLS estimator tends to over-estimate the effect of a variable if that variable is positively correlated with the omitted factors present in the error term. 3 The standard error on the coefficient on educ is over 2.5 times larger than the standard error on the OLS estimate. This reflects the fact that even with a good instrument the IV estimator is not efficient. Of course this situation can be remedied slightly by adding more valid instruments for educ. Stan Hurn (NCER) Stata Tutorial 62 / 66 Instrumental Variables Estimation Artificial Data The Data The datafile is ivreg2.dta contains 500 artificially generated observations on x, y , z1 and z2 . The variable y is generated as y t = β0 + β1 x t + e t , β0 = 3, β1 = 1 with x ∼ N(0, 2) , e ∼ N(0, 1) , cov(x, e) = 0.9 . Note that ρz1 ,x = 0.5 Stan Hurn (NCER) ρz2 ,x = 0.3. Stata Tutorial 63 / 66 Instrumental Variables Estimation Artificial Data Summary of Estimation Results Table was generated by using the Postestimation menu option to store results and create a table. Stan Hurn (NCER) Stata Tutorial 64 / 66 Instrumental Variables Estimation Artificial Data Hausman Test To Implement the Hausman test assuming that you have stored the output from the IV and OLS regressions you click Postestimation – Tests – Hausman specification test Stan Hurn (NCER) Stata Tutorial 65 / 66 Instrumental Variables Estimation Artificial Data Hausman Test This indicates a strong rejection of the null hypothesis of exogeneity — indicating that cov(x, u) 6= 0 — which we know to be true by construction. Stan Hurn (NCER) Stata Tutorial 66 / 66