STATA Quick Introduction (January 2007) This handout provides short and quick introduction to STATA. If you want to increase your skills in using STATA, please check manuals, and use help information whenever you need help. OUTLINE: Starting and stopping Stata Reading data into Stata: Typing Getting external files Useful commands: List Summation Tabulation Logical operators Functions and expressions Generating variables Graphics Simple linear regression Subsetting the data Linear restrictions Time series(lags and differences, DW test and Q-stats, autocrrelation, Dicky-fuller test, Gold-Feld test) Starting & Stopping Stata Starting STATA on the PC. Click Start Programs Stata Intercooled Stata You will find four windows: (a) ‘Review’ window on the upper left, past commands will appear there; (b) ‘Variables’ window on the lower left, variables list will appear there; (c) ‘Results’ window, results will be displayed there; (d) ‘Command’ window where you can type commands. Also you will several ‘buttons’ above the windows. Just hold the mouse pointer over a button for a moment and a box will appear with a description of that button. The buttons that most frequently used are: a. Open: open a Stata dataset; b. Save: save to disk the Stata dataset currently in memory; c. Do-file Editor: open the do-file editor or bring do-file editor to the front of the other stats windows; d. Data Editor: open the data editor or bring the data editor to the front of the other Stata windows. Stopping Stata. Type exit in the command window, or just click button in upper right corner of Stata window. -1- Reading Data into Stata Comma/Tab separated file with variable names on line 1 Consider that you have the following data in Excel format. Name mary john jose lee mike HOUR 10 11 5 6 9 GRADE 90 92 60 71 80 Here GRADE is just the % points. Two common file formats for raw data are comma separated files and tab separated files. Such files are commonly made from spreadsheet programs like Excel. For example, if you have a data set in comma /tab separated file (you can save excel data into .csv or .txt format), which stored in your C:\temp\data.csv, with variable names on the first row. This file has two characteristics. (1) The first line has the names of the variables separated by comma/tabs. (2) The following lines have the values for the variables, also separated by commas/tabs. This kind of file can be read using the insheet command such as: insheet using C:\temp\study.csv or (comma separated) insheet using C:\temp\study.txt (tab separated) ***NOTE: However, insheet command could not handle a file that uses a mixture of commas and tabs as delimiters. **** Comma/Tab separated file without variable names on line 1 The same data as above except that there are no variable names on the first row. Then where should Stata get the variable names? If Stata does not have names for variables, it names then v1 v2 v3 etc …as we can see from the window. -2- We can of course tell Stata the names of the variables on the same insheet command, such as: insheet name hour grade using C:\temp\study.csv Space separated file For this case, file can be read with the infile command as shown below: infile str4 name hour grade using C:\temp\study.txt (str4 means that variable NAME is a character variable (a string) and it could take up to 4 characters wide.) Fixed format file If a file uses fixed column data, i.e., the variables are clearly defined by which column(s) they are located. Then this type of file can be read with the infix command as shown below: infix str name 1-4 hour 5-6 grade 8-9 using C:\temp\study.csv Creating a command file using do-file editor Let’s introduce do-file editor window. Double-click the do-file editor icon, the fifth from the right, a blank window will appear, now you can type the commands such as use filename, list, summ, etc and save that as a do file. Only through this way, your commands will not disappear after quitting Stata. Writing Comments You can put your comments in three different ways 1. ‘*’ type your comments after the ‘star’ (single line comment) * My project for Econ 509 2. <CODES> // comments in the same line 3. comments in multiple lines e.g /*My project for Econ 509 999999999999999 00000000000000 8888888888888888 77777777777777777777777 */ With /without Delimiter You can work in STATA with or without the delimiter, such as ‘;’. Always be consistent. If you have a habit of forgetting ‘;’ in each line of STATA codes, it is better to avoid its use from the beginning. Because if you miss the delimiter once you specified it, it will produce an error. If you are comfortable, then go for it. Typing your codes *Open do-file window, and type the following commands. Anything with ‘*’ is just a comment *clearing memory every time clear; -3- *expects semi-colon at the end of each command line #delimit; *storing output file in: filename.log log using e:\teaching\409\stexp0.log, replace; *reading excel .CSV file with the variable labels insheet using e:\teaching\409\hour.csv; *Reading stata .DTA file with the variable labels use a:\study.dta; *listing the observations list; *graphing x-axis vs y-axis graph hour grade, xlabel ylabel title("graph of study_hour vs grades"); *Running Regression regress grade hour; *predicting fitted value of dependent variable predict fgrade; * calculating residuals and printing it gen res = grade-fgrade; list res; *graph observed versus fitted against the indepen variable graph grade fgrade hour, connect(.l); *closing the output file log close; Do not worry about the commands; let’s take a look at the procedure steps. a) First, we need to use ‘clear’ to clear the memory; b) Second, it’d be better to use #delimit to indicate “;” for the end of each command; c) Third, in order to save the output into a specified file we can create a log file in the beginning, such as log using a:\409\study0.log, replace; which is paired with log close at the end. Then we can either view or copy & paste the output file easily. Just click the start log icon, which is the fourth icon from the left on the menu, then open the specified log file you already saved. If you read in the data correctly, you should see the variable names appear and some lines goes through results window. (no red letters, red letters indicate error). -4- Some useful commands a. Descriptive statistics (the most simple and powerful commands are): list ; //List all or some of the observations. For example, list in 1/5 ; //list the first five observations for all the variables list hour in f/l ; //list “hour” from the first obs to the last one *You will have the following in the results window: 1. 2. 3. 4. 5. hour 10 11 5 6 9 ; list name in –2/l ; //list the last two variable of “name” *you will see the following in the results window: 4. 5. name lee joe . Attention: -2/l, here “l’ is for “Last”, do not confuse it with one (1) in 1/5. b.summarize(summ for short) summ //gives you the mean, standard deviation, minimum and maximum value for all //numerical variables. Examples: summ (gives you means of all numerical variables in your sample), the results should be: Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------hour | 5 8.2 2.588436 5 11 grade | 5 78.6 13.37161 60 92 name | 0 summ hour ; //only gives you mean study hour for the sample tabulate -5- tab ; // with one variable gives you a frequency distribution. *`Tab’ with two or more variables gives you a cross-tab Examples: tab grade (gives you the number of people (and the percentage) for each grade level), the results are shown in the following table: grade | Freq. Percent Cum. ------------+----------------------------------60 | 1 20.00 20.00 71 | 1 20.00 40.00 80 | 1 20.00 60.00 90 | 1 20.00 80.00 92 | 1 20.00 100.00 ------------+----------------------------------Total | 5 100.00 tab hour grade; level. //gives you the number of people in study hour level and grade Be careful not to ask for a tab of a continuous variable: you will get hundreds of values, such as income. c. Logical operators They are used to evaluate an expression and then do something depending on the outcome. The operators are: = = equal to (you must use double = signs) >= greater than or equal to <= less than or equal to > greater than < less than ~= not equal to & and : both conditions hold | or : at least one condition holds Examples: summ hour if grade >=80 & grade <=100 grade between 80 and 100. //calculates mean of study hour for students list if (hour > 5) & (hour !=.) //list all the variables if study hour is greater than 5 and is not missing Functions and expressions Taking log: Raising to powers: Taking square root: Taking absolute value: gen lhour = log(hour) gen hournew = hour ^3 gen sqgrade = ln(sqrt(grade)) gen hournew= abs(hour) -6- Taking lags: gen xlag = x[_n – 1] or gen xlagged = L.x (lag1 of x) gen Xlag = X[_n – 2] (lag2 of x) Of course, the arithmetic operators, such as +, -, *, /, ^ are working. Generating variables New variables may be generated by using the commands generate or egen. The command generate(gen for short) simply equates a new variable to an expression which is evaluated for each observation. generate minutes = hour* 60 if grade>80 Generating dummy variables Suppose we are interested in construct one dummy variable to a random variable X, such as grade, we want to create 1s for grade > 79, 0 otherwise. Then the appropriate Stata command is: gen xdummy = (x > 79) Stata will offer 1s for X>79, and 0s for X<80 automatically, please remember we always put the conditions in the parenthesis. The function egen provides an extension to generate. One advantage of egen is that some of its functions accept a variable list as an argument, whereas the functions for gen can only take simple expressions as arguments. egen average = rmean(x y z) Another function works in the same way as generate is replace, which allows an existing variable to be changed. replace grade = 0 if grade < 60 Graphics The command graph may be used to plot a large number of different graphs. The basic syntax is [graph varlist, options] where varlist is the list of variables and options are used to specify the type of graph. For example, graph hour, normal title(“histogram of study hour”) [the normal option overlaid a normal curve on our histogram, title option has to be quoted] -7- Fraction .4 0 5 11 hour histogram of study hour Now consider about the case that we are interested in the plotting more than one variable, for example, we are interested in the relationship between the grades and the study hours. graph grade hour, xlabel ylabel title("graph of study_hour vs grades") [the xlabel and ylabel options cause the x- and y-axes to be labeled using round values, without them, only the minimum and maximum values are labeled] 90 grade 80 70 60 4 6 8 hour graph of study_hour vs grades -8- 10 12 Of course if you want to save the graph, you have to add save(filename). There are many options for graphics in Stata, if you want to explore more, please check the manual: Stata Graphics for details. Simple linear regression Let’s begin by showing some examples of simple linear regression using Stata. In this type of regression, we have only one predictor variable. This variable can be either continuous or discrete. There is only one response or dependent variable, and it is continuous. In Stata, the dependent variable is listed immediately after the regress command followed by one or more predictor variables, like regress [dep. Var.] [predictors]. Let’s examine the relationship between the study hours and grades. From the plot above, there is an obvious positive relationship between these two variables. It looks like the more hours students spent on studying, the better grade they have. For this example, the model we are interested in is: grade = + *hour+e, grade is the dependent variable and hour is the predictor: regress grade hour Source | SS df MS -------------+-----------------------------Model | 684.073134 1 684.073134 Residual | 31.1268657 3 10.3756219 -------------+-----------------------------Total | 715.20 4 178.80 Number of obs F( 1, 3) Prob > F R-squared Adj R-squared Root MSE = = = = = = 5 65.93 0.0039 0.9565 0.9420 3.2211 -----------------------------------------------------------------------------grade | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------hour | 5.052239 .6222138 8.12 0.004 3.072077 7.032401 _cons | 37.17164 5.301613 7.01 0.006 20.29954 54.04374 In addition to getting the regression table, we can also get the predicted variables and plot them. For example, predict fgrade [fgrade is the name of the new predicted variable] list grade fgrade [take a look at the predicted ones] 1. 2. 3. 4. 5. grade 90 92 60 71 80 -9- fgrade 87.69403 92.74627 62.43283 67.48508 82.64179 graph grade fgrade hour, connect(.l) [graph the original and fitted values against hour, connecting the fitted line] grade Fitted values 92.7463 60 5 11 hour Subsettting data manipulations We can subset data by keeping or dropping variables, and we can also subset data by keeping or dropping observations. Now let’s focus on keeping and dropping observations. For example, we have the data as follows, either entering by data editor or reading from external file. For gender, 0=female, 1=male; income is divided by $100; year means how long people have experiences for this kind of job; for degree, 0=no degree, 1=B.S.(B.A.), 2=M.S.(M.A.). gender 1 0 1 1 0 0 1 1 0 0 -999 -999 income 10 30 15 19 27 23 45 27 29 30 22 19 degree 0 1 0 0 2 1 2 1 2 2 1 1 - 10 - year 5 7 9 10 12 10 5 9 7 8 15 9 Please note that there are two –999 in gender which represent for missing values, we can generate missing values, using replace gender=. if gender ==-999 // [double equal ==] Suppose we only take care of non-missing values, we can drop the missing values, using drop if (gender==.). Now we want to split the sample into two sections, one for female, the other for male, and want to regression on these two parts separately using model: Income = +1*degree + 2*year. Here is one way to handle this. To extract only female, keep if gender==0 and save it into a new data file such as D:\pr_stata\incomef.dta, then we run the regression using the new data set. *extract only female and save it into new data* keep if gender==0 save D:\pr_stata\incomef.dta, replace *run the regression using new data use D:\pr_stata\incomef.dta reg income degree year *back to original data* use D:\pr_stata\income.dta *extract only male, and save into another new data keep if gender==1 save D:\pr_stata\incomem.dta, replace *run the regression using male new data use D:\pr_stata\incomem.dta reg income degree year Linear Restrictions Often we want to test linear hypothesis after model estimation, the most useful command is test. It performs F or 2 tests of linear restrictions about the estimated parameters from the most recently estimated model using a Wald test. There are several version of syntax, we only introduce two versions, for detail please check Stata Manual [Reference: test]. 1) test expressions = expressions - 11 - 2) test coefficientlist {note: coefficientlist van simply be a list of variable names} Examples: Suppose we have estimated a model of 1980 Census data on the 50 states recording the birth rate in each state (brate), the median age (medage) and its square (medagesq), and the region of the country in which each state is located. The variable reg1 =1 if the state is located in the Northeast, otherwise 0; whereas reg2 = 1 if the state is located in the North Central, reg3 marks the South, and reg4 makes the West. First we estimate the following regression: reg brate medage medagesq reg2-reg4 *reg2-reg4 is the abbreviation for reg2, reg3 and reg4 If we want to test (F) if the coefficient on medage is zero, just type: test medage = 0 If we want to test the coefficient on reg2 is the same as that on reg4, we can do test reg2 = reg4 Of course, we can put more complicated linear restrictions here, like test 2*reg2-3*reg4 = reg3 However, the real power of test is when we test joint hypothesis. Suppose we wish to test whether the region variables, taken as a whole, are significant. To perform tests of this kind, specify each constraint and accumulate it with previous constraints: test reg2 = 0 test reg3 = 0, accumulate test reg4 = 0, accumulate Typing separate test commands for each constraint, like above, can be tedious, the second syntax allows us to perform our last test more conveniently: test reg2 reg3 reg4 - 12 - Dealing with time series Data a) Setting the time span We have to let Stata know the time span variable at the beginning, and the command is tsset. There are two cases. One is our data already provide the time span information, for example, we have an annual exchange rate data consisting of two observations: year exchrate, “year” here is the variable that infers the time span, we simply say tsset year; then Stata will read the data as annual time series dataset. The other case is that we have to generate the time span by ourselves since we do not have it in our data. For example, if we want to generate annual data starting in 1985, gen t = y(1985) + _n-1; //1985 is the start time, y indicates year() tsset t; if we want to generate quarterly data starting in 1973:II, then gen t = q(1973q2) +_n-1; /*1973q2 means the start time:the 2nd quarter in 1973, q() infers quarterly()*/ format t %tq; //assign Stata format for quarterly data to t tsset t; if we want to generate monthly data starting in 1995 July, then gen t = m(1995m7)+_n-1; /*1995m7 means the start time, m() infers monthly*/ format t %tm; /*assign Stata format for monthly data to t*/ tsset t; if we want to generate weekly data starting from the 1st week of 1995, then gen t = w(1995w1)+_n-1; /*1995w1 indicates the start time, w() infers weekly*/ format t %tw; /*assign Stata format for weekly data to t*/ tsset t; Generating lags and differences Suppose x and y are random time series variables. If we want to create 1 lag of y, then we can generate a new variable ylag1: gen ylag1 = y[_n-1]; Applying the same idea if we want to create 2 lags of y, then we can generate another new variable ylag2 as: gen ylag2 = y[_n-2]; - 13 - also we can generate lags of x as gen xlag1 = x[_n-1]; To generate the differences of variables, we use D.. If we want to generate 1st order difference of y, we can create a new variable dy1 as: gen dy1 = D.y; If we want o generate 1st order difference of dy1, we can create another new variable dy2 as: gen dy2 = D.D.y; To generate a seasonal difference, we can create the lags first then take the difference. For example, we want to create a new variable sdy4 for seasonal difference of a quarterly data, gen sdy4 = y-y[_n-4]; Durbin-Walson test and Q-stats The Stata command are dwstat and wntestq. These commands can be applied after estimation and storing residuals. For example: reg y ylag1 ylag2 xlag1; predict uhat, residuals; /*store residuals into uhat*/ dwstat; wntestq uhat, lag(20); /*Q-stat on residuals up to 20 lags*/ Autocorrelation prais estimates a linear regression of depvar on varlist that is corrected for first-order serially-correlated residuals using the Prais-Winsten transformed regression estimator, the Cochrane-Orcutt transformed regression estimator, or a version of the search method suggested by Hildreth-Lu. Please pay attention that prais is for use with time-series data. You must tsset your data before using prais. For example, prais y x1 x2, corc; /*corc specifies that the Cochrane-Orcutt transformation be used to estimate the equation*/ - 14 - prais y x1 x2, corc ssesearch; /*ssesearch specifies that a search is performed for the value of rho that minimizes the sum of squared errors of the transformed equation*/ Dicky-Fuller and Unit Root Let’s still use the variables created earlier. There are two ways for D-F test. reg dy1 ylag1; Or, reg y ylag1; /*test the coefficient is zero*/ /*test the coefficient is 1*/ To perform augmented Dicky-Fuller test, we can use dfuller command. This test performs a regression of the differenced variable on its lag and the user-specified number of lagged differences of the variable. For example, dfuller y, lags(5) regress; Gold_Feld test To perform the Gold_Feld test, first we need to split the observations into two parts. Usually we sort the variable first then run two regressions on separate parts. For example, we have data about expenditure (exp) and income (inc), and we are interested in performing Gold-Feld for inc. What we can do is as follows: sort inc; /*sorting the data first*/ list inc exp in 1/5; /*checking the sorted data*/ reg exp inc in 1/10; /*run OLS on the 1st part of the obs*/ reg exp inc in 21/40; /*run OLS on the 2nd part of the obs*/ then extract the SSR to perform the test. - 15 - Some Useful Extras Merging When you are dealing with complex data sets, then your data might be in several different files. Before analyzing such data in different files, you need to create a common file with the relevant variables from different files that you are interested in. One way of combining such different tables is using “merge”. For merging two files, you need common ID in both files. If that is the case, then do the following sort ID // ID is the name of the common ID in both files *after sorting, save the file under different name *Read the second file and sort with ID. Now you are ready to merge two data sets. merge ID using <file name that you just saved> *Now your two data sets are in one file. *To make sure you merged the files appropriately, type * If you have more files, just repeat those steps Reshaping Your data might be in one of the following form (wide form) i ....... x_ij ........ id sex inc80 inc81 inc82 ------------------------------1 0 5000 5500 6000 2 1 2000 2200 3300 3 0 3000 2000 1000 (long form) i j x_ij id year sex inc ----------------------1 80 0 5000 1 81 0 5500 1 82 0 6000 2 80 1 2000 2 81 1 2200 2 82 1 3300 3 80 0 3000 - 16 - 3 3 81 82 0 2000 0 1000 *To reshape the data from wide to long, do the following reshape long inc, i(id) j(year) // goes from top-form to bottom *to reshape the data from long to wide, do the following reshape wide inc, i(id) j(year) //goes from bottom-form to top Collapsing *collapse converts the dataset in memory into a dataset of means, sums, medians, etc. Example: If you have a data set across 50 states and you want to get the summary statistics of each state. Use the following command collapse age educ income, by(state) // here age, educ and income are 3 variables *coll2, an alternative to collapse, converts the data in memory into a data set of means, sums, medians, etc. coll2 age educ income, by(state) *in above cases, you will get mean values of age, edu and income across all the states. If you want something different such as mode or sum, you have to specify as: coll2 (model) age educ income, by(state) Weighting Issue Most Stata commands can deal with weighted data. It is basically useful in survey data. Stata allows four kinds of weights: 1. fweights, or frequency weights, are weights that indicate the number of duplicated observations. 2. pweights, or sampling weights, are weights that denote the inverse of the probability that the observation is included due to the sampling design. 3. aweights, or analytic weights, are weights that are inversely proportional to the variance of an observation; i.e., the variance of the j-th observation is assumed to be sigma^2/w_j, where w_j are the weights. Typically, the observations represent averages and the weights are the number of elements that gave rise to the average. For - 17 - most Stata commands, the recorded scale of aweights is irrelevant; Stata internally rescales them to sum to N, the number of observations in your data, when it uses them. 4. iweights, or importance weights, are weights that indicate the "importance" of the observation in some vague sense. iweights have no formal statistical definition; any command that supports iweights will define exactly how they are treated. In most cases, they are intended for use by programmers who want to produce a certain computation. Example: Suppose you are running a regression of y on x1, x2, x3, and you have a variable called pop that you want to use as a weight, they do the following regress y x1 x2 x3 [aw=pop] analytical weight. // remember the square brackets, and ‘aw’ refers to - 18 -