LECTURE: STATA Data in memory Stata thinks of your data as a giant matrix with rows and columns. Each row is an observation and each column is a variable. Draw a picture. Some commands do one thing with certain columns, such as regress or tabulate. Other commands do something to every row, such as generating a new variable. Stata keeps the entire data set in RAM all the time, which is what makes it fast. It also means that you are limited in the size of the data sets you can examine. Programs Stata programs are called “.do” files. You can create your own do files. This is a good thing to do because you will likely want to run your commands many times, each time with slight changes. Stata has a built-in editor for creating do files and you can also create them in notepad. They are just plain text (ASCII) files. Log files The commands log using [filename], replace log close will prove useful to write a log file to disk. Delimiters Normally, your commands can be at most one line long. If you want to use longer commands, you need to use a delimiter. That is CS jargon for a symbol that tells Stata when each command ends. For example #delimit ; at the start of a program indicates that you will use semicolons as delimiters, as in C or Pascal. 1 You must then end each command with a semicolon or Stata will become confused. Comments Stata interprets lines starting with an asterisk as comments, which means it just ignores them. They are for you to document what you have done, or at least what you think you are doing. Data sets To use a data set already in Stata format, you use the use command, as in use mydata.dta To save a Stata data set to disk, use the save command, as in save mydata.dta Note that in both cases you may need to give a full directory address, which would appear in double quotes, as in, use “c:\important\422\data\mydata.dta” To clear the working memory in preparation for loading another data set, use the clear command, as in clear Generating variables Creating a new variable To create new variables use the generate command, which can be abbreviated gen. For example, to great a new variable that is a constant equal to 5, you would say, gen x = 5 where this creates a variable called x and sets its value to 5 for every observation. Using replace and if to create a new variable with multiple values Sometimes you will probably want to create a new variable that has different values depending on the value of another variable. 2 For example, suppose that you have a variable for years of schooling and you want to create a dummy variable for having at least a high school education. Then you would do the following: gen hs = 0 replace hs = 1 if yrschl >= 12 Explain what the first generate command does. Explain what the replace command does. Note the double equals sign for the logical equal as opposed to the single equal sign for assignment. This difference helps Stata figure out what you mean. Logical statements Almost all Stata commands can include the if along with conditions on the sample to be affected. The if allows a number of logical operations including the following: “==” logical equals “!=” not equals “>” greater than “<” less than “>=” greater than or equal to “<=” less than or equal to Note that the logical statements following the if can include both variables and constants, as in the example of yrschl >= 12. You can also use “&” for “and” and “|” for “or” to create more complicated logical statements. For example, you might just want to replace the values of one variable if both of two other variables meet specific conditions. For example, replace hs = 1 if yrschl >=12 & sex == 1 would set the high school dummy variable to one only for observations with both at least 12 years of school and with sex equal to one. Missing values Stata’s missing value indicator is a period. Thus, to create a new variable and set all the values to missing, you would use the command: 3 gen varname = . If you use Stata’s missing value indicator, it will almost always handle the missing values the way that you would want. It does not include them in means or other descriptive statistics and it does include them in regressions or other analyses. Generating aggregate variables The extended generate command, called egen, creates variables that are functions of more than one observation. The functions include means, standard deviations, percentiles and so on. For example, egen mearn = mean(earn) will produce a new variable called mearn that contains the mean of the variable earn in the data set. Thus, this variable will have the same value for everyone. The by option often makes egen much more useful. For example, to produce a new variable that contains the mean earnings for persons with the same sex as each observation, you would use the command egen mearn = mean(earn), by(sex) Assuming two sexes, male observations would have a value of mearn that equal to the mean of male earnings, and female observations would have a value of mearn equal to the mean of female earnings. Descriptive statistics You should always begin any empirical exercise by getting to know your data really well. This means spending some time just taking a look at the descriptive statistics and also looking at simple bivariate relationships. Looking at the data will help you find any data problems before you get going. Tell the story of the 99s. It will also give you a basic sense of the observations in your data, be they individuals, firms, states or whatever. The summarize command 4 The summarize command, abbreviated sum, presents means, standard deviations, minima and maxima. Thus, summarize will produce this information for every variable in the data set. You can also tell the summarize command that you want information only about specific variables, as in summarize [varlist] summarize age sex yrschl If you want to know a lot about each variable, then you can use the detail option to summarize, as in summarize age, detail Note that the option comes after a comma. This comma indicates to Stata that the main part of the command is over and that the options are now about to begin. The detail option gives you everything that the regular summarize command gives you plus the five largest and smallest values, various percentiles of the variable, and additional moments, such as the skew and the kurtosis. If you want to summarize a variable for certain subgroups, you can use the general by feature of Stata. Here you might say something like by sex: summarize age to get a separate summary of the age variable for observations having each value of the sex variable. Note that the by syntax can be used with most Stata commands to apply them to various subgroups. Of course, the by variable should be one without very many values. The tabulate command The tabulate command, which can be abbreviated as tab, presents a table containing each value of a variable and the number of observations that have that value. The basic usage for a single variable is: 5 tab varname Note that the tabulate command is most useful for variables that take on only a few values. Tabulate will not work for continuous variables in large data sets, or it may give you a very long log file where every value has only one observation, which is not very useful. You can examine the bivariate relationship between two binary or categorical variables by doing cross-tabulations. For example, to see the relationship between sex and years of schooling you could do tab yrschl sex This would give you a table rows corresponding to each value of years of school and columns corresponding to each value of sex. The cells contain the number of observations with the corresponding values for each variable. You also get row and column totals. Do an example with hs and sex for a 2 x 2 table. The tabulate command has a number of useful options, which include row, col, cell and chi. The row, col and cell options add the row, column and cell percentages to the table, respectively. Often it is the percentages you are interested, rather than, or in addition to, the count of observations in each cell. The chi option performs a chi-squared test of the null that the rows and columns in the table are independent. The correlate command For examining the bivariate relationship between two continuous variables, you may want to look at the sample correlation. The sample correlation between two variables can be obtained with the command correlate age earn Note that you can get a whole table of correlations by including more than two variables in the variable list. Testing differences in means Stata has handy commands for running t-tests of differences in means built in. There are three basic versions. 6 Testing whether the mean of a variable equals a constant Here we want to test the null that the mean of a variable equals a particular number. For that task, we use a command like ttest varname = # or, by example ttest age == 32 Testing whether the mean of one variable equals that of another variable Here we want to test the equality of the means of two variables. For example, we might have information on earnings measured in two ways, perhaps once from a survey and once from administrative records. The general form of the command is ttest varname1 = varname2 A specific example is ttest aearn = searn Testing whether subgroups have equal means of a particular variable Here we might want to see if two subgroups have the same mean of, say, earnings. The general form is ttest varname, by(groupvar) A specific example is ttest earn, by(sex) Remarks Each form gives the relevant sample means and estimated standard errors for the means, along with the t-statistic and related p-value from the test. The level() option allows you to pick a significance level different from the default of 95 percent. 7 Running regressions Running regressions in Stata is really simple. The general form of the command is given by: regress depvar [varlist] where the depvar is the dependent variable and the [varlist] contains one or more independent variables. Thus, to estimate the model Yi 0 1 X1i 2 X 2i i , you would use the command regress y x1 x2 The output includes the estimated coefficients, their estimated standard errors, p-values from tests of the nulls that each population coefficient equals zero, the p-value from an Ftest that all of the population coefficients are zero and so on. Stata includes a constant in the regression automatically; you do not need to include one in the variable list. The constant can be suppressed using the noconstant option. What could be simpler? There are, of course, many embellishments to the basic regression command that we will discuss later on in the course. 8