Brandeis University Maurice and Marilyn Cohen Center for Modern Jewish Studies Using Stata More Effectively Benjamin Phillips gen dobday=string(day,"%2.0f") replace dobday="0"+dobday if length(dobday)==1 gen dobmonth=string(month,"%2.0f") replace dobmonth="0"+dobmonth if length(dobmonth)==1 recode year (1962/1977=1978)(1998=1992) gen dobyear=string(year,"%4.0f") gen age=floor((date("1jan2010","DMY")- /// date((dobday+"-"+dobmonth+"-"+dobyear), "DMY"))/365.25) August 2010 Using Stata More Effectively © 2010 Brandeis University Maurice and Marilyn Cohen Center for Modern Jewish Studies Updated August 17, 2010 Using Stata More Effectively Table of Contents Introduction ......................................................................................................................... 1 Stata 11................................................................................................................................ 1 Setting up Stata ................................................................................................................... 1 Working with directories .................................................................................................... 2 Versions .............................................................................................................................. 3 Running .do files within .do files or the command dialog .................................................. 3 Comments ........................................................................................................................... 3 Breaking long lines ............................................................................................................. 4 Avoiding errors ................................................................................................................... 4 Renaming variables ............................................................................................................. 5 Changing variable order ...................................................................................................... 5 Computing variables with egen ........................................................................................ 5 Macros................................................................................................................................. 5 Looping (foreach, forvalues, and while) .............................................................. 7 Creating sets of dummy variables: the xi command ....................................................... 11 The if and else commands ........................................................................................... 13 Case order variables, sorting, and cross-case functions .................................................... 14 The duplicates command........................................................................................... 17 The list command ......................................................................................................... 17 The by command .............................................................................................................. 17 Data verification................................................................................................................ 18 The in command .............................................................................................................. 18 Predictions from estimation commands ............................................................................ 18 i Using Stata More Effectively Working with dates and times........................................................................................... 19 Numeric variable types ..................................................................................................... 22 String functions ................................................................................................................. 22 Importing .csv and other text files .................................................................................... 25 Exporting .csv, fixed format, and other text files ............................................................. 26 Merging, appending, and reshaping .................................................................................. 27 Matrices and scalars .......................................................................................................... 30 Running Stata from the command line ............................................................................. 34 Programs ........................................................................................................................... 37 The post and postfile commands ............................................................................ 38 The bootstrap .............................................................................................................. 39 Weird error messages ........................................................................................................ 39 Index ................................................................................................................................. 41 ii Using Stata More Effectively Introduction This file contains most of the collective wisdom of the Cohen Center regarding the effective use of Stata. It assumes a good working knowledge of basic Stata procedures and provides a guide to nonobvious shortcuts and other tricks of the trade. While I am the author of this document, I’ve incorporated others’ discoveries as well, giving credit in text to the “discoverers” of new functionalities. Stata 11 Stata 11 introduces three very useful features: a variables manager, an improved .do file editor, and the full set of manuals in PDF format. The variable manager is very similar to the SPSS PASW IBM SPSS variable view. Other than providing more screen real estate to view variable labels, it shows what label (if any) is attached to the variable. The .do file editor now allows collapsing of loops and colors commands, strings, locals, and comments, helping differentiate text. It also numbers rows and gives column numbers. The on-line manual is available from Help > PDF Documentation. For this to work properly, though, you need to use Adobe Acrobat or Acrobat Reader as your default PDF viewer. The reason for this is that it is a set of linked PDF files and third party readers do not seem able to move from one to the other. If you have a third party PDF viewer as the default, find a PDF file in My Computer or Windows Explorer, right click on it, select Open With > Choose Program, click the box next to Always choose the selected program to open this type of file, choose Acrobat or Acrobat Reader, and then select O.K. Along with the good points, some syntax has changed. The syntax for merging datasets has arguably improved and is certainly very different from the previous version (see p. 27ff). If you don’t want to rewrite old syntax, be sure to use the version function (see p. 3). Setting up Stata Stata has default settings that some of us do not like. Here is a list of ways to permanently correct them. Memory Stata opens datasets in RAM (random access memory). If you don’t have enough RAM, you can’t open the dataset. But even if you do have enough RAM, you may not be able to open the dataset. Stata grabs a chunk of RAM when it is launched for opening and working with datasets. By default, this is a measly 10MB. To expand this to a more useful 200MB permanently: set mem 200m, perm This can be expanded on a temporary basis to, say, 1GB as follows: set mem 1g Note that you’re limited by the RAM in your computer, the amount of memory used by other applications, and whether you are using a 32- or 64-bit operating system. Basically, 1 Using Stata More Effectively a 32-bit operating system can only keep track of 232 memory addresses (4,294,967,296), roughly corresponding to 4GB. In Windows, 2GB (some of this may be “virtual memory” stored on the swap file) is allocated to the operating system and each application receives another 2GB. In practice, the maximum amount of RAM 32-bit Windows will allocate to Stata in a system with 2GB of RAM (the normal maximum for 32-bit OSs) is somewhere in the 200MB to 250MB range. In 64-bit OSs, the maximum number of memory addresses that can be tracked is 264. In theory, this would include 18,446,744,073,709,600,000 addresses, roughly corresponding to 18PB (petabytes). In practice, the 64-bit architecture used in most AMD and Intel chips limits addressable memory to 256TB (terabytes). More To turn off Stata’s annoying characteristic of making you click to get the next page of results, use: set more off, perm Scroll Buffer Size Stata will only display a certain number of past results. In general, it’s better to display more than less. The command to use is set scrollbufsize #, where # is bytes between 10,000 and 2,000,000. It is permanent and does not take the , perm option. Stata must be closed and started again for this to take effect. Working with directories Stata works in a similar fashion to DOS or Unix with directories. cd "C:\Cohen Center\BRI" mkdir BRI20 cd BRI20 If you are in the correct directory, you do not need to specify the full file path. Hence, instead of: use ″C:\Cohen Center\BRI\BRI20\mydata.dta″, replace You can simply specify: use mydata, replace The .dta is assumed and need not be specified. Files in the working directory can be listed: dir 2 Using Stata More Effectively Stata can also erase files: erase mydata, replace This can be useful in situations where it is necessary to create temporary files (there is another way of doing this, tempfile, but it is most useful when creating commands). Versions Stata syntax changes from version to version. Generally, this isn’t a problem, being limited to relatively obscure areas. Occasionally, though, this impacts analyses, causing strange error messages to appear. This is easily solved. Stata is smart enough to be able to translate your commands from an earlier version of Stata to the present version. All this requires is a statement near the beginning of the .do file that lists the version of Stata the command was written on: version 11.1 Be aware, though, that Stata usually changes syntax to facilitate greater functionality. Stata’s survey commands prior to Stata 9 didn’t allow as many options for defining the characteristics of a complex survey sample. Consequently, while Stata 8 commands would still run on later versions (provided the version command was used), they may less accurately estimate variance than if rewritten for version 10.0 or later. The merge commands also changed between 10.1 and 11. Running .do files within .do files or the command dialog .do files can be run inside another .do file or from the command dialog provided one is in the correct folder (see p. 2): do mydofile This was necessary in Stata 10 and before when there was a maximum number of lines for a .do file in the .do file editor. This is no longer the case in Stata 11, but this functionality may still be of use if there are modular segments of identical code that need to be run at multiple points in a file. While I’m well aware of the fact that many PCs run Stata too slowly to rerun the entire .do file as needed, this problem will be eventually addressed by Moore’s Law or (when I win the lottery) the Jodi and Benjamin Phillips Fund for Ridiculous Computing Initiatives. When it is, running the entire file is good practice because it avoids the common problem of having the .do file blow up at a certain point because we have been tinkering with the file and running it piecemeal. Comments A well-written .do file will have considerable commentary outlining what is being done, how it is being achieved, and why this is necessary. There are two types of comments, those that constitute a line in themselves or those that can be written in the middle of a command. To write a comment on a line, it simply needs to be prefaced with an asterisk. 3 Using Stata More Effectively You can add more asterisks and finish with an asterisk or not, depending on your preferences. It doesn’t matter as anything on the line after the initial asterisk is disregarded. As soon as you type in a carriage return, though, the next line will be considered part of the program unless it, too, is preceded by an asterisk. (Note that you can put spaces and tabs before the first asterisk, allowing one to create bullet-point lists of comments. In some cases, it might be useful to make comments within a command. Stata will stop paying attention as soon as it reaches /*. It will not pay attention to again until it reaches */. Anything in between will be ignored, even if it stretches across multiple lines with many carriage returns. Conversely, this could appear in the middle of a command and it would not disrupt the command itself. * Here is a comment that must go on one line /* Here is a comment that covers several lines now it is over */ tab vara varb /* Comment at the end of the line */ tab /* comments */ vara /* in the middle */ varb /* are confusing but syntactically acceptable */, col Breaking long lines Stata will accept very long lines of code. Unfortunately, this means that the entire line won’t be visible at once in the text editor and will break up in an ugly fashion in the display window and log files. The simplest way to break a line is ///, which tells Stata to ignore the carriage return (which normally tells Stata that the command—whatever it is— is finished and should be executed). You can also use the comment indicator: reg vary varx1 varx2 varx3 varx4 varx5 varx6 varx7 /* */ varx8 varx9 An alternative (which I’m not fond of) is to use the #delim command, assigning a semicolon as the end of command statement (note that periods can’t be used), e.g., #delim ; reg vary varx1 varx2 varx3 varx4 varx5 varx6 varx7 varx8 varx9 ; #delim CR The last statement returns the delimiter to the default carriage return. The only options are the semicolon or the carriage return. Avoiding errors While the fact that Stata crashes as soon as it hits an error may be useful, there are times when what Stata regards as an error and what we would regard as an error diverge. Let’s say we’ve been working with a file that defines some value labels and we switch to another dataset which creates value labels of the same name. This will bring Stata to a crashing halt. We could specify label drop mylabel, but that is (a) a pain in the neck 4 Using Stata More Effectively and (b) will cause the .do file to crash if there is no label specified at the beginning. This can be avoided by using the capture prefix. Hence: capture label define mylabel 0 ″No″ 1 ″Yes″ “Capture” refers to Stata “capturing” the error message. Renaming variables At times it is necessary to rename variables, this is simply done with rename. If you wish to rename variables with prefixes—for instance, changing w09* to w1*— you can use the renpfix command. Changing variable order Stata can change the order in which the variables appear in a file. The order command send the variables one specifies in the order one specifies to the front of the dataset. Any variables not included in the varlist of an order command appear in their original order immediately after the last specified variable in the varlist. Thanks to Michelle for finding this command. Computing variables with egen Stata’s generate (usually shortened to gen) only handles simple mathematical operations like addition, subtraction, multiplication, division, exponentiation, and logarithms. While you can do a lot with these, there’s an additional command called egen that offers commands that work across multiple cases or multiple variables. These include calculating means, medians, summing (called total, not sum, for reasons I don’t understand), minimums, maximums, and so on. Before leaping in, though, be aware that the default mode for egen is operations across cases within a single variable. Thus egen xbar=mean(x) will create a new variable (xbar; i.e., 𝑥̅ ) that will be identical for every case containing the mean of the variable x. Thus, the within-case sum of a group of variables x1 x2 x3 will be egen sumx=rowtotal(x1 x2 x3), which could be simplified to egen sumx=rowtotal(x1-x3) if the variables were located next to one another in the dataset. Macros Stata has a macro function that can record arbitrary strings of characters. This can be useful for situations where one wants to have blocks of text that can be easily substituted in instead of having to be retyped or copied. The most useful form of Stata macro for our purposes is a local macro, which must be defined within your .do file. We typically have sets of interrelated dummy variables. Defining these as a macro would make specifying models easier. local denom ″rereform conserve orthodox other″ svy: ologit potrprelgpilg prtrpexprelgpilg landed15 /// kdmitzvot prmitzvot `denom′, or 5 Using Stata More Effectively Macros can also be useful for complicated statements and so on. Note that local macros include indexes for foreach or forvalues (see next section). Stata will overwrite previously defined local macros from these, so use different names. Macros can also be expressed as: local macroname = macrocontents It is recommended, however, that you stick to the form displayed above: local macroname macrocontents This executes faster. However, if you were to have a mathematical function as part of the macro, the equal sign would be necessary. Hence a program that counts to two and displays it on the screen: local y display local y display 1 `y′ = `y′ + 1 `y′ After being defined, local macros are referred to as `x′ (assuming x is the name of the macro). Note very carefully that the left hand apostrophe is from the top left key in your keyboard, under the tilde (“~”), immediately to the left of the key for 1. The right apostrophe is the one under the regular quotation mark, immediately left of the enter button and right of the key for the colon and semicolon. Advanced macro use When running .do files from the command line (p. 3) or programs (p. 37), arguments after do myfile get entered as macros `1′ `2′ etc. These can be then referred to in the .do file itself. For this trivial.do file: tab `1′ `2′ Thus: do trivial vara varb is equivalent to: tab vara varb Obviously, this isn’t the sort of thing we would want to use on an everyday basis, but it could be helpful in certain complicated programming situations. 6 Using Stata More Effectively Globals Locals are only one kind of macro. There are also global macros, which are everpresent. While one can add new global macros, this is not recommended. One neat global macro is $S_DATE, which contains the current date. Thus, to save a file with today’s date: save ″myfile $S_DATE.dta″, replace Take care with this, though. The sequence is very specific: “ dd Mmm yyyy”. There is a leading space and, in addition, if dd < 10, there will be another space in place of the first d. Month, of course, is the first three letters with the first being capitalized. Thus: June 8, 2008 “ 8 Jun 2009” May 22, 1975 “ 22 May 1975” I haven’t tried years < 1000 or > 9999 but as the date is drawn from your system clock, it is unlikely that you will have this problem. (If you’re reading this in 10000 CE, you’re probably up to speed on this, given the Y10K bug.) Looping (foreach, forvalues, and while) Stata supports looping and makes it very easy. There are three primary kinds of loops. foreach loops through strings of text, forvalues loops through numbers, and while loops. There are several simple rules to remember. First, after writing the specifications of the loop, you have to put a left-hand brace “{” at the end of the line (i.e. immediately before the carriage return). It is good practice to then indent the lines of code that run within the loop (though the loop will run fine if you don’t indent). Second, the loop is closed when it reaches a right-hand brace “}” on a line by itself. I like to keep this at the same level of indentation of the rest of the loop, but others may put the right-hand brace unindented. Third, you need an index for the vectors. In the examples below I use x for text strings and n for numbers, but these can be any letters (and more than one letter) you find convenient. They can even be the same name as variables, but it is probably best to avoid the confusion this may cause. The index is declared at the beginning of the loop. The index is a local macro, so be sure not to call your index the name of a macro you will be calling (or will call at a later point). Here is a loop over text strings: foreach x in shabcan shabmeal mitzvot { svy: ologit po`x′ pr`x′ landed, or } And here is a loop over values: forvalues n=1/4 { svy, subpop(if region==`n′): mean age } 7 Using Stata More Effectively Note that Stata differentiates between mathematical equalities (=) and logical equalities (==). Here the equality in the forvalues statement is mathematical while the equality of the if qualifier is logical. Stata will throw error messages if you confuse one with the other. If you want to loop over nonconsecutive, unevenly space numbers like 1, 3, 5, 6, and 9 you would enter these into foreach, as in “foreach n in 1 3 5 6 9”. To loop over evenly spaced numbers forvalues should be specified as forvalues n=2(2)10, which would yield the sequence 2 4 6 8 10. One can run loops within loops: foreach x in shabcan shabmeal mitzvot { forvalues n=1/5 { svy, subpop(if denom==`n′): ologit po`x′ /// pr`x′ landed, or } } One small issue with running large loops or sets of loops, particularly for analysis commands, is that it can be difficult to keep track of what each piece of output represents. This can be solved by getting Stata to specify which variable is being run under which conditions using the display command. The “as txt” option, discovered by Michelle, ensures it displays nicely. You can also precede variable output with “as output” to conform to Stata’s usual scheme and _newline to force new lines. Here is the previous example: foreach x in shabcan shabmeal mitzvot { forvalues n=1/5 { display _newline as output /// ″`x′” as text if ”denom==” as output `n′ svy, subpop(if denom==`n′): ologit po`x′ /// pr`x′ landed, or } } For shabmeal and denom=3 this would display: . . shabmeal if denom==3 8 Using Stata More Effectively Of course, loops can also be very helpful in data manipulation, not just analysis. Here we Z-score a group of variables (Stata has a user-written command called zscore that will do this, but we’ll ignore it for the present): foreach x in busguide busgroup busmifgash buslearn { quietly summarize `x′ gen z`x’=(`x’-r(mean))/r(sd) } An excursus on silence and system variables What on earth is quietly summarize and r(mean) and r(sd)? First, quietly tells Stata to suppress output. Generally, you don’t want to do this, but it minimizes clutter in instances where you want to run a command but don’t need the output. A block of commands can be set to quietly, much as one would do a loop: quietly { command command } Within this loop, one could always specify quietly’s counterpart, noisily (who says computer programmers don’t have a sense of humor?), for a given command to see its output. Second, summarize is an analysis command that reports the number of valid observations, mean, standard deviation, minimum, and maximum. Almost all Stata analysis commands store some information in a matrix. An OLS regression will store R2, the coefficients, and so on (type return list and ereturn list to see details). These are removed when the next analysis command is run. (See help return for details.) As it happens, summarize stores the mean and the standard deviation. From there, we simply plug these pieces into the formula for a z-score: z x Looping using while An alternative means of looping through values is while. In this instance, the index serves as a counter and the loop continues for a given case until the logical condition is specified. Note that this can lead to loops of infinite length is the logical condition is not set properly. Here is a loop to assign a value for the last cohort a given case is associated with using forvalues: gen lastround=. forvalues n=1/18 { replace lastround=`n' if round`n'==1&qualified`n'==1 } 9 Using Stata More Effectively Here it is using while: local i 0 while (`i++') <= 18 { replace lastround=`i' if round`i'==1&qualified`i'==1 } We create the local macro i with an initial value of 0. The logical statement “while (`i++' <= 18)” can be understood as follows. i++ increments i by one each loop (++i would achieve the same effect, while --i or i-- would decrement i by one each loop). When i reaches the value of 18, the loop is terminated for the given case and moves to the next case, until all cases are completed. Note that we don’t have to combine the increment and the logical statement as we did above. This could be specified as: local i 0 while `i' <= 18 { local i = `i++' We could also forgo i++ and recast the last line above as “local i = `i' + 1”. This, however, would be slower than ++i, according to the manual. It is not clear from Stata’s documentation whether forvalues or while ++i or --i is faster. I would guess forvalues has a slight advantage as while probably needs a 19th loop (in the above example to reach the point at which i > 18 while forvalues knows that it needs to loop 18 and only 18 times. In any case, as forvalues is easier to understand, it probably makes sense to stick with forvalues. Finally, here’s a program I wrote to calculate the average interitem correlation of a lower triangular matrix (note this contains some features I haven’t discussed yet): capture program drop interitem program interitem version 10.1 syntax varlist(min=2 numeric) corr `varlist' matrix corr = r(C) local nargs : word count `varlist' foreach x in sum cell n mean { matrix `x' = 0 } local a 0 local c 0 while (`a++') < `nargs' { local b `a' while (`b++') < `nargs' { matrix cell[1,1] = corr[`b',`a'] 10 Using Stata More Effectively matrix cell[1,1] = abs(cell[1,1]) matrix sum[1,1] = sum[1,1] + cell[1,1] local c = `c' + 1 matrix n[1,1] = `c' } } matrix mean[1,1] = sum[1,1]/n[1,1] matlist mean end Conditional breaks Because I am only interested in finding the last cohort the loop finds, iterating through all 18 possibilities for every case is wasteful of computer resources. A better alternative would be to start looking at the last cohort and work backwards. This requires stopping when I find the first cohort a case is associated with. Accordingly, I use an if statement to conditionally end the loop for a given case: gen lastround=. local i 19 while (`--i' > 0) { replace lastround=`i' if round`i'==1&qualified`i'==1 if (lastround != .) exit } That is to say that if lastround no longer has a missing value, the loop for that case is over, and it should proceed to the next case until all cases are complete. In my case, going forwards through all 18 possibilities took .64 seconds while going backwards and stopping at the first hit took .58 seconds, so there was a small benefit. (I got the timing by set rmsg on.) Benefits will be greater for very large loops, very large datasets, or very slow computers. Alternately, I could add in a conditional break to a decrementing forvalues loop to achieve the same effect as the while loop: forvalues n=18(-1)1 { replace lastround=`n' if round`n'==1&qualified`n'==1 if (lastround != .) exit } Creating sets of dummy variables: the xi command Creating a set of dummy variables is a common operation in data analysis. Unfortunately, it is an annoying chore and one that goes wrong occasionally. Michelle has found a better alternative in the xi command. Using this, instead of laboriously coding: recode recode recode recode recode denom denom denom denom denom (1=1)(2/7=0), gen(orthodox) (2=1)(1 3/7=0), gen(conserv) (3 4=1)(1 2 5/7=0), gen(rereform) (5 6=1)(1/4 7=0), gen(justjew) (7=1)(1/6=0), gen(otherjew) 11 Using Stata More Effectively One could simply code: recode denom (1=1)(2=2)(3 4=3)(5 6=4)(7=4), gen(newdenom) xi i.newdenom, noomit The noomit statement just means that one variable will be created for each category, compared to the default state where the category with the lowest value (here, Orthodox) is omitted. Of course, some labor is still required if you’re going to have a clue what these variables mean: rename rename rename rename rename _Inewdenom_1 _Inewdenom_2 _Inewdenom_3 _Inewdenom_4 _Inewdenom_5 orthodox conserv rereform justjew otherjew This could be speeded up, too, using loops: local i=0 foreach x in orthodox conserve rereform justjew otherjew { local i=`i'+1 rename _Inewdenom_`i’ `x’ } xi can be used to create more complicated variables, too. See documentation in the help file. Using xi in estimation xi can be used in estimation commands. For instance, the following command: reg y x conserv rereform justjew otherjew could be recast as: xi: reg y x i.conserv Doing this essentially creates temporary versions of the variables used in the analysis and then immediately dropped. The names of these temporary variables follow the logic of variable creation. You could specify noomit after xi, but that will cause problems because a set of dummy variables needs to have one category excluded. This sounds great, but it’s usually more trouble than it’s worth. For one thing, you don’t get to choose the omitted category. While you could work around this, perhaps recoding denomination so Conservative=1 and Orthodox=2, but that removes some of the labor saving aspect. Perhaps more problematically, you (yes, you!) will have to remember precisely what _Isomevariable1 actually represents and type out _Isomevariable1 (and 2 and 3 and so on) into postestimation commands. In most cases, you’re better off creating new variables and giving them meaningful names. 12 Using Stata More Effectively The if and else commands These commands look superficially similar to the SPSS do if and else if commands. Unfortunately, where SPSS applies these case by case, so they can be used to branch to account for, say, skip patterns, Stata treats all cases alike. Here is a sample of SPSS syntax: do if pocomplete=1. + compute dadjew=podadjew. else if prcomplete=1. + compute dadjew=prdadjew. end if. What we would like to be able to do in Stata is as follows: gen dadjew=. if pocomplete==1 { replace dadjew=podadjew } else { replace dadjew=prdadjew } Note that else doesn’t take conditions. What would happen, though, is that if the first case had completed the post-trip survey, then everyone would have dadjew=podadjew; if the first case had not completed the post-trip survey, every case would have dadjew=prdadjew. We could tell Stata to do this for every case: gen dadjew=. local n = _N forvalues i = 1/`n' { if pocomplete[`n']==1 { replace dadjew[`n']=podadjew[`n'] } else { replace dadjew[`n']=prdadjew[`n'] } } However, it would be a lot easier to simply do: gen dadjew=podadjew if pocomplete==1 replace dadjew=prdadjew if prcomplete==1 Or better yet: gen dadjew=. foreach x in po pr { replace dadjew=`x'dadjew if `x'complete==1 } 13 Using Stata More Effectively Either of the latter two options would also run faster, because Stata executes this on the entire dataset at once, not case by case. Enthusiastic as I am about Stata, this is not a very useful command for most instances and is aimed at people writing new commands. It would be great if there was a parallel to the SPSS commands, but as far as I know there isn’t. Case order variables, sorting, and cross-case functions SPSS has $casenum which is a system variable that contains a unique positive integer for each case from 1 to n. This can be used to save the original order of cases prior to sorting. Stata has a similar system variable: _n. Hence, the original order of cases can be saved to a variable as follows: gen sortorder=_n An excursus on sorting One might think that when a sort command is issued, Stata will keep the relative order of cases within each sort category. Thus, if we sorted for sex, we would expect case 1 to remain ahead of case 3 among men and case 2 to remain ahead of case 4 among women. Not so! When sorting, Stata randomizes the order of variables with a given sort category. In general, this should cause no difficulty. If, however, there is a tacit assumption that the order within each sorting category is retained, there will be problems (I’ve spent days sorting out the messes this has created in sampling). This can be solved by saving the original order as above and then sort sex sortorder. If you are setting up a stratified random sample and require reproducibility, this can be solved by setting the seed of the random number generator ahead of the sort (e.g., set seed 1000). When one needs to sort in descending order, the sort command will not work; instead it is necessary to use gsort; the syntax is gsort –sex +age. Lags and leads Unlike SPSS, _n can also be used for lags and leads (cross-case comparisons within a single variable). Here, _n is appended to a variable inside brackets (e.g., []) to indicate a particular case. Hence, sex[3] refers to the sex of the third case, while sex[_n] refers to the sex of the nth case. SPSS has a function called lag that can be computed for the same ends. For instance, a variable identifying duplicate cases (though see the duplicates command below) could be constructed as: sort briusaid_1 gen dupe=0 replace dupe=1 if briusaid_1[_n]==briusaid[_n-1] The lag or lead can be, respectively, backward or forward by an arbitrary number of places by substituting +1 or -2 instead of the -1 in the above example. Note that the [_n] on the left hand side of the logical equality is unnecessary. I include it for the sake of clarity. 14 Using Stata More Effectively If we want to refer consistently to the nth case of the dataset, we put that case’s row number in as: gen newvar = oldvar[1] ([_N] always refers to the last row in the dataset, which also happens to document the number of cases in the dataset.) These suffixes can be combined. For instance, we could reverse the values of oldvar as follows: gen newvar = oldvar[_N-_n+1] (And, no, I didn’t think that up myself.) One can substitute in a variable name and Stata will refer to the row number designated by the value of that variable. Let’s say we have a dataset with parents and children as individual cases and ID variables for each child with the row number of each parent (I will assume there is a variable called sortorder that makes sure the variables are in the correct order for these operations. To add, say, each parent’s denomination, as variables to the child’s data, we could do as follows: sort sortorder foreach x in mom dad { gen `x’denom=denom[_`x’id] } An extended example from a Stata lecture follows. 15 Using Stata More Effectively By combining _n and _N with explicit indexing, we can produce truly amazing results. (Note the version command at the top of the file. This is needed for Stata 11 and later because this file uses Stata 10 and before merge commands.) For instance, let's assume we have a dataset that contains personid age sex weight fatherid motherid six-digit id number of person current age sex (1=male, 2=female) weight (lbs.) six-digit id number of father (if in data) six-digit id number of mother (if in data) version 10 capture log close log using crrel2, replace use relation, clear sort personid by personid: assert _N==1 /* see Exercise 9 */ gen obsno = _n keep personid obsno rename personid id save mapping, replace use relation, clear gen id = fatherid sort id merge id using mapping keep if _merge==1 | _merge==3 rename obsno f_n label var f_n "Father's obs. # when sorted" drop _merge id gen id = motherid sort id merge id using mapping keep if _merge==1 | _merge==3 rename obsno m_n label var m_n "Mother's obs. # when sorted" drop _merge id sort personid save rel2, replace erase mapping.dta log close exit Then, when I wanted, say, the father’s age sort personid gen fage = age[f_n] /* if not already */ and, if I wanted the mother's weight gen mweight = weight[m_n] 16 Using Stata More Effectively The duplicates command Charles correctly points out that my first example in the case order variable section reinvents the wheel. Stata has a built-in command called duplicates that handles just about anything one would like to do regarding duplicate cases. It can report all duplicates—cases with identical values for the variables specified in varlist, report only one example for each group of duplicates, create a new variable identifying duplicate observations, delete duplicates (though caution is advised whenever using powerful commands that don’t leave a record of what they dropped), and has powerful controls for how the duplicate report tables are displayed. See help duplicates for details. The list command It’s often helpful to look at some actual data to aid debugging. One way of going about this is to use the data browser. However, the variables one wants to compare are often far apart. A neat alternative is to use the list command, which will list onscreen (record in a log file if you expect a lot of values). Here is a potential sequence of commands for finding and checking dupes in a BRI file (but see the duplicates command, above). log using dupecheck, replace text sort briusaid_1 list briusaid_1 idmain idpanelmain if /// briusaid_1[_n]==briusaid[_n-1]| /// briusaid_1[_n]==briusaid[_n+1], clean noobs log close Note the use of the if option to limit the number of cases displayed and the use of forward and backward lags to ensure that both dupes are shown. clean and noobs respectively get rid of frames around the items displayed and suppresses observation numbers. list can also be used to quickly list answers for all items for a given respondent: list if token="abc1234" The by command The by command in Stata is extremely helpful. It is produces the same result as forming separate datasets for each unique set of values of varlist and running stata_cmd on each dataset separately. However, data must be sorted by the varlist used first. This can be used for analysis: sort denomination by denomination: tab poshabcan prshabcan, col If [_n] and [_N] are used with a by command, they refer to within each by grouping. Here by is used for data manipulation, creating bus averages for the bus guide scale: sort groupname by groupname: egen mnbusguide=total(busguide)/_N 17 Using Stata More Effectively The only thing to watch out for is that this will divide the sum of the values of bus guide within a bus by the total number of people on that bus, which will be problematic if we don’t have a response from each person. Of course, it is easier to simply do: sort groupname by groupname: egen mnbusguide=mean(busguide) Data verification Stata has a command called assert. This is followed by a logical expression. If the logical expression is contradicted, the program will throw an error message. Hence, looking for out of range values for an opinion question: assert prtripfree>=1&prtripfree<=4 If the system throws an error, we can find it in the following fashion: list briusaid prtripfree !(prtripfree>=1&prtripfree<=4) The exclamation mark specifies a logical not. The logical statement above is equivalent to (prtripfree<1|prtripfree>4) but the former is preferable as it is less likely to be mangled by human error, e.g., (prtripfree<=1|prtripfree>=4). The in command Stata can refer to lines in the dataset. Here is some syntax from Michelle that adds a line to the dataset (originally with 41,457 cases) and then assigns values to variables in that line: set obs 41458 replace fedzip = "H9" in 41458 replace fedcode = 11 in 41458 If you know the case number, you can also use this with the list command. Predictions from estimation commands Stata can generate predictions from estimation commands, most commonly regression models. This can be approached several different ways. First, you can generate the predicted value of the dependent variable for every case included in the regression by using the predict command after the regression. This creates a new variable: reg vary varx1 varx2 predict predvary More often, we want to generate predictions for ideal types. For instance, we may regress posttrip attending Hillel activities on attending pretrip Hillel activities and being invited to posttrip Hillel activities. We might want to estimate the expected values of posttrip attending Hillel activities by each pretrip frequency of attending activities, holding invitations constant. Stata’s built-in command for this is adjust. This, however, doesn’t 18 Using Stata More Effectively work well in the context of regressions for limited dependent variables. In this case, an ordinal logistic regression, it will return E(y*|x). In logistic regression, y* is generally unintelligible and all the more so for ordinal logit. Instead we want, Pr(y=1|x), Pr(y=2|x), Pr(y=3|x), Pr(y=4|x), and Pr(y=5|x). The best command for this purpose is a user-written command, prvalue. (If you do not have this on your computer, type net search spost.) We could then specify following the regression: forvalues n=1/5 { prvalue, x(practhill=`n') } This will return the estimates for each value of practhill. The value of poinvhill will be set to the mean (spost is smart enough to calculate this for all variables not specified). This sounds good, but there’s a catch. The mean value substituted for poinvhill is unweighted, when our data is in fact weighted. We really want to use the weighted mean. Here’s how we accomplish it: svy: mean poinvhill practhill foreach x in poinvhill practhill { local mean_`x' _b[`x'] } svy: ologit poacthill poinvhill practhill, or forvalues n = 1/5 { prvalue, x(practhill=`n' practhill=`mean_poinvhill') } But what is _b[poinvhill] that I save as local mean_poinvhill? In the section on looping, I used r(mean) and r(sd), which were statistics retained by Stata after an estimation command (there summarize). Stata saves the coefficients of estimation commands as _b[varname]. For svy: mean, these are the means of the variables. Because I wish to use these after another estimation command, I have to store them somewhere else, because _b will be overwritten. I use a local macro for this purpose. There is one drawback with this approach: local macros only last for a single run of the .do file. As soon as the .do file has run, Stata has forgets the local macros. This means that after making some changes, one has to rerun svy: mean, convert them to local macros, rerun the estimation command, and then run prvalue. I have come up with an ugly kludge that gets around this, which is discussed on p. 33, as it involves matrices and scalars. Working with dates and times Thanks go to Graham for revising and expanding this section based on his (painful) experience working with times and dates. Turning dates and times into numeric data The clock() function converts string variables containing data and time information into numeric format. This is specified as: 19 Using Stata More Effectively gen double newdate = clock(olddate, ″MDYhm″) The double specifies that the variable created by the gen command is large enough to hold the (generally extremely large) date value that Stata creates (see Numerical variable types, p. 22, for a discussion of this issue). If the command specified was merely gen newdate = … then the variable created would not be able to hold all the data and would lose precision. The last argument specifies the order of the month, day, etc. If the string variable holding the date lists these in a different order then you can shift around the ″MDYhm″ letters to compensate and Stata will understand. For example, if the date is stored as “year-monthdate-hour-minute” then you would specify ″YMDhm″. The numeric variable created by this command is the number of milliseconds from 12 midnight on January 1, 1960 (this is why you need to store it in a double variable rather than the default float type…it’s really big). To change the display from a numeric to a time and date format, it is necessary to change the format to a time and date format, of which there are various types (see help format): format newdate %tc %tc ignores leap seconds. If you do want to track leap seconds (and, heaven knows, we all do), the correct format is %tC. Oftentimes, however, that level of precision is unnecessary and date-level information would be sufficient. We can read in a string stored as DMY as follows: gen int newdate = date(olddate, ″DMY″) The int specifies this variable is to be stored as an integer, which is generally the most efficient way to store integers. Similarly to clock variables, Stata counts days from January 1, 1960, except the basic unit is the day, not the millisecond. To get Stata to display this as a date, not a number, format it: format newdate %td At times, a time format will be more precise than one wants and can be converted to a date format (here, I overwrite the existing variable): replace submitdate_w2=dofc(submitdate_w2) There is the opposite function, cofd, which turns date variables into time variables, which is of limited use as the time of the day is always set to midnight (to the millisecond). 20 Using Stata More Effectively Extracting days, months, years, and more from date variables Stata can extract information from %td variables like the year, month, day (of month), day (of year), and day (of week). These are simple functions of generate, where d is the %td variable: year(d), month(d), day(d) (day of month), doy(d) (day of year), dow(d) (day of week), week(d) (week of year), halfyear(d) (half of year), and quarter(d) (quarter of year). For example, we extract year of birth from a birthdate variable: gen yrborn=year(birthdate) Recoding date variables While putting the date variable in a format Stata can recognize is all well and good, you may wish to recode this numeric variable to reflect a number of days, hours or minutes, as opposed to milliseconds. In order to do this, you must divide it by some set of numbers depending on what time unit you want it to display. For example, by dividing the variable by (1000*60*60*24)—which is the number of milliseconds in a second, the number of seconds in a minute, the number of minutes in an hour, and the number of hours in a day—will convert milliseconds into days. This is often useful when, for example, you are attempting to create binary date variables that select only cases with a date of a specific day or year. Turning separate numeric day, month, year variables into age The following code transforms variables day, month, and year into an age variable, when age is expressed in the standard Western form, i.e., ⌊𝑎𝑔𝑒⌋ or floor(age). gen dobday=string(day,"%2.0f") replace dobday="0"+dobday if length(dobday)==1 gen dobmonth=string(month,"%2.0f") replace dobmonth="0"+dobmonth if length(dobmonth)==1 recode year (1962/1977=1978)(1998=1992) gen dobyear=string(year,"%4.0f") gen age=floor((date("1jan2010","DMY")- /// date((dobday+"-"+dobmonth+"-"+dobyear), "DMY"))/365.25) Working with dates as strings In some (perhaps most) cases, despite all of Stata’s features, it may be easier to simply keep a string variable that codes a date as a string, and work with it that way. Often, if the data is sorted in a standardized way, you can use string functions (see p. 22ff for more) to extract the pertinent time unit (days, hours, years) from the string variable and work with these variables instead. For example, if the string date variable stores its values as “200802-15 11:32:15.20000” and all you want to know is the year, month and day, you can use the substring function described below to extract only that data, by selecting the first 10 characters of the string only: gen newdate=substr(date,1,10) 21 Using Stata More Effectively Once this command is run the resulting date variable will look like this: “2008-02-15” and can be dealt with much more easily. Numeric variable types Stata stores numeric variables with varying degrees of precision. The numeric variable type that occupies the least memory (both in terms of storage on a hard disk or other semipermanent medium and in RAM when the file is active) is a byte (i.e., 0 or 1). At the other end of the spectrum, a double (IEEE 754-1985 double precision floating-point binary format) occupies the most space. It can hold up to approximately 16 decimal digits. Naturally, this occupies the most space. When you convert a file to Stata using Stat/Transfer 10, it automatically chooses the type with the smallest memory footprint for each variable that does not lose any precision. If you are using a file from another source or simply want to check, simply use the command compress and Stata will automatically choose the type with the smallest memory footprint that loses the least data. By default, Stata stores variables in the float type. This has the smallest memory footprint for noninteger variables. You can always specify the kind of variable you want stored. While it might be good practice to specify byte for dummy variables (e.g., gen byte female=1 if sex==1), at worst this will chew up storage space and slow the system down very slightly. On the other hand, when generating a random variable for sorting order operations where there are a large number of cases, specifying double is a must because the less precision one has, the greater the odds of having cases with precisely the same value which will be randomly ordered each time a sort is run. In a dataset with approximately 90,000 cases, it turned out that even double was not sufficiently precise and additional precautions had to be taken to maintain the precise sort order. String functions Particularly when merging ID variables, string manipulation rears its ugly head. (A string is a variable containing characters that may include letters and symbols.) Things to remember include that missing is rendered as ″″ (i.e., a zero-width string) and the possibility of leading and/or trailing blanks (i.e., ″ ″). Transforming string variables to numeric There are at least three different kinds of string to numeric operations. (1) Transform a string variable containing words into a numeric variable with labels similar to those of the original variable. (2) Transform a variable containing words into a new variable with labels dissimilar to those of the original variable. (3) Transform a numeric string variable (i.e., no characters besides numerals, decimal points, and commas) into a numeric format. In IBM SPSS, Operations 1 and 2 could be accomplished using the recode command. Not so in Stata. An example of Operation 1 is transforming a string (e.g., name of university) into a numeric variable. This is handled by a command, encode, that automatically generates a new numeric variable with a user-specified name. (You cannot directly overwrite the variable.) Unless you tell Stata to do otherwise, it will create its own value labels. It’s often the case, however, that the value labels are messy and the values are not in the order 22 Using Stata More Effectively you like. Thus, typical usage will often be to run encode, drop the old variable, recode the new variable (perhaps to the old variable’s name), and assign new value labels. An example of Operation 2 is transforming a registration variable where gender is encoded as ″False″ for males and ″True″ for females (this is a product of the way SQL treats 0,1 variables). We would typically like this to be either gender (1=male, 2=female) or female (0=no, 1=yes). There is no elegant way I know how to do this. Brute force works best with two category variables like this. gen gender2=1 if gender==″False″ replace gender2=2 if gender==″True″ drop gender rename gender2 gender In Operation 3, the variable, though encoded as a string, is already in numeric format. Here we simply use destring. This command includes options to either overwrite (replace) the original variable or generate a new variable (you must choose one of these options). Alternately, one could use the real function of the generate command, which bases the numeric format of the resulting variable on the way it is displayed in the original string variable. Changing format will require using the format command. Transforming numeric variables to strings As with the alternate operation, there are several scenarios. (4) Transform a numeric variable with value labels to a string variable where the string is identical to the value labels. (5) Transform a numeric variable into a string variable. For Operation 4, one uses encode’s counterpart, decode. For Operation 5, there are multiple alternatives. In most cases, the best option is probably destring’s counterpart, tostring. Like destring, we can optionally replace the existing variable. You also have the option to specify a display format for the string. If the numeric to string transformation is part of a larger function, you can use the string function of the generate command. This takes a variable format specification, allowing one to control the formatting of the way in which the variable is displayed: gen agestr=string(age,″%2.0f″) Here I want to display in %2.0f format, like 24, 26, etc. If anyone in my sample was aged more than 99, I would have to use %3.0f format. If I wanted age in decimals, I would need to specify as %4.1f or similar. Searching for substrings It is sometimes necessary to search for substrings (i.e., strings of characters inside a larger string) as, for instance, in looking test cases in a dataset where people gave names with some variant of “test” in them (e.g., “Ben-test”). This can be accomplished using the regular expression function: 23 Using Stata More Effectively list if regexm(lower(namefirst),"test")==1 List is used rather than drop because this would catch nontest cases where “test” was part of the name (e.g., “Testa”). The lower case function (lower()) is used to deal with irregularities in capitalization. Naturally, far more ambitious regular expressions can be used (see my notes on using TextPad effectively). The following drops phone numbers with letters of the alphabet in them: replace phoneprim2="" if regexm(phoneprim2,"[A-Za-z]")==1 Other string functions Other useful string functions (see help string_functions for a comprehensive list) are given below. Note that these aren’t restricted to variable creation and can be used in conditional statements (see the example below for an illustration). Finally, s refers to a string (which can either be a string variable name or an actual string like ″Boston″). Keep only part of a string: substr(s,n1,n2) (n1 is the position the substring should begin at, n2 is the desired length of the substring, hence substr(s,2,4) begins at the second position and has a length of four, thus terminating at the fifth position) Trim leading blanks: ltrim(s) Trim trailing blanks: rtrim(s) Trim internal blanks: itrim(s) Trim leading and trailing blanks: trim(s) Length of string (returns length of string): length(s) Make all letters lower case: lower(s) 24 Using Stata More Effectively Make all letters upper case: upper(s) Real (returns numeric value if numeric, system missing (.) if not numeric): real(s) Note that decode, which is a command, not a function, is roughly speaking the counterpart to real. An example of transforming zip codes for the U.S. and Canada: replace zip=trim(zip) /* Removes leading and trailing blanks */ replace zip=upper(zip) /* Turn lower case Canadian postal codes to all upper case) */ replace zip=substr(zip,1,5) if real(substr(zip,1,1))~=. /* Turn U.S. zip+4 to regular zips with a length of 5) */ replace zip=substr(zip,1,2) if real(substr(zip,1,1))==. /* Keep only first two places of Canadian zips */ Importing .csv and other text files We work with text files more often than one might think. The basic form is: insheet using data.csv, comma clear .csv is, naturally, a comma-separated format: the delimiter (the character that separates columns/variables) is the comma and the text qualifier is the quotation mark. insheet supports specifying the delimiter: insheet using data.dat, delimiter("~") clear Here, the delimiter is the tilde (~). Why would we want a tilde for a delimiter? When working with open-ended responses, people often use commas. Now, a properly specified .csv shouldn’t have a problem with this because textual commas will be inside the text qualifier (i.e., the quotation marks). Usually, that’s fine. But what if people also use quotation marks in their open-ended responses? Things go to hell. To avoid Pandemonium, we thus specify some character that people are unlikely to use, hence the tilde. Where the text file is in fixed length format, the command is infix. Usage is as follows (numbers are column numbers for each variable and pertain to the variable name they follow): infix id 1-7 str blah 8-20 altwt2 21-27 using alt4wt.out 25 Using Stata More Effectively Stata has a dictionary format that may be useful when we need to repeatedly import complex text files. I won’t go into this further, as further details can be found in the manual. Dealing with delimiter issues on imports Occasionally, you will “insheet” a .csv file containing string variables to Stata and discover that commas in the string variables have caused the variable to be broken into multiple parts, i.e., the commas become separators between columns instead of part of the text contained in the variable. In this case, if you can put your string variable at the end of the dataset, Stata can fix the problem. First, put the commas back in the text this way: egen optaffil1=concat(v14 v15 v16 v17), punct(,) Then, trim the trailing commas: replace optaffil1=reverse(substr(reverse(optaffil1), indexnot(reverse(optaffil1), ","), .)) Exporting .csv, fixed format, and other text files We often need to get Stata to export files in some text format for cleaning, weighting, or other purposes. Exporting .csv files is very straightforward: outsheet vara varb using xyz.csv, comma nolabel replace By default, Stata exports text files with tab as the delimiter and quotation marks. It also has (in my opinion) the unfortunate habit of exporting value labels rather than the underlying numeric values of a variable unless told to do otherwise. You can also turn off quotation marks, change the delimiter, and not put variable names on the first row of the file. See help outsheet for further details. Stata can be hacked to export fixed format datasets, too. (Unlike SPSS, it doesn’t support this natively. There is no “outfix” command.) The important thing to bear in mind when doing this, though, is that all values within a variable must be the same length. Thus, if some values are shorter than others, you will need to turn them into strings and add enough leading blank spaces for each value to be the same length. For an ID variable with a maximum width of four characters, this would be as follows. gen newid=string(id,"%4.0f") replace newid=" "+newid if length(newid)==1 replace newid=" "+newid if length(newid)==2 replace newid=" "+newid if length(newid)==3 outsheet newid vara varb using blah.dat, /// delimiter(" ") nonames nolabel noquote replace 26 Using Stata More Effectively In some circumstances, like working with QBAL, our rim weighting program, it may be more helpful to use zeroes rather than spaces. Here is a code fragment that adds zeroes to the beginning of a weight variable with different lengths by looping through and adding one zero at a time until it reaches the correct length. This obviates the need to write out a line of code for each extant length of the string under the maximum string length. gen nwrswtst=string(newrswt,"%8.1f") forvalues i=3/7 { replace nwrswtst="0"+nwrswtst if length(nwrswtst)==`i' } Note that this hack produces a file with columns of spaces between variables, not the usual fixed format with no blank columns. Setting the delimiter to "" will not work. In that case, Stata just defaults to producing tab separated files. Merging, appending, and reshaping Before getting into the technical details of merging, it’s worth spending a moment thinking about the operation on a practical and conceptual level. A merge takes place when two datasets have at least some cases in common, but different substantive variables, which is usually why we want to merge in the first place. (While it’s OK from Stata’s perspective if both files have cases not found in the other one, or even no cases in common, generally it’s bad from a user point of view because values will be systematically missing from variables for cases found on one but not the other dataset.) We might for instance have a set of information derived from administrative data from program registration that we wish to add to a file containing survey data. An append occurs, conversely, when the two files have similar sets of variables but different sets of cases and we essentially want to add more cases at the bottom of the file. Thus, we might want to add files from two similar surveys of different populations together. Merging The pivot around which a merge takes place is a single variable or set of variables found in both files that identify a case. In the most basic case, this will be an ID variable that identifies cases uniquely. In more complex cases, this might be a set of variables that do not uniquely identify cases. In either case, the merge operation uses the identifying variable or variables to determine which cases to link together. In the simplest of all cases, there is a single ID variable with unique values (because this will be the basis of the merge, it must be in both files) and both files contain precisely the same cases and, other than the ID variable, have no other variables in common. The variables (other than the ID variable) in the using file (the file being merged into the currently open master file) will be appended to the right hand side of the dataset. Making things a little more complicated (but still relatively simple), what happens if the ID variable has duplicates in either the using or the master file? This might be the case, for instance, in a household survey where one file contains one record for every household member (each with the same household ID variable) containing individuallevel demographic data, while the other file contains household-level data, with only a 27 Using Stata More Effectively single case per household. In this instance, we would want to merge the variables from the household-level file into the individual-level file. While one could do the merge the other way around, it makes the most sense to have the individual-level file open as the master file and the household-level data as the using file. The same household-level variables are thus added to each individual-level record where the ID is the same. To complicate the picture a little more, it’s often the case that there are some “orphan” cases that are found in one file and not the other. That doesn’t present problems for Stata. It will just leave missing values for the variables cases for which no match exists and it records the status of each case in variable called _merge (this will be created regardless of whether all cases matched). This gives a value of 1 for cases only found in the master file, 2 for cases found only in the using file, and 3 for cases found in both files. You can then take the appropriate steps, which will vary depending on the situation. Where the orphan cases are salient, one keeps them. If they’re not, the best way of removing unmatched cases from the using file is to use the keep option (not to be confused with keepusing, discussed below), which Graham introduced to me. This is advantageous when memory is an issue as it can massively reduce the amount of memory Stata needs to perform the action. Returning to _merge, it’s good practice to drop or rename _merge so it doesn’t create problems for the next merge. If there is no reason to generate _merge (and I’d argue that this is rarely the case, as you ought to run diagnostics after each merge to verify things went as expected—and they often don’t) you can choose the nogenerate option. You can also tell Stata to call _merge something else by using the generate option. What if you have some variables in the file you will merge into the master file that aren’t needed? In general, my practice has been to chew up some hard disk space (which is cheap) and drop unneeded variables, save the file under a new name, and merge using the newly created file. Alternately, you can just drop them after the merge. This is wasteful of time and resources and creates file after file cluttering up your directories. A better and more elegant approach (which is what this document is about) is to use the keepusing option, which specifies the variables to be kept from the using file. This is especially useful when memory is tight as it minimizes the amount of memory Stata requires to perform a merge. Let’s get a little more complicated still and imagine a situation where there are some variables in common between the files other than the ID variable or variables. If one of the files has better values than the other, you probably want to drop the “worse” variable(s) from the master file prior to merging or use the keepusing option to remove variables from the using file as appropriate prior to the merge. Where some cases have values on one file and some cases have values on the other file, Stata offers the options update (overwrites the missing values in the master file with values from the using file; note that this will produce additional values of _merge for matched cases where updates occurred, did not occur, or where the two files had conflicting nonmissing values) or update replace (overwrites all values in the master file with values from the using file). You will want to think very hard before using this functionality because 28 Using Stata More Effectively it’s very powerful and can overwrite values inappropriately without leaving any indication that it did so. To muddy the waters further still, what happens if both files have duplicate values? You most likely shouldn’t be in this situation and, if you are, presumably know what you are doing. Stata will create new cases for each combination of the duplicate values. As before, this is almost always undesirable and suggests that something is wrong with one or both files, your assumptions, or all of the above. Finally, what about the possibility of using multiple variables to identify cases for merging? In general, I would recommend against this as a case of a powerful but difficult to trace function. While Stata will work happily with multiple matching variables (as long as they are all present in both files), you would do better to create some variable that concatenates the values of the various variables into a single variable (doing the same operation in each file) to use for the merge. This promotes clear thinking about what you are doing and makes searching for duplicate cases a snap. You can always drop the variable from the merged file later on. As I noted previously, the code used to merge files together in Stata 11 has changed significantly from the “more primitive syntax” (see help merge) used previous versions of Stata. The new code is far easier to understand. The type of merge is now specified immediately after the command: merge 1:1 merge m:1 merge 1:m merge m:m The left hand side of the colon specifies the nature of observations in the currently open dataset. Are they unique (i.e., each case in the file has a unique combination in the values of the variables being merged on)? If so, 1. If not, m. Not surprisingly, the right hand side specifies the nature of the file that is being opened (used) in the merge. The ID variable(s) to merge on are specified immediately after the 1:m etc. part of the command. See help merge for further details on the syntax. A final thing to bear in mind is that Stata as of version 11.0 automatically sorts the cases on the ID variable(s) unless you tell it otherwise (sorted). This is an improvement, but remember that when Stata sorts a dataset and has cases of equal rank, it randomizes their order (see p. 14). In most cases, this should not present a problem. If, however, order is important to you (e.g., you create a reproducible sample based on random numbers after set seed), this will cause unpredictable, irreproducible variations. In such cases, you would best follow my earlier advice and create an unambiguous order variable (e.g., using _n), then sort the cases prior to the merge by the ID variable(s) and then by the unambiguous order variable, and use the sorted option. This only works, of course, if 29 Using Stata More Effectively no additional cases are being added from the using file. If cases are provided by both files and some of them have equal rank with respect to the ID variable(s), the problem remains and you would want to take additional steps to ensure that cases are placed in a predictable order. Appends By comparison, append is very simple. You simply specify the dataset(s) which you wish to append to the master dataset (note that Stata doesn’t use the term “master” with respect to append), chose whether or not to have a variable record which file was the source for a given operation (generate), and can limit the variables to appended from the using dataset(s) via the keep option. Reshaping For cases like datasets containing information on multiple individuals in a household stored as variables (e.g., sex1 sex2 sex3) or for datasets where cases are clustered (e.g., cases 1, 2, 3 are all members of household 1, while cases 4 and 5 are members of household 2), Stata can reshape the dataset. I have used this command in the past but found it to be rather unintuitive, to put it mildly. I will expand this description when I next work with it (and so can you!). Matrices and scalars As we’ve seen, besides the dataset, Stata can store and manipulate data in local macros, and some commands store additional information that can be accessed, like the coefficients of a regression model. As it happens, Stata has another way to store and manipulate information: the matrix. A matrix is simply an r × c grid that contains data (where r is rows and c is columns). This is a vital functionality for a statistics program, as most estimation commands involve the manipulation of matrices. But access to matrices is not limited to .ado files (the programs that define Stata commands). You can create your own matrices and use matrices created by estimation commands. There are, in fact, two ways to do this. The first is Stata’s original matrix language. This is suitable for almost all operations one is likely to undertake; the only constraint that may rear its head is that this size of matrices is limited to 11,000 × 11,000. Since Stata 9, however, StataCorp has added a new language: Mata. Mata allows for far more powerful and complicated manipulation of matrices with no practical size limitations. I do not recommend using it, though, unless you absolutely have to. Mata is not really an organic part of Stata. To work with Mata, one either temporarily “pauses” Stata to work in Mata or invokes it through special Stata commands that functions (to my mind) something like an API. The syntax for Mata is to my mind unintuitive and its manual is, frankly, a disaster, being divided into M-1, M-2, etc. components without really explaining what each one is and what commands are explained in there. Thus, I shall say nothing more about Mata. Before I begin, let me add a cautionary note. This section focuses on using matrices not for the matrix mathematics one can do but as a way of hacking Stata to get it to perform relatively basic operations that are difficult, clunky, or impossible using regular 30 Using Stata More Effectively techniques. Sophisticated mathematicians and statisticians may want to puke. You have been warned. A brief primer on matrices The first thing to remember about matrices is that individual elements of the matrix are specified in precisely the opposite fashion to Cartesian coordinates: y,x not x,y. Of course, the standard way of stating this is r,c not y,x. I do not know why this idiotic convention arose any more than I know why keyboard and telephone number pads have different orders, but as with keyboards and phones, one has to learn to live with it. In the title of this section, I mentioned scalars. A scalar is a single number (something like a 1 × 1 matrix) that has some special properties in Stata. From the point of view of regular datasets, a scalar is equivalent to a variable with a constant value for each case. (A scalar can, however, have the same name as matrix.) Thus one might multiply each element of a variable the dataset with a scalar. One commonly occurring problem is that the name of the scalar is treated as a variable. Thus, a scalar shouldn’t have the same name as a variable (the variable has precedence). Moreover, if a scalar has the same name as the unique characters of a given variable, that variable has precedence. For example, if we have a dataset containing a single variable, female, a scalar named f will give precedence to female. On the other hand, if we have a dataset with two variables, female and femage, the scalar will have precedence, because Stata knows it can’t distinguish female and femage. In practice, one does better to avoid this situation entirely by giving each scalar a unique name. Failing to heed this rule will lead to errors that are difficult to trace. A final point to note about matrices is that one does not add them together and multiply them unthinkingly. Matrix operations like this follow the logic of matrix algebra. Unless you’re au fait with matrix algebra, the safe and certain way to perform operations on a matrix is to loop through rows and columns, performing the desired operation on each element in turn. Accessing and addressing matrices An existing matrix can be viewed by the command matrix list matname. New matrices can be created by directly entering data, but this is unlikely to be needed and you can look it up yourself. A more common occurrence is that one needs to define a matrix in advance so that one can later fill it with values. This is what it looks like: matrix x = J(2,3,.) This syntax creates an r × c matrix (here 2 × 3) called x (in mathematics, matrix names are typically expressed in unitalicized bold) with all the elements set to missing. To make the elements equal to 0, we could do matrix x = J(2,3,0). Note that setting a matrix equal to J(r,c,x) will overwrite an existing matrix of the same name, so be careful. You can always drop existing matrices, too, by matrix drop x. Except for neatness, this is unnecessary as there is no limit on the number of matrices can create and access. 31 Using Stata More Effectively So far, I’ve avoided addressing matrix elements. To access x(2,3) in our blank matrix and change it to a 1, we simply do matrix x[2,3] = 1. Let’s now work through our matrix and assign values to each cell that are the sum of the row and column number (yes, it’s a dumb example, but it illustrates operations on individual elements): forvalues r = 1/2 { forvalues c = 1/3 { matrix x[`r',`c'] = `r'+`c' } } Now let’s create a new matrix y with cells equal to the same cell in x + 1. matrix y = J(2,3,.) forvalues r = 1/2 { forvalues c = 1/3 { matrix y[`r',`c'] = x[`r',`c'] + 1 } } This is all well and good, but typically one wants to get an element or elements of the matrix to interact with the dataset. Stata doesn’t make this easy, because one can’t simply put a reference to a matrix into a regular command; gen varx = vary * x[1,1] won’t work. I’m sure there’s a better way, but my kludge is to copy the value of the element of the matrix to a scalar (remembering that scalars are treated as variables from a naming perspective): scalar x = x[1,1] gen varx = vary * x Using matrices generated by estimation commands Probably the most common cause for dirtying one’s hand with matrices is to make use of matrices created by estimation commands. After running an estimation commands, do ereturn list. This will list the information generated by a command. Here’s what it looks like after OLS regression of two covariates on a variable of interest: . ereturn list scalars: e(N) e(df_m) e(df_r) e(F) e(r2) e(rmse) e(mss) e(rss) e(r2_a) = = = = = = = = = 221 2 218 .4511474285902355 .0041219067975928 2.823463192155305 7.193044431553062 1737.88387864537 -.0050145894703193 32 Using Stata More Effectively e(ll) = e(ll_0) = -541.465233008145 -541.9216450006693 macros: e(cmdline) e(title) e(vce) e(depvar) e(cmd) e(properties) e(predict) e(model) e(estat_cmd) : : : : : : : : : "regress y x1 x2" "Linear regression" "ols" "y" "regress" "b V" "regres_p" "ols" "regress_estat" matrices: e(b) : e(V) : 1 x 3 3 x 3 functions: e(sample) At the top we can see some scalars that could be useful (RMSE, for example) and near the bottom we can see some matrices. The matrix e(b) contains the coefficients for the intercept and covariates while the matrix e(V) is the variance-covariance matrix of the covariates (and intercept). For some calculations, it might be useful to save one or more of the coefficients and use them later. There are a few things to remember, though. First, these will be overwritten by the next estimation command. Second, it turns out that you can’t directly access the elements of e(b) or e(V). One can’t do the following, for instance, where z is a matrix we want to put a coefficient in for some reason or another: matrix z[4,1]=e(b)[1,1] Instead, one has to do the following, creating matrix x to hold e(b) so we can refer to the element we want to copy to z: matrix x=e(b) matrix z[4,1]=x[1,1] matrix drop x Using matrices and scalars for predictions I discussed generating predicted values when using weighted survey data previously (p. 18). At the time I presented a relatively elegant solution, albeit with considerable drawbacks, while a more robust, but uglier, solution was deferred until matrices and scalars had been described. I have just described matrices and scalars, so it’s now time for the ugly kludge threatened previously. This relies on scalars to hold the (weighted) means of predictor variables. The advantage of using a scalar over a local is that these do not cease to exist, kick the bucket, shuffle of this mortal coil, etc. as soon as a 33 Using Stata More Effectively .do file has run. The disadvantage, though, is that they are a pain to bring to life in the first place and revive each time one wants to use them. (Note that closing and restarting Stata will remove scalars and matrices from memory, as will deliberately dropping them.) The first stage is to svy: mean separately for each variable. (One could, in theory, run the mean of all explanatory variables simultaneously, and refer to the column of e(b) in which they are to be found for subsequent operations. While this would save some lines of code, in my case at least, this potential efficiency is more than counterbalanced by the greater likelihood of referring to the wrong column.) After each time, one first converts the 1×1 matrix e(b) into a temporary matrix, which I call x, then write the value of the only cell of x into a scalar (being careful to give it a unique name—see p. 31 for details): svy: mean hoursjewed /* Taking mean of jedu hrs for calcs involving quadratic term */ matrix x=e(b) scalar meanjeduhrs=x[1,1] /* Save scalar for use as local */ scalar meanjeduhrsq=meanjeduhrs^2 /* Contains square of mean hrs jew ed */ matrix drop x Immediately before or after running the estimation command for which one wants to generate predictions, one then turns the scalars into locals and refers to the locals in prvalue: svy: ologit conisr participant hoursjewed hoursjewedsq, or local j=meanjeduhrs local js=meanjeduhrsq prvalue, x(participant=0 hoursjewed=`j' hoursjewedsq=`x') prvalue, x(participant=1 hoursjewed=`j' hoursjewedsq=`x') This way the weighted means can be used without having to reestimated the means each time a .do file is run. Running Stata from the command line Back in the day when researchers were researchers, analyses ran on mainframes, and the windows paradigm was just a curiosity at Xerox in Palo Alto, real social scientists ran analyses from the command line (the operating system command line, not Stata’s). While it’s buried in very obscure places in the Stata manual today, that functionality remains. Why bother, other than my own antiquarian interest in data analyses modes of the past? There are two scenarios where this may come in handy. The first is to run analyses on a schedule (i.e., schedule with Windows so some analysis occurs overnight or at some other time) and the second is to run Stata and other programs in a predetermined sequence. A hypothetical example that draws on both these scenarios is as follows. Suppose we have an ongoing survey on LimeSurvey for which we have cleaning syntax in Stata, rim weighting syntax in QBAL, and analytic syntax in Stata. In sequence, we probably want to get the server to feed us the latest data from Lime, open it in SPSS 34 Using Stata More Effectively PASW IBM SPSS (we can use an SPSS script generated by Lime), run the script and saving the file, use StatTransfer to turn it into a Stata file, run the cleaning syntax in Stata and export the data needed for weighting, rim weight the file, merge the rim weights back into the main Stata file and then run the analyses we need. To this, we write a DOS batch file that calls the various programs in sequence (SPSS also can run from the command line) and tells them which files to run and set it using the Windows scheduler to run at a specified time. I won’t go through invoking SPSS or the server from the command line, but I will put the contents of a batch file that weights the BRI long-term surveys: REM Windows batch file for weighting BRI Long Term data REM REM is a comment/remark REM change working directory to data file cd \Cohen Center\BRI\BRI Panel Study\Data\ REM runs initial part of file "C:\Program Files (x86)\Stata10\wmpstata" /e do "C:\Cohen Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part1_06080 9.do" qbal round3wt.qbs qbal round5awt.qbs REM this weight balances denoms REM runs second part of file "C:\Program Files (x86)\Stata10\wmpstata" /e do "C:\Cohen Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part2_06080 9.do" qbal round5wt.qbs qbal round7wt.qbs qbal round9wt.qbs REM runs third part of file "C:\Program Files (x86)\Stata10\wmpstata" /e do "C:\Cohen Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part3_05080 9.do" qbal round5awtx.qbs REM runs fourth part of file 35 Using Stata More Effectively "C:\Program Files (x86)\Stata10\wmpstata" /e do "C:\Cohen Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part4_05080 9.do" qbal round3wtx.qbs qbal round5wtx.qbs qbal round7wtx.qbs qbal round9wtx.qbs REM runs fifth part of file "C:\Program Files (x86)\Stata10\wmpstata" /e do "C:\Cohen Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part5_05080 9.do" There are a couple of things to note regarding calling Stata. Any files with paths that have spaces in them need to be enclosed in quotation marks. The first part of each call of Stata ("C:\Program Files (x86)\Stata10\wmpstata") simply tells Windows to start Stata by giving the place Stata’s executable is to be found on your computer (both the exact file location and name of the Stata executable may differ for you). The /e tells Stata to run without stopping for you to hit enter on the termination of the file. Finally do "C:\Cohen Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part5 _050809.do" tells Stata what .do file to run. Note that all Stata (and QBAL) syntax files need to have at least one character return (i.e., a blank line) at the end of the file, otherwise the programs will blow up. Increasing the size of the command prompt scrollback buffer One problem with a fairly long batch file like the above example is that the Windows command prompt window only shows a set number of lines by default, much like Stata, and that number of lines is too small to see all commands (and whether they blow up). Here’s how to fix this (courtesy of http://www.petri.co.il/customize_command_prompt_ in_windows_xp_2000_2003.htm): 1. Open command prompt. 2. Click the upper left hand corner of the command prompt window (i.e. the small black box containing C:\ in the title bar of the window). 3. Click the options tab. 4. In Command History, set the Buffer Size to 999 and select 5 in Number of Buffers. 5. In the Edit Options box, select Quick Edit Mode and Insert Mode check boxes. 36 Using Stata More Effectively 6. Click the layout tab. 7. In Screen Buffer Size, set the Height to 9999. 8. Optionally, increase the Height and/or Width of the window under Window Size. 9. When you select apply properties, save properties for future windows with the same name. Programs A program has a very specific meaning in Stata. It is a sort of multiline macro that can be called upon within a .do file. The most powerful aspect of a program is its ability to use the arguments appended to the program name when it is called. This was mentioned briefly regarding macros. Programs have a format quite similar to those of loops in that they must be embedded properly when they are defined. Let us say we wish to write a program that will simply display the first four things appended after it. program show display display display display end "`1'" "`2'" "`3'" "`4'" Were we to type or program in a .do file show a b c d, Stata would return: a b c d A program will remain after being defined until Stata is closed, even if its usefulness has passed. Programs can be removed by program drop programname. Where might a program be useful? Just as if typing (or contemplating typing) line after line of that only varies with respect to a variable name or a number is an indicator that it might be helpful to use a loop or loops instead, having large segments of a .do file that repeat may be an indicator that a program would be useful. Here is a program I wrote that generates a stacked column graph for weighted survey data: capture program drop svychart program svychart preserve /* Keep copy of unchanged dataset */ label save `2' using label`2', replace quietly: svy: tab `1' `2' local r = e(r) /* Number of rows */ local c = e(c) /* Number of cols */ matrix prop = e(Prop) /* Copy matrix of weighted cell proportions */ matrix coltot = J(1,`c',0) forvalues col = 1/`c' { forvalues row = 1/`r' { /* Calculate column total */ matrix coltot[1,`col'] = coltot[1,`col'] + /// prop[`row',`col'] 37 Using Stata More Effectively } } matrix colpct = J(`r',`c',0) forvalues col = 1/`c' { /* Calculate within column percentages */ forvalues row = 1/`r' { matrix colpct[`row',`col'] = prop[`row',`col'] /// / coltot[1,`col'] } } matrix cname = e(Col) /* Create matrix to get col names */ local cnames : colfullnames(cname) /* Local containing col names */ matrix colnames colpct = `cnames' /* Add col names to matrix */ matrix rname = e(Row)' /* Create transposed matrix to get row names */ local rnames : rowfullnames(rname) /* Local containing row names */ matrix rownames colpct = `rnames' /* Add row names to matrix */ matrix rowpct = colpct' /* Transpose matrix */ clear /* Clear dataset */ svmat rowpct, names(col) /* Write matrix to dataset */ gen `2' = _n /* Create a column containing group ids */ run label`2' /* Create a label for group ids */ label values `2' `2' /* Assign labels to group ids */ graph bar `rnames', over(`2') percentages stack /* Graph command */ restore /* Bring back original dataset */ end Note the use of preserve at the beginning of the program and restore at the end. These allow me to make temporary changes to the dataset for the purposes of making calculations without having to actually save and use the dataset. Note also the quietly prefix to a svy: tab command to suppress unnecessary output. The post and postfile commands These commands create a new dataset separately from the dataset one is working on. It can create files that are larger than Stata can currently handle in memory. The basic syntax to set up the file to be written to is: postfile postname newvarlist using filename [, replace] Once this has been created, results are sent to it as follows: post postname (exp) (exp) ... (exp) It closed as: postclose postname An example of postfile used in a Monte Carlo simulation from the Stata course: program doit postfile mysim b lb ub using simres, replace forvalues i = 1/1000 { drop _all /* construct a sample */ set obs 25 gen x = invnormal(uniform())*3 + 2 ci x /* calculate statistics */ post mysim (r(mean)) (r(lb)) (r(ub)) 38 Using Stata More Effectively } postclose mysim end Note that ci returns the mean and the 95 percent confidence intervals. This can be changed from the 95 percent default to, say, 90 percent, by set level 90. Stata will revert to its default the next time it is run. The bootstrap Rather than relying on the Gaussian distribution, one can estimate standard errors by bootstrapping, which resamples from the dataset a large number of times (the larger the number of resamples, the more reliable the bootstrap estimates). Because the resamples are random, in order to have reproducible results, it is necessary to set the random number seed as follows (set seed n, where n is some number). Although some commands include a bootstrap option, bootstrap can be specified as a prefix. See help bootstrap for details. Note that bootstrap will include all observations in memory, so it is necessary to specify a keep or drop statement to restrict the analysis to the valid cases. Weird error messages From time to time, Stata will generate weird error messages. Here we consider error messages users have experienced. unrecognized command: _pecats User-written commands are typically those obtained by typing net search nameofcommand. They are common candidates for weird error messages. The following was an example of an error message received from a perfectly well-written call to prvalue, a very useful command that is part of the estimable J. Scott Long’s SPost suite of postestimation commands. The error message read “unrecognized command: _pecats”. This qualifies as a weird error message, no? You couldn’t make this stuff up. It turned out that the prvalue command didn’t work because the user had installed a version of SPost designed for an earlier version of Stata. If you ever get really weird error messages like this one, it may be because a command is written using commands not supported in your version of Stata or relies on things stored in matrices, scalars, global macros, or local macros that are no longer stored or stored under that name in your version of Stata. This should only happen in user written commands. Naturally, if it’s written for a version of Stata that’s more recent than your own, you’re in trouble, but we keep up to date. Assuming the user-written command you’re having trouble with is written for an earlier version of Stata, the first thing you should do is to check that you have the latest version (if the authors keep updating their command) or, as in the case of SPost, where there are multiple versions of the same commands available, the one that supports the most current version of Stata. In either case, you will come to a point when Stata tells you that the version of the files it is trying to install are not the version of the files that are currently installed. Simply tell it to force an installation at that point. If that doesn’t do the trick—or your already installed version 39 Using Stata More Effectively really was the up-to-date version—you have one more ace up your sleeve, the version command (see p. 3, above). All well-written .do files should have a version command at the top, and the same goes for .ado files. The version command means that Stata will try to translate any outdated commands into the present version of the syntax at runtime (consider the merge command prior to 11.0 and from 11.0 onwards). The user-written file that is giving you problems may, however, not include one. Therefore, you should try the following. Write a version command with the number of the version the user-written command was written for immediately prior to executing that command and then write a version command for the version of Stata you’re using immediately afterward. If you don’t know the version command, start at the second most recent version of Stata and work your way backwards. This will occasionally cause a user-written command to work when it otherwise would not have. '.' found where number expected The context for this error message was in generating predictions from a regression model, using the procedures outlined on p. 33 while failing to heed the commandment found on p. 31 (“Thou shalt not name thy scalar after thy variables”). For rabbinical scholars, note that this rule can be deduced in a straightforward fashion (tongue firmly planted in cheek) from the doctrine of kilayim (Lev. 19:19, Deut. 22:9-11, bt Tract. Kilayim). 40 Using Stata More Effectively Index cd 2 Change directory ........................... See cd Memory allocation ................................ 1 More message ....................................... 2 Scroll buffer size ................................... 2 41