Notes on Stata

advertisement
Brandeis University
Maurice and Marilyn Cohen Center for Modern Jewish Studies
Using Stata More Effectively
Benjamin Phillips
gen dobday=string(day,"%2.0f")
replace dobday="0"+dobday if length(dobday)==1
gen dobmonth=string(month,"%2.0f")
replace dobmonth="0"+dobmonth if length(dobmonth)==1
recode year (1962/1977=1978)(1998=1992)
gen dobyear=string(year,"%4.0f")
gen age=floor((date("1jan2010","DMY")- ///
date((dobday+"-"+dobmonth+"-"+dobyear), "DMY"))/365.25)
August 2010
Using Stata More Effectively
© 2010 Brandeis University
Maurice and Marilyn Cohen Center for Modern Jewish Studies
Updated August 17, 2010
Using Stata More Effectively
Table of Contents
Introduction ......................................................................................................................... 1
Stata 11................................................................................................................................ 1
Setting up Stata ................................................................................................................... 1
Working with directories .................................................................................................... 2
Versions .............................................................................................................................. 3
Running .do files within .do files or the command dialog .................................................. 3
Comments ........................................................................................................................... 3
Breaking long lines ............................................................................................................. 4
Avoiding errors ................................................................................................................... 4
Renaming variables ............................................................................................................. 5
Changing variable order ...................................................................................................... 5
Computing variables with egen ........................................................................................ 5
Macros................................................................................................................................. 5
Looping (foreach, forvalues, and while) .............................................................. 7
Creating sets of dummy variables: the xi command ....................................................... 11
The if and else commands ........................................................................................... 13
Case order variables, sorting, and cross-case functions .................................................... 14
The duplicates command........................................................................................... 17
The list command ......................................................................................................... 17
The by command .............................................................................................................. 17
Data verification................................................................................................................ 18
The in command .............................................................................................................. 18
Predictions from estimation commands ............................................................................ 18
i
Using Stata More Effectively
Working with dates and times........................................................................................... 19
Numeric variable types ..................................................................................................... 22
String functions ................................................................................................................. 22
Importing .csv and other text files .................................................................................... 25
Exporting .csv, fixed format, and other text files ............................................................. 26
Merging, appending, and reshaping .................................................................................. 27
Matrices and scalars .......................................................................................................... 30
Running Stata from the command line ............................................................................. 34
Programs ........................................................................................................................... 37
The post and postfile commands ............................................................................ 38
The bootstrap .............................................................................................................. 39
Weird error messages ........................................................................................................ 39
Index ................................................................................................................................. 41
ii
Using Stata More Effectively
Introduction
This file contains most of the collective wisdom of the Cohen Center regarding the
effective use of Stata. It assumes a good working knowledge of basic Stata procedures
and provides a guide to nonobvious shortcuts and other tricks of the trade. While I am the
author of this document, I’ve incorporated others’ discoveries as well, giving credit in
text to the “discoverers” of new functionalities.
Stata 11
Stata 11 introduces three very useful features: a variables manager, an improved .do file
editor, and the full set of manuals in PDF format. The variable manager is very similar to
the SPSS PASW IBM SPSS variable view. Other than providing more screen real estate
to view variable labels, it shows what label (if any) is attached to the variable. The .do
file editor now allows collapsing of loops and colors commands, strings, locals, and
comments, helping differentiate text. It also numbers rows and gives column numbers.
The on-line manual is available from Help > PDF Documentation. For this to work
properly, though, you need to use Adobe Acrobat or Acrobat Reader as your default PDF
viewer. The reason for this is that it is a set of linked PDF files and third party readers do
not seem able to move from one to the other. If you have a third party PDF viewer as the
default, find a PDF file in My Computer or Windows Explorer, right click on it, select
Open With > Choose Program, click the box next to Always choose the selected program
to open this type of file, choose Acrobat or Acrobat Reader, and then select O.K.
Along with the good points, some syntax has changed. The syntax for merging datasets
has arguably improved and is certainly very different from the previous version (see p.
27ff). If you don’t want to rewrite old syntax, be sure to use the version function (see p.
3).
Setting up Stata
Stata has default settings that some of us do not like. Here is a list of ways to permanently
correct them.
Memory
Stata opens datasets in RAM (random access memory). If you don’t have enough RAM,
you can’t open the dataset. But even if you do have enough RAM, you may not be able to
open the dataset. Stata grabs a chunk of RAM when it is launched for opening and
working with datasets. By default, this is a measly 10MB. To expand this to a more
useful 200MB permanently:
set mem 200m, perm
This can be expanded on a temporary basis to, say, 1GB as follows:
set mem 1g
Note that you’re limited by the RAM in your computer, the amount of memory used by
other applications, and whether you are using a 32- or 64-bit operating system. Basically,
1
Using Stata More Effectively
a 32-bit operating system can only keep track of 232 memory addresses (4,294,967,296),
roughly corresponding to 4GB. In Windows, 2GB (some of this may be “virtual
memory” stored on the swap file) is allocated to the operating system and each
application receives another 2GB. In practice, the maximum amount of RAM 32-bit
Windows will allocate to Stata in a system with 2GB of RAM (the normal maximum for
32-bit OSs) is somewhere in the 200MB to 250MB range. In 64-bit OSs, the maximum
number of memory addresses that can be tracked is 264. In theory, this would include
18,446,744,073,709,600,000 addresses, roughly corresponding to 18PB (petabytes). In
practice, the 64-bit architecture used in most AMD and Intel chips limits addressable
memory to 256TB (terabytes).
More
To turn off Stata’s annoying characteristic of making you click to get the next page of
results, use:
set more off, perm
Scroll Buffer Size
Stata will only display a certain number of past results. In general, it’s better to display
more than less. The command to use is set scrollbufsize #, where # is bytes
between 10,000 and 2,000,000. It is permanent and does not take the , perm option.
Stata must be closed and started again for this to take effect.
Working with directories
Stata works in a similar fashion to DOS or Unix with directories.
cd "C:\Cohen Center\BRI"
mkdir BRI20
cd BRI20
If you are in the correct directory, you do not need to specify the full file path. Hence,
instead of:
use ″C:\Cohen Center\BRI\BRI20\mydata.dta″, replace
You can simply specify:
use mydata, replace
The .dta is assumed and need not be specified.
Files in the working directory can be listed:
dir
2
Using Stata More Effectively
Stata can also erase files:
erase mydata, replace
This can be useful in situations where it is necessary to create temporary files (there is
another way of doing this, tempfile, but it is most useful when creating commands).
Versions
Stata syntax changes from version to version. Generally, this isn’t a problem, being
limited to relatively obscure areas. Occasionally, though, this impacts analyses, causing
strange error messages to appear. This is easily solved. Stata is smart enough to be able to
translate your commands from an earlier version of Stata to the present version. All this
requires is a statement near the beginning of the .do file that lists the version of Stata the
command was written on:
version 11.1
Be aware, though, that Stata usually changes syntax to facilitate greater functionality.
Stata’s survey commands prior to Stata 9 didn’t allow as many options for defining the
characteristics of a complex survey sample. Consequently, while Stata 8 commands
would still run on later versions (provided the version command was used), they may
less accurately estimate variance than if rewritten for version 10.0 or later. The merge
commands also changed between 10.1 and 11.
Running .do files within .do files or the command dialog
.do files can be run inside another .do file or from the command dialog provided one is in
the correct folder (see p. 2):
do mydofile
This was necessary in Stata 10 and before when there was a maximum number of lines
for a .do file in the .do file editor. This is no longer the case in Stata 11, but this
functionality may still be of use if there are modular segments of identical code that need
to be run at multiple points in a file.
While I’m well aware of the fact that many PCs run Stata too slowly to rerun the entire
.do file as needed, this problem will be eventually addressed by Moore’s Law or (when I
win the lottery) the Jodi and Benjamin Phillips Fund for Ridiculous Computing
Initiatives. When it is, running the entire file is good practice because it avoids the
common problem of having the .do file blow up at a certain point because we have been
tinkering with the file and running it piecemeal.
Comments
A well-written .do file will have considerable commentary outlining what is being done,
how it is being achieved, and why this is necessary. There are two types of comments,
those that constitute a line in themselves or those that can be written in the middle of a
command. To write a comment on a line, it simply needs to be prefaced with an asterisk.
3
Using Stata More Effectively
You can add more asterisks and finish with an asterisk or not, depending on your
preferences. It doesn’t matter as anything on the line after the initial asterisk is
disregarded. As soon as you type in a carriage return, though, the next line will be
considered part of the program unless it, too, is preceded by an asterisk. (Note that you
can put spaces and tabs before the first asterisk, allowing one to create bullet-point lists of
comments. In some cases, it might be useful to make comments within a command. Stata
will stop paying attention as soon as it reaches /*. It will not pay attention to again until
it reaches */. Anything in between will be ignored, even if it stretches across multiple
lines with many carriage returns. Conversely, this could appear in the middle of a
command and it would not disrupt the command itself.
* Here is a comment that must go on one line
/* Here is a comment
that covers several lines
now it is over */
tab vara varb /* Comment at the end of the line */
tab /* comments */ vara /* in the middle */ varb /* are
confusing but syntactically acceptable */, col
Breaking long lines
Stata will accept very long lines of code. Unfortunately, this means that the entire line
won’t be visible at once in the text editor and will break up in an ugly fashion in the
display window and log files. The simplest way to break a line is ///, which tells Stata to
ignore the carriage return (which normally tells Stata that the command—whatever it is—
is finished and should be executed). You can also use the comment indicator:
reg vary varx1 varx2 varx3 varx4 varx5 varx6 varx7 /*
*/ varx8 varx9
An alternative (which I’m not fond of) is to use the #delim command, assigning a
semicolon as the end of command statement (note that periods can’t be used), e.g.,
#delim ;
reg vary varx1 varx2 varx3 varx4 varx5 varx6 varx7 varx8
varx9 ;
#delim CR
The last statement returns the delimiter to the default carriage return. The only options
are the semicolon or the carriage return.
Avoiding errors
While the fact that Stata crashes as soon as it hits an error may be useful, there are times
when what Stata regards as an error and what we would regard as an error diverge. Let’s
say we’ve been working with a file that defines some value labels and we switch to
another dataset which creates value labels of the same name. This will bring Stata to a
crashing halt. We could specify label drop mylabel, but that is (a) a pain in the neck
4
Using Stata More Effectively
and (b) will cause the .do file to crash if there is no label specified at the beginning. This
can be avoided by using the capture prefix. Hence:
capture label define mylabel 0 ″No″ 1 ″Yes″
“Capture” refers to Stata “capturing” the error message.
Renaming variables
At times it is necessary to rename variables, this is simply done with rename. If you
wish to rename variables with prefixes—for instance, changing w09* to w1*— you can
use the renpfix command.
Changing variable order
Stata can change the order in which the variables appear in a file. The order command
send the variables one specifies in the order one specifies to the front of the dataset. Any
variables not included in the varlist of an order command appear in their original order
immediately after the last specified variable in the varlist. Thanks to Michelle for finding
this command.
Computing variables with egen
Stata’s generate (usually shortened to gen) only handles simple mathematical
operations like addition, subtraction, multiplication, division, exponentiation, and
logarithms. While you can do a lot with these, there’s an additional command called
egen that offers commands that work across multiple cases or multiple variables. These
include calculating means, medians, summing (called total, not sum, for reasons I don’t
understand), minimums, maximums, and so on. Before leaping in, though, be aware that
the default mode for egen is operations across cases within a single variable. Thus egen
xbar=mean(x) will create a new variable (xbar; i.e., 𝑥̅ ) that will be identical for every
case containing the mean of the variable x. Thus, the within-case sum of a group of
variables x1 x2 x3 will be egen sumx=rowtotal(x1 x2 x3), which could be
simplified to egen sumx=rowtotal(x1-x3) if the variables were located next to one
another in the dataset.
Macros
Stata has a macro function that can record arbitrary strings of characters. This can be
useful for situations where one wants to have blocks of text that can be easily substituted
in instead of having to be retyped or copied. The most useful form of Stata macro for our
purposes is a local macro, which must be defined within your .do file. We typically
have sets of interrelated dummy variables. Defining these as a macro would make
specifying models easier.
local denom ″rereform conserve orthodox other″
svy: ologit potrprelgpilg prtrpexprelgpilg landed15 ///
kdmitzvot prmitzvot `denom′, or
5
Using Stata More Effectively
Macros can also be useful for complicated statements and so on. Note that local macros
include indexes for foreach or forvalues (see next section). Stata will overwrite
previously defined local macros from these, so use different names.
Macros can also be expressed as:
local macroname = macrocontents
It is recommended, however, that you stick to the form displayed above:
local macroname macrocontents
This executes faster.
However, if you were to have a mathematical function as part of the macro, the equal
sign would be necessary. Hence a program that counts to two and displays it on the
screen:
local y
display
local y
display
1
`y′
= `y′ + 1
`y′
After being defined, local macros are referred to as `x′ (assuming x is the name of the
macro). Note very carefully that the left hand apostrophe is from the top left key in your
keyboard, under the tilde (“~”), immediately to the left of the key for 1. The right
apostrophe is the one under the regular quotation mark, immediately left of the enter
button and right of the key for the colon and semicolon.
Advanced macro use
When running .do files from the command line (p. 3) or programs (p. 37), arguments
after do myfile get entered as macros `1′ `2′ etc. These can be then referred to in
the .do file itself. For this trivial.do file:
tab `1′ `2′
Thus:
do trivial vara varb
is equivalent to:
tab vara varb
Obviously, this isn’t the sort of thing we would want to use on an everyday basis, but it
could be helpful in certain complicated programming situations.
6
Using Stata More Effectively
Globals
Locals are only one kind of macro. There are also global macros, which are everpresent. While one can add new global macros, this is not recommended. One neat
global macro is $S_DATE, which contains the current date. Thus, to save a file with
today’s date:
save ″myfile $S_DATE.dta″, replace
Take care with this, though. The sequence is very specific: “ dd Mmm yyyy”. There is a
leading space and, in addition, if dd < 10, there will be another space in place of the first
d. Month, of course, is the first three letters with the first being capitalized. Thus:
June 8, 2008 “
8 Jun 2009”
May 22, 1975 “ 22 May 1975”
I haven’t tried years < 1000 or > 9999 but as the date is drawn from your system clock, it
is unlikely that you will have this problem. (If you’re reading this in 10000 CE, you’re
probably up to speed on this, given the Y10K bug.)
Looping (foreach, forvalues, and while)
Stata supports looping and makes it very easy. There are three primary kinds of loops.
foreach loops through strings of text, forvalues loops through numbers, and while
loops. There are several simple rules to remember. First, after writing the specifications
of the loop, you have to put a left-hand brace “{” at the end of the line (i.e. immediately
before the carriage return). It is good practice to then indent the lines of code that run
within the loop (though the loop will run fine if you don’t indent). Second, the loop is
closed when it reaches a right-hand brace “}” on a line by itself. I like to keep this at the
same level of indentation of the rest of the loop, but others may put the right-hand brace
unindented. Third, you need an index for the vectors. In the examples below I use x for
text strings and n for numbers, but these can be any letters (and more than one letter) you
find convenient. They can even be the same name as variables, but it is probably best to
avoid the confusion this may cause. The index is declared at the beginning of the loop.
The index is a local macro, so be sure not to call your index the name of a macro you will
be calling (or will call at a later point).
Here is a loop over text strings:
foreach x in shabcan shabmeal mitzvot {
svy: ologit po`x′ pr`x′ landed, or
}
And here is a loop over values:
forvalues n=1/4 {
svy, subpop(if region==`n′): mean age
}
7
Using Stata More Effectively
Note that Stata differentiates between mathematical equalities (=) and logical equalities
(==). Here the equality in the forvalues statement is mathematical while the equality of
the if qualifier is logical. Stata will throw error messages if you confuse one with the
other.
If you want to loop over nonconsecutive, unevenly space numbers like 1, 3, 5, 6, and 9
you would enter these into foreach, as in “foreach n in 1 3 5 6 9”. To loop over
evenly spaced numbers forvalues should be specified as forvalues n=2(2)10,
which would yield the sequence 2 4 6 8 10.
One can run loops within loops:
foreach x in shabcan shabmeal mitzvot {
forvalues n=1/5 {
svy, subpop(if denom==`n′): ologit po`x′ ///
pr`x′ landed, or
}
}
One small issue with running large loops or sets of loops, particularly for analysis
commands, is that it can be difficult to keep track of what each piece of output represents.
This can be solved by getting Stata to specify which variable is being run under which
conditions using the display command. The “as txt” option, discovered by Michelle,
ensures it displays nicely. You can also precede variable output with “as output” to
conform to Stata’s usual scheme and _newline to force new lines. Here is the previous
example:
foreach x in shabcan shabmeal mitzvot {
forvalues n=1/5 {
display _newline as output ///
″`x′” as text if ”denom==” as output `n′
svy, subpop(if denom==`n′): ologit po`x′ ///
pr`x′ landed, or
}
}
For shabmeal and denom=3 this would display:
.
. shabmeal if denom==3
8
Using Stata More Effectively
Of course, loops can also be very helpful in data manipulation, not just analysis. Here we
Z-score a group of variables (Stata has a user-written command called zscore that will
do this, but we’ll ignore it for the present):
foreach x in busguide busgroup busmifgash buslearn {
quietly summarize `x′
gen z`x’=(`x’-r(mean))/r(sd)
}
An excursus on silence and system variables
What on earth is quietly summarize and r(mean) and r(sd)? First, quietly tells
Stata to suppress output. Generally, you don’t want to do this, but it minimizes clutter in
instances where you want to run a command but don’t need the output. A block of
commands can be set to quietly, much as one would do a loop:
quietly {
command
command
}
Within this loop, one could always specify quietly’s counterpart, noisily (who says
computer programmers don’t have a sense of humor?), for a given command to see its
output.
Second, summarize is an analysis command that reports the number of valid
observations, mean, standard deviation, minimum, and maximum. Almost all Stata
analysis commands store some information in a matrix. An OLS regression will store R2,
the coefficients, and so on (type return list and ereturn list to see details).
These are removed when the next analysis command is run. (See help return for
details.) As it happens, summarize stores the mean and the standard deviation. From
there, we simply plug these pieces into the formula for a z-score:
z
x

Looping using while
An alternative means of looping through values is while. In this instance, the index
serves as a counter and the loop continues for a given case until the logical condition is
specified. Note that this can lead to loops of infinite length is the logical condition is not
set properly. Here is a loop to assign a value for the last cohort a given case is associated
with using forvalues:
gen lastround=.
forvalues n=1/18 {
replace lastround=`n' if round`n'==1&qualified`n'==1
}
9
Using Stata More Effectively
Here it is using while:
local i 0
while (`i++') <= 18 {
replace lastround=`i' if round`i'==1&qualified`i'==1
}
We create the local macro i with an initial value of 0. The logical statement “while
(`i++' <= 18)” can be understood as follows. i++ increments i by one each loop
(++i would achieve the same effect, while --i or i-- would decrement i by one each
loop). When i reaches the value of 18, the loop is terminated for the given case and
moves to the next case, until all cases are completed.
Note that we don’t have to combine the increment and the logical statement as we did
above. This could be specified as:
local i 0
while `i' <= 18 {
local i = `i++'
We could also forgo i++ and recast the last line above as “local i = `i' + 1”. This,
however, would be slower than ++i, according to the manual.
It is not clear from Stata’s documentation whether forvalues or while ++i or --i is
faster. I would guess forvalues has a slight advantage as while probably needs a 19th
loop (in the above example to reach the point at which i > 18 while forvalues knows
that it needs to loop 18 and only 18 times. In any case, as forvalues is easier to
understand, it probably makes sense to stick with forvalues.
Finally, here’s a program I wrote to calculate the average interitem correlation of a lower
triangular matrix (note this contains some features I haven’t discussed yet):
capture program drop interitem
program interitem
version 10.1
syntax varlist(min=2 numeric)
corr `varlist'
matrix corr = r(C)
local nargs : word count `varlist'
foreach x in sum cell n mean {
matrix `x' = 0
}
local a 0
local c 0
while (`a++') < `nargs' {
local b `a'
while (`b++') < `nargs' {
matrix cell[1,1] = corr[`b',`a']
10
Using Stata More Effectively
matrix cell[1,1] = abs(cell[1,1])
matrix sum[1,1] = sum[1,1] + cell[1,1]
local c = `c' + 1
matrix n[1,1] = `c'
}
}
matrix mean[1,1] = sum[1,1]/n[1,1]
matlist mean
end
Conditional breaks
Because I am only interested in finding the last cohort the loop finds, iterating through all
18 possibilities for every case is wasteful of computer resources. A better alternative
would be to start looking at the last cohort and work backwards. This requires stopping
when I find the first cohort a case is associated with. Accordingly, I use an if statement
to conditionally end the loop for a given case:
gen lastround=.
local i 19
while (`--i' > 0) {
replace lastround=`i' if round`i'==1&qualified`i'==1
if (lastround != .) exit
}
That is to say that if lastround no longer has a missing value, the loop for that case is
over, and it should proceed to the next case until all cases are complete. In my case, going
forwards through all 18 possibilities took .64 seconds while going backwards and
stopping at the first hit took .58 seconds, so there was a small benefit. (I got the timing by
set rmsg on.) Benefits will be greater for very large loops, very large datasets, or very
slow computers.
Alternately, I could add in a conditional break to a decrementing forvalues loop to
achieve the same effect as the while loop:
forvalues n=18(-1)1 {
replace lastround=`n' if round`n'==1&qualified`n'==1
if (lastround != .) exit
}
Creating sets of dummy variables: the xi command
Creating a set of dummy variables is a common operation in data analysis. Unfortunately,
it is an annoying chore and one that goes wrong occasionally. Michelle has found a better
alternative in the xi command. Using this, instead of laboriously coding:
recode
recode
recode
recode
recode
denom
denom
denom
denom
denom
(1=1)(2/7=0), gen(orthodox)
(2=1)(1 3/7=0), gen(conserv)
(3 4=1)(1 2 5/7=0), gen(rereform)
(5 6=1)(1/4 7=0), gen(justjew)
(7=1)(1/6=0), gen(otherjew)
11
Using Stata More Effectively
One could simply code:
recode denom (1=1)(2=2)(3 4=3)(5 6=4)(7=4), gen(newdenom)
xi i.newdenom, noomit
The noomit statement just means that one variable will be created for each category,
compared to the default state where the category with the lowest value (here, Orthodox)
is omitted. Of course, some labor is still required if you’re going to have a clue what
these variables mean:
rename
rename
rename
rename
rename
_Inewdenom_1
_Inewdenom_2
_Inewdenom_3
_Inewdenom_4
_Inewdenom_5
orthodox
conserv
rereform
justjew
otherjew
This could be speeded up, too, using loops:
local i=0
foreach x in orthodox conserve rereform justjew otherjew {
local i=`i'+1
rename _Inewdenom_`i’ `x’
}
xi can be used to create more complicated variables, too. See documentation in the help
file.
Using xi in estimation
xi can be used in estimation commands. For instance, the following command:
reg y x conserv rereform justjew otherjew
could be recast as:
xi: reg y x i.conserv
Doing this essentially creates temporary versions of the variables used in the analysis and
then immediately dropped. The names of these temporary variables follow the logic of
variable creation. You could specify noomit after xi, but that will cause problems
because a set of dummy variables needs to have one category excluded.
This sounds great, but it’s usually more trouble than it’s worth. For one thing, you don’t
get to choose the omitted category. While you could work around this, perhaps recoding
denomination so Conservative=1 and Orthodox=2, but that removes some of the labor
saving aspect. Perhaps more problematically, you (yes, you!) will have to remember
precisely what _Isomevariable1 actually represents and type out _Isomevariable1 (and 2
and 3 and so on) into postestimation commands. In most cases, you’re better off creating
new variables and giving them meaningful names.
12
Using Stata More Effectively
The if and else commands
These commands look superficially similar to the SPSS do if and else if commands.
Unfortunately, where SPSS applies these case by case, so they can be used to branch to
account for, say, skip patterns, Stata treats all cases alike. Here is a sample of SPSS
syntax:
do if pocomplete=1.
+
compute dadjew=podadjew.
else if prcomplete=1.
+
compute dadjew=prdadjew.
end if.
What we would like to be able to do in Stata is as follows:
gen dadjew=.
if pocomplete==1 {
replace dadjew=podadjew
}
else {
replace dadjew=prdadjew
}
Note that else doesn’t take conditions. What would happen, though, is that if the first
case had completed the post-trip survey, then everyone would have dadjew=podadjew;
if the first case had not completed the post-trip survey, every case would have
dadjew=prdadjew. We could tell Stata to do this for every case:
gen dadjew=.
local n = _N
forvalues i = 1/`n' {
if pocomplete[`n']==1 {
replace dadjew[`n']=podadjew[`n']
}
else {
replace dadjew[`n']=prdadjew[`n']
}
}
However, it would be a lot easier to simply do:
gen dadjew=podadjew if pocomplete==1
replace dadjew=prdadjew if prcomplete==1
Or better yet:
gen dadjew=.
foreach x in po pr {
replace dadjew=`x'dadjew if `x'complete==1
}
13
Using Stata More Effectively
Either of the latter two options would also run faster, because Stata executes this on the
entire dataset at once, not case by case.
Enthusiastic as I am about Stata, this is not a very useful command for most instances and
is aimed at people writing new commands. It would be great if there was a parallel to the
SPSS commands, but as far as I know there isn’t.
Case order variables, sorting, and cross-case functions
SPSS has $casenum which is a system variable that contains a unique positive integer
for each case from 1 to n. This can be used to save the original order of cases prior to
sorting. Stata has a similar system variable: _n. Hence, the original order of cases can be
saved to a variable as follows:
gen sortorder=_n
An excursus on sorting
One might think that when a sort command is issued, Stata will keep the relative order of
cases within each sort category. Thus, if we sorted for sex, we would expect case 1 to
remain ahead of case 3 among men and case 2 to remain ahead of case 4 among women.
Not so! When sorting, Stata randomizes the order of variables with a given sort category.
In general, this should cause no difficulty. If, however, there is a tacit assumption that the
order within each sorting category is retained, there will be problems (I’ve spent days
sorting out the messes this has created in sampling). This can be solved by saving the
original order as above and then sort sex sortorder. If you are setting up a
stratified random sample and require reproducibility, this can be solved by setting the
seed of the random number generator ahead of the sort (e.g., set seed 1000). When
one needs to sort in descending order, the sort command will not work; instead it is
necessary to use gsort; the syntax is gsort –sex +age.
Lags and leads
Unlike SPSS, _n can also be used for lags and leads (cross-case comparisons within a
single variable). Here, _n is appended to a variable inside brackets (e.g., []) to indicate a
particular case. Hence, sex[3] refers to the sex of the third case, while sex[_n] refers
to the sex of the nth case. SPSS has a function called lag that can be computed for the
same ends. For instance, a variable identifying duplicate cases (though see the
duplicates command below) could be constructed as:
sort briusaid_1
gen dupe=0
replace dupe=1 if briusaid_1[_n]==briusaid[_n-1]
The lag or lead can be, respectively, backward or forward by an arbitrary number of
places by substituting +1 or -2 instead of the -1 in the above example. Note that the
[_n] on the left hand side of the logical equality is unnecessary. I include it for the sake
of clarity.
14
Using Stata More Effectively
If we want to refer consistently to the nth case of the dataset, we put that case’s row
number in as:
gen newvar = oldvar[1]
([_N] always refers to the last row in the dataset, which also happens to document the
number of cases in the dataset.)
These suffixes can be combined. For instance, we could reverse the values of oldvar as
follows:
gen newvar = oldvar[_N-_n+1]
(And, no, I didn’t think that up myself.)
One can substitute in a variable name and Stata will refer to the row number designated
by the value of that variable. Let’s say we have a dataset with parents and children as
individual cases and ID variables for each child with the row number of each parent (I
will assume there is a variable called sortorder that makes sure the variables are in the
correct order for these operations. To add, say, each parent’s denomination, as variables
to the child’s data, we could do as follows:
sort sortorder
foreach x in mom dad {
gen `x’denom=denom[_`x’id]
}
An extended example from a Stata lecture follows.
15
Using Stata More Effectively
By combining _n and _N with explicit indexing, we can produce truly amazing results.
(Note the version command at the top of the file. This is needed for Stata 11 and later
because this file uses Stata 10 and before merge commands.) For instance, let's assume
we have a dataset that contains
personid
age
sex
weight
fatherid
motherid
six-digit id number of person
current age
sex (1=male, 2=female)
weight (lbs.)
six-digit id number of father (if in data)
six-digit id number of mother (if in data)
version 10
capture log close
log using crrel2, replace
use relation, clear
sort personid
by personid: assert _N==1
/* see Exercise 9 */
gen obsno = _n
keep personid obsno
rename personid id
save mapping, replace
use relation, clear
gen id = fatherid
sort id
merge id using mapping
keep if _merge==1 | _merge==3
rename obsno f_n
label var f_n "Father's obs. # when sorted"
drop _merge id
gen id = motherid
sort id
merge id using mapping
keep if _merge==1 | _merge==3
rename obsno m_n
label var m_n "Mother's obs. # when sorted"
drop _merge id
sort personid
save rel2, replace
erase mapping.dta
log close
exit
Then, when I wanted, say, the father’s age
sort personid
gen fage = age[f_n]
/* if not already */
and, if I wanted the mother's weight
gen mweight = weight[m_n]
16
Using Stata More Effectively
The duplicates command
Charles correctly points out that my first example in the case order variable section
reinvents the wheel. Stata has a built-in command called duplicates that handles just
about anything one would like to do regarding duplicate cases. It can report all
duplicates—cases with identical values for the variables specified in varlist, report only
one example for each group of duplicates, create a new variable identifying duplicate
observations, delete duplicates (though caution is advised whenever using powerful
commands that don’t leave a record of what they dropped), and has powerful controls for
how the duplicate report tables are displayed. See help duplicates for details.
The list command
It’s often helpful to look at some actual data to aid debugging. One way of going about
this is to use the data browser. However, the variables one wants to compare are often far
apart. A neat alternative is to use the list command, which will list onscreen (record in
a log file if you expect a lot of values). Here is a potential sequence of commands for
finding and checking dupes in a BRI file (but see the duplicates command, above).
log using dupecheck, replace text
sort briusaid_1
list briusaid_1 idmain idpanelmain if ///
briusaid_1[_n]==briusaid[_n-1]| ///
briusaid_1[_n]==briusaid[_n+1], clean noobs
log close
Note the use of the if option to limit the number of cases displayed and the use of
forward and backward lags to ensure that both dupes are shown. clean and noobs
respectively get rid of frames around the items displayed and suppresses observation
numbers.
list can also be used to quickly list answers for all items for a given respondent:
list if token="abc1234"
The by command
The by command in Stata is extremely helpful. It is produces the same result as forming
separate datasets for each unique set of values of varlist and running stata_cmd on each
dataset separately. However, data must be sorted by the varlist used first. This can be
used for analysis:
sort denomination
by denomination: tab poshabcan prshabcan, col
If [_n] and [_N] are used with a by command, they refer to within each by grouping.
Here by is used for data manipulation, creating bus averages for the bus guide scale:
sort groupname
by groupname: egen mnbusguide=total(busguide)/_N
17
Using Stata More Effectively
The only thing to watch out for is that this will divide the sum of the values of bus guide
within a bus by the total number of people on that bus, which will be problematic if we
don’t have a response from each person. Of course, it is easier to simply do:
sort groupname
by groupname: egen mnbusguide=mean(busguide)
Data verification
Stata has a command called assert. This is followed by a logical expression. If the
logical expression is contradicted, the program will throw an error message. Hence,
looking for out of range values for an opinion question:
assert prtripfree>=1&prtripfree<=4
If the system throws an error, we can find it in the following fashion:
list briusaid prtripfree !(prtripfree>=1&prtripfree<=4)
The exclamation mark specifies a logical not. The logical statement above is equivalent
to (prtripfree<1|prtripfree>4) but the former is preferable as it is less likely to
be mangled by human error, e.g., (prtripfree<=1|prtripfree>=4).
The in command
Stata can refer to lines in the dataset. Here is some syntax from Michelle that adds a line
to the dataset (originally with 41,457 cases) and then assigns values to variables in that
line:
set obs 41458
replace fedzip = "H9" in 41458
replace fedcode = 11 in 41458
If you know the case number, you can also use this with the list command.
Predictions from estimation commands
Stata can generate predictions from estimation commands, most commonly regression
models. This can be approached several different ways. First, you can generate the
predicted value of the dependent variable for every case included in the regression by
using the predict command after the regression. This creates a new variable:
reg vary varx1 varx2
predict predvary
More often, we want to generate predictions for ideal types. For instance, we may regress
posttrip attending Hillel activities on attending pretrip Hillel activities and being invited
to posttrip Hillel activities. We might want to estimate the expected values of posttrip
attending Hillel activities by each pretrip frequency of attending activities, holding
invitations constant. Stata’s built-in command for this is adjust. This, however, doesn’t
18
Using Stata More Effectively
work well in the context of regressions for limited dependent variables. In this case, an
ordinal logistic regression, it will return E(y*|x). In logistic regression, y* is generally
unintelligible and all the more so for ordinal logit. Instead we want, Pr(y=1|x), Pr(y=2|x),
Pr(y=3|x), Pr(y=4|x), and Pr(y=5|x). The best command for this purpose is a user-written
command, prvalue. (If you do not have this on your computer, type net search
spost.) We could then specify following the regression:
forvalues n=1/5 {
prvalue, x(practhill=`n')
}
This will return the estimates for each value of practhill. The value of poinvhill will be
set to the mean (spost is smart enough to calculate this for all variables not specified).
This sounds good, but there’s a catch. The mean value substituted for poinvhill is
unweighted, when our data is in fact weighted. We really want to use the weighted mean.
Here’s how we accomplish it:
svy: mean poinvhill practhill
foreach x in poinvhill practhill {
local mean_`x' _b[`x']
}
svy: ologit poacthill poinvhill practhill, or
forvalues n = 1/5 {
prvalue, x(practhill=`n' practhill=`mean_poinvhill')
}
But what is _b[poinvhill] that I save as local mean_poinvhill? In the section on
looping, I used r(mean) and r(sd), which were statistics retained by Stata after an
estimation command (there summarize). Stata saves the coefficients of estimation
commands as _b[varname]. For svy: mean, these are the means of the variables.
Because I wish to use these after another estimation command, I have to store them
somewhere else, because _b will be overwritten. I use a local macro for this purpose.
There is one drawback with this approach: local macros only last for a single run of the
.do file. As soon as the .do file has run, Stata has forgets the local macros. This means
that after making some changes, one has to rerun svy: mean, convert them to local
macros, rerun the estimation command, and then run prvalue. I have come up with an
ugly kludge that gets around this, which is discussed on p. 33, as it involves matrices and
scalars.
Working with dates and times
Thanks go to Graham for revising and expanding this section based on his (painful)
experience working with times and dates.
Turning dates and times into numeric data
The clock() function converts string variables containing data and time information
into numeric format. This is specified as:
19
Using Stata More Effectively
gen double newdate = clock(olddate, ″MDYhm″)
The double specifies that the variable created by the gen command is large enough to
hold the (generally extremely large) date value that Stata creates (see Numerical variable
types, p. 22, for a discussion of this issue). If the command specified was merely gen
newdate = … then the variable created would not be able to hold all the data and would
lose precision.
The last argument specifies the order of the month, day, etc. If the string variable holding
the date lists these in a different order then you can shift around the ″MDYhm″ letters to
compensate and Stata will understand. For example, if the date is stored as “year-monthdate-hour-minute” then you would specify ″YMDhm″.
The numeric variable created by this command is the number of milliseconds from 12
midnight on January 1, 1960 (this is why you need to store it in a double variable rather
than the default float type…it’s really big). To change the display from a numeric to a
time and date format, it is necessary to change the format to a time and date format, of
which there are various types (see help format):
format newdate %tc
%tc ignores leap seconds. If you do want to track leap seconds (and, heaven knows, we
all do), the correct format is %tC.
Oftentimes, however, that level of precision is unnecessary and date-level information
would be sufficient. We can read in a string stored as DMY as follows:
gen int newdate = date(olddate, ″DMY″)
The int specifies this variable is to be stored as an integer, which is generally the most
efficient way to store integers. Similarly to clock variables, Stata counts days from
January 1, 1960, except the basic unit is the day, not the millisecond. To get Stata to
display this as a date, not a number, format it:
format newdate %td
At times, a time format will be more precise than one wants and can be converted to a
date format (here, I overwrite the existing variable):
replace submitdate_w2=dofc(submitdate_w2)
There is the opposite function, cofd, which turns date variables into time variables,
which is of limited use as the time of the day is always set to midnight (to the
millisecond).
20
Using Stata More Effectively
Extracting days, months, years, and more from date variables
Stata can extract information from %td variables like the year, month, day (of month),
day (of year), and day (of week). These are simple functions of generate, where d is the
%td variable: year(d), month(d), day(d) (day of month), doy(d) (day of year),
dow(d) (day of week), week(d) (week of year), halfyear(d) (half of year), and
quarter(d) (quarter of year). For example, we extract year of birth from a birthdate
variable:
gen yrborn=year(birthdate)
Recoding date variables
While putting the date variable in a format Stata can recognize is all well and good, you
may wish to recode this numeric variable to reflect a number of days, hours or minutes,
as opposed to milliseconds. In order to do this, you must divide it by some set of numbers
depending on what time unit you want it to display. For example, by dividing the variable
by (1000*60*60*24)—which is the number of milliseconds in a second, the number of
seconds in a minute, the number of minutes in an hour, and the number of hours in a
day—will convert milliseconds into days. This is often useful when, for example, you are
attempting to create binary date variables that select only cases with a date of a specific
day or year.
Turning separate numeric day, month, year variables into age
The following code transforms variables day, month, and year into an age variable, when
age is expressed in the standard Western form, i.e., ⌊𝑎𝑔𝑒⌋ or floor(age).
gen dobday=string(day,"%2.0f")
replace dobday="0"+dobday if length(dobday)==1
gen dobmonth=string(month,"%2.0f")
replace dobmonth="0"+dobmonth if length(dobmonth)==1
recode year (1962/1977=1978)(1998=1992)
gen dobyear=string(year,"%4.0f")
gen age=floor((date("1jan2010","DMY")- ///
date((dobday+"-"+dobmonth+"-"+dobyear), "DMY"))/365.25)
Working with dates as strings
In some (perhaps most) cases, despite all of Stata’s features, it may be easier to simply
keep a string variable that codes a date as a string, and work with it that way. Often, if the
data is sorted in a standardized way, you can use string functions (see p. 22ff for more) to
extract the pertinent time unit (days, hours, years) from the string variable and work with
these variables instead. For example, if the string date variable stores its values as “200802-15 11:32:15.20000” and all you want to know is the year, month and day, you can use
the substring function described below to extract only that data, by selecting the first 10
characters of the string only:
gen newdate=substr(date,1,10)
21
Using Stata More Effectively
Once this command is run the resulting date variable will look like this: “2008-02-15”
and can be dealt with much more easily.
Numeric variable types
Stata stores numeric variables with varying degrees of precision. The numeric variable
type that occupies the least memory (both in terms of storage on a hard disk or other
semipermanent medium and in RAM when the file is active) is a byte (i.e., 0 or 1). At
the other end of the spectrum, a double (IEEE 754-1985 double precision floating-point
binary format) occupies the most space. It can hold up to approximately 16 decimal
digits. Naturally, this occupies the most space. When you convert a file to Stata using
Stat/Transfer 10, it automatically chooses the type with the smallest memory footprint for
each variable that does not lose any precision. If you are using a file from another source
or simply want to check, simply use the command compress and Stata will automatically choose the type with the smallest memory footprint that loses the least data. By
default, Stata stores variables in the float type. This has the smallest memory footprint
for noninteger variables. You can always specify the kind of variable you want stored.
While it might be good practice to specify byte for dummy variables (e.g., gen byte
female=1 if sex==1), at worst this will chew up storage space and slow the system
down very slightly. On the other hand, when generating a random variable for sorting
order operations where there are a large number of cases, specifying double is a must
because the less precision one has, the greater the odds of having cases with precisely the
same value which will be randomly ordered each time a sort is run. In a dataset with
approximately 90,000 cases, it turned out that even double was not sufficiently precise
and additional precautions had to be taken to maintain the precise sort order.
String functions
Particularly when merging ID variables, string manipulation rears its ugly head. (A string
is a variable containing characters that may include letters and symbols.) Things to
remember include that missing is rendered as ″″ (i.e., a zero-width string) and the
possibility of leading and/or trailing blanks (i.e., ″ ″).
Transforming string variables to numeric
There are at least three different kinds of string to numeric operations. (1) Transform a
string variable containing words into a numeric variable with labels similar to those of the
original variable. (2) Transform a variable containing words into a new variable with
labels dissimilar to those of the original variable. (3) Transform a numeric string variable
(i.e., no characters besides numerals, decimal points, and commas) into a numeric format.
In IBM SPSS, Operations 1 and 2 could be accomplished using the recode command. Not
so in Stata.
An example of Operation 1 is transforming a string (e.g., name of university) into a
numeric variable. This is handled by a command, encode, that automatically generates a
new numeric variable with a user-specified name. (You cannot directly overwrite the
variable.) Unless you tell Stata to do otherwise, it will create its own value labels. It’s
often the case, however, that the value labels are messy and the values are not in the order
22
Using Stata More Effectively
you like. Thus, typical usage will often be to run encode, drop the old variable, recode
the new variable (perhaps to the old variable’s name), and assign new value labels.
An example of Operation 2 is transforming a registration variable where gender is
encoded as ″False″ for males and ″True″ for females (this is a product of the way
SQL treats 0,1 variables). We would typically like this to be either gender (1=male,
2=female) or female (0=no, 1=yes). There is no elegant way I know how to do this. Brute
force works best with two category variables like this.
gen gender2=1 if gender==″False″
replace gender2=2 if gender==″True″
drop gender
rename gender2 gender
In Operation 3, the variable, though encoded as a string, is already in numeric format.
Here we simply use destring. This command includes options to either overwrite
(replace) the original variable or generate a new variable (you must choose one of
these options). Alternately, one could use the real function of the generate command,
which bases the numeric format of the resulting variable on the way it is displayed in the
original string variable. Changing format will require using the format command.
Transforming numeric variables to strings
As with the alternate operation, there are several scenarios. (4) Transform a numeric
variable with value labels to a string variable where the string is identical to the value
labels. (5) Transform a numeric variable into a string variable.
For Operation 4, one uses encode’s counterpart, decode.
For Operation 5, there are multiple alternatives. In most cases, the best option is probably
destring’s counterpart, tostring. Like destring, we can optionally replace the
existing variable. You also have the option to specify a display format for the string. If
the numeric to string transformation is part of a larger function, you can use the string
function of the generate command. This takes a variable format specification, allowing
one to control the formatting of the way in which the variable is displayed:
gen agestr=string(age,″%2.0f″)
Here I want to display in %2.0f format, like 24, 26, etc. If anyone in my sample was
aged more than 99, I would have to use %3.0f format. If I wanted age in decimals, I
would need to specify as %4.1f or similar.
Searching for substrings
It is sometimes necessary to search for substrings (i.e., strings of characters inside a
larger string) as, for instance, in looking test cases in a dataset where people gave names
with some variant of “test” in them (e.g., “Ben-test”). This can be accomplished using the
regular expression function:
23
Using Stata More Effectively
list if regexm(lower(namefirst),"test")==1
List is used rather than drop because this would catch nontest cases where “test” was part
of the name (e.g., “Testa”). The lower case function (lower()) is used to deal with
irregularities in capitalization. Naturally, far more ambitious regular expressions can be
used (see my notes on using TextPad effectively). The following drops phone numbers
with letters of the alphabet in them:
replace phoneprim2="" if regexm(phoneprim2,"[A-Za-z]")==1
Other string functions
Other useful string functions (see help string_functions for a comprehensive list)
are given below. Note that these aren’t restricted to variable creation and can be used in
conditional statements (see the example below for an illustration). Finally, s refers to a
string (which can either be a string variable name or an actual string like ″Boston″).
Keep only part of a string:
substr(s,n1,n2)
(n1 is the position the substring should begin at, n2 is the desired length of the substring,
hence substr(s,2,4) begins at the second position and has a length of four, thus
terminating at the fifth position)
Trim leading blanks:
ltrim(s)
Trim trailing blanks:
rtrim(s)
Trim internal blanks:
itrim(s)
Trim leading and trailing blanks:
trim(s)
Length of string (returns length of string):
length(s)
Make all letters lower case:
lower(s)
24
Using Stata More Effectively
Make all letters upper case:
upper(s)
Real (returns numeric value if numeric, system missing (.) if not numeric):
real(s)
Note that decode, which is a command, not a function, is roughly speaking the
counterpart to real.
An example of transforming zip codes for the U.S. and Canada:
replace zip=trim(zip) /* Removes leading and trailing
blanks */
replace zip=upper(zip) /* Turn lower case Canadian postal
codes to all upper case) */
replace zip=substr(zip,1,5) if real(substr(zip,1,1))~=. /*
Turn U.S. zip+4 to regular zips with a length of 5) */
replace zip=substr(zip,1,2) if real(substr(zip,1,1))==. /*
Keep only first two places of Canadian zips */
Importing .csv and other text files
We work with text files more often than one might think. The basic form is:
insheet using data.csv, comma clear
.csv is, naturally, a comma-separated format: the delimiter (the character that separates
columns/variables) is the comma and the text qualifier is the quotation mark.
insheet supports specifying the delimiter:
insheet using data.dat, delimiter("~") clear
Here, the delimiter is the tilde (~). Why would we want a tilde for a delimiter? When
working with open-ended responses, people often use commas. Now, a properly specified
.csv shouldn’t have a problem with this because textual commas will be inside the text
qualifier (i.e., the quotation marks). Usually, that’s fine. But what if people also use
quotation marks in their open-ended responses? Things go to hell. To avoid
Pandemonium, we thus specify some character that people are unlikely to use, hence the
tilde.
Where the text file is in fixed length format, the command is infix. Usage is as follows
(numbers are column numbers for each variable and pertain to the variable name they
follow):
infix id 1-7 str blah 8-20 altwt2 21-27 using alt4wt.out
25
Using Stata More Effectively
Stata has a dictionary format that may be useful when we need to repeatedly import
complex text files. I won’t go into this further, as further details can be found in the
manual.
Dealing with delimiter issues on imports
Occasionally, you will “insheet” a .csv file containing string variables to Stata and
discover that commas in the string variables have caused the variable to be broken into
multiple parts, i.e., the commas become separators between columns instead of part of the
text contained in the variable. In this case, if you can put your string variable at the end of
the dataset, Stata can fix the problem.
First, put the commas back in the text this way:
egen optaffil1=concat(v14 v15 v16 v17), punct(,)
Then, trim the trailing commas:
replace optaffil1=reverse(substr(reverse(optaffil1),
indexnot(reverse(optaffil1), ","), .))
Exporting .csv, fixed format, and other text files
We often need to get Stata to export files in some text format for cleaning, weighting, or
other purposes. Exporting .csv files is very straightforward:
outsheet vara varb using xyz.csv, comma nolabel replace
By default, Stata exports text files with tab as the delimiter and quotation marks. It also
has (in my opinion) the unfortunate habit of exporting value labels rather than the
underlying numeric values of a variable unless told to do otherwise. You can also turn off
quotation marks, change the delimiter, and not put variable names on the first row of the
file. See help outsheet for further details.
Stata can be hacked to export fixed format datasets, too. (Unlike SPSS, it doesn’t support
this natively. There is no “outfix” command.) The important thing to bear in mind
when doing this, though, is that all values within a variable must be the same length.
Thus, if some values are shorter than others, you will need to turn them into strings and
add enough leading blank spaces for each value to be the same length. For an ID variable
with a maximum width of four characters, this would be as follows.
gen newid=string(id,"%4.0f")
replace newid="
"+newid if length(newid)==1
replace newid=" "+newid if length(newid)==2
replace newid=" "+newid if length(newid)==3
outsheet newid vara varb using blah.dat, ///
delimiter(" ") nonames nolabel noquote replace
26
Using Stata More Effectively
In some circumstances, like working with QBAL, our rim weighting program, it may be
more helpful to use zeroes rather than spaces. Here is a code fragment that adds zeroes to
the beginning of a weight variable with different lengths by looping through and adding
one zero at a time until it reaches the correct length. This obviates the need to write out a
line of code for each extant length of the string under the maximum string length.
gen nwrswtst=string(newrswt,"%8.1f")
forvalues i=3/7 {
replace nwrswtst="0"+nwrswtst if length(nwrswtst)==`i'
}
Note that this hack produces a file with columns of spaces between variables, not the
usual fixed format with no blank columns. Setting the delimiter to "" will not work. In
that case, Stata just defaults to producing tab separated files.
Merging, appending, and reshaping
Before getting into the technical details of merging, it’s worth spending a moment
thinking about the operation on a practical and conceptual level. A merge takes place
when two datasets have at least some cases in common, but different substantive
variables, which is usually why we want to merge in the first place. (While it’s OK from
Stata’s perspective if both files have cases not found in the other one, or even no cases in
common, generally it’s bad from a user point of view because values will be
systematically missing from variables for cases found on one but not the other dataset.)
We might for instance have a set of information derived from administrative data from
program registration that we wish to add to a file containing survey data. An append
occurs, conversely, when the two files have similar sets of variables but different sets of
cases and we essentially want to add more cases at the bottom of the file. Thus, we might
want to add files from two similar surveys of different populations together.
Merging
The pivot around which a merge takes place is a single variable or set of variables found
in both files that identify a case. In the most basic case, this will be an ID variable that
identifies cases uniquely. In more complex cases, this might be a set of variables that do
not uniquely identify cases. In either case, the merge operation uses the identifying
variable or variables to determine which cases to link together.
In the simplest of all cases, there is a single ID variable with unique values (because this
will be the basis of the merge, it must be in both files) and both files contain precisely
the same cases and, other than the ID variable, have no other variables in common. The
variables (other than the ID variable) in the using file (the file being merged into the
currently open master file) will be appended to the right hand side of the dataset.
Making things a little more complicated (but still relatively simple), what happens if the
ID variable has duplicates in either the using or the master file? This might be the
case, for instance, in a household survey where one file contains one record for every
household member (each with the same household ID variable) containing individuallevel demographic data, while the other file contains household-level data, with only a
27
Using Stata More Effectively
single case per household. In this instance, we would want to merge the variables from
the household-level file into the individual-level file. While one could do the merge the
other way around, it makes the most sense to have the individual-level file open as the
master file and the household-level data as the using file. The same household-level
variables are thus added to each individual-level record where the ID is the same.
To complicate the picture a little more, it’s often the case that there are some “orphan”
cases that are found in one file and not the other. That doesn’t present problems for Stata.
It will just leave missing values for the variables cases for which no match exists and it
records the status of each case in variable called _merge (this will be created regardless
of whether all cases matched). This gives a value of 1 for cases only found in the master
file, 2 for cases found only in the using file, and 3 for cases found in both files. You can
then take the appropriate steps, which will vary depending on the situation. Where the
orphan cases are salient, one keeps them. If they’re not, the best way of removing
unmatched cases from the using file is to use the keep option (not to be confused with
keepusing, discussed below), which Graham introduced to me. This is advantageous
when memory is an issue as it can massively reduce the amount of memory Stata needs to
perform the action. Returning to _merge, it’s good practice to drop or rename _merge
so it doesn’t create problems for the next merge. If there is no reason to generate
_merge (and I’d argue that this is rarely the case, as you ought to run diagnostics after
each merge to verify things went as expected—and they often don’t) you can choose the
nogenerate option. You can also tell Stata to call _merge something else by using the
generate option.
What if you have some variables in the file you will merge into the master file that
aren’t needed? In general, my practice has been to chew up some hard disk space (which
is cheap) and drop unneeded variables, save the file under a new name, and merge
using the newly created file. Alternately, you can just drop them after the merge. This is
wasteful of time and resources and creates file after file cluttering up your directories. A
better and more elegant approach (which is what this document is about) is to use the
keepusing option, which specifies the variables to be kept from the using file. This is
especially useful when memory is tight as it minimizes the amount of memory Stata
requires to perform a merge.
Let’s get a little more complicated still and imagine a situation where there are some
variables in common between the files other than the ID variable or variables. If one of
the files has better values than the other, you probably want to drop the “worse”
variable(s) from the master file prior to merging or use the keepusing option to
remove variables from the using file as appropriate prior to the merge. Where some
cases have values on one file and some cases have values on the other file, Stata offers
the options update (overwrites the missing values in the master file with values from
the using file; note that this will produce additional values of _merge for matched cases
where updates occurred, did not occur, or where the two files had conflicting nonmissing
values) or update replace (overwrites all values in the master file with values from
the using file). You will want to think very hard before using this functionality because
28
Using Stata More Effectively
it’s very powerful and can overwrite values inappropriately without leaving any
indication that it did so.
To muddy the waters further still, what happens if both files have duplicate values? You
most likely shouldn’t be in this situation and, if you are, presumably know what you are
doing. Stata will create new cases for each combination of the duplicate values. As
before, this is almost always undesirable and suggests that something is wrong with one
or both files, your assumptions, or all of the above.
Finally, what about the possibility of using multiple variables to identify cases for
merging? In general, I would recommend against this as a case of a powerful but difficult
to trace function. While Stata will work happily with multiple matching variables (as long
as they are all present in both files), you would do better to create some variable that
concatenates the values of the various variables into a single variable (doing the same
operation in each file) to use for the merge. This promotes clear thinking about what you
are doing and makes searching for duplicate cases a snap. You can always drop the
variable from the merged file later on.
As I noted previously, the code used to merge files together in Stata 11 has changed
significantly from the “more primitive syntax” (see help merge) used previous versions
of Stata. The new code is far easier to understand. The type of merge is now specified
immediately after the command:
merge 1:1
merge m:1
merge 1:m
merge m:m
The left hand side of the colon specifies the nature of observations in the currently open
dataset. Are they unique (i.e., each case in the file has a unique combination in the values
of the variables being merged on)? If so, 1. If not, m. Not surprisingly, the right hand side
specifies the nature of the file that is being opened (used) in the merge. The ID
variable(s) to merge on are specified immediately after the 1:m etc. part of the command.
See help merge for further details on the syntax.
A final thing to bear in mind is that Stata as of version 11.0 automatically sorts the cases
on the ID variable(s) unless you tell it otherwise (sorted). This is an improvement, but
remember that when Stata sorts a dataset and has cases of equal rank, it randomizes their
order (see p. 14). In most cases, this should not present a problem. If, however, order is
important to you (e.g., you create a reproducible sample based on random numbers after
set seed), this will cause unpredictable, irreproducible variations. In such cases, you
would best follow my earlier advice and create an unambiguous order variable (e.g.,
using _n), then sort the cases prior to the merge by the ID variable(s) and then by the
unambiguous order variable, and use the sorted option. This only works, of course, if
29
Using Stata More Effectively
no additional cases are being added from the using file. If cases are provided by both
files and some of them have equal rank with respect to the ID variable(s), the problem
remains and you would want to take additional steps to ensure that cases are placed in a
predictable order.
Appends
By comparison, append is very simple. You simply specify the dataset(s) which you wish
to append to the master dataset (note that Stata doesn’t use the term “master” with respect
to append), chose whether or not to have a variable record which file was the source for
a given operation (generate), and can limit the variables to appended from the using
dataset(s) via the keep option.
Reshaping
For cases like datasets containing information on multiple individuals in a household
stored as variables (e.g., sex1 sex2 sex3) or for datasets where cases are clustered (e.g.,
cases 1, 2, 3 are all members of household 1, while cases 4 and 5 are members of
household 2), Stata can reshape the dataset. I have used this command in the past but
found it to be rather unintuitive, to put it mildly. I will expand this description when I
next work with it (and so can you!).
Matrices and scalars
As we’ve seen, besides the dataset, Stata can store and manipulate data in local macros,
and some commands store additional information that can be accessed, like the
coefficients of a regression model. As it happens, Stata has another way to store and
manipulate information: the matrix. A matrix is simply an r × c grid that contains data
(where r is rows and c is columns). This is a vital functionality for a statistics program, as
most estimation commands involve the manipulation of matrices. But access to matrices
is not limited to .ado files (the programs that define Stata commands). You can create
your own matrices and use matrices created by estimation commands. There are, in fact,
two ways to do this. The first is Stata’s original matrix language. This is suitable for
almost all operations one is likely to undertake; the only constraint that may rear its head
is that this size of matrices is limited to 11,000 × 11,000. Since Stata 9, however,
StataCorp has added a new language: Mata. Mata allows for far more powerful and
complicated manipulation of matrices with no practical size limitations. I do not
recommend using it, though, unless you absolutely have to. Mata is not really an organic
part of Stata. To work with Mata, one either temporarily “pauses” Stata to work in Mata
or invokes it through special Stata commands that functions (to my mind) something like
an API. The syntax for Mata is to my mind unintuitive and its manual is, frankly, a
disaster, being divided into M-1, M-2, etc. components without really explaining what
each one is and what commands are explained in there. Thus, I shall say nothing more
about Mata.
Before I begin, let me add a cautionary note. This section focuses on using matrices not
for the matrix mathematics one can do but as a way of hacking Stata to get it to perform
relatively basic operations that are difficult, clunky, or impossible using regular
30
Using Stata More Effectively
techniques. Sophisticated mathematicians and statisticians may want to puke. You have
been warned.
A brief primer on matrices
The first thing to remember about matrices is that individual elements of the matrix are
specified in precisely the opposite fashion to Cartesian coordinates: y,x not x,y. Of course,
the standard way of stating this is r,c not y,x. I do not know why this idiotic convention
arose any more than I know why keyboard and telephone number pads have different
orders, but as with keyboards and phones, one has to learn to live with it.
In the title of this section, I mentioned scalars. A scalar is a single number (something
like a 1 × 1 matrix) that has some special properties in Stata. From the point of view of
regular datasets, a scalar is equivalent to a variable with a constant value for each case.
(A scalar can, however, have the same name as matrix.) Thus one might multiply each
element of a variable the dataset with a scalar.
One commonly occurring problem is that the name of the scalar is treated as a variable.
Thus, a scalar shouldn’t have the same name as a variable (the variable has
precedence). Moreover, if a scalar has the same name as the unique characters of a given
variable, that variable has precedence. For example, if we have a dataset containing a
single variable, female, a scalar named f will give precedence to female. On the other
hand, if we have a dataset with two variables, female and femage, the scalar will have
precedence, because Stata knows it can’t distinguish female and femage. In practice, one
does better to avoid this situation entirely by giving each scalar a unique name. Failing to
heed this rule will lead to errors that are difficult to trace.
A final point to note about matrices is that one does not add them together and multiply
them unthinkingly. Matrix operations like this follow the logic of matrix algebra. Unless
you’re au fait with matrix algebra, the safe and certain way to perform operations on a
matrix is to loop through rows and columns, performing the desired operation on each
element in turn.
Accessing and addressing matrices
An existing matrix can be viewed by the command matrix list matname. New
matrices can be created by directly entering data, but this is unlikely to be needed and
you can look it up yourself. A more common occurrence is that one needs to define a
matrix in advance so that one can later fill it with values. This is what it looks like:
matrix x = J(2,3,.)
This syntax creates an r × c matrix (here 2 × 3) called x (in mathematics, matrix names
are typically expressed in unitalicized bold) with all the elements set to missing. To make
the elements equal to 0, we could do matrix x = J(2,3,0). Note that setting a matrix
equal to J(r,c,x) will overwrite an existing matrix of the same name, so be careful.
You can always drop existing matrices, too, by matrix drop x. Except for neatness,
this is unnecessary as there is no limit on the number of matrices can create and access.
31
Using Stata More Effectively
So far, I’ve avoided addressing matrix elements. To access x(2,3) in our blank matrix and
change it to a 1, we simply do matrix x[2,3] = 1. Let’s now work through our
matrix and assign values to each cell that are the sum of the row and column number
(yes, it’s a dumb example, but it illustrates operations on individual elements):
forvalues r = 1/2 {
forvalues c = 1/3 {
matrix x[`r',`c'] = `r'+`c'
}
}
Now let’s create a new matrix y with cells equal to the same cell in x + 1.
matrix y = J(2,3,.)
forvalues r = 1/2 {
forvalues c = 1/3 {
matrix y[`r',`c'] = x[`r',`c'] + 1
}
}
This is all well and good, but typically one wants to get an element or elements of the
matrix to interact with the dataset. Stata doesn’t make this easy, because one can’t simply
put a reference to a matrix into a regular command; gen varx = vary * x[1,1]
won’t work. I’m sure there’s a better way, but my kludge is to copy the value of the
element of the matrix to a scalar (remembering that scalars are treated as variables from a
naming perspective):
scalar x = x[1,1]
gen varx = vary * x
Using matrices generated by estimation commands
Probably the most common cause for dirtying one’s hand with matrices is to make use of
matrices created by estimation commands. After running an estimation commands, do
ereturn list. This will list the information generated by a command. Here’s what it
looks like after OLS regression of two covariates on a variable of interest:
. ereturn list
scalars:
e(N)
e(df_m)
e(df_r)
e(F)
e(r2)
e(rmse)
e(mss)
e(rss)
e(r2_a)
=
=
=
=
=
=
=
=
=
221
2
218
.4511474285902355
.0041219067975928
2.823463192155305
7.193044431553062
1737.88387864537
-.0050145894703193
32
Using Stata More Effectively
e(ll) =
e(ll_0) =
-541.465233008145
-541.9216450006693
macros:
e(cmdline)
e(title)
e(vce)
e(depvar)
e(cmd)
e(properties)
e(predict)
e(model)
e(estat_cmd)
:
:
:
:
:
:
:
:
:
"regress y x1 x2"
"Linear regression"
"ols"
"y"
"regress"
"b V"
"regres_p"
"ols"
"regress_estat"
matrices:
e(b) :
e(V) :
1 x 3
3 x 3
functions:
e(sample)
At the top we can see some scalars that could be useful (RMSE, for example) and near
the bottom we can see some matrices. The matrix e(b) contains the coefficients for the
intercept and covariates while the matrix e(V) is the variance-covariance matrix of the
covariates (and intercept).
For some calculations, it might be useful to save one or more of the coefficients and use
them later. There are a few things to remember, though. First, these will be overwritten
by the next estimation command. Second, it turns out that you can’t directly access the
elements of e(b) or e(V). One can’t do the following, for instance, where z is a matrix
we want to put a coefficient in for some reason or another:
matrix z[4,1]=e(b)[1,1]
Instead, one has to do the following, creating matrix x to hold e(b) so we can refer to the
element we want to copy to z:
matrix x=e(b)
matrix z[4,1]=x[1,1]
matrix drop x
Using matrices and scalars for predictions
I discussed generating predicted values when using weighted survey data previously
(p. 18). At the time I presented a relatively elegant solution, albeit with considerable
drawbacks, while a more robust, but uglier, solution was deferred until matrices and
scalars had been described. I have just described matrices and scalars, so it’s now time
for the ugly kludge threatened previously. This relies on scalars to hold the (weighted)
means of predictor variables. The advantage of using a scalar over a local is that
these do not cease to exist, kick the bucket, shuffle of this mortal coil, etc. as soon as a
33
Using Stata More Effectively
.do file has run. The disadvantage, though, is that they are a pain to bring to life in the
first place and revive each time one wants to use them. (Note that closing and restarting
Stata will remove scalars and matrices from memory, as will deliberately dropping them.)
The first stage is to svy: mean separately for each variable. (One could, in theory, run
the mean of all explanatory variables simultaneously, and refer to the column of e(b) in
which they are to be found for subsequent operations. While this would save some lines
of code, in my case at least, this potential efficiency is more than counterbalanced by the
greater likelihood of referring to the wrong column.) After each time, one first converts
the 1×1 matrix e(b) into a temporary matrix, which I call x, then write the value of the
only cell of x into a scalar (being careful to give it a unique name—see p. 31 for details):
svy: mean hoursjewed /* Taking mean of jedu hrs for calcs
involving quadratic term */
matrix x=e(b)
scalar meanjeduhrs=x[1,1] /* Save scalar for use as local
*/
scalar meanjeduhrsq=meanjeduhrs^2 /* Contains square of
mean hrs jew ed */
matrix drop x
Immediately before or after running the estimation command for which one wants to
generate predictions, one then turns the scalars into locals and refers to the locals in
prvalue:
svy: ologit conisr participant hoursjewed hoursjewedsq, or
local j=meanjeduhrs
local js=meanjeduhrsq
prvalue, x(participant=0 hoursjewed=`j' hoursjewedsq=`x')
prvalue, x(participant=1 hoursjewed=`j' hoursjewedsq=`x')
This way the weighted means can be used without having to reestimated the means each
time a .do file is run.
Running Stata from the command line
Back in the day when researchers were researchers, analyses ran on mainframes, and the
windows paradigm was just a curiosity at Xerox in Palo Alto, real social scientists ran
analyses from the command line (the operating system command line, not Stata’s). While
it’s buried in very obscure places in the Stata manual today, that functionality remains.
Why bother, other than my own antiquarian interest in data analyses modes of the past?
There are two scenarios where this may come in handy. The first is to run analyses on a
schedule (i.e., schedule with Windows so some analysis occurs overnight or at some
other time) and the second is to run Stata and other programs in a predetermined
sequence. A hypothetical example that draws on both these scenarios is as follows.
Suppose we have an ongoing survey on LimeSurvey for which we have cleaning syntax
in Stata, rim weighting syntax in QBAL, and analytic syntax in Stata. In sequence, we
probably want to get the server to feed us the latest data from Lime, open it in SPSS
34
Using Stata More Effectively
PASW IBM SPSS (we can use an SPSS script generated by Lime), run the script and
saving the file, use StatTransfer to turn it into a Stata file, run the cleaning syntax in Stata
and export the data needed for weighting, rim weight the file, merge the rim weights back
into the main Stata file and then run the analyses we need. To this, we write a DOS batch
file that calls the various programs in sequence (SPSS also can run from the command
line) and tells them which files to run and set it using the Windows scheduler to run at a
specified time. I won’t go through invoking SPSS or the server from the command line,
but I will put the contents of a batch file that weights the BRI long-term surveys:
REM Windows batch file for weighting BRI Long Term data
REM REM is a comment/remark
REM change working directory to data file
cd \Cohen Center\BRI\BRI Panel Study\Data\
REM runs initial part of file
"C:\Program Files (x86)\Stata10\wmpstata" /e do "C:\Cohen
Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part1_06080
9.do"
qbal round3wt.qbs
qbal round5awt.qbs REM this weight balances denoms
REM runs second part of file
"C:\Program Files (x86)\Stata10\wmpstata" /e do "C:\Cohen
Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part2_06080
9.do"
qbal round5wt.qbs
qbal round7wt.qbs
qbal round9wt.qbs
REM runs third part of file
"C:\Program Files (x86)\Stata10\wmpstata" /e do "C:\Cohen
Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part3_05080
9.do"
qbal round5awtx.qbs
REM runs fourth part of file
35
Using Stata More Effectively
"C:\Program Files (x86)\Stata10\wmpstata" /e do "C:\Cohen
Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part4_05080
9.do"
qbal round3wtx.qbs
qbal round5wtx.qbs
qbal round7wtx.qbs
qbal round9wtx.qbs
REM runs fifth part of file
"C:\Program Files (x86)\Stata10\wmpstata" /e do "C:\Cohen
Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part5_05080
9.do"
There are a couple of things to note regarding calling Stata. Any files with paths that have
spaces in them need to be enclosed in quotation marks. The first part of each call of Stata
("C:\Program Files (x86)\Stata10\wmpstata") simply tells Windows to start
Stata by giving the place Stata’s executable is to be found on your computer (both the
exact file location and name of the Stata executable may differ for you). The /e tells
Stata to run without stopping for you to hit enter on the termination of the file. Finally do
"C:\Cohen Center\BRI\BRI Panel Study\Syntax\bri_lt_weight_part5
_050809.do" tells Stata what .do file to run. Note that all Stata (and QBAL) syntax files
need to have at least one character return (i.e., a blank line) at the end of the file,
otherwise the programs will blow up.
Increasing the size of the command prompt scrollback buffer
One problem with a fairly long batch file like the above example is that the Windows
command prompt window only shows a set number of lines by default, much like Stata,
and that number of lines is too small to see all commands (and whether they blow up).
Here’s how to fix this (courtesy of http://www.petri.co.il/customize_command_prompt_
in_windows_xp_2000_2003.htm):
1. Open command prompt.
2. Click the upper left hand corner of the command prompt window (i.e. the small
black box containing C:\ in the title bar of the window).
3. Click the options tab.
4. In Command History, set the Buffer Size to 999 and select 5 in Number of
Buffers.
5. In the Edit Options box, select Quick Edit Mode and Insert Mode check boxes.
36
Using Stata More Effectively
6. Click the layout tab.
7. In Screen Buffer Size, set the Height to 9999.
8. Optionally, increase the Height and/or Width of the window under Window Size.
9. When you select apply properties, save properties for future windows with the
same name.
Programs
A program has a very specific meaning in Stata. It is a sort of multiline macro that can
be called upon within a .do file. The most powerful aspect of a program is its ability to
use the arguments appended to the program name when it is called. This was mentioned
briefly regarding macros. Programs have a format quite similar to those of loops in that
they must be embedded properly when they are defined. Let us say we wish to write a
program that will simply display the first four things appended after it.
program show
display
display
display
display
end
"`1'"
"`2'"
"`3'"
"`4'"
Were we to type or program in a .do file show a b c d, Stata would return:
a
b
c
d
A program will remain after being defined until Stata is closed, even if its usefulness has
passed. Programs can be removed by program drop programname.
Where might a program be useful? Just as if typing (or contemplating typing) line after
line of that only varies with respect to a variable name or a number is an indicator that it
might be helpful to use a loop or loops instead, having large segments of a .do file that
repeat may be an indicator that a program would be useful. Here is a program I wrote that
generates a stacked column graph for weighted survey data:
capture program drop svychart
program svychart
preserve /* Keep copy of unchanged dataset */
label save `2' using label`2', replace
quietly: svy: tab `1' `2'
local r = e(r) /* Number of rows */
local c = e(c) /* Number of cols */
matrix prop = e(Prop) /* Copy matrix of weighted cell proportions */
matrix coltot = J(1,`c',0)
forvalues col = 1/`c' {
forvalues row = 1/`r' { /* Calculate column total */
matrix coltot[1,`col'] = coltot[1,`col'] + ///
prop[`row',`col']
37
Using Stata More Effectively
}
}
matrix colpct = J(`r',`c',0)
forvalues col = 1/`c' { /* Calculate within column percentages */
forvalues row = 1/`r' {
matrix colpct[`row',`col'] = prop[`row',`col'] ///
/ coltot[1,`col']
}
}
matrix cname = e(Col) /* Create matrix to get col names */
local cnames : colfullnames(cname) /* Local containing col names */
matrix colnames colpct = `cnames' /* Add col names to matrix */
matrix rname = e(Row)' /* Create transposed matrix to get row names */
local rnames : rowfullnames(rname) /* Local containing row names */
matrix rownames colpct = `rnames' /* Add row names to matrix */
matrix rowpct = colpct' /* Transpose matrix */
clear /* Clear dataset */
svmat rowpct, names(col) /* Write matrix to dataset */
gen `2' = _n /* Create a column containing group ids */
run label`2' /* Create a label for group ids */
label values `2' `2' /* Assign labels to group ids */
graph bar `rnames', over(`2') percentages stack /* Graph command */
restore /* Bring back original dataset */
end
Note the use of preserve at the beginning of the program and restore at the end.
These allow me to make temporary changes to the dataset for the purposes of making
calculations without having to actually save and use the dataset. Note also the quietly
prefix to a svy: tab command to suppress unnecessary output.
The post and postfile commands
These commands create a new dataset separately from the dataset one is working on. It
can create files that are larger than Stata can currently handle in memory.
The basic syntax to set up the file to be written to is:
postfile postname newvarlist using filename [, replace]
Once this has been created, results are sent to it as follows:
post postname (exp) (exp) ... (exp)
It closed as:
postclose postname
An example of postfile used in a Monte Carlo simulation from the Stata course:
program doit
postfile mysim b lb ub using simres, replace
forvalues i = 1/1000 {
drop _all
/* construct a sample */
set obs 25
gen x = invnormal(uniform())*3 + 2
ci x
/* calculate statistics */
post mysim (r(mean)) (r(lb)) (r(ub))
38
Using Stata More Effectively
}
postclose mysim
end
Note that ci returns the mean and the 95 percent confidence intervals. This can be
changed from the 95 percent default to, say, 90 percent, by set level 90. Stata will
revert to its default the next time it is run.
The bootstrap
Rather than relying on the Gaussian distribution, one can estimate standard errors by
bootstrapping, which resamples from the dataset a large number of times (the larger the
number of resamples, the more reliable the bootstrap estimates). Because the resamples
are random, in order to have reproducible results, it is necessary to set the random
number seed as follows (set seed n, where n is some number). Although some
commands include a bootstrap option, bootstrap can be specified as a prefix. See help
bootstrap for details. Note that bootstrap will include all observations in memory, so it
is necessary to specify a keep or drop statement to restrict the analysis to the valid
cases.
Weird error messages
From time to time, Stata will generate weird error messages. Here we consider error
messages users have experienced.
unrecognized command:
_pecats
User-written commands are typically those obtained by typing net search
nameofcommand. They are common candidates for weird error messages. The following
was an example of an error message received from a perfectly well-written call to
prvalue, a very useful command that is part of the estimable J. Scott Long’s SPost suite
of postestimation commands. The error message read “unrecognized command:
_pecats”. This qualifies as a weird error message, no? You couldn’t make this stuff up.
It turned out that the prvalue command didn’t work because the user had installed a
version of SPost designed for an earlier version of Stata.
If you ever get really weird error messages like this one, it may be because a command is
written using commands not supported in your version of Stata or relies on things stored
in matrices, scalars, global macros, or local macros that are no longer stored or stored
under that name in your version of Stata. This should only happen in user written
commands. Naturally, if it’s written for a version of Stata that’s more recent than your
own, you’re in trouble, but we keep up to date. Assuming the user-written command
you’re having trouble with is written for an earlier version of Stata, the first thing you
should do is to check that you have the latest version (if the authors keep updating their
command) or, as in the case of SPost, where there are multiple versions of the same
commands available, the one that supports the most current version of Stata. In either
case, you will come to a point when Stata tells you that the version of the files it is trying
to install are not the version of the files that are currently installed. Simply tell it to force
an installation at that point. If that doesn’t do the trick—or your already installed version
39
Using Stata More Effectively
really was the up-to-date version—you have one more ace up your sleeve, the version
command (see p. 3, above). All well-written .do files should have a version command at
the top, and the same goes for .ado files. The version command means that Stata will try
to translate any outdated commands into the present version of the syntax at runtime
(consider the merge command prior to 11.0 and from 11.0 onwards). The user-written
file that is giving you problems may, however, not include one. Therefore, you should try
the following. Write a version command with the number of the version the user-written
command was written for immediately prior to executing that command and then write a
version command for the version of Stata you’re using immediately afterward. If you
don’t know the version command, start at the second most recent version of Stata and
work your way backwards. This will occasionally cause a user-written command to work
when it otherwise would not have.
'.' found where number expected
The context for this error message was in generating predictions from a regression model,
using the procedures outlined on p. 33 while failing to heed the commandment found on
p. 31 (“Thou shalt not name thy scalar after thy variables”). For rabbinical scholars, note
that this rule can be deduced in a straightforward fashion (tongue firmly planted in cheek)
from the doctrine of kilayim (Lev. 19:19, Deut. 22:9-11, bt Tract. Kilayim).
40
Using Stata More Effectively
Index
cd 2
Change directory ........................... See cd
Memory allocation ................................ 1
More message ....................................... 2
Scroll buffer size ................................... 2
41
Download