CHAPTER 1

advertisement
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 1
CHAPTER 1
INTRODUCTION
This chapter introduces Stata by describing its basic features, installation, updates, memory
configuration, and online helps.
1.1 What is Stata?
Stata is an integrated data analysis package for managing, analyzing, and graphing data. Like
SAS and SPSS, Stata supports most statistical and econometric analyses that are frequently
used in various fields. But it also has many features for data management, graphics, matrices
operations, and programming language. The statistical and econometric analyses Stata
supports include:











Linear Regression Models (Ordinary Least Squares)
Generalized Linear Models
Categorical Dependent Variable Models (Logit/Probit Models)
Panel Data Models
Event Count Data Models
Time Series Analysis
Tobit and Survival Analyses
T-test and ANOVA
Multivariate Analyses
Nonparametric Methods
Sampling and Simulations
Unlike SAS and SPSS, Stata is basically a command driven package in which users type in a
command and hit ENTER to run it. Stata benefits from this interactive mode that provides
highly flexible and efficient ways of communication. Also Stata supports the point-and-click
GUI interface, batch processing (non-interactive mode), and programming; For instance,
users can write their own commands.
1.2 Stata Flavors
Stata is available in a variety of platforms and flavors. Stata runs on UNIX and UNIX-Like
(e.g., Mac OSX, Linux, AIX, HP-UX, Irix, and Solaris) as well as Microsoft Windows.
Stata has four different flavors. Stata/MP (Multiprocessor) and Stata/SE (Special Edition) are
most powerful in a sense that it can handle large data sets and matrices fast. Stata/MP
supports parallel processing using multiprocessors or multi-core processors (e.g., dual core
and quad core). Intercooled Stata, a standard version between Stata/SE and Small Stata,
provides moderate capacity for ordinary users. Small Stata is very limited in its capacity; for
instance, it supports up to 99 variables. Table 1.1 summarizes major features of three major
flavors.
Table 1.1 Comparison of Three Major Flavors
Stata/MP
Observations
http://www.sonsoo.org
Limited by resources
Stata/SE
Limited by resources
Stata/IC
Limited by
© 2003-Present Hun Myoung Park (1/26/2013)
Max # Variables
Max # Right-hand
Vars
Dataset Width
Command
Macro
String Variable
Matrices
One-way Table
Two-way Table
Statistical and Econometric Data Analyses in Stata: 2
32,767
10,998
32,767
10,998
resources
2,047
798
393,192
1,081,527 characters
1,081,511 characters
244 characters
11,000 by 11,000
12,000
12,000 by 80
393,192
1,081,527 characters
1,081,511 characters
244 characters
11,000 by 11,000
12,000
12,000 by 80
24,564
67,800 characters
67,800 characters
80 characters
800 by 800
3,000
300 by 20
You may check the current version by executing the .about command or .version. Type in
“about” in the Stata command window and hit ENTER to get the result. Note that the period
(.) is the Stata prompt.
. about
Stata/SE 12.1 for Mac (64-bit Intel)
Revision 18 Dec 2012
Copyright 1985-2011 StataCorp LP
45-user Stata network perpetual license:
…
1.3 Installing Stata
Make sure you have the serial number, license code, and authorization key. The information
is not required during installation, but should be provided when you first run Stata after
installation.
Once Stata installer begins running, just follow the instructions provided. You are asked to
choose the directory in which Stata is installed; the default is C:\STATA8.
Then, you need to choose the flavor of Stata. If your license is of the intercooled Stata,
Stata/SE and Small Stata will not work. Click the icon of your license. Before copying Stata
files to your hard disk, Stata may ask you to choose the working directory; the default is
C:\DATA.1
Once installation is complete, you may run Stata to type in the serial number, license code,
and authorization key. You are asked to type in your name and institution so that they appear
on the screen when Stata is launched.
If you wish to verify if Stata is successfully installed, execute the .verinst command. If
installation is performed correctly, you may see the following message.
. verinst
Stata/SE 12.1 for Mac (64-bit Intel)
Revision 18 Dec 2012
Copyright 1985-2011 StataCorp LP
45-user Stata network perpetual license:
1 You may check the working directory at the left bottom of the Stata window.
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 3
There may be user-written ado files that you have interested in. You may search and get them
installed using the .net command. Following commands download and install the SPOST
module written by Scott J. Long for categorical dependent variable models.
. net from http://www.indiana.edu/~jslsoc/stata/
. net install spostado
See the Section 11. Managing User-Written Files for the details.
1.4 Starting and Terminating Stata
In Microsoft Windows, you can launch Stata by clicking the Stata icon from the Windows
Start menu. Under the X window system, you have to execute xstata at the X terminal prompt
to get Stata’s main windows.
$ xstata
In UNIX machines, you need to type stata at the UNIX prompt to start Stata in an interactive
mode.
$ stata
If you want to run a batch job in non-interactive mode, at the UNIX prompt, type
$ stata -b do cigar.do
You need to replace “cigar.do” with your Do-file name. The default extension of “.do” can
be omitted.
To terminate Stata, type exit in the command window (GUI) or command line (UNIX).
Alternatively, you may choose FILEExit (Alt+F4) or click
at the right upper corner of
Stata window under Microsoft Windows and X window.
. exit
. exit, clear
Note that if you wish to terminate Stata without saving any change, add the clear option as in
the second command above.
1.5 Stata Windows
In X window and Microsoft Windows, you have four default windows when Stata is first
launched: Stata Command window, Stata Results window, Variables window, and Review
window.
Stata Command window is the place where you type in commands. Stata Results window is
the place where results are displayed. Variables window at the left bottom lists the variable
names in the current dataset. Review window lists the old commands
executed so far.
There are three more windows. You may click Window menu to see
all windows available in Stata (see the left screenshot). Viewer
window shows contents of Stata help or text files. Data Editor
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 4
window browses data in a spreadsheet style so that users can check and correct them. Finally
Do-file Editior window allows users to write .do or .ado programs.
1.6 Managing Memory
Stata puts a dataset into computer memory (including virtual memory), but it does not
automatically use all the memory available in your computer. Stata/SE by default assigns
10MB. To check the current memory reserved to Stata, run the .memory command.
. memory
When you try to read a dataset larger than current memory size, Stata might give you a
warning message such as:
No room to add more observations
If this happens, you will need to increase the memory size using the .set memory command
so that Stata has an enough room for the large data set. Current Stata by default manages
memory.You can also change the maximum number of variables and matrices size.
. set maxvar 10000
. set matsize 1000
However, increasing memory size does not always improve the overall performance of Stata.
The optimal memory size depends upon computing resources and the size of the dataset.
1.7 Default extensions
The following table summarizes Stata’s default extensions that are often omitted.
Default
File Types
Related Commands
.dta
Stata data file
.use and .save
.do
Stata do file
.do and .doedit
.ado
Automatically loaded do file
.doedit
.log
Log file in text mode
.log
.smcl
Log file in SMCL format
.cmdlog
.raw
ASCII text file
.infile, . infix, and . insheet
.out
Files saved by the .outsheet
.outsheet
.dct
Stata dictionary file
.infix
.gph
Graph image
.graph
1.8 Updating Official Stata Files (.update)
Stata supports Internet functionality through the .update , .net, and .ado commands. These
commands make it easier to update Stata files and user-written files through the Internet.
Accordingly, your computer needs to be hooked on the network in order to use these
commands.
The .update command is used to update official Stata files, which means by the Stata
executable file and ado files that produced by Stata company. The .update command without
any option reports on the current update status and gives recommendation, if necessary.
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 5
. update
Stata executable
folder:
name of file:
currently installed:
C:\Wins\Stata8\
wstata.exe
24 Apr 2003
Ado-file updates
folder:
names of files:
currently installed:
C:\Wins\Stata8\ado\updates\
(various)
09 Sep 2003
Recommendation
Type -update query- to compare these dates with what is available from
http://www.stata.com.
The all option compares the current ado files and the Stata executable file with those
available from Stata company, and then downloads and installs the update files, if necessary.
. update all
You may check ado files or the Stata executable separately. The executable option compares
the current Stata executable file with the corresponding official update, and then downloads it,
if necessary. The ado option update ado files only. Consequently, the all option is more
convenient and recommended than the executable and ado options.
. update ado
. update executable
(contacting http://www.stata.com)
Executable update log
1. verifying "C:\Wins\Stata8\" is writeable
2. downloading new executable
New executable successfully downloaded
Instructions
1. Type -update swap-
Stata stores the new executable as wstata.bin where wstata.exe is located. You may manually
delete the old executable file and rename the new executable. But the .update swap performs
that task for you.
1.9 Managing User-Written Files (.net and .ado)
The .net command allows users to find out useful user-written resources from the Internet or
media, and then download and install them to Stata. The resources includes packages and
ancillary files. A package in Stata is a collection of ado and help files that provides a new
feature in a command. Ancillary files are additional files, such as datasets and example files.
. net search spost
The above command searches packages associated with the keyword “spost,” and then lists
them with their URLs.
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 6
Once finding useful one, you may need to view the contents of the package using the .net
from command.
. net from http://www.indiana.edu/~jslsoc/stata/
From the list of contents, choose one you need. Then, use the .net describe command to get
more information about the package.
. net describe spostado
. net describe spostst8
You may check that the “spostado” is a package including ado and its help files, and that the
“spostst8” has a set of example files. You need to install the package “spostado” using
the .net install command and copy the set of ancillary files “spostst8” using the .net get
command. You may not switch the commands.
. net install spostado
. net get spostst8
If you know the right URLs, specify them using the from option.
. net install spostado, from(http://www.indiana.edu/~jslsoc/stata)
. net get spostst8, from(http://www.indiana.edu/~jslsoc/stata)
Now, you are ready to use commands supported by the package. You may double-check if
the package is available using the .ado command.
The .ado command manages the packages you have installed using the .net command. This
command allows you to list and remove the packages installed.
. ado
. ado describe spostado
. ado uninstall spostado
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 7
The first command lists the packages you have installed. The second shows the contents of
the package, while the third remove it.
You may load or copy a dataset from web pages.
. use http://mypage.iu.edu/~kucc625/documents/cancer.dta
. copy http://mypage.iu.edu/~kucc625/documents/cancer.dta c:\cancer.dta
. type http://mypage.iu.edu/~kucc625/documents/cancer.txt
Note that the last command is to view the contents of an ASCII text file “cancer.txt.”
1.10 Backward Compatibility
Stata 8.0 introduces new or remarkably enhanced features that have not been supported in
previous releases. Among the features are the point-and-click command mode and the .graph
command. You may check the differences across Stata releases by running the following
command.
. help version
The .graph command in Stata 8.0 provides higher quality graphs at the expense of changes in
its syntaxes. In other words, the old . graph command does not work correctly in release 8. If
you still wish to use old style syntax in Stata 8, you can either change the command
interpreter to a lower version with the .version command, or use the .graph7 command
instead of the .graph.
. version 7
. graph score, bin(10) normal
Above commands draw a histogram of variable “score” with a normal curve overlapping.
They are equivalent to the following command.
. graph7 score, bin(10) normal
You may change the command interpreter back to release 8.0 by specifying the version
number. The following .histogram command is equivalent to the above graph commands.
. version 8
. histogram score, bin(10) normal
1.11 On-line Help and Internet Resources
Stata’s help system is well organized and resourceful. The .help command lists contents of
commands and functions on the Stata results window. You may attach a command whose
usage you want to know. The second command below, for instance, shows the syntax,
explanation, and examples of the .regress command.
. help
. help regress
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 8
You may look at help in the point-and-click mode by running the .view help command. The
Stata viewer allows you to navigate the entire command system organized in a hierarchical
order.
. view help
You may check what has been added since releasing 8.0 by running the following command.
. help whatsnew
Stata provides various services through the Internet.
 http://www.stata.com (Stata Webpage)
 http://www.stata.com/support/ (Support)
 http://www.stata.com/support/faqs/ (Frequently Asked Questions)
 http://www.stata-journal.com/ (Stata Journal)
 http://www.stata-press.com/ (Stata Bookstore)
You may find very useful internet resources for using Stata.



http://www.princeton.edu/~erp/stata/main.html (Princeton University)
http://www.indiana.edu/~jslsoc (J. Scott Long, Indiana University)
http://sobek.colorado.edu/LAB/STATS/stata_help.html (University of Colorado)
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 9
CHAPTER 2
COMMAND, OPERATOR, AND FUNCTION
This chapter explains how to communicate with Stata in three different ways. Then, major
commands, operators, functions, and data types are listed. In addition, we also come up with
how to specify a subset of a dataset, how to repeat a command on groups, and how to use
the .display command as a calculator and a probability distribution table.
2.1 Interface Modes
Stata supports interactive, non-interactive (batch mode), Graphic interface modes.
2.1.1 Interactive mode
Basically Stata is a command-driven application. In other words, users need to type in a
command and hit ENTER to run the command as in UNIX and DOS. Then, Stata interprets
the command, processes the job, and return its result to users.
This interactive mode enables users to communicate with Stata step by step. GAUSS, S-Plus,
Matlab, and Maple also use this interactive mode of communication. Stats’ systematic
grammar structure and abbreviation rules provide highly flexible and efficient ways of
communication. Following command runs a linear regression of “lung” on “cigar.”
This mode has several advantages. First, it makes it efficient to perform many tasks, such as
recoding variables and listing observations. Stata must come in pretty handy especially for
“data cooking.” Imaging you can use Stata as a calculator or probability distribution tables.
See the Section 11 for the details.
The second strength originates from the way that interpreters works. Unlike compilers, Stata
command interpreter keeps analysis results in memory even after executing commands so
that users can conduct necessary follow-up analyses without running entire analyses again.
For example, you can run a linear regression model, and then check its results. You may feel
like getting predicted values using the .predict command and conducting hypothesis tests
using the .test command. It means that the coefficient matrix and the variance-covariance
matrix remain in the memory. In SAS and SPSS, by contrast, you have to run the regression
again after making proper changes for predicted values and hypothesis testing.
2.1.2 Non-interactive mode (batch mode)
The non-interactive mode runs a set of commands written in a text file. Classical statistical
software such as SAS and SPSS uses this mode of communication. Stata non-interactive
mode supports two kinds of programs: “.do” and “.ado” files. Users can write a “.do” file, a
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 10
batch file, in which a set of Stata commands are organized. You may run the entire
commands individually in the Stata command window. But writing a do file is more efficient
especially when you have a bundle of commands to be repeated many times.
Like C and Java program sources, Stata programs (i.e., do and ado files) may be written in a
text editor (e.g., Notepad) or a wordprocessor (e.g., Wordperfect), but they should be stored
in a plain ASCII text format. Of course, you may use Stata Do-file editor by clicking
WindowDo-file Editor or pressing Ctrl+8. Alternatively, run the .doedit command or click
the Do-file editor icon
.
. doedit
. doedit cigar.do
The first command above creates a new .do file, while the second reads and edits an
existing .do file “cigar.do.”
Once a .do file is ready (edited and saved), you can execute the batch job by running the .do
command in the command window. Like in SAS, alternatively, you may choose ToolsDo
menu (Ctrl+D) or click
in the Do-file Editor window. When you wish to execute only a
part of commands, highlight the block of commands using a mouse, and choose ToolsDo
Selection menu.
. do cancer.do
Another type of programs is the “.ado” file, a source of Stata commands. Put differently,
many Stata commands, such as .logit, .regress, and .recode, are based on ado files. Stata
company provides basic .ado files that are installed under “…stata8\ado\base” directory. But
users can write .ado programs as well. That is, users can add their own commands to Stata.
Unlike .do files, .ado programs need to be written in the Stata ado language, which looks
similar to C.
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 11
2.1.3 GUI (Graphical User Interface)
Users can benefit from the point-and-click environment, which is supported since Stata 8.0.
Users pull down Stata menus and select a proper menu to invoke a dialog box to run a
command.
Stata’ GUI builds a command on the basis of information provided in dialog boxes. Thus, the
command is echoed on the Results window, allowing users to compare the point-and-clicking
with its corresponding command. So GUI mode seems quite useful in particular for Stata
beginners. Most statistical software (e.g., SAS and SPSS) nowadays supports this mode of
communication.
Users may use shortcut instead of pointing and clicking menus. For example, Ctrl+S
(pressing S key while the Ctrl key is pressed) is equivalent to choosing FILESave.
Interestingly, you may invoke a proper dialog box by executing the .db command instead of
using pull-down pop-up menus.
. db regress
The above command is equivalent to choosing STATISTICLinear Regression and
relatedLinear Regression (See the screenshot).
2.2 Rules of Commands
2.2.1 Stata is casesensitive. The commands are lowercased. In order words, “REGRESS”
and “Regress” do not work at all; use the ”regress.”
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 12
2.2.2 Stata commands, variable names, and options can be abbreviated to the shortest
string of characters as long as they are uniquely identified. This highly flexible abbreviation
is one of Stata’s fascinating features. The minimum abbreviations are underlined in help and
manuals (e.g., tabulate). However, some commands like the .replace cannot be abbreviated.
For example, the .regress command can be abbreviated as .reg, .regr, .regre, and .regres.
Similarly, the nolabel option can be reduced to nol. A variable name “gender” may be
referred as “gen” unless there are variable names in the current dataset beginning with “gen”
(e.g., “gene” and “genre”). You may use wildcards (i.e., ?, *, and ~) when abbreviating
variable names. See the Section 4 for the details of wildcards.
2.2.3 Syntax Structure: In general, a Stata command consists of (a) a command, (b) a list of
variables, (c) qualifiers, and (d) options. Some commands may have their subcommands. The
in and if qualifiers are used to specify a subset of datasets to which a command is applied.
See the Section 8 for the details of the qualifiers.
.
.
.
.
.
.
list
list state-lung k*
list if area==4
list in 10/l
list, nolabel noobs separator(10)
list state-lung k* in 10/l if area==4, nol noo sep(10)
Omitting a list of variables implies all variables (the first command). You may use wildcards
when listing variables (the second). The third and fourth are examples of the if and in
qualifiers. The fifth shows how a series of options is listed. The last combines all of these
components of a command. See the Chapter 7 for the details of the .list command.
2.2.4 A dependent variable precedes a set of independent variables. In the following
example, “yesno” is the dependent variable, whereas “income,” “education,” and
“occupation” are independent variables.
. logit yesno income education occupation if gender==1,robust
2.2.5 Comma: A command and its options should be separated by a comma. But, there is no
comma in the list of variables and the list of options.
. tabulate grade degree, chi2 expected gamma
Note that in the above .tabulate command, the chi2, expected, gamma are all options that
might be omitted.2
2.3 Major Commands
This section classifies major Stata commands in comparison with those of SAS and SPSS.
2.3.1 Descriptive Statistics
Stata Commands
summarize; tabstat; inspect
sktest; swilk; sfrancia
2
SAS Procedures
UNIVARIATE; CAPABILITY
UNIVARIATE
The “chi2” conducts chi-square test; the “expected” computes the expected frequencies of cells; the “gamma”
shows the gamma statistic, a measure of association for ordinal variables.
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
summarize; tabstat
tabulate
tabulate
list; browse
graph; dotplot; histogram
2.3.2 Data Management
Stata Commands
describe
use; save; edit
generate; replace; recode
keep; drop
label; format
append; merge
rename
collapse; reshape
infile; insheet; infix
outfile; outsheet
odbc
sort; order
2.3.3 Regression Models
Stata Commands
regress
logistic; logit; probit
ologit; mlogit; clogit
nl
tobit; streg; stcox
ivreg; mvreg; reg3; sureg
poisson; nbreg; zip; zinb
2.3.4 ANOVA and Multivariate
Stata Commands
ttest
oneway; anova
glm; manova
factor; pca
correlate; pwcorr; alpha
cannon
cluster
Statistical and Econometric Data Analyses in Stata: 13
MEANS; SUMMARY
FREQ
TABULATE
PRINT; REPORT
CHART; PLOT
SAS Procedures
CONTENTS
DATA (SET)
DATA
DATA (KEEP; DROP)
DATA; FORMAT
DATA (MERGE)
DATA (RENAME)
MEANS
DATA (INFILE); IMPORT
EXPORT
SQL (SAS/SQL)
SORT
SAS Procedures
REG
LOGISTIC; PROBIT
GENMOD; CATMOD; MDC
NLIN
LIFEREG; PHREG
SYSLIN
GENMOD
SAS Procedures
TTEST
ANOVA
GLM; CATMOD; GENMOD
FACTOR; PRINCOMP
CORR
CANCORR
CLUSTER
* ANCOVA is conducted by the .anova in Stata, but by GLM in SAS and SPSS
2.3.5 Nonparametrics and Others
Stata Commands
ksmirnov; kwallis; ranksum
tab; tabi; kappa
matrix
SAS Procedures
NPAR1WAY
FREQ
DISCRIM
IML (SAS/IML)
2.4 Operators, Wildcards, System Variables
This section summarizes operators, wildcards, and system variables.
2.3.1 Operators
Types
Operators
Arithmetic + (addition), - (subtraction), * (multiplication), / (division), ^ (raise to a
power)
Relational > (greater than), >= (greater than or equal), < (less than), <= (less than or
equal), == (equal), != or ~= (not equal)
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Logical
Others
Statistical and Econometric Data Analyses in Stata: 14
& (and), | (or), ~ (not)
= (assignment), + (string concatenation),
L#.variable (backward shift for time series data)
2.4.2 Wildcards and Other Symbols
Meaning
*
Any characters
?
Any character
~
zero or more characters
Specifying range of variables
/
Specifying range of observations (in the in
qualifier)
//
Comments (Programming)
///
join the next line with the current line in do
and ado programs (Programming)
||
Overlapping graphs (Graphics)
/*... */ Comments (Programming)
2.4.3 System Variables
Example
_all
_all
_n
_n
_N
_N
_coef (or _b)
_coef[cigar]
_se[cigar]
_se
_b[_cons]
_cons
_pi
_pi
_pred
_pred
_rc
_rc
_skip
_skip
Examples
re*
measure?
mil~um
gender-rank
in 1/100
// to explain
.regress y x1 x2 x3, ///
beta robust // options
|| scatter …
/* to explain */
Meaning
All variables
Current observation number
Total number of observations
Coefficient of the variable “cigar”
Standard error of the coefficient of the variable
Equal to 1 or the intercept term
Value of 
Return code from the capture command
2.5 Functions
Followings are the lists of major functions that are commonly used.
2.5.1 Mathematic Functions
Functions
Meaning
abs(x)
Absolute value
sin(x), cos(x), tan(x)
Sine, cosine, tangent
ceil(x), floor(x)
Unique value
int(x), round(x)
Truncations
comb(n, k)
Combinational function
exp(x)
Exponential function
ln(x) or log(x)
Natural logarithm
logit(x), invlogit(x)
Log of the odd ratio and its inverse
max(x), min(x)
Maximum and minimum values
mod(x,y)
Modulus of x with respect to y
sign(x)
Sign
sqrt(x)
Square root
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
sum(x)
Statistical and Econometric Data Analyses in Stata: 15
Sum
2.5.2 String functions
Functions
Meaning
char(n)
Character corresponding to ASCII code n
index(s, key)
Position in s at which key is first found; otherwise zero
length(s)
The length of a string
lower(s)
Lowercase string
ltrim(s)
A string without leading blanks
real(s)
To convert a string to a number
reverse(s)
A reversed string
rtrim(s)
A string without trailing blanks
string(n), string(n, s) To convert a number to a (formatted) string
substr(s, n1, n2)
Substring of s starting at n1 for a length of n2
trim(s)
String without leading and tailing blanks
upper(s)
Uppercase string
word(s)
The number of words in a string
* Also see Chapter 9. Section 6 Handling String Variables
2.5.3 Probability Functions
Functions
Meaning
binorm(h, k, p)
Joint cumulative distribution of bivariate normal
chi2(d, x)
Cumulative chi square distribution
chi2tail(d, x)
Reverse cumulative (upper-tail) chi square distribution
F(d1, d2, f)
Cumulative F distribution
Fden(d1, d2, f)
Probability density function of the F distribution
Ftail(d1, d2, f)
Reverse cumulative (upper-tail) F distribution
norm(z)
Cumulative standard normal distribution
normden(z)
Standard normal density
normden(z, s)
Rescaled standard normal density
tden(d, t)
Probability density function of Student’s t distribution
ttail(d, t)
Reverse cumulative (upper-tail) Student’s t distribution
* Also see Section 11.Using the .display Command
2.5.4 Other Useful Functions
Functions
Meaning
autocode()
Grouping observations
group(#)
Grouping observations
recode
Grouping observations
uniform()
Uniform pseudo-random numbers
2.6. Data Types
Stata has six different data types, which are grouped into real number, integer, and string. In
order for efficient memory management, use the appropriate type for your data. For example,
int and byte are better than float and double if you have five point Likert scale variables. The
latter types consume more memory than the formers.
Keyword
Type
http://www.sonsoo.org
Range
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 16
float
double
int
long
byte
str#
Real number
8.5 digits of precision
Real number
16.5 digits of precision
Integer
-32767 ~ 32740
Integer
-2,147,483,647 ~ 2,147,483,620
Integer
-127 ~ 100
String
str1 through str244*
* Intercooled Stata and Small Stata support up to str80.
2.7 Rules of Naming Variables
Naming is a beginning point of data analyses. Bad naming may frequently bother you during
the analyses. Please take enough time to get good names; it will pay back soon.
(1) Use characters (a through z and A through Z), numbers (0 though 9), or underscore (_).
Do not use special characters, such as space, -, $, #, @, &, and ~.
(2) A variable name should begin with a letter. Any number cannot come first. It is not
recommended using underscore as the first letter as long as the variable name is similar to
any system variable (i.e., _all, _b, _coef, _cons, _n, _N, _pi, _pred, _rc, _skip, and _se).
(3) Avoid reserved words or keywords, such as byte, double, float, in, int, long, using, with,
regress, anova, display, and tabulate.
(4) Variable names need to have some meanings indicating what the variable is for.
(5) The shorter the better, although Stata allows up to 32 characters
(6) Use lower cases unless necessary or required. Keep in mind that Stata is case-sensitive.
(7) Use group names so that you can take advantage of wildcards (e.g., score? and score1score9).
Category
(1)
(2)
(2)
(3)
(4)
(5)
(6)
(6)
(7)
Good
gnp_2002
score1; gnp_2003
interest
gender; education
invest_2003
rInt_2003
income; sales_IBM
rInt_US2003
score1; score2; score3…
Bad, If not Invalid
gnp of 2002; gnp-2002; gnp#2002; gnp~2002
1st_score; 2003_gnp
_interest
double; int; using; logit; glm; ttest; tabulate
x; y; z; xxx; yyy; zmdje; ej93nx6
real_interest_rate_of_in_2003
INCOME; InCoMe; sales_ibm
rint_us2003; RINT_US2003
math, physics, history, management…
2.8 Specifying a Subset of a Dataset (if and in Qualifiers)
The if and in qualifiers specify subset of a dataset in different ways. The if qualifier selects
observations to which a command is applied by imposing conditions that the observations
need to satisfy. You may use & and/or | relational operators to provide more than one
condition. Consider the following examples.
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 17
. sum cigar-kidney if area==1
. list state cigar lung if (area==4) & (lung >= 10)
. regress bladder cigar if (area==2) | (area==3)
The in qualifier specifies the range of observations to which a command is applied. You may
use observation numbers or some keywords indicating particular observations.
. sum cigar-kidney in f/10
. list state cigar lung in 7
. regress bladder cigar in f/l
The first command returns summary statistics for the first ten observations. The second lists
the values of the 7th observation. The third command of regressing “bladder” on “cigar” is
applied to all observations (from the first through the last); so you may omit the “in f/l.”
Keywords
n
-n
1 (or f)
-1 (or l)
Example
in 10
in -10
in 1/10; in f/10
in 15/-1; in 15/l
Meaning
The 10th observation
The 10th observation from the last
From the first observation through the 10th
From the 15th observation through the last
However, you may not list more than one observation numbers without the / operator, nor
specify observation numbers as well as the range of observations at the same time.
Accordingly, following commands do not work at all.
. list state cigar lung in 7 9 18 (invalid commands)
. regress bladder cigar in 7 9/-5 (invalid commands)
2.9 Repeating a Command on Groups (.bysort and .by)
You may wish to run the same command on each group instead of the entire dataset. Let us
get the summary statistics (e.g., mean and standard deviation) of variables “cigar” and “lung”
in each area.
.
.
.
.
sum
sum
sum
sum
cigar
cigar
cigar
cigar
lung
lung
lung
lung
if
if
if
if
area==1
area==2
area==3
area==4
This approach works, but it will be burdensome when there are many groups. Here is the
rationale the .bysort (or .bys) and .by commands are needed. The .bysort repeats Stata
command on each group without the if qualifier. Group variables needs to be sorted in some
ways. There are three equivalent ways of repeating a command group by group.
. bysort area: sum cigar lung
_______________________________________________________________________________
-> area = 1
_______________________________________________________________________________
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------cigar |
8
27.94625
2.297881
23.78
31.1
lung |
8
21.72375
4.262283
12.11
25.95
_______________________________________________________________________________
-> area = 2
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------cigar |
12
23.70667
2.762431
19.96
27.91
lung |
12
18.31667
3.68153
12.12
22.8
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 18
...
The .bysort command first sorts the variable “area” in an ascending order, and then repeats
the command on groups. Note that colon (:) separates the .bysort or .by from the command to
be repeated.
. by area, sort: sum cigar lung
The .by command above gives us the same result as that of the first command. You may omit
the sort (or s) option, if you sort the variable separately as follow.
. sort area
. by area: sum cigar lung
You can also run various analysis commands with the .bysort and .by commands.
. bys gender: regress income education age occupation
However not every Stata command can be used with the .bysort and .by commands.
The .sktest command, for example, cannot be combined with them.
When writing a .do or .ado program, you may need to repeat a set of commands. See the
Chapter 5. Stata Programming for looping commands (i.e., .while, .forvalues, .foreach) and
the .if command.
2.10 Using Explicit Subscripts
You may wish to refer individual observations of variables. For example, “What is the value
of cigar of the tenth observation?” You may add a subscript enclosed with brackets to a
variable name as follow.
. display cigar[10]
20.1
The first command below creates variable “cigar2,” and then copy the value of “cigar” of the
10th observation. That is, a particular value of 20.1 is copied to the variable “cigar2” for all
observations. The first differ from the second in that the latter copy each value of “cigar” to a
new variable “cigar3.”
. generate cigar2=cigar[10]
. generate cigar3=cigar
This feature of explicit subscripts enable user to easily create a variable which has
observation numbers using the system variable _n. It is also straightforward to generate a
lagged variable if you use _n-1 as a subscript.
. gen serial=state[_n]
. gen cigar_lag=cigar[_n-1]
However, zero, negative numbers, and numbers larger than _N (i.e., the total number of
observations) result in a missing value.
2.11 Using the .display Command
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 19
The .display (or .di) command displays strings and values of various expressions. The first
example below displays the values of system variables _pi and _cons.
. display _pi " and " _cons
3.1415927 and 1
The .display command can list values of variables using explicit subscripts mentioned in the
previous section.
. display "The Cigar Consumption of " state[12] " State is " cigar[12]
The Cigar Consumption of IN State is 26.18
You can use Stata as a calculator using the .display command. Consider the following
examples. From the first to the third, they result in 78.5, .02210445, and .44271887.
. display 5*5*3.14
. display (1.3)^(1/12)-1
. di (6.4-5.0)/sqrt(10)
You can also get p-values without referring probability distribution tables. See the Section 5
for detailed probability functions.
. di norm(1.96)
. di (1-norm(1.96))*2
The norm(z) returns the cumulative probability of the standard normal distribution. So the
second command gives you the p-value of z score 1.96 for a two-tail test. The above
commands respectively return .9750021 and .04999579.
. di ttail(20, 2.086)
. di (1-ttail(20, 2.086))*2
The ttail(df ,t) returns the reverse cumulative (upper-tail only) Student’s t distribution. So the
second command gives you the p-value of the t value 2.086 with degree of freedom 20 for
two-tail test. Thus, the above commands give you .02499818 and 1.9500036, respectively.
. di F(5, 10, 3.325)
. di Ftail(5, 10, 3.325)
. di Ftail(5, 10, 3.325)*2
The F(df1, df2, F) shows the cumulative F distribution, while the Ftail(df1, df2, F) returns
reverse cumulative (upper-tail only) F distribution. Note that the F is uppercased and that the
first and second degrees of freedom are of numerator and denominator, respectively. Thus,
the third give you the p-value of the F value 3.325 for a two-tail test. The three commands
above respectively return .94996612, .05003388, and .10006777.
. disp chi2(10, 18.307)
. disp chi2tail(10, 18.307)
Similarly, the chi2(df, c) returns the cumulative chi square distribution, while chi2tail(df, c)
gives you the reverse cumulative (upper-tail) chi-square distribution. The commands give
you .94999941 and .05000059, respectively.
2.12 Using the .format Command
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 20
The .format command specifies the format of variables to be displayed. But this command
does not affect actual values of variables. When a variable is copied, its format is also copied.
You may check the current display format of each variable by execute .describe command,
which shows variable names, types, formats, and labels.
. describe
In general, a format begins with % that is followed by a number (the total number of digits),
period, a number (the number of digits below the decimal point), and letters indicating types
of format.
Let us put a comma in variable in order to make numbers more readable. In the following
example, the first number “10” indicates the total number of digits including the decimal
point, while the second “2” sets the number of digits below the decimal point. The letter “f”
and “c” respectively mean “fixed format” and “comma format.”
. format gnp2 gdp2 %10.2fc
. list gnp gnp2 gdp gdp2
1.
2.
3.
4.
5.
...
+-----------------------------------------------+
|
gnp
gnp2
gdp
gdp2 |
|-----------------------------------------------|
| 1600.929
1,600.93
3420.02
3,420.02 |
| 251.0714
251.07
3559.387
3,559.39 |
|
469
469.00
3569.177
3,569.18 |
| 227.7857
227.79
3910.404
3,910.40 |
| 339.8571
339.86
4649.005
4,649.00 |
Note that “gnp” and “gdp” are displayed in their default format. The following is an example
of a numeric format without any digit below the decimal point.
. format l* %5.0f
If you wish to fill leading zero, add “0” right after the %. Note that wildcards * and - are used
to list variables efficiently.
. format cigar-kidney %010.2f
Now, you may want string variables to be left-justified. Use the “-“ and “s” to indicate “leftjustified format” and “string format,” respectively. Again the “15” indicates the total number
of characters of the variables to be displayed.
. format last_name first_name %-15s
You may take “-“ out in order to get back to the default right-justified format. But, do not use
“+.”
. format last_name first_name %15s
For detailed formats, run the .help format command.
2.13 Handling Missing values
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 21
The missing value of a numeric variable is denoted by a single period (.). In string, missing
value is expressed as “”. Any arithmetic operation on a missing value results in a missing
value.
You may wish to exclude missing values using the if qualifier. You may ask whether a
variable is less than period (.).
. sum cigar lung kidney if cigar<.
. list cigar lung kidney if (cigar==.) | (lung>.) | (kidney>=.)
The first command above produces summary statistics of those observations whose variable
“cigar” is not missing. The second lists the values of three variables when any one of the
three is missing. Note that the three different usages of relational operators equally detect
missing values in each variable.
You may want to detect observations that have missing values in any variables specified.
The .mark and .markout commands are useful for marking observations with the missing.
The former command creates a dummy variable to be used by the latter. The .markout
command sets 0 in the marking variable created by the .markout command if an observation
has missing values.
.
.
.
.
mark yn_miss // to create a marking variable (dummy)
mark yn_miss cigar lung kidney
tab yn_miss, missing // to double-check flagging marks
drop if yn_miss==0 // to drop observations with missing values
2.14 Using Comments
Using comments in Do-files is very useful when documenting the files. Comment can also be
used to debug the Stata Do-files. Stata offers three ways of documentation. Asterisk (*) and
double slash (//) put comments in single command line, while /*…*/, like in C and Java, can
include multiple lines of comments.
Any command should not come before *, whereas // must follow a command. // does not
affect its preceding command. If asterisk is used in front of a command, Stata just ignores the
command. Consider the following examples.
. * This document is for statistical and econometric data analyses.
. recode year (1 2=0 ) (3 4=1), gen(class) // recoding to low and upper classes
. *recode year (1 2=0 ) (3 4=1), gen(class)
Note that the // does not work in the command window (interactive mode), but works only in
a Do-file.
Sometimes you may wish to put detail information longer than single line in a Do-file. Like
in SAS, it is the case of /* … */. Any command or comment between /* … */ is ignored
when Stata executes the Do-file.
/*
This Do-file is to recode several key variables (recode_01142004.do)
Date: Wednesday, January 14, 2004
*/
use project003.dta,clear
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 22
...
2.15 Macros (.global and .local)
Like SAS/Macro, Stata Macro enables to use programming variables in do and ado programs.
As such, users can reduce human errors and tedious typing jobs. A macro consists of a macro
name and its content. Once a macro is called, its macro content is substituted for the macro. A
macro can be string or numeric. Macros are grouped into global macro or local macro.
. global js = 625 // to declare a numeric macro
. local fruit="Grape Pear Apple" // to declare a string macro
Local macros, frequently used in many most cases, exist only a program or a module in
which they are declared. A local macro is called by its name with a left and a right quote
surrounded. Note that the left quote is got by pressing ` key (the same as the Tilda). A global
macro needs a $ in front of the macro name. You may use {} to clarify meaning or form
nested constructions.
. di `fruit’
. local fruit=$js
. gen ph$js=id // equal to ph625=id
Macros are used in both expressions and commands.
. local LHS "gnp"
. local RHS "interest consume inflate"
. regress `LHS’ `RHS’
If you want to see the list of macros declared, type in the .macro list command. The .macro
drop command removes the macro specified. .
.macro list
.macro drop fruit
2.16 Looping (.foreach and .forvalues)
If you wish to repeat a set of commands, take advantages of looping structures, such
as .foreach, .forvalues, and .while commands.3 The .foreach command executes a set of
commands enclosed in braces for each element of the macro, variables, or numbers specified.
Let us list numbers from 1 to 100. The numlist indicates that the following list is of number.
You may list all numbers as 1 2 3 4 … 100.
. foreach n of numlist 1/100 {
disp `n'
}
Alternatively, you may benefit from using the .forvalues command, if repetition is not
determined by macros and variables, but by numbers. The following .forvalues command
works in exactly the same manner as the above.
. forvalues n= 1/100 { // from 1 through 100 in step of 1
disp `n'
}
3
The usage of the .foreach command is quite similar in PHP and PERL.
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 23
You may use three alternative ways of specifying the range of numbers, which are equally
use in in the .foreach command. Note that the .forv is an abbreviation of the .forvalues.
forvalues n= 1(1)100 { // from 1 through 100 in step of 1
forv n= 1 2 : 100 { // from 1 through 100 in step of 2-1
forv n= 1 2 to 100 { // from 1 through 100 in step of 2-1
Let us go over to variables. The followings commands produce the identical result.
foreach var in cigar bladder lung kidney leukemi {
sum `var' if area==1
}
foreach var of varlist cigar-leukemi {
sum `var' if area==1
}
What is the difference between the in subcommand and the of subcommand? The former is
general in listing values, variables, or macros, while the latter should specify the type (i.g.,
global, local, varlist, newlist, and numlist) of argument. Thus, varlist of the second command
should not be omitted. Consider the following commands that create five random variables.
foreach var of newlist random1-random5 {
gen `var' = uniform()
}
foreach var in random1 random2 random3 random4 random5 {
gen `var' = uniform()
}
Note that in the first command the newlist to create new variables cannot be omitted and the
usage of “random1-random5” is not allowed in the second command.
Now, it is time for macros. The following three commands produce the identical result. The
double quotes around `str’ cannot be omitted since the values are string.
local fruit "Grape Pear Apple"
foreach str of local fruit {
di "`str'"
}
foreach str in `fruit' {
di "`str'"
}
foreach str in "Grape" "Pear" "Apple" {
di "`str'"
}
Note that the macro name in the first command is not enclosed with single quotes, whereas it
was in the second command.
For information about the .while loop command and the .if conditional command,
see the Programming Stata.
2.17 Using Operating System Commands
http://www.sonsoo.org
© 2003-Present Hun Myoung Park (1/26/2013)
Statistical and Econometric Data Analyses in Stata: 24
Following table summarizes the useful operating system commands available in Stata
Command
Meaning
Examples
. cd ..\data
.cd (or pwd)
to change a directory
.copy
.dir (or ls)
to copy files
to list directories and files
.erase (or .rm) to remove files
.
.
.
.
.
.
.
.
.
cd ~/data
copy a.dta b.dta
dir *.dta
ls ~/data/*.do
erase ..\temp.dta
rm ../data/temp.dta
mkdir cancer
shell
type cancer.dct
.mkdir
to create a directory
to invoke operating system temporarily
.shell
to view contents of a text file
.type
* Note that the .pwd and .rm respectively work only in Stata for Mac OS and UNIX.
http://www.sonsoo.org
Download