STATA - workshop nov2011

advertisement
Ann Arbor ASA
“Up and Running” Series:
Intro Stata
Prepared by volunteers of the Ann
Arbor Chapter of the American
Statistical Association
November 29, 2011
Agenda
• Why Stata?
• The Stata interface
• The Stata mindset
• data
• logging
• issuing commands via menus
• understanding command syntax
•
•
•
•
•
Data management
Descriptive statistics and estimation
Graphing
Adding user-written commands
.do files
Ann Arbor ASA (Up and Running): Stata Intro
2
Why Stata
• General purpose, cross-platform package like R or SAS
• Command line interface combined with point-and-click
menus
• Intuitive and standardized command syntax that is welldocumented with formulas, examples and references
• Many advanced user-written commands
• Easy to write your own code that is pretty fast
• Excellent corporate tech support and user community
Ann Arbor ASA (Up and Running): Stata Intro
3
Which Stata: MP, SE, IC or Small
• Stata is not sold in pieces, every flavor has the same
commands
• Most flavors available for 32- and 64-bit Windows, Mac,
and Unix/Linux platforms
• Stata/IC (Intercooled) can handle up to 2,047 variables
• Stata/SE (Special Edition) can handle up to 32766
variables. Also allows longer string variables and larger
matrices
• Stata/MP has the same limits, but is faster on multicore
and multiprocessor computers
• Small Stata is intended for students and is limited to
analyzing data sets with a maximum of 99 variables and
1200 observations
• All of these versions can read each other’s files within
their size limits
Ann Arbor ASA (Up and Running): Stata Intro
4
The Stata Interface
• Results window: All output appears here, except for
graphs which will appear in a separate window. Note that
output is not automatically saved to a file
• Command window: Enter commands here interactively
• Variables window: All variables in the current dataset are
listed here. Clicking on a variable sends its name to the
command window
• Review window: Previously issued commands are listed
here and can b reissued by clicking on them.
• Buttons: Shortcuts for many common commands such
as log, browse, edit, etc.
• Menus: Convenient for learning Stata command syntax,
but time consuming
• Look and feel is customizable
Ann Arbor ASA (Up and Running): Stata Intro
5
Lab 1A
• Use the Stata File menu to open the
example dataset, auto.dta
Ann Arbor ASA (Up and Running): Stata Intro
6
The Stata Mindset
•
•
•
•
Data
Logging
Issuing commands from menus
Understanding command syntax
Ann Arbor ASA (Up and Running): Stata Intro
7
Data
• Stata reads an entire dataset into memory.
This is a fundamental difference from
other stat packages such as SAS and
SPSS
• Only one dataset at a time in a Stata
session
• This is why there are flavors of Stata – IC,
SE, Small
Ann Arbor ASA (Up and Running): Stata Intro
8
Reading data into memory
• Use the menus
– File, Open
• Use the command window
use “C:\...\sample.dta”, clear
use sample.dta, clear
• Use the File Open button
• All methods produce the same result
Ann Arbor ASA (Up and Running): Stata Intro
9
Saving data
• Use the menus
– File, Save (or Save As…)
• Use the command window
save “C:\...\sample.dta” [, replace]
• Use the Save button
• All methods produce the same result
Ann Arbor ASA (Up and Running): Stata Intro
10
Logging
• Stata does not automatically write output to a file!
• You can do this by starting a log file at the start of your
analysis, and closing it at the end
• Use the menus
– File, Log
• Use the command window
log using “C:\...\analysis1.log”
• Use the Log button
• All methods produce the same result
• Logs can be created, replaced, suspended, resumed,
and appended
Ann Arbor ASA (Up and Running): Stata Intro
11
Lab 1B
• Use the Stata menus to:
– change the color scheme
– change the working directory to your desktop
lab folder
– start a log file called “labs.log” in your
desktop lab folder
– save the example auto.dta dataset to your
desktop lab folder
Ann Arbor ASA (Up and Running): Stata Intro
12
Issuing Commands from Menus
• Menus are great for:
– Familiarizing yourself with Stata’s capabilities,
both big picture and command-specific
– Getting context-sensitive help
– Learning Stata command syntax
• The downside:
– time-consuming, especially for repetitive tasks
– not all functionality available through the
menus!
Ann Arbor ASA (Up and Running): Stata Intro
13
Lab 2
• To get a codebook for the auto.dta
dataset, use the following menu path:
Data, Describe data, Describe data
contents (codebook)
• You will see the codebook dialog. Inspect
it closely…
Ann Arbor ASA (Up and Running): Stata Intro
14
Anatomy of a Dialog Box
The Stata
command
(keyword) that
will be submitted
Multiple tabs
Submit and
close dialog
Help, Reset,
and Copy
Command
Ann Arbor ASA (Up and Running): Stata Intro
Submit and
leave dialog
open
15
Anatomy of a Dialog Box
Use if/in to
filter rows
Specify
logical
condition
Specify
row #s
Ann Arbor ASA (Up and Running): Stata Intro
16
Anatomy of a Dialog Box
Command
options
available
on
additional
tabs
Ann Arbor ASA (Up and Running): Stata Intro
17
Understanding Command Syntax
• The general syntax for all Stata commands is:
[prefix:] cmdname [varlist] [=exp] [if exp] [in exp]
[weight] [using filename] [, options]
• Elements in square brackets are optional for
some commands
• Sometimes cmdname is all that is required, for
example, codebook or describe
• The underlined portion of cmdname is shorthand
for the command
• Stata is case sensitive
Ann Arbor ASA (Up and Running): Stata Intro
18
Understanding Command Syntax:
cmdname
• cmdname is Stata’s keyword for a
command Examples:
generate replace
drop
regress
logistic
logit
scatter
graph bar graph box
• Enter cmdname exactly as indicated,
taking care to use the proper case (usually
lower case for commands)
Ann Arbor ASA (Up and Running): Stata Intro
19
Understanding Command Syntax:
varlist
• You can apply the command to particular variables by
specifying a varlist
• Order of variables matters; can use hyphen to indicate a
series of variables in order as in:
codebook x1-x20
• Use wildcard notation for shorthand, such as
codebook x*
• Use _all to apply command to all variables
• Remember that Stata is case sensitive! Variables
gender and Gender are two different things to Stata
Ann Arbor ASA (Up and Running): Stata Intro
20
Understanding Command Syntax:
=exp
• exp is short for expression
• exp is used by data management commands
such as generate and replace
• For example, to create a constant variable x
equal to 1, use:
generate x=1
• You can also use functions this way:
gen x2 = x^2
gen x_sq = x*x
gen logx=ln(x)
Ann Arbor ASA (Up and Running): Stata Intro
21
Understanding Command Syntax:
if/in exp
•
•
Without any options, commands apply to all observations/variables in the
dataset
To filter observations, use the if exp clause:
codebook if (x==2 & z>=3) | w==2
•
•
Note the parentheses!
Also note the difference between = and == (assignment and condition
equality, respectively)
gen x=1 if y==2
list if gender==“F”
•
Conditional operators in Stata are
== (equal to)
> (greater than)
< (less than)
& (and)
•
!= (not equal to)
>= (greater than or equal to)
<= (less than or equal to)
| (or)
Use in exp to refer to particular row numbers in the dataset:
list in 1/10
Ann Arbor ASA (Up and Running): Stata Intro
22
A Brief but Critical Detour: Missing
Data
• While we are talking about selecting cases using
an if exp clause, it is important to note that
Stata considers missing the largest possible
numeric value
• Stata represents missing numeric variables with
a dot
• Keep this in mind when filtering cases based on
a numeric variable:
replace hieduc = 1 if x>3 (potential problem)
replace hieduc = 1 x>3 & x<. (playing it safe)
replace hieduc = 1 x>3 & x!=. (playing it safe)
Ann Arbor ASA (Up and Running): Stata Intro
23
Understanding Command Syntax:
weight
• Most Stata commands can deal with weighted data,
where the weight is a variable in the dataset
• You need to specify the type of weight and the weight
variable, using brackets, as in:
summarize x [iweight=weightvarname]
• Four types of weights:
– Frequency fweights, for replicated data
– Probability pweights, for observations sampled with unequal
probability of selection
– Analytic aweights, for data containing averages where the
average is weighted by the # obs used in calculating the average
– Importance iweights, defined by the specific command
Ann Arbor ASA (Up and Running): Stata Intro
24
Understanding Command Syntax:
using filename
• Some commands read in data from external files, or
write to files
• These commands contain a using clause, in which the
path and filename appear
• Merging two datasets together is an example:
use “C:\…\master_data.dta,clear
merge 1:1 id using “C:\...\using_data.dta
• This performs a 1:1 match using the key variable, id
(merge adds new variables). 1:many merges are also
possible
• Similarly, to stack datasets:
use “C:\..\one.dta”,clear
append using “C:\...\two.dta”
Ann Arbor ASA (Up and Running): Stata Intro
25
Understanding Command Syntax:
prefix:
• Prefix commands operate on other Stata
commands. One common prefix is
bysort:
bysort gender: summarize wage
• The bysort prefix sorts and stratifies the
summarize command by the gender variable
• The bysort prefix is also very handy in a data
management context, for example, aggregating
bysort gender: egen avg_wage = mean(wage)
• Not all commands permit the use of all or even
any prefixes
Ann Arbor ASA (Up and Running): Stata Intro
26
Understanding Command Syntax:
Where to get HELP
• If you know the name of a command, enter
help cmdname
• If you don’t know it, enter
findit word1 [word2]…
• This queries a keyword database and some of the official
internet sources (such as Stata FAQs, Stata Journal
articles)
• Google
• Email or call Stata Technical Services (really!)
• Statalist archives
• Email CSCAR Stata support at stata.help@umich.edu if
you are affiliated with the U-M as a grad student, staff or
faculty member
Ann Arbor ASA (Up and Running): Stata Intro
27
Lab 3
• Enter the appropriate commands in the command
window (no menus!):
–
–
–
–
–
–
open the auto.dta dataset, clearing out what is in memory
describe the datatset
get the codebook for the first 5 variables in the dataset
list out the first 10 observations
try out the browse command
browse the cases where price is greater than 5000 (but not
missing)
– summarize the price variable where foreign==0 (for domestic
cars)
– use the bysort prefix to summarize the price variable by levels
of the foreign variable
Ann Arbor ASA (Up and Running): Stata Intro
28
Data Management Commands
• We’ve already seen quite a few
use
browse
codebook
gen, egen
merge
save
list
describe
replace
append
//open/save data
//view data
//10,000 ft view
//create/replace vars
//merge/stack datasets
• Next up:
–
–
–
–
importing
exporting
aggregating
keeping/dropping
Ann Arbor ASA (Up and Running): Stata Intro
29
Data Management Commands:
importing files
• use reads Stata formatted (.dta) datasets.
• For data created in another software package:
– Save the data in an excel file, then use the import excel
command (new with Stata 12)
– save the data in a comma separated values file (.csv), or a
delimited file, then use the insheet command
– use the other package to save the data in .dta format (SPSS 17+
and SAS 9.2 can do this)
– use StatTransfer to convert the file to .dta
• .dta, delimited, and .csv files are the simplest file types to
get into Stata
• Stata will also import data in other formats, but it’s not
always straight-forward
• To import a .csv file:
insheet using “C:\...\new_data.csv”, comma clear
Ann Arbor ASA (Up and Running): Stata Intro
30
Data Management Commands:
exporting files
• save saves the data in .dta format
• To make the data usable by other software
packages:
– export the data to a comma separated values file
(.csv), or a delimited file using outsheet
– use the other package to open the .dta file and save it
in another format (SPSS 17+ and SAS 9.2 can do
this)
– use StatTransfer to convert the file from .dta to
something else
• To export data to a .csv file:
outsheet using “C:\...\out_data.csv”, comma
Ann Arbor ASA (Up and Running): Stata Intro
31
Data Management Commands:
aggregating files
• It is a common exercise to aggregate data, or to make a dataset of
summary statistics
• Use the collapse command:
collapse (mean) mn_wage=wage (count) count=gender, by(gender)
to turn data like this……… into this
id
1
2
3
4
5
gender
M
M
M
F
F
wage
500
550
490
505
410
………
gender
M
F
count
3
2
mn_wage
##
##
• Use collapse to produce counts, means, medians, percentiles,
extrema, and standard deviations of your data.
Ann Arbor ASA (Up and Running): Stata Intro
32
Data Management Commands:
keep/drop
• To throw away variables, use
keep varlist
drop varlist
• To get ride of particular observations,
add an if or in clause with no
varlist:
drop if x==3
keep in 1/100
Ann Arbor ASA (Up and Running): Stata Intro
33
Lab 4
• Import the “auto.csv” dataset from your desktop
lab folder
• Save the file in your desktop lab folder as
“auto1.dta”
• Aggregate the dataset by levels of foreign,
obtaining the mean and median for price and
mpg
• Drop the median price and median mpg
variables
• Export the aggregated dataset to a .csv file in
your desktop lab folder
Ann Arbor ASA (Up and Running): Stata Intro
34
Descriptive Statistics and
Estimation
• We’ve already seen
summarize
• Next up:
– summarizing (with detail)
– tabulating
– estimation (modeling)
– post-estimation
Ann Arbor ASA (Up and Running): Stata Intro
35
Descriptive Statistics and Estimation :
summarizing with detail
• summarize gives descriptive statistics for
numeric variables
• Use the detail option to get additional
descriptive statistics
sum x1, detail
• summarize without a varlist will
summarize all numeric variables in the
dataset
Ann Arbor ASA (Up and Running): Stata Intro
36
Descriptive Statistics and Estimation :
tabulating
• tabulate gives one- and two-way tables
for categorical variables
• Use the chi2, row, and col options
to get a chi-square test, row %, column %
tab race, row
tab race treatment, chi2 col
Ann Arbor ASA (Up and Running): Stata Intro
37
Descriptive Statistics and Estimation :
estimation (modeling)
• Most estimation commands have the same
syntax
cmdname yvar(list) xvarlist [,options]
• Common estimation commands are
regress
logit, logistic
mlogit
ologit
poisson
xtmixed
//OLS
//logistic
//multinomial
//ordinal
//poisson
//mixed
• Example:
reg y x1 x2 x3
Ann Arbor ASA (Up and Running): Stata Intro
38
Descriptive Statistics and Estimation :
post-estimation
• After you get your estimates you can obtain predictions:
predict yhat1 if e(sample)
predict yhat2
predict resid, residuals
• Adjusting the estimated covariance matrix is straight
forward:
reg y x1 x2 x3, robust
reg y x1 x2 x3, cluster(clustervar)
• Testing hypotheses about parameters:
test x1=3
• Hypotheses can also be nonlinear and involve
combinations of parameters
Ann Arbor ASA (Up and Running): Stata Intro
39
Lab 5
• Using the auto.dta dataset:
–
–
–
–
summarize the variables price and mpg
tabulate the foreign variable
regress price on mpg and foreign (OLS regression)
save the predicted values in a new variable called
yhat
– save the studentized residuals in a new variable
called rstudent
Ann Arbor ASA (Up and Running): Stata Intro
40
Graphing
• Easily customized graphics
• Graphs can be created via menus or command
line
• Manual adjustment can be done after the graph
is generated, using the Graph Editor
• Graphs can be saved in various file formats
and/or pasted into documents
• Examples:
histogram y, normal
twoway (scatter y x) (lfit y x)
Ann Arbor ASA (Up and Running): Stata Intro
41
Lab 6
• Using the auto.dta dataset, create a scatterplot
of price on the y-axis, and mpg on the x-axis
• From the Graph window, start the Graph Editor.
Modify the plot titles and colors
• Save your graph as a Stata .gph file in your
desktop lab folder
• Copy the graph and paste it into a Word or
PowerPoint file
Ann Arbor ASA (Up and Running): Stata Intro
42
Adding User-written Commands
• You can install add-on packages, which are
user-written commands made publicly available
• You may run into these packages if you
– do a findit search
– Google
– go to Help, SJ and user-written programs
• Installation is usually as simple as clicking thru
some links
• My personal most-used add-ons:
mvpatterns
gllamm
Ann Arbor ASA (Up and Running): Stata Intro
43
Lab 7
• Install the mvpatterns add-on package, by typing
findit mvpatterns
•
•
•
•
then click on the blue link starting with dm91
Follow links to install
Read the help file for mvpatterns
Check the missing value patterns for the
variables make thru rep78
Close your log file
Ann Arbor ASA (Up and Running): Stata Intro
44
.do Files
• .do files are text files that contain sequences of Stata
commands (like a SAS command file, or a SPSS syntax
file)
• Create them using Stata’s .do file editor, or any text
editor.
– Copy from your Review window
– Type in the commands directly
• Saving your commands to a .do file(s) is never a bad
idea. But use good habits:
– Comment liberally, using * or /* */ conventions
– Specify the version of Stata used
– Use set more off to opt out of Stata’s paging feature, if
appropriate
• You can run the entire .do file, or just a small part of it
• Stata will stop processing if an error is encountered
when commands from a .do file are submitted
Ann Arbor ASA (Up and Running): Stata Intro
45
Lab 8
• Open the sample.do file in your desktop lab
folder
• Can you describe what is happening in the .do
file?
• Copy all of the commands from tonight’s
session into a new .do file
• Run a small section of commands
• Run the entire file
Ann Arbor ASA (Up and Running): Stata Intro
46
Other Misc.
• To manage variable attributes, use the Variables
Manager.
• Type help cmdname to find out more about
these commands:
matrix
mata
foreach
xt
st
svy
Ann Arbor ASA (Up and Running): Stata Intro
//matrix algebra
//fancy matrix programming
//looping command
//panel/longitudinal analysis
//survival analysis
//analysis of complex survey data
47
Additional Resources
• Stata website, FAQs:
http://www.stata.com/support/faqs
• UCLA website
http://www.ats.ucla.edu/stat/stata/default.htm
• Christopher F. Baum’s Stata handouts
http://fmwww.bc.edu/GStat/docs/StataIntro.pdf
http://fmwww.bc.edu/GStat/docs/StataProg.pdf
http://fmwww.bc.edu/GStat/docs/StataMata.pdf
• Stata NetCourses
http://www.stata.com/netcourse/
• CSCAR workshops
http://www.umich.edu/~cscar/workshops/
Ann Arbor ASA (Up and Running): Stata Intro
48
Download