Ann Arbor ASA “Up and Running” Series: Intro Stata Prepared by volunteers of the Ann Arbor Chapter of the American Statistical Association November 29, 2011 Agenda • Why Stata? • The Stata interface • The Stata mindset • data • logging • issuing commands via menus • understanding command syntax • • • • • Data management Descriptive statistics and estimation Graphing Adding user-written commands .do files Ann Arbor ASA (Up and Running): Stata Intro 2 Why Stata • General purpose, cross-platform package like R or SAS • Command line interface combined with point-and-click menus • Intuitive and standardized command syntax that is welldocumented with formulas, examples and references • Many advanced user-written commands • Easy to write your own code that is pretty fast • Excellent corporate tech support and user community Ann Arbor ASA (Up and Running): Stata Intro 3 Which Stata: MP, SE, IC or Small • Stata is not sold in pieces, every flavor has the same commands • Most flavors available for 32- and 64-bit Windows, Mac, and Unix/Linux platforms • Stata/IC (Intercooled) can handle up to 2,047 variables • Stata/SE (Special Edition) can handle up to 32766 variables. Also allows longer string variables and larger matrices • Stata/MP has the same limits, but is faster on multicore and multiprocessor computers • Small Stata is intended for students and is limited to analyzing data sets with a maximum of 99 variables and 1200 observations • All of these versions can read each other’s files within their size limits Ann Arbor ASA (Up and Running): Stata Intro 4 The Stata Interface • Results window: All output appears here, except for graphs which will appear in a separate window. Note that output is not automatically saved to a file • Command window: Enter commands here interactively • Variables window: All variables in the current dataset are listed here. Clicking on a variable sends its name to the command window • Review window: Previously issued commands are listed here and can b reissued by clicking on them. • Buttons: Shortcuts for many common commands such as log, browse, edit, etc. • Menus: Convenient for learning Stata command syntax, but time consuming • Look and feel is customizable Ann Arbor ASA (Up and Running): Stata Intro 5 Lab 1A • Use the Stata File menu to open the example dataset, auto.dta Ann Arbor ASA (Up and Running): Stata Intro 6 The Stata Mindset • • • • Data Logging Issuing commands from menus Understanding command syntax Ann Arbor ASA (Up and Running): Stata Intro 7 Data • Stata reads an entire dataset into memory. This is a fundamental difference from other stat packages such as SAS and SPSS • Only one dataset at a time in a Stata session • This is why there are flavors of Stata – IC, SE, Small Ann Arbor ASA (Up and Running): Stata Intro 8 Reading data into memory • Use the menus – File, Open • Use the command window use “C:\...\sample.dta”, clear use sample.dta, clear • Use the File Open button • All methods produce the same result Ann Arbor ASA (Up and Running): Stata Intro 9 Saving data • Use the menus – File, Save (or Save As…) • Use the command window save “C:\...\sample.dta” [, replace] • Use the Save button • All methods produce the same result Ann Arbor ASA (Up and Running): Stata Intro 10 Logging • Stata does not automatically write output to a file! • You can do this by starting a log file at the start of your analysis, and closing it at the end • Use the menus – File, Log • Use the command window log using “C:\...\analysis1.log” • Use the Log button • All methods produce the same result • Logs can be created, replaced, suspended, resumed, and appended Ann Arbor ASA (Up and Running): Stata Intro 11 Lab 1B • Use the Stata menus to: – change the color scheme – change the working directory to your desktop lab folder – start a log file called “labs.log” in your desktop lab folder – save the example auto.dta dataset to your desktop lab folder Ann Arbor ASA (Up and Running): Stata Intro 12 Issuing Commands from Menus • Menus are great for: – Familiarizing yourself with Stata’s capabilities, both big picture and command-specific – Getting context-sensitive help – Learning Stata command syntax • The downside: – time-consuming, especially for repetitive tasks – not all functionality available through the menus! Ann Arbor ASA (Up and Running): Stata Intro 13 Lab 2 • To get a codebook for the auto.dta dataset, use the following menu path: Data, Describe data, Describe data contents (codebook) • You will see the codebook dialog. Inspect it closely… Ann Arbor ASA (Up and Running): Stata Intro 14 Anatomy of a Dialog Box The Stata command (keyword) that will be submitted Multiple tabs Submit and close dialog Help, Reset, and Copy Command Ann Arbor ASA (Up and Running): Stata Intro Submit and leave dialog open 15 Anatomy of a Dialog Box Use if/in to filter rows Specify logical condition Specify row #s Ann Arbor ASA (Up and Running): Stata Intro 16 Anatomy of a Dialog Box Command options available on additional tabs Ann Arbor ASA (Up and Running): Stata Intro 17 Understanding Command Syntax • The general syntax for all Stata commands is: [prefix:] cmdname [varlist] [=exp] [if exp] [in exp] [weight] [using filename] [, options] • Elements in square brackets are optional for some commands • Sometimes cmdname is all that is required, for example, codebook or describe • The underlined portion of cmdname is shorthand for the command • Stata is case sensitive Ann Arbor ASA (Up and Running): Stata Intro 18 Understanding Command Syntax: cmdname • cmdname is Stata’s keyword for a command Examples: generate replace drop regress logistic logit scatter graph bar graph box • Enter cmdname exactly as indicated, taking care to use the proper case (usually lower case for commands) Ann Arbor ASA (Up and Running): Stata Intro 19 Understanding Command Syntax: varlist • You can apply the command to particular variables by specifying a varlist • Order of variables matters; can use hyphen to indicate a series of variables in order as in: codebook x1-x20 • Use wildcard notation for shorthand, such as codebook x* • Use _all to apply command to all variables • Remember that Stata is case sensitive! Variables gender and Gender are two different things to Stata Ann Arbor ASA (Up and Running): Stata Intro 20 Understanding Command Syntax: =exp • exp is short for expression • exp is used by data management commands such as generate and replace • For example, to create a constant variable x equal to 1, use: generate x=1 • You can also use functions this way: gen x2 = x^2 gen x_sq = x*x gen logx=ln(x) Ann Arbor ASA (Up and Running): Stata Intro 21 Understanding Command Syntax: if/in exp • • Without any options, commands apply to all observations/variables in the dataset To filter observations, use the if exp clause: codebook if (x==2 & z>=3) | w==2 • • Note the parentheses! Also note the difference between = and == (assignment and condition equality, respectively) gen x=1 if y==2 list if gender==“F” • Conditional operators in Stata are == (equal to) > (greater than) < (less than) & (and) • != (not equal to) >= (greater than or equal to) <= (less than or equal to) | (or) Use in exp to refer to particular row numbers in the dataset: list in 1/10 Ann Arbor ASA (Up and Running): Stata Intro 22 A Brief but Critical Detour: Missing Data • While we are talking about selecting cases using an if exp clause, it is important to note that Stata considers missing the largest possible numeric value • Stata represents missing numeric variables with a dot • Keep this in mind when filtering cases based on a numeric variable: replace hieduc = 1 if x>3 (potential problem) replace hieduc = 1 x>3 & x<. (playing it safe) replace hieduc = 1 x>3 & x!=. (playing it safe) Ann Arbor ASA (Up and Running): Stata Intro 23 Understanding Command Syntax: weight • Most Stata commands can deal with weighted data, where the weight is a variable in the dataset • You need to specify the type of weight and the weight variable, using brackets, as in: summarize x [iweight=weightvarname] • Four types of weights: – Frequency fweights, for replicated data – Probability pweights, for observations sampled with unequal probability of selection – Analytic aweights, for data containing averages where the average is weighted by the # obs used in calculating the average – Importance iweights, defined by the specific command Ann Arbor ASA (Up and Running): Stata Intro 24 Understanding Command Syntax: using filename • Some commands read in data from external files, or write to files • These commands contain a using clause, in which the path and filename appear • Merging two datasets together is an example: use “C:\…\master_data.dta,clear merge 1:1 id using “C:\...\using_data.dta • This performs a 1:1 match using the key variable, id (merge adds new variables). 1:many merges are also possible • Similarly, to stack datasets: use “C:\..\one.dta”,clear append using “C:\...\two.dta” Ann Arbor ASA (Up and Running): Stata Intro 25 Understanding Command Syntax: prefix: • Prefix commands operate on other Stata commands. One common prefix is bysort: bysort gender: summarize wage • The bysort prefix sorts and stratifies the summarize command by the gender variable • The bysort prefix is also very handy in a data management context, for example, aggregating bysort gender: egen avg_wage = mean(wage) • Not all commands permit the use of all or even any prefixes Ann Arbor ASA (Up and Running): Stata Intro 26 Understanding Command Syntax: Where to get HELP • If you know the name of a command, enter help cmdname • If you don’t know it, enter findit word1 [word2]… • This queries a keyword database and some of the official internet sources (such as Stata FAQs, Stata Journal articles) • Google • Email or call Stata Technical Services (really!) • Statalist archives • Email CSCAR Stata support at stata.help@umich.edu if you are affiliated with the U-M as a grad student, staff or faculty member Ann Arbor ASA (Up and Running): Stata Intro 27 Lab 3 • Enter the appropriate commands in the command window (no menus!): – – – – – – open the auto.dta dataset, clearing out what is in memory describe the datatset get the codebook for the first 5 variables in the dataset list out the first 10 observations try out the browse command browse the cases where price is greater than 5000 (but not missing) – summarize the price variable where foreign==0 (for domestic cars) – use the bysort prefix to summarize the price variable by levels of the foreign variable Ann Arbor ASA (Up and Running): Stata Intro 28 Data Management Commands • We’ve already seen quite a few use browse codebook gen, egen merge save list describe replace append //open/save data //view data //10,000 ft view //create/replace vars //merge/stack datasets • Next up: – – – – importing exporting aggregating keeping/dropping Ann Arbor ASA (Up and Running): Stata Intro 29 Data Management Commands: importing files • use reads Stata formatted (.dta) datasets. • For data created in another software package: – Save the data in an excel file, then use the import excel command (new with Stata 12) – save the data in a comma separated values file (.csv), or a delimited file, then use the insheet command – use the other package to save the data in .dta format (SPSS 17+ and SAS 9.2 can do this) – use StatTransfer to convert the file to .dta • .dta, delimited, and .csv files are the simplest file types to get into Stata • Stata will also import data in other formats, but it’s not always straight-forward • To import a .csv file: insheet using “C:\...\new_data.csv”, comma clear Ann Arbor ASA (Up and Running): Stata Intro 30 Data Management Commands: exporting files • save saves the data in .dta format • To make the data usable by other software packages: – export the data to a comma separated values file (.csv), or a delimited file using outsheet – use the other package to open the .dta file and save it in another format (SPSS 17+ and SAS 9.2 can do this) – use StatTransfer to convert the file from .dta to something else • To export data to a .csv file: outsheet using “C:\...\out_data.csv”, comma Ann Arbor ASA (Up and Running): Stata Intro 31 Data Management Commands: aggregating files • It is a common exercise to aggregate data, or to make a dataset of summary statistics • Use the collapse command: collapse (mean) mn_wage=wage (count) count=gender, by(gender) to turn data like this……… into this id 1 2 3 4 5 gender M M M F F wage 500 550 490 505 410 ……… gender M F count 3 2 mn_wage ## ## • Use collapse to produce counts, means, medians, percentiles, extrema, and standard deviations of your data. Ann Arbor ASA (Up and Running): Stata Intro 32 Data Management Commands: keep/drop • To throw away variables, use keep varlist drop varlist • To get ride of particular observations, add an if or in clause with no varlist: drop if x==3 keep in 1/100 Ann Arbor ASA (Up and Running): Stata Intro 33 Lab 4 • Import the “auto.csv” dataset from your desktop lab folder • Save the file in your desktop lab folder as “auto1.dta” • Aggregate the dataset by levels of foreign, obtaining the mean and median for price and mpg • Drop the median price and median mpg variables • Export the aggregated dataset to a .csv file in your desktop lab folder Ann Arbor ASA (Up and Running): Stata Intro 34 Descriptive Statistics and Estimation • We’ve already seen summarize • Next up: – summarizing (with detail) – tabulating – estimation (modeling) – post-estimation Ann Arbor ASA (Up and Running): Stata Intro 35 Descriptive Statistics and Estimation : summarizing with detail • summarize gives descriptive statistics for numeric variables • Use the detail option to get additional descriptive statistics sum x1, detail • summarize without a varlist will summarize all numeric variables in the dataset Ann Arbor ASA (Up and Running): Stata Intro 36 Descriptive Statistics and Estimation : tabulating • tabulate gives one- and two-way tables for categorical variables • Use the chi2, row, and col options to get a chi-square test, row %, column % tab race, row tab race treatment, chi2 col Ann Arbor ASA (Up and Running): Stata Intro 37 Descriptive Statistics and Estimation : estimation (modeling) • Most estimation commands have the same syntax cmdname yvar(list) xvarlist [,options] • Common estimation commands are regress logit, logistic mlogit ologit poisson xtmixed //OLS //logistic //multinomial //ordinal //poisson //mixed • Example: reg y x1 x2 x3 Ann Arbor ASA (Up and Running): Stata Intro 38 Descriptive Statistics and Estimation : post-estimation • After you get your estimates you can obtain predictions: predict yhat1 if e(sample) predict yhat2 predict resid, residuals • Adjusting the estimated covariance matrix is straight forward: reg y x1 x2 x3, robust reg y x1 x2 x3, cluster(clustervar) • Testing hypotheses about parameters: test x1=3 • Hypotheses can also be nonlinear and involve combinations of parameters Ann Arbor ASA (Up and Running): Stata Intro 39 Lab 5 • Using the auto.dta dataset: – – – – summarize the variables price and mpg tabulate the foreign variable regress price on mpg and foreign (OLS regression) save the predicted values in a new variable called yhat – save the studentized residuals in a new variable called rstudent Ann Arbor ASA (Up and Running): Stata Intro 40 Graphing • Easily customized graphics • Graphs can be created via menus or command line • Manual adjustment can be done after the graph is generated, using the Graph Editor • Graphs can be saved in various file formats and/or pasted into documents • Examples: histogram y, normal twoway (scatter y x) (lfit y x) Ann Arbor ASA (Up and Running): Stata Intro 41 Lab 6 • Using the auto.dta dataset, create a scatterplot of price on the y-axis, and mpg on the x-axis • From the Graph window, start the Graph Editor. Modify the plot titles and colors • Save your graph as a Stata .gph file in your desktop lab folder • Copy the graph and paste it into a Word or PowerPoint file Ann Arbor ASA (Up and Running): Stata Intro 42 Adding User-written Commands • You can install add-on packages, which are user-written commands made publicly available • You may run into these packages if you – do a findit search – Google – go to Help, SJ and user-written programs • Installation is usually as simple as clicking thru some links • My personal most-used add-ons: mvpatterns gllamm Ann Arbor ASA (Up and Running): Stata Intro 43 Lab 7 • Install the mvpatterns add-on package, by typing findit mvpatterns • • • • then click on the blue link starting with dm91 Follow links to install Read the help file for mvpatterns Check the missing value patterns for the variables make thru rep78 Close your log file Ann Arbor ASA (Up and Running): Stata Intro 44 .do Files • .do files are text files that contain sequences of Stata commands (like a SAS command file, or a SPSS syntax file) • Create them using Stata’s .do file editor, or any text editor. – Copy from your Review window – Type in the commands directly • Saving your commands to a .do file(s) is never a bad idea. But use good habits: – Comment liberally, using * or /* */ conventions – Specify the version of Stata used – Use set more off to opt out of Stata’s paging feature, if appropriate • You can run the entire .do file, or just a small part of it • Stata will stop processing if an error is encountered when commands from a .do file are submitted Ann Arbor ASA (Up and Running): Stata Intro 45 Lab 8 • Open the sample.do file in your desktop lab folder • Can you describe what is happening in the .do file? • Copy all of the commands from tonight’s session into a new .do file • Run a small section of commands • Run the entire file Ann Arbor ASA (Up and Running): Stata Intro 46 Other Misc. • To manage variable attributes, use the Variables Manager. • Type help cmdname to find out more about these commands: matrix mata foreach xt st svy Ann Arbor ASA (Up and Running): Stata Intro //matrix algebra //fancy matrix programming //looping command //panel/longitudinal analysis //survival analysis //analysis of complex survey data 47 Additional Resources • Stata website, FAQs: http://www.stata.com/support/faqs • UCLA website http://www.ats.ucla.edu/stat/stata/default.htm • Christopher F. Baum’s Stata handouts http://fmwww.bc.edu/GStat/docs/StataIntro.pdf http://fmwww.bc.edu/GStat/docs/StataProg.pdf http://fmwww.bc.edu/GStat/docs/StataMata.pdf • Stata NetCourses http://www.stata.com/netcourse/ • CSCAR workshops http://www.umich.edu/~cscar/workshops/ Ann Arbor ASA (Up and Running): Stata Intro 48