An Introductory Course for Stata By Dallas J. Bateman I. Introduction What is Stata? Stata is a statistical package used mostly by business and academic institutions. It is highly used in economics, sociology, political science, and epidemiology. Stata is highly admirable among these fields because of the simple point-and-click features that accomplish complex statistical analyses and produce publication-quality graphics. Stata has computing capabilities to perform data management, statistical analysis, provide graphics, run simulations, and even do custom programming. For a full list of the capabilities of Stata, please refer to the following website: http://www.stata.com/capabilities/. Although there is a point-and-click capability, the commands given in this paper will be for the use of running commands based on Stata code. There are a few different versions of Stata depending on the type of data with which one may work. There is a version for multiprocessor computers, large databases, a standard version, and a smaller version for students. Since Stata is not free software, the software and a license must be purchased in order to install it on a personal computer. The software can run around $600.00. Student versions can be significantly cheaper. For additional help files on getting acquainted with Stata, please visit http://www.stata.com/links/resources1.html. Reading data into Stata Stata has built-in datasets with which we may work. To locate a dataset: 1. Select File > Example Datasets…. 2. Click on Example datasets installed with Stata. 3. Choose the dataset you would like to work with. For outside data files, Stata can read in a file from a directory on the computer or a file from the internet. Both methods require the use command followed by either the directory location or the web address as examples: use H:\School\STAT 582\logit.dta, clear use http://www.ats.ucla.edu/stat/stata/dae/logit.dta, clear lookfor allows you to find variables that contain a specified keyword. This is especially useful in large data sets with many variables. Often abbreviated keywords are the most helpful. To find a poverty variable, type lookfor pov. describe tells you about the contents of a specific variable. describe xvar yvar. codebook xvar yvar will produce a nicely formatted codebook of your data which is especially useful if you have added variable labels with the label variable command. codebook by itself will list every variable in your data and generate a lot of output. Once you have opened your data and are ready to begin, Stata has a way of opening help files specific to the functions that you would like to call. For example, say you want to begin using simple linear regression analysis, but you cannot remember the syntax for the regress command. By typing findit regress in the command window, you will be given a help file explaining the required parameters for the regress function. II. Common Statistical Analyses in Stata Descriptive Statistics summarize gives basic descriptive statistics for a variable. This is mostly useful for continuous variables. summarize xvar yvar summarize xvar yvar tabulate (or simply tab) gives a frequency distribution for your variable. This is useful for categorical variables. tabulate xvar. Linear Regression To run a linear model in Stata, we are going to use the crime dataset. The variables are state id (sid), state name (state), violent crimes per 100,000 people (crime), percent of the population living under the poverty line (poverty), and percent of the population that are single parents (single). There are other variables in the dataset, but these are the ones that we will refer to for this example. To load the data into Stata, type the following commands in the Command window: use http://www.ats.ucla.edu/stat/stata/webbooks/reg/crime, clear drop if sid == 51 The drop command will drop Washington DC since it is not a state. To fit a regression model, we will treat crime as the response and poverty and single as the predictors. Typing regress crime poverty single in the Command window will produce regression analysis output with an ANOVA table, model fit statistics (R2, Adj R2, Root MSE, etc.), and a table with the coefficients, standard errors, significance tests and confidence intervals of the respective coefficients. Let us suppose for a moment that there was an additional predictor variable race, which is a categorical variable denoting the race of the, was added to the model. To let Stata know that you want to use indicator variables for this categorical variable, we can add such a statement into the model above by adding “i.” before the categorical variable: regress crime poverty single i.race A common desire is to obtain residuals or fitted values to test assumptions of normality. Stata makes this simple. The following code will store the residuals and the fitted values: predict res, r predict yhat This predict statement must be done after the regress statement. The first line of code will store the residuals (r) in a new variable called res. The second line simply stores the fitted values (yhat) in a variable called yhat. To look at residual plots: plot res yhat plot res poverty plot res single plot res race This is a very basic run-through of regression analysis. For more information on checking model assumptions, checking model fit, and searching for outliers please refer to the following website: http://www.ats.ucla.edu/stat/stata/dae/rreg.htm. Categorical Variable Analysis Tabulating two categorical variables together gives you a cross-tabulation of those variables, e.g tabulate xvar yvar, row col chi2 pwcorr xvar yvar, sig gives you the pairwise correlation of two continuous variables. oneway xvar yvar, tabulate gives you a oneway ANOVA of a continuous variable over a categorical factor. As an example using logistic regression, we are going to use a hypothetical dataset about getting into graduate school. Hypothetical data has been generated, which can be loaded into Stata via the following command: use http://www.ats.ucla.edu/stat/stata/dae/logit.dta, clear This hypothetical data set has a binary response variable called admit denoting whether or not a student was admitted into graduate school. There are three predictor variables: gre, gpa and topnotch, which is a binary predictor where 1 indicates that the undergraduate institution was "top notch" and 0 indicates that it is not. tab admit topnotch will produce a crosstab of admit and topnotch: | topnotch admit | 0 1 | Total -----------+----------------------+---------0 | 238 35 | 273 1 | 97 30 | 127 -----------+----------------------+---------Total | 335 65 | 400 None of the cells are too small or empty (has no cases), so it is safe to run a logistic model. logistic admit gre topnotch gpa Note again that the first variable listed after the logistic command is automatically considered the response where all variables listed afterwards are the predictors. The logistic command above will produce the following output (similar to that for linear regression): Logistic regression Log likelihood = -239.06481 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 = = = = 400 21.85 0.0001 0.0437 -----------------------------------------------------------------------------admit | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------gre | .0024768 .0010702 2.31 0.021 .0003792 .0045744 topnotch | .4372236 .2918532 1.50 0.134 -.1347983 1.009245 gpa | .6675556 .3252593 2.05 0.040 .0300592 1.305052 _cons | -4.600814 1.096379 -4.20 0.000 -6.749678 -2.451949 Again, this is a very basic run-through of logistic regression and producing contingency tables. For more information on this particular example please refer to the following website: http://www.ats.ucla.edu/stat/stata/dae/logit.htm. General Data Manipulations To keep a portion of the dataset conditioned on a specific value: scatter y x if x < 10 This example will produce a scatterplot of x and y only for x-values greater than 10. Sample Size and Power Calculations For this problem, we are going to be given only some results (no data). First, we are told that there are four groups in the study. Second, the largest group mean is 646 and the smallest group mean is 550 (the other two groups are considered equal to the group mean for simplicity). Third, the standard deviation for all four groups is equal and said to be the same as the population standard deviation of 80. We will make use of the Stata function fpower to do the power analysis. The fpower function needs the following information in order to do the power analysis: 1. the number of levels (or groups) 2. the effect size (called delta) 3. the alpha level From the information given above, we know that there are four groups, a=4. We will set alpha = 0.05, and we will compute the effect size: max{1...4 } min{1...4 } sd ( 0 ) 646 550 1.2 Hence, 80 Now, we can apply fpower and get the corresponding output: fpower, a(4) delta(1.2) alpha(0.05) a = 4 nobs 2 3 4 5 6 7 8 9 10 12 14 b = 1 c = power .0906746 .1438119 .2013958 .2614601 .3224192 .3829314 .4419005 .49847 .5520059 .6484047 .7294912 1 r = 1 rho = 0 delta = nobs 16 18 20 25 30 35 40 45 50 100 1.2 power .795521 .8478578 .8884002 .9512783 .9800673 .9922693 .9971333 .998977 .9996469 1 If we wanted to obtain 80% power, then our sample size (or nobs) falls somewhere between 16 and 18 observations. To do the reverse, the same Stata code applies, but this time suppose that we have 40 subjects. We would then see that we have a power of 99.71%. III. Working with Graphics in Stata histogram xvar will give you a nice display of one variable. histogram xvar, by(yvar) may be useful for comparing the distributions of two variables over the categories of yvar. histogram xvar, percent will scale the y-axis more intuitively in terms of percentages. histogram xvar, discrete gives a nicer display for categorical variables. twoway scatter yvar xvar gives you a twoway scatterplot of your data. sunflower yvar xvar gives you a sunflower plot of your data. twoway lfit yvar xvar will give you a linear fit graph. The two syntaxes may be combined e.g. twoway (scatter yvar xvar)(lfit yvar xvar) graph bar xvar, over(yvar) is useful for creating a bar graph of a continuous or categorical variable graphed across the categories of a categorical variable. For all graphs, options after a comma will be helpful in titling your graph, example: twoway lfit yvar xvar, title(“…”) xtitle(“…”) ytitle(“…”) scatter y x A greater detailed report on the graphics capabilities in Stata can be found at: http://www.stata.com/stata8/graphics.html. The code for such graphs are not provided with this list. They are provided as a result of a point-and-click GUI representation. I am not personally familiar with the personalization abilities of Stata when it comes to graphics, but this link seems to show several different ways to personalize any publication-ready graph. Notes Much of the information for this write up has been taken from the following resources: 1. http://www.stata.com/links/resources1.html (accessed 4/13/2010). 2. http://www-personal.umich.edu/~agrogan/stata/TwoPageStata.pdf (accessed 4/14/2010). 3. http://www.ats.ucla.edu/stat/stata/dae/ (accessed 4/13/2010).