Stata

advertisement
An Introductory Course for Stata
By Dallas J. Bateman
I. Introduction
What is Stata?
Stata is a statistical package used mostly by business and academic institutions. It is
highly used in economics, sociology, political science, and epidemiology. Stata is highly
admirable among these fields because of the simple point-and-click features that accomplish
complex statistical analyses and produce publication-quality graphics. Stata has computing
capabilities to perform data management, statistical analysis, provide graphics, run simulations,
and even do custom programming. For a full list of the capabilities of Stata, please refer to the
following website: http://www.stata.com/capabilities/.
Although there is a point-and-click capability, the commands given in this paper will be
for the use of running commands based on Stata code.
There are a few different versions of Stata depending on the type of data with which one
may work. There is a version for multiprocessor computers, large databases, a standard version,
and a smaller version for students.
Since Stata is not free software, the software and a license must be purchased in order to
install it on a personal computer. The software can run around $600.00. Student versions can be
significantly cheaper.
For additional help files on getting acquainted with Stata, please visit
http://www.stata.com/links/resources1.html.
Reading data into Stata
Stata has built-in datasets with which we may work. To locate a dataset:
1. Select File > Example Datasets….
2. Click on Example datasets installed with Stata.
3. Choose the dataset you would like to work with.
For outside data files, Stata can read in a file from a directory on the computer or a file
from the internet. Both methods require the use command followed by either the directory
location or the web address as examples:
use H:\School\STAT 582\logit.dta, clear
use http://www.ats.ucla.edu/stat/stata/dae/logit.dta, clear
lookfor allows you to find variables that contain a specified keyword. This is especially
useful in large data sets with many variables. Often abbreviated keywords are the most helpful.
To find a poverty variable, type lookfor pov.
describe tells you about the contents of a specific variable. describe xvar yvar.
codebook xvar yvar will produce a nicely formatted codebook of your data which is especially
useful if you have added variable labels with the label variable command. codebook by itself
will list every variable in your data and generate a lot of output.
Once you have opened your data and are ready to begin, Stata has a way of opening help
files specific to the functions that you would like to call. For example, say you want to begin
using simple linear regression analysis, but you cannot remember the syntax for the regress
command. By typing findit regress in the command window, you will be given a help file
explaining the required parameters for the regress function.
II. Common Statistical Analyses in Stata
Descriptive Statistics
summarize gives basic descriptive statistics for a variable. This is mostly useful for
continuous variables.
summarize xvar yvar
summarize xvar yvar
tabulate (or simply tab) gives
a frequency distribution for your variable. This is useful
for categorical variables.
tabulate xvar.
Linear Regression
To run a linear model in Stata, we are going to use the crime dataset. The variables are
state id (sid), state name (state), violent crimes per 100,000 people (crime), percent of the
population living under the poverty line (poverty), and percent of the population that are single
parents (single). There are other variables in the dataset, but these are the ones that we will
refer to for this example.
To load the data into Stata, type the following commands in the Command window:
use http://www.ats.ucla.edu/stat/stata/webbooks/reg/crime, clear
drop if sid == 51
The drop command will drop Washington DC since it is not a state.
To fit a regression model, we will treat crime as the response and poverty and single as
the predictors. Typing regress crime poverty single in the Command window will
produce regression analysis output with an ANOVA table, model fit statistics (R2, Adj R2, Root
MSE, etc.), and a table with the coefficients, standard errors, significance tests and confidence
intervals of the respective coefficients.
Let us suppose for a moment that there was an additional predictor variable race, which
is a categorical variable denoting the race of the, was added to the model. To let Stata know that
you want to use indicator variables for this categorical variable, we can add such a statement into
the model above by adding “i.” before the categorical variable:
regress crime poverty single i.race
A common desire is to obtain residuals or fitted values to test assumptions of normality.
Stata makes this simple. The following code will store the residuals and the fitted values:
predict res, r
predict yhat
This predict statement must be done after the regress statement. The first line of code
will store the residuals (r) in a new variable called res. The second line simply stores the fitted
values (yhat) in a variable called yhat. To look at residual plots:
plot res yhat
plot res poverty
plot res single
plot res race
This is a very basic run-through of regression analysis. For more information on
checking model assumptions, checking model fit, and searching for outliers please refer to the
following website: http://www.ats.ucla.edu/stat/stata/dae/rreg.htm.
Categorical Variable Analysis
Tabulating two categorical variables together gives you a cross-tabulation of those
variables, e.g tabulate xvar yvar, row col chi2
 pwcorr xvar yvar, sig gives you the pairwise correlation of two continuous variables.
 oneway xvar yvar, tabulate gives you a oneway ANOVA of a continuous variable
over a categorical factor.
As an example using logistic regression, we are going to use a hypothetical dataset about
getting into graduate school. Hypothetical data has been generated, which can be loaded into
Stata via the following command:
use http://www.ats.ucla.edu/stat/stata/dae/logit.dta, clear
This hypothetical data set has a binary response variable called admit denoting whether
or not a student was admitted into graduate school. There are three predictor variables: gre, gpa
and topnotch, which is a binary predictor where 1 indicates that the undergraduate institution
was "top notch" and 0 indicates that it is not.
tab admit topnotch
will produce a crosstab of admit and topnotch:
|
topnotch
admit |
0
1 |
Total
-----------+----------------------+---------0 |
238
35 |
273
1 |
97
30 |
127
-----------+----------------------+---------Total |
335
65 |
400
None of the cells are too small or empty (has no cases), so it is safe to run a logistic model.
logistic admit gre topnotch gpa
Note again that the first variable listed after the logistic command is automatically
considered the response where all variables listed afterwards are the predictors. The logistic
command above will produce the following output (similar to that for linear regression):
Logistic regression
Log likelihood = -239.06481
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
=
=
=
=
400
21.85
0.0001
0.0437
-----------------------------------------------------------------------------admit |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gre |
.0024768
.0010702
2.31
0.021
.0003792
.0045744
topnotch |
.4372236
.2918532
1.50
0.134
-.1347983
1.009245
gpa |
.6675556
.3252593
2.05
0.040
.0300592
1.305052
_cons | -4.600814
1.096379
-4.20
0.000
-6.749678
-2.451949
Again, this is a very basic run-through of logistic regression and producing contingency
tables. For more information on this particular example please refer to the following website:
http://www.ats.ucla.edu/stat/stata/dae/logit.htm.
General Data Manipulations
To keep a portion of the dataset conditioned on a specific value:
scatter y x if x < 10
This example will produce a scatterplot of x and y only for x-values greater than 10.
Sample Size and Power Calculations
For this problem, we are going to be given only some results (no data). First, we are told
that there are four groups in the study. Second, the largest group mean is 646 and the smallest
group mean is 550 (the other two groups are considered equal to the group mean for simplicity).
Third, the standard deviation for all four groups is equal and said to be the same as the
population standard deviation of 80.
We will make use of the Stata function fpower to do the power analysis. The fpower
function needs the following information in order to do the power analysis:
1. the number of levels (or groups)
2. the effect size (called delta)
3. the alpha level
From the information given above, we know that there are four groups, a=4. We will set
alpha = 0.05, and we will compute the effect size:
max{1...4 }  min{1...4 }

sd ( 0 )
646  550

 1.2
Hence,
80
Now, we can apply fpower and get the corresponding output:
fpower, a(4) delta(1.2) alpha(0.05)
a =
4
nobs
2
3
4
5
6
7
8
9
10
12
14
b =
1
c =
power
.0906746
.1438119
.2013958
.2614601
.3224192
.3829314
.4419005
.49847
.5520059
.6484047
.7294912
1
r =
1
rho =
0
delta =
nobs
16
18
20
25
30
35
40
45
50
100
1.2
power
.795521
.8478578
.8884002
.9512783
.9800673
.9922693
.9971333
.998977
.9996469
1
If we wanted to obtain 80% power, then our sample size (or nobs) falls somewhere
between 16 and 18 observations. To do the reverse, the same Stata code applies, but this time
suppose that we have 40 subjects. We would then see that we have a power of 99.71%.
III. Working with Graphics in Stata
histogram xvar will give you a nice display of one variable. histogram xvar,
by(yvar) may be useful for comparing the distributions of two variables over the categories of
yvar.
 histogram xvar, percent will scale the y-axis more intuitively in terms of
percentages.
 histogram xvar, discrete gives a nicer display for categorical variables.
 twoway scatter yvar xvar gives you a twoway scatterplot of your data.
 sunflower yvar xvar gives you a sunflower plot of your data.
 twoway lfit yvar xvar will give you a linear fit graph.
The two syntaxes may be combined e.g. twoway (scatter yvar xvar)(lfit yvar xvar)
graph bar xvar, over(yvar) is useful for creating a bar graph of a continuous or categorical
variable graphed across the categories of a categorical variable.
For all graphs, options after a comma will be helpful in titling your graph, example:
twoway lfit yvar xvar, title(“…”) xtitle(“…”) ytitle(“…”)
scatter y x
A greater detailed report on the graphics capabilities in Stata can be found at:
http://www.stata.com/stata8/graphics.html. The code for such graphs are not provided with this
list. They are provided as a result of a point-and-click GUI representation. I am not personally
familiar with the personalization abilities of Stata when it comes to graphics, but this link seems
to show several different ways to personalize any publication-ready graph.
Notes
Much of the information for this write up has been taken from the following resources:
1. http://www.stata.com/links/resources1.html (accessed 4/13/2010).
2. http://www-personal.umich.edu/~agrogan/stata/TwoPageStata.pdf (accessed 4/14/2010).
3. http://www.ats.ucla.edu/stat/stata/dae/ (accessed 4/13/2010).
Download