Stata - Data and Statistical Services

advertisement
Stata Tutorial
Dan Edelstein
Kristi Thompson
Data and Statistical Services
data@princeton.edu
http://dss.princeton.edu/
Stata
Stata is a statistical analysis package, used for exploring, graphing, summarizing and
manipulating data files. There are commands built into Stata that allow the user to do
statistical analysis such as cross-tabulation and regression on data sets.
Data Sets
A data set is just a file organized into rows and columns. The rows represent individual
units, or observations. An observation might be a person, a stock, a district, or a
company. The columns represent variables. A variable is a piece of information that has
been recorded about each observation.
As an example, we will be using a file with some data about people. Each observation is
a person, and the variables include each person's name, income, age, and race.
Getting Data into Stata
Stata has its own native format for datasets, and files in this format can be read directly
into Stata with the "use" command or by using the "Open" command on the File menu.
Files in Stata format have the file extension ".dta".
To read Excel Files into Stata:






Open the file in Excel.
Make sure that missing data values are coded as blank or as numeric values (e.g.,
999 or -1). Do not use periods (.) or any other character values (e.g., N/A) to
represent missing data. Simply leave the cells empty or code a number.
Make sure that there are no commas in the numbers. You can change this under
Format menu, then select Cells....
Make sure that variable names are included only in the first row of your
spreadsheet. There should be only one row of variable names (some files
produced by databases have several header rows). Variable names should be 32
characters or less, start with a letter and contain no special characters, such as $ or
&, except the underscore [_]. You should eliminate embedded blanks (spaces).
Under the File menu, select Save As. Then Save as type 'CSV' (comma separated
values). The file will be saved with a .csv extension, for example "sample.csv."
Read the file in to Stata using the "insheet" command
Other file types can be read into Stata, or converted to Stata format using conversion
programs, depending on the type of file. The data lab can provide assistance.
Problem Spreadsheet
Cleaned Spreadsheet
For rest of this tutorial we will be using a sample spreadsheet, sample.csv, that can be
downloaded from the AP501 blackboard site. Start by downloading this spreadsheet
from the web site and save it to your H: drive. In this tutorial, commands you should
type will be show like this:
.stata command
Output that stata gives you will be shown below the commands, on lines that do not start
with periods.
Commands, logs, help
Commands can be executed one at a time at the Stata prompt (command window in Stata
for Windows). Just type the command and hit the "Enter" key. Alternatively, groups of
commands can be entered into do files which can then be executed.
If you are not sure which command you need, you can type "search" and a keyword. Stata
will display a list of commands and other resources associated with that keyword, if there
are any. Click on the name of one of the commands or resources to display the help
screen.
Alternatively, you can use the "Help" menu. Click on "Stata Command" if you know the
command or "Search" if you don't.
Opening Stata
Stata starts in its default working folder or directory, typically C:\data or C:\stata. If you
don't change to something else, Stata will assume that any file name you type is in the
default directory. Since normally your data will be in other directories, you need the cd
(Change Directory) command:
. search cd
. cd "h:\"
"h:\" is where you saved the data spreadsheet. Now that you have told stata to work there,
you can read it in with a simple "insheet" command. But before you do that, you need to
open a log file:
. log using stata1.log
Because you first cd'd to your h:\ drive, the log file "stata1.log" will be saved there.
Everything you type into Stata (commands) and all the output Stata gives you will be
saved in this file, exactly as it appeared on the screen. This is important for two reasons:
1. You will have a record of your results to refer to and paste in to your report when
you write up your analysis.
2. You will be able to see exactly what happened if you realize you made a mistake.
A log file with extension ".log" is a plain text file. This means you can open and read it in
MS Word or almost any other program. If you just issued the command
. log using stata1
Stata would create the file "stata1.smcl", which is a file type that can only be opened by
Stata. This is usually inconvenient, so don't.
Describing and listing data
Now that a log file is open, read the data into Stata, then use the "describe" command to
find out something about it.
. insheet using sample.csv
. describe
The describe command gives the names of each variable in your dataset and also gives
the variable type.
There are two types of variables in Stata: numeric and string. String variables are
indicated by variable types that begin with "str", followed by a number indicating the
maximum length of the string: str9, str5, etc. Different sizes of number have different
labels in the variable type column: byte, int, long, double, float. Numeric variables are
simple – they contain numbers. String variables contain text which can contain any
characters on the keyboard: letters, numbers, and special characters. We can do numeric
calculations and statistical analysis on numeric variables – we can’t on string variables.
So it's important to check your data to make sure that variables that should be numeric
actually are!
The list command lists values of the different variables in your dataset.
. list name
Lists names for everyone in your dataset.
. list in 1/2
Lists the values of all variables for the first two observations. The lower number / higher
number syntax can be used with other commands too.
. list in 5/10
. list
Lists the value of every variable for every observation in the dataset-- usually not a good
idea!
. l sex income
Lists the value of sex and income for every observation. l (the letter L) is an abbreviation
for list.
. l name if income<50000
Lists the names of people with an income of less than 50,000. Note the use of the lessthan symbol to specify a condition. We'll see more of this later.
Basic descriptive statistics
Two commands that are useful for getting basic descriptive statistics for your variables
are summarize and tabulate (abbreviated sum and tab respectively). sum gives the
number of valid observations, mean, standard deviation, minimum and maximum values
for any variables you specify. You can do the entire dataset at once:
. sum
or just a subset of variables:
. sum income age
The tabulate (or tab) command gives you a frequency distribution (for one variable) or a
crosstabulation (for two).
. tab sex
sex |
Freq.
Percent
Cum.
------------+----------------------------------Female |
10
50.00
50.00
Male |
10
50.00
100.00
------------+----------------------------------Total |
20
100.00
The summarize command can be combined with the tabulate command to produce
summaries of one variable for each value of another. The following table shows separate
summaries of income for males and females.
. tab sex, sum(income)
|
Summary of income
sex |
Mean
Std. Dev.
Freq.
------------+-----------------------------------Female |
27100
12251.531
10
Male |
45444.444
24734.142
9
------------+-----------------------------------Total |
35789.474
20868.847
19
. tab sex race
Crosstabulation showing values of race for each level of sex.
. tab sex race, row col
Same as above, but with row and column percentages.
You can combine a crosstabulation with the summarize command, just like we earlier
combined a simple tab with summarize:
. tab sex race, sum(income)
Means, Standard Deviations and Frequencies of income
|
race
sex |
Black
Other
White |
Total
-----------+---------------------------------+---------Female | 19666.667
28500
31000 |
27100
| 577.35027 2121.3203 16583.124 | 12251.531
|
3
2
5 |
10
-----------+---------------------------------+---------Male | 32666.667
50000
52200 | 45444.444
| 12220.202
0
31051.57 | 24734.142
|
3
1
5 |
9
-----------+---------------------------------+---------Total | 26166.667 35666.667
41600 | 35789.474
| 10515.069 12503.333 25992.307 | 20868.847
|
6
3
10 |
19
. tab sex race, row col chi2
Same as the earlier crosstab, but calculates Pearson's chi-squared for the hypothesis that
race and sex are independent.
|
race
sex |
White
Black
Other |
Total
-----------+---------------------------------+---------Female |
5
3
2 |
10
|
50.00
30.00
20.00 |
100.00
|
50.00
42.86
66.67 |
50.00
-----------+---------------------------------+---------Male |
5
4
1 |
10
|
50.00
40.00
10.00 |
100.00
|
50.00
57.14
33.33 |
50.00
-----------+---------------------------------+---------Total |
10
7
3 |
20
|
50.00
35.00
15.00 |
100.00
|
100.00
100.00
100.00 |
100.00
Pearson chi2(2) =
0.4762
Pr = 0.788
(Notice that we've started getting into performing statistical tests!) The tab command
usually only makes sense for categorical variables. Trying to tab an income variable with
hundreds of different values wouldn't be very enlightening. Stata sensibly refuses to do a
two-way table when the table would be excessively large.
Statistical tests
To test whether there is a significant difference in the means between two groups, use the
command ttest:
. ttest income, by(sex)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------Female |
10
27100
3874.274
12251.53
18335.78
35864.22
Male |
9
45444.44
8244.714
24734.14
26432.1
64456.79
---------+-------------------------------------------------------------------combined |
19
35789.47
4787.642
20868.85
25731.01
45847.94
---------+-------------------------------------------------------------------diff |
-18344.44
8806.464
-36924.46
235.5702
-----------------------------------------------------------------------------diff = mean(Female) - mean(Male)
t = -2.0831
Ho: diff = 0
degrees of freedom =
17
Ha: diff < 0
Pr(T < t) = 0.0263
Ha: diff != 0
Pr(|T| > |t|) = 0.0527
Ha: diff > 0
Pr(T > t) = 0.9737
The by() construct is used to specify the variable that divides the dataset into two groups.
If the sex variable had three categories, say Male, Female, and Unknown, the ttest
command would not work. However, you could give the command
. ttest income if sex != "Unknown", by(sex)
which would cut the number of categories back down to two.
regress (abbreviated reg) is the command that runs simple linear regression. A binary
regression of income on age would look like this:
. reg income age
Source |
SS
df
MS
-------------+-----------------------------Model |
26697204
1
26697204
Residual | 7.8125e+09
17
459556511
-------------+-----------------------------Total | 7.8392e+09
18
435508772
Number of obs
F( 1,
17)
Prob > F
R-squared
Adj R-squared
Root MSE
=
19
=
0.06
= 0.8124
= 0.0034
= -0.0552
=
21437
-----------------------------------------------------------------------------income |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age | -78.67952
326.4362
-0.24
0.812
-767.3997
610.0406
_cons |
38891.1
13776.23
2.82
0.012
9825.808
67956.4
------------------------------------------------------------------------------
A multivariate regression might include sex as a control variable, as men and women
have different mean incomes. However, running
. reg income age sex
produces the error message
no observations
This is because sex is a string variable, with values of Male and Female. To include sex
in this regression, we need to modify the variable to make it useable.
Creating and modifying variables
The basic commands for creating new variables and modifying old ones in Stata are
generate (abbreviated gen), egen and replace.
The command gen variablename = something creates a new variable named
variablename and sets it equal to something. Something can be a simple number, a string,
a mathematical expression, or a function of other variables.
.
.
.
.
.
gen
gen
gen
gen
gen
one = 1
two = 1+1
three = one+two
bmi = (weight/(height*height)) * 703
approx_health_score = round(health_score,1)
There is another command, egen, that is also used to create new variables. egen works
with a different set of functions than gen. There is no particular logic as to why there are
two variable creation commands - it's just an oddity that has to do with the way Stata was
written. Most egen functions work across all observations to produce variables that
summarize other variables. For example:
. egen max_weight = max(weight)
. egen sum_of_all_weights=sum(weight)
For information on what gen functions there are, search for "functions" in Stata's online
help. For information on egen functions, search for "egen".
The if qualifier
The if qualifier is used to isolate a set of observations with variables meeting some
particular criteria. Values on variables in a dataset are compared to values on other
variables or to numbers or strings using logical comparison operators. This is very often
used to create "dummy variables", 0-1 indicators used to indicate whether something is
true or false.
Operator Meaning
==
equal to
>
greater than
>=
greater than or equal to
<
less than
<=
less than or equal to
!= or ~=
not equal to
Pay special attention to that double equals sign! If you are testing for equality, use a
double equals sign (==). A single equals sign (=) is used to set something equal to
something else.
For example, if you want to list all information for a person in your dataset whose first
name is Sarah, you would type:
. list if name=="Sarah"
String values always need to be put in quotes.
To display the names of people with an income less than or equal to $40,000:
. list name if income<=40000
Combining tests: and and or
if on its own is useful if you are interested in testing for only one thing at once, such as
the condition "Female". But let's say you want to find out the mean income for women in
your dataset between the ages of 25 and 34. What you need to do is take this series of
tests and combine them with the and operator, &.
. sum income if sex=="Female" & age >=25 & age <= 34
Note that the if statement is included only once, and then the tests are simply stated one
after another. Also note that you need to write out the entire test statement each time:
. sum income if age >=25 & <= 34
is not allowed.
If you want to look at cases where at least one of two or more conditions is met, the or
operator | is needed.
. gen child_of_immigrant=1 if birthplace_mother!="USA" |
birthplace_father !="USA"
It is possible to combine the & and | operators. It's good practice to group the statements
using parentheses. Since we didn't like the way we generated "child of immigrant" the
first time, we need to drop it using the drop command before recreating it.
. drop child_of_immigrant
. gen child_of_immigrant=1 if (birthplace_mother!="USA" |
birthplace_father !="USA") & (birthplace=="USA" &
birthplace_mother!="" & birthplace_father !="")
Notice that this time a test for missing birthplaces was included.
The replace command is used to make changes to existing variables:
. gen heavy=1 if bmi>=30
. replace heavy = 0 if bmi<30
This pair of commands looks correct, but is actually wrong, for a subtle, Stata-specific
reason. A peculiarity of Stata is that numerical missing, represented as a period (.), is
internally treated as an infinitely large number, the highest number possible. So if you are
testing for values greater than some number, missing values will always be included. This
can produce very strange results -- for example, people who show up as being both senior
citizens and in high school, because their age was missing. The correct sequence of
commands is:
. gen heavy=1 if bmi>=30 & bmi!=.
. replace heavy = 0 if bmi<30
The moral is, always check your variable creation statements and then ask yourself,
"What is happening to the missings?"
Missing string values are represented by the empty string, "".
Replace can be used with all gen functions to change the values of existing variables, but
not with egen functions. However, you can use replace to modify variables created by
egen as well as those created by gen.
You normally want to use replace for second and later steps in multi-step variable
creations, just as we used it here. It is bad practice to "write over" existing variables,
because if you make a mistake there's no way to get the original data back. For example,
even if you decided that you only cared about health scores rounded to the nearest
integer,
. replace health_score = round(health_score,1)
is not recommended. It's always better to create a new variable. Stata doesn't have an
"undo" command, so once something is gone, it's gone!
Finally, we're ready to return to the sex and regression problem. Variables like sex, where
the values have no numeric meaning, are called categorical. Variables like this are
included in regressions by creating dummy variables, which are variables with the values
of 0 and 1. For sex, we create a variable called female with a value of 1 meaning female
and a value of 0 meaning not female (or male).
. tab sex
sex |
Freq.
Percent
Cum.
------------+----------------------------------Female |
10
50.00
50.00
Male |
10
50.00
100.00
------------+----------------------------------Total |
20
100.00
. gen female = 1 if sex == "Female"
. replace female = 0 if sex == "Male"
. regress income age female
A similar procedure would be followed for race:
. tab race
race |
Freq.
Percent
Cum.
------------+----------------------------------Black |
7
36.84
36.84
Other |
2
10.53
47.37
White |
10
52.63
100.00
------------+----------------------------------Total |
19
100.00
.
.
.
.
gen black = 1
replace black
gen other = 1
replace other
if race == "Black"
= 0 if race != "Black" & race != ""
if race == "Other"
= 0 if race != "Other" & race != ""
. regress income age female black other
The save and use Commands
You can save your data in Stata format using the save command.
. save sample
This saves all the new variables you've created.
You can then open the Stata data set with the use command.
. use sample
Graphs
While Stata's graphs are often not the most visually appealing, they can be quite useful
for diagnosing data problems and for illustrating the results of analyses. For discovering
the different kinds of graphs available, it is often helpful to explore the menus, one of the
cases where Stata's menu interface is really helpful.
The basic command for drawing a bivariate graph in Stata is twoway. The command
twoway is followed by a keyword indicating the type of graph. To do a scatter plot
showing the relationship between age and income, type
. twoway scatter income age
To use the menus, click on graphics, twoway graph (scatterplot, line etc), then choose
scatter from the list on the right and select age and income as the x and y variables.
Twoway graphs can be overlaid; that is, you can draw two twoway graphs on the same
set of axes. A common use of this is to draw a scatterplot with a linear fit line laid
overtop of it to show how the regression line fits the data.
. twoway (scatter income age) (lfit income age)
To use the menus, click on graphics, overlaid twoway graphs, then choose a scatterplot
as above for plot 1 and a linear prediction with age and income as the x and y variables
for plot 2.
Download