Generating variables

advertisement
LECTURE: STATA
Data in memory
Stata thinks of your data as a giant matrix with rows and columns. Each row is an
observation and each column is a variable.
Draw a picture.
Some commands do one thing with certain columns, such as regress or tabulate.
Other commands do something to every row, such as generating a new variable.
Stata keeps the entire data set in RAM all the time, which is what makes it fast. It also
means that you are limited in the size of the data sets you can examine.
Programs
Stata programs are called “.do” files.
You can create your own do files. This is a good thing to do because you will likely want
to run your commands many times, each time with slight changes.
Stata has a built-in editor for creating do files and you can also create them in notepad.
They are just plain text (ASCII) files.
Log files
The commands
log using [filename], replace
log close
will prove useful to write a log file to disk.
Delimiters
Normally, your commands can be at most one line long. If you want to use longer
commands, you need to use a delimiter. That is CS jargon for a symbol that tells Stata
when each command ends. For example
#delimit ;
at the start of a program indicates that you will use semicolons as delimiters, as in C or
Pascal.
1
You must then end each command with a semicolon or Stata will become confused.
Comments
Stata interprets lines starting with an asterisk as comments, which means it just ignores
them. They are for you to document what you have done, or at least what you think you
are doing.
Data sets
To use a data set already in Stata format, you use the use command, as in
use mydata.dta
To save a Stata data set to disk, use the save command, as in
save mydata.dta
Note that in both cases you may need to give a full directory address, which would
appear in double quotes, as in,
use “c:\important\422\data\mydata.dta”
To clear the working memory in preparation for loading another data set, use the clear
command, as in
clear
Generating variables
Creating a new variable
To create new variables use the generate command, which can be abbreviated gen.
For example, to great a new variable that is a constant equal to 5, you would say,
gen x = 5
where this creates a variable called x and sets its value to 5 for every observation.
Using replace and if to create a new variable with multiple values
Sometimes you will probably want to create a new variable that has different values
depending on the value of another variable.
2
For example, suppose that you have a variable for years of schooling and you want to
create a dummy variable for having at least a high school education. Then you would do
the following:
gen hs = 0
replace hs = 1 if yrschl >= 12
Explain what the first generate command does.
Explain what the replace command does.
Note the double equals sign for the logical equal as opposed to the single equal sign for
assignment. This difference helps Stata figure out what you mean.
Logical statements
Almost all Stata commands can include the if along with conditions on the sample to be
affected.
The if allows a number of logical operations including the following:
“==” logical equals
“!=” not equals
“>” greater than
“<” less than
“>=” greater than or equal to
“<=” less than or equal to
Note that the logical statements following the if can include both variables and
constants, as in the example of yrschl >= 12.
You can also use “&” for “and” and “|” for “or” to create more complicated logical
statements. For example, you might just want to replace the values of one variable if
both of two other variables meet specific conditions. For example,
replace hs = 1 if yrschl >=12 & sex == 1
would set the high school dummy variable to one only for observations with both at least
12 years of school and with sex equal to one.
Missing values
Stata’s missing value indicator is a period. Thus, to create a new variable and set all the
values to missing, you would use the command:
3
gen varname = .
If you use Stata’s missing value indicator, it will almost always handle the missing values
the way that you would want. It does not include them in means or other descriptive
statistics and it does include them in regressions or other analyses.
Generating aggregate variables
The extended generate command, called egen, creates variables that are functions of
more than one observation.
The functions include means, standard deviations, percentiles and so on.
For example,
egen mearn = mean(earn)
will produce a new variable called mearn that contains the mean of the variable earn in
the data set. Thus, this variable will have the same value for everyone.
The by option often makes egen much more useful. For example, to produce a new
variable that contains the mean earnings for persons with the same sex as each
observation, you would use the command
egen mearn = mean(earn), by(sex)
Assuming two sexes, male observations would have a value of mearn that equal to the
mean of male earnings, and female observations would have a value of mearn equal to
the mean of female earnings.
Descriptive statistics
You should always begin any empirical exercise by getting to know your data really well.
This means spending some time just taking a look at the descriptive statistics and also
looking at simple bivariate relationships.
Looking at the data will help you find any data problems before you get going. Tell the
story of the 99s.
It will also give you a basic sense of the observations in your data, be they individuals,
firms, states or whatever.
The summarize command
4
The summarize command, abbreviated sum, presents means, standard deviations,
minima and maxima.
Thus,
summarize
will produce this information for every variable in the data set.
You can also tell the summarize command that you want information only about specific
variables, as in
summarize [varlist]
summarize age sex yrschl
If you want to know a lot about each variable, then you can use the detail option to
summarize, as in
summarize age, detail
Note that the option comes after a comma. This comma indicates to Stata that the main
part of the command is over and that the options are now about to begin.
The detail option gives you everything that the regular summarize command gives
you plus the five largest and smallest values, various percentiles of the variable, and
additional moments, such as the skew and the kurtosis.
If you want to summarize a variable for certain subgroups, you can use the general by
feature of Stata. Here you might say something like
by sex: summarize age
to get a separate summary of the age variable for observations having each value of the
sex variable.
Note that the by syntax can be used with most Stata commands to apply them to various
subgroups.
Of course, the by variable should be one without very many values.
The tabulate command
The tabulate command, which can be abbreviated as tab, presents a table containing
each value of a variable and the number of observations that have that value. The basic
usage for a single variable is:
5
tab varname
Note that the tabulate command is most useful for variables that take on only a few
values. Tabulate will not work for continuous variables in large data sets, or it may give
you a very long log file where every value has only one observation, which is not very
useful.
You can examine the bivariate relationship between two binary or categorical variables
by doing cross-tabulations. For example, to see the relationship between sex and years of
schooling you could do
tab yrschl sex
This would give you a table rows corresponding to each value of years of school and
columns corresponding to each value of sex. The cells contain the number of
observations with the corresponding values for each variable. You also get row and
column totals.
Do an example with hs and sex for a 2 x 2 table.
The tabulate command has a number of useful options, which include row, col, cell and
chi.
The row, col and cell options add the row, column and cell percentages to the table,
respectively. Often it is the percentages you are interested, rather than, or in addition to,
the count of observations in each cell.
The chi option performs a chi-squared test of the null that the rows and columns in the
table are independent.
The correlate command
For examining the bivariate relationship between two continuous variables, you may
want to look at the sample correlation. The sample correlation between two variables can
be obtained with the command
correlate age earn
Note that you can get a whole table of correlations by including more than two variables
in the variable list.
Testing differences in means
Stata has handy commands for running t-tests of differences in means built in. There are
three basic versions.
6
Testing whether the mean of a variable equals a constant
Here we want to test the null that the mean of a variable equals a particular number. For
that task, we use a command like
ttest varname = #
or, by example
ttest age == 32
Testing whether the mean of one variable equals that of another variable
Here we want to test the equality of the means of two variables. For example, we might
have information on earnings measured in two ways, perhaps once from a survey and
once from administrative records.
The general form of the command is
ttest varname1 = varname2
A specific example is
ttest aearn = searn
Testing whether subgroups have equal means of a particular variable
Here we might want to see if two subgroups have the same mean of, say, earnings.
The general form is
ttest varname, by(groupvar)
A specific example is
ttest earn, by(sex)
Remarks
Each form gives the relevant sample means and estimated standard errors for the means,
along with the t-statistic and related p-value from the test.
The level() option allows you to pick a significance level different from the default of
95 percent.
7
Running regressions
Running regressions in Stata is really simple.
The general form of the command is given by:
regress depvar [varlist]
where the depvar is the dependent variable and the [varlist] contains one or more
independent variables.
Thus, to estimate the model
Yi   0  1 X1i   2 X 2i   i ,
you would use the command
regress y x1 x2
The output includes the estimated coefficients, their estimated standard errors, p-values
from tests of the nulls that each population coefficient equals zero, the p-value from an Ftest that all of the population coefficients are zero and so on.
Stata includes a constant in the regression automatically; you do not need to include one
in the variable list. The constant can be suppressed using the noconstant option.
What could be simpler?
There are, of course, many embellishments to the basic regression command that we will
discuss later on in the course.
8
Download