Stata Commands for Module 3

advertisement
Stata Hand-out for Module 3
Stata 10
Now that Stata 10 is on Citrix, I will use Stata 10 rather than
8 for these handouts. I hope that the use of Stata 8 in the
previous module did not confuse people.
Odum Workshop Handouts
Please refer to these handouts from Cathy Zimmer’s Odum
Institute workshop on Stata; this information will be useful to
you in Soc 708:
http://www.odum.unc.edu/odum/content/pdf/statahandout1.pdf
http://www.odum.unc.edu/odum/content/pdf/statahandout2.pdf
http://www.odum.unc.edu/odum/content/pdf/statahandout3.pdf
Odum Workshop Offered Again in November
The workshop will be offered again November 4-6 3:30-5:00 PM,
Manning Hall Room 01 (the computer lab at Odum Institute)
Registration is not required, bring a flash drive.
Addenda to Module 2:
(1) How to put a density curve on top of a histogram? Turns out
it’s very simple. For the example in Module 2, you would simply
include “kdensity” somewhere in the clause after the comma.
(Thanks Francois!)
This command will give you a histogram with the number of bins
Stata chooses automatically
.histogram infant, kdensity
This next command will give you a histogram like the one we did
before, with 14 bins, titles, and a different vertical axis
. histogram infant if infant~=., bin(14) frequency ytitle(Count)
xtitle(Infant Mortality Rate) title(Distribution of Infant
Mortality Rates) kdensity
(2) Doing calculations in Stata:
. display (2+2+3+4+10)/5
And Stata gives you this output:
4.2
Helpful tips:
1. To find out more about any Stata command, use the help function within Stata. Go to
Help, Choose “Stata command”, and type “lfit” (lfit is a command). Or just type “help
lfit” into the command window.
2. To use a variable name in a command, you have a few options. You can type the whole
variable name, or:
a. just type part of variable name – if just a few characters can uniquely identify the
variable name, Stata will supply the rest of the variable.
b. click the variable as it appears in the variable window (or double-click, depending
on your set-up)
3. If you would like to re-do a command that you just did, you can press “page up” and the
previous command will appear. If you press “page up” again it will supply the command
before that, etc.
4. If you would like a record of what you did,
a. you can keep a log of your session, or part of a session. In Stata 10, go to “file”
and choose “log”, or click on the brown rectangle in the tool bar. Then it will ask
you where to save it and in what format.
b. You can save everything you have done during a Stata session. that is in the
“Review” window by clicking on the white square in the top left corner and
choosing “Save review contents”
c. You could use a ‘do-file’ – write all your commands in one lengthy text file and
then have Stata run them all at once. This is probably not useful to you for
homework, but is necessary for some
5. In an “if” clause, you are telling Stata to execute a command depending on the value of a
variable. Examples:
a. If you want Stata to only display points in a scatter plot if the value of variable
“age” equals 30, you would include this somewhere in the command: “if
age==30”. == is just two = signs so it’s like saying “if age equal equal 30”
b. When the variable you are using for your “if” clause is a “string” variable,
meaning it is not understood by Stata to be numerical, you put quotation marks
around the value of the variable. So, in the case of the scatterplot below where we
use an “if” clause for the variable “type”, you write “if type==”bc” ”
c. If you would like Stata to run the command to EXCLUDE observations with a
certain value, use “~” – “if age~=30” means the command applies to observations
where age is not equal to 30. You only have to type one = sign after the ~.
In Stata 10, to make a scatterplot you would go to Graphics menu, choose “twoway graph
(scatter, line etc.)”. Then choose “create” to define a new plot. Then choose “Basic plot” and
“scatter” and select Y and X variables – income is the y variable and education is the x variable.
You can select them from the drop-down boxes.
Stata produces this command and graph:
0
5000
income
10000
15000
20000
25000
. twoway (scatter income education)
6
8
10
12
14
16
education
To identify each point in the scatter plot with information about a third variable “type”, you can
have Stata label the points. This seems to be the simplest way to get information about a third
variable in a scatterplot, but as you can see it can be hard to read!
To do it, you go back into your scatterplot and on the same screen where you selected the Y and
X variables, click “Marker properties”. Then check the box for “Add labels to markers” and
specify which variable it should use for labeling. In this case, we choose “type” as the variable.
The values of “type” are bc, prof, and wc, for blue collar, professional and white collar. (You can
confirm that these are the values with the command “tab type”, which will show you a table of
type with its three values, plus N/A.) For some reason, “type” was stored as a “string” variable –
instead of being coded as numbers they were stored as letters. Storing as a string variable can
have consequences for how data are handled in Stata (usually it’s better to code values as
numbers).
Here is the Stata command:
25000
. twoway (scatter income education, mlabel(type))
prof
20000
prof
prof
income
10000
15000
prof
prof prof
prof
prof
prof
prof
prof
prof
bcwc prof
NA
bc
bc bc wc wc
wc
wc
bc
prof
wc
bc
bc
bc bc
bc bc
bc
wc
prof
bc
prof
bc bc
bc
wc
bc
bc
wc
bc
wc
bc
wc
wc
bc
prof
bc
bc
wc
wc
bc
bc
wc
wc
bc
bc
bc
wc
NAbc bc
bc
bc
bc
wc
wc wc
bc bc
bc
wc wc wc
bc
bc
bc
NA
NA
0
5000
bcbc
6
8
bc
bc
10
12
education
prof
prof
prof prof
prof prof
prof
prof
prof
prof
prof
prof
prof
14
prof
16
To make a graph that looks like the colorful one in R (lecture notes), I had to create 3 plots, and
use an “if” clause for each one, and choose a color and symbol for each. The “if” clause specifies
which value of TYPE will be displayed (e.g., if type==”bc”). (This procedure is more
complicated than the one in R! And if you want to superimpose lines, you would use the “lfit”
command on 3 additional plots!)
0
5000
income
10000
15000
20000
25000
twoway (scatter income education if type=="bc", mcolor(red) msymbol(circle_hollow)) (scatter
income education if type=="prof", mcolor(midgreen) msymbol(triangle_hollow)) (scatter income
education if type=="wc", mcolor(blue) msymbol(plus)), legend(off)
6
8
10
12
education
14
16
For the scatterplot of prestige and education, you will also use more than one plot. First make the
scatter plot, with “prestige” as the y-variable and “education” as the x-variable. Then make a
second plot, a “fit plot”, and choose “linear prediction”. This plot will put a line over the
scatterplot:
20
40
60
80
100
. twoway (scatter prestige education) (lfit prestige education)
6
8
10
12
14
education
prestige
Fitted values
Correlation coefficients slides: These calculations can be done in Excel.
Here is how you get the correlation coefficient in Stata:
. pwcorr education prestige
And it produces this:
| educat~n prestige
-------------+-----------------education |
1.0000
prestige |
0.8502
1.0000
16
For the slide titled “Least Squares Regression”, you can refer
back to the scatterplot with line above. The Stata command is
“reg”. After “reg” (for regress) you list the y-variable
(a.k.a. dependent variable, response variable) first, then the
x-variable (independent variable, explanatory variable).
. reg prestige education
Source |
SS
df
MS
-------------+-----------------------------Model | 21608.4361
1 21608.4361
Residual |
8286.99
100
82.8699
-------------+-----------------------------Total | 29895.4261
101 295.994318
Number of obs
F( 1,
100)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
102
260.75
0.0000
0.7228
0.7200
9.1033
-----------------------------------------------------------------------------prestige |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------education |
5.360878
.3319882
16.15
0.000
4.702223
6.019533
_cons | -10.73198
3.677088
-2.92
0.004
-18.02722
-3.436743
------------------------------------------------------------------------------
For the slide with 2 lines of different colors representing 2
different equations, I did not provide Stata commands because
there are no R commands provided.
Using the data provided by the CD that came with the book. Use
ta02006. I downloaded it as an Excel file, then cut and paste to
Stata. To do this, you open Stata and go to the editor window
(or just type “edit”) and you see empty rows and columns. Then
just paste what you had from Excel.
4
6
8
10
12
. twoway (scatter y1 x) (lfit y1 x)
4
6
8
10
x
y1
Fitted values
12
14
[To save time, don’t use the graphic interface to do all 4 of
these. Just hit “page up” and change the variable names while
retaining the commands.]
2
4
6
8
10
. twoway (scatter y2 x) (lfit y2 x)
4
6
8
10
12
14
x
y2
Fitted values
4
6
8
10
12
. twoway (scatter y3 x) (lfit y3 x)
4
6
8
10
x
y3
Fitted values
. twoway (scatter y4 x4) (lfit y4 x4)
12
14
14
12
10
8
6
5
10
15
20
x4
y4
Fitted values
To see the influence of the single outlier, it again requires multiple plots in Stata. Plot 1 was a
scatterplot of height and weight, Plot 2 a line fit plot for height and weight, and Plot 3 a line fit
plot for height and weight that excludes observation #12. To see which one that is, you can type
“browse” or “br” to look at your data. To exclude observation 12 from the 3rd plot, you use an
“if” clause – “if id~=12”. The ~ means “not”, so this clause says only do the command for
observations that do NOT have the id number of 12.
50
100
150
200
. twoway (scatter weight height) (lfit weight height) (lfit weight height if id~=12)
50
100
150
height
weight
Fitted values
Fitted values
200
Stata automatically made the lines two different colors. The red line excludes the outlier.
Here are the relevant regression equations. The first includes observation 12, and the second
excludes it using “if id~=12”.
. reg weight height
Source |
SS
df
MS
-------------+-----------------------------Model | 1630.88712
1 1630.88712
Residual | 43713.1129
198 220.773297
-------------+-----------------------------Total |
45344
199 227.859296
Number of obs
F( 1,
198)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
200
7.39
0.0072
0.0360
0.0311
14.858
-----------------------------------------------------------------------------weight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------height |
.2384059
.0877159
2.72
0.007
.0654286
.4113832
_cons |
25.26623
14.95042
1.69
0.093
-4.216263
54.74872
-----------------------------------------------------------------------------. reg weight height if id~=12
Source |
SS
df
MS
-------------+-----------------------------Model | 20941.4871
1 20941.4871
Residual | 14312.0205
197 72.6498502
-------------+-----------------------------Total | 35253.5075
198 178.048018
Number of obs
F( 1,
197)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
199
288.25
0.0000
0.5940
0.5920
8.5235
-----------------------------------------------------------------------------weight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------height |
1.149222
.0676889
16.98
0.000
1.015734
1.28271
_cons |
-130.747
11.56271
-11.31
0.000
-153.5496
-107.9444
-----------------------------------------------------------------------------Extra Material:
As we saw with “histogram infant, kdensity”, you can add options after a command
with a comma. Here is an option that may be used with the “reg” command: beta.
“beta asks that standardized beta coefficients be reported instead of
confidence intervals. The beta coefficients are the regression
coefficients obtained by first standardizing all variables to have a
mean of 0 and a standard deviation of 1. beta may not be specified
with vce(cluster clustvar) or the svy prefix.”
Here is how you use it, and what it does:
. reg prestige education, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 21608.4361
1 21608.4361
Residual |
8286.99
100
82.8699
-------------+-----------------------------Total | 29895.4261
101 295.994318
Number of obs
F( 1,
100)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
102
260.75
0.0000
0.7228
0.7200
9.1033
-----------------------------------------------------------------------------prestige |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------education |
5.360878
.3319882
16.15
0.000
.8501769
_cons | -10.73198
3.677088
-2.92
0.004
.
------------------------------------------------------------------------------
Download