Stata Hand-out for Module 4

advertisement
Stata Hand-out for Module 4
Helpful tips
1. Creating a variable involves the “generate” or “gen” command. For example, “gen educ=.”
tells Stata to create a variable called “educ” and to make it numeric (the “.” tells Stata this).
See the extra material at the end of this module for more about generating variables
2. If you don’t want to keep the variable you created, you can delete it by typing “drop” and
then the variable name. For example, “drop educ” tells Stata to drop the variable above.
3. If you have a really long tedious command, like a graphic with 6 overlaid plots, you will
probably NOT want to use the graphics interface because typing the commands is faster once
you know what they are. Even better, you will probably want to copy and paste commands
so you don’t have to keep retyping the same ones. Sometimes it will help to put the
command in a text file (using WordPad or NotePad) and using the search and replace
function.
4. Some of the graphics commands are complicated in Stata. Don’t worry about this because
graphics are really not the point of the class!
Tables similar to those created in the slides can be made easily in Stata. To take an example that is
not in the lecture notes, if you open the dataset “Chile” you can make a table using two categorical
variables, “region” and “oil”. I’m choosing these to show you simply because they are categorical
variables; this is not theory-driven. You can simply write “tab educ vote” to get a simple table. You
can choose to get the row percentages, column percentages, and/or cell percentages by writing
“row”, “col”, “cell” (or all three “row col cell”) after a comma. For example, to get row percentages:
From the dataset “Chile”
. tab educ vote, row
+----------------+
| Key
|
|----------------|
|
frequency
|
| row percentage |
+----------------+
|
vote
education |
A
N
NA
U
Y |
Total
-----------+-------------------------------------------------------+---------1 |
0
0
0
1
0 |
1
|
0.00
0.00
0.00
100.00
0.00 |
100.00
-----------+-------------------------------------------------------+---------NA |
0
2
1
3
5 |
11
|
0.00
18.18
9.09
27.27
45.45 |
100.00
-----------+-------------------------------------------------------+---------P |
52
266
71
295
422 |
1,106
|
4.70
24.05
6.42
26.67
38.16 |
100.00
-----------+-------------------------------------------------------+---------PS |
32
224
24
52
130 |
462
|
6.93
48.48
5.19
11.26
28.14 |
100.00
-----------+-------------------------------------------------------+---------S |
103
397
72
237
311 |
1,120
|
9.20
35.45
6.43
21.16
27.77 |
100.00
-----------+-------------------------------------------------------+---------Total |
187
889
168
588
868 |
2,700
1
|
6.93
32.93
6.22
21.78
32.15 |
100.00
If there is another categorical variable and you would like to see what the same table
looks like for each value of that variable. For example, “by sex:” tells it to make two
tables, one for each value of sex. But first I sorted the data according to sex (sort
sex).
. sort sex
. by sex: tab educ vote, row
-------------------------------------------------------------------------------> sex = F
+----------------+
| Key
|
|----------------|
|
frequency
|
| row percentage |
+----------------+
|
vote
education |
A
N
NA
U
Y |
Total
-----------+-------------------------------------------------------+---------NA |
0
2
0
1
4 |
7
|
0.00
28.57
0.00
14.29
57.14 |
100.00
-----------+-------------------------------------------------------+---------P |
32
112
36
177
250 |
607
|
5.27
18.45
5.93
29.16
41.19 |
100.00
-----------+-------------------------------------------------------+---------PS |
15
86
7
31
60 |
199
|
7.54
43.22
3.52
15.58
30.15 |
100.00
-----------+-------------------------------------------------------+---------S |
57
163
27
153
166 |
566
|
10.07
28.80
4.77
27.03
29.33 |
100.00
-----------+-------------------------------------------------------+---------Total |
104
363
70
362
480 |
1,379
|
7.54
26.32
5.08
26.25
34.81 |
100.00
-------------------------------------------------------------------------------> sex = M
+----------------+
| Key
|
|----------------|
|
frequency
|
| row percentage |
+----------------+
|
vote
education |
A
N
NA
U
Y |
Total
-----------+-------------------------------------------------------+---------1 |
0
0
0
1
0 |
1
|
0.00
0.00
0.00
100.00
0.00 |
100.00
-----------+-------------------------------------------------------+---------NA |
0
0
1
2
1 |
4
|
0.00
0.00
25.00
50.00
25.00 |
100.00
-----------+-------------------------------------------------------+---------P |
20
154
35
118
172 |
499
|
4.01
30.86
7.01
23.65
34.47 |
100.00
-----------+-------------------------------------------------------+----------
2
PS |
17
138
17
21
70 |
263
|
6.46
52.47
6.46
7.98
26.62 |
100.00
-----------+-------------------------------------------------------+---------S |
46
234
45
84
145 |
554
|
8.30
42.24
8.12
15.16
26.17 |
100.00
-----------+-------------------------------------------------------+---------Total |
83
526
98
226
388 |
1,321
|
6.28
39.82
7.42
17.11
29.37 |
100.00
From the dataset “Prestige”. Doing the regression of “prestige” on “income” and “type” requires a
little preparation in Stata. In module 2 we found out that we could not make a bar chart of “vote” in
Stata unless we created dummy variables for every value of vote. We need to do that again in this
module in order to analyze the relationship of prestige to income and type of occupation. (And again,
it turns out that R does not require this step.) Like “vote”, “type” is a categorical variable.
. tab type
type |
Freq.
Percent
Cum.
------------+----------------------------------NA |
4
3.92
3.92
bc |
44
43.14
47.06
prof |
31
30.39
77.45
wc |
23
22.55
100.00
------------+----------------------------------Total |
102
100.00
The table shows there is a “NA” value, which we will want to turn into a missing value.
. replace type=”” if type==”NA”
Generate new categorical variables using the 3 remaining values:
. tab type_num, gen(type)
You will see that 3 new variables, type1, type2, and type3 have just been created. These are
“dummy” variables and correspond to the three categories of “type”. “Dummy” variables have
values of either 0 or 1 and are also called “indicator” variables and “dichotomous” variables.
To see the variables we just created automatically, you can get tables of each variable. To save a
step, you can just write “tab1” and list all the variables you would like a table for in a single
command.
. tab1
type1 type2 type3
-> tabulation of type1
3
type_num==B |
lue Collar |
Freq.
Percent
Cum.
------------+----------------------------------0 |
58
56.86
56.86
1 |
44
43.14
100.00
------------+----------------------------------Total |
102
100.00
-> tabulation of type2
type_num==P |
rofessional |
Freq.
Percent
Cum.
------------+----------------------------------0 |
71
69.61
69.61
1 |
31
30.39
100.00
------------+----------------------------------Total |
102
100.00
-> tabulation of type3
type_num==W |
hite Collar |
Freq.
Percent
Cum.
------------+----------------------------------0 |
79
77.45
77.45
1 |
23
22.55
100.00
------------+----------------------------------Total |
102
100.00
To make it easier on ourselves, let’s rename the variables so we can see at a glance what they refer
to. Do this with the command “rename” – rename oldvar newvar [“var” stands for “variable”]
. rename type1 type_BC
. rename type2 type_PROF
. rename type3 type_WC
Now you can regress prestige on income and type. Because “type” is a categorical variable rather
than an ordinal or ratio variable, you have to put in the separate dummy variables. You can’t just put
in the original variable “type” because the categories are not numbers, they are categorical values:
blue collar, professional, white collar.
In this example, only type_PROF and type_WC are listed as independent variables, which tells Stata
that type_BC is to be understood as the “reference category.”
. reg prestige income
Source |
type_PROF type_WC
SS
df
MS
Number of obs =
102
4
-------------+-----------------------------Model | 23016.1084
3 7672.03612
Residual | 6879.31774
98 70.1971198
-------------+-----------------------------Total | 29895.4261
101 295.994318
F( 3,
98)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
109.29
0.0000
0.7699
0.7628
8.3784
-----------------------------------------------------------------------------prestige |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------income |
.0014871
.0002428
6.12
0.000
.0010052
.0019691
type_PROF |
24.42513
2.327585
10.49
0.000
19.80611
29.04414
type_WC |
7.010142
2.125056
3.30
0.001
2.793037
11.22725
_cons |
27.71983
1.749331
15.85
0.000
24.24834
31.19132
------------------------------------------------------------------------------
The following method is based on getting predictions based on the regression equation just obtained.
You can try this out if you’re feeling adventurous. (But don’t worry too much about graphics at this
stage):
Do the same regression (reg prestige income type_PROF type_WC)
Then these (as separate commands) – the first command obtains predicted values based on the
regression equation, the next command creates new variables for each category of “type”, and the
graph command tries to put this all together.
.predict yhat
.separate yhat, by(type)
The “predict” command predicts the y-values, which are called yhat.
The”separate” command creates 3 new yhat variables corresponding to the 3 values of type.
This graph lets you see the different lines for different values of type by graphing the 3 yhats, and
then laying a regression line over the whole thing (lfit)
. graph twoway (scatter prestige yhat1 yhat2 yhat3 income,
connect(i l l l) msymbol(o i i i) sort) (lfit prestige income)
5
100
80
60
40
20
0
5000
10000
15000
income
prestige
yhat, type == prof
Fitted values
20000
25000
yhat, type == bc
yhat, type == wc
To use an “interaction” in your regression equation, you can create a new variable for each
interaction. You use the “gen” command and create a new variable that is the multiplication of the
two interacting variables. For this example, we want the interaction of income with two “levels” of
“values” of the categorical variable “type”.
. gen PROF_income= type_PROF*income
. gen WC_income= type_WC*income
When you put these interaction terms in an equation you have to be sure to also include the regular
variables that were multiplied to get the interaction: income, type_PROF, and type_WC.
. reg prestige income
type_PROF type_WC
PROF_income WC_income
Source |
SS
df
MS
-------------+-----------------------------Model | 24667.8187
5 4933.56373
Residual | 5227.60743
96
54.454244
-------------+-----------------------------Total | 29895.4261
101 295.994318
Number of obs
F( 5,
96)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
102
90.60
0.0000
0.8251
0.8160
7.3793
-----------------------------------------------------------------------------prestige |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------income |
.0038699
.000492
7.87
0.000
.0028932
.0048466
type_PROF |
43.60598
4.041315
10.79
0.000
35.58404
51.62793
type_WC |
17.5677
5.174322
3.40
0.001
7.296751
27.83865
PROF_income | -.0030247
.0005512
-5.49
0.000
-.0041188
-.0019306
6
WC_income | -.0020176
.000947
-2.13
0.036
-.0038974
-.0001378
_cons |
15.31756
2.77366
5.52
0.000
9.811886
20.82323
------------------------------------------------------------------------------
To see this as three lines conditional on “type” I had to predict yhat again. This time I labeled it
“yhat_inter” to indicate that these yhats were predicted based on the regression equation with an
interaction in it.
. predict yhat_inter
. separate yhat_inter, by(type)
20
40
60
80
100
. graph twoway scatter prestige
yhat_inter1 yhat_inter2 yhat_inter3 income,
connect(i l l l) msymbol(o i i i) sort
0
5000
10000
15000
income
prestige
yhat_inter, type == prof
20000
25000
yhat_inter, type == bc
yhat_inter, type == wc
Dot plots… don’t seem that great in Stata. Maybe one of you can improve on this!
Prestige data: graph dot (sum) type_BC type_PROF type_WC
Creates this (not very useful) graph, showing that there are about 23 white collar workers, 31
professionals, 44 blue collar workers:
7
0
10
20
sum of type_BC
sum of type_WC
30
40
sum of type_PROF
EXTRA: Data Management Exercise: Turning a string variable into a numeric variable
As we saw earlier, having data in a string variable can sometimes make things a little harder in Stata.
We encountered this with the variable “type” in the Prestige dataset. This variable has values “bc”
“prof” and “wc”. We can make this variable into a numeric variable (though it doesn’t actually solve
the problems with making a bar chart or using it in regression; those things still appear to be easier in
R). Here is one way to do it.
This is our old variable.
. tab type
type |
Freq.
Percent
Cum.
------------+----------------------------------NA |
4
3.92
3.92
bc |
44
43.14
47.06
prof |
31
30.39
77.45
wc |
23
22.55
100.00
------------+----------------------------------Total |
102
100.00
.
First, generate a new variable. Let’s call it “type_num”. We want to make sure Stata knows it’s
numeric, so we tell Stata this as follows:
. gen type_num=.
8
If we had written “gen type_num=”” “ Stata would have thought it was a string variable. Two
quotation marks indicate a string variable. This was true before when we had to put quotation marks
for the value of a string variable in the “if clause” from Module 3 handout.
Now we need to make sure the correct values are assigned. We do this by telling Stata to replace a
value in the new variable depending on the value of the old variable. Here we are using an “if
clause” – and using the quotation marks because the old variable is a string variable.
. replace type_num=0 if type=="bc"
(44 real changes made)
The above commands tells Stata that every time there is a “bc” value in the variable “type”, there
should now be a zero value in the variable “type_num”
. replace type_num=1 if type=="prof"
(31 real changes made)
. replace type_num=2 if type=="wc"
(23 real changes made)
There was a “N/A” value as well so we might as well change that too. Above we turned it into
missing data, but maybe for some reason we want to represent it as a numeric value… who knows
. replace type_num=3 if type=="NA"
(4 real changes made)
Now let’s see what we have by making a table of our new variable.
. tab type_num
type_num |
Freq.
Percent
Cum.
------------+----------------------------------0 |
44
43.14
43.14
1 |
31
30.39
73.53
2 |
23
22.55
96.08
3 |
4
3.92
100.00
------------+----------------------------------Total |
102
100.00
It looks fine – all 102 observations are there and in 4 categories. But we need to label the numbers
now, using value labels, so that people who read this know that 0 stands in for “blue collar”. Here is
how you define a label, and then apply the label to the values of the variable. (See me if you have
questions.)
. label def type_num 0 "Blue Collar" 1 "Professional" 2 "White
Collar" 3 "N/A"
9
. lab values type_num type_num
. tab type_num
type_num |
Freq.
Percent
Cum.
-------------+----------------------------------Blue Collar |
44
43.14
43.14
Professional |
31
30.39
73.53
White Collar |
23
22.55
96.08
N/A |
4
3.92
100.00
-------------+----------------------------------Total |
102
100.00
Alternative ways of graphing:
The graph below is actually 7 graphs, one on top of the other: 3 scatter plots for the different values
of “type”, 3 plots with a fitted line for the different values of “type”, and 1 plot showing the fitted
line for all values of “type”. I turned “legend” off (see end of command) because the legend was not
informative for this graph. This graph is not as good as the one that uses “yhat”.
twoway (scatter prestige income if type=="bc", mcolor(red) msymbol(circle_hollow))(lfit prestige
income if type=="bc", lcolor(red)) (scatter prestige income if type=="prof", mcolor(midgreen)
msymbol(triangle_hollow)) (lfit prestige income if type=="prof", lcolor(midgreen))(scatter prestige
income if type=="wc", mcolor(blue) msymbol(plus)) (lfit prestige income if type=="wc", lcolor(blue))
(lfit prestige income , lcolor(black)), legend(off)
10
100
80
60
40
20
0
5000
10000
15000
income
20000
25000
11
Download