Stata Hand-out for Module 4 Helpful tips 1. Creating a variable involves the “generate” or “gen” command. For example, “gen educ=.” tells Stata to create a variable called “educ” and to make it numeric (the “.” tells Stata this). See the extra material at the end of this module for more about generating variables 2. If you don’t want to keep the variable you created, you can delete it by typing “drop” and then the variable name. For example, “drop educ” tells Stata to drop the variable above. 3. If you have a really long tedious command, like a graphic with 6 overlaid plots, you will probably NOT want to use the graphics interface because typing the commands is faster once you know what they are. Even better, you will probably want to copy and paste commands so you don’t have to keep retyping the same ones. Sometimes it will help to put the command in a text file (using WordPad or NotePad) and using the search and replace function. 4. Some of the graphics commands are complicated in Stata. Don’t worry about this because graphics are really not the point of the class! Tables similar to those created in the slides can be made easily in Stata. To take an example that is not in the lecture notes, if you open the dataset “Chile” you can make a table using two categorical variables, “region” and “oil”. I’m choosing these to show you simply because they are categorical variables; this is not theory-driven. You can simply write “tab educ vote” to get a simple table. You can choose to get the row percentages, column percentages, and/or cell percentages by writing “row”, “col”, “cell” (or all three “row col cell”) after a comma. For example, to get row percentages: From the dataset “Chile” . tab educ vote, row +----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ | vote education | A N NA U Y | Total -----------+-------------------------------------------------------+---------1 | 0 0 0 1 0 | 1 | 0.00 0.00 0.00 100.00 0.00 | 100.00 -----------+-------------------------------------------------------+---------NA | 0 2 1 3 5 | 11 | 0.00 18.18 9.09 27.27 45.45 | 100.00 -----------+-------------------------------------------------------+---------P | 52 266 71 295 422 | 1,106 | 4.70 24.05 6.42 26.67 38.16 | 100.00 -----------+-------------------------------------------------------+---------PS | 32 224 24 52 130 | 462 | 6.93 48.48 5.19 11.26 28.14 | 100.00 -----------+-------------------------------------------------------+---------S | 103 397 72 237 311 | 1,120 | 9.20 35.45 6.43 21.16 27.77 | 100.00 -----------+-------------------------------------------------------+---------Total | 187 889 168 588 868 | 2,700 1 | 6.93 32.93 6.22 21.78 32.15 | 100.00 If there is another categorical variable and you would like to see what the same table looks like for each value of that variable. For example, “by sex:” tells it to make two tables, one for each value of sex. But first I sorted the data according to sex (sort sex). . sort sex . by sex: tab educ vote, row -------------------------------------------------------------------------------> sex = F +----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ | vote education | A N NA U Y | Total -----------+-------------------------------------------------------+---------NA | 0 2 0 1 4 | 7 | 0.00 28.57 0.00 14.29 57.14 | 100.00 -----------+-------------------------------------------------------+---------P | 32 112 36 177 250 | 607 | 5.27 18.45 5.93 29.16 41.19 | 100.00 -----------+-------------------------------------------------------+---------PS | 15 86 7 31 60 | 199 | 7.54 43.22 3.52 15.58 30.15 | 100.00 -----------+-------------------------------------------------------+---------S | 57 163 27 153 166 | 566 | 10.07 28.80 4.77 27.03 29.33 | 100.00 -----------+-------------------------------------------------------+---------Total | 104 363 70 362 480 | 1,379 | 7.54 26.32 5.08 26.25 34.81 | 100.00 -------------------------------------------------------------------------------> sex = M +----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ | vote education | A N NA U Y | Total -----------+-------------------------------------------------------+---------1 | 0 0 0 1 0 | 1 | 0.00 0.00 0.00 100.00 0.00 | 100.00 -----------+-------------------------------------------------------+---------NA | 0 0 1 2 1 | 4 | 0.00 0.00 25.00 50.00 25.00 | 100.00 -----------+-------------------------------------------------------+---------P | 20 154 35 118 172 | 499 | 4.01 30.86 7.01 23.65 34.47 | 100.00 -----------+-------------------------------------------------------+---------- 2 PS | 17 138 17 21 70 | 263 | 6.46 52.47 6.46 7.98 26.62 | 100.00 -----------+-------------------------------------------------------+---------S | 46 234 45 84 145 | 554 | 8.30 42.24 8.12 15.16 26.17 | 100.00 -----------+-------------------------------------------------------+---------Total | 83 526 98 226 388 | 1,321 | 6.28 39.82 7.42 17.11 29.37 | 100.00 From the dataset “Prestige”. Doing the regression of “prestige” on “income” and “type” requires a little preparation in Stata. In module 2 we found out that we could not make a bar chart of “vote” in Stata unless we created dummy variables for every value of vote. We need to do that again in this module in order to analyze the relationship of prestige to income and type of occupation. (And again, it turns out that R does not require this step.) Like “vote”, “type” is a categorical variable. . tab type type | Freq. Percent Cum. ------------+----------------------------------NA | 4 3.92 3.92 bc | 44 43.14 47.06 prof | 31 30.39 77.45 wc | 23 22.55 100.00 ------------+----------------------------------Total | 102 100.00 The table shows there is a “NA” value, which we will want to turn into a missing value. . replace type=”” if type==”NA” Generate new categorical variables using the 3 remaining values: . tab type_num, gen(type) You will see that 3 new variables, type1, type2, and type3 have just been created. These are “dummy” variables and correspond to the three categories of “type”. “Dummy” variables have values of either 0 or 1 and are also called “indicator” variables and “dichotomous” variables. To see the variables we just created automatically, you can get tables of each variable. To save a step, you can just write “tab1” and list all the variables you would like a table for in a single command. . tab1 type1 type2 type3 -> tabulation of type1 3 type_num==B | lue Collar | Freq. Percent Cum. ------------+----------------------------------0 | 58 56.86 56.86 1 | 44 43.14 100.00 ------------+----------------------------------Total | 102 100.00 -> tabulation of type2 type_num==P | rofessional | Freq. Percent Cum. ------------+----------------------------------0 | 71 69.61 69.61 1 | 31 30.39 100.00 ------------+----------------------------------Total | 102 100.00 -> tabulation of type3 type_num==W | hite Collar | Freq. Percent Cum. ------------+----------------------------------0 | 79 77.45 77.45 1 | 23 22.55 100.00 ------------+----------------------------------Total | 102 100.00 To make it easier on ourselves, let’s rename the variables so we can see at a glance what they refer to. Do this with the command “rename” – rename oldvar newvar [“var” stands for “variable”] . rename type1 type_BC . rename type2 type_PROF . rename type3 type_WC Now you can regress prestige on income and type. Because “type” is a categorical variable rather than an ordinal or ratio variable, you have to put in the separate dummy variables. You can’t just put in the original variable “type” because the categories are not numbers, they are categorical values: blue collar, professional, white collar. In this example, only type_PROF and type_WC are listed as independent variables, which tells Stata that type_BC is to be understood as the “reference category.” . reg prestige income Source | type_PROF type_WC SS df MS Number of obs = 102 4 -------------+-----------------------------Model | 23016.1084 3 7672.03612 Residual | 6879.31774 98 70.1971198 -------------+-----------------------------Total | 29895.4261 101 295.994318 F( 3, 98) Prob > F R-squared Adj R-squared Root MSE = = = = = 109.29 0.0000 0.7699 0.7628 8.3784 -----------------------------------------------------------------------------prestige | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------income | .0014871 .0002428 6.12 0.000 .0010052 .0019691 type_PROF | 24.42513 2.327585 10.49 0.000 19.80611 29.04414 type_WC | 7.010142 2.125056 3.30 0.001 2.793037 11.22725 _cons | 27.71983 1.749331 15.85 0.000 24.24834 31.19132 ------------------------------------------------------------------------------ The following method is based on getting predictions based on the regression equation just obtained. You can try this out if you’re feeling adventurous. (But don’t worry too much about graphics at this stage): Do the same regression (reg prestige income type_PROF type_WC) Then these (as separate commands) – the first command obtains predicted values based on the regression equation, the next command creates new variables for each category of “type”, and the graph command tries to put this all together. .predict yhat .separate yhat, by(type) The “predict” command predicts the y-values, which are called yhat. The”separate” command creates 3 new yhat variables corresponding to the 3 values of type. This graph lets you see the different lines for different values of type by graphing the 3 yhats, and then laying a regression line over the whole thing (lfit) . graph twoway (scatter prestige yhat1 yhat2 yhat3 income, connect(i l l l) msymbol(o i i i) sort) (lfit prestige income) 5 100 80 60 40 20 0 5000 10000 15000 income prestige yhat, type == prof Fitted values 20000 25000 yhat, type == bc yhat, type == wc To use an “interaction” in your regression equation, you can create a new variable for each interaction. You use the “gen” command and create a new variable that is the multiplication of the two interacting variables. For this example, we want the interaction of income with two “levels” of “values” of the categorical variable “type”. . gen PROF_income= type_PROF*income . gen WC_income= type_WC*income When you put these interaction terms in an equation you have to be sure to also include the regular variables that were multiplied to get the interaction: income, type_PROF, and type_WC. . reg prestige income type_PROF type_WC PROF_income WC_income Source | SS df MS -------------+-----------------------------Model | 24667.8187 5 4933.56373 Residual | 5227.60743 96 54.454244 -------------+-----------------------------Total | 29895.4261 101 295.994318 Number of obs F( 5, 96) Prob > F R-squared Adj R-squared Root MSE = = = = = = 102 90.60 0.0000 0.8251 0.8160 7.3793 -----------------------------------------------------------------------------prestige | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------income | .0038699 .000492 7.87 0.000 .0028932 .0048466 type_PROF | 43.60598 4.041315 10.79 0.000 35.58404 51.62793 type_WC | 17.5677 5.174322 3.40 0.001 7.296751 27.83865 PROF_income | -.0030247 .0005512 -5.49 0.000 -.0041188 -.0019306 6 WC_income | -.0020176 .000947 -2.13 0.036 -.0038974 -.0001378 _cons | 15.31756 2.77366 5.52 0.000 9.811886 20.82323 ------------------------------------------------------------------------------ To see this as three lines conditional on “type” I had to predict yhat again. This time I labeled it “yhat_inter” to indicate that these yhats were predicted based on the regression equation with an interaction in it. . predict yhat_inter . separate yhat_inter, by(type) 20 40 60 80 100 . graph twoway scatter prestige yhat_inter1 yhat_inter2 yhat_inter3 income, connect(i l l l) msymbol(o i i i) sort 0 5000 10000 15000 income prestige yhat_inter, type == prof 20000 25000 yhat_inter, type == bc yhat_inter, type == wc Dot plots… don’t seem that great in Stata. Maybe one of you can improve on this! Prestige data: graph dot (sum) type_BC type_PROF type_WC Creates this (not very useful) graph, showing that there are about 23 white collar workers, 31 professionals, 44 blue collar workers: 7 0 10 20 sum of type_BC sum of type_WC 30 40 sum of type_PROF EXTRA: Data Management Exercise: Turning a string variable into a numeric variable As we saw earlier, having data in a string variable can sometimes make things a little harder in Stata. We encountered this with the variable “type” in the Prestige dataset. This variable has values “bc” “prof” and “wc”. We can make this variable into a numeric variable (though it doesn’t actually solve the problems with making a bar chart or using it in regression; those things still appear to be easier in R). Here is one way to do it. This is our old variable. . tab type type | Freq. Percent Cum. ------------+----------------------------------NA | 4 3.92 3.92 bc | 44 43.14 47.06 prof | 31 30.39 77.45 wc | 23 22.55 100.00 ------------+----------------------------------Total | 102 100.00 . First, generate a new variable. Let’s call it “type_num”. We want to make sure Stata knows it’s numeric, so we tell Stata this as follows: . gen type_num=. 8 If we had written “gen type_num=”” “ Stata would have thought it was a string variable. Two quotation marks indicate a string variable. This was true before when we had to put quotation marks for the value of a string variable in the “if clause” from Module 3 handout. Now we need to make sure the correct values are assigned. We do this by telling Stata to replace a value in the new variable depending on the value of the old variable. Here we are using an “if clause” – and using the quotation marks because the old variable is a string variable. . replace type_num=0 if type=="bc" (44 real changes made) The above commands tells Stata that every time there is a “bc” value in the variable “type”, there should now be a zero value in the variable “type_num” . replace type_num=1 if type=="prof" (31 real changes made) . replace type_num=2 if type=="wc" (23 real changes made) There was a “N/A” value as well so we might as well change that too. Above we turned it into missing data, but maybe for some reason we want to represent it as a numeric value… who knows . replace type_num=3 if type=="NA" (4 real changes made) Now let’s see what we have by making a table of our new variable. . tab type_num type_num | Freq. Percent Cum. ------------+----------------------------------0 | 44 43.14 43.14 1 | 31 30.39 73.53 2 | 23 22.55 96.08 3 | 4 3.92 100.00 ------------+----------------------------------Total | 102 100.00 It looks fine – all 102 observations are there and in 4 categories. But we need to label the numbers now, using value labels, so that people who read this know that 0 stands in for “blue collar”. Here is how you define a label, and then apply the label to the values of the variable. (See me if you have questions.) . label def type_num 0 "Blue Collar" 1 "Professional" 2 "White Collar" 3 "N/A" 9 . lab values type_num type_num . tab type_num type_num | Freq. Percent Cum. -------------+----------------------------------Blue Collar | 44 43.14 43.14 Professional | 31 30.39 73.53 White Collar | 23 22.55 96.08 N/A | 4 3.92 100.00 -------------+----------------------------------Total | 102 100.00 Alternative ways of graphing: The graph below is actually 7 graphs, one on top of the other: 3 scatter plots for the different values of “type”, 3 plots with a fitted line for the different values of “type”, and 1 plot showing the fitted line for all values of “type”. I turned “legend” off (see end of command) because the legend was not informative for this graph. This graph is not as good as the one that uses “yhat”. twoway (scatter prestige income if type=="bc", mcolor(red) msymbol(circle_hollow))(lfit prestige income if type=="bc", lcolor(red)) (scatter prestige income if type=="prof", mcolor(midgreen) msymbol(triangle_hollow)) (lfit prestige income if type=="prof", lcolor(midgreen))(scatter prestige income if type=="wc", mcolor(blue) msymbol(plus)) (lfit prestige income if type=="wc", lcolor(blue)) (lfit prestige income , lcolor(black)), legend(off) 10 100 80 60 40 20 0 5000 10000 15000 income 20000 25000 11