Stata Hand-out for Module 3 Stata 10 Now that Stata 10 is on Citrix, I will use Stata 10 rather than 8 for these handouts. I hope that the use of Stata 8 in the previous module did not confuse people. Odum Workshop Handouts Please refer to these handouts from Cathy Zimmer’s Odum Institute workshop on Stata; this information will be useful to you in Soc 708: http://www.odum.unc.edu/odum/content/pdf/statahandout1.pdf http://www.odum.unc.edu/odum/content/pdf/statahandout2.pdf http://www.odum.unc.edu/odum/content/pdf/statahandout3.pdf Odum Workshop Offered Again in November The workshop will be offered again November 4-6 3:30-5:00 PM, Manning Hall Room 01 (the computer lab at Odum Institute) Registration is not required, bring a flash drive. Addenda to Module 2: (1) How to put a density curve on top of a histogram? Turns out it’s very simple. For the example in Module 2, you would simply include “kdensity” somewhere in the clause after the comma. (Thanks Francois!) This command will give you a histogram with the number of bins Stata chooses automatically .histogram infant, kdensity This next command will give you a histogram like the one we did before, with 14 bins, titles, and a different vertical axis . histogram infant if infant~=., bin(14) frequency ytitle(Count) xtitle(Infant Mortality Rate) title(Distribution of Infant Mortality Rates) kdensity (2) Doing calculations in Stata: . display (2+2+3+4+10)/5 And Stata gives you this output: 4.2 Helpful tips: 1. To find out more about any Stata command, use the help function within Stata. Go to Help, Choose “Stata command”, and type “lfit” (lfit is a command). Or just type “help lfit” into the command window. 2. To use a variable name in a command, you have a few options. You can type the whole variable name, or: a. just type part of variable name – if just a few characters can uniquely identify the variable name, Stata will supply the rest of the variable. b. click the variable as it appears in the variable window (or double-click, depending on your set-up) 3. If you would like to re-do a command that you just did, you can press “page up” and the previous command will appear. If you press “page up” again it will supply the command before that, etc. 4. If you would like a record of what you did, a. you can keep a log of your session, or part of a session. In Stata 10, go to “file” and choose “log”, or click on the brown rectangle in the tool bar. Then it will ask you where to save it and in what format. b. You can save everything you have done during a Stata session. that is in the “Review” window by clicking on the white square in the top left corner and choosing “Save review contents” c. You could use a ‘do-file’ – write all your commands in one lengthy text file and then have Stata run them all at once. This is probably not useful to you for homework, but is necessary for some 5. In an “if” clause, you are telling Stata to execute a command depending on the value of a variable. Examples: a. If you want Stata to only display points in a scatter plot if the value of variable “age” equals 30, you would include this somewhere in the command: “if age==30”. == is just two = signs so it’s like saying “if age equal equal 30” b. When the variable you are using for your “if” clause is a “string” variable, meaning it is not understood by Stata to be numerical, you put quotation marks around the value of the variable. So, in the case of the scatterplot below where we use an “if” clause for the variable “type”, you write “if type==”bc” ” c. If you would like Stata to run the command to EXCLUDE observations with a certain value, use “~” – “if age~=30” means the command applies to observations where age is not equal to 30. You only have to type one = sign after the ~. In Stata 10, to make a scatterplot you would go to Graphics menu, choose “twoway graph (scatter, line etc.)”. Then choose “create” to define a new plot. Then choose “Basic plot” and “scatter” and select Y and X variables – income is the y variable and education is the x variable. You can select them from the drop-down boxes. Stata produces this command and graph: 0 5000 income 10000 15000 20000 25000 . twoway (scatter income education) 6 8 10 12 14 16 education To identify each point in the scatter plot with information about a third variable “type”, you can have Stata label the points. This seems to be the simplest way to get information about a third variable in a scatterplot, but as you can see it can be hard to read! To do it, you go back into your scatterplot and on the same screen where you selected the Y and X variables, click “Marker properties”. Then check the box for “Add labels to markers” and specify which variable it should use for labeling. In this case, we choose “type” as the variable. The values of “type” are bc, prof, and wc, for blue collar, professional and white collar. (You can confirm that these are the values with the command “tab type”, which will show you a table of type with its three values, plus N/A.) For some reason, “type” was stored as a “string” variable – instead of being coded as numbers they were stored as letters. Storing as a string variable can have consequences for how data are handled in Stata (usually it’s better to code values as numbers). Here is the Stata command: 25000 . twoway (scatter income education, mlabel(type)) prof 20000 prof prof income 10000 15000 prof prof prof prof prof prof prof prof prof bcwc prof NA bc bc bc wc wc wc wc bc prof wc bc bc bc bc bc bc bc wc prof bc prof bc bc bc wc bc bc wc bc wc bc wc wc bc prof bc bc wc wc bc bc wc wc bc bc bc wc NAbc bc bc bc bc wc wc wc bc bc bc wc wc wc bc bc bc NA NA 0 5000 bcbc 6 8 bc bc 10 12 education prof prof prof prof prof prof prof prof prof prof prof prof prof 14 prof 16 To make a graph that looks like the colorful one in R (lecture notes), I had to create 3 plots, and use an “if” clause for each one, and choose a color and symbol for each. The “if” clause specifies which value of TYPE will be displayed (e.g., if type==”bc”). (This procedure is more complicated than the one in R! And if you want to superimpose lines, you would use the “lfit” command on 3 additional plots!) 0 5000 income 10000 15000 20000 25000 twoway (scatter income education if type=="bc", mcolor(red) msymbol(circle_hollow)) (scatter income education if type=="prof", mcolor(midgreen) msymbol(triangle_hollow)) (scatter income education if type=="wc", mcolor(blue) msymbol(plus)), legend(off) 6 8 10 12 education 14 16 For the scatterplot of prestige and education, you will also use more than one plot. First make the scatter plot, with “prestige” as the y-variable and “education” as the x-variable. Then make a second plot, a “fit plot”, and choose “linear prediction”. This plot will put a line over the scatterplot: 20 40 60 80 100 . twoway (scatter prestige education) (lfit prestige education) 6 8 10 12 14 education prestige Fitted values Correlation coefficients slides: These calculations can be done in Excel. Here is how you get the correlation coefficient in Stata: . pwcorr education prestige And it produces this: | educat~n prestige -------------+-----------------education | 1.0000 prestige | 0.8502 1.0000 16 For the slide titled “Least Squares Regression”, you can refer back to the scatterplot with line above. The Stata command is “reg”. After “reg” (for regress) you list the y-variable (a.k.a. dependent variable, response variable) first, then the x-variable (independent variable, explanatory variable). . reg prestige education Source | SS df MS -------------+-----------------------------Model | 21608.4361 1 21608.4361 Residual | 8286.99 100 82.8699 -------------+-----------------------------Total | 29895.4261 101 295.994318 Number of obs F( 1, 100) Prob > F R-squared Adj R-squared Root MSE = = = = = = 102 260.75 0.0000 0.7228 0.7200 9.1033 -----------------------------------------------------------------------------prestige | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------education | 5.360878 .3319882 16.15 0.000 4.702223 6.019533 _cons | -10.73198 3.677088 -2.92 0.004 -18.02722 -3.436743 ------------------------------------------------------------------------------ For the slide with 2 lines of different colors representing 2 different equations, I did not provide Stata commands because there are no R commands provided. Using the data provided by the CD that came with the book. Use ta02006. I downloaded it as an Excel file, then cut and paste to Stata. To do this, you open Stata and go to the editor window (or just type “edit”) and you see empty rows and columns. Then just paste what you had from Excel. 4 6 8 10 12 . twoway (scatter y1 x) (lfit y1 x) 4 6 8 10 x y1 Fitted values 12 14 [To save time, don’t use the graphic interface to do all 4 of these. Just hit “page up” and change the variable names while retaining the commands.] 2 4 6 8 10 . twoway (scatter y2 x) (lfit y2 x) 4 6 8 10 12 14 x y2 Fitted values 4 6 8 10 12 . twoway (scatter y3 x) (lfit y3 x) 4 6 8 10 x y3 Fitted values . twoway (scatter y4 x4) (lfit y4 x4) 12 14 14 12 10 8 6 5 10 15 20 x4 y4 Fitted values To see the influence of the single outlier, it again requires multiple plots in Stata. Plot 1 was a scatterplot of height and weight, Plot 2 a line fit plot for height and weight, and Plot 3 a line fit plot for height and weight that excludes observation #12. To see which one that is, you can type “browse” or “br” to look at your data. To exclude observation 12 from the 3rd plot, you use an “if” clause – “if id~=12”. The ~ means “not”, so this clause says only do the command for observations that do NOT have the id number of 12. 50 100 150 200 . twoway (scatter weight height) (lfit weight height) (lfit weight height if id~=12) 50 100 150 height weight Fitted values Fitted values 200 Stata automatically made the lines two different colors. The red line excludes the outlier. Here are the relevant regression equations. The first includes observation 12, and the second excludes it using “if id~=12”. . reg weight height Source | SS df MS -------------+-----------------------------Model | 1630.88712 1 1630.88712 Residual | 43713.1129 198 220.773297 -------------+-----------------------------Total | 45344 199 227.859296 Number of obs F( 1, 198) Prob > F R-squared Adj R-squared Root MSE = = = = = = 200 7.39 0.0072 0.0360 0.0311 14.858 -----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------height | .2384059 .0877159 2.72 0.007 .0654286 .4113832 _cons | 25.26623 14.95042 1.69 0.093 -4.216263 54.74872 -----------------------------------------------------------------------------. reg weight height if id~=12 Source | SS df MS -------------+-----------------------------Model | 20941.4871 1 20941.4871 Residual | 14312.0205 197 72.6498502 -------------+-----------------------------Total | 35253.5075 198 178.048018 Number of obs F( 1, 197) Prob > F R-squared Adj R-squared Root MSE = = = = = = 199 288.25 0.0000 0.5940 0.5920 8.5235 -----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------height | 1.149222 .0676889 16.98 0.000 1.015734 1.28271 _cons | -130.747 11.56271 -11.31 0.000 -153.5496 -107.9444 -----------------------------------------------------------------------------Extra Material: As we saw with “histogram infant, kdensity”, you can add options after a command with a comma. Here is an option that may be used with the “reg” command: beta. “beta asks that standardized beta coefficients be reported instead of confidence intervals. The beta coefficients are the regression coefficients obtained by first standardizing all variables to have a mean of 0 and a standard deviation of 1. beta may not be specified with vce(cluster clustvar) or the svy prefix.” Here is how you use it, and what it does: . reg prestige education, beta Source | SS df MS -------------+-----------------------------Model | 21608.4361 1 21608.4361 Residual | 8286.99 100 82.8699 -------------+-----------------------------Total | 29895.4261 101 295.994318 Number of obs F( 1, 100) Prob > F R-squared Adj R-squared Root MSE = = = = = = 102 260.75 0.0000 0.7228 0.7200 9.1033 -----------------------------------------------------------------------------prestige | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------education | 5.360878 .3319882 16.15 0.000 .8501769 _cons | -10.73198 3.677088 -2.92 0.004 . ------------------------------------------------------------------------------