Optional StataDay Eleanor Boyle, Adam Lenart and Virginia Zarulli Epidemiology and Biostatistics, SDU Autumn 2013 Outline Introduction Exercises Stata-Environment Statahas several windows I Command You type commands like the logarithm of 100 and display display log(100) To execute the command press the ENTER-key The result is shown in the: I Result window I the Review window holds the history of all your typed commands Do-files I Script file that holds your Stata command I default extension .do I open a new one via task line I type a command like disp "Good Day" I Mark at least part of the line, and execute via CTRL D (CMD D for Mac users) I to run the whole file: CTRL A and CTRL D Do-files I allows to run sequence of commands several times I keeps record of commands used to produce results I save via menu or CTRL S (do it frequently!!) I comments in do-files: * at the beginning of a line or everything within /* and */ commands over multiple lines: split by /// display A B /// C D I Where to get help I help correlate I search covariance I Stata cheat sheet I Google I Search for Stata-programs by others: net search unitroot Stata Data Editor Open the data editor either I menu: Data -> Data editor -> Data editor (Edit) I command: edit The first colored row will contain the variables names Assume the first variable is animal I I I First put in the values of the variable (dog, cat, whale) and end each time by pressing ENTER Second: go to the lower right Variables window and change the default variable name var1 to animal. This name should now appear in the colored row in the table. Now you can generate another variable this time called weight with the values 6.3, 8,9, and 8 Stata’s .dta files Save your data via File -> Save and provide the file name, for example animal. Press OK, and the file is saved in your working directory as animal.dta Clear now the memory by clear and read the data via use animal, clear Look at the data either by list or browse or edit . Opening and saving data Open I open existing Stata datafile animal.dat in your working direrctory use animal, clear I open example dataset from Stata(at http://www.stata-press.com/data/r12/): webuse lifeexp, clear open a data file (here bodyfat.dta from our course server use http://www.biostat.sdu.dk/courses/data/bodyfat, clear Save into your working directory I save actual data in animalNew.dta: save animalNew, replace I Useful commands for working directory I report name and contents of current directory: pwd I report name and contents of current directory: dir I change working directory: cd c:/users/Biostat I in case of memory problems: help memory Dataimport: Excel Files Import of Excel-files: I Menu-oriented: see http://stata.com/stata12/excel-import-export/ I Command: e. g. import excel C:/Biostat1/Data.xls, sheet("2012") cellrange(A1:H25) firstrow clear see also help import excel Dataimport: ASCII (Text Files) I Tab separated name age Ute Hansen -> 77.3 Ib Ibsen -> 22.0 insheet using mydata.txt, clear names I Space separated (make sure that variable entries are separated by exactly ONE space (blank)) name age "Ute Hansen" 77.3 "Ib Ibsen" 22.0 insheet using mydata.txt, clear names delim=" " Dataimport: ASCII (Text Files) I Comma separated name,age Ute Hansen,77.3 Ib Ibsen,22.0 insheet using mydata.txt, clear names delim="," For more information see http://www.ats.ucla.edu/stat/stata/modules/input.htm Data manipulation webuse lifeexp I generate gnppc2 = gnppc^2 Make new variable gnppc2 as square of the old I gen logPopgrowth = log(popgrowth) Taking logarithm (natural!) I gen rich = -1 if gnppc >= 20000 replace rich = 0 if gnppc < 20000 I recode rich -1=1 Data manipulation Drop or keep variables I drop rich Drop the variable rich keep region country popgrowth lexp gnppc keeps only the mentioned variables in data Drop or keep observations I drop if region==3 Drops observations where region is equal to 3 I keep in 6/20 Keeps observations from number 6 to 20 I Add summarizing variables to data Extensions to generate egen mgnppc=mean(gnppc) Adds the overall mean of gnppc to the data. Important is the by option egen mgnppcReg=mean(gnppc), by region Adds the region specific mean of gnppc to the data. Descriptive statistics webuse lifeexp * description of data in memory/variable properties describe * describe data contents codebook * report univariate summary statistics summarize lexp gnppc * empirical distribution function tabulate safewater * report correlation resp. covariance correlate popgrowth lexp gnppc correlate popgrowth lexp gnppc, covariance Graphs webuse lifeexp * scatter-plot scatter lexp gnppc * line-plot line lexp gnppc, sort * scatterplot and overlay regression line scatter lexp gnppc || lfit lexp gnppc * histogram histogram gnppc Graphs: save PDF - file graph export myplot.pdf, replace PNG - file (for Word) graph export myplot.png, replace WMF - file (for Word) graph export myplot.wmf, replace Outline Introduction Exercises Exercise 1 1. Create a new do-file. 2. Load the dataset water.dta from our course server. 3. Save the data under the name vand.dta into your working directory. 4. Look at summary measures of the data using e. g. the commands des, list, summarize, codebook, mean. 5. Make a scatterplot of mortality against calcium using scatter (try help scatter) 6. Make a histogram of mortality. Exercise 2 - Part 1 1. download the data http://www.biostat.sdu.dk/courses/data/tabSepData.txt http://www.biostat.sdu.dk/courses/data/commaSepData.txt http://www.biostat.sdu.dk/courses/data/spaceSepData.txt http://www.biostat.sdu.dk/courses/data/excelData.txt 2. set working directory to where you saved the files (menu-based or using the command cd) 3. use insheet to read the data in in "tabSepData.txt", "commaSepData.txt" and "spaceSepdata.txt" 4. use mport excel to read in the data in "excelData.txt" 5. save finnaly the data as a Stata dataset named "StataData.dta" 6. clear the data from memory and read it back in from "StataData.dta" using use Exercise 2 - Part 2 Note: Exercise parts marked with * are more difficult than the others and might possibly be skipped. 8. Click into the Data Editor and type in the variable sex with values 1,2, and 1. 9. Define value labels for sex (1=male, 2=female) 10. * Use generate to generate id, a subject index (from 1 to 3). 11. Use rename to rename the variables v1 to v3 to time1 to time3. (*Also try doing this in a loop using forvalues.) Reshaping Data Generally, data exists in two formats: wide and long. Assume we have measurements on j occasions for i subjects. I wide: one line per subjectid each occasion j is represented by a variable weightj reshape long weight, i(id) j(occ) Note: the variable occ does exist first in the long table. I long: one line per occasion reshape wide weight, i(id) j(occ) Exercise 2 - Part 3 Note: Exercise parts marked with * are more difficult than the others and might possibly be skipped. 12. * Use reshape to convert the dataset to long shape. 13. * Generate a variable d that is equal to the squared difference between the variable time at each occasion and the average of time for each subject. 14. Drop the observation corresponding to the third occasion for id=2 using the commands drop and if (see help if). Exercise 3 (1.2 in the Stata-book) Download http://www.biostat.sdu.dk/courses/data/wagepan.dta Data on wages and race for 545 American young males for 1980-1987. The variables considered here are: year: calendar year 1980 to 1987 lwage: natural log of hourly wage black: dummy variable for being black hisp: dummy variable for being Hispanic 1. Execute: describe (or just: des) to get en overview of the data 2. Retain only the above 4 variables using the keep command 3. Create a new variable equal to the exponential of lwage. 4. Collapse the data (using collapse) to obtain the mean wages by year and ethnic group. 5. Produce a line graph (using twoway line) showing the mean wages over time, separately for the groups. 4.∗ Improve the graph by defining labels, line patterns, legends etc. (compare Stata book) Exercise 4 I Download into Stata the file Symptoms.dta via use http://www.biostat.sdu.dk/courses/data/Symptoms.dta, clear I Download also the .do file PlottingSymptomsOfDiseases.do http://www.biostat.sdu.dk/courses/data/PlottingSymptomsOfDiseases.do (On some computers this download may cause problems. Often Mac users have under Safari -> Preferences NOT unchecked the Open after download, and Mac tries to open the do file which the OP cannot) I Try to understand, what it is about and what the do-file is doing.