Class 1 - Introduction; Overview of Stata -- LECTURE NOTES Contents 1. Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3. Course objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4. Course organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Web site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Userid and password . . . . . . . . . . . . . . . . . . . . . . . 4.3 Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Data analysis project . . . . . . . . . . . . . . . . . . . . . . . 11.6 Student’s t-test . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Test for binomial proportions . . . . . . . . . . . . . . . 11.8 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9 Simple linear regression . . . . . . . . . . . . . . . . . . 11.10 Analysis of variance . . . . . . . . . . . . . . . . . . . . . 11.11 Multiple linear regression . . . . . . . . . . . . . . . . . 11.12 Multiple logistic regression . . . . . . . . . . . . . . . 11.13 Epidemiologic calculations - epitab . . . . . . . . . 11.14 Sample size and power calculations . . . . . . . . 41 41 41 41 42 42 42 42 47 2 2 2 3 3 5. Stata statistical package . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.2 Flavors of Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.3 Requesting more memory for Stata . . . . . . . . . . . . 5 5.4 On-line help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5.5 Resources for learning about Stata . . . . . . . . . . . . 6 5.6 Stata software pricing . . . . . . . . . . . . . . . . . . . . . . 6 5.7 Customizing Stata . . . . . . . . . . . . . . . . . . . . . . . . . 6 5.8 Keeping Stata up-to-date . . . . . . . . . . . . . . . . . . . . 7 5.9 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.10 Stata commands . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.11 How to re-issue commands . . . . . . . . . . . . . . . . . 8 5.12 Program files - do files . . . . . . . . . . . . . . . . . . . . . 8 5.13 A special do-file – profile.do . . . . . . . . . . . . . . . . 8 5.14 How to start Stata and set the working directory ....................................... 8 5.15 Keeping a log of your work . . . . . . . . . . . . . . . . . 9 5.16 Getting data into Stata . . . . . . . . . . . . . . . . . . . . . 9 5.17 Stata tutorial on data input . . . . . . . . . . . . . . . . . . 9 5.18 Saving a Stata dataset . . . . . . . . . . . . . . . . . . . 12 5.19 Loading a Stata dataset . . . . . . . . . . . . . . . . . . . 12 6. Stata programs – “do-files” . . . . . . . . . . . . . . . . . . . . . . 6.1 What are and why use do-files . . . . . . . . . . . . . . 6.2 “Hello Mom” program . . . . . . . . . . . . . . . . . . . . . . 6.3 Start Stata do-file editor . . . . . . . . . . . . . . . . . . . . 6.4 Edit and re-run “do” Program . . . . . . . . . . . . . . . 6.5 Another program . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 13 13 13 13 7. Using Stata to create “do” files . . . . . . . . . . . . . . . . . . . 15 8. Stat /Transfer for importing/exporting data . . . . . . . . . . 15 9. Example 1: exploratory analysis of data from Altman’s Exercise 3-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Listing of data file . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Analysis Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Box-Cox transform . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Techniques Illustrated . . . . . . . . . . . . . . . . . . . . . 9.5 Log Showing Commands and Output . . . . . . . . . 16 18 19 19 20 20 10. Example 2: input and display of data from Altman’s exercise 3-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Source data from Altman . . . . . . . . . . . . . . . . . . 10.2 Raw data — text file on disk . . . . . . . . . . . . . . . 10.3 Analysis plan . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Stata log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 34 34 36 36 11. Common data analysis applications . . . . . . . . . . . . . . 11.1 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . 11.2 Stem-and-leaf charts . . . . . . . . . . . . . . . . . . . . . 11.3 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Confidence interval for a mean . . . . . . . . . . . . . 11.5 Confidence interval for a proportion . . . . . . . . . 40 40 40 40 40 40 Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 1 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES — Estimate the unknown coefficients and their standard errors using maximum(or partial) likelihood and perform tests of relevant null hypotheses about the association with the response of particular subsets of explanatory variables 1. Topics ! Outline course ! Overview of Stata — Check whether a model fits the data well; identify ways to improve a model when necessary ! Handouts — Use several models for the analysis of a dataset to effectively answer the main scientific questions — Website and Schedule — Lecture Notes #1 — Understand how longitudinal data differ from crosssectional data and why special regression methods are sometimes needed for their analysis — e-Quiz #1 (due Fri, 8 Apr 2011) — Summarize in a table, the results of linear, logistic, log-linear, and survival regressions and write a description of the statistical methods, results, and main findings for a scientific report 2. Syllabus ! Multiple regression models: — Linear — Logistic — Conditional logistic (case-control studies) — Log-linear (Poisson) for counts & rates — Log-linear for contingency tables — Cox proportional hazards — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent statistical software — Complete a data analysis project, including data analysis and a written summary in the form of a scientific paper ! Longitudinal data analysis (repeated measures), analysis of clustered data 4. Course organization ! Random effects/mixed effects/multilevel models ! Model checking: analysis of residuals, measures of leverage and influence ! Special topics: methods for missing data; reliability, interrater agreement, diagnostic tests, reference intervals, sample size, regression for survey samples ! The course contents, schedule, and procedures are summarized in course website pages: — “Home” page: organizational details — “Schedule” page: classes, e-quizzes, exam, project 3. Course objectives 4.1 Web site ! Students who master the course contents will be able to: ! Web site URL: — Frame a scientific question about the dependence of a continuous, binary, count, or time-to-event response on explanatory variables in terms of linear, logistic, log-linear, or survival regression model whose parameters represent quantities of scientific interest — Design a tabular or graphical display of a dataset that makes apparent the association between explanatory variables and the response http://biostat.jhsph.edu/courses/bio624/ 4.2 Userid and password ! Some parts of the course site require a Userid and Password, which are — Choose a specific linear, logistic, log-linear, or survival regression model appropriate to address a scientific question and correctly interpret the meaning of its parameters. Userid: bio624 Password: theedge — Appreciate that the interpretation of a particular multiple regression coefficient depends on which other explanatory variables are in the model Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 2 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES b. 4.3 Grading c. 20% e-quizzes (5 of these) 50% Data analysis project 30% 1% Preliminary abstract (must be on time) 49% Completed project Examination (in-class; required for grade of A, otherwise optional) d. e. f. g. h. i. j. k. Model checks (residuals, influential points) Sensitivity analyses (with/without influential points, etc) Step-wise variable selection Non-linearity checks Collinearity assessment Interaction assessment Confounding -- compare adjusted and unadjusted models Likelihood testing or F-tests for nested models Stata do-file(s) - REQUIRED Stata logs and graphs with enough results to confirm statements in the the paper 4.4 Data analysis project ! Conduct an analysis to address a scientific topic using appropriate statistical methods — Students must identify topics and datasets independently – ie, topics and datasets will not be assigned or provided — The analysis should involve regression modeling with at least two explanatory variables — The dataset and analysis should address a public health topic, with “public health” interpreted broadly — Typically, datasets will have between 100 and 100,000 observations; however, larger or smaller datasets may also be appropriate - ask if in doubt — Datasets with fewer than 50 observations are discouraged, but not prohibited — IMPORTANT: Conduct the final analysis and write the final report INDEPENDENTLY — However, CONSULTING/COLLABORATING with instructors, TAs, students or others about the data or analysis IS ENCOURAGED — It is also OK to share datasets, as long as the final analysis (do-file), tables, and report are done INDEPENDENTLY ! Prepare a report summarizing your findings in the form of a mini scientific paper in the following format: 0. Title 1. Abstract (structured) 2. Introduction 3. Methods (including sample size considerations) 4. Results (including at least one figure and one table) 5. Discussion 6. Appropriate other tables, figures, etc 7. a. Appendices (as applicable) Variable list Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 3 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 4.4 Data analysis project (cont'd) ! Possible sources for datasets: — Some textbooks have collections of datasets that may be suitable for further analysis Again, if you decide to use one of these datasets, make sure to consult source paper(s) for the dataset and attach with the supporting materials for your project report — An important part of the project is to identify and gain access to an appropriate dataset — The best dataset is one that you are familiar with from past work that you can use to address questions that have not been addressed before LC Hamilton, Statistics with Stata www.stata.com/bookstore/swsdl.html — Next best is a dataset from an advisor or colleague — ideally one whose subject matter is of interest to you Duxbury publishing website - site contains datasets from health statistics textbooks: Click “Data Library”: http://www.thomsonedu.com/statistics/disciplin e_content/dataLibrary.html — It is OK to use datasets from other classes or the MPH capstone project if they include enough material to support a regression analysis — if in doubt, ask an instructor from this class — Online datasets. There are numerous datasets online that could be used for a project. Some links to possible sources for datasets are posted on the course website (“Other links” on the home page): Hosmer and Lemeshow: Applied Survival Analysis: ftp://ftp.wiley.com/public/sci_tech_med/survival/ http://www.biostat.jhsph.edu/courses/bio624/misc/datasets.ht m Hosmer and Lemeshow: Applied Logistic Regression Analysis: Datasets are contained in the University of Massachusetts Datasets Archive, which contains links to other data resources (make sure to type the URL exactly as given below and then scroll down to the list of datasets by type of analysis - DO NOT USE the low birthweight dataset) — Government and institutional websites ( a few are listed below) contain an enormous amount of data, will require some exploration to find downloadable, raw data suitable for analysis): www.fedstats.gov FEDSTATS (federal statistics locator) www.cdc.gov Centers for Disease Control, including the National Center for Health Statistics NCHS public use data files and documentation www.cdc.gov/nchs/datawh/ftpserv/ftpdata/ftpdata.htm www.census.gov US Census Bureau www.who.ch World Health Organization http://www-unix.oit.umass.edu/~statdata/statdata/ Moore and McCabe: Introduction to the Practice of Statistics (IPS), arguably, the best introductory statistics text available. The applets help master statistical concepts. The datasets will require finding the source papers http://www.whfreeman.com/ips/ Emory Biostatistics Dept excellent list of online databases http://www.sph.emory.edu/bios/bioslist.html#database — Statistical data warehouse with library of data and data stories (ie, documentation): www.stat.cmu.edu — click DASL under Related Links If you decide to use one of these datasets, you must consult source paper(s) for the dataset and attach with the supporting materials for your project report Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 4 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 5. Stata statistical package 5.2 Flavors of Stata 5.1 Introduction ! Stata , according to its authors, is used for: — Managing data — Analyzing data — Graphing data ! Stata offers a common interface across different computers and operating systems: DOS, Windows, Macintosh, Unix, and others — files created on one system may be used on another without any conversion ! The Stata interface is command-driven — “type a little, get a little.” ! But commands can be a pain at times, so Stata offers a menu-based interface ! Stata is very fast, due mainly to storage of datasets in memory during processing (as opposed to disk processing). Graphics are not so fast! ! Stata is capable of processing a large variety of datasets with the sole restriction that the dataset must fit into available computer memory. This restriction rules out really large datasets such as Medicare or other health information systems. ! Data integrity: Stata works on a copy of your dataset in memory, making it “safe interactive use.” You can still destroy your data by explicitly saving over it. Tip: always make copies of your key datasets before data handling activities that involve saving results. Note that analysis activities are “safe” with very little risk of harm to your data. Data management activities are “risky.” ! Stata 11 was released in 2009 — Major revisions occur about every 3 years — Menus for nearly all commands — Vastly improved graphics — Enhancements to statistics, especially survival analysis — We will use Stata 11 in this course — We will try to accomodate Macintosh users, but some programs may not work with Macs — Macintosh users: see notes under “Other Links” on the home page: http://www.biostat.jhsph.edu/~courses/bio624 ! Stata comes in three forms: — Stata IC (Intercooled - we use this) — Small Stata - not for this course — Stata/SE (Special Edition “super-size”) — Stata/MP (Muliple processors) ! Stata/SE — Can analyze datasets with as many as 32,767 variables, and the only limit on observations is the amount of RAM on your computer — Maximum length of a string variable is 244 characters — Matrices may be up to 11,000 x 11,000 ! Intercooled (IC) Stata — Can analyze datasets with as many as 2,047 variables, and the only limit on observations is the amount of RAM on your computer — Maximum length of a string variable is 80 characters — Matrices may be up to 800 x 800 — Computer should have at least 32 megabytes of RAM — 5.3 Requesting more memory for Stata ! Stata is case-sensitive: The name “Myfile” is different from “myfile” — when in doubt, use lower case ! Stata is programmable — many parts of Stata are written in the Stata programming language. This language can be used to generate, in principle, any statistical analysis whether or not it is explicitly part of Stata (see “do” and “ado” files in the Manual) ! By default, Intercooled Stata starts with 1 megabyte of memory for datasets and work space. This can be increased in one of 2 ways: — Change memory: ! Stata has a very large and active on-line users group. Members meet via the Internet using a “listserv” e-mail system. Stata is continually updated and many updates come from users. You may submit questions to the “listserv” -- your questions go to all members of the “listserv” – currently 25 questions per day are submitted ! The Stata website (www.stata.com) has a good Support section, especially the FAQs To change from 1 megabye to 800 megabytes, give the following command: set memory 800m To make the change permanent every time you start Stata, set memory 800m , permanently ! Stata’s e-mail based user support is very responsive and helpful. Remember to provide your serial number in the e-mail along with your question Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 5 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 5.4 On-line help 5.6 Stata software pricing ! Stata has lots of on-line help available -- all sections of the written documentation is on-line in “abbreviated” form (sometimes too abbreviated, especially for statistical techniques) ! Prices vary for academic institutions, businesses, and students. Prices also depend on whether the system will be used on a network and how many users there will be ! A good way to access on-line help is via the Help pull-down menu - portal to all Stata Help including the complete set of manuals in well-indexed PDF format. ! If you know the name of the command, you can access online help via the help command. For example to get help for the summarize command: help summarize Note, upper right: dialog: summarize – Nearly every Stata command has a dialog screen to construct the command ! Manuals are purchased separately - some are available in the JHMI bookstore ! There is a charge for a subscriptions to the Stata Journal are also extra, which comes in both hard copy and PDF format ! Stata has no annual renewal fee, as do some other statistical packages such as SAS, and offers regular free updates containing fixes and extensions ! The Stata web site, www.stata.com, has the latest prices and information on how to purchase items ! BSPH has a GRADPLAN for purchasing the lastest version of Stata by students. Online ordering is at www.stata.com/gpdirect Note: [R] summarize -- Summary statistics - Nearly every Stata command has an [R] link to the PDF Documentation entry ! If you want to look up a topic use the “findit” command, which search help files, as well as internet resources at Stata. The results are hyperlinked for easy access to results. For example, to get information on “logistic regression”: findit logistic regression 5.7 Customizing Stata ! Changing the size and fonts for Stata windows -- to improve readability — From the Edit menu, select: Preferences / Manage Preferences / Load Preferences / Maximized Window Settings 5.5 Resources for learning about Stata ... Make font changes, etc. to taste ! The primary documentation now spans 5,000+ pages. The main components are the Reference Manual, the User’s Guide, and the Graphics Manual. While somewhat intimidating and irritating, these are now inlcuded in a PDF - a necessity for “serious” users of Stata ! Introductory materials (may be purchased using the Stata website): Preferences / Manage Preferences / Load Preferences / New Preferences Set / YOUR INITIALS — Demonstrate changing the font and font size by using the control button at the upper left of each window, but the Results window is the most important one to change — Statistics with Stata by LC Hamilton — the best book on Stata ! The Stata Journal is a refereed journal and is published quarterly with articles about statistics, data analysis, teaching methods, and effective use of Stata’s language 1. Click the control button and select Font 2. Select one a fixed space font -- one of the larger Stata fonts or fixedsys are good choices 3. Make sure the font size is at least 9 Net courses on Stata. These range is length from a few to 12 weeks. They are done via e-mail. There is a charge for the courses. Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) 4. IMPORTANT – save the windowing preferences or the changes disappear: CLASS 1 - 6 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES — Observations (rows) are numbered from 1 to _N — Schematic on how data are stored in Stata 5.8 Keeping Stata up-to-date Columns = variable names ! MAKE SURE your Stata is up-to-date: Rows=observations — Updates are free var1 var2 var3 — Fixes and extends Stata ! The current version of Stata is updated frequently about every two weeks. Updates are free. To see what version of Stata you are using, type the following commands: about query born ... varn 1 2 3 celli,j = data value for variable j on observation i N — Stata gives the following simple example of “Data” Var1 ! To see if you need an update (you must be connected to the internet), either use the Help menu or type the command: 1. 2. 3. 4. 5. update query Var2 1 3 5 7 9 2 4 6 8 10 Name Bill Mary Pat Roger Sean ! This will advise you to one of the following: ! In Stata, a “Dataset” is “Data” plus labels, formats, notes, and characteristics 1. Do nothing, all files up to date 2. Update both the executable and ado files Click: update all 5.10 Stata commands 3. Update only the executable Click: ! There are 200+ commands in Stata, many of which are commands to obtain specific statistical analyses update executable ! An early User’s Guide, lists 37 commands that “everyone should know” by function: 4. Update only the ado files Click: update ado — Getting on-line help lookup, help, (and pull down Help menu) ! The new ado files are installed and ready to use as soon as the download is completed ! One extra step is are required to install a new executable: Click: update swap ! After installing an update, you can find out what has been added or changed by typing: help whatsnew 5.9 Datasets ! In Stata, “Data” are a rectangular table of numbers and character strings — Each row is an “observation” on all the variables — Each column contains all the observations for a given variable — Variables (columns) are represented by 8-character names Biostatistics 624 © 2011 by JHU Biostatistics Dept. — Operating system interface pwd, cd — Using and saving data from disk use, save append, merge compress — Inputting data into Stata input edit infile infix insheet — Basic data reporting describe codebook list browse count inspect table tabulate Sun, 27 Mar 2011 (6:47p) CLASS 1 - 7 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES — Data manipulation generate, replace recode egen rename drop, keep sort encode, decode order by reshape 5.13 A special do-file – profile.do ! When Stata begins, it looks for a file named profile.do , containing commands that are to be executed as Stata starts ! In particular, Stata looks for the profile.do file in c:\data, among other places, so you can execute a set of commands every time you start Stata by placing them in a text file named profile.do , which you store in c:\data — Keeping track of your work log notes ! The profile.do file recommended for this course is as follows and can be downloaded from various places on the e-Quizzes page on the course website: — Convenience display ! Newer commands worth noting — Handling subsets: define/analyze summary statistics collapse contract statsby — Tabulation - more compact results than tabulate or summarize table tabstat tab_chi ( use findit install/help) tab_chi for * profile.do for starting Stata * Place in C:\DATA or any working folder containing your files set memory 750m set linesize 75 set more off 5.14 How to start Stata and set the working directory ! The “working directory” in Stata is the folder where Stata looks for data and program files. By default, the working directory is 5.11 How to re-issue commands c:\data ! Stata stores a long list of the commands you issue in the Review window ! These commands can be accessed and re-issued – VERY useful for correcting errors without re-typing the whole command To retrieve commands, use either: ! When you start Stata from the Stata icon, the working directory is set to the default: c:\data ! You can change the working directory to the folder containing your files: Page Up/Page Down File / Change Working Directory Click the command in the Review window ... Browse to folder or ! Or, you can change the working directory by starting Stata by double-clicking a dataset or program (do-file) in the folder containing the files related to your chosen project – most prefer this method! 5.12 Program files - do files ! “Do-files” contain a collection of Stata statements that perform a variety of tasks – called a Stata program ! Do-files will be used extensively in this course and by experienced Stata practitioners ! Do-files allow you the document your work by making it possible to exactly reproduce key analyses – “ a step towards “Reproducible research” Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 8 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 5.15 Keeping a log of your work ! For documentation of your work, you should keep log files, which are transcripts of what appears in a Stata session – the log command or the Log button on the toolbar are used to manage logs ! These logs can be kept in either of two formats text (recommended – very easy to import into word processors) smcl (a formatted log that preserves hyperlinks, fonts and colors) or ! You can translate form one format to the other: translate mylog.smcl mylog.log ! You would usually store the log(s) in the same folder with your data files related to your work 5.16 Getting data into Stata ! The easiest way to enter a small amount of data into Stata is with the edit command. This is an interactive spreadsheet like process that is very intuitive -demonstrate ! If the data are stored in a file on disk and have spaces between each variable, use infile as we have done in the example below ! Files with more complicated formats such as variable items with no spaces between them or character strings with embedded blanks, require more complicated input via infile or infix with a data dictionary — details are in the Reference Manual, User’s Guide and in on-line Help. By the way, Stata advises against the use of the data dictionary approach since there are other, easier ways to do it 5.17 Stata tutorial on data input ! In addition to the resources mentioned above, there is an old tutorial on data input -- still applies to Stata: In this tutorial we show you how to enter your data into Stata. You can enter your data -------------------------- by using -------------------------------------- directly from the keyboard edit (Stata for Windows or Macintosh) input (all versions of Stata) indirectly from a file insheet infile Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 9 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 5.17 Stata tutorial on data input (cont'd) -------------------------Then you save your data infix a transfer program -------------------------------------by using save ------------------------------------------------------------------------------edit is the easiest way to enter a small amount of data. You type . clear . edit (to drop any data in memory) (to enter the spreadsheet editor) Only Stata for Windows and Stata for Macintosh users can use edit. We are not going to demonstrate it here. See the Getting Started manual or just try it. input is available on all versions of Stata: ------------------------------------------------------------------------------. clear . input id mpg weight price 1. 2. 3. 4. 5. 6. 1 22 2 17 3 22 4 20 5 15 end id 2930 3350 2640 3250 4080 mpg weight price 4099 4749 3799 4816 7827 ------------------------------------------------------------------------------input continues to accept observations until you type 'end'. Once you have some data in memory, typing input by itself adds new observations: ------------------------------------------------------------------------------. input id mpg weight price 6. 6 26 2230 4453 7. end Only Stata for Windows and Stata for Macintosh users can use edit. We are not going to demonstrate it here. See the Getting Started manual or just try it. input is available on all versions of Stata: ------------------------------------------------------------------------------. clear . input id mpg weight price 1. 2. 3. 4. 5. 6. 1 22 2 17 3 22 4 20 5 15 end id 2930 3350 2640 3250 4080 mpg weight price 4099 4749 3799 4816 7827 ------------------------------------------------------------------------------input continues to accept observations until you type 'end'. Once you have some data in memory, typing input by itself adds new observations: ------------------------------------------------------------------------------. input id 6. 6 26 2230 4453 7. end mpg weight price ------------------------------------------------------------------------------Another way to enter this data would be to type it into a wordprocessor or an editor, save it in a file, and then read the file. We have such a file: ------------------------------------------------------------------------------. type "h:\stata\auto1.raw" make, mpg,weight, price AMC Concord, 22, 2930, 4099 AMC Pacer, 17, 3350, 4749 Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 10 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 5.17 Stata tutorial on data input (cont'd) AMC Spirit, 22, 2640, 3799 Buick Century, 20, 3250, 4816 Buick Electra, 15,4080, 7827 ------------------------------------------------------------------------------Our file has the variable names at the top (that is not required) and we used commas to separate values one from the other. To read this, we can type: ------------------------------------------------------------------------------. clear . insheet using "h:\stata\auto1.raw" (4 vars, 5 obs) . list make 1. AMC Concord 2. AMC Pacer 3. AMC Spirit 4. Buick Century 5. Buick Electra mpg 22 17 22 20 15 weight 2930 3350 2640 3250 4080 price 4099 4749 3799 4816 7827 ------------------------------------------------------------------------------It's easy. insheet will read comma- or tab-delimited files, so it will read text files created by spreadsheet and database programs. ------------------------------------------------------------------------------------------------------------------------------------------------------------If your values are separated by blanks rather than commas or tabs, you use infile to read it. Here is such a file: ------------------------------------------------------------------------------. type "h:\stata\autodata.raw" "AMC Concord" 22 2930 4099 "AMC Pacer" 17 3350 4749 "AMC Spirit" 22 2640 3799 "Buick Century" 20 3250 4816 "Buick Electra" 15 4080 7827 . clear . infile str14 make mpg weight price using "h:\stata\autodata" (5 observations read) . list in ½ 1. 2. make AMC Concord AMC Pacer mpg 22 17 weight 2930 3350 price 4099 4749 ------------------------------------------------------------------------------Finally, if you have a formatted file, you use infile or infix to read it: ------------------------------------------------------------------------------. type "h:\stata\auto3.raw" AMC Concord 2229304099 AMC Pacer 1733504749 AMC Spirit 2226403799 Buick Century 2032504816 Buick Electra 1540807827 . clear . infix 1: str make 1-18 2: mpg 1-2 weight 3-6 price 7-11 > using "h:\stata\auto3.raw" (5 observations read) Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 11 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES . list 1. 2. 3. 4. 5. make AMC Concord AMC Pacer AMC Spirit Buick Century Buick Electra mpg 22 17 22 20 15 weight 2930 3350 2640 3250 4080 price 4099 4749 3799 4816 7827 Saving data ----------After you have entered data into Stata, you can save it. The command is: save filename If you do not specify the extension for the filename, Stata assumes the extension '.dta'. For instance, we could type 'save auto' to save this data. It would be saved in the file auto.dta. The command to retrieve previously saved data is: use filename [, clear] Thus, the next time we want to use auto.dta, we could type 'use auto' or 'use auto, clear'. Sometimes 'use auto' will work, but 'use auto, clear' will always work. Stata stores data in memory. The clear option tells Stata that it's okay to drop the data in memory in order to retrieve the new data. 5.18 Saving a Stata dataset ! To save the dataset in the current work space on disk, give the command below along with the appropriate path to the folder containing the file ! Command: save blah.dta, replace 5.19 Loading a Stata dataset ! To load a saved dataset from disk into the work area ! Command: use blah.dta, clear Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 12 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 6. Stata programs – “do-files” ! Type the following Stata command into the file: display “Hello Mom” ... Make sure you press [Enter] after typing the line 6.1 What are and why use do-files ! “Do-files” contain a collection of Stata statements that perform a variety of tasks – called a Stata program ! Save the file: ! Always use do-files to make your work “reproducible” and well-documented ! Note: you can enter commands interactively and then save the commands into a do-file by right clicking anywhere in the Review window: Select Save Review Contents... and navigating to the folder where you want to save the file ! For example, we include a do-file for each e-Quiz except the first containing the all the commands to carry out the analyses: eq2.do, eq3.do, etc. Demonstrate how to "run" eq1.do Click File / Save As Type: MyDocuments\bio624\mom.do ! Run the “do” file: do mom.do (as a Stata command) or, Click: Do current file icon (in do-file editor) 6.4 Edit and re-run “do” Program ! “Do-files” document your work ! Return to Do-file editor: ! “Do-files” permit reproducible analyses ! “Do-files” make re-running a series of commands very easy – one step ! “Do-files” for particular tasks can be copied and modified to perform similar tasks – “do-files” serve as templates for future work ! See Stata User’s Guide, for full documentation on what “dofiles” can accomplish Click mom.do on the Task Bar ! Make the fixes (change to “Hello Mother Dear” ) and then (IMPORTANT) save the file Click File / Save ! Re-run the program: Click Intercooled... on the Task Bar 6.2 “Hello Mom” program do mom.do ! This program simply displays the message “Hello Mom” -e an easy way to try the do-file approach or (as above), Click: Do current file icon (in do-file editor) ! The name of the program file will be mom.do ! Store the program in a folder: My Documents\bio624 ! Repeat the “Edit - Run” cycle until done or tired 6.3 Start Stata do-file editor 6.5 Another program ! To create a program file: ! This program is a little more complicated – try it for fun and practice in making do-files Click: Start Click: Stata icon Click: Do-editor icon (envelope) ! Open Stata by clicking profile.do in MyDocuments\bio624 ! Input faculty IQ data and summarize it ! The name of the program will be blah.do Note: You can also used NOTEPAD, WORDPAD or even WORD -- anything that allows files to be read and written in “text” format Biostatistics 624 © 2011 by JHU Biostatistics Dept. ! The program is in folder: Sun, 27 Mar 2011 (6:47p) MyDocuments\bio624 CLASS 1 - 13 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 6.5 Another program (cont'd) cd “MyDocuments\bio624" ! To create a program file: Click: File/New or start Stata do-file editor as shown above ! Type the following Stata commands into to the do-file editor to enter the data and generate the summary statistics: (Always change to the working directory, which will contain related datasets, graphs, etc.) ! Run the “do” file: do blah.do * Turn off annoying – more – message or, set more off Click: * Open log file on disk Do current file ! Edit + re-run “do” Program * Trick for automatically opening a log file in a do-file capture log close log using blah.log, replace input sno IQ 1 138 2 142 3 136 4 124 5 158 6 108 7 116 8 128 9 125 10 88 end list summarize IQ , detail histogram IQ , bin(10) fraction norm graph export blah.wmf,replace log close ! Save the file: Click File / Save As Type: MyDocuments\bio624\blah.do ! Change the working directory to the folder containing the “do” program file, if needed -- the current working directory is shown on the lower left in the Status Bar: Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 14 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES ! Return to do-file editor: Click blah.do on the Task Bar ! Make the fixes and then save Click File / Save 7. Using Stata to create “do” files ! A good way to make do-files is to enter the commands interactively and then copy them to a do-file for further work: Drag mouse to select commands (or select all) Right click anywhere in the Review window ! Re-run the program: Click Intercooled... on the Task Bar Click Save All or Save Selected Paste into the do-file editor (or into Notepad or Wordpad) do blah.do 8. Stat /Transfer for importing/exporting data or, Click Do current file ! Most often data are entered and managed using software other than Stata. This might done in a spreadsheet such as Excel, a datbase such as Access or Oracle, or another statistical package such as SAS or SPSS ! In many cases, you can Copy/Paste the data from the outside source into the Stata Data Editor, which transfers the data in simple cases ! If worse comes to worse, data may be transferred to Stata for analysis by writing a space or comma delimited ASCII text file to disk and then reading that into Stata using infile or infix ! The best option is to use to translate the data into or from Stata format is to use a “transfer program” such as StatTransfer -available in the PC Labs on the 3rd floor ! DEMO: To make the transfer, start Stat/Transfer and specify the input file and select its type, then select the output file and select its type (Stata version). Note that you may also translate a Stata dataset into any of the other supported file formats, ie, you could translate a Stata dataset for further analysis using SAS or SPSS, for example — Example: translate the SAS dataset alt3-1.sd2 into a Stata dataset named alt3-1.dta Start Stat/Transfer: Start Button, Program, ... click the Stat/Transfer icon Click the About tab and verify the version is 5 or higher — earlier versions of Stat/Transfer may not correctly transfer SAS datasets Select SAS for Windows/OS2 from the input File Type selection box Click Browse ; locate and select the file SAS file for the input File Specification box ex3-1.sd2 Select Stata from the Output File Type selection box Type ex3-1.sd2 in the File Specification box Click the Transfer button ... SAS dataset should be converted to Stata format Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 15 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES To test the transfer: Start Stata and give the commands use alt3-1.dta , clear describe ! Using the clipboard to import datasets — Some datasets, such as spreadsheets, can be “copied” to the clipboard — These can “pasted” into the Stata Data Editor, which often is a very quick way to transfer data into Stata — Demonstrate transfer from Excel to Stata — Data can be exported from Stata, using the clipboard by reversing the process 9. Example 1: exploratory analysis of data from Altman’s Exercise 3-1 ! Data Source: The data comes from Exercise 3 on p.45 from the well-written textbook Practical Statistics for Medical Research (Chapman & Hall) by Douglas Altman ! Data Story: The data has to do with 65 patients with rheumatoid arthritis, whether they experienced adverse drug reactions (REAC) to sodium aurothiomalate (SA), and whether age, dose, or an index (SI = sulphoxidation index) bear any relationship to the adverse reactions ! Data sheet: Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 16 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9. Example 1: exploratory analysis of data from Altman’s Exercise 3-1 (cont'd) Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 17 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.1 Listing of data file ! Below there is a listing of the contents of the file alt3-1ex.dat, which contains the raw data, one line (row) per patient ! The variables (columns) for each patient are as follows: Id Number sno Reaction (1=Yes 2=No) react Age (years) age Dose (mg) sadose Sulphoxidation Index (no units) si Whether Index is censored (1=Yes 0=No) censor 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 44 65 58 57 51 64 33 61 49 67 39 42 35 31 37 43 39 53 44 41 72 61 48 59 72 59 71 53 53 74 29 53 67 67 54 51 57 62 51 68 50 38 61 59 68 44 57 49 49 63 1560 1310 850 1250 950 850 1200 1390 1450 3300 2760 860 1810 1310 1250 1210 1460 2310 1360 1910 910 1410 2460 1350 810 1460 760 910 360 2010 1390 660 1135 510 410 910 360 1260 560 1135 1410 1110 960 1310 910 1235 2950 360 1935 1660 1.0 1.2 1.2 1.7 1.8 1.8 1.9 2.0 2.3 2.8 2.8 3.4 3.4 3.8 3.8 4.2 4.9 5.4 5.9 6.2 12.0 18.8 47.0 70.0 80.0 80.0 80.0 80.0 2.0 2.0 2.0 3.0 3.5 5.3 5.7 6.5 13.0 13.0 13.9 14.7 15.4 15.7 16.6 16.6 16.6 22.0 22.3 33.2 47.0 61.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 18 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 29 53 53 49 42 44 59 51 46 46 41 39 62 49 53 435 310 310 410 690 910 1260 1260 1310 1350 1410 1460 1535 1560 2050 65.0 65.0 80.0 80.0 80.0 80.0 80.0 80.0 80.0 80.0 80.0 80.0 80.0 80.0 80.0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 9.2 Analysis Plan — Means, SDs , percentiles with summarize — List data for checking with list — Stem and Leafs for continuous variables using stem — Scatterplot matrix to show bivariate relationships among continuous variables using graph matrix — Dot diagrams to show point distributions within groups using dotplot — Boxplots by group using graph box — Shapiro-Wilk test for normal distribution using sw — Diagnostic plots for normal distribution using qnorm — Pick transformation using the Box-Cox transformation: boxcox 9.3 Box-Cox transform ! The Box-Cox transform is used to find a scale for the response variable that is approximately normally distributed — does not always work, but worth trying. Don’t apply this without applying common sense to the result ! It can be used in a regression model to find a transformation that makes the errors in the regression model approximately normally distributed ! The transform represents a family of “power” transformations commonly used in data analysis: Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 19 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES ! See boxcox in the Stata reference manual for more details and examples 9.4 Techniques Illustrated ! Use of comment statements for documentation ! Clear Stata’s work space ! Change the working folder (directory) on disks from Stata ! Make folder from Stata to help organize your work ! Print results by sending them to a file on disk so they can be incorporated into a word processor and printed ! Input free-format data from a data file on disk ! Label variables ! Label variable values ! List data ! Get summary statistics ! Get stem-and-leaf plots ! Get a scatterplot matrix ! Store Stata graphs on disk in “Windows metafile format” (.wmf) for incorporation into word processing programs and printing ! Get dot diagrams ! Get boxplots ! Generate the Shapiro-Wilk statistic for testing normality ! Produce a quantile-quantile plot for assessing goodness of fit to a normal distribution ! Use the Box-Cox transform to suggest a transformation to normality ! NOTE: The do-file and data file are on the website as alt31ex.do and alt3-1ex.dat 9.5 Log Showing Commands and Output . . * Turn off MORE feature . . set more off . . . . * Input data . Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 20 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) . infile sno react age sadose si censor using alt3-1ex.dat (65 observations read) . . . . * Variable labels . label variable sno "Study No." . label variable react "Adverse Reaction" . label variable age "Age in years" . label variable sadose "Dose of SA (mg)" . label variable si "Sulphoxidation Index" . . . . * Value labels . . label define reactlbl 1 "Yes" 2 "No" . . label values react reactlbl . . . . . * Save Stata dataset . . save alt3-1ex.dta, replace file alt3-1ex.dta saved . . . * List data for checking . . list in 1/10 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. +-------------------------------------------+ | sno react age sadose si censor | |-------------------------------------------| | 1 No 44 1560 1 0 | | 2 No 65 1310 1.2 0 | | 3 No 58 850 1.2 0 | | 4 No 57 1250 1.7 0 | | 5 No 51 950 1.8 0 | |-------------------------------------------| | 6 No 64 850 1.8 0 | | 7 No 33 1200 1.9 0 | | 8 No 61 1390 2 0 | | 9 No 49 1450 2.3 0 | | 10 No 67 3300 2.8 0 | +-------------------------------------------+ . . . Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 21 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) . * Descriptive Statistics . . summarize , detail Study No. ------------------------------------------------------------Percentiles Smallest 1% 1 1 5% 2 1 10% 4 2 Obs 65 25% 9 2 Sum of Wgt. 65 50% 75% 90% 95% 99% 17 25 31 34 37 Largest 34 35 36 37 Mean Std. Dev. 17.06154 9.974776 Variance Skewness Kurtosis 99.49615 .1632394 2.000031 Adverse Reaction ------------------------------------------------------------Percentiles Smallest 1% 1 1 5% 1 1 10% 1 1 Obs 65 25% 1 1 Sum of Wgt. 65 50% 75% 90% 95% 99% 1 2 2 2 2 Largest 2 2 2 2 Mean Std. Dev. 1.430769 .4990375 Variance Skewness Kurtosis .2490385 .2796164 1.078185 Age in years ------------------------------------------------------------Percentiles Smallest 1% 29 29 5% 33 29 10% 38 31 Obs 65 25% 44 33 Sum of Wgt. 65 50% 75% 90% 95% 99% 53 61 67 71 74 Largest 71 72 72 74 Mean Std. Dev. Variance Skewness Kurtosis 52.12308 11.19641 125.3596 -.0659275 2.326933 Dose of SA (mg) ------------------------------------------------------------Percentiles Smallest 1% 310 310 5% 360 310 10% 410 360 Obs 65 25% 860 360 Sum of Wgt. 65 50% 75% 90% 95% 99% 1260 1460 2010 2460 3300 Largest 2460 2760 2950 3300 Mean Std. Dev. 1249.538 622.3134 Variance Skewness Kurtosis 387274 .9572716 4.426923 Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 22 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) Sulphoxidation Index ------------------------------------------------------------Percentiles Smallest 1% 1 1 5% 1.7 1.2 10% 1.9 1.2 Obs 65 25% 3.4 1.7 Sum of Wgt. 65 50% 14.7 75% 90% 95% 99% 80 80 80 80 Largest 80 80 80 80 Mean Std. Dev. 31.54308 33.2201 Variance Skewness Kurtosis 1103.575 .6044778 1.543044 censor ------------------------------------------------------------Percentiles Smallest 1% 0 0 5% 0 0 10% 0 0 Obs 65 25% 0 0 Sum of Wgt. 65 50% 0 75% 90% 95% 99% 1 1 1 1 Largest 1 1 1 1 Mean Std. Dev. .2615385 .4428926 Variance Skewness Kurtosis .1961538 1.085217 2.177696 . . . . * Stem and leaf . stem age Stem-and-leaf plot for age (Age in years) 2. 3* 3. 4* 4. 5* 5. 6* 6. 7* | | | | | | | | | | 99 13 578999 112234444 66899999 0111133333334 77789999 1112234 577788 1224 . stem sadose Stem-and-leaf plot for sadose (Dose of SA (mg)) 0*** 0*** 0*** 0*** 1*** 1*** 1*** 1*** 1*** 2*** 2*** 2*** 2*** 2*** 3*** 3*** | | | | | | | | | | | | | | | | 310,310,360,360,360 410,410,435,510,560 660,690,760 810,850,850,860,910,910,910,910,910,950,960 110,135,135 200,210,235,250,250,260,260,260,310,310,310,310,350,350,360,390,390 410,410,410,450,460,460,460,535,560,560 660 810,910,935 010,050 310 460 760 950 300 Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 23 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) . stem si Stem-and-leaf plot for si (Sulphoxidation Index) si rounded to nearest multiple of .1 plot in units of .1 0** 0** 1** 1** 2** 2** 3** 3** 4** 4** 5** 5** 6** 6** 7** 7** 8** | | | | | | | | | | | | | | | | | 10,12,12,17,18,18,19,20,20,20,20,23,28,28,30,34,34,35,38,38,42,49 53,54,57,59,62,65 20,30,30,39,47 54,57,66,66,66,88 20,23 32 70,70 10 50,50 00 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 . . . . * Scatterplots Matrix . graph box age, over (react) t1(AGE BOXPLOTS) t2(" ") l1(A > GE) b1(REACTION) (file alt3-1ex\boxplot1.gph saved) . . graph export alt3-1ex\scatmat.wmf,replace (file C:\jt\bio624\2004\progs\alt3-1ex\scatmat.wmf written in Windows Metafile format) SCATTERPLOT MATRIX Adverse Reaction 80 60 Age in years AGE 40 20 4000 Dose of SA (mg) 2000 0 100 Sulphoxidation Index 50 0 1 1.5 220 40 60 800 2000 4000 REACTION Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 24 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) . . . * Dot diagram . . sort react . . dotplot age , by (react) t1(AGE DOTPLOT) l1(AGE) b1(REAC > TION) (file alt3-1ex\dotplot1.gph saved) . graph export alt3-1ex\dotplot1.wmf,replace (file C:\jt\bio624\2004\progs\alt3-1ex\dotplot1.wmf written in Windows Metafile format) 30 40 AGE Age in years 50 60 70 AGE DOTPLOT Yes No Adverse Reaction REACTION . . dotplot sadose, by (react) t1(SA DOSE DOTPLOT) l1(SADOSE M > G) b1(REACTION) (file alt3-1ex\dotplot2.gph saved) . graph export alt3-1ex\dotplot2.wmf,replace (file C:\jt\bio624\2004\progs\alt3-1ex\dotplot2.wmf written in Windows Metafile format) 0 SADOSE MG Dose of SA (mg) 1000 2000 3000 4000 SA DOSE DOTPLOT Yes No Adverse Reaction REACTION . . dotplot si, by (react) t1(SI DOSE DOTPLOT) l1(SI) > b1(REACTION) (file alt3-1ex\dotplot3.gph saved) . graph export alt3-1ex\dotplot3.wmf,replace (file C:\jt\bio624\2004\progs\alt3-1ex\dotplot3.wmf written in Windows Metafile format) Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 25 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) 0 SI Sulphoxidation Index 20 40 60 80 SI DOSE DOTPLOT Yes No Adverse Reaction REACTION . . * Letter values, outliers by reaction subgroup . . lv age if react==1 ,generate # 37 M F E D C B 19 10 5.5 3 2 1.5 1 inner fence outer fence Age in years --------------------------------| 53 | | 46 52.5 59 | | 41.5 53.25 65 | | 38 53 68 | | 29 48.5 68 | | 29 50 71 | | 29 51.5 74 | | | | | | 26.5 78.5 | | 7 98 | spread 13 23.5 30 39 42 45 # below 0 0 pseudosigma 10.05177 10.80392 10.23727 11.47614 11.27376 10.79743 # above 0 0 . list age if react==1 & ( (age >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (age <=( r(l_F) - 1.5*(r(u_F > ) - r(l_F)))) ) . . . lv age if react==2 ,generate # 28 M F E D C 14.5 7.5 4 2.5 1.5 1 inner fence outer fence Age in years --------------------------------| 52 | | 41.5 51.25 61 | | 37 52 67 | | 34 52.75 71.5 | | 32 52 72 | | 31 51.5 72 | | | | | | 12.25 90.25 | | -17 119.5 | spread 19.5 30 37.5 40 41 # below 0 0 pseudosigma 14.65586 13.28402 13.11905 11.51282 10.41174 # above 0 0 . list age if react==2 & ( (age >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (age <=( r(l_F) - 1.5*(r(u_F) > - r(l_F)))) ) . . Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 26 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) . lv sadose # 37 M F E D C B 19 10 5.5 3 2 1.5 1 inner fence outer fence if react==1 ,generate Dose of SA (mg) --------------------------------| 1135 | | 560 985 1410 | | 385 997.5 1610 | | 360 1185 2010 | | 310 1180 2050 | | 310 1405 2500 | | 310 1630 2950 | | | | | | -715 2685 | | -1990 3960 | spread 850 1225 1650 1740 2190 2640 # below 0 0 pseudosigma 657.2313 563.183 563.0501 512.0124 587.8463 633.4493 # above 1 0 . list sadose if react==1 & ( (sadose >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (sadose <=( r(l_F) - 1 > .5*(r(u_F) - r(l_F)))) ) +--------+ | sadose | |--------| 37. | 2950 | +--------+ . . lv sadose # 28 M F E D C 14.5 7.5 4 2.5 1.5 1 inner fence outer fence if react==2 , generate Dose of SA (mg) --------------------------------| 1330 | | 930 1220 1510 | | 850 1580 2310 | | 830 1720 2610 | | 785 1907.5 3030 | | 760 2030 3300 | | | | | | 60 2380 | | -810 3250 | spread 580 1460 1780 2245 2540 # below 0 0 pseudosigma 435.9179 646.489 622.7175 646.157 645.0197 # above 3 1 . list sadose if react==2 & ( (sadose >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (sadose <=( r(l_F) - 1 > .5*(r(u_F) - r(l_F)))) ) +--------+ | sadose | |--------| 26. | 2460 | 27. | 2760 | 28. | 3300 | +--------+ . . . lv si if react==1 ,generate # 37 M F E D C B 19 10 5.5 3 2 1.5 1 inner fence outer fence Sulphoxidation Index --------------------------------| 22.3 | | 13 46.5 80 | | 4.4 42.2 80 | | 2 41 80 | | 2 41 80 | | 2 41 80 | | 2 41 80 | | | | | | -87.5 180.5 | | -188 281 | spread 67 75.6 78 78 78 78 # below 0 0 pseudosigma 51.80529 34.75644 26.61691 22.95228 20.93699 18.71555 # above 0 0 . list si if react==1 & ( (si >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (si <=( r(l_F) - 1.5*(r(u_F) > - r(l_F)))) ) . Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 27 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) . lv si if react==2 ,generate # 28 M F E D C 14.5 7.5 4 2.5 1.5 1 inner fence outer fence Sulphoxidation Index --------------------------------| 3.8 | | 1.95 8.675 15.4 | | 1.7 40.85 80 | | 1.2 40.6 80 | | 1.1 40.55 80 | | 1 40.5 80 | | | | | | -18.225 35.575 | | -38.4 55.75 | spread 13.45 78.3 78.8 78.9 79 # below 0 0 pseudosigma 10.10879 34.6713 27.5675 22.70904 20.06164 # above 6 5 . list si if react==2 & ( (si >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (si <=( r(l_F) - 1.5*(r(u_F) > - r(l_F)))) ) 23. 24. 25. 26. 27. 28. +----+ | si | |----| | 47 | | 70 | | 80 | | 80 | | 80 | |----| | 80 | +----+ . . . . . . Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 28 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) . * Boxplots . . sort react . . graph box age, over (react) t1(AGE BOXPLOTS) t2(" ") l1(A > GE) b1(REACTION) (file alt3-1ex\boxplot1.gph saved) . graph export alt3-1ex\boxplot1.wmf,replace (file C:\jt\bio624\2004\progs\alt3-1ex\boxplot1.wmf written in Windows Metafile format) 30 40 AGE Age in years 50 60 70 AGE BOXPLOTS Yes No REACTION . . graph box sadose, over (react) > ") l1(DOSE MG) b1(REACTION) (file alt3-1ex\boxplot2.gph saved) t1(SA DOSE BOXPLOTS) t2(" . graph exort alt3-1ex\boxplot2.wmf,replace (file C:\jt\bio624\2004\progs\alt3-1ex\boxplot2.wmf written in Windows Metafile format) 0 DOSE MG Dose of SA (mg) 1,000 2,000 3,000 4,000 SA DOSE BOXPLOTS Yes No REACTION . . graph box si, over (react) > 1(SI) b1(REACTION) t1(SI DOSE BOXPLOTS) t2(" ") l . graph export alt3-1ex\boxplot3.wmf,replace (file C:\jt\bio624\2004\progs\alt3-1ex\boxplot3.wmf written in Windows Metafile format) Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 29 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) SI 0 Sulphoxidation Index 20 40 60 80 SI DOSE BOXPLOTS Yes No REACTION . . * Shapiro-Wilk Test for Normality . . swilk age sadose si Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------age | 65 0.98503 0.868 -0.307 0.62061 sadose | 65 0.92756 4.199 3.107 0.00094 si | 65 0.82921 9.901 4.964 0.00000 . . . Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 30 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) * Diagnostic Plot for Normal Distribution (Q-Q plot) . . qnorm age , grid b1(AGE Q-Q PLOT) l1(AGE) . graph export alt3-1ex\qqplot1.wmf,replace (file C:\jt\bio624\2004\progs\alt3-1ex\qqplot1.wmf written in Windows Metafile format) 52.12308 70.53953 53 30 33 40 AGE Age in years 50 60 70 71 80 33.70662 30 40 50 60 Inverse Normal 70 80 AGE Q-Q PLOT Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles . qnorm sadose , grid b1(SA DOSE Q-Q PLOT) l1(SA DOSE) . graph export alt3-1ex\qqplot2.wmf,replace (file C:\jt\bio624\2004\progs\alt3-1ex\qqplot2.wmf written in Windows Metafile format) 1249.538 2273.153 2460 1260 0 360 SA DOSE Dose of SA (mg) 1000 2000 3000 4000 225.924 0 500 1000 1500 Inverse Normal 2000 2500 SA DOSE Q-Q PLOT Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles . . qnorm si , grid b1(SI Q-Q PLOT) l1(SI) . graph export alt3-1ex\qqplot3.wmf,replace (file C:\jt\bio624\2004\progs\alt3-1ex\qqplot3.wmf written in Windows Metafile format) Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 31 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 9.5 Log Showing Commands and Output (cont'd) 31.54308 86.18528 1.714.7 -50 SI Sulphoxidation Index 0 50 80 100 -23.09912 -50 0 50 100 Inverse Normal SI Q-Q PLOT Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles . . * Box-Cox method to choose transformation to normality . . * nolog option suppresses iterations - nothing to do with logarithms . . boxcox age , nolog Fitting comparison model Fitting full model Log likelihood = -248.73918 Number of obs LR chi2(0) Prob > chi2 = = = 65 0.00 . -----------------------------------------------------------------------------age | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------/theta | 1.028826 .527121 1.95 0.051 -.004312 2.061964 -----------------------------------------------------------------------------Estimates of scale-variant parameters ---------------------------| Coef. -------------+-------------Notrans | _cons | 55.8456 -------------+-------------/sigma | 12.44209 -----------------------------------------------------------------------------------Test Restricted LR statistic P-Value H0: log likelihood chi2 Prob > chi2 --------------------------------------------------------theta = -1 -256.76965 16.06 0.000 theta = 0 -250.73362 3.99 0.046 theta = 1 -248.74068 0.00 0.956 --------------------------------------------------------- Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 32 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES . boxcox sadose, nolog Fitting comparison model Fitting full model Log likelihood = -505.33421 Number of obs LR chi2(0) Prob > chi2 = = = 65 0.00 . -----------------------------------------------------------------------------sadose | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------/theta | .4100593 .1929563 2.13 0.034 .031872 .7882467 -----------------------------------------------------------------------------Estimates of scale-variant parameters ---------------------------| Coef. -------------+-------------Notrans | _cons | 41.58575 -------------+-------------/sigma | 9.273821 -----------------------------------------------------------------------------------Test Restricted LR statistic P-Value H0: log likelihood chi2 Prob > chi2 --------------------------------------------------------theta = -1 -530.33416 50.00 0.000 theta = 0 -507.58528 4.50 0.034 theta = 1 -509.90097 9.13 0.003 --------------------------------------------------------. boxcox si , nolog Fitting comparison model Fitting full model Log likelihood = -285.74575 Number of obs LR chi2(0) Prob > chi2 = = = 65 0.00 . -----------------------------------------------------------------------------si | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------/theta | .0403967 .1055843 0.38 0.702 -.1665448 .2473382 -----------------------------------------------------------------------------Estimates of scale-variant parameters ---------------------------| Coef. -------------+-------------Notrans | _cons | 2.770815 -------------+-------------/sigma | 1.64801 -----------------------------------------------------------------------------------Test Restricted LR statistic P-Value H0: log likelihood chi2 Prob > chi2 --------------------------------------------------------theta = -1 -333.2825 95.07 0.000 theta = 0 -285.81928 0.15 0.701 theta = 1 -319.4322 67.37 0.000 --------------------------------------------------------. . . . . . * Close the log -- may want to use for production runs . *log close 10. Example 2: input and display of data from Altman’s exercise 3-2 Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 33 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES ! Data: These data are found on p.47 of Altman (Exercise 3.2). The data concerns airplane accidents (counts, rates/1000, and rates per 100,000 flight hours) and how they relate to occupation of the pilot ! Script of Stata commands contained in alt3-2ex.do ! NOTE: The script file and data file are on the class disk as alt3-2ex.do and alt3-2ex.dat 10.1 Source data from Altman 10.2 Raw data — text file on disk Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 34 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 10.2 Raw data — text file on disk (cont'd) Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 35 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES Professional pilots Lawyers Farmers Sales representatives Physicians Mechanics and repairmen Policemen and detectives Managers and administrators Engineers Teachers Housewives Academic students Armed Forces Members 1302 57 166 137 76 44 48 643 125 43 29 188 111 15.9 11.0 10.1 9.0 8.7 6.9 6.6 6.0 4.7 4.2 3.7 3.2 1.6 0.2 1.5 1.3 1.2 1.8 1.5 1.8 0.7 1.1 1.1 3.2 3.7 0.7 10.3 Analysis plan ! Explore this simple dataset with several graphs using the graph command — Show how counts of accidents are related to occupation of pilot — Show how rates per 1000 pilots are related to occupation — Show how rates per 100,000 flight hours are related to occupation — Show how the two rates are related to one another ! Consider other approaches to analysis 10.4 Stata log . . . * Turn off MORE feature . . set more off . . . . * Input data, embedded blanks in string . . infix str occup 1-29 accid 30-34 rate1 40-44 rate2 50-54 using alt3-2ex.dat (13 observations read) . . . . * Variable labels . label variable occup "Occupation" . label variable accid "No. of Accidents" . label variable rate1 "Rate per 1000" . label variable rate2 "Rate per 100,000 hr" . . * List data for checking . . list 1. 2. 3. 4. 5. +-----------------------------------------------------+ | occup accid rate1 rate2 | |-----------------------------------------------------| | Professional pilots 1302 15.9 .2 | | Lawyers 57 11 1.5 | | Farmers 166 10.1 1.3 | | Sales representatives 137 9 1.2 | | Physicians 76 8.7 1.8 | Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 36 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 10.4 Stata log (cont'd) |-----------------------------------------------------| | Mechanics and repairmen 44 6.9 1.5 | | Policemen and detectives 48 6.6 1.8 | | Managers and administrators 643 6 .7 | | Engineers 125 4.7 1.1 | | Teachers 43 4.2 1.1 | |-----------------------------------------------------| 11. | Housewives 29 3.7 3.2 | 12. | Academic students 188 3.2 3.7 | 13. | Armed Forces Members 111 1.6 .7 | +-----------------------------------------------------+ 6. 7. 8. 9. 10. . . . . * Code occupations for graphs . encode occup, gen(occup1) . . . . * Make shorter labels for graphs . . #delimit ; delimiter now ; . label define occuplab 1 "Acad" > 4 "Farm" > 7 "Mgrs" > 10 "Police" > 13 "Teach" ; 2 "Armed For" 5 "Housewife" 8 "Mech" 11 "Pro Pilot" 3 "Engin" 6 "Law" 9 "MD" 12 "Sales" . #delimit cr delimiter now cr . . label values occup1 occuplab . . . . . * Save as Stata dataset . . save alt3-2ex.dta, replace file alt3-2ex.dta saved . . . * Bar graph, See Figure 1 . . sort occup1 . . graph hbar accid , over(occup1,sort(1)) ytitle(" ") l1(OCCUPAT > ION) b1(No. of Accidents) t1 (AIRPLANE ACCIDENTS) . graph export alt3-2ex\fig1.wmf,replace (file C:\jt\bio624\2004\progs\alt3-2ex\fig1.wmf written in Windows Metafile format) AIRPLANE ACCIDENTS OCCUPATION Housewife Teach Mech Police Law MD Armed For Engin Sales Farm Acad Mgrs Pro Pilot 0 500 1,000 1,500 No. of Accidents Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 37 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 10.4 Stata log (cont'd) . . . . * Bar graph, See Figure 2 . . graph hbar rate1 , over(occup1,sort(1)) ytitle(" ") l1(OCCUPAT > ION) b1(Rate per 1000 Pilots) t1 (AIRPLANE ACCIDENTS) . graph export alt3-2ex\fig2.wmf,replace (file C:\jt\bio624\2004\progs\alt3-2ex\fig2.wmf written in Windows Metafile format) AIRPLANE ACCIDENTS OCCUPATION Armed For Acad Housewife Teach Engin Mgrs Police Mech MD Sales Farm Law Pro Pilot 0 5 10 15 Rate per 1000 Pilots . . * Bar graph See Figure 3 . . graph hbar rate2 , over(occup1,sort(1)) ytitle(" ") l1(OCCUPAT > ION) b1(Rate per 100000 hrs) t1 (AIRPLANE ACCIDENTS) (file alt3-2ex\fig3.gph saved) . graph export alt3-2ex\fig3.wmf,replace (file C:\jt\bio624\2004\progs\alt3-2ex\fig3.wmf written in Windows Metafile format) OCCUPATION AIRPLANE ACCIDENTS Pro Pilot Armed For Mgrs Engin Teach Sales Farm Law Mech MD Police Housewife Acad 0 1 2 3 4 Rate per 100000 hrs . . . * Scatterplot See Figure 4 . . graph twoway scatter rate1 rate2, mlabel(occup1) t1(AIRPLANE ACCIDENT RATES) . graph export alt3-2ex\fig4.wmf,replace (file C:\jt\bio624\2004\progs\alt3-2ex\fig4.wmf written in Windows Metafile format) Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 38 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 10.4 Stata log (cont'd) AIRPLANE ACCIDENT RATES 15 Pro Pilot Rate per 1000 10 Law Farm Sales MD Mech Police 5 Mgrs Engin Teach Housewife Acad 0 Armed For 0 1 2 Rate per 100,000 hr 3 4 AIRPLANE ACCIDENT RATES 15 Pro Pilot Rate per 1000 10 Law Farm Sales MD Mech Police 5 Mgrs Engin Teach Housewife Acad 0 Armed For 0 1 2 Rate per 100,000 hr 3 4 . . . log close Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 39 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 11. Common data analysis applications 11.4 Confidence interval for a mean ! For simplicity of illustration, the data from the rheumatoid arthritis data introduced earlier will be used in all the examples, some of which may be contrived or inappropriate ! Calculate a 95% confidence interval for the mean value of a variable ! Variable: age ! The examples shown below assume that the Stata dataset has been loaded into the work space through input of the raw data or by loading a saved data (e.g., use alt3-1ex\alt3-1ex.dta) ! Command: 11.1 Descriptive statistics ! Immediate form of command — used as a “calculator” to produce 95% CI from n, mean, and SD ci age . cii 65 52.12 11.20 ! Means, SDs, and other descriptive statistics ! Variables: age, sadose, and si 11.5 Confidence interval for a proportion ! Command: summarize age sadose si , detail ! Calculate a 95% confidence interval for the proportion positive in a binomial distribution. Stata calculates exact binomial limits. Note: Stata can also calculate limits for the mean of Poisson distribution using the poisson option of the ci or cii commands. 11.2 Stem-and-leaf charts ! Stem-and-Leaf to show distribution of continuous variable -- must do one variable at a time ! Variable: age ! Variable: censor ! Command: ci censor , binomial ! Command: ! Immediate form of command — used as a “calculator” to produce 95% CI from n, # of events stem age . cii 65 17 11.3 Boxplots ! Poisson example ( 27 deaths, 645 person-years): ! Boxplot to show distribution of a variable in subgroups of the data. Data must be sorted by the subgrouping variables. Store the graph in a folder (sub-directory) in metafile format (*.wmf), so it can be imported into a word processor for printing cii 645 27 , poisson ! Variables: — Subgrouping: reac — Analysis: age ! Commands: [Type command below each on a single, long line] sort react graph box age, over (react) marker(1,mlab(sno)) t1(AGE BOXPLOTS) t2(" ") l1(AGE) b1(REACTION) graph export alt3-1ex\boxplot1.wmf,replace Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 40 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES tab censor reac, chi2 exact 11.6 Student’s t-test ! Immediate forms of commands can be used as a “calculator” to test equality of proportions in a 2x2 table. Enter the rows of the table separated by a “\” character: ! Used to test equality of means. It comes in 3 forms: — Test that variable has a mean equal to specific # — this is the one-sample t-test tabi 24 24 \ 13 4 , chi2 exact — Test that variable1 has the same mean as variable2 — this is the paired t-test — Test that variable has the same mean within two groups defined by a grouping variable groupvar — this is the twosample t-test 11.8 Correlation ! Obtain either the Pearson’s or Spearman’s (rank) estimated correlation coefficient of two measured responses x and y Note: Stata gives p-values for the t-tests, but also gives 95% confidence intervals on means and differences in means ! Variables: age and si ! Variables: age with reac as the subgrouping variable ! Commands: ! Commands: corr age si — One-sample ttest: Test mean age = 50 spearman age si ttest age = 50 — Paired t-test: (Stupidly, for illustration) test mean sadose = si ttest sadose = si — Two-sample t-test: Test age means are equal within reaction groups Note: Pairs of correlations among a set of variables may be obtained by specifying the list of variables. E.g., to obtain age-sadose, age-si, and sadose-si correlations: corr age sadose si ttest age ,by (reac) or, ttest age ,by (reac) unequal ... does not assume = variances 11.9 Simple linear regression ! Immediate forms of commands can be used as a “calculator” to get t-test given summary data on n, and the observed means and standard deviations (sd): ! Estimate simple linear model relating a measured response (dependent) variable y to a fixed, covariate (independent) variable x — y = α+βx+ε — One-sample test (n=24, observed mean=62.6, sd=15.8; test mean=75) ttesti 24 62.6 15.8 75 — Paired t-test: there is no immediate command for this — Two-sample t-test: (n1=20,m1=20,sd1=5; n2=32,m2=15,sd2=4; test mean's equal) ttesti 20 20 5 32 15 4 Stata produces an analysis of variance, p-values, coefficient estimates, standard errors, and 95% confidence intervals ! Variables: Dependent = si and independent = age ! Commands: regress si age ! Commands to obtain a graph of the data, fitted line, and 95% CIs:( Type the graph command on one line) 11.7 Test for binomial proportions graph twoway (scatter si age) || (lfitci si age) t1("si= 30.15+.0268age") ! Use to test equality of proportions within two subgroups graph export alt3-1ex\lreg.wmf,replace Note: Stata gives the 2x2 chi-square test and p-value. It also gives the Fisher’s exact test p-value ! Variables: proportion censored (censor) within reactivity groups (reac) ! Commands: Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 41 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES ! Details may be found in the Manual or by typing 11.10 Analysis of variance help epitab ! Used to tests equality of means withing two or more subgroups — usually 3 or more as the t-test is usually used for 2 groups For convenience, the Help text is included below ! Variables: Dependent variable = si, subgrouping variable= reac— only 2 groups in this example ! Command: oneway si reac 11.11 Multiple linear regression ! Use either regress ! For details refer to the Reference Manual or help regress ! Also see Stata User’s Guide Chapters 26 and 35 (in the handout for Part 1) for more details on fitting regression models 11.12 Multiple logistic regression ! Use logistic for logistic regression for binary responses ! Use clogit for matched or highly stratified case-control studies (including “frequency-matched” studies) ! Use ologit for logistic regression for ordered responses with more than 2 categories ! Use mlogit for logistic regression for responses with more than 2 categories (not ordered) ! For details refer to the Reference Manual or help logistic help clogit help ologit help mlogit ! Also see Stata User’s Guide Chapters 26 and 35 (in the handout for Part 1) for more details on fitting regression models 11.13 Epidemiologic calculations - epitab ! Most of the common calculations for epidemiologic analysis have been included in Stata in a group of command labeled “epitab” in the Reference Manual ! Most of the commands have an “immediate” form so that they may be applied to summary tables, rather than to the raw data, which may not be available Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 42 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 11.13 Epidemiologic calculations - epitab (cont'd) . help epitab ------------------------------------------------------------------------------help for epitab, ir, iri, cs, csi, cc, cci, mcc, mcci (manual: [R] epitab) ------------------------------------------------------------------------------Tables for epidemiologists -------------------------ir case_var ex_var time_var [weight] [if exp] [in range] [, level(#) tb by(varname) fast estandard istandard standard(varname) ird nocrude pool nohet ] iri #a #b #N1 #N2 [, level(#) tb ] cs case_var ex_var [weight] [if exp] [in range] [, level(#) exact tb woolf by(varname) fast or estandard istandard standard(varname) nocrude pool nohet ] csi #a #b #c #d [, level(#) exact or tb woolf ] cc case_var ex_var [weight] [if exp] [in range] [, level(#) exact tb woolf by(varname) fast estandard istandard standard(varname) nocrude pool nohet ] cci #a #b #c #d [, level(#) exact tb woolf ] mcc ex_case_var ex_cntl_var [weight] [if exp] [in range] [, level(#) tb ] mcci #a #b #c #d [, level(#) tb ] Description ----------ir is used with incidence rate (incidence density or person-time) data; point estimates and confidence intervals for the incidence rate ratio and difference are calculated along with attributable or prevented fractions for the exposed and total population. iri is the immediate form of ir; see help immed. Also see help nbreg, help poisson and help stcox for related commands. cs is used with cohort study data with equal follow-up time per subject and, in some cases, cross-sectional data. Risk is then the proportion of subjects who become cases. Point estimates and confidence intervals for the risk difference, risk ratio, and (optionally) the odds ratio are calculated along with attributable or prevented fractions for the exposed and total population. csi is the immediate form of cs; see help immed. Also see help logistic and help glogit for related commands. cc is used with case-control and cross-sectional data. Point estimates and confidence intervals for the odds ratio are calculated along with attributable or prevented fractions for the exposed and total population. cci is the immediate form of cc; see help immed. Also see help logistic and help glogit for related commands. mcc is used with matched case-control data. McNemar's chi-squared, point estimates and confidence intervals for the difference, ratio, and relative difference of the proportion with the factor, along with the odds ratio, are calculated. mcci is the immediate form of mcc; see help immed. Also see help clogit for a related command. Options ------level(#) specifies in percent the confidence level for confidence intervals. exact requests Fisher's exact P be calculated rather than the chi-squared and its significance level. We recommend specifying exact whenever samples are small. A conservative rule-of-thumb for 2x2 tables is to specify exact when the least-frequent cell contains fewer than 1,000 cases. Note that exact does not affect whether exact confidence intervals are calculated; commands always calculate exact confidence intervals where they can unless tb or woolf is specified. by(varname) specifies that the tables are stratified on varname. Withinstratum statistics are shown then combined with Mantel-Haenszel weights. If estandard, istandard, or standard() is also specified (see below), the weights specified are used in place of Mantel-Haenszel weights. Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 43 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 11.13 Epidemiologic calculations - epitab (cont'd) fast specifies that calculations of within-stratum confidence intervals are not to be made. This speeds execution of the command, although in the case of ir, it makes little difference and for the remaining commands, woolf or tb are almost as fast. or is allowed only with the cs and csi commands. Specified without by(), or reports the calculation of the odds ratio in addition to the risk ratio. With by(), or specifies that a Mantel-Haenszel estimate of the combined odds ratio be made rather than the Mantel-Haenszel estimate of the risk ratio. In either case, this is the same calculation as would be made by cc or cci and, typically, the use of those commands is to be preferred for obtaining odds ratios. tb requests that test-based confidence intervals be calculated wherever appropriate in place of confidence intervals based on other approximations or exact confidence intervals. We recommend that test-based confidence intervals be used only for pedagogical purposes and never be used for research work. woolf requests that the Woolf approximation, also known as the Taylor expansion, be used for calculating the standard error of the odds ratio. Otherwise, the Cornfield approximation is used. The Cornfield approximation takes substantially longer (a few seconds) to calculate than the Woolf approximation. This standard error is used in calculating a confidence interval for the odds ratio. (For matched case-control data, exact confidence intervals are always calculated.) estandard, istandard, and standard(varname) request that within-stratum statistics are to be combined with external, internal, or user-specified weights to produce a standardized estimate. These options are mutually exclusive and can only be used when by() is also specified. (When by() is specified without one of these options, Mantel-Haenszel weights are used.) estandard external weights are the person-time for the unexposed (ir), the total number of unexposed (cs), or the number of unexposed controls (cc). istandard internal weights are person-time for the exposed (ir), the total number of exposed (cs), or the number of exposed controls (cc). istandard can be used, among other things, to produce standardized mortality ratios (SMRs). standard(varname) allows user-specified weights. varname must contain a constant within stratum and be nonnegative. The scale of varname is irrelevant. ird may be used only with estandard, istandard, or standard(); it requests ir calculate the standardized incidence rate difference rather than the default incidence rate ratio. rd may be used only with estandard, istandard, or standard(); it requests that cs calculate the standardized risk difference rather than the default risk ratio. nocrude specifies that in a stratified analysis, the crude estimate -- the estimate one would obtain without regard to strata -- not be displayed. nocrude is relevant only if by() is also specified. pool specifies that in a stratified analysis, the directly pooled estimate should also be displayed. The pooled estimate is a weighted average of the stratum-specific estimates using inverse-variance weights. pool is relevant only if by() is also specified. nohet specifies that a chi-squared test for heterogeneity not be included in the output of a stratified analysis. This tests whether the exposure effect is the same across strata and can be performed for any pooled estimate -- directly pooled or Mantel-Haenszel. nohet is relevant only if by() is also specified. Examples: incidence rate data -----------------------------The table for incidence rate data is Exposed Unexposed ------------+--------------------Cases | a b Person-time | N1 N0 Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 44 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 11.13 Epidemiologic calculations - epitab (cont'd) The basic syntax (ignoring options) for iri is "iri #a #b #N1 #N2". For example: . iri 41 15 28010 19017 . iri 41 15 28010 19017, level(90) . iri 41 15 28010 19017, level(90) tb The basic syntax (ignoring options) for ir is "ir case_var ex_var time_var". case_var contains the number of cases represented by an observation. ex_var contains 0 if the observation represents unexposed and nonzero (e.g., 1) if the observation represents exposed. time_var contains the exposure time (e.g., person-years) represented by the observation. ir obtains the table by summing across observations. Observations with missing values are not used. . list 1. 2. 3. cases 20 21 15 exposed 1 1 0 time 14000 14010 19017 . ir cases exposed time, level(90) (output omitted) To obtain Mantel-Haenszel combined IRR: . list 1. 2. 3. 4. agegrp 1 1 2 2 deaths 14 10 76 121 exposed 1 0 1 0 pyears 1516 1701 949 2245 . ir deaths exposed pyears, by(agegrp) To obtain internally standardized IRR: . irr deaths exposed pyears, by(agegrp) istandard To weight each group equally: . gen wgt=1 . irr deaths exposed pyears, by(agegrp) standard(wgt) Examples: cohort-study data ---------------------------The table for cohort-study data is Exposed Unexposed ------------+--------------------Cases | a b Noncases | c d The basic syntax (ignoring options) for csi is "csi #a #b #c #d". For example: . csi 7 12 9 2 . csi 7 12 9 2, exact . csi 7 12 9 2, exact level(90) tb The basic syntax (ignoring options) for cs is "cs case_var ex_var". case_var contains 1 if the observation represents a case and nonzero (e.g., 1) if it represents a noncase. ex_var contains 0 if the observation represents unexposed and nonzero (e.g., 1) if it represents exposed. Frequency weights are allowed. . list 1. 2. 3. 4. case 0 0 1 1 exp 0 1 0 1 pop 2 9 12 2 Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 45 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 5. 1 1 5 . cs case exp [freq=pop] (output omitted) If "[freq=pop]" is not specified, each observation contributes 1. Stratified tables work as with ir. risk ratio: To obtain the Mantel-Haenszel combined . cs case exposed [freq=pop], by(age) To obtain internally 1. 2. 3. 4. 5. standardized risk ratio: 0 0 2 0 1 9 1 0 12 1 1 2 1 1 5 . cs case exp [freq=pop] (output omitted) If "[freq=pop]" is not specified, each observation contributes 1. Stratified tables work as with ir. risk ratio: To obtain the Mantel-Haenszel combined . cs case exposed [freq=pop], by(age) To obtain internally standardized risk ratio: . cs case exposed [freq=pop], by(age) istandard To obtain externally standardized risk ratio: . cs case exposed [freq=pop], by(age) estandard To weight each age group equally: . gen wgt=1 . cs case exposed [freq=pop], by(age) standard(wgt) Examples: case-control data ---------------------------cc and cci work just like cs and csi. They differ in that they report the odds ratio rather than the risk ratio. Examples: matched case-control data -----------------------------------mcc and mcci work just like cc and cci except that they report different statistics. Stratified tables are not allowed with mcc. Also see -------Manual: On-line: [R] epitab help for bitest, ci, clogit, dstdize, immed, logistic, nbreg, poisson, st, stcox, tabulate help sampsi For convenience, the Help text is given below: 11.14 Sample size and power calculations ! The Stata command sampsi performs sample size of power calculations for comparison of means or proportions ! Also see the free sample size software from Dupont and Plummer – “Other Links” on the course website Home page ! For details, refer to sampsi in the Reference Manual or type Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 46 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 11.14 Sample size and power calculations (cont'd) . help sampsi ------------------------------------------------------------------------------help for sampsi (manual: [R] sampsi) ------------------------------------------------------------------------------Sample size and power determination ----------------------------------sampsi #1 #2 [, alpha(#) power(#) n1(#) n2(#) ratio(#) sd1(#) sd2(#) onesample onesided ] Description ----------sampsi estimates required sample size or power of tests for comparisons of means or proportions. If n1() or n2() is specified, sampsi computes power; otherwise, it computes sample size. sampsi is an immediate command; all of its arguments are numbers; see help immed. sampsi computes sample size or power for four types of tests: 1. Two-sample comparison of means. The postulated values of the means are #1 and #2. The postulated standard deviations are sd1() and sd2(). 2. One-sample comparison of mean to hypothesized value. Option onesample must be specified. The hypothesized value (null hypothesis) is #1. The postulated mean (alternative hypothesis) is #2. The postulated standard deviation is sd1(). 3. Two-sample comparison of proportions. The postulated values of the proportions are #1 and #2. 4. One-sample comparison of proportion to hypothesized value. Option onesample must be specified. The hypothesized proportion (null hypothesis) is #1. The postulated proportion (alternative hypothesis) is #2. Options ------alpha(#) specifies the significance level of the test; the default is alpha(.05). (More correctly, the default is 1-level/100 from set level, see help level.) power(#) is power of the test. Default is power(.90). n1(#) specifies the size of the first (or only) sample and n2(#) specifies the size of the second sample. If specified, sampsi reports the power calculation. If not specified, sampsi computes sample size. ratio(#) is an alternative way to specify n2() in two-sample tests. In a two-sample test, if n2() is not specified, n2() is assumed to be n1()*ratio(). That is, ratio() = n2()/n1(). The default is ratio(1). sd1(#) and sd2(#) are the standard deviations for comparison of means. If not specified, comparison of proportions is assumed. In two-sample cases, if only sd1() is specified, sd2() is assumed to equal sd1(). onesample indicates a one-sample test. onesided indicates a one-sided test. The default is a two-sample test. The default is a two-sided test. Examples -------1. Two-sample comparison of mean1 to mean2. n2/n1 = 2: Compute sample sizes with . sampsi 132.86 127.44, p(0.8) r(2) sd1(15.34) sd2(18.23) Compute power with n1 = n2, sd1 = sd2, and alpha = 0.01 one-sided: Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 47 Class 1 - Introduction; Overview of Stata -- LECTURE NOTES 11.14 Sample size and power calculations (cont'd) . sampsi 5.6 6.1, n1(100) sd1(1.5) a(0.01) onesided 2. One-sample comparison of mean to hypothesized value = 180. sample size: Compute . sampsi 180 211, sd(46) onesam One-sample comparison of mean to hypothesized value = 0. power: Compute . sampsi 0 -2.5, sd(4) n(25) onesam 3. Two-sample comparison of proportions. Compute sample size with n1 = n2 (i.e., ratio = 1, the default) and power = 0.9 (the default): . sampsi 0.25 0.4 Compute power with n1 = 500 and ratio = n2/n1 = 0.5: . sampsi 0.25 0.4, n1(300) r(0.5) 4. One-sample comparison of proportion to hypothesized value = 0.5: . sampsi 0.5 0.75, power(0.8) onesample Compute power: . sampsi 0.5 0.6, n(200) onesam Also see -------Manual: On-line: [R] sampsi help for immed Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 48