class 1 - Biostatistics

advertisement
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
Contents
1. Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3. Course objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4. Course organization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Web site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Userid and password . . . . . . . . . . . . . . . . . . . . . . .
4.3 Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Data analysis project . . . . . . . . . . . . . . . . . . . . . . .
11.6 Student’s t-test . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7 Test for binomial proportions . . . . . . . . . . . . . . .
11.8 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.9 Simple linear regression . . . . . . . . . . . . . . . . . .
11.10 Analysis of variance . . . . . . . . . . . . . . . . . . . . .
11.11 Multiple linear regression . . . . . . . . . . . . . . . . .
11.12 Multiple logistic regression . . . . . . . . . . . . . . .
11.13 Epidemiologic calculations - epitab . . . . . . . . .
11.14 Sample size and power calculations . . . . . . . .
41
41
41
41
42
42
42
42
47
2
2
2
3
3
5. Stata statistical package . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2 Flavors of Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.3 Requesting more memory for Stata . . . . . . . . . . . . 5
5.4 On-line help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.5 Resources for learning about Stata . . . . . . . . . . . . 6
5.6 Stata software pricing . . . . . . . . . . . . . . . . . . . . . . 6
5.7 Customizing Stata . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.8 Keeping Stata up-to-date . . . . . . . . . . . . . . . . . . . . 7
5.9 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.10 Stata commands . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.11 How to re-issue commands . . . . . . . . . . . . . . . . . 8
5.12 Program files - do files . . . . . . . . . . . . . . . . . . . . . 8
5.13 A special do-file – profile.do . . . . . . . . . . . . . . . . 8
5.14 How to start Stata and set the working directory
....................................... 8
5.15 Keeping a log of your work . . . . . . . . . . . . . . . . . 9
5.16 Getting data into Stata . . . . . . . . . . . . . . . . . . . . . 9
5.17 Stata tutorial on data input . . . . . . . . . . . . . . . . . . 9
5.18 Saving a Stata dataset . . . . . . . . . . . . . . . . . . . 12
5.19 Loading a Stata dataset . . . . . . . . . . . . . . . . . . . 12
6. Stata programs – “do-files” . . . . . . . . . . . . . . . . . . . . . .
6.1 What are and why use do-files . . . . . . . . . . . . . .
6.2 “Hello Mom” program . . . . . . . . . . . . . . . . . . . . . .
6.3 Start Stata do-file editor . . . . . . . . . . . . . . . . . . . .
6.4 Edit and re-run “do” Program . . . . . . . . . . . . . . .
6.5 Another program . . . . . . . . . . . . . . . . . . . . . . . . .
13
13
13
13
13
13
7. Using Stata to create “do” files . . . . . . . . . . . . . . . . . . . 15
8. Stat /Transfer for importing/exporting data . . . . . . . . . . 15
9. Example 1: exploratory analysis of data from Altman’s
Exercise 3-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1 Listing of data file . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Analysis Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3 Box-Cox transform . . . . . . . . . . . . . . . . . . . . . . . .
9.4 Techniques Illustrated . . . . . . . . . . . . . . . . . . . . .
9.5 Log Showing Commands and Output . . . . . . . . .
16
18
19
19
20
20
10. Example 2: input and display of data from Altman’s
exercise 3-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1 Source data from Altman . . . . . . . . . . . . . . . . . .
10.2 Raw data — text file on disk . . . . . . . . . . . . . . .
10.3 Analysis plan . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4 Stata log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
34
34
36
36
11. Common data analysis applications . . . . . . . . . . . . . .
11.1 Descriptive statistics . . . . . . . . . . . . . . . . . . . . .
11.2 Stem-and-leaf charts . . . . . . . . . . . . . . . . . . . . .
11.3 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4 Confidence interval for a mean . . . . . . . . . . . . .
11.5 Confidence interval for a proportion . . . . . . . . .
40
40
40
40
40
40
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 1
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
— Estimate the unknown coefficients and their standard
errors using maximum(or partial) likelihood and
perform tests of relevant null hypotheses about the
association with the response of particular subsets
of explanatory variables
1. Topics
! Outline course
! Overview of Stata
— Check whether a model fits the data well; identify
ways to improve a model when necessary
! Handouts
— Use several models for the analysis of a dataset to
effectively answer the main scientific questions
— Website and Schedule
— Lecture Notes #1
— Understand how longitudinal data differ from crosssectional data and why special regression methods
are sometimes needed for their analysis
— e-Quiz #1 (due Fri, 8 Apr 2011)
— Summarize in a table, the results of linear, logistic,
log-linear, and survival regressions and write a
description of the statistical methods, results, and
main findings for a scientific report
2. Syllabus
! Multiple regression models:
— Linear
— Logistic
— Conditional logistic (case-control studies)
— Log-linear (Poisson) for counts & rates
— Log-linear for contingency tables
— Cox proportional hazards
— Perform data management, including input, editing,
and merging of datasets, necessary to analyze data
in Stata or equivalent statistical software
— Complete a data analysis project, including data
analysis and a written summary in the form of a
scientific paper
! Longitudinal data analysis (repeated measures), analysis of
clustered data
4. Course organization
! Random effects/mixed effects/multilevel models
! Model checking: analysis of residuals, measures of
leverage and influence
! Special topics: methods for missing data; reliability, interrater agreement, diagnostic tests, reference intervals,
sample size, regression for survey samples
! The course contents, schedule, and procedures are
summarized in course website pages:
— “Home” page: organizational details
— “Schedule” page: classes, e-quizzes, exam, project
3. Course objectives
4.1 Web site
! Students who master the course contents will be able to:
! Web site URL:
— Frame a scientific question about the dependence of
a continuous, binary, count, or time-to-event
response on explanatory variables in terms of
linear, logistic, log-linear, or survival regression
model whose parameters represent quantities of
scientific interest
— Design a tabular or graphical display of a dataset that
makes apparent the association between
explanatory variables and the response
http://biostat.jhsph.edu/courses/bio624/
4.2 Userid and password
! Some parts of the course site require a Userid and
Password, which are
— Choose a specific linear, logistic, log-linear, or
survival regression model appropriate to address a
scientific question and correctly interpret the
meaning of its parameters.
Userid:
bio624
Password:
theedge
— Appreciate that the interpretation of a particular
multiple regression coefficient depends on which
other explanatory variables are in the model
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 2
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
b.
4.3 Grading
c.
20%
e-quizzes (5 of these)
50%
Data analysis project
30%
1%
Preliminary abstract (must
be on time)
49%
Completed project
Examination
(in-class; required for grade of A,
otherwise optional)
d.
e.
f.
g.
h.
i.
j.
k.
Model checks (residuals, influential
points)
Sensitivity analyses (with/without
influential points, etc)
Step-wise variable selection
Non-linearity checks
Collinearity assessment
Interaction assessment
Confounding -- compare adjusted
and unadjusted models
Likelihood testing or F-tests for
nested models
Stata do-file(s) - REQUIRED
Stata logs and graphs with enough
results to confirm statements in the
the paper
4.4 Data analysis project
! Conduct an analysis to address a scientific topic using
appropriate statistical methods
— Students must identify topics and datasets
independently – ie, topics and datasets will not be
assigned or provided
— The analysis should involve regression modeling with
at least two explanatory variables
— The dataset and analysis should address a public
health topic, with “public health” interpreted broadly
— Typically, datasets will have between 100 and
100,000 observations; however, larger or smaller
datasets may also be appropriate - ask if in doubt
— Datasets with fewer than 50 observations are
discouraged, but not prohibited
— IMPORTANT: Conduct the final analysis and write
the final report INDEPENDENTLY
— However, CONSULTING/COLLABORATING with
instructors, TAs, students or others about the data
or analysis IS ENCOURAGED
— It is also OK to share datasets, as long as the final
analysis (do-file), tables, and report are done
INDEPENDENTLY
! Prepare a report summarizing your findings in the form of a
mini scientific paper in the following format:
0. Title
1. Abstract (structured)
2. Introduction
3. Methods (including sample size
considerations)
4. Results (including at least one figure and one
table)
5. Discussion
6. Appropriate other tables, figures, etc
7.
a.
Appendices (as applicable)
Variable list
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 3
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
4.4 Data analysis project (cont'd)
! Possible sources for datasets:
— Some textbooks have collections of datasets that may
be suitable for further analysis
Again, if you decide to use one of these
datasets, make sure to consult source
paper(s) for the dataset and attach with
the supporting materials for your project
report
— An important part of the project is to identify and gain
access to an appropriate dataset
— The best dataset is one that you are familiar with from
past work that you can use to address questions
that have not been addressed before
LC Hamilton, Statistics with Stata
www.stata.com/bookstore/swsdl.html
— Next best is a dataset from an advisor or colleague
— ideally one whose subject matter is of interest to
you
Duxbury publishing website - site contains datasets
from health statistics textbooks: Click “Data
Library”:
http://www.thomsonedu.com/statistics/disciplin
e_content/dataLibrary.html
— It is OK to use datasets from other classes or the
MPH capstone project if they include enough
material to support a regression analysis — if in
doubt, ask an instructor from this class
— Online datasets. There are numerous datasets online
that could be used for a project. Some links to
possible sources for datasets are posted on the
course website (“Other links” on the home page):
Hosmer and Lemeshow: Applied Survival
Analysis:
ftp://ftp.wiley.com/public/sci_tech_med/survival/
http://www.biostat.jhsph.edu/courses/bio624/misc/datasets.ht
m
Hosmer and Lemeshow: Applied Logistic
Regression Analysis: Datasets are
contained in the University of Massachusetts
Datasets Archive, which contains links to other
data resources (make sure to type the URL
exactly as given below and then scroll down to
the list of datasets by type of analysis - DO
NOT USE the low birthweight dataset)
— Government and institutional websites ( a few are
listed below) contain an enormous amount of data,
will require some exploration to find downloadable,
raw data suitable for analysis):
www.fedstats.gov
FEDSTATS (federal
statistics locator)
www.cdc.gov
Centers for Disease
Control, including the
National Center for
Health Statistics
NCHS public use data files
and documentation
www.cdc.gov/nchs/datawh/ftpserv/ftpdata/ftpdata.htm
www.census.gov
US Census Bureau
www.who.ch
World Health Organization
http://www-unix.oit.umass.edu/~statdata/statdata/
Moore and McCabe: Introduction to the Practice of
Statistics (IPS), arguably, the best introductory statistics
text available. The applets help master statistical
concepts. The datasets will require finding the source
papers
http://www.whfreeman.com/ips/
Emory Biostatistics Dept excellent list of online
databases
http://www.sph.emory.edu/bios/bioslist.html#database
— Statistical data warehouse with library of data and
data stories (ie, documentation): www.stat.cmu.edu
— click DASL under Related Links
If you decide to use one of these datasets,
you must consult source paper(s) for
the dataset and attach with the supporting
materials for your project report
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 4
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
5. Stata statistical package
5.2 Flavors of Stata
5.1 Introduction
! Stata , according to its authors, is used for:
— Managing data
— Analyzing data
— Graphing data
! Stata offers a common interface across different
computers and operating systems: DOS, Windows,
Macintosh, Unix, and others — files created on one
system may be used on another without any conversion
! The Stata interface is command-driven — “type a little,
get a little.”
! But commands can be a pain at times, so Stata offers a
menu-based interface
! Stata is very fast, due mainly to storage of datasets in
memory during processing (as opposed to disk
processing). Graphics are not so fast!
! Stata is capable of processing a large variety of datasets
with the sole restriction that the dataset must fit into
available computer memory. This restriction rules out
really large datasets such as Medicare or other health
information systems.
! Data integrity: Stata works on a copy of your dataset in
memory, making it “safe interactive use.” You can still
destroy your data by explicitly saving over it.
Tip: always make copies of your key datasets before
data handling activities that involve saving
results. Note that analysis activities are “safe”
with very little risk of harm to your data. Data
management activities are “risky.”
! Stata 11 was released in 2009
— Major revisions occur about every 3 years
— Menus for nearly all commands
— Vastly improved graphics
— Enhancements to statistics, especially survival
analysis
— We will use Stata 11 in this course
— We will try to accomodate Macintosh users, but
some programs may not work with Macs
— Macintosh users: see notes under “Other Links” on
the home page:
http://www.biostat.jhsph.edu/~courses/bio624
! Stata comes in three forms:
— Stata IC (Intercooled - we use this)
— Small Stata - not for this course
— Stata/SE (Special Edition “super-size”)
— Stata/MP (Muliple processors)
! Stata/SE
— Can analyze datasets with as many as 32,767
variables, and the only limit on observations is the
amount of RAM on your computer
— Maximum length of a string variable is 244 characters
— Matrices may be up to 11,000 x 11,000
! Intercooled (IC) Stata
— Can analyze datasets with as many as 2,047
variables, and the only limit on observations is the
amount of RAM on your computer
— Maximum length of a string variable is 80 characters
— Matrices may be up to 800 x 800
— Computer should have at least 32 megabytes of RAM
—
5.3 Requesting more memory for Stata
! Stata is case-sensitive: The name “Myfile” is different from
“myfile” — when in doubt, use lower case
! Stata is programmable — many parts of Stata are written
in the Stata programming language. This language can
be used to generate, in principle, any statistical analysis
whether or not it is explicitly part of Stata (see “do” and
“ado” files in the Manual)
! By default, Intercooled Stata starts with 1 megabyte of
memory for datasets and work space. This can be
increased in one of 2 ways:
— Change memory:
! Stata has a very large and active on-line users group.
Members meet via the Internet using a “listserv” e-mail
system. Stata is continually updated and many updates
come from users. You may submit questions to the
“listserv” -- your questions go to all members of the
“listserv” – currently 25 questions per day are submitted
! The Stata website (www.stata.com) has a good Support
section, especially the FAQs
To change from 1 megabye to 800 megabytes,
give the following command:
set memory 800m
To make the change permanent every time you
start Stata,
set memory 800m , permanently
! Stata’s e-mail based user support is very responsive and
helpful. Remember to provide your serial number in the
e-mail along with your question
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 5
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
5.4 On-line help
5.6 Stata software pricing
! Stata has lots of on-line help available -- all sections of the
written documentation is on-line in “abbreviated” form
(sometimes too abbreviated, especially for statistical
techniques)
! Prices vary for academic institutions, businesses, and
students. Prices also depend on whether the system will
be used on a network and how many users there will be
! A good way to access on-line help is via the Help pull-down
menu - portal to all Stata Help including the complete set
of manuals in well-indexed PDF format.
! If you know the name of the command, you can access
online help via the help command. For example to get
help for the summarize command:
help summarize
Note, upper right: dialog: summarize
– Nearly every Stata command has a dialog
screen to construct the command
! Manuals are purchased separately - some are available in
the JHMI bookstore
! There is a charge for a subscriptions to the Stata Journal
are also extra, which comes in both hard copy and PDF
format
! Stata has no annual renewal fee, as do some other
statistical packages such as SAS, and offers regular free
updates containing fixes and extensions
! The Stata web site, www.stata.com, has the latest prices
and information on how to purchase items
! BSPH has a GRADPLAN for purchasing the lastest version
of Stata by students. Online ordering is at
www.stata.com/gpdirect
Note: [R] summarize -- Summary statistics
- Nearly every Stata command has an [R] link
to the PDF Documentation entry
! If you want to look up a topic use the “findit” command,
which search help files, as well as internet resources at
Stata. The results are hyperlinked for easy access to
results. For example, to get information on “logistic
regression”:
findit logistic regression
5.7 Customizing Stata
! Changing the size and fonts for Stata windows -- to
improve readability
— From the Edit menu, select:
Preferences / Manage Preferences / Load
Preferences / Maximized Window Settings
5.5 Resources for learning about Stata
... Make font changes, etc. to taste
! The primary documentation now spans 5,000+ pages. The
main components are the Reference Manual, the
User’s Guide, and the Graphics Manual. While
somewhat intimidating and irritating, these are now
inlcuded in a PDF - a necessity for “serious” users of
Stata
! Introductory materials (may be purchased using the Stata
website):
Preferences / Manage Preferences / Load
Preferences / New Preferences Set / YOUR
INITIALS
— Demonstrate changing the font and font size by using
the control button at the upper left of each window,
but the Results window is the most important one
to change
— Statistics with Stata by LC Hamilton — the best
book on Stata
! The Stata Journal is a refereed journal and is published
quarterly with articles about statistics, data analysis,
teaching methods, and effective use of Stata’s language
1. Click the control button and select Font
2. Select one a fixed space font -- one of the larger
Stata fonts or fixedsys are good choices
3. Make sure the font size is at least 9
Net courses on Stata. These range is length from a few to
12 weeks. They are done via e-mail. There is a charge for
the courses.
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
4. IMPORTANT – save the windowing preferences
or the changes disappear:
CLASS 1 - 6
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
— Observations (rows) are numbered from 1 to _N
— Schematic on how data are stored in Stata
5.8 Keeping Stata up-to-date
Columns = variable names
! MAKE SURE your Stata is up-to-date:
Rows=observations
— Updates are free
var1 var2 var3
— Fixes and extends Stata
! The current version of Stata is updated frequently about
every two weeks. Updates are free. To see what
version of Stata you are using, type the following
commands:
about
query born
...
varn
1
2
3
celli,j = data value for variable j
on observation i
N
— Stata gives the following simple example of “Data”
Var1
! To see if you need an update (you must be connected to
the internet), either use the Help menu or type the
command:
1.
2.
3.
4.
5.
update query
Var2
1
3
5
7
9
2
4
6
8
10
Name
Bill
Mary
Pat
Roger
Sean
! This will advise you to one of the following:
! In Stata, a “Dataset” is “Data” plus labels, formats, notes,
and characteristics
1. Do nothing, all files up to date
2. Update both the executable and ado files
Click:
update all
5.10 Stata commands
3. Update only the executable
Click:
! There are 200+ commands in Stata, many of which are
commands to obtain specific statistical analyses
update executable
! An early User’s Guide, lists 37 commands that “everyone
should know” by function:
4. Update only the ado files
Click:
update ado
— Getting on-line help
lookup, help, (and pull down Help menu)
! The new ado files are installed and ready to use as soon as
the download is completed
! One extra step is are required to install a new executable:
Click:
update swap
! After installing an update, you can find out what has been
added or changed by typing:
help whatsnew
5.9 Datasets
! In Stata, “Data” are a rectangular table of numbers and
character strings
— Each row is an “observation” on all the variables
— Each column contains all the observations for a given
variable
— Variables (columns) are represented by 8-character
names
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
— Operating system interface
pwd, cd
— Using and saving data from disk
use, save
append, merge
compress
— Inputting data into Stata
input
edit
infile
infix
insheet
— Basic data reporting
describe
codebook
list
browse
count
inspect
table
tabulate
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 7
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
— Data manipulation
generate, replace
recode
egen
rename
drop, keep
sort
encode, decode
order
by
reshape
5.13 A special do-file – profile.do
! When Stata begins, it looks for a file named profile.do ,
containing commands that are to be executed as Stata
starts
! In particular, Stata looks for the profile.do file in c:\data,
among other places, so you can execute a set of
commands every time you start Stata by placing them in
a text file named profile.do , which you store in c:\data
— Keeping track of your work
log
notes
! The profile.do file recommended for this course is as
follows and can be downloaded from various places on
the e-Quizzes page on the course website:
— Convenience
display
! Newer commands worth noting
— Handling subsets: define/analyze summary statistics
collapse
contract
statsby
— Tabulation - more compact results than tabulate
or summarize
table
tabstat
tab_chi
( use findit
install/help)
tab_chi for
* profile.do for starting Stata
* Place in C:\DATA or any working folder containing your
files
set memory 750m
set linesize 75
set more off
5.14 How to start Stata and set the working
directory
! The “working directory” in Stata is the folder where Stata
looks for data and program files. By default, the working
directory is
5.11 How to re-issue commands
c:\data
! Stata stores a long list of the commands you issue in the
Review window
! These commands can be accessed and re-issued – VERY
useful for correcting errors without re-typing the whole
command
To retrieve commands, use either:
! When you start Stata from the Stata icon, the working
directory is set to the default:
c:\data
! You can change the working directory to the folder
containing your files:
Page Up/Page Down
File / Change Working Directory
Click the command in the Review window
... Browse to folder
or
! Or, you can change the working directory by starting Stata
by double-clicking a dataset or program (do-file) in the
folder containing the files related to your chosen project
– most prefer this method!
5.12 Program files - do files
! “Do-files” contain a collection of Stata statements that
perform a variety of tasks – called a Stata program
! Do-files will be used extensively in this course and by
experienced Stata practitioners
! Do-files allow you the document your work by making it
possible to exactly reproduce key analyses – “ a step
towards “Reproducible research”
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 8
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
5.15 Keeping a log of your work
! For documentation of your work, you should keep log files,
which are transcripts of what appears in a Stata session
– the log command or the Log button on the toolbar
are used to manage logs
! These logs can be kept in either of two formats
text
(recommended – very easy to import
into word processors)
smcl
(a formatted log that preserves
hyperlinks, fonts and colors)
or
! You can translate form one format to the other:
translate mylog.smcl mylog.log
! You would usually store the log(s) in the same folder with
your data files related to your work
5.16 Getting data into Stata
! The easiest way to enter a small amount of data into Stata
is with the edit command. This is an interactive
spreadsheet like process that is very intuitive -demonstrate
! If the data are stored in a file on disk and have spaces
between each variable, use infile as we have done in
the example below
! Files with more complicated formats such as variable items
with no spaces between them or character strings with
embedded blanks, require more complicated input via
infile or infix with a data dictionary — details are in
the Reference Manual, User’s Guide and in on-line Help.
By the way, Stata advises against the use of the data
dictionary approach since there are other, easier ways to
do it
5.17 Stata tutorial on data input
! In addition to the resources mentioned above, there is an
old tutorial on data input -- still applies to Stata:
In this tutorial we show you how to enter your data into Stata.
You can enter your data
--------------------------
by using
--------------------------------------
directly from the keyboard
edit (Stata for Windows or Macintosh)
input (all versions of Stata)
indirectly from a file
insheet
infile
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 9
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
5.17 Stata tutorial on data input (cont'd)
-------------------------Then you save your data
infix
a transfer program
-------------------------------------by using save
------------------------------------------------------------------------------edit is the easiest way to enter a small amount of data. You type
. clear
. edit
(to drop any data in memory)
(to enter the spreadsheet editor)
Only Stata for Windows and Stata for Macintosh users can use edit. We are
not going to demonstrate it here. See the Getting Started manual or just
try it. input is available on all versions of Stata:
------------------------------------------------------------------------------. clear
. input id mpg weight price
1.
2.
3.
4.
5.
6.
1 22
2 17
3 22
4 20
5 15
end
id
2930
3350
2640
3250
4080
mpg
weight
price
4099
4749
3799
4816
7827
------------------------------------------------------------------------------input continues to accept observations until you type 'end'. Once you have
some data in memory, typing input by itself adds new observations:
------------------------------------------------------------------------------. input
id
mpg
weight
price
6. 6 26 2230 4453
7. end
Only Stata for Windows and Stata for Macintosh users can use edit. We are
not going to demonstrate it here. See the Getting Started manual or just
try it. input is available on all versions of Stata:
------------------------------------------------------------------------------. clear
. input id mpg weight price
1.
2.
3.
4.
5.
6.
1 22
2 17
3 22
4 20
5 15
end
id
2930
3350
2640
3250
4080
mpg
weight
price
4099
4749
3799
4816
7827
------------------------------------------------------------------------------input continues to accept observations until you type 'end'. Once you have
some data in memory, typing input by itself adds new observations:
------------------------------------------------------------------------------. input
id
6. 6 26 2230 4453
7. end
mpg
weight
price
------------------------------------------------------------------------------Another way to enter this data would be to type it into a wordprocessor or an
editor, save it in a file, and then read the file. We have such a file:
------------------------------------------------------------------------------. type "h:\stata\auto1.raw"
make, mpg,weight, price
AMC Concord, 22, 2930,
4099
AMC Pacer, 17, 3350, 4749
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 10
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
5.17 Stata tutorial on data input (cont'd)
AMC Spirit, 22, 2640, 3799
Buick Century,
20, 3250, 4816
Buick Electra, 15,4080, 7827
------------------------------------------------------------------------------Our file has the variable names at the top (that is not required) and we used
commas to separate values one from the other. To read this, we can type:
------------------------------------------------------------------------------. clear
. insheet using "h:\stata\auto1.raw"
(4 vars, 5 obs)
. list
make
1.
AMC Concord
2.
AMC Pacer
3.
AMC Spirit
4. Buick Century
5. Buick Electra
mpg
22
17
22
20
15
weight
2930
3350
2640
3250
4080
price
4099
4749
3799
4816
7827
------------------------------------------------------------------------------It's easy. insheet will read comma- or tab-delimited files, so it will read
text files created by spreadsheet and database programs.
------------------------------------------------------------------------------------------------------------------------------------------------------------If your values are separated by blanks rather than commas or tabs, you use
infile to read it. Here is such a file:
------------------------------------------------------------------------------. type "h:\stata\autodata.raw"
"AMC Concord" 22 2930
4099
"AMC Pacer" 17
3350 4749
"AMC Spirit" 22
2640 3799
"Buick Century"
20 3250 4816
"Buick Electra" 15 4080 7827
. clear
. infile str14 make mpg weight price using "h:\stata\autodata"
(5 observations read)
. list in ½
1.
2.
make
AMC Concord
AMC Pacer
mpg
22
17
weight
2930
3350
price
4099
4749
------------------------------------------------------------------------------Finally, if you have a formatted file, you use infile or infix to read it:
------------------------------------------------------------------------------. type "h:\stata\auto3.raw"
AMC Concord
2229304099
AMC Pacer
1733504749
AMC Spirit
2226403799
Buick Century
2032504816
Buick Electra
1540807827
. clear
. infix 1: str make 1-18 2: mpg 1-2 weight 3-6 price 7-11
> using "h:\stata\auto3.raw"
(5 observations read)
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 11
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
. list
1.
2.
3.
4.
5.
make
AMC Concord
AMC Pacer
AMC Spirit
Buick Century
Buick Electra
mpg
22
17
22
20
15
weight
2930
3350
2640
3250
4080
price
4099
4749
3799
4816
7827
Saving data
----------After you have entered data into Stata, you can save it.
The command is:
save filename
If you do not specify the extension for the filename, Stata assumes the extension '.dta'. For instance, we could type 'save auto' to save this data.
It would be saved in the file auto.dta. The command to retrieve previously
saved data is:
use filename [, clear]
Thus, the next time we want to use auto.dta, we could type 'use auto' or 'use
auto, clear'. Sometimes 'use auto' will work, but 'use auto, clear' will always work. Stata stores data in memory. The clear option tells Stata that
it's okay to drop the data in memory in order to retrieve the new data.
5.18 Saving a Stata dataset
! To save the dataset in the current work space on disk, give
the command below along with the appropriate path to
the folder containing the file
! Command:
save blah.dta, replace
5.19 Loading a Stata dataset
! To load a saved dataset from disk into the work area
! Command:
use blah.dta, clear
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 12
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
6. Stata programs – “do-files”
! Type the following Stata command into the file:
display “Hello Mom”
... Make sure you press [Enter] after typing the line
6.1 What are and why use do-files
! “Do-files” contain a collection of Stata statements that
perform a variety of tasks – called a Stata program
! Save the file:
! Always use do-files to make your work “reproducible” and
well-documented
! Note: you can enter commands interactively and then save
the commands into a do-file by right clicking anywhere in
the Review window: Select Save Review
Contents... and navigating to the folder where you
want to save the file
! For example, we include a do-file for each e-Quiz except
the first containing the all the commands to carry out the
analyses: eq2.do, eq3.do, etc.
Demonstrate how to "run" eq1.do
Click
File / Save As
Type:
MyDocuments\bio624\mom.do
! Run the “do” file:
do mom.do
(as a Stata command)
or,
Click:
Do current file icon
(in do-file editor)
6.4 Edit and re-run “do” Program
! “Do-files” document your work
! Return to Do-file editor:
! “Do-files” permit reproducible analyses
! “Do-files” make re-running a series of commands very easy
– one step
! “Do-files” for particular tasks can be copied and modified to
perform similar tasks – “do-files” serve as templates for
future work
! See Stata User’s Guide, for full documentation on what “dofiles” can accomplish
Click mom.do on the Task Bar
! Make the fixes (change to “Hello Mother Dear” ) and then
(IMPORTANT) save the file
Click File / Save
! Re-run the program:
Click Intercooled... on the Task Bar
6.2 “Hello Mom” program
do mom.do
! This program simply displays the message “Hello Mom” -e
an easy way to try the do-file approach
or (as above),
Click:
Do current file icon
(in do-file editor)
! The name of the program file will be mom.do
! Store the program in a folder:
My Documents\bio624
! Repeat the “Edit - Run” cycle until done or tired
6.3 Start Stata do-file editor
6.5 Another program
! To create a program file:
! This program is a little more complicated – try it for fun
and practice in making do-files
Click:
Start
Click:
Stata icon
Click:
Do-editor icon (envelope)
! Open Stata by clicking profile.do in MyDocuments\bio624
! Input faculty IQ data and summarize it
! The name of the program will be blah.do
Note:
You can also used NOTEPAD, WORDPAD or
even WORD -- anything that allows files to be
read and written in “text” format
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
! The program is in folder:
Sun, 27 Mar 2011 (6:47p)
MyDocuments\bio624
CLASS 1 - 13
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
6.5 Another program (cont'd)
cd “MyDocuments\bio624"
! To create a program file:
Click:
File/New
or start Stata do-file editor as
shown above
! Type the following Stata commands into to the do-file
editor to enter the data and generate the summary
statistics:
(Always change to the working directory, which will
contain related datasets, graphs, etc.)
! Run the “do” file:
do blah.do
* Turn off annoying – more – message
or,
set more off
Click:
* Open log file on disk
Do current file
! Edit + re-run “do” Program
* Trick for automatically opening a log file in a do-file
capture log close
log using blah.log, replace
input sno IQ
1 138
2 142
3 136
4 124
5 158
6 108
7 116
8 128
9 125
10 88
end
list
summarize IQ , detail
histogram IQ , bin(10) fraction norm
graph export blah.wmf,replace
log close
! Save the file:
Click
File / Save As
Type:
MyDocuments\bio624\blah.do
! Change the working directory to the folder containing the
“do” program file, if needed -- the current working
directory is shown on the lower left in the Status Bar:
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 14
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
! Return to do-file editor:
Click blah.do on the Task Bar
! Make the fixes and then save
Click File / Save
7. Using Stata to create “do” files
! A good way to make do-files is to enter the commands
interactively and then copy them to a do-file for further work:
Drag mouse to select commands (or select all)
Right click anywhere in the Review window
! Re-run the program:
Click Intercooled... on the Task Bar
Click Save All or Save Selected
Paste into the do-file editor (or into Notepad or Wordpad)
do blah.do
8. Stat /Transfer for importing/exporting data
or,
Click
Do current file
! Most often data are entered and managed using software other
than Stata. This might done in a spreadsheet such as Excel, a
datbase such as Access or Oracle, or another statistical
package such as SAS or SPSS
! In many cases, you can Copy/Paste the data from the outside
source into the Stata Data Editor, which transfers the data in
simple cases
! If worse comes to worse, data may be transferred to Stata for
analysis by writing a space or comma delimited ASCII text file
to disk and then reading that into Stata using infile or infix
! The best option is to use to translate the data into or from Stata
format is to use a “transfer program” such as StatTransfer -available in the PC Labs on the 3rd floor
! DEMO: To make the transfer, start Stat/Transfer and specify the
input file and select its type, then select the output file and
select its type (Stata version). Note that you may also translate
a Stata dataset into any of the other supported file formats, ie,
you could translate a Stata dataset for further analysis using
SAS or SPSS, for example
— Example: translate the SAS dataset alt3-1.sd2 into a Stata
dataset named alt3-1.dta
Start Stat/Transfer: Start Button, Program, ... click the
Stat/Transfer icon
Click the About tab and verify the version is 5 or higher —
earlier versions of Stat/Transfer may not correctly transfer
SAS datasets
Select SAS for Windows/OS2 from the input File Type selection
box
Click Browse ; locate and select the file SAS file
for the input File Specification box
ex3-1.sd2
Select Stata from the Output File Type selection box
Type ex3-1.sd2 in the File Specification box
Click the Transfer button
... SAS dataset should be converted to Stata format
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 15
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
To test the transfer:
Start Stata and give the commands
use alt3-1.dta , clear
describe
! Using the clipboard to import datasets
— Some datasets, such as spreadsheets, can be “copied” to
the clipboard
— These can “pasted” into the Stata Data Editor, which often is
a very quick way to transfer data into Stata
— Demonstrate transfer from Excel to Stata
— Data can be exported from Stata, using the clipboard by
reversing the process
9. Example 1: exploratory analysis of data from
Altman’s Exercise 3-1
! Data Source: The data comes from Exercise 3 on p.45 from the
well-written textbook Practical Statistics for Medical
Research (Chapman & Hall) by Douglas Altman
! Data Story: The data has to do with 65 patients with rheumatoid
arthritis, whether they experienced adverse drug reactions
(REAC) to sodium aurothiomalate (SA), and whether age,
dose, or an index (SI = sulphoxidation index) bear any
relationship to the adverse reactions
! Data sheet:
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 16
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9. Example 1: exploratory analysis of data from Altman’s
Exercise 3-1 (cont'd)
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 17
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.1 Listing of data file
! Below there is a listing of the contents of the file
alt3-1ex.dat,
which contains the raw data, one line (row) per patient
! The variables (columns) for each patient are as follows:
Id Number
sno
Reaction (1=Yes 2=No)
react
Age (years)
age
Dose (mg)
sadose
Sulphoxidation Index (no units)
si
Whether Index is censored (1=Yes 0=No)
censor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
44
65
58
57
51
64
33
61
49
67
39
42
35
31
37
43
39
53
44
41
72
61
48
59
72
59
71
53
53
74
29
53
67
67
54
51
57
62
51
68
50
38
61
59
68
44
57
49
49
63
1560
1310
850
1250
950
850
1200
1390
1450
3300
2760
860
1810
1310
1250
1210
1460
2310
1360
1910
910
1410
2460
1350
810
1460
760
910
360
2010
1390
660
1135
510
410
910
360
1260
560
1135
1410
1110
960
1310
910
1235
2950
360
1935
1660
1.0
1.2
1.2
1.7
1.8
1.8
1.9
2.0
2.3
2.8
2.8
3.4
3.4
3.8
3.8
4.2
4.9
5.4
5.9
6.2
12.0
18.8
47.0
70.0
80.0
80.0
80.0
80.0
2.0
2.0
2.0
3.0
3.5
5.3
5.7
6.5
13.0
13.0
13.9
14.7
15.4
15.7
16.6
16.6
16.6
22.0
22.3
33.2
47.0
61.0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 18
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
29
53
53
49
42
44
59
51
46
46
41
39
62
49
53
435
310
310
410
690
910
1260
1260
1310
1350
1410
1460
1535
1560
2050
65.0
65.0
80.0
80.0
80.0
80.0
80.0
80.0
80.0
80.0
80.0
80.0
80.0
80.0
80.0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
9.2 Analysis Plan
— Means, SDs , percentiles with summarize
— List data for checking with list
— Stem and Leafs for continuous variables using stem
— Scatterplot matrix to show bivariate relationships among
continuous variables using graph matrix
— Dot diagrams to show point distributions within groups using
dotplot
— Boxplots by group using graph box
— Shapiro-Wilk test for normal distribution using sw
— Diagnostic plots for normal distribution using qnorm
— Pick transformation using the Box-Cox transformation:
boxcox
9.3 Box-Cox transform
! The Box-Cox transform is used to find a scale for the response
variable that is approximately normally distributed — does not
always work, but worth trying. Don’t apply this without applying
common sense to the result
! It can be used in a regression model to find a transformation that
makes the errors in the regression model approximately
normally distributed
! The transform represents a family of “power” transformations
commonly used in data analysis:
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 19
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
! See boxcox in the Stata reference manual for more details and
examples
9.4 Techniques Illustrated
! Use of comment statements for documentation
! Clear Stata’s work space
! Change the working folder (directory) on disks from Stata
! Make folder from Stata to help organize your work
! Print results by sending them to a file on disk so they can be
incorporated into a word processor and printed
! Input free-format data from a data file on disk
! Label variables
! Label variable values
! List data
! Get summary statistics
! Get stem-and-leaf plots
! Get a scatterplot matrix
! Store Stata graphs on disk in “Windows metafile format” (.wmf)
for incorporation into word processing programs and printing
! Get dot diagrams
! Get boxplots
! Generate the Shapiro-Wilk statistic for testing normality
! Produce a quantile-quantile plot for assessing goodness of fit to a
normal distribution
! Use the Box-Cox transform to suggest a transformation to
normality
! NOTE:
The do-file and data file are on the website as alt31ex.do and alt3-1ex.dat
9.5 Log Showing Commands and Output
.
. * Turn off MORE feature
.
. set more off
.
.
.
. * Input data
.
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 20
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
. infile sno react age sadose si censor using alt3-1ex.dat
(65 observations read)
.
.
.
. * Variable labels
. label variable sno
"Study No."
. label variable react
"Adverse Reaction"
. label variable age
"Age in years"
. label variable sadose "Dose of SA (mg)"
. label variable si
"Sulphoxidation Index"
.
.
.
. * Value labels
.
. label define reactlbl 1 "Yes"
2 "No"
.
. label values react reactlbl
.
.
.
.
. * Save Stata dataset
.
. save alt3-1ex.dta, replace
file alt3-1ex.dta saved
.
.
. * List data for checking
.
. list in 1/10
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
+-------------------------------------------+
| sno
react
age
sadose
si
censor |
|-------------------------------------------|
|
1
No
44
1560
1
0 |
|
2
No
65
1310
1.2
0 |
|
3
No
58
850
1.2
0 |
|
4
No
57
1250
1.7
0 |
|
5
No
51
950
1.8
0 |
|-------------------------------------------|
|
6
No
64
850
1.8
0 |
|
7
No
33
1200
1.9
0 |
|
8
No
61
1390
2
0 |
|
9
No
49
1450
2.3
0 |
| 10
No
67
3300
2.8
0 |
+-------------------------------------------+
.
.
.
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 21
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
. * Descriptive Statistics
.
. summarize , detail
Study No.
------------------------------------------------------------Percentiles
Smallest
1%
1
1
5%
2
1
10%
4
2
Obs
65
25%
9
2
Sum of Wgt.
65
50%
75%
90%
95%
99%
17
25
31
34
37
Largest
34
35
36
37
Mean
Std. Dev.
17.06154
9.974776
Variance
Skewness
Kurtosis
99.49615
.1632394
2.000031
Adverse Reaction
------------------------------------------------------------Percentiles
Smallest
1%
1
1
5%
1
1
10%
1
1
Obs
65
25%
1
1
Sum of Wgt.
65
50%
75%
90%
95%
99%
1
2
2
2
2
Largest
2
2
2
2
Mean
Std. Dev.
1.430769
.4990375
Variance
Skewness
Kurtosis
.2490385
.2796164
1.078185
Age in years
------------------------------------------------------------Percentiles
Smallest
1%
29
29
5%
33
29
10%
38
31
Obs
65
25%
44
33
Sum of Wgt.
65
50%
75%
90%
95%
99%
53
61
67
71
74
Largest
71
72
72
74
Mean
Std. Dev.
Variance
Skewness
Kurtosis
52.12308
11.19641
125.3596
-.0659275
2.326933
Dose of SA (mg)
------------------------------------------------------------Percentiles
Smallest
1%
310
310
5%
360
310
10%
410
360
Obs
65
25%
860
360
Sum of Wgt.
65
50%
75%
90%
95%
99%
1260
1460
2010
2460
3300
Largest
2460
2760
2950
3300
Mean
Std. Dev.
1249.538
622.3134
Variance
Skewness
Kurtosis
387274
.9572716
4.426923
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 22
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
Sulphoxidation Index
------------------------------------------------------------Percentiles
Smallest
1%
1
1
5%
1.7
1.2
10%
1.9
1.2
Obs
65
25%
3.4
1.7
Sum of Wgt.
65
50%
14.7
75%
90%
95%
99%
80
80
80
80
Largest
80
80
80
80
Mean
Std. Dev.
31.54308
33.2201
Variance
Skewness
Kurtosis
1103.575
.6044778
1.543044
censor
------------------------------------------------------------Percentiles
Smallest
1%
0
0
5%
0
0
10%
0
0
Obs
65
25%
0
0
Sum of Wgt.
65
50%
0
75%
90%
95%
99%
1
1
1
1
Largest
1
1
1
1
Mean
Std. Dev.
.2615385
.4428926
Variance
Skewness
Kurtosis
.1961538
1.085217
2.177696
.
.
.
. * Stem and leaf
. stem age
Stem-and-leaf plot for age (Age in years)
2.
3*
3.
4*
4.
5*
5.
6*
6.
7*
|
|
|
|
|
|
|
|
|
|
99
13
578999
112234444
66899999
0111133333334
77789999
1112234
577788
1224
. stem sadose
Stem-and-leaf plot for sadose (Dose of SA (mg))
0***
0***
0***
0***
1***
1***
1***
1***
1***
2***
2***
2***
2***
2***
3***
3***
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
310,310,360,360,360
410,410,435,510,560
660,690,760
810,850,850,860,910,910,910,910,910,950,960
110,135,135
200,210,235,250,250,260,260,260,310,310,310,310,350,350,360,390,390
410,410,410,450,460,460,460,535,560,560
660
810,910,935
010,050
310
460
760
950
300
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 23
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
. stem si
Stem-and-leaf plot for si (Sulphoxidation Index)
si rounded to nearest multiple of .1
plot in units of .1
0**
0**
1**
1**
2**
2**
3**
3**
4**
4**
5**
5**
6**
6**
7**
7**
8**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10,12,12,17,18,18,19,20,20,20,20,23,28,28,30,34,34,35,38,38,42,49
53,54,57,59,62,65
20,30,30,39,47
54,57,66,66,66,88
20,23
32
70,70
10
50,50
00
00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
.
.
.
. * Scatterplots Matrix
. graph box age, over (react)
t1(AGE BOXPLOTS) t2(" ") l1(A
> GE) b1(REACTION)
(file alt3-1ex\boxplot1.gph saved)
.
. graph export alt3-1ex\scatmat.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\scatmat.wmf written in Windows Metafile format)
SCATTERPLOT MATRIX
Adverse
Reaction
80
60
Age in
years
AGE
40
20
4000
Dose
of SA
(mg)
2000
0
100
Sulphoxidation
Index
50
0
1
1.5
220
40
60
800
2000
4000
REACTION
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 24
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
.
.
. * Dot diagram
.
. sort react
.
. dotplot age ,
by (react) t1(AGE DOTPLOT) l1(AGE) b1(REAC
> TION)
(file alt3-1ex\dotplot1.gph saved)
. graph export alt3-1ex\dotplot1.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\dotplot1.wmf written in Windows Metafile format)
30
40
AGE
Age in years
50
60
70
AGE DOTPLOT
Yes
No
Adverse Reaction
REACTION
.
. dotplot sadose, by (react) t1(SA DOSE DOTPLOT) l1(SADOSE M
> G) b1(REACTION)
(file alt3-1ex\dotplot2.gph saved)
. graph export alt3-1ex\dotplot2.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\dotplot2.wmf written in Windows Metafile format)
0
SADOSE MG
Dose of SA (mg)
1000
2000
3000
4000
SA DOSE DOTPLOT
Yes
No
Adverse Reaction
REACTION
.
. dotplot si,
by (react)
t1(SI DOSE DOTPLOT) l1(SI)
>
b1(REACTION)
(file alt3-1ex\dotplot3.gph saved)
. graph export alt3-1ex\dotplot3.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\dotplot3.wmf written in Windows Metafile format)
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 25
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
0
SI
Sulphoxidation Index
20
40
60
80
SI DOSE DOTPLOT
Yes
No
Adverse Reaction
REACTION
.
. * Letter values, outliers by reaction subgroup
.
. lv age if react==1 ,generate
#
37
M
F
E
D
C
B
19
10
5.5
3
2
1.5
1
inner fence
outer fence
Age in years
--------------------------------|
53
|
|
46
52.5
59 |
|
41.5
53.25
65 |
|
38
53
68 |
|
29
48.5
68 |
|
29
50
71 |
|
29
51.5
74 |
|
|
|
|
|
26.5
78.5 |
|
7
98 |
spread
13
23.5
30
39
42
45
# below
0
0
pseudosigma
10.05177
10.80392
10.23727
11.47614
11.27376
10.79743
# above
0
0
. list age
if react==1 & ( (age >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (age <=( r(l_F) - 1.5*(r(u_F
> ) - r(l_F)))) )
.
.
. lv age
if react==2 ,generate
#
28
M
F
E
D
C
14.5
7.5
4
2.5
1.5
1
inner fence
outer fence
Age in years
--------------------------------|
52
|
|
41.5
51.25
61 |
|
37
52
67 |
|
34
52.75
71.5 |
|
32
52
72 |
|
31
51.5
72 |
|
|
|
|
|
12.25
90.25 |
|
-17
119.5 |
spread
19.5
30
37.5
40
41
# below
0
0
pseudosigma
14.65586
13.28402
13.11905
11.51282
10.41174
# above
0
0
. list age if react==2 & ( (age >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (age <=( r(l_F) - 1.5*(r(u_F)
> - r(l_F)))) )
.
.
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 26
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
. lv sadose
#
37
M
F
E
D
C
B
19
10
5.5
3
2
1.5
1
inner fence
outer fence
if react==1 ,generate
Dose of SA (mg)
--------------------------------|
1135
|
|
560
985
1410 |
|
385
997.5
1610 |
|
360
1185
2010 |
|
310
1180
2050 |
|
310
1405
2500 |
|
310
1630
2950 |
|
|
|
|
|
-715
2685 |
|
-1990
3960 |
spread
850
1225
1650
1740
2190
2640
# below
0
0
pseudosigma
657.2313
563.183
563.0501
512.0124
587.8463
633.4493
# above
1
0
. list sadose
if react==1 & ( (sadose >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (sadose <=( r(l_F) - 1
> .5*(r(u_F) - r(l_F)))) )
+--------+
| sadose |
|--------|
37. |
2950 |
+--------+
.
. lv sadose
#
28
M
F
E
D
C
14.5
7.5
4
2.5
1.5
1
inner fence
outer fence
if react==2 , generate
Dose of SA (mg)
--------------------------------|
1330
|
|
930
1220
1510 |
|
850
1580
2310 |
|
830
1720
2610 |
|
785
1907.5
3030 |
|
760
2030
3300 |
|
|
|
|
|
60
2380 |
|
-810
3250 |
spread
580
1460
1780
2245
2540
# below
0
0
pseudosigma
435.9179
646.489
622.7175
646.157
645.0197
# above
3
1
. list sadose
if react==2 & ( (sadose >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (sadose <=( r(l_F) - 1
> .5*(r(u_F) - r(l_F)))) )
+--------+
| sadose |
|--------|
26. |
2460 |
27. |
2760 |
28. |
3300 |
+--------+
.
.
. lv si
if react==1 ,generate
#
37
M
F
E
D
C
B
19
10
5.5
3
2
1.5
1
inner fence
outer fence
Sulphoxidation Index
--------------------------------|
22.3
|
|
13
46.5
80 |
|
4.4
42.2
80 |
|
2
41
80 |
|
2
41
80 |
|
2
41
80 |
|
2
41
80 |
|
|
|
|
|
-87.5
180.5 |
|
-188
281 |
spread
67
75.6
78
78
78
78
# below
0
0
pseudosigma
51.80529
34.75644
26.61691
22.95228
20.93699
18.71555
# above
0
0
. list si
if react==1 & ( (si >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (si <=( r(l_F) - 1.5*(r(u_F)
> - r(l_F)))) )
.
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 27
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
. lv si
if react==2 ,generate
#
28
M
F
E
D
C
14.5
7.5
4
2.5
1.5
1
inner fence
outer fence
Sulphoxidation Index
--------------------------------|
3.8
|
|
1.95
8.675
15.4 |
|
1.7
40.85
80 |
|
1.2
40.6
80 |
|
1.1
40.55
80 |
|
1
40.5
80 |
|
|
|
|
|
-18.225
35.575 |
|
-38.4
55.75 |
spread
13.45
78.3
78.8
78.9
79
# below
0
0
pseudosigma
10.10879
34.6713
27.5675
22.70904
20.06164
# above
6
5
. list si
if react==2 & ( (si >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (si <=( r(l_F) - 1.5*(r(u_F)
> - r(l_F)))) )
23.
24.
25.
26.
27.
28.
+----+
| si |
|----|
| 47 |
| 70 |
| 80 |
| 80 |
| 80 |
|----|
| 80 |
+----+
.
.
.
.
.
.
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 28
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
. * Boxplots
.
. sort react
.
. graph box age, over (react) t1(AGE BOXPLOTS) t2(" ") l1(A
> GE) b1(REACTION)
(file alt3-1ex\boxplot1.gph saved)
. graph export alt3-1ex\boxplot1.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\boxplot1.wmf written in Windows Metafile format)
30
40
AGE
Age in years
50
60
70
AGE BOXPLOTS
Yes
No
REACTION
.
. graph box sadose, over (react)
> ") l1(DOSE MG) b1(REACTION)
(file alt3-1ex\boxplot2.gph saved)
t1(SA DOSE BOXPLOTS) t2("
. graph exort alt3-1ex\boxplot2.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\boxplot2.wmf written in Windows Metafile format)
0
DOSE MG
Dose of SA (mg)
1,000
2,000
3,000
4,000
SA DOSE BOXPLOTS
Yes
No
REACTION
.
. graph box si, over (react)
> 1(SI) b1(REACTION)
t1(SI DOSE BOXPLOTS) t2(" ") l
. graph export alt3-1ex\boxplot3.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\boxplot3.wmf written in Windows Metafile format)
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 29
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
SI
0
Sulphoxidation Index
20
40
60
80
SI DOSE BOXPLOTS
Yes
No
REACTION
.
.
* Shapiro-Wilk Test for Normality
.
. swilk age sadose si
Shapiro-Wilk W test for normal data
Variable |
Obs
W
V
z
Prob>z
-------------+------------------------------------------------age |
65
0.98503
0.868
-0.307 0.62061
sadose |
65
0.92756
4.199
3.107 0.00094
si |
65
0.82921
9.901
4.964 0.00000
.
.
.
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 30
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
* Diagnostic Plot for Normal Distribution (Q-Q plot)
.
. qnorm age
, grid
b1(AGE Q-Q PLOT) l1(AGE)
. graph export alt3-1ex\qqplot1.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\qqplot1.wmf written in Windows Metafile format)
52.12308
70.53953
53
30
33
40
AGE
Age in years
50
60
70
71
80
33.70662
30
40
50
60
Inverse Normal
70
80
AGE Q-Q PLOT
Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles
. qnorm sadose
, grid
b1(SA DOSE Q-Q PLOT) l1(SA DOSE)
. graph export alt3-1ex\qqplot2.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\qqplot2.wmf written in Windows Metafile format)
1249.538
2273.153
2460
1260
0
360
SA DOSE
Dose of SA (mg)
1000
2000
3000
4000
225.924
0
500
1000
1500
Inverse Normal
2000
2500
SA DOSE Q-Q PLOT
Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles
.
. qnorm si
, grid
b1(SI Q-Q PLOT) l1(SI)
. graph export alt3-1ex\qqplot3.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-1ex\qqplot3.wmf written in Windows Metafile format)
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 31
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
9.5 Log Showing Commands and Output (cont'd)
31.54308
86.18528
1.714.7
-50
SI
Sulphoxidation Index
0
50
80
100
-23.09912
-50
0
50
100
Inverse Normal
SI Q-Q PLOT
Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles
.
.
* Box-Cox method to choose transformation to normality
.
. *
nolog option suppresses iterations - nothing to do with logarithms
.
. boxcox age
, nolog
Fitting comparison model
Fitting full model
Log likelihood = -248.73918
Number of obs
LR chi2(0)
Prob > chi2
=
=
=
65
0.00
.
-----------------------------------------------------------------------------age |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------/theta |
1.028826
.527121
1.95
0.051
-.004312
2.061964
-----------------------------------------------------------------------------Estimates of scale-variant parameters
---------------------------|
Coef.
-------------+-------------Notrans
|
_cons |
55.8456
-------------+-------------/sigma |
12.44209
-----------------------------------------------------------------------------------Test
Restricted
LR statistic
P-Value
H0:
log likelihood
chi2
Prob > chi2
--------------------------------------------------------theta = -1
-256.76965
16.06
0.000
theta = 0
-250.73362
3.99
0.046
theta = 1
-248.74068
0.00
0.956
---------------------------------------------------------
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 32
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
. boxcox sadose, nolog
Fitting comparison model
Fitting full model
Log likelihood = -505.33421
Number of obs
LR chi2(0)
Prob > chi2
=
=
=
65
0.00
.
-----------------------------------------------------------------------------sadose |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------/theta |
.4100593
.1929563
2.13
0.034
.031872
.7882467
-----------------------------------------------------------------------------Estimates of scale-variant parameters
---------------------------|
Coef.
-------------+-------------Notrans
|
_cons |
41.58575
-------------+-------------/sigma |
9.273821
-----------------------------------------------------------------------------------Test
Restricted
LR statistic
P-Value
H0:
log likelihood
chi2
Prob > chi2
--------------------------------------------------------theta = -1
-530.33416
50.00
0.000
theta = 0
-507.58528
4.50
0.034
theta = 1
-509.90097
9.13
0.003
--------------------------------------------------------. boxcox si
, nolog
Fitting comparison model
Fitting full model
Log likelihood = -285.74575
Number of obs
LR chi2(0)
Prob > chi2
=
=
=
65
0.00
.
-----------------------------------------------------------------------------si |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------/theta |
.0403967
.1055843
0.38
0.702
-.1665448
.2473382
-----------------------------------------------------------------------------Estimates of scale-variant parameters
---------------------------|
Coef.
-------------+-------------Notrans
|
_cons |
2.770815
-------------+-------------/sigma |
1.64801
-----------------------------------------------------------------------------------Test
Restricted
LR statistic
P-Value
H0:
log likelihood
chi2
Prob > chi2
--------------------------------------------------------theta = -1
-333.2825
95.07
0.000
theta = 0
-285.81928
0.15
0.701
theta = 1
-319.4322
67.37
0.000
--------------------------------------------------------.
.
.
.
.
. * Close the log -- may want to use for production runs
. *log close
10. Example 2: input and display of data from
Altman’s exercise 3-2
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 33
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
! Data: These data are found on p.47 of Altman (Exercise 3.2). The
data concerns airplane accidents (counts, rates/1000, and rates
per 100,000 flight hours) and how they relate to occupation of
the pilot
! Script of Stata commands contained in alt3-2ex.do
! NOTE:
The script file and data file are on the class disk as
alt3-2ex.do and alt3-2ex.dat
10.1 Source data from Altman
10.2 Raw data — text file on disk
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 34
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
10.2 Raw data — text file on disk (cont'd)
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 35
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
Professional pilots
Lawyers
Farmers
Sales representatives
Physicians
Mechanics and repairmen
Policemen and detectives
Managers and administrators
Engineers
Teachers
Housewives
Academic students
Armed Forces Members
1302
57
166
137
76
44
48
643
125
43
29
188
111
15.9
11.0
10.1
9.0
8.7
6.9
6.6
6.0
4.7
4.2
3.7
3.2
1.6
0.2
1.5
1.3
1.2
1.8
1.5
1.8
0.7
1.1
1.1
3.2
3.7
0.7
10.3 Analysis plan
! Explore this simple dataset with several graphs using the graph
command
— Show how counts of accidents are related to occupation of
pilot
— Show how rates per 1000 pilots are related to occupation
— Show how rates per 100,000 flight hours are related to
occupation
— Show how the two rates are related to one another
! Consider other approaches to analysis
10.4 Stata log
.
.
. * Turn off MORE feature
.
. set more off
.
.
.
. * Input data, embedded blanks in string
.
. infix str occup 1-29 accid 30-34 rate1 40-44 rate2 50-54 using alt3-2ex.dat
(13 observations read)
.
.
.
. * Variable labels
. label variable occup "Occupation"
. label variable accid "No. of Accidents"
. label variable rate1 "Rate per 1000"
. label variable rate2 "Rate per 100,000 hr"
.
. * List data for checking
.
. list
1.
2.
3.
4.
5.
+-----------------------------------------------------+
|
occup
accid
rate1
rate2 |
|-----------------------------------------------------|
|
Professional pilots
1302
15.9
.2 |
|
Lawyers
57
11
1.5 |
|
Farmers
166
10.1
1.3 |
|
Sales representatives
137
9
1.2 |
|
Physicians
76
8.7
1.8 |
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 36
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
10.4 Stata log (cont'd)
|-----------------------------------------------------|
|
Mechanics and repairmen
44
6.9
1.5 |
|
Policemen and detectives
48
6.6
1.8 |
| Managers and administrators
643
6
.7 |
|
Engineers
125
4.7
1.1 |
|
Teachers
43
4.2
1.1 |
|-----------------------------------------------------|
11. |
Housewives
29
3.7
3.2 |
12. |
Academic students
188
3.2
3.7 |
13. |
Armed Forces Members
111
1.6
.7 |
+-----------------------------------------------------+
6.
7.
8.
9.
10.
.
.
.
. * Code occupations for graphs
. encode occup, gen(occup1)
.
.
.
. * Make shorter labels for graphs
.
. #delimit ;
delimiter now ;
. label define occuplab 1 "Acad"
>
4 "Farm"
>
7 "Mgrs"
>
10 "Police"
>
13 "Teach" ;
2 "Armed For"
5 "Housewife"
8 "Mech"
11 "Pro Pilot"
3 "Engin"
6 "Law"
9 "MD"
12 "Sales"
. #delimit cr
delimiter now cr
.
. label values occup1 occuplab
.
.
.
.
. * Save as Stata dataset
.
. save alt3-2ex.dta, replace
file alt3-2ex.dta saved
.
.
. * Bar graph, See Figure 1
.
. sort occup1
.
. graph hbar accid , over(occup1,sort(1))
ytitle(" ") l1(OCCUPAT
> ION) b1(No. of Accidents) t1 (AIRPLANE ACCIDENTS)
. graph export alt3-2ex\fig1.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-2ex\fig1.wmf written in Windows Metafile format)
AIRPLANE ACCIDENTS
OCCUPATION
Housewife
Teach
Mech
Police
Law
MD
Armed For
Engin
Sales
Farm
Acad
Mgrs
Pro Pilot
0
500
1,000
1,500
No. of Accidents
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 37
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
10.4 Stata log (cont'd)
.
.
.
. * Bar graph, See Figure 2
.
. graph hbar rate1 , over(occup1,sort(1)) ytitle(" ") l1(OCCUPAT
> ION) b1(Rate per 1000 Pilots) t1 (AIRPLANE ACCIDENTS)
. graph export alt3-2ex\fig2.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-2ex\fig2.wmf written in Windows Metafile format)
AIRPLANE ACCIDENTS
OCCUPATION
Armed For
Acad
Housewife
Teach
Engin
Mgrs
Police
Mech
MD
Sales
Farm
Law
Pro Pilot
0
5
10
15
Rate per 1000 Pilots
.
. * Bar graph See Figure 3
.
. graph hbar rate2 , over(occup1,sort(1)) ytitle(" ") l1(OCCUPAT
> ION) b1(Rate per 100000 hrs) t1 (AIRPLANE ACCIDENTS)
(file alt3-2ex\fig3.gph saved)
. graph export alt3-2ex\fig3.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-2ex\fig3.wmf written in Windows Metafile format)
OCCUPATION
AIRPLANE ACCIDENTS
Pro Pilot
Armed For
Mgrs
Engin
Teach
Sales
Farm
Law
Mech
MD
Police
Housewife
Acad
0
1
2
3
4
Rate per 100000 hrs
.
.
. * Scatterplot See Figure 4
.
. graph twoway scatter rate1 rate2, mlabel(occup1)
t1(AIRPLANE ACCIDENT RATES)
. graph export alt3-2ex\fig4.wmf,replace
(file C:\jt\bio624\2004\progs\alt3-2ex\fig4.wmf written in Windows Metafile format)
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 38
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
10.4 Stata log (cont'd)
AIRPLANE ACCIDENT RATES
15
Pro Pilot
Rate per 1000
10
Law
Farm
Sales
MD
Mech
Police
5
Mgrs
Engin
Teach
Housewife
Acad
0
Armed For
0
1
2
Rate per 100,000 hr
3
4
AIRPLANE ACCIDENT RATES
15
Pro Pilot
Rate per 1000
10
Law
Farm
Sales
MD
Mech
Police
5
Mgrs
Engin
Teach
Housewife
Acad
0
Armed For
0
1
2
Rate per 100,000 hr
3
4
.
.
. log close
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 39
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
11. Common data analysis applications
11.4 Confidence interval for a mean
! For simplicity of illustration, the data from the rheumatoid arthritis
data introduced earlier will be used in all the examples, some of
which may be contrived or inappropriate
! Calculate a 95% confidence interval for the mean value of a
variable
! Variable: age
! The examples shown below assume that the Stata dataset has
been loaded into the work space through input of the raw data
or by loading a saved data (e.g., use alt3-1ex\alt3-1ex.dta)
! Command:
11.1 Descriptive statistics
! Immediate form of command — used as a “calculator” to produce
95% CI from n, mean, and SD
ci age
. cii 65 52.12 11.20
! Means, SDs, and other descriptive statistics
! Variables: age, sadose, and si
11.5 Confidence interval for a proportion
! Command:
summarize age sadose si , detail
! Calculate a 95% confidence interval for the proportion positive in a
binomial distribution. Stata calculates exact binomial limits.
Note: Stata can also calculate limits for the mean of Poisson
distribution using the poisson option of the ci or cii commands.
11.2 Stem-and-leaf charts
! Stem-and-Leaf to show distribution of continuous variable -- must
do one variable at a time
! Variable: age
! Variable: censor
! Command:
ci censor , binomial
! Command:
! Immediate form of command — used as a “calculator” to produce
95% CI from n, # of events
stem age
. cii 65 17
11.3 Boxplots
! Poisson example ( 27 deaths, 645 person-years):
! Boxplot to show distribution of a variable in subgroups of the data.
Data must be sorted by the subgrouping variables. Store the
graph in a folder (sub-directory) in metafile format (*.wmf), so it
can be imported into a word processor for printing
cii 645 27 , poisson
! Variables:
— Subgrouping: reac
— Analysis: age
! Commands:
[Type command below each on a single, long
line]
sort react
graph box age, over (react) marker(1,mlab(sno))
t1(AGE BOXPLOTS) t2(" ") l1(AGE) b1(REACTION)
graph export alt3-1ex\boxplot1.wmf,replace
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 40
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
tab censor reac, chi2 exact
11.6 Student’s t-test
! Immediate forms of commands can be used as a “calculator” to
test equality of proportions in a 2x2 table. Enter the rows of the
table separated by a “\” character:
! Used to test equality of means. It comes in 3 forms:
— Test that variable has a mean equal to specific # — this is
the one-sample t-test
tabi 24 24 \ 13 4 , chi2 exact
— Test that variable1 has the same mean as variable2 — this
is the paired t-test
— Test that variable has the same mean within two groups
defined by a grouping variable groupvar — this is the twosample t-test
11.8 Correlation
! Obtain either the Pearson’s or Spearman’s (rank) estimated
correlation coefficient of two measured responses x and y
Note: Stata gives p-values for the t-tests, but also gives 95%
confidence intervals on means and differences in means
! Variables: age and si
! Variables: age with reac as the subgrouping variable
! Commands:
! Commands:
corr age si
— One-sample ttest: Test mean age = 50
spearman age si
ttest age = 50
— Paired t-test: (Stupidly, for illustration) test mean sadose = si
ttest sadose = si
— Two-sample t-test: Test age means are equal within reaction
groups
Note: Pairs of correlations among a set of variables may be
obtained by specifying the list of variables. E.g., to obtain
age-sadose, age-si, and sadose-si correlations:
corr age sadose si
ttest age ,by (reac)
or,
ttest age ,by (reac) unequal ... does not assume =
variances
11.9 Simple linear regression
! Immediate forms of commands can be used as a “calculator” to
get t-test given summary data on n, and the observed means
and standard deviations (sd):
! Estimate simple linear model relating a measured response
(dependent) variable y to a fixed, covariate (independent)
variable x — y = α+βx+ε
— One-sample test (n=24, observed mean=62.6, sd=15.8; test
mean=75)
ttesti 24 62.6 15.8 75
— Paired t-test: there is no immediate command for this
— Two-sample t-test: (n1=20,m1=20,sd1=5;
n2=32,m2=15,sd2=4; test mean's equal)
ttesti 20 20 5 32 15 4
Stata produces an analysis of variance, p-values, coefficient
estimates, standard errors, and 95% confidence intervals
! Variables: Dependent = si and independent = age
! Commands:
regress si age
! Commands to obtain a graph of the data, fitted line, and 95% CIs:(
Type the graph command on one line)
11.7 Test for binomial proportions
graph twoway (scatter si age) || (lfitci si age) t1("si=
30.15+.0268age")
! Use to test equality of proportions within two subgroups
graph export alt3-1ex\lreg.wmf,replace
Note: Stata gives the 2x2 chi-square test and p-value. It also
gives the Fisher’s exact test p-value
! Variables: proportion censored (censor) within reactivity groups
(reac)
! Commands:
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 41
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
! Details may be found in the Manual or by typing
11.10 Analysis of variance
help epitab
! Used to tests equality of means withing two or more subgroups —
usually 3 or more as the t-test is usually used for 2 groups
For convenience, the Help text is included below
! Variables: Dependent variable = si, subgrouping variable= reac—
only 2 groups in this example
! Command:
oneway si reac
11.11 Multiple linear regression
! Use either regress
! For details refer to the Reference Manual or
help regress
! Also see Stata User’s Guide Chapters 26 and 35 (in the handout
for Part 1) for more details on fitting regression models
11.12 Multiple logistic regression
! Use logistic for logistic regression for binary responses
! Use clogit for matched or highly stratified case-control studies
(including “frequency-matched” studies)
! Use ologit for logistic regression for ordered responses with more
than 2 categories
! Use mlogit for logistic regression for responses with more than 2
categories (not ordered)
! For details refer to the Reference Manual or
help logistic
help clogit
help ologit
help mlogit
! Also see Stata User’s Guide Chapters 26 and 35 (in the handout
for Part 1) for more details on fitting regression models
11.13 Epidemiologic calculations - epitab
! Most of the common calculations for epidemiologic analysis have
been included in Stata in a group of command labeled “epitab”
in the Reference Manual
! Most of the commands have an “immediate” form so that they may
be applied to summary tables, rather than to the raw data,
which may not be available
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 42
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
11.13 Epidemiologic calculations - epitab (cont'd)
. help epitab
------------------------------------------------------------------------------help for epitab, ir, iri, cs, csi, cc, cci, mcc, mcci
(manual: [R] epitab)
------------------------------------------------------------------------------Tables for epidemiologists
-------------------------ir
case_var ex_var time_var [weight] [if exp] [in range] [, level(#)
tb by(varname) fast estandard istandard standard(varname) ird
nocrude pool nohet ]
iri
#a #b #N1 #N2 [, level(#) tb ]
cs
case_var ex_var [weight] [if exp] [in range] [, level(#) exact tb
woolf by(varname) fast or estandard istandard standard(varname)
nocrude pool nohet ]
csi
#a #b #c #d [, level(#) exact or tb woolf ]
cc
case_var ex_var [weight] [if exp] [in range] [, level(#) exact tb
woolf by(varname) fast estandard istandard standard(varname)
nocrude pool nohet ]
cci
#a #b #c #d [, level(#) exact tb woolf ]
mcc
ex_case_var ex_cntl_var [weight] [if exp] [in range] [, level(#) tb ]
mcci
#a #b #c #d [, level(#) tb ]
Description
----------ir is used with incidence rate (incidence density or person-time) data; point
estimates and confidence intervals for the incidence rate ratio and difference
are calculated along with attributable or prevented fractions for the exposed
and total population. iri is the immediate form of ir; see help immed.
Also see help nbreg, help poisson and help stcox for related commands.
cs is used with cohort study data with equal follow-up time per subject and,
in some cases, cross-sectional data. Risk is then the proportion of subjects
who become cases. Point estimates and confidence intervals for the risk difference, risk ratio, and (optionally) the odds ratio are calculated along with
attributable or prevented fractions for the exposed and total population. csi
is the immediate form of cs; see help immed. Also see help logistic and help
glogit for related commands.
cc is used with case-control and cross-sectional data. Point estimates and
confidence intervals for the odds ratio are calculated along with attributable
or prevented fractions for the exposed and total population. cci is the immediate form of cc; see help immed. Also see help logistic and help glogit
for related commands.
mcc is used with matched case-control data. McNemar's chi-squared, point estimates and confidence intervals for the difference, ratio, and relative difference of the proportion with the factor, along with the odds ratio, are calculated. mcci is the immediate form of mcc; see help immed. Also see help
clogit for a related command.
Options
------level(#) specifies in percent the confidence level for confidence intervals.
exact requests Fisher's exact P be calculated rather than the chi-squared and
its significance level. We recommend specifying exact whenever samples are
small. A conservative rule-of-thumb for 2x2 tables is to specify exact
when the least-frequent cell contains fewer than 1,000 cases. Note that
exact does not affect whether exact confidence intervals are calculated;
commands always calculate exact confidence intervals where they can unless
tb or woolf is specified.
by(varname) specifies that the tables are stratified on varname. Withinstratum statistics are shown then combined with Mantel-Haenszel weights.
If estandard, istandard, or standard() is also specified (see below), the
weights specified are used in place of Mantel-Haenszel weights.
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 43
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
11.13 Epidemiologic calculations - epitab (cont'd)
fast specifies that calculations of within-stratum confidence intervals are not
to be made. This speeds execution of the command, although in the case of
ir, it makes little difference and for the remaining commands, woolf or
tb are almost as fast.
or is allowed only with the cs and csi commands. Specified without by(), or
reports the calculation of the odds ratio in addition to the risk ratio.
With by(), or specifies that a Mantel-Haenszel estimate of the combined
odds ratio be made rather than the Mantel-Haenszel estimate of the risk
ratio. In either case, this is the same calculation as would be made by
cc or cci and, typically, the use of those commands is to be preferred
for obtaining odds ratios.
tb requests that test-based confidence intervals be calculated wherever
appropriate in place of confidence intervals based on other approximations
or exact confidence intervals. We recommend that test-based confidence
intervals be used only for pedagogical purposes and never be used for
research work.
woolf requests that the Woolf approximation, also known as the Taylor expansion, be used for calculating the standard error of the odds ratio. Otherwise, the Cornfield approximation is used. The Cornfield approximation
takes substantially longer (a few seconds) to calculate than the Woolf
approximation. This standard error is used in calculating a confidence
interval for the odds ratio. (For matched case-control data, exact confidence intervals are always calculated.)
estandard, istandard, and standard(varname) request that within-stratum statistics are to be combined with external, internal, or user-specified weights
to produce a standardized estimate. These options are mutually exclusive
and can only be used when by() is also specified. (When by() is specified
without one of these options, Mantel-Haenszel weights are used.)
estandard external weights are the person-time for the unexposed (ir),
the total number of unexposed (cs), or the number of unexposed controls
(cc).
istandard internal weights are person-time for the exposed (ir), the total
number of exposed (cs), or the number of exposed controls (cc). istandard
can be used, among other things, to produce standardized mortality
ratios (SMRs).
standard(varname) allows user-specified weights. varname must contain
a constant within stratum and be nonnegative. The scale of varname is
irrelevant.
ird may be used only with estandard, istandard, or standard(); it requests ir
calculate the standardized incidence rate difference rather than the
default incidence rate ratio.
rd may be used only with estandard, istandard, or standard(); it requests that
cs calculate the standardized risk difference rather than the default risk
ratio.
nocrude specifies that in a stratified analysis, the crude estimate -- the
estimate one would obtain without regard to strata -- not be displayed.
nocrude is relevant only if by() is also specified.
pool specifies that in a stratified analysis, the directly pooled estimate
should also be displayed. The pooled estimate is a weighted average of
the stratum-specific estimates using inverse-variance weights. pool is
relevant only if by() is also specified.
nohet specifies that a chi-squared test for heterogeneity not be included in
the output of a stratified analysis. This tests whether the exposure
effect is the same across strata and can be performed for any pooled
estimate -- directly pooled or Mantel-Haenszel. nohet is relevant only
if by() is also specified.
Examples: incidence rate data
-----------------------------The table for incidence rate data is
Exposed
Unexposed
------------+--------------------Cases
|
a
b
Person-time |
N1
N0
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 44
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
11.13 Epidemiologic calculations - epitab (cont'd)
The basic syntax (ignoring options) for iri is "iri #a #b #N1 #N2".
For example:
. iri 41 15 28010 19017
. iri 41 15 28010 19017, level(90)
. iri 41 15 28010 19017, level(90) tb
The basic syntax (ignoring options) for ir is "ir case_var ex_var time_var".
case_var contains the number of cases represented by an observation. ex_var
contains 0 if the observation represents unexposed and nonzero (e.g., 1) if the
observation represents exposed. time_var contains the exposure time (e.g.,
person-years) represented by the observation. ir obtains the table by summing
across observations. Observations with missing values are not used.
. list
1.
2.
3.
cases
20
21
15
exposed
1
1
0
time
14000
14010
19017
. ir cases exposed time, level(90)
(output omitted)
To obtain Mantel-Haenszel combined IRR:
. list
1.
2.
3.
4.
agegrp
1
1
2
2
deaths
14
10
76
121
exposed
1
0
1
0
pyears
1516
1701
949
2245
. ir deaths exposed pyears, by(agegrp)
To obtain internally standardized IRR:
. irr deaths exposed pyears, by(agegrp) istandard
To weight each group equally:
. gen wgt=1
. irr deaths exposed pyears, by(agegrp) standard(wgt)
Examples: cohort-study data
---------------------------The table for cohort-study data is
Exposed
Unexposed
------------+--------------------Cases
|
a
b
Noncases
|
c
d
The basic syntax (ignoring options) for csi is "csi #a #b #c #d".
For example:
. csi 7 12 9 2
. csi 7 12 9 2, exact
. csi 7 12 9 2, exact level(90) tb
The basic syntax (ignoring options) for cs is "cs case_var ex_var". case_var
contains 1 if the observation represents a case and nonzero (e.g., 1) if it
represents a noncase. ex_var contains 0 if the observation represents unexposed and nonzero (e.g., 1) if it represents exposed. Frequency weights are
allowed.
. list
1.
2.
3.
4.
case
0
0
1
1
exp
0
1
0
1
pop
2
9
12
2
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 45
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
5.
1
1
5
. cs case exp [freq=pop]
(output omitted)
If "[freq=pop]" is not specified, each observation contributes 1.
Stratified tables work as with ir.
risk ratio:
To obtain the Mantel-Haenszel combined
. cs case exposed [freq=pop], by(age)
To obtain internally
1.
2.
3.
4.
5.
standardized risk ratio:
0
0
2
0
1
9
1
0
12
1
1
2
1
1
5
. cs case exp [freq=pop]
(output omitted)
If "[freq=pop]" is not specified, each observation contributes 1.
Stratified tables work as with ir.
risk ratio:
To obtain the Mantel-Haenszel combined
. cs case exposed [freq=pop], by(age)
To obtain internally standardized risk ratio:
. cs case exposed [freq=pop], by(age) istandard
To obtain externally standardized risk ratio:
. cs case exposed [freq=pop], by(age) estandard
To weight each age group equally:
. gen wgt=1
. cs case exposed [freq=pop], by(age) standard(wgt)
Examples: case-control data
---------------------------cc and cci work just like cs and csi. They differ in that they report the
odds ratio rather than the risk ratio.
Examples: matched case-control data
-----------------------------------mcc and mcci work just like cc and cci except that they report different
statistics. Stratified tables are not allowed with mcc.
Also see
-------Manual:
On-line:
[R] epitab
help for bitest, ci, clogit, dstdize, immed, logistic, nbreg,
poisson, st, stcox, tabulate
help sampsi
For convenience, the Help text is given below:
11.14 Sample size and power calculations
! The Stata command sampsi performs sample size of power
calculations for comparison of means or proportions
! Also see the free sample size software from Dupont and Plummer
– “Other Links” on the course website Home page
! For details, refer to sampsi in the Reference Manual or type
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 46
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
11.14 Sample size and power calculations (cont'd)
. help sampsi
------------------------------------------------------------------------------help for sampsi
(manual: [R] sampsi)
------------------------------------------------------------------------------Sample size and power determination
----------------------------------sampsi #1 #2 [, alpha(#) power(#) n1(#) n2(#) ratio(#)
sd1(#) sd2(#) onesample onesided ]
Description
----------sampsi estimates required sample size or power of tests for comparisons of
means or proportions. If n1() or n2() is specified, sampsi computes power;
otherwise, it computes sample size. sampsi is an immediate command; all of
its arguments are numbers; see help immed.
sampsi computes sample size or power for four types of tests:
1.
Two-sample comparison of means.
The postulated values of the means are #1 and #2.
The postulated standard deviations are sd1() and sd2().
2.
One-sample comparison of mean to hypothesized value.
Option onesample must be specified.
The hypothesized value (null hypothesis) is #1.
The postulated mean (alternative hypothesis) is #2.
The postulated standard deviation is sd1().
3.
Two-sample comparison of proportions.
The postulated values of the proportions are #1 and #2.
4.
One-sample comparison of proportion to hypothesized value.
Option onesample must be specified.
The hypothesized proportion (null hypothesis) is #1.
The postulated proportion (alternative hypothesis) is #2.
Options
------alpha(#) specifies the significance level of the test; the default is
alpha(.05). (More correctly, the default is 1-level/100 from set level,
see help level.)
power(#) is power of the test.
Default is power(.90).
n1(#) specifies the size of the first (or only) sample and n2(#) specifies
the size of the second sample. If specified, sampsi reports the power
calculation. If not specified, sampsi computes sample size.
ratio(#) is an alternative way to specify n2() in two-sample tests. In a
two-sample test, if n2() is not specified, n2() is assumed to be
n1()*ratio(). That is, ratio() = n2()/n1(). The default is
ratio(1).
sd1(#) and sd2(#) are the standard deviations for comparison of means. If
not specified, comparison of proportions is assumed. In two-sample
cases, if only sd1() is specified, sd2() is assumed to equal sd1().
onesample indicates a one-sample test.
onesided indicates a one-sided test.
The default is a two-sample test.
The default is a two-sided test.
Examples
-------1. Two-sample comparison of mean1 to mean2.
n2/n1 = 2:
Compute sample sizes with
. sampsi 132.86 127.44, p(0.8) r(2) sd1(15.34) sd2(18.23)
Compute power with n1 = n2, sd1 = sd2, and alpha = 0.01 one-sided:
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 47
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES
11.14 Sample size and power calculations (cont'd)
. sampsi 5.6 6.1, n1(100) sd1(1.5) a(0.01) onesided
2. One-sample comparison of mean to hypothesized value = 180.
sample size:
Compute
. sampsi 180 211, sd(46) onesam
One-sample comparison of mean to hypothesized value = 0.
power:
Compute
. sampsi 0 -2.5, sd(4) n(25) onesam
3. Two-sample comparison of proportions. Compute sample size with
n1 = n2 (i.e., ratio = 1, the default) and power = 0.9 (the default):
. sampsi 0.25 0.4
Compute power with n1 = 500 and ratio = n2/n1 = 0.5:
. sampsi 0.25 0.4, n1(300) r(0.5)
4. One-sample comparison of proportion to hypothesized value = 0.5:
. sampsi 0.5 0.75, power(0.8) onesample
Compute power:
. sampsi 0.5 0.6, n(200) onesam
Also see
-------Manual:
On-line:
[R] sampsi
help for immed
Biostatistics 624 © 2011 by JHU Biostatistics Dept.
Sun, 27 Mar 2011 (6:47p)
CLASS 1 - 48
Download