Biostatistics and Epidemiology Using Stata

advertisement
Biostatistics and Epidemiology Using Stata: A Course Manual
Table of Contents
Section 1. Stata: Data Management, Graphics, and Programming
1-1
Installing Stata and recovering Stata windows
1. installing Stata
1.
2.
3.
3.
1-2
adding an icon to the desktop (PC Windows)
run Stata to finish setup
updating Stata after setup
recoving Windows: load factory settings
Getting data into Stata and some other basics
1.
1.
1.
2.
3.
opening a Stata formatted data file: 1) clicking on file icon
showing full dictory path in Windows Explorer
showing file extensions in Windows Explorer
opening a Stata formatted data file: 2) File icon on menu bar
opening a Stata formatted data file: 3)
change directory (cd) command
directory list (dir) command
read in Stata data file (use) command
4. scrolling in Stata’s Results window
5. general syntax, or structure, of Stata commands
6. Stata help facility, help command
7. Stata Manuals
7. books on Stata
7. setting file attributes in Windows (turn of Read Only)
8. using do-files
9. suggested do-file structure
10. increasing memory size of Stata’s workspace: set memory (set mem) command
10. importing Excel file into Stata
11. reading in a *.csv or *.txt formatted file: insheet command
11. saving a Stata formatted data file: save command
12. saving a Stata formatted data file compatable with Stata version 8 or 9:
saveold command
1-3
Cleaning data
1.
2.
2.
2.
3.
listing data: list command
block comment: /* … */
deleting variables: drop command
inline comment “//”
tabulation of values of variables using frequency table: tabulate (tab) command
1
3. examining 4 smallest and 4 largest values: summarize, detail (sum) command
3. replacing value of variable: replace command
assignment “=” and logical equals “==”
4. recoding values of a variable with generate (gen) and replace commands
5. keeping a command from crashing the do-file: capture command, e.g.,
capture drop
5. the “0 observations” result: attempting to do arithmetic on a string variable
6. describing variables: decribe command
6. variable storage types
7. missing value for string variable: the null string ""
7. converting a string variable to numeric, destring command
8. recoding values of a variable: recode command
9. converting to all upper or lower case: upper and lower string functions
9. renaming the variable name: rename command
1-4
Merging files
1. adding a file to bottom of file in memory: append command
2. adding a file in rightmost columns of file in memory, one-to-one merge
without matching on some variable: merge command
3. merging files while matching on some variable such as a subject ID,
match merge: merge command
4. checking how well the matching worked: Stata’s _match variable (values 1 to 3)
5. non-overwrite feature of merge command (the default)
6. non-overwrite of missing values feature of merge command (the default)
7. updating file in memory with another file by replacing missing values only:
update option
8. updating file in memory with another file by replacing both missing and
nonmissing values, update with replace: replace option
8. checking how well the update with replaced worked: Stata’s _match variable
(values 1 to 5)
1-5
Labeling variables and values
2.
3.
3.
4.
6.
7.
8.
9.
1-6
adding label to a variable: label variable command
adding labels to the values of a variable: label define and label values commands
listing value labels: label list command
suspending value labels in data browser and outputs, nolabel option
removing variables labels
removing value labels: label drop command
removing value labels: capture label drop command
displaying values and value labels
Basic graphics
1. using graphs from Stata version 7: graph7 and version 7 commands
1. redisplaying a graph: graph display command
2. scatterplot: graph twoway scatter command
2
3.
3.
3.
3.
3.
4.
4.
5.
5.
appreviated scatterplot commands: twoway scatter and scatter commands
side by side graph: by option
linear regression line graph: lfit command
overlaying graphs: “||” operator
overlaying linear regression line on scatterplot using || operator
overlaying graphs using binding notation: ( ) ( )
overlaying linear regression line on scatterplot using ( ) approach
generating a variable with rounding: round function
generating a mean across data rows for subgroups:
by specification with egen command with mean function
5. listing variables: list command
5. extending a command across several lines in do-file editor: #delimit command
5. line graph: line command
6. requirement to sort on x variable before plotting a line graph: sort command
7. table of descriptive statistics for a two variable crossclassification: table command
7. smooth line graph using fractional polynomial fit: fpfit command
7. fractional polynomial fit with covariates: fracpoly command
8. adding title to graph: title command
8. adding subtitle to graph: subtitle command
8. adding axis titles to graph: ytitle and xtitle commands
8. adding footnote to graph: note command
9. adding more tick marks and labels to axes: ylabel and xlabel commands
9. better labels for legend: legend command
10. list of choices for line graph line widths: graph query linewidthstyle command
10. changing connect line width of line graph: clwidth option
11. list of choices for graph scheme: graph query, schemes command
11. changing default graph scheme for current session or permanently:
set scheme command
11. chaning graph scheme just for current graph: scheme option
12. basic black-and-white scheme for manuscripts: scheme(s1mono) option
13. eliminating border around graph: plotregion(style(none)) option
14. adding text to graph: text option
15. placement options for positioning text: placement option
16. adding space between x-axis title and x-axis tick labels: height(5) option in xtitle
17. changing color of connect line of line graph: clcolor option
17. turning off legend: legend(off) option
19. reading in graph data by putting data in do-file: input and end commands
19. adding error bars to graph: rcap command
20. overlaying errors bars on scatterplot to get symbol with error bars:
twoway (rcap…) (scatter…) commands
21. adding white space to left and right side of graph: xlabel command
21. change tick mark labels to more descriptive labels: xlabel command
22. drop tick marks from graph while retaining labels: noticks option
23. adding horizontal or vertical reference lines: yline and xline options
24. list of choices for colors: help colorstyle command
24. list of choices for symbols: help symbolstyle command
3
24. changing marker symbol for scatterplot: msymbol option
24. changing color to marker symbol border line and inside fill:
mlcolor and mfcolor options
1-7
1-8
1-9
1-10
1-11
Looping, collapsing, and reshaping
Operators, ifs, dates, and times
More graphics: popular scientific graphs
Programming Stata
Compilation of frequently used variable generation and modifying
commands (a chapter for quick look up)
1-12 Homework problems
Section 2. Biostatistics
2-1
Describing variables, levels of measurement, and vhoice of descriptive
statistics
Describing a variable (distribution):
with tables: frequency tables
with graphs: histogram, boxplots
with descriptive statistics: mean, standard deviation, etc.
Levels of measurement (nominal, ordinal, ... categorical, continuous ...)
How to decide what descriptive statistic to use to describe a variable in the
“Table 1. Patient Characteristics” table of an article.
2-2
Logic of significance tests
What a probability distribution is
Logic of a significance test (same logic as a laboratory reference range)
Chance, randomness, sampling variability
Statistical regularity (the basis of statistical theory)
Strong Law of Large Numbers (formal statement of statistical regularity).
Deriving the form of statistical test (significance test) intuitively
Sampling distribution
p value
2-3
Choice of significance test
2-4
Comparison of two independent groups
Role of p values in a Table 1 Patients Characteristics table
Confounding variables
chi-square test
Fisher’s exact test
Asymptotic vs exact tests (parametric vs nonparametric tests)
Minimum expected frequency rule for choosing between chi-square test and
4
Fisher’s exact test
Barnard’s unconditional exact test
Fisher-Freeman-Halton test
Wilcoxon-Mann-Whitney test
Fisher-Pitman Permutation Test for Independent Samples
Central Limit Theorem
Levene’s test for equality of variances
t test (both equal and unequal variances)
Shapiro-Wilks test for normality
Reporting styles
Outliers
Prespecification of analysis
2-5
Basics of power analysis
definition of power
power increases as sample size increases
decision errors of significance tests [ Type I error (alpha), Type II error (beta) ]
Type II error and sample size paragraph in journal article
conclusions of equivalence
power of a significance test
effect of one- or two-sided comparison on power
effect of choice of alpha on power
effect of choice of minimum detectable effect size on power
effect of size of assumed standard deviation (SD) on power – coming up with a
SD estimate
effect of sample size on power
sample size and power calculations for an interval scaled outcome variable
what to do if you don’t know anything (no effect size or standard deviation
estimates):
the standard deviation units approach, Cohen’s d.
sample size calculation when a multiple comparison adjustment is planned
overfitting
switching the dependent and independent variables
sample size based on precision (desired width of confidence interval)
excessive power (sample size very large)
two group comparison of interval scale outcome sample size paragraph in study
protocol
2-6
More on levels of measurement
sums of ordinal scales produce interval sacles
dichotomous scales are actually interval scales
can statistical tests that require interval scales be used with ordinal scales ( the
ordinal-interval controversy in statistics)
2-7
Comparison of two paired groups
5
2-8
Multiplicity and the Comparison of 3+ Groups
multiplicity
multiple comparison problem
p value based multiple comparison procedures: family-wise error rate
(Bonferroni, Holm, Sidak, Holm-Sidak, Hochberg, Finner, Hommel,
Tukey-Ciminera-Heyse)
P value based multiple comparison procedures: false discovery rate
(Benjamini-Hochberg procedure)
how to get away without using multiple-comparison procedures
simultaneous comparison of 3+ groups (includes one-way analysis of variance)
sample size when multiple comparisons are planned
2-9
Correlation
2-10 Linear regression
how linear regression controls for covariates
2-11 Logistic regression and dummy variables
linear regression estimates risk difference (difference between proportions), but is
criticized because it can estimate predicted probabilities outside of the 0-1 range
logistic regression is designed to constrain the predicted probability between 0
and 1
definition of an odds ratio
assessing linearity of effect
dummy variables (indicator variables)
2-12 Survival analysis: Kaplan-Meier graphs, Log-rank Test, and Cox
regression
life tables
Kaplan-Meier survival probabilies & Kaplan-Meier curves
log-rank test
Cox regression
assessing goodness of fit with c-statistic (ROC area)
interpreting the c-statistic
testing proportional hazards assumption of Cox regression
2-13
2-14
2-15
2-16
2-17
2-18
2-19
Confidence intervals versus p values and trends toward significance
Pearson correlation coefficient with clustered data
Equivalence and noninferiority tests
Validity and reliability
Methods comparison studies
One sample tests
Homework problems
6
Section 3. Epidemiology
3-1
3-2
3-3
3-4
3-5
3-6
3-7
3-8
3-9
3-10
3-11
3-12
3-13
3-14
3-15
Introduction to epidemiologic thinking
Sufficient/component cause theory of disease
Hill’s causal criteria
Logic and errors
Effect measures
Study designs
Randomization using Excel
Bias and confounding
Random error and statistics
Crude analysis
Stratified analysis
Standardization
Sensitivity (bias) analysis
Case-cohort study design
Homework problems
Section 4. Power Analysis
Chapter 4-1. Sample Size Determination and Power Analysis for Specific
Applications
two independent group comparison of means (independent groups t test)
linear regression: comparing two groups adjusted for covariates
two independent groups comparison of dichotomous outcome variable (chi-square test,
Fisher’s exact test)
two indendpent groups comparison of a nominal outcome variable (chi-square test and
Fisher-Freeman-Halton test)
two independent groups comparison of ordinal outcome variable (Wilcoxon-MannWhitney test)
paired ordinal outcome variable (Wilcoxon signed ranks tests)
repeated measurements or clustered studies (GEE, mixed, mulilevel, hierarchial models)
power analysis using Monte Carlo simulation (independent samples t test)
power analysis using Monte Carlo simulation (2 × 2 table chi-square test)
power analysis using Monte Carlo simulation (Poisson regression with person-time)
power analysis using Monte Carlo simulation (2-way ANOVA, both factors with 2 levels,
neither of which is a repeated measurement)
logrank test
7
Section 5. Regression Models
5-1
5-2
5-3
5-4
5-5
5-6
5-7
5-8
5-9
5-10
5-11
5-12
5-13
5-14
5-15
5-16
5-17
5-18
5-19
5-20
5-21
5-22
5-23
5-24
5-25
5-26
5-27
What regression is and curvilinear correlation
Holding constant
Dichotomous predictor variables
Adjusted means, Analysis of Variance (ANOVA), and interaction
Deriving logistic regression
Exact logistic regression
Introducing Cox regression and Kaplan-Meier plots
Interaction
Missing data imputation
Linear regression robust to assumptions
Linear regression diagnostics and transformations
Variable selection and collinearity
Monte Carlo Simulation and Bootstrapping
Model Validation
Response feature (summary measure) analysis
Analysis of covariance (ANCOVA) versus change analysis
Conditional logistic regression
Repeated measures analysis of variance
Generalized estimating equations (GEE)
Multilevel (mixed effects) models
Regression post tests
Modeling cost
Cox regression proportional hazards assumption
Cluster analysis
Multilevel (mixed effects) logistic regression
Trend tests
Homework problems
Appendix 1. Dataset Descriptions
births.dta
Concerns 500 mothers who had singleton births in a large London
8
hospital.
evans.dta
From a cohort study in which n=609 white males were followed for 7
years, with coronary heart disease as the outcome of interest.
2.20.Framingham.dta The dataset comes from a long-term follow-up study of cardiovascular risk
factors on 4699 patients living in the town of Framingham, Massachusetts.
LeeLife.dta
Concerns male patients with localized cancer of the rectum diagnosed in
Connecticut from 1935 to 1954. The research question is whether survival
improved for the 1945-1954 cohort of patients (cohort = 1) relative to the
earlier 1935-1944 cohort (cohort = 0).
mi.dta
From a 1:2 matched case-control study in which n=117 subjects are
formed into 39 matched strata.
rmr.dta
Data published by Nawata et al (2004)(on course CD). The data were
created from the authors’ Figure 1, a scatterplot, and so only approximate
the actual values used by the authors.
smoke.csv
Concerns 234 smokers who expressed a willingness to quit smoking were
followed for one year to estimate the proportion of recidivism (quit for a
time and then started again).
wright_lowbw.dta
The dataset concerns 900 birthweight outcomes and risk factors
attributable to the mother.
vaso.dta
The data were obtained in a carefully controlled study of the effect of the
RATE and VOLume of air inspired by human subjects on the occurrence
(coded 1) or non-occurrence (coded 0) of a transient vasoconstriction
RESPonse in the skin of the fingers.
9
PEPI Windows Programs
Some selected programs from the software package Programs for Epidemiologists (PEPI).
The software distribution CD grants permission to share the software without permission, as long
as the person sharing it does not charge for it. The manual must be purchases, however, and
cannot be shared without permission.
These programs run in a DOS window of the Windows operating system.
adjustp.exe
Multiple comparison procedures (referred to in Chapter 2-8)
Holm’s procedure
Hommel’s procedure
Finner’s procedure
misclass.exe
sensitivity analysis for misclassification bias (referred to in Chapter 3-13).
powr.exe
power analysis for comparing two groups (not referred to in manual)
independent proportions (chi-square test)
related proportions (McNemar test)
ordered categories (Wilcoxon-Mann-Whitney test)
independent means (independent groups t test)
related sample means (paired t test)\
sample.exe
single group sample size determination (not referred to in manual)
proportion
mean
precision of prevalence rate
samples.exe
two group sample size determination (not referred to in manual)
independent proportions (chi-square test)
related proportions (McNemar test)
ordered categories (Wilcoxon-Mann-Whitney test)
independent means (independent groups t test)
related sample means (paired t test)\
10
Download