Biostatistics Collaboration Center http://www.feinberg.northwestern.edu/sites/bcc/
Basic Biostatistics in Medical Research:
Emerging Trends
November 14, 2013
Leah J. Welty, PhD
Biostatistics Collaboration Center
http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer
From The Economist
Power:
What is it?
How do you compute it?
Are we having a “power failure”?
Reproducible research:
How did it start?
What it is?
Why practice it?
• Most granting agencies now require some sort of justification of sample size.
• A study with too much power will usually be costly, and will occasionally claim “significant” results that are not clinically relevant.
• A study that lacks power will not be “significant” – even if results are clinically meaningful. There is a known publication bias against studies with negative findings.
Slide credit: Dr. Mary Kwasny
• [Studies] should have sufficient statistical power (usually 80%) to detect (clinically meaningful) differences between groups.
• To be assured of this without compromising levels of significance, a sample size calculation should be considered early in the planning stages.
Friedman LM, Furberg CD, and DeMets DL. Fundamentals of Clinical Trials, 4th
Edition. New York: Springer-Verlag, 2010.
Slide credit: Dr. Mary Kwasny
Reality
H
0 true H
1 true
Test Result Reject H
0
(p < 0.05)
Fail to reject H
0
(p > 0.05)
Type I Error (α)
α= 0.05 (5%)
Confidence
0.95 (95%)
Slide credit: Dr. Mary Kwasny
Power = conditional probability
= Pr(Reject H
0
| H
1 true)
Power
0.80 (80%)
Type II Error (β)
0.20 (20%)
Power is related to testing a specific hypothesis e.g., clinical trial (Is drug A better than drug B?)
For descriptive studies, there may be no central hypothesis e.g., estimate the prevalence of autism, thus may need to base sample size calculations on margin of error
In practice, the power section of a grant is typically some combination of both.
8
Power = the probability that you reject the null hypothesis, given that the (specific) alternative is true
= Pr (reject H
0
| H
1 true)
Acceptable power is usually 0.8 to 0.9 (80-90%). If your alternative hypothesis is true, you want to have a ‘good chance’ of detecting it.
9
•
•
•
Slide credit: Dr. Mary Kwasny
1.
Null hypothesis and (a specific) alternative hypothesis
2.
The appropriate statistical method to test the null hypothesis
3.
Effect size, or variability
4.
Level of statistical significance (usually α = 0.05; this should be decided before starting a study)
5.
EITHER power or sample size (solve for the other)
11
Research Question:
Do elderly smokers have a greater prevalence of depression than elderly nonsmokers?
Literature Review:
Prevalence of depression among elderly nonsmokers is
0.20.
12
Power/Sample Size Example
1.
2.
3.
Null hypothesis and (a specific) alternative hypothesis.
H
0
: prevalence of depression is same in elderly smokers and elderly nonsmokers
H
1
: prevalence of depression is different in elderly smokers and elderly nonsmokers
The appropriate statistical method to test the null hypothesis
Effect size, or variability prevalence among elderly nonsmokers = 0.2
prevalence among elderly smokers = 0.3
4.
5.
Level of statistical significance α = 0.05
EITHER power or sample size 80% power (1
– power = β = 20%)
13
Power/Sample Size Example
1.
2.
3.
Null hypothesis and (a specific) alternative hypothesis.
H
0
: prevalence of depression is same in elderly smokers and elderly nonsmokers Two-sided
H
1
: prevalence of depression is different in elderly smokers and elderly nonsmokers alternative
The appropriate statistical method to test the null hypothesis chi-squared test
Effect size, or variability prevalence among elderly non-smokers = 0.2
prevalence among elderly smokers = 0.3
4.
5.
Level of statistical significance α = 0.05
EITHER power or sample size 80% power (1
– power = β = 20%)
14
Power/Sample Size Example
1.
2.
3.
Null hypothesis and (a specific) alternative hypothesis.
H
0
: prevalence of depression is same in elderly smokers and elderly nonsmokers
H
1
: prevalence of depression is different in elderly smokers and elderly nonsmokers
The appropriate statistical method to test the null hypothesis chi-squared test
Effect size, or variability prevalence among elderly non-smokers = 0.2
prevalence among elderly smokers = 0.3
4.
5.
Level of statistical significance α = 0.05
EITHER power or sample size 80% power (1
– power = β = 20%)
15
Power/Sample Size Example
1.
2.
3.
Null hypothesis and (a specific) alternative hypothesis.
H
0
: prevalence of depression is same in elderly smokers and elderly nonsmokers
H
1
: prevalence of depression is different in elderly smokers and elderly nonsmokers
The appropriate statistical method to test the null hypothesis chi-squared test
Effect size, or variability prevalence among elderly non-smokers = 0.2
prevalence among elderly smokers = 0.3
Talk to your friendly neighborhood statistician
4.
5.
Level of statistical significance α = 0.05
EITHER power or sample size 80% power (1
– power = β = 20%)
16
Power/Sample Size Example
1.
2.
3.
Null hypothesis and (a specific) alternative hypothesis.
H
0
: prevalence of depression is same in elderly smokers and elderly nonsmokers
H
1
: prevalence of depression is different in elderly smokers and elderly nonsmokers
The appropriate statistical method to test the null hypothesis chi-squared test
Effect size, or variability prevalence among elderly non-smokers = 0.2
prevalence among elderly smokers = 0.3
4.
5.
Level of statistical significance α = 0.05
EITHER power or sample size 80% power (1
– power = β = 20%)
17
Power/Sample Size Example
1.
2.
3.
4.
5.
Null hypothesis and (a specific) alternative hypothesis.
H
0
: prevalence of depression is same in elderly smokers and elderly nonsmokers
H
1
: prevalence of depression is different in elderly smokers and elderly nonsmokers
The appropriate statistical method to test the null hypothesis chi-squared test
Effect size, or variability prevalence among elderly non-smokers = 0.2
prevalence among elderly smokers = 0.3
Level of statistical significance α = 0.05
From literature, your past studies, pilot data, or even an educated guess. Cannot come from the study you’re trying to power!
EITHER power or sample size 80% power (1
– power = β = 20%)
18
Power/Sample Size Example
1.
2.
3.
Null hypothesis and (a specific) alternative hypothesis.
H
0
: prevalence of depression is same in elderly smokers and elderly nonsmokers
H
1
: prevalence of depression is different in elderly smokers and elderly nonsmokers
The appropriate statistical method to test the null hypothesis chi-squared test
Effect size, or variability prevalence among elderly non-smokers = 0.2
prevalence among elderly smokers = 0.3
4.
5.
Level of statistical significance α = 0.05
EITHER power or sample size 80% power (1
– power = β = 20%)
19
Power/Sample Size Example
1.
2.
3.
Null hypothesis and (a specific) alternative hypothesis.
H
0
: prevalence of depression is same in elderly smokers and elderly nonsmokers
H
1
: prevalence of depression is different in elderly smokers and elderly nonsmokers
The appropriate statistical method to test the null hypothesis chi-squared test
Effect size, or variability prevalence among elderly non-smokers = 0.2
prevalence among elderly smokers = 0.3
4.
5.
Level of statistical significance α = 0.05
Typically 0.05. Sometimes 0.01, for example some clinical trials.
EITHER power or sample size 80% power (1
– power = β = 20%)
20
Power/Sample Size Example
1.
2.
3.
Null hypothesis and (a specific) alternative hypothesis.
H
0
: prevalence of depression is same in elderly smokers and elderly nonsmokers
H
1
: prevalence of depression is different in elderly smokers and elderly nonsmokers
The appropriate statistical method to test the null hypothesis chi-squared test
Effect size, or variability prevalence among elderly non-smokers = 0.2
prevalence among elderly smokers = 0.3
4.
5.
Level of statistical significance α = 0.05
EITHER power or sample size 80% power (1
– power = β = 20%)
21
Power/Sample Size Example
1.
2.
3.
4.
5.
Null hypothesis and (a specific) alternative hypothesis.
H
0
: prevalence of depression is same in elderly smokers and elderly nonsmokers
H
1
: prevalence of depression is different in elderly smokers and elderly nonsmokers
The appropriate statistical method to test the null hypothesis chi-squared test
Effect size, or variability prevalence among elderly non-smokers = 0.2
prevalence among elderly smokers = 0.3
Usually 80% or 90%.
Level of statistical significance α = 0.05
EITHER power or sample size 80% power (1
– power = β = 20%)
22
Power/Sample Size Example
#1 - 5
• Your friendly neighborhood statistician
• Software (SAS,
STATA, R, PASS)
• Tables
• Simulations
Sample size or power
293 elderly nonsmokers &
293 elderly smokers
23
“Why Most Published Research Findings Are False.”
“Power failure: why small sample size undermines the reliability of neuroscience.”
“Small sample size is not the real problem.”
Suppose H
1 true. If
Pr (reject H
0
| H
1 true) ~ 10%, 20% the chances of ‘uncovering’ H
1 are small.
Fail to reject the null when you should.
Wasted effort, money, resources?
PPV = Pr ( H
1 true | reject H
0
)
Let R = pre-study odds
= Pr (H
1 true ) / Pr (H
0 true)
(Think of H
1 and H
0 not as single hypotheses but as randomly selected from the collection of all hypotheses in a given field.)
Assume alpha = 0.05 (Type I error). So Pr (reject H
0
| H
0 true) = 0.05.
PPV = Pr ( H
1 true | reject H
0
)
= Pr (reject H
0
| H
1 true) * Pr(H
1 true)
Pr (reject H
0
| H
1 true) * Pr(H
1 true) + Pr (reject H
0
| H
0 true) * Pr(H
0 true)
= Power * Pr(H
1 true)
Power * Pr(H
1 true) + 0.05 * Pr(H
0 true)
= Power * Pr(H
1
Power * Pr(H
1 true)/Pr(H
0 true)/Pr(H
0 true) true) + 0.05
= Power * R
Power * R + 0.05
PPV = Pr ( H
1 true | reject H
0
)
=
Bayes’ Theorem
Pr (reject H
0
| H
1 true) * Pr(H
1 true)
Pr (reject H
0
| H
1 true) * Pr(H
1 true) + Pr (reject H
0
| H
0 true) * Pr(H
0 true)
= Power * Pr(H
1 true)
Power * Pr(H
1 true) + 0.05 * Pr(H
0 true)
= Power * Pr(H
1
Power * Pr(H
1 true)/Pr(H
0 true)/Pr(H
0 true) true) + 0.05
= Power * R
Power * R + 0.05
Definition of power and alpha.
Nifty trick.
What we really care about.
PPV = Power * R
Power * R + 0.05
Suppose you are in a field where 1 in 5 hypotheses is correct. R = ¼ = 0.25.
Power = 20% PPV = 0.2 * 0.25 / (0.2 * 0.25 + 0.05) = 0.50
Power = 80% PPV = 0.8 * 0.25 / (0.8 * 0.25 + 0.05) = 0.80
If you conduct a low powered study, but you (correctly) reject H
0
, it is likely that your estimated effect is actually larger than the true effect.
Called “effect inflation.”
We have an extraordinary problem with selective reporting and publication bias.
We may also (sub)consciously manipulate the design, analysis, and interpretation of studies.
There is an over-reliance on p-values:
Preferable to look at confidence intervals.
Winner’s Curse is also a problem of selection, and even occurs in adequately powered studies. Think about regression to the mean.
Power calculations are more nuanced than this discussion: selection of ‘true’ H
1
80% is arbitrary, results in studies are rarely yes/no.
,
Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124
Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, and Munafo MR
(2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 451 Nature Reviews Neuroscience
14, 451 (2013) doi:10.1038/nrn3502 Published online 15 April 2013
Bacchetti P (2013) Small sample size is not the real problem. Nature Reviews
Neuroscience, 14, 585, doi:10.1038/nrn3475-c3, Published online 03 July 2013
(McClelland, Elkington, Teplin, & Abram, 2004)
(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)
(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)
“In our laboratory (the Stanford Exploration Project or SEP) we noticed that after a few months or years, researchers were usually unable to reproduce their own work without considerable agony.”
- Claerbout describing experience in mid 1980s
(McClelland, Elkington, Teplin, & Abram, 2004)
“The published documents are merely the advertisement of scholarship whereas the computer programs, input data, parameter values, etc. embody the scholarship itself.”
(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)
– Claerbout et al. (2000)
(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)
Requirement “that data sets and computer code be made available to others for verifying published results and conducting alternative analyses.” - Peng (2009)
Many journals have policies consistent with this practice, e.g. Annals of
Internal Medicine, Nature, Science, Biostatistics
(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)
Requirement “that data sets and computer code be made available to others for verifying published results and conducting alternative analyses.” - Peng (2009); also Buckheit & Donoho (1995)
Many journals have policies consistent with this practice, e.g. Biostatistics,
Annals of Internal Medicine, Nature, Science
‘Electronic lab notebook’ containing final product as well as research workflow process
The final product (dynamic document) AND archive of what other approaches were pursued and abandoned, as well as research decisions along the way.
- Nolan (2010)
(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)
Requirement “that data sets and computer code be made available to others for verifying published results and conducting alternative analyses.” - Peng (2009); also Buckheit & Donoho (1995)
Many journals have policies consistent with this practice, e.g. Biostatistics,
Annals of Internal Medicine, Nature, Science
‘Electronic lab notebook’ containing final product as well as research workflow process
The final product (dynamic document) AND archive of what other approaches were pursued and abandoned, as well as research decisions along the way.
- Nolan (2010)
This is a work in progress for medical research!
(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)
Reproducible:
Start with the same “raw” data. Repeat cleaning, manipulation, analyses, and end up with all the same exact results (parameter estimates, numbers in tables, and figures).
Test: Give someone else your “raw” data, programs, and methods section of the manuscript. Would they be able to reproduce your findings?
From Nature:
. . . we will more systematically ensure that key methodological details are reported, and we will give more space to methods sections. We will examine statistics more closely and encourage authors to be transparent, for example by including their raw data.
Replicable:
Duplicate general findings in different environment, i.e. in a different lab, research group, or slightly different experimental conditions.
(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)
Good
Well commented statistical programs, with log files or other record of execution
Version control for data, manuscripts, analyses
Systems for connecting final manuscript to data, programs,
Software packages that bundle data and programs
Not so good
Analyses conducted on the command line with no record of sequence of code
Data stored in Excel, without record of updates or corrections
Published papers with no record of final analyses or data used in manuscript
Data and programs unavailable to investigator, reviewers, or colleagues for replication or
Using Excel or (other interactive approaches) for data capture, manipulation or analysis results in little or no documentation of data provenance or analysis!
“The most simple problems are common.” When using Excel, it is especially easy to make off-by-one errors (e.g. accidentally deleting a cell in one column), or mixing up group labels (e.g. swapping sensitive/resistant).
-Baggerly and Coombes (2009)
(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)
(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)
Using Excel or (other interactive approaches) for data capture, manipulation or analysis results in little or no documentation of data provenance or analysis!
“The most simple problems are common.” When using Excel, it is especially easy to make off-by-one errors (e.g. accidentally deleting a cell in one column), or mixing up group labels (e.g. swapping sensitive/resistant).
-Baggerly and Coombes (2009)
(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)
Do you have an
Excel disaster story?
(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)
REDCap (Research Electronic Data Capture)
REDCap is a secure web application “designed exclusively to support data capture for research studies.” http://project-redcap.org/
Northwestern is part of the
REDCap consortium. REDCap is free!
REDCap features:
• Rapid set-up
• Web-based data collection
• Data Validation
• Export to statistical programs
• Supports HIPAA compliance
(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)
Statistical Programs: SAS, Stata, R, SPSS
Should keep a record of any and all manipulations to the data. If you have to correct an error in the data, write it in your code! All your analyses should exist as a set of programming commands, or at least a copy of the execution of commands.
e.g. “log” files in Stata
(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)
(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)
“R” is freely available, open-source statistical software. It is one of the main (if not main) programs in use by statisticians.
It has many add-on ‘packages’ for analyzing particular types of data.
Very popular for genomics, bioinformatics.
See http://cran.us.r-project.org/
R may not be quite as user friendly as Stata or SPSS, but it’s getting better.
nice environment for working with R.
Reproducible research is becoming part of ethical statistical and scientific practice.
After the start-up cost, actually makes life a LOT easier.
(McClelland, Elkington, Teplin, & Abram, 2004)
Not conducting reproducible research may have serious consequences.
Damage to career and professional reputation
Retraction of scientific papers
(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)
Loss of public confidence in medical research
Harm to patients
1.
You find an error in your analysis code or in your data.
2.
You fix the error. (In a way that leaves record of the fix).
3.
You update your tables, figures, and the manuscript, possibly by
What if step 3 were eliminated and happened at the touch of a button?
(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)
Programs like knitR and Sweave, although still accessible mostly to the statistical community, are making this possible.
Reproducible Research System (RSS)
Reproducible Research Environment (RRE)
Computational tools
Track data, analyses
Package results (tables, figures)
Reproducible Research Publisher (RRP)
Document preparation system
Easy link to RRE
E.g. GenePattern-Word RRS system, developed in collaboration with Microsoft Research
Jill Mesirov (2010)
References and Links
Series of articles in Nature: http://www.nature.com/news/announcement-reducing-our-irreproducibility-1.12852
“Simply Statistics” blog has many excellent posts, references, and discussions of many topics, including reproducibility: http://simplystatistics.org/?s=reproducibility
Keith A. Baggerly and Kevin R. Coombes. Deriving chemosensitivity from cell lines: Forensic bioinformatics
and reproducible research in high-throughput biology, Ann. Appl. Stat.
Volume 3, Number 4 (2009), 1309-
1334.
(McClelland, Elkington, Teplin, & Abram, 2004)
More technical references:
Deborah Nolan, Roger D. Peng, and Duncan Temple Lang. “Enhanced Dynamic Documents for Reproducible Research” (2010) Biomedical Informatics for
Cancer Research, pp. 335-345
Jill P. Mesirov. “Accessible Reproducible Research” (2010) Science, pp. 415-416
Matthias Schwab, Martin Karrenback, and Jon Claerbout “Making scientific computations reproducible” (2000) Computing in Science and Engineering, 2, pp. 61 – 67.
2002 – Proceedings in Computational Statistics, pp. 575 – 580. Physica Verlag, Heidelberg, 2002.
Russell Lenth and Soren Hojsgaard, “SASweave: Literate Programming Using SAS” (2007) Journal of Statistical Software, 19, 8, pp. 1-20.
Roger D. Peng. “Reproducible research and Biostatistics.” (2009) Biostatistics, pp. 405-408.
Paul Thompson and Andrew Burnett. “Reproducible Research” CORE Issues in Professional and Research Ethics, Volume 1, Paper 6, 2012. Accessed from http://nationalethicscenter.org/content/article/175
Univeristy. Accessed from http://statistics.stanford.edu/~ckirby/techreports/NSF/EFS%20NSF%20474.pdf
, February 2013.
EpiBio 560: Statistical Consulting is a ‘statistics practicum’ offered in winter quarter for students in the Master of
Science in Epidemiology and Biostatistics (MSEB) program.
The instructor, Dr. Kwang-Youn Kim, is on the lookout for real projects to help these students hone their consultation and analysis skills. The consultation and analysis are provided free of charge.
If you’re interested in volunteering your project, please contact Dr. Kim at kykim@northwestern.edu.
49
Biostatistics Collaboration Center http://www.feinberg.northwestern.edu/sites/bcc/
Thank you!
Evaluation forms!