Presentation - Feinberg School of Medicine

advertisement

Biostatistics Collaboration Center http://www.feinberg.northwestern.edu/sites/bcc/

Basic Biostatistics in Medical Research:

Emerging Trends

November 14, 2013

Leah J. Welty, PhD

Biostatistics Collaboration Center

http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer

From The Economist

Emerging Trends in Biostatistics

Power:

What is it?

How do you compute it?

Are we having a “power failure”?

Reproducible research:

How did it start?

What it is?

Why practice it?

Why is power important?

• Most granting agencies now require some sort of justification of sample size.

• A study with too much power will usually be costly, and will occasionally claim “significant” results that are not clinically relevant.

• A study that lacks power will not be “significant” – even if results are clinically meaningful. There is a known publication bias against studies with negative findings.

Slide credit: Dr. Mary Kwasny

Fundamental point

• [Studies] should have sufficient statistical power (usually 80%) to detect (clinically meaningful) differences between groups.

• To be assured of this without compromising levels of significance, a sample size calculation should be considered early in the planning stages.

Friedman LM, Furberg CD, and DeMets DL. Fundamentals of Clinical Trials, 4th

Edition. New York: Springer-Verlag, 2010.

Slide credit: Dr. Mary Kwasny

“testing” quick review

Reality

H

0 true H

1 true

Test Result Reject H

0

(p < 0.05)

Fail to reject H

0

(p > 0.05)

Type I Error (α)

α= 0.05 (5%)

Confidence

0.95 (95%)

Slide credit: Dr. Mary Kwasny

Power = conditional probability

= Pr(Reject H

0

| H

1 true)

Power

0.80 (80%)

Type II Error (β)

0.20 (20%)

Power and Sample Size

Power is related to testing a specific hypothesis e.g., clinical trial (Is drug A better than drug B?)

For descriptive studies, there may be no central hypothesis e.g., estimate the prevalence of autism, thus may need to base sample size calculations on margin of error

In practice, the power section of a grant is typically some combination of both.

8

Power Defined

Power = the probability that you reject the null hypothesis, given that the (specific) alternative is true

= Pr (reject H

0

| H

1 true)

Acceptable power is usually 0.8 to 0.9 (80-90%). If your alternative hypothesis is true, you want to have a ‘good chance’ of detecting it.

9

Note

Power is vague (conditional on what, exactly?)

In defining a “reality” we have either no effect

(the null) or some effect (the alternative)

This is OK, but makes the investigator decide some specific alternative under which to estimate power.

Slide credit: Dr. Mary Kwasny

What you need for power/sample size

1.

Null hypothesis and (a specific) alternative hypothesis

2.

The appropriate statistical method to test the null hypothesis

3.

Effect size, or variability

4.

Level of statistical significance (usually α = 0.05; this should be decided before starting a study)

5.

EITHER power or sample size (solve for the other)

11

Power Example: Smoking &

Depression

Research Question:

Do elderly smokers have a greater prevalence of depression than elderly nonsmokers?

Literature Review:

Prevalence of depression among elderly nonsmokers is

0.20.

12

Power/Sample Size Example

1.

2.

3.

Null hypothesis and (a specific) alternative hypothesis.

H

0

: prevalence of depression is same in elderly smokers and elderly nonsmokers

H

1

: prevalence of depression is different in elderly smokers and elderly nonsmokers

The appropriate statistical method to test the null hypothesis

Effect size, or variability prevalence among elderly nonsmokers = 0.2

prevalence among elderly smokers = 0.3

4.

5.

Level of statistical significance α = 0.05

EITHER power or sample size 80% power (1

– power = β = 20%)

13

Power/Sample Size Example

1.

2.

3.

Null hypothesis and (a specific) alternative hypothesis.

H

0

: prevalence of depression is same in elderly smokers and elderly nonsmokers Two-sided

H

1

: prevalence of depression is different in elderly smokers and elderly nonsmokers alternative

The appropriate statistical method to test the null hypothesis chi-squared test

Effect size, or variability prevalence among elderly non-smokers = 0.2

prevalence among elderly smokers = 0.3

4.

5.

Level of statistical significance α = 0.05

EITHER power or sample size 80% power (1

– power = β = 20%)

14

Power/Sample Size Example

1.

2.

3.

Null hypothesis and (a specific) alternative hypothesis.

H

0

: prevalence of depression is same in elderly smokers and elderly nonsmokers

H

1

: prevalence of depression is different in elderly smokers and elderly nonsmokers

The appropriate statistical method to test the null hypothesis chi-squared test

Effect size, or variability prevalence among elderly non-smokers = 0.2

prevalence among elderly smokers = 0.3

4.

5.

Level of statistical significance α = 0.05

EITHER power or sample size 80% power (1

– power = β = 20%)

15

Power/Sample Size Example

1.

2.

3.

Null hypothesis and (a specific) alternative hypothesis.

H

0

: prevalence of depression is same in elderly smokers and elderly nonsmokers

H

1

: prevalence of depression is different in elderly smokers and elderly nonsmokers

The appropriate statistical method to test the null hypothesis chi-squared test

Effect size, or variability prevalence among elderly non-smokers = 0.2

prevalence among elderly smokers = 0.3

Talk to your friendly neighborhood statistician

4.

5.

Level of statistical significance α = 0.05

EITHER power or sample size 80% power (1

– power = β = 20%)

16

Power/Sample Size Example

1.

2.

3.

Null hypothesis and (a specific) alternative hypothesis.

H

0

: prevalence of depression is same in elderly smokers and elderly nonsmokers

H

1

: prevalence of depression is different in elderly smokers and elderly nonsmokers

The appropriate statistical method to test the null hypothesis chi-squared test

Effect size, or variability prevalence among elderly non-smokers = 0.2

prevalence among elderly smokers = 0.3

4.

5.

Level of statistical significance α = 0.05

EITHER power or sample size 80% power (1

– power = β = 20%)

17

Power/Sample Size Example

1.

2.

3.

4.

5.

Null hypothesis and (a specific) alternative hypothesis.

H

0

: prevalence of depression is same in elderly smokers and elderly nonsmokers

H

1

: prevalence of depression is different in elderly smokers and elderly nonsmokers

The appropriate statistical method to test the null hypothesis chi-squared test

Effect size, or variability prevalence among elderly non-smokers = 0.2

prevalence among elderly smokers = 0.3

Level of statistical significance α = 0.05

From literature, your past studies, pilot data, or even an educated guess. Cannot come from the study you’re trying to power!

EITHER power or sample size 80% power (1

– power = β = 20%)

18

Power/Sample Size Example

1.

2.

3.

Null hypothesis and (a specific) alternative hypothesis.

H

0

: prevalence of depression is same in elderly smokers and elderly nonsmokers

H

1

: prevalence of depression is different in elderly smokers and elderly nonsmokers

The appropriate statistical method to test the null hypothesis chi-squared test

Effect size, or variability prevalence among elderly non-smokers = 0.2

prevalence among elderly smokers = 0.3

4.

5.

Level of statistical significance α = 0.05

EITHER power or sample size 80% power (1

– power = β = 20%)

19

Power/Sample Size Example

1.

2.

3.

Null hypothesis and (a specific) alternative hypothesis.

H

0

: prevalence of depression is same in elderly smokers and elderly nonsmokers

H

1

: prevalence of depression is different in elderly smokers and elderly nonsmokers

The appropriate statistical method to test the null hypothesis chi-squared test

Effect size, or variability prevalence among elderly non-smokers = 0.2

prevalence among elderly smokers = 0.3

4.

5.

Level of statistical significance α = 0.05

Typically 0.05. Sometimes 0.01, for example some clinical trials.

EITHER power or sample size 80% power (1

– power = β = 20%)

20

Power/Sample Size Example

1.

2.

3.

Null hypothesis and (a specific) alternative hypothesis.

H

0

: prevalence of depression is same in elderly smokers and elderly nonsmokers

H

1

: prevalence of depression is different in elderly smokers and elderly nonsmokers

The appropriate statistical method to test the null hypothesis chi-squared test

Effect size, or variability prevalence among elderly non-smokers = 0.2

prevalence among elderly smokers = 0.3

4.

5.

Level of statistical significance α = 0.05

EITHER power or sample size 80% power (1

– power = β = 20%)

21

Power/Sample Size Example

1.

2.

3.

4.

5.

Null hypothesis and (a specific) alternative hypothesis.

H

0

: prevalence of depression is same in elderly smokers and elderly nonsmokers

H

1

: prevalence of depression is different in elderly smokers and elderly nonsmokers

The appropriate statistical method to test the null hypothesis chi-squared test

Effect size, or variability prevalence among elderly non-smokers = 0.2

prevalence among elderly smokers = 0.3

Usually 80% or 90%.

Level of statistical significance α = 0.05

EITHER power or sample size 80% power (1

– power = β = 20%)

22

Power/Sample Size Example

#1 - 5

• Your friendly neighborhood statistician

• Software (SAS,

STATA, R, PASS)

• Tables

• Simulations

Sample size or power

293 elderly nonsmokers &

293 elderly smokers

23

Are we having a “power failure”?

Series of article titles:

“Why Most Published Research Findings Are False.”

“Power failure: why small sample size undermines the reliability of neuroscience.”

“Small sample size is not the real problem.”

Problems with Low Power, #1:

False Negatives

Suppose H

1 true. If

Pr (reject H

0

| H

1 true) ~ 10%, 20% the chances of ‘uncovering’ H

1 are small.

Fail to reject the null when you should.

Wasted effort, money, resources?

Problems with Low Power, #2:

Low Positive Predictive Value

PPV = Pr ( H

1 true | reject H

0

)

Let R = pre-study odds

= Pr (H

1 true ) / Pr (H

0 true)

(Think of H

1 and H

0 not as single hypotheses but as randomly selected from the collection of all hypotheses in a given field.)

Assume alpha = 0.05 (Type I error). So Pr (reject H

0

| H

0 true) = 0.05.

Problems with Low Power, #2 cont’d:

Low Positive Predictive Value

PPV = Pr ( H

1 true | reject H

0

)

= Pr (reject H

0

| H

1 true) * Pr(H

1 true)

Pr (reject H

0

| H

1 true) * Pr(H

1 true) + Pr (reject H

0

| H

0 true) * Pr(H

0 true)

= Power * Pr(H

1 true)

Power * Pr(H

1 true) + 0.05 * Pr(H

0 true)

= Power * Pr(H

1

Power * Pr(H

1 true)/Pr(H

0 true)/Pr(H

0 true) true) + 0.05

= Power * R

Power * R + 0.05

Problems with Low Power, #2 cont’d:

Low Positive Predictive Value

PPV = Pr ( H

1 true | reject H

0

)

=

Bayes’ Theorem

Pr (reject H

0

| H

1 true) * Pr(H

1 true)

Pr (reject H

0

| H

1 true) * Pr(H

1 true) + Pr (reject H

0

| H

0 true) * Pr(H

0 true)

= Power * Pr(H

1 true)

Power * Pr(H

1 true) + 0.05 * Pr(H

0 true)

= Power * Pr(H

1

Power * Pr(H

1 true)/Pr(H

0 true)/Pr(H

0 true) true) + 0.05

= Power * R

Power * R + 0.05

Definition of power and alpha.

Nifty trick.

What we really care about.

Problems with Low Power, #2 cont’d:

Low Positive Predictive Value

PPV = Power * R

Power * R + 0.05

Suppose you are in a field where 1 in 5 hypotheses is correct. R = ¼ = 0.25.

Power = 20% PPV = 0.2 * 0.25 / (0.2 * 0.25 + 0.05) = 0.50

Power = 80% PPV = 0.8 * 0.25 / (0.8 * 0.25 + 0.05) = 0.80

Problems with Low Power, #3:

Winner’s Curse

If you conduct a low powered study, but you (correctly) reject H

0

, it is likely that your estimated effect is actually larger than the true effect.

Called “effect inflation.”

Is it really a power failure?

We have an extraordinary problem with selective reporting and publication bias.

We may also (sub)consciously manipulate the design, analysis, and interpretation of studies.

There is an over-reliance on p-values:

Preferable to look at confidence intervals.

Winner’s Curse is also a problem of selection, and even occurs in adequately powered studies. Think about regression to the mean.

Power calculations are more nuanced than this discussion: selection of ‘true’ H

1

80% is arbitrary, results in studies are rarely yes/no.

,

References

Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124

Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, and Munafo MR

(2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 451 Nature Reviews Neuroscience

14, 451 (2013) doi:10.1038/nrn3502 Published online 15 April 2013

Bacchetti P (2013) Small sample size is not the real problem. Nature Reviews

Neuroscience, 14, 585, doi:10.1038/nrn3475-c3, Published online 03 July 2013

Reproducible Research

(McClelland, Elkington, Teplin, & Abram, 2004)

(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)

(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)

Origins of Reproducible Research

“In our laboratory (the Stanford Exploration Project or SEP) we noticed that after a few months or years, researchers were usually unable to reproduce their own work without considerable agony.”

- Claerbout describing experience in mid 1980s

(McClelland, Elkington, Teplin, & Abram, 2004)

“The published documents are merely the advertisement of scholarship whereas the computer programs, input data, parameter values, etc. embody the scholarship itself.”

(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)

– Claerbout et al. (2000)

(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)

What is reproducible research?

Requirement “that data sets and computer code be made available to others for verifying published results and conducting alternative analyses.” - Peng (2009)

Many journals have policies consistent with this practice, e.g. Annals of

Internal Medicine, Nature, Science, Biostatistics

(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)

What is reproducible research?

Requirement “that data sets and computer code be made available to others for verifying published results and conducting alternative analyses.” - Peng (2009); also Buckheit & Donoho (1995)

Many journals have policies consistent with this practice, e.g. Biostatistics,

Annals of Internal Medicine, Nature, Science

‘Electronic lab notebook’ containing final product as well as research workflow process

The final product (dynamic document) AND archive of what other approaches were pursued and abandoned, as well as research decisions along the way.

- Nolan (2010)

(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)

What is reproducible research?

Requirement “that data sets and computer code be made available to others for verifying published results and conducting alternative analyses.” - Peng (2009); also Buckheit & Donoho (1995)

Many journals have policies consistent with this practice, e.g. Biostatistics,

Annals of Internal Medicine, Nature, Science

‘Electronic lab notebook’ containing final product as well as research workflow process

The final product (dynamic document) AND archive of what other approaches were pursued and abandoned, as well as research decisions along the way.

- Nolan (2010)

This is a work in progress for medical research!

(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)

Reproducible vs Replicable Research

Reproducible:

Start with the same “raw” data. Repeat cleaning, manipulation, analyses, and end up with all the same exact results (parameter estimates, numbers in tables, and figures).

Test: Give someone else your “raw” data, programs, and methods section of the manuscript. Would they be able to reproduce your findings?

From Nature:

. . . we will more systematically ensure that key methodological details are reported, and we will give more space to methods sections. We will examine statistics more closely and encourage authors to be transparent, for example by including their raw data.

Replicable:

Duplicate general findings in different environment, i.e. in a different lab, research group, or slightly different experimental conditions.

(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)

Examples of Reproducible Research

Good

Well commented statistical programs, with log files or other record of execution

Version control for data, manuscripts, analyses

Systems for connecting final manuscript to data, programs,

Software packages that bundle data and programs

Not so good

Analyses conducted on the command line with no record of sequence of code

Data stored in Excel, without record of updates or corrections

Published papers with no record of final analyses or data used in manuscript

Data and programs unavailable to investigator, reviewers, or colleagues for replication or

Problems with MS Excel

Using Excel or (other interactive approaches) for data capture, manipulation or analysis results in little or no documentation of data provenance or analysis!

“The most simple problems are common.” When using Excel, it is especially easy to make off-by-one errors (e.g. accidentally deleting a cell in one column), or mixing up group labels (e.g. swapping sensitive/resistant).

-Baggerly and Coombes (2009)

(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)

(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)

Problems with MS Excel

Using Excel or (other interactive approaches) for data capture, manipulation or analysis results in little or no documentation of data provenance or analysis!

“The most simple problems are common.” When using Excel, it is especially easy to make off-by-one errors (e.g. accidentally deleting a cell in one column), or mixing up group labels (e.g. swapping sensitive/resistant).

-Baggerly and Coombes (2009)

(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)

Do you have an

Excel disaster story?

(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)

Alternatives to Excel for Data Capture

REDCap (Research Electronic Data Capture)

REDCap is a secure web application “designed exclusively to support data capture for research studies.” http://project-redcap.org/

Northwestern is part of the

REDCap consortium. REDCap is free!

REDCap features:

• Rapid set-up

• Web-based data collection

• Data Validation

• Export to statistical programs

• Supports HIPAA compliance

(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)

Alternatives to Excel for Data Analysis

Statistical Programs: SAS, Stata, R, SPSS

Should keep a record of any and all manipulations to the data. If you have to correct an error in the data, write it in your code! All your analyses should exist as a set of programming commands, or at least a copy of the execution of commands.

e.g. “log” files in Stata

(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)

(Cottle, Lee, & Heilbrun, 2001; McReynolds, Schwalbe, & Wasserman, 2010; Stoolmiller & Blechman, 2005)

Alternatives to Excel for Data Analysis

“R” is freely available, open-source statistical software. It is one of the main (if not main) programs in use by statisticians.

It has many add-on ‘packages’ for analyzing particular types of data.

Very popular for genomics, bioinformatics.

See http://cran.us.r-project.org/

R may not be quite as user friendly as Stata or SPSS, but it’s getting better.

nice environment for working with R.

Why strive for reproducible research?

Reproducible research is becoming part of ethical statistical and scientific practice.

After the start-up cost, actually makes life a LOT easier.

(McClelland, Elkington, Teplin, & Abram, 2004)

Not conducting reproducible research may have serious consequences.

Damage to career and professional reputation

Retraction of scientific papers

(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)

Loss of public confidence in medical research

Harm to patients

Why strive for reproducible research?

1.

You find an error in your analysis code or in your data.

2.

You fix the error. (In a way that leaves record of the fix).

3.

You update your tables, figures, and the manuscript, possibly by

What if step 3 were eliminated and happened at the touch of a button?

(Teplin, Welty, Abram, Dulcan, & Washburn, 2012)

Programs like knitR and Sweave, although still accessible mostly to the statistical community, are making this possible.

The future of reproducible research in collaborative medical environment?

Reproducible Research System (RSS)

Reproducible Research Environment (RRE)

Computational tools

Track data, analyses

Package results (tables, figures)

Reproducible Research Publisher (RRP)

Document preparation system

Easy link to RRE

E.g. GenePattern-Word RRS system, developed in collaboration with Microsoft Research

Jill Mesirov (2010)

References and Links

Series of articles in Nature: http://www.nature.com/news/announcement-reducing-our-irreproducibility-1.12852

“Simply Statistics” blog has many excellent posts, references, and discussions of many topics, including reproducibility: http://simplystatistics.org/?s=reproducibility

Keith A. Baggerly and Kevin R. Coombes. Deriving chemosensitivity from cell lines: Forensic bioinformatics

and reproducible research in high-throughput biology, Ann. Appl. Stat.

Volume 3, Number 4 (2009), 1309-

1334.

(McClelland, Elkington, Teplin, & Abram, 2004)

More technical references:

Deborah Nolan, Roger D. Peng, and Duncan Temple Lang. “Enhanced Dynamic Documents for Reproducible Research” (2010) Biomedical Informatics for

Cancer Research, pp. 335-345

Jill P. Mesirov. “Accessible Reproducible Research” (2010) Science, pp. 415-416

Matthias Schwab, Martin Karrenback, and Jon Claerbout “Making scientific computations reproducible” (2000) Computing in Science and Engineering, 2, pp. 61 – 67.

2002 – Proceedings in Computational Statistics, pp. 575 – 580. Physica Verlag, Heidelberg, 2002.

Russell Lenth and Soren Hojsgaard, “SASweave: Literate Programming Using SAS” (2007) Journal of Statistical Software, 19, 8, pp. 1-20.

Roger D. Peng. “Reproducible research and Biostatistics.” (2009) Biostatistics, pp. 405-408.

Paul Thompson and Andrew Burnett. “Reproducible Research” CORE Issues in Professional and Research Ethics, Volume 1, Paper 6, 2012. Accessed from http://nationalethicscenter.org/content/article/175

Univeristy. Accessed from http://statistics.stanford.edu/~ckirby/techreports/NSF/EFS%20NSF%20474.pdf

, February 2013.

A plug for EpiBio 560

EpiBio 560: Statistical Consulting is a ‘statistics practicum’ offered in winter quarter for students in the Master of

Science in Epidemiology and Biostatistics (MSEB) program.

The instructor, Dr. Kwang-Youn Kim, is on the lookout for real projects to help these students hone their consultation and analysis skills. The consultation and analysis are provided free of charge.

If you’re interested in volunteering your project, please contact Dr. Kim at kykim@northwestern.edu.

49

Biostatistics Collaboration Center http://www.feinberg.northwestern.edu/sites/bcc/

Thank you!

Evaluation forms!

Download