Clinical Trials & Design of Experiments

advertisement
Introduction to Biostatistics
for Clinical and Translational
Researchers
KUMC Departments of Biostatistics & Internal Medicine
University of Kansas Cancer Center
FRONTIERS: The Heartland Institute of Clinical and Translational Research
Course Information
 Jo A. Wick, PhD
 Office Location: 5028 Robinson
 Email: jwick@kumc.edu
 Lectures are recorded and posted at
http://biostatistics.kumc.edu under ‘Events &
Lectures’
Objectives
 Understand the role of statistics in the scientific
process and how it is a core component of
evidence-based medicine
 Understand features, strengths and limitations of
descriptive, observational and experimental
studies
 Distinguish between association and causation
 Understand roles of chance, bias and
confounding in the evaluation of research
Course Calendar
 July 5: Introduction to Statistics: Core Concepts
 July 12: Quality of Evidence: Considerations for
Design of Experiments and Evaluation of Literature
 July 19: Hypothesis Testing & Application of
Concepts to Common Clinical Research Questions
 July 26: (Cont.) Hypothesis Testing & Application
of Concepts to Common Clinical Research
Questions
Why is there conflicting
evidence?
 Answer: There is no perfect research study.
 Every study has limitations.
 Every study has context.
 Medicine (and research!) is an art as well as a
science.
 Unfortunately, the literature is full of poorly
designed, poorly executed and improperly
interpreted studies—it is up to you, the
consumer, to critically evaluate its merit.
Critical Evaluation
Validity and Relevance
 Is the article from a peer-reviewed journal?
 How does the location of the study reflect the
larger context of the population?
 Does the sample reflect the targeted population?
 Is the study sponsored by an organization that may
influence the study design or results?
 Is the intervention feasible? available?
Critical Evaluation
Intent
 Therapy: testing the efficacy of drug treatments,
surgical procedures, alternative methods of delivery,
etc. (RCT)
 Diagnosis: demonstrating whether a new diagnostic
test is valid (Cross-sectional survey)
 Screening: demonstrating the value of tests which
can be applied to large populations and which pick up
disease at a presymptomatic stage (Cross-sectional
survey)
 Prognosis: determining what is likely to happen to
someone whose disease is picked up at an early stage
(Longitudinal cohort)
Critical Evaluation
 Causation: determining whether a harmful agent
is related to development of illness (Cohort or
case-control)
Critical Evaluation
Validity based on intent
 What is the study design? Is it appropriate and
optimal for the intent?
 Are all participants who entered the trial accounted
for in the conclusion?
 What protections against bias were put into place?
Blinding? Controls? Randomization?
 If there were treatment groups, were the groups
similar at the start of the trial?
 Were the groups treated equally (aside from the
actual intervention)?
Critical Evaluation
 If statistically significant, are the results clinically
meaningful?
 If negative, was the study powered prior to
execution?
 Were there other factors not accounted for that
could have affected the outcome?
Miser, WF Primary Care 2006.
Experimental Design
 Statistical analysis, no matter how intricate, cannot
rescue a poorly designed study.
 No matter how efficient, statistical analysis cannot
be done overnight.
 A researcher should plan and state what they are
going to do, do it, and then report those results.
 Be transparent!
Types of Samples

 Random sample: each
population
sample
person has equal chance
of being selected.
 Convenience sample:
persons are selected
because they are convenient or readily available.
The principal way to guarantee that the sample
 Systematic sample: persons selected based on
a pattern.
 Stratified sample: persons selected from within
subgroup.
Random Sampling
 For studies, it is optimal (but not always possible)
for the sample providing the data to be
representative of the population under study.
 Simple random sampling provides a
representative sample (theoretically) and
protections against selection bias.
 A sampling scheme in which every possible sub-sample
of size n from a population is equally likely to be selected
 Assuming the sample is representative, the summary
statistics (e.g., mean) should be ‘good’ estimates of the
true quantities in the population.
• The larger n is, the better estimates will be.
Random Samples
 The Fundamental Rule of Using Data for
Inference requires the use of random sampling or
random assignment.
 Random sampling or random assignment ensures
control over “nuisance” variables.
 We can randomly select individuals to ensure that
the population is well-represented.
 Equal sampling of males and females
 Equal sampling from a range of ages
 Equal sampling from a range of BMI, weight, etc.
Random Samples
 Randomly assigning subjects to treatment levels to
ensure that the levels differ only by the treatment
administered.
 weights
 ages
 risk factors
Nuisance Variation
 Nuisance variation is any undesired sources of
variation that affect the outcome.
 Can systematically distort results in a particular
direction—referred to as bias.
 Can increase the variability of the outcome being
measured—results in a less powerful test because of too
much ‘noise’ in the data.
Example: Albino Rats
 It is hypothesized that exposing albino rats to
microwave radiation will decrease their food
consumption.
 Intervention: exposure to radiation
 Levels exposure or non-exposure
 Levels 0, 20000, 40000, 60000 uW
 Measurable outcome: amount of food consumed
 Possible nuisance variables: sex, weight,
temperature, previous feeding experiences
Experimental Design
 Types of data collected in a clinical trial:
 Treatment – the patient’s assigned treatment and actual
treatment received
 Response – measures of the patient’s response to
treatment including side-effects
 Prognostic factors (covariates) – details of the patient’s
initial condition and previous history upon entry into the
trial
Experimental Design
 Three basic types of outcome data:
 Qualitative – nominal or ordinal, success/failure, CR, PR,
Stable disease, Progression of disease
 Quantitative – interval or ratio, raw score, difference,
ratio, %
 Time to event – survival or disease-free time, etc.
Experimental Design
 Formulate statistical hypotheses that are germane
to the scientific hypothesis.
 Determine:
 experimental conditions to be used (independent
variable(s))
 measurements to be recorded
 extraneous conditions to be controlled (nuisance
variables)
Experimental Design
 Specify the number of subjects required and the
population from which they will be sampled.
 Power, Type I & II errors
 Specify the procedure for assigning subjects to the
experimental conditions.
 Determine the statistical analysis that will be
performed.
Experimental Design
 Considerations:
 Does the design permit the calculation of a valid estimate
of treatment effect?
 Does the data-collection procedure produce reliable
results?
 Does the design possess sufficient power to permit and
adequate test of the hypotheses?
Experimental Design
 Considerations:
 Does the design provide maximum efficiency within the
constraints imposed by the experimental situation?
 Does the experimental procedure conform to accepted
practices and procedures used in the research area?
• Facilitates comparison of findings with the results of other
investigations
Types of Studies
 Purpose of research
1) To explore
2) To describe or classify
3) To establish relationships
4) To establish causality
Ambiguity
Control
 Strategies for accomplishing these purposes:
1) Naturalistic observation
2) Case study
3) Survey
4) Quasi-experiment
5) Experiment
Generating Evidence
Studies
Descriptive
Studies
Populations
Case
Reports
Analytic
Studies
Individuals
Case
Series
Cross
Sectional
Observational
Case
Control
Complexity and Confidence
Cohort
Experimental
RCT
Observation versus Experiment
 A designed experiment involves the investigator
assigning (preferably randomly) some or all
conditions to subjects.
 An observational study includes conditions that
are observed, not assigned.
Example: Heart Study
 Question: How does serum total cholesterol vary





by age, gender, education, and use of blood
pressure medication? Does smoking affect any of
the associations?
Recruit n = 3000 subjects over two years
Take blood samples and have subjects answer a
CVD risk factor survey
Outcome: Serum total cholesterol
Factors: BP meds (observed, not assigned)
Confounders?
Example: Diabetes
 Question: Will a new treatment help overweight




people with diabetes lose weight?
N = 40 obese adults with Type II (non-insulin
dependent) diabetes (20 female/20 male)
Randomized, double-blind, placebo-controlled
study of treatment versus placebo
Outcome: Weight loss
Factor: Treatment versus placebo
Cross-Sectional Studies
 Designed to assess the association between an
independent variable (exposure?) and a
dependent variable (disease?)
 Selection of study subjects is based on both their
exposure and outcome status, thus there is no
direction of inquiry
Defined Population
Cross-Sectional Studies
Exposed
Diseased
Gather data on
Exposure & Disease
Exposed
No Disease
Not Exposed
Diseased
Not Exposed
No Disease
Cross-Sectional Studies
 Cannot determine causal relationships between
exposure and outcome
 Cannot determine temporal relationship between
exposure and outcome
“Exposure is associated
with Disease”
“Exposure causes Disease”
“Disease follows Exposure”
Analysis of Cross-Sectional Data
Exposure
Disease
+
-
+
a
b
-
c
d
Prevalence of disease compared in
exposed versus non-exposed groups:
Prevalence of exposure compared in
diseased versus non-diseased
groups:
p (D+ |E + ) =
a
a+ b
p (E+ |D+ |) =
a
a+ c
p (D+ |E - ) =
c
c+ d
p (E+ |D - ) =
b
b+ d
Case-Control Studies
 Designed to assess the association between
disease and past exposures
 Selection of study subjects is based on their
disease status
 Direction of inquiry is backward
Case-Control Studies
Exposed
Direction of Inquiry
Unexposed
Defined
Population
Diseased
Gather data on
Disease
No Disease
Exposed
Unexposed
Time
Analysis of Case-Control Data
Exposure
Disease
+
-
Total
+
a
b
a+b
-
c
d
c+d
a+c
b+d
Total
Odds ratio: odds of case exposure .
odds of control exposure
OR =
a
c
b
d
=
ad
bc
Cohort Studies
 Designed to assess the association between
exposures and disease occurrence
 Selection of study subjects is based on their
exposure status
 Direction of inquiry is forward
Cohort Studies
Defined Population
Direction of Inquiry
Disease
Exposed
No Disease
Gather data on
Exposure
Disease
Not Exposed
No Disease
Time
Cohort Studies
 Attrition or loss to follow-up
 Time and money!
 Inefficient for very rare outcomes
 Bias
 Outcome ascertainment
 Information bias
 Non-response bias
Analysis of Cohort Data
Exposure
Disease
+
-
Total
+
a
b
a+b
-
c
d
c+d
a+c
b+d
Total
Relative Risk:
risk of disease in exposed .
risk of disease in unexposed
a
RR = a + b
c
c+ d
Randomized Controlled Trials
 Designed to test the association between
exposures and disease
 Selection of study subjects is based on their
assigned exposure status
 Direction of inquiry is forward
Randomized Controlled Trials
Defined Population
Direction of Inquiry
Exposed
(Treated)
Randomize to
Exposure
Not Exposed
(Control)
Disease
No Disease
Disease
No Disease
Time
Why do we randomize?
 Suppose we wish to compare surgery for CAD to
a drug used to treat CAD. We know that such
major heart surgery is invasive and complex—
some people die during surgery. We may assign
the patients with less severe CAD (on purpose or
not) to the surgery group.
 If we see a difference in patient survival, is it due to
surgery versus drugs or to less severe disease versus
more severe disease?
 Such a study would be inconclusive and a waste of time,
money and patients.
How could we fix it?
 Randomize!
 Randomization is critical because there is no way for a
researcher to be aware of all possible confounders.
 Observational studies have little to no formal control for
any confounders—thus we cannot conclude cause and
effect based on their results.
 Randomization forms the basis of inference.
Other Protections Against Bias
 Blinding
 Single (patient only), double (patient and evaluator), and
triple (patient, evaluator, statistician) blinding is possible
 Eliminates biases that can arise from knowledge of
treatment
 Control
 Null (no treatment), placebo (no active treatment), active
(current standard of care) controls are used
 Eliminates biases that can arise from the natural
progression of disease (null control) or simply from the
act of being treated (placebo)
Analysis of RCT Data
 What kind of outcome do you have?
 Continuous? Categorical?
 How many samples (groups) do you have?
 Are they related or independent?
Types of Tests
 Parametric methods: make assumptions about
the distribution of the data (e.g., normally
distributed) and are suited for sample sizes large
enough to assess whether the distributional
assumption is met
 Nonparametric methods: make no assumptions
about the distribution of the data and are suitable
for small sample sizes or large samples where
parametric assumptions are violated
 Use ranks of the data values rather than actual data
values themselves
 Loss of power when parametric test is appropriate
Analysis of RCT Data
 Two independent percentages? Fisher’s Exact
test, chi-square test, logistic regression
 Two independent means? Mann-Whitney, Twosample t-test, analysis of variance, linear
regression
 Two independent time-to-event outcomes? Logrank test, Wilcoxon test, Cox regression
 Any adjustments for other prognostic factors can
be accomplished with the appropriate regression
models (e.g., logistic for yes/no outcomes, linear
for continuous, Cox for time-to)
Threats to Valid Inference
 Statistical Conclusion Validity
• Low statistical power - failing to reject a false hypothesis because
of inadequate sample size, irrelevant sources of variation that are
not controlled, or the use of inefficient test statistics.
• Violated assumptions - test statistics have been derived
conditioned on the truth of certain assumptions. If their tenability is
questionable, incorrect inferences may result.
 Many methods are based on approximations to a
normal distribution or another probability
distribution that becomes more accurate as sample
size increases—using these methods for small
sample sizes may produce unreliable results.
Threats to Valid Inference
 Statistical Conclusion Validity
 Reliability of measures and treatment implementation.
 Random variation in the experimental setting and/or
subjects.
• Inflation of variability may result in not rejecting a false hypothesis
(loss of power).
Threats to Valid Inference
 Internal Validity
 Uncontrolled events - events other than the
administration of treatment that occur between the time
the treatment is assigned and the time the outcome is
measured.
 The passing of time - processes not related to treatment
that occur simply as a function of the passage of time that
may affect the outcome.
Threats to Valid Inference
 Internal Validity
 Instrumentation - changes in the calibration of a
measuring instrument, the use of more than one
instrument, shifts in subjective criteria used by observers,
etc.
• The “John Henry” effect - compensatory rivalry by subjects
receiving less desirable treatments.
• The “placebo” effect - a subject behaves in a manner consistent
with his or her expectations.
Threats to Valid Inference
 External Validity—Generalizability
 Reactive arrangements - subjects who are aware that
they are being observed may behave differently that
subjects who are not aware.
 Interaction of testing and treatment - pretests may
sensitize subjects to a topic and enhance the
effectiveness of a treatment.
Threats to Valid Inference
 External Validity—Generalizability
 Self-selection - the results may only generalize to
volunteer populations.
 Interaction of setting and treatment - results obtained in a
clinical setting may not generalize to the outside world.
Clinical Trials—Purpose
 Prevention trials look for more effective/safer
ways to prevent a disease in individuals who have
never had it, or to prevent a disease from recurring
in individuals who have.
 Screening trials attempt to identify the best
methods for detecting diseases or health
conditions.
 Diagnostic trials are conducted to distinguish
better tests or procedures for diagnosing a
particular disease or condition.
Clinical Trials—Purpose
 Treatment trials assess experimental treatments,
new combinations of drugs, or new approaches to
surgery or radiation therapy for efficacy and safety.
 Quality of life (supportive care) trials explore
means to improve comfort and quality of life for
individuals with chronic illness.
Classification according to the U.S. National Institutes of Health
Clinical Trials—Phases
 Pre-clinical studies involve in vivo and in vitro
testing of promising compounds to obtain
preliminary efficacy, toxicity, and pharmacokinetic
information to assist in making decisions about
future studies in humans.
Clinical Trials—Phases
 Phase 0 studies are exploratory, first-in-human
trials, that are designed to establish very early on
whether the drug behaves in human subjects as
was anticipated from preclinical studies.
 Typically utilizes N = 10 to 15 subjects to assess
pharmacokinetics and pharmacodynamics.
 Allows the go/no-go decision usually made from animal
studies to be based on preliminary human data.
Clinical Trials—Phases
 Phase I studies assess the safety, tolerability,
pharmacokinetics, and pharmacodynamics of a
drug in healthy volunteers (industry standard) or
patients (academic/research standard).
 Involves dose-escalation studies which attempt to identify
an appropriate therapeutic dose.
 Utilizes small samples, typically N = 20 to 80 subjects.
Clinical Trials—Phases
 Phase II studies assess the efficacy of the drug
and continue the safety assessments from phase I.
 Larger groups are usually used, N = 20 to 300.
 Their purpose is to confirm efficacy (i.e., estimation of
effect), not necessarily to compare experimental drug to
placebo or active comparator.
Clinical Trials—Phases
 Phase III studies are the definitive assessment of
a drug’s effectiveness and safety in comparison
with the current gold standard treatment.
 Much larger sample sizes are utilized, N = 300 to 3,000,
and multiple sites can be used to recruit patients.
 Because they are quite an investment, they are usually
randomized, controlled studies.
Clinical Trials—Phases
 Phase IV studies are also known as post-
marketing surveillance trials and involve the
ongoing or long-term assessment of safety in
drugs that have been approved for human use.
 Detect any rare or long-term adverse effects in a much
broader patient population
The Size of a Clinical Trial
 Lasagna’s Law
 Once a clinical trial has started, the number of suitable
patients dwindles to a tenth of what was calculated before
the trial began.
The Size of a Clinical Trial
 “How many patients do we need?”
 Statistical methods can be used to determine the
required number of patients to meet the trial’s
principal scientific objectives.
 Other considerations that must be accounted for
include availability of patients and resources and
the ethical need to prevent any patient from
receiving inferior treatment.
 We want the minimum number of patients required to
achieve our principal scientific objective.
The Size of a Clinical Trial
 Estimation trials involve the use of point and
interval estimates to describe an outcome of
interest.
 Hypothesis testing is typically used to detect a
difference between competing treatments.
The Size of a Clinical Trial
 Type I error rate (α): the risk of concluding a
significant difference exists between treatments
when the treatments are actually equally effective.
 Type II error rate (β): the risk of concluding no
significant difference exists between treatments
when the treatments are actually different.
The Size of a Clinical Trial
 Power (1 – β): the probability of correctly detecting
a difference between treatments—more commonly
referred to as the power of the test.
Truth
Conclusion
H1
H0
H1
H0
1–β
β
α
1–α
The Size of a Clinical Trial
 Setting three determines the fourth:
 For the chosen level of significance (α), a clinically
meaningful difference (δ) can be detected with a
minimally acceptable power (1 – β) with n subjects.
 Depending on the nature of the outcome, the same
applies: For the chosen level of significance (α), an
outcome can be estimated within a specified margin of
error (ME) with n subjects.
Example: Detecting a Difference
 The Anturane Reinfarction Trial Research Group
(1978) describe the design of a randomized
double-blind trial comparing anturan and placebo
in patients after a myocardial infarction.
 What is the main purpose of the trial?
 What is the principal measure of patient outcome?
 How will the data be analyzed to detect a treatment
difference?
 What type of results does one anticipate with standard
treatment?
 How small a treatment difference is it important to detect
and with what degree of uncertainty?
Example: Detecting a Difference
 Primary objective: To see if anturan is of value in
preventing mortality after a myocardial infarction.
 Primary outcome: Treatment failure is indicated by
death within one year of first treatment (0/1).
 Data analysis: Comparison of percentages of
patients dying within first year on anturan (π1)
versus placebo (π2) using a χ2 test at the α = 0.05
level of significance.
Example: Detecting a Difference
 Expected results under placebo: One would
expect about 10% of patients to die within a year
(i.e., π2 = .1).
 Difference to detect (δ): It is clinically interesting
to be able to determine if anturan can halve the
mortality—i.e., 5% of patients die within a year—
and we would like to be 90% sure that we detect
this difference as statistically significant.
Example: Detecting a Difference
 We have:
 H0: π1 = π2 versus H1: π1  π2 (two-sided test)
 α = 0.05
n = 583 patients per group is required
 1 – β = 0.90
 δ = π2 – π1 = 0.05
 The estimate of power for this test is a function of
sample size:

1-   P z 

- z 2 SE    - 


 + P z 
p1q1 n1 + p2 q2 n2 

z 2 SE    - 


p1q1 n1 + p2 q2 n2 
Example: Detecting a Difference
1-β
β
1-α
α/2
-zα/2
Reject H0
Conclude difference
α/2
zα/2
Fail to reject H0
Conclude no
difference
Reject H0
Conclude difference
Power and Sample Size
 n is roughly inversely proportional to δ2; for fixed α and
β, halving the difference in rates requiring detection
results in a fourfold increase in sample size.
 n depends on the choice of β such that an increase in
power from 0.5 to 0.95 requires around 3 times the
number of patients.
 Reducing α from 0.05 to 0.01 results in an increase in
sample size of around 40% when β is around 10%.
 Using a one-sided test reduces the required sample
size.
Example: Detecting a Difference
 Primary objective: To see if treatment A increases
outcome W.
 Primary outcome: The primary outcome, W, is
continuous.
 Data analysis: Comparison of mean response of
patients on treatment A (μ1) versus placebo (μ2)
using a two-sided t-test at the α = 0.05 level of
significance.
Example: Detecting a Difference
 Expected results under placebo: One would
expect a mean response of 10 (i.e., μ2 = 10).
 Difference to detect (δ): It is clinically interesting to
be able to determine if treatment A can increase
response by 10%—i.e., we would like to see a
mean response of 11 (10 + 1) in patients getting
treatment A and we would like to be 80% sure that
we detect this difference as statistically significant.
Example: Detecting a Difference
 We have:
 H0: μ1 = μ2 versus H1: μ1  μ2 (two-sided test)
 α = 0.05
 1 – β = 0.80
δ=1
 For continuous outcomes we need to determine
what difference would be clinically meaningful, but
specified in the form of an effect size which takes
into account the variability of the data.
Example: Detecting a Difference
 Effect size is the difference in the means divided
by the standard deviation, usually of the control or
comparison group, or the pooled standard
deviation of the two groups
d
where
1 -  2

12 22

+
n1 n2
Example: Detecting a Difference
1-β
β
1-α
α/2
-zα/2
Reject H0
Conclude difference
α/2
zα/2
Fail to reject H0
Conclude no
difference
Reject H0
Conclude difference
Example: Detecting a Difference
 Power Calculations  an interesting interactive
web-based tool to show the relationship between
power and the sample size, variability, and
difference to detect.
 A decrease in the variability of the data results in
an increase in power for a given sample size.
 An increase in the effect size results in a decrease
in the required sample size to achieve a given
power.
 Increasing α results in an increase in the required
sample size to achieve a given power.
Prognostic Factors
 It is reasonable and sometimes essential to collect
information of personal characteristics and past
history at baseline when enrolling patient’s onto a
clinical trial.
 These variables allow us to determine how
generalizable the results are.
Prognostic Factors
 Prognostic factors known to be related to the
desired outcome of the clinical trial must be
collected and in some cases randomization should
be stratified upon these variables.
 Many baseline characteristics may not be known
to be related to outcome, but may be associated
with outcome for a given trial.
Comparable Treatment Groups
 All baseline prognostic and descriptive factors of
interest should be summarized between the
treatment groups to insure that they are
comparable between treatments. It is generally
recommended that these be descriptive
comparisons only, not inferential
 Note: Just because a factor is balanced does not
mean it will not affect outcome and vice versa.
Subgroup Analysis
 Does response differ for differing types of patients?
This is a natural question to ask.
 To answer this question one should test to see if
the factor that determines type of patient interacts
with treatment.
 Separate significance tests for different subgroups
do not provide direct evidence of whether a
prognostic factor affects the treatment difference:
a test for interaction is much more valid.
 Tests for interactions may also be designed a
priori.
Multiplicity of Data
 Multiple Treatments – the number of possible
treatment comparisons increases rapidly with the
number of treatments. (Newman-Keuls, Tukey’s
HSD or other adjustment should be designed)
 Multiple end-points – there may be multiple ways
to evaluate how a patient responds. (Bonferroni
adjustment, Multivariate test, combined score, or
reduce number of primary end-points)
Multiplicity of Data
 Repeated Measurements – patient’s progress
may be recorded at several fixed time points after
the start of treatment. One should aim for a single
summary measure for each patient outcome so
that only one significance test is necessary.
 Subgroup Analyses – patients may be grouped
into subgroups and each subgroup may be
analyzed separately.
 Interim Analyses – repeated interim analyses
may be performed after accumulating data while
the trial is in progress.
Summary
 Statistics plays a key role in pre-clinical and clinical
research
 Statistics helps us determine how ‘confident’ we
should be in the results of a study
 Confidence in a study is based on (1) the size of
the study, (2) its safeguards against biases
(complexity), (3) how it was actually undertaken
 Statistical support is available and should be
sought out as early as possible in the process of
designing a study
Next Time . . .
 Basic Descriptive and Inferential Methods
 Hypothesis Testing
 P-values
 Confidence Intervals
 Interpretation
 Examples
Download