Statistical Methods in
Clinical Trials
Ziad Taib
Biostatistics
AstraZeneca
February 22, 2012
Types of Data
Continuous
Blood pressure
Time to event
Categorical
sex
quantitative
qualitative
Discrete
No of relapses
Ordered
Categorical
Pain level
Types of data analysis (Inference)
Parametric
Vs
Non parametric
Model based
Vs
Data driven
Frequentist
Vs
Bayesian
Steps in data analysis and
presentation of results
•
•
•
•
Data cleaning.
Descriptive analyses.
Statistical Inference.
Reporting and presenting the results.
Continuous Data
Part I Classical Inference
Part II Bayesian Inference
Part I
Classical Inference
1.
2.
3.
4.
One sample
Paired data
Two samples
Many samples
» One way anova
» Two way anova
» Ancova
One sample
From a population of patients we draw a
sample of size n of patients and measure
some response on each of these. We want
to test whether the average response in
this population is equal to (or is larger or
smaller than) some predefined value μ0
(e.g. the corresponding value in the
healthy population).
Estimation
• Assume
is a sample (i.i.d) of
size n from some distribution F with mean
μ and standard deviation σ. Then
and
• Are (i) unbiased and (ii) consistent
estimators of μ and σ² i.e.
(i)
(ii)
(LLN)
Estimation in general
• The maximum likelihood method
• The moment method
• Uniformly Minimum Variance Unbiased
estimators
• Least squares
• etc
Hypothesis testing
We can e.g. test the null hypothesis
against
using the test statistic
when n is large, T follows, under the null
hypothesis, the standard normal
distribution (CLT).
Hypothesis testing
For moderate values of n assuming
follow a normal distribution, then
follows, under the null hypothesis, the t
(student) distribution with (n-1) degrees of
freedom.
The distribution of the average when
n is large
The true distribution among the patients
(exponential with mean 0.5).
Confidence intervals
• A (1-α)100% confidence interval for μ has
the form
(i)
when n is large
(ii)
when n moderate but sample normally
distributed.
A hypothesis (or a confidence interval) can be
one sided or two sided
95% confidence interval
The normal
distribution
1-α=0.95
The t
distribution
1-α=0.95
Correspondence Theorem
Hypothesis testing and confidence intervals
are two sides of the same coin!
Reject null hypothesis (α)
Parameter value according to null
hypothesis not included in
confidence interval (α)
Paired data
• Sometimes we measure a response at
baseline and at the end of the treatment. A
suitable analysis can be to consider the
endpoint changes from baseline.
• and to use the (one sample) test statistic
Two independent samples
From two populations/distributions we draw two
samples of sizes n1 and n2 of patients and
measure some response on each of these. We
want to test whether the average responses, μ1
and μ2, in these two populations are equal. The
population variances are denoted by σ1² and σ2²
respectively.
H 0 (null)
1   2
H A (alternative )
1   2
1   2
1   2
1   2
1   2
Two samples (continued)
• When both sample sizes are large we can use
the central limit theorem according to which the
test statistic below follows the standard normal
distribution
• When the sample sizes are moderate but normal
with equal variances we can use the t-statistic
(having n1+n2-2 d.f.) and the pooled variance.
• The remaining case can be handled using a
method known as Satterwaite’s confidence
intervals.
Many samples
One way analysis of variance
• Here we want to compare several (k) groups
(treatments) with respect to the average
response in each group. The following model is
assumed
Anova1
Anova1
• Main point: individuals only differ with
respect to one criterion: group belonging.
• Intuitively: Looking at the pictures we see
that the variation says something about
the difference between the cases where
the groups are different and when they are
similar.
Anova1 continued
• The null hypothesis and the alternative
hypothesis can be formulated as
• The deviation of an individual observation
from the overall mean can be described by
SST/SST
SSA/SSB
SSE/SSW
Anova 1 continued
• Manipulating these gives:
MSE
MSA
MSA
• When the null hypothesis is false we expect
the ratio MSA/MSE to be large. To calculate
exact significance levels we use the fact that
it F-distributed with (k-1,N-k) d.f.
Anova table
Source of
variation
Sum of d.f.
squares
Mean
F/p-value
squares
Treatment
SSA
K-1
MSA
Error
SSE
N-k
MSE
Total
SST
N-1
F0=MSA/MSE
/p
• Example: 3 groups with the same mean 20 and the same
standard deviation 2. We reject for large values of F0.
F is F-distributed and F0 the value of MSA/MSE
We cannot reject
Source of
variation
Sum of
squares
d.f.
Mean
squares
F0/p-value
Treatment
9.56
2
4.78
1.22/ 0.29
Error
4671.1
1197
3.9
Total
4680.65
1199
• Example: 3 groups with means 20, 20 and 20.5
and the same standard deviation 3.
We can reject
Source
of
variation
Sum of
squares
d.f.
Mean
squares
F/pvalue
Treatmen
t
59.4
2
29.7
3.25/
0.039
Error
10942.8
1197
9.14
Total
11002.3
1199
We can reject
Multiple comparisons
• When we reject the null hypothesis we only know that
the groups are not equal but some of them might still be.
To find out more we have to consider all the pairwise
comparisons between the groups. With k groups this
gives m=k(k-1)/2 such comparisons. How do we do this
and still have a reasonable overall significance level?
The simplest way to deal with this is using Bonferroni’s
inequality. This implies that when performing m tests if
each test is at the 1-α/m level then the tests taken
simultaneously will be on the 1- α level. We will deal with
problem in detail later on.
Two way anova
• Here we assume that individuals can differ
with respect to two factors (two drugs,
treatment and centre). The following model is
usually suitable for this situation
• As before the deviation of an individual
observation from the overall average can be
decomposed into terms related to the effects of
treatment, center, treatment by center as well as to
a pure random component.
Anova2
• We use a similar notation as before
• Tests of treatment effect, center effect or
treatment by center interaction can be
performed by using the appropriate ratio
between mean square values.
Anova2
• As an example assume we want to test if
there is a treatment effect:
• This can be tested using the test statistic
• Which has under the null hypothesis an F
distribution with (a-1) and ab(n-1) degrees
of freedom.
• The various tests are summarized in the
following table
Two-way anova
Analysis of covariance
• Here we assume that individuals can differ
with respect baseline values. It is
sometimes desirable to adjust the model
for the endpoint measures Y so baseline
values X are taken into account. The
following model can then be useful
The analysis uses an F distribution based on
Mean sums of squares (cf. p.319)
Various forms of models and relation between them
Classical statistics (Observations are random, parameters are unknown constants)
Mixed models, both random and constant parameters
LM: Assumptions:
1.
independence,
2.
normality,
3.
constant parameters
LMM:
Assumptions 1)
and 3) are modified
GLM: assumption 2)
Exponential family
Maximum likelihood
LM - Linear model
Repeated measures:
Assumptions 1) and 3)
are modified
GLMM: Assumption 2) Exponential
family and assumptions 1) and 3) are
modified
Longitudinal data
Non-linear models
GLM - Generalised linear model
LMM - Linear mixed model
GLMM - Generalised linear mixed model
Bayesian statistics, parameters are random
39
The Linear Mixed-effects Model
• The linear mixed effects model is quite flexible and does not need
balance, independence etc. Usually some version of maximum l
likelihood is used for the inference
Average evolution
Subject specific
40
Part II
Bayesian Inference
What is a probability?
1. the limit of a relative frequency
Problem: Not all events all repeatable!
2. the degree of plausibility Or of belief
Problem: subjective?
P=?
P=1/6
Thomas Bayes
(1702 to 1761)
Bayes in brief
• Bayes’ conclusions were accepted by Laplace
in 1781
• Rediscovered by Condorcet, and remained
unchallenged until
• Boole questioned them.
• Since then Bayes' techniques have been
subject to controversy.
Is Classical Inference
Really Classical?
• Only in the sense that it started
a realtively long time ago with
Fisher around 1920.
• Bayesian Statistics is even
more classical since it was
dominating until then.
• Classical inference was
introduced as a way to
introduce objectivity in the
scientific process.
Simplified scientific Process
1. A theory or a hypothesis needs to be
tested
2. We perform an experience and obtain
some data
3. Are our data in agreement with our
theory?
4. If the answer to the above question is no,
we reject the theory. Otherwise we
cannot reject it!
Classical Paradigm
vs
The Bayesian paradigm
• The classical paradigm is based on the
consideration of
P[ Data | Theory ] (1)
• How likely is the data if the theory was
to be true?
Classical vs Bayesian (cont’d)
• The Bayesian paradigm is based on the
consideration of
P[ Theory | Data] (2)
• How much support or belief is there in the
theory given the data?
Bayes’ Formula
States that
P [ T | D] 
P [ D | T ]P [ T ]
P [ D]
(3)
Where
T = Theory
and
D = Data
Simple formula with many interesting implications.
Implications of Bayes’ Formula
1. (1) and (2) are not equivalent.
2. To work out (2) we have to estimate P[ T ]
i.e. we need to put a probability on our
belief in the theory we are testing.
3. We cannot make P[ T ] disappear. (Similar
to the uncertainty principle in quantum
mechanics?)
Is subjective always bad?
In our everyday life we are far from being
always objective. Not even when it comes
to scientific issues. Nevertheless we are
afraid of incorporating subjectivity into the
scientific process.
This is the reason why Bayesian inference is
perceived as being controversial.
How to deal with subjectivity?
1. Sometimes we do have some apriori
information (earlier studies, similar drugs,
expert opinions etc).
2. But even if we do not have any prior
knowledge, we can study the effect of having
various degrees of belief . (D. Spiegelhalter)
3. We can also invert the problem and ask what
prior belief in the theory is required to make
our data coherent with that theory?
How to deal …
4. The role of the apriori information becomes
negligible as data accumulates. It can be
shown mathematically that ultimately all
investigators will reach the same conclusion
regardless of what prior information they had
from the beginning. (for a mathematical
argument cf. O’hagan, 1994). In the
meanwhile the sceptic will just need more
time/convincing/data!
A textbook example
• Let X be a sample from N(,s) of size n.
• Let the apriori distribution p() be N(,) which is
the so called conjugate distribution.
• Then the aposterior distribution is of the
following form
p (  | x) 
P ( x | ) p (  )
p
(t
)P
(
x
|
t
)dt

n
 N (x  (1   ) , 2
)
2
s  n
2
n
 2
2
s  n
2
Comments
• There are problems in choosing the apriori
distribution.
• The importance of the apriori distribution
decreases as the sample size increases.
Example: p-values
• Some toxin is associated with certain
symptoms.  denotes the toxin level in
patients having the symptoms and 0
the toxin level in healthy individuals. We
want to test if there is a difference. A
test can be based on a sample of size n
through
Under the null hypothesis
H 0:  =  0,
z is an observation from a t-distribution
with n-1 degrees of freedom.
If, moreover, the p-value is less than
0.05 it is then customary to consider the
result as significant, i.e. we reject the
null hypothesis.
We return to this in moment!
Preliminaries
• Consider two Hypothesis H0 and H1
P [ D | H0 ] P [ H0 ]
P [ H 0 | D] 
P [ D]
P [ D | H1 ] P [ H1 ]
P [ H1 | D] 
P [ D]
• We consider the odds ratio between the
two
P[ H 0 | D] P[ H 0 ] P[ D | H 0 ]


P[ H1 | D]
P[ H1 ] P[ D | H1 ]
Odds ratio = prior odds x likelihood ratio
New
old
Bayes factor
Result

1- P [ H0 ] 

P [ H 0 | D]  1 
 P [ H 0 ]  BF 
where
P[D| H 0 ]
BF 
P[D| H1 ]
1
Back to the t-test
P. M. Lee Bayesian Statistics: An Introduction
2nd Ed. London: Arnold, p. 131 (1997)
shows that in the case of the t-est and under quite
general conditions:
BF  e
 z2
2
Consequences
• Assume that there is no agreement on the
effect of the toxin so we take P(H0)=0.5.
Then:

1

P[H0 | D]  1  -z 2

2
 e
1
1
2

-z 


1  e 2 








• Assume further that the value of z in the
experiment turns out to be 2. Since this leads
to a p-value of 0.044 we conclude that the
result is significant at the 0.05 level. Setting
z=2 in the formula above leads to:
-2

P[H0 | D]  1  e 2


2
1

  0.12


P[ H0| D]
Z=2
P[ H0]
P[ H0]
P[ H0| D]
Conclusions
• By introducing the element of degree of
belief about a theory, we arrive at
conclusions that do not agree with those
obtained using the frequentist approach,
i.e. prior knowledge matters
• Prior knowledge is part of the Bayesian
approach.
Backup slides
Bayesian Analysis in Clinical Trials
Number of Bayes Articles in Medical Journals
In short, Bayesian inference provides a coherent,
comprehensive and strikingly intuitive alternative to the
flawed frequentist methods of statistical inference. It
leads to results that are more easily interpreted , more
useful, and which more accurately reflect the way
science actually proceeds. It is, moreover, unique in its
ability to deal explicitly and reliably with the provably
ineluctable presence of subjectivity in science.
These features alone should motivate many working
scientists to find out more about Bayesian inference in
their own research.
Robert A.J. Mathhews (1998) The use and
abuse of subjectivity in scientific research
– Part 2. Working paper for the European
Science and Environment Forum
What for?
• The greatest virtue of frequentist clinical
trials is extreme rigor and focus on the
experiment at hand.
• But they tend to become large and expensive
and some patients receive inferior
experimental therapies.
• The Bayesian approach is gaining terrain and
has more and more advocates hoping to
address these and other challenges
Bayesian Framework
• The Advantages:
–
–
–
–
Formal system for incorporating existing information
Natural approach to inference
Generally more efficient
Well suited for decision making
• The Challenges:
–
–
–
–
Determining appropriate prior probabilities
Computational complexity
Lack of familiarity
Lack of software tools
What is the difference?
Approach
Traditional
Bayesian
Question
How likely are the trial
results given there
really is no difference
among treatments?
How likely is it that
there is a true
difference among
treatments given the
trial data?
Approval based on
Pivotal trial
(Hypothesis testing)
Weight of evidence
(revising beliefs in
light of new evidence)
Design
Single stage
Adaptive
In what settings have Bayesian
desings been used?
•
•
•
•
•
•
Pharmacokinetics
Phase I (CRM)
Phase II (Thall/Simon)
Phase III (Spiegelhalter, Parmar, others)
Meta-analysis
Medical device clinical trials (guidelines
from 2006)
Summary of applications
• Methods for incomplete data
• Adaptive designs (including adaptive
sample size)
• Confirmatory trials
• Evaluating a modified version of an
approaved product
• Synthesising data in post market
surveillance
Recommended Reading
1.
2.
3.
4.
Berry DA (2006). Bayesian clinical trials. Nature Reviews:
Drug Discovery. 5: 27-36; 2006.
Goodman, S.N. Toward Evidence-Based Medical Statistics:
The P Value Fallacy. Annals of Internal Medicine. 130:9951021; 1999.
Spiegelhalter DJ, Keith R, Abrams KR, Myles JP. Bayesian
Approaches to Clinical Trials and Health-Care Evaluation.
John Wiley & Sons, Ltd. 2004.
Winkler, R.L. Why Bayesian Analysis Hasn’t Caught on in
Healthcare Decision Making. International Journal of
Technology Assessment in Health Care. 17:1, 56-66; 2001.
Questions or Comments?
Karl Popper (1902-1994)
A theory can be called
scientific if it can be
falsified!
We recognise this
pirnciple from the
foundation of
statistical hypothesis
testing .
Falsifiability (or refutability or testability) is the logical
possibility that an assertion can be shown false by an
observation or a physical experiment.
For example, "all men are mortal" is unfalsifiable, since
no amount of observation could ever demonstrate its
falsehood. "All men are immortal," by contrast, is
falsifiable, by the presentation of just one dead man
A white mute swan, common to Eurasia and North America.
Two black swans, native to Australia
Proof
P [ H 0 | D] P [ H 0 ] P [D | H 0 ]


P [ H1 | D] P [ H1 ] P [D | H1 ]
P [ H 0 ] P [D | H 0 ]
 P [ H 0 | D] 

 P [ H1 | D]
P [ H1 ] P [D | H1 ]
 C  1 - P [ H 0 | D])
 P [ H 0 | D]  C  P [ H 0 | D]  C
 C  1 C 
 P [ H 0 | D]  


1 C   C 




1- P [ H0 ]


 1

P [D | H 0 ] 
P [ H0 ]


P
[D
|
H
]
1 

1
1

1- P [ H0 ] 

 P [ H 0 | D]  1 
P [ H 0 ]  BF 

1
1

 1  
C

1