Lecture18 - Biostatistics

advertisement
How many patients
do I need for my study?
Realistic Sample Size
Estimates for Clinical Trials
Sample Size Estimation
1.
General considerations
2.
Continuous response variable
– Parallel group comparisons
•
•
Comparison of response after a specified period of follow-up
Comparison of changes from baseline
– Crossover study
3.
Success/failure response variable
– Impact of non-compliance, lag
– Realistic estimates of control event rate (Pc) and event rate
pattern
– Use of epidemiological data to obtain realistic estimates of
experimental group event rate (Pe)
4.
Time to event designs and variable follow-up
Useful References
• Lachin JM, Cont Clin Trials, 2:93-113, 1981 (a
general overview)
• Shih J, Cont Clin Trials, 16:395-407, 1995 (time to
event studies with dropouts, dropins, and lag
issues) – see size program on biostatistics
network
• Farrington CP and Manning G, Stat Med, 9:14471454, 1990 (sample size for equivalence trials)
• Whitehead J, Stat Med, 12:2257-2271, 1993
(sample size for ordinal outcomes)
• Donner A, Amer J Epid, 114:906-914, 1981
(sample size for cluster randomized trials)
Key Points
• Sample size should be specified in advance (often it is not)
• Sample size estimation requires collaboration and some
time to do it right (not solely a statistical exercise)
• Often sample size is based on uncertain assumptions
(estimates should consider a range of values for key
parameters and the impact on power for small deviations in
final assumptions should be considered)
• Parameters that do not involve the treatment difference
(e.g., SD) on which sample size was based should be
evaluated by protocol leaders (who are blinded to treatment
differences) during the trial
• It pays to be conservative; however, ultimate size and
duration of a study involves compromises, e.g., power,
costs, timeliness.
Some Evidence that Sample Size is Not
Considered Carefully: A Survey of 71
“Negative” Trials
(Freiman et al., NEJM, 1978)
•
•
•
•
Authors stated “no difference”
P-value > 0.10 (2-sided)
Success/failure endpoint
Expected number of events >5 in control and
experimental groups
• Using the stated Type I error and control group event
rate, power was determined corresponding to:
– 25% difference between groups
– 50% difference between groups
Frequency Distribution of Power Estimates for
71 “Negative” Trials
25% Reduction
25
20
15
10
5.63%
5
0
0-9
10-19
20-29
30-39
40-49
50-59
Power (1 - ß)
References: Frieman et al, NEJM 1978.
60-69
70-79
80-89
90-99
Frequency Distribution of Power Estimates for
71 “Negative” Trials
50% Reduction
29.58%
25
20
15
10
5
0
0-9
10-19
20-29
30-39
40-49
50-59
Power (1 - ß)
References: Frieman et al, NEJM 1978.
60-69
70-79
80-89
90-99
Implications of Review by Frieman et al.
• Many investigations do not estimate sample size in
advance
• Many studies should never have been initiated; some
were stopped too soon
• “Non-significant” difference does not mean there is
not an important difference
• Design estimates (in Methods) are important to
interpret study findings
• Confidence intervals should be used to summarize
treatment differences
Percent of Studies with at Least 80% Power
Studies with Power to Detect 25% and
50% Differences
6
50
l
45
l
l
40
35
30
25
l
l
6
20
15
10
6
6
6
5
0
1975
Moher et al,
JAMA , 272:122-124,1994
1980
1985
1990
25% Difference
50% Difference
These Results Emphasize the Importance of
Understanding that the Size of P-Value
Depends on:
• Magnitude of difference
(strength of association); and
• Sample size
“Absence of evidence is not evidence of absence”, Altman
and Bland, BMJ 1995; 311:485.
Steps in Planning a Study
1) Specify the precise research question
2) Define target population
3) Assess feasibility of studying question
(compute sample size)
4) Decide how to recruit study participants,
e.g., single center, multi-center, and
make sure you have back-up plans
Beginning: A Protocol Stating Null and Alternative
Hypotheses Along with Significance Level and
Power
Null hypothesis (HO)
Hypothesis of no difference or no association
Alternative hypothesis (HA)
Hypothesis that there is a specified difference (Δ)
No direction specified (2-tailed)
A direction specified (1-tailed)
Significance Level (): Type I Error
The probability of rejecting H0 given that H0 is true
Power = (1 - ): ( = Type II Error)
Power is the probability of rejecting H0 when the true difference is Δ
End: Test of Significance
According to Protocol
Statistically Significant?
Yes
No
Reject
HO
Do not reject
HO
Sampling variation
is an unlikely
explanation for the
discrepancy
Sampling variation
is a likely
explanation for the
discrepancy
Normal Distribution
If Z is large (lies in yellow area), we assume difference in means is unlikely to have
come from a distribution with mean zero.
Continuous Outcome Example
Observations:
Many people have stage 1 (mild) hypertension (SBP
140-159 or DBP 90-99 mmHg)
For most, treatment is life-long
Many drugs which lower BP produce undesirable
symptoms and metabolic effects (new drugs are
needed)
Research
Question:
Can new drug T adequately control BP for patients
with mild hypertension?
Objective:
To compare new drug T with diuretic treatment for
lowering diastolic blood pressure (DBP)
Parallel Group Design Comparing Average
Diastolic BP (DBP) After One Year
Hypothesis
HO: DBP after one year of treatment with
new drug T equals the DBP for patients
given a diuretic (control)
HA: DBP after one year is different for patients
given new drug T compared to diuretic
treatment (difference is 4 mmHg or more)
Drug T
DBP at year 1
Diuretic
DBP at year 1
Study Population:
Those with mild
hypertension
Parallel Group Design Comparing Average
Difference (Year 1 – Baseline) in DBP.
Hypothesis
HO: DBP change from baseline after one year of
treatment with new Drug T equals the DBP
change from baseline after one year for patients
given a diuretic (control)
HA: DBP change from baseline after one year of
treatment with new Drug T is different than the
DBP change from baseline after one year for
patients given a diuretic (control) treatment
(difference is 4 mmHg or more)
Drug T
Change in
DBP(Year 1 –
Baseline)
Diuretic
Change in DBP
(Year 1 –
Baseline)
Study Population:
Those with mild
hypertension
Why Δ= 4 mmHg? An important
difference on a population-wide basis
Observational studies (Lancet 2002;360:1903-13)
• 58 studies; 958,074 participants
• 5 mm Hg lower DBP among those 40-59 years
• 41% (30%) lower risk of death from stroke (CHD)
Clinical trials (Lancet 1990;335:827-38)
• 14 randomized trials; 36,908 participants
• 5-6 mmHg DBP difference (treatment vs. control)
• 28% reduction in fatal/non-fatal CVD
Considerations in Specifying
Treatments Effect (Delta)
• Smallest difference of clinical significance/interest
• Stage of research
• Realistic and plausible estimates based on:
– previous research
– expected non-compliance and switchover rates
– consideration of type of participants to be studied
• Resources (compromise)
Delta is a difference that is important NOT to miss if present.
Principal Determinants of Sample Size
• Size of difference considered important (Delta)
• Type I error () or significance level
• Type II error (), or power (1- )
• Variability of response/frequency of event
Constants
Sample Size for Two Groups: Equal
Allocation
General Formula
N Per
=
Group
2 x Variability x [Constant (,)]2
Delta2
Delta = Δ = clinically relevant and plausible treatment difference
Sample Size Formula Derivation:
One Sample Situation
Sample size has to satisfy :
Prob ( Z  Z )   if Ho is true
and
Prob ( Z  Z )  1   if HA is true
Ho :   o  o
HA :   o;   
Sample Size Derivation (cont.)
X 0
Z

N
Reject Ho if Z  1.96  Z 1   2
For   0.05 (2 - sided)


 X 0


Prob 
 Z 1  2   1- under HA



N




 X 
 

Prob 
 Z1  2 


N


N




N(0,1)
Z
Note :  Z 1     Z and solve for N
N
 2 (Z 1    Z 1   )2
2
Weighing the Errors

Type 2 error:
Sponsor’s concern

Type 1 error:
Regulator’s Concern
Typical Values for (Z1-/2 + Z1- )2 Which
Is Numerator of Sample Size
Type I Error ()
or
Significance Level (Z1-/2)
2-sided test
Power (1 - )
(Z1-)
(Z1-/2 + Z1-)2
0.05 (1.96)
0.80 (0.84)
0.90 (1.28)
0.95 (1.645)
7.84
10.50
13.00
0.01 (2.575)
0.80 (0.84)
0.90 (1.28)
0.95 (1.645)
11.67
14.86
17.81
Example
Hypertension Study
HO
HA
0
4

mmHg

HO : 1 = 2 ; 1 - 2 = 0
HA : 1 ≠ 2 ; 1 - 2 = 4 mmHg
Usually formulated in terms of change from baseline (e.g., Ho = D1 - D2 = 0)
Another Derivation
Z 
d  0
2

N
n1  n2  N
ProbZ  Z    under HO
Z 
d  
2

N
ProbZ  Z    1   under H A
  d  0    d
  Z 
2
2
 Z
N
N
2
2
2   2  Z  Z  
N 
2 2 Z  Z  
2
N
2
Solve for N using these
2 equations and by noting
that Δ = sum of 2 parts
from the previous figure .
Sources of Variability of BP Measurements
Ref: Rose GA. Standardization of Observers in Blood Pressure
Measurement. Lancet 1965;1:673-4.
Known factors
True variations in
arterial pressure
Recent physical activity
Emotional state
Position of subject and arm
Room temperature and season of year
Unknown factors
Variability of
blood pressure
readings
Inaccuracy of sphygmomanometer
Instrument
Cuff width and length
Measurement
errors
Chiefly affecting
the mean pressure
estimate
Observer
Mental concentration
Hearing acuity
Confusion of auditory and visual
Interpretation of sounds
Rates of inflation and deflation
Reading of moving column
Distorting the frequency
distribution curve (and
sometimes affecting the mean)
Terminal digit preference
Prejudice, e.g., excess of
readings at 120/80
Estimates of Variability for Diastolic Blood
Pressure Measurements (MRFIT)
Estimated Using Random-Zero (R-Z) Readings
Variance Component
Estimate
(mmHg)2
2
Between Subject  
s
58.4
Within Subjects 
2e 
36.3
Estimates of Variability for Diastolic Blood
Pressure Measurements
Estimated Using Random-Zero (R-Z) Readings
at Screen 2 and Screen 3 in MRFIT
(2 Readings at Each Visit)
Variance Component
Estimate
(mmHg)2
Between Subject 
s2
58.4
v2

26.1
Between Visits
Between Readings 
e2
10.2
Within subject
analyzed further
Consequences on Sample Size of Using
Multiple Readings for Defining Diastolic BP
=0.05, 1-=0.90
Inter-subject variability=58.4 (mmHg)2
No. of
No. of
visits readings/visit
1
1
N per Group
∆=8
∆=4
31
124
1
2
30
118
2
1
25
100
2
2
24
97
Between visit variability = 26.1 (mmHg)2
Within visit variability = 10.2 (mmHg)2
Parallel Group Design Comparing Average
DBP After One Year.
Hypothesis
HO: DBP after one year of treatment with
new Drug T equals the DBP for patients
given a diuretic (control)
HA: DBP after one year is different for patients
given new Drug T compared to diuretic
treatment (difference is 4 mmHg or more)
Drug T
DBP at year 1
Diuretic
DBP at year 1
Study Population:
Those with mild
hypertension
Parallel Group Studies
Comparing Average DBP After One Year
1 measure, 1 visit (=0.05, = .10)
2=58.4 + 26.1+10.2=94.7
=8 mmHg
=4 mmHg
H O : T = C
H O : T = C
H A :  T  C ;  C   T  4
H A :  T  C ; C  T  8


2 2  z   z1-  
12


n = nT = nC =
2

nT = nC =
2
2(94.7)10.5
 124.3  125
2
4


2 2  z   z1-  
12


n = nT = nC =
2
nT = nC =
2
2(94.7) 10.5
 31.07  32
82
Parallel Group Design Comparing Average
Difference (Year 1 – Baseline) in DBP.
Hypothesis
(2-Tailed)
HO: DBP change from baseline after one year of
treatment with new Drug T equals the DBP
change from baseline after one year for patients
given a diuretic (control)
HA: DBP change from baseline after one year of
treatment with new Drug T is different than the
DBP change from baseline after one year for
patients given a diuretic (control) treatment
(difference is 4 mmHg or more)
Drug T
Change in DBP
(Year 1 –
Baseline)
Diuretic
Change in DBP
(Year 1 –
Baseline)
Study Population:
Those with mild
hypertension
Sample Size for Two Groups: Equal
Allocation
General Formula
N Per
=
Group
2 x Variability x [Constant (,)]2
Delta2
Delta = Δ = clinically relevant and plausible treatment difference
Estimate of Variability for Change
Outcome
• Prior studies (For MRFIT, SD of DBP change after 12 months
= 9.0 mmHg [baseline is one visit, 2 readings; follow-up is one
visit, 2 readings]. For comparison, SD of 12 month DBP is 9.5
mmHg)
• Use correlation (ρ) of repeat readings for participants
to estimate e2. (For MRFIT, correlation of DBP at baseline
and 12 months is 0.55; note that SD (diff) can be written as
2σT2 (1-ρ) = 2σe2 = 2(81)(1-0.55) = 72.9 (SD of change ≈ 8.5
mmHg)
• Estimate of SD change using analysis of covariance
(regression of change on baseline) (For MRFIT, SD = 7.9
mmHg)
Let y B  baseline measuremen t
y F  follow - up measuremen t
 t   s2   e2
2
Var( y F - y B ) = Var ( y F )  Var ( y B )  2 cov( y F y B )
cov( y F y B )  p y F  y B  p t
2
if  y F   y B   t
and Var( y F - y B ) = 2 t  2  t  2 t (1   )
2
2
2
so,
2
2




2
2
2
2
s
e

Var( y F - y B )  2( s   e )(1  2
)  2( s   e ) 2
2
2 
s e
s e 
 2 e2
Crossover Group Design Comparing Average
Difference (Diuretic – Drug T) in DBP
Hypothesis
HO: Average of paired differences for the two
treatment sequences differences is zero.
HA: Average is 4 mmHg or more)
Drug T
Washout Period
Diuretic
Diuretic
Washout Period
Drug T
Study Population:
Those with mild
hypertension
Crossover Study Design
I
Period
1
2 Diff.
y1
y2
dl
II
y1
y2
dll
2

Var(dl) = 2 e
2

Var(d ) = 2 e
ll
∆ = TT - TC = E
– –
dl + dll = Dl + Dll
2
2
With parallel group comparison we had:
HO : T = C or HO : DT = DC where
DT and DC refer to the difference between
follow-up and baseline levels of outcome
With crossover we have:
D
+
D
l
ll = 0
HO =
2
or equivalently:
HO = TT – TC = 0
Variance for
Sample Size Formula:
 d I  d II  1
Var 
 var(d I )  var(d II )

 2  4
1
 [ 2 e2  2 e2 ]   e2
4
Substitution into
Sample Size Formula Gives:
nc = nI = nII
 e2  z1-  z1-  

2
2

2
n| = n|| =
number randomly allocated to each
sequence - I (AB) or II (BA).
This follows because the variance of the pooled
treatment difference across the 2 sequences is
¼ (22e + 22 e)
Crossover Sample Size Compared to
Parallel Design (no baseline)
 e2  z1-  z1-  

nc

n

2
2

2
2( s2 +  e2 ) z1-  z1-  
2


2
2
=
 e2
2( +  )
2
s
2
e
Crossover Sample Size Compared
to Parallel Design (no baseline)
 s2
= 2
 s +  e2
nc
 e2
 e2
1 


since 1   = 2
2
2
2
n 2( s +  e )
2
s +e
(1   )n
nc 
2
But the crossover design will require twice the number of
measurements. So, if ρ= 0 then number of measurements
are equal, but sample size for crossover is ½.
Consider an Experiment
with Diastolic BP Response
Type 1 error = 0.05 (2-sided) and Power = 0.95
 s2 = 58.4 (mmHg) 2
 e2 = 36.3 (mmHg) 2
nc
36.3

= 0.19
n 2(58.4  36.3)
5 times more patients needed for
parallel group design
 = 5, nc  19, n = 99
Examples

nc
n
36
0.62
0.19
1200
400
0.75
0.125
Overnight urine
excretion Na+
(meq/8 hours)
325
625
0.34
0.33
2 overnights
325
312
0.51
0.24
7 overnights
325
90
0.78
0.11
DBP (mmHg)
Cholesterol (mg/dl)

s2

e2
58
With parallel group comparisons with baseline we need
to consider
Var(y A - y B ) = 2( s2 +  e2 ) or
Var(d A - d B ) = 4 e2
with crossover we need to consider


 d I + d II  1
2
2
2
Var
  2 e + 2 e   e
 2  4
 e2 ( z1-  z1-  )2
2
c
2
n (crossover)


n(parallel with baseline) 4( e2 )( z1-  z1-  )2
2
2

1
=
4
Regardless of what  e2 or  is
4 times more patients required for parallel
group design which uses baseline compared
to crossover
Sample size for  = .05 (2-sided) and  = .05
 
Parallel
Number/group
(no baseline)
Parallel
Baseline
number/group
(=0.75)
Crossover
(Number/seq.)
 = 0.00
 = 0.25
 = 0.50
 = 0.75
0.4
0.6
0.8
1.0
163
72
41
26
80
36
20
12
82
62
41
20
36
27
18
9
21
15
10
5
13
10
7
3
Key Points
• Sample size should be specified in advance
• Sample size estimation requires collaboration
• Often sample size is based on uncertain
assumptions, therefore estimates should
consider a range of values for key parameters
(i.e., investigate the impact on power if sample
size and treatment effect is not achieved)
• Parameters on which sample size is based should
be evaluated during the trial
• It pays to be conservative; however, ultimate size
and duration of a study involves compromises,
e.g., power, costs, timeliness.
Power
A measure of how likely the study will detect
a specific treatment difference (∆), if present.
Prob (rej Ho | when HA is true) = 1- 
1-  = (rej Ho | 1 - 2 = )
x 1 - x2 - 0
= Prob

2
1
n1

x 1 - x2 - 0

2
1
n1


2
2


2
2
 Z1
2
 1 - 2 = 
n2
 Z 1
2
1 -  2 = 
n2
Assume  12   22   2 and n1  n2  n
Power (cont.)




x - x - 
0- 

1-  = Prob  1 2
 Z 1 
1 - 2 =  
2
2 2
 2 2





n
n




x1 - x 2 - 

0-
 Prob 
 Z 1 
1 - 2 =  
2
2 2
 2 2





n
n





 
 Prob Z  Z 1 
 
2
2
2 




n 
Prob Z  Z 1 
2

2 2
n
Usually one of these probabilities will be very close to
zero, depending on whether ∆ is positive or negative.
Power (cont.)
If  > 0, then 2nd Prob  0, then


1   = Prob  Z  Z1 
2





 
2 2 

n 
Zc
 = 0.05; Z1  1.96
2
 = 0.01; Z1  2.575
2
Sensitivity of Power to Variations in Other
Sample Size Parameters
(Assume 2 = 100)

∆
n
Zc
Power
0.05
0.01
4
4
100
100
-0.87
-0.25
0.81
0.60
0.05
0.01
6
6
100
100
-2.28
-1.67
0.99
0.95
0.05
0.01
4
4
200
200
-2.04
-1.425
0.98
0.92
Unequal Sample Sizes
ΝT  2Nc (2 : 1 allocation)
1
1

Nc 2Nc
1.5 2 (Z  Z  )
Nc 
2
ΝT  kNc (k : 1 allocation)
1 2
2
(1  ) Z  Z  
k
Nc 
2
SE(diff)  
1
Relative sample size for 1: 1 versus k : 1  (2  k  ) / 4
k
For k  2  4.5 / 4  12.5%
Another Formulation: Unequal Allocation
Comparison of means (Treatment C vs. E):
Total N =
2
 1

1

 
 2 
P  P  Z1   Z1  
 C

E 
2
Δ2
PC and PE= fraction of patients assigned control (C) and
experimental treatment (E); PC+ PE = 1
Total Sample Size for Different
Allocation Ratios
∆ (mm Hg)
4
Allocation Ratio (E:C)
1:1
2:1
3:1
1:2
250
280
332
280
8
64
Sample sizes rounded up
70
84
 2  94.7 (mm Hg) 2
  .05 (2 - sided test)
Power (1 - β)  0.90
70
Multiple Treatments and Unequal
Allocation
Example:
m experimental treatments and control; comparison of
means
Let m = no. experiment al treatments
n = no. of patients on each experiment al
treatment
N - mm = no. of patients on control arm
Xc = response to control treatment
X e = response to each experiment al
treatment
 = E(x)


V X e  Xc =
 e2
n
+
 c2
N - mn
Problem: Find n which minimizes variance
Solution: Take derivative with respect
to n and set = zero
Then,
c
N  mn 
n m
e
if  e   c
N - mn = n m
No. of patients in control group = no. of patients in
experimental times square root no. of treatments
Other Issues with Multiple Groups
• Multiple comparisons
( adjustment)
• Interim analyses – possible early
termination of some, but not all,
treatment groups
Minimum Clinically Important Difference (MCID)
For a given sample size (N) the null hypothesis (HO: difference in
means = 0) will be rejected if the observed difference (d)
d is  Z 1   / 2
2
N
2
(
Z
1


/
2

Z
1


)
If N was determined so that N  2 2
(MCID)2

Z1   / 2 
then d  MCID (
)
Z1   / 2  Z1  
For   0.05,   0.10, P  0.05 if d is 61% of MCID
d can be smaller than MCID and p<0.05!
Chuang-Stein C et al. Pharmaceutical Stat 2010.
Sample Size for Dual Criteria: Statistical
Significance and Clinical Significance
• In some cases, you may want to establish
with high probability that the treatment effect
is as large as MCID
– For example, a new HIV vaccine might be
assumed to have 60% efficacy but the study is
designed to have sufficient power to rule out
efficacy lower than 30%
– This will require a larger sample size
– For example, if Δ=2MCID, then sample size is 4
times greater
Summary (General)
•
It is important that sample size be large enough
to achieve the goals of the study – too many
studies are conducted which are underpowered.
•
Sample size assumptions are frequently very
rough so they should be re-evaluated as the
study progresses.
•
A good knowledge of the subject matter
(background on disease and intervention,
outcomes, and target population) is necessary
to estimate sample size.
Summary (Crossover versus Parallel Group)
•
Efficiency of crossover increases as 
increases.
•
Design using change from baseline as
response is better than design which just
uses follow-up responses if  > 0.50.
•
With multiple measurements on each
patient, to establish baseline and followup levels, sample size can be reduced.
Download