Sample Size Determination

advertisement
Chapter 4-1. Sample Size Determination and Power Analysis for Specific
Applications
The Basics
The basics are covered in the Biostatistics Section, chapter 2-5.
This chapter contains specific applications only.
Two Independent Groups Comparison of Means (Independent Groups t Test)
When a comparison of means from a continuous, or interval scaled, outcome variable is planned
for two independent groups, an independent groups t test is appropriate.
To compute the sample size, you must provide the two expected means, the difference being the
minimally clinically interesting effect or the anticipated effect. You must also provide the two
assumed standard deviations and a choice for power (at least 80%).
In Stata, the command syntax for equal sample sizes in the two groups is
sampsi mean1 mean2 , sd1(
) sd2(
) power(
)
By default, a two-sided comparison is used with an alpha = 0.05.
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
Chapter 4-1 (revised 23 Jun 2010)
p. 1
For example, comparing mean±SDs of 4±2 vs 5±2.5, with a desired 80% power,
if using the Stata menu would be,
Statistics
Power and sample size
Tests of means and proportions
Main tab: Input: Two-sample comparison of means
Main tab: Mean one: 4
Main tab: Std. deviation one: 2
Main tab: Mean two: 5
Main tab: Std. deviation two: 2.5
Options tab: Output: Compute sample size
Options tab: Sample-based calculations: Ratio of sample sizes: 1
Options tab: Power of the test: .8
Options tab: Sides: Two-sided test
Options tab: Alpha (using 0.05): .05 or leave blank
OK
sampsi 4 5, sd1(2) sd2(2.5) power(.80)
Estimated sample size for two-sample comparison of means
Test Ho: m1 = m2, where m1 is the mean in population 1
and m2 is the mean in population 2
Assumptions:
alpha
power
m1
m2
sd1
sd2
n2/n1
=
=
=
=
=
=
=
0.0500
0.8000
4
5
2
2.5
1.00
(two-sided)
Estimated required sample sizes:
n1 =
n2 =
Chapter 4-1 (revised 23 Jun 2010)
81
81
p. 2
To compute the power for a given sample size, you leave off the power( ) and replace it with
n1( ) and n2( ).
sampsi 4 5, sd1(2) sd2(2.5) n1(90) n2(90)
Estimated power for two-sample comparison of means
Test Ho: m1 = m2, where m1 is the mean in population 1
and m2 is the mean in population 2
Assumptions:
alpha
m1
m2
sd1
sd2
sample size n1
n2
n2/n1
=
=
=
=
=
=
=
=
0.0500
4
5
2
2.5
90
90
1.00
(two-sided)
Estimated power:
power =
0.8421
Chapter 4-1 (revised 23 Jun 2010)
p. 3
Linear Regression: Comparing two groups adjusted for covariates
Usually when linear regression is applied in research, it involves the comparison of two groups
after adjusting for potential confounders. That is, it is an adjusted means comparison problem.
Sample size calculation, then, is simply one of how big a sample is required to compare the
difference between two means. You can use the same sample size calculation formula that you
would use to compare two means with a t test. It is too difficult to know how much the means
will change when covariates are added, so you just don’t bother attempting that much precision
in your sample size determination.
However, if you have preliminary data on adjusted means, and their standard deviations, you
could use those in your calculation. Usually, you are not going to know what these will be after
you adjust for all of your covariates, so it is a generally accepted pratice to use the unadjusted
means and standard deviations.
<< Completely revise this section – excellent discussion of this topic in Steyerberg. >>
Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development,
Validation, and Updating. New York, Springer. pp.25-27.
Chapter 4-1 (revised 23 Jun 2010)
p. 4
Two Independent Groups Comparison of Dichotomous Outcome Variable (chi-square test,
Fisher’s exact test)
For a given test statistic, sample size is determined by the following five things:
1) the effect size in the population
2) the standard deviation in the population
3) our choice of alpha
4) whether we will use a one-sided or two-sided comparison (one-tailed or two-tailed test)
5) desired power
It appears, then, that we need to specify the standard deviation to compute power for a test
statistic that compares two proportions. We don’t. The formula uses the standard deviations of
the proportions (specifically, Bernoulli variables), but it computes them from the proportions,
basically as
std.dev 
p(1  p)
This is computed internally by any sample size software once the proportion is specified.
The power analysis for comparing two proportions in Stata is basically for the chi-square test
with continuity correction (precisely, the formula is the normal approximation with a continuity
correction). Thus, the result is closer to a Fisher’s exact test than a chi-square test, since the
continuity correction moves the p value in the direction of the Fisher’s exact test p value. That is
a good thing, however, since you do not know in advance if you’ll meet the expected frequency
rule for the chi-square test, and thus have to use Fisher’s exact test anyway.
Suppose we want to conduct a study to detect a difference of 1.5% in preterm births between
males and females, based on published preterm births incidence proportions of 0.301 for males
and 0.316 for females. We would use:
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Sample size & power determination
Main tab: Two-sample comparison of proportions (values in [0,1]):
Proportion one: 0.301
Proportion two: 0.316
Options tab: Output: Compute sample size
Power of the test: 0.90
OK
sampsi .301 .316 , alpha(.05) power(.90)
Chapter 4-1 (revised 23 Jun 2010)
p. 5
Estimated sample size for two-sample comparison of proportions
Test Ho: p1 = p2, where p1 is the proportion in population 1
and p2 is the proportion in population 2
Assumptions:
alpha =
0.0500 (two-sided)
power =
0.9000
p1 =
0.3010
p2 =
0.3160
n2/n1 =
1.00
Estimated required sample sizes:
n1 =
n2 =
20056
20056
We see that such a study is only practical if we have a large database already available.
For sample sizes this large, the p values for the three statistical approaches (chi-square, corrected
chi-square, and Fisher’s exact test) will be equivalent, and so will be the required sample sizes.
Let’s try a larger effect size.
sampsi .300 .400 , alpha(.05) power(.90)
Estimated sample size for two-sample comparison of proportions
Test Ho: p1 = p2, where p1 is the proportion in population 1
and p2 is the proportion in population 2
Assumptions:
alpha =
power =
p1 =
p2 =
n2/n1 =
0.0500
0.9000
0.3000
0.4000
1.00
(two-sided)
Estimated required sample sizes:
n1 =
n2 =
496
496
In SamplePower 2.0 we can compute the required sample specifically for each of the three
statistical tests, getting
uncorrected chi-square test: n1 = n2 = 479
corrected chi-square test:
n1 = n2 = 498 which is very close to Stata’s n = 496.
Fisher’s exact test:
n1 = n2 = 496
We see that Stata’s sample size calculation is conservative, but provides an adequate sample size
for any of the three approaches. It works for both the uncorrected chi-square test, as well as the
Fisher’s exact test—since you cannot be sure in advance which test you will require, you might
as well go with the larger sample size.
Chapter 4-1 (revised 23 Jun 2010)
p. 6
Two Indenpent Groups Comparison of a Nominal Outcome Variable (chi-square test and
Fisher-Freeman-Halton test)
The r × c table case for power or sample size calculation is not available in Stata or PEPI 4.0. At
first, you might think it makes sense to try collapsing the rows or columns to make it a 2 × 2
table, and then using the approach we took for the dichotomous case. For the 3 × 2 case,
collapsing would be the same as computing power for any row of the table.
|
col
row |
1
2 |
Total
-----------+----------------------+---------1 |
14
8 |
22
|
14.89
8.60 |
11.76
-----------+----------------------+---------2 |
76
82 |
158
|
80.85
88.17 |
84.49
-----------+----------------------+---------3 |
4
3 |
7
|
4.26
3.23 |
3.74
-----------+----------------------+---------Total |
94
93 |
187
|
100.00
100.00 |
100.00
sampsi .1489 .0860 , alpha(.05) power(.90)
sampsi .8085 .8817 , alpha(.05) power(.90)
sampsi .0426 .0323 , alpha(.05) power(.90)
// gives n1=n2=580
// gives n1=n2=539
// gives n1=n2=7332
We see that it is not clear what the correct sample size should be. Fortunately, the correct sample
size calculation, using the entire table simultaneously, can be done using SamplePower 2.0. For
power of 0.90, we need n1=n2=997.
Chapter 4-1 (revised 23 Jun 2010)
p. 7
Two Independent Groups Comparison of Ordinal Outcome Variable (Wilcoxon-MannWhitney test)
The power for the Wilcoxon-Mann-Whitney test can be computed with StatXact-5, but not with
Stata-8 or SamplePower 2.0. It can also be computed with the PEPI-4.0 (Abramson and
Gahlinger, 2001) SAMPLES.EXE program.
PEPI computes the required sample size for a Wilcoxon-Mann-Whitney test (adjusted for ties,
which is what you would normally always use). The computation is only accurate if it generates
a moderate to large sample size (is only asymptotically correct). It calculates the sample size
using the procedure described by Whitehead (1993).
The Whitehead procedure is based on a proportional odds model, which assumes that when the 2
 k table displaying the data is converted to a 2  2 table by combining adjacent categories, the
odds ratio is the same whatever cutting-point is used.
For a 2  2 table, with cell counts a, b, c, and d, the odds ratio is defined as:
a
c
b
d
odds ratio = ab/bc
Example (Whitehead, 1993): Suppose that the follow values are considered appropriate:
Highest Category
Lowest Category
Very good
Good
Moderate
Poor
Control Group:
20%
50%
20%
10%
It is intuitive to consider the cutpoint as (Very good or Good) vs (Moderate or Poor). Thus 70%
will be in this upper category in the control group. We choose 85% in this upper category as the
minimal clinically relevant effect, which we feel is obtainable (likely true in the population).
Thus we have the following:
Success
Failure
Experimental Control
85%
70%
15%
30%
OR = (8530)/(7015)=2.43
We assume that for any other cut-point for combining adjacent categories, the odds ratio will still
be 2.43 (proportional odds assumption).
Chapter 4-1 (revised 23 Jun 2010)
p. 8
Running PEPI SAMPLES (available in PEPI subdirectory).
type 4 for Wilcoxon-Mann-Whitney test
level of significance: .05
power (%): 90
ratio of sample sizes: 1
how many categories: 4
(controls)
category B %
1
20
2
50
3
20
4
10
odds ratio: 2.43
So that you can check that the proportional odds assumption is close to what you expect your cell
percents to be in the experimental group, PEPI reports:
An odds ratio of 2.43 expressed the following findings:
(experimental)
(control)
Group A
Group B
Category
%
%
1
37.8
20
2
47.2
50
3
10.6
20
4
4.4
10
If we are happy with these percents for our experimental group (they look like what we expect to
observe) then the sample size is appropriate. If it is way off, then you cannot use PEPI to
compute your sample size, because it must assume proportional odds.
On the next screen, PEPI reports
n=94 subjects are required in each group.
This matches the example in the Whitehead (1993) article (bottom of page 2261), so we can feel
confident that PEPI calculates the sample size for a Wilcoxon-Mann-Whitney test correctly.
Since we are using 4 categories with our ordinal scale, rather than only 2 categories using a 2  2
Fisher’s exact test, we should have computed a smaller required sample size. This is the case.
PEPI SAMPLES for comparing the two independent proportions 0.70 to 0.85 reports that we
would need n=161 in each group to detect the same effect using that statistical approach.
Chapter 4-1 (revised 23 Jun 2010)
p. 9
Protocol
You could state,
The sample size for testing the Quality of Life outcome (of whatever the variable is)
using a Wilcoxon-Mann-Whitney test was computed using the procedure
reported by Whitehead (1993). We assumed the Quality of Life ordered category
percents for the control group will be 20% (very good), 50%, 20%, and 10% (poor), with
a proportional odds of a higher quality of life for the experimental group of 2.43. This
odds ratio corresponds to a response of 70% “good” or “very good” in the control group
and 85% “good” or “very good” in the experimental group, which we selected as our
minimal clinically relevant effect to be able to detect. For this proportional odds, the
expected quality of life category percents for the experimental group are 37.8%, 47.2%,
10.6%, and 4.4%, which are consistent with what we expect to observe. This sample size
provides 90% power to detect this effect with an alpha of 0.05 using a two-sided
comparison.
Chapter 4-1 (revised 23 Jun 2010)
p. 10
Paired Ordinal Outcome Variable (Wilcoxon signed ranks test)
The required sample size for the Wilcoxon signed ranks test cannot be computed with StatXact6, Stata-8, or SamplePower 2.0. It can, however, be computed with the PEPI-4.0 (Abramson and
Gahlinger, 2001) SAMPLES.EXE program.
PEPI computes the required sample size for the comparison of proportions in ordered categories
of matched samples (1 case to 1 control, or pre and post measures on the same individuals) under
the assumption of proportional odds. This approach, although not specifically tailored to the
Wilcoxon signed ranks test, provides a reasonable approximation for the required sample size of
the Wilcoxon signed ranks test. I am not aware of any other approach, except to derive the
sample size by simulation. PEPI calculates the sample size using the procedure described by
Julious and Campbell (1998). The procedure only considers the disconcordant pairs (difference
score on pre and post test), throwing away the concordant pairs (same score on pre and post test),
consistent with the Wilcoxon signed ranks test.
Consider, for example, a variable the is a three-point scale. Calculating the changes scores
(post test minus pretest), the possible values are -2, -1, 0, 1, and 2. The 0s are ignored, leaving -2
and -1 as the negative discordant categories, and 1 and 2 as the positive discordant categories.
If no pilot data are available, we next assume a value for the odds ratio
OR = odds of a pair being positive = ratio of positive changes to negative changes.
Using the example in Julious and Campbell (1998), we wish to conduct a study, such as a
matched case-control or perhaps a cross-over trial, where the outcome is the Hospital Anxiety
and Depression Scale, which has three categories.
We might estimate the OR by expecting that we will observe 5 positive changes for each
negative change, OR=5/1=5.
We then must estimate the distribution of positives, which we might guess will be 0.8 for +1 and
0.2 for +2, which correctly sums to 1.
Under the proportional odds assumption,
OR 
C pi (1  Cni )
Cni (1  C pi )
, which is fixed for all i < k, where k=# of categories
In this equation, C pi is the cumulative proportion of positive pairs in category i, and Cni is the
cumulative proportion of negative pairs.
Chapter 4-1 (revised 23 Jun 2010)
p. 11
Plugging in OR=5 and C p 2 = 0.2, we get
0.2(1  Cn 2 ) 0.2(1  Cn 2 ) 1  1  Cn 2 

 

Cn 2 (1  0.2)
0.8Cn 2
4  Cn 2 
1 Cn 2
1
 20 


1
Cn 2 C n 2 C n 2
5
 21 
1
Cn 2
 Cn 2 
1
 0.048
21
Since the cumulative proportions must sum to 1, by subtraction we get Cn1 = 1 – 0.048 = 0.952.
We now have the proportions that are conditional upon a change being positive or negative,
where the negatives sum to 1 and the positives sum to 1:
difference -2
pn 2 = 0.048
-1
pn1 = 0.952
+1
p p1 = 0.8
+2
p p 2 = 0.2
These can be converted to the unconditional expected proportions by multiplying the negatives
by 1/(OR+1) and the positives by OR/(OR+1):
difference -2
pn 2 = 0.048
-1
pn1 = 0.952
+1
p p1 = 0.8
+2
p p 2 = 0.2
pi : 0.048(1/6)= 0.952(1/6)= 0.8(5/6)= 0.2(5/6)=
0.008
0.159
0.667
0.167
where the pi sum to 1.
Chapter 4-1 (revised 23 Jun 2010)
p. 12
Running PEPI SAMPLES
type 5 for comparison of ordered categories, matched pairs
level of significance: .05
power (%):
80
how many categories:
3
odds ratio:
5
probability size of discrepancy (value of change score) 1:
probability size of discrepancy (value of change score) 2:
.8
.2
which produces the result that you need 12 discordant pairs to achieve this power.
If you assume that 1/3 of your sample will be discordant, than you need a total of 36 pairs as your
actual sample size.
If pilot data are available, we can use these data to estimate the odds ratio for the proportional
odds assumption, as
OR = (proportion positive) / (proportion negative)
We would get these proportions by first computing the change scores and then generating a
frequency table of the change scores.
In the above example, using Stata, this would look like:
gen diff = posthads – prehads
tab diff if diff ~= 0
<- compute change scores
<- frequency table ignoring “no change” values
which will produce:
change score
-2
-1
+1
+2
total
frequency
2
32
133
33
200
percent
0.8
15.9
66.7
16.7
100.0
Computing the proportional odds:
OR = (0.667+0.167)/(0.008+0.159) = 5
PEPI will require the proportion of positives in the +1 category, which is 133/(133+33)=0.8
and the proportion of positives in the +2 category, which is 33/(133+33)=0.2
Chapter 4-1 (revised 23 Jun 2010)
p. 13
Running PEPI SAMPLES, we input the same values as before:
type 5 for comparison of ordered categories, matched pairs
level of significance: .05
power (%):
80
how many categories:
3
odds ratio:
5
probability size of discrepancy (value of change score) 1:
probability size of discrepancy (value of change score) 2:
.8
.2
Protocol Suggestion
You could state,
The required sample size for testing the Hospital Anxiety and Depression Scale (or
whatever the variable is) using a Wilcoxon signed ranks test was computed using the
procedure reported by Julious and Campbell (1998). Based on pilot data, we assumed a
proportional odds of positive discordant pairs to negative discordant pairs of 5.0, with
proportions of discordant pairs of 0.8, 15.9, 66.7, and 0.167 for changes of -2, -1, +1, and
+2, respectively. For 80% power with a two-sided 0.05 level test, this requires 12
discordant pairs. Assuming that only 33% of the pairs will be discordant, consistent with
the pilot data, we require a total of N=36 pairs, or subjects. This approach is consistent
with the Wilcoxon signed ranks test, which only uses and bases its sample size on the
number of discordant pairs.
Chapter 4-1 (revised 23 Jun 2010)
p. 14
Interrater Reliability (Precision of Confidence Interval Around Intraclass Correlation
Coefficient)
Interrater reliability is how close two or more raters agree on the value they assign to a
measurement for the same subject. Intrarater reliability is how close the ratings are for the same
subjects assigned by the same rater on two or more occasions (such as test/re-test reliability). For
both, the reliability coefficient is the intraclass correlation coefficient (ICC).
It was stated in Chapter 2-5, page 32, that the sample size for a interrater reliability is based on
the desired width of the confidence interval around the ICC statistic, rather than based on a
hypothesis test that the ICC is different from zero (Bristol,1989; Chow et al, 2008) Another
decision that has to be made when designing the reliability study is how many raters to use,
which also affects the sample size calculation.
A formula for estimating the required sample size is provided by Bonett (2002).
Step 1) Copy the following into the Stata do-file editor, highlight it, and hit the run key (rightmost menu button). This will load the program, or sampicc command, into your current session
of Stata. Once loaded, it will execute as any other Stata command, for your current session of
Stata only.
Chapter 4-1 (revised 23 Jun 2010)
p. 15
* syntax: sampicc , icc(0.7) raters(5) width(0.2) level(.95)
*
where icc = assumed ICC
*
raters = number of raters
*
width = desired precision (upper minus lower limits)
*
level = confidence level of CI, e.g., 95%
capture program drop sampicc
program define sampicc , rclass
version 10
syntax [,icc(real 0.7) raters(real 2) width(real 0.2) ///
level(real 0.95)]
local rho = `icc'
local k = `raters'
local w = `width'
local level = `level'
local alpha = 1 - `level'
local n = 8*(invnorm(1-`alpha'/2))^2*((1-`rho')^2 ///
*(1+(`k'-1)*`rho')^2)/(`k'*(`k'-1)*`w'^2)+1
if (`k'==2 & `rho'>=0.7) {
local n = `n'+5*`rho' // improved estimate for this special case
}
local n = round(`n'+0.5) // round up to nearest integer
display _newline
display as text ///
"Required N for desired precision of exact CI around ICC"
display as text ///
"-------------------------------------------------------"
display as result "Assumed ICC: " %4.3f "`rho'"
display as result %2.0f `level'*100 "% CI width (upper minus" , _c
display as result "lower limits): " %3.0f "`w'"
display as result "Number of raters: " %3.0f "`k'"
display as result "Required n: " %8.0f "`n'"
return scalar num_subjects = `n' // required N
return scalar level = `level'
return scalar width = `w'
return scalar num_raters = `k'
return scalar icc = `rho'
end
The command “sampicc” has the following syntax:
sampicc , icc(#) raters(#) width(#) level(#)
where
icc = assumed ICC (expressed as proportion between 0 and 1)
raters = number of raters
width = desired precision (upper minus lower limits)
level = confidence level of CI, e.g., 95%
(expressed as proportion between 0 and 1)
Example: For an assumed ICC=0.70, using 4 raters, a desired 95% CI width of 0.2 (upper bound
minus lower bound), you woule use:
sampicc , icc(0.7) raters(4) width(0.2) level(.95)
Chapter 4-1 (revised 23 Jun 2010)
p. 16
Step 2) Execute the command in do-file editor or command window.
sampicc , icc(0.85) raters(4) width(0.2) level(.95)
which outputs:
Required N for desired precision of exact CI around ICC
------------------------------------------------------Assumed ICC: .85
95% CI width (upper minus lower limits): .2
Number of raters: 4
Required n: 20
If the sample turns out to give an ICC of 0.85, using 4 raters, the width of the 95% CI will be
close to 0.20. The result agrees with the example given in the article the formula was taken from
(Bonett, 2002), so you can be confident it was programmed correctly.
To see what the sampicc command returns for use in programming, use
return list
scalars:
r(icc)
r(num_raters)
r(width)
r(level)
r(num_subjects)
=
=
=
=
=
.85
4
.2
.95
20
These returned values allow us to use the sampicc program, or command, in a loop to look at
various combinations of CI width and number of raters. Copying the following into the Stata dofile and executing it, will provide the required sample sizes for the selected combinations.
*
Vary the number of raters from 2 to 10
*
Vary 95% CI width from .1 to .4 in increments of 0.05
*
Fix ICC at 0.7 and confidence level (95% CI) to 95%
preserve // hold copy of original dataset
clear
quietly set obs 100
quietly gen _width=.
forval r=2/6 {
quietly gen _raters`r'=.
local row=0
forval w=0.1(.05)0.4 {
local row=`row'+1
quietly sampicc , icc(.7) raters(`r') width(`w') level(.95)
quietly replace _width = r(width) in `row'
quietly replace _raters`r' = r(num_subjects) in `row'
}
}
list _width _raters* if _raters2~=. , noobs sep(0) clean
restore // return original dataset into memory
Chapter 4-1 (revised 23 Jun 2010)
p. 17
_width
.1
.15
.2
.25
.3
.35
.4
_raters2
405
183
105
69
49
38
30
_raters3
267
120
68
44
31
23
18
_raters4
223
100
57
37
26
20
15
_raters5
201
90
51
33
24
18
14
_raters6
188
84
48
31
22
17
13
The Stata created variable _width holds the requested widths, _raters2 holds the required
samples size when two raters are used, _rater3 for three raters, and so on. From this, if can afford
to assess interrater reliability (ICC) on n=31 subjects, we would probably decide to use 3 raters
and accept a width of 0.3 for our 95% CI.
In Bonett (2002) article Table 1, the width was set to 0.2, the ICC was varied from 0.1, 0.2,…,
0.9, and the number of raters was selected as 2, 3, 5, and 10. To duplicate Bonett’s Table 1, for
the purpose of illustrating how to modify the looping structure of this Stata code, we
would use:
*
Vary the number of raters as 2, 3, 5, and 10
*
Vary ICC from .1 to .9 in increments of 0.1
*
Fix ICC at 0.7 and confidence level (95% CI) to 95%
preserve // hold copy of original dataset
clear
quietly set obs 100
quietly gen _ICC=.
foreach r of numlist 2 3 5 10 {
quietly gen _raters`r'=.
local row=0
forval i=.1(.1).9 {
local row=`row'+1
quietly sampicc , icc(`i') raters(`r') width(.2) level(.95)
quietly replace _ICC = r(icc) in `row'
quietly replace _raters`r' = r(num_subjects) in `row'
}
}
list _ICC _raters* if _raters2~=. , noobs sep(0) abbrev(9) clean
restore // return original dataset into memory
_ICC
.1
.2
.3
.4
.5
.6
.7
.8
.9
_raters2
378
356
320
273
218
159
105
55
20
_raters3
151
162
162
151
130
101
68
36
12
_raters5
62
81
93
95
88
73
51
29
10
_raters10
26
44
59
67
66
57
42
24
9
These estimates agree with Bonett’s Table 1. For the case of two raters with ICC≥0.70, however,
the adjustment describe in the article paragraph following Bonett’s Table 1 has been applied to
provide better estimates, so those three sample sizes intentionally differ from Bonett’s Table 1.
Chapter 4-1 (revised 23 Jun 2010)
p. 18
Protocol Suggestion
Using the example above, where it was decided that it was feasible to use n=31 subjects and k=3
raters, with an anticipated ICC of 0.70, you could state something like the following in your
protocol:
Interrater reliability will be assessed with the intraclass correlation coefficient (ICC). To
assess interrater reliability of tumor size measurements from chest X-ray radiographs, k=3
radiologists will be used, each providing measurements from the same radiograph on the
same n=31 lung cancer patients. Interrater reliability will be assessed separately for the
posterior-anterior view and the lateral view. The precision approach is used for sample
size determination ( Bristol, 1989; Chow et al, 2008). Using the sample size
determination approach described by Bonett (2002), and assuming ICC=0.70, this sample
size and number of raters provides a 95% confidence interval around ICC of width 0.3
(ICC ±.15). This seems to be acceptable precision for our purposes, so the sample size
and number of raters are adequate.
-------------------Bristol DR. Sample size for constructing confidence intervals and testing hypotheses.
Statist Med 1989;8:803-811.
Chow S-C, Shao J, Wang H. Sample Size Calculations in Clinical Research. 2nd ed. New
York, Chapman & Hall/CRC, 2008.
Bonett DG. Sample size requirements for estimating intraclass correlations with desired
precision. Statistics in Medicine 2002;21:1331-1335.
Chapter 4-1 (revised 23 Jun 2010)
p. 19
Repeated Measures or Clustered Studies (GEE, mixed, multilevel, hierachical models)
Studies that use multilevel models require a larger sample size than when ordinary linear
regression is used. Ordinary regression models assume that all observations are independent.
When we have repeated measurements on the same person, the repeated measurements are
correlated.
Suppose you have a study with two office visits for each patient, with one to three patients for
each physician. It might look something like this:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
+----------------------------------------------+
| patient_id
physician_id
visit2
y
x |
|----------------------------------------------|
|
1
1
0
12
1 |
|
1
1
1
10
1 |
|
2
1
0
13
2 |
|
2
1
1
11
3 |
|
3
1
0
14
2 |
|
3
1
1
9
5 |
|----------------------------------------------|
|
4
2
0
20
2 |
|
4
2
1
18
3 |
|
5
2
0
22
4 |
|
5
2
1
17
5 |
|----------------------------------------------|
|
6
3
0
25
4 |
|
6
3
1
22
7 |
|
7
3
0
23
7 |
|
7
3
1
21
10 |
|----------------------------------------------|
|
8
4
0
30
8 |
|
8
4
1
27
9 |
|----------------------------------------------|
|
9
5
0
30
1 |
|
9
5
1
27
2 |
|
10
5
0
32
11 |
|
10
5
1
29
15 |
+----------------------------------------------+
In this example, ordinary linear regression assumes that there are N=20 independent pieces of
information. With two observations taken for each patient, only N=10 patient clusters
contributed information. To make matters worse, the patients within phyisicians were highly
alike, so there are really only N=5 physician clusters contributing information. Is the sample size
20, 10, or 5, then?
If the correlation among the observations is 0, the sample size is 20. If the correlation is 1, the
sample size is 5, which is the number of physician clusters. Since the correlation is not going to
be 0 or 1, the sample size is somewhere in between.
The amount of correlation in the data is measured with an intraclass correlation coefficient (ICC).
I am not aware of any articles that discuss how to compute the sample size for more than one
level of clusters. Usually, statisticians simply choose one level for sample size determination.
Chapter 4-1 (revised 23 Jun 2010)
p. 20
First, the sample size is calculated for the naïve model, which is the ordinary regression model
that assumes independent observations. Then, this is adjusted for clustering by multiplying this
sample size by the design effect, which is (Campbell et al, 2000)
1 + [( average cluster size – 1)×ICC]
If you don’t what the ICC is when designing the study, Campbell (2000) provides some
suggestions and is a good citation.
Consistent with our example, let’s suppose we have N=5 physician clusters. Within each
physician, we expect to have an average of 4 patients, each with two repeated measurements.
We assumed the ICC was 0.20. The design effect is then,
1 + [( 4 – 1)×0.20] = 1+3×0.20 = 1.6
We would have to use a sample size of patient observations that is 1.6 times the sample size
required if all patient observations where independent.
Applying the design effect after calculating a sample size using sampsi can be done in Stata using
sampclus. You will have to add it to Stata. After beginning connected to the Internet, run the
command
findit sampclus
and then follow the instructions to install it.
Example Suppose we want to compare two groups of patients, where we intend to collect 10
patients from each provider. We assume group means±SDs of 4±2 vs 5±2.5. We assume an
ICC=0.05, which is nonzero due to some patients seeing the same provider. We desire a power
of 80%. To compute the sample size, we first use
sampsi 4 5 , sd1(2) sd2(2.5) power(.80)
Chapter 4-1 (revised 23 Jun 2010)
p. 21
Estimated sample size for two-sample comparison of means
Test Ho: m1 = m2, where m1 is the mean in population 1
and m2 is the mean in population 2
Assumptions:
alpha
power
m1
m2
sd1
sd2
n2/n1
=
=
=
=
=
=
=
0.0500
0.8000
4
5
2
2.5
1.00
(two-sided)
Estimated required sample sizes:
n1 =
n2 =
81
81
If the ICC for patients nested within provider was 0, we could stop here. However, due to the
nonzero correlation introduce by provider clusters, there are not n=81+81=162 independent
pieces of information. There is something between that and the n=162/10=16.2 providers.
We now multiple each group’s sample size by the design effect
1 + [( average cluster size – 1)×ICC]
display 81*(1+((10-1)*0.05))
117.45
Thus we must collect n=118 observations (not subjects) in each study group to have 80% power.
We get the same result using sampclus after samsi, as follows,
sampsi 4 5 , sd1(2) sd2(2.5) power(.80)
sampclus , obsclus(10) rho(.05)
sampclus , obsclus(10) rho(.05)
Sample Size Adjusted for Cluster Design
n1 (uncorrected) = 81
n2 (uncorrected) = 81
Intraclass correlation
= .05
Average obs. per cluster
= 10
Minimum number of clusters = 24
Estimated sample size per group:
n1 (corrected) = 118
n2 (corrected) = 118
Chapter 4-1 (revised 23 Jun 2010)
p. 22
Example Suppose we want to compare two groups of patients, where we intend to collect 10
repeated measurements per patient. In this situation, patient is now the cluster.
In Chapter 23, where we modeled forearm blood flow, the ICC was 0.53. You can expect very
high ICCs for repeated measures data. Assuming the same effect size with this higher ICC,
sampsi 4 5 , sd1(2) sd2(2.5) power(.80)
sampclus , obsclus(10) rho(.53)
Sample Size Adjusted for Cluster Design
n1 (uncorrected) = 81
n2 (uncorrected) = 81
Intraclass correlation
= .53
Average obs. per cluster
= 10
Minimum number of clusters = 94
Estimated sample size per group:
n1 (corrected) = 468
n2 (corrected) = 468
Thus, we need 468 observations divided by the number of observations per patient, 468/10=46.8,
or 47 patients per group.
We see that if we have a good estimate of the ICC, we can reduce our sample size. Frequently,
investigators are not willing to guess what the ICC will be, so they would just go with the n=81
patients per group to be on the safe side (safe from making a false estimate of the ICC).
Protocol Suggestion
In your protocol, you need to convince the reviewer that you understand an adjustment will be
necessary to account for the correlation structure in the data (the ICC) and that you have a
reasonable estimate of the ICC. The Campbell (2000) paper is good to cite, both as a reference
for the design effect and for assumption the ICC will not exceed 0.05 for patient outcomes or
0.15 for provider processes. You might use something like this:
Ordinary sample size calculation assumes that all data points are independent. With a
multi-level structure, the ordinary sample size estimates needs to be inflated by the design
effect, 1  ( n  1)  , where n is the average cluster size, each prescriber representing a
cluster, and ρ is the estimated intra-cluster correlation coefficient (ICC). Sample size
calculation proceeds by calculating the sample size for a naïve model, an ordinary model
that assumes all observations are independent, and then inflating that sample size by
multiplying it by the design effect, so that the sample size calculation applies to the multilevel model (Campbell, 2000). The ICCs related to the outcomes of this study are
unknown. However, in a study exploring several datasets for a wide range of
practictioner-related, or process, variables in primary care settings, the ICCs were in the
range of 0.05 to 0.15, whereas patient-related outcome variables were generally less than
0.05 (Campbell, 2000). Conservatively, for prescriber outcomes, an ICC of 0.15 will be
assumed, and for patient outcomes, an ICC of 0.05 will be assumed. The required sample
sizes for the various outcomes are shown in Table 3.
Chapter 4-1 (revised 23 Jun 2010)
p. 23
Power Analysis Using Monte Carlo Simulation (Independent Samples t Test)
By definition, power is the probability that you will get a significant result in your sample data,
for a given effect size in the sampled population. Formula for computing power base is based on
the long-run average. That is just was Monte Carlo simulation is, a long-run average. You can
get the same answer as the formula, then, using Monte Carlo simulation.
Suppose we wanted to compute the power for an independent samples t test, with the following
assumed parameters:
Group
A
B
mean
4
5
SD
2
2.5
N
50
50
For you Mstat students, we will be applying the inverse transformation method (see box).
Inverse Transformation Method (Ross, 1998, p. 455)
“Let U be a uniform(0,1) random varaible. For any continuous
distribution function F, if we define the random variable Y by
Y  F 1 (U )
then the random variable Y has distribution function F. [ F 1 ( x) is
defined to equal that value y for which F(y) = x.]
In Stata, the pseudo random number generator for a standard normal variable is
invnorm(uniform()).
which comes from applying the inverse transformation,
where F is the normal distribution,
F 1 is the inverse normal distribution [invnorm( ) function in Stata ]
U is the uniform(0,1) distribution [ uniform( ) function in Stata ]
This produces a standard normal distribution, with mean = 0 and SD = 1. To convert this to a
normal distribution for a desired mean and SD, we use the following:
X  Mean
= z , which is Normal with mean=0 and SD=1 (Standard Normal)
SD
X  Mean  SDz
X  Mean  SDz , which is Normal with desired mean and SD
Chapter 4-1 (revised 23 Jun 2010)
p. 24
For one iteration, just to see what is happening, we use the following code to create the variables
and compute an independent samples t test,
set seed 999 // use if want to be able to exactly reproduce the result
clear
set obs 100
gen group = 0 in 1/50
replace group = 1 in 51/100
gen y = invnorm(uniform())*2+4 in 1/50
// mean = 4 , SD = 2
replace y = invnorm(uniform())*2.5+5 in 51/100 // mean = 5 , SD = 2.5
ttest y , by(group)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------0 |
50
4.173172
.2956474
2.090543
3.579046
4.767298
1 |
50
5.229902
.2909314
2.057196
4.645254
5.814551
---------+-------------------------------------------------------------------combined |
100
4.701537
.213067
2.13067
4.278766
5.124308
---------+-------------------------------------------------------------------diff |
-1.05673
.4147873
-1.879862
-.2335987
-----------------------------------------------------------------------------diff = mean(0) - mean(1)
t = -2.5476
Ho: diff = 0
degrees of freedom =
98
Ha: diff < 0
Pr(T < t) = 0.0062
Ha: diff != 0
Pr(|T| > |t|) = 0.0124
Ha: diff > 0
Pr(T > t) = 0.9938
We see that we generated a sample with means and SDs, as expected, except for sampling
variation. We need a way to save the p value (p = 0.0124).
For Stata commands that are not regression models, the results are saved in return list. To see
where the p value is being saved, we use,
return list
scalars:
r(sd)
r(sd_2)
r(sd_1)
r(se)
r(p_u)
r(p_l)
r(p)
r(t)
r(df_t)
r(mu_2)
r(N_2)
r(mu_1)
r(N_1)
=
=
=
=
=
=
=
=
=
=
=
=
=
2.130670072012433
2.057195742247358
2.09054282614709
.4147872618554172
.9938001171468509
.0061998828531491
.0123997657062982
-2.547644488066585
98
5.229902350902558
50
4.173171869516373
50
Matching this with the t test output, we discover the two-tailed p value is saved in r(p).
Chapter 4-1 (revised 23 Jun 2010)
p. 25
Here is the complete code for two iterations,
* create a file to hold significant results
clear
set obs 1
gen signif = .
save junk, replace
*
set seed 999 // if want to get same result when repeating this
*
* iterate and append results to file
forval i=1/2 {
clear
set obs 100
gen group = 0 in 1/50
replace group = 1 in 51/100
gen y = invnorm(uniform())*2+4 in 1/50
// mean = 4 , SD = 2
replace y = invnorm(uniform())*2.5+5 in 51/100 // mean = 5 , SD = 2.5
ttest y , by(group)
gen signif = cond(r(p)<0.05,1,0) in 1/1
keep in 1/1
keep signif
append using junk
save junk, replace
}
sum signif
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------0 |
50
4.173172
.2956474
2.090543
3.579046
4.767298
1 |
50
5.229902
.2909314
2.057196
4.645254
5.814551
---------+-------------------------------------------------------------------combined |
100
4.701537
.213067
2.13067
4.278766
5.124308
---------+-------------------------------------------------------------------diff |
-1.05673
.4147873
-1.879862
-.2335987
-----------------------------------------------------------------------------diff = mean(0) - mean(1)
t = -2.5476
Ho: diff = 0
degrees of freedom =
98
Ha: diff < 0
Pr(T < t) = 0.0062
Ha: diff != 0
Pr(|T| > |t|) = 0.0124
Ha: diff > 0
Pr(T > t) = 0.9938
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------0 |
50
3.937702
.2594191
1.83437
3.41638
4.459024
1 |
50
4.860296
.3174238
2.244525
4.222409
5.498183
---------+-------------------------------------------------------------------combined |
100
4.398999
.209139
2.09139
3.984022
4.813976
---------+-------------------------------------------------------------------diff |
-.9225939
.4099466
-1.73612
-.1090683
-----------------------------------------------------------------------------diff = mean(0) - mean(1)
t = -2.2505
Ho: diff = 0
degrees of freedom =
98
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------signif |
2
1
0
1
1
Chapter 4-1 (revised 23 Jun 2010)
p. 26
For two iterations, the power if 100%, which is the mean of the 0-1 variable, signif, where 1
denotes p < 0.05 for a given iteration.
We don’t really want the results to display on our computer screen for each iteration. We can
turn this off using “quietly” in front of each command that produces output, or simply put it in
front of every command inside the for loop.
Let’s also request 1,000 iterations.
* create a file to hold significant results
clear
set obs 1
gen signif = .
save junk, replace
*
set seed 999
*
* iterate and append results to file
forval i=1/1000 {
quietly clear
quietly set obs 100
quietly gen group = 0 in 1/50
quietly replace group = 1 in 51/100
quietly gen y = invnorm(uniform())*2+4 in 1/50
quietly replace y = invnorm(uniform())*2.5+5 in 51/100
quietly ttest y , by(group)
quietly gen signif = cond(r(p)<0.05,1,0) in 1/1
quietly keep in 1/1
quietly keep signif
quietly append using junk
quietly save junk, replace
}
sum signif
Now, we just get the final result.
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------signif |
1000
.582
.493477
0
1
We see that our power is 58.2%.
Comparing this result to the closed form formula approach,
sampsi 4 5 , sd1(2) sd2(2.5) n1(50) n2(50)
Estimated power:
power =
0.5982
The simulated power (58.2%) is very close to power formula (59.8%).
Chapter 4-1 (revised 23 Jun 2010)
p. 27
Here are the results for some various number of iterations,
Formula
power
59.82%
Simulation
# iterations
power
100
56.00%
1,000
58.20%
10,000
59.52%
Since your SD assumptions are estimates anyway, there is really no need for the extra precision
provided by 10,000 iterations. It is sufficient, and recommended, to just use 1,000 iterations. If
you are simulating a very complicated model and you are in a hurry, it is probably sufficient to
just use 100 iterations, since the convergence is quite good even with that, and your assumptions
will be off anyway.
To discover the sample size for a desired power requires varying the sample size and computing
power until you converge on the desired power. For that, you can use the above Stata code, but
use 50 or 100 iterations until you begin to get close, and then use 1,000 iterations for your final
sample size calculation.
Chapter 4-1 (revised 23 Jun 2010)
p. 28
Power Analysis Using Monte Carlo Simulation (2 x 2 Table Chi-square Test)
For this situation, we are comparing two proportions. Suppose we anticipate the following 2 × 2
table.
Group
A
20 (20%)
80
100
Outcome Yes
No
B
30 (30%)
70
100
Using the formula approach,
sampsi .20 .30 , n1(100) n2(100)
Estimated power:
power =
0.3108
Here is the code for the simulation,
* create a file to hold significant results
clear
set obs 1
gen signif = .
save junk, replace
*
set seed 999
*
* iterate and append results to file
forval i=1/100 {
quietly clear
quietly set obs 200
quietly gen group = 0 in 1/100
quietly replace group = 1 in 101/200
quietly gen y = uniform() in 1/100
quietly replace y = uniform() in 101/200
quietly replace y = cond(y<=0.2,1,0) in 1/100
quietly replace y = cond(y<=0.3,1,0) in 101/200
quietly tab y group , chi2
quietly gen signif = cond(r(p)<0.05,1,0) in 1/1
quietly keep in 1/1
quietly keep signif
quietly append using junk
quietly save junk, replace
}
sum signif
Here are the results for some various number of iterations,
Formula
power
31.08%
Simulation
# iterations
power
100
29.00%
1,000
34.80%
10,000
36.86%
Chapter 4-1 (revised 23 Jun 2010)
p. 29
We notice that the simulation, with the recommended 1,000 iterations, and also with the 10,000
iterations, produces a higher power than the formula approach. The sampsi command for
proportions uses the Yates continuity correction in its calculation. The tab command does not
use the continuity correction. Few researchers use the Yates continuity correction anymore,
because it is known to be conservative (p values are too large)(see box). If the continuity
correction is not going to be applied in the data analysis (in particular, it is not applied in logistic
regression models), then the simulated power is the more correct estimate. In practice, however,
doing the simulation is a lot of work for a small gain, so it is fine to just use the formula
approach, which is the sampsi command.
Yates Continuity Correction Controversy (Agresti, 1990, p.68)
There is a controversy among statisticians on whether or not the Yates continuity correction
should be applied. One camp claims that the continuity correction should always be applied,
because the p value is more accurate and because it is closer to an exact p value (Fisher’s exact p
value). The other camp claims that the continuity correction should not be applied, because it
takes the p value closer to the Fisher’s exact test p value, and it is known that the Fisher’s exact p
value is conservative (does not drop below alpha, 0.05, often enough).
Chapter 4-1 (revised 23 Jun 2010)
p. 30
Power Analysis Using Monte Carlo Simulation (Poisson Regression with Person-Time)
Suppose we want to model the effect of an elementary school flu immunization program. Our
outcome variable is the number of days absent during the winter months, a readily available
outcome measure. In 10% of the schools we will immunize all the children before the start of the
flu season. In 90% of the schools, no school immunization program willl be implemented. Since
children can miss multiple days, we need to model this as a rate for each school.
absentee rate = (total days absence)/(number of students × number of school days)
In n=40 schools, each school will contribute one observation to the sample size. The three
variables are:
Immunization program: 1 = yes, 0 = no
Student days: total number of days that school was in session
Absent: total number of days which students were absent
For sample size determination, we were only able to get two schools.
clear
input school studentdays absent
1 39177 1405
2 41015 1810
end
list
sum
+-------------------------------+
| school
studentdays
absent |
|-------------------------------|
1. |
1
39177
1405 |
2. |
2
41015
1810 |
+-------------------------------+
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------school |
2
1.5
.7071068
1
2
studentdays |
2
40096
1299.662
39177
41015
absent |
2
1607.5
286.3782
1405
1810
We now have the problem of what to assume for a mean and standard deviation for studentdays
and absentees. With only two schools to estimate this from, we better be careful. The estimate
of the means are not as important as the estimates of the standard deviations. Let’s use these
estimates for the mean and increase the standard deviations by 100%.
display 1300*2
display 286*2
2600
572
Chapter 4-1 (revised 23 Jun 2010)
p. 31
In the Monte Carlo simulation of the sample size, we will use
absent
mean = 1608 SD = 572
studentdays mean = 40096 SD = 2600
It is reasonable that studentdays and absent are correlated. Let’s assume they are correlated in
the amount of r = 0.30.
To draw a simulated random sample of two correlated normally distributed variables, we use the
drawnorm command. We have to provide this command with a vector (n × 1 matrix) of means,
a vector of standard deviations, and a correlation matrix (n × n).
In Stata, the computed values are stored in ereturn list, rather than return list. The regression
coefficients and standard errors are stored in a matrix. It is easier to use a shortcut. Stata also
saves the regression coefficients in _b[ ] and _se[ ], where you put the variable name inside the
brackets. The spelling of the variable name is identically the way it appears in the regression
output. The p value is not stored but can be computed by looking up the probability in the
standard normal distribution for the Wald test = _b[ ]/_se[ ].
Here is the approach, using 1 iteration. We will assume an effect size of RR = 0.95, so the
immunization (immun) intervention reduces absenteeism by 5%.
clear
set obs 1
gen signif = .
save tempsignif, replace
set seed 999
forval i=1/1{
clear
set obs 50 // number of schools
matrix m = (1608 , 40096) // means
matrix sd = (572 , 2600 ) // standard deviations
matrix c = (1 , .3 \ .3 , 1) // correlation
drawnorm absent studentdays , n(50) means(m) sds(sd) corr(c)
replace absent=round(absent,1) // convert to integer
replace studentdays=round(studentdays,1)
replace absent=absent*.95 in 1/10
gen immun = 1 in 1/10
replace immun = 0 in 11/50
poisson absent immun , exposure(studentdays) irr
gen signif=(1-(normprob(abs(_b[immun]/_se[immun]))))*2 < 0.05 in 1/1
display (1-(normprob(abs(_b[immun]/_se[immun]))))*2
keep in 1/1
keep signif
append using tempsignif
save tempsignif, replace
}
sum
Chapter 4-1 (revised 23 Jun 2010)
p. 32
Poisson regression
Log likelihood = -5188.1744
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
50
240.84
0.0000
0.0227
-----------------------------------------------------------------------------absent |
IRR
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------immun |
.8692226
.0079689
-15.29
0.000
.8537433
.8849826
studentdays | (exposure)
-----------------------------------------------------------------------------Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------signif |
12
1
0
1
1
It is possible that sometimes the model will crash, so we want to make sure the simulation keeps
running even when that happens, and then just don’t include the crashed models in the sum at the
end. To do this, we put capture in front of the poisson command, which means it captures all
output, error messages in particular. We then check the return code, _rc, to see it is is 0, meaning
no error occurred, and continue for the rest of the iteration only if no error occurred.
Making this change,
clear
set obs 1
gen signif = .
save tempsignif, replace
set seed 999
set seed 999
forval i=1/1{
clear
set obs 50 // number of schools
matrix m = (1608 , 40096) // means
matrix sd = (572 , 2600 ) // standard deviations
matrix c = (1 , .3 \ .3 , 1) // correlation
drawnorm absent studentdays , n(50) means(m) sds(sd) corr(c)
replace absent=round(absent,1) // convert to integer
replace studentdays=round(studentdays,1)
replace absent=absent*.95 in 1/10
gen immun = 1 in 1/10
replace immun = 0 in 11/50
capture poisson absent immun , exposure(studentdays) irr
if _rc==0 {
gen signif=(1-(normprob(abs(_b[immun]/_se[immun]))))*2 < 0.05 in 1/1
display (1-(normprob(abs(_b[immun]/_se[immun]))))*2
keep in 1/1
keep signif
append using tempsignif
save tempsignif, replace
}
}
sum
Chapter 4-1 (revised 23 Jun 2010)
p. 33
Since it takes a long time to simulation a regression model, particularly if it is a complex multilevel model, we can display the iteration number so we can tell how close we are to being
finished. We will need to turn off the scrolling prompt so we don’t have to hit the space bar
every time the screen fills up with the iteration numbers.
Here is the change, along with putting quietly in front of all the commands, and changing it to
run 100 iterations.
clear
set obs 1
gen signif = .
save tempsignif, replace
set seed 999
set more off
forval i=1/100 {
quietly clear
quietly set obs 50 // number of schools
quietly matrix m = (1608 , 40096) // means
quietly matrix sd = (572 , 2600 ) // standard deviations
quietly matrix c = (1 , .3 \ .3 , 1) // correlation
quietly drawnorm absent studentdays , n(50) means(m) sds(sd) corr(c)
quietly replace absent=round(absent,1) // convert to integer
quietly replace studentdays=round(studentdays,1)
quietly replace absent=absent*.95 in 1/10
quietly gen immun = 1 in 1/10
quietly replace immun = 0 in 11/50
quietly capture poisson absent immun , exposure(studentdays) irr
if _rc==0 {
quietly gen signif=(1-(normprob(abs(_b[immun]/_se[immun]))))*2 < 0.05 in 1/1
quietly display (1-(normprob(abs(_b[immun]/_se[immun]))))*2
quietly keep in 1/1
quietly keep signif
quietly append using tempsignif
quietly save tempsignif, replace
}
display "Now on iteration " `i'
}
set more on
sum
Now
Now
Now
…
Now
on iteration 1
on iteration 2
on iteration 3
on iteration 100
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------signif |
90
.9222222
.269322
0
1
We see that the power is 92.2%, based on 90 iterations. That means 10 times it crashed, so the
_rc trick was necessary.
Chapter 4-1 (revised 23 Jun 2010)
p. 34
Power Analysis Using Monte Carlo Simulation (2-way ANOVA, both factors with 2 levels,
neither of which is a repeated measurement)
This section is for a 2 x 2 factorial design,
Factor 2
Low High
Factor 1
Low
High
where a 2-way ANOVA will be fitted with a Factor 1 x Factor 2 interaction term. Neither factor
can be a repeated measurement.
You specify the means, standard deviation, and sample size for each cell of the table, and the
power is returned for each main effect and for the interaction term.
First, cut-and-paste the following code into your Stata do-file, highlight it, and run it to set up the
program.
* power analysis for a 2x2 factorial design ANOVA
*
with factor 1 x factor 2 interaction term
capture program drop poweranova
program define poweranova
* 2 x 2 factorial design
* factor 1 low
mean1low (sd1low)
*
high mean1high (sd1high)
* factor 2 low
mean2low (sd2low)
*
high mean2high (sd2high)
* syntax: poweranova mean1low SD1low N1low
///
*
mean1high SD1high N1high ///
*
mean2low SD2low N2low
///
*
mean2high SD2high N2high
args mean1low sd1low n1low mean1high sd1high n1high ///
mean2low sd2low n2low mean2high sd2high n2high
preserve
clear
quietly set obs 1
quietly gen signif1 = .
quietly gen signif2 = .
quietly gen signif3 = .
quietly save poweranovatemp, replace
set seed 999
local n = `n1low'+`n1high'+`n2low'+`n2high'
local a = `n1low'
local b = `n1low'+1
local c = `n1low'+`n1high'
local d = `n1low'+`n1high'+1
local e = `n1low'+`n1high'+`n2low'
local f = `n1low'+`n1high'+`n2low'+1
* iterate and append results to file
forval i=1/1000 {
quietly clear
quietly set obs `n'
quietly gen factor1 = 0 in 1/`c'
quietly replace factor1 = 1 in `d'/`n'
quietly gen factor2 = 0 in 1/`a'
Chapter 4-1 (revised 23 Jun 2010)
p. 35
quietly replace factor2 = 1 in `b'/`c'
quietly replace factor2 = 0 in `d'/`e'
quietly replace factor2 = 1 in `f'/`n'
quietly gen y = invnorm(uniform())* ///
`sd1low'+`mean1low' in 1/`a'
quietly replace y = invnorm(uniform())* ///
`sd1high'+`mean1high' in `b'/`c'
quietly replace y = invnorm(uniform())* ///
`sd2low'+`mean2low' in `d'/`e'
quietly replace y = invnorm(uniform())* ///
`sd2high'+`mean2high' in `f'/`n'
quietly anova y factor1 factor2 factor1*factor2
* group factor p value
quietly gen signif1 = ///
cond(Ftail(e(df_1),e(df_r),e(F_1))<0.05,1,0) ///
in 1/1 // factor 1 main effect p value
quietly gen signif2 = ///
cond(Ftail(e(df_2),e(df_r),e(F_2))<0.05,1,0) ///
in 1/1 // factor 2 main effect p value
quietly gen signif3 = ///
cond(Ftail(e(df_3),e(df_r),e(F_3))<0.05,1,0) ///
in 1/1 // interaction p value
quietly keep in 1/1
quietly keep signif1 signif2 signif3
quietly append using poweranovatemp
quietly save poweranovatemp , replace
}
display as result "Factor 1 (low): mean = " `mean1low' ///
" , SD = " `sd1low' " , n = " `n1low'
display as result "Factor 1 (high): mean = " `mean1high' ///
" , SD = " `sd1high' " , n = " `n1high'
display as result "Factor 2 (low): mean = " `mean2low' ///
" , SD = " `sd2low' " , n = " `n2low'
display as result "Factor 2 (high): mean = " `mean2high' ///
" , SD = " `sd2high' " , n = " `n2high'
quietly sum signif1
display as result "Power for Factor 1 main effect = " ///
r(mean)*100 "%"
quietly sum signif2
display as result "Power for Factor 2 main effect = " ///
r(mean)*100 "%"
quietly sum signif3
display as result "Power for Factor 1 x Factor 2 interaction = " ///
r(mean)*100 "%"
capture erase poweranovatemp.dta
restore
end
* syntax
*poweranova mean1low SD1low N1low mean1high SD1high N2high ///
*
mean2low SD2low N2low mean2high SD2high N2high
Then, you run the command, poweranova, with 12 parameters, as follows:
Syntax:
poweranova mean1low SD1low N1low mean1high SD1high N2high ///
mean2low SD2low N2low mean2high SD2high N2high
Chapter 4-1 (revised 23 Jun 2010)
p. 36
Example: You are conducting an animal experiment, with a study and a control group. The
animal must be sacrificed to collect the histological measurement, so one set of animals is
followed for 3 months, and a second set of animals if followed for 6 months, for each of the
groups. You estimate the
Factor 1
(group)
Low
(control)
High
(study)
Factor 2 (time)
Low
High
(3 months) (6 months)
Mean: 2.0
Mean: 3.0
SD: 1.0
SD: 1.5
N:
7
N:
7
Mean: 4.0
Mean: 7.0
SD: 2.0
SD: 3.5
N:
7
N:
7
After loading the program into Stata, as described above, you run it using
Syntax:
poweranova mean1low SD1low N1low mean1high SD1high N2high ///
mean2low SD2low N2low mean2high SD2high N2high
poweranova 2.0 1.0 7 3.0 1.5 7 4.0 2.0 7 7.0 3.5 7
The result is,
Factor 1 (low): mean = 2 , SD = 1 , n = 7
Factor 1 (high): mean = 3 , SD = 1.5 , n = 7
Factor 2 (low): mean = 4 , SD = 2 , n = 7
Factor 2 (high): mean = 7 , SD = 3.5 , n = 7
Power for Factor 1 main effect = 93.2%
Power for Factor 2 main effect = 63.1%
Power for Factor 1 x Factor 2 interaction = 24.5%
Chapter 4-1 (revised 23 Jun 2010)
p. 37
Sample Size for Survival Analysis
The Stata command stpower computes the sample size for survival analysis comparing two
survivor functions using the log-rank test, Cox regression, or the exponential parametric survival
test.
The syntax is:
Sample size determination
stpower cox [...] [, ...]
stpower logrank [...] [, ...]
stpower exponential [...] [, ...]
Power determination
stpower cox [...] , n(numlist) [...]
stpower logrank [...], n(numlist) [...]
stpower exponential [...], n(numlist) [...]
Effect-size determination
stpower cox , n(numlist) {power(numlist) | beta(numlist)} [...]
Example
Suppose you plan to do a log-rank test for an animal experiment (rabbits), where you plan to
have a study group (active antimicrobial with a bandage) and control group (just a bandage).
You intend to make an incision to provide a tunnel for infection and then add bacteria to the
wound. You expect all of the control group to have a blood streaminfection, and none of the
study group. You also expect 20% to 50% of the rabbits to drop out of the study before the end
of the four-week follow-up period. The “stpower logrank” command is based on the method of
Freedman (1982). Here is an example power analysis paragraph:
The planned sample size was based on the number of events, allowing for withdrawals,
and the use of the logrank test (Freedman, 1982). It was assumed that the control group
would have 100% infection, or 100% failure probability, and the treatment group would
have 0% infection. Assuming 20% withdrawals, the study had at least 80% power if
n=10 rabbits were studied in each group. Assuming 50% withdrawals, the the study had
at least 80% power if n=10 rabbits were used in the control group and n=20 were used in
the treatment group.
----Reference:
Freedman LS, Tables of the number of patients required in clinical trials using the
logrank test. Statistics in Medicine 1982;1:121-129.
Chapter 4-1 (revised 23 Jun 2010)
p. 38
This comes from:
stpower logrank .99 .01 ,
power(.8) wdprob(.20)
Estimated sample sizes for two-sample comparison of survivor functions
Log-rank test, Freedman method
Ho: S1(t) = S2(t)
Input parameters:
alpha
s1
s2
hratio
power
p1
withdrawal
=
=
=
=
=
=
=
0.0500 (two sided)
0.9900
0.0100
458.2106
0.8000
0.5000
20.00%
Estimated number of events and sample sizes:
E
N
N1
N2
=
=
=
=
8
20
10
10
The .99 is the survival probability for the test group (really 1, but Stata needs something between
0 and 1). The .01 is the survival probability for the control group. The “wprob” is the
withdrawal probability. It is fine to base these three probabilities on simple proportions
anticipated at the end of the follow-up. In the output “p1” is the proportion of the sample size in
the control group.
Here’s the command for n=10 controls and n=20 treatment group. The nratio( ) is the ratio of the
sample sizes, treat:control.
stpower logrank .01 .99 ,
power(.8) wdprob(.50) nratio(2)
Estimated sample sizes for two-sample comparison of survivor functions
Log-rank test, Freedman method
Ho: S1(t) = S2(t)
Input parameters:
alpha
s1
s2
hratio
power
p1
withdrawal
=
=
=
=
=
=
=
0.0500 (two sided)
0.0100
0.9900
0.0022
0.8000
0.3333
50.00%
Estimated number of events and sample sizes:
E
N
N1
N2
=
=
=
=
Chapter 4-1 (revised 23 Jun 2010)
4
24
8
16
p. 39
References
Abramson JH, Gahlinger PM. (2001). Computer Programs for Epidemiologists: PEPI Version
4.0. Salt Lake City, UT, Sagebrush Press.
The PEPI-4.0 software can be downloaded free from the Internet, although the manual
must be purchased.
http://www.sagebrushpress.com/pepibook.html
Agresti A. (1990). Categorical Data Analysis. New York, John Wiley & Sons.
Bonett DG. (2002). Sample size requirements for estimating intraclass correlations with desired
precision. Statistics in Medicine 21:1331-1335.
Bristol DR. (1989). Sample size for constructing confidence intervals and testing hypotheses.
Statist Med 8:803-811.
Campbell M, Grimshaw J, Steen N, et al. (2000). Sample size calculations for cluster randomised
trials. Journal of Health Services Research & Policy 5(1):12-16.
Chow S-C, Shao J, Wang H. (2008). Sample Size Calculations in Clinical Research. 2nd ed. New
York, Chapman & Hall/CRC.
Freedman LS, Tables of the number of patients required in clinical trials using the logrank test.
Statistics in Medicine 1982;1:121-129.
Julious SA, Campbell MJ. (1998). Sample size calculations for paired or matched ordinal
data. Statist Med 17:1635-1642.
Ross S. (1998). A First Course in Probability, 5th ed. Upper Saddle River, NJ.
Whitehead J. (1993). Sample size calculations for ordered categorical data. Statistics in Medicine
12:2257-2271.
Chapter 4-1 (revised 23 Jun 2010)
p. 40
Appendix: Chapter Revision History
16 May 2010 Revision history first tracked.
14 Jun 2010
Added section, “Interrater Reliability (Precision of Confidence Interval
Around Intraclass Correlation Coefficient)”
Chapter 4-1 (revised 23 Jun 2010)
p. 41
Download