Chapter 4-1. Sample Size Determination and Power Analysis for Specific Applications The Basics The basics are covered in the Biostatistics Section, chapter 2-5. This chapter contains specific applications only. Two Independent Groups Comparison of Means (Independent Groups t Test) When a comparison of means from a continuous, or interval scaled, outcome variable is planned for two independent groups, an independent groups t test is appropriate. To compute the sample size, you must provide the two expected means, the difference being the minimally clinically interesting effect or the anticipated effect. You must also provide the two assumed standard deviations and a choice for power (at least 80%). In Stata, the command syntax for equal sample sizes in the two groups is sampsi mean1 mean2 , sd1( ) sd2( ) power( ) By default, a two-sided comparison is used with an alpha = 0.05. _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 4-1 (revised 23 Jun 2010) p. 1 For example, comparing mean±SDs of 4±2 vs 5±2.5, with a desired 80% power, if using the Stata menu would be, Statistics Power and sample size Tests of means and proportions Main tab: Input: Two-sample comparison of means Main tab: Mean one: 4 Main tab: Std. deviation one: 2 Main tab: Mean two: 5 Main tab: Std. deviation two: 2.5 Options tab: Output: Compute sample size Options tab: Sample-based calculations: Ratio of sample sizes: 1 Options tab: Power of the test: .8 Options tab: Sides: Two-sided test Options tab: Alpha (using 0.05): .05 or leave blank OK sampsi 4 5, sd1(2) sd2(2.5) power(.80) Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha power m1 m2 sd1 sd2 n2/n1 = = = = = = = 0.0500 0.8000 4 5 2 2.5 1.00 (two-sided) Estimated required sample sizes: n1 = n2 = Chapter 4-1 (revised 23 Jun 2010) 81 81 p. 2 To compute the power for a given sample size, you leave off the power( ) and replace it with n1( ) and n2( ). sampsi 4 5, sd1(2) sd2(2.5) n1(90) n2(90) Estimated power for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha m1 m2 sd1 sd2 sample size n1 n2 n2/n1 = = = = = = = = 0.0500 4 5 2 2.5 90 90 1.00 (two-sided) Estimated power: power = 0.8421 Chapter 4-1 (revised 23 Jun 2010) p. 3 Linear Regression: Comparing two groups adjusted for covariates Usually when linear regression is applied in research, it involves the comparison of two groups after adjusting for potential confounders. That is, it is an adjusted means comparison problem. Sample size calculation, then, is simply one of how big a sample is required to compare the difference between two means. You can use the same sample size calculation formula that you would use to compare two means with a t test. It is too difficult to know how much the means will change when covariates are added, so you just don’t bother attempting that much precision in your sample size determination. However, if you have preliminary data on adjusted means, and their standard deviations, you could use those in your calculation. Usually, you are not going to know what these will be after you adjust for all of your covariates, so it is a generally accepted pratice to use the unadjusted means and standard deviations. << Completely revise this section – excellent discussion of this topic in Steyerberg. >> Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, Springer. pp.25-27. Chapter 4-1 (revised 23 Jun 2010) p. 4 Two Independent Groups Comparison of Dichotomous Outcome Variable (chi-square test, Fisher’s exact test) For a given test statistic, sample size is determined by the following five things: 1) the effect size in the population 2) the standard deviation in the population 3) our choice of alpha 4) whether we will use a one-sided or two-sided comparison (one-tailed or two-tailed test) 5) desired power It appears, then, that we need to specify the standard deviation to compute power for a test statistic that compares two proportions. We don’t. The formula uses the standard deviations of the proportions (specifically, Bernoulli variables), but it computes them from the proportions, basically as std.dev p(1 p) This is computed internally by any sample size software once the proportion is specified. The power analysis for comparing two proportions in Stata is basically for the chi-square test with continuity correction (precisely, the formula is the normal approximation with a continuity correction). Thus, the result is closer to a Fisher’s exact test than a chi-square test, since the continuity correction moves the p value in the direction of the Fisher’s exact test p value. That is a good thing, however, since you do not know in advance if you’ll meet the expected frequency rule for the chi-square test, and thus have to use Fisher’s exact test anyway. Suppose we want to conduct a study to detect a difference of 1.5% in preterm births between males and females, based on published preterm births incidence proportions of 0.301 for males and 0.316 for females. We would use: Statistics Summaries, tables & tests Classical tests of hypotheses Sample size & power determination Main tab: Two-sample comparison of proportions (values in [0,1]): Proportion one: 0.301 Proportion two: 0.316 Options tab: Output: Compute sample size Power of the test: 0.90 OK sampsi .301 .316 , alpha(.05) power(.90) Chapter 4-1 (revised 23 Jun 2010) p. 5 Estimated sample size for two-sample comparison of proportions Test Ho: p1 = p2, where p1 is the proportion in population 1 and p2 is the proportion in population 2 Assumptions: alpha = 0.0500 (two-sided) power = 0.9000 p1 = 0.3010 p2 = 0.3160 n2/n1 = 1.00 Estimated required sample sizes: n1 = n2 = 20056 20056 We see that such a study is only practical if we have a large database already available. For sample sizes this large, the p values for the three statistical approaches (chi-square, corrected chi-square, and Fisher’s exact test) will be equivalent, and so will be the required sample sizes. Let’s try a larger effect size. sampsi .300 .400 , alpha(.05) power(.90) Estimated sample size for two-sample comparison of proportions Test Ho: p1 = p2, where p1 is the proportion in population 1 and p2 is the proportion in population 2 Assumptions: alpha = power = p1 = p2 = n2/n1 = 0.0500 0.9000 0.3000 0.4000 1.00 (two-sided) Estimated required sample sizes: n1 = n2 = 496 496 In SamplePower 2.0 we can compute the required sample specifically for each of the three statistical tests, getting uncorrected chi-square test: n1 = n2 = 479 corrected chi-square test: n1 = n2 = 498 which is very close to Stata’s n = 496. Fisher’s exact test: n1 = n2 = 496 We see that Stata’s sample size calculation is conservative, but provides an adequate sample size for any of the three approaches. It works for both the uncorrected chi-square test, as well as the Fisher’s exact test—since you cannot be sure in advance which test you will require, you might as well go with the larger sample size. Chapter 4-1 (revised 23 Jun 2010) p. 6 Two Indenpent Groups Comparison of a Nominal Outcome Variable (chi-square test and Fisher-Freeman-Halton test) The r × c table case for power or sample size calculation is not available in Stata or PEPI 4.0. At first, you might think it makes sense to try collapsing the rows or columns to make it a 2 × 2 table, and then using the approach we took for the dichotomous case. For the 3 × 2 case, collapsing would be the same as computing power for any row of the table. | col row | 1 2 | Total -----------+----------------------+---------1 | 14 8 | 22 | 14.89 8.60 | 11.76 -----------+----------------------+---------2 | 76 82 | 158 | 80.85 88.17 | 84.49 -----------+----------------------+---------3 | 4 3 | 7 | 4.26 3.23 | 3.74 -----------+----------------------+---------Total | 94 93 | 187 | 100.00 100.00 | 100.00 sampsi .1489 .0860 , alpha(.05) power(.90) sampsi .8085 .8817 , alpha(.05) power(.90) sampsi .0426 .0323 , alpha(.05) power(.90) // gives n1=n2=580 // gives n1=n2=539 // gives n1=n2=7332 We see that it is not clear what the correct sample size should be. Fortunately, the correct sample size calculation, using the entire table simultaneously, can be done using SamplePower 2.0. For power of 0.90, we need n1=n2=997. Chapter 4-1 (revised 23 Jun 2010) p. 7 Two Independent Groups Comparison of Ordinal Outcome Variable (Wilcoxon-MannWhitney test) The power for the Wilcoxon-Mann-Whitney test can be computed with StatXact-5, but not with Stata-8 or SamplePower 2.0. It can also be computed with the PEPI-4.0 (Abramson and Gahlinger, 2001) SAMPLES.EXE program. PEPI computes the required sample size for a Wilcoxon-Mann-Whitney test (adjusted for ties, which is what you would normally always use). The computation is only accurate if it generates a moderate to large sample size (is only asymptotically correct). It calculates the sample size using the procedure described by Whitehead (1993). The Whitehead procedure is based on a proportional odds model, which assumes that when the 2 k table displaying the data is converted to a 2 2 table by combining adjacent categories, the odds ratio is the same whatever cutting-point is used. For a 2 2 table, with cell counts a, b, c, and d, the odds ratio is defined as: a c b d odds ratio = ab/bc Example (Whitehead, 1993): Suppose that the follow values are considered appropriate: Highest Category Lowest Category Very good Good Moderate Poor Control Group: 20% 50% 20% 10% It is intuitive to consider the cutpoint as (Very good or Good) vs (Moderate or Poor). Thus 70% will be in this upper category in the control group. We choose 85% in this upper category as the minimal clinically relevant effect, which we feel is obtainable (likely true in the population). Thus we have the following: Success Failure Experimental Control 85% 70% 15% 30% OR = (8530)/(7015)=2.43 We assume that for any other cut-point for combining adjacent categories, the odds ratio will still be 2.43 (proportional odds assumption). Chapter 4-1 (revised 23 Jun 2010) p. 8 Running PEPI SAMPLES (available in PEPI subdirectory). type 4 for Wilcoxon-Mann-Whitney test level of significance: .05 power (%): 90 ratio of sample sizes: 1 how many categories: 4 (controls) category B % 1 20 2 50 3 20 4 10 odds ratio: 2.43 So that you can check that the proportional odds assumption is close to what you expect your cell percents to be in the experimental group, PEPI reports: An odds ratio of 2.43 expressed the following findings: (experimental) (control) Group A Group B Category % % 1 37.8 20 2 47.2 50 3 10.6 20 4 4.4 10 If we are happy with these percents for our experimental group (they look like what we expect to observe) then the sample size is appropriate. If it is way off, then you cannot use PEPI to compute your sample size, because it must assume proportional odds. On the next screen, PEPI reports n=94 subjects are required in each group. This matches the example in the Whitehead (1993) article (bottom of page 2261), so we can feel confident that PEPI calculates the sample size for a Wilcoxon-Mann-Whitney test correctly. Since we are using 4 categories with our ordinal scale, rather than only 2 categories using a 2 2 Fisher’s exact test, we should have computed a smaller required sample size. This is the case. PEPI SAMPLES for comparing the two independent proportions 0.70 to 0.85 reports that we would need n=161 in each group to detect the same effect using that statistical approach. Chapter 4-1 (revised 23 Jun 2010) p. 9 Protocol You could state, The sample size for testing the Quality of Life outcome (of whatever the variable is) using a Wilcoxon-Mann-Whitney test was computed using the procedure reported by Whitehead (1993). We assumed the Quality of Life ordered category percents for the control group will be 20% (very good), 50%, 20%, and 10% (poor), with a proportional odds of a higher quality of life for the experimental group of 2.43. This odds ratio corresponds to a response of 70% “good” or “very good” in the control group and 85% “good” or “very good” in the experimental group, which we selected as our minimal clinically relevant effect to be able to detect. For this proportional odds, the expected quality of life category percents for the experimental group are 37.8%, 47.2%, 10.6%, and 4.4%, which are consistent with what we expect to observe. This sample size provides 90% power to detect this effect with an alpha of 0.05 using a two-sided comparison. Chapter 4-1 (revised 23 Jun 2010) p. 10 Paired Ordinal Outcome Variable (Wilcoxon signed ranks test) The required sample size for the Wilcoxon signed ranks test cannot be computed with StatXact6, Stata-8, or SamplePower 2.0. It can, however, be computed with the PEPI-4.0 (Abramson and Gahlinger, 2001) SAMPLES.EXE program. PEPI computes the required sample size for the comparison of proportions in ordered categories of matched samples (1 case to 1 control, or pre and post measures on the same individuals) under the assumption of proportional odds. This approach, although not specifically tailored to the Wilcoxon signed ranks test, provides a reasonable approximation for the required sample size of the Wilcoxon signed ranks test. I am not aware of any other approach, except to derive the sample size by simulation. PEPI calculates the sample size using the procedure described by Julious and Campbell (1998). The procedure only considers the disconcordant pairs (difference score on pre and post test), throwing away the concordant pairs (same score on pre and post test), consistent with the Wilcoxon signed ranks test. Consider, for example, a variable the is a three-point scale. Calculating the changes scores (post test minus pretest), the possible values are -2, -1, 0, 1, and 2. The 0s are ignored, leaving -2 and -1 as the negative discordant categories, and 1 and 2 as the positive discordant categories. If no pilot data are available, we next assume a value for the odds ratio OR = odds of a pair being positive = ratio of positive changes to negative changes. Using the example in Julious and Campbell (1998), we wish to conduct a study, such as a matched case-control or perhaps a cross-over trial, where the outcome is the Hospital Anxiety and Depression Scale, which has three categories. We might estimate the OR by expecting that we will observe 5 positive changes for each negative change, OR=5/1=5. We then must estimate the distribution of positives, which we might guess will be 0.8 for +1 and 0.2 for +2, which correctly sums to 1. Under the proportional odds assumption, OR C pi (1 Cni ) Cni (1 C pi ) , which is fixed for all i < k, where k=# of categories In this equation, C pi is the cumulative proportion of positive pairs in category i, and Cni is the cumulative proportion of negative pairs. Chapter 4-1 (revised 23 Jun 2010) p. 11 Plugging in OR=5 and C p 2 = 0.2, we get 0.2(1 Cn 2 ) 0.2(1 Cn 2 ) 1 1 Cn 2 Cn 2 (1 0.2) 0.8Cn 2 4 Cn 2 1 Cn 2 1 20 1 Cn 2 C n 2 C n 2 5 21 1 Cn 2 Cn 2 1 0.048 21 Since the cumulative proportions must sum to 1, by subtraction we get Cn1 = 1 – 0.048 = 0.952. We now have the proportions that are conditional upon a change being positive or negative, where the negatives sum to 1 and the positives sum to 1: difference -2 pn 2 = 0.048 -1 pn1 = 0.952 +1 p p1 = 0.8 +2 p p 2 = 0.2 These can be converted to the unconditional expected proportions by multiplying the negatives by 1/(OR+1) and the positives by OR/(OR+1): difference -2 pn 2 = 0.048 -1 pn1 = 0.952 +1 p p1 = 0.8 +2 p p 2 = 0.2 pi : 0.048(1/6)= 0.952(1/6)= 0.8(5/6)= 0.2(5/6)= 0.008 0.159 0.667 0.167 where the pi sum to 1. Chapter 4-1 (revised 23 Jun 2010) p. 12 Running PEPI SAMPLES type 5 for comparison of ordered categories, matched pairs level of significance: .05 power (%): 80 how many categories: 3 odds ratio: 5 probability size of discrepancy (value of change score) 1: probability size of discrepancy (value of change score) 2: .8 .2 which produces the result that you need 12 discordant pairs to achieve this power. If you assume that 1/3 of your sample will be discordant, than you need a total of 36 pairs as your actual sample size. If pilot data are available, we can use these data to estimate the odds ratio for the proportional odds assumption, as OR = (proportion positive) / (proportion negative) We would get these proportions by first computing the change scores and then generating a frequency table of the change scores. In the above example, using Stata, this would look like: gen diff = posthads – prehads tab diff if diff ~= 0 <- compute change scores <- frequency table ignoring “no change” values which will produce: change score -2 -1 +1 +2 total frequency 2 32 133 33 200 percent 0.8 15.9 66.7 16.7 100.0 Computing the proportional odds: OR = (0.667+0.167)/(0.008+0.159) = 5 PEPI will require the proportion of positives in the +1 category, which is 133/(133+33)=0.8 and the proportion of positives in the +2 category, which is 33/(133+33)=0.2 Chapter 4-1 (revised 23 Jun 2010) p. 13 Running PEPI SAMPLES, we input the same values as before: type 5 for comparison of ordered categories, matched pairs level of significance: .05 power (%): 80 how many categories: 3 odds ratio: 5 probability size of discrepancy (value of change score) 1: probability size of discrepancy (value of change score) 2: .8 .2 Protocol Suggestion You could state, The required sample size for testing the Hospital Anxiety and Depression Scale (or whatever the variable is) using a Wilcoxon signed ranks test was computed using the procedure reported by Julious and Campbell (1998). Based on pilot data, we assumed a proportional odds of positive discordant pairs to negative discordant pairs of 5.0, with proportions of discordant pairs of 0.8, 15.9, 66.7, and 0.167 for changes of -2, -1, +1, and +2, respectively. For 80% power with a two-sided 0.05 level test, this requires 12 discordant pairs. Assuming that only 33% of the pairs will be discordant, consistent with the pilot data, we require a total of N=36 pairs, or subjects. This approach is consistent with the Wilcoxon signed ranks test, which only uses and bases its sample size on the number of discordant pairs. Chapter 4-1 (revised 23 Jun 2010) p. 14 Interrater Reliability (Precision of Confidence Interval Around Intraclass Correlation Coefficient) Interrater reliability is how close two or more raters agree on the value they assign to a measurement for the same subject. Intrarater reliability is how close the ratings are for the same subjects assigned by the same rater on two or more occasions (such as test/re-test reliability). For both, the reliability coefficient is the intraclass correlation coefficient (ICC). It was stated in Chapter 2-5, page 32, that the sample size for a interrater reliability is based on the desired width of the confidence interval around the ICC statistic, rather than based on a hypothesis test that the ICC is different from zero (Bristol,1989; Chow et al, 2008) Another decision that has to be made when designing the reliability study is how many raters to use, which also affects the sample size calculation. A formula for estimating the required sample size is provided by Bonett (2002). Step 1) Copy the following into the Stata do-file editor, highlight it, and hit the run key (rightmost menu button). This will load the program, or sampicc command, into your current session of Stata. Once loaded, it will execute as any other Stata command, for your current session of Stata only. Chapter 4-1 (revised 23 Jun 2010) p. 15 * syntax: sampicc , icc(0.7) raters(5) width(0.2) level(.95) * where icc = assumed ICC * raters = number of raters * width = desired precision (upper minus lower limits) * level = confidence level of CI, e.g., 95% capture program drop sampicc program define sampicc , rclass version 10 syntax [,icc(real 0.7) raters(real 2) width(real 0.2) /// level(real 0.95)] local rho = `icc' local k = `raters' local w = `width' local level = `level' local alpha = 1 - `level' local n = 8*(invnorm(1-`alpha'/2))^2*((1-`rho')^2 /// *(1+(`k'-1)*`rho')^2)/(`k'*(`k'-1)*`w'^2)+1 if (`k'==2 & `rho'>=0.7) { local n = `n'+5*`rho' // improved estimate for this special case } local n = round(`n'+0.5) // round up to nearest integer display _newline display as text /// "Required N for desired precision of exact CI around ICC" display as text /// "-------------------------------------------------------" display as result "Assumed ICC: " %4.3f "`rho'" display as result %2.0f `level'*100 "% CI width (upper minus" , _c display as result "lower limits): " %3.0f "`w'" display as result "Number of raters: " %3.0f "`k'" display as result "Required n: " %8.0f "`n'" return scalar num_subjects = `n' // required N return scalar level = `level' return scalar width = `w' return scalar num_raters = `k' return scalar icc = `rho' end The command “sampicc” has the following syntax: sampicc , icc(#) raters(#) width(#) level(#) where icc = assumed ICC (expressed as proportion between 0 and 1) raters = number of raters width = desired precision (upper minus lower limits) level = confidence level of CI, e.g., 95% (expressed as proportion between 0 and 1) Example: For an assumed ICC=0.70, using 4 raters, a desired 95% CI width of 0.2 (upper bound minus lower bound), you woule use: sampicc , icc(0.7) raters(4) width(0.2) level(.95) Chapter 4-1 (revised 23 Jun 2010) p. 16 Step 2) Execute the command in do-file editor or command window. sampicc , icc(0.85) raters(4) width(0.2) level(.95) which outputs: Required N for desired precision of exact CI around ICC ------------------------------------------------------Assumed ICC: .85 95% CI width (upper minus lower limits): .2 Number of raters: 4 Required n: 20 If the sample turns out to give an ICC of 0.85, using 4 raters, the width of the 95% CI will be close to 0.20. The result agrees with the example given in the article the formula was taken from (Bonett, 2002), so you can be confident it was programmed correctly. To see what the sampicc command returns for use in programming, use return list scalars: r(icc) r(num_raters) r(width) r(level) r(num_subjects) = = = = = .85 4 .2 .95 20 These returned values allow us to use the sampicc program, or command, in a loop to look at various combinations of CI width and number of raters. Copying the following into the Stata dofile and executing it, will provide the required sample sizes for the selected combinations. * Vary the number of raters from 2 to 10 * Vary 95% CI width from .1 to .4 in increments of 0.05 * Fix ICC at 0.7 and confidence level (95% CI) to 95% preserve // hold copy of original dataset clear quietly set obs 100 quietly gen _width=. forval r=2/6 { quietly gen _raters`r'=. local row=0 forval w=0.1(.05)0.4 { local row=`row'+1 quietly sampicc , icc(.7) raters(`r') width(`w') level(.95) quietly replace _width = r(width) in `row' quietly replace _raters`r' = r(num_subjects) in `row' } } list _width _raters* if _raters2~=. , noobs sep(0) clean restore // return original dataset into memory Chapter 4-1 (revised 23 Jun 2010) p. 17 _width .1 .15 .2 .25 .3 .35 .4 _raters2 405 183 105 69 49 38 30 _raters3 267 120 68 44 31 23 18 _raters4 223 100 57 37 26 20 15 _raters5 201 90 51 33 24 18 14 _raters6 188 84 48 31 22 17 13 The Stata created variable _width holds the requested widths, _raters2 holds the required samples size when two raters are used, _rater3 for three raters, and so on. From this, if can afford to assess interrater reliability (ICC) on n=31 subjects, we would probably decide to use 3 raters and accept a width of 0.3 for our 95% CI. In Bonett (2002) article Table 1, the width was set to 0.2, the ICC was varied from 0.1, 0.2,…, 0.9, and the number of raters was selected as 2, 3, 5, and 10. To duplicate Bonett’s Table 1, for the purpose of illustrating how to modify the looping structure of this Stata code, we would use: * Vary the number of raters as 2, 3, 5, and 10 * Vary ICC from .1 to .9 in increments of 0.1 * Fix ICC at 0.7 and confidence level (95% CI) to 95% preserve // hold copy of original dataset clear quietly set obs 100 quietly gen _ICC=. foreach r of numlist 2 3 5 10 { quietly gen _raters`r'=. local row=0 forval i=.1(.1).9 { local row=`row'+1 quietly sampicc , icc(`i') raters(`r') width(.2) level(.95) quietly replace _ICC = r(icc) in `row' quietly replace _raters`r' = r(num_subjects) in `row' } } list _ICC _raters* if _raters2~=. , noobs sep(0) abbrev(9) clean restore // return original dataset into memory _ICC .1 .2 .3 .4 .5 .6 .7 .8 .9 _raters2 378 356 320 273 218 159 105 55 20 _raters3 151 162 162 151 130 101 68 36 12 _raters5 62 81 93 95 88 73 51 29 10 _raters10 26 44 59 67 66 57 42 24 9 These estimates agree with Bonett’s Table 1. For the case of two raters with ICC≥0.70, however, the adjustment describe in the article paragraph following Bonett’s Table 1 has been applied to provide better estimates, so those three sample sizes intentionally differ from Bonett’s Table 1. Chapter 4-1 (revised 23 Jun 2010) p. 18 Protocol Suggestion Using the example above, where it was decided that it was feasible to use n=31 subjects and k=3 raters, with an anticipated ICC of 0.70, you could state something like the following in your protocol: Interrater reliability will be assessed with the intraclass correlation coefficient (ICC). To assess interrater reliability of tumor size measurements from chest X-ray radiographs, k=3 radiologists will be used, each providing measurements from the same radiograph on the same n=31 lung cancer patients. Interrater reliability will be assessed separately for the posterior-anterior view and the lateral view. The precision approach is used for sample size determination ( Bristol, 1989; Chow et al, 2008). Using the sample size determination approach described by Bonett (2002), and assuming ICC=0.70, this sample size and number of raters provides a 95% confidence interval around ICC of width 0.3 (ICC ±.15). This seems to be acceptable precision for our purposes, so the sample size and number of raters are adequate. -------------------Bristol DR. Sample size for constructing confidence intervals and testing hypotheses. Statist Med 1989;8:803-811. Chow S-C, Shao J, Wang H. Sample Size Calculations in Clinical Research. 2nd ed. New York, Chapman & Hall/CRC, 2008. Bonett DG. Sample size requirements for estimating intraclass correlations with desired precision. Statistics in Medicine 2002;21:1331-1335. Chapter 4-1 (revised 23 Jun 2010) p. 19 Repeated Measures or Clustered Studies (GEE, mixed, multilevel, hierachical models) Studies that use multilevel models require a larger sample size than when ordinary linear regression is used. Ordinary regression models assume that all observations are independent. When we have repeated measurements on the same person, the repeated measurements are correlated. Suppose you have a study with two office visits for each patient, with one to three patients for each physician. It might look something like this: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. +----------------------------------------------+ | patient_id physician_id visit2 y x | |----------------------------------------------| | 1 1 0 12 1 | | 1 1 1 10 1 | | 2 1 0 13 2 | | 2 1 1 11 3 | | 3 1 0 14 2 | | 3 1 1 9 5 | |----------------------------------------------| | 4 2 0 20 2 | | 4 2 1 18 3 | | 5 2 0 22 4 | | 5 2 1 17 5 | |----------------------------------------------| | 6 3 0 25 4 | | 6 3 1 22 7 | | 7 3 0 23 7 | | 7 3 1 21 10 | |----------------------------------------------| | 8 4 0 30 8 | | 8 4 1 27 9 | |----------------------------------------------| | 9 5 0 30 1 | | 9 5 1 27 2 | | 10 5 0 32 11 | | 10 5 1 29 15 | +----------------------------------------------+ In this example, ordinary linear regression assumes that there are N=20 independent pieces of information. With two observations taken for each patient, only N=10 patient clusters contributed information. To make matters worse, the patients within phyisicians were highly alike, so there are really only N=5 physician clusters contributing information. Is the sample size 20, 10, or 5, then? If the correlation among the observations is 0, the sample size is 20. If the correlation is 1, the sample size is 5, which is the number of physician clusters. Since the correlation is not going to be 0 or 1, the sample size is somewhere in between. The amount of correlation in the data is measured with an intraclass correlation coefficient (ICC). I am not aware of any articles that discuss how to compute the sample size for more than one level of clusters. Usually, statisticians simply choose one level for sample size determination. Chapter 4-1 (revised 23 Jun 2010) p. 20 First, the sample size is calculated for the naïve model, which is the ordinary regression model that assumes independent observations. Then, this is adjusted for clustering by multiplying this sample size by the design effect, which is (Campbell et al, 2000) 1 + [( average cluster size – 1)×ICC] If you don’t what the ICC is when designing the study, Campbell (2000) provides some suggestions and is a good citation. Consistent with our example, let’s suppose we have N=5 physician clusters. Within each physician, we expect to have an average of 4 patients, each with two repeated measurements. We assumed the ICC was 0.20. The design effect is then, 1 + [( 4 – 1)×0.20] = 1+3×0.20 = 1.6 We would have to use a sample size of patient observations that is 1.6 times the sample size required if all patient observations where independent. Applying the design effect after calculating a sample size using sampsi can be done in Stata using sampclus. You will have to add it to Stata. After beginning connected to the Internet, run the command findit sampclus and then follow the instructions to install it. Example Suppose we want to compare two groups of patients, where we intend to collect 10 patients from each provider. We assume group means±SDs of 4±2 vs 5±2.5. We assume an ICC=0.05, which is nonzero due to some patients seeing the same provider. We desire a power of 80%. To compute the sample size, we first use sampsi 4 5 , sd1(2) sd2(2.5) power(.80) Chapter 4-1 (revised 23 Jun 2010) p. 21 Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha power m1 m2 sd1 sd2 n2/n1 = = = = = = = 0.0500 0.8000 4 5 2 2.5 1.00 (two-sided) Estimated required sample sizes: n1 = n2 = 81 81 If the ICC for patients nested within provider was 0, we could stop here. However, due to the nonzero correlation introduce by provider clusters, there are not n=81+81=162 independent pieces of information. There is something between that and the n=162/10=16.2 providers. We now multiple each group’s sample size by the design effect 1 + [( average cluster size – 1)×ICC] display 81*(1+((10-1)*0.05)) 117.45 Thus we must collect n=118 observations (not subjects) in each study group to have 80% power. We get the same result using sampclus after samsi, as follows, sampsi 4 5 , sd1(2) sd2(2.5) power(.80) sampclus , obsclus(10) rho(.05) sampclus , obsclus(10) rho(.05) Sample Size Adjusted for Cluster Design n1 (uncorrected) = 81 n2 (uncorrected) = 81 Intraclass correlation = .05 Average obs. per cluster = 10 Minimum number of clusters = 24 Estimated sample size per group: n1 (corrected) = 118 n2 (corrected) = 118 Chapter 4-1 (revised 23 Jun 2010) p. 22 Example Suppose we want to compare two groups of patients, where we intend to collect 10 repeated measurements per patient. In this situation, patient is now the cluster. In Chapter 23, where we modeled forearm blood flow, the ICC was 0.53. You can expect very high ICCs for repeated measures data. Assuming the same effect size with this higher ICC, sampsi 4 5 , sd1(2) sd2(2.5) power(.80) sampclus , obsclus(10) rho(.53) Sample Size Adjusted for Cluster Design n1 (uncorrected) = 81 n2 (uncorrected) = 81 Intraclass correlation = .53 Average obs. per cluster = 10 Minimum number of clusters = 94 Estimated sample size per group: n1 (corrected) = 468 n2 (corrected) = 468 Thus, we need 468 observations divided by the number of observations per patient, 468/10=46.8, or 47 patients per group. We see that if we have a good estimate of the ICC, we can reduce our sample size. Frequently, investigators are not willing to guess what the ICC will be, so they would just go with the n=81 patients per group to be on the safe side (safe from making a false estimate of the ICC). Protocol Suggestion In your protocol, you need to convince the reviewer that you understand an adjustment will be necessary to account for the correlation structure in the data (the ICC) and that you have a reasonable estimate of the ICC. The Campbell (2000) paper is good to cite, both as a reference for the design effect and for assumption the ICC will not exceed 0.05 for patient outcomes or 0.15 for provider processes. You might use something like this: Ordinary sample size calculation assumes that all data points are independent. With a multi-level structure, the ordinary sample size estimates needs to be inflated by the design effect, 1 ( n 1) , where n is the average cluster size, each prescriber representing a cluster, and ρ is the estimated intra-cluster correlation coefficient (ICC). Sample size calculation proceeds by calculating the sample size for a naïve model, an ordinary model that assumes all observations are independent, and then inflating that sample size by multiplying it by the design effect, so that the sample size calculation applies to the multilevel model (Campbell, 2000). The ICCs related to the outcomes of this study are unknown. However, in a study exploring several datasets for a wide range of practictioner-related, or process, variables in primary care settings, the ICCs were in the range of 0.05 to 0.15, whereas patient-related outcome variables were generally less than 0.05 (Campbell, 2000). Conservatively, for prescriber outcomes, an ICC of 0.15 will be assumed, and for patient outcomes, an ICC of 0.05 will be assumed. The required sample sizes for the various outcomes are shown in Table 3. Chapter 4-1 (revised 23 Jun 2010) p. 23 Power Analysis Using Monte Carlo Simulation (Independent Samples t Test) By definition, power is the probability that you will get a significant result in your sample data, for a given effect size in the sampled population. Formula for computing power base is based on the long-run average. That is just was Monte Carlo simulation is, a long-run average. You can get the same answer as the formula, then, using Monte Carlo simulation. Suppose we wanted to compute the power for an independent samples t test, with the following assumed parameters: Group A B mean 4 5 SD 2 2.5 N 50 50 For you Mstat students, we will be applying the inverse transformation method (see box). Inverse Transformation Method (Ross, 1998, p. 455) “Let U be a uniform(0,1) random varaible. For any continuous distribution function F, if we define the random variable Y by Y F 1 (U ) then the random variable Y has distribution function F. [ F 1 ( x) is defined to equal that value y for which F(y) = x.] In Stata, the pseudo random number generator for a standard normal variable is invnorm(uniform()). which comes from applying the inverse transformation, where F is the normal distribution, F 1 is the inverse normal distribution [invnorm( ) function in Stata ] U is the uniform(0,1) distribution [ uniform( ) function in Stata ] This produces a standard normal distribution, with mean = 0 and SD = 1. To convert this to a normal distribution for a desired mean and SD, we use the following: X Mean = z , which is Normal with mean=0 and SD=1 (Standard Normal) SD X Mean SDz X Mean SDz , which is Normal with desired mean and SD Chapter 4-1 (revised 23 Jun 2010) p. 24 For one iteration, just to see what is happening, we use the following code to create the variables and compute an independent samples t test, set seed 999 // use if want to be able to exactly reproduce the result clear set obs 100 gen group = 0 in 1/50 replace group = 1 in 51/100 gen y = invnorm(uniform())*2+4 in 1/50 // mean = 4 , SD = 2 replace y = invnorm(uniform())*2.5+5 in 51/100 // mean = 5 , SD = 2.5 ttest y , by(group) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 50 4.173172 .2956474 2.090543 3.579046 4.767298 1 | 50 5.229902 .2909314 2.057196 4.645254 5.814551 ---------+-------------------------------------------------------------------combined | 100 4.701537 .213067 2.13067 4.278766 5.124308 ---------+-------------------------------------------------------------------diff | -1.05673 .4147873 -1.879862 -.2335987 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = -2.5476 Ho: diff = 0 degrees of freedom = 98 Ha: diff < 0 Pr(T < t) = 0.0062 Ha: diff != 0 Pr(|T| > |t|) = 0.0124 Ha: diff > 0 Pr(T > t) = 0.9938 We see that we generated a sample with means and SDs, as expected, except for sampling variation. We need a way to save the p value (p = 0.0124). For Stata commands that are not regression models, the results are saved in return list. To see where the p value is being saved, we use, return list scalars: r(sd) r(sd_2) r(sd_1) r(se) r(p_u) r(p_l) r(p) r(t) r(df_t) r(mu_2) r(N_2) r(mu_1) r(N_1) = = = = = = = = = = = = = 2.130670072012433 2.057195742247358 2.09054282614709 .4147872618554172 .9938001171468509 .0061998828531491 .0123997657062982 -2.547644488066585 98 5.229902350902558 50 4.173171869516373 50 Matching this with the t test output, we discover the two-tailed p value is saved in r(p). Chapter 4-1 (revised 23 Jun 2010) p. 25 Here is the complete code for two iterations, * create a file to hold significant results clear set obs 1 gen signif = . save junk, replace * set seed 999 // if want to get same result when repeating this * * iterate and append results to file forval i=1/2 { clear set obs 100 gen group = 0 in 1/50 replace group = 1 in 51/100 gen y = invnorm(uniform())*2+4 in 1/50 // mean = 4 , SD = 2 replace y = invnorm(uniform())*2.5+5 in 51/100 // mean = 5 , SD = 2.5 ttest y , by(group) gen signif = cond(r(p)<0.05,1,0) in 1/1 keep in 1/1 keep signif append using junk save junk, replace } sum signif Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 50 4.173172 .2956474 2.090543 3.579046 4.767298 1 | 50 5.229902 .2909314 2.057196 4.645254 5.814551 ---------+-------------------------------------------------------------------combined | 100 4.701537 .213067 2.13067 4.278766 5.124308 ---------+-------------------------------------------------------------------diff | -1.05673 .4147873 -1.879862 -.2335987 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = -2.5476 Ho: diff = 0 degrees of freedom = 98 Ha: diff < 0 Pr(T < t) = 0.0062 Ha: diff != 0 Pr(|T| > |t|) = 0.0124 Ha: diff > 0 Pr(T > t) = 0.9938 Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 50 3.937702 .2594191 1.83437 3.41638 4.459024 1 | 50 4.860296 .3174238 2.244525 4.222409 5.498183 ---------+-------------------------------------------------------------------combined | 100 4.398999 .209139 2.09139 3.984022 4.813976 ---------+-------------------------------------------------------------------diff | -.9225939 .4099466 -1.73612 -.1090683 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = -2.2505 Ho: diff = 0 degrees of freedom = 98 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------signif | 2 1 0 1 1 Chapter 4-1 (revised 23 Jun 2010) p. 26 For two iterations, the power if 100%, which is the mean of the 0-1 variable, signif, where 1 denotes p < 0.05 for a given iteration. We don’t really want the results to display on our computer screen for each iteration. We can turn this off using “quietly” in front of each command that produces output, or simply put it in front of every command inside the for loop. Let’s also request 1,000 iterations. * create a file to hold significant results clear set obs 1 gen signif = . save junk, replace * set seed 999 * * iterate and append results to file forval i=1/1000 { quietly clear quietly set obs 100 quietly gen group = 0 in 1/50 quietly replace group = 1 in 51/100 quietly gen y = invnorm(uniform())*2+4 in 1/50 quietly replace y = invnorm(uniform())*2.5+5 in 51/100 quietly ttest y , by(group) quietly gen signif = cond(r(p)<0.05,1,0) in 1/1 quietly keep in 1/1 quietly keep signif quietly append using junk quietly save junk, replace } sum signif Now, we just get the final result. Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------signif | 1000 .582 .493477 0 1 We see that our power is 58.2%. Comparing this result to the closed form formula approach, sampsi 4 5 , sd1(2) sd2(2.5) n1(50) n2(50) Estimated power: power = 0.5982 The simulated power (58.2%) is very close to power formula (59.8%). Chapter 4-1 (revised 23 Jun 2010) p. 27 Here are the results for some various number of iterations, Formula power 59.82% Simulation # iterations power 100 56.00% 1,000 58.20% 10,000 59.52% Since your SD assumptions are estimates anyway, there is really no need for the extra precision provided by 10,000 iterations. It is sufficient, and recommended, to just use 1,000 iterations. If you are simulating a very complicated model and you are in a hurry, it is probably sufficient to just use 100 iterations, since the convergence is quite good even with that, and your assumptions will be off anyway. To discover the sample size for a desired power requires varying the sample size and computing power until you converge on the desired power. For that, you can use the above Stata code, but use 50 or 100 iterations until you begin to get close, and then use 1,000 iterations for your final sample size calculation. Chapter 4-1 (revised 23 Jun 2010) p. 28 Power Analysis Using Monte Carlo Simulation (2 x 2 Table Chi-square Test) For this situation, we are comparing two proportions. Suppose we anticipate the following 2 × 2 table. Group A 20 (20%) 80 100 Outcome Yes No B 30 (30%) 70 100 Using the formula approach, sampsi .20 .30 , n1(100) n2(100) Estimated power: power = 0.3108 Here is the code for the simulation, * create a file to hold significant results clear set obs 1 gen signif = . save junk, replace * set seed 999 * * iterate and append results to file forval i=1/100 { quietly clear quietly set obs 200 quietly gen group = 0 in 1/100 quietly replace group = 1 in 101/200 quietly gen y = uniform() in 1/100 quietly replace y = uniform() in 101/200 quietly replace y = cond(y<=0.2,1,0) in 1/100 quietly replace y = cond(y<=0.3,1,0) in 101/200 quietly tab y group , chi2 quietly gen signif = cond(r(p)<0.05,1,0) in 1/1 quietly keep in 1/1 quietly keep signif quietly append using junk quietly save junk, replace } sum signif Here are the results for some various number of iterations, Formula power 31.08% Simulation # iterations power 100 29.00% 1,000 34.80% 10,000 36.86% Chapter 4-1 (revised 23 Jun 2010) p. 29 We notice that the simulation, with the recommended 1,000 iterations, and also with the 10,000 iterations, produces a higher power than the formula approach. The sampsi command for proportions uses the Yates continuity correction in its calculation. The tab command does not use the continuity correction. Few researchers use the Yates continuity correction anymore, because it is known to be conservative (p values are too large)(see box). If the continuity correction is not going to be applied in the data analysis (in particular, it is not applied in logistic regression models), then the simulated power is the more correct estimate. In practice, however, doing the simulation is a lot of work for a small gain, so it is fine to just use the formula approach, which is the sampsi command. Yates Continuity Correction Controversy (Agresti, 1990, p.68) There is a controversy among statisticians on whether or not the Yates continuity correction should be applied. One camp claims that the continuity correction should always be applied, because the p value is more accurate and because it is closer to an exact p value (Fisher’s exact p value). The other camp claims that the continuity correction should not be applied, because it takes the p value closer to the Fisher’s exact test p value, and it is known that the Fisher’s exact p value is conservative (does not drop below alpha, 0.05, often enough). Chapter 4-1 (revised 23 Jun 2010) p. 30 Power Analysis Using Monte Carlo Simulation (Poisson Regression with Person-Time) Suppose we want to model the effect of an elementary school flu immunization program. Our outcome variable is the number of days absent during the winter months, a readily available outcome measure. In 10% of the schools we will immunize all the children before the start of the flu season. In 90% of the schools, no school immunization program willl be implemented. Since children can miss multiple days, we need to model this as a rate for each school. absentee rate = (total days absence)/(number of students × number of school days) In n=40 schools, each school will contribute one observation to the sample size. The three variables are: Immunization program: 1 = yes, 0 = no Student days: total number of days that school was in session Absent: total number of days which students were absent For sample size determination, we were only able to get two schools. clear input school studentdays absent 1 39177 1405 2 41015 1810 end list sum +-------------------------------+ | school studentdays absent | |-------------------------------| 1. | 1 39177 1405 | 2. | 2 41015 1810 | +-------------------------------+ Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------school | 2 1.5 .7071068 1 2 studentdays | 2 40096 1299.662 39177 41015 absent | 2 1607.5 286.3782 1405 1810 We now have the problem of what to assume for a mean and standard deviation for studentdays and absentees. With only two schools to estimate this from, we better be careful. The estimate of the means are not as important as the estimates of the standard deviations. Let’s use these estimates for the mean and increase the standard deviations by 100%. display 1300*2 display 286*2 2600 572 Chapter 4-1 (revised 23 Jun 2010) p. 31 In the Monte Carlo simulation of the sample size, we will use absent mean = 1608 SD = 572 studentdays mean = 40096 SD = 2600 It is reasonable that studentdays and absent are correlated. Let’s assume they are correlated in the amount of r = 0.30. To draw a simulated random sample of two correlated normally distributed variables, we use the drawnorm command. We have to provide this command with a vector (n × 1 matrix) of means, a vector of standard deviations, and a correlation matrix (n × n). In Stata, the computed values are stored in ereturn list, rather than return list. The regression coefficients and standard errors are stored in a matrix. It is easier to use a shortcut. Stata also saves the regression coefficients in _b[ ] and _se[ ], where you put the variable name inside the brackets. The spelling of the variable name is identically the way it appears in the regression output. The p value is not stored but can be computed by looking up the probability in the standard normal distribution for the Wald test = _b[ ]/_se[ ]. Here is the approach, using 1 iteration. We will assume an effect size of RR = 0.95, so the immunization (immun) intervention reduces absenteeism by 5%. clear set obs 1 gen signif = . save tempsignif, replace set seed 999 forval i=1/1{ clear set obs 50 // number of schools matrix m = (1608 , 40096) // means matrix sd = (572 , 2600 ) // standard deviations matrix c = (1 , .3 \ .3 , 1) // correlation drawnorm absent studentdays , n(50) means(m) sds(sd) corr(c) replace absent=round(absent,1) // convert to integer replace studentdays=round(studentdays,1) replace absent=absent*.95 in 1/10 gen immun = 1 in 1/10 replace immun = 0 in 11/50 poisson absent immun , exposure(studentdays) irr gen signif=(1-(normprob(abs(_b[immun]/_se[immun]))))*2 < 0.05 in 1/1 display (1-(normprob(abs(_b[immun]/_se[immun]))))*2 keep in 1/1 keep signif append using tempsignif save tempsignif, replace } sum Chapter 4-1 (revised 23 Jun 2010) p. 32 Poisson regression Log likelihood = -5188.1744 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 50 240.84 0.0000 0.0227 -----------------------------------------------------------------------------absent | IRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------immun | .8692226 .0079689 -15.29 0.000 .8537433 .8849826 studentdays | (exposure) -----------------------------------------------------------------------------Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------signif | 12 1 0 1 1 It is possible that sometimes the model will crash, so we want to make sure the simulation keeps running even when that happens, and then just don’t include the crashed models in the sum at the end. To do this, we put capture in front of the poisson command, which means it captures all output, error messages in particular. We then check the return code, _rc, to see it is is 0, meaning no error occurred, and continue for the rest of the iteration only if no error occurred. Making this change, clear set obs 1 gen signif = . save tempsignif, replace set seed 999 set seed 999 forval i=1/1{ clear set obs 50 // number of schools matrix m = (1608 , 40096) // means matrix sd = (572 , 2600 ) // standard deviations matrix c = (1 , .3 \ .3 , 1) // correlation drawnorm absent studentdays , n(50) means(m) sds(sd) corr(c) replace absent=round(absent,1) // convert to integer replace studentdays=round(studentdays,1) replace absent=absent*.95 in 1/10 gen immun = 1 in 1/10 replace immun = 0 in 11/50 capture poisson absent immun , exposure(studentdays) irr if _rc==0 { gen signif=(1-(normprob(abs(_b[immun]/_se[immun]))))*2 < 0.05 in 1/1 display (1-(normprob(abs(_b[immun]/_se[immun]))))*2 keep in 1/1 keep signif append using tempsignif save tempsignif, replace } } sum Chapter 4-1 (revised 23 Jun 2010) p. 33 Since it takes a long time to simulation a regression model, particularly if it is a complex multilevel model, we can display the iteration number so we can tell how close we are to being finished. We will need to turn off the scrolling prompt so we don’t have to hit the space bar every time the screen fills up with the iteration numbers. Here is the change, along with putting quietly in front of all the commands, and changing it to run 100 iterations. clear set obs 1 gen signif = . save tempsignif, replace set seed 999 set more off forval i=1/100 { quietly clear quietly set obs 50 // number of schools quietly matrix m = (1608 , 40096) // means quietly matrix sd = (572 , 2600 ) // standard deviations quietly matrix c = (1 , .3 \ .3 , 1) // correlation quietly drawnorm absent studentdays , n(50) means(m) sds(sd) corr(c) quietly replace absent=round(absent,1) // convert to integer quietly replace studentdays=round(studentdays,1) quietly replace absent=absent*.95 in 1/10 quietly gen immun = 1 in 1/10 quietly replace immun = 0 in 11/50 quietly capture poisson absent immun , exposure(studentdays) irr if _rc==0 { quietly gen signif=(1-(normprob(abs(_b[immun]/_se[immun]))))*2 < 0.05 in 1/1 quietly display (1-(normprob(abs(_b[immun]/_se[immun]))))*2 quietly keep in 1/1 quietly keep signif quietly append using tempsignif quietly save tempsignif, replace } display "Now on iteration " `i' } set more on sum Now Now Now … Now on iteration 1 on iteration 2 on iteration 3 on iteration 100 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------signif | 90 .9222222 .269322 0 1 We see that the power is 92.2%, based on 90 iterations. That means 10 times it crashed, so the _rc trick was necessary. Chapter 4-1 (revised 23 Jun 2010) p. 34 Power Analysis Using Monte Carlo Simulation (2-way ANOVA, both factors with 2 levels, neither of which is a repeated measurement) This section is for a 2 x 2 factorial design, Factor 2 Low High Factor 1 Low High where a 2-way ANOVA will be fitted with a Factor 1 x Factor 2 interaction term. Neither factor can be a repeated measurement. You specify the means, standard deviation, and sample size for each cell of the table, and the power is returned for each main effect and for the interaction term. First, cut-and-paste the following code into your Stata do-file, highlight it, and run it to set up the program. * power analysis for a 2x2 factorial design ANOVA * with factor 1 x factor 2 interaction term capture program drop poweranova program define poweranova * 2 x 2 factorial design * factor 1 low mean1low (sd1low) * high mean1high (sd1high) * factor 2 low mean2low (sd2low) * high mean2high (sd2high) * syntax: poweranova mean1low SD1low N1low /// * mean1high SD1high N1high /// * mean2low SD2low N2low /// * mean2high SD2high N2high args mean1low sd1low n1low mean1high sd1high n1high /// mean2low sd2low n2low mean2high sd2high n2high preserve clear quietly set obs 1 quietly gen signif1 = . quietly gen signif2 = . quietly gen signif3 = . quietly save poweranovatemp, replace set seed 999 local n = `n1low'+`n1high'+`n2low'+`n2high' local a = `n1low' local b = `n1low'+1 local c = `n1low'+`n1high' local d = `n1low'+`n1high'+1 local e = `n1low'+`n1high'+`n2low' local f = `n1low'+`n1high'+`n2low'+1 * iterate and append results to file forval i=1/1000 { quietly clear quietly set obs `n' quietly gen factor1 = 0 in 1/`c' quietly replace factor1 = 1 in `d'/`n' quietly gen factor2 = 0 in 1/`a' Chapter 4-1 (revised 23 Jun 2010) p. 35 quietly replace factor2 = 1 in `b'/`c' quietly replace factor2 = 0 in `d'/`e' quietly replace factor2 = 1 in `f'/`n' quietly gen y = invnorm(uniform())* /// `sd1low'+`mean1low' in 1/`a' quietly replace y = invnorm(uniform())* /// `sd1high'+`mean1high' in `b'/`c' quietly replace y = invnorm(uniform())* /// `sd2low'+`mean2low' in `d'/`e' quietly replace y = invnorm(uniform())* /// `sd2high'+`mean2high' in `f'/`n' quietly anova y factor1 factor2 factor1*factor2 * group factor p value quietly gen signif1 = /// cond(Ftail(e(df_1),e(df_r),e(F_1))<0.05,1,0) /// in 1/1 // factor 1 main effect p value quietly gen signif2 = /// cond(Ftail(e(df_2),e(df_r),e(F_2))<0.05,1,0) /// in 1/1 // factor 2 main effect p value quietly gen signif3 = /// cond(Ftail(e(df_3),e(df_r),e(F_3))<0.05,1,0) /// in 1/1 // interaction p value quietly keep in 1/1 quietly keep signif1 signif2 signif3 quietly append using poweranovatemp quietly save poweranovatemp , replace } display as result "Factor 1 (low): mean = " `mean1low' /// " , SD = " `sd1low' " , n = " `n1low' display as result "Factor 1 (high): mean = " `mean1high' /// " , SD = " `sd1high' " , n = " `n1high' display as result "Factor 2 (low): mean = " `mean2low' /// " , SD = " `sd2low' " , n = " `n2low' display as result "Factor 2 (high): mean = " `mean2high' /// " , SD = " `sd2high' " , n = " `n2high' quietly sum signif1 display as result "Power for Factor 1 main effect = " /// r(mean)*100 "%" quietly sum signif2 display as result "Power for Factor 2 main effect = " /// r(mean)*100 "%" quietly sum signif3 display as result "Power for Factor 1 x Factor 2 interaction = " /// r(mean)*100 "%" capture erase poweranovatemp.dta restore end * syntax *poweranova mean1low SD1low N1low mean1high SD1high N2high /// * mean2low SD2low N2low mean2high SD2high N2high Then, you run the command, poweranova, with 12 parameters, as follows: Syntax: poweranova mean1low SD1low N1low mean1high SD1high N2high /// mean2low SD2low N2low mean2high SD2high N2high Chapter 4-1 (revised 23 Jun 2010) p. 36 Example: You are conducting an animal experiment, with a study and a control group. The animal must be sacrificed to collect the histological measurement, so one set of animals is followed for 3 months, and a second set of animals if followed for 6 months, for each of the groups. You estimate the Factor 1 (group) Low (control) High (study) Factor 2 (time) Low High (3 months) (6 months) Mean: 2.0 Mean: 3.0 SD: 1.0 SD: 1.5 N: 7 N: 7 Mean: 4.0 Mean: 7.0 SD: 2.0 SD: 3.5 N: 7 N: 7 After loading the program into Stata, as described above, you run it using Syntax: poweranova mean1low SD1low N1low mean1high SD1high N2high /// mean2low SD2low N2low mean2high SD2high N2high poweranova 2.0 1.0 7 3.0 1.5 7 4.0 2.0 7 7.0 3.5 7 The result is, Factor 1 (low): mean = 2 , SD = 1 , n = 7 Factor 1 (high): mean = 3 , SD = 1.5 , n = 7 Factor 2 (low): mean = 4 , SD = 2 , n = 7 Factor 2 (high): mean = 7 , SD = 3.5 , n = 7 Power for Factor 1 main effect = 93.2% Power for Factor 2 main effect = 63.1% Power for Factor 1 x Factor 2 interaction = 24.5% Chapter 4-1 (revised 23 Jun 2010) p. 37 Sample Size for Survival Analysis The Stata command stpower computes the sample size for survival analysis comparing two survivor functions using the log-rank test, Cox regression, or the exponential parametric survival test. The syntax is: Sample size determination stpower cox [...] [, ...] stpower logrank [...] [, ...] stpower exponential [...] [, ...] Power determination stpower cox [...] , n(numlist) [...] stpower logrank [...], n(numlist) [...] stpower exponential [...], n(numlist) [...] Effect-size determination stpower cox , n(numlist) {power(numlist) | beta(numlist)} [...] Example Suppose you plan to do a log-rank test for an animal experiment (rabbits), where you plan to have a study group (active antimicrobial with a bandage) and control group (just a bandage). You intend to make an incision to provide a tunnel for infection and then add bacteria to the wound. You expect all of the control group to have a blood streaminfection, and none of the study group. You also expect 20% to 50% of the rabbits to drop out of the study before the end of the four-week follow-up period. The “stpower logrank” command is based on the method of Freedman (1982). Here is an example power analysis paragraph: The planned sample size was based on the number of events, allowing for withdrawals, and the use of the logrank test (Freedman, 1982). It was assumed that the control group would have 100% infection, or 100% failure probability, and the treatment group would have 0% infection. Assuming 20% withdrawals, the study had at least 80% power if n=10 rabbits were studied in each group. Assuming 50% withdrawals, the the study had at least 80% power if n=10 rabbits were used in the control group and n=20 were used in the treatment group. ----Reference: Freedman LS, Tables of the number of patients required in clinical trials using the logrank test. Statistics in Medicine 1982;1:121-129. Chapter 4-1 (revised 23 Jun 2010) p. 38 This comes from: stpower logrank .99 .01 , power(.8) wdprob(.20) Estimated sample sizes for two-sample comparison of survivor functions Log-rank test, Freedman method Ho: S1(t) = S2(t) Input parameters: alpha s1 s2 hratio power p1 withdrawal = = = = = = = 0.0500 (two sided) 0.9900 0.0100 458.2106 0.8000 0.5000 20.00% Estimated number of events and sample sizes: E N N1 N2 = = = = 8 20 10 10 The .99 is the survival probability for the test group (really 1, but Stata needs something between 0 and 1). The .01 is the survival probability for the control group. The “wprob” is the withdrawal probability. It is fine to base these three probabilities on simple proportions anticipated at the end of the follow-up. In the output “p1” is the proportion of the sample size in the control group. Here’s the command for n=10 controls and n=20 treatment group. The nratio( ) is the ratio of the sample sizes, treat:control. stpower logrank .01 .99 , power(.8) wdprob(.50) nratio(2) Estimated sample sizes for two-sample comparison of survivor functions Log-rank test, Freedman method Ho: S1(t) = S2(t) Input parameters: alpha s1 s2 hratio power p1 withdrawal = = = = = = = 0.0500 (two sided) 0.0100 0.9900 0.0022 0.8000 0.3333 50.00% Estimated number of events and sample sizes: E N N1 N2 = = = = Chapter 4-1 (revised 23 Jun 2010) 4 24 8 16 p. 39 References Abramson JH, Gahlinger PM. (2001). Computer Programs for Epidemiologists: PEPI Version 4.0. Salt Lake City, UT, Sagebrush Press. The PEPI-4.0 software can be downloaded free from the Internet, although the manual must be purchased. http://www.sagebrushpress.com/pepibook.html Agresti A. (1990). Categorical Data Analysis. New York, John Wiley & Sons. Bonett DG. (2002). Sample size requirements for estimating intraclass correlations with desired precision. Statistics in Medicine 21:1331-1335. Bristol DR. (1989). Sample size for constructing confidence intervals and testing hypotheses. Statist Med 8:803-811. Campbell M, Grimshaw J, Steen N, et al. (2000). Sample size calculations for cluster randomised trials. Journal of Health Services Research & Policy 5(1):12-16. Chow S-C, Shao J, Wang H. (2008). Sample Size Calculations in Clinical Research. 2nd ed. New York, Chapman & Hall/CRC. Freedman LS, Tables of the number of patients required in clinical trials using the logrank test. Statistics in Medicine 1982;1:121-129. Julious SA, Campbell MJ. (1998). Sample size calculations for paired or matched ordinal data. Statist Med 17:1635-1642. Ross S. (1998). A First Course in Probability, 5th ed. Upper Saddle River, NJ. Whitehead J. (1993). Sample size calculations for ordered categorical data. Statistics in Medicine 12:2257-2271. Chapter 4-1 (revised 23 Jun 2010) p. 40 Appendix: Chapter Revision History 16 May 2010 Revision history first tracked. 14 Jun 2010 Added section, “Interrater Reliability (Precision of Confidence Interval Around Intraclass Correlation Coefficient)” Chapter 4-1 (revised 23 Jun 2010) p. 41