Power and Sample Size Analysis David Huang, Dr. PH. Integrated Substance Abuse Treatment Programs, University of California, Los Angeles November 18, 2011 1 Outline of This Presentation I. Introduction to Sample Size Analysis Why estimation of required sample is important for a research proposal Rationales What is sample size analysis General framework: Rules of Thumb Components: (e.g., effect size, statistical power, correlations) II. Strategies and Examples Single group estimation Two-group comparisons Multivariate analysis Nested (multi-level) sample Repeated measured over time 2 How much data do we need • How many subjects should be included in the research. Without considering the expenses, the more data the better. • It is not feasible to collect data on the entire population of interest. • Consider the collected data as a random sample of the population of interest. 3 Rules of Thumb Feasible in terms of budget and research time frame Sufficient data to ensure results to be Accurate, Efficient, and Credible 4 What is Sample Size Analysis Based upon the statistical test for the main research question, Sample size analysis is intended to determine the minimal data (or sample size) required for detecting a significant research finding. 5 Example: An experimental Study An experimental study will be conducted to evaluate the treatment effectiveness between the two treatment protocols (Suboxone vs Methadone). Outcome measures will include urine test results, ASI score and satisfaction score at three months after discharged. 6 Questions Related to Sample size Calculation How to measure treatment effectiveness? What measures or indicators will be employed ? The accuracy of the measures or indicators ? How large a difference will be expected ? What is the smallest effect size would be considered of importance? Whether the new treatment is expected to better than the old treatment (1-side or 2-side test). Reliability of research findings will be (power or Alphalevel). Whether subjects are nested within treatment clinics? Whether there are follow-up data. 7 Components of Sample size Calculation What test statistic will be employed ? Hypothesis Testing: The null hypothesis vs. The alternative hypothesis Alpha Level (or desired accuracy; width of confidence interval) Power Effect size: expected differences and variation of outcome measures Sample size 8 Types of Statistics Means (e.g., ASI score) -- Compare 2 means (t-test) -- Compare 3 or more means (ANOVA) Proportions (e.g., abstinent rate) -- Compare 2 proportions Bivariate relationship – correlation (r) Multiple regression – Multiple R2 Cluster sampling/multi-level 9 Hypothesis Testing The null hypothesis: This hypothesis predicts that there is no effect on the variable of interest The alternative hypothesis: This hypothesis predicts that there is an effect on the variable of interest (or a difference between groups). Statistical tests look for evidence to reject the null hypothesis and conclude the alternative hypothesis (an effect is existing) Sample size analysis: Determine the minimal amount of data required. 10 Alpha level/Power from Hypothesis Testing True situation in the population H0 Ha True True Research findings Do not reject H0 Correct (1- α ) Error β Reject H0 Error α Correct (1- β) “power” 11 Alpha Level and Power Alpha level: Probability of incorrectly concluding (from sample data) a significant effect when it does not really exist in the population (Type-I error). -- Alpha level is usually set as .05 Power: Probability of correctly concluding (from sample data) a significant effect when it really exist in the population. -- Power is usually set as .80 12 Effect size Effect size – Standardized measure of the magnitude of a difference or relationship. -- How big a difference or relationship (in a standardized metric) was detected in analysis -- Effect sizes (for the same type of statistics) are calculated on a common scale, which allows to compare the effectiveness of different programs In calculating sample size: -- How big a difference or relationship do we want to detect? -- How big a difference is considered clinically important? 13 Computing Effect Size Various formulas depend on type of statistic e.g., for difference in means (t-test) d= (mean1 – mean2) /standard deviation Various labels, d for difference in two means w (h) for difference in proportions r for correlations f for difference in many means (e.g., One-way ANOVA) η2 for variance explained R2 for multiple regression 14 Determining Effect Size Based on substantive knowledge Based on findings from prior research Based on a pilot study Use conventions -- e.g., small, medium and large effect size defined by Cohen 15 Magnitude of Effect size From by Cohen, 1988 The bigger the effect size, the easier the detection. small medium Large d 0.20 0.50 0.80 w f η2 0.10 0.1 0.01 0.30 0.25 0.06 0.50 0.4 0.14 r Partial R2 0.10 0.02 0.30 0.13 0.50 0.26 16 Steps for Sample Size Determination Decide types of outcome statistics (e.g., mean, proportion, correlations,…) Specify 1- or 2-tailed tests Specify desired alpha level and power Specify the desired effect size (from literature, pilot study, or best guess) 17 General Rules: Required Sample Size Detecting small effect sizes --> larger N Smaller alpha or greater power --> larger N 2-tailed test --> larger N than 1-tailed test Addition of covariates (e.g., ANCOVA) reduce error variance, then increase effect size and decrease N More parameters in model --> larger N 18 Effect Size vs. Number of Subjects per Group for two-tailed t-test with α=.05, power=.80 Sample size per group 500 400 300 200 100 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Effect size 19 Functions of Power vs. Number of Subjects per Group for two-tailed t-test with α=.05 General Rules: Required Sample Size Cluster sampling/multi-level data structure: -- Larger N as intra-class correlation increase Follow-up with repeated measures: -- More repeated measures, smaller N per group 21 Power vs. Intra-class Correlation in the Cluster Sample given Effect Size of 0.3, α of.05, and 20 Clusters Power vs. Number of Subjects for testing linear trend effect with repeated measures over time given effect size of 0.3 and α=.05 800 time=3 Sample size 600 time=6 400 200 0 0.1 0.2 0.3 0.4 0.5 Power 0.6 0.7 0.8 0.9 Limitations of Sample Size Analysis The analyses do not generalize very well Based on assumptions and educated guesses, the analyses give a “best case scenario” estimate of necessary sample size It is a good strategy to compute the required sample sizes by different levels of effect size (or alpha, power), and then present the required sample sizes in a range, instead of a single number 24 Computational Software • • • Erdfelder, E., Paul, F., & Buchner, A. (1996). GPOWER: A general power analysis program. Behavior Research Methods, Instrument & Computers, 28,1-11. (Free) http://www.psycho.uniduesseldorf.de/abteilungen/aap/gpower3/ Elashoff, J. (2008) nQuery Advisor. 25 Examples 26 Example 1: A Cross-sectional Evaluation Proposition 36 was implemented in California. A study will be conducted to evaluate the system impact of P36 on treatment accessibility and outcomes among drugabused offenders. A total of 450 clients from 15 treatment programs will be screened and recruited for the study. Outcome measures will include urine test results during treatment, ASI score and satisfaction score at three months after discharged. 27 Example 1: Research Aims Estimate proportion of negative urine test, and ASI score and satisfaction score at three months after discharged. Examine difference in treatment outcomes by integrated and non-integrated programs. Compare treatment retention by integrated, women-only, other types of treatment programs. Correlation of length of treatment retention with ASI scores. Examination of potential risk and protective factors associated with treatment outcomes (e.g., ASI score, negative urine test). 28 Example 1: Corresponding Sample size Analysis Description of means and rates (%) For an estimation of mean of a continuous measure (e.g., ASI score), the proposed sample of 450 will be sufficient to estimate ASI score within a range of ± 0.13 standard deviation. The proposed sample of N=450 will be sufficient to estimate a rate (e.g., negative rate) of 10% within a range of ± 4%, a rate of 30% within a range of ± 6%, and a rate of 50% within a range of ± 7%. 29 Example 1: Corresponding Sample size Analysis Comparisons on outcome measures by integrated and non-integrated programs. Compare (means- independent t-test): The required sample for detecting a medium effect of 0.3 will be 176 per group. The proposed sample of 225 per group will allow detection of an effect size of 0.26. Compare (%): Given the sample of N=225 per group, the detectable difference on rates will be 13%, 12%, and 9% when rates in the study population are 50%, 30%, and 10 %, respectively. 30 Example 1: Corresponding Sample size Analysis Comparisons of means by three or more groups (e.g., ANOVA). For comparison of means among three groups (e.g., integrated, women-only, and others), the required sample for detecting a medium effect of 0.25 (f) will be 159. The proposed sample of 450 (e.g., 150 per group) will allow detection of an effect size of 0.15. 31 Example 1: Corresponding Sample size Analysis Correlation and Linear regression In the simple regression (1 covariate), the detectable correlation of 0.2 requires a sample of 193. The sample of 450 will allow detection of small-to-medium correlation of 0.13 or larger. In a multiple regression with p covariates, the required sample will increase, depending on the partial correlation of p-1 covariates. the sample of 450 will allow to detect medium effect of R2=.02 for predicting ASI score from 5-10 covariates. 32 Example 1: Corresponding Sample size Analysis Logistic regression for categorical measures The sample of size of 450 should allow detection of an odds ratio of 1.40 for a single covariate model, given the rate of outcome in the study population is 0.2. The detectable effect size would increase within a multivariate model, depending the correlation of the main predictor with other covariates. The detectable odds ratio will range from 1.42 to 1.54 when partial R2 of other covariates is 0.1 to 0.4. 33 Example 2: An Experimental Study Under the main scope of CTN project, an experimental study will be conducted to evaluate the treatment effectiveness between the two treatment protocols (Suboxone vs Methadone). Subjects will be randomized to one protocol. Outcome measures will include treatment negative urine test, ASI score and satisfaction score at three months after discharged. 34 Example 2: An Experimental Study In an experimental study, we need to think about “statistically significant” versus “clinically relevant” when considering effect size. A small effect size may not be clinically meaningful. For example, a reduction of blood pressure by two points. If an incorrect conclusion has potential adverse consequence on subjects, a lower level of alpha and a higher power should be selected. 35 Example 3: A Nested Study Design and A Longitudinal Study When individual subjects are recruited from clusters (e.g., treatment programs, schools), the correlation among subjects within each cluster may need to be considered in sample size analysis. A Longitudinal study repeatedly collects data across time. Each individual will have repeated measures over time. The repeated measures within each individual are correlated. 36 Example 3: Approaches and Software for Nested Data Structure The specific statistical analysis for a nested (cluster) data is Multi-level (Hierarchical) Modeling or Generalized linear modeling. Software for computing sample size for Multi-level or Generalized linear models. • Spybrook, J., Raudenbush, S., Congdon, R., & Martinez, A. (2009) Optimal Design for Longitudinal and Multilevel Research: Documentation for the “Optimal Design” Software. Available at http://www.wtgrantfoundation.org/resources/ove rview/research_tools/research_tools 37 Example 3: Approaches to the Nested Data Structure Efficiency of proposed sample size will decrease because of the intra-cluster correlation. The required sample size could be adjusted by an inflation factor: 1/[1+(m-1) ρ] Here, m indicates average size in a treatment program and ρ indicates intra-cluster correlation. 38 Random effect regression for repeated measures The target sample size of 120 per group will allow the detection of small-to-medium effects of about d=.32 in detecting a difference in patterns over time between the maintenance group and each of the other two groups, with power=.80 and one-tailed alpha=.05 assuming a moderate correlation of .50 over time and approximately 15% attrition (Hedeker et al., 1999). 39 Software • • • • Erdfelder, E., Paul, F., & Buchner, A. (1996). GPOWER: A general power analysis program. Behavior Research Methods, Instrument & Computers, 28,1-11. (Free) http://www.psycho.uniduesseldorf.de/abteilungen/aap/gpower3/ Elashoff, J. (2008) nQuery Advisor. Hedeker, D. RMASS2, Repeated Measures with Attrition: Sample Size for 2 Groups. (free) -- Old version available to download at http://tigger.uic.edu/%7Ehedeker/ml.html -- New on-line version available at http://www.uic.edu/labs/biostat/projects.html -- Software for computing sample size for general(ized) linear models with repeated measures. 40 Software • Spybrook, J., Raudenbush, S., Congdon, R., & Martinez, A. (2009) Optimal Design for Longitudinal and Multilevel Research: Documentation for the “Optimal Design” Software. Available at http://www.wtgrantfoundation.org/resources/overview/rese arch_tools/research_tools • SamplePower. SPSS $$SPSS module for computing power/sample size. • Proc POWER and Proc GLMPOWER. SAS $$ SAS procedures for computing power/sample size. 41 Software • PS-power/sample size (free) Available to download at http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/Power SampleSize • Dennis, M (1997) Power Analysis Worksheet. (free) Available at http://www.chestnut.org/LI/downloads/index.html [Click on "Power Analysis Worksheet part way down web page] Excel spreadsheet for calculating sample size for simple designs/analyses 42 References • Cohen, J. (1988) Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Assoc. Formulas and tables for computing power and sample size--very comprehensive and somewhat difficult--but very well-respected reference. • Fink, A. (2003) How to sample in surveys. Thousands Oaks, CA: Sage. Simple step-by-step guide to sampling issues and procedures—with table for looking up approximate sample size for proportions from 2category responses. • Kraemer, H. C. & Thiemann, S. (1987) How Many Subjects? Newbury Park, CA: Sage. -- Formulas and tables for approximate sample size and some good descriptions of issues. • Lipsey, M. W. (1990) Design Sensitivity. Newbury Park, CA: Sage. Nonmathematical discussion of many design/sample size issues. • Rudy, E. & Kerr, M. (1991) Unraveling the mystique of power analysis. Heart & Lung, 20(5), 517-522. Description of power/sample size for researchers. 43