Sample Design for Group- Randomized Trials Howard S. Bloom Chief Social Scientist

Sample Design for GroupRandomized Trials Howard S. Bloom Chief Social Scientist MDRC Prepared for the IES/NCER Summer Research Training Institute held at Northwestern University on July 9, 2008. Today we will examine Sample size determinants  Precision requirements  Sample allocation  Covariate adjustments  Matching and blocking  Subgroup analyses  Generalizing findings for sites and blocks  Using two-level data for three-level situations  Part I: The Basics Statistical properties of grouprandomized impact estimators Unbiased estimates Yij = a+B0Tj+ej+eij E(b0) = B0 Less precise estimates VAR(eij) = s2 VAR(ej) = t2 r = t2/(t2+s2) GEM  1  (n  1) r  SEC (b0) / SEI (b0) Design Effect (for a given total number of individuals) ______________________________________ Intraclass Individuals per Group (n) Correlation (r) 0.01 0.05 0.10 10 50 500 1.04 1.20 1.38 1.22 1.86 2.43 2.48 5.09 7.13 _____________________________________ Sample design parameters  Number of randomized groups (J)  Number of individuals per randomized group (n)  Proportion of groups randomized to program status (P) Reporting precision  A minimum detectable effect (MDE) is the smallest true effect that has a “good chance” of being found to be statistically significant.  We typically define an MDE as the smallest true effect that has 80 percent power for a two-tailed test of statistical significance at the 0.05 level.  An MDE is reported in natural units whereas a minimum detectable effect size (MDES) is reported in units of standard deviations Minimum Detectable Effect Sizes For a Group-Randomized Design with r = 0.05 and no Covariates ___________________________________ Randomized Individuals per Group (n) Groups (J) 10 50 500 10 0.77 0.53 0.46 20 0.50 0.35 0.30 40 0.35 0.24 0.21 120 0.20 0.14 0.12 ___________________________________ Implications for sample design  It is extremely important to randomize an adequate number of groups.  It is often far less important how many individuals per group you have. Part II Determining required precision When assessing how much precision is needed: Always ask “relative to what?” Program benefits  Program costs  Existing outcome differences  Past program performance  Effect Size Gospel According to Cohen and Lipsey Cohen Lipsey (speculative) (empirical) _______________________________________________ Small = 0.2s Medium = 0.5s Large = 0.8s Small = 0.15s Medium = 0.45s Large = 0.90s Five-year impacts of the Tennessee class-size experiment Treatment: 13-17 versus 22-26 students per class Effect sizes: 0.11s to 0.22s for reading and math Findings are summarized from Nye, Barbara, Larry V. Hedges and Spyros Konstantopoulos (1999) “The Long-Term Effects of Small Classes: A FiveYear Follow-up of the Tennessee Class Size Experiment,” Educational Evaluation and Policy Analysis, Vol. 21, No. 2: 127-142. Annual reading and math growth Reading Math Grade Growth Growth Transition Effect Size Effect Size ---------------------------------------------------------------K-1 1.52 1.14 1-2 0.97 1.03 2-3 0.60 0.89 3-4 0.36 0.52 4-5 0.40 0.56 5-6 0.32 0.41 6-7 0.23 0.30 7-8 0.26 0.32 8-9 0.24 0.22 9 - 10 0.19 0.25 10 - 11 0.19 0.14 11 - 12 0.06 0.01 ------------------------------------------------------------------------------------------------- Based on work in progress using documentation on the national norming samples for the CAT5, SAT9, Terra Nova CTBS, Gates MacGinitie (for reading only), MAT8, Terra Nova CAT, and SAT10. 95% confidence intervals range in reading from +/- .03 to .15 and in math from +/- .03 to .22 Performance gap between “average” (50th percentile) and “weak” (10th percentile) schools Subject and grade District I District II District III District IV Reading Grade 3 0.31 0.18 0.16 0.43 Grade 5 0.41 0.18 0.35 0.31 Grade 7 .025 0.11 0.30 NA Grade 10 0.07 0.11 NA NA Grade 3 0.29 0.25 0.19 0.41 Grade 5 0.27 0.23 0.36 0.26 Grade 7 0.20 0.15 0.23 NA Grade 10 0.14 0.17 NA NA Math Source: District I outcomes are based on ITBS scaled scores, District II on SAT 9 scaled scores, District III on MAT NCE scores, and District IV on SAT 8 NCE scores. Demographic performance gap in reading and math: Main NAEP scores BlackWhite HispanicWhite MaleFemale EligibleIneligible for free/reduced price lunch Grade 4 -0.83 -0.77 -0.18 -0.74 Grade 8 -0.80 -0.76 -0.28 -0.66 Grade 12 -0.67 -0.53 -0.44 -0.45 Grade 4 -0.99 -0.85 0.08 -0.85 Grade 8 -1.04 -0.82 0.04 -0.80 Grade 12 -0.94 -0.68 0.09 -0.72 Subject and grade Reading Math Source: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2002 Reading Assessment and 2000 Mathematics Assessment. ES Results from Randomized Studies Achievement Measure n Mean 389 0.33 Standardized test (Broad) 21 0.07 Standardized test (Narrow) 181 0.23 Specialized Topic/Test 180 0.44 Middle Schools 36 0.51 High Schools 43 0.27 Elementary School Part III The ABCs of Sample Allocation Sample allocation alternatives Balanced allocation maximizes precision for a given sample size;  maximizes robustness to distributional assumptions.  Unbalanced allocation precision erodes slowly with imbalance for a given sample size  imbalance can facilitate a larger sample  Imbalance can facilitate randomization  Variance relationships for the program and control groups  Equal variances: when the program does not affect the outcome variance.  Unequal variances: when the program does affect the outcome variance. MDES for equal variances without covariates MJ  2 1 r 1 MDES  r J n P(1  P) How allocation affects MDES 1  P(1  P) 1  2.18 .7(1  .7) 1  2.00 .5(1  .5) 1  2.50 .8(1  .8) 1  2.04 .6(1  .6) 1  3.33 .9(1  .9) Minimum Detectable Effect Size For Sample Allocations Given Equal Variances Allocation Example* Ratio to Balanced Allocation 0.5/0.5 0.54s 1.00 0.6/0.4 0.55s 1.02 0.7/0.3 0.59s 1.09 0.8/0.2 0.68s 1.25 0.9/0.1 0.91s 1.67 ________________________________________ * Example is for n = 20, J = 10, r = 0.05, a one-tail hypothesis test and no covariates. Implications of unbalanced allocations with unequal variances s s SE (b0)U   2 P 2 C JP JC s s E ( se(b0)) E   2 P 2 C JC JP Implications Continued The estimated standard error is unbiased   When the allocation is balanced When the variances are equal The estimated standard error is biased upward  When the larger sample has the larger variance The estimated standard error is biased downward  When the larger sample has the smaller variance Interim Conclusions  Don’t use the equal variance assumption for an unbalanced allocation with many degrees of freedom.  Use a balanced allocation when there are few degrees of freedom. References Gail, Mitchell H., Steven D. Mark, Raymond J. Carroll, Sylvan B. Green and David Pee (1996) “On Design Considerations and Randomization-Based Inferences for Community Intervention Trials,” Statistics in Medicine 15: 1069 – 1092. Bryk, Anthony S. and Stephen W. Raudenbush (1988) “Heterogeneity of Variance in Experimental Studies: A Challenge to Conventional Interpretations,” Psychological Bulletin, 104(3): 396 – 404. Part IV Using Covariates to Reduce Sample Size Basic ideas  Goal: Reduce the number of clusters randomized  Approach: Reduce the standard error of the impact estimator by controlling for baseline covariates  Alternative Covariates  Individual-level  Cluster-level  Pretests  Other characteristics Impact Estimation with a Covariate yij  a   0Tj   1xij  ej  eij yij  a   0Tj   1xj  ej  eij yij = the outcome for student i from school j Tj = 1 for treatment schools and 0 for control schools Xj = a covariate for school j xij = a covariate for student i from school j ej = a random error term for school j eij = a random error term for student i from school j Minimum Detectable Effect Size with a Covariate MDES  MJ  K r (1  R 22) (1  r )(1  R12)  P(1  P) J P(1  P)nJ MDES = minimum detectable effect size MJ-K = a degrees-of-freedom multiplier1 J = the total number of schools randomized n = the number of students in a grade per school P = the proportion of schools randomized to treatment r = the unconditional intraclass correlation (without a covariate) R12 = the proportion of variance across individuals within schools (at level 1) predicted by the covariate R22 = the proportion of variance across schools (at level 2) predicted by the covariate 1 For 20 or more degrees of freedom MJ-K equals 2.8 for a two-tail test and 2.5 for a one-tail test with statistical power of 0.80 and statistical significance of 0.05 Questions Addressed Empirically about the Predictive Power of Covariates      School-level vs. student-level pretests Earlier vs. later follow-up years Reading vs. math Elementary vs. middle vs. high school All schools vs. low-income schools vs. low-performing schools Empirical Analysis  Estimate r, R22 and R12 from data on thousands of students from hundreds of schools, during multiple years at five urban school districts  Summarize these estimates for reading and math in grades 3, 5, 8 and 10  Compute implications for minimum detectable effect sizes Estimated Parameters for Reading with a School-level Pretest Lagged One Year ___________________________________________________________________ School District ___________________________________________________________ A B C D E ___________________________________________________________________ Grade 3 r 0.20 0.15 0.19 0.22 0.16 R22 0.31 0.77 0.74 0.51 0.75 Grade 5 r 0.25 0.15 0.20 NA 0.12 R22 0.33 0.50 0.81 NA 0.70 Grade 8 r 0.18 NA 0.23 NA NA R22 0.77 NA 0.91 NA NA Grade 10 r 0.15 NA 0.29 NA NA R22 0.93 NA 0.95 NA NA ____________________________________________________________________ Minimum Detectable Effect Sizes for Reading with a School-Level Pretest (Y-1) or a Student-Level Pretest (y-1) Lagged One Year ________________________________________________________ Grade 3 Grade 5 Grade 8 Grade 10 ________________________________________________________ 20 schools randomized No covariate 0.57 0.56 0.61 0.62 Y-1 0.37 0.38 0.24 0.16 y-1 0.38 0.40 0.28 0.15 40 schools randomized No covariate 0.39 0.38 0.42 0.42 Y-1 0.26 0.26 0.17 0.11 y-1 0.26 0.27 0.19 0.10 60 schools randomized No covariate 0.32 0.31 0.34 0.34 Y-1 0.21 0.21 0.13 0.09 y-1 0.21 0.22 0.15 0.08 ________________________________________________________ Key Findings        Using a pretest improves precision dramatically. This improvement increases appreciably from elementary school to middle school to high school because R22 increases. School-level pretests produce as much precision as do student-level pretests. The effect of a pretest declines somewhat as the time between it and the post-test increases. Adding a second pretest increases precision slightly. Using a pretest for a different subject increases precision substantially. Narrowing the sample to schools that are similar to each other does not improve precision beyond that achieved by a pretest. Source Bloom, Howard S., Lashawn Richburg-Hayes and Alison Rebeck Black (2007) “Using Covariates to Improve Precision for Studies that Randomize Schools to Evaluate Educational Interventions” Educational Evaluation and Policy Analysis, 29(1): 30 – 59. Part V The Putative Power of Pairing A Tail of Two Tradeoffs (“It was the best of techniques. It was the worst of techniques.” Who the dickens said that?) Pairing Why match pairs?  for face validity  for precision How to match pairs?  rank order clusters by covariate  pair clusters in rank-ordered list  randomize clusters in each pair When to pair?  When the gain in predictive power outweighs the loss of degrees of freedom  Degrees of freedom  J - 2 without pairing  J/2 - 1 with pairing Deriving the Minimum Required Predictive Power of Pairing Without pairing MDE(b0)GR  MJ  2 SE(b0)GR With pairing MDE (b0)GR  MJ / 2  1 1 R 2 SE (b0)GR Breakeven R2 R 2 m in 2 J 2 2 J / 2 1  1 M M The Minimum Required Predictive Power of Pairing Randomized Clusters (J) 6 8 10 20 30 *For a two-tail test. Required Predictive Power (R min2)* 0.52 0.35 0.26 0.11 0.07 A few key points about blocking Blocking for face validity vs. blocking for precision  Treating blocks as fixed effects vs.random effects  Defining blocks using baseline information  Part VI Subgroup Analyses: Learning from Diversity Purposes  To assess generalizability through description (by exploring how impacts vary)  To enhance generalizability through explanation (by exploring what predicts impact variation) Considerations  Research protocol: Maximize ex ante specification through theory and thought to minimize ex post data mining.  Assessment criteria Internal validity  Precision   Defining Features Program characteristics  Randomized group characteristics  Individual characteristics  Defining Subgroups by The Characteristics of Programs  Based only on program features that were randomized  Thus one cannot use implementation quality Defining Subgroups by Characteristics Of Randomized Groups  Types of impacts Net impacts  Differential impacts   Internal validity   only use pre-existing characteristics Precision Net impact estimates are limited by reduced number of randomized groups  Differential impact estimates are triply limited (and often need four times as many randomized groups)  Defining Subgroups by Characteristics of Individuals  Types of impacts Net impacts  Differential impacts   Internal validity Only use pre-existing characteristics  Only use subgroups with sample members from all randomized groups   Precision For net impacts: can be almost as good as for full sample  For differential impacts: can be even better than for full sample  Differential Impacts by Gender Program Group Control Group Boys Girls Boys Girls Boys Girls Boys Girls Boys Girls Boys Girls Boys Girls YPB - YPG Boys Girls YCB - YCG Part VII Generalizing Results from Multiple Sites and Blocks Fixed vs. Random Effects Inference Known vs. unknown populations  Broader vs. narrower inferences  Weaker vs. stronger precision  Few vs. many sites or blocks  Weighting Sites and Blocks Implicitly through a pooled regression  Explicitly based on  Number of schools  Number of students  Explicitly based on precision  Fixed effects  Random effects  Bottom line: the question addressed is what counts  Part VIII Using Two-Level Data for ThreeLevel Situations The Issue  General Question: What happens when you design a study with randomized groups that comprise three levels based on data which do not account explicitly for the middle level?  Specific Example: What happens when you design a study that randomizes schools (with students clustered in classrooms in schools) based on data for students clustered in schools? 3-level vs. 2-level Variance Components Variance Components 3-Level Model 2-Level Model Outcomes School Class Student Total School Student Total Expressive vocab-spring 19.84 32.45 306.18 358.48 38.15 321.11 359.26 Stanford 9 Total Math Scaled Score 115.14 36.40 1273.15 1424.69 131.39 1293.24 1424.63 Stanford 9 Total Reading Scaled Score 108.75 158.95 1581.86 1849.56 181.77 1666.48 1848.25 Sources: The Chicago Literacy Initiative: Making Better Early Readers study (CLIMBERs) database and the School Breakfast Pilot Project (SBPP) database. 3-level vs. 2-level MDES for Original Sample MDES 3-Level Model Outcomes 2-Level Model Unconditional Conditional Unconditional Conditional Expressive vocab-spring 0.482 0.386 0.495 0.311 Stanford 9 Total Math Scaled Score 0.259 0.184 0.259 0.184 Stanford 9 Total Reading Scaled Score 0.261 0.148 0.264 0.150 Sources: The Chicago Literacy Initiative: Making Better Early Readers study (CLIMBERs) database and the School Breakfast Pilot Project (SBPP) database. Further References Bloom, Howard S. (2005) “Randomizing Groups to Evaluate Place-Based Programs,” in Howard S. Bloom, editor, Learning More From Social Experiments: Evolving Analytic Approaches (New York: Russell Sage Foundation). Bloom, Howard S., Lashawn Richburg-Hayes and Alison Rebeck Black (2005) “Using Covariates to Improve Precision: Empirical Guidance for Studies that Randomize Schools to Measure the Impacts of Educational Interventions” (New York: MDRC). Donner, Allan and Neil Klar (2000) Cluster Randomization Trials in Health Research (London: Arnold). Hedges, Larry V. and Eric C. Hedberg (2006) “Intraclass Correlation Values for Planning Group Randomized Trials in Education” (Chicago: Northwestern University). Murray, David M. (1998) Design and Analysis of Group-Randomized Trials (New York: Oxford University Press). Raudenbush, Stephen W., Andres Martinez and Jessaca Spybrook (2005) “Strategies for Improving Precision in Group-Randomized Experiments” (University of Chicago). Raudenbush, Stephen W. (1997) “Statistical Analysis and Optimal Design for Cluster Randomized Trials” Psychological Methods, 2(2): 173 – 185. Schochet, Peter Z. (2005) “Statistical Power for Random Assignment Evaluations of Education Programs,” (Princeton, NJ: Mathematica Policy Research).

Sample Design for Group- Randomized Trials Howard S. Bloom Chief Social Scientist

Related documents

Products

Support

Sample Design for Group- Randomized Trials Howard S. Bloom Chief Social Scientist

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib