Howard S. Bloom
Chief Social Scientist
MDRC
Prepared for the IES/NCER Summer Research Training Institute held at
Northwestern University on July 27, 2010.
Sample size determinants
Precision requirements
Sample allocation
Covariate adjustments
Matching and blocking
Subgroup analyses
Generalizing findings for sites and blocks
Using two-level data for three-level situations
The Basics
Statistical properties of grouprandomized impact estimators
Unbiased estimates
Y ij
= a
+B
0
T j
+e j
+ e ij
E(b
0
) = B
0
Less precise estimates
VAR( e ij
) = s
2
VAR(e j
) = t
2 r
= t
2 /( t
2 + s
2 )
GEM
n
r
SE
C
b
0
SE
I
b
0
(for a given total number of individuals)
______________________________________
Intraclass Individuals per Group (n)
Correlation ( r) 10 50 500
0.01 1.04 1.22 2.48
0.05 1.20 1.86 5.09
0.10 1.38 2.43 7.13
_____________________________________
Number of randomized groups (J)
Number of individuals per randomized group (n)
Proportion of groups randomized to program status (P)
A minimum detectable effect (MDE) is the smallest true effect that has a “good chance” of being found to be statistically significant.
We typically define an MDE as the smallest true effect that has 80 percent power for a two-tailed test of statistical significance at the 0.05 level.
An MDE is reported in natural units whereas a minimum detectable effect size (MDES) is reported in units of standard deviations
Minimum Detectable Effect Sizes
For a Group-Randomized Design with r
= 0.05 and no Covariates
___________________________________
Randomized Individuals per Group (n)
Groups (J) 10 50 500
10 0.77 0.53 0.46
20 0.50 0.35 0.30
40 0.35 0.24 0.21
120 0.20 0.14 0.12
___________________________________
It is extremely important to randomize an adequate number of groups.
It is often far less important how many individuals per group you have.
Determining required precision
(speculative) (empirical)
_______________________________________________
Small = 0.2
s
Small = 0.15
s
Medium = 0.5
s
Medium = 0.45
s
Large = 0.8
s
Large = 0.90
s
Treatment:
13-17 versus 22-26 students per class
Effect sizes:
0.11
s to 0.22
s for reading and math
Findings are summarized from Nye, Barbara, Larry V. Hedges and Spyros
Konstantopoulos (1999) “The Long-Term Effects of Small Classes: A Five-
Year Follow-up of the Tennessee Class Size Experiment,” Educational
Evaluation and Policy Analysis , Vol. 21, No. 2: 127-142.
Reading
Grade Growth
Transition Effect Size
Math
Growth
Effect Size
----------------------------------------------------------------
K - 1 1.52 1.14
1 - 2 0.97 1.03
2 - 3 0.60 0.89
3 - 4 0.36 0.52
4 - 5 0.40 0.56
5 - 6 0.32 0.41
6 - 7 0.23 0.30
7 - 8 0.26 0.32
8 - 9 0.24 0.22
9 - 10 0.19 0.25
10 - 11 0.19 0.14
11 - 12 0.06 0.01
-------------------------------------------------------------------------------------------------
Based on work in progress using documentation on the national norming samples for the CAT5,
SAT9, Terra Nova CTBS, Gates MacGinitie (for reading only), MAT8, Terra Nova CAT, and SAT10.
95% confidence intervals range in reading from +/- .03 to .15 and in math from +/- .03 to .22
Performance gap between “average” (50 th percentile) and “weak” (10 th percentile) schools
Subject and grade District I District II District III District IV
Reading
Grade 3 0.31
0.18
0.16
0.43
Grade 5
Grade 7
Grade 10
0.41
.025
0.07
0.18
0.11
0.11
0.35
0.30
NA
0.31
NA
NA
Math
Grade 3
Grade 5
Grade 7
Grade 10
0.29
0.27
0.20
0.14
0.25
0.23
0.15
0.17
0.19
0.36
0.23
NA
0.41
0.26
NA
NA
Source: District I outcomes are based on ITBS scaled scores, District II on SAT 9 scaled scores, District
III on MAT NCE scores, and District IV on SAT 8 NCE scores.
Demographic performance gap in reading and math: Main NAEP scores
Subject and grade
Reading
Grade 4
Grade 8
Grade 12
Math
Grade 4
Grade 8
Grade 12
Black-
White
-0.83
-0.80
-0.67
-0.99
-1.04
-0.94
Hispanic-
White
-0.77
-0.76
-0.53
-0.85
-0.82
-0.68
Male-
Female
-0.18
-0.28
-0.44
0.08
0.04
0.09
Eligible-
Ineligible for free/reduced price lunch
-0.74
-0.66
-0.45
-0.85
-0.80
-0.72
Source: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics,
National Assessment of Educational Progress (NAEP), 2002 Reading Assessment and 2000 Mathematics Assessment.
ES Results from Randomized Studies
Achievement Measure
Elementary School
Standardized test (Broad)
Standardized test (Narrow)
Specialized Topic/Test
Middle Schools
High Schools n
389
21
181
180
36
43
Mean
0.33
0.07
0.23
0.44
0.51
0.27
The ABCs of Sample Allocation
Balanced allocation
maximizes precision for a given sample size;
maximizes robustness to distributional assumptions.
Unbalanced allocation
precision erodes slowly with imbalance for a given sample size
imbalance can facilitate a larger sample
Imbalance can facilitate randomization
Equal variances: when the program does not affect the outcome variance.
Unequal variances: when the program does affect the outcome variance.
MDES for equal variances without covariates
MDES
M
J
2
J r
1
n r
1
P ( 1
P )
1
P ( 1
P )
1
.
5 ( 1
.
5 )
2 .
00
1
.
6 ( 1
.
6 )
2 .
04
1
.
7 ( 1
.
7 )
2 .
18
1
.
8 ( 1
.
8 )
2 .
50
1
.
9 ( 1
.
9 )
3 .
33
Minimum Detectable Effect Size For Sample
Allocations Given Equal Variances
Allocation Example * Ratio to
Balanced
Allocation
0.5/0.5 0.54
s
0.6/0.4 0.55
s
0.7/0.3
0.8/0.2
0.9/0.1
0.59
s
0.68
s
0.91
s
1.00
1.02
1.09
1.25
1.67
*
________________________________________
Example is for n = 20, J = 10, r
= 0.05, a one-tail hypothesis test and no covariates .
Implications of unbalanced allocations with unequal variances
SE ( b
0
)
U
s
2
P
J
P
s
J
C
2
C
E ( se ( b
0
))
E
s
2
P
J
C
s
J
P
2
C
Implications Continued
The estimated standard error is unbiased
When the allocation is balanced
When the variances are equal
The estimated standard error is biased upward
When the larger sample has the larger variance
The estimated standard error is biased downward
When the larger sample has the smaller variance
Don’t use the equal variance assumption for an unbalanced allocation with many degrees of freedom .
Use a balanced allocation when there are few degrees of freedom .
References
Gail, Mitchell H., Steven D. Mark, Raymond J. Carroll,
Sylvan B. Green and David Pee (1996) “On Design
Considerations and Randomization-Based Inferences for Community Intervention Trials,”
Statistics in
Medicine 15: 1069 – 1092.
Bryk, Anthony S. and Stephen W. Raudenbush (1988)
“Heterogeneity of Variance in Experimental Studies:
A Challenge to Conventional Interpretations,”
Psychological Bulletin, 104(3): 396 – 404.
Using Covariates to Reduce
Sample Size
Goal: Reduce the number of clusters randomized
Approach: Reduce the standard error of the impact estimator by controlling for baseline covariates
Alternative Covariates
Individual-level
Cluster-level
Pretests
Other characteristics
Impact Estimation with a Covariate y ij
a
0
T j
1 x ij
e j
e ij y ij
a
0
T j
1 x j
e j
e ij y ij
= the outcome for student i from school j
T j
X j
= 1 for treatment schools and 0 for control schools
= a covariate for school j x ij
= a covariate for student i from school j e j e ij
= a random error term for school j
= a random error term for student i from school j
Minimum Detectable Effect Size with a Covariate
MDES
M
J
K r
( 1
P ( 1
R
2
2
)
P ) J
( 1
r
)( 1
P ( 1
R
P ) nJ
1
2
)
MDES = minimum detectable effect size
M
J-K
= a degrees-of-freedom multiplier 1
J = the total number of schools randomized n = the number of students in a grade per school
P = the proportion of schools randomized to treatment r = the unconditional intraclass correlation (without a covariate)
R
1
2 = the proportion of variance across individuals within schools (at level 1) predicted by the covariate
R
2
2 = the proportion of variance across schools (at level 2) predicted by the covariate
1 For 20 or more degrees of freedom M
J-K equals 2.8 for a two-tail test and 2.5 for a one-tail test with statistical power of 0.80 and statistical significance of 0.05
Questions Addressed Empirically about the
Predictive Power of Covariates
School-level vs. student-level pretests
Earlier vs. later follow-up years
Reading vs. math
Elementary vs. middle vs. high school
All schools vs. low-income schools vs. low-performing schools
Empirical Analysis
Estimate r
, R
2
2 and R
1
2 from data on thousands of students from hundreds of schools, during multiple years at five urban school districts
Summarize these estimates for reading and math in grades
3, 5, 8 and 10
Compute implications for minimum detectable effect sizes
Estimated Parameters for Reading with a School-level Pretest Lagged One Year
___________________________________________________________________
School District
___________________________________________________________
A B C D E
___________________________________________________________________
Grade 3 r
0.20 0.15 0.19 0.22 0.16
R
2
2 0.31 0.77 0.74 0.51 0.75
Grade 5 r
0.25 0.15 0.20 NA 0.12
R
2
2
Grade 8 r
R
2
2
Grade 10 r
0.33 0.50 0.81 NA 0.70
0.18 NA 0.23 NA NA
0.77 NA 0.91 NA NA
0.15 NA 0.29 NA NA
R
2
2 0.93 NA 0.95 NA NA
____________________________________________________________________
Minimum Detectable Effect Sizes for Reading with a School-Level
Pretest (Y
-1
) or a Student-Level Pretest (y
-1
) Lagged One Year
________________________________________________________
Grade 3 Grade 5 Grade 8 Grade 10
________________________________________________________
20 schools randomized
No covariate 0.57 0.56 0.61 0.62
Y
-1 y
-1
0.37 0.38 0.24 0.16
0.38 0.40 0.28 0.15
40 schools randomized
No covariate 0.39 0.38 0.42 0.42
Y
-1 y
-1
0.26 0.26 0.17 0.11
0.26 0.27 0.19 0.10
60 schools randomized
No covariate 0.32 0.31 0.34 0.34
Y
-1 y
-1
0.21 0.21 0.13 0.09
0.21 0.22 0.15 0.08
________________________________________________________
Key Findings
Using a pretest improves precision dramatically.
This improvement increases appreciably from elementary school to middle school to high school because R
2
2 increases.
School-level pretests produce as much precision as do student-level pretests.
The effect of a pretest declines somewhat as the time between it and the post-test increases.
Adding a second pretest increases precision slightly.
Using a pretest for a different subject increases precision substantially.
Narrowing the sample to schools that are similar to each other does not improve precision beyond that achieved by a pretest.
Source
Bloom, Howard S., Lashawn Richburg-Hayes and Alison
Rebeck Black (2007) “
Using Covariates to Improve
Precision for Studies that Randomize Schools to
Evaluate Educational Interventions” Educational
Evaluation and Policy Analysis , 29(1): 30 – 59.
A Tail of Two Tradeoffs
(“It was the best of techniques. It was the worst of techniques.”
Who the dickens said that?)
Why match pairs?
for face validity
for precision
How to match pairs?
rank order clusters by covariate
pair clusters in rank-ordered list
randomize clusters in each pair
When the gain in predictive power outweighs the loss of degrees of freedom
Degrees of freedom
J - 2 without pairing
J/2 - 1 with pairing
Deriving the Minimum Required
Predictive Power of Pairing
Without pairing
MDE ( b
0
)
GR
M
J
2
SE ( b
0
)
GR
With pairing
MDE ( b
0
)
GR
M
J / 2
1
1
R
2
SE ( b
0
)
GR
Breakeven R 2
R
2 m in
1
M
2
M
J
2
J
2
/ 2
1
The Minimum Required
Predictive Power of Pairing
Randomized Required Predictive
Clusters (J) Power (R min
2 ) *
6 0.52
8 0.35
10
20
30
* For a two-tail test.
0.26
0.11
0.07
Blocking for face validity vs. blocking for precision
Treating blocks as fixed effects vs.random effects
Defining blocks using baseline information
Subgroup Analyses #1:
When to Emphasize Them
Confirmatory: Draw conclusions about the program’s effectiveness if results are
Consistent with theory and contextual factors
Statistically significant and large
And subgroup was pre-specified
Exploratory: Develop hypotheses for further study
45
Before the analysis, state that conclusions about the program will be based in part on findings for this set of subgroups
Pre-specification can be based on
Theory
Prior evidence
Policy relevance
46
When should we discuss subgroup findings?
Depends on
Whether significant differences in impacts across subgroups
Might depend on whether impacts for the full sample are statistically significant
47
Subgroup Analyses #2:
Creating Subgroups
Defining Features
Creating subgroups in terms of:
Program characteristics
Randomized group characteristics
Individual characteristics
Defining Subgroups by
Program Characteristics
Based only on program features that were randomized
Thus one cannot use implementation quality
Defining Subgroups by Characteristics
Of Randomized Groups
Types of impacts
Net impacts
Differential impacts
Internal validity
only use pre-existing characteristics
Precision
Net impact estimates are limited by reduced number of randomized groups
Differential impact estimates are triply limited (and often need four times as many randomized groups)
Defining Subgroups by
Characteristics of Individuals
Types of impacts
Net impacts
Differential impacts
Internal validity
Only use pre-existing characteristics
Only use subgroups with sample members from all randomized groups
Precision
For net impacts : can be almost as good as for full sample
For differential impacts: can be even better than for full sample
Program
Group
Differential Impacts by Gender
Boys Girls
Control
Group
Boys Girls
Boys Girls Boys Girls
Boys Girls Boys Girls
Boys Girls
Y
PB
- Y
PG
Boys Girls
Y
CB
- Y
CG
Generalizing Results from
Multiple Sites and Blocks
Fixed vs. Random Effects Inference:
A Vexing Issue
Known vs. unknown populations
Broader vs. narrower inferences
Weaker vs. stronger precision
Few vs. many sites or blocks
Implicitly through a pooled regression
Explicitly based on
Number of schools
Number of students
Explicitly based on precision
Fixed effects
Random effects
Bottom line: the question addressed is what counts
Using Two-Level Data for Three-
Level Situations
The Issue
General Question: What happens when you design a study with randomized groups that comprise three levels based on data which do not account explicitly for the middle level?
Specific Example: What happens when you design a study that randomizes schools (with students clustered in classrooms in schools) based on data for students clustered in schools?
3-level vs. 2-level Variance Components
Outcomes
Expressive vocab-spring
School
19.84
3-Level Model
Class
32.45
Student
306.18
Variance Components
Total
358.48
Stanford 9 Total Math Scaled Score 1273.15
1424.69
School
38.15
131.39
2-Level Model
Student
321.11
1293.24
Total
359.26
1424.63
Stanford 9 Total Reading Scaled Score 1581.86
1849.56
181.77
1666.48
1848.25
Sources: The Chicago Literacy Initiative: Making Better Early Readers study (CLIMBERs) database and the School Breakfast Pilot Project (SBPP) database.
3-level vs. 2-level MDES for Original Sample
Outcomes
Expressive vocab-spring
3-Level Model
Unconditional
0.482
Conditional
0.386
MDES
2-Level Model
Unconditional
0.495
Conditional
0.311
Stanford 9 Total Math Scaled Score 0.259
0.184
0.259
0.184
Stanford 9 Total Reading Scaled Score 0.261
0.148
0.264
0.150
Sources: The Chicago Literacy Initiative: Making Better Early Readers study (CLIMBERs) database and the School Breakfast Pilot Project (SBPP) database.
Further References
Bloom, Howard S. (2005) “Randomizing Groups to Evaluate Place-Based
Programs,” in Howard S. Bloom, editor, Learning More From Social
Experiments: Evolving Analytic Approaches (New York: Russell Sage
Foundation).
Bloom, Howard S., Lashawn Richburg-Hayes and Alison Rebeck Black (2005)
“Using Covariates to Improve Precision: Empirical Guidance for Studies that
Randomize Schools to Measure the Impacts of Educational Interventions” (New
York: MDRC).
Donner, Allan and Neil Klar (2000) Cluster Randomization Trials in Health
Research (London: Arnold).
Hedges, Larry V. and Eric C. Hedberg (2006) “Intraclass Correlation Values for
Planning Group Randomized Trials in Education” (Chicago: Northwestern
University).
Murray, David M. (1998) Design and Analysis of Group-Randomized Trials (New
York: Oxford University Press).
Raudenbush, Stephen W., Andres Martinez and Jessaca Spybrook (2005) “Strategies for Improving Precision in Group-Randomized Experiments” (University of
Chicago).
Raudenbush, Stephen W. (1997) “Statistical Analysis and Optimal Design for
Cluster Randomized Trials” Psychological Methods , 2(2): 173 – 185.
Schochet, Peter Z. (2005) “Statistical Power for Random Assignment Evaluations of
Education Programs,” (Princeton, NJ: Mathematica Policy Research).