Sample Design for Group- Randomized Trials Howard S. Bloom Chief Social Scientist

advertisement
Sample Design for GroupRandomized Trials
Howard S. Bloom
Chief Social Scientist
MDRC
Prepared for the IES/NCER Summer Research Training Institute held at
Northwestern University on July 9, 2008.
Today we will examine
Sample size determinants
 Precision requirements
 Sample allocation
 Covariate adjustments
 Matching and blocking
 Subgroup analyses
 Generalizing findings for sites and blocks
 Using two-level data for three-level
situations

Part I:
The Basics
Statistical properties of grouprandomized impact estimators
Unbiased estimates
Yij = a+B0Tj+ej+eij
E(b0) = B0
Less precise estimates
VAR(eij) = s2
VAR(ej) = t2
r = t2/(t2+s2)
GEM  1  (n  1) r  SEC (b0) / SEI (b0)
Design Effect
(for a given total number of individuals)
______________________________________
Intraclass
Individuals per Group (n)
Correlation (r)
0.01
0.05
0.10
10
50
500
1.04
1.20
1.38
1.22
1.86
2.43
2.48
5.09
7.13
_____________________________________
Sample design parameters

Number of randomized groups (J)

Number of individuals per randomized
group (n)

Proportion of groups randomized to
program status (P)
Reporting precision

A minimum detectable effect (MDE) is the
smallest true effect that has a “good chance” of
being found to be statistically significant.

We typically define an MDE as the smallest true
effect that has 80 percent power for a two-tailed
test of statistical significance at the 0.05 level.

An MDE is reported in natural units whereas a
minimum detectable effect size (MDES) is
reported in units of standard deviations
Minimum Detectable Effect Sizes
For a Group-Randomized Design
with r = 0.05 and no Covariates
___________________________________
Randomized
Individuals per Group (n)
Groups (J)
10
50
500
10
0.77
0.53
0.46
20
0.50
0.35
0.30
40
0.35
0.24
0.21
120
0.20
0.14
0.12
___________________________________
Implications for sample design

It is extremely important to randomize
an adequate number of groups.

It is often far less important how many
individuals per group you have.
Part II
Determining required precision
When assessing how much
precision is needed:
Always ask “relative to what?”
Program benefits
 Program costs
 Existing outcome differences
 Past program performance

Effect Size Gospel According
to Cohen and Lipsey
Cohen
Lipsey
(speculative)
(empirical)
_______________________________________________
Small = 0.2s
Medium = 0.5s
Large = 0.8s
Small = 0.15s
Medium = 0.45s
Large = 0.90s
Five-year impacts of the
Tennessee class-size experiment
Treatment:
13-17 versus 22-26 students per class
Effect sizes:
0.11s to 0.22s for reading and math
Findings are summarized from Nye, Barbara, Larry V. Hedges and Spyros
Konstantopoulos (1999) “The Long-Term Effects of Small Classes: A FiveYear Follow-up of the Tennessee Class Size Experiment,” Educational
Evaluation and Policy Analysis, Vol. 21, No. 2: 127-142.
Annual reading and math growth
Reading
Math
Grade
Growth
Growth
Transition
Effect Size
Effect Size
---------------------------------------------------------------K-1
1.52
1.14
1-2
0.97
1.03
2-3
0.60
0.89
3-4
0.36
0.52
4-5
0.40
0.56
5-6
0.32
0.41
6-7
0.23
0.30
7-8
0.26
0.32
8-9
0.24
0.22
9 - 10
0.19
0.25
10 - 11
0.19
0.14
11 - 12
0.06
0.01
-------------------------------------------------------------------------------------------------
Based on work in progress using documentation on the national norming samples for the CAT5,
SAT9, Terra Nova CTBS, Gates MacGinitie (for reading only), MAT8, Terra Nova CAT, and SAT10.
95% confidence intervals range in reading from +/- .03 to .15 and in math from +/- .03 to .22
Performance gap between “average” (50th
percentile) and “weak” (10th percentile) schools
Subject and grade
District I District II District III District IV
Reading
Grade 3
0.31
0.18
0.16
0.43
Grade 5
0.41
0.18
0.35
0.31
Grade 7
.025
0.11
0.30
NA
Grade 10
0.07
0.11
NA
NA
Grade 3
0.29
0.25
0.19
0.41
Grade 5
0.27
0.23
0.36
0.26
Grade 7
0.20
0.15
0.23
NA
Grade 10
0.14
0.17
NA
NA
Math
Source: District I outcomes are based on ITBS scaled scores, District II on SAT 9 scaled scores, District
III on MAT NCE scores, and District IV on SAT 8 NCE scores.
Demographic performance gap in reading and
math: Main NAEP scores
BlackWhite
HispanicWhite
MaleFemale
EligibleIneligible for
free/reduced
price lunch
Grade 4
-0.83
-0.77
-0.18
-0.74
Grade 8
-0.80
-0.76
-0.28
-0.66
Grade 12
-0.67
-0.53
-0.44
-0.45
Grade 4
-0.99
-0.85
0.08
-0.85
Grade 8
-1.04
-0.82
0.04
-0.80
Grade 12
-0.94
-0.68
0.09
-0.72
Subject and
grade
Reading
Math
Source: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics,
National Assessment of Educational Progress (NAEP), 2002 Reading Assessment and 2000 Mathematics Assessment.
ES Results from Randomized Studies
Achievement Measure
n
Mean
389
0.33
Standardized test (Broad)
21
0.07
Standardized test (Narrow)
181
0.23
Specialized Topic/Test
180
0.44
Middle Schools
36
0.51
High Schools
43
0.27
Elementary School
Part III
The ABCs of Sample Allocation
Sample allocation alternatives
Balanced allocation
maximizes precision for a given sample size;
 maximizes robustness to distributional
assumptions.

Unbalanced allocation
precision erodes slowly with imbalance for a
given sample size
 imbalance can facilitate a larger sample
 Imbalance can facilitate randomization

Variance relationships for the
program and control groups

Equal variances: when the program does
not affect the outcome variance.

Unequal variances: when the program does
affect the outcome variance.
MDES for equal variances
without covariates
MJ  2
1 r
1
MDES 
r
J
n
P(1  P)
How allocation affects MDES
1

P(1  P)
1
 2.18
.7(1  .7)
1
 2.00
.5(1  .5)
1
 2.50
.8(1  .8)
1
 2.04
.6(1  .6)
1
 3.33
.9(1  .9)
Minimum Detectable Effect Size For Sample
Allocations Given Equal Variances
Allocation
Example*
Ratio to
Balanced
Allocation
0.5/0.5
0.54s
1.00
0.6/0.4
0.55s
1.02
0.7/0.3
0.59s
1.09
0.8/0.2
0.68s
1.25
0.9/0.1
0.91s
1.67
________________________________________
*
Example is for n = 20, J = 10, r = 0.05, a one-tail hypothesis test and no
covariates.
Implications of unbalanced allocations
with unequal variances
s
s
SE (b0)U 

2
P
2
C
JP
JC
s
s
E ( se(b0)) E 

2
P
2
C
JC
JP
Implications Continued
The estimated standard error is unbiased


When the allocation is balanced
When the variances are equal
The estimated standard error is biased upward

When the larger sample has the larger variance
The estimated standard error is biased downward

When the larger sample has the smaller variance
Interim Conclusions

Don’t use the equal variance assumption
for an unbalanced allocation with many
degrees of freedom.

Use a balanced allocation when there are
few degrees of freedom.
References
Gail, Mitchell H., Steven D. Mark, Raymond J. Carroll,
Sylvan B. Green and David Pee (1996) “On Design
Considerations and Randomization-Based Inferences
for Community Intervention Trials,” Statistics in
Medicine 15: 1069 – 1092.
Bryk, Anthony S. and Stephen W. Raudenbush (1988)
“Heterogeneity of Variance in Experimental Studies: A
Challenge to Conventional Interpretations,”
Psychological Bulletin, 104(3): 396 – 404.
Part IV
Using Covariates to Reduce
Sample Size
Basic ideas

Goal: Reduce the number of clusters randomized

Approach: Reduce the standard error of the
impact estimator by controlling for baseline
covariates

Alternative Covariates
 Individual-level
 Cluster-level
 Pretests
 Other characteristics
Impact Estimation with a Covariate
yij  a   0Tj   1xij  ej  eij
yij  a   0Tj   1xj  ej  eij
yij = the outcome for student i from school j
Tj = 1 for treatment schools and 0 for control schools
Xj = a covariate for school j
xij = a covariate for student i from school j
ej = a random error term for school j
eij = a random error term for student i from school j
Minimum Detectable Effect Size
with a Covariate
MDES  MJ  K
r (1  R 22) (1  r )(1  R12)

P(1  P) J
P(1  P)nJ
MDES = minimum detectable effect size
MJ-K = a degrees-of-freedom multiplier1
J = the total number of schools randomized
n = the number of students in a grade per school
P = the proportion of schools randomized to treatment
r = the unconditional intraclass correlation (without a covariate)
R12 = the proportion of variance across individuals within schools (at
level 1) predicted by the covariate
R22 = the proportion of variance across schools (at level 2) predicted
by the covariate
1
For 20 or more degrees of freedom MJ-K equals 2.8 for a two-tail test and 2.5 for a
one-tail test with statistical power of 0.80 and statistical significance of 0.05
Questions Addressed Empirically about the
Predictive Power of Covariates





School-level vs. student-level pretests
Earlier vs. later follow-up years
Reading vs. math
Elementary vs. middle vs. high school
All schools vs. low-income schools vs. low-performing
schools
Empirical Analysis

Estimate r, R22 and R12 from data on thousands of students
from hundreds of schools, during multiple years at five
urban school districts

Summarize these estimates for reading and math in grades
3, 5, 8 and 10

Compute implications for minimum detectable effect sizes
Estimated Parameters for Reading with
a School-level Pretest Lagged One Year
___________________________________________________________________
School District
___________________________________________________________
A
B
C
D
E
___________________________________________________________________
Grade 3
r
0.20
0.15
0.19
0.22
0.16
R22
0.31
0.77
0.74
0.51
0.75
Grade 5
r
0.25
0.15
0.20
NA
0.12
R22
0.33
0.50
0.81
NA
0.70
Grade 8
r
0.18
NA
0.23
NA
NA
R22
0.77
NA
0.91
NA
NA
Grade 10
r
0.15
NA
0.29
NA
NA
R22
0.93
NA
0.95
NA
NA
____________________________________________________________________
Minimum Detectable Effect Sizes for Reading with a School-Level
Pretest (Y-1) or a Student-Level Pretest (y-1) Lagged One Year
________________________________________________________
Grade 3
Grade 5
Grade 8
Grade 10
________________________________________________________
20 schools randomized
No covariate
0.57
0.56
0.61
0.62
Y-1
0.37
0.38
0.24
0.16
y-1
0.38
0.40
0.28
0.15
40 schools randomized
No covariate
0.39
0.38
0.42
0.42
Y-1
0.26
0.26
0.17
0.11
y-1
0.26
0.27
0.19
0.10
60 schools randomized
No covariate
0.32
0.31
0.34
0.34
Y-1
0.21
0.21
0.13
0.09
y-1
0.21
0.22
0.15
0.08
________________________________________________________
Key Findings







Using a pretest improves precision dramatically.
This improvement increases appreciably from
elementary school to middle school to high school
because R22 increases.
School-level pretests produce as much precision as do
student-level pretests.
The effect of a pretest declines somewhat as the time
between it and the post-test increases.
Adding a second pretest increases precision slightly.
Using a pretest for a different subject increases precision
substantially.
Narrowing the sample to schools that are similar to each
other does not improve precision beyond that achieved
by a pretest.
Source
Bloom, Howard S., Lashawn Richburg-Hayes and Alison
Rebeck Black (2007) “Using Covariates to Improve
Precision for Studies that Randomize Schools to
Evaluate Educational Interventions” Educational
Evaluation and Policy Analysis, 29(1): 30 – 59.
Part V
The Putative Power of Pairing
A Tail of Two Tradeoffs
(“It was the best of techniques. It was the worst of techniques.”
Who the dickens said that?)
Pairing
Why match pairs?
 for face validity
 for precision
How to match pairs?
 rank order clusters by covariate
 pair clusters in rank-ordered list
 randomize clusters in each pair
When to pair?

When the gain in predictive power
outweighs the loss of degrees of
freedom

Degrees of freedom
 J - 2 without pairing
 J/2 - 1 with pairing
Deriving the Minimum Required
Predictive Power of Pairing
Without pairing
MDE(b0)GR  MJ  2 SE(b0)GR
With pairing
MDE (b0)GR  MJ / 2  1 1 R 2 SE (b0)GR
Breakeven R2
R
2
m in
2
J 2
2
J / 2 1
 1 M
M
The Minimum Required
Predictive Power of Pairing
Randomized
Clusters (J)
6
8
10
20
30
*For
a two-tail test.
Required Predictive
Power (R min2)*
0.52
0.35
0.26
0.11
0.07
A few key points about blocking
Blocking for face validity vs. blocking for
precision
 Treating blocks as fixed effects vs.random
effects
 Defining blocks using baseline information

Part VI
Subgroup Analyses:
Learning from Diversity
Purposes

To assess generalizability through
description (by exploring how impacts vary)

To enhance generalizability through
explanation (by exploring what predicts
impact variation)
Considerations

Research protocol: Maximize ex ante
specification through theory and thought to
minimize ex post data mining.

Assessment criteria
Internal validity
 Precision


Defining Features
Program characteristics
 Randomized group characteristics
 Individual characteristics

Defining Subgroups by The
Characteristics of Programs

Based only on program features that were
randomized

Thus one cannot use implementation quality
Defining Subgroups by Characteristics
Of Randomized Groups

Types of impacts
Net impacts
 Differential impacts


Internal validity


only use pre-existing characteristics
Precision
Net impact estimates are limited by reduced
number of randomized groups
 Differential impact estimates are triply
limited (and often need four times as many
randomized groups)

Defining Subgroups by
Characteristics of Individuals

Types of impacts
Net impacts
 Differential impacts


Internal validity
Only use pre-existing characteristics
 Only use subgroups with sample members from all
randomized groups


Precision
For net impacts: can be almost as good as for full
sample
 For differential impacts: can be even better than
for full sample

Differential Impacts
by Gender
Program
Group
Control
Group
Boys
Girls
Boys
Girls
Boys
Girls
Boys
Girls
Boys
Girls
Boys
Girls
Boys
Girls
YPB - YPG
Boys
Girls
YCB - YCG
Part VII
Generalizing Results from
Multiple Sites and Blocks
Fixed vs. Random Effects Inference
Known vs. unknown populations
 Broader vs. narrower inferences
 Weaker vs. stronger precision
 Few vs. many sites or blocks

Weighting Sites and Blocks
Implicitly through a pooled regression
 Explicitly based on
 Number of schools
 Number of students
 Explicitly based on precision
 Fixed effects
 Random effects
 Bottom line: the question addressed is what
counts

Part VIII
Using Two-Level Data for ThreeLevel Situations
The Issue

General Question: What happens when you
design a study with randomized groups that
comprise three levels based on data which do not
account explicitly for the middle level?

Specific Example: What happens when you
design a study that randomizes schools (with
students clustered in classrooms in schools) based
on data for students clustered in schools?
3-level vs. 2-level Variance Components
Variance Components
3-Level Model
2-Level Model
Outcomes
School
Class
Student
Total
School
Student
Total
Expressive vocab-spring
19.84
32.45
306.18
358.48
38.15
321.11
359.26
Stanford 9 Total Math Scaled Score
115.14
36.40
1273.15
1424.69
131.39
1293.24
1424.63
Stanford 9 Total Reading Scaled Score
108.75
158.95
1581.86
1849.56
181.77
1666.48
1848.25
Sources: The Chicago Literacy Initiative: Making Better Early Readers study (CLIMBERs) database and the School Breakfast Pilot Project (SBPP) database.
3-level vs. 2-level MDES
for Original Sample
MDES
3-Level Model
Outcomes
2-Level Model
Unconditional
Conditional
Unconditional
Conditional
Expressive vocab-spring
0.482
0.386
0.495
0.311
Stanford 9 Total Math Scaled Score
0.259
0.184
0.259
0.184
Stanford 9 Total Reading Scaled Score
0.261
0.148
0.264
0.150
Sources: The Chicago Literacy Initiative: Making Better Early Readers study (CLIMBERs) database and the School Breakfast Pilot Project (SBPP) database.
Further References
Bloom, Howard S. (2005) “Randomizing Groups to Evaluate Place-Based
Programs,” in Howard S. Bloom, editor, Learning More From Social
Experiments: Evolving Analytic Approaches (New York: Russell Sage
Foundation).
Bloom, Howard S., Lashawn Richburg-Hayes and Alison Rebeck Black (2005)
“Using Covariates to Improve Precision: Empirical Guidance for Studies that
Randomize Schools to Measure the Impacts of Educational Interventions” (New
York: MDRC).
Donner, Allan and Neil Klar (2000) Cluster Randomization Trials in Health
Research (London: Arnold).
Hedges, Larry V. and Eric C. Hedberg (2006) “Intraclass Correlation Values for
Planning Group Randomized Trials in Education” (Chicago: Northwestern
University).
Murray, David M. (1998) Design and Analysis of Group-Randomized Trials (New
York: Oxford University Press).
Raudenbush, Stephen W., Andres Martinez and Jessaca Spybrook (2005) “Strategies
for Improving Precision in Group-Randomized Experiments” (University of
Chicago).
Raudenbush, Stephen W. (1997) “Statistical Analysis and Optimal Design for
Cluster Randomized Trials” Psychological Methods, 2(2): 173 – 185.
Schochet, Peter Z. (2005) “Statistical Power for Random Assignment Evaluations of
Education Programs,” (Princeton, NJ: Mathematica Policy Research).
Download