Sample Design for Group- Randomized Trials Howard S. Bloom Chief Social Scientist

Sample Design for Group-

Randomized Trials

Howard S. Bloom

Chief Social Scientist

MDRC

Prepared for the IES/NCER Summer Research Training Institute held at

Northwestern University on July 27, 2010.

Today we will examine

 Sample size determinants

 Precision requirements

 Sample allocation

 Covariate adjustments

 Matching and blocking

 Subgroup analyses

 Generalizing findings for sites and blocks

 Using two-level data for three-level situations

Part I:

The Basics

Statistical properties of grouprandomized impact estimators

Unbiased estimates

Y ij

= a

+B

0

T j

+e j

+ e ij

E(b

0

) = B

0

Less precise estimates

VAR( e ij

) = s

2

VAR(e j

) = t

2 r

= t

2 /( t

2 + s

2 )

GEM



1



(

n



1 )

r 

SE

C

(

b

0

) /

SE

I

(

b

0

)

Design Effect

(for a given total number of individuals)

______________________________________

Intraclass Individuals per Group (n)

Correlation ( r) 10 50 500

0.01 1.04 1.22 2.48

0.05 1.20 1.86 5.09

0.10 1.38 2.43 7.13

_____________________________________

Sample design parameters

 Number of randomized groups (J)

 Number of individuals per randomized group (n)

 Proportion of groups randomized to program status (P)

Reporting precision

 A minimum detectable effect (MDE) is the smallest true effect that has a “good chance” of being found to be statistically significant.

 We typically define an MDE as the smallest true effect that has 80 percent power for a two-tailed test of statistical significance at the 0.05 level.

 An MDE is reported in natural units whereas a minimum detectable effect size (MDES) is reported in units of standard deviations

Minimum Detectable Effect Sizes

For a Group-Randomized Design with r

= 0.05 and no Covariates

___________________________________

Randomized Individuals per Group (n)

Groups (J) 10 50 500

10 0.77 0.53 0.46

20 0.50 0.35 0.30

40 0.35 0.24 0.21

120 0.20 0.14 0.12

___________________________________

Implications for sample design

 It is extremely important to randomize an adequate number of groups.

 It is often far less important how many individuals per group you have.

Part II

Determining required precision

When assessing how much precision is needed:

Always ask “relative to what?”



Program benefits



Program costs



Existing outcome differences



Past program performance

Effect Size Gospel According to Cohen and Lipsey

Cohen Lipsey

(speculative) (empirical)

_______________________________________________

Small = 0.2

s

Small = 0.15

s

Medium = 0.5

s

Medium = 0.45

s

Large = 0.8

s

Large = 0.90

s

Five-year impacts of the

Tennessee class-size experiment

Treatment:

13-17 versus 22-26 students per class

Effect sizes:

0.11

s to 0.22

s for reading and math

Findings are summarized from Nye, Barbara, Larry V. Hedges and Spyros

Konstantopoulos (1999) “The Long-Term Effects of Small Classes: A Five-

Year Follow-up of the Tennessee Class Size Experiment,” Educational

Evaluation and Policy Analysis , Vol. 21, No. 2: 127-142.

Annual reading and math growth

Reading

Grade Growth

Transition Effect Size

Math

Growth

Effect Size

----------------------------------------------------------------

K - 1 1.52 1.14

1 - 2 0.97 1.03

2 - 3 0.60 0.89

3 - 4 0.36 0.52

4 - 5 0.40 0.56

5 - 6 0.32 0.41

6 - 7 0.23 0.30

7 - 8 0.26 0.32

8 - 9 0.24 0.22

9 - 10 0.19 0.25

10 - 11 0.19 0.14

11 - 12 0.06 0.01

-------------------------------------------------------------------------------------------------

Based on work in progress using documentation on the national norming samples for the CAT5,

SAT9, Terra Nova CTBS, Gates MacGinitie (for reading only), MAT8, Terra Nova CAT, and SAT10.

95% confidence intervals range in reading from +/- .03 to .15 and in math from +/- .03 to .22

Performance gap between “average” (50 th percentile) and “weak” (10 th percentile) schools

Subject and grade District I District II District III District IV

Reading

Grade 3 0.31

0.18

0.16

0.43

Grade 5

Grade 7

Grade 10

0.41

.025

0.07

0.18

0.11

0.11

0.35

0.30

NA

0.31

NA

NA

Math

Grade 3

Grade 5

Grade 7

Grade 10

0.29

0.27

0.20

0.14

0.25

0.23

0.15

0.17

0.19

0.36

0.23

NA

0.41

0.26

NA

NA

Source: District I outcomes are based on ITBS scaled scores, District II on SAT 9 scaled scores, District

III on MAT NCE scores, and District IV on SAT 8 NCE scores.

Demographic performance gap in reading and math: Main NAEP scores

Subject and grade

Reading

Grade 4

Grade 8

Grade 12

Math

Grade 4

Grade 8

Grade 12

Black-

White

-0.83

-0.80

-0.67

-0.99

-1.04

-0.94

Hispanic-

White

-0.77

-0.76

-0.53

-0.85

-0.82

-0.68

Male-

Female

-0.18

-0.28

-0.44

0.08

0.04

0.09

Eligible-

Ineligible for free/reduced price lunch

-0.74

-0.66

-0.45

-0.85

-0.80

-0.72

Source: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics,

National Assessment of Educational Progress (NAEP), 2002 Reading Assessment and 2000 Mathematics Assessment.

ES Results from Randomized Studies

Achievement Measure

Elementary School

Standardized test (Broad)

Standardized test (Narrow)

Specialized Topic/Test

Middle Schools

High Schools n

389

21

181

180

36

43

Mean

0.33

0.07

0.23

0.44

0.51

0.27

Part III

The ABCs of Sample Allocation

Sample allocation alternatives

Balanced allocation

 maximizes precision for a given sample size;

 maximizes robustness to distributional assumptions.

Unbalanced allocation

 precision erodes slowly with imbalance for a given sample size

 imbalance can facilitate a larger sample

 Imbalance can facilitate randomization

Variance relationships for the program and control groups

 Equal variances: when the program does not affect the outcome variance.

 Unequal variances: when the program does affect the outcome variance.

MDES for equal variances without covariates

MDES



M

J



2

J r 

1

 n r

1

P ( 1



P )

How allocation affects MDES

1

P ( 1



P )



1

.

5 ( 1



.

5 )



2 .

00

1

.

6 ( 1



.

6 )



2 .

04

1

.

7 ( 1



.

7 )



2 .

18

1

.

8 ( 1



.

8 )



2 .

50

1

.

9 ( 1



.

9 )



3 .

33

Minimum Detectable Effect Size For Sample

Allocations Given Equal Variances

Allocation Example * Ratio to

Balanced

Allocation

0.5/0.5 0.54

s

0.6/0.4 0.55

s

0.7/0.3

0.8/0.2

0.9/0.1

0.59

s

0.68

s

0.91

s

1.00

1.02

1.09

1.25

1.67

*

________________________________________

Example is for n = 20, J = 10, r

= 0.05, a one-tail hypothesis test and no covariates .

Implications of unbalanced allocations with unequal variances

SE ( b

0

)

U

 s

2

P

J

P

 s

J

C

2

C

E ( se ( b

0

))

E

 s

2

P

J

C

 s

J

P

2

C

Implications Continued

The estimated standard error is unbiased





When the allocation is balanced

When the variances are equal

The estimated standard error is biased upward



When the larger sample has the larger variance

The estimated standard error is biased downward

 When the larger sample has the smaller variance

Interim Conclusions



Don’t use the equal variance assumption for an unbalanced allocation with many degrees of freedom .

 Use a balanced allocation when there are few degrees of freedom .

References

Gail, Mitchell H., Steven D. Mark, Raymond J. Carroll,

Sylvan B. Green and David Pee (1996) “On Design

Considerations and Randomization-Based Inferences for Community Intervention Trials,”

Statistics in

Medicine 15: 1069 – 1092.

Bryk, Anthony S. and Stephen W. Raudenbush (1988)

“Heterogeneity of Variance in Experimental Studies:

A Challenge to Conventional Interpretations,”

Psychological Bulletin, 104(3): 396 – 404.

Part IV

Using Covariates to Reduce

Sample Size

Basic ideas

 Goal: Reduce the number of clusters randomized

 Approach: Reduce the standard error of the impact estimator by controlling for baseline covariates

 Alternative Covariates

 Individual-level

 Cluster-level

 Pretests

 Other characteristics

Impact Estimation with a Covariate y ij

 a  

0

T j

 

1 x ij

 e j

 e ij y ij

 a  

0

T j

 

1 x j

 e j

 e ij y ij

= the outcome for student i from school j

T j

X j

= 1 for treatment schools and 0 for control schools

= a covariate for school j x ij

= a covariate for student i from school j e j e ij

= a random error term for school j

= a random error term for student i from school j

Minimum Detectable Effect Size with a Covariate

MDES



M

J



K r

( 1



P ( 1



R

2

2

)

P ) J



( 1

 r

)( 1



P ( 1



R

P ) nJ

1

2

)

MDES = minimum detectable effect size

M

J-K

= a degrees-of-freedom multiplier 1

J = the total number of schools randomized n = the number of students in a grade per school

P = the proportion of schools randomized to treatment r = the unconditional intraclass correlation (without a covariate)

R

1

2 = the proportion of variance across individuals within schools (at level 1) predicted by the covariate

R

2

2 = the proportion of variance across schools (at level 2) predicted by the covariate

1 For 20 or more degrees of freedom M

J-K equals 2.8 for a two-tail test and 2.5 for a one-tail test with statistical power of 0.80 and statistical significance of 0.05

Questions Addressed Empirically about the

Predictive Power of Covariates

 School-level vs. student-level pretests

 Earlier vs. later follow-up years

 Reading vs. math

 Elementary vs. middle vs. high school

 All schools vs. low-income schools vs. low-performing schools

Empirical Analysis

 Estimate r

, R

2

2 and R

1

2 from data on thousands of students from hundreds of schools, during multiple years at five urban school districts

 Summarize these estimates for reading and math in grades

3, 5, 8 and 10

 Compute implications for minimum detectable effect sizes

Estimated Parameters for Reading with a School-level Pretest Lagged One Year

___________________________________________________________________

School District

___________________________________________________________

A B C D E

___________________________________________________________________

Grade 3 r

0.20 0.15 0.19 0.22 0.16

R

2

2 0.31 0.77 0.74 0.51 0.75

Grade 5 r

0.25 0.15 0.20 NA 0.12

R

2

2

Grade 8 r

R

2

2

Grade 10 r

0.33 0.50 0.81 NA 0.70

0.18 NA 0.23 NA NA

0.77 NA 0.91 NA NA

0.15 NA 0.29 NA NA

R

2

2 0.93 NA 0.95 NA NA

____________________________________________________________________

Minimum Detectable Effect Sizes for Reading with a School-Level

Pretest (Y

-1

) or a Student-Level Pretest (y

-1

) Lagged One Year

________________________________________________________

Grade 3 Grade 5 Grade 8 Grade 10

________________________________________________________

20 schools randomized

No covariate 0.57 0.56 0.61 0.62

Y

-1 y

-1

0.37 0.38 0.24 0.16

0.38 0.40 0.28 0.15


No covariate 0.39 0.38 0.42 0.42

Y

-1 y

-1

0.26 0.26 0.17 0.11

0.26 0.27 0.19 0.10


No covariate 0.32 0.31 0.34 0.34

Y

-1 y

-1

0.21 0.21 0.13 0.09

0.21 0.22 0.15 0.08

________________________________________________________

Key Findings







Using a pretest improves precision dramatically.

This improvement increases appreciably from elementary school to middle school to high school because R

2

2 increases.

School-level pretests produce as much precision as do student-level pretests.

 The effect of a pretest declines somewhat as the time between it and the post-test increases.

 Adding a second pretest increases precision slightly.

 Using a pretest for a different subject increases precision substantially.

 Narrowing the sample to schools that are similar to each other does not improve precision beyond that achieved by a pretest.

Source

Bloom, Howard S., Lashawn Richburg-Hayes and Alison

Rebeck Black (2007) “

Using Covariates to Improve

Precision for Studies that Randomize Schools to

Evaluate Educational Interventions” Educational

Evaluation and Policy Analysis , 29(1): 30 – 59.

Part V

The Putative Power of Pairing

A Tail of Two Tradeoffs

(“It was the best of techniques. It was the worst of techniques.”

Who the dickens said that?)

Pairing

Why match pairs?

 for face validity

 for precision

How to match pairs?

 rank order clusters by covariate

 pair clusters in rank-ordered list

 randomize clusters in each pair

When to pair?

 When the gain in predictive power outweighs the loss of degrees of freedom

 Degrees of freedom

 J - 2 without pairing

 J/2 - 1 with pairing

Deriving the Minimum Required

Predictive Power of Pairing

Without pairing

MDE ( b

0

)

GR

 M

J



2

SE ( b

0

)

GR

With pairing

MDE ( b

0

)

GR

 M

J / 2



1

1



R

2

SE ( b

0

)

GR

Breakeven R 2

R

2 m in



1

 M

2

M

J

2

J



2

/ 2



1

The Minimum Required

Predictive Power of Pairing

Randomized Required Predictive

Clusters (J) Power (R min

2 ) *

6 0.52

8 0.35

10

20

30

* For a two-tail test.

0.26

0.11

0.07

A few key points about blocking

 Blocking for face validity vs. blocking for precision

 Treating blocks as fixed effects vs.random effects

 Defining blocks using baseline information

Part VI

Subgroup Analyses #1:

When to Emphasize Them

Confirmatory vs. Exploratory

Findings

 Confirmatory: Draw conclusions about the program’s effectiveness if results are

 Consistent with theory and contextual factors

 Statistically significant and large

 And subgroup was pre-specified

 Exploratory: Develop hypotheses for further study

45

Pre-specification

 Before the analysis, state that conclusions about the program will be based in part on findings for this set of subgroups

 Pre-specification can be based on

 Theory

 Prior evidence

 Policy relevance

46

Statistical significance

 When should we discuss subgroup findings?

 Depends on

 Whether significant differences in impacts across subgroups

 Might depend on whether impacts for the full sample are statistically significant

47

Part VII

Subgroup Analyses #2:

Creating Subgroups

Defining Features

 Creating subgroups in terms of:

 Program characteristics

 Randomized group characteristics

 Individual characteristics

Defining Subgroups by

Program Characteristics

 Based only on program features that were randomized

 Thus one cannot use implementation quality

Defining Subgroups by Characteristics

Of Randomized Groups

 Types of impacts

 Net impacts

 Differential impacts

 Internal validity

 only use pre-existing characteristics

 Precision

 Net impact estimates are limited by reduced number of randomized groups

 Differential impact estimates are triply limited (and often need four times as many randomized groups)

Defining Subgroups by

Characteristics of Individuals

 Types of impacts

 Net impacts

 Differential impacts

 Internal validity

 Only use pre-existing characteristics

 Only use subgroups with sample members from all randomized groups

 Precision

 For net impacts : can be almost as good as for full sample

 For differential impacts: can be even better than for full sample

Program

Group

Differential Impacts by Gender

Boys Girls

Control

Group

Boys Girls

Boys Girls Boys Girls

Boys Girls Boys Girls

Boys Girls

Y

PB

- Y

PG

Boys Girls

Y

CB

- Y

CG

Part VIII

Generalizing Results from

Multiple Sites and Blocks

Fixed vs. Random Effects Inference:

A Vexing Issue

 Known vs. unknown populations

 Broader vs. narrower inferences

 Weaker vs. stronger precision

 Few vs. many sites or blocks

Weighting Sites and Blocks

 Implicitly through a pooled regression

 Explicitly based on

 Number of schools

 Number of students

 Explicitly based on precision

 Fixed effects

 Random effects

 Bottom line: the question addressed is what counts

Part IX

Using Two-Level Data for Three-

Level Situations

The Issue

 General Question: What happens when you design a study with randomized groups that comprise three levels based on data which do not account explicitly for the middle level?

 Specific Example: What happens when you design a study that randomizes schools (with students clustered in classrooms in schools) based on data for students clustered in schools?

3-level vs. 2-level Variance Components

Outcomes

Expressive vocab-spring

School

19.84

3-Level Model

Class

32.45

Student

306.18

Variance Components

Total

358.48

Stanford 9 Total Math Scaled Score 1273.15

1424.69

School

38.15

131.39

2-Level Model

Student

321.11

1293.24

Total

359.26

1424.63

Stanford 9 Total Reading Scaled Score 1581.86

1849.56

181.77

1666.48

1848.25

Sources: The Chicago Literacy Initiative: Making Better Early Readers study (CLIMBERs) database and the School Breakfast Pilot Project (SBPP) database.

3-level vs. 2-level MDES for Original Sample

Outcomes

Expressive vocab-spring

3-Level Model

Unconditional

0.482

Conditional

0.386

MDES

2-Level Model

Unconditional

0.495

Conditional

0.311

Stanford 9 Total Math Scaled Score 0.259

0.184

0.259

0.184

Stanford 9 Total Reading Scaled Score 0.261

0.148

0.264

0.150

Sources: The Chicago Literacy Initiative: Making Better Early Readers study (CLIMBERs) database and the School Breakfast Pilot Project (SBPP) database.

Further References

Bloom, Howard S. (2005) “Randomizing Groups to Evaluate Place-Based

Programs,” in Howard S. Bloom, editor, Learning More From Social

Experiments: Evolving Analytic Approaches (New York: Russell Sage

Foundation).

Bloom, Howard S., Lashawn Richburg-Hayes and Alison Rebeck Black (2005)

“Using Covariates to Improve Precision: Empirical Guidance for Studies that

Randomize Schools to Measure the Impacts of Educational Interventions” (New

York: MDRC).

Donner, Allan and Neil Klar (2000) Cluster Randomization Trials in Health

Research (London: Arnold).

Hedges, Larry V. and Eric C. Hedberg (2006) “Intraclass Correlation Values for

Planning Group Randomized Trials in Education” (Chicago: Northwestern

University).

Murray, David M. (1998) Design and Analysis of Group-Randomized Trials (New

York: Oxford University Press).

Raudenbush, Stephen W., Andres Martinez and Jessaca Spybrook (2005) “Strategies for Improving Precision in Group-Randomized Experiments” (University of

Chicago).

Raudenbush, Stephen W. (1997) “Statistical Analysis and Optimal Design for

Cluster Randomized Trials” Psychological Methods , 2(2): 173 – 185.

Schochet, Peter Z. (2005) “Statistical Power for Random Assignment Evaluations of

Education Programs,” (Princeton, NJ: Mathematica Policy Research).

Sample Design for Group- Randomized Trials Howard S. Bloom Chief Social Scientist

Sample Design for Group-

Randomized Trials

Today we will examine

Part I:

1

(

1 )

(

) /

(

)

Design Effect

Sample design parameters

Reporting precision

Implications for sample design

Part II

When assessing how much precision is needed:

Always ask “relative to what?”

Program benefits

Program costs

Existing outcome differences

Past program performance

Effect Size Gospel According to Cohen and Lipsey

Cohen Lipsey

Five-year impacts of the

Tennessee class-size experiment

Annual reading and math growth

Part III

Sample allocation alternatives

Variance relationships for the program and control groups

How allocation affects MDES

Interim Conclusions

Part IV

Basic ideas

Part V

The Putative Power of Pairing

Pairing

When to pair?

A few key points about blocking

Part VI

Confirmatory vs. Exploratory

Findings

Pre-specification

Statistical significance

Part VII

Part VIII

Weighting Sites and Blocks

Part IX

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib