Day 1 Session 1 slides - The Education Endowment Foundation

advertisement
Building Evidence in Education:
Conference for EEF Evaluators
11th July: Theory
12th July: Practice
www.educationendowmentfoundation.org.uk
The EEF by numbers
33
1,800
topics in
the Toolkit
schools
participating
in projects
16
independent
evaluation
teams
£200
m
estimated spend
over lifetime of
the EEF
300,000
11
pupils involved
in EEF projects
members
of EEF
team
3,000
56
heads
presented to
since launch
projects
funded to
date
Research Design
Stephen Gorard
s.a.c.gorard@durham.ac.uk
http://www.evaluationdesign.co.uk/
Outline of a full cycle of research
A model of causation in social science
Association - For X (a possible cause) and Y (a possible effect) to be in a
causal relationship they must be repeatedly associated. This association must
be strong and clearly observable. It must be replicable, and it must be specific
to X and Y.
Sequence – X and Y must proceed in sequence. X must always precede Y
(where both appear), and the appearance of Y must be safely predictable from
the appearance of X.
Intervention - It must have been demonstrated repeatedly that an intervention
to change the strength or appearance of X strongly and clearly changes the
strength or appearance of Y.
Explanatory mechanism - There must a coherent mechanism to explain the
causal link. This mechanism must be the simplest available without which the
evidence cannot be explained. Put another way, if the proposed mechanism
were not true then there must be no simpler or equally simple way of explaining
the evidence for it.
Red herrings and real problems. Some reflections on the evaluation of
Aimhigher
http://www.heacademy.ac.uk/assets/documents/aim_higher/AspireReflections_on_evaluation_of_Aimhigher.doc
In an influential review of Widening Participation (WP) research written for the HEFCE and
published in July 2006, Gorard et al (2006) have harshly criticised the evaluation of WP initiatives.
In their view, to date no convincing evidence of impact has been produced on pre-entry
interventions for school pupils and partnership-based interventions, such as Aimhigher.
Gorard et al’s criticisms were addressed by the HEFCE in another review of WP research
published later in the same year, in November 2006, and based on a survey of the evidence
collected by the HEIs. It reasserted the value [of Aimhigher and other WP initiatives] as a
monitoring and evaluating device and emphasised that, to date, attitudes of learners and
teachers have been consistently and overwhelmingly positive. HEFCE feels satisfied that
convincing and precise evidence has been produced on attainment by the national evaluation
carried out by the National Foundation for Educational Research (NFER), and, to a lesser extent,
on HE participation by the NFER and the HEIs. For example, it has been found that participating in
Aimhigher activities was associated with ‘[a]n average improvement of 2.5 points in GCSE total
point scores’ and a ‘3.9 percentage point increase in Year 11 pupils intending to progress to HE’
(HEFCE 2006: 23). Moreover, ‘[i]f the ‘evidence bar’ is set too high’, the HEFCE (2006: 6-7) pointed
out, ‘we run the risk of discouraging any attempt to estimate the effectiveness of the interventions’.
There seems no scope for setting up a social science experiment in which the experiences
of a wp group is compared with a control group.
Session 1: Part 2: Trial design
(45 mins.)
Professor David Torgerson
Director, York Trials Unit,
University of York
david.torgerson@york.ac.uk
Professor Carole Torgerson
School of Education, Durham
University
carole.torgerson@durham.ac.uk
2008 Palgrave Macmillan
Key design issues
•
•
•
•
•
Independent concealed randomisation
Type of randomisation
Types of trials
Sample size
Regression discontinuity design
Independent concealed
randomisation
• One of the most important issues is the
need to undertake independent allocation.
• Many methodological studies have shown
that unless someone who is disinterested
in the trial results undertakes the
randomisation there is a serious risk of
bias.
• In health trials it is the source of bias that
has the most evidence.
Subversion of a health RCT
C lin icia n
E xp e rim e n ta l
C o n tro l
A ll p < 0 .0 1
59
63
1 p = .8 4
62
61
2 p = 0 .6 0
43
52
3 p < 0 .0 1
57
72
4 p < 0 .0 0 1
33
69
5 p = 0 .0 3
47
72
O th e rs p = 0 .9 9
64
59
.1
.15
0
.05
-10
-5
0
logit (p-value)
Adequate
Unclear
Hewitt et al. BMJ;2005:.
Inadequate
5
Type of randomisation
• Simple or restricted?
• Simple, similar to tossing a coin
» Advantages: difficult to go wrong; with large samples
(n > 100) and combined with ANCOVA is efficient
» Disadvantages: for small samples can produce
imbalance and inefficiency in analysis.
• Restricted, ensures better balance
» Advantages: gets better balance and more efficient
for small samples
» Disadvantages: more complicated; can go wrong
Restricted allocation
• Minimisation
» Not strictly randomisation; uses algorithm to
ensure balance in covariates
• Stratified
» Using blocks of repeating allocations
produces balance on 1 or 2 variables
• Matched pairs
» Matches units (e.g., schools) and allocates
one to each group; can reduce power in some
cases and has other disadvantages
Discussion (5 mins.)
• Discuss how randomisation was
undertaken in your EEF trial(s) and note
whether this was independent and
concealed, and whether it was restricted.
If so, what method was used?
Types of trial
• Individual randomisation
» Most powerful design for given sample size
• Cluster design
» Randomises groups of individuals (classes;
schools; periods of time; geographical areas)
• Stepped wedge
» Type of cluster design; randomises order of
implementation so all schools eventually
receive intervention
Individual allocation
• Appropriate when it is possible to separate
intervention and control conditions
• DISCOVER summer school evaluation
using individual randomisation as control
children cannot gain access to intervention
• Many educational interventions are
delivered at class or school level – so can’t
use individual allocation
Variations on a theme
• Factorial designs
» Two trials for the price of one
• Unequal allocation
» When the sample size is fixed equal allocation
best; when costs are fixed unequal best –
DISCOVER using unequal allocation for
intervention to ensure efficient use of summer
school resources.
Individual RCT: key points
•
•
•
•
Trial registration
Pre-test BEFORE randomisation
Independent allocation
Spill over/contamination must not exceed 30%
or cluster allocation more efficient
• Post-testing done blindly or in exam conditions,
marking done blindly
• Primary outcome specified before analysis
• Statistical analysis plan written and approved
before data are examined
Cluster allocation
• More complex to design than individual
RCT
• Many educational interventions need to
use cluster allocation
• Cluster allocation usually avoids
contamination and can make intervention
delivery logistically easier
Cluster allocation: additional key
points
• Small number of clusters – so usually need to use
restricted randomisation
• Need to recruit participants and pre-test BEFORE cluster
allocation
• Teachers must be linked to class BEFORE
randomisation
• Analysis and sample size need to take clustering into
account
• Best to have large numbers of clusters with small
numbers per cluster than few clusters with large
numbers
Variations on a theme
• What level of randomisation?
» Pupil > class > year > school
• Balanced design
» An efficient design is a balanced approach – Year 7 gets
intervention in half schools and Year 8 gets intervention in other
schools with each school’s adjacent year acting as control
» Or Year 7 in intervention schools get literacy intervention and
Year 7s in control get maths
• Split plot
» Cluster level allocation followed by individual randomisation. A
form of factorial. Exeter evaluation using partial split plot
Stepped wedge
• A form of cluster design, which may be
more efficient than standard cluster design
• If we have 12 schools all are pre-tested; 4
randomised for first 6 months and all
tested; another 4 are given intervention
and all tested; final 4 given intervention
and all tested
• Requires testing at every point
Discussion (5 mins.)
• Discuss the trial designs that have been
used and the challenges associated with
them.
Sample size calculation
• Most interventions will not work very well.
» Effect sizes of 0.20 to 0.3 – likely
» Effect sizes 0.30 to 0.50 – unusual
» Effect sizes >0.50 – very unlikely
• Need large sample sizes to detect modest
differences. Example: 512 for 0.25; 800
for 0.20 (not clustered design)
• Powerful covariate can reduce this
» 0.70 correlation reduces sample size by 50%
How to do it?
• Free programmes on line
» PSPower; Optimal Design Software
• In your head (back of envelope) using
approximation formula (i.e., 32/Effect Size
squared)
• Fixed sample size
» Still good practice to estimate likelihood of
difference.
Pilot trials – sample size
• Modelling study suggests that a study with
10% of the main study’s sample will
produce a 1 sided 80% confidence interval
that will include the ‘true’ estimate if it
exists
Cocks K, Torgerson DJ. Sample size calculations for pilot randomised trials: a
confidence interval approach. Journal of Clinical Epidemiology 2013;66:197-201
Discussion
• Discuss how sample size calculations
were undertaken and whether sample
sizes are large enough to detect modest
differences between groups.
Regression discontinuity
• Theoretically the most robust, nonrandomised approach, is the RD design
• Rediscovered several times since
Thistlewaite and Campbell first described
it in the 1960s
What is it?
• Regression discontinuity, sometimes
known as risk based cut-off design,
selects people into a group on the basis of
a measureable continuous variable
• For example, age, test scores, waiting list,
income
How does it work?
• Selecting on a pre-test variable we then
correlate post test outcomes with the pretest variable and test to see if there is an
interruption, break or discontinuity in the
regression line
• Effective treatment
Ineffective treatment
Do summer schools work?
• Some states in the USA mandate summer
schools for children who fall below a
certain score in a high stakes test
• But will sending children off to have extra
tuition during their summer break be
effective?
• Because the children chosen are chosen
in the basis of a cut point on a quantitative
scale this ideal RD territory
Jacob and Lefgren, Rev of Economics and Statistics, 2004,86:226-44.
Proportion treated by test
scores
Treatment against outcomes
Evaluation of SHINE on
secondaries
• Randomised controlled trial design not
possible
• Regression discontinuity design with ‘tiebreaker randomisation’
• Advantages of this design
• Challenges of this design
Download