Building Evidence in Education: Conference for EEF Evaluators 11th July: Theory 12th July: Practice www.educationendowmentfoundation.org.uk The EEF by numbers 33 1,800 topics in the Toolkit schools participating in projects 16 independent evaluation teams £200 m estimated spend over lifetime of the EEF 300,000 11 pupils involved in EEF projects members of EEF team 3,000 56 heads presented to since launch projects funded to date Research Design Stephen Gorard s.a.c.gorard@durham.ac.uk http://www.evaluationdesign.co.uk/ Outline of a full cycle of research A model of causation in social science Association - For X (a possible cause) and Y (a possible effect) to be in a causal relationship they must be repeatedly associated. This association must be strong and clearly observable. It must be replicable, and it must be specific to X and Y. Sequence – X and Y must proceed in sequence. X must always precede Y (where both appear), and the appearance of Y must be safely predictable from the appearance of X. Intervention - It must have been demonstrated repeatedly that an intervention to change the strength or appearance of X strongly and clearly changes the strength or appearance of Y. Explanatory mechanism - There must a coherent mechanism to explain the causal link. This mechanism must be the simplest available without which the evidence cannot be explained. Put another way, if the proposed mechanism were not true then there must be no simpler or equally simple way of explaining the evidence for it. Red herrings and real problems. Some reflections on the evaluation of Aimhigher http://www.heacademy.ac.uk/assets/documents/aim_higher/AspireReflections_on_evaluation_of_Aimhigher.doc In an influential review of Widening Participation (WP) research written for the HEFCE and published in July 2006, Gorard et al (2006) have harshly criticised the evaluation of WP initiatives. In their view, to date no convincing evidence of impact has been produced on pre-entry interventions for school pupils and partnership-based interventions, such as Aimhigher. Gorard et al’s criticisms were addressed by the HEFCE in another review of WP research published later in the same year, in November 2006, and based on a survey of the evidence collected by the HEIs. It reasserted the value [of Aimhigher and other WP initiatives] as a monitoring and evaluating device and emphasised that, to date, attitudes of learners and teachers have been consistently and overwhelmingly positive. HEFCE feels satisfied that convincing and precise evidence has been produced on attainment by the national evaluation carried out by the National Foundation for Educational Research (NFER), and, to a lesser extent, on HE participation by the NFER and the HEIs. For example, it has been found that participating in Aimhigher activities was associated with ‘[a]n average improvement of 2.5 points in GCSE total point scores’ and a ‘3.9 percentage point increase in Year 11 pupils intending to progress to HE’ (HEFCE 2006: 23). Moreover, ‘[i]f the ‘evidence bar’ is set too high’, the HEFCE (2006: 6-7) pointed out, ‘we run the risk of discouraging any attempt to estimate the effectiveness of the interventions’. There seems no scope for setting up a social science experiment in which the experiences of a wp group is compared with a control group. Session 1: Part 2: Trial design (45 mins.) Professor David Torgerson Director, York Trials Unit, University of York david.torgerson@york.ac.uk Professor Carole Torgerson School of Education, Durham University carole.torgerson@durham.ac.uk 2008 Palgrave Macmillan Key design issues • • • • • Independent concealed randomisation Type of randomisation Types of trials Sample size Regression discontinuity design Independent concealed randomisation • One of the most important issues is the need to undertake independent allocation. • Many methodological studies have shown that unless someone who is disinterested in the trial results undertakes the randomisation there is a serious risk of bias. • In health trials it is the source of bias that has the most evidence. Subversion of a health RCT C lin icia n E xp e rim e n ta l C o n tro l A ll p < 0 .0 1 59 63 1 p = .8 4 62 61 2 p = 0 .6 0 43 52 3 p < 0 .0 1 57 72 4 p < 0 .0 0 1 33 69 5 p = 0 .0 3 47 72 O th e rs p = 0 .9 9 64 59 .1 .15 0 .05 -10 -5 0 logit (p-value) Adequate Unclear Hewitt et al. BMJ;2005:. Inadequate 5 Type of randomisation • Simple or restricted? • Simple, similar to tossing a coin » Advantages: difficult to go wrong; with large samples (n > 100) and combined with ANCOVA is efficient » Disadvantages: for small samples can produce imbalance and inefficiency in analysis. • Restricted, ensures better balance » Advantages: gets better balance and more efficient for small samples » Disadvantages: more complicated; can go wrong Restricted allocation • Minimisation » Not strictly randomisation; uses algorithm to ensure balance in covariates • Stratified » Using blocks of repeating allocations produces balance on 1 or 2 variables • Matched pairs » Matches units (e.g., schools) and allocates one to each group; can reduce power in some cases and has other disadvantages Discussion (5 mins.) • Discuss how randomisation was undertaken in your EEF trial(s) and note whether this was independent and concealed, and whether it was restricted. If so, what method was used? Types of trial • Individual randomisation » Most powerful design for given sample size • Cluster design » Randomises groups of individuals (classes; schools; periods of time; geographical areas) • Stepped wedge » Type of cluster design; randomises order of implementation so all schools eventually receive intervention Individual allocation • Appropriate when it is possible to separate intervention and control conditions • DISCOVER summer school evaluation using individual randomisation as control children cannot gain access to intervention • Many educational interventions are delivered at class or school level – so can’t use individual allocation Variations on a theme • Factorial designs » Two trials for the price of one • Unequal allocation » When the sample size is fixed equal allocation best; when costs are fixed unequal best – DISCOVER using unequal allocation for intervention to ensure efficient use of summer school resources. Individual RCT: key points • • • • Trial registration Pre-test BEFORE randomisation Independent allocation Spill over/contamination must not exceed 30% or cluster allocation more efficient • Post-testing done blindly or in exam conditions, marking done blindly • Primary outcome specified before analysis • Statistical analysis plan written and approved before data are examined Cluster allocation • More complex to design than individual RCT • Many educational interventions need to use cluster allocation • Cluster allocation usually avoids contamination and can make intervention delivery logistically easier Cluster allocation: additional key points • Small number of clusters – so usually need to use restricted randomisation • Need to recruit participants and pre-test BEFORE cluster allocation • Teachers must be linked to class BEFORE randomisation • Analysis and sample size need to take clustering into account • Best to have large numbers of clusters with small numbers per cluster than few clusters with large numbers Variations on a theme • What level of randomisation? » Pupil > class > year > school • Balanced design » An efficient design is a balanced approach – Year 7 gets intervention in half schools and Year 8 gets intervention in other schools with each school’s adjacent year acting as control » Or Year 7 in intervention schools get literacy intervention and Year 7s in control get maths • Split plot » Cluster level allocation followed by individual randomisation. A form of factorial. Exeter evaluation using partial split plot Stepped wedge • A form of cluster design, which may be more efficient than standard cluster design • If we have 12 schools all are pre-tested; 4 randomised for first 6 months and all tested; another 4 are given intervention and all tested; final 4 given intervention and all tested • Requires testing at every point Discussion (5 mins.) • Discuss the trial designs that have been used and the challenges associated with them. Sample size calculation • Most interventions will not work very well. » Effect sizes of 0.20 to 0.3 – likely » Effect sizes 0.30 to 0.50 – unusual » Effect sizes >0.50 – very unlikely • Need large sample sizes to detect modest differences. Example: 512 for 0.25; 800 for 0.20 (not clustered design) • Powerful covariate can reduce this » 0.70 correlation reduces sample size by 50% How to do it? • Free programmes on line » PSPower; Optimal Design Software • In your head (back of envelope) using approximation formula (i.e., 32/Effect Size squared) • Fixed sample size » Still good practice to estimate likelihood of difference. Pilot trials – sample size • Modelling study suggests that a study with 10% of the main study’s sample will produce a 1 sided 80% confidence interval that will include the ‘true’ estimate if it exists Cocks K, Torgerson DJ. Sample size calculations for pilot randomised trials: a confidence interval approach. Journal of Clinical Epidemiology 2013;66:197-201 Discussion • Discuss how sample size calculations were undertaken and whether sample sizes are large enough to detect modest differences between groups. Regression discontinuity • Theoretically the most robust, nonrandomised approach, is the RD design • Rediscovered several times since Thistlewaite and Campbell first described it in the 1960s What is it? • Regression discontinuity, sometimes known as risk based cut-off design, selects people into a group on the basis of a measureable continuous variable • For example, age, test scores, waiting list, income How does it work? • Selecting on a pre-test variable we then correlate post test outcomes with the pretest variable and test to see if there is an interruption, break or discontinuity in the regression line • Effective treatment Ineffective treatment Do summer schools work? • Some states in the USA mandate summer schools for children who fall below a certain score in a high stakes test • But will sending children off to have extra tuition during their summer break be effective? • Because the children chosen are chosen in the basis of a cut point on a quantitative scale this ideal RD territory Jacob and Lefgren, Rev of Economics and Statistics, 2004,86:226-44. Proportion treated by test scores Treatment against outcomes Evaluation of SHINE on secondaries • Randomised controlled trial design not possible • Regression discontinuity design with ‘tiebreaker randomisation’ • Advantages of this design • Challenges of this design