Developing evidence-based products using the systematic review process Session 4/Unit 10 Assessing study quality Carole Torgerson November 13th 2007 NCDDR training course for NIDRR grantees 1 Assessing study quality or critical appraisal • To investigate whether the individual studies in the review are affected by bias. Systematic errors in a study can bias its results by overestimating or underestimating effects. Bias in an individual study or individual studies in the review can in turn bias the results of the review • To make a judgement about the weight of evidence that should be placed on each study (higher weight given to studies of higher quality in a meta-analysis) • To investigate differences in effect sizes between high an low quality studies in a meta-regression 2 Coding for study quality • Collect and record information about each study on quality of design and quality of analysis (internal validity) • Base quality assessment judgement about each study on this information • Use quality assessment judgements of all included studies to inform the synthesis (metaanalysis) » Only use findings from studies judged to be of high quality or qualify findings » Look for homogeneity/heterogeneity » Examine differences in findings according to quality (sensitivity analysis) 3 “A careful look at randomized experiments will make clear that they are not the gold standard. But then, nothing is. And the alternatives are usually worse.” Berk RA. (2005) Journal of Experimental Criminology 1, 417-433. 4 Code for design: Experiments/quasi experiments; fidelity to random allocation; concealment Code for whether RCT or quasi-experiment (specify) or other (specify) Is the method of assignment unclear? » Need to ascertain how assignment was undertaken and code for this; if unclear, may need to contact authors for clarification » Look for confusion between non-random and random assignment – the former can lead to bias. If RCT » Need to look for and code for assignment discrepancies, e.g. failure to keep to random allocation » Need to code for whether or not allocation concealed 5 Which studies are RCTs? • “We took two groups of schools – one group had high ICT use and the other low ICT use – we then took a random sample of pupils from each school and tested them”. • “We put the students into two groups, we then randomly allocated one group to the intervention whilst the other formed the control” • “We formed the two groups so that they were approximately balanced on gender and pre-test scores” • “We identified 200 children with a low reading age and then randomly selected 50 to whom we gave the intervention. They were then compared to the remaining 150”. • “Of the eight [schools] two randomly chosen schools served as a control group” 6 Is it randomised? “The groups were balanced for gender and, as far as possible, for school. Otherwise, allocation was randomised.” Thomson et al. Br J Educ Psychology 1998;68:475-91. 7 Is it randomised? “The students were assigned to one of three groups, depending on how revisions were made: exclusively with computer word processing, exclusively with paper and pencil or a combination of the two techniques.” Greda and Hannafin, J Educ Res 1992;85:144. 8 Mixed allocation “Students were randomly assigned to either Teen Outreach participation or the control condition either at the student level (i.e., sites had more students sign up than could be accommodated and participants and controls were selected by picking names out of a hat or choosing every other name on an alphabetized list) or less frequently at the classroom level” Allen et al, Child Development 1997;64:729-42. 9 Non-random assignment confused with random allocation “Before mailing, recipients were randomized by rearranging them in alphabetical order according to the first name of each person. The first 250 received one scratch ticket for a lottery conducted by the Norwegian Society for the Blind, the second 250 received two such scratch tickets, and the third 250 were promised two scratch tickets if they replied within one week.” Finsen V, Storeheier, AH (2006) Scratch lottery tickets are a poor incentive to respond to mailed questionnaires. BMC Medical Research Methodology 6, 19. doi:10.1186/1471-2288-6-19. 10 Misallocation issues “23 offenders from the treatment group could not attend the CBT course and they were then placed in the control group”. 11 Concealed allocation – why is it important? » Good evidence from multiple sources shows that effect sizes for RCTs where randomisation was not independently conducted were larger compared with RCTs that used independent assignment methods. » A wealth of evidence is available that indicates that unless random assignment was undertaken by an independent third party, then subversion of the allocation may have occurred (leading to selection bias and exaggeration of any differences between the groups). 12 Allocation concealment: a meta-analysis • Schulz and colleagues took a database of 250 randomised trials in the field of pregnancy and child birth. • The trials were divided into 3 groups with respect to concealment: » Good concealment (difficult to subvert); » Unknown (not enough detail in paper); » Poor (e.g., randomisation list on a public notice board). • They found exaggerated effect sizes for poorly concealed compared with well concealed randomisation. 13 Comparison of adequate, unclear and inadequate concealment Allocation Concealment Adequate Effect Size OR 1.0 Unclear 0.67 Inadequate 0.59 P < 0.01 Schulz et al. JAMA 1995;273:408. 14 Small vs large trials • Small trials tend to give greater effect sizes than large trials: this shouldn’t happen. • Kjaergard et al showed this phenomenon was due to poor allocation concealment in small trials; when trials were grouped by allocation methods ‘secure’ allocation reduced effect by 51%. Kjaegard et al. Ann Intern Med 2001;135:982. 15 Case study • Subversion is rarely reported for individual studies. • One study where it has been reported was for a large, multi-centred surgical trial. • Participants were randomised at 5+ centres using sealed envelopes (sealed envelopes can be opened in advance and participants can be selected by the recruiting researcher into groups rather than by randomisation). 16 Mean ages of groups Clinician Experimental Control All p < 0.01 59 63 1 p =.84 62 61 2 p = 0.60 43 52 3 p < 0.01 57 72 4 p < 0.001 33 69 5 p = 0.03 47 72 Others p = 0.99 64 59 17 Using telephone allocation Clinician Experimental Control All p = 0.37 59 57 1 p =.62 57 57 2 p = 0.24 60 51 3 NA 61 70 4 p =0.99 63 65 5 p = 0.91 57 62 Others p = 0.99 59 56 18 Recent blocked trial “This was a block randomised study (four patients to each block) with separate randomisation at each of the three centres. Blocks of four cards were produced, each containing two cards marked with "nurse" and two marked with "house officer." Each card was placed into an opaque envelope and the envelope sealed. The block was shuffled and, after shuffling, was placed in a box.” Kinley et al., BMJ 2002 325:1323. 19 • Block randomisation is a method of ensuring numerical balance; in this case, blocking was by centre. • If block randomisation of 4 was used then numbers in each group at each centre should not be different by more than 2 participants. 20 Problem? Southampton Sheffield Doncaster Doctor Nurse Doctor Nurse Doctor Nurse 500 308 118 511 319 Kinley et al., 2002 BMJ 325:1323. 118 21 Examples of good allocation concealment • “Randomisation by centre was conducted by personnel who were not otherwise involved in the research project” [1] • Distant assignment was used to: “protect overrides of group assignment by the staff, who might have a concern that some cases receive home visits regardless of the outcome of the assignment process”[2] [1] Cohen et al. (2005) J of Speech Language and Hearing Res. 48, 715-729. [2] Davis RG, Taylor BG. (1997) Criminology 35, 307-333. 22 Assignment discrepancy • “Pairs of students in each classroom were matched on a salient pretest variable, Rapid Letter Naming, and randomly assigned to treatment and comparison groups.” • “The original sample – those students were tested at the beginning of Grade 1 – included 64 assigned to the SMART program and 63 assigned to the comparison group.” Baker S, Gersten R, Keating T. (2000) When less may be more: A 2-year longitudinal evaluation of a volunteer tutoring program requiring minimal training. Reading Research Quarterly 35, 494-519. 23 Change in concealed allocation 50 45 40 35 30 25 20 15 10 5 0 <1997 >1996 Drug P = 0.04 No Drug P = 0.70 24 NB No education trial used concealed allocation Example of unbalanced trial affecting results • Trowman and colleagues undertook a systematic review to see if calcium supplements were useful for helping weight loss among overweight people. • The meta-analysis of final weights showed a statistically significant benefit of calcium supplements. HOWEVER, a meta-analysis of baseline weights showed that most of the trials had ‘randomised’ lower weight people into the intervention group. When this was taken into account there was no longer any difference. 25 Meta-analysis of baseline body weight Trowman R et al (2006) A systematic review of the effects of calcium supplementation on body weight. British Journal of Nutrition 95, 1033-38. 26 Summary of assignment and concealment • Code for whether RCT or quasi-experiment (specify) or other (specify) • Increasing evidence to suggest that subversion of random allocation is a problem in randomised trials. The ‘gold-standard’ method of ‘random’ allocation is the use of a secure third party method. • Code whether or not the trial reports that an independent method of allocation was used. Poor quality trials: use sealed envelopes; do not specify allocation method; or use allocation methods within the control of the researcher (e.g., tossing a coin). • Code for assignment discrepancies, e.g. failure to keep to random allocation 27 5 minute break! 28 Other design issues • Attrition (drop-out) can introduce selection bias • Unblinded ascertainment (outcome measurement) can lead to ascertainment bias • Small samples can lead to Type II error (concluding there is no difference when there is a difference) • Multiple statistical tests can give Type I errors (concluding there is a difference when this is due to chance) • Poor reporting of uncertainty (e.g., lack of confidence intervals). 29 Coding for other design characteristics • Code for attrition in intervention and control groups • Code for whether or not there is ‘blinding’ of participants • Code for whether or not there is blinded assessment of outcome • Code for whether or not the sample size is adequate • Code for whether the primary and secondary outcomes are pre-specified 30 Blinding of participants and investigators • Participants can be blinded to: » Research hypotheses » Nature of the control or experimental condition » Whether or not they are taking part in a trial • This may help to reduce bias from resentful demoralisation • Investigators should be blinded (if possible) to follow-up tests as this eliminates ‘ascertainment’ bias. This is where consciously or unconsciously investigators ascribe a better outcome than is the truth based on the knowledge of the assigned groups. 31 Blinding of outcome assessment • Code for whether or not post-tests were administered by someone who is unaware of the group allocation. Ascertainment bias can result when the assessor is not blind to group assignment, e.g., homeopathy study of histamine showed an effect when researchers were not blind to the assignment but no effect when they were. • Example of outcome assessment blinding: Study “was implemented with blind assessment of outcome by qualified speech language pathologists who were not otherwise involved in the project” Cohen et al. (2005) J of Speech Language and Hearing Res. 48, 715-729. 32 Blinded outcome assessment 40 P = 0.13 P = 0.03 35 30 25 <1997 >1996 20 15 10 5 0 Hlth Ed Education Torgerson CJ, Torgerson DJ, Birks YF, Porthouse J. (2005) A comparison of randomised controlled trials in health and education. British Educational Research Journal,31:761-785. 33 Statistical power • Few effective educational interventions produce large effect sizes especially when the comparator group is an ‘active’ intervention. In a tightly controlled setting 0.5 of a standard deviation difference at post-test is good. Smaller effect sizes in field trials are to be expected (e.g. 0.25). To detect 0.5 of an effect size with 80% power (sig = 0.05), we need 128 participants for an individually randomised experiment. 34 Percentage of trials underpowered (n < 128) P = 0.22 90 80 P = 0.76 70 60 50 <1997 >1996 40 30 20 10 0 Hlth Ed Education Torgerson CJ, Torgerson DJ, Birks YF, Porthouse J. (2005) A comparison of randomised controlled trials in health and education. British Educational Research Journal,31:761-785. 35 Code for analysis issues • Code for whether, once randomised, all participants are included within their allocated groups for analysis (i.e., was intention to treat analysis used). • Code for whether a single analysis is prespecified before data analysis. 36 Attrition • Attrition can lead to bias; a high quality trial will have maximal follow-up after allocation. • It can be difficult to ascertain the amount of attrition and whether or not attrition rates are comparable between groups. • A good trial reports low attrition with no between group differences. • Rule of thumb: 0-5%, not likely to be a problem. 6% to 20%, worrying, > 20% selection bias. 37 Poorly reported attrition • In a RCT of Foster-Carers extra training was given. » “Some carers withdrew from the study once the dates and/or location were confirmed; others withdrew once they realized that they had been allocated to the control group” “117 participants comprised the final sample” • No split between groups is given except in one table which shows 67 in the intervention group and 50 in the control group. 25% more in the intervention group – unequal attrition hallmark of potential selection bias. But we cannot be sure. 38 Macdonald & Turner, Brit J Social Work (2005) 35,1265 What is the problem here? Random allocation 160 children in 20 schools (8 per school) 80 in each group 1 school 8 children withdrew N = 17 children replaced following discussion with teachers 76 children allocated to control 76 allocated to intervention group 39 • In this example one school withdrew and pupils were lost from both groups – unlikely to be a source of bias. • BUT 17 pupils were withdrawn by teachers as they did not want them to have their allocated intervention and these were replaced by 17 others. • This WILL introduce bias into the experiment. Note: such a trial should be regarded as a quasiexperiment. 40 What about matched pairs? • It is sometimes stated that selection bias due to attrition can be avoided using a matched pairs design, whereby the survivor of a pair is removed from the analysis (1). • We can only match on observable variables and we trust to randomisation to ensure that unobserved covariates or confounders are equally distributed between groups. • Using matched pairs won’t remove attrition bias from the unknown covariate. (1) Farrington DP, Welsh BC. (2005) Randomized experiments in criminology: What have we learned in the last two decades? Journal of Experimental 41 Criminology 1, 9-38. Pairs matched on gender Control (unknown covariate) Boy (high) Intervention (unknown covariate) Boy (low) Girl (high) Girl (high) Girl (low) Girl (high) Boy (high) Boy (low) Girl (low) Girl (high) 3 Girls and 3 highs 3 Girls and 3 highs. 42 Drop-out of 1 girl Control Intervention Boy (high) Boy (low) Girl (high) Girl (high) Girl (low) Girl (high) Boy (high) Boy (low) Girl (high) 2 Girls and 3 highs 3 Girls and 3 highs. 43 Removing matched pair does not balance the groups! Control Intervention Boy (high) Boy (low) Girl (high) Girl (high) Girl (low) Girl (high) Boy (high) Boy (low) 2 Girls and 3 highs 2 Girls and 2 highs. 44 Intention to treat (ITT) • Randomisation ensures the abolition of selection bias at baseline; after randomisation some participants may cross over into the opposite treatment group (e.g., fail to take allocated treatment or obtain experimental intervention elsewhere). • There is often a temptation by trialists to analyse the groups as treated rather than as randomised. • This is incorrect and can introduce selection bias. 45 ITT analysis: examples • Seven participants allocated to the control condition (1.6%) received the intervention, whilst 65 allocated to the intervention failed to receive treatment (15%). (1) The authors, however, analysed by randomised group (CORRECT approach) • “It was found in each sample that approximately 86% of the students with access to reading supports used them. Therefore, one-way ANOVAs were computed for each school sample, comparing this subsample with subjects who did not have access to reading supports.” (2) (INCORRECT) (1) Davis RG, Taylor BG. (1997) Criminology 35, 307-333. (2) Feldman SC, Fish MC. (1991) Journal of Educational Computing Research 7, 25-36. 46 Unit of allocation • Participants can be randomised individually (the most usual approach) or as groups. • The latter is known as cluster or group or place randomised controlled trials. • Often it is not possible to randomise as individuals for example: » Evaluating training interventions on clinicians or teachers and measuring outcomes on patients or students; » Spill over or contamination of the control group. 47 Clusters • A cluster can take many forms: » GP practice or patients belonging to an individual practitioner; » Hospital ward; » School, class; » A period of time (week; day; month); » Geographical area (village; town; postal district). 48 Code for quality of cluster trials • Code for whether the participants were recruited before the clusters were randomised – if not this could have lead to selection bias. • Individuals within clusters have outcomes that are related and this needs to be accounted for both in the sample size calculation and the analysis. Code for the following: did the trial report its intracluster correlation coefficient (ICC)? did the analysis use some form of statistical approach to take clustering into account (e.g., cluster level means, hierarchical linear modelling, robust standard errors). 49 What is wrong here? • “the remaining 4 classes of fifth-grade students (n = 96) were randomly assigned, each as an intact class, to the [4] prewriting treatment groups;” Brodney et al. J Exp Educ 1999;68,5-20. 50 Insufficient cluster replication • The key quality criterion of a cluster trial is not the number of individual participants in the study but the number of clusters. • A cluster trial with only 1 cluster per group cannot be thought of as a trial as it is impossible to control for cluster level confounders. At least 4 (some say 7) clusters per group are needed to have some hope of balancing out confounders. 51 Which is better? • Cluster trial A: We randomised 10 schools with 500 children in each 5 to the intervention and 5 to the control (I.e., 5,000 children in all); OR • Cluster trial B: We randomised 100 classes with 25 children in each 50 to the control and 50 to the intervention (I.e., 2,500 children in all). • Trial B is better as it has 100 units of allocation rather than 10 despite having 50% fewer children. 52 Selection bias in cluster randomised trials • Given enough clusters selection bias should not have occurred in cluster trials as randomisation will have dealt with this. • HOWEVER, the clusters will be balanced at the individual level ONLY if all eligible people, or a random sample, within the cluster were included in the trial. • In some trials this does not apply as randomisation occurred BEFORE recruitment. This could have introduced selection bias. 53 Reviews of Cluster Trials Years Clustering allowed in sample size Clustering allowed in analysis Donner et al. 16 non-therapeutic (1990) intervention trials 1979–1989 <20% <50% Simpson et al. (1995) 1990–1993 19% 57% Isaakidis and 51 trials in Sub-Saharan 1973–2001 Ioannidis Africa (half post (2003) 1995) 20% 37% Puffer et al. (2003) 36 trials in British 1997–2002 Medical Journal, Lancet, and New England Journal of Medicine 56% 92% Eldridge et al. (Clinical Trials 2004) 152 trials in primary health care 20% 59% Authors Source 21 trials from American Journal of Public Health and Preventive Medicine 1997–2000 54 Analysis Many cluster randomised health care trials were improperly analysed. Most analyses use t-tests, chi-squared tests, which assumes independence of observations, which are violated in a cluster trial. This leads to spurious p values and narrow CIs. Various methods exist, e.g., multilevel models, comparing means of clusters, which will produce correct estimates. See a worked example at Martin Bland’s website: http://www-users.york.ac.uk/~mb55/ 55 Survey of trial quality Characteristic Cluster Randomised Sample size justified Concealed randomisation Blinded Follow-up Use of CIs Low Statistical Power Drug Health Education 1% 36% 18% 59% 28% 0% 40% 8% 0% 53% 30% 14% 68% 41% 1% 45% 41% 85% Torgerson CJ, Torgerson DJ, Birks YF, Porthouse J. (2005) A comparison of randomised controlled trials in health and education. British Educational Research Journal,31:761-785. (based on n = 168 trials) 56 CONSORT • Because the majority of health care trials were badly reported, a group of health care trial methodologists developed the CONSORT statement, which indicates key methodological items that must be reported in a trial report. • This has now been adopted by all major medical journals and some psychology journals. 57 The CONSORT guidelines, adapted for trials in educational research • • • • • • • • Was the target sample size adequately determined? Was intention to teach analysis used? (i.e. were all children who were randomised included in the follow-up and analysis?) Were the participants allocated using random number tables, coin flip, computer generation? Was the randomisation process concealed from the investigators? (i.e. were the researchers who were recruiting children to the trial blind to the child’s allocation until after that child had been included in the trial?) Were follow-up measures administered blind? (i.e. were the researchers who administered the outcome measures blind to treatment allocation?) Was precision of effect size estimated (confidence intervals)? Were summary data presented in sufficient detail to permit alternative analyses or replication? Was the discussion of the study findings consistent with the data? 58 Flow Diagram • In health care trials reported in the main medical journals authors are required to produce a CONSORT flow diagram. • The trial by Hatcher et al, clearly shows the fate of the participants after randomisation until analysis. 59 Flow Diagram 635 children in 16 schools screened using group spelling test 118 children with poor spelling skills given individual tests of vocabulary, letter knowledge, word reading and phoneme manipulation 2 schools excluded due to insufficient numbers of poor spellers 84/118 in 14 remaining schools (6 per school) selected for randomisation to interventions excluded 9 children due to behaviour 1 school (6 children) withdrew from study after randomisation 39/42 children in 13 remaining schools allocated to 20-week intervention 39/42 children included 39/42 children in 13 remaining schools allocated to 10-week intervention 1 child left study (moved school) 38/42 children included Hatcher et al. 2005 J Child Psych Psychiatry: online 60 Year 7 Pupils N = 155 Randomised ICT group N = 77 N = 3 left school No ICT Group N = 78 N = 1 left school 70 valid pre-tests 67 valid post-tests 63 valid pre and post 75 valid pre-tests 71 valid post-tests 67 valid pre and post tests 61 Dr Carole Torgerson Senior Research Fellow Institute for Effective Education University of York cjt3@york.ac.uk 62