Funded through the ESRC’s Researcher Development Initiative Session 2.3: Assessing Quality Prof. Herb Marsh Ms. Alison O’Mara Dr. Lars-Erik Malmberg Department of Education, University of Oxford Session 2.3: Assessing Quality Establish research question Define relevant studies Develop code materials Data entry and effect size calculation Pilot coding; coding Locate and collate studies Main analyses Supplementary analyses Does it make a difference? How do we assess primary study quality? 3 The quality of evidence generated by a review depends entirely on the quality of primary studies which make up the review Garbage-in, garbage-out! Quality assessment helps to set apart a metaanalysis or systematic review from a narrative review In meta-analysis, quality refers to methodological quality (the internal validity of primary studies) High quality reduces the risk of bias by eliminating or taking into account sources of bias such as: Selection bias Information bias Confounding Increasingly, meta-analysts evaluate the quality of each study included in a meta-analysis Sometimes this is a global holistic (subjective) rating. In this case it is important to have multiple raters to establish inter-rater agreement Sometimes study quality is quantified in relation to objective criteria of a good study, e.g. larger sample sizes; more representative samples; better measures; use of random assignment; appropriate control for potential bias; double blinding, and low attrition rates (particularly for longitudinal studies) 6 In a meta-analysis of Social Science meta-analyses, Wilson & Lipsey (1993) found an effect size of .50. They evaluated how this was related to study quality: For meta-analyses providing a global (subjective) rating of the quality of each study, there was no significant difference between high and low quality studies; the average correlations between effect size and quality was almost exactly zero. Almost no difference between effect sizes based on random- and non-random assignment (effect sizes slightly larger for random assignment). The only study quality characteristic to make a difference was positively biased effects due to one-group pre/post design with no control group at all 7 Goldring (1990) evaluated the effects of gifted education programs on achievement. She found a positive effect, but emphasised that findings were questionable because of weak studies: 21 of the 24 studies were unpublished and only one used random assignment. Effects varied with matching procedures: largest effects for achievement outcomes were for studies in which all non-equivalent groups' differences controlled by only one pretest variable. Effect sizes reduced as the number of control variables increase and disappeared altogether with random assignment. Goldring (1990, p. 324) concluded policy makers need to be aware of the limitations of the GAT literature. 8 Schulz (1995) evaluated study quality in 250 randomized clinical trials (RCTs) from 33 meta-analyses. Poor quality studies led to positively biased estimates: lack of concealment (30-41%), lack of double-blind (17%), participants excluded after randomization (NS). Moher et al. (1998) reanalysed 127 RCTs randomized clinical trials from 11 meta-analyses for study quality. Low quality trials resulted in significantly larger effect sizes, 30-50% exaggeration in estimates of treatment efficacy. Wood et al. (2008) evaluated study quality (1,346 RCTs from 146 meta-analyses. subjective outcomes: inadequate/unclear concealment & lack of blinding resulted in substantial biases. objective outcomes: no significant effects. conclusion: systematic reviewers should assess risk of bias. 9 Meta-analyses should always include subjective and/or objective indicators of study quality. In Social Sciences, there is some evidence that studies with highly inadequate control for pre-existing differences leads to inflated effect sizes. However, it is surprising that other indicators of study quality make so little difference. In medical research, studies are largely limited to RCTs where there is MUCH more control than in social science research. Here, there is evidence that inadequate concealment of assignment and lack of double-blind inflate effect sizes, but perhaps only for subjective outcomes. These issues are likely to be idiosyncratic to individual discipline areas and research questions. 10 11 It is important to code the study quality characteristics Juni, Witschi, Bloch, and Egger (1999): Evaluation of scales designed to assess the quality of randomized field trials in medicine Used an identical set of 17 studies Applied the quality weightings dictated by 25 different scales Seven of the scales showed that high quality trials showed an effect whereas low quality trials did not. Six of the scales found that high quality trials showed no effect whereas low quality trials did (the reverse conclusion). For the remaining 12 scales, effect estimates were similar across the quality levels Overall summary quality scores were not significantly associated with treatment effects. In summary: the scale used to evaluate study quality can determine whether a difference in quality levels is detected 12 Requires designing the code materials to include adequate questions about the study design and reporting It requires skill and training in identifying quality characteristics May require additional analyses: Quality weighting (Rosenthal, 1991) Use of kappa statistic in determining validity of quality filtering for meta-analysis (Sands & Murphy, 1996) Regression with “quality” as a predictor of effect size (see Valentine & Cooper, 2008) 13 Uses of information about quality: Narrative discussion of impact of quality on results Display study quality and results in a tabular format Weight the data by quality - not usually recommended because scales are not always consistent (see Juni et al., 1999; Valentine & Cooper, 2008) Subgroup analysis by quality Include quality as a covariate in meta-regression Developed an instrument for assessing study quality for inclusion in meta-analysis and systematic review “focuses on the operational details of studies and results in a profile of scores instead of a single score to represent study quality (p. 130)” Study Design and Implementation Assessment Device (DIAD) Hierarchical: consists of “global”, “composite”, and “design and implementation” questions 15 From Valentine & Cooper (2008, p. 139). Multiple questions within each global and composite question Example excerpt of Table 4 (p. 144) Critical Appraisal Skills Programme (CASP) to help with the process of critically appraising articles: Systematic Reviews (!) Randomised Controlled Trials (RCTs) Qualitative Research Economic Evaluation Studies Cohort Studies Case Control Studies Diagnostic Test Studies Good start, but not comprehensive Download from http://www.phru.nhs.uk/Pages/PHD/resources.htm Example from the CASP Cohort Studies form http://www.phru.nhs.uk/Pages/PHD/resources.htm Centre for Reviews and Dissemination (CRD), University of York Report on “Study quality assessment” www.york.ac.uk/inst/crd/pdf/crdreport4_ph5.pdf Development of quality assessment instruments Also, quality assessment of: effectiveness studies accuracy studies qualitative research economic evaluations Some questions to consider regarding quality in case series designs (can also be used for survey data designs) Evidence for Policy and Practice Information and Co-ordinating Centre (EPPI-Centre), Institute of Education, University of London Some issues to consider when coding are listed in the Guidelines for the REPOrting of primary empirical research Studies in Education (The REPOSE Guidelines) Not as detailed/thorough (prescriptive?) as in the medical research guidelines... When designing your code materials, you can look at guidelines for what should be reported, and turn those into questions to evaluate quality For example, Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) STROBE Statement—Checklist of items that should be included in reports of cross-sectional studies Point 13(b): Give reasons for non-participation at each stage Could be rephrased in the coding materials as: Did the study give a reason for non-participation? STROBE checklists available for cohort, casecontrol, and cross-sectional studies at http://www.strobe-statement.org/Checklist.html Quality of reporting It is often hard to separate quality of reporting from methodological quality - “Not reported” is not always “Not done” Should code “Unspecified” as distinct from “Criteria not met” Consult as many materials as possible when developing coding materials There are some good references for systematic reviews that also apply to meta-analysis Torgerson’s (2003) book Gough’s (2007) framework Search Cochrane Collaboration (http://www.cochrane.org/) for “assessing quality” Gough, D. (2007). Weight of evidence: a framework for the appraisal of the quality and relevance of evidence. In J. Furlong, A. Oancea (Eds.) Applied and Practice-based Research. Special Edition of Research Papers in Education, 22, 213-228. Juni, P., Witschi, A., Bloch R., & Egger, M. (1999). The hazards of scoring the quality of clinical trials for meta-analysis, JAMA, 282,1054–1060 Rosenthal, R. (1991). Quality-weighting of studies in meta-analytic research. Psychotherapy Research, 1, 25-28. Sands, M. L., & Murphy, J. R. (1996). Use of kappa statistic in determining validity of quality filtering for meta-analysis: A case study of the health effects of electromagnetic radiation. Journal of Clinical Epidemiology, 49, 1045-1051. Torgerson, C. (2003). Systematic reviews. UK: Continuum International. Valentine, J. C., & Cooper, H. M. (2008). A systematic and transparent approach for assessing the methodological quality of intervention effectiveness research: The Study Design and Implementation Assessment Device (Study DIAD). Psychological Methods, 13, 130-149.