Funded through the ESRC’s Researcher
Development Initiative
Session 2.3: Assessing Quality
Prof. Herb Marsh
Ms. Alison O’Mara
Dr. Lars-Erik Malmberg
Department of Education,
University of Oxford
Session 2.3: Assessing Quality
Develop code
Data entry
and effect size
Pilot coding;
Locate and
collate studies
Main analyses
 Does it make a difference?
 How do we assess primary study quality?
The quality of evidence generated by a review
depends entirely on the quality of primary
studies which make up the review
 Garbage-in, garbage-out!
Quality assessment helps to set apart a metaanalysis or systematic review from a narrative
 In meta-analysis, quality refers to methodological
quality (the internal validity of primary studies)
 High quality reduces the risk of bias by eliminating
or taking into account sources of bias such as:
 Selection bias
 Information bias
 Confounding
Increasingly, meta-analysts evaluate the quality
of each study included in a meta-analysis
Sometimes this is a global holistic (subjective)
rating. In this case it is important to have multiple
raters to establish inter-rater agreement
Sometimes study quality is quantified in relation to
objective criteria of a good study, e.g.
 larger sample sizes;
 more representative samples;
 better measures;
 use of random assignment;
 appropriate control for potential bias;
 double blinding, and
 low attrition rates (particularly for longitudinal studies)
 In a meta-analysis of Social Science meta-analyses,
Wilson & Lipsey (1993) found an effect size of .50.
They evaluated how this was related to study quality:
 For meta-analyses providing a global (subjective) rating of
the quality of each study, there was no significant
difference between high and low quality studies; the
average correlations between effect size and quality was
almost exactly zero.
 Almost no difference between effect sizes based on
random- and non-random assignment (effect sizes slightly
larger for random assignment).
 The only study quality characteristic to make a difference
was positively biased effects due to one-group pre/post
design with no control group at all
Goldring (1990) evaluated the effects of gifted
education programs on achievement. She found a
positive effect, but emphasised that findings were
questionable because of weak studies:
21 of the 24 studies were unpublished and only one used
random assignment.
Effects varied with matching procedures:
 largest effects for achievement outcomes were for studies
in which all non-equivalent groups' differences controlled
by only one pretest variable.
 Effect sizes reduced as the number of control variables
increase and
 disappeared altogether with random assignment.
Goldring (1990, p. 324) concluded policy makers need
to be aware of the limitations of the GAT literature. 8
 Schulz (1995) evaluated study quality in 250 randomized
clinical trials (RCTs) from 33 meta-analyses. Poor quality
studies led to positively biased estimates:
 lack of concealment (30-41%),
 lack of double-blind (17%),
 participants excluded after randomization (NS).
 Moher et al. (1998) reanalysed 127 RCTs randomized clinical
trials from 11 meta-analyses for study quality.
 Low quality trials resulted in significantly larger effect sizes, 30-50%
exaggeration in estimates of treatment efficacy.
 Wood et al. (2008) evaluated study quality (1,346 RCTs from
146 meta-analyses.
 subjective outcomes: inadequate/unclear concealment & lack of
blinding resulted in substantial biases.
 objective outcomes: no significant effects.
 conclusion: systematic reviewers should assess risk of bias.
 Meta-analyses should always include subjective and/or
objective indicators of study quality.
 In Social Sciences, there is some evidence that studies
with highly inadequate control for pre-existing
differences leads to inflated effect sizes. However, it is
surprising that other indicators of study quality make
so little difference.
 In medical research, studies are largely limited to RCTs
where there is MUCH more control than in social
science research. Here, there is evidence that
inadequate concealment of assignment and lack of
double-blind inflate effect sizes, but perhaps only for
subjective outcomes.
 These issues are likely to be idiosyncratic to individual
discipline areas and research questions.
 It is important to code the study quality characteristics
 Juni, Witschi, Bloch, and Egger (1999):
 Evaluation of scales designed to assess the quality of
randomized field trials in medicine
 Used an identical set of 17 studies
 Applied the quality weightings dictated by 25 different scales
 Seven of the scales showed that high quality trials showed
an effect whereas low quality trials did not.
 Six of the scales found that high quality trials showed no
effect whereas low quality trials did (the reverse conclusion).
 For the remaining 12 scales, effect estimates were similar
across the quality levels
 Overall summary quality scores were not significantly
associated with treatment effects.
 In summary: the scale used to evaluate study quality can
determine whether a difference in quality levels is detected
 Requires designing the code materials to include
adequate questions about the study design and
 It requires skill and training in identifying quality
 May require additional analyses:
 Quality weighting (Rosenthal, 1991)
 Use of kappa statistic in determining validity of quality
filtering for meta-analysis (Sands & Murphy, 1996)
 Regression with “quality” as a predictor of effect size
(see Valentine & Cooper, 2008)
 Uses of information about quality:
 Narrative discussion of impact of quality on results
 Display study quality and results in a tabular format
 Weight the data by quality - not usually recommended
because scales are not always consistent (see Juni et al.,
1999; Valentine & Cooper, 2008)
 Subgroup analysis by quality
 Include quality as a covariate in meta-regression
 Developed an instrument for assessing study
quality for inclusion in meta-analysis and
systematic review
 “focuses on the operational details of studies and results
in a profile of scores instead of a single score to
represent study quality (p. 130)”
 Study Design and Implementation Assessment
Device (DIAD)
 Hierarchical: consists of “global”, “composite”, and
“design and implementation” questions
 From Valentine & Cooper (2008, p. 139).
 Multiple questions within each global and
composite question
 Example excerpt of Table 4 (p. 144)
 Critical Appraisal Skills Programme (CASP) to help with
the process of critically appraising articles:
 Systematic Reviews (!)
 Randomised Controlled Trials (RCTs)
 Qualitative Research
 Economic Evaluation Studies
 Cohort Studies
 Case Control Studies
 Diagnostic Test Studies
 Good start, but not comprehensive
 Download from
 Example from the CASP Cohort Studies form
 Centre for Reviews and Dissemination (CRD),
University of York
 Report on “Study quality assessment”
 Development of quality assessment instruments
 Also, quality assessment of:
 effectiveness studies
 accuracy studies
 qualitative research
 economic evaluations
 Some questions to consider regarding quality in
case series designs (can also be used for survey
data designs)
 Evidence for Policy and Practice Information and
Co-ordinating Centre (EPPI-Centre), Institute of
Education, University of London
 Some issues to consider when coding are listed in
the Guidelines for the REPOrting of primary
empirical research Studies in Education (The
REPOSE Guidelines)
Not as
(prescriptive?) as in
the medical
 When designing your code materials, you can look
at guidelines for what should be reported, and turn
those into questions to evaluate quality
 For example, Strengthening the Reporting of
Observational Studies in Epidemiology (STROBE)
 STROBE Statement—Checklist of items that should be
included in reports of cross-sectional studies
 Point 13(b): Give reasons for non-participation at each
 Could be rephrased in the coding materials as:
 Did the study give a reason for non-participation?
 STROBE checklists available for cohort, casecontrol, and cross-sectional studies at
 Quality of reporting
 It is often hard to separate quality of reporting from
methodological quality - “Not reported” is not always “Not
 Should code “Unspecified” as distinct from “Criteria not met”
 Consult as many materials as possible when
developing coding materials
 There are some good references for systematic reviews that
also apply to meta-analysis
 Torgerson’s (2003) book
 Gough’s (2007) framework
 Search Cochrane Collaboration ( for
“assessing quality”
 Gough, D. (2007). Weight of evidence: a framework for the appraisal of the
quality and relevance of evidence. In J. Furlong, A. Oancea (Eds.) Applied and
Practice-based Research. Special Edition of Research Papers in Education,
22, 213-228.
 Juni, P., Witschi, A., Bloch R., & Egger, M. (1999). The hazards of scoring the
quality of clinical trials for meta-analysis, JAMA, 282,1054–1060
 Rosenthal, R. (1991). Quality-weighting of studies in meta-analytic research.
Psychotherapy Research, 1, 25-28.
 Sands, M. L., & Murphy, J. R. (1996). Use of kappa statistic in determining
validity of quality filtering for meta-analysis: A case study of the health effects
of electromagnetic radiation. Journal of Clinical Epidemiology, 49, 1045-1051.
 Torgerson, C. (2003). Systematic reviews. UK: Continuum International.
 Valentine, J. C., & Cooper, H. M. (2008). A systematic and transparent
approach for assessing the methodological quality of intervention
effectiveness research: The Study Design and Implementation Assessment
Device (Study DIAD). Psychological Methods, 13, 130-149.

quality - Department of Education