2014 Target 1.2 - Cochrane Methods

advertisement
Audit of published new Cochrane Reviews
of interventions: 2014 Target 1.2
Executive summary
Since September 2013, the CEU has been screening pre-publication drafts of new reviews against
key MECIR conduct and reporting standards. To assess changes in review quality since screening
began we audited and compared two cohorts of new intervention reviews published in August 2013
and August 2014 against a subset of key MECIR standards.
The audit comprised 56 reviews. Overall, a higher proportion of the quality items were met by the
reviews in 2014 compared with 2013 (86% vs. 71%). The proportion of reviews judged to be fully or
partially compliant with all quality items was higher in the 2014 cohort compared with 2013 (64% vs.
18%). There were reasonable improvements in how recent searches were, use of trial registries, and
declared changes from protocol. Internal consistency of reviews was considered better in the 2014
cohort of reviews. Inappropriate study exclusion decisions, problematic interpretation of findings,
omission of primary outcomes in abstracts, and inconsistent reporting of results remained relatively
low across both years. Although infrequent, misinterpretation of subgroup analysis suggests that
this approach should be applied more carefully.
The audit provides some evidence that an increasing number of reviews are meeting key MECIR
standards. However, there remain areas in clear need of improvement, including the
implementation of GRADE and its use outside Summary of Findings (SoF) tables, and the use of
subgroup analysis. We encourage editorial teams and authors to continue to focus carefully on
these specific aspects of reviews. Having considered review quality against a broad set of
representative standards, follow-up work should address specific aspects and be more tightly
focussed.
1
1. Background
Goal 1 of Cochrane Strategy to 2020 reaffirms Cochrane’s mission to produce high-quality
systematic reviews and specifically to develop comprehensive quality assurance processes. Target
1.2 for 2014 directly supports this aim by using a subset of MECIR standards as the basis for an audit
of Cochrane Reviews.
Since September 2013, the CEU has been screening pre-publication drafts of new intervention
reviews. Based on preparatory work in April 2013 we have been using a set of key standards that are
used to check review quality during the screening process.
In response to feedback from Cochrane Review Groups (CRGs) in the run up to and during the 2014
Cochrane Colloquium in Hyderabad, we decided to change the focus of the 2014 audit and therefore
the target. Instead of auditing the last three months’ worth of reviews published in 2014, we have
compared two cohorts of published reviews. This was done in order to preserve the review
screening programme which had initially been intended as a time-limited project.
The first cohort of reviews were published as new intervention reviews in August 2013 and the
second cohort comprised new intervention reviews published in August 2014. This enabled us to
establish in broad terms the quality of reviews at a pre-screening baseline and to see how far
published reviews had changed since screening began.
2. Audit standards, rationale & method
2.1 Standards
The subset of the MECIR standards that were used as the basis of the audit was based on the CEU
review screening criteria at the time of the audit. The standards are subdivided according to three
discrete components of the review:
a) Implementation of protocol methods;
b) Interpretation;
c) Completeness of reporting in the abstract & internal consistency.
2
a) Implementation of protocol methods
Standard title
MECIR item
Standard
Search trials registers and repositories of results, where
Searching
trials registers
relevant to the topic through ClinicalTrials.gov, the WHO
C27
International Clinical Trials Registry Platform (ICTRP) portal
and other sources as appropriate.
Rerun or update searches for all relevant databases within 12
Searching for
studies
C37
screen the results for potentially eligible studies.
Selecting
studies into
Include studies in the review irrespective of whether measured
C40
the review
Synthesizing
C68
Differences
Explain and justify any changes from the protocol (including
between
review
to be sufficient studies to do this meaningfully, use a formal
statistical test to compare them.
studies
protocol and
outcome data are reported in a ‘usable’ way.
If subgroup analyses are to be compared, and there are judged
the results of
included
months before publication of the review or review update, and
R106
any post hoc decisions about eligibility criteria or the addition
of subgroup analyses).
Although not comprehensive in terms of all the methods that we would expect to see implemented
in reviews, these standards provide a broad indication as to how well protocol methods have been
implemented.
We could not realistically incorporate every searching standard in this audit, so in consultation with
the CEU information specialist we elected to include a standard on searching trials registers and a
further standard around the date of the search.
3
Screening considers carefully the role that outcome availability plays in determining study
eligibility. Although MECIR standards make some allowance for this, there is a concern that studies
are excluded from reviews on the basis of outcome reporting rather than whether the outcome in
question was measured. We considered subgroup analysis as part of the audit to determine how
closely reviews adhered to guidance on conduct and interpretation of this method for investigating
heterogeneity.
Comparing the published protocol with the draft review enables the evaluation of any changes to
the protocol that could impact on the results and how these are acknowledged and justified. Not all
changes will be important to declare, but some may require justification if they alter the review
question or change the analysis of data.
b) Interpretation
Standard
MECIR item
Standard
title
Present a ‘Summary of Findings’ table according to
recommendations described in Chapter 11 of the Cochrane
Handbook (version 5 or later). Specifically:
include results for one clearly defined population group (with few
exceptions); indicate the intervention and the comparison
intervention; include seven or fewer patient-important
Summary of
Findings
R97
table
outcomes; describe the outcomes (e.g. scale, scores, follow-up);
indicate the number of participants and studies for each
outcome; present at least one baseline risk for each dichotomous
outcome (e.g. study population or median/medium risk) and
baseline scores for continuous outcomes (if appropriate);
summarize the intervention effect (if appropriate); and include a
measure of the quality of the body of evidence for each
outcome.
Summarizing
the findings
Use the five GRADE considerations (study limitations,
C76
consistency of effect, imprecision, indirectness and publication
bias) to assess the quality of the body of evidence for each
4
outcome, and to draw conclusions about the quality of evidence
within the text of the review.
Reaching
conclusions
Author's
conclusions
C78
Base conclusions only on findings from the synthesis
(quantitative or narrative) of studies included in the review.
Provide a general interpretation of the evidence so that it can
R101
inform healthcare or policy decisions. Avoid making
recommendations for practice.
These standards address areas of the review that underpin the interpretation of the review findings
and the appropriateness of review conclusions.
The format and content of SoF tables feature in the screening process, and so we included these
considerations as part of the audit. One of the earliest points of concern to arise from screening was
to see reviews occasionally make recommendations for or against the adoption of an intervention.
Users are likely to remember these as key messages, and since the decision to use an intervention
will draw on a number of factors outside the evidence presented in the review, we wanted to
establish how common this issue remained.
c) Completeness of reporting in the abstract & internal consistency
Standard title
MECIR item
Standard
Abstract,
Main results:
bias
R11
Provide a comment on the findings of the bias assessment
assessment
Abstract,
Main results:
Report findings for all primary outcomes, irrespective of the
R12
findings
data.
Abstract,
Main results:
adverse
effects
strength and direction of the result, and of the availability of
Ensure that any findings related to adverse effects are
R13
reported. If adverse effects data were sought, but availability of
data was limited, this should be reported.
5
Consistency
of summary
versions of
Ensure that reporting of objectives, important outcomes,
R18
the review
Consistency
of results
results, caveats and conclusions is consistent across the text,
the abstract, the plain language summary and the ‘Summary of
findings’ table (if included).
Ensure that all statistical results presented in the main review
R86
text are consistent between the text and the ‘Data and analysis’
tables.
Assuring consistency across all of the review forms a major objective of the review screening
process. As part of our evaluations we looked at how the Plain language summary (PLS) and
abstract mirror the main review findings, and if available, Summary of Findings tables.
2.2 Method
We created an audit tool in Excel addressing 14 MECIR standards outlined above. Judgements were
made as ‘Yes’ (e.g. trials registries were reported to have been searched), ‘Partially met’ (e.g.
reported changes from protocol are incomplete), ‘Unclear’ (inadequate information presented to
determine whether the standard has been met) and ‘No’ (standard not met). We decided that
members of the CEU review screening team would lack objectivity in auditing reviews postscreening. An editor who had not previously been involved in the screening programme undertook
assessments of the reviews in order to maintain independence (Newton Opiyo).
6
3. Audit Findings
We included a total of 56 new Cochrane intervention reviews in the audit. The characteristics of the
two cohorts are summarised below:
Characteristic
2013
2014
Total
Reviews (N)
34
22
56
Review groups (N)
22
18
32
9 [0 to 77]
7 [0 to 129]
8 [0 to 129]
26 [2 to 107]
28 [3 to 195]
27 [2 to 195]
130 [13 to 449]
139 [41 to 342]
130 [13 to 449]
21 (62)
16 (73)
Number of included studies
(median, range)
Weeks between search date &
publication (median, range)
Weeks between protocol &
publication (median, range)
Number with Summary of
37 (66)
Findings tables (%)
Table 1: Characteristics of two cohorts of new intervention reviews published in August 2013 and August
2014.
The two cohorts of reviews that were assessed share broadly similar characteristics in terms of the
number of included studies, currency of the search dates, and time taken from the publication of
the protocol to the publication of the review. The slightly higher number of reviews with Summary
of Findings tables included in the 2014 cohort possibly reflects the growing adoption of GRADE
within reviews.
Overall, a higher proportion of MECIR standards for conduct and reporting of reviews were met by
the reviews in 2014 compared to 2013 (86.0%, 265/308 items vs. 71.2%, 339/476 items). When
expressed as the proportion of compliant reviews (i.e. those reviews judged where each standard
was fully or partially met) a greater proportion of reviews were compliant from the 2014 cohort
(65%, 13/22 reviews) than from 2013 (18%, 6/34 reviews).
7
3.1 Implementation of protocol methods
Findings on the subset of MECIR standards relevant to implementation of protocol methods are
shown below.
C27: Search trial registers
C40: Study inclusion irrespective
of usable outcomes
C37: Updated searches
N
N
N
Partial
Partial
Partial
Unclear
Y
Y
Y
0
10
20
30
40
50
% of reviews
Y-Fully met; P-Partially met; N-Not met; U-Unclear
0
40
60
80
100
0
20
40
60
80
100
% of reviews
% of reviews
Y-Fully met; P-Partially met; N-Not met; U-Unclear Y-Fully met; P-Partially met; N-Not met; U-Unclear
R106: Report differences between
protocol and review
C68: Appropriate subgroup analysis
N
20
N
Partial
Partial
Unclear
Unclear
Y
Y
0
40
60
80
% of reviews
Y-Fully met; P-Partially met; N-Not met; U-Unclear
0
20
2013
20
40
60
80
% of reviews
2014
Y-Fully met; P-Partially met; N-Not met; U-Unclear
There was reasonable improvement for three standards:

searching trials registries and repositories for ongoing studies (e.g. WHO ICTRP and
ClinicalTrials.gov), although the proportion of reviews meeting this criterion still
remained below 50% in 2014 (C27)

currency of the searches (updated database searches within 12 months prior to review
publication) (C37)

reporting and justification of changes to protocol (R106).
Excluding studies on the basis of outcome reporting was relatively low in both cohorts of reviews
(C40). No changes were apparent for appropriate conduct of subgroup analyses in the reviews
(C68).
8
Comments
The judgment of inadequate searching of trials registries for the 2013 reviews could simply reflect
failure to report rather than substandard conduct of searches. Although a large proportion of
reviews do not exclude studies on the basis of the availability of outcome data, careful attention is
still needed to ensure authors explicitly address this standard and that they do not introduce a bias
that they have taken steps to avoid with an extensive search strategy.1
In a small number of reviews the conduct and/or interpretation of subgroup analyses do not follow
current guidance. Cochrane Handbook guidance outlines three specific points which are worth
emphasizing:
1. Analyses are based on a small number of pre-specified subgroups to prevent knowledge
of results from influencing the choice of subgroups being investigated
2. A formal statistical test for subgroup differences (test of interaction) is used as the basis
for interpreting subgroup analyses, and
3. The need to interpret findings of subgroup analyses cautiously to avoid potentially
misleading inferences.2 3
Authors and editors should be mindful of the limitations of subgroup analysis in general and pay
careful attention to the interpretation of subgroup analysis in the presence of few studies and small
sample sizes.
1Saini P,
Loke YK, Gamble C, Altman DG, Williamson PR, Kirkham JJ. Selective reporting bias of harm outcomes within
studies: findings from a cohort of systematic reviews. BMJ. 2014;349:g6501.
2 Sun X, Briel M, Walter SD, Guyatt GH. Is a subgroup effect believable? Updating criteria to evaluate the credibility of
subgroup analyses. BMJ. 2010;340:c117
3 Higgins JPT, Green S (editors). Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated
March 2011]. The Cochrane Collaboration, 2011. Available from www.cochrane-handbook.org [date accessed 27th January
2015].
9
3.2 Interpretation
Findings on the subset of MECIR standards relevant to interpretation are shown below.
R97: Appropriate presentation of summary of findings table
C76: Appropriate application of GRADE in the review
N
N
Partial
Partial
Y
Y
0
20
40
60
80
% of reviews
Y-Fully met; P-Partially met; N-Not met; U-Unclear
0
40
60
80
% of reviews
Y-Fully met; P-Partially met; N-Not met; U-Unclear
C78: Appropriately formulate conclusions
20
R101: Appropriate conclusions-implications for practice
N
Partial
Partial
Y
Y
0
0
20
40
60
80
100
% of reviews
Y-Fully met; P-Partially met; N-Not met; U-Unclear
2013
20
40
60
80
100
% of reviews
Y-Fully met; P-Partially met; N-Not met; U-Unclear
2014
Substantial improvement was observed in two criteria: appropriate presentation of summary of
findings tables (R97); and implementation of GRADE in the body of reviews (C76).
Appropriate formulation of implications for practice (based on the strength of the presented
evidence) (C78) and drawing of conclusions (without making recommendations for practice) (R101)
remained relatively high in both cohort of reviews.
Comments
Certain aspects of published SoF tables adhered to existing Cochrane Handbook and GRADE
working group guidance, namely: reporting of the number of participants and studies, intervention
effect(s) and quality of evidence for each outcome.
Closer examination reveals areas for improvement. There was variation in the amount of
information available relating to scales for continuous outcomes, follow-up, expression of results
10
from standardised mean differences, the basis for assumed control group risks or scores, and the
explanations of downgrading decisions. We noted poor consistency across reviews with regard to
how imprecision is understood. Reliance on statistical significance of the relative effect as the basis
for downgrading decisions suggests that other important factors such as sample size, confidence
interval for the absolute effect and the number of events are not routinely considered.4
Although there was evidence of better use of GRADE to inform the interpretation of findings in the
review text for reviews published in 2014, many pre-publication screening reports continue to
highlight that GRADE is often overlooked when the discussion and conclusions are written.
Concerted efforts to integrate GRADE ratings beyond SoF tables in the text of reviews prior to
submission for editorial approval are warranted.
Cont./
4
Guyatt GH, Oxman AD, Kunz R, Borzek J, Alonso-Coello P, Rind D, et al. GRADE guidelines 6. Rating the quality of
evidence – imprecision. J Clin Epidemiol.. 2011; Aug 12: 1283–1293.
11
3.3 Completeness of reporting in the abstract & internal consistency
Findings on the subset of MECIR standards relevant to completeness of reporting in the abstract
and internal consistency are shown below.
R12: Abstract - report
primary outcomes
R11: Abstract - report
risk of bias
R13: Abstract - report
adverse effects
N
N
N
Partial
Partial
Partial
Y
Y
Y
0
20
40
60
80
100
0
40
60
80
100
0
20
40
60
80
100
% of reviews
% of reviews
Y-Fully met; P-Partially met; N-Not met; U-Unclear
Y-Fully met; P-Partially met; N-Not met; U-Unclear
% of reviews
Y-Fully met; P-Partially met; N-Not met; U-Unclear
R18: Consistency of summary versions
of the review
R86: Consistency of results across
the review
N
N
Partial
Partial
Y
Y
0
40
60
80
100
% of reviews
Y-Fully met; P-Partially met; N-Not met; U-Unclear
0
20
2013
20
20
40
60
% of reviews
2014
80
100
Y-Fully met; P-Partially met; N-Not met; U-Unclear
There was reasonable improvement in the reporting of findings of bias assessment (R11), adverse
effects (R13), and consistency of key findings across summary and full text versions of the reviews
(R18).
Complete reporting of findings for primary outcomes (R12) and consistency of results across the
review (R86) remained relatively high in both cohorts of reviews.
Comments
Despite good (>80%) reporting of key items (risk of bias, primary outcomes, adverse outcomes) in
the abstracts of the reviews surveyed, focused attention to improve clarity and completeness of
outcome reporting is still needed (e.g. description of pre-specified outcomes not measured).5 We
5
Smith V, Clarke M, Williamson P, Gargon E. Survey of new 2007 and 2011 Cochrane reviews found 37% of prespecified
outcomes not reported. J Clin Epidemiol. 2014 Nov 18. pii: S0895-4356(14)00398-9.
12
found that the reviews that had gone to great lengths to incorporate GRADE in to summary
versions of the review were better placed to present the key findings of the review consistently, and
to communicate key uncertainties clearly. Integrating GRADE in abstracts and plain language
summaries also provides an opportunity to present GRADE ratings as part of the review findings
and more than a means to prepare a SoF table.
4. What are the main implications of the audit findings?
The audit provides some evidence that for many recent reviews, a substantial proportion of key
MECIR standards are being met. Improved adherence is likely attributable to a range of efforts and
not simply a function of pre-publication screening. Other upstream efforts such as refinement of
CRG quality assurance processes in response to screening, training and greater uptake of Summary
of Findings tables, and increasing awareness of Cochrane standards by review authors and CRGs
may also influence the findings.
Whilst it is encouraging to see the growing proportion of reviews meeting key standards, there
remain areas where improvement is needed. Implementation and use of GRADE to communicate
key results clearly would help to improve readability of Cochrane Reviews and to help users
understand review findings. Where subgroup analysis is considered appropriate, careful attention to
its implementation and interpretation is warranted.
There are three potential limitations of the audit. Firstly, we focussed on a limited set of standards
to assess review quality. In so doing we may have neglected aspects of searching, analysis of data or
implementation of the risk of bias tool by review teams that would provide more extensive insights
into review quality. Secondly, one person assessed the reviews and some of the assessments reflect
subjective judgments (e.g. appropriateness of conclusions), rather than purely objective items (e.g.
date of search). It is possible that others replicating this exercise would arrive at different
assessments. Lastly, although interested in exploring the impact of adopting GRADE on review
quality, we felt that this was a secondary objective of the audit, and that the sample was too small
to assess this reliably.
5. How do the audit findings relate to pre-publication screening?
At the time of preparing this report 520 reviews have been screened by a team of editors in the
CEU. Screening focuses on the three domains that featured in the audit: implementation of
13
protocol methods, interpretation and inconsistency. The audit findings provide only a snapshot, but
reinforce similar problems to those emerging from screening. We encourage editorial teams to
continue to focus carefully on these three specific aspects of reviews, and to draw on resources such
as the table of common errors and guidance on incorporating GRADE in to the text of the review.
Notable improvements have occurred largely with respect to interpretation and consistency. Better
reporting of departures from protocol, more recently incorporated search results and searching trial
registries were the main drivers of improving standards associated with review conduct. However,
we should also reflect on what the audit findings tell us about the limitations of screening more
generally.
Anecdotal evidence from screening suggests that the problems which are easiest to address at sign
off relate to interpretation of review findings and reporting. More serious problems arising from
methods for analysis, suboptimal conduct or the adoption of non-standard methods have proven
harder to address. Detecting problems relating to the review objectives, study inclusion decisions or
risk of bias assessment often comes too late to implement satisfactory solutions. Earlier checks for
assessing the implementation of the protocol and better dialogue between methods groups and
CRGs should reduce the need to make more fundamental changes to review methods at a late
stage. This is especially the case for reviews that have implemented complex methods to address
questions that do not fit within the intervention review format.
Screening should be regarded as an effective means of monitoring quality but a more limited way of
improving the methodological quality of every review. The benefits of introducing review screening
are likely to be felt by developing guidance and training initiatives for authors and editors which
fully take account of the lessons that are emerging from the process.
A key objective for 2015 is the development of a strategic approach to quality assurance. Although
initially intended as time-limited project, screening will likely remain part of such an approach but
on a more restricted basis. Based on the volume of reviews screened the CEU has already decided
to make screening a voluntary process for a small number of groups. For others where there has
been greater variation in compliance with MECIR standards screening will continue.
Screening has provided valuable insights in to the challenges of review production and the need to
develop better learning and support for editors. Additional considerations for a quality assurance
strategy will be how to find ways of sharing good practice, and how to prevent common errors
occurring or remaining unaddressed until late in the editorial process. Some of the problems
14
identified will be better addressed with the development and better integration of technologies in
the author workflow, such as Covidence and GRADEpro. 6 7
6. Implications for further audit and targets
Having considered review quality in broad terms we think that discrete, targeted audits should now
be considered. Given the importance of systematic searching, we think that a focussed evaluation
of search methods in reviews is warranted to establish whether non-reporting of process necessarily
equates to poor conduct or poor disclosure of good conduct. Accuracy checks on data used in
analyses would also help to understand the nature, frequency and impact of data errors, and to
identify how these might be prevented or addressed.
GRADE and SoF tables are essential for assessing, interpreting and presenting findings of
systematic reviews for users. It is no coincidence that the implementation of GRADE features
heavily in the screening process. The current audit identified a number of areas for improving the
quality of reviews through better use of GRADE and more standard preparation and formatting of
SoF tables. Developing an audit tool to evaluate this aspect of reviews would help to provide
insights in to how GRADE is implemented in reviews and areas for improvement.
6 https://www.covidence.org/
7 http://www.guidelinedevelopment.org/
15
Download