Audit of published new Cochrane Reviews of interventions: 2014 Target 1.2 Executive summary Since September 2013, the CEU has been screening pre-publication drafts of new reviews against key MECIR conduct and reporting standards. To assess changes in review quality since screening began we audited and compared two cohorts of new intervention reviews published in August 2013 and August 2014 against a subset of key MECIR standards. The audit comprised 56 reviews. Overall, a higher proportion of the quality items were met by the reviews in 2014 compared with 2013 (86% vs. 71%). The proportion of reviews judged to be fully or partially compliant with all quality items was higher in the 2014 cohort compared with 2013 (64% vs. 18%). There were reasonable improvements in how recent searches were, use of trial registries, and declared changes from protocol. Internal consistency of reviews was considered better in the 2014 cohort of reviews. Inappropriate study exclusion decisions, problematic interpretation of findings, omission of primary outcomes in abstracts, and inconsistent reporting of results remained relatively low across both years. Although infrequent, misinterpretation of subgroup analysis suggests that this approach should be applied more carefully. The audit provides some evidence that an increasing number of reviews are meeting key MECIR standards. However, there remain areas in clear need of improvement, including the implementation of GRADE and its use outside Summary of Findings (SoF) tables, and the use of subgroup analysis. We encourage editorial teams and authors to continue to focus carefully on these specific aspects of reviews. Having considered review quality against a broad set of representative standards, follow-up work should address specific aspects and be more tightly focussed. 1 1. Background Goal 1 of Cochrane Strategy to 2020 reaffirms Cochrane’s mission to produce high-quality systematic reviews and specifically to develop comprehensive quality assurance processes. Target 1.2 for 2014 directly supports this aim by using a subset of MECIR standards as the basis for an audit of Cochrane Reviews. Since September 2013, the CEU has been screening pre-publication drafts of new intervention reviews. Based on preparatory work in April 2013 we have been using a set of key standards that are used to check review quality during the screening process. In response to feedback from Cochrane Review Groups (CRGs) in the run up to and during the 2014 Cochrane Colloquium in Hyderabad, we decided to change the focus of the 2014 audit and therefore the target. Instead of auditing the last three months’ worth of reviews published in 2014, we have compared two cohorts of published reviews. This was done in order to preserve the review screening programme which had initially been intended as a time-limited project. The first cohort of reviews were published as new intervention reviews in August 2013 and the second cohort comprised new intervention reviews published in August 2014. This enabled us to establish in broad terms the quality of reviews at a pre-screening baseline and to see how far published reviews had changed since screening began. 2. Audit standards, rationale & method 2.1 Standards The subset of the MECIR standards that were used as the basis of the audit was based on the CEU review screening criteria at the time of the audit. The standards are subdivided according to three discrete components of the review: a) Implementation of protocol methods; b) Interpretation; c) Completeness of reporting in the abstract & internal consistency. 2 a) Implementation of protocol methods Standard title MECIR item Standard Search trials registers and repositories of results, where Searching trials registers relevant to the topic through ClinicalTrials.gov, the WHO C27 International Clinical Trials Registry Platform (ICTRP) portal and other sources as appropriate. Rerun or update searches for all relevant databases within 12 Searching for studies C37 screen the results for potentially eligible studies. Selecting studies into Include studies in the review irrespective of whether measured C40 the review Synthesizing C68 Differences Explain and justify any changes from the protocol (including between review to be sufficient studies to do this meaningfully, use a formal statistical test to compare them. studies protocol and outcome data are reported in a ‘usable’ way. If subgroup analyses are to be compared, and there are judged the results of included months before publication of the review or review update, and R106 any post hoc decisions about eligibility criteria or the addition of subgroup analyses). Although not comprehensive in terms of all the methods that we would expect to see implemented in reviews, these standards provide a broad indication as to how well protocol methods have been implemented. We could not realistically incorporate every searching standard in this audit, so in consultation with the CEU information specialist we elected to include a standard on searching trials registers and a further standard around the date of the search. 3 Screening considers carefully the role that outcome availability plays in determining study eligibility. Although MECIR standards make some allowance for this, there is a concern that studies are excluded from reviews on the basis of outcome reporting rather than whether the outcome in question was measured. We considered subgroup analysis as part of the audit to determine how closely reviews adhered to guidance on conduct and interpretation of this method for investigating heterogeneity. Comparing the published protocol with the draft review enables the evaluation of any changes to the protocol that could impact on the results and how these are acknowledged and justified. Not all changes will be important to declare, but some may require justification if they alter the review question or change the analysis of data. b) Interpretation Standard MECIR item Standard title Present a ‘Summary of Findings’ table according to recommendations described in Chapter 11 of the Cochrane Handbook (version 5 or later). Specifically: include results for one clearly defined population group (with few exceptions); indicate the intervention and the comparison intervention; include seven or fewer patient-important Summary of Findings R97 table outcomes; describe the outcomes (e.g. scale, scores, follow-up); indicate the number of participants and studies for each outcome; present at least one baseline risk for each dichotomous outcome (e.g. study population or median/medium risk) and baseline scores for continuous outcomes (if appropriate); summarize the intervention effect (if appropriate); and include a measure of the quality of the body of evidence for each outcome. Summarizing the findings Use the five GRADE considerations (study limitations, C76 consistency of effect, imprecision, indirectness and publication bias) to assess the quality of the body of evidence for each 4 outcome, and to draw conclusions about the quality of evidence within the text of the review. Reaching conclusions Author's conclusions C78 Base conclusions only on findings from the synthesis (quantitative or narrative) of studies included in the review. Provide a general interpretation of the evidence so that it can R101 inform healthcare or policy decisions. Avoid making recommendations for practice. These standards address areas of the review that underpin the interpretation of the review findings and the appropriateness of review conclusions. The format and content of SoF tables feature in the screening process, and so we included these considerations as part of the audit. One of the earliest points of concern to arise from screening was to see reviews occasionally make recommendations for or against the adoption of an intervention. Users are likely to remember these as key messages, and since the decision to use an intervention will draw on a number of factors outside the evidence presented in the review, we wanted to establish how common this issue remained. c) Completeness of reporting in the abstract & internal consistency Standard title MECIR item Standard Abstract, Main results: bias R11 Provide a comment on the findings of the bias assessment assessment Abstract, Main results: Report findings for all primary outcomes, irrespective of the R12 findings data. Abstract, Main results: adverse effects strength and direction of the result, and of the availability of Ensure that any findings related to adverse effects are R13 reported. If adverse effects data were sought, but availability of data was limited, this should be reported. 5 Consistency of summary versions of Ensure that reporting of objectives, important outcomes, R18 the review Consistency of results results, caveats and conclusions is consistent across the text, the abstract, the plain language summary and the ‘Summary of findings’ table (if included). Ensure that all statistical results presented in the main review R86 text are consistent between the text and the ‘Data and analysis’ tables. Assuring consistency across all of the review forms a major objective of the review screening process. As part of our evaluations we looked at how the Plain language summary (PLS) and abstract mirror the main review findings, and if available, Summary of Findings tables. 2.2 Method We created an audit tool in Excel addressing 14 MECIR standards outlined above. Judgements were made as ‘Yes’ (e.g. trials registries were reported to have been searched), ‘Partially met’ (e.g. reported changes from protocol are incomplete), ‘Unclear’ (inadequate information presented to determine whether the standard has been met) and ‘No’ (standard not met). We decided that members of the CEU review screening team would lack objectivity in auditing reviews postscreening. An editor who had not previously been involved in the screening programme undertook assessments of the reviews in order to maintain independence (Newton Opiyo). 6 3. Audit Findings We included a total of 56 new Cochrane intervention reviews in the audit. The characteristics of the two cohorts are summarised below: Characteristic 2013 2014 Total Reviews (N) 34 22 56 Review groups (N) 22 18 32 9 [0 to 77] 7 [0 to 129] 8 [0 to 129] 26 [2 to 107] 28 [3 to 195] 27 [2 to 195] 130 [13 to 449] 139 [41 to 342] 130 [13 to 449] 21 (62) 16 (73) Number of included studies (median, range) Weeks between search date & publication (median, range) Weeks between protocol & publication (median, range) Number with Summary of 37 (66) Findings tables (%) Table 1: Characteristics of two cohorts of new intervention reviews published in August 2013 and August 2014. The two cohorts of reviews that were assessed share broadly similar characteristics in terms of the number of included studies, currency of the search dates, and time taken from the publication of the protocol to the publication of the review. The slightly higher number of reviews with Summary of Findings tables included in the 2014 cohort possibly reflects the growing adoption of GRADE within reviews. Overall, a higher proportion of MECIR standards for conduct and reporting of reviews were met by the reviews in 2014 compared to 2013 (86.0%, 265/308 items vs. 71.2%, 339/476 items). When expressed as the proportion of compliant reviews (i.e. those reviews judged where each standard was fully or partially met) a greater proportion of reviews were compliant from the 2014 cohort (65%, 13/22 reviews) than from 2013 (18%, 6/34 reviews). 7 3.1 Implementation of protocol methods Findings on the subset of MECIR standards relevant to implementation of protocol methods are shown below. C27: Search trial registers C40: Study inclusion irrespective of usable outcomes C37: Updated searches N N N Partial Partial Partial Unclear Y Y Y 0 10 20 30 40 50 % of reviews Y-Fully met; P-Partially met; N-Not met; U-Unclear 0 40 60 80 100 0 20 40 60 80 100 % of reviews % of reviews Y-Fully met; P-Partially met; N-Not met; U-Unclear Y-Fully met; P-Partially met; N-Not met; U-Unclear R106: Report differences between protocol and review C68: Appropriate subgroup analysis N 20 N Partial Partial Unclear Unclear Y Y 0 40 60 80 % of reviews Y-Fully met; P-Partially met; N-Not met; U-Unclear 0 20 2013 20 40 60 80 % of reviews 2014 Y-Fully met; P-Partially met; N-Not met; U-Unclear There was reasonable improvement for three standards: searching trials registries and repositories for ongoing studies (e.g. WHO ICTRP and ClinicalTrials.gov), although the proportion of reviews meeting this criterion still remained below 50% in 2014 (C27) currency of the searches (updated database searches within 12 months prior to review publication) (C37) reporting and justification of changes to protocol (R106). Excluding studies on the basis of outcome reporting was relatively low in both cohorts of reviews (C40). No changes were apparent for appropriate conduct of subgroup analyses in the reviews (C68). 8 Comments The judgment of inadequate searching of trials registries for the 2013 reviews could simply reflect failure to report rather than substandard conduct of searches. Although a large proportion of reviews do not exclude studies on the basis of the availability of outcome data, careful attention is still needed to ensure authors explicitly address this standard and that they do not introduce a bias that they have taken steps to avoid with an extensive search strategy.1 In a small number of reviews the conduct and/or interpretation of subgroup analyses do not follow current guidance. Cochrane Handbook guidance outlines three specific points which are worth emphasizing: 1. Analyses are based on a small number of pre-specified subgroups to prevent knowledge of results from influencing the choice of subgroups being investigated 2. A formal statistical test for subgroup differences (test of interaction) is used as the basis for interpreting subgroup analyses, and 3. The need to interpret findings of subgroup analyses cautiously to avoid potentially misleading inferences.2 3 Authors and editors should be mindful of the limitations of subgroup analysis in general and pay careful attention to the interpretation of subgroup analysis in the presence of few studies and small sample sizes. 1Saini P, Loke YK, Gamble C, Altman DG, Williamson PR, Kirkham JJ. Selective reporting bias of harm outcomes within studies: findings from a cohort of systematic reviews. BMJ. 2014;349:g6501. 2 Sun X, Briel M, Walter SD, Guyatt GH. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses. BMJ. 2010;340:c117 3 Higgins JPT, Green S (editors). Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated March 2011]. The Cochrane Collaboration, 2011. Available from www.cochrane-handbook.org [date accessed 27th January 2015]. 9 3.2 Interpretation Findings on the subset of MECIR standards relevant to interpretation are shown below. R97: Appropriate presentation of summary of findings table C76: Appropriate application of GRADE in the review N N Partial Partial Y Y 0 20 40 60 80 % of reviews Y-Fully met; P-Partially met; N-Not met; U-Unclear 0 40 60 80 % of reviews Y-Fully met; P-Partially met; N-Not met; U-Unclear C78: Appropriately formulate conclusions 20 R101: Appropriate conclusions-implications for practice N Partial Partial Y Y 0 0 20 40 60 80 100 % of reviews Y-Fully met; P-Partially met; N-Not met; U-Unclear 2013 20 40 60 80 100 % of reviews Y-Fully met; P-Partially met; N-Not met; U-Unclear 2014 Substantial improvement was observed in two criteria: appropriate presentation of summary of findings tables (R97); and implementation of GRADE in the body of reviews (C76). Appropriate formulation of implications for practice (based on the strength of the presented evidence) (C78) and drawing of conclusions (without making recommendations for practice) (R101) remained relatively high in both cohort of reviews. Comments Certain aspects of published SoF tables adhered to existing Cochrane Handbook and GRADE working group guidance, namely: reporting of the number of participants and studies, intervention effect(s) and quality of evidence for each outcome. Closer examination reveals areas for improvement. There was variation in the amount of information available relating to scales for continuous outcomes, follow-up, expression of results 10 from standardised mean differences, the basis for assumed control group risks or scores, and the explanations of downgrading decisions. We noted poor consistency across reviews with regard to how imprecision is understood. Reliance on statistical significance of the relative effect as the basis for downgrading decisions suggests that other important factors such as sample size, confidence interval for the absolute effect and the number of events are not routinely considered.4 Although there was evidence of better use of GRADE to inform the interpretation of findings in the review text for reviews published in 2014, many pre-publication screening reports continue to highlight that GRADE is often overlooked when the discussion and conclusions are written. Concerted efforts to integrate GRADE ratings beyond SoF tables in the text of reviews prior to submission for editorial approval are warranted. Cont./ 4 Guyatt GH, Oxman AD, Kunz R, Borzek J, Alonso-Coello P, Rind D, et al. GRADE guidelines 6. Rating the quality of evidence – imprecision. J Clin Epidemiol.. 2011; Aug 12: 1283–1293. 11 3.3 Completeness of reporting in the abstract & internal consistency Findings on the subset of MECIR standards relevant to completeness of reporting in the abstract and internal consistency are shown below. R12: Abstract - report primary outcomes R11: Abstract - report risk of bias R13: Abstract - report adverse effects N N N Partial Partial Partial Y Y Y 0 20 40 60 80 100 0 40 60 80 100 0 20 40 60 80 100 % of reviews % of reviews Y-Fully met; P-Partially met; N-Not met; U-Unclear Y-Fully met; P-Partially met; N-Not met; U-Unclear % of reviews Y-Fully met; P-Partially met; N-Not met; U-Unclear R18: Consistency of summary versions of the review R86: Consistency of results across the review N N Partial Partial Y Y 0 40 60 80 100 % of reviews Y-Fully met; P-Partially met; N-Not met; U-Unclear 0 20 2013 20 20 40 60 % of reviews 2014 80 100 Y-Fully met; P-Partially met; N-Not met; U-Unclear There was reasonable improvement in the reporting of findings of bias assessment (R11), adverse effects (R13), and consistency of key findings across summary and full text versions of the reviews (R18). Complete reporting of findings for primary outcomes (R12) and consistency of results across the review (R86) remained relatively high in both cohorts of reviews. Comments Despite good (>80%) reporting of key items (risk of bias, primary outcomes, adverse outcomes) in the abstracts of the reviews surveyed, focused attention to improve clarity and completeness of outcome reporting is still needed (e.g. description of pre-specified outcomes not measured).5 We 5 Smith V, Clarke M, Williamson P, Gargon E. Survey of new 2007 and 2011 Cochrane reviews found 37% of prespecified outcomes not reported. J Clin Epidemiol. 2014 Nov 18. pii: S0895-4356(14)00398-9. 12 found that the reviews that had gone to great lengths to incorporate GRADE in to summary versions of the review were better placed to present the key findings of the review consistently, and to communicate key uncertainties clearly. Integrating GRADE in abstracts and plain language summaries also provides an opportunity to present GRADE ratings as part of the review findings and more than a means to prepare a SoF table. 4. What are the main implications of the audit findings? The audit provides some evidence that for many recent reviews, a substantial proportion of key MECIR standards are being met. Improved adherence is likely attributable to a range of efforts and not simply a function of pre-publication screening. Other upstream efforts such as refinement of CRG quality assurance processes in response to screening, training and greater uptake of Summary of Findings tables, and increasing awareness of Cochrane standards by review authors and CRGs may also influence the findings. Whilst it is encouraging to see the growing proportion of reviews meeting key standards, there remain areas where improvement is needed. Implementation and use of GRADE to communicate key results clearly would help to improve readability of Cochrane Reviews and to help users understand review findings. Where subgroup analysis is considered appropriate, careful attention to its implementation and interpretation is warranted. There are three potential limitations of the audit. Firstly, we focussed on a limited set of standards to assess review quality. In so doing we may have neglected aspects of searching, analysis of data or implementation of the risk of bias tool by review teams that would provide more extensive insights into review quality. Secondly, one person assessed the reviews and some of the assessments reflect subjective judgments (e.g. appropriateness of conclusions), rather than purely objective items (e.g. date of search). It is possible that others replicating this exercise would arrive at different assessments. Lastly, although interested in exploring the impact of adopting GRADE on review quality, we felt that this was a secondary objective of the audit, and that the sample was too small to assess this reliably. 5. How do the audit findings relate to pre-publication screening? At the time of preparing this report 520 reviews have been screened by a team of editors in the CEU. Screening focuses on the three domains that featured in the audit: implementation of 13 protocol methods, interpretation and inconsistency. The audit findings provide only a snapshot, but reinforce similar problems to those emerging from screening. We encourage editorial teams to continue to focus carefully on these three specific aspects of reviews, and to draw on resources such as the table of common errors and guidance on incorporating GRADE in to the text of the review. Notable improvements have occurred largely with respect to interpretation and consistency. Better reporting of departures from protocol, more recently incorporated search results and searching trial registries were the main drivers of improving standards associated with review conduct. However, we should also reflect on what the audit findings tell us about the limitations of screening more generally. Anecdotal evidence from screening suggests that the problems which are easiest to address at sign off relate to interpretation of review findings and reporting. More serious problems arising from methods for analysis, suboptimal conduct or the adoption of non-standard methods have proven harder to address. Detecting problems relating to the review objectives, study inclusion decisions or risk of bias assessment often comes too late to implement satisfactory solutions. Earlier checks for assessing the implementation of the protocol and better dialogue between methods groups and CRGs should reduce the need to make more fundamental changes to review methods at a late stage. This is especially the case for reviews that have implemented complex methods to address questions that do not fit within the intervention review format. Screening should be regarded as an effective means of monitoring quality but a more limited way of improving the methodological quality of every review. The benefits of introducing review screening are likely to be felt by developing guidance and training initiatives for authors and editors which fully take account of the lessons that are emerging from the process. A key objective for 2015 is the development of a strategic approach to quality assurance. Although initially intended as time-limited project, screening will likely remain part of such an approach but on a more restricted basis. Based on the volume of reviews screened the CEU has already decided to make screening a voluntary process for a small number of groups. For others where there has been greater variation in compliance with MECIR standards screening will continue. Screening has provided valuable insights in to the challenges of review production and the need to develop better learning and support for editors. Additional considerations for a quality assurance strategy will be how to find ways of sharing good practice, and how to prevent common errors occurring or remaining unaddressed until late in the editorial process. Some of the problems 14 identified will be better addressed with the development and better integration of technologies in the author workflow, such as Covidence and GRADEpro. 6 7 6. Implications for further audit and targets Having considered review quality in broad terms we think that discrete, targeted audits should now be considered. Given the importance of systematic searching, we think that a focussed evaluation of search methods in reviews is warranted to establish whether non-reporting of process necessarily equates to poor conduct or poor disclosure of good conduct. Accuracy checks on data used in analyses would also help to understand the nature, frequency and impact of data errors, and to identify how these might be prevented or addressed. GRADE and SoF tables are essential for assessing, interpreting and presenting findings of systematic reviews for users. It is no coincidence that the implementation of GRADE features heavily in the screening process. The current audit identified a number of areas for improving the quality of reviews through better use of GRADE and more standard preparation and formatting of SoF tables. Developing an audit tool to evaluate this aspect of reviews would help to provide insights in to how GRADE is implemented in reviews and areas for improvement. 6 https://www.covidence.org/ 7 http://www.guidelinedevelopment.org/ 15