Appraisal of a systematic review using a checklist: Notes and Example 1 Effect of fibre, antispasmodics, and peppermint oil in the treatment of irritable bowel syndrome: systematic review and meta-analysis. Ford AC, Talley NJ, Speigel BMR, et al. BMJ 2008;337:a2313 Appraisal of a systematic review using a checklist: Notes and Example 1 • At Clinical Evidence we principally assess systematic reviews (SRs) of randomised controlled trials (RCTs) that pool data • This presentation uses the critical appraisal checklist for SRs to examine one such review and illustrate the principles involved • It also expands on the issues involved when assessing SRs of RCTs in general • There are SRs that include data from RCTs and observational studies, or include data from observational studies alone • However, the content of this presentation relates solely to SRs of RCTs with meta-analysis Does the SR perform and report a comprehensive and reproducible literature search? • The SR should clearly state what search it has performed • Sometimes extensive detail is presented such as the terms searched / the search string used. Other times, less detail is reported • However, you should be able to confirm the search is comprehensive and be reasonably able to reproduce it should you wish to do so • Key questions include: – Are the search dates reported (from start date to finish date)? – Has it searched the appropriate databases (has just one database been interrogated or different databases)? – Have other methods of identifying studies also been employed (e.g., searching bibliographies)? – Were the studies identified by the search systematically assessed? – Are there specific exclusions (e.g., language restrictions)? In this study: does the SR perform and report a comprehensive and reproducible literature search? • Look in the Methods section. The review describes the databases interrogated as well as the actual terms used in the search METHODS We searched the medical literature using Medline (1950 to April 2008), Embase (1980 to April 2008), and the Cochrane controlled trials register (2007). We identified studies on irritable bowel syndrome using the terms “irritable bowel syndrome” and “functional diseases, colon” (both as medical subject heading and free text terms), and “IBS, spastic colon, irritable colon”, and “functional adj5 bowel” (as free text terms). These were combined using the set operator AND with studies identified with the terms: “dietary fibre”, “cereals”, “psyllium”, “sterculia”, “karaya gum”, “parasympatholytics”, “scopolamine”, “trimebutine”, “muscarinic antagonists”, “butylscopolammonium bromide” (both as medical subject headings and free text terms), and the following free text terms: “bulking agent”, “psyllium fibre”, “fibre”, “husk”, “bran”, “ispaghula”, “wheat bran”, “spasmolytics”, “spasmolytic agents”, “antispasmodics”, “mebeverine”, “alverine”, “pinaverium bromide”, “otilonium bromide”, “cimetropium bromide”, “hyoscine butyl bromide”, “butylscopolamine”, “peppermint oil”, and “colpermin”. In this study: does the SR perform and report a comprehensive and reproducible literature search? • The SR also reports on restrictions, additional searches, and the assessment of the studies identified METHODS No language restrictions were applied. The lead reviewer evaluated the abstracts of papers identified by the initial search for appropriateness to the study question. Potentially relevant papers were obtained and evaluated in detail. Foreign language papers were translated when required. We hand searched abstract books of conference proceedings between 2001 and 2007 to identify potentially eligible studies. The reference lists of all identified relevant studies were used to carry out a recursive search of the literature. Two reviewers independently assessed articles using predesigned eligibility forms, according to eligibility criteria defined prospectively. Any disagreement between investigators was resolved by consensus. In this study: does the SR perform and report a comprehensive and reproducible literature search? • It also presents a flow diagram of studies identified by the literature search Does the SR formulate a clearly focused question? • The SR should define the question that it is trying to answer before the systematic search is performed • This is needed in order to ensure that the search will answer the question • If the question is not clearly defined before the search, or the search is broad and not clearly defined, this may result in “data dredging” where a mass of trial data is tested post hoc in order to find significant results and associations • Also, if it is not focused a priori, the search may not be able to answer a specific question of interest because it does not encompass all the trial data needed to answer that question In this study: does the SR formulate a clearly focused question? • Yes, look in the Abstract for the concise question ABSTRACT Objective To determine the effect of fibre, antispasmodics, and peppermint oil in the treatment of irritable bowel syndrome. Design Systematic review and meta-analysis of randomised controlled trials. • The Introduction gives the rationale and background INTRODUCTION Results of randomised controlled trials are conflicting, and many have been underpowered to detect a difference between active treatment and control intervention. Systematic reviews have also come to different conclusions about the efficacy of the three treatments in irritable bowel syndrome. As a result confusion exists as to the roles of these agents, with current management guidelines for irritable bowel syndrome making varying recommendations. We carried out a systematic review and meta-analysis to determine the effect of fibre, antispasmodics, and peppermint oil in the treatment of irritable bowel syndrome. Does the methods section explicitly state the basis for the inclusion or exclusion of primary studies? • It is vitally important to know why studies have been included or excluded • Adhering to explicit criteria avoids bias, where a study might be included or excluded just because an author does or doesn’t like it • It is also necessary to allow the search to be reproducible • What studies are included or excluded affects the final results • Explicitly reporting such criteria also allows the reader to assess whether these criteria are reasonable, and what population group or circumstances the results of the review may be applicable to In this study: is the basis for the inclusion or exclusion of primary studies described? • Yes, look in Methods section METHODS We considered randomised controlled trials of adults (>16 years) with a diagnosis of irritable bowel syndrome based on a clinician’s opinion or that met specific diagnostic criteria (Manning, Kruis score, Rome I, II, or III), combined with the results of investigations to exclude organic disease if trial investigators thought this necessary. The studies had to compare fibre, antispasmodics, and peppermint oil with placebo or no treatment. Participants were required to be followed up for at least one week, and studies had to report either a global assessment of cure or improvement of symptoms, or cure or improvement of abdominal pain, after treatment. This was preferably as reported by the patient, but could be documented by a doctor. If studies included patients with other functional gastrointestinal disorders, then we excluded these patients from our analyses if trial reporting allowed this, but if this was not possible we excluded the studies from the meta-analysis. We also considered as eligible for inclusion the first period of cross over randomised controlled trials. To allow steady state plasma concentrations of the agents to be achieved we considered one week as the minimum duration of treatment. In this study: is the basis for the inclusion or exclusion of primary studies described? • • A particular issue of this subject area is the difficulty in making the diagnosis of irritable bowel syndrome, and hence, what criteria the RCTs used to include people The diagnostic criteria employed by the review are reported on the previous slide. In addition, this is further explored in the Discussion DISCUSSION Most trials were done before the Rome committee published their recommendations for the design of randomised controlled trials of therapies in functional gastrointestinal disorders. Only five of the included studies used the Rome criteria to define the presence of irritable bowel syndrome, although only nine were published after the first Rome classification was proposed in 1990, and only two used a validated outcome measure to define improvement in symptoms after treatment. However, many of the included trials met some of the other suggested methodological criteria, such as presence of double blinding and a minimum duration of therapy of 8 to 12 weeks. We preferentially extracted patient reported improvement in symptoms of irritable bowel syndrome or abdominal pain whenever trial reporting allowed this, which is also in line with these recommendations. Does the SR report data from primary RCTs (e.g., size, interventions used, results from individual RCTs)? • Are the size of included RCTs reported? Sometimes the results of a meta-analysis are dominated by one large RCT • Are the interventions used in the RCTs defined? Even through RCTs may be comparing, say, the same drug, dosages and regimens may vary widely, and may sometimes account for differing trial results • Are results from individual RCTs reported? Is there consistency of effects between RCTs or do their results differ widely? • Watch out for: – Where primary studies are not described at all — issues such as what was actually done in the studies and the setting / population included are important information and limit who the results may be applicable to – Where absolute numbers are not reported — although size is no guarantee of quality or accuracy of a final result, you may view a result based on 100 people differently from one based on 4000 people In this study: does the SR report data from primary RCTs? • • • • • • Yes, look at tables 1, 2, and 3 For each primary study, it reports the country the RCT took place in, the diagnostic criteria employed, the outcome reported, the sample size of the study, the exact treatment regimen used, and the duration of treatment given The review also reports results for individual RCTs in the final meta-analysis Did you notice in tables 1, 2, and 3 how few studies were done in primary care? They were undertaken mainly in secondary and tertiary care. We may speculate these might be selected people who are slightly different from people with irritable bowel syndrome seen in primary care. This could be an important limitation in the available data Did you also notice the duration of treatment in tables 1, 2, and 3? Many trials were short term, some only lasting a few weeks. Irritable bowel disease is a chronic relapsing and remitting condition. This could also be an important limitation in the available data Reporting such data from primary studies allows us to identify important caveats for when we come to interpret the results and decide how generalisable they are Does the SR assess the methodological quality of primary studies, and take these into account? • • • • • People sometimes get the review quality and the included study quality mixed up Just because a review includes poor-quality studies doesn’t mean it is a poor quality review You can have a good-quality review of a subject area in which only poor-quality studies are available Alternatively, you can have a poor-quality review of a subject area with goodquality studies The methodological quality of the included studies may affect what weight you may wish to put on the results. A review should discuss this issue, and may present a sensitivity analysis (for example, an analysis just including high-quality studies to see if this differs from the analysis of all available studies) Does the SR assess the methodological quality of primary studies, and take these into account? Things to watch out for are: • • • The inclusion of unpublished data (i.e., data from abstracts or data not published at all). Some independent SRs may include unpublished data and write to authors to obtain more details. Sometimes, industry sponsored or authored reviews may include completely unpublished data with no methodological assessment. Beware of terms such as ‘data on file’ which are not subject to external scrutiny Unpublished data is a complex issue and there are suggestions that trials that find no effect are less likely to be published (publishing bias). Hence, including some unpublished data may be very informative. Alternatively, some unpublished data may be misleading. See how the review has handled these data, and whether what it has done or included seems reasonable and free from bias Some poor-quality reviews may report methodological parameters very sparsely. It can be difficult, say, to be sure that all the trials included in an analysis are RCTs. Don’t take anything for granted. If it doesn’t state that they are RCTs, don’t assume that they are. Sometimes imprecise terms such as ‘trials’ or ‘clinical trials’ are used in the text. If in doubt, look at the abstracts of studies included in the analysis on PubMed as a first step In this study: does the SR assess the methodological quality of primary studies, and take these into account? • Yes, look in tables 1, 2, and 3 where the Jadad score is reported for all the individual RCTs • The Jadad score is a well recognised and often reported quality score • See the Study Quality section in the review where details are given of how the Jadad score was assessed and calculated • A review may use other overall categorisations of methodological quality. However, it should clearly explain what they are and how they have been calculated • Some reviews may simply report on individual elements of methods without an overall score (e.g., randomisation, blinding, etc). Either way of reporting is acceptable STUDY QUALITY Two reviewers independently assessed study quality according to the Jadad scale. This records whether a study is described as randomised and double blind, the methods for generation of the allocation schedule and double blinding, and whether there is a description of dropouts during the trial. Meta-analysis: does the SR combine primary studies appropriately? • • • • • • Pooling data increases the power of an analysis, and may demonstrate a significant effect with an intervention that was not apparent when looking at the results of smaller individual studies However, a key question is when is it reasonable to combine data from different studies and when is it not There may be clinical heterogeneity between studies. This could be in the intervention used, the population studied, or the outcomes measured, amongst others For example, it may be reasonable, say, to combine the results of different studies using different beta-blockers to lower blood pressure, where one would not combine these data with studies using relaxation therapy to lower blood pressure, even though the aim was the same (to lower blood pressure) It can be a difficult decision whether it is acceptable to combine trial data. For example, in the case of combining data on multidisciplinary trials where the actual components of multidisciplinary care may vary widely between different RCTs. You need to consider this issue. It may require a judgement on your part It should also be clear what data from the RCTs have been used in the analysis. A systematic review may, say, combine data on an intention-to-treat basis, whereas the original study may have reported a per-protocol or completer analysis In this study: does the SR combine primary studies appropriately? • • • Yes, the study would seem to combine results appropriately It combines the RCTs looking at bran and antispasmodics as a group and peppermint oil as an individual agent. It might be argued that different types or bran or different antispasmodics may or may not have different effects. However, it reports that it also planned a sensitivity analysis a priori according to the type of fibre or antispasmodic used. Hence, it reports an overall analysis for bran and antispasmodics as a group, as well as an individual analyses by the exact agent used Look at the Data Extraction section. It gives clear information on what data was used from each study, how these data were extracted, and what assumptions were made DATA EXTRACTION Two reviewers independently extracted data on to an Excel spreadsheet (XP professional; Microsoft, Redmond, WA) as dichotomous outcomes (persistent or unimproved global symptoms of irritable bowel syndrome, or persistent or unimproved abdominal pain). In addition we extracted the following clinical data for each trial: setting (primary, secondary, or tertiary care), number of centres, country, dose and duration of treatment, total number of adverse events reported, definition of irritable bowel syndrome used, primary outcome measure used to define improvement in symptoms or cure after treatment, method of generation of the randomisation schedule, method for allocation concealment, level of blinding, proportion of female patients, subtype of irritable bowel syndrome according to predominant stool pattern, and duration of follow-up. Data were extracted as intention to treat analyses where all dropouts are assumed to be treatment failures, whenever this was allowed by trial reporting. If this was not clear from the original article then we carried out an analysis on all patients with reported evaluable data. Meta-analysis: does the SR state how results are combined statistically? • Different statistical tests and assumptions can give different results. Hence, it is important to report what methods have been used • For a non-statistician, it can be difficult to interpret the detail of what has been undertaken. For example, whether one statistical test is more appropriate than another, or whether it is reasonable to make specific statistical assumptions • However, the review should clearly state the statistical methods used • In practice, the amount of detail supplied varies widely between reviews • Watch out for reviews that don’t give any detail at all on this In this study: does the SR state how the results are combined statistically? • • Yes, look at Data Synthesis and Statistical Analysis The review clearly states what has been done, and also lists the statistical package used DATA SYNTHESIS AND STATISTICAL ANALYSIS We pooled data using a random effects model to give a more conservative estimate of the effect of individual treatments, allowing for any heterogeneity between studies. The effects of different interventions were expressed as a relative risk (95% confidence interval) of global symptoms of irritable bowel syndrome or abdominal pain persisting with fibre, antispasmodics, or peppermint oil compared with placebo or no treatment. For rare outcomes, such as adverse events, when no patients in one or both treatment arms had the outcome of interest in a single study, we added 0.5 to all four cells for the purposes of the analysis. From the reciprocal of the risk difference from the meta-analysis we calculated the number needed to treat and 95% confidence intervals. We used the I2 statistic, with a cut-off point of 25%, to assess heterogeneity between studies and the χ2 test with a P value <0.10 to define a significant degree of heterogeneity. If adverse events were statistically significantly increased with active treatment we calculated the number needed to harm and a 95% confidence interval using the formula: number needed to harm=1/(1–relative risk)×control adverse event rate. We used Review Manager version 4.2.8 (Nordic Cochrane Centre, Copenhagen, Denmark) and StatsDirect version 2.4.4 (Sale, Cheshire, England) to generate forest plots of pooled relative risks and risk differences for primary and secondary outcomes with 95% confidence intervals. We used the Egger and Begg tests to assess funnel plots for evidence of publication bias. Meta-analysis: does the SR report absolute numbers as well as appropriate summary statistics? • Each summary statistic has its own strengths and weaknesses • There is evidence that people may interpret results differently if the same results are presented as different summary statistics • Hence, it is also important to know absolute numbers • Watch out if the review doesn’t report absolute numbers • Apart from not knowing how many people are included in any analysis, you can’t judge what figures have been extracted (for example, did the review use intention-to-treat figures or not; did it include all data from all RCTs or subgroup data) • Absolute numbers also help to put the summary statistic in context and may help you gauge what this result actually means in clinical practice In this study: does the SR report absolute numbers as well as appropriate summary statistics? • • Yes, it does report absolute numbers. Look at this excerpt of figure 2 and how clear the reporting is with regard to what data have been used You could go back to each of these RCTs and check yourself whether you agree with these extracted figures In this study: does the SR report absolute numbers as well as appropriate summary statistics? • • • In Data Analysis and Statistical Analysis (slide 20) the review reported that it had calculated relative risks. This is a commonly reported statistic for categorical data (e.g., yes / no; present / absent) In this excerpt from figure 2, it states that the numerators are people with either global symptoms or abdominal pain unimproved or persistent after treatment The relative risk is an appropriate summary statistic to use in this case Fig 2 Forest plot of randomised controlled trials of fibre versus placebo or low fibre diet in irritable bowel syndrome. Events are number of patients with either global symptoms of irritable bowel syndrome or abdominal pain unimproved or persistent after treatment In this study: does the SR report absolute numbers as well as appropriate summary statistics? • • • • • • In looking at the technicalities of the numerical data and how results have been calculated, it may be easy to forget that what is being measured is also of prime importance, that is, the outcome of interest On a practical point, any analysis in a systematic review is limited by the outcomes actually measured and reported in the included RCTs In some subject areas, outcomes may be well defined e.g., mortality, subsequent MI Sometimes composite outcomes are reported (e.g., mortality and morbidity combined). This increases the statistical power of an analysis. This may or may not be appropriate. For example, was this analysis specified a priori or were results combined post hoc in order to achieve a significant result? For each subject area, you need to form an opinion on whether the outcome reported (either single or composite) is appropriate At Clinical Evidence we have always reported primarily on clinical outcomes, that is, outcomes which matter to people. We try not to report on laboratory or proxy outcomes wherever possible In this study: does the SR report absolute numbers as well as appropriate summary statistics? • • • • • In irritable bowel disease, we have already noted that diagnosis is an issue The authors of the review have chosen to report the effects of interventions on a composite outcome of global symptoms or abdominal pain Do you think this is reasonable? How was this measured? Should they have reported on, say, pain alone, or given the nature of the disease, is this a reasonable outcome which is of clinical use? This involves a value judgement on your part In practice, whether an outcome is reasonable is often not a straight “yes / no” answer. It may often be a “yes, but….” Given the possible broad nature of this outcome, the authors have reported what criteria the RCTs actually used to define symptom improvement — see table 3 as an example below Study Lech 1988w29 Liu 1997w30 Diagnostic criteria for Criteria to define symptom Country Setting irritable bowel syndrome improvement after therapy Denmark Secondary care Clinical diagnosis and Patient reported improvement in investigations global symptoms Taiwan Sample Dose of size peppermint oil 47 200 mg three times daily Duration of therapy 4 weeks Jadad score 3 Secondary care Clinical diagnosis and investigations Patient reported improvement in abdominal pain 110 187 mg three or four times daily 1 month 4 Capanni Italy 2005w32 Secondary care Rome II 178 2 capsules three times daily 3 months 5 Cappello Italy 2007w31 Secondary care Rome II and investigations Improvement in global symptoms assessed by validated questionnaire ≥50% improvement from baseline in overall irritable bowel syndrome symptom score using questionnaire data 57 225 mg twice daily 4 weeks 5 Does the SR discuss the reasons for any variations / heterogeneity between individual RCTs? • • • • • • • Whenever you combine data from different RCTs there is going to be some heterogeneity. The question is what degree of heterogeneity is acceptable We have already mentioned clinical heterogeneity. When results are numerically combined, a statistical test of heterogeneity should be reported Statistical heterogeneity is often reported as a chi-squared test (X2) with a P value and/or the I2 test statistic If there is a high degree of heterogeneity among RCTs, this suggests there may be something different among the RCTs, and that their results should not be combined If there is statistical heterogeneity among RCTs in an analysis, you should expect the review to comment on the reasons for this. Often the review will exclude trials that account for the heterogeneity and recalculate the analysis, and may also report other sensitivity analyses. If a review does exclude trials, it should have a good underlying reason for doing so, other than its results are different If a Forest plot is presented, you can visually examine the spread of the results In practice, some reviews report significant heterogeneity tests in the results tables, but don’t mention this in the results text or allude to this further. Watch out for this! In this study: does the SR discuss the reasons for any variations / heterogeneity between individual RCTs? • • • • • • • Look in Data Synthesis and Statistical Analysis where the review outlines the tests of heterogeneity it is going to perform; The review was going to analyse “antispasmodics” as a group. This is a grouping of agents with a treatment effect rather than a grouping based on drug structure; We might, therefore, speculate that effects may vary by the individual agents used. Read this section again where the review outlines a priori that it is going to do a subgroup analysis by each individual agent. A similar procedure is specified for the “fibre” analysis In the Results section under the Antispasmodics subheading the review reports a sensitivity analysis of the overall result. It also reports I2 results for the individual antispasmodics analysis. Read where it discusses the relative strength of the evidence on the individual agents. It further alludes to this issue in the Discussion section In fact, the review found statistical heterogeneity in a number of the analyses. See how it discusses the issue of heterogeneity in the Discussion section. It also finds some evidence of publication bias These issues may affect what weight you may wish to put on the results. Hence, interpretation may be complex. However, it is important that such issues are identified and the limitations of any analyses are discussed Beware of reviews which don’t report on or discuss the limitations of their analyses In this study: does the SR discuss the reasons for any variations / heterogeneity between individual RCTs? Look at the overall Forest plot for all antispasmodics. You can see visually how the results vary by individual agent and RCT It should be remembered that some of these individual analyses are based on small numbers, often from only one RCT You can also easily see how the amount of available evidence varies widely between different agents In this study: does the SR report on the clinical relevance / importance of the results? • • • • • • You should expect a review to report on the limitations of its analysis. This is often reported in the Discussion section, as it is in this review You should also expect the review to discuss its results in the context of previously reported evidence or guidelines. That is, how they may differ, agree, or contradict previous studies or practice. Again, this is usually reported in the Discussion section as it is in this review Some test statistics are more difficult to interpret clinically than others. For example, standardised mean differences (SMDs) and effect sizes have no units specified. Hence, if these are reported, it is difficult to know what the results actually mean in clinical terms Similarly, if there is an improvement in pain of, say, 8 points on a 100 point VAS scale, at what point does the change become clinically important? Although this particular review didn’t report on these type of data, if a review does, it should give some guidance as to what represents an important clinical effect In interpreting any result, it is important to remember that statistical significance and clinical importance are not synonymous. For example, a very large study may find that an intervention significantly reduces systolic blood pressure by 0.7 mmHg. The question is than one of interpretation. In terms of the individual, is this clinically important? Some final thoughts • • • • • At Clinical Evidence we examine systematic reviews every day. Just because a study is a systematic review doesn’t necessarily mean it is of good methodological quality. There can be a wide variation between reviews. However, increasingly, most are of reasonable quality A key principle is transparency. You shouldn’t need to guess or make assumptions about what a review has done Pay close attention to the inclusion and exclusion criteria. Are these reasonable? These directly affect the final result as well who the results may, or may not be, generalisable to In some subject areas (e.g., cardiovascular) there may be multiple large studies whereas in others (e.g., surgery) evidence may be scarce. Any review can only report on what is available, and has to work within the limitations of the available data. It should, however, explicitly discuss any such limitations, and the effect of these on the robustness of its analyses Increasingly, reviews are being published online, which allows for the extensive reporting of data (for example, included study details, methods assessment, etc). Reviews published in print journals may have more constraints in terms of space, although additional web tables are increasingly used. Nonetheless, there should always be a minimum level of reporting that allows the general reader to reasonably assess what has been done “BMJ Publishing Group Limited (“BMJ Group”) 2011. All rights reserved.” www.clinicalevidence.bmj.com