Uploaded by Jan Melvin Zapanta

epdf.pub painless-evidence-based-medicine

advertisement
Painless
Evidence-Based
Medicine
Edited by
Antonio L. Dans
Leonila F. Dans
Maria Asuncion A. Silvestre
Asia-Pacific Center for Evidence-Based Medicine
Manila, Philippines
This page intentionally left blank
Painless
Evidence-Based
Medicine
This page intentionally left blank
Painless
Evidence-Based
Medicine
Edited by
Antonio L. Dans
Leonila F. Dans
Maria Asuncion A. Silvestre
Asia-Pacific Center for Evidence-Based Medicine
Manila, Philippines
Copyright © 2008
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester,
West Sussex PO19 8SQ, England
Telephone
+44 1243 779777
Email (for orders and customer service enquiries): cs-books@wiley.co.uk
Visit our Home Page on www.wiley.com
All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or
transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or
otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of
a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP,
UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed
to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West
Sussex PO19 8SQ, England, or emailed to permreq@wiley.co.uk, or faxed to (+44) 1243 770620.
Designations used by companies to distinguish their products are often claimed as trademarks. All
brand names and product names used in this book are trade names, service marks, trademarks or
registered trademarks of their respective owners. The Publisher is not associated with any product or
vendor mentioned in this book.
This publication is designed to provide accurate and authoritative information in regard to the subject
matter covered. It is sold on the understanding that the Publisher is not engaged in rendering
professional services. If professional advice or other expert assistance is required, the services of a
competent professional should be sought.
Other Wiley Editorial Offices
John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA
Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA
Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany
John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia
John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809
John Wiley & Sons Canada Ltd, 6045 Freemont Blvd, Mississauga, Ontario, L5R 4J3
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may
not be available in electronic books.
Library of Congress Cataloging in Publication Data
Painless evidence-based medicine / edited by Antonio L. Dans, Leonila F. Dans,
Maria Asuncion A. Silvestre ; foreword by Gordon Guyatt.
p. ; cm.
Includes bibliographical references.
ISBN 978-0-470-51939-4 (pbk. : alk. paper)
1. Evidence-based medicine. I. Dans, Antonio L. II. Dans, Leonila F.
III. Silvestre, Maria Asuncion A.
[DNLM: 1. Evidence-Based Medicine. WB 102 P147 2007]
R723.7.P35 2007
612.4—dc22
2007050370
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library
ISBN 978-0-470-51939-4
Typeset in 10.5/13pt Minion by Integra Software Services Pvt. Ltd, Pondicherry, India
Printed and bound in Great Britain by TJ International Ltd., Padstow, Cornwall
This book is printed on acid-free paper.
Contents
Foreword
vii
Preface
xv
List of Contributors
xix
1 Introduction
1
2 Evaluation of Articles on Therapy
15
3 Evaluation of Articles on Diagnosis
45
4 Evaluation of Articles on Harm
67
5 Evaluation of Articles on Prognosis
85
6 Evaluation of Systematic Reviews
99
7 Literature Searches
Index
115
137
This page intentionally left blank
Foreword
No clinician would consider entering clinical practice without
knowing the rudiments of history-taking and physical examination. Nor would clinicians consider independent practice without
a basic understanding of how the drugs they prescribe act on
their patients. Yet, traditionally, clinicians have started practice
without an ability to understand evidence about how they should
interpret what they find on history and physical examination, or
the magnitude of the effects they might expect when they offer
patients medication.
Evidence-based medicine (EBM) provides a remedy for this
problem. The movement to teach clinicians to become effective users of medical literature began in the 1970s and evolved
through the 1980s into a whole system for the delivery of clinical care. We needed a name for this new way of practice and
the term ‘evidence-based medicine’, which first appeared in the
medical literature in 19911 proved extremely popular. Over the
subsequent 16 years evidence-based medicine has evolved and
now represents not only an approach to using the medical literature effectively, but a principled guide for the process of clinical
decision-making.
Members of the general public are surprised, and often
appalled, when they learn that most physicians remain unable
viii
FOREWORD
to critically read an original research article or fully understand
the results reported there. For the physician, inability to critically
appraise a research study and grasp all that is implied in its findings limits their independence. The result is reliance on expert
opinion, the practices of colleagues and on information from the
pharmaceutical industry. But what is one to do if experts and
colleagues disagree, or if one is mistrustful of the enthusiastic
advice from a pharmaceutical industry representative?
This book represents the key to a world that provides the
answer to that question, a world that has traditionally been closed
to most practising physicians: the world of original medical literature. Opening the door to this world is enormously empowering. No longer must one choose what to believe on the basis
of which recommendation is backed by the most authority, or
speaks with the loudest voice. The ability to differentiate high
from low quality evidence and large treatment effects from small
allows clinicians to make independent judgements about what is
best for their patients. It also allows them to explain the impact
of alternatives to the patients themselves, and thus to ensure
that choices are consistent with patients’ underlying values and
preferences.
Ten years ago, experts and the official voices of the organizations to which they belonged consistently recommended longterm hormone replacement therapy (HRT) for post-menopausal
women. These recommendations were made largely on the basis
of observational studies suggesting that women taking HRT could
expect large reductions to their risk of major cardiovascular
events. Proponents of evidence-based medicine raised concerns
about the wisdom of this strong advocacy of therapy for huge
populations on the basis of the fundamentally weak methods
of observational studies. Their voices were largely ignored, until
randomized trials demonstrated that the results of the observational studies were incorrect. If HRT has any impact on cardiovascular disease at all, it is to increase its frequency.
FOREWORD
ix
Many clinical communities now endorse widespread population screening to prevent the occurrence of cancer and
cardiovascular disease. Breast cancer screening for women as
young as 40 years, colon cancer screening for entire populations
and treatment to improve lipid profiles even in very low risk
patients are widely advocated. Many clinicians are unaware that
to prolong a single life, hundreds of individuals must be screened
for breast or colon cancer or treated with lipid profile-modifying
agents for periods of up to a decade. The costs include anxiety
as a result of the many false positive results, complications of
invasive procedures such as lumpectomy or colonoscopy, side
effects of treatment (including deaths as a result of a lipidlowering agent now withdrawn from the market) and resource
investment that, at least for some individuals, might be better
allocated elsewhere. The point is not that the experts were
uniformly wrong in suggesting that women consider HRT, nor
that screening or treatment of low-risk individuals to modify
their cancer or coronary risk is wrong. Rather, it is that clinicians
should be aware there are important trade-offs in these decisions.
If clinicians don’t know the difference between an observational
study and a randomized trial, or between a relative risk reduction
and a risk difference, they are in no position to understand these
trade-offs. If they are unable to understand the trade-offs, it is
not possible for them to convey the possible benefits and risks
to their patients, many of whom may, with a full understanding,
decline screening or treatment.
This book provides the basic tools for the clinician to evaluate
the strength of original studies, to understand their results and to
apply those results in day-to-day clinical practice. I am delighted
to inform the reader that its editors are not only brilliant teachers
who have created a wonderful introductory text, but wonderful
human beings. I met Tony and Inday Dans just about the time
that our McMaster group was realizing that what we had been
calling ‘critical appraisal’ had evolved into a systematic approach
x
FOREWORD
to medical practice, a system of thinking about clinical care and
clinical decision-making.
Inday and Tony had come to McMaster to train in clinical epidemiology – the science that underlies evidence-based
medicine. I had the great pleasure of working with both these
brilliant, enthusiastic and critical young people. I was extremely
fortunate that Tony chose me as one of his supervisors, and
as a result we had the opportunity to work particularly closely
together. It was not long before I discovered that I had the privilege of interacting with an extraordinary individual, exceptional
even among the lively, intelligent, dedicated students who populated our Masters program. Tony was far more questioning than
most students, and possessed a far deeper and more intense social
conscience. To me, these qualities were very striking.
Since their days at McMaster, Inday and Tony have continued
to demonstrate their high intelligence, tremendous initiative,
extraordinary ability to question and explore issues at the
deepest level and their unusual and extremely admirable social
conscience. Having a social conscience leads you to challenge
existing power structures and vested interests. Doing so requires
more than conscience: it requires courage. I have had the good
fortune and great pleasure to interact with Inday and Tony in
a variety of settings at quite regular intervals, and have as a
result seen first-hand how their courage has led them to repeatedly challenge authority and power, acting in the interests of
the Philippine people. To use the adjective preferred by young
Canadians nowadays, their performance has been consistently
awesome.
I will add one final anecdote about what makes Tony and
Inday so special. Each year, we conduct a ‘how to teach evidencebased medicine’ workshop at McMaster. In the last few years,
Tony and Inday have participated in the workshop in the role of
tutor trainees. Almost all participants in the workshop feel they
learn a great deal, and take elements of what they have discovered
FOREWORD
xi
back to their own teaching settings. But very few, and extremely
few among the very experienced, make major innovations in
their teaching as a result. Despite having run literally dozens of
extremely successful workshops in the Philippines prior to their
participation in the McMaster workshop, Inday and Tony took
the key elements of the McMaster strategy and revamped their
approach to their own workshops. The result has been a spectacular success, with Philippine participants reporting profoundly
positive educational experiences. In the two decades in which I
have participated in our workshop, I’ve never seen anyone make
as good use of their experience with us. The message about Tony
and Inday: a tremendous openness and ability to integrate what
they’ve learned and apply in imaginative and perspicacious ways
in their own setting.
One fortunate consequence of Inday and Tony’s brilliant
teaching – which makes the presentation of this book so vividly
clear – is that it inspires others. About ten years ago Mianne
Silvestre, a neonatologist, attended one of the Dans’ workshops
and emerged as an EBM enthusiast. She took on a teaching role
and emerged as one of the most effective EBM facilitators in the
Philippines. Her insights and experience have also contributed to
the lucid presentations in this text.
We shall now take advantage of Inday, Tony and Mianne’s
enormous experience of EBM and their imagination and brilliant
teaching abilities in this wonderful book. The title ‘Painless EBM’
captures the essence of their work. They have presented challenging concepts in simple, clear and extremely appealing ways
which make learning EBM painless and enjoyable. They have
emphasized the last of the three pillars of the EBM approach:
while the book tells you about validity and understanding the
results, the focus is on applicability. What is the meaning of the
evidence? How can you apply it in your own setting? How can you
apply the evidence to patients with very different circumstances
and varying values and preferences?
xii
FOREWORD
Increasingly, applying the literature to clinical practice does
not mean a detailed reading of a large number of original studies.
Rather, the clinician can recognize valid pre-appraised resources
and differentiate them from poorly substantiated opinion. The
book provides guides for assessing not only original studies of
diagnosis and therapy, but also systematic reviews which summarize a number of such original studies. The ability to differentiate strong from weak literature reviews and to understand
summaries of the magnitude of treatment effects is crucial for
efficient evidence-based practice.
When a new pivotal study comes to light, evidence-based clinicians do not need to read it in detail to evaluate its significance
or to decide how to use its results. Imagine that I am telling you
about a recently conducted study reporting an apparently important treatment effect. I tell you: that the study was a randomized trial and that randomization was adequately concealed;
that patients, caregivers and those collecting and adjudicating
outcome data were blind to whether patients received treatment
or control interventions; that investigators successfully followed
all patients who were randomized; and that, in the analysis, all
patients were included in the groups to which they were randomized. Assuming that I am skilled in making these judgements, and
am telling you the truth, you have all the information you need
to judge the validity of the study. If I then provide you with a few
crucial details of who was enrolled, how the interventions were
administered and the magnitude and precision of estimates of the
impact of the intervention on all patient-relevant outcomes, you
have everything you need to apply the results in clinical practice.
Synopses of individual studies which provide the crucial
information needed to understand the appropriate strength of
inference to apply the results are increasingly available, as are
systematic reviews and, to a lesser extent, high quality evidencebased practice guidelines. Entire systems of knowledge based
on evidence-based principles and textbooks of evidence-based
FOREWORD
xiii
medicine are beginning to arrive. The innovative electronic text
UpToDate is an example of a resource that strives to be fully
evidence-based and to provide guidance for most dilemmas that
clinicians face in practice; UpToDate is effective in meeting both
these aims.
When you, as a clinician, have read and digested the current
text, you will have the tools to read and interpret synopses and
systematic reviews and will be able to find such pearls in the
rocky landscape of the current medical literature. In this text
you will find case studies and examples directly relevant to your
area of clinical practice. More importantly, you will find trueto-life examples of how to address the daily patient dilemmas
you face more effectively. You will find clinical practice more
satisfying and, most important, you will be more confident in
providing your patients with optimal medical care. Finally, if
you are interested in a deeper understanding of EBM, this book
provides a stepping stone to a more comprehensive text that can
provide knowledge and skills required for not only the practice,
but also the teaching of EBM2 .
It has been my privilege and joy to reflect on EBM in the
context of this wonderful book, prepared by two of my dear
friends and their outstanding colleague.
Gordon Guyatt, MD
McMaster University
References
[1] Evidence-Based Medicine Working Group. 1992. Evidence-based
medicine. A new approach to teaching the practice of medicine. Journal
of the American Medical Association. 268(17): 2420–2425.
[2] Guyatt G and Rennie D (eds.) 2002. The Users’ Guides to the Medical
Literature: A Manual for Evidence-Based Clinical Practice. AMA publications: Chicago, Illinois.
This page intentionally left blank
Preface
Inday (Leonila) and I finished our Masters Degrees in Clinical Epidemiology at the McMaster University Medical Center
in Ontario in 1991. Since then, our lives have never been the
same. The learning culture was unusual at Mac. Education was
casual and fun, self-directed and interactive – very different from
the serious, didactic, teacher-directed learning that characterized
the region where we came from. Distinguished faculty insisted
on being called by their first names, rather than ‘professor’ or
‘doctor’. This was difficult for us at first, but soon we began to
understand: at Mac, there was only a grey line separating students
and teachers.
In the midst of what, for us, was a novel learning environment,
we had the great fortune of meeting pioneers in the field of
Clinical Epidemiology such as Gordon Guyatt, Brian Haynes,
Andy Oxman, Dave Sackett and Peter Tugwell. If there was one
thing we brought home from interacting with these professors,
it was a passion for lifelong learning. Although the term had
not yet been coined, they developed and taught the rudiments
of evidence-based medicine: the process of updating one’s self
to ensure that patients receive not just the latest, but the best
medical care.
On our return to the Philippines from the idyllic world of
McMaster, we encountered overwhelming obstacles to practising
xvi
PREFACE
and teaching what we had learned. The infrastructure was not
ready for EBM; there was no internet nor e-mail, MEDLINE was
not accessible and our medical library at the university seemed
like a newspaper stand compared to the library at McMaster.
Yes, we worked in an academic setting, but that too was not
ready to accept a way of thinking that challenged the status quo.
Traditional ‘experts’ felt threatened by EBM, and clinicians were
bewildered by the onslaught of alien epidemiologic terms.
To make matters worse, we grappled constantly with the
problem of applicability of research conducted in other countries.
When are such results applicable to the patients we see? Should we
always insist on local studies? What criteria do we use to decide?
We had many questions, and hardly any answers. Although the
books we brought home from McMaster did not have the solutions, the curiosity we had developed and the creativity we had
learned from our mentors there helped us think through the
problem. In 1998, with much encouragement and guidance from
Gordon Guyatt, we wrote an article on the applicability of clinical trials: our own contribution to the Journal of American
Medical Association (JAMA) series of User’s Guides to Reading
the Medical Literature1 . Since then, we have developed guides
on the applicability of other study designs. We share many of
them for the first time in this book.
Another problem which we encountered in trying to teach
EBM was the discovery that, unlike healthcare providers in North
America, those in developing countries were unable to spare a
lot of time reading thick books, listening to long explanations
of different concepts of EBM, or attending lengthy workshops
or seminars. They didn’t want 5-day workshops; they wanted
1-hour lectures. They didn’t want a 500-page dissertation; they
wanted brief material that could be read in just a few hours. Of
course, we insisted on a compromise. Most workshops which we
conducted in the 1990s lasted an average of 2 days and the reading
material was at least 25 pages. Slowly but surely through these
PREFACE
xvii
abbreviated meetings, EBM crept into the consciousness of health
workers: first among physicians and then among various healthcare professionals. These meetings were a tremendous source of
gratification, with each workshop bringing new knowledge, new
friends and lots of fun.
Dr Mianne Silvestre was one of our participants in these 2-day
workshops. Hers is a story we use to inspire other students
in EBM. Mianne caught on to EBM really fast. A week after
attending her first workshop, she was able to facilitate and teach.
A week after that, we heard she had been invited to Singapore to
lecture on evidence-based neonatology! This rapid ascent to fame
has allowed us to proclaim: you don’t need a Masters degree in
clinical epidemiology to practice and teach EBM, all you need is
a passion for lifelong learning. Her valuable contribution to this
book is to ensure that reading it is ‘painless’.
With Mianne and eight other enthusiastic participants of our
self-learning workshops, we have extended the EBM learning
experience through this book. We now provide not 25 but
150 pages of prose, developed through years of trying different
approaches to teaching what many consider to be difficult
concepts. We have painstakingly put these strategies together to
ease the pain of reading and shorten the time it takes to understand EBM. As a result, our publishers have told us that we have
an unusually large number of editors and contributors for such a
short book. The reason this happened is because we spent years
testing the materials on hundreds of workshop participants in
the Asia-Pacific region for clarity and re-writing them to make
learning exciting and enjoyable.
To augment the simplicity of our prose, we have developed
what we call ‘tackle boxes’. These are stand-alone tables that
help readers grapple with difficult issues without interrupting
the train of thought in the text. These tackle boxes come with
instructions for use, and often have exercises to reinforce the
lessons learned. The result is a simple, concise book which we
xviii
PREFACE
hope captures the very essence of EBM: lifelong learning that is
painless, spontaneous and lots of fun!
We would like to thank Mrs Jemimah Gambito-Esguerra, CEO
of the Asia-Pacific Center for EBM, who helped us put together
this manuscript; Miguel, our son who, fresh from high school,
became our guinea pig to test the manuscript for readability; and
Sandra, our daughter, who took time from her studies to draw
the cartoons for each chapter.
Special thanks are due to Dr Peter Tugwell and Dr Gord Guyatt,
revered mentors and dear friends who inspired us to write this
book, and from whom we learned the rudiments of practising
and teaching EBM.
Lastly, we thank the thousands of workshop participants in the
region whose company we have cherished and who continue to
show us how to teach EBM.
Mabuhay!
Antonio L. Dans
References
[1] Dans AL, Dans LF, Guyatt GH and Richardson S for the Evidence-Based
Medicine Working Group. 1998. How to Decide on the Applicability
of Clinical Trials Results to Your Patient. Journal of American Medical
Association. 279(7): 545–549.
List of Contributors
Hilda Diana A. Alava
Department of Pediatrics
Manila Central University FDTMF College of Medicine
Caloocan City, Philippines
Marissa M. Alejandria
Department of Clinical Epidemiology
University of the Philippines College of Medicine
547 Pedro Gil St., Ermita, Manila, Philippines 1000
Antonio L. Dans
Section of Adult Medicine
Department of Medicine
University of the Philippines College of Medicine
547 Pedro Gil St., Ermita, Manila, Philippines 1000
Leonila F. Dans
Department of Pediatrics
Department of Clinical Epidemiology
University of the Philippines College of Medicine
547 Pedro Gil St., Ermita, Manila, Philippines 1000
xx
LIST OF CONTRIBUTORS
Aldrin B. Loyola
University of the Philippines College of Medicine
Medical Specialist III
University of the Philippines-Philippine General Hospital
Taft Ave., Manila, Philippines
Jacinto Blas V. Mantaring III
Clinical Associate Professor of Pediatrics
University of the Philippines College of Medicine
547 Pedro Gil St., Ermita, Manila, Philippines 1000
Bernadette A. Tumanan-Mendoza
Institute of Clinical Epidemiology
NIH-University of the Philippines Manila
Benita S. Padilla
Hemodialysis Unit
National Kidney and Transplant Institute
Quezon City, Philippines
Felix Eduardo R. Punzalan
University of the Philippines College of Medicine
547 Pedro Gil St., Ermita, Manila, Philippines 1000
Maria Asuncion A. Silvestre
University of the Philippines College of Medicine
547 Pedro Gil St., Ermita, Manila, Philippines 1000
Maria Vanessa V. Sulit
Asia Pacific Center for Evidence Based Medicine
1344 Taft Ave, Manila, Philippines, 1000
1
Introduction
Antonio L. Dans, Leonila F. Dans, Maria Asuncion
A. Silvestre
Half of what we learn in medical school is wrong.
We just don’t know which half.
This statement is often heard by freshmen as they are ushered
into medical school[1] , but it probably rings true for students in
nursing, dentistry, midwifery, physical therapy and other allied
medical professions as well. A lot of truth dwells in these words.
Just a few years ago, we thought that enlarged tonsils had to
be removed, pregnant mothers had to be shaved before delivery
and vitamin C enhanced immunity to respiratory tract infections.
These were non-debatable bits of ‘knowledge’ then. Today, they
are nothing more than sombre testimony to the fallibility of
the human mind. Our problem is not healthcare education per
se. Our problem is progress. Science evolves so fast that what we
know now will quickly be outdated if we don’t keep up with the
literature.
If there was a problem with education in medicine and its
allied professions in the last century, it was that professionals
Painless Evidence-Based Medicine Antonio L. Dans, Leonila F. Dans and Maria Asuncion A. Silvestre
© 2008 John Wiley & Sons, Ltd
2
CH01 INTRODUCTION
were not taught how to keep up with science. We were certainly
told that we had to keep up-to-date, but we didn’t know how to
do this efficiently until 1991. In that year, Dr Gordon Guyatt
of McMaster University Medical College in Ontario, Canada,
described what he believed were important improvements in the
way medicine was taught in his university:
Clinicians were formerly taught to look to authority (whether a textbook, an expert lecturer, or a local senior physician) to resolve issues
of patient management. Evidence-based medicine uses additional strategies, including quickly tracking down publications of studies that are
directly relevant to the clinical problem, critically appraising these studies,
and applying the results of the best studies to the clinical problem at
hand[2] .
This is the first use of the term evidence-based medicine (EBM)
in published literature. While the emphasis in this passage is
on decision-making, the key process has to do with keeping
up-to-date with the literature. The faculty of McMaster thought
they were on to something exciting[2] – and they were! Within
the next few years, the concept spread like fire, becoming one
of the most widely-used phrases in the medical literature (see
Figure 1.1). EBM was introduced into the curricula of healthcare
professionals, first in medicine and later in other fields[3] . Seminars and workshops were conducted across the globe, involving
thousands of practitioners from various health care disciplines.
The popularity of EBM is easy to understand. For many the
proposed ‘rules of evidence’ were simple and easy to understand.
These rules demystified scientific research, turning it into something busy practitioners could understand, challenge and keep up
with. A few ‘philosophers’ have debated whether EBM deserves
the popularity it has gained[4, 5] , but the debate has been confusing
rather than helpful, fuelled by misconceptions and hurt rather
than meaningful differences in opinion. An example of just how
confusing the debate has been is as follows:
CH01 INTRODUCTION
3
7000
5911
Number of citations
6000
5200
5000
4568
3905
4000
3268
3000
3566
2466
2006
2000
1307
839
1000
21
31
50
141
365
0
92 93 94 95 96 97 98 99 00 01 02 03 04 05 06
Year of publication
Figure 1.1 MEDLINE citations containing the phrase ‘evidence-based’ in
the title or abstract
The authors reframe the evidence-based medicine debate by pointing out
an underappreciated epistemological deficiency: evidence-based medicine
as currently conceptualized cannot accommodate concepts that resist
quantitative analysis and therefore cannot logically differentiate human
beings from complex machines. The authors use Michael Polanyi’s
philosophy of tacit knowing (which refers to the taken-for-granted
knowledge at the periphery of attention that allows persons to understand
the world and discern meaning in it) as a starting point for rectifying
this deficiency and for working towards an improved, person-centred
epistemology of medical practice[6] .
We are sure the intellectual ruminations would be fascinating – if
only we could understand them. The debate, however, is for
philosophers and not for busy healthcare practitioners. For now,
all we want to say is this: if you’re overwhelmed by the literature
in healthcare, then it doesn’t matter if you’re a physician, dentist,
nurse, midwife or therapist, EBM is for you!
4
CH01 INTRODUCTION
1.1
The definition of EBM
Choosing an acceptable definition of EBM is difficult since
there are so many definitions available[7] . This is partly because
EBM has evolved so much since 1992, and partly because
various healthcare professions have modified its definition to
suit particular fields. Thus, there are definitions for evidencebased surgery[8] , evidence-based nursing[9] , evidence-based
pediatrics[10] , evidence-based psychiatry[11] , evidence-based
healthcare[12] and even evidence-based alternative medicine[13] ,
to state a few. Our search for the best definition led to the
conclusion that there are too many definitions, so what the heck,
here’s our own:
EBM is a systematic approach to the acquisition, appraisal and application
of research evidence to guide healthcare decisions.
Below is our first tackle box on understanding the definition of
EBM (Tackle Box 1.1). As with other tackle boxes in this book,
please spend as much (or as little) time on it as you need to,
before proceeding with the text. In the following sections of this
chapter, we discuss the three essential skills necessary for the
practice of EBM.
1.2
The three skills of EBM
Our proposed definition of EBM requires that healthcare
providers demonstrate three major skills to efficiently process the
literature. Each skill has several components which are illustrated
in Figure 1.2 and discussed in the following sections.
Skill number 1: Acquiring the evidence
The literature on healthcare can be acquired in two modes: by
active searching or by browsing[14] . In the active mode, acquisition
1.2 THE THREE SKILLS OF EBM
Tackle Box 1.1 Components of the definition of
evidence-based medicine
Instructions: This tackle box summarizes our definition of EBM and explains various
components of the definition. Read column 1 in its entirety before reading the details
row by row.
Components of the definition
Explanation of the component
EBM is a systematic approach
to the EBM allows practitioners to assess new
(as well as old) technology in an efficient
and logical manner, without being
intimidated or overwhelmed. This
approach requires three sets of skills:
acquisition,
the skill to conduct an efficient literature
search and secure a publication in
response to information needs;
appraisal
the skill to criticize medical literature and
decide if results are credible or not; and
and application of research
evidence
the skill to utilize the information in the
care of specific persons or populations.
to guide decisions
The evidence itself is only one
component of the decisions that need to
be made. Other components are tacit
knowledge, professional expertise and
patients’ preferences.
in healthcare.
The skills in EBM have found application
in medicine, surgery, physical therapy,
nursing, dentistry, primary care,
emergency medicine and many other
fields including public health.
Note: What have we omitted from other definitions and descriptions?
1. The role of clinical expertise[14] : while this is an important aspect of healthcare
decisions, it is a set of skills distinct from EBM, developed and taught in other areas
of training of a healthcare professional.
2. The role of patient preferences[14] : again, we feel this is an important element of
medical decisions but, like clinical expertise, skills for eliciting patient preferences are
distinct from EBM, and are developed and taught in other areas of training.
3. EBM is a philosophy, a movement, or even a scientific revolution[15] . Let’s leave the
debate to philosophers. This is a book for healthcare practitioners. We say it again if you are drowning in the medical literature, then EBM is for you!
5
6
CH01 INTRODUCTION
Acquire
Appraise
Apply
Ask a focused
question
Appraise
directness
Assess
applicability
Conduct
the search
Appraise
validity
Individualize
results
Retrieve the
information
Appraise
results
Figure 1.2 The three skills of EBM: how to acquire, appraise and apply
the evidence
of evidence is driven by a problem we encounter during the actual
care of a specific patient. This triggers a process of search and
retrieval for a specific article. In contrast, data acquisition happens
by accident in the browse mode, e.g. while leafing through articles
in a journal, listening to a lecture series or surfing the net. Most
EBM advocates prefer the active mode because the evidence is
used almost immediately on a patient. Any lessons learned in this
process are more likely to be retained.
Active searches entail three important skills:
(a) how to ask focused clinical questions;
(b) how to search the literature for answers; and
(c) how to retrieve the necessary references.
How to ask focused clinical questions
The most common types of questions asked in clinical practice
pertain to the effectiveness of treatment, accuracy of diagnostic
tests, prognosis of certain conditions or harmful effects of certain
exposures. Whether they are on therapy (Chapter 2), diagnosis
1.2 THE THREE SKILLS OF EBM
7
(Chapter 3), harm (Chapter 4) or prognosis (Chapter 5), focused
clinical questions have three components: the population of interest
(P), the exposure in question (E) and the outcome expected (O).
These are explained in detail in their respective chapters. Tackle
Box 1.2 includes examples of how to phrase these questions.
Generating focused clinical questions during day-to-day practice is very important because it is during encounters with patients
that we come face to face with much of our information needs.
Because health technology evolves so rapidly, it is safe to assume
that healthcare professionals should be asking these questions all
their lives. In an actual healthcare service, this will entail constant
vigilance. It will also entail a measure of humility. Instead of
posing as professors who know everything, practitioners should
role-model curiosity and information-seeking behaviour in their
daily practice.
How to search the medical literature
One advantage of phrasing clinical questions as populations (P),
exposures (E) and outcomes (O) is that these variables are our
gateways to the medical literature. Medical literature databases
(such as MEDLINE) usually classify articles according to P, E
and O. Looking for articles in these databases becomes a simple
matter of using these keywords as filters.
In recent years, managers of these medical literature databases
have begun to classify articles according to study design. This is
an exciting development because now we can specify not just the
topic, but the study design as well.
More details on systematic search strategies are described
in the final chapter of this book. It is sufficient to say at this
point that the ability to conduct efficient searches is a new skill
expected of all healthcare providers. This skill is now a specific
expectation in undergraduate and postgraduate curricula for
healthcare professionals.
8
CH01 INTRODUCTION
Tackle Box 1.2
Asking a focused clinical question
Instructions: A well-stated question makes it clear whether one is dealing with a query on
therapy, diagnosis, harm or prognosis. If the type of question is not clear, then something
is wrong. Go through this tackle box to learn the syntax then rehearse the skill while
drawing from your own clinical experience.
Type of question
Syntax
Sample question
Therapy
Among P (patients with a
certain disease), how effective
is E (a certain treatment) in
preventing O (an adverse
outcome)?
Among children with HIV (P),
how effective is isoniazid
prophylaxis (E) in preventing
tuberculosis (O)?
Diagnosis
Among P (patients with a
certain condition), how
accurate is E (a certain test),
in diagnosing O (a disease)?
Among patients with acute chest
pain (P), how accurate is an
electrocardiogram (E) in
diagnosing acute myocardial
infarction (O)?
Harm
Among P (a certain group of
patients), how much does E (a
potentially harmful exposure),
contribute to the causation of
O (a certain disease)?
Among healthy males (P), how
much does smoking (E)
contribute to the causation of
lung cancer (O)?
Prognosis
Among P (patients with a
certain disease), by how much
does E (a prognostic factor),
increase the risk of O (a
certain complication)? or
Among patients with P, how
big is the risk of O?
Among patients with prostatic
cancer (P), by how much does
lumbar metastasis (E) increase
5-year mortality (O)? or Among
patients with stage IV breast
cancer (P), what is the risk of
mortality in the next 5 years (O)?
Note: P = the population of interest (usually characterized by a disease or condition);
E = the exposure being evaluated (a treatment, test, harmful exposure or a prognostic
factor); O = the outcome expected (a disease, complication or some measure of health).
In a single study, several Ps, Es and Os may be compared at the same time.
Exercise: Look back to a patient you took care of in the previous week. Think of four
focused questions that you could have asked while caring for that patient, and state them
in terms of P, E and O.
How to retrieve articles
In most developed countries, once an article has been identified
through a systematic search it can almost always be retrieved
electronically at the touch of a key. Unfortunately, there are
1.2 THE THREE SKILLS OF EBM
9
great inequities in access to health information. Libraries in low
to middle income countries (LMICs) are generally small and
under-resourced. For this reason, journal retrieval can become
long-drawn, tedious and frustrating. Methods of tackling this
problem include the following:
1. Access free articles online. Many journals such as the Journal
of the American Medical Association (JAMA) and Lancet
provide free access six to twelve months after publication.
Others such as BioMed Central (BMC) are entirely free for
the developing world. Freemedicaljournals.com lists journals
that currently provide open access. PubMed also highlights
such journals. The World Health Organization through the
Health InterNetwork Access to Research Initiative (HINARI)
provides free online access to major journals to many developing countries.
2. Seek help from multinational companies which have access to
extensive library collections (this will be better for your health
than seeking free meals).
3. Write to the authors of the publication and request a copy of
their article. E-mail addresses are usually listed in the article
itself, and authors are often happy to learn of interest in their
papers.
4. Keep a list of friends in developed countries, especially those
with access to university libraries. Make sure to rotate requests
for journals so that they all remain your friends.
If all else fails, you can always pay for online access to an
article. The more often you try to retrieve articles, the easier it
becomes. Hopefully, one day, someone will address this problem
of inequitable access to healthcare information.
10
CH01 INTRODUCTION
Skill number 2: Appraising the evidence
Once an article has been obtained, three aspects of a study
need detailed appraisal: directness, validity and results. These are
described briefly below, and are discussed in greater detail in
Chapters 2–6.
Appraising directness
Directness refers to how well the PEO in the article corresponds
to the PEO that you ask. Because medical knowledge is limited,
the answers provided in the literature are often similar but
not identical. Sometimes the difference is trivial and can be
neglected. Other times, however, the differences are important
and worrisome.
Appraising validity
Validity refers to how close we think study results are to the
truth. As can be seen in later chapters, there are numerous ways
in which studies may be flawed. These flaws can lead to biases,
meaning they can lead to over- or under-estimates of the things
we want to measure such as effectiveness of a treatment, accuracy
of a test or causation or prognosis of disease. The higher the
number of flaws, the greater is the expectation of bias.
Unfortunately, it is impossible to simply classify evidence as
valid or invalid based on study design. The difference between a
perfectly valid and a totally invalid study is a huge grey area of
intermediate validity. This has led to the concept of a ‘hierarchy
of evidence’, where study designs are ranked according to validity
rather than categorized as valid or invalid. The goal of EBM is
to identify the best evidence in this hierarchy for each focused
clinical question. More will be learnt about evaluating the validity
of specific studies in Chapters 2–6.
1.2 THE THREE SKILLS OF EBM
11
Appraising the results
We use the term ‘results’ to refer to numeric expressions of
effectiveness, accuracy, causal relationships and prognosis. Examples include the relative risk reduction, number needed to treat,
likelihood ratios, odds ratios and hazards ratios. Understanding
these numeric expressions can be problematic, especially for the
numero-phobic. However, as will be seen, time spent understanding these concepts will eventually be gratified by improvements in the quality of care rendered. We have exerted a lot of
effort trying to simplify these concepts in this book.
Skill number 3: Applying the evidence
After acquiring the evidence and appraising directness, validity
and results, the last step in processing the evidence is applying it to
a particular clinical situation. Two tasks are required: assessment
of applicability and individualization of results.
Assessing applicability
Applicability refers to the extent to which conclusions of a study
can be expected to hold true for a particular patient. It is similar
to directness, but not exactly the same. Directness compares the
clinical PEO to the research PEO (in a very general sense). Clinicians can then decide if an article is worth reading or not. Applicability, on the other hand, takes a closer look at specific issues
such as biologic and socioeconomic differences between the study
population and the patients we see. Clinicians reading articles in
scientific journals intuitively ask: will this technology work for
my patients? While the hesitation is sometimes well founded, it
may actually be unnecessary in some cases. In Chapters 2–6, we
share ways of thinking about applicability that have helped us
12
CH01 INTRODUCTION
strike a balance between hasty application and excessive caution
in the interpretation of results.
Individualizing the results
Studies are concerned with average effects of health technology
on large groups of people. As healthcare providers, however, we
deal with individual patients whose responses may differ from
the average. The estimation of the effect in an individual is
therefore an important process. We refer to this process as ‘individualization of results’. Individualization is both a science and
an art. The science deals with estimation of the magnitude of
the effect on a particular individual. This involves some grappling with numbers. The art refers to sharing decisions with
patients, laying the cards on the table and using their own values
and preferences to assess the trade-offs between benefit, harm
and cost.
1.3
Summary
Evidence-based medicine is a systematic approach to the acquisition, appraisal and application of research evidence to guide
healthcare decisions. The key skills in EBM help us keep upto-date with the literature. Acquisition of evidence involves
skills in asking focused clinical questions, searching for the
evidence and retrieving articles. Appraisal of evidence involves
skills in the critical assessment of directness, validity and results.
Finally, application of evidence involves skills in the assessment of applicability and individualization of results. Acquisition skills are detailed in Chapter 7 (Literature searches), while
appraisal and application skills are discussed more extensively in
Chapters 2–6.
REFERENCES
13
References
[1] http://hms.harvard.edu/public/history/past-deans.html; George Packer
Berry, 1949–1965. Past Dean of the Harvard Medical School.
[2] Evidence-Based Medicine Working Group. 1992. Evidence-based
medicine. A new approach to teaching the practice of medicine. Journal
of the American Medical Association. 268(17): 2420–2425.
[3] Green ML. 2000. Evidence-Based Medicine training in graduate
medical education: past, present and future. Journal of Evaluation in
Clinical Practice. 6(2): 12–138.
[4] Miles A, Charlton B and Bentley P. 2000. New perspectives in evidencebased health care debate. Journal of Evaluation in Clinical Practice. 6(2):
77–84.
[5] Miles A, Bentley P, Polychronis A and Grey J. 1997. Evidence-based
Medicine: Why all the fuss? This is why. Journal of Evaluation in Clinical
Practice. 3(2): 83–86.
[6] Henry SG, Zaner RM and Dittus R. 2007. Viewpoint: Moving beyond
Evidence-based Medicine. Academy of Medicine. 82(3): 292–297.
[7] Beutow S and Kenealy T. 2000. Evidence Based Medicine: need for
new definition. Journal of Evaluation in Clinical Practice. 6(2): 85–92.
[8] Toledo-Pereyra LH. 2005. Evidence-Based Surgery. Journal of Investigative Surgery. 18(5): 219–222.
[9] DiCenso A, Cullum N and Ciliska D. 1998. Implementing
evidence-based nursing: some misconceptions [editorial]. EvidenceBased Nursing. 1: 38–40.
[10] Moyer VA and Elliott EJ. 2000. Preface In Moyer VA, Elliott EJ, Davis
RL, Gilbert R, Klassen T, Logan S, Mellis C and Williams K (eds.).
Evidence based pediatrics and child health. London: BMJ Books.
[11] Goldbloom DS. 2003. Evidence-based psychiatry [editorial]. CPA
Bulletin. 35(6): 3, 5.
[12] Hicks N. 1997. Evidence based healthcare. Bandolier. 4(39): 8.
http://www.jr2.ox.ac.uk/bandolier/band39/b39-9.html
[13] Mills EJ, Hollyer T, Guyatt G, Ross CP, Saranchuk R and Wilson
K. 2002. Evidence-Based Complementary and Alternative Medicine
Working Group. Teaching evidence-based complementary and alternative medicine: 1. A learning structure for clinical decision changes.
Journal of Alternative Complementary Medicine. 8(2): 207–214.
[14] Sackett, D. 2000. Evidence-based Medicine: How to Practice and Teach
EBM. 2nd edition. Churchill Livingstone.
14
CH01 INTRODUCTION
[15] Seton SR and Stanley DE. 2003. A philosophical analysis of the
evidence-based medicine debate. BMC Health Services Research. 3: 14.
Good! You acquired the article! Now you need to appraise and apply.
2
Evaluation of Articles
on Therapy
Leonila F. Dans, Hilda Diana A. Alava, Antonio
L. Dans, Benita S. Padilla
Questions on effectiveness of therapy should be phrased in terms
of the variables: P, the patient population with a certain disease
or condition; E, the exposures (or treatments) to be administered
to these patients; and O, the outcomes (or conditions) that the
treatments are intended to prevent or promote. For example:
Among children 1–5 years of age (P), how effective is zinc supplementation compared to placebo (E), in preventing acute diarrhoea (O)?
As can be seen in this example, there are usually two exposures in
a therapy question: the experimental treatment being evaluated
(usually a new one) and the control therapy to which it is being
compared. The control therapy can be a neutral control with
no expected effect (e.g. a placebo), or an active control which
constitutes what is considered the most effective treatment at the
time of investigation.
Painless Evidence-Based Medicine Antonio L. Dans, Leonila F. Dans and Maria Asuncion A. Silvestre
© 2008 John Wiley & Sons, Ltd
16
CH02 EVALUATION OF ARTICLES ON THERAPY
Outcomes in each of these treatment groups are compared in
therapy trials. Some outcomes are dichotomous, meaning there
are only two possible results, e.g. death or survival. Dichotomous outcomes are usually summarized as proportions (e.g. the
proportion that died in each treatment group). Other outcomes
are continuous in nature, meaning there is a wide range of
possible results (e.g. changes in weight). Continuous outcomes
are commonly summarized as means (e.g. the mean change in
weight in each treatment group).
2.1
Appraising directness
Before even reading an article, the first thing we must do is
evaluate directness, i.e. how well the PEO in the study (the
research question), corresponds with our own PEO (the clinical
question). The problem of directness is very common. Directness issues arise because investigators are constrained by their
budgets to answer very narrow, focused questions. In contrast,
healthcare professionals in the real world are interested in a
very broad range of issues. These differences are summarized in
Table 2.1.
Because the evidence doesn’t always provide a direct answer
to our clinical question, we are often constrained to use available information. For example, we have used studies in men to
manage coronary disease in women for many decades[1] . Also,
few beta-blockers have been proven to reduce the risk of stroke
in hypertensive patients and yet specialty societies agree that any
beta-blocker may be used[2] .
There are instances however, when we should be very cautious
in using indirect evidence. For example, a study on the antiarrhythmic drug flecainide for acute myocardial infarction (MI)[3]
showed that, despite the beneficial effect on cardiac rhythm (the
surrogate outcome), the drug eventually caused thousands of
2.1 APPRAISING DIRECTNESS
17
Table 2.1 The problem of directness: differences in the types of questions
asked by researchers and clinicians according to the population of
interest (P), the exposures or treatments evaluated (E) and the outcomes
monitored (O)
Research Questions
Clinical Questions
P
Because sample size is
limited, researchers need to
restrict their studies to a
few, broadly defined
subgroups of patients
e.g. the effect of treatment
in the young and old.
Clinically, we are often interested in
treatment effects in many different
subgroups e.g. the effect of
treatment in young and old, males
and females, sick and healthy,
smokers and non-smokers, rich and
poor.
E
Researchers usually evaluate
a specific exposure e.g. drug
preparation, surgical
technique or educational
strategy.
Clinically, we are usually interested
in variations of the exposure e.g. a
similar drug belonging to the same
class, a similar surgery using a
modified technique, or a similar
educational strategy using a different
educational tool.
Clinically, we are only interested in
fair comparisons.
O
Researchers sometimes
make inappropriate
comparisons e.g. a high
dose of one drug versus a
low dose of another.
Researchers usually monitor
a selected few, easily
measurable outcomes e.g.
surrogate outcomes such as
serum cholesterol and
blood pressure.
Researchers at times
monitor composite
outcomes e.g. instead of
measuring the incidence of
death alone, they monitor
the combined incidence of
death, stroke or heart
attack. This increases the
number of events in a
study, and thus decreases
the required sample size.
Clinically, we are more interested in
important effects that are sometimes
difficult and expensive to measure
e.g. clinical outcomes such as pain
relief, disability, overall quality of
life or mortality.
Clinically, we are interested in the
effect of treatment on separate
outcomes because in composite
outcomes 1) the individual
components may not be of equal
importance to us or 2) the effect of
treatment may not be of the same
magnitude for each component.
18
CH02 EVALUATION OF ARTICLES ON THERAPY
deaths in the setting of an acute MI (the clinical outcome)[4] .
If we are to avoid harming patients the way this drug did, we
must think twice before assuming that a surrogate outcome,
such as reducing cardiac rhythm problems, provides a direct
answer to a clinically important question, such as prolonging
life.
Evaluating directness is as much an art as it is a science. If you
feel the article might provide a reasonably close answer to the
question you are asking, then proceed to evaluate it in greater
detail. Otherwise, it would be wise to spend your time looking
for a closer match.
2.2
Appraising validity
The validity of trials comparing two therapies depends almost
entirely on how fair the comparisons are between patients in
the treatment and control groups. We summarize eight criteria
for assessing validity. These criteria are phrased as questions.
A ‘yes’ answer to a question means that the criterion is satisfied. The more ‘yes’ answers there are, the more sure we
become that the comparisons are fair and that the odds are
not stacked in favour of one group (a phenomenon known as
bias).
Question 1: Were patients randomly assigned
to treatment groups?
Random assignment of patients to treatment groups in a trial
is the best technique to ensure that treatment groups are truly
comparable. If patients are not assigned to treatment groups at
random, then allocation may become unfair. Cases with poorer
prognosis may end up being assigned to one treatment group.
2.2 APPRAISING VALIDITY
19
Historically, a coin toss could be used to assign patients to either
treatment group. More frequently, tables of random numbers
and, more recently, computer-generated random sequences have
been used.
Studies that use this strategy to compare two treatments are
referred to as randomized controlled trials (RCTs). Such trials are
considered the highest form of evidence on therapy. However,
there are some situations where we cannot insist on randomization. For example, when one therapy is clearly superior to
another in reversing the course of a disease, it would be unethical
to conduct an RCT. In such situations (e.g. blood transfusion for
massive haemorrhage, or surgical repair of perforating abdominal injuries), we can usually assume effectiveness. Unfortunately,
such miracle cures are the exception rather than the rule. In
general, we search for RCTs before accepting a claim that a
therapy is effective.
Tip: To check if randomization was performed, look for the
words ‘randomization’, ‘randomized’ or ‘randomly allocated/
assigned’ in the title, abstract or methodology.
Question #2: Was allocation concealed?
In a trial, the random order by which we assign patients to
treatment groups is referred to as the allocation sequence. It is
not sufficient that researchers generate an allocation sequence
that is random; they must also take measures to ensure that the
sequence is not altered by clinicians who unknowingly (or knowingly) tend to disturb it. These measures are called allocation
concealment strategies. One such measure is placement of treatment assignments in sealed opaque envelopes. Other investigators
assign a third party not involved in the study (e.g. a pharmacist)
to assign the treatment. More recently, objective computers
20
CH02 EVALUATION OF ARTICLES ON THERAPY
have replaced subjective humans in the task of allocating
treatment.
Tip: To ascertain allocation concealment, look for a phrase or
paragraph stating one of the following:
1. that a clinician recruiting a patient must contact a remote
centre to obtain a patient treatment assignment;
2. that envelopes containing the allocation sequence were sealed
and numbered; or
3. that vehicles or packages containing the medications were
indistinguishable.
Question #3: Were baseline characteristics similar
at the start of the trial?
Because of randomization, baseline characteristics of treatment
groups in an RCT tend to be very similar. Sometimes, however,
inequality between treatment groups may arise due to chance
alone. This third criterion checks how successful randomization and allocation concealment were by actually comparing
the characteristics of patients within the treatment and control
groups.
The comparisons are usually presented as a table of baseline
characteristics with columns for treatment and control groups.
Sometimes there is also a column of p-values but this is not as
important. A p-value just tells us if the difference occurred by
chance. What we really need to know is if the difference is big
enough to affect the results by putting one group at a disadvantage. Thus, when studies report, ‘There was no statistically
significant difference between the two groups for baseline characteristics’, try to assess if the magnitude of the differences are
clinically important. For example, a difference in mean gestational
2.2 APPRAISING VALIDITY
21
age of 31 weeks versus 33 weeks among premature babies may
be statistically insignificant (p > 0.05). However, this difference
may still be clinically important in a trial assessing a therapy to
prevent hyaline membrane disease.
Question #4: Were patients blinded
to treatment assignment?
If patients know which treatment they are receiving (active
drug or placebo), they can subconsciously or even deliberately influence their own outcomes. This may occur, especially
when the outcomes being monitored are subjective. For example,
there may be a tendency to report symptoms if they know
they are taking placebo. This will make the active drug seem
better.
Tip: To decide if patients were blinded during conduct of the
trial, look for use of identical preparations. Obviously, blinding is
more difficult (even impossible) when the interventions involve
diet, educational manoeuvres or surgical procedures.
Do not confuse blinding with allocation concealment. Blinding
is an attempt to make the treatments being compared indistinguishable. In contrast, allocation concealment is an attempt to
preserve the allocation sequence (as described in Question #2
above).
Question #5: Were caregivers blinded
to treatment assignment?
When caregivers and physicians are aware of the treatment group
to which patients are assigned, they may treat the patients in the
two groups differently. For example, a clinician who knows his
patient is on a placebo may worry and decide to take better care
22
CH02 EVALUATION OF ARTICLES ON THERAPY
of that patient. Conversely, a clinician who knows the patient
is on treatment may decide to monitor more frequently just in
case there are side effects. These changes in care may make a
treatment appear better or worse than it really is.
Tip: To decide if caregivers were blinded in the conduct
of the trial, look for use of identical preparations. Again,
blinding is not always possible. It can be very difficult when the
interventions involve diet, educational manoeuvres, or surgical
procedures.
Question #6: Were outcome assessors blinded
to treatment assignment?
Outcome assessors can be the patients themselves, or their
caregivers. They are sometimes directly involved in assessing
outcomes or therapy response. For example, patients may be
asked if they feel better and caregivers may be asked if they
think their patients are doing well. The potential bias involved
in this task can often be reduced by blinding them to treatment
assignments as described in Questions #4 and #5.
Sometimes, however, the outcome assessors are study
personnel (for example, members of an endpoint committee),
responsible for deciding how a patient responded to therapy. The
questions these personnel answer can become very tricky. Was
this hospitalization a result of a drug side effect? Did this patient
die of a heart attack or pneumonia? Is this rise in creatinine
significant or not?
Because their decisions are often subjective, information on
treatment assignment should be withheld from outcome assessors
whenever they review a case. This is an important strategy. As
mentioned in Questions #4 and #5, blinding of patients and their
caregivers may be difficult or impossible in some trials. However,
blinding study personnel when they assess outcomes is almost
2.2 APPRAISING VALIDITY
23
always doable (with rare exceptions). This will usually ensure a
more objective evaluation.
Tip: First decide who is making the outcome assessment: the
patient, the caregiver or the investigator. If it is the patient or
caregiver, tips to answer Questions #5 and #6 will apply. If it
is the investigator, look for statements indicating that they were
‘blinded’, that clinical data was withheld, or that their assessments
were independent.
Question #7: Were all patients analysed in the groups
to which they were originally randomized?
As in the real world, patients in trials may not comply or adhere
to the treatment protocol that they were assigned to. This may
happen for many reasons. They may forget how to take a drug
properly, suffer intolerable side effects, or even refuse to go on for
no apparent reason. A dilemma arises: should they be analysed
in their original treatment group (intention-to-treat analysis),
or should they just be excluded from the study (censored
analysis)?
There are two reasons why non-compliant patients should
be analysed in their original groups. Firstly, in the real world,
a patient’s ability to comply is considered part of the performance (success or failure) of that therapy. If investigators remove
non-compliant patients from their analysis, the treatments may
appear better than they actually are. Secondly, non-compliant
patients should be analysed in their original assignments because
removing them could disturb the balance achieved by random
allocation.
Tip: To make sure that investigators did not censor noncompliant patients from the analysis, look for the term ‘intentionto-treat’ under the analysis section of the article. If not stated
explicitly, look for some indication that patients retained their
24
CH02 EVALUATION OF ARTICLES ON THERAPY
original assignment even if they were non-compliant. If this
is not stated as well, a last resort would be to check that
the number of patients randomized is equal to the number of
patients analysed at the end of the study. This suggests (but
does not guarantee) that patients were not excluded from the
analysis.
Question #8: Was follow-up rate adequate?
Adequacy of follow-up refers to minimization of the number
of patients who drop out from a study. Drop-outs are not
the same as non-compliant patients (discussed in Question #7
above). Non-compliant patients stop taking the drug, but we can
still determine what happened to them. In fact, as previously
discussed, we can still include them in the analysis. In contrast,
drop-outs leave the study (or are removed), leaving us with no
data on their outcomes. Drop-outs usually leave the study because
of adverse events or dissatisfaction. The greater the number of
patients lost to follow-up, the more the study validity is threatened. The crucial issue is ascertaining when to worry about the
number of drop-outs. Tackle Box 2.1 provides an approach to
this problem. Essentially, readers should worry about drop-out
rates when they are large enough to affect the outcome of a trial.
The eight questions on the validity of a study have now been
discussed. While it is tempting to be strict and insist that all eight
criteria be satisfied, we must be pragmatic and remember that
we sometimes need to make medical decisions based on less than
perfect information. If you feel that errors are small or that this is
probably the best study you will find that addresses your clinical
question, read the results. If you feel the errors are too great and
that there are better studies, then don’t waste your time with the
article.
2.3 APPRAISING THE RESULTS
2.3
25
Appraising the results
Question #1: How large was the effect of treatment?
The magnitude of the treatment effect may be expressed by
comparing outcomes in the treatment and control groups. As
described earlier, outcomes can be reported either as:
1. continuous variables, which have a range of possible results
(e.g. change in weight, or change in quality of life on a scale
of zero to one); or
2. dichotomous variables, which have only one of two possible
results (e.g. dead or alive, hospitalized or not).
When the outcomes are continuous, the effect of treatment is
simply expressed as the ‘mean difference’. This is easily calculated
by obtaining the mean result in the treatment and control groups
and calculating the difference, that is
Mean difference = mean in control group −mean in treatment group
For example, if a trial on hypertension showed the mean systolic
blood pressure in the control to be 160 mmHg and the mean in
the treatment group to be 120 mmHg, then the mean difference
would be 40 mmHg.
Unfortunately, things are not as straightforward when dealing
with dichotomous outcomes. Tackle Box 2.2 attempts to simplify
this task. If unfamiliar with the concepts of relative risk reduction,
absolute risk reduction and relative risk, take time to study this
box. If you have heard of these concepts but have difficulty
remembering them, the box provides some tricks to help you
recall them in an instant.
Tackle Box 2.1
Assessing if there are too many drop-outs in a study
Instructions: In this hypothetical RCT, patients with severe pneumonia are randomized to receive either active
treatment (n = 30) or placebo (n = 30). There are 13 drop-outs in the study, 6 on treatment and 7 on placebo.
Go through this tackle box by rows to learn how to decide if there are too many drop-outs in a study.
Treatment group
N = 30
5
24
Step 1: Count the number of patients with
bad outcomes in each treatment group,
and express this as a fraction of the
number of patients analysed
Deaths =
Step 2: Count the number of drop-outs in
each treatment group
Drop-outs = 6
Step 3: Create a worst scenario for the
treatment group by assuming all the
drop-outs in this group had the bad
outcome, and all the drop-outs in the
control group had a good outcome
Deaths =
5+6
11
=
24 + 6 30
Placebo group
N = 30
Deaths =
9
23
Drop-outs = 7
Deaths =
9+0
9
=
23 + 7 30
Step 4: Create a best scenario for the
treatment group by assuming the opposite,
i.e. all the drop-outs in this group had a
good outcome, and all the drop-outs in
the control group had the bad outcome
Step 5: Were the conclusions of the best
and worst scenarios significantly different?
If yes, then there were too many
drop-outs!
Deaths =
5+0
5
=
24 + 6 30
Deaths =
9+7
16
=
23 + 7 30
The best-case scenario showed fewer bad events for
treatment (benefit), while the worst-case scenario
showed more bad events for treatment (harm). Therefore, there were too many drop-outs in this hypothetical study.
Note – This process of making best and worst assumptions about what happened to drop-outs is called a sensitivity
analysis. In essence, we are trying to find out if the conclusions are sensitive to the assumptions we make. If
it didn’t matter what happened to the drop-outs, we say the conclusions are ‘robust’. If what happened to the
drop-outs will significantly change our conclusions, however, we say that the conclusions are ‘soft’.
Exercise – If the deaths on treatment were 1/24 and the deaths on placebo were 16/23, would the same drop-out
rates still be worrisome? (Answers: No, the treatment group would have fewer deaths in both the best and worst
scenarios.)
Tackle Box 2.2
Measures of effectiveness for dichotomous outcomes
Instructions: Before going through this tackle box, imagine you went on a diet and lost weight from 80 kg to 60 kg. Now spend a few minutes to
think of different ways of numerically expressing your weight loss. We’ve listed some ways to do this in column 1. Start by going through this
column, then through columns 2–5, to understand different measures of effectiveness.
Weight analogy
Measures of effectiveness
In what ways could you
express your change in
weight if it went down
from 80 kg (W c ) to 60
kg (W t )?
What subliminal formula
did you use to come up
with these expressions?
Similarly, in what ways could
you express a change in the
risk of death from 8% in a
control group (Rc ) to 6% in a
treatment group (Rt )?
What would the
formulae be for these
expressions?
‘I lost 25% of my weight’
– your relative weight
reduction.
weight change
W − Wt
= c
Wc
original weight
‘I lost 25% of my risk’.
This is called the relative risk
reduction or RRR, and it is
usually expressed in %.
‘I lost 20 kg’
– your absolute weight
reduction.
weight change = Wc − Wt
‘I lost 2% of my risk’.
This is called the absolute risk
reduction or ARR, and it is
usually expressed in %.
‘I now weigh 75% of
what I used to weigh’
– your relative weight.
new weight
W
= t
original weight Wc
‘My risk is now 0.75 of what it
new risk
R
RR =
= t
was’.
original risk Rc
This is called the relative risk or
RR, and it is usually expressed
as a decimal number.
risk change
original risk
R − Rt
= c
Rc
RRR =
ARR = risk change
= R c − Rt
How would you interpret
these numbers?
RRR (usually in
>0% Treatment
˜0% Treatment
<0% Treatment
percent):
beneficial
no effect
harmful
ARR (usually in percent):
>0% Treatment beneficial
˜0% Treatment no effect
<0% Treatment harmful
RR (usually in decimals):
<1.0 Treatment beneficial
˜1.0 Treatment no effect
>1.0 Treatment harmful
Notes:
1) Look at the middle column. One statement reads ‘I lost 25% of my risk’. The other statement reads ‘I lost 2% of my risk’. Can both of these
statements be correct, given the same data? The answer is yes. Note however, that the first is a relative reduction and the second an absolute
reduction. Authors like to report relative reductions because they are constant and they also seem bigger. Readers should also look for the
absolute reductions.
2) Look at the last column. These interpretations of ARR, RRR and RR assume that the outcomes are reported as unfavourable events (e.g.
death), rather than favourable events (e.g. survival). If the outcome is reported as the proportion that survived instead of the proportion that
died, all the interpretations are reversed. An RRR >0% could actually mean harm instead of benefit. We think authors should always report
the harmful outcome to avoid confusion. Also, we feel that it sounds really pessimistic to report ‘the relative risk of surviving’. Unfortunately,
not everyone agrees.
3) Aside from the RRR, ARR and RR, there is a fourth measure of treatment effects not shown in the table. This is known as the number needed
to treat, or NNT. This is derived from the ARR. In the example above, the ARR is 2%, which means ‘I can prevent 2 deaths out of every
100 patients that I treat’. Thus, I need to treat 50 patients before I save one life. NNT, therefore, is simply calculated as 100/ARR (since it is
reported in percent). It tells us the number of patients we would need to treat to prevent 1 adverse outcome. The closer the NNT is to one,
the more effective the treatment. The higher it is above one, the less effective the treatment is. When the ARR is <0 (i.e. when treatment is
harmful), NNT will also be negative. In this situation it is sometimes referred to as the number needed to harm (NNH).
Exercise:
1. If an RCT showed death rates of 4% in the treatment group and 5% in the placebo group, what would the RRR, ARR, RR and NNT be?
2. What if death rates were 5% on treatment and 4% on placebo?
Answers: 1. RRR = 20%, ARR = 1%, RR = 0.80, NNT = 100; 2. RRR = –25%, ARR = –1%, NNH=100.
30
CH02 EVALUATION OF ARTICLES ON THERAPY
Another way of summarizing dichotomous outcomes is by
using rates instead of proportions. As the term suggests, rates tell
us how fast events accumulate in treatment and control groups
through time. Thus, if a clinical trial is conducted over a fouryear period, the results can be stated as either: 40% of patients
died in the control group and 20% in treatment (proportions);
or 10% died per year in the control group, and 5% died per year
in the treatment group (rates). The difference can be graphically
illustrated in survival curves as shown in Figure 2.1.
100%
Death rate in %
80%
60%
Placebo
group
40%
20%
Treatment
group
0%
1
2
3
4
Years of follow-up
Figure 2.1 Survival curve of a hypothetical randomized controlled trial
A common expression comparing two rates would be the hazards
ratio, which is the rate of outcomes on treatment divided by the
rate of outcomes in control. This expression is similar to the
relative risk (RR). It is interpreted in exactly the same way and it
is usually different by just a few decimal places. The various ways
of expressing effectiveness are summarized in Table 2.2 below.
2.3 APPRAISING THE RESULTS
31
Table 2.2 Ways of expressing effectiveness
Outcome
Summary of result
within each group
Proportion (e.g.
Dichotomous
(e.g. lived or died, deaths per 100
BP controlled or patients)
not)
Rate = e.g. deaths per
100 patients per year
Comparison of results
between two groups
Relative risk reduction,
absolute risk reduction,
relative risk (see Tackle Box
2.2)
Hazard ratio = rate in
treatment/rate in control
group
Continuous (e.g. Mean (e.g. mean blood Mean difference = mean in
blood pressure in pressure)
control − mean in
mmHg, quality of
treatment group
life on a scale of
0 to 1)
Question #2: How precise was the estimate
of the treatment effect?
Because a study merely estimates the true effect of a drug, it may
be unduly confident to express absolute risk reduction (ARR),
relative risk reduction (RRR) and relative risk (RR) as exact
values or point estimates. For example, it may be misleading to
say simply that ‘warfarin reduces the risk of stroke in patients
with atrial fibrillation by 79% (RRR)’. Such an estimate does not
accurately reflect uncertainty due to sample size limitations. Thus,
in addition to simplistic point estimates, researchers also report
an interval estimate which provides a range of possible values
of the treatment effect. By convention, interval estimates are
estimated at a 95% level of confidence. Thus, when we state 95%
confidence intervals (95% CI), we mean that we are 95% sure that
the true effect of the treatment lies within this range. Therefore,
for warfarin, a better statement would be that ‘it reduces the risk
32
CH02 EVALUATION OF ARTICLES ON THERAPY
of stroke in patients with atrial fibrillation by 79% (RRR), but
the actual RRR could range anywhere from 52% to 90%’[5] . By
convention, this is expressed as:
RRR = 79% 95% CI 52% 90%
How do we interpret the results? Tackle Box 2.3 below
summarizes different conclusions that can arise from looking at
confidence intervals. Spend some time familiarizing yourself with
reading CIs. To make things easier, remember the following four
basic tenets.
1. When both ends of the CI are on the side of benefit, the
treatment is definitely beneficial.
2. When both ends of the CI are on the side of harm, the
treatment is definitely harmful.
3. When one end reflects important benefit and the other end
reflects important harm, then the study is inconclusive.
4. When one end reflects a small unimportant benefit and the
other end reflects a small unimportant harm, then for all
intents and purposes the two treatments being compared are
equal.
Confused? Then, you really need to spend time on Tackle Box 2.3.
This will be worthwhile as it will help analyse study results quickly
and efficiently.
Tackle Box 2.3 demonstrates why it is important to understand
confidence intervals. Without them, it would be very difficult to
interpret the results of a study. In cases where the 95% CIs are
not reported in the article, do not panic! You could either
1. use freely downloadable programs[6, 7] ;
2. call a statistician friend to compute it for you; or
2.4 ASSESSING APPLICABILITY
33
3. compute it yourself[8] (not recommended if you are not
confident about working with numbers).
As a final option: go ahead and panic.
2.4
Assessing applicability
After evaluating validity and analysing the results of a trial, the
next step is to decide if the results can be applied to our own
patients. Trials provide information that can help us decide this.
For example, we can check if our patients’ characteristics satisfy
the inclusion and exclusion criteria. We can also look at treatment
effects in subgroups of patients that more closely approximate
the individual patient we are trying to help (subgroup analysis).
Unfortunately, in real life there are simply too many patient
subtypes and trials are typically too small to address more than a
few subgroup hypotheses. It would be very unlikely for example,
that we find a trial on a drug for hypertension which analyses
effects among male smokers aged 40–45, weighing 60–70 kg, with
a total cholesterol of 6–7 mmol/Li and a family history of stroke!
Because data from subgroups are limited, healthcare providers
must decide on the applicability of trial results to individual
patients, based on the general information available to them.
In doing this, we suggest consideration of biologic as well as
socioeconomic issues.
Biologic issues affecting applicability
Sex
Consider physiological, hormonal or biochemical differences
between sexes that might affect the effectiveness of an intervention. For example, women have greater reduction in stroke
incidence compared to men when treated with aspirin[9] .
Tackle Box 2.3
Interpreting 95% Confidence Intervals (CIs)
Instructions: When researchers express the effect of treatment using the relative risk reduction, absolute risk reduction, or relative risk, they often
give us a range of possibilities rather than a single estimate. This range of possibilities is called a ‘95% Confidence Interval (95% CI)’ to mean ‘we
are 95% sure that the true effect of a drug lies in this range’. Go through this tackle box in rows, to discover how helpful 95% CIs are.
Measure of
effectiveness
Relative Risk
Reduction
RRR =
R c − Rt
Rc
Absolute Risk
Reduction
ARR = Rc − Rt
Interpretation of point
estimates
Usually in percent
>0% Treatment beneficial
˜0% Treatment no effect
<0% Treatment harmful
Usually in percent
>0% Treatment beneficial
˜0% Treatment no effect
<0% Treatment harmful
Interpretation of 95% CIs
Treatment surely
better than control
Treatment surely
worse than control
Inconclusive (we
need more studies)
The two are probably
equivalent
Both ends of 95%
CI >0%
Both ends of
95% CI <0%
95% CI straddles
0%; either end is far
from 0%
95% CI straddles 0%;
either end very close
to 0%
Example:
RRR = 7%
[95% CI: 6%, 8%]
Example:
RRR = –8%
[95% CI: –7%, –9%]
Example:
RRR = 1%
[95% CI: –16%, 15%]
Example:
RRR = 0.4%
[95% CI: –0.6%, 1.0%]
Both ends of 95%
CI > 0%
Both ends of
95% CI < 0%
95% CI wide;
straddles 0%
95% CI narrow;
straddles 0%
Example:
ARR = 2%
[95% CI: 1%, 3%]
Example:
ARR = −3%
[95% CI: –7%, –1%]
Example:
ARR = 1%
[95% CI: –9%, 9%]
Example:
ARR = 0.2%
[95% CI: –0.1%, 0.5%]
Relative Risk
RR =
Rt
Rc
Usually in decimals
<1.0 Treatment beneficial
˜1.0 Treatment no effect
>1.0 Treatment harmful
Both ends of 95%
CI < 1.0
Both ends of
95% CI > 1.0
95% CI wide;
straddles 1.0
95% CI narrow;
straddles 1.0
Example:
RR = 0.7
[95% CI: 0.6, 0.8]
Example: RR = 2.4
[95% CI: 1.8, 3.2]
Example:
RR = 1 [95% CI: 0.2,
5.3]
Example:
RR = 1
[95% CI: 0.9, 1.1]
Note: These interpretations only hold if the dichotomous events are expressed as adverse rather than desirable events, e.g.
death rather than survival, treatment failure rather that cure, or disease rather than disease-free. When dichotomous outcomes
are expressed as desirable events, the interpretation of benefit and harm is reversed. We feel researchers should standardize
reporting and use adverse events in order to avoid confusion. Unfortunately, not everyone agrees.
Exercise: An RCT compared stroke rates among patients given an experimental drug or placebo. How would you interpret the
following hypothetical results?
(a)
(b)
(c)
(d)
RR = 2.3 [95% CI: 1.5, 3.1];
RR = 0.98 [95% CI: 0.95, 1.02];
RR = 0.63 [95% CI: 0.53, 0.73]; and
RR = 0.98 [95% CI: 0.50, 1.50]
Answers: (a) treatment surely worse; (b) the two are equivalent; (c) treatment surely better; and (d) inconclusive results.
36
CH02 EVALUATION OF ARTICLES ON THERAPY
Co-Morbidities
Consider co-existent conditions that could affect applicability.
Studies show that response to measles vaccination is reduced in
malnourished children[10, 11] .
Race
Racial differences may affect applicability. For example, black
hypertensives are more responsive to diuretics than whites[12] .
East Asians are more likely to develop the adverse effect of
cough from angiotensin converting enzyme inhibitors compared
to whites[13] .
Age
Age differences commonly affect response to a treatment. For
example, flu vaccines lead to smaller reductions in the risk of
influenza in older people[14] .
Pathology
Consider differences in the disease under study itself. At times,
diseases we refer to by the same name are actually conditions with
slightly different pathology. This can lead to significant variations
in response to treatment. For example, malaria in Zimbabwe is
different from malaria in the Philippines and this is manifested
as differences in treatment response[15, 16] .
Socioeconomic issues affecting applicability
Most trials are carried out under ideal conditions, which are
difficult to apply in everyday life. This is as much a problem
of provider compliance as it is a problem of patient compliance. Patient compliance problems often relate to markers of
2.5 INDIVIDUALIZING THE RESULTS
37
socioeconomic disadvantage such as poverty and lack of education. Provider compliance problems, on the other hand, are
often related to skill in the implementation of certain procedures
and availability of necessary facilities. Some therapies present
both types of compliance problems. Warfarin administration for
atrial fibrillation, for example, requires not only strict patient
compliance with monitoring, but also availability of resources for
prothrombin time determination and emergency management of
life-threatening bleeds.
2.5
Individualizing the results
When you are satisfied that biologic and socioeconomic factors
will not compromise effectiveness, the next step is to individualize the benefit, risks and costs to your patient. While studies
report effectiveness of a treatment in a population as a whole, the
benefits, risks and costs will vary slightly from patient to patient.
The main source of this variation is the patient’s baseline risk
for the event you are trying to prevent. Variation in risk is very
common in medicine. Patients have mild, moderate or severe
forms of the same disease, or may have variable risk factors for
an adverse outcome. Some individuals may come in early and
others may come in late in the course of an illness. Tackle Box 2.4
shows us five quick steps in using the baseline risk to estimate
the effect of therapy on an individual.
You now have information on the individualized benefits,
harms and costs of treatment which you can use to reach a
decision for your patient. Remember that we tend to make decisions for patients most of the time. This is acceptable for treatments where benefits far outweigh the harm, and cost is not a
problem. However, when we are at a deadlock between benefit
and harm, or when cost is potentially prohibitive, we may want
to provide information for our patient in a clear and concise but
Tackle Box 2.4
Steps to estimate the individualized changes in risk
Instructions: Studies estimate the average effect of treatment on groups of people, but unfortunately, not everyone is average.
Some people have milder disease than average, while others have more severe disease. Consideration of these differences allows
us to individualize the estimated effect of treatment. Here are five simple steps to do this.
Step
How to do it
Example: Treatment
showing benefit
Example: Treatment showing
harm
Step 1: Estimate your
individual patient’s risk for
an event without treatment
(Rc ).
This is estimated from clinical
features of the patient’s condition,
e.g. severity, stage of disease, age, risk
factors, etc. Possible references
include studies on prognosis. For a
few diseases, risk calculators are
available (e.g. coronary risk
calculators).[17, 18]
A 55-year old male with
atrial fibrillation,
diabetes hypertension, a
prior stroke but no valve
disease has an annual
risk of ischemic stroke
of 8%[19] .
A 55-year old male with
atrial fibrillation, diabetes
hypertension, a prior stroke but
no valve disease has a small
annual risk of major
extracranial haemorrhage of
approximately 1%[20] .
Step 2: Estimate the RR
using the study results.
If the relative risk (RR) is given, well
and good. If not, estimate it from the
relative risk reduction (RRR). Simply
convert RRR from percent to decimal
notation and subtract from 1.00. For
example, if RRR = 25%, then
RR = 1.00 – 0. 25 = 0.75.
If we give warfarin to
If we give warfarin to the
the patient above, we
patient above, we increase the
can reduce the risk of an risk of a extracranial
ischemic stroke:
haemorrhage: RR = 4.3[19]
RR = 0.36[19]
Step 3: Estimate your
individual patient’s risk for
an event with treatment
(Rt ).
Rt
therefore
Rc
Rt = Rc ×RR
RR =
Rt = Rc ×RR
= 8% × 0.36
= 2.9%
Rt = Rc ×RR
= 1% × 4.3
= 4.3%
Step 4: Estimate the
individualized absolute risk
reduction (ARR)
ARR = Rc − Rt (from step 1 and 3)
ARR = 8.0% – 2.9%
= 5.1%
ARR
= 1% – 4.3%
= –3.3%
Step 5: Estimate the
individualized number
needed to treat (NNT) or
number needed to harm
(NNH)
NNT = 100/ARR
NNT = 100/5.1 = 20
NNT = 100/–3.3= –30
or NNH = 30
Note: Sometimes your only source for step 1 is the control group in the trials themselves. If this is the case, then you don’t need to go
through all five steps. The individualized NNT for your patient is the one reported in the study.
Exercises:
1. What would the NNT be in the example in column 3, if the baseline risk for ischemic stroke was (a) 16%; (b) 32%?
2. What would the NNH be in the example in column 4, if the baseline risk for extracranial haemorrhage was (a) 2%; (b) 4%?
Answers: 1(a) 10; 1(b) 5; 2(a) 15; 2(b) 7
40
CH02 EVALUATION OF ARTICLES ON THERAPY
balanced manner. Tools that have been used for this purpose
are called patient decision aids. They may take the form of audio
or video tapes, illustrations or reading materials, and have been
shown to improve knowledge as well as stimulate participation
in decision-making[21] .
Patients are unlikely to understand RR, RRR or ARR easily,
so a better way to express treatment effects would be to
simply show them the changes in risk. Using the example of
warfarin for patients with atrial fibrillation but no valve disease
(Tackle Box 2.4), we might summarize the individualized effect
of treatment as follows.
Benefit: The risk of ischemic stroke falls from 8.0% to 2.9% per year.
Harm: The risk of haemorrhagic stroke rises from 1.0% to 4.3% per
year.
Cost: Average treatment cost would be US $365 per year (based on
Philippine prices as of 2007).
Unfortunately, presenting information in this way may be challenging for many patients[22] . An effective way of simplifying this
would be to present a face table[23, 24] . Face tables consist of 100
faces, coloured or shaded to indicate the probability of certain
outcomes. In the example in Figure 2.2, two sets of 100 faces
represent treated and untreated patients. White ‘smiley’ faces
No warfarin
Legend:
Warfarin
Legend:
No outcome
No outcome
Hemorrhage
Hemorrhage
Stroke
Stroke
Figure 2.2 Face table summarizing the effect of warfarin in a patient with
atrial fibrillation but no valve disease
REFERENCES
41
represent patients who do well, dark grey ‘frowny’ faces represent
patients who develop the unfavourable outcome being prevented
(e.g. stroke), and the light grey ‘frowny’ faces represent the
number with an adverse effect (e.g. extracranial haemorrhage).
Decision aids such as these face tables enable patients to visualize
the trade-offs between benefit and harm better than numbers can.
2.6
Summary
Articles on therapy are probably the most common types of
articles that we come across. When confronted with a therapy
article, first appraise directness: does it directly address a question
that is important to you? If it does, appraise validity to decide if
the results are credible. Try to save yourself some time. You don’t
need to read the entire article. Just go straight ahead and seek
answers to each of the validity questions. If you are satisfied that
the study is valid, proceed to appraise the results. However, do not
simply accept the findings hook-line-and-sinker. Remember, you
still have to assess applicability. If you are satisfied with directness,
validity, the results and applicability, then individualize the results
to estimate specific effects on your individual patient. Involve
them in decisions when necessary, by sharing what you know
about risks, benefits and costs.
References
[1] Grady D, Chaput L and Kristof M. 2003. Results of Systematic Review
of Research on Diagnosis and Treatment of Coronary Heart Disease in
Women, Evidence Report/Technology Assessment No. 80. University
of California, San Francisco-Stanford Evidence-Based Practice Center.
AHRQ Publication No. 03-E035.
[2] Chobanian AV, Bakris GL, Black HR, Cushman WC, Green LA,
Izzo JL Jr, Jones DW, Materson BJ, Oparil S, Wright JT Jr and
42
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
CH02 EVALUATION OF ARTICLES ON THERAPY
Roccella EJ. 2003. Joint National Committee on Prevention, Detection,
Evaluation, and Treatment of High Blood Pressure. National Heart,
Lung, and Blood Institute; National High Blood Pressure Education
Program Coordinating Committee. Seventh report of the Joint National
Committee on Prevention, Detection, Evaluation, and Treatment of
High Blood Pressure. Hypertension. 42(6): 1206–1252. Epub 2003
Dec 1. http://www.nhlbi.nih.gov/guidelines/hypertension/express.pdf
Epstein AE, Bigger JT, Wyse DG, Romhilt DW, Reynolds-Haertle RA
and Hallstrom AP. 1991. Events in the Cardiac Arrhythmia Suppression
Trial (CAST): mortality in the entire population enrolled. Journal of
American College of Cardiology. 18(1): 14–19.
Moore TJ. 1995. Deadly Medicine. Simon & Schuster: New York.
Ezekowitz MD, Bridgers SL, James KE, Carliner NH, Colling CL and
Gornick CC. 1992. Warfarin in the prevention of stroke associated with
nonrheumatic atrial fibrillation. Veterans affairs stroke prevention in
nonrheumatic atrial fibrillation investigators. New England Journal of
Medicine. 327: 1406–1412.
http://www.cdc.gov/epiinfo/epiinfo.htm
http://www.statpac.com/statistics-calculator/index.htm
Riffenburgh RH. 2006. Statistics in Medicine. Second Edition. Academic
Press: USA.
Ridker PM, Cook NR, Lee I, Gordon D, Gaziano JM, Manson JE,
Hennekens CH and Buring JE. 2005. A randomized trial of low-dose
aspirin in the primary prevention of cardiovascular disease in women.
New England Journal of Medicine. 13(352): 1293–1304.
Adu FD, Akinwolere OA, Tomori O and Uche LN. 1992. Low
seroconversion rates to measles vaccine among children in Nigeria.
Bulletin of World Health Organisation. 70: 457–460.
Bautista-Lopez N, Vaisberg A, Kanashiro R, Hernandez H and Ward
BJ. 2001. Immune response to measles vaccine in Peruvian children.
Bulletin of World Health Organisation. 79: 1038–1046.
Falkner B and Kushner H. 1990. Effect of chronic sodium loading on
cardiovascular response in young blacks and whites. Hypertension. 15:
36–43.
McDowell SE, Coleman JJ and Ferner RE. 2006. Systematic review
and meta-analysis of ethnic differences in risks of adverse reactions to
drugs used in cardiovascular medicine. British Medical Journal. 332:
1177–1181. DOI 10.1136/bmj.38803.528113.55.
Villari P, Manzoli L and Boccia A. 2004. Methodological quality of
studies and patient age as major sources of variation in efficacy esti-
REFERENCES
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
43
mates of influenza vaccination in healthy adults: a meta-analysis.
Vaccine. 22(25–26): 3475–3486.
Mharakurwa S, Manyame B and Shiff CJ. 1997. Trial of the ParaSight-F
test for malaria diagnosis in the primary health care system, Zimbabwe.
Tropical Medicine/International Health. 2: 544–550.
Bell D, Go R, Miguel C, Walker J, Cacal L and Saul A. 2001. Diagnosis of
malaria in a remote area of the Philippines: comparison of techniques
and their acceptance by health workers and the community. Bulletin
of World Health Organisation. 79: 933–941.
http://www.intmed.mcw.edu/clincalc/heartrisk.html
http://www.chd-taskforce.com/calculator/calculator.htm
Rothman KJ. 2002. Epidemiology. Chapter 3. Oxford University Press:
Oxford; p. 46–49.
Saxena R and Koudstaal PJ. 2004. Anticoagulants for preventing stroke
in patients with nonrheumatic atrial fibrillation and a history of stroke
or transient ischaemic attack. Cochrane Database of Systematic Reviews.
2:CD000185. DOI: 10.1002/14651858.CD000185.pub2.
O’Connor AM, Stacey D, Rovner D, Holmes-Rovner M, Tetroe J,
Llewellyn-Thomas H, Entwistle V, Rostom A, Fiset V, Barry M and Jones
J. 2001. Decision aids for people facing health treatment or screening
decisions. Cochrane Database of Systematic Reviews. 3:CD001431.
Update in 2003 2:CD001431.
Edwards A, Hood K, Matthews EJ, Russell D, Russell IT, Barker J, Pill R
and Stott N. 2000. The effectiveness of one-to-one risk communication interventions in health care: a systematic review. Medical Decision
Making. 20: 290–297.
Santesso N, Maxwell L, Tugwell PS, Wells GA, O’Connor AM,
Judd M and Buchbinder R. 2006. Knowledge transfer to clinicians
and consumers by the Cochrane Musculoskeletal Group. Journal of
Rheumatology. 33(11): 2312–2318.
Feldman-Stewart D, Kocovski N, McConnell BA, Brundage MD, and
Mackillop WJ. 2000. Perception of quantitative information for treatment decisions. Medical Decision Making. 20: 228.
44
CH02 EVALUATION OF ARTICLES ON THERAPY
3
Evaluation of Articles on
Diagnosis
Maria Asuncion A. Silvestre, Aldrin B. Loyola,
Antonio L. Dans, Leonila F. Dans
Tests may be performed for various reasons, including to
1. plan an intervention (e.g. surgery);
2. monitor response to therapy;
3. estimate risk of future events; or
4. determine prognosis in patients with a known disease.
This chapter deals with studies that evaluate a fifth function of
tests, which is to diagnose the presence or absence of a particular
condition.
Painless Evidence-Based Medicine Antonio L. Dans, Leonila F. Dans and Maria Asuncion A. Silvestre
© 2008 John Wiley & Sons, Ltd
46
CH03 EVALUATION OF ARTICLES ON DIAGNOSIS
3.1
Appraising directness
Questions on diagnosis should be phrased in terms of the
following variables: P, the patient population on whom the test
might be done; E, the exposure (in this case the test to be
performed); and O, the outcome (i.e. the condition that the test
is supposed to diagnose). Examples include:
Among postmenopausal women presenting with a breast mass (P), how
accurate is fine needle aspiration biopsy (E) in establishing the presence
or absence of breast cancer (O)?
In Bacille Calmette Guerin (BCG)-vaccinated children (P), how accurate is an interferon- assay (E) in establishing the presence or absence
of tuberculosis (O)?
As in articles on therapy, before even reading an article, the first
thing we must do is evaluate directness i.e. how well the PEO
in the study (the research question), corresponds with your own
PEO (the clinical question). Did the study evaluate the kind of
patients you are interested in? Is the test in the study exactly
the one you want? Is the disease which is being diagnosed the
same? If you feel the article might provide a reasonably close
answer to the question you ask, then proceed to evaluate it in
greater detail. Otherwise, it would be wiser to spend your time
on something else.
3.2
Appraising validity
Question #1: Was the reference standard an
acceptable one?
To measure the accuracy of a particular test we need to compare
its results to a reference standard, a test that can unequivocally
3.2 APPRAISING VALIDITY
47
establish presence or absence of the disease in question. Unfortunately, perfect tests are very rare and when they do exist, they
tend to be invasive and expensive. Because of this, researchers
evaluating diagnostic tests spend a lot of time trying to strike a
balance between cost, accuracy and safety in choosing a reference
standard.
While many single tests (such as histopathologic examinations) are considered classic examples of reference standards, such
tests do not always work for complex conditions. Sometimes,
researchers resort to establishing disease presence using multiple
criteria (e.g. the Jones criteria to diagnose rheumatic fever)[1] or
even response to therapy (e.g. reversible airway obstruction to
diagnose bronchial asthma)[2] . Whatever they choose, they must
ensure that their reference standard defines disease in a way that
is acceptable to medical practitioners.
Conceptualizing reference standards as acceptable rather than
exact definitions of disease makes the lives of researchers easier.
It means they can choose a reference standard on the basis
of feasibility, even if it is not the most accurate of several
choices. While this may seem unscientific, it approximates what
actually happens in real life. In real life, a simple sputum
AFB stain is used more often than an expensive AFB culture
for the diagnosis of pulmonary tuberculosis[3] . Similarly, in
the diagnosis of minimal change disease (MCD), the response
to a therapeutic trial of steroids is an acceptable alternative
to a kidney biopsy[4–6] . The point is if clinicians are willing
to make treatment decisions based on these definitions of
disease, then surely researchers can use these definitions as reference standards. A common approach is to use disease definitions officially adopted by specialty organizations. These can
usually be found in consensus statements that they publish (e.g.
the definition of antiphospholipid antibody syndrome in pregnancy)[7] .
48
CH03 EVALUATION OF ARTICLES ON DIAGNOSIS
Question #2: Was the reference standard interpreted
independently from the test in question?
Since the accuracy of the test in question is gauged by how
often its results agree with the results of the reference standard,
measures should be taken to ensure that knowledge of the results
of one test does not bias interpretation of the other. This is not
as easy as it sounds. Many studies on diagnosis are performed
retrospectively using data already collected for clinical reasons.
Medical records, for example, are commonly used. Although the
retrospective nature of such studies makes them more feasible,
assurance of independent interpretation becomes much more
difficult for several reasons.
1. In most situations, there is no perceived need to maintain
independence. In fact, it is standard practice when requesting
some tests to supply the laboratory with information on prior
findings. For example, clinical data is routinely supplied to the
interpreter, prior to performance of a chest x-ray.
2. The result of a test often becomes the indication for performing
the reference standard. This allows one test to influence performance and interpretation of the other. For example, if the
result of a needle aspiration biopsy of a breast mass is positive,
it is more likely that the individual will be subjected to an
excision biopsy.
3. Other times, the result of the test itself is part of the reference standard (e.g. the results of a tuberculin skin test such
as the PPD test is often part of multi-criteria definition of
tuberculosis)[8] . As such, the test and reference standard are
not strictly independent.
Not only are these three pitfalls difficult to avoid for authors, they
are difficult for us to spot as readers. Authors will simply say that
3.3 APPRAISING RESULTS
49
both the test and the reference standard were performed in all
patients. They will rarely point out that, at least in some of the
cases they analysed, the interpretation of the tests was not independent. Perhaps the biggest clue that this may have happened is when
the study is conducted in a retrospective manner e.g. chart reviews.
3.3
Appraising results
Question # 1: What were the likelihood ratios of the
various test results?
After performing the two tests (the test being evaluated against
the reference standard) on the study patients, the conventional
method of comparing results in a 2 × 2 table is shown in Tackle
Box 3.1. Consider this table for a few minutes before proceeding.
If you already understand the concepts of sensitivity, specificity
and predictive values, you may want to skip this table altogether.
The main problem with sensitivity, specificity and predictive
values is that they only work in situations where the test being
evaluated has two results. In truth, tests with just two results
(positive or negative) are unusual. They only seem common for
the following two reasons.
1. Researchers like to set cutoffs when tests have many results e.g.
a home pregnancy test is actually a measurement of urinary
HCG titers; it only seems to have two results because the colour
change of the test strip is set to occur at a certain level[9] .
2. Researchers like to ignore the meaning of intermediate results
e.g. fine needle aspiration biopsy of breast nodules was
reported as definitely malignant or definitely benign in one
study, but there were actually equivocal results which were
ignored in the analysis[10] .
50
CH03 EVALUATION OF ARTICLES ON DIAGNOSIS
Tackle Box 3.1 Comparing the results of a test with a
reference standard when the test has only two results,
e.g. positive or negative
Instructions: Standard symbols are used in the 2 × 2 table below: a = the
number of persons with a true positive result (test is positive and person
has disease); b = the number of persons with a false positive result (test is
positive and person has no disease); c = the number of persons with a false
negative result (test is negative and person has disease); d = the number
of persons with a true negative result (test is negative and person has no
disease); (a + c) = number of persons with the disease; (b + d) = number of
persons without the disease; (a + b) = number of persons with a positive test;
(c + d) = number of persons with a negative test. Go through the formulae
below to discover different ways of reporting concordance between the test
and the gold standard.
Test result
Reference standard
Disease present
Disease absent
Positive
a
b
(a + b)
Negative
c
d
(c + d)
(a + c)
(b + d)
Column total
Row total
Notes: Using the variables a, b, c and d, there are four traditional ways of expressing how
correct a test is.
1. Sensitivity (sn) refers to the proportion of persons with disease who correctly have a
positive test, i.e. a/(a + c);
2. Specificity (sp) refers to the proportion of persons with no disease who correctly have
a negative test, i.e. d/(b + d).
3. Positive predictive value (ppv) is the proportion of persons with a positive test who
correctly turn out to have disease, i.e. a/(a + b);
4. Negative predictive value (npv) is the proportion of persons with a negative test who
correctly turn out to have no disease, i.e. d/(c + d).
These numbers are not difficult to remember. The denominators for sn and sp are
the column totals, i.e. they are the proportions of correct results in columns 1 and 2
respectively. The denominators for ppv and npv are the row totals, i.e. they are the
proportions of correct results in rows 1 and 2, respectively.
Exercise: Assume a = 42, b = 7, c = 4 and d = 38. Compute (a) sn, (b) sp, (c) ppv and
(d) npv
Answers: (a) 91.3%, (b) 84.4%, (c) 85.7% and (d) 90.5%
3.4 ASSESSING APPLICABILITY
51
To evaluate a test with multi-level results, a 2 × 2 table will
not suffice. We need a ‘2 × n’ table where n is the number of
results that a test may have. Instead of sensitivity, specificity
and predictive values, we will need a relatively new measure
of accuracy: the likelihood ratio (LR). Study Tackle Box 3.2
for a few minutes before proceeding, to understand likelihood
ratios. It is not important to comprehend the formulae for calculating LR, as long as it is understood that LR is a measure
of how much the likelihood of disease changes given a test
result.
3.4
Assessing applicability
If you think the study is reasonably valid and that the results
show acceptable accuracy, the next step is to evaluate applicability
to your particular patient. As in evaluating articles on therapy,
biologic and socioeconomic issues may affect how well the test
performs in the real world.
Biologic issues affecting applicability
Sex
Consider physiological, hormonal or biochemical differences
between males and females that might affect the test results. For
example, creatinine clearance based on a single serum creatinine
determination must be adjusted according to sex[11] .
Co-morbidities
Consider co-morbid conditions that could affect the performance
of the diagnostic examination. For example, malnutrition can
decrease sensitivity of a tuberculin skin test[12] .
52
CH03 EVALUATION OF ARTICLES ON DIAGNOSIS
Tackle Box 3.2 Comparing the results of a test with a
reference standard when the test has two or more results,
e.g. positive, intermediate or negative
Instructions: Standard symbols are used in the 2 × n table below: a = the number
of persons with disease and a positive result; b = the number of persons without
disease but a positive result; c = number of persons with disease and an intermediate result; d = number of persons without disease and an intermediate result;
e = number of persons with disease but a negative result; f = number of persons
without disease and a negative result. There may be more cells depending on
how many intermediate results there are. Go through the formulae below to
discover different ways of reporting concordance between the test and the gold
standard.
Test result
Reference standard
Disease present
Disease absent
Row total
Positive
a
b
(a + b)
Intermediate
c
d
(c + d)
Negative
e
f
(e + f)
(a + c + e)
(b + d + f)
Column Total
Notes:
a
a
1) At this point, it is important to distinguish between
probabilities and odds. Probabilities are portions of
a
the whole, while odds are the ratio of portions. Say
that again? Well, if we were talking of a pie (figure
b
b
in right), probability would be a piece of the pie
Odds
Probability
divided by the entire pie, i.e. a/(a + b). Odds, on
the other hand, would be a piece of the pie divided by the rest of the pie, i.e.
a/b. To convert from probability to odds, we simply reduce the denominator
by subtracting the numerator from it. For example: 7/100 (probability)
becomes 7/93 (odds); 92/100 (probability) becomes 92/8 (odds).
2) The odds of disease when the test is positive is the ratio of a to b. This is
written as a:b or a/b. It is read as ‘a is to b’. Similarly, the odds of disease
when the test is intermediate are c/d. When the test is negative, the odds
are e/f. There may be g/h, i/j and so forth depending on the number of
intermediate results. The overall odds of disease regardless of the test results
is (a + c + e)/(b + d + f). This is also the odds of disease regardless of test
results, most commonly referred to as the pre-test odds.
3) These odds may be used to estimate likelihood ratios (LRs) for each result.
The LR is nothing more than the odds of disease given a test result (the
post-test odds), divided by the overall odds of disease (the pre-test odds).
3.4 ASSESSING APPLICABILITY
53
Thus for a positive test, the LR is (a/b) ÷ [(a + c + e)/(b + d + f)]. For
an intermediate test, the LR is (c/d) ÷ [(a + c + e)/(b + d + f)]. Finally,
for a negative test, the LR is (e/f) ÷ [(a + c + e)/(b + d + f)]. (If you
cannot perform any of the above operations because the denominator is
0, impute a value of 1 for that cell and adjust the corresponding column
or row total.)
4) If the LRs are not provided, you may need to compute your own by
reconstructing a 2 × n table from the data provided. If sensitivity and specificity are reported, reconstruct the 2 × n table by assigning a = sensitivity,
b = 100 – specificity, c = 100 – sensitivity, and d = specificity. Then compute
for LR as described in step 3.
5) Because LR is the ratio of post-test to pre-test odds, it is an expression of
change in the odds of disease. Thus an LR of 10/1 represents a 10-fold increase
in the odds of disease, while an LR of 1/10 (or 0.1) expresses a 10-fold drop
in the odds of disease. Similarly, an LR of 1/1 (or 1.0) represents a test result
that does not change the odds of disease. Thus, the further away from 1.0 the
LR is, the greater the rise or fall in odds of disease.
Exercise:
1. Assume a = 35, b = 10, c = 10, d = 20, e = 5 and f = 70. Calculate
(a) LR of a positive test result, (b) LR of intermediate test results and
(c) LR of a negative test result.
2. If sn = 90 and sp = 60, what would be the LR of a (a) positive test and a
(b) negative test? Clue: reconstruct the 2 × n table as advised in note #4
above.
Answers: 1(a) 7.00, 1(b) 1.00, 1(c) 0.14, 2(a) 2.25, 2(b) 0.17
Race
Consider racial differences that may alter the performance of
the test in question. For example, African-American ancestry
increased the likelihood of high grade prostate cancer in patients
with high levels of prostate specific antigen[13] .
Age
Consider the age of the population in the study in relation
to your own patients. A sputum AFB stain performs well
in adults but gastric aspirates would maximize accuracy in
infants[14] .
54
CH03 EVALUATION OF ARTICLES ON DIAGNOSIS
Pathology
Test accuracy is influenced by the severity and duration of the
patient’s illness. In general, the more advanced a disease is, the
easier it is to diagnose. For this reason, studies on hospitalized
patients (i.e. patients with severe disease) can sometimes lead to
overestimates of the accuracy of a test. Knowing where a trial
was done gives you an idea of the kinds of patients recruited,
and can help you decide whether you can still use the authors’
conclusions.
Socioeconomic issues affecting applicability
The accuracy of questionnaires is particularly prone to social,
cultural and economic differences among patients. One reason
is that a lot can be lost in the translation of questionnaires.
However, even if the language used is the same, interpretation and reaction may vary. For example, cognitive tests tend
to underestimate the abilities of elderly people from ethnic
minorities. This can lead to overdiagnosis of dementia in these
communities[15] . The CAGE questionnaire (which is commonly
used to detect alcoholism) performed poorly in some ethnic
groups, particularly African-American men[16] . Similarly, a questionnaire to detect autism developed in the US and UK could
not be used in families in Hong Kong because of perceived
cultural differences[17] . These examples (and many more) should
lead us to look for local validation studies before accepting
the accuracy of diagnostic tests, especially in the form of
questionnaires.
Even laboratory tests may sometimes have questionable applicability. When your laboratory has limited resources, it may
not match the standards of performance defined in a study
which uses the best equipment, hires the best diagnosticians and
3.5 INDIVIDUALIZING THE RESULTS
55
continuously monitors good laboratory practice. We need to
ensure that these standards are (at least) approximated by the
local laboratories we use.
3.5
Individualizing the results
When satisfied that biologic and socioeconomic differences do
not compromise the applicability of a test in your setting, the
next step is to determine the impact that the test (and its results)
might have on your specific patient’s probability of having a
disease. While studies of diagnosis report the average effect of a
test on probability of disease, the effect may vary greatly from
patient to patient. The main source of this variation is the individual’s baseline probability of disease, also known as the pre-test
probability.
A variation in pre-test probability is common. Based on history
and physical examination, individuals may have little or no signs
of disease in which case disease probability is very low. Other
individuals may have florid signs of disease, in which case the
pre-test probability is very high. Take, for example, a 24-year
old female consulting for fleeting chest pain. Her history reveals
occasional pricking pain on the anterior chest wall not related to
effort. Her physical findings are unremarkable. The probability
that this particular individual is having a heart attack is quite
low, i.e. you assess her pre-test probability for a heart attack to
be around 0.1%. Contrast this with a 60-year hypertensive male
smoker with a chronic history of chest discomfort during physical
exertion. He presents at the emergency room with acute, severe
chest pain. On physical examination, he is hypotensive with a
BP of 80/60, tachycardic with a heart rate of 110 and has cold
clammy perspiration. The probability of this man having a heart
attack is high, i.e. the pre-test probability of a heart attack may
56
CH03 EVALUATION OF ARTICLES ON DIAGNOSIS
Diagnostic
threshold
0%
Therapeutic
threshold
Testing range
25%
50%
75%
100%
Disease probability
Figure 3.1 Disease probability and thresholds of management
be more than 90%. Figure 3.1 above illustrates the probability of
disease, and shows us some conceptual thresholds in the management of disease.
Figure 3.1 depicts the two threshold regions at the upper and
lower levels of probability: the therapeutic threshold is the probability of disease above which we are willing to stop testing and
just get on with therapy and the diagnostic threshold is the probability of disease below which we are willing to stop testing and
just reassure the patient. Between these two thresholds, the clinician is more uncertain and so further tests are required. Tests are
useful when they can move us out of the testing range to somewhere beyond the treatment threshold (so we can commence
treatment), or below the diagnostic threshold (so we can just
reassure the patient or consider other diseases). How do the
results of a test change disease probability i.e. how do we estimate the post-test probability? Tackle Box 3.3 illustrates a strategy
for those who are up to the challenge of manual computation.
Tackle Box 3.4 illustrates a strategy for those of us afraid of
numbers.
After arriving at a post-test probability of disease, you may
now make a clinical decision, i.e. to treat, reassure, or carry
out more tests (depending on whether you land above the
treatment threshold, below the diagnostic threshold or inbetween).
3.5 INDIVIDUALIZING THE RESULTS
57
While this discussion focuses on use of likelihood ratios to
interpret test results when they arrive, we can also use these
calculations to decide if the tests should be done at all for a
particular patient. Consider the following when you contemplate
requesting a test.
1. When the test result will not lead to important changes in
probability, we should think twice about doing the test at
all. Remember that the change in probability is not just a
function of the LRs of a test, it is also a function of the pre-test
probability. When the pre-test probability is close to 50% (as
uncertain as it can get), the changes in probability tend to be
great and the tests become much more useful. When the pretest probability is already close to 0% or 100%, the changes
in probability tend to be very small and testing is of less
value.
2. When effective treatment is unavailable for the disease you
are detecting, either because it is difficult to treat or the
patient cannot afford the treatment, testing may not be
useful.
3. The cost of the test should always be considered, especially in
places where medical care is mainly an out-of-pocket expense.
When we talk of cost, it is not just the immediate cost of
the test but also the cost of the subsequent tests, as well as
subsequent medical or surgical interventions.
4. Safety is an issue for some tests, especially invasive procedures.
5. Just as we should involve the patient when deciding on a
therapeutic intervention, the patient should make informed
choices about diagnostic tests to be performed.
Tackle Box 3.3
Computing for post-test probability of disease given a test result
Instructions: Results of a test change the probability of disease. This tackle box discusses the math involved. If you’re numero-phobic, skip this
tackle box and proceed directly to Tackle Box 3.4.
Step 1: Estimate the
pre-test probability
in percent.
How to do this
Need an equation?
Interview the patient and carry out a good physical examination.
Based on your findings, your clinical experience will give you a
good estimate of the probability of disease. If the case is difficult,
an expert might be in a better position to estimate what the
probability of the disease might be.
There isn’t any what you need
are good skills in history and physical
examination.
Exercise (1): Ms X, a 25-year old sexually active female presents
with a 3-day history of burning sensation on urination. Physical
exam was unremarkable. Estimate the pre-test probability that
she has a urinary tract infection (UTI).
Step 2: Convert
pre-test probability
to odds.
There are two ways of expressing the possibility of disease: as
odds or as probabilities. Probabilities are a portion of the whole,
while odds are the ratio of portions. To convert from probability
to odds, we simply reduce the denominator by subtracting the
numerator from it. For example: 25/100 (probability) becomes
25/75 (odds), and 90/100 (probability) becomes 90/10 (odds). If
you are not yet comfortable with probabilities and odds, return
to Tackle Box 3.2 (review the concept of the pie)!
Exercise [2]: If you set the pre-test probability of UTI in
exercise [1] at 80%, what would the pre-test odds be?
Odds =
Probability
100 − Probability
Step 3: Multiply
pre-test odds by the
Likelihood Ratio of
the test result to get
the post-test odds.
The pre-test odds were estimated in step 2. The study you read should
tell you the LR of the test result you obtained. Remember, LR varies
according to the result. A positive test will probably have an LR > 1.0, a
negative test an LR < 1.0 while an equivocal test an LR that is close to
1.0.
Post-test Odds = Pre-test Odds × LR
Exercise (3): Continuing the scenario in exercise (2), estimate the
post-test odds of UTI in the following scenarios: (a) her urine dipstick
nitrite is positive; (b) her urine dipstick is negative. Note: Study shows
that urine dipstick nitrite has an LR(+) = 3.0 and an LR(–) = 0.518 .
Step 4: Convert
post-test odds back
to post-test
probability in
percent.
Simple. Just increase the denominator by adding the numerator back to
it. Thus, odds of 1/3 become a probability of 1/4 (or 25%); and odds of
1/1 become a probability of 1/2 (or 50%). You can also use the formula
in the next column.
Probability =
Odds
× 100
1 + Odds
Exercise (4): Convert the post-test odds back to (post-test) probability
in the two scenarios in exercise (3).
Notes:
1. In these equations, probability is expressed as a percentage.
2. Usually a sequence of tests is necessary to confirm disease or rule it out. In this case, the post-test probability of one test becomes the pre-test
probability for the next test and so forth. This only works in non-emergency cases. When confronted with an emergency, forget sequential
tests; do everything at the same time!
Answers: Exercise (1): Depending on the details of the history, estimates of the probability of UTI may vary. A reasonable estimate might be around
80%. Exercise (2): If you set the pre-test probability at 80%, pre-test odds will be 80/20 or 4/1. Exercise (3): For scenario (a) with urine dipstick
result positive, post-test odds = 40 × 30 = 12. For scenario (b) with urine dipstick result negative, post-test odds = 40 × 05 = 2. Exercise (4):
For scenario (a), post-test probability will be [12/(1+12)] x 100 = 92%. For scenario (b), post-test probability will be 2/1 + 2 × 100 = 67%.
60
CH03 EVALUATION OF ARTICLES ON DIAGNOSIS
Tackle Box 3.4 Estimating post-test probability of
disease given test results (using nomogram [18] )
Instructions: If you are uncomfortable with manual computations for
post-test probability as described in tackle box 3.3, go through the
nomogram shown in Figure 3.2 to learn an easier way to do it.
Step 1: Estimate the pre-test probability based on your history or
physical examination, i.e. clinical intuition. You can also derive
this estimate from the results of surveys. Plot this on the left-most
vertical axis.
Exercise: Ms X, a 25-year old sexually active female presents with a
three day history of a burning sensation on urination. Physical exam
was unremarkable. Estimate the pre-test probability that she has a
urinary tract infection (UTI).
Step 2: Determine the likelihood ratio of the test result from the results
of the study you reviewed. Remember, the LR varies depending on
the test result. Plot this on the middle vertical axis.
Exercise: Study shows that urine dipstick nitrite has an LR(+) = 3.0
and LR(–) = 0.5[19] . Look for these points along the middle axis.
Step 3: Connect the two points in steps 1 and 2, and extend the
line to the rightmost vertical axis. The point of intersection is the
probability of disease after the test (the post-test probability).
Exercise: What would the post-test probabilities be if (a) her dipstick
nitrite is positive; (b) her dipstick nitrite is negative?
Note: Sometimes, a sequence of tests is necessary to confirm disease or rule
it out. In this case, the post-test probability of the earlier test becomes the
pre-test probability for the next test and so forth and so on. This only works
in non-emergency cases. When confronted with an emergency, forget about
sequential testing, do everything at the same time!
Answers: If you set the pre-test probability at 80%, if the urine dipstick result
is positive, post-test probability will be around 95% and when negative, the
post-test probability will be around 65%. These numbers just approximate the
exact answers for Tackle Box 3.3.
3.5 INDIVIDUALIZING THE RESULTS
61
99
0.1
0.2
1000
0.5
1
500
200
100
50
20
10
5
2
2
5
10
20
1
0.5
0.2
0.1
0.05
0.02
0.01
0.005
0.002
30
40
50
60
70
80
90
95
0.001
95
90
80
70
60
50
40
30
20
10
5
2
1
0.5
0.2
0.1
99
Pre-test
probability
Likelihood
ratio
Post-test
probability
Figure 3.2 Bayes nomogram for estimating post-test probability∗
∗
Reproduced with permission[18]
62
3.6
CH03 EVALUATION OF ARTICLES ON DIAGNOSIS
Summary
Articles on diagnostic tests are sometimes considered difficult
reading. By dissecting them one step at a time, the task becomes
simple and straightforward. First appraise directness: is the
question asked important to you? Is the test something you are
considering? Is it available? Is it a feasible option? If not, maybe
you should be looking for another article. If it does address
an important question for you, the next step is to appraise
validity. This will tell you if the results are credible. If the
results aren’t credible anyway, why even bother appraising the
article? These two steps can save you lots of time spent on futile
reading.
If you decide to proceed and interpret the results, look for likelihood ratios, or (heaven forbid) information that may lead you
to compute them yourself. Before actually using these numbers,
however, decide if the information is applicable to your setting.
Biologic and socioeconomic factors should be considered.
If everything is in order (directness, validity, results and applicability), you now have information to:
1. decide if the test is something you would like to request, and
2. interpret the results when they arrive.
Don’t become too engrossed with numbers. Sometimes it’s
enough just to understand that LRs are a measure of changes in
disease probability. Test results with LR > 1.0 increase the disease
probability. The higher the LR is, the closer it can bring you to
confirming the disease. Test results with LR < 1.0 decrease the
disease probability. The lower the LR is, the closer it can bring
you to ruling out disease. Test results with LR very close to 1.0
have little impact on your clinical or pre-test estimates of disease
probability.
REFERENCES
63
When you have arrived at your post-test probability of disease,
you can now make a clinical decision based on your thresholds
of management and also your patient’s preferences. There are
usually three choices:
1. stop testing and get on with the treatment of the probable
disease;
2. stop testing and reassure the patient that disease probability is
low; or
3. do more tests before you decide.
References
[1] Ferrieri P for the Jones Criteria Working Group. 2002. Proceedings of
the Jones Criteria Workshop. Circulation. 106: 2521.
[2] Boulet LP, Becker A, Berube D, Beveridge D and Ernst P. 1999. Canadian asthma consensus report, 1999. Canadian Asthma Consensus
Group. Diagnosis and evaluation of asthma in adults. Canadian Medical
Association Journal. 161(11) S6–S7.
[3] Sennik D. 2006. BMJ Review. Pulmonary tuberculosis: diagnosis and
treatment. BMJ Clinical Review. 332: 1194–1197.
[4] Gulati S, Sharma AP, Sharma RK, Gupta A and Gupta RK. 2002. Do
current recommendations for kidney biopsy in nephritic syndrome
need modifications? Pediatric Nephrology. 17: 404–408.
[5] Gandhi BV. 1994. The role of renal biopsy in nephrotic syndrome.
Journal of Postgraduate Medicine. 40: 135–136.
[6] Filler G, Young E, Geier P, Carpenter B, Drukker A and Feber J. 2003.
Is there really an increase in non-minimal change nephrotic syndrome
in children? American Journal of Kidney Diseases. 42(6): 1107–1113.
[7] Wilson WA, Gharavi AE, Koike T, Lockshin MD, Branch DW, Piette
JC, Brey R, Derksen R, Harris EN, Hughes GR, Triplett DA and
Khamashta MA. 1999. International consensus statement on preliminary classification criteria for definite APS: report of an international
workshop. Arthritis & Rheumatism. 42: 1309–1311.
[8] Taylor Z, Nolan CM and Blumberg HM. 2005. American Thoracic
Society; Centers for Disease Control and Prevention; Infectious
64
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
CH03 EVALUATION OF ARTICLES ON DIAGNOSIS
Diseases Society of America. Controlling tuberculosis in the United
States. Recommendations from the American Thoracic Society, CDC,
and the Infectious Diseases Society of America. Morbidity and Mortality
Weekly Report Recommendation Report 4: 54(RR–12): 1–81.
Bastian LA, Nanda K, Hasselblad V and Simel DL. 1998. Diagnostic
efficiency of home pregnancy test kits. A meta-analysis. Archives of
Family Medicine. 7(5): 465–469.
Ariga R, Bloom K, Reddy VB, Kluskens L, Francescatti D, Dowlat
K, Siziopikou K and Gattuso P. 2002. Fine-needle aspiration of clinically suspicious palpable breast masses with histopathologic correlation.
American Journal of Surgery. 184: 410–413.
Johnson CA, Levey AS, Coresh J, Levin A, Lau J and Eknoyan G. 2004.
Clinical Practice Guidelines for Chronic Kidney Disease in Adults:
Part II. Glomerular Filtration Rate, Proteinuria, and Other Markers.
American Family Physician. 70: 1091–1097.
Pelly TF, Santillan CF, Gilman RH, Cabrera LZ, Garcia E, Vidal C,
Zimic MJ, Moore DA and Evans CA. 2005. Tuberculosis skin testing,
anergy and protein malnutrition in Peru. International Journal of Tuberculosis and Lung Diseases. 9(9): 977–984.
Thompson IM and Ankerst DP. 2007. Prostate-specific antigen in
the early detection of prostate cancer. Canadian Medical Association
Journal. 176(13): 1853–1858.
How C. (ed). 2003. Tuberculosis in Infancy and Childhood. Philippine
Pediatric Society Publications: Quezon City; pp 1–97.
Parker C and Philp I. 2004. Screening for cognitive impairment among
older people in black and minority ethnic groups. Age and Ageing.
33(5): 447–452. Epub 2004 Jun 24.
Steinbauer JR, Cantor SB, Holzer CE and Volk RJ. 1998. Ethnic and sex
bias in primary care screening tests for alcohol use disorders. Annals
of Internal Medicine. 129(5): 353–362.
Wong V, Hui LH, Lee WC, Leung LS, Ho PK, Lau WL, Fung CW and
Chung B. 2004. A modified screening tool for autism (Checklist for
Autism in Toddlers [CHAT-23]) for Chinese children. Pediatrics. 114(2):
166–176.
Glasziou P. 2001. Which methods for bedside Bayes? Evidence Based
Medicine. 6: 164–166.
Otham S, Chia YC and Ng CJ. 2003. Accuracy of urinalysis in detection
of urinary tract infection in a primary care setting. Asia Pacific Family
Medicine. 2(4): 206–212.
REFERENCES
65
Well according to these tests you’re feeling much better! Maybe you just
don’t know it yet This page intentionally left blank
4
Evaluation of Articles on
Harm
Jacinto Blas V. Mantaring III, Antonio L. Dans,
Felix Eduardo R. Punzalan
Studies on harm try to establish if a particular exposure
is responsible for causing an undesirable outcome. Harmful
exposures can be behaviours (e.g. tobacco or alcohol abuse),
treatments (e.g. aspirin or warfarin intake) or patient characteristics (e.g. hypertension or exposure to pollution). In any of
these situations, the questions on harm (or causation) should be
phrased in terms of the following variables: P, the patient population that might be at risk; E, the potentially harmful exposures;
and O, the outcomes that these exposures might cause. For
example:
Among healthy adults (P), how strongly associated is mobile phone use
(E) with the risk of developing brain cancer (O)?
Painless Evidence-Based Medicine Antonio L. Dans, Leonila F. Dans and Maria Asuncion A. Silvestre
© 2008 John Wiley & Sons, Ltd
68
4.1
CH04 EVALUATION OF ARTICLES ON HARM
Appraising directness
As in therapy and diagnosis, before reading an article on harm,
we must first evaluate how well the PEO in the study (the research
question) corresponds to our own PEO (your clinical question).
Did the study recruit the types of patients you are interested in?
Did they evaluate the exposure you are interested in? Even if the
question is not exactly the same, sometimes the study can still
provide some answers. For example, much of what we think we
know about the effect of alcohol on health is derived from studies
on red wine. The two exposures are not exactly the same, but
whatever we learn about red wine can certainly provide some
answers about the impact of alcohol intake in general. If you feel
the article might help answer your question, then go ahead and
evaluate it as in previous chapters.
4.2
Appraising validity
Question #1: Were the patient groups being
compared sufficiently similar with respect to baseline
characteristics? If not, were statistical adjustments
made?
Perhaps the most important validity criterion which may influence the outcome of a harm study is the similarity in the baseline
characteristics. How similar the groups are at the start of a study
depends on which design was used, e.g. a randomized controlled
trial, a cohort or a case-control study (see Tackle Box 4.1). Groups
being compared are most similar in RCTs and least similar in
case-control studies. When groups being compared are dissimilar, statisticians can make adjustments. These adjustments estimate what the study outcomes might have been had the baseline
4.2 APPRAISING VALIDITY
69
characteristics been the same. This may seem like salvaging data,
and in fact this is what they are doing. However, this is the best
we can do considering the imbalance inherent in some study
designs.
Question #2: Were unbiased criteria used to determine
exposure in all patients?
Ascertainment of the exposure is a key issue in studies on causation or harm. If ascertainment of exposure systematically favours
one group compared to the other, it may lead us to a wrong estimate of the relationship between exposure and outcome. Determining exposure is not a problem in RCTs and cohort studies
because exposure is either assigned (in RCTs) or carefully determined (in cohorts) at the start of the study.
Determining exposure is particularly problematic when we
speak of case-control designs, as we often rely on the patients’
ability to recall these exposures. This isn’t a big problem if cases
and controls had the same rates of recall as the difference would
cancel out. Unfortunately, studies show that cases are more likely
to recall an exposure than controls, probably because they already
sustained an undesirable outcome[1] . This so-called ‘recall bias’
was demonstrated in a study where mothers were interviewed on
likely causes of congenital anomalies in their offspring[2, 3] . Many
years after a pregnancy, these mothers may be more likely to
volunteer a prenatal history of cravings, medicines, exotic food,
rituals, falls, etc. compared to mothers whose babies do not have
anomalies.
Ascertainment of exposure in case-control studies may also be
subject to bias on the part of the interviewer[4] . Interviewers may
tend to be more persistent in their questioning for exposures
among cases than controls. Readers should therefore look for
strategies employed in the study to minimize this bias, such as
blinding of interviewers.
Tackle Box 4.1 Three types of study designs used to evaluate the relationship between a
potentially harmful exposure and an undesirable outcome
Instructions: Start going through this tackle box by understanding the diagrams in column 1. Once familiar with the diagrams and the basic
differences in the designs, it will be easier to go through the columns on validity and feasibility.
Study design
Validity
Feasibility
Randomized controlled trial
Highest validity. Baseline
characteristics of exposed and
unexposed tend to be very
similar because of randomized
assignment to these groups.
Lowest feasibility for three
reasons: (1) If outcome is rare, a
big sample size is required; (2) If
prolonged exposure is necessary,
long follow-up is required; (3)
When exposure may be harmful,
it may be unethical to randomize.
Midway in validity. Baseline
characteristics of exposed and
unexposed are rarely similar, but
statisticians can make ‘magical’
adjustments.
Midway in feasibility. Budget is
smaller because researchers don’t
need to pay for the exposure.
Still, like RCTs, it may require a
big sample size and prolonged
follow-up.
Exposed
Patients
Outcome
No outcome
Randomize
Unexposed
Outcome
No outcome
Cohort study
Exposed
Outcome
No outcome
Patients
Unexposed
Outcome
No outcome
Case-control study
Outcome
Exposed
Unexposed
Patients
No outcome
Exposed
Unexposed
Lowest in validity for three
reasons: (1) Baseline
characteristics of those with and
without outcome are surely
different. Researchers try to
correct this by making sure each
case is matched with a similar
control, but matching is feasible
only for a few characteristics. (2)
Patients need to be asked about
exposure, and they may forget.
(3) We can never be sure the
exposure preceded the outcome.
Highest feasibility. Even if
outcome is rare, we can look for
persons who already have it and
recruit them. We don’t need to
follow-up because the event has
already taken place. Just ask
patients to recall if they were ever
exposed.
Notes:
1. Randomized controlled trials (discussed in Chapter 2) assign patients to either exposed (treated) or unexposed (untreated) groups who are
then followed-up for a length of time to see how many do or don’t develop outcomes.
2. Cohort studies are similar, except that we do not assign exposure. Patients come in and are already exposed or unexposed to a possibly harmful
agent (by choice, e.g. when they smoke; or by chance, e.g. when they are born with genes for hypertension).
3. Case-control studies begin by recruiting patients with the outcome (cases) and without the outcome (controls), then proceed in the opposite
direction by asking if they recall having been exposed to the harmful agent. For example, patients with lung cancer (cases) and without cancer
(controls) may be asked whether they ever smoked or not.
Exercise: Design and implement a randomized controlled trial, a cohort and a case-control study on the harmful effects of smoking. In which one
did you spend more?
Answers: Just kidding but of course you’ll spend more on your (unethical) RCT!
72
CH04 EVALUATION OF ARTICLES ON HARM
Question #3: Were unbiased criteria used to detect
the outcome in all patients?
Detecting outcomes in study patients can be fraught with bias.
This is not such a big problem in case-control studies because
the outcome is determined at the start of the study and, in fact, is
part of inclusion and exclusion criteria. In contrast, researchers
conducting RCTs and cohort studies must pay special attention
to ensure that outcome detection is unbiased. The best strategy to
avoid this pitfall is to blind people tasked to ascertain outcome;
they must not know if a particular patient is exposed or not.
Question #4: Was the follow-up rate adequate?
Cohorts and RCTs are generally conducted over a longer duration. Because of this, loss to follow-up is more often a problem
than in case-control studies. Patients lost to follow-up may have
suffered adverse events and because they have dropped out, we
fail to incorporate their experiences in the analysis. Tackle Box 2.1
in Chapter 2 presents an approach to assessing if follow-up was
adequate in a study. Essentially, readers should worry about dropouts if numbers are large enough to affect the outcome rates in
a study.
4.3
Appraising the results
Question #1: How strong is the association between
exposure and outcome?
In Tackle Box 2.2 of Chapter 2 we discussed various measures of
effectiveness, such as the relative risk (RR), relative risk reduction (RRR), absolute risk reduction (ARR) and number needed
4.3 APPRAISING THE RESULTS
73
to treat (NNT). These same measures also work for studies on
harm designed as cohorts or RCTs. Instead of benefit, however,
we will usually end up with values suggesting harm, i.e. RRR < 0,
ARR < 0, RR > 1 and NNT < 0. Some people don’t like numbers
less than zero, so they omit the negative sign and just call
them relative risk increase (RRI), absolute risk increase (ARI), or
number needed to harm (NNH). It doesn’t really matter as long
as it is clear what the numbers represent. (If these numbers do
not make sense, refer again to Tackle Box 2.2.)
For case-control studies, we use a slightly different measure of
harm referred to as the odds ratio (OR). The odds ratio is read like
a relative risk. A value more than 1 suggests harm (if OR = 2 then
exposure doubles the odds of the outcome). A value less than 1
on the other hand, suggests benefit (if OR = 0.5 then exposure
halves the odds of an outcome). Usually OR ∼ RR, differing by
only a few decimal places. To learn more of OR and RR, go
through Tackle Box 4.2. If content with simply understanding
the results or if seeing numbers causes you to break out in a rash,
skip the tackle box and just move on!
Question #2: How precise is the estimate of the risk?
Just like RCTs on therapy, studies on harm merely estimate the true
effect of an exposure. Thus, it may be unduly precise to express the
OR as an exact point estimate (e.g. smoking causes cancer, OR =
14.0)[5] . As we pointed out in Chapter 2, point estimates such as
these do not accurately reflect uncertainty in the studies that we
do. Researchers therefore express the strength of an association as
an ‘interval estimate’ which provides the range of possible values.
Interval estimates are expressed at a 95% level of confidence. When
we state 95% confidence intervals (95% CI), we mean that we are
95% sure that the true strength of association (the odds ratio) lies
within this range. A better statement would therefore be ‘smoking
74
CH04 EVALUATION OF ARTICLES ON HARM
Tackle Box 4.2 Why we measure odds ratios in
case-control studies instead of relative risk
Instructions: A standard 2 × 2 table indicates (a) the exposure status
in rows and (b) the outcome status in columns. Go through the 2 × 2
tables below to understand the difference between relative risk and
odds ratio.
Panel A. 2 × 2 tables for RCTs and cohorts
Group 1
Outcome
No
outcome
Exposed
a
b
How many among exposed
had the outcome?Ans: a/(a+b)
Unexposed
c
d
How many among unexposed
had the outcome?Ans: c/(c+d)
Recruitment
Group 2
Panel B. 2 × 2 table for case-control studies
Recruitment
Group 1
Group 2
Outcome
No
outcome
Exposed
a
b
Unexposed
c
d
How many among
those with outcome
had the exposure?
Ans: a/(a+c)
How many among
those with no outcome
had the exposure?
Ans: b/(b+d)
4.3 APPRAISING THE RESULTS
75
Notes:
1. Panel A demonstrates that sampling in RCTs and cohort studies is a horizontal
event: we recruit exposed patients and follow them up to see the proportion
who developed the outcomes. In this situation, it is rational to compare risk
of outcomes in the exposed and unexposed, i.e. a/(a+b) ÷ c/(c+d). This is
equivalent to the relative risk (RR) of an outcome (Chapter 2).
2. In contrast, the 2 × 2 table in Panel B shows that the sampling in case-controls
is vertical: we recruit cases (persons with the outcome) and controls (persons
without the outcome) and determine the proportion who were exposed.
The rational comparison in this case would be risk of exposure among cases
and the risk of exposure among controls, i.e. a/(a + c) ÷ b/(b + d). This may
be referred to as the relative risk of an exposure. This would be a useless
number. It would be like telling a patient: ‘Sir, you have lung cancer, there
is a 10-fold rise in the risk that you are a smoker’!
3. To counter this problem, statisticians use the odds of exposure a/c (‘a is to c’)
and b/d (‘b is to d’) instead of the risk of the exposure which is a/(a + c)
and b/(b + d). (You might want to review Tackle Box 3.2, if you find this
confusing.) The ratio of odds would then be (a/c) ÷ (b/d). This is still a vertical
comparison. Instead of the useless relative risk of exposure, we have converted
it to a useless relative odds of exposure. Or ‘Sir, you have lung cancer, there is
a 10-fold rise in the odds that you are a smoker’! However, this vertical ‘odds
ratio’ (OR) has four advantages over the relative risk (RR): (a) it is equivalent
to the horizontal odds ratio (a/b) ÷ (c/d), which is our main interest or
‘Sir, you are a smoker, so your odds of having lung cancer is 10 times
higher’; (b) numerically, the OR is almost the same as the relative risk,
usually differing by only a few decimal places; (c) it can be reduced to a
ridiculously simple equation OR = ad/bc and (d) it not only works for just
case-control studies, it will also work for RCTs and cohorts.
Exercise: If a = 4, b = 21, c = 1, d = 24, compute (a) the relative risk; (b) the
odds ratio.
Answers: RR = 4.0; OR = 4.6
increases the risk of lung cancer 14-fold, but it could range somewhere between a 3-fold and 59-fold increase’[5] . By convention, this
is expressed in the following syntax:
OR = 140 95% CI 33 593
76
CH04 EVALUATION OF ARTICLES ON HARM
How do we interpret the 95% CI? Recall the four basic tenets for
therapy:
1. When both ends of the CI are on the side of benefit (OR < 1),
the exposure definitely reduces the odds of the unfavourable
outcome.
2. When both ends of the CI are on the side of harm (OR > 1),
the exposure definitely increases the odds of the unfavourable
outcome.
3. When one end reflects an important reduction in odds and
the other end reflects an important increase in odds, the study
is inconclusive.
4. When one end reflects a small unimportant reduction in odds
and the other end reflects a small unimportant increase in
odds, then for all intents and purposes the exposure has no
effect on the outcome.
Confused? Then, you may need to review Tackle Box 2.3 in
Chapter 2. This will be time well spent, because it will help you
analyse study results quickly and efficiently!
4.4
Assessing applicability
Now that we have appraised the validity and the results of the
article, we can decide if the results apply to your own patient.
As in previous chapters, we suggest two things to consider when
studying applicability: biologic and socioeconomic issues.
Biologic issues affecting applicability
Biologic factors may affect the applicability of a study on harm to
your particular patient. These biologic issues include differences
in sex, co-morbidity, race, age and pathology.
4.4 ASSESSING APPLICABILITY
77
Sex
Consider physiological, hormonal or biochemical differences
between sexes that may impact on the effect of an exposure on
the outcome. Heavy smoking increases the risk of oral cancer in
men, with an OR = 2.1 [95% CI: 1.4, 3.2]. In women, the OR is
4.6 [95% CI: 2.5, 8.7][6] .
Co-morbidities
Consider co-existing conditions that could affect the risk of an
outcome for a given exposure. Chronic rofecoxib use increases
the risk of cardiovascular events overall. Among patients with
indications for ASA, the OR is 4.89 [95% CI: 1.41, 16.88]. Among
patients with no indication for ASA, the OR is only 1.89 [95%
CI: 1.03, 3.45][7] .
Race
Consider racial or ethnic differences that may alter the risk for
the outcome. Angiotensin-Converting Enzyme (ACE) inhibitors
(a class of antihypertensives) increase the risk of angioedema. The
risk is three times higher in blacks than in whites[8] . The odds
of developing hepatitis with isoniazid use is lower among Asians
who are fast acetylators of the drug[9] .
Age
Consider the age of the population in the study in relation to
the population that you are interested in. Inhaled corticosteroids
increase the risk for cataracts in asthmatic patients older than 40
years of age, but not in younger patients[10] .
Pathology
Diseases referred to by the same name may sometimes represent
pathologic states that differ in subtle ways. These differences may
78
CH04 EVALUATION OF ARTICLES ON HARM
sometimes account for variations in predisposition to disease. For
example, the relationship between high salt intake and hypertension is stronger among blacks. This is probably caused by a
higher prevalence of salt sensitivity[11] .
Socioeconomic factors affecting applicability
Socioeconomic status is known to modify the relationship
between exposure and outcome for a wide variety of conditions.
This applies to many infectious diseases that afflict low to middleincome countries all over the world (such as tuberculosis, malaria
and diarrhoea). However, socioeconomic status can also affect
causation of several chronic degenerative diseases. For example,
the OR for developing peptic ulcer disease from H. pylori infection in Denmark is 1.6 [95% CI: 0.8, 3.4] in higher socioeconomic
classes, compared to 4.1 [95% CI: 1.8, 9.2] in lower socioeconomic classes[12] . It also affects the relationship between smoking,
alcohol, diet and oesophageal cancer[13] , smoking and chronic
obstructive lung disease[14] , diabetes and stroke[15] , and many
other diseases.
4.5
Individualizing the results
If you feel that the biologic and socioeconomic factors do not
significantly limit applicability, you can proceed to individualize
the estimate of harm. Studies report the average effect of exposure
in the population. However, the magnitude of this effect will
vary slightly between patients. The main source of variation is
the patient’s baseline risk for the adverse outcome. Variation in
risk is very common in medicine. Patients have mild, moderate
4.6 SUMMARY
79
or severe forms of the same disease, or may have different risk
factors for an adverse outcome. Some may come in early, and
others may come in late in the course of an illness. Tackle Box 4.3
shows us four quick steps in using the baseline risk to estimate
the effect of the exposure on an individual.
4.6
Summary
Studies on harm help us decide whether or not to avoid a
certain exposure. Sometimes these studies lead us to avoid certain
behaviours (e.g. smoking or alcoholism) or certain treatments.
To read articles on harm efficiently, start by evaluating how
directly they answer questions that are important to you. If
the patients, exposures or outcomes are similar to those you
are interested in, go ahead and assess validity. Bear in mind
that our hands are often tied when we study harm, because
the ideal study (an RCT) would be too long, too large or even
unethical.
After appraising validity, appraise the results. Don’t get bogged
down with numbers and computations. Most of the time, the
authors tell you what the ORs and RRs are (excuse the tongue
twister). Of course, just as in articles on therapy and diagnosis,
you must next assess applicability, because biologic and socioeconomic factors may affect relationships between risk factors
and outcomes.
Lastly, you can proceed to individualize the results and estimate
risks specific to your patient, that is, the patients absolute risk
increase and NNH. Again, don’t get too engrossed with numerical
sophistication if this is difficult for you. ORs and RRs are nothing
more than estimates of how exposure changes risk, expressed as
a multiple of the baseline risk.
Tackle Box 4.3
Estimating impact of a harmful exposure on individual patients
Instructions: The effect of an exposure on the risk of disease varies from individual to individual. This variation can be attributed
to differences in the baseline risk for a disease. Go through this Tackle Box in rows, from step 1 to step 4, to understand
the process involved in individualizing risk. In the rightmost column, we have a running example of a hypertensive woman
considering the possibility of harm if she takes phenylpropanolamine (PPA) for her colds.
Step
How to do this
Need an equation?
Step 1: Estimate
the baseline risk
or pre-exposure
risk in percent.
If the article you are reading is a cohort study, it
may already contain this information. Look for a
subgroup of unexposed individuals that closely
approximates the one you are seeing. If the study does
not contain this information, you might need to look at
community-based studies.
There isn’t any Example: Let’s say we have a patient with BP
of 220/120. Studies show that the baseline
risk for a haemorrhagic stroke is high, about
6% per year[16] .
Step 2: Convert
baseline risk to
baseline odds.
Risk is usually expressed as a probability. Probabilities
are portions of the whole, while odds are the ratio of
portions. Say that again? Well, if we were talking of a pie,
probability would be a piece of the pie divided by the
entire pie. Odds, on the other hand, would be a piece of
the pie divided by the rest of pie. To convert from
probability to odds, we simply reduce the denominator
by subtracting the numerator from it. For example, 7/100
(probability) becomes 7/93 (odds); 92/100 (probability)
becomes 92/8 (odds)*.
Odds =
Probability%
100 − Probability%
Example: If the risk of a haemorrhagic stroke
is 6%, the odds will be 6/94.
Step 3: Multiply
baseline odds by
the Odds Ratio
to get the
post-exposure
odds.
The baseline odds are derived in step 2. The study
you read should tell you the OR of the exposure.
Remember, a harmful exposure will probably have an
OR > 1.0. Thus, an OR of 2 doubles the odds of an
event!
Post-exposure odds = Baseline odds × OR
Step 4: Convert
post-exposure
odds to
post-exposure
risk in percent.
Simple. Just add the numerator back to the
denominator. For example, 1/3 (odds) becomes 1/4
(probability) and 0.24 (odds) becomes 0.24/1.24
(probability). This is now your patient’s risk after
exposure!
Probability =
Example: Exposure to PPA increases the risk of
haemorrhagic stroke with OR
˜2.0[17] ; the post-exposure risk in our example
is therefore 2 x 6/94 or 12/94.
Odds
×100
1 + Odds
Example: If odds are 12/94, then the probability
is 12/106 or about 11.3%. If we use the formula
above, 12/94 = 0.128, so probability is
(0.128/1.128)×100 = 11.3%.
Notes:
1. You can skip converting probabilities to odds (step 2) and odds to probabilities (step 4) in situations when: (a) the baseline
risk is very small because for all intents and purposes, odds and probability will be equal (e.g. if the probability is 1/100, odds
will be 1/99, an ignorable difference); and (b) the measure of risk is expressed as a ‘relative risk’ rather than an odds ratio.
In both these situations, all you need to do is multiply baseline risk by the OR or RR.
2. Once you have the pre-exposure and post-exposure risk, you can do a lot of patient-specific calculations! (a) Post-exposure risk
minus baseline risk = individualized absolute risk increase (ARI). In the example above, 11.3% – 6% = 5.3%. (b) 100/(absolute
risk increase) = individualized number needed to harm (NNH). In the example above, 100/5.3 = 19.
3. Individualized information can now be used to guide a decision on whether or not to avoid a potentially harmful exposure.
Exercise: If the baseline risk for stroke was 1% in the example above, what would be the ARI and NNH for stroke if the patient
received PPA?
Answer: ARI = 1%, NNH = 100
* Déjà vu – a feeling of tedious familiarity.
82
CH04 EVALUATION OF ARTICLES ON HARM
References
[1] Delgado-Rodríguez M and Llorca J. 2004. Bias. Journal of Epidemiology
and Community Health. 58: 635–641.
[2] Werler MM, Pober BR, Nelson K and Holmes LB. 1989. Reporting
accuracy among mothers of malformed and non-malformed infants.
American Journal of Epidemiology. 129(2): 415–421.
[3] Rockenbauer M, Olsen J, Czeizel AE, Pedersen L, Sorensen, Henrik T
and the EuroMAP Group. 2001. Recall bias in a case-control surveillance system on the use of medicine during pregnancy. Epidemiology.
12(4): 461–466.
[4] Blomgren KJ, Sundstrom A, Steineck G and Wilholm BE. 2006. Interviewer variability – quality aspects in a case-control study. European
Journal of Epidemiology. 21(4): 267–277.
[5] Doll R and Hill AB. 1950. Smoking and carcinoma of the lung. Preliminary report. British Medical Journal. 2: 739–748.
[6] Muscat JE, Richie JP Jr, Thompson S and Wynder EL. 1996. Gender
differences in smoking and risk for oral cancer. Cancer Research.
1556(22): 5192–5197.
[7] Mukherjee D, Nissen SE and Topol EJ. 2001. Risk of cardiovascular
events with selective COX-2 inhibitors. Journal of American Medical
Association. 286(8): 954–959.
[8] Kostis JB, Kim HJ, Rusnak J, Casale T, Kaplan A, Corren J and Levy
E. 2005. Incidence and characteristics of angioedema associated with
enalapril. Archives of Internal Medicine. 165(14): 1637–1642.
[9] Huang YS, Chern HD, Su WJ, Lai SL, Yang SY, Chang FY and Lee
SD. 2002. Polymorphism of the N-acetyltransferase gene as a susceptibility risk factor for antituberculosis drug-induced hepatitis. Hepatology. 35(4): 883–889.
[10] Jick SS, Vasilakis-Scaramozza C and Maier WC. 2001. The risk of
cataract among users of inhaled steroids. Epidemiology. 12(2): 229–234.
[11] Wilson DK, Bayer L and Sica DA. 1996. Variability in saly sensitivity
classification in black male versus female adolescents. Hypertension.
28(2): 250–255.
[12] Rosenstock SJ, Jorgensen T, Bonnevie O and Andersen P. 2004. Does
Helicobacter pylori infection explain all socioeconomic differences in
peptic ulcer incidence? Genetic and psychosocial markers for incident
peptic ulcer disease in a large cohort of Danish adults. Scandinavian
Journal of Gastroenterology. 39(9): 823–829.
REFERENCES
83
[13] Wu M, Zhao JK, Hu XS, Wang PH, Qin Y, Lu YC, Yang J, Liu AM,
Wu DL, Zhang ZF, Frans KJ and van’t Veer P. 2006. Association of
smoking, alcohol drinking and dietary factors with esophageal cancer
in high- and low-risk areas of Jiangsu Province, China. World Journal
of Gastroenterology. 12(11): 1686–1693.
[14] Thorn J, Björkelund C, Bengtsson C, Guo X, Lissner L and Sundh
V. 2006. Low socio-economic status, smoking, mental stress and
obesity predict obstructive symptoms in women, but only smoking also
predicts subsequent experience of poor health. International Journal
Medical Science. 34(1): 7–12.
[15] Avendano M, Kawachi I, Van Lenthe F, Boshuizen HC, Mackenbach
JP, Van den Bos GA, Fay ME and Berkman LF. 2006. Socioeconomic
status and stroke incidence in the US elderly: the role of risk factors in
the EPESE study. Stroke. 37(6): 1368–1373. Epub 2006 May 11.
[16] Song Y, Sung J, Lawlor DA, Smith GD, Shin Y and Ebrahim S. 2004.
Blood pressure, haemorrhagic stroke, and ischaemic stroke: the Korean
national prospective occupational cohort study. British Medical Journal.
328: 324–325.
[17] Kernan WN, Viscoli CM, Brass LM, Broderick JP, Brott T, Feldmann
E, Morgenstern LB, Wilterdink JL and Horwitz, RI. 2000. Phenylpropanolamine and the risk of haemorrhagic stroke. New England
Journal of Medicine. 343(25): 1826–1832.
This page intentionally left blank
5
Evaluation of Articles
on Prognosis
Felix Eduardo R. Punzalan, Antonio L. Dans, Jacinto
Blas V. Mantaring III, Leonila F. Dans
People with a disease have important questions about how their
particular condition will affect their lives. Patients and their families want to know what to expect. What is the chance of dying
from the disease? What are its complications? How often do they
occur? Studies on prognosis evaluate the likelihood of outcomes
developing over time in patients with a particular clinical condition. This information helps patients make important decisions
about their own healthcare.
5.1
Appraising directness
A clinical question on prognosis is usually phrased using the
variables P, the patient or population with a certain disease and
O, the outcome or complication of interest. For example,
Painless Evidence-Based Medicine Antonio L. Dans, Leonila F. Dans and Maria Asuncion A. Silvestre
© 2008 John Wiley & Sons, Ltd
86
CH05 EVALUATION OF ARTICLES ON PROGNOSIS
Among patients with angina (P), what is the likelihood of developing
myocardial infarction (O)?
Sometimes, we are interested in how certain characteristics of
the population affect prognosis. In this case, we also refer to a
variable E, an exposure or ‘prognostic factor’ that affects risk of
an outcome. For example,
Among patients with diabetes mellitus (P), how does the presence of
proteinuria (E) affect the likelihood of developing renal failure (O)?
The usual design for answering this question is a cohort study (see
Tackle Box 4.1) where patients with the condition are followed up
for a certain duration to measure the frequency of occurrence of
the outcome. As before, begin appraising an article on prognosis
by deciding if it provides a direct enough answer to the question
you are asking. Pay particular attention to the population (P)
because even when the disease is the same, prognosis can vary
greatly depending on subtype, severity and stage of disease. If
you think the patients in the study (or subgroups of them) are
similar enough to the patient you are seeing, appraise validity.
Otherwise, look for an article that gives you a closer answer to
the question you ask.
5.2
Appraising validity
Question #1: Was the sample of patients
representative?
Ensure that the population of patients recruited by the study
is representative of the patients with the condition in question. If the authors claim to estimate stroke rates among
patients with hypertension in general, recruiting hypertensives
5.2 APPRAISING VALIDITY
87
admitted to a hospital would surely lead to an inappropriately
morbid prediction. If they claim to measure stroke rates among
hypertensives admitted to a hospital, then recruiting confined
patients would be appropriate. To evaluate how representative
a study is, examine the research objective. If you feel the inclusion and exclusion criteria are appropriate to the population
(P) that they address then the sample of patients is probably
representative.
Question #2: Were patients (or subgroups of
patients) sufficiently homogeneous with respect
to prognostic risk?
The prognosis of most diseases is determined by characteristics of
the patient (e.g. age, sex and state of nutrition) and characteristics
of the disease (e.g. stage, subtype and duration). Because of this,
many times, it may not make sense to simply make an overall
average prognosis. Such an average risk would overestimate risk
in patients with mild disease and underestimate risk in patients
with severe illness.
A more valid approach to summarizing results would be to
group patients according to the presence or absence of the
aforementioned prognostic factors. For example, a study on the
survival of patients with congestive heart failure might subgroup
patients according to age, functional class and even etiology of
heart failure (ischemic versus idiopathic).
Deciding whether groups in a study are sufficiently homogeneous can be tricky. One must have clinical experience
or knowledge of the disease biology to decide if there are
other prognostic factors that the authors may have failed to
consider.
88
CH05 EVALUATION OF ARTICLES ON PROGNOSIS
Question #3: Were unbiased criteria used to detect the
outcome in all patients?
Outcomes may be classified by the degree of objectivity needed
to classify them. Hard outcomes are those that require little
judgment to ascertain, e.g. the fact of death. Soft outcomes, on the
other hand, require subjective judgment, e.g. the cause of death,
or quality of life. The softer the outcome, the more important
it is to define criteria on which they can be based. Researchers
spend a lot of energy defining outcomes, even if it seems (on
the surface) to be easily determined. Myocardial infarction, for
example, was defined in a study as follows:
A diagnosis of myocardial infarction was made if the following conditions were met: creatine kinase or creatine kinase-MB more than
twice the upper limit of normal; or troponin I or T more than
twice the upper limit of normal (if neither creatine kinase or creatine kinase-MB were available); or troponin I or T more than three
times the upper limit of normal for the same markers within 24 h
of percutaneous transluminal coronary angioplasty; or troponin I or
T more than five times the upper limit of normal for the same
markers within 24 h of coronary artery bypass grafting surgery. In
addition to these marker criteria, a patient had to have experienced electrocardiographic changes in two or more contiguous leads
showing new Q waves (or R waves in V1 or V2), left bundle
branch block, or ischaemic ST-T wave changes, or typical clinical presentation consistent with myocardial infarction defined as
one of the following: cardiac ischaemic-type pain lasting more than
20 min, pulmonary oedema, or cardiogenic shock not otherwise
explained[1] .
Whew, talk about complex definitions! The more subjective the
interpretation, the more detailed the criteria we need. Moreover,
even after we go through the complexities of a definition such
as this, we need to protect implementation of these criteria, by
blinding those who will use them. As we discussed in Chapter 2,
prior expectations on outcomes can sway decisions about whether
the outcomes actually occurred or not.
5.3 APPRAISING THE RESULTS
89
Question #4: Was follow-up rate adequate?
Prognosticating is about knowing the likelihood of an outcome.
Therefore, having patients whose outcomes are unknown because
they failed to follow-up makes the estimate of prognosis less
certain. The higher the proportion of those who were lost to
follow-up, the greater is the threat to validity. This becomes even
more worrisome when you suspect that an adverse event is more
likely among patients who drop out from the study.
Just as in articles on therapy, the crucial issue is ascertaining
when you should worry about the number of drop-outs. This
should be relatively simple for articles on prognosis. Simply
assume the worst event for the drop-outs, add them to the overall
event rate and see if this significantly changes the conclusions. If
the change in the conclusions is significant, then there were too
many drop-outs.
5.3
Appraising the results
Question #1: How likely are the outcomes over time?
The results from studies about prognosis can be expressed in
pictures or in numbers. Common numbers used are the 1. event
rate; 2. event-free rate; and 3. average time to an event.
These are discussed in Table 5.1.
Reporting prognosis in pictures is even more informative. If
you are familiar with survival curves, you can skip Tackle Box 5.1
altogether. If not, please spend some time on it before proceeding.
Question #2: How precise are the estimates of
likelihood?
When presented as numbers (event-free rates, event rates or
average survival), the likelihood of an outcome can be expressed
90
CH05 EVALUATION OF ARTICLES ON PROGNOSIS
Table 5.1 Reporting prognosis in numbers
Expression of
prognosis
Definition
Example
Event rate
Percentage of
patients who have
the undesirable
outcome over time
Mortality rate, stroke
rate, hospitalization
rate
Event-free rate
Percentage of
patients who do not
have the outcome
over time
Survival rate,
stroke-free survival
rate, pain-free
survival rate
Average survival
Average length of
time before an event
occurs
Mean survival,
median survival,
average time to event
with a range of possibilities known as 95% Confidence Intervals
(CIs). The interval gives us the best and worst scenarios in terms
of prognosis of the condition being evaluated. The narrower the
confidence interval, the more precise is the estimate.
When presented as survival curves, these 95% CIs can be shown
at specific points in the graph as shown in Figure 5.1. Usually the
survival curves are more precise in the earlier periods of a study
since it includes more patients. As the study goes on, patients are
lost to follow-up and the 95% CIs become wider.
5.4
Assessing applicability
Biologic issues affecting applicability
As in previous chapters, applicability of findings of studies on
prognosis can be affected by biologic factors that include sex,
co-morbid conditions, race, age and pathology.
Tackle Box 5.1
Reporting prognosis in pictures
Instructions: The results of prognosis studies are often reported in survival curves. Usually, the x-axis represents time and the y-axis is scaled from 100%
survival (top) to 0% survival (at the bottom). Go through this tackle box to learn more about interpreting survival curves.
% Survival
100
A
80
B
C
60
40
20
0
0
3
6
Months of follow-up
9
12
Notes:
1. Survival curves give us more information than simple numbers depicting overall rates. Compare the three survival curves above. Despite a similar overall
survival rate of 60% after 12 months of follow-up, there is a distinct survival advantage for line A. By the sixth month, more than 90% of patients in line
A had survived, but for line C, survival was only a little over 60%. Patients in line B lie somewhere in between. As you can see, the curves tell a more
accurate story than the overall survival rates.
2. If we’re interested in events other than death, we can rename the x-axis as a generic ‘event-free survival’. We can also be more specific and state the
exact event: for example, pain-free survival or stroke-free survival.
3. Some authors prefer using event rates, instead of event-free survival. In this case, the curves would start from 0 events at the bottom of the x-axis, and
gradually go up through time: an exact mirror image of a survival curve. Figure 2.1 is an example of such a graph.
Exercise: Which of the above three curves illustrates the best survival rate in the first six months?
Answer: Curve A
92
CH05 EVALUATION OF ARTICLES ON PROGNOSIS
% Survival
100
80
60
40
20
0
0
3
6
9
12
Months of follow-up
Figure 5.1 Survival rates over a 12-month follow-up period
Sex
Consider sex differences that alter disease prognosis. For example,
in-hospital mortality after coronary artery bypass surgery is two
times higher in women than in men[2] .
Co-morbid conditions
Consider other clinical conditions that may alter prognosis. High
in-hospital mortality has been noted among Filipinos after a
myocardial infarction[3] . This has been partially attributed to a
higher incidence of hospital-acquired pneumonia among those
admitted in government facilities. Similarly, it has been noted
that concomitant hypertension is an independent predictor of
progression to kidney failure among diabetic patients[4] .
Race
Ethnic or racial factors are another important determinant
of prognosis in many conditions. Asians, for example, have
5.4 ASSESSING APPLICABILITY
93
a slightly lower risk of deep vein thrombosis after major
orthopaedic surgery[5] . Similarly, African-American women have
higher mortality after breast cancer than white women[6] .
Age
Age is an important prognostic factor in almost any condition.
For example, shorter survival times with recurrent breast cancer
are predicted in women of ages younger than 35 years (irrespective of menopausal status)[7] . Survival is also significantly lower
for children with ependymoma who are less than 4 years of age[8] .
On the other hand, older females with cervical cancer[9] and
advanced ovarian cancer[10] have poorer survival.
Pathology
Consider factors relating to disease severity, stage or subtype,
because these are the main determinants of disease progression. The medical literature is replete with examples. The
more severe the BP elevation, the higher is the risk of stroke.
The more extensive the infection, the more catastrophic is its
course. The more malignant the histology, the earlier is the
metastasis.
Socioeconomic issues affecting applicability
Socioeconomic conditions may affect prognosis, especially when
prognosis is highly dependent on the quality of medical care
that patients receive. In a study among Filipinos with acute MI,
in-hospital mortality was two times higher in patients admitted to
a charity government hospital when compared to private hospitals. This was partially attributed to the limited availability of
costly thrombolytic agents[3] . In a cohort study of elderly patients
94
CH05 EVALUATION OF ARTICLES ON PROGNOSIS
with prostate carcinoma, it was shown that low socioeconomic
status was significantly associated with decreasing survival even
after adjustment for age, co-morbidity and treatment. Those
living in the community within the lowest quartile of socioeconomic status were 31% more likely to die than those living in the
highest quartile[11] .
Inevitably, some differences will arise and your patient will
not always be similar to the study patient. You should assess
whether the study patients are so different from yours that you
cannot confidently use the results in making predictions for your
patients. If there are no burning biologic or socioeconomic issues
that limit the applicability of the study, you can use the study
results for prognostic information.
5.5
Individualizing the results
The more variables found to affect outcome, the more difficult it becomes to estimate the risk of an individual patient. To
estimate the annual risk of a coronary event in an individual,
for example, one would need to consider age, sex, family history,
blood pressure, waist-to-hip ratio, serum cholesterol, blood sugar
and even socioeconomic status. Because these calculations present
with so many variables, some researchers have developed risk
equations and scoring systems in an attempt to simplify the task
of risk stratification. Thus, Goldman’s score estimates the risk of
life-threatening complications after non-cardiac surgery[12] and
the Apache score is used to estimate mortality risk after admission to the intensive care unit of critically ill cancer patients[13] .
Some have even gone to the extent of developing electronic
calculators that do all the mathematics for us: for example, the
Framingham risk calculator[14] . The main goal in using these
various instruments is to come up with an estimate of prognosis
5.6 SUMMARY
95
specific to an individual patient, based on that individual’s
characteristics. This information is useful in making treatment
decisions.
Prognostic data is also used to inform a patient, about what to
expect of their condition. This is always an important aspect of
management. A valid prognostic study may not alter treatment
decisions; however, it can always give the patient a glimpse of the
future.
5.6
Summary
Information on disease prognosis helps us to decide whether
we want to treat an illness or just leave it alone. In addition,
however, estimates of prognosis may be used as baseline risks for
our patients when we try to individualize the results of treatment
(see Chapter 2) or harm (see Chapter 4).
As usual, begin evaluation of the article by deciding whether
it provides a direct answer to the question you ask. Only then
should you spend time going deeper. Validity can be appraised
quickly by checking on four criteria: how representative the
study is, homogeneity of study subjects, objectivity of outcomes
and completeness of follow-up. When satisfied that the study is
valid, interpret the results. These can be presented as numbers
(disease-free survival rate, event rates, or mean time to event)
or as graphs (disease-free survival curves or curves of cumulative events). Before using the numbers on your patient, ensure
that biologic and socioeconomic factors will not limit applicability of the results. Finally, try to come up with a risk
estimate that is specific for your patient. Check subgroups in
the study that your patient might fit into. If there are too
many, look for a scoring system that the authors might have
provided.
96
CH05 EVALUATION OF ARTICLES ON PROGNOSIS
References
[1] Pfeffer MA, Swedberg K, Granger CB, Held P, McMurray JJ, Michelson
EL, Olofsson B, Ostergren J, Yusuf J, Pocock S and CHARM Investigators and Committees. 2003. Effects of candesartan on mortality and
morbidity in patients with chronic heart failure: the CHARM-Overall
programme. Lancet. 362: 759–766.
[2] Hannan EL, Wu C, Bennett EV, Carlson RE, Culliford AT, Gold JP,
Higgins RSD, Isom OW, Smith CR, and Jones RH. 2006. Risk stratification of in-hospital mortality for coronary artery bypass graft surgery.
Journal of American College of Cardiology. 47: 661–668.
[3] The Isip Study Group. 1999. Acute myocardial infarction in tertiary
centers in Metro Manila: In-hospital survival and physicians practices.
Infarct survival in the Philippines: In-hospital mortality (ISIP). Asean
Heart Journal. 7(1): 1–7.
[4] Bruno G, Biggeri A, Merletti F, Bargero G, Ferrero S, Pagano G and
Perin PC. 2003. Low incidence of end-stage renal disease and chronic
renal failure in type 2 diabetes: 10 year prospective study. Diabetes
Care. 26(8): 2353–2358.
[5] Leizorovicz A, Turpie AGG, Cohen AT, Wong L, Yoo MC and Dans
A for the SMART Study Group. 2005. Epidemiology of venous thromboembolism in Asian patients undergoing major orthopedic surgery
without thromboprophylaxis. The SMART Study. Journal of Thrombosis and Haemostatis. 3: 28–34.
[6] Chlebowski RT, Chen Z, Anderson GL, Rohan T, Aragaki A, Lane D,
Dolan NC, Paskett ED, McTiernan A, Hubbell A, Adams-Campbell LL,
Prentice R. 2005. Ethnicity and breast cancer: factors influencing differences in incidence and outcome. Journal of National Cancer Institute.
97(6): 439–448.
[7] Falkson G, Gelman RS and Pretorius FJ. 1986. Age as a prognostic
factor in recurrent breast cancer. Journal of Clinical Oncology. 4: 663–
671.
[8] Sala F, Talacchi A, Mazza C, Prisco R, Ghimenton C and Bricolo A.
1998. Prognostic factors in childhood intracranial ependymomas: The
role of age and tumor location. Pediatric Neurosurgery. 28: 135–142.
[9] Meanwell CA, Kelly KA, Wilson S, Roginski C, Woodman C, Griffiths
R and Blackledge G. 1988. Young age as a prognostic factor in cervical
cancer: analysis of population based data from 10 022 cases. British
Medical Journal. 296(6619): 386–391.
REFERENCES
97
[10] Thigpen T, Brady MF, Omura GA, Creasman WT, McGuire WP,
Hoskins WJ and Williams S. 1993. Age as a prognostic factor in ovarian
carcinoma. The Gynecologic Oncology Group experience. Cancer. 71(2
Suppl): 606–614.
[11] Du XL, Fang S, Coker AL, Sanderson M, Aragaki C, Cormier JN, Xing
Y, Gor BJ and Chan W. 2006. Racial disparity and socioeconomic
status in association with survival in older men with local/regional stage
prostate carcinoma: findings from a large community-based cohort.
Cancer. 106(6): 1276–1285.
[12] Prause G, Ratzenhofer-Comenda B, Pierer G, Smolle-Juttner F, Glanzer
H and Smolle J. 1997. Can ASA grade or Goldman’s cardiac risk
index predict peri-operative mortality? A study of 16,227 patients.
Anaesthesia. 52(3): 203–206.
[13] Chang L, Horng CF, Huang YC and Hsieh YY. 2006. Prognostic accuracy of acute physiology and chronic health evaluation II scores in
critically ill cancer patients. American Journal of Critical Care. 15(1):
47–53.
[14] http://hp2010.nhlbihin.net/atpiii/calculator.asp
‘There’s something wrong Mrs. Cruz. We need to do more tests. According
to this study you should have had a stroke a year ago!’
This page intentionally left blank
6
Evaluation of Systematic
Reviews
Marissa M. Alejandria, Bernadette A.
Tumanan-Mendoza, Ma. Vanessa Villarruz-Sulit,
Antonio L. Dans
So far, we have breezed through the appraisal of four common
types of articles encountered in medicine: articles on therapy,
diagnosis, prognosis and harm. The last type of article to be
discussed in this book is the systematic review. A systematic
review summarizes the results of several studies that seem to
answer the same clinical question. It is distinguished from traditional reviews, by the use of systematic methods to
1. identify, select and critically appraise relevant studies; and
2. collect and analyse data from the included studies to minimize
bias[1] .
A systematic review may use statistical methods to summarize
the results of included studies. In this case, it is referred to as a
‘meta-analysis’.
Painless Evidence-Based Medicine Antonio L. Dans, Leonila F. Dans and Maria Asuncion A. Silvestre
© 2008 John Wiley & Sons, Ltd
100
CH06 EVALUATION OF SYSTEMATIC REVIEWS
Authors conduct systematic reviews to
1. draw a reliable conclusion based on a summary of all relevant
studies;
2. increase the precision of estimates of the effectiveness of treatment, accuracy of tests, magnitude of risk and prognosis of
disease;
3. increase the number of patients in clinically relevant subgroups;
4. resolve uncertainty when results of primary studies seem
conflicting; and
5. plan new studies when there is lack of adequate evidence.
6.1
Appraising directness
Systematic reviews are usually conducted to summarize several
articles on therapy, but they may also be used to answer questions
on diagnosis, prognosis or harm. We begin to analyse a systematic
review by appraising how directly it addresses the question raised,
i.e. how well the PEO in the study corresponds to our own PEO.
The issue of directness of studies on therapy, diagnosis, harm and
prognosis has been discussed in detail in Chapters 2–5. When we
look at systematic reviews on this topic, however, a unique issue
arises: the sensibility of the research question. Some questions
are so broad that they do not really correspond to the interest
of health care providers and patients. An example of an overly
broad question is:
Among patients with cancer, how effective is chemotherapy in delaying
disease progression?
In this example, there are too many different cancers and
chemotherapeutic agents. Combining the results of these various
6.2 APPRAISING VALIDITY
101
studies would not yield a meaningful estimate of effectiveness.
To make this question sensible, we need to focus them so
they refer to a specific population, exposure and outcome. For
example, among older women with operable breast cancer (a
specific cancer population), how effective is tamoxifen (a specific
chemotherapeutic agent) in prolonging survival (a specific
outcome).
If you feel the article provides a reasonably direct answer and
that the reviewers pose a sensible question, then proceed to read
it in more detail. Otherwise, look for a better paper.
6.2
Appraising validity
Question #1: Were the criteria for inclusion of studies
appropriate?
While primary studies describe criteria for inclusion or exclusion of individual patients, systematic reviews describe criteria for
inclusion or exclusion of individual studies. These criteria describe
the methodological quality of the studies (M) to be included, as
well as the populations (P), exposures (E) and outcomes (O)[2] .
In the hierarchy of evidence, randomized controlled trials (RCTs)
come with the least bias, followed by cohort studies, case-control
studies and surveys. The lowest in the hierarchy would be descriptive studies including case series and single case reports.
The type of study design to be included in a systematic review
depends on the type of clinical question the review is addressing.
Of course, the best study design may not always be available for
a certain question. In this case, reviewers should be less rigid in
their inclusion criteria in order to accommodate study designs of
lower validity.
102
CH06 EVALUATION OF SYSTEMATIC REVIEWS
Question #2: Was the search for eligible studies
thorough?
A comprehensive search of the literature is important to ensure
that relevant studies, both published and unpublished, are not
missed. A thorough search for published literature should include
use of electronic medical databases such as MEDLINE, EMBASE,
the Cochrane Library and non-English language databases. Crossreferences of original publications are also a good source of
published articles.
One problem with literature searches is that studies that report
a ‘positive’ result are more likely to be published than ‘negative’
studies. Thus, if only published articles are sought, conclusions
may overestimate the effectiveness of an intervention, safety of
a therapy or the accuracy of a test. This phenomenon has been
well-documented and is often referred to as ‘publication bias’[3] .
Avoiding publication bias in systematic reviews entails a thorough search for unpublished articles. This may be done by
writing to experts, going through pharmaceutical industry files
or surveying conference proceedings and books of abstracts.
Among other sources, unpublished trials can now be found in the
WHO Network of Collaborating Clinical Trial Registers, ClinicalTrials.gov from the US National Institutes of Health, the US Food
and Drug Administration (FDA) registry, and the International
Committee of Medical Journal Editors (ICMJE) registry[4] .
Question #3: Was the validity of the included
studies assessed?
Look for an assessment of the methodological quality of the
included studies in the review. Check whether the authors used
validity criteria for appraising primary studies (similar to those
presented in preceding chapters). For example, in a systematic
6.2 APPRAISING VALIDITY
103
Table 6.1 Important validity criteria for assessing the quality of primary
studies included in a systematic review
Study Type
Validity Criteria
Therapy
Were patients randomly assigned to treatment groups?
Was allocation concealed?
Were baseline characteristics similar at the start of the
trial?
Were patients blinded to the treatment assignment?
Were caregivers blinded to treatment assignment?
Were outcome assessors blinded to the treatment
assignment?
Were all patients analysed in the groups they were
originally randomized?
Was follow-up rate adequate?
Was the reference standard an acceptable one?
Was the reference standard interpreted independently from
the test in question?
Were the patient groups being compared sufficiently
similar with respect to baseline characteristics?
If not, were statistical adjustments made?
Were unbiased criteria used to determine exposure in all
patients?
Were unbiased criteria used to detect the outcome in all
patients?
Was follow-up rate adequate?
Was the sample of patients representative?
Were patients sufficiently homogenous with respect to
prognostic risk?
Were unbiased criteria used to detect the outcome in all
patients?
Was follow-up rate adequate?
Diagnosis
Harm or
causation
Prognosis
review on therapy, the validity criteria should at least include
randomization and adequacy of allocation concealment (see
Table 6.1). In general, included studies that are of weak quality
tend to overestimate the effectiveness of an intervention[5] .
Tackle Box 6.1
How to interpret forest plots
Instructions: The balloons below label the most important parts of the forest plot. Go through these labels and familiarize yourself with the anatomy of the graph. Once you feel sufficiently familiar with the anatomy, go through the
notes below on what the forest plot can signify.
Review:
Comparison:
Outcome:
Study
or sub-category
Hypothetical example
01 Gym-based fitness regimen (treatment) vs Home-based fitness regimen (control)
02 Failure to get a modelling contract
Treatment
n/N
Dans & Dans 1988 4/16
Silvestre 1991
3/16
Padilla 1994
8/18
Alava 1996
19/130
Mendoza 1998
9/19
Mantaring 1999
25/70
Punzalan 2001
16/85
Alejandria 2003
9/55
Loyola 2005
30/145
Sulit 2006
46/157
Control
n/N
RR (fixed)
95% Cl
RR (fixed)
95% Cl
0.53
0.80
1.60
0.46
1.42
1.37
0.64
0.74
1.48
2.04
8/17
4/17
5/18
42/133
6/18
19/73
24/81
14/63
20/143
23/160
Total (95% CI)
711
723
2
2
Test for heterogeneity: Chi = 29.50, df = 9 (P = 0.0005), I = 69.5%
Test for overall effect: Z = 0.42 (P = 0.68)
0.1
0.2
0.5
Favours treatment
1
[0.20, 1.43]
[0.21, 3.02]
[0.65, 3.96]
[0.28, 0.75]
[0.63, 3.19]
[0.83, 2.26]
[0.36, 1.11]
[0.35, 1.57]
[0.88, 2.48]
[1.30, 3.19]
Each tree represents a study;
the square is its point estimate
and the horizontal line is the
95% CI. Exact numbers are in
line with each tree.
The diamond represents the
summary effect of all studies.
The apex is the point estimate
and the ends are the 95% CIs.
1.04 [0.86, 1.26]
2
5
10
Favours control
X-axis: for RR, midpoint is 1.0.
Labels indicate which side is
benefit or harm.
Notes: A forest plot can tell us:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
how many studies the review included: just count the number of trees!
which studies are the largest: the bigger the square in the middle, the bigger the study.
which studies had more outcome events: these have the narrowest 95% CI.
which studies showed statistically significant benefit (entire line is to the left of 1.0).
which studies showed statistically significant harm (entire line is to the right of 1.0).
which studies were inconclusive (line straddles 1.0 and extends far into either side).
which studies were inconclusive but showed a trend towards benefit (line is on the left, and barely touches 1.0).
which studies were inconclusive but showed a trend towards harm (line is on the right, and barely touches 1.0).
which studies show that the therapies are equal (line straddles 1.0 and doesn’t go far to either side).
whether there are important differences (heterogeneity) between studies: if the lines hardly overlap, we should worry.
Notes 6 and 9 describe similar studies because they both straddle 1.0. However, they mean entirely different things. In note 6,
the single study is inconclusive because confidence intervals are very wide. More studies need to be done. In note 9, the single
study is conclusive that the two treatments are equal. The ends of the 95% CI are so close to 1 that they represent very small,
unimportant harm and benefit. The situation in item 6 is sometimes referred to as ‘absence of proof’, while item 9 is referred
to as ‘proof of absence’!
Exercise: In the graph above, how many studies show (a) definite benefit, (b) definite harm, (c) inconclusive results and (d)
equivalence of the two treatments?
Answers: (a) 1, (b) 1, (c) 8, (d) 0.
106
CH06 EVALUATION OF SYSTEMATIC REVIEWS
Question #4: Were the assessments of the
studies reproducible?
Authors of systematic reviews decide on
(a) which studies to include;
(b) how valid these studies are; and
(c) what data to extract.
These may seem like objective tasks, but experienced reviewers will
tell you that these are difficult judgments to make and are subject to
human error. To ensure reproducible unbiased assessments in each
of these tasks, there should be at least two independent reviewers.
Also, check what steps the authors took to resolve disagreement.
This is usually done by discussion or by calling in a third person.
6.3
Appraising the results
Question #1: What are the overall results
of the review?
In a systematic review, a quantitative summary of the results
is usually presented as a forest plot. If you are unfamiliar with
the interpretation of forest plots, review Tackle Box 6.1 before
proceeding.
Outcomes can be reported as dichotomous variables which
have only two possible results, as we have demonstrated in Tackle
Box 6.1 (i.e. given or not given a modeling contract). However,
as described in Chapter 2, outcomes can also be reported as
continuous variables that have a range of possible results (e.g.
change in weight or blood pressure). When the outcomes are
continuous, the difference between treatment and control groups
6.3 APPRAISING THE RESULTS
Review:
Comparison:
Outcome:
Study
or sub-category
107
Comparison of weight reduction programs
Gym-based fitness regimen (treatment) vs. Home-based fitness regimen (control)
Mean weight loss in lbs.
N
Control
Mean (SD)
N
Treatment
Mean (SD)
Dans & Dans 1988 16
16
Silvestre 1991
18
Padilla 1994
130
Alava 1996
19
Mendoza 1998
70
Mantaring 1999
85
Punzalan 2001
55
Alejandria 2003
145
Loyola 2005
157
Sulit 2006
15.25 (7.13)
18.34 (7.89)
16.87 (4.77)
25.40 (9.30)
17.00 (2.08)
22.00 (12.00)
22.80 (16.70)
24.32 (8.66)
23.00 (5.12)
24.23 (3.00)
17
17
18
133
18
73
81
63
143
160
13.81 (5.75)
16.54 (6.78)
17.92 (5.23)
21.10 (8.20)
18.30 (2.34)
25.00 (9.00)
21.30 (14.10)
22.00 (9.35)
26.23 (6.26)
27.40 (4.60)
WMD (fixed)
95% CI
711
723
Total (95% CI)
2
2
Test for heterogeneity: Chi = 59.29, df = 9 (P < 0.00001), I = 84.8%
Test for overall effect: Z = 6.49 (P < 0.00001)
–5
0
5
Favours treatment
Favours control
WMD = weighted mean difference
Figure 6.1 Forest plot of a hypothetical systematic review evaluating a
continuous outcome
can no longer be expressed as a relative risk reduction, absolute
risk reduction or relative risk. Instead, the difference in outcome
is simply expressed as the difference in means between treatment
and control. Figure 6.1 illustrates a forest plot of hypothetical
studies evaluating a continuous outcome.
Question #2: Were the results similar from study
to study?
With a reasonably focused clinical question and appropriate
inclusion and exclusion criteria, one would expect that the results
would be similar from one study to another. However, this may
not always be the case. Between studies, there may be slight
variations in the characteristics of the population (P), the administration of the exposure (E) or the definition of the outcome (O).
Even when the PEOs are sufficiently similar, it is possible that
the study methodologies used are slightly different (M). When
108
CH06 EVALUATION OF SYSTEMATIC REVIEWS
such heterogeneity is present, reviewers are hesitant to perform
a meta-analysis and estimate the average results. This is understandable as averages don’t make sense unless we are averaging the
same things. For example, a systematic review of mammographic
screening for breast cancer would probably show no difference between screened and unscreened women if we analyse
all age groups. Such a conclusion would be totally wrong and
misleading. Screening is definitely beneficial among women aged
50 or more[6] and probably harmful among younger women[7] .
To determine whether the results are similar enough to justify
combining them, check whether the authors assessed for heterogeneity. There are two ways of assessing heterogeneity. One is
by visual inspection of the forest plot to check if the trees from
different studies overlap. This was discussed in Tackle Box 6.1.
The second method is by performing statistical tests.
The chi-squared test is the most popular way of testing
for heterogeneity. When the p value of the test is significant
(p < 0.10), it is quite probable that differences exist among studies
that cannot be attributed to chance alone. If you study the
example in Tackle Box 6.1, you will see this test in the lower left
hand of the sample forest plot.
Another statistical measure of heterogeneity is the I2 statistic.
While results of the chi-squared test indicate presence or absence
of heterogeneity, the I2 statistic provides information on the
magnitude of heterogeneity. A value greater than 50% suggests
substantial heterogeneity[8] . This value is seen next to the chisquared test in Figure 6.1.
When studies are significantly heterogeneous, the authors
should identify the sources of heterogeneity. Since differences in
PEO or M are the common reasons for heterogeneity, dividing the
studies into homogenous subgroups may minimize the problem.
Another strategy would be to exclude apparent outlier studies and
check if heterogeneity is corrected. This is referred to as a sensitivity analysis. When these subgroup analyses are not planned
6.4 ASSESSING APPLICABILITY
109
beforehand, their results should be viewed with caution. Too
much exploration raises the possibility of accidental findings. If
you torture the data enough, you might just get a confession!
Question #3: How precise were the results?
Because studies merely estimate the effectiveness of therapy or
accuracy of tests, it may be unduly precise to express these estimates as exact values. For example, it may be misleading to
conclude that ‘warfarin reduces the risk of stroke in patients
with atrial fibrillation by 79.0% (RRR)’. Such an estimate may
sound too precise. It does not convey the uncertainty of the estimates. Therefore, researchers also express the treatment effect as
an interval estimate which provides a range of possible values.
Systematic reviews provide us point estimates and 95% CIs for
individual studies (the lines in a forest plot) and, sometimes, for
an overall estimate (the diamond at the bottom of a forest plot).
As expected, the 95% CI around the summary value will always
be narrower than the 95% CI around the individual studies. This
gain in precision is the main reason we carry out systematic
reviews at all.
6.4
Assessing applicability
Assessing the applicability of the results of a systematic review
to your individual patient follows the same principles described
in Chapters 2–5. Depending on whether you are appraising
a systematic review on therapy, diagnosis, harm or prognosis,
you can refer to the respective applicability sections of these
chapters.
If the overall results of the review are not directly applicable
to your patient population, do not despair. Valid subgroups
110
CH06 EVALUATION OF SYSTEMATIC REVIEWS
may have been analysed that fit your patient’s characteristics.
As mentioned earlier, however, results derived from a subgroup
analysis should be interpreted with caution. Criteria for determining whether the findings from subgroup analysis are credible
include the following[9, 10] .
1. The subgroup analysis should be pre-planned.
2. There shouldn’t be too many subgroup analyses.
3. Subgroup differences, if found, should be seen consistently in
different studies.
4. Subgroup differences, if found, should be biologically plausible.
When there are too many subgroup analyses and when many of
them are unplanned, beware! Again too much exploration can
lead to accidental findings.
6.5
Individualizing the results
Application of the results to an individual patient or patient
group depends on what type of research question the systematic
review addresses. These are discussed in detail in the sections on
individualization of Chapters 2–5. Table 6.2 summarizes recommendations from these chapters.
A note of caution when dealing with systematic reviews on
therapy: the reported ARR and NNT is averaged, rather than
individualized. Therefore, these numbers may not apply to
specific patients that you see. You need to seek the summary RR
or RRR then go through the calculations described in Tackle Box
2.4 to estimate patient-specific results.
6.6 SUMMARY
111
Table 6.2 Information needed for individualization of results by type of
research question
Type of
research
question
Therapya
What you need
from the
patient
The baseline
risk based on
clinical
characteristics
What you
need from the
study
The relative
risk (RR) or
relative risk
reduction
(RRR)
Diagnosisb
Pre-test
probability
Harmc
Baseline risk
Likelihood
Ratio for a
test result
Odds Ratio or
Relative Risk
Prognosisd
Baseline
characteristics
Event rate or
event-free
survival
The individualized
statistic you derive
The individualized
absolute risk
reduction (ARR) or
individualized
number needed to
treat (NNT)
The post-test
probability of
disease
The individualized
absolute risk
increase (ARI) or
number needed to
harm (NNH)
The patient-specific
probability of
developing
complications
a
Tackle Box 2.2
Tackle Boxes 3.3 and 3.4
c
Tackle Box 4.3
d
Section 5.5
b
6.6
Summary
Systematic reviews usually address questions on therapy; however,
they occasionally address questions on diagnosis, harm and
prognosis. Begin evaluation of a systematic review by ensuring
that it provides a direct enough answer to the focused and sensible
112
CH06 EVALUATION OF SYSTEMATIC REVIEWS
question that you ask. If so, appraise validity, ensuring the process
of study inclusion was both thorough and objective. Appraisal
of results will entail understanding forest plots. Aside from the
magnitude of the overall effect and the 95% CIs, these graphs also
tell us if the results were similar from study to study. Assessment
of applicability and individualization of results will depend on
the nature of the research question.
References
[1] Chalmers I and Altman D (eds.) 1995. Systematic Reviews. BMJ Publishing
Group: London.
[2] Egger M, Davey Smith G and Rourke K. 2001. Rationale, potentials
and promise of systematic reviews. In: Systematic Reviews in Health
Care: Meta-analysis in Context. Second Edition. BMJ Publishing Group:
London.
[3] Egger M and Davey Smith G. 1998. Bias in location and selection of
studies. British Medical Journal. 316: 61–66.
[4] Krleža-Jeri K. 2005. Clinical trial registration: The differing views of
industry, the WHO, and the Ottawa Group. PLoS Medicine. 2(11):
e378.
[5] Oxman A, Guyatt G, Cook D and Montori V. 2002. Summarizing
the evidence. In Guyatt G, Drummond R (eds.) Users’ Guides to the
Medical Literature. American Medical Association Press: USA.
[6] Nystrom L, Rutqvist LE, Wall S, Lindgren A, Lindqvist M, Ryden S
Andersson I, Bjurstam N, Fagerberg G, Frisell J and Tabár L. 1993.
Breast cancer screening with mammography: An overview of Swedish
randomized trials. Lancet. 341: 973–978.
[7] Gotzsche PC and Nielsen M. 2006. Screening for breast cancer
with mammography. Cochrane Database of Systematic Reviews.
4(CD001877). DOI: 10.1002/14651858.CD001877.pub2.
[8] Clarke M and Oxman AD. (eds.) 2003. Cochrane Reviewers Handbook
4.2.0 [updated March 2003]. In The Cochrane Library, Issue 2. Update
Software: Oxford.
REFERENCES
113
[9] Fletcher RH and Fletcher SW. 2005. Clinical Epidemiology: The Essentials. Fourth edition. Lippincott Williams & Wilkins: Maryland, USA.
[10] Oxman AD and Guyatt GH. 1992. A consumer’s guide to subgroup
analyses. Annals of Internal Medicine. 116: 78–84.
This page intentionally left blank
7
Literature Searches
Antonio L. Dans, Leonila F. Dans,
Maria Asuncion A. Silvestre
If you were born in the 1950s or earlier, then you are probably
old enough to remember what a pain it was to have to search
the medical literature. The Index Medicus was the nightmare of
every medical student. Sleep-deprived and bleary-eyed, students
in medical libraries would hunch over endless rows of thick books
that must have weighed tons when put together, to find that one
elusive article that they needed. In fact, searching was such a
pain then, that a journal search became a popular, sadistic form
of punishment in medical schools. ‘Don’t be late for class or
off you go to the dungeons of Index Medicus!’ was the threat,
accompanied by a crash of thunder in the background.
Fortunately, information access has evolved so much in the
past two decades, that today we are able to search the entire
Index Medicus at the click of a mouse! This is possible due to
the fact that almost all databases of scientific publications are now
available in electronic format. This development has cast manual
searches into the depths of antiquity, but it has also raised new
Painless Evidence-Based Medicine Antonio L. Dans, Leonila F. Dans and Maria Asuncion A. Silvestre
© 2008 John Wiley & Sons, Ltd
116
CH07 LITERATURE SEARCHES
expectations from healthcare providers all over the world. Today,
new (and not so new) generations of practitioners are expected
to have skills for conducting thorough and efficient searches of
the electronic medical literature.
7.1
What are electronic databases?
An electronic literature database can be likened to a file composed
of several index cards. Each index card represents a published
article, containing structured information such as its TITLE, the
AUTHORS, the SOURCE, the ABSTRACT, KEYWORDS and
other important information. In computer jargon, each index
card is referred to as a RECORD, each piece of structured information is referred to as a FIELD and the entire conglomerate of
index cards is referred to as the DATABASE. These relationships
are illustrated in Figure 7.1.
Title:
Database
Author(s):
Source:
Abstract:
Mesh keywords:
Fields
Records
Figure 7.1 Components of an electronic database of medical literature
7.2 THE LANGUAGE OF ELECTRONIC DATABASES
117
Each record represents an article containing structured information (fields) such as the author, source and abstract. A collection
of records with structured information is called a database.
Numerous electronic databases of medical literature have
emerged in recent years. The most popular (and the largest) is
MEDLINE, which is managed by the US National Library of
Medicine. As of July 2007, it contained 15 million articles from
5000 medical journals published in over 70 countries around the
world. As many as 2000–4000 articles are indexed every working
day[1] . EMBASE is another popular database, which indexes
375 000 articles a year from 3500 journals[2] . The Cumulative Index to Nursing & Allied Health Literature (CINAHL)
Database stores articles from 2593 journals from various healthcare professions[3] . The Cochrane Central Register of Controlled
Trials (CCRCT) has over 300 000 bibliographic references to
controlled trials in health care. This database is an amalgamation
of various sources[4] . Numerous other databases are available in
various fields of the healthcare profession.
7.2
The language of electronic databases
Searching electronic databases requires a thorough understanding
of Boolean language. This is not some ancient tongue unearthed
from an advanced civilization which has been buried for
centuries. On the contrary, Boolean language is very simple. It
has just a handful of words, the usage of which you probably
learned in elementary school. The best way to review Boolean
syntax is by using the Venn diagram. Go through Tackle Box 7.1
to review the basic concepts. If you have children of school age,
now would be a good time to ask for their help.
118
CH07 LITERATURE SEARCHES
Tackle Box 7.1
The Venn diagram and Boolean language
Instruction: The Venn diagram below depicts three sets of numbers
defined by the circles A, B and C. The numbers within each set are
referred to as elements. In an electronic database of medical literature,
the numbers represent various articles and the sets are key words
or concepts that define these articles. Go through the description of
Boolean language below, to understand how electronic databases work.
1
A
3
5
7
8
2
6
10
B
9
4
11
12
13
14
C
Boolean logic can be used to identify specific elements in this diagram.
This is achieved using the connectives AND or OR. For example,
the union of sets A and B can be expressed in Boolean language as
A OR B. In the diagram, this would be represented by elements 1–9.
Similarly, the intersect of sets A and B can be expressed as A AND
B. In the diagram, this would be represented by elements 5 and 6.
Note that unions (expressed by OR) tend to increase the number of
elements included, while intersects (expressed by AND) tend to do the
opposite. To make things more exciting, brackets can help us use unions
and intersects in the same expression. Traditionally, operations with
brackets are performed first. For example, to understand the expression
(A AND C) OR B, first identify a new set defined by the intersect
of A AND C. Complete the statement by getting the union of that
new set with B. This whole statement would now refer to elements 4–9.
Exercise: Which elements do the following phrases refer to? a) B
AND C; b) A AND B AND C; c) B OR C; d) A AND (B OR C).
Answers: (a) 6, 9; (b) 6 only; c) 4–14; (d) 4, 5, 6.
7.2 THE LANGUAGE OF ELECTRONIC DATABASES
119
If you understand Boolean language and the difference between
the connectives AND and OR, then you are ready to conduct an
electronic search. While we refer to PubMed often in this discussion (because it is free online and accessible to all), the four steps
we describe are also applicable to most other literature databases.
Step 1: Identify the concepts in your focused clinical
question (P, E, O and M)
We have extensively discussed the components of a focused clinical question in previous chapters. P refers to the population of
interest, E to the exposure being evaluated and O to the outcomes
expected. If a clinical question is focused and well stated, it can
often be clearly categorized into a quest for articles on therapy,
diagnosis, harm or prognosis. The advantage of this association
is that we can now link the focused question with specific study
designs, using so-called methodological filters (M). For example,
not only can we search for articles on treatment of osteoporosis
(P) using bisphosphonates (E) to prevent fractures (O), but we
can also narrow our search so that we only get randomized
controlled trials (M)! A powerful feature of electronic databases
is the ability to make the computer carry out some of the critical
appraisal. Table 7.1 lists method filters you may want to use for
various types of articles.
Step 2: Prioritize the concepts from most
to least important
This is the tricky part of literature searches. Each concept in your
question represents a set in the Venn diagram. If you intersect
all these sets straight away, you may end up with nothing! The
plan is therefore to intersect the concepts one at a time, until
120
CH07 LITERATURE SEARCHES
Table 7.1 Method filters for various types of focused clinical questions
Type of question
Generic format
Method filter
Therapy
Among P, how effective
is E in preventing O?
Systematic reviews or
meta-analyses, RCTs,
cohorts, case-control
studies, case series or case
reports
Diagnosis
Among P, how accurate
is E in diagnosing O?
Studies that report
‘sensitivity and specificity’
or ‘likelihood ratios’,
systematic reviews or
meta-analyses
Harm
Among P, how much
does E contribute to
the causation of O?
Systematic reviews or
meta-analyses, RCTs,
cohorts, case-control
studies, case series or case
reports
Prognosis
Among P, by how
much does E increase
the risk of O?
Cohorts studies or
systematic reviews or
meta-analyses
Note: P = the population of interest (usually characterized by a disease or condition);
E = the exposure being evaluated (either a treatment, a test, a harmful exposure or
a prognostic factor); O = the outcome expected (a disease, complication or some
measure of health).
you reach a manageable number of articles. Before you can do
this, however, you need to decide which concept to search first
(the most important) and which one to search last (the least
important).
Which one is the most important? Ask yourself this: if you
were allowed only one term to search, which concept would
you search for? Consider the earlier example: among patients
with osteoporosis (P) how effective are bisphosphonates (E) in
preventing fractures (O)? Let’s say you’re interested in finding
systematic reviews (M).
7.2 THE LANGUAGE OF ELECTRONIC DATABASES
121
In this situation, which concept would you search first – P, E,
O or M? If you search for articles on ‘fractures’ (O), you would be
in trouble. There would be too many! The same would be true if
you search for ‘systematic reviews’ (M). If you search for articles
on ‘osteoporosis’ (P), you would pick up fewer articles but a lot
would be on the use of other drugs such as calcium and vitamin
D. However, if you search for articles on ‘bisphosphonates’ (E),
they are likely to also be on osteoporosis and fracture prevention.
Thus, the term bisphosphonates would be a reasonable start for
this search, because it would yield the most useful set of articles.
If you search for articles on bisphosphonates and still get too
many of them, then you need to try and narrow the search.
Choosing the next best concept to search follows the same
process. If you could intersect the first concept with just one other
concept, which one would it be? Intersecting bisphosphonates
with osteoporosis might yield a lot of articles you don’t need
such as cohort studies and practice guidelines, or even prevalence
studies and letters to the editor. The same would be true if you
intersect bisphosphonates with fractures. If you intersect bisphosphonates with systematic reviews, however, you are likely to get
the most relevant yield. Systematic reviews on bisphosphonates
(M and E) will most probably be about osteoporosis (P) and
fractures (O).
If intersecting the first two concepts yields too many articles,
then you need to choose a third. Again you ask: if you could
intersect the first two concepts with just one other concept, which
one would it be? In the example, adding fractures (O) as the
third concept would filter out some studies (but not all) that just
monitor effect on bone mass density (a mechanistic endpoint)
and not fractures (a clinical endpoint). ‘Fractures’ would therefore be a reasonable third option. If you still get too many, then
the fourth option is the only one left: osteoporosis (P).
Table 7.2 summarizes how this process of prioritization might
be used for certain examples, including our bisphosphonate
122
CH07 LITERATURE SEARCHES
Table 7.2 Examples of clinical questions, concepts and how they may be
prioritized
Focused clinical question Identified concepts
Rank∗
Among patients
with osteoporosis,
how effective are
bisphosphonates in
preventing fractures?
P = patients with osteoporosis
E = bisphosphonates
O = fractures
M = systematic reviews
4
1
3
2
Among patients with
diabetes, how effective
are sulfonylureas in
preventing usual
complications?
P = patients with diabetes
E = sulfonylureas
O = usual complications
M = systematic reviews or RCTs
x
1
3
2
Among patients
presenting with acute
chest pain, how accurate
is serum troponin-I
determination in
diagnosing acute
myocardial infarction?
P = patients with acute chest pain
E = serum troponin-I determination
O = acute myocardial infarction
M = studies that report sensitivity,
specificity, or likelihood ratios
x
1
x
2
∗
Note: 1 = most important; 4 = least important; x = probably an unnecessary
concept
example. As can be seen from these examples, prioritization
can sometimes lead to the conclusions that some concepts are
unnecessary. When searching for articles on sulfonylureas (P),
for example, it is probably unnecessary to specify diabetes (P) as
sulfonylureas are almost exclusively used for diabetes. Similarly,
when searching for articles on serum troponin-I determination,
it is probably unnecessary to specify P and O, because the test is
specifically carried out to diagnose MI in patients with chest pain.
In our experience, the process of prioritization is extremely
important but often neglected. Spend time to think through this
7.2 THE LANGUAGE OF ELECTRONIC DATABASES
123
prioritization step regularly so that it becomes intuitive. There
are several advantages of making this a habit.
1. If searching priority concepts yields very few articles, then
we don’t need to use the lower priority concepts. This saves
time.
2. We can also save time by skipping concepts that would not
have improved our yield.
3. We avoid missing important articles because of mindless
intersections.
Step 3: (a) Expand and (b) intersect each concept
sequentially, until you obtain a manageable number
of articles.
The previous two steps can be considered without even touching
a computer key. These two steps represent the planning process
before final implementation. Steps 3a and 3b comprise the
bulk of the actual search. ‘Expanding’ a concept means trying
to ensure that we account for different synonyms as well as
different spellings. ‘Intersecting’ a concept means adding criteria
to our search so that we limit it to only the articles that answer
our focused question. To use the Venn diagram analogy, when
we expand a concept we are trying to make our sets (represented by circles) as big as possible. We cast a wide net so
that we include all the relevant articles in the concept. We
then intersect these large circles, to get to the few articles that
address all four components of our focused question. The cycle
of expansion-intersection should be completed one concept at
a time, using the sequence of priorities that you planned in
Step 2.
124
CH07 LITERATURE SEARCHES
To explore the intricacies of the expansion-intersection cycle,
let’s use the earlier example where we asked:
• Among patients with osteoporosis (P, 4th priority);
• how effective are bisphosphonates (E, 1st priority);
• in preventing fractures? (O, 3rd priority).
• We were looking for systematic reviews (M, 2nd
priority)
While reading the subsequent sections of this chapter, it is
recommended that you be in front of a computer connected
to your favourite electronic database of medical literature. We
recommend PubMed (of the US National Library of Medicine
and the US National Institutes of Health) at www.pubmed.com as
it offers MEDLINE for free and has many neat features that make
searching easy. If you aren’t sure how to log on, you probably
need someone beside you to provide guidance. Your best chance
is anyone born in the 1970s or later. Figure 7.2 shows the home
page of PubMed and its five main features which are useful for
Step 3.
Step 3a: Expanding concepts
Expanding a concept using free text searches Start by looking
for the search box of the database you use. Type the concept
‘bisphosphonate’, press the enter key, and Voila, you have
done a free text search! This is called a free text search not
because it’s free, but because it searches for words used freely by
the authors in the title or abstract. Because the authors are not
bound by conventions in spelling or terminology, they can use
terms in any way they want. This is difficulty with the free text
search strategy: the same concept can be referred to in different
ways by different authors. ‘Bisphosphonate’ for example, may also
7.2 THE LANGUAGE OF ELECTRONIC DATABASES
125
Figure 7.2 The main features of PubMed: (1) free text search; (2) search
history; (3) search filters; (4) standardized terms for article classification
and (5) search filters for articles on therapy, diagnosis, harm or prognosis
be referred to as ‘biphosphonate’. To account for this, a better
strategy would be to search for the union of the two terms, that
is ‘bisphosphonate OR biphosphonate’. Better yet, to account for
plural forms of these words, we can type ‘biphosphonate OR
biphosphonates OR bisphosphonate OR bisphosphonates’. Try
typing this in the search box to see that you do get a larger
number of articles.
This is easy enough in our example but what if you have say, 20
synonyms and variations in spelling? Some electronic databases
offer useful tricks to simplify the task.
1. History functions: Many electronic databases have a ‘history’
function which displays a numbered history of terms that you
recently searched. Figure 7.3 shows a search for four spelling
variations of bisphosphonates. Users can obtain the unions of
126
CH07 LITERATURE SEARCHES
Figure 7.3 PubMed history feature, enabling previous searches to be
combined
previous searches by typing their numbers in the sequence,
instead of the terms themselves, e.g. ‘#1 OR #2 OR #3 OR #4’.
Note that the union yields more articles than any of the four
terms alone.
2. Truncations: Some words have the same roots, with variations
primarily occurring in the suffix. In PubMed, the asterisk
symbol (*) can be used at the end of root words so that
all variations in suffix are captured. For example, try typing
‘osteoporo*’ in the search box. This will yield articles that
use words beginning with ‘osteoporo’ such as osteoporotic
and osteoporosis.
3. Phrase recognition: Sometimes a short phrase is one of the
synonyms of a concept. For example, meta-analysis may also be
referred to by the phrase systematic review. Electronic databases
vary in how they search for phrases. Some look for an exact
match but more advanced ones will automatically seek variations in word order and spelling. If you are sure about the
phraseology of your search term however, it is best to enclose it
7.2 THE LANGUAGE OF ELECTRONIC DATABASES
127
in quotation marks so that the database seeks an exact match.
If you are not sure, entering the phrase without quotation
marks may be your best option.
Expanding a concept using subject heading searches A more
efficient way of solving the problem of non-uniform terminology
is to use a standardized dictionary of medical terms to classify articles. MEDLINE uses a dictionary called MeSH (Medical
Subject Headings). A similar dictionary called EMTREE is used
by EMBASE. CINAHL also uses its own ‘tree’. People entering
articles into these databases continually check the dictionaries so
they can classify articles in a standard way, regardless of the terms
used by the authors.
Subject headings are very useful, because they obviate the need
to create long lists of synonyms. Another advantage arises from
the arrangement of the terms in hierarchal form, with trunks,
branches, twigs and leaves. For example, Figure 7.4 below is
from the MeSH tree of PubMed. It shows that ‘female athlete
triad syndrome’ and ‘osteoporosis, postmenopausal’ are under
Figure 7.4 PubMed screen, displaying a portion of the MeSH tree
128
CH07 LITERATURE SEARCHES
‘osteoporosis’, which is under ‘bone diseases, metabolic’, which
in turn is under ‘bone diseases’. Clicking on any branch in this
tree will automatically search the twigs and leaves attached to
it, again obviating the need for multiple terms and entries. In
PubMed, this feature is called the ‘explode’ function. Don’t take
this literally; you don’t have to run to your nearest bomb shelter.
It is a default feature so all you need to do is click a term in the
hierarchy and all sub-entries are searched automatically. Explode
is especially useful when you are interested in a class of drugs,
because then you don’t need to enter the individual names.
Now try PubMed’s MeSH tree on your sample problem. (You
can access MeSH by clicking on the MeSH button in the left-hand
column of any PubMed page.) As the screen shot in Figure 7.5
shows, if you enter the word ‘bisphosphonates’ in the search box,
the MeSH synonym that will come up is ‘diphosphonates’. We
didn’t know of this synonym, and therefore missed it altogether
in our free text search! As it turns out, it is the standard term
adopted by MeSH.
Figure 7.5 How a chosen MeSH term is used to search PubMed
7.2 THE LANGUAGE OF ELECTRONIC DATABASES
129
Once a subject heading is chosen from the MeSH dictionary,
there must be a mechanism to go back to the main database to
actually search for articles on that subject. In PubMed, this is
initiated by checking the tick box beside the term (e.g. diphosphonates), then clicking the ‘send to search box’ function. From
the search box, PubMed can now be searched for relevant articles.
Figure 7.6 compares the search yield with MeSH (13 084 articles) and the search yield with the free text strategy that we used
earlier (14 618). The difference is easily explained. Free text lists
articles with the terms searched, even if they are not the main
topic of the articles. MeSH only lists articles if the term searched
is a key concept in that article. Most of the time therefore, MeSH
will give you a better yield in terms of relevance. Researchers
are often pressed to be very thorough in their searches. A trick
we sometimes use is to search for the union of a free text
and a subject heading. In this example, it would be stated as
‘#5 OR #7’.
Figure 7.6 Use of PubMed history function to compare yield from a free
text search (#5) and a MeSH search (#7)
Special strategies for expanding method searches Free text
and subject headings may also be used to expand the method
130
CH07 LITERATURE SEARCHES
filters (M) identified in Table 7.2. ‘Meta-analysis’ can be a free
text search, for example, and can also be found as a MeSH
term. In addition however, some advanced electronic databases
provide special tools to augment these two strategies. PubMed,
for instance, has two features that we find very useful.
1. Limits: This feature provides a table with checkboxes for Publication types, such as randomized controlled trials, reviews or
clinical practice guidelines.
Tip: You need to keep track of whether you have the Limits
function ‘on’ or ‘off ’. Forgetting to turn it off when you no
longer need it is a common source of search errors.
2. Clinical Queries: PubMed lists this function in the blue margin
on the left side of the PubMed screen. This feature contains
pre-tested combinations of MeSH and free text strategies to
search for certain study types. Users can actually select if they
want to search for articles on therapy, diagnosis, prognosis
or harm. In addition, they can select if they want a sensitive
search (nothing missed, but many non-relevant articles) or
a specific search (mostly relevant articles but some may be
missed).
Tip: If you’re in a hurry, this might be a good choice of search
strategy.
Step 3b: Intersecting concepts
By accessing many synonyms in step 3a, we are assuring the
retrieval of all relevant articles (obtaining large Boolean sets).
We face the danger, however, of retrieving thousands of articles,
most of which we might not even need. Thus, a corollary strategy
during the actual search is to narrow the yield by a sequential
intersection of the concepts that were previously expanded. The
number of concepts combined will depend on the yield at each
7.2 THE LANGUAGE OF ELECTRONIC DATABASES
131
step. Obviously, when the first step yields just a handful articles,
it would be futile to proceed with additional combinations.
For the example on osteoporosis, whether we use the free text
search for synonyms of bisphosphonates, the MeSH search using
diphosphonates or the union of both, the yield will be around
14 000 articles. This is certainly not a manageable number; who,
in their right mind, would want to read 14 000 titles and abstracts?
We therefore need to intersect this concept with our second
choice: a method filter. To do this, we can click the Limits function, click on Publication types, then ‘meta-analysis’. In Figure 7.7
below, we see a yield of only 37! Another strategy would be to click
‘Clinical Queries’ to search for systematic reviews. To do this,
simply type the history number of our Mesh search for diphosphonates, as shown in Figure 7.7. Click ‘Go’ and lo and behold:
we have found 38 articles! Now this may not seem like a big
difference in this example, but in many instances, the difference
in yield can be quite large. In our experience, Clinical Queries
have been extremely useful in simplifying our search for articles.
Figure 7.7 Using PubMed Limits function to search for certain study types
(in this case a meta-analysis)
132
CH07 LITERATURE SEARCHES
Figure 7.8 PubMed Clinical Queries feature, allowing articles to
be searched for by type
Depending on how much time you have, running through
the titles and abstracts of 37 or 38 articles may be considered
manageable. If this is the case, you can now stop and just scan
the titles for any relevant citations. If you don’t have a lot of
time, you will have to expand the next concept (‘fractures’) and
intersect it with the first two concepts.
Step 4: Examine the yield for mishits and misses and
revise the search if necessary
‘Mishits’ are articles yielded by our search that aren’t relevant to
our needs. ‘Misses’ are articles relevant to our search that our
7.2 THE LANGUAGE OF ELECTRONIC DATABASES
133
strategy did not find. Mishits are easy to assess. All you need
to do is read through the list of articles found. Misses on the
other hand, are difficult to assess and constitute a more serious
problem. You never really know that you may be missing articles,
unless you have some already in mind that you know should
have come up. Figure 7.9 below shows that there were 37 articles
yielded by our strategy. The first two seem to be mishits, but
that’s fine as long as we don’t miss the good stuff. The third
article seems to be relevant.
Figure 7.9 The different articles found by our search strategy
When we do find them, mishits are also easier to handle. When
there are too many mishits, what can we do? We can:
1. expand and intersect more concepts; or
2. insert more stringent method filters (e.g. expand and intersect
‘placebo’ to obtain only studies with a placebo control group).
134
CH07 LITERATURE SEARCHES
When there are too many misses (low yield), we can:
1. reduce the number of concepts;
2. use less stringent method filters; or
3. look for more synonyms by going through the initial yield.
If you were using free text, try expanding with MeSH. If you were
using MeSH, try expanding with free text, or use the union of
MeSH and free text searches.
7.3
Summary
A proper search begins with steps that don’t even require a
computer: identification of concepts (Step 1) and prioritization
according to importance (Step 2). The search proper involves
a cycle of concept expansion (Step 3a) and intersection (Step
3b), until a manageable (readable) number of articles are located.
Examining the yield for mishits and misses can lead to an
improvement of the strategy (Step 4).
Skill in literature searching is very empowering. You can
obtain information straight away, when you really need it. This
chapter has provided a few general points to help you, but you
must practice to master the skill. Make it a habit to spend a
few minutes everyday to search for topics of interest and hone
your skills. Remember, finding a relevant article doesn’t necessarily mean that your search strategy was a good one. You can
always find an article by luck! Lucky searches reinforce bad technique, so don’t use the yield to assess adequacy of your strategy.
Instead, concentrate on method to ensure that your searches are
reliable.
REFERENCES
135
References
[1]
[2]
[3]
[4]
http://www.nlm.nih.gov/pubs/factsheets/medline.html
http://info.embase.com/embase_com/about/index.shtml
http://www.cinahl.com/library/library.htm
http://www.mrw.interscience.wiley.com/cochrane/
cochrane_clcentral_articles_fs.html
‘You printed 5,275 abstracts? I asked you to search on global warming,
I didn’t ask you to cause it!’
This page intentionally left blank
Index
Absolute risk reductions 28–9,
31, 34, 39–40, 72–8, 111
Active controls 15–16
Adverse events 24, 35
Allocation concealment strategies
19–20, 21, 103–4
see also Concealed allocations
Baseline risk 37–9, 78–81, 111
Bias 10, 18, 69–71, 89, 101–4
BioMed Central (BMC) 9
Blinding issues 21–3, 25, 69–71,
94–5
Boolean language 117–34
Case-control studies 68–71, 101,
120
Censored analysis 23
Chi-squared tests 108
Clinical Queries feature, PubMed
130–2
Clinical questions 6–8, 16–18,
46, 67–68, 85–86, 100–101
Cochrane Central Register of
Controlled Trials (CCRCT)
117
Cohort studies 68–71, 86, 101–2,
120
Concealed allocation
see also Allocation concealment
Confidence intervals (CIs) 31–5,
73–6, 90–2, 109, 112
Continuous outcomes 16, 25–31,
105–7
Control groups 1
Cross-sectional surveys 101–2
Decision
aids 40–1
Deja vu 81
Painless Evidence-Based Medicine Antonio L. Dans, Leonila F. Dans and Maria Asuncion A. Silvestre
© 2008 John Wiley & Sons, Ltd
138
Diagnosis
applicability of studies 51–55,
109–10
clinical question 6–8, 46
directness 46, 62
individualization of results
55–62, 111
meta-analysis 103, 111
searching the literature 120
validity 46–9, 103
Diagnostic threshold 56
Dichotomous outcomes 16,
25–31, 106–7
Directness 6, 10–11, 16–18
diagnosis 46, 62
harm 68, 79
prognosis 85–6
systematic reviews 100–1
therapy 16–18
Drop-outs 24–6, 72, 89
EBM applicability
biological issues 33–6, 51–4,
76–8, 92–4, 95
diagnosis 51–62, 110–12
harm 76–81, 111
prognosis 92–6, 111
socioeconomic issues 37, 54,
62, 76, 78, 94–6
systematic reviews 109–11
therapy 30–41, 110–12
EBM, see Evidence-based medicine
Electronic database searches 7,
102–4, 115–34
history functions 125–6
key words 118–34
phrase recognition 126–7
search steps 119–34
subject headings 127–30
truncations 126
INDEX
EMBASE 117, 127
Event rate 90
Event-free rate 90
Evidence-based medicine (EBM)
ix–xx
definition 4–6
Exposure 7–11, 15–17, 46, 67–8,
85–6, 101, 107–9, 119–34
Face tables 40–1
False negative results 50–3
False positive results 50–3
Forest plots 104–5, 112
Freemedicaljournals.com 9
Harm
applicability of studies 76–78
clinical questions 6–7
directness 68, 79
electronic database searches
121
individualization of results
78–81, 111
meta-analysis 103, 111
validity 68–72, 79, 103
Hazard ratios 30–1
Health InterNetwork Access to
Research Initiative (HINARI),
WHO 9
Heterogeneity 108–9
Histopathologic examinations 47
History functions, electronic
databases 125–6
I-square 108
Index Medicus 115
Individualization of results
11–12
diagnosis 56–62
harm 78–81
6,
INDEX
meta-analysis 110–111
prognosis 94–5
therapy 37–41
Intention-to-treat analysis 23
Interval estimates, see confidence
intervals
Likelihood ratios 49–62,
120–2
Limits feature, PubMed 130–2
Mabuhay xx
Mean differences 25–31, 72–6,
105–7
MEDLINE 3, 7, 102–4, 117, 124,
127–34
see also Electronic databases
MeSH tree, MEDLINE 127–34
Meta-analysis
see Systematic reviews
Method filters 119–34
Mishits/misses, electronic
databases 132–4
National Institutes of Health 102,
124
National Library of Medicine
117, 124
see also MEDLINE
Nomogram, Bayes 60–1
Number needed to harm 29, 39,
73–6, 79–81, 111
Number needed to treat 11, 29,
39, 73–6, 111
Odds ratio (OR) 11, 73–6,
79–81
Outcomes 7–11, 15–18, 46,
67–8, 72, 85–6, 100–1, 107–9,
119–34
139
Placebos 15–16, 21–2, 26–7,
133–4
Population of interest 7–11,
15–17, 20–1, 46, 67–8, 85–6,
100–1, 107–9, 119–34
Post-test probability 52, 56–60,
63
Pre-test probability 53, 55–60,
111
Predictive values
positive 49–51
negative 49–51
Prognosis
applicability of studies 90–94
clinical questions 7–8, 119–34
directness issues 85–6, 95–6
individualization of results
94–5, 111
meta analysis 103, 111
validity issues 86–9, 103
Proportion 30–31
Publication bias 102–3
PubMed 119, 124–34
Racial differences, application
EBM skills 36, 53, 77, 92–3
Randomization 18–19, 68–71,
103–4
Randomized controlled trials 19,
26–9, 68–71, 101–4,
Rates 30–31
Recall bias 69–71
Reference standard 46–8
Relative Risk 28–29, 31–38,
72–75, 107, 111
Relative risk reductions 28–9, 31,
32–3, 38–40, 72–6, 109, 111
Research questions 6–8, 12,
16–18, 46, 67, 85–86,
100–101
140
INDEX
Searches
see Electronic database
searches
Sensitivity 49–51
Sensitivity analysis 27, 108
Socioeconomic issues 37, 54,
61–2, 76, 78, 94–6
diagnosis 54–5, 62–3
harm 76, 78
prognosis 94–5
therapy 37
Specificity 49–51
Statistical adjustments
68–71
Subgroup analysis 33, 110–12
Subject headings, electronic
databases 127–30
Surrogate outcomes 18
Surveys 101–2
Survival curves 30, 90–2
Systematic reviews
applicability of studies 109–10
clinical questions 6–8, 100–1
individualization of results
110–11
validity 101–6
Therapeutic threshold 56
Therapy
applicability of studies 33–37
clinical questions 8, 15–16
directness 16–18
individualization of results
37–41, 111
meta-analysis 103, 111
validity issues 18–24, 103
True negative results 50–3
True positive results 50–3
Truncations, electronic databases
126
Validity
diagnosis 46–49
harm 68–72
prognosis 86–9
systematic reviews
therapy 18–24
Index compiled by Terry Halliday
101–106
Download