Sackett et al., 2000

advertisement
CAT (Critically Appraised Topic)
(adapted from Sackett, et al. 2000)
1-page summary of evidence
resulting from critical appraisal of an
article, test, etc.
 Answers a specific foreground
question

– “Compared to no treatment, does
parent-administered treatment
significantly improve the language skills
of toddlers with language delay?”
First part of CAT identical for tx and
dx studies (see handout pp. 2-3)
Clinical bottom line: (appears 1st but
completed last)
 Clinical question:
 Search terms:
 Appraised by whom, and date:
 Synopsis of key (memorable)
information, in a concise, maximally
useful format (e.g., types of subjects,
procedures, measures, results, etc.)

CAT-egories (appraisal points) for a
study of therapy (Sackett et al., 2000)
Prospective, controlled?
 Random assignment?
 Comparing > 2 conditions?
 Recognizable subjects?
 Evidence of pre-tx group similarity?
 Blinding (insofar as possible) of
evaluators, relevant others?

Appraisal points (cont.)



Control over nuisance variables?
Valid, reliable measures of tx effects?
Statistically significant difference (pvalue)?

Practically significant difference (d-value)?
Precision of treatment effects (narrow CI)?
Outcomes for all enrolled?

Cost-benefit and feasibility analyses?


A sample treatment CAT





CAT: Language of delayed toddlers improves in
response to parent-administered focused
stimulation
Clinical bottom line: Compared to an untreated
control group, motivated mothers of low-vocabulary
toddlers significantly decreased their speaking rate
and language complexity and increased their
vocabulary inputs in response to ~18 hr of
instruction in focused stimulation techniques, and
their children produced significantly more words
and early grammatical forms.
Clinical question: Compared to no treatment, does
parent-administered treatment significantly improve
the language skills of toddlers with language delay?
Search terms: word learning AND toddlers, PubMed
clinical query
Appraised by: Dollaghan
Key appraisal points













Prospective, controlled
Yes
Randomized
Yes
Comparing > 2 conditions
Yes
Recognizable Ss
Yes
Pre-tx similarity
Yes
Blinding
Yes Cn; no parent
Control over nuisance variables
Yes
Valid, reliable measures
Yes
Statistically significant differences Yes
Practically significant differences
Yes
Precision of treatment effects
No
Outcomes for all enrolled
Yes
Cost-benefit, feasibility analyses
Yes
Critical appraisal of evidence
on diagnostic indicators
The key variables by which
individuals are identified as members
of a class, ostensibly to improve
prediction and outcome for them
 Myriad diagnostic indicators have
been proposed in communication
sciences and disorders
 Diagnostic indicators in your area of
interest?

Most diagnostic indicators in CSD
are based on “Phase I” studies

Group mean comparison studies
– People with, and people without, the
condition of interest are compared with
respect to a proposed indicator

Correlational studies
– Association between proposed indicator
and accepted indicators

Such studies can’t address the two most
crucial features of a diagnostic
indicator: accuracy and precision
Accuracy and precision
 Accuracy
– The ability of an indicator to
identify a condition of interest, i.e.,
the amount of agreement between
the proposed indicator and a
reference standard
 Precision
– Width of confidence intervals (CI)
for estimates of accuracy
Accuracy of a diagnostic indicator
The ability of an indicator to identify
a condition of interest, i.e., the
amount of agreement between the
proposed indicator and a reference
standard
 Preferred measures of diagnostic
accuracy: positive and negative
likelihood ratios
(Battaglia et al., 2002)

Positive Likelihood Ratio (LR+)



Reflects the degree of confidence that a
person who scores in the positive
(affected or disordered) range on a dx
indicator does have the disorder
Formula: sensitivity/1-specificity
The higher the LR+, the more informative
the indicator for identifying people who
have the disorder
Interpreting LR+ values
(Sackett et al., 1991)
LR+ > 20 Very high; virtually certain that a
person with this score has the disorder
LR+ = 10 High; disorder very likely in a person
with this score
LR+ = 4 Intermediate; the indicator is
suggestive of disorder but insufficient
to diagnose
LR+ = 1 Equivocal; a person who scores in the
disordered range on the measure may
or may not have the disorder; the
measure provides no new information
Negative Likelihood Ratio (LR-)
Reflects the degree of confidence
that a person scoring in the negative
(normal) range on the diagnostic
indicator truly does not have the
disorder
 Formula: 1-sensitivity/specificity
 The lower the LR-, the more
informative the indicator for ruling
out the presence of disorder

Interpreting LR- values
(Sackett et al., 1991)
LR- < 0.10 Very low ; virtually certain that a
person scoring in this range does not
have the disorder
LR- = 0.20 Low; disorder very unlikely
LR- = 0.40 Intermediate; the indicator is suggestive
but insufficient to rule out the disorder
LR- = 1.0 Equivocal; a person scoring in the
normal range on this measure may or
may not be normal
Calculating sensitivity and specificity
(nothing more than LR precursors)



Sensitivity: the percentage of people
with the disorder that the new indicator
correctly classifies as disordered
Specificity: the percentage of people
who don’t have the disorder that the
new indicator correctly classifies as
not disordered
The “true” status of every individual
with regard to the disorder is
established according to a gold (or
reference) standard
Disorder Status (re: Gold Standard)
+ Disorder (LI)
- Disorder (LN)
+ Disorder (LI)
New
Test
Result
-Disorder (LN)
# with
disorder
a
b
c
d
# without
disorder
Disorder Status (re: Gold Standard)
+ Disorder (LI) - Disorder (LN)
+ Disorder (LI) True positive
New
Test
Result -Disorder (LN)
False positive
a
b
c
d
False negative True negative
Sensitivity=a/a+c
(the proportion of people with the
disorder that the new test identifies
as having the disorder)
Disorder Status (re: Gold Standard)
+ Disorder
New
Test
Result
-Disorder
+ Disorder
- Disorder
True positive
False positive
a
b
c
d
False negative True negative
Specificity = d/b+d
(the proportion of people
without the disorder that
the new test identifies as
not having the disorder)
Example
100 children diagnosed with language
impairments (LI) and enrolled in
language intervention, and 100 sameage children with no history of language
impairment (LN), were administered a
new test of grammatical morphology.
 80 of the children with LI, and 30 of the
children with LN, scored in the
disordered range on the new measure.

Disorder Status (re: Gold Standard)
+ Disorder (LI) - Disorder (LN)
+ Disorder (LI)
80
a b
New Test
Result
-Disorder (LN)
30
c d
(20)
(70)
100 with
100 without
disorder
Sens= a/a+c=
80/100 = .80
disorder
Spec = d/b+d =
70/100 = .70
Why not just use sensitivity and
specificity as measures of accuracy?



It’s their interrelationship that is most
important overall
Sensitivity and specificity vary
substantially according sample
characteristics, including N, base rate
(prevalence), severity, confusability
Likelihood Ratios are not impervious to
sample characteristics, but are much
less affected than are sensitivity and
specificity
Calculating Likelihood Ratios
Sens = .80
 Spec = .70
 LR+ = sens/1-spec = .80/.30 = 2.67
 LR- = 1-sens/spec = .20/.770 = 0.29
 Several programs, some free on web,
are set up to allow entry in 2x2 table
format
 In addition to accuracy measures, they
also provide information on precision

Precision of a diagnostic indicator



Width of confidence intervals (CI) for
sensitivity, specificity, and likelihood ratios,
calculated by adding and subtracting a
multiple of standard error (e.g., 1.96 SE for
a 95% CI)
Standard error depends on sample size and
reliability; larger samples and higher
reliability will result in narrower CIs, all else
being equal
Sackett et al. (2000) appendix shows how
to calculate CIs by hand, and programs
(some free) provide CIs given raw numbers
in a 2x2 table
Sample size and precision: 95% CIs for
studies with same LRs but different Ns
Value
Sens = .80
Spec = .70
LR+ = 2.67
LR- = 0.29
N = 200
(95% CI)
(0.71-0.87)
(0.60-0.79)
(1.98-3.70)
(0.19-0.42)
N = 20
(95% CI)
(0.44-0.98)
(0.35-0.93)
(1.12-7.66)
(0.08-0.87)
CAT-ing evidence on a diagnostic
indicator (Sackett et al., 2000; Battaglia et al., 2002)

Does the study report a comparison
between measures, or measure and gold
standard?
– sine qua non for evidence of diagnostic
accuracy

Was the gold (or reference) standard
valid, reliable, and/or reasonable?
– Gold standard and new indicator also must
be independent to avoid incorporation bias
that can inflate accuracy measures
Criteria for diagnostic
indicators (cont.)
Were patients enrolled prospectively
and consecutively (or by random
assignment), and
 Did the sample include a spectrum of
patient types and severities?

– These two criteria are important in
avoiding spectrum bias, in which the
sample includes only clear-cut or handpicked cases and thus does not
represent the diagnostic task
Criteria for diagnostic
indicators (cont.)



Were the new measure and the reference
standard administered independently, by
different examiners, and
Were the examiners blinded to the
subject’s performance on the other test
and to other relevant subject information?
Were the new measure and the reference
standard both administered to all subjects
and controls?
– Important to avoid differential verification bias,
when controls are assumed to be normal
without testing on gold standard
Criteria for diagnostic
indicators (cont.)

Do likelihood ratios suggest adequate
diagnostic accuracy?
– LR+ > 4.0 (> 10 cf. Bayes Library, 2002)
– LR- < 0. 40 (< 0.20, cf Bayes Library, 2002)
Precision (narrow confidence intervals)?
 Feasibility for usual clinical practice?
 Value (i.e., better than current
measure)?

Evidence on norm-referenced tests
as diagnostic indicators for early LI



Many norm-referenced tests have diagnosis
of LI as their explicit purpose
A growing number of tests meet typical
psychometric criteria, e.g. N = 100 subjects
per age level; reliability > .90; means,
standard deviations, and standard errors of
measurement
But very few provide evidence of diagnostic
accuracy or precision, and none meet the
recommended critical appraisal criteria
Norm-referenced tests not
providing information on accuracy
or precision





Test of Language Development (TOLD)
Sequenced Inventory of Language
Development (SICD)
Test of Early Language Development
(TELD)
Reynell Scales
MacArthur Communicative Development
Inventories (CDI)
A few tests provide information allowing
accuracy and precision to be calculated
Age LI LN LR+ (95% CI) LR- (95% CI)
PLS-4 Total language score < 85
3
24
24 6.7 (2.6-19.4) 0.19 (.08-.42)
4
23
23 18 (3.6-102) 0.23 (.10-.44)
5
28
28 4.4 (2.1-10.2) 0.26 (.12-.50)
3-5 75 75 6.7 (3.7-12.5) 0.23 (.14-.35)
CELF-P Total language score < 85
3-5 80 80 5.3 (2.9-10.2) 0.45 (.34-.58)
CELF-P Total language score < 77
3-5 80
80 12.7 (4.4-37.8) 0.54 (.43-.66)
But note that these studies would fail many of the other
critical appraisal criteria, their accuracy notwithstanding.
The situation is no better for other
proposed diagnostic indicators




Few compare indicator to a gold
standard, so accuracy can’t be
determined
Few used blinded examiners, so a high
potential for context and other biases
Small samples, wide CIs (rarely provided)
When sensitivity and specificity have
been reported, they have sometimes
been calculated incorrectly and/or
misinterpreted
I choose not to despair
Knowing the limitations of our
diagnostic tools is an important
prerequisite to designing better
diagnostic tools
 Several possible ways forward,
most involving clinician-researcher
partnerships

A way forward to EBP in Speechlanguage pathology and Audiology

Designing studies to meet the criteria for
strong evidence
– e.g., STARD (Bossuyt et al., 2003) statement

Large-scale, cooperative studies of
diagnostic indicators
– CARE-COAD model (Straus et al. 2002)

Dealing with the absence of a gold
standard
– e.g., Demissie et al., 1998; Dunson, 2001;
reliability and outcome studies

Diagnostic studies as multivariable,
prediction research (Moons & Grobbee,
2002)
Test yourself
Critical appraisal of diagnostic test
(handout p. 5)
 Critical appraisal of treatment study
(handout p. 4)

Critical appraisal and CAT enable
the remaining steps to EBP
5. Decide whether the evidence is
strong enough to influence your
clinical practice
6. Integrate the evidence with the
“intangibles”
7. Update!
EBP is itself a set of
assumptions, not a cult
Ultimately, strong evidence will be
needed to determine whether EBP
results in improved clinical service.
 And EBP can’t be applied blindly, to
all kinds of problems...

As with many interventions intended to
prevent ill health, the effectiveness of
parachutes has not been subjected to
rigorous evaluation by using randomised
controlled trials. Advocates of evidence
based medicine have criticised the adoption
of interventions evaluated by using only
observational data. We think that everyone
might benefit if the most radical
protagonists of evidence
based medicine organised
and participated in a double
blind, randomised, placebo
controlled, crossover trial of
the parachute.
Thanks!
References
Download