Measuring Agreement Beyond Chance (PDF Available)

30
Measuring Agreement
Beyond Chance
Andrew Jull, Alba DiCenso, and Gordon Guyatt
The following Editorial Board members also made substantive contributions to this
chapter: Rien de Vos, Teresa Icart Isern, and Mary van Soeren.
We gratefully acknowledge the work of Thomas McGinn, Richard Cook, and
Maureen Meade on the original chapter that appears in the Users’ Guides to the
Medical Literature, edited by Guyatt and Rennie.
In This Chapter
Clinicians Often Disagree
Chance Will Always Be Responsible for Some of the Apparent
Agreement Between Observers
Alternatives for Dealing With the Problem of Agreement by
Chance
One Solution to Agreement by Chance: Chance-Corrected
Agreement or Kappa
Calculating Kappa
Kappa With Three or More Raters or Three or More Categories
A Limitation of Kappa
An Alternative to Kappa: Chance-Independent Agreement or Phi
Advantages of Phi Over Other Approaches
446
Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved.
Measuring Agreement Beyond Chance
Chapter 30
447
CLINICIANS OFTEN DISAGREE
Clinicians often disagree in their assessments of patients, be it in physical examinations
or interpretation of diagnostic tests. Disagreement between two clinicians about the
presence of a particular physical sign in a patient (e.g., elevated blood pressure) may be
the result of different approaches to the examination or different interpretations of the
findings. Similarly, disagreement among repeated examinations of the same patient by
the same clinician may result from inconsistencies in the way the examination is
conducted or different interpretations of the findings with each examination.
Researchers may also have difficulties agreeing on whether patients meet the eligibility requirements for a clinical trial (e.g., whether a patient has severe pain), whether
patients in a randomized trial have experienced an outcome of interest (e.g., whether a
patient has improved functional ability), or whether a study meets the eligibility criteria for a systematic review.
Agreement between two observers is often called interobserver (or interrater) agreement, and agreement within the same observer is referred to as intraobserver (or
intrarater) agreement.
CHANCE WILL ALWAYS BE RESPONSIBLE FOR SOME
OF THE APPARENT AGREEMENT BETWEEN OBSERVERS
Any two people judging the presence or absence of an attribute will agree some of the time
by chance. This means that even if the people making the assessment are doing so by guessing in a completely random way, their random guesses will agree some of the time. When
investigators present agreement as raw agreement—that is, by simply counting the number
of times agreement has occurred—this chance agreement gives a misleading impression.
ALTERNATIVES FOR DEALING WITH THE PROBLEM
OF AGREEMENT BY CHANCE
In this chapter, we describe approaches developed by statisticians to separate chance agreement from nonrandom agreement (or agreement over and above chance). When we are
dealing with categorical data (i.e., placing patients in discrete categories such as mild or
severe pain, a present or absent sign, or a positive or negative test), the most popular
approach is chance-corrected agreement. Chance-corrected agreement is quantified as
kappa. Kappa is a statistic used to measure nonrandom agreement among observers, investigators, or measurements.
ONE SOLUTION TO AGREEMENT BY CHANCE:
CHANCE-CORRECTED AGREEMENT OR KAPPA
Conceptually, kappa removes agreement by chance and informs clinicians about the
extent of possible agreement over and above chance. If raters agree on every judgment,
the total possible agreement is 1.0 (also expressed as 100%). Figure 30-1 depicts a situation
in which agreement by chance is 50%, leaving possible agreement above and beyond
Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved.
448
Unit V
Understanding the Results
Potential agreement 100%
50%
50%
Agreement expected
by chance alone 50%
50%
Observed agreement 75%
50%
25%
Observed agreement – Agreement expected by chance alone
Total potential agreement – Agreement expected by chance alone
75% − 50%
=
100% − 50%
25%
=
50%
= 50% (moderate agreement)
Kappa =
Figure 30-1. Kappa.
chance of 50%. As depicted in this figure, the raters have achieved an agreement of 75%. Of
this 75%, 50% was achieved by chance alone. Of the remaining possible 50% agreement,
the raters have achieved half, resulting in a kappa value of 0.25/0.50, or 0.50 (or 50%).
CALCULATING KAPPA
Kappa is calculated by following three steps: (1) calculating observed agreement, (2) calculating agreement expected by chance alone, and (3) calculating agreement beyond chance.
Assume that two observers are assessing the presence of respiratory wheeze. However, they
have no skill in auscultating the chest, and their evaluations are no better than blind
guesses. Let us say that they are both guessing in a ratio of 50:50; they guess that wheezing
is present in half of the patients and that it is absent in half of the patients. On average, if
both raters were evaluating the same 100 patients, they would achieve the results presented
in Figure 30-2. Referring to this figure, you observe that both raters detected a wheeze in 25
patients (cell A) and neither of the raters detected a wheeze in 25 patients (cell D). By adding
these two cells and dividing by the total number of patients (or observations) (T), we
calculate observed agreement to be (25 + 25)/100. This represents an agreement of 50%
(i.e., they correctly guessed the presence or absence of a wheeze in 50 of 100 patients). Thus,
simply by guessing (and thus by chance), the raters achieved 50% agreement.
Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved.
Measuring Agreement Beyond Chance
ⴙ
ⴙ
Observer 1
Observer 2
ⴚ
25
50 E
B
25
25
C
50 F
D
50 G
Observed agreement =
449
ⴚ
25
A
Chapter 30
50 H
100 T
(A + D) (25 + 25) 50
=
=
= 50% (or 0.5)
T
100
100
Agreement expected by
chance alone for A =
(E × G)
T2
=
(50 × 50)
1002
=
2500
= 25% (or 0.25)
10000
Agreement expected by
(F × H)
(50 × 50)
2500
= 25% (or 0.25)
10000
T
100
Total agreement expected by chance alone = 25% + 25% = 50% (or 0.5)
chance alone for D =
2
=
2
=
Figure 30-2. Agreement by chance when both reviewers are guessing in a
ratio of 50% target positive and 50% target negative. A, Patients in whom
both observers detect respiratory wheeze. B, Patients in whom observer 1
detects a respiratory wheeze, and observer 2 does not detect a wheeze. C,
Patients in whom observer 1 does not detect a wheeze, and observer 2 detects a
wheeze. D, Patients in whom both observers do not detect a wheeze. E, Patients
in whom observer 1 detects a wheeze. F, Patients in whom observer 1 does not
detect a wheeze. G, Patients in whom observer 2 detects a wheeze. H, Patients
in whom observer 2 does not detect a wheeze. T, Total number of patients.
Having determined observed agreement, we go on to calculate agreement expected
by chance alone, which involves calculating the number of observations we
expect by chance to fall in cells A and D. We do this by multiplying the corresponding
marginal totals (E, F, G, H) and dividing by the square of the total number of observations
(T). To calculate how many observations we expect by chance to fall in cell A, we multiply
E times G and divide this number by T2: (50 × 50)/1002 = 0.25. Similarly, to calculate the
number of observations we expect in cell D, we multiply F times H and divide by T2:
(50 × 50)/1002 = 0.25. Total chance agreement is therefore 0.25 + 0.25 = 0.50 or 50%.
Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved.
450
Unit V
Understanding the Results
Observer 2
ⴙ
ⴙ
Observer 1
64
16
A
80 E
B
ⴚ
16
4
C
20 F
D
80 G
Observed agreement =
ⴚ
20 H
100 T
(A + D) (64 + 4) 68
=
=
= 68% (or 0.68)
T
100
100
Agreement expected by
chance alone for A =
(E × G)
T
2
=
(80 × 80)
100
2
=
6400
= 64% (or 0.64)
10000
Agreement expected by
(F × H)
(20 × 20)
400
= 4% (or 0.04)
10000
T
100
Total agreement expected by chance alone = 64% + 4% = 68% (or 0.68)
chance alone for D =
2
=
2
=
Figure 30-3. Agreement by chance when both reviewers are guessing in a
ratio of 80% target positive and 20% target negative. +, Target positive;
–, target negative. In this case, + is respiratory wheeze present and − is respiratory wheeze absent.
What happens if the raters repeat the exercise of rating 100 patients, but this time each
guesses in a ratio of 80% positive and 20% negative? Figure 30-3 depicts what, on average,
will occur. The observed agreement (cells A and D) has increased to (64 + 4)/100 or 68%
and total chance agreement has increased to 68%.
As two observers classify an increasing proportion of patients in one category or the
other (e.g., positive or negative; sign present or absent), agreement by chance increases as
shown in Table 30-1. Once we have determined observed agreement and agreement
expected by chance alone, we are ready to calculate kappa (or agreement beyond chance).
Figure 30-4 illustrates the calculation of kappa with a hypothetical data set. First, we
calculate the agreement observed. The two observers agreed that respiratory wheeze was
present in 41 patients (cell A) and was absent in 80 patients (cell D). Thus, total
agreement is (41 + 80)/200 = 0.605, or 60.5%. Next, we calculate agreement expected
by chance in cell A by multiplying the marginal totals E and G and dividing by T 2
Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved.
Measuring Agreement Beyond Chance
Chapter 30
451
Table 30-1 Relationship Between the Proportion Positive and the
Expected Agreement by Chance
Proportion Positive (E/T = G/T)*
Agreement by Chance
0.5
0.6
0.7
0.8
0.9
0.5
0.52
0.58
0.68
0.82
(50%)
(52%)
(58%)
(68%)
(82%)
*E/ T and G/ T refer to letters in Figures 30-2 and 30-3; E/ T, the proportion of patients observer 1 finds
positive; G/ T, the proportion of patients observer 2 finds positive.
[(60 × 101)/2002 = 0.15], and we calculate agreement expected by chance in cell D by
multiplying the marginal totals F and H and dividing by T 2 [(140 × 99)/2002 = 0.35].
Total chance agreement is 0.15 + 0.35 = 0.50 or 50%. We can then calculate kappa using
the principle illustrated in Figure 30-1:
(observed agreement – agreement by chance)
(agreement possible [100%] – agreement by chance)
or, in this case:
(0.605 − 0.50)
= 0.105 ÷ 0.50 = 0.21 or 21%
(1.0 − 0.50)
To summarize, observed agreement was 60.5%. Agreement of 50% would be
expected to occur by chance, resulting in a kappa, or agreement over and above chance
of 21%. Although there are numerous approaches to valuing the kappa levels achieved
by raters, a widely accepted interpretation is the following: 0 = poor agreement; 0 to
0.2 = slight agreement; 0.2 to 0.4 = fair agreement; 0.4 to 0.6 = moderate agreement; 0.6
to 0.8 = substantial agreement; and 0.8 to 1.0 = almost perfect agreement1
Examples of chance-corrected agreement that investigators have calculated in clinical
studies include exercise stress test cardiac T-wave changes, kappa = 0.25;2 jugular venous
distention, kappa = 0.50;3 presence or absence of goiter, kappa = 0.82 to 0.95;4,5 and
straight-leg raising for diagnosis of low back pain, kappa = 0.82.6
KAPPA WITH THREE OR MORE RATERS OR THREE
OR MORE CATEGORIES
Using similar principles, one can calculate chance-corrected agreement when there are
more than two raters.7 Furthermore, one can calculate kappa when raters place patients
into more than two categories (e.g., patients with heart failure may be rated as New York
Heart Association class I, II, III, or IV). In these situations, one may give partial credit
for intermediate levels of agreement (e.g., one observer may classify a patient as class II,
whereas another may observe the same patient as class III) by adopting a so-called
Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved.
452
Unit V
Understanding the Results
ⴙ
Observer 2
ⴚ
15
ⴙ
Observer 1
41
A
60 E
19
B
35
ⴚ
60
C
80
140 F
D
101 G
99 H
200 T
(A + D) (41 + 80) 121
=
=
= 60.5% (or 0.605)
200
200
T
6060
(E × G) (60 × 101)
Agreement expected by chance alone for A =
=
=
2
2
40000
200
T
= 15% (or 0.15)
(F × H) (140 × 99) 13860
=
=
Agreement expected by chance alone for D =
40000
2002
T2
= 35% (or 0.35)
Total agreement expected by chance alone = 15% + 35% = 50% (or 0.5)
Observed agreement =
observed agreement – agreement expected by chance alone
total potential agreement – agreement expected by chance alone
60.5% − 50% 10.5%
=
=
= 21% (or 0.21)
100% − 50%
50%
Kappa (κ ) =
Figure 30-4. Observed and expected agreement. +, Target positive; – target
negative. In this case, + is respiratory wheeze present and – is respiratory wheeze
absent. Expected agreement by chance appears in italics in cells A and D.
weighted kappa statistic (weighted because full agreement receives full credit, and partial
agreement receives partial credit).8
A LIMITATION OF KAPPA
Kappa provides an accurate measurement of agreement beyond chance when the marginal
frequencies are similar to one another (e.g., 50% positive observations and 50% negative
observations). When the distribution of results is extreme (e.g., 90% positive observations
and 10% negative observations), kappa is invariably low, in the absence of perfect agreement. If two observers believe that the prevalence of a clinical entity of interest (e.g.,
Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved.
Measuring Agreement Beyond Chance
Chapter 30
453
respiratory wheeze) is high or low in a given population, and both make a high or low
proportion of positive ratings, raw agreement or agreement by chance will be high, even if
the raters are just guessing. For this reason, when the proportion of positive ratings is
extreme (i.e., extreme differences in the marginal frequencies), possible agreement beyond
chance agreement is small, and it is difficult to achieve even moderate values of kappa.9–11
AN ALTERNATIVE TO KAPPA: CHANCE-INDEPENDENT
AGREEMENT OR PHI
One solution to this problem is chance-independent agreement using the phi statistic,
which is a relatively new approach to assessing observer agreement.12 One begins by estimating the odds ratio from a 2 × 2 table displaying the agreement between two
observers. Figure 30-5 contrasts the formulas for raw agreement, kappa, and phi.
The odds ratio (OR = ad/bc in Figure 30-5) provides the basis for calculating phi.
The odds ratio is simply the odds of a positive classification by observer 2 when observer 1
gives a positive classification, divided by the odds of a positive classification by observer 2
when observer 1 gives a negative classification (see Chapter 27, Measures of Association).
The odds ratio would not change if we were to reverse the rows and columns. Thus, it does
not matter which observer we identify as observer 1 and which we identify as observer 2.
The odds ratio provides a natural measure of agreement. Interpretation of this agreement
can be simplified by converting it to a form that takes values from −1.0 (representing
extreme disagreement) to 1.0 (representing extreme agreement).
The phi statistic makes this conversion using the following formula:
Phi =
OR − 1
OR + 1
Raw agreement Kappa where observed agreement and expected agreement Odds Ratio {OR} ad
bc
=
ad − bc
ad + bc
ad
abcd
observed agreement expected agreement
1 expected agreement
ad
abcd
(a b)(a c)
(a b c d)2
Phi (c d)(b d)
(a b c d)2
OR 1
OR 1
ad bc
ad bc
Figure 30-5. Calculations of agreement. a, b, c, and d, the four cells of a
2 × 2 table.
Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved.
454
Unit V
Understanding the Results
When both margins are 0.5 (that is, when both raters conclude that 50% of the
patients are positive and 50% are negative for the trait of interest), phi is equal to kappa.
ADVANTAGES OF PHI OVER OTHER APPROACHES
The use of phi has four important advantages over other approaches. First, it is independent of the level of chance agreement. Thus, investigators could expect to find similar
levels of phi if the distribution of results is 50% positive and 50% negative or if it is 90%
positive and 10% negative. This is not true for measures of the kappa statistic, a chancecorrected index of agreement.
Second, phi allows statistical modeling approaches that the kappa statistic does not.
For instance, such flexibility allows investigators to take advantage of all ratings when
observers assess patients on multiple occasions.12 Third, phi allows testing of whether
differences in agreement between pairings of raters are statistically significant, an option
that is not available with kappa.12 Fourth, because phi is based on the odds ratio, one
can carry out exact analyses. This feature is particularly attractive when the sample is
small or when there is a zero cell in the chart.13
Statisticians may disagree about the relative usefulness of kappa and phi. The key
point for clinicians reading studies that report measures of agreement is that investigators
not mislead readers by presenting only raw agreement.
REFERENCES
1. Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic. Am J Epidemiol. 1987;126:
161-169.
2. Blackburn H. The exercise electrocardiogram: differences in interpretation. Am J Cardiol. 1968;21:871.
3. Cook DJ. Clinical assessment of central venous pressure in the critically ill. Am J Med Sci. 1990;299:
175-178.
4. Kilpatrick R, Milne JS, Rushbrooke M, Wilson ESB. A survey of thyroid enlargement in two general practices in Great Britain. BMJ. 1963;1:29-34.
5. Trotter WR, Cochrane AL, Benjamin IT, Mial WE, Exley D. A goitre survey in the Vale of Glamorgan.
Br J Prev Soc Med. 1962;16:16-21.
6. McCombe PF, Fairbank JC, Cockersole BC, Pynsent PB. 1989 Volvo Award in clinical sciences.
Reproducibility of physical signs in low-back pain. Spine. 1989;14:908-918.
7. Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial
credit. Psychol Bull. 1968;70:213-220.
8. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:
159-174.
9. Thompson WD, Walter SD. A reappraisal of the kappa coefficient. J Clin Epidemiol. 1988;41:949-958.
10. Feinstein AR, Cicchetti DV. High agreement but low kappa, I: the problems of two paradoxes. J Clin
Epidemiol. 1990;43:543-549.
11. Cook RJ, Farewell VT. Conditional inference for subject-specific and marginal agreement: two families
of agreement measures. Can J Stat. 1995;23:333-344.
12. Meade MO, Cook RJ, Guyatt GH, et al. Interobserver variation in interpreting chest radiographs for the
diagnosis of acute respiratory distress syndrome. Am J Respir Crit Care Med. 2000;161:85-90.
13. Armitage P, Colton T, eds. Encyclopedia of Biostatistics. Chichester, UK: John Wiley & Sons; 1998.
Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved.