Measuring Agreement Beyond Chance (PDF Available)

30 Measuring Agreement Beyond Chance Andrew Jull, Alba DiCenso, and Gordon Guyatt The following Editorial Board members also made substantive contributions to this chapter: Rien de Vos, Teresa Icart Isern, and Mary van Soeren. We gratefully acknowledge the work of Thomas McGinn, Richard Cook, and Maureen Meade on the original chapter that appears in the Users’ Guides to the Medical Literature, edited by Guyatt and Rennie. In This Chapter Clinicians Often Disagree Chance Will Always Be Responsible for Some of the Apparent Agreement Between Observers Alternatives for Dealing With the Problem of Agreement by Chance One Solution to Agreement by Chance: Chance-Corrected Agreement or Kappa Calculating Kappa Kappa With Three or More Raters or Three or More Categories A Limitation of Kappa An Alternative to Kappa: Chance-Independent Agreement or Phi Advantages of Phi Over Other Approaches 446 Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved. Measuring Agreement Beyond Chance Chapter 30 447 CLINICIANS OFTEN DISAGREE Clinicians often disagree in their assessments of patients, be it in physical examinations or interpretation of diagnostic tests. Disagreement between two clinicians about the presence of a particular physical sign in a patient (e.g., elevated blood pressure) may be the result of different approaches to the examination or different interpretations of the findings. Similarly, disagreement among repeated examinations of the same patient by the same clinician may result from inconsistencies in the way the examination is conducted or different interpretations of the findings with each examination. Researchers may also have difficulties agreeing on whether patients meet the eligibility requirements for a clinical trial (e.g., whether a patient has severe pain), whether patients in a randomized trial have experienced an outcome of interest (e.g., whether a patient has improved functional ability), or whether a study meets the eligibility criteria for a systematic review. Agreement between two observers is often called interobserver (or interrater) agreement, and agreement within the same observer is referred to as intraobserver (or intrarater) agreement. CHANCE WILL ALWAYS BE RESPONSIBLE FOR SOME OF THE APPARENT AGREEMENT BETWEEN OBSERVERS Any two people judging the presence or absence of an attribute will agree some of the time by chance. This means that even if the people making the assessment are doing so by guessing in a completely random way, their random guesses will agree some of the time. When investigators present agreement as raw agreement—that is, by simply counting the number of times agreement has occurred—this chance agreement gives a misleading impression. ALTERNATIVES FOR DEALING WITH THE PROBLEM OF AGREEMENT BY CHANCE In this chapter, we describe approaches developed by statisticians to separate chance agreement from nonrandom agreement (or agreement over and above chance). When we are dealing with categorical data (i.e., placing patients in discrete categories such as mild or severe pain, a present or absent sign, or a positive or negative test), the most popular approach is chance-corrected agreement. Chance-corrected agreement is quantified as kappa. Kappa is a statistic used to measure nonrandom agreement among observers, investigators, or measurements. ONE SOLUTION TO AGREEMENT BY CHANCE: CHANCE-CORRECTED AGREEMENT OR KAPPA Conceptually, kappa removes agreement by chance and informs clinicians about the extent of possible agreement over and above chance. If raters agree on every judgment, the total possible agreement is 1.0 (also expressed as 100%). Figure 30-1 depicts a situation in which agreement by chance is 50%, leaving possible agreement above and beyond Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved. 448 Unit V Understanding the Results Potential agreement 100% 50% 50% Agreement expected by chance alone 50% 50% Observed agreement 75% 50% 25% Observed agreement – Agreement expected by chance alone Total potential agreement – Agreement expected by chance alone 75% − 50% = 100% − 50% 25% = 50% = 50% (moderate agreement) Kappa = Figure 30-1. Kappa. chance of 50%. As depicted in this figure, the raters have achieved an agreement of 75%. Of this 75%, 50% was achieved by chance alone. Of the remaining possible 50% agreement, the raters have achieved half, resulting in a kappa value of 0.25/0.50, or 0.50 (or 50%). CALCULATING KAPPA Kappa is calculated by following three steps: (1) calculating observed agreement, (2) calculating agreement expected by chance alone, and (3) calculating agreement beyond chance. Assume that two observers are assessing the presence of respiratory wheeze. However, they have no skill in auscultating the chest, and their evaluations are no better than blind guesses. Let us say that they are both guessing in a ratio of 50:50; they guess that wheezing is present in half of the patients and that it is absent in half of the patients. On average, if both raters were evaluating the same 100 patients, they would achieve the results presented in Figure 30-2. Referring to this figure, you observe that both raters detected a wheeze in 25 patients (cell A) and neither of the raters detected a wheeze in 25 patients (cell D). By adding these two cells and dividing by the total number of patients (or observations) (T), we calculate observed agreement to be (25 + 25)/100. This represents an agreement of 50% (i.e., they correctly guessed the presence or absence of a wheeze in 50 of 100 patients). Thus, simply by guessing (and thus by chance), the raters achieved 50% agreement. Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved. Measuring Agreement Beyond Chance ⴙ ⴙ Observer 1 Observer 2 ⴚ 25 50 E B 25 25 C 50 F D 50 G Observed agreement = 449 ⴚ 25 A Chapter 30 50 H 100 T (A + D) (25 + 25) 50 = = = 50% (or 0.5) T 100 100 Agreement expected by chance alone for A = (E × G) T2 = (50 × 50) 1002 = 2500 = 25% (or 0.25) 10000 Agreement expected by (F × H) (50 × 50) 2500 = 25% (or 0.25) 10000 T 100 Total agreement expected by chance alone = 25% + 25% = 50% (or 0.5) chance alone for D = 2 = 2 = Figure 30-2. Agreement by chance when both reviewers are guessing in a ratio of 50% target positive and 50% target negative. A, Patients in whom both observers detect respiratory wheeze. B, Patients in whom observer 1 detects a respiratory wheeze, and observer 2 does not detect a wheeze. C, Patients in whom observer 1 does not detect a wheeze, and observer 2 detects a wheeze. D, Patients in whom both observers do not detect a wheeze. E, Patients in whom observer 1 detects a wheeze. F, Patients in whom observer 1 does not detect a wheeze. G, Patients in whom observer 2 detects a wheeze. H, Patients in whom observer 2 does not detect a wheeze. T, Total number of patients. Having determined observed agreement, we go on to calculate agreement expected by chance alone, which involves calculating the number of observations we expect by chance to fall in cells A and D. We do this by multiplying the corresponding marginal totals (E, F, G, H) and dividing by the square of the total number of observations (T). To calculate how many observations we expect by chance to fall in cell A, we multiply E times G and divide this number by T2: (50 × 50)/1002 = 0.25. Similarly, to calculate the number of observations we expect in cell D, we multiply F times H and divide by T2: (50 × 50)/1002 = 0.25. Total chance agreement is therefore 0.25 + 0.25 = 0.50 or 50%. Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved. 450 Unit V Understanding the Results Observer 2 ⴙ ⴙ Observer 1 64 16 A 80 E B ⴚ 16 4 C 20 F D 80 G Observed agreement = ⴚ 20 H 100 T (A + D) (64 + 4) 68 = = = 68% (or 0.68) T 100 100 Agreement expected by chance alone for A = (E × G) T 2 = (80 × 80) 100 2 = 6400 = 64% (or 0.64) 10000 Agreement expected by (F × H) (20 × 20) 400 = 4% (or 0.04) 10000 T 100 Total agreement expected by chance alone = 64% + 4% = 68% (or 0.68) chance alone for D = 2 = 2 = Figure 30-3. Agreement by chance when both reviewers are guessing in a ratio of 80% target positive and 20% target negative. +, Target positive; –, target negative. In this case, + is respiratory wheeze present and − is respiratory wheeze absent. What happens if the raters repeat the exercise of rating 100 patients, but this time each guesses in a ratio of 80% positive and 20% negative? Figure 30-3 depicts what, on average, will occur. The observed agreement (cells A and D) has increased to (64 + 4)/100 or 68% and total chance agreement has increased to 68%. As two observers classify an increasing proportion of patients in one category or the other (e.g., positive or negative; sign present or absent), agreement by chance increases as shown in Table 30-1. Once we have determined observed agreement and agreement expected by chance alone, we are ready to calculate kappa (or agreement beyond chance). Figure 30-4 illustrates the calculation of kappa with a hypothetical data set. First, we calculate the agreement observed. The two observers agreed that respiratory wheeze was present in 41 patients (cell A) and was absent in 80 patients (cell D). Thus, total agreement is (41 + 80)/200 = 0.605, or 60.5%. Next, we calculate agreement expected by chance in cell A by multiplying the marginal totals E and G and dividing by T 2 Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved. Measuring Agreement Beyond Chance Chapter 30 451 Table 30-1 Relationship Between the Proportion Positive and the Expected Agreement by Chance Proportion Positive (E/T = G/T)* Agreement by Chance 0.5 0.6 0.7 0.8 0.9 0.5 0.52 0.58 0.68 0.82 (50%) (52%) (58%) (68%) (82%) *E/ T and G/ T refer to letters in Figures 30-2 and 30-3; E/ T, the proportion of patients observer 1 finds positive; G/ T, the proportion of patients observer 2 finds positive. [(60 × 101)/2002 = 0.15], and we calculate agreement expected by chance in cell D by multiplying the marginal totals F and H and dividing by T 2 [(140 × 99)/2002 = 0.35]. Total chance agreement is 0.15 + 0.35 = 0.50 or 50%. We can then calculate kappa using the principle illustrated in Figure 30-1: (observed agreement – agreement by chance) (agreement possible [100%] – agreement by chance) or, in this case: (0.605 − 0.50) = 0.105 ÷ 0.50 = 0.21 or 21% (1.0 − 0.50) To summarize, observed agreement was 60.5%. Agreement of 50% would be expected to occur by chance, resulting in a kappa, or agreement over and above chance of 21%. Although there are numerous approaches to valuing the kappa levels achieved by raters, a widely accepted interpretation is the following: 0 = poor agreement; 0 to 0.2 = slight agreement; 0.2 to 0.4 = fair agreement; 0.4 to 0.6 = moderate agreement; 0.6 to 0.8 = substantial agreement; and 0.8 to 1.0 = almost perfect agreement1 Examples of chance-corrected agreement that investigators have calculated in clinical studies include exercise stress test cardiac T-wave changes, kappa = 0.25;2 jugular venous distention, kappa = 0.50;3 presence or absence of goiter, kappa = 0.82 to 0.95;4,5 and straight-leg raising for diagnosis of low back pain, kappa = 0.82.6 KAPPA WITH THREE OR MORE RATERS OR THREE OR MORE CATEGORIES Using similar principles, one can calculate chance-corrected agreement when there are more than two raters.7 Furthermore, one can calculate kappa when raters place patients into more than two categories (e.g., patients with heart failure may be rated as New York Heart Association class I, II, III, or IV). In these situations, one may give partial credit for intermediate levels of agreement (e.g., one observer may classify a patient as class II, whereas another may observe the same patient as class III) by adopting a so-called Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved. 452 Unit V Understanding the Results ⴙ Observer 2 ⴚ 15 ⴙ Observer 1 41 A 60 E 19 B 35 ⴚ 60 C 80 140 F D 101 G 99 H 200 T (A + D) (41 + 80) 121 = = = 60.5% (or 0.605) 200 200 T 6060 (E × G) (60 × 101) Agreement expected by chance alone for A = = = 2 2 40000 200 T = 15% (or 0.15) (F × H) (140 × 99) 13860 = = Agreement expected by chance alone for D = 40000 2002 T2 = 35% (or 0.35) Total agreement expected by chance alone = 15% + 35% = 50% (or 0.5) Observed agreement = observed agreement – agreement expected by chance alone total potential agreement – agreement expected by chance alone 60.5% − 50% 10.5% = = = 21% (or 0.21) 100% − 50% 50% Kappa (κ ) = Figure 30-4. Observed and expected agreement. +, Target positive; – target negative. In this case, + is respiratory wheeze present and – is respiratory wheeze absent. Expected agreement by chance appears in italics in cells A and D. weighted kappa statistic (weighted because full agreement receives full credit, and partial agreement receives partial credit).8 A LIMITATION OF KAPPA Kappa provides an accurate measurement of agreement beyond chance when the marginal frequencies are similar to one another (e.g., 50% positive observations and 50% negative observations). When the distribution of results is extreme (e.g., 90% positive observations and 10% negative observations), kappa is invariably low, in the absence of perfect agreement. If two observers believe that the prevalence of a clinical entity of interest (e.g., Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved. Measuring Agreement Beyond Chance Chapter 30 453 respiratory wheeze) is high or low in a given population, and both make a high or low proportion of positive ratings, raw agreement or agreement by chance will be high, even if the raters are just guessing. For this reason, when the proportion of positive ratings is extreme (i.e., extreme differences in the marginal frequencies), possible agreement beyond chance agreement is small, and it is difficult to achieve even moderate values of kappa.9–11 AN ALTERNATIVE TO KAPPA: CHANCE-INDEPENDENT AGREEMENT OR PHI One solution to this problem is chance-independent agreement using the phi statistic, which is a relatively new approach to assessing observer agreement.12 One begins by estimating the odds ratio from a 2 × 2 table displaying the agreement between two observers. Figure 30-5 contrasts the formulas for raw agreement, kappa, and phi. The odds ratio (OR = ad/bc in Figure 30-5) provides the basis for calculating phi. The odds ratio is simply the odds of a positive classification by observer 2 when observer 1 gives a positive classification, divided by the odds of a positive classification by observer 2 when observer 1 gives a negative classification (see Chapter 27, Measures of Association). The odds ratio would not change if we were to reverse the rows and columns. Thus, it does not matter which observer we identify as observer 1 and which we identify as observer 2. The odds ratio provides a natural measure of agreement. Interpretation of this agreement can be simplified by converting it to a form that takes values from −1.0 (representing extreme disagreement) to 1.0 (representing extreme agreement). The phi statistic makes this conversion using the following formula: Phi = OR − 1 OR + 1 Raw agreement Kappa where observed agreement and expected agreement Odds Ratio {OR} ad bc = ad − bc ad + bc ad abcd observed agreement expected agreement 1 expected agreement ad abcd (a b)(a c) (a b c d)2 Phi (c d)(b d) (a b c d)2 OR 1 OR 1 ad bc ad bc Figure 30-5. Calculations of agreement. a, b, c, and d, the four cells of a 2 × 2 table. Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved. 454 Unit V Understanding the Results When both margins are 0.5 (that is, when both raters conclude that 50% of the patients are positive and 50% are negative for the trait of interest), phi is equal to kappa. ADVANTAGES OF PHI OVER OTHER APPROACHES The use of phi has four important advantages over other approaches. First, it is independent of the level of chance agreement. Thus, investigators could expect to find similar levels of phi if the distribution of results is 50% positive and 50% negative or if it is 90% positive and 10% negative. This is not true for measures of the kappa statistic, a chancecorrected index of agreement. Second, phi allows statistical modeling approaches that the kappa statistic does not. For instance, such flexibility allows investigators to take advantage of all ratings when observers assess patients on multiple occasions.12 Third, phi allows testing of whether differences in agreement between pairings of raters are statistically significant, an option that is not available with kappa.12 Fourth, because phi is based on the odds ratio, one can carry out exact analyses. This feature is particularly attractive when the sample is small or when there is a zero cell in the chart.13 Statisticians may disagree about the relative usefulness of kappa and phi. The key point for clinicians reading studies that report measures of agreement is that investigators not mislead readers by presenting only raw agreement. REFERENCES 1. Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic. Am J Epidemiol. 1987;126: 161-169. 2. Blackburn H. The exercise electrocardiogram: differences in interpretation. Am J Cardiol. 1968;21:871. 3. Cook DJ. Clinical assessment of central venous pressure in the critically ill. Am J Med Sci. 1990;299: 175-178. 4. Kilpatrick R, Milne JS, Rushbrooke M, Wilson ESB. A survey of thyroid enlargement in two general practices in Great Britain. BMJ. 1963;1:29-34. 5. Trotter WR, Cochrane AL, Benjamin IT, Mial WE, Exley D. A goitre survey in the Vale of Glamorgan. Br J Prev Soc Med. 1962;16:16-21. 6. McCombe PF, Fairbank JC, Cockersole BC, Pynsent PB. 1989 Volvo Award in clinical sciences. Reproducibility of physical signs in low-back pain. Spine. 1989;14:908-918. 7. Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 1968;70:213-220. 8. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33: 159-174. 9. Thompson WD, Walter SD. A reappraisal of the kappa coefficient. J Clin Epidemiol. 1988;41:949-958. 10. Feinstein AR, Cicchetti DV. High agreement but low kappa, I: the problems of two paradoxes. J Clin Epidemiol. 1990;43:543-549. 11. Cook RJ, Farewell VT. Conditional inference for subject-specific and marginal agreement: two families of agreement measures. Can J Stat. 1995;23:333-344. 12. Meade MO, Cook RJ, Guyatt GH, et al. Interobserver variation in interpreting chest radiographs for the diagnosis of acute respiratory distress syndrome. Am J Respir Crit Care Med. 2000;161:85-90. 13. Armitage P, Colton T, eds. Encyclopedia of Biostatistics. Chichester, UK: John Wiley & Sons; 1998. Copyright © 2005, Mosby, Inc. All Rights Reserved. Portions copyright © 2002, American Medical Association. All Rights Reserved.

Measuring Agreement Beyond Chance (PDF Available)

Related documents

Products

Support

Measuring Agreement Beyond Chance (PDF Available)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib