Chapter 2-16. Validity and Reliability

Chapter 2-16. Validity and Reliability Accuracy and Precision Analogy (Target Example) Validity and reliability are analogous concepts to accuracy and precision. If you shoot a gun at a target and always hit within the bull’s-eye and first surrounding ring in a random pattern, you would say your shooting was accurate but not precise. You always hit in a tight pattern just up and right from the bull’s-eye, you would say your shooting was precise but not accurate. Only a tight pattern inside the bull’s-eye is both accurate and precise. A measuring instrument is said to be valid if it measures what it is intended to measure (hits the region aimed at). It said to be reliable if it provides measurements that are repeatable, giving consistent measurements upon repeated applications (hits consistently inside any one region). Validity (Nunnally, 1978, Chapter 3) After a measuring instrument is constructed, it is necessary to inquire whether the instrument is useful scientifically. “This usually spoken of as determining the validity of an instrument.” (Nunnally, p.86) “Generally speaking, a measuring instrument is valid if it does what it is intended to do.” Example: yardstick does indeed measure length in the way that we define length. (Nunnally, p.86) “Strictly speaking, one validates not a measuring instrument but rather some use to which the instrument is put.” Example: a fifth grade spelling achievement test is valid for that purpose, but not necessarily valid for predicting success in high school. (Nunnally, p.87) There are three types of validity for measuring instruments: (1) predictive validity, (2) content validity, and (3) construct validity. (Nunnally, p.87) _______________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 2-16 (revision 8 May 2011) p. 1 Predictive validity – “Predictive validity is at issue when the purpose is to use an instrument to estimate some important form of behavior that is external of the measuring instrument itself, the latter being referred to as the criterion.” (Nunnally, p. 87) Example: A test to select first year college students is useful in that situation only if it accurately estimates successful performance in college. (Nunnally, p.88) The use of “predictive” could be referring to something future, current, or past. (Nunnally, p. 88) Predictive validity is demonstrated by correlating the instrument’s measurement with the measure of the criterion variable. “The size of the correlation is a direct indication of the amount of validity.” (Nunnally, p. 88) Predictive validity is at issue when measuring instruments are employed in making decisions, such as choosing among different drug regimens to treat some condition. (Nunnally, p. 89) “Predictive validity represents a very direct, simple issue in scientific generalization that concerns the extent to which one can generalize from scores on one variable to scores on another variable. The correlation between the predictor test and the criterion variable specifies the degree of validity of that generalization.” Chapter 2-16 (revision 8 May 2011) p. 2 Content Validity – “For some instruments, validity depends primarily on the adequacy in which a specified domain of content is sampled. A prime example would be a final examination for a course in introductory psychology. Obviously, the test could not be validated in terms of predictive validity, because the purpose of the test is not to predict something else but to directly measure performance in a unit of instruction. The test must stand by itself as an adequate measure of what it is supposed to measure. Validity cannot be determined by correlating the test with a criterion, because the test itself is the criterion of performance.” (Nunnally, p. 91) The two major standards for ensuring content validity are: (1) a representative collection of items and (2) “sensible” methods of test construction. (Nunnally, p.92) Example: achievement test in spelling for fourth-grade students. A representative collection might be a random sampling of words from fourth-grade reading books. A sensible method of test construction might be to put each correctly spelled word in with three misspellings and require the student to circle the correct one. (Nunnally, p.92) “...content validity rests mainly on appeals to reason regarding the adequacy with which important content has been sampled and on the adequacy with which important content has been cast in the form of test items.” (Nunnally, p.93) “For example, at least a moderate level of internal consistency among the items within a test would be expected; i.e., the items should tend to measure something in common.” (Nunnally, p.93) “Another type of evidence for content validity is obtained from correlating scores on different tests purporting to measure much the same thing, e.g., two tests by different commercial firms for the measurement of achievement in reading. It is comforting to find high correlations in such instances, but this does not guarantee content validity. Both tests may measure the same wrong things.” (Nunnally, p. 94) “Although helpful hints are obtained from analyses of statistical findings, content validity primarily rests upon an appeal to the propriety of content and the way that it is presented.” (Nunnally, p. 94) defn: propriety =true nature Chapter 2-16 (revision 8 May 2011) p. 3 Construct Validity – “To the extent that a variable is abstract rather than concrete, we speak of it as being a construct. Such a variable is literally a construct in that it is something that scientists put together from their own imaginations, something that does not exist as an isolated, observable dimension of behavior. A construct represents a hypothesis (usually only halfformed) that a variety of behaviors will correlate with one another in studies of individual differences and/or will be similarly affected by experimental treatments.” (Nunnally, p.96) Example: “Take, for example, an experiment where a particular treatment is hypothesized to raise anxiety. Can the measure of anxiety be validated purely as a predictor of some specific variable? No, it cannot, because the purpose is to measure the amount of anxiety then and there, not to estimate scores on any other variable obtained in the past, present, or future. Also, the measure cannot be validated purely in terms of content validity. There is no obvious body of “content” (behaviors) corresponding to anxiety reactions, and if there were, how to measure such content would be far more of a puzzle than it is with performance in arithmetic.” (Nunnally, p.95) There are three major aspects of construct validation: (1) specifying the domain of observables related to the construct; (2) from empirical research and statistical analyses, determining the extent to which the observables tend to measure the same thing, several different things, or many different things; and (3) subsequently performing studies of individual differences and/or controlled experiments to determine the extent to which supposed measures of the construct produce results which are predictable from highly accepted theoretical hypotheses concerning the construct. Chapter 2-16 (revision 8 May 2011) p. 4 Validity of Diagnostic Tests “Two indices are used to evaluate the accuracy or validity of a diagnostic test—sensitivity and specificity.” (Lilienfeld and Stolley, 1994, p. 118). In computing the sensitivity and specificity of a new test, we generally compare it to a gold standard, where the gold standard is accepted to measure what it is intended to measure (recall Nunnally quote above, “a measuring instrument is valid if it does what it is intended to do”). This type of validity is called convergent and divergent validity by McDowell and Newell (1996, p. 33), which is a special case of using correlation to show what Nunnally called predictive validity in the presentation above. Convergent and Divergent Validity McDowell and Newell (1996, p. 33) describe using correlation evidence for validity: “Hypotheses are formulated which state that the measurement will correlate with other methods that measure the same concept; the hypotheses are tested in the normal way. This is known as a test of ‘convergent validity’ and is equivalent to assessing sensitivity. Where no single criterion exists, the measurement is sometimes compared with several other indices using multivariate procedures. Hypotheses may also state that the measurement will not correlate with others which measure different themes. This is termed divergent validity, and is equivalent to the concept of specificity. For example, a test of ‘Type A behavior patterns’ may be expected to measure something distinct from neurotic behavior. Accordingly, a low correlation would be hypothesized between the Type A scale and a neuroticism index and, if obtained, would lend reassurance that the test was not simply measuring neurotic behavior. Naturally, this provides little information on what it does measure.” Exercise 1. Take out the Merlo paper. 1) In the Section “Materials and Methods” look at the paragraph under the subheading “Validity analysis” (p. 789). This is a good example of sensitivity and specificity being used as measurements of validity. 2) Find the second to last sentence of the Introduction section (p.788), “By conducting a validity analysis…” That’s it. There is no mention of construct validity or content validity. Establishing predictive validity is sufficient for this article. However, for articles presenting new measuring instruments, construct and content validity should be discussed. Chapter 2-16 (revision 8 May 2011) p. 5 Sensitivity and Specificity With the data in the required form for Stata: Gold Standard “true value” disease present ( + ) disease absent ( - ) Test “probable value” disease present ( + ) disease absent ( - ) a (true positives) b (false negatives) c (false positives) d (true negatives) a+c b+d a+b c+d We define the following terminology (Lilienfeld, 1994, p. 118-124), expressed as percents: sensitivity = (true positives)/(true positives plus false negatives) = (true positives)/(all those with the disease) = a / (a + b) 100 specificity = (true negatives)/(true negatives plus false positives) = (true negatives)/(all those without the disease) = d / (c + d)  100 Sensitivity and specificity provide information about the accuracy (validity) of a test. Positive and negative predictive values provide information about the meaning to the test results. The probability of disease being present given a positive test result is the positive predictive value: positive predictive value = (true positives)/(true positives plus false positives) = (true positives)/(all those with a positive test result) = a / (a + c)  100 The probability of no disease being present given a negative test result is the negative predictive value. negative predictive value = (true negatives)/(true negatives plus false negatives) = (true negatives)/(all those with a negative test result) = d / (b + d)  100 “Unlike sensitivity and specificity, the positive and negative predictive values of a test depend on the prevalence rate of disease in the population. …For a test of given sensitivity and specificity, the higher the prevalence of the disease, the greater the positive predictive value and the lower the negative predictive value.” (Lilienfeld, 1994, p. 122-123) Chapter 2-16 (revision 8 May 2011) p. 6 Stata Commands -- Diagnostic Test Statistics To compute test characteristics (sensitivity, specificity, positive predictive value, and negative predictive value, etc.) in Stata, the diagt commmand is used. If the data are in Stata as variables, you use: diagt goldvar testvar Alternatively, you can enter the data directly as cell counts, using the “immediate” form of the diagt command: diagti #a #b #c #d where the a, b, c, d, correspond to the cell frequencies in the following table: Gold Standard “true value” disease present ( + ) disease absent ( - ) Test “probable value” disease present ( + ) disease absent ( - ) a (true positives) b (false negatives) c (false positives) d (true negatives) a+c b+d a+b c+d Installing the diagt and diagti commands in Stata The diagt and diagti commands must first be installed, since they are user contributed commands. In the command window, run findit diagt SJ-4-4 sbe36_2 . . . . . . . . . . . . . . . . . . Software update for diagt (help diagt if installed) . . . . . . . . . . P. T. Seed and A. Tobias Q4/04 SJ 4(4):490 new options added to diagt then click on sbe36_2 to install. Chapter 2-16 (revision 8 May 2011) p. 7 Exercise 1. Replicate the validity analyses for estrogrens found in the Merlo (2000) paper. To practice using diagt with variables in Stata’s memory, first read in the 2 x 2 table data give in Table 1 of the Merlo paper (these commands are in chapter12.do). clear input question diary count 1 0 231 0 1 164 1 1 969 0 0 14696 end drop if count==0 // not needed here, but would be if any count was 0 expand count drop count Using the diary as the gold standard, and questionnaire as the test variable, compute the test characteristics. diagt diary question | question diary | Pos. Neg. | Total -----------+----------------------+---------Abnormal | 969 164 | 1,133 Normal | 231 14,696 | 14,927 -----------+----------------------+---------Total | 1,200 14,860 | 16,060 True abnormal diagnosis defined as diary = 1 [95% Confidence Interval] --------------------------------------------------------------------------Prevalence Pr(A) 7.1% 6.7% 7.46% --------------------------------------------------------------------------Sensitivity Pr(+|A) 85.5% 83.3% 87.5% Specificity Pr(-|N) 98.5% 98.2% 98.6% ROC area (Sens. + Spec.)/2 .92 .91 .93 --------------------------------------------------------------------------Likelihood ratio (+) Pr(+|A)/Pr(+|N) 55.3 48.5 62.9 Likelihood ratio (-) Pr(-|A)/Pr(-|N) .147 .128 .169 Odds ratio LR(+)/LR(-) 376 305 464 Positive predictive value Pr(A|+) 80.8% 78.4% 82.9% Negative predictive value Pr(N|-) 98.9% 98.7% 99.1% --------------------------------------------------------------------------- Chapter 2-16 (revision 8 May 2011) p. 8 Alternatively, we could have skipped reading in the data and simply uses the immediate form of the command, diagti 969 164 231 14696 True | disease | Test result status | Neg. Pos. | Total -----------+----------------------+---------Normal | 14,696 231 | 14,927 Abnormal | 164 969 | 1,133 -----------+----------------------+---------Total | 14,860 1,200 | 16,060 [95% Confidence Interval] --------------------------------------------------------------------------Prevalence Pr(A) 7.1% 6.7% 7.46% --------------------------------------------------------------------------Sensitivity Pr(+|A) 85.5% 83.3% 87.5% Specificity Pr(-|N) 98.5% 98.2% 98.6% ROC area (Sens. + Spec.)/2 .92 .91 .93 --------------------------------------------------------------------------Likelihood ratio (+) Pr(+|A)/Pr(+|N) 55.3 48.5 62.9 Likelihood ratio (-) Pr(-|A)/Pr(-|N) .147 .128 .169 Odds ratio LR(+)/LR(-) 376 305 464 Positive predictive value Pr(A|+) 80.8% 78.4% 82.9% Negative predictive value Pr(N|-) 98.9% 98.7% 99.1% --------------------------------------------------------------------------- Quirk: There is a quirk with the diagt and diagti commands. The table of test characteristics is the same, but the outputted data table is in a different sort order. Notice the data table just generated with the diagti command, True | disease | Test result status | Neg. Pos. | Total -----------+----------------------+---------Normal | 14,696 231 | 14,927 Abnormal | 164 969 | 1,133 -----------+----------------------+---------Total | 14,860 1,200 | 16,060 <- diati is not the order the data had to be provided on the diagti command. That makes it difficult to verify the cell counts were provided by you, the user, correctly. [Just remember that when the sort order is switched for both the row and column variables, the cell counts simply move diagonally to the opposite corners]. Whereas, the order entered does match the data table displayed from the diagt command on the previous page (copied below for ease of comparison): | question diary | Pos. Neg. | Total -----------+----------------------+---------Abnormal | 969 164 | 1,133 Normal | 231 14,696 | 14,927 -----------+----------------------+---------Total | 1,200 14,860 | 16,060 Chapter 2-16 (revision 8 May 2011) <- diat p. 9 Reliability (Nunnally, 1978, Chapters 6,7) “To the extent to which measurement error is slight, a measure is said to be reliable. Reliability concerns the extent to which measurements are repeatable.” (Nunnally, p.191) “...measurements are reliable to the extent that they are repeatable and that any random influence which tends to make measurements different from occasion to occasion or circumstance to circumstance is a source of measurement error.” (Nunnally, p.225) Estimation of Reliability Internal consistency –“Estimates of reliability based on the average correlation among items within a test are said to concern the “internal consistency.” This is partly a misnomer, because the size of the reliability coefficient is based on both the average correlation among items (the internal consistency) and the number of items. Coefficient alpha is the basic formula for determining the reliability based on internal consistency. It, or the special version applicable to dichotomous items (KR-20), should be applied to all new measurement methods. Even if other estimates of reliability should be made for particular instruments, coefficient alpha should be obtained first.” (Nunnally, p. 229-230) Alternative forms—“In addition to computing coefficient alpha, with most measures it also is informative to correlate alternative forms.” (Nunnally, p.230) Example: “An example would be constructing two vocabulary tests, each of which would be an alternative form for the other.” (Nunnally, p.228) Use of Reliability Coefficients “The major use of reliability coefficients is in communicating the extent to which the results obtained from a measurement method are repeatable. The reliability coefficient is one index of the effectiveness of an instrument, reliability being a necessary but not sufficient condition for any type of validity.” (Nunnally, p.237) The choice of a reliability coefficient is based on level of measurement. Like correlations coefficients, they range between 0 and 1, with 0 indicating no reproducibility (no rater agreement) and 1 indicating perfect reproducibility (perfect rater agreement). Chapter 2-16 (revision 8 May 2011) p. 10 Kappa Statistic and Weighted Kappa Statistic The Kappa statistic is the most widely used reliability coefficient when the variables being compared for agreement are unordered categorical (nominal level of measurement). For ordered categorical data (ordinal level of measurement), the weighted kappa is used. Kappa Statistic: Two Unique Raters, Two Classification Categories Above, we used the Merlo paper data to compute test characteristics, similar to what authors did in their article. In that article, subjects recorded their drug use using a 7-day diary, and they also recorded their drug use using a questionnaire. The research question was whether or not a questionnaire is sufficiently reliable to collect drug use information. When they computed test characteristics, they made the assumption that the diary was the “gold standard”, which assumes the data are collected without error. The authors also reported a kappa coefficient, which is used to evaluate if two raters provide the same measurement value, which does not require that one rater’s measurements represent the gold standard. They used the “two unique raters, two classification categories” form of the kappa coefficient. Since the same subject provided both measurements, it was actually the same rater providing both measurements, so there were not two unique raters. However, the two measurement instruments differed, so it was “two unique instruments”. This form of kappa, then, did apply to their analysis situation. If the data are not already in Stata memory from doing this above, then input the data using, clear input question diary count 1 0 231 0 1 164 1 1 969 0 0 14696 end drop if count==0 // not needed here, but would be if any count was 0 expand count drop count Chapter 2-16 (revision 8 May 2011) p. 11 To display the data, use the tabulate (abbreviated tab) with the “cell” option, to get the cell percent of the total sample size. tab diary question , cell +-----------------+ | Key | |-----------------| | frequency | | cell percentage | +-----------------+ | question diary | 0 1 | Total -----------+----------------------+---------0 | 14,696 231 | 14,927 | 91.51 1.44 | 92.95 -----------+----------------------+---------1 | 164 969 | 1,133 | 1.02 6.03 | 7.05 -----------+----------------------+---------Total | 14,860 1,200 | 16,060 | 92.53 7.47 | 100.00 To compute the kappa coefficient, use kap diary question Expected Agreement Agreement Kappa Std. Err. Z Prob>Z ----------------------------------------------------------------97.54% 86.53% 0.8174 0.0079 103.64 0.0000 This displayed the “observed agreement”, which is the percent of the sample size on the main diagonal (both methods provided same score): 91.51% + 6.03% = 97.54% The “expected agreement” is computed similar to how it is done to the test the minimum expected frequency assumption required for the chi-square test to be appropriate for a given crosstabulation table. Obtaining the expected cell counts, Chapter 2-16 (revision 8 May 2011) p. 12 tab diary question , expect +--------------------+ | Key | |--------------------| | frequency | | expected frequency | +--------------------+ | question diary | 0 1 | Total -----------+----------------------+---------0 | 14,696 231 | 14,927 | 13,811.7 1,115.3 | 14,927.0 -----------+----------------------+---------1 | 164 969 | 1,133 | 1,048.3 84.7 | 1,133.0 -----------+----------------------+---------Total | 14,860 1,200 | 16,060 | 14,860.0 1,200.0 | 16,060.0 Converting the expected cell frequencies on the main diagonal into expected percents, display ((13811.7+84.7)/16060)*100 86.52802 Which agrees with the kap output from the previous page, shown again here: Expected Agreement Agreement Kappa Std. Err. Z Prob>Z ----------------------------------------------------------------97.54% 86.53% 0.8174 0.0079 103.64 0.0000 Kappa is defined as the amount of agreement in excess of chance agreement. It has the form (Fleiss, Levin, Paik, 2003, p.603), “The obtained excess beyond chance is Io – Ie, whereas the maximum possible excess is 1 - Ie. The ratio of these two differences is called kappa, ˆ  Io  Ie .” 1  Ie Plugging these values into the kappa formula, display "kappa = " (.9754-.8653)/(1-.8653) kappa = .81737194 This value agrees with the kappa = 0.817 reported in Merlo’s Table 1. Chapter 2-16 (revision 8 May 2011) p. 13 Notice how Merlo reports kappa in the Results Section (page 789, first sentence under Reliability analysis subheading). To make sure the interpretation of the kappa coefficient is understood, they describe it, “We conducted a reliability analysis to assess to what extent the questionnaire and the personal diary agreed (i.e., the ability to replicate results whether or not the information was correct). We calculated the percentage of agreement and the related kappa coefficient (4) for dichotomous (yes vs. no) self-reported current use of hormone therapy. The kappa coefficient is a measure of the degree of nonrandom agreement between two measurements of the same categorical variable (5). A p value of <0.05 was required for rejection of the null hypothesis of no agreement other than by chance.” Kappa Statistic: Two Unique Raters, Two or More Unordered Classification Categories We will practice with the dataset , boydrater.dta. This dataset (StataCorp, 2007, p.85) came from the Stata website [use http://www.stata-press.com/data/r10/rate2]. It is a subset of data from Boyd et al. (1982) and discussed in the context of kappa in Altman (1991, p.403405). The data represent classifications by two radiologists of 85 xeromammograms as normal, benign disease, suspicion of cancer, or cancer. Reading the data into Stata, File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on boydrater.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\boydrater.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd " Biostats & Epi With Stata\datasets & do-files" use boydrater, clear Chapter 2-16 (revision 8 May 2011) p. 14 Tabulating the two raters classifications, Statistics Summaries, tables & tests Tables Two-way tables with measures of association Main tab: Row variable: rada Column variable: radb OK tabulate preterm hyp, column Radiologis | t A's | Radiologist B's assessment assessment | 1.normal 2.benign 3.suspect 4.cancer | Total -----------+--------------------------------------------+---------1.normal | 21 12 0 0 | 33 2.benign | 4 17 1 0 | 22 3.suspect | 3 9 15 2 | 29 4.cancer | 0 0 0 1 | 1 -----------+--------------------------------------------+---------Total | 28 38 16 3 | 85 We see that the classification are actually on an ordinal scale, or ordered categorical data, but we will first treat them an nominal, or unordered categorical data, to illustrate the kappa statistic. (Later, we will compute the weighted kappa statistic.) Restating the definition, kappa is the amount of agreement in excess of chance agreement. It has the form (Fleiss, Levin, Paik, 2003, p.603), “The obtained excess beyond chance is Io – Ie, whereas the maximum possible excess is 1 - Ie. The ratio of these two differences is called kappa, ˆ  Io  Ie .” 1  Ie The amount of agreement is simply the proportion of observations on the main diagonal of the crosstabulation table. disp (21+17+15+1)/85 .63529412 The amount of agreement expected by chance are the proportion of expected cell frequencies for the main diagonal cells, computed in the same way as used for the chi-square test, expected cell frequency = (column total)(row total)/(grand total) disp (28*33/85+38*22/85+16*29/85+3*1/85)/85 .30823529 Chapter 2-16 (revision 8 May 2011) p. 15 Computing kappa, display (.63529412-.30823529)/(1-.30823529) .47278912 Calculating kappa in Stata, kap rada radb Expected Agreement Agreement Kappa Std. Err. Z Prob>Z ----------------------------------------------------------------63.53% 30.82% 0.4728 0.0694 6.81 0.0000 The Stata manual provides a nice interpretation of this result (StataCorp, 2007, p.85), “If each radiologist had made his determination randomly (but with probabilities equal to the overall proportions), we would expect the two radiologists to agree on 30.8% of the patients. In fact, they agreed on 63.5% of the patients, or 47.3% of the way between random agreement and perfect agreement. The amount of agreement indicates that we can reject the hypothesis that they are making their determinations randomly.” Kappa is scaled such that 0 is the amount of agreement that would be expected by chance, and 1 is perfect agreement. Thus, the test statistic, z = 6.81, is a test of the null hypothesis, H0: κ = 0 . It is of little interest to test kappa against zero, since acceptable interrater reliability is a number much greater than zero, so the p value is not worth reporting. Instead, report kappa along with a 95% confidence interval. To obtain a confidence interval for kappa, you must first add the command kapci to your Stata. To do this, first use, findit kapci SJ-4-4 st0076 . . . . . . . . . Confidence intervals for the kappa statistic (help kapci if installed) . . . . . . . . . . . . . M. E. Reichenheim Q4/04 SJ 4(4):421--428 confidence intervals for the kappa statistic and then click on the st0076 link to install. Since this a user-contributed command (Reichenheim, 2004), to see the help description for this, use, help kapci If the kappa is for a dichotomous variable, the kapci command uses a formula approach (Fleiss, 1981). For all other cases, the kapci command uses a bootstrap approach (Efron and Tibshirani 1993; Lee and Fung 1993). Chapter 2-16 (revision 8 May 2011) p. 16 Obtaining the confidence interval, kapci rada radb Note: default number of bootstrap replications has been set to 5 for syntax testing only.reps() needs to be increased when analysing real data. B=5 N=85 -----------------------------------------------Kappa (95% CI) = 0.473 (0.293 - 0.529) (BC) -----------------------------------------------BC = bias corrected We observe it used a bootstrap approach, but with only 5 repetitions. A minimum of 200 repetitions should be used, but 1,000 repetitions is recommended. Obtaining the bootstrapped confidence interval, based on 1,000 repetitions, kapci rada radb , reps(1000) This may take quite a long time. Please wait ... B=1000 N=85 -----------------------------------------------Kappa (95% CI) = 0.473 (0.326 - 0.612) (BC) -----------------------------------------------BC = bias corrected There are many methods of bootstrapping. The default is “bias corrected”, which is the most popular. Beyond just the two rater on a dichotomous variable case, confidence interval formulas have not been developed. At least I don’t know of them. Reichnehim (2004, p. 76), the author of the kapci command, states the same thing, “Computer efficiency is the main advantge of using an analytical procedure. Alas, to the best of the author’s knowledge, no such method has been developed to accommodate more complex analysis beyond the simple 2 × 2 case.” Therefore, you should definitely state that you used a bootstrapped CI for kappa when that is what you have done, because your reader, or reviewer, will wonder how you computed the CI. Chapter 2-16 (revision 8 May 2011) p. 17 Protocol Suggestion Here is some example wording for describing what was just done, Inter-rater reliability will be measured using the kappa coefficient, and reported using a bootstrapped, bias-corrected method, 95% confidence interval (Reichenheim, 2004; Carpenter and Bithell, 2000). Interpreting Kappa Landis and Koch (1977, p.165) suggested the following guideline for the evaluation of Kappa: Interpretation of Kappa kappa interpretation <0 poor agreement 0 - 0.20 slight agreement 0.21 - 0.40 fair agreement 0.41 - 0.60 moderate agreement 0.61 - 0.80 substantial agreement 0.81 - 1.00 almost perfect agreement Kappa is scaled such that 0 is the amount of agreement that would be expected by chance, and 1 is perfect agreement. Chapter 2-16 (revision 8 May 2011) p. 18 PABAK (Prevalence and Bias Adjusted Kappa) When the observed cells counts in a crosstabulation table of two raters bunch up in any corner of the table, the kappa coefficient is known to produce paradoxical results. An alternative form of kappa, called PABAK, has been proposed as a solution to this problem. Consider the 2 x 2 table of dichotomous ratings (yes or no) between two ratings, such as two radiologists reading X-rays to determine the presence or absence of a tumor. Expressed using Byrt’s et al (1993) notation, Observer B Yes No Total Observer A Yes No a b c d f1 f2 Total g1 g2 N The proportion of observed agreement is po  (a  d ) / N Some proportion of this agreement, however, can occur by chance. Observer A can score yes’s in a completely independent fashion than Observer B, and so whatever agreements occur would just be chance. For example, radiologist A can just flip a coin to decide if a tumor is present, and radiologistic B can likewise flip a coin. The instances of them both getting a heads and both getting a tails, are just chance occurrences. The proportion of expected by chance agreement (see box) is pe  ( f1 g1  f 2 g 2 ) / N 2 Kappa is defined as the amount of agreement in excess of chance agreement. The obtained excess beyond chance is po – pe The maximum possible excess is 1 – pe The ratio of these two differences is called kappa, K Chapter 2-16 (revision 8 May 2011) po  pe 1  pe p. 19 Expected chance agreement The expected agreement comes from the “multiplication rule for independent events” in probability. If two events, A and B, are independent, then the probability they will both occur is: P(AB) = P(A)P(B) , where P(AB) = probability both occur P(A) = probability A will occur P(B) = probability B will occur Observer B Yes No Total Observer A Yes No a b c d f1 f2 Total g1 g2 N A probability is just the proportion of times an event occurs, so P(observer A scores Yes) = f1/N , P(observer A scores No) = f2/N and P(observer B scores Yes) = g1/N , P(observer B scores No) = g2/N If the observers score independently, or no correlation between the two, such as not using a commmon criteria like educated professional judgment, then by the multiplication rule for independent events, P(both score Yes by chance) = (f1/N)( g1/N ) = f1 g1/N2 P(both score No by chance) = (f2/N)( g2/N ) = f2 g2/N2 Adding these to get to the total proportion of times an agreement occurs by chance, P(both score Yes or both score No) = f1 g1/N2 + f2 g2/N2 = ( f1 g1 + f2 g2 )/ N2 **** Left to Add **** Using Byrt paper, describe anomalies due to prevalence and bias. Chapter 2-16 (revision 8 May 2011) p. 20 Expressed using Looney and Hagan’s (2008) notation, Observer B Yes No Total Observer A Yes No n11 n12 n21 n22 n.1 n.2 Total n1. n2. n the formula for PABAK is, PABAK  (n11  n22 )  (n12  n21 )  2 p0  1 n Looney and Hagan (2008, p.115) point out, “Note that PABAK is equivalent to the proportion of ‘agreements’ between the variables minus the proportion of ‘disagreements’.” Looney and Hagan (2008, pp.115-116) provide formulas for variance and confidence intervals, “The approximate variance of PABAK is given by estimated Var(PABAK)  4 p0 (1  p0 ) / n and the approximate 100(1-α)% confidence limits for the true value of PABAK are given by estimated PABAK  z /2 estimated Var(PABAK) .” Calculating PABAK in Stata: pabak.ado file I programmed this for the 2 x 2 table case (2 raters, 2 possible outcomes). Either copy the file pabak.ado to the directory: C:\ado\personal so that is always available, or change directories to the directory it is in so that it is temporarily available. Chapter 2-16 (revision 8 May 2011) p. 21 Reading in the data for the example given in Looney and Hagan (2008, p.116), * -- data in Looney and Hagan (p.116) clear input biomarkera biomarkerb count 1 1 80 1 0 15 0 1 5 0 0 0 end drop if count==0 expand count drop count Displaying the data and calculating kappa, tab biomarkera biomarkerb kap biomarkera biomarkerb | biomarkerb biomarkera | 0 1 | Total -----------+----------------------+---------0 | 0 5 | 5 1 | 15 80 | 95 -----------+----------------------+---------Total | 15 85 | 100 Expected Agreement Agreement Kappa Std. Err. Z Prob>Z ----------------------------------------------------------------80.00% 81.50% -0.0811 0.0841 -0.96 0.8324 We see that kappa = -0.08, or is basically kappa = 0, even though the observed agreement is 80%, which is very high. The negative sign comes from the observed agreement being worse than the expected agreement, so agreement was actually worse than what would be expected simply by chance. The problem lies in that the kappa coefficient fails with some particular data patterns. These are well described in Bryt et al (1993). Chapter 2-16 (revision 8 May 2011) p. 22 Calculating PABAK with the analytic (formula based) 95% confidence interval, pabak biomarkera biomarkerb Interrater Reliability biomarkerb 1 0 ---------------biomarkera 1 | 80 15 | 95 0 | 5 0 | 5 ----------------------85 15 | 100 PABAK = 0.6000 , 95% CI (0.4432 , 0.7568) The PABAK and confidence limits agree exactly with those provided in Looney and Hagan (2008, p.116), so you can feel comfortable that the pabak.ado file was correctly programmed. To verify that the analytic CI is reasonable, we can bootstrap the CI for PABAK, using bootstrap r(pabak), reps(1000) size(100) seed(999) bca: /// pabak biomarkera biomarkerb estat bootstrap, all Bootstrap results command: _bs_1: Number of obs Replications = = 100 1000 pabak biomarkera biomarkerb r(pabak) -----------------------------------------------------------------------------| Observed Bootstrap | Coef. Bias Std. Err. [95% Conf. Interval] -------------+---------------------------------------------------------------_bs_1 | .6 .00252 .07922074 .4447302 .7552698 (N) | .44 .75 (P) | .46 .76 (BC) | .46 .76 (BCa) -----------------------------------------------------------------------------(N) normal confidence interval (P) percentile confidence interval (BC) bias-corrected confidence interval (BCa) bias-corrected and accelerated confidence interval We see that the analytic CI was in agreement with these four bootstrapped CI approaches. Chapter 2-16 (revision 8 May 2011) p. 23 Protocol Suggestion for PABAK Reliability will be measured using the Kappa coefficient, which is the proportion of agreement beyond expected chance agreement. Although Kappa remains the most widely used measure of agreement, several authors have pointed out data patterns that produce a Kappa with paradoxical results. For example, if the agreements bunch up in one of the agreement cells (prevalence) or disagreements bunch up in one of disagreement cells (bias), then the Kappa statistic is paradoxically different from a crosstabulation table with more evenly distributed agreements and disagreements, even though the percents of agreement and disagreement do not change. Therefore, the prevalence-adjusted bias-adjusted kappa (PABAK) will also be reported, which gives the true proportion of agreement beyond expected chance agreement regardless of unbalanced data patterns (Byrt, 1993). The reliability between the nurse observers and the nurse trainer will be computed. This analysis will be stratified by study sites, reporting site-specific percent agreement, Kappa and PABAK coefficients, as well as a summary Kappa, which is a weighted average across the study sites, along with confidence intervals. Chapter 2-16 (revision 8 May 2011) p. 24 Intraclass Correlation Coefficient (ICC) For a continuous rating (interval scale), interrater reliabity is measured with the intraclass correlation coefficient (ICC), also called the intracluster correlation coefficient (ICC), or the reliability coefficient (r or rho). . To compute the ICC, we use the formula (Streiner and Norman, 1995, p.106; Shrout and Fleiss, 1979), reliability = subject variability subject variability + measurement error expressed symbolically as,  s2  2  s   e2 Note, however, that these sigma’s are population parameters. Population parameters are estimated using the expected value of sample statistics, where the expected value is the long-run average. The ICC cannot be computed, then, simply from the MS(between) and MS(within) from an analysis of variance table. For any ANOVA, depending on whether it is for a fixed effect, random effect, multiple raters for each subject, separate raters for each subject, etc, the expected mean squares (EMS) are different equations containing the MS(between) and MS(within). That is, all versions of the ICC use the same MS(between) and MS(within) from the ANOVA table, but the EMS(between) and EMS(within) has a slightly different equation for each situation. Using the dataset provided in Table 8.1 of Streiner and Norman (1995), File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on streiner and normantable81.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ streiner and normantable81boydrater.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use "streiner and normantable81", clear Chapter 2-16 (revision 8 May 2011) p. 25 These are data are ratings by three observers for 10 patients, which ratings on a 10-point scale for some attribute, such as sadness. Listing the data, list, sep(0) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. +---------------------------------------+ | patient observ1 observ2 observ3 | |---------------------------------------| | 1 6 7 8 | | 2 4 5 6 | | 3 2 2 2 | | 4 3 4 5 | | 5 5 4 6 | | 6 8 9 10 | | 7 5 7 9 | | 8 6 7 8 | | 9 4 6 8 | | 10 7 9 8 | +---------------------------------------+ Drawing a scatter diagram, showing the ratings for each patient lined up vertically, 2 4 6 8 10 sort patient twoway (scatter observ1 patient) (scatter observ2 patient) /// (scatter observ3 patient) 0 2 4 6 8 10 patient observ1 observ3 observ2 We see that the ratings for the same patient appear more alike than rating between patient, suggesting a high value of ICC. Chapter 2-16 (revision 8 May 2011) p. 26 To calculate the ICC in Stata, we must first reshape the data to long format. Then, we will list the data for the first two patients to verify the data are reshaped correctly. reshape long observ , i(patient) j(observerID) list if patient<=2 , sepby(patient) abbrev(15) 1. 2. 3. 4. 5. 6. +-------------------------------+ | patient observerID observ | |-------------------------------| | 1 1 6 | | 1 2 7 | | 1 3 8 | |-------------------------------| | 2 1 4 | | 2 2 5 | | 2 3 6 | +-------------------------------+ In Stata, the ICC can be computed by treating observer as a fixed effect and patient as a random effect, using xi: xtreg observ i.observerID, i(patient) mle // Stata-10 * <or> xtreg observ i.observerID, i(patient) mle // Stata-11 Random-effects ML regression Group variable: patient Number of obs Number of groups = = 30 10 Random effects u_i ~ Gaussian Obs per group: min = avg = max = 3 3.0 3 Log likelihood = -47.804751 LR chi2(2) Prob > chi2 = = 21.97 0.0000 -----------------------------------------------------------------------------observ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_Iobserver~2 | 1 .316217 3.16 0.002 .380226 1.619774 _Iobserver~3 | 2 .316217 6.32 0.000 1.380226 2.619774 _cons | 5 .6429085 7.78 0.000 3.739923 6.260077 -------------+---------------------------------------------------------------/sigma_u | 1.90613 .4459892 1.205013 3.015182 /sigma_e | .7071068 .1117939 .5186921 .963963 rho | .8790323 .0609117 .7179378 .9611003 -----------------------------------------------------------------------------Likelihood-ratio test of sigma_u=0: chibar2(01)= 32.10 Prob>=chibar2 = 0.000 The interpretation is that 88% of the variance in the observations was the result of “true” variance among patients (Streiner and Norman, 1995, p.111). That is, 88% of the variance is due to the background variability among patients, or patient-to-patient differences, rather than due to lack of agreement among the raters or measurement error. That interpretation is simply consistent with the formula. It is a round about way of describing the agreement among the raters, which is excellent agreement in this example. This is discussed further in the “Interpreting the ICC” section below. This ICC = 0.88 agrees with the Streiner and Norman (1995, p.111) calculation. Here we treated the patient as a random effect and observer as a fixed effect. Thus, we are making an inference to the population of all patients, but only to these three specific observers. Chapter 2-16 (revision 8 May 2011) p. 27 This is called the “classical” definition of reliability (Streiner and Norman, 1995, p.111). This is the situation where reliability is calculated to demonstrate the observers used in the particular study had good interrater reliability. In this situation, there is no need to make an inference to other observers. Anomalous Values of ICC In the section above, “PABAK (Prevalence and Bias Adjusted Kappa)”, it was pointed out that when the observed cells counts in a crosstabulation table of two raters bunch up in any corner of the table, the kappa coefficient is known to produce paradoxical results. That is, it gives values of kappa that are much smaller than appear to be consistent with the data. This results because of the limitations inherent in the Kappa formula. A similar thing can occur with ICC. When the variability among the raters is larger than the variability among the subjects, the ICC becomes negative. Since ICC is defined to be a number between 0 and 1, most software packages, including Stata, will set the ICC to 0, indicating no agreement among the raters. (The SPSS software allows negative ICC values, which makes interpretion difficult.) What if your sample just happens to be a group of subjects who have very similar values of the outcome variable? If the subject similarity is closer to, but larger than, the similarity among the raters, then ICC will be a positive low value. If this similarity is tighter than the similarity among the raters, then ICC will be 0. Now, in a different sample of patients, using the same raters, but this time where the subjects differ from each other so that there is a wide range of values on the outcome variable, the ICC will be a high value. These disparate results are due solely to subject variability, having nothing to do with how well the raters actually agree with each other. To illustrate this anomaly, we will use the following dataset. clear input id scaleA0 scaleB0 scaleC0 scaleA1 scaleB1 scaleC1 1 40 45 49 42 47 51 2 60 55 50 59 54 49 3 55 50 48 59 54 52 4 58 54 52 56 52 50 5 52 51 52 53 52 53 6 45 49 51 46 50 52 7 48 48 49 50 50 51 8 58 53 51 60 55 53 9 53 52 51 54 53 52 10 55 51 50 57 53 52 11 45 49 52 42 46 49 12 51 45 51 52 46 52 13 52 51 49 54 53 51 14 57 53 51 58 54 52 15 48 48 49 49 49 50 16 58 53 51 59 54 52 17 44 50 52 40 46 48 end Chapter 2-16 (revision 8 May 2011) p. 28 In this dataset, three measurements are recorded (scale A, B, and C) at pretest (0) and then again at posttest(1) . This would occur, for example, when a subject records how they feel on a standardized test at the first baseline visit. Then at their second visit one week later, you ask them to score how they feel again before the intervention is started. Given the two baseline measurements, you can assess the test-retest reliability of the standardized test using an ICC reliability coefficient. The way this dataset is contrived, the difference between the pretest and posttest scores provided by the subjects are exactly the same for each of the three subscales, A, B, and C. The only thing that varies is the range of scores. For subscale A, the range is 10 (min = 40, max = 60); for subscale B, the range is 10 (min = 45 , max = 55), and for subscale C, the range is 4 (min = 48 , max = 52). For example, for subject 1, the difference between pretest and posttest is 2 for each of the three subscales. For subject 2, the difference is 5 for each of the three subscales, and so on. Thus, the agreement is consistent for each of the three subscales. It is just the range of scores, which is the variability between subjects, that varies. 70 66 62 58 54 50 46 42 38 34 30 ICC = 0.95 Scale B test and retest Scale A test and retest In the following figure, we see that the ICC gets smaller as the subjects provide more homogeneous scores (subjects become more alike). When this homogeneity gets tight enough, the ICC becomes 0. Scale C test and retest subject 70 66 62 58 54 50 46 42 38 34 30 70 66 62 58 54 50 46 42 38 34 30 ICC = 0.78 subject ICC = 0.00 subject Figure. For each specific subject, the difference between the two ratings are exactly the same in the three graphs. The only thing that varies is the range of values on the Score variable. Chapter 2-16 (revision 8 May 2011) p. 29 To be fair to the ICC statistic, one could say that the scale B is worse than scale A, and scale C is worse than scale B. This would be the case if the scales always had very different abilities to discrminate between subjects. So, this does not always represent an anomaly. Be careful, then, about casually concluding your data represent an anomaly. However, if you were to use the scales with a more diverse set of patients, so that patient variability increased, the ICCs would most likely go up, even though nothing about the scale or the “real” reliability remained the same, thus presenting the anomaly. This contrived dataset was attempting to represent that particular scenario. Simply think of scales A, B, and C to be the same scale, only measured on three sets of patients, where homogeneity of patients varies due to inclusion-exclusion criteria. Then, this dataset really does illustrate the anomaly. To study inter- and intra-rater reliability, a study design to do attempts to find a heterogeneous set of patients, which avoids the anomaly described here. Those are the study designs that correctly measure reliability for a measurement scale. Frequently, however, researchers use datasets that have strict inclusion/exclusion criteria in order to reduce heterogeneity (control for possible confounding using the restriction approach), in order to test a study hypothesis. So, when they also attempt to assess inter- and intra-rater reliability within their studies, these values are lower than the scale is capable of. These are the studies that the anomaly will most frequently show up. Streiner and Norman (1995, p.122) describe this in the context of how to improve the ICC, “…An alternative approach, which is not legitimate, is to administer the test to a more heterogeneous group of subjects for the purpose of determining reliability. For example, if a measure of function in arthritis does not reliably discriminate among ambulatory arthritics, administering the test to both normal subjects and to bedridden hospitalized arthritics will almost certainly improve reliability. Of course, the resulting reliability no longer yields any information about the ability of the instrument to discriminate among ambulatory patients. By contrast, it is sometime the case that a reliability coefficient derived from a homogeneous population is to be applied to a population which is more heterogeneous. It is clear form the above discussion that the reliability in the application envisioned will be larger than that determined in the homogeneous study population….” Chapter 2-16 (revision 8 May 2011) p. 30 If you are curious how the above figure with the three ICC graphs was generated, here is the Stata code. #delimit ; twoway (rscatter scaleA0 scaleA1 id , symbol(circle) color(red)) (rspike scaleA0 scaleA1 id , color(red)) , ytitle("Scale A test and retest") xtitle("subject") xlabels(0 " " 18 " ", notick) ylabels(30(4)70, angle(horizontal)) text(68 8.5 "ICC = 0.95",placement(c) size(*2)) legend(off) plotregion(style(none)) scheme(s1color) saving(tempa, replace) ; #delimit cr * #delimit ; twoway (rscatter scaleB0 scaleB1 id , symbol(circle) color(red)) (rspike scaleB0 scaleB1 id , color(red)) , ytitle("Scale B test and retest") xtitle("subject") xlabels(0 " " 18 " ", notick) ylabels(30(4)70, angle(horizontal)) text(68 5 "ICC = 0.78" ,placement(e) size(*2)) legend(off) plotregion(style(none)) scheme(s1color) saving(tempb, replace) ; #delimit cr * #delimit ; twoway (rscatter scaleC0 scaleC1 id , symbol(circle) color(red)) (rspike scaleC0 scaleC1 id , color(red)) , ytitle("Scale C test and retest") xtitle("subject") xlabels(0 " " 18 " ", notick) ylabels(30(4)70, angle(horizontal)) text(68 5 "ICC = 0.00" ,placement(e) size(*2)) legend(off) plotregion(style(none)) scheme(s1color) saving(tempc, replace) ; #delimit cr * * -- combine graphs into single figure graph combine tempa.gph tempb.gph tempc.gph , scheme(s1color) erase tempa.gph // delete temporary graph files from hard drive erase tempb.gph erase tempc.gph Chapter 2-16 (revision 8 May 2011) p. 31 Equivalence of Kappa and Intraclass Correlation Coefficient (ICC) Unweighted Kappa For dichotomous ratings, kappa and ICC are equivalent [Fleiss, Levin, and Pai (2003, p.604); Streiner and Norman (1995, p.118)]. To verify this equivalence for the dichotomous rating, we read in the Boyd Rater dataset, File Open Find the directory where you copied the course CD Find the subdirectory datasets & do-files Single click on boydrater.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\boydrater.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd " Biostats & Epi With Stata\datasets & do-files" use boydrater, clear Chapter 2-16 (revision 8 May 2011) p. 32 Recoding the classifications to an indicator of abnormal, normal -> normal benign, suspect, cancer -> abnormal tab rada radb recode rada 1=0 2/4=1 , gen(abnormal1) recode radb 1=0 2/4=1 , gen(abnormal2) tab abnormal1 abnormal2 Radiologis | t A's | Radiologist B's assessment assessment | 1.normal 2.benign 3.suspect 4.cancer | Total -----------+--------------------------------------------+---------1.normal | 21 12 0 0 | 33 2.benign | 4 17 1 0 | 22 3.suspect | 3 9 15 2 | 29 4.cancer | 0 0 0 1 | 1 -----------+--------------------------------------------+---------Total | 28 38 16 3 | 85 RECODE of | rada | (Radiologi | RECODE of radb st A's | (Radiologist B's assessment | assessment) ) | 0 1 | Total -----------+----------------------+---------0 | 21 12 | 33 1 | 7 45 | 52 -----------+----------------------+---------Total | 28 57 | 85 Calculating kappa, kap abnormal1 abnormal2 Expected Agreement Agreement Kappa Std. Err. Z Prob>Z ----------------------------------------------------------------77.65% 53.81% 0.5160 0.1076 4.80 0.0000 Chapter 2-16 (revision 8 May 2011) p. 33 Reshaping the data to long format and then calculating ICC, gen patient=_n // create a patient ID reshape long abnormal , i(patient) j(radiologist) xi: xtreg abnormal i.radiologist, i(patient) mle Random-effects ML regression Group variable: patient Number of obs Number of groups = = 170 85 Random effects u_i ~ Gaussian Obs per group: min = avg = max = 2 2.0 2 Log likelihood = -102.60835 LR chi2(1) Prob > chi2 = = 1.33 0.2495 -----------------------------------------------------------------------------abnormal | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_Iradiolog~2 | .0588235 .0508824 1.16 0.248 -.0409041 .1585512 _cons | .6117647 .051928 11.78 0.000 .5099877 .7135417 -------------+---------------------------------------------------------------/sigma_u | .3452094 .0405841 .2741651 .4346634 /sigma_e | .3317146 .025441 .2854181 .3855208 rho | .5199275 .0791436 .3671773 .6697714 -----------------------------------------------------------------------------Likelihood-ratio test of sigma_u=0: chibar2(01)= 26.79 Prob>=chibar2 = 0.000 We see kappa and ICC are identical to two decimal places. Weighted Kappa For order categorical (ordinal scale) ratings, the weighted kappa and ICC are equivalent, provided the weights are taken as [Fleiss, Levin, and Pai (2003, p.609); Streiner and Norman (1995, p.118)], (i  j )2 ( the “wgt(w2)” option in Stata) wij  1  (k  1)2 To verify this equivalence for ordinal scale ratings, we again use the Boyd Rater dataset, but this time use the original ordinal scaled ratings. Chapter 2-16 (revision 8 May 2011) p. 34 Reading in the original dataset, File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on boydrater.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\boydrater.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd " Biostats & Epi With Stata\datasets & do-files" use boydrater, clear Calculating weighted kappa, using the wt2 weights, kap rada radb , wgt(w2) Ratings weighted by: 1.0000 0.8889 0.8889 1.0000 0.5556 0.8889 0.0000 0.5556 0.5556 0.8889 1.0000 0.8889 0.0000 0.5556 0.8889 1.0000 Expected Agreement Agreement Kappa Std. Err. Z Prob>Z ----------------------------------------------------------------94.77% 84.09% 0.6714 0.1079 6.22 0.0000 Reshaping the data to long format and then calculating ICC, rename rada rad1 rename radb rad2 gen patient=_n // create a patient ID reshape long rad , i(patient) j(radiologist) xi: xtreg rad i.radiologist, i(patient) mle Chapter 2-16 (revision 8 May 2011) p. 35 Random-effects ML regression Group variable: patient Number of obs Number of groups = = 170 85 Random effects u_i ~ Gaussian Obs per group: min = avg = max = 2 2.0 2 Log likelihood = -187.11654 LR chi2(1) Prob > chi2 = = 0.40 0.5266 -----------------------------------------------------------------------------rad | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_Iradiolog~2 | -.0470588 .0742292 -0.63 0.526 -.1925453 .0984277 _cons | 1.976471 .0917076 21.55 0.000 1.796727 2.156214 -------------+---------------------------------------------------------------/sigma_u | .6933196 .0673845 .5730655 .8388082 /sigma_e | .4839286 .037113 .4163913 .5624201 rho | .6724105 .0594218 .5493559 .7790901 -----------------------------------------------------------------------------Likelihood-ratio test of sigma_u=0: chibar2(01)= 51.15 Prob>=chibar2 = 0.000 We see weighted kappa and ICC are identical to two decimal places. Interpreting ICC The ICC, also called the reliability coefficient, is computed as (Streiner and Norman, 1995, p.106; Shrout and Fleiss, 1979), reliability = subject variability subject variability + measurement error expressed symbolically as,  s2  2  s   e2 In the inter-rater reliability situation, the raters assign a score, or measurement, to a subject. Each score can be expressed of as the actual true score plus or minus any measurement error the rater makes. The variability of the measurements made on a sample of subjects can be likewise be expressed as the variability of subjects, perhaps due to biological variability for example, plus the variability of the measurement errors made by the raters. [This comes from a statistical theory identity, which is that the variance of the sum of two independent variables is the sum of the variances. To apply this, we make the reasonable assumption that measurment errors are made independently of true values of the variable being measured.] We would like to have a reliablility coefficient that expresses the proportion of the measurement variability that is due only to subject variability, which can be thought of as the portion due to raters making consistent measurements without introducing measurement error (since measurement error would cause them to score a subject differently). This can be expressed as 1 Chapter 2-16 (revision 8 May 2011) p. 36 minus proportion due to measurement error, since what is left over is the consistent ratings portion. Expressing this with formulas,   1  portion due to measurment error  e2  1 2  s   e2  s2   e2  e2  2   s   e2  s2   e2  s2   e2   e2   s2   e2  s2  2  s   e2 So, we see that ICC, or rho, is interpreted as the proportion of total variability in measurements due to subject variability, which is the definition that authors put in their descriptions of ICC. McDowell (2006, p.45) provides the Cicchetti and Sparrow (1981) guideline. Cicchetti and Sparrow (1981) suggested the following guideline for the evaluation of ICC used to measure inter-rater agreement: Interpretation of ICC ICC interpretation <0.40 Poor agreement 0.40 - 0.59 fair to moderate agreement 0.60 - 0.74 good agreement 0.75 - 1.00 excellent agreement Chapter 2-16 (revision 8 May 2011) p. 37 References Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol. 1993;46(5):423-9. Carpenter J, Bithell J. (2000). Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Statist. Med. 19:1141-1164. Cicchetti DV, Sparrow SA. (1981). Developing criteria for establishing interrater reliability of secific items: applications to assessment of adaptive behavior. Am J Ment Defc 86:127137. Efron, B. and R. Tibshirani. (1993). An Introduction to the Bootstrap. London: Chapman & Hall. Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. 2nd ed. New York: Wiley. Fleiss JL, Levin B, Pai MC. (2003). Statistical Methods for Rates and Proportions, 3rd ed. Hokoken, NJ, John Wiley & Sons. Landis JR, Koch GG (1977). The measurement of observer agreement for categorical data. Biometrics, 33:159-174. Lee, J. and K. P. Fung. 1993. Confidence interval of the kappa coefficient by bootstrap resampling [letter]. Psychiatry Research 49:97-98. Lilienfeld DE, Stolley PD (1994). Foundations of Epidemiology, 3rd ed., New York, Oxford University Press. Looney SW, Hagan JL. (2008). Statistical methods for assessing biomarkers and analyzing biomaker data. In, Rao CR, Miller JP, Rao DC (eds), Handbook of Statistics 27: Epidemiology and Medical Statistics, New York, Elsevier.pp. 109-147. McDowell I, Newell C (1996). Measuring Health: a Guide to Rating Scales and Questionnaires, 2nd ed., New York, Oxford University Press. McDowell I (2006). Measuring Health: a Guide to Rating Scales and Questionnaires, 3rd ed., New York, Oxford University Press. Merlo J, Berglund, G, Wirfält E, et al. (2000). Self-administered questionnaire compared with a personal diary for assessment of current use of hormone therapy: an analysis of 16,060 women. Am J Epidemiol 152:788–92. Nunnally JC (1978). Psychometric Theory, 2nd ed., New York, McGraw-Hill Book Company. Reichenheim ME. (2004). Confidence intervals for the kappa statistic. The Stata Journal 4(4):421-428. Rosner B. (1995). Fundamentals of Biostatistics, 4th ed., Belmont CA, Duxbury Press. Chapter 2-16 (revision 8 May 2011) p. 38 StataCorp (2007). Stata Base Reference Manual, Vol 2 (I-P) Release 10. College Station, TX, Stata Press. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin 1979;86(2):420-428. Streiner DL, Norman GR. (1995). Health Measurement Scales: A Practical Guide to Their Development and Use. New York, Oxford University Press. Chapter 2-16 (revision 8 May 2011) p. 39

Chapter 2-16. Validity and Reliability

Related documents

Products

Support

Chapter 2-16. Validity and Reliability

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib