Chapter 2-16. Validity and Reliability

advertisement
Chapter 2-16. Validity and Reliability
Accuracy and Precision Analogy (Target Example)
Validity and reliability are analogous concepts to accuracy and precision. If you shoot a gun at a
target and always hit within the bull’s-eye and first surrounding ring in a random pattern, you
would say your shooting was accurate but not precise. You always hit in a tight pattern just up
and right from the bull’s-eye, you would say your shooting was precise but not accurate. Only a
tight pattern inside the bull’s-eye is both accurate and precise.
A measuring instrument is said to be valid if it measures what it is intended to measure (hits the
region aimed at). It said to be reliable if it provides measurements that are repeatable, giving
consistent measurements upon repeated applications (hits consistently inside any one region).
Validity (Nunnally, 1978, Chapter 3)
After a measuring instrument is constructed, it is necessary to inquire whether the instrument is
useful scientifically. “This usually spoken of as determining the validity of an instrument.”
(Nunnally, p.86)
“Generally speaking, a measuring instrument is valid if it does what it is intended to do.”
Example: yardstick does indeed measure length in the way that we define length.
(Nunnally, p.86)
“Strictly speaking, one validates not a measuring instrument but rather some use to which the
instrument is put.”
Example: a fifth grade spelling achievement test is valid for that purpose, but not
necessarily valid for predicting success in high school. (Nunnally, p.87)
There are three types of validity for measuring instruments: (1) predictive validity, (2) content
validity, and (3) construct validity. (Nunnally, p.87)
_______________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
Chapter 2-16 (revision 8 May 2011)
p. 1
Predictive validity – “Predictive validity is at issue when the purpose is to use an instrument to
estimate some important form of behavior that is external of the measuring instrument itself, the
latter being referred to as the criterion.” (Nunnally, p. 87)
Example: A test to select first year college students is useful in that situation only if it
accurately estimates successful performance in college. (Nunnally, p.88)
The use of “predictive” could be referring to something future, current, or past.
(Nunnally, p. 88)
Predictive validity is demonstrated by correlating the instrument’s measurement with the
measure of the criterion variable. “The size of the correlation is a direct indication of the
amount of validity.” (Nunnally, p. 88)
Predictive validity is at issue when measuring instruments are employed in making
decisions, such as choosing among different drug regimens to treat some condition.
(Nunnally, p. 89)
“Predictive validity represents a very direct, simple issue in scientific generalization that
concerns the extent to which one can generalize from scores on one variable to scores on
another variable. The correlation between the predictor test and the criterion variable
specifies the degree of validity of that generalization.”
Chapter 2-16 (revision 8 May 2011)
p. 2
Content Validity – “For some instruments, validity depends primarily on the adequacy in which
a specified domain of content is sampled. A prime example would be a final examination for a
course in introductory psychology. Obviously, the test could not be validated in terms of
predictive validity, because the purpose of the test is not to predict something else but to directly
measure performance in a unit of instruction. The test must stand by itself as an adequate
measure of what it is supposed to measure. Validity cannot be determined by correlating the test
with a criterion, because the test itself is the criterion of performance.” (Nunnally, p. 91)
The two major standards for ensuring content validity are: (1) a representative collection of items
and (2) “sensible” methods of test construction. (Nunnally, p.92)
Example: achievement test in spelling for fourth-grade students. A representative
collection might be a random sampling of words from fourth-grade reading books. A
sensible method of test construction might be to put each correctly spelled word in with
three misspellings and require the student to circle the correct one. (Nunnally, p.92)
“...content validity rests mainly on appeals to reason regarding the adequacy with which
important content has been sampled and on the adequacy with which important content
has been cast in the form of test items.” (Nunnally, p.93)
“For example, at least a moderate level of internal consistency among the items within a
test would be expected; i.e., the items should tend to measure something in common.”
(Nunnally, p.93)
“Another type of evidence for content validity is obtained from correlating scores on
different tests purporting to measure much the same thing, e.g., two tests by different
commercial firms for the measurement of achievement in reading. It is comforting to find
high correlations in such instances, but this does not guarantee content validity. Both
tests may measure the same wrong things.” (Nunnally, p. 94)
“Although helpful hints are obtained from analyses of statistical findings, content validity
primarily rests upon an appeal to the propriety of content and the way that it is presented.”
(Nunnally, p. 94) defn: propriety =true nature
Chapter 2-16 (revision 8 May 2011)
p. 3
Construct Validity – “To the extent that a variable is abstract rather than concrete, we speak of
it as being a construct. Such a variable is literally a construct in that it is something that
scientists put together from their own imaginations, something that does not exist as an isolated,
observable dimension of behavior. A construct represents a hypothesis (usually only halfformed) that a variety of behaviors will correlate with one another in studies of individual
differences and/or will be similarly affected by experimental treatments.” (Nunnally, p.96)
Example: “Take, for example, an experiment where a particular treatment is
hypothesized to raise anxiety. Can the measure of anxiety be validated purely as a
predictor of some specific variable? No, it cannot, because the purpose is to measure the
amount of anxiety then and there, not to estimate scores on any other variable obtained in
the past, present, or future. Also, the measure cannot be validated purely in terms of
content validity. There is no obvious body of “content” (behaviors) corresponding to
anxiety reactions, and if there were, how to measure such content would be far more of a
puzzle than it is with performance in arithmetic.” (Nunnally, p.95)
There are three major aspects of construct validation: (1) specifying the domain of observables
related to the construct; (2) from empirical research and statistical analyses, determining the
extent to which the observables tend to measure the same thing, several different things, or many
different things; and (3) subsequently performing studies of individual differences and/or
controlled experiments to determine the extent to which supposed measures of the construct
produce results which are predictable from highly accepted theoretical hypotheses concerning the
construct.
Chapter 2-16 (revision 8 May 2011)
p. 4
Validity of Diagnostic Tests
“Two indices are used to evaluate the accuracy or validity of a diagnostic test—sensitivity and
specificity.” (Lilienfeld and Stolley, 1994, p. 118).
In computing the sensitivity and specificity of a new test, we generally compare it to a gold
standard, where the gold standard is accepted to measure what it is intended to measure (recall
Nunnally quote above, “a measuring instrument is valid if it does what it is intended to do”).
This type of validity is called convergent and divergent validity by McDowell and Newell (1996,
p. 33), which is a special case of using correlation to show what Nunnally called predictive
validity in the presentation above.
Convergent and Divergent Validity
McDowell and Newell (1996, p. 33) describe using correlation evidence for validity:
“Hypotheses are formulated which state that the measurement will correlate with other
methods that measure the same concept; the hypotheses are tested in the normal way.
This is known as a test of ‘convergent validity’ and is equivalent to assessing sensitivity.
Where no single criterion exists, the measurement is sometimes compared with several
other indices using multivariate procedures. Hypotheses may also state that the
measurement will not correlate with others which measure different themes. This is
termed divergent validity, and is equivalent to the concept of specificity. For example, a
test of ‘Type A behavior patterns’ may be expected to measure something distinct from
neurotic behavior. Accordingly, a low correlation would be hypothesized between the
Type A scale and a neuroticism index and, if obtained, would lend reassurance that the
test was not simply measuring neurotic behavior. Naturally, this provides little
information on what it does measure.”
Exercise 1. Take out the Merlo paper.
1)
In the Section “Materials and Methods” look at the paragraph under the subheading
“Validity analysis” (p. 789). This is a good example of sensitivity and specificity being
used as measurements of validity.
2)
Find the second to last sentence of the Introduction section (p.788), “By conducting a
validity analysis…”
That’s it. There is no mention of construct validity or content validity. Establishing
predictive validity is sufficient for this article. However, for articles presenting new
measuring instruments, construct and content validity should be discussed.
Chapter 2-16 (revision 8 May 2011)
p. 5
Sensitivity and Specificity
With the data in the required form for Stata:
Gold Standard “true value”
disease present ( + )
disease absent ( - )
Test “probable value”
disease present ( + )
disease absent ( - )
a (true positives)
b (false negatives)
c (false positives)
d (true negatives)
a+c
b+d
a+b
c+d
We define the following terminology (Lilienfeld, 1994, p. 118-124), expressed as percents:
sensitivity = (true positives)/(true positives plus false negatives)
= (true positives)/(all those with the disease)
= a / (a + b) 100
specificity = (true negatives)/(true negatives plus false positives)
= (true negatives)/(all those without the disease)
= d / (c + d)  100
Sensitivity and specificity provide information about the accuracy (validity) of a test. Positive
and negative predictive values provide information about the meaning to the test results.
The probability of disease being present given a positive test result is the positive predictive
value:
positive predictive value = (true positives)/(true positives plus false positives)
= (true positives)/(all those with a positive test result)
= a / (a + c)  100
The probability of no disease being present given a negative test result is the negative predictive
value.
negative predictive value = (true negatives)/(true negatives plus false negatives)
= (true negatives)/(all those with a negative test result)
= d / (b + d)  100
“Unlike sensitivity and specificity, the positive and negative predictive values of a test depend on
the prevalence rate of disease in the population. …For a test of given sensitivity and specificity,
the higher the prevalence of the disease, the greater the positive predictive value and the lower
the negative predictive value.” (Lilienfeld, 1994, p. 122-123)
Chapter 2-16 (revision 8 May 2011)
p. 6
Stata Commands -- Diagnostic Test Statistics
To compute test characteristics (sensitivity, specificity, positive predictive value, and negative
predictive value, etc.) in Stata, the diagt commmand is used.
If the data are in Stata as variables, you use:
diagt goldvar testvar
Alternatively, you can enter the data directly as cell counts, using the “immediate” form of the
diagt command:
diagti #a #b #c #d
where the a, b, c, d, correspond to the cell frequencies in the following table:
Gold Standard “true value”
disease present ( + )
disease absent ( - )
Test “probable value”
disease present ( + )
disease absent ( - )
a (true positives)
b (false negatives)
c (false positives)
d (true negatives)
a+c
b+d
a+b
c+d
Installing the diagt and diagti commands in Stata
The diagt and diagti commands must first be installed, since they are user contributed commands.
In the command window, run
findit diagt
SJ-4-4
sbe36_2 . . . . . . . . . . . . . . . . . . Software update for diagt
(help diagt if installed) . . . . . . . . . . P. T. Seed and A. Tobias
Q4/04
SJ 4(4):490
new options added to diagt
then click on sbe36_2 to install.
Chapter 2-16 (revision 8 May 2011)
p. 7
Exercise 1. Replicate the validity analyses for estrogrens found in the Merlo (2000) paper.
To practice using diagt with variables in Stata’s memory, first read in the 2 x 2 table data give in
Table 1 of the Merlo paper (these commands are in chapter12.do).
clear
input question diary count
1 0 231
0 1 164
1 1 969
0 0 14696
end
drop if count==0 // not needed here, but would be if any count was 0
expand count
drop count
Using the diary as the gold standard, and questionnaire as the test variable, compute the test
characteristics.
diagt diary question
|
question
diary |
Pos.
Neg. |
Total
-----------+----------------------+---------Abnormal |
969
164 |
1,133
Normal |
231
14,696 |
14,927
-----------+----------------------+---------Total |
1,200
14,860 |
16,060
True abnormal diagnosis defined as diary = 1
[95% Confidence Interval]
--------------------------------------------------------------------------Prevalence
Pr(A)
7.1%
6.7%
7.46%
--------------------------------------------------------------------------Sensitivity
Pr(+|A)
85.5%
83.3%
87.5%
Specificity
Pr(-|N)
98.5%
98.2%
98.6%
ROC area
(Sens. + Spec.)/2
.92
.91
.93
--------------------------------------------------------------------------Likelihood ratio (+)
Pr(+|A)/Pr(+|N)
55.3
48.5
62.9
Likelihood ratio (-)
Pr(-|A)/Pr(-|N)
.147
.128
.169
Odds ratio
LR(+)/LR(-)
376
305
464
Positive predictive value
Pr(A|+)
80.8%
78.4%
82.9%
Negative predictive value
Pr(N|-)
98.9%
98.7%
99.1%
---------------------------------------------------------------------------
Chapter 2-16 (revision 8 May 2011)
p. 8
Alternatively, we could have skipped reading in the data and simply uses the immediate form of
the command,
diagti 969 164 231 14696
True |
disease |
Test result
status |
Neg.
Pos. |
Total
-----------+----------------------+---------Normal |
14,696
231 |
14,927
Abnormal |
164
969 |
1,133
-----------+----------------------+---------Total |
14,860
1,200 |
16,060
[95% Confidence Interval]
--------------------------------------------------------------------------Prevalence
Pr(A)
7.1%
6.7%
7.46%
--------------------------------------------------------------------------Sensitivity
Pr(+|A)
85.5%
83.3%
87.5%
Specificity
Pr(-|N)
98.5%
98.2%
98.6%
ROC area
(Sens. + Spec.)/2
.92
.91
.93
--------------------------------------------------------------------------Likelihood ratio (+)
Pr(+|A)/Pr(+|N)
55.3
48.5
62.9
Likelihood ratio (-)
Pr(-|A)/Pr(-|N)
.147
.128
.169
Odds ratio
LR(+)/LR(-)
376
305
464
Positive predictive value
Pr(A|+)
80.8%
78.4%
82.9%
Negative predictive value
Pr(N|-)
98.9%
98.7%
99.1%
---------------------------------------------------------------------------
Quirk: There is a quirk with the diagt and diagti commands. The table of test characteristics is
the same, but the outputted data table is in a different sort order. Notice the data table just
generated with the diagti command,
True |
disease |
Test result
status |
Neg.
Pos. |
Total
-----------+----------------------+---------Normal |
14,696
231 |
14,927
Abnormal |
164
969 |
1,133
-----------+----------------------+---------Total |
14,860
1,200 |
16,060
<- diati
is not the order the data had to be provided on the diagti command. That makes it difficult to
verify the cell counts were provided by you, the user, correctly. [Just remember that when the
sort order is switched for both the row and column variables, the cell counts simply move
diagonally to the opposite corners]. Whereas, the order entered does match the data table
displayed from the diagt command on the previous page (copied below for ease of comparison):
|
question
diary |
Pos.
Neg. |
Total
-----------+----------------------+---------Abnormal |
969
164 |
1,133
Normal |
231
14,696 |
14,927
-----------+----------------------+---------Total |
1,200
14,860 |
16,060
Chapter 2-16 (revision 8 May 2011)
<- diat
p. 9
Reliability (Nunnally, 1978, Chapters 6,7)
“To the extent to which measurement error is slight, a measure is said to be reliable. Reliability
concerns the extent to which measurements are repeatable.” (Nunnally, p.191)
“...measurements are reliable to the extent that they are repeatable and that any random influence
which tends to make measurements different from occasion to occasion or circumstance to
circumstance is a source of measurement error.” (Nunnally, p.225)
Estimation of Reliability
Internal consistency –“Estimates of reliability based on the average correlation among items
within a test are said to concern the “internal consistency.” This is partly a misnomer, because
the size of the reliability coefficient is based on both the average correlation among items (the
internal consistency) and the number of items. Coefficient alpha is the basic formula for
determining the reliability based on internal consistency. It, or the special version applicable to
dichotomous items (KR-20), should be applied to all new measurement methods. Even if other
estimates of reliability should be made for particular instruments, coefficient alpha should be
obtained first.” (Nunnally, p. 229-230)
Alternative forms—“In addition to computing coefficient alpha, with most measures it also is
informative to correlate alternative forms.” (Nunnally, p.230)
Example: “An example would be constructing two vocabulary tests, each of which
would be an alternative form for the other.” (Nunnally, p.228)
Use of Reliability Coefficients
“The major use of reliability coefficients is in communicating the extent to which the results
obtained from a measurement method are repeatable. The reliability coefficient is one index of
the effectiveness of an instrument, reliability being a necessary but not sufficient condition for
any type of validity.” (Nunnally, p.237)
The choice of a reliability coefficient is based on level of measurement. Like correlations
coefficients, they range between 0 and 1, with 0 indicating no reproducibility (no rater
agreement) and 1 indicating perfect reproducibility (perfect rater agreement).
Chapter 2-16 (revision 8 May 2011)
p. 10
Kappa Statistic and Weighted Kappa Statistic
The Kappa statistic is the most widely used reliability coefficient when the variables being
compared for agreement are unordered categorical (nominal level of measurement).
For ordered categorical data (ordinal level of measurement), the weighted kappa is used.
Kappa Statistic: Two Unique Raters, Two Classification Categories
Above, we used the Merlo paper data to compute test characteristics, similar to what authors did
in their article. In that article, subjects recorded their drug use using a 7-day diary, and they also
recorded their drug use using a questionnaire. The research question was whether or not a
questionnaire is sufficiently reliable to collect drug use information. When they computed test
characteristics, they made the assumption that the diary was the “gold standard”, which assumes
the data are collected without error. The authors also reported a kappa coefficient, which is used
to evaluate if two raters provide the same measurement value, which does not require that one
rater’s measurements represent the gold standard.
They used the “two unique raters, two classification categories” form of the kappa coefficient.
Since the same subject provided both measurements, it was actually the same rater providing
both measurements, so there were not two unique raters. However, the two measurement
instruments differed, so it was “two unique instruments”. This form of kappa, then, did apply to
their analysis situation.
If the data are not already in Stata memory from doing this above, then input the data using,
clear
input question diary count
1 0 231
0 1 164
1 1 969
0 0 14696
end
drop if count==0 // not needed here, but would be if any count was 0
expand count
drop count
Chapter 2-16 (revision 8 May 2011)
p. 11
To display the data, use the tabulate (abbreviated tab) with the “cell” option, to get the cell
percent of the total sample size.
tab diary question , cell
+-----------------+
| Key
|
|-----------------|
|
frequency
|
| cell percentage |
+-----------------+
|
question
diary |
0
1 |
Total
-----------+----------------------+---------0 |
14,696
231 |
14,927
|
91.51
1.44 |
92.95
-----------+----------------------+---------1 |
164
969 |
1,133
|
1.02
6.03 |
7.05
-----------+----------------------+---------Total |
14,860
1,200 |
16,060
|
92.53
7.47 |
100.00
To compute the kappa coefficient, use
kap diary question
Expected
Agreement
Agreement
Kappa
Std. Err.
Z
Prob>Z
----------------------------------------------------------------97.54%
86.53%
0.8174
0.0079
103.64
0.0000
This displayed the “observed agreement”, which is the percent of the sample size on the main
diagonal (both methods provided same score):
91.51% + 6.03% = 97.54%
The “expected agreement” is computed similar to how it is done to the test the minimum
expected frequency assumption required for the chi-square test to be appropriate for a given
crosstabulation table. Obtaining the expected cell counts,
Chapter 2-16 (revision 8 May 2011)
p. 12
tab diary question , expect
+--------------------+
| Key
|
|--------------------|
|
frequency
|
| expected frequency |
+--------------------+
|
question
diary |
0
1 |
Total
-----------+----------------------+---------0 |
14,696
231 |
14,927
| 13,811.7
1,115.3 | 14,927.0
-----------+----------------------+---------1 |
164
969 |
1,133
|
1,048.3
84.7 |
1,133.0
-----------+----------------------+---------Total |
14,860
1,200 |
16,060
| 14,860.0
1,200.0 | 16,060.0
Converting the expected cell frequencies on the main diagonal into expected percents,
display ((13811.7+84.7)/16060)*100
86.52802
Which agrees with the kap output from the previous page, shown again here:
Expected
Agreement
Agreement
Kappa
Std. Err.
Z
Prob>Z
----------------------------------------------------------------97.54%
86.53%
0.8174
0.0079
103.64
0.0000
Kappa is defined as the amount of agreement in excess of chance agreement. It has the form
(Fleiss, Levin, Paik, 2003, p.603),
“The obtained excess beyond chance is Io – Ie, whereas the maximum possible excess is
1 - Ie. The ratio of these two differences is called kappa,
ˆ 
Io  Ie
.”
1  Ie
Plugging these values into the kappa formula,
display "kappa = " (.9754-.8653)/(1-.8653)
kappa = .81737194
This value agrees with the kappa = 0.817 reported in Merlo’s Table 1.
Chapter 2-16 (revision 8 May 2011)
p. 13
Notice how Merlo reports kappa in the Results Section (page 789, first sentence under Reliability
analysis subheading). To make sure the interpretation of the kappa coefficient is understood,
they describe it,
“We conducted a reliability analysis to assess to what extent the questionnaire and the
personal diary agreed (i.e., the ability to replicate results whether or not the information
was correct). We calculated the percentage of agreement and the related kappa coefficient
(4) for dichotomous (yes vs. no) self-reported current use of hormone therapy. The kappa
coefficient is a measure of the degree of nonrandom agreement between two
measurements of the same categorical variable (5). A p value of <0.05 was required for
rejection of the null hypothesis of no agreement other than by chance.”
Kappa Statistic: Two Unique Raters, Two or More Unordered Classification Categories
We will practice with the dataset , boydrater.dta. This dataset (StataCorp, 2007, p.85) came
from the Stata website [use http://www.stata-press.com/data/r10/rate2]. It is a subset
of data from Boyd et al. (1982) and discussed in the context of kappa in Altman (1991, p.403405). The data represent classifications by two radiologists of 85 xeromammograms as normal,
benign disease, suspicion of cancer, or cancer.
Reading the data into Stata,
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on boydrater.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\boydrater.dta",
clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd " Biostats & Epi With Stata\datasets & do-files"
use boydrater, clear
Chapter 2-16 (revision 8 May 2011)
p. 14
Tabulating the two raters classifications,
Statistics
Summaries, tables & tests
Tables
Two-way tables with measures of association
Main tab: Row variable: rada
Column variable: radb
OK
tabulate preterm hyp, column
Radiologis |
t A's |
Radiologist B's assessment
assessment | 1.normal
2.benign 3.suspect
4.cancer |
Total
-----------+--------------------------------------------+---------1.normal |
21
12
0
0 |
33
2.benign |
4
17
1
0 |
22
3.suspect |
3
9
15
2 |
29
4.cancer |
0
0
0
1 |
1
-----------+--------------------------------------------+---------Total |
28
38
16
3 |
85
We see that the classification are actually on an ordinal scale, or ordered categorical data, but we
will first treat them an nominal, or unordered categorical data, to illustrate the kappa statistic.
(Later, we will compute the weighted kappa statistic.)
Restating the definition, kappa is the amount of agreement in excess of chance agreement. It has
the form (Fleiss, Levin, Paik, 2003, p.603),
“The obtained excess beyond chance is Io – Ie, whereas the maximum possible excess is
1 - Ie. The ratio of these two differences is called kappa,
ˆ 
Io  Ie
.”
1  Ie
The amount of agreement is simply the proportion of observations on the main diagonal of the
crosstabulation table.
disp (21+17+15+1)/85
.63529412
The amount of agreement expected by chance are the proportion of expected cell frequencies for
the main diagonal cells, computed in the same way as used for the chi-square test,
expected cell frequency = (column total)(row total)/(grand total)
disp (28*33/85+38*22/85+16*29/85+3*1/85)/85
.30823529
Chapter 2-16 (revision 8 May 2011)
p. 15
Computing kappa,
display (.63529412-.30823529)/(1-.30823529)
.47278912
Calculating kappa in Stata,
kap rada radb
Expected
Agreement
Agreement
Kappa
Std. Err.
Z
Prob>Z
----------------------------------------------------------------63.53%
30.82%
0.4728
0.0694
6.81
0.0000
The Stata manual provides a nice interpretation of this result (StataCorp, 2007, p.85),
“If each radiologist had made his determination randomly (but with probabilities equal to
the overall proportions), we would expect the two radiologists to agree on 30.8% of the
patients. In fact, they agreed on 63.5% of the patients, or 47.3% of the way between
random agreement and perfect agreement. The amount of agreement indicates that we
can reject the hypothesis that they are making their determinations randomly.”
Kappa is scaled such that 0 is the amount of agreement that would be expected by chance, and 1
is perfect agreement. Thus, the test statistic, z = 6.81, is a test of the null hypothesis, H0: κ = 0 .
It is of little interest to test kappa against zero, since acceptable interrater reliability is a number
much greater than zero, so the p value is not worth reporting. Instead, report kappa along with a
95% confidence interval.
To obtain a confidence interval for kappa, you must first add the command kapci to your Stata.
To do this, first use,
findit kapci
SJ-4-4
st0076 . . . . . . . . . Confidence intervals for the kappa statistic
(help kapci if installed) . . . . . . . . . . . . . M. E. Reichenheim
Q4/04
SJ 4(4):421--428
confidence intervals for the kappa statistic
and then click on the st0076 link to install.
Since this a user-contributed command (Reichenheim, 2004), to see the help description for this,
use,
help kapci
If the kappa is for a dichotomous variable, the kapci command uses a formula approach (Fleiss,
1981). For all other cases, the kapci command uses a bootstrap approach (Efron and Tibshirani
1993; Lee and Fung 1993).
Chapter 2-16 (revision 8 May 2011)
p. 16
Obtaining the confidence interval,
kapci rada radb
Note: default number of bootstrap replications has been
set to 5 for syntax testing only.reps() needs to
be increased when analysing real data.
B=5
N=85
-----------------------------------------------Kappa (95% CI) = 0.473 (0.293 - 0.529)
(BC)
-----------------------------------------------BC = bias corrected
We observe it used a bootstrap approach, but with only 5 repetitions. A minimum of 200
repetitions should be used, but 1,000 repetitions is recommended.
Obtaining the bootstrapped confidence interval, based on 1,000 repetitions,
kapci rada radb , reps(1000)
This may take quite a long time. Please wait ...
B=1000 N=85
-----------------------------------------------Kappa (95% CI) = 0.473 (0.326 - 0.612)
(BC)
-----------------------------------------------BC = bias corrected
There are many methods of bootstrapping. The default is “bias corrected”, which is the most
popular.
Beyond just the two rater on a dichotomous variable case, confidence interval formulas have not
been developed. At least I don’t know of them. Reichnehim (2004, p. 76), the author of the
kapci command, states the same thing,
“Computer efficiency is the main advantge of using an analytical procedure. Alas, to the
best of the author’s knowledge, no such method has been developed to accommodate
more complex analysis beyond the simple 2 × 2 case.”
Therefore, you should definitely state that you used a bootstrapped CI for kappa when that is
what you have done, because your reader, or reviewer, will wonder how you computed the CI.
Chapter 2-16 (revision 8 May 2011)
p. 17
Protocol Suggestion
Here is some example wording for describing what was just done,
Inter-rater reliability will be measured using the kappa coefficient, and reported using a
bootstrapped, bias-corrected method, 95% confidence interval (Reichenheim, 2004;
Carpenter and Bithell, 2000).
Interpreting Kappa
Landis and Koch (1977, p.165) suggested the following guideline for the evaluation of Kappa:
Interpretation of Kappa
kappa
interpretation
<0
poor agreement
0 - 0.20
slight agreement
0.21 - 0.40
fair agreement
0.41 - 0.60
moderate agreement
0.61 - 0.80
substantial agreement
0.81 - 1.00
almost perfect agreement
Kappa is scaled such that 0 is the amount of agreement that would be expected by chance, and 1
is perfect agreement.
Chapter 2-16 (revision 8 May 2011)
p. 18
PABAK (Prevalence and Bias Adjusted Kappa)
When the observed cells counts in a crosstabulation table of two raters bunch up in any corner of
the table, the kappa coefficient is known to produce paradoxical results. An alternative form of
kappa, called PABAK, has been proposed as a solution to this problem.
Consider the 2 x 2 table of dichotomous ratings (yes or no) between two ratings, such as two
radiologists reading X-rays to determine the presence or absence of a tumor. Expressed using
Byrt’s et al (1993) notation,
Observer B
Yes
No
Total
Observer A
Yes
No
a
b
c
d
f1
f2
Total
g1
g2
N
The proportion of observed agreement is
po  (a  d ) / N
Some proportion of this agreement, however, can occur by chance. Observer A can score yes’s
in a completely independent fashion than Observer B, and so whatever agreements occur would
just be chance. For example, radiologist A can just flip a coin to decide if a tumor is present, and
radiologistic B can likewise flip a coin. The instances of them both getting a heads and both
getting a tails, are just chance occurrences.
The proportion of expected by chance agreement (see box) is
pe  ( f1 g1  f 2 g 2 ) / N 2
Kappa is defined as the amount of agreement in excess of chance agreement.
The obtained excess beyond chance is
po – pe
The maximum possible excess is
1 – pe
The ratio of these two differences is called kappa,
K
Chapter 2-16 (revision 8 May 2011)
po  pe
1  pe
p. 19
Expected chance agreement
The expected agreement comes from the “multiplication rule for independent events” in
probability. If two events, A and B, are independent, then the probability they will both occur is:
P(AB) = P(A)P(B) , where P(AB) = probability both occur
P(A) = probability A will occur
P(B) = probability B will occur
Observer B
Yes
No
Total
Observer A
Yes
No
a
b
c
d
f1
f2
Total
g1
g2
N
A probability is just the proportion of times an event occurs, so
P(observer A scores Yes) = f1/N , P(observer A scores No) = f2/N
and
P(observer B scores Yes) = g1/N , P(observer B scores No) = g2/N
If the observers score independently, or no correlation between the two, such as not using a
commmon criteria like educated professional judgment, then by the multiplication rule for
independent events,
P(both score Yes by chance) = (f1/N)( g1/N ) = f1 g1/N2
P(both score No by chance) = (f2/N)( g2/N ) = f2 g2/N2
Adding these to get to the total proportion of times an agreement occurs by chance,
P(both score Yes or both score No) = f1 g1/N2 + f2 g2/N2 = ( f1 g1 + f2 g2 )/ N2
**** Left to Add ****
Using Byrt paper, describe anomalies due to prevalence and bias.
Chapter 2-16 (revision 8 May 2011)
p. 20
Expressed using Looney and Hagan’s (2008) notation,
Observer B
Yes
No
Total
Observer A
Yes
No
n11
n12
n21
n22
n.1
n.2
Total
n1.
n2.
n
the formula for PABAK is,
PABAK 
(n11  n22 )  (n12  n21 )
 2 p0  1
n
Looney and Hagan (2008, p.115) point out,
“Note that PABAK is equivalent to the proportion of ‘agreements’ between the variables
minus the proportion of ‘disagreements’.”
Looney and Hagan (2008, pp.115-116) provide formulas for variance and confidence intervals,
“The approximate variance of PABAK is given by
estimated Var(PABAK)  4 p0 (1  p0 ) / n
and the approximate 100(1-α)% confidence limits for the true value of PABAK are given
by
estimated PABAK  z /2 estimated Var(PABAK) .”
Calculating PABAK in Stata: pabak.ado file
I programmed this for the 2 x 2 table case (2 raters, 2 possible outcomes). Either copy the file
pabak.ado to the directory:
C:\ado\personal
so that is always available, or change directories to the directory it is in so that it is temporarily
available.
Chapter 2-16 (revision 8 May 2011)
p. 21
Reading in the data for the example given in Looney and Hagan (2008, p.116),
* -- data in Looney and Hagan (p.116)
clear
input biomarkera biomarkerb count
1 1 80
1 0 15
0 1 5
0 0 0
end
drop if count==0
expand count
drop count
Displaying the data and calculating kappa,
tab biomarkera biomarkerb
kap biomarkera biomarkerb
|
biomarkerb
biomarkera |
0
1 |
Total
-----------+----------------------+---------0 |
0
5 |
5
1 |
15
80 |
95
-----------+----------------------+---------Total |
15
85 |
100
Expected
Agreement
Agreement
Kappa
Std. Err.
Z
Prob>Z
----------------------------------------------------------------80.00%
81.50%
-0.0811
0.0841
-0.96
0.8324
We see that kappa = -0.08, or is basically kappa = 0, even though the observed agreement is 80%,
which is very high. The negative sign comes from the observed agreement being worse than the
expected agreement, so agreement was actually worse than what would be expected simply by
chance. The problem lies in that the kappa coefficient fails with some particular data patterns.
These are well described in Bryt et al (1993).
Chapter 2-16 (revision 8 May 2011)
p. 22
Calculating PABAK with the analytic (formula based) 95% confidence interval,
pabak biomarkera biomarkerb
Interrater Reliability
biomarkerb
1
0
---------------biomarkera 1 |
80
15 |
95
0 |
5
0 |
5
----------------------85
15 |
100
PABAK = 0.6000 , 95% CI (0.4432 , 0.7568)
The PABAK and confidence limits agree exactly with those provided in Looney and Hagan
(2008, p.116), so you can feel comfortable that the pabak.ado file was correctly programmed.
To verify that the analytic CI is reasonable, we can bootstrap the CI for PABAK, using
bootstrap r(pabak), reps(1000) size(100) seed(999) bca: ///
pabak biomarkera biomarkerb
estat bootstrap, all
Bootstrap results
command:
_bs_1:
Number of obs
Replications
=
=
100
1000
pabak biomarkera biomarkerb
r(pabak)
-----------------------------------------------------------------------------|
Observed
Bootstrap
|
Coef.
Bias
Std. Err. [95% Conf. Interval]
-------------+---------------------------------------------------------------_bs_1 |
.6
.00252
.07922074
.4447302
.7552698
(N)
|
.44
.75
(P)
|
.46
.76 (BC)
|
.46
.76 (BCa)
-----------------------------------------------------------------------------(N)
normal confidence interval
(P)
percentile confidence interval
(BC)
bias-corrected confidence interval
(BCa) bias-corrected and accelerated confidence interval
We see that the analytic CI was in agreement with these four bootstrapped CI approaches.
Chapter 2-16 (revision 8 May 2011)
p. 23
Protocol Suggestion for PABAK
Reliability will be measured using the Kappa coefficient, which is the proportion of agreement
beyond expected chance agreement. Although Kappa remains the most widely used measure of
agreement, several authors have pointed out data patterns that produce a Kappa with paradoxical
results. For example, if the agreements bunch up in one of the agreement cells (prevalence) or
disagreements bunch up in one of disagreement cells (bias), then the Kappa statistic is
paradoxically different from a crosstabulation table with more evenly distributed agreements and
disagreements, even though the percents of agreement and disagreement do not change.
Therefore, the prevalence-adjusted bias-adjusted kappa (PABAK) will also be reported, which
gives the true proportion of agreement beyond expected chance agreement regardless of
unbalanced data patterns (Byrt, 1993). The reliability between the nurse observers and the nurse
trainer will be computed. This analysis will be stratified by study sites, reporting site-specific
percent agreement, Kappa and PABAK coefficients, as well as a summary Kappa, which is a
weighted average across the study sites, along with confidence intervals.
Chapter 2-16 (revision 8 May 2011)
p. 24
Intraclass Correlation Coefficient (ICC)
For a continuous rating (interval scale), interrater reliabity is measured with the intraclass
correlation coefficient (ICC), also called the intracluster correlation coefficient (ICC), or the
reliability coefficient (r or rho). .
To compute the ICC, we use the formula (Streiner and Norman, 1995, p.106; Shrout and Fleiss,
1979),
reliability =
subject variability
subject variability + measurement error
expressed symbolically as,
 s2
 2
 s   e2
Note, however, that these sigma’s are population parameters. Population parameters are
estimated using the expected value of sample statistics, where the expected value is the long-run
average.
The ICC cannot be computed, then, simply from the MS(between) and MS(within) from an
analysis of variance table. For any ANOVA, depending on whether it is for a fixed effect,
random effect, multiple raters for each subject, separate raters for each subject, etc, the expected
mean squares (EMS) are different equations containing the MS(between) and MS(within). That
is, all versions of the ICC use the same MS(between) and MS(within) from the ANOVA table,
but the EMS(between) and EMS(within) has a slightly different equation for each situation.
Using the dataset provided in Table 8.1 of Streiner and Norman (1995),
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on streiner and normantable81.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\
streiner and normantable81boydrater.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use "streiner and normantable81", clear
Chapter 2-16 (revision 8 May 2011)
p. 25
These are data are ratings by three observers for 10 patients, which ratings on a 10-point scale for
some attribute, such as sadness.
Listing the data,
list, sep(0)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
+---------------------------------------+
| patient
observ1
observ2
observ3 |
|---------------------------------------|
|
1
6
7
8 |
|
2
4
5
6 |
|
3
2
2
2 |
|
4
3
4
5 |
|
5
5
4
6 |
|
6
8
9
10 |
|
7
5
7
9 |
|
8
6
7
8 |
|
9
4
6
8 |
|
10
7
9
8 |
+---------------------------------------+
Drawing a scatter diagram, showing the ratings for each patient lined up vertically,
2
4
6
8
10
sort patient
twoway (scatter observ1 patient) (scatter observ2 patient) ///
(scatter observ3 patient)
0
2
4
6
8
10
patient
observ1
observ3
observ2
We see that the ratings for the same patient appear more alike than rating between patient,
suggesting a high value of ICC.
Chapter 2-16 (revision 8 May 2011)
p. 26
To calculate the ICC in Stata, we must first reshape the data to long format. Then, we will list
the data for the first two patients to verify the data are reshaped correctly.
reshape long observ , i(patient) j(observerID)
list if patient<=2 , sepby(patient) abbrev(15)
1.
2.
3.
4.
5.
6.
+-------------------------------+
| patient
observerID
observ |
|-------------------------------|
|
1
1
6 |
|
1
2
7 |
|
1
3
8 |
|-------------------------------|
|
2
1
4 |
|
2
2
5 |
|
2
3
6 |
+-------------------------------+
In Stata, the ICC can be computed by treating observer as a fixed effect and patient as a random
effect, using
xi: xtreg observ i.observerID, i(patient) mle // Stata-10
* <or>
xtreg observ i.observerID, i(patient) mle // Stata-11
Random-effects ML regression
Group variable: patient
Number of obs
Number of groups
=
=
30
10
Random effects u_i ~ Gaussian
Obs per group: min =
avg =
max =
3
3.0
3
Log likelihood
= -47.804751
LR chi2(2)
Prob > chi2
=
=
21.97
0.0000
-----------------------------------------------------------------------------observ |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Iobserver~2 |
1
.316217
3.16
0.002
.380226
1.619774
_Iobserver~3 |
2
.316217
6.32
0.000
1.380226
2.619774
_cons |
5
.6429085
7.78
0.000
3.739923
6.260077
-------------+---------------------------------------------------------------/sigma_u |
1.90613
.4459892
1.205013
3.015182
/sigma_e |
.7071068
.1117939
.5186921
.963963
rho |
.8790323
.0609117
.7179378
.9611003
-----------------------------------------------------------------------------Likelihood-ratio test of sigma_u=0: chibar2(01)=
32.10 Prob>=chibar2 = 0.000
The interpretation is that 88% of the variance in the observations was the result of “true”
variance among patients (Streiner and Norman, 1995, p.111). That is, 88% of the variance is due
to the background variability among patients, or patient-to-patient differences, rather than due to
lack of agreement among the raters or measurement error. That interpretation is simply
consistent with the formula. It is a round about way of describing the agreement among the
raters, which is excellent agreement in this example. This is discussed further in the
“Interpreting the ICC” section below.
This ICC = 0.88 agrees with the Streiner and Norman (1995, p.111) calculation. Here we treated
the patient as a random effect and observer as a fixed effect. Thus, we are making an inference
to the population of all patients, but only to these three specific observers.
Chapter 2-16 (revision 8 May 2011)
p. 27
This is called the “classical” definition of reliability (Streiner and Norman, 1995, p.111). This is
the situation where reliability is calculated to demonstrate the observers used in the particular
study had good interrater reliability. In this situation, there is no need to make an inference to
other observers.
Anomalous Values of ICC
In the section above, “PABAK (Prevalence and Bias Adjusted Kappa)”, it was pointed out that
when the observed cells counts in a crosstabulation table of two raters bunch up in any corner of
the table, the kappa coefficient is known to produce paradoxical results. That is, it gives values
of kappa that are much smaller than appear to be consistent with the data. This results because
of the limitations inherent in the Kappa formula.
A similar thing can occur with ICC. When the variability among the raters is larger than the
variability among the subjects, the ICC becomes negative. Since ICC is defined to be a number
between 0 and 1, most software packages, including Stata, will set the ICC to 0, indicating no
agreement among the raters. (The SPSS software allows negative ICC values, which makes
interpretion difficult.) What if your sample just happens to be a group of subjects who have very
similar values of the outcome variable? If the subject similarity is closer to, but larger than, the
similarity among the raters, then ICC will be a positive low value. If this similarity is tighter than
the similarity among the raters, then ICC will be 0. Now, in a different sample of patients, using
the same raters, but this time where the subjects differ from each other so that there is a wide
range of values on the outcome variable, the ICC will be a high value. These disparate results are
due solely to subject variability, having nothing to do with how well the raters actually agree with
each other.
To illustrate this anomaly, we will use the following dataset.
clear
input id scaleA0 scaleB0 scaleC0 scaleA1 scaleB1 scaleC1
1 40 45 49 42 47 51
2 60 55 50 59 54 49
3 55 50 48 59 54 52
4 58 54 52 56 52 50
5 52 51 52 53 52 53
6 45 49 51 46 50 52
7 48 48 49 50 50 51
8 58 53 51 60 55 53
9 53 52 51 54 53 52
10 55 51 50 57 53 52
11 45 49 52 42 46 49
12 51 45 51 52 46 52
13 52 51 49 54 53 51
14 57 53 51 58 54 52
15 48 48 49 49 49 50
16 58 53 51 59 54 52
17 44 50 52 40 46 48
end
Chapter 2-16 (revision 8 May 2011)
p. 28
In this dataset, three measurements are recorded (scale A, B, and C) at pretest (0) and then again
at posttest(1) . This would occur, for example, when a subject records how they feel on a
standardized test at the first baseline visit. Then at their second visit one week later, you ask
them to score how they feel again before the intervention is started. Given the two baseline
measurements, you can assess the test-retest reliability of the standardized test using an ICC
reliability coefficient.
The way this dataset is contrived, the difference between the pretest and posttest scores provided
by the subjects are exactly the same for each of the three subscales, A, B, and C. The only thing
that varies is the range of scores. For subscale A, the range is 10 (min = 40, max = 60); for
subscale B, the range is 10 (min = 45 , max = 55), and for subscale C, the range is 4 (min = 48 ,
max = 52).
For example, for subject 1, the difference between pretest and posttest is 2 for each of the three
subscales. For subject 2, the difference is 5 for each of the three subscales, and so on. Thus, the
agreement is consistent for each of the three subscales. It is just the range of scores, which is the
variability between subjects, that varies.
70
66
62
58
54
50
46
42
38
34
30
ICC = 0.95
Scale B test and retest
Scale A test and retest
In the following figure, we see that the ICC gets smaller as the subjects provide more
homogeneous scores (subjects become more alike). When this homogeneity gets tight enough,
the ICC becomes 0.
Scale C test and retest
subject
70
66
62
58
54
50
46
42
38
34
30
70
66
62
58
54
50
46
42
38
34
30
ICC = 0.78
subject
ICC = 0.00
subject
Figure. For each specific subject, the difference between the two ratings are exactly
the same in the three graphs. The only thing that varies is the range of values
on the Score variable.
Chapter 2-16 (revision 8 May 2011)
p. 29
To be fair to the ICC statistic, one could say that the scale B is worse than scale A, and scale C is
worse than scale B. This would be the case if the scales always had very different abilities to
discrminate between subjects. So, this does not always represent an anomaly. Be careful, then,
about casually concluding your data represent an anomaly.
However, if you were to use the scales with a more diverse set of patients, so that patient
variability increased, the ICCs would most likely go up, even though nothing about the scale or
the “real” reliability remained the same, thus presenting the anomaly. This contrived dataset was
attempting to represent that particular scenario. Simply think of scales A, B, and C to be the
same scale, only measured on three sets of patients, where homogeneity of patients varies due to
inclusion-exclusion criteria. Then, this dataset really does illustrate the anomaly.
To study inter- and intra-rater reliability, a study design to do attempts to find a heterogeneous set
of patients, which avoids the anomaly described here. Those are the study designs that correctly
measure reliability for a measurement scale.
Frequently, however, researchers use datasets that have strict inclusion/exclusion criteria in order
to reduce heterogeneity (control for possible confounding using the restriction approach), in order
to test a study hypothesis. So, when they also attempt to assess inter- and intra-rater reliability
within their studies, these values are lower than the scale is capable of. These are the studies that
the anomaly will most frequently show up.
Streiner and Norman (1995, p.122) describe this in the context of how to improve the ICC,
“…An alternative approach, which is not legitimate, is to administer the test to a
more heterogeneous group of subjects for the purpose of determining reliability. For
example, if a measure of function in arthritis does not reliably discriminate among
ambulatory arthritics, administering the test to both normal subjects and to bedridden
hospitalized arthritics will almost certainly improve reliability. Of course, the resulting
reliability no longer yields any information about the ability of the instrument to
discriminate among ambulatory patients.
By contrast, it is sometime the case that a reliability coefficient derived from a
homogeneous population is to be applied to a population which is more heterogeneous. It
is clear form the above discussion that the reliability in the application envisioned will be
larger than that determined in the homogeneous study population….”
Chapter 2-16 (revision 8 May 2011)
p. 30
If you are curious how the above figure with the three ICC graphs was generated, here is the Stata
code.
#delimit ;
twoway (rscatter scaleA0 scaleA1 id , symbol(circle) color(red))
(rspike scaleA0 scaleA1 id , color(red))
, ytitle("Scale A test and retest") xtitle("subject")
xlabels(0 " " 18 " ", notick)
ylabels(30(4)70, angle(horizontal))
text(68 8.5 "ICC = 0.95",placement(c) size(*2))
legend(off) plotregion(style(none)) scheme(s1color)
saving(tempa, replace)
;
#delimit cr
*
#delimit ;
twoway (rscatter scaleB0 scaleB1 id , symbol(circle) color(red))
(rspike scaleB0 scaleB1 id , color(red))
, ytitle("Scale B test and retest") xtitle("subject")
xlabels(0 " " 18 " ", notick)
ylabels(30(4)70, angle(horizontal))
text(68 5 "ICC = 0.78" ,placement(e) size(*2))
legend(off) plotregion(style(none)) scheme(s1color)
saving(tempb, replace)
;
#delimit cr
*
#delimit ;
twoway (rscatter scaleC0 scaleC1 id , symbol(circle) color(red))
(rspike scaleC0 scaleC1 id , color(red))
, ytitle("Scale C test and retest") xtitle("subject")
xlabels(0 " " 18 " ", notick)
ylabels(30(4)70, angle(horizontal))
text(68 5 "ICC = 0.00" ,placement(e) size(*2))
legend(off) plotregion(style(none)) scheme(s1color)
saving(tempc, replace)
;
#delimit cr
*
* -- combine graphs into single figure
graph combine tempa.gph tempb.gph tempc.gph , scheme(s1color)
erase tempa.gph // delete temporary graph files from hard drive
erase tempb.gph
erase tempc.gph
Chapter 2-16 (revision 8 May 2011)
p. 31
Equivalence of Kappa and Intraclass Correlation Coefficient (ICC)
Unweighted Kappa
For dichotomous ratings, kappa and ICC are equivalent [Fleiss, Levin, and Pai (2003, p.604);
Streiner and Norman (1995, p.118)].
To verify this equivalence for the dichotomous rating, we read in the Boyd Rater dataset,
File
Open
Find the directory where you copied the course CD
Find the subdirectory datasets & do-files
Single click on boydrater.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\boydrater.dta",
clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd " Biostats & Epi With Stata\datasets & do-files"
use boydrater, clear
Chapter 2-16 (revision 8 May 2011)
p. 32
Recoding the classifications to an indicator of abnormal,
normal -> normal
benign, suspect, cancer -> abnormal
tab rada radb
recode rada 1=0 2/4=1 , gen(abnormal1)
recode radb 1=0 2/4=1 , gen(abnormal2)
tab abnormal1 abnormal2
Radiologis |
t A's |
Radiologist B's assessment
assessment | 1.normal
2.benign 3.suspect
4.cancer |
Total
-----------+--------------------------------------------+---------1.normal |
21
12
0
0 |
33
2.benign |
4
17
1
0 |
22
3.suspect |
3
9
15
2 |
29
4.cancer |
0
0
0
1 |
1
-----------+--------------------------------------------+---------Total |
28
38
16
3 |
85
RECODE of |
rada |
(Radiologi |
RECODE of radb
st A's |
(Radiologist B's
assessment |
assessment)
) |
0
1 |
Total
-----------+----------------------+---------0 |
21
12 |
33
1 |
7
45 |
52
-----------+----------------------+---------Total |
28
57 |
85
Calculating kappa,
kap abnormal1 abnormal2
Expected
Agreement
Agreement
Kappa
Std. Err.
Z
Prob>Z
----------------------------------------------------------------77.65%
53.81%
0.5160
0.1076
4.80
0.0000
Chapter 2-16 (revision 8 May 2011)
p. 33
Reshaping the data to long format and then calculating ICC,
gen patient=_n // create a patient ID
reshape long abnormal , i(patient) j(radiologist)
xi: xtreg abnormal i.radiologist, i(patient) mle
Random-effects ML regression
Group variable: patient
Number of obs
Number of groups
=
=
170
85
Random effects u_i ~ Gaussian
Obs per group: min =
avg =
max =
2
2.0
2
Log likelihood
= -102.60835
LR chi2(1)
Prob > chi2
=
=
1.33
0.2495
-----------------------------------------------------------------------------abnormal |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Iradiolog~2 |
.0588235
.0508824
1.16
0.248
-.0409041
.1585512
_cons |
.6117647
.051928
11.78
0.000
.5099877
.7135417
-------------+---------------------------------------------------------------/sigma_u |
.3452094
.0405841
.2741651
.4346634
/sigma_e |
.3317146
.025441
.2854181
.3855208
rho |
.5199275
.0791436
.3671773
.6697714
-----------------------------------------------------------------------------Likelihood-ratio test of sigma_u=0: chibar2(01)=
26.79 Prob>=chibar2 = 0.000
We see kappa and ICC are identical to two decimal places.
Weighted Kappa
For order categorical (ordinal scale) ratings, the weighted kappa and ICC are equivalent,
provided the weights are taken as [Fleiss, Levin, and Pai (2003, p.609); Streiner and Norman
(1995, p.118)],
(i  j )2
( the “wgt(w2)” option in Stata)
wij  1 
(k  1)2
To verify this equivalence for ordinal scale ratings, we again use the Boyd Rater dataset, but this
time use the original ordinal scaled ratings.
Chapter 2-16 (revision 8 May 2011)
p. 34
Reading in the original dataset,
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on boydrater.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\boydrater.dta",
clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd " Biostats & Epi With Stata\datasets & do-files"
use boydrater, clear
Calculating weighted kappa, using the wt2 weights,
kap rada radb , wgt(w2)
Ratings weighted by:
1.0000
0.8889
0.8889
1.0000
0.5556
0.8889
0.0000
0.5556
0.5556
0.8889
1.0000
0.8889
0.0000
0.5556
0.8889
1.0000
Expected
Agreement
Agreement
Kappa
Std. Err.
Z
Prob>Z
----------------------------------------------------------------94.77%
84.09%
0.6714
0.1079
6.22
0.0000
Reshaping the data to long format and then calculating ICC,
rename rada rad1
rename radb rad2
gen patient=_n // create a patient ID
reshape long rad , i(patient) j(radiologist)
xi: xtreg rad i.radiologist, i(patient) mle
Chapter 2-16 (revision 8 May 2011)
p. 35
Random-effects ML regression
Group variable: patient
Number of obs
Number of groups
=
=
170
85
Random effects u_i ~ Gaussian
Obs per group: min =
avg =
max =
2
2.0
2
Log likelihood
= -187.11654
LR chi2(1)
Prob > chi2
=
=
0.40
0.5266
-----------------------------------------------------------------------------rad |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Iradiolog~2 | -.0470588
.0742292
-0.63
0.526
-.1925453
.0984277
_cons |
1.976471
.0917076
21.55
0.000
1.796727
2.156214
-------------+---------------------------------------------------------------/sigma_u |
.6933196
.0673845
.5730655
.8388082
/sigma_e |
.4839286
.037113
.4163913
.5624201
rho |
.6724105
.0594218
.5493559
.7790901
-----------------------------------------------------------------------------Likelihood-ratio test of sigma_u=0: chibar2(01)=
51.15 Prob>=chibar2 = 0.000
We see weighted kappa and ICC are identical to two decimal places.
Interpreting ICC
The ICC, also called the reliability coefficient, is computed as (Streiner and Norman, 1995,
p.106; Shrout and Fleiss, 1979),
reliability =
subject variability
subject variability + measurement error
expressed symbolically as,
 s2
 2
 s   e2
In the inter-rater reliability situation, the raters assign a score, or measurement, to a subject. Each
score can be expressed of as the actual true score plus or minus any measurement error the rater
makes. The variability of the measurements made on a sample of subjects can be likewise be
expressed as the variability of subjects, perhaps due to biological variability for example, plus the
variability of the measurement errors made by the raters. [This comes from a statistical theory
identity, which is that the variance of the sum of two independent variables is the sum of the
variances. To apply this, we make the reasonable assumption that measurment errors are made
independently of true values of the variable being measured.]
We would like to have a reliablility coefficient that expresses the proportion of the measurement
variability that is due only to subject variability, which can be thought of as the portion due to
raters making consistent measurements without introducing measurement error (since
measurement error would cause them to score a subject differently). This can be expressed as 1
Chapter 2-16 (revision 8 May 2011)
p. 36
minus proportion due to measurement error, since what is left over is the consistent ratings
portion. Expressing this with formulas,
  1  portion due to measurment error
 e2
 1 2
 s   e2
 s2   e2
 e2
 2

 s   e2  s2   e2
 s2   e2   e2

 s2   e2
 s2
 2
 s   e2
So, we see that ICC, or rho, is interpreted as the proportion of total variability in measurements
due to subject variability, which is the definition that authors put in their descriptions of ICC.
McDowell (2006, p.45) provides the Cicchetti and Sparrow (1981) guideline. Cicchetti and
Sparrow (1981) suggested the following guideline for the evaluation of ICC used to measure
inter-rater agreement:
Interpretation of ICC
ICC
interpretation
<0.40
Poor agreement
0.40 - 0.59
fair to moderate agreement
0.60 - 0.74
good agreement
0.75 - 1.00
excellent agreement
Chapter 2-16 (revision 8 May 2011)
p. 37
References
Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol. 1993;46(5):423-9.
Carpenter J, Bithell J. (2000). Bootstrap confidence intervals: when, which, what? A practical
guide for medical statisticians. Statist. Med. 19:1141-1164.
Cicchetti DV, Sparrow SA. (1981). Developing criteria for establishing interrater reliability of
secific items: applications to assessment of adaptive behavior. Am J Ment Defc 86:127137.
Efron, B. and R. Tibshirani. (1993). An Introduction to the Bootstrap. London: Chapman & Hall.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. 2nd ed. New York: Wiley.
Fleiss JL, Levin B, Pai MC. (2003). Statistical Methods for Rates and Proportions, 3rd ed.
Hokoken, NJ, John Wiley & Sons.
Landis JR, Koch GG (1977). The measurement of observer agreement for categorical data.
Biometrics, 33:159-174.
Lee, J. and K. P. Fung. 1993. Confidence interval of the kappa coefficient by bootstrap
resampling [letter]. Psychiatry Research 49:97-98.
Lilienfeld DE, Stolley PD (1994). Foundations of Epidemiology, 3rd ed., New York, Oxford
University Press.
Looney SW, Hagan JL. (2008). Statistical methods for assessing biomarkers and analyzing
biomaker data. In, Rao CR, Miller JP, Rao DC (eds), Handbook of Statistics 27:
Epidemiology and Medical Statistics, New York, Elsevier.pp. 109-147.
McDowell I, Newell C (1996). Measuring Health: a Guide to Rating Scales and Questionnaires,
2nd ed., New York, Oxford University Press.
McDowell I (2006). Measuring Health: a Guide to Rating Scales and Questionnaires,
3rd ed., New York, Oxford University Press.
Merlo J, Berglund, G, Wirfält E, et al. (2000). Self-administered questionnaire compared with a
personal diary for assessment of current use of hormone therapy: an analysis of 16,060
women. Am J Epidemiol 152:788–92.
Nunnally JC (1978). Psychometric Theory, 2nd ed., New York, McGraw-Hill Book Company.
Reichenheim ME. (2004). Confidence intervals for the kappa statistic. The Stata Journal
4(4):421-428.
Rosner B. (1995). Fundamentals of Biostatistics, 4th ed., Belmont CA, Duxbury Press.
Chapter 2-16 (revision 8 May 2011)
p. 38
StataCorp (2007). Stata Base Reference Manual, Vol 2 (I-P) Release 10. College Station, TX,
Stata Press.
Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychological
Bulletin 1979;86(2):420-428.
Streiner DL, Norman GR. (1995). Health Measurement Scales: A Practical Guide to
Their Development and Use. New York, Oxford University Press.
Chapter 2-16 (revision 8 May 2011)
p. 39
Download