Interrater and Intrarater Reliability

Chapter 8
Reliability:
Test–Retest, Parallel Test,
Interrater, and Intrarater
Reliability
Copyright © 2014 Wolters Kluwer • All Rights Reserved
Basics of Reliability and Related Definitions
• Reliability: the extent to which a measurement is free from
measurement error.
• True score: the mean of an infinite number of measurements
of a single person taken under identical circumstances.
• The lower the measurement error, the better the instrument
estimates the true score.
• The larger the sample, the more errors in measurement tend
to “cancel out.”
Copyright © 2016 Wolters Kluwer • All Rights Reserved
2
Test–Retest Reliability (Stability, Reproducibility)
• Demonstrated by administering a measure to the same
person on at least two occasions.
• Maintaining anonymity can be an issue.
• Differences between testing results may be caused by:
• Subject attrition between testings.
• Traits change over time; generally the longer the time
gap, the lower the test-retest reliability.
• Answers to PROs may be remembered and replicated if
time gap is too short.
• Boredom or annoyance when measured a second time
may cause careless responses.
• Rehearsal/learning effect (especially with tests of
performance).
• Regression to the mean with extreme scores.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
3
Parallel Test Reliability
• Used when development of multi-item parallel tests
(alternative-form tests) is desirable.
• Parallel tests can be created by randomly selecting two sets
of items from a tested item pool.
• Useful when multiple measures are taken in a period of time
and carryover effects need to be avoided.
• A major source of measurement error in parallel test
reliability is the sampling of items used on the alternate
form.
• Appropriate only for reflective multi-item scales.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
4
Interrater and Intrarater Reliability
• A key source of measurement error can result from the
person making observations or recording the measurements.
• Interrater (or interobserver) reliability assessment involves
having two or more observers independently applying the
instrument with the same people and comparing scores for
consistency.
• Intrarater reliability assesses the consistency of the same
rater measuring on two or more occasions, blinded to the
scores he or she assigned on any previous measurements.
• Actions to increase this type of reliability:
• Developing scoring systems needing little inference.
• Meticulous instructions with precise scoring guidelines
and clear examples.
• Training of scorers.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
5
Choosing a Type of Reliability to Estimate
• Consider all possible sources of measurement error and then
assess as many types of reliability as are meaningful.
• Type of instrument will dictate the type of reliability to
estimate; for example:
• Reliability of a formative verbal report index should
only be assessed through test–retest.
• Interrater or intrarater reliability is essential for
observational measures; other forms are situational.
• Table 8.2 in the text summarizes key features of
different types of reliability that need to be
demonstrated according to measurement type and
major source of error.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
6
QUESTION
Place the letter of the type of reliability listed in the left-hand
column next to the term that best matches it in the right-hand
column:
Types of Reliability Related Terms
A. Test–Retest
___ Used when multi-item tests
are needed that measure same the
construct.
B. Parallel Test
___ Assesses responses from the
same scorer at different times.
C. Interrater
___ Stability, Reproducibility.
D. Intrarater
___ Assesses responses from
different scorers.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
7
ANSWER
Place the letter of the type of reliability listed in the left-hand
column next to the term that best matches it in the right-hand
column:
Types of Reliability Related Terms
A. Test–Retest
_B_ Used when multi-item tests
are needed that measure same the
construct.
B. Parallel Test
_D_ Assesses responses from the
same scorer at different times.
C. Interrater
_A_ Stability, Reproducibility.
D. Intrarater
_C_ Assesses responses from
different scorers.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
8
Intraclass Correlation Coefficient as a Reliability Parameter
• A reliability coefficient indicates how well people can be
differentiated from one another on the target construct
despite measurement error.
• True score variance is never known but can be estimated
based on variability between people.
• The reliability coefficient is calculated by means of the
intraclass correlation coefficient, which can be used when a
measure yields continuous scores.
• The basic assumptions are that the scores for different
people being measured:
• Are independent and normally distributed.
• Are randomly selected.
• That the residual variation (error) is random and
independent, with a mean of zero.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
9
ICC Models for Fully Crossed Designs
• Score variability can be conceptualized as being of two
types:
• Variation from person to person being measured.
• Variation from measurement to measurement of each
person for k measurements.
• The term fully crossed means that each person is rated by k
raters (or completes a measure k times), and each rater
rates everyone.
• The value for k is often 2: a test and then a retest, or two
observers’ ratings.
• The two-way ANOVA for repeated measures is the
fundamental model for ICC.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
10
ICC Models for Fully Crossed Designs (cont’d)
• Three factors need to be considered in selecting an ICC
formula for reliability assessment:
• Will a single score or an averaged score for each person
be used?
• Are the k measurements viewed as a fixed or random
effect?
• Is the assessment most concerned about consistency of
scores (e.g., ranking between observers are consistent)
or absolute agreement of scores (scores across measures
are identical)?
• When systematic variation across observers or waves is
considered relevant, then ICC for agreement should be used.
• In clinical situations, absolute agreement may be more
important than consistency.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
11
ICC Models for Not Fully Crossed Designs
• Designs that are not fully crossed include:
• Nested designs: Different pairs of raters rate different
patients with no overlap.
• Unbalanced designs: Neither crossed or nested; two
raters per patient, but raters are not paired, and there
is some (but not complete) overlap of patients and
raters.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
12
Intraclass Correlation Coefficient Calculation in SPSS
Copyright © 2016 Wolters Kluwer • All Rights Reserved
13
Intraclass Correlation Coefficient Calculation in SPSS (cont’d)
Copyright © 2016 Wolters Kluwer • All Rights Reserved
14
QUESTION
Select the statement that is false for Intraclass Correlation
Coefficient (ICC) Models and fully crossed designs:
A. The two-way ANOVA for repeated measures is the
fundamental model used to analyze a fully crossed design.
B. Nested or unbalanced designs are two designs that are fully
crossed designs.
C. One type of score variability in fully crossed designs is the
variation from person to person being measured for N
people.
D. A factor that needs to be considered when selecting an
appropriate ICC formula for a fully crossed design is
whether a single or an averaged score is being used.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
15
ANSWER
Answer: B
Nested or unbalanced designs are designs that are used when
different pairs of raters rate different patients with no overlap
(nested), or raters are not paired, and there is some overlap of
patients and raters (unbalanced).
Copyright © 2016 Wolters Kluwer • All Rights Reserved
16
Interpretation of ICC Values
• Reliability needs to be higher for measures that will be used
to make decisions about individual people.
• Acceptable minimum reliability criteria for decision
making about individuals range from .85-.95.
• Recommended minimum reliability values for measures
used in group situations range from .70 to .75.
• It should be remembered that an ICC of .70 means that
approximately 70% of the variance is attributed to the “true
score,” while approximately 30% is attributed to error.
• Low ICC values may result from:
•
Low variability in the N by k data matrix.
•
Problems with the measurement design.
•
People-by-rater interactions.
•
The measure not being reliable.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
17
Consequences of Low ICC Values
• Sample size needed to achieve a given power is greater
because it is related to the value of the ICC.
• There is difficulty with determining true changes after
treatment.
• There are misclassifications of clinical conditions and
potential treatment errors.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
18
Reliability Parameters for Noncontinuous Variables:
Proportion of Agreement
• Proportion of overall agreement: total percentage of
agreement on both positive and negative cases.
• Proportion of negative agreement: percentage of agreement
on negative (trait not present) cases.
• Proportion of positive agreement: percentage of agreement
on positive (trait present) cases.
• Examination of specific agreement (positive and negative) is
particularly important if the distribution is severely skewed.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
19
Kappa for Dichotomous Ratings by Two Raters
• The kappa statistic is used to correct for raters’ agreement
that occur by chance.
• Is the most widely used reliability index.
• Assumptions for use include:
•
People are being rated independent of one another.
•
All ratings should be made by the same k raters.
•
Rating categories are independent of one another.
• Marginal homogeneity, which refers to whether the raters
distribute their ratings in a comparable fashion, can be used
to further understand rater agreement.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
20
Weighted Kappa for Ordinal Ratings
• Weighted kappa: a method in which “partial credit” is given
to raters whose ratings are not identical but are in close
proximity to each other.
• Weighting schemes include:
• Linear weights.
• Quadratic weights (scheme most widely used).
Note: Cohen’s kappa is not appropriate for use in not fully
crossed designs.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
21
Interpretation of Kappa
• Kappa values can range from -1.0 to +1.0.
• A kappa of 1.0 means that all ratings are along the diagonal
of the contingency table.
• Although guidelines tend to be arbitrary, the following is
suggested:
<.20
.21-.40
.41-.60
Poor
Fair
Moderate
.61-.80
>.81
Substantial
Excellent
• Values under .60 may indicate need for modifications in the
instrument, the raters, the training protocol, or other
aspects of the measurement situation.
• Kappa paradox: occurs in skewed distributions when the
proportion of agreement is substantial but kappa is low.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
22
QUESTION
The following Cohen’s kappa (k) values strongly suggest that
the instrument, the raters, the training protocol, or other
aspects of the measurement situation need to be modified or
there is an error in the kappa calculation (select all that
apply):
A.
B.
C.
D.
k
k
k
k
=
=
=
=
.69
.20
3.2
.80
Copyright © 2016 Wolters Kluwer • All Rights Reserved
23
ANSWER
Answer: A and D
In general, kappa values under .60 may indicate need for
modifications in the instrument, the raters, the training
protocol, or other aspects of the measurement situation. A
value of 3.2 is not within the range of an accurately calculated
kappa score.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
24
Reliability and Item Response Theory (IRT)
• In IRT, the concept of information is usually used in lieu of
reliability.
• Information: a conditional expression of measurement
precision for a single person that is population independent.
• Test–retest reliability can be assessed with both static
measures developed using IRT methods and also with
computerized adaptive tests.
• IRT scaling methods also have been used with items that
require observational ratings, thus requiring interrater
reliability assessment.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
25
Designing a Reliability Study
Study Design
• Demonstrating test-retest reliability usually is a case of
testing the same sample at two time periods.
• The most straightforward interrater analyses for ICCs and
kappas are with fully crossed designs; the nested design is
the next best choice.
Timing of Measurements
• Simultaneity is possible in demonstrating interrater
reliability, however, it is not possible in determining testretest, intrarater, or parallel test reliability.
• Measurements timings need to be scheduled such that the
likelihood that extraneous and transient measurement errors
will be minimized.
• Many experts advise that the interval for PRO measurements
should be 1 to 2 weeks.
• Physical measurements probably should be retested at
shorter intervals.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
26
Other Design Issues in Reliability Studies
• The following issues are especially important to consider in
designing reliability studies:
• Blinding.
• Comparable measurement circumstances.
• Training.
• Attrition.
• Random ordering of items or subscales.
• Specification of an a priori standard.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
27
Sampling in Reliability Studies
• People being measured should be representative of the
population for whom the measure is designed, and raters
should also be representative of the population of potential
raters using the measure.
• A heterogeneous sample from the population of interest
should be used.
• A sample size of 50 is deemed adequate in most reliability
studies, however sample size can be estimated using the
confidence interval around an estimated reliability
coefficient.
• Sample sizes for kappa are difficult to estimate because
information is needed for:
•
the expected kappa value
•
the expected proportion of positive ratings
Copyright © 2016 Wolters Kluwer • All Rights Reserved
28
Reporting a Reliability Study
As much detail about the study as possible needs to be reported,
including:
• Type of reliability assessed.
• Nature of the measure and its possible application.
• Target population.
• Sample details, including recruitment and heterogeneity.
• Sample size and attrition.
• Rater characteristics (if appropriate) and training.
• Measurement procedures.
• Data preparation.
• Statistical decisions.
• Statistical results.
• Interpretation of results.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
29
QUESTION
Is the following statement True or False?
Choices to consider when designing reliability studies include
use of blinding, possibility of subject attrition, whether
measurement items should be randomly ordered, and
projecting the minimum ICC or kappa values.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
30
ANSWER
Answer: True
Among other choices, all of these issues need to be considered
when designing reliability studies.
Copyright © 2016 Wolters Kluwer • All Rights Reserved
31