Chapter 8 Reliability: Test–Retest, Parallel Test, Interrater, and Intrarater Reliability Copyright © 2014 Wolters Kluwer • All Rights Reserved Basics of Reliability and Related Definitions • Reliability: the extent to which a measurement is free from measurement error. • True score: the mean of an infinite number of measurements of a single person taken under identical circumstances. • The lower the measurement error, the better the instrument estimates the true score. • The larger the sample, the more errors in measurement tend to “cancel out.” Copyright © 2016 Wolters Kluwer • All Rights Reserved 2 Test–Retest Reliability (Stability, Reproducibility) • Demonstrated by administering a measure to the same person on at least two occasions. • Maintaining anonymity can be an issue. • Differences between testing results may be caused by: • Subject attrition between testings. • Traits change over time; generally the longer the time gap, the lower the test-retest reliability. • Answers to PROs may be remembered and replicated if time gap is too short. • Boredom or annoyance when measured a second time may cause careless responses. • Rehearsal/learning effect (especially with tests of performance). • Regression to the mean with extreme scores. Copyright © 2016 Wolters Kluwer • All Rights Reserved 3 Parallel Test Reliability • Used when development of multi-item parallel tests (alternative-form tests) is desirable. • Parallel tests can be created by randomly selecting two sets of items from a tested item pool. • Useful when multiple measures are taken in a period of time and carryover effects need to be avoided. • A major source of measurement error in parallel test reliability is the sampling of items used on the alternate form. • Appropriate only for reflective multi-item scales. Copyright © 2016 Wolters Kluwer • All Rights Reserved 4 Interrater and Intrarater Reliability • A key source of measurement error can result from the person making observations or recording the measurements. • Interrater (or interobserver) reliability assessment involves having two or more observers independently applying the instrument with the same people and comparing scores for consistency. • Intrarater reliability assesses the consistency of the same rater measuring on two or more occasions, blinded to the scores he or she assigned on any previous measurements. • Actions to increase this type of reliability: • Developing scoring systems needing little inference. • Meticulous instructions with precise scoring guidelines and clear examples. • Training of scorers. Copyright © 2016 Wolters Kluwer • All Rights Reserved 5 Choosing a Type of Reliability to Estimate • Consider all possible sources of measurement error and then assess as many types of reliability as are meaningful. • Type of instrument will dictate the type of reliability to estimate; for example: • Reliability of a formative verbal report index should only be assessed through test–retest. • Interrater or intrarater reliability is essential for observational measures; other forms are situational. • Table 8.2 in the text summarizes key features of different types of reliability that need to be demonstrated according to measurement type and major source of error. Copyright © 2016 Wolters Kluwer • All Rights Reserved 6 QUESTION Place the letter of the type of reliability listed in the left-hand column next to the term that best matches it in the right-hand column: Types of Reliability Related Terms A. Test–Retest ___ Used when multi-item tests are needed that measure same the construct. B. Parallel Test ___ Assesses responses from the same scorer at different times. C. Interrater ___ Stability, Reproducibility. D. Intrarater ___ Assesses responses from different scorers. Copyright © 2016 Wolters Kluwer • All Rights Reserved 7 ANSWER Place the letter of the type of reliability listed in the left-hand column next to the term that best matches it in the right-hand column: Types of Reliability Related Terms A. Test–Retest _B_ Used when multi-item tests are needed that measure same the construct. B. Parallel Test _D_ Assesses responses from the same scorer at different times. C. Interrater _A_ Stability, Reproducibility. D. Intrarater _C_ Assesses responses from different scorers. Copyright © 2016 Wolters Kluwer • All Rights Reserved 8 Intraclass Correlation Coefficient as a Reliability Parameter • A reliability coefficient indicates how well people can be differentiated from one another on the target construct despite measurement error. • True score variance is never known but can be estimated based on variability between people. • The reliability coefficient is calculated by means of the intraclass correlation coefficient, which can be used when a measure yields continuous scores. • The basic assumptions are that the scores for different people being measured: • Are independent and normally distributed. • Are randomly selected. • That the residual variation (error) is random and independent, with a mean of zero. Copyright © 2016 Wolters Kluwer • All Rights Reserved 9 ICC Models for Fully Crossed Designs • Score variability can be conceptualized as being of two types: • Variation from person to person being measured. • Variation from measurement to measurement of each person for k measurements. • The term fully crossed means that each person is rated by k raters (or completes a measure k times), and each rater rates everyone. • The value for k is often 2: a test and then a retest, or two observers’ ratings. • The two-way ANOVA for repeated measures is the fundamental model for ICC. Copyright © 2016 Wolters Kluwer • All Rights Reserved 10 ICC Models for Fully Crossed Designs (cont’d) • Three factors need to be considered in selecting an ICC formula for reliability assessment: • Will a single score or an averaged score for each person be used? • Are the k measurements viewed as a fixed or random effect? • Is the assessment most concerned about consistency of scores (e.g., ranking between observers are consistent) or absolute agreement of scores (scores across measures are identical)? • When systematic variation across observers or waves is considered relevant, then ICC for agreement should be used. • In clinical situations, absolute agreement may be more important than consistency. Copyright © 2016 Wolters Kluwer • All Rights Reserved 11 ICC Models for Not Fully Crossed Designs • Designs that are not fully crossed include: • Nested designs: Different pairs of raters rate different patients with no overlap. • Unbalanced designs: Neither crossed or nested; two raters per patient, but raters are not paired, and there is some (but not complete) overlap of patients and raters. Copyright © 2016 Wolters Kluwer • All Rights Reserved 12 Intraclass Correlation Coefficient Calculation in SPSS Copyright © 2016 Wolters Kluwer • All Rights Reserved 13 Intraclass Correlation Coefficient Calculation in SPSS (cont’d) Copyright © 2016 Wolters Kluwer • All Rights Reserved 14 QUESTION Select the statement that is false for Intraclass Correlation Coefficient (ICC) Models and fully crossed designs: A. The two-way ANOVA for repeated measures is the fundamental model used to analyze a fully crossed design. B. Nested or unbalanced designs are two designs that are fully crossed designs. C. One type of score variability in fully crossed designs is the variation from person to person being measured for N people. D. A factor that needs to be considered when selecting an appropriate ICC formula for a fully crossed design is whether a single or an averaged score is being used. Copyright © 2016 Wolters Kluwer • All Rights Reserved 15 ANSWER Answer: B Nested or unbalanced designs are designs that are used when different pairs of raters rate different patients with no overlap (nested), or raters are not paired, and there is some overlap of patients and raters (unbalanced). Copyright © 2016 Wolters Kluwer • All Rights Reserved 16 Interpretation of ICC Values • Reliability needs to be higher for measures that will be used to make decisions about individual people. • Acceptable minimum reliability criteria for decision making about individuals range from .85-.95. • Recommended minimum reliability values for measures used in group situations range from .70 to .75. • It should be remembered that an ICC of .70 means that approximately 70% of the variance is attributed to the “true score,” while approximately 30% is attributed to error. • Low ICC values may result from: • Low variability in the N by k data matrix. • Problems with the measurement design. • People-by-rater interactions. • The measure not being reliable. Copyright © 2016 Wolters Kluwer • All Rights Reserved 17 Consequences of Low ICC Values • Sample size needed to achieve a given power is greater because it is related to the value of the ICC. • There is difficulty with determining true changes after treatment. • There are misclassifications of clinical conditions and potential treatment errors. Copyright © 2016 Wolters Kluwer • All Rights Reserved 18 Reliability Parameters for Noncontinuous Variables: Proportion of Agreement • Proportion of overall agreement: total percentage of agreement on both positive and negative cases. • Proportion of negative agreement: percentage of agreement on negative (trait not present) cases. • Proportion of positive agreement: percentage of agreement on positive (trait present) cases. • Examination of specific agreement (positive and negative) is particularly important if the distribution is severely skewed. Copyright © 2016 Wolters Kluwer • All Rights Reserved 19 Kappa for Dichotomous Ratings by Two Raters • The kappa statistic is used to correct for raters’ agreement that occur by chance. • Is the most widely used reliability index. • Assumptions for use include: • People are being rated independent of one another. • All ratings should be made by the same k raters. • Rating categories are independent of one another. • Marginal homogeneity, which refers to whether the raters distribute their ratings in a comparable fashion, can be used to further understand rater agreement. Copyright © 2016 Wolters Kluwer • All Rights Reserved 20 Weighted Kappa for Ordinal Ratings • Weighted kappa: a method in which “partial credit” is given to raters whose ratings are not identical but are in close proximity to each other. • Weighting schemes include: • Linear weights. • Quadratic weights (scheme most widely used). Note: Cohen’s kappa is not appropriate for use in not fully crossed designs. Copyright © 2016 Wolters Kluwer • All Rights Reserved 21 Interpretation of Kappa • Kappa values can range from -1.0 to +1.0. • A kappa of 1.0 means that all ratings are along the diagonal of the contingency table. • Although guidelines tend to be arbitrary, the following is suggested: <.20 .21-.40 .41-.60 Poor Fair Moderate .61-.80 >.81 Substantial Excellent • Values under .60 may indicate need for modifications in the instrument, the raters, the training protocol, or other aspects of the measurement situation. • Kappa paradox: occurs in skewed distributions when the proportion of agreement is substantial but kappa is low. Copyright © 2016 Wolters Kluwer • All Rights Reserved 22 QUESTION The following Cohen’s kappa (k) values strongly suggest that the instrument, the raters, the training protocol, or other aspects of the measurement situation need to be modified or there is an error in the kappa calculation (select all that apply): A. B. C. D. k k k k = = = = .69 .20 3.2 .80 Copyright © 2016 Wolters Kluwer • All Rights Reserved 23 ANSWER Answer: A and D In general, kappa values under .60 may indicate need for modifications in the instrument, the raters, the training protocol, or other aspects of the measurement situation. A value of 3.2 is not within the range of an accurately calculated kappa score. Copyright © 2016 Wolters Kluwer • All Rights Reserved 24 Reliability and Item Response Theory (IRT) • In IRT, the concept of information is usually used in lieu of reliability. • Information: a conditional expression of measurement precision for a single person that is population independent. • Test–retest reliability can be assessed with both static measures developed using IRT methods and also with computerized adaptive tests. • IRT scaling methods also have been used with items that require observational ratings, thus requiring interrater reliability assessment. Copyright © 2016 Wolters Kluwer • All Rights Reserved 25 Designing a Reliability Study Study Design • Demonstrating test-retest reliability usually is a case of testing the same sample at two time periods. • The most straightforward interrater analyses for ICCs and kappas are with fully crossed designs; the nested design is the next best choice. Timing of Measurements • Simultaneity is possible in demonstrating interrater reliability, however, it is not possible in determining testretest, intrarater, or parallel test reliability. • Measurements timings need to be scheduled such that the likelihood that extraneous and transient measurement errors will be minimized. • Many experts advise that the interval for PRO measurements should be 1 to 2 weeks. • Physical measurements probably should be retested at shorter intervals. Copyright © 2016 Wolters Kluwer • All Rights Reserved 26 Other Design Issues in Reliability Studies • The following issues are especially important to consider in designing reliability studies: • Blinding. • Comparable measurement circumstances. • Training. • Attrition. • Random ordering of items or subscales. • Specification of an a priori standard. Copyright © 2016 Wolters Kluwer • All Rights Reserved 27 Sampling in Reliability Studies • People being measured should be representative of the population for whom the measure is designed, and raters should also be representative of the population of potential raters using the measure. • A heterogeneous sample from the population of interest should be used. • A sample size of 50 is deemed adequate in most reliability studies, however sample size can be estimated using the confidence interval around an estimated reliability coefficient. • Sample sizes for kappa are difficult to estimate because information is needed for: • the expected kappa value • the expected proportion of positive ratings Copyright © 2016 Wolters Kluwer • All Rights Reserved 28 Reporting a Reliability Study As much detail about the study as possible needs to be reported, including: • Type of reliability assessed. • Nature of the measure and its possible application. • Target population. • Sample details, including recruitment and heterogeneity. • Sample size and attrition. • Rater characteristics (if appropriate) and training. • Measurement procedures. • Data preparation. • Statistical decisions. • Statistical results. • Interpretation of results. Copyright © 2016 Wolters Kluwer • All Rights Reserved 29 QUESTION Is the following statement True or False? Choices to consider when designing reliability studies include use of blinding, possibility of subject attrition, whether measurement items should be randomly ordered, and projecting the minimum ICC or kappa values. Copyright © 2016 Wolters Kluwer • All Rights Reserved 30 ANSWER Answer: True Among other choices, all of these issues need to be considered when designing reliability studies. Copyright © 2016 Wolters Kluwer • All Rights Reserved 31