Uploaded by Saeid Shahraz

ICC6 manuscript revised3

advertisement
1
Intraclass correlation for reliability assessment: the introduction of a validated program in SAS (ICC6)
V. S. Senthil Kumar1 & Saeid Shahraz2,3
1
Heller School for Social Policy and Management
Brandeis University, Waltham, MA 02453, USA
Email: vssenthilk@gmail.com
2
Verantos, Inc.
325 Sharon Park Dr. Suite 730, Palo Alto, CA 94025, USA
South San Francisco, CA 94080, USA
Email: saeid.shahraz@verantos.com
3
The Institute for Clinical Research and Health Policy Studies (ICRHPS)- Tufts Medical Center
800 Washington Street, Boston, MA 02111 USA
Abstract
Reliability refers to how measurements can produce consistent results and are crucial for any scientific research
measurement. Intraclass Correlation Coefficient (ICC) is the most widely used method to determine the reproducibility
of measurements of various statistical techniques. Calculated ICC and its confidence interval that reveal the underlying
sampling distribution may help detect an experimental method's ability to identify systematic differences between
research participants in a test. This study aimed to introduce a new SAS macro, ICC,6, for calculating different ICC
forms and their confidence intervals. A SAS macro that employs the PROC GLM procedure in SAS was created to
generate two-way random effects (ANOVA) estimates. A simulated dataset was used to input the macro to calculate
the point estimates for different ICCs. The ICC forms' upper and lower confidence interval limits were calculated
using the F statistics distribution. Our SAS macro provides a complete set of various ICC forms and their confidence
intervals. A validation analysis using commercial software packages STATA and SPSS delivered identical results. A
development of SAS methodology using publicly available statistical approaches in estimating six distinct forms of
ICC and their confidence intervals has been reported in this article. This work is an extension of general methodology
supported by a few other statistical software packages to SAS.
Keywords: Intraclass Correlation, ICC, ANOVA, Reliability.
2
Statements and Declarations
Conflict of interest
The authors declare that they have no conflict of interest.
3
Introduction
Reliability refers to the consistency of a measure, ensuring that measurement variation is due to replicable differences
between people regardless of time, target behavior, or user profile. Reliability is a fundamental metric used to
understand the quality of a measurement and provides an idea about the source of a possible inconsistency (Bruton et
al., 2000). The reliability of measurement scores can help assess the precision of a given measurement score. In clinical
trials, obtaining consistent results through repeated measurements with the same patient (test-retest reliability) plays
a key role in decision-making. Reliability is recommended by U.S. Food and Drug Administration Guidance for
Industry (FDA, 2009) as one of the principal psychometric properties in validating a PRO instrument for clinical trials.
Reliability is also widely used in clinical trials to validate the measurement equivalence among different forms of the
same test (Potashman et al. 2022). Three kinds of reliability measures are widely used in data analysis: internal
consistency (Revicki 2014), test-retest reliability or intra-rater reliability, and inter-rater reliability (Belur et al. 2018).
Internal consistency helps judge the stability of results across items of a measurement instrument. It measures whether
different items deemed to measure the same construct in a test produce similar results. Internal consistency is often
measured with Cronbach's alpha or coefficient omega (Boateng et al. 2018), generating correlations between items in
the same test. Test-retest reliability calculates the consistency or agreement between responses in the same population
at different points in time. Inter-rater reliability refers to the level of consistency or agreement on a test for which the
scoring involves more than one rater’s judgment. In other words, it reflects the scoring process reliability delivered
by different raters assessing the same responses, and the possible inconsistency or disagreement may arise from the
selection of raters (Hallgren 2012).
Test-retest reliability and inter-rater reliability are essential psychometric aspects of reliability analyses. The most
widely used strategy for assessing these reliability measures is to analyze the differences between the responses to a
test for each research participant. Researchers often apply various statistical methods for reliability assessment.
Examples are using paired t-test to compare mean difference (Hopkins 2000), the strength of the linear relationship
between results of two tests by the Pearson correlation (Brown et al. 1962), calculating standard error of measurement
(SEM) (Weir 2005), and quantifying the agreement between two quantitative measurements by studying the mean
difference using Bland-Altman plot (Bland and Altman 1986). The Pearson correlation is a correlation measure, and
the SEM represents the measurement error. The paired t-test and Bland-Altman plot are more suitable for analyzing
4
agreement between measurements. ICC is a popular method for analyzing any type of reliability in psychometric
assessments from both consistency and agreement viewpoints and factors in the measurement errors in the estimation
(Liljequist et al. 2019; Zaki et al. 2013). ICC's generic definition is the ratio of the variance of interest over the sum
of the variance of interest and the error variance. Several forms of ICC can produce different results when applied to
the same data. Fisher (Fisher 1954) first defined the ICC as the ratio of the between-subject variance to the total
variance, i.e., the sum of between-subject and within-subject variances using a one-way analysis of variance
(ANOVA) model considering the ICC as an alternative to the Pearson correlation coefficient. Bartko (1966)
introduced fixed and random rater effects using a two-way ANOVA model to calculate the ICC. Shrout and Fleiss
(1979) and McGraw and Wong (1996) extended various forms of the ICC, calculating the ICC by employing mean
squares in various settings using diverse ANOVA models. Shrout and Fleiss introduced six special ICC measures.
Three statistical models guided these ICC reliability measures: the one-way random model, the two-way random
model, and the two-way mixed model. Shrout and Fleiss built a single-rater and a multiple-rater (average of k raters)
model for each of these three models. Later, McGraw and Wong added four other ICC measures to the ones Shrout
and Fleiss proposed. These additional measures yielded numerically identical results to those of the two-way random
model, and the two-way mixed model explained by Shrout and Fleiss (1979). However, McGraw and Wong
interpreted the additional models differently. McGraw and Wong (1996) defined ten forms of ICC based on the three
models (the one-way random model, the two-way random model, and the two-way mixed model), the type (single
rating or mean of k ratings), and the definition of relationship (consistency or absolute agreement). Table 1 conveys a
summary of these models along with their vocabulary. ICC values range between 0 and 1(0 indicates no reliability,
and 1 represents perfect reliability). Agreed-upon cutoffs for a uniform interpretation of ICC results do not exist
(Nunnally and Bernstein 1994; Fleiss 1986; Rosner 2006; Portney and Watkins 2009). However, several authors,
including Koo and Li (2016), have suggested the following scheme to interpret the strength of reliability after ICC
measurement. Values less than 0.5 indicate poor reliability; those between 0.5 and 0.75 represent moderate reliability;
values between 0.75 and 0.9 show good reliability, and values greater than 0.90 indicate excellent reliability.
Point estimates of ICC and their confidence interval are equally crucial for the report of reliability results. A point
estimate represents the true underlying score due to the measurement error in measuring a quantity. A confidence
interval indicates a range of values a true score is likely to receive and is used to draw inferences regarding the
underlying population. One can obtain the stability of the estimated ICC by looking at the confidence intervals.
5
However, confidence intervals of the ICC provide more information than point estimates (Stoffel et al. 2017).
Calculation of the most commonly used forms of the ICC and their confidence intervals, defined by McGraw and
Wong, is currently supported in the open-source software package R (Stoffel et al. 2017) as well as commercial
software packages such as STATA (STATA 2017) and SPSS (Richard 1993). The mathematical algorithms behind
drawing various types of ICC for these statistical programs are publicly available. However, statisticians have not
formally introduced these algorithms for statistical model specification to directly perform the ICC analysis and
produce desired ICCs and confidence intervals. For instance, SAS is widely used to analyze educational, social, and
clinical data worldwide. SAS is also the Food and Drug Administration's preferred statistical software for receiving
and reviewing clinical data (Shostak 2005; Dmitrienko et al. 2005). To our knowledge, a SAS macro that can handle
the calculation of ICCs and their respective confidence intervals have not yet been developed. In our experience,
specifying a model for the six different ICCs is not straight forward. Hence, model specifications and validating a
SAS procedure for generating a list of the most used ICCs and their confidence intervals will fulfill researchers'' need
to assess reliability measures in various contexts. The specifications help develop similar algorithms to generate
similar results in other commercial and open-source statistical software applications. This paper presents statistical
models we specified to render ICC analysis for reliability estimates in SAS that are consistent with the results
generated by SPSS and STATA.
Methods
ICC Point Estimate Calculation
Shrout and Fleiss defined six distinct forms of ICCs (Shrout and Fleiss 1979). Two numbers inside a parenthesis
identify each of these types of ICCs, as shown in Table 1. The first number refers to the model (1 for One-way random,
2 for Two-way random, and 3 for Two-way mixed models). The second number is 1 or k, referring to the single rater
or mean of k raters/measurements. McGraw and Wong (1996) defined ten forms of ICCs based on the model (1,2, or
3), the type (1 or mean of k raters), and the definition of relationship (consistency or absolute agreement). But the
formulas used by Shrout and Fleiss (1979) for the six forms of ICCs are sufficient for the computation of the ten types
of ICCs defined by McGraw and Wong. More specifically, the ICC calculations from both the Two-way random- and
mixed-effects models produce identical estimations because they use the same formula to calculate the ICC. In SPSS,
STATA, and R, the ICC calculations are based on McGraw and Wong's mathematical definition of the ICCs. For the
6
mathematical description of the ICCs, Shrout, Fleiss, McGraw, and Wong used four essential parameters. They are
within-target mean square, between-targets mean square, between-measurements mean square and residual mean
square.
All statistical analyses to calculate ICCs were performed in SAS version 9.4 (SAS 2013). The theory we employed in
ICC calculation is based on the analysis of variance (ANOVA). Among the existing approaches (Alexander 1947;
Nakagawa and Schielzeth 2010), the ANOVA (Analysis of Variance) method is used the most to calculate ICCs in
clinical trials. The PROC GLM procedure in SAS was employed to generate two-way random effects (ANOVA)
estimates, which calculate the sums of squares, mean squares, and residuals. The ANOVA procedure on a dataset with
more than one measurement was performed to calculate the estimates of possible mean squares and residuals. Table
1 conveys the equations to calculate the point estimates for different types of ICCs by using the Mean squares and
Error parameters obtained from the ANOVA model.
Confidence Intervals
The confidence interval limits ((1 − α/2) × 100th percentile) for the six distinct forms of the ICCs were calculated
using the F statistics distribution and the four essential parameters as discussed by McGraw and Wong (1996).
One-way Model
The lower and upper confidence interval limits for the One-way model with one rater and k raters are calculated as
(
𝐹𝐿 −1
,
𝐹𝐿 +(π‘˜−1)
πΉπ‘ˆ −1
πΉπ‘ˆ +(π‘˜−1)
1
1
𝐹𝐿
πΉπ‘ˆ
) and (1- , 1- ), respectively. 𝐹𝐿 = Fobs/Ftabled where Fobs is the row effects from the ANOVA.
Ftabled denotes the (1 – 0.5α) X 100th percentile of the F distribution with (n–1) numerator degrees of freedom and n(k
- 1) denominator degrees of freedom.
πΉπ‘ˆ = Fobsοƒ—ΦΌFtabled where Fobs is the row effects from the ANOVA and Ftabled is the (1 – 0.5α) X 100th percentile of the F
distribution with n (k — 1) numerator degrees of freedom and (n-1) denominator degrees of freedom. n is the number
of participants and k represents the number of measurements in the model with a single rater and the number of raters
in the model with k raters.
7
Table 1. Intraclass correlation coefficients (ICC) defined by Shrout and Fleiss (1979) and McGraw and Wong (1996).
IC Form (McGraw & Wong)
ICC Form (Shrout & Fleiss )
Formulas for ICC
One-way random effects, absolute
agreement (single
rater/measurement)
ICC (1,1)
𝑀𝑆𝑅 − π‘€π‘†π‘Š
𝑀𝑆𝑅 + (π‘˜ − 1)π‘€π‘†π‘Š
Two-way random effects,
absolute agreement (single
rater/measurement)
ICC (2,1)
𝑀𝑆𝑅 − 𝑀𝑆𝐸
π‘˜
𝑀𝑆𝑅 + (π‘˜ − 1)𝑀𝑆𝐸 + (𝑀𝑆𝑐 − 𝑀𝑆𝐸 )
𝑛
Two-way mixed effects,
consistency (single
rater/measurement)
ICC (3,1)
𝑀𝑆𝑅 − 𝑀𝑆𝐸
𝑀𝑆𝑅 + (π‘˜ − 1)𝑀𝑆𝐸
One-way random effects, absolute
agreement
(multiple raters/ measurements)
ICC (1,k)
𝑀𝑆𝑅 − π‘€π‘†π‘Š
𝑀𝑆𝑅
Two-way random effects,
absolute agreement (multiple
raters/ measurements)
ICC (2,k)
𝑀𝑆𝑅 − 𝑀𝑆𝐸
𝑀𝑆𝑅
𝑀𝑆𝑅 − 𝑀𝑆𝐸
Two-way mixed effects,
ICC (3,k)
consistency (multiple
(𝑀𝑆𝑐 − 𝑀𝑆𝐸 )
𝑀𝑆𝑅 +
raters/measurements)
𝑛
M.S. – mean square; MSR - between target mean square; MSW - within target mean square; MSC - between the
measurements mean square; MSE - residual mean square; n - number of participants; k - number of raters or
measurements
Two-way Model measures correlation using a"“consistency"” definition
The lower and upper confidence interval limits for the Two-way models with one rater and k raters are calculated as(
𝐹𝐿 −1
,
𝐹𝐿 +(π‘˜−1)
πΉπ‘ˆ −1
πΉπ‘ˆ +(π‘˜−1)
1
1
𝐹𝐿
πΉπ‘ˆ
) and (1- , 1- ). 𝐹𝐿 = Fobs/Ftabled
where Fobs is the row effects from the ANOVA. Ftabled denotes
the (1 – 0.5α) X 100th percentile of the F distribution with (n-1) numerator degrees of freedom and (n - l)(k - 1)
denominator degrees of freedom
πΉπ‘ˆ = Fobsοƒ—ΦΌFtabled where Fobs is the row effects from the ANOVA and Ftabled is the (1 – 0.5α) X 100th percentile of the F
distribution with (n - l)(k - 1) numerator degrees of freedom and (n-1) denominator degrees of freedom. . n is the
number of participants and k represents the number of measurements in the model with single rater and number of
raters in model with k raters.
8
ΦΌ Two-way Model measures correlation using an Absolute Agreement (A.A.) definition
In the case of Two-way models where absolute agreement is defined as the relationship between different
measurements or different raters, the lower and upper confidence limits are calculated for single rater and k raters as
(
𝑛(𝑀𝑆𝑅 −𝐹𝐿 𝑀𝑆𝐸 )
𝐹𝐿 [π‘˜π‘€π‘†πΆ +(π‘˜π‘›−π‘˜−𝑛)𝑀𝑆𝐸 }+𝑛𝑀𝑆𝑅 )
((
𝑛(𝑀𝑆𝑅 −𝐹𝐿 𝑀𝑆𝐸 )
,
,
𝑛(πΉπ‘ˆ 𝑀𝑆𝑅 −𝑀𝑆𝐸 )
) and
π‘˜π‘€π‘†πΆ +(π‘˜π‘›−π‘˜−𝑛)𝑀𝑆𝐸 +π‘›πΉπ‘ˆπ‘Žπ‘π‘™π‘’π‘‘ 𝑀𝑆𝑅 )
𝑛(πΉπ‘ˆ 𝑀𝑆𝑅 −𝑀𝑆𝐸 )
). 𝐹𝐿 π‘‘π‘’π‘›π‘œπ‘‘π‘’π‘  the (1 – 0.5α) X 100th percentile of the F distribution with n-1
𝐹𝐿 (𝑀𝑆𝐢 −𝑀𝑆𝐸 }+𝑛𝑀𝑆𝑅 ) 𝑀𝑆𝐢 −𝑀𝑆𝐸 +π‘›πΉπ‘ˆ 𝑀𝑆𝑅 )
numerator degrees of freedom and υ denominator degrees of freedom whereas πΉπ‘ˆ is the (1 – 0.5α) X 100th percentile
of the F distribution with υ numerator degrees of freedom and n-1 denominator degrees of freedom. MSR is the mean
square for rows, MSC is mean square for columns, MSE is mean square error, n is the number of participants and k
represents the number of measurements in the model with a single rater and number of raters in the model with k
raters. υ can be calculated using the following formula
υ=
(π‘Žπ‘€π‘†πΆ +𝑏𝑀𝑆𝐸 )2
2
(π‘Žπ‘€π‘†π‘ )2 (𝑏𝑀𝑆𝐸 )
+
π‘˜−1
(𝑛−1)(π‘˜−1)
where a=
π‘˜(𝐼𝐢𝐢)
𝑛(1−𝐼𝐢𝐢)
and b=1 +
π‘˜(𝐼𝐢𝐢)(𝑛−1)
𝑛(1−𝐼𝐢𝐢)
.
SAS Macro to Calculate the intraclass correlation coefficients and their confidence intervals
We implemented the algorithm to calculate the ICCs and their confidence intervals and created a SAS macro for the
overall algorithm called ICC6. This SAS macro uses ANOVA results obtained from the GLM procedure were used to
calculate the point estimates of the ICCs (Table 1) and the confidence interval limits ((1 − α/2) × 100th percentile).
The confidence interval estimates are calculated using the F statistics distribution discussed in Methods Section. Seven
parameters are required to run the macro ICC6: input is the name of the input dataset; id is the participant id variable;
measurement is the variable that denotes the time point of measurement; k is the number of raters/measurements;
score is the measurement of interest; n is the number of participants in the test; alpha is equal to 1-confidence level.
Each participant in the input data should have at least two observations from different time points or different raters.
The final output table produced by the SAS macro contains ICC point estimates, parameters from F statistics
distribution, and lower and upper confidence interval limits.
%ICC6 (input=, id=, measurement=, k=, score=, n=, alpha=);
9
Simulated data for the application
We employed the RANDMVBINARY program in SAS to build a simulated dataset. The SAS program
RandMVBinary was developed by Wicklin (2013) that accommodates Emrich and Piedmonte's algorithm (1991). The
RANDMVBINARY function requires defining a vector of the parameter probabilities and a matrix of the parameter
correlations and simulating the distributions using a matrix of zeros and ones. We used this program to generate the
correlated binary responses for two different measurements from two distinct time points for each participant with
two independent raters. The simulated data set includes the responses of 500 individuals and ten binary items, Q1
through Q10. It was challenging to find an actual data as an example to use with the SAS macro to calculate six
different ICC forms discussed in this paper. Hence, we intentionally changed the parameters in the simulated dataset
to produce all various forms of ICCs. The parameter probabilities and a matrix of the parameter correlations for data
in different measurements were adjusted to get high reliability between the two measures by purpose. Table 2 displays
the sample data with four rows of observations.
Table 2. A Correlated Data set with binary items generated by simulation (four rows of the data are shown)
ID
Visit
Rater
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Total
1
V0
R1
1
0
0
1
1
0
1
1
1
1
7
2
V0
R1
1
0
0
0
1
0
1
1
1
0
5
3
V0
R1
0
0
1
0
0
1
0
0
0
0
2
4
V0
R1
1
0
0
1
1
1
1
1
1
1
8
An annotated example of ICC6 output
Following are the results of the % ICC6 call:
The v0_v1_r1 data set was used as an input in this example, and the output from the GLM procedure to generate Twoway random is given below. The SAS code and the dataset are included as supplementary materials. The SAS
10
annotated output example by UCLA Statistical Consulting Group (UCLA 2016) has been used as a template to provide
the ANOVA results from the GLM procedure here.
The GLM Procedure
Dependent Variablea: score
Sourceb
DFc Sum of Squaresd Mean Squaree F Valuef Pr > Fg
Model
500 7217.836000
14.435672
Error
499 144.275000
0.289128
49.93
<.0001
Corrected Total 999 7362.111000
R-Squareh Coeff Vari Root MSEj score Meank
0.980403
11.52146
Sourcel
DFm Type I SSn
Id
499
measurement 1
0.537706
4.667000
Mean Squareo F Valuep Pr > Fq
7147.611000 14.323870
49.54
<.0001
70.225000
242.89
<.0001
70.225000
Source
DF Type III SSr Mean Square F Value Pr > F
Id
499 7147.611000 14.323870
measurement 1
70.225000
70.225000
49.54
<.0001
242.89
<.0001
a. Dependent Variable – This is the dependent variable (score) in the glm model.
11
b. Source – There are three parts of the sources of variation of the dependent variable, Model, Error, and Corrected
Total. The partitioning of this variation is shown in terms of the variation of the response variable (sums of squares).
The Model is the variation explained by the model (Id and Measurement), Error, is the variation not explained by
the Model. The sum of these two sources (Model and Error) adds up to the Corrected Total.
c. D.F. – These are the degrees of freedom associated with the respective sources of variance.
d. Sum of Squares – These are the sums of squares that correspond to the three sources of variation. The sum of
squares for Model is the squared difference of the predicted value and the grand mean summed over all observations
whereas the sum of squares for Error is the squared difference of the observed value from the predicted value summed
over all observations. The Corrected Total sum of squares is the sum of the Model sum of squares and Error sum of
squares.
e. Mean Square – These are the Mean Squares (M.S.) that correspond to the partitions of the total variance. The MS
is defined as sum of squares/D.F.
f. F Value – This is the F Value that is computed as Mean Square for Modell / Mean Square for Error.
g. Pr > F – This is the P Value and the probability of observing an F Value as large as, or larger, than 49.93 under the
null hypothesis is < 0.0001.
h. R-Square – The R-Square value for the model defines the proportion of the total variance explained by the Model
and is calculated as Sum of Squares for the Modell/ Sum of Squares for Corrected Total=0.980403.
i. Coeff Var – This is the Coefficient of Variation (CV). The coefficient of variation is defined as the 100 times root
Mean Square for Error (Root MSE) divided by the mean of response variable; CV = 100*0.537706/4.667000=
11.52146.
j. Root MSE – This is the root mean square error. It is the square root of the Mean Square for Error. and defines the
standard deviation of an observation about the predicted value.
k. score Mean – This is the grand mean of the response variable (score).
12
l. Source – Underneath are the variables in the model. This model has Id and Measurement.
m. D.F. – These are the degrees of freedom for the individual predictor variables in the model
n. Type I SS – These are the type I sum of squares, referred to as the partial sum of squares.
o. Mean Square – These are the mean squares for the individual predictor variables in the model. They are calculated
as the Sum of squares/D.F.
p. F Value - F Value is computed as MSSource Var / MSError
q. Pr > F – This is the P-Value and the probability of observing an F Value as large as, or larger, than 49.93 under the
null hypothesis is < 0.0001.
r. Type III SS – These are the type III sum of squares, which are referred to as the partial sum of squares.
The four essential parameters (MSW, MSC, MSR and MSE) along with six ICCs generated by the estimation of the
ICC part of the code (using the definitions in Table 1), are provided in the table below.
Obs
MSW
MSC
MSR
MSE
N
k
One_Way_R_
Abs_Agrm_sr
1
0.429
70.225
14.3239
0.28913
500
2
0.94184
Two_way_R_or_m Two_way_R_or_m
_con_sr
_Abs_sr
One_Way_R_Abs_A Two_way_R_or_m_ Two_way_R_or_m
grm_mr
con_mr
_abs_mr
0.96043
0.97005
0.94239
0.97981
0.97034
MSW - within target mean square, MSC - between the measurements mean square, MSR - between target mean
square, MSE - residual mean square, n – Number of participants, k – Number of measurements/raters
One_Way_R_Abs_Agrm_sr – ICC (One-way Random Absolute Agreement-Single rater)
13
Two_way_R_or_m_con_sr – ICC (Two-way Random_Mixed Consistency-Single rater)
Two_way_R_or_m_Abs_sr – ICC (Two-way Random_Mixed Absolute Agreement-Single rater)
One_Way_R_Abs_Agrm_mr- ICC (One-way Random Absolute Agreement-multiple raters)
Two_way_R_or_m_con_mr – ICC (Two-way Random_Mixed Consistency-multiple raters)
Two_way_R_or_m_abs_mr – ICC (Two-way Random_Mixed Absolute Agreement-multiple raters)
The output from the final part of the code for the calculation of confidence intervals is given below. The calculation
of Fobs, F_Dist_l, F_Dist_u, FL, F.U. and the confidence intervals (Lower_limit and Upper_limit) were calculated by
the SAS macro using the methods provided in the Methods Section of this paper. Fobs, FL, and F.U. were not used in
the calculation of confidence intervals for Two-way models where absolute agreement is defined as the relationship
between different measurements or different raters.
Obs ICC_type
ICC
Fobs
F_Dist_l F_Dist_u
1
One-way Random Absolute Agreement-Single rater
0.94184 33.3890 1.19194
1.19196
2
Two-way Random_Mixed Consistency-Single rater
0.96043 49.5416 1.19206
1.19206
3
Two-way Random_Mixed Absolute Agreement-Single rater
0.94239 .
3.27352
2.12088
4
One-way Random Absolute Agreement-multiple raters
0.97005 33.3890 1.19194
1.19196
5
Two-way Random_Mixed Consistency-multiple raters
0.97981 49.5416 1.19206
1.19206
6
Two-way Random_Mixed Absolute Agreement-multiple raters 0.97034 .
FL
Discussion
FU
Lower_limit Upper_limit
28.0122 39.7984 0.93106
0.95098
41.5597 59.0564 0.95301
0.96670
.
0.82648
0.97228
28.0122 39.7984 0.96430
0.97487
41.5597 59.0564 0.97594
0.98307
.
0.98594
.
.
0.90509
3.27022
2.12005
14
We specified various statistical models according to the standard algorithms described by Shrout and Fleiss (1979).
We transitioned these models into a SAS macro (ICC6) that generated six distinct ICCs with respective confidence
intervals. We showed that the results using ICC6 macro agree with those generated by SPSS version 26 and STATA
version 16. A dataset with two measurements in two time points with two raters was generated. The parameters used
in the data simulation were adjusted to ensure that the measurements have excellent reliability. The model type can
guide the selection of the correct ICC form for reliability study, the number of raters, and the definition of the
relationship between the measurements (Koo and Li 2016). A decision flowchart illustrated by McGraw and Wong's
published work that explains the ICC selection protocol is provided in Figure 1.
The mean square parameters within the targets, between the targets, between the measurements, and error variance
obtained from ANOVA models using the SAS code were discussed in the ICC6 macro. These estimations include the
two estimators for the single rater and average of k raters Absolute Agreement-ICCs in a one-way model, the two
estimators for the single rater and average of k raters Absolute Agreement -ICCs in two-way models, and the two
estimators for the single rater and average of k raters Consistency Agreement-ICCs in two-way models. Only Absolute
Agreement ICCs are defined for the one-way model. The ICC estimation of a single rater is always smaller than that
of average k raters. Among the different model estimations, the estimate of ICC based on the one-way model is
generally smaller than the estimate from that of two-way models. When the relationship criteria are considered, the
Absolute Agreement type ICC is smaller than a consistency type ICC. Although the ICCs estimated using two-way
random-effects and mixed-effects models are identical, they differ in how they are interpreted. The correlations
between any two measurements made on a target are measured in two-way random-effects models, and the mixedeffects model represents an absolute agreement of measurements treating raters as fixed for Absolute Agreement ICCs.
On the other hand, correlations between two measurements made on a target are measured in a two-way mixed effect
model, and a two-way random effect model measures the consistency between the measurements treating raters as
random.
The ICC estimate obtained from this is only an expected value of the true ICC. A 95% confidence interval will provide
a range with a probability of 0.95 that the true value of the ICC parameter falls in that range. The usage of ICC with
its confidence intervals is becoming increasingly important due to the significant role the confidence intervals play in
reliability estimations. A recent study (Shahraz et al. 2021) on measurement equivalence between electronic and paper-
15
based patient-reported outcome measures also indicates the significance of lower bound confidence interval value in
analyzing the reliability between the measurements.
Table 3 shows the ICC parameters and 95% confidence interval limits obtained from all three software packages using
the simulated data discussed above. There is a publicly available SAS macro to calculate the point estimate of the
different forms of ICC. A published report also discussed a SAS code to generate a point estimate of a single ICC and
its confidence interval using a one-way ANOVA model (Li and Nawar 2007). But to our knowledge, none of these
studies have provided SAS codes to generate a complete set of the conventional forms of ICC and their confidence
intervals. The SAS macro provided here can generate a complete set of different forms of ICC, as suggested by recent
studies.
Table 3. Intraclass Correlation Coefficients (ICC) with 95% confidence intervals calculated from the SAS macro
(ICC6), SPSS, and STATA
ICC Type
SAS
SPSS
STATA
ICC (1,1)
0.942 (0.931-0.951)
0.942 (0.931-0.951)
0.942 (0.931-0.951)
ICC(2,1)
0.942 (0.826-0.972)
0.942 (0.831-0.972)
0.942 (0.831-0.972)
ICC (3,1)
0.96 (0.953-0.967)
0.96 (0.953-0.967)
0.96 (0.953-0.967)
ICC (1,k)
0.97 (0.964-0.975)
0.97 (0.964-0.975)
0.97 (0.964-0.975)
ICC (2,k)
0.97 (0.905-0.986)
0.97 (0.908-0.986)
0.97 (0.908-0.986)
ICC (3,k)
0.98 (0.976-0.983)
0.98 (0.976-0.983)
0.98 (0.976-0.983)
It was expected that ICC estimates calculated by different statistical programming software packages from the same
statistical procedure and the same data might vary slightly due to differences in the handling algorithms for statistical
models (Qin et al. 2019). In addition to estimating the ICC and their confidence intervals, the estimates obtained using
the SAS Macro ICC6 are identical to the results obtained from different commercial statistical software packages.
This across-program agreement explains the accuracy of the methods we adopted to calculate the ICC in SAS. Our
work extends the available methodology of ICC estimation supported in R, STATA, and SPSS to SAS, which will
help SAS users involved in the reliability analysis.
16
The SAS macro provided here estimates six distinct forms of ICC and their confidence intervals based on the mean
squares of within-subjects, between-subjects, within-raters, and error variance using One-way and Two-way ANOVA
models. The work presented here is a development of SAS methodology using publicly available statistical concepts.
It can be applied to the estimation of a set of ICCs described by Shrout & Fleiss (1979) and McGraw & Wong (1996).
Of note, this work has the same limitations as the original work by these authors, i.e., it is limited to calculating ICCs
based on predefined parameters. When the plan is to run multivariable models, a post-estimation ICC provides a
solution different from these conventional ICCs (Shahraz et al. 2021). The SAS macro calculates confidence intervals
for various ICC forms that involve different assumptions and interpretations. Updating this macro for calculating ICC
for repeated measures and the data with missing values may need further investigation.
References
1.
Alexander, H. W.: The estimation of reliability when several trials are available. Psychometrika 12(2):79–
99 (1947) pmid:20254752
2.
Bartko, J. J.: The intraclass correlation coefficient as a measure of reliability. Psychol. Rep.19(1), 3-11 (1966)
https://doi.org/10.2466/pr0.1966.19.1.3
3.
Belur, J., Tompson, L., Thornton, M., Simon, M.: Interrater Reliability in Systematic Review Methodology:
Exploring Variation in Coder Decision-Making. Sociological Methods & Research. 50(2) 837-865 (2018)
https://doi.org/10.1177/0049124118799372
4.
Bland, J. M., Altman, D. G.: Statistical methods for assessing agreement between two methods of clinical
measurement. Lancet 327, 307–310 (1986) doi: https://doi.org/10.1016/S0140-6736(86)90837-8
5.
Boateng, G. O., Neilands, T. B., Frongillo, E. A., Melgar-Quiñonez, H. R., Young, S. L.: Best Practices for
Developing and Validating Scales for Health, Social, and Behavioral Research: A Primer. Front. Public
Health 6, 149 (2018) https://doi.org/10.3389/fpubh.2018.00149
6.
Brown, B. W. Jr., Lucero, R. J., Foss, A. B.: A situation where the Pearson correlation coefficient leads to
erroneous assessment of reliability. J. Clin. Psychol. 18(1), 95–97 (1962) https://doi.org/10.1002/10974679(196201)18:1<95::aid-jclp2270180131>3.0.co;2-2
17
7.
Bruton, A., Conway, J. H., Holgate, S. T. : Reliability: what is it, and how is it measured? Physiotherapy
86, 94–99 (2000) https://doi.org/10.1016/S0031-9406(05)61211-4
8.
Dmitrienko, A., Molenberghs, G., Chuang-Stein, C., Offen, W.: Analysis of clinical trials using SAS: a
practical guide. Cary, NC (2005) https://doi.org/10.1080/10543400500508994
9.
Emrich, L. J., Piedmonte, M. R.: A Method for Generating High-Dimensional Multivariate Binary
Variables. The American Statistician, 45, 302—304 (1991)
10. Fisher, R. A.: Statistical methods for research workers. Oliver and Boyd; Edinburgh (1954)
https://doi.org/10.1007/978-1-4612-4380-9_6
11. Fleiss, J. L.: The Design and Analysis of Clinical Experiments. Wiley and Sons: New York (1986)
12. Hallgren, K. A.: Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutor
Quant. Methods Psychol. 8(1): 23-34 (2012) 10.20982/tqmp.08.1.p023
13. Hopkins, W. G.:"Measures of reliability in sports medicine and science. Sports Med. 30(1), 1–15 (2000)
doi: 10.2165/00007256-200030010-00001
14. Koo, T. K., Li, M. Y.: A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for
Reliability Research. J. Chiropr. Med. 15 (2), 155–63 (2016) 10.1016/j.jcm.2016.02.012.
15. Li, L., Nawar, S.: Reliability Analysis: Calculate and Compare Intra-class Correlation Coefficients (ICC) in
SAS. Northeast SAS Users Group (2007)
16. Liljequist, D., Elfving, B., Roaldsen, K. S.: Intraclass correlation – A discussion and demonstration of basic
features. PLoSONE 14(7), e0219854 (2019) https://doi.org/10.1371/journal.pone.0219854
17. McGraw, K. O., Wong, S. P.: Forming inferences about some intraclass correlation coefficients. Psychol.
Methods 1(1), 30-46 (1996) https://doi.org/10.1037/1082-989X.1.1.30
18. McGraw, K. O., & Wong, S. P.: Forming inferences about some intraclass correlationscoefficients:
Correction. Psychol. Methods, 1(4), 390-390 (1996)
19. Nakagawa, S., Schielzeth, H.: Repeatability for Gaussian and non-Gaussian data: a practical guide for
biologists. Biol. Rev. 85:935–956 (2010) pmid:20569253
20. Nunnally, J. C., Bernstein, I. H.: Psychometric Theory. 3rd Edition. New York: McGraw-Hill Series in
Psychology (1994)
18
21. Portney, L. G., Watkins, M. P.: Foundations of clinical research: applications to practice (Vol. 892). Upper
Saddle River, NJ: Pearson/Prentice Hall (2009)
22. Potashman, M., Ping, M., Tahir, M., Shahraz, S., Dichter, S., Perneczky, R., Nolte, S.: Psychometric properties of
the Alzheimer’s Disease Cooperative Study – Activities of Daily Living for Mild Cognitive Impairment (ADCSMCI-ADL) scale: a post hoc analysis of the ADCS ADC-008 trial. BMC Geriatrics Accepted for publication (2022)
23. Qin, S., Nelson, L., McLeod, L., Eremenco, S., Coons, S. L.: Assessing test–retest reliability of patientreported outcome measures using intraclass correlation coefficients: recommendations for selecting and
documenting
the
analytical
formula.
Qual.
Life
Res.
28(4),
1029–1033
(2019)
https://doi.org/10.1007/s11136-018-2076-0.
24. Revicki, D.: Internal Consistency Reliability. In: Michalos, A.C. (Eds.), Encyclopedia of Quality of Life
and Well-Being Research. Springer, Dordrecht (2014)
25. Richard, N. M.: Interrater Reliability with SPSS for Windows 5.0. The American Statistician 47 (4): 292–
296 (1993) 10.1080/00031305.1993.10476000.
26. Rosner, B.: Fundementals of Biostatistics. 6th ed. Duxbury: Thomson Brooks/Cole (2006)
27. SAS/STAT Software, Version 9.4. SAS Institute Inc, Cary, NC USA (2013) URL https://www.sas.com.
28. Shahraz, S., Pham, T. P., Gibson, M., De La Cruz, M., Baara, M., Karnik, S., Dell, C., Pease, S., Nigam, S.,
Cappelleri, J. C., Lipset, C., Zornow, P., Lee, J., Byrom, B.: Does scrolling affect measurement equivalence
of electronic patient-reported outcome measures? Results of a quantitative equivalence study. J. Patient Rep.
Outcomes 5:23 (2021) doi: https://doi.org/10.1186/s41687-021-00296-z
29. Shostak, J.: SAS Programming in the Pharmaceutical Industry. (2005)
30. Shrout, P. E., Fleiss, J. L.: Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull. 86(2),
420-428 (1979) https://doi.org/10.1037/0033-2909.86.2.420
31. STATA Statauser'ss guide release 15. (2017) URL https://www.stata.com/manuals15/ u.pdf.
32. Stoffel, M. A., Nakagawa, S., Schielzeth, H.: rptR: repeatability estimation and variance decomposition by
generalized linear mixed-effects models. Methods in Ecology and Evolution. 8 (11), 1639–1644 (2017)
https://doi.org/10.1111/2041-210X.12797.
33. UCLA: Statistical Consulting Group. Introduction to SAS (2016) https://stats.idre.ucla.edu/sas/modules/saslearning-moduleintroduction-to-the-features-of-sas/.
19
34. U.S. Department of Health and Human Services Food and Drug Administration (FDA). Guidance for
Industry Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling
Claims (2009). https://www.fda.gov/media/77832/download
35. Weir, J. P.: Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. J.
Strength Cond. Res. 19(1), 231–240 (2005) doi: 10.1519/15184.1
36. Wicklin, R.: Simulating Data with SAS. SAS Institute Inc., Cary NC, pp. 154--157 (2013) URL
https://support.sas.com/content/dam/SAS/support/en/books/simulating-data-with-sas/65378_excerpt.pdf
37. Zaki, R., Bulgiba, A., Nordin, N., Ismail, N. A.: A Systematic Review of Statistical Methods Used to Test
for Reliability of Medical Instruments Measuring Continuous Variables. Iran J. Basic. Med. Sci, 16(6), 803807 (2013) PMID: 23997908; PMCID: PMC3758037
Download