1 Intraclass correlation for reliability assessment: the introduction of a validated program in SAS (ICC6) V. S. Senthil Kumar1 & Saeid Shahraz2,3 1 Heller School for Social Policy and Management Brandeis University, Waltham, MA 02453, USA Email: vssenthilk@gmail.com 2 Verantos, Inc. 325 Sharon Park Dr. Suite 730, Palo Alto, CA 94025, USA South San Francisco, CA 94080, USA Email: saeid.shahraz@verantos.com 3 The Institute for Clinical Research and Health Policy Studies (ICRHPS)- Tufts Medical Center 800 Washington Street, Boston, MA 02111 USA Abstract Reliability refers to how measurements can produce consistent results and are crucial for any scientific research measurement. Intraclass Correlation Coefficient (ICC) is the most widely used method to determine the reproducibility of measurements of various statistical techniques. Calculated ICC and its confidence interval that reveal the underlying sampling distribution may help detect an experimental method's ability to identify systematic differences between research participants in a test. This study aimed to introduce a new SAS macro, ICC,6, for calculating different ICC forms and their confidence intervals. A SAS macro that employs the PROC GLM procedure in SAS was created to generate two-way random effects (ANOVA) estimates. A simulated dataset was used to input the macro to calculate the point estimates for different ICCs. The ICC forms' upper and lower confidence interval limits were calculated using the F statistics distribution. Our SAS macro provides a complete set of various ICC forms and their confidence intervals. A validation analysis using commercial software packages STATA and SPSS delivered identical results. A development of SAS methodology using publicly available statistical approaches in estimating six distinct forms of ICC and their confidence intervals has been reported in this article. This work is an extension of general methodology supported by a few other statistical software packages to SAS. Keywords: Intraclass Correlation, ICC, ANOVA, Reliability. 2 Statements and Declarations Conflict of interest The authors declare that they have no conflict of interest. 3 Introduction Reliability refers to the consistency of a measure, ensuring that measurement variation is due to replicable differences between people regardless of time, target behavior, or user profile. Reliability is a fundamental metric used to understand the quality of a measurement and provides an idea about the source of a possible inconsistency (Bruton et al., 2000). The reliability of measurement scores can help assess the precision of a given measurement score. In clinical trials, obtaining consistent results through repeated measurements with the same patient (test-retest reliability) plays a key role in decision-making. Reliability is recommended by U.S. Food and Drug Administration Guidance for Industry (FDA, 2009) as one of the principal psychometric properties in validating a PRO instrument for clinical trials. Reliability is also widely used in clinical trials to validate the measurement equivalence among different forms of the same test (Potashman et al. 2022). Three kinds of reliability measures are widely used in data analysis: internal consistency (Revicki 2014), test-retest reliability or intra-rater reliability, and inter-rater reliability (Belur et al. 2018). Internal consistency helps judge the stability of results across items of a measurement instrument. It measures whether different items deemed to measure the same construct in a test produce similar results. Internal consistency is often measured with Cronbach's alpha or coefficient omega (Boateng et al. 2018), generating correlations between items in the same test. Test-retest reliability calculates the consistency or agreement between responses in the same population at different points in time. Inter-rater reliability refers to the level of consistency or agreement on a test for which the scoring involves more than one rater’s judgment. In other words, it reflects the scoring process reliability delivered by different raters assessing the same responses, and the possible inconsistency or disagreement may arise from the selection of raters (Hallgren 2012). Test-retest reliability and inter-rater reliability are essential psychometric aspects of reliability analyses. The most widely used strategy for assessing these reliability measures is to analyze the differences between the responses to a test for each research participant. Researchers often apply various statistical methods for reliability assessment. Examples are using paired t-test to compare mean difference (Hopkins 2000), the strength of the linear relationship between results of two tests by the Pearson correlation (Brown et al. 1962), calculating standard error of measurement (SEM) (Weir 2005), and quantifying the agreement between two quantitative measurements by studying the mean difference using Bland-Altman plot (Bland and Altman 1986). The Pearson correlation is a correlation measure, and the SEM represents the measurement error. The paired t-test and Bland-Altman plot are more suitable for analyzing 4 agreement between measurements. ICC is a popular method for analyzing any type of reliability in psychometric assessments from both consistency and agreement viewpoints and factors in the measurement errors in the estimation (Liljequist et al. 2019; Zaki et al. 2013). ICC's generic definition is the ratio of the variance of interest over the sum of the variance of interest and the error variance. Several forms of ICC can produce different results when applied to the same data. Fisher (Fisher 1954) first defined the ICC as the ratio of the between-subject variance to the total variance, i.e., the sum of between-subject and within-subject variances using a one-way analysis of variance (ANOVA) model considering the ICC as an alternative to the Pearson correlation coefficient. Bartko (1966) introduced fixed and random rater effects using a two-way ANOVA model to calculate the ICC. Shrout and Fleiss (1979) and McGraw and Wong (1996) extended various forms of the ICC, calculating the ICC by employing mean squares in various settings using diverse ANOVA models. Shrout and Fleiss introduced six special ICC measures. Three statistical models guided these ICC reliability measures: the one-way random model, the two-way random model, and the two-way mixed model. Shrout and Fleiss built a single-rater and a multiple-rater (average of k raters) model for each of these three models. Later, McGraw and Wong added four other ICC measures to the ones Shrout and Fleiss proposed. These additional measures yielded numerically identical results to those of the two-way random model, and the two-way mixed model explained by Shrout and Fleiss (1979). However, McGraw and Wong interpreted the additional models differently. McGraw and Wong (1996) defined ten forms of ICC based on the three models (the one-way random model, the two-way random model, and the two-way mixed model), the type (single rating or mean of k ratings), and the definition of relationship (consistency or absolute agreement). Table 1 conveys a summary of these models along with their vocabulary. ICC values range between 0 and 1(0 indicates no reliability, and 1 represents perfect reliability). Agreed-upon cutoffs for a uniform interpretation of ICC results do not exist (Nunnally and Bernstein 1994; Fleiss 1986; Rosner 2006; Portney and Watkins 2009). However, several authors, including Koo and Li (2016), have suggested the following scheme to interpret the strength of reliability after ICC measurement. Values less than 0.5 indicate poor reliability; those between 0.5 and 0.75 represent moderate reliability; values between 0.75 and 0.9 show good reliability, and values greater than 0.90 indicate excellent reliability. Point estimates of ICC and their confidence interval are equally crucial for the report of reliability results. A point estimate represents the true underlying score due to the measurement error in measuring a quantity. A confidence interval indicates a range of values a true score is likely to receive and is used to draw inferences regarding the underlying population. One can obtain the stability of the estimated ICC by looking at the confidence intervals. 5 However, confidence intervals of the ICC provide more information than point estimates (Stoffel et al. 2017). Calculation of the most commonly used forms of the ICC and their confidence intervals, defined by McGraw and Wong, is currently supported in the open-source software package R (Stoffel et al. 2017) as well as commercial software packages such as STATA (STATA 2017) and SPSS (Richard 1993). The mathematical algorithms behind drawing various types of ICC for these statistical programs are publicly available. However, statisticians have not formally introduced these algorithms for statistical model specification to directly perform the ICC analysis and produce desired ICCs and confidence intervals. For instance, SAS is widely used to analyze educational, social, and clinical data worldwide. SAS is also the Food and Drug Administration's preferred statistical software for receiving and reviewing clinical data (Shostak 2005; Dmitrienko et al. 2005). To our knowledge, a SAS macro that can handle the calculation of ICCs and their respective confidence intervals have not yet been developed. In our experience, specifying a model for the six different ICCs is not straight forward. Hence, model specifications and validating a SAS procedure for generating a list of the most used ICCs and their confidence intervals will fulfill researchers'' need to assess reliability measures in various contexts. The specifications help develop similar algorithms to generate similar results in other commercial and open-source statistical software applications. This paper presents statistical models we specified to render ICC analysis for reliability estimates in SAS that are consistent with the results generated by SPSS and STATA. Methods ICC Point Estimate Calculation Shrout and Fleiss defined six distinct forms of ICCs (Shrout and Fleiss 1979). Two numbers inside a parenthesis identify each of these types of ICCs, as shown in Table 1. The first number refers to the model (1 for One-way random, 2 for Two-way random, and 3 for Two-way mixed models). The second number is 1 or k, referring to the single rater or mean of k raters/measurements. McGraw and Wong (1996) defined ten forms of ICCs based on the model (1,2, or 3), the type (1 or mean of k raters), and the definition of relationship (consistency or absolute agreement). But the formulas used by Shrout and Fleiss (1979) for the six forms of ICCs are sufficient for the computation of the ten types of ICCs defined by McGraw and Wong. More specifically, the ICC calculations from both the Two-way random- and mixed-effects models produce identical estimations because they use the same formula to calculate the ICC. In SPSS, STATA, and R, the ICC calculations are based on McGraw and Wong's mathematical definition of the ICCs. For the 6 mathematical description of the ICCs, Shrout, Fleiss, McGraw, and Wong used four essential parameters. They are within-target mean square, between-targets mean square, between-measurements mean square and residual mean square. All statistical analyses to calculate ICCs were performed in SAS version 9.4 (SAS 2013). The theory we employed in ICC calculation is based on the analysis of variance (ANOVA). Among the existing approaches (Alexander 1947; Nakagawa and Schielzeth 2010), the ANOVA (Analysis of Variance) method is used the most to calculate ICCs in clinical trials. The PROC GLM procedure in SAS was employed to generate two-way random effects (ANOVA) estimates, which calculate the sums of squares, mean squares, and residuals. The ANOVA procedure on a dataset with more than one measurement was performed to calculate the estimates of possible mean squares and residuals. Table 1 conveys the equations to calculate the point estimates for different types of ICCs by using the Mean squares and Error parameters obtained from the ANOVA model. Confidence Intervals The confidence interval limits ((1 − α/2) × 100th percentile) for the six distinct forms of the ICCs were calculated using the F statistics distribution and the four essential parameters as discussed by McGraw and Wong (1996). One-way Model The lower and upper confidence interval limits for the One-way model with one rater and k raters are calculated as ( πΉπΏ −1 , πΉπΏ +(π−1) πΉπ −1 πΉπ +(π−1) 1 1 πΉπΏ πΉπ ) and (1- , 1- ), respectively. πΉπΏ = Fobs/Ftabled where Fobs is the row effects from the ANOVA. Ftabled denotes the (1 – 0.5α) X 100th percentile of the F distribution with (n–1) numerator degrees of freedom and n(k - 1) denominator degrees of freedom. πΉπ = FobsοΦΌFtabled where Fobs is the row effects from the ANOVA and Ftabled is the (1 – 0.5α) X 100th percentile of the F distribution with n (k — 1) numerator degrees of freedom and (n-1) denominator degrees of freedom. n is the number of participants and k represents the number of measurements in the model with a single rater and the number of raters in the model with k raters. 7 Table 1. Intraclass correlation coefficients (ICC) defined by Shrout and Fleiss (1979) and McGraw and Wong (1996). IC Form (McGraw & Wong) ICC Form (Shrout & Fleiss ) Formulas for ICC One-way random effects, absolute agreement (single rater/measurement) ICC (1,1) πππ − πππ πππ + (π − 1)πππ Two-way random effects, absolute agreement (single rater/measurement) ICC (2,1) πππ − πππΈ π πππ + (π − 1)πππΈ + (πππ − πππΈ ) π Two-way mixed effects, consistency (single rater/measurement) ICC (3,1) πππ − πππΈ πππ + (π − 1)πππΈ One-way random effects, absolute agreement (multiple raters/ measurements) ICC (1,k) πππ − πππ πππ Two-way random effects, absolute agreement (multiple raters/ measurements) ICC (2,k) πππ − πππΈ πππ πππ − πππΈ Two-way mixed effects, ICC (3,k) consistency (multiple (πππ − πππΈ ) πππ + raters/measurements) π M.S. – mean square; MSR - between target mean square; MSW - within target mean square; MSC - between the measurements mean square; MSE - residual mean square; n - number of participants; k - number of raters or measurements Two-way Model measures correlation using a"“consistency"” definition The lower and upper confidence interval limits for the Two-way models with one rater and k raters are calculated as( πΉπΏ −1 , πΉπΏ +(π−1) πΉπ −1 πΉπ +(π−1) 1 1 πΉπΏ πΉπ ) and (1- , 1- ). πΉπΏ = Fobs/Ftabled where Fobs is the row effects from the ANOVA. Ftabled denotes the (1 – 0.5α) X 100th percentile of the F distribution with (n-1) numerator degrees of freedom and (n - l)(k - 1) denominator degrees of freedom πΉπ = FobsοΦΌFtabled where Fobs is the row effects from the ANOVA and Ftabled is the (1 – 0.5α) X 100th percentile of the F distribution with (n - l)(k - 1) numerator degrees of freedom and (n-1) denominator degrees of freedom. . n is the number of participants and k represents the number of measurements in the model with single rater and number of raters in model with k raters. 8 ΦΌ Two-way Model measures correlation using an Absolute Agreement (A.A.) definition In the case of Two-way models where absolute agreement is defined as the relationship between different measurements or different raters, the lower and upper confidence limits are calculated for single rater and k raters as ( π(πππ −πΉπΏ πππΈ ) πΉπΏ [ππππΆ +(ππ−π−π)πππΈ }+ππππ ) (( π(πππ −πΉπΏ πππΈ ) , , π(πΉπ πππ −πππΈ ) ) and ππππΆ +(ππ−π−π)πππΈ +ππΉππππππ πππ ) π(πΉπ πππ −πππΈ ) ). πΉπΏ πππππ‘ππ the (1 – 0.5α) X 100th percentile of the F distribution with n-1 πΉπΏ (πππΆ −πππΈ }+ππππ ) πππΆ −πππΈ +ππΉπ πππ ) numerator degrees of freedom and υ denominator degrees of freedom whereas πΉπ is the (1 – 0.5α) X 100th percentile of the F distribution with υ numerator degrees of freedom and n-1 denominator degrees of freedom. MSR is the mean square for rows, MSC is mean square for columns, MSE is mean square error, n is the number of participants and k represents the number of measurements in the model with a single rater and number of raters in the model with k raters. υ can be calculated using the following formula υ= (ππππΆ +ππππΈ )2 2 (ππππ )2 (ππππΈ ) + π−1 (π−1)(π−1) where a= π(πΌπΆπΆ) π(1−πΌπΆπΆ) and b=1 + π(πΌπΆπΆ)(π−1) π(1−πΌπΆπΆ) . SAS Macro to Calculate the intraclass correlation coefficients and their confidence intervals We implemented the algorithm to calculate the ICCs and their confidence intervals and created a SAS macro for the overall algorithm called ICC6. This SAS macro uses ANOVA results obtained from the GLM procedure were used to calculate the point estimates of the ICCs (Table 1) and the confidence interval limits ((1 − α/2) × 100th percentile). The confidence interval estimates are calculated using the F statistics distribution discussed in Methods Section. Seven parameters are required to run the macro ICC6: input is the name of the input dataset; id is the participant id variable; measurement is the variable that denotes the time point of measurement; k is the number of raters/measurements; score is the measurement of interest; n is the number of participants in the test; alpha is equal to 1-confidence level. Each participant in the input data should have at least two observations from different time points or different raters. The final output table produced by the SAS macro contains ICC point estimates, parameters from F statistics distribution, and lower and upper confidence interval limits. %ICC6 (input=, id=, measurement=, k=, score=, n=, alpha=); 9 Simulated data for the application We employed the RANDMVBINARY program in SAS to build a simulated dataset. The SAS program RandMVBinary was developed by Wicklin (2013) that accommodates Emrich and Piedmonte's algorithm (1991). The RANDMVBINARY function requires defining a vector of the parameter probabilities and a matrix of the parameter correlations and simulating the distributions using a matrix of zeros and ones. We used this program to generate the correlated binary responses for two different measurements from two distinct time points for each participant with two independent raters. The simulated data set includes the responses of 500 individuals and ten binary items, Q1 through Q10. It was challenging to find an actual data as an example to use with the SAS macro to calculate six different ICC forms discussed in this paper. Hence, we intentionally changed the parameters in the simulated dataset to produce all various forms of ICCs. The parameter probabilities and a matrix of the parameter correlations for data in different measurements were adjusted to get high reliability between the two measures by purpose. Table 2 displays the sample data with four rows of observations. Table 2. A Correlated Data set with binary items generated by simulation (four rows of the data are shown) ID Visit Rater Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Total 1 V0 R1 1 0 0 1 1 0 1 1 1 1 7 2 V0 R1 1 0 0 0 1 0 1 1 1 0 5 3 V0 R1 0 0 1 0 0 1 0 0 0 0 2 4 V0 R1 1 0 0 1 1 1 1 1 1 1 8 An annotated example of ICC6 output Following are the results of the % ICC6 call: The v0_v1_r1 data set was used as an input in this example, and the output from the GLM procedure to generate Twoway random is given below. The SAS code and the dataset are included as supplementary materials. The SAS 10 annotated output example by UCLA Statistical Consulting Group (UCLA 2016) has been used as a template to provide the ANOVA results from the GLM procedure here. The GLM Procedure Dependent Variablea: score Sourceb DFc Sum of Squaresd Mean Squaree F Valuef Pr > Fg Model 500 7217.836000 14.435672 Error 499 144.275000 0.289128 49.93 <.0001 Corrected Total 999 7362.111000 R-Squareh Coeff Vari Root MSEj score Meank 0.980403 11.52146 Sourcel DFm Type I SSn Id 499 measurement 1 0.537706 4.667000 Mean Squareo F Valuep Pr > Fq 7147.611000 14.323870 49.54 <.0001 70.225000 242.89 <.0001 70.225000 Source DF Type III SSr Mean Square F Value Pr > F Id 499 7147.611000 14.323870 measurement 1 70.225000 70.225000 49.54 <.0001 242.89 <.0001 a. Dependent Variable – This is the dependent variable (score) in the glm model. 11 b. Source – There are three parts of the sources of variation of the dependent variable, Model, Error, and Corrected Total. The partitioning of this variation is shown in terms of the variation of the response variable (sums of squares). The Model is the variation explained by the model (Id and Measurement), Error, is the variation not explained by the Model. The sum of these two sources (Model and Error) adds up to the Corrected Total. c. D.F. – These are the degrees of freedom associated with the respective sources of variance. d. Sum of Squares – These are the sums of squares that correspond to the three sources of variation. The sum of squares for Model is the squared difference of the predicted value and the grand mean summed over all observations whereas the sum of squares for Error is the squared difference of the observed value from the predicted value summed over all observations. The Corrected Total sum of squares is the sum of the Model sum of squares and Error sum of squares. e. Mean Square – These are the Mean Squares (M.S.) that correspond to the partitions of the total variance. The MS is defined as sum of squares/D.F. f. F Value – This is the F Value that is computed as Mean Square for Modell / Mean Square for Error. g. Pr > F – This is the P Value and the probability of observing an F Value as large as, or larger, than 49.93 under the null hypothesis is < 0.0001. h. R-Square – The R-Square value for the model defines the proportion of the total variance explained by the Model and is calculated as Sum of Squares for the Modell/ Sum of Squares for Corrected Total=0.980403. i. Coeff Var – This is the Coefficient of Variation (CV). The coefficient of variation is defined as the 100 times root Mean Square for Error (Root MSE) divided by the mean of response variable; CV = 100*0.537706/4.667000= 11.52146. j. Root MSE – This is the root mean square error. It is the square root of the Mean Square for Error. and defines the standard deviation of an observation about the predicted value. k. score Mean – This is the grand mean of the response variable (score). 12 l. Source – Underneath are the variables in the model. This model has Id and Measurement. m. D.F. – These are the degrees of freedom for the individual predictor variables in the model n. Type I SS – These are the type I sum of squares, referred to as the partial sum of squares. o. Mean Square – These are the mean squares for the individual predictor variables in the model. They are calculated as the Sum of squares/D.F. p. F Value - F Value is computed as MSSource Var / MSError q. Pr > F – This is the P-Value and the probability of observing an F Value as large as, or larger, than 49.93 under the null hypothesis is < 0.0001. r. Type III SS – These are the type III sum of squares, which are referred to as the partial sum of squares. The four essential parameters (MSW, MSC, MSR and MSE) along with six ICCs generated by the estimation of the ICC part of the code (using the definitions in Table 1), are provided in the table below. Obs MSW MSC MSR MSE N k One_Way_R_ Abs_Agrm_sr 1 0.429 70.225 14.3239 0.28913 500 2 0.94184 Two_way_R_or_m Two_way_R_or_m _con_sr _Abs_sr One_Way_R_Abs_A Two_way_R_or_m_ Two_way_R_or_m grm_mr con_mr _abs_mr 0.96043 0.97005 0.94239 0.97981 0.97034 MSW - within target mean square, MSC - between the measurements mean square, MSR - between target mean square, MSE - residual mean square, n – Number of participants, k – Number of measurements/raters One_Way_R_Abs_Agrm_sr – ICC (One-way Random Absolute Agreement-Single rater) 13 Two_way_R_or_m_con_sr – ICC (Two-way Random_Mixed Consistency-Single rater) Two_way_R_or_m_Abs_sr – ICC (Two-way Random_Mixed Absolute Agreement-Single rater) One_Way_R_Abs_Agrm_mr- ICC (One-way Random Absolute Agreement-multiple raters) Two_way_R_or_m_con_mr – ICC (Two-way Random_Mixed Consistency-multiple raters) Two_way_R_or_m_abs_mr – ICC (Two-way Random_Mixed Absolute Agreement-multiple raters) The output from the final part of the code for the calculation of confidence intervals is given below. The calculation of Fobs, F_Dist_l, F_Dist_u, FL, F.U. and the confidence intervals (Lower_limit and Upper_limit) were calculated by the SAS macro using the methods provided in the Methods Section of this paper. Fobs, FL, and F.U. were not used in the calculation of confidence intervals for Two-way models where absolute agreement is defined as the relationship between different measurements or different raters. Obs ICC_type ICC Fobs F_Dist_l F_Dist_u 1 One-way Random Absolute Agreement-Single rater 0.94184 33.3890 1.19194 1.19196 2 Two-way Random_Mixed Consistency-Single rater 0.96043 49.5416 1.19206 1.19206 3 Two-way Random_Mixed Absolute Agreement-Single rater 0.94239 . 3.27352 2.12088 4 One-way Random Absolute Agreement-multiple raters 0.97005 33.3890 1.19194 1.19196 5 Two-way Random_Mixed Consistency-multiple raters 0.97981 49.5416 1.19206 1.19206 6 Two-way Random_Mixed Absolute Agreement-multiple raters 0.97034 . FL Discussion FU Lower_limit Upper_limit 28.0122 39.7984 0.93106 0.95098 41.5597 59.0564 0.95301 0.96670 . 0.82648 0.97228 28.0122 39.7984 0.96430 0.97487 41.5597 59.0564 0.97594 0.98307 . 0.98594 . . 0.90509 3.27022 2.12005 14 We specified various statistical models according to the standard algorithms described by Shrout and Fleiss (1979). We transitioned these models into a SAS macro (ICC6) that generated six distinct ICCs with respective confidence intervals. We showed that the results using ICC6 macro agree with those generated by SPSS version 26 and STATA version 16. A dataset with two measurements in two time points with two raters was generated. The parameters used in the data simulation were adjusted to ensure that the measurements have excellent reliability. The model type can guide the selection of the correct ICC form for reliability study, the number of raters, and the definition of the relationship between the measurements (Koo and Li 2016). A decision flowchart illustrated by McGraw and Wong's published work that explains the ICC selection protocol is provided in Figure 1. The mean square parameters within the targets, between the targets, between the measurements, and error variance obtained from ANOVA models using the SAS code were discussed in the ICC6 macro. These estimations include the two estimators for the single rater and average of k raters Absolute Agreement-ICCs in a one-way model, the two estimators for the single rater and average of k raters Absolute Agreement -ICCs in two-way models, and the two estimators for the single rater and average of k raters Consistency Agreement-ICCs in two-way models. Only Absolute Agreement ICCs are defined for the one-way model. The ICC estimation of a single rater is always smaller than that of average k raters. Among the different model estimations, the estimate of ICC based on the one-way model is generally smaller than the estimate from that of two-way models. When the relationship criteria are considered, the Absolute Agreement type ICC is smaller than a consistency type ICC. Although the ICCs estimated using two-way random-effects and mixed-effects models are identical, they differ in how they are interpreted. The correlations between any two measurements made on a target are measured in two-way random-effects models, and the mixedeffects model represents an absolute agreement of measurements treating raters as fixed for Absolute Agreement ICCs. On the other hand, correlations between two measurements made on a target are measured in a two-way mixed effect model, and a two-way random effect model measures the consistency between the measurements treating raters as random. The ICC estimate obtained from this is only an expected value of the true ICC. A 95% confidence interval will provide a range with a probability of 0.95 that the true value of the ICC parameter falls in that range. The usage of ICC with its confidence intervals is becoming increasingly important due to the significant role the confidence intervals play in reliability estimations. A recent study (Shahraz et al. 2021) on measurement equivalence between electronic and paper- 15 based patient-reported outcome measures also indicates the significance of lower bound confidence interval value in analyzing the reliability between the measurements. Table 3 shows the ICC parameters and 95% confidence interval limits obtained from all three software packages using the simulated data discussed above. There is a publicly available SAS macro to calculate the point estimate of the different forms of ICC. A published report also discussed a SAS code to generate a point estimate of a single ICC and its confidence interval using a one-way ANOVA model (Li and Nawar 2007). But to our knowledge, none of these studies have provided SAS codes to generate a complete set of the conventional forms of ICC and their confidence intervals. The SAS macro provided here can generate a complete set of different forms of ICC, as suggested by recent studies. Table 3. Intraclass Correlation Coefficients (ICC) with 95% confidence intervals calculated from the SAS macro (ICC6), SPSS, and STATA ICC Type SAS SPSS STATA ICC (1,1) 0.942 (0.931-0.951) 0.942 (0.931-0.951) 0.942 (0.931-0.951) ICC(2,1) 0.942 (0.826-0.972) 0.942 (0.831-0.972) 0.942 (0.831-0.972) ICC (3,1) 0.96 (0.953-0.967) 0.96 (0.953-0.967) 0.96 (0.953-0.967) ICC (1,k) 0.97 (0.964-0.975) 0.97 (0.964-0.975) 0.97 (0.964-0.975) ICC (2,k) 0.97 (0.905-0.986) 0.97 (0.908-0.986) 0.97 (0.908-0.986) ICC (3,k) 0.98 (0.976-0.983) 0.98 (0.976-0.983) 0.98 (0.976-0.983) It was expected that ICC estimates calculated by different statistical programming software packages from the same statistical procedure and the same data might vary slightly due to differences in the handling algorithms for statistical models (Qin et al. 2019). In addition to estimating the ICC and their confidence intervals, the estimates obtained using the SAS Macro ICC6 are identical to the results obtained from different commercial statistical software packages. This across-program agreement explains the accuracy of the methods we adopted to calculate the ICC in SAS. Our work extends the available methodology of ICC estimation supported in R, STATA, and SPSS to SAS, which will help SAS users involved in the reliability analysis. 16 The SAS macro provided here estimates six distinct forms of ICC and their confidence intervals based on the mean squares of within-subjects, between-subjects, within-raters, and error variance using One-way and Two-way ANOVA models. The work presented here is a development of SAS methodology using publicly available statistical concepts. It can be applied to the estimation of a set of ICCs described by Shrout & Fleiss (1979) and McGraw & Wong (1996). Of note, this work has the same limitations as the original work by these authors, i.e., it is limited to calculating ICCs based on predefined parameters. When the plan is to run multivariable models, a post-estimation ICC provides a solution different from these conventional ICCs (Shahraz et al. 2021). The SAS macro calculates confidence intervals for various ICC forms that involve different assumptions and interpretations. Updating this macro for calculating ICC for repeated measures and the data with missing values may need further investigation. References 1. Alexander, H. W.: The estimation of reliability when several trials are available. Psychometrika 12(2):79– 99 (1947) pmid:20254752 2. Bartko, J. J.: The intraclass correlation coefficient as a measure of reliability. Psychol. Rep.19(1), 3-11 (1966) https://doi.org/10.2466/pr0.1966.19.1.3 3. Belur, J., Tompson, L., Thornton, M., Simon, M.: Interrater Reliability in Systematic Review Methodology: Exploring Variation in Coder Decision-Making. Sociological Methods & Research. 50(2) 837-865 (2018) https://doi.org/10.1177/0049124118799372 4. Bland, J. M., Altman, D. G.: Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 327, 307–310 (1986) doi: https://doi.org/10.1016/S0140-6736(86)90837-8 5. Boateng, G. O., Neilands, T. B., Frongillo, E. A., Melgar-Quiñonez, H. R., Young, S. L.: Best Practices for Developing and Validating Scales for Health, Social, and Behavioral Research: A Primer. Front. Public Health 6, 149 (2018) https://doi.org/10.3389/fpubh.2018.00149 6. Brown, B. W. Jr., Lucero, R. J., Foss, A. B.: A situation where the Pearson correlation coefficient leads to erroneous assessment of reliability. J. Clin. Psychol. 18(1), 95–97 (1962) https://doi.org/10.1002/10974679(196201)18:1<95::aid-jclp2270180131>3.0.co;2-2 17 7. Bruton, A., Conway, J. H., Holgate, S. T. : Reliability: what is it, and how is it measured? Physiotherapy 86, 94–99 (2000) https://doi.org/10.1016/S0031-9406(05)61211-4 8. Dmitrienko, A., Molenberghs, G., Chuang-Stein, C., Offen, W.: Analysis of clinical trials using SAS: a practical guide. Cary, NC (2005) https://doi.org/10.1080/10543400500508994 9. Emrich, L. J., Piedmonte, M. R.: A Method for Generating High-Dimensional Multivariate Binary Variables. The American Statistician, 45, 302—304 (1991) 10. Fisher, R. A.: Statistical methods for research workers. Oliver and Boyd; Edinburgh (1954) https://doi.org/10.1007/978-1-4612-4380-9_6 11. Fleiss, J. L.: The Design and Analysis of Clinical Experiments. Wiley and Sons: New York (1986) 12. Hallgren, K. A.: Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutor Quant. Methods Psychol. 8(1): 23-34 (2012) 10.20982/tqmp.08.1.p023 13. Hopkins, W. G.:"Measures of reliability in sports medicine and science. Sports Med. 30(1), 1–15 (2000) doi: 10.2165/00007256-200030010-00001 14. Koo, T. K., Li, M. Y.: A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J. Chiropr. Med. 15 (2), 155–63 (2016) 10.1016/j.jcm.2016.02.012. 15. Li, L., Nawar, S.: Reliability Analysis: Calculate and Compare Intra-class Correlation Coefficients (ICC) in SAS. Northeast SAS Users Group (2007) 16. Liljequist, D., Elfving, B., Roaldsen, K. S.: Intraclass correlation – A discussion and demonstration of basic features. PLoSONE 14(7), e0219854 (2019) https://doi.org/10.1371/journal.pone.0219854 17. McGraw, K. O., Wong, S. P.: Forming inferences about some intraclass correlation coefficients. Psychol. Methods 1(1), 30-46 (1996) https://doi.org/10.1037/1082-989X.1.1.30 18. McGraw, K. O., & Wong, S. P.: Forming inferences about some intraclass correlationscoefficients: Correction. Psychol. Methods, 1(4), 390-390 (1996) 19. Nakagawa, S., Schielzeth, H.: Repeatability for Gaussian and non-Gaussian data: a practical guide for biologists. Biol. Rev. 85:935–956 (2010) pmid:20569253 20. Nunnally, J. C., Bernstein, I. H.: Psychometric Theory. 3rd Edition. New York: McGraw-Hill Series in Psychology (1994) 18 21. Portney, L. G., Watkins, M. P.: Foundations of clinical research: applications to practice (Vol. 892). Upper Saddle River, NJ: Pearson/Prentice Hall (2009) 22. Potashman, M., Ping, M., Tahir, M., Shahraz, S., Dichter, S., Perneczky, R., Nolte, S.: Psychometric properties of the Alzheimer’s Disease Cooperative Study – Activities of Daily Living for Mild Cognitive Impairment (ADCSMCI-ADL) scale: a post hoc analysis of the ADCS ADC-008 trial. BMC Geriatrics Accepted for publication (2022) 23. Qin, S., Nelson, L., McLeod, L., Eremenco, S., Coons, S. L.: Assessing test–retest reliability of patientreported outcome measures using intraclass correlation coefficients: recommendations for selecting and documenting the analytical formula. Qual. Life Res. 28(4), 1029–1033 (2019) https://doi.org/10.1007/s11136-018-2076-0. 24. Revicki, D.: Internal Consistency Reliability. In: Michalos, A.C. (Eds.), Encyclopedia of Quality of Life and Well-Being Research. Springer, Dordrecht (2014) 25. Richard, N. M.: Interrater Reliability with SPSS for Windows 5.0. The American Statistician 47 (4): 292– 296 (1993) 10.1080/00031305.1993.10476000. 26. Rosner, B.: Fundementals of Biostatistics. 6th ed. Duxbury: Thomson Brooks/Cole (2006) 27. SAS/STAT Software, Version 9.4. SAS Institute Inc, Cary, NC USA (2013) URL https://www.sas.com. 28. Shahraz, S., Pham, T. P., Gibson, M., De La Cruz, M., Baara, M., Karnik, S., Dell, C., Pease, S., Nigam, S., Cappelleri, J. C., Lipset, C., Zornow, P., Lee, J., Byrom, B.: Does scrolling affect measurement equivalence of electronic patient-reported outcome measures? Results of a quantitative equivalence study. J. Patient Rep. Outcomes 5:23 (2021) doi: https://doi.org/10.1186/s41687-021-00296-z 29. Shostak, J.: SAS Programming in the Pharmaceutical Industry. (2005) 30. Shrout, P. E., Fleiss, J. L.: Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull. 86(2), 420-428 (1979) https://doi.org/10.1037/0033-2909.86.2.420 31. STATA Statauser'ss guide release 15. (2017) URL https://www.stata.com/manuals15/ u.pdf. 32. Stoffel, M. A., Nakagawa, S., Schielzeth, H.: rptR: repeatability estimation and variance decomposition by generalized linear mixed-effects models. Methods in Ecology and Evolution. 8 (11), 1639–1644 (2017) https://doi.org/10.1111/2041-210X.12797. 33. UCLA: Statistical Consulting Group. Introduction to SAS (2016) https://stats.idre.ucla.edu/sas/modules/saslearning-moduleintroduction-to-the-features-of-sas/. 19 34. U.S. Department of Health and Human Services Food and Drug Administration (FDA). Guidance for Industry Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims (2009). https://www.fda.gov/media/77832/download 35. Weir, J. P.: Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. J. Strength Cond. Res. 19(1), 231–240 (2005) doi: 10.1519/15184.1 36. Wicklin, R.: Simulating Data with SAS. SAS Institute Inc., Cary NC, pp. 154--157 (2013) URL https://support.sas.com/content/dam/SAS/support/en/books/simulating-data-with-sas/65378_excerpt.pdf 37. Zaki, R., Bulgiba, A., Nordin, N., Ismail, N. A.: A Systematic Review of Statistical Methods Used to Test for Reliability of Medical Instruments Measuring Continuous Variables. Iran J. Basic. Med. Sci, 16(6), 803807 (2013) PMID: 23997908; PMCID: PMC3758037