1226 The Reliability A Quantitative Kenneth J. Ottenbacher, of the Functional Review Independence Measure: PhD, Yungwen Hsu, MS, Carl V. Granger, MD, Roger C. Fiedler, PhD ABSTRACT. Ottenbacher KJ, Hsu Y, Granger CV, Fiedler RC. The reliability of the Functional Independence Measure: a quantitative review. Arch Phys Med Rehabil 1996;77:1226-32. Objective: The reliability of the Functional Independence Measure (FIMSM) for adults was examined using procedures of meta-analysis. Data Sources: Eleven published studies reporting estimates of reliability for the FIM were located using computer searches of Index Medicus, Psychological Abstracts, the Functional Assessment Information Service, and citation tracking. Study Selection: Studies were identified and coded based on type of reliability (interrater, test-retest, or equivalence), method of data analysis, size of sample, and training or experience of raters. Data Extraction: Information from the articles was coded by two independent raters. Interrater reliability for coding all elements included in the analysis ranged from .89 to 1.00. Data Synthesis: The 11 investigations included a total of 1,568 patients and produced 221 reliability coefficients. The majority of the reliability values (81%) were from interrater reliability studies, and the intraclass correlation coefficient (ICC) was the most commonly used statistical procedure to compute reliability. The reported reliability values were converted to a common correlation metric and aggregated across the 11 studies. The results revealed a median interrater reliability for the total FIM of .95 and median test-retest and equivalence reliability values of .95 and .92, respectively. The median reliability values for the six FIM subscales ranged from .9.5 for Self-Care to .78 for Social Cognition. For the individual FIM items, median reliability values varied from .90 for Toilet Transfer to .61 for Comprehension. Median and mean reliability coefficients for FIM motor items were generally higher than for items in the cognitive or communication subscales. Conclusions: Based on the 11 studies examined in this review the FIM demonstrated acceptable reliability across a wide variety of settings, raters, and patients. 0 I996 by the American Congress of Rehabilitation Medicine and the American Academy of Physical Medicine and Rehabilitation T HE FUNCTIONAL Independence Measure (FIMSM) is one of the most widely used methods of assessing basic quality of daily living activities in persons with a disability.’ The FIM for adults was developed by a national task force cosponsored by the American Academy of Physical Medicine and Rehabilitation and the American Congress of Rehabilitation Medicine.’ The original work of this task force was expanded by the Department of Rehabilitation Medicine at the State University of New York at Buffalo. The FIM is now part of the Uniform Data System for Medical Rehabilitation (UDSMR~~) and is widely used in the United States and internationally.2-5 The FIM includes 18 items designed to assess the amount of assistance required for a person with a disability to perform basic life activities safely and effectively. The activities include a minimum set of skills related to self-care, sphincter control, transfers, locomotion, communication, and social cognition. In discussing the measurement of functional status in rehabilitation, Johnston and colleagues noted that “the FIM is currently the most widely used measure of disability, being used in several hundred medical rehabilitation hospitaIs.“5 They went on to observe that in spite of its widespread use, relatively little reliability data on the FIM have been published. Several studies examining the reliability of the FIM have appeared in the rehabilitation literature since the observation by Johnston et a16-13 These investigations explore the inter-rater and test-retest reliability of the FIM for different patient groups and professional disciplines. Some reliability studies have examined the impact of training and experience on agreement among raters.6.7 Other investigations have compared ratings obtained on the FIM with scores obtained using alternate methods of functional assessment.’ What do the cumulative results of these studies indicate regarding the reliability of the FIM and its usefulness in rehabilitation practice? The purpose of this investigation was to examine previously published research on the reliability of the Adult FIM using procedures for quantitatively synthesizing research described by Glass and others.‘4-‘6 These techniques have been used previously to synthesis reliability findings in clinical research.17 In this investigation the emphasis was on synthesizing three types of FIM reliability: interrater, test-retest reliability, and equivalence reliability. Interrater and test-retest reliability are widely understood by rehabilitation researchers and clinicians. Equivalence reliability is defined as the reliability of an assessment in two or more versions. For example, Jaworski and coworkers* examined the agreement between FIM ratings obtained through in-person interview and observation with ratings collected by telephone interview. Their investigation was considered as a study of equivalence reliability. METHODS From the University of Texas Medical Branch at Galveston (Dr. Ottenbacher), the State Universitv of New York at Buffalo (Dr. Granter. Dr. Fiedler).-,, and National Chengkuni University, Taiwan (Ms. Hsu). ” Submitted for publication September 26, 1995. Accepted in revised form July 10, 1996. Supported in part by Rehabilitation Research and Training Center grant H133B30041 from the National Institute on Disability and Rehabilitation Research, US Department of Education. The authors have chosen not to select a disclosure statement. Reprint requests to Kenneth J. Ottenbacher, PhD, School of Allied Health Sciences, 11th and Mechanic Streets, Galveston, TX 77555-102X. 0 1996 by the American Congress of Rehabilitation Medicine and the American Academy of Physical Medicine and Rehabilitation 0003.9993/96/7712-3707$3.00/O Arch Phys Med Rehabil Vol77, December 1996 Studies were identified through computer searches of the Index Medicus and Psychological Abstracts data bases from 1966 to 1994. The Functional Assessment Information Service database associated with the Center on Functional Assessment Research and the Rehabilitation Research and Training Center on Functional Assessment and Evaluation of Rehabilitation Outcomes at the State University of New York at Buffalo was also examined. The following key words were used in the computer bibliographic searches: functional assessment, reliability, measurement, functional independence, and activities of daily living. Manual searches were conducted using the reference lists FUNCTIONAL ASSESSMENT of all retrieved articles. A total of 39 potentially relevant papers was identified in the initial search. The articles were individually examined by three raters with rehabilitation experience to determine whether they met the following criteria. Only articles on which all three examiners agreed were included in the final review and analysis. Inclusion Criteria To be included in the analysis a reliability study had to include a quantitative estimate of interrater, test-retest, or equivalence reliability for the Functional Independence Measure (FIM). Studies that did not include an estimate of interrater, test-retest, or equivalence reliability were not included in the review. The statistical method used to determine the reliability value had to be clearly identified. The sample size on which the reliability index was based and the number of raters involved in interrater reliability investigations had to be reported. For studies including an estimate of test-retest reliability, the time interval between testing had to be clearly specified. Finally, the method used to obtain the FIM ratings, that is, observation, selfreport, or proxy rating, had to be identified. Eleven of the 39 studies were eliminated because a sample size for the reliability index could not be identified and/or the reliability index was not clearly identified or was not appropriate, eg, a measure of internal consistency. Seven studies used the term reliability, in the title or abstract, but did not provide any numerical data or the authors referred to reliability values published in previous reports. Five investigations did not provide information about the raters or the time interval between test and retest. Three studies included information on concurrent validity with other ADL assessments, but no reliability values. One investigation (a book chapter) reported data on the four-level version of the FIM. Finally, one abstract included data that were subsequently published in a larger investigation. The 11 remaining studies that met all of the above criteria were selected for further analysis. The 11 investigations are listed in the appendix. A complete listing of the 39 original investigations can be obtained from the first author. Study Coding Each of the 11 investigations was coded according to year of appearance and source of publication. Sample size and characteristics were coded for both the raters and the patients included in the investigations. Raters were coded as having previous experience using the FIM or having received formal training in the administration of the FIM. The specific form of training was also recorded. The following categories of training were identified and coded: formal workshop or instruction, use of FIM videotape and case studies, informal use of FIM guide, and no training. The subjects who were evaluated using the FIM were coded according to disability, age, and sex. The setting in which the FIM rating occurred was coded as one of the following: rehabilitation center, hospital, home, or other. Each individual reliability coefficient was coded according to the type of reliability, eg, interrater or test-retest. For those investigations recording test-retest information, the duration between the first and second test was recorded. The type of statistical procedure used to calculate the reliability index was coded as: Pearson product moment correlation, intraclass correlation coefficient (ICC), Kappa (K), or other. Quality assessment screening as proposed by Chalmers and others” was not used. These screening criteria were developed primarily for use with randomized clinical trials (RCTs). The design attributes associated with reliability investigations are different from those involved in RCTs. The coded attributes RELIABILITY, 1227 Ottenbacher Table 1: The Functional Independence Measure FIM (motor) Self-care A. Eating B. Grooming C. Bathing D. Dressing upper body E. Dressing lower body F. Toileting Sphincter control G. Bladder management H. Bowel management Transfer I. Bed, chair, wheelchair J. Toilet K. Tub, shower Locomotion L. Walk/wheelchair M. Stairs FIM (cognitive) Communication N. Comprehension 0. Expression Social cognition P. Social interaction Q. Problem solving R. Memory Levels of scoring Independence 7 Complete independence (timely, safely) 6 Modified independence (device) Modified dependence 5 Supervision 4 Minimal assistance (subject 75%+) 3 Moderate assistance (subject 50%+) Complete dependence 2 Maximal assistance (subject 25%+) 1 Total assistance (subject O%+) described above, for example, the type of reliability, rater training and experience, duration between ratings, and number of raters, were believed to sufficiently capture the characteristics of design quality involved in the reliability studies examined. Two coders examined and rated the manuscripts without author or title attribution. The reliability (agreement) of these ratings was examined (see below) to determine the existence of rater bias or error. Adult FIM The FIM instrument is a minimal data set designed to assess functional independence.’ The FIM includes 18 items, each with a maximum score of 7 and a minimum score of 1. Possible scores range from 18 to 126. Each level of scoring is defined. For example, a score of 7 equals “complete independence,” and a score of 1 equals “complete dependence,” and 3 equals “moderate assistance.” The areas examined by the FIM include: self-care, sphincter control, transfers, locomotion, communication, and social cognition. These areas are further defined into motor and cognitive domains. The motor domain includes thirteen items in the areas of self-care, sphincter control, transfers, and locomotion. The cognitive domain contains five items from the Communication and Social Cognition subscales. The domains, subscales and items included in the FIM are presented in table 1. In evaluating the 11 studies, reliability coefficients were coded based on whether they were computed for individual items, for FIM subscale areas such as self-care, sphincter control, locomotion, etc, or for the motor or cognitive domain. Reliability of Coding To establish the interrater agreement of the coding procedures, the 11 studies were examined, without author attribution Arch Phys Med Rehabil Vol77, December 1996 1228 FUNCTIONAL ASSESSMENT and titles, by two raters with rehabilitation experience and graduate level training in research methods and statistics. When the two raters did not agree on the coding of a particular item, a third rater with more than 15 years of research and rehabilitation experience was consulted. The majority rating was then used in the analysis. The interrater agreement for all items included in the analysis described below, ranged from ICC of 89 to 1.00. Streiner and Norman” observe that “there has been considerable debate in the literature regarding the most appropriate choice of reliability coefficient.” They discuss the advantages and limitations of the three statistical methods used most frequently in the eleven articles: Pearson product moment correlation (Y), intraclass correlation coefficients (ICC), and Kappa (K). All three of these statistics have similar practical ranges for interpretation, that is, values from -1.00 to 1.00. What is considered an acceptable reliability value, however, varies considerably among authors. For example, Portney and Watkins” suggest ICC values of .75 and above are “indicative of good reliability, and those below .75 poor to moderate reliability.” Kelly’i has recommended .94, and Weiner and Stewartz2 suggest 8.5 for reliability values involved in making decisions about individuals. The published guidelines for interpreting unweighted Kappa tend to be lower than for r or ICC. Landis and Koch23 have suggested that values for Kappa from .60 to 80 represent good to excellent reliability (agreement), Kappa values of .40 to .60 reflect moderate agreement, and values less than .40 indicate poor agreement. The interpretation of values for weighted Kappa are identical to those computed for ICC when the variance for raters (or trials) is excluded.20 The standard method for computing Kappa uses the following formula: K = P, - PJI - P,, where P, is the observed proportion of agreement and P, is the expected proportion of agreement by chance. Rae24 has demonstrated that analysis of variance terms can be substituted for P, and P, to produce the following formula: K = [SSBP - SS&K - 1]/[SS8r - SS,,], where SSBp and SSwp are the sum of squares between subjects and within subjects in analysis of variance. If N is at all large (N > 20), then Kappa is estimated from the formula, K = C&/O& + a&, which is derived from the formula above and represents the intraclass correlation coefficient.% Rae and others24-25 provide several formulae to convert Kappa (weighted and unweighted) to various ICC models that include or exclude systematic variability among raters (or trials) as a component of total variability. The procedures described by Rae and others22-24 were used to convert Kappa values to a reliability metric comparable to the ICC and Y. Reliability (correlation) values can be aggregated as “raw” effect sizes by combining the r values weighted for sample size26 or they can be converted by Fisher’~~~,~~ variance stabilizing z transformation. Statisticians generally recommend using Fisher’s z transformation unless sample sizes are very large because standard errors, confidence intervals, and homogeneity tests can be quite variable.29 All reliability values in this investigation were converted using Fisher’s variance stabilizing z transformation and combined using a random effect model.29 The random effects model assumes a distribution of population reliability values generated by study attributes such as patient grouping (diagnosis), raters, or environments. The homogeneity analysis proposed by Hedges and Olkin~9 called Q, was computed following the transformation to Fisher z values. The Q statistic was computed for the total FIM ratings included in the 11 studies. The Q statistic has a chi-square distribution with N = 1 degrees of freedom. To make the interpretation of the final Phys Mad Rehabil Ottenbacher results easier the Fisher z values were converted back to correlation values following homogeneity testing (see tables 3 and 4). Shadish and Haddock3’ state that “Interpretation of results is facilitated” if Fisher z values are converted back to correlation values. This is particularly true in reliability studies where results are traditionally reported as coefficients ranging from .OO to 1.00. RESULTS Data Analysis Arch RELIABILITY, Vol77, December 1996 The 11 studies were published from 1993 to 1995 (mean = 1994, SD = .70). A total of 1,568 subjects were included in the 11 studies (mean = 130.66, SD = 284.17). Two hundred twenty-one reliability coefficients were included in the 11 studies, indicating that each investigation contained approximately 19 reliability values. A large majority of the reliability coefficients (81%) were from interrater reliability comparisons. The ICC was the most common statistical procedure used to compute reliability values (n = 116), followed by the Kappa statistic (n = 53) and the Pearson product moment correlation coefficient (n = 52). Basic descriptive information concerning each of the eleven studies included in the analysis is presented in table 2. The table includes the first author, sample size, type of reliability, and values for the total FIM score. The analysis of z transformed reliability coefficients produced an average z of 1.55, and a variance of .21, the square root of which (.47) is the standard error. This average effect size is significantly different from zero since Z = 1Z. j l(v.)‘” = 1.55/ .47 = 3.30, which exceeds the critical value of 1.96 for 01 = .05 in the standard normal distribution. The limits of the 95% confidence interval are 1.50 and 1.60. The homogeneity statistic (Q) for the 221 effect sizes was 2,444 (p < .05, df = 220) indicating that the variability in reliability values was greater than that expected by sampling error alone. Figure 1 presents summary box plots of reliability values for the total FIM ratings, for the motor and cognitive domain ratings. Table 3 includes median and mean reliability coefficients for the FIM items, subscales, and domains. Table 3 also includes values for the 95% confidence interval for each of the mean values. Inspection of table 3 reveals that the median reliability value for the total FIM score was .95, indicating excellent overall consistency among raters using the FIM across a large number of patients with varying levels of impairment and medical diagnoses. The median reliability values for the subscales ranged from .95 for self-care to .78 for social cognition. The median reliability values for the cognitive domain items (.93) was lower than for the motor domain items (.97) (see table 1 for a listing of the items contained within the FIM motor and cognitive domains). The median reliability values for the individual FIM items ranged from a low of .61 for comprehension to a high of .90 for transfer to toilet. In general, the FIM motor and ADL items were found to have higher median and mean reliability values than the communication and cognition items. Table 4 includes median and mean reliability values associated with various study attributes. Included in the table are the reliability values for different levels of experience and training, and for different diagnostic conditions. A criterion of two standard errors of the mean was used to examine whether differences existed in mean reliability values across the study attributes. If greater than two standard errors of the mean separated any two mean reliability values, then those values were considered to be “substantially” different. This conservative criterion has been used in previous meta-analyses when multiple values were generated from the same study, the assumption of independent data points could not be made, and the number of studies was relatively sma11.3’,32 Using the greater than two standard error of the mean criterion, there were no differences found FUNCTIONAL Table First Author Year Sample 2: Description ASSESSMENT Information RELIABILITY, for the 11 Studies Diagnosis 1993 4 Stroke 40 therapists Ottenbacher 1994 20 Mixed 2 clinicians Chau 1994 198 Mixed Team Segal 1993 57 SCI Team Hamilton 1994 1,018 Mixed Team Grey 1993 40 SCI Nurses and self-report Jawoski 1994 14 Mixed Nurses (2 raters) Kidd 1995 50 Mixed Team Smith 1995 40 Mixed Nurses Segal 1994 8 38 Stroke Stroke Therapist/(2 Team/care raters) giver/proxy Brosseau 1994 81 MS Therapists (2 raters) Motor FIM N=127 Type of Reliability bathing, rated video observe/phone (2 raters) observe/phone dressing-upper body, Total FIM ICC .89 Interrater Test/retest ICC+ ICC+ .94 .93 Interrater k .90 Test-retest r .84 Interrater ICC .92 Equivalence r .84 Interrater Equivalence ICC ICC .99 .94 Interrater Test/retest rf r* .92 .90 Equivalence ICC .97 Interrater Equivalence ICC ICC ,963 .87 interrater dressing-lower body; ICC grooming; toileting; .83 toilet transfer; tub/ Cognitive FIM N=42 * l .25 Statistic Interrater Social Cognition were lower for the studies where raters received informal FIM training than for the other level of training groups. These results, however, should be interpreted with caution because the sample size in some of the categories included in table 4 is small and the statistical power is low. Regression models represent another approach to examining moderator (predictor) variables. A technical problem with reTable Related * in the Review (2 raters) among the mean reliability values for the study attributes of experience, training, or medical diagnoses. When subscale scores were examined there was some indication of an interaction between level of training and subscale reliability scores. The median reliability for the subscales of Communication and Total FIM N=‘l? Included Procedure Fricke” * Fricke et al investigation included 8 FIM items: eating, shower transfer. ‘Total FIM based on average of two total FIM scores. *Correlation estimated from data provided in article. 1229 Ottenbacher + Fig 1. Box plot summary statistics for Total FIM, Motor FIM, and Cognitive FIM ratings. Box plots include all values for each variable. The middle line in the box indicates the median. The top and bottom of the box are the 25% and 75% scores. The bars at the end of the lines represent the 10% and 90% values. The stars represent individual values outside the 10% to 90% range of scores. (Five values--.18, .16, .16, .13, .02-are not included in the figure. Total N does not equal 221 because some correlations [n = 401 included combinations of motor and cognitive items.) 3: Median Descriptive and Mean Statistics Sample Size, and Domain, and Total Median Mean Sample N* SD 95% Cl .77 .84 .83 .89 .88 .83 67 .78 .79 .90 .88 .66 .66 .61 .73 .72 .84 .85 .75 .80 .76 .83 .82 .77 .68 .69 .80 .85 .79 .71 .60 .59 .62 57 .74 .73 1,412 1,412 1,412 1,412 1,412 1,412 1,412 1,412 1,412 1,412 1,412 1,408 1,408 1,408 1,408 1,408 1,408 1,408 7112 7112 7112 7112 7112 7112 717 717 717 719 719 617 5l7 517 .I1 .I2 .I8 .I3 .I6 .I8 .I7 .I9 .I4 .I 1 .21 .I7 .23 .25 .22 .28 .24 .27 ,744.,756 .794-,806 .755-.765 ,822.,838 ,812.,828 .762-,778 .672-,688 ,680.,700 .792-,808 .844-.856 .778-,802 .696-,724 .598-,612 .576-,604 .608-.632 .552-,586 .726-,754 .716-,744 .95 .91 .92 .92 .87 .78 .93 .89 .90 .84 .76 .63 1,254 1,254 1,254 1,254 1,254 1,254 515 515 515 515 515 515 .06 .I0 .07 .I8 .21 .40 .926-,934 ,886.,894 .896-,904 ,830.,850 .750-,770 ,614.,646 .97 .93 .96 .91 1,173 1,173 518 518 .04 .I0 .948-,972 .904-,916 .93 1,568 .05 .926-.934 FIM item Eating Grooming Bathing Dressing UB Dressing LB Toileting Bladder Bowel Bed chair transfer Toilet transfer Tubishr transfer Walkichair Stairs Comprehension Expression Social interaction Problem solving Memory FIM subscale Self-care Sphincter control Transfers Locomotion Communication Social cognition FIM domains Motor Cognitive FIM total Total * Number Reliability Values, by Item, Subscale, FIM Score .95 of studies/number of statistical Arch Phys Med 6i7 W 6l7 617 11/15 values. Rehabil Vol77, December 1996 1230 FUNCTIONAL ASSESSMENT Table 4: Median and Mean Reliability Values (Total FIM) Descriptive Statistics Associated With Study Attributes Meall Sample .90 .93 .96 .86 .91 .93 42 97 81 1348 .95 .95 .92 .92 .92 .89 1393 127 132 .94 .90 1231 .96 .95 .90 .93 Median Stroke SCI MS Mixed Tvpe of reliability -interrater Test-retest Equivalence FIM training+ Previous FIM experience Formal FIM training Informal FIM training UDS credentialled .92 .89 and N* SD 95% Cl 312 212 2/l .I4 .24 .I8 .I9 .840-,880 ,900.,930 .925-,935 918 413 414 .I7 .I2 .25 .915-,925 .910-.930 .890-.9 10 513 .I9 .895-,905 202 614 .I2 ,942.,958 .93 242 513 .21 ,917.,943 .91 1173 212 .I3 .906-,914 IO/6 .880-,920 * Number of reliability values/number of studies. ’ Formal training involved UDSMR-sponsored training; informal training involved use of videotapes and review of the manual (training not sponsored by UDSMR). gression is that the number of studies (n = 11) is small in comparison to the number of potential moderator variables. As a rough guide, one might include no more predictors than the square root of the number of studies, about 3 in this case. Reducing the number of potential predictor variables can be accomplished in many ways. In this investigation the focus on methodolody led us to select type of reliability as a relevant variable category to include in the model. A regression equation was computed following the weighted least squares procedure outlined in Hedges and Olkin.29 The resulting multiple R was .47 (R* = .22) for the first regression equation including the predictor variables of interrater reliability, test-retest reliability, and equivalence reliability. The test for significance of the predictor set was Q,(3) = 7.47, p = NS.* The test for model specification was rejected (Qe = 69.14, df = 8, p < .0.5), suggesting that nonrandom variance in effect size remained. A more clinically practical way to examine reliability is to compute the standard deviation of the measurement error. The standard deviation of measurement error is commonly referred to as the standard error of measurement (SEM).” The SEM is more accurate if estimated from a large number of ratings.” One advantage of the meta-analytic approach used in this investigation is the creation of a large combined subject pool (N = 1,568) to derive estimates of reliability for the FIM. The SEM is computed using the following formula: SEM = SD ii1 - r, where SD is the standard deviation and r is the reliability value. The SEM for the FIM was computed using the standard deviation contained in the annual UDSMR report for 1994.33 The standard deviation value was based on 1994 FIM admission data containing FIM records for more than 150,000 patients receiving medical rehabilitation services. Using this information and the test-retest reliability obtained in this investigation the SEM for the FIM is 2141 - .95 = 4.70. DISCUSSION In describing the importance of reliability in the Interdisciplinary Measurement Standards for Medical Rehabilitation, * Hedgesand Okin’s exampleis computedin SAS, which includesthe intercept in this t&t; the present regkssion was computed in SPSSx, which excluesthe intercept.Therefore, one must add a degree of freedom to the degreesof freedom for the Q. test.= Arch Phys Med Rehabil Vol77, December 1996 RELIABILITY, Ottenbacher Johnston and colleagues34 stated that “both science and the clinical practice of medicine demand high, or at least known, levels of reliability.” Any widely used measure of disability must produce consistent results across raters and over time if it is to be useful in program evaluation and research. Grey and Kennedy35 observed, “In recent years the Functional Independence Measure has emerged as a standard assessment instrument for use in rehabilitation and therapy programs for disabled persons.” The results of this quantitative review indicate the FIM provides good interrater reliability across a wide variety of raters with different professional backgrounds and levels of training. The median interrater reliability value was .95 and is based on a large cumulative sample of patients representing a wide variety of disability levels and medical conditions. The 95% confidence interval for the mean interrater reliability value are .915 to .925. In addition, the evidence for test-retest (median .95) and equivalence (median .92) reliability is also good. The results suggest that reliability is highest for items in the motor domain, specifically for the individual items of dressingupper body and transfer to the toilet. The lowest interrater reliability values were associated with the items of comprehension and social interaction. These items are among the most difficult to observe directly. There is some indication that lower reliability of items in the Communication and Social Cognition subscales may be related to levels of training. Additional research is needed to clarify the importance of training in achieving high reliability values for individual items and subscale scores. Research is also necessary to determine the impact of the number of raters on the reliability of ratings. In general, the larger the number of raters, the higher the interrater reliability coefficient will be, all other factors remaining equal. In the 11 studies examined, all interrater reliability comparisons were based on scores for two raters, except for the investigation by Fricke and others, who used 40 raters to assess the same patients. The ICC values for Fricke’s investigation based on 40 raters will be higher than if the interrater reliability study was conducted using only two raters. When the analysis was conducted without the data from Fricke’s study, the mean interrater reliability (ICC) for the total FIM was .94 compared to a mean of .93 when the data from Fricke’s investigation were included in the analysis. As more reliability studies are reported using multiple raters, the impact of number of examiners can be more carefully examined. Another possible moderating variable of clinical importance in future research is the professional background and training of the raters. There are reports in the literature that different professional groups may obtain different ratings for FIM items or subscales. For example, Adamovich36 conducted a study to compare the functional communication ratings of registered nurses with speech-language pathologists on selected FIM items. The nurses assigned significantly higher FIM ratings than the speech-language pathologists when rating communication skills of persons with left hemisphere damage. Meta-analysis procedures have the potential to examine the effect of moderator variables such as professional affiliation and background if those variables are clearly defined in the primary studies. Unfortunately, the 11 investigations included in this review did not systematically control for professional affiliation and background. Most ratings were performed by teams of mixed professionals and the impact of professional background could not be statistically examined. Additional research is needed on the response patterns and clinical reasoning of FIM ratings provided by different professional groups before quantitative reviewing methods will be able to productively examine this issue. Establishing reliability allows researchers and clinicians to determine the standard error of measurement (SEM) for the FUNCTIONAL ASSESSMENT FIM. The SEM can be used to establish the range in which a person’s “true” score will fall and is interpreted according to the properties of the normal curve. For example, if a person’s obtained (raw) total FIM rating is 100 and the SEM is 4.70, then based on the characteristics of the normal curve there is a 68% chance that the individual’s true score falls within +- 1SEM (between 105 and 95) or a 95% chance that it falls within ?2SEM (between 110 and 90). The SEM is an essential indicator of measurement sensitivity and the ability to document reliable change in performance over time. Two or more ratings are always necessary to determine whether a clinically important change has occurred. The first step in measuring change is to ensure that the instrument used to collect the two measures has adequate reliability. Without demonstrated reliability, change in values from one administration to the next could be the result of random measurement error. Jacobson and colleagues3’ refer to change as “a reliable difference in values for a variable over two or more points in time.” They have proposed the Reliability Change Index (RCI) to determine changes in performance that are not caused by measurement error. To compute the RCI, the clinician must have the following information: (1) the patient’s preintervention score, (2) the posttest score following rehabilitation, and (3) the standard error of measurement (SEM) for the instrument. The standard error of measurement is influenced by the reliability of the test and is computed from the formula presented earlier (SEM = SD ,/l - r, where SD is the standard deviation for the test and Y is the reliability coefficient). The RCI is computed as follows: RCI = (X, - X,)/SEM, where Xz is the postintervention score, X1 is the preintervention score, and SEM is the standard error of measurement as defined above. For example, a patient receiving intervention for a deficit in ADL may have an admission FIM score of 73. Following medical rehabilitation, the patient’s FIM score at discharge might increase to 85. The SEM for the FIM is 4.70. Using the formula provided above, the reliability change index is RCI = 85 - 7314.70 = 2.55. Jacobson and associates37 argue that the RCI index should be interpreted using a unit normal distribution where a value of 21.96 would be unlikely to occur (p < .05) without actual change. The RCI of 2.55 in the example indicates that a reliable change in FIM scores has occurred from pre to post-rehabilitation. The index proposed by Jacobson et al provides a statistical indication of change. The RCI, however, does not include information regarding the clinical importance of the change. In the above example, the clinician using the RCI knows that the change of 12 FIM points (from 73 to 85) was probably (p < .05) not caused by chance (random measurement error). The RCI does not tell the clinician whether the 12-point improvement resulted from the rehabilitation treatment or whether the 12-point increase is clinically important. Whether the improvement in FIM scores is clinically significant cannot be answered statistically. The answer to this question requires knowledge of what the patient and his or her family believe are important functional skills along with information on the cost required to obtain these skills. The results of this investigation indicate that the Adult FIM provides reliable information across a variety of patient populations, settings and clinicians. The ability of the FIM (or any assessment of functional performance) to provide consistent information is essential to the measurement of change in daily living skills. Based on the quantitative synthesis of the 11 studies examined in this investigation, the FIM appears to be a reliable instrument for assessing basic daily living skills in persons with disabilities. RELIABILITY, Ottenbacher 1231 References 1. Guide for the Uniform Data Set for Medical Rehabilitation (Adult FIM), version 4.0. Buffalo: State University of New York at Buffalo, 1993. 2. Granger CV, Hamilton BB, Keith RA, Zielezny M, Sherwin FS. Advances in functional assessment for medical rehabilitation. Top Geriatr Rehabil 1986; 159.74. 3. Hamilton BB, Granger CV, Sherwin FS, Zielezny M, Tashman JS. A uniform national system for medical rehabilitation. In: Fuhrer MJ, editor. Rehabilitation outcomes: analysis and measurement. Baltimore: Brookes, 1987:137-47. 4. Granger CV, Braun S, Fiedler RC, Griffiths A, Johnston MV, Kelley-Hays A. Quality and outcome measures for medical rehabilitation. In: Braddom IU, editor. Physical medicine and rehabilitation. Philadelphia; Saunders, 1996:239-55. 5. Johnston MV, Findley TW, DeLuca J, Katz RT. Research in physical medicine and rehabilitation: XII measurement tools with application to brain injury. Am .I Phys Med Rehabil 1991;70 Suppl __ 1: s114-30. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. Chau N, DaIer S, Andre JM, Pauis A. Inter-rater agreement of two functional independence scales: the Functional Independence Measure (FIM) and a subjective uniform continuous scale. Disabil Rehabil 1994; 16:63-71. Fricke J, Unsworth C, Worrell D. Reliability of the Functional Independence Measure with occupational therapists. Aust Occup Ther J 1993;40:7-15. Jaworski DM, Kult T, Boynton PR. The Functional Independence Measure: a pilot study comparison of observed and reported ratings. Rehabil Nurs Res 1994; Winter: 141-7. Hamilton BB, Laughlin JA, Fiedler RC, Granger CV. Interrater reliability of the 7-level Functional Independence Measure (FIM). Stand J Rehabil Med 1994;26:115-9. Ottenbacher KJ, Mann WC, Granger CV, Tomita T, Hurren D, Charvat B. Interrater agreement and stability of functional assessment in the community based elderly. Arch Phys Med Rehabil 1994:75:1297-301. Segal ME, Ditunno JF, Staas WE. Inter-institutional agreement of individual Functional Independence Measure (FIM) items at two sites on one sample of SC1 patients. Paraplegia 1993;31:622-31. Hamilton BB, Laughlin J, Granger CV, Kayton RM. Interrater agreement of the seven level Functional Independence Measure (FIM) [abstract]. Arch Phys Med Rehabil 1991;72:790. Brosseau L. The inter-rater reliability and construct validity of the Functional Independence Measure for multiple sclerosis subjects. Clin Rehabil 1994; 8: 107-15. Glass GV, McGaw B, Smith ML. Meta-analysis in social research. Beverly Hills (CA): Sage, 1981. Cooper HM. Integrating research: a guide for literature reviews, 2nd ed. Newbury Park (CA): Sage, 1989. Petitti DB. Meta-analysis, decision analysis, and cost effectiveness analysis: methods for quantitative synthesis in medicine. New York: Oxford University Press, 1994. Ottenbacher KJ. Interrater agreement of visual analysis in singlesubject decisions: quantitative review analysis. Am J Ment Retard 1993;98:135-42. Chalmers TC, Smith H, Blackbum B, Silverman B, Schroeder B, Reitman D, et al. A method for assessing the quality of a randomized control trial. Cont Clin Trial 1981;2:31-49. Streiner DL, Norman GR. Health measurement scales: a practical guide to their development and use. 2nd ed. New York: Oxford University Press, 1995. Portney LG, Watkins MP. Foundations of clinical research: applications to practice. Norwalk (CT): Appleton & Lange, 1993. Kelly TL. Interpretation of educational measurements. Yonkers (NY): World Books, 1927. Weiner EA, Stewart BJ. Assessing individuals. Boston: Little Brown, 1984. Landis JR: Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-62. Rae G. The equivalence of multiple rater Kappa statistics and intraclass correlation coefficients. Educ Psych Meas 1988;48:367-74. Fleiss JL, Coner J. The equivalence of weighted Kappa and the Arch Phys Med Rehabil Vol77, December 1996 1232 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. Arch FUNCTIONAL ASSESSMENT intraclass coefficients as measures of reliability. Educ Psych Meas 1973;33:613-9. Cooper HM. Statistically combining independent studies: a metaanalysis of sex differences in conformity research. J Pers Sot Psychol 1979;37:131-46. Fisher RA. Statistical methods for research workers. Edinburgh: Oliver & Boxel, 1925. Hunter JE, Schmidt FL. Correcting for sources of artificial variation across studies. In: Cooper HM, Hedges LV, editors. The handbook of research synthesis. New York: Russell Sage Foundation, 1994: 324-36. Hedges LV, Olkin I. Statistical methods for me&analysis. Orlando (FL): Academic Press, 1985. Shadish WR, Haddock CK. Combining estimates of effect size. In: Cooper HM, Hedgers LV, editors. The handbook of research synthesis. New York: Russell Sage Foundation, 1994:261-81. Ottenbacher K. The impact of random assignment on study outcome: an empirical examination. Cont Clin Trial 192; 13:50-61. Ottenbacher K, Jane11 S. The results of clinical trials in stroke rehabilitation research. Arch Neurol 1993;50:37-44. Granger CV, Ottenbacher K, Fiedler R. The uniform data system for medical rehabilitation: report of first admissions for 1994. Am J Phys Med Rehabil 1996;75:125-9. Johnston MV, Keith RA, Hinderer SR. Measurement standards for interdisciplinary medical rehabilitation. Arch Phys Med Rehabil 1992;73 Suppl:Sl-23. Grey P, Kennedy N. The Functional Independence Measure: a comparative study of clinician and self ratings. Paraplegia 1993; 31:45761. Adamovich BLB. Pitfalls in functional assessment: a comparison of FIM ratings by speech-language pathologists and nurses. Neurorehab 1992;2:42-51. Jacobson NS, Follette WC, Revenstorf D. Psychotherapy outcome research: Methods for reporting variability and evaluating clinical significance. Behav Ther 1984; 15:336-52. Phys Mad Rehabil Vol77, December 1996 RELIABILITY, Ottenbacher APPENDIX: THE ELEVEN STUDIES ANALYZED 1. Fricke J, Unsworth C, Worrell D. Reliability of the Functional Independence Measure with occupational therapists. Aust J Occup Ther 1993;40:7-15. 2. Chau N, Daler S, Andre JM, Patois A. Interrater agreement of two functional independence scales: the Functional Independence Measure (FIM) and a subiective uniform continuous scale. Disabil Rehabil 1994; 16:63-71. ” 3. Ottenbacher K, Mann WC, Granger CV, Tomita M, Hurren D, Charvat B. Interrater agreement and stability of functional assessment in the community-based elderly. Arch Phys Med Rehabil 1994;75:1297-301. 4. Segal ME, Ditunno JF, Staas WE. Inter-institutional agreement of individual Functional Independence Measure (FIM) items measured at two sites on one sample of SC1 patients. Paraplegia 1993; 31: 622-31. 5. Hamilton BB, Laughlin JA, Fiedler RC, Granger CV. Interrater reliability of the 7-level Functional Independence Measure (FIM). Stand .I Rehabil Med 1994;26:115-9. 6. Grey N, Kennedy P. The Functional Independence Measure: a comparative study of clinician and self-ratings. Paraplegia 1993; 31:45761. 7. Jaworski DM, Kult T, Boynton PR. The Functional Independence Measure: a pilot study comparison of observed and reported ratings. Rehabil Nurs Res 1994; Winter: 141-7. 8. Kidd D, Stewart G, Baldry J, Johnson J, Rossiter A, Petruckevitch A, Thompson AJ. The Functional Independence Measure: a comparative validity and reliability study. Disabil Rehabil 1995; 17: 10-4. 9. Smith PM, Illig SB, Fiedler RC, Hamilton BB, Ottenbacher KJ. Intermodal agreement of follow-up telephone functional assessment using the Functional Independence Measure in patients with stroke. Arch Phys Med Rehabil 1996;77:431-5. 10. Segal ME, Schall RR. Determining the functional/health status and its relation to disability in stroke survivors. Stroke 1994;25:2391-7. 11. Brosseau L. The interrater reliability and construct validity of the Functional Independence Measure (FIM) for multiple sclerosis subjects. Clin Rehabil 1994;8:107-15.