The Reliability of the Functional Independence Measure: A

advertisement
1226
The Reliability
A Quantitative
Kenneth J. Ottenbacher,
of the Functional
Review
Independence
Measure:
PhD, Yungwen Hsu, MS, Carl V. Granger, MD, Roger C. Fiedler, PhD
ABSTRACT.
Ottenbacher KJ, Hsu Y, Granger CV, Fiedler
RC. The reliability
of the Functional Independence Measure: a
quantitative review. Arch Phys Med Rehabil 1996;77:1226-32.
Objective: The reliability of the Functional Independence
Measure (FIMSM) for adults was examined using procedures of
meta-analysis.
Data Sources: Eleven published studies reporting estimates
of reliability for the FIM were located using computer searches
of Index Medicus, Psychological Abstracts, the Functional Assessment Information Service, and citation tracking.
Study Selection: Studies were identified and coded based on
type of reliability (interrater, test-retest, or equivalence), method
of data analysis, size of sample, and training or experience of
raters.
Data Extraction: Information from the articles was coded
by two independent raters. Interrater reliability for coding all
elements included in the analysis ranged from .89 to 1.00.
Data Synthesis: The 11 investigations included a total of
1,568 patients and produced 221 reliability
coefficients. The
majority of the reliability
values (81%) were from interrater
reliability
studies, and the intraclass correlation
coefficient
(ICC) was the most commonly used statistical procedure to
compute reliability.
The reported reliability
values were converted to a common correlation metric and aggregated across the
11 studies. The results revealed a median interrater reliability for
the total FIM of .95 and median test-retest and equivalence
reliability values of .95 and .92, respectively. The median reliability values for the six FIM subscales ranged from .9.5 for
Self-Care to .78 for Social Cognition. For the individual
FIM
items, median reliability values varied from .90 for Toilet Transfer to .61 for Comprehension. Median and mean reliability coefficients for FIM motor items were generally higher than for
items in the cognitive or communication
subscales.
Conclusions: Based on the 11 studies examined in this review the FIM demonstrated acceptable reliability across a wide
variety of settings, raters, and patients.
0 I996 by the American Congress of Rehabilitation
Medicine
and the American Academy of Physical Medicine and Rehabilitation
T
HE FUNCTIONAL
Independence Measure (FIMSM) is one
of the most widely used methods of assessing basic quality
of daily living activities in persons with a disability.’ The FIM
for adults was developed by a national task force cosponsored
by the American Academy of Physical Medicine and Rehabilitation and the American Congress of Rehabilitation
Medicine.’
The original work of this task force was expanded by the Department of Rehabilitation
Medicine at the State University of
New York at Buffalo. The FIM is now part of the Uniform
Data System for Medical Rehabilitation
(UDSMR~~) and is
widely used in the United States and internationally.2-5 The FIM
includes 18 items designed to assess the amount of assistance
required for a person with a disability to perform basic life
activities safely and effectively. The activities include a minimum set of skills related to self-care, sphincter control, transfers, locomotion, communication,
and social cognition.
In discussing the measurement of functional status in rehabilitation, Johnston and colleagues noted that “the FIM is currently
the most widely used measure of disability, being used in several hundred medical rehabilitation
hospitaIs.“5 They went on
to observe that in spite of its widespread use, relatively little
reliability data on the FIM have been published. Several studies
examining the reliability of the FIM have appeared in the rehabilitation literature since the observation by Johnston et a16-13
These investigations explore the inter-rater and test-retest reliability of the FIM for different patient groups and professional
disciplines. Some reliability
studies have examined the impact
of training and experience on agreement among raters.6.7 Other
investigations have compared ratings obtained on the FIM with
scores obtained using alternate methods of functional assessment.’
What do the cumulative results of these studies indicate regarding the reliability of the FIM and its usefulness in rehabilitation practice? The purpose of this investigation was to examine
previously published research on the reliability
of the Adult
FIM using procedures for quantitatively
synthesizing research
described by Glass and others.‘4-‘6 These techniques have been
used previously to synthesis reliability
findings in clinical research.17 In this investigation the emphasis was on synthesizing
three types of FIM reliability:
interrater, test-retest reliability,
and equivalence reliability.
Interrater and test-retest reliability
are widely understood by rehabilitation
researchers and clinicians. Equivalence reliability is defined as the reliability
of an
assessment in two or more versions. For example, Jaworski
and coworkers* examined the agreement between FIM ratings
obtained through in-person interview and observation with ratings collected by telephone interview. Their investigation was
considered as a study of equivalence reliability.
METHODS
From the University
of Texas Medical Branch at Galveston (Dr. Ottenbacher),
the State Universitv
of New York at Buffalo (Dr. Granter.
Dr. Fiedler).-,, and
National Chengkuni
University,
Taiwan (Ms. Hsu).
”
Submitted for publication September 26, 1995. Accepted in revised form July
10, 1996.
Supported in part by Rehabilitation
Research and Training
Center grant
H133B30041
from the National Institute on Disability
and Rehabilitation
Research, US Department of Education.
The authors have chosen not to select a disclosure statement.
Reprint requests to Kenneth J. Ottenbacher,
PhD, School of Allied Health
Sciences, 11th and Mechanic Streets, Galveston, TX 77555-102X.
0 1996 by the American Congress of Rehabilitation Medicine and the American
Academy of Physical Medicine and Rehabilitation
0003.9993/96/7712-3707$3.00/O
Arch
Phys Med
Rehabil
Vol77,
December
1996
Studies were identified through computer searches of the Index Medicus and Psychological Abstracts data bases from 1966
to 1994. The Functional Assessment Information
Service database associated with the Center on Functional Assessment Research and the Rehabilitation
Research and Training Center on
Functional Assessment and Evaluation of Rehabilitation
Outcomes at the State University of New York at Buffalo was also
examined. The following key words were used in the computer
bibliographic
searches: functional assessment, reliability, measurement, functional independence, and activities of daily living. Manual searches were conducted using the reference lists
FUNCTIONAL
ASSESSMENT
of all retrieved articles. A total of 39 potentially relevant papers
was identified in the initial search. The articles were individually
examined by three raters with rehabilitation
experience to determine whether they met the following criteria. Only articles on
which all three examiners agreed were included in the final
review and analysis.
Inclusion Criteria
To be included in the analysis a reliability
study had to include a quantitative estimate of interrater, test-retest, or equivalence reliability
for the Functional
Independence
Measure
(FIM). Studies that did not include an estimate of interrater,
test-retest, or equivalence reliability
were not included in the
review. The statistical method used to determine the reliability
value had to be clearly identified. The sample size on which
the reliability index was based and the number of raters involved
in interrater reliability
investigations
had to be reported. For
studies including an estimate of test-retest reliability,
the time
interval between testing had to be clearly specified. Finally, the
method used to obtain the FIM ratings, that is, observation, selfreport, or proxy rating, had to be identified. Eleven of the 39
studies were eliminated because a sample size for the reliability
index could not be identified and/or the reliability
index was
not clearly identified or was not appropriate, eg, a measure of
internal consistency. Seven studies used the term reliability, in
the title or abstract, but did not provide any numerical data or
the authors referred to reliability
values published in previous
reports. Five investigations did not provide information
about
the raters or the time interval between test and retest. Three
studies included information
on concurrent validity with other
ADL assessments, but no reliability
values. One investigation
(a book chapter) reported data on the four-level version of the
FIM. Finally, one abstract included data that were subsequently
published in a larger investigation.
The 11 remaining studies
that met all of the above criteria were selected for further analysis. The 11 investigations are listed in the appendix. A complete
listing of the 39 original investigations
can be obtained from
the first author.
Study Coding
Each of the 11 investigations
was coded according to year
of appearance and source of publication. Sample size and characteristics were coded for both the raters and the patients included in the investigations. Raters were coded as having previous experience using the FIM or having received formal training
in the administration
of the FIM. The specific form of training
was also recorded. The following
categories of training were
identified and coded: formal workshop or instruction, use of
FIM videotape and case studies, informal use of FIM guide,
and no training. The subjects who were evaluated using the FIM
were coded according to disability, age, and sex. The setting in
which the FIM rating occurred was coded as one of the following: rehabilitation
center, hospital, home, or other.
Each individual
reliability
coefficient was coded according
to the type of reliability,
eg, interrater or test-retest. For those
investigations recording test-retest information, the duration between the first and second test was recorded. The type of statistical procedure used to calculate the reliability index was coded
as: Pearson product moment correlation, intraclass correlation
coefficient (ICC), Kappa (K), or other.
Quality assessment screening as proposed by Chalmers and
others” was not used. These screening criteria were developed
primarily for use with randomized clinical trials (RCTs). The
design attributes associated with reliability
investigations
are
different from those involved in RCTs. The coded attributes
RELIABILITY,
1227
Ottenbacher
Table
1: The Functional
Independence
Measure
FIM (motor)
Self-care
A. Eating
B. Grooming
C. Bathing
D. Dressing
upper body
E. Dressing
lower body
F. Toileting
Sphincter
control
G. Bladder management
H. Bowel management
Transfer
I. Bed, chair, wheelchair
J. Toilet
K. Tub, shower
Locomotion
L. Walk/wheelchair
M. Stairs
FIM (cognitive)
Communication
N. Comprehension
0. Expression
Social cognition
P. Social interaction
Q. Problem solving
R. Memory
Levels of scoring
Independence
7 Complete
independence
(timely, safely)
6 Modified
independence
(device)
Modified
dependence
5 Supervision
4 Minimal assistance
(subject
75%+)
3 Moderate
assistance
(subject 50%+)
Complete
dependence
2 Maximal
assistance
(subject
25%+)
1 Total assistance
(subject
O%+)
described above, for example, the type of reliability, rater training and experience, duration between ratings, and number of
raters, were believed to sufficiently capture the characteristics
of design quality involved in the reliability
studies examined.
Two coders examined and rated the manuscripts without author
or title attribution. The reliability
(agreement) of these ratings
was examined (see below) to determine the existence of rater
bias or error.
Adult FIM
The FIM instrument is a minimal data set designed to assess
functional independence.’ The FIM includes 18 items, each with
a maximum score of 7 and a minimum score of 1. Possible
scores range from 18 to 126. Each level of scoring is defined.
For example, a score of 7 equals “complete independence,”
and a score of 1 equals “complete dependence,”
and 3 equals
“moderate
assistance.” The areas examined by the FIM include: self-care, sphincter control, transfers, locomotion, communication, and social cognition. These areas are further defined
into motor and cognitive domains. The motor domain includes
thirteen items in the areas of self-care, sphincter control, transfers, and locomotion. The cognitive domain contains five items
from the Communication
and Social Cognition subscales. The
domains, subscales and items included in the FIM are presented
in table 1. In evaluating the 11 studies, reliability coefficients
were coded based on whether they were computed for individual
items, for FIM subscale areas such as self-care, sphincter control, locomotion, etc, or for the motor or cognitive domain.
Reliability
of Coding
To establish the interrater agreement of the coding procedures, the 11 studies were examined, without author attribution
Arch
Phys Med
Rehabil
Vol77,
December
1996
1228
FUNCTIONAL
ASSESSMENT
and titles, by two raters with rehabilitation experience and graduate level training in research methods and statistics. When the
two raters did not agree on the coding of a particular item, a
third rater with more than 15 years of research and rehabilitation
experience was consulted. The majority rating was then used
in the analysis. The interrater agreement for all items included
in the analysis described below, ranged from ICC of 89 to 1.00.
Streiner and Norman” observe that “there has been considerable debate in the literature regarding the most appropriate
choice of reliability coefficient.”
They discuss the advantages
and limitations of the three statistical methods used most frequently in the eleven articles: Pearson product moment correlation (Y), intraclass correlation coefficients (ICC), and Kappa (K).
All three of these statistics have similar practical ranges for
interpretation, that is, values from -1.00 to 1.00. What is considered an acceptable reliability value, however, varies considerably among authors. For example, Portney and Watkins” suggest ICC values of .75 and above are “indicative
of good
reliability,
and those below .75 poor to moderate reliability.”
Kelly’i has recommended .94, and Weiner and Stewartz2 suggest
8.5 for reliability
values involved in making decisions about
individuals.
The published
guidelines
for interpreting
unweighted Kappa tend to be lower than for r or ICC. Landis and
Koch23 have suggested that values for Kappa from .60 to 80
represent good to excellent reliability
(agreement), Kappa values of .40 to .60 reflect moderate agreement, and values less
than .40 indicate poor agreement. The interpretation
of values
for weighted Kappa are identical to those computed for ICC
when the variance for raters (or trials) is excluded.20
The standard method for computing Kappa uses the following
formula: K = P, - PJI - P,, where P, is the observed proportion of agreement and P, is the expected proportion of agreement
by chance. Rae24 has demonstrated that analysis of variance
terms can be substituted for P, and P, to produce the following
formula: K = [SSBP - SS&K
- 1]/[SS8r - SS,,], where SSBp
and SSwp are the sum of squares between subjects and within
subjects in analysis of variance. If N is at all large (N > 20),
then Kappa is estimated from the formula, K = C&/O&
+
a&, which is derived from the formula above and represents
the intraclass correlation coefficient.% Rae and others24-25 provide several formulae to convert Kappa (weighted and unweighted) to various ICC models that include or exclude systematic variability
among raters (or trials) as a component of
total variability. The procedures described by Rae and others22-24
were used to convert Kappa values to a reliability metric comparable to the ICC and Y.
Reliability
(correlation)
values can be aggregated as “raw”
effect sizes by combining the r values weighted for sample size26
or they can be converted by Fisher’~~~,~~ variance stabilizing z
transformation.
Statisticians generally recommend using Fisher’s z transformation unless sample sizes are very large because
standard errors, confidence intervals, and homogeneity tests can
be quite variable.29 All reliability
values in this investigation
were converted using Fisher’s variance stabilizing z transformation and combined using a random effect model.29 The random
effects model assumes a distribution
of population reliability
values generated by study attributes such as patient grouping
(diagnosis), raters, or environments. The homogeneity analysis
proposed by Hedges and Olkin~9 called Q, was computed following the transformation
to Fisher z values. The Q statistic
was computed for the total FIM ratings included in the 11
studies. The Q statistic has a chi-square distribution with N =
1 degrees of freedom. To make the interpretation
of the final
Phys Mad
Rehabil
Ottenbacher
results easier the Fisher z values were converted back to correlation values following homogeneity testing (see tables 3 and 4).
Shadish and Haddock3’ state that “Interpretation
of results is
facilitated”
if Fisher z values are converted back to correlation
values. This is particularly true in reliability studies where results are traditionally
reported as coefficients ranging from .OO
to 1.00.
RESULTS
Data Analysis
Arch
RELIABILITY,
Vol77,
December
1996
The 11 studies were published from 1993 to 1995 (mean =
1994, SD = .70). A total of 1,568 subjects were included in
the 11 studies (mean = 130.66, SD = 284.17). Two hundred
twenty-one reliability coefficients were included in the 11 studies, indicating that each investigation contained approximately
19 reliability values. A large majority of the reliability coefficients (81%) were from interrater reliability comparisons. The
ICC was the most common statistical procedure used to compute
reliability values (n = 116), followed by the Kappa statistic (n
= 53) and the Pearson product moment correlation coefficient
(n = 52). Basic descriptive information concerning each of the
eleven studies included in the analysis is presented in table 2.
The table includes the first author, sample size, type of reliability, and values for the total FIM score.
The analysis of z transformed reliability coefficients produced
an average z of 1.55, and a variance of .21, the square root of
which (.47) is the standard error. This average effect size is
significantly different from zero since Z = 1Z. j l(v.)‘” = 1.55/
.47 = 3.30, which exceeds the critical value of 1.96 for 01 =
.05 in the standard normal distribution.
The limits of the 95%
confidence interval are 1.50 and 1.60. The homogeneity statistic
(Q) for the 221 effect sizes was 2,444 (p < .05, df = 220)
indicating that the variability in reliability
values was greater
than that expected by sampling error alone.
Figure 1 presents summary box plots of reliability values for
the total FIM ratings, for the motor and cognitive domain ratings. Table 3 includes median and mean reliability coefficients
for the FIM items, subscales, and domains. Table 3 also includes
values for the 95% confidence interval for each of the mean
values. Inspection of table 3 reveals that the median reliability
value for the total FIM score was .95, indicating excellent overall consistency among raters using the FIM across a large number of patients with varying levels of impairment and medical
diagnoses. The median reliability
values for the subscales
ranged from .95 for self-care to .78 for social cognition. The
median reliability values for the cognitive domain items (.93)
was lower than for the motor domain items (.97) (see table 1
for a listing of the items contained within the FIM motor and
cognitive domains). The median reliability values for the individual FIM items ranged from a low of .61 for comprehension
to a high of .90 for transfer to toilet. In general, the FIM motor
and ADL items were found to have higher median and mean
reliability values than the communication
and cognition items.
Table 4 includes median and mean reliability values associated with various study attributes. Included in the table are the
reliability values for different levels of experience and training,
and for different diagnostic conditions. A criterion of two standard errors of the mean was used to examine whether differences existed in mean reliability values across the study attributes. If greater than two standard errors of the mean separated
any two mean reliability values, then those values were considered to be “substantially”
different. This conservative criterion
has been used in previous meta-analyses when multiple values
were generated from the same study, the assumption of independent data points could not be made, and the number of studies
was relatively sma11.3’,32 Using the greater than two standard
error of the mean criterion, there were no differences found
FUNCTIONAL
Table
First Author
Year
Sample
2: Description
ASSESSMENT
Information
RELIABILITY,
for the 11 Studies
Diagnosis
1993
4
Stroke
40 therapists
Ottenbacher
1994
20
Mixed
2 clinicians
Chau
1994
198
Mixed
Team
Segal
1993
57
SCI
Team
Hamilton
1994
1,018
Mixed
Team
Grey
1993
40
SCI
Nurses
and self-report
Jawoski
1994
14
Mixed
Nurses
(2 raters)
Kidd
1995
50
Mixed
Team
Smith
1995
40
Mixed
Nurses
Segal
1994
8
38
Stroke
Stroke
Therapist/(2
Team/care
raters)
giver/proxy
Brosseau
1994
81
MS
Therapists
(2 raters)
Motor FIM
N=127
Type of
Reliability
bathing,
rated
video
observe/phone
(2 raters)
observe/phone
dressing-upper
body,
Total FIM
ICC
.89
Interrater
Test/retest
ICC+
ICC+
.94
.93
Interrater
k
.90
Test-retest
r
.84
Interrater
ICC
.92
Equivalence
r
.84
Interrater
Equivalence
ICC
ICC
.99
.94
Interrater
Test/retest
rf
r*
.92
.90
Equivalence
ICC
.97
Interrater
Equivalence
ICC
ICC
,963
.87
interrater
dressing-lower
body;
ICC
grooming;
toileting;
.83
toilet
transfer;
tub/
Cognitive FIM
N=42
*
l
.25
Statistic
Interrater
Social Cognition were lower for the studies where raters received informal FIM training than for the other level of training
groups. These results, however, should be interpreted with caution because the sample size in some of the categories included
in table 4 is small and the statistical power is low.
Regression models represent another approach to examining
moderator (predictor) variables. A technical problem with reTable
Related
*
in the Review
(2 raters)
among the mean reliability
values for the study attributes of
experience, training, or medical diagnoses. When subscale
scores were examined there was some indication of an interaction between level of training and subscale reliability
scores.
The median reliability for the subscales of Communication
and
Total FIM
N=‘l?
Included
Procedure
Fricke”
* Fricke et al investigation
included
8 FIM items: eating,
shower
transfer.
‘Total FIM based on average
of two total FIM scores.
*Correlation
estimated
from data provided
in article.
1229
Ottenbacher
+
Fig 1. Box plot summary
statistics
for Total FIM, Motor FIM, and Cognitive FIM ratings. Box plots include all values for each variable. The middle
line in the box indicates
the median. The top and bottom
of the box are
the 25% and 75% scores.
The bars at the end of the lines represent
the
10% and 90% values. The stars represent
individual
values outside the
10% to 90% range of scores.
(Five values--.18,
.16, .16, .13, .02-are
not included
in the figure. Total N does not equal 221 because
some
correlations
[n = 401 included
combinations
of motor
and cognitive
items.)
3: Median
Descriptive
and Mean
Statistics
Sample Size, and
Domain, and Total
Median
Mean
Sample
N*
SD
95% Cl
.77
.84
.83
.89
.88
.83
67
.78
.79
.90
.88
.66
.66
.61
.73
.72
.84
.85
.75
.80
.76
.83
.82
.77
.68
.69
.80
.85
.79
.71
.60
.59
.62
57
.74
.73
1,412
1,412
1,412
1,412
1,412
1,412
1,412
1,412
1,412
1,412
1,412
1,408
1,408
1,408
1,408
1,408
1,408
1,408
7112
7112
7112
7112
7112
7112
717
717
717
719
719
617
5l7
517
.I1
.I2
.I8
.I3
.I6
.I8
.I7
.I9
.I4
.I 1
.21
.I7
.23
.25
.22
.28
.24
.27
,744.,756
.794-,806
.755-.765
,822.,838
,812.,828
.762-,778
.672-,688
,680.,700
.792-,808
.844-.856
.778-,802
.696-,724
.598-,612
.576-,604
.608-.632
.552-,586
.726-,754
.716-,744
.95
.91
.92
.92
.87
.78
.93
.89
.90
.84
.76
.63
1,254
1,254
1,254
1,254
1,254
1,254
515
515
515
515
515
515
.06
.I0
.07
.I8
.21
.40
.926-,934
,886.,894
.896-,904
,830.,850
.750-,770
,614.,646
.97
.93
.96
.91
1,173
1,173
518
518
.04
.I0
.948-,972
.904-,916
.93
1,568
.05
.926-.934
FIM item
Eating
Grooming
Bathing
Dressing
UB
Dressing
LB
Toileting
Bladder
Bowel
Bed chair transfer
Toilet transfer
Tubishr
transfer
Walkichair
Stairs
Comprehension
Expression
Social interaction
Problem solving
Memory
FIM subscale
Self-care
Sphincter
control
Transfers
Locomotion
Communication
Social cognition
FIM domains
Motor
Cognitive
FIM total
Total
* Number
Reliability
Values,
by Item, Subscale,
FIM Score
.95
of studies/number
of statistical
Arch
Phys Med
6i7
W
6l7
617
11/15
values.
Rehabil
Vol77,
December
1996
1230
FUNCTIONAL
ASSESSMENT
Table 4: Median and Mean Reliability
Values (Total FIM)
Descriptive
Statistics
Associated
With Study Attributes
Meall
Sample
.90
.93
.96
.86
.91
.93
42
97
81
1348
.95
.95
.92
.92
.92
.89
1393
127
132
.94
.90
1231
.96
.95
.90
.93
Median
Stroke
SCI
MS
Mixed
Tvpe of reliability
-interrater
Test-retest
Equivalence
FIM training+
Previous
FIM
experience
Formal FIM
training
Informal
FIM
training
UDS
credentialled
.92
.89
and
N*
SD
95% Cl
312
212
2/l
.I4
.24
.I8
.I9
.840-,880
,900.,930
.925-,935
918
413
414
.I7
.I2
.25
.915-,925
.910-.930
.890-.9 10
513
.I9
.895-,905
202
614
.I2
,942.,958
.93
242
513
.21
,917.,943
.91
1173
212
.I3
.906-,914
IO/6
.880-,920
* Number
of reliability
values/number
of studies.
’ Formal training
involved
UDSMR-sponsored
training;
informal
training
involved
use of videotapes
and review of the manual (training
not sponsored by UDSMR).
gression is that the number of studies (n = 11) is small in
comparison to the number of potential moderator variables. As
a rough guide, one might include no more predictors than the
square root of the number of studies, about 3 in this case.
Reducing the number of potential predictor variables can be
accomplished in many ways. In this investigation the focus on
methodolody
led us to select type of reliability
as a relevant
variable category to include in the model.
A regression equation was computed following the weighted
least squares procedure outlined in Hedges and Olkin.29 The
resulting multiple R was .47 (R* = .22) for the first regression
equation including the predictor variables of interrater reliability, test-retest reliability,
and equivalence reliability.
The test
for significance of the predictor set was Q,(3) = 7.47, p = NS.*
The test for model specification was rejected (Qe = 69.14, df
= 8, p < .0.5), suggesting that nonrandom variance in effect
size remained.
A more clinically practical way to examine reliability
is to
compute the standard deviation of the measurement error. The
standard deviation of measurement error is commonly referred
to as the standard error of measurement (SEM).” The SEM is
more accurate if estimated from a large number of ratings.”
One advantage of the meta-analytic approach used in this investigation is the creation of a large combined subject pool (N =
1,568) to derive estimates of reliability for the FIM. The SEM
is computed using the following formula: SEM = SD ii1 - r,
where SD is the standard deviation and r is the reliability value.
The SEM for the FIM was computed using the standard deviation contained in the annual UDSMR report for 1994.33 The
standard deviation value was based on 1994 FIM admission
data containing FIM records for more than 150,000 patients
receiving medical rehabilitation
services. Using this information
and the test-retest reliability
obtained in this investigation the
SEM for the FIM is 2141 - .95 = 4.70.
DISCUSSION
In describing the importance of reliability in the Interdisciplinary Measurement
Standards for Medical Rehabilitation,
* Hedgesand Okin’s exampleis computedin SAS, which includesthe intercept
in this t&t; the present regkssion was computed in SPSSx, which excluesthe
intercept.Therefore, one must add a degree of freedom to the degreesof freedom
for the Q. test.=
Arch
Phys Med
Rehabil
Vol77,
December
1996
RELIABILITY,
Ottenbacher
Johnston and colleagues34 stated that “both science and the
clinical practice of medicine demand high, or at least known,
levels of reliability.”
Any widely used measure of disability
must produce consistent results across raters and over time if
it is to be useful in program evaluation and research. Grey and
Kennedy35 observed, “In recent years the Functional Independence Measure has emerged as a standard assessment instrument for use in rehabilitation
and therapy programs for disabled
persons.” The results of this quantitative review indicate the
FIM provides good interrater reliability
across a wide variety
of raters with different professional backgrounds and levels of
training. The median interrater reliability value was .95 and is
based on a large cumulative sample of patients representing a
wide variety of disability levels and medical conditions. The
95% confidence interval for the mean interrater reliability value
are .915 to .925. In addition, the evidence for test-retest (median
.95) and equivalence (median .92) reliability is also good.
The results suggest that reliability is highest for items in the
motor domain, specifically for the individual items of dressingupper body and transfer to the toilet. The lowest interrater reliability values were associated with the items of comprehension
and social interaction. These items are among the most difficult
to observe directly. There is some indication that lower reliability of items in the Communication
and Social Cognition subscales may be related to levels of training. Additional research
is needed to clarify the importance of training in achieving
high reliability values for individual items and subscale scores.
Research is also necessary to determine the impact of the number of raters on the reliability of ratings. In general, the larger
the number of raters, the higher the interrater reliability coefficient will be, all other factors remaining equal. In the 11 studies
examined, all interrater reliability comparisons were based on
scores for two raters, except for the investigation by Fricke and
others, who used 40 raters to assess the same patients. The ICC
values for Fricke’s investigation
based on 40 raters will be
higher than if the interrater reliability study was conducted using
only two raters. When the analysis was conducted without the
data from Fricke’s study, the mean interrater reliability
(ICC)
for the total FIM was .94 compared to a mean of .93 when the
data from Fricke’s investigation were included in the analysis.
As more reliability studies are reported using multiple raters, the
impact of number of examiners can be more carefully examined.
Another possible moderating variable of clinical importance in
future research is the professional background and training of
the raters. There are reports in the literature that different professional groups may obtain different ratings for FIM items or
subscales. For example, Adamovich36 conducted a study to compare the functional communication
ratings of registered nurses
with speech-language pathologists on selected FIM items. The
nurses assigned significantly
higher FIM ratings than the
speech-language pathologists when rating communication
skills
of persons with left hemisphere damage.
Meta-analysis procedures have the potential to examine the
effect of moderator variables such as professional affiliation and
background if those variables are clearly defined in the primary
studies. Unfortunately,
the 11 investigations
included in this
review did not systematically control for professional affiliation
and background.
Most ratings were performed by teams of
mixed professionals and the impact of professional background
could not be statistically examined. Additional
research is
needed on the response patterns and clinical reasoning of FIM
ratings provided by different professional groups before quantitative reviewing methods will be able to productively
examine
this issue.
Establishing reliability
allows researchers and clinicians to
determine the standard error of measurement (SEM) for the
FUNCTIONAL
ASSESSMENT
FIM. The SEM can be used to establish the range in which a
person’s “true”
score will fall and is interpreted according to
the properties of the normal curve. For example, if a person’s
obtained (raw) total FIM rating is 100 and the SEM is 4.70,
then based on the characteristics of the normal curve there is a
68% chance that the individual’s
true score falls within +- 1SEM
(between 105 and 95) or a 95% chance that it falls within
?2SEM (between 110 and 90).
The SEM is an essential indicator of measurement sensitivity
and the ability to document reliable change in performance over
time. Two or more ratings are always necessary to determine
whether a clinically important change has occurred. The first
step in measuring change is to ensure that the instrument used
to collect the two measures has adequate reliability.
Without
demonstrated reliability, change in values from one administration to the next could be the result of random measurement
error. Jacobson and colleagues3’ refer to change as “a reliable
difference in values for a variable over two or more points in
time.” They have proposed the Reliability Change Index (RCI)
to determine changes in performance
that are not caused by
measurement error. To compute the RCI, the clinician must
have the following information: (1) the patient’s preintervention
score, (2) the posttest score following rehabilitation,
and (3) the
standard error of measurement (SEM) for the instrument. The
standard error of measurement is influenced by the reliability
of the test and is computed from the formula presented earlier
(SEM = SD ,/l - r, where SD is the standard deviation for the
test and Y is the reliability coefficient).
The RCI is computed as follows: RCI = (X, - X,)/SEM,
where Xz is the postintervention
score, X1 is the preintervention
score, and SEM is the standard error of measurement as defined
above. For example, a patient receiving intervention for a deficit
in ADL may have an admission FIM score of 73. Following
medical rehabilitation,
the patient’s FIM score at discharge
might increase to 85. The SEM for the FIM is 4.70. Using the
formula provided above, the reliability change index is RCI =
85 - 7314.70 = 2.55. Jacobson and associates37 argue that the
RCI index should be interpreted using a unit normal distribution
where a value of 21.96 would be unlikely to occur (p < .05)
without actual change. The RCI of 2.55 in the example indicates
that a reliable change in FIM scores has occurred from pre to
post-rehabilitation.
The index proposed by Jacobson et al provides a statistical
indication of change. The RCI, however, does not include information regarding the clinical importance of the change. In the
above example, the clinician using the RCI knows that the
change of 12 FIM points (from 73 to 85) was probably (p <
.05) not caused by chance (random measurement error). The
RCI does not tell the clinician whether the 12-point improvement resulted from the rehabilitation
treatment or whether the
12-point increase is clinically important. Whether the improvement in FIM scores is clinically
significant cannot be answered
statistically. The answer to this question requires knowledge of
what the patient and his or her family believe are important
functional skills along with information on the cost required to
obtain these skills.
The results of this investigation indicate that the Adult FIM
provides reliable information across a variety of patient populations, settings and clinicians. The ability of the FIM (or any
assessment of functional performance)
to provide consistent
information is essential to the measurement of change in daily
living skills. Based on the quantitative synthesis of the 11 studies examined in this investigation,
the FIM appears to be a
reliable instrument for assessing basic daily living skills in persons with disabilities.
RELIABILITY,
Ottenbacher
1231
References
1. Guide for the Uniform Data Set for Medical Rehabilitation (Adult
FIM), version 4.0. Buffalo: State University of New York at Buffalo, 1993.
2. Granger CV, Hamilton BB, Keith RA, Zielezny M, Sherwin FS.
Advances in functional assessment for medical rehabilitation. Top
Geriatr Rehabil 1986; 159.74.
3. Hamilton BB, Granger CV, Sherwin FS, Zielezny M, Tashman JS.
A uniform national system for medical rehabilitation. In: Fuhrer
MJ, editor. Rehabilitation outcomes: analysis and measurement.
Baltimore: Brookes, 1987:137-47.
4. Granger CV, Braun S, Fiedler RC, Griffiths A, Johnston MV, Kelley-Hays A. Quality and outcome measures for medical rehabilitation. In: Braddom IU, editor. Physical medicine and rehabilitation.
Philadelphia; Saunders, 1996:239-55.
5. Johnston MV, Findley TW, DeLuca J, Katz RT. Research in physical medicine and rehabilitation: XII measurement tools with application to brain injury. Am .I Phys Med Rehabil 1991;70 Suppl
__ 1:
s114-30.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
Chau N, DaIer S, Andre JM, Pauis A. Inter-rater agreement of
two functional independence scales: the Functional Independence
Measure (FIM) and a subjective uniform continuous scale. Disabil
Rehabil 1994; 16:63-71.
Fricke J, Unsworth C, Worrell D. Reliability of the Functional
Independence Measure with occupational therapists. Aust Occup
Ther J 1993;40:7-15.
Jaworski DM, Kult T, Boynton PR. The Functional Independence
Measure: a pilot study comparison of observed and reported ratings.
Rehabil Nurs Res 1994; Winter: 141-7.
Hamilton BB, Laughlin JA, Fiedler RC, Granger CV. Interrater
reliability of the 7-level Functional Independence Measure (FIM).
Stand J Rehabil Med 1994;26:115-9.
Ottenbacher KJ, Mann WC, Granger CV, Tomita T, Hurren D,
Charvat B. Interrater agreement and stability of functional
assessment in the community
based elderly.
Arch Phys Med Rehabil
1994:75:1297-301.
Segal ME, Ditunno
JF, Staas WE. Inter-institutional
agreement
of
individual
Functional
Independence
Measure
(FIM)
items at two
sites on one sample of SC1 patients. Paraplegia
1993;31:622-31.
Hamilton
BB, Laughlin
J, Granger
CV, Kayton
RM. Interrater
agreement
of the seven level Functional
Independence
Measure
(FIM) [abstract].
Arch Phys Med Rehabil 1991;72:790.
Brosseau
L. The inter-rater
reliability
and construct
validity
of the
Functional
Independence
Measure
for multiple
sclerosis
subjects.
Clin Rehabil 1994; 8: 107-15.
Glass GV, McGaw
B, Smith ML. Meta-analysis
in social research.
Beverly
Hills (CA): Sage, 1981.
Cooper HM. Integrating
research:
a guide for literature
reviews,
2nd ed. Newbury
Park (CA): Sage, 1989.
Petitti DB. Meta-analysis,
decision analysis,
and cost effectiveness
analysis: methods for quantitative
synthesis in medicine.
New York:
Oxford University
Press, 1994.
Ottenbacher
KJ. Interrater
agreement
of visual analysis in singlesubject decisions:
quantitative
review analysis.
Am J Ment Retard
1993;98:135-42.
Chalmers
TC, Smith H, Blackbum
B, Silverman
B, Schroeder
B,
Reitman D, et al. A method for assessing the quality of a randomized
control trial. Cont Clin Trial 1981;2:31-49.
Streiner DL, Norman
GR. Health measurement
scales: a practical
guide to their development
and use. 2nd ed. New York: Oxford
University
Press, 1995.
Portney LG, Watkins
MP. Foundations
of clinical research: applications to practice.
Norwalk
(CT): Appleton
& Lange, 1993.
Kelly TL. Interpretation
of educational
measurements.
Yonkers
(NY): World
Books, 1927.
Weiner
EA, Stewart
BJ. Assessing
individuals.
Boston:
Little
Brown,
1984.
Landis JR: Koch GG. The measurement
of observer agreement
for
categorical
data. Biometrics
1977;33:159-62.
Rae G. The equivalence
of multiple rater Kappa statistics
and intraclass correlation
coefficients.
Educ Psych Meas 1988;48:367-74.
Fleiss JL, Coner J. The equivalence
of weighted
Kappa and the
Arch
Phys Med
Rehabil
Vol77,
December
1996
1232
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
Arch
FUNCTIONAL
ASSESSMENT
intraclass coefficients as measures of reliability. Educ Psych Meas
1973;33:613-9.
Cooper HM. Statistically combining independent studies: a metaanalysis of sex differences in conformity research. J Pers Sot Psychol 1979;37:131-46.
Fisher RA. Statistical methods for research workers. Edinburgh:
Oliver & Boxel, 1925.
Hunter JE, Schmidt FL. Correcting for sources of artificial variation
across studies. In: Cooper HM, Hedges LV, editors. The handbook
of research synthesis. New York: Russell Sage Foundation, 1994:
324-36.
Hedges LV, Olkin I. Statistical methods for me&analysis. Orlando
(FL): Academic Press, 1985.
Shadish WR, Haddock CK. Combining estimates of effect size.
In: Cooper HM, Hedgers LV, editors. The handbook of research
synthesis. New York: Russell Sage Foundation, 1994:261-81.
Ottenbacher K. The impact of random assignment on study outcome: an empirical examination. Cont Clin Trial 192; 13:50-61.
Ottenbacher K, Jane11 S. The results of clinical trials in stroke
rehabilitation research. Arch Neurol 1993;50:37-44.
Granger CV, Ottenbacher K, Fiedler R. The uniform data system
for medical rehabilitation: report of first admissions for 1994. Am
J Phys Med Rehabil 1996;75:125-9.
Johnston MV, Keith RA, Hinderer SR. Measurement standards for
interdisciplinary medical rehabilitation. Arch Phys Med Rehabil
1992;73 Suppl:Sl-23.
Grey P, Kennedy N. The Functional Independence Measure: a comparative study of clinician and self ratings. Paraplegia 1993; 31:45761.
Adamovich BLB. Pitfalls in functional assessment: a comparison
of FIM ratings by speech-language pathologists and nurses. Neurorehab 1992;2:42-51.
Jacobson NS, Follette WC, Revenstorf D. Psychotherapy outcome
research: Methods for reporting variability and evaluating clinical
significance. Behav Ther 1984; 15:336-52.
Phys Mad
Rehabil
Vol77,
December
1996
RELIABILITY,
Ottenbacher
APPENDIX: THE ELEVEN STUDIES ANALYZED
1. Fricke J, Unsworth C, Worrell D. Reliability of the Functional
Independence Measure with occupational therapists. Aust J Occup
Ther 1993;40:7-15.
2. Chau N, Daler S, Andre JM, Patois A. Interrater agreement of
two functional independence scales: the Functional Independence
Measure (FIM) and a subiective uniform continuous scale. Disabil
Rehabil 1994; 16:63-71. ”
3. Ottenbacher K, Mann WC, Granger CV, Tomita M, Hurren D,
Charvat B. Interrater agreement and stability of functional assessment in the community-based elderly. Arch Phys Med Rehabil
1994;75:1297-301.
4. Segal ME, Ditunno JF, Staas WE. Inter-institutional agreement of
individual Functional Independence Measure (FIM) items measured
at two sites on one sample of SC1 patients. Paraplegia 1993; 31:
622-31.
5. Hamilton BB, Laughlin JA, Fiedler RC, Granger CV. Interrater
reliability of the 7-level Functional Independence Measure (FIM).
Stand .I Rehabil Med 1994;26:115-9.
6. Grey N, Kennedy P. The Functional Independence Measure: a comparative study of clinician and self-ratings. Paraplegia 1993; 31:45761.
7. Jaworski DM, Kult T, Boynton PR. The Functional Independence
Measure: a pilot study comparison of observed and reported ratings.
Rehabil Nurs Res 1994; Winter: 141-7.
8. Kidd D, Stewart G, Baldry J, Johnson J, Rossiter A, Petruckevitch
A, Thompson AJ. The Functional Independence Measure: a comparative validity and reliability study. Disabil Rehabil 1995; 17: 10-4.
9. Smith PM, Illig SB, Fiedler RC, Hamilton BB, Ottenbacher KJ.
Intermodal agreement of follow-up telephone functional assessment
using the Functional Independence Measure in patients with stroke.
Arch Phys Med Rehabil 1996;77:431-5.
10. Segal ME, Schall RR. Determining the functional/health status and its
relation to disability in stroke survivors. Stroke 1994;25:2391-7.
11. Brosseau L. The interrater reliability and construct validity of the
Functional Independence Measure (FIM) for multiple sclerosis subjects. Clin Rehabil 1994;8:107-15.
Download