Clinical Reliability of Manual Muscle Testing Middle Trapezius and Gluteus Medius Muscles ETHEL FRESE, MARYBETH BROWN, and BARBARA J. NORTON The purposes of this study were to develop a protocol to examine the reliability of manual muscle testing in a clinical setting and to use that protocol to assess the interrater reliability of manually testing the strength of the middle trapezius and gluteus medius muscles. One hundred ten patients with various diagnoses participated as subjects, and 11 physical therapists participated as examiners in this study. The results showed that interrater reliability for right and left middle trapezius and gluteus medius muscles was low. The percentage of therapists obtaining a rating of the same grade or within one third of a grade ranged from 50% to 60% for the four muscles. This study indicates that using manual muscle testing to make accurate clinical assessments of patient status is of questionable value. Key Words: Manual muscle testing, Muscle hypotonia, Physical therapy. Manual muscle testing is an important clinical tool used by physical therapists to determine a patient's muscle strength. Muscle testing originated in the United States in the early 1900s during the study of muscle function in patients with poliomyelitis. Despite the change in the role of manual muscle testing with the end of the last poliomyelitis epidemic in this country, it remains an important clinical tool for assessing the muscular causes of movement dysfunction. Testing of muscles is considered to be an essential prerequisite for treatment program planning and modification. The results of manual muscle testing also are used to make clinical judgments concerning the patient's progress or deterioration, as well as to assess the effectiveness of a particular treatment. The study of the reliability of examiners performing manual muscle tests is Mrs. Frese is Instructor, Department of Physical Therapy, St. Louis University, 1504 S Grand Blvd, St. Louis, MO 63104 (USA). She was a master's degree student, Program in Physical Therapy, School of Medicine, Washington University, St. Louis, MO, when this study was completed. Dr. Brown is Instructor, Program in Physical Therapy, PO Box 8083, School of Medicine, Washington University, 660 S Euclid Ave, St. Louis, MO 63110. Mrs. Norton is Instructor, Program in Physical Therapy, School of Medicine, Washington University. This study was completed in partial fulfillment of the requirements for Mrs. Frese's master's-degree, Washington University. This article was submitted April 14, 1986; was with the authors for revision 10 weeks; and was accepted August 27, 1986. Potential Conflict of Interest: 4. 1072 necessary if the tests are to be used. Manual muscle testing reliability in a clinical setting has been studied minimally. Lilienfeld et al found muscle test grades from Zero to Normal assigned by 12 to 39 examiners in four different trials to be within one grade, although the testing method was controlled because the examiners were trained by the same instructor.1 Iddings et al also found manual muscle testing to be reliable among 10 examiners whose ratings were within one grade in 90.6% of the trials.2 All of the subjects in both of these studies had the diagnosis of poliomyelitis, and the examiners were highly skilled in manual muscle testing. The reliability of manual muscle tests has been the most difficult to achieve for grades greater than Fair because of the examiner's subjective judgment of the amount of resistance applied during the test. One of the problems central to manual muscle testing is the variable "frame of reference" for making an assessment. Such subjective judgments include determining what is normal muscle strength for an individual given the person's age and size, in addition to the relative strengths of the tester and patient.3-6 Many other factors influence the reproducibility of a manual muscle test. The testing method may vary among therapists (eg, Kendall and McCreary7 vs Daniels and Worthingham8), both because the therapists' training may have differed and because physical therapists tend to develop their own techniques and standards for grading muscle strength. Other variables that influence the accuracy of a muscle test are 1) the point and line of force application, 2) the magnitude of resistive force, 3) the speed of resistive force application, 4) the duration of the contraction, 5) the degree of cooperation from the patient, 6) fatigue, 7) various distracting influences, 8) the type of instructions given, 9) the tone of the therapist's voice, and 10) the amount of interaction between the therapist and patient.4,9-15 Beasley attempted to increase objectivity in manual muscle testing by developing a standardized scale of norms for muscle strength.16 Using an electronic myodynagraph, Beasley found a discrepancy between the percentage of Normal strength assigned in a manual muscle test and the percentage of strength found by a quantitative measure.16 The Good muscle strength group, usually rated at 75% of Normal in the manual muscle testing system,7 had only 43% of the Normal value on Beasley's standardized scale. The Fair group had a rating of only 9% of Normal, rather than 50% of Normal usually assigned. The Poor group, ordinarily rated at 25% of Normal on the manual scale, had a rating of only 2.6% of Normal on the standardized scale. The standard deviations showed considerable overlap in the percentage of Normal scores in grades below Fair, indicating poor differentiation in grades below Fair, the range in which manual muscle testing supposedly is more accurate.16 The purposes of this study were to develop a protocol to examine the reliPHYSICAL THERAPY ability of manually testing muscle strength in a physical therapy department and to use that protocol to assess the interrater reliability for manually testing the middle trapezius and gluteus medius muscles. We chose the two muscles indicated 1) because we wanted to examine muscles from both the upper and lower extremities and 2) because the selected muscles are difficult to test owing to the stabilization required by other muscle groups during testing. In addition, the two muscles selected for study frequently are found to be weak in patients. The hypothesis was that a staff of physical therapists working together in a physical therapy department would demonstrate interrater reliability in testing the middle trapezius and gluteus medius muscles. METHOD Subjects One hundred ten patients, who were referred for physical therapy at St. Louis University Hospital, participated in the study. The patients had various musculoskeletal and neurological disorders including low back pain, degenerative joint disease, cervical pain, gunshot wound, chondromalacia, rheumatoid arthritis, and connective tissue disease. The patients had to exhibit sufficient range of motion to allow the body part to be placed in the test position and either pain-free motion or pain that did not interfere with the muscle test. The test group consisted of 50 female and 60 male subjects, aged 15 to 76 years, with a mean age of 41 years (± 15 years). Examiners Eleven staff physical therapists at St. Louis University Hospital served as the examiners. All examiners were graduates of accredited university programs. Seven were graduates of the same university, 2 others graduated from another university, and the remaining 2 therapists graduated from two other different schools. The mean number of years of experience of the staff members was 2.3 ± 1.2 years. Eight of the therapists preferred the Kendall and McCreary muscle testing technique,7 2 preferred that of Daniels and Worthingham,8 and 1 used both. Each therapist received a work sheet with 10 spaces for 10 different patients. Next to each space was the name of the therapist with whom the examiner had been paired randomly for that particular Volume 67 / Number 7, July 1987 patient. A different therapist's name appeared in each space so every examiner was paired with 1 of 10 different therapists. Each therapist also received a second work sheet with 10 spaces to be used for recording muscle grades of another therapist's patient when her name appeared on that therapist's list for that patient. Each examiner then selected 10 patients to be included in the study. The Appendix gives the muscle testing scale and definitions that all of the therapists used. Testing Procedures Manual muscle testing was performed during the patient's daily treatment session. A rest period of at least three minutes was allowed between the two examiners' tests and the two therapists kept their results confidential. The examiners used a "break" test, and for the gluteus medius muscle test, the patient's hip was placed in as much extension as possible. Testing Sequence The testing sequence involved the following steps: 1. The examiner first identified a patient suitable for the study. 2. The examiner performed the middle trapezius and gluteus medius muscle tests bilaterally. The side and muscle to be tested first was assigned randomly before the beginning of the test phase. The examiner used her accustomed technique of muscle testing to determine the appropriate grade and repeated the test several times, if needed, to assign a grade. She then recorded the grades in the appropriate space on her work sheet. 3. A second therapist, who had been paired randomly with her for that patient, then performed the same two muscle tests in the same order, but using her own testing technique. The second therapist also repeated the test several times, if necessary, to determine a grade. She then recorded the grades on her work sheet. Data Analysis Cohen's weighted Kappa (Kw) determination17 was used as an index of agreement for interrater reliability. This index weighs disagreements by the amount of disagreement. A weight matrix (Tab. 1) was designed for the scores RESEARCH obtained in the study and gave the ratioscaled degrees of disagreement assigned to each cell. Each cell in the matrix represents one score for each examiner. For example, the cell for Normal-Normal had a weight value of 1.0, the cell for Good-Normal was 0.7, and the cell for Poor minus-Normal was 0.0. To determine whether eliminating the pluses and minuses would improve the reliability coefficient, we compressed the original scores into afive-pointscale. Pluses and minuses were assigned the same score as the main grade (eg, Fair plus and Fair minus became Fair), and a weight matrix was designed for these scores. The muscle test scores of every patient whom Therapist 1 examined were compared with the scores of each of the other 10 therapists with whom she was paired. An interrater reliability coefficient then was computed for Therapist 1. This process was repeated for each therapist so that an interrater reliability coefficient was computed for all 11 examiners. By doing so, we wanted to determine whether any particular therapist appeared to be less reliable compared with the other 10, and whether the school the therapist graduated from or her years of experience were factors affecting reliability. RESULTS Table 2 summarizes the percentages of the total number of subjects on which the examiners agreed, in addition to percentages of agreement within several ranges of disparity (ie, fractions of grades they were apart). The percentage of subjects on whom the same grade was obtained by two examiners ranged from 28% to 45% for the four muscles, and for 89% to 92% of the subjects we found either complete agreement or agreement within one grade. The percentage of patients who were rated Fair plus or above by one or both examiners was 88% for the right middle trapezius muscle, 90% for the left middle trapezius muscle, 91% for the right gluteus medius muscle, and 95% for the left gluteus medius muscle. One or both examiners assigned a grade of Normal in 50% of the tests for the right middle trapezius muscle, in 44% of the tests for the left middle trapezius muscle, in 67% of the tests for the right gluteus medius muscle, and in 70% of tests for the left gluteus medius muscle. Table 3 gives the interrater reliability coefficients for both the original and the 1073 compressed muscle testing scores. The reliability for the original scores was low, ranging from .11 to .58. Compressing the scores did not change the interrater reliability coefficient appreciably (.26.42). Even for grades below Fair, we found poor interrater reliability. Table 4 summarizes the results of comparing each of the examiners with every other examiner for each test. Reliability coefficients ranged from .04 to .66 with no pattern of high reliability being established by any one therapist. Those therapists with more clinical experience did not demonstrate any greater level of reliability than those who had graduated more recently. The school from which the therapist graduated did not appear to affect reliability because those therapists who graduated from the same university did not demonstrate any greater reliability among each other than the therapists who graduated from different schools. Therapist 3 demonstrated low reliability coefficients on all four tests (.08-. 19). DISCUSSION Using Cohen's weighted Kappa determination, we found interrater reliability for manually testing the strength of middle trapezius and gluteus medius muscles in a clinical setting to be poor. When the results were expressed as percentages of agreement, however, they were similar to the findings of Lilienfeld et al1 and Iddings et al2 who reported good reliability within one grade among experienced examiners (more experienced than those in our study). The results (28%-47% agreement) did not agree with those of Williams,10 who found that two examiners agreed completely between 60% and 75% of the time. The examiners in our study agreed more frequently on the gluteal muscle tests than on the middle trapezius muscle tests for reasons we could not determine. We also found poor interrater reliability in grades below Fair, which agrees with Beasley's16 finding of poor differentiation in grades below Fair. Compressing the scores by eliminating pluses and minuses did not appreciably change the interrater reliability coefficients. The coefficient for the right middle trapezius muscle decreased, possibly because the interval widened between grades with pluses and minuses when they were compressed (eg, Fair plus-Good minus was changed to FairGood). 1074 TABLE 1 Weight Matrix for Original Scoresa Muscle Test a Muscle Test Scores for Examiner 2 Scores for Examiner 1 P- P P+ F- F F+ G- G G+ N- N PP P+ FF F+ GG G+ NN 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.9 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.8 0.9 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.7 0.8 0.9 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.6 0.7 0.8 0.9 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.9 0.8 0.7 0.6 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.9 0.8 0.7 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.9 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Eleven possible scores ranging from P - to N. TABLE 2 Percentage of Agreement Among Scores for Subjectsa Musclesb RMT Grade Same grade 1/3 grade apart 2/3 grade apart 1 grade apart 1 1/3 grades apart 1 2/3 grades apart Within 1 grade apart Same grade or within 1 grade LMT RGM LGM n % n % n '% n % 31 24 19 25 6 5 68 28 22 17 23 5 5 62 32 27 23 15 8 4 65 29 25 21 14 7 4 60 52 11 24 14 4 1 49 47 10 22 13 4 1 45 50 17 15 16 7 5 48 45 15 14 15 6 5 44 99 90 97 89 101 92 98 89 a Each grade was divided into thirds with the use of pluses and minuses; therefore, the difference between 2 and 2+ was considered 1/3, the difference between 2 - and 2+ was 2/3, and the difference between 2 and 3 was one grade. b RMT-right middle trapezius, LMT-left middle trapezius, RGM-right gluteus medius, LGMleft gluteus medius. TABLE 3 Interrater Reliability of Original and Compressed Scores Musclesa Conditions Original Compressed N 110 110 RMT LMT RGM LGM Kwb .58 .26 Kw .29 .26 Kw .25 .30 Kw .11 .42 a RMT-right middle trapezius, LMT-left middle trapezius, RGM-right gluteus medius, LGMleft gluteus medius. b Kw = weighted Kappa coefficient. The distribution of the scores might have affected the reliability or agreement coefficient. Because the majority of the subjects' scores were Fair plus or greater for all of the muscles (88%-95%), the scores were not well distributed across all possible muscle grades. This skewed distribution might have reduced spuriously the magnitude of the Kappa coefficient. A broader range of scores should improve the chances of demon- strating an accurate measure of agreement. Because we established the criterion of pain not interfering with the muscle test, some of the weaker patients may have been excluded from the study. One procedural problem that could have affected our results was the difficulty of positioning some of the patients for a particular test. Different therapists adjusted the procedure differently to solve the problem. PHYSICAL THERAPY TABLE 4 Interrater Reliability Among Therapists Musclesa RMT Therapist LMT RGM LGM Kw Kw Kw Kw .22 .21 .19 .06 .42 .28 .42 .15 .37 .14 .62 .15 .52 .16 .33 .25 .30 .63 .47 .46 .28 .04 .31 .34 .08 .52 .48 .26 .50 .66 .25 .56 .20 .58 .55 .13 .44 .38 .34 .29 .37 .29 .49 .11 b 1 2 3 4 5 6 7 8 9 10 11 a RMT-right middle trapezius, LMT-left middle trapezius, RGM-right gluteus medius, LGMleft gluteus medius. b Kw = weighted Kappa coefficient. APPENDIX Muscle Testing Scale and Definitions* a Normal (5) Normal minus (5-) Good plus (4+) Good (4) Good minus (4-) Fair plus (3+) Fair (3) Fair minus (3-) Poor plus (2+) Poor (2) Poor minus (2-) Trace Zero (1) (0) able to hold the test against gravity and maximum resistance, or to move the part into the test position and hold against gravity and maximum pressure same as for Normal except slightly less resistance can be given same as for Good but slightly more resistance can be given same as for Normal except able to hold against moderate resistance same as for Good but slightly less resistance can be given able to hold the test position against gravity, or to move the part into the test position and hold against gravity and slight resistance able to hold the test position against gravity, or to move the part into the test position and hold against gravity able to release gradually from the test position against gravity, or to move the part toward the test position against gravity almost through full range able to move the part through full range with gravity eliminated, but against slight resistance able to move the part through full range with gravity eliminated able to move the part through partial range with gravity eliminated muscle contraction can be palpated no contraction can be elicited Adapted from Kendall and McCreary7 and Daniels and Worthingham.8 The patients' age did not appear to be a factor in the low interrater reliability coefficients because the scores for the youngest and the oldest subjects in the study were not consistently any farther apart than those of the subjects in the middle age range. Achieving reliability within one grade, as in this study, has questionable Volume 67 / Number 7, July 1987 clinical value, especially when considering the differences between Poor and Fair, or Fair and Good, versus the difference between Good and Normal. The interval between each of these pairs of grades is one grade, although the therapists' subjective judgments of patient function may have been quite different. The accuracy of assessments of patient RESEARCH progress or deterioration, therefore, would be questionable despite reliability within one grade. Manual muscle testing is an inexpensive, relatively quick, and convenient method for assessing a patient's muscle strength. In view of the results of this study, however, physical therapists should consider supplementing their manual muscle test scores with isokinetic testing, dynamometry, or tensiometry. Griffin et al compared the results of manual muscle testing with isokinetic testing for knee extensor muscles in patients with neuromuscular disease and found that a lack of strength improvement or a decrease in strength was demonstrated by both manual muscle testing and isokinetic testing.18 They also found, however, that in patients with a manual muscle test score of 9 to 10 (Normal minus-Normal), isokinetic testing revealed either muscle strength deficits or improvement not detectable with manual muscle testing methods. They concluded that isokinetic testing adds valuable information when patients have manual muscle test scores of Normal. Bohannan found a significant reliability correlation between manual muscle test scores and dynamometer test scores for knee extensor muscles, which indicated that both testing methods measure muscle strength similarly.19 He found a significant difference, however, between theoretical percentage manual muscle test scores and calculated dynamometer percentage test scores, which indicated that theoretical percentage scores based on manual muscle testing are likely to overestimate a patient's muscle strength. Supplementing manual muscle test scores with isokinetic testing, dynamometry, or tensiometry would decrease the subjectivity in assessing a patient's disability. Further study is needed in this area with each therapist being paired more than twice with another therapist. One potential study might incorporate several staff in-service training sessions before the start of testing to help standardize the muscle testing techniques among the staff members as much as possible. Reliability then could be reassessed to determine whether any improvement is noted. Garraway et al were able to increase the proportion of examinations for stroke assessment, which included motor function, in which total agreement was reached from 41% to 68% after standardizing definitions, discussion and interpretation of instructions by the examiners, and practice.20 1075 CONCLUSIONS The results of this study do not support the research hypothesis that staff physical therapists can perform manual muscle tests reliably in a clinical setting. The results do demonstrate that the therapists are reliable within one grade; however, this degree of reproducibility may not be adequate for making clinical judgments. Supplementing muscle test scores with isokinetic testing, dynamometry, or tensiometry is suggested. The development of a standardized method of muscle testing is needed so that different examiners can obtain comparable results in a clinical setting. Standardizing the resistance given in grades of Good and Normal so that subjective judgment is minimized is an area in which further study is needed. Acknowledgments. We thank the physical therapy staff of St. Louis University Hospital for their cooperation and Carolyn Heriza for her advice in planning the study. REFERENCES 1. Lilienfeld AM, Jacobs M, Willis M: A study of the reproducibility of muscle testing and certain other aspects of muscle scoring. Phys Ther Rev 34:279-289, 1954 2. Iddings DM, Smith LK, Spencer WA: Muscle testing: Part 2. Reliability in clinical use. Phys Ther Rev 41:249-256, 1961 3. Molnar GE, Alexander J, Grutfield N: Reliability of quantitative strength measurements in children. Arch Phys Med Rehabil 60:218-221, 1979 4. Editorial: The accuracy of the manual muscle test. Arch Phys Med Rehabil 35:515-517, 1954 5. Bechtol CO: Grip test: The use of a dynamometer with adjustable hand spacing. J Bone Joint Surg [Am] 36:820-824, 1954 6. Nicholas JA, Sapega A, Kraus H, et al: Factors influencing manual muscle tests in physical therapy. J Bone Joint Surg [Am] 60:186-190, 1978 7. Kendall FP, McCreary EK: Muscles: Testing and Function, ed 3. Baltimore, MD, Williams & Wilkins, 1983 1076 8. Daniels L, Worthingham C: Muscle Testing: Techniques of Manual Examination, ed 4. Philadelphia, PA, W B Saunders Co, 1980 9. Smidt GL, Rogers MW: Factors contributing to the regulation and clinical assessment of muscular strength. Phys Ther 62:1283-1290, 1982 10. Williams M: Manual muscle testing: Development and current use. Phys Ther Rev 36:797805,1956 15. Trombly CA: Occupational Therapy for Physical Dysfunction, ed 2. Baltimore, MD, Williams & Wilkins, 1982, pp 173-229 16. Beasley WC: Quantitative muscle testing: Principles and applications to research and clinical services. Arch Phys Med Rehabil 42:398-425, 1961 11. Wintz MN: Variations in current manual muscle testing. Phys Ther Rev 39:466-475, 1959 17. Cohen J: Weighted Kappa: Nominal scale agreement and provision for scaled disagreement or partial credit. Psychol Bull 70:213220,1968 12. Johannson CA, Kent BE, Shepard KF: Relationship between verbal command volume and magnitude of muscle contraction. Phys Ther 63:1260-1265,1983 18. Griffin JW, McClure MH, Bertorini TE: Sequential isokinetic and manual muscle testing in patients with neuromuscular disease. Phys Ther 66:32-35, 1986 13. Westers BM: Factors influencing strength testing and exercise prescription. Physiotherapy 68:42-44, 1982 19. Bohannan RW: Manual muscle test scores and dynamometer test scores of knee extension strength. Arch Phys Med Rehabil 67:390-392, 1986 14. Gonnella C, Harmon G, Jacobs M: The role of the physical therapist in the gamma globulin poliomyelitis prevention study. Phys Ther Rev 33:337-345, 1953 20. Garraway WM, Akhtar AJ, Gore SM, et al: Observer variation in the clinical assessment of stroke. Age Ageing 5:233-240, 1976 PHYSICAL THERAPY