Appendix G ASSESSING CLINICAL SIGNIFICANCE STATISTICALLY Appendix Outline: INTRODUCTION STATISTICALLY RELIABLE CHANGE NORMATIVE COMPARISONS STATISTICALLY RELIABLE CHANGE PLUS RECOVERY INTRODUCTION As promised in Chapter 16, this appendix will provide some additional statistical information to help you include a statistical approach for assessing clinical significance when writing proposals or interpreting findings of completed studies. You may recall that clinical significance refers not only to the meaningfulness and practical value of the overall findings of a study, but also to the meaningfulness and practical value of the benefits of an intervention for each individual recipient of the evaluated intervention. Significant differences between group means, for example, don’t tell us how many individual clients in each group experienced clinically significant improvements. Three statistical approaches have been proposed for measuring clinical significance among individual clients: 1) statistically reliable change; 2) normative comparisons; and 3) statistically reliable change plus recovery. Let’s begin with the statistically reliable change approach. STATISTICALLY RELIABLE CHANGE Ogles, Lunnen, and Bonesteel (2001) provide an overview of the different statistical methods that have been developed using the statistically reliable change approach. Each method applies a formula separately to every client to assess whether their amount of change from pretest to posttest can be attributed to measurement error. Consequently, this approach can be applied only when both pretests and posttests are Appendix G May 2005 Evidence-Based 612 used. Other limitations of this approach are discussed in Chapter 16. At this point we will examine the most frequently used statistical formula for assessing the reliability of individual change scores: the Jacobson and Truax (1991) Reliable Change Index (RCI). The formula is as follows: Xpost – Xpre ___________ Sdiff RCI = where Xpre = the pretest score Xpost = the posttest score Sdiff = the standard error of the difference between the two test scores To calculate the standard error of the difference between the two test scores, we apply the denominator from the t-test formula, which we first encountered in Chapter 12. Here it is again: N1 s12 + N2 s22 ______________ N 1 + N2 – 2 N1 + N2 ________ N 1N 2 where s12 is the variance of the pretest scores (the square of the standard deviation of the pretest scores. See Chapter 6 to be reminded about calculating the standard deviation) Appendix G May 2005 Evidence-Based 613 s22 is the variance of the posttest scores (the squared standard deviation of the posttest scores.) N1 is the size of the first sample. N2 is the size of the second sample. Let’s calculate an RCI for each of two experimental group clients in an imaginary pretest-posttest experiment that found statistically significant results on a scale measuring well being. Let’s assume that the experimental group and control group each had a sample size of 50 and that the pretest and posttest standard deviations were both 10. Next, let’s imagine that one client, Thelma, improved from a pretest score of 40 to a posttest score of 50. Thelma’s RCI would be as follows: 50 - 40 RCI= ___________________ 50 x 102 + 50 x 102 _______________ 50 +50 – 2 50 +50 ________ 50 x 50 [AUTHOR’S NOTE: I DID NOT INTEND LINES LIKE THIS TO BE IN THIS DOCUMENT, BUT DON’T KNOW WHY THEY ARE HERE OR HOW TO DELETE THEM.] Appendix G May 2005 Evidence-Based 614 10 = ___________________ 50 x 100 + 50 x 100 _______________ 50 +50 – 2 100 ________ 2500 10 = ___________________ 5000 + 5000 _______________ 50+50 – 2 1 ________ 25 10 = ___________________ 10000 _______________ 98 Appendix G May 2005 Evidence-Based 1 ________ 25 615 10 = _____________________________________________________ 102 .04 10 = _________ 4.08 = 2.45 Thus, Thelma’s RCI is 2.45. Let’s assume that a second imaginary client, Louise, improved from a pretest score of 40 to a posttest score of 46. In calculating her RCI, the denominator will be the same, but the numerator will be 6 (because 46 minus 40 equals 6). Thus, Louise’s RCI would be 6/4.08 = 1.47. So, what do these two RCI’s mean? To answer that question, we should recall that the standard error represents one standard deviation in a normal distribution. We should also recall (from Chapter 7) that only 5 percent of the normal curve falls at least 1.96 standard deviations above or below the mean. Therefore, an RCI of at least 1.96 is needed to rule out the plausibility of measurement error as an explanation for the improvement and thus deem the amount of change to be statistically reliable. Using this approach to clinical significance, each client who achieves an RCI of at least 1.96 is considered to have made a clinically significant amount of improvement. Thus, Thelma, with her RCI of 2.45 made clinically significant improvement, but Louise, with her RCI of 1.47 fell short of clinical significance. When reporting the results for an entire sample, we would Appendix G May 2005 Evidence-Based 616 indicate the proportion of clients in each group who made a clinically significant amount of improvement based on their RCI scores. NORMATIVE COMPARISONS As mentioned in Chapter 16, the second statistical approach to clinical significance emphasizes recovery and the use of normative comparisons. A nonstatistical application of this approach is to conduct diagnostic interviews before and after treatment to see if clients who meet the criteria for a particular disorder before treatment no longer meet those criteria after treatment. A statistical application suggested by Jacobson, Follette and Revenstorf (1984) requires the use of a standardized measurement instrument that has had norms established for populations with and without a particular disorder. Clinical significance would be attained by clients improve from pretest scores closer to mean of the clinical population (those with the disorder) to posttest scores that are closer to the mean of the normal population than the mean of the clinical population. Suppose previous large scale studies have established that the mean on a depression scale is 70 for people in treatment for depression, with a standard deviation of 10, while the mean and standard deviation of people not in treatment for depression are 40 and 10. Figure 1 displays the two distributions. A person who scores 50 on the scale is one standard deviation above the normal population mean, but two standard deviations below the clinical population mean. Thus, that person is more likely to be part of the normal population than the clinical one. Conversely, a person who scores 60 scale is one standard deviation below the clinical population mean, but two standard deviations above Appendix G May 2005 Evidence-Based 617 the normal population mean. Thus, that person is more likely to be part of the clinical population than the normal one. What about a person who scores 55? That person has an equal probability of belonging to either population, because they are at a midpoint of 1.5 standard deviations away from either mean. One criterion for clinical significance using this approach therefore, would be for the client’s posttest score to improve from the clinical side of that midpoint to the normal side of it. Then they would have moved from a pretest score indicating that they were more likely to belong to the clinical population to a posttest score indicating that they are more likely to belong to the normal population. Alternative criteria can be used, though you may deem them too lenient or too stringent. One that may be too stringent is to require a posttest score that is at least two standard deviations better than the clinical mean, which – for our imaginary study – would be 50 or lower in Figure 1. On the lenient side, we may require a posttest score of only one standard deviation better than the clinical mean (a score of 60 in Figure 1), since we know that only 16 percent of the normal curve is that much better than the mean (as discussed in Chapter 7). Although there is no firm rule about which cutoff point to use, the midpoint (a score of 55 in Figure 1) may be the best compromise as neither too lenient nor too stringent and in light of the somewhat arbitrary nature of the decision to use either the more stringent or more lenient alternative. The normative approach can be used with group means as well as with individual client scores. When means are used instead of individual scores, we would apply the chosen cutoff alternative to the posttest mean of the experimental group receiving the intervention rather than to an individual client scores. When individual scores are used Appendix G May 2005 Evidence-Based 618 instead of the group mean, we would (as with the RCI) report the proportion of clients who improved past the chosen cutoff point for clinical significance. Insert Figure 1 about here. STATISTICALLY RELIABLE CHANGE PLUS RECOVERY This third approach is recommended as the most likely to impress favorably those proposal reviewers who want to see a statistical approach to clinical significance. To have clinically significant improvement, clients would have to meet two criteria. First, they would need an RCI (from pretest to posttest) of at least 1.96. Second, their posttest score would have to pass the selected normative comparison cutoff point, such as being on the normal side of the midpoint between the two distributions. Thus, to use this approach, you simply apply the statistical procedures discussed above both for statistically reliable change and for recovery using normative comparisons. Then you can report the proportion of clients who meet both criteria. You could also report the proportion meeting the statistically reliable change criterion, only, and the proportion meeting the recovery criterion, only. Reporting all three of these proportions enables readers to use the criterion they might prefer (if any) in judging the clinical significance of your findings. If you are dubious about using any of the foregoing three statistical methods to judge clinical significance, you have lots of company. Each is controversial and makes some problematic assumptions, as discussed in Chapter 16. These assumptions are particularly problematic in many practice evaluations. For some evaluation studies, none of the approaches may be applicable. Although the third approach, which combines the Appendix G May 2005 Evidence-Based 619 first two, may be the one most preferred by some reviewers, requiring both statistically reliable change as well as recovery is the most stringent criterion and is least likely to be applicable to evaluations in social work and other human services. Consequently, if you use this approach (and even if you use one of the less stringent approaches), you should let your readers know that you are aware of its controversies and limitations. Some reviewers may not favor your chosen approach, and some may reject the entire notion of using any of these statistical approaches to clinical significance. So you don’t want to appear unaware of those issues or unlikely to express appropriate cautions when reporting the results of the approach you choose. Appendix G May 2005 Evidence-Based 620 Figure 1. Hypothetical Distributions of Depression Scale Scores for a Normal and a Clinical Population Clinical Population Normal Population 40 50 55 Clinically Significant Post-Test Scores (more likely to belong to normal population than to clinical population) 60 70 Clinically Significant Improvement Appendix G May 2005 Evidence-Based 621