Cross-Battery Assessment (XBA) and Averaged Scores: Comments on the Schneider and McGrew Report Dawn Flanagan, Samuel Ortiz, and Vincent Alfonso General Comments The report by Schneider and McGrew provides an analysis regarding some of the pitfalls and problems inherent in taking an uninformed and unsystematic approach to creating ability clusters or composites using averaging methods. As authors of the CHC Cross-Battery Approach (an approach originally developed by Drs. McGrew and Flanagan, 1998), we have ourselves examined and addressed these issues over the years because XBA may, but does not always, necessitate use of averaging. The report helps to clarify many of the issues that we have long recognized as important considerations in the measurement of cognitive abilities—so much so that we have made deliberate efforts to include cautions and guidance regarding the use of averaging when, and if, it becomes necessary for practitioners using XBA (Flanagan, Ortiz, & Alfonso, 2007). Thus, we welcome the report because it provides even more substantiation for the XBA guidelines we already have in place and continue to promote in our presentations and trainings. That said, in all fairness it should be noted that the “differences” that are reported for two-subtest broad and narrow CHC clusters between averaged and norm-based methods above and below the mean are generally small and not clinically meaningful. Despite the warnings against the use of averaged scores in making critical, high-stakes decision, the magnitude of the observed differences is small and unlikely to lead to the myriad misinterpretations Schneider and McGrew fear. In truth, the values derived from both methods (for a two-subtest cluster) are similar, especially in the “bright line” range that governs decisions that distinguish between what may or may not be intact ability. Moreover, the report is presented in a manner that seems to suggest that making decisions regarding dysfunction with scores in this range is unique to scores derived only from averaging methods. The fact is, making such decisions is an issue for any practitioner using any scores gathered with any test, irrespective of the manner in which they were derived. Deciding whether a standard score in the range of 85-89, for example, represents function or dysfunction is not made any easier simply because that value was obtained via norm-based versus averaging methods. Indeed, what the report actually demonstrates is that the two methods produce marginally different results (for a two-test cluster; i.e., those used in XBA) when the values are well above or well below the mean. If the intent of the article is to note that averaging can be problematic, especially when conducted in an unreasoned and indiscriminate manner, we completely agree. However, when averaging is used in a systematic manner, such as the way it is specified via XBA (limited and constrained to situations where it is reasonable, such as in the absence of available norm-based clusters), averaging may provide useful information upon which defensible decisions may be based, especially when supported by ecological validity. Thus, implications and clinical significance of the actual difference in values that emanate from the different methods becomes, in our opinion (and within the context of our approach), minimal and largely inconsequential. The report is partially a statistical debate—and one in which the position taken by the authors is presented as if it were the irrefutable representation of reality. Thus, when use of a method to compute composite scores produces values that are not identical to those already established via a combinatorial probability method, it is automatically assumed the other values are wrong. Use of the terms “real” when referring to norm-based clusters and “pseudo” when referring to averaged clusters seem intended to imply that “real” is equal to actual reality and that “pseudo” clusters do not reflect reality. We are all cautioned to remember that any measure of any latent variable is just an estimate and that no method or procedure produces an error free value, let alone the “true” value of the given attribute. Evaluation of Major Issues It would not be accurate to say that we disagree with the major issues that are raised. In fact, we actually support many of the contentions and have attempted to ameliorate them. For example, one of the points made in the report involved the degree to which “global” composite scores created by the averaging of multiple ability scores may be inaccurate in comparison to those based on combinatorial probability methods (or a norm-based global composite). The relatively large differences in the scores between these two methods suggest that averaging multiple scores in this manner is difficult to support. As such, it is not surprising that within the guidelines for XBA, there are no recommendations for generating a global ability score based on averaging subtest scores. Use of XBA provides no mechanism by which a global score can or should be created. It is our recommendation that when a global score is desired or necessary, it can only be obtained from an existing standardized, norm-referenced test of intelligence or battery of cognitive abilities. Also noteworthy is that the XBA approach does not provide a mechanism for the creation of any cluster, whether global, broad, or narrow, on the basis of more than three scores. In fact, in the overwhelming majority of XBA cases wherein averages are calculated, they are based on two scores that very likely represent either the same broad or same narrow CHC ability (and therefore, are likely to be at least moderately correlated). If what we consider to be an interpretable, norm-based composite is available, then that is the score that should be reported and interpreted—a core guiding principle of XBA (Flanagan et al., 2007). Only when clusters or composites contain significant variability and one of the scores comprising the cluster is suggestive of a normative weakness, do we recommend follow-up evaluation. Regarding followup procedures, the authors of XBA also recommend use of an additional test within the core battery, if available, to generate a norm-based composite (use of Dr. Schneider’s automated “compositator” for the WJ III would be recommended in this situation). If such a test within the core battery is not available, practitioners must then necessarily cross batteries to obtain another measure of the narrow ability underlying the test that yielded a score that was suggestive of a normative weakness. Depending upon the score obtained on this additional subtest, a twosubtest cluster average (representing a narrow ability) may be calculated, for example. Averaging is not a central tenet of XBA. It is only in the case where a norm-based cluster is not available, that XBA suggests the formation of a theory-based CHC cluster via the average of two subtests. The use of averaging in this sense is quite limited and suggested only because there are no norms available to represent the data gathered during the course of examining an individual’s pattern of cognitive strengths and weaknesses. That is, the field currently lacks instruments with sufficient breadth and depth of coverage of CHC narrow abilities. The alternative to averaging consistent subtest score performances from different batteries is, of course, to not average them and to simply discuss them separately. But not following up and testing hypotheses about aberrant score performances in narrow ability domains, simply because no norm-based clusters are available is not a viable option, in our opinion, especially in cases where an understanding of an individual’s learning difficulties is necessary. At the lower end of the ability continuum, the Schneider and McGrew report indicates that arithmetic averages (of a two-test cluster) are slightly higher than norm-based clusters. We are comfortable with this fact. Because averages overestimate performance at the lower end of the ability continuum, they are less likely to result in an interpretation of deficiency for an individual who is not actually deficient. Also noteworthy about averages is the fact that most fall within the standard error of measurement of the norm based cluster. To illustrate, Figure 1 includes clusters that have been generated in different ways based on two WISC-IV subtests (Block Design and Picture Completion) that each have a reliability coefficient of .87 and an intercorrelation of .54 (for 11-year-olds). These two WISC-IV subtests are combined to form a Visual Processing (Gv) cluster (actual norms for this cluster may be found in Flanagan & Kaufman, 2009). The Gv cluster in Figure 1 was generated in the following ways: a) actual norms; b) formula suggested in the Schneider and McGrew report (i.e., “cluster generator”); and c) arithmetic average. This figure shows that clusters derived by the formula closely align with actual normbased clusters, as pointed out in the Schneider and McGrew report. Figure 1 shows further that arithmetic averages fall within the norm-based cluster’s SEM across the entire ability range. The fact that the arithmetic average for Gv in Figure 1 consistently falls within the SEM of the norm-based Gv cluster suggests that the differences between these estimates of ability are not clinically meaningful. For example, when attempting to ascertain whether or not a student is deficient in Gv, will the practitioners’ interpretation change if the score was 80 v. 76? 75 v. 70? 69 v. 62? Based on our work with students and our approach to understanding a student’s strengths and weaknesses, our answer to this question is “no.” These scores are simply not substantially different enough from one another to suggest that a score derived from one method is deficient and a score derived from another method is not. Moreover, we would not make a determination about deficiency in any cognitive area based on a single score (which includes averaged clusters and norm-based clusters). Note that it is highly likely that the only time in which averages and norm-based clusters may suggest a different interpretation (e.g., low average ability v. normative weakness, respectively) is in cases where the scores are in the 80s. But clinical judgment is most certainly necessary to interpret scores in this range, as such scores may or may not result in functional limitations for the individual (e.g., Flanagan, Alfonso, & Mascolo, 2011). Figure 1. Two-Subtest Average (XBA Mean) Falls within the SEM of the Actual Normbased Score Across the Ability Continuum Figure 2 shows a situation that used XBA data. That is, tests from different batteries were combined to yield an arithmetic average, representing the narrow ability of Spatial Relations (Sotelo-Dynega, 2011). In this situation, WISC-IV Block Design and KABC-II Triangles were derived in two ways (averaged and using the formula mentioned above). The intercorrelation between these two tests (.55) was derived from a cross-battery dataset. The same pattern is observed in Figure 2 as in Figure 1. That is, the averaged cluster under- and overestimates performance at the extreme ends of the ability continuum. At about 2 SDs below the mean, there is about a 4-point difference between the averaged cluster and the cluster generated from the formula, with the averaged cluster being higher. However, at this point on the ability continuum, regardless of how the Spatial Relations cluster was generated, the interpretation is the same (e.g., a 68 and a 72 both represent a normative weakness). Incidentally, we report averaged clusters with a confidence interval of +/- 5. As such, a good rule of thumb when evaluating the averaged cluster is that had a norm-based cluster been available, it would likely have been closer to the lower end of the confidence interval. For example, a Spatial Relations averaged cluster of 72 has a confidence band of 67-77. The value generated by the formula (that takes into account the reliability of the subtests and their intercorrelation) was 68. Thus, the confidence interval for the averaged cluster encompasses the value derived from the formula, which is also nearly identical to an actual norm-based cluster. The rule of thumb “heard from the field” According to the Schneider and McGrew report, “The rule-of-thumb heard from the field is that it is appropriate to average test SS’s if the SS’s for the respective tests are within 15 (<1SD) points of each other. The rule of thumb is meant to suggest that the two tests that are being averaged are in all likelihood measuring aspects of the same construct (broad or narrow, depending on what tests are being averaged) within the context of the CHC XBA approach. And, if they are measuring the same construct, then the correlation between them is likely at least moderate, which would then lead to a two-subtest cluster that does not differ meaningfully from a norm-based cluster (if such a cluster was available; see Figure 2). Incidentally, we do interpret clusters that are comprised of scores that differ more than 1SD under certain circumstances. We describe these clusters as nonconvergent but clinically meaningful (Flanagan et al., 2007). Contrary to our recommendation, the Schneider and McGrew report takes the position that two scores (presumably measuring the same construct) can always form an interpretable cluster regardless of the degree of difference between them (e.g., 65 and 135) and regardless of whether or not at least one of the scores in the composite is suggestive of deficiency. It is argued that it is acceptable, at least from a psychometric viewpoint, to generate and interpret a composite score from such divergent values. Whereas something may be mathematically feasible and perhaps psychometrically precise, does not automatically mean it has clinical meaning and in fact may be more misleading to the practitioner when making important educational decisions. In applied evaluations, accepting such disparate values as reflective of the same construct (particularly when a score within the composites warrants attention) seems inappropriate, particularly in the case of measurement at the broad and narrow ability levels. At the global level, such differences may well be reasonably accommodated by the numerous values that enter in the construction of a measure of general ability. However, in measurement of a broad or narrow ability, such differences in all likelihood have clinical implications and should not be ignored. Comments on the Recommendations and Suggestions in the Schneider and McGrew Report The Schneider and McGrew report discusses three main points as recommendations and solutions regarding the use of averaged scores. The first is a caution regarding averaged cluster scores. Particularly, it is recommended that averages should not be applied “in contexts where specific scores are compared to ‘bright-line’ specific cut-score eligibility/diagnostic criteria.” We wholeheartedly agree that averaged clusters, and even norm-based clusters for that matter, should not be used in making decisions on the basis of specific cut-scores. If cut-score criteria are applied (based on any method of score derivation), incorrect decisions will result because of the failure to consider the degree of precision (or lack thereof) inherent in scores. Likewise, the use of simple equations or formulae for making strict diagnostic decisions is equally unsupportable regardless of the type of score used. The second recommendation provided in the Schneider and McGrew report is that practitioners use norm-based clusters when available, and that “if supplementary testing is required (crossing batteries) to obtain at least a two-test composite of an ability, it is recommended that a supplementary battery be selected that provides 2 or more test normed-based composite SS.” This very recommendation has long been a foundational principle of the XBA approach (see Flanagan et al., 2007, p. 29). The third recommendation noted in the report relates to the use of a formula to develop clusters. We agree that when reliabilities of individual subtests (which are readily available) and the correlation between them (which are often not available) are available, then the formula should be used. Nevertheless, as we stated earlier, averaging is often done when crossing batteries. As such, the intercorrelations between tests (from different batteries) are often not available. Dr. Schneider offered a recommendation on the NASP listserv, which is something that we have been considering as we revise our programs (i.e., basing our averages on “average intercorrelations” among tests that purport to measure the same broad and narrow abilities). Summary There is much with which we agree in the report by Schneider and McGrew. Clusters generated via the use of averaging produce values that are different than the values based on a test’s actual norms. Those differences, however, have minor (not major) clinical implications when appropriate assessment procedures are used. And, in most cases, interpretations based on averages within the context of XBA do not differ from interpretations based on actual norms. This is because the differences between the two methods occur at the extremes of the ability continuum, where 4-5 points do not alter an interpretation of deficient performance (or normative weakness), for example. Also, practitioners who conduct assessments ought to know that no single score (averaged or norm-based) should be interpreted in isolation. In addition, use of averaged scores within XBA occurs only in situations where there is no normbased composite available and the need to explore suspected low average or deficient functioning requires crossing batteries. Under no circumstances should practitioners engage in the arbitrary creation of clusters or composites when there is a norm-based cluster available. Moreover, because the difference in values obtained by either method (arithmetic average or actual norm for two scores in the same cognitive domain) are negligible in the average range of ability (e.g., for standard scores ranging from about 90 to 110), the implications for practice are minimal. This is because practitioners are unlikely to infer deficiencies on the basis of solidly average scores. Likewise, although we recognize that in cases with scores above the mean there are differences in the values obtained via the two methods, within the context of XBA, we do not suggest following up on high average scores and therefore, we do not cross-batteries to create averages at the upper end of the ability continuum. We only recommend following up on single subtest performances that suggest a normative weakness. And, we only recommend crossing batteries when there are no tests within the core battery (or set of co-normed tests) that measure the same narrow ability as the test/score in question. Thus, in XBA, averaging is nearly always done to follow up on weak score performances at the lower end of the ability continuum. Incidentally, such follow up testing (or hypothesis testing) is routine in the field of neuropsychology. The cautions presented in the report notwithstanding, the use of averages (via XBA) can and often do bolster, confirm, and enhance existing findings about an individual’s performance in a given area. Indeed, a “convergence of indicators” has always been our motto. If the practice of measuring and interpreting psychological variables could be accomplished purely by automated, statistical methods, what need would there be for practitioners? The role of a good clinician is not to measure any particular variable with absolute precision—something that, in fact, cannot be done. Rather, the role of a good clinician certainly includes drawing inferences about performance based on a solid foundation of converging data sources derived through multiple methods. Since no method exists that provides the “real” value or the true score, in those cases where questions regarding functioning remain unanswered or inadequately answered, additional information is both desirable and necessary. If the only manner in which such information can be gathered is via the use of crossing batteries and averaging scores, then it should be done within the framework of a systematic, theory-based, and well-reasoned approach, such as XBA – an approach that Alan S. Kaufman called a “systematic…sophisticated, high-quality psychometric approach to intelligence test interpretation” (Kaufman, 2000; p. xv). Flanagan, D. P., Alfonso, V. C., & Mascolo, J. T. (2011). A CHC-based operational definition of SLD: Integrating multiple data sources and multiple data-gathering methods. In D. P. Flanagan & V. C. Alfonso (Eds.), Essentials of SLD identification. Hoboken, NJ: Wiley. Flanagan, D. P., Ortiz, S. O., & Alfonso, V. C. (2007). Essentials of Cross-Battery Assessment, 2nd edition. Hoboken, NJ: Wiley. McGrew, K. S. & Flanagan, D. P. (1998). The Intelligence Test Desk Reference (ITDR): Gf-Gc Cross-Battery Assessment. Boston, MA: Allyn & Bacon.