XBA and Averaged Clusters - Comments on the Schneider

advertisement
Cross-Battery Assessment (XBA) and Averaged Scores:
Comments on the Schneider and McGrew Report
Dawn Flanagan, Samuel Ortiz, and Vincent Alfonso
General Comments
The report by Schneider and McGrew provides an analysis regarding some of the pitfalls and
problems inherent in taking an uninformed and unsystematic approach to creating ability clusters
or composites using averaging methods. As authors of the CHC Cross-Battery Approach (an
approach originally developed by Drs. McGrew and Flanagan, 1998), we have ourselves
examined and addressed these issues over the years because XBA may, but does not always,
necessitate use of averaging. The report helps to clarify many of the issues that we have long
recognized as important considerations in the measurement of cognitive abilities—so much so
that we have made deliberate efforts to include cautions and guidance regarding the use of
averaging when, and if, it becomes necessary for practitioners using XBA (Flanagan, Ortiz, &
Alfonso, 2007). Thus, we welcome the report because it provides even more substantiation for
the XBA guidelines we already have in place and continue to promote in our presentations and
trainings.
That said, in all fairness it should be noted that the “differences” that are reported for two-subtest
broad and narrow CHC clusters between averaged and norm-based methods above and below the
mean are generally small and not clinically meaningful. Despite the warnings against the use of
averaged scores in making critical, high-stakes decision, the magnitude of the observed
differences is small and unlikely to lead to the myriad misinterpretations Schneider and McGrew
fear. In truth, the values derived from both methods (for a two-subtest cluster) are similar,
especially in the “bright line” range that governs decisions that distinguish between what may or
may not be intact ability. Moreover, the report is presented in a manner that seems to suggest that
making decisions regarding dysfunction with scores in this range is unique to scores derived only
from averaging methods. The fact is, making such decisions is an issue for any practitioner using
any scores gathered with any test, irrespective of the manner in which they were derived.
Deciding whether a standard score in the range of 85-89, for example, represents function or
dysfunction is not made any easier simply because that value was obtained via norm-based
versus averaging methods. Indeed, what the report actually demonstrates is that the two methods
produce marginally different results (for a two-test cluster; i.e., those used in XBA) when the
values are well above or well below the mean.
If the intent of the article is to note that averaging can be problematic, especially when conducted
in an unreasoned and indiscriminate manner, we completely agree. However, when averaging is
used in a systematic manner, such as the way it is specified via XBA (limited and constrained to
situations where it is reasonable, such as in the absence of available norm-based clusters),
averaging may provide useful information upon which defensible decisions may be based,
especially when supported by ecological validity. Thus, implications and clinical significance of
the actual difference in values that emanate from the different methods becomes, in our opinion
(and within the context of our approach), minimal and largely inconsequential.
The report is partially a statistical debate—and one in which the position taken by the authors is
presented as if it were the irrefutable representation of reality. Thus, when use of a method to
compute composite scores produces values that are not identical to those already established via
a combinatorial probability method, it is automatically assumed the other values are wrong. Use
of the terms “real” when referring to norm-based clusters and “pseudo” when referring to
averaged clusters seem intended to imply that “real” is equal to actual reality and that “pseudo”
clusters do not reflect reality. We are all cautioned to remember that any measure of any latent
variable is just an estimate and that no method or procedure produces an error free value, let
alone the “true” value of the given attribute.
Evaluation of Major Issues
It would not be accurate to say that we disagree with the major issues that are raised. In fact, we
actually support many of the contentions and have attempted to ameliorate them. For example,
one of the points made in the report involved the degree to which “global” composite scores
created by the averaging of multiple ability scores may be inaccurate in comparison to those
based on combinatorial probability methods (or a norm-based global composite). The relatively
large differences in the scores between these two methods suggest that averaging multiple scores
in this manner is difficult to support. As such, it is not surprising that within the guidelines for
XBA, there are no recommendations for generating a global ability score based on averaging
subtest scores. Use of XBA provides no mechanism by which a global score can or should be
created. It is our recommendation that when a global score is desired or necessary, it can only be
obtained from an existing standardized, norm-referenced test of intelligence or battery of
cognitive abilities.
Also noteworthy is that the XBA approach does not provide a mechanism for the creation of any
cluster, whether global, broad, or narrow, on the basis of more than three scores. In fact, in the
overwhelming majority of XBA cases wherein averages are calculated, they are based on two
scores that very likely represent either the same broad or same narrow CHC ability (and
therefore, are likely to be at least moderately correlated). If what we consider to be an
interpretable, norm-based composite is available, then that is the score that should be reported
and interpreted—a core guiding principle of XBA (Flanagan et al., 2007). Only when clusters or
composites contain significant variability and one of the scores comprising the cluster is
suggestive of a normative weakness, do we recommend follow-up evaluation. Regarding followup procedures, the authors of XBA also recommend use of an additional test within the core
battery, if available, to generate a norm-based composite (use of Dr. Schneider’s automated
“compositator” for the WJ III would be recommended in this situation). If such a test within the
core battery is not available, practitioners must then necessarily cross batteries to obtain another
measure of the narrow ability underlying the test that yielded a score that was suggestive of a
normative weakness. Depending upon the score obtained on this additional subtest, a twosubtest cluster average (representing a narrow ability) may be calculated, for example.
Averaging is not a central tenet of XBA. It is only in the case where a norm-based cluster is not
available, that XBA suggests the formation of a theory-based CHC cluster via the average of two
subtests. The use of averaging in this sense is quite limited and suggested only because there are
no norms available to represent the data gathered during the course of examining an individual’s
pattern of cognitive strengths and weaknesses. That is, the field currently lacks instruments with
sufficient breadth and depth of coverage of CHC narrow abilities. The alternative to averaging
consistent subtest score performances from different batteries is, of course, to not average them
and to simply discuss them separately. But not following up and testing hypotheses about
aberrant score performances in narrow ability domains, simply because no norm-based clusters
are available is not a viable option, in our opinion, especially in cases where an understanding of
an individual’s learning difficulties is necessary.
At the lower end of the ability continuum, the Schneider and McGrew report indicates that
arithmetic averages (of a two-test cluster) are slightly higher than norm-based clusters. We are
comfortable with this fact. Because averages overestimate performance at the lower end of the
ability continuum, they are less likely to result in an interpretation of deficiency for an individual
who is not actually deficient. Also noteworthy about averages is the fact that most fall within the
standard error of measurement of the norm based cluster. To illustrate, Figure 1 includes clusters
that have been generated in different ways based on two WISC-IV subtests (Block Design and
Picture Completion) that each have a reliability coefficient of .87 and an intercorrelation of .54
(for 11-year-olds). These two WISC-IV subtests are combined to form a Visual Processing (Gv)
cluster (actual norms for this cluster may be found in Flanagan & Kaufman, 2009).
The Gv cluster in Figure 1 was generated in the following ways: a) actual norms; b) formula
suggested in the Schneider and McGrew report (i.e., “cluster generator”); and c) arithmetic
average. This figure shows that clusters derived by the formula closely align with actual normbased clusters, as pointed out in the Schneider and McGrew report. Figure 1 shows further that
arithmetic averages fall within the norm-based cluster’s SEM across the entire ability range.
The fact that the arithmetic average for Gv in Figure 1 consistently falls within the SEM of the
norm-based Gv cluster suggests that the differences between these estimates of ability are not
clinically meaningful. For example, when attempting to ascertain whether or not a student is
deficient in Gv, will the practitioners’ interpretation change if the score was 80 v. 76? 75 v. 70?
69 v. 62? Based on our work with students and our approach to understanding a student’s
strengths and weaknesses, our answer to this question is “no.” These scores are simply not
substantially different enough from one another to suggest that a score derived from one method
is deficient and a score derived from another method is not. Moreover, we would not make a
determination about deficiency in any cognitive area based on a single score (which includes
averaged clusters and norm-based clusters).
Note that it is highly likely that the only time in which averages and norm-based clusters may
suggest a different interpretation (e.g., low average ability v. normative weakness, respectively)
is in cases where the scores are in the 80s. But clinical judgment is most certainly necessary to
interpret scores in this range, as such scores may or may not result in functional limitations for
the individual (e.g., Flanagan, Alfonso, & Mascolo, 2011).
Figure 1. Two-Subtest Average (XBA Mean) Falls within the SEM of the Actual Normbased Score Across the Ability Continuum
Figure 2 shows a situation that used XBA data. That is, tests from different batteries were
combined to yield an arithmetic average, representing the narrow ability of Spatial Relations
(Sotelo-Dynega, 2011). In this situation, WISC-IV Block Design and KABC-II Triangles were
derived in two ways (averaged and using the formula mentioned above). The intercorrelation
between these two tests (.55) was derived from a cross-battery dataset. The same pattern is
observed in Figure 2 as in Figure 1. That is, the averaged cluster under- and overestimates
performance at the extreme ends of the ability continuum. At about 2 SDs below the mean, there
is about a 4-point difference between the averaged cluster and the cluster generated from the
formula, with the averaged cluster being higher. However, at this point on the ability continuum,
regardless of how the Spatial Relations cluster was generated, the interpretation is the same (e.g.,
a 68 and a 72 both represent a normative weakness). Incidentally, we report averaged clusters
with a confidence interval of +/- 5. As such, a good rule of thumb when evaluating the averaged
cluster is that had a norm-based cluster been available, it would likely have been closer to the
lower end of the confidence interval. For example, a Spatial Relations averaged cluster of 72 has
a confidence band of 67-77. The value generated by the formula (that takes into account the
reliability of the subtests and their intercorrelation) was 68. Thus, the confidence interval for the
averaged cluster encompasses the value derived from the formula, which is also nearly identical
to an actual norm-based cluster.
The rule of thumb “heard from the field”
According to the Schneider and McGrew report, “The rule-of-thumb heard from the field is that
it is appropriate to average test SS’s if the SS’s for the respective tests are within 15 (<1SD)
points of each other. The rule of thumb is meant to suggest that the two tests that are being
averaged are in all likelihood measuring aspects of the same construct (broad or narrow,
depending on what tests are being averaged) within the context of the CHC XBA approach.
And, if they are measuring the same construct, then the correlation between them is likely at least
moderate, which would then lead to a two-subtest cluster that does not differ meaningfully from
a norm-based cluster (if such a cluster was available; see Figure 2). Incidentally, we do interpret
clusters that are comprised of scores that differ more than 1SD under certain circumstances. We
describe these clusters as nonconvergent but clinically meaningful (Flanagan et al., 2007).
Contrary to our recommendation, the Schneider and McGrew report takes the position that two
scores (presumably measuring the same construct) can always form an interpretable cluster
regardless of the degree of difference between them (e.g., 65 and 135) and regardless of whether
or not at least one of the scores in the composite is suggestive of deficiency. It is argued that it is
acceptable, at least from a psychometric viewpoint, to generate and interpret a composite score
from such divergent values. Whereas something may be mathematically feasible and perhaps
psychometrically precise, does not automatically mean it has clinical meaning and in fact may be
more misleading to the practitioner when making important educational decisions. In applied
evaluations, accepting such disparate values as reflective of the same construct (particularly
when a score within the composites warrants attention) seems inappropriate, particularly in the
case of measurement at the broad and narrow ability levels. At the global level, such differences
may well be reasonably accommodated by the numerous values that enter in the construction of a
measure of general ability. However, in measurement of a broad or narrow ability, such
differences in all likelihood have clinical implications and should not be ignored.
Comments on the Recommendations and Suggestions in the Schneider and McGrew
Report
The Schneider and McGrew report discusses three main points as recommendations and
solutions regarding the use of averaged scores. The first is a caution regarding averaged cluster
scores. Particularly, it is recommended that averages should not be applied “in contexts where
specific scores are compared to ‘bright-line’ specific cut-score eligibility/diagnostic criteria.”
We wholeheartedly agree that averaged clusters, and even norm-based clusters for that matter,
should not be used in making decisions on the basis of specific cut-scores. If cut-score criteria
are applied (based on any method of score derivation), incorrect decisions will result because of
the failure to consider the degree of precision (or lack thereof) inherent in scores. Likewise, the
use of simple equations or formulae for making strict diagnostic decisions is equally
unsupportable regardless of the type of score used.
The second recommendation provided in the Schneider and McGrew report is that practitioners
use norm-based clusters when available, and that “if supplementary testing is required (crossing
batteries) to obtain at least a two-test composite of an ability, it is recommended that a
supplementary battery be selected that provides 2 or more test normed-based composite SS.”
This very recommendation has long been a foundational principle of the XBA approach (see
Flanagan et al., 2007, p. 29).
The third recommendation noted in the report relates to the use of a formula to develop clusters.
We agree that when reliabilities of individual subtests (which are readily available) and the
correlation between them (which are often not available) are available, then the formula should
be used. Nevertheless, as we stated earlier, averaging is often done when crossing batteries. As
such, the intercorrelations between tests (from different batteries) are often not available. Dr.
Schneider offered a recommendation on the NASP listserv, which is something that we have
been considering as we revise our programs (i.e., basing our averages on “average
intercorrelations” among tests that purport to measure the same broad and narrow abilities).
Summary
There is much with which we agree in the report by Schneider and McGrew. Clusters generated
via the use of averaging produce values that are different than the values based on a test’s actual
norms. Those differences, however, have minor (not major) clinical implications when
appropriate assessment procedures are used. And, in most cases, interpretations based on
averages within the context of XBA do not differ from interpretations based on actual norms.
This is because the differences between the two methods occur at the extremes of the ability
continuum, where 4-5 points do not alter an interpretation of deficient performance (or normative
weakness), for example. Also, practitioners who conduct assessments ought to know that no
single score (averaged or norm-based) should be interpreted in isolation.
In addition, use of averaged scores within XBA occurs only in situations where there is no normbased composite available and the need to explore suspected low average or deficient
functioning requires crossing batteries. Under no circumstances should practitioners engage in
the arbitrary creation of clusters or composites when there is a norm-based cluster available.
Moreover, because the difference in values obtained by either method (arithmetic average or
actual norm for two scores in the same cognitive domain) are negligible in the average range of
ability (e.g., for standard scores ranging from about 90 to 110), the implications for practice are
minimal. This is because practitioners are unlikely to infer deficiencies on the basis of solidly
average scores. Likewise, although we recognize that in cases with scores above the mean there
are differences in the values obtained via the two methods, within the context of XBA, we do not
suggest following up on high average scores and therefore, we do not cross-batteries to create
averages at the upper end of the ability continuum. We only recommend following up on single
subtest performances that suggest a normative weakness. And, we only recommend crossing
batteries when there are no tests within the core battery (or set of co-normed tests) that measure
the same narrow ability as the test/score in question. Thus, in XBA, averaging is nearly always
done to follow up on weak score performances at the lower end of the ability continuum.
Incidentally, such follow up testing (or hypothesis testing) is routine in the field of
neuropsychology.
The cautions presented in the report notwithstanding, the use of averages (via XBA) can and
often do bolster, confirm, and enhance existing findings about an individual’s performance in a
given area. Indeed, a “convergence of indicators” has always been our motto. If the practice of
measuring and interpreting psychological variables could be accomplished purely by automated,
statistical methods, what need would there be for practitioners? The role of a good clinician is
not to measure any particular variable with absolute precision—something that, in fact, cannot be
done. Rather, the role of a good clinician certainly includes drawing inferences about
performance based on a solid foundation of converging data sources derived through multiple
methods. Since no method exists that provides the “real” value or the true score, in those cases
where questions regarding functioning remain unanswered or inadequately answered, additional
information is both desirable and necessary. If the only manner in which such information can be
gathered is via the use of crossing batteries and averaging scores, then it should be done within
the framework of a systematic, theory-based, and well-reasoned approach, such as XBA – an
approach that Alan S. Kaufman called a “systematic…sophisticated, high-quality psychometric
approach to intelligence test interpretation” (Kaufman, 2000; p. xv).
Flanagan, D. P., Alfonso, V. C., & Mascolo, J. T. (2011). A CHC-based operational definition of
SLD: Integrating multiple data sources and multiple data-gathering methods. In D. P.
Flanagan & V. C. Alfonso (Eds.), Essentials of SLD identification. Hoboken, NJ: Wiley.
Flanagan, D. P., Ortiz, S. O., & Alfonso, V. C. (2007). Essentials of Cross-Battery Assessment,
2nd edition. Hoboken, NJ: Wiley.
McGrew, K. S. & Flanagan, D. P. (1998). The Intelligence Test Desk Reference (ITDR): Gf-Gc
Cross-Battery Assessment. Boston, MA: Allyn & Bacon.
Download