instructionally d. tests - Learning Achievement Coalition

advertisement
Under Editorial Consideration
Please do not cite or quote sans permission.
January 26, 2015
INSTRUCTIONALLY DIAGNOSTIC TESTS:
FALSE LABELLING?
W. James Popham
University of California, Los Angeles
“Diagnostic test” is a label more often uttered than understood. Yes, although the
phrase “diagnostic testing” seems to pop up in educators’ conversations every few
days, many of those who employ this phrase have, at best, only a limited understanding
of what constitutes a genuine diagnostic test. Indeed, widespread misunderstandings
about diagnostic tests—by teachers who themselves build such tests or by
administrators who purchase off-the-shelf diagnostic tests—renders most of these tests
dysfunctional.
Generally, of course, diagnostic tests are thought to be good things. Most people regard
such tests as useful tools—often associated with such medical applications as when
physicians employ tests to identify what’s causing a patient’s illness. A diagnostic test,
according to my dictionary, is “concerned with identification of the nature of illness or
other problems” (The New Oxford American Dictionary, 2001). When diagnostic testing
occurs in the field of education, of course, we are typically dealing with “other
problems.” In education, such tests are used either for purposes of classification or for
instruction.
Classification-focused diagnostic tests are often employed by educators who are
working with atypical students, that is, with students who are particularly gifted or
students who have pronounced disabilities. Such tests allow educators to identify the
nature of a student’s exceptionality—so that the student can then be accurately
assigned to a specific classification category.
Instruction-focused diagnostic tests are used when teachers attempt to provide
particularized instruction for individual students so that the teacher’s upcoming
instructional activities will better mesh with the precise learning needs of different
students. Because most of today’s so-called instructionally diagnostic tests really aren’t
1
any such thing, the following analysis will be devoted exclusively to instructionally
diagnostic tests.
Getting my own bias on the table immediately, I am an all-in fan of instructionally
diagnostic tests—but only if they are well constructed and, then, used appropriately.
That is, I endorse such tests when they are built properly and, thereupon, employed by
teachers to do a better instructional job with their students. Those two requirements, of
course, are easier to assert than to satisfy.
What is an Instructionally Diagnostic Test?
Because the mission of an instructionally diagnostic test is to help teachers do an
effective instructional job with their students, we need to identify how using such a test
will enlighten a teacher’s instructional decisions, for example, such decisions as
deciding when to give or withhold additional or different instruction from which students.
Appropriately matching instruction, both its quantity and its type, with students’ current
needs constitutes a particularly important element of effective teaching. If properly
fashioned, an instructionally diagnostic test permits teachers to identify those students
who need more or less instruction regarding specific learning targets. Moreover, if the
diagnostic test is adroitly constructed, such a test can even help teachers determine
which sorts of instruction will be likely to succeed with which students.
Here, then, is how I think educators should define an instructionally diagnostic test:
An instructionally diagnostic test is an assessment instrument whose use permits
teachers to draw accurate inferences about individual test-takers’ strengths
and/or weaknesses with respect to two or more skills or bodies of knowledge—
thereby permitting teachers to take more effective next-step instructional actions.
Several components of this definition can have an influence on one’s conception of
instructionally diagnostic tests and, thus, warrant elaboration.
Solo-student inferences. Let’s begin with the idea that, based on students’
performances on an instructionally diagnostic test, a teacher will be able to draw more
valid (or, if you prefer, more accurate) inferences about an individual student’s status.
Putting it differently, a good instructionally diagnostic test will help a teacher get a more
exact fix on what each of the teacher’s students currently knows or is able to do.
Clearly, if a teacher chooses to summarize the individual test results of multiple
students in order to make sub-group focused inferences, or even whole-class focused
inferences, this is altogether appropriate. However, the proposed definition specifies
that an honest-to-goodness instructionally diagnostic test must be capable of yielding
valid inferences about the status of an individual test-taker.
2
Strengths and/or weaknesses. In the above definition, the “and/or” has an
important role to play. It signifies that a student’s performance on an instructionally
diagnostic test can provide results permitting a teacher to draw inferences exclusively
about the student’s strengths, exclusively about the student’s weaknesses, or about
both the strengths and weaknesses of the student. Although, in most settings, teachers
are likely to be more interested in either students’ weaknesses or in their strengths, the
definition of instructionally diagnostic tests makes clear that this “and/or” should be
taken seriously.
Two or more skills or bodies of knowledge. The strengths and/or weaknesses
addressed in an instructionally diagnostic test must be at least two. This is where the
diagnosticity of such tests scampers onstage. If a test only helps teachers establish a
student’s status with respect to one cognitive skill, or identifies the student’s masterylevel regarding one body of knowledge, such information can surely be useful to a
teacher. But such a measurement mission of a test is focused on classification, not
instruction. Teachers can use such a test to classify students as masters or nonmasters of whatever skill or body of knowledge is being assessed. A single-focus test,
however, is not instructionally diagnostic. It’s not a bad test; it’s just not instructionally
diagnostic.
Next-step instructional actions. The final requisite feature of this definition stems
from the fundamental orientation of an instructionally diagnostic test, namely,
instruction. The definition indicates that, based on the results of such tests, teachers
can make next-step instructional decisions, for instance, what to teach tomorrow or how
to teach something next week. Moreover, those next-steps instructional decisions are
apt to be more effective than would have been the case had a teacher’s decisions not
been abetted by test-elicited information about students’ current status.
What this definitional requirement calls for, in plain language, are test-results that can
be readily translated into sound next-step pedagogical moves by a teacher. To supply
such data, an instructionally diagnostic test must provide its results at an actionable
grain-size, that is, at a level of generality addressable by routine teacher-chosen
instructional actions. Accordingly, if the reports from a supposedly diagnostic test are
provided at such a broad grain-size that the only reasonable action implication is for the
teacher to “make students smarter,” this test would be a diagnostic flop.
Optimal grain-sizes of high-quality instructionally diagnostic tests will be identified
below, but we see that those who craft instructionally diagnostic tests must be
constantly alert so that their tests supply teachers with information on which they can
readily take reasonable instructional actions.
Determining a test’s instructional diagnosticity. For both the determination of
whether a test is, in fact, instructionally diagnostic and, if it is, how good the test is, the
3
most practical approach is to rely on the aggregated, individually rendered judgments of
a panel of experienced and properly oriented educators.
How Good is an Instructionally Diagnostic Test?
Once tests have been identified as instructionally diagnostic, they are still apt to differ in
their quality, sometimes substantially. Let’s consider, briefly, five evaluative criteria by
which to determine such a test’s merits. In turn, then, a brief description will be
presented of each of the following attributes of an instructionally diagnostic test’s
quality: (1) curricular alignment, (2) grain size, (3) sufficiency of items, (4) item quality
and (5) ease of usage.
Curricular alignment. The overriding purpose of educational tests is to collect
students’ responses in such a way that we can use those overt responses to testquestions in order to arrive at valid inferences about students’ covert status with respect
to such variables as students’ knowledge or their cognitive skills. Thus, the initial factor
by which instructionally diagnostic tests should be evaluated is the degree to which the
test’s items will, in fact, elicit responses permitting valid interpretations to be made
about students’ status regarding what purportedly is being measured.
Ideally, we would like instructionally diagnostic tests to measure lofty curricular aims
such as students’ attainment of truly high-level cognitive skills or their possession of
genuinely significant bodies of knowledge. But the grandeur of the curricular aims that a
test attempts to measure is often determined not by the test-makers themselves but,
rather, by higher-level educational authorities. It seems unfair to down-grade a test itself
because of puerile curricular dictates over which the test’s developers had no control.
What is being proposed here is that judgments about the worthiness of the curricular
aims being assessed by an instructionally diagnostic test should be rendered by others,
that is, by other evaluators who focus on the curricular defensibility of what’s supposed
to be measured by an instructionally diagnostic test. Such determinations of the quality
of a test’s curricular aims is an important undertaking, but it is separable—and it should
be separated—from evaluation of an instructionally diagnostic test’s quality.
Grain size. Perhaps the most common error encountered in flawed diagnostic
tests consists of attempts to measure students’ status with respect to their mastery of
assessment targets representing the wrong grain size. Assessment targets that are too
broad provide insufficient guidance to teachers regarding what is really being sought of
students. Given too-broad assessment targets, teachers cannot discern from a
student’s test results what it really is that a student needs to master. And tests whose
grain sizes are too tiny will typically overwhelm both teacher and students with an
excessive number of instructional targets—almost always accompanied often by too few
items per assessed target.
4
Sufficiency of items. The evaluative criterion of item sufficiency for appraising an
instructionally diagnostic test is easy to understand, but difficult to operationalize. Put
plainly, a decent instructionally diagnostic test needs to contain enough items dealing
with each of the skills and/or bodies of knowledge being measured so that, when we
see a student’s responses to those items, we can identify—with reasonable accuracy—
how well a student has achieved each of the assessment’s targets.
And here, of course, is where sensible, experienced educators can differ. Depending on
the nature of the curricular targets being measured by a test, seasoned educators might
understandably have different opinions about the numbers of items needed. To
illustrate, when measuring larger grain-size curricular targets, more items would
typically be needed than when measuring smaller grain-size curricular targets.
Ideally, the items being employed in an instructionally diagnostic test would also
represent a range of difficulties so that students of varying achievement levels would
have a chance to display how well they had mastered the skills and/or knowledge being
sought.
Item quality. With few exceptions, educational tests are made up of individual
items. Occasionally, of course, we encounter a one-item test such a when we ask
students to provide a writing-sample by composing a solo, original essay. But most
tests, and especially those tests intended to supply instructionally diagnostic
information, contain multiple items. If a test’s items are good ones, the test is obviously
apt to be better than a test composed of shoddy items. This is true whether the test is
supposed to serve a classification function an instructional function, or some other
function altogether. Better items make better tests.
How does one tell if a test’s items are rapturous or wretched? Fortunately, over the
years measurement specialists have identified a widely approved collection of itemwriting guidelines—as well as item-improvement guidelines. These “rules” come from
two sources. First, some of the guidelines are provided by empirical investigations in
which, for example, we secure research evidence regarding how to devise constructedresponse items so we can more accurately score students’ responses. A second set of
guidelines is based on the experiences of in-the-trenches item-writers who have, for
nearly a century, reached trial-and-error judgments about what sorts of items seem to
work and what sorts seem to misfire
Ease of usage. Testing takes time. Teaching takes time. Most teachers,
therefore, have very little discretionary moments to spend on anything—including
diagnostic testing. Accordingly, an effective instructionally diagnostic test, if it is going to
be employed by many teachers, must be easy to use. If an instructionally diagnostic test
is difficult to use—or time-consuming to use—it won’t be.
5
If, for example, scoring of students’ responses must be personally carried out by the
teacher, then such scoring should require minimal time and very little effort. These days,
of course, because of advances in technology, much scoring is done electronically. But
if hassles arise when teachers use a test, and most certainly when teachers score a
test, those hassles will diminish the test’s use. Instructionally diagnostic tests that are
easy to be used will tend to be used. Those that aren’t, won’t.
Summing up, instructionally diagnostic tests, if appropriately built, can function as potent
tools to help teachers promote their students’ learning. But simply labeling a test as
“instructionally diagnostic” does not automatically signify that the test is anything other
than an improperly named assessment device.
6
Download