Under Editorial Consideration Please do not cite or quote sans permission. January 26, 2015 INSTRUCTIONALLY DIAGNOSTIC TESTS: FALSE LABELLING? W. James Popham University of California, Los Angeles “Diagnostic test” is a label more often uttered than understood. Yes, although the phrase “diagnostic testing” seems to pop up in educators’ conversations every few days, many of those who employ this phrase have, at best, only a limited understanding of what constitutes a genuine diagnostic test. Indeed, widespread misunderstandings about diagnostic tests—by teachers who themselves build such tests or by administrators who purchase off-the-shelf diagnostic tests—renders most of these tests dysfunctional. Generally, of course, diagnostic tests are thought to be good things. Most people regard such tests as useful tools—often associated with such medical applications as when physicians employ tests to identify what’s causing a patient’s illness. A diagnostic test, according to my dictionary, is “concerned with identification of the nature of illness or other problems” (The New Oxford American Dictionary, 2001). When diagnostic testing occurs in the field of education, of course, we are typically dealing with “other problems.” In education, such tests are used either for purposes of classification or for instruction. Classification-focused diagnostic tests are often employed by educators who are working with atypical students, that is, with students who are particularly gifted or students who have pronounced disabilities. Such tests allow educators to identify the nature of a student’s exceptionality—so that the student can then be accurately assigned to a specific classification category. Instruction-focused diagnostic tests are used when teachers attempt to provide particularized instruction for individual students so that the teacher’s upcoming instructional activities will better mesh with the precise learning needs of different students. Because most of today’s so-called instructionally diagnostic tests really aren’t 1 any such thing, the following analysis will be devoted exclusively to instructionally diagnostic tests. Getting my own bias on the table immediately, I am an all-in fan of instructionally diagnostic tests—but only if they are well constructed and, then, used appropriately. That is, I endorse such tests when they are built properly and, thereupon, employed by teachers to do a better instructional job with their students. Those two requirements, of course, are easier to assert than to satisfy. What is an Instructionally Diagnostic Test? Because the mission of an instructionally diagnostic test is to help teachers do an effective instructional job with their students, we need to identify how using such a test will enlighten a teacher’s instructional decisions, for example, such decisions as deciding when to give or withhold additional or different instruction from which students. Appropriately matching instruction, both its quantity and its type, with students’ current needs constitutes a particularly important element of effective teaching. If properly fashioned, an instructionally diagnostic test permits teachers to identify those students who need more or less instruction regarding specific learning targets. Moreover, if the diagnostic test is adroitly constructed, such a test can even help teachers determine which sorts of instruction will be likely to succeed with which students. Here, then, is how I think educators should define an instructionally diagnostic test: An instructionally diagnostic test is an assessment instrument whose use permits teachers to draw accurate inferences about individual test-takers’ strengths and/or weaknesses with respect to two or more skills or bodies of knowledge— thereby permitting teachers to take more effective next-step instructional actions. Several components of this definition can have an influence on one’s conception of instructionally diagnostic tests and, thus, warrant elaboration. Solo-student inferences. Let’s begin with the idea that, based on students’ performances on an instructionally diagnostic test, a teacher will be able to draw more valid (or, if you prefer, more accurate) inferences about an individual student’s status. Putting it differently, a good instructionally diagnostic test will help a teacher get a more exact fix on what each of the teacher’s students currently knows or is able to do. Clearly, if a teacher chooses to summarize the individual test results of multiple students in order to make sub-group focused inferences, or even whole-class focused inferences, this is altogether appropriate. However, the proposed definition specifies that an honest-to-goodness instructionally diagnostic test must be capable of yielding valid inferences about the status of an individual test-taker. 2 Strengths and/or weaknesses. In the above definition, the “and/or” has an important role to play. It signifies that a student’s performance on an instructionally diagnostic test can provide results permitting a teacher to draw inferences exclusively about the student’s strengths, exclusively about the student’s weaknesses, or about both the strengths and weaknesses of the student. Although, in most settings, teachers are likely to be more interested in either students’ weaknesses or in their strengths, the definition of instructionally diagnostic tests makes clear that this “and/or” should be taken seriously. Two or more skills or bodies of knowledge. The strengths and/or weaknesses addressed in an instructionally diagnostic test must be at least two. This is where the diagnosticity of such tests scampers onstage. If a test only helps teachers establish a student’s status with respect to one cognitive skill, or identifies the student’s masterylevel regarding one body of knowledge, such information can surely be useful to a teacher. But such a measurement mission of a test is focused on classification, not instruction. Teachers can use such a test to classify students as masters or nonmasters of whatever skill or body of knowledge is being assessed. A single-focus test, however, is not instructionally diagnostic. It’s not a bad test; it’s just not instructionally diagnostic. Next-step instructional actions. The final requisite feature of this definition stems from the fundamental orientation of an instructionally diagnostic test, namely, instruction. The definition indicates that, based on the results of such tests, teachers can make next-step instructional decisions, for instance, what to teach tomorrow or how to teach something next week. Moreover, those next-steps instructional decisions are apt to be more effective than would have been the case had a teacher’s decisions not been abetted by test-elicited information about students’ current status. What this definitional requirement calls for, in plain language, are test-results that can be readily translated into sound next-step pedagogical moves by a teacher. To supply such data, an instructionally diagnostic test must provide its results at an actionable grain-size, that is, at a level of generality addressable by routine teacher-chosen instructional actions. Accordingly, if the reports from a supposedly diagnostic test are provided at such a broad grain-size that the only reasonable action implication is for the teacher to “make students smarter,” this test would be a diagnostic flop. Optimal grain-sizes of high-quality instructionally diagnostic tests will be identified below, but we see that those who craft instructionally diagnostic tests must be constantly alert so that their tests supply teachers with information on which they can readily take reasonable instructional actions. Determining a test’s instructional diagnosticity. For both the determination of whether a test is, in fact, instructionally diagnostic and, if it is, how good the test is, the 3 most practical approach is to rely on the aggregated, individually rendered judgments of a panel of experienced and properly oriented educators. How Good is an Instructionally Diagnostic Test? Once tests have been identified as instructionally diagnostic, they are still apt to differ in their quality, sometimes substantially. Let’s consider, briefly, five evaluative criteria by which to determine such a test’s merits. In turn, then, a brief description will be presented of each of the following attributes of an instructionally diagnostic test’s quality: (1) curricular alignment, (2) grain size, (3) sufficiency of items, (4) item quality and (5) ease of usage. Curricular alignment. The overriding purpose of educational tests is to collect students’ responses in such a way that we can use those overt responses to testquestions in order to arrive at valid inferences about students’ covert status with respect to such variables as students’ knowledge or their cognitive skills. Thus, the initial factor by which instructionally diagnostic tests should be evaluated is the degree to which the test’s items will, in fact, elicit responses permitting valid interpretations to be made about students’ status regarding what purportedly is being measured. Ideally, we would like instructionally diagnostic tests to measure lofty curricular aims such as students’ attainment of truly high-level cognitive skills or their possession of genuinely significant bodies of knowledge. But the grandeur of the curricular aims that a test attempts to measure is often determined not by the test-makers themselves but, rather, by higher-level educational authorities. It seems unfair to down-grade a test itself because of puerile curricular dictates over which the test’s developers had no control. What is being proposed here is that judgments about the worthiness of the curricular aims being assessed by an instructionally diagnostic test should be rendered by others, that is, by other evaluators who focus on the curricular defensibility of what’s supposed to be measured by an instructionally diagnostic test. Such determinations of the quality of a test’s curricular aims is an important undertaking, but it is separable—and it should be separated—from evaluation of an instructionally diagnostic test’s quality. Grain size. Perhaps the most common error encountered in flawed diagnostic tests consists of attempts to measure students’ status with respect to their mastery of assessment targets representing the wrong grain size. Assessment targets that are too broad provide insufficient guidance to teachers regarding what is really being sought of students. Given too-broad assessment targets, teachers cannot discern from a student’s test results what it really is that a student needs to master. And tests whose grain sizes are too tiny will typically overwhelm both teacher and students with an excessive number of instructional targets—almost always accompanied often by too few items per assessed target. 4 Sufficiency of items. The evaluative criterion of item sufficiency for appraising an instructionally diagnostic test is easy to understand, but difficult to operationalize. Put plainly, a decent instructionally diagnostic test needs to contain enough items dealing with each of the skills and/or bodies of knowledge being measured so that, when we see a student’s responses to those items, we can identify—with reasonable accuracy— how well a student has achieved each of the assessment’s targets. And here, of course, is where sensible, experienced educators can differ. Depending on the nature of the curricular targets being measured by a test, seasoned educators might understandably have different opinions about the numbers of items needed. To illustrate, when measuring larger grain-size curricular targets, more items would typically be needed than when measuring smaller grain-size curricular targets. Ideally, the items being employed in an instructionally diagnostic test would also represent a range of difficulties so that students of varying achievement levels would have a chance to display how well they had mastered the skills and/or knowledge being sought. Item quality. With few exceptions, educational tests are made up of individual items. Occasionally, of course, we encounter a one-item test such a when we ask students to provide a writing-sample by composing a solo, original essay. But most tests, and especially those tests intended to supply instructionally diagnostic information, contain multiple items. If a test’s items are good ones, the test is obviously apt to be better than a test composed of shoddy items. This is true whether the test is supposed to serve a classification function an instructional function, or some other function altogether. Better items make better tests. How does one tell if a test’s items are rapturous or wretched? Fortunately, over the years measurement specialists have identified a widely approved collection of itemwriting guidelines—as well as item-improvement guidelines. These “rules” come from two sources. First, some of the guidelines are provided by empirical investigations in which, for example, we secure research evidence regarding how to devise constructedresponse items so we can more accurately score students’ responses. A second set of guidelines is based on the experiences of in-the-trenches item-writers who have, for nearly a century, reached trial-and-error judgments about what sorts of items seem to work and what sorts seem to misfire Ease of usage. Testing takes time. Teaching takes time. Most teachers, therefore, have very little discretionary moments to spend on anything—including diagnostic testing. Accordingly, an effective instructionally diagnostic test, if it is going to be employed by many teachers, must be easy to use. If an instructionally diagnostic test is difficult to use—or time-consuming to use—it won’t be. 5 If, for example, scoring of students’ responses must be personally carried out by the teacher, then such scoring should require minimal time and very little effort. These days, of course, because of advances in technology, much scoring is done electronically. But if hassles arise when teachers use a test, and most certainly when teachers score a test, those hassles will diminish the test’s use. Instructionally diagnostic tests that are easy to be used will tend to be used. Those that aren’t, won’t. Summing up, instructionally diagnostic tests, if appropriately built, can function as potent tools to help teachers promote their students’ learning. But simply labeling a test as “instructionally diagnostic” does not automatically signify that the test is anything other than an improperly named assessment device. 6