How is Testing notes - CMS Music Assessment

advertisement
How is Testing Supposed to Improve Schooling?
NCME Career Award Address
Edward Haertel
Sunday April 15, 2012
[SLIDE 1] Thank you. It was truly an honor to have been chosen as last year's NCME Career Award
recipient, and it is a further honor to speak with all of you this afternoon. I also want to thank Mark
Wilson and Dan Koretz for participating in this session with me. As you can see from my title, I've
chosen a big topic. Measurement professionals are obviously concerned with this question, although
we might frame it a bit more formally. Rather than asking, "How is Testing Supposed to Improve
Schooling?", we would more likely ask, with regard to a specific testing program, "What are the
intended uses or interpretations of the scores obtained from this testing procedure?" Then, we would
frame matters of test validity by asking, "What are the interpretive arguments and supporting validity
arguments for those intended uses or interpretations?" That's proven to be a very helpful framing. It
focuses attention first on the intended use or purpose of the testing, then on the chain of reasoning
from scores to decisions or implications, and then on the evidence supporting or challenging the
propositions in that chain of reasoning.
At the same time, there may be some lingering sense that helpful as this familiar validation
framework has been, it is somewhat incomplete. Cronbach seemed to be reaching for something more
in his 1988 paper offering "Five perspectives on validity argument." Messick expanded our notions of
test validation by directing attention to the consequential aspects of test interpretation and use. In his
2006 chapter on "Validation," Michael Kane touched on this incompleteness in his thoughtful analysis of
potential fallacies in validity arguments. He called out in particular the unwarranted leap from evidence
justifying a particular interpretation of test scores to the assertion that a particular use of test scores
was thereby justified. I'm not sure I have all that much to add to this long-running conversation, but I
would like to offer another way of framing these concerns, in the hope that I may at least provoke some
further reflection.
I've actually been thinking about this question for some time. In 2005, Joan Herman and I co-edited
a yearbook for the National Society for the Study of Education on Uses and Misuses of Data for
Educational Accountability and Improvement, which included a chapter the two of us co-authored,
offering "A Historical Perspective on Validity Arguments for Accountability Testing." That chapter
discussed about five different interpretive arguments—Instructional grouping according to IQ test scores
beginning around the 1920s, for example, or criterion-referenced testing in its various forms, or
minimum-competency testing, or using tests in evaluating educational programs or curricula, ending up
with accountability testing ala NCLB. Then, I gave a talk at a CRESST conference in 2005 that discussed
seven ways to make education better by testing. In 2008, I gave a talk at ETS on eight theories of action
for testing as an educational policy tool. Then in 2010 , I spoke at CTB down in Monterey on "A Dozen
Theories of Action for Testing as an Educational Policy Tool." As you might guess, these talks just kept
getting longer and longer. A linear extrapolation out to 2012 suggests that this talk should cover about
13.6 testing uses. [SLIDE 2]
1
[SLIDE 3] I'm in fact back down to seven broad purposes again, although this particular list is
different from any I've used before. Also, for today, I've grouped these seven purposes into two
contrasting categories, with four in one category, three in the other. I am going to suggest that the story
lines explaining how testing is supposed to improve schooling can be divided into the categories of
measuring versus influencing. [SLIDE 4] The measuring category is quite familiar. Uses or
interpretations fall into the measuring category when they rely directly on the information scores
provide about measured constructs. Our profession has worked hard to understand and improve the
work of validation for this category, and I think we know how to do it pretty well. The influencing
category is less familiar, and it's murkier. I hope to write more about it in the coming months, and my
ideas will probably change. But for now, here's my working definition: "Influencing" refers to intended
uses of test scores to bring about effects that do not depend directly on the particular outcomes of
testing. I distinguish two subcategories of influencing. [CLICK] First, there are effects that flow from
people's deliberate efforts to raise test scores. Among other examples, these would include students'
efforts to raise their own score as well as teachers' efforts to raise their students' scores. Second,
influencing includes a subcategory of intended effects wherein testing is used to shape perceptions or
understandings, again in ways that do not depend directly on the actual scores. I believe that our
profession has more work to do in better understanding and addressing this "influencing" category of
well-intentioned test uses.
Before delving into details, let me illustrate my measuring and influencing categories with a couple
of examples. First, suppose a teacher gives a weekly spelling test covering that week's spelling word list.
[SLIDE 5] After scoring the test, he notes the words a lot of children missed, so that he can put those
words back on the list again a few weeks later. He also notes each child's total score in his grade book.
The children's papers are returned to them so that they can see how well they did overall, and which
words they got wrong. One reason the children study the weekly spelling word lists is because they
know there will be a test, and they want to do well. This brief example mentions several testing uses,
some from each category. Measuring purposes include grading, providing diagnostic feedback to
students, and planning future instruction. These are all direct uses of test scores or of individual item
responses. One influencing purpose is encouraging students to study their spelling words because they
want to do well on the test. Their studying is of course a deliberate effort to raise test scores. The
teacher might also intend to shape students' perceptions or understandings by using testing to convey
the idea that knowing how to spell is important. In a general way, the amount that students choose to
study might depend on their prior test performance, but I would assert that these influencing purposes
depend only weakly, if at all, on particular outcomes of testing.
My second example is from Michael Kane's "Validation" chapter (2006, p. 57). [SLIDE 6] I think it
captures these ideas nicely, although he uses it as an example of "begging the question." Here's what
he says:
The arguments for [current accountability] testing programs tend to claim that the
program will lead to improvements in school effectiveness and student achievement by
focusing the attention of school administrators, teachers, and students on demanding
2
content. Yet, the validity arguments developed to support these ambitious claims
typically attend only to the descriptive part of the interpretive argument (and often to
only a part of that). The validity evidence that is provided tends to focus on scoring and
generalization to the content domain for the test. The claim that the imposition of the
accountability requirements will improve the overall performance of schools and
students is taken for granted.
I think you can see my measuring and influencing categories here. Validity evidence tends to focus
on measuring purposes, but the purpose in the first sentence I quoted, "[improving] school effectiveness
and student achievement by focusing … attention … on demanding content," is in my influencing
category.
I've said that we do a better job of validation for purposes in the measuring category than for
purposes in the influencing category. There is another, related way to describe how we tend to fall
short in our validation efforts. [SLIDE 7] Kane lays out four broad stages in most interpretive arguments,
namely scoring, generalization, extrapolation, and decision or implication. In my experience, as
measurement professionals, we direct considerable attention to various technical aspects of alignment
to test specifications, DIF, scaling, norming, and equating, and also to score precision, reliability, or
generalizability. That is to say, we do a good job with the scoring and generalization stages. But, we
devote considerably less attention to stage three, extrapolation beyond the true score, universe score,
or trait estimate, and least attention to the stage four questions of how test scores are actually used or
interpreted.
Of course, we could just say that Kane's decision or implication stage, or my influencing category,
are simply not our concern. [SLIDE 8] The 1999 Standards for Educational and Psychological Testing, as
well as earlier editions, make it clear that test users share responsibility for appropriate test use and
sound test interpretation (p. 111), especially if a test is used for some purpose other than that for which
it was validated (p. 18, Standard 1.4). Despite the good efforts of conscientious test developers to
screen and educate potential users, and despite the guidance provided in test manuals, the Standards
acknowledge that, "appropriate test use and sound interpretation of test scores are likely to remain
primarily the responsibility of the test user" (p. 111). [CLICK]
Nonetheless, I believe that deeply understanding, as well as thoughtfully advocating for, appropriate
test use and sound test interpretation is very much a part of our job as testing experts, and that we
need to pay more attention to the influencing purposes as well as the measuring purposes. There is no
bright line between technical concerns and practice or policy concerns in testing. A full consideration of
validity requires that the interpretive argument be carried all the way through to the end, and that the
full range of intended purposes be considered. It may not work very well for testing experts to see to
the first few steps, scoring, generalizability, and so forth, and then hand off the test for someone else to
worry about whether score-based decisions or inferences are truly justified, whether the test solves the
problem it was intended to solve, and whether it creates unforeseen problems.
3
Test design and use might benefit from more feedback loops. [SLIDE 9] In a typical linear, sequential
process for a large-scale achievement testing program, one committee designs a curriculum framework,
then another committee uses that framework to craft a test specification, then item writers use the test
specs to guide their work, then another group takes the items and builds field test forms, then after
tryout the psychometricians recommend edits to improve the test's technical properties, and then the
test is delivered as a finished product. Testing experts are pretty much out of the picture at that point,
except for subsequent scoring, scaling, and equating. With a process like that, there are substantial risks
first, that the test itself will fall well short of the vision embraced by the developers of the original
curriculum framework, and second, that questions of how testing and test scores are actually used, how
they are supposed to function, and how they actually function as part of a complex system, will escape
close attention.
It's easy to say, of course, that everything would work better if parties all along the way shared a
common understanding of how actual test use was supposed to be of value. It's more difficult to say
just how that would work or just how it would help. We have a hard enough time simply hewing to a
vision of what it is a test is supposed to measure. Could we also develop and hold to a common
understanding of the decisions the test was supposed to guide, the conclusions it was supposed to
support, the messages it was supposed to convey, and the behavioral changes it was supposed to
encourage? I wish I could say I had a good answer to that question.
[SLIDE 10] I've decided today to focus on aptitude and achievement tests for students, setting aside
tests administered directly to practicing teachers or to prospective teachers. The use of tests given to
students for purposes of teacher evaluation does fall within my purview. I'm also setting aside testing to
diagnose individual students' special needs, although I do include tests of English language proficiency.
I'll have a few things to say about classroom testing, meaning formative or curriculum-embedded
assessment, but most of what I'll be considering will be standardized tests used on a large scale. I
should hasten to add that my remarks are nothing remotely close to an exhaustive review. Think of
aptitude and achievement tests for students as the domain of testing programs I'm sampling from in
order to illustrate my broader purposes.
My analysis is also circumscribed because I can only speak to these concerns as a psychometrician
who has spent his career at a university, in a school of education. I've gotten many important insights
about testing not only from testing experts at other universities, but also from colleagues at testing
companies, in school districts, and in state and federal agencies, as well as from scholars in other
disciplines. As a quick thought experiment, just try to think for a moment about some questions an
anthropologist, or a philosopher, or a political scientist might ask about standardized achievement
testing. For example: Is uniform standardization the best or only way to approach the ideals of fairness
and score comparability? How does testing support belief in a meritocratic system where individual
effort is rewarded? How has testing shaped popular views of the quality of public education? We could
benefit from more cross-disciplinary dialogue about testing. (I should mention that I'm on the editorial
board of the journal Mark Wilson, Paul De Boeck, and Pamela Moss have established, called
Measurement: Interdisciplinary Research and Perspectives. That's one place where some of this kind of
4
work has appeared.) In my conclusions, I'll return to the idea that if we are to do a better job of
validating the influencing purposes of testing, if we are to carry our validation efforts all the way
through to uses and interpretations, then we're going to need more help from scholars in other
disciplines.
[SLIDE 11] There's just one more preliminary. Before turning to interpretive arguments for
educational testing, I need to comment on the relation between achievement tests and test takers' prior
instruction. Test tasks must be neither completely familiar nor completely unfamiliar. The "completely
familiar" problem is easy to see. It's hard to imagine a cognitive item that could not be reduced to a test
of rote recall by coaching students on that specific question. More subtly, classroom teachers may fool
themselves and their students about their depth of understanding if the students are asked to
demonstrate their knowledge using the same words or examples as they encountered during
instruction. But let me clarify what I mean when I say test tasks must not be completely unfamiliar. An
ideal task for eliciting complex reasoning or problem solving is one that draws upon a rich base of prior
knowledge and skills, but requires using that prior learning in new ways. Reasoning has to be about
something. If I want to investigate students' understanding of irony, say, or foreshadowing, it's really
helpful to know a particular book they've read. That way, I can cite specific events from different parts
of the book and ask how they're connected or why the author placed them in a particular sequence. If I
know the students have read books but I can't assume any particular book, then my test tasks must
instead be little self-contained packages, bundling the question with whatever background material
might be required to arrive at the answer. Such questions can be useful, of course, and can measure
some important kinds of learning. But, I can ask much deeper questions knowing which Shakespeare
play students have read than I can simply knowing that they have read a Shakespeare play. Likewise in
history, knowing what specific events, historical figures, or historical documents students have studied
enables me to ask better questions to measure historical thinking. In ecology, I can ask deeper
questions if I know which particular ecosystem students have learned about. This reality poses a big,
pervasive challenge for standardized testing where there is no common curriculum. The Common Core
State Standards in English language arts, for example call for students to have read a Shakespeare play,
but don't say which one. In many subject areas, curriculum-neutral assessments cannot drill down very
far on complex reasoning, simply because there is so little that can be assumed concerning specific,
relevant, common background knowledge students should have had a chance to acquire in the
classroom.
So, let me now finally turn to some ways testing is supposed to improve schooling. [SLIDE 12] I
propose that there are seven broad purposes whereby testing is intended to improve educational
outcomes: (1) Instructional Guidance, (2) Student Placement and Selection, (3) Informing Comparisons
Among Educational Approaches, (4) Educational Management, (5) Directing Student Effort, (6) Focusing
the System, and (7) Shaping Public Perceptions. I'll be illustrating these purposes with examples, which
should make it clear that finer distinctions would be possible. It will also be clear that testing and test
scores are often used for several different purposes at the same time. Any taxonomy like this is
somewhat arbitrary; I hope you find mine helpful. For now, let me just offer quick descriptions as an
advance organizer.
5
[SLIDE 13] Instructional guidance refers to narrowly focused achievement testing for the purpose of
informing instructional decisions. This is the area of formative testing. The primary users of the test
information are teachers and the students themselves. The constructs measured are usually narrowly
focused and closely tied to the curriculum. Interpretations are mostly criterion-referenced, and mostly
at the individual student level.
[SLIDE 14] Student placement and selection should be self-explanatory. This set includes testing for
purposes of guiding decisions about student ability grouping, determining entry into and exit from
classifications like "English learner," and making college admissions decisions, among other uses. I also
place certification tests like high school exit exams into this category, as well as Advanced Placement
and International Baccalaureate tests and examinations.
[SLIDE 15] My third purpose, informing comparisons among educational approaches, covers tests
used as outcome measures in evaluations of curricula or of alternative instructional methods.
[SLIDE 16] Fourth, educational management refers to uses of achievement tests to guide decisions
about teachers or schools. NCLB provisions for identifying schools in need of improvement and
prescribing remedial action would be examples, as would uses of student test scores for teacher
evaluation.
[SLIDE 17] These first four purposes, of course, are comprised by my measuring category. Test
scores are taken as indicators of some underlying construct, and on that basis the scores are used to
guide some decision or draw some implication. How accurately scores reflect underlying constructs is of
the essence here. The four purposes are distinguished by what it is that test scores are used to describe:
individual student's short-term learning progress (category 1), more stable student aptitudes or
summative measures of achievement (category 2), educational materials or approaches (category 3), or
teachers and schools (category 4). My remaining three purposes make up the influencing category. For
these purposes, tests are used in somewhat less direct ways.
[SLIDE 18] Fifth on my list is directing student effort. This covers several possibilities. As Black and
Wiliam document in their 1998 review of formative assessment, how we test and how we communicate
test results can influence what, how, and how much students study. The question, "Is it going to be on
the test?" has become a cliché. At a more macro level, tests with stakes for students like high school
exit exams or college admissions tests will also influence student effort, although not uniformly, and not
always in the ways we might intend.
[SLIDE 19] My sixth category, what I've called focusing the system, refers to various uses of testing
to influence curriculum and instruction in ways that do not depend directly on the particular scores
students earn. Let me give some examples: Participating in test construction or in the scoring of
performance assessments might be regarded as a kind of in-service training for teachers, leading them
over time to teach in different ways. Or, performance assessments themselves might exemplify useful
but underused instructional approaches. Or, high-stakes tests covering just a few school subjects might
encourage teachers to spend more time on those subjects and less time on others. Or, inclusion of
6
questions about specific topics on a high-stakes test might be intended to help assure that those topics
were addressed in classrooms.
[SLIDE 20] Finally, testing and reports of test results shape popular perceptions, including
perceptions of public education and of the teaching profession. I could have gone on to say that
perceptions matter because they shape actions, like supporting school bond issues or choosing
alternatives to regular public schools, but the story gets complicated enough just going as far as
perceptions. In David Berliner and Bruce Biddle's 1995 book on The Manufactured Crisis, they argued
that test results had been used systematically to bolster perceptions that our public education system
was functioning poorly. As another possible example, I've heard more than once that the principal users
of school accountability scores in California are real estate agents, who can quantify the effect on
neighborhood housing prices of the local school's Academic Performance Index. (I'm not sure about
that example because perceptions of a particular school might arguably be based on specific scores. But
I decided to include it because a school's reputation over time seems not to be closely tied to any
specific individuals' scores on specific tests.)
[SLIDE 21] Here, again, are some short labels to help keep track. The measuring purposes are one,
measuring learning; two, measuring learners; three, measuring methods; and four, measuring actors
(including schools). The influencing purposes are five, influencing learners; six, influencing methods; and
seven, influencing the perceptions of actors outside the school system itself.
Let me say again that these are intended uses. Some, like using high-stakes tests to focus the system
on specific subjects or specific kinds of knowledge and skills, carry substantial risks of unintended
consequences, including score inflation and distortion of curriculum coverage. Daniel Koretz's 2008
book, Measuring Up, offers an excellent review and analysis of evidence concerning these kinds of
effects.
Please note that I do not mean for my influencing purposes to be mapped onto Messick's
"consequential basis of test interpretation and use." As I understand Messick's writing, he was drawing
attention to normative considerations in testing, including the value implications of test interpretations
and the social consequences of test uses. Normative issues arise in connection with my influencing
purposes, of course, as well as with measuring purposes. But normative concerns are not my main focus
right now. My goal this afternoon is to draw attention to intended, anticipated effects of testing with no
direct dependence on the information particular scores provide about underlying constructs. That is
what I mean by "influencing" purposes.
Let me next say a little more about each of these measuring and influencing purposes. [SLIDE 22] I'll
start with my instructional guidance, or formative assessment purpose, which includes the major
purposes for tests teachers create or select for use in their own classrooms. These are the quizzes, unit
tests, midterms and finals used to assign grades and to guide the pacing of instruction. They provide
feedback to students about their own learning and they help teachers plan instruction. Testing what's
just been taught in order to guide further learning may sound straightforward, but of course that
doesn't mean that it is simple or easy to do well. An interpretive argument is helpful in organizing
7
potential concerns. Following Kane, we might begin with scoring. There must be some way to quantify
examinee performances to capture accurately the dimensions of interest. If scoring keys or rubrics
credit only the particular answers a teacher intended, for example, and discount alternative answers
that are also defensible, then there's a problem at that very first step. Also, scoring should be free from
bias. Unless it's a test of handwriting, papers should not be marked down for poor penmanship alone.
Apart from obvious flaws, perhaps the most critical scoring concern here is whether the collection of
question adequately samples the intended content domain and whether the questions elicit the kinds of
reasoning deemed important. It has been recognized at least since the time of the Bloom taxonomy in
1956 that teachers tend to write low-level items calling for factual recall or routine application of
learned procedures, rather than more complex questions demanding nonroutine problem solving or socalled higher-order thinking. As an aside, I might add that that's not just a challenge for classroom
teachers. The prevalence of low-level questions in the work of professional item writers was pointed
out in 1984 by Norm Frederiksen in his paper on "The Real Test Bias," and similar concerns were echoed
in a 2007 NAEP Validity Studies Panel report examining the NAEP math exercise pool (Daro, Stancavage,
Ortega, DeStefano, & Linn, 2007).
The second broad step in Kane's interpretive argument is generalization. This is the traditional
concern of test reliability, asking to what extent the scores obtained from the observed performance on
a particular occasion are representative of the many scores that might have been obtained with
alternate forms, with different scorers, on different occasions, and so forth. For classroom tests serving
the instructional guidance function, there may be no need to standardize administration or scoring
across different times or places, but it is still important to consider whether tests are long enough and
well enough designed that scores have adequate precision.
The third step, extrapolation, asks whether test scores are accurate indicators of proficiency or
performance in some larger domain of situations, beyond the test or testing replications. We might ask,
for example, whether the test scores are predictive of performance on different kinds of tasks that call
for the same skills, including situations outside the classroom.
The fourth step, decision or implication, addresses the ways test scores are actually used: What is
going to be done differently or understood differently based on the score? Black and Wiliam emphasize
that by their definition, assessment becomes formative only when it is actually used to adapt teaching
work to meet learning needs. Decisions to provide extra review and practice or to advance a student to
a higher-level course would be examples. Other possible decisions might include reteaching some
material, skipping ahead in the syllabus, or contacting a student's parents concerning academic
progress. If there is a formal decision rule determining what is to be done with the scores, involving a
passing score, for example, then that rule itself requires scrutiny. In practice, of course, it may be quite
unclear what different actions are indicated by alternative testing outcomes. Data are of little use if
teachers simply don't know what to do with them, or if they lack the time and other affordances needed
to take action. "Implication" refers here to some inference or conclusion to be drawn. This might
perhaps be conveyed by assigning a grade. Again, there is a range of concerns, including the
appropriateness of the decision rule.
8
I've started with classroom testing for the purpose of instructional guidance to elaborate a little on
how an interpretive argument can be helpful in thinking through testing concerns. You may know of
instances where teachers were provided with data that were intended to be useful for guiding
instruction, but the interpretive argument wasn't quite carried through to the final step. It may actually
not be very helpful simply to provide teachers with reports of student test performance, however nicely
formatted, on the assumption that they already know or will figure out how to make use of those
reports and that they have the capacity to take constructive action.
My instructional guidance purpose also covers more formal testing systems [SLIDE 23] where the
material to be learned is broken down into many narrow learning objectives with a prescribed sequence,
with a large bank of tests covering those learning objectives. Students' progress through the learning
progression is controlled by their success on the corresponding tests. This has been tried many times.
The earliest example I know of was The Winnetka Plan, created by Professor Carleton Washburne and
his colleagues at the University of Chicago in the early 1920s. This was an elementary math program
that supplemented a textbook with mimeographed worksheets and tests. Pupils could take selfadministered tests of each sequential objective, and then when they were ready, they could request a
teacher-administered test. The date on which each pupil mastered each objective was recorded.
Decades later, this same pattern was found in Programmed Instruction, where content was
presented in frames, each followed by test questions with branching to one frame or another according
to whether the test score indicated a need for more work versus readiness to move on. Still later,
Benjamin Bloom's Mastery Learning featured a similar model, using what he called formative tests,
similar to the criterion-referenced tests appearing from the late 1960s into the early 1980s. The
Pittsburgh Learning Research and Development Center's Individually Prescribed Instruction Mathematics
Project offered yet another illustration. Measurement-driven instruction worked best for those areas of
the curriculum that could be readily analyzed into a sequence of small independent units that built upon
earlier units. Teaching basic arithmetic or phonics are good examples. It was less successful teaching
more advanced content, where the precise patterns of reasoning required are less predictable.
[SLIDE 24] These more formal instructional guidance models mostly finessed scoring, the first step
of the interpretive argument, by making the constructs measured more-or-less isomorphic with
answering the test questions. This fit well with the perspective of behaviorist psychology, which sought
to avoid inferences about the psychological processes underlying observable behaviors. Robert Mager's
influential book, "Preparing Instructional Objectives," first published in 1962 with a second edition in
1984, coached teachers on how to formulate instructional objectives in observable terms, using verbs
like "arrange," "define," or "label" rather than verbs like "knows" or "understands." Well-written
instructional objectives were readily translated into criterion-referenced tests.
Generalization, likewise, was not much of a problem. If a test consists of highly similar items asking
students to do pretty much the same thing over and over, then any given student's performance is likely
to be quite consistent from one test form to another, or even from one item to another on the same
test. I've seen U-shaped score distributions on criterion-referenced tests, clearly distinguishing students
9
who had mastered some narrow objective from those who had not. That said, reliability coefficients for
criterion-referenced tests were often low simply because there was little observed-score variance.
Reliability coefficients fit well with norm-referenced test interpretations, but are often less useful with
criterion-referenced tests. The standard error of measurement is a better option.
Extrapolation, the third step, was more problematical for criterion-referenced testing. As Lauren
and Daniel Resnick explained in their important 1992 chapter on "Assessing the Thinking Curriculum,"
the measurement-driven instructional model and its behaviorist underpinnings relied on assumptions of
decomposability and decontextualization. There was often insufficient attention to the question of
whether students could summon up the right pieces of learning in new contexts and put them together
in different ways to arrive at solutions to unfamiliar problems.
Decision or implication, the fourth step, typically relied on some simple cut score, usually defined by
a required percent correct, to classify students as having mastered or not having mastered the tested
objective. There was some published work on the problem of formulating these decision rules, but
general guidelines like "Mastery is 80% correct " were often adopted uncritically. And, as with the
example I cited earlier from Michael Kane's chapter, the efficacy of the use of these tests for purposes of
instructional guidance was too often taken for granted, with validation efforts focused instead on score
interpretation.
Educational psychology has come a long way since the era of behaviorism, and I think it would be a
mistake to try and resurrect these testing and instructional approaches. Nonetheless, I would like to
speak for a moment to some positive features of measurement-driven instructional models. The fact
that they did not work for everything doesn't mean they were not useful for some things. The tests
used in these systems were closely aligned to a specific curriculum, and it was always clear that the
curriculum came first and the tests, second. It was clear what the tests were supposed to be used for.
Students either moved on directly to the next small unit of instruction or else they were given help
before moving on. These many separate instructional decisions for each pupil may have been overly
mechanical, but the systems were workable. The communications around test performance were all
about mastery, not ranking or comparing one student to another. We could do worse. The instructional
guidance purpose lives on in interim or benchmark assessments, of course, but as I understand it, these
are generally designed by working backwards from the summative test, and are administered at fixed
intervals. That's a very different model.
[SLIDE 25] I'd next like to turn from instructional guidance to purposes of student placement and
selection, shifting from measuring learning to measuring learners. You'll see in my examples here that
there is a big difference in the kinds of tests used. Aptitude tests make an appearance, and where
achievement tests are used, they are for one thing, tests of much broader achievement constructs, and,
for another thing, used like aptitude tests, to predict future performance or readiness to profit from
some instructional setting much more remote than the next small lesson. When one measures learning,
one hopes to see rapid change. When one measures learners for purposes of placement or selection,
one generally treats the characteristics measured as if they were relatively stable over time.
10
Perhaps the extreme view of aptitude as fixed and immutable was represented in the student IQ
testing used early in the last century. Following on the perceived success of the Army Alpha during
World War I, IQ tests were widely used for tracking students.
My previous purpose, instructional guidance, was predicated on the notion that almost all children
could master the curriculum if given the supports they needed along the way. [SLIDE 26] IQ-based
tracking, which I am using to illustrate purposes of student placement and selection, was predicated on
the notion that children differed substantially in their capacities for school learning, and so unequal
outcomes were inevitable. Tracking was intended to make schooling more efficient by giving teachers
more homogeneous groups of students to work with. That way, slower children would not be pushed to
the point of frustration, and quicker children would not be held back. There's more to be said, of
course, about the reasons this instructional model emerged when it did. [SLIDE 27] IQ tests seemed to
confirm prevailing stereotypes about racial and ethnic differences in intelligence, and provided an alltoo-convenient explanation for what we today refer to as between-group achievement gaps. This
history is laid out in Paul D. Chapman's 1988 book on Schools as Sorters. Chapman points out that
tracking was already widespread before IQ tests became available, but the tests supported the practice,
made it seem more objective and scientific, and fit comfortably with prevailing beliefs about the
intelligence and individual differences.
Whereas tests used for instructional guidance are often carefully integrated into the school
curriculum, IQ tests were intended to have zero dependence on the school curriculum. They were
designed to tap into different aspects of general mental ability, typically using progressively more
difficult tasks of a given type to determine how well the examinee could perform. [SLIDE 28] Scoring
rules were carefully worked out, and generalization was shown by high correlations between alternate
test forms as well as high stability over time. Extrapolation was assumed on the basis of a strong theory
about the importance of intelligence in determining fitness for different occupations and other aspects
of adult life as well as success in school. At the time, however, the implications drawn from IQ test
scores and the justification for decisions based on those scores were largely unexamined. It was simply
assumed that intelligence was largely inborn and little affected by age or experience. Yet again,
validation generally stopped short of actual use.
Just as psychology has moved beyond the behaviorist notions underlying elaborate formal models
for instructional guidance, it has also moved beyond the strongly hereditarian views that justified IQ
testing. Today, environmental influences on IQ test scores are better understood, and the idea of
"culture-free" testing has long since been abandoned. It is also more generally recognized that tracking
can lead to a self-fulfilling prophecy whereby the groups offered a stronger, faster-paced curriculum
progress more quickly than their peers in lower academic tracks. I would certainly not advocate a return
to IQ-based tracking, but this model, too, may offer some useful lessons. Every teacher knows that
some students tend to catch on more quickly than others. Students have different patterns of strengths
and weaknesses, of course, but there are also general tendencies, and on average, students stronger in
one subject area are likely to be stronger in other subject areas, as well. There seems to be no
acknowledgement of this reality in the current rhetoric of school reform. I want to say clearly that of
11
course I believe all children can learn, and nearly all have the potential to become literate, successful
members of society. But the mantra of "all children can learn," which morphs into an insistence on a
common, high standard for all children, and the seeming denial of large individual differences in
aptitude for school work have distorted education policy.
[SLIDE 29] Before leaving the area of student placement and selection purposes, I should mention
quickly that some of the other testing applications grouped here do feature strong connections between
the test and the curriculum, with AP and IB exams as prime examples.
Instructional guidance had to do with measuring learning, and student placement and selection
have to do with measuring learners. My third broad set of purposes has to do with measuring methods.
[SLIDE 30] These are purposes of testing aimed at informing comparisons among instructional
approaches. This sort of testing applications can be traced back at least to the Project Head Start
evaluations mandated by the Elementary and Secondary Education Act (ESAA) of 1965 and the largescale evaluations of curricula created with NSF funding in the post-Sputnik era, around that same time.
More recently, the National Diffusion Network, in existence from 1974 to 1995, attempted to synthesize
evaluations documenting successful educational approaches, and the What Works Clearinghouse,
established in 2002, has a similar mission. These are examples of testing applications in the sense that
curriculum or program evaluations use achievement tests as outcomes. The purpose in each case is to
compare methods and see which works best. This is one case where the interpretive argument really is
carried through to the end, or almost to the end. There is usually a final, taken-for-granted assumption
that the approach found superior in the sites involved in the evaluation will also give superior results if it
is widely disseminated, and of course there is no guarantee. These "informing comparisons" purposes
are best implemented with randomized controlled trials, but the category would also include testing
applications in studies with a wide range of research designs. The 2011 paper by Tom Dee and Brian
Jacob in the Journal of Policy Analysis and Management, on "The impact of No Child Left Behind on
student achievement," offers an elegant example in which NAEP data were used for the purpose of
evaluating the NCLB approach to school accountability.
My fourth purpose, what I've called educational management, comprises applications where
student achievement tests are used to derive scores for schools or teachers, either to identify
problematical cases requiring intervention or to identify exemplary cases from which others might learn.
[SLIDE 31] Note that this purpose is in my measuring category. Decisions and implications based
directly on school and teacher scores are in fact key to accountability models' theories of action.
However, I think it's fair to say that measuring, in this sense, is only part of the story. These
accountability models are intended to affect the educational system not only via measurement-driven
management decisions, but also by focusing the system and possibly by shaping public perceptions—
purposes I've placed in the influencing category. The NCLB accountability model is a prominent
example. Here, student test scores from annual state-level testing systems are used in a complicated
way to calculate scores for schools and then to determine whether each school is meeting a goal called
Adequate Yearly Progress. There is a rational mechanism whereby schools not meeting AYP targets are
supposed to receive remediation leading to improvement. However, a primary purpose of
12
accountability testing under NCLB—a major piece of the rationale whereby NCLB was supposed to
improve schooling—would be situated in my sixth category of focusing the system, over in the righthand, Influencing column. These two purposes of educational management and focusing the system
interact here, in that the pressure to focus is most intense for schools failing to meet AYP.
Staying with educational management for the moment, perhaps the most prominent applications
these days are "value-added" models for teacher evaluation. [SLIDE 32] Value-added models partake of
educational management purposes when scores are used to identify specific teachers for follow-up
actions, which might include among others, merit pay, mentoring, or dismissal. The scoring step of the
interpretive argument is familiar, although with some models, vertical scaling assumptions place more
stringent demands on scores for VAM than for other testing uses. The generalization step has an extra
wrinkle, because when student scores over time are used in the aggregate to construct scores for
teachers, reliability of individual student-level scores becomes less important, while other sources of
error, related to student assignment and sampling must also be considered. As with most of the testing
uses I've mentioned this afternoon, validation efforts seem to have focused primarily on these first two
steps of the interpretive argument, with limited attention to extrapolation—the question of how
strongly value-added scores relate to broader notions of teaching quality or effectiveness; and decision
or implication—the question of whether actions taken on the basis of value-added scores have the
intended effects.
I said that NCLB used testing for both category four purposes of educational management and
category six purposes of focusing the system. I'd say the same about value-added models for teacher
evaluation. Certainly the use of student achievement test scores for high-stakes evaluation of teachers
is likely to focus teacher effort on raising test scores. But the purposes for teacher value-added models
may also cross over into my seventh category, shaping public perceptions. I think it is very likely that
highly publicized reports of individual teachers' value-added scores, and the highly publicized reactions
to these reports coming from teachers' unions and from various political actors are influencing public
perceptions of teaching. I suspect that in some cases this may be an intended effect, but that could be
hard to document.
At this point, I've given some examples of testing applications under each of the four broad
purposes in my measuring category. In some cases, I've also referred to purposes in my influencing
category. [SLIDE 33] My colleague Elliot Eisner quipped long ago that the quickest way to precipitate an
educational crisis was to give a test and announce that half the children were below average. Elliot's
comment notwithstanding, I could not think of any testing applications that were solely in the
"influencing" category, except for the bogus surveys one gets in the mail sometimes, purporting to ask
for opinions with the true purpose of shaping opinions or raising money. Nonetheless, I believe these
purposes may be more significant than measuring narrowly conceived in justifying and sustaining many
large-scale testing programs.
I did not say much about my seventh category, shaping public perceptions, except to speculate that
that might possibly be a hidden purpose for teacher value-added models. I want to mention just one
13
more example, also plausible but speculative. I believe that one political purpose for accountability
testing is to shape a public perception that an elected leader cares about education and is doing
something about it. [SLIDE 34] Many presidents and governors, from both parties, have embraced the
cause of educational improvement. As Robert Linn (2000, p. 4) explained it, "Test results can be
reported to the press. Poor results in the beginning are desirable for policymakers who want to show
they have had an effect. Based on past experience, policymakers can reasonably expect increases in
scores in the first few years of a program … with or without real improvement in the broader
achievement constructs that tests and assessments are intended to measure. The resulting overly rosy
picture that is painted by short-term gains observed in most new testing programs gives the impression
of improvement right on schedule for the next election." Testing can be a form of symbolic action. How
many more times will we hear about a new testing program to identify low-performing groups so that
remediation can be targeted. Are the results ever really a surprise?
In my last few minutes, I want to speak to both the challenge and the importance of attending to
these influencing purposes of testing programs. [SLIDE 35] Importance is clear. Most obviously, the
influencing purposes are often key to the interpretive argument. They come at the end of the chain of
reasoning, often serving as the ultimate rationale for the testing program. Doing a really good job on
validity studies supporting the scoring and generalization stages may be gratifying to the writers of
technical reports (if there is anything gratifying about writing technical reports), but the chain of
reasoning is as strong as its weakest link. The argument simply has to be carried through to its
conclusion.
Examining these influencing purposes is also important because it is in connection with these
purposes that many unintended consequences of testing may arise. Using tests to focus the system is a
prime example. When curriculum-neutral tests are used to focus the system on key learning objectives,
the result may be superficial coverage of just that subset of learning objectives included in the test
specification. If we had well worked out, widely understood and widely used methods for studying
focusing effects, these unintended consequences would receive more attention and might in time come
to be better understood by policy makers and the public at large.
The difficulties of examining these "influencing" purposes of testing programs are formidable. First,
these purposes may not be clearly articulated. Second, necessary data may not become available until
after a testing program has been implemented. Third, the research methods needed to carry out the
required studies may be unfamiliar to testing professionals. Fourth, the agencies and actors best
positioned to carry out this work have disincentives for doing so. Fifth, this work is expensive, and it
may not be clear that the answers will really matter. Nonetheless, acknowledging the challenges, I think
we can and should do a better job than we have historically.
The lack of clarity around influencing purposes is not insurmountable. Some implicit purposes can
certainly be ignored. If a politician believes that sponsoring a testing program will garner votes in the
next election, that need not be a measurement concern. Nonetheless, major purposes can generally be
discerned, and stating these purposes more clearly might actually be helpful. We can at least push for
14
these purposes to be better thought through. [SLIDE 36] The procurement for state consortia's
comprehensive assessment systems under Race to the Top required that applications include "a theory
of action that describes in detail the causal relationships between specific actions or strategies in the
eligible applicant’s proposed project and its desired outcomes for the proposed project, including
improvement in student achievement and college- and career-readiness."
Data availability could be greatly improved if evaluation studies were planned before new testing
initiatives were launched and baseline data were collected in advance. [SLIDE 37] That data collection
might employ audit tests aligned to outcomes, but could also include surveys on content coverage or
allocation of instructional time to different subject areas, for example. If evaluation is an afterthought,
it may be too late to do it properly. Another helpful approach would be phased implementation. If
changes everywhere in a state are made at the same time, it is much harder to isolate the effects of
those changes on schooling outcomes.
As to required research methods, [SLIDE 38] this is a place where we could use some help from
colleagues in other disciplines. Our field draws most heavily from the disciplines of statistics and
psychology. We could use some help here from sociologists, anthropologists, economists, linguists,
curriculum specialists, and teacher educators, among many others. Some progress might be made
simply by thinking about school principals, teachers, and students as rational actors and reasoning
through their likely responses to the incentive structures testing shapes. But, we're going to need more
than common sense. As Messick (1989) said long ago, validity must be supported by both empirical
evidence and theoretical rationales. We have bodies of theory to get us through the descriptive stages
of our interpretive arguments—We can figure out what test scores mean. But when it comes to the
ways testing is actually supposed to function out in the world, our theories are impoverished. We may
borrow a bit from economics, as when test scores provide the information essential for rational
consumer choice, but we need more help from organizational theory, social psychology, and elsewhere.
My last two challenges had to do with costs and incentives for doing a better job. [SLIDE 39] This is
not a new problem. It's all well and good to lament the confirmationist bias in test validation, as in
much social science research. It is understandable that testing companies and state education agencies
might be reluctant to subject their tests and their policies to serious, critical scrutiny. But evaluation
requirements built into testing procurements might help. As one example, I mentioned the
procurement for SMARTER Balanced and PARCC. The evaluative criteria for those proposals allocated
30 points to bidders' Research and Evaluation Plans, which required not only research on the properties
of the tests themselves, but also a plan "for determining whether the assessments are being
implemented as designed and the theory of action is being realized, including whether the intended
effects on individuals and institutions are being achieved" (U.S. Department of Education, 2010).
Obviously, the two consortia each submitted plans responsive to the RFP's requirements. Similar
requirements in requests for proposals from other agencies would be a step in the right direction. With
experience, such evaluation requirements might be elaborated and refined to the point where serious
attention to influencing as well as measuring purposes of testing, and serious attention to actual uses
15
and their consequences as well as to score interpretations, became an expected part of test validation.
That will not happen overnight, but it is a goal worth striving for.
[SLIDE 40] Thank you.
16
Download