How is Testing Supposed to Improve Schooling? NCME Career Award Address Edward Haertel Sunday April 15, 2012 [SLIDE 1] Thank you. It was truly an honor to have been chosen as last year's NCME Career Award recipient, and it is a further honor to speak with all of you this afternoon. I also want to thank Mark Wilson and Dan Koretz for participating in this session with me. As you can see from my title, I've chosen a big topic. Measurement professionals are obviously concerned with this question, although we might frame it a bit more formally. Rather than asking, "How is Testing Supposed to Improve Schooling?", we would more likely ask, with regard to a specific testing program, "What are the intended uses or interpretations of the scores obtained from this testing procedure?" Then, we would frame matters of test validity by asking, "What are the interpretive arguments and supporting validity arguments for those intended uses or interpretations?" That's proven to be a very helpful framing. It focuses attention first on the intended use or purpose of the testing, then on the chain of reasoning from scores to decisions or implications, and then on the evidence supporting or challenging the propositions in that chain of reasoning. At the same time, there may be some lingering sense that helpful as this familiar validation framework has been, it is somewhat incomplete. Cronbach seemed to be reaching for something more in his 1988 paper offering "Five perspectives on validity argument." Messick expanded our notions of test validation by directing attention to the consequential aspects of test interpretation and use. In his 2006 chapter on "Validation," Michael Kane touched on this incompleteness in his thoughtful analysis of potential fallacies in validity arguments. He called out in particular the unwarranted leap from evidence justifying a particular interpretation of test scores to the assertion that a particular use of test scores was thereby justified. I'm not sure I have all that much to add to this long-running conversation, but I would like to offer another way of framing these concerns, in the hope that I may at least provoke some further reflection. I've actually been thinking about this question for some time. In 2005, Joan Herman and I co-edited a yearbook for the National Society for the Study of Education on Uses and Misuses of Data for Educational Accountability and Improvement, which included a chapter the two of us co-authored, offering "A Historical Perspective on Validity Arguments for Accountability Testing." That chapter discussed about five different interpretive arguments—Instructional grouping according to IQ test scores beginning around the 1920s, for example, or criterion-referenced testing in its various forms, or minimum-competency testing, or using tests in evaluating educational programs or curricula, ending up with accountability testing ala NCLB. Then, I gave a talk at a CRESST conference in 2005 that discussed seven ways to make education better by testing. In 2008, I gave a talk at ETS on eight theories of action for testing as an educational policy tool. Then in 2010 , I spoke at CTB down in Monterey on "A Dozen Theories of Action for Testing as an Educational Policy Tool." As you might guess, these talks just kept getting longer and longer. A linear extrapolation out to 2012 suggests that this talk should cover about 13.6 testing uses. [SLIDE 2] 1 [SLIDE 3] I'm in fact back down to seven broad purposes again, although this particular list is different from any I've used before. Also, for today, I've grouped these seven purposes into two contrasting categories, with four in one category, three in the other. I am going to suggest that the story lines explaining how testing is supposed to improve schooling can be divided into the categories of measuring versus influencing. [SLIDE 4] The measuring category is quite familiar. Uses or interpretations fall into the measuring category when they rely directly on the information scores provide about measured constructs. Our profession has worked hard to understand and improve the work of validation for this category, and I think we know how to do it pretty well. The influencing category is less familiar, and it's murkier. I hope to write more about it in the coming months, and my ideas will probably change. But for now, here's my working definition: "Influencing" refers to intended uses of test scores to bring about effects that do not depend directly on the particular outcomes of testing. I distinguish two subcategories of influencing. [CLICK] First, there are effects that flow from people's deliberate efforts to raise test scores. Among other examples, these would include students' efforts to raise their own score as well as teachers' efforts to raise their students' scores. Second, influencing includes a subcategory of intended effects wherein testing is used to shape perceptions or understandings, again in ways that do not depend directly on the actual scores. I believe that our profession has more work to do in better understanding and addressing this "influencing" category of well-intentioned test uses. Before delving into details, let me illustrate my measuring and influencing categories with a couple of examples. First, suppose a teacher gives a weekly spelling test covering that week's spelling word list. [SLIDE 5] After scoring the test, he notes the words a lot of children missed, so that he can put those words back on the list again a few weeks later. He also notes each child's total score in his grade book. The children's papers are returned to them so that they can see how well they did overall, and which words they got wrong. One reason the children study the weekly spelling word lists is because they know there will be a test, and they want to do well. This brief example mentions several testing uses, some from each category. Measuring purposes include grading, providing diagnostic feedback to students, and planning future instruction. These are all direct uses of test scores or of individual item responses. One influencing purpose is encouraging students to study their spelling words because they want to do well on the test. Their studying is of course a deliberate effort to raise test scores. The teacher might also intend to shape students' perceptions or understandings by using testing to convey the idea that knowing how to spell is important. In a general way, the amount that students choose to study might depend on their prior test performance, but I would assert that these influencing purposes depend only weakly, if at all, on particular outcomes of testing. My second example is from Michael Kane's "Validation" chapter (2006, p. 57). [SLIDE 6] I think it captures these ideas nicely, although he uses it as an example of "begging the question." Here's what he says: The arguments for [current accountability] testing programs tend to claim that the program will lead to improvements in school effectiveness and student achievement by focusing the attention of school administrators, teachers, and students on demanding 2 content. Yet, the validity arguments developed to support these ambitious claims typically attend only to the descriptive part of the interpretive argument (and often to only a part of that). The validity evidence that is provided tends to focus on scoring and generalization to the content domain for the test. The claim that the imposition of the accountability requirements will improve the overall performance of schools and students is taken for granted. I think you can see my measuring and influencing categories here. Validity evidence tends to focus on measuring purposes, but the purpose in the first sentence I quoted, "[improving] school effectiveness and student achievement by focusing … attention … on demanding content," is in my influencing category. I've said that we do a better job of validation for purposes in the measuring category than for purposes in the influencing category. There is another, related way to describe how we tend to fall short in our validation efforts. [SLIDE 7] Kane lays out four broad stages in most interpretive arguments, namely scoring, generalization, extrapolation, and decision or implication. In my experience, as measurement professionals, we direct considerable attention to various technical aspects of alignment to test specifications, DIF, scaling, norming, and equating, and also to score precision, reliability, or generalizability. That is to say, we do a good job with the scoring and generalization stages. But, we devote considerably less attention to stage three, extrapolation beyond the true score, universe score, or trait estimate, and least attention to the stage four questions of how test scores are actually used or interpreted. Of course, we could just say that Kane's decision or implication stage, or my influencing category, are simply not our concern. [SLIDE 8] The 1999 Standards for Educational and Psychological Testing, as well as earlier editions, make it clear that test users share responsibility for appropriate test use and sound test interpretation (p. 111), especially if a test is used for some purpose other than that for which it was validated (p. 18, Standard 1.4). Despite the good efforts of conscientious test developers to screen and educate potential users, and despite the guidance provided in test manuals, the Standards acknowledge that, "appropriate test use and sound interpretation of test scores are likely to remain primarily the responsibility of the test user" (p. 111). [CLICK] Nonetheless, I believe that deeply understanding, as well as thoughtfully advocating for, appropriate test use and sound test interpretation is very much a part of our job as testing experts, and that we need to pay more attention to the influencing purposes as well as the measuring purposes. There is no bright line between technical concerns and practice or policy concerns in testing. A full consideration of validity requires that the interpretive argument be carried all the way through to the end, and that the full range of intended purposes be considered. It may not work very well for testing experts to see to the first few steps, scoring, generalizability, and so forth, and then hand off the test for someone else to worry about whether score-based decisions or inferences are truly justified, whether the test solves the problem it was intended to solve, and whether it creates unforeseen problems. 3 Test design and use might benefit from more feedback loops. [SLIDE 9] In a typical linear, sequential process for a large-scale achievement testing program, one committee designs a curriculum framework, then another committee uses that framework to craft a test specification, then item writers use the test specs to guide their work, then another group takes the items and builds field test forms, then after tryout the psychometricians recommend edits to improve the test's technical properties, and then the test is delivered as a finished product. Testing experts are pretty much out of the picture at that point, except for subsequent scoring, scaling, and equating. With a process like that, there are substantial risks first, that the test itself will fall well short of the vision embraced by the developers of the original curriculum framework, and second, that questions of how testing and test scores are actually used, how they are supposed to function, and how they actually function as part of a complex system, will escape close attention. It's easy to say, of course, that everything would work better if parties all along the way shared a common understanding of how actual test use was supposed to be of value. It's more difficult to say just how that would work or just how it would help. We have a hard enough time simply hewing to a vision of what it is a test is supposed to measure. Could we also develop and hold to a common understanding of the decisions the test was supposed to guide, the conclusions it was supposed to support, the messages it was supposed to convey, and the behavioral changes it was supposed to encourage? I wish I could say I had a good answer to that question. [SLIDE 10] I've decided today to focus on aptitude and achievement tests for students, setting aside tests administered directly to practicing teachers or to prospective teachers. The use of tests given to students for purposes of teacher evaluation does fall within my purview. I'm also setting aside testing to diagnose individual students' special needs, although I do include tests of English language proficiency. I'll have a few things to say about classroom testing, meaning formative or curriculum-embedded assessment, but most of what I'll be considering will be standardized tests used on a large scale. I should hasten to add that my remarks are nothing remotely close to an exhaustive review. Think of aptitude and achievement tests for students as the domain of testing programs I'm sampling from in order to illustrate my broader purposes. My analysis is also circumscribed because I can only speak to these concerns as a psychometrician who has spent his career at a university, in a school of education. I've gotten many important insights about testing not only from testing experts at other universities, but also from colleagues at testing companies, in school districts, and in state and federal agencies, as well as from scholars in other disciplines. As a quick thought experiment, just try to think for a moment about some questions an anthropologist, or a philosopher, or a political scientist might ask about standardized achievement testing. For example: Is uniform standardization the best or only way to approach the ideals of fairness and score comparability? How does testing support belief in a meritocratic system where individual effort is rewarded? How has testing shaped popular views of the quality of public education? We could benefit from more cross-disciplinary dialogue about testing. (I should mention that I'm on the editorial board of the journal Mark Wilson, Paul De Boeck, and Pamela Moss have established, called Measurement: Interdisciplinary Research and Perspectives. That's one place where some of this kind of 4 work has appeared.) In my conclusions, I'll return to the idea that if we are to do a better job of validating the influencing purposes of testing, if we are to carry our validation efforts all the way through to uses and interpretations, then we're going to need more help from scholars in other disciplines. [SLIDE 11] There's just one more preliminary. Before turning to interpretive arguments for educational testing, I need to comment on the relation between achievement tests and test takers' prior instruction. Test tasks must be neither completely familiar nor completely unfamiliar. The "completely familiar" problem is easy to see. It's hard to imagine a cognitive item that could not be reduced to a test of rote recall by coaching students on that specific question. More subtly, classroom teachers may fool themselves and their students about their depth of understanding if the students are asked to demonstrate their knowledge using the same words or examples as they encountered during instruction. But let me clarify what I mean when I say test tasks must not be completely unfamiliar. An ideal task for eliciting complex reasoning or problem solving is one that draws upon a rich base of prior knowledge and skills, but requires using that prior learning in new ways. Reasoning has to be about something. If I want to investigate students' understanding of irony, say, or foreshadowing, it's really helpful to know a particular book they've read. That way, I can cite specific events from different parts of the book and ask how they're connected or why the author placed them in a particular sequence. If I know the students have read books but I can't assume any particular book, then my test tasks must instead be little self-contained packages, bundling the question with whatever background material might be required to arrive at the answer. Such questions can be useful, of course, and can measure some important kinds of learning. But, I can ask much deeper questions knowing which Shakespeare play students have read than I can simply knowing that they have read a Shakespeare play. Likewise in history, knowing what specific events, historical figures, or historical documents students have studied enables me to ask better questions to measure historical thinking. In ecology, I can ask deeper questions if I know which particular ecosystem students have learned about. This reality poses a big, pervasive challenge for standardized testing where there is no common curriculum. The Common Core State Standards in English language arts, for example call for students to have read a Shakespeare play, but don't say which one. In many subject areas, curriculum-neutral assessments cannot drill down very far on complex reasoning, simply because there is so little that can be assumed concerning specific, relevant, common background knowledge students should have had a chance to acquire in the classroom. So, let me now finally turn to some ways testing is supposed to improve schooling. [SLIDE 12] I propose that there are seven broad purposes whereby testing is intended to improve educational outcomes: (1) Instructional Guidance, (2) Student Placement and Selection, (3) Informing Comparisons Among Educational Approaches, (4) Educational Management, (5) Directing Student Effort, (6) Focusing the System, and (7) Shaping Public Perceptions. I'll be illustrating these purposes with examples, which should make it clear that finer distinctions would be possible. It will also be clear that testing and test scores are often used for several different purposes at the same time. Any taxonomy like this is somewhat arbitrary; I hope you find mine helpful. For now, let me just offer quick descriptions as an advance organizer. 5 [SLIDE 13] Instructional guidance refers to narrowly focused achievement testing for the purpose of informing instructional decisions. This is the area of formative testing. The primary users of the test information are teachers and the students themselves. The constructs measured are usually narrowly focused and closely tied to the curriculum. Interpretations are mostly criterion-referenced, and mostly at the individual student level. [SLIDE 14] Student placement and selection should be self-explanatory. This set includes testing for purposes of guiding decisions about student ability grouping, determining entry into and exit from classifications like "English learner," and making college admissions decisions, among other uses. I also place certification tests like high school exit exams into this category, as well as Advanced Placement and International Baccalaureate tests and examinations. [SLIDE 15] My third purpose, informing comparisons among educational approaches, covers tests used as outcome measures in evaluations of curricula or of alternative instructional methods. [SLIDE 16] Fourth, educational management refers to uses of achievement tests to guide decisions about teachers or schools. NCLB provisions for identifying schools in need of improvement and prescribing remedial action would be examples, as would uses of student test scores for teacher evaluation. [SLIDE 17] These first four purposes, of course, are comprised by my measuring category. Test scores are taken as indicators of some underlying construct, and on that basis the scores are used to guide some decision or draw some implication. How accurately scores reflect underlying constructs is of the essence here. The four purposes are distinguished by what it is that test scores are used to describe: individual student's short-term learning progress (category 1), more stable student aptitudes or summative measures of achievement (category 2), educational materials or approaches (category 3), or teachers and schools (category 4). My remaining three purposes make up the influencing category. For these purposes, tests are used in somewhat less direct ways. [SLIDE 18] Fifth on my list is directing student effort. This covers several possibilities. As Black and Wiliam document in their 1998 review of formative assessment, how we test and how we communicate test results can influence what, how, and how much students study. The question, "Is it going to be on the test?" has become a cliché. At a more macro level, tests with stakes for students like high school exit exams or college admissions tests will also influence student effort, although not uniformly, and not always in the ways we might intend. [SLIDE 19] My sixth category, what I've called focusing the system, refers to various uses of testing to influence curriculum and instruction in ways that do not depend directly on the particular scores students earn. Let me give some examples: Participating in test construction or in the scoring of performance assessments might be regarded as a kind of in-service training for teachers, leading them over time to teach in different ways. Or, performance assessments themselves might exemplify useful but underused instructional approaches. Or, high-stakes tests covering just a few school subjects might encourage teachers to spend more time on those subjects and less time on others. Or, inclusion of 6 questions about specific topics on a high-stakes test might be intended to help assure that those topics were addressed in classrooms. [SLIDE 20] Finally, testing and reports of test results shape popular perceptions, including perceptions of public education and of the teaching profession. I could have gone on to say that perceptions matter because they shape actions, like supporting school bond issues or choosing alternatives to regular public schools, but the story gets complicated enough just going as far as perceptions. In David Berliner and Bruce Biddle's 1995 book on The Manufactured Crisis, they argued that test results had been used systematically to bolster perceptions that our public education system was functioning poorly. As another possible example, I've heard more than once that the principal users of school accountability scores in California are real estate agents, who can quantify the effect on neighborhood housing prices of the local school's Academic Performance Index. (I'm not sure about that example because perceptions of a particular school might arguably be based on specific scores. But I decided to include it because a school's reputation over time seems not to be closely tied to any specific individuals' scores on specific tests.) [SLIDE 21] Here, again, are some short labels to help keep track. The measuring purposes are one, measuring learning; two, measuring learners; three, measuring methods; and four, measuring actors (including schools). The influencing purposes are five, influencing learners; six, influencing methods; and seven, influencing the perceptions of actors outside the school system itself. Let me say again that these are intended uses. Some, like using high-stakes tests to focus the system on specific subjects or specific kinds of knowledge and skills, carry substantial risks of unintended consequences, including score inflation and distortion of curriculum coverage. Daniel Koretz's 2008 book, Measuring Up, offers an excellent review and analysis of evidence concerning these kinds of effects. Please note that I do not mean for my influencing purposes to be mapped onto Messick's "consequential basis of test interpretation and use." As I understand Messick's writing, he was drawing attention to normative considerations in testing, including the value implications of test interpretations and the social consequences of test uses. Normative issues arise in connection with my influencing purposes, of course, as well as with measuring purposes. But normative concerns are not my main focus right now. My goal this afternoon is to draw attention to intended, anticipated effects of testing with no direct dependence on the information particular scores provide about underlying constructs. That is what I mean by "influencing" purposes. Let me next say a little more about each of these measuring and influencing purposes. [SLIDE 22] I'll start with my instructional guidance, or formative assessment purpose, which includes the major purposes for tests teachers create or select for use in their own classrooms. These are the quizzes, unit tests, midterms and finals used to assign grades and to guide the pacing of instruction. They provide feedback to students about their own learning and they help teachers plan instruction. Testing what's just been taught in order to guide further learning may sound straightforward, but of course that doesn't mean that it is simple or easy to do well. An interpretive argument is helpful in organizing 7 potential concerns. Following Kane, we might begin with scoring. There must be some way to quantify examinee performances to capture accurately the dimensions of interest. If scoring keys or rubrics credit only the particular answers a teacher intended, for example, and discount alternative answers that are also defensible, then there's a problem at that very first step. Also, scoring should be free from bias. Unless it's a test of handwriting, papers should not be marked down for poor penmanship alone. Apart from obvious flaws, perhaps the most critical scoring concern here is whether the collection of question adequately samples the intended content domain and whether the questions elicit the kinds of reasoning deemed important. It has been recognized at least since the time of the Bloom taxonomy in 1956 that teachers tend to write low-level items calling for factual recall or routine application of learned procedures, rather than more complex questions demanding nonroutine problem solving or socalled higher-order thinking. As an aside, I might add that that's not just a challenge for classroom teachers. The prevalence of low-level questions in the work of professional item writers was pointed out in 1984 by Norm Frederiksen in his paper on "The Real Test Bias," and similar concerns were echoed in a 2007 NAEP Validity Studies Panel report examining the NAEP math exercise pool (Daro, Stancavage, Ortega, DeStefano, & Linn, 2007). The second broad step in Kane's interpretive argument is generalization. This is the traditional concern of test reliability, asking to what extent the scores obtained from the observed performance on a particular occasion are representative of the many scores that might have been obtained with alternate forms, with different scorers, on different occasions, and so forth. For classroom tests serving the instructional guidance function, there may be no need to standardize administration or scoring across different times or places, but it is still important to consider whether tests are long enough and well enough designed that scores have adequate precision. The third step, extrapolation, asks whether test scores are accurate indicators of proficiency or performance in some larger domain of situations, beyond the test or testing replications. We might ask, for example, whether the test scores are predictive of performance on different kinds of tasks that call for the same skills, including situations outside the classroom. The fourth step, decision or implication, addresses the ways test scores are actually used: What is going to be done differently or understood differently based on the score? Black and Wiliam emphasize that by their definition, assessment becomes formative only when it is actually used to adapt teaching work to meet learning needs. Decisions to provide extra review and practice or to advance a student to a higher-level course would be examples. Other possible decisions might include reteaching some material, skipping ahead in the syllabus, or contacting a student's parents concerning academic progress. If there is a formal decision rule determining what is to be done with the scores, involving a passing score, for example, then that rule itself requires scrutiny. In practice, of course, it may be quite unclear what different actions are indicated by alternative testing outcomes. Data are of little use if teachers simply don't know what to do with them, or if they lack the time and other affordances needed to take action. "Implication" refers here to some inference or conclusion to be drawn. This might perhaps be conveyed by assigning a grade. Again, there is a range of concerns, including the appropriateness of the decision rule. 8 I've started with classroom testing for the purpose of instructional guidance to elaborate a little on how an interpretive argument can be helpful in thinking through testing concerns. You may know of instances where teachers were provided with data that were intended to be useful for guiding instruction, but the interpretive argument wasn't quite carried through to the final step. It may actually not be very helpful simply to provide teachers with reports of student test performance, however nicely formatted, on the assumption that they already know or will figure out how to make use of those reports and that they have the capacity to take constructive action. My instructional guidance purpose also covers more formal testing systems [SLIDE 23] where the material to be learned is broken down into many narrow learning objectives with a prescribed sequence, with a large bank of tests covering those learning objectives. Students' progress through the learning progression is controlled by their success on the corresponding tests. This has been tried many times. The earliest example I know of was The Winnetka Plan, created by Professor Carleton Washburne and his colleagues at the University of Chicago in the early 1920s. This was an elementary math program that supplemented a textbook with mimeographed worksheets and tests. Pupils could take selfadministered tests of each sequential objective, and then when they were ready, they could request a teacher-administered test. The date on which each pupil mastered each objective was recorded. Decades later, this same pattern was found in Programmed Instruction, where content was presented in frames, each followed by test questions with branching to one frame or another according to whether the test score indicated a need for more work versus readiness to move on. Still later, Benjamin Bloom's Mastery Learning featured a similar model, using what he called formative tests, similar to the criterion-referenced tests appearing from the late 1960s into the early 1980s. The Pittsburgh Learning Research and Development Center's Individually Prescribed Instruction Mathematics Project offered yet another illustration. Measurement-driven instruction worked best for those areas of the curriculum that could be readily analyzed into a sequence of small independent units that built upon earlier units. Teaching basic arithmetic or phonics are good examples. It was less successful teaching more advanced content, where the precise patterns of reasoning required are less predictable. [SLIDE 24] These more formal instructional guidance models mostly finessed scoring, the first step of the interpretive argument, by making the constructs measured more-or-less isomorphic with answering the test questions. This fit well with the perspective of behaviorist psychology, which sought to avoid inferences about the psychological processes underlying observable behaviors. Robert Mager's influential book, "Preparing Instructional Objectives," first published in 1962 with a second edition in 1984, coached teachers on how to formulate instructional objectives in observable terms, using verbs like "arrange," "define," or "label" rather than verbs like "knows" or "understands." Well-written instructional objectives were readily translated into criterion-referenced tests. Generalization, likewise, was not much of a problem. If a test consists of highly similar items asking students to do pretty much the same thing over and over, then any given student's performance is likely to be quite consistent from one test form to another, or even from one item to another on the same test. I've seen U-shaped score distributions on criterion-referenced tests, clearly distinguishing students 9 who had mastered some narrow objective from those who had not. That said, reliability coefficients for criterion-referenced tests were often low simply because there was little observed-score variance. Reliability coefficients fit well with norm-referenced test interpretations, but are often less useful with criterion-referenced tests. The standard error of measurement is a better option. Extrapolation, the third step, was more problematical for criterion-referenced testing. As Lauren and Daniel Resnick explained in their important 1992 chapter on "Assessing the Thinking Curriculum," the measurement-driven instructional model and its behaviorist underpinnings relied on assumptions of decomposability and decontextualization. There was often insufficient attention to the question of whether students could summon up the right pieces of learning in new contexts and put them together in different ways to arrive at solutions to unfamiliar problems. Decision or implication, the fourth step, typically relied on some simple cut score, usually defined by a required percent correct, to classify students as having mastered or not having mastered the tested objective. There was some published work on the problem of formulating these decision rules, but general guidelines like "Mastery is 80% correct " were often adopted uncritically. And, as with the example I cited earlier from Michael Kane's chapter, the efficacy of the use of these tests for purposes of instructional guidance was too often taken for granted, with validation efforts focused instead on score interpretation. Educational psychology has come a long way since the era of behaviorism, and I think it would be a mistake to try and resurrect these testing and instructional approaches. Nonetheless, I would like to speak for a moment to some positive features of measurement-driven instructional models. The fact that they did not work for everything doesn't mean they were not useful for some things. The tests used in these systems were closely aligned to a specific curriculum, and it was always clear that the curriculum came first and the tests, second. It was clear what the tests were supposed to be used for. Students either moved on directly to the next small unit of instruction or else they were given help before moving on. These many separate instructional decisions for each pupil may have been overly mechanical, but the systems were workable. The communications around test performance were all about mastery, not ranking or comparing one student to another. We could do worse. The instructional guidance purpose lives on in interim or benchmark assessments, of course, but as I understand it, these are generally designed by working backwards from the summative test, and are administered at fixed intervals. That's a very different model. [SLIDE 25] I'd next like to turn from instructional guidance to purposes of student placement and selection, shifting from measuring learning to measuring learners. You'll see in my examples here that there is a big difference in the kinds of tests used. Aptitude tests make an appearance, and where achievement tests are used, they are for one thing, tests of much broader achievement constructs, and, for another thing, used like aptitude tests, to predict future performance or readiness to profit from some instructional setting much more remote than the next small lesson. When one measures learning, one hopes to see rapid change. When one measures learners for purposes of placement or selection, one generally treats the characteristics measured as if they were relatively stable over time. 10 Perhaps the extreme view of aptitude as fixed and immutable was represented in the student IQ testing used early in the last century. Following on the perceived success of the Army Alpha during World War I, IQ tests were widely used for tracking students. My previous purpose, instructional guidance, was predicated on the notion that almost all children could master the curriculum if given the supports they needed along the way. [SLIDE 26] IQ-based tracking, which I am using to illustrate purposes of student placement and selection, was predicated on the notion that children differed substantially in their capacities for school learning, and so unequal outcomes were inevitable. Tracking was intended to make schooling more efficient by giving teachers more homogeneous groups of students to work with. That way, slower children would not be pushed to the point of frustration, and quicker children would not be held back. There's more to be said, of course, about the reasons this instructional model emerged when it did. [SLIDE 27] IQ tests seemed to confirm prevailing stereotypes about racial and ethnic differences in intelligence, and provided an alltoo-convenient explanation for what we today refer to as between-group achievement gaps. This history is laid out in Paul D. Chapman's 1988 book on Schools as Sorters. Chapman points out that tracking was already widespread before IQ tests became available, but the tests supported the practice, made it seem more objective and scientific, and fit comfortably with prevailing beliefs about the intelligence and individual differences. Whereas tests used for instructional guidance are often carefully integrated into the school curriculum, IQ tests were intended to have zero dependence on the school curriculum. They were designed to tap into different aspects of general mental ability, typically using progressively more difficult tasks of a given type to determine how well the examinee could perform. [SLIDE 28] Scoring rules were carefully worked out, and generalization was shown by high correlations between alternate test forms as well as high stability over time. Extrapolation was assumed on the basis of a strong theory about the importance of intelligence in determining fitness for different occupations and other aspects of adult life as well as success in school. At the time, however, the implications drawn from IQ test scores and the justification for decisions based on those scores were largely unexamined. It was simply assumed that intelligence was largely inborn and little affected by age or experience. Yet again, validation generally stopped short of actual use. Just as psychology has moved beyond the behaviorist notions underlying elaborate formal models for instructional guidance, it has also moved beyond the strongly hereditarian views that justified IQ testing. Today, environmental influences on IQ test scores are better understood, and the idea of "culture-free" testing has long since been abandoned. It is also more generally recognized that tracking can lead to a self-fulfilling prophecy whereby the groups offered a stronger, faster-paced curriculum progress more quickly than their peers in lower academic tracks. I would certainly not advocate a return to IQ-based tracking, but this model, too, may offer some useful lessons. Every teacher knows that some students tend to catch on more quickly than others. Students have different patterns of strengths and weaknesses, of course, but there are also general tendencies, and on average, students stronger in one subject area are likely to be stronger in other subject areas, as well. There seems to be no acknowledgement of this reality in the current rhetoric of school reform. I want to say clearly that of 11 course I believe all children can learn, and nearly all have the potential to become literate, successful members of society. But the mantra of "all children can learn," which morphs into an insistence on a common, high standard for all children, and the seeming denial of large individual differences in aptitude for school work have distorted education policy. [SLIDE 29] Before leaving the area of student placement and selection purposes, I should mention quickly that some of the other testing applications grouped here do feature strong connections between the test and the curriculum, with AP and IB exams as prime examples. Instructional guidance had to do with measuring learning, and student placement and selection have to do with measuring learners. My third broad set of purposes has to do with measuring methods. [SLIDE 30] These are purposes of testing aimed at informing comparisons among instructional approaches. This sort of testing applications can be traced back at least to the Project Head Start evaluations mandated by the Elementary and Secondary Education Act (ESAA) of 1965 and the largescale evaluations of curricula created with NSF funding in the post-Sputnik era, around that same time. More recently, the National Diffusion Network, in existence from 1974 to 1995, attempted to synthesize evaluations documenting successful educational approaches, and the What Works Clearinghouse, established in 2002, has a similar mission. These are examples of testing applications in the sense that curriculum or program evaluations use achievement tests as outcomes. The purpose in each case is to compare methods and see which works best. This is one case where the interpretive argument really is carried through to the end, or almost to the end. There is usually a final, taken-for-granted assumption that the approach found superior in the sites involved in the evaluation will also give superior results if it is widely disseminated, and of course there is no guarantee. These "informing comparisons" purposes are best implemented with randomized controlled trials, but the category would also include testing applications in studies with a wide range of research designs. The 2011 paper by Tom Dee and Brian Jacob in the Journal of Policy Analysis and Management, on "The impact of No Child Left Behind on student achievement," offers an elegant example in which NAEP data were used for the purpose of evaluating the NCLB approach to school accountability. My fourth purpose, what I've called educational management, comprises applications where student achievement tests are used to derive scores for schools or teachers, either to identify problematical cases requiring intervention or to identify exemplary cases from which others might learn. [SLIDE 31] Note that this purpose is in my measuring category. Decisions and implications based directly on school and teacher scores are in fact key to accountability models' theories of action. However, I think it's fair to say that measuring, in this sense, is only part of the story. These accountability models are intended to affect the educational system not only via measurement-driven management decisions, but also by focusing the system and possibly by shaping public perceptions— purposes I've placed in the influencing category. The NCLB accountability model is a prominent example. Here, student test scores from annual state-level testing systems are used in a complicated way to calculate scores for schools and then to determine whether each school is meeting a goal called Adequate Yearly Progress. There is a rational mechanism whereby schools not meeting AYP targets are supposed to receive remediation leading to improvement. However, a primary purpose of 12 accountability testing under NCLB—a major piece of the rationale whereby NCLB was supposed to improve schooling—would be situated in my sixth category of focusing the system, over in the righthand, Influencing column. These two purposes of educational management and focusing the system interact here, in that the pressure to focus is most intense for schools failing to meet AYP. Staying with educational management for the moment, perhaps the most prominent applications these days are "value-added" models for teacher evaluation. [SLIDE 32] Value-added models partake of educational management purposes when scores are used to identify specific teachers for follow-up actions, which might include among others, merit pay, mentoring, or dismissal. The scoring step of the interpretive argument is familiar, although with some models, vertical scaling assumptions place more stringent demands on scores for VAM than for other testing uses. The generalization step has an extra wrinkle, because when student scores over time are used in the aggregate to construct scores for teachers, reliability of individual student-level scores becomes less important, while other sources of error, related to student assignment and sampling must also be considered. As with most of the testing uses I've mentioned this afternoon, validation efforts seem to have focused primarily on these first two steps of the interpretive argument, with limited attention to extrapolation—the question of how strongly value-added scores relate to broader notions of teaching quality or effectiveness; and decision or implication—the question of whether actions taken on the basis of value-added scores have the intended effects. I said that NCLB used testing for both category four purposes of educational management and category six purposes of focusing the system. I'd say the same about value-added models for teacher evaluation. Certainly the use of student achievement test scores for high-stakes evaluation of teachers is likely to focus teacher effort on raising test scores. But the purposes for teacher value-added models may also cross over into my seventh category, shaping public perceptions. I think it is very likely that highly publicized reports of individual teachers' value-added scores, and the highly publicized reactions to these reports coming from teachers' unions and from various political actors are influencing public perceptions of teaching. I suspect that in some cases this may be an intended effect, but that could be hard to document. At this point, I've given some examples of testing applications under each of the four broad purposes in my measuring category. In some cases, I've also referred to purposes in my influencing category. [SLIDE 33] My colleague Elliot Eisner quipped long ago that the quickest way to precipitate an educational crisis was to give a test and announce that half the children were below average. Elliot's comment notwithstanding, I could not think of any testing applications that were solely in the "influencing" category, except for the bogus surveys one gets in the mail sometimes, purporting to ask for opinions with the true purpose of shaping opinions or raising money. Nonetheless, I believe these purposes may be more significant than measuring narrowly conceived in justifying and sustaining many large-scale testing programs. I did not say much about my seventh category, shaping public perceptions, except to speculate that that might possibly be a hidden purpose for teacher value-added models. I want to mention just one 13 more example, also plausible but speculative. I believe that one political purpose for accountability testing is to shape a public perception that an elected leader cares about education and is doing something about it. [SLIDE 34] Many presidents and governors, from both parties, have embraced the cause of educational improvement. As Robert Linn (2000, p. 4) explained it, "Test results can be reported to the press. Poor results in the beginning are desirable for policymakers who want to show they have had an effect. Based on past experience, policymakers can reasonably expect increases in scores in the first few years of a program … with or without real improvement in the broader achievement constructs that tests and assessments are intended to measure. The resulting overly rosy picture that is painted by short-term gains observed in most new testing programs gives the impression of improvement right on schedule for the next election." Testing can be a form of symbolic action. How many more times will we hear about a new testing program to identify low-performing groups so that remediation can be targeted. Are the results ever really a surprise? In my last few minutes, I want to speak to both the challenge and the importance of attending to these influencing purposes of testing programs. [SLIDE 35] Importance is clear. Most obviously, the influencing purposes are often key to the interpretive argument. They come at the end of the chain of reasoning, often serving as the ultimate rationale for the testing program. Doing a really good job on validity studies supporting the scoring and generalization stages may be gratifying to the writers of technical reports (if there is anything gratifying about writing technical reports), but the chain of reasoning is as strong as its weakest link. The argument simply has to be carried through to its conclusion. Examining these influencing purposes is also important because it is in connection with these purposes that many unintended consequences of testing may arise. Using tests to focus the system is a prime example. When curriculum-neutral tests are used to focus the system on key learning objectives, the result may be superficial coverage of just that subset of learning objectives included in the test specification. If we had well worked out, widely understood and widely used methods for studying focusing effects, these unintended consequences would receive more attention and might in time come to be better understood by policy makers and the public at large. The difficulties of examining these "influencing" purposes of testing programs are formidable. First, these purposes may not be clearly articulated. Second, necessary data may not become available until after a testing program has been implemented. Third, the research methods needed to carry out the required studies may be unfamiliar to testing professionals. Fourth, the agencies and actors best positioned to carry out this work have disincentives for doing so. Fifth, this work is expensive, and it may not be clear that the answers will really matter. Nonetheless, acknowledging the challenges, I think we can and should do a better job than we have historically. The lack of clarity around influencing purposes is not insurmountable. Some implicit purposes can certainly be ignored. If a politician believes that sponsoring a testing program will garner votes in the next election, that need not be a measurement concern. Nonetheless, major purposes can generally be discerned, and stating these purposes more clearly might actually be helpful. We can at least push for 14 these purposes to be better thought through. [SLIDE 36] The procurement for state consortia's comprehensive assessment systems under Race to the Top required that applications include "a theory of action that describes in detail the causal relationships between specific actions or strategies in the eligible applicant’s proposed project and its desired outcomes for the proposed project, including improvement in student achievement and college- and career-readiness." Data availability could be greatly improved if evaluation studies were planned before new testing initiatives were launched and baseline data were collected in advance. [SLIDE 37] That data collection might employ audit tests aligned to outcomes, but could also include surveys on content coverage or allocation of instructional time to different subject areas, for example. If evaluation is an afterthought, it may be too late to do it properly. Another helpful approach would be phased implementation. If changes everywhere in a state are made at the same time, it is much harder to isolate the effects of those changes on schooling outcomes. As to required research methods, [SLIDE 38] this is a place where we could use some help from colleagues in other disciplines. Our field draws most heavily from the disciplines of statistics and psychology. We could use some help here from sociologists, anthropologists, economists, linguists, curriculum specialists, and teacher educators, among many others. Some progress might be made simply by thinking about school principals, teachers, and students as rational actors and reasoning through their likely responses to the incentive structures testing shapes. But, we're going to need more than common sense. As Messick (1989) said long ago, validity must be supported by both empirical evidence and theoretical rationales. We have bodies of theory to get us through the descriptive stages of our interpretive arguments—We can figure out what test scores mean. But when it comes to the ways testing is actually supposed to function out in the world, our theories are impoverished. We may borrow a bit from economics, as when test scores provide the information essential for rational consumer choice, but we need more help from organizational theory, social psychology, and elsewhere. My last two challenges had to do with costs and incentives for doing a better job. [SLIDE 39] This is not a new problem. It's all well and good to lament the confirmationist bias in test validation, as in much social science research. It is understandable that testing companies and state education agencies might be reluctant to subject their tests and their policies to serious, critical scrutiny. But evaluation requirements built into testing procurements might help. As one example, I mentioned the procurement for SMARTER Balanced and PARCC. The evaluative criteria for those proposals allocated 30 points to bidders' Research and Evaluation Plans, which required not only research on the properties of the tests themselves, but also a plan "for determining whether the assessments are being implemented as designed and the theory of action is being realized, including whether the intended effects on individuals and institutions are being achieved" (U.S. Department of Education, 2010). Obviously, the two consortia each submitted plans responsive to the RFP's requirements. Similar requirements in requests for proposals from other agencies would be a step in the right direction. With experience, such evaluation requirements might be elaborated and refined to the point where serious attention to influencing as well as measuring purposes of testing, and serious attention to actual uses 15 and their consequences as well as to score interpretations, became an expected part of test validation. That will not happen overnight, but it is a goal worth striving for. [SLIDE 40] Thank you. 16