on linking formative and summative functions

advertisement
ON LINKING FORMATIVE AND SUMMATIVE FUNCTIONS
IN THE DESIGN OF LARGE-SCALE ASSESSMENT SYSTEMS
Richard J. Shavelson1
Stanford University
Paul J. Black, Dylan Wiliam
Kings College London
Janet Coffey
Stanford University
Submitted to Educational Evaluation and Policy Analysis
Abstract
The potential for mischief caused by well-intended accountability systems is legion when
expedient output measures are used as surrogates for valued outcomes. Such misalignment
occurs in education where external achievement tests are not well aligned either with
curriculum/standards, or high-quality teaching methods. Tests become outcomes themselves
driving curriculum, teaching and learning. We present three case studies that show how polities
have crafted assessment systems that attempted to align testing for accountability (“summative
assessment”) with testing for learning improvement (“formative assessment”). We also describe
how two failed to achieve their intended implementation. From these case studies, we set forth
decisions for designing accountability systems that link information useful for teaching and
learning with information useful to those who hold education accountable.
1
School of Education, 485 Lasuen Mall, Stanford University, Stanford, CA 94305-3096; 650-723-4040
(telephone); 650-725-7412 (fax). richs@stanford.edu.
ON LINKING FORMATIVE AND SUMMATIVE FUNCTIONS
IN THE DESIGN OF LARGE-SCALE ASSESSMENT SYSTEMS
The proposition that democracy requires accountability of citizens and officials is a universal
tenet of democratic theory. There is less agreement as to how this objective is to be
accomplished—James March and Johan Olsen (1995, p. 162)
Democracy requires that public officials, public and private institutions, and individuals
be held accountable for their actions, typically by providing information on their actions and
imposing sanctions. This is no less true for education, public and private, than other institutions;
the demand for holding schools accountable is part of the democratic fabric.
The democratic concept of accountability is noble. However, in practice it can fall short
of the ideal. For, as March and Olsen (1995, p. 141) put it, “The events of history are frequently
ambiguous.
Human accounts of those events characteristically are not.
Accounts provide
interpretations and explanations of experience” that make sense within a cultural-political
framework.
They (March & Olsen, p. 141) go on to point out that, “Formal systems of
accounting used in economic, social and political institutions are accounts of political reality”
(e.g., students’ test performance represent outcomes in a political reality). As such, they have
profound impact on actors involved. On the one hand, accounts focus social control and make
actors responsive to social pressure and standards of appropriate behavior; on the other hand, they
tend to reduce risk-taking that might become public, make decision making cautions about
change, and reinforce current courses of action that have appeared to have failed (March, 200?).
In this paper our focus is on human educational events—teaching, learning, outcomes—
that are by their very nature ambiguous but get accounted for unambiguously in the form of test
scores, league tables, and the like with significant impact on education. For accountability
information, embedded in an interpretative, political framework, is, indeed, a very powerful
policy instrument. Yet the potential for mischief is legion. Our intent is to see whether or not we
can improve on one aspect of accountability—that part that deals with information—to improve
On Aligning Formative and Summative Assessment
3
both its validity and positive impact. Specifically, our intent is to set forth design choices for
accountability systems that link, to a greater or lesser extent, information useful to teaching and
learning with information useful to those holding education responsible—policy makers, parents,
students, and citizens. For the current mismatch between information to improve teaching and
learning, and information to inform the public of education quality has substantial negative
consequences on valued educational outcomes.
Before turning to the linkage issue, a caveat is in order. To paraphrase March and Olsen,
educational events are frequently ambiguous… and messy, complicated and political. To focus
this paper, we are guilty in the portrayal of educational events and our “human accounts of those
events are characteristically not [ambiguous]” but should be. Teachers, for example, face conflict
every day in gathering information on student performance–to help students close the gap
between what they know/can do and what they need to know/be able do on the one hand, and to
evaluate students’ performance for the purpose of grading on the other hand. This creates
considerable conflict and complexity that, due to space, we have reluctantly omitted (see, for
example, Atkin, Black & Coffey, 2002; Atkin & Coffey, 2001). .
The Linkage Issue
In a simple minded way, we might think of education as producing an outcome such as
highly productive workers, enlightened citizens, life-long learners or some combination of these.
Inputs (resources) are transformed through a process (use of resources) into outputs (products)
that are expected contribute to the realization of valued outcomes. “With respect to schools, for
example, inputs include such things as teachers and students, processes include how teachers and
students spend their time during the school day, outputs include increased cognitive abilities as
measured by tests, and outcomes include the increased capacity of the students to participate
effectively in economic, political and social life” (e.g., Gormley & Weimer, 1999, pp. 7-8).
When outputs are closely and accurately linked to inputs and processes on the one hand,
and to outcomes on the other, the accounts of actors and their actions may be valid—they may
On Aligning Formative and Summative Assessment
4
provide valuable information for improving the system (processes given inputs) on the one hand,
and for accounting to the public for outcomes on the other. However, great mischief can be done
by an accountability system that does not, at least, closely link outputs to outcomes. For outputs
(e.g., broad scope multiple-choice test scores) quickly become valued outcomes from the
education system’s perspective, and the information provided by outputs may not provide
information either closely related to outcomes or needed to improve processes.
Current educational accountability systems use large-scale assessments for monitoring
achievement over time. They can be characterized as outputs (e.g., broad-spectrum multiplechoice or short-answer questions of factual or procedural recall) that are either distal from
outcomes and processes or that have become the desired outcomes themselves. Consider, for
example, the algebra test item:
Simplify, if possible, 5a + 2b (Hart, Brown, Kerslake,
Kuchemann, & Ruddock, 1985). Many teachers regard this item unfair for large-scale testing
since students are “tricked” into simplifying the expression because of the prevailing “didactic
contract” (Brousseau, 1984) under which students assume that there is “academic work” (Doyle,
1983) to be done; doing nothing cannot be academic work with the consequence of a low mark.
The fact that they are tempted to simplify the expression in the context of a test question when
they would not do so in other contexts means that this item may not be a very good question to
use in a test for external accountability purposes.2
Indeed, testing for external accounting purposes may not align with testing for improving
teaching and learning. Accountability requires test standardization while improvement may
involve dynamic assessment of learning; testing for accountability typically employs a uniform
testing date while assessment for improvement is on-going, testing for accountability leaves the
student unassisted while assessment for improvement might involve assistance, results of
accountability tests are delayed while assessment for improvement must be immediate, and
2
However, such considerations do not disqualify it for purposes of improving the teaching-learning process
because a student being tricked may be important information for the teacher to have indicating the
students’ insecurity with basic algebraic principles.
On Aligning Formative and Summative Assessment
5
testing for accountability must stress reliability while clinical judgment plays an important role
over time in improvement (Shepard, 2003). Note that this version of accountability reflects
current education practice.
It is not inevitable, as Jim March reminds us:
There is little
agreement on how to accomplish democracy’s accountability function and this is especially true
in education.
The potential for mischief and negative consequences is heightened when public
accountability reports are accompanied by strong sanctions and rewards, such as student
graduation or teacher salary enhancements. Test items may be stolen, teachers may teach the
answers to the test, and administrators may change students’ answers on the test (e.g., Assessment
Reform Group, 2002; Black, 1993; Shepard, 2003).
In this paper, we set forth design choices for tightening the links among educational
outputs, the processes for improving performance on them, and desired outcomes. We begin with
a framework for viewing assessment of learning for improving teaching-learning processes and
for external accounting purposes. We then turn attention to large-scale assessment systems that
have attempted alignment as existence proofs and as sources for identifying large-scale
assessment design choices. To be sure, alignment has been attempted in the past with varying
degrees of success.
From these “cases” we extract design choices, choices that are made,
intentionally or not, in the construction of large-scale assessment systems. We then show the
pattern of design choices that characterize each of our cases.
Functions of Large-Scale Assessments: Evaluative, Summative and Formative
Large-scale testing systems may serve three functions: Evaluative, summative and
formative. The first function is evaluative. The system provides (often longitudinal) information
with which to evaluate institutions and curricula. To this end, the focus is on the system; samples
of students, teachers, and schools provide information for drawing inferences about institutional
or curricular performance. The National Assessment of Educational Progress in the United States
and the Third International Mathematics and Science Study are examples of the evaluative
On Aligning Formative and Summative Assessment
6
function. We are not concerned here with this function3; rather we focus on summative and
formative functions of assessment.
The second function is summative. The system provides direct information on individual
students (e.g., teacher grades—not our focus here—and external test scores) and, by aggregation,
indirectly on teachers (e.g., for salaries) and schools (for funding) for the purpose of grading,
selecting, and promoting students certifying their achievement, and for accountability. The
General Certificate of Secondary Education examination in the United Kingdom (e.g.,
http://www.gcse.com/) is one example of this function as are statewide assessments of
achievement in the United States such as California’s Standardized Testing and Reporting
program
(http://star.cde.ca.gov/)
or
the
(http://www.scotthochberg.com/taas.html/).
Texas
Assessment
of
Academic
The third function is formative.
Skills
The system
provides information directly and immediately on individual students’ achievement or potential to
both students and teachers. Formative assessment of student learning can be informal when
incidental evidence of achievement is generated in the course of a teacher’s day to day activities,
when the teacher notices that a student has some knowledge or capacity of which she was not
previously aware. Or it can be formal as a result of a deliberate teaching act designed to provide
evidence about a student’s knowledge or capabilities in a particular area. This most commonly
takes the form of direct questioning (whether orally or written),4 but also in the form of
curriculum embedded assessments (with known reliability and validity) that focus on some aspect
of learning (e.g., mental model of the sun-earth relationship that accounts for day and night or a
change in seasons) and that permits the teacher to stand back and observe performance as it
3
We believe that this function is often confounded with the summative function (see below) where the
focus is on institutional evaluation and a census of students (etc.) is taken to provide feedback on individual
achievement as well as on institutional performance.
4
Of course questioning, formally or informally, will not guarantee that if the student has any knowledge or
understanding in the area being answered, then evidence of that attainment will be elicited. One way of
asking a question might produce no answer while a slightly different approach may elicit evidence of
achievement.
On Aligning Formative and Summative Assessment
7
evolves. The goal here is to identify the gap between desired performance and a student’s
observed performance so as to improve student performance through immediate feedback on how
to do so. Formative assessment also provides “feed-forward” to teachers as to where, on what,
and possibly how to focus their teaching immediately and in the future.
While most people are aware of summative assessment, few are aware of formative
assessment and the evidence of its positive, large-in-magnitude impact on student learning (e.g.,
Black & Wiliam, 1998). Perhaps a couple of examples of formative feedback, then, would be
helpful.
Consider, for instance, teacher questioning, a ubiquitous classroom event.
Many
teachers do not plan and conduct classroom questioning in ways that might help students learn.
Rowe’s (1974) research showed that when teachers paused to give students an opportunity to
answer a question, the level of intellectual exchange increased. Yet teachers typically ask a
question and give students about one second to answer. As one teacher came to realize (Black,
Harrison, Lee, Marshall, & Wiliam, 2002, p. 5):
Increasing waiting time after asking questions proved difficult to start with—due to my
habitual desire to ‘add’ something almost immediately after asking the original question.
The pause after asking the question was sometimes ‘painful’. It felt unnatural to have
such a seemingly ‘dead’ period, but I persevered. Given more thinking time students
seemed to realize that a more thoughtful answer was required. Now, after many months
of changing my style of questioning I have noticed that most students will give an answer
and an explanation (where necessary) without additional prompting.
As a second example, consider the use of curriculum-embedded assessments. These
assessments are embedded in the on-going curriculum. They serve to guide teaching, and create
an opportunity for immediate feedback to students on their developing understandings. In a joint
project between the Stanford Education Assessment Laboratory and the Curriculum Research and
Development Group at the University Hawaii, modifications have been made to the Foundational
On Aligning Formative and Summative Assessment
8
Approaches in Science Teaching (FAST) middle-school curriculum.
A set of assessments
designed to tap declarative knowledge (“knowing that”), procedural knowledge (“knowing how”)
and schematic knowledge (“knowing why”) have been embedded at four natural transitions or
“joints” in an 8-week unit on buoyancy—some assessments are repeated to create a time series
(e.g., “Why do things sink or float?”) and some (multiple-choice, short-answer, concept-map,
performance assessment) focus on the particular concepts, procedures and models that led up to
the joints. The assessments serve to focus teaching on different aspects of learning about mass,
volume, density and buoyancy. Feedback on performance is immediate, focuses on constructing
conceptual understanding based on empirical evidence. For example, assessment items (graphs,
“predict-observe-explain,” and short answer) that tap declarative, procedural and schematic
knowledge are given to students at a particular joint. Students debate different explanations of
sinking and floating based on evidence in hand.
While the dichotomy of formative and summative assessment seems perfectly
unexceptional, it appears to have had one serious consequence. Significant tensions are created
when the same assessments are required to serve multiple functions, and few believe that a single
system can function adequately to serve both functions. At least two coordinated or aligned
systems are required: formative and summative.
Both functions require that evidence of
performance or attainment is elicited, is then interpreted, and as a result of that interpretation,
some action is taken. Such action will then, directly or indirectly, generate further evidence
leading to subsequent interpretation and action, and so on.
Tensions arise between the formative and summative functions in each of three areas:
evidence elicited, interpretation of evidence, and actions taken. First, consider evidence. As
Shepard (2003) pointed out, issues of reliability and validity are paramount in the summative
function because, typically, a “snapshot” of the breadth of students’ achievement is sought at one
point in time. The forms of assessment used to elicit evidence are likely to differ from summative
On Aligning Formative and Summative Assessment
9
to formative. In summative assessment, typical “objective” or “essay” tests are given on a
particular occasion. In contrast, with formative assessment, students’ real-time responses are
given to one another in group work, to a teacher’s question, to the activity they are engaged in or
to a curriculum-embedded test. Moreover, the summative and formative functions differ in the
reliability and validity of the scores produced. In summative assessment, each form of a test
needs to be internally consistent and scores from these forms need to be consistent from one rater
to the next or from one form to the next. The items on the tests need to be a representative
sample of items from the broad knowledge domain defined by the curriculum syllabus/standards.
In contrast, as formative assessment is iterative, issues of reliability and validity are resolved over
time with corrections made as information is collected naturally in everyday student performance.
Finally, the same test question might be used for both summative and formative assessment but,
as shown with the simplify item (simplify 5a + 2b), interpretation and practical uses will probably
differ (e.g., Wiliam & Black, 1996).
The potential conflict between summative and formative assessment can also be seen in
the interpretation of evidence. Typically, the summative function calls for a norm-referenced or
cohort-referenced interpretation where students’ scores come to have meaning relative to their
standing among peers. Such comparisons typically combine complex performances into a single
number and put the performance of individuals into some kind of rank order. A norm- or cohortreferenced interpretation would indicate how much better an individual needs to do, pointing to
the existence of a gap, rather than giving an indication of how that improvement is to be brought
about. It tells the individual that they need to do better rather than telling him or her how to
improve.5
5
Nevertheless, summative information can be used for formative purposes, as for example when statistics
and interpretations are published, as they are in Delaware, for science items on a summative assessment
and thereby serve as benchmarks for what teachers and parents can reasonably expect students to achieve
(see Wood & Schmidt, 2002; see also Shepard 2003).
On Aligning Formative and Summative Assessment
10
The alternative to norm-referenced interpretation in large-scale assessment is criterion- or
domain-referenced interpretation with focus on amount of, rather than rank ordering on,
knowledge. Summative assessment, in this case, would report on the level of performance of
individuals or schools (e.g., percent of domain mastered), perhaps with respect to some desired
standard of performance (e.g., proportion of students above standard).
In this case, the
summative assessment would, for example, certify competence.6
Formative assessment, in contrast, provides students and teachers with information on
how well someone had done and how to improve, rather than on what they have done and how
they rank. For this purpose, a criterion- or domain-referenced interpretation is needed. Such an
interpretation focuses on the gaps between what a student knows and is able to do with what is
expected of a student in that knowledge domain. However, formative assessment goes beyond
domain referencing in that it also needs to be interpreted in terms of learning needs—it should be
diagnostic (domain-referenced) and remedial (how to improve learning). The essential condition
for an assessment to function diagnostically is that it must provide evidence that can be
interpreted in a way that suggests what needs to be done next to close the gap.
The two assessment functions also differ in the actions that are typically taken based on
their interpretations. Summative assessment reaches an end when outcomes are interpreted.
While there may be some actions contingent on the outcomes, they tend to follow directly and
automatically as a result of previous validation studies. Students who achieve a given score on
the SAT are admitted to college because the score is taken to mean that they have the necessary
aptitude for further study.
One essential requirement here is that the meanings and significance
of the assessment outcomes must be widely shared. The same score must be interpreted in the
same way for different individuals.
6
The value, implications and social consequences of the
For an aircraft pilot, the issue is pass or fail; good take off but dodgy on the landing is of little value!
On Aligning Formative and Summative Assessment
11
assessment, while generally considered important, are often not considered as aspects of validity
narrowly defined (Madaus, 1988), but are central to a broader notion of validity (Messick, 1980).
For formative assessment, in contrast, the learning caused by the assessment is
paramount. If different teachers elicit different evidence from the same individual, or interpret
the same evidence differently, or even if they make similar interpretations but take different
actions, this is relatively unimportant. What really matters is whether the result of the assessment
is successful learning. In this sense, formative assessment is validated primarily with respect to
its consequences.
The potential for summative and formative assessment to work at cross-purposes, then, is
enormous. However, too much is at stake to leave them in natural conflict. If left in conflict, the
summative function will overpower the formative function and the goals of education may be
reduced to the outputs measured by standardized tests that rank order performance and relate it to
a peer group. The goal of teaching and education becomes improving scores on these tests, and
all kinds of unintended consequences arise including cheating.
What is at stake is the evidence of a positive, large-in-magnitude impact that the
formative function makes on student learning (Assessment Reform Group, 2002; Black &
Wiliam, 1998). It becomes imperative, then, to align formative and summative assessment. We
believe this might be done in two general ways: (1) broaden the evidence base to include a greater
range of outcomes that goes into summative assessment to include formative information, or (2)
aggregate formative assessment information as the basis of the summative function. We turn to
case studies of both approaches as “existence proofs.”
Attempts to Align Large-Scale Formative and Summative Assessment
Attempts that have been made to align large-scale summative assessment with formative
assessment have met with a greater and lesser degree of success. We focus on three cases not
because they are exhaustive but because they are illustrative of the larger domain.
The
On Aligning Formative and Summative Assessment
12
conceptual framework and procedures employed in each of these different cases provide an image
of how the summative and formative functions of assessment might be linked. They also provide
lessons in the difficulty of achieving alignment in hotly contested, political environments where
differing conceptions of teaching and learning as well as questions of cost effectiveness come into
play. We begin with two attempts to include formative information in summative, large-scale
assessments of achievement: The California Learning Assessment System in the U.S. and the
Task Group on Assessment and Testing’s framework for reporting national curriculum
assessment results in the U.K. We then turn attention to a different case, one that built a
summative assessment system on formative assessment.
Our intent is to provide concrete
examples from which we might draw implications for large-scale assessment design choices.
California Learning Assessment System
The California Learning Assessment System (CLAS) sought to link formative with
summative assessment of student achievement in grades 4, 8 and 10. The legislation (Senate Bill
SB-1273) was written in 1991 in response to Governor Pete Wilson’s promise to citizens of the
State to create an achievement assessment system that provided timely and instructionally
relevant information to teachers, to individual students and their parents, and to the State. This
promise reflected what were seen as limitations to the then California Assessment Program
(CAP) that indexed the achievement of California’s 4th, 8th, and 10th grade students in the
aggregate (i.e., by matrix sampling) rather than at the individual student level (i.e., evaluative
function of assessment). CAP, then, was unable to provide reliable scores to students who took
only a subset of test items. Moreover, CLAS was to be aligned with the new curriculum
frameworks that embraced constructivist, inquiry principles of learning and teaching so that, in
addition to traditional tests (i.e., multiple-choice), alternative assessments (essays in English,
performance assessments in science, mathematics and history) would be incorporated into the
assessment system.
On Aligning Formative and Summative Assessment
13
In order to meet the Governor’s vision of providing timely achievement information to
teachers, students and parents, CLAS was designed to combine information from on-demand
testing, classroom embedded testing and classroom portfolios (see Figure 1). Like CAP, at least
initially, CLAS would include on-demand statewide testing where students would sit for a
multiple-choice test (Figure 1-A). Unlike CAP, however, the test would not contain just a sample
of items for each student. Rather, all students would take the same test so that individual scores
could be given to teachers, students and parents. Over time (see below), individual student
information would come from classroom data sources and the on-demand aspect of CLAS would
simply be an audit of classroom data through sampling.
Insert Figure 1. Sources of CLAS data.
In order to provide timely, curriculum-relevant feedback to teachers and students, CLAS
envisioned embedding state-designed tests into the ongoing curriculum (Figure 1-B).
For
example, the State might provide writing prompts for all 8th grade English teachers to administer
and score (using a State developed rubric) five times in the academic year. Or the State might
provide 4th grade teachers with a 5-day mini-science curriculum on (say) biodiversity that
incorporated performance assessments (practical/laboratory investigation) that a teacher would
score immediately using a state-provides rubric. Such an exercise might be carried out three
times in the year. In this way, feedback on student performance would be immediate in the
classroom and the data could then be passed on to the state for further processing with the intent
of producing school, aggregate school (by demographics), district and state level scores (Figure
2).
Insert Figure 2. CLAS’ conceptual framework.
Finally, recognizing the diversity of curricula in the state and its implementation in local
contexts, CLAS provided for incorporating the uniqueness of classroom instruction into the
assessment system (Figure 1-C). To this end, student artifacts from classroom activities would be
incorporated into a portfolio that sampled these artifacts following a conceptual framework, and
On Aligning Formative and Summative Assessment
14
the classroom teacher would evaluate the portfolio. In this way, immediate feedback was
provided to teachers and students, and the student portfolios could be sampled by the state for use
at multiple levels of the education system.
The classroom-generated test information, scored by the teacher, would then be used by
the state to create a “moderated” score for individual schools and school districts (Figure 2).
Moderation would serve to equate a particular teacher’s scores with the stringency of other
teachers’ scores. In this way, the state would be able to compare scores from one teacher to
another on the “same metric,” and teachers would receive feedback so they could calibrate their
evaluation of student performance with their peers’ evaluations across the state.
The architects of CLAS envisioned a 10-year implementation period (Figure 3). This
was necessary because several new technologies had to be developed. One such technology was
that of alternative assessment including performance assessment. Another new technology, at
least for California, was moderation. California had not tried it before. A third new technology
was combining data from Sources B and C into a state score for schools, districts and the state as
a whole over time. And, finally, the fourth new technology was that of comparing the classroomgenerated information with the on-demand state “audit” in order to see if inconsistencies arose in
classrooms and to reconcile those inconsistencies.
Insert Figure 3. CLAS implementation time line.
CLAS never realized its naïve ten-year implementation time line. Ten years is a political
eternity. And, as it turns out, CLAS ignored Governor Wilson’s political needs and priorities—
an achievement score produced in a timely fashion for each and every student taking the state’s
mandated achievement tests. Moreover, CLAS ran afoul of conservative interest groups in the
State especially in the content used in the literature and writing tests (Kirst & Mazzeo, 1996). As
Governor Wilson wrote to the California Legislature when he red lined the CLAS budget:
SB 1273 [CLAS] takes a different approach . . . . Instead of mandating individual student
scores first, with performance-based assessment incorporated into such scores as this
On Aligning Formative and Summative Assessment
15
method is proven valid and reliable, it mandates performance-based assessment now and
treats the production of individual student scores as if it were the experimental
technology--which it clearly is not. In short, SB 1273 stands the priority for individual
scores on its head.
In spite of its naiveté, CLAS provides a vision of what an accountability system might
look like that links formative and summative assessment. Embedding statewide assessment in
classes to provide a common ground for assessing and comparing achievement has several
virtues. First, it aligns the State’s on-demand, external audit with classroom relevant activities.
Second, it provides immediate feedback to students and teachers, feedback linked not just to the
same class but to peer classes.
And third, classroom assessment is taken seriously and
incorporated into the state-level reporting system. The use of portfolios or other classroom
artifacts, as well, links classroom activities to the state assessment. Moreover, it provides an
opportunity to have local curriculum implementation considered in the assessment of students’
achievement (not unlike the school-based component of the British GCSEs and A-levels).
Another benefit is that CLAS proposed putting part of the state’s assessment responsibility in the
hands of teachers and to provide training and moderation for their professional development.
Finally, the on-demand portion of the system provides a means for immediately auditing
classroom/school scores that are obviously out of line, thereby providing the external
accountability needed for public trust in the testing system.
Task Group on Assessment and Testing: National Curriculum Assessment Results
The Task Group on Assessment and Testing (TGAT) was set up by the UK government
to advise on a new system for national testing at ages 7, 11, 14 and 16 to accompany the first-ever
national curriculum. TGAT proposed the system considered here, developed under severe time
pressures set by the government (from mid-September to Christmas 1987). TGAT issued its
report at the end of December 1987 (“TGAT Report”; DES 1987), with a supplementary report
On Aligning Formative and Summative Assessment
16
(DES, 1988) a few months later including plans for teacher training and for the setting up and
operating a network of local moderation groups.
School assessment practice at the time was generally uncoordinated and ineffective in
providing a clear picture of pupils’ progress, attainment or potential. The brief to the group was to
produce a new scheme, but it was otherwise vague – so TGAT was left free to invent a radically
new system. However TGAT members were aware that there were powerful political groups that
believed that external tests for all, with publication of school performance, were all that was
required. Their report put a different view:
A system which was not closely linked to normal classroom assessments and which
demand the professional skills and commitment of teachers might be less expensive and
simpler to implement, but would be indefensible in that it could set in opposition the
processes of learning, teaching and assessment. (para. 220 D.E.S. 1997).
One basic principle of their scheme was that formative assessment would be the key to
raising standards; they argued that external assessments could only support learning if linked to
formative practice. This meant that both formative and summative assessments had to work to
the same criteria.
Given evidence that pupils' attainments at any one age covered a range corresponding to
several years of normal progression, a second principle proposed by TGAT was that there should
be a single scheme of criteria, spanning across the age-ranges and setting out guidance for
progression in learning. This was accepted and implemented in the specification of the national
curriculum for all subjects: each subject had to specify a sequence of ten criterion-referenced
levels to cover the age range from 5 to 16.
The group also stressed a third principle, that good external tests could be powerful
instruments for raising the quality of teachers' assessments, and recommended that teachers’
assessments should be combined with the results of external tests, with uniformity of standards
On Aligning Formative and Summative Assessment
17
secured through peer review in group moderation.7 Here, the group was able to draw on the
experience of moderation procedures built into external public examination systems in the UK
over many years (see Wood 1991).
However, the group was well aware of the limitations, in validity and reliability, of short
external tests. So they recommended, for example, that the external tests at age 7 be extended
“tasks,” rather like well-designed pieces of teaching which engaged children, and so designed that
they were given opportunity to show performance in the appropriate targets. The general term
Standard Assessment Tasks8 was coined to emphasize that the tests, at all ages, should be
stronger on validity than existing instruments.
TGAT was deeply concerned about the prospect that the government would want too rapid an
introduction of a new scheme. So the report stated:
We recommend that the new assessment should be phased in over a period adequate for
the preparation and trial of new assessment methods, for teacher preparation, and for
pupils to benefit from extensive experience of the new curriculum. We estimate that this
period needs to be at least five years from the promulgation of the relevant attainment
targets.... The phasing should pay regard to the undue stress on teachers, on their schools
and the consequent harm to pupils which will arise if too many novel requirements are
initiated over a short period of time (Section 199, DES 1987).
The overall scheme is represented in Figure 4. It was founded on three operational
recommendations, namely that: (1) the system should be based on a combination of moderated
teachers’ ratings and standardized assessment tasks, (2) group moderation should be an integral
part of the system to produce the agreed combination of moderated teachers’ ratings and the
results of national tests, (3) the final reports on individual pupils to their parents should be the
7
Moderation is UK terminology for a process whereby the assessments of different teachers are put onto a
common scale, whether by mutual agreement between them, or by external audit, or by automatic scaling
using a common reference test.
8
Known as SATs–not to be confused with the meaning in the USA.
On Aligning Formative and Summative Assessment
18
responsibility of the teachers, supported by standardized assessment tasks and group moderation
(recommendations 14, 15 and 17 DES 1987), and (4) each subject’s curriculum should be
specified as a profile of about four components, each with its ten levels of criteria, and that
assessment and test results be reported as a profile of the separate scores on each component..
Insert Figure 4: Overview of the TGAT system.
Teachers generally welcomed the report. Prime Minister Thatcher, however, disliked it,
and later wrote:
Ken Baker [Education Minister] warmly welcomed the report. Whether he had read it
properly I do not know: if he had it says much for his stamina. Certainly I had no
opportunity to do so before agreeing to its publication . . .that it was then welcomed by
the Labour party, the National Union of Teachers and the Times Educational Supplement
was enough to confirm for me that its approach was suspect (pp. 594-595 in Thatcher,
1993).
Thatcher’s concern was reflected in the suspicions of some commentators—was this a
Trojan Horse of the right, or a subversion from the left by the “educational establishment?” The
Task Group had not consciously engaged in either maneuver; they were focused on constructing
the optimum system. The suspicions reflect both the simplistic level of public thinking about
testing and the fraught ideological context.
In June 1988, Education Minister Baker announced that he endorsed almost all of the
proposals, including the principle of balance between external tests and teachers' assessments.
However, he found the proposals about the system for managing the implementation of the new
scheme "complicated and costly" – a reservation which left no plan for implementing the
moderation. New proposals were to be worked out by a newly established Schools Examination
and Assessment Council (SEAC).
The subsequent history was of a step-by-step dismemberment of the recommendations
(see Black 1997, Daugherty 1995).
The first casualty was the TGAT vision of Standard
On Aligning Formative and Summative Assessment
19
Assessment Tasks. The trials in 1990 of the first attempts at such tasks attracted considerable and
mainly adverse publicity. The design was modified, but then Baker was replaced. His successor
declared the new tasks to be "elaborate nonsense." A new design was imposed requiring short
‘manageable’ written tests.
The second casualty was teacher assessment. The SEAC paid almost no attention to this
aspect of their brief, perhaps because the priority of setting up the SATs pushed all else off the
agenda, or perhaps because much of the “acceptance” of TGAT was a pretence, to be slowly
abandoned as a different political agenda was implemented. Decisions about the function of
teachers' assessments in relation to the SAT results were changed year by year. Following the
abandonment of moderation the published “league tables” for schools were based on the SATs
results alone. Teachers' assessments, un-calibrated, were to be reported to parents alongside SAT
results, and were later reduced to a formality when it was ruled that teachers could decide their
assessment results after the SATs results were known.
Underlying this history is a struggle for power between competing ideologies.
Conservative policy had long been influenced by fear that left-wing conspiracies seek to
undermine education (Lawton 1994). An example is the following account of the view of
Baker’s predecessor as minister:
Here Joseph shared a view common to all conservative educationists: that education had
seen an unholy alliance of socialists, bureaucrats, planners and Directors of Education
acting against the true interests and wishes of the nation's children and parents by their
imposition on the schools of an ideology (equality of condition) based on utopian dreams
of universal co-operation and brotherhood [Knight, 1990].
Indeed, after Thatcher replaced Baker the TGAT policy was doomed –it was the creature of a
prince who had fallen from favor at court.
However, there remains the question of whether the TGAT plan contained such serious
defects that it was too fragile to survive with or without opposition. Two difficult aspects of the
On Aligning Formative and Summative Assessment
20
scheme were the stress on criterion referencing and the ten-level scheme for progression.
Attempts to implement criterion-referencing in UK public examinations had already run into
difficulty, and the curriculum drafters had little evidence on which to base a scheme of
progression which would match the sequences in which most children achieve proficiency. Both
of these aspects continue to be controversial to the present day (Wiliam 2001) but the national
curricula are still expressed in eight levels of criteria, and everyday school talk involves such
phrases as, “he is working towards level five.” However, the test-level results are only weakly
related to the level criteria.
Another difficulty with the TGAT proposal was the practice of formative and summative
assessment by teachers (Brown, 1989; Swain, 1988). It was known that teachers’ summative
work was of weak quality. Good practice had been established in limited areas, notably in
teacher assessments in English where teacher assessed alternatives to the external examinations
for the school leaving certificates had operated successfully. Various graded assessment schemes
had also set good precedents, although the basis of these was mainly frequent summative testing
conducted by teachers rather than by an integration of formative with summative practices. For
the proposed systems of group moderation however, good practices had been developed and the
TGAT report was able to set out a detailed account of how meetings for peer moderation might
be conducted.
For formative assessment, little was known at the time about either its precise meaning or
about how to develop its potential. The TGAT report said very little about the link between the
formative and summative aspects of teachers’ work and, whilst it listed the various ways that
teachers might collect data, it had little to say about how teachers should make judgments on the
evidence they might collect.
Those responsible for the development of policy understood little about these problems,
their vision being focused on ‘objective’ external tests. Instead of being sensitive to the need for
careful development research to build up the new system, they imposed such speedy
On Aligning Formative and Summative Assessment
21
implementation that the TGAT plan could not survive. Even if the designers had known exactly
how to implement a new scheme, it should have been allowed many more years for development.
Teachers could not in two or three years possibly grasp and incorporate into the complexities of
classroom practice the radically new TGAT plans whilst dealing also with the quite new
curriculum.
But in fact TGAT had faced a dilemma. They knew of - had indeed been involved in –
limited work on all of the components of the new practices they were recommending, but they did
not have time to think through all of the evidence that pertained to their case. The temptation to
produce an acceptable plan meant that they failed to face realistically the full implications of
imposing these practices, in a newly articulated whole, on all teachers. If they had done so, then
perhaps they might have had the courage to declare that at least ten years of careful development
work was essential to translate the proposals into a workable system. If they had stated such a
conclusion, it is almost certain that they would have been dismissed and their report might never
have been published. Ironically, whilst this would have impoverished subsequent debate, it
would probably have made no difference to the eventual outcome.
In spite of its political naiveté, TGAT provides a vision somewhat similar to CLAS of
what a formative-summative assessment system might look like. Such a system would be built to
self-consciously link formative and summative, criterion-referenced assessments to the same set
of standards. As a consequence, teacher assessments would be combined with external tests and
through moderation a combined score and a profile of scores would be reported. Moderation
would provide quality assurance for the classroom data. Formative assessment would take
priority and be a key to raising standards; teachers would be responsible for final reports to
individual student’s parents supported by the standardized test results. This system would take a
developmental perspective where a single set of criteria, spanning across the age ranges, would
set out guidance for progression in learning. The external tests would be extended tasks, not short
On Aligning Formative and Summative Assessment
22
paper-and-pencil tests, similar to well-designed pieces of teaching which would engage students
and would be tied to performance targets.
The Queensland Senior Certificate Examination System
The Queensland Senior Certificate Examination System was designed specifically to
provide formative feedback to teachers and 16-17 year-old college-bound students with the goal
of improving teaching and learning. The examination system was conceived from the very start
to focus on the formative function of assessment and build a summative examination from the
elements of a formative system.
The system was developed in response to an externally
mandated summative system similar to the British A-Level examinations set by the University of
Queensland and later, in transition, by the Board of Senior Secondary Studies.
In 1971, responding to the Radford Committee report (see below), the State of
Queensland, Australia, abolished external examinations and replaced them with a teacherjudgment, criteria-based assessment that students had to pass to earn a Senior Certificate at the
end of secondary schooling. The abolition of the A-level-like external examination was in
response to: (a) “… the recurring tendency for the examinations to be beyond the expectations of
teachers and the capability of students” (Butler, 1995, p. 137), (b) teachers learning all they could
about the senior examiner to anticipate test questions, (c) the absence of formative feedback to
teachers and students, and (d) public dissatisfaction with the process and narrowness of the
curriculum syllabi (Butler, 1995).
In 1970 the Radford Committee recommended abolishing the external examination and
replacing it with a school-based assessment that departed radically from any know assessment of
accomplishment for the Senior Certificate. The Board of Senior Secondary Studies (“Board”)
was to have responsibility for setting the content of the 2-year syllabus in each content area and
the methods of assessing accomplishment.
The system included moderation to establish
comparability of achievement ratings through a Moderation Committee and system of Chief
Moderators.
On Aligning Formative and Summative Assessment
23
The “Queensland Experiment” has evolved over the past 30 years and will continue to do
so.
At present it contains two major components. The first component is a school-based
assessment system in each subject area (e.g., science). The school-based assessment system
includes, in each subject area, a locally adapted “course of work” that specifies content and
teaching consistent with the subject syllabus set by the Board and accredited by an external,
government sponsored subject review panel (see Figure 5).
Teachers evaluate (rate) their
students’ performance on the assessment and teachers’ ratings are reviewed through a process of
moderation; the Moderation Committee and Chief Moderators provide external oversight.
The second component of the Queensland system is a general Queensland Core Skills
(QCS) test. The test is used as a statistical bridge for equating ratings across subjects for the
purposes of comparison. If there is a large discrepancy between teacher ratings based on twoyears of work in a subject and the QCS, the former carries the weight, not the latter.
Insert Figure 5. Governance of Queensland’s assessment system.
The assessment system works as follows. A subject syllabus (e.g., in physics) is set by
the Board and covers objectives in four areas: (1) affective (attitudes and values), (2) content
(factual knowledge), (3) process (cognitive abilities), and (4) skills (practical skills). Students are
assessed in the last 3 areas only. The assessment system is criteria and standards referenced, not
norm referenced; student performance is assessed against standards, not against their peers’
performance. The syllabus and assessments are instantiated locally, taking account of local
contexts to link learning to everyday experience, in the form of “school work programs.”9
Teachers rate students according to their level of performance, not to fit a normal curve,
on a 5-point scale: Very High Achievement, High Achievement, Sound Achievement, Limited
Achievement, Very Limited Achievement.
Comparability of curriculum and standards are
“The science teachers in each school are encouraged to write unique work programs guided by the broad
framework in the syllabus document but using all the resources available within the community and
environment surrounding the individual school and taking account of the unique characteristics of students
in the school’ (Butler, 1995, p. 141).
9
On Aligning Formative and Summative Assessment
24
assured through a school-work-program accreditation overseen by State and District Review
panels comprised largely of teachers (about 20% of teachers are engaged in this manner in the
State in each subject). These same panels assure comparability of ratings across the state by
reviewing and certifying each school’s results in each subject area.
The assessment of achievement is formative as well as summative and continuous
throughout the two years of study; students are told which assessments are formative and which
summative (for the purpose of rating their performance on the Levels of Achievement). Each
assessment must clearly specify which criteria and standard it measures and against which a
student’s performance will be judged; students are to receive explicit feedback on each and every
assessment exercise, formative or summative, along with recommended steps for improving their
performance.
The Board is responsible for oversight of the assessment system (see Figure 5), devolving
specific authority to the State and District (subject-matter) Review Panels. The Review Panels
have responsibility for: (a) accreditation—examining a school work program and verifying that it
corresponds to the Board’s syllabus guidelines; (b) monitoring—examining samples of Year 11
student work with tentative Levels of Achievement assigned to check compatibility with criteria
and standards; and (c) certification—of Year 12 Exit Levels of Achievement by insuring
compatibility.
The Queensland system provides a number of lessons in assessment-system design. Built
in response to an externally mandated system that was widely recognized as too difficult for
teachers and students, too narrow and as having undesirable consequences, the present system
focuses on formative assessment with immediate feedback to teachers and students that is linked
to a summative function—the senior certificate. The assessment is continuous over 2 years
providing feedback to students on how to improve their performance from both the formative and
summative examinations. The curriculum (“course of work”) and assessments are developed
locally to external specifications for the content domain (a syllabus) and assessment techniques.
On Aligning Formative and Summative Assessment
25
An external, governmental governance structure is in place to assure the public of the credibility
of the course of study, the assessment, and the scores derived from the assessment. And the
assessment focuses on knowledge, cognitive, and practical abilities acquired during the course of
study.
Design Choices
The case studies provide a basis for identifying choices that are made, explicitly or
implicitly, in the design of a large-scale assessment program linking formative and summative
functions.
Here we highlight what appear to be important choices that once made, give
significant direction to the degree of alignment of the summative and formative functions in a
large-scale assessment. To be sure, far more choices are made than represented here and those
choices will ultimately determine whether the alignment produces salutary results; the devil
certainly is in the details.
We have sorted design decisions into a set of related categories; within each category, a
series of choices is made (Table 1). Decisions made in early categories restrict choices later on.
A set of choices across the categories would provide a rough blueprint for an assessment system.
Insert Table 1. Large-Scale Assessment Design Decisions and Choice Alternatives
Purpose
Large-scale assessment systems are meant to serve one or another purpose. Some are
designed to evaluate institutions, others are designed to measure students’ achievement either for
learning improvement (typically formative but see Wood & Schmidt, 2002) or accountability
(typically summative but see especially the Queensland case) while others align assessment for
learning and accountability (as have each of our case studies). At present, the choice alternative,
solely learning-improvement assessment would probably not credibly serve the democratic
purpose of publicly accounting for actions in a shared framework. Consequently we do not view
it as a stand-alone option; the choice set is: PURPOSE = {Accountability, Aligned}.
On Aligning Formative and Summative Assessment
26
Accountability Mechanism
We can identify two accountability mechanisms from our case studies: assessment and
audit. The assessment mechanism provides a direct measure of some desired outcome—e.g., an
achievement test. An audit focuses on the processes in place that are necessary to produce
trustworthy information for accountability purposes. It is an indirect measure of some desired
outcome. For formative assessment, the audit would be closely tied to evidence on the quality of
teaching and learning. For summative assessment, the validity and reliability of classroomgenerated information would be of central concern in an audit (and addressed in part by
moderation). While not in place in K-12 education (including CLAS and TGAT), the audit is
used for accountability purposes in higher education in the UK, Sweden, Australia, New Zealand,
and Hong Kong. Queensland uses a combination of local assessment and an audit to insure its
integrity. ACCOUNTABILITY MECHANISM = {Assessment, Audit, Combination}.
Developmental Model
Large-scale assessment might focus on students’ achievement, learning potential, actual
learning or some combination.
Achievement is the declarative, procedural and schematic
knowledge that a student demonstrates at a particular point in time. For example, TIMSS
measured 13 and 14 years old students’ science achievement (“Population 2”).
Learning
potential, beyond achievement, is the amount of support or scaffolding a student needs to achieve
at a level higher than her unaided level. For example, with assistance, a 14-year-old student
might be able to solve force and motion problems that 16 year olds would be expected to solve.
Dynamic assessment of students’ knowledge with teachers interacting with students fits this
focus. Finally, learning is the change or growth from one point in time to another point in time
after intentional or informal instruction and practice. If learning is successively measured by
summative assessments that are conceptually and/or statistically linked to reflect development,
such as TGAT’s Standard Assessment Tasks, the time series can produce ipsative information
and measure individual learning as it develops over time. What is measured, when it is measured,
On Aligning Formative and Summative Assessment
27
and how it is measured depends on whether achievement, learning potential or actual learning is
the focus of the assessment system: DEVELOPMENTAL MODEL = {Achievement, Potential,
Learning}.
Knowledge Tapped
Typically large scale-assessments measure declarative knowledge, mostly facts and
concepts in a domain, and low-level procedural knowledge in the form of algorithms in
mathematics or step-by-step procedures in science. For example, Pine and colleagues found twothirds of the TIMSS Population 2 science test items consisted of factual and simple conceptual
questions (AIR, 2000) with the balance mostly reflecting routine procedures. In measuring
achievement, potential or learning, we identify four types of knowledge, the first two being
declarative and procedural knowledge. The third type of knowledge, schematic knowledge, is,
“knowing why”; for example, “Why does New England have a change of seasons?” Such
knowledge calls for a mental model of the earth-sun relationship that is used to provide an
explanation. The fourth type of knowledge is “strategic knowledge,” knowing when, where and
how knowledge applies in a particular situation. This type of knowledge is required of all even
slightly novel situations regardless of whether the knowledge called for is declarative, procedural
or schematic. So, in the design of an assessment system, a choice needs to be made of the type of
knowledge to be tested: KNOWLEDGE = {Declarative, Procedural, Schematic, Strategic,
Combination}.
Abilities Tapped
As a test relies increasingly on strategic knowledge in bringing forth intellectual
capabilities to solve problems or perform tasks of a relatively novel nature in a subject-matter
domain, it moves away from a strict test of knowledge and increasingly focuses on cognitive
ability (Shavelson & Huang, 2003). These abilities might be crystallized in nature, drawing on
generalization from specific learning to general reasoning in a domain (e.g., reading
comprehension).
The Scholastic Assessment Test taps crystallized verbal and quantitative
On Aligning Formative and Summative Assessment
28
abilities. Or the abilities might be fluid in nature, drawing on abstract representations and selfregulation in completely novel situations. The Raven’s Matrices test is the prototype for fluid
ability. Or, the abilities might require spatial-visual representations that have been generalized
from specific learning situations. Consequently, choices need to be made in the nature of
cognitive abilities to be tapped: ABILITY = {Crystallized, Fluid, Spatial-Visual, Combination}
Balance of Knowledge and Abilities
In designing an assessment system, a choice might need to be made as to the relative
balance between the kinds of knowledge tapped on the test and the kinds of abilities tapped:
BALANCE = {Knowledge, Ability, Combination}.
Curricular Link
An assessment can be linked to curriculum on a rough continuum ranging from very
immediately linked to remotely linked (Ruiz-Primo, Shavelson, Hamilton & Klein, 2002). At the
immediate level, classroom artifacts or embedded assessments (e.g., CLAS) provide achievement
information. At the close level, tests should be curriculum sensitive to the content and activities
engaged in by students (e.g., “embedded” or “end-of-unit” test). At the proximal level, tests
should reflect the knowledge and skills relevant to the curriculum but topics differ from those
directly studied (e.g., Standardized Assessment Tasks). A distal test is based on state or national
standards in a content domain (e.g., National Assessment of Educational Progress Mathematics
Test), and a remote test provides a very general achievement measure (e.g., Third International
Mathematics and Science Study’s Population 2 Science Test). Clearly formative assessments are
typically found at the immediate, close and proximal levels while summative assessments are
found at the proximal, distal and remote levels (but see the Queensland case).
CLAS, TGAT,
and Queensland all envisioned a combination of assessments at various levels. CURRICULAR
LINK = {Immediate, Close, Proximal, Distal, Remote, Combination}.
On Aligning Formative and Summative Assessment
29
Information (“Data”) Sources
Achievement information (“data”) can, simply speaking, be collected externally by an
independent data collection agent, internally as a part of classroom activities, or both. Most
larges-scale assessments in the United States such as the National Assessment of Educational
Progress or California’s STAR assessment are conducted externally by an independent agency
(often by a governmental contractor). The British, however, have a tradition of combining
external and internal large-scale assessment including their GCSE, A-level and TGAT
examinations. Indeed, TGAT was envisioned as primarily an internal examination system with
external audits. CLAS, as well, combined external and internal assessment initially with priority
given to external examinations with a shift to internal examinations, as the latter proved feasible.
Finally, the Queensland system is primarily an internal examination system with extensive
governmental oversight and an external examination primarily for cross-subject equating
purposes. INFORMATION SOURCES = {Internal, External, Combination}
Assessment Method
At the most general level, we speak of selected- and constructed-response tests. With
selected-response tests, respondents are asked to select the correct or best answer from a set of
alternatives provided by the tester. Multiple-choice tests are the most popular version, but truefalse and matching tests, for example, would be examples of selected response tests. Such tests
are cost and scoring efficient.
With a constructed-response test, respondents produce a response. Constructed-response
tests range from a simple response (such as fill-in-the-blank), to short-answer, to essay, to
performance assessment (e.g., provide a mini-laboratory and ask a student to perform a scientific
investigation), to concept-maps (students link pairs of key concepts in a domain), to portfolios, to
extended projects. Constructed response, then, is probably too gross a choice but for simplicity,
we lump things together recognizing that this alternative needs unpacking in practice.
On Aligning Formative and Summative Assessment
30
What is important for assessment systems is that the method of testing influences what
can be measured (e.g., Shepard, 2003). For example, performance assessments (envisioned by
both TGAT and CLAS) provide reasonably good measures of procedural and schematic
knowledge but overly costly measures of declarative knowledge. In the design of an assessment
system, then, choices need to be made among possible testing modes and what is to be tested:
TESTING METHOD = {Selected, Constructed, Combination}.
Note that selected methods permit machine scoring whereas constructed response
methods require human (or in a very few cases such as the Graduate Management Admissions
Test, computer) judgment with or without moderation over judges and sometimes, as in the
Queensland case, across subject areas tested.
Test Interpretation
Test scores are typically some linear or non-linear combination of scores on the items that
comprise the test. In and of themselves, they do not have meaning. If we simply tell you that you
received a score of 30 correct, you know little, other than your score was not so low as to be zero
correct! However, you might ask, “How many items were on the test?” Or, you might want to
know, “How did my classmates do on the test?” Or you might want to know, “How much of the
knowledge domain or how many of the performance targets have I learned?” Or, finally, being
very persistent, you might ask, “Did my score improve from the last time I took the test?” Simply
put, a test score needs to be referenced to something to give it meaning.
We distinguish three types of referencing for giving meaning to test scores. The first is
norm or cohort referencing in which individuals are rank ordered and a score comes to have
meaning by knowing the percent of peers who scored below a particular score. For example, the
score of 30 might have been higher that that attained by 90 percent of your peers. The second is
criterion or domain referencing in which a score estimates the amount of knowledge acquired in a
domain. In this case, a score of 30 might mean that from the sample of items, we estimate that
you know about 85 percent of the knowledge domain. And the third type of score reflects
On Aligning Formative and Summative Assessment
31
progression or change over a period of time. We call this an ipsative score and it reflects the
change in the level of your performance in a particular domain.
Test construction and
interpretation depends on the intended meaning or interpretation placed on scores: SCORE
INTERPRETATION = {Norm-referenced, Domain-referenced, Ipsative, Combination}.
Standardization of Administration
Standardization refers to the extent to which the test and testing procedures are the same
for all students. At one extreme, a test is completely standardized when all students receive the
same test, under the same testing condition, at the same time, and so on. This is typically what is
meant by a standardized test. When different test forms are used, we construct those forms to
produce equivalent scores or statistically calibrate them to do so.
Even with different but
equivalent test forms, we speak of the test as standardized. However, at the other extreme, when
tests are adapted to fit each individual, we do not have a standardized test; but the test just might
“fit” the person better than a standardized test. There are a multitude of intermediate conditions.
In building a large-scale assessment, choices about standardization need to be made:
STANDARDIZATION = {High, Intermediate, Low}.
Feedback to Student and Teacher
Feedback from large-scale assessments can be immediate or delayed.
Immediate
feedback occurs, as in TGAT, Queensland and CLAS, when achievement information is collected
in the course of classroom activities (see Information Source above) and is fed back to students
and teachers almost immediately (e.g., teacher evaluates students’ performance using his own or
externally provided rubric). However, most large-scale assessments are conducted externally and
independently where feedback is typically delayed for months. FEEDBACK = {Immediate,
Delayed, Combination}.
If an assessment system is to serve a formative purpose, at least some portion of it must
provide immediate feedback to students and teachers. Moreover, the feedback should advise
On Aligning Formative and Summative Assessment
32
students as to how to improve their achievement, not as scores or other means of ranking students
(Black & Wiliam, 1998; 1998b).
Score Reporting Level
Scores are reported for external accountability purposes, not formative purposes (e.g.,
Black & Wiliam, 1998). To internal audiences such as teachers and parents, information about
individual students is appropriate. For external audiences, scores are reported as aggregates, at
least to the classroom level. When accountability purposes are linked to important life outcomes,
such as reporting scores to the public, attaching teachers’ salaries to them, or certifying that a
student can graduate, high stakes accompanies them. There is some evidence in this case that
summative accountability, although well intended, might very well work against formative
learning purposes (ARG, 2002). SCORE LEVEL = {Individual, Aggregate, Combination}.
Score Comprehensiveness
Single scores are typically reported for accountability purposes. While this satisfies
criteria such as clarity and ease of understanding, single scores that characterize complicated
achievements by students are misleading. For this reason, a score profile—a set of scores linked
to content and knowledge—offers an alternative of more information that is possibly diagnostic
(e.g., Wood & Schmidt, 2002). SCORE-COMPREHENSIVENESS = {Single Score, Score
Profile, Both}.
Application of Design Choices to Case Studies
We now illustrate how design choices (Table 1 above) influence the nature of an
assessment system.
To do so, we characterize the three case studies—CLAS, TGAT, and
Queensland.
California Learning Assessment System
CLAS was designed to align the summative and formative functions of assessment,
providing immediate feedback to teachers and students on classroom embedded tests and
activities, and summative information to the state in the form of an on-demand test to be replaced
On Aligning Formative and Summative Assessment
33
over time by formative information (Table 2). The developmental model underlying the system
was primarily that of achievement—to provide a snapshot of students’ performance at the end of
the school year—with longitudinal aspects associated with embedded tests. The system focused
on knowledge, rather than ability or a balance of the two, especially on declarative and procedural
knowledge with novel parts of the assessment demanding strategic knowledge. To this end, the
testing methods combined selected and constructed response formats, used highly standardized
test administration, and the major method of interpretation was norm-referenced. CLAS was
designed to draw on both internal and external information sources, the former providing
immediate feedback in the form of single scores to students and teachers individually and the later
delayed feedback to educators, parents and students individually, and to policy makers and the
public in the aggregate.
Task Group on Assessment and Testing: National Curriculum Assessment Results
The TGAT design was comprehensive assessment – it was to serve both purposes (Table
2). In particular, the TGAT framework of ten criterion-referenced levels, adopted by those
designing the curriculum documents, could serve the ipsative purposes in providing a longitudinal
picture of development. The reports emphasis on validity, reinforced by use of both external and
teachers’ own assessments, envisaged ways of mapping all four forms of knowledge, up to the
strategic, and combinations of the several dimensions of ability. Similarly, a balance between
knowledge and ability was to be attained by combining both external and internal sources of
assessment information., thereby combining variation in levels of curricular linkage.
For methods, a wide range was envisaged: a 35-page appendix to the report set out 21
examples of assessment items. Only one of these was a selected response item, the others
included test of reading, of oracy involving spoken responses, and practical tasks in science and
mathematics. There was in addition a further six-page appendix describing two extended activity
tasks with seven-year olds, tested in practice, which could have elicited evidence of aspects of
On Aligning Formative and Summative Assessment
34
achievement across a range of subject areas. This diversity reflects the fact that in public tests and
national surveys in the UK the use of selected response items had always been very limited.
The interpretation was to be both ipsative and domain referenced, enhanced by
profile reporting. There was to be a combination of standards to cover a wide range of
the levels at each age. Since the internal (classroom) assessments would be an aggregate
of teachers’ on-going assessments, and the external assessments produced both individual
and aggregated scores, multiple levels were involved.
Finally, scores were to be
assembled in a profile over about three of four domains in each curriculum.
Queensland Senior Certificate Examination System
Queensland aligned the summative and formative functions of assessment giving
priority to the formative function (Table 2). The system provides immediate feedback to
teachers and students on all tests, formative and summative closely linked to the local
curriculum, with advice to individual students on how to improve performance. The
developmental model underlying the system balanced longitudinal and achievement
alternatives. The system provides continuous feedback on performance over two years
and a snapshot of students’ performance at the end of that time for certification purposes.
The system attempts to measure different aspects of knowledge as well as reasoning
ability in a subject domain, balancing knowledge and ability. To this end, the testing
methods combined selected and constructed response formats, and used a combination
standardized test administration with domain-referenced interpretation. The Queensland
system draws primarily on internal information sources but also uses an external source
for cross-subject equating, the Queensland Core Skills test. Qualitative (advice) as well
as quantitative (score for certification, profile for formative) feedback is provided to
individual students while an aggregate score summarizes school performance for policy
On Aligning Formative and Summative Assessment
makers and the public.
35
Importantly, quality assurance is provided by an external,
government-sponsored audit system.
Concluding Comments
Democratic governance requires that our school, and educators in them, be held
accountable, typically by providing information on their actions and imposing sanctions.
Democracy, however, is not a one-way street; it also requires that students, parents, policy
makers, and citizens be held accountable for their actions as well. While polls show widespread
support for the noble democratic concept of accountability, accountability can and does fall short
in practice. And when the stakes are high, as they are now in education accountability systems,
and when interpretations of large-scale assessment scores with ambiguous or narrow meaning are
treated in league tables and funding decisions as unambiguous, and when single scores are
generalized beyond justification as true characterizations of individuals and systems, the potential
for mischief is enormous.
We might ask whether the present system is working well. Carnoy and Loeb (2003)
provide evidence that in states with high-stakes large-scale testing programs, scores on the
National Assessment of Educational Progress are higher than in states without such testing
systems. However, there is also evidence that large-scale testing has unintended consequences.
Curricular shifts follow high-stakes testing and lead to a narrowing of the curriculum, a focus on
superficial factual knowledge and basic skills, practice not on subject matter but on test taking
skills, and cheating on the test (e.g., Shepard, 2003; ARG, 2002; ASE, 1992). When this
happens, externally mandated tests—output measures—rapidly become the highly valued
outcomes of education themselves, a case of the tail wagging the dog.
Democracy depends on the informed judgment of its citizens and this dependency us
unlikely to abate with the rapid social and technological change experienced in the past century.
We have seen specific concepts and skills go out of date rapidly, and personal flexibility and
On Aligning Formative and Summative Assessment
36
capacity to learn count increasingly in the workforce and in everyday citizenship, but even what
is meant by or important about these concepts changes quickly.
In addition to improving
students’ declarative and procedural knowledge, we need to build their conceptual understanding
(schematic knowledge) and ability to adapt to new situations (strategic knowledge).
Externally mandated summative assessments of students’ achievement are short-run
tactical fixes, not long-run strategies for meeting democratic goals.
To meet these goals,
transmission teaching prompted by many current accountability systems will have to be replaced
by interactive, student-centered learning. We now see how formative assessment, so interactive,
is a key component of accountability, supported by “scientific evidence” (e.g., Black & Wiliam,
1988a). Formative assessment, then, must be strengthened, teachers’ summative assessment must
be developed in harmony with their formative practices, and external summative assessment must
be aligned with formative assessment and not be allowed to go on limiting learning.
The case studies presented herein and the design principles gleaned from them and the
literature on formative and summative assessment suggest a wide array of alternative ways for
aligning formative and summative assessment. We fully recognize that one size will not fit all
situations. Rather, some combination of choices will produce an aligned, large-scale assessment
system that fits political and educational realities.
To this end, we are compelled to introduce uncertainty about political will and capacity.
Will the public understand the mandate? Evidence from CLAS and TGAT makes us skeptical;
the stakes, however, are too high not to try. Unless teachers can be trained to respond in their
teaching and learning work, there may be little point in assessing novel, formative outcomes
anyway. To be sure, we have an existence proof in Queensland, but how well will it travel across
the seas? This is an open question that begs for research on teacher education and professional
development. It also raises the question of the quality of the college education in academic
majors that students receive, those entering teacher education and those not.
On Aligning Formative and Summative Assessment
37
Political realities have to be addressed. CLAS and TGAT did not die because of their
inherent faults. They died because politicians and the public could not accept the novelties, a
rejection due in part to lack of understanding of the issues, and in part to the histories of
competing ideologies about education and the social good. So what should be done about this?
Cynically, one could say that we are entertaining our academic selves but will only be
building yet another scheme to be rubbished in the arena of policy practice. To be optimistic yet
realistic, there must be at least as much effort in building public understanding of, and trust in,
new practices as in actually building them. So perhaps we have to re-balance our distribution of
effort.
On Aligning Formative and Summative Assessment
38
References
AIR (2000). Review of science items from TIMSS-R: A report from the science committee to the technical
review panel. American Institutes for Research, Washington, DC.
ASE (1992). Assessment Reform Group (ARG). (2002). Testing, motivation and learning. Cambridge,
England: University of Cambridge Faculty of Education.
Atkin, J.M, Black, P., & Coffey, J.E. (Eds) (2001). Classroom assessment and the National Science
Education Standards. Washington, DC: National Academy Press.
Atkin, J.M, & Coffey, J.E. (Eds.) (2003). Everyday assessment in the science classroom. Arlington, VA:
NSTA Press.
Black, P.J. (1993). Formative and summative assessment by teachers. Studies in Science Education, 21,
49-97.
Black, P.J. 1997. Whatever Happened to TGAT ? pps. 24-50 in Cullingford, C. (ed.) Assessment vs.
Evaluation Cassell : London.
Black, P.J., & Wiliam, D. (1998a). Assessment and classroom learning. Assessment in Education, 5(1), 773.
Black, P.J., & Wiliam, D. (1998b). Inside the black box: Raising standards through classroom assessment.
London, UK: King’s College London School of Education.
Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2002). Working inside the black box. London, UK:
King’s College London School of Education.
Brousseau, G. (1984). The crucial role of the didactical contract in the analysis and construction of situations in
teaching and learning mathematics. In H.-G. Steiner (Ed.), Theory of mathematics education: ICME 5
topic area and miniconference (pp. 110-119).
Brown, M. (1989). Graded assessment and learning hierarchies in mathematics: an alternative view. British
Educational Research Journal, 15(2), 121-128.
Butler, J. (1995). Teachers judging standards in senior science subjects: Fifteen years of the Queensland
experiment. Studies in Science Education, 26, 135-157.
Carnoy, M., & Loeb, S. (2003). Does external accountability affect student outcomes? A cross-state analysis.
Educational Evaluation and Policy Analysis, 24(4), 305-331.
Daugherty, R. 1995 National Curriculum Assessment . A Review of Policy 1987 - 1994. London : Falmer Press
DES 1987. Task Group on Assessment and Testing : A Report
and the Welsh Office
London : Department of Education and Science
DES 1988. Task Group on Assessment and Testing : Three Supplementary Reports
Education and Science and the Welsh Office
London : Department of
Doyle, W. (1983). Academic work. Review of Educational Research, 53(2), 159-199.
Gormley, W.T. Jr. & Weimer, D.L. (1999). Organizational report cards. Cambridge. MA: Harvard
University Press.
Kirst, M.W., & Mazzeo, C. (1996). Rise, fall, and rise of state assessment in California 1993-96. Phi Delta
Kappan, 78(4), 319-323.
Knight, C. 1990 The Making of Tory Education Policy in Post-War Britain 1950-1986 London : Falmer
Lawton, D. 1994 The Tory Mind on Education 1979-94. London: Falmer.
Madaus, G.F. (1988). The influence of testing on the curriculum. In L.N. Tanner (Ed.), Critical issues in
curriculum: the 87th yearbook of the National Society for the Study of Education (part 1) (pp. 83-121).
Chicago, IL: University of Chicago Press.
On Aligning Formative and Summative Assessment
39
March, J.G., & Olsen, J.P. Democratic governance. NY: Free Press.
Messick, S. (1980). Validity. In R.L. Linn (Ed.), Educational measurement (13-103). Washington, DC:
American Council on Education/Macmillan.
Rowe, M.B. (1974). Wait time and rewards as instructional variables, their influence on language, logic
and fate control. Journal of Research in Science Teaching, 11, 81-94.
Shavelson, R.J., & Huang, L. (2003). Responding responsibly to the frenzy to assess learning in higher
education, Change, 35(1), 10-19.
Shepard, L.A. (2003). Reconsidering large-scale assessment to heighten its relevance to learning. In J.M.
Atkin & J.E. Coffey (Eds.), Everyday assessment in the science classroom. Arlington, VA: NSTA
Press.
Swain, J.R.L. (1988). GASP: The graded assessments in science project. School Science Review, 70(251),
152-158.
Thatcher, M. 1993 The Downing Street Years London ; Harper Collins
Wiliam, D. (2001). Level best? Levels of attainment in the national curriculum assessments. London:
Association of Teachers and Lecturers.
Wiliam, D., & Black, P. (1996). Meanings and consequences: a basis for distinguishing formative and
summative functions of assessment? British Educational Research Journal, 22(5), 537-548.
Wood, R. (1991). Assessment and Testing: a Survey of Research. Cambridge: Cambridge University Press.
Wood, R., & Schmidt, J. (2002). History of the development of Delaware Comprehensive Assessment Program
in Science. Paper presented at the National Research Council Workshop on Bridging the Gap between
Large-scale and Classroom Assessment. National Academies Building, Washington, DC.
On Aligning Formative and Summative Assessment
Figures
Figure 1. Sources of CLAS data.
Figure 2. CLAS’ conceptual framework.
Figure 3. CLAS implementation time line.
Figure 4: Overview of the TGAT system.
Figure 5. Governance of Queensland’s assessment system.
40
On Aligning Formative and Summative Assessment
A.
41
B.
“On Demand”
Matrix Sampling
of Tasks & Tests
Standardized
Curriculum-Embedded
Assessments
Student’s
Score
C.
Portfolios
On Aligning Formative and Summative Assessment
42
Conceptual Framework For CLAS
Aggregate Level of Performance
A.
Matrix Sample Benchmark: “Moderated” Score:
Multiple-Choice &
Individual, School
Performance-Based
& District Score
Assessment
Teacher
Individual Level of Performance
Moderation
B.
Standardized
Curriculum-Embedded
Assessments
C.
Portfolios
Sample from Class
for Aggregation
Teacher Calibration &
Professional Development
On Aligning Formative and Summative Assessment
Timeline For Implementation
A. Standardized Benchmarks
B. Curriculum Embedded Assessments &
C. Portfolios
Pilot
1990
...
Implementation . . .
Operation
2000
Increasing Teacher Responsibility for Assessment
43
On Aligning Formative and Summative Assessment
Teachers’
Formative
Assessment
Teachers’
Summative
Assessment
44
Inter-school
moderation groups:
(a) align overall
Teacher Assessment
results with SATs
(b) align Teachers’
Assessments across
schools
Schools’ overall
results published
Teachers decide
individual’s results
External Tests
(Standardized
Assessment
Tasks)
On Aligning Formative and Summative Assessment
45
Award Senior
Certificate
Board of Senior
Secondary Studies
Sets
Syllabus
Sets Assessment
Methods
State Subject
Review Panels
Accreditation
Responsibility
Monitoring-Moderation
District Subject
Review Panels
District 1
Biology
District 2
…
Review Panels for Science:
Chemistry
Earth Science Multi-Strand
Biology Department
In Local School

Sets “Course of Work” based on
Syllabus

Sets formative & summative
assessment based on Assessment
Methods Specification

Scores student assessments
Which are moderated
District 11
On Aligning Formative and Summative Assessment
Tables
Table 1. Large-Scale Assessment Design Decisions and Choice Alternatives
Table 2. Profile of Design Decisions for the California Learning
Assessment System
46
On Aligning Formative and Summative Assessment
Decision
Purpose
Accountability Mechanism
Developmental Model
Knowledge Tapped
Abilities Tapped
Balance in Knowledge and Abilities
Curricular Link
Information Sources
Assessment Method
Test Interpretation
Standardization of Administration
Feedback to Student and Teacher
Score Reporting Level
Score Comprehensiveness
47
Alternative Choices
 Learning Improvement
 Public Accountability
 Aligned
 Assessment
 Audit
 Combination
 Achievement
 Learning Potential
 Actual Learning
 Declarative
 Procedural
 Schematic
 Strategic
 Combination
 Crystallized
 Fluid
 Spatial-Visual
 Combination
 Knowledge
 Ability
 Combination
 Immediate
 Close
 Proximal
 Distal
 Remote
 Combination
 Internal
 External
 Combination
 Selected Response
 Constructed Response
 Combination
 Norm-Referenced
 Domain-Referenced
 Ipsative
 Combination
 Low
 Intermediate
 High
 Immediate
 Delayed
 Combination
 Individual
 Aggregate
 Combination
 Single Score
 Score Profile
 Both
On Aligning Formative and Summative Assessment
DECISION*
48
CLAS
TGAT
QUEENSLAND
PURPOSE
Aligned
√
√
√
ACCOUNTABILITY MECHANISM
Assessment
√
√
√
DEVELOPMENTAL MODEL
Achievement
Longitudinal
√
√
√
√
√
√
KNOWLEDGE TAPPED
Combination
√
√
√
√
√
√
√
√
√
ABILITIES TAPPED
Crystallized
Fluid
Spatial
Combination
BALANCE IN KNOWLEDGE AND ABILITIES
Knowledge
Ability
Combination
CURRICULAR LINK
Combined
√
√
√
INFORMATION SOURCES
Internal
External
Combination
√
ASSESSMENT METHOD
Selected
Constructed
Combination
√
TEST INTERPRETATION
Norm
Domain
Ipsative
STANDARDIZATION OF ADMINISTRATION
Low
Intermediate
High
Combination
√
√
√
√
√
√
√
√
√
√
√
√
FEEDBACK TO STUDENT AND TEACHER
Immediate
Delayed
Combination
√
√
SCORE REPORTING LEVEL
Combination
√
√
SCORE COMPREHENSIVENESS
Single
Profile
Both
*Some
√
√
√
√
choice alternatives in Table 1 omitted here because they do not vary across case study.
√
Download