On the Measurement and Impact of Formative Assessment on

advertisement
On the Measurement and Impact
RUNNING HEAD: Measurement and Impact of the Treatment
On the Measurement and Impact of Formative Assessment on
Students’ Motivation, Achievement, and Conceptual Change
Yue Yin1
Carlos C. Ayala2
Richard J. Shavelson3
Maria Araceli Ruiz-Primo4
Miki K. Tomita3
Erin M. Furtak5
Paul R. Brandon1
Donald B. Young1
1.
2.
College of Education, University of Hawaii
School of Education, Sonoma State University
3.
4.
School of Education, Stanford University
University of Colorado at Denver and Health Sciences Center
5.
Max Planck Institute for Human Development
1
On the Measurement and Impact
Abstract
Formative assessment holds the promise of promoting achievement and
motivation. The effects of formative assessments, however, have rarely been examined
experimentally in regular education settings. Motivational beliefs are hypothesized to be
important factors that influence conceptual change. Because the motivational beliefs
expected to be modified by formative assessment happen to be the motivational beliefs
hypothesized to influence conceptual change, the study connected the formative
assessment and conceptual change frameworks and examined the effect of formative
assessments on student outcomes. In particular, the study explored whether formative
assessments would improve student motivation, achievement, and consequently
conceptual change. A randomized experimental design in a field setting was used and 12
middle school science teachers and their students participated in the study. It was found
that student motivation, achievement, and conceptual change varied significantly among
the participating teachers, but the formative assessment treatment did not have significant
impact on these outcome variables.
2
On the Measurement and Impact
On the Measurement and Impact of Formative Assessment on
Students’ Motivation, Achievement, and Conceptual Change
The logic of formative assessment—identify learning goals, assess where students
are with respect to those goals, and use effective teaching strategies to close the gap—is
compelling and has led to the expectation that formative assessment would improve
students’ learning and achievement (Ramaprasad, 1983; Sadler, 1989). Moreover,
substantial empirical evidence has supported the effectiveness of formative assessment
(Black & Wiliam, 1998a; Black & Wiliam, 1998b). However, the evidence mainly comes
from either laboratory studies or on anecdotal records (e.g., Black & Wiliam, 1998a). As
Black and Wiliam (1998a) pointed out, studies conducted in laboratory contexts may
suffer “ecological validity” problems and encounter reality obstacles when applied in
classrooms. The effects of formative assessments have rarely been examined
experimentally in regular education settings.
The study reported in this paper examined the effect of formative assessments on
student outcomes by using a randomized experimental design in a field setting. The study
explored whether formative assessments would improve student motivation, achievement,
and conceptual change. This paper presents (a) the conceptual framework of the study, (b)
research design, (c) the instruments used to measure the outcome variables—motivation,
3
On the Measurement and Impact
achievement, and conceptual change—and the technical qualities of each instrument; and
(d) results of the experiment.
Conceptual Framework
This section reviews literature on conceptual change and formative assessment.
Based on the literature review, the theoretical framework for this study is laid out and
research questions are raised.
Achievement is viewed as acquiring knowledge and the capacity to reason with
that knowledge. As described in Shavelson et al. (this issue), knowledge can be
distinguished as four types: declarative, procedural, schematic, and strategic. The focus
of this study is on schematic knowledge and more specifically how to move students
from a naïve conception of the natural world to a scientifically justifiable one as to “why
things sink and float”. Hence we focus on conceptual change and factors that affect it.
Conceptual Change
When students come into science classrooms, they often hold deep-rooted prior
knowledge or conceptions about the natural world. These conceptions influence how they
come to understand what they are taught. Some of their existing prior knowledge
provides good foundations for formal schooling, for example, their prior knowledge of
number and language. However, other prior conceptions are incompatible with currently
accepted scientific knowledge; these conceptions are commonly referred to as
misconceptions (NRC, 2001) or alternative conceptions (Abimbola, 1988).
4
On the Measurement and Impact
Usually students derive misconceptions through limited observation and
experience. Consequently, learning is not only the acquisition of new knowledge but also
the interaction between new knowledge and students’ prior knowledge. For example,
everyday life experience leads young children to believe that the earth is flat. In fact,
even when some children learn that the earth is round, they believe that it is like a
pancake—round but still flat (Vosniadou & Brewer, 1992).
The topic of this study, “Why things sink and float,” is addressed in many middle
school physical science curricula and students’ misconceptions about the topic have been
well documented (e.g., M. G. Hewson, 1986; Kohn, 1993; Yin, Tomita, & Shavelson, in
press). Although sinking and floating is a common experience in everyday life, it is in
fact a sophisticated scientific phenomenon. To fully understand the fundamental reason
underlying sinking and floating, students need to have complicated knowledge that
includes an analysis of forces (buoyancy and gravity) and water pressure. That
knowledge, however, is either not introduced or not sufficiently addressed in middle
school curricula most likely because of the difficulty learning it. Rather, some curriculum
developers take a shortcut and use relative density to simplify the explanation of sinking
and floating (e.g., Pottenger & Young, 1992). Even so, relative density itself is
challenging for many students because density is a concept involving the ratio of mass to
volume (e.g., Smith, Snir, & Grosslight, 1992) and relative density involves comparing
two ratio variables, the density of an object and the density of an medium.
5
On the Measurement and Impact
Despite its complexity in science, sinking and floating is a common phenomenon.
Most students have rich experiences and personal theories or mental models for
explaining sinking and floating. Unfortunately, many of their theories are either
misconceptions or conceptions that are only valid under certain circumstances--for
example, big/heavy things sink, things with holes in them sink, hollow things float, things
with air in them float, or flat things float. Those conceptions are so deeply rooted in
students’ minds that it is very difficult for students to change them, even after students
have been intensively exposed to scientific conceptions such as, “If an object’s density is
less than a liquid’s density, the object will float in the liquid regardless of the size or mass
of the object” (Yin et al., in press).
Similar to children’s misunderstanding of the shape of the planet earth and “why
things sink and float,” many other misconceptions are deeply rooted in everyday
experiences, which then inhibit students from acquiring scientific conceptions. These
alternative conceptions exist widely across different subject domains, among people of
different ages, across different cultures, and through the history of the development of
scientifically justifiable ideas (Yin, 2005).
Given the presence and persistence of alternative conceptions, the restructuring or
reorganization of existing knowledge, or conceptual change, has become a very
important component of teaching and learning (Carey, 1984; Schnotz, Vosniadou, &
Carretero, 1999). To fully establish scientifically justifiable conceptions of the natural
6
On the Measurement and Impact
world, sometimes students have to experience conceptual change (Carey, 1984) and
transform misconceptions to complete and accurate conceptions (NRC, 2001).
From a cognitive perspective, researchers have described the mechanics of
conceptual change (Demastes, Good, & Peebles, 1996; Posner, Strike, Hewson, &
Gertzog, 1982); developed theories about the nature of conceptual change (Chi, 1992;
Chi, Slotta, & deLeeuw, 1994); and explored practical ways to foster conceptual change
(Chinn & Malhotra, 2002; P. W. Hewson & Hewson, 1984). Our knowledge of the
cognitive aspects of alternative conceptions and conceptual change has been greatly
enriched over the past 20 years. However, the traditional approaches have been criticized
as “cold cognition” because they could not adequately explain why some students
changed their conceptions while others did not (Dreyfus, Jungwirth, & Eliovitch, 1990;
Pintrich, 1999; Pintrich, Marx, & Boyle, 1993).
Motivation researchers proposed that the conceptual change process is influenced
by students’ motivational beliefs, such as goal orientations, epistemological beliefs,
interest or value, and efficacy (e.g., Pintrich, Marx, & Boyle, 1993; Pintrich, 1999). They
suggested, conceptual change could be facilitated by students’ acquisition of a task goal
orientation, incremental intelligence belief, high self-efficacy, and strong interest; in
contrast, conceptual change could also be prevented by ego goal orientation and fixed
intelligence beliefs. Each of these possible influences on conceptual change will be
discussed in more detail below.
7
On the Measurement and Impact
Goal orientations are self-constructed theories about what it means to succeed.
They help students make meaning of their own learning in context. Goal orientations are
generally classified into two main categories: intrinsic (also called, mastery or taskinvolved orientation) and extrinsic (also called, performance or ego-involved orientation)
(Dweck & Leggett, 1988). Task goal orientation means that the goal of studying is to
learn; Ego-involved orientation can be further classified to ego approach orientation (t
those with an ego goal orientation, students with a task goal orientation tended to focus
on learning, understanding, and mastering the tasks, and engaged in deeper processing
strategies, which are helpful for conceptual change.
The intelligence belief is “whether intelligence or ability is fixed or not.”
Schommer (1992) found individuals who believe in intelligence as a malleable quality
that can be developed (i.e., incremental intelligence) are more likely to be cognitively
engaged in tasks. Dweck and Leggett (1988) suggested that intelligence theory can
influence students’ goal orientation, which, in turn, influences students’ learning. In their
model, a student with fixed intelligence beliefs is more likely to adopt a performance goal
orientation. As discussed in the task goal orientation section, performance goal
orientation is considered to constrain conceptual change by limiting the level of deeper
cognitive engagement.
Interest is students’ general attitude or preference for the content or task (Eccles,
1983). Pintrich (1999) argued that interests help facilitate the process of conceptual
8
On the Measurement and Impact
change because conceptual change demands that students maintain their cognitive
engagement in order to accommodate new conflicting information.
Self-efficacy beliefs are individuals’ beliefs about their performance capabilities in
a particular domain (Bandura, 1986; Pajares, 1996). Self-efficacy is related to task choice
(Bandura, 1997), effort taken, task persistence (Bandura, 1986; Bandura & Cervone,
1983), and cognitive engagement (Pintrich & Schrauben, 1992). Pintrich suggested that
the students who are confident in their ability to change their ideas and to learn to use
proper cognitive strategies to integrate and synthesize divergent ideas are more likely
than others to realize conceptual change.
Pintrich and his colleagues’ greatest contribution is that they pointed out an
important missing component in the previous cognitive conceptual change model,
expanding researchers’ horizon from a narrow cognitive view to a broader one that
involves individuals’ motivations. They suggested that motivational beliefs influence
people’s cognitive engagement and depth of information processing, which in turn
determines whether conceptual change can occur. Moreover, Pintrich insisted that all
motivational beliefs are amenable to change, even though some of them are robust or
relatively stable (Pintrich, 1999; Pintrich et al., 1993). For instance, an educational
context that encourage task goal orientated will encourage students to develop task goal
orientation. Similarly, an ego goal-orientated context encourages ego goal orientation
among students.
9
On the Measurement and Impact
Few studies have been conducted to examine empirically motivation theorists’
claims about the influence of motivation on conceptual change. One possible approach to
bringing about conceptual change that integrates cognition and motivation is formative
assessment.
Formative Assessment
Building on earlier reviews, Black and Wiliam reviewed 580 articles (or chapters)
from over 160 journals in a nine year period and concluded that formative assessments
have a positive impact on students’ learning, as well as on students’ motivation (Black,
Harrison, Lee, Marshall, & Wiliam, 2002; Black & Wiliam, 1998a; Black & Wiliam,
1998b; Hattie & Timperley, 2007).
In summative assessment, scores are often related to students’ rank compared to
their peers, and performance is the most important concern. This environment produces
several negative effects on motivation and achievement: (a) Students develop an egoinvolving goal orientation (Schunk & Swartz, 1993); (b) students, especially the low
achievers, tend to attribute achievement to capability (Siero & Van Oudenhoven, 1995)
and regarded ability as inborn and intelligence as fixed (Vispoel & Austin, 1995), which
might discourage them from putting effort into future learning; (c) students are reluctant
to seek help because they fear their questions will be regarded as evidence demonstrating
low ability (Blumenfeld, 1992); and (d) students tend to be engaged in superficial
information processes, for example, rote learning and concentrating on the recall of
isolated details (Butler, 1987; Butler & Neuman, 1995).
10
On the Measurement and Impact
In contrast, formative assessment possesses substantial advantages on motivation
and achievement: (a) Because formative assessment emphasizes the learning process and
closing the gap between students’ current situation and the desired goal, students are
more likely to form a task-involving goal orientation, which encourages students to
process information more deeply than does the performance orientation (Schunk &
Swartz, 1993); (b) Because formative assessment concentrates on improving students’
learning, it is less likely to cause students to lose confidence. Instead, students tend to
believe in incremental intelligence; that is, ability is just a collection of skills that people
can master over time (Vispoel and Austin, 1995). Formative assessment shapes selfefficacy in a way that benefits students’ learning and, consequently, students’
achievement, especially that of lower achievers (Geisler-Brenstein & Schmeck, 1996). (c)
The activities involved in formative assessment can improve students’ interest in learning.
(d) Students engaged in formative assessments increase their self-regulation, reasoning,
and planning, which are all important for effective learning and conceptual change (Black
and William, 1998a).
In sum, formative assessment is expected to encourage the motivational beliefs
hypothesized to promote conceptual change, such as, task goal orientation, incremental
intelligence beliefs, self-efficacy, and interest. Meanwhile, formative assessments
discourages the motivational beliefs hypothesized to prevent conceptual change, such as
ego goal orientation and fixed intelligence beliefs (Figure 1). For future discussion
11
On the Measurement and Impact
convenience, the former motivational beliefs are called positive motivation and the latter
are called negative motivation.
-------------------------------Insert Figure 1 about Here
-------------------------------Because the two models share the motivational beliefs, it seems plausible to
connect the motivational model of conceptual change with formative assessment. We
conjectured that students who participate in formative assessments would be more likely
to change motivational beliefs in a way that promoting conceptual change, demonstrate
improved achievement, and realize conceptual change than those who did not participate
in formative assessments.
As promising as it sounds, formative assessment places greater demands on
teachers than regular teaching due to its uncertainty and flexibility (Bell & Cowie, 2001).
Black and Wiliam suggested that teachers “need to see examples of what doing better
means in practice” (1998b, p.146). To date, there is a paucity of systematic, detailed, and
practical information to address adequately the challenge of how to design and implement
formative assessments in teaching to help improve student learning and motivation.
Shavelson et al. (this issue) characterized formative assessment techniques as
falling on a continuum based on the amount of planning involved and the formality of
technique used, benchmarking points along the continuum as—(a) on-the-fly formative
assessment, which occurs when teachable moments unexpectedly arise in the classroom;
12
On the Measurement and Impact
(b) planned-for-interaction formative assessment, which is used during instruction but
prepared deliberately before class to align with instructional goal closely; and (c) formal
and embedded in curriculum formative assessment, which is designed to be embedded
after a curriculum unit to make sure that students have reached important goals before
moving to the next lesson. The first two kinds of formative assessments require teachers
to have rich teaching experience and/or strong spontaneous teaching skills. To assist
teachers to implement formative assessments, formal formative assessment, Embedded
Formative Assessment, was taken in this study’s examination of the impact of assessment
on student outcomes.
The study attempts to answer the following three questions: (a) Can Embedded
Formative Assessment improve students’ motivational beliefs, which were conjectured to
influence conceptual change? (b) Can Embedded Formative Assessment improve
students’ achievement? (c) Can Embedded Formative Assessment improve conceptual
change? Based on prior theory we expected that formative assessment could improve all
three outcomes.
Method
Design
To answer the three research questions, a small randomized experimental study
was conducted. The study involved two groups of teachers, one that employed Embedded
Formative Assessment while teaching a science unit and another that taught the same unit
13
On the Measurement and Impact
without formative assessments. Four steps were taken in the study: (a) Twelve teacher
participants along with their students were randomly assigned to either the experimental
group or the control group; (b) All of the students were given a pretest motivation survey
and science achievement test (including conceptual change items); (c) Both groups of
teachers taught the same curriculum unit, but the teachers in the experimental group used
formative assessments and were expected to use the information collected by formative
assessment to help improve teaching and learning; (d) All the students were given a
posttest motivation survey and achievement test. By comparing the experimental and
control students’ motivation and achievement scores on the pretest and posttest, we
examined whether the formative assessment treatment affected students’ motivation,
achievement, and conceptual change.
Details about the participants and procedures are addressed by Shavelson, et al.
(this issue). Details about the curriculum and treatment are described by Ayala, et al. (this
issue). This paper focuses on the instruments and results of the study1.
Instruments
Motivation questionnaire and achievement assessments were developed to
examine the impact of embedded formative assessments.
1
Due to space limitations, the discussion of instruments and results in this paper is rather brief. To know
more details about the instruments and results, please refer to Yin (2005).
14
On the Measurement and Impact
Motivation Questionnaire
Motivation questionnaire development. The motivation questionnaire measures
the constructs that were addressed in both conceptual change literature and formative
assessment literature (Black et al., 2002; Pintrich, 1999; Pintrich et al., 1993). Among
them, four motivational beliefs promote conceptual changes: (a) task goal orientation
(e.g., “An important reason I do my science work is because I like to learn new things.”),
(b) perceived task-goal orientation context (e.g., “Our teacher really thinks it is very
important to try hard”), (c) self-efficacy in science (e.g., “I am certain I can figure out
how to do difficult work in science”), and (d) interest in science (e.g., “I find science
interesting”). The other four motivational beliefs prevent conceptual change: (a) ego
approach orientation (e.g., “I want to do better than other students in my science class.”),
(b) ego avoidance orientation (e.g., “It is very important to me that I do not look stupid in
my science class”), (c) perceived performance-goal orientation context (e.g., “Our teacher
lets us know which students get the highest scores on tests”), and (d) inborn intelligence
epistemic beliefs (e.g., “I can’t change how smart I am in science”).
Most motivation items were drawn from Lau and Roeser’s study (2002). A fivepoint Likert-type scale from 1 (strongly disagree) to 5 (strongly agree) was used in the
questionnaire to measure the degree to which students agreed or disagreed with each
statement. The same motivation questionnaire was given to the control and experimental
groups at pretest and posttest.
15
On the Measurement and Impact
Technical quality of the motivation questionnaire. Internal consistency analysis,
exploratory factor analysis, and confirmatory factor analysis were conducted to validate
the motivation constructs at pretest and posttest and diagnose ill-matched items.
Coefficient alpha was calculated for each final scale. Exploratory factor analysis
examined whether the motivation items measured the constructs that they were designed
to measure and to identify ill-matched items. Principal Axis Factoring extraction and
Promax with Kaiser Normalization rotation were applied assuming the motivation
constructs to be correlated with each other. Factors with eigen values greater than 1 were
extracted. The factor loadings of each item in the pattern matrix were examined to
indicate the fitness of items with their corresponding construct. Finally confirmatory
factor analysis was conducted to further validate the motivation constructs and
modification indexes (MIs) were examined to capture misfit items in the model. Details
of these analyses are available from the first author.
Table 1 presents the internal consistency of each motivation construct and the
correlations among them, based on the posttest data. The results based on pretest data,
similar to the result based on posttest, were omitted to conserve space.
-------------------------------Insert Table 1 about here
-------------------------------Overall, the motivation sub-scales have acceptable internal consistency, except
for fixed intelligence (.44), possibly due to the small number of items (3) (Table 1). This
16
On the Measurement and Impact
caveat in mind, the average score across items in each subscale was calculated to form a
composite score. Ten composite scores corresponding to the eight constructs were used
for further data analyses. Most motivation constructs were significantly correlated with
each other (Table 1). All the positive motivation constructs were positively correlated
with each other. So were all the negative motivation constructs. Except for the correlation
between ego-approach and self-efficacy, most positive motivation constructs were
negatively correlated with negative motivation constructs, whenever the correlation was
significant. The correlation pattern among the motivation constructs provided evidence
for discriminant validity of the motivation measurement.
Achievement Assessments
In this section, the multiple-choice test and open-ended assessments are described
and their technical qualities examined.
Achievement assessment development. Assessment development was based on
two concerns: (a) to adequately cover the curriculum’s instructional objectives and (b) to
comprehensively represent different types of knowledge--in particular, declarative
knowledge (knowing that), procedural knowledge (knowing how), and schematic
knowledge (knowing why)--the knowledge-reasoning-type framework set forth in
Shavelson et al. (this issue). Four assessments were developed: a multiple-choice test, a
performance assessment, a short-answer assessment, and a predict-observe-explain
assessment. A description of predict-observe-explain assessment can be found in Ayala et
al. (this issue).
17
On the Measurement and Impact
Thirty-eight items were included in the multiple-choice pretest, covering the
important instructional objectives across three types of knowledge. Multiple-choice test
item examples can be found in Shavelson et al. (this issue). In addition to the 38 items on
the pretest, five more items mainly focusing on common misconceptions about sinking
and floating were included on the posttest. Although we tried not to include any identical
achievement assessment items in the formative assessments, some of the content or style
of the formative assessments for the experimental group might be unavoidably similar to
that of the items or tests in the achievement assessments. To reduce the possible bias, we
selected 14 items from external sources, such as the National Assessment of Educational
Progress and Trends of International Mathematics and Science Study, to tap the spread of
the treatment effect. Because the correlation between the external items and internal items
was high, r = .70, p < .01, the external items were analyzed with the internal items
together in this paper.
The performance assessment was designed to assess students’ procedural and
schematic knowledge. In the first part of the performance assessment, given the mass
information of a rectangular block and some equipment, students were asked to find the
density of the block. To solve the problem, students needed to (a) use either a ruler or an
overflow container to measure the volume of a solid block and (b) apply the density
formula to calculate the density of the block. In the second part of the performance
assessment, given three blocks with density information and a bottle of mystery liquid
with unknown density, students were asked to find the density of the mystery liquid. To
18
On the Measurement and Impact
solve this problem, students needed to conduct an experiment, observe the sinking and
floating behavior of three blocks in mystery liquid, and infer density information of the
mystery liquid.
The short-answer assessment was designed to measure students’ declarative and
schematic knowledge. It asked students to explain “why do things sink or float” and to
provide evidence and an example to support their argument. The same short answer
question was repeatedly posed in the Embedded Formative Assessment for the
experimental group, bringing advantages as well as disadvantages to the study. The
advantage is that because the same question was asked at different times in the
experimental group, the experimental students’ conception developmental growth could
be tracked until the end of the instruction. Two disadvantages were possible: (a) the
experimental group had been unavoidably sensitized to this question and may have
benefited from their teachers’ feedback and (b) the experimental group might have
become bored with answering the same question three times, losing their motivation to
provide their best answers on the posttest. Therefore, caution should be taken when
interpreting the difference between experimental and control groups in their responses to
this question.
The predict-observe-explain assessment was also designed to assess students’
schematic knowledge. In the predict-observe-explain assessment, a test administrator
demonstrated that a bar of soap sunk in water. Then the test administrator cut the soap
into two unequal pieces—1/4 and 3/4—and asked students to predict what would happen
19
On the Measurement and Impact
when the two pieces were put in water and why. After students turned in their predictions
and reasons, the test administrator put the two pieces of soap in water, and asked students
to record their observation and to reconcile their predictions. The soap predict-observeexplain assessment was intended to examine whether students understood two main
points: (a) density is a property of a material and will not change with the size of the
object, and (b) an object sinks or floats depending on its density (relative to the medium’s
density) instead of its volume and mass.
Multiple-choice test was administered at both the pretest and posttest.
Performance, short-answer, and predict-observe-explain assessments were administered
at posttest only for the following reasons: (a) The performance assessment was heavily
content loaded. The pilot study showed that students tended to use trial and error to solve
the performance assessment problem before they had learned the relevant curriculum
content. Therefore the performance assessment could not provide valuable information at
pretest. (b) The short-answer assessment asked a succinct and straightforward question,
making it easy to memorize. We worried that it might sensitize control-group teachers
and lead them to prepare students to answer it, which might unnecessarily boost controlgroup students’ performance at posttest. And (c) the predict-observe-explain assessment
was so memorable that once students knew what happened to the two pieces of soap, they
might not forget the result.
Research team members administered the instruments in the classrooms of each
teacher immediately before and after the target unit. The motivation questionnaire was
20
On the Measurement and Impact
given before the achievement tests, so that students’ report of their motivation would not
be influenced by their performance on the achievement tests. On the posttest, the
achievement tests were given in the following order: multiple-choice test, performance
assessment, short-answer assessment, and predict-observe-explain assessment (Yin,
2005). The predict-observe-explain was given as the last one because of its instructional
effect.
Technical Quality of the Achievement Assessments. The quality of the multiplechoice test items was examined as to: (a) item difficulties, (b) instructional sensitivity, (c)
factorial structure, and (d) internal consistency. With respect to difficulty, most items
appropriately fell in the moderate range (p-values.40 ~ .80 at posttest). Some items were
found to be too difficult for the students, and these items were the ones that assessed
whether students still have misconceptions. The proportions of students who correctly
answered these questions were close to or even lower than what would be expected by
guessing at random (.25). The results suggested that most students still held alternative
conceptions after instruction.
Instructional sensitivity (D) was used to examine whether multiple-choice test
items discriminated between examinees who had and those who had not been taught. It is
defined as the difference between an item’s proportion of students responding correctly
(p-value) after instruction (posttest) and the p-value before instruction (pretest) (Crocker
& Algina, 1986). A high positive D value indicates that the item can well discriminate
between examinees who have received instruction and those who have not. All the items
21
On the Measurement and Impact
had positive instruction sensitivity Ds. That is, for all the items given at pre- and posttest,
more students chose correct answers at posttest than at pretest. This result provided
evidence for the content validity of the multiple-choice test items. That is, the items were
linked to the curriculum content and would be expected to reflect the effect of instruction.
Exploratory factor analysis was used to examine whether the multiple-choice test
items measured the constructs they were designed to measure, such as content knowledge
and knowledge types. When items in a test can be classified into different constructs from
only one dimension, items measuring the same construct would share similar loading
patterns in exploratory factor analysis. However, multiple-choice test items were
designed to tap two dimensions: the curriculum instructional objectives and knowledge
types. If the former was dominant, items measuring the same instructional objective
should have similar loading patterns in the exploratory factor analysis. If the latter was
dominant, items measuring the same knowledge type should have similar loading patterns.
If neither classification was dominant, connections might not be clear between items’
loading patterns and the original classifications because the two classifications or neither
of them might influence items’ loading patterns.
The results indicated that items measuring the same content knowledge clustered
together, while items measuring the same type of knowledge were scattered across
different factors, regardless if all 13 factors above an eigen value of 1 or an analysis
constrained to three factors reflecting knowledge types were run. Internal consistency
reliability was calculated for the total score, then, to examine whether items could
22
On the Measurement and Impact
reliably assess students’ achievement. The internal consistency of the 43 posttest
multiple-choice test items was .86, which indicated that the items in general reliably
measured students’ achievement. For conciseness, the total score of the multiple-choice
test items, rather than any subscales, will be used in the following data analysis.
Because the three open-ended assessments (performance, predict-observe-explain,
and short-answer) shared many features, the three measures are examined together in this
section.
In order to sufficiently capture information about students’ achievement reflected
in students’ responses to the assessments, analytical scoring systems were carefully
developed for each assessment. The scoring systems were aligned with the conceptual
developmental trajectory (Ayala et al., this issue) and important instructional objectives.
Using scoring system drafts, three researchers scored students’ response samples to
examine whether the preliminary scoring systems could adequately capture the variations
among students’ responses, differentiate high understanding levels from low
understanding levels, and assist scorers to reach a satisfactory inter-rater reliability.
Doctoral students at Stanford Educational Assessment Laboratory scored the three
open-ended assessments: performance assessment (six raters), short-answer assessment
(six raters), and predict-observe-explain assessment (four raters). Before independent
scoring, all raters received scoring training until they reached satisfactory inter-rater
reliability (above .80) or agreement (above 80%). In the scoring training process, when
raters encountered disagreements, they reached consensus through discussions. Based on
23
On the Measurement and Impact
the consensus, scoring systems were further improved and specific rules were clarified.
This process was implemented continuously until the scoring system could effectively
collect information about students’ performance and raters could reliably score students’
responses.
The scoring systems were finalized prior to independent scoring and raters had
reached satisfactory inter-rater reliability or agreement before independently scoring
student work. Performance assessments were scored with numerical values assigned by
six raters, therefore generalizability theory and G-coefficient were used to estimate
reliability. Based on 48 randomly selected student responses on the performance
assessment, the average G-coefficient for the six raters (three in each group) was 0.83.
Short-answer and predict-observe-explain assessments were scored with categorical
values, therefore agreements were used to evaluate the inter-rater agreement. Based on 48
randomly selected student responses on short-answer, the average agreement of nine
raters (two, three, and four raters in each group) was 87%. Based on 56 randomly
selected student responses on predict-observe-explain, the four raters’ agreement was
92%.
Internal consistency and discriminant validity. After all the assessments were
scored, the internal consistency of each assessment and the correlations among the
assessments were calculated. Because the four assessments measure students’ knowledge
about the same topic but with different emphases, high internal consistency within each
24
On the Measurement and Impact
assessment and moderately high correlations among assessments were expected. Table 2
shows that high internal consistency empirically confirming expectation.
------------------------------Insert Table 2 about here
------------------------------Conceptual change score and achievement total score. Five multiple-choice test
items, seven predict-observe-explain assessment items, one performance assessment item,
and one short-answer assessment items focused on conceptual change. Internal
consistency of the 14 items was .79. The composite score was interpreted as a conceptual
change score, with the maximum score of 14.
In addition to scores on the individual tests, an overall achievement total score
was calculated. It combined scores from all four achievement assessments and had 72
items, an internal consistency reliability of 0.91, and a total score of 86. This achievement
total score was also used in the overall analyses reported below.
Data Analysis Plan
Because teachers were in different states or different school districts in a state, the
teaching contexts varied dramatically across teachers, such as the number and size of
classes a teacher taught. Moreover, students’ achievement and motivation subscale scores
varied greatly across classes and teachers at pretest, even though pairs of teachers were
matched on school-level variables and randomly assigned to the treatment or control
group (Shavelson et al., this issue). To reduce these differences, we selected one class
25
On the Measurement and Impact
from each teacher based on students’ pretest motivation and achievement scores for
intensive study such that the pairs of classes selected were as close as possible on mean
pretest scores. The twelve classes selected were called focus classes.
In this report, only the focus classes were included to examine the treatment effect
for three reasons: (a) The focus classes were the most closely matched classes of teachers
in the control and experimental groups; (b) the focus classes were the classes from which
we collected the most elaborate implementation information; and (c) the answers to the
research questions based on focus classes were similar to those based on all the students
(Yin, 2005).
To check on the mean equivalence of pretest scores, mean score differences
between each experimental-control teacher pair’s focus class were compared statistically
(Appendix A). Significant differences exist among three teacher pairs on motivation
scores. Experimental group teacher Aden’s students had a significantly higher mean score
than control group teacher Rachel’s students on two positive motivation scales: perceived
task goal orientation, and self-efficacy. Aden’s students also had a significantly lower
mean score than Rachel’s students on a negative motivation scale: perceived performance
goal orientation. Experimental group teacher Carol’s students scored significantly lower,
on average, than control group teacher Sofia’s students on three negative motivation
scales: ego approach orientation, ego avoidance orientation, and perceived performance
goal. Experimental group teacher Diana’s students had significantly a higher mean score
than control group teacher Ben’s students on all the positive motivation scales.
26
On the Measurement and Impact
As for achievement, students in the control group of two teachers, Lenny and Ben,
scored significantly higher on the multiple-choice test than did their experimental group
counterparts on average, students in the experimental group of two teachers, Andy and
Diana, respectively.
Overall, experimental and control groups were not completely equivalent at the
beginning of the study. The experimental group started with better average motivation
scores but worse achievement scores than the control group at pretest. While not
completely adequate, covariate adjustments were made in future analyses.
The data were first analyzed descriptively and statistically for the teacher pairs—
matched roughly for their school-level characteristics. The overall differences between
the control and experimental group on the outcome measures were then compared and
discussed. Finally, because students were nested in teachers, and teachers were nested in
treatment groups, a two-level hierarchical structure existed in the data. To deal with this
data structure, two-level Hierarchical linear modeling (HLM) was then applied to
examine the formative assessment treatment’s influence statistically. The limitation of
using HLM is that the analysis lacks statistical power due to the small sample size of only
six teachers in each group.2
2
The intent of the study was exploratory and we and the National Science Foundation agreed on the small sample size
for cost reasons and as a first step in providing an “existence proof” that might lead to future funding.
27
On the Measurement and Impact
Results
The analysis presented in this chapter addresses three research questions: (1) Did
formative assessment improved students’ motivational beliefs? (2) Did formative
assessment improved students’ science achievement? And (3) did formative assessment
bring about conceptual change? Descriptive analyses of students’ motivation,
achievement, and conceptual change scores at pretest and posttest are first presented for
individual teachers in two groups. Following that, the effects of formative assessment on
students’ motivation, achievement and conceptual change are reported.
Initial Analysis of Pretest and Posttest Scores
In this section, the experimental-control group focus class pairs were compared
and then the overall experimental-control group differences were analyzed.
Analysis of Experimental-Control Group Pairs
In contrast to the pretest, overall the experimental group did not have a
significantly better motivation average score than control group at posttest. Instead, two
experimental classes had better motivation scores than their control pairs, so did two
control classes. Control group teacher Lenny’s students have higher scores than
experimental group teacher Andy’s students on three positive motivation scales: task goal
orientation, perceived task goal orientation, and interest. Control group teacher Sofia’s
students also scored significantly higher than experimental group teacher Carol’s students
on three positive motivation scales: task goal orientation, perceived task goal orientation,
and interest. Similar to the pretest, experimental group teacher Diana’s students still had
28
On the Measurement and Impact
significantly higher scores than control group teacher Ben’s students on all the positive
motivation scales. Moreover, Diana’s students had lower scores than Ben’s students on a
negative motivation scale: perceived performance goal orientation. Finally, experimental
group teacher Robert’s students scored significantly higher than control group teacher
Serena’s students on two positive motivations, perceived goal orientation and interest.
As for achievement and conceptual change, two control group teachers’ students
outperformed their counterpart in the experimental group (Appendix B). Similar to the
pretest, control group teacher Lenny’s students, on average, still outperformed the
experimental group teacher Andy’s students on achievement scores and conceptual
change scores at posttest, with the exception of the short-answer question (Appendix B).
Moreover, control group teacher Rachel’s students outperformed experimental group
teacher Aden’s students on the achievements tests, except for performance assessment
and conceptual change items (Appendix B), although they did not differ significantly on
the pre multiple-choice test (Appendix A). The only “winning” experimental-group class
among all the pairs was Diana’s. Control group teacher Ben’s students did not
significantly outperform experimental-group teacher Diana’s students as they did at
pretest.
Experimental-Control Group Comparisons
Due to the nested structure, it is not statistically legitimate to combine all the
students in each treatment group for data analysis. However, the descriptive analyses
based on the combined groups can provide a preliminary overall picture of the two group
29
On the Measurement and Impact
comparison and treatment effect at posttest. Table 3 indicates that overall the
experimental group scored higher than the control group on most positive motivation
scores but scored lower on achievement test. This result is consistent with the previous
individual teacher pair comparison.
------------------------------Insert Table 3 about here
------------------------------Table 4 shows that experimental group still scored higher than control group on
positive motivation scales at posttest, but only the significant difference was in the mean
difference in perceived task goal orientation. As for achievement, overall control group
still outperformed the experimental group, scoring significantly higher on the multiplechoice test, performance assessment, and total achievement score. Based on the
comparison of Table 3 and Table 4, the treatment seemed to have no influence on student
motivation, achievement or conceptual change at posttest.
------------------------------Insert Table 4 about here
------------------------------However, in general, experimental group students had significantly lower
variance than the control group on the predict-observe-explain assessment (F = 4.09, p
< .05) and conceptual change score (F = 3.88, p = .05), that is, the achievement gap
between higher achievers and lower achievers in the experimental group was not as big as
30
On the Measurement and Impact
that in the control group. This is consistent with literature that formative assessments are
particularly useful for low achievers (Black & Wiliam, 1998a).
Hierarchical Linear Modeling
Variables Used in the HLM
We were interested in the influence of formative assessment on students’
motivation, achievement, and conceptual change; therefore posttest motivation,
achievement, and conceptual change scores served as dependent variables in the model.
Specifically, these dependent variables include the ten posttest motivation subscale scores,
all posttest achievement scores, the composite achievement score, and the composite
conceptual change score.
Independent variables, or predictors, include pretest scores that corresponded to
the posttest score and treatment group. The pretest scores of individual students,
including pretest motivation scores and pretest multiple-choice test score, were the
predictors at the first level. Treatment group was the predictor at the second level. Due to
heterogeneous school context and teacher backgrounds, there could be many possible
teacher and school level predictors at level-2, such as, school size, ethnical composition,
percent of students who receive free or reduced price lunch, students’ grade level,
teachers’ teaching experience, highest educational degree, science background, teaching
load, the length of instruction. However, due to the constraint of small sample size (there
31
On the Measurement and Impact
were only 12 teachers at level-2), only the treatment group, the focus in this study, was
introduced into the model as the predictor as level-2.
Model Construction
Three steps were taken to analyze all the outcome variables.
Step 1: Unconditional model. An unconditional model was first applied to
examine the variation in all of the outcome variables at the two different levels--student
and teacher. A one-way ANOVA model was applied for each outcome variable at level-1.
The unconditional model was specified by equations 1 and 2.
Level-1
Yij   0 j  rij
rij ~ N (0, σ2)
[1]
μ0j ~ N (0, τ00)
[2]
Level-2
 0 j   00   0 j
Yij was the outcome variable, such as a motivation subscale scores or achievement
scores. β0j was the intercept, in this case, the mean score for each teacher’s students. rij
was the random error at student level. The variance of rij was σ2. Variance due to rij
represented that the score variation at level-1, in this case, within teachers. β0j was a
function of grand mean, γ00, plus a random error, μ0j. The variance of μ0j was τ00.
Significant variance, τ00, due to μ0j indicated that teacher mean score varied at level-2.
Step 2: Conditional model with covariate(s) at level-1. According to Raudenbush
and Bryk (2002), statistical adjustments for individuals’ background are important,
32
On the Measurement and Impact
because (a) “persons are not usually assigned at random to organizations, failure to
control for background may bias the estimates of organization effect” and (b) “if level-1
predictors (or covariates) are strongly related to the outcome of interest, controlling for
them will increase the precision of any estimates of organization effects and the power of
hypothesis tests by reducing unexplained level-1 error variance” (p. 111).
In this study, students had rather different starting motivation and achievement
scores; therefore, the pretest scores were added to the model as covariate(s) at level-1 to
account for the outcome variable variance within-teacher. Only the corresponding pretest
score was introduced to the model at level-1. For example, when the posttest motivation
subscale, interest, was the outcome variable, only pretest interest scores were introduced
as the predictor at level-1. When a posttest achievement score or conceptual change score
was the outcome variable, only the pretest multiple-choice test score was introduced as
the predictor at level-1. There are two reasons for including only one covariate at level-1:
First, the corresponding pretest variable was the most relevant covariate of the
corresponding posttest score. Second, preliminary analyses indicated that, after a pretest
motivation score was included in a model to explain the corresponding posttest score, an
additional pretest motivation score was usually not a significant predictor for the posttest
score because motivation constructs were correlated with each other. Since pretest scores
were continuous variables, they were grand-mean centered (Raudenbush & Bryk, 2002).
The model was specified as follows.
33
On the Measurement and Impact
Level-1
Yij   0 j  1 j ( X ij  X .. )  rij
rij ~ N (0, σ2)
[3]
Level-2
 0 j   00   0 j
1 j   10  1 j
μ0j ~ N (0, τ00)
μ1j ~ N (0, τ11)
[4]
[5]
In this model, Xij was a pretest score as the predictor for the corresponding
posttest score. X .. is the grand-mean. The intercept, β0j, was the expected outcome for a
subject whose value on Xij was equal to the grand mean, X .. . The slope, β1j, represented
the regression parameter between students’ pretest and posttest scores. rij was the random
error after controlling for students’ pretest score. β0j and β1j might vary across teachers in
the level-2 model as a function of a grand mean and a random error. Since X was grandmean centered, γ00 was the adjusted mean of the teachers on students’ posttest scores. γ10
was the adjusted mean of pretest and posttest regression slope across those teachers. μ0j
was the unique increase to the intercept associated with teacher j. Significant variance
associated with μ0j indicated that the mean posttest score varied across teachers. μ1j was
the unique increase to the slope associated with teacher j. Significant variance associated
with μ1j indicated that pretest-posttest regression slope varied across teachers.
Analysis showed that variance of μ1j was not significantly different from 0. That is,
the regression coefficients between outcome variable and predictor variable within
34
On the Measurement and Impact
teachers at level-1 did not vary across teachers at level-2 (“homogeneity of regression
slopes”). Therefore, μ1j was eliminated from the final model. Accordingly, equation [5]
was simplified into [6].
1 j   10
[6]
Step 3: Model with intercepts-as-outcomes. One of the important purposes for
using HLM is to examine the cross level effects. Specifically, in this study, we examined
whether the regression coefficients (regression slopes of posttest scores on covariates) at
the student level varied across teachers. Analyses indicated that regression coefficients at
level-1 did not significantly vary across teachers, i.e., homogeneity of regressions slopes.
Therefore, the regression coefficients at level-1 were set as fixed at level-2 and only the
intercept at level-1 was an outcome variable at level-2. As mentioned earlier, treatment
group was the only predictor at level-2. Since treatment group was a dummy variable
(experimental group = 1, control group = 0), it was added to the model without being
centered (Raudenbush & Bryk, 2002).
Analyses showed that τ00 in equation [4] significantly differed from 0. That is, the
intercept at level-1 varied significantly at level-2 across teachers. A predictor at level-2,
treatment group, was then included in the model to account for the intercept variance at
leve-2. The model was specified in equation [7].
35
On the Measurement and Impact
 0 j   00   01W1 j   0 j
[7]
W1j was the treatment variable at level-2. Because treatment was a dummy
variable, it was added to the model without being centered, for example treatment group,
which was coded as “0” for control group and “1” for experimental group.
HLM Results
Tables 4 and 5 present changes in the variance components for motivation scores
and achievement scores, respectively, when different models were implemented. Most
motivational and achievement outcomes varied significantly across teachers (at level-2),
specifically motivation scores except for ego avoidance, self-efficacy, and fixed ability
(Table 4), and achievement scores except for conceptual change (Table 5).
------------------------------Insert Table 4 about here
--------------------------------------------------------------Insert Table 5 about here
-------------------------------Tables 6 and 7 present the results when pretest scores were included as covariate
at level-1 and treatment was included as predictor at level-2. Results reported in these
36
On the Measurement and Impact
tables indicate that pretest scores were significant predictors for the corresponding
posttest scores at level-1; however, treatment was not a significant predictor of the
variance across teachers at level-2. And for the achievement indicators, the negative level
2 coefficient for treatment indicates a very slight, non-significant adjusted mean
advantage for the control group. An important reason for the non-significant group effect
might be the “small” sample size.
-------------------------------Insert Table 6 about here
----------------------------------------------------------------Insert Table 7 about here
--------------------------------In a word, based on HLM analyses, students’ motivation, achievement, and
conceptual change scores greatly varied among teachers. However, the formativeassessment treatment did not explain the variation among teachers. Experimental group
students did not seem to benefit, on average, from the formative assessments they
received.
37
On the Measurement and Impact
Conclusion
In this study, one questionnaire and four types of achievement tests were used in
this study to measure students’ motivational beliefs, achievement, and conceptual change.
Empirical evidence was provided in support of the reliability and validity of scores from
these instruments.
Both the descriptive statistics and HLM analyses show that the assessments
embedded in the FAST curriculum used by the experimental group did not have a
significant influence, on average, on students’ motivation, achievement, and conceptual
change compared to students in the control group that used FAST without embedded
assessments. This said, regardless of treatment group, teachers varied greatly on the
motivation and achievement outcomes they produced in their students. This finding did
not support our conjectures about the salutatory effect of formative assessment on student
outcomes or the findings reported in literature reviews.
More elaborative analyses show that teachers’ educational background, teaching
experience, teaching load, age, and some school information (such as school size, and
percent of free and reduced lunch) were not significant predictors for students’
motivation and achievement difference either (Yin, 2005). Why, then, did the Embedded
Formative Assessments not work as expected? What might account for the differences in
outcomes among students taught by different teachers? Furtak, et al. (this issue) suggest
answers to these questions.
38
On the Measurement and Impact
References
Abimbola, I. O. (1988). The problem of terminology in the study of student conceptions
in science. Science Education, 72(2), 175-184.
Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory.
Englewood Cliffs, NJ: Prentice Hall.
Bandura, A. (1997). Self-efficacy: The exercise of control. New York: Freeman.
Bandura, A., & Cervone, D. (1983). Self-evaluative and self-efficacy mechanisms
governing the motivational effects of goal systems. Journal of Personality and
Social Psychology, 45(5), 1017-1028.
Bell, B., & Cowie, B. (2001). Formative Assessment and Science Education. Dordrecht,
The Netherlands: Kluwer Academic Publishers.
Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2002). Testing, Motivation
and Learning. Cambridge: University of Cambridge, Faculty of Education: The
Assessment Reform Group.
Black, P., & Wiliam, D. (1998a). Assessment and classroom learning. Assessment in
Education, 5, 7-68.
Black, P., & Wiliam, D. (1998b). Inside the black box: raising standards through
classroom assessment. Phi Delta Kappan, 80(2), 139-148.
Blumenfeld, P. C. (1992). Classroom learning and motivation: Clarifying and expanding
goal theory. Journal of Educational Psychology, 84(3), 272-281.
39
On the Measurement and Impact
Butler, R. (1987). Task-Involving and Ego-Involving Properties of Evaluation. Journal of
Educational Psychology, 79, 474-482.
Butler, R., & Neuman, O. (1995). Effects of task and ego achievement goals on helpseeking behaviors and attitudes. Journal of Educational Psychology, 87, 261-271.
Carey, S. (1984). Conceptual Change in Childhood. Cambridge, MA: MIT Press.
Chi, M. T. H. (1992). Conceptual change within and across ontological categories:
Examples from learning and discovery in science. In Cognitive models of science:
Minnesota studies in the philosophy of science (pp. 129-160). Minneapolis, MN:
University of Minnesota Press.
Chi, M. T. H., Slotta, J. D., & deLeeuw, N. (1994). From things to processes: A theory of
conceptual change for learning science concepts. Learning and Instruction, 4, 2743.
Chinn, C. A., & Malhotra, B. A. (2002). Children's responses to Anomalous Scientific
Data: How is Conceptual Change Impeded? Journal of Educational Psychology,
94(2), 327-343.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory.
Harcourt Brace, PA: Jovanovich College Publishers.
Demastes, S. S., Good, R. G., & Peebles, P. (1996). Patterns of conceptual change in
evolution. Journal of Research in Science Teaching, 33, 407-431.
40
On the Measurement and Impact
Dreyfus, A., Jungwirth, E., & Eliovitch, R. (1990). Applying the 'Cognitive Conflict'
strategy for conceptual change - some implications, difficulties and problems.
Science Education, 74(5), 555-569.
Dweck, C. S., & Leggett, E. L. (1988). A Social-Cognitive Approach to Motivation and
Personality. Psychological Review, 95(2), 256-273.
Eccles, J. (1983). Expectancies, values and academic behaviors. San Francisco: Freeman.
Geisler-Brenstein, E., & Schmeck, R. R. (1996). The Revised Inventory of Learning
Processes: A multifaceted perspective on individual differences in learning. In M.
Birenbaum & F. J. R. C. Dochy (Eds.), Alternatives in Assessment of
Achievements, Learning Processes and Prior Knowledge. Boston, MA: Kluwer
Academic Publishers.
Hattie, J., & Timperley, H. (2007). The Power of Feedback. Review of Educational
Research, 77(1), 81-112.
Hewson, M. G. (1986). The acquisition of scientific knowledge: Analysis and
representation of student conceptions concerning density. Science Education,
70(2), 159-170.
Hewson, P. W., & Hewson, M. G. (1984). The role of conceptual conflict in conceptual
change and the design of science instruction. Instructional Science, 13(1), 1-13.
Kohn, A. S. (1993). Preschoolers' reasoning about density: Will it float? Child
Development, 64, 1637-1650.
41
On the Measurement and Impact
Lau, S., & Roeser, R. W. (2002). Cognitive abilities and motivational processes in high
school students' situational engagement and achievement in science. Educational
Assessment, 8(2), 139-162.
NRC. (2001). Knowing What Students Know. Washington, DC: National Academies
Press, National Research Council.
Pajares, F. (1996). Self-Efficacy beliefs in academic settings. Review of Educational
Research, 66(4), 543-578.
Pintrich, P. R. (1999). Motivational beliefs as resources for and constraints on conceptual
change. In W. Schnotz, S. Vosniadou & M. Carretero (Eds.), New Perspectives in
Conceptual Change Research. Oxford, England: Pergamon Press.
Pintrich, P. R., Marx, R., & Boyle, R. (1993). Beyond cold conceptual change: The role
of motivational beliefs and classroom contextual factors in the process of
conceptual change. Review of Educational Research, 63, 167-199.
Pintrich, P. R., & Schrauben, B. (1992). Student’s motivational beliefs and their cognitive
engagement in classroom academic tasks. In D. Schunk & J. Meece (Eds.),
Student perceptions in the Classroom: Causes and Consequences (pp. 149-183).
Hillsdale, NJ: Lawrence Erlbaum.
Posner, G. J., Strike, K. A., Hewson, P. W., & Gertzog, W. F. (1982). Accommodation of
a scientific conception: toward a theory of conceptual change. Science Education,
66, 211-227.
42
On the Measurement and Impact
Pottenger, F. M. I., & Young, D. B. (1992). The Local Environment: FAST 1,
Foundational Approaches in Science Teaching. Honolulu: University of Hawaii
Curriculum Research & Development group.
Ramaprasad, A. (1983). On the definition of feedback. Behavioral Science, 28(1), 4-13.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and
data analysis methods (2 ed.). Newbury Park, CA: Sage.
Sadler, R. (1989). Formative assessment and the design of instructional systems.
Instructional Science, 18, 119-144.
Schnotz, W., Vosniadou, S., & Carretero, M. (1999). Preface. In W. Schnotz, S.
Vosniadou & M. Carretero (Eds.), New Perspectives in Conceptual Change
Research. Oxford, England: Pergamon Press.
Schunk, D. H., & Swartz, C. W. (1993). Goals and progress feedback: Effects on selfefficacy and writing achievement. Contemporary Educational Psychology, 18,
337-354.
Siero, F., & Van Oudenhoven, J. P. (1995). The effects of contingent feedback on
perceived control and performance. European Journal of Psychology of
Education, 10, 13-24.
Smith, C., Snir, J., & Grosslight, L. (1992). Using conceptual models to facilitate
conceptual change: the case of weight-density differentiation. Cognition and
Instruction, 9(3), 221-283.
43
On the Measurement and Impact
Vispoel, W. P., & Austin, J. R. (1995). Success and failure in junior high school: A
critical incident approach to understanding students' attributional beliefs.
American Educational Research Journal, 32(2), 377-412.
Vosniadou, S., & Brewer, W. F. (1992). Mental models of the earth: A study of
conceptual change in childhood. Cognitive Psychology, 24, 535-585.
Yin, Y. (2005). The Influence of Formative Assessment on Student Motivation,
Achievement, and Conceptual Change. Stanford University, Stanford.
Yin, Y., Tomita, K. M., & Shavelson, R. J. (in press). Diagnosing and dealing with
student misconceptions about “Sinking and Floating”. Science Scope.
44
On the Measurement and Impact
Appendix A
Average Class Pretest Motivation Scores, Multiple-choice Score, and Class Size for
Focus Classa
Variables
Pair 1
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Interest
Ego Approach Orientation
Ego Avoidance Orientation
Perceived Performance Goal
Fixed Ability
Pretest Multiple-choice
Size
Mean
SD
Experimental ( E )
Aden
29
3.79 0.69
29
4.26 0.45
29
4.13 0.61
29
4.15 0.80
29
2.82 0.81
29
2.90 0.85
29
2.10 0.52
29
2.23 0.81
29
14.55 5.08
Size
Mean
Control ( C )
Rachel
29
3.59
29
3.86
29
3.77
29
3.82
29
2.89
29
2.94
29
2.52
29
2.19
29
14.76
SD
Difference
E-C
0.73
0.44
0.59
0.60
0.71
0.79
0.57
0.67
3.90
0.21
0.40**
0.37*
0.33
-0.07
-0.04
-0.42**
0.04
-0.21
Pair 2
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Ego Approach Orientation
Ego Avoidance Orientation
Perceived Performance Goal
Fixed Ability
Interest
Pretest Multiple-choice
25
25
25
25
25
25
25
25
26
Andy
3.63
3.92
3.84
2.90
3.06
2.25
2.55
3.93
13.12
0.65
0.50
0.63
0.86
0.77
0.60
0.87
0.96
3.68
19
19
19
19
19
19
19
19
19
Lenny
3.60
4.23
4.02
2.96
3.23
2.41
2.11
3.95
15.95
0.64
0.45
0.58
1.02
0.93
0.59
0.66
0.88
4.13
0.03
-0.31*
-0.19
-0.06
-0.17
-0.17
0.44
-0.02
-2.83*
Pair 3
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Interest
Ego Approach Orientation
Ego Avoidance Orientation
Perceived Performance Goal
Fixed Ability
Pretest Multiple-choice
20
20
20
20
20
20
20
20
20
Becca
4.22
4.37
4.03
4.20
3.23
2.97
2.59
2.87
12.25
0.51
0.45
0.42
0.61
0.99
0.91
0.44
0.96
2.75
23
23
23
23
23
23
23
23
22
Ellen
3.92
4.24
3.80
4.28
3.20
2.99
2.82
2.54
13.23
0.52
0.40
0.50
0.50
0.61
0.82
0.47
0.85
4.19
0.30
0.13
0.22
-0.08
0.03
-0.02
-0.23
0.33
-0.98
Pair 4
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Interest
18
18
18
18
Carol
3.88
4.21
4.10
4.09
0.59
0.42
0.57
0.77
21
21
21
21
Sofia
3.99
4.23
4.03
4.31
0.53
0.40
0.57
0.67
-0.11
-0.02
0.07
-0.22
45
On the Measurement and Impact
Variables
Ego Approach Orientation
Ego Avoidance Orientation
Fixed Ability
Perceived Performance Goal
Pretest Multiple-choice
Pair 5
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Interest
Ego Approach Orientation
Ego Avoidance Orientation
Perceived Performance Goal
Fixed Ability
Pretest Multiple-choice
Pair 6
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Interest
Ego Approach Orientation
Ego Avoidance Orientation
Perceived Performance Goal
Fixed Ability
Pretest Multiple-choice
a: * p < 0.05; ** p < 0.01.
Size
Mean
SD
Experimental ( E )
18
2.40 0.75
18
2.56 0.91
18
2.15 0.69
18
1.89 0.53
19
13.16 4.49
Size
Mean
Control ( C )
21
3.26
21
3.15
21
2.31
21
2.49
24
15.21
SD
0.85
0.58
0.91
0.62
3.96
Difference
E-C
-0.86**
-0.60*
-0.16
-0.60**
-2.05
25
25
25
25
25
25
25
25
25
Diana
4.00
4.41
4.11
4.28
3.16
3.19
2.39
2.29
15.52
0.58
0.29
0.58
0.58
0.88
0.74
0.78
0.85
3.61
21
21
21
21
21
21
21
21
21
Ben
3.36
4.08
3.74
3.38
3.18
3.18
2.49
2.33
18.38
0.65
0.59
0.53
0.80
0.86
0.72
0.62
0.58
5.11
0.64**
0.33*
0.37*
0.90**
-0.02
0.01
-0.10
-0.03
-2.86*
27
27
27
27
27
27
27
27
26
Robert
3.84
4.09
4.04
4.17
3.13
3.06
2.21
2.31
16.00
0.52
0.52
0.43
0.41
0.74
0.91
0.51
0.72
4.58
20
20
20
20
20
20
20
20
20
Serena
3.87
4.21
4.16
4.03
2.98
3.05
2.30
2.17
15.20
0.67
0.48
0.61
0.81
0.90
0.96
0.67
0.84
4.64
-0.03
-0.12
-0.12
0.14
0.15
0.01
-0.09
0.14
0.80
46
On the Measurement and Impact
Appendix B
Average Class Posttest Motivation Scores, Multiple-choice Score, and Class Size for
Focus Classa
Variables
Pair 1
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Interest
Ego Approach Orientation
Ego Avoidance Orientation
Perceived Performance Goal
Fixed Ability
Multiple-choice
Performance Assessment
Predict-Observe-Explain
Short-Answer
Achievement Total
Conceptual Change
Size
Mean
SD
Experimental ( E )
Aden
29 3.56
0.84
25 3.96
0.72
25 3.95
0.80
25 3.71
0.94
29 2.68
0.92
25 2.77
1.08
25 2.01
0.59
25 2.08
0.64
25 21.79
7.19
22 13.31
8.91
26 2.15
1.74
26 1.54
1.68
20 40.45
17.00
20 5.10
3.26
Size
21
21
21
21
21
21
21
21
21
22
21
21
19
19
Mean
SD
Control ( C )
Rachel
3.10
0.77
3.63
0.67
3.79
0.39
3.14
0.92
2.64
0.72
2.52
0.79
2.33
0.67
2.29
0.53
28.10
6.18
17.91
5.94
3.33
2.29
3.10
1.18
54.53
11.76
6.54
3.20
Pair 2
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Interest
Ego Approach Orientation
Ego Avoidance Orientation
Perceived Performance Goal
Fixed Ability
Multiple-choice
Performance Assessment
Predict-Observe-Explain
Short-Answer
Achievement Total
Conceptual Change
25
22
22
22
25
22
22
22
22
22
20
22
18
18
Andy
2.98
3.83
3.85
2.85
2.72
3.00
2.60
2.17
22.91
9.23
3.70
2.14
39.56
6.90
0.73
0.61
0.71
0.96
0.79
0.85
0.60
0.79
5.92
4.41
2.03
1.21
11.09
3.41
19
18
18
18
19
18
18
18
18
17
16
19
13
13
Lenny
3.72
4.33
4.08
3.89
2.99
2.80
2.23
2.02
28.79
20.53
5.31
2.47
62.08
10.55
0.66
0.44
0.58
0.68
0.78
0.60
0.55
0.66
6.84
7.86
2.41
1.26
10.57
3.22
Pair 3
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Interest
Ego Approach Orientation
20
16
16
16
20
Becca
3.85
4.25
3.88
4.19
3.03
0.71
0.37
0.53
0.69
1.03
23
21
21
21
23
Ellen
4.04
4.25
3.89
4.30
2.65
0.45
0.52
0.58
0.83
0.85
Difference
E-C
0.46
0.33
0.16
0.56*
0.04
0.24
-0.32
-0.21
-6.31**
-2.01
-2.01*
-3.59**
-2.99**
-1.39
-0.74**
-0.50**
-0.23
-1.04**
-0.27
0.19
0.37
0.15
-5.88**
-5.32**
-2.18*
-0.87
-5.68**
-3.01**
-0.19
-0.01
-0.01
-0.11
0.38
47
On the Measurement and Impact
Variables
Ego Avoidance Orientation
Perceived Performance Goal
Fixed Ability
Multiple-choice
Performance Assessment
Predict-Observe-Explain
Short-Answer
Achievement Total
Conceptual Change
Size
Mean
SD
Experimental ( E )
16 2.83
0.84
16 2.64
0.78
16 2.35
0.75
16 20.25
5.73
16 11.25
2.98
16 2.36
1.91
14 2.07
1.44
14 36.36
8.66
14 4.58
2.48
Size
21
21
21
21
24
24
24
22
22
Mean
SD
Control ( C )
2.56
0.98
2.29
0.61
2.29
0.72
22.18
5.57
13.17
5.74
1.54
1.14
1.42
1.69
38.55
11.75
3.51
2.20
Difference
E-C
0.27
0.34
0.07
-1.93
-1.23
1.45
1.30
-0.60
1.35
Pair 4
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Interest
Ego Approach Orientation
Ego Avoidance Orientation
Perceived Performance Goal
Fixed Ability
Multiple-choice
Performance Assessment
Predict-Observe-Explain
Short-Answer
Achievement Total
Conceptual Change
18
17
17
17
18
17
17
17
17
15
17
18
13
13
Carol
3.28
3.81
3.76
3.27
2.50
2.47
2.37
2.10
28.12
17.93
3.00
2.94
51.92
6.55
0.64
0.66
0.55
1.08
0.87
0.85
0.77
0.44
6.27
8.57
2.50
1.00
16.55
4.03
21
22
22
22
21
22
22
22
22
21
21
22
20
20
Sofia
3.84
4.27
3.95
4.21
3.03
2.53
2.03
2.24
27.32
16.38
2.71
2.91
49.9
5.32
0.68
0.57
0.76
0.93
0.83
1.02
0.80
1.08
5.80
6.09
1.31
1.31
11.67
2.76
-0.55*
-0.45*
-0.19
-0.94**
-0.53
-0.05
0.34
-0.14
0.80
0.60
0.41
0.09
0.38
0.97
Pair 5
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Interest
Ego Approach Orientation
Ego Avoidance Orientation
Perceived Performance Goal
Fixed Ability
Multiple-choice
Performance Assessment
Predict-Observe-Explain
Short-Answer
Achievement Total
Conceptual Change
25
23
23
23
25
23
23
23
23
21
21
21
21
21
Diana
3.90
4.46
4.11
4.19
3.18
2.98
2.18
2.14
24.78
16.71
2.29
3.14
47.05
4.86
0.71
0.46
0.58
0.72
0.89
0.74
0.65
0.94
4.49
7.13
1.42
1.42
11.08
2.08
21
21
21
21
21
21
21
21
21
20
20
22
17
17
Ben
2.76
3.20
3.63
2.40
3.25
2.84
2.59
2.10
26.14
14.95
2.35
2.77
45.29
4.60
0.77
0.72
0.82
1.11
0.93
0.90
0.53
0.63
7.88
5.99
2.13
1.51
15.37
3.00
1.14**
1.26**
0.48*
1.79**
-0.07
0.15
-0.41*
0.04
-1.35
0.86
-0.11
0.83
0.40
0.30
Pair 6
Robert
Serena
48
On the Measurement and Impact
Variables
Task Goal Orientation
Perceived Task Goal
Interest
Self Efficacy
Ego Approach Orientation
Ego Avoidance Orientation
Perceived Performance Goal
Fixed Ability
Multiple-choice
Performance Assessment
Predict-Observe-Explain
Short-Answer
Achievement Total
Conceptual Change
a: * p < 0.05; ** p < 0.01.
Size
Mean
SD
Experimental ( E )
27 3.70
0.70
22 4.52
0.39
22 4.37
0.65
22 4.20
0.36
27 2.95
0.94
22 2.60
0.98
22 2.12
0.78
22 2.06
0.70
22 20.48
6.88
21 11.00
5.06
22 1.86
1.54
22 1.86
1.42
21 35.86
12.47
21 4.34
2.13
Size
20
20
20
20
20
20
20
20
20
19
17
20
16
16
Mean
SD
Control ( C )
3.76
0.91
4.09
0.54
3.62
0.92
3.92
0.63
3.38
0.86
3.11
0.82
2.42
0.74
2.55
0.89
18.05
4.62
9.53
6.10
1.76
2.05
1.60
1.39
31.9
12.08
3.69
2.70
Difference
E-C
-0.06
0.44**
0.75**
0.28
-0.42
-0.51
-0.30
-0.49
2.43
0.84
0.17
0.61
1.02
0.83
49
On the Measurement and Impact
Table 1
Motivational Belief Constructs, Number of Items, Coefficient Alpha, and Correlations
among Constructs at Posttest
Positive Motivation
Negative Motivation
Construct Number
1
2
3
4
5
6
7
8
Number of Items
5
6
5
3
4
5
6
3
1 Task Goal
(.83)
2 Task Perceived
.73**
3 Self-Efficacy
.77** .72**
4 Interest
.85** .85** .64**
(.86)
(.89)
(.81)
5 Ego Approach
.04
.04
.10*
-.00
(.77)
6 Ego Avoidance
.02
.04
.03
-.02
.84**
(.78)
7 Performance
-.36** -.53** -.42** -.42**
.41** .37**
(.70)
-.26** -.31** -.39** -.33**
.50** .48** .74**
Perceived
8 Fixed Ability
(.44)
Note: a. * p < 0.05, ** p < 0.01, N = 788 ~ 826. b. Results in this table are the output from AMOS, but they
are arranged in a conventional way for display convenience. c. Correlations computed in AMOS were
attenuated by the reliability of each construct. d. Numbers in the diagonal of the matrix are internal
consistency of the corresponding construct.
50
On the Measurement and Impact
Table 2
Internal Consistency and Correlations among All Score a
(1)
(2)
(3)
(4)
Knowledge
Item Number
43
18
4
7
Reasoning
Total Score
43
32
4
7
Measured
(1) Post Multiple-choice
(.86)b
Declarative/
Schematic/
Procedural
(2) Performance Assessment
.69
(.81)
Procedural/
Schematic
(3) Short-Answer
.50
.49
(.83)
Declarative/
Schematic
(4) Predict-Observe-Explain
.48
.42
.39
(.74)
Schematic
Note: a. N = 732 ~ 830, all the correlations are significant, p < 0.01. b. The numbers in
the diagonal are the internal consistency of the corresponding scale.
51
On the Measurement and Impact
Table 3
Comparison of Experimental and Control Group Student Motivation, Achievement, and
Conceptual Change Scores at Posttest
Experimental
Control
Difference
N
Mean
SD
N
Mean
SD
t
Pre Task Goal
144
3.88
.61
133
3.72
.66
2.12*
Pre Task Perceived
144
4.20
.47
133
4.13
.48
1.38
Pre Self-Efficacy
144
4.04
.55
133
3.91
.58
1.99*
Pre Interest
144
4.14
.71
133
3.96
.76
2.06*
Pre Ego Approach
144
2.95
.86
133
3.07
.82
-1.14
Pre Ego Avoidance
144
2.97
.85
133
3.08
.80
-1.04
Pre Performance
144
2.24
.60
133
2.51
.60
-3.82
Pre Fixed Ability
144
2.39
.84
133
2.27
.76
1.19
Pre Multiple-choice
145
14.22
4.30
135
15.39
4.49
-1.16*
Perceived
52
On the Measurement and Impact
Table 4
Comparison of Experimental and Control Group Student Motivation, Achievement, and
Conceptual Change Scores at Posttest
Group
Experimental (E)
Control (C)
Difference
Variable
N
Mean
SD
N
Mean
SD
t
Post Task Goal
125
3.55
.79
123
3.53
.84
0.14
Post Task Perceived
125
4.14
.62
123
3.95
.71
2.25*
Post Self-Efficacy
125
3.97
.62
123
3.87
.65
1.25
Post Interest
125
3.76
1.00
123
3.59
1.12
1.29
Post Ego Approach
125
2.85
.92
123
2.99
.86
-1.24
Post Ego Avoidance
125
2.79
.91
123
2.72
.88
0.57
Post Performance Perceived
125
2.29
.72
123
2.31
.67
-0.25
Post Fixed Ability
125
2.14
.72
123
2.25
.78
-1.14
Post Multiple-choice
129 22.92
6.59
125 25.15
7.16
-2.58**
Post Performance Assessment
117 13.05
7.11
123 15.31
6.99
-2.48*
Post Predict-Observe-Explain
120
2.53
1.93
119
2.74
2.26
-0.79
Post Short-Answer
125
2.24
1.49
128
2.37
1.53
-0.67
Post Achievement
107 41.55 13.94
107 46.41 15.31
-2.43*
Conceptual Change
107
5.32
2.99
107
5.44
3.52
-0.29
Multiple-Choice Gain Score
127
8.82
5.83
120
9.58
6.49
-0.96
53
On the Measurement and Impact
Table 5
Comparison of Unconditional, Covariate, and Final Model Variance Components of
Posttest Achievement and Conceptual Change
Model 1
Model 2 (with Pretest
Model 3 (with
(Unconditional)
multiple-choice test
Predictor(s) at both
Level-1 Only)
Level-1 & 2)
Within Teachers (σ2)
38.832
27.680
27.681
Between Teachers (τ00)
10.987**
10.764**
11.794**
Within Teachers (σ2)
28.127
27.680
27.681
Between Teachers (τ00)
11.196**
10.764**
11.794**
Within Teachers (σ2)
40.992
36.947
36.949
Between Teachers (τ00)
11.050**
10.255**
10.620**
Within Teachers (σ2)
1.970
1.827
1.827
Between Teachers (τ00)
0.322**
0.270**
0.304
Within Teachers (σ2)
3.684
3.614
3.613
Between Teachers (τ00)
0.861**
0.783**
0.881**
Within Teachers (σ2)
162.042
122.692
122.688
Between Teachers (τ00)
68.552**
68.541**
73.778**
Conceptual Change
Within Teachers (σ2)
9.361
8.634
8.632
Score
Between Teachers (τ00)
3.399**
3.414**
3.827**
Post Multiple-choice
Test Score
Multiple-choice
Gain Score
Performance
Assessment
Short-Answer
Predict-ObserveExplain
Achievement Total
Note: ** p < 0.01, p < 0.05
54
On the Measurement and Impact
Table 6
Multilevel Regression Estimates for Posttest Motivational Beliefs a
Outcome Variable (Posttest)
Task Goal Orientation
Perceived Task Goal
Self Efficacy
Interest
Ego Approach
Ego Avoidance
Perceived Performance
Goal
Fixed Ability
Predictors
Level-1 Variable
TASKGOAL, γ10
Level-2 Variable
GROUP, γ01
Coefficient
0.548162**
SE
0.069905
P value
0.000
-0.048264
0.197369
0.812
Level-1 Variable
PERCTASK, γ10
Level-2 Variable
GROUP, γ01
0.470225**
0.079936
0.000
0.164382
0.202496
0.436
Level-1 Variable
EFFICACY, γ10
Level-2 Variable
GROUP, γ01
0.406299**
0.068589
0.000
0.078124
0.079747
0.351
Level-1 Variable
INTEREST, γ10
Level-2 Variable
GROUP, γ01
0.598573**
0.075573
0.000
0.089683
0.310063
0.778
Level-1 Variable
EGOPERF, γ10
Level-2 Variable
GROUP, γ01
0.549829**
0.059008
0.000
-0.058126
0.135709
0.677
Level-1 Variable
EGOAVO, γ10
Level-2 Variable
GROUP, γ01
0.531549**
0.059259
0.000
0.187889
0.125347
0.165
Level-1 Variable
PERCPERF, γ10
Level-2 Variable
GROUP, γ01
0.439538**
0.068559
0.000
0.138793
0.137974
0.339
Level-1 Variable
FIXEDABILITY, γ10
Level-2 Variable
GROUP, γ01
0.369011**
0.057722
0.000
-0.128559
0.090237
0.185
Note: ** p < 0.01, p < 0.05
55
On the Measurement and Impact
Table 7
Multilevel Regression Estimates for Posttest Achievement and Conceptual Change a
Outcome Variable (Posttest)
Post Multiple-choice Test
Score
Predictors
Level-1 Variable
Coefficient
SE
P value
0.81**
0.08
0.00
-0.76
2.10
0.73
-0.19*
0.08
0.023
-0.76
2.10
0.73
0.50**
0.10
0.00
-1.68
2.05
0.43
0.09**
0.02
0.00
-0.03
0.36
0.93
0.09**
0.03
0.00
-0.11
0.60
0.86
1.53**
0.19
0.00
-2.84
5.21
0.60
0.22**
0.05
0.00
0.03
1.21
1.00
PRETOTAL, γ10
Level-2 Variable
GROUP, γ01
Multiple-choice Gain Score
Level-1 Variable
PRETOTAL, γ10
Level-2 Variable
GROUP, γ01
Performance Assessment
Level-1 Variable
PRETOTAL, γ10
Level-2 Variable
GROUP, γ01
Short-Answer
Level-1 Variable
PRETOTAL, γ10
Level-2 Variable
GROUP, γ01
Predict-Observe-Explain
Level-1 Variable
PRETOTAL, γ10
Level-2 Variable
GROUP, γ01
Achievement Total
Level-1 Variable
PRETOTAL, γ10
Level-2 Variable
GROUP, γ01
Conceptual Change Items
Level-1 Variable
PRETOTAL, γ10
Level-2 Variable
GROUP, γ01
56
On the Measurement and Impact
Note: ** p < 0.01, p < 0.05
57
On the Measurement and Impact
Motivation Encouraged
Motivation
Promoting
Conceptual
Change
Motivation Discouraged
Task goal orientation
Incremental intelligence beliefs
Interest
Self-efficacy
Motivation
Constraining
Conceptual
Change
Ego goal orientation
(including ego-avoidance
and ego-approach)
Fixed intelligence belief
Figure 1. Motivational beliefs involved in formative assessment and conceptual change.
58
Download