4654 Plomp Ch-19 - Stanford University

advertisement
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 291
19
Towards a Science
Achievement Framework:
The Case of TIMSS 1999
Min Li, Maria Araceli Ruiz-Primo and Richard J. Shavelson
Introduction
Increasingly testing has become an essential element in educational reforms. Testing
is used to gauge both students’ and teachers’ performance and to improve students’
learning (for example, Baker et al, 2002; Linn, 2000; Pellegrino et al., 2001). While
the debates about testing and related issues have been heated and political, it is recognized that testing can be far too powerful an instrument to be used without very careful scrutiny of what is measured by the tests. Motivated by this belief, we present a
secondary analysis that addresses the validity of TIMSS 1999 test score interpretation
as a measure of students’ science achievement.
In this chapter, we describe and implement a framework for assessing science
achievement. The framework conceptualizes science achievement as four types of
knowledge that are characteristic of competency or achievement in science. We apply
this framework to the TIMSS 1999 science items and scores by logically analyzing
test items and statistically modeling the underlying patterns of item scores. We use a
coding system to analyze and map test items into the type(s) of knowledge that they
most likely measure. We also report results from factor analyses to determine whether
the knowledge-factor model accounts for the underlying pattern of item scores better
than alternative models. Lastly, we examine the mosaic of instructional factors that
influence students’ performance in terms of knowledge types.
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 292
292
MIN LI ET AL.
Theoretical Framework Science Achievement
Defining Science Achievement
Parallel to the old definition of intelligence as what intelligence test measures, today
we can say that science achievement is conventionally considered as what multiplechoice tests measure due to the pervasive and accepted use of the multiple-choice
technology. Such a narrow perspective for measuring science achievement not only
provides limited and even inaccurate information about students’ performance but
undermines desirable outcomes of students’ learning (for example, Kohn, 2000;
Shavelson et al., 1990). In contrast, our knowledge-based conception of science
achievement is intended to broaden the notion of science achievement.
In a series of papers (for example, Shavelson and Ruiz-Primo, 1999; Ruiz-Primo,
2002; Ruiz-Primo et al., 2002; Li et al., in press), we have developed and described a
framework for science achievement drawn upon a body of related research (for
example, Alexander and Judy, 1988; Bybee, 1996, 1997; Chi, Feltovich and Glaser,
1981; Chi, Glaser and Farr, 1988; Bennett and Ward, 1993; de Jong and FergusonHessler, 1996; OECD/PISA, 2001; Pellegrino et al., 2001; Sadler, 1998; White,
1999). We believe that such a framework should fully address the connections
between standards, science curriculum, student learning, instruction, and educational measurement.
Specifically, we conceptualize science achievement as four interdependent types of
knowledge1:
1. Declarative knowledge: knowing that. Such knowledge includes scientific terms,
definitions, facts, or statements, such as ‘CO2 is carbon dioxide’ or ‘green plants
manufacture glucose from carbon dioxide and water’. Declarative knowledge can
be learned and held in memory as words, pictures, or sensory representations.
Indeed, representing concepts in a variety of forms is considered essential to the
learning of science (for example, National Research Council [NRC], 1996).
2. Procedural knowledge: knowing how. Procedural knowledge takes the form of ifthen production rules or a sequence of steps, such as applying algorithms to
balance chemical equations, reading data tables, or designing experiments to
identify components of a mixture. With practice, individuals can convert their
procedural knowledge into automatic responses. Such automaticity is one of the
characteristics of expertise (for example, Anderson, 1983; Dreyfus and Dreyfus,
1986; Dutton and Jackson, 1987; Randel and Pugh, 1996; Rasmussen, 1986).
3. Schematic knowledge: knowing why. Schematic knowledge typically entails the
application of scientific principles or explanatory models, such as explaining why
the moon generally rises about 50 minutes later each day or how a particular virus
functions. Schematic knowledge can also be used to guide actions, to troubleshoot
systems, and to predict the effect that changes in some phenomenon will have on
others (De Kleer and Brown, 1983; Gentner and Stevens, 1983).
4. Strategic knowledge: knowing when, where, and how to apply knowledge. This
knowledge includes domain-specific strategies such as ways to represent a
problem or strategies to deal with certain types of tasks. It also entails general
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 293
TOWARDS A SCIENCE ACHIEVEMENT FRAMEWORK
293
monitoring performance or planning strategies such as dividing a task into subtasks, or integrating the three other types of knowledge in an efficient manner.
Experts, for example, are reported to interpret problems based on relatively
absolute principles compared to the novices who just group problems using superficial features such as springs or inclined planes (for example, Chi et al., 1981;
Zeitz, 1997).
Linking Assessments to the Knowledge Framework
This achievement framework can be used to specify important aspects of an achievement test. By synthesizing the research of cognitive psychology and assessment (for
example, Glaser et al., 1992; Hamilton, Nussbaum et al., 1997; Li et al., in press;
Martinez, 1999; Messick, 1993; Shavelson et al., 2002; Shavelson and Ruiz-Primo,
1999; Snow, 1989; Sugrue, 1993, 1995; White, 1999), we argue that some types of
science knowledge may be well tapped by certain assessment items because of particular affordances and constraints provided by those items. Accordingly, in order to
assess science achievement accurately and adequately, assessors should employ tests
varying in their methods for eliciting students’ performance regarding knowledge
types.
To this end, we take one step further to map assessment items to these knowledge
types. Note that it is not a straightforward task. Martinez (1999), for example, has
demonstrated that testing method alone does not determine what type of knowledge
an item measures. Instead, test items vary in many other characteristics, including the
nature of the task demands placed on test takers or degree of task openness. Therefore,
all these item characteristics should, if possible, be taken into account during the
process of mapping assessment items to science knowledge.
Research Questions
Having sketched our framework for defining and measuring science achievement, in
what follows we report a study using the TIMSS 1999 items and scores of the USA
sample to examine the viability of this framework. For the sake of simplicity, we only
report the results on TIMSS 1999 Booklet 8 items since similar findings were
observed from the analysis of the other seven booklets.2 This study purports to validate the framework, which allows us to measure students’ science achievement and
interpret scores in a meaningful way rather than using total scores. We ask two
research questions.
First, can the science achievement measured by the TIMSS 1999 items be decomposed as four types of knowledge? Research has shown that science achievement
measured by large-scale tests can be decomposed as different aspects (for example,
Hamilton, Nussbaum et al., 1997). In the logical analysis, two coders independently
coded and classified the items into types of knowledge. Then, a set of confirmatory
factor analyses (CFA) using the Booklet 8 item scores was performed to ascertain
whether the items measured the types of knowledge as hypothesized and to compare
against the competing models for science achievement.
4654 Plomp Ch-19
294
2/2/06
12:56 PM
Page 294
MIN LI ET AL.
Second, do students’ instructional experiences affect their science achievement?
Instruction is conceived of a vital factor that contributes to the variations in students’
learning outcomes, such as test scores (NRC, 1996). The structural equation modeling was conducted after controlling other related variables to examine how the
instructional variables are associated with students’ use of science knowledge.
Students’ Science Achievement Measured by TIMSS 1999 Items
In this section, we apply the achievement framework to examine the TIMSS 1999
science items and scores. We first present the logical analysis in which we used a
coding system to identify item characteristics and classify the items into knowledgetypes. Following this, we report a set of CFA that were executed to verify the item
classifications.
Coding System and Example Items
We relied on the research work done by us and other researchers to develop a coding
system to analyze item characteristics (Ayala et al., 2000; Baxter and Glaser, 1998; Li,
2001; Quellmalz, 2002). We identified four major categories, listed from most important to least important: task demands; cognitive demands; item openness; and complexity
(examples of those coding decisions are presented in Table 19.1). Task and cognitive
demands are both critical cues for establishing the item-knowledge link whereas the
latter two categories provide additional information to modify or revise the link.
Through generating and weighting the item characteristics, we were able to map
the TIMSS 1999 test items into types of knowledge. Below we present several TIMSS
1999 items to illustrate the utility of the coding system to classify test items.
Item D05 was classified as an item tapping declarative knowledge (Figure 19.1). It
asks students to identify the organ that carries the sensorial information. The coding
led us to conclude that it mainly taps declarative knowledge. First, the response is
expected to be in the form of terms (for example, names of the human organs). This
item asks for a specific content question (that is, a specific fact), leaving students
little opportunity to provide relations between concepts or to apply principles. Second,
the cognition evoked is likely to be recalling facts. The cognitive process to answer
the item is to directly retrieve information or do a minimum of scientific reasoning
to sort out the relevant information. Third, speaking item openness, Item D05 is
restricted because of the multiple-choice format and the unfinished stem that forces
students to select options instead of providing responses prior to reading options.
Being restricted in turn reinforces the task and cognitive demands placed on students.
Finally, the coding for complexity supports the declarative link as well. For example,
the item is similar to school-type problems. Consequently when answering the question students may recall exactly what they had been taught. Therefore, weighing the
four categories, we classified Item D05 as a declarative-knowledge item.
Item P01 was coded as testing procedural knowledge (Figure 19.2). It provides
students with a graph showing a distance that a car has traveled for 75 minutes and asks
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 295
TOWARDS A SCIENCE ACHIEVEMENT FRAMEWORK
Table 19.1. Coding System.
Coding
Description
category
Task
Task(s) that
demands
students
are asked
to perform
295
Selected examples of
Possible links with
coding sub-categories
knowledge typea
.D
. De®ning/using terms
. Providing explanations . D/Sc
.P
. Designing an
.P
experiment
. Mathematical
calculation
.D
Cognitive
Inferred cognitive . Recalling facts
.P
. Reading ®gures
demands
processes
. Sc
. Reasoning with
students likely
. St
models
act upon to
provide responses . Comparing and
selecting strategies
. Selecting good for D
. Selecting vs.
Item
Degree of
. Incomplete stem good
generating responses
openness
freedom students
for D, but not good for
. Stem presented as a
are allowed to
Sc/P/St
complete question vs.
have in shaping
. One good for D vs.
an incomplete stem
responses to
multiple good for
. One vs. multiple
an item
Sc/St
correct answers
. Not good for St
. Recipe directions
. Not good for any type
. Linguistic challenges
Complexity Domain-general
. Textbook good for D
. Textbook vs. novel
factors that
vs. novel good for St
tasks
in¯uences the
. Good for St
. Inclusion of irrelevant
item diculty
. Not good for any type
information
. Use of everyday
knowledge
a
Notes. D ˆ declarative knowledge, P ˆ procedural knowledge, Sc ˆ schematic knowledge, and St ˆ strategic knowledge. These abbreviations are applied hereafter.
Figure 19.1. Example of a declarative-knowledge item.
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 296
296
MIN LI ET AL.
Figure 19.2. Example of a procedural-knowledge item.
them to select the speed of that car. The item requires students to respond with an
algorithm to calculate speed and/or to interpret the diagram directly, both of which are
procedural knowledge as we defined. Second, the cognitive process students probably
undertake can be either applying a calculation algorithm for speed by dividing distance
with time or locating the point corresponding to 60 minutes on the line in the graph (that
is, an hour) and simply reading the distance on the vertical axis. However, the item’s
lack of openness to some degree limits students’ applying procedural knowledge.
Although this item allows students to generate their own responses before reading the
options, students can arrive at the correct answer by simply working backwards from
options or even merely guessing. Finally, the analysis of complexity does not provide
additional information for modifying the coding decision. Considering all the codes,
Item P01 was classified as procedural item though it is not a perfect candidate.
Very different from Item D05 and P01, our logical analysis revealed that Item P02
tended to test students’ schematic knowledge. This short-answer item asks students to
provide an explanation for why the same, or less, amount of light shines on the wall
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 297
TOWARDS A SCIENCE ACHIEVEMENT FRAMEWORK
297
from two meters away (Figure 19.3). The four item characteristics support students to
use schematic knowledge. First, the item is intended to assess students’ knowledge
about factors that determine the amount of light. It invites students to make a prediction and provide conceptual explanations by using the phrase ‘explain your answer’
in the stem. The knowledge that students use can be models or theories about light in
words or pictures. Second, the dominant cognitive process is reasoning with theories.
Very likely students have to develop responses by explaining the key features about
light and its movement. Third, Item P02 is moderately open, requiring students to
reason and organize their responses. Finally, the low complexity (for example, no
heavy reading or irrelevant information) strengthens the posited link to schematic
knowledge by reducing construct-irrelevant variances.
Lastly and unfortunately, there are no example items for strategic knowledge that
can be selected from TIMSS 1999. Strategic items are expected to have certain item
characteristics, such as a complex, novel task that entails students’ interpreting the
problem and selecting domain-specific strategies, an open item structure that supports
the use of strategic knowledge, and a scoring that directly captures the differences in
students’ use of strategic knowledge (for example, types of strategies, efficiency,
patterns of monitoring). However, no TIMSS 1999 items satisfied the criteria for
being strategic knowledge items.
Logical Analysis
Two coders independently reviewed and classified the TIMSS-R science items into
knowledge types following the same procedures illustrated in the previous section.
The inter-coder agreement was 89 percent.3 A final item classification was achieved
as follows. For those items on which the two coders agreed, the agreed upon classifi-
Figure 19.3. Example of a schematic-knowledge item.
4654 Plomp Ch-19
298
2/2/06
12:56 PM
Page 298
MIN LI ET AL.
cation is reported. For the disagreed items, we consulted the findings of Pine and
colleagues (AIR, 2000) and resolved the difference based on their rating.
In some cases where the item might tap a second knowledge type, coders were
instructed then to provide a secondary code for the item. This was expected as the
TIMSS-R items were designed to tap more than one performance expectation domain.
Due to ambiguity and items drawing on everyday experience, we found that the
knowledge codes were not applicable for a few items (then, those items were labeled
as NA; see Table 19.2). This difficulty was not unique. In a recent study analyzing the
same items, Pine and his colleagues (AIR, 2000) experienced similar difficulty.
Item classifications are presented in Table 19.2. Given the fact that the Booklet 8
science items varied a great deal in their item characteristics, we anticipated that those
items would require the use of different knowledge types. Consistent with this prediction, we observed that the booklet measured three of the four proposed types of
knowledge, while – as stated before – none of the items was found to assess the
strategic knowledge. The Booklet 8 science test is heavily loaded on declarative
knowledge (approximately 50 percent) with balance of procedural and schematic
knowledge items about equally distributed. The finding that declarative knowledge is
overwhelmingly tested is also reported at other studies that examined large-scale tests
at various grade levels (Britton et al., 1996; Lomax et al., 1995).
As expected, three items (A08, P03 and R05) were double coded which were
ambiguous to an extent that students could use either of two types of knowledge. We
also classified Item Q16 as NA because students could use many different ways, even
general reasoning, to figure out the correct answer. The item itself did not have sufficient affordances for triggering a certain type of knowledge. Rather, the knowledge to
be used is more determined by an individual student’s preference and other factors.
We also examined item format and item difficulty indexed by p-values in the table.
There was little association between item format and the type of knowledge items are
intended to test (adjusted χ 2 = 0.68, p = 0.71). This is not surprising. As we explained,
the knowledge that one brings to work on an item is elicited by a complicated interaction of his cognition and a set of item characteristics, including but not limited to
format. Further, a comparison of item difficulties revealed that relatively easier items
were used for measuring the declarative knowledge rather than other types of knowledge. On average about 15 percent more students were able to provide correct
answers to declarative items than other items. Putting those pieces together, one
disturbing conclusion is that the Booklet 8 test largely measured students’ declarative
knowledge with less challenging items.
Collectively, we found that raters could apply the coding system to consistently
classify the TIMSS 1999 items into types of knowledge. The coding showed that the
Booklet 8 test was heavily loaded on declarative knowledge while procedural or
schematic knowledge was measured by relatively fewer items.
Factor Analysis
The link between construct and assessment does not stop with the logical analysis. A
variety of analyses can be performed to verify the posited links, such as examining
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 299
TOWARDS A SCIENCE ACHIEVEMENT FRAMEWORK
299
Table 19.2. Classi®cations of TIMSS-R Booklet 8 Science Items with the USA Booklet 8
Sample.
Item Description
p
Classi®cationc
label
Formata Valueb D P Sc St NA
A07 Organ not in abdomen
MC
73
T
A08 Stored energy in two springs
MC
80
T
T
A09 Fanning a wood ®re
MC
78
T
A10 Seeing person in a dark room
MC
76
T
A11 Overgrazing by livestock
MC
62
T
A12 Changes in river shape/speed
MC
52
T
B01 Layers in earth
MC
0.91 T
B02 Energy released from car engine
MC
0.64 T
B03 Greatest density from mass/volume table MC
0.18
T
B04 Pulse/breathing rate after exercise
MC
0.92 T
B05 Elevation diagram of wind/temperature MC
0.49
T
B06 Color re¯ecting most light
MC
0.82 T
P01 Determination of speed from graph
MC
0.47
T
P02 Amount of light on wall and ceiling
SA
0.29
T
P03 Tree growth
SA
0.41 T
T
P04 Hibernating animal
MC
0.44 T
P05 Two reasons for famine
SA*
0.32 T
P06 Digestion in stomach
SA
0.42 T
P07 Replication of measurement
MC
0.37
T
Q11 Earth year
MC
0.63 T
Q12 Water level in U-tube
SA
0.26
T
Q13 Alcohol/glass expansion in thermometer MC
0.44
T
Q14 Sugar dissolving in water
MC
0.35
T
Q15 NOT a chemical change
MC
0.39
T
Q16 Light from sun and moon
SA
0.59
T
Q17 Function of chlorophyll
MC
0.53 T
Q18 Mass of freezing water
SA
0.30
T
R01 Bacteria/mold experience
MC
0.46
T
R02 Appearance of red dress in green light
MC
0.45 T
R03 Two outcomes of introducing new species SA*
0.39 T
R04 Atmospheric conditions in jets
SA
0.15 T
R05 Small pieces of wood burn faster
SA
0.16 T
T
R06 Results of global warming
MC
0.28 T
Notes. aMC ˆ multiple-choice, SA ˆ short-answer worth of 1 point, and SA * ˆ shortanswer worth of 2 points. bThe p-value reported is de®ned as the proportion of
students
who got completely correct answers out of those who took the Booklet 8
test. cNA indicates that the items cannot be classi®ed into certain knowledgetype(s) since items are too ambiguous or not dealing with science knowledge.
students’ responses to the test items, collecting and analyzing students’ think-aloud
verbalizations while responding to the test items, or statistically modeling the underlying patterns. In this chapter we empirically evaluate the posited link by examining
the patterns of how item scores covariate with each other. We performed the CFA
using software Mplus (Muthén and Muthén, 1999) to ascertain whether the knowledge model could account for the underlying covariance pattern. We built a statistical
4654 Plomp Ch-19
300
2/2/06
12:56 PM
Page 300
MIN LI ET AL.
model to test the links between the items and the coded knowledge types, which
exactly correspond to the item classifications resulted from the logical analysis in
Table 19.2 after removing Item Q16.
Mplus outputs χ2 to indicate the model-goodness-fit of the data to the hypothesized
model. The statistics of our model (χ2 = 325.49, df = 294, p = 0.10) and RMSEA
(0.01) indicate that the fit of the item scores to our model was adequately good and
was in the bounds of statistical acceptability. That is, the CFA confirmed the item
classifications from the logical analysis: 17 items as declarative-knowledge; eight
items as schematic-knowledge; and seven as procedural-knowledge. The results completely supported these classifications. Most factor loadings were statistically significant and generally high, which was also true for at least one loading with those
double-coded items. The fit and the factor loadings support the feasibility to use our
framework of science achievement to analyze the TIMSS 1999 science items.
Further, the relation between each pair of the three hypothetical knowledge factors
was significant too, indicated by the estimate to standard error ratios ranging from 3.73
to 11.04. We speculate that the strong associations presented in the knowledge types
may be due to several reasons: (1) Different types of knowledge, in nature, are interdependent aspects of science understanding that one can achieve, therefore knowledge
types should be moderately correlated. (2) Inevitably an individual’s general ability
more or less affects her performance on each item, subsequently the knowledge types.
And (3) items are unevenly distributed across the format and content, for example, 70
percent of items in multiple-choice format and more than half in life science and
physics. If replicating the analysis with appropriately sampled items, we should expect
to observe diluted correlations after reducing the influences from format or content.
Another part of our CFA work was to compare the knowledge model with competing models. We built alternative models based on different theoretical claims, such
as one factor as general ability, two factors as format (multiple-choice and shortanswer), or three factors using the performance expectation framework by TIMSS
19994 (Table 19.3). Using the fit measure and other information, we found that the
knowledge model had a much better fit than those alternative models. All the models
Table 19.3. Comparison Between the Knowledge-type Model and Alternative Models.
Tested model
Description of factors
Fit of model
2 probability
One factor model
General reasoning ability
343.16 0.03 (df ˆ 297)
Knowledge model
Declarative, procedural, and
325.49 0.10 (df ˆ 294)
schematic
Format model
Multiple-choice, and short-answer 339.20 0.04 (df ˆ 296)
Content model
Life science, physics, environmental 333.89 0.04 (df ˆ 290)
issues, chemistry, earth science, and
nature of science
TIMSS-R performance Simple information, complex
341.79 0.04 (df ˆ 296)
expectation model
information, applying, theorizing,
and solving problems
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 301
TOWARDS A SCIENCE ACHIEVEMENT FRAMEWORK
301
did fit well the item scores. Yet, only the knowledge model was significantly
superior to the one factor model whereas other models just slightly improved the fit.
The results suggest that a knowledge model better reflects the underlying pattern
among item scores than using the total score as the general science achievement or
using alternative structures. Overall, regardless of the domain, format, or thinking
processes the TIMSS 1999 items have, they clustered into three knowledge factors.
Linking Instructional Experiences to Students’ Achievement
In order to provide additional evidence bearing on the construct validity, we attend to
the relationship between students’ achievement and their instructional experiences.
Ample research has addressed the relation between learning outcomes and instructional practice based on the very premise that students’ learning is influenced by classroom experiences (for example, Hamilton et al., 2003; Wenglinsky, 2002). One
challenge that such studies must face is what learning outcomes are measured (Glass,
1990; Haertel, 1986; Linn, 2000). This exactly relates to the validity concerns that we
pointed out at the beginning of this chapter. Use of invalid measures of science
achievement not only results in misleading interpretations and decisions but eventually undermines the teaching and learning practice.
With this concern, we continue validating our achievement framework by relating
students’ classroom experiences with their learning defined as distinct types of knowledge. We expect that the associations would vary in their strength depending on the
types of knowledge, which renders appropriate interpretations about students’
achievement. In contrast to using total scores, this partition approach provides more
accurate, diagnostic information in tracing the effects of instructional variables. A
similar approach used to conceptualize and decompose the science achievement, but
on the basis of reasoning dimensions, has been reported more valuable than using total
scores in a series of studies on National Educational Longitudinal Study of 1988
achievement test (for example, Hamilton, Nussbaum et al., 1997; Kupermintz and
Snow, 1997).
In summary, this section continues to apply the proposed knowledge framework to
examine how instructional variables were associated with different types of knowledge demonstrated by the students. We anticipate that the findings are meaningfully
different enough to justify the use of the knowledge framework rather than total
scores.
Measures of instructional experiences
TIMSS 1999 administered questionnaires to both students and their teachers to collect
information about the classroom instructional experiences that students have been
exposed to. We limit the scope of our analysis by examining: (1) the instructional
variables available in the student questionnaires (Martin et al., 2000); (2) literature on
the impact of instructions (for example, De Jong and van Joolingen, 1998; Duit and
Treaguest, 1998; Hamilton et al., 2003; Mayer, Mullens and Moore, 2000; Monfils et
al, 2000; National Center for Education Statistics, 2001; Papanastasiou et al., 2004;
4654 Plomp Ch-19
302
2/2/06
12:56 PM
Page 302
MIN LI ET AL.
Ruiz-Primo et al., 2002; Smerdon et al., 1999; Tytler, 2002; Von Secker, 2002;
Wenglinsky, 2002); and (3) instructional methods that teachers are recommended in
science standard documents (for example, NRC, 1996; New Standards, 1997).
Specifically, we selected to focus our analysis on eight instructional methods, grouped
as textbook-based and hands-on. Textbook-based methods require students to copy
notes from the board, to take quizzes and tests, or to work from worksheets or textbook questions. In contrast, hands-on methods are those recommended by national
and state standards to teach science through authentic activities and inquiry
experience. While the textbook-based methods tend to concentrate on literally rote
learning of the materials from the texts, the hands-on methods attempt to help students
make sense of what is learned.
Table 19.4 summarizes the means of the instructional variables self-reported by the
TIMSS 1999 USA Booklet 8 sample. Generally, students were often exposed to all
those instructional methods, indicated by the range of these variables. The textbookbased methods seemed to be a more frequent experience than the hands-on methods,
among which the use of everyday life was the least frequently reported; nevertheless,
such a difference was negligible regarding the effect size. Comparing to the finding
reported by some studies in last decade that textbook-based was a quite dominate
teaching mode compared to hands-on in many American classrooms (Kober, 1993),
this shows the gradual shift of teachers’ instructional practice, but a fundamental
change in response to the national and state science standards has not yet fully
achieved.
Table 19.4. Means of Instructional Methods Reported by the TIMSS-R USA Booklet 8
Sample.
Instructional
Description of questionnaire item
n
Mean
experience
(max=4)
Textbook-based methods
Notes from board
We copy notes from the board.
990 3.08
Quiz
We have a quiz or test.
986 3.06
Worksheet
We work from worksheets or textbooks on 984 3.12
our own.
Hands-on methods
Problem
The teacher shows us how to do science 993 2.98
problems.
Project
We work on science projects.
983 2.82
Everyday life
We use things from every day life in solving 985 2.50
science problems.
Demonstrated experiment The teacher gives a demonstration of an 982 2.99
experiment.
Student experiment
We ourselves do an experiment or practical 978 2.82
investigation in class.
Note. Students were instructed to report how often each of the instructional methods
happened in their science classrooms at a 4-point scale. After recoded, scores of 1,
2, 3, and 4 refer to `Never', `Once in a while', `Pretty often', and `Almost always',
respectively.
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 303
TOWARDS A SCIENCE ACHIEVEMENT FRAMEWORK
303
Relation between types of knowledge and instructional experiences
Informed by the relevant research and science standards, we speculate that the textbook-based instructional methods are associated with students’ learning of declarative
knowledge whereas the hands-on methods can be effective in improving students’
procedural and/or schematic knowledge (for example, Haury and Rillero, 1994;
Glasson, 1989; Sutton and Kruger, 2001). For instance, textbook-based lecture, an
effective way of teaching declarative knowledge, is unlikely to be successful in teaching procedural knowledge. In contrast, using everyday life examples, instead of
unconnected factual information, to solve science problems is recommended as a
strategy to improve students’ understanding across more than one content area (NRC,
1996). Therefore, this instructional method supports students’ construction of
schematic knowledge rather than rote-learning of declarative knowledge.
In order to test this predicted pattern, a set of structural equation modeling was executed while controlling for some important variables at student level (that is, highest
education level of parents, index of home educational resources, out-of-school study
science time, index of confidence in science ability, self-reported indication of
usually doing well in science and positive attitude toward science). Decisions for
selecting those controlling variables were made because they are commonly identified
in the literature and available in the TIMSS 1999 dataset. We acknowledge that the
complex sampling, including clustering of students within classrooms and sampling
weights, should be taken into account in order to model instructional data and
achievement scores (for example, Keeves et al., 2000). However, it is inappropriate
with the data sample for each booklet to run the hierarchical linear modeling, for
example, about half of the classrooms had only one or two students in Booklet 8 (for
example, Kreft, 1996; Pollack, 1998).5 To address this methodological dilemma, we
chose to report the results from the structural equation modeling and focus our interpretations on the differences between the use of knowledge framework and total
scores instead of statistical significance (see Table 19.5).
Overall, most of the factor loadings of instructional experiences on students’ performance on the test items were not significant, indicating that after controlling all the
relevant background variables students’ instructional experiences did not contribute
much to their learning outcomes. Across both models, teaching methods as copying
notes from board, using everyday life problems and examples, and having students’
engaged in experiments seems to be associated with the students’ performance
whereas other instructional methods are negatively related with students’ test scores.
The magnitude of the loadings and the patterns of instruction-learning relationship are
not consistent with the findings from a recent study by Papanastasiou and her
colleagues to examine the relation between TIMSS 1999 instructions and students’
achievement indicated by Rasch (2004) scores. We suspect this may be due to the use
of data from different countries and different indicators for student performance.
In terms of differences between the two models, the relationship exhibits slightly
different patterns for the instructional methods, such as the direction of the relationship for methods as Quiz and Everyday life or the strength of the relationship for
methods as Student experiment. Within the knowledge-factor model, the factor loadings vary across the three types of knowledge, but some of the instruction–learning
Page 304
12:56 PM
2/2/06
4654 Plomp Ch-19
MIN LI ET AL.
304
450.0
750.0
350.0
750.0
750.0
350.0
960.0
940.0
.E.S
461.0
360.0±
390.0±
550.0±
400.0
111.0±
120.0±
900.0±
setamitsE
citamehcS
840.0
440.0
340.0
740.0
540.0
440.0
350.0
040.0
.E.S
.noitaler eht fo htgnerts lacitsitats eht stneserp yllacipyt .E.S ot etamitse fo oitar ehT .etoN
131.0
050.0
190.0
530.0
390.0
tnemirepxe tnedutS
840.0±
050.0
340.0±
330.0
830.0±
tnemirepxe detartsnomeD
600.0±
050.0
280.0±
130.0
940.0±
tcejorP
320.0±
250.0
100.0±
330.0
710.0±
melborP
001.0±
550.0
910.0
430.0
500.0
efil yadyrevE
570.0±
050.0
240.0±
130.0
550.0±
teehskroW
510.0
260.0
080.0±
040.0
920.0±
ziuQ
450.0
740.0
630.0
030.0
220.0
draob morf setoN
setamitsE
.E.S
setamitsE
.E.S
setamitsE
larudecorP
evitaralceD
gnidnatsrednU
elbairav
ledom rotcaf-egdelwonK
ledom rotcaf enO
lanoitcurtsnI
.ledoM rotcaF-egdelwonK dna ledoM rotcaF eno fo sgnidaoL rotcaF gnirapmoC .5.91 elbaT
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 305
TOWARDS A SCIENCE ACHIEVEMENT FRAMEWORK
305
relationships are contradictory to our predication that the textbook-based instructions
are positively related to students’ learning of declarative knowledge where as handson instructions are effective for learning procedural and schematic knowledge. To
make sense of this contradiction, we reason that the instruction–learning relation is
more accurately represented if considering the quality of teaching in terms of how
teachers implement the instructional methods rather than the implementation frequency or the amount of time teachers use those methods. For example, some
researchers claim that frequent use of hands-on instructions does not automatically
lead to the increase of students’ learning if instructional activities are not focused
or well-connected to scientific concepts (for example, Nancy, 1993). Therefore, one
real challenge for studies such as TIMSS 1999 would be to include items within the
limited scope of student and teacher questionnaires to capture some important aspects
of teaching quality. Equally challenging, secondary studies must strategically select
and work with the data from those studies, such as being aware of the constraint of the
data while interpreting the results, interpreting data combining several items to create
appropriate indicators.
To conclude, the relations between instruction and student performance differ
between the one factor and knowledge-factor models and across the three types of
knowledge. Generally speaking, the knowledge-factor model shows a more detailed
picture of the instruction-learning relation compared to one factor model. Of course,
given problems associated with the self-report technique (for example, inaccuracy of
reported information, discrepancy between students’ and teachers’ perceptions about
instruction) and the limitation of our methodology discussed earlier, these conclusions
need to be cautiously examined before generalization.
Conclusion
The main purpose of this chapter was to outline and apply a knowledge framework to
examine the science achievement tested by the TIMSS 1999 Booklet 8 science items
and its relation with students’ instructional experiences. This framework conceptualizes science achievement as four types of knowledge: declarative knowledge
or ‘knowing that’, procedural knowledge or ‘knowing how’, schematic knowledge or
‘knowing why’ and strategic knowledge or ‘knowing when, where, and how knowledge applies’. We argue that test items with different combinations of item characteristics as affordances and constraints can elicit or restrict students’ use of their science
knowledge.
For this purpose, we presented three sets of analyses. First, we applied the framework in a logical analysis to code and classify the TIMSS 1999 items. We observed
that declarative knowledge was heavily loaded in the test items whereas none of the
items tested strategic knowledge. Second, the results from factor analyses indicated
that the underlying pattern for the item scores converged with the proposed itemknowledge links from the logical analysis. Those items that pre-classified as a certain
knowledge type were clustered together. The knowledge-factor model was also found
to fit better the item scores than any of the alternative models. And third, we exam-
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 306
306
MIN LI ET AL.
ined the associations between students’ science knowledge and the self-reported
instructional experiences. The factor loadings revealed that a modeling of the relationship with the knowledge factors provided more specific, meaningful interpretations of
students’ learning than with one factor model. In addition, factor loadings varied
across the three types of knowledge, which was another source of evidence bearing
on the differences between the knowledge types. Even though the pattern failed to
fully confirm our predictions, most loadings supported the distinctions between the
knowledge types that we proposed.
Overall, we conclude that the science achievement measured by the TIMSS 1999
test can be distinguished as different types of knowledge. The study reported suggests
that it is necessary to implement a framework for science achievement, which can be
either the one we proposed here or any theoretical frameworks developed on the basis
of cognitive research and related fields, to carefully examine what is measured in
science tests. Such a framework provides not only accurate interpretations of students’
achievement scores but meaningful insights and interesting questions on how instructional experiences are connected to students’ learning. Furthermore, the framework
offers valuable information for testing development and practice as well. This
chapter, we believe, is a beginning to develop and evaluate those frameworks. It
should be followed with more studies to provide systematic, vigorous evidence using
different methods with additional sets of science items.
Notes
1. We recognize the simplification of our framework, for example, treating types of
knowledge as if they were truly separable or presenting a static typology of knowledge.
2. In order to make it manageable to test any single student, TIMSS 1999 used
an incomplete block design to sample test items and students. The TIMSS
1999 science items were organized into 21 clusters, which were then placed
into eight test booklets so that a student only responded in one randomly
assigned test booklet. Released items and data files can be found at
http://isc.bc.edu/timss1999i/study.html.
3. The inter-coder agreement was calculated based on the codes for all the TIMSS
1999 science items.
4. Performance expectations implemented in TIMSS 1999 included five categories:
(1) understanding simple information; (2) understanding complex information; (3)
theorizing, analyzing and solving problems; (4) using tools, routine procedures;
and (5) investigating the natural world. The framework was revised from a very
similar one originally developed in TIMSS whereas a substantive modified framework based on cognitive demands as (1) factual knowledge, (2) conceptual understanding and (3) reasoning and analysis was proposed in TIMSS 2003. This
chapter applied the TIMSS 1999 framework to classify the test items.
5. Analysis data within each booklet allows us to analyze test items and validate the
posited knowledge codes in CFA.
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 307
TOWARDS A SCIENCE ACHIEVEMENT FRAMEWORK
307
References
AIR (2000) Review of science items from TIMSS 1999: A report from the science
committee to the technical review panel. Washington, DC: American Institutes for
Research.
Alexander, P.A. and Judy, J.E. (1988) The interaction of domain-specific and
strategic knowledge in academic performance. Review of Educational Research,
58, 375–404.
Anderson, J.R. (1983) The architecture of cognition. Cambridge, MA: Harvard
University.
Ayala, C.C., Ayala, M.A. and Shavelson, R.J. (2000, April) On the cognitive interpretation of performance assessment scores. Paper presented at the AREA annual
meeting, New Orleans, LA.
Baker, E.L., Linn, R.L., Herman, J.L. and Koretz, D. (2002) Standards for educational account systems. CRESST policy brief 5. Los Angeles: University of
California, CRESST.
Baxter, G.P. and Glaser, R. (1998) Investigating the cognitive complexity of
science assessments. Educational Measurement: Issues and Practice, 7(3),
37–45.
Bennett, R. and Ward, W. (1993) (eds.). Construct versus choice in cognitive measurement. Hillsdale, NJ: Erlbaum.
Britton, E.D., Dossey, J., Eubanks, L., Gisselberg, K., Hawkins, S., Raizen, S.A. and
Tamir, P. (1996) Comparing examinations across subjects and countries. In E.D.
Britton and S.A. Raizen (eds.) Examining the examinations: An international
comparison of science and mathematics examination for college bound students
(pp. 23–54). Boston, MA: Kluwer.
Bybee, R.W. (1996) The contemporary reform of science education. In J. Rhoton and
P. Bowers (eds.), Issues in science education (pp. 1–14). Arlington, VA: National
Science Teachers Association.
Bybee, R.W. (1997) Achieving scientific literacy: from purpose to practice.
Portsmouth, NH: Heinemann.
Chi, M.T.H., Feltovich, P. and Glaser, R. (1981) Categorization and representation of
physics problems by experts and novices. Cognitive Science, 5, 121–52.
Chi, M.T.H., Glaser, R. and Farr, M. (eds.). (1988) The nature of expertise. Hillsdale,
NJ: Erlbaum.
de Jong, T. and Ferguson-Hesser, M.G.M. (1996) Types and qualities of knowledge.
Educational Psychologist, 31, 105–13.
de Jong, T. and van Joolingen, W.R. (1998) Scientific discovery learning with computer simulations of conceptual domains. Review of Educational Research, 68(2),
179–201.
De Kleer, J. and Brown, J.S. (1983) Assumptions and ambiguities in mechanistic
mental models. In D. Gentner and A.L. Stevens (eds.), Mental models (pp.
155–90). Hillsdale, NJ: Erlbaum.
Dreyfus, H.L. and Dreyfus, S.E. (1986) Mind over machine: The power of human
intuition and expertise in the era of the computer. New York: The Free Press.
4654 Plomp Ch-19
308
2/2/06
12:56 PM
Page 308
MIN LI ET AL.
Duit, R. and Treagust, D. (1998) Learning in science – from behaviorism towards
social constructivism and beyond. In B. Fraser and K. Tobin (eds.), International
handbook of science education (pp. 3–25). Dordrecht, Netherlands: Kluwer.
Dutton, J.E. and Jackson, S.E. (1987) The categorization of strategic issues by
decision makers and its links to organizational action. Academy of Management
Review, 12, 76–90.
Gentner, D. and Stevens, A.L. (eds.) (1983) Mental models. Hillsdale, NJ: Erlbaum.
Glaser, R., Raghavan, K. and Baxter, G.P. (1992) Cognitive theory as the basis for
design of innovative assessment: Design characteristics of science assessment
(CSE Technical report No. 349). Los Angeles: University of California, CRESST.
Glass, G.V. (1990) Using student test scores to evaluate teachers. In J. Millman and
L. Darling-Hammond (eds.), The new handbook of teacher evaluation (pp.
229–40). Newbury Park, CA: SAGE.
Glasson, G.E. (1989) The effects of hands-on and teacher demonstration laboratory
methods on science achievement in relation to reasoning ability and prior knowledge. Journal of Research in Science Teaching, 26, 121–31.
Haertel, E. (1986) The valid use of student achievement measures for teacher evaluation. Educational Evaluation and Policy Analysis, 8(1), 45–60.
Hamilton, L.S., McCaffrey, D.F., Stecher, B.M., Klein, S.P., Robyn, A. and Bugliari,
D. (2003) Studying large-scale reforms of instructional practice: An example form
mathematics and science. Educational Evaluation and Policy Analysis, 25(1),
1–29.
Hamilton, L.S., Nussbaum, E.M., Kupermintz, H., Kerthoven, J.I.M. and Snow, R.E.
(1997) Enhancing the validity and usefulness of large-scale educational assessments: II. NELS: 88 science achievement. American Educational Research
Journal, 32, 555–81.
Hamilton, L.S., Nussbaum, E.M. and Snow, R.E. (1997) Interview procedures for validating science assessments. Applied Measurement in Education, 10(2), 181–200.
Haury, D.L. and Rillero, P. (1994) Perspectives of hands-on science teaching.
Columbus, OH: ERIC Clearinghouse for Science, Mathematics and Environmental
Education.
Keeves, J.P., Johnson, T.G. and Afrassa, T.M. (2000) Errors: What are they and how
significant are they? International Education Journal, 11(3), 164–80.
Kober, N. (1993) What we know about science teaching and learning. EdTalk.
Washington, DC: Council for Educational Development and Research.
Kohn, A. (2000) The case against standardized testing. Greenwich, CT: Heinemann.
Kreft, Ita G.G. (1996) Are multilevel techniques necessary? An overview, including
simulation studies. Unpublished report, California State University, Los Angeles.
Retrieved 12 August 2003, from http://www.calstatela.edu/faculty/ikreft/quarterly/
quarterly.html
Kupermintz, H. and Snow, R.E. (1997) Enhancing the validity and usefulness of
large-scale educational assessments: II. NELS: 88 mathematics achievement to
12th grade. American Educational Research Journal, 34, 124–50.
Li, M. (2001) A framework for science achievement and its link to test items.
Unpublished dissertation at Stanford University.
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 309
TOWARDS A SCIENCE ACHIEVEMENT FRAMEWORK
309
Li, M., Shavelson, R.J. and White, R. (In progress). Towards a framework for achievement assessment design: The case of science education. Manuscript submitted.
Linn, R.L. (2000) Assessment and accountability. Educational Researcher, 29, 4–16.
Lomax, R.G., West, M.M., Harmon, M.C. and Viator, K.A. (1995) The impact of mandated standardized testing on minority students. Journal of Negro Education, 64,
171–85.
Martin, M.O., Mullis, I.V.S., Gonzalez, E.J., Gregory, K.D., Smith, T.A.,
Chrostowski, S.J., Garden, R.A. and O’Connor, K.M. (2000) TIMSS 1999 international science report findings from IEA’s Repeat of the Third International
Mathematics and Science Study at the eighth grade. Chestnut hill, MA: Boston
College.
Martinez, M.E. (1999) Cognition and the questions of test item format. Educational
Psychology, 34, 207–18.
Mayer, D.P., Mullens, J.E. and Moore, M.T. (2000) Monitoring school quality: An
indicators report (NCES 2001–030). Washington, DC: National Center for
Education statistics.
Messick, S. (1993) Trait equivalence as construct validity of score interpretation
across multiple methods of measurement. In R. Bennett and W. Ward (eds.),
Construction versus choice in cognitive measurement (pp. 61–74). Hillsdale, NJ:
Erlbaum.
Monfils, L., Camilli, G., Firestone, W. and Mayrowetz, D. (2000, April)
Multidimensional analysis of scales developed to measure standards based instruction in response to systemic reform. Paper presented at the annual meeting of
AERA.
Muthén, L.K. and Muthén, B.O. (1999) Mplus The comprehensive modeling program
for applied researchers: User’s guide. Los Angeles, CA: Muthén and Muthén.
National Center for Education Statistics (2001) The nation’s report card: Science
highlights 2000. Washington, DC: National Center for Education Statistics.
National Research Council (NRC) (1996) National Science Education Standards.
Washington, DC: National Academy.
New Standards (1997) Performance standards: English language arts, mathematics,
science, applied learning. Washington, DC: Author.
OECD/PISA (2001) Measuring student knowledge and skills. The PISA 2000 assessment of reading, mathematical, and scientific literacy. Paris: Organisation for
Economic Cooperation and Development/Programme for International Student
Assessment.
Papanastasiou, E.C., Zacharia, Z. and Zembylas, M. (2004, May) Examining when
instructional activities work well in relation to science achievement. Paper presented at the 1st IEA International Research Conference, Lefkosia, Cyprus.
Pellegrino, J., Chudowsky, N. and Glaser, R. (2001) Knowing what students know:
The science and design of educational assessment. Washington, DC: National
Academy Press.
Pollack, B.N. (1998) Hierarchical linear modeling and the ‘unit of analysis’ problem:
A solution for analyzing responses of intact group member. Group Dynamics:
Theory, Research, and Practice, 2, 299–312.
4654 Plomp Ch-19
310
2/2/06
12:56 PM
Page 310
MIN LI ET AL.
Quellmalz, E. (2002) Using cognitive analysis to study the validities of science
inquiry assessments. Paper presented at the Annual AERA Meeting, New Orleans,
LA.
Randel, J.M. and Pugh, H.L. (1996) Differences in expert and novice situation awareness in naturalistic decision making. International Journal of Human Computer
Studies, 45, 579–97.
Rasmussen, J. (1986) Information processing and human-machine interaction: An
approach to cognitive engineering. New York: North-Holland.
Ruiz-Primo, M.A., Shavelson, R.J., Hamilton, L. and Klein, S. (2002) On the evaluation of systemic science education reform: Searching for instructional sensitivity.
Journal of Research in Science Teaching, 39, 369–93.
Ruiz-Primo, M.A. (2002, February) On a seamless assessment system. Paper
presented to the AAAS annual meeting, Boston, MA.
Sadler, P.M. (1998) Psychometric models of student conceptions in science:
Reconciling qualitative studies and distractor-driven assessment instruments.
Journal of Research in Science Teaching, 35(3), 265–96.
Shavelson, R.J., Carey, N.B. and Webb, N.M. (1990, May) Indicators of science
achievement: Options for a powerful policy instrument. Phi Delta Kappan,
692–7.
Shavelson, R.J., Li, M., Ruiz-Primo, M.A. and Ayala, C.C. (2002, August) Evaluating
new approaches to assessing learning. Presented at Joint Northumbria/EARLI
Assessment Conference, University of Northumbria at New Castle, Longhirst
Campus.
Shavelson, R.J. and Ruiz-Primo, M.A. (1999) On the assessment of science achievement. (English version) Unterrichts wissenschaft, 27(2), 102–27.
Smerdon, B.A., Burkam D.T. and Lee, V.E. (1999) Access to constructivist and
didactic teaching: Who gets it? Where is it practiced? Teachers College Record,
101(1), 5–34.
Snow, R.E. (1989) Toward assessment of cognitive and conative structures in learning. Educational Researcher, 18(9), 8-14.
Sugrue, B. (1993) Specifications for the design of problem-solving assessments in
science. Project 2.1. Designs for assessing individual and group problem-solving.
ERIC ED: 372 081.
Sugrue, B. (1995) A theory-based framework for assessing domain-specific problemsolving ability. Educational Measurement: Issues and practices, 14(3), 29–36.
Sutton, J. and Krueger, A. (eds.). (2001) EDThoughts: What we know about science
teaching and learning. Aurora, CO: Mid-continent Research for Education and
Learning.
Tytler, R. (2002) Teaching for understanding in science: Constructivist/conceptual
change teaching approaches. Australian Science Teachers Journal, 48(4), 30–5.
Von Secker, C. (2002) Effects of inquiry-based teacher practices on science excellence
and equity. Journal of Educational Research, 95, 151–60.
Wenglinsky, H. (2002) How schools matter: The link between teacher classroom
practices and student academic performance. Education Policy Analysis Archives,
10(12). Retrieved 12 August 2003, from http://epaa.asu.edu/epaa/v10n12/
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 311
TOWARDS A SCIENCE ACHIEVEMENT FRAMEWORK
311
White, R. (1999) The nature and structure of knowledge: Implications for assessment.
Manuscript.
Zeitz, C. (1997) Some concrete advantages of abstraction: How experts’ representations facilitate reasoning. In P.J. Feltovich, K.M. Ford and R.R. Hoffman (eds.),
Experts in context (pp. 43–65). Menlo Park, CA: AAAI Press.
4654 Plomp Ch-19
2/2/06
12:56 PM
Page 312
Download