On the Measurement and Impact RUNNING HEAD: Measurement and Impact of the Treatment On the Measurement and Impact of Formative Assessment on Students’ Motivation, Achievement, and Conceptual Change Yue Yin1 Carlos C. Ayala2 Richard J. Shavelson3 Maria Araceli Ruiz-Primo4 Miki K. Tomita3 Erin M. Furtak5 Paul R. Brandon1 Donald B. Young1 1. 2. College of Education, University of Hawaii School of Education, Sonoma State University 3. 4. School of Education, Stanford University University of Colorado at Denver and Health Sciences Center 5. Max Planck Institute for Human Development 1 On the Measurement and Impact Abstract Formative assessment holds the promise of promoting achievement and motivation. The effects of formative assessments, however, have rarely been examined experimentally in regular education settings. Motivational beliefs are hypothesized to be important factors that influence conceptual change. Because the motivational beliefs expected to be modified by formative assessment happen to be the motivational beliefs hypothesized to influence conceptual change, the study connected the formative assessment and conceptual change frameworks and examined the effect of formative assessments on student outcomes. In particular, the study explored whether formative assessments would improve student motivation, achievement, and consequently conceptual change. A randomized experimental design in a field setting was used and 12 middle school science teachers and their students participated in the study. It was found that student motivation, achievement, and conceptual change varied significantly among the participating teachers, but the formative assessment treatment did not have significant impact on these outcome variables. 2 On the Measurement and Impact On the Measurement and Impact of Formative Assessment on Students’ Motivation, Achievement, and Conceptual Change The logic of formative assessment—identify learning goals, assess where students are with respect to those goals, and use effective teaching strategies to close the gap—is compelling and has led to the expectation that formative assessment would improve students’ learning and achievement (Ramaprasad, 1983; Sadler, 1989). Moreover, substantial empirical evidence has supported the effectiveness of formative assessment (Black & Wiliam, 1998a; Black & Wiliam, 1998b). However, the evidence mainly comes from either laboratory studies or on anecdotal records (e.g., Black & Wiliam, 1998a). As Black and Wiliam (1998a) pointed out, studies conducted in laboratory contexts may suffer “ecological validity” problems and encounter reality obstacles when applied in classrooms. The effects of formative assessments have rarely been examined experimentally in regular education settings. The study reported in this paper examined the effect of formative assessments on student outcomes by using a randomized experimental design in a field setting. The study explored whether formative assessments would improve student motivation, achievement, and conceptual change. This paper presents (a) the conceptual framework of the study, (b) research design, (c) the instruments used to measure the outcome variables—motivation, 3 On the Measurement and Impact achievement, and conceptual change—and the technical qualities of each instrument; and (d) results of the experiment. Conceptual Framework This section reviews literature on conceptual change and formative assessment. Based on the literature review, the theoretical framework for this study is laid out and research questions are raised. Achievement is viewed as acquiring knowledge and the capacity to reason with that knowledge. As described in Shavelson et al. (this issue), knowledge can be distinguished as four types: declarative, procedural, schematic, and strategic. The focus of this study is on schematic knowledge and more specifically how to move students from a naïve conception of the natural world to a scientifically justifiable one as to “why things sink and float”. Hence we focus on conceptual change and factors that affect it. Conceptual Change When students come into science classrooms, they often hold deep-rooted prior knowledge or conceptions about the natural world. These conceptions influence how they come to understand what they are taught. Some of their existing prior knowledge provides good foundations for formal schooling, for example, their prior knowledge of number and language. However, other prior conceptions are incompatible with currently accepted scientific knowledge; these conceptions are commonly referred to as misconceptions (NRC, 2001) or alternative conceptions (Abimbola, 1988). 4 On the Measurement and Impact Usually students derive misconceptions through limited observation and experience. Consequently, learning is not only the acquisition of new knowledge but also the interaction between new knowledge and students’ prior knowledge. For example, everyday life experience leads young children to believe that the earth is flat. In fact, even when some children learn that the earth is round, they believe that it is like a pancake—round but still flat (Vosniadou & Brewer, 1992). The topic of this study, “Why things sink and float,” is addressed in many middle school physical science curricula and students’ misconceptions about the topic have been well documented (e.g., M. G. Hewson, 1986; Kohn, 1993; Yin, Tomita, & Shavelson, in press). Although sinking and floating is a common experience in everyday life, it is in fact a sophisticated scientific phenomenon. To fully understand the fundamental reason underlying sinking and floating, students need to have complicated knowledge that includes an analysis of forces (buoyancy and gravity) and water pressure. That knowledge, however, is either not introduced or not sufficiently addressed in middle school curricula most likely because of the difficulty learning it. Rather, some curriculum developers take a shortcut and use relative density to simplify the explanation of sinking and floating (e.g., Pottenger & Young, 1992). Even so, relative density itself is challenging for many students because density is a concept involving the ratio of mass to volume (e.g., Smith, Snir, & Grosslight, 1992) and relative density involves comparing two ratio variables, the density of an object and the density of an medium. 5 On the Measurement and Impact Despite its complexity in science, sinking and floating is a common phenomenon. Most students have rich experiences and personal theories or mental models for explaining sinking and floating. Unfortunately, many of their theories are either misconceptions or conceptions that are only valid under certain circumstances--for example, big/heavy things sink, things with holes in them sink, hollow things float, things with air in them float, or flat things float. Those conceptions are so deeply rooted in students’ minds that it is very difficult for students to change them, even after students have been intensively exposed to scientific conceptions such as, “If an object’s density is less than a liquid’s density, the object will float in the liquid regardless of the size or mass of the object” (Yin et al., in press). Similar to children’s misunderstanding of the shape of the planet earth and “why things sink and float,” many other misconceptions are deeply rooted in everyday experiences, which then inhibit students from acquiring scientific conceptions. These alternative conceptions exist widely across different subject domains, among people of different ages, across different cultures, and through the history of the development of scientifically justifiable ideas (Yin, 2005). Given the presence and persistence of alternative conceptions, the restructuring or reorganization of existing knowledge, or conceptual change, has become a very important component of teaching and learning (Carey, 1984; Schnotz, Vosniadou, & Carretero, 1999). To fully establish scientifically justifiable conceptions of the natural 6 On the Measurement and Impact world, sometimes students have to experience conceptual change (Carey, 1984) and transform misconceptions to complete and accurate conceptions (NRC, 2001). From a cognitive perspective, researchers have described the mechanics of conceptual change (Demastes, Good, & Peebles, 1996; Posner, Strike, Hewson, & Gertzog, 1982); developed theories about the nature of conceptual change (Chi, 1992; Chi, Slotta, & deLeeuw, 1994); and explored practical ways to foster conceptual change (Chinn & Malhotra, 2002; P. W. Hewson & Hewson, 1984). Our knowledge of the cognitive aspects of alternative conceptions and conceptual change has been greatly enriched over the past 20 years. However, the traditional approaches have been criticized as “cold cognition” because they could not adequately explain why some students changed their conceptions while others did not (Dreyfus, Jungwirth, & Eliovitch, 1990; Pintrich, 1999; Pintrich, Marx, & Boyle, 1993). Motivation researchers proposed that the conceptual change process is influenced by students’ motivational beliefs, such as goal orientations, epistemological beliefs, interest or value, and efficacy (e.g., Pintrich, Marx, & Boyle, 1993; Pintrich, 1999). They suggested, conceptual change could be facilitated by students’ acquisition of a task goal orientation, incremental intelligence belief, high self-efficacy, and strong interest; in contrast, conceptual change could also be prevented by ego goal orientation and fixed intelligence beliefs. Each of these possible influences on conceptual change will be discussed in more detail below. 7 On the Measurement and Impact Goal orientations are self-constructed theories about what it means to succeed. They help students make meaning of their own learning in context. Goal orientations are generally classified into two main categories: intrinsic (also called, mastery or taskinvolved orientation) and extrinsic (also called, performance or ego-involved orientation) (Dweck & Leggett, 1988). Task goal orientation means that the goal of studying is to learn; Ego-involved orientation can be further classified to ego approach orientation (t those with an ego goal orientation, students with a task goal orientation tended to focus on learning, understanding, and mastering the tasks, and engaged in deeper processing strategies, which are helpful for conceptual change. The intelligence belief is “whether intelligence or ability is fixed or not.” Schommer (1992) found individuals who believe in intelligence as a malleable quality that can be developed (i.e., incremental intelligence) are more likely to be cognitively engaged in tasks. Dweck and Leggett (1988) suggested that intelligence theory can influence students’ goal orientation, which, in turn, influences students’ learning. In their model, a student with fixed intelligence beliefs is more likely to adopt a performance goal orientation. As discussed in the task goal orientation section, performance goal orientation is considered to constrain conceptual change by limiting the level of deeper cognitive engagement. Interest is students’ general attitude or preference for the content or task (Eccles, 1983). Pintrich (1999) argued that interests help facilitate the process of conceptual 8 On the Measurement and Impact change because conceptual change demands that students maintain their cognitive engagement in order to accommodate new conflicting information. Self-efficacy beliefs are individuals’ beliefs about their performance capabilities in a particular domain (Bandura, 1986; Pajares, 1996). Self-efficacy is related to task choice (Bandura, 1997), effort taken, task persistence (Bandura, 1986; Bandura & Cervone, 1983), and cognitive engagement (Pintrich & Schrauben, 1992). Pintrich suggested that the students who are confident in their ability to change their ideas and to learn to use proper cognitive strategies to integrate and synthesize divergent ideas are more likely than others to realize conceptual change. Pintrich and his colleagues’ greatest contribution is that they pointed out an important missing component in the previous cognitive conceptual change model, expanding researchers’ horizon from a narrow cognitive view to a broader one that involves individuals’ motivations. They suggested that motivational beliefs influence people’s cognitive engagement and depth of information processing, which in turn determines whether conceptual change can occur. Moreover, Pintrich insisted that all motivational beliefs are amenable to change, even though some of them are robust or relatively stable (Pintrich, 1999; Pintrich et al., 1993). For instance, an educational context that encourage task goal orientated will encourage students to develop task goal orientation. Similarly, an ego goal-orientated context encourages ego goal orientation among students. 9 On the Measurement and Impact Few studies have been conducted to examine empirically motivation theorists’ claims about the influence of motivation on conceptual change. One possible approach to bringing about conceptual change that integrates cognition and motivation is formative assessment. Formative Assessment Building on earlier reviews, Black and Wiliam reviewed 580 articles (or chapters) from over 160 journals in a nine year period and concluded that formative assessments have a positive impact on students’ learning, as well as on students’ motivation (Black, Harrison, Lee, Marshall, & Wiliam, 2002; Black & Wiliam, 1998a; Black & Wiliam, 1998b; Hattie & Timperley, 2007). In summative assessment, scores are often related to students’ rank compared to their peers, and performance is the most important concern. This environment produces several negative effects on motivation and achievement: (a) Students develop an egoinvolving goal orientation (Schunk & Swartz, 1993); (b) students, especially the low achievers, tend to attribute achievement to capability (Siero & Van Oudenhoven, 1995) and regarded ability as inborn and intelligence as fixed (Vispoel & Austin, 1995), which might discourage them from putting effort into future learning; (c) students are reluctant to seek help because they fear their questions will be regarded as evidence demonstrating low ability (Blumenfeld, 1992); and (d) students tend to be engaged in superficial information processes, for example, rote learning and concentrating on the recall of isolated details (Butler, 1987; Butler & Neuman, 1995). 10 On the Measurement and Impact In contrast, formative assessment possesses substantial advantages on motivation and achievement: (a) Because formative assessment emphasizes the learning process and closing the gap between students’ current situation and the desired goal, students are more likely to form a task-involving goal orientation, which encourages students to process information more deeply than does the performance orientation (Schunk & Swartz, 1993); (b) Because formative assessment concentrates on improving students’ learning, it is less likely to cause students to lose confidence. Instead, students tend to believe in incremental intelligence; that is, ability is just a collection of skills that people can master over time (Vispoel and Austin, 1995). Formative assessment shapes selfefficacy in a way that benefits students’ learning and, consequently, students’ achievement, especially that of lower achievers (Geisler-Brenstein & Schmeck, 1996). (c) The activities involved in formative assessment can improve students’ interest in learning. (d) Students engaged in formative assessments increase their self-regulation, reasoning, and planning, which are all important for effective learning and conceptual change (Black and William, 1998a). In sum, formative assessment is expected to encourage the motivational beliefs hypothesized to promote conceptual change, such as, task goal orientation, incremental intelligence beliefs, self-efficacy, and interest. Meanwhile, formative assessments discourages the motivational beliefs hypothesized to prevent conceptual change, such as ego goal orientation and fixed intelligence beliefs (Figure 1). For future discussion 11 On the Measurement and Impact convenience, the former motivational beliefs are called positive motivation and the latter are called negative motivation. -------------------------------Insert Figure 1 about Here -------------------------------Because the two models share the motivational beliefs, it seems plausible to connect the motivational model of conceptual change with formative assessment. We conjectured that students who participate in formative assessments would be more likely to change motivational beliefs in a way that promoting conceptual change, demonstrate improved achievement, and realize conceptual change than those who did not participate in formative assessments. As promising as it sounds, formative assessment places greater demands on teachers than regular teaching due to its uncertainty and flexibility (Bell & Cowie, 2001). Black and Wiliam suggested that teachers “need to see examples of what doing better means in practice” (1998b, p.146). To date, there is a paucity of systematic, detailed, and practical information to address adequately the challenge of how to design and implement formative assessments in teaching to help improve student learning and motivation. Shavelson et al. (this issue) characterized formative assessment techniques as falling on a continuum based on the amount of planning involved and the formality of technique used, benchmarking points along the continuum as—(a) on-the-fly formative assessment, which occurs when teachable moments unexpectedly arise in the classroom; 12 On the Measurement and Impact (b) planned-for-interaction formative assessment, which is used during instruction but prepared deliberately before class to align with instructional goal closely; and (c) formal and embedded in curriculum formative assessment, which is designed to be embedded after a curriculum unit to make sure that students have reached important goals before moving to the next lesson. The first two kinds of formative assessments require teachers to have rich teaching experience and/or strong spontaneous teaching skills. To assist teachers to implement formative assessments, formal formative assessment, Embedded Formative Assessment, was taken in this study’s examination of the impact of assessment on student outcomes. The study attempts to answer the following three questions: (a) Can Embedded Formative Assessment improve students’ motivational beliefs, which were conjectured to influence conceptual change? (b) Can Embedded Formative Assessment improve students’ achievement? (c) Can Embedded Formative Assessment improve conceptual change? Based on prior theory we expected that formative assessment could improve all three outcomes. Method Design To answer the three research questions, a small randomized experimental study was conducted. The study involved two groups of teachers, one that employed Embedded Formative Assessment while teaching a science unit and another that taught the same unit 13 On the Measurement and Impact without formative assessments. Four steps were taken in the study: (a) Twelve teacher participants along with their students were randomly assigned to either the experimental group or the control group; (b) All of the students were given a pretest motivation survey and science achievement test (including conceptual change items); (c) Both groups of teachers taught the same curriculum unit, but the teachers in the experimental group used formative assessments and were expected to use the information collected by formative assessment to help improve teaching and learning; (d) All the students were given a posttest motivation survey and achievement test. By comparing the experimental and control students’ motivation and achievement scores on the pretest and posttest, we examined whether the formative assessment treatment affected students’ motivation, achievement, and conceptual change. Details about the participants and procedures are addressed by Shavelson, et al. (this issue). Details about the curriculum and treatment are described by Ayala, et al. (this issue). This paper focuses on the instruments and results of the study1. Instruments Motivation questionnaire and achievement assessments were developed to examine the impact of embedded formative assessments. 1 Due to space limitations, the discussion of instruments and results in this paper is rather brief. To know more details about the instruments and results, please refer to Yin (2005). 14 On the Measurement and Impact Motivation Questionnaire Motivation questionnaire development. The motivation questionnaire measures the constructs that were addressed in both conceptual change literature and formative assessment literature (Black et al., 2002; Pintrich, 1999; Pintrich et al., 1993). Among them, four motivational beliefs promote conceptual changes: (a) task goal orientation (e.g., “An important reason I do my science work is because I like to learn new things.”), (b) perceived task-goal orientation context (e.g., “Our teacher really thinks it is very important to try hard”), (c) self-efficacy in science (e.g., “I am certain I can figure out how to do difficult work in science”), and (d) interest in science (e.g., “I find science interesting”). The other four motivational beliefs prevent conceptual change: (a) ego approach orientation (e.g., “I want to do better than other students in my science class.”), (b) ego avoidance orientation (e.g., “It is very important to me that I do not look stupid in my science class”), (c) perceived performance-goal orientation context (e.g., “Our teacher lets us know which students get the highest scores on tests”), and (d) inborn intelligence epistemic beliefs (e.g., “I can’t change how smart I am in science”). Most motivation items were drawn from Lau and Roeser’s study (2002). A fivepoint Likert-type scale from 1 (strongly disagree) to 5 (strongly agree) was used in the questionnaire to measure the degree to which students agreed or disagreed with each statement. The same motivation questionnaire was given to the control and experimental groups at pretest and posttest. 15 On the Measurement and Impact Technical quality of the motivation questionnaire. Internal consistency analysis, exploratory factor analysis, and confirmatory factor analysis were conducted to validate the motivation constructs at pretest and posttest and diagnose ill-matched items. Coefficient alpha was calculated for each final scale. Exploratory factor analysis examined whether the motivation items measured the constructs that they were designed to measure and to identify ill-matched items. Principal Axis Factoring extraction and Promax with Kaiser Normalization rotation were applied assuming the motivation constructs to be correlated with each other. Factors with eigen values greater than 1 were extracted. The factor loadings of each item in the pattern matrix were examined to indicate the fitness of items with their corresponding construct. Finally confirmatory factor analysis was conducted to further validate the motivation constructs and modification indexes (MIs) were examined to capture misfit items in the model. Details of these analyses are available from the first author. Table 1 presents the internal consistency of each motivation construct and the correlations among them, based on the posttest data. The results based on pretest data, similar to the result based on posttest, were omitted to conserve space. -------------------------------Insert Table 1 about here -------------------------------Overall, the motivation sub-scales have acceptable internal consistency, except for fixed intelligence (.44), possibly due to the small number of items (3) (Table 1). This 16 On the Measurement and Impact caveat in mind, the average score across items in each subscale was calculated to form a composite score. Ten composite scores corresponding to the eight constructs were used for further data analyses. Most motivation constructs were significantly correlated with each other (Table 1). All the positive motivation constructs were positively correlated with each other. So were all the negative motivation constructs. Except for the correlation between ego-approach and self-efficacy, most positive motivation constructs were negatively correlated with negative motivation constructs, whenever the correlation was significant. The correlation pattern among the motivation constructs provided evidence for discriminant validity of the motivation measurement. Achievement Assessments In this section, the multiple-choice test and open-ended assessments are described and their technical qualities examined. Achievement assessment development. Assessment development was based on two concerns: (a) to adequately cover the curriculum’s instructional objectives and (b) to comprehensively represent different types of knowledge--in particular, declarative knowledge (knowing that), procedural knowledge (knowing how), and schematic knowledge (knowing why)--the knowledge-reasoning-type framework set forth in Shavelson et al. (this issue). Four assessments were developed: a multiple-choice test, a performance assessment, a short-answer assessment, and a predict-observe-explain assessment. A description of predict-observe-explain assessment can be found in Ayala et al. (this issue). 17 On the Measurement and Impact Thirty-eight items were included in the multiple-choice pretest, covering the important instructional objectives across three types of knowledge. Multiple-choice test item examples can be found in Shavelson et al. (this issue). In addition to the 38 items on the pretest, five more items mainly focusing on common misconceptions about sinking and floating were included on the posttest. Although we tried not to include any identical achievement assessment items in the formative assessments, some of the content or style of the formative assessments for the experimental group might be unavoidably similar to that of the items or tests in the achievement assessments. To reduce the possible bias, we selected 14 items from external sources, such as the National Assessment of Educational Progress and Trends of International Mathematics and Science Study, to tap the spread of the treatment effect. Because the correlation between the external items and internal items was high, r = .70, p < .01, the external items were analyzed with the internal items together in this paper. The performance assessment was designed to assess students’ procedural and schematic knowledge. In the first part of the performance assessment, given the mass information of a rectangular block and some equipment, students were asked to find the density of the block. To solve the problem, students needed to (a) use either a ruler or an overflow container to measure the volume of a solid block and (b) apply the density formula to calculate the density of the block. In the second part of the performance assessment, given three blocks with density information and a bottle of mystery liquid with unknown density, students were asked to find the density of the mystery liquid. To 18 On the Measurement and Impact solve this problem, students needed to conduct an experiment, observe the sinking and floating behavior of three blocks in mystery liquid, and infer density information of the mystery liquid. The short-answer assessment was designed to measure students’ declarative and schematic knowledge. It asked students to explain “why do things sink or float” and to provide evidence and an example to support their argument. The same short answer question was repeatedly posed in the Embedded Formative Assessment for the experimental group, bringing advantages as well as disadvantages to the study. The advantage is that because the same question was asked at different times in the experimental group, the experimental students’ conception developmental growth could be tracked until the end of the instruction. Two disadvantages were possible: (a) the experimental group had been unavoidably sensitized to this question and may have benefited from their teachers’ feedback and (b) the experimental group might have become bored with answering the same question three times, losing their motivation to provide their best answers on the posttest. Therefore, caution should be taken when interpreting the difference between experimental and control groups in their responses to this question. The predict-observe-explain assessment was also designed to assess students’ schematic knowledge. In the predict-observe-explain assessment, a test administrator demonstrated that a bar of soap sunk in water. Then the test administrator cut the soap into two unequal pieces—1/4 and 3/4—and asked students to predict what would happen 19 On the Measurement and Impact when the two pieces were put in water and why. After students turned in their predictions and reasons, the test administrator put the two pieces of soap in water, and asked students to record their observation and to reconcile their predictions. The soap predict-observeexplain assessment was intended to examine whether students understood two main points: (a) density is a property of a material and will not change with the size of the object, and (b) an object sinks or floats depending on its density (relative to the medium’s density) instead of its volume and mass. Multiple-choice test was administered at both the pretest and posttest. Performance, short-answer, and predict-observe-explain assessments were administered at posttest only for the following reasons: (a) The performance assessment was heavily content loaded. The pilot study showed that students tended to use trial and error to solve the performance assessment problem before they had learned the relevant curriculum content. Therefore the performance assessment could not provide valuable information at pretest. (b) The short-answer assessment asked a succinct and straightforward question, making it easy to memorize. We worried that it might sensitize control-group teachers and lead them to prepare students to answer it, which might unnecessarily boost controlgroup students’ performance at posttest. And (c) the predict-observe-explain assessment was so memorable that once students knew what happened to the two pieces of soap, they might not forget the result. Research team members administered the instruments in the classrooms of each teacher immediately before and after the target unit. The motivation questionnaire was 20 On the Measurement and Impact given before the achievement tests, so that students’ report of their motivation would not be influenced by their performance on the achievement tests. On the posttest, the achievement tests were given in the following order: multiple-choice test, performance assessment, short-answer assessment, and predict-observe-explain assessment (Yin, 2005). The predict-observe-explain was given as the last one because of its instructional effect. Technical Quality of the Achievement Assessments. The quality of the multiplechoice test items was examined as to: (a) item difficulties, (b) instructional sensitivity, (c) factorial structure, and (d) internal consistency. With respect to difficulty, most items appropriately fell in the moderate range (p-values.40 ~ .80 at posttest). Some items were found to be too difficult for the students, and these items were the ones that assessed whether students still have misconceptions. The proportions of students who correctly answered these questions were close to or even lower than what would be expected by guessing at random (.25). The results suggested that most students still held alternative conceptions after instruction. Instructional sensitivity (D) was used to examine whether multiple-choice test items discriminated between examinees who had and those who had not been taught. It is defined as the difference between an item’s proportion of students responding correctly (p-value) after instruction (posttest) and the p-value before instruction (pretest) (Crocker & Algina, 1986). A high positive D value indicates that the item can well discriminate between examinees who have received instruction and those who have not. All the items 21 On the Measurement and Impact had positive instruction sensitivity Ds. That is, for all the items given at pre- and posttest, more students chose correct answers at posttest than at pretest. This result provided evidence for the content validity of the multiple-choice test items. That is, the items were linked to the curriculum content and would be expected to reflect the effect of instruction. Exploratory factor analysis was used to examine whether the multiple-choice test items measured the constructs they were designed to measure, such as content knowledge and knowledge types. When items in a test can be classified into different constructs from only one dimension, items measuring the same construct would share similar loading patterns in exploratory factor analysis. However, multiple-choice test items were designed to tap two dimensions: the curriculum instructional objectives and knowledge types. If the former was dominant, items measuring the same instructional objective should have similar loading patterns in the exploratory factor analysis. If the latter was dominant, items measuring the same knowledge type should have similar loading patterns. If neither classification was dominant, connections might not be clear between items’ loading patterns and the original classifications because the two classifications or neither of them might influence items’ loading patterns. The results indicated that items measuring the same content knowledge clustered together, while items measuring the same type of knowledge were scattered across different factors, regardless if all 13 factors above an eigen value of 1 or an analysis constrained to three factors reflecting knowledge types were run. Internal consistency reliability was calculated for the total score, then, to examine whether items could 22 On the Measurement and Impact reliably assess students’ achievement. The internal consistency of the 43 posttest multiple-choice test items was .86, which indicated that the items in general reliably measured students’ achievement. For conciseness, the total score of the multiple-choice test items, rather than any subscales, will be used in the following data analysis. Because the three open-ended assessments (performance, predict-observe-explain, and short-answer) shared many features, the three measures are examined together in this section. In order to sufficiently capture information about students’ achievement reflected in students’ responses to the assessments, analytical scoring systems were carefully developed for each assessment. The scoring systems were aligned with the conceptual developmental trajectory (Ayala et al., this issue) and important instructional objectives. Using scoring system drafts, three researchers scored students’ response samples to examine whether the preliminary scoring systems could adequately capture the variations among students’ responses, differentiate high understanding levels from low understanding levels, and assist scorers to reach a satisfactory inter-rater reliability. Doctoral students at Stanford Educational Assessment Laboratory scored the three open-ended assessments: performance assessment (six raters), short-answer assessment (six raters), and predict-observe-explain assessment (four raters). Before independent scoring, all raters received scoring training until they reached satisfactory inter-rater reliability (above .80) or agreement (above 80%). In the scoring training process, when raters encountered disagreements, they reached consensus through discussions. Based on 23 On the Measurement and Impact the consensus, scoring systems were further improved and specific rules were clarified. This process was implemented continuously until the scoring system could effectively collect information about students’ performance and raters could reliably score students’ responses. The scoring systems were finalized prior to independent scoring and raters had reached satisfactory inter-rater reliability or agreement before independently scoring student work. Performance assessments were scored with numerical values assigned by six raters, therefore generalizability theory and G-coefficient were used to estimate reliability. Based on 48 randomly selected student responses on the performance assessment, the average G-coefficient for the six raters (three in each group) was 0.83. Short-answer and predict-observe-explain assessments were scored with categorical values, therefore agreements were used to evaluate the inter-rater agreement. Based on 48 randomly selected student responses on short-answer, the average agreement of nine raters (two, three, and four raters in each group) was 87%. Based on 56 randomly selected student responses on predict-observe-explain, the four raters’ agreement was 92%. Internal consistency and discriminant validity. After all the assessments were scored, the internal consistency of each assessment and the correlations among the assessments were calculated. Because the four assessments measure students’ knowledge about the same topic but with different emphases, high internal consistency within each 24 On the Measurement and Impact assessment and moderately high correlations among assessments were expected. Table 2 shows that high internal consistency empirically confirming expectation. ------------------------------Insert Table 2 about here ------------------------------Conceptual change score and achievement total score. Five multiple-choice test items, seven predict-observe-explain assessment items, one performance assessment item, and one short-answer assessment items focused on conceptual change. Internal consistency of the 14 items was .79. The composite score was interpreted as a conceptual change score, with the maximum score of 14. In addition to scores on the individual tests, an overall achievement total score was calculated. It combined scores from all four achievement assessments and had 72 items, an internal consistency reliability of 0.91, and a total score of 86. This achievement total score was also used in the overall analyses reported below. Data Analysis Plan Because teachers were in different states or different school districts in a state, the teaching contexts varied dramatically across teachers, such as the number and size of classes a teacher taught. Moreover, students’ achievement and motivation subscale scores varied greatly across classes and teachers at pretest, even though pairs of teachers were matched on school-level variables and randomly assigned to the treatment or control group (Shavelson et al., this issue). To reduce these differences, we selected one class 25 On the Measurement and Impact from each teacher based on students’ pretest motivation and achievement scores for intensive study such that the pairs of classes selected were as close as possible on mean pretest scores. The twelve classes selected were called focus classes. In this report, only the focus classes were included to examine the treatment effect for three reasons: (a) The focus classes were the most closely matched classes of teachers in the control and experimental groups; (b) the focus classes were the classes from which we collected the most elaborate implementation information; and (c) the answers to the research questions based on focus classes were similar to those based on all the students (Yin, 2005). To check on the mean equivalence of pretest scores, mean score differences between each experimental-control teacher pair’s focus class were compared statistically (Appendix A). Significant differences exist among three teacher pairs on motivation scores. Experimental group teacher Aden’s students had a significantly higher mean score than control group teacher Rachel’s students on two positive motivation scales: perceived task goal orientation, and self-efficacy. Aden’s students also had a significantly lower mean score than Rachel’s students on a negative motivation scale: perceived performance goal orientation. Experimental group teacher Carol’s students scored significantly lower, on average, than control group teacher Sofia’s students on three negative motivation scales: ego approach orientation, ego avoidance orientation, and perceived performance goal. Experimental group teacher Diana’s students had significantly a higher mean score than control group teacher Ben’s students on all the positive motivation scales. 26 On the Measurement and Impact As for achievement, students in the control group of two teachers, Lenny and Ben, scored significantly higher on the multiple-choice test than did their experimental group counterparts on average, students in the experimental group of two teachers, Andy and Diana, respectively. Overall, experimental and control groups were not completely equivalent at the beginning of the study. The experimental group started with better average motivation scores but worse achievement scores than the control group at pretest. While not completely adequate, covariate adjustments were made in future analyses. The data were first analyzed descriptively and statistically for the teacher pairs— matched roughly for their school-level characteristics. The overall differences between the control and experimental group on the outcome measures were then compared and discussed. Finally, because students were nested in teachers, and teachers were nested in treatment groups, a two-level hierarchical structure existed in the data. To deal with this data structure, two-level Hierarchical linear modeling (HLM) was then applied to examine the formative assessment treatment’s influence statistically. The limitation of using HLM is that the analysis lacks statistical power due to the small sample size of only six teachers in each group.2 2 The intent of the study was exploratory and we and the National Science Foundation agreed on the small sample size for cost reasons and as a first step in providing an “existence proof” that might lead to future funding. 27 On the Measurement and Impact Results The analysis presented in this chapter addresses three research questions: (1) Did formative assessment improved students’ motivational beliefs? (2) Did formative assessment improved students’ science achievement? And (3) did formative assessment bring about conceptual change? Descriptive analyses of students’ motivation, achievement, and conceptual change scores at pretest and posttest are first presented for individual teachers in two groups. Following that, the effects of formative assessment on students’ motivation, achievement and conceptual change are reported. Initial Analysis of Pretest and Posttest Scores In this section, the experimental-control group focus class pairs were compared and then the overall experimental-control group differences were analyzed. Analysis of Experimental-Control Group Pairs In contrast to the pretest, overall the experimental group did not have a significantly better motivation average score than control group at posttest. Instead, two experimental classes had better motivation scores than their control pairs, so did two control classes. Control group teacher Lenny’s students have higher scores than experimental group teacher Andy’s students on three positive motivation scales: task goal orientation, perceived task goal orientation, and interest. Control group teacher Sofia’s students also scored significantly higher than experimental group teacher Carol’s students on three positive motivation scales: task goal orientation, perceived task goal orientation, and interest. Similar to the pretest, experimental group teacher Diana’s students still had 28 On the Measurement and Impact significantly higher scores than control group teacher Ben’s students on all the positive motivation scales. Moreover, Diana’s students had lower scores than Ben’s students on a negative motivation scale: perceived performance goal orientation. Finally, experimental group teacher Robert’s students scored significantly higher than control group teacher Serena’s students on two positive motivations, perceived goal orientation and interest. As for achievement and conceptual change, two control group teachers’ students outperformed their counterpart in the experimental group (Appendix B). Similar to the pretest, control group teacher Lenny’s students, on average, still outperformed the experimental group teacher Andy’s students on achievement scores and conceptual change scores at posttest, with the exception of the short-answer question (Appendix B). Moreover, control group teacher Rachel’s students outperformed experimental group teacher Aden’s students on the achievements tests, except for performance assessment and conceptual change items (Appendix B), although they did not differ significantly on the pre multiple-choice test (Appendix A). The only “winning” experimental-group class among all the pairs was Diana’s. Control group teacher Ben’s students did not significantly outperform experimental-group teacher Diana’s students as they did at pretest. Experimental-Control Group Comparisons Due to the nested structure, it is not statistically legitimate to combine all the students in each treatment group for data analysis. However, the descriptive analyses based on the combined groups can provide a preliminary overall picture of the two group 29 On the Measurement and Impact comparison and treatment effect at posttest. Table 3 indicates that overall the experimental group scored higher than the control group on most positive motivation scores but scored lower on achievement test. This result is consistent with the previous individual teacher pair comparison. ------------------------------Insert Table 3 about here ------------------------------Table 4 shows that experimental group still scored higher than control group on positive motivation scales at posttest, but only the significant difference was in the mean difference in perceived task goal orientation. As for achievement, overall control group still outperformed the experimental group, scoring significantly higher on the multiplechoice test, performance assessment, and total achievement score. Based on the comparison of Table 3 and Table 4, the treatment seemed to have no influence on student motivation, achievement or conceptual change at posttest. ------------------------------Insert Table 4 about here ------------------------------However, in general, experimental group students had significantly lower variance than the control group on the predict-observe-explain assessment (F = 4.09, p < .05) and conceptual change score (F = 3.88, p = .05), that is, the achievement gap between higher achievers and lower achievers in the experimental group was not as big as 30 On the Measurement and Impact that in the control group. This is consistent with literature that formative assessments are particularly useful for low achievers (Black & Wiliam, 1998a). Hierarchical Linear Modeling Variables Used in the HLM We were interested in the influence of formative assessment on students’ motivation, achievement, and conceptual change; therefore posttest motivation, achievement, and conceptual change scores served as dependent variables in the model. Specifically, these dependent variables include the ten posttest motivation subscale scores, all posttest achievement scores, the composite achievement score, and the composite conceptual change score. Independent variables, or predictors, include pretest scores that corresponded to the posttest score and treatment group. The pretest scores of individual students, including pretest motivation scores and pretest multiple-choice test score, were the predictors at the first level. Treatment group was the predictor at the second level. Due to heterogeneous school context and teacher backgrounds, there could be many possible teacher and school level predictors at level-2, such as, school size, ethnical composition, percent of students who receive free or reduced price lunch, students’ grade level, teachers’ teaching experience, highest educational degree, science background, teaching load, the length of instruction. However, due to the constraint of small sample size (there 31 On the Measurement and Impact were only 12 teachers at level-2), only the treatment group, the focus in this study, was introduced into the model as the predictor as level-2. Model Construction Three steps were taken to analyze all the outcome variables. Step 1: Unconditional model. An unconditional model was first applied to examine the variation in all of the outcome variables at the two different levels--student and teacher. A one-way ANOVA model was applied for each outcome variable at level-1. The unconditional model was specified by equations 1 and 2. Level-1 Yij 0 j rij rij ~ N (0, σ2) [1] μ0j ~ N (0, τ00) [2] Level-2 0 j 00 0 j Yij was the outcome variable, such as a motivation subscale scores or achievement scores. β0j was the intercept, in this case, the mean score for each teacher’s students. rij was the random error at student level. The variance of rij was σ2. Variance due to rij represented that the score variation at level-1, in this case, within teachers. β0j was a function of grand mean, γ00, plus a random error, μ0j. The variance of μ0j was τ00. Significant variance, τ00, due to μ0j indicated that teacher mean score varied at level-2. Step 2: Conditional model with covariate(s) at level-1. According to Raudenbush and Bryk (2002), statistical adjustments for individuals’ background are important, 32 On the Measurement and Impact because (a) “persons are not usually assigned at random to organizations, failure to control for background may bias the estimates of organization effect” and (b) “if level-1 predictors (or covariates) are strongly related to the outcome of interest, controlling for them will increase the precision of any estimates of organization effects and the power of hypothesis tests by reducing unexplained level-1 error variance” (p. 111). In this study, students had rather different starting motivation and achievement scores; therefore, the pretest scores were added to the model as covariate(s) at level-1 to account for the outcome variable variance within-teacher. Only the corresponding pretest score was introduced to the model at level-1. For example, when the posttest motivation subscale, interest, was the outcome variable, only pretest interest scores were introduced as the predictor at level-1. When a posttest achievement score or conceptual change score was the outcome variable, only the pretest multiple-choice test score was introduced as the predictor at level-1. There are two reasons for including only one covariate at level-1: First, the corresponding pretest variable was the most relevant covariate of the corresponding posttest score. Second, preliminary analyses indicated that, after a pretest motivation score was included in a model to explain the corresponding posttest score, an additional pretest motivation score was usually not a significant predictor for the posttest score because motivation constructs were correlated with each other. Since pretest scores were continuous variables, they were grand-mean centered (Raudenbush & Bryk, 2002). The model was specified as follows. 33 On the Measurement and Impact Level-1 Yij 0 j 1 j ( X ij X .. ) rij rij ~ N (0, σ2) [3] Level-2 0 j 00 0 j 1 j 10 1 j μ0j ~ N (0, τ00) μ1j ~ N (0, τ11) [4] [5] In this model, Xij was a pretest score as the predictor for the corresponding posttest score. X .. is the grand-mean. The intercept, β0j, was the expected outcome for a subject whose value on Xij was equal to the grand mean, X .. . The slope, β1j, represented the regression parameter between students’ pretest and posttest scores. rij was the random error after controlling for students’ pretest score. β0j and β1j might vary across teachers in the level-2 model as a function of a grand mean and a random error. Since X was grandmean centered, γ00 was the adjusted mean of the teachers on students’ posttest scores. γ10 was the adjusted mean of pretest and posttest regression slope across those teachers. μ0j was the unique increase to the intercept associated with teacher j. Significant variance associated with μ0j indicated that the mean posttest score varied across teachers. μ1j was the unique increase to the slope associated with teacher j. Significant variance associated with μ1j indicated that pretest-posttest regression slope varied across teachers. Analysis showed that variance of μ1j was not significantly different from 0. That is, the regression coefficients between outcome variable and predictor variable within 34 On the Measurement and Impact teachers at level-1 did not vary across teachers at level-2 (“homogeneity of regression slopes”). Therefore, μ1j was eliminated from the final model. Accordingly, equation [5] was simplified into [6]. 1 j 10 [6] Step 3: Model with intercepts-as-outcomes. One of the important purposes for using HLM is to examine the cross level effects. Specifically, in this study, we examined whether the regression coefficients (regression slopes of posttest scores on covariates) at the student level varied across teachers. Analyses indicated that regression coefficients at level-1 did not significantly vary across teachers, i.e., homogeneity of regressions slopes. Therefore, the regression coefficients at level-1 were set as fixed at level-2 and only the intercept at level-1 was an outcome variable at level-2. As mentioned earlier, treatment group was the only predictor at level-2. Since treatment group was a dummy variable (experimental group = 1, control group = 0), it was added to the model without being centered (Raudenbush & Bryk, 2002). Analyses showed that τ00 in equation [4] significantly differed from 0. That is, the intercept at level-1 varied significantly at level-2 across teachers. A predictor at level-2, treatment group, was then included in the model to account for the intercept variance at leve-2. The model was specified in equation [7]. 35 On the Measurement and Impact 0 j 00 01W1 j 0 j [7] W1j was the treatment variable at level-2. Because treatment was a dummy variable, it was added to the model without being centered, for example treatment group, which was coded as “0” for control group and “1” for experimental group. HLM Results Tables 4 and 5 present changes in the variance components for motivation scores and achievement scores, respectively, when different models were implemented. Most motivational and achievement outcomes varied significantly across teachers (at level-2), specifically motivation scores except for ego avoidance, self-efficacy, and fixed ability (Table 4), and achievement scores except for conceptual change (Table 5). ------------------------------Insert Table 4 about here --------------------------------------------------------------Insert Table 5 about here -------------------------------Tables 6 and 7 present the results when pretest scores were included as covariate at level-1 and treatment was included as predictor at level-2. Results reported in these 36 On the Measurement and Impact tables indicate that pretest scores were significant predictors for the corresponding posttest scores at level-1; however, treatment was not a significant predictor of the variance across teachers at level-2. And for the achievement indicators, the negative level 2 coefficient for treatment indicates a very slight, non-significant adjusted mean advantage for the control group. An important reason for the non-significant group effect might be the “small” sample size. -------------------------------Insert Table 6 about here ----------------------------------------------------------------Insert Table 7 about here --------------------------------In a word, based on HLM analyses, students’ motivation, achievement, and conceptual change scores greatly varied among teachers. However, the formativeassessment treatment did not explain the variation among teachers. Experimental group students did not seem to benefit, on average, from the formative assessments they received. 37 On the Measurement and Impact Conclusion In this study, one questionnaire and four types of achievement tests were used in this study to measure students’ motivational beliefs, achievement, and conceptual change. Empirical evidence was provided in support of the reliability and validity of scores from these instruments. Both the descriptive statistics and HLM analyses show that the assessments embedded in the FAST curriculum used by the experimental group did not have a significant influence, on average, on students’ motivation, achievement, and conceptual change compared to students in the control group that used FAST without embedded assessments. This said, regardless of treatment group, teachers varied greatly on the motivation and achievement outcomes they produced in their students. This finding did not support our conjectures about the salutatory effect of formative assessment on student outcomes or the findings reported in literature reviews. More elaborative analyses show that teachers’ educational background, teaching experience, teaching load, age, and some school information (such as school size, and percent of free and reduced lunch) were not significant predictors for students’ motivation and achievement difference either (Yin, 2005). Why, then, did the Embedded Formative Assessments not work as expected? What might account for the differences in outcomes among students taught by different teachers? Furtak, et al. (this issue) suggest answers to these questions. 38 On the Measurement and Impact References Abimbola, I. O. (1988). The problem of terminology in the study of student conceptions in science. Science Education, 72(2), 175-184. Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory. Englewood Cliffs, NJ: Prentice Hall. Bandura, A. (1997). Self-efficacy: The exercise of control. New York: Freeman. Bandura, A., & Cervone, D. (1983). Self-evaluative and self-efficacy mechanisms governing the motivational effects of goal systems. Journal of Personality and Social Psychology, 45(5), 1017-1028. Bell, B., & Cowie, B. (2001). Formative Assessment and Science Education. Dordrecht, The Netherlands: Kluwer Academic Publishers. Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2002). Testing, Motivation and Learning. Cambridge: University of Cambridge, Faculty of Education: The Assessment Reform Group. Black, P., & Wiliam, D. (1998a). Assessment and classroom learning. Assessment in Education, 5, 7-68. Black, P., & Wiliam, D. (1998b). Inside the black box: raising standards through classroom assessment. Phi Delta Kappan, 80(2), 139-148. Blumenfeld, P. C. (1992). Classroom learning and motivation: Clarifying and expanding goal theory. Journal of Educational Psychology, 84(3), 272-281. 39 On the Measurement and Impact Butler, R. (1987). Task-Involving and Ego-Involving Properties of Evaluation. Journal of Educational Psychology, 79, 474-482. Butler, R., & Neuman, O. (1995). Effects of task and ego achievement goals on helpseeking behaviors and attitudes. Journal of Educational Psychology, 87, 261-271. Carey, S. (1984). Conceptual Change in Childhood. Cambridge, MA: MIT Press. Chi, M. T. H. (1992). Conceptual change within and across ontological categories: Examples from learning and discovery in science. In Cognitive models of science: Minnesota studies in the philosophy of science (pp. 129-160). Minneapolis, MN: University of Minnesota Press. Chi, M. T. H., Slotta, J. D., & deLeeuw, N. (1994). From things to processes: A theory of conceptual change for learning science concepts. Learning and Instruction, 4, 2743. Chinn, C. A., & Malhotra, B. A. (2002). Children's responses to Anomalous Scientific Data: How is Conceptual Change Impeded? Journal of Educational Psychology, 94(2), 327-343. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Harcourt Brace, PA: Jovanovich College Publishers. Demastes, S. S., Good, R. G., & Peebles, P. (1996). Patterns of conceptual change in evolution. Journal of Research in Science Teaching, 33, 407-431. 40 On the Measurement and Impact Dreyfus, A., Jungwirth, E., & Eliovitch, R. (1990). Applying the 'Cognitive Conflict' strategy for conceptual change - some implications, difficulties and problems. Science Education, 74(5), 555-569. Dweck, C. S., & Leggett, E. L. (1988). A Social-Cognitive Approach to Motivation and Personality. Psychological Review, 95(2), 256-273. Eccles, J. (1983). Expectancies, values and academic behaviors. San Francisco: Freeman. Geisler-Brenstein, E., & Schmeck, R. R. (1996). The Revised Inventory of Learning Processes: A multifaceted perspective on individual differences in learning. In M. Birenbaum & F. J. R. C. Dochy (Eds.), Alternatives in Assessment of Achievements, Learning Processes and Prior Knowledge. Boston, MA: Kluwer Academic Publishers. Hattie, J., & Timperley, H. (2007). The Power of Feedback. Review of Educational Research, 77(1), 81-112. Hewson, M. G. (1986). The acquisition of scientific knowledge: Analysis and representation of student conceptions concerning density. Science Education, 70(2), 159-170. Hewson, P. W., & Hewson, M. G. (1984). The role of conceptual conflict in conceptual change and the design of science instruction. Instructional Science, 13(1), 1-13. Kohn, A. S. (1993). Preschoolers' reasoning about density: Will it float? Child Development, 64, 1637-1650. 41 On the Measurement and Impact Lau, S., & Roeser, R. W. (2002). Cognitive abilities and motivational processes in high school students' situational engagement and achievement in science. Educational Assessment, 8(2), 139-162. NRC. (2001). Knowing What Students Know. Washington, DC: National Academies Press, National Research Council. Pajares, F. (1996). Self-Efficacy beliefs in academic settings. Review of Educational Research, 66(4), 543-578. Pintrich, P. R. (1999). Motivational beliefs as resources for and constraints on conceptual change. In W. Schnotz, S. Vosniadou & M. Carretero (Eds.), New Perspectives in Conceptual Change Research. Oxford, England: Pergamon Press. Pintrich, P. R., Marx, R., & Boyle, R. (1993). Beyond cold conceptual change: The role of motivational beliefs and classroom contextual factors in the process of conceptual change. Review of Educational Research, 63, 167-199. Pintrich, P. R., & Schrauben, B. (1992). Student’s motivational beliefs and their cognitive engagement in classroom academic tasks. In D. Schunk & J. Meece (Eds.), Student perceptions in the Classroom: Causes and Consequences (pp. 149-183). Hillsdale, NJ: Lawrence Erlbaum. Posner, G. J., Strike, K. A., Hewson, P. W., & Gertzog, W. F. (1982). Accommodation of a scientific conception: toward a theory of conceptual change. Science Education, 66, 211-227. 42 On the Measurement and Impact Pottenger, F. M. I., & Young, D. B. (1992). The Local Environment: FAST 1, Foundational Approaches in Science Teaching. Honolulu: University of Hawaii Curriculum Research & Development group. Ramaprasad, A. (1983). On the definition of feedback. Behavioral Science, 28(1), 4-13. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2 ed.). Newbury Park, CA: Sage. Sadler, R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18, 119-144. Schnotz, W., Vosniadou, S., & Carretero, M. (1999). Preface. In W. Schnotz, S. Vosniadou & M. Carretero (Eds.), New Perspectives in Conceptual Change Research. Oxford, England: Pergamon Press. Schunk, D. H., & Swartz, C. W. (1993). Goals and progress feedback: Effects on selfefficacy and writing achievement. Contemporary Educational Psychology, 18, 337-354. Siero, F., & Van Oudenhoven, J. P. (1995). The effects of contingent feedback on perceived control and performance. European Journal of Psychology of Education, 10, 13-24. Smith, C., Snir, J., & Grosslight, L. (1992). Using conceptual models to facilitate conceptual change: the case of weight-density differentiation. Cognition and Instruction, 9(3), 221-283. 43 On the Measurement and Impact Vispoel, W. P., & Austin, J. R. (1995). Success and failure in junior high school: A critical incident approach to understanding students' attributional beliefs. American Educational Research Journal, 32(2), 377-412. Vosniadou, S., & Brewer, W. F. (1992). Mental models of the earth: A study of conceptual change in childhood. Cognitive Psychology, 24, 535-585. Yin, Y. (2005). The Influence of Formative Assessment on Student Motivation, Achievement, and Conceptual Change. Stanford University, Stanford. Yin, Y., Tomita, K. M., & Shavelson, R. J. (in press). Diagnosing and dealing with student misconceptions about “Sinking and Floating”. Science Scope. 44 On the Measurement and Impact Appendix A Average Class Pretest Motivation Scores, Multiple-choice Score, and Class Size for Focus Classa Variables Pair 1 Task Goal Orientation Perceived Task Goal Self Efficacy Interest Ego Approach Orientation Ego Avoidance Orientation Perceived Performance Goal Fixed Ability Pretest Multiple-choice Size Mean SD Experimental ( E ) Aden 29 3.79 0.69 29 4.26 0.45 29 4.13 0.61 29 4.15 0.80 29 2.82 0.81 29 2.90 0.85 29 2.10 0.52 29 2.23 0.81 29 14.55 5.08 Size Mean Control ( C ) Rachel 29 3.59 29 3.86 29 3.77 29 3.82 29 2.89 29 2.94 29 2.52 29 2.19 29 14.76 SD Difference E-C 0.73 0.44 0.59 0.60 0.71 0.79 0.57 0.67 3.90 0.21 0.40** 0.37* 0.33 -0.07 -0.04 -0.42** 0.04 -0.21 Pair 2 Task Goal Orientation Perceived Task Goal Self Efficacy Ego Approach Orientation Ego Avoidance Orientation Perceived Performance Goal Fixed Ability Interest Pretest Multiple-choice 25 25 25 25 25 25 25 25 26 Andy 3.63 3.92 3.84 2.90 3.06 2.25 2.55 3.93 13.12 0.65 0.50 0.63 0.86 0.77 0.60 0.87 0.96 3.68 19 19 19 19 19 19 19 19 19 Lenny 3.60 4.23 4.02 2.96 3.23 2.41 2.11 3.95 15.95 0.64 0.45 0.58 1.02 0.93 0.59 0.66 0.88 4.13 0.03 -0.31* -0.19 -0.06 -0.17 -0.17 0.44 -0.02 -2.83* Pair 3 Task Goal Orientation Perceived Task Goal Self Efficacy Interest Ego Approach Orientation Ego Avoidance Orientation Perceived Performance Goal Fixed Ability Pretest Multiple-choice 20 20 20 20 20 20 20 20 20 Becca 4.22 4.37 4.03 4.20 3.23 2.97 2.59 2.87 12.25 0.51 0.45 0.42 0.61 0.99 0.91 0.44 0.96 2.75 23 23 23 23 23 23 23 23 22 Ellen 3.92 4.24 3.80 4.28 3.20 2.99 2.82 2.54 13.23 0.52 0.40 0.50 0.50 0.61 0.82 0.47 0.85 4.19 0.30 0.13 0.22 -0.08 0.03 -0.02 -0.23 0.33 -0.98 Pair 4 Task Goal Orientation Perceived Task Goal Self Efficacy Interest 18 18 18 18 Carol 3.88 4.21 4.10 4.09 0.59 0.42 0.57 0.77 21 21 21 21 Sofia 3.99 4.23 4.03 4.31 0.53 0.40 0.57 0.67 -0.11 -0.02 0.07 -0.22 45 On the Measurement and Impact Variables Ego Approach Orientation Ego Avoidance Orientation Fixed Ability Perceived Performance Goal Pretest Multiple-choice Pair 5 Task Goal Orientation Perceived Task Goal Self Efficacy Interest Ego Approach Orientation Ego Avoidance Orientation Perceived Performance Goal Fixed Ability Pretest Multiple-choice Pair 6 Task Goal Orientation Perceived Task Goal Self Efficacy Interest Ego Approach Orientation Ego Avoidance Orientation Perceived Performance Goal Fixed Ability Pretest Multiple-choice a: * p < 0.05; ** p < 0.01. Size Mean SD Experimental ( E ) 18 2.40 0.75 18 2.56 0.91 18 2.15 0.69 18 1.89 0.53 19 13.16 4.49 Size Mean Control ( C ) 21 3.26 21 3.15 21 2.31 21 2.49 24 15.21 SD 0.85 0.58 0.91 0.62 3.96 Difference E-C -0.86** -0.60* -0.16 -0.60** -2.05 25 25 25 25 25 25 25 25 25 Diana 4.00 4.41 4.11 4.28 3.16 3.19 2.39 2.29 15.52 0.58 0.29 0.58 0.58 0.88 0.74 0.78 0.85 3.61 21 21 21 21 21 21 21 21 21 Ben 3.36 4.08 3.74 3.38 3.18 3.18 2.49 2.33 18.38 0.65 0.59 0.53 0.80 0.86 0.72 0.62 0.58 5.11 0.64** 0.33* 0.37* 0.90** -0.02 0.01 -0.10 -0.03 -2.86* 27 27 27 27 27 27 27 27 26 Robert 3.84 4.09 4.04 4.17 3.13 3.06 2.21 2.31 16.00 0.52 0.52 0.43 0.41 0.74 0.91 0.51 0.72 4.58 20 20 20 20 20 20 20 20 20 Serena 3.87 4.21 4.16 4.03 2.98 3.05 2.30 2.17 15.20 0.67 0.48 0.61 0.81 0.90 0.96 0.67 0.84 4.64 -0.03 -0.12 -0.12 0.14 0.15 0.01 -0.09 0.14 0.80 46 On the Measurement and Impact Appendix B Average Class Posttest Motivation Scores, Multiple-choice Score, and Class Size for Focus Classa Variables Pair 1 Task Goal Orientation Perceived Task Goal Self Efficacy Interest Ego Approach Orientation Ego Avoidance Orientation Perceived Performance Goal Fixed Ability Multiple-choice Performance Assessment Predict-Observe-Explain Short-Answer Achievement Total Conceptual Change Size Mean SD Experimental ( E ) Aden 29 3.56 0.84 25 3.96 0.72 25 3.95 0.80 25 3.71 0.94 29 2.68 0.92 25 2.77 1.08 25 2.01 0.59 25 2.08 0.64 25 21.79 7.19 22 13.31 8.91 26 2.15 1.74 26 1.54 1.68 20 40.45 17.00 20 5.10 3.26 Size 21 21 21 21 21 21 21 21 21 22 21 21 19 19 Mean SD Control ( C ) Rachel 3.10 0.77 3.63 0.67 3.79 0.39 3.14 0.92 2.64 0.72 2.52 0.79 2.33 0.67 2.29 0.53 28.10 6.18 17.91 5.94 3.33 2.29 3.10 1.18 54.53 11.76 6.54 3.20 Pair 2 Task Goal Orientation Perceived Task Goal Self Efficacy Interest Ego Approach Orientation Ego Avoidance Orientation Perceived Performance Goal Fixed Ability Multiple-choice Performance Assessment Predict-Observe-Explain Short-Answer Achievement Total Conceptual Change 25 22 22 22 25 22 22 22 22 22 20 22 18 18 Andy 2.98 3.83 3.85 2.85 2.72 3.00 2.60 2.17 22.91 9.23 3.70 2.14 39.56 6.90 0.73 0.61 0.71 0.96 0.79 0.85 0.60 0.79 5.92 4.41 2.03 1.21 11.09 3.41 19 18 18 18 19 18 18 18 18 17 16 19 13 13 Lenny 3.72 4.33 4.08 3.89 2.99 2.80 2.23 2.02 28.79 20.53 5.31 2.47 62.08 10.55 0.66 0.44 0.58 0.68 0.78 0.60 0.55 0.66 6.84 7.86 2.41 1.26 10.57 3.22 Pair 3 Task Goal Orientation Perceived Task Goal Self Efficacy Interest Ego Approach Orientation 20 16 16 16 20 Becca 3.85 4.25 3.88 4.19 3.03 0.71 0.37 0.53 0.69 1.03 23 21 21 21 23 Ellen 4.04 4.25 3.89 4.30 2.65 0.45 0.52 0.58 0.83 0.85 Difference E-C 0.46 0.33 0.16 0.56* 0.04 0.24 -0.32 -0.21 -6.31** -2.01 -2.01* -3.59** -2.99** -1.39 -0.74** -0.50** -0.23 -1.04** -0.27 0.19 0.37 0.15 -5.88** -5.32** -2.18* -0.87 -5.68** -3.01** -0.19 -0.01 -0.01 -0.11 0.38 47 On the Measurement and Impact Variables Ego Avoidance Orientation Perceived Performance Goal Fixed Ability Multiple-choice Performance Assessment Predict-Observe-Explain Short-Answer Achievement Total Conceptual Change Size Mean SD Experimental ( E ) 16 2.83 0.84 16 2.64 0.78 16 2.35 0.75 16 20.25 5.73 16 11.25 2.98 16 2.36 1.91 14 2.07 1.44 14 36.36 8.66 14 4.58 2.48 Size 21 21 21 21 24 24 24 22 22 Mean SD Control ( C ) 2.56 0.98 2.29 0.61 2.29 0.72 22.18 5.57 13.17 5.74 1.54 1.14 1.42 1.69 38.55 11.75 3.51 2.20 Difference E-C 0.27 0.34 0.07 -1.93 -1.23 1.45 1.30 -0.60 1.35 Pair 4 Task Goal Orientation Perceived Task Goal Self Efficacy Interest Ego Approach Orientation Ego Avoidance Orientation Perceived Performance Goal Fixed Ability Multiple-choice Performance Assessment Predict-Observe-Explain Short-Answer Achievement Total Conceptual Change 18 17 17 17 18 17 17 17 17 15 17 18 13 13 Carol 3.28 3.81 3.76 3.27 2.50 2.47 2.37 2.10 28.12 17.93 3.00 2.94 51.92 6.55 0.64 0.66 0.55 1.08 0.87 0.85 0.77 0.44 6.27 8.57 2.50 1.00 16.55 4.03 21 22 22 22 21 22 22 22 22 21 21 22 20 20 Sofia 3.84 4.27 3.95 4.21 3.03 2.53 2.03 2.24 27.32 16.38 2.71 2.91 49.9 5.32 0.68 0.57 0.76 0.93 0.83 1.02 0.80 1.08 5.80 6.09 1.31 1.31 11.67 2.76 -0.55* -0.45* -0.19 -0.94** -0.53 -0.05 0.34 -0.14 0.80 0.60 0.41 0.09 0.38 0.97 Pair 5 Task Goal Orientation Perceived Task Goal Self Efficacy Interest Ego Approach Orientation Ego Avoidance Orientation Perceived Performance Goal Fixed Ability Multiple-choice Performance Assessment Predict-Observe-Explain Short-Answer Achievement Total Conceptual Change 25 23 23 23 25 23 23 23 23 21 21 21 21 21 Diana 3.90 4.46 4.11 4.19 3.18 2.98 2.18 2.14 24.78 16.71 2.29 3.14 47.05 4.86 0.71 0.46 0.58 0.72 0.89 0.74 0.65 0.94 4.49 7.13 1.42 1.42 11.08 2.08 21 21 21 21 21 21 21 21 21 20 20 22 17 17 Ben 2.76 3.20 3.63 2.40 3.25 2.84 2.59 2.10 26.14 14.95 2.35 2.77 45.29 4.60 0.77 0.72 0.82 1.11 0.93 0.90 0.53 0.63 7.88 5.99 2.13 1.51 15.37 3.00 1.14** 1.26** 0.48* 1.79** -0.07 0.15 -0.41* 0.04 -1.35 0.86 -0.11 0.83 0.40 0.30 Pair 6 Robert Serena 48 On the Measurement and Impact Variables Task Goal Orientation Perceived Task Goal Interest Self Efficacy Ego Approach Orientation Ego Avoidance Orientation Perceived Performance Goal Fixed Ability Multiple-choice Performance Assessment Predict-Observe-Explain Short-Answer Achievement Total Conceptual Change a: * p < 0.05; ** p < 0.01. Size Mean SD Experimental ( E ) 27 3.70 0.70 22 4.52 0.39 22 4.37 0.65 22 4.20 0.36 27 2.95 0.94 22 2.60 0.98 22 2.12 0.78 22 2.06 0.70 22 20.48 6.88 21 11.00 5.06 22 1.86 1.54 22 1.86 1.42 21 35.86 12.47 21 4.34 2.13 Size 20 20 20 20 20 20 20 20 20 19 17 20 16 16 Mean SD Control ( C ) 3.76 0.91 4.09 0.54 3.62 0.92 3.92 0.63 3.38 0.86 3.11 0.82 2.42 0.74 2.55 0.89 18.05 4.62 9.53 6.10 1.76 2.05 1.60 1.39 31.9 12.08 3.69 2.70 Difference E-C -0.06 0.44** 0.75** 0.28 -0.42 -0.51 -0.30 -0.49 2.43 0.84 0.17 0.61 1.02 0.83 49 On the Measurement and Impact Table 1 Motivational Belief Constructs, Number of Items, Coefficient Alpha, and Correlations among Constructs at Posttest Positive Motivation Negative Motivation Construct Number 1 2 3 4 5 6 7 8 Number of Items 5 6 5 3 4 5 6 3 1 Task Goal (.83) 2 Task Perceived .73** 3 Self-Efficacy .77** .72** 4 Interest .85** .85** .64** (.86) (.89) (.81) 5 Ego Approach .04 .04 .10* -.00 (.77) 6 Ego Avoidance .02 .04 .03 -.02 .84** (.78) 7 Performance -.36** -.53** -.42** -.42** .41** .37** (.70) -.26** -.31** -.39** -.33** .50** .48** .74** Perceived 8 Fixed Ability (.44) Note: a. * p < 0.05, ** p < 0.01, N = 788 ~ 826. b. Results in this table are the output from AMOS, but they are arranged in a conventional way for display convenience. c. Correlations computed in AMOS were attenuated by the reliability of each construct. d. Numbers in the diagonal of the matrix are internal consistency of the corresponding construct. 50 On the Measurement and Impact Table 2 Internal Consistency and Correlations among All Score a (1) (2) (3) (4) Knowledge Item Number 43 18 4 7 Reasoning Total Score 43 32 4 7 Measured (1) Post Multiple-choice (.86)b Declarative/ Schematic/ Procedural (2) Performance Assessment .69 (.81) Procedural/ Schematic (3) Short-Answer .50 .49 (.83) Declarative/ Schematic (4) Predict-Observe-Explain .48 .42 .39 (.74) Schematic Note: a. N = 732 ~ 830, all the correlations are significant, p < 0.01. b. The numbers in the diagonal are the internal consistency of the corresponding scale. 51 On the Measurement and Impact Table 3 Comparison of Experimental and Control Group Student Motivation, Achievement, and Conceptual Change Scores at Posttest Experimental Control Difference N Mean SD N Mean SD t Pre Task Goal 144 3.88 .61 133 3.72 .66 2.12* Pre Task Perceived 144 4.20 .47 133 4.13 .48 1.38 Pre Self-Efficacy 144 4.04 .55 133 3.91 .58 1.99* Pre Interest 144 4.14 .71 133 3.96 .76 2.06* Pre Ego Approach 144 2.95 .86 133 3.07 .82 -1.14 Pre Ego Avoidance 144 2.97 .85 133 3.08 .80 -1.04 Pre Performance 144 2.24 .60 133 2.51 .60 -3.82 Pre Fixed Ability 144 2.39 .84 133 2.27 .76 1.19 Pre Multiple-choice 145 14.22 4.30 135 15.39 4.49 -1.16* Perceived 52 On the Measurement and Impact Table 4 Comparison of Experimental and Control Group Student Motivation, Achievement, and Conceptual Change Scores at Posttest Group Experimental (E) Control (C) Difference Variable N Mean SD N Mean SD t Post Task Goal 125 3.55 .79 123 3.53 .84 0.14 Post Task Perceived 125 4.14 .62 123 3.95 .71 2.25* Post Self-Efficacy 125 3.97 .62 123 3.87 .65 1.25 Post Interest 125 3.76 1.00 123 3.59 1.12 1.29 Post Ego Approach 125 2.85 .92 123 2.99 .86 -1.24 Post Ego Avoidance 125 2.79 .91 123 2.72 .88 0.57 Post Performance Perceived 125 2.29 .72 123 2.31 .67 -0.25 Post Fixed Ability 125 2.14 .72 123 2.25 .78 -1.14 Post Multiple-choice 129 22.92 6.59 125 25.15 7.16 -2.58** Post Performance Assessment 117 13.05 7.11 123 15.31 6.99 -2.48* Post Predict-Observe-Explain 120 2.53 1.93 119 2.74 2.26 -0.79 Post Short-Answer 125 2.24 1.49 128 2.37 1.53 -0.67 Post Achievement 107 41.55 13.94 107 46.41 15.31 -2.43* Conceptual Change 107 5.32 2.99 107 5.44 3.52 -0.29 Multiple-Choice Gain Score 127 8.82 5.83 120 9.58 6.49 -0.96 53 On the Measurement and Impact Table 5 Comparison of Unconditional, Covariate, and Final Model Variance Components of Posttest Achievement and Conceptual Change Model 1 Model 2 (with Pretest Model 3 (with (Unconditional) multiple-choice test Predictor(s) at both Level-1 Only) Level-1 & 2) Within Teachers (σ2) 38.832 27.680 27.681 Between Teachers (τ00) 10.987** 10.764** 11.794** Within Teachers (σ2) 28.127 27.680 27.681 Between Teachers (τ00) 11.196** 10.764** 11.794** Within Teachers (σ2) 40.992 36.947 36.949 Between Teachers (τ00) 11.050** 10.255** 10.620** Within Teachers (σ2) 1.970 1.827 1.827 Between Teachers (τ00) 0.322** 0.270** 0.304 Within Teachers (σ2) 3.684 3.614 3.613 Between Teachers (τ00) 0.861** 0.783** 0.881** Within Teachers (σ2) 162.042 122.692 122.688 Between Teachers (τ00) 68.552** 68.541** 73.778** Conceptual Change Within Teachers (σ2) 9.361 8.634 8.632 Score Between Teachers (τ00) 3.399** 3.414** 3.827** Post Multiple-choice Test Score Multiple-choice Gain Score Performance Assessment Short-Answer Predict-ObserveExplain Achievement Total Note: ** p < 0.01, p < 0.05 54 On the Measurement and Impact Table 6 Multilevel Regression Estimates for Posttest Motivational Beliefs a Outcome Variable (Posttest) Task Goal Orientation Perceived Task Goal Self Efficacy Interest Ego Approach Ego Avoidance Perceived Performance Goal Fixed Ability Predictors Level-1 Variable TASKGOAL, γ10 Level-2 Variable GROUP, γ01 Coefficient 0.548162** SE 0.069905 P value 0.000 -0.048264 0.197369 0.812 Level-1 Variable PERCTASK, γ10 Level-2 Variable GROUP, γ01 0.470225** 0.079936 0.000 0.164382 0.202496 0.436 Level-1 Variable EFFICACY, γ10 Level-2 Variable GROUP, γ01 0.406299** 0.068589 0.000 0.078124 0.079747 0.351 Level-1 Variable INTEREST, γ10 Level-2 Variable GROUP, γ01 0.598573** 0.075573 0.000 0.089683 0.310063 0.778 Level-1 Variable EGOPERF, γ10 Level-2 Variable GROUP, γ01 0.549829** 0.059008 0.000 -0.058126 0.135709 0.677 Level-1 Variable EGOAVO, γ10 Level-2 Variable GROUP, γ01 0.531549** 0.059259 0.000 0.187889 0.125347 0.165 Level-1 Variable PERCPERF, γ10 Level-2 Variable GROUP, γ01 0.439538** 0.068559 0.000 0.138793 0.137974 0.339 Level-1 Variable FIXEDABILITY, γ10 Level-2 Variable GROUP, γ01 0.369011** 0.057722 0.000 -0.128559 0.090237 0.185 Note: ** p < 0.01, p < 0.05 55 On the Measurement and Impact Table 7 Multilevel Regression Estimates for Posttest Achievement and Conceptual Change a Outcome Variable (Posttest) Post Multiple-choice Test Score Predictors Level-1 Variable Coefficient SE P value 0.81** 0.08 0.00 -0.76 2.10 0.73 -0.19* 0.08 0.023 -0.76 2.10 0.73 0.50** 0.10 0.00 -1.68 2.05 0.43 0.09** 0.02 0.00 -0.03 0.36 0.93 0.09** 0.03 0.00 -0.11 0.60 0.86 1.53** 0.19 0.00 -2.84 5.21 0.60 0.22** 0.05 0.00 0.03 1.21 1.00 PRETOTAL, γ10 Level-2 Variable GROUP, γ01 Multiple-choice Gain Score Level-1 Variable PRETOTAL, γ10 Level-2 Variable GROUP, γ01 Performance Assessment Level-1 Variable PRETOTAL, γ10 Level-2 Variable GROUP, γ01 Short-Answer Level-1 Variable PRETOTAL, γ10 Level-2 Variable GROUP, γ01 Predict-Observe-Explain Level-1 Variable PRETOTAL, γ10 Level-2 Variable GROUP, γ01 Achievement Total Level-1 Variable PRETOTAL, γ10 Level-2 Variable GROUP, γ01 Conceptual Change Items Level-1 Variable PRETOTAL, γ10 Level-2 Variable GROUP, γ01 56 On the Measurement and Impact Note: ** p < 0.01, p < 0.05 57 On the Measurement and Impact Motivation Encouraged Motivation Promoting Conceptual Change Motivation Discouraged Task goal orientation Incremental intelligence beliefs Interest Self-efficacy Motivation Constraining Conceptual Change Ego goal orientation (including ego-avoidance and ego-approach) Fixed intelligence belief Figure 1. Motivational beliefs involved in formative assessment and conceptual change. 58