Using Automated Item Generation to Promote Principled Test Design and Development Cecilia B. Alves Mark J. Gierl Hollis Lai Centre for Research in Applied Measurement and Evaluation University of Alberta Paper presented at the annual meeting of the American Educational Research Association Denver, CO, USA April 30 – May 4, 2010 Principled Test Design 2 Introduction Educational assessment plays an important role in modern society. Teachers use tests to measure students’ strengths and weaknesses and to determine whether students are meeting educational objectives; school administrators use tests to monitor students’ progress, and to place students in the appropriate grade; students are selected by colleges and universities based on their performance on standardized tests; parents are informed about their children’s performance in each subject by means of report cards. The diversity of assessment situations is truly impressive. However, such high demands for educational assessment comes at a significant cost and effort. Hundreds, if not thousands, of test items must be developed to measure student performance. As a result, item development involves significant cost, time, and effort. In the traditional approach to test construction, each item is individually developed by content specialists. In this process, the item is first written, then reviewed, revised, edited, and, finally, it is administered. As a result of this lengthy process, it becomes difficult to meet the ever increasing demand for more test items (Drasgow, Luecht, & Bennett , 2006). Automated Item Generation (AIG) is an alternative approach to item development that can suplement the traditional approach by using specifically programmed algorithms. The goal of AIG is to produce large numbers of high-quality items that require little human review prior to administration (Williamson, Johnson, Sinharay, & Bejar, 2002). The purpose of this study is to describe and illustrate an approach for developing task models that can be used for AIG with the College Board’s Advanced Placement Program (AP). The College Board supports major programs and services that promote college admissions, guidance, assessment, financial aid, enrollment, teaching, and learning (College Principled Test Design 3 Board, 2008). Being committed to the principles of excellence and equity, the College Board has paid renewed attention to maintaining and improving the quality of their exams. In accordance with this goal, the use of Evidence-Centered Design (ECD) is one important action in the process of improving the AP Program. ECD, as initially described by Robert J. Mislevy, Linda S. Steinberg, and Russell G. Almond in 2003, provides a conceptual framework for designing, producing, and delivering educational assessments using evidentiary arguments. Huff, Steinberg, and Matts (2009) defined ECD as a: set of activities and artifacts that facilitate explicit thinking about a) given the purpose of the assessment, what content and skills are both useful and interesting to claim about examinees; b) what is the reasonable and observable evidence in student work or performance required to support the claims; and c) how tasks (items) can be developed within (p. 8). This framework helps to ensure that evidence supports the underlying knowledge the assessment is intended to measure (Mislevy, Steinberg, & Almond, 2003). This standpoint is valuable because it clarifies the conceptualization of assessments in a structured, coherent, and purposeful way that should lead to more valid inferences about student performance on exams. Three advantages that result from the application of ECD to the AP Program are: a) a foundation for alignment between what is taught in the course and measured on the exam; b) an explication of what is meant by deep conceptual understanding and complex reasoning skills; and c) detailed item design and item development guidelines and form assembly specifications that flow directly from this explication and that serve as the basis for comparable scores. (Huff, Steinberg, & Matts, 2009, p. 9) Principled Test Design 4 In their ECD-based assessment framework, the AP Program will incorporate task models as input for item generation. A task model identifies features of the assessment situation that make it possible for the student to produce that evidence. The use of a task model will ensure that the generated items are derived from claims and evidence, thereby connecting items and inference about student performance. The continuum that depicts the progress from claims to the items is presented in Figure 1. General Specific Figure 1. Continuum from Claim to Items (adapted from Hendrickson, Huff, & Luecht, 2009) The boxes in Figure 1 increase in detail from left to right. This figure depicts a continuum that goes from the most general inference, the Claim on the left-hand side, to a very specific instance of student performance, the item on the right-hand side. The continuum represents a progression of specificity, which occurs at five levels: Claim, Principled Test Design 5 Observable Evidence, Task Model, Template, and Item. The five particular levels of item development illustrated in this continuum are intended to indicate the range of different types of processes and procedures that must be executed to link ECD and item development. Each of these levels will be described and illustrated in our paper. Claim is used to articulate the purposes of the assessment. It represents content and skills deemed as useful for making statements about examinees. Observable evidence is a way to support particular claims about examinee proficiency (Huff, Steinberg, & Matts, 2009). In other words, observable evidence is considered a behavior that allows inferences about aspects of an examinee’s proficiency. Task models, as outlined by Hendrickson, Huff, and Luecht (2009), are components of an evidentiary argument that supports the validity of the inferences made from student assessments. Mislevy, Almond, and Steinberg (2002) describe a task model as “a language for characterizing features of tasks and specifying how the interaction between the examinee and the task is managed” (p. 115). They also state that different sets of variables (i.e., item types or kinds of stimulus materials) may require different task models because these characteristics may be important in modeling item parameters or controlling item selection. Task models clarify the features of the test performance situations that will elicit relevant evidence (PytlikZillig, Bodvarsson, & Bruning, 2005). These features can be related to what the student is asked to say, do, or create (e.g., draw a map, write an essay, select the correct response from a set of options) or features related to the stimulus provided to a student (e.g., a passage, a table of data, a question) (Hendrickson, Huff, & Luecht, 2009). Figure 2 illustrates a task model from Haladyna (2004) for applied statistics and educational measurement statistics. Principled Test Design 6 Construct Identifier: Level of the construct: Primary Context: Competency Claim Applied statistics and educational measurement statistics Basic Effect size, d Computes and interprets an effect size as a standardized difference between groups or levels of an independent variable. Evidence Documentation 1. Successfully computes d, given two means and standard deviations from a common population. 2. Successfully computes d, given two means and standard deviations from independent populations (i.e., using the pooled variances). 3. Correctly interprets d, given two means and standard deviations from a common population. 4. Correctly interprets d, given two means and standard deviations from independent populations (i.e., using the pooled variances). Conceptual Task Model Specific Tasks Expected Mastery Criteria 1. interprets (d|single pop. means) Plausible choice from options 2. interprets (d|separate pop. means) Plausible choice from options 3. interprets (d|levels of independent variable) Plausible choice from options 4. computes(d|µ1, µ2, σ) Correct value 5. computes(d|µ1, µ2, σ1, σ2) Correct value 6. interprets(computes(d|µ1, µ2, σ)) Plausible choice from options 7. interprets(computes(d|µ1, µ2, σ1, σ2)) Plausible choice from options 8. Interprets(generates(scatterplot|x)) Plausible choice from options Manipulable features of complexity/difficulty 1. Magnitude of d (low, moderate, high) 2. Standardization of variables 3. Number of groups (two or more) 4. Sign of the effect size 5. Formulas provided 6. Software/calculator access/training 7. Graphic facilitators (depictions of the distributions) Manipulable features of complexity/difficulty 1. Variable labels 2. Magnitude of scale 3. Compute vs. interpret vs. interpret(compute()) Figure 2. Example of Task model specification (from Haladyna, 2004. In Luecht, 2008) Principled Test Design 7 Figure 2 documents the key features of a task model. Since task models serve as the foundation for the item development, the key features ensure the items will be consistent with the claims and evidence specified in the domain model. The first four lines of the table represent the domain area of the task. Information about the intended construct, complexity level, context, and claim is presented. Next, evidence documentation specifies what type of behaviors to expect from a student who masters the claim. In other words, the evidence that support a particular claim about examinee proficiency. In this task, for example, a student should be able to successfully compute the effect size, given two means and the standard deviations, from a common population. With conceptual task models, more specific tasks, such as interpret (d|single population means), are presented. The expected mastery criterion, which represents the way the evidence is observed, is also mentioned. The evidence in this task model is exhibited by choosing the plausible choices from a multiple-choice item. The features that affect the difficulty or complexity of the item, as well as the features that are irrelevant to the item difficulty or complexity, are also documented. If the evidence is presented in different formats or is focused on different aspects of proficiency, then multiple task models may be needed. Each specific format of the task model is called an item template. Templates are useful for producing new items with the same task model specifications and serve as a guide for automated item generation. There is no unanimous taxonomy in the automatic item generation literature (Johnson & Sinharay, 2005). Item templates are also called item models, schemas, item forms, and item shells. Similarly, the generated items are called siblings, variants, instances, isomorphs, and clones. Items are directly generated from templates that are linked to a task model and, subsequently, to a claim. Item templates provide the foundation necessary for automatic item generation. Principled Test Design 8 Automatic item generation is a procedure for using item templates to create isomorphic instances with known item characteristics. By employing ECD, a better understanding of the cognitive mechanisms required to solve items and the features that affect difficulty is obtained . A careful use of design principles when manipulating the variables is vital to the creation of items at desired difficulty levels. As Gitomer and Bennett (2002) state “variation in difficulty may be obtained by creating different templates, each intended to produce items in a particular target range, or by creating a single template to generate items spanning the desired range” (p. 6). Templates are written at a level of specificity to produce items (Hendrickson, Huff, & Luecht, 2009). Templates present attributes in detail whereas task models presents attributes at a more general level (Mislevy & Riconscente, 2006). Item templates are developed explicitly to represent the clauses listed in a corresponding task model. Furthermore, task models provide the theoretical foundation for item development. In contrast, item templates provide an operational foundation (Zhou, 2009). This theoretical foundation, present in a task model, entails providing information about important features required for item development such as the construct being measured, complexity, cognitive level of the construct, documentation of the evidence, features that affect complexity/difficulty, and features that are irrelevant to the complexity/difficulty of the task. A task model is a detailed structure that reveals how the information is related to other components of the assessment, and it serves as the blueprint for constructing the actual tasks presented to students. Task models are also created at different locations along an ability scale and, in turn, each model provides measurement information in a particular ability location (Luecht, 2008). Lai, Gierl, and Alves (2010), recognized this issue when they stated: Principled Test Design 9 Whereas the goal of generating task models from a cognitive model is to have a set of task models that are representative across the scale of ability, item generation from item templates is an attempt to achieve the opposite. The goal of AIG is to vary these elements within the item template, in an iterative/systematic manner, to generate unique items that are comparable to each other psychometrically (p. 8). Since features required for item development, such as the construct being measured and the cognitive level of the construct are considered, each task model should be capable of generating multiple item templates. Items produced by one item template should have comparable psychometric properties if only incidental elements are included in an item template (Zhou, 2009). In other words, templates are written to create items at a specific ability level, which depends on the variables and constraints placed on the template. If one wants to assess different ability levels, then different item templates should be used. Camara and Kimmel (2005) state that a task model is an item template augmented with metadata and instantiation logic and the means of exchanging information with other components of the assessment. Item templates are constructed by the content specialists’ manipulation of specific, well-defined, elements. This step requires the differentiation of the fixed and variable elements. The variable elements can be numeric or string and replacing these variables with value results in a new item. As indicated by Gierl, Zhou, and Alves (2008), in order to develop item templates, at least three variables may be required: the stem, options, and auxiliary information. Each variable functions differently. The stem is the section of the model used to formulate context, content, and/or questions. The stem can be classified in four categories. Independent indicates that the elements in the stem are independent or Principled Test Design 10 unrelated to one another. That is, a change in one element will have no affect on the other stem elements in the item template. Dependent indicates that the elements in the stem are dependent or directly related to one other. Mixed Independent/Dependent includes both independent and dependent elements in the stem. Fixed represents a constant stem format with no variation or change. The options contain the alternatives for the item template when the multiple-choice format is used. The options can be categorized as randomly-selected when the distractors are selected randomly; constrained when the keyed option and the distractors are generated according to specific constraints, such as formulas, calculation, and/or context; fixed when both the keyed option and distractors are invariant or unchanged in the item template. The auxiliary information includes any additional material, in either the stem or option, required to generate an item, including texts, images, tables, and/or diagrams. To illustrate these concepts, a Biology template is presented in Figure 3 (see Gierl et al., 2008, for more examples). Principled Test Design 11 Stem: Mixed; Options: Fixed; Auxiliary Information: Graph On a newly formed island, successful populations of grasses and a species of mouse appeared. Later, a species of hawks flew in. The hawks feed on mice. The population levels of mice and hawks are represented in the graph. In 1991, the data for the mice indicates that A. B. C. D. r is negative because b<d r is negative because b>d r is positive because b<d r is positive because b>d STEM: On S1, successful populations of grasses and a species of S2 appeared. Later, a species of S3 flew in. The S3 feed on S2. The population levels of S2 and S3 are represented in the graph. In I1, the data for the S2 indicates that ELEMENTS: I1 Range: “1990”, “1991”, “1994” S1 Range: “a newly formed island”, “distant forest”, “isolated jungle” S2 Range: “rabbits”, “beetles”, “mice”, “snakes”, “fish”, “lizards”, “insects”, “bugs”, “frogs” S3 Range: “hawks”, “eagles”, “ravens” As S3=“hawks”, then S2=“rabbits”, “beetles”, “mice”, “snakes”, or “frogs” As S3=“eagles”, then S2=”fish”, “snakes”, or “lizards” As S3=“ravens”, then S2=“lizards”, “insects”, “bugs”, or “frogs” AUXILIARY INFORMATION: Graph with Yearly Populations KEY: A Figure 3. Example of Item Template (retrieved from Gierl et al., 2008). Principled Test Design 12 The stem contains one integer (I1) and three strings (S1, S2, and S3). The I1 element shows three possible years: 1990, 1991, and 1994. S1 identifies the context or location of the event, such as, a “newly formed island”, “distant forest”, “isolated jungle”. S2 represents the prey, the animal that is attacked by the predator (S3). Since different predators feed on different preys, S2 is dependent on S3. The location (S1), however, does not depend on the type of preys (S2) or predators (S3). Hence, the stem is considered Mixed. The constraints about predators and preys are stated as follows: As S3=“hawks”, then S2=“rabbits”, “beetles”, “mice”, “snakes”, or “frogs”. The four alternatives, labeled A to D, do not varying accordingly to the elements used, thus, it is Fixed. A graph is required as auxiliary information for this item model. In this paper we describe the methodology that was used with the College Board subject-matter experts (SMEs) to articulate the content and skills deemed important in the Biology domain and then the iterative processes of constructing task models. Part of this objective is accomplished by documenting all the relevant background information and considerations that are required for using content expertise as well as our generative procedures so future researchers can benefit from our experience. Three important products of this process are the creation of (1) robust task models, (2) item templates, and (3) sample of items generated automatically. The use of AIG together with the ECD perspective may promote efficient and high-quality item and test development. By outlining test development principles for creating item templates for AIG, we will describe procedures for automatically generating items for AP assessments. It is expected that, at the end of this project, hundreds of new items will be generated to demonstrate our AIG approach. Principled Test Design 13 Method In the present study, task models were developed based on the Biology claims and evidence provided by the College Board. The claims and evidence are viewed as the goals for both instruction (in the AP courses) and assessment (on the AP exams). The AP program allows students to earn credit or advanced placement in the college admissions process. It covers 37 courses and exams across 22 subject areas. Due to space limitations, our study will focus on Biology. The Biology exam uses multiple-choice items. Hence, this exam may benefit from automated item generation. SMEs from the College Board articulated the claims and the evidence in the domain and then participated in the iterative processes of constructing task models and item templates, which included reviewing sets of keys, distractors, and constraints. After these procedures were conducted, AIG was performed. Procedures Step 1: Revision of Claims and Evidence Recently, the College Board adopted the ECD perspective for some of the AP programs, including Biology. Based on the content and skills that were deemed important to the Biology domain, the SMEs produced a document containing claims and evidence. This document describes the connection among what is (a) contained in the curriculum, (b) taught in AP courses, and (c) measured on the exams. A detailed description of the processes used to write claims and evidence is provided in Ewing, Packman, Hamen, and Clark (2009). This claims and evidence document will constitute the starting point for the task model development. Principled Test Design 14 Step 2: Creating Task Models After having high-quality claims and evidence, task models were created. Task models serve as a guide for creating item templates, which is the foundation for automated item generation. It is expected that the generated task models are explicitly linked to the claim and evidence about student proficiency. Step 3: Creating Item Templates An item template serves as a way for generating a large number of items with similar conceptual and statistical properties. Item templates are constructed by the SME’s manipulation of specific and well-defined elements. The template development requires manipulation of the three elements: stem, options, and auxiliary information. For each template, the specification of the fixed and variable elements, as well as their numeric or string ranges, is necessary. Step 4: Item Generation The software to be used in this project is called IGOR (Item GeneratOR). It was developed and used by Gierl et al. (2008) for developing achievement test items in Language Arts, Social Studies, Mathematics, and Science. It was reported to be a robust and reliable tool for generating items automatically. After creating the item template, the software will be used to generate items. The number of generated items depends on several factors, including the model, the number of elements in the stem of the model, and the range specified for the elements (Gierl et al., 2008). Principled Test Design 15 Results The results will be shown accordingly to each step described in the Methods section. Step 1: Revision of Claims and Evidence This step was carried out in rounds of discussion. To begin, a review of a document provided by the College Board containing pairings of mechanisms/processes and structure/processes was undertaken. This document provided the starting points for creating statements intended to characterize evidence that the student has the knowledge and skills necessary to master the Claim 2A2 & 6.2. (The student is able to construct explanations of the mechanisms that allow organisms to capture free energy with [the production of ATP from ADP, photosynthesis, cellular respiration].) Key parts of the discussion as well as some resulting changes to highlight the sequential steps executed for the enhancement of the claim and evidence which was necessary for the subsequent development of the task models as well as the construction of keys and distractors. (1) The evidence statements were subcategorized into structures or mechanisms in the original document. However, the SME stated there was not a clear division between structure and mechanism. Hence, values for structures and mechanisms were collapsed. (2) The SME argued that the values for the collapsed processes need to be aligned with a statement that clarified “explanation of the mechanism.” For example, the claim “A student can explain X.” will contain as features of evidence, “correct use of language” and “reasoning connects cause and effect”. The SME argued that while the second feature may provide sufficient evidence that the student’s work has satisfied the claim, the first is not sufficient. Hence, all statements were re-checked in order to satisfy the Principled Test Design 16 “explanation” requirement. For example, ‘Pyruvate oxidation connects glycolysis to the Krebs cycle’ was changed to ‘Pyruvate is produced by glycolysis and reacts to form carbon dioxide and a two carbon compound that is added to molecule of the Krebs cycle’ in order to express the “explanation” feature. (3) The SME claimed that sub-values for the correct statements would be very desirable, since the number of assertions would increase and also, when answering an item, identifying the key would be more difficult. As an example, the assertion ‘In algae and higher plants the free energy captured by the light reactions in the form of ATP and NADPH2 are used to produce carbohydrates from carbon dioxide in the Calvin cycle that occurs in the stroma of the chloroplast’ was broken into (1) ‘In algae and higher plants, ATP and NADPH are used to produce carbohydrates from carbon dioxide in the Calvin cycle, which occurs in the stroma of the chloroplast’ and (2) ‘In algae and higher plants the free energy is captured by the light reactions in the form of ATP and NADPH2’. (4) The creation of false explanations to the Claim was performed. The false statements were constructed negating aspects of the correct explanations, expressing a misconception, or simply using the wrong explanation for the Claim. These false statements were also scrutinized by the SME. (5) The SMEs re-checked correct and false explanations to the claim. This verification had a threefold purpose. First, all the correct and incorrect statements were verified and rewritten, when necessary, in order to ensure that the assertions could be used as keys and distractors in the item generation process. Second, the manipulable features of Principled Test Design 17 complexity/difficulty were collected for this specific claim. Third, the features irrelevant to complexity were also investigated. Step 2: Creating Task Models Next, the task model for the Biology Claim 2A2 & 6.2−The student is able to construct explanations of the mechanisms that allow organisms to capture free energy with [the production of ATP from ADP, with photosynthesis, with cellular respiration]−is illustrated. This task model which outlines the construct being measured, complexity and cognitive level of the construct, documentation of the evidence, features that affect complexity/difficulty, and features that are irrelevant to the complexity/difficulty of the task is based in Haladyna’s task model example (Haladyna, 2004). Figure 4 documents the key features of a task model for the AP Biology Claim 2A2 & 6.2. Principled Test Design 18 Construct Identifier: Level of the construct: Primary Context: Competency Claim AP Biology 5 (complex) ATP production 2A2 & 6.2 The student is able to construct explanations of the mechanisms that allow organisms to capture free energy with [the production of ATP from ADP, photosynthesis, cellular respiration] Evidence Documentation 1. Successfully identifies a statement that describes one aspect of an organism’s ability to capture free energy with the production of ATP from ADP. 2. Successfully identifies a statement that describes one aspect of an organism’s ability to capture free energy with photosynthesis. 3. Successfully identifies a statement that describes one aspect of an organism’s ability to capture free energy with cellular respiration. Conceptual Task Model 1. 2. 3. 4. 5. 6. 7. 1. 2. 3. 4. 5. 6. 7. Specific Tasks Identifies that during cellular respiration and fermentation, ATP is produced from the phosphorylation of ADP as organic molecules are broken down Identifies that during photosynthesis, the production of ATP from ADP is coupled to the release of energy from a proton gradient established by electron transport chains embedded in a membrane Identifies that the production of ATP from ADP is coupled to the release of energy from a proton gradient established by electron transport chains embedded in the inner membrane of the mitochondria Identifies that photosynthesis captures free energy in visible light through the excitation of electrons in chlorophyll molecules in chloroplasts Identifies that f during photosynthesis, free energy is captured during the light reactions in the form of ATP and NADPH Identifies that glycolysis captures free energy with a series of enzymecatalyzed reactions, some of which are not spontaneous and involve the conversion of ATP to ADP Identifies that the difference in free energy between the mitochondrial matrix and the inner membrane space is the energy used to produce ATP from ADP Manipulable features of complexity/difficulty Conceptual difficulty increases when the concept: involves simultaneous conceptualization of multiple variable involves multiple, possibly confounding causes of an effect requires weighing and assessing of relative causal significance involves multiple competing criteria involves multiple variables whose significance must be decided novel concepts that combine existing concepts novel situations that require extension of existing knowledge Expected Mastery Criteria Plausible choice from options Plausible choice from options Plausible choice from options Plausible choice from options Plausible choice from options Plausible choice from options Plausible choice from options Figure 4. Task Model for the AP Biology Claim 2A2 & 6.2 The key features help ensure that the items are consistent with the claims and evidence specified in the domain model. The first four lines of the table represent the domain area of the task. AP Biology is the construct identifier, the level of the construct is complex, the Principled Test Design 19 primary context of the construct is ATP production, and the claim is 2A2 & 6.2 - The student is able to construct explanations of the mechanisms that allow organisms to capture free energy with [the production of ATP from ADP, photosynthesis, cellular respiration]. Evidence documentation specifies the evidence that support a particular claim about examinee proficiency. For example, successfully identifies components of the mechanisms that allow organisms to capture free energy with the production of ATP from ADP. Conceptual task model provides more specific tasks, such as identifies that during cellular respiration and fermentation, ATP is produced from the phosphorylation of ADP as organic molecules are broken down. The expected mastery criteria, which represents the way the evidence is observed, is also mentioned. In this task model, the evidence is exhibited by choosing the plausible choice from option on a multiple-choice item. The features that affect the difficulty or complexity of the item, as well as the features that are irrelevant to the item difficulty or complexity, are also documented. Step 3: Creating Item Templates To illustrate the task model for claim 2A2 & 6.2, the following item template is presented. Principled Test Design 20 Subject AP Biology 2A2 & 6.2 The student is able to construct explanations of the mechanisms that allow Targeted Claim organisms to capture free energy with [the production of ATP from ADP, photosynthesis, cellular respiration] Key A Level 5 Stem: Independent; Options: Constrained; Auxiliary Information: None Identify a statement that describes one aspect of an organism’s ability to capture free energy with the production of ATP from ADP. a. b. c. d. During cellular respiration and fermentation, ATP is produced from the phosphorylation of ADP as organic molecules are broken down. The production of ATP by the phosphorylation of ADP occurs on the surface of the thylakoid membrane and therefore occurs only in chloroplasts ADP is phosphorylated to produce ATP using a proton gradient embedded in a membrane, which is established by an electron transport chain that includes oxygen as an electron donor The change in free energy during the production of ATP from ADP is used to create a proton gradient across a membrane in both the mitochondrion and the chloroplast Item template variables Stem Identify correct components of the mechanisms that allow organisms to capture free energy with [PROCESS] Elements [PROCESS] range: “the production of ATP from ADP”, “photosynthesis”, “cellular respiration” Options a. b. c. d. [KEY] [DISTRACTOR1] [DISTRACTOR2] [DISTRACTOR3] As [PROCESS]= “the production of ATP from ADP”, then [KEY]: (1) During cellular respiration and fermentation, ATP is produced from the phosphorylation of ADP as organic molecules are broken down . . . (10) … As [PROCESS]= “photosynthesis”, then [KEY]: (11) Photosynthesis captures free energy in visible light through the excitation of electrons in chlorophyll molecules in chloroplasts . . . Principled Test Design 21 (23) … As [PROCESS]= “cellular respiration”, then [KEY]: (24) Glycolysis captures free energy with a series of enzyme-catalyzed reactions, some of which are not spontaneous and involve the conversion of ATP to ADP . . . (35) … As [PROCESS]= “the production of ATP from ADP”, then [DISTRACTORS]: (1) The production of ATP by the phosphorylation of ADP occurs on the surface of the thylakoid membrane and therefore occurs only in chloroplasts . . . (5) … As [PROCESS]= “photosynthesis”, then [DISTRACTORS]: (6) The Calvin cycle is a series of biochemical reactions that takes place in the cytoplasm of cells in photosynthetic algae and higher plants . . . (17) … As [PROCESS]= “cellular respiration”, then [DISTRACTORS]: (18) During Krebs cycle, ATP is converted to ADP when bonds in carbohydrates and other organic molecules are broken . . . (24) … As the [KEY]=11, 12, 13, 14, 18, or 19, then [DISTRACTORS] ≠10 As the [KEY]=15, 16, or 19, then [DISTRACTORS] ≠6 Constraints As the [KEY]=16, then [DISTRACTORS] ≠9 As the [KEY]=17, then [DISTRACTORS] ≠14 As the [KEY]=28, 30, or 35 , then [DISTRACTORS] ≠19 As one [DISTRACTOR]=7, then remaining [DISTRACTORS] ≠8 As one [DISTRACTOR]=10, then remaining [DISTRACTORS] = 8 or 12 Auxiliary Information None Figure 5. Item template for the AP Biology Claim 2A2 & 6.2 Principled Test Design 22 In this example, the stem contains one string variable (PROCESS). This variable can assume three values: “the production of ATP from ADP”, “photosynthesis”, “cellular respiration”. The four alternatives, labelled A to D, are generated in the example using algorithms that use a combination of elements, varied in a systematic manner. Therefore, depending on which values PROCESS assume, different keys and distractors will be displayed in the generated item. For example, if PROCESS equals “the production of ATP from ADP”, then there are 10 possible keys. For security reasons, only one key is been disclosed here: Photosynthesis captures free energy in visible light through the excitation of electrons in chlorophyll molecules in chloroplasts. A similar process is used with the distractors. If PROCESS equals “the production of ATP from ADP”, then there are five potential distractors, among them: The production of ATP by the phosphorylation of ADP occurs on the surface of the thylakoid membrane and therefore occurs only in chloroplasts. There is no auxiliary information for this item template. Since there is only one variable in the stem, the options are constrained, and no additional material (such as texts, images, tables, and/or diagrams), in either the stem or option, is required to generate an item this template. Principled Test Design 23 Step 4: Item Generation Using the generator software IGOR, a total of 1,966 items were created. IGOR creates items by iterating all combinations of elements in the item template, taking into account the specified constraints. The number of items generated by type of PROCESS is presented in Table 1. Table 1 – The number of generated items from the AP Biology template. Production of ATP from ADP 10 Number of Distractors 5 Photosynthesis 13 12 1,461 Cellular respiration 12 7 405 Total 35 24 1,966 Process Number of Keys Number of items 100 Conclusions and Implications The purpose of our study is to link automatic item generation and ECD to improve test development practices at the College Board. Clearly, our approach is complex but, we believe, worthwhile. To use Mislevy and Riconscente’s (2006) words: initial applications of the ideas encompassed in the ECD framework may be labor intensive and time consuming. Nevertheless, the import of ideas for improving assessment will become clear from (a) the explication of the reasoning behind assessment design decisions and (b) the identification of reusable elements and pieces of structure−conceptual as well as technical−that can be adapted for new projects (p.86). The inclusion of the ECD approach into this project fills the gap between the traditional automated item generation and cognitive perspective on cognition and assessment. In this Principled Test Design 24 way, task models and item templates are constructed taking into account the knowledge and skills students use to succeed on items based on specific claims and to demonstrate the evidence for producing these claims. Other strengths of the proposed approach to item development include: (a) the process of item development can occur more quickly because items are generated automatically; (b) difficulties present in creating parallel test forms (i.e., items with the same content and difficulty level) can be addressed and minimized since item templates can be used to generate large numbers of parallel items; (c) issues of test security become less of a concern because many test items are now available for creating tests; and (d) the costs involved in the item development process can be significantly reduced. The benefits of AIG are also evident for computer-based, online, and continuous assessments, where a large number of items are required for maintaining a bank. Although we may be able to generate an enormous number of items from few task models and item templates, a small number of these items might be useful for supporting an automated generation process in AP. However, this project still has the great potential of contributing to a better understanding of the processes involved in how to build robust task models as well as constructing successful items from those task models. In this way, the amount of knowledge and experience gathered about how to work with item modeling and item generation and the potential to fill to the gap between what ECD and AIG might by itself be very valuable. Principled Test Design 25 Author Note This project was completed with funds provided to the first author as part of the College Board Research Grant Program. The purpose of the program is to encourage and support developing young research scientists who wish to gain experience in conducting a program of research. We would like to thank the College Board for their support. However, the authors are solely responsible for the methods, procedures, and interpretations expressed in this study. Our views do not necessarily reflect those of the College Board. Principled Test Design 26 References Camara, W.J., & Kimmel, E.W. (Eds) (2005). Choosing Students: Higher Education Admissions Tools for the 21st Century, Lawrence Erlbaum Associates, Mahwah, NJ. Drasgow, F., Luecht, R. M., & Bennett, R. (2006). Technology and testing. In R. L. Brennan (Ed.), Educational measurement (pp. 471–516). Washington, DC: American Council on Education. Gierl, M. J., Zhou, J., & Alves, C. B. (2008). Developing a Taxonomy of Item Model Types to Promote Assessment Engineering. The Journal of Technology, Learning, and Assessment, 7(2), 1-51. Available from http://www.jtla.org. Gitomer, D. H., & Bennett, R. E. (2002). Unmasking constructs through new technology, measurement theory, and cognitive sciences. In National Academy of Sciences (Ed.), Technology and Assessment Thinking Ahead: Proceedings from a Workshop (pp. 1-11). Washington: National Academy Press. Haladyna, T. (2004). Developing and Validating Multiple-Choice Test Item. Mahwah, New Jersey: Lawrence Erlbaum Associates, Publishers. Luecht, R. M. (2008). Assessment Engineering in Test Design, Development, Assembly, and Scoring. Presentation at East Coast Organization of Language Testers (ECOLT) Conference. Retrieved December 13, 2009 from: http://www.govtilr.org/Publications/ECOLT08-AEKeynote-RMLuecht07Nov08%5B1%5D.pdf Principled Test Design 27 Johnson, M. S., & Sinharay, S. (2005). Calibration of polytomous item families using Bayesian hierarchical modeling. Applied Psychological Measurement, 29, 369–400. Mislevy, R. J., Almond, R. G., & Lukas, J. F. (2003). A brief introduction to evidence centered design. ETS Research Report RR03-16. Princeton: ETS. Mislevy, R. J., & Riconscente, M. M. (2006). Evidence-centered assessment design. In Downing, S. M., & Haladyna, T. M. (Eds.), Handbook of test development (pp. 61-90). Mahwah, NJ: Erlbaum. Mislevy, R. J., Almond, R. G., & Steinberg, L. S. (2002). On the roles of task model variables in assessment design. In Irvine, S., & Kyllonen, P. (Eds.), Generating items for cognitive tests: Theory and practice. (p. 97-128). Hillsdale, NJ: Lawrence Erlbaum. PytlikZillig, L. M., Bodvarsson, M., & Bruning, R. (2005). Technology-Based Education: Bringing Researchers and Practitioners Together. Greenwich, CT: Information Age Publishing. Williamson, D. M., Mislevy, R. J., & Bejar, I. I. (Eds.). (2006). Automated scoring of complex tasks in computer-based testing. Mahwah, NJ: Lawrence Erlbaum Associates. Williamson, D. M., Johnson, M. S., Sinharay, S., & Bejar, I. (2002). Hierarchical IRT examination of isomorphic equivalence of complex constructed response tasks. Paper presented at the American Educational Research Association, New Orleans, LA. Zhou, J. (2009). A Review of Assessment Engineering Principles with Select Applications to the Certified Public Accountant Examination. Technical Report prepared for The Principled Test Design 28 American Institute of Certified Public Accountants. Retrieved March 30, 2009 from: http://www.cpa-exam.org/download/Zhou-A-Review-of-Assessment-Engineering.pdf Principled Test Design 29 Appendix The appendix contains a sample of items generated automatically Identify a statement that describes one aspect of an organism’s ability to capture free energy with the production of ATP from ADP. a) The production of ATP from ADP is coupled to the release of energy from a proton gradient established by electron transport chains embedded in the inner membrane of the mitochondria* b) The change in free energy during the production of ATP from ADP is used to create a proton gradient across a membrane in both the mitochondrion and the chloroplast c) Facilitated diffusion of protons across the cell membrane and out of the cytoplasm of bacteria is coupled to the capture of free energy through production of ATP from ADP on the inner surface of the membrane d) Active transport of protons across the cell membrane and out of the cytoplasm of bacteria is coupled to the capture of free energy through the production of ATP from ADP on the inner surface of the membrane Identify a statement that describes one aspect of an organism’s ability to capture free energy with photosynthesis. a) In the light reactions, electrons are transferred from the oxygen atom in a water molecule to NADP+ to establish an electrochemical gradient* b) In chloroplasts, the conversion of ATP to ADP provides the free energy required to move protons across the thylakoid membrane and into the stroma c) Electrons from O2 are used to replace the electrons from chlorophyll molecules that are energized by light photons and phosphorylate ADP in the electron transport chain d) In photosystems I and II, solar energy elevates the free energy of the hydrogen atoms in water, which bond with the carbon atom in carbon dioxide, leaving the oxygen atoms in carbon dioxide to form molecular oxygen Principled Test Design 30 Identify a statement that describes one aspect of an organism’s ability to capture free energy with cellular respiration. a) Two pyruvate molecules produced during glycolysis are transported to the Krebs cycle, thereby connecting the process of glycolysis with the Krebs cycle* b) The proton gradient across the mitochondrial membrane that surrounds the matrix is established by the free energy extracted from carbohydrates c) In cellular respiration, electrons extracted from molecular oxygen are transferred to NADH and then to carbon atoms during a series of reactions in the electron transport chain d) During cellular respiration, the pH of the mitochondrial matrix becomes lower than the pH of the inner membrane space, and the difference in pH is used to generate ATP from ADP