® The Conceptual and Scientific Basis for Automated Scoring of Constructed Response Items David M. Williamson Senior Research Director Applied Research & Development Educational Testing Service Princeton, NJ 08541 Phone: 609-734-1303 Email: dMwilliamson@ets.org Educational Testing Service, ETS, the ETS logo, and Listening. Learning. Leading. are registered trademarks of Educational Testing Service (ETS). Aspirations • Provide – Conceptual basis for scoring innovative tasks – Overview of current automated scoring methods – Outline of the empirical basis for the science of scoring • Positioned – As a practical reference for those responsible for design or selection of scoring methods – In anticipation of growth in computer delivery and automated scoring of innovative tasks • Contexts & definitions – Automated scoring – Constructed response items (traditional and innovative) – Interests of state assessment ® 2 Overview • Conceptual aspects of scoring • Methods for automated scoring • The science of scoring • The future of automated scoring ® 3 Overview of the Conceptual Basis for Scoring • Interplay between design and scoring • Design methodologies • Evidence Centered Design • Illustrative Examples ® 4 The Interplay Between Design and Scoring • How NOT to score: Come up with good tasks and then figure out how to score them. Instead, design tasks around the parts of the construct you want the scoring to represent • It’s not just the scoring! Effective scoring is embedded in the context of an assessment design (Bennett & Bejar, 1998) • The needs of the design drive the selection and application of scoring methodologies • Assessments targeting innovation have greater need for such design rigor • Use of automated scoring places more demands on assessment design than human scored tasks The first step in successful scoring is good design! ® 5 Designing for Innovation • Why not traditional design? – Emphasis on items encourages tendency to jump too quickly from construct definition to item production, or skip altogether – Formalized design helps item innovation develop as explicit science rather than implicit art • Methodologies for innovative design – Assessment Engineering (Luecht, 2007) – BEAR (Wilson & Sloane, 2000) – Evidence Centered Design (Mislevy, Steinberg, & Almond, 2003) • Principles of good design methodology (innovation) – Focus on use and construct – Defer item development until evidential need is defined – Explicit and formal linkage between items and construct ® 6 The Centrality of Evidence as the Foundation of Scoring • Evidence of what? • What evidence is optimal, and of this, achievable? • How do we transform data into evidence into action? Observation Interpretation Evidence Pelligrino, J.W., Chudowsky, N., & Glaser, R. (2001). Knowing What Students Know. National Academy Press, Washington, D.C. Cognition ® 7 Evidence Centered Design Process Proficiency Model Evidence Models Task Models Evidence Rules Stat model Features 1. xxxxx 2. xxxxx 3. xxxxx • Proficiency Model – What you want to measure • Evidence Model – How to recognize & interpret observable evidence of unobservable proficiencies • Task Models – How to elicit valid and reliable evidence ® 8 ECD and the Assessment Triangle Task Models Observation Interpretation Evidence Models Stat model Evidence Rules Features 1. 2. 3. xxxxx xxxxx xxxxx Messick (1994): “…and what tasks or situations should elicit those behaviors?” Evidence Messick (1994): “Next, what behaviors or performances should reveal those constructs….” Cognition Proficiency Model Messick (1994): “…what complex of knowledge, skills, or other attribute should be assessed, presumably because they are tied to explicit or implicit objectives of instruction or are otherwise valued by society.” ® 9 Chain of Reasoning: Validity Evidence Proficiency Evidence Evidence Task Model Tasks Tasks Task Model Tasks Task Model Tasks Task Model Tasks ® 10 Scoring Process Proficiency Evidence Accumulation Task Identification • Stage 1: Evidence Identification – Task level scoring to summarize responses as “observables” • Stage 2: Evidence Accumulation – Using these elements to estimate ability ® 11 Scoring in the Context of Design • Design specifies hypotheses about examinees, what would constitute evidence, and relevant observations • Scoring is the logical and empirical mechanism by which data becomes evidence about hypotheses • The better the design, the easier the scoring for innovative items • The following are some examples of design structures, with implications for the demands on scoring ® 12 Univariate with Single Linkage x x x 1 o 1 x o 1 1 2 3 Item Level Test Level ® 13 Univariate with Conditional Dependence Test Level Item Level x x x 1 o 1 x x x o 1 2 1 2 3 T1 3 Task 1 x x 4 T2 5 … x n-1 x n Ti ® 14 Multivariate with Single Linkage Test Level Item Level o 1 x o o x x x x 1 o 2 3 1 1 2 3 4 … x n-1 x n ® 15 Multivariate with Tree Structure Test Level o x 3 x o x x 3 4 o 2 … x 1 n n-1 1 … x 2 ® 16 Multivariate with Conditional Dependence Item Level Test Level o o o x1 x2 x x 2 1 T1 o 1 1 T1 2 o o 2 4 3 x x 3 4 … x n-1 x n Ti ® 17 Multivariate with Multiple Linkage x6 x8 x12 o1 x7 x19 x9 x14 o5 o2 x16 o4 o3 x4 x10 x11 x18 x20 x15 x17 ® 18 Copyright © 2009 by Educational Testing Service. ETS, the ETS logo and LISTENING. LEARNING. LEADING. are registered trademarks of Educational Testing Service (ETS). Parallelism of test level and item level considerations • The examples present test-level complexities, but the same issues hold true at the item level • The outcome of scoring a task might be one observable or many • Observables may have straightforward information feeding into them, or complex ® 19 The Conceptual Basis of Scoring: Good Scoring Begins with Good Design • Good scoring comes from a design that: – Defines appropriate proficiencies for intended score use – Specifies relevant evidence making distinctions among ability levels – Presents situations that elicit targeted distinctions in behavior • Formal design methodology encourages good scoring • Design decisions can make scoring more or less complex – Number and structure of proficiencies in the model – Conditional dependence of multiple observables from a single task – Conditionally-dependent observables from a single task contributing to multiple proficiencies – Extent to which a single observable relates to multiple proficiencies ® 20 Overview of Automated Scoring Methods • Value and challenges of automated scoring • Commercial systems • Building your own ® 21 The Promise of Automated Scoring • Quality – Can attend to things that humans can’t – Greater consistency and objectivity of scores – Fully transparent scoring justification • Efficiency – Faster score turnaround – Lower cost – Easier scheduling • Construct representation in scores – Enable use of CR items (expanded construct representation) where previously infeasible – Provision of performance feedback ® 22 Challenges of Automated Scoring • Quality – Humans do some things better, sometimes dramatically – Consistency is a liability when some element of scoring is wrong, leading to potential bias – May not handle unusual responses well • Efficiency – Cost of development can be high – Development timeframes can be lengthy • Construct representation in scores – Likely a somewhat different construct than from human scoring – Scores cannot be defended on the basis of process and “résumé” ® 23 Classes of Automated Scoring • Response-type systems (commercial systems) – Response type common across testing purposes, programs, and populations yielding high generalizability: • • • • Essays Correct answers in textual responses Mathematical equations, plots and figures Speech • Simulation-based (custom systems) – Realistic scenarios; limited generalizabilty of scoring ® 24 Automated Scoring of Essays • First and most common with > 12 commercial systems • Most widely known: – e-rater® (Burstein, 2003; Attali, & Burstein, 2006) – Intelligent Essay Assessor TM (Landauer, Laham, & Foltz, 2003) – IntelliMetricTM (Elliot, 2003) – Project Essay Grade (Page, 1966; 1968; 2003) • Commonalities – Computer-identifiable features intended to be construct relevant; – Statistical tools for accumulating these to a summary score • Differences – Nature, computation and relative weighting of features – Statistical methods used to derive summary scores ® 25 Applications of Automated Essay Scoring • Target traditional academic essays, emphasizing writing quality over content, though measures of content exist in each system – Example: “What I did last summer” • Low-stakes learning/practice – WriteToLearnTM (IEA) – MyAccess! TM (IntelliMetric) – CriterionTM (e-rater) • High-stakes assessment – – – – GMAT® with e-rater (1999) then IntelliMetric (2006) GRE® with e-rater (2008) TOEFL® independent with e-rater (2009) Pearson Test of English with IEA (2009) ® 26 Strengths and Limitations of Automated Scoring of Essays • Strengths – Evaluating traditional academic essays emphasizing fluency – Empirical performance typically on par with human raters – Performance feedback related to fluency • Limitations – Evaluation of content accuracy, audience, rhetorical style, creative or literary writing – Content understanding relatively primitive – Potential vulnerability to score manipulation – Does not detect all types of errors, classify all errors correctly – Unexplained differences in agreement by demographic variables (Bridgeman, Trapani & Attali, in press) ® 27 Automated Scoring for Correct Answers • Design to score short textual responses for the correctness of information in the response • Systems for scoring include: – Automark (Mitchell, Russell, Broomhead, & Aldridge, 2002) – c-rater (Leacock & Chodorow, 2003) – Oxford-UCLES (Sukkarieh, Pulman, &Raikes, 2003) • Only c-rater is known to have been deployed commercially ® 28 Example Question & Rubric • Question: “Identify TWO common ways the body maintains homeostasis during exercise.” – Student responds with computer entered free-text • Scoring rubric looks for any of these concepts: – – – – – Sweating (perspiration) Increased breathing rate (respiration) Decreased digestion Increased circulation rate (heart speeds up) Dilation of blood vessels in skin (increased blood flow) • Rubric: – 2 points - two key elements – 1 point - one key element – 0 points - other ® 29 Strengths and Limitations of Automated Scoring of Correct Answers • Strengths – Empirical performance can be on par with human graders – Targets correct content in a way automated essay scoring systems do not – Emphasizes principles of good item design often overlooked in human scoring • Limitations – – – – – Success on items not always predictable Unlike essays, expectation for near perfect agreement Errors tend to be systematic Additional controls on item production are an initial challenge Model building can be labor intensive ® 30 Automated Scoring of Mathematical Responses • A variety of systems are available that are designed to score multiple kinds of mathematical responses – – – – Equations Graphs Geometric Figures Numeric response • Numerous systems available (see Steinhaus, 2008) – Maple TA – M-rater (Singley & Bennett, 1998) • Operational deployment (sample) – State assessment (m-rater) – Classroom learning (Maple TA) ® 31 Strengths and Limitations of Automated Scoring of Mathematical Responses • Strengths – Empirical performance typically better than human raters – Can compute the mathematical equivalence of unanticipated representations (e.g. unusual forms of an equation) – Partial credit scoring and performance feedback • Limitations – Some topics more challenging to complete in the computer interface (e.g. geometry) – “Show your work” can be more challenging to represent in administration interface – Mixed response (text and equations) still highly limited ® 32 Automated Scoring of Spoken Responses • Systems are available to score – Predictable responses: read-aloud or describe a picture – Unpredictable responses: such as “If you could go anywhere in the world, where would it be and why?” • Scoring systems include – Versant (Bernstein et al., 2000) – SpeechRater (Zechner et al., 2009) – EduSpeak (Franco, et al., 2000) • Operational Deployment – Pearson Test of English (Versant, 2009) – PhonePass SET-10 (Versant, 2002) – TOEFL Practice Online (SpeechRater, 2007) ® 33 Strengths and Limitations of Automated Scoring of Spoken Responses • Strengths – High accuracy for predictable speech – Performance improving rapidly year over year • Limitations – Is not as good as human scoring for unpredictable speech – Data intensive, requiring large data sets – Presenting challenges in calibration for sufficient range of accented speech – State-of-the-art of speech recognition for non-native speakers is substantially behind that of native speakers of English – Limited content representation ® 34 An (Oversimplified) Approach to Building an Automated Scoring System for Innovative Tasks • Innovation is unique, by definition so this is a broad perspective • Design (see previous discussion) – A good design will specify the evidence needed so that innovation feeds the design, rather than the reverse • Feature extraction – Represent relevant aspects of performance as a variable • Synthesizing features into one or more task “scores” – A multitude of methods, from the mundane to the exotic ® 35 Turning Features into Scores number-right counts Kohonen networks item response theoryweighted counts cluster analysis regression factor analytic methods rule-based classification neural networks support vector machines classification and regression trees Bayesian networks Arpeggio rule-space multivariate IRT Many ways to turn features into scores, from the traditional to the innovative, but the key question is how well the method matches the intent of design ® 36 Examples of Successful Systems • Graphical designs in architectural licensure (Braun, Bejar & Williamson, 2006) – Architect Registration Examination by NCARB – Examinees construct solutions on computer with CAD – Rule-based scoring from algorithmically-based features • Simulations of patients for physician licensure (Margolis & Clauser, 2006) – United States Medical Licensing ExaminationTM (NBME) – Order diagnostic and treatment procedures with simulated patients – Regression-based scoring from algorithmically-based features ® 38 More Examples of Successful Systems • Simulations of accounting problems for CPA licensure (DeVore, 2002) – Uniform CPA Examination (AICPA) – Conduct direct work applying accounting princples – Rule-based scoring from algorithmic features • Assessment of information and communications technology literacy for collegiate placement (Katz, & Smith-Macklin, 2007) – iSkillsTM – Interactive simulations around information problems – Number-right scoring on the basis of algorithmic features ® 39 Automated Scoring Methods: A Variety of Tools for Innovation • Tools should be chosen carefully based on need • Variety of automated scoring systems commercially available • Innovative items might incorporate aspects of existing systems, or may require customized scoring • Innovation should be driven by good design, and innovative items do not necessarily require innovative scoring • Even for targeted innovation, parsimony is a virtue • There are a number of successful models to follow in designing innovative items with automated scoring ® 40 Overview of The Science of Scoring • Some characteristics to qualify as science • Scoring as science • Some pitfalls to avoid in scientific inquiry for scoring ® 41 To Be Science • We must have a theory about the natural world that is capable of explaining and predicting phenomena • The theory must be subject to support or refutation through testing of specific hypotheses through empirical research • Theories must be modified or abandoned in favor of competing theories based on the outcomes of empirical testing of hypotheses ® 42 Scoring as Science • Theory – The design of an innovative task and corresponding scoring constitutes a theory of performance in the domain • Falsification – Experimentation (pilot testing, etc.) allows for confirmation or falsification of the hypotheses regarding how examinees of certain ability or understanding would behave, distinguishing them from examinees of alternate ability or understanding • Modification – Item designs and scoring can be modified and/or abandoned as a result, leading to better items/scoring ® 43 Explanation: Scoring as Theory • Scoring as part of a coherent theory of assessment for a construct of interest (Design) • Construct representation – What does an automated score mean? • How complete is the representation of construct? • Are they direct measures or proxies (e.g. essay length)? – What is the construct under human scoring? • Judgment scoring vs. confirmation: Can reasonable experts disagree on the score? • How much do we really know about operational human scoring? • Implications of using human scores to produce an automated scoring system? • What hypotheses are presented to drive falsification? ® 44 Prediction: Empirical Support or Falsification of Scoring • Does scoring distinguish among examinees of differing ability? – Score distributions; item difficulty; unused “distracters”; Rbiserials, monotonic increasing ability/performance • Do scores relate to human scores as predicted? – Agreement rates: correlation; weighted kappa; etc. – Distributions (means, deviations, etc.) – Differences by subgroups in the above • Do scores follow predicted patterns of relationship with external validity criteria (human and other measures)? • What is the impact of using automated scores on reported scores compared to all human scoring? ® 45 Example: Evaluation Criteria for e-rater (Williamson, 2009) • Construct relevance • Empirical evidence of validity – Relationship to human scores • • • • • • Exact / adjacent agreement [no standard] Pearson correlation ≥ 0.70 Weighted kappa ≥ 0.70 Reduction compared to human-human agreement < 0.10 Difference in standardized mean score < 0.15 Subgroups (fairness), difference in standardized mean score < 0.10 – Relationship to external criteria • Impact analysis ® 46 Modification: Changing Scoring • Item level – Collapsing categories in scoring features that didn’t discriminate (e.g. from 5 categories to 3) – Modifying the scoring so that response that was thought to be of lower ability is designated as higher and vice versa (if consistent with revised understanding of theory) – Changes to task conditions to facilitate understanding and task completion • Aggregate level – Implementation model: confirmatory; contributory, automated only – Adjudication threshold ® 47 Importance of Scientific Rigor: Avoiding Scoring Pitfalls • Shallow empiricism – Percent agreements – Aggregated data – Overgeneralization • Confounding outcomes and process – Construct conclusions from measures of association • Validity by design alone – Assuming success of construct by design • Predisposition for human scores – Excessive criticism/confidence (Williamson, Bejar & Hone, 1999) ® 48 The Science of Scoring: Design Innovation Paired With Traditional Empiricism • Item designs, with their associated scoring, as elements of a theory of construct proficiency • Rigor and thoroughness in efforts to empirically falsify, beyond routine evaluations • Willingness to modify or abandon unsupported theories for better models ® 49 What will the future of automated scoring bring? • We will be struck by what will become possible – Scoring unpredictable speech by non-native speakers – Content scoring from both spoken and written works – Unprecedented delivery technologies and formats • We will be frustrated by what isn’t possible/practical – They still won’t do all that humans can do – Statistical methods will advance, but not be a panacea – We won’t escape the challenges of time and money • Unpredicted innovations will be the most exciting – – – – Assessments distributed over time Intersection of learning progressions & intelligent tutoring Incorporation of public content into assessment (Wiki-test) Assessment through data mining of academic activities ® 50 Automated Scoring: 1941 “The International Business Machine (I.B.M.) Scorer (1938) uses a carefully printed sheet … upon which the person marks all his answers with a special pencil. The sheet is printed with small parallel lines showing where the pencil marks should be placed to indicate true items, false items, or multiple-choices. To score this sheet, it is inserted in the machine, a lever is moved, and the total score is read from a dial. The scoring is accomplished by electrical contacts with the pencil marks. … Corrections for guessing can be obtained by setting a dial on the machine. By this method, 300 true–false items can be scored simultaneously. The sheets can be run through the machine as quickly as the operator can insert them and write down the scores. The operator needs little special training beyond that for clerical work.” (Greene, p. 134) ® 51 Conclusion • Good scoring begins with good design • There are a variety of methods, both commercial and custom-built, for scoring innovative and/or traditional tasks • Theory driven empiricism is the core of scoring as science: Construct is good, but evidence is better • The future will surprise, and disappoint: Technology will change quickly but psychometric infrastructure will change slowly, leading to an expanding gap between what is possible and what is practical ® 52 References Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v.2. Journal of Technology, Learning, and Assessment 4 (3). Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17(4), 9-17. Braun, H., & Bejar, I. I., & Williamson, D. M. (2006). Rule-based methods for automatic scoring: Application in a licensing context. In D. M. Williamson, R. J. Mislevy & I. I. Bejar (Eds.), Automated scoring for complex constructed response tasks in computer based testing. Mahwah, NJ: Lawrence Erlbaum Associates. Bridgeman, B., Trapani, C., & Attali, Y. (in press). Comparison of human and machine scoring essays: Differences by gender, ethnicity, and country. Applied Measurement in Education. Bernstein, J., De Jong, J., Pisoni, D., and Townshend, B. (2000). Two experiments on automatic scoring of spoken language proficiency. Proceedings of InSTIL2000 (Integrating Speech Tech. in Learning) (pp. 57-61). Dundee, Scotland: University of Abertay. Burstein, J. (2003). The e-rater® scoring engine: Automated essay scoring with natural language processing. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 113-121). Hillsdale, NJ: Lawrence Erlbaum Associates. DeVore, R. (2002, April). Considerations in the development of accounting simulations. Paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans. Elliot, S. (2003). IntelliMetric: from here to validity. In Mark D. Shermis and Jill C. Burstein (Eds.). Automated essay scoring: a cross disciplinary approach. Mahwah, NJ: Lawrence Erlbaum Associates. Franco, H., Abrash, V., Precoda, K., Bratt, H., Rao, R., Butzberger, J., Rossier, R., & Cesari, F. (2000). The SRI EduSpeakTM system: Recognition and pronunciation scoring for language learning. Proceedings of InSTILL (Integrating Speech Technology in Language Learning) (pp. 123–128). Scotland. Greene, E. B. (1941). Measurements of Human Behavior. New York: The Odyssey Press. Katz, I. R., & Smith-Macklin, A. (2007). Information and communication technology (ICT) literacy: Integration and assessment in higher education. Journal of Systemics, Cybernetics, and Informatics, 5(4), 50-55. Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87-112). Hillsdale, NJ: Lawrence Erlbaum Associates. ® 53 References (cont.) Luecht, R. M. (April, 2007). Assessment Engineering in Language Testing: from Data Models and Templates to Psychometrics. Invited paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago. Margolis, M. J., & Clauser, B. E. (2006). A regression-based procedure for automated scoring of a complex medical performance assessment. In D. Williamson, R. Mislevy & I. Bejar (Eds.) Automated scoring of complex tasks in computer based testing (pp. 123-167). Hillsdale, NJ: Lawrence Erlbaum Associates. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments . Educational Researcher, 23(2) pp. 13-23 Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1. Page, E.B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238–243. Page, E.B. (1968). The use of the computer in analyzing student essays. International Review of Education 14(2), 210–225. Page, E.B. (2003). Project essay grade: PEG. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43-54). Hillsdale, NJ: Lawrence Erlbaum Associates. Pelligrino, J.W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know. National Academy Press, Washington, D.C. Steinhaus, S. (July, 2008). Comparison of mathematical programs for data analysis. Retrieved August 13, 2010, from http://www.scientificweb.com/ncrunch/ Sukkarieh, J. Z.; Pulman, S. G.; and Raikes, N. (2003). Auto-marking: using computational linguistics to score short, free text responses. In the 29th annual conference of the International Association for Educational Assessment (IAEA), Manchester, UK. Wilson, M. & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13(2), 181-208. Williamson, D. M. (2009, April). A framework for evaluating and implementing automated scoring. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA. Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). ‘Mental model’ comparison of automated and human scoring. Journal of Educational Measurement, 36(2), 158-184. Zechner, K., Higgins, D., Xi, X., & Williamson, D. M. (2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51(10), 883-895. ® 54