AERA-NCME 2009 PowerPoint template

advertisement
®
The Conceptual and Scientific
Basis for Automated Scoring of
Constructed Response Items
David M. Williamson
Senior Research Director
Applied Research & Development
Educational Testing Service
Princeton, NJ 08541
Phone: 609-734-1303
Email: dMwilliamson@ets.org
Educational Testing Service, ETS, the ETS logo, and Listening. Learning. Leading. are registered trademarks of Educational Testing Service (ETS).
Aspirations
• Provide
– Conceptual basis for scoring innovative tasks
– Overview of current automated scoring methods
– Outline of the empirical basis for the science of scoring
• Positioned
– As a practical reference for those responsible for design or
selection of scoring methods
– In anticipation of growth in computer delivery and automated
scoring of innovative tasks
• Contexts & definitions
– Automated scoring
– Constructed response items (traditional and innovative)
– Interests of state assessment
®
2
Overview
• Conceptual aspects of scoring
• Methods for automated scoring
• The science of scoring
• The future of automated scoring
®
3
Overview of the Conceptual Basis
for Scoring
• Interplay between design and scoring
• Design methodologies
• Evidence Centered Design
• Illustrative Examples
®
4
The Interplay Between Design
and Scoring
• How NOT to score: Come up with good tasks and then figure
out how to score them. Instead, design tasks around the
parts of the construct you want the scoring to represent
• It’s not just the scoring! Effective scoring is embedded in the
context of an assessment design (Bennett & Bejar, 1998)
• The needs of the design drive the selection and application of
scoring methodologies
• Assessments targeting innovation have greater need for such
design rigor
• Use of automated scoring places more demands on
assessment design than human scored tasks
The first step in successful scoring is good design!
®
5
Designing for Innovation
• Why not traditional design?
– Emphasis on items encourages tendency to jump too quickly
from construct definition to item production, or skip altogether
– Formalized design helps item innovation develop as explicit
science rather than implicit art
• Methodologies for innovative design
– Assessment Engineering (Luecht, 2007)
– BEAR (Wilson & Sloane, 2000)
– Evidence Centered Design (Mislevy, Steinberg, & Almond,
2003)
• Principles of good design methodology (innovation)
– Focus on use and construct
– Defer item development until evidential need is defined
– Explicit and formal linkage between items and construct
®
6
The Centrality of Evidence as the
Foundation of Scoring
• Evidence of what?
• What evidence is optimal, and of this, achievable?
• How do we transform data into evidence into action?
Observation
Interpretation
Evidence
Pelligrino, J.W., Chudowsky, N., & Glaser,
R. (2001). Knowing What Students Know.
National Academy Press, Washington, D.C.
Cognition
®
7
Evidence Centered Design Process
Proficiency Model
Evidence Models
Task Models
Evidence
Rules
Stat
model
Features
1.
xxxxx
2.
xxxxx
3.
xxxxx
• Proficiency Model – What you want to measure
• Evidence Model – How to recognize & interpret
observable evidence of unobservable proficiencies
• Task Models – How to elicit valid and reliable evidence
®
8
ECD and the Assessment Triangle
Task Models
Observation
Interpretation
Evidence Models
Stat
model
Evidence
Rules
Features
1.
2.
3.
xxxxx
xxxxx
xxxxx
Messick (1994): “…and what tasks or situations
should elicit those behaviors?”
Evidence
Messick (1994): “Next, what behaviors or
performances should reveal those constructs….”
Cognition
Proficiency Model
Messick (1994): “…what complex of
knowledge, skills, or other attribute should be
assessed, presumably because they are tied
to explicit or implicit objectives of instruction
or are otherwise valued by society.”
®
9
Chain of Reasoning: Validity
Evidence
Proficiency
Evidence
Evidence
Task
Model
Tasks
Tasks
Task
Model
Tasks
Task
Model
Tasks
Task
Model
Tasks
®
10
Scoring Process
Proficiency
Evidence
Accumulation
Task
Identification
• Stage 1: Evidence Identification
– Task level scoring to summarize responses as
“observables”
• Stage 2: Evidence Accumulation
– Using these elements to estimate ability
®
11
Scoring in the Context of Design
• Design specifies hypotheses about examinees, what would
constitute evidence, and relevant observations
• Scoring is the logical and empirical mechanism by which
data becomes evidence about hypotheses
• The better the design, the easier the scoring for innovative
items
• The following are some examples of design structures,
with implications for the demands on scoring
®
12
Univariate with Single Linkage
x
x
x
1
o
1
x
o
1
1
2
3
Item Level
Test Level
®
13
Univariate with Conditional
Dependence
Test Level
Item Level
x
x
x
1
o
1
x
x
x
o
1
2
1
2
3
T1
3
Task 1
x
x
4
T2
5
…
x
n-1
x
n
Ti
®
14
Multivariate with Single Linkage
Test Level
Item Level
o
1
x
o
o
x x
x x
1
o
2
3
1
1
2
3
4
… x n-1
x
n
®
15
Multivariate with Tree Structure
Test Level
o
x
3
x
o
x x
3
4
o
2
…
x
1
n
n-1
1
…
x
2
®
16
Multivariate with Conditional
Dependence
Item Level
Test Level
o
o
o
x1
x2
x x
2
1
T1
o
1
1
T1
2
o
o
2
4
3
x x
3
4
…
x
n-1
x
n
Ti
®
17
Multivariate with Multiple Linkage
x6 x8
x12
o1
x7
x19
x9
x14
o5
o2
x16
o4
o3
x4
x10
x11 x18 x20
x15
x17
®
18
Copyright © 2009 by Educational Testing Service. ETS, the ETS logo and LISTENING. LEARNING. LEADING. are registered trademarks of Educational Testing Service (ETS).
Parallelism of test level and item
level considerations
• The examples present test-level complexities, but the same
issues hold true at the item level
• The outcome of scoring a task might be one observable or
many
• Observables may have straightforward information feeding
into them, or complex
®
19
The Conceptual Basis of Scoring: Good
Scoring Begins with Good Design
• Good scoring comes from a design that:
– Defines appropriate proficiencies for intended score use
– Specifies relevant evidence making distinctions among ability levels
– Presents situations that elicit targeted distinctions in behavior
• Formal design methodology encourages good scoring
• Design decisions can make scoring more or less
complex
– Number and structure of proficiencies in the model
– Conditional dependence of multiple observables from a single task
– Conditionally-dependent observables from a single task
contributing to multiple proficiencies
– Extent to which a single observable relates to multiple proficiencies
®
20
Overview of Automated Scoring
Methods
• Value and challenges of automated scoring
• Commercial systems
• Building your own
®
21
The Promise of Automated
Scoring
• Quality
– Can attend to things that humans can’t
– Greater consistency and objectivity of scores
– Fully transparent scoring justification
• Efficiency
– Faster score turnaround
– Lower cost
– Easier scheduling
• Construct representation in scores
– Enable use of CR items (expanded construct representation)
where previously infeasible
– Provision of performance feedback
®
22
Challenges of Automated Scoring
• Quality
– Humans do some things better, sometimes dramatically
– Consistency is a liability when some element of scoring is
wrong, leading to potential bias
– May not handle unusual responses well
• Efficiency
– Cost of development can be high
– Development timeframes can be lengthy
• Construct representation in scores
– Likely a somewhat different construct than from human
scoring
– Scores cannot be defended on the basis of process and
“résumé”
®
23
Classes of Automated Scoring
• Response-type systems (commercial systems)
– Response type common across testing purposes, programs, and
populations yielding high generalizability:
•
•
•
•
Essays
Correct answers in textual responses
Mathematical equations, plots and figures
Speech
• Simulation-based (custom systems)
– Realistic scenarios; limited generalizabilty of scoring
®
24
Automated Scoring of Essays
• First and most common with > 12 commercial systems
• Most widely known:
– e-rater® (Burstein, 2003; Attali, & Burstein, 2006)
– Intelligent Essay Assessor TM (Landauer, Laham, & Foltz,
2003)
– IntelliMetricTM (Elliot, 2003)
– Project Essay Grade (Page, 1966; 1968; 2003)
• Commonalities
– Computer-identifiable features intended to be construct
relevant;
– Statistical tools for accumulating these to a summary score
• Differences
– Nature, computation and relative weighting of features
– Statistical methods used to derive summary scores
®
25
Applications of Automated Essay
Scoring
• Target traditional academic essays, emphasizing
writing quality over content, though measures of
content exist in each system
– Example: “What I did last summer”
• Low-stakes learning/practice
– WriteToLearnTM (IEA)
– MyAccess! TM (IntelliMetric)
– CriterionTM (e-rater)
• High-stakes assessment
–
–
–
–
GMAT® with e-rater (1999) then IntelliMetric (2006)
GRE® with e-rater (2008)
TOEFL® independent with e-rater (2009)
Pearson Test of English with IEA (2009)
®
26
Strengths and Limitations of
Automated Scoring of Essays
• Strengths
– Evaluating traditional academic essays emphasizing fluency
– Empirical performance typically on par with human raters
– Performance feedback related to fluency
• Limitations
– Evaluation of content accuracy, audience, rhetorical style,
creative or literary writing
– Content understanding relatively primitive
– Potential vulnerability to score manipulation
– Does not detect all types of errors, classify all errors
correctly
– Unexplained differences in agreement by demographic
variables (Bridgeman, Trapani & Attali, in press)
®
27
Automated Scoring for Correct
Answers
• Design to score short textual responses for the
correctness of information in the response
• Systems for scoring include:
– Automark (Mitchell, Russell, Broomhead, & Aldridge, 2002)
– c-rater (Leacock & Chodorow, 2003)
– Oxford-UCLES (Sukkarieh, Pulman, &Raikes, 2003)
• Only c-rater is known to have been deployed commercially
®
28
Example Question & Rubric
• Question: “Identify TWO common ways the body
maintains homeostasis during exercise.”
– Student responds with computer entered free-text
• Scoring rubric looks for any of these concepts:
–
–
–
–
–
Sweating (perspiration)
Increased breathing rate (respiration)
Decreased digestion
Increased circulation rate (heart speeds up)
Dilation of blood vessels in skin (increased blood flow)
• Rubric:
– 2 points - two key elements
– 1 point - one key element
– 0 points - other
®
29
Strengths and Limitations of
Automated Scoring of Correct
Answers
• Strengths
– Empirical performance can be on par with human graders
– Targets correct content in a way automated essay scoring
systems do not
– Emphasizes principles of good item design often overlooked
in human scoring
• Limitations
–
–
–
–
–
Success on items not always predictable
Unlike essays, expectation for near perfect agreement
Errors tend to be systematic
Additional controls on item production are an initial challenge
Model building can be labor intensive
®
30
Automated Scoring of
Mathematical Responses
• A variety of systems are available that are designed to
score multiple kinds of mathematical responses
–
–
–
–
Equations
Graphs
Geometric Figures
Numeric response
• Numerous systems available (see Steinhaus, 2008)
– Maple TA
– M-rater (Singley & Bennett, 1998)
• Operational deployment (sample)
– State assessment (m-rater)
– Classroom learning (Maple TA)
®
31
Strengths and Limitations of
Automated Scoring of
Mathematical Responses
• Strengths
– Empirical performance typically better than human raters
– Can compute the mathematical equivalence of unanticipated
representations (e.g. unusual forms of an equation)
– Partial credit scoring and performance feedback
• Limitations
– Some topics more challenging to complete in the computer
interface (e.g. geometry)
– “Show your work” can be more challenging to represent in
administration interface
– Mixed response (text and equations) still highly limited
®
32
Automated Scoring of Spoken
Responses
• Systems are available to score
– Predictable responses: read-aloud or describe a picture
– Unpredictable responses: such as “If you could go anywhere
in the world, where would it be and why?”
• Scoring systems include
– Versant (Bernstein et al., 2000)
– SpeechRater (Zechner et al., 2009)
– EduSpeak (Franco, et al., 2000)
• Operational Deployment
– Pearson Test of English (Versant, 2009)
– PhonePass SET-10 (Versant, 2002)
– TOEFL Practice Online (SpeechRater, 2007)
®
33
Strengths and Limitations of
Automated Scoring of Spoken
Responses
• Strengths
– High accuracy for predictable speech
– Performance improving rapidly year over year
• Limitations
– Is not as good as human scoring for unpredictable speech
– Data intensive, requiring large data sets
– Presenting challenges in calibration for sufficient range of
accented speech
– State-of-the-art of speech recognition for non-native
speakers is substantially behind that of native speakers of
English
– Limited content representation
®
34
An (Oversimplified) Approach to
Building an Automated Scoring
System for Innovative Tasks
• Innovation is unique, by definition so this is a broad
perspective
• Design (see previous discussion)
– A good design will specify the evidence needed so that
innovation feeds the design, rather than the reverse
• Feature extraction
– Represent relevant aspects of performance as a variable
• Synthesizing features into one or more task “scores”
– A multitude of methods, from the mundane to the exotic
®
35
Turning Features into Scores
number-right counts
Kohonen networks
item response theoryweighted counts
cluster analysis
regression
factor analytic methods
rule-based classification
neural networks
support vector machines
classification and regression trees
Bayesian networks
Arpeggio
rule-space multivariate IRT
Many ways to turn features into scores, from the
traditional to the innovative, but the key question is how
well the method matches the intent of design
®
36
Examples of Successful Systems
• Graphical designs in architectural licensure (Braun,
Bejar & Williamson, 2006)
– Architect Registration Examination by NCARB
– Examinees construct solutions on computer with CAD
– Rule-based scoring from algorithmically-based features
• Simulations of patients for physician licensure
(Margolis & Clauser, 2006)
– United States Medical Licensing ExaminationTM (NBME)
– Order diagnostic and treatment procedures with simulated
patients
– Regression-based scoring from algorithmically-based
features
®
38
More Examples of Successful
Systems
• Simulations of accounting problems for CPA licensure
(DeVore, 2002)
– Uniform CPA Examination (AICPA)
– Conduct direct work applying accounting princples
– Rule-based scoring from algorithmic features
• Assessment of information and communications
technology literacy for collegiate placement (Katz, &
Smith-Macklin, 2007)
– iSkillsTM
– Interactive simulations around information problems
– Number-right scoring on the basis of algorithmic features
®
39
Automated Scoring Methods: A
Variety of Tools for Innovation
• Tools should be chosen carefully based on need
• Variety of automated scoring systems commercially
available
• Innovative items might incorporate aspects of existing
systems, or may require customized scoring
• Innovation should be driven by good design, and
innovative items do not necessarily require innovative
scoring
• Even for targeted innovation, parsimony is a virtue
• There are a number of successful models to follow in
designing innovative items with automated scoring
®
40
Overview of The Science of
Scoring
• Some characteristics to qualify as science
• Scoring as science
• Some pitfalls to avoid in scientific inquiry for
scoring
®
41
To Be Science
• We must have a theory about the natural world that is
capable of explaining and predicting phenomena
• The theory must be subject to support or refutation
through testing of specific hypotheses through
empirical research
• Theories must be modified or abandoned in favor of
competing theories based on the outcomes of
empirical testing of hypotheses
®
42
Scoring as Science
• Theory
– The design of an innovative task and corresponding scoring
constitutes a theory of performance in the domain
• Falsification
– Experimentation (pilot testing, etc.) allows for confirmation or
falsification of the hypotheses regarding how examinees of
certain ability or understanding would behave, distinguishing
them from examinees of alternate ability or understanding
• Modification
– Item designs and scoring can be modified and/or abandoned
as a result, leading to better items/scoring
®
43
Explanation: Scoring as Theory
• Scoring as part of a coherent theory of assessment for
a construct of interest (Design)
• Construct representation
– What does an automated score mean?
• How complete is the representation of construct?
• Are they direct measures or proxies (e.g. essay length)?
– What is the construct under human scoring?
• Judgment scoring vs. confirmation: Can reasonable experts
disagree on the score?
• How much do we really know about operational human scoring?
• Implications of using human scores to produce an automated
scoring system?
• What hypotheses are presented to drive falsification?
®
44
Prediction: Empirical Support or
Falsification of Scoring
• Does scoring distinguish among examinees of
differing ability?
– Score distributions; item difficulty; unused “distracters”; Rbiserials, monotonic increasing ability/performance
• Do scores relate to human scores as predicted?
– Agreement rates: correlation; weighted kappa; etc.
– Distributions (means, deviations, etc.)
– Differences by subgroups in the above
• Do scores follow predicted patterns of relationship with
external validity criteria (human and other measures)?
• What is the impact of using automated scores on
reported scores compared to all human scoring?
®
45
Example: Evaluation Criteria for
e-rater (Williamson, 2009)
• Construct relevance
• Empirical evidence of validity
– Relationship to human scores
•
•
•
•
•
•
Exact / adjacent agreement [no standard]
Pearson correlation ≥ 0.70
Weighted kappa ≥ 0.70
Reduction compared to human-human agreement < 0.10
Difference in standardized mean score < 0.15
Subgroups (fairness), difference in standardized mean score
< 0.10
– Relationship to external criteria
• Impact analysis
®
46
Modification: Changing Scoring
• Item level
– Collapsing categories in scoring features that didn’t
discriminate (e.g. from 5 categories to 3)
– Modifying the scoring so that response that was thought to
be of lower ability is designated as higher and vice versa (if
consistent with revised understanding of theory)
– Changes to task conditions to facilitate understanding and
task completion
• Aggregate level
– Implementation model: confirmatory; contributory, automated
only
– Adjudication threshold
®
47
Importance of Scientific Rigor:
Avoiding Scoring Pitfalls
• Shallow empiricism
– Percent agreements
– Aggregated data
– Overgeneralization
• Confounding outcomes and process
– Construct conclusions from measures of association
• Validity by design alone
– Assuming success of construct by design
• Predisposition for human scores
– Excessive criticism/confidence (Williamson, Bejar & Hone,
1999)
®
48
The Science of Scoring: Design
Innovation Paired With
Traditional Empiricism
• Item designs, with their associated scoring, as
elements of a theory of construct proficiency
• Rigor and thoroughness in efforts to empirically falsify,
beyond routine evaluations
• Willingness to modify or abandon unsupported
theories for better models
®
49
What will the future of automated
scoring bring?
• We will be struck by what will become possible
– Scoring unpredictable speech by non-native speakers
– Content scoring from both spoken and written works
– Unprecedented delivery technologies and formats
• We will be frustrated by what isn’t possible/practical
– They still won’t do all that humans can do
– Statistical methods will advance, but not be a panacea
– We won’t escape the challenges of time and money
• Unpredicted innovations will be the most exciting
–
–
–
–
Assessments distributed over time
Intersection of learning progressions & intelligent tutoring
Incorporation of public content into assessment (Wiki-test)
Assessment through data mining of academic activities
®
50
Automated Scoring: 1941
“The International Business Machine (I.B.M.) Scorer (1938) uses a
carefully printed sheet … upon which the person marks all his answers
with a special pencil. The sheet is printed with small parallel lines showing
where the pencil marks should be placed to indicate true items, false
items, or multiple-choices. To score this sheet, it is inserted in the
machine, a lever is moved, and the total score is read from a dial. The
scoring is accomplished by electrical contacts with the pencil marks. …
Corrections for guessing can be obtained by setting a dial on the machine.
By this method, 300 true–false items can be scored simultaneously. The
sheets can be run through the machine as quickly as the operator can
insert them and write down the scores. The operator needs little special
training beyond that for clerical work.” (Greene, p. 134)
®
51
Conclusion
• Good scoring begins with good design
• There are a variety of methods, both commercial and
custom-built, for scoring innovative and/or traditional tasks
• Theory driven empiricism is the core of scoring as science:
Construct is good, but evidence is better
• The future will surprise, and disappoint: Technology will
change quickly but psychometric infrastructure will change
slowly, leading to an expanding gap between what is
possible and what is practical
®
52
References
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v.2. Journal of Technology, Learning, and
Assessment 4 (3).
Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement:
Issues and Practice, 17(4), 9-17.
Braun, H., & Bejar, I. I., & Williamson, D. M. (2006). Rule-based methods for automatic scoring: Application in a
licensing context. In D. M. Williamson, R. J. Mislevy & I. I. Bejar (Eds.), Automated scoring for complex
constructed response tasks in computer based testing. Mahwah, NJ: Lawrence Erlbaum Associates.
Bridgeman, B., Trapani, C., & Attali, Y. (in press). Comparison of human and machine scoring essays: Differences by
gender, ethnicity, and country. Applied Measurement in Education.
Bernstein, J., De Jong, J., Pisoni, D., and Townshend, B. (2000). Two experiments on automatic scoring of spoken
language proficiency. Proceedings of InSTIL2000 (Integrating Speech Tech. in Learning) (pp. 57-61).
Dundee, Scotland: University of Abertay.
Burstein, J. (2003). The e-rater® scoring engine: Automated essay scoring with natural language processing. In M. D.
Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 113-121).
Hillsdale, NJ: Lawrence Erlbaum Associates.
DeVore, R. (2002, April). Considerations in the development of accounting simulations. Paper presented at the Annual
Meeting of the National Council on Measurement in Education, New Orleans.
Elliot, S. (2003). IntelliMetric: from here to validity. In Mark D. Shermis and Jill C. Burstein (Eds.). Automated essay
scoring: a cross disciplinary approach. Mahwah, NJ: Lawrence Erlbaum Associates.
Franco, H., Abrash, V., Precoda, K., Bratt, H., Rao, R., Butzberger, J., Rossier, R., & Cesari, F. (2000). The SRI
EduSpeakTM system: Recognition and pronunciation scoring for language learning. Proceedings of InSTILL
(Integrating Speech Technology in Language Learning) (pp. 123–128). Scotland.
Greene, E. B. (1941). Measurements of Human Behavior. New York: The Odyssey Press.
Katz, I. R., & Smith-Macklin, A. (2007). Information and communication technology (ICT) literacy: Integration and
assessment in higher education. Journal of Systemics, Cybernetics, and Informatics, 5(4), 50-55.
Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and annotation of essays with the Intelligent
Essay Assessor. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary
perspective (pp. 87-112). Hillsdale, NJ: Lawrence Erlbaum Associates.
®
53
References (cont.)
Luecht, R. M. (April, 2007). Assessment Engineering in Language Testing: from Data Models and Templates to
Psychometrics. Invited paper presented at the Annual Meeting of the National Council on Measurement in
Education, Chicago.
Margolis, M. J., & Clauser, B. E. (2006). A regression-based procedure for automated scoring of a complex medical
performance assessment. In D. Williamson, R. Mislevy & I. Bejar (Eds.) Automated scoring of complex
tasks in computer based testing (pp. 123-167). Hillsdale, NJ: Lawrence Erlbaum Associates.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments .
Educational Researcher, 23(2) pp. 13-23
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement:
Interdisciplinary Research and Perspectives, 1.
Page, E.B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238–243.
Page, E.B. (1968). The use of the computer in analyzing student essays. International Review of Education 14(2),
210–225.
Page, E.B. (2003). Project essay grade: PEG. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A
cross-disciplinary perspective (pp. 43-54). Hillsdale, NJ: Lawrence Erlbaum Associates.
Pelligrino, J.W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know. National Academy Press,
Washington, D.C.
Steinhaus, S. (July, 2008). Comparison of mathematical programs for data analysis. Retrieved August 13, 2010, from
http://www.scientificweb.com/ncrunch/
Sukkarieh, J. Z.; Pulman, S. G.; and Raikes, N. (2003). Auto-marking: using computational linguistics to score short,
free text responses. In the 29th annual conference of the International Association for Educational
Assessment (IAEA), Manchester, UK.
Wilson, M. & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement
in Education, 13(2), 181-208.
Williamson, D. M. (2009, April). A framework for evaluating and implementing automated scoring. Paper presented at
the annual meeting of the National Council on Measurement in Education, San Diego, CA.
Williamson, D. M., Bejar, I. I., & Hone, A. S. (1999). ‘Mental model’ comparison of automated and human scoring.
Journal of Educational Measurement, 36(2), 158-184.
Zechner, K., Higgins, D., Xi, X., & Williamson, D. M. (2009). Automatic scoring of non-native spontaneous speech in
tests of spoken English. Speech Communication, 51(10), 883-895.
®
54
Download