Technical Report Number 790 UCAM-CL-TR-790 ISSN 1476

advertisement
Technical Report
UCAM-CL-TR-790
ISSN 1476-2986
Computer Laboratory
Number 790
Automated assessment of ESOL free
text examinations
Ted Briscoe, Ben Medlock, Øistein Andersen
November 2010
15 JJ Thomson Avenue Cambridge CB3 0FD
United Kingdom phone +44 1223 763500
http://www.cl.cam.ac.uk/
c 2010 Ted Briscoe, Ben Medlock, Øistein Andersen
Technical reports published by the University of Cambridge
Computer Laboratory are freely available via the Internet:
http://www.cl.cam.ac.uk/techreports/
ISSN 1476-2986
Automated As ses sment of E SOL Free
Text Examinations
Ted B risc o e
Compute r Lab oratory
University of Cambridge
and
Be n Medlo ck and Øistein Anderse n
iLexIR Ltd
ejb@cl.cam.ac.uk ben/oistein@ilexir.co.uk
Abstract In this rep ort, we c ons ider the task of
automated ass es sme nt of English as a Se cond Language (ESOL) exam ination sc ripts written in re sp ons e to prompts e lic
iting fre e text ans we rs . We re vie w and critic ally evaluate pre vious work on
autom ated ass es sm ent for es says, es p ec ially when applie d to ESOL te xt. We
form ally define the task as dis criminative preference ranking and de velop a ne w
sys tem traine d and tes ted on a c orpus of manually-grade d s cripts . We s how exp
e rime ntally that our b e st p e rforming sys tem is ve ry close to the upp e r b ound
for the task, as define d by the agree me nt b etwee n human e xamine rs on the s
ame corpus . Finally we argue that our approach, unlike extant solutions, is re lative
ly prompt-ins ensitive and res istant to subve rs ion, even when its op e rating
principles are in the public dom ain. T he se prop ertie s make our approach
significantly more viable for high-stake s as se ss ment.
1 Intro duction
The task of automated asses sment of free text passages or ess ays is distinct from that
of scoring short text or multiple choice answers to a series of very sp ecific prompts.
Neverthe less, since Page (1966) des crib ed the Pro je ct Essay Grade (P EG) program,
this has b een an active and fruitful area of research. To day there are at least 12
programs and ass o c iated pro ducts (Williams on, 2009), such as the Educ ational
Testing Servic e’s (ETS) e -Rater (Attali and Burstein, 2006), PearsonKT’s KAT Engine / I
ntelligent Es say Asse ssor (IEA) (Landauer et al , 2003) or Vantage Learning’s I
ntellimetric (E lliot, 2003), which are deployed to as sess e ssays as part of self- tutoring
systems or as a comp onent of examination marking (e.g. Kukich, 2000). Because of the
broad p otential applic ation of automated asse ssment to essays, the se systems fo cus
as much on asses sing the se mantic relevance or ‘topicality’ of e ssays to a given prompt
as on assess ing the quality of the ess ay itself .
3
Many English as a Se cond Language (ESOL) e xaminations include free text ess
ay-style answer c omp onents designe d to e valuate candidates’ ability to write , w ith a
fo cus on sp e cific c ommunicative goals . For example, a prompt might s p ecify writing
a letter to a friend describing a rece nt ac tivity or writing an email to a prosp ec tive
employer jus tif ying a job application. The design, delivery, and marking of such
examinations is the fo cus of cons iderable res earch into tas k validity for the sp ecific
skills and leve ls of attainment exp ected for a given qualification (e.g. Hawkey, 2009).
The marking scheme s f or such writing tasks typically emphas ise us e of varied and eff
ective language appropriate for the genre, exhibiting a range and complexity consonant
with the level of attainme nt required by the examination (e.g. Shaw and Weir, 2007). T
hus, the marking c riteria are not primarily prompt or topic sp ecific but linguis tic . This
makes automate d asses sment f or ESOL te xt (hereafter AAET) a distinct sub case of
the general problem of marking ess ays, which, we argue, in turn require s a dis tinct
technical approach, if optimal p erf ormance and effec tivenes s are to b e achieved.
1Neverthe
less, e xtant general purp ose systems, such as e- Rater and I EA have b een
de ployed in self- ass essment or second marking roles for AAET. Furthermore , Edexcel,
a division of Pe arson, has rec ently announced that f rom autumn 2009 a re vise d
version of its Pearson Tes t of Englis h Acade mic (PTE Ac ademic ), a te st aime d at
ESOL sp e akers see king e ntry to English sp eaking universities , w ill b e entirely asses
sed us ing “Pe arson’s proven automated s coring technologies”2. This announcement f
rom one of the ma jor providers of such high stake s tests makes investigation of the
viability and accuracy of automated as sessme nt systems a res earch priority. I n this rep
ort, we des crib e rese arch undertaken in c ollab oration with C ambridge ESOL, a divis
ion of C ambridge Ass essment, which is, in turn, a division of the University of
Cambridge, to de velop an accurate and viable approach to AAET and to asses s the
appropriateness of more ge neral automated assess ment technique s for this task.
Section 2 provides s ome technical details of e xtant s ys tems and considers their like ly
e fficacy for AAET. Section 3 describ es and motivates the new mo del that we have
develop ed for AAE T based on the paradigm of discriminative preference ranking using
machine learning over linguistically-motivated text fe atures automatically e xtracted f rom
scripts. Section 4 describ es an e xp erime nt training and testing this clas sifier on
sample s of manually marke d scripts from candidates f or Cambridge E SOL’s First Ce
rtificate of English (FCE) e xamination and the n comparing p erformance to human
examine rs and to our reimplementation of the key comp one nt of Pears onKT ’s IEA.
Sec tion 5 dis cusses the implications of these exp e riments within the wider context of
op e rational deployment of AAET. Finally, sec tion 6 summarises our main conclusions
and outlines areas of future research.
1www.pearsonpte.com/news/Pages/PTEAcademiclaunch.aspx
2Williams
on (2009) als o state s that e- Rater will als o b e use d op erationally from m id-2009 for ass es
sing com p onents of ETS’s T OEFL e xam , but in conjunc tion w ith human marking.
4
2 Technical Background
A full history of automate d asses sment is b eyond the sc op e of this rep ort. For recent
re views of work on automated essay or free -text asses sment se e Dikli (2006) and
Williamson (2009). In this section, we fo cus on the ETS’s e-Rater and PearsonKT’s IE A
systems as these are two of the three main systems which are op erationally deployed.
We do not consider IntelliMetric f urther as there is no precise and de tailed technical des
cription of this system in the public domain (Williamson, 2009). However, we do disc uss
a numb er of academic studies which ass ess and compare the p erformance of diff erent
technique s and as well as that of the public domain prototyp e s ys te m, BET SY
(Rudner and Lang, 2002), which treats automated ass essment as a B ayesian text c lass
ific ation proble m, as this work sheds useful light on the p otential of approaches other
than those deploye d by e -Rater and IEA.
2. 1 e -Rate r
e- Rater is extensively de scrib ed in a numb er of publications and patents (e.g. Burstein,
2003; Attali and Burstein, 2006; Burstein et al , 2002, 2005). The mos t recently describ
ed version of e-Rater uses 10 broad feature typ es extracted from the text us ing NLP te
chniques, 8 re pre sent writing quality and 2 content. These features corres p ond to
high-level prop erties of a text, such as grammar, usage (errors), organisation or
prompt/topic sp e cific c ontent. Each of thes e high-level features is broken down into a
set of ground feature s; for ins tance, grammar is sub divided into features which c ount
the numb er of auxiliary verbs, c omplement clauses, and so forth, in a text. These f
eatures are extracted from the ess ay using NLP to ols which automatically ass ign
part-of-sp eech tags to words and phras es, s earch for sp ecific lexical items , and so f
orth. M any of the feature extrac tors are manually w ritten and based on e ssay marking
rubrics used as guides for human marking of essays for sp ecific examinations. The res
ulting counts for each feature are asso ciate d with cells of a vector which e nco des all
the grammar feature s of a text. Similar vectors are cons tructed for the other high-le vel
features .
The feature extraction s ys te m outlined ab ove, and des crib ed in more detail in the
ref erence s provided, allows any text to b e repres ente d as a s et of vectors each
repres enting a set of f eatures of a given high- level typ e. Each feature in each ve ctor
is weighted using a variety of techniques drawn from the fields of information retrieval
(IR) and machine learning (ML). For instanc e, content-based analysis of an e ssay is
bas ed on vec tors of individual word frequency counts drawn f rom text. Attali and B
urstein (2006) transf orm frequency counts to weights by normalising the word counts
to that of the most frequent word in a training se t of manually-marked es says written in
re sp onse to the s ame prompt, scored on a 6 p oint sc ale. Sp ecifically, they re move
stop words which are exp ected to o ccur with ab out equal frequency in all texts (such
as the ), then for each of the sc ore p oints , the weight for word i at p oint p is:
Wip = ( FipM N) * l og ) (1)
axFp
( Ni
is the f requenc y of word i at s core p oint p , M is the maximum f re que
where Fip axFp
nc y
5
of any word at scop e p oint p , N is the total numb er of e ssays in the training set, and
Niis the total numb er of e ssays having word i in all score p oints in the training se t 3. For
automated ass essment of the content of an unmarked e ssay, this weighted vector is
compute d by dropping the c onditioning on p and the result is compared to aggregated
vectors for the marked training essays in each class using cosine s imilarity. The
unmarked ess ay is assigned a content s core corre sp onding to the most s imilar c lass .
This approach transf orms an unsup ervis ed we ighting technique, w hich only requires
an unannotated collection of essays or do cuments, into a sup ervised one w hich
requires a set of manuallymarked prompt- sp ec ific es says.
Other vectors are weighted in different ways dep ending on the typ e of features
extracted. Counts of grammatical, usage and s tyle feature s are smo othed by adding 1
to all c ounts (avoiding z ero counts for any feature), then divided by essay length word
count to normalise for different ess ay lengths , the n transformed to logs of counts to
avoid skewing results on the bas is of abnormally high counts for a given feature.
Rhetorical organisation is computed by random indexing (Kanerva et al, 2000), a mo dific
ation of latent semantic indexing (see s ection 2.2), which constructs word vectors based
on c o o cc urrence in texts. Words c an b e we ighted us ing a wide variety of weight
functions (Gorman and Curran, 2006). Burstein et al (2005) desc rib e an approach which
c alculates mean vectors for words from training es says which have b e en manually
marke d and se gmented into passage s p erforming diffe rent rhetorical functions. M ean
vectors for each score p oint and pas sage typ e are normalise d to unit length and transf
ormed so they lie on the origin of a graph of the transf ormed geometric space. This c
ontrols f or differing pass age lengths and incorp orates inve rs e do cument frequency
into the word weights . The re sulting passage ve ctors c an now b e used to compare the
similarity of passage s within and across e ssays, and, as ab ove, to score e ssays for
organisation via similarity to me an vec tors for manually- marked training passages .
The set of high- level f eature scores obtained for a given ess ay are c ombined to give
an overall score. In earlier versions of e- Rater this was done by stepwise linear re gress
ion to assign optimal we ights to the comp onent s cores so that the correlation with
manuallyassigned overall sc ores on the training set was maximised. However, Attali and
Burstein (2006) advo cate a simpler and more p erspicuous approach using the weighted
average of standardised feature score s, where the weights can b e set by e xp ert e
xaminers based on marking rubric s. Williamson (2009) in his des cription of e- Rater
implies a return to regre ssion- bas ed weighting.
2. 2 Int el li gent Essay Assessor (IE A)
The I EA like e-Rater asses ses es says in terms of a small numb er of high-level feature
s such as content, organisation, fluency, and grammar. The published pap e rs and
patent desc ribing the technique s b e hind IE A (e.g. Landauer et al , 2000, 2003; Foltz et
al , 2002)
3Burstein
et al (2002) patent a differe nt but related we ighting of the se counts using inverse do c um ent
fre quency (i.e . the well-know n tf/idf we ighting s chem e intro duce d into IR by Sparck- Jone s, 1972).
Inverse do c um ent fre quency is c alculate d from a s et of prompt-sp ec ific ess ays w hich have b ee n m
anually marked and as signe d to class es . Presumably this was abandoned in favour of the c urre nt
approach bas ed on exp erim ental com parison.
6
fo cus on the us e of latent semantic analysis (LSA), a te chnique originally develop ed in
IR to compute the s imilarity b etween do cuments or b etwe en do cuments and keyword
queries by clustering words and do cuments s o that the measurement of similarity do es
not require exact matches at the word level. For a rec ent tutorial intro duction to LSA see
Manning et al (2008:ch18). Landaue r et al (2000) argue that LSA measures similarity of
semantic content and that semantic content is dominant in the ass essment of e ssays.
As f or the c ontent analysis comp onent of e-Rater, LSA represents a do cument as a ve
ctor of words and requires a training set of prompt-sp e cific manually-marke d ess ays.
However, instead of computing the cos ine similarity directly b e twee n aggre gate d or
mean vec tors from the training s et and the e ssay to b e asses sed, LSA deploys
singular value decomp osition (SVD) to reduce the dimensions of a matrix of words by
ess ays to obtain a new matrix with reduced dimensions which effectively c lusters words
with s imilar contexts (i.e. their dis tribution acros s essays ) and clus te rs es says with
similar words. Words can b e weighted to take ac count of their frequency in an ess ay
and across a collection of ess ays b ef ore SVD is applied. LSA c an b e use d to meas
ure ess ay coherence as well by c omparing passages within an essay and passages of s
imilar rhetorical typ e from other ess ays. In this re sp ect, there is little diffe re nce b etwe
en e -Rater and IEA, b ecause random inde xing is simply a computationally effic ient
approximation of SVD which avoids c onstruction of the f ull word- by-e ssay co o c
currenc e matrix.
Though it se ems clear that IE A uses LSA to as sess content and organisation (Foltz et
al , 2002), it is unc lear w hich other high-level f eatures are c omputed this way. It is very
unlike ly that LSA is used to as sess grammar or sp e lling, though there is no published
desc ription of how these features are asse ssed. On the othe r hand, it is likely that
features like flue nc y are as sesse d via LSA, probably by using training annotated sets
of text passages which illustrate this feature to different degrees so that a score can b e
assigned in the same manner as the content score. IEA, by def ault, combines the score
obtained from e ach high-level feature into an overall s core us ing multiple re gress ion
against human scores in a training se t. However, this c an b e changed f or sp ecific
examinations based, f or example , on the marking rubric (Landauer et al , 2000).
2 .3 Ot her Resear ch o n Aut om ate d Assessm ent
2 .3. 1 Text Cl assi ficat io n
Both e-Rate r and IE A implicitly treat automated asses sment, at least partly, as a text
classification proble m. Whilst the ro ots of vector- base d re pres entations of text as a
bas is for me asuring s imilarity b etween texts lie in IR (Salton, 1971), and in the ir
original f orm can b e deployed in an uns up ervised fashion, the ir use in b oth s ys tems
is sup ervised in the sense that similarity is now measured relative to training sets of
premarked ess ays (along se veral dimens ions), and thus test essays can b e classified
on a grade p oint scale. Manning et al (2008:ch13) provides a tutorial intro duction to text
clas sification, an area of ongoing research which lies at the intersec tion of I R and M L
and which has b e en given considerable imp etus re cently by new techniques emerging
f rom ML.
Leakey (1998) explic itly mo delle d automated ass essment as a text classification
problem
7
comparing the p erformance of two standard class ifiers, binomial Naive B ayes (NB ) and
kNN, ove r four diffe re nt examination datasets. He found that binomial NB outp
erformed kNN and that the b e st do cument representation was a vec tor of le mmas or
stemmed words rather than word forms. Rudner and Liang (2002) desc rib e B ETSY (the
Bayesian Ess ay Te st Sc oring sYstem), which uses either a binomial or multinomial NB
classifie r and represents essays in terms of unigrams, bigrams and non-adjac ent
bigrams of word forms. A full tutorial on NB classifiers can b e f ound in Manning et al
(2008:ch13). Briefly, however, a multinomial mo del will estimate values for instanc es of
the defined fe ature typ es from training data for e ach class typ e , which in the simples t
case could b e just ‘pass’ and ‘fail’, by smo othing and normaliz ing frequency c ounts for
each feature, F , f or example , for bigrams:
P (Fbig r am- i) = F r eq (Wj, W) + 1 F r eq (W jk) + N (2)
where N is the total bigram freque ncy count for this p ortion of the training data. To
predict the most likely clas s for new unlab elled text, the log class-c onditional
probabilities of the f eatures found in the ne w text are summed for each (C , e .g.
pass/fail)
il og (P (C )) + l og (P
| C )) (3)
(Fi
and added to the prior probability of the class , typically e stimated from the prop ortion of
texts in e ach class in the training data. Taking the sum of the logs assumes (‘naively’)
that each feature ins tance, w hether unigram, bigram or whatever, is indep e ndent of the
othe rs . This is c le arly incorre ct, though suffices to construct an acc urate and efficient
classifier in many situations. In practice, within the NB framework, more s ophisticated
feature selection or weighting to handle the obvious dep e ndencies b e twee n unigrams
and bigrams would probably improve p erformance, as would adoption of a classification
mo del which do es not rely on such s trong indep endence assumptions.
BE TSY is freely available for rese arch purp oses. Coniam (2009) trained BETSY f or
AAET on a corpus of manually-marked year 11 Hong Kong ESOL examination scripts.
He found that non-adjacent bigrams or word pairs provided the most useful feature typ es
for accurate asses sment. Both approaches use regre ssion to optimise the fit b etween
the output of the classifie rs , which in the case of the Bayesian clas sifiers can b e
interpreted as the de gree of statistical certainty or confidence in a given c lass ification,
to the grade p oint scales us ed in the different examinations .
Text classification is a use ful mo del for automated ass essment as it allows the problem
to b e framed in terms of sup e rvise d classification using machine learning techniques,
and provides a f ramework to supp ort systematic exploration of diff erent clas sifiers with
diff erent representations of the text. From this p ers p ective , the extant work has only
sc ratched the surface of the s pace of p ossible s ys tems. For instance, all the
approaches disc ussed so far rely heavily on so-called ‘bag- of- words’ representations of
the text in which p ositional and struc tural information is ignore d, and all have utilised
non-dis criminative clas sifiers. Howe ver, there are strong reasons to think that, at least
for AAET , grammatical c omp e tence and p erformance e rrors are central to asse
ssment, but the se are not captured well by a bag-of-words repre sentation. In ge neral
discriminative classification techniques have
8
p erformed b etter on te xt classification problems than non-dis criminative techniques,
such as NB c lass ifiers, using similar f eature se ts (e .g. Joachims, 1998), so it is
surprising that discriminative mo dels have not b e en applied to automated essay as
sessme nt.
2. 3. 2 St ruct ural Infor mati on
Page (1994) was the first to de scrib e a system which used partial parsing to extrac t
syntac tic feature s f or automated asses sment. e- Rater extended this work using hand
co ded extractors to lo ok for sp ecific syntactic c onstructions and sp ecific typ es of
grammatical error (se e sec tion 2.1 and ref erences therein). Kanejiya et al (2003) des
crib e an extension to LSA w hich constructs a matrix of words by e ssays in which words
are paired with the part-of-sp eech tag of the previous word. This massively increas es
the size of the resultant matrix but do es take acc ount of limited structural information.
However, their comparative exp eriments with pure LSA showed little improve ment in
ass essment p erformance .
Lons dale and Krause (2003) is the firs t application of a minimally- mo dified standard
syntactic pars er to the proble m of automated as sess me nt. T hey use the Link Parser
(Sleator and Temp erley, 1993) with some added vo cabulary to analyse sentences in E
SOL ess ays. The parse r outputs the set of grammatical relations which hold b etween
word pairs in the sentence, but als o is able to skip words and output a cos t vector
(including the numb er of words skipp ed and the length of the sente nce), w hen faced
with ungrammatical input. The system scored e ssays by scoring e ach s ente nce on a
five p oint scale, based on its c ost vector, and then averaging these scores.
Ros´e et al (2003) directly compare four different approaches to automated asse ssment
on a corpus of phys ics essays . The se are a) LSA over words, b) a NB text classifie r
over words , c) bile xic al grammatical relations and syntactic fe atures, such as pas sive
voice, from sente nce-by-s entenc e parse s of the e ssays, and d) a mo de l integrating
b) and c). They f ound that the NB c lass ifier outp erformed LSA, whilst the mo de l- bas
ed on parsing outp erf ormed the NB classifier, and the mo del integrating parse
information and the NB classifier p erformed b est.
This b o dy of work broadly supp orts the intuition that s tructural information is rele vant
to assess ment but the only direct comparison of LSA, word- bas ed c lass ification and
clas sification via structural information is on phys ics essays and may not, theref ore, b e
comparable for ESOL. Lonsdale and Krause show reasonable correlation w ith human s
coring using pars e features alone on ESOL essays , but they conduct no comparative
evaluation. The exp erimental des ign of Ros´e et al is much b ette r but a similar e xp
eriment needs to b e conducted for ESOL ess ays.
2 .3. 3 C ont ent Analysis
IEA e xploits LSA for content analysis and e-Rater use s random indexing (RI). Both te
chniques are a f orm of word- by- essay clus te ring which allow the systems to generalis
e f rom sp e cific to distributionally relate d words as me asured by their o ccurrence in
similar es says . However, the re are many other published te chniques for construc ting
dis tributional
9
semantic ‘spaces’ of this general typ e (see e.g. Turney and Pantel (2010) for a survey).
Both probabilis tic LSA (PLSA) and Latent Dirichlet Allo c ation (LDA) have b een shown
to work b etter than LSA for s ome IR applications. Kakkone n et al (2006) c ompare the p
erformance of b oth to LSA on a c orpus of grade d Finnish es says on various topics.
They found that LDA p erformed worse than LSA and that PLSA p e rf ormed similarly.
There are many further p ossibilities in the area of content analysis that remain to b e
tried. Firstly, the re are more recent approaches to constructing such distributional se
mantic spaces w hich have b een s hown to outp erform RI and SVD-based techniques
like LSA on the task of clus tering words by semantic similarity, w hich is arguably
central to the content analysis comp one nt of automated as sess me nt. For e xample,
B aroni et al (2007) show that Inc reme ntal Semantic Analysis (ISA) leads to b etter p
erformance on se mantic categorisation of nouns and ve rbs . ISA is an improveme nt
of RI . Initially each word w is assigned a signature, a sparse vector, sw, of fixed
dimensionality d made up of a small numb er of randomly distribute d +1 and -1 cells
with all other c ells ass igned 0. d is typic ally much smalle r than the dimens ionality of
the p ossible conte xts (co o ccurre nces) of words given the text contexts us ed to
define c o o cc urrence . At each o c curre nce of a target word t with a context word c ,
the history vector of t is up dated as follows:
+ (1 - mc)sc
ht+ = i (mchc F r eq ( c ) K
c
mc
) (4) where i is a cons tant impact rate and mdetermines how much the history of one
word influences the his tory of another word – the more frequent a context word the
less it will influence the history of the target word. The m weight of c decreases as
follows :
m
) (5)
=1
exp (
is a parameter determining rate of de cay. ISA has the advantage that it is fully
w
incremental, do es not rely on weighting schemes that require global c omputations
here Km
over contexts, and is therefore efficient to compute . It extends RI by up dating the ve
ctor for t with the signature and history of c so that se cond order effe cts of the context
word’s distribution are f actore d into the repre sentation of the target word.
As well as exploring improved clustering techniques over LSA or RI such as ISA, b oth
the weighting functions us ed for mo delling c o o ccurrence (e.g. Gorman and Curran,
2006), and the conte xts used to as sess co o ccurre nce (e.g. Baroni and Lenci, 2009),
which has b ee n exclusively base d on an entire ess ay in automated ass essment
work, should b e varied. For ins tance, the b est mo de ls of semantic s imilarity often
me asure co o ccurrence of words in lo cal syntactic conte xts, such as those provided
by the grammatic al relations output by a parser. Finally, though prompt-sp ec ific
content analysis is clearly imp ortant f or assess ment of many es says typ es, it is not s
o c lear that it is a c entral as p ect of E SOL assess ment, where demonstration of
communicative comp etence and linguis tic varie ty without excessive errors is arguably
more imp ortant than the sp ecific topic addressed.
2 .4 E val uat io n
The e
valuation of
automated
asses
sment
systems
has largely
b ee n
base d on
analys es
of corre
lation with
human
marke rs .
Typically,
systems
are traine d
on
premarked
ess ays f or
10
a s p ecific exam and prompt and their output scaled and fitted to a partic ular grade p
oint scheme using regression or e xp ert rubric- based weighting. Then the Pearson
correlation co efficient is calculate d for a set of test e ssays for which one or more
human gradings are available. Using this me asure, b oth e-Rater, I EA and other
approaches discusse d ab ove have b een show n to corre late well w ith human grades.
Of te n they c orrelate as we ll as the grade s ass igned by two or more human markers
on the same essays. Additionally, the rates of exact re plication of human s cores , of
deviations by one p oint, and so forth can b e calculated. These may b e more informative
ab out causes of large r divergences given sp e cific phenomena in e ssays (e.g.
Williamson, 2009; Coniam, 2009).
A weakness of the ab ove approach is that it is clear that it is re lative ly eas y to build a
system that will correlate well with human markers unde r ideal conditions. Eve n the
original PEG (Page, 1966) obtained high c orrelations using ve ry sup erficial textual
features such as e ssay, word and s entenc e length. Howe ver, such feature s are easily
‘gamed’ by s tudents and by instructors ‘te aching to the exam’ (asses sment regime)
once it is public knowle dge w hat feature s are e xtracted for automated as sessme nt.
As automated assess ment is not based on a full understanding of an essay, the f
eatures extracted are to some extent proxies f or such understanding. The degree to
which such proxies can b e manipulated indep endently of the features that they are
intended to measure is c learly an imp ortant factor in the analysis of systems, esp ecially
if they are inte nded for use in high-stakes asse ssment. Powers et al (2002) conducted
an exp e riment in which a varie ty of exp erts we re invited to design and submit e ssays
that they b elieved would either b e under- or over- scored by e- Rater. The res ults
showed that e- Rater was relative ly robust to such ‘gaming’, though those with intimate
knowledge of e-Rate r were able to trick it into assigning score s de viating from human
markers, even by 3 or more p oints on a 6- p oint scale.
A furthe r weakness of comparison w ith human markers, and inde ed with training such
systems on raw human marks, is that human markers are relatively inc onsis te nt and
show comparatively p o or corre lation w ith each other. Alternatives, have b een prop os
ed such as training and/or testing on averaged or RASCH- corrected s cores (e .g.
Coniam, 2009), or evaluating by correlating system grades on one task, such es say w
riting, with human scores on an inde p endent task, such as sp oken comprehension
(Attali and Burstein, 2006). Finally, many non-technical prof essionals involved in asses
sment ob ject to automated assess ment, arguing, f or example , that a c omputer can
neve r rec ognise creativity. In the end, this typ e of philosophical ob jection tends to
dissipate as algorithms b ecome more effe ctive at any given task. For e xample, few
argue that computers will never b e able to play chess prop erly now that ches s
programs regularly de feat grand masters, though some will argue that prowess at ches s
is not in fact a sign of ‘genuine intelligence’.
Neverthe less, it is clear that very thorough evaluation of asses sment systems will b e re
quired b efore op erational, es p ecially high stake s, deployment and that this should
include evaluation in adversarial sce narios and on unusual ‘outlier’ data, whether this b e
highly creative or deviant. From this p ers p ective it is s urprising that Powe rs et al
(2002) is the sole study of this kind, though b oth e- Rater and IE A are claimed to incorp
orate mechanisms to flag such outliers for human marking.
11
3 AAET using Discriminative Pre fe re nce Ranking
One of the key weakness es of the text classification me tho ds deployed s o far for
automated assess ment is that they are bas ed on non-discriminative machine learning
mo dels.
Non-discriminative mo dels often e mb o dy incorre ct as sumptions ab out the underlying
prop erties of the texts to b e classified – f or e xample, that the probability of each
feature (e .g. word or ngram) in a text is indep endent of the others, in the cas e of the NB
classifier (se e se ction 2.3). Such mo dels als o weight fe atures of the text in ways only
lo os ely connected to the clas sification task – f or e xample, p ossibly smo othed class
conditional maximum like liho o d estimates of fe atures in the cas e of the NB clas sifier
(se e again sec tion 2.3).
In this work, we apply discriminative machine learning metho ds , s uch as mo dern
variants of the Large Margin Pe rc eptron (Freund and Schapire, 1998) and the Supp ort
Ve ctor Machine (SVM, Vapnik, 1995) to AAET. To our knowledge, this is the first such
application to automated es say assess ment. Disc riminative c lass ifiers make weake r
assumptions concerning the prop e rties of texts , dire ctly optimiz e clas sification p
erformance on training data, and yie ld optimal pre dictions if training and test material is
draw n from the same distribution (se e Collins (2002) for exte nde d theoretical dis
cussion and pro ofs ).
In our des cription of the classifiers, we will use the following notation:
N numb er of training samples
Dν avg. numb er of unique f eatures / training sample X ∈ P Rreal D
-dimens ional sample s pace Y = { +1, -1} binary target lab el space
xi∈ X vector repres enting the i th training sample yi∈ { +1, -1}
binary c ategory indicator f or i th training sample f : X → Y
classification function
3 .1 Supp or t Vect or Machine
Linear SVM s (Vapnik, 1995) le arn wide margin classifiers based on Struc tural Ris k M
inimization and c ontinue to yield state- of- the-art results in text clas sification exp
eriments (e.g. Le wis et al , 2004). In its dual form, linear SVM optimiz ation equates to
minimizing the f ollow ing expres sion:
= 0 where the a ’s are the weight co efficients. The
prediction is given by:if (x) = sig n ( aiyixi· x + b ) (7)
where b is the bias and sig n (r ) ∈ {- 1, +1} dep e nding on the sign of the input. The prac
tic al use of the SVM mo del re lie s on efficient metho ds of finding approximate
solutions to the quadratic programming (QP) problem p osed by (6). A p opular s olution
is
12
sub ject to the constraint iaiiyiai
1i,j2
aiajyiyjxi · xj
(6)
implemented in Joachims ’ SVMl ig h t2package (Joachims, 1999), in which the QP problem
is dec omp osed into small constitue nt subproblems (the ‘working s et’) and solved se
que ntially. This yie lds a training complexity at each iteration of O (q1 . 5· ν ) where q is
the size of the working set. The effic iency of the pro cedure lie s in the fact that q N .
The numb er of iterations is governed by the choice of q which makes it difficult to plac e
a theoretic al complexity b ound on the overall optimization pro cedure, but exp erimental
analysis by Yang et al (2003) sugges ts a s up er-linear b ound of approximately O (N)
with resp ec t to the numb er of training samples , though in our exp erience this is quite
heavily dep endent on the se parability of the data and the value of the regularization hyp
erparameter.
The p er s ample time complexity for prediction in the SVM mo del is O (M · ν ) whe re M
is the numb er of categories, as a separate c lass ifie r must b e trained f or each c
ategory.
3. 2 T im ed Agg re gate Per cept ro n
We now present a des cription of a novel variant of the batch p e rc eptron algorithm, the
Timed Aggregate Perce ptron (TAP, M edlo ck, 2010). We will first intro duce the ideas b
ehind our mo del and then provide a formal description.
The online p erceptron learning mo de l has b e en a mainstay of artificial intellige nce
and machine le arning rese arch since its intro duction by Rosenblatt (1958). The basic
principle is to iteratively up date a vector of weights in the sample space by adding s ome
quantity in the direc tion of misclassifie d samples as they are identified. The Perceptron
with Margins (PAM ) was intro duce d by Krauth and Mez ard (1987) and show n to yield
b ette r ge neralisation p erformance than the basic p erce ptron. More recent developme
nts include the Voted Perceptron (Freund and Schapire, 1998) and the Perceptron with
Uneven Margins (PAUM), applied with some succes s to text categorization and inf
ormation extrac tion (Li et al , 2005).
The mo del we present is base d on the batch training metho d (e.g. Bos and Opp er,
1998) where the weight vec tor is up date d in the direc tion of al l misclassified ins tances
simultaneously. In our mo de l an aggregate vector is created at each iteration by
summing all misc lass ified s amples and normalising according to a timing variable which
controls b oth the magnitude of the aggre gate vec tor and the stopping p oint of the
training pro ces s. The weight vector is the n augme nted in the direction of the aggregate
vector and the pro cedure iterates. T he timing variable is re sp onsible for protection
against overfitting; its value is initialised to 1, and gradually diminishes as training
progresse s until reaching zero, at which p oint the pro cedure terminates.
Given a set of N data samples paired with target lab els ( xi, yDi) the TAP learning pro
cedure returns an optimized weight vector ˆ w ∈ R. The predic tion for a ne w sample x ∈
RDis given by:
f (x) = sig n ( ˆ w · x) (8) where the sig n function converts an arbitrary real numb er to
+/ - 1 base d on its sign.The default de cision b oundary lies along the unbiased hyp
erplane ˆ w · x = 0, though a threshold can easily b e intro duced to adjust the bias .
is constructed by s umming all mis clas sified
At each iteration, an aggregate vec tor ˜a13
t
samples and normalising: ˜at
xi=
Qt
nor m (
∈
xiyi , t ) (9)
nor m (a, t ) normalises a to magnitude t and Qtis the set of mis clas sified samples at
iteration t, w ith the misclassification condition given by:
wt · xiyi
t- 1 > Lt
= 1 t- L
(13)
+ t+|
N
governe d by: tt
The clas s-normalise d empirical los s,
Lt
Lt
+
|
Q
+) N
|
N
-
+/-
< 1 (10) A margin of +/ - 1 p erp endicular to the decis ion b oundary is required for
correct class ifi-cation of training sample s. The timing variable t is s et to 1 at the start
of the pro cedure and gradually diminishes ,
2 |Q 0 L)ß othe rwis e (11)
, f alls within the range (0, 1) and is define d as :
= tt- 1
t- 1
t(L
(12)
with N
t
denoting the numb er of c lass +/-1 training sample s re sp ectively. ß is a measure of
the balance of the training distribution siz es:
ß = min(N , N
with an upp er b ound of 0.5 repres enting p erfe ct balance. Te rmination o cc urs when e
ither t or the empirical los s reaches zero. How well the TAP solution fits the training data
is governed by the rapidity of the timing schedule; earlier stopping leads to a more
approximate fit.
In some cases , it may b e b e neficial to tune the rapidity of the timing schedule to
achieve optimal p erformance on a sp ec ific problem, particularly when cross validation
is feasible. In this instanc e we prop ose a mo dified version of express ion (11) that
includes a timing rapidity hyp erparameter, r :
1
> Lt
4
=
r
- Lt- 1t- 1
t
tt t
1
1
L
r
t
(
L
t
ure is for
given in
mul
Algorith
atio
m 1. n, t
The als
timing o
mechagov
nism us
ern
ed in s
our the
Noalgorithlen
te m is gth
thamotivatof
t ed by the
thithe agg
s princ reg
exiple of ate
preearly vec
ssistop- tor,
onping whi
is in p ch
eqercep is
uivtron ana
aletraini log
ous
nt ng
to (Bos to
(11and the
) inOpp lear
nin
theer,
1998)
g
ca
rate
se, w
he
re
in
tha
t r the the
= pro sta
1. cedur nda
Ane is rd p
ovhalte erc
e d b e eptr
rvifore on.
ewreach t is
dec
of ing
thethe p rea
TAoint sed
onl
P of
mini
y
le
arnmum wh
ingempir en
proical the
c los s. cla
edIn our s
s-normalised
empirical los s
increas es. An
increase in
emprical loss is an
indication either
that the mo del is b
e ginning to overfit
or that the le arning
rate is to o high,
and a c onsequent
decrease in t works
to counte r b oth p
ossibilities. T he sc
ale of the decrease
is governed by
three heuristic
factors:
)
ß
o
t
h
e
r
w
i
s
e
(
1
4
)
1,
y1), . . . , (xAlg or it hm
1 – TAP training pro
cedure Requi re: training
data { (xN, yN)} t = 1
for t = 1, 2, 3 . . . do if
tt= 0 ∨ Lt= 0 t hen
terminate and return wt
el se = wt +
wt+1
˜at
- Lt- 1
3. 3 D iscr im inati ve Pr efer ence Ranking
15
end if
compute tt+1
end fo r
how far the algorithm has progresse d (t) 2. t
the increase in empirical loss (Lt
) 3. the balance of the
training distributions ( ß )
The motivation b ehind the third heuristic is that in the
early stages of the algorithm, unbalanced training
distributions lead to aggregate vectors that are
skewed toward the dominant class. I f the pro ce dure
is stopp e d to o early, the e mpiric al loss will b e dis
prop ortionately high for the s ub dominant clas s,
leading to a s kewe d weight vector. The e ffect of ß
is to relax the timing schedule for imbalanced data w
hich results in higher quality solutions.
The TAP optimisation pro cedure requires storage of
the input vectors along with the feature weight and up
date ve ctors, yielding space complexity of O (N ) in
the numb er of training samples. At each iteration,
computation of the empirical loss and aggregate
vector is O (N · ν ) (recall that ν is the average numb
er of unique features p e r sample). Given the curre nt
and previous loss value s, computing t is O (1) and
thus each ite ration scales with time complexity O (N )
in the numb e r of training samples. The numb e r of
training iterations is governed by the rapidity of the
timing schedule which has no direct dep e nde nce on
the numb er of training s amples, yielding an
approximate overall complexity of O (N ) (linear) in
the numb er of training s amples .
The TAP and SVM mo dels describ ed ab ove p
erform binary disc riminative clas sification, in which
training exam scripts mus t b e divided into ‘pass’ and
‘fail’ categories . The confidence margin ge nerated
by the classifie r on a given test sc ript can b e inte
rpreted as an e stimate of the degree to which that
script has passed or failed, e.g. a ‘go o d’ pass or a
‘bad’ fail. However, this gradation of script quality is
not mo de lled explicitly by the classifier, rather it
relies on emerge nt correlation of key f eatures with
script quality.
In this section, we intro duce an alte rnative ML
technique c alled preferenc e ranking which is b e
tter suited to the AAE T task. I t explicitly mo dels
the relationships b etween sc ripts by learning an
optimal ranking over a given sample domain,
inferred through an optimisation
pro cecure that utilises a sp ec ified orde ring on training samples . This allows us to mo
del the fact that some sc ripts are ‘b etter’ than others, across an arbitrary grade range,
without nece ssarily having to sp ecif y a numerical score for each, or intro duce an
arbitrary pass /fail b oundary.
We now pres ent a version of the TAP algorithm that efficiently learns pref erence
ranking mo dels. A de rivation of similar equations for learning SVM- base d mo dels ,
and pro of of their optimality is give n by Joachims (2002).
The TAP preference ranking optimis ation pro cedure requires a set of training
samples, x1, x2, . . . , xn, and a ranking <rsuch that the relation xi<rxholds if and only if a
sample xjshould b e ranked higher than xiji<rxfor a finite, disc re te partial or complete
ranking or ordering, 1 = i, j = n, i = j . G iven some ranking x, the metho d only
considers the diffe re nce b etween the feature vec tors xiand xjjas evidence, know n as
pairwise difference vectors. The target of the optimisation pro cedure is to c ompute a
weight vec tor ˆw that minimis es the numb er of margin-s eparate d misranked pairs of
training samples, as formalised by the f ollowing cons traints on pairw ise diff erence
vec tors:
∀(xi <r xj) : ˆw · (xi - xj) = µ. (15)
where µ is the margin, given a sp ec ific value b elow. The derived set of pairwise diff ere
nce vectors grows quickly as a function of the numb er
of training samples. An upp er b ound on the numb er of difference vectors f or a se t of
training vectors is give n by:
2u = a * r (r - 1)/2 (16)
where r is the numb er of ranks and a is the average rank frequency. This yields intrac
table numb e rs of diff erence vec tors for eve n mo dest numb ers of training
vectors, eg: r = 4, a = 2000 yields 24, 000, 000 differenc e vectors. To overcome this, the
TAP optimisation pro cedure employs a sampling strategy to re duce
the numb er of differenc e vec tors to a manageable quantity. An upp e r b ound is sp e
cified on the numb er of training vectors, and then the probability of s ampling an arbitrary
difference ve ctor is given by u /u where u is the sp ecifie d upp er b ound and u is given
ab ove.
The optimisation algorithm the n pro ceeds as for the classific ation mo del (Algorithm
1), except we have a one- sided margin. The mo dified pro cedure is shown in
Algorithm 2.
· (xj - xi) > 2 (17)
The misclassification c ondition is:
wt
xi<rxj∈ Q
16
˜at
t
and the
aggregate
vector ˜at
is cons tructed by: xj - xi , t ) (18)
= nor m (
1<rx2),
. . . , (xN<rxN +1Alg
or it hm 2 – TAP rank
pre ference training pro
ce dure Requi re:
training data { (x)} t = 1
for t = 1, 2, 3 . . . do if
tt= 0 ∨ Lt= 0 the n
terminate and return wt
el se = wt +
wt+1
˜at
e
n
d
i
f
c
o
m
p
u
t
e
t
t
+
1
end fo r
Note that the one- sided
pre ference ranking
margin takes the value
2, mirroring the twosided
unit- width margin in the
classification mo de l.
The te rmination of the
optimisation pro cedure
is governed by the timing
rapidity hyp erparameter,
as in the classification
case, and training time is
approximately linear in
the numb er of pairwise
diff erence vectors, upp
er b ounded by u (see ab
ove ). The output from
the training pro cedure is
an optimised weight
vector wtwhere t is the
iteration at w hich the
pro cedure te rminated.
G iven a test s ample, x,
Pre dictions are made,
analogously to the clas
sification mo de l, by
computing the dot-pro
duct wt· x. The res ulting
real scalar can then b e
mapp e d onto a grade
/score range via simple
line ar regres sion (or
some other pro ce dure),
or us ed in rank
comparison with othe r
test samples. Joachims
(2002) desc rib es an
analogous pro cedure f
or the SVM mo del which
we do not rep eat here.
As stated earlier, in
application to AAET , the
principal advantage of
this approach is that we
e xplicitly mo del the
grade relationships b
etwee n scripts. Pref ere
nce ranking allows us to
mo del ordering in any
way we cho ose ; for
instance we might only
have acces s to pass
/fail information, or a
broad banding of grade
leve ls, or we may have
acc ess to detaile d
scores. Prefe renc e
ranking c an account for
each of these scenarios,
w hereas clas sification
mo dels only the first,
and numerical regres
sion only the las t.
3. 4 Feat ure Space
Intuitively AAET involves
comparing and
quantifying the linguis tic
varie ty and complexity,
the degree of linguistic
comp etence, displayed
by a text against errors
or infe licities in the p
erformance of this comp
etence . It is unlikely that
this comparison c an b e
c aptured optimally in
terms of feature typ es
like , f or example,
ngrams over word forms
. Varie ty and complexity
will not only b e
manifested lexically but
also by the use of diff
ere nt typ es of
grammatical
construction, whilst
grammatic al errors of c
ommiss ion may involve
nonlo cal dep e nde
ncies b etween words
that are not capture d by
any given length of
ngram. Neverthe less,
the f eature typ es used
for AAET must b e
automatically extracted
from text with go o d
levels of reliability to b e
effe ctively exploitable.
We used the RASP
system (Brisco e et al
2006; Brisco e, 2006) to
automatically annotate
17
TFC: Typ e Exampl e
lexical terms and / mar k lexical bigrams dear mar y / of the
part-of- sp eech tags NNL1 / JJ part-of- sp eech bigrams VBR
DA1 / DB2 NN1 part-of- sp eech trigrams JJ NNSB1 NP1 /
VV0 PPY RG
TFS: pars e rule names V1/ mo dal bse/ +- / A1/ a inf script
length numerical corpus -derived error rate numerical
Table 1: E ight AAET Feature Typ es
b oth training and test data in order to provide a range of p ossible feature typ es
and their instances s o that we could explore the ir impact on the accurac y of the
resulting AAET system. The RASP system is a pip eline of mo dules that p
erform s ente nc e b oundary de tection, tokenisation, lemmatisation, part-of-s p
eech (PoS) tagging, and s yntac tic analys is (parsing) of text. T he PoS tagging
and pars ing mo dules are probabilis tic and trained on native English text drawn
from a varie ty of source s. For the A AET system and e xp eriments des crib ed
here we use RASP unmo dified w ith default pro ces sing settings and s elect the
most likely PoS sequence and syntactic analysis as the basis for feature e xtrac
tion. The system make s availalble a wide variety of output representations of te
xt (s ee B risco e, 2006 for details). I n developing the AAET system we exp
erimente d with most of them, but for the subset of e xp erime nts rep orte d he re
we make us e of the set of feature typ es given along with illustrative examples in
Table 1.
Lower-c ase d but not lemmatised lexical terms (i.e. unigrams) are extracted
along with their frequency counts, as in a standard ‘bag-of-words’ mo de l. The
se are s upple mented by bigrams of adjacent lexical terms. Unigrams, bigrams
and trigrams of adjacent sequenc es of PoS tags drawn from the RASP tags et
and most likely output se quenc e are extracted along with their frequency
counts. All instances of these f eature typ es are inc luded with their c ounts in
the ve ctors repre senting the training data and also in the ve ctors extracted for
unlab elled test instance s.
Lexical term and ngram features are weighted by frequency c ounts from the
training data and then scaled us ing tf · idf weighting (Sparck-Jones, 1972) and
normalised to unit length. Rule name counts, s cript length and error rate are
linearly scaled s o that their weights are of the same order of magnitude as the s
caled term/ngram c ounts.
Pars e rule name s are e xtracte d from the phrase structure tre e for the most
likely analys is found by the RA SP parse r. For example, the f ollowing s ente
nce from the training data, Then some though occured to me. , receives the
analysis given in Figure 1, whilst the correc te d version, Then a thought occurred
to me. receives the analysis give n in Figure 2. In this represe ntation, the no des
of the pars e tre es are decorated with one of ab out 1000 rule names, which are
semi- automatically generated by the parse r and w hich enco de quite detaile d
information ab out the grammatical constructions found. However, in common
with ngram f eatures, these rule names are extracted as an unordered list f rom
the analys es for all sentences in a given s cript along with their f requenc y
counts. Each rule name
18
✭ ✭✭✭ ✭✭ ✭✭ ✭
T/frag Tph/np✭ ✭❤ ❤❤ ❤❤
❤❤ ❤❤ ❤ ❤❤ ✭
✥ ✥NP/a1-c
✥✥ ✥❵ ❵❵ ❵❵ ❵
at np-r✥ ✘ ✘ ✘NP/det a1- r✘
PP/p1 P1/p
np-pro❍ ✟
✘❳ ❳❳ ❳❳ ❳ ✘
some DD A1/advp ppart-r
to I I I+ PPI O1
✏✏
✏
✟✟ ❍❍
✏✏
o ccur+e d VVN
A1/a
Then
RR
A
P
/
a
1
A
1
/
a
though RR
Figure 1: Then some though occured to me
✭ ✭T/txt-s c1 S/adv s✭ ✭❤
❤✭ ✭❤ ❤❤ ❤❤ ❤ ❤
✭ ✭ ✭✭ ✭S/np vp✭ ✭❤ ❤✭
✭✭
✭❤ ❤❤ ❤❤ ❤ ❤
✭
✘ ✘✘✘ ✘❳ ❳ ❳❳ ❳
A
P
/
a
1
A
1
/
a
NP/det n1✦
V1/v pp
✦ ✦❛ ❛❛ ❛ ✦
AT 1 N1/n thought o ccur+e d VVD P P/p1 P1/p
Then RR a
NN1
np-pro❍ ❍
19
❍✟
to ✟I I I+ PPI O1
Figure 2: Then a thought occurred to me
together with its frequency c ount is represe nted as a cell in the vector derived from a
script. T he script length in words is us ed as a feature less for its intrinsic
informativenes s than for the nee d to balance the effect of script le ngth on othe r fe
aturccurrence of some feature to its actual o c curre nce.
es.The automatic identification of grammatical and lexical e rrors in text is far from trivial
For(Andersen, 2010). In the e xis ting systems reviewed in section 2, a fe w sp ecific typ
exaes of well- known and relatively frequent errors, such as s ub ject- verb agreement,
mplare c aptured explicitly via manually- cons tructed e rror-sp e cific fe ature extractors.
e, Otherwise , errors are captured implicitly and indire ctly, if at all, via unigram or other f
erro
eature typ es. Our AAET system already improves on this approach b ecause the
r RASP parser rule names explicitly represe nt marked, p eripheral or rare construc
ratetions using the ‘-r’ s uffix, as well
s,
ngra
m
freq
uen
cies,
etc
w ill
tend
to
rise
with
the
amo
unt
of
text,
but
the
over
all
qual
ity
of a
scri
pt
mus
tbe
ass
e
sse
d as
a
ratio
of
the
opp
ortu
nitie
s
affor
ded
for
the
o
as c ombinations of extragrammatical subsequences suffixed ‘frag’, as can b e seen by
comparing Figure 2 and Figure 1. The se c ues are automatically extracted without any
need for error- sp ecific rules or e xtrac tors and c an capture many typ es of
long-distance grammatical e rror. Howeve r, we also include a single numerical feature re
pres enting the ove rall error rate of the s cript. This is estimate d by counting the numb er
of unigrams, bigrams and trigrams of lexical terms in a script that do not o ccur in a very
large ‘background’ ngram mo del for E nglish which we have cons tructed from
approximately 500 billion words of E nglish sampled from the world wide web. We do this
efficiently using a Blo om Filter (Blo om, 1970). We have also e xp erimenente d with us
ing frequency counts for smaller mo dels and measures such as mutual information (e.g.
Turne y and Pantel, 2010). How ever, the most eff ective metho d we have found is to
use simple prese nce/absenc e over a very large dataset of ngrams which unlike, say, the
Go ogle ngram corpus (Franz and Brants , 2006) retains low frequency ngrams .
Although we have only de scrib ed the fe ature typ es that we used in the exp eriments
rep orted b elow, b ecause the y proved usef ul with res p ect to the comp etenc e level
and text typ es inves tigate d, it is likely that others made available by the RASP s ys te
m, such as the c onnected, directed graph of grammatical relations over s ente nces, the
degre e of ambiguity within a se ntence , the lemmas and/or morphological complexity of
words, and so forth (see Brisco e 2006 for a fuller desc ription of the range of feature typ
es, in principle, made available by RASP), will b e discriminative in other AAET sc
enarios. The system we have de velop ed inc ludes automated feature e xtrac tors for
mos t typ es of f eature made available through the various representations provided by
RASP. T his allows the rapid and largely automated discovery of an appropriate feature
set for any given ass essment tas k, using the exp erimental metho dology e xe mplifie d
in the next section.
4 The FCE Exp erime nts
4. 1 D ata
For our exp e riments we made use of a se t of trans crib ed handwritten sc ripts pro duc
ed by candidates taking the First Certificate in English (FCE ) examination written comp
onent. Thes e were extracted from the Cambridge Learner Corpus (CLC) de velop ed by
Cambridge University Pres s. T hese sc ripts are linked to metadata giving details of the
candidate, date of the e xam, and so forth, as well as the final scores given for the two
written questions attempted by candidates (se e Hawkey, 2009 for details of the FCE ).
The marks assigned by the examiners are p ostpro c essed to identify outliers,
sometimes second marked, and the final score s are adjusted us ing RASCH analysis to
improve consistency. In addition, the scripts in the CLC have b e en manually e rror- co
ded using a taxonomy of around 80 error typ es providing corrections for each error. The
errors in the e xample from the previous sec tion are co ded in the f ollow ing way:
<RD>some|a</RD> <SX>though|thought</SX> <IV>occured|occurred</IV>
where RD denote s a determiner replacement e rror, SX a s p elling error, and I V a verb
inflection error (see Nicholls 2003 for full de tails of the s cheme). In our exp eriments, we
20
used around three thousand scripts from examinations s et b etween 1997 and 2004,
each ab out 500 words in length. A sample sc ript is provided in the app endix.
In order to obtain an upp er b ound on examiner agree ment and also to provide a b etter
b enchmark to as sess the p erformance of our AAET s ys te m compared to that of
human examine rs (as recomme nded by, for example , Attali and Bernstein, 2006),
Cambridge ESOL arranged for four senior e xaminers to remark 100 FCE scripts drawn
from the 2001 examinations in the CLC using the marking rubric from that year. We
know, for example, from analysis of these marks and comparison to those in the CLC that
the correlation b etween the human marke rs and the CLC sc ores is ab out .8 (Pearson)
or .78 (Sp earman’s Rank), thus establishing an upp er b ound for p erformance of any
classifie r trained on this data (see sec tion 4.3 b elow).
4. 2 Bi nary Cl assi ficati on
In our first exp eriment we traine d five c lass ifier mo dels on 2973 FCE scripts drawn f
rom the years 1999–2003. T he aim was to apply well- know n classification and
evaluation techniques to explore the AAET task from a disc riminative machine learning p
ersp ective and also to inve stigate the efficacy of individual feature typ e s. We use d the
feature typ es desc rib e d in se ction 3.4 with all the mo dels and divided the training data
into pass (mark ab ove 23) and fail classes . B ecause there was a large s kew in the
training classes , with ab out 80% of the scripts falling into the pass clas s, we use d the
Break Even Precision (BEP ) meas ure , de fine d as the p oint at which ave rage
precision=rec all, (e.g. Manning et al , 2008) to evaluate the p erformance of the mo dels
on this binary clas sification task. This measure favours a clas sifer which lo cates the
decision b oundary b etween the two classes in s uch a way that false p os itives /
negative s are evenly distributed b e twee n the two class es.
The mo dels trained were naive B ayes, Bays ian logistic regres sion, maximum entropy,
SVM, and TAP. Cons istent with much pre vious work on te xt clas sification tasks, we
found that the TAP and SVM mo de ls p erformed b es t and did not yield significantly
different results. For brevity, and b ecause TAP is f aster to train, we rep ort results only
for this mo del in what follows.
Figure 3 shows the contribution of fe ature typ es to the overall accuracy of the classifier.
With unigram terms alone it is p ossible to achieve a BE P of 66.4%. The addition of
bigrams of te rms improves p e rf ormanc e by 2.6% (repre senting ab out 19% relative
error reduction (RER) on the upp er b ound of 80%). The addition of an error es timate fe
ature based on the Go ogle ngram corpus furthe r improves p erformance by 2.9%
(further RER ab out 21%). Addition of pars e rule name features further improves p e rf
ormanc e by 1.5% (furthe r RE R ab out 11%). The remaining fe ature typ es in Table 1
contribute another 0.4% improvement (further RER ab out 3%).
Thes e res ults provide some supp ort for the choic e of feature typ es desc rib ed in se
ction 3.4. Howe ver, the final datap oint in the graph in Figure 3 s hows that if we
substitute the error rate predicted f rom the CLC manual error co ding for our corpus de
rived es timate, then p erformance improves a further 2.9%, only 3.3.% b elow the upp e r
b ound defined by the de gree of agreement b e tween human marke rs . This strongly
sugges ts that the error
21
CLC Ra ter 1 R ater 2 Ra ter 3 Ra ter 4 Aut o-mark CLC 0.80 0.79
0.75 0.76 0.80
Ra ter 1 0.80 0.81 0.81 0.85 0.74 Ra ter 2 0.79 0.81 0.75 0.79 0.75 Ra ter 3 0.75
0.81 0.75 0.79 0.75 Ra ter 4 0.76 0.85 0.79 0.79 0.73 Aut o-mark 0.80 0.74 0.75
0.75 0.73
Average : 0.78 0.80 0.78 0.77 0.78 0.75
Table 3: C orrelation (Sp earman’s Rank)
Thes e results suggest that the AAET system we have de velop ed is able to achieve
levels of correlation similar to thos e achieved by the human markers b oth with e ach
other and with the RASCH-adjusted marks in the CLC. To give a more concrete idea of
the ac tual marks assigned and their variation, we give marks assigned to a random
sample of 10 scripts from the test data in Table 4 (fitted to the appropriate score range by
simple linear regres sion).
Aut o-mark Rat er 1 Rater 2 R ater 3 R ater 4 26 26 23 25 23
33 36 31 38 36 29 25 22 25 27 24 23 20 24 24 25 25 22 24 22
27 26 23 30 24 5 12 5 12 17 29 30 25 27 27 21 24 21 25 19
23 25 22 25 25
Table 4: Sample predic tions (random ten)
4. 4 Tem p o ral Sensit ivity
The training data we have us ed so far in our exp eriments is draw n from examinations b
oth b ef ore and after the test data. In order to investigate b oth the e ffect of different
amounts of training data and also the e ffect of training on scripts drawn f rom e
xaminations at increasing temp oral distance from the test data, we divided the data by
ye ar and trained and tested the c orrelation (Pears on) with the C LC marks. Figure 4
shows the results – clearly there is an effect of training data size , as no re sult is as go o
d as those rep orted using the full datase t for training. Howeve r, there is also a s trong
effect for temp oral distance b e tween training and te st data, re flecting the fact that b
oth the typ e of prompts used to e licit text and the marking rubrics e volve over time (e.g.
Hawkey, 2009; Cec il and We ir, 2007).
23
0
.
0.59
6
0
1000
0.69 0.72
0.69 0.60 0.60
= correlation
0.55
800
600
400
2
0
0
1998 1999 2000 2001 2002 2003 2004
Year
1
9
9
7
Figure 4: Training Data Effects
4. 5 E rr or Est im at io n
In order to explore the effect of different datasets on the error prediction e stimate, we
have gathered a large corpus of Englis h te xt f rom the web. Estimating e rror rate
using a 2 billion word sample of text sampled f rom the UK domain re taining low
frequency unigrams, bigrams, and trigrams we were able to improve p e rformanc e
over estimation using the Go ogle ngram corpus by 0.09% (Pearson) in exp eriments
which were othe rw ise identical to those re p orted in section 4.3
To date we have gathered ab out a trillion words of sequence d text f rom the web. We
exp ect future exp eriments with error estimates based on larger sample s of this
corpus to improve on these results f urther. Howeve r the results rep orted here
demonstrate the viability of this approach, in combination with pars er-based feature s
which implicitly c apture many typ es of longer distanc e gramatical error, compared to
the more lab our intensive one of manually co ding feature extractors for known typ es
of stereotypical learner error.
4. 6 Incr em ental Sem ant ic Analysis
Although, the fo cus of our exp eriments has not b een on content analysis (see section
2.3.3), we have undertaken some limited exp e riments to compare the p erformance of
an AAET system based primarily on such technique s (such as PearsonKT’s , IE A, see
se ction 2) to that of the system pres ente d here.
We used ISA (see section 2.3.3) to c onstruct a system w hich, like IEA, uses similarity
to an average vector cons tructed us ing I SA from high scoring FCE training scripts as
the bas is for assigning a mark. The cosine similarity scores were the n fitted to the
FCE scoring scheme. We trained on ab out a thousand scripts drawn f rom 1999 to
2004 and tested on the s tandard test se t from 2001. U sing this approach we were
only able to obtain a correlation of 0.45 (Pe arson) with the CLC scores and and
average of 0.43 (Pearson) with the human e xaminers. This contras ts with score s of
Number of samples
0.47
(Pe
arso
n)
and
0.45
(Pe
ars
on)
24
training the TAP ranked pre ference clas sifier on a similar numb e r of scripts and using
only unigram term feature s.
Thes e res ults , taken with those rep orted ab ove, s uggest that there isn’t a clear
advantage to us ing techniques that cluster terms according to the ir c ontext of o c
currenc e, and compute te xt similarity on the basis of thes e c lusters, over the text clas
sification approach deployed here. Of course, this exp e rime nt do es not demonstrate
that clustering te chniques c annot play a us eful role in AAET, howeve r, it do e s
suggest that a straightforward applic ation of latent or distributional s emantic metho ds to
AAE T is not guarantee d to yield optimal res ults .
4. 7 Off -Pr om pt E ssay D et ecti on
As disc ussed in se ction 2.4, one is sue with with the deployment of AAET for high s
takes examinations or other ‘adve rs arial’ contexts is that a non-prompt sp ecific
approach to AAET is vulne rable to ‘gaming’ via submiss ion of linguistically e xc ellent
rote-learned text regardless of the prompt. To detect such off-prompt te xt automatically
do e s require content analys is of the typ e discusse d in s ection 2.3.3 and explored in
the previous sec tion as an approach to grading.
Given that our approach to AAET is not prompt- sp ecific in terms of training data, ide ally
we would like to b e able to de tec t off-prompt scripts with a s ys tem that do esn’t
require retraining for different prompts. We would like to train a system w hich is able to
compare the que stion and answer s cript within a ge neric dis tributional semantic space.
B ecause the prompts are typically quite s hort we c annot exp ect that in gene ral there
will b e much direct ove rlap b etween contentful terms or lemmas in the prompt and
those in the answer text.
We trained an I SA mo del using 10M words of diverse E nglish te xt using a 250- word s
top list and ISA parame te rs of 2000 dimens ions, impac t factor 0.0003, and dec ay
constant 50 with a context window of 3 words . Each question and answer is represented
by the s um of the his tory vectors corres p onding to the terms they contain. We als o
included additional dimensions representing actual terms in the overall mo del of dis
tributional semantic space to capture cas es of literal overlap b etween terms in questions
and in answe rs . The res ulting vectors are then compared by calculating their cosine
similarity. For comparison, we built a standard vector s pace mo de l that meas ures
semantic s imilarity using cosine distance b etween vec tors of terms for que stion and
answer via literal term overlap.
To test the p erformance of these two approache s to off-prompt ess ay detection, we
extracted 109 pas sing FC E sc ripts from the CLC answering four different prompts :
1. During your holiday you made some new f riends . Write a letter to the m saying
how you enjoyed the time sp e nt w ith them and inviting them to visit you.
2. You have b een asked to make a sp eech welcoming a well-know n w riter who has
come to talk to your class ab out his /her work. Write what you say.
3. “Put that light out!” I shouted. Write a s tory which b egins or e nds w ith these words
.
25
4. Many p eople think that the car is the greatest danger to human life to day. What do
you think?
Each system was use d to assign each answer text to the most similar prompt. The acc
uracy (ratio of correct to all assignme nts) of of the standard ve ctor space mo de l was
85%, whilst the augmented ISA mo de l achieved 93%. T his pre liminary exp eriment
suggests that a generic mo del for flagging putative off-prompt ess ays for manual
checking could b e construc te d by manual selec tion of a set of prompts from past pap
ers and the c urrent pap er and then flagging any ans wers that matched a past prompt b
etter than the c urrent prompt. There will b e some false p os itives, but these initial
results s uggest that an augmented I SA mo del could p erform we ll enough to b e use
ful. Further exp e rimentation on larger se ts of generic training text and on optimal tuning
of ISA parameters may also improve accurac y.
5 Conclusions
In this re p ort, we have intro duc ed the discriminative TAP prefe re nce ranking mo del
for AAET. We have demons trated that this mo del can b e coupled with the RASP text
pro cessing to olkit allowing fully automated extraction of a wide range of feature typ es
many of which we have shown exp erimentally are disc riminative for AAET. We have
also intro duc ed a generic and fully automated approach to error e stimation based on
efficient matching of text s eque nces with a ve ry large background ngram corpus
derived from the web using a B lo om filter, and have shown exp erimentally that this is
the single most discriminative fe ature in our AAET mo del. We have also show n exp
erimentally that this mo del p e rf orms s ignificantly b etter than an otherwise equivalent
one based on classification as opp os ed to prefe re nc e ranking. We have also show n
exp erimentally that text classification is at le ast as effe ctive for AAET as a mo del base
d on ISA, a recent and improved latent or dis tributional semantic content-based text
similarity me tho d akin to that used in IEA. However, ISA is use ful for de tec ting offprompt es says using a generic mo del of dis tributional s emantic space that do e s not
require retraining for new prompts.
Much further work remains to b e done. We b elie ve that the feature s as sess ed by our
AAET mo del make subversion by students difficult as they more dire ctly asse ss
linguistic comp etence than pre vious approaches. However, it remains to tes t this e xp
erime ntally. We have shown that e rror estimation against a background ngram c orpus
is highly informative, but our fully automated technique still lags error e stimates bas ed
on the manual error co ding of the CLC. Further exp e rimentation with larger background
corp ora and weighting of ngrams on the basis of their frequency, p ointwise mutual inf
ormation, or similar meas ure s may he lp clos e this gap. Our AAET mo del is not traine
d on promptsp e cific data, w hich is op erationally advantageous, but it do e s not inc
lude any mechanism for detecting text lacking overall inter- sentential c oherence . We b
elieve that ISA or other recent dis tributional s emantic te chniques provide a go o d basis
for adding such fe atures to the mo del and plan to test this exp e rime ntally. Finally our
current AAET system simply returns a s core, though implicit in its computation is the
identific ation of b oth negative and p ositive feature s that contribute to its c alculation.
We plan to explore metho ds f or automatically providing feedback to students based on
these features in order to fac ilitate
26
deployment of the system f or se lf-asses sment and self- tutoring. In the near f uture, we
inte nd to re leas e a public- domain training se t of anonymis ed FCE
l ig h tscripts from the CLC together with an anonymis ed version of the te st data des crib
ed in sec tion 4. We also intend to rep ort the p erformance of preference ranking with the
SVMpackage (Joachims, 1999) based on RASP-derived features, and error estimation
using a public domain corpus trained and tested on this data and compared to the p
erformance of our b est TAP-based mo de l. This w ill allow b etter re plication of our
results and facilitate further work on AAET.
Acknowle dgements
The research and exp eriments rep orted he re were partly funded through a contract to
iLexIR Ltd from Cambridge ESOL, a divis ion of Cambridge Asse ssment, w hich in turn
is a subsidiary of the University of Cambridge. We are grateful to Cambridge University P
res s for p ermiss ion to us e the subset of the C ambridge Learners’ Corpus for these
exp eriments. We are also grateful to Cambridge Asses sment f or arranging for the test
sc ripts to b e remarked by f our of their s enior examiners to f acilitate their evaluation.
Refer ence s
Ande rs en, O. (2010) Grammatical error prediction, Cambridge University, C omputer Lab
oratory, PhD Dis sertation.
Attali, Y. and B urste in, J. (2006) ‘Automated e ssay sc oring with e -rater v2’, Journ al of
Technology, Learning and Assessmen t, vol.4(3),
Baroni, M., and Lenci, I. (2009) ‘One dis tributional me mory, many se mantic spaces ’, Proceedin
gs of t he Wkshp on Geometrical Models of Natural Language Semantics, Eur. Ass o c . for
Comp. Linguistics , pp. 1–8.
Bos , S. and Opp er, M. (1998) ‘Dynam ic s of batch training in a p erce ptron’, J. P hysics A:
Math . & Gen., vol.31(21 ), 4835–4850.
Burstein, J. (2003) ‘T he e- rate r s coring e ngine: automated es say s coring w ith natural
language pro ce ss ing’ in (e ds ) Shermis, M.D. and J. B urste in (eds.), Aut omated Essay
Scoring: A cross-Discip linary Perspective, Lawre nc e Erlbaum As so ciates Inc., pp. 113–122.
Burstein, J., Brade n- Harder, L., Cho dorow, M.S., Kaplan, B.A., Kukich, K., Lu, C., Ro ck, D.A.,
and Wolff, S. (2002) System and method fo r computer-ba sed aut oma tic essay scorin g, US
Patent 6,366,759, April 2.
Burstein, J., Higgins , D., Gentile, C., and Marc u, D. (2005) Method a nd syst em fo r determining
text coherence, US Patent 2005/0143971 A1, June 30.
Collins , M. (2002) ‘Disc riminative training m etho ds for hidde n Markov mo de ls: the ory and
exp erim ents with Pe rc eptron algorithm s’, Proceedin gs of the E mpirical Methods in Nat. Lg.
Processing (EMNL P), Ass o c. for Comp. Linguistics , pp. 1–8.
Coniam, D. (2009) ‘Exp e rime nting with a c omputer es say-s coring program bas ed on ESL s
tudent writing s cripts’, ReCALL , vol.21(2), 259–279.
27
Dikli, S. (2006) ‘An ove rview of automate d sc oring of es says’, Journ al of Techn ology,
Learning and A ssessment, vol.5(1),
Elliot, S. (2003) ‘IntellimetricTM: From He re to Validity’ in (e ds ) She rm is , M.D. and J. Burs tein
(eds.), Aut omated E ssay S co ring: A cross-Disciplin ary P erspective, Lawre nc e Erlbaum Asso
c iate s Inc ., pp. 71–86.
Foltz , P.W., Landauer, T.K., Laham, R.D., Kintsch, W., and Rehde r, R.E. (2002) Methods for
analysis an d evalua tio n of t he semantic con tent of a w riting based on vector length, US
Patent 6,356,864 B1, March 12.
Franz, A. and Brants , T. (2006) Al l our N-gram are Belo ng to Yo u, http://go ogle rese
arch.blogsp ot.c om/2006/08/all-our- n- gram -are-b elong- to-you.html.
Fre und, Y. and Schapire, R. (1998) ‘Large margin clas sific ation using the p erce ptron
algorithm’, Comp uta tio nal Learning Theory, vol.209–2 17,
Gorm an, J. and Curran, J.R. (2006) ‘Random indexing us ing s tatis tic al we ight func tions’,
Proceedin gs of the Conf. on Empirical Methods in Na t. Lg. Proc., Ass o c. for C omp. Linguistics
, pp. 457–464.
Hawke y, R. (2009) Examining FCE and CAE : Studies in Language Test ing, 28, Cambridge
Unive rs ity Pre ss.
Joachims , T. (1998) ‘Te xt categorization w ith supp ort vector machines : le arning w ith many
relevant fe atures ’, Proceedin gs of t he P roc. of Eur. Conf. on Ma ch. Learnin g, Springe
rVerlag, pp. 137–142.
Joachims , T . (1999) ‘Making large -sc ale s upp ort vec tor machine le arning practical’ in (e ds )
Scholkopf, S.B. and C. B urges (eds.), Advan ces in kernel methods, MIT Press .
Joachims , T. (2002) ‘Optimiz ing search e ngine s using c lickthrough data’, Proceedin gs of t he
SIGKDD, Ass o c. C omputing Machinery.
Kakkone n, T., Mylle r, N., Sutine n, E. (2006) ‘Applying Late nt Dirichlet Allo c ation to autom atic
es say grading’, Proceedin gs of the FinTA L, Springe r- Ve rlag, pp. 110–120.
Kane jiya, D., Kam ar, A. and Pras ad, S. (2003) ‘Autom atic Evaluation of Stude nts’ Answers
using Syntac tic ally Enhanc ed LSA’, Proceedin gs of t he H LT-NAACL 0 3 Workshop on Buildin
g Educational Ap plications U sing N atural Lan guage Processing, Ass o c. for C omp. Linguistics
.
Kane rva, P., Kris tofe rs son, J., and Holst, A. (2000) ‘Random inde xing of text s ample s for
latent se mantic analysis’, Proceedin gs o f th e 22nd Annual Con f. of the Cognit ive S cience S
ociety, Cognitive Science So c ..
Krauth, W. and Mez ard, M. (1987) ‘Le arning algorithms w ith optimal s tability in ne ural ne
tworks’, J. o f Physics A ; Math. Gen ., vol.20,
Kukich, K. (2000) ‘Be yond automate d e ssay s coring’ in (ed.) Hearst, M. (e ds .), The debat e
on automated essay grading, IEEE Intelligent Sys tem s, pp. 27–31.
Landauer, T.K., Laham, D., and Foltz, P.W. (2000) ‘The I nte lligent Ess ay Ass es sor’, IEEE Intel
ligent Systems, vol.15(5),
Landauer, T.K., Laham, D. and Foltz, P.W. (2003) ‘Autom ate d scoring and annotation of es says
with the Intelligent Essay Ass es sor’ in (e ds ) Shermis , M.D. and J. Burstein (eds.), Aut omated
Essay Scorin g: A cross-Discip linary Perspective, Lawre nc e Erlbaum As so ciate s Inc., pp.
87–112.
Leake y, L.S. (1998) ‘Automatic es say grading using te xt c ategorization technique s’, Proceedin
gs of the 21st ACM-SIGIR , Ass o c. for Computing Machine ry.
Lew is , D.D., Yang, Y., Rose , T. and Li, T. (2004) ‘RC v1: A new b enchmark c olle ction for text
cate goriz ation res earch’, J. Mach. Learning res., vol.5, 361–397.
28
Li, Y., B ontcheva, K. and Cunningham, H. (2005) ‘Us ing uneve n margins svm and p e rc eptron
for inform ation extraction’, Proceedin gs of the 9th Con f. on Nat. Lg. Learning, Ass o c. for Com
p. Ling..
Lonsdale , D. and Strong-Kraus e, D. (2003) ‘Autom ate d Rating of ESL Es says ’, Proceedin gs
of the HLT-N AACL 03 Workshop on Building Educational Applications Using Natural Language
Processing, Ass o c. for Comp. Linguistics .
Manning, C ., Raghavan, P.,and Schutze , H. (2008) Introduct ion to Info rmation Retrieval, Cam
bridge University Pre ss .
Nicholls, D. (2003) ‘The Cambridge Le arner Corpus: Error c o ding and analys is for lexicography
and ELT ’ in Corpus Linguistics I I (eds.), Archer, D, Rayson , P., Wilson, A. and McCenery T.
(eds.), UCREL Te chnic al Rep ort 16, Lanc aster University.
Page, E.B. (1966) ‘The imm inence of grading ess ays by compute r’, Phi Delta Kap pan, vol.48 ,
238–243.
Page, E.B. (1994) ‘C omputer grading of s tudent pros e, us ing m o de rn c onc epts and
software’, Journ al of Experimental Education, vol.6 2(2), 127–142.
Powe rs, D.E., Burs tein, J., Cho dorow, M., Fowles , M.E., Kukich, K. (2002) ‘Stum ping e-rater:
challenging the valdity of automated e ss ay s coring’, Comp uters in Human Behavior, vol.18,
103–134.
Ros enblatt, F. (1958) ‘The p e rc eptron: A probabilis tic mo del for information storage and
organiz ation in the brain’, Psychological Review , vol.65,
Ros´e, C.P., Ro que , A., Bhe mb e, D. and VanLe hn, K. (2003) ‘A Hybrid Text Class ific ation
Approach for Analysis of Stude nt Es says ’, Proceedin gs of the H LT-NAACL 03 Wo rkshop on
Building Educational Applicat ion s U sing N atural Language Processing, Ass o c. for Comp.
Linguistics .
Rudner, L.M. and Lang, T. (2002) ‘Automate d es say scoring using Baye s’ the ore m’, Journ al
of Technology, Learnin g an d Assessment , vol.1(2),
Shaw , S and Weir, C. (2007) Examining Writing in a S econ d Language, Studies in Language
Testing 26, Cambridge Unive rsity Pre ss .
Sparck Jones , K. (1972) ‘A statistic al inte rpre tation of term sp e cificity and its application in
retrie val’, Journ al of Documenta tio n, vol.28(1), 11–21.
Sle ator, D. and Te mp e rle y, D. (1993) ‘Parsing Englis h with a Link Grammar’, Proceedin gs of
the 3rd Int . Wkshp on Pa rsing Technologies, Ass o c. for Comp. Ling..
Turney, P. and Pante l, P. (2010) ‘From freque nc y to m eaning’, Jnl. of Art ificial Intel ligen ce
Research, vol.37, 141–188.
Vapnik, V.N. (1995) The n ature of st atist ical learning theory, Springe r- Ve rlag. Williams on,
D.M. (2009) A framew ork for implementing a utomat ed sco ring, Educational Te sting
Servic e, Te chnic al Rep ort. Yang, Y., Zhang, J. and Kisiel, B. (2003) ‘A sc alability analys is
of c las sifiers in text cate gorization’, Proceedin gs of the 26th ACM-SIGIR, Ass o c. for Computing Machine ry, pp. 96–103.
App endix: Sample Scr ipt
The following is a sample of a FCE s cript with error annotation drawn from the CLC and conve
rte d to XML. The full e rror annotation s che me is des crib ed in Nicholls (2003).
29
<head title="lnr:1.01" entry="0" status="Active" url="571574"
sortkey="AT*040*0157*0100*2000*01"> <candidate> <exam>
<exam_code>0100</exam_code> <exam_desc>First Certificate in
English</exam_desc> <exam_level>FCE</exam_level></exam> <personnel>
<ncode>011</ncode> <language>German</language> <age>18</age>
<sex>M</sex></personnel> <text> <answer1>
<question_number>1</question_number> <exam_score>34.2</exam_score>
<coded_answer> <p idx="15576">Dear Mrs Ryan<NS type="MP">|,</NS></p><p
idx="15577">Many thanks for your letter.</p><p idx="15578">I would like to travel in July
because I have got <NS type="MD">|my</NS> summer holidays from July to August and
I work as a bank clerk in August. I think a tent would suit my personal <NS
type="RP">life-style|lifestyle</NS> better than a log cabin because I love <NS
type="UD">the</NS> nature.</p><p idx="15579">I would like to play basketball during
my holidays at Camp California because I love this game. I have been playing basketball
for 8 years and today I am a member of an Austrian <NS type="RP">basketball-team|
basketball team</NS>. But I have never played golf in my life <NS
type="RC">but|though</NS> with your help I would be able to learn how to play golf and
I think this could be very interesting.</p><p idx="15580">I <NS type="W">also
would|would also</NS> like to know how much money I will get from you for <NS
type="RA">those|these</NS> two weeks because I would like to spend some money
<NS type="RT">for|on</NS> clothes.</p><p idx="15581">I am looking forward to
hearing from you soon.</p><p idx="15582">Yours sincerely</p>
</coded_answer></answer1> <answer2> <question_number>4</question_number>
<exam_score>30.0</exam_score> <coded_answer> <p idx="15583">Dear Kim</p><p
idx="15584">Last month I enjoyed helping at a pop concert and I think you want to hear
some funny stories about the <NS type="FN">experience|experiences</NS> I <NS
type="RV">made| had</NS>.</p><p idx="15585">At first I had to clean the three private
rooms of the stars. This was very boring but after I left the third room I met Brunner and
Brunner. These two people are stars in our country... O.K. I am just <NS
type="IV">kiding|kidding</NS>. I don’t like <NS type="W">the songs of Brunner and
Brunner|Brunner and Brunner’s songs</NS> because this kind of music is very
boring.</p><p idx="15586">I also had to clean the <NS type="RN">washing rooms|
30
washrooms</NS>. I will never ever help anybody to <NS type="S">organice|
organise</NS> a pop concert <NS type="MY">|again</NS>.</p><p idx="15587">But
after this <NS type="S">serville|servile</NS> work I met Eminem. I think you know his
popular songs like "My Name Is". It was one of the greatest moments in my life. I had to
<NS type="RV">bring| take</NS> him something to eat.</p><p idx="15588">It was <NS
type="UD">a</NS> hard but also <NS type="UD">a</NS> <NS type="RJ">funny|
fun</NS> work. You should try to <NS type="RV"><NS type="FV">called|
call</NS>|get</NS> some experience <NS type="RT">during|at</NS> such a
concert<NS type="RP"> you|. You</NS> would not regret it.</p><p idx="15589">I am
looking forward to hearing from you soon.</p>
</coded_answer></answer2></text></head>
31
Download