Technical Report UCAM-CL-TR-790 ISSN 1476-2986 Computer Laboratory Number 790 Automated assessment of ESOL free text examinations Ted Briscoe, Ben Medlock, Øistein Andersen November 2010 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2010 Ted Briscoe, Ben Medlock, Øistein Andersen Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Automated As ses sment of E SOL Free Text Examinations Ted B risc o e Compute r Lab oratory University of Cambridge and Be n Medlo ck and Øistein Anderse n iLexIR Ltd ejb@cl.cam.ac.uk ben/oistein@ilexir.co.uk Abstract In this rep ort, we c ons ider the task of automated ass es sme nt of English as a Se cond Language (ESOL) exam ination sc ripts written in re sp ons e to prompts e lic iting fre e text ans we rs . We re vie w and critic ally evaluate pre vious work on autom ated ass es sm ent for es says, es p ec ially when applie d to ESOL te xt. We form ally define the task as dis criminative preference ranking and de velop a ne w sys tem traine d and tes ted on a c orpus of manually-grade d s cripts . We s how exp e rime ntally that our b e st p e rforming sys tem is ve ry close to the upp e r b ound for the task, as define d by the agree me nt b etwee n human e xamine rs on the s ame corpus . Finally we argue that our approach, unlike extant solutions, is re lative ly prompt-ins ensitive and res istant to subve rs ion, even when its op e rating principles are in the public dom ain. T he se prop ertie s make our approach significantly more viable for high-stake s as se ss ment. 1 Intro duction The task of automated asses sment of free text passages or ess ays is distinct from that of scoring short text or multiple choice answers to a series of very sp ecific prompts. Neverthe less, since Page (1966) des crib ed the Pro je ct Essay Grade (P EG) program, this has b een an active and fruitful area of research. To day there are at least 12 programs and ass o c iated pro ducts (Williams on, 2009), such as the Educ ational Testing Servic e’s (ETS) e -Rater (Attali and Burstein, 2006), PearsonKT’s KAT Engine / I ntelligent Es say Asse ssor (IEA) (Landauer et al , 2003) or Vantage Learning’s I ntellimetric (E lliot, 2003), which are deployed to as sess e ssays as part of self- tutoring systems or as a comp onent of examination marking (e.g. Kukich, 2000). Because of the broad p otential applic ation of automated asse ssment to essays, the se systems fo cus as much on asses sing the se mantic relevance or ‘topicality’ of e ssays to a given prompt as on assess ing the quality of the ess ay itself . 3 Many English as a Se cond Language (ESOL) e xaminations include free text ess ay-style answer c omp onents designe d to e valuate candidates’ ability to write , w ith a fo cus on sp e cific c ommunicative goals . For example, a prompt might s p ecify writing a letter to a friend describing a rece nt ac tivity or writing an email to a prosp ec tive employer jus tif ying a job application. The design, delivery, and marking of such examinations is the fo cus of cons iderable res earch into tas k validity for the sp ecific skills and leve ls of attainment exp ected for a given qualification (e.g. Hawkey, 2009). The marking scheme s f or such writing tasks typically emphas ise us e of varied and eff ective language appropriate for the genre, exhibiting a range and complexity consonant with the level of attainme nt required by the examination (e.g. Shaw and Weir, 2007). T hus, the marking c riteria are not primarily prompt or topic sp ecific but linguis tic . This makes automate d asses sment f or ESOL te xt (hereafter AAET) a distinct sub case of the general problem of marking ess ays, which, we argue, in turn require s a dis tinct technical approach, if optimal p erf ormance and effec tivenes s are to b e achieved. 1Neverthe less, e xtant general purp ose systems, such as e- Rater and I EA have b een de ployed in self- ass essment or second marking roles for AAET. Furthermore , Edexcel, a division of Pe arson, has rec ently announced that f rom autumn 2009 a re vise d version of its Pearson Tes t of Englis h Acade mic (PTE Ac ademic ), a te st aime d at ESOL sp e akers see king e ntry to English sp eaking universities , w ill b e entirely asses sed us ing “Pe arson’s proven automated s coring technologies”2. This announcement f rom one of the ma jor providers of such high stake s tests makes investigation of the viability and accuracy of automated as sessme nt systems a res earch priority. I n this rep ort, we des crib e rese arch undertaken in c ollab oration with C ambridge ESOL, a divis ion of C ambridge Ass essment, which is, in turn, a division of the University of Cambridge, to de velop an accurate and viable approach to AAET and to asses s the appropriateness of more ge neral automated assess ment technique s for this task. Section 2 provides s ome technical details of e xtant s ys tems and considers their like ly e fficacy for AAET. Section 3 describ es and motivates the new mo del that we have develop ed for AAE T based on the paradigm of discriminative preference ranking using machine learning over linguistically-motivated text fe atures automatically e xtracted f rom scripts. Section 4 describ es an e xp erime nt training and testing this clas sifier on sample s of manually marke d scripts from candidates f or Cambridge E SOL’s First Ce rtificate of English (FCE) e xamination and the n comparing p erformance to human examine rs and to our reimplementation of the key comp one nt of Pears onKT ’s IEA. Sec tion 5 dis cusses the implications of these exp e riments within the wider context of op e rational deployment of AAET. Finally, sec tion 6 summarises our main conclusions and outlines areas of future research. 1www.pearsonpte.com/news/Pages/PTEAcademiclaunch.aspx 2Williams on (2009) als o state s that e- Rater will als o b e use d op erationally from m id-2009 for ass es sing com p onents of ETS’s T OEFL e xam , but in conjunc tion w ith human marking. 4 2 Technical Background A full history of automate d asses sment is b eyond the sc op e of this rep ort. For recent re views of work on automated essay or free -text asses sment se e Dikli (2006) and Williamson (2009). In this section, we fo cus on the ETS’s e-Rater and PearsonKT’s IE A systems as these are two of the three main systems which are op erationally deployed. We do not consider IntelliMetric f urther as there is no precise and de tailed technical des cription of this system in the public domain (Williamson, 2009). However, we do disc uss a numb er of academic studies which ass ess and compare the p erformance of diff erent technique s and as well as that of the public domain prototyp e s ys te m, BET SY (Rudner and Lang, 2002), which treats automated ass essment as a B ayesian text c lass ific ation proble m, as this work sheds useful light on the p otential of approaches other than those deploye d by e -Rater and IEA. 2. 1 e -Rate r e- Rater is extensively de scrib ed in a numb er of publications and patents (e.g. Burstein, 2003; Attali and Burstein, 2006; Burstein et al , 2002, 2005). The mos t recently describ ed version of e-Rater uses 10 broad feature typ es extracted from the text us ing NLP te chniques, 8 re pre sent writing quality and 2 content. These features corres p ond to high-level prop erties of a text, such as grammar, usage (errors), organisation or prompt/topic sp e cific c ontent. Each of thes e high-level features is broken down into a set of ground feature s; for ins tance, grammar is sub divided into features which c ount the numb er of auxiliary verbs, c omplement clauses, and so forth, in a text. These f eatures are extracted from the ess ay using NLP to ols which automatically ass ign part-of-sp eech tags to words and phras es, s earch for sp ecific lexical items , and so f orth. M any of the feature extrac tors are manually w ritten and based on e ssay marking rubrics used as guides for human marking of essays for sp ecific examinations. The res ulting counts for each feature are asso ciate d with cells of a vector which e nco des all the grammar feature s of a text. Similar vectors are cons tructed for the other high-le vel features . The feature extraction s ys te m outlined ab ove, and des crib ed in more detail in the ref erence s provided, allows any text to b e repres ente d as a s et of vectors each repres enting a set of f eatures of a given high- level typ e. Each feature in each ve ctor is weighted using a variety of techniques drawn from the fields of information retrieval (IR) and machine learning (ML). For instanc e, content-based analysis of an e ssay is bas ed on vec tors of individual word frequency counts drawn f rom text. Attali and B urstein (2006) transf orm frequency counts to weights by normalising the word counts to that of the most frequent word in a training se t of manually-marked es says written in re sp onse to the s ame prompt, scored on a 6 p oint sc ale. Sp ecifically, they re move stop words which are exp ected to o ccur with ab out equal frequency in all texts (such as the ), then for each of the sc ore p oints , the weight for word i at p oint p is: Wip = ( FipM N) * l og ) (1) axFp ( Ni is the f requenc y of word i at s core p oint p , M is the maximum f re que where Fip axFp nc y 5 of any word at scop e p oint p , N is the total numb er of e ssays in the training set, and Niis the total numb er of e ssays having word i in all score p oints in the training se t 3. For automated ass essment of the content of an unmarked e ssay, this weighted vector is compute d by dropping the c onditioning on p and the result is compared to aggregated vectors for the marked training essays in each class using cosine s imilarity. The unmarked ess ay is assigned a content s core corre sp onding to the most s imilar c lass . This approach transf orms an unsup ervis ed we ighting technique, w hich only requires an unannotated collection of essays or do cuments, into a sup ervised one w hich requires a set of manuallymarked prompt- sp ec ific es says. Other vectors are weighted in different ways dep ending on the typ e of features extracted. Counts of grammatical, usage and s tyle feature s are smo othed by adding 1 to all c ounts (avoiding z ero counts for any feature), then divided by essay length word count to normalise for different ess ay lengths , the n transformed to logs of counts to avoid skewing results on the bas is of abnormally high counts for a given feature. Rhetorical organisation is computed by random indexing (Kanerva et al, 2000), a mo dific ation of latent semantic indexing (see s ection 2.2), which constructs word vectors based on c o o cc urrence in texts. Words c an b e we ighted us ing a wide variety of weight functions (Gorman and Curran, 2006). Burstein et al (2005) desc rib e an approach which c alculates mean vectors for words from training es says which have b e en manually marke d and se gmented into passage s p erforming diffe rent rhetorical functions. M ean vectors for each score p oint and pas sage typ e are normalise d to unit length and transf ormed so they lie on the origin of a graph of the transf ormed geometric space. This c ontrols f or differing pass age lengths and incorp orates inve rs e do cument frequency into the word weights . The re sulting passage ve ctors c an now b e used to compare the similarity of passage s within and across e ssays, and, as ab ove, to score e ssays for organisation via similarity to me an vec tors for manually- marked training passages . The set of high- level f eature scores obtained for a given ess ay are c ombined to give an overall score. In earlier versions of e- Rater this was done by stepwise linear re gress ion to assign optimal we ights to the comp onent s cores so that the correlation with manuallyassigned overall sc ores on the training set was maximised. However, Attali and Burstein (2006) advo cate a simpler and more p erspicuous approach using the weighted average of standardised feature score s, where the weights can b e set by e xp ert e xaminers based on marking rubric s. Williamson (2009) in his des cription of e- Rater implies a return to regre ssion- bas ed weighting. 2. 2 Int el li gent Essay Assessor (IE A) The I EA like e-Rater asses ses es says in terms of a small numb er of high-level feature s such as content, organisation, fluency, and grammar. The published pap e rs and patent desc ribing the technique s b e hind IE A (e.g. Landauer et al , 2000, 2003; Foltz et al , 2002) 3Burstein et al (2002) patent a differe nt but related we ighting of the se counts using inverse do c um ent fre quency (i.e . the well-know n tf/idf we ighting s chem e intro duce d into IR by Sparck- Jone s, 1972). Inverse do c um ent fre quency is c alculate d from a s et of prompt-sp ec ific ess ays w hich have b ee n m anually marked and as signe d to class es . Presumably this was abandoned in favour of the c urre nt approach bas ed on exp erim ental com parison. 6 fo cus on the us e of latent semantic analysis (LSA), a te chnique originally develop ed in IR to compute the s imilarity b etween do cuments or b etwe en do cuments and keyword queries by clustering words and do cuments s o that the measurement of similarity do es not require exact matches at the word level. For a rec ent tutorial intro duction to LSA see Manning et al (2008:ch18). Landaue r et al (2000) argue that LSA measures similarity of semantic content and that semantic content is dominant in the ass essment of e ssays. As f or the c ontent analysis comp onent of e-Rater, LSA represents a do cument as a ve ctor of words and requires a training set of prompt-sp e cific manually-marke d ess ays. However, instead of computing the cos ine similarity directly b e twee n aggre gate d or mean vec tors from the training s et and the e ssay to b e asses sed, LSA deploys singular value decomp osition (SVD) to reduce the dimensions of a matrix of words by ess ays to obtain a new matrix with reduced dimensions which effectively c lusters words with s imilar contexts (i.e. their dis tribution acros s essays ) and clus te rs es says with similar words. Words can b e weighted to take ac count of their frequency in an ess ay and across a collection of ess ays b ef ore SVD is applied. LSA c an b e use d to meas ure ess ay coherence as well by c omparing passages within an essay and passages of s imilar rhetorical typ e from other ess ays. In this re sp ect, there is little diffe re nce b etwe en e -Rater and IEA, b ecause random inde xing is simply a computationally effic ient approximation of SVD which avoids c onstruction of the f ull word- by-e ssay co o c currenc e matrix. Though it se ems clear that IE A uses LSA to as sess content and organisation (Foltz et al , 2002), it is unc lear w hich other high-level f eatures are c omputed this way. It is very unlike ly that LSA is used to as sess grammar or sp e lling, though there is no published desc ription of how these features are asse ssed. On the othe r hand, it is likely that features like flue nc y are as sesse d via LSA, probably by using training annotated sets of text passages which illustrate this feature to different degrees so that a score can b e assigned in the same manner as the content score. IEA, by def ault, combines the score obtained from e ach high-level feature into an overall s core us ing multiple re gress ion against human scores in a training se t. However, this c an b e changed f or sp ecific examinations based, f or example , on the marking rubric (Landauer et al , 2000). 2 .3 Ot her Resear ch o n Aut om ate d Assessm ent 2 .3. 1 Text Cl assi ficat io n Both e-Rate r and IE A implicitly treat automated asses sment, at least partly, as a text classification proble m. Whilst the ro ots of vector- base d re pres entations of text as a bas is for me asuring s imilarity b etween texts lie in IR (Salton, 1971), and in the ir original f orm can b e deployed in an uns up ervised fashion, the ir use in b oth s ys tems is sup ervised in the sense that similarity is now measured relative to training sets of premarked ess ays (along se veral dimens ions), and thus test essays can b e classified on a grade p oint scale. Manning et al (2008:ch13) provides a tutorial intro duction to text clas sification, an area of ongoing research which lies at the intersec tion of I R and M L and which has b e en given considerable imp etus re cently by new techniques emerging f rom ML. Leakey (1998) explic itly mo delle d automated ass essment as a text classification problem 7 comparing the p erformance of two standard class ifiers, binomial Naive B ayes (NB ) and kNN, ove r four diffe re nt examination datasets. He found that binomial NB outp erformed kNN and that the b e st do cument representation was a vec tor of le mmas or stemmed words rather than word forms. Rudner and Liang (2002) desc rib e B ETSY (the Bayesian Ess ay Te st Sc oring sYstem), which uses either a binomial or multinomial NB classifie r and represents essays in terms of unigrams, bigrams and non-adjac ent bigrams of word forms. A full tutorial on NB classifiers can b e f ound in Manning et al (2008:ch13). Briefly, however, a multinomial mo del will estimate values for instanc es of the defined fe ature typ es from training data for e ach class typ e , which in the simples t case could b e just ‘pass’ and ‘fail’, by smo othing and normaliz ing frequency c ounts for each feature, F , f or example , for bigrams: P (Fbig r am- i) = F r eq (Wj, W) + 1 F r eq (W jk) + N (2) where N is the total bigram freque ncy count for this p ortion of the training data. To predict the most likely clas s for new unlab elled text, the log class-c onditional probabilities of the f eatures found in the ne w text are summed for each (C , e .g. pass/fail) il og (P (C )) + l og (P | C )) (3) (Fi and added to the prior probability of the class , typically e stimated from the prop ortion of texts in e ach class in the training data. Taking the sum of the logs assumes (‘naively’) that each feature ins tance, w hether unigram, bigram or whatever, is indep e ndent of the othe rs . This is c le arly incorre ct, though suffices to construct an acc urate and efficient classifier in many situations. In practice, within the NB framework, more s ophisticated feature selection or weighting to handle the obvious dep e ndencies b e twee n unigrams and bigrams would probably improve p erformance, as would adoption of a classification mo del which do es not rely on such s trong indep endence assumptions. BE TSY is freely available for rese arch purp oses. Coniam (2009) trained BETSY f or AAET on a corpus of manually-marked year 11 Hong Kong ESOL examination scripts. He found that non-adjacent bigrams or word pairs provided the most useful feature typ es for accurate asses sment. Both approaches use regre ssion to optimise the fit b etween the output of the classifie rs , which in the case of the Bayesian clas sifiers can b e interpreted as the de gree of statistical certainty or confidence in a given c lass ification, to the grade p oint scales us ed in the different examinations . Text classification is a use ful mo del for automated ass essment as it allows the problem to b e framed in terms of sup e rvise d classification using machine learning techniques, and provides a f ramework to supp ort systematic exploration of diff erent clas sifiers with diff erent representations of the text. From this p ers p ective , the extant work has only sc ratched the surface of the s pace of p ossible s ys tems. For instance, all the approaches disc ussed so far rely heavily on so-called ‘bag- of- words’ representations of the text in which p ositional and struc tural information is ignore d, and all have utilised non-dis criminative clas sifiers. Howe ver, there are strong reasons to think that, at least for AAET , grammatical c omp e tence and p erformance e rrors are central to asse ssment, but the se are not captured well by a bag-of-words repre sentation. In ge neral discriminative classification techniques have 8 p erformed b etter on te xt classification problems than non-dis criminative techniques, such as NB c lass ifiers, using similar f eature se ts (e .g. Joachims, 1998), so it is surprising that discriminative mo dels have not b e en applied to automated essay as sessme nt. 2. 3. 2 St ruct ural Infor mati on Page (1994) was the first to de scrib e a system which used partial parsing to extrac t syntac tic feature s f or automated asses sment. e- Rater extended this work using hand co ded extractors to lo ok for sp ecific syntactic c onstructions and sp ecific typ es of grammatical error (se e sec tion 2.1 and ref erences therein). Kanejiya et al (2003) des crib e an extension to LSA w hich constructs a matrix of words by e ssays in which words are paired with the part-of-sp eech tag of the previous word. This massively increas es the size of the resultant matrix but do es take acc ount of limited structural information. However, their comparative exp eriments with pure LSA showed little improve ment in ass essment p erformance . Lons dale and Krause (2003) is the firs t application of a minimally- mo dified standard syntactic pars er to the proble m of automated as sess me nt. T hey use the Link Parser (Sleator and Temp erley, 1993) with some added vo cabulary to analyse sentences in E SOL ess ays. The parse r outputs the set of grammatical relations which hold b etween word pairs in the sentence, but als o is able to skip words and output a cos t vector (including the numb er of words skipp ed and the length of the sente nce), w hen faced with ungrammatical input. The system scored e ssays by scoring e ach s ente nce on a five p oint scale, based on its c ost vector, and then averaging these scores. Ros´e et al (2003) directly compare four different approaches to automated asse ssment on a corpus of phys ics essays . The se are a) LSA over words, b) a NB text classifie r over words , c) bile xic al grammatical relations and syntactic fe atures, such as pas sive voice, from sente nce-by-s entenc e parse s of the e ssays, and d) a mo de l integrating b) and c). They f ound that the NB c lass ifier outp erformed LSA, whilst the mo de l- bas ed on parsing outp erf ormed the NB classifier, and the mo del integrating parse information and the NB classifier p erformed b est. This b o dy of work broadly supp orts the intuition that s tructural information is rele vant to assess ment but the only direct comparison of LSA, word- bas ed c lass ification and clas sification via structural information is on phys ics essays and may not, theref ore, b e comparable for ESOL. Lonsdale and Krause show reasonable correlation w ith human s coring using pars e features alone on ESOL essays , but they conduct no comparative evaluation. The exp erimental des ign of Ros´e et al is much b ette r but a similar e xp eriment needs to b e conducted for ESOL ess ays. 2 .3. 3 C ont ent Analysis IEA e xploits LSA for content analysis and e-Rater use s random indexing (RI). Both te chniques are a f orm of word- by- essay clus te ring which allow the systems to generalis e f rom sp e cific to distributionally relate d words as me asured by their o ccurrence in similar es says . However, the re are many other published te chniques for construc ting dis tributional 9 semantic ‘spaces’ of this general typ e (see e.g. Turney and Pantel (2010) for a survey). Both probabilis tic LSA (PLSA) and Latent Dirichlet Allo c ation (LDA) have b een shown to work b etter than LSA for s ome IR applications. Kakkone n et al (2006) c ompare the p erformance of b oth to LSA on a c orpus of grade d Finnish es says on various topics. They found that LDA p erformed worse than LSA and that PLSA p e rf ormed similarly. There are many further p ossibilities in the area of content analysis that remain to b e tried. Firstly, the re are more recent approaches to constructing such distributional se mantic spaces w hich have b een s hown to outp erform RI and SVD-based techniques like LSA on the task of clus tering words by semantic similarity, w hich is arguably central to the content analysis comp one nt of automated as sess me nt. For e xample, B aroni et al (2007) show that Inc reme ntal Semantic Analysis (ISA) leads to b etter p erformance on se mantic categorisation of nouns and ve rbs . ISA is an improveme nt of RI . Initially each word w is assigned a signature, a sparse vector, sw, of fixed dimensionality d made up of a small numb er of randomly distribute d +1 and -1 cells with all other c ells ass igned 0. d is typic ally much smalle r than the dimens ionality of the p ossible conte xts (co o ccurre nces) of words given the text contexts us ed to define c o o cc urrence . At each o c curre nce of a target word t with a context word c , the history vector of t is up dated as follows: + (1 - mc)sc ht+ = i (mchc F r eq ( c ) K c mc ) (4) where i is a cons tant impact rate and mdetermines how much the history of one word influences the his tory of another word – the more frequent a context word the less it will influence the history of the target word. The m weight of c decreases as follows : m ) (5) =1 exp ( is a parameter determining rate of de cay. ISA has the advantage that it is fully w incremental, do es not rely on weighting schemes that require global c omputations here Km over contexts, and is therefore efficient to compute . It extends RI by up dating the ve ctor for t with the signature and history of c so that se cond order effe cts of the context word’s distribution are f actore d into the repre sentation of the target word. As well as exploring improved clustering techniques over LSA or RI such as ISA, b oth the weighting functions us ed for mo delling c o o ccurrence (e.g. Gorman and Curran, 2006), and the conte xts used to as sess co o ccurre nce (e.g. Baroni and Lenci, 2009), which has b ee n exclusively base d on an entire ess ay in automated ass essment work, should b e varied. For ins tance, the b est mo de ls of semantic s imilarity often me asure co o ccurrence of words in lo cal syntactic conte xts, such as those provided by the grammatic al relations output by a parser. Finally, though prompt-sp ec ific content analysis is clearly imp ortant f or assess ment of many es says typ es, it is not s o c lear that it is a c entral as p ect of E SOL assess ment, where demonstration of communicative comp etence and linguis tic varie ty without excessive errors is arguably more imp ortant than the sp ecific topic addressed. 2 .4 E val uat io n The e valuation of automated asses sment systems has largely b ee n base d on analys es of corre lation with human marke rs . Typically, systems are traine d on premarked ess ays f or 10 a s p ecific exam and prompt and their output scaled and fitted to a partic ular grade p oint scheme using regression or e xp ert rubric- based weighting. Then the Pearson correlation co efficient is calculate d for a set of test e ssays for which one or more human gradings are available. Using this me asure, b oth e-Rater, I EA and other approaches discusse d ab ove have b een show n to corre late well w ith human grades. Of te n they c orrelate as we ll as the grade s ass igned by two or more human markers on the same essays. Additionally, the rates of exact re plication of human s cores , of deviations by one p oint, and so forth can b e calculated. These may b e more informative ab out causes of large r divergences given sp e cific phenomena in e ssays (e.g. Williamson, 2009; Coniam, 2009). A weakness of the ab ove approach is that it is clear that it is re lative ly eas y to build a system that will correlate well with human markers unde r ideal conditions. Eve n the original PEG (Page, 1966) obtained high c orrelations using ve ry sup erficial textual features such as e ssay, word and s entenc e length. Howe ver, such feature s are easily ‘gamed’ by s tudents and by instructors ‘te aching to the exam’ (asses sment regime) once it is public knowle dge w hat feature s are e xtracted for automated as sessme nt. As automated assess ment is not based on a full understanding of an essay, the f eatures extracted are to some extent proxies f or such understanding. The degree to which such proxies can b e manipulated indep endently of the features that they are intended to measure is c learly an imp ortant factor in the analysis of systems, esp ecially if they are inte nded for use in high-stakes asse ssment. Powers et al (2002) conducted an exp e riment in which a varie ty of exp erts we re invited to design and submit e ssays that they b elieved would either b e under- or over- scored by e- Rater. The res ults showed that e- Rater was relative ly robust to such ‘gaming’, though those with intimate knowledge of e-Rate r were able to trick it into assigning score s de viating from human markers, even by 3 or more p oints on a 6- p oint scale. A furthe r weakness of comparison w ith human markers, and inde ed with training such systems on raw human marks, is that human markers are relatively inc onsis te nt and show comparatively p o or corre lation w ith each other. Alternatives, have b een prop os ed such as training and/or testing on averaged or RASCH- corrected s cores (e .g. Coniam, 2009), or evaluating by correlating system grades on one task, such es say w riting, with human scores on an inde p endent task, such as sp oken comprehension (Attali and Burstein, 2006). Finally, many non-technical prof essionals involved in asses sment ob ject to automated assess ment, arguing, f or example , that a c omputer can neve r rec ognise creativity. In the end, this typ e of philosophical ob jection tends to dissipate as algorithms b ecome more effe ctive at any given task. For e xample, few argue that computers will never b e able to play chess prop erly now that ches s programs regularly de feat grand masters, though some will argue that prowess at ches s is not in fact a sign of ‘genuine intelligence’. Neverthe less, it is clear that very thorough evaluation of asses sment systems will b e re quired b efore op erational, es p ecially high stake s, deployment and that this should include evaluation in adversarial sce narios and on unusual ‘outlier’ data, whether this b e highly creative or deviant. From this p ers p ective it is s urprising that Powe rs et al (2002) is the sole study of this kind, though b oth e- Rater and IE A are claimed to incorp orate mechanisms to flag such outliers for human marking. 11 3 AAET using Discriminative Pre fe re nce Ranking One of the key weakness es of the text classification me tho ds deployed s o far for automated assess ment is that they are bas ed on non-discriminative machine learning mo dels. Non-discriminative mo dels often e mb o dy incorre ct as sumptions ab out the underlying prop erties of the texts to b e classified – f or e xample, that the probability of each feature (e .g. word or ngram) in a text is indep endent of the others, in the cas e of the NB classifier (se e se ction 2.3). Such mo dels als o weight fe atures of the text in ways only lo os ely connected to the clas sification task – f or e xample, p ossibly smo othed class conditional maximum like liho o d estimates of fe atures in the cas e of the NB clas sifier (se e again sec tion 2.3). In this work, we apply discriminative machine learning metho ds , s uch as mo dern variants of the Large Margin Pe rc eptron (Freund and Schapire, 1998) and the Supp ort Ve ctor Machine (SVM, Vapnik, 1995) to AAET. To our knowledge, this is the first such application to automated es say assess ment. Disc riminative c lass ifiers make weake r assumptions concerning the prop e rties of texts , dire ctly optimiz e clas sification p erformance on training data, and yie ld optimal pre dictions if training and test material is draw n from the same distribution (se e Collins (2002) for exte nde d theoretical dis cussion and pro ofs ). In our des cription of the classifiers, we will use the following notation: N numb er of training samples Dν avg. numb er of unique f eatures / training sample X ∈ P Rreal D -dimens ional sample s pace Y = { +1, -1} binary target lab el space xi∈ X vector repres enting the i th training sample yi∈ { +1, -1} binary c ategory indicator f or i th training sample f : X → Y classification function 3 .1 Supp or t Vect or Machine Linear SVM s (Vapnik, 1995) le arn wide margin classifiers based on Struc tural Ris k M inimization and c ontinue to yield state- of- the-art results in text clas sification exp eriments (e.g. Le wis et al , 2004). In its dual form, linear SVM optimiz ation equates to minimizing the f ollow ing expres sion: = 0 where the a ’s are the weight co efficients. The prediction is given by:if (x) = sig n ( aiyixi· x + b ) (7) where b is the bias and sig n (r ) ∈ {- 1, +1} dep e nding on the sign of the input. The prac tic al use of the SVM mo del re lie s on efficient metho ds of finding approximate solutions to the quadratic programming (QP) problem p osed by (6). A p opular s olution is 12 sub ject to the constraint iaiiyiai 1i,j2 aiajyiyjxi · xj (6) implemented in Joachims ’ SVMl ig h t2package (Joachims, 1999), in which the QP problem is dec omp osed into small constitue nt subproblems (the ‘working s et’) and solved se que ntially. This yie lds a training complexity at each iteration of O (q1 . 5· ν ) where q is the size of the working set. The effic iency of the pro cedure lie s in the fact that q N . The numb er of iterations is governed by the choice of q which makes it difficult to plac e a theoretic al complexity b ound on the overall optimization pro cedure, but exp erimental analysis by Yang et al (2003) sugges ts a s up er-linear b ound of approximately O (N) with resp ec t to the numb er of training samples , though in our exp erience this is quite heavily dep endent on the se parability of the data and the value of the regularization hyp erparameter. The p er s ample time complexity for prediction in the SVM mo del is O (M · ν ) whe re M is the numb er of categories, as a separate c lass ifie r must b e trained f or each c ategory. 3. 2 T im ed Agg re gate Per cept ro n We now present a des cription of a novel variant of the batch p e rc eptron algorithm, the Timed Aggregate Perce ptron (TAP, M edlo ck, 2010). We will first intro duce the ideas b ehind our mo del and then provide a formal description. The online p erceptron learning mo de l has b e en a mainstay of artificial intellige nce and machine le arning rese arch since its intro duction by Rosenblatt (1958). The basic principle is to iteratively up date a vector of weights in the sample space by adding s ome quantity in the direc tion of misclassifie d samples as they are identified. The Perceptron with Margins (PAM ) was intro duce d by Krauth and Mez ard (1987) and show n to yield b ette r ge neralisation p erformance than the basic p erce ptron. More recent developme nts include the Voted Perceptron (Freund and Schapire, 1998) and the Perceptron with Uneven Margins (PAUM), applied with some succes s to text categorization and inf ormation extrac tion (Li et al , 2005). The mo del we present is base d on the batch training metho d (e.g. Bos and Opp er, 1998) where the weight vec tor is up date d in the direc tion of al l misclassified ins tances simultaneously. In our mo de l an aggregate vector is created at each iteration by summing all misc lass ified s amples and normalising according to a timing variable which controls b oth the magnitude of the aggre gate vec tor and the stopping p oint of the training pro ces s. The weight vector is the n augme nted in the direction of the aggregate vector and the pro cedure iterates. T he timing variable is re sp onsible for protection against overfitting; its value is initialised to 1, and gradually diminishes as training progresse s until reaching zero, at which p oint the pro cedure terminates. Given a set of N data samples paired with target lab els ( xi, yDi) the TAP learning pro cedure returns an optimized weight vector ˆ w ∈ R. The predic tion for a ne w sample x ∈ RDis given by: f (x) = sig n ( ˆ w · x) (8) where the sig n function converts an arbitrary real numb er to +/ - 1 base d on its sign.The default de cision b oundary lies along the unbiased hyp erplane ˆ w · x = 0, though a threshold can easily b e intro duced to adjust the bias . is constructed by s umming all mis clas sified At each iteration, an aggregate vec tor ˜a13 t samples and normalising: ˜at xi= Qt nor m ( ∈ xiyi , t ) (9) nor m (a, t ) normalises a to magnitude t and Qtis the set of mis clas sified samples at iteration t, w ith the misclassification condition given by: wt · xiyi t- 1 > Lt = 1 t- L (13) + t+| N governe d by: tt The clas s-normalise d empirical los s, Lt Lt + | Q +) N | N - +/- < 1 (10) A margin of +/ - 1 p erp endicular to the decis ion b oundary is required for correct class ifi-cation of training sample s. The timing variable t is s et to 1 at the start of the pro cedure and gradually diminishes , 2 |Q 0 L)ß othe rwis e (11) , f alls within the range (0, 1) and is define d as : = tt- 1 t- 1 t(L (12) with N t denoting the numb er of c lass +/-1 training sample s re sp ectively. ß is a measure of the balance of the training distribution siz es: ß = min(N , N with an upp er b ound of 0.5 repres enting p erfe ct balance. Te rmination o cc urs when e ither t or the empirical los s reaches zero. How well the TAP solution fits the training data is governed by the rapidity of the timing schedule; earlier stopping leads to a more approximate fit. In some cases , it may b e b e neficial to tune the rapidity of the timing schedule to achieve optimal p erformance on a sp ec ific problem, particularly when cross validation is feasible. In this instanc e we prop ose a mo dified version of express ion (11) that includes a timing rapidity hyp erparameter, r : 1 > Lt 4 = r - Lt- 1t- 1 t tt t 1 1 L r t ( L t ure is for given in mul Algorith atio m 1. n, t The als timing o mechagov nism us ern ed in s our the Noalgorithlen te m is gth thamotivatof t ed by the thithe agg s princ reg exiple of ate preearly vec ssistop- tor, onping whi is in p ch eqercep is uivtron ana aletraini log ous nt ng to (Bos to (11and the ) inOpp lear nin theer, 1998) g ca rate se, w he re in tha t r the the = pro sta 1. cedur nda Ane is rd p ovhalte erc e d b e eptr rvifore on. ewreach t is dec of ing thethe p rea TAoint sed onl P of mini y le arnmum wh ingempir en proical the c los s. cla edIn our s s-normalised empirical los s increas es. An increase in emprical loss is an indication either that the mo del is b e ginning to overfit or that the le arning rate is to o high, and a c onsequent decrease in t works to counte r b oth p ossibilities. T he sc ale of the decrease is governed by three heuristic factors: ) ß o t h e r w i s e ( 1 4 ) 1, y1), . . . , (xAlg or it hm 1 – TAP training pro cedure Requi re: training data { (xN, yN)} t = 1 for t = 1, 2, 3 . . . do if tt= 0 ∨ Lt= 0 t hen terminate and return wt el se = wt + wt+1 ˜at - Lt- 1 3. 3 D iscr im inati ve Pr efer ence Ranking 15 end if compute tt+1 end fo r how far the algorithm has progresse d (t) 2. t the increase in empirical loss (Lt ) 3. the balance of the training distributions ( ß ) The motivation b ehind the third heuristic is that in the early stages of the algorithm, unbalanced training distributions lead to aggregate vectors that are skewed toward the dominant class. I f the pro ce dure is stopp e d to o early, the e mpiric al loss will b e dis prop ortionately high for the s ub dominant clas s, leading to a s kewe d weight vector. The e ffect of ß is to relax the timing schedule for imbalanced data w hich results in higher quality solutions. The TAP optimisation pro cedure requires storage of the input vectors along with the feature weight and up date ve ctors, yielding space complexity of O (N ) in the numb er of training samples. At each iteration, computation of the empirical loss and aggregate vector is O (N · ν ) (recall that ν is the average numb er of unique features p e r sample). Given the curre nt and previous loss value s, computing t is O (1) and thus each ite ration scales with time complexity O (N ) in the numb e r of training samples. The numb e r of training iterations is governed by the rapidity of the timing schedule which has no direct dep e nde nce on the numb er of training s amples, yielding an approximate overall complexity of O (N ) (linear) in the numb er of training s amples . The TAP and SVM mo dels describ ed ab ove p erform binary disc riminative clas sification, in which training exam scripts mus t b e divided into ‘pass’ and ‘fail’ categories . The confidence margin ge nerated by the classifie r on a given test sc ript can b e inte rpreted as an e stimate of the degree to which that script has passed or failed, e.g. a ‘go o d’ pass or a ‘bad’ fail. However, this gradation of script quality is not mo de lled explicitly by the classifier, rather it relies on emerge nt correlation of key f eatures with script quality. In this section, we intro duce an alte rnative ML technique c alled preferenc e ranking which is b e tter suited to the AAE T task. I t explicitly mo dels the relationships b etween sc ripts by learning an optimal ranking over a given sample domain, inferred through an optimisation pro cecure that utilises a sp ec ified orde ring on training samples . This allows us to mo del the fact that some sc ripts are ‘b etter’ than others, across an arbitrary grade range, without nece ssarily having to sp ecif y a numerical score for each, or intro duce an arbitrary pass /fail b oundary. We now pres ent a version of the TAP algorithm that efficiently learns pref erence ranking mo dels. A de rivation of similar equations for learning SVM- base d mo dels , and pro of of their optimality is give n by Joachims (2002). The TAP preference ranking optimis ation pro cedure requires a set of training samples, x1, x2, . . . , xn, and a ranking <rsuch that the relation xi<rxholds if and only if a sample xjshould b e ranked higher than xiji<rxfor a finite, disc re te partial or complete ranking or ordering, 1 = i, j = n, i = j . G iven some ranking x, the metho d only considers the diffe re nce b etween the feature vec tors xiand xjjas evidence, know n as pairwise difference vectors. The target of the optimisation pro cedure is to c ompute a weight vec tor ˆw that minimis es the numb er of margin-s eparate d misranked pairs of training samples, as formalised by the f ollowing cons traints on pairw ise diff erence vec tors: ∀(xi <r xj) : ˆw · (xi - xj) = µ. (15) where µ is the margin, given a sp ec ific value b elow. The derived set of pairwise diff ere nce vectors grows quickly as a function of the numb er of training samples. An upp er b ound on the numb er of difference vectors f or a se t of training vectors is give n by: 2u = a * r (r - 1)/2 (16) where r is the numb er of ranks and a is the average rank frequency. This yields intrac table numb e rs of diff erence vec tors for eve n mo dest numb ers of training vectors, eg: r = 4, a = 2000 yields 24, 000, 000 differenc e vectors. To overcome this, the TAP optimisation pro cedure employs a sampling strategy to re duce the numb er of differenc e vec tors to a manageable quantity. An upp e r b ound is sp e cified on the numb er of training vectors, and then the probability of s ampling an arbitrary difference ve ctor is given by u /u where u is the sp ecifie d upp er b ound and u is given ab ove. The optimisation algorithm the n pro ceeds as for the classific ation mo del (Algorithm 1), except we have a one- sided margin. The mo dified pro cedure is shown in Algorithm 2. · (xj - xi) > 2 (17) The misclassification c ondition is: wt xi<rxj∈ Q 16 ˜at t and the aggregate vector ˜at is cons tructed by: xj - xi , t ) (18) = nor m ( 1<rx2), . . . , (xN<rxN +1Alg or it hm 2 – TAP rank pre ference training pro ce dure Requi re: training data { (x)} t = 1 for t = 1, 2, 3 . . . do if tt= 0 ∨ Lt= 0 the n terminate and return wt el se = wt + wt+1 ˜at e n d i f c o m p u t e t t + 1 end fo r Note that the one- sided pre ference ranking margin takes the value 2, mirroring the twosided unit- width margin in the classification mo de l. The te rmination of the optimisation pro cedure is governed by the timing rapidity hyp erparameter, as in the classification case, and training time is approximately linear in the numb er of pairwise diff erence vectors, upp er b ounded by u (see ab ove ). The output from the training pro cedure is an optimised weight vector wtwhere t is the iteration at w hich the pro cedure te rminated. G iven a test s ample, x, Pre dictions are made, analogously to the clas sification mo de l, by computing the dot-pro duct wt· x. The res ulting real scalar can then b e mapp e d onto a grade /score range via simple line ar regres sion (or some other pro ce dure), or us ed in rank comparison with othe r test samples. Joachims (2002) desc rib es an analogous pro cedure f or the SVM mo del which we do not rep eat here. As stated earlier, in application to AAET , the principal advantage of this approach is that we e xplicitly mo del the grade relationships b etwee n scripts. Pref ere nce ranking allows us to mo del ordering in any way we cho ose ; for instance we might only have acces s to pass /fail information, or a broad banding of grade leve ls, or we may have acc ess to detaile d scores. Prefe renc e ranking c an account for each of these scenarios, w hereas clas sification mo dels only the first, and numerical regres sion only the las t. 3. 4 Feat ure Space Intuitively AAET involves comparing and quantifying the linguis tic varie ty and complexity, the degree of linguistic comp etence, displayed by a text against errors or infe licities in the p erformance of this comp etence . It is unlikely that this comparison c an b e c aptured optimally in terms of feature typ es like , f or example, ngrams over word forms . Varie ty and complexity will not only b e manifested lexically but also by the use of diff ere nt typ es of grammatical construction, whilst grammatic al errors of c ommiss ion may involve nonlo cal dep e nde ncies b etween words that are not capture d by any given length of ngram. Neverthe less, the f eature typ es used for AAET must b e automatically extracted from text with go o d levels of reliability to b e effe ctively exploitable. We used the RASP system (Brisco e et al 2006; Brisco e, 2006) to automatically annotate 17 TFC: Typ e Exampl e lexical terms and / mar k lexical bigrams dear mar y / of the part-of- sp eech tags NNL1 / JJ part-of- sp eech bigrams VBR DA1 / DB2 NN1 part-of- sp eech trigrams JJ NNSB1 NP1 / VV0 PPY RG TFS: pars e rule names V1/ mo dal bse/ +- / A1/ a inf script length numerical corpus -derived error rate numerical Table 1: E ight AAET Feature Typ es b oth training and test data in order to provide a range of p ossible feature typ es and their instances s o that we could explore the ir impact on the accurac y of the resulting AAET system. The RASP system is a pip eline of mo dules that p erform s ente nc e b oundary de tection, tokenisation, lemmatisation, part-of-s p eech (PoS) tagging, and s yntac tic analys is (parsing) of text. T he PoS tagging and pars ing mo dules are probabilis tic and trained on native English text drawn from a varie ty of source s. For the A AET system and e xp eriments des crib ed here we use RASP unmo dified w ith default pro ces sing settings and s elect the most likely PoS sequence and syntactic analysis as the basis for feature e xtrac tion. The system make s availalble a wide variety of output representations of te xt (s ee B risco e, 2006 for details). I n developing the AAET system we exp erimente d with most of them, but for the subset of e xp erime nts rep orte d he re we make us e of the set of feature typ es given along with illustrative examples in Table 1. Lower-c ase d but not lemmatised lexical terms (i.e. unigrams) are extracted along with their frequency counts, as in a standard ‘bag-of-words’ mo de l. The se are s upple mented by bigrams of adjacent lexical terms. Unigrams, bigrams and trigrams of adjacent sequenc es of PoS tags drawn from the RASP tags et and most likely output se quenc e are extracted along with their frequency counts. All instances of these f eature typ es are inc luded with their c ounts in the ve ctors repre senting the training data and also in the ve ctors extracted for unlab elled test instance s. Lexical term and ngram features are weighted by frequency c ounts from the training data and then scaled us ing tf · idf weighting (Sparck-Jones, 1972) and normalised to unit length. Rule name counts, s cript length and error rate are linearly scaled s o that their weights are of the same order of magnitude as the s caled term/ngram c ounts. Pars e rule name s are e xtracte d from the phrase structure tre e for the most likely analys is found by the RA SP parse r. For example, the f ollowing s ente nce from the training data, Then some though occured to me. , receives the analysis given in Figure 1, whilst the correc te d version, Then a thought occurred to me. receives the analysis give n in Figure 2. In this represe ntation, the no des of the pars e tre es are decorated with one of ab out 1000 rule names, which are semi- automatically generated by the parse r and w hich enco de quite detaile d information ab out the grammatical constructions found. However, in common with ngram f eatures, these rule names are extracted as an unordered list f rom the analys es for all sentences in a given s cript along with their f requenc y counts. Each rule name 18 ✭ ✭✭✭ ✭✭ ✭✭ ✭ T/frag Tph/np✭ ✭❤ ❤❤ ❤❤ ❤❤ ❤❤ ❤ ❤❤ ✭ ✥ ✥NP/a1-c ✥✥ ✥❵ ❵❵ ❵❵ ❵ at np-r✥ ✘ ✘ ✘NP/det a1- r✘ PP/p1 P1/p np-pro❍ ✟ ✘❳ ❳❳ ❳❳ ❳ ✘ some DD A1/advp ppart-r to I I I+ PPI O1 ✏✏ ✏ ✟✟ ❍❍ ✏✏ o ccur+e d VVN A1/a Then RR A P / a 1 A 1 / a though RR Figure 1: Then some though occured to me ✭ ✭T/txt-s c1 S/adv s✭ ✭❤ ❤✭ ✭❤ ❤❤ ❤❤ ❤ ❤ ✭ ✭ ✭✭ ✭S/np vp✭ ✭❤ ❤✭ ✭✭ ✭❤ ❤❤ ❤❤ ❤ ❤ ✭ ✘ ✘✘✘ ✘❳ ❳ ❳❳ ❳ A P / a 1 A 1 / a NP/det n1✦ V1/v pp ✦ ✦❛ ❛❛ ❛ ✦ AT 1 N1/n thought o ccur+e d VVD P P/p1 P1/p Then RR a NN1 np-pro❍ ❍ 19 ❍✟ to ✟I I I+ PPI O1 Figure 2: Then a thought occurred to me together with its frequency c ount is represe nted as a cell in the vector derived from a script. T he script length in words is us ed as a feature less for its intrinsic informativenes s than for the nee d to balance the effect of script le ngth on othe r fe aturccurrence of some feature to its actual o c curre nce. es.The automatic identification of grammatical and lexical e rrors in text is far from trivial For(Andersen, 2010). In the e xis ting systems reviewed in section 2, a fe w sp ecific typ exaes of well- known and relatively frequent errors, such as s ub ject- verb agreement, mplare c aptured explicitly via manually- cons tructed e rror-sp e cific fe ature extractors. e, Otherwise , errors are captured implicitly and indire ctly, if at all, via unigram or other f erro eature typ es. Our AAET system already improves on this approach b ecause the r RASP parser rule names explicitly represe nt marked, p eripheral or rare construc ratetions using the ‘-r’ s uffix, as well s, ngra m freq uen cies, etc w ill tend to rise with the amo unt of text, but the over all qual ity of a scri pt mus tbe ass e sse d as a ratio of the opp ortu nitie s affor ded for the o as c ombinations of extragrammatical subsequences suffixed ‘frag’, as can b e seen by comparing Figure 2 and Figure 1. The se c ues are automatically extracted without any need for error- sp ecific rules or e xtrac tors and c an capture many typ es of long-distance grammatical e rror. Howeve r, we also include a single numerical feature re pres enting the ove rall error rate of the s cript. This is estimate d by counting the numb er of unigrams, bigrams and trigrams of lexical terms in a script that do not o ccur in a very large ‘background’ ngram mo del for E nglish which we have cons tructed from approximately 500 billion words of E nglish sampled from the world wide web. We do this efficiently using a Blo om Filter (Blo om, 1970). We have also e xp erimenente d with us ing frequency counts for smaller mo dels and measures such as mutual information (e.g. Turne y and Pantel, 2010). How ever, the most eff ective metho d we have found is to use simple prese nce/absenc e over a very large dataset of ngrams which unlike, say, the Go ogle ngram corpus (Franz and Brants , 2006) retains low frequency ngrams . Although we have only de scrib ed the fe ature typ es that we used in the exp eriments rep orted b elow, b ecause the y proved usef ul with res p ect to the comp etenc e level and text typ es inves tigate d, it is likely that others made available by the RASP s ys te m, such as the c onnected, directed graph of grammatical relations over s ente nces, the degre e of ambiguity within a se ntence , the lemmas and/or morphological complexity of words, and so forth (see Brisco e 2006 for a fuller desc ription of the range of feature typ es, in principle, made available by RASP), will b e discriminative in other AAET sc enarios. The system we have de velop ed inc ludes automated feature e xtrac tors for mos t typ es of f eature made available through the various representations provided by RASP. T his allows the rapid and largely automated discovery of an appropriate feature set for any given ass essment tas k, using the exp erimental metho dology e xe mplifie d in the next section. 4 The FCE Exp erime nts 4. 1 D ata For our exp e riments we made use of a se t of trans crib ed handwritten sc ripts pro duc ed by candidates taking the First Certificate in English (FCE ) examination written comp onent. Thes e were extracted from the Cambridge Learner Corpus (CLC) de velop ed by Cambridge University Pres s. T hese sc ripts are linked to metadata giving details of the candidate, date of the e xam, and so forth, as well as the final scores given for the two written questions attempted by candidates (se e Hawkey, 2009 for details of the FCE ). The marks assigned by the examiners are p ostpro c essed to identify outliers, sometimes second marked, and the final score s are adjusted us ing RASCH analysis to improve consistency. In addition, the scripts in the CLC have b e en manually e rror- co ded using a taxonomy of around 80 error typ es providing corrections for each error. The errors in the e xample from the previous sec tion are co ded in the f ollow ing way: <RD>some|a</RD> <SX>though|thought</SX> <IV>occured|occurred</IV> where RD denote s a determiner replacement e rror, SX a s p elling error, and I V a verb inflection error (see Nicholls 2003 for full de tails of the s cheme). In our exp eriments, we 20 used around three thousand scripts from examinations s et b etween 1997 and 2004, each ab out 500 words in length. A sample sc ript is provided in the app endix. In order to obtain an upp er b ound on examiner agree ment and also to provide a b etter b enchmark to as sess the p erformance of our AAET s ys te m compared to that of human examine rs (as recomme nded by, for example , Attali and Bernstein, 2006), Cambridge ESOL arranged for four senior e xaminers to remark 100 FCE scripts drawn from the 2001 examinations in the CLC using the marking rubric from that year. We know, for example, from analysis of these marks and comparison to those in the CLC that the correlation b etween the human marke rs and the CLC sc ores is ab out .8 (Pearson) or .78 (Sp earman’s Rank), thus establishing an upp er b ound for p erformance of any classifie r trained on this data (see sec tion 4.3 b elow). 4. 2 Bi nary Cl assi ficati on In our first exp eriment we traine d five c lass ifier mo dels on 2973 FCE scripts drawn f rom the years 1999–2003. T he aim was to apply well- know n classification and evaluation techniques to explore the AAET task from a disc riminative machine learning p ersp ective and also to inve stigate the efficacy of individual feature typ e s. We use d the feature typ es desc rib e d in se ction 3.4 with all the mo dels and divided the training data into pass (mark ab ove 23) and fail classes . B ecause there was a large s kew in the training classes , with ab out 80% of the scripts falling into the pass clas s, we use d the Break Even Precision (BEP ) meas ure , de fine d as the p oint at which ave rage precision=rec all, (e.g. Manning et al , 2008) to evaluate the p erformance of the mo dels on this binary clas sification task. This measure favours a clas sifer which lo cates the decision b oundary b etween the two classes in s uch a way that false p os itives / negative s are evenly distributed b e twee n the two class es. The mo dels trained were naive B ayes, Bays ian logistic regres sion, maximum entropy, SVM, and TAP. Cons istent with much pre vious work on te xt clas sification tasks, we found that the TAP and SVM mo de ls p erformed b es t and did not yield significantly different results. For brevity, and b ecause TAP is f aster to train, we rep ort results only for this mo del in what follows. Figure 3 shows the contribution of fe ature typ es to the overall accuracy of the classifier. With unigram terms alone it is p ossible to achieve a BE P of 66.4%. The addition of bigrams of te rms improves p e rf ormanc e by 2.6% (repre senting ab out 19% relative error reduction (RER) on the upp er b ound of 80%). The addition of an error es timate fe ature based on the Go ogle ngram corpus furthe r improves p erformance by 2.9% (further RER ab out 21%). Addition of pars e rule name features further improves p e rf ormanc e by 1.5% (furthe r RE R ab out 11%). The remaining fe ature typ es in Table 1 contribute another 0.4% improvement (further RER ab out 3%). Thes e res ults provide some supp ort for the choic e of feature typ es desc rib ed in se ction 3.4. Howe ver, the final datap oint in the graph in Figure 3 s hows that if we substitute the error rate predicted f rom the CLC manual error co ding for our corpus de rived es timate, then p erformance improves a further 2.9%, only 3.3.% b elow the upp e r b ound defined by the de gree of agreement b e tween human marke rs . This strongly sugges ts that the error 21 CLC Ra ter 1 R ater 2 Ra ter 3 Ra ter 4 Aut o-mark CLC 0.80 0.79 0.75 0.76 0.80 Ra ter 1 0.80 0.81 0.81 0.85 0.74 Ra ter 2 0.79 0.81 0.75 0.79 0.75 Ra ter 3 0.75 0.81 0.75 0.79 0.75 Ra ter 4 0.76 0.85 0.79 0.79 0.73 Aut o-mark 0.80 0.74 0.75 0.75 0.73 Average : 0.78 0.80 0.78 0.77 0.78 0.75 Table 3: C orrelation (Sp earman’s Rank) Thes e results suggest that the AAET system we have de velop ed is able to achieve levels of correlation similar to thos e achieved by the human markers b oth with e ach other and with the RASCH-adjusted marks in the CLC. To give a more concrete idea of the ac tual marks assigned and their variation, we give marks assigned to a random sample of 10 scripts from the test data in Table 4 (fitted to the appropriate score range by simple linear regres sion). Aut o-mark Rat er 1 Rater 2 R ater 3 R ater 4 26 26 23 25 23 33 36 31 38 36 29 25 22 25 27 24 23 20 24 24 25 25 22 24 22 27 26 23 30 24 5 12 5 12 17 29 30 25 27 27 21 24 21 25 19 23 25 22 25 25 Table 4: Sample predic tions (random ten) 4. 4 Tem p o ral Sensit ivity The training data we have us ed so far in our exp eriments is draw n from examinations b oth b ef ore and after the test data. In order to investigate b oth the e ffect of different amounts of training data and also the e ffect of training on scripts drawn f rom e xaminations at increasing temp oral distance from the test data, we divided the data by ye ar and trained and tested the c orrelation (Pears on) with the C LC marks. Figure 4 shows the results – clearly there is an effect of training data size , as no re sult is as go o d as those rep orted using the full datase t for training. Howeve r, there is also a s trong effect for temp oral distance b e tween training and te st data, re flecting the fact that b oth the typ e of prompts used to e licit text and the marking rubrics e volve over time (e.g. Hawkey, 2009; Cec il and We ir, 2007). 23 0 . 0.59 6 0 1000 0.69 0.72 0.69 0.60 0.60 = correlation 0.55 800 600 400 2 0 0 1998 1999 2000 2001 2002 2003 2004 Year 1 9 9 7 Figure 4: Training Data Effects 4. 5 E rr or Est im at io n In order to explore the effect of different datasets on the error prediction e stimate, we have gathered a large corpus of Englis h te xt f rom the web. Estimating e rror rate using a 2 billion word sample of text sampled f rom the UK domain re taining low frequency unigrams, bigrams, and trigrams we were able to improve p e rformanc e over estimation using the Go ogle ngram corpus by 0.09% (Pearson) in exp eriments which were othe rw ise identical to those re p orted in section 4.3 To date we have gathered ab out a trillion words of sequence d text f rom the web. We exp ect future exp eriments with error estimates based on larger sample s of this corpus to improve on these results f urther. Howeve r the results rep orted here demonstrate the viability of this approach, in combination with pars er-based feature s which implicitly c apture many typ es of longer distanc e gramatical error, compared to the more lab our intensive one of manually co ding feature extractors for known typ es of stereotypical learner error. 4. 6 Incr em ental Sem ant ic Analysis Although, the fo cus of our exp eriments has not b een on content analysis (see section 2.3.3), we have undertaken some limited exp e riments to compare the p erformance of an AAET system based primarily on such technique s (such as PearsonKT’s , IE A, see se ction 2) to that of the system pres ente d here. We used ISA (see section 2.3.3) to c onstruct a system w hich, like IEA, uses similarity to an average vector cons tructed us ing I SA from high scoring FCE training scripts as the bas is for assigning a mark. The cosine similarity scores were the n fitted to the FCE scoring scheme. We trained on ab out a thousand scripts drawn f rom 1999 to 2004 and tested on the s tandard test se t from 2001. U sing this approach we were only able to obtain a correlation of 0.45 (Pe arson) with the CLC scores and and average of 0.43 (Pearson) with the human e xaminers. This contras ts with score s of Number of samples 0.47 (Pe arso n) and 0.45 (Pe ars on) 24 training the TAP ranked pre ference clas sifier on a similar numb e r of scripts and using only unigram term feature s. Thes e res ults , taken with those rep orted ab ove, s uggest that there isn’t a clear advantage to us ing techniques that cluster terms according to the ir c ontext of o c currenc e, and compute te xt similarity on the basis of thes e c lusters, over the text clas sification approach deployed here. Of course, this exp e rime nt do es not demonstrate that clustering te chniques c annot play a us eful role in AAET, howeve r, it do e s suggest that a straightforward applic ation of latent or distributional s emantic metho ds to AAE T is not guarantee d to yield optimal res ults . 4. 7 Off -Pr om pt E ssay D et ecti on As disc ussed in se ction 2.4, one is sue with with the deployment of AAET for high s takes examinations or other ‘adve rs arial’ contexts is that a non-prompt sp ecific approach to AAET is vulne rable to ‘gaming’ via submiss ion of linguistically e xc ellent rote-learned text regardless of the prompt. To detect such off-prompt te xt automatically do e s require content analys is of the typ e discusse d in s ection 2.3.3 and explored in the previous sec tion as an approach to grading. Given that our approach to AAET is not prompt- sp ecific in terms of training data, ide ally we would like to b e able to de tec t off-prompt scripts with a s ys tem that do esn’t require retraining for different prompts. We would like to train a system w hich is able to compare the que stion and answer s cript within a ge neric dis tributional semantic space. B ecause the prompts are typically quite s hort we c annot exp ect that in gene ral there will b e much direct ove rlap b etween contentful terms or lemmas in the prompt and those in the answer text. We trained an I SA mo del using 10M words of diverse E nglish te xt using a 250- word s top list and ISA parame te rs of 2000 dimens ions, impac t factor 0.0003, and dec ay constant 50 with a context window of 3 words . Each question and answer is represented by the s um of the his tory vectors corres p onding to the terms they contain. We als o included additional dimensions representing actual terms in the overall mo del of dis tributional semantic space to capture cas es of literal overlap b etween terms in questions and in answe rs . The res ulting vectors are then compared by calculating their cosine similarity. For comparison, we built a standard vector s pace mo de l that meas ures semantic s imilarity using cosine distance b etween vec tors of terms for que stion and answer via literal term overlap. To test the p erformance of these two approache s to off-prompt ess ay detection, we extracted 109 pas sing FC E sc ripts from the CLC answering four different prompts : 1. During your holiday you made some new f riends . Write a letter to the m saying how you enjoyed the time sp e nt w ith them and inviting them to visit you. 2. You have b een asked to make a sp eech welcoming a well-know n w riter who has come to talk to your class ab out his /her work. Write what you say. 3. “Put that light out!” I shouted. Write a s tory which b egins or e nds w ith these words . 25 4. Many p eople think that the car is the greatest danger to human life to day. What do you think? Each system was use d to assign each answer text to the most similar prompt. The acc uracy (ratio of correct to all assignme nts) of of the standard ve ctor space mo de l was 85%, whilst the augmented ISA mo de l achieved 93%. T his pre liminary exp eriment suggests that a generic mo del for flagging putative off-prompt ess ays for manual checking could b e construc te d by manual selec tion of a set of prompts from past pap ers and the c urrent pap er and then flagging any ans wers that matched a past prompt b etter than the c urrent prompt. There will b e some false p os itives, but these initial results s uggest that an augmented I SA mo del could p erform we ll enough to b e use ful. Further exp e rimentation on larger se ts of generic training text and on optimal tuning of ISA parameters may also improve accurac y. 5 Conclusions In this re p ort, we have intro duc ed the discriminative TAP prefe re nce ranking mo del for AAET. We have demons trated that this mo del can b e coupled with the RASP text pro cessing to olkit allowing fully automated extraction of a wide range of feature typ es many of which we have shown exp erimentally are disc riminative for AAET. We have also intro duc ed a generic and fully automated approach to error e stimation based on efficient matching of text s eque nces with a ve ry large background ngram corpus derived from the web using a B lo om filter, and have shown exp erimentally that this is the single most discriminative fe ature in our AAET mo del. We have also show n exp erimentally that this mo del p e rf orms s ignificantly b etter than an otherwise equivalent one based on classification as opp os ed to prefe re nc e ranking. We have also show n exp erimentally that text classification is at le ast as effe ctive for AAET as a mo del base d on ISA, a recent and improved latent or dis tributional semantic content-based text similarity me tho d akin to that used in IEA. However, ISA is use ful for de tec ting offprompt es says using a generic mo del of dis tributional s emantic space that do e s not require retraining for new prompts. Much further work remains to b e done. We b elie ve that the feature s as sess ed by our AAET mo del make subversion by students difficult as they more dire ctly asse ss linguistic comp etence than pre vious approaches. However, it remains to tes t this e xp erime ntally. We have shown that e rror estimation against a background ngram c orpus is highly informative, but our fully automated technique still lags error e stimates bas ed on the manual error co ding of the CLC. Further exp e rimentation with larger background corp ora and weighting of ngrams on the basis of their frequency, p ointwise mutual inf ormation, or similar meas ure s may he lp clos e this gap. Our AAET mo del is not traine d on promptsp e cific data, w hich is op erationally advantageous, but it do e s not inc lude any mechanism for detecting text lacking overall inter- sentential c oherence . We b elieve that ISA or other recent dis tributional s emantic te chniques provide a go o d basis for adding such fe atures to the mo del and plan to test this exp e rime ntally. Finally our current AAET system simply returns a s core, though implicit in its computation is the identific ation of b oth negative and p ositive feature s that contribute to its c alculation. We plan to explore metho ds f or automatically providing feedback to students based on these features in order to fac ilitate 26 deployment of the system f or se lf-asses sment and self- tutoring. In the near f uture, we inte nd to re leas e a public- domain training se t of anonymis ed FCE l ig h tscripts from the CLC together with an anonymis ed version of the te st data des crib ed in sec tion 4. We also intend to rep ort the p erformance of preference ranking with the SVMpackage (Joachims, 1999) based on RASP-derived features, and error estimation using a public domain corpus trained and tested on this data and compared to the p erformance of our b est TAP-based mo de l. This w ill allow b etter re plication of our results and facilitate further work on AAET. Acknowle dgements The research and exp eriments rep orted he re were partly funded through a contract to iLexIR Ltd from Cambridge ESOL, a divis ion of Cambridge Asse ssment, w hich in turn is a subsidiary of the University of Cambridge. We are grateful to Cambridge University P res s for p ermiss ion to us e the subset of the C ambridge Learners’ Corpus for these exp eriments. We are also grateful to Cambridge Asses sment f or arranging for the test sc ripts to b e remarked by f our of their s enior examiners to f acilitate their evaluation. Refer ence s Ande rs en, O. (2010) Grammatical error prediction, Cambridge University, C omputer Lab oratory, PhD Dis sertation. Attali, Y. and B urste in, J. (2006) ‘Automated e ssay sc oring with e -rater v2’, Journ al of Technology, Learning and Assessmen t, vol.4(3), Baroni, M., and Lenci, I. (2009) ‘One dis tributional me mory, many se mantic spaces ’, Proceedin gs of t he Wkshp on Geometrical Models of Natural Language Semantics, Eur. Ass o c . for Comp. Linguistics , pp. 1–8. Bos , S. and Opp er, M. (1998) ‘Dynam ic s of batch training in a p erce ptron’, J. P hysics A: Math . & Gen., vol.31(21 ), 4835–4850. Burstein, J. (2003) ‘T he e- rate r s coring e ngine: automated es say s coring w ith natural language pro ce ss ing’ in (e ds ) Shermis, M.D. and J. B urste in (eds.), Aut omated Essay Scoring: A cross-Discip linary Perspective, Lawre nc e Erlbaum As so ciates Inc., pp. 113–122. Burstein, J., Brade n- Harder, L., Cho dorow, M.S., Kaplan, B.A., Kukich, K., Lu, C., Ro ck, D.A., and Wolff, S. (2002) System and method fo r computer-ba sed aut oma tic essay scorin g, US Patent 6,366,759, April 2. Burstein, J., Higgins , D., Gentile, C., and Marc u, D. (2005) Method a nd syst em fo r determining text coherence, US Patent 2005/0143971 A1, June 30. Collins , M. (2002) ‘Disc riminative training m etho ds for hidde n Markov mo de ls: the ory and exp erim ents with Pe rc eptron algorithm s’, Proceedin gs of the E mpirical Methods in Nat. Lg. Processing (EMNL P), Ass o c. for Comp. Linguistics , pp. 1–8. Coniam, D. (2009) ‘Exp e rime nting with a c omputer es say-s coring program bas ed on ESL s tudent writing s cripts’, ReCALL , vol.21(2), 259–279. 27 Dikli, S. (2006) ‘An ove rview of automate d sc oring of es says’, Journ al of Techn ology, Learning and A ssessment, vol.5(1), Elliot, S. (2003) ‘IntellimetricTM: From He re to Validity’ in (e ds ) She rm is , M.D. and J. Burs tein (eds.), Aut omated E ssay S co ring: A cross-Disciplin ary P erspective, Lawre nc e Erlbaum Asso c iate s Inc ., pp. 71–86. Foltz , P.W., Landauer, T.K., Laham, R.D., Kintsch, W., and Rehde r, R.E. (2002) Methods for analysis an d evalua tio n of t he semantic con tent of a w riting based on vector length, US Patent 6,356,864 B1, March 12. Franz, A. and Brants , T. (2006) Al l our N-gram are Belo ng to Yo u, http://go ogle rese arch.blogsp ot.c om/2006/08/all-our- n- gram -are-b elong- to-you.html. Fre und, Y. and Schapire, R. (1998) ‘Large margin clas sific ation using the p erce ptron algorithm’, Comp uta tio nal Learning Theory, vol.209–2 17, Gorm an, J. and Curran, J.R. (2006) ‘Random indexing us ing s tatis tic al we ight func tions’, Proceedin gs of the Conf. on Empirical Methods in Na t. Lg. Proc., Ass o c. for C omp. Linguistics , pp. 457–464. Hawke y, R. (2009) Examining FCE and CAE : Studies in Language Test ing, 28, Cambridge Unive rs ity Pre ss. Joachims , T. (1998) ‘Te xt categorization w ith supp ort vector machines : le arning w ith many relevant fe atures ’, Proceedin gs of t he P roc. of Eur. Conf. on Ma ch. Learnin g, Springe rVerlag, pp. 137–142. Joachims , T . (1999) ‘Making large -sc ale s upp ort vec tor machine le arning practical’ in (e ds ) Scholkopf, S.B. and C. B urges (eds.), Advan ces in kernel methods, MIT Press . Joachims , T. (2002) ‘Optimiz ing search e ngine s using c lickthrough data’, Proceedin gs of t he SIGKDD, Ass o c. C omputing Machinery. Kakkone n, T., Mylle r, N., Sutine n, E. (2006) ‘Applying Late nt Dirichlet Allo c ation to autom atic es say grading’, Proceedin gs of the FinTA L, Springe r- Ve rlag, pp. 110–120. Kane jiya, D., Kam ar, A. and Pras ad, S. (2003) ‘Autom atic Evaluation of Stude nts’ Answers using Syntac tic ally Enhanc ed LSA’, Proceedin gs of t he H LT-NAACL 0 3 Workshop on Buildin g Educational Ap plications U sing N atural Lan guage Processing, Ass o c. for C omp. Linguistics . Kane rva, P., Kris tofe rs son, J., and Holst, A. (2000) ‘Random inde xing of text s ample s for latent se mantic analysis’, Proceedin gs o f th e 22nd Annual Con f. of the Cognit ive S cience S ociety, Cognitive Science So c .. Krauth, W. and Mez ard, M. (1987) ‘Le arning algorithms w ith optimal s tability in ne ural ne tworks’, J. o f Physics A ; Math. Gen ., vol.20, Kukich, K. (2000) ‘Be yond automate d e ssay s coring’ in (ed.) Hearst, M. (e ds .), The debat e on automated essay grading, IEEE Intelligent Sys tem s, pp. 27–31. Landauer, T.K., Laham, D., and Foltz, P.W. (2000) ‘The I nte lligent Ess ay Ass es sor’, IEEE Intel ligent Systems, vol.15(5), Landauer, T.K., Laham, D. and Foltz, P.W. (2003) ‘Autom ate d scoring and annotation of es says with the Intelligent Essay Ass es sor’ in (e ds ) Shermis , M.D. and J. Burstein (eds.), Aut omated Essay Scorin g: A cross-Discip linary Perspective, Lawre nc e Erlbaum As so ciate s Inc., pp. 87–112. Leake y, L.S. (1998) ‘Automatic es say grading using te xt c ategorization technique s’, Proceedin gs of the 21st ACM-SIGIR , Ass o c. for Computing Machine ry. Lew is , D.D., Yang, Y., Rose , T. and Li, T. (2004) ‘RC v1: A new b enchmark c olle ction for text cate goriz ation res earch’, J. Mach. Learning res., vol.5, 361–397. 28 Li, Y., B ontcheva, K. and Cunningham, H. (2005) ‘Us ing uneve n margins svm and p e rc eptron for inform ation extraction’, Proceedin gs of the 9th Con f. on Nat. Lg. Learning, Ass o c. for Com p. Ling.. Lonsdale , D. and Strong-Kraus e, D. (2003) ‘Autom ate d Rating of ESL Es says ’, Proceedin gs of the HLT-N AACL 03 Workshop on Building Educational Applications Using Natural Language Processing, Ass o c. for Comp. Linguistics . Manning, C ., Raghavan, P.,and Schutze , H. (2008) Introduct ion to Info rmation Retrieval, Cam bridge University Pre ss . Nicholls, D. (2003) ‘The Cambridge Le arner Corpus: Error c o ding and analys is for lexicography and ELT ’ in Corpus Linguistics I I (eds.), Archer, D, Rayson , P., Wilson, A. and McCenery T. (eds.), UCREL Te chnic al Rep ort 16, Lanc aster University. Page, E.B. (1966) ‘The imm inence of grading ess ays by compute r’, Phi Delta Kap pan, vol.48 , 238–243. Page, E.B. (1994) ‘C omputer grading of s tudent pros e, us ing m o de rn c onc epts and software’, Journ al of Experimental Education, vol.6 2(2), 127–142. Powe rs, D.E., Burs tein, J., Cho dorow, M., Fowles , M.E., Kukich, K. (2002) ‘Stum ping e-rater: challenging the valdity of automated e ss ay s coring’, Comp uters in Human Behavior, vol.18, 103–134. Ros enblatt, F. (1958) ‘The p e rc eptron: A probabilis tic mo del for information storage and organiz ation in the brain’, Psychological Review , vol.65, Ros´e, C.P., Ro que , A., Bhe mb e, D. and VanLe hn, K. (2003) ‘A Hybrid Text Class ific ation Approach for Analysis of Stude nt Es says ’, Proceedin gs of the H LT-NAACL 03 Wo rkshop on Building Educational Applicat ion s U sing N atural Language Processing, Ass o c. for Comp. Linguistics . Rudner, L.M. and Lang, T. (2002) ‘Automate d es say scoring using Baye s’ the ore m’, Journ al of Technology, Learnin g an d Assessment , vol.1(2), Shaw , S and Weir, C. (2007) Examining Writing in a S econ d Language, Studies in Language Testing 26, Cambridge Unive rsity Pre ss . Sparck Jones , K. (1972) ‘A statistic al inte rpre tation of term sp e cificity and its application in retrie val’, Journ al of Documenta tio n, vol.28(1), 11–21. Sle ator, D. and Te mp e rle y, D. (1993) ‘Parsing Englis h with a Link Grammar’, Proceedin gs of the 3rd Int . Wkshp on Pa rsing Technologies, Ass o c. for Comp. Ling.. Turney, P. and Pante l, P. (2010) ‘From freque nc y to m eaning’, Jnl. of Art ificial Intel ligen ce Research, vol.37, 141–188. Vapnik, V.N. (1995) The n ature of st atist ical learning theory, Springe r- Ve rlag. Williams on, D.M. (2009) A framew ork for implementing a utomat ed sco ring, Educational Te sting Servic e, Te chnic al Rep ort. Yang, Y., Zhang, J. and Kisiel, B. (2003) ‘A sc alability analys is of c las sifiers in text cate gorization’, Proceedin gs of the 26th ACM-SIGIR, Ass o c. for Computing Machine ry, pp. 96–103. App endix: Sample Scr ipt The following is a sample of a FCE s cript with error annotation drawn from the CLC and conve rte d to XML. The full e rror annotation s che me is des crib ed in Nicholls (2003). 29 <head title="lnr:1.01" entry="0" status="Active" url="571574" sortkey="AT*040*0157*0100*2000*01"> <candidate> <exam> <exam_code>0100</exam_code> <exam_desc>First Certificate in English</exam_desc> <exam_level>FCE</exam_level></exam> <personnel> <ncode>011</ncode> <language>German</language> <age>18</age> <sex>M</sex></personnel> <text> <answer1> <question_number>1</question_number> <exam_score>34.2</exam_score> <coded_answer> <p idx="15576">Dear Mrs Ryan<NS type="MP">|,</NS></p><p idx="15577">Many thanks for your letter.</p><p idx="15578">I would like to travel in July because I have got <NS type="MD">|my</NS> summer holidays from July to August and I work as a bank clerk in August. I think a tent would suit my personal <NS type="RP">life-style|lifestyle</NS> better than a log cabin because I love <NS type="UD">the</NS> nature.</p><p idx="15579">I would like to play basketball during my holidays at Camp California because I love this game. I have been playing basketball for 8 years and today I am a member of an Austrian <NS type="RP">basketball-team| basketball team</NS>. But I have never played golf in my life <NS type="RC">but|though</NS> with your help I would be able to learn how to play golf and I think this could be very interesting.</p><p idx="15580">I <NS type="W">also would|would also</NS> like to know how much money I will get from you for <NS type="RA">those|these</NS> two weeks because I would like to spend some money <NS type="RT">for|on</NS> clothes.</p><p idx="15581">I am looking forward to hearing from you soon.</p><p idx="15582">Yours sincerely</p> </coded_answer></answer1> <answer2> <question_number>4</question_number> <exam_score>30.0</exam_score> <coded_answer> <p idx="15583">Dear Kim</p><p idx="15584">Last month I enjoyed helping at a pop concert and I think you want to hear some funny stories about the <NS type="FN">experience|experiences</NS> I <NS type="RV">made| had</NS>.</p><p idx="15585">At first I had to clean the three private rooms of the stars. This was very boring but after I left the third room I met Brunner and Brunner. These two people are stars in our country... O.K. I am just <NS type="IV">kiding|kidding</NS>. I don’t like <NS type="W">the songs of Brunner and Brunner|Brunner and Brunner’s songs</NS> because this kind of music is very boring.</p><p idx="15586">I also had to clean the <NS type="RN">washing rooms| 30 washrooms</NS>. I will never ever help anybody to <NS type="S">organice| organise</NS> a pop concert <NS type="MY">|again</NS>.</p><p idx="15587">But after this <NS type="S">serville|servile</NS> work I met Eminem. I think you know his popular songs like "My Name Is". It was one of the greatest moments in my life. I had to <NS type="RV">bring| take</NS> him something to eat.</p><p idx="15588">It was <NS type="UD">a</NS> hard but also <NS type="UD">a</NS> <NS type="RJ">funny| fun</NS> work. You should try to <NS type="RV"><NS type="FV">called| call</NS>|get</NS> some experience <NS type="RT">during|at</NS> such a concert<NS type="RP"> you|. You</NS> would not regret it.</p><p idx="15589">I am looking forward to hearing from you soon.</p> </coded_answer></answer2></text></head> 31