Computer-aided generation of multiple-choice tests Ruslan Mitkov School of Humanities, Languages and Social Sciences University of Wolverhampton, WV1 1SB Email R.Mitkov@wlv.ac.uk Structure of the presentation Introduction Premises NLP-based methodology for construction of multiple-choice test items term extraction, distractor selection, question generation In-class experiments Evaluation efficiency, item analysis Discussion Forthcoming work Introduction Multiple-choice test: an effective way to measure student achievements. Computer-aided multiple-choice test generation: an alternative for the labourintensive task. Novel NLP methodology employing shallow parser, automatic term extraction, word sense disambiguation, corpora, and WordNet. Premises Questions should focus on key concepts Distractors should be as semantically close to the correct answer as possible Example Syntax is the branch of linguistics which studies the way words are put together into sentences Which branch of linguistics studies the way words are put together into sentences? Pragmatics Syntax Morphology Semantics NLP-based methodology (1) term extraction transformational rules narrative texts question generation distractors wordnet distractor selection terms (key concepts) test items NLP-based methodology (2): term extraction Nouns and noun phrases are first identified using FDG shallow parser Nouns with frequency over threshold defined as ‘key terms’ NPs featuring key terms as heads and satisfying specific regular expressions considered as terms (phrase adjectival phrase/verb phrase…) Terms serve as ‘anchors’ for generating test questions NLP-based methodology (3): selection of distractors Semantic closeness: WordNet consulted for close terms If too many returned, those appearing in the corpus given preference Example: the electronic textbook contains the following noun phrases with modifier as head: modifier that accompanies a noun, associated modifier, misplaced modifier. Alternative: corpus search (NPs with same head but different modifiers selected) NLP-based methodology (4): generation of test questions Eligible sentences: containing domain-specific terms having SVO or SV structure type. Examples of generation rules: S(term)VO =>“Which HVO” where H is a hypernym of the term SVO(term) => “What do/does/did S V” Agreement rules Genre-specific heuristics In-class experiments Controlled set of 36 test items introduced (24 generated with the help of the program, 12 manually produced) 45 undergraduate students took the test System operates via the Questionmark Perception web-based testing software Example of a test item generated 29 of 36 Which kind of pronoun will agree with the subject in number, person, and gender? o second person pronoun o indefinite pronoun o relative pronoun o reflexive pronoun Post-editing Automatic generation Test items classed as “worthy” (57%) or “unworthy” (43%) About 9% of the automatically generated items did not need any revision From the revisions needed: minor (17%), fair (36%), and major (47%) Evaluation Efficiency of the procedure Quality of the test items Evaluation (2): efficiency of the procedure Efficiency: items produced time average time per item computer-aided 300 540' 1' 48'' manual 65 450' 6' 55'' Evaluation (3): quality of the test items Item analysis Item Difficulty (= C/T) Discriminating Power (=(CU-CL):T/2) Usefulness of the distractors (comparing no. of students in upper and lower groups who selected each incorrect alternative) Evaluation (4) item difficulty item discriminating power avg item difficulty too easy computer aided 0.75 3 0 0.4 1 manual 0.59 1 0 0.25 2 usefulness of distractors average negative too discriminating discriminating poor difficult power power not useful total avg difference 6 3 65 1.92 10 2 33 1.18 Discussion Computer-aided construction of multiplechoice test items is much more effective than purely manual construction Quality of test items produced with the help program is not compromised in exchange for time and labour savings Forthcoming work: extensions to other genres Current project delivered a prototype in the area of Linguistics, but system will be tuned to cover Chemistry, Biology, Mathematics and Computer Science. Forthcoming work: other types of questions Questions about properties or information associated with the term (e.g. colour, location, time, discoverer/author) will also be generated. ‘Uranium was discovered in 1798 by Martin Klaproth.’ ‘When was uranium discovered?’ or ‘Who discovered uranium?’ ‘Carbon dioxide is a colourless gas.’ ‘What is the colour of the gas carbon dioxide?’ Forthcoming work: other suitable types of distractors • Distractors of the same semantic category would be features placed close on a specific property scale (e.g. time, colour) to the correct answer. • General (e.g. WordNet, Roget’s Thesaurus) and/or domain-specific resources can be used to provide such scales. • Additional heuristics: preference for selecting time expressions, colours, human proper names etc. that also appear in the same corpus/document. Forthcoming work: extraction of domain-specific feature patterns Extraction of domain-specific patterns of the form <term, feature1, … featureN> Example: <chemical element (proper name element), weight (number), colour (value from colours set), location/found in (proper name place), discoverer (proper name human), time of discovery (temporal expression)>. Represented as typed-feature structures; basis for restricted domain ontologies. Forthcoming work: more sophisticated term extraction Based on the statistical and linguistic properties of the terms Statistical scores: (relative) frequency, tf.idf, mutual information Different types of term variations (Jacquemin 1999, Ha 2003b) Part-of-speech patterns (Justeson and Katz 1996) “Knowledge patterns” (Meyer 2001; Ha 2003a) Machine learning methods will be employed Forthcoming work: wider coverage question generation grammar ML methods will be experimented with to improve the variety of the transformational rules End-of-chapter questions and sentences containing these answers will be automatically aligned (Semi-)automatic alignment at word and phrase level will also be performed Forthcoming work: experiments with other similarity measures Statistical, corpus-based methods to mine for close concepts/words (Pekar 2002, 2003) Recent thesaurus-based similarity approaches (Budanitsky and Hirst 2001, Jarmasz and Szpakowicz 2003). By-products Bank of test items Restricted domain-specific ontologies Other future plans Offer the option to generate a long list of distractors with the user choosing among them Impact of the program on professional test developers Agreement among post-editors Computer-Aided Generation of Multiple-Choice Tests Thank you