Automatic Scoring of Vocabulary Tests: Does it work? Ana Pellicer-Sánchez University of Nottingham aexamp1@nottingham.ac.uk INTRODUCTION RESULTS AND DISCUSSION Vocabulary tests are a useful component of language proficiency batteries. Previous research has shown that measures of vocabulary size predict general language proficiency. One of the types of tests used to measure vocabulary size is the checklist test format. It consists of a list of words and testees simply have to respond Yes or No if they know the meaning of the word or not. Example of paper and pencil version of the checklist test: Research Question 1: - Checklist test responses (yes/no) were compared with interview scores. - Results showed that the checklist test format is a good indication of participants’ true knowledge (at least for our group of participants). - Not much overestimation. - Possible underestimation. Checklist tests format is a conservative measure. NO Responses Look through the English words listed below. Cross out words that you do not know well enough to say what they mean. ADVISER WEEKEND MORLORN DISDAIN GHASTLY MOISTER DISCARD CONTORD YES Responses 9% IMPLORE STOURGE REFUSAL SARSAGE 10% 16% No knowledge 24% Meaning Recognition Meaning Recall (items taken from Meara, 1990) 67% 74% Advantages - Easy to construct. - Simple and fast to complete. - Large number of items assessed. - Possible computerization. - Straightforward and automatic scoring. Disadvantages - Limited indication of depth of knowledge. - Only measures passive recognition. - Difficulty in assessing words with several meanings. - Subjects’ OVERESTIMATION. Research Question 2: - RT data was compared with interviews scores: first by level of knowledge shown in interviews (Figure 1) and then collapsing all the levels into correct/matching or incorrect/non-matching responses (Figure 2). Figure 1. Mean RT Yes and No responses per level of knowledge shown in interviews. 1600 1400 1200 ** 1404 * 1329 1176 * Figure 2. Mean RT incorrect and correct responses. 1600 1400 1240 1200 Correct responses 1176 Incorrect responses 1384 Correct responses 872 1000 ms 1000 No Knowledge 796 Meaning Recognition Incorrect responses 1128 **Sig. .000 1128 800 *Sig. .008 800 Meaning Recall 600 600 The main solution used for solving the problem of overestimation has been the use of NONWORDS (e.g. Meara, 1990; Meara & Buxton, 1987). However, despite the benefits of this alternative, one main disadvantage remains: How do we adjust scores? Although different scoring methods have been proposed (e.g. Meara’s hits and false alarms), there is currently no consensus on the best adjustment formula. In this study we examine an alternative solution which does not involve the use of nonwords but the use of reaction times (RTs) and in-depth interviews to validate checklist tests. We explore whether the time it takes learners to make their judgements can be used in determining the accuracy of those judgements. The approach assumes that quick judgements should be relatively accurate, while hesitant judgements may be less certain. RESEARCH QUESTIONS -No Responses: * .005 -Yes Responses: * .000 ** .000 400 200 200 0 NO Responses Example 1 Example 2 Example 3 (RT of YES responses for one participant) (RT of YES responses for one participant) (RT of NO responses for one participant) Msec. 3) If so, is there a general threshold in RTs to establish the accuracy of responses? 482 Accuracy 494 503 METHODOLOGY 546 564 Participants 573 - 35 native English speakers (UG students, University of Nottingham). 592 Materials 606 741 Interviews: - Meaning recall part. - Meaning recognition part: Multiple Choice item per word. 763 4. imperil: Those actions imperil our families. imperil * NO Responses ** YES Responses Research Question 3: - Data for each participant was individually analyzed. - Results showed a considerable variation, ranging from very clear cases to very unclear ones. No clear threshold. - More data is currently being collected in search for a clearer threshold. 2) Are RTs a good indication of true vocabulary knowledge? - E-Prime software. - 40 items. Random order presentation. - Items presented one by one. - Items taken from a range of frequencies. 0 YES Responses - RT data shows a significant difference between correct and incorrect responses. This may make it possible to discriminate between accurate (faster) and inaccurate (slower) responses. 1) Are results of the checklist test an accurate indication of true vocabulary knowledge? Computerized Checklist Test: 400 781 831 a. to put someone or something in a dangerous situation. b. to make it easier for someone to do something. c. to make someone feel sad and worried. d. I don’t know. 850 1078 1185 1240 2067 √√ √√ √√ √√ √√ √√ √√ √√ √√ √√ √√ √√ √ X √ √√ X Msec. 909 910 √√ 870 918 965 1245 √√ √√ X √√ √√ √√ √ 1329 √√ 1641 √ X √ 782 827 837 855 873 927 979 979 1071 1101 2578 - M= 817 msec. - SD= 400 -Threshold= M + 0.4 SD 1) Completion of computerized checklist test: Msec. 605 721 1839 Procedure Accuracy X √√ √√ √√ √√ √√ √√ √ 677 - M= 1262 msec. - SD= 532 - Threshold= ? 619 634 685 751 752 792 821 1069 1132 1212 1232 1385 1673 Accuracy X √ √ √ X √ √ √ X X X √ √ √ X X YES RESPONSES √ √ = Accurate Response (Meaning Recall) √ = Accurate Response (Meaning Recognition) X = Inaccurate Response (Data of test and interview not matching) NO RESPONSES √ = Accurate Response (No knowledge) X = Inaccurate Response (Meaning recognition or meaning recall) - M= 949 - SD= 309 - Threshold= ???? - Participants saw the target words one by one. - They had to pres YES or NO whether they knew the meaning of the word or not. - Their RTs were recorded. CONCLUSIONS 2) Interviews: - Meaning recall part: Participants were first shown the words in isolation and were asked to define each word. - If participant’s response had not been informative enough in the meaning recall part, they had to complete the multiple choice question of that item. 3) Interviews Scoring: Meaning recall = 2 Meaning recognition = 1 No knowledge = 0 - Research Question 1= For the group of participants tested, the checklist format seems to be a good measure of true vocabulary knowledge. - Research Question 2= Data shows a significant difference between RTs for accurate and inaccurate responses, suggesting that RTs might be an effective way of discriminating between incorrect and correct responses in checklist tests and potentially, in other types of test formats. - Research Question 3= Data collected so far has not shown a clear threshold where to set the limit between accurate and inaccurate responses. - Further research is being carried out in an attempt to find out this threshold. REFERENCES - Meara, P. (1990). Some notes on the Eurocentres vocabulary tests. In J. Tommola (Ed.) Foreign language comprehension and production (pp. 103-113). Turku: AFinLa Yearbook. Meara, P. & Buxton (1987). An alternative to multiple choice vocabulary tests. Language Testing 4,142-154. Meara, P. & Jones, G. (1988). Vocabulary size as a placement indicator. In P. Grunwell (Ed.) Applied Linguistics in Society (pp. 80-87). London: CILT.