Automatic Scoring of Vocabulary Tests: Does it work?

advertisement
Automatic Scoring of Vocabulary Tests: Does it work?
Ana Pellicer-Sánchez
University of Nottingham
aexamp1@nottingham.ac.uk
INTRODUCTION
RESULTS AND DISCUSSION
Vocabulary tests are a useful component of language proficiency batteries. Previous research has
shown that measures of vocabulary size predict general language proficiency. One of the types of
tests used to measure vocabulary size is the checklist test format. It consists of a list of words and
testees simply have to respond Yes or No if they know the meaning of the word or not. Example of
paper and pencil version of the checklist test:
Research Question 1:
- Checklist test responses (yes/no) were compared with interview scores.
- Results showed that the checklist test format is a good indication of participants’ true knowledge (at
least for our group of participants).
- Not much overestimation.
- Possible underestimation. Checklist tests format is a conservative measure.
NO Responses
Look through the English words listed below. Cross out words that you do not know well
enough to say what they mean.
ADVISER
WEEKEND
MORLORN
DISDAIN
GHASTLY
MOISTER
DISCARD
CONTORD
YES Responses
9%
IMPLORE
STOURGE
REFUSAL
SARSAGE
10%
16%
No knowledge
24%
Meaning Recognition
Meaning Recall
(items taken from Meara, 1990)
67%
74%
Advantages
- Easy to construct.
- Simple and fast to complete.
- Large number of items assessed.
- Possible computerization.
- Straightforward and automatic
scoring.
Disadvantages
- Limited indication of depth of
knowledge.
- Only measures passive
recognition.
- Difficulty in assessing words with
several meanings.
- Subjects’ OVERESTIMATION.
Research Question 2:
- RT data was compared with interviews scores: first by level of knowledge shown in interviews (Figure
1) and then collapsing all the levels into correct/matching or incorrect/non-matching responses
(Figure 2).
Figure 1. Mean RT Yes and No responses per
level of knowledge shown in interviews.
1600
1400
1200
**
1404
*
1329
1176
*
Figure 2. Mean RT incorrect and correct responses.
1600
1400
1240
1200
Correct
responses
1176
Incorrect
responses
1384
Correct
responses
872
1000
ms
1000
No Knowledge
796
Meaning Recognition
Incorrect
responses
1128
**Sig. .000
1128
800
*Sig. .008
800
Meaning Recall
600
600
The main solution used for solving the problem of overestimation has been the use of NONWORDS
(e.g. Meara, 1990; Meara & Buxton, 1987). However, despite the benefits of this alternative, one
main disadvantage remains: How do we adjust scores? Although different scoring methods have
been proposed (e.g. Meara’s hits and false alarms), there is currently no consensus on the best
adjustment formula.
In this study we examine an alternative solution which does not involve the use of nonwords but the
use of reaction times (RTs) and in-depth interviews to validate checklist tests. We explore whether
the time it takes learners to make their judgements can be used in determining the accuracy of those
judgements. The approach assumes that quick judgements should be relatively accurate, while
hesitant judgements may be less certain.
RESEARCH QUESTIONS
-No Responses:
* .005
-Yes Responses:
* .000
** .000
400
200
200
0
NO Responses
Example 1
Example 2
Example 3
(RT of YES responses for one
participant)
(RT of YES responses for
one participant)
(RT of NO responses for
one participant)
Msec.
3) If so, is there a general threshold in RTs to establish the accuracy of responses?
482
Accuracy
494
503
METHODOLOGY
546
564
Participants
573
- 35 native English speakers (UG students, University of Nottingham).
592
Materials
606
741
Interviews:
- Meaning recall part.
- Meaning recognition part:
Multiple Choice item per word.
763
4. imperil: Those actions
imperil our families.
imperil
*
NO Responses
**
YES Responses
Research Question 3:
- Data for each participant was individually analyzed.
- Results showed a considerable variation, ranging from very clear cases to very unclear ones. No
clear threshold.
- More data is currently being collected in search for a clearer threshold.
2) Are RTs a good indication of true vocabulary knowledge?
- E-Prime software.
- 40 items. Random order presentation.
- Items presented one by one.
- Items taken from a range of frequencies.
0
YES Responses
- RT data shows a significant difference between correct and incorrect responses. This may make it
possible to discriminate between accurate (faster) and inaccurate (slower) responses.
1) Are results of the checklist test an accurate indication of true vocabulary knowledge?
Computerized Checklist Test:
400
781
831
a. to put someone or
something in a dangerous
situation.
b. to make it easier for
someone to do something.
c. to make someone feel sad
and worried.
d. I don’t know.
850
1078
1185
1240
2067
√√
√√
√√
√√
√√
√√
√√
√√
√√
√√
√√
√√
√
X
√
√√
X
Msec.
909
910
√√
870
918
965
1245
√√
√√
X
√√
√√
√√
√
1329
√√
1641
√
X
√
782
827
837
855
873
927
979
979
1071
1101
2578
- M= 817 msec.
- SD= 400
-Threshold= M + 0.4 SD
1) Completion of computerized checklist test:
Msec.
605
721
1839
Procedure
Accuracy
X
√√
√√
√√
√√
√√
√√
√
677
- M= 1262 msec.
- SD= 532
- Threshold= ?
619
634
685
751
752
792
821
1069
1132
1212
1232
1385
1673
Accuracy
X
√
√
√
X
√
√
√
X
X
X
√
√
√
X
X
YES RESPONSES
√ √ = Accurate Response
(Meaning Recall)
√ = Accurate Response
(Meaning Recognition)
X = Inaccurate Response
(Data of test and
interview not matching)
NO RESPONSES
√ = Accurate Response
(No knowledge)
X = Inaccurate Response
(Meaning recognition
or meaning recall)
- M= 949
- SD= 309
- Threshold= ????
- Participants saw the target words one by one.
- They had to pres YES or NO whether they knew the meaning of the word or not.
- Their RTs were recorded.
CONCLUSIONS
2) Interviews:
- Meaning recall part: Participants were first shown the words in isolation and were asked to define
each word.
- If participant’s response had not been informative enough in the meaning recall part, they had to
complete the multiple choice question of that item.
3) Interviews Scoring:
Meaning recall = 2
Meaning recognition = 1
No knowledge = 0
- Research Question 1= For the group of participants tested, the checklist format seems to be a good
measure of true vocabulary knowledge.
- Research Question 2= Data shows a significant difference between RTs for accurate and inaccurate
responses, suggesting that RTs might be an effective way of discriminating between incorrect and
correct responses in checklist tests and potentially, in other types of test formats.
- Research Question 3= Data collected so far has not shown a clear threshold where to set the limit
between accurate and inaccurate responses.
- Further research is being carried out in an attempt to find out this threshold.
REFERENCES
-
Meara, P. (1990). Some notes on the Eurocentres vocabulary tests. In J. Tommola (Ed.) Foreign language comprehension and
production (pp. 103-113). Turku: AFinLa Yearbook.
Meara, P. & Buxton (1987). An alternative to multiple choice vocabulary tests. Language Testing 4,142-154.
Meara, P. & Jones, G. (1988). Vocabulary size as a placement indicator. In P. Grunwell (Ed.) Applied Linguistics in Society (pp. 80-87).
London: CILT.
Download