here - About

advertisement
LSA versus N-grams in EFL essays
1
Comparing Content versus Grammar Oriented Techniques in Automated Scoring of English as a
Foreign Language (EFL) Essays
Benjamin Schloss
PhD Candidate, Pennsylvania State University
Department of Psychology
December 16, 2014
LSA versus N-grams in EFL essays
2
Introduction
Automatic essay scoring is a field of artificial intelligence research that has straightforward implications for the education system: the ability to reduce the need for professionals to
grade essays saves time and money and permits the reallocation of those resources for better use.
Furthermore, most computer programs can be easily adapted to deal with a large number of
different languages depending on the type of alphabet and whether the language has a writing
system, while human graders can only specialize in a handful of languages. Although the
complete replacement of human graders is unlikely in the imminent future, the need for multiple
graders to ensure reliability and objectivity may be rapidly disappearing with the advent of more
advanced automatic essay scoring systems that are performing as reliably as humans and that are
necessarily more objective.
The scope of this paper is to specifically consider the English as Foreign Language (EFL)
context. This context is likely to be different from the monolingual context in which many essay
scoring systems are created because of their overemphasis on grammatical correctness and
under-emphasis of the quality of the context. As possible evidence for this belief, colleagues of
the researcher in the current study developed an automated essay scoring system called the
Learner English Essay Scorer (LEES; Li, 2012), which uses Latent Semantic Analysis (LSA;
Landauer & Dumais, 1997) to score essays from the Chinese Learner English Corpus (Gui &
Yang, 2003). The LSA based essay scorer performed significantly worse than previous studies
reported from LSA based essay scoring systems. This is possibly due to the fact that LSA
primarily analyzes essays for their content using a “bag of words” methodology. This syntax
independent measure does not capture information that many human graders likely look for
when grading essays from students learning a second language. Although others have
acknowledged this issue (Yao, 2012), few studies have taken the approach of directly comparing
the performance of a content based measures like LSA to more grammatically oriented measures
on the same sample of essays across different essay contexts.
With this in mind, this paper will primarily focus on the use of N-grams as a method for
capturing syntactic information by using part of speech (POS) N-grams as features for analyzing
texts. Part of speech N-grams should capture syntactic information because POS analysis
describes words in a syntactically general way such that words with the same part of speech
generally behave similarly in terms of the words that appear immediately before and after them.
Take a general noun like table for example. Table is often preceded by a determiner (the or a),
by an adjective (big), a possessor (his or John’s), etc. and is often succeeded by a verb (is or
broke), a preposition (in or from), etc. Now take another, unrelated noun like ball. Although it
may be preceded by different adjectives (round or flat), followed by different verbs (fell or
rolled), from the point of view of the surrounding POS’s, ball and table are very similar. Of
course, not all nouns behave the same, and the grain of the analysis will depend on the quality of
the POS analysis that occurs, but even simple POS analysis can capture a lot of syntactic
information that might distinguish learners like the use of an adjective after a noun instead of
before a noun (a phenomenon that is common for speakers of romance languages).
LSA versus N-grams in EFL essays
3
Background
Latent Semantic Analysis (Landauer & Dumais, 1997). LSA is a computational word
learning model that extracts the meaning of words from large corpora (sets) of text documents by
analyzing how words co-occur in these documents (Landauer & Dumais, 1997). In this model,
the meaning of any word is represented as a unique vector where each value corresponds to a
basic component/feature of the word’s meaning (similar to how combinations of a set of
phonemes make up all the sounds in a given language). For example, a vector might contain a
value for size between 0 and 1, with 0 meaning the size is not measurable and increasing positive
values corresponding to increasing size. Although this example illustrates how the individual
features might work, the actual features are derived in the process of singular value
decomposition (a dimensionality reduction method), and cannot be precisely specified.
Because LSA represents words as vectors, word similarity can be measured by the cosine
similarity, a measure of vector similarity between -1 and 1: -1 for polar opposites, 0 for
unrelated, and 1 for identical. It also can represent the meaning of “bags of words” by averaging
the vectors for each word. However simple this idea may seem, it is quite powerful in
mimicking human performance on tests of word similarity and text similarity (Landauer &
Dumais). Furthermore, the cosine similarities derived from other vector space models that are
similar to LSA have been used not only to predict similarity judgments in human behavior, but
also to predict similarity in functional magnetic resonance imaging (fMRI) data when processing
concrete nouns (Mitchell et al, 2008). Thus, the way that LSA learns and represents words seems
to capture something fundamentally correct about the way humans learn and represent words at a
behavioral and neurological level, which is important for any application that wishes to automate
human behavior. Furthermore, LSA has been widely used in automated essay scoring
applications, notably for English essays from native English speakers (Foltz, Laham, &
Landauer, 1999), but also in English essays from non-native speakers (Yao, 2012).
N-grams. N-grams are n consecutive units of text, usually defined at the word level, but
may also be considered at the phrase, sentence, phoneme, and letter level. For example, the
sentence “I love cat memes” has 4 monograms, 3 bigrams, 2 trigrams and one 4-gram, when the
grams are words. In this study we will look at POS N-grams. Likewise, each word in the
previous sentence has a part of speech, and so there are equal numbers of POS N-grams as word
N-grams. N-grams are frequently used tools in Natural Language Processing (NLP) because of
the regularity of the order of human speech at many different levels. Speakers of the same
language are more likely to use the same word order than speakers of different languages, and
speakers of languages from the same language family are, in many cases, more likely to use
similar word orders. Groups of friends may mimic each other’s word order, and people from
similar regions of the same country may also demonstrate more similar word order choices than
individuals from different regions. However, these similarities and differences can be very
subtle, and capturing them can be made difficult by insufficient sample sizes, poorly chosen
attribute sets, etc. Nevertheless, N-grams have been used to improve algorithms for authorship
discrimination in short texts (Hirst & Feiguina, 2007), for large-scale document classification
(Ko et al, 2012), and has been combined in various ways with LSA (Islam & Hoque, 2010;
LSA versus N-grams in EFL essays
4
Kakkonen, Myller, & Sutinen, 2006) and similar methods frequently used in NLP (Hatami,
Akbari, & Nasersharif, 2013).
Current Study
The current study describes a pilot study in which we analyzed the performance of a
Naïve-Bayes Classifier on the automated grading of a sample of essays from the Chinese Learner
English Corpus (CLEC; ). The grading consisted of a binary classification in which we divided a
subset of the essays into good and bad essays based on their score.
Methods. We used a Naïve-Bayes Classifier to classify essays by their grades based on
monogram and bigram frequencies. The key assumption in a Naïve-Bayes Classifier is that the
attributes are conditionally independent of one another given their class, which is expressed in
the last equality in the equation below.
𝑛
𝑃(π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— |π‘Ž1 , … , π‘Žπ‘› ) = 𝑃(π‘Ž1 , … , π‘Žπ‘› |π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— )𝑃(π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— ) = 𝑃(π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— ) ∏ 𝑃( π‘Žπ‘– |π‘”π‘Ÿπ‘Žπ‘‘π‘’)
𝑖=1
In order to approximate the probabilities 𝑃(π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— ) and 𝑃(π‘Žπ‘– |π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— ), we took two
different approaches. 𝑃(π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— ) was approximated non-parametrically based on how many
essay of each grade were in the sample divided by the total number of essays, 𝑃(π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— )=
#{π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— }
𝑛
∑𝑗=1 #{π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— }
. On the other hand, we used parametric statistics to approximate 𝑃(π‘Žπ‘– |π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— ).
Using a Poisson distribution, 𝑃(π‘Žπ‘– = π‘₯|π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— ) =
πœ†π‘₯ 𝑒 −πœ†
π‘₯!
π‘Žπ‘‘π‘‘π‘–
. So, we approximated πœ†π‘”
attribute, atti, given grade g, using the maximum likelihood estimator of πœ†π‘”π‘Žπ‘‘π‘‘ =
for each
#{π‘Žπ‘‘π‘‘π‘– |π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— }
#{π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— }
,
or the number of times attribute, atti, , appears in essays of grade g. Then, as an illustration, an
untrained essay would be classified based on whichever grade was most likely given the number
of times each attribute appears in the particular essay, and, if there was a tie, the classifier
preferred the higher score.
𝑛
πΆπ‘™π‘Žπ‘ π‘ π‘–π‘“π‘¦(π‘’π‘ π‘ π‘Žπ‘¦) = π‘šπ‘Žπ‘₯∀𝑗 { 𝑃(π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— ) ∏ 𝑃(π‘Žπ‘– = π‘₯𝑖 |π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— )}
𝑛
𝑖=1
π‘₯
π‘Žπ‘‘π‘‘
#{π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— }
πœ†π‘”π‘Žπ‘‘π‘‘ 𝑒 −πœ†π‘”
= 𝑛
∏
∑𝑗=1 #{π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— }
π‘₯!
𝑖
; πœ†π‘”π‘Žπ‘‘π‘‘ =
#{π‘Žπ‘‘π‘‘π‘– |π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— }
#{π‘”π‘Ÿπ‘Žπ‘‘π‘’π‘— }
Materials. For the pilot study, we had a sample of essays 344 essays with grades ranging
from 6-14. We decided to split the essays into two groups, one with essays of grades 6, 7, and 8,
and another with essays of grades 11, 12, and 13. The two middle grades were excluded to
ensure that the essay groups would differ substantially in quality, and there was only one essay
with a grade of 14, so this essay was also excluded. However, even after this initial split, there
was a large difference in the number of essays in the low and high group, with 190 essays in the
low group, and 38 essays in the high group. Thus, when creating the classifier, we randomly
sampled 38 essays from the low group so that the prior distribution of the low graded essays
LSA versus N-grams in EFL essays
5
would not cause the classifier to classify all of the essays as “low.” A sample of an essay from
each group is given below:
<SCORE 6>
As a proverb say: Haste Makes Waste. It's quite clear that a haste people can't make
achievement because he hasn't prepared enough. It is known to all of us.
No one can deny the proverb. Haste makes waste. For example: a very young baby, as we all
know, can't walk very well. He walks slowly. He throws himself to the ground now and then.
However, his mother let him run to her. He can't reach to her without any help. Every one learns
to walk in childhood. No one can deny it cost him many time to walk well, much more time to
run.
From the above we can conclude that without preparing can't make a success.
I have the opinion that haste makes waste. So we should think it over before we begin it. Don't
you think so?
<SCORE 12>
It is well known to us that "more haste, less speed". Because if we want to finish
a task in less time, we will feel tense , and our brain can't keep calm . So the
way of our thought will be massed and our wit can not be excited . Thus, as a
result , we maybe spend more time but the qulity of the task we have done became
poor.
For example, when we have a exam in a class, we afraid that we have not enough
time to do it. We will skim or scan the paper in order that we can save some time.
However, if we do so, we will not understand the passage very well and the effect
certainly will not be good.
When you are about to do something, don't forget the word "more haste less speed".
To POS-tag the essays, we used the Natural Language Toolkit in Python 3.4.2. We chose
this program because it is easy to use, and, as a part of the existing Python framework, it is easy
to integrate into larger programs. If the POS-tagger in Python works well enough, then why use a
more complicated tool? If it does not, then we will look at more detailed POS-taggers in future
research. The POS-tagger in Python’s Natural Language Toolkit POS-tags the first two lines
essay of score 6 shown above in the following way:
[('As', 'IN'), ('a', 'DT'), ('proverb', 'NN'), ('say', 'NN'), (':', ':'), ('Haste', 'NNP'), ('Makes',
'NNPS'), ('Waste', 'NNP'), ('.', '.'), ('It', 'PRP'), ("'s", 'VBZ'), ('quite', 'RB'), ('clear', 'JJ'), ('that',
LSA versus N-grams in EFL essays
6
'IN'), ('a', 'DT'), ('haste', 'NN'), ('people', 'NNS'), ('ca', 'MD'), ("n't", 'RB'), ('make', 'VB'),
('achievement', 'JJ'), ('because', 'IN'), ('he', 'PRP'), ('has', 'VBZ'), ("n't", 'RB'), ('prepared', 'VBN'),
('enough', 'JJ'), ('.', '.')…]
As can be seen, the POS-tagger is not without flaw. However, the point of the pilot study
was to examine the usefulness of the POS-tagger as is. As such, all results reported in this study
are for POS-tagger in the Python Natural Language Toolkit and the essays did not undergo any
“data cleaning” prior to tagging.
Results. When the LSA program was used to grade the low and high essays with either a
score of either 0 (low) or 1 (high), the program achieved 40.35% when it used 157 dimensions to
represent the meaning of each word (reduced from 228, one for each essay). The LSA program
uses a 228-fold grading mechanism where it trains on all but one essay to be tested each time.
On the other hand, the Naïve-Bayes Classifier developed in the current study used a 10fold training/testing algorithm on 30 random subsets of the 190 essays from the low essay group
paired with the 38 essays from the high essay group, and achieved a an accuracy of 47.76%, and
the confusion matrix is shown below. The algorithm performed best when we only considered
bigrams and monograms that appeared at least 10 times in the training set as attributes.
Confusion Matrix
0
1
0
33
1107
1
84
1156
When using a 76-fold (leave one out) algorithm on one random subset of 38 essays of the
190 essays, we see a similar pattern of results, with 48.68% accuracy.
Confusion Matrix
0
1
0
0
38
1
1
37
Both of the Naïve-Bayes Classifier analyses revealed at chance performance p =.49 for
the 30 random sets of 10 fold testing and p = .45 for the 72-fold analysis. However, the LSA
algorithm performed significantly worse than chance with p < .05. Additionally, the LSA
algorithm performed significantly worse than the algorithm that used 30 random sets and 10 fold
testing, but did not significantly differ from the single run of the 76-fold algorithm.
General Discussion
LSA versus N-grams in EFL essays
7
Although the literature seems to suggest that a more grammatically-oriented method like
an N-gram technique would significantly outperform a content-based method like LSA when
grading English learner essays, the current study did not support this hypothesis. Both
algorithms performed close to chance, and the leave-one out algorithm (the one that was most
similar to the algorithm used by the LSA system) did not differ significantly in performance
from the LSA algorithm. For reasons that are not totally clear, the Naïve-Bayes Classifier
classified almost all of the essays as being from the high score group, despite that we controlled
the essays so that their prior probabilities would be equal. This suggests that perhaps the Poisson
distribution is not a good underlying distribution for approximating the probability of how
frequently the attributes occur across essays, but may also be due to the fact that the higher
scored essays tend to be longer, causing the lower score essays to have 0 frequencies for many of
the potential attributes, and in fact, many of the attributes extracted may only be present in the
higher score essays. The sample size is also a limitation, and it simply may be the case that a
Naïve-Bayes Classifier needs a larger training set to extract sensitive enough frequency measures
′
for the maximum likelihood estimates of the πœ†π‘”π‘Žπ‘‘π‘‘ 𝑠 to be accurate enough.
Future directions for this project include developing Markovian Model which estimates
the transition probabilities at the bigram level, hand selecting the attributes, and using more
detailed sentence parsers. A Markov Chain may be more appropriate for quantifying the
differences in the syntactic structure of the high score versus low score essays because it relies
on the transition probability from one monogram to another, instead of simply counting how
many of each type of bigram is in each essay. Additionally, more detailed or hand-chosen
attributes may allow us to further improve the accuracy and efficiency of the algorithm. Another
direction for future research for this project is to expand the consideration of grammatical and
content-based features to morphological analyses, measures of discourse coherence (Foltz,
Kintsch, & Landauer, 1998), and other features to get a better understanding of the relationship
between content based scoring and grammatically oriented scoring in learner English essays. It is
also important to research whether combining these measures yields better automated essay
scoring algorithms, and how to optimally combine content analysis with grammar analysis for
different essay scoring contexts.
LSA versus N-grams in EFL essays
8
References
Foltz, P., Kintsch, W., & Landauer, T. (1998). The Measurement of Textual Coherence with
Latent Semantic Analysis. Discourse Processes, 25, 285-307.
Foltz, P., Laham, D., & Landauer, T (1999). Automated Essay Scoring: Applications to
Educational Technology. Proceedings from The World Conference on Educational
Multimedia Hypermedia and Telecommunications. Montreal, Canada: Association for the
Advancement of Computing in Education.
Gui, S. & Yang, H. (2003). The Chinese Learner English Corpus. Shanghai: Shanghai Foreign
Language Education Press.
Hatami, A., Akbari, A., & Nasersharif, B. (2013). N-gram adaptation using Dirichlet class
language model based on part-of-speech for speech recognition. Proceedings from
ICEE: The 21st Iranian Conference on Electrical Engineering. Mashhad, Iran:
IEEE.Hirst, G. & Feiguina, O. (2007). Bigrams of Syntactic Labels for Authorship
Discrimination of Short Texts. Literary and Linguistic Computing, 22(4), 405-417.
Islam, M. & Hoque, A. (2010). Automated Essay Scoring Using Generalized Latent Semantic
Analysis. Proceedings from ICCIT 2010: The 13th International Conference on
Computer and Information Technology. Dhaka, Bangladesh: IEEE.
Kakkonen T., Myller, N., & Sutinen, E. (2006). Applying Part-of-Speech Enhanced LSA to
Automatic Essay Grading. Proceedings from ICIT 2006: The 4th International
Conference on Information Technology. Tel Aviv, Israel: IEEE.
Ko, B., Choi, D., Choi, C., Choi, J., Kim, P. (2012). Document Classification through Building
Specified N-gram. Proceedings from the Sixth International Conference on Innovative
Mobile and Internet Services in Ubiquitous Computing.
Landauer, T. & Dumais, S. (1997). A Solution to Plato’s Problem: The Latent Semantic Analysis
Theory of Acquisition, Induction, and Representation of Knowledge. Psychological
Review, 104(2), 211-240.
Li, J. (2012). Using Latent Semantic Analysis for Automated Essay Scoring in the Chinese EFL
Context (unpublished doctoral dissertation). Zhejiang University, Zhejiang, China.
Mitchell, T., Shinkareva, S., Carlson, A., Chang, K., Malave, V., Mason, R., & Just, M. (2008).
Predicting Human Brain Activity Associated with the Meanings of Nouns. Science, 320,
1191-119.
Yao, X. (2012). LSA-based Automated Essay Scoring in Chinese Context. Applied Mechanics
and Materials, 274, 654-657.
Download