LSA versus N-grams in EFL essays 1 Comparing Content versus Grammar Oriented Techniques in Automated Scoring of English as a Foreign Language (EFL) Essays Benjamin Schloss PhD Candidate, Pennsylvania State University Department of Psychology December 16, 2014 LSA versus N-grams in EFL essays 2 Introduction Automatic essay scoring is a field of artificial intelligence research that has straightforward implications for the education system: the ability to reduce the need for professionals to grade essays saves time and money and permits the reallocation of those resources for better use. Furthermore, most computer programs can be easily adapted to deal with a large number of different languages depending on the type of alphabet and whether the language has a writing system, while human graders can only specialize in a handful of languages. Although the complete replacement of human graders is unlikely in the imminent future, the need for multiple graders to ensure reliability and objectivity may be rapidly disappearing with the advent of more advanced automatic essay scoring systems that are performing as reliably as humans and that are necessarily more objective. The scope of this paper is to specifically consider the English as Foreign Language (EFL) context. This context is likely to be different from the monolingual context in which many essay scoring systems are created because of their overemphasis on grammatical correctness and under-emphasis of the quality of the context. As possible evidence for this belief, colleagues of the researcher in the current study developed an automated essay scoring system called the Learner English Essay Scorer (LEES; Li, 2012), which uses Latent Semantic Analysis (LSA; Landauer & Dumais, 1997) to score essays from the Chinese Learner English Corpus (Gui & Yang, 2003). The LSA based essay scorer performed significantly worse than previous studies reported from LSA based essay scoring systems. This is possibly due to the fact that LSA primarily analyzes essays for their content using a “bag of words” methodology. This syntax independent measure does not capture information that many human graders likely look for when grading essays from students learning a second language. Although others have acknowledged this issue (Yao, 2012), few studies have taken the approach of directly comparing the performance of a content based measures like LSA to more grammatically oriented measures on the same sample of essays across different essay contexts. With this in mind, this paper will primarily focus on the use of N-grams as a method for capturing syntactic information by using part of speech (POS) N-grams as features for analyzing texts. Part of speech N-grams should capture syntactic information because POS analysis describes words in a syntactically general way such that words with the same part of speech generally behave similarly in terms of the words that appear immediately before and after them. Take a general noun like table for example. Table is often preceded by a determiner (the or a), by an adjective (big), a possessor (his or John’s), etc. and is often succeeded by a verb (is or broke), a preposition (in or from), etc. Now take another, unrelated noun like ball. Although it may be preceded by different adjectives (round or flat), followed by different verbs (fell or rolled), from the point of view of the surrounding POS’s, ball and table are very similar. Of course, not all nouns behave the same, and the grain of the analysis will depend on the quality of the POS analysis that occurs, but even simple POS analysis can capture a lot of syntactic information that might distinguish learners like the use of an adjective after a noun instead of before a noun (a phenomenon that is common for speakers of romance languages). LSA versus N-grams in EFL essays 3 Background Latent Semantic Analysis (Landauer & Dumais, 1997). LSA is a computational word learning model that extracts the meaning of words from large corpora (sets) of text documents by analyzing how words co-occur in these documents (Landauer & Dumais, 1997). In this model, the meaning of any word is represented as a unique vector where each value corresponds to a basic component/feature of the word’s meaning (similar to how combinations of a set of phonemes make up all the sounds in a given language). For example, a vector might contain a value for size between 0 and 1, with 0 meaning the size is not measurable and increasing positive values corresponding to increasing size. Although this example illustrates how the individual features might work, the actual features are derived in the process of singular value decomposition (a dimensionality reduction method), and cannot be precisely specified. Because LSA represents words as vectors, word similarity can be measured by the cosine similarity, a measure of vector similarity between -1 and 1: -1 for polar opposites, 0 for unrelated, and 1 for identical. It also can represent the meaning of “bags of words” by averaging the vectors for each word. However simple this idea may seem, it is quite powerful in mimicking human performance on tests of word similarity and text similarity (Landauer & Dumais). Furthermore, the cosine similarities derived from other vector space models that are similar to LSA have been used not only to predict similarity judgments in human behavior, but also to predict similarity in functional magnetic resonance imaging (fMRI) data when processing concrete nouns (Mitchell et al, 2008). Thus, the way that LSA learns and represents words seems to capture something fundamentally correct about the way humans learn and represent words at a behavioral and neurological level, which is important for any application that wishes to automate human behavior. Furthermore, LSA has been widely used in automated essay scoring applications, notably for English essays from native English speakers (Foltz, Laham, & Landauer, 1999), but also in English essays from non-native speakers (Yao, 2012). N-grams. N-grams are n consecutive units of text, usually defined at the word level, but may also be considered at the phrase, sentence, phoneme, and letter level. For example, the sentence “I love cat memes” has 4 monograms, 3 bigrams, 2 trigrams and one 4-gram, when the grams are words. In this study we will look at POS N-grams. Likewise, each word in the previous sentence has a part of speech, and so there are equal numbers of POS N-grams as word N-grams. N-grams are frequently used tools in Natural Language Processing (NLP) because of the regularity of the order of human speech at many different levels. Speakers of the same language are more likely to use the same word order than speakers of different languages, and speakers of languages from the same language family are, in many cases, more likely to use similar word orders. Groups of friends may mimic each other’s word order, and people from similar regions of the same country may also demonstrate more similar word order choices than individuals from different regions. However, these similarities and differences can be very subtle, and capturing them can be made difficult by insufficient sample sizes, poorly chosen attribute sets, etc. Nevertheless, N-grams have been used to improve algorithms for authorship discrimination in short texts (Hirst & Feiguina, 2007), for large-scale document classification (Ko et al, 2012), and has been combined in various ways with LSA (Islam & Hoque, 2010; LSA versus N-grams in EFL essays 4 Kakkonen, Myller, & Sutinen, 2006) and similar methods frequently used in NLP (Hatami, Akbari, & Nasersharif, 2013). Current Study The current study describes a pilot study in which we analyzed the performance of a Naïve-Bayes Classifier on the automated grading of a sample of essays from the Chinese Learner English Corpus (CLEC; ). The grading consisted of a binary classification in which we divided a subset of the essays into good and bad essays based on their score. Methods. We used a Naïve-Bayes Classifier to classify essays by their grades based on monogram and bigram frequencies. The key assumption in a Naïve-Bayes Classifier is that the attributes are conditionally independent of one another given their class, which is expressed in the last equality in the equation below. π π(ππππππ |π1 , … , ππ ) = π(π1 , … , ππ |ππππππ )π(ππππππ ) = π(ππππππ ) ∏ π( ππ |πππππ) π=1 In order to approximate the probabilities π(ππππππ ) and π(ππ |ππππππ ), we took two different approaches. π(ππππππ ) was approximated non-parametrically based on how many essay of each grade were in the sample divided by the total number of essays, π(ππππππ )= #{ππππππ } π ∑π=1 #{ππππππ } . On the other hand, we used parametric statistics to approximate π(ππ |ππππππ ). Using a Poisson distribution, π(ππ = π₯|ππππππ ) = ππ₯ π −π π₯! ππ‘π‘π . So, we approximated ππ attribute, atti, given grade g, using the maximum likelihood estimator of ππππ‘π‘ = for each #{ππ‘π‘π |ππππππ } #{ππππππ } , or the number of times attribute, atti, , appears in essays of grade g. Then, as an illustration, an untrained essay would be classified based on whichever grade was most likely given the number of times each attribute appears in the particular essay, and, if there was a tie, the classifier preferred the higher score. π πΆπππ π πππ¦(ππ π ππ¦) = πππ₯∀π { π(ππππππ ) ∏ π(ππ = π₯π |ππππππ )} π π=1 π₯ ππ‘π‘ #{ππππππ } ππππ‘π‘ π −ππ = π ∏ ∑π=1 #{ππππππ } π₯! π ; ππππ‘π‘ = #{ππ‘π‘π |ππππππ } #{ππππππ } Materials. For the pilot study, we had a sample of essays 344 essays with grades ranging from 6-14. We decided to split the essays into two groups, one with essays of grades 6, 7, and 8, and another with essays of grades 11, 12, and 13. The two middle grades were excluded to ensure that the essay groups would differ substantially in quality, and there was only one essay with a grade of 14, so this essay was also excluded. However, even after this initial split, there was a large difference in the number of essays in the low and high group, with 190 essays in the low group, and 38 essays in the high group. Thus, when creating the classifier, we randomly sampled 38 essays from the low group so that the prior distribution of the low graded essays LSA versus N-grams in EFL essays 5 would not cause the classifier to classify all of the essays as “low.” A sample of an essay from each group is given below: <SCORE 6> As a proverb say: Haste Makes Waste. It's quite clear that a haste people can't make achievement because he hasn't prepared enough. It is known to all of us. No one can deny the proverb. Haste makes waste. For example: a very young baby, as we all know, can't walk very well. He walks slowly. He throws himself to the ground now and then. However, his mother let him run to her. He can't reach to her without any help. Every one learns to walk in childhood. No one can deny it cost him many time to walk well, much more time to run. From the above we can conclude that without preparing can't make a success. I have the opinion that haste makes waste. So we should think it over before we begin it. Don't you think so? <SCORE 12> It is well known to us that "more haste, less speed". Because if we want to finish a task in less time, we will feel tense , and our brain can't keep calm . So the way of our thought will be massed and our wit can not be excited . Thus, as a result , we maybe spend more time but the qulity of the task we have done became poor. For example, when we have a exam in a class, we afraid that we have not enough time to do it. We will skim or scan the paper in order that we can save some time. However, if we do so, we will not understand the passage very well and the effect certainly will not be good. When you are about to do something, don't forget the word "more haste less speed". To POS-tag the essays, we used the Natural Language Toolkit in Python 3.4.2. We chose this program because it is easy to use, and, as a part of the existing Python framework, it is easy to integrate into larger programs. If the POS-tagger in Python works well enough, then why use a more complicated tool? If it does not, then we will look at more detailed POS-taggers in future research. The POS-tagger in Python’s Natural Language Toolkit POS-tags the first two lines essay of score 6 shown above in the following way: [('As', 'IN'), ('a', 'DT'), ('proverb', 'NN'), ('say', 'NN'), (':', ':'), ('Haste', 'NNP'), ('Makes', 'NNPS'), ('Waste', 'NNP'), ('.', '.'), ('It', 'PRP'), ("'s", 'VBZ'), ('quite', 'RB'), ('clear', 'JJ'), ('that', LSA versus N-grams in EFL essays 6 'IN'), ('a', 'DT'), ('haste', 'NN'), ('people', 'NNS'), ('ca', 'MD'), ("n't", 'RB'), ('make', 'VB'), ('achievement', 'JJ'), ('because', 'IN'), ('he', 'PRP'), ('has', 'VBZ'), ("n't", 'RB'), ('prepared', 'VBN'), ('enough', 'JJ'), ('.', '.')…] As can be seen, the POS-tagger is not without flaw. However, the point of the pilot study was to examine the usefulness of the POS-tagger as is. As such, all results reported in this study are for POS-tagger in the Python Natural Language Toolkit and the essays did not undergo any “data cleaning” prior to tagging. Results. When the LSA program was used to grade the low and high essays with either a score of either 0 (low) or 1 (high), the program achieved 40.35% when it used 157 dimensions to represent the meaning of each word (reduced from 228, one for each essay). The LSA program uses a 228-fold grading mechanism where it trains on all but one essay to be tested each time. On the other hand, the Naïve-Bayes Classifier developed in the current study used a 10fold training/testing algorithm on 30 random subsets of the 190 essays from the low essay group paired with the 38 essays from the high essay group, and achieved a an accuracy of 47.76%, and the confusion matrix is shown below. The algorithm performed best when we only considered bigrams and monograms that appeared at least 10 times in the training set as attributes. Confusion Matrix 0 1 0 33 1107 1 84 1156 When using a 76-fold (leave one out) algorithm on one random subset of 38 essays of the 190 essays, we see a similar pattern of results, with 48.68% accuracy. Confusion Matrix 0 1 0 0 38 1 1 37 Both of the Naïve-Bayes Classifier analyses revealed at chance performance p =.49 for the 30 random sets of 10 fold testing and p = .45 for the 72-fold analysis. However, the LSA algorithm performed significantly worse than chance with p < .05. Additionally, the LSA algorithm performed significantly worse than the algorithm that used 30 random sets and 10 fold testing, but did not significantly differ from the single run of the 76-fold algorithm. General Discussion LSA versus N-grams in EFL essays 7 Although the literature seems to suggest that a more grammatically-oriented method like an N-gram technique would significantly outperform a content-based method like LSA when grading English learner essays, the current study did not support this hypothesis. Both algorithms performed close to chance, and the leave-one out algorithm (the one that was most similar to the algorithm used by the LSA system) did not differ significantly in performance from the LSA algorithm. For reasons that are not totally clear, the Naïve-Bayes Classifier classified almost all of the essays as being from the high score group, despite that we controlled the essays so that their prior probabilities would be equal. This suggests that perhaps the Poisson distribution is not a good underlying distribution for approximating the probability of how frequently the attributes occur across essays, but may also be due to the fact that the higher scored essays tend to be longer, causing the lower score essays to have 0 frequencies for many of the potential attributes, and in fact, many of the attributes extracted may only be present in the higher score essays. The sample size is also a limitation, and it simply may be the case that a Naïve-Bayes Classifier needs a larger training set to extract sensitive enough frequency measures ′ for the maximum likelihood estimates of the ππππ‘π‘ π to be accurate enough. Future directions for this project include developing Markovian Model which estimates the transition probabilities at the bigram level, hand selecting the attributes, and using more detailed sentence parsers. A Markov Chain may be more appropriate for quantifying the differences in the syntactic structure of the high score versus low score essays because it relies on the transition probability from one monogram to another, instead of simply counting how many of each type of bigram is in each essay. Additionally, more detailed or hand-chosen attributes may allow us to further improve the accuracy and efficiency of the algorithm. Another direction for future research for this project is to expand the consideration of grammatical and content-based features to morphological analyses, measures of discourse coherence (Foltz, Kintsch, & Landauer, 1998), and other features to get a better understanding of the relationship between content based scoring and grammatically oriented scoring in learner English essays. It is also important to research whether combining these measures yields better automated essay scoring algorithms, and how to optimally combine content analysis with grammar analysis for different essay scoring contexts. LSA versus N-grams in EFL essays 8 References Foltz, P., Kintsch, W., & Landauer, T. (1998). The Measurement of Textual Coherence with Latent Semantic Analysis. Discourse Processes, 25, 285-307. Foltz, P., Laham, D., & Landauer, T (1999). Automated Essay Scoring: Applications to Educational Technology. Proceedings from The World Conference on Educational Multimedia Hypermedia and Telecommunications. Montreal, Canada: Association for the Advancement of Computing in Education. Gui, S. & Yang, H. (2003). The Chinese Learner English Corpus. Shanghai: Shanghai Foreign Language Education Press. Hatami, A., Akbari, A., & Nasersharif, B. (2013). N-gram adaptation using Dirichlet class language model based on part-of-speech for speech recognition. Proceedings from ICEE: The 21st Iranian Conference on Electrical Engineering. Mashhad, Iran: IEEE.Hirst, G. & Feiguina, O. (2007). Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts. Literary and Linguistic Computing, 22(4), 405-417. Islam, M. & Hoque, A. (2010). Automated Essay Scoring Using Generalized Latent Semantic Analysis. Proceedings from ICCIT 2010: The 13th International Conference on Computer and Information Technology. Dhaka, Bangladesh: IEEE. Kakkonen T., Myller, N., & Sutinen, E. (2006). Applying Part-of-Speech Enhanced LSA to Automatic Essay Grading. Proceedings from ICIT 2006: The 4th International Conference on Information Technology. Tel Aviv, Israel: IEEE. Ko, B., Choi, D., Choi, C., Choi, J., Kim, P. (2012). Document Classification through Building Specified N-gram. Proceedings from the Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing. Landauer, T. & Dumais, S. (1997). A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104(2), 211-240. Li, J. (2012). Using Latent Semantic Analysis for Automated Essay Scoring in the Chinese EFL Context (unpublished doctoral dissertation). Zhejiang University, Zhejiang, China. Mitchell, T., Shinkareva, S., Carlson, A., Chang, K., Malave, V., Mason, R., & Just, M. (2008). Predicting Human Brain Activity Associated with the Meanings of Nouns. Science, 320, 1191-119. Yao, X. (2012). LSA-based Automated Essay Scoring in Chinese Context. Applied Mechanics and Materials, 274, 654-657.