Speaker Agreement Analysis in Talmudic Aramaic By Igor ‘Eli’ Finkelshteyn 1. Introduction Talmudic Aramaic, the language of the Talmud (compiled between approximately 0-500 C.E.), is written in a Hebraic script without any punctuation. It consists of a very large corpus consisting of over 100,000 words. Two interesting tasks that immediately come to mind when examining this corpus are the tasks of segmenting it and of annotating the sentiments of its segments. For this project, I have segmented and annotated a corpus of 12 pages of Aramaic into 425 distinct, nonoverlapping parts. This corpus can be used as data for someone interested in either of the two tasks mentioned above, among others. The task I was interested in for this project is the annotation of given segments of Aramaic. I hypothesized that the three key indicators of sentiment in the Talmud are cue words, cue structures, and context. To test this hypothesis, I annotated for action type (question, answer, statement, etc…) and agreement type (agreement, disagreement, or neutralagreement), which resulted in 8 unique annotations. These are: -neutral-agreement statement -neutral-agreement question -neutral-agreement answer -neutral-agreement alternate-version -agreement support -disagreement attack -disagreement defense -disagreement resolution These, along with the intricacies of the annotation process are discussed in the attached AnnotationReadMe.txt. After completing annotation, I went on to build a Python script that would use the NLTK Naïve Bayes and Maximum Entropy classifiers, along with a feature set consisting of cue words, cue structures, and context (in the weak form of supplying the previous segment’s tag as a feature) to attempt to predict each segment’s sentiment. 2. Setup Since the annotated corpus is very small, as far as corpora go, I used 10-fold cross validation on the corpus to attain more accurate scores. As benchmarks, I used accuracy and set-matching f-score. For accuracy, there is a pre-built class in NLTK’s Classifier module, so I made use of this. For set-matching f-score, I could find no prebuilt module, so I built my own. As testing classifiers, I used a Naïve Bayes Classifier, a Maximum Entropy Classifier, and a Hidden Markov Model Classifier. It is very probable that a Conditional Random Field would have done a better job of classifying the data, because I hypothesized that context and features are important, but I had trouble getting one to work properly with my set of data. Many cue words used as features have overlap, so Naïve Bayes is not the ideal classifier to use due to its feature independence assumptions. This is expected to provide skewed results when many correlated cue words are thrown in as features. Nevertheless, it is able to provide quick calculations—something not available with Maximum Entropy Classifiers. It is also supposed to work well on small data sets. Due to this, the primary use of the Naïve Bayes Classifier for me was to get quick results to see which features were helpful and which were not. Its results are provided for comparison, acknowledging that I expect the Maximum Entropy Classifier to work better. Similarly, the HMM classifier does not take into account any features, but rather only context. It also has a very small dataset to work with whose segments are almost all unique. Due to this, I don’t expect it to work well either. Describing the classifiers further in depth, they are all provided through NLTK. The Naïve Bayes classifier uses supervised learning, and works the same way as the one we built for class. We use three different versions for Naïve Bayes in our results. The first is Naïve Bayes using no external features. This makes the classifier simply label everything every segment as the most commonly seen label in the training data set. The second is Naïve Bayes using our full feature set. I cannot give an in-depth explanation here of what the specific grammatical function of each cue word in the feature set is here, as this would take up pages on its own and is largely arguable. Instead I give a quick overview of the features I use along with basic, literal translations of cue words and enclitics. This is: Boolean features that indicate whether the first letter of a word in a segment is an enclitic: -First letter ‘daled’ (meaning ‘that’) -First letter ‘kaf’ (meaning ‘like’) -First letter ‘vav’ (meaning ‘and’) -First two letters ‘vav’ and ‘hey’ (meaning ‘and isn’t it…?’) Boolean features that indicate whether a certain word appears in a segment: -‘tanan’ (meaning ‘we learn’) -‘hatam’ (meaning ‘there’) -‘hacha’ (meaning ‘here’) -‘may’ (meaning ‘what’) -‘may shnah’ (meaning ‘what’s the difference?’) -‘omer’ (meaning ‘says’) -‘amar/amru’ (meaning ‘said’) -‘amar leih/amru leih’ (meaning ‘said to him’) -‘may taymah’ (meaning ‘what’s the reason?’) ‘kman azlah’ (meaning ‘whom does it/he agree with?’) -‘kman’ (meaning ‘like whom?’) -‘ibayit eimah’ (meaning ‘if you want, I can say’) -‘mishum’ (meaning ‘because’) -‘deikah’ (meaning ‘look fastidiously’) -‘pashu’ (meaning ‘there are ___ remaining’) -‘mi’ (meaning ‘who’) -‘ki’ (meaning ‘because’) -‘ha’ (meaning ‘the’ or introduces a question) -‘atai’ (meaning ‘comes’) -‘hanicha’ (meaning ‘it makes sense for…’) -‘bishlama’ (meaning ‘it’s good for…’) -‘kama havi’ (meaning ‘how many is that?’) -‘ela’ (meaning ‘rather’) -‘ela’ as first word in segment ‘Context’ based features -first word in clause -last word in clause -full previous tag -category of previous tag The third uses a minimal feature set consisting of only the ‘context’ based features (I put context in quotes here to acknowledge that it’s arguable whether the first and last word features are really context based). The Maximum Entropy Classifier was run with 100 iterations and 10 iterations. On 100 iterations, it achieved approximately 97% accuracy on the training data (this varied slightly each time I ran it within 10-fold cross validation). On 10 iterations it achieved approximately 76% accuracy on the training data. Both versions used Improved Iterative Scaling. Finally, the Hidden Markov Model was run using transitions, outputs, and priors I created a module to compute directly (because NLTK’s built in HMM classifier was unable to compute these on its own). The module I used is provided with my code in SetupHMM.py. 3. Results Classifying at random Naïve Bayes with label probability feature only Naïve Bayes Classifier did better than the Maximum Entropy Classifier using a full feature set, and that the Naïve Bayes Classifier did better without using cue words as features. 0.7 0.6 0.5 0.4 0.3 0.2 Accuracy F-score 0.1 0 Accuracy .125 F-score .125 .238 .3506 Hidden Markov Classifier .237 .353 Maximum Entropy (10 iterations) Maximum Entropy (100 iterations) Naïve Bayes with cue words and context .488 .514 .55 .5715 .5714 .5946 Naïve Bayes without cue words .5857 .6143 As was predicted, cue words, cue structures, and context improved accuracy and the f-score significantly over a base case of using just label probabilities with a Naïve Bayes Classifier. The two sets of surprising results were that the The first anomaly may have been due to the Maximum Entropy classifier being run with only 100 iterations to generate its model, which was not enough to achieve optimal accuracy levels on the training set. This could be the case because running the classifier with 100 iterations gave us better results than running it with just 10. Alternatively, some number of iterations between these two could also provide better results. This is left unclear, and I avoid experimenting more with the Maximum Entropy Classifier because of its extremely long training times. Either way,this would account for some loss in the Classifier’s accuracy and Fscore, but probably not enough. I am at a loss to explain the rest. The latter phenomenon seems more easily explicable; because of Naïve Bayes’ independence assumptions, clearly dependent cue word features (i.e. where a word appears both by itself and as part of a phrase) wound up overlapping, thus causing a performance hit. I should additionally mention here that the extremely poor performance of the HMM classifier is entirely due to the fact that almost all segments in my data set are unique. Because of this, the output probabilities (i.e. of seeing a certain symbol given a certain state) become highly skewed and offer very poor results. The HMM would likely work much better on a large data set that follows a Zipf distribution more closely. 4. Conclusion I still surmise that, given the right classifier, cue words should be helpful. The best classifier for our case is probably a Conditional Random Field, since it allows us to better model both context and highly correlated features. Problematically, I could not find a version of this classifier implementable in NLTK that I could get to work, and building one myself was beyond the scope of my project. Nevertheless, regardless of the cue-word-using version being slightly worse than its counterpart, both still managed to attain results that doubled what was achievable under the Naïve Bayes Classifier which used only label probability as a feature. Additionally, because the version of the Classifier which tried to use context without cue words still used first_word and last_word as features (these being a cross between cue words and context), it still seems like cue words are useful, even without being able to see them used in a Conditional Random Field. 5. References This task would have been much more difficult, if not impossible without the use of the Natural Language Tool Kit for Python. Its webpage is located at http://www.nltk.org/. I used NumPy for an implementation of a Maximum Entropy Classifier. Its homepage is http://numpy.scipy.org/. Finally, everything I made was built on Python 2.6. Python’s home page is http://www.python.org/.