Speaker Agreement Analysis in Talmudic Aramaic

advertisement
Speaker Agreement
Analysis in Talmudic
Aramaic
By Igor ‘Eli’ Finkelshteyn
1. Introduction
Talmudic Aramaic, the language of the Talmud
(compiled between approximately 0-500 C.E.),
is written in a Hebraic script without any
punctuation. It consists of a very large corpus
consisting of over 100,000 words. Two
interesting tasks that immediately come to
mind when examining this corpus are the tasks
of segmenting it and of annotating the
sentiments of its segments. For this project, I
have segmented and annotated a corpus of 12
pages of Aramaic into 425 distinct, nonoverlapping parts. This corpus can be used as
data for someone interested in either of the
two tasks mentioned above, among others.
The task I was interested in for this project is
the annotation of given segments of Aramaic. I
hypothesized that the three key indicators of
sentiment in the Talmud are cue words, cue
structures, and context. To test this hypothesis,
I annotated for action type (question, answer,
statement, etc…) and agreement type
(agreement, disagreement, or neutralagreement), which resulted in 8 unique
annotations. These are:
-neutral-agreement statement
-neutral-agreement question
-neutral-agreement answer
-neutral-agreement alternate-version
-agreement support
-disagreement attack
-disagreement defense
-disagreement resolution
These, along with the intricacies of the
annotation process are discussed in the
attached AnnotationReadMe.txt.
After completing annotation, I went on to build
a Python script that would use the NLTK Naïve
Bayes and Maximum Entropy classifiers, along
with a feature set consisting of cue words, cue
structures, and context (in the weak form of
supplying the previous segment’s tag as a
feature) to attempt to predict each segment’s
sentiment.
2. Setup
Since the annotated corpus is very small, as far
as corpora go, I used 10-fold cross validation on
the corpus to attain more accurate scores. As
benchmarks, I used accuracy and set-matching
f-score. For accuracy, there is a pre-built class in
NLTK’s Classifier module, so I made use of this.
For set-matching f-score, I could find no prebuilt module, so I built my own.
As testing classifiers, I used a Naïve Bayes
Classifier, a Maximum Entropy Classifier, and a
Hidden Markov Model Classifier.
It is very probable that a Conditional Random
Field would have done a better job of classifying
the data, because I hypothesized that context
and features are important, but I had trouble
getting one to work properly with my set of
data.
Many cue words used as features have overlap,
so Naïve Bayes is not the ideal classifier to use
due to its feature independence assumptions.
This is expected to provide skewed results when
many correlated cue words are thrown in as
features. Nevertheless, it is able to provide
quick calculations—something not available
with Maximum Entropy Classifiers. It is also
supposed to work well on small data sets. Due
to this, the primary use of the Naïve Bayes
Classifier for me was to get quick results to see
which features were helpful and which were
not. Its results are provided for comparison,
acknowledging that I expect the Maximum
Entropy Classifier to work better.
Similarly, the HMM classifier does not take into
account any features, but rather only context. It
also has a very small dataset to work with
whose segments are almost all unique. Due to
this, I don’t expect it to work well either.
Describing the classifiers further in depth, they
are all provided through NLTK. The Naïve Bayes
classifier uses supervised learning, and works
the same way as the one we built for class. We
use three different versions for Naïve Bayes in
our results.
The first is Naïve Bayes using no external
features. This makes the classifier simply label
everything every segment as the most
commonly seen label in the training data set.
The second is Naïve Bayes using our full feature
set. I cannot give an in-depth explanation here
of what the specific grammatical function of
each cue word in the feature set is here, as this
would take up pages on its own and is largely
arguable. Instead I give a quick overview of the
features I use along with basic, literal
translations of cue words and enclitics. This is:
Boolean features that indicate whether the first
letter of a word in a segment is an enclitic:
-First letter ‘daled’ (meaning ‘that’)
-First letter ‘kaf’ (meaning ‘like’)
-First letter ‘vav’ (meaning ‘and’)
-First two letters ‘vav’ and ‘hey’ (meaning ‘and
isn’t it…?’)
Boolean features that indicate whether a
certain word appears in a segment:
-‘tanan’ (meaning ‘we learn’)
-‘hatam’ (meaning ‘there’)
-‘hacha’ (meaning ‘here’)
-‘may’ (meaning ‘what’)
-‘may shnah’ (meaning ‘what’s the
difference?’)
-‘omer’ (meaning ‘says’)
-‘amar/amru’ (meaning ‘said’)
-‘amar leih/amru leih’ (meaning ‘said to him’)
-‘may taymah’ (meaning ‘what’s the reason?’)
‘kman azlah’ (meaning ‘whom does it/he agree
with?’)
-‘kman’ (meaning ‘like whom?’)
-‘ibayit eimah’ (meaning ‘if you want, I can
say’)
-‘mishum’ (meaning ‘because’)
-‘deikah’ (meaning ‘look fastidiously’)
-‘pashu’ (meaning ‘there are ___ remaining’)
-‘mi’ (meaning ‘who’)
-‘ki’ (meaning ‘because’)
-‘ha’ (meaning ‘the’ or introduces a question)
-‘atai’ (meaning ‘comes’)
-‘hanicha’ (meaning ‘it makes sense for…’)
-‘bishlama’ (meaning ‘it’s good for…’)
-‘kama havi’ (meaning ‘how many is that?’)
-‘ela’ (meaning ‘rather’)
-‘ela’ as first word in segment
‘Context’ based features
-first word in clause
-last word in clause
-full previous tag
-category of previous tag
The third uses a minimal feature set consisting
of only the ‘context’ based features (I put
context in quotes here to acknowledge that it’s
arguable whether the first and last word
features are really context based).
The Maximum Entropy Classifier was run with
100 iterations and 10 iterations. On 100
iterations, it achieved approximately 97%
accuracy on the training data (this varied
slightly each time I ran it within 10-fold cross
validation). On 10 iterations it achieved
approximately 76% accuracy on the training
data. Both versions used Improved Iterative
Scaling.
Finally, the Hidden Markov Model was run using
transitions, outputs, and priors I created a
module to compute directly (because NLTK’s
built in HMM classifier was unable to compute
these on its own). The module I used is
provided with my code in SetupHMM.py.
3. Results
Classifying at random
Naïve Bayes with label
probability feature only
Naïve Bayes Classifier did better than the
Maximum Entropy Classifier using a full feature
set, and that the Naïve Bayes Classifier did
better without using cue words as features.
0.7
0.6
0.5
0.4
0.3
0.2
Accuracy
F-score
0.1
0
Accuracy
.125
F-score
.125
.238
.3506
Hidden Markov Classifier .237
.353
Maximum Entropy (10
iterations)
Maximum Entropy (100
iterations)
Naïve Bayes with cue
words and context
.488
.514
.55
.5715
.5714
.5946
Naïve Bayes without cue
words
.5857
.6143
As was predicted, cue words, cue structures,
and context improved accuracy and the f-score
significantly over a base case of using just label
probabilities with a Naïve Bayes Classifier. The
two sets of surprising results were that the
The first anomaly may have been due to the
Maximum Entropy classifier being run with only
100 iterations to generate its model, which was
not enough to achieve optimal accuracy levels
on the training set. This could be the case
because running the classifier with 100
iterations gave us better results than running it
with just 10. Alternatively, some number of
iterations between these two could also provide
better results. This is left unclear, and I avoid
experimenting more with the Maximum
Entropy Classifier because of its extremely long
training times. Either way,this would account
for some loss in the Classifier’s accuracy and Fscore, but probably not enough. I am at a loss to
explain the rest.
The latter phenomenon seems more easily
explicable; because of Naïve Bayes’
independence assumptions, clearly dependent
cue word features (i.e. where a word appears
both by itself and as part of a phrase) wound up
overlapping, thus causing a performance hit.
I should additionally mention here that the
extremely poor performance of the HMM
classifier is entirely due to the fact that almost
all segments in my data set are unique. Because
of this, the output probabilities (i.e. of seeing a
certain symbol given a certain state) become
highly skewed and offer very poor results. The
HMM would likely work much better on a large
data set that follows a Zipf distribution more
closely.
4. Conclusion
I still surmise that, given the right classifier, cue
words should be helpful. The best classifier for
our case is probably a Conditional Random
Field, since it allows us to better model both
context and highly correlated features.
Problematically, I could not find a version of this
classifier implementable in NLTK that I could get
to work, and building one myself was beyond
the scope of my project.
Nevertheless, regardless of the cue-word-using
version being slightly worse than its
counterpart, both still managed to attain results
that doubled what was achievable under the
Naïve Bayes Classifier which used only label
probability as a feature. Additionally, because
the version of the Classifier which tried to use
context without cue words still used first_word
and last_word as features (these being a cross
between cue words and context), it still seems
like cue words are useful, even without being
able to see them used in a Conditional Random
Field.
5. References
This task would have been much more difficult,
if not impossible without the use of the Natural
Language Tool Kit for Python. Its webpage is
located at http://www.nltk.org/.
I used NumPy for an implementation of a
Maximum Entropy Classifier. Its homepage is
http://numpy.scipy.org/.
Finally, everything I made was built on Python
2.6. Python’s home page is
http://www.python.org/.
Download