AUTOMATED WRITER IDENTIFICATION FOR SYRIAC SCRIBES

advertisement
AUTOMATED WRITER IDENTIFICATION FOR SYRIAC SCRIBES
Emma Dalton and Nicholas Howe
1
2
Departments of Engineering and Computer Science, Smith College, Northampton, MA
1
Background and Motivation
Methods
The Syriac Language
Text-Dependent Methods Alaph Congealing
Syriac (Fig. 1) is an ancient dialect of Aramaic in which over 10,000 manuscripts were
written. These manuscripts include some of the ealiest translations of the Bible, and
provide a critical link in the preservation of the writings of Aristotle. Syriac Christianity was also one of the most far-reaching branches of the religion, making contact
with early Islam, India and even China [1].
Figure 1: A sample of handwriting from the Syriac script Estrangelo. Consists of 23 distinct letters used for text-dependent identification.
Despite all of these compelling aspects, the study of the corpus of Syriac manuscripts is stunted, while the study of the Christian religion has focused mainly on
Latin texts. In recent years there has been a shift in attitude, and now there are several universities offering instruction in the Syriac language, and the Vatican has
opened its manuscript collection for general academic use.
The Problem of Authorship
Derive statistics from large blocks of text. This requires
no prior knowledge of the script and is very fast, but
also very general. All of the methods used here are
derived from Schomaker, et al [3]
Letter Parts and Letter Concavities
• isolate letter parts using the letter skeleton
• isolate concavities from the convex hull
by subtracting the entire letter (Fig. 3)
• independently congeal each across the
whole letter set
=
+ ... +
=
+ ... +
+ ... +
These methods specifically target author
specific traits, such as long tails or extra
curves in particular letters. This specificity
takes extra time to compute.
Figure 4: Schematic showing examples of runlengths on background both vertical and horizontal.
Table 1: Percent In-Document Recognition for All Methods Used
Letter Concavities
+ ... +
Figure 3: Examples
of letterParts
parts and letter conLetter
cavities isolated from one letter alaph.
The Contour Direction feature
•measures the general tilt of the text
•uses letter contour
•calculates angle φ (Fig. 6)
•histogram compares text pages (Fig. 7).
16
Contour Hinge PDF
TextIndependent Horizontal Run-Length PDF
TextDependent
Combined
=
Percent In-Document
Recognition
Method Name
Contour Direction PDF
=
70
16
Vertical Run-Length PDF
68
Contour Hinge + Vertical Run-Length
All Text-Independent Methods
76
Whole Letters
80
Letter Parts
79
Convex Hulls
88
66
All Text-Dependent Methods
98.7
Text-Dependent and -Independent
100
Overall, text-dependent methods performed better than text-independent
ones, and the convex hull feature performs best individually with 88% indocument recognition. Of text-independent methods, the contour hinge feature
perfomed best, with a 70% recognition rate.
For the combined methods, only the contour hinge and vertical run-length
were used from the text-independent methods. Using all of the methods available decreased the accuracy of the system considerably.
Conclusion and Future Work
In an experimental setting, the system developed is capable of working at
100% accuacy for in-document recognition. This is considered highly successful
in this context. However, to be a successful system for use by Syriac scholars, the
system must be expanded and adapted to compare new page information with
a more complete dataset.
Figure 6: Schematic showing how Contour Hinge feature is extracted. Angle φ is
calculated from the horizontal.
Manuscripts Used
Table 1 shows the results for in-document recognition for each method used. In
absence of multiple documents by a sufficient number of authors, in-document
recognition is used as a measurement of writer identification. Using a combination of text-dependent and text-independent methods with the dataset used
for this project is able to achieve 100% in-document recognition.
Category
The Run-Length Feature
•measures background lengths
•both vertical and horizontal (Fig. 4)
•2 histograms used to compare text pages (Fig. 5)
While there are many Syriac manuscripts to study, and some are easily attributed to
a particular scribe because they include a colophon recording the copyist and date,
the majority of preserved documents (95 percent) do not include this important
information [1].
The goal of this project is to create a system which can take one document and
digitally compare it to other preprocessed manuscripts, to determine possible
handwriting matches, with a score to describe how close the match is. This will
assist scholars of Syraic by providing them with connections between documents
through authorship, which may speed up the process of study considerably.
While many Syriac manuscripts exist, relatively few are available in the digital
format needed for computerized identification. The documents used for this project were provided by Brigham Young University in collaboration with the Vatican.
From the commercially available CD-ROM, 4 pages from each of 19 documents in
the Syriac script Estrangelo were used to create and test a possible scribe identification system.
Results and Discussion
These methods require the user to have
prior knowledge of the language used.
Before
+
Many letter samples are combined using
Congealing [2], to create more “average”
characters. The transformations used to
get to these more general characters are
then used to compare letter samples, and
therefore pages of text to one another.
This can be done with both whole letters
and letter pieces.
+
After
Whole Letters
• select character samples from pages of
text
• process all letters using congealing
Shin Congealing
• use congealing transforms to compare
individual letters (Fig. 2)
Before
+
The whole letter method compares individual letters on relative scaling, along
with their relative tilt. By comparing letters to the computed “average” letters, the
method can easily be adapted to any
letter set.
Figure 2: Comparison of letter samples of
+
After
alaph and shin before and after congealing.
Text-Independent Methods
2
Figure 7: Comparison of two segments of text
using the contour hinge feature. Radial histograms
show two pages of text to be dissimilar.
Other future work might include:
• Expansion of the current set of documents
• Exploration of more writeridentification methods
• Verification using another language,
possibly the Hebrew dataset used by
Bar-Yosef et al [4]
• Expansion into the other Syriac
scripts, Serto and East Syriac (Fig. 10).
The Contour Hinge feature
•measures the general “roundness” of the text
•uses letter contour
•calculates two angles, φ1 and φ2 (Fig. 8)
•2-dimensional histogram compares text
pages (Fig. 9).
Figure 10: A sample of writing in the
East Syriac script. This is generally
blockier than Estrangelo, while Serto is
thinner and more cursive.
References
[1] M. Penn. “Draft NEH Proposal”, Unpublished, 2007.
[2] E. Learned-Miller, "Data driven image models through continuous joint alignment"
IEEE Transactions on Pattern Analysis and Machine Intelligence vol:28, no:2, pp. 236-250,
Feb. 2006.
[3] N. Bulacu, and L. Schomaker, "Text-Independent Writer Identifcation and Verifcation
Using Textural and Allographic Features", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29 no.4, pp.701-717, April 2007
[4] I. Bar-Yosef, I. Beckman, K. Kedem, and I. Dinstein, "Binarization, character extraction,
and writer identifcation of historical Hebrew calligraphy documents" Int. J. Doc. Anal.
Recognit. vol:9, no:2, pp. 89-99, Apr. 2007
Figure 5: Graphs showing histograms for vertical
and horizontal run length features for two pages.
Figure 8: Schematic showing how Contour Hinge feature is extracted. Two
angles, φ1 and φ2 are calculated from
the horizontal from each point on the
letter contour.
Figure 9: Comparison of two segments of text using the
contour hinge feature. Surface plots show the 2dimensional histogram of φ1 and φ2. The two plots are
significantly dissimilar to provide author identification.
Acknowledgements
Thank you to Professor Nicholas Howe for his continued support, and Professor
Michael Penn of the Mount Holyoke Religion Department for providing guidance
and information about the Syriac language and Syriac studies issues.
Download