A New Comprehensive Database of Hand

advertisement
A New Comprehensive Database of Hand-written Arabic Words,
Numbers, and Signatures used for OCR Testing
Nawwaf Kharma. Maher Ahmed, & Rabab Ward
E.C.E. Department,
(289-) 2366 Main Mall, University of British Columbia, Vancouver BC, Canada. V6T 1Z4.
Phone: 604- 822 1742. Fax: 604- 822 9013
Nawwaf@ieee.org, MaherA@ece.ubc.cai & RababW@cs.ubc.ca
Abstract
This paper describes the formation of a comprehensive
database of handwritten Arabic words, numbers, and
signature, for use in optical character recognition research
related to the Arabic language. So far no such (freely or
commercially available) database exists.
1 Introduction
The construction of software programs for the
recognition of characters and words is an involved
engineering pursuit. The quality of the resulting
software depends, to a great degree, on the
thoroughness of testing, both during and after the
development stage.
Many hand-writing related applications
are concerned with off-line uses, such as formfilling and signature confirmation. In such
situations, there is no (direct) control on the style
of writing applied by the user. Overlapping and
badly formed characters, as well as missing dots
are common occurrences.
The above mentioned type of practical
problems necessitates the utilisation of a database
of naturally written characters, numbers and other
gestures, in order to evaluate the real-world
effectiveness of the developed software. Especially
in the case of the Arabic language, there exists no
freely or commercially available (or Internetaccessible) database of Arabic characters,
numbers, and signatures. This, specifically, is the
purpose of the research work described here.
2 Background
Much work has been carried out in Arabic
character & word recognition. Doubtless, some of
this work relied on privately collected and kept
databases of Arabic printed and hand-written text.
However, and despite a determined lengthy paper
& Internet search campaign, no publicly accessible
Arabic database was found. This must be a source
of genuine hindrance for many researchers into
Arabic character & word recognition.
Hence, and though our own work [2] was
on on-line (as opposed to offline) Arabic character
recognition, it was decided that such a database, on
its own merit, is a worthwhile academic effort. An
effort, which would, if properly conducted, provide
researchers with a useful tool. In addition, it is
envisioned that, with the right thinning and tracing
techniques, a determined researcher of on-line
recognition, would be able to salvage the temporal
order of writing, and as a result, use the same
database for his/her testing needs.
3 Collection of Data
Hand-written Arabic words, numbers,
signatures, and complete sentences were the object
of the data collection effort, which took place at AlIsra’ University in Amman, Jordan. Five hundred
randomly selected students took part in the exercise.
Each student was asked to copy a pre-selected list of
words and digits, as well as a sentence. He/she also
jotted his/her signature 5 times at the bottom of the
distributed (standard) form.
The words were chosen carefully to ensure
that they contain (at least once) each of the letters
of the Arabic alphabet, in their various letterforms
(e.g. middle and end.) As to numbers, both the
‘Magharibi’ (North African) and ‘Mashriqi’
(Middle Eastern) digit sets were requested. The
single sentence (a line of poetry) was included in
the form to provide the database user with a
snapshot of the way in which a native writer forms
a whole sentence, while using short vowels (or
‘tashkeel’) and punctuation marks. Finally, the
signatures came in all forms, but all were written
either in Arabic or (not unusually) in English.
It is worth noting that a couple of
restrictions were placed on the writer. The
standard form contained (rather big) boxes for the
writer to script his/her written response (e.g.
word) in. Also, and with the exception of the
sentence, no base line was provided. This made
the (following) process of cleaning the black &
white images much easier, and more correct.
4 Scanning, Digital Manipulation &
Storage of Data
The paper forms collected from the
students were all scanned. The scanned images
were segmented, in some cases cleaned up, and
saved in logically named computer files, first on a
hard disk, and then on compact disk (CD).
Scanning was done using an HP truecolour capable scanner. However, it was decided
that allowing colour would hugely increase the size
of the files, without necessarily adding a popularly
demanded feature. The forms were indeed
scanned as (sharp) grayscale and (sharp) black &
white (or b&w) images. The images were saved in
bitmap (or .bmp) file format, because this format
is standard, and is very easy to use by computer
programs.
Once scanning was completed, the image
of the page was digitally manipulated, according to
its contents, in the following manner.
Words in grayscale are cut out of the
image, and saved in bitmap files. Words in b&w
however, are cut out of the image, borders as well
as noise are manually removed, and then the image
is saved in a bitmap file.
In the case of numbers, each grayscale
digit is cut out of the image and saved in its own
.bmp file. In contrast, digits in b&w are cut out of
the image, all borders and noise are manually
removed, and then each digit is saved in a separate
file.
Signatures were also written in boxes. As
with words and numbers, grayscale signatures were
simply cut out of the original image and placed in
separate .bmp files. On the other hand, b&w
signatures were cleaned up before being saved in
bitmap files.
Sentences are a different story. No cleaning
up was done, neither in the grayscale nor in the
b&w case. In both cases, the sentence was simply
cut out of the image and saved in a .bmp file. The
reason for not cleaning up the b&w sentences is
that doing so would mean removing the base-line something that was never intended.
5 Thinning of Images
On top of saving images in b&w and
grayscale .bmp formats, a thinning algorithm
developed by Ahmed [1] was applied to all the b&w
image files. The output of this application was also
saved in bitmap files, sorted according to their
contents, into word, number, signature, and
sentence files. The purpose of this application is to
provide researchers with a readily available version
of the original image files.
The thinning algorithm in [1] has just 16
rules. These rules aim at deleting the pixels which
are north-west corner pixels or pixels that lie in the
east boundaries or south boundaries of the
designated pixel (being processed). The conditions
for 2-pixel thickness lines are given special
attention. The original 16 rules are transformed by
rotating them by 90o, 180o, and 270o to produce
three more sets of rules. The 64 rules (composed of
the 16 rules and their rotated versions) are applied
on every pixel, and in one pass.
6 Conclusion & Future Work
The final resultant of all the collection and
processing work described is:
- 37,000 Arabic words,
- 10,000 digits (in two types),
- 2,500 signatures, and
- 500 free-form Arabic sentences,
all saved in grayscale and black and white .bmp file
formats and ready for use.
Copies of the database CD will be offered
to researchers in Canada, as of May 1999, and (only
partially) to the rest of the world, once it is placed
on the Internet.
Acknowledgements
Thanks are due to everyone at Al-Isra’ University
who participated in the data collection stage. I
would also like to express my gratitude to my wife,
Aida, for her patient and persistent help in
scanning the forms.
References
[1] Ahmed, M. and Ward, R., “A rule-based system
for thinning symbols to their central lines,”
submitted to the IEEE journal of PAMI in June
1998.
[2] Kharma, N. and Ward R., “A novel invariant
mapping applied to hand-written Arabic Character
Recognition” submitted to Character Recognition
Letters in January, 1999.
Download