Uploaded by shemse shukre

FACTORED CONVOLUTIONAL NEURAL NETWORK FOR AMHARIC CHARACTER

advertisement
FACTORED CONVOLUTIONAL NEURAL NETWORK FOR AMHARIC CHARACTER
IMAGE RECOGNITION
Birhanu Belay?+
Tewodros Habtegebrial?
Marcus Liwicki †
Gebeyehu Belay +
Didier Stricker?
?
+
University of Kaiserslautern, DE
Bahir Dar Institute of Technology, Ethiopia
?
DFKI-German Research Center for Artificial Intelligence, Kaiserslautern, DE
†
Lulea University of Technology, Department of Computer Science, Lulea, Sweden
ABSTRACT
In this paper we propose a novel CNN based approach
for Amharic character image recognition. The proposed
method is designed by leveraging the structure of Amharic
graphemes. Amharic characters could be decomposed in to a
consonant and a vowel. As a result of this consonant-vowel
combination structure, Amharic characters lie within a matrix structure called ’Fidel Gebeta’. The rows and columns
of ’Fidel Gebeta’ correspond to a character’s consonant and
the vowel components, respectively. The proposed method
has a CNN architecture with two classifiers that detect the
row/consonant and column/vowel components of a character.
The two classifiers share a common feature space before they
fork-out at their last layers. The method achieves state-of-theart result on a synthetically generated dataset. The proposed
method achieves 94.97% overall character recognition accuracy.
the characters similarly across rows and having large classes
affect the performance of Amharic OCR and results the OCR
being complex. However, there is a point to be considered so
as to overcome class size complexity and characters similarity across rows of the ’Fidel Gebeta’. Based on the structure
of ’Fidel Gebeta’, dividing the number of large classes in to
two smaller sub classes (row-wise and column-wise classes).
Index Terms— Amharic character image, Factored CNN,
Fidel Gebeta, Row-column order, OCR
1. INTRODUCTION
Amharic is the second most widely spoken Semitic language with its own alphabets and it is an official language of
Ethiopia. There are multiple documents written in Amharic
script dated back from 12th century [1]. Amharic writing
system uses about 310 different symbols of which 231 basic
characters, 50 labialized, 20 numeric and 9 punctuation marks
[2]. Amharic script is written and read, as English, from left
to right. The most widely used characters in Amharic script
are basic characters which we consider, in this paper, are
arranged in a matrix structure called ’Fidel Gebeta’ with a 33
by 7 row-column order are shown in Figure 1.
Optical Character Recognition(OCR) on multiple Latin
and non-Latin script has reached its maturity [3, 4, 5] whereas
other African languages that have their own indigenous
scripts are still an open research area [2]. Currently, some
research works, on Amharic OCR have been published and
presented in [6, 7, 8]. In literature, researchers noted that
978-1-5386-6249-6/19/$31.00 ©2019 IEEE
Fig. 1. Amharic characters (’Fidel Gebeta’) row-column-wise
(33 × 7) order; read from left to right: The first column gives
the consonant characters while the next six columns are derived by combining the consonant with vowel sounds.
2906
ICIP 2019
To the best of our knowledge, no attempt has been made
to recognize a character based on the row-column order of
Amharic character in ’Fidel Gebeta’ which has motivated us
to work on it. Therefore, in this paper, we propose a a method
called Factored Convolutional Neural Network (FCNN) for
Amharic OCR. FCNN has two classifiers that share a common feature space before they fork-out at their task specific
layer.
The rest of the paper is organized as follows. section 2
talks about related works. In section 3 the proposed system
architecture and detail of datasets are presented. Section 4
presents experimental results and conclusions are presented
in section 5.
2. RELATED WORK
Recognition of character image has been studied and solved
for multiple scripts. Researchers have been made lots of work
to address the problems in most Latin and non-Latin scripts
and even most of the scripts, now, have commercially off-theshelf OCR applications. However, there are still untouched
scripts in the area of OCR [7]. Until recent times, the OCR
for Amharic script remained relatively unexplored.
The first work done on Amharic OCR by Worku in 1997
and he adopted a tree classification scheme built by using
topological features of a character [9]. It was only able to
recognize an Amharic character written with Washera font, 12
point type. Then other attempts have been made ranging from
typewritten [10], machine printed [2], Amharic braille document image recognition [8], Amharic document image recognition and retrieval [11], numeric recognition [6] to handwritten [12]. However, they applied statistical machine learning
techniques, limited number of private datasets.
Several works have been made based on CNN approach
in many domains; such as character recognition [13], digit
reading in natural image [14], visual pattern mining [15], and
car plate recognition [16]. Maitra et al [17] proposed a CNN
as a trainable feature extractor and SVM classifier for numeric character recognition in multiple scripts. Similar study,
Elleuch et al [18] employed a CNN as a feature extractor and
SVM as recognizer to classify Arabic handwritten script using the HACDB database.
Zhang et al [19] propose a convolutional neural network
to find facial landmarks and then recognise face attributes.
Kisilev et al [20] present a multi-task CNN approach for detection and semantic description of lesions in diagnostic images. Liu et al [21] propose a neural network for query classification and information retrieval. Yang et al [22] introduce
multi-task learning algorithm for cross domain image captioning. The common part in all these researchers work is
to use the same parameters for the bottom layers of the deep
neural network while task-specific parameters at top layers.
As explained by Rgyriou et al [23], learning multiple related
tasks simultaneously has been shown to often significantly
improve performance relative to learning each task independently.
Recently published work [7] employs a CNN architecture
for Amharic character image recognition. In this work, about
80000 synthetically generated Amharic character images
were used. They have considered 231 classes and achieved
92.71% of character recognition rate. All of these methods
mentioned above consider each character as a single class
[2, 10, 7] but not consider the row and column order location
of the character in ’Fidel Gebeta’. The problems of these
approaches are; treating structurally similar characters in different class and using large number of classes which leads
the network to be complex. It is also difficult to obtain a
good performance when considering each structurally similar
character as a class and without explicitly using the shared
information, in Fidel Gebeta, between characters.
Therefore, in this paper, we propose a CNN based approach, called FCNN, with two classifiers that have shared
layers at the lower stage and task specific layer at their last
stage. Both classifiers are trained jointly and can detect the
row and column components of a character. In addition, the
proposed method minimize the class size complexity from
231 to 40. Sample Amharic characters from ’Fidel Gebeta’
and shape difference across their corresponding vowels in
row-wise order is depicted in Figure 2.
Fig. 2. Shape change in row-column structure of ’Fidel Gebeta’. We call the first entries of each row the base character. Each row has its own base character, as a result it has a
unique shape. Each column has a unique ”pattern” of modification it applies to the base characters. For instance the third
column marked by green box adds a dash at the right bottom,
the fourth column marked by blue box makes the left leg of
the character shorter, the fifth column marked by violet box
adds circle at the right leg and so on.
3. MATERIAL AND METHODS
The shortage of dataset is the main challenge of pattern recognition in general and specifically for Amharic OCR. There
are few database used in various works on Amharic OCR reported in literature. As reported in [10], the authors considered the most frequently used Amharic character with small
2907
number of samples, which are about 5172 core Amharic characters with 16 sample image per class. Later work done by
million et al[11] uses 76800 character images with different
fonts type, size with 231 classes. Other researchers work on
Amharic OCR [9, 12, 6] reported as they used their own private database, but none of them make publicly available for
research purpose.
In this paper, we use Amharic character database [7]
which contains 80000 sample Amharic character images and
few sample character images taken from the database are
shown in Figure 3. In this database, there are many deformed
and meaningless character images. Therefore, for our experiment, we removed about 2006 distorted character images
from the database. We also labelled each character image
in a row-column order following the arrangements of ’Fidel
Gebeta’. In addition, we resized the images in to a size of
32 × 32 pixel.
Fig. 4. The proposed Factored CNN architecture: This
method has two classifiers which share a common feature
space at the lower layer and then fork-out at their last layers.
FC1 and FC2, at the last layer, represent the fully connected
layers of row and column detector having 33 and 7 neurons
that corresponding to the number of classes for each task respectively.
formulated as
ljoint = lr + lc
(2)
Where lr and lc denotes different losses of row and column detector which can be computed by Equation (3).
loss = −
C
X
ti log(si )
(3)
i
Fig. 3. Sample character images taken from the database.
3.1. Proposed algorithms
The overall framework of the proposed approach is shown in
Figure 4. In this architecture, we employ six 3 × 3 convolutional layers with rectifier linear unit (ReLU) activation and
two stacked fully connected layers in common for both classifiers and fork-out by adding one fully connected layer with
soft-max activation function for each classification branch
which has 33 neurons for row detector and 7 neurons for
column detector branch. The final output of each task is
determined by a Soft-max function which tells that the probability of the classes being true is computed using Equation
(1).
ezj
f (zj ) = PK
f or j = 1...k
(1)
zk
k=1 e
Where z is a vector of the inputs to the output layer and k is
the number of outputs.
In the proposed method, we jointly trained the two tasks
(row detector and column detector) by taking the advantage
of multi-task learning. In a such a way, when we train the
network, the parameters of the row detector layer should not
change no matter how wrong we get in column detector, and
vice versa, but the parameters of the shared layer changes with
both tasks since we only optimize a single joint loss which is
Where C class of the sample, ti is ground truth for each class
and si is a score of each class calculated by Equation (1).
The overall character recognition accuracy of the model is
determined by the recognition accuracy of both tasks. In this
case, a single character is correctly recognized when the row
and column components of a character is correctly detected
by both classifiers simultaneously. The index of a character
in ’Fidel Gebeta’ can be computed from a row and a column
component values of a character as:
Ic = a × 7 + b
(4)
where Ic is the character index, a and b are the row and column order values of the character in Fidel Gebeta.
The total number of characters correctly recognized by the
proposed method depends on the number of times in which
both row and column detector networks are correct, for each
character, simultaneously. Then the recognition performance
of the proposed model is measured using accuracy and can be
formulated as Equation (5).
n(Sr ∩ Sc )
× 100
(5)
A=
t
where A is overall accuracy, ∩ is set intersection, Sr and Sc
are samples correctly recognized by the row and column detector respectively, t is number of test samples and n(Sr ∩ Sc )
is total number of samples that are mutual detected as correct
by the row and column detector.
2908
4. EXPERIMENTAL RESULTS
Table 1. Performance of prior works and proposed method
Experiments were run following the network architecture introduced in section 3.1. The input sample character is recognized in terms of its row and column components, as shown
in Figure 5. Our method is implemented in Keras Application Program Interface(API) with a TensorFlow backend and
the model was trained on GeForce GTX 1070 GPU.
Dereje [10]*
Million [2]*
Belay [7]
Ours
Ours
Dataset size
5172
76800
80,000
77,994
77,994
Recognition
character
character
character
character
Row-column
Accuracy
61%
90.37%
92.71%
93.45%
94.97%
∗ Denotes methods tested on different datasets. The value ’character’ in
third column refers recognition with 231 classes while ’row-column’ refers
recognition with two smaller sub classes, which is done by dividing 231
classes in to (33 row-wise and 7 column-wise classes).
5. CONCLUSION
Fig. 5. A typical diagram that shows the predicted output of
the model. (i.e For an input character image, the row component detector recognizes zero and the column component
detector recognizes three then the model output becomes the
character located at row zero and column three of the ’Fidel
Gebeta’).
We carry out our experiments on Amharic database taken
from[7] by removing distorted and meaningless images. Then
77,994 character images that contains on average 2364 samples from each row of ’Fidel Gebeta’ were used for training and testing. The proposed FCNN architecture contains
two multi-class classification problems (the row and column
detector) and the convolutional layers have 32 filters of size
3 × 3. Each two blocks of convolutional layers are followed
by a 2 × 2 max-pooling. The two fully connected layers have
512 neurons each while the last forked layers have a size corresponding to the number of unique classes in each specific
task, in our case 33 for row detector and 7 for column/vowel
component detector. Considering the size of the dataset, we
train our network using different batch size and the best test
result is recorded with the batch size of 256 and Adam optimizer running for 15 epochs.
In this paper, we have introduced a novel CNN based method,
for Amharic character image recognition. The architecture
consists of two classifiers: the first is responsible to detect a
row component and then the second is responsible to detect
column component of a character in ’Fidel Gebeta’. A character is correctly recognized when both detectors are correctly
identify its row and column component’s. We evaluated our
Factored Convolutional Neural Network based recognition
model on synthetically generated Amharic character database
which has 77994 sample Amharic character images. The results show that our proposed method achieved state-of-the-art
performance.
Our proposed method minimizes the problems of class
size complexity and row-wise character similarly by considering the structural arrangement of Amharic characters in ’Fidel Gebeta’ and then splitting the classes in to two smaller
sub classes. As part of future work, the performance of the
proposed framework may enhance using advanced neural network architectures and extended to real life data that includes
all characters in Amharic script by integrating our own segmentation algorithms.
Based on the results recorded during experimentation,
96.61 % of row and 95.96% of the column components of the
characters are correctly detected by the row detector and column detector respectively. An overall character recognition
accuracy of 94.97% were recorded and the training converges
faster since we reduce the number of classes from 231 to 40
classes (33 row-wise and 7 column-wise classes). As can be
observed in Table 1, we achieved better performance compared with works done on Amharic OCR using 231 classes.
Other’s work presented in Table 1, here, is not to compare
the performance directly since they used different datasets.
However, it is just to show the progress OCR for Amharic
script.
2909
6. REFERENCES
[1] Ronny Meyer, “Amharic as lingua franca in ethiopia,”
Lissan: Journal of African Languages and Linguistics,
vol. 20, no. 1/2, pp. 117–132, 2006.
[2] Million Meshesha and CV Jawahar, “Optical character
recognition of amharic documents,” African Journal of
Information & Communication Technology, vol. 3, no.
2, 2007.
[3] Jinfeng Bai, Zhineng Chen, Bailan Feng, and Bo Xu,
“Image character recognition using deep convolutional
neural network learned from different languages,” in Image Processing (ICIP), 2014 IEEE International Conference on. IEEE, 2014, pp. 2560–2564.
[4] Aiquan Yuan, Gang Bai, Lijing Jiao, and Yajie Liu, “Offline handwritten english character recognition based on
convolutional neural network,” in Document Analysis
Systems (DAS), 2012 10th IAPR International Workshop
on. IEEE, 2012, pp. 125–129.
[5] Thomas M Breuel, Adnan Ul-Hasan, Mayce Ali AlAzawi, and Faisal Shafait, “High-performance ocr
for printed english and fraktur using lstm networks,”
in Document Analysis and Recognition (ICDAR), 2013
12th International Conference on. IEEE, 2013, pp. 683–
687.
[6] Betselot Yewulu Reta, Dhara Rana, and Gayatri Viral
Bhalerao, “Amharic handwritten character recognition
using combined features and support vector machine,”
in 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI). IEEE, 2018, pp. 265–
270.
[7] Birhanu Belay, Tewodros Habtegebrial, and Didier
Stricker, “Amharic character image recognition,” in
2018 IEEE 18th International Conference on Communication Technology (ICCT). IEEE, 2018, pp. 1179–1182.
[8] Seid Ali Hassen and Yaregal Assabie, “Recognition of
double sided amharic braille documents,” International
Journal of Image, Graphics and Signal Processing, vol.
9, no. 4, pp. 1, 2017.
[9] Worku Alemu, “The application of ocr techniques to the
amharic script,” An MSc thesis at Addis Ababa University Faculty of Informatics, 1997.
[10] Dereje Teferi, “Optical character recognition of typewritten amharic text,” M.S. thesis, School of Information studies for Africa, Addis Ababa, 1999.
[11] Million Meshesha, Recognition and retrieval from document image collections, Ph.D. thesis, IIIT Hyderabad,
India, 2008.
[12] Yaregal Assabie and Josef Bigun, “Hmm-based handwritten amharic word recognition with feature concatenation,” in 2009 10th International Conference on Document Analysis and Recognition. IEEE, 2009, pp. 961–
965.
[13] Adam Coates, Blake Carpenter, Carl Case, Sanjeev
Satheesh, Bipin Suresh, Tao Wang, David J Wu, and Andrew Y Ng, “Text detection and character recognition
in scene images with unsupervised feature learning,” in
Document Analysis and Recognition (ICDAR), 2011 International Conference on. IEEE, 2011, pp. 440–445.
[14] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng, “Reading digits in
natural images with unsupervised feature learning,” in
NIPS workshop on deep learning and unsupervised feature learning, 2011, p. 5.
[15] Hongzhi Li, Joseph G Ellis, Lei Zhang, and Shih-Fu
Chang, “Patternnet: Visual pattern mining with deep
neural network,” in Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM,
2018, pp. 291–299.
[16] Madhusree Mondal, Parmita Mondal, Nilendu Saha,
and Paramita Chattopadhyay, “Automatic number plate
recognition using cnn based self synthesized feature
learning,” in Calcutta Conference (CALCON), 2017
IEEE. IEEE, 2017, pp. 378–381.
[17] Durjoy Sen Maitra, Ujjwal Bhattacharya, and Swapan K
Parui, “Cnn based common approach to handwritten
character recognition of multiple scripts,” in Document
Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 1021–1025.
[18] Mohamed Elleuch, Najiba Tagougui, and Monji Kherallah, “A novel architecture of cnn based on svm classifier
for recognising arabic handwritten script,” International
Journal of Intelligent Systems Technologies and Applications, vol. 15, no. 4, pp. 323–340, 2016.
[19] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang, “Facial landmark detection by deep multitask learning,” in European Conference on Computer
Vision. Springer, 2014, pp. 94–108.
[20] Pavel Kisilev, Eli Sason, Ella Barkan, and Sharbell
Hashoul, “Medical image captioning: learning to
describe medical image findings using multi-task-loss
cnn,” Deep Learning and Data Labeling for Medical
Applications:in proceeding of International Conference
on Medical Image Computing and Computer-Assisted
Intervention, 2016.
[21] Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng,
Kevin Duh, and Ye-Yi Wang, “Representation learning
using multi-task deep neural networks for semantic classification and information retrieval,” Human Language
Technologies: The 2015 Annual Conference of the North
American Chapter of the ACL, pages 912921, 2015.
[22] Min Yang, Wei Zhao, Wei Xu, Yabing Feng, Zhou Zhao,
Xiaojun Chen, and Kai Lei, “Multitask learning for
cross-domain image captioning,” IEEE Transactions on
Multimedia, 2018.
[23] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil, “Multi-task feature learning,” in Advances
in neural information processing systems, 2007, pp. 41–
48.
2910
Download