FACTORED CONVOLUTIONAL NEURAL NETWORK FOR AMHARIC CHARACTER IMAGE RECOGNITION Birhanu Belay?+ Tewodros Habtegebrial? Marcus Liwicki † Gebeyehu Belay + Didier Stricker? ? + University of Kaiserslautern, DE Bahir Dar Institute of Technology, Ethiopia ? DFKI-German Research Center for Artificial Intelligence, Kaiserslautern, DE † Lulea University of Technology, Department of Computer Science, Lulea, Sweden ABSTRACT In this paper we propose a novel CNN based approach for Amharic character image recognition. The proposed method is designed by leveraging the structure of Amharic graphemes. Amharic characters could be decomposed in to a consonant and a vowel. As a result of this consonant-vowel combination structure, Amharic characters lie within a matrix structure called ’Fidel Gebeta’. The rows and columns of ’Fidel Gebeta’ correspond to a character’s consonant and the vowel components, respectively. The proposed method has a CNN architecture with two classifiers that detect the row/consonant and column/vowel components of a character. The two classifiers share a common feature space before they fork-out at their last layers. The method achieves state-of-theart result on a synthetically generated dataset. The proposed method achieves 94.97% overall character recognition accuracy. the characters similarly across rows and having large classes affect the performance of Amharic OCR and results the OCR being complex. However, there is a point to be considered so as to overcome class size complexity and characters similarity across rows of the ’Fidel Gebeta’. Based on the structure of ’Fidel Gebeta’, dividing the number of large classes in to two smaller sub classes (row-wise and column-wise classes). Index Terms— Amharic character image, Factored CNN, Fidel Gebeta, Row-column order, OCR 1. INTRODUCTION Amharic is the second most widely spoken Semitic language with its own alphabets and it is an official language of Ethiopia. There are multiple documents written in Amharic script dated back from 12th century [1]. Amharic writing system uses about 310 different symbols of which 231 basic characters, 50 labialized, 20 numeric and 9 punctuation marks [2]. Amharic script is written and read, as English, from left to right. The most widely used characters in Amharic script are basic characters which we consider, in this paper, are arranged in a matrix structure called ’Fidel Gebeta’ with a 33 by 7 row-column order are shown in Figure 1. Optical Character Recognition(OCR) on multiple Latin and non-Latin script has reached its maturity [3, 4, 5] whereas other African languages that have their own indigenous scripts are still an open research area [2]. Currently, some research works, on Amharic OCR have been published and presented in [6, 7, 8]. In literature, researchers noted that 978-1-5386-6249-6/19/$31.00 ©2019 IEEE Fig. 1. Amharic characters (’Fidel Gebeta’) row-column-wise (33 × 7) order; read from left to right: The first column gives the consonant characters while the next six columns are derived by combining the consonant with vowel sounds. 2906 ICIP 2019 To the best of our knowledge, no attempt has been made to recognize a character based on the row-column order of Amharic character in ’Fidel Gebeta’ which has motivated us to work on it. Therefore, in this paper, we propose a a method called Factored Convolutional Neural Network (FCNN) for Amharic OCR. FCNN has two classifiers that share a common feature space before they fork-out at their task specific layer. The rest of the paper is organized as follows. section 2 talks about related works. In section 3 the proposed system architecture and detail of datasets are presented. Section 4 presents experimental results and conclusions are presented in section 5. 2. RELATED WORK Recognition of character image has been studied and solved for multiple scripts. Researchers have been made lots of work to address the problems in most Latin and non-Latin scripts and even most of the scripts, now, have commercially off-theshelf OCR applications. However, there are still untouched scripts in the area of OCR [7]. Until recent times, the OCR for Amharic script remained relatively unexplored. The first work done on Amharic OCR by Worku in 1997 and he adopted a tree classification scheme built by using topological features of a character [9]. It was only able to recognize an Amharic character written with Washera font, 12 point type. Then other attempts have been made ranging from typewritten [10], machine printed [2], Amharic braille document image recognition [8], Amharic document image recognition and retrieval [11], numeric recognition [6] to handwritten [12]. However, they applied statistical machine learning techniques, limited number of private datasets. Several works have been made based on CNN approach in many domains; such as character recognition [13], digit reading in natural image [14], visual pattern mining [15], and car plate recognition [16]. Maitra et al [17] proposed a CNN as a trainable feature extractor and SVM classifier for numeric character recognition in multiple scripts. Similar study, Elleuch et al [18] employed a CNN as a feature extractor and SVM as recognizer to classify Arabic handwritten script using the HACDB database. Zhang et al [19] propose a convolutional neural network to find facial landmarks and then recognise face attributes. Kisilev et al [20] present a multi-task CNN approach for detection and semantic description of lesions in diagnostic images. Liu et al [21] propose a neural network for query classification and information retrieval. Yang et al [22] introduce multi-task learning algorithm for cross domain image captioning. The common part in all these researchers work is to use the same parameters for the bottom layers of the deep neural network while task-specific parameters at top layers. As explained by Rgyriou et al [23], learning multiple related tasks simultaneously has been shown to often significantly improve performance relative to learning each task independently. Recently published work [7] employs a CNN architecture for Amharic character image recognition. In this work, about 80000 synthetically generated Amharic character images were used. They have considered 231 classes and achieved 92.71% of character recognition rate. All of these methods mentioned above consider each character as a single class [2, 10, 7] but not consider the row and column order location of the character in ’Fidel Gebeta’. The problems of these approaches are; treating structurally similar characters in different class and using large number of classes which leads the network to be complex. It is also difficult to obtain a good performance when considering each structurally similar character as a class and without explicitly using the shared information, in Fidel Gebeta, between characters. Therefore, in this paper, we propose a CNN based approach, called FCNN, with two classifiers that have shared layers at the lower stage and task specific layer at their last stage. Both classifiers are trained jointly and can detect the row and column components of a character. In addition, the proposed method minimize the class size complexity from 231 to 40. Sample Amharic characters from ’Fidel Gebeta’ and shape difference across their corresponding vowels in row-wise order is depicted in Figure 2. Fig. 2. Shape change in row-column structure of ’Fidel Gebeta’. We call the first entries of each row the base character. Each row has its own base character, as a result it has a unique shape. Each column has a unique ”pattern” of modification it applies to the base characters. For instance the third column marked by green box adds a dash at the right bottom, the fourth column marked by blue box makes the left leg of the character shorter, the fifth column marked by violet box adds circle at the right leg and so on. 3. MATERIAL AND METHODS The shortage of dataset is the main challenge of pattern recognition in general and specifically for Amharic OCR. There are few database used in various works on Amharic OCR reported in literature. As reported in [10], the authors considered the most frequently used Amharic character with small 2907 number of samples, which are about 5172 core Amharic characters with 16 sample image per class. Later work done by million et al[11] uses 76800 character images with different fonts type, size with 231 classes. Other researchers work on Amharic OCR [9, 12, 6] reported as they used their own private database, but none of them make publicly available for research purpose. In this paper, we use Amharic character database [7] which contains 80000 sample Amharic character images and few sample character images taken from the database are shown in Figure 3. In this database, there are many deformed and meaningless character images. Therefore, for our experiment, we removed about 2006 distorted character images from the database. We also labelled each character image in a row-column order following the arrangements of ’Fidel Gebeta’. In addition, we resized the images in to a size of 32 × 32 pixel. Fig. 4. The proposed Factored CNN architecture: This method has two classifiers which share a common feature space at the lower layer and then fork-out at their last layers. FC1 and FC2, at the last layer, represent the fully connected layers of row and column detector having 33 and 7 neurons that corresponding to the number of classes for each task respectively. formulated as ljoint = lr + lc (2) Where lr and lc denotes different losses of row and column detector which can be computed by Equation (3). loss = − C X ti log(si ) (3) i Fig. 3. Sample character images taken from the database. 3.1. Proposed algorithms The overall framework of the proposed approach is shown in Figure 4. In this architecture, we employ six 3 × 3 convolutional layers with rectifier linear unit (ReLU) activation and two stacked fully connected layers in common for both classifiers and fork-out by adding one fully connected layer with soft-max activation function for each classification branch which has 33 neurons for row detector and 7 neurons for column detector branch. The final output of each task is determined by a Soft-max function which tells that the probability of the classes being true is computed using Equation (1). ezj f (zj ) = PK f or j = 1...k (1) zk k=1 e Where z is a vector of the inputs to the output layer and k is the number of outputs. In the proposed method, we jointly trained the two tasks (row detector and column detector) by taking the advantage of multi-task learning. In a such a way, when we train the network, the parameters of the row detector layer should not change no matter how wrong we get in column detector, and vice versa, but the parameters of the shared layer changes with both tasks since we only optimize a single joint loss which is Where C class of the sample, ti is ground truth for each class and si is a score of each class calculated by Equation (1). The overall character recognition accuracy of the model is determined by the recognition accuracy of both tasks. In this case, a single character is correctly recognized when the row and column components of a character is correctly detected by both classifiers simultaneously. The index of a character in ’Fidel Gebeta’ can be computed from a row and a column component values of a character as: Ic = a × 7 + b (4) where Ic is the character index, a and b are the row and column order values of the character in Fidel Gebeta. The total number of characters correctly recognized by the proposed method depends on the number of times in which both row and column detector networks are correct, for each character, simultaneously. Then the recognition performance of the proposed model is measured using accuracy and can be formulated as Equation (5). n(Sr ∩ Sc ) × 100 (5) A= t where A is overall accuracy, ∩ is set intersection, Sr and Sc are samples correctly recognized by the row and column detector respectively, t is number of test samples and n(Sr ∩ Sc ) is total number of samples that are mutual detected as correct by the row and column detector. 2908 4. EXPERIMENTAL RESULTS Table 1. Performance of prior works and proposed method Experiments were run following the network architecture introduced in section 3.1. The input sample character is recognized in terms of its row and column components, as shown in Figure 5. Our method is implemented in Keras Application Program Interface(API) with a TensorFlow backend and the model was trained on GeForce GTX 1070 GPU. Dereje [10]* Million [2]* Belay [7] Ours Ours Dataset size 5172 76800 80,000 77,994 77,994 Recognition character character character character Row-column Accuracy 61% 90.37% 92.71% 93.45% 94.97% ∗ Denotes methods tested on different datasets. The value ’character’ in third column refers recognition with 231 classes while ’row-column’ refers recognition with two smaller sub classes, which is done by dividing 231 classes in to (33 row-wise and 7 column-wise classes). 5. CONCLUSION Fig. 5. A typical diagram that shows the predicted output of the model. (i.e For an input character image, the row component detector recognizes zero and the column component detector recognizes three then the model output becomes the character located at row zero and column three of the ’Fidel Gebeta’). We carry out our experiments on Amharic database taken from[7] by removing distorted and meaningless images. Then 77,994 character images that contains on average 2364 samples from each row of ’Fidel Gebeta’ were used for training and testing. The proposed FCNN architecture contains two multi-class classification problems (the row and column detector) and the convolutional layers have 32 filters of size 3 × 3. Each two blocks of convolutional layers are followed by a 2 × 2 max-pooling. The two fully connected layers have 512 neurons each while the last forked layers have a size corresponding to the number of unique classes in each specific task, in our case 33 for row detector and 7 for column/vowel component detector. Considering the size of the dataset, we train our network using different batch size and the best test result is recorded with the batch size of 256 and Adam optimizer running for 15 epochs. In this paper, we have introduced a novel CNN based method, for Amharic character image recognition. The architecture consists of two classifiers: the first is responsible to detect a row component and then the second is responsible to detect column component of a character in ’Fidel Gebeta’. A character is correctly recognized when both detectors are correctly identify its row and column component’s. We evaluated our Factored Convolutional Neural Network based recognition model on synthetically generated Amharic character database which has 77994 sample Amharic character images. The results show that our proposed method achieved state-of-the-art performance. Our proposed method minimizes the problems of class size complexity and row-wise character similarly by considering the structural arrangement of Amharic characters in ’Fidel Gebeta’ and then splitting the classes in to two smaller sub classes. As part of future work, the performance of the proposed framework may enhance using advanced neural network architectures and extended to real life data that includes all characters in Amharic script by integrating our own segmentation algorithms. Based on the results recorded during experimentation, 96.61 % of row and 95.96% of the column components of the characters are correctly detected by the row detector and column detector respectively. An overall character recognition accuracy of 94.97% were recorded and the training converges faster since we reduce the number of classes from 231 to 40 classes (33 row-wise and 7 column-wise classes). As can be observed in Table 1, we achieved better performance compared with works done on Amharic OCR using 231 classes. Other’s work presented in Table 1, here, is not to compare the performance directly since they used different datasets. However, it is just to show the progress OCR for Amharic script. 2909 6. REFERENCES [1] Ronny Meyer, “Amharic as lingua franca in ethiopia,” Lissan: Journal of African Languages and Linguistics, vol. 20, no. 1/2, pp. 117–132, 2006. [2] Million Meshesha and CV Jawahar, “Optical character recognition of amharic documents,” African Journal of Information & Communication Technology, vol. 3, no. 2, 2007. [3] Jinfeng Bai, Zhineng Chen, Bailan Feng, and Bo Xu, “Image character recognition using deep convolutional neural network learned from different languages,” in Image Processing (ICIP), 2014 IEEE International Conference on. IEEE, 2014, pp. 2560–2564. [4] Aiquan Yuan, Gang Bai, Lijing Jiao, and Yajie Liu, “Offline handwritten english character recognition based on convolutional neural network,” in Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on. IEEE, 2012, pp. 125–129. [5] Thomas M Breuel, Adnan Ul-Hasan, Mayce Ali AlAzawi, and Faisal Shafait, “High-performance ocr for printed english and fraktur using lstm networks,” in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013, pp. 683– 687. [6] Betselot Yewulu Reta, Dhara Rana, and Gayatri Viral Bhalerao, “Amharic handwritten character recognition using combined features and support vector machine,” in 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI). IEEE, 2018, pp. 265– 270. [7] Birhanu Belay, Tewodros Habtegebrial, and Didier Stricker, “Amharic character image recognition,” in 2018 IEEE 18th International Conference on Communication Technology (ICCT). IEEE, 2018, pp. 1179–1182. [8] Seid Ali Hassen and Yaregal Assabie, “Recognition of double sided amharic braille documents,” International Journal of Image, Graphics and Signal Processing, vol. 9, no. 4, pp. 1, 2017. [9] Worku Alemu, “The application of ocr techniques to the amharic script,” An MSc thesis at Addis Ababa University Faculty of Informatics, 1997. [10] Dereje Teferi, “Optical character recognition of typewritten amharic text,” M.S. thesis, School of Information studies for Africa, Addis Ababa, 1999. [11] Million Meshesha, Recognition and retrieval from document image collections, Ph.D. thesis, IIIT Hyderabad, India, 2008. [12] Yaregal Assabie and Josef Bigun, “Hmm-based handwritten amharic word recognition with feature concatenation,” in 2009 10th International Conference on Document Analysis and Recognition. IEEE, 2009, pp. 961– 965. [13] Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh, Bipin Suresh, Tao Wang, David J Wu, and Andrew Y Ng, “Text detection and character recognition in scene images with unsupervised feature learning,” in Document Analysis and Recognition (ICDAR), 2011 International Conference on. IEEE, 2011, pp. 440–445. [14] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS workshop on deep learning and unsupervised feature learning, 2011, p. 5. [15] Hongzhi Li, Joseph G Ellis, Lei Zhang, and Shih-Fu Chang, “Patternnet: Visual pattern mining with deep neural network,” in Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 2018, pp. 291–299. [16] Madhusree Mondal, Parmita Mondal, Nilendu Saha, and Paramita Chattopadhyay, “Automatic number plate recognition using cnn based self synthesized feature learning,” in Calcutta Conference (CALCON), 2017 IEEE. IEEE, 2017, pp. 378–381. [17] Durjoy Sen Maitra, Ujjwal Bhattacharya, and Swapan K Parui, “Cnn based common approach to handwritten character recognition of multiple scripts,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 1021–1025. [18] Mohamed Elleuch, Najiba Tagougui, and Monji Kherallah, “A novel architecture of cnn based on svm classifier for recognising arabic handwritten script,” International Journal of Intelligent Systems Technologies and Applications, vol. 15, no. 4, pp. 323–340, 2016. [19] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang, “Facial landmark detection by deep multitask learning,” in European Conference on Computer Vision. Springer, 2014, pp. 94–108. [20] Pavel Kisilev, Eli Sason, Ella Barkan, and Sharbell Hashoul, “Medical image captioning: learning to describe medical image findings using multi-task-loss cnn,” Deep Learning and Data Labeling for Medical Applications:in proceeding of International Conference on Medical Image Computing and Computer-Assisted Intervention, 2016. [21] Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang, “Representation learning using multi-task deep neural networks for semantic classification and information retrieval,” Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 912921, 2015. [22] Min Yang, Wei Zhao, Wei Xu, Yabing Feng, Zhou Zhao, Xiaojun Chen, and Kai Lei, “Multitask learning for cross-domain image captioning,” IEEE Transactions on Multimedia, 2018. [23] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil, “Multi-task feature learning,” in Advances in neural information processing systems, 2007, pp. 41– 48. 2910