International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 7 (2016) pp 4849-4856 © Research India Publications. http://www.ripublication.com Detection and Removal of Graphical Components in Pre-Printed Documents N. Shobha Rani Department of Computer Science, Amrita Vishwa Vidyapeetham, Amrita University, Mysuru campus, Karnataka, India. Vineeth,P Department of Computer Science, Amrita Vishwa Vidyapeetham, Amrita University, Mysuru campus, Karnataka, India. Deeptha Ajith Department of Computer Science, Amrita Vishwa Vidyapeetham, Amrita University, Mysuru campus, Karnataka, India. Government or private organizations, as per the variety of job requirements that are relative to their task accomplishments. These documents are defined with a pre structured layout indicating various fields for data entry. It is also consists with information like company name, purpose, captions, logos and symbolic entities indicating the details of organization, department etc. These graphical diacritics are overlaid with text in most of the documents during the process of data entry. The detection and removal of all these graphical elements may lead to the error free subsequent processing, that is segmentation, feature extraction, classification, and finally result to an accurate character recognition by OCR. Since text exist is the very minute gradient information that is sensitive to the noisy content in the image and when this textual portions are bounded or overlaid with the graphical entities like horizontal or vertical lines, presence of logos, symbols, photos etc. It increases conflicts in accurate resolution process of textual components in the image. The accurate resolution of textual components in the image is connected mostly with the pure textual images. Therefore it is very much significant to have the image free from all the graphical entities that are mentioned above. The present work focuses on pre-processing of pre-printed document images. The pre-printed documents in the proposed work belongs to the regions of Anantapur district of Andhra Pradesh state. Figure 1 depicts one of such pre-printed document. The pre-printed document represented in figure 1 consists of printed components, handwritten and other graphical components like horizontal or vertical edges, symbols, logos etc. The various graphical entities that we propose to work on are as depicted in Figure 2. The graphical entities that are in Figure 2 may obstruct the process of text recognition. This requires the separation of graphical entities from the textual portions. Thus it is more crucial to detect and remove the graphical entities. There are numerous experimentations in the literature addresses more on pre-processing of the document images. Moreover the preprinted documents differs from one type of organization to the other. Some of the experimentations that are revised in the literature are discussed below. Abstract Pre-processing of document images is one of the most intensive operations for pre-printed document images. The recognition of text in pre-printed documents is most sensitive to graphical components coexisting with it. In this paper we address the problem of detection and removal of graphical components like logos, emblems and other symbolic entities, which leads to an error free document processing in the subsequent stages of Optical Character Recognition. The detection of graphical entities is performed by employing Zernike moments and histogram of gradient features, followed by which the line detection and removal is accomplished by masking the image with a vertical line structuring element by computation of region covered by convex hull within the area by structuring element in the image. The detection of line structuring element also addresses the problem of characters overlapping with lines leading to retention of the character during erosion of lines from the image. The experimental outcomes produced by emblem detection of algorithm are appreciable with accuracy of around 97% for the emblem detection and 92% accurate outcomes in case of line detection and removal. Keywords: emblem detection, graphical components, preprinted documents, line detection, moments and HOG features. Introduction Enhancement of document image prior to Region of Interest (ROI) processing is the inclination of efficient optical recognition systems. The document images are of varied categories. There are document image ranging from simple text to documents with fully complex gradient details. The simple text documents are composed either printed or handwritten text, whereas few documents are composed of handwritten as well as printed text. There are still some hybrid documents which consist of both graphical and textual components, this type of documents are termed as pre-printed documents [1]. The pre-printed documents are printed by 4849 International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 7 (2016) pp 4849-4856 © Research India Publications. http://www.ripublication.com a clear idea of the image and an analytical test that provides a statistical measurement based on a benchmark dataset and evaluation measurement and gave best performance. Subhadipet. al. [7] had developed a novel framework with the implementation of Hough transform for recognition of postal codes in Latin, Devnagiri, Bangla and Urdu script from multiscript postal address block. This work achieved around 98% postal-code localization accuracy. Manjunathet. al. [8], described the study of robust text detection in color and regular image. First stage used combination wavelet transform and Gabor filter to extract sharpened edges and textural features of the image. In second stage wavelet entropy is imposed to get the further experimental values. They achieved 97.9% of accuracy. Battista et. al. [9] had proposed a comprehensive survey and categorization of computer vision and pattern recognition techniques proposed so far against image spam, and make an experimental analysis and comparison of some of them on real, publicly available data sets. Alvaro et. al [10], contributed a robust method to localize and recognize text in natural image using CC-based approach that extract and discard basic letter candidates using a series of easy and fast-to compute features. Rohanet. al. [11], had presented a completely automated way to detect brain tumor. Bounding box method using symmetry is used to detect the location of tumor and they achieved good accuracy. Aswiniet. al. [12], had proposed a system implements SURF to extract local features from logos and to match the features. They proposed a simple and compact SURF algorithm. Prof. Mrinalineeet. al.[13], developed an improved approach for logo detection and recognition. They used SIFT and CDS to extract feature and match the image logo. Amrapaliet. al. [14], extended CDS method to implementing scalable and highly effective method for logo detection.Firojet. al. [15] had used bounding boxes by morphological dilation for the segmentation of Arabic word. They have tested appropriate methods on documents of Arabic script and theirs have obtained encouraging results from proposed techniques. Victor et. al. [16] had discussed how the bounding box can be further used to impose a powerful topological prior, which prevents the solution from excessive shrinking and ensures that the user provided box bounds the segmentation in a sufficiently tight way. Thawaret. al. [17], in their paper three kinds of moments: Geometrical, Zernike and Legendre moments have been evaluated for classifying 3D object image using Nearest Neighbor classifier. Subhajitet. al. [18] had proposed an efficient algorithm for recognizing palm prints for biometric identification of individuals by complex Zernike moments are constructed using a set of complex polynomials. Jyotsnaraniet. al. [19], presented a reconstruction of the basic characters in Oriya text, which can handle different font sizes and font types, by using Hu’s seven moments and Zernike moments. Diptiet. al. [20], discussed the form image registration technique and the image masking and image improvement techniques implemented in their system as part of the character image extraction process. To best of our knowledge the works reported in the literature focus on graphical component detection specific to the type of documents that are proposed to work in their research and none of the works addresses the problem of emblem detection with respect to the e-seva documents belonging to the regions Figure 1: The pre-printed document Figure 2: Types of graphical entities Gatos et. al. [2] had proposed an algorithm for automatic table detection in documents using line length and line width estimation by using edge detection operators.Yefenget. al [3] had contributed an algorithm to detect the severely broken parallel lines in handwritten document images based on directional single connected chain method using three parameters called skew angle, vertical line gap and vertical translation. The experimentation had produced results of around 94% for Arabic documents. Shobhaet. al [4], proposed a generic line elimination methodology for removal of horizontal grid like structures using circular structuring element for application form images and had achieved an accuracy of more than 90%. Ping et. al. [5], proposed a novel face detection system using hybrid feature extraction and three set of face features are extracted. This system achieved accuracy of 95%. Bilal et. al. [6] had proposed an adaptive local Binarization method for document images which includes two type of experiments: visual experiment provides 4850 International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 7 (2016) pp 4849-4856 © Research India Publications. http://www.ripublication.com of Andhra Pradesh state. Thus, we propose to work on the detection of emblems and lines inherent in pre-printed e-seva documents. If Di represents a pre-processed image which is subjected to capture the various objects. The objects in the document Di are captured by employing bounding box construct, which encloses a set of pixels fully connected with in a rectangle to its borders. The set of pixels fully connected represents an object that can be either a graphical component like emblem or logo or character images. Proposed Methodology The proposed methodology for detection and removal of graphical components in pre-printeddocuments is comprised of two crucial stages. The stage one prefers the processing of graphical components like emblems and logos. Horizontal and the vertical line overlaid with the text is accomplished in stage two. The block diagram of proposed methodology is depicted in Figure 3. Figure 3: Block diagram of graphical entity detection system The subsection A and B describes the methodologies for logo and emblem detection and detection of horizontal and vertical lines Figure 4: Flow chart of proposed algorithm Detection of graphical entities - logos and emblems In stage 1, the proposed methodology for the detection and removal of graphical entities from application form images has been addressed. Here we mainly focus on the detection of graphical entities like emblems and logos in the pre-printed documents. The detection and removal of emblems from application form images will reduces the computational conflicts during segmentation and classification of characters and renders to an error free recognition by OCR. The algorithm for emblem detection and removal initially prompts the user for acquisition of pre-printed document image as input. The acquired input is subject to pre-processing to obtain a transformed and enhanced binary image. Further the binary image is processed to connect all the broken gradient details by employing morphological bridge operation [21]. The detection of emblems is accomplished by tracking all the objects in the image with bounding boxes and filtering it further to identify only the required graphical entities. Finally the Histogram of Gradient and Zernike moments features are computed for the detection of bounding box with emblems or logos. Figure 4 depicts the block diagram of algorithm for emblem or logo detection. Obj1 , Obj2 , Obj3 ...Objn are the objects captured by applying bounding boxes to pre-processed image Di . Each Let Obji will be interpreted to identify whether the Obji Class(Ch ) or Obji Class(Gc ) where i 1, 2,3...n , Class(Ch ) and Class(Gc ) represents the set of objects with textual object components and graphical components. Figure 5 and figure 6 presents the pre-processed image and the objects captured within the image using bounding boxes. Once the objects are captured in the image, each object Obji is inspected to check whether it is maximum area bounding box or not. The maximum area bounding boxes exists for those objects with graphical entities like logos, photos and emblems and termed as max area objects shown in figure 7. 4851 M obj as International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 7 (2016) pp 4849-4856 © Research India Publications. http://www.ripublication.com Figure 7: maximum area objects The max area objects M obj are filtered from the other objects i.e., objects with textual components. If H is the filter applied on each object to filter the max area objects, then filtering of max area objects is given by equation (1) M obj (i ) H (Obji ) (1) The filter H implies a transformation to detect whether it is max area object or not. The filter H is associated with a criterion given by equation (2). (2) The filter returns the top two maximum length bounding boxes which are usually called as nested objects. The outcome of filtering transformation is shown in figure 8. Once the nested objects are detected, each nested object NObj is subjected to undergo the concatenation transformation that converts a nested object into a simple object. The concatenation transformation CT combines all the smaller Figure 5: Pre-processed image bounding boxes into a bounding box of maximum length and width. The concatenation transformation CT is given by equation (3). Obji CT ( Nobj ) Figure 6: Image with objects detected using bounding boxes Figure 8: The result of filtering transformation 4852 (3) H International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 7 (2016) pp 4849-4856 © Research India Publications. http://www.ripublication.com The figure 9 represents the simple object Obji obtained after The graphical representation for the “phi” value computed for the emblem and the other graphical components are shown in the figure 10 below. applying concatenation transformation CT . Figure 9: Result of Concatenation transformation CT The transformed nested objects are forwarded for classification of objects into various graphical entities that include logos, photos and emblems. The classification in the proposed methodology accomplished through the histogram of gradient (HOG)[22]and Zernike moments features respectively, the classification is performed by thresholding operations on the feature value extracted. Once we get the biggest bounding box after concatenation transformation, Itmake sure that the algorithm detected the logo only. This detection stage is done with the help of moment and HOG values. From the experimentation result(refer table 1) for fourth order Zernike moment, the degree of rotation is negative(ie, anti-clockwise) for emblems. Similarly for identifying “e-seva” emblem HOG descriptor is used. Proposed algorithm finds the range of HOG value for the specific type of emblem from a set of 30 emblems. Then this range is used for further classification. Figure 10: features of zernike moments in degrees Here the negative value clearly shows that the graphical components are emblem where as features in a negative value indicates other graphical components. HOG Descriptors The main objective of Histograms for oriented gradients (HOG) is object detection. The basic idea is, local shape information often is well described by the distribution of intensity gradients or edge directions even without precise information about the location of the edges themselves. The HOG features differs greatly from a bounding box with emblem to a bounding box with simple text, thus we employ HOG features in our work The computed HOG features for various types of emblems depict a great dissimilarity in features of graphical component to a non-graphical component. Table 2 shows an overview of HOG descriptor features of the various graphical components detected. Zernike Moments Generally moments explains numeric quantities at some distance from one reference point or one axis. The main advantage of using Zernike moment is better accuracy and simple rotation invariance. Zernike moments are used here to find the “phi” value (degree of rotation) of the emblem detected in the application form images.The given table 1 given below shows few observed phi values. Table 2: Hog descriptor values Table 1: Phase angle of the moment in degree 4853 International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 7 (2016) pp 4849-4856 © Research India Publications. http://www.ripublication.com Figure 13: Detection of horizontal and vertical lines and removal For removing the vertical line presented in the application form images the mask will move through the identified row with origin of mask as the target pixel. The 2 x 11 mask determines the presence of black pixels and if more than 20 percentage of row length, the continuous black is encountered to its right then the target pixel will replace with back ground. The same method will repeat with 11 x 2 mask for the removal of horizontal line Figure 11: HOG descriptor features The given figure 11 shows the graphical representation for the HOG descriptor. After detecting emblems in the application form images, it converts into the background pixels. So it will remove from the image. Result of this proposed algorithm in application form images is shown in figure 12below. Experimental Analysis The experimental analysis in the proposed system is conducted on the datasets of around 80 pre-printed documents. The documents are collected from the e-seva centers of Andhra Pradesh regions. The accuracy in the proposed system is defined individually for stage 1 and stage 2. The accuracy of emblem detection is the number of emblems correctly detected Dc to the total number of graphical components originally detected equation Accuracy D as given by Dc D (4) The accuracy in stage 2 is the number of lines detected correctly Lc to the total number of lines present L in the Figure 12: Application form image before and after applying the algorithm image as given in equation (5). Accuracy Detection of horizontal and vertical lines In this second stage, the proposed algorithm focuses on the detection and removal of horizontal and vertical lines from the pre-printed application form images. The application form images for undergoes for the initial pre-processing operations like binarization and noise reduction. From the binarized image the continuous count of black pixel values locate the position of horizontal or vertical line. Lc L (5) The experimental outcomes of the proposed system are as depicted in figure 14. Rectangular element mask After identifying the horizontal and vertical lines the masking operation is employed using rectangular structuring element.A 11 x 2 rectangular element mask with its middle row as the target used here to detect the horizontal line and in the same way a 2 x 11 rectangular element mask with middle column as target used for the detection of vertical lines. Figure 13 gives an overview of the algorithm proposed by rectangular structuring element. Figure 14: Accuracy of proposed methodology 4854 International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 7 (2016) pp 4849-4856 © Research India Publications. http://www.ripublication.com Conclusion The proposed algorithm for emblem detection has employed bounding boxes for detection of objects and features of Zernike moments and HOG descriptors for the detection and removal of emblems in the application form images. The Bounding box is very efficient detection of the objects in the application form images and the features employed are consistent and adequate enough in detection of objects with emblems. For the specific “e-seva” emblem this proposed algorithm working with high efficiency and for the other graphical components it provides satisfactory results. The algorithm proposed for the vertical and horizontal line detection and removal works efficiently in detection horizontal and vertical lines. In future, the work can be further extended to remove lines where text is overlapping. The proposed work is applicable for the pre-printed document of various languages. The dynamic detection threshold values for the emblem and line detection can considered as a future work. References [8] Aradhya, V.M., Pavithra, M.S. and Naveena, C., 2012. “A robust multilingual text detection approach based on transforms and wavelet entropy”. Procedia Technology, 4, pp.232-237. [9] Biggio, B., Fumera, G., Pillai, I. and Roli, F., 2011. “A survey and experimental evaluation of image spam filtering techniques”. Pattern Recognition Letters, 32(10), pp.1436-1446. [10] González, Á. and Bergasa, L.M., 2013. “A text reading algorithm for natural images. Image and Vision Computing”, 31(3), pp.255-274. [11] Kaus, M.R., Warfield, S.K., Nabavi, A., Black, P.M., Jolesz, F.A. and Kikinis, R., 2001. “Automated segmentation of mr images of brain tumors 1”.Radiology, 218(2), pp.586-591. [12] C. Aswini, D. Chitra., 2014, “Enhanced Logo Matching and Recognition using SURF Descriptor”. International Journal of Engineering Research & Technology (IJERT). Vol. 3.4, ISSN: 2278-0181. [13] Prof. MrunalineePatole, MeeraSambhajiSawalkar., 2014, “Improved approach for logo detection and recognition”. International Journal of Emerging Trends & Technology in Computer Science (IJETTCS).ISSN 2278-6856.Vol 3.6. [14] Amrapali A. Dudhgaonkar, Prof. N.N. Thune., 2014, “Novel and Scalable Solution for Logo Detection and Recognition using CDS method”. International Journal of Engineering Research & Technology (IJERT). ISSN: 2278-0181.vol 3.6 [1] Akram, S., Dar, M.D. and Quyoum, A., 2010. “Document Image Processing-A Review”. International Journal of Computer Applications, 10(5), pp.35-40. [2] Gatos, B., Danatsas, D., Pratikakis, I. and Perantonis, S.J., 2005. Automatic table detection in document images”.In Pattern Recognition and Data Mining (pp. 609-618).Springer Berlin Heidelberg. [3] Zheng, Y., Li, H. and Doermann, D., 2003, August. “A model-based line detection algorithm in documents”.In Document Analysis and Recognition, 2003.Proceedings. Seventh International Conference on (pp. 44-48). IEEE. [15] Parwej, F., 2013.” A Perceptive Method for Arabic Word Segmentation using Bounding Boxes by Morphological Dilation”. International Journal of Computer Applications, 71(1). [4] Shobha Rani N, Vasudev T., 2014. “A Generic Line Elimination Methodology using Circular Masks for Printed and Handwritten Document Images “, Proceedings of second international conference on emerging research in computing, information, communication and applications, ERCICA, ISBN: 9789351072638. [16] Lempitsky, V., Kohli, P., Rother, C. and Sharp, T., 2009, September.“Image segmentation with a bounding box prior”.In Computer Vision, 2009 IEEE 12th International Conference on (pp. 277284).IEEE. [17] Arif, T., Shaaban, Z., Krekor, L. and Baba, S., 2009.”Object classification via geometrical, zernike and legendre moments”. Journal of Theoretical and Applied Information Technology, 7(1), pp.31-37. [18] Karar, Subhajit, and Ranjan Parekh., 2012, "Palm Print Recognition using Zernike Moments." International Journal of Computer Applications 55.16. [19] Tripathy, J., 2010. “Reconstruction of oriya alphabets using Zernike moments”. International Journal of Computer Applications (0975-8887), 8(8). [20] Deodhare, D., Suri, N.R. and Amit, R., 2005. “Preprocessing and Image Enhancement Algorithms for a Form-based Intelligent Character Recognition System”. IJCSA, 2(2), pp.131-144. [5] [6] [7] Zhang, P. and Guo, X., 2012. “A cascade face recognition system using hybrid feature extraction”. Digital Signal Processing, 22(6), pp.987-993. Bataineh, B., Abdullah, S.N.H.S. and Omar, K., 2011. “An adaptive local binarization method for document images based on a novel thresholding method and dynamic windows”. Pattern Recognition Letters, 32(14), pp.1805-1813. Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M. and Basu, D.K., 2010. “A novel framework for automatic sorting of postal documents with multiscript address blocks”. Pattern Recognition, 43(10), pp.3507-3521. 4855 International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 7 (2016) pp 4849-4856 © Research India Publications. http://www.ripublication.com [21] Dougherty, Edward R., Roberto A. Lotufo, and The International Society for Optical Engineering SPIE., 2003, “Hands-on morphological image processing”. Vol. 71. Washington: SPIE Optical Engineering Press. [22] Dalal, N. and Triggs, B., 2005, June.”Histograms of oriented gradients for human detection”.In Computer Vision and Pattern Recognition, 2005.CVPR 2005.IEEE Computer Society Conference on (Vol. 1, pp. 886-893).IEEE. 4856