Handwritten Hindi Character Recognition using K-means Clustering and SVM Akanksha Gaur Dr. Sunita Yadav M.Tech Scholar (CSE), AKGEC, Ghaziabad, India gaur.akanksha27@gmail.com Professor (CSE Dept), AKGEC, Ghaziabad, India yadav.sunita104@gmail.com ABSTRACT Devanagari script is used in many languages in India. Hindi language is also under Devanagari script. In this paper recognition of hindi characters is done by using a three step procedure. First step is preprocessing, in which binarization of the image and separations of characters are performed. Each hindi word has a horizontal bar on the top of word. That bar is also removed in preprocessing phase. The next step is feature extraction in which region based k-means clustering is used and the feature vector is created and used in classification phase as input. Third step is classification process, for which support vector machine in used. Support vector machine uses hyper-plane for classification. This Hyper-plane is used as a decision surface which is with maximum margin of separation of hyper-plane and closest data point. Support vector machine uses a different kernel functions which defines the way of classification. The kernel function used in Support vector machine for classification is linear kernel function. Raw image can have different type of noises, distortion etc. Removal of noises from scanned image makes the recognition of characters easy. After preprocessing of the image, the special quality of character is extracted. This process is called feature extraction process and the special quality is called feature. Many feature extraction techniques are used by various researchers, like structure features, contour features, ring features, Zernike features, ink based features, gradient features, global features etc. E. Kolman et.al, (2008) used structural details like endpoints, intersection of line segments, loops, curvatures, of line segments, loops, curvatures, segment lengths, etc. describing the geometry of the pattern structure as feature. Image is divided into segments and structure features are extracted from each segment individually by S Arora et.al (2008). This structural approach is also used for segmentation process of characters from words by M. Hanmandlu et.al (2009). Keywords Hindi characters, Feature Extraction, Classification, K-means clustering, Support vector machine (SVM). INTRODUCTION Optical character recognition (OCR) coverts the scanned image into usable format. These scanned images can be printed character images, handwritten character images. Mainly it is used at the time of data entry from data source which is written on paper. OCR is mainly divided into two parts: a) Online character recognition, b) Offline character recognition. In Online character recognition, characters are recognized at the time of writing and it uses the time stamp process for this. Offline character recognition uses the image of characters and converts them into computer understandable format. Offline character recognition can be done on two types of data: a) Printed text, b) Handwritten text. Handwritten character recognition is more difficult in comparison to printed character recognition because of diversity in handwriting of different persons. In this paper, handwritten Hindi character recognition is presented. Hindi comes under devanagari Script. Hindi is India’s national language and is very popular. There are 14 modifiers (matras) which are shown in Fig 1(c), 13 vowels, shown in Fig 1(a) and 34 consonants which are shown in Fig1 (b), in Hindi language (R. Jayadevan et.al, 2011). Hindi vowels are called ‘Swar’ in Hindi language and Hindi consonants are also called ‘Vyanjan’ in Hindi language. For character recognition many techniques are used by various researchers. In character recognition, first the preprocessing of scanned image is required to remove noises from scanned image. (a) (b) (c) Fig 1. (a) Hindi vowels, (b) Hindi Consonants, (c) Hindi modifies On the basis of features, classification process is executed. Classification is the process in which objects are differentiated and categorized into classes. For classification process, different types of techniques are used by various authors, like, Neural Network, Fuzzy Logic, HMM, Support vector machine and hybrid techniques too. KNN-SVM, the hybrid approach, results in the specialization of SVMs in the local areas around the surface of separation by C. Zanchettin et.al (2012). C. Zanchettin et.al (2012) used Hybrid KNNSVM recognizer improves significantly the performance in terms of recognition and error rate compared with a single kNN model for characters classification task. M. Hanmandlu el.al (2007) used fuzzy logic for classification purpose. U. Pal el.al (2009) used another hybrid approach as a classifier which combines SVM & MQDF. clustering to reduce the size of training database of Printed Kannada characters. Hight-width ratio, occupancy ratio and distance ratio are used to find the features from human signature image by S. Biswas et.al (2010). N. Huan et.al (2010) used K-means clustering for detection of iris red and green section for eye detection. Clustering algorithm as a feature extraction technique is used in protein sequence by I. Bonet et.al (2006). Classification Phase: RELATED WORK Character recognition process has been performed by researchers on characters of different languages, like, English, Tamil, Chinese, Bangla, Arabic, Farsi, Kannada, Devanagri etc. Generally, the whole process is executed in three steps: a) Pre-processing, b) Feature extraction and c) Classification. In pre-processing phase, unimportant data is removed. In this, some tasks are performed mainly, removal of noise, normalization of data and segmentation etc. In feature extraction phase, features are extracted from input image which is preprocessed. Features are the attributes of image, on the basis of which, the character is represented and recognized. For classification, features are given as input to the classifier. Classifier classifies the set of feature vectors of input images to the different classes according to from which class the particular feature belongs to. Classification deals with the numerical properties of different image features and then output data is organized into categories. Pre-processing Phase: Researchers used different techniques for character recognition. In character recognition, first the characters are segmented from the raw data. For segementation process, a structural approach is proposed by M. Hanmandlu et,al (2009). They proposed the technique which is based on structure of hindi characters and modifiers. After segmentation of characters feature extraction phase started. For Classification many classifiers are used by various authors. Support Vector Machine (SVM) is used by many authors. Some of them are A. Alaei et.al (2009), R. Ramanathan et.al (2009), J. Hou et.al (2010), S.W. Lee et.al (2012), D.C. Shubhangi et.al (2007), D. Nasien et.al (2010), S. Kumar et.al (2009). They use SVM as classifier. I. Bonet et.al (2006) used K-means clustering as classifier. S. Biswas et.al (2010) used K-nearest neighbor technique for classification. K. Sheshadri et.al (2010) recognised the characters by determining nearest match. S. Pourmohammad et.al (2013) used LDA for recognition of characters. U. Pal et.al (2008) used MIL as a classifier and U. Pal el.al (2009) used SVM & MQDF, which is a hybrid approach, as a classifier. C. Zanchettin et.al (2012) used KNNSVM hybrid approach. M. Hanmandlu el.al (2007) used Fuzzy logic for classification purpose. Sharma et.al (2006) used Quadratic classifier for classification. PROPOSED METHOD Fig 2 shows the flow diagram of proposed method which is divided in to 3 parts: Pre-processing Phase, Feature Extraction Phase and Classification Phase. Pre-processing Phase: In this phase the scanned image is taken as input which is shown in fig 3(a) and processed in 2 parts: Binarization: Feature Extraction Phase: Many researchers used many techniques for feature extraction. D. Nasien et.al (2010) used freeman chain code, which is based on 8neighbourhood connection, for feature extraction for English characters. Structural and Statistical features are used for feature extraction for English characters by D.C. Shubhangi (2007). S.W. Lee et.al (2012) extracts multiple features which consist of chain code, density of pixels and number of lines for handwritten numeral recognition. Multi layered features are used for Chinese accent recognition in accent identification by J. Hou et.al (2010). Directional features are also used with the information of type of connectivity for devnagari characters by P.S. Deshpandey et.al (2008). R. Ramanathan et.al (2009) used Gabor filters for feature extraction for English and Tamil characters. Modified contour chain codes are used for feature extraction for Arabic handwritten characters by A. Alaei et.al (2009). Chain code feature, Intersection feature, shadow feature are used for Devnagari characters by S. Arora et.al (2008). Gradient and curvature based feature are used for devnagari characters by U. Pal et.al (2008) and U. Pal el.al (2009). Sharma et.al (2006) used histograms of chain code contour as features for devnagari characters. U. Pal et.al (2007) used Gradient and Gaussian filter for feature extraction for devanagari characters. Vector distances for feature extraction are used for devanagari charcters by M. Hanmandlu el.al (2007). S. El Ferchichi et.al (2011) used clustering technique and similarity-measure based technique to extract features for face recognition. PCA with K-means clustering for preprocessing is used for character recognition by S. Pourmohammad et.al (2013). K. Sheshadri et.al (2010) used K-means Scanned input is first converted into the grayscale image, and then convert into binary image by vertical and horizontal levels of grayscale threshold. Two outputs comes from this method which are vertical and horizontal binary images. Removal of horizontal bar and Separation of character: The outputs obtained from Binarization step are combined by using Anding operation and removal of horizontal bar is performed. The character to be recognized is separated by cropping the image. Fig 3 shows the Preprocessing phase. The scanned input is first binarized by two ways, vertically and horizontally and after morphological operations horizontal bar is removed. Fig 3(a) shows the original image taken as input. Fig 3(b) shows the grayscale image and binary outputs and dilated outputs also. Feature Extraction Phase: Feature Extraction is performed on the binarized cropped character using K-means clustering. K-means clustering is used to group the data items. K-means clustering gives robust performance to the problem of low illumination. It reduces the dimension of data so that computational overhead is reduced. K-means clustering is simple as well as flexible technique. Number of clusters is equal to number of centroids selected. The algorithm is described below. Algorithm of K-means Clustering: • Select K points as centroids for each group. • Take each point from a given data set and associate it to the nearest centroid. To calculate that which point is nearest to which centroid, euclidean distance between centroids and data points are calculated. Which is- Where P and Q are data points. • • Fig 2.Flow Diagram of Proposed Method Fig 3 (a): Original image When no point is pending, recalculate the position of the k centroids. Repeat above 2 steps until center points no longer move. This K-means clustering is applied on cropped binary image which is region based. Region based K-means clustering is applied on the location of pixels. K-means Clustering divides the image into K cluster. Each cluster has the data into x and y pixel coordinate format. Values of Pixels which are under a cluster are combined together to get the pixel density in each cluster. After this process, each cluster is represented by a single value. Each cluster value is arranged row-wise and makes a vector which is called feature vector. In this method, cropped image is resized into 70x50 pixels and total 35 clusters are obtained from image. So the feature vector has 35 values. Feature vector = [0.770000000000000;0.590000000000000;0.480000000000000;1;0.8 60000000000000;0.280000000000000;0.970000000000000;0.340000 000000000;1;1;0.950000000000000;0.700000000000000;0.49000000 0000000;1;0.400000000000000;1;0.200000000000000;0.7600000000 00000;0.440000000000000;0.340000000000000;1;0.9500000000000 00;0.420000000000000;0.570000000000000;0.400000000000000;0.7 50000000000000;0.970000000000000;0.810000000000000;0.480000 000000000;0.370000000000000;0.770000000000000;0.80000000000 0000;0.220000000000000;0.860000000000000;0.460000000000000]; In Fig 4 (a) shows the separated and resized character and Fig 4 (b) shows the plotted feature vector which depends upon the feature vector values. Classification phase: For classification of characters Support Vector Machine (SVM) and Euclidean distance approach are used separately. SVM is based on supervised learning that used for analyzing of data. SVM uses hyper-plane for classification. Hyper-plane with maximum margin of separation of hyper-plane and closest data point, is used as a decision surface. This Optimal Hyper-plane gives the output. Different types of kernels are used in SVM: Linear, RBF, Quadratic, Polynomial and MLP. Here linear kernel is used for classification with SVM. Fig 3 (b): Preprocessing Phase It works according to the centre points selected randomly and other data values are attached with the corresponding centre points according to the difference between data values and centre points. This distance is calculated between training vectors and test vectors and find out the lowest distance. According to lowest distance classification is performed. EXPERIMENTAL RESULT (a) (b) Fig 4 (a) Cropped and resized character, (b) Plotted feature vector For implementation of the proposed model numbers of steps are used. For implementation, MATLAB is used as a tool. The scanned image of Hindi word is taken as input. Character image set of size 430 is taken for implementation. In which 140 characters are used for training and rest 290 images of characters are taken as test data for classification using SVM. For classification using Euclidean distance training data is according to different sample sets which are described in table 1. First, the image is required to be processed so that useful section of image, on which the recognition process will be applied, can be extracted. This is done in preprocessing phase by using morphological operations in matlab. Scanned image is first converted into grayscale image and threshold value is extracted by filter that grayscale image horizontally and vertically. Using that threshold value grayscale image is binarized vertically and horizontally. On these binary images, morphological operations are performed and combine the resultant images pixel value by using AND operations. So that horizontal line can be removed. After extracting the character from word, it is resized into 70x50 pixels. K-means clustering is applied on this resized binary image by dividing it into 7 parts horizontally. On each part K-means clustering is applied where 5 centroids are specified. After applying k-means clustering on image a vector is generated from image which has 35 values. For every character feature vector is produced. For classification Euclidean distance method and Support Vector machine are used and the results are compared. Euclidean distance method is used for classification by calculating the distance between the test data feature and training data feature. Every training data feature is stored in matrix and which has 35 rows and columns depends on the number of training data. Euclidean distance is calculated between test data feature and training data feature individually. The nearest training feature is that which has lowest distance from the test feature. According to that training feature corresponding character is shown. Here we calculate the result by taking different number of samples of each character in training data set. Fig 5. SVM with optimal separating hyper-plane Here, this green line is the optimal hyper-plane which is separating two sets with maximum margin of hyper-plane and nearest data points. Euclidean distance approach calculates the standard distance between two given points or vectors. If p and q are two vectors then Euclidean distance between them is calculated according to this formula – Training database Sample set 1 Sample set 2 Sample set 3 Sample set 4 No. of samples 36 108 180 252 Test data Result (%) 290 290 290 290 58 73.7 80.7 81.7 Table 1: Results of Euclidean distance method with different sample set. Here we can see that as we are increasing the sample set, the recognition result percentage is improving. But here are some characters which are continuously giving good performance that is better than 75%. Table 2 shows the percentage of performance of characters. Percentage is calculated by – Recognized characters % of performance= ------------------------------------------Total characters for testing Characters with continuous good performance for two or more than two sample sets, are 25 in numbers and remaining are bad performing characters. Fig 5(a) shows the characters which are giving good performance using Euclidean Distance approach. Fig 5(b) shows the characters which are not giving good performance using Euclidean Distance approach. Characters with good performance % % Character Character Character d u x ?k Fk N t > V B m 100 94 j < 83 100 75 100 100 100 75 100 83 98 100 80 e v {k y b Q c l o 'k "k % Bad performance Character % .k r /k ; , g [k p M i n 100 90 100 75 95 100 80 80 75 80 97 SVM uses hyper-plane for classification. This Hyper-plane is used as a decision surface which is with maximum margin of separation of hyper-plane and closest data point. For classification with SVM 140 characters are used for training and rest 290 images of characters are taken as test data. Training data is handwritten characters which is stored in the form of feature vector. And test data is collected by storing each test character in feature vector form. Using this technique, the recognition result is achieved 95.86%. Total no. of Recognized characters % of performance= ------------------------------------------Total characters for testing Some characters are continuously giving good performance in recognition that is better than 75%. These are shown in figure 6. 66 Character % Character % 60 d [k x ?k p N t > V B 100 M < r Fk n /k u i Q c 100 66 60 50 66 33 50 25 80 80 Table 2: Performance of Characters using Euclidean Distance Approach for sample set 4 100 83 100 87.5 100 100 100 100 100 100 100 75 90 66 100 100 100 100 Character e ; j y o 'k "k l g % Character % 87.5 .k {k v b , m 100 94 80 91 100 100 100 90 100 100 83 80 100 83 100 Table 3: Performance of Characters using Support Vector Machine Fig 6: Good performance character by SVM method (a) (b) Fig 5: (a) Good performance characters by Euclidean method, (b) Bad performance characters by Euclidean method Here is a table which describes the results of devanagari character recognition. Author Sharma et.al, 2006 Deshpande et.al, 2008 Hanmandalu et.al, 2007 Arora et.al, 2010 U. Pal et.al, 2007 U. Pal et.al, 2008 U. Pal et.al, 2009 Proposed technique using SVM Result (%) 80.36 82 90.65 90.74 94.24 95.13 95.19 95.86 Table 4: Comparison table Results using Support Vector Machine are better than results using Euclidean distance. Characters with good performance i.e. which has more than 75% performance percentage, are 25 using Euclidean distance method. Using SVM, performance of all characters is better than 75% except one character and that is /k. Computation overhead in Support Vector Machine is less than Euclidean distance approach. So classification using SVM is better than classification using Euclidean distance approach. CONCLUSION This paper presented handwritten hindi characters recognition based on K-means clustering and SVM. K-means clustering reduces the size of feature vector so that computation becomes easy. Here results are calculated using two approaches for classification, one is Euclidean distance and other is Support vector machine. Results using SVM are better than results using Euclidean distance. Maximum achieved result using Euclidean distance is 81.7%. SVM is used with linear kernel and giving 95.86% result. REFERENCES [1] M. Hanmandlu, O.V. Ramana Murthy, “Fuzzy model based recognition of handwritten numerals” Science Direct, Pattern Recognition, Vol. 40, Issue 6, pp. 1840-1854, June 2007. [2] S. Arora, D bhattacharjee, M Nasipuri, D. K. Basu, “Combining Multiple Feature Extraction Techniques for Handwritten Devnagari Character Recognition” IEEE Region 10 Colloquium and the Third ICIIS, Kharagpur, India, December 8-10, 2008. [3] J. Hou, Y. Liu, T. F. Zheng, J. Olsen, J. Tian, “Multi-layered Features with SVM for Chinese Accent Identification”, IEEE International Conference on Audio Language and Image Processing, pp. 25-30, 23-25 Nov. 2010. [4] Shen-Wei Lee, Hsien-Chu Wu, “Effective Multiple-features Extraction for Off-line SVM-Based Handwritten Numeral Recognition”, IEEE International Conference on Information Security and Intelligence Control, pp. 194-197, 2012. [5] D.C. Shubhangi, “Noisy English Character Recognition by Combining SVM Classifier”, IEEE international Conference on Information and Communication Technology in Electrical Sciences, pp. 663-666, 2007. [6] R.Ramanathan, S.Ponmathavan, N.Valliappan, “Optical Character Recognition for English and Tamil Using Support Vector Machines” IEEE International Conference on Advances in Computing, Control, and Telecommunication Technologies, pp.610-612, 2009. [7] D. Nasien, H. Haron, S. Sophiayati Yuhaniz, “Support Vector Machine (SVM) for English Handwritten Character Recognition”, IEEE International Conference on Computer Engineering and Application, Vol. 1, pp. 249 – 252, 2010. [8] A. Alaei, U. Pal and P. Nagabhushan, “Using Modified Contour Features and SVM Based Classifier for the Recognition of Persian/Arabic Handwritten numerals”, IEEE Seventh International Conference on Advances in Pattern Recognition, pp. 391 – 394, 2009. [9] C. Zanchettin, B. L. D. Bezerra and W. W. Azevedo, “A KNNSVM Hybrid Model for Cursive Handwriting Recognition”, IEEE World Congress on Computational Intelligence, International Joint Conference on Neural Network, June, 10-15, 2012. [10] N. Sharma, U. Pal, F. Kimura, and S. Pal, “Recognition of Off-Line Handwritten Devnagari Characters Using Quadratic Classifier”, ICVGIP, Springer, pp. 805 – 816, 2006. [11] U. Pal, N. Sharma, T. Wakabayashi and F. Kimura, “OffLine Handwritten Character Recognition of Devnagari Script”, ICDAR, IEEE, 2007. [12] U. Pal, T. Wakabayashi, F. Kimura, “Comparative Study of Devnagari Handwritten Character Recognition using Different Feature and Classifiers”, ICDAR, IEEE, 2009. [13] M. Hanmandlu, O.V. Ramana Murthy, Vamsi Krishna Madasu, “Fuzzy Model based recognition of handwritten Hindi characters”, DICTA, IEEE, 2007. [14] Sabra El Ferchichi, Salah Zidi, Kaouther Laabidi, Moufida Ksouri, and Salah Maouche, “A New Feature Extraction Method Based on Clustering for Face Recognition”, IFIP AICT , pp. 247– 253, 2011. [15] Sajjad Pourmohammad, Reza Soosahabi, Anthony S. Maida, “An Efficient Character Recognition Scheme Based on K-Means Clustering”, 2013 IEEE. [16] Karthik Sheshadri, Pavan Kumar T Ambekar, Deeksha Padma Prasad and Dr. Ramakanth P Kumar, “An OCR system for Printed Kannada using k-means clustering”, 2010 IEEE. [17] Samit Biswas, Debnath Bhattacharyya, Tai-hoon Kim, and Samir Kumar Bandyopadhyay, “Extraction of Features from Signature Image and Signature Verification Using Clustering Techniques”, SUComS, Springer, CCIS 78, pp. 493–503, 2010. [18] Nguyen van Huan, Nguyen Thi Hai Binh and Hakil Kim, “Eye Feature Extraction Using K-means Clustering for Low Illumination and Iris Color Variety”, ICCARV, IEEE, 2010. [19] Isis Bonet1, Yvan Saeys3, Ricardo Grau Ábalo1, María M. García1, Robersy Sanchez2, and Yves Van de Peer, “Feature Extraction Using Clustering of Protein”, CIARP, Springer, pp. 614 – 623, 2006. [20] Dr. P. S. Deshpande, Latesh Malik, Sandhya Arora, “Fine Classification & Recognition of HandWritten Devnagari Characters with Regular Expressions & Minimum Edit Distance Method”, JOURNAL OF COMPUTERS, VOL. 3, NO. 5, MAY 2008. [21] Eyal Kolman and Michael Margaliot, “A New Approach to Knowledge-Based Design of Recurrent Neural Networks” IEEE Transaction on Neural Networks, Vol. 19, Issue. 8, pp. 13891401, August 2008. [22] R. Jayadevan, Satish R. Kolhe, Pradeep M. Patil, and Umapada Pal: Offline Recognition of Devanagari Script: A Survey. IEEE Transaction Vol. 41, Issue: 6, pp. 782-796, 2011. [23] S. Kumar: Performance comparisons of features on devanagari hand-printed dataset. IJRT, Vol. 1, pp. 33-37, 2009.