Proposed method - Academic Science

advertisement
Handwritten Hindi Character Recognition using K-means Clustering
and SVM
Akanksha Gaur
Dr. Sunita Yadav
M.Tech Scholar (CSE), AKGEC,
Ghaziabad, India
gaur.akanksha27@gmail.com
Professor (CSE Dept), AKGEC,
Ghaziabad, India
yadav.sunita104@gmail.com
ABSTRACT
Devanagari script is used in many languages in India. Hindi language
is also under Devanagari script. In this paper recognition of hindi
characters is done by using a three step procedure. First step is
preprocessing, in which binarization of the image and separations of
characters are performed. Each hindi word has a horizontal bar on the
top of word. That bar is also removed in preprocessing phase. The
next step is feature extraction in which region based k-means
clustering is used and the feature vector is created and used in
classification phase as input. Third step is classification process, for
which support vector machine in used. Support vector machine uses
hyper-plane for classification. This Hyper-plane is used as a decision
surface which is with maximum margin of separation of hyper-plane
and closest data point. Support vector machine uses a different kernel
functions which defines the way of classification. The kernel function
used in Support vector machine for classification is linear kernel
function.
Raw image can have different type of noises, distortion etc. Removal
of noises from scanned image makes the recognition of characters
easy. After preprocessing of the image, the special quality of
character is extracted. This process is called feature extraction
process and the special quality is called feature. Many feature
extraction techniques are used by various researchers, like structure
features, contour features, ring features, Zernike features, ink based
features, gradient features, global features etc. E. Kolman et.al,
(2008) used structural details like endpoints, intersection of line
segments, loops, curvatures, of line segments, loops, curvatures,
segment lengths, etc. describing the geometry of the pattern structure
as feature. Image is divided into segments and structure features are
extracted from each segment individually by S Arora et.al (2008).
This structural approach is also used for segmentation process of
characters from words by M. Hanmandlu et.al (2009).
Keywords
Hindi characters, Feature Extraction, Classification, K-means
clustering, Support vector machine (SVM).
INTRODUCTION
Optical character recognition (OCR) coverts the scanned image into
usable format. These scanned images can be printed character
images, handwritten character images. Mainly it is used at the time of
data entry from data source which is written on paper. OCR is mainly
divided into two parts: a) Online character recognition, b) Offline
character recognition. In Online character recognition, characters are
recognized at the time of writing and it uses the time stamp process
for this. Offline character recognition uses the image of characters
and converts them into computer understandable format. Offline
character recognition can be done on two types of data: a) Printed
text, b) Handwritten text. Handwritten character recognition is more
difficult in comparison to printed character recognition because of
diversity in handwriting of different persons.
In this paper, handwritten Hindi character recognition is presented.
Hindi comes under devanagari Script. Hindi is India’s national
language and is very popular. There are 14 modifiers (matras) which
are shown in Fig 1(c), 13 vowels, shown in Fig 1(a) and 34
consonants which are shown in Fig1 (b), in Hindi language (R.
Jayadevan et.al, 2011). Hindi vowels are called ‘Swar’ in Hindi
language and Hindi consonants are also called ‘Vyanjan’ in Hindi
language.
For character recognition many techniques are used by various
researchers. In character recognition, first the preprocessing of
scanned image is required to remove noises from scanned image.
(a)
(b)
(c)
Fig 1. (a) Hindi vowels, (b) Hindi Consonants, (c) Hindi modifies
On the basis of features, classification process is executed.
Classification is the process in which objects are differentiated and
categorized into classes. For classification process, different types of
techniques are used by various authors, like, Neural Network, Fuzzy
Logic, HMM, Support vector machine and hybrid techniques too.
KNN-SVM, the hybrid approach, results in the specialization of
SVMs in the local areas around the surface of separation by C.
Zanchettin et.al (2012). C. Zanchettin et.al (2012) used Hybrid KNNSVM recognizer improves significantly the performance in terms of
recognition and error rate compared with a single kNN model for
characters classification task. M. Hanmandlu el.al (2007) used fuzzy
logic for classification purpose. U. Pal el.al (2009) used another
hybrid approach as a classifier which combines SVM & MQDF.
clustering to reduce the size of training database of Printed Kannada
characters. Hight-width ratio, occupancy ratio and distance ratio are
used to find the features from human signature image by S. Biswas
et.al (2010). N. Huan et.al (2010) used K-means clustering for
detection of iris red and green section for eye detection. Clustering
algorithm as a feature extraction technique is used in protein sequence
by I. Bonet et.al (2006).
Classification Phase:
RELATED WORK
Character recognition process has been performed by researchers on
characters of different languages, like, English, Tamil, Chinese,
Bangla, Arabic, Farsi, Kannada, Devanagri etc. Generally, the whole
process is executed in three steps: a) Pre-processing, b) Feature
extraction and c) Classification. In pre-processing phase, unimportant
data is removed. In this, some tasks are performed mainly, removal of
noise, normalization of data and segmentation etc. In feature
extraction phase, features are extracted from input image which is preprocessed. Features are the attributes of image, on the basis of which,
the character is represented and recognized. For classification, features
are given as input to the classifier. Classifier classifies the set of
feature vectors of input images to the different classes according to
from which class the particular feature belongs to. Classification deals
with the numerical properties of different image features and then
output data is organized into categories.
Pre-processing Phase:
Researchers used different techniques for character recognition. In
character recognition, first the characters are segmented from the raw
data. For segementation process, a structural approach is proposed by
M. Hanmandlu et,al (2009). They proposed the technique which is
based on structure of hindi characters and modifiers. After
segmentation of characters feature extraction phase started.
For Classification many classifiers are used by various authors.
Support Vector Machine (SVM) is used by many authors. Some of
them are A. Alaei et.al (2009), R. Ramanathan et.al (2009), J. Hou
et.al (2010), S.W. Lee et.al (2012), D.C. Shubhangi et.al (2007), D.
Nasien et.al (2010), S. Kumar et.al (2009). They use SVM as
classifier. I. Bonet et.al (2006) used K-means clustering as classifier.
S. Biswas et.al (2010) used K-nearest neighbor technique for
classification. K. Sheshadri et.al (2010) recognised the characters by
determining nearest match. S. Pourmohammad et.al (2013) used LDA
for recognition of characters. U. Pal et.al (2008) used MIL as a
classifier and U. Pal el.al (2009) used SVM & MQDF, which is a
hybrid approach, as a classifier. C. Zanchettin et.al (2012) used KNNSVM hybrid approach. M. Hanmandlu el.al (2007) used Fuzzy logic
for classification purpose. Sharma et.al (2006) used Quadratic
classifier for classification.
PROPOSED METHOD
Fig 2 shows the flow diagram of proposed method which is divided
in to 3 parts: Pre-processing Phase, Feature Extraction Phase and
Classification Phase.
Pre-processing Phase:
In this phase the scanned image is taken as input which is shown in
fig 3(a) and processed in 2 parts:
Binarization:
Feature Extraction Phase:
Many researchers used many techniques for feature extraction. D.
Nasien et.al (2010) used freeman chain code, which is based on 8neighbourhood connection, for feature extraction for English
characters. Structural and Statistical features are used for feature
extraction for English characters by D.C. Shubhangi (2007). S.W. Lee
et.al (2012) extracts multiple features which consist of chain code,
density of pixels and number of lines for handwritten numeral
recognition. Multi layered features are used for Chinese accent
recognition in accent identification by J. Hou et.al (2010). Directional
features are also used with the information of type of connectivity for
devnagari characters by P.S. Deshpandey et.al (2008). R. Ramanathan
et.al (2009) used Gabor filters for feature extraction for English and
Tamil characters. Modified contour chain codes are used for feature
extraction for Arabic handwritten characters by A. Alaei et.al (2009).
Chain code feature, Intersection feature, shadow feature are used for
Devnagari characters by S. Arora et.al (2008). Gradient and curvature
based feature are used for devnagari characters by U. Pal et.al (2008)
and U. Pal el.al (2009). Sharma et.al (2006) used histograms of chain
code contour as features for devnagari characters. U. Pal et.al (2007)
used Gradient and Gaussian filter for feature extraction for devanagari
characters. Vector distances for feature extraction are used for
devanagari charcters by M. Hanmandlu el.al (2007). S. El Ferchichi
et.al (2011) used clustering technique and similarity-measure based
technique to extract features for face recognition. PCA with K-means
clustering for preprocessing is used for character recognition by S.
Pourmohammad et.al (2013). K. Sheshadri et.al (2010) used K-means
Scanned input is first converted into the grayscale image, and then
convert into binary image by vertical and horizontal levels of
grayscale threshold. Two outputs comes from this method which are
vertical and horizontal binary images.
Removal of horizontal bar and Separation of
character:
The outputs obtained from Binarization step are combined by using
Anding operation and removal of horizontal bar is performed. The
character to be recognized is separated by cropping the image.
Fig 3 shows the Preprocessing phase. The scanned input is first
binarized by two ways, vertically and horizontally and after
morphological operations horizontal bar is removed. Fig 3(a) shows
the original image taken as input. Fig 3(b) shows the grayscale image
and binary outputs and dilated outputs also.
Feature Extraction Phase:
Feature Extraction is performed on the binarized cropped character
using K-means clustering. K-means clustering is used to group the
data items. K-means clustering gives robust performance to the
problem of low illumination. It reduces the dimension of data so that
computational overhead is reduced. K-means clustering is simple as
well as flexible technique.
Number of clusters is equal to number of centroids selected. The
algorithm is described below.
Algorithm of K-means Clustering:
•
Select K points as centroids for each group.
•
Take each point from a given data set and associate it to the
nearest centroid.
To calculate that which point is nearest to which centroid,
euclidean distance between centroids and data points are
calculated. Which is-
Where P and Q are data points.
•
•
Fig 2.Flow Diagram of Proposed Method
Fig 3 (a): Original image
When no point is pending, recalculate the position of the k
centroids.
Repeat above 2 steps until center points no longer move.
This K-means clustering is applied on cropped binary image
which is region based. Region based K-means clustering is applied on
the location of pixels. K-means Clustering divides the image into K
cluster. Each cluster has the data into x and y pixel coordinate format.
Values of Pixels which are under a cluster are combined together to
get the pixel density in each cluster. After this process, each cluster is
represented by a single value. Each cluster value is arranged row-wise
and makes a vector which is called feature vector.
In this method, cropped image is resized into 70x50 pixels and
total 35 clusters are obtained from image. So the feature vector has 35
values.
Feature vector =
[0.770000000000000;0.590000000000000;0.480000000000000;1;0.8
60000000000000;0.280000000000000;0.970000000000000;0.340000
000000000;1;1;0.950000000000000;0.700000000000000;0.49000000
0000000;1;0.400000000000000;1;0.200000000000000;0.7600000000
00000;0.440000000000000;0.340000000000000;1;0.9500000000000
00;0.420000000000000;0.570000000000000;0.400000000000000;0.7
50000000000000;0.970000000000000;0.810000000000000;0.480000
000000000;0.370000000000000;0.770000000000000;0.80000000000
0000;0.220000000000000;0.860000000000000;0.460000000000000];
In Fig 4 (a) shows the separated and resized character and Fig 4
(b) shows the plotted feature vector which depends upon the feature
vector values.
Classification phase:
For classification of characters Support Vector Machine (SVM)
and Euclidean distance approach are used separately. SVM is based
on supervised learning that used for analyzing of data. SVM uses
hyper-plane for classification. Hyper-plane with maximum margin of
separation of hyper-plane and closest data point, is used as a decision
surface. This Optimal Hyper-plane gives the output. Different types
of kernels are used in SVM: Linear, RBF, Quadratic, Polynomial and
MLP. Here linear kernel is used for classification with SVM.
Fig 3 (b): Preprocessing Phase
It works according to the centre points selected randomly and
other data values are attached with the corresponding centre points
according to the difference between data values and centre points.
This distance is calculated between training vectors and test
vectors and find out the lowest distance. According to lowest distance
classification is performed.
EXPERIMENTAL RESULT
(a)
(b)
Fig 4 (a) Cropped and resized character, (b) Plotted feature
vector
For implementation of the proposed model numbers of steps are
used. For implementation, MATLAB is used as a tool. The scanned
image of Hindi word is taken as input. Character image set of size
430 is taken for implementation. In which 140 characters are used for
training and rest 290 images of characters are taken as test data for
classification using SVM. For classification using Euclidean distance
training data is according to different sample sets which are described
in table 1.
First, the image is required to be processed so that useful section
of image, on which the recognition process will be applied, can be
extracted. This is done in preprocessing phase by using morphological
operations in matlab. Scanned image is first converted into grayscale
image and threshold value is extracted by filter that grayscale image
horizontally and vertically. Using that threshold value grayscale image
is binarized vertically and horizontally. On these binary images,
morphological operations are performed and combine the resultant
images pixel value by using AND operations. So that horizontal line
can be removed.
After extracting the character from word, it is resized into 70x50
pixels. K-means clustering is applied on this resized binary image by
dividing it into 7 parts horizontally. On each part K-means clustering
is applied where 5 centroids are specified. After applying k-means
clustering on image a vector is generated from image which has 35
values. For every character feature vector is produced.
For classification Euclidean distance method and Support Vector
machine are used and the results are compared.
Euclidean distance method is used for classification by calculating
the distance between the test data feature and training data feature.
Every training data feature is stored in matrix and which has 35 rows
and columns depends on the number of training data. Euclidean
distance is calculated between test data feature and training data
feature individually. The nearest training feature is that which has
lowest distance from the test feature. According to that training feature
corresponding character is shown. Here we calculate the result by
taking different number of samples of each character in training data
set.
Fig 5. SVM with optimal separating hyper-plane
Here, this green line is the optimal hyper-plane which is
separating two sets with maximum margin of hyper-plane and nearest
data points.
Euclidean distance approach calculates the standard distance
between two given points or vectors.
If p and q are two vectors then Euclidean distance between them is
calculated according to this formula –
Training database
Sample set 1
Sample set 2
Sample set 3
Sample set 4
No. of
samples
36
108
180
252
Test data
Result (%)
290
290
290
290
58
73.7
80.7
81.7
Table 1: Results of Euclidean distance method with different sample
set.
Here we can see that as we are increasing the sample set, the
recognition result percentage is improving. But here are some
characters which are continuously giving good performance that is
better than 75%. Table 2 shows the percentage of performance of
characters. Percentage is calculated by –
Recognized characters
% of performance= ------------------------------------------Total characters for testing
Characters with continuous good performance for two or more
than two sample sets, are 25 in numbers and remaining are bad
performing characters. Fig 5(a) shows the characters which are giving
good performance using Euclidean Distance approach. Fig 5(b) shows
the characters which are not giving good performance using Euclidean
Distance approach.
Characters with good performance
%
%
Character
Character
Character
d
u
x
?k
Fk
N
t
>
V
B
m
100
94
j
<
83
100
75
100
100
100
75
100
83
98
100
80
e
v
{k
y
b
Q
c
l
o
'k
"k
%
Bad
performance
Character %
.k
r
/k
;
,
g
[k
p
M
i
n
100
90
100
75
95
100
80
80
75
80
97
SVM uses hyper-plane for classification. This Hyper-plane is used
as a decision surface which is with maximum margin of separation of
hyper-plane and closest data point. For classification with SVM 140
characters are used for training and rest 290 images of characters are
taken as test data.
Training data is handwritten characters which is stored in the form
of feature vector. And test data is collected by storing each test
character in feature vector form. Using this technique, the recognition
result is achieved 95.86%.
Total no. of Recognized characters
% of performance= ------------------------------------------Total characters for testing
Some characters are continuously giving good performance in
recognition that is better than 75%. These are shown in figure 6.
66
Character
%
Character
%
60
d
[k
x
?k
p
N
t
>
V
B
100
M
<
r
Fk
n
/k
u
i
Q
c
100
66
60
50
66
33
50
25
80
80
Table 2: Performance of Characters using Euclidean Distance
Approach for sample set 4
100
83
100
87.5
100
100
100
100
100
100
100
75
90
66
100
100
100
100
Character
e
;
j
y
o
'k
"k
l
g
%
Character
%
87.5
.k
{k
v
b
,
m
100
94
80
91
100
100
100
90
100
100
83
80
100
83
100
Table 3: Performance of Characters using Support Vector
Machine
Fig 6: Good performance character by SVM method
(a)
(b)
Fig 5: (a) Good performance characters by Euclidean
method, (b) Bad performance characters by Euclidean method
Here is a table which describes the results of devanagari character
recognition.
Author
Sharma et.al, 2006
Deshpande et.al, 2008
Hanmandalu et.al, 2007
Arora et.al, 2010
U. Pal et.al, 2007
U. Pal et.al, 2008
U. Pal et.al, 2009
Proposed technique using SVM
Result (%)
80.36
82
90.65
90.74
94.24
95.13
95.19
95.86
Table 4: Comparison table
Results using Support Vector Machine are better than results using
Euclidean distance. Characters with good performance i.e. which has
more than 75% performance percentage, are 25 using Euclidean
distance method. Using SVM, performance of all characters is better
than 75% except one character and that is /k. Computation overhead in
Support Vector Machine is less than Euclidean distance approach. So
classification using SVM is better than classification using Euclidean
distance approach.
CONCLUSION
This paper presented handwritten hindi characters recognition
based on K-means clustering and SVM. K-means clustering reduces
the size of feature vector so that computation becomes easy. Here
results are calculated using two approaches for classification, one is
Euclidean distance and other is Support vector machine. Results using
SVM are better than results using Euclidean distance. Maximum
achieved result using Euclidean distance is 81.7%. SVM is used with
linear kernel and giving 95.86% result.
REFERENCES
[1] M. Hanmandlu, O.V. Ramana Murthy, “Fuzzy model based
recognition of handwritten numerals” Science Direct, Pattern
Recognition, Vol. 40, Issue 6, pp. 1840-1854, June 2007.
[2] S. Arora, D bhattacharjee, M Nasipuri, D. K. Basu, “Combining
Multiple Feature Extraction Techniques for Handwritten
Devnagari Character Recognition” IEEE Region 10 Colloquium
and the Third ICIIS, Kharagpur, India, December 8-10, 2008.
[3] J. Hou, Y. Liu, T. F. Zheng, J. Olsen, J. Tian, “Multi-layered
Features with SVM for Chinese Accent Identification”, IEEE
International Conference on Audio Language and Image
Processing, pp. 25-30, 23-25 Nov. 2010.
[4] Shen-Wei Lee, Hsien-Chu Wu, “Effective Multiple-features
Extraction for Off-line SVM-Based Handwritten Numeral
Recognition”, IEEE International Conference on Information
Security and Intelligence Control, pp. 194-197, 2012.
[5] D.C. Shubhangi, “Noisy English Character Recognition by
Combining SVM Classifier”, IEEE international Conference on
Information and Communication Technology in Electrical
Sciences, pp. 663-666, 2007.
[6] R.Ramanathan,
S.Ponmathavan,
N.Valliappan,
“Optical
Character Recognition for English and Tamil Using Support
Vector Machines” IEEE International Conference on Advances
in Computing, Control, and Telecommunication Technologies,
pp.610-612, 2009.
[7] D. Nasien, H. Haron, S. Sophiayati Yuhaniz, “Support Vector
Machine (SVM) for English Handwritten Character
Recognition”, IEEE International Conference on Computer
Engineering and Application, Vol. 1, pp. 249 – 252, 2010.
[8] A. Alaei, U. Pal and P. Nagabhushan, “Using Modified Contour
Features and SVM Based Classifier for the Recognition of
Persian/Arabic Handwritten numerals”, IEEE Seventh
International Conference on Advances in Pattern Recognition,
pp. 391 – 394, 2009.
[9] C. Zanchettin, B. L. D. Bezerra and W. W. Azevedo, “A
KNNSVM Hybrid Model for Cursive Handwriting Recognition”,
IEEE World Congress on Computational Intelligence,
International Joint Conference on Neural Network, June, 10-15,
2012.
[10]
N. Sharma, U. Pal, F. Kimura, and S. Pal, “Recognition of
Off-Line Handwritten Devnagari Characters Using Quadratic
Classifier”, ICVGIP, Springer, pp. 805 – 816, 2006.
[11]
U. Pal, N. Sharma, T. Wakabayashi and F. Kimura, “OffLine Handwritten Character Recognition of Devnagari Script”,
ICDAR, IEEE, 2007.
[12]
U. Pal, T. Wakabayashi, F. Kimura, “Comparative Study of
Devnagari Handwritten Character Recognition using Different
Feature and Classifiers”, ICDAR, IEEE, 2009.
[13]
M. Hanmandlu, O.V. Ramana Murthy, Vamsi Krishna
Madasu, “Fuzzy Model based recognition of handwritten Hindi
characters”, DICTA, IEEE, 2007.
[14]
Sabra El Ferchichi, Salah Zidi, Kaouther Laabidi, Moufida
Ksouri, and Salah Maouche, “A New Feature Extraction Method
Based on Clustering for Face Recognition”, IFIP AICT , pp. 247–
253, 2011.
[15]
Sajjad Pourmohammad, Reza Soosahabi, Anthony S.
Maida, “An Efficient Character Recognition Scheme Based on
K-Means Clustering”, 2013 IEEE.
[16]
Karthik Sheshadri, Pavan Kumar T Ambekar, Deeksha
Padma Prasad and Dr. Ramakanth P Kumar, “An OCR system
for Printed Kannada using k-means clustering”, 2010 IEEE.
[17]
Samit Biswas, Debnath Bhattacharyya, Tai-hoon Kim, and
Samir Kumar Bandyopadhyay, “Extraction of Features from
Signature Image and Signature Verification Using Clustering
Techniques”, SUComS, Springer, CCIS 78, pp. 493–503, 2010.
[18]
Nguyen van Huan, Nguyen Thi Hai Binh and Hakil Kim,
“Eye Feature Extraction Using K-means Clustering for Low
Illumination and Iris Color Variety”, ICCARV, IEEE, 2010.
[19]
Isis Bonet1, Yvan Saeys3, Ricardo Grau Ábalo1, María M.
García1, Robersy Sanchez2, and Yves Van de Peer, “Feature
Extraction Using Clustering of Protein”, CIARP, Springer, pp.
614 – 623, 2006.
[20]
Dr. P. S. Deshpande, Latesh Malik, Sandhya Arora, “Fine
Classification & Recognition of HandWritten Devnagari
Characters with Regular Expressions & Minimum Edit Distance
Method”, JOURNAL OF COMPUTERS, VOL. 3, NO. 5, MAY
2008.
[21]
Eyal Kolman and Michael Margaliot, “A New Approach to
Knowledge-Based Design of Recurrent Neural Networks” IEEE
Transaction on Neural Networks, Vol. 19, Issue. 8, pp. 13891401, August 2008.
[22]
R. Jayadevan, Satish R. Kolhe, Pradeep M. Patil, and
Umapada Pal: Offline Recognition of Devanagari Script: A
Survey. IEEE Transaction Vol. 41, Issue: 6, pp. 782-796, 2011.
[23]
S. Kumar: Performance comparisons of features on
devanagari hand-printed dataset. IJRT, Vol. 1, pp. 33-37, 2009.
Download