A benchmark image database of isolated Bangla handwritten

advertisement
IJDAR (2014) 17:413–431
DOI 10.1007/s10032-014-0222-y
ORIGINAL PAPER
A benchmark image database of isolated Bangla handwritten
compound characters
Nibaran Das · Kallol Acharya · Ram Sarkar ·
Subhadip Basu · Mahantapas Kundu · Mita Nasipuri
Received: 23 July 2012 / Revised: 9 April 2014 / Accepted: 25 April 2014 / Published online: 23 May 2014
© Springer-Verlag Berlin Heidelberg 2014
Abstract In the present work, we present a benchmark
image database of isolated handwritten Bangla compound
characters, used in the standard Bangla literature. A thorough survey over more than 2 million Bangla words has
revealed that there exist around 334 compound characters in
Bangla script. Of which, only around 171 character classes
form unique pattern shapes, and some of these classes are
often written in multiple styles. Altogether, 55,278 isolated
character images, belonging to 199 different pattern shapes,
are collected using three different data collection modalities.
The database is divided into training and test sets in 4:1 ratio
for each pattern class, by considering a balanced distribution of shapes from different modalities. A convex hull and
quadtree-based feature set has been designed, and the test set
recognition performance is reported with the support vector
machine classifier. We have achieved a recognition accuracy
of 79.35 % on the test database consisting of 171 character classes. The complete compound character image database is freely available as CMATERdb 3.1.3.3 from the website http://code.google.com/p/cmaterdb/, which may facilitate research on handwritten character recognition, especially
related to Bangla form document processing systems.
Keywords OCR · Handwritten character recognition ·
Bangla Compound character · Benchmark database · SVM
Electronic supplementary material The online version of this
article (doi:10.1007/s10032-014-0222-y) contains supplementary
material, which is available to authorized users.
N. Das · K. Acharya · R. Sarkar · S. Basu · M. Kundu ·
M. Nasipuri (B)
Computer Science and Engineering Department,
Jadavpur University, Kolkata 700032, India
e-mail: mnasipuri@cse.jdvu.ac.in
1 Introduction
Since early days of machine learning research, optical character recognition (OCR) has remained as a popular and wellstudied problem domain, often guided by industrial demands
[1]. Apart from the commercial applications, character recognition is also considered as a benchmark problem in pattern recognition research, especially related to the recognition of handwritten text. Cheriet et al. [2] gave an elaborate
overview of handwritten OCR research during the last couple of decades. It may be observed that handwritten character recognition for Latin script has attained a considerable
maturity and a number of research groups contributed significantly in this developmental process [2]. Ample researches
have also been done for recognition of handwritten characters of Chinese [3,4], Japanese [5], Korean [6,7] and Arabic
[8] scripts.
Research efforts on Indian script OCR [9–11] on Devanagari [12], Tamil [13], Oriya [14] and Bangla [15–23] have
started to receive attention since last decade. In India, Bangla
is the second most popular script and language after Devanagari [24]. As a script, it is used for Bangla, Ahamia, Manipuri,
Daphla, Garo, Allam, Khasi, Mizo, Munda, Naga, Rian
and Santali languages. Moreover, Bangla, which is also the
national language of Bangladesh, is the sixth most popular language in the world [25]. Around 230 million people
speak in Bangla all over the world. Despite its popularity and
importance, evidences of research on handwritten Bangla
OCR, involving the complete set of characters, are few in
numbers [26] . One possible reason is the limited availability
of benchmark databases, especially related to the complete
Bangla character set. Another possible reason may be linked
to the complex nature of the character set, which consists
of 10 digits, 50 basic characters (11 vowels and 39 consonants), and around 334 compound characters. Figure 1a, b
123
414
Fig. 1 a Handwritten Bangla basic characters. b Handwritten numerals of Bangla script
shows sample images of handwritten Bangla basic characters and digit symbols. To add to the complexity, when a
vowel appears alongside a consonant, the vowel itself takes a
modified shape called as vowel allograph. There are 10 such
vowel allographs in Bangla script, each of which appears
either on top or left or right or simultaneously at left and right
sides or at the bottom of associated consonant(s), as shown in
Fig. 2. A compound character, an important and integral part
of Bangla language system, is a complex shaped character
which consists of two or more consonants, occasionally followed by a vowel and pronounced simultaneously in words
of Bangla language. Sometimes the shape of the compound
character is so complex that it becomes difficult to identify
the constituent consonants individually. Figure 3 shows a list
of handwritten samples of Bangla compound characters with
the corresponding Unicode compatible shapes.
From Fig. 3, it is evident that the shapes are very complex
and some compound characters resemble pair-wise so closely
that the only sign of differences left between them are some
additional shapes like short straight lines, circular curves, etc.
It is often difficult to identify those characters without analyzing the context, especially in handwritten documents. A set
of such closely resembled character pairs is shown in Table 1.
The top ten frequently occurring Bangla compound characters, their respective printed form, composition, handwritten
samples, usage pattern in words and frequency of occurrence
(percent) in standard literature are shown in Fig. 4. The complete table is also included in the supplementary document
(S-Table 1).
Fig. 2 Different vowel
allographs with consonant
123
N. Das et al.
In two special cases of compound characters, one of the
constituent consonants takes a special modified shape, such
as Ya-phala [ ] and Reph . In Ya-phala , the second of
the two consonants joined together is Ya [ ], whereas in
Reph , the first consonant is Ra [ ]. Like the vowel allographs, the Ya-phala appears to the right and Reph
appears on top of the associated consonant without changing
its shape. Since the associated consonant characters maintain their original shapes in the formation of the compound
characters Ya-phala and Reph , they are also referred
as consonant allographs. Figure 5 shows typical usage pattern of the aforementioned two consonant allographs. Due
to this special distinction, although these two categories of
compound characters are considered within the overall compound character set of Bangla script, they are not considered
in preparation of the current database.
Despite the existence of varieties of compound characters, their frequency of appearance in any text page is much
lower than that of basic characters. Interestingly, out of these
large numbers of compound characters, some are rarely used
and some have become obsolete. Moreover in the year 1997,
West Bengal Bangla Academy [27] introduced new types
of shapes/glyphs to represent Bangla compound characters
through their Bangla dictionary “Akaademi Banan Abhidhan” [28] . The objective of their effort was to simplify the
complex shapes for easier understanding of the compound
characters. Nowadays, in India, the Bangla text books mainly
follow this type of glyphs for compound characters. However,
many newspapers and publishing houses do not always follow these new standards, as common people are still more
comfortable with the old style of writing of Bangla compound characters. All these modifications add newer variations in handwritten character patterns, leading to further
complexities in OCR of handwritten Bangla compound characters. To address this issue, we have considered multiple
shapes for many compound character classes.
It is evident from the discussion that recognition of handwritten compound characters is a complex pattern recognition problem. Moreover, a major obstacle in pursuing
research in this domain relates to the non-availability of any
comprehensive public domain database. Unlike handwritten
numerals and basic characters, collection of adequate isolated
data samples of compound characters is an uphill task. Since
long, the requirement for a benchmark database for handwritten Bangla compound characters has been felt by the
A benchmark image database of isolated Bangla handwritten compound characters
415
Fig. 3 List of 171 handwritten samples of Bangla compound characters with corresponding printed glyphs
Bangla OCR research community. The prospect of a complete handwritten Bangla OCR system would otherwise be
illusive. The current work reported in this paper is an attempt
in that direction.
1.1 A brief review on handwritten character databases
Benchmark databases are essential to compare the performances of different handwritten character recognition tech-
123
416
N. Das et al.
Table 1 A set of randomly selected closely resembled character pair
Percent
Frequency
Fig. 4 Formation of some frequently occurring Bangla compound characters and their usage in standard Bangla literature is shown
niques over an even test bed. Most popularly used databases for Latin numerals, characters and words are NIST
[29], IAM-DB [30], CEDAR [31], MNIST [32], OCR [33],
etc. Among these, handwritten numerals databases like NIST
[29] and MNIST [32] were designed to advocate specialized
applications, such as recognition of postal code. IAM-DB
123
[30] database is constructed with labeled handwritten Latin
script in the form of sentences, lines and words. CEDAR
[31] database consists of city names, state names, ZIP codes
and alphanumeric characters, and the OCR [33] database is
based on handwritten words of Latin script. Apart from the
Latin script databases, there are databases of other scripts as
A benchmark image database of isolated Bangla handwritten compound characters
Fig. 5 Typical character
formations with two consonant
allograph Ya-phala and Reph
417
the research community to develop a much awaited, complete handwritten Bangla character recognition system in
near future.
2 Database preparation under the present work
well [34–48]. Benchmark databases are available for Korean
[41], JIS Chinese script [45], Japanese script [42], Arabic
script [37,39], etc. The ICDAR 2009 Handwriting Segmentation Contest [49] also provided a set of handwritten document pages written in Latin script for English, Greek, French
and German languages. A limited public domain database
(accessible on request) of isolated digits and basic characters is also available [50] for some of the handwritten Indic
scripts viz., Bangla, Devanagari and Oriya. Bhattacharya
et al. [51] also provided two updated databases for handwritten numerals for Devanagari and Bangla scripts. Bangla
numeral database, available from [50,51] consists of 19392
training and 4000 test samples which are written by 1106
persons. A Bangla basic character database [52] is also provided by Bhattacharya et al. The database, available in [50]
consists of 37858 isolated character images, among which
25000 are considered for training and 12858 are marked as
test samples. Recently, Das et al. [53] published four different
handwritten numeral databases of Indic script, viz. Bangla,
Devanagari, Arabic, Telugu. A database of unconstrained
handwritten Bangla and Bangla–English mixed script document images has been prepared recently by Sarkar et al. [54].
This database contains 150 handwritten document pages,
among which 100 pages are written only in Bangla script
and rest of the 50 pages are written in Bangla script mixed
with English words. They also provided ground truth data for
text line segmentation along with a semi-automatic ground
truth generating tool.
However, no standard Bangla compound character database is yet available in public domain for research purpose.
It may be worth to mention here that the complete compound
character database available at [50] is not downloadable.
Only sample compound characters are available for download from the website. To bridge this resource–requirement
gap, a complete compound character database, developed
by us, is uploaded in http://code.google.com/p/cmaterdb/ for
free download.
As mentioned before, the primary objective of the present
work is to create a standard database for handwritten Bangla
compound characters along with a detailed analysis of their
occurrence patterns in standard Bangla-printed literature. We
also report a benchmark recognition performance on this
newly prepared database. We hope that our effort may help
It has already been mentioned that apart from 50 basic characters and vowel allographs, Bangla alphabet is enriched
with more than 300 compound characters. Bangla compound
characters have evolved into their present shapes and structures over hundreds of years. Some compound characters
have become obsolete at present, and some have undergone
modification in shapes over time. Even for some compound
characters, more than one shape exist simultaneously. So
identification of proper compound character shapes itself is
a challenging task for the researchers. Therefore, the present
work on database preparation is preceded by a survey of the
frequency of occurrence of Bangla compound characters in
standard Bangla literature.
2.1 Survey on the usage of Bangla compound characters
The concept of compound characters in Bangla script originated from the Brāhmı̄ script—the mother of the Indo-Aryan
scripts [55]. Devanagari and the other major Indic scripts
such as Gujarati, Telugu, Bangla, Oriya, Kannada, Gurmukhi, Tamil, Malayalam, etc. follow abugidas, i.e., most
symbols stand for a consonant plus an inherent vowel (usually the sound /a/) [56]. Here, the orthographic syllable starts
with a consonant (C) or a sequence of consonants and ends
with an inherent vowel (V) and generally follow a canonical structure of (((C)C)C)V. The formations of compound
characters in the Bangla language also maintain the above
canonical structure which is almost similar to those of the
languages originated from Brāhmı̄ script [57,58]. However,
the number of compound characters in Bangla language is
much higher than that of Devanagari language. It is due to
the fact that many modern words, coming from different languages, require more compound characters which do not
exist in the original Bangla script. Bangla is continuously
getting enriched by including these words from other languages. The new words require new set of compound characters for writing them in Bangla language. As Bangla is a
phonetic language, the phonetic structure of Bangla is mainly
responsible for formation of compound characters. Bangla
has canonical SOV (Subject–Object–Verb) structure compared to SVO (Subject–Verb–Object) structure of English. In
[59], Dash represents the relationship among the graphemes
of different languages which are originated from Brāhmı̄
script including Bangla language. According to Dash, there
are more than four hundred (400+) compound characters
available in Bangla script out of which near about three hun-
123
418
dred and twenty (∼ 320) compound characters are made of
two consonant graphemes, around seventy (∼ 70) compound
characters are made of three consonant graphemes and about
ten (∼ 10) compound characters are made of four consonant graphemes. But many of these compound characters are
rarely used in modern Bangla literatures.
A thorough survey has been conducted under the present
work with the alphabetic characters collected from three
popular Bangla newspapers, viz Anandabazar Patrika [60],
Aajkal [61] and Bartaman [62]; two popular Bangla magazines Anada-Mela and Desh [63]; and some Bangla websites [64–66]. We have considered around 2.4 million words,
consisting of around 8 million characters, to generate the corpus. It is worthy to mention that most of these samples are
collected from Unicode encoded digital archives and analyzed using a self-developed automated tool. The digital corpus is made available at our website http://code.google.com/
p/cmaterdb/. From the survey, we have found that around
6.61 % characters (of the total corpus) are compound characters. It has also been observed that 26.25 % words are constructed with at least one compound character. The percentage of words having at least one compound character against
the different group of word lengths is shown in Fig. 6. Among
those compound characters, around 28.38 % are formed with
the two consonant allographs Ya-phala [ ] and Reph [ ], and
some obsolete compound characters (
).
The remaining 71.62 % compound characters are actually
considered for preparing the current database.
During the survey, we have imposed some constraints on
the collected characters to overcome fallacious observations
and improper deductions about the shapes and percentage of
usage of characters. The constraints are discussed below.
(a) None of the Bangla punctuation marks, symbols and
numerals is considered as a character.
(b) Vowel graphemes, consonant graphemes, vowel allographs and diacritic symbols are considered as noncompound characters for the current survey.
Fig. 6 Distribution of
compound character words of
different lengths in the total
word corpus and the possibility
of having at least one compound
character in a word is shown
123
N. Das et al.
(c) A compound character is considered as a combination of
two or more consonant graphemes.
(d) A compound character with a vowel allograph is also
treated as a new compound character if the combination
of characters possesses different shape of vowel allo).
graph (Example:
(e) A consonant with the consonant allograph is also considered as new compound characters during the survey
).
(Example:
We have found around 334 compound characters from the
survey conducted on the newly developed digital text corpus
database. According to our survey, the number of compound
characters formed with two consonant graphemes is two hundred eighteen (218), one hundred twelve (112) compound
characters are constructed using 3 consonant graphemes and
four (4) compound characters which are formed using 4 consonant graphemes.
It is also observed that both the consonant allographs Yaphala [ ] and Reph [ ] are prevalent in the Bangla literature
as already mentioned in Sect. 1 that none of these modifiers
change the original shape of the associated consonant (see
Fig. 5) and are excluded from the current database. More
specifically, the Yaphala appears to the right side of the main
character, touching the common matra or Sirorekha of the
word. Also the Reph appears above the matra of the character
shapes, rarely touching the main character.
Since we have generated a large corpus automatically from
digital contents, some misspelled compound characters were
initially considered in the survey. Even some obsolete char,
have come to our notice during the
acters such as
survey. Those misspelled and obsolete characters are not considered for this work. As a result of these considerations, the
total number of compound characters used under this work
came down to 171. A list of top 20 Bangla compound characters (according to their frequency of usage) is shown in
Table 2 along with their cumulative frequencies in standard
Bangla literature. The survey also revealed that for some
compound characters like nya-ja ( ), nga-ka ( ), etc., peo-
A benchmark image database of isolated Bangla handwritten compound characters
ple are accustomed to use more than one shapes. Therefore,
after detailed analysis, we have considered another 28 shapes,
totaling 199 shapes of compound characters for our current
work. Table 3 shows a list of handwritten Bangla compound
characters having multiple shapes.
2.2 Data collection and preprocessing
Data collection is one of the difficult tasks in any pattern recognition research. It becomes even more challenging when the number of classes is too high (199 character shapes for our Bangla handwritten compound character database). Three different data collection modalities are
adopted in the present work, to cover a wide spectrum of
writing styles. Please note that handwritten text is usually
written in either structured documents (like pre-formatted
419
forms) or in an unconstrained fashion. In case of Bangla
script, because of the presence of ascendants (characters
or their parts appearing above the matra) and descendants
(characters or their parts appearing below the main character), standardized form documents are not yet popular. In
many cases, semi-structured layouts are used for data-entry
where words or text lines are written in rectangular boxes.
Therefore, we have collected handwritten compound character samples using both categories of data collection sheets. In
one type of sheets, the participating volunteers were asked to
write the isolated compound characters in individual boxes,
whereas in the other case, the volunteers were asked to write
a complete word (containing compound characters). Sample images of these two categories of data collection sheets
are shown in Fig. 7a, b. As the third type of data collection
modality, unconstrained Bangla handwritten pages are used
Table 2 Formation of compound characters and their corresponding frequency of appearances in recent Bangla literature and news items of top
20 frequently occurred compound characters
123
420
from our public domain repository [67] (see Fig. 7c). Compound characters appearing in the text are manually cropped
to populate the database.
For the present work, formatted data sheets were designed
at the Center for Microprocessor Application for Training
Education and Research (CMATER), a research laboratory at
Computer Science and Engineering department of Jadavpur
University, India, and were used to collect data from the
Table 3 The list of handwritten Bangla compound characters having
multiple shapes
N. Das et al.
native Bangla writers of different age, sex and educational
background groups. Black/Blue-ink pens with 0.5/0.7 mm
pen tip were used for writing on those data sheets having 70
GSM paper thicknesses. A pie chart describing the distribution of the age groups of the writers, participated in the data
collection drive, is shown in Fig. 8. It was also observed that
writer’s education level also varies with age. In our survey,
22 % among the writers were from school going students,
and the rest 78 % belong to either under-graduate/graduate
students or graduate degree holders. About 50 % of the entire
data were collected from the students/teachers/staffs of different Universities/colleges/schools, another 30 % from family members, friends and relatives, and rest 20 % were collected randomly from the public places, on request.
Collected data sheets (from both the modalities) were
optically scanned in gray scale with a resolution of 300
dpi using a HP F380 flatbed scanner. Data collections from
the digitized pages containing handwritten words are done
manually. But extraction of the large number of isolated
handwritten characters from the other category of data collection sheets is an uphill task. For this purpose, we have
binarized each such digitized data sheet using a simple
image-specific global threshold value, which is computed as
(min_intensit y + max_intensit y)/2. From the binarized
data sheet images, the coordinates of the gridlines separating individual characters are automatically identified using
a page layout analysis algorithm described below. Since the
data sheets were collected under manual supervision, the following pixel-histogram-based page layout analysis algorithm
works well for most of the data sheets. In some extreme cases,
however, we had to use manual intervention to extract the
data part from the scanned data sheets. Please also note that
the gray scale images of the character are extracted from the
original grayscale data sheet images using the coordinates of
the grid lines identified by the following algorithm.
Algorithm:
Step 1 I N ×M be the binary image of M column and N rows.
Calculate the vertical pixel density profile (PV ) as the sum of
black pixels along each column of the image. Let, PV [ j] =
M−1
I [i, j] where j = 0, . . . , M − 1 and I[i, j] ∈ {0, 1}.
0
Step 2 Calculate the horizontal pixel density profiles
along
each row of the image and is denoted byPH [i] = 0N −1 I [i, j],
where i = 0, . . . , N − 1
Step 3 Consider the vertical profile vector PV , if a particular
column j, (PV [ j] > δ ∗ N ) where δ is a heuristically chosen
constant (0 < δ < 1), then that column of pixels is treated
as a vertical line on the data sheet separating the characters,
consecutive columns, that satisfy the above criteria are said
to form one column cluster, comprising of multiple vertical
columns of pixels. Each column cluster represents a vertical
gridline. Let Cth be the minimum gap between two vertical
123
A benchmark image database of isolated Bangla handwritten compound characters
421
Fig. 7 Sample images of three different categories of data collection sheets. a Isolated compound characters. b Complete words containing
compound characters. c Unconstrained Bangla handwritten page containing compound characters
Step 6 The coordinates of each cell are used to crop the original grayscale image.
2.3 Description of the present database
Fig. 8 Participation statistics of the persons in data collection drive
according to their age
columns of pixels, for inclusion in the same column cluster.
Two candidate columns of pixels ( j1 and j2 ) are said to be
within the same column cluster if j1 − j2 < Cth ; where ||-||
signifies the difference in column numbers. Now, compute
the center line in each column cluster. This is done by simple
averaging the column numbers in each cluster.
Step 4 Consider the horizontal profile vector PH . For a particular row i, if (PH [i] > δ ∗ M) then that row of pixels is
treated as a horizontal separating line on the data sheet. Consecutive identified rows of pixels forms a single row cluster
representing a horizontal gridline. Let Rth be the minimum
gap between two horizontals rows of pixels, for inclusion in
the same row cluster. Two candidate rows (r1 and r2 ) are said
to be within the same row cluster if r1 − r2 < Rth ; where ||-||
signifies the difference in row numbers. The central line in
each row cluster is computed in the same way as in the case
of vertical grid lines.
Step 5 Identify the coordinates of each cell containing handwritten characters from adjacent pairs of vertical and horizontal grid lines.
In the current work, total 335 number of persons voluntarily
participated during the data collection drive. A total of 354
sheets were processed using the aforementioned methodology to generate 42,959 number of isolated handwritten compound character images. Please note that many handwritten
samples had to be discarded because of writing errors like
use of inconsistent glyphs, overwriting, etc. For handwritten
words, 122 data sheets were processed to generate 10,270
compound characters from 9,829 isolated word images.
As the third category of the data collection modality, 150
handwritten pages were considered from the CMATERdb
1.1.1 and CMATERdb 1.2.1 database to generate 2,049 number of isolated compound characters.
All the extracted characters from the three modalities are
then divided randomly into train and test sets, approximately
in the ratio 4:1 for each shape patterns. The final training
set consists of 44,152 compound characters, whereas the test
set consist of 11,126 characters. As discussed before, the
number of Bangla compound characters considered here is
171. But, due to presence of more than one grapheme for
a single class, the total number of pattern classes is 199.
Here, the number of characters per class is not equal and
remain in between 125 and 474 samples in a pattern class.
The average number of samples per pattern class is around
277. The standard deviation is 73.77.
123
422
N. Das et al.
Table 4 Statistical analysis of the data samples of first 10 pattern classes of Bangla compound characters collected in isolated fashion
Pattern
class#
Count of
samples
Himax Himin Wimax Wimin Aimax
Avg
Aimin A R imax A R imin Hi
Avg
Wi
Avg
Ai
Avg
A Ri
SDH SDW SDA
SDAR
(a) For training set
1
209
124
45
195
40
21,080 1,880 3.12
0.55
70.76
95.47 6,974.63 1.37
14.98 33.98 3,606.95 0.45
2
172
132
32
218
46
28,776 1,696 2.47
0.63
74.65
92.45 7,303.78 1.26
20.09 34.42 4,511.61 0.39
3
199
123
38
190
45
20,068 2,090 3.42
0.64
67.4
97.84 6,889.84 1.48
17.69 30.34 3673.98 0.4
4
215
132
32
204
43
22,939 1,760 3.02
0.55
66.2
91.53 6,442.29 1.4
18.45 33.99 4,099.19 0.4
5
206
137
43
214
43
23,736 2,484 2.93
0.45
75.94
93.26 7,528.87 1.24
19.74 36.62 4,833.06 0.38
6
199
137
44
215
44
28,165 2,816 2.53
0.59
78.79 100.58 8,308.41 1.29
18.11 35.35 4,693.85 0.36
7
208
132
45
212
40
21,106 2,160 2.24
0.59
73.48
82.3 6,461.62 1.1
16.34 35.12 4,288.62 0.32
8
200
136
44
216
48
29,376 2,112 2.25
0.6
80.19
96.91 8,167.15 1.21
17.56 34.65 4,689.18 0.31
9
104
141
47
220
48
27,477 3,216 2.24
0.72
77.89
92.49 7,498.65 1.19
10
132
136
40
213
46
28,968 2,352 2.4
0.73
73.92 102.96 8,207.75 1.39
15.29 32.26 4,037.33 0.3
20.66 40.76 5,527.38 0.38
(b) For test set
1
52
115
45
207
48
18,837 2,352 3.25
0.71
69.29
95.27 6,867.37 1.4
16.51 37.32 3,851.31 0.51
2
43
144
38
210
34
30,240 1,292 2.3
0.64
73.26
88.93 6,932.26 1.22
19.23 33.87 4,836.05 0.35
3
49
112
37
187
52
20,570 1,924 2.69
0.82
67.61
93.82 6,543.14 1.44
17.69 29.26 3,326.27 0.45
4
53
131
38
211
43
27,641 2,150 2.45
0.69
73.04
104.6 8,502.77 1.43
24.31 46.4
5
51
136
54
166
51
21,216 3,432 1.96
0.68
77.88
89.65 7,334.92 1.16
17.69 29.21 4,129.54 0.28
6
49
122
51
177
53
20,886 3,162 2.09
0.82
79.04 101.37 8,419.49 1.28
17.59 30.63 4,363.11 0.26
7
51
118
41
176
46
20,768 2,444 1.84
0.6
69.61
71.27 5,237.86 1.02
14.07 29.17 3,532.25 0.28
8
49
122
53
219
47
22,204 2,867 2.26
0.69
82.1
97.24 8,270.96 1.19
16.36 34.34 4,230.72 0.36
9
25
138
59
163
52
18,354 3,744 2.03
0.72
77.6
91.28 7,363.44 1.18
16.7
10
33
144
43
221
59
31,104 2,891 2.22
0.71
76.3
109.88 9,211.76 1.44
Some standard statistical measures are employed to analyze different structural characteristics of the characters for
each of the pattern classes in both the training and test databases. More specifically, maximum, minimum and average
of heights and widths of the bounding rectangles of character images for each pattern class have been estimated. These
estimations give a notion of average spatial dimension of the
patterns of a class. The maximum, minimum and average of
aspect ratios of the character images for each pattern class
have also been estimated to add more information regarding
the shapes of the characters in the collected data samples.
Let us assume that for a pattern class Ci , (1 ≤ i ≤ 199), the
total number of samples be Ni . Let the height and width of
the bounding box of the jth sample in Ci be defined as h i j
and wi j , respectively. Several statistical measures specifying
the structural characteristics of the character shapes in Ci are
Maximum height (Himax ), Minimum height (Himin ), Maximum width (Wimax ), Minimum width Wimin , Maximum area
Aimax , Minimum area Aimin , Maximum aspect ratio ARimax ,
Avg
Minimum aspect ratio ARimin , Average height Hi , AverAvg
Avg
age width Wi , Average area Ai , Average aspect ratio
Avg
ARi . The standard deviations of height, width, area and
123
6,339.71 0.39
27.36 3,781.42 0.29
24.15 45.8
6,759.26 0.38
aspect ratio are denoted by SD H , SDw , SD A , SDAR respectively for the present work.
Table 4 lists the values of abovementioned measures for
specifying the structural characteristics of the characters for
some representative character classes taken from the original
database for the samples collected in isolated fashion. Along
with these, the number of patterns per class together with
their frequency of appearance in standard Bangla literature
is also specified. The complete table consisting of the value
for 199 pattern is given in supplementary document (see STable 2). The supplementary document is also uploaded as
a supplementary file in http://code.google.com/p/cmaterdb/
along with the database.
Database nomenclature is often important for future reference by the research community. Based on our predefined naming convention, we have named our current
database as CMATERdb 3.1.3.3. In CMATERdb 3.1.3.3,
“CMATER” stands for Center for Microprocessor Applications for Training Education and Research, a research laboratory of Jadavpur University, India and “db” stands for
database, the leftmost “3” indicates handwritten character
database, the next “1” indicates Bangla script and next “3”
indicates compound character database and right most “3”
A benchmark image database of isolated Bangla handwritten compound characters
indicates the third release of the gray level images of Bangla
compound characters.
3 Performance analysis on the developed database
Some research has already been done on recognition of handwritten basic characters, vowel allographs and numerals of
Bangla scripts. In one of the recent works, Basu et al. [15]
developed a hierarchical technique for recognition of Bangla
basic characters and modified shapes. Some more research
contributions related to this topic are also available in the literature [19,21,68,69]. Handwritten Bangla numeral recognition is also a well-studied problem, especially in the context of postal automation in India and Bangladesh [70,71].
Recently, Bhattacharya et al. [51] developed a comprehensive database of 23,392 handwritten isolated Bangla digit
samples and reported their benchmark accuracy on that database. Despite such efforts, the OCR of Bangla script remains
incomplete without detailed research on compound characters. There are three earlier instances of work on handwritten Bangla compound characters [16,26,72] in the literature.
In[16], Pal et al. used Modified Quadratic Discriminant Function (MQDF) as the classifier. The directional information
obtained from the arc tangent of the gradient is used there to
form the feature set for the characters. Using fivefold crossvalidation of results, they obtained around 85.90 % accuracy
from a database of Bangla compound characters containing
20,543 samples. The number of classes considered there for
the compound characters is 138, which had been selected on
the basis of old statistics published in 90’s [73,74].
In one of our earlier works [72], a technique for recognition of handwritten compound characters was proposed with
a discussion on the potential problems of Bangla compound
character recognition. The technique advocated for incrementally expanding the number of learned character classes
from more frequently occurred to less frequently occurred
ones. As an initial step, 55 character classes covering 90 %
of the total compound character usages were used there for
recognition. The average recognition rate of 84.67 % was
observed on the database, after threefold cross-validation
using MLP-based classifiers and 84 quad tree-based longest
run features. In another work [26], we considered a combination of handwritten Bangla basic and compound characters simultaneously for recognition. For that work, a total
of 93 character classes including 50 basic and 43 compound characters were considered. The recognition accuracies 79.25 % using MLP and 80.51 % using Support Vector Machine (SVM) had been achieved using shadow and
quadtree-based longest run features after three fold crossvalidation of data. The rationale for that work was to create
a framework for recognition of frequently occurring compound characters along with the basic characters.
423
3.1 Recognition strategy for the current database
To report a benchmark recognition performance on the developed database, we have used a combination of two different feature sets, namely convex hull-based features [23] and
quadtree-based longest run features [26,71,75]. More specifically, these two features are used here for representation of
binarized character images (first binarized using a imagespecific global threshold-based binarization algorithm discussed in Sect. 2.2 above and then size normalized to 96 × 96
pixels) in feature space, and subsequent recognition by a
SVM classifier [76]. In the following, we briefly describe
the two feature descriptors. We also include (in Table 5)
a comparative performance analysis of different popularly
used feature sets on our database. It is evident that the chosen feature set generates promising results, in comparison
with others considered under this study.
3.1.1 Quad tree-based longest run features
Before extraction of longest run features [77] from character
images, a character image is first organized in to a quadtree of
depth 2. To do this, the root node is first created by assigning
the entire character image to it. Then, all the successors of the
root node are created. For that the character image is divided
into four quadrants along the row and column of the center of
gravity of the black pixels therein. The sub-images so created
are assigned to the successor nodes from left to right order.
The process is repeated recursively for creating all the next
level successors and assigning image contents to them in the
same way. The character image or sub-images contained in
the nodes of the tree are considered here for extraction of
features. Four longest run features are extracted from each
of 21 nodes of the quadtree of depth 2. Thus, the number of
longest run features is 84 for the present work.
3.1.2 Convex hull based feature set
Any object with a non-regular shape may be represented by
a collection of its topological components or features. In the
current work, we have extracted several such topological features based on different bay and lake attributes of a convex
hull of handwritten Bangla characters.
A total of 28 features are designed on the basis of different bay attributes of the convex hull of handwritten Bangla
compound characters. In any binary image, the convex hull
is constructed around the character pixels (represented by
binary label “1” and often referred to as foreground). Pixels that constitute the convex hull boundary are referred as
convex hull boundary pixels.
We define a distance measure dcp as the count in number
of pixels from the convex hull boundary pixel to the nearest
character pixel in either horizontal or vertical direction. The
123
123
76.89
73.11
73.40
74.37
84
120 + 84 = 204
25 × 8 = 200
39 × 5 = 155
155 × 84 = 239
QTLR [70,72]
QTSh+QTLR [25]
Gradient features [15]
CH [26]
CH +QTLR
(Present
work)
80.48
74.09
67.48
65.84
65.61
68.42
66.73
62.50
58.69
58.05
61.23
78.67
72.55
71.32
71.03
74.60
Overall
performance
81.06
75.04
74.06
73.82
77.51
75.07
68.57
66.92
66.78
69.58
Extracted
from word
images
67.79
63.14
59.75
59.11
62.50
Extracted from
CMATERdb
1.1.1 and CMATERdb 1.2.1
Isolated
samples
Extracted from
CMATERdb
1.1.1 and CMATERdb 1.2.1
Isolated
samples
Extracted
from word
images
199 patterns regrouped into 171 character classes
Considering 199 patterns
Success rate (in percentage) on the test set of CMATERdb 3.1.3.3
Feature dimension
Feature description
Table 5 Comparative overview of recognition performances of different feature sets on the test set of CMATERdb 3.1.3.3
79.35
73.29
72.09
71.84
75.35
Overall
performance
424
N. Das et al.
A benchmark image database of isolated Bangla handwritten compound characters
Fig. 9 Illustration of convex hull feature extraction. a The features
are extracted based on dcp from left to right direction along with lake
features. b The coordinates of center of gravity (r x , r y ) of a hypothetical
region created by joining the bay pixels up to the nearest character pixels
along left to right direction
dcp is estimated from top, bottom, right and left sides of
the image (see Fig. 9a that illustrates the distance measure
from left side of an image). Based on this dcp measure in
a particular direction, five different topological features are
calculated as follows:
•
•
•
•
•
Maximum dcp ,
Average dcp ,
Total numbers of rows having dcp > 0,
Total numbers of rows having dcp = 0,
Number of visible bays [78].
Here, bays are measured in terms of length of the consecutives convex hull boundary pixels for which dcp > 0 (see
Fig. 9 for illustration). Two more features are computed as
the coordinates of center of gravity (r x , r y ) of a hypotheti-
425
cal region created by joining the bay pixels up to the nearest
character pixels along a specific direction. Figure 9b illustrates the hypothetical region created by joining such pixels
along left-to-right direction. Therefore, 28(= 4 × 7) features
are extracted from top, bottom, right and left boundaries of
an image. It is worthy to mention that convex hull shape of a
broken character does not differ too much from the shape of
a complete character (see Fig. 10). However, the bay and lake
counts may vary in case of a broken and a complete character
sample. To avoid this situation, smaller bays (bay length less
than 4 % of the character height, in case of vertical direction
or character width in case of horizontal direction) are rejected
from consideration. Such bays are mostly generated in case
of noisy/broken characters.
Apart from the above 28 features, three more features are
derived from the entire convex hull. First, total number of
convex hull boundary pixels having dcp = 0 is considered
as one feature. Second, the number of lakes within a character is considered as another feature. Lake in a character
image denotes the region entirely surrounded by the character pixels within a character image. The regions completely
surrounded by character pixels having an area greater than
L th are considered as a lake. Selection of the lake threshold
L th is a heuristically chosen parameter (for our current experiment, we have chosen L th = 20) and is directly proportional
to the area of the character images. Finally, total number of
convex hull boundary pixels having dcp > 0 is considered
as the third feature. Thus, 31 features are extracted from the
entire character image based on different bays attributes of
the convex hull. Figure 9 shows this feature extraction technique for a character image in a horizontal direction from left
side of a Bangla compound character.
To extract local information, from the character images,
each character pattern is further divided into four sub-images
based on the centroid of the data pixels which is referred as
CG or center of gravity. The coordinates of the CG of an
image frame, (C x , C y ), is calculated as follows:
1
x · f (x, y)
k n
1
Cy =
y · f (x, y)
k m
1; for all data pixels
f (x, y) =
0; other wise
Cx =
where x and y are the coordinates of each pixel in the image
of size m × n pixels and k is the count of pixels having
f (x, y) = 1. The convex hulls are then constructed for the
character pixels within each such sub-image for computation of different topological features, as described earlier.
A total of 124 such features (31 × 4) are computed from
the 4 sub-images of each character pattern. This makes the
total feature count as 155, i.e., 31 features for the overall
123
426
N. Das et al.
Fig. 10 Convex hulls of an integral and broken character. Here, though the character is broken but the convex hull remain same
Fig. 11 Illustration of formation of convex hulls from the CG-based quad partitions. a Sample image of Bangla compound character
CG. The CG is denoted by black pixel. b Pictures of convex hulls over four different subparts created based on the CG
image and 124 features for the four sub-images. It is noteworthy to mention that convex hulls of two or more different
characters may be all but similar in shape but their subdivisions are not. Thus more discriminant features could be
found using quadtree-based subdivision. The convex hulls
constructed on four pattern subdivisions which are created
based on the CG-based partitioning scheme are shown in
Fig. 11.
Combining the two sets of features, i.e., the quadtreebased longest run and convex hull-based features, a 239 element feature set (155 + 84) is formed. The pattern classifier
employed here, using the above mentioned features, is imple-
123
with its
mented with SVMs. An SVM employed as a pattern classifier
for a two class problem always finds the optimal separating
hyper plane that maximizes the margin between two classes.
The margin is the distance of the closest sample point, in each
class, from the separating hyper plane. For employing SVMs
for multiclass problem, One Versus ONE (OVO) approach is
followed here. To implement SVM, we have used LibSVM
[79], a well-known open-source software for SVM. Among
different existing kernels in LibSVM, we have used Radial
Basis Function (RBF) kernel with gamma (γ ) value 0.5. The
rationale behind the choice of RBF kernel is due to its ability
to perform better for handwritten digit recognition applica-
A benchmark image database of isolated Bangla handwritten compound characters
427
Fig. 12 Schematic diagram of
our approach
tions [80]. The trained SVMs are used later to classify the
test set of the same database after extracting the same 239
feature set for each of the test pattern samples. Thus, we have
generated three different SVMs for the training samples collected from isolated form, from word images, and from the
database CMATERdb 1.1.1 as well as CMATERdb 1.2.1 separately, which are used to evaluate with their respective test
sets. Thus, a recognition rate (number of successfully classified data/(number of successfully classified data + number
of misclassified data)*100) of 78.67 % has been obtained
on 199 patterns classes. The recognition rate improves marginally and becomes 79.35 % on 171 character classes after
combining different graphemes of a single character class. A
detailed analysis of the recognition performance of the test
set of CMATERdb 3.1.3.3 is shown in Table 5. Please note
that the performances vary widely across the data collection
modalities. Although we have got around 81 % accuracy over
isolated character samples, the same for unconstrained text
(collected from CMATERdb 1.1.1 and CMATERdb 1.2.1)
drops to around 67 %.
A schematic diagram of our developed methodology is
shown in Fig. 12. Table 6 shows a detailed analysis over top
ten highly misclassified pattern classes. It may be observed
from the table that the top two misclassified compound characters resemble so closely that there exist minimal differences in shapes, even in case of printed characters. Few samples of misclassified and correctly classified character images
are also shown in Figs. 13 and 14 respectively. Possible reason behind the misclassification may be linked to the insufficiency of the feature set in distinguishing finer details of
character images of different classes with close resemblance.
The image samples of Fig. 14 reflect this. A two-stage pat-
tern classification approach [81] may be explored in future to
address the problem. Lexicon matching may also be useful in
domain-specific applications, like city name recognition for
automatic mail sorting, extraction of information from filled
in forms, etc.
Three other feature extraction techniques [16,26,72]
which were used previously for Bangla compound character
recognition have been tested on the present database with
SVM-based classifier. Before extraction of those features,
each character sample is binarized and then normalized to
a size of 96 × 96 pixels. The recognition results obtained
for different feature extraction techniques are reported in
Table 5. It could be observed from the table that present
feature set (combination of convex hull-based and quadtreebased longest run features) performs better than other feature
sets [16,26,72] in the current experimental setup.
4 Conclusion
In our present work, we have developed a new benchmark
database for the Bangla handwritten compound characters
after considering an elaborate survey on its usage patterns
from the contemporary Bangla literatures. Though the survey
revealed that the count of compound characters is around 334
but for the present work, we have considered only 171 character classes after discarding the characters with Ya-phala
and Reph. Moreover, we have identified that some characters
have more than one shape which may mislead to recognize
them properly. To overcome the problem, we have treated
them as different classes during training. Thus number of
123
428
Table 6 Top 10 poorly classified pattern classes along with their misclassification statistics
Fig. 13 Examples of some misclassified samples along with their misclassified and original class labels
123
N. Das et al.
A benchmark image database of isolated Bangla handwritten compound characters
429
Fig. 14 Some successfully classified difficult data samples with their corresponding class are shown
pattern classes considered here is 199 instead of 171 character classes. This additional pattern classes are merged after
testing and the final recognition accuracy is evaluated over
171 character classes. The database is prepared from handwritten compound character samples collected from three
different types of datasheets consisting of isolated characters,
isolated words and unconstrained handwriting. The objective
is to capture a wide variation of writing styles by individuals and to address a potentially-wide range of Bangla OCR
applications with varying constraints on the writing space.
In some applications, characters are written in box-formatted
data sheets, and in some cases, words or sentences are written in partially formatted data sheets. Some unconstrained
pages are also considered in our work to include extreme variability in the database for the compound characters. Please
note that, in the popular literature, only around 6 % characters are compound in nature and from 150 unconstrained
handwritten Bangla pages, only 2,049 compound characters
could be found. Therefore, to prepare a complete train–test
database, we have collected handwritten isolated compound
characters and words (containing specific compound characters) from participating volunteers. We have also made the
complete database of 55,278 characters available in public
domain as CMATERdb 3.1.3.3 from the website http://code.
google.com/p/cmaterdb/. To the best of our knowledge, this
is the largest (if not the first) database of its kind available
to the pattern recognition research community. A subset of
this database has been used previously during Bangla Handwriting Recognition Competition in ICFHR 2010 [82]. In the
current work, we have used convex hull and quadtree-based
longest run feature set, to report the benchmark recognition
performance on the test set of the developed database. Further scopes exist to improve the feature set, especially by
introduction of global and local features or multistage recognition strategy. In a nutshell, the current work is an effort to
facilitate researchers in the domain of handwritten OCR to
work on a large and complex character database of one of the
widely used script of the world.
Acknowledgments Authors are thankful to the “Center for Microprocessor Application for Training Education and Research”, “Project
on Storage Retrieval and Understanding of Video for Multimedia” of
Computer Science & Engineering Department, Jadavpur University, for
providing infrastructure facilities during progress of the work. The work
reported here has been partially funded by DST, Govt. of India, PURSE
(Promotion of University Research and Scientific Excellence) Program.
References
1. Fujisawa, H.: Forty years of research in character and document
recognition—an industrial perspective. Pattern Recognit. 41(8),
2435–2446 (2008)
2. Cheriet, M., El Yacoubi, M., Fujisawa, H., Lopresti, D., Lorette,
G.: Handwriting recognition research: twenty years of achievement
· · · and beyond. Pattern Recognit. 42(12), 3131–3135 (2009)
3. Su, T.-H., Zhang, T.-W., Guan, D.-J., Huang, H.-J.: Off-line recognition of realistic chinese handwriting using segmentation-free
strategy. Pattern Recognit. 42(1), 167–182 (2009)
4. Srihari, S., Yang, X., Ball, G.: Offline chinese handwriting recognition: an assessment of current technology. Front. Comput. Sci.
China 1(2), 137–155 (2007)
5. Kimura, F.: OCR Technologies for machine printed and hand
printed Japanese text. In: Chaudhuri, B.B. (ed.) Digital document
processing. Advances in pattern recognition, pp. 49–71. Springer,
London (2007)
6. Kwon, J.-O., Sin, B., Kim, J.H.: Recognition of on-line cursive
korean characters combining statistical and structural methods. Pattern Recognit. 30(8), 1255–1263 (1997)
7. Kim, H.J., Kim, P.K.: Recognition of off-line handwritten korean
characters. Pattern Recognit. 29(2), 245–254 (1996)
8. Amin, A.: Off line Arabic character recognition: a survey. In: The
fourth international conference on document analysis and recognition, pp. 596–599 (1997)
9. Pal, U., Chaudhuri, B.B.: Indian script character recognition: a
survey. Pattern Recognit. 37(9), 1887–1899 (2004)
10. Pal, U., Jayadevan, R., Sharma, N.: Handwriting recognition in
indian regional scripts: a survey of offline techniques. ACM Trans.
Asian Lang. Inf. Process. 11(1), 1–35 (2012)
11. Arya, D., Jawahar, C., Bhagvati, C., Patnaik, T., Chaudhuri, B.,
Lehal, G., Chaudhury, S., Ramakrishna, A.: Experiences of integration and performance testing of multilingual OCR for printed
Indian scripts. In: Proceedings of the 2011 joint workshop on multilingual OCR and analytics for noisy unstructured text data, p. 9.
ACM (2011)
12. Pal, U., Wakabayashi, T., Kimura, F.: Comparative study of Devnagari handwritten character recognition using different feature and
classifiers. In: 10th international conference on document analysis
and recognition (ICDAR ’09.), pp. 1111–1115 (2009)
13. Jagadeesh Kannan, R., Prabhakar, R.: A comparative study of optical character recognition for tamil script. Eur. J. Sci. Res. 35(4),
570–582 (2009)
14. Pal, U., Wakabayashi, T., Kimura, F.: A system for off-line Oriya
handwritten character recognition using curvature feature. In: 10th
international conference on information technology (ICIT 2007),
pp. 227–229 (2007)
123
430
15. Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., Basu, D.K.:
A hierarchical approach to recognition of handwritten bangla characters. Pattern Recognit. 42(7), 1467–1484 (2009)
16. Pal, U., Wakabayashi, T., Kimura, F.: Handwritten Bangla compound character recognition using gradient feature. In: 10th international conference on information technology-07, pp. 208–213
(2007)
17. Roy, K., Pal, U., Kimura, F.: Bangla handwritten character recognition. In: Prasad, B. (ed.) 2nd Indian international conference on
artificial intelligence, pp. 431–443. Pune, India (2005)
18. Bhattacharya, U., Parui, S.K., Shridhar, M., Kimura, F.: Two-stage
recognition of handwritten Bangla alphanumeric characters using
neural classifiers. In: Prasad, B. (ed.) 2nd Indian international conference on artificial intelligence, pp. 1357–1376. Pune, India (2005)
19. Bhowmik, T., Bhattacharya, U., Parui, S.: Recognition of bangla
handwritten characters using an mlp classifier based on stroke features. In: Pal, N., Kasabov, N., Mudi, R., Pal, S., Parui, S. (eds.)
Neural Inf. Process. Lecture notes in computer science, vol. 3316,
pp. 814–819. Springer, Berlin (2004)
20. Chaudhuri, B.B., Pal, U.: A complete printed bangla ocr system.
Pattern Recognit. 31(5), 531–549 (1998)
21. Bhowmik, T., Ghanty, P., Roy, A., Parui, S.: Svm-based hierarchical
architectures for handwritten bangla character recognition. Int. J.
Doc. Anal. Recognit. 12(2), 97–108 (2009)
22. Das, N., Sarkar, R., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.:
A genetic algorithm based region sampling for selection of local
features in handwritten digit recognition application. Appl. Soft
Comput. 12(5), 1592–1606 (2012)
23. Das, N., Pramanik, S., Basu, S., Saha, P.K., Sarkar, R., Kundu,
M., Nasipuri, M.: Recognition of handwritten Bangla basic characters and digits using convex hull based feature set. In: Dimitrios
A. Karras, Z.M., Etienne E. Kerre, Chunping Li (eds.) International conference on artificial intelligence and pattern recognition,
Orlando, Florida, USA, pp. 380–386. ISRST (2009)
24. http://censusindia.gov.in/Census_Data_2001/Census_Data_Onlin
e/Language/Statement1.htm. Accessed 22nd July 2011
25. http://en.wikipedia.org/wiki/Bengali_language. Accessed 22nd
July 2011
26. Das, N., Das, B., Sarkar, R., Basu, S., Kundu, M., Nasipuri, M.:
Handwritten bangla basic and compound character recognition
using mlp and svm classifier. J. Comput. 2(2), 109–115 (2010)
27. http://en.wikipedia.org/wiki/Paschimbanga_Bangla_Akademi.
Accessed 22nd July 2011
28. Sarkar, P., Mukhopadhay, A., DasGupta, P.: Akaademi Bannan
Abhidhan. In: Chakrabarty, N., Ghosh, S., Sarkar, P., Chaki, J., Das,
N., Mukhopadhay, A., Bhattachajee, S., Amitava, C., Mukhopadhay, A., Bhattacharjee, S., Das, P., Chattopadhay, S., Basu, A.,
Mandal, S. (eds.). Akademi Bannan Abhidhan, p. 582. Pachimbanga Bangla Akaademi, Kolkata (2008)
29. Wilkinson, R.A., Geist, J., Janet, S., Grother, P.J., Burges, C.J.C.,
Creecy, R., Hammond, B., Hull, J.J., Larsen, N.J., Vogl, T.P., Wilson, C.L.: In: The first census optical character recognition system
conference. p. 372 (1992)
30. Marti, U.V., Bunke, H.: The iam-database: an english sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recognit. 5(1), 39–46 (2002)
31. Hull, J.J.: A database for handwritten text recognition research.
IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994)
32. MNIST Dataset. http://yann.lecun.com/exdb/mnist. Accessed
29th July 2011
33. OCR Database. http://ai.stanford.edu/~btaskar/ocr/ (2011).
Accessed 22nd July 2011
34. Honggang, Z., Jun, G., Guang, C., Chunguang, L.: HCL2000 - A
large-scale handwritten Chinese character database for handwritten
character recognition. In: ICDAR ’09., pp. 286–290 (2009)
123
N. Das et al.
35. Abdleazeem, S., El-Sherif, E.: Arabic handwritten digit recognition. Int. J. Doc. Anal. Recognit. 11(3), 127–141 (2008)
36. Khosravi, H., Kabir, E.: Introducing a very large dataset of handwritten farsi digits and a study on their varieties. Pattern Recognit.
Lett. 28(10), 1133–1141 (2007)
37. Mozaffari, S., Faez, K., Faradji, F., Ziaratban, M. A., Golzan,
S.M.: A comprehensive isolated Farsi/Arabic character database
for handwritten OCR research. In: Tenth international workshop
on frontiers in handwriting recognition, La Baule (France), pp.
385–389 (2006)
38. Al-Ohali, Y., Cheriet, M., Suen, C.: Databases for recognition
of handwritten Arabic cheques. Pattern Recognit. 36(1), 111–121
(2003)
39. Kavallieratou, E., Liolios, N., Koutsogeorgos, E., Fakotakis, N.,
Kokkinakis, G.: The GRUHD database of Greek unconstrained
handwriting. In: Sixth international conference on document analysis and recognition, pp. 561–565 (2001)
40. Viard-Gaudin, C., Lallican, P.M., Knerr, S., Binter, P.: The IRESTE
On/Off (IRONOFF) dual handwriting database. In: Fifth international conference on document analysis and recognition (ICDAR
’99.), pp. 455–458 (1999)
41. Kim, D.-H., Hwang, Y.-S., Park, S.-T., Kim, E.-J., Paek, S.-H.,
Bang, S.-y.: Handwritten Korean Character Image Database PE92.
In. IEICE transactions on information and systems, pp. 943–950
(1996)
42. Noumi, T., Matsui, T., Yamashita, I., Wakahara, T., Tsutsumida,
T.: Tegaki Suji database ’IPTP CD-ROM1’ no ichi bunseki (in
Japanese). Autumn Meeting of IEICE D-309 (1994)
43. Yamada, H., Yamamoto, K., Saito, T.: A nonlinear normalization
method for handprinted kanji character recognition-line density
equalization. Pattern Recognit. 23(9), 1023–1029 (1990)
44. Liu, Y., Tai, J., Liu, J.: An introduction to the 4 million handwriting Chinese character samples library. In: International conference
on Chinese computing and orient language processing, Changsa,
China, pp. 94–97 (1989)
45. Saito, T., Yamada, H., Yamamoto, K.: On the Database ELT9 of
Handprinted Characters in JIS Chinese Characters and Its Analysis
(in Japanese). Trans. IECEJ J.68-D(4), 757–764 (1985)
46. Mori, S., Yamamoto, K., Yamada, H., Saito, T.: On a handprinted
kyoiku-kanji character data base. Bull. Electrotech. Lab. 43(11–
12), 752–773 (1979)
47. http://www.hpl.hp.com/india/research/penhw-interfaces-1linguis
tics.html. (2011). Accessed 22nd July 2011
48. http://code.google.com/p/hit-mw-database/wiki/HomePage.
(2011). Accessed 22nd July 2011
49. http://users.iit.demokritos.gr/~bgat/HandSegmCont2009/. (2011).
Accessed 22nd July 2011
50. Bhattacharya, U.: Handwritten character databases of indic scripts.
http://www.isical.ac.in/~ujjwal/download/database.html (2011).
Accessed 22nd July 2011
51. Bhattacharya, U., Chaudhuri, B.B.: Handwritten numeral databases
of indian scripts and multistage recognition of mixed numerals.
IEEE Trans. Pattern Anal. Mach. Intell. 31(3), 444–457 (2009)
52. Bhattacharya, U., Shridhar, M., Parui, S.K., Sen, P.K., Chaudhuri, B.B.: Offline recognition of handwritten bangla characters:
an efficient two-stage approach. Pattern Anal. Appl. 15(4), 445–
458 (2012)
53. Das, N., Reddy, J.M., Sarkar, R., Basu, S., Kundu, M., Nasipuri,
M., Basu, D.K.: A statistical-topological feature combination for
recognition of handwritten numerals. Appl. Soft Comput. 12(8),
2486–2495 (2012)
54. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.:
Cmaterdb1: a database of unconstrained handwritten Bangla and
Bangla-english mixed script document image. Int. J. Doc. Anal.
Recognit. 15(1), 71–83 (2012)
A benchmark image database of isolated Bangla handwritten compound characters
55. Chattopadhyay, S.K.: Bangla Bhasatattver Bhumika. Calcutta University Press, Kolkata (1974)
56. Sproat, R.: A formal computational analysis of indic scripts. In:
International symposium on indic scripts: past and future, Tokyo
(2003)
57. Consortium, U.: The unicode standard, Version 6.1—core specification. The Unicode Consortium, Mountain View, CA, 2012. In.
ISBN 978-1-936213-02-3. URL http://www.unicode.org/versions/
Unicode6.1.0
58. MSVS, B.R., Vardhan, V., GA, N., Reddy, P.: A noval security
model for indic scripts-a case study on Telugu. Int. J. Comput. Sci.
Secur. (IJCSS) 3(4), 303
59. Das, N.S.: Modern Bengali script: an introduction. Dakhabharati,
Kolkata (2010)
60. AnandaBazar Patrika. http://www.anandabazar.in/ (2012).
Accessed 10th March 2012
61. AajKaal. http://www.aajkaal.net (2011). Accessed 29th July 2011
62. Bartaman. http://bartamanpatrika.com (2011). Accessed 29th July
2011
63. Anandamela,
Desh.
http://my.anandabazar.com/content/
magazines (2011). Accessed 29th July 2011
64. Sarat Rachanabali. http://www.sarat-rachanabali.nltr.org/ (2012).
Accessed 3rd March 2012
65. Bangla Documets. http://banglalibrary.evergreenbangla.com/
(2012). Accessed 4th March 2012
66. Newspapers from Bangladesh. http://new.ittefaq.com.bd/, http://
www.prothom-alo.com/ (2012). Accessed 4th March 2012
67. CMATER Handwritten Character Database. http://code.google.
com/p/cmaterdb/ (2011). Accessed 1st Aug 2011
68. Bhattacharya, U., Shridhar, M., Parui, S.: On recognition of handwritten Bangla characters. In: Kalra, P., Peleg, S. (eds.) Computer
vision, graphics and image processing. Lecture notes in computer
science, pp. 817–828. Springer, Berlin (2006)
69. Rahman, A.F.R., Rahman, R., Fairhurst, M.C.: Recognition of
handwritten bengali characters: a novel multistage approach. Pattern Recognit. 35(5), 997–1006 (2002)
70. Wen, Y., Lu, Y., Shi, P.: Handwritten bangla numeral recognition
system and its application to postal automation. Pattern Recognit.
40(1), 99–107 (2007)
71. Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., Kumar
Basu, D.: A novel framework for automatic sorting of postal documents with multi-script address blocks. Pattern Recognit. 43(10),
3507–3521 (2010)
431
72. Das, N., Basu, S., Sarkar, R., Kundu, M., Nasipuri, M., Basu, D.K.:
Handwritten Bangla compound character recognition: potential
challenges and probable solution. In: Prasad, B., Lingras, P., Ram,
A. (eds.) 4th Indian international conference on artificial intelligence, Bangalore, pp. 1901–1913 (2009)
73. Chaudhuri, B.B., Pal, U.: Relational studies between phoneme and
grapheme statistics in current bangla. J. Acoust. Soc. India 23,
67–77 (1995)
74. Pal, U., Chaudhury, B.B.: Character occurrence statistics in Bangla
language and recognition of Bangla printed script. In: ICAPRDT,
Kolkata, pp. 52–59 (1993)
75. Das, N., Basu, S., Sarkar, R., Kundu, M., Nasipuri, M., Basu, D.K.:
An Improved feature descriptor for recognition of handwritten
Bangla alphabet. In: Guru, D.S., Vasudev, T. (eds.) International
conference on signal and image processing, Mysore, India, pp.
451–454. Excel India Publishers (2009)
76. Burges, C.J.C.: A tutorial on support vector machines for pattern
recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)
77. Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., Basu,
D.: Recognition of numeric postal codes from multi-script postal
address blocks. In: Chaudhury, S., Mitra, S., Murthy, C.A., Sastry,
P.S., Pal, S. (eds.) Pattern Recognit. Mach. Intell. Lecture notes in
computer science, vol. 5909, pp. 381–386. Springer, Berlin (2009)
78. Marlow, B.K., Batchelor, B.G.: Improving the speed of convex hull
calculations. Electron. Lett. 16(9), 319–321 (1980)
79. Chang, C.-C., Lin, C.-J.: Libsvm : a library for support vector
machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
80. Das, N., Mandal, B., Basu, S., Sarkar, R., Kundu, M., Nasipuri, M.:
An SVM-MLP classifier combination scheme for recognition of
handwritten Bangla digits. In: Kale, K.V., Malhrota, S.C., Manza,
R.R. (eds.) 2nd International conference on advances in computer
vision and information technology, Aurangabad, India, pp. 615–
623. I. K. International Publishing House Pvt. Ltd. (2009)
81. Basu, S., Chaudhuri, C., Kundu, M., Nasipuri, M., Basu, D.: A twopass approach to pattern classification neural information processing. In: Pal, N., Kasabov, N., Mudi, R., Pal, S., Parui, S. (eds.), vol.
3316. Lecture notes in computer science, pp. 781–786. Springer,
Berlin (2004)
82. El Abed, H., Märgner, V., Blumenstein, M.: international conference on frontiers in handwriting recognition (ICFHR 2010)—
competitions overview. In: 12th international conference on frontiers in handwriting recognition pp. 703–708 (2010)
123
Download