IJDAR (2014) 17:413–431 DOI 10.1007/s10032-014-0222-y ORIGINAL PAPER A benchmark image database of isolated Bangla handwritten compound characters Nibaran Das · Kallol Acharya · Ram Sarkar · Subhadip Basu · Mahantapas Kundu · Mita Nasipuri Received: 23 July 2012 / Revised: 9 April 2014 / Accepted: 25 April 2014 / Published online: 23 May 2014 © Springer-Verlag Berlin Heidelberg 2014 Abstract In the present work, we present a benchmark image database of isolated handwritten Bangla compound characters, used in the standard Bangla literature. A thorough survey over more than 2 million Bangla words has revealed that there exist around 334 compound characters in Bangla script. Of which, only around 171 character classes form unique pattern shapes, and some of these classes are often written in multiple styles. Altogether, 55,278 isolated character images, belonging to 199 different pattern shapes, are collected using three different data collection modalities. The database is divided into training and test sets in 4:1 ratio for each pattern class, by considering a balanced distribution of shapes from different modalities. A convex hull and quadtree-based feature set has been designed, and the test set recognition performance is reported with the support vector machine classifier. We have achieved a recognition accuracy of 79.35 % on the test database consisting of 171 character classes. The complete compound character image database is freely available as CMATERdb 3.1.3.3 from the website http://code.google.com/p/cmaterdb/, which may facilitate research on handwritten character recognition, especially related to Bangla form document processing systems. Keywords OCR · Handwritten character recognition · Bangla Compound character · Benchmark database · SVM Electronic supplementary material The online version of this article (doi:10.1007/s10032-014-0222-y) contains supplementary material, which is available to authorized users. N. Das · K. Acharya · R. Sarkar · S. Basu · M. Kundu · M. Nasipuri (B) Computer Science and Engineering Department, Jadavpur University, Kolkata 700032, India e-mail: mnasipuri@cse.jdvu.ac.in 1 Introduction Since early days of machine learning research, optical character recognition (OCR) has remained as a popular and wellstudied problem domain, often guided by industrial demands [1]. Apart from the commercial applications, character recognition is also considered as a benchmark problem in pattern recognition research, especially related to the recognition of handwritten text. Cheriet et al. [2] gave an elaborate overview of handwritten OCR research during the last couple of decades. It may be observed that handwritten character recognition for Latin script has attained a considerable maturity and a number of research groups contributed significantly in this developmental process [2]. Ample researches have also been done for recognition of handwritten characters of Chinese [3,4], Japanese [5], Korean [6,7] and Arabic [8] scripts. Research efforts on Indian script OCR [9–11] on Devanagari [12], Tamil [13], Oriya [14] and Bangla [15–23] have started to receive attention since last decade. In India, Bangla is the second most popular script and language after Devanagari [24]. As a script, it is used for Bangla, Ahamia, Manipuri, Daphla, Garo, Allam, Khasi, Mizo, Munda, Naga, Rian and Santali languages. Moreover, Bangla, which is also the national language of Bangladesh, is the sixth most popular language in the world [25]. Around 230 million people speak in Bangla all over the world. Despite its popularity and importance, evidences of research on handwritten Bangla OCR, involving the complete set of characters, are few in numbers [26] . One possible reason is the limited availability of benchmark databases, especially related to the complete Bangla character set. Another possible reason may be linked to the complex nature of the character set, which consists of 10 digits, 50 basic characters (11 vowels and 39 consonants), and around 334 compound characters. Figure 1a, b 123 414 Fig. 1 a Handwritten Bangla basic characters. b Handwritten numerals of Bangla script shows sample images of handwritten Bangla basic characters and digit symbols. To add to the complexity, when a vowel appears alongside a consonant, the vowel itself takes a modified shape called as vowel allograph. There are 10 such vowel allographs in Bangla script, each of which appears either on top or left or right or simultaneously at left and right sides or at the bottom of associated consonant(s), as shown in Fig. 2. A compound character, an important and integral part of Bangla language system, is a complex shaped character which consists of two or more consonants, occasionally followed by a vowel and pronounced simultaneously in words of Bangla language. Sometimes the shape of the compound character is so complex that it becomes difficult to identify the constituent consonants individually. Figure 3 shows a list of handwritten samples of Bangla compound characters with the corresponding Unicode compatible shapes. From Fig. 3, it is evident that the shapes are very complex and some compound characters resemble pair-wise so closely that the only sign of differences left between them are some additional shapes like short straight lines, circular curves, etc. It is often difficult to identify those characters without analyzing the context, especially in handwritten documents. A set of such closely resembled character pairs is shown in Table 1. The top ten frequently occurring Bangla compound characters, their respective printed form, composition, handwritten samples, usage pattern in words and frequency of occurrence (percent) in standard literature are shown in Fig. 4. The complete table is also included in the supplementary document (S-Table 1). Fig. 2 Different vowel allographs with consonant 123 N. Das et al. In two special cases of compound characters, one of the constituent consonants takes a special modified shape, such as Ya-phala [ ] and Reph . In Ya-phala , the second of the two consonants joined together is Ya [ ], whereas in Reph , the first consonant is Ra [ ]. Like the vowel allographs, the Ya-phala appears to the right and Reph appears on top of the associated consonant without changing its shape. Since the associated consonant characters maintain their original shapes in the formation of the compound characters Ya-phala and Reph , they are also referred as consonant allographs. Figure 5 shows typical usage pattern of the aforementioned two consonant allographs. Due to this special distinction, although these two categories of compound characters are considered within the overall compound character set of Bangla script, they are not considered in preparation of the current database. Despite the existence of varieties of compound characters, their frequency of appearance in any text page is much lower than that of basic characters. Interestingly, out of these large numbers of compound characters, some are rarely used and some have become obsolete. Moreover in the year 1997, West Bengal Bangla Academy [27] introduced new types of shapes/glyphs to represent Bangla compound characters through their Bangla dictionary “Akaademi Banan Abhidhan” [28] . The objective of their effort was to simplify the complex shapes for easier understanding of the compound characters. Nowadays, in India, the Bangla text books mainly follow this type of glyphs for compound characters. However, many newspapers and publishing houses do not always follow these new standards, as common people are still more comfortable with the old style of writing of Bangla compound characters. All these modifications add newer variations in handwritten character patterns, leading to further complexities in OCR of handwritten Bangla compound characters. To address this issue, we have considered multiple shapes for many compound character classes. It is evident from the discussion that recognition of handwritten compound characters is a complex pattern recognition problem. Moreover, a major obstacle in pursuing research in this domain relates to the non-availability of any comprehensive public domain database. Unlike handwritten numerals and basic characters, collection of adequate isolated data samples of compound characters is an uphill task. Since long, the requirement for a benchmark database for handwritten Bangla compound characters has been felt by the A benchmark image database of isolated Bangla handwritten compound characters 415 Fig. 3 List of 171 handwritten samples of Bangla compound characters with corresponding printed glyphs Bangla OCR research community. The prospect of a complete handwritten Bangla OCR system would otherwise be illusive. The current work reported in this paper is an attempt in that direction. 1.1 A brief review on handwritten character databases Benchmark databases are essential to compare the performances of different handwritten character recognition tech- 123 416 N. Das et al. Table 1 A set of randomly selected closely resembled character pair Percent Frequency Fig. 4 Formation of some frequently occurring Bangla compound characters and their usage in standard Bangla literature is shown niques over an even test bed. Most popularly used databases for Latin numerals, characters and words are NIST [29], IAM-DB [30], CEDAR [31], MNIST [32], OCR [33], etc. Among these, handwritten numerals databases like NIST [29] and MNIST [32] were designed to advocate specialized applications, such as recognition of postal code. IAM-DB 123 [30] database is constructed with labeled handwritten Latin script in the form of sentences, lines and words. CEDAR [31] database consists of city names, state names, ZIP codes and alphanumeric characters, and the OCR [33] database is based on handwritten words of Latin script. Apart from the Latin script databases, there are databases of other scripts as A benchmark image database of isolated Bangla handwritten compound characters Fig. 5 Typical character formations with two consonant allograph Ya-phala and Reph 417 the research community to develop a much awaited, complete handwritten Bangla character recognition system in near future. 2 Database preparation under the present work well [34–48]. Benchmark databases are available for Korean [41], JIS Chinese script [45], Japanese script [42], Arabic script [37,39], etc. The ICDAR 2009 Handwriting Segmentation Contest [49] also provided a set of handwritten document pages written in Latin script for English, Greek, French and German languages. A limited public domain database (accessible on request) of isolated digits and basic characters is also available [50] for some of the handwritten Indic scripts viz., Bangla, Devanagari and Oriya. Bhattacharya et al. [51] also provided two updated databases for handwritten numerals for Devanagari and Bangla scripts. Bangla numeral database, available from [50,51] consists of 19392 training and 4000 test samples which are written by 1106 persons. A Bangla basic character database [52] is also provided by Bhattacharya et al. The database, available in [50] consists of 37858 isolated character images, among which 25000 are considered for training and 12858 are marked as test samples. Recently, Das et al. [53] published four different handwritten numeral databases of Indic script, viz. Bangla, Devanagari, Arabic, Telugu. A database of unconstrained handwritten Bangla and Bangla–English mixed script document images has been prepared recently by Sarkar et al. [54]. This database contains 150 handwritten document pages, among which 100 pages are written only in Bangla script and rest of the 50 pages are written in Bangla script mixed with English words. They also provided ground truth data for text line segmentation along with a semi-automatic ground truth generating tool. However, no standard Bangla compound character database is yet available in public domain for research purpose. It may be worth to mention here that the complete compound character database available at [50] is not downloadable. Only sample compound characters are available for download from the website. To bridge this resource–requirement gap, a complete compound character database, developed by us, is uploaded in http://code.google.com/p/cmaterdb/ for free download. As mentioned before, the primary objective of the present work is to create a standard database for handwritten Bangla compound characters along with a detailed analysis of their occurrence patterns in standard Bangla-printed literature. We also report a benchmark recognition performance on this newly prepared database. We hope that our effort may help It has already been mentioned that apart from 50 basic characters and vowel allographs, Bangla alphabet is enriched with more than 300 compound characters. Bangla compound characters have evolved into their present shapes and structures over hundreds of years. Some compound characters have become obsolete at present, and some have undergone modification in shapes over time. Even for some compound characters, more than one shape exist simultaneously. So identification of proper compound character shapes itself is a challenging task for the researchers. Therefore, the present work on database preparation is preceded by a survey of the frequency of occurrence of Bangla compound characters in standard Bangla literature. 2.1 Survey on the usage of Bangla compound characters The concept of compound characters in Bangla script originated from the Brāhmı̄ script—the mother of the Indo-Aryan scripts [55]. Devanagari and the other major Indic scripts such as Gujarati, Telugu, Bangla, Oriya, Kannada, Gurmukhi, Tamil, Malayalam, etc. follow abugidas, i.e., most symbols stand for a consonant plus an inherent vowel (usually the sound /a/) [56]. Here, the orthographic syllable starts with a consonant (C) or a sequence of consonants and ends with an inherent vowel (V) and generally follow a canonical structure of (((C)C)C)V. The formations of compound characters in the Bangla language also maintain the above canonical structure which is almost similar to those of the languages originated from Brāhmı̄ script [57,58]. However, the number of compound characters in Bangla language is much higher than that of Devanagari language. It is due to the fact that many modern words, coming from different languages, require more compound characters which do not exist in the original Bangla script. Bangla is continuously getting enriched by including these words from other languages. The new words require new set of compound characters for writing them in Bangla language. As Bangla is a phonetic language, the phonetic structure of Bangla is mainly responsible for formation of compound characters. Bangla has canonical SOV (Subject–Object–Verb) structure compared to SVO (Subject–Verb–Object) structure of English. In [59], Dash represents the relationship among the graphemes of different languages which are originated from Brāhmı̄ script including Bangla language. According to Dash, there are more than four hundred (400+) compound characters available in Bangla script out of which near about three hun- 123 418 dred and twenty (∼ 320) compound characters are made of two consonant graphemes, around seventy (∼ 70) compound characters are made of three consonant graphemes and about ten (∼ 10) compound characters are made of four consonant graphemes. But many of these compound characters are rarely used in modern Bangla literatures. A thorough survey has been conducted under the present work with the alphabetic characters collected from three popular Bangla newspapers, viz Anandabazar Patrika [60], Aajkal [61] and Bartaman [62]; two popular Bangla magazines Anada-Mela and Desh [63]; and some Bangla websites [64–66]. We have considered around 2.4 million words, consisting of around 8 million characters, to generate the corpus. It is worthy to mention that most of these samples are collected from Unicode encoded digital archives and analyzed using a self-developed automated tool. The digital corpus is made available at our website http://code.google.com/ p/cmaterdb/. From the survey, we have found that around 6.61 % characters (of the total corpus) are compound characters. It has also been observed that 26.25 % words are constructed with at least one compound character. The percentage of words having at least one compound character against the different group of word lengths is shown in Fig. 6. Among those compound characters, around 28.38 % are formed with the two consonant allographs Ya-phala [ ] and Reph [ ], and some obsolete compound characters ( ). The remaining 71.62 % compound characters are actually considered for preparing the current database. During the survey, we have imposed some constraints on the collected characters to overcome fallacious observations and improper deductions about the shapes and percentage of usage of characters. The constraints are discussed below. (a) None of the Bangla punctuation marks, symbols and numerals is considered as a character. (b) Vowel graphemes, consonant graphemes, vowel allographs and diacritic symbols are considered as noncompound characters for the current survey. Fig. 6 Distribution of compound character words of different lengths in the total word corpus and the possibility of having at least one compound character in a word is shown 123 N. Das et al. (c) A compound character is considered as a combination of two or more consonant graphemes. (d) A compound character with a vowel allograph is also treated as a new compound character if the combination of characters possesses different shape of vowel allo). graph (Example: (e) A consonant with the consonant allograph is also considered as new compound characters during the survey ). (Example: We have found around 334 compound characters from the survey conducted on the newly developed digital text corpus database. According to our survey, the number of compound characters formed with two consonant graphemes is two hundred eighteen (218), one hundred twelve (112) compound characters are constructed using 3 consonant graphemes and four (4) compound characters which are formed using 4 consonant graphemes. It is also observed that both the consonant allographs Yaphala [ ] and Reph [ ] are prevalent in the Bangla literature as already mentioned in Sect. 1 that none of these modifiers change the original shape of the associated consonant (see Fig. 5) and are excluded from the current database. More specifically, the Yaphala appears to the right side of the main character, touching the common matra or Sirorekha of the word. Also the Reph appears above the matra of the character shapes, rarely touching the main character. Since we have generated a large corpus automatically from digital contents, some misspelled compound characters were initially considered in the survey. Even some obsolete char, have come to our notice during the acters such as survey. Those misspelled and obsolete characters are not considered for this work. As a result of these considerations, the total number of compound characters used under this work came down to 171. A list of top 20 Bangla compound characters (according to their frequency of usage) is shown in Table 2 along with their cumulative frequencies in standard Bangla literature. The survey also revealed that for some compound characters like nya-ja ( ), nga-ka ( ), etc., peo- A benchmark image database of isolated Bangla handwritten compound characters ple are accustomed to use more than one shapes. Therefore, after detailed analysis, we have considered another 28 shapes, totaling 199 shapes of compound characters for our current work. Table 3 shows a list of handwritten Bangla compound characters having multiple shapes. 2.2 Data collection and preprocessing Data collection is one of the difficult tasks in any pattern recognition research. It becomes even more challenging when the number of classes is too high (199 character shapes for our Bangla handwritten compound character database). Three different data collection modalities are adopted in the present work, to cover a wide spectrum of writing styles. Please note that handwritten text is usually written in either structured documents (like pre-formatted 419 forms) or in an unconstrained fashion. In case of Bangla script, because of the presence of ascendants (characters or their parts appearing above the matra) and descendants (characters or their parts appearing below the main character), standardized form documents are not yet popular. In many cases, semi-structured layouts are used for data-entry where words or text lines are written in rectangular boxes. Therefore, we have collected handwritten compound character samples using both categories of data collection sheets. In one type of sheets, the participating volunteers were asked to write the isolated compound characters in individual boxes, whereas in the other case, the volunteers were asked to write a complete word (containing compound characters). Sample images of these two categories of data collection sheets are shown in Fig. 7a, b. As the third type of data collection modality, unconstrained Bangla handwritten pages are used Table 2 Formation of compound characters and their corresponding frequency of appearances in recent Bangla literature and news items of top 20 frequently occurred compound characters 123 420 from our public domain repository [67] (see Fig. 7c). Compound characters appearing in the text are manually cropped to populate the database. For the present work, formatted data sheets were designed at the Center for Microprocessor Application for Training Education and Research (CMATER), a research laboratory at Computer Science and Engineering department of Jadavpur University, India, and were used to collect data from the Table 3 The list of handwritten Bangla compound characters having multiple shapes N. Das et al. native Bangla writers of different age, sex and educational background groups. Black/Blue-ink pens with 0.5/0.7 mm pen tip were used for writing on those data sheets having 70 GSM paper thicknesses. A pie chart describing the distribution of the age groups of the writers, participated in the data collection drive, is shown in Fig. 8. It was also observed that writer’s education level also varies with age. In our survey, 22 % among the writers were from school going students, and the rest 78 % belong to either under-graduate/graduate students or graduate degree holders. About 50 % of the entire data were collected from the students/teachers/staffs of different Universities/colleges/schools, another 30 % from family members, friends and relatives, and rest 20 % were collected randomly from the public places, on request. Collected data sheets (from both the modalities) were optically scanned in gray scale with a resolution of 300 dpi using a HP F380 flatbed scanner. Data collections from the digitized pages containing handwritten words are done manually. But extraction of the large number of isolated handwritten characters from the other category of data collection sheets is an uphill task. For this purpose, we have binarized each such digitized data sheet using a simple image-specific global threshold value, which is computed as (min_intensit y + max_intensit y)/2. From the binarized data sheet images, the coordinates of the gridlines separating individual characters are automatically identified using a page layout analysis algorithm described below. Since the data sheets were collected under manual supervision, the following pixel-histogram-based page layout analysis algorithm works well for most of the data sheets. In some extreme cases, however, we had to use manual intervention to extract the data part from the scanned data sheets. Please also note that the gray scale images of the character are extracted from the original grayscale data sheet images using the coordinates of the grid lines identified by the following algorithm. Algorithm: Step 1 I N ×M be the binary image of M column and N rows. Calculate the vertical pixel density profile (PV ) as the sum of black pixels along each column of the image. Let, PV [ j] = M−1 I [i, j] where j = 0, . . . , M − 1 and I[i, j] ∈ {0, 1}. 0 Step 2 Calculate the horizontal pixel density profiles along each row of the image and is denoted byPH [i] = 0N −1 I [i, j], where i = 0, . . . , N − 1 Step 3 Consider the vertical profile vector PV , if a particular column j, (PV [ j] > δ ∗ N ) where δ is a heuristically chosen constant (0 < δ < 1), then that column of pixels is treated as a vertical line on the data sheet separating the characters, consecutive columns, that satisfy the above criteria are said to form one column cluster, comprising of multiple vertical columns of pixels. Each column cluster represents a vertical gridline. Let Cth be the minimum gap between two vertical 123 A benchmark image database of isolated Bangla handwritten compound characters 421 Fig. 7 Sample images of three different categories of data collection sheets. a Isolated compound characters. b Complete words containing compound characters. c Unconstrained Bangla handwritten page containing compound characters Step 6 The coordinates of each cell are used to crop the original grayscale image. 2.3 Description of the present database Fig. 8 Participation statistics of the persons in data collection drive according to their age columns of pixels, for inclusion in the same column cluster. Two candidate columns of pixels ( j1 and j2 ) are said to be within the same column cluster if j1 − j2 < Cth ; where ||-|| signifies the difference in column numbers. Now, compute the center line in each column cluster. This is done by simple averaging the column numbers in each cluster. Step 4 Consider the horizontal profile vector PH . For a particular row i, if (PH [i] > δ ∗ M) then that row of pixels is treated as a horizontal separating line on the data sheet. Consecutive identified rows of pixels forms a single row cluster representing a horizontal gridline. Let Rth be the minimum gap between two horizontals rows of pixels, for inclusion in the same row cluster. Two candidate rows (r1 and r2 ) are said to be within the same row cluster if r1 − r2 < Rth ; where ||-|| signifies the difference in row numbers. The central line in each row cluster is computed in the same way as in the case of vertical grid lines. Step 5 Identify the coordinates of each cell containing handwritten characters from adjacent pairs of vertical and horizontal grid lines. In the current work, total 335 number of persons voluntarily participated during the data collection drive. A total of 354 sheets were processed using the aforementioned methodology to generate 42,959 number of isolated handwritten compound character images. Please note that many handwritten samples had to be discarded because of writing errors like use of inconsistent glyphs, overwriting, etc. For handwritten words, 122 data sheets were processed to generate 10,270 compound characters from 9,829 isolated word images. As the third category of the data collection modality, 150 handwritten pages were considered from the CMATERdb 1.1.1 and CMATERdb 1.2.1 database to generate 2,049 number of isolated compound characters. All the extracted characters from the three modalities are then divided randomly into train and test sets, approximately in the ratio 4:1 for each shape patterns. The final training set consists of 44,152 compound characters, whereas the test set consist of 11,126 characters. As discussed before, the number of Bangla compound characters considered here is 171. But, due to presence of more than one grapheme for a single class, the total number of pattern classes is 199. Here, the number of characters per class is not equal and remain in between 125 and 474 samples in a pattern class. The average number of samples per pattern class is around 277. The standard deviation is 73.77. 123 422 N. Das et al. Table 4 Statistical analysis of the data samples of first 10 pattern classes of Bangla compound characters collected in isolated fashion Pattern class# Count of samples Himax Himin Wimax Wimin Aimax Avg Aimin A R imax A R imin Hi Avg Wi Avg Ai Avg A Ri SDH SDW SDA SDAR (a) For training set 1 209 124 45 195 40 21,080 1,880 3.12 0.55 70.76 95.47 6,974.63 1.37 14.98 33.98 3,606.95 0.45 2 172 132 32 218 46 28,776 1,696 2.47 0.63 74.65 92.45 7,303.78 1.26 20.09 34.42 4,511.61 0.39 3 199 123 38 190 45 20,068 2,090 3.42 0.64 67.4 97.84 6,889.84 1.48 17.69 30.34 3673.98 0.4 4 215 132 32 204 43 22,939 1,760 3.02 0.55 66.2 91.53 6,442.29 1.4 18.45 33.99 4,099.19 0.4 5 206 137 43 214 43 23,736 2,484 2.93 0.45 75.94 93.26 7,528.87 1.24 19.74 36.62 4,833.06 0.38 6 199 137 44 215 44 28,165 2,816 2.53 0.59 78.79 100.58 8,308.41 1.29 18.11 35.35 4,693.85 0.36 7 208 132 45 212 40 21,106 2,160 2.24 0.59 73.48 82.3 6,461.62 1.1 16.34 35.12 4,288.62 0.32 8 200 136 44 216 48 29,376 2,112 2.25 0.6 80.19 96.91 8,167.15 1.21 17.56 34.65 4,689.18 0.31 9 104 141 47 220 48 27,477 3,216 2.24 0.72 77.89 92.49 7,498.65 1.19 10 132 136 40 213 46 28,968 2,352 2.4 0.73 73.92 102.96 8,207.75 1.39 15.29 32.26 4,037.33 0.3 20.66 40.76 5,527.38 0.38 (b) For test set 1 52 115 45 207 48 18,837 2,352 3.25 0.71 69.29 95.27 6,867.37 1.4 16.51 37.32 3,851.31 0.51 2 43 144 38 210 34 30,240 1,292 2.3 0.64 73.26 88.93 6,932.26 1.22 19.23 33.87 4,836.05 0.35 3 49 112 37 187 52 20,570 1,924 2.69 0.82 67.61 93.82 6,543.14 1.44 17.69 29.26 3,326.27 0.45 4 53 131 38 211 43 27,641 2,150 2.45 0.69 73.04 104.6 8,502.77 1.43 24.31 46.4 5 51 136 54 166 51 21,216 3,432 1.96 0.68 77.88 89.65 7,334.92 1.16 17.69 29.21 4,129.54 0.28 6 49 122 51 177 53 20,886 3,162 2.09 0.82 79.04 101.37 8,419.49 1.28 17.59 30.63 4,363.11 0.26 7 51 118 41 176 46 20,768 2,444 1.84 0.6 69.61 71.27 5,237.86 1.02 14.07 29.17 3,532.25 0.28 8 49 122 53 219 47 22,204 2,867 2.26 0.69 82.1 97.24 8,270.96 1.19 16.36 34.34 4,230.72 0.36 9 25 138 59 163 52 18,354 3,744 2.03 0.72 77.6 91.28 7,363.44 1.18 16.7 10 33 144 43 221 59 31,104 2,891 2.22 0.71 76.3 109.88 9,211.76 1.44 Some standard statistical measures are employed to analyze different structural characteristics of the characters for each of the pattern classes in both the training and test databases. More specifically, maximum, minimum and average of heights and widths of the bounding rectangles of character images for each pattern class have been estimated. These estimations give a notion of average spatial dimension of the patterns of a class. The maximum, minimum and average of aspect ratios of the character images for each pattern class have also been estimated to add more information regarding the shapes of the characters in the collected data samples. Let us assume that for a pattern class Ci , (1 ≤ i ≤ 199), the total number of samples be Ni . Let the height and width of the bounding box of the jth sample in Ci be defined as h i j and wi j , respectively. Several statistical measures specifying the structural characteristics of the character shapes in Ci are Maximum height (Himax ), Minimum height (Himin ), Maximum width (Wimax ), Minimum width Wimin , Maximum area Aimax , Minimum area Aimin , Maximum aspect ratio ARimax , Avg Minimum aspect ratio ARimin , Average height Hi , AverAvg Avg age width Wi , Average area Ai , Average aspect ratio Avg ARi . The standard deviations of height, width, area and 123 6,339.71 0.39 27.36 3,781.42 0.29 24.15 45.8 6,759.26 0.38 aspect ratio are denoted by SD H , SDw , SD A , SDAR respectively for the present work. Table 4 lists the values of abovementioned measures for specifying the structural characteristics of the characters for some representative character classes taken from the original database for the samples collected in isolated fashion. Along with these, the number of patterns per class together with their frequency of appearance in standard Bangla literature is also specified. The complete table consisting of the value for 199 pattern is given in supplementary document (see STable 2). The supplementary document is also uploaded as a supplementary file in http://code.google.com/p/cmaterdb/ along with the database. Database nomenclature is often important for future reference by the research community. Based on our predefined naming convention, we have named our current database as CMATERdb 3.1.3.3. In CMATERdb 3.1.3.3, “CMATER” stands for Center for Microprocessor Applications for Training Education and Research, a research laboratory of Jadavpur University, India and “db” stands for database, the leftmost “3” indicates handwritten character database, the next “1” indicates Bangla script and next “3” indicates compound character database and right most “3” A benchmark image database of isolated Bangla handwritten compound characters indicates the third release of the gray level images of Bangla compound characters. 3 Performance analysis on the developed database Some research has already been done on recognition of handwritten basic characters, vowel allographs and numerals of Bangla scripts. In one of the recent works, Basu et al. [15] developed a hierarchical technique for recognition of Bangla basic characters and modified shapes. Some more research contributions related to this topic are also available in the literature [19,21,68,69]. Handwritten Bangla numeral recognition is also a well-studied problem, especially in the context of postal automation in India and Bangladesh [70,71]. Recently, Bhattacharya et al. [51] developed a comprehensive database of 23,392 handwritten isolated Bangla digit samples and reported their benchmark accuracy on that database. Despite such efforts, the OCR of Bangla script remains incomplete without detailed research on compound characters. There are three earlier instances of work on handwritten Bangla compound characters [16,26,72] in the literature. In[16], Pal et al. used Modified Quadratic Discriminant Function (MQDF) as the classifier. The directional information obtained from the arc tangent of the gradient is used there to form the feature set for the characters. Using fivefold crossvalidation of results, they obtained around 85.90 % accuracy from a database of Bangla compound characters containing 20,543 samples. The number of classes considered there for the compound characters is 138, which had been selected on the basis of old statistics published in 90’s [73,74]. In one of our earlier works [72], a technique for recognition of handwritten compound characters was proposed with a discussion on the potential problems of Bangla compound character recognition. The technique advocated for incrementally expanding the number of learned character classes from more frequently occurred to less frequently occurred ones. As an initial step, 55 character classes covering 90 % of the total compound character usages were used there for recognition. The average recognition rate of 84.67 % was observed on the database, after threefold cross-validation using MLP-based classifiers and 84 quad tree-based longest run features. In another work [26], we considered a combination of handwritten Bangla basic and compound characters simultaneously for recognition. For that work, a total of 93 character classes including 50 basic and 43 compound characters were considered. The recognition accuracies 79.25 % using MLP and 80.51 % using Support Vector Machine (SVM) had been achieved using shadow and quadtree-based longest run features after three fold crossvalidation of data. The rationale for that work was to create a framework for recognition of frequently occurring compound characters along with the basic characters. 423 3.1 Recognition strategy for the current database To report a benchmark recognition performance on the developed database, we have used a combination of two different feature sets, namely convex hull-based features [23] and quadtree-based longest run features [26,71,75]. More specifically, these two features are used here for representation of binarized character images (first binarized using a imagespecific global threshold-based binarization algorithm discussed in Sect. 2.2 above and then size normalized to 96 × 96 pixels) in feature space, and subsequent recognition by a SVM classifier [76]. In the following, we briefly describe the two feature descriptors. We also include (in Table 5) a comparative performance analysis of different popularly used feature sets on our database. It is evident that the chosen feature set generates promising results, in comparison with others considered under this study. 3.1.1 Quad tree-based longest run features Before extraction of longest run features [77] from character images, a character image is first organized in to a quadtree of depth 2. To do this, the root node is first created by assigning the entire character image to it. Then, all the successors of the root node are created. For that the character image is divided into four quadrants along the row and column of the center of gravity of the black pixels therein. The sub-images so created are assigned to the successor nodes from left to right order. The process is repeated recursively for creating all the next level successors and assigning image contents to them in the same way. The character image or sub-images contained in the nodes of the tree are considered here for extraction of features. Four longest run features are extracted from each of 21 nodes of the quadtree of depth 2. Thus, the number of longest run features is 84 for the present work. 3.1.2 Convex hull based feature set Any object with a non-regular shape may be represented by a collection of its topological components or features. In the current work, we have extracted several such topological features based on different bay and lake attributes of a convex hull of handwritten Bangla characters. A total of 28 features are designed on the basis of different bay attributes of the convex hull of handwritten Bangla compound characters. In any binary image, the convex hull is constructed around the character pixels (represented by binary label “1” and often referred to as foreground). Pixels that constitute the convex hull boundary are referred as convex hull boundary pixels. We define a distance measure dcp as the count in number of pixels from the convex hull boundary pixel to the nearest character pixel in either horizontal or vertical direction. The 123 123 76.89 73.11 73.40 74.37 84 120 + 84 = 204 25 × 8 = 200 39 × 5 = 155 155 × 84 = 239 QTLR [70,72] QTSh+QTLR [25] Gradient features [15] CH [26] CH +QTLR (Present work) 80.48 74.09 67.48 65.84 65.61 68.42 66.73 62.50 58.69 58.05 61.23 78.67 72.55 71.32 71.03 74.60 Overall performance 81.06 75.04 74.06 73.82 77.51 75.07 68.57 66.92 66.78 69.58 Extracted from word images 67.79 63.14 59.75 59.11 62.50 Extracted from CMATERdb 1.1.1 and CMATERdb 1.2.1 Isolated samples Extracted from CMATERdb 1.1.1 and CMATERdb 1.2.1 Isolated samples Extracted from word images 199 patterns regrouped into 171 character classes Considering 199 patterns Success rate (in percentage) on the test set of CMATERdb 3.1.3.3 Feature dimension Feature description Table 5 Comparative overview of recognition performances of different feature sets on the test set of CMATERdb 3.1.3.3 79.35 73.29 72.09 71.84 75.35 Overall performance 424 N. Das et al. A benchmark image database of isolated Bangla handwritten compound characters Fig. 9 Illustration of convex hull feature extraction. a The features are extracted based on dcp from left to right direction along with lake features. b The coordinates of center of gravity (r x , r y ) of a hypothetical region created by joining the bay pixels up to the nearest character pixels along left to right direction dcp is estimated from top, bottom, right and left sides of the image (see Fig. 9a that illustrates the distance measure from left side of an image). Based on this dcp measure in a particular direction, five different topological features are calculated as follows: • • • • • Maximum dcp , Average dcp , Total numbers of rows having dcp > 0, Total numbers of rows having dcp = 0, Number of visible bays [78]. Here, bays are measured in terms of length of the consecutives convex hull boundary pixels for which dcp > 0 (see Fig. 9 for illustration). Two more features are computed as the coordinates of center of gravity (r x , r y ) of a hypotheti- 425 cal region created by joining the bay pixels up to the nearest character pixels along a specific direction. Figure 9b illustrates the hypothetical region created by joining such pixels along left-to-right direction. Therefore, 28(= 4 × 7) features are extracted from top, bottom, right and left boundaries of an image. It is worthy to mention that convex hull shape of a broken character does not differ too much from the shape of a complete character (see Fig. 10). However, the bay and lake counts may vary in case of a broken and a complete character sample. To avoid this situation, smaller bays (bay length less than 4 % of the character height, in case of vertical direction or character width in case of horizontal direction) are rejected from consideration. Such bays are mostly generated in case of noisy/broken characters. Apart from the above 28 features, three more features are derived from the entire convex hull. First, total number of convex hull boundary pixels having dcp = 0 is considered as one feature. Second, the number of lakes within a character is considered as another feature. Lake in a character image denotes the region entirely surrounded by the character pixels within a character image. The regions completely surrounded by character pixels having an area greater than L th are considered as a lake. Selection of the lake threshold L th is a heuristically chosen parameter (for our current experiment, we have chosen L th = 20) and is directly proportional to the area of the character images. Finally, total number of convex hull boundary pixels having dcp > 0 is considered as the third feature. Thus, 31 features are extracted from the entire character image based on different bays attributes of the convex hull. Figure 9 shows this feature extraction technique for a character image in a horizontal direction from left side of a Bangla compound character. To extract local information, from the character images, each character pattern is further divided into four sub-images based on the centroid of the data pixels which is referred as CG or center of gravity. The coordinates of the CG of an image frame, (C x , C y ), is calculated as follows: 1 x · f (x, y) k n 1 Cy = y · f (x, y) k m 1; for all data pixels f (x, y) = 0; other wise Cx = where x and y are the coordinates of each pixel in the image of size m × n pixels and k is the count of pixels having f (x, y) = 1. The convex hulls are then constructed for the character pixels within each such sub-image for computation of different topological features, as described earlier. A total of 124 such features (31 × 4) are computed from the 4 sub-images of each character pattern. This makes the total feature count as 155, i.e., 31 features for the overall 123 426 N. Das et al. Fig. 10 Convex hulls of an integral and broken character. Here, though the character is broken but the convex hull remain same Fig. 11 Illustration of formation of convex hulls from the CG-based quad partitions. a Sample image of Bangla compound character CG. The CG is denoted by black pixel. b Pictures of convex hulls over four different subparts created based on the CG image and 124 features for the four sub-images. It is noteworthy to mention that convex hulls of two or more different characters may be all but similar in shape but their subdivisions are not. Thus more discriminant features could be found using quadtree-based subdivision. The convex hulls constructed on four pattern subdivisions which are created based on the CG-based partitioning scheme are shown in Fig. 11. Combining the two sets of features, i.e., the quadtreebased longest run and convex hull-based features, a 239 element feature set (155 + 84) is formed. The pattern classifier employed here, using the above mentioned features, is imple- 123 with its mented with SVMs. An SVM employed as a pattern classifier for a two class problem always finds the optimal separating hyper plane that maximizes the margin between two classes. The margin is the distance of the closest sample point, in each class, from the separating hyper plane. For employing SVMs for multiclass problem, One Versus ONE (OVO) approach is followed here. To implement SVM, we have used LibSVM [79], a well-known open-source software for SVM. Among different existing kernels in LibSVM, we have used Radial Basis Function (RBF) kernel with gamma (γ ) value 0.5. The rationale behind the choice of RBF kernel is due to its ability to perform better for handwritten digit recognition applica- A benchmark image database of isolated Bangla handwritten compound characters 427 Fig. 12 Schematic diagram of our approach tions [80]. The trained SVMs are used later to classify the test set of the same database after extracting the same 239 feature set for each of the test pattern samples. Thus, we have generated three different SVMs for the training samples collected from isolated form, from word images, and from the database CMATERdb 1.1.1 as well as CMATERdb 1.2.1 separately, which are used to evaluate with their respective test sets. Thus, a recognition rate (number of successfully classified data/(number of successfully classified data + number of misclassified data)*100) of 78.67 % has been obtained on 199 patterns classes. The recognition rate improves marginally and becomes 79.35 % on 171 character classes after combining different graphemes of a single character class. A detailed analysis of the recognition performance of the test set of CMATERdb 3.1.3.3 is shown in Table 5. Please note that the performances vary widely across the data collection modalities. Although we have got around 81 % accuracy over isolated character samples, the same for unconstrained text (collected from CMATERdb 1.1.1 and CMATERdb 1.2.1) drops to around 67 %. A schematic diagram of our developed methodology is shown in Fig. 12. Table 6 shows a detailed analysis over top ten highly misclassified pattern classes. It may be observed from the table that the top two misclassified compound characters resemble so closely that there exist minimal differences in shapes, even in case of printed characters. Few samples of misclassified and correctly classified character images are also shown in Figs. 13 and 14 respectively. Possible reason behind the misclassification may be linked to the insufficiency of the feature set in distinguishing finer details of character images of different classes with close resemblance. The image samples of Fig. 14 reflect this. A two-stage pat- tern classification approach [81] may be explored in future to address the problem. Lexicon matching may also be useful in domain-specific applications, like city name recognition for automatic mail sorting, extraction of information from filled in forms, etc. Three other feature extraction techniques [16,26,72] which were used previously for Bangla compound character recognition have been tested on the present database with SVM-based classifier. Before extraction of those features, each character sample is binarized and then normalized to a size of 96 × 96 pixels. The recognition results obtained for different feature extraction techniques are reported in Table 5. It could be observed from the table that present feature set (combination of convex hull-based and quadtreebased longest run features) performs better than other feature sets [16,26,72] in the current experimental setup. 4 Conclusion In our present work, we have developed a new benchmark database for the Bangla handwritten compound characters after considering an elaborate survey on its usage patterns from the contemporary Bangla literatures. Though the survey revealed that the count of compound characters is around 334 but for the present work, we have considered only 171 character classes after discarding the characters with Ya-phala and Reph. Moreover, we have identified that some characters have more than one shape which may mislead to recognize them properly. To overcome the problem, we have treated them as different classes during training. Thus number of 123 428 Table 6 Top 10 poorly classified pattern classes along with their misclassification statistics Fig. 13 Examples of some misclassified samples along with their misclassified and original class labels 123 N. Das et al. A benchmark image database of isolated Bangla handwritten compound characters 429 Fig. 14 Some successfully classified difficult data samples with their corresponding class are shown pattern classes considered here is 199 instead of 171 character classes. This additional pattern classes are merged after testing and the final recognition accuracy is evaluated over 171 character classes. The database is prepared from handwritten compound character samples collected from three different types of datasheets consisting of isolated characters, isolated words and unconstrained handwriting. The objective is to capture a wide variation of writing styles by individuals and to address a potentially-wide range of Bangla OCR applications with varying constraints on the writing space. In some applications, characters are written in box-formatted data sheets, and in some cases, words or sentences are written in partially formatted data sheets. Some unconstrained pages are also considered in our work to include extreme variability in the database for the compound characters. Please note that, in the popular literature, only around 6 % characters are compound in nature and from 150 unconstrained handwritten Bangla pages, only 2,049 compound characters could be found. Therefore, to prepare a complete train–test database, we have collected handwritten isolated compound characters and words (containing specific compound characters) from participating volunteers. We have also made the complete database of 55,278 characters available in public domain as CMATERdb 3.1.3.3 from the website http://code. google.com/p/cmaterdb/. To the best of our knowledge, this is the largest (if not the first) database of its kind available to the pattern recognition research community. A subset of this database has been used previously during Bangla Handwriting Recognition Competition in ICFHR 2010 [82]. In the current work, we have used convex hull and quadtree-based longest run feature set, to report the benchmark recognition performance on the test set of the developed database. Further scopes exist to improve the feature set, especially by introduction of global and local features or multistage recognition strategy. In a nutshell, the current work is an effort to facilitate researchers in the domain of handwritten OCR to work on a large and complex character database of one of the widely used script of the world. Acknowledgments Authors are thankful to the “Center for Microprocessor Application for Training Education and Research”, “Project on Storage Retrieval and Understanding of Video for Multimedia” of Computer Science & Engineering Department, Jadavpur University, for providing infrastructure facilities during progress of the work. The work reported here has been partially funded by DST, Govt. of India, PURSE (Promotion of University Research and Scientific Excellence) Program. References 1. Fujisawa, H.: Forty years of research in character and document recognition—an industrial perspective. Pattern Recognit. 41(8), 2435–2446 (2008) 2. Cheriet, M., El Yacoubi, M., Fujisawa, H., Lopresti, D., Lorette, G.: Handwriting recognition research: twenty years of achievement · · · and beyond. Pattern Recognit. 42(12), 3131–3135 (2009) 3. Su, T.-H., Zhang, T.-W., Guan, D.-J., Huang, H.-J.: Off-line recognition of realistic chinese handwriting using segmentation-free strategy. Pattern Recognit. 42(1), 167–182 (2009) 4. Srihari, S., Yang, X., Ball, G.: Offline chinese handwriting recognition: an assessment of current technology. Front. Comput. Sci. China 1(2), 137–155 (2007) 5. Kimura, F.: OCR Technologies for machine printed and hand printed Japanese text. In: Chaudhuri, B.B. (ed.) Digital document processing. Advances in pattern recognition, pp. 49–71. Springer, London (2007) 6. Kwon, J.-O., Sin, B., Kim, J.H.: Recognition of on-line cursive korean characters combining statistical and structural methods. Pattern Recognit. 30(8), 1255–1263 (1997) 7. Kim, H.J., Kim, P.K.: Recognition of off-line handwritten korean characters. Pattern Recognit. 29(2), 245–254 (1996) 8. Amin, A.: Off line Arabic character recognition: a survey. In: The fourth international conference on document analysis and recognition, pp. 596–599 (1997) 9. Pal, U., Chaudhuri, B.B.: Indian script character recognition: a survey. Pattern Recognit. 37(9), 1887–1899 (2004) 10. Pal, U., Jayadevan, R., Sharma, N.: Handwriting recognition in indian regional scripts: a survey of offline techniques. ACM Trans. Asian Lang. Inf. Process. 11(1), 1–35 (2012) 11. Arya, D., Jawahar, C., Bhagvati, C., Patnaik, T., Chaudhuri, B., Lehal, G., Chaudhury, S., Ramakrishna, A.: Experiences of integration and performance testing of multilingual OCR for printed Indian scripts. In: Proceedings of the 2011 joint workshop on multilingual OCR and analytics for noisy unstructured text data, p. 9. ACM (2011) 12. Pal, U., Wakabayashi, T., Kimura, F.: Comparative study of Devnagari handwritten character recognition using different feature and classifiers. In: 10th international conference on document analysis and recognition (ICDAR ’09.), pp. 1111–1115 (2009) 13. Jagadeesh Kannan, R., Prabhakar, R.: A comparative study of optical character recognition for tamil script. Eur. J. Sci. Res. 35(4), 570–582 (2009) 14. Pal, U., Wakabayashi, T., Kimura, F.: A system for off-line Oriya handwritten character recognition using curvature feature. In: 10th international conference on information technology (ICIT 2007), pp. 227–229 (2007) 123 430 15. Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., Basu, D.K.: A hierarchical approach to recognition of handwritten bangla characters. Pattern Recognit. 42(7), 1467–1484 (2009) 16. Pal, U., Wakabayashi, T., Kimura, F.: Handwritten Bangla compound character recognition using gradient feature. In: 10th international conference on information technology-07, pp. 208–213 (2007) 17. Roy, K., Pal, U., Kimura, F.: Bangla handwritten character recognition. In: Prasad, B. (ed.) 2nd Indian international conference on artificial intelligence, pp. 431–443. Pune, India (2005) 18. Bhattacharya, U., Parui, S.K., Shridhar, M., Kimura, F.: Two-stage recognition of handwritten Bangla alphanumeric characters using neural classifiers. In: Prasad, B. (ed.) 2nd Indian international conference on artificial intelligence, pp. 1357–1376. Pune, India (2005) 19. Bhowmik, T., Bhattacharya, U., Parui, S.: Recognition of bangla handwritten characters using an mlp classifier based on stroke features. In: Pal, N., Kasabov, N., Mudi, R., Pal, S., Parui, S. (eds.) Neural Inf. Process. Lecture notes in computer science, vol. 3316, pp. 814–819. Springer, Berlin (2004) 20. Chaudhuri, B.B., Pal, U.: A complete printed bangla ocr system. Pattern Recognit. 31(5), 531–549 (1998) 21. Bhowmik, T., Ghanty, P., Roy, A., Parui, S.: Svm-based hierarchical architectures for handwritten bangla character recognition. Int. J. Doc. Anal. Recognit. 12(2), 97–108 (2009) 22. Das, N., Sarkar, R., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.: A genetic algorithm based region sampling for selection of local features in handwritten digit recognition application. Appl. Soft Comput. 12(5), 1592–1606 (2012) 23. Das, N., Pramanik, S., Basu, S., Saha, P.K., Sarkar, R., Kundu, M., Nasipuri, M.: Recognition of handwritten Bangla basic characters and digits using convex hull based feature set. In: Dimitrios A. Karras, Z.M., Etienne E. Kerre, Chunping Li (eds.) International conference on artificial intelligence and pattern recognition, Orlando, Florida, USA, pp. 380–386. ISRST (2009) 24. http://censusindia.gov.in/Census_Data_2001/Census_Data_Onlin e/Language/Statement1.htm. Accessed 22nd July 2011 25. http://en.wikipedia.org/wiki/Bengali_language. Accessed 22nd July 2011 26. Das, N., Das, B., Sarkar, R., Basu, S., Kundu, M., Nasipuri, M.: Handwritten bangla basic and compound character recognition using mlp and svm classifier. J. Comput. 2(2), 109–115 (2010) 27. http://en.wikipedia.org/wiki/Paschimbanga_Bangla_Akademi. Accessed 22nd July 2011 28. Sarkar, P., Mukhopadhay, A., DasGupta, P.: Akaademi Bannan Abhidhan. In: Chakrabarty, N., Ghosh, S., Sarkar, P., Chaki, J., Das, N., Mukhopadhay, A., Bhattachajee, S., Amitava, C., Mukhopadhay, A., Bhattacharjee, S., Das, P., Chattopadhay, S., Basu, A., Mandal, S. (eds.). Akademi Bannan Abhidhan, p. 582. Pachimbanga Bangla Akaademi, Kolkata (2008) 29. Wilkinson, R.A., Geist, J., Janet, S., Grother, P.J., Burges, C.J.C., Creecy, R., Hammond, B., Hull, J.J., Larsen, N.J., Vogl, T.P., Wilson, C.L.: In: The first census optical character recognition system conference. p. 372 (1992) 30. Marti, U.V., Bunke, H.: The iam-database: an english sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recognit. 5(1), 39–46 (2002) 31. Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994) 32. MNIST Dataset. http://yann.lecun.com/exdb/mnist. Accessed 29th July 2011 33. OCR Database. http://ai.stanford.edu/~btaskar/ocr/ (2011). Accessed 22nd July 2011 34. Honggang, Z., Jun, G., Guang, C., Chunguang, L.: HCL2000 - A large-scale handwritten Chinese character database for handwritten character recognition. In: ICDAR ’09., pp. 286–290 (2009) 123 N. Das et al. 35. Abdleazeem, S., El-Sherif, E.: Arabic handwritten digit recognition. Int. J. Doc. Anal. Recognit. 11(3), 127–141 (2008) 36. Khosravi, H., Kabir, E.: Introducing a very large dataset of handwritten farsi digits and a study on their varieties. Pattern Recognit. Lett. 28(10), 1133–1141 (2007) 37. Mozaffari, S., Faez, K., Faradji, F., Ziaratban, M. A., Golzan, S.M.: A comprehensive isolated Farsi/Arabic character database for handwritten OCR research. In: Tenth international workshop on frontiers in handwriting recognition, La Baule (France), pp. 385–389 (2006) 38. Al-Ohali, Y., Cheriet, M., Suen, C.: Databases for recognition of handwritten Arabic cheques. Pattern Recognit. 36(1), 111–121 (2003) 39. Kavallieratou, E., Liolios, N., Koutsogeorgos, E., Fakotakis, N., Kokkinakis, G.: The GRUHD database of Greek unconstrained handwriting. In: Sixth international conference on document analysis and recognition, pp. 561–565 (2001) 40. Viard-Gaudin, C., Lallican, P.M., Knerr, S., Binter, P.: The IRESTE On/Off (IRONOFF) dual handwriting database. In: Fifth international conference on document analysis and recognition (ICDAR ’99.), pp. 455–458 (1999) 41. Kim, D.-H., Hwang, Y.-S., Park, S.-T., Kim, E.-J., Paek, S.-H., Bang, S.-y.: Handwritten Korean Character Image Database PE92. In. IEICE transactions on information and systems, pp. 943–950 (1996) 42. Noumi, T., Matsui, T., Yamashita, I., Wakahara, T., Tsutsumida, T.: Tegaki Suji database ’IPTP CD-ROM1’ no ichi bunseki (in Japanese). Autumn Meeting of IEICE D-309 (1994) 43. Yamada, H., Yamamoto, K., Saito, T.: A nonlinear normalization method for handprinted kanji character recognition-line density equalization. Pattern Recognit. 23(9), 1023–1029 (1990) 44. Liu, Y., Tai, J., Liu, J.: An introduction to the 4 million handwriting Chinese character samples library. In: International conference on Chinese computing and orient language processing, Changsa, China, pp. 94–97 (1989) 45. Saito, T., Yamada, H., Yamamoto, K.: On the Database ELT9 of Handprinted Characters in JIS Chinese Characters and Its Analysis (in Japanese). Trans. IECEJ J.68-D(4), 757–764 (1985) 46. Mori, S., Yamamoto, K., Yamada, H., Saito, T.: On a handprinted kyoiku-kanji character data base. Bull. Electrotech. Lab. 43(11– 12), 752–773 (1979) 47. http://www.hpl.hp.com/india/research/penhw-interfaces-1linguis tics.html. (2011). Accessed 22nd July 2011 48. http://code.google.com/p/hit-mw-database/wiki/HomePage. (2011). Accessed 22nd July 2011 49. http://users.iit.demokritos.gr/~bgat/HandSegmCont2009/. (2011). Accessed 22nd July 2011 50. Bhattacharya, U.: Handwritten character databases of indic scripts. http://www.isical.ac.in/~ujjwal/download/database.html (2011). Accessed 22nd July 2011 51. Bhattacharya, U., Chaudhuri, B.B.: Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals. IEEE Trans. Pattern Anal. Mach. Intell. 31(3), 444–457 (2009) 52. Bhattacharya, U., Shridhar, M., Parui, S.K., Sen, P.K., Chaudhuri, B.B.: Offline recognition of handwritten bangla characters: an efficient two-stage approach. Pattern Anal. Appl. 15(4), 445– 458 (2012) 53. Das, N., Reddy, J.M., Sarkar, R., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.: A statistical-topological feature combination for recognition of handwritten numerals. Appl. Soft Comput. 12(8), 2486–2495 (2012) 54. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.: Cmaterdb1: a database of unconstrained handwritten Bangla and Bangla-english mixed script document image. Int. J. Doc. Anal. Recognit. 15(1), 71–83 (2012) A benchmark image database of isolated Bangla handwritten compound characters 55. Chattopadhyay, S.K.: Bangla Bhasatattver Bhumika. Calcutta University Press, Kolkata (1974) 56. Sproat, R.: A formal computational analysis of indic scripts. In: International symposium on indic scripts: past and future, Tokyo (2003) 57. Consortium, U.: The unicode standard, Version 6.1—core specification. The Unicode Consortium, Mountain View, CA, 2012. In. ISBN 978-1-936213-02-3. URL http://www.unicode.org/versions/ Unicode6.1.0 58. MSVS, B.R., Vardhan, V., GA, N., Reddy, P.: A noval security model for indic scripts-a case study on Telugu. Int. J. Comput. Sci. Secur. (IJCSS) 3(4), 303 59. Das, N.S.: Modern Bengali script: an introduction. Dakhabharati, Kolkata (2010) 60. AnandaBazar Patrika. http://www.anandabazar.in/ (2012). Accessed 10th March 2012 61. AajKaal. http://www.aajkaal.net (2011). Accessed 29th July 2011 62. Bartaman. http://bartamanpatrika.com (2011). Accessed 29th July 2011 63. Anandamela, Desh. http://my.anandabazar.com/content/ magazines (2011). Accessed 29th July 2011 64. Sarat Rachanabali. http://www.sarat-rachanabali.nltr.org/ (2012). Accessed 3rd March 2012 65. Bangla Documets. http://banglalibrary.evergreenbangla.com/ (2012). Accessed 4th March 2012 66. Newspapers from Bangladesh. http://new.ittefaq.com.bd/, http:// www.prothom-alo.com/ (2012). Accessed 4th March 2012 67. CMATER Handwritten Character Database. http://code.google. com/p/cmaterdb/ (2011). Accessed 1st Aug 2011 68. Bhattacharya, U., Shridhar, M., Parui, S.: On recognition of handwritten Bangla characters. In: Kalra, P., Peleg, S. (eds.) Computer vision, graphics and image processing. Lecture notes in computer science, pp. 817–828. Springer, Berlin (2006) 69. Rahman, A.F.R., Rahman, R., Fairhurst, M.C.: Recognition of handwritten bengali characters: a novel multistage approach. Pattern Recognit. 35(5), 997–1006 (2002) 70. Wen, Y., Lu, Y., Shi, P.: Handwritten bangla numeral recognition system and its application to postal automation. Pattern Recognit. 40(1), 99–107 (2007) 71. Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., Kumar Basu, D.: A novel framework for automatic sorting of postal documents with multi-script address blocks. Pattern Recognit. 43(10), 3507–3521 (2010) 431 72. Das, N., Basu, S., Sarkar, R., Kundu, M., Nasipuri, M., Basu, D.K.: Handwritten Bangla compound character recognition: potential challenges and probable solution. In: Prasad, B., Lingras, P., Ram, A. (eds.) 4th Indian international conference on artificial intelligence, Bangalore, pp. 1901–1913 (2009) 73. Chaudhuri, B.B., Pal, U.: Relational studies between phoneme and grapheme statistics in current bangla. J. Acoust. Soc. India 23, 67–77 (1995) 74. Pal, U., Chaudhury, B.B.: Character occurrence statistics in Bangla language and recognition of Bangla printed script. In: ICAPRDT, Kolkata, pp. 52–59 (1993) 75. Das, N., Basu, S., Sarkar, R., Kundu, M., Nasipuri, M., Basu, D.K.: An Improved feature descriptor for recognition of handwritten Bangla alphabet. In: Guru, D.S., Vasudev, T. (eds.) International conference on signal and image processing, Mysore, India, pp. 451–454. Excel India Publishers (2009) 76. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998) 77. Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., Basu, D.: Recognition of numeric postal codes from multi-script postal address blocks. In: Chaudhury, S., Mitra, S., Murthy, C.A., Sastry, P.S., Pal, S. (eds.) Pattern Recognit. Mach. Intell. Lecture notes in computer science, vol. 5909, pp. 381–386. Springer, Berlin (2009) 78. Marlow, B.K., Batchelor, B.G.: Improving the speed of convex hull calculations. Electron. Lett. 16(9), 319–321 (1980) 79. Chang, C.-C., Lin, C.-J.: Libsvm : a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011) 80. Das, N., Mandal, B., Basu, S., Sarkar, R., Kundu, M., Nasipuri, M.: An SVM-MLP classifier combination scheme for recognition of handwritten Bangla digits. In: Kale, K.V., Malhrota, S.C., Manza, R.R. (eds.) 2nd International conference on advances in computer vision and information technology, Aurangabad, India, pp. 615– 623. I. K. International Publishing House Pvt. Ltd. (2009) 81. Basu, S., Chaudhuri, C., Kundu, M., Nasipuri, M., Basu, D.: A twopass approach to pattern classification neural information processing. In: Pal, N., Kasabov, N., Mudi, R., Pal, S., Parui, S. (eds.), vol. 3316. Lecture notes in computer science, pp. 781–786. Springer, Berlin (2004) 82. El Abed, H., Märgner, V., Blumenstein, M.: international conference on frontiers in handwriting recognition (ICFHR 2010)— competitions overview. In: 12th international conference on frontiers in handwriting recognition pp. 703–708 (2010) 123