Segmentation of Touching Characters in Printed Devanagari and Bangla Scripts Using Fuzzy Multifactorial Analysis Prepared By: Sanjeev Maharjan St. Xavier’s College Background Optical Character Recognition is a major component of the document analysis system. It is responsible for recognizing the text from the paper document after scanning the document. To conduct such text recognition from the document, the texts are to be segmented first. Segmentation of the text takes place in three steps. First the text is segmented into lines, and then lines are segmented into words and finally those words are segmented into characters. Once the text is segmented into the characters, it is two-step process to recognize those characters: 1. Extract the distinguishing features of the character image 2. Finding the member of predefined member set of characters which best matches the character image Introduction to Devnagari and Bangla Characters Both Asian scripts specified, have horizontal writing style from left to right, and have no uppercase and lowercase distinction. There are 50 basic characters in both scripts. Among the characters, the vowels often take modified shapes in a word, known as allograph or modifiers. Several consonant characters combine to form compound characters that partly retain the shape of the constituent characters. The number of compound characters in Bangla and Devanagari is more than 250 .The characters of the word are combined by Shirorekha in Devnagari and Matra in Bangla which we can call as Headline. In both scripts, a text may have three parts: Upper zone denoting the portion above headline Middle zone constitutes of the basic and compound characters Lower zone constitutes of the modifiers Segmentation in Devnagari and Bangla Scripts The document with ‘Devnagari’ and ‘Bangla’ scripts are scanned and are segmented into line, word and characters. The headlines are detected by the large values computed by a row-wise sum of black pixels and the position between two consecutive headlines, where the projection profile height is least, identifies the line segmentation. Similarly words are segmented by the vertical pixel projection profile. And once the headlines are removed from the words, then the each character in a word becomes independent and all the characters are thus segmented. Touching Characters in OCR The efficiency of the OCR highly relies over the segmentation error rate. Lower the segmentation error rate, higher would be the efficiency of OCR. The segmentation process on the other hand, has other challenge to tackle to better itself. And the challenge is the adjacent touching characters in the scanned image. This appears for the major problem for segmentation as the segmentation process is carried out on the basis of the connectivity analysis where the touching characters are considered as a single unit. This touching characters effect may be less abundant in roman character segmentation but it appears frequently in the documents with Devnagari and Bangla texts printed over it. Such touching effect increases among the old books, copied materials and, low quality newspapers. Since the touching characters are considered as a single character, even the finally segmented characters result in failure during the character recognition phase. This is the setback of segmentation due to touching characters presence. Research Observation on Touching Characters for Devnagari and Bangla Scripts Among many sample documents undertaken for the study of the touching characters among the old books, newspapers and copied documents, following results were observed: Mostly an image of the touching characters constitutes of two characters The combined shape of the characters are not the valid characters Touching characters generally have larger aspect ratio than single isolated characters Among most of the touching positions a single black run is encountered The vertical thickness of the black blob at touching position is usually small The constituent characters are more likely to touch each other at middle of the middle zone At touching points the character parts generate uncommon stroke patterns Fuzzy Multifactorial Analysis In 1982 Wang first defined the concept of factor spaces and applied it to the study of Artificial Intelligence. According to him, ‘factor’ is a primary term with properties like state and characteristics. H.X. Li and V.C. Yen have discussed four types of factors: 1. Measurable Factors: (like time, height etc) 2. Nominal Factors: (like religion) 3. Degree Factors/ Fuzzy Factors (Degree of Similarity, Feasibility) 4. Switch or Boolean Factors only two possible values The relevant factors for the touching characters are fuzzy factors and to segment the touching characters multiple fuzzy factors are considered to identify the optimal cut columns. Identifying the Touching Characters Checking every character for the touching effect while segmenting them would be very costly method. So to minimize such overhead of checking every character only those characters which are rejected by character recognition should be checked for the touching effect. Touching characters are identified by a technique based on multifactorial analysis and for such purpose two factors are to be defined first. One factor measures dissimilarity (fmd) and aspect ratio (far). Fmd is calculated as: fmd=1-doff/d, where doff is the minimum similarity distance for a target character against a set of stored prototypes, and d is the offset distance used by character classifiers where with value of d less than doff is accepted as character by classifier and is rejected if the result is vice versa. Other factor being far is defined as far=ea/1+ea, where ‘a=w/h’ such that w is width and h is height of the minimum upright bounding box of character. This character recognition of such fails if width by height is larger than the isolated characters. Multifactorial function used to analyze these two factors is given by: Mid(fmd , far)=1/2(fmd +far) If the values of Mid is found to be above threshold value then the character is classified to be a touching character. Finding Cut Positions for Segmentation Once the touching characters are identified, again the multifactorial analysis is carried out to separate the touching characters. This analysis is based upon five fuzzy factors: Fic(Inverse crossing count)= c-1, where c is the vertical crossing count for a pixel column. It the column scan is more than one it is least preferable for identifying as cut column. Fmt(measure of blob thickness)=1-t/T, where t is number of black pixel encountered in one column scan, and T is the height of the characters middle zone of characters Fdm(degree of Middleness): This factor realizes where the pixel lies in the middle zone i.e. whether it lies at middle of middle zone or away from center of the middle zone Fup (Up Stroke pattern) Flow(Lower Stroke Pattern) If an image of touching characters has m as the total number of the pixel column, then for all the m columns above five factors are evaluated and a 5Xm one–factor evaluation matrix is formed as follows: Each column in matrix represents each pixel column in the image and consists of a 5-D vector that reflects five different aspects for that column to be a cut column. Next, these five aspects get combined and mapped into a 1-D scalar by an ASM function Ms. The function Ms transforms the 5xm matrix Vs into a 1xm matrix V’s as follows: Since the state spaces of all the five factors are theoretically bounded by the interval[0,1] and Ms retains the property of an ASM function , the state space for fs is also bounded by the interval[0,1]. Actually, Ms is used to give each pixel column (i) a degree of membership fis that reflects the possibility of the i th column to be a cut column for separating the characters. Higher Fs value indicates the larger possibility for that column to be a cut column or touched character separation column. Confirmation of the Cut Column After the possible cut columns are predicted as per their Fs values with larger values, the segments generated from the cut column are passed on to the character classifier where if either of the two parts or both segments are recognized then cut point is the appropriate separation point. If both the segments are rejected then that cut point is discarded and next cut point with next largest Fs values is selected. Simplified algorithm for confirming cut position: 1. List optimal cut positions identified 2. Take cut position with highest multifactor evaluation value 3. Segment the touching character resulting two characters say p1 & p2 4. Send p1 and p2 to character classifier 5. If p1 & p2 both are recognized then cut position confirmed else if p1 is recognized then take cut position with 2nd highest multifactor evaluation repeat from step 3 to segment p2 else if p2 is recognized then take cut position with 2nd highest multifactor evaluation repeat from step 3 to segment p1 else take cut position with 2nd highest multifactor evaluation repeat from step 3 Results Achieved by Implementing Fuzzy Multifactorial Analysis There still exist problem in identifying the touching characters in the document. Sometimes the valid characters are also interpreted as touching characters and sometimes even touching characters are not identified. Segmentation accuracy is 98.92% and 98.47% in Devnagari and Bangla Scripts after using this separation module for the touching characters System Throughput (T) is calculated as : T= C/t where C is total number of characters properly recognized by the OCR, and t is total time elapsed for the operation System Efficiency (E) is calculated as E=(Nv*100)/Nt where Nv is the number of valid cut columns and Nt is total number of cut columns checked to find the valid cuts. Conclusion Fuzzy Multifactorial analysis implementation in Devnagari Language is very promising so this method can also be implemented to minimize the joining errors among the touching characters in Nepali OCR as well.