March 2008 SegmentationReport_By_SanjeevMaharjan

advertisement
Segmentation of Touching Characters in Printed
Devanagari and Bangla Scripts
Using Fuzzy Multifactorial Analysis
Prepared By:
Sanjeev Maharjan
St. Xavier’s College
Background
Optical Character Recognition is a major component of the document analysis system.
It is responsible for recognizing the text from the paper document after scanning the
document. To conduct such text recognition from the document, the texts are to be
segmented first.
Segmentation of the text takes place in three steps. First the text is segmented into
lines, and then lines are segmented into words and finally those words are segmented
into characters.
Once the text is segmented into the characters, it is two-step process to recognize
those characters:
1. Extract the distinguishing features of the character image
2. Finding the member of predefined member set of characters which best
matches the character image
Introduction to Devnagari and Bangla Characters
Both Asian scripts specified, have horizontal writing style from left to right, and have
no uppercase and lowercase distinction.
There are 50 basic characters in both scripts. Among the characters, the vowels often
take modified shapes in a word, known as allograph or modifiers. Several consonant
characters combine to form compound characters that partly retain the shape of the
constituent characters. The number of compound characters in Bangla and Devanagari
is more than 250 .The characters of the word are combined by Shirorekha in
Devnagari and Matra in Bangla which we can call as Headline.
In both scripts, a text may have three parts:
 Upper zone denoting the portion above headline
 Middle zone constitutes of the basic and compound characters
 Lower zone constitutes of the modifiers
Segmentation in Devnagari and Bangla Scripts
The document with ‘Devnagari’ and ‘Bangla’ scripts are scanned and are segmented
into line, word and characters. The headlines are detected by the large values
computed by a row-wise sum of black pixels and the position between two consecutive
headlines, where the projection profile height is least, identifies the line segmentation.
Similarly words are segmented by the vertical pixel projection profile. And once the
headlines are removed from the words, then the each character in a word becomes
independent and all the characters are thus segmented.
Touching Characters in OCR
The efficiency of the OCR highly relies over the segmentation error rate. Lower the
segmentation error rate, higher would be the efficiency of OCR. The segmentation
process on the other hand, has other challenge to tackle to better itself. And the
challenge is the adjacent touching characters in the scanned image. This appears for
the major problem for segmentation as the segmentation process is carried out on the
basis of the connectivity analysis where the touching characters are considered as a
single unit.
This touching characters effect may be less abundant in roman character segmentation
but it appears frequently in the documents with Devnagari and Bangla texts printed
over it. Such touching effect increases among the old books, copied materials and, low
quality newspapers.
Since the touching characters are considered as a single character, even the finally
segmented characters result in failure during the character recognition phase. This is
the setback of segmentation due to touching characters presence.
Research Observation on Touching Characters for Devnagari and Bangla
Scripts
Among many sample documents undertaken for the study of the touching characters
among the old books, newspapers and copied documents, following results were
observed:
 Mostly an image of the touching characters constitutes of two characters
 The combined shape of the characters are not the valid characters
 Touching characters generally have larger aspect ratio than single isolated
characters
 Among most of the touching positions a single black run is encountered
 The vertical thickness of the black blob at touching position is usually small
 The constituent characters are more likely to touch each other at middle of the
middle zone
 At touching points the character parts generate uncommon stroke patterns
Fuzzy Multifactorial Analysis
In 1982 Wang first defined the concept of factor spaces and applied it to the study of
Artificial Intelligence. According to him, ‘factor’ is a primary term with properties like
state and characteristics.
H.X. Li and V.C. Yen have discussed four types of factors:
1. Measurable Factors: (like time, height etc)
2. Nominal Factors: (like religion)
3. Degree Factors/ Fuzzy Factors (Degree of Similarity, Feasibility)
4. Switch or Boolean Factors only two possible values
The relevant factors for the touching characters are fuzzy factors and to segment the
touching characters multiple fuzzy factors are considered to identify the optimal cut
columns.
Identifying the Touching Characters
Checking every character for the touching effect while segmenting them would be very
costly method. So to minimize such overhead of checking every character only those
characters which are rejected by character recognition should be checked for the
touching effect.
Touching characters are identified by a technique based on multifactorial analysis and
for such purpose two factors are to be defined first. One factor measures dissimilarity
(fmd) and aspect ratio (far).
Fmd is calculated as: fmd=1-doff/d, where doff is the minimum similarity distance for a
target character against a set of stored prototypes, and d is the offset distance used
by character classifiers where with value of d less than doff is accepted as character by
classifier and is rejected if the result is vice versa.
Other factor being far is defined as far=ea/1+ea, where ‘a=w/h’ such that w is width and
h is height of the minimum upright bounding box of character. This character
recognition of such fails if width by height is larger than the isolated characters.
Multifactorial function used to analyze these two factors is given by:
Mid(fmd , far)=1/2(fmd +far)
If the values of Mid is found to be above threshold value then the character is
classified to be a touching character.
Finding Cut Positions for Segmentation
Once the touching characters are identified, again the multifactorial analysis is carried
out to separate the touching characters. This analysis is based upon five fuzzy factors:
 Fic(Inverse crossing count)= c-1, where c is the vertical crossing count for a
pixel column. It the column scan is more than one it is least preferable for
identifying as cut column.
 Fmt(measure of blob thickness)=1-t/T, where t is number of black pixel
encountered in one column scan, and T is the height of the characters middle
zone of characters
 Fdm(degree of Middleness): This factor realizes where the pixel lies in the
middle zone i.e. whether it lies at middle of middle zone or away from center of
the middle zone
 Fup (Up Stroke pattern)
 Flow(Lower Stroke Pattern)
If an image of touching characters has m as the total number of the pixel column,
then for all the m columns above five factors are evaluated and a 5Xm one–factor
evaluation matrix is formed as follows:
Each column in matrix represents each pixel column in the image and consists of a 5-D
vector that reflects five different aspects for that column to be a cut column. Next,
these five aspects get combined and mapped into a 1-D scalar by an ASM function Ms.
The function Ms transforms the 5xm matrix Vs into a 1xm matrix V’s as follows:
Since the state spaces of all the five factors are theoretically bounded by the
interval[0,1] and Ms retains the property of an ASM function , the state space for fs is
also bounded by the interval[0,1]. Actually, Ms is used to give each pixel column (i) a
degree of membership fis that reflects the possibility of the i th column to be a cut
column for separating the characters. Higher Fs value indicates the larger possibility
for that column to be a cut column or touched character separation column.
Confirmation of the Cut Column
After the possible cut columns are predicted as per their Fs values with larger values,
the segments generated from the cut column are passed on to the character classifier
where if either of the two parts or both segments are recognized then cut point is the
appropriate separation point. If both the segments are rejected then that cut point is
discarded and next cut point with next largest Fs values is selected.
Simplified algorithm for confirming cut position:
1. List optimal cut positions identified
2. Take cut position with highest multifactor evaluation value
3. Segment the touching character resulting two characters say p1 & p2
4. Send p1 and p2 to character classifier
5. If p1 & p2 both are recognized then
cut position confirmed
else if p1 is recognized then
take cut position with 2nd highest multifactor evaluation
repeat from step 3 to segment p2
else if p2 is recognized then
take cut position with 2nd highest multifactor evaluation
repeat from step 3 to segment p1
else
take cut position with 2nd highest multifactor evaluation
repeat from step 3
Results Achieved by Implementing Fuzzy Multifactorial Analysis
 There still exist problem in identifying the touching characters in the document.
Sometimes the valid characters are also interpreted as touching characters and
sometimes even touching characters are not identified.
 Segmentation accuracy is 98.92% and 98.47% in Devnagari and Bangla
Scripts after using this separation module for the touching characters
 System Throughput (T) is calculated as : T= C/t where C is total number of
characters properly recognized by the OCR, and t is total time elapsed for the
operation
 System Efficiency (E) is calculated as E=(Nv*100)/Nt where Nv is the number
of valid cut columns and Nt is total number of cut columns checked to find the
valid cuts.
Conclusion
Fuzzy Multifactorial analysis implementation in Devnagari Language is very promising
so this method can also be implemented to minimize the joining errors among the
touching characters in Nepali OCR as well.
Download