ICPR-TextLocalizatio.. - School of Computing

advertisement
Wavelet-Gradient-Fusion for Video Text Binarization
aSangheeta
Roy, bPalaiahnakote Shivakumara, cPartha Pratim Roy and bChew Lim Tan
a
Tata Consultancy Services, Kolkata, India
School of Computing, National University of Singapore, Singapore
c
Laboratoire d’Informatique, Université François Rabelais, Tours, France
a
roy.sangheeta@tcs.com b{shiva, and tancl}@comp.nus.edu.sg cpartha.roy@univ-tours.fr
b
Abstract—Achieving good character recognition rate in video
images is not as easy as achieving the same from the scanned
documents because of low resolution and complex
background in video images. In this paper, we propose a new
method using fusion of horizontal, vertical and diagonal
information obtained by the wavelet and the gradient on text
line images to enhance the text information. We apply kmeans with k=2 on row-wise and column-wise pixels
separately to extract possible text information. The union
operation on row-wise and column-wise clusters provides the
text candidates information. With the help of Canny of the
input image, the method identifies the disconnections based
on mutual nearest neighbor criteria on end points and it
compares the disconnected area with the text candidates to
restore the missing information. Next, the method uses
connected component analysis to merge some subcomponents based on nearest neighbor criteria. The
foreground (text) and background (non-text) is separated
based on new observation that the color values at edge pixel
of the components are larger than the color values of the pixel
inside the component. Finally, we use Google Tesseract OCR
to validate our results and the results are compared with the
baseline thresholding techniques to show that the proposed
method is superior to existing methods in terms of recognition
rate on 236 video and 258 ICDAR 2003 text lines.
Keywords- Wavelet-Gradient-Fusion, Video text lines, Video
Video text restoration, Video character rcognition
I.
INTRODUCTION
Character recognition in document analysis is the most
successful application in the field of pattern recognition.
However, if we test the same OCR engine on video scene
character, the OCR engine reports poor accuracy because the
OCR was developed mainly for scanned document images
containing simple background and high contrast but not for
the video images having complex background and low
contrast. It is evident from the natural scene character
recognition methods [1-6] that the document OCR engine
does not work for camera based natural scene images due to
failure of binarization in handling non-uniform background
and non-illumination. Therefore, poor character recognition
rate (67%) is reported for ICDAR-2003 competition data [7].
This shows that despite high contrast of camera images, the
best accuracy reported is 67% so far, thus achieving better
character recognition rate for video images is still an elusive
goal for the researchers because of the lack of good
binarization method which can tackle both low contrast and
complex background to separate foreground and background
accurately [8]. It is noted that character recognition rate varies
from 0% to 45% [8] if we apply OCR directly on video text,
which is much lower than scene character recognition
accuracy. Our experimental result of the existing baseline
methods such as Niblack [9] and Sauvola et al.[10] show that
thresholding techniques give poor accuracy for the video and
scene images. It is reported in [11] that the performance of
these thresholding techniques is not consistent because the
character recognition rate changes as the application and
dataset change. In this paper, we make an attempt by
proposing a new method for separation of foreground (text)
and background (non-text) such that the OCR engine provides
better accuracy than reported accuracy in the literature.
There are several papers that addressed video text
binarization and localization problem based on edge, stroke,
color and corner information to improve character recognition
rate. Ntirogiannis et al. [12] have proposed a binarization
method based on baseline and stroke width extraction to
obtain body of the text information and convex hull analysis
with adaptive thresholding is done for obtaining final text
information. However, this method focuses on artificial text
where pixels have uniform color but not on both artificial and
scene text where pixel do not have uniform color values. An
automatic binarization method for color text areas in images
and video based on convolutional neural network is proposed
by Saidane and Garcia [13]. The performance of the method
depends on the number of training samples. Recently, edge
based binarization for video text image is proposed by Zhou
et al. [14] to improve the video character recognition rate.
This method takes Canny of the input image as input and it
proposes a modified flood fill algorithm to fill the gap if there
is a small gap on the contour. This method works well for
small gaps but not for big gaps on the contours. In addition to
this, the method’s primary focus is graphics text and big font
but not both graphics and scene text.
Therefore, from the above discussion, it can be concluded
that there are methods to improve video character recognition
rate through binarization but these methods concentrate on big
font, graphics text in video but not on both graphics and scene
text where we can expect much more variation in contrast and
background compared to graphics text. Therefore, improving
video character recognition through binarization irrespective
of text type, contrast and background complexity is
challenging. Hence, in this work, we propose a new WaveletGradient-Fusion (WGF) method based on fusion concept with
wavelet and gradient information and a new way of obtaining
text candidates to overcome the above problems.
II.
PROPOSED METHOD
While we note that there are several sophisticated methods
for text line detection in video irrespective of contrast, text
type, orientation, contrast and background variation, we use
our method [15] using Laplacian approach and skeleton analysis
to segment the text lines from the video frames. Therefore, the
output of our text detection method is the input to the proposed
method in this work. The multi-oriented text lines segmented
from the video frames are converted to horizontal text lines
based on the direction of the text lines. Hence, non-horizontal
text lines are treated as horizontal text lines to make
implementation easier. The proposed method is structured into
four sub-sections. In Section A, we propose a novel method to
fuse wavelet and gradient information to enhance the text
information in video text lines. Text candidates are obtained by
a new way of clustering on enhanced image in Section B. The
possible text information is restored with the help of Canny of
the input image and the text candidate image in Section C.
Finally, in Section D, the method to separate foreground and
background is presented based on color features of the edge
pixel and inside component pixels.
the above three obtained fused images to get the final fused
image as shown in Figure 2(k) where we can see that the text
information is sharpened compared to the results shown in
Figure 2(d), (g) and (j).
(a). Input text line image
(b) Horizontal Wavelet
(c) Horizontal Gradient
Input text line image
(d) Fusion-1 of (b) and (c)
Wavelet
Gradient
(e) Vertical Wavelet
H
V
D
H
V
(f) Vertical Gradient
D
(g) Fusion-2 of (e) and (f)
Fusion-1
Fusion-2
(h) Diagonal Wavelet
(i) Diagonal Gradient
Fusion-3
(j) Fusion-3 of (h) and (i)
Final fused image
Figure 1. Flow diagram for the wavelet-gradient-fusion
A. Wavelet-Gradient-Fusion Method
It is noted that wavelet decomposition is good for
enhancing the low contrast pixel in the video frame because of
multi-resolution analysis which gives horizontal (H), vertical
(V) and diagonal (D) information and gradient operation of the
same direction on video image gives fine detail of the edge
pixel in video text line image. To overcome the problems of
unpredictable video characteristics, the work presented in [16]
suggested the use of fusion of the values given by the low
bands of the input images to increase the resolution of the
image. Inspired by this work, we propose an operation that
chooses the highest pixel value among low pixel values of
different sub-bands corresponding to wavelet and gradient at
different levels as a fusion criterion. It is shown in Figure 1
where one can see how the sub-bands of wavelet fuse with the
gradient images and the final fusion image is obtained after
fusing three Fusion-1, Fusion-2 and Fusion-3 images. For
example, for the input image shown in Figure 2(a), the method
compares the pixel values in the horizontal wavelet (Figure
2(b)) with the corresponding pixel values in the horizontal
gradient (Figure 2(c)) and it chooses the highest pixel value to
obtain the fusion image as shown in Figure 2(d). In the same
way, the method obtains the fusion image for the vertical
wavelet and the vertical gradient as shown in Figure 2(e)-(g),
and the diagonal wavelet and the diagonal gradient images as
shown in Figure 2(h)-(j). The same operation is performed on
(k) Fusion of Fusion-1, Fusion-2 and Fusion-3
Figure 2. Intermediate results for WGF method
B. Text Candidates
It is observed from the result of the previous section that
WGF method widens the gap between text and non-text pixels.
Therefore, to classify text and non-text pixels, we use k means
clustering with k=2 in a novel way by applying on each row
and column separately as shown in Figure 3(a) and (b) where
the result of row-wise clustering lose some text information
while the result of column-wise clustering does not lose text
information. Here the cluster that has the higher mean between
the two is considered the text cluster. This is the advantage of
the new way of row-wise and column-wise clustering as it
helps in restoring the possible text information. The union of
row-wise and column-wise clustering results is considered as
text candidates to separate and text and non-text information as
shown in Figure 3(c) where it is seen that the union operation
includes other background information in addition to text.
(a). k-means clustering row-wise (b) k-means clustering column-wise
(c) Union of (a) and (b)
Figure 3. Text candidates for text binarization
C. Smoothing
It is observed from the text candidates that the shape of the
character is almost preserved and it may contain other
background information. Therefore, the method considers the
text candidates image as the reference image to clean up the
background. The method identifies disconnections in the
Canny of the input image by testing mutual nearest neighbor
criteria on end points as shown in Figure 4(a) where
disconnections are marked by red color rectangles. The mutual
nearest neighbor criteria is defined as follows: if P1 is near to
P2 then P2 should be near to P1, where Point P1 and Point P2
are the two end points. This is because Canny gives good edge
information for video text line images but at the same time it
gives lots of disconnections due to low contrast and complex
background. The identified disconnection area is matched with
the same position in the text candidates image locally to
restore the missing text information since the text candidates
image does not lose much text information compared to the
Canny edge image as shown in Figure 4(b) where almost all
components are filled by flood fill operation. However, we can
see noisy pixels in the background. To eliminate them, we
perform projection profiles analysis which result in a clear text
information with clean background as shown in Figure 4(c).
(a). Gap identification based on mutual nearest neighbor criteria
(b) Disconnections are filled and identified noisy pixels
(c) Clear connected components
Figure 4. Process of smoothing
(a). Color values of edge pixels and inside character
(b) Foreground and background is separated
(c)
“successive year”
Figure 5. Foreground and background separation by analyzing
the color values at edge pixel and inside the components
D. Foreground and Background Separation
The method considers the text in the smoothed image
obtained from the above step C as connected components and
it analyses by fixing a bounding box to merge the subcomponents, if any, based on the nearest neighbor criteria as
shown in Figure 5(a). For each component in the merged
image, the method extracts the maximum color information
from the input image corresponding to pixels in the
components of the merged image. It is found from the results
of maximum color extraction that the extracted color values
refer the border/edge of the components. This is valid because
usually colour values at edges or near edges have higher values
than those at the pixels inside the components if there exist
holes inside the component. This observation helps us to find a
hole for each component by making low values as black and
high values as white as shown in Figure 5(b). After separating
text and non-text, the result is fed to OCR [17] to test
recognition results. For example, for the result shown in Figure
5(b), the OCR engine recognizes the whole text correctly as
shown in Figure 5(c) where recognition result is in quote.
III.
EXPERIMENTAL RESULTS
As there is no standard database to evaluate the proposed
method performance, we create our own data of video which
include 236 text lines selected from different news video
sources and 258 text lines selected randomly from ICDAR2003 competition scene images. In total, 494 text line images
are considered. To measure the performance of the proposed
method, we use character recognition rate. For comparative
study, we implement two baseline methods of binarization [9,
10] and the methods are evaluated in terms of recognition
rate. The sample results for the proposed and existing
methods on both video and ICDAR data are shown in Table 1
where we consider input images with low contrast, complex
background, distorted text and different fonts. It is noticed
from the recognition results in quote in Table 1 that the OCR
engine recognizes almost all the results given by the proposed
method while Niblack method gives better results than the
Sauvola method and worse than the proposed method. For
Sauvola’s method, the OCR returns none (“ “) for almost all
the input images. The reason for the poor result lies in the use
of thresholds to binarize because it is hard to fix optimal
thresholds for video text lines due to unpredictable
characteristics. On the other hand, the proposed method does
not fix any threshold and it takes advantage of the WaveletGradient-Fusion and color features for foreground and
background separation. However, the proposed method
sometimes fails for too low contrast images as shown in last
row of video and ICDAR data in Table 1. Therefore, there is a
scope for further improvements.
The OCR engine is used to calculate the recognition rate
for the input images without binarization (“Before” column in
Table 2 and Table 3) and the results are reported in Table 2
and Table 3 for video data and ICDAR data. The OCR engine
gives slightly better results for ICDAR data than video data.
This is true because ICDAR data contains high contrast image
and complex background whereas video data is of low
contrast and contains complex background. The results
reported in Table 2 and Table 3 show that the proposed
method provides better improvements such as 16.08% for
video data and 15.79% for ICDAR compared to recognition
results before binarization.
Table 1. Sample results of the proposed and existing methods
Video
Input
Proposed (WGF)
Niblack [9]
Sauvola
[10]
“1-800 EH -7
0000”
“K•{.-Gab
605 0*”
“••1
.;*
000 0”
”$70°°”
“ECE?”
“FHEIEE”
“INFOHME”
“ “
““
”Rapld”
“Rama’”
““
ODOUR”
““
““
“and Connect”
““
“••n¤•¤¤¤n•
c•¤”
”
“ “
“$IILIGa gl”
“EZTJIEIIH”
ICDAR 2003 Competition data
ACKNOWLEDGMENT
This work is done jointly by National University of
Singapore (NUS), Singapore and Département Informatique Polytech'Tours, France. This research is also supported in part
by A*STAR grant 092 101 0051 (WBS no. R252-000-402305).
REFERENCES
”
DISCOVER”
have shown that this fusion helps in enhancing text
information. We used k-means clustering algorithm in
different row-wise and column-wise way to obtain text
candidates. The mutual nearest neighbor concept is proposed
to identify the true pair of end pixels to restore the missing text
information. To separate foreground and background, we
explore the color values at edges and inside the components.
The experimental results of the proposed method and existing
method show that the proposed method outperforms the
existing methods in terms of recognition rate. However, the
reported recognition rate is not very high as in document
analysis because the tesseract OCR is not font independent and
robust, we are planning to explore learning based method to
improve the recognition rate on large dataset.
““
““
[1] D. Doermann, J. Liang and H. Li, “Progress in Camera-Based
[2]
“skimmed”
““
““
[4]
”
The”
[3]
““
““
[5]
““
“EXIT”
““
[6]
“AT HE-E * T I
CS”
““
““
“G E E K”
““
““
“fa(UE‘HTS”
“GENTS”
“A C}££a`|'1”
Table2. Recognition rate of the proposed and existing methods on
video data (in %)
Before
Proposed
Niblack
Sauvola
48.49
After
64.57
47.03
17.26
Improvements
16.08
-1.46
-31.23
Table3. Recognition rate of the proposed and existing methods on
ICDAR data (in %)
Before
Proposed
Niblack
Sauvola
51.62
IV.
After
67.41
42.30
19.98
Improvements
15.79
-9.32
-31.64
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
CONCLUSION
In this work, we have proposed a new fusion method based
on wavelet sub-bands and gradient of different directions. We
[16]
[17]
Document Image Analysis”, In Proc. ICDAR, 2003, pp 606616.
J. Zang and R. Kasturi, “Extraction of Text Objects in Video
Documents: Recent Progress”, In Proc. DAS, 2008, pp 5-17
K. Wang and S. Belongie, “Word Spotting in the Wild”, In
Proc. ECCV, 2010, pp 591-604.
X. Tang, X. Gao, J. Liu and H. Zhang, “A Spatial-Temporal
Approach for Video Caption Detection and Recognition”, IEEE
Trrans. Neural Network, 2002, pp 961-971.
M. R. Lyu, J. Song and M. Cai, “A Comprehensive Method for
Multilingual Video Text Detction , Localization, and
Extraction”, IEEE Trans. CSVT, 2005, pp 243-255.
A. Mishara, K. Alahari and C. V. Jawahar, “An MRF Model for
Binarization of Natural Scene Text”, In Proc. ICDAR, 2011, pp
11-16.
L. Neumann and J. Matas, “A Method for Text Localization and
Recognition in Real-World Images”, In Proc. ACCV, 2011, pp
770-783.
D. Chen and J. M. Odobez, “Video text recognition using
sequential Monte Carlo and error voting methods”, Pattern
Recognition Letters, 2005, pp 1386-1403.
W. Niblack, “An Introduction to Digital Image Processing”,
Prentice Hall, Englewood Cliffs, 1986.
J. Sauvola, T. seeppanen, S. Haapakoski and M. Pietikainen,
“Adaptive Document Binarization”, In Proc. ICDAR, 1997, pp
147-152.
J. He, Q. D. M. Do, A. C. Downton and J. H. Kim, “A
Comparision of Bianarization Methods for Historical Archive
Documents”, In Proc. ICDAR, 2005, pp 538-542.
K. Ntirogiannis, B. Gotos and I. Pratikakis, “Binarization of
Textual Content in Video Frames”, In Proc. ICDAR, 2011, pp
673-677.
Z. Saidane and C. Garcia, “Roubst Binarization for Video Text
Recognition”, In Proc. ICDAR, 2007, pp 874-879.
Z. Zhou, L. Li and C. L. Tan, “Edge based Binarization of
Video Text Images”, In Proc. ICPR, 2010, pp 133-136.
P. Shivakumara, T. Q. Phan and C. L. Tan, “A Laplacian
Approach to Multi-Oriented Text Detection in Video”, IEEE
Trans. PAMI, 2011, pp 412-419.
G. Pajares and J. M. Cruz, “A wavelet-based image fusion
tutorial”, Pattern Recognition, 2004, pp 1855-1872.
Tesseract. http://code.google.com/p/tesseract-ocr/.
Download