TCSVT2013 - School of Computing

advertisement
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
1
Gradient Vector Flow and Grouping based Method for ArbitrarilyOriented Scene text Detection in Video Images
Palaiahnakote Shivakumara, Trung Quy Phan, Shijian Lu and Chew Lim Tan, Senior Member, IEEE
ο€ 
Abstract—Text detection in videos is challenging due to low
resolution and complex background of videos. Besides, an arbitrary
orientation of scene text lines in video makes the problem more
complex and challenging. This paper presents a new method that
extracts text lines of any orientations based on Gradient Vector Flow
(GVF) and neighbor component grouping. The GVF of edge pixels in
the Sobel edge map of the input frame is explored to identify the
dominant edge pixels which represent text components. The method
extracts edge components corresponding to dominant pixels in the
Sobel edge map, which we call Text Candidates (TC) of the text lines.
We propose two grouping schemes. The first finds nearest neighbors
based on geometrical properties of TC to group broken segments and
neighboring characters which results in word patches. The end and
junction points of skeleton of the word patches are considered to
eliminate false positives, which output the Candidate Text
Components (CTC). The second is based on the direction and the size
of the CTC to extract neighboring CTC and to restore missing CTC,
which enables arbitrarily-oriented text line detection in video frame.
Experimental results on different datasets including arbitrarily
oriented text data, non-horizontal and horizontal text data, Hua’s
data and ICDAR-03 data (Camera images) show that the proposed
method outperforms existing methods in terms of recall, precision
and F-measure.
Index Terms—Gradient vector flow, Dominant text pixel, Text
candidates, Text components, Candidate text components,
Arbitrarily-oriented text detection.
I. INTRODUCTION
T
ext detection and recognition is a hot topic for researchers in the
field of image processing, pattern recognition and multimedia. It
draws attention of Content based Image Retrieval (CBIR) community
in order to fill the semantic gap between low level and high level
features to some extent if text is available in the video [1-4]. In
addition, the text detection and recognition can be used to retrieve the
exciting and semantic events from the sports video [5-7]. Therefore,
text detection and extraction is essential to improve the performance
of the retrieval system in real world applications.
This research is supported in part by the A*STAR grant 092 101 0051
(WBS no. R252-000-402-305).
P. Shivakumara is with the Multimedia Unit, Department of Computer
Systems and Information Technology, University of Malaya, Kuala Lampur,
50603,
Malaysia, Telephone: +60 03 7967
2505
(E-mail:
hudempsk@yahoo.com).
T. Q. Phan is with the Department of Computer Science, School of
Computing, National University of Singapore, Computing 1, 13 Computing
Drive, Singapore 117417. (E-mail: phanquyt@comp.nus.edu.sg).
S. Lu is with the Department of Computer Vision and Image
Understanding, Infocomm of Research (I2R), Singapore. (Email: slu@i2r.astar.edu.sg).
C. L. Tan is with the Department of Computer Science, School of
Computing, National University of Singapore, Computing 1, 13 Computing
Drive, Singapore 117417. (Email: tancl@comp.nus.edu.sg).
Copyright (c) 2013 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
Video consists of two types of texts that are scene text and
graphics text. Scene text is part of the image captured by camera.
Examples of scene text include street signs, billboards, and text on
trucks and writing on shirts. Therefore, the nature of scene text is
unpredictable compared to graphics text which can be more
structured and closely related to the subject. However, scene text can
be used to uniquely identify objects in sports events, navigating
Google map and assisting visual impaired people. Since the nature of
scene text is unpredictable, it poses lots of challenges. Out of these,
arbitrary orientation is more challenging as it is not as easy as
processing straight text lines.
Several methods have been developed for text detection and
extraction that achieve reasonable accuracy for natural scene text
(camera images) [8-13] as well as multi-oriented text [11]. However,
it is noted that most of the methods use classifier and large number of
training samples to improve the text detection accuracy. To tackle the
multi-orientation problem, the methods use connected component
analysis. For instance, the stroke width transform based method for
text detection in scene images by Epshtein et al. [8] works well for
connected components which preserve shapes. Pan et al. [9] also
proposed a hybrid approach for text detection in natural scene images
based on conditional random field. The conditional random field
involves connected component analysis to label the text candidates.
Since the images are high contrast images, the connected component
analysis based features with classifier training work well for
achieving better accuracy. However, the same methods cannot be
used directly for text detection in video because of low contrast and
complex background which causes disconnections, loss of shapes etc.
In this case, deciding classifier and geometrical features of the
components is not that easy. Thus, these methods are not suitable for
video text detection.
Plenty of methods have been proposed since last decade for text
detection in video based on connected component, [14-15], texture
[16-19] and edge and gradient [20-25]. Connected component based
methods are good for caption text and uniform color text but not for
multiple color characters text line and clutter background text.
Texture based methods consider the appearance of the text as a
special texture. These methods are good for complex background to
some extent but at the cost of computations due to a large number of
features and large number of training samples for classification of
text and non-text pixels. Therefore, the performance of these methods
depends on the classifier in use and the number of training samples
chosen for text and non-text. Edge and texture features without a
classifier is proposed by Liu et al. [26] for text detection but the
method uses a large number of features to discriminate text and nontext pixels. A set of texture features without a classifier is also
proposed by Shivakumara et al. [27, 28] for accurate text detection in
video frames. Though, the methods work well for different varieties
of frames, they require more time to process due to large number of
features. In addition, the scope of the methods is limited to horizontal
text.
Similarly, combination of edge and gradient features is good for
both text detection accuracy and efficiency compared to texture based
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
methods. For example, text detection using gradient and statistical
analysis of intensity values is proposed by Wong and Chen [21]. This
method suffers from grouping of text and non-text components. The
colour information is also used along with edge information for text
detection by Cai et al. [22]. This method works well for caption text
but the performance of the method degrades when the font size
varies. In general, edge and gradient based methods produce more
false positives due to heuristics that are used for text and non-text
pixel classification.
To the best of our knowledge none of the methods as discussed
above address the arbitrarily-oriented text detection in video properly.
The reason is that arbitrarily-oriented text generally comes from scene
text which poses many problems compared to graphics text. Zhou et al.
[29] have proposed a method for detecting both horizontal and vertical
text lines in video using multiple stage verification and effective
connected component analysis. This method is good for caption text
but not for other text and the orientation is limited to horizontal and
vertical only. Shivakumara et al. [30] have addressed this multioriented issue based on the Laplacian and the skeletonization methods.
This method gives low accuracy because the skeleton based method is
not good enough to classify simple and complex components when
clutter background is present. In addition, the method is said to be
computationally expensive. Recently, the method [31] based on
Bayesian classifier and boundary growing is proposed to improve
accuracy for multi-oriented text detection in video. However, the
boundary growing method used in this work is good when sufficient
space is present between the text lines otherwise it considers non-text
as text components. Therefore, the method considers only nonhorizontal straight text lines instead of arbitrary oriented ones where the
space between the text lines is often limited. The arbitrary text
detection is proposed in [32] using gradient directional features and
region growing. This method requires classification of horizontal and
non-horizontal text images and when the image contains multi-oriented
text then fails to classify them. Therefore, it is not effective for arbitrary
text detection. Thus, the arbitrarily-oriented text detection in video is
still considered as a challenging and interesting problem.
Hence in this paper, we propose the use of gradient vector flow for
identifying text components in a novel way. The work presented in
[33] for identifying object boundaries using Gradient Vector Flow
(GVF) which has the ability to move into concave boundaries without
sacrificing boundary pixels motivated us to propose GVF based
method for arbitrary text detection in this work. This property helps
in detecting both high and low contrast text pixels unlike gradient in
[32] detects only high contrast text pixels, which is essential for video
text detection of any orientation to improve the accuracy.
II. PROPOSED METHODOLOGY
We explore GVF for identifying dominant text pixel using Sobel
edge map of the input image for arbitrary text detection in video in
this work. We prefer Sobel than other edge operators such as Canny
because Sobel gives fine details for text and less details for non-text
while Canny gives lots of erratic edges for background along with
fine details of text. Next, edge components in Sobel edge map
corresponding to dominant pixels are extracted and we call them Text
Candidates (TC). This operation gives representatives for each text
line. To tackle arbitrary orientation, we propose a new two-stage
grouping criterion for the TC. The first stage grows the perimeter of
each TC to identify the nearest neighbor based on size and angle of
the TC to group them, which gives text components. Before
proceeding to the second stage of grouping, we introduce a skeleton
concept on text components given by the first stage to eliminate false
text components based on junction points. We name this output as
Candidate Text Components (CTC). In the second stage, we use tails
2
of the CTC to identify the direction of the text information and the
method grows along the identified direction to find the nearest
neighbor CTC, which outputs the final result of arbitrarily-oriented
text detection in video. To the best of our knowledge, this is the first
work addressing the issue of arbitrarily-oriented text detection in
video with promising accuracy using GVF information.
A. GVF for Dominant Text Pixel Selection
The Gradient Vector Flow (GVF) is a vector that minimizes the
energy functional as defined in equation (1) [33].
ℇ = ∬ πœ‡(𝑒π‘₯2 + 𝑒𝑦2 + 𝑣π‘₯2 + 𝑣𝑦2 ) + |∇𝑓|2 |𝑔 − ∇𝑓 2 | 𝑑π‘₯𝑑𝑦
(1)
where 𝑔(π‘₯, 𝑦) = (𝑒(π‘₯, 𝑦), 𝑣(π‘₯, 𝑦)) is the GVF field and 𝑓(π‘₯, 𝑦) is
the edge map of the input image.
This GVF has been used in [33] for object boundary detection and
it is shown that GVF is better than traditional gradient and sneak. It is
also noted from [33] that there are two problems with the traditional
gradient operation that are (1) these vectors generally have large
magnitudes only in the immediate vicinity of the edges (2) in
homogeneous regions, where pixel values are nearly constant, f is
nearly zero. The GVF is extension of gradient which extends the
gradient map farther away from the edges and into homogeneous
regions using computational diffusion process. This results in the
inherent competition of the diffusion process which will create
vectors that point into boundary concavities. This is a special
property of the GVF. In summary, GVF helps to propagate gradient
information, i.e. the magnitude and the direction, into homogenous
regions. In other words, GVF helps in detecting multiple forces at
corner points of object contours. This cue allows us to use multiple
forces at corner points of edge components in the Sobel edge map of
the input video text frame to identify them as dominant pixels. This
dominant pixel selection removes most of the background
information which simplifies the problem of classifying text and nontext pixels and retains text information irrespective of the orientation
of the text in video. This is the great advantage of dominant pixel
selection by GVF information. It is illustrated in Figure 1 where (a) is
the input and (b) is the GVF for all pixels in the image in Figure 1(a).
It is observed from Figure 1(b) that dense forces at corners of
contours and at curve boundaries of text components as text
components are more cursive than non-text components in general.
Therefore, for each pixel, we count how many forces are pointing
to the text pixels and other pixels (based on GVF arrows). A pixel is
classified as a “dominant text pixel” if the pixel attracts at least four
GVF forces. The threshold of four is determined by running an
experiment of counting GVF forces between one and five GVF forces
over 100 test samples randomly selected from our database as
reported quantitative results in Table 1. Table 1 shows that for 2
GVF, f-measure is low and misdetection rate is high compared to 3
GVF due to more non-text pixels (background) represented by 2 GVF
while for 3 GVF, f-measure is low and misdetection rate is high
compared to 4 GVF due to the same reason. On the other hand, for 4
GVF, f-measure is high and misdetection rate is low compared to 5
GVF. This shows that 5 GVF loses text pixels and it increases the
misdetection rate. It is also observed from Table 1 that the 5 GVF
gives high precision and low recall compared to 4 GVF. This
indicates that 5 GVF loses dominant pixels which represent true text
and non-text pixels as well. Therefore, it is inferred that 4 GVF is
better than other GVF for identifying dominant text pixels which
represent true text pixels and few non-text pixels.
In addition, at this stage, our objective to propose 4 GVF for
dominant pixel selection is to remove non-text pixels as many as
possible despite the fact that it eliminates a few dominant pixels
which represent text pixels because the proposed grouping presented
(in Section II C and II D) have the ability to restore missing text
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
information. Therefore, losing a few dominant text pixels for
characters in a text line does not affect much overall performance of
the method. The dominant text pixel selection is illustrated in Figure
1(c) for the frame shown in Figure 1(a). Figure 1(c) shows that the
dominant text pixel selection that removes almost all non-text
components. Figure 1(d) shows dominant text pixels overlaid on the
input frame. One can notice from Figure 1(d) that each text
components have dominant pixel. In this way, dominant text pixel
selection facilitates arbitrarily-oriented text detection.
represent only text pixels as shown in Figure 3(h) for the GVF in red
color shown in Figure 3(g). In addition, Figure 4 shows that 4 GVF
selection identifies dominant pixels (Figure 4(b) and (d)) well for the
characters like “O” and “I” (Figure 4(a) and (c)) where there are no
corners but they have extreme points. Thus, it confirms that 4 GVF
work well for any other characters.
(a) Character chosen from Figure 1(a)
(a). Input
(b) GVF
(b) Sobel edge map
(c) Dominant text pixels
(c) GVF overlaid on Sobel edge map
(d) Dominant pixels on input frame
Figure 1. Dominant point selection based on GVF
As an example, we choose a character image “a” from the input
frame shown in Figure 1(a). This is reproduced in Figure 2(a) to
illustrate that how GVF information helps in selecting dominant text
pixels. To show GVF arrows for the character image in Figure 2(a),
we get the Sobel edge map as shown in Figure 2(b) and GVF arrows
on the Sobel edge map as shown in Figure 2(c). From 2(c), it is clear
that all the GVF arrows are pointing towards the inner contour of the
character image “a”. This is because of the low contrast in the
background and the high contrast at the inner boundary of the
character image “a”. Thus from Figure 2(d), we observe that corner
points and cursive text pixel on the contour attract more GVF arrows
compared to non-corner points and non-text pixels. For instance, for a
text pixel on the inner contour of the character “a’ shown in Figure
2(a), the GVF corresponding to this pixel is marked by the oval in the
middle of Figure 2(d). The oval area shows that a greater number of
GVF forces are pointing towards that text pixel. Similarly, for a nontext pixel at the top left corner of the character “a” in Figure 2(a), the
corresponding GVF marked by top left oval in Figure 2(d) shows that
lesser number of GVF forces are pointing towards that pixel. For the
same two text and non-text pixels, we show the GVF arrows in their
3 x 3 neighborhood. Darker arrows shown in Figure 3(a) and (b) are
those that point to the middle pixel (the pixel of interest); lighter
arrows are those that are attracted elsewhere. In Figure 3(a), the
middle pixel attracts four arrows. Hence it is classified as a corner
point (dominant text pixel) and the other pixel shown in Figure 3(b)
attracts only one arrow and is classified as a non-text pixel.
We test some pixels that attract two and three GVF arrows as
shown in Figure 3(c)-(d), Figure 3(e)-(f), respectively. One can see
that Dominant Pixels (DP) shown in Figure 3(d) and (f)
corresponding to GVF (red color) in Figure 3(c) and (e) represent not
only text pixels but also non-text pixel (background pixels). On the
other hand, in Figure 3(g)-(h) we see that the pixels selected by four
GVF are real candidate text pixels because these pixels indeed
3
(d) GVF for the character image shown in (a)
Figure 2. Magnified GVF for corner and non-corner pixels marked by
oval shape
(a). GVF arrows at text pixel
(b) GVF arrows at non-text
(c) 2GVF
(e) 3GVF
(d) DP
(f) DP
(g) 4 GVF
(h) DP
Figure 3. Illustration for selection of dominant text pixels (DP) with
GVF arrows
(a) 4 GVF
(b) DP
(c) 4 GVF
(d) DP
Figure 4. 4 GVF for the character like “O” and “I” to identify DP
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Table 1. Experiments on 100 random samples chosen form different
database for choosing GVF arrows
GVF Arrows
2
3
4
5
R
0.51
0.56
0.78
0.63
P
0.36
0.47
0.67
0.68
F
0.42
0.51
0.72
0.65
MDR
0.33
0.27
0.16
0.53
B. Text Candidates Selection
We use the result of dominant pixel selection shown in Figure
1(c) for text candidate selection. For each dominant pixel in Figure
1(c), the method extracts edge components from the Sobel edge map
shown in Figure 5(a) corresponding to dominant pixels. We call these
extracted edge components as text candidates as shown in Figure
5(b). Figure 5(b) shows that this operation extracts almost all text
components with few false positives. Then the extracted text
candidates are used in the next section to restore complete text
information with Sobel edge map.
(a). Sobel edge map
(b) Text candidates
Figure 5. Text candidates selection based on dominant pixels
C. First Grouping for Candidate Text Components
For each text candidate shown in Figure 5(b), the method finds its
perimeter and it allows the perimeter to grow in five iterations, pixel
by pixel, in the direction of the text line in the Sobel edge map of the
input frame to group neighboring text candidates. The perimeter is
defined as contour of the text candidates. The method computes
minor axis for the perimeter of the text candidates and it considers
length of the minor axis as radius to expand the perimeter. At every
iteration, the method traverses the expanded perimeter to find the text
pixel (white pixel) of the neighboring text candidate in the text line.
The objective of this step is to merge segments of character
components and neighbor characters to form a word. This process
merges text candidates which have close proximity within five
iterations of the perimeter. The five is determined empirically by
studying the space between the text candidates. The five pixel
tolerance is acceptable because it is lower than the space between the
characters. As a result, we get two groups of text candidates, namely
the current group and the neighbor group. Then the method verifies
the following properties based on the size and angle of the text
candidate groups before merging them. Generally, the length of the
major axes of the character components will have almost the same
lengths and the angle difference between the character components
have almost the same angle. However, we fix πœƒπ‘šπ‘–π‘›1 as 5° because in
case of arbitrarily oriented text, each character has slight different
orientations according to nature of text line orientation. To take care
of little orientation variation, we fix the 5o.
Size:
π‘šπ‘’π‘‘π‘–π‘Žπ‘›πΏπ‘’π‘›π‘”π‘‘β„Ž(𝑔)
< π‘™π‘’π‘›π‘”π‘‘β„Ž(𝑐) < π‘šπ‘’π‘‘π‘–π‘Žπ‘›πΏπ‘’π‘›π‘”π‘‘β„Ž(𝑔) × 3
3
4
where length(.) is the length of the major axis of a text candidates
group and medianLength(.) is the median length of the major axes of
all the text candidates in the group so far.
- Angle:
𝑔 = π‘”π‘π‘Ÿπ‘’π‘£ ∪ {π‘π‘™π‘Žπ‘ π‘‘ }
𝑔𝑛𝑒π‘₯𝑑 = 𝑔 ∪ {𝑐}
βˆ†πœƒ1 = |π‘Žπ‘›π‘”π‘™π‘’(𝑔) − π‘Žπ‘›π‘”π‘™π‘’(π‘”π‘π‘Ÿπ‘’π‘£ )|
βˆ†πœƒ2 = |π‘Žπ‘›π‘”π‘™π‘’(𝑔) − π‘Žπ‘›π‘”π‘™π‘’(𝑔𝑛𝑒π‘₯𝑑 )|
where g is the current group, clast is the text candidate group that
was last added to g, and c is the new text candidate group that we are
considering to add to g. It follows that gprev and gnext are the group
immediately before the current group and the candidate (next) group,
respectively. angle(.) returns the orientation of the major axis of each
group based on PCA. The angle condition is:
|βˆ†πœƒ1 − βˆ†πœƒ2 | ≤ πœƒπ‘šπ‘–π‘›1
This condition is only checked when g has at least four
components. If a text candidates group passes these two conditions,
we merge the neighbor group with the current group to get candidate
text components (word patches). These two conditions fail when we
get large angle difference between two words due to clutter
background while grouping. It is illustrated in Figure 6 where (a)-(e)
show g, c, gprev, clast, and gnext, respectively chosen from Figure 5(b).
The angles are computed for the groups are as follows. In this case:
1 = 5.33ο‚°, 2 = 4.02ο‚°, length(c) = 11.95,
medianLength(g) = 12.64.
So the conditions are satisfied and c is merged into g as shown in
Figure 6(e). In this way, the method groups the broken segments and
neighboring characters to get candidate text components. The final
results of grouping for the text candidates in Figure 5(b) are shown in
Figure 7(a) where we can see different colors representing different
formed groups. The staircase effect in Figure 7(b) shows that
grouping mechanism for obtaining candidate text groups. This
process repeats until there are no remaining unvisited text candidates.
This grouping essentially gives word patches by grouping character
components.
(a). g
(b) c
(c) gprev
(d) clast
(e) gnext
Figure 6. Illustration for candidate text components selection
(a). First Grouping
(b) Starecase effect
(d) End and junction points
(c).Skeleton
(e) Candidates text components
After false positive elimination
Figure 7. Word patches extraction
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
It is observed from Figure 7(b) that there are false text candidates
groups. To eliminate them, we check the skeleton of each group as
shown in Figure 7(c) and count the number of junction points shown
in Figure 7(d). If |intersection(skeleton(g))| > 0 οƒ  false text candidate
groups and not retained. skeleton(.) returns the skeleton of a group
and intersection(.) returns the set of intersection (junction) points.
The final results can be seen after removing false text candidates
group in Figure 7(e). However, there are still false text candidates
group.
5
grouped with the final results shown in Figure 9(b) where the curving
text line is extracted with a false positive.
D. Second Grouping for Text Line Detection
The first grouping mentioned above produces the word patches by
grouping character components. For each word patch, the second
grouping now finds two tail ends using the major axis of the word
patch. The method considers text candidates at both tail ends of the
word to grow its perimeter based on the direction of the major axis
for a few iterations to find neighboring word patches. The number of
iterations is determined based on the experiments on space between
words and characters. While growing the perimeter by pixel by pixel,
the method looks for white pixels of the neighboring word patches.
The Sobel edge map of the input frame has been used for growing
and finding neighboring word patches. The two word patches are
grouped based on the angle properties of word patches. Let t 1 and t2
be the right tail end and left tail end of the second word patch,
respectively.
𝑑1 = π‘‘π‘Žπ‘–π‘™(𝑀1 , 𝑐1 ), 𝑑2 = π‘‘π‘Žπ‘–π‘™(𝑀2 , 𝑐2 ), 𝑑12 = 𝑑1 ∪ 𝑑2
βˆ†πœƒ1 = |π‘Žπ‘›π‘”π‘™π‘’(𝑑1 ) − π‘Žπ‘›π‘”π‘™π‘’(𝑑12 )|, βˆ†πœƒ2 = |π‘Žπ‘›π‘”π‘™π‘’(𝑑2 ) − π‘Žπ‘›π‘”π‘™π‘’(𝑑12 )|
where w1 is the current word patch, c1 is the text candidate that is
being used for growing. c2 is that text candidate of the word patch w2
that it belongs to. The idea is to check that the “tail angles” of the two
words are compatible with each other. tail(w, c) returns up to three
text candidates immediately connected to c in w. t12 is then the next
text candidate tail of both t1 and t2. The angle condition is:
βˆ†πœƒ1 ≤ πœƒπ‘šπ‘–π‘› 2 ∧ βˆ†πœƒ2 ≤ πœƒπ‘šπ‘–π‘› 2
This condition is only checked if both t1 and t2 contain three
components. If a word patch passes this condition, it is merged to the
current word. Here we set πœƒπ‘šπ‘–π‘› 2 to 25° to take care of orientation
difference between the words in the text line. The little orientation
difference between the words is expected because the input is
arbitrarily oriented text. This 25o may not affect much grouping
process because of enough space between the text lines.
Illustration for grouping word patches chosen from Figure 7(e) can
be seen in Figure 8 where (a)-(e) represent w1, w2, t1, t2 and t12,
respectively. Suppose we are considering whether to merge w1 and
w2.
(a). Second grouping
(b) Text line detection
Figure 9. Arbitrary text extraction
E. False Positive Removal
Sometimes the false positives are merged with the text lines (like
in the above case), which makes it difficult to remove the false
positives. However, in other cases, the false positives may stand
alone and thus we propose the following rules to remove these kinds
of false positives. The rules for eliminating such false positives based
geometrical properties of the text block are common practice in text
detection [14-32] to improve the accuracy. Therefore, we also
propose the similar rules in this work.
False positive checking: if area(w) < 200 or edge_density(w) <
0.05 οƒ  false positive and removed
𝑒𝑑𝑔𝑒_π‘™π‘’π‘›π‘”π‘‘β„Ž(π‘ π‘œπ‘π‘’π‘™(𝑀))
𝑒𝑑𝑔𝑒_𝑑𝑒𝑛𝑠𝑖𝑑𝑦(𝑀) =
π‘Žπ‘Ÿπ‘’π‘Ž(𝑀)
Where sobel(.) returns the sobel edge map and edge_length(.)
returns the total length of all edges in the edge map. Figure 10(a)
shows the input, (b) shows the results before false positive
elimination, (c) shows the results of false positive elimination using
the area of the text block and (d) shows the results of false positive
elimination using edge density of the text block.
(a).Input
(c) Area for false
positive removal
(b) Before false positive removal
(d).Density for false
positive removal
Figure 10. Illustration for false positives elimination
III. EXPERIMENTAL RESULTS
(a). w1
(b) w2
(c) t1
(d) t2
(e) t12
Figure 8. Illustration for word grouping
In this case: 1 = 20.87ο‚°, 2 = 20.68ο‚°
so the condition is satisfied and w1 and w2 are merged and it is shown
in Figure 9(a) in red color.
This process repeats until there are no remaining unvisited words
and the output of the second grouping is shown in Figure 9(a) where
the staircase effect with different colors shows how the words are
We create our own dataset for evaluating the proposed method
along with standard dataset such as Hua’s data of 45 video frames
[34]. Our dataset includes 142 arbitrarily-oriented text frames (almost
all scene text frames), 220 non-horizontal text frames (176 scene text
frames and 44 graphics text frames), 800 horizontal text frames (160
Chinese text frames, 155 scene text frames and 485 English graphics
text frames), and publicly available Hua’s data of 45 frames (12 scene
text frames and 33 graphics text frames). We also tested our method
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
on the ICDAR-03 competition dataset [35] of 251 camera images (all
are scene text images) to check the effectiveness of our method on
camera based images. In total, 1207 (142+220+800+45) video frames
and 251 camera images are used for experimentation.
To compare the results of the proposed method with existing
methods, we consider seven popular existing methods that are
Bayesian and boundary growing based method [31], Laplacian and
skeleton based method [30], Zhou et al. [29], Fourier-RGB based
method [28], Liu et al [26], Wong and Chen [21] and Cai et al. [22].
The main reason to consider these existing methods is that these
methods work with fewer constraints, for complex background
without a classifier and training as in our proposed method.
We evaluate the performance of the proposed method at the text
line level, which is a common granularity level in the literature [1725], rather than the word or character level because we have not
considered text recognition in this work. The following categories are
defined for each detected block by a text detection method.
Truly Detected Block (TDB): A detected block that contains at
least one true character. Thus, a TDB may or may not fully enclose a
text line. Falsely Detected Block (FDB): A detected block that does
not contain text. Text Block with Missing Data (MDB): A detected
block that misses more than 20% of the characters of a text line
(MDB is a subset of TDB). The percentage is chosen according to
[30-31], in which a text block is considered correctly detected if it
overlaps at least 80% of the pixels of the ground-truth block. We
count manually Actual Number of Text Blocks (ATB) in the images
and it is considered as the ground truth for evaluation.
The performance measures are defined as follows. Recall (R) =
TDB / ATB, Precision (P) = TDB / (TDB + FDB), F-measure (F) =
(2 ο‚΄ P ο‚΄ R) / (P + R), Misdetection Rate (MDR) = MDB / TDB.
Sample experimental results for both the proposed and existing
methods on horizontal text detection are shown in Figure 13 where
input image shown in Figure 13(a) has complex background with
horizontal text. It is noticed from Figure 13 that the proposed method,
the Bayesian, the Laplacian, the Fourier-RGB and Cai et al.’s
methods detect almost all text lines while other methods miss text
lines. The Bayesian method does not fix bounding box properly and it
gives more false positives due to the problem of boundary growing.
The Fourier-RGB method detects text properly. The other existing
methods do not detect text properly as we can notice that Zhou et
al.’s method misses a few text lines, Liu et al.’s method misses a few
words in addition to false positives, while Wong and Che’s , and Cai
et al.’s methods do not fix the bounding boxes properly for the text
lines.
(a). Input
(b) Proposed
(d) Laplacian
(e) Zhou et al.
(c) Bayesian
(f) Fourier-RGB
In addition, we also measure the Average Processing Time (in
terms of seconds) for each method in our experiment.
A. Experiment on Video Text Data
In order to show the effectiveness of the proposed method over the
existing methods, we assemble 142 arbitrary images with 800
horizontal and 220 non-horizontal images to form a representative
variety set of general video data to calculate the performance
measures, namely, recall, precision, F-measure and misdetection rate.
The quantitative results of the proposed and the existing methods for
1162 images (142+800+220) are reported in Table 2. We highlight
sample arbitrary, horizontal and non-horizontal images for discussion
in Figure 11, Figure 12 and Figure 13, respectively.
For the curve text line like circle shaped shown in Figure 11(a),
the proposed method extracts text lines with one false positive while
the existing methods fail to detect curve text line properly. The main
reason is that the existing methods are developed for horizontal and
non-horizontal text line detection but not for arbitrary text detection.
It is observed from Figure 12 that for the input frame having
different orientations and complex background as shown in Figure
12(a), the proposed method detects almost all text with a few
misdetections as shown in Figure 12(b) while the Bayesian method
does not fix bounding boxes properly as shown in Figure 12(c), the
Laplacian method detects two text lines and it loses one text line as
shown in Figure 12(d) due to complex background in the frame. On
the other hand Zhou et al.’s method fails to detect text as shown in
Figure 12(e) as it is limited to horizontal and vertical text lines only
and caption text but not scene text and multi-oriented text. It is also
observed from Figure 12 that the Fourier-RGB method, Liu et al.’s,
Wong and Chen’s and Cai et al.’s methods fail to detect text lines
because these methods are developed for horizontal text detection but
not for non-horizontal text detection.
6
(g) Liu et al.
(h) Wong and Chen
(i) Cai et al.
Figure 11. Sample results for arbitrarily-oriented text detection
(a).Inputs
(d) Laplacian
(g) Liu et al.
(b) Proposed
(e) Zhou et al.
(h) Wong and Chen
(c) Bayesian
(f) Fourier-RGB
(i) Cai et al.
Figure 12. Sample results for non-horizontal text detection
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
7
B. Experiment on Independent Data (Hua’s Data)
(a) Input
(d) Laplacian
(b) Proposed
(e) Zhou et al.
We found a small publicly available dataset of 45 video frames
[34], namely, Hua’s Dataset for evaluating the performance of the
proposed method in comparison with the existing methods. We
included this set in our experiment as it serves as an independent test
set in addition to our own data set in the preceding section. But we
caution that this set contains only horizontal text and hence does not
give a full comparison for the entire spectrum of the text detection
capability from horizontal, non-horizontal to arbitrary orientation.
Figure 14 shows sample results for the proposed and existing
methods, where (a) is the input frame having huge and small font
text, and (b)-(i) are the results of the proposed and existing methods,
respectively.
(c) Bayesian
(f) Fourier-RGB
(a)Input
(g) Liu et al.
(h) Wong and Chen
(b) Proposed
(c) Bayesian
(i) Cai et al.
Figure 13. Sample results for horizontal text detection
Observiations of the above sample images show that the proposed
method detects well for arbitrary, non-horizontal and horizontal texts
compared to existing methods, the quantitative results reported in
Table 2 also show the proposed method outperforms the existing
methods in terms of recall, precision, F-measure and misdetection
rate. However, the Average Processing Time (APT) of the proposed
method is longer than most of the existing methods except for the
Fourier-RGB and Liu et al.’s methods as shown in Table 2 as well as
in subsequent experiments, namely Tables 3 and 4. . The higher APT
is attributed to the process of GVF determination and grouping which
incurs higher computational cost. It is this GVF process that enables
the proposed method to deal with arbitrarily-oriented text lines.
Our previous methods, namely, the Bayesian and the Laplacian
methods give lower accuracy compared to the proposed method
according to Table 2. This is because these methods were developed
for non-horizontal and horizontal text detection but not for arbitrary
orientation text detection. As a result, the boundary growing and the
skeleton based methods proposed, respectively in the Bayesian and
the Laplacian for handling multi-oriented problem fail to perform on
arbitrary text. Zhou et al.’s method works well for only vertical and
horizontal caption text but not for arbitrary orientation and scene text
and hence the method gives poor accuracy. Since Liu et al.’s, Wong
and Chen’s and Cai et al.’s methods were developed for horizontal
text detection but not for non-horizontal and arbitrary orientation text
detection, these methods give poor accuracy compared to the
proposed method.
Table 2. Performance on arbitrary + non-horizontal + horizontal data (142
+ 220 + 800 = 1162)
Methods
R
P
F
MDR
Proposed Method
Bayesian [31]
Laplacian [30]
Zhou et al. [29]
Fourier-RGB [28]
Liu et al. [26]
Wong and Chen [21]
Cai et al. [22]
0.78
0.75
0.74
0.54
0.63
0.57
0.54
0.54
0.79
0.69
0.77
0.72
0.77
0.64
0.76
0.41
0.78
0.71
0.75
0.61
0.69
0.60
0.63
0.46
0.10
0.15
0.19
0.28
0.13
0.12
0.12
0.17
APT
(sec)
14.6
10.3
9.6
1.5
16.9
23.3
1.8
7.4
(d) Laplacian
(e) Zhou et al.
(f) Fourier-RGB
(g) Liu et al.
(h) Wong and Chen
(i) Cai et al.
Figure 14. Sample results for Hua’s data
It is observed from Figure 14 that the proposed method detects
both the text lines in the input frame while the Bayesian method does
not detect all text, the Laplacian method fails to detect complete text
lines, hence rendering them as either misdetection or false positives.
Therefore, misdetection rate is high compared to the proposed
method as shown in Table 3. The Fourier-RGB method detects text
properly and hence it gives good recall. The other existing methods
fail to detect text lines in the input frame due to font variation. From
Table 3, it can be concluded that the proposed method and our earlier
methods [30,31] outperform the other existing methods in terms of
recall, precision, F-measure and misdetection rate. We take note that
the Bayesian method [31] and the Laplacian method [30] achieve
better F-measure than the proposed method. However, as we earlier
caution, Hua’s dataset does not contain arbitrarily oriented text, both
the Bayesian and the Laplacian methods are given an advantage of
not being tested with arbitrary text lines. If Hua’s dataset had
contained arbitrarily oriented text lines, then the Bayesian and the
Laplacian methods would have shown poorer F-measures like in
Table 2.
Table 3. Performance with Hua’s data
Methods
R
P
F
MDR
Proposed Method
Bayesian [31]
Laplacian [30]
Zhou et al. [29]
Fourier-RGB [28]
0.88
0.87
0.93
0.72
0.81
0.74
0.85
0.81
0.82
0.73
0.80
0.85
0.87
0.77
0.76
0.05
0.18
0.07
0.44
0.06
APT
(sec)
10.5
5.6
11.7
1.13
14.6
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Liu et al. [26]
Wong and Chen [21]
Cai et al. [22]
0.75
0.51
0.69
0.54
0.75
0.43
0.63
0.61
0.53
0.16
0.13
0.13
24.9
1.6
9.2
C. Experiment on ICDAR-03 Data (Camera Images)
We added another independent test set in this experiment like in
the preceding section. The objective of this experiment is to show
that the proposed method works well for high resolution camera
images when the proposed method works well for low resolution
video frames. This dataset is available publicly [35] as ICDAR-03
competition data for text detection from natural scene images.
(a)Input
(b) Proposed
(c) Bayesian
Zhou et al. [29]
Fourier-RGB [28]
Liu et al. [26]
Wong and Chen [21]
Cai et al. [22]
0.66
0.80
0.53
0.52
0.67
0.83
0.66
0.61
0.83
0.33
0.73
0.72
0.57
0.64
0.44
0.26
0.04
0.24
0.08
0.43
8
1.2
15.5
16.1
1.0
6.1
We also conduct experiments on ICDAR data using ICDAR 2003
measures for our proposed method and the results are reported in
Table 5. Since our primary goal is to detect text in video, we develop
and evaluate the method at the line level as it is a common practice in
video text detection survey [14-32]. In order to calculate recall,
precision and F-measure according to ICDAR 2003, we modify the
method to fix the bounding box for each word in the image based on
the space between the words and characters. Table 5 shows that the
proposed method does not achieve better accuracy than the best
method (Hinner Becker) but it stands in the third position among the
five methods. The poor accuracy is due to the problem of word
segmentation, fixing closed bounding box and strict measures. In
addition, the method does not utilize the advantage of high resolution
images as the participating methods use connected component
analysis for text detection and grouping. Hence, the proposed method
misses true text blocks. The results of the participating methods
reported in Table 5 are taken from the ICDAR 2005 [35] to compare
with the proposed method.
Table 5. Word level performance on ICDAR 2003 data
(d) Laplacian
(g) Liu et al.
(e) Zhou et al
(f) Fourier-RGB
(h)Wong and Chen
(i) Cai et al.
Methods
Proposed Method
Hinner Becker [35]
Alex Chen [35]
Qiang Zhu [35]
Jisoo Kim [35]
Nobuo Ezaki [35]
R
0.42
0.67
0.60
0.40
0.28
0.36
P
0.36
0.62
0.60
0.33
0.22
0.18
F
0.35
0.62
0.58
0.33
0.22
0.22
Figure 15. Sample results for scene text detection (ICDAR-2003 data)
IV. CONCLUSION AND FUTURE WORK
We show sample results for the proposed and existing methods in
Figure 15 where (a) is a sample input frame, and (b)-(i) show the
results of the proposed and the existing methods, respectively. It is
observed from Figure 15 that the proposed method, the Fourier-RGB
method and Cai et al.’s method work well for the input frame but
other methods including our earlier methods, namely, the Bayesian
and the Laplacian methods fail to detect text lines properly. The
results reported in Table 4 shows that the proposed method is better
in terms of recall, F-measure and misdetection rate compared to the
Bayesian, the Laplacian and Fourier-RGB methods. This is because
for high contrast and resolution images, the classification methods
proposed in the Bayesian and the Laplacian methods and the dynamic
threshold used in Fourier-RGB all fail to classify text and non-text
pixels properly. However, the proposed method and our earlier
methods are better than the other existing methods in terms of recall,
precision and F-measure but in terms of misdetection rate, Wong and
Chen’s method is better according to results reported in Table 4.
Wong and Chen’s method is worst in recall, precision and F-measure
compared to the proposed method. This experiment shows that the
proposed method is good for even high resolution and contrast
images.
In this paper, we have explored GVF information for the first time
for text detection in video by selecting dominant text pixels and text
candidates with the help of the Sobel edge map. This dominant text
pixel selection helps in removing non-text information in complex
background of video frames. Text candidate selection and the first
grouping method ensure that text pixels are not missed. The second
grouping tackles the problems created by arbitrarily-oriented text to
achieve better accuracy for text detection in video. Experimental
results of the variety of the datasets such as arbitrarily-oriented data,
non-horizontal data, horizontal data, Hua’ s data and ICDAR-03 data
show that the proposed method works well for text detection
irrespective of contrast, orientation, background, script, fonts and font
size. However, the proposed method may not give good accuracy for
horizontal text lines with less spacing between text lines. To
overcome this problem, we are planning to develop another method
which can detect text lines without considering their spacing using an
alternative grouping criterion in future.
Table 4. Line level performance on ICDAR-03 data
Methods
R
P
F
MDR
Proposed Method
Bayesian [31]
Laplacian [30]
0.92
0.87
0.86
0.76
0.72
0.76
0.83
0.78
0.81
0.13
0.14
0.13
APT
(sec)
12.7
7.9
6.8
ACKNOWLEDGEMENT
This research is supported in part by A*STAR grant 092 101 0051
(WBS no. R252-000-402-305). Authors are grateful to Editor and
reviewers for constructive comments and suggestions which improve
quality of paper greatly.
REFERENCES
[1] N. Sharma, U. Pal, M. Blumenstein, "Recent Advances in Video Based
Document Processing: A Review," In Proc. DAS, 2012. 63-68.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
9
[2] J. Zang and R. Kasturi, “Extraction of Text Objects in Video Documents: Recent
[31] P. Shivakumara, R. P. Sreedhar, T. Q. Phan, S. Lu and C. L. Tan,
Progress”, DAS, 2008, 5-17.
K. Jung, K.I. Kim and A.K. Jain, “Text Information Extraction in
Images and Video: a Survey”, Pattern Recognition, 2004, 977-997.
D. Crandall and R. Kasturi, “Robust Detection of Stylized Text Events
in Digital Video”, ICDAR, 2001, 865-869.
D. Zhang and S. F. Chang, “Event Detection in Baseball Video using
Superimposed Caption Recognition”, In Proc. ACM MM, 2002, pp
315-318.
C. Xu, J. Wang, K. Wan, Y. Li and L. Duan, “Live Sports Event
Detection based on Broadcast Video and Web-casting Text”, In Proc.
ACM MM, 2006, pp 221-230.
W. Wu, X. Chen and J. Yang, “Incremental Detection of Text on Road
Signs from Video with Applications to a Driving Assistant Systems”,
In Proc. ACM MM, 2004, pp 852-859.
B. Epshtein, E ofek and Y. Wexler, “Detecting Text in Natural Scenes
with Stroke Width Transform”, CVPR, 2010, 2963-2970.
Y. F. Pan, X. Hou and C. L. Liu, “A Hybrid Approach to Detect and
Localize Texts in Natural Scene Images”, IEEE Trans. on IP, 2011,
800-813.
X. Chen, J. Yang, J. Zhang and A. Waibel, “Automatic Detection and
Recognition of Signs from Natural Scenes”, IEEE Trans. on IP, 2004,
87-99.
C. Yao, X. Bai, W. Liu, Y. Ma and Z. Tu, “Detecting Texts of
Arbitrary Orientations in Natural Images”, In Proc. CVPR, 2012, 10831090.
L. Neumann and J. Matas, “Real-Time Scene Text Localization and
Recognition”, In Proc. CVPR, 2012, 3538-3545.
T. Q. Phan, P. Shivakumara and C. L. Tan, “Detecting Text in the Real
World”, In Proc. ACM MM, 2012, 765-768.
A. K. Jain and B. Yu, “Automatic Text Location in Images and Video
Frames”, Pattern Recognition, 1998, 2055-2076.
V. Y. Mariano and R. Kasturi, “Locating Uniform-Colored Text in
Video Frames”, ICPR, 2000, 539-542.
H. Li, D. Doermann and O. Kia, “Automatic Text Detection and
Tracking in Digital Video”, IEEE Trans. on IP, 2000, 147-156.
Y. Zhong, H. Zhang and A. K. Jain, “Automatic Caption Localization
in Compressed Video”, IEEE Trans. on PAMI, 2000, 385-392.
K. L Kim, K. Jung and J. H. Kim. “Texture-Based Approach for Text
Detection in Images using Support Vector Machines and Continuous
Adaptive Mean Shift Algorithm” IEEE Trans. on PAMI, 2003, 16311639.
V. Wu, R. Manmatha and E. M Riseman, “Text Finder: an Automatic
System to Detect and Recognize Text in Images”, IEEE Trans. on
PAMI, 1999, 1224-1229.
R. Lienhart and A. Wernickle, “Localizing and Segmenting Text in
Images and Videos”, IEEE Trans. on CSVT, 2002, 256-268.
E. K. Wong and M. Chen, “A New Robust Algorithm for Video Text
Extraction”, Pattern Recognition, 2003, 1397-1406.
M. Cai, J. Song and M. R. Lyu, “A New Approach for Video Text
Detection”, ICIP, 2002, 117–120.
A. Jamil, I. Siddiqi, F. Arif and A. Raza, “Edge-based Features for
Localization of Artificial Urdu Text in Video Images”, ICDAR, 2011,
1120-124.
M. Anthimopoulos, B. Gatos and I. Pratikakis, “A Two-Stage Scheme
for Text Detection in Video Images”, Image and Vision Computing,
2010, 1413-1426.
X. Peng, H. Cao, R. Prasad and P. Natarajan, “Text Extraction from
Video using Conditional Random Fields”, ICDAR, 2011, 1029-1033.
C. Liu, C. Wang and R. Dai, “Text Detection in Images Based on
Unsupervised Classification of Edge-based Features”, ICDAR, 2005,
610-614.
P. Shivakumara, W. Huang, C. L. Tan and P. Q. Trung, “Accurate
Video Text Detection Through Classification of Low and High
Contrast Images”, Pattern Recognition, 2010, 2165-2185.
P. Shivakumara, T. Q. Phan and C. L. Tan, “New Fourier-Statistical
Features in RGB Space for Video Text Detection”, IEEE Trans. on
CSVT, 2010, 1520-1532.
J. Zhou, L. Xu, B. Xiao and R. Dai, “A Robust System for Text
Extraction in Video”, ICMV, 2007, 119-124.
P. Shivakumara, T, Q. Phan and C. L. Tan, “A Laplacian Approach to
Multi-Oriented Text Detection in Video”, IEEE Trans. on PAMI, 2011,
412-419.
“Multi-Oriented Video Scene Text Detection Through Bayesian
Classification and Boundary Growing”, IEEE Trans. on CSVT, 2012,
1227-1235.
N. Sharma, P. Shivakumara, U. Pal, M. Blumenstein, Chew Lim Tan,
"A New Method for Arbitrarily-Oriented Text Detection in Video," In
Proc. DAS, 2012, 74-78.
C. Xu and J. L. Prince, "Snakes, shapes, and Gradient Vector Flow",
IEEE Transactions on Image Processing, 1998, 359–369.
X. S Hua, L. Wenyin and H.J. Zhang, “An Automatic Performance
Evaluation Protocol for Video Text Detection Algorithms”, IEEE
Trans.
on
CSVT,
2004,
498-507.
(http://www.cs.cityu.edu.hk/~liuwy/PE_VTDetect/)
S. M. Lucas, “ICDAR 2005 Text Locating Competition Results”,
ICDAR, 2005, 80-84.
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[32]
[33]
[34]
[35]
Download