xu_icpr_2d-3d_shiva_.. - School of Computing

advertisement
2D and 3D Video Scene Text Classification
Jiamin Xu1, Palaiahnakote Shivakumara2, Tong Lu1, and Chew Lim Tan3
1
2
National Key Lab for Novel Software Technology, Nanjing University, Nanjing, China
Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
3
School of Computing, National University of Singapore
superxjm@yeah.net, hudempsk@yahoo.com, lutong@nju.edu.cn, and tancl@comp.nus.edu.sg
Abstract—Text detection and recognition is a challenging
problem in the field of image processing and document analysis due
to the presence of the unpredictable nature of video texts, such as the
variations of orientation, font and size, illumination effects, and even
different 2D/3D text contours. In this paper, we propose a novel
horizontal and vertical symmetry feature by calculating the gradient
direction and the gradient magnitude of each text candidate, which
results in Potential Text Candidates (PTCs) after applying the kmeans clustering algorithm on the gradient image of each input
frame. To verify PTCs, we explore temporal information of video by
proposing an iterative process that continuously verifies the PTCs of
the first frame and the successive frames, until the process meets the
converging criterion. This outputs Stable Potential Text Candidates
(SPTCs). For each SPTC, the method obtains text representatives
with the help of the edge image of the input frame. Then for each text
representative, we divide it into four quadrants and check a new
Mutual Nearest Neighbor Symmetry (MNNS) based on the dominant
stroke width distances of the four quadrants. A voting method is
finally proposed to classify each text block as either 2D or 3D by
counting the text representatives that satisfy MNNS. Experimental
results on classifying 2D and 3D text images are promising, and the
results are further validated by text detection and recognition before
classification and after classification with the exiting methods,
respectively.
Keywords—Video text frames, Horizontal and vertical symmetry,
Video potential text candidates, Dominant potential text candidates, 2D
and 3D text video classification
I.
Introduction
Convergence of the technologies from computer graphics,
computer vision, multimedia and other related fields has enabled the
development of advanced types of visual media and devices, such as
3D video (3DV) and free viewpoint video (FVV), which expand
user’s sensation beyond what is offered by traditional 2D video [1].
As a result, in the future, video simultaneously containing 2D and 3D
texts will become quite common and we can see 3D TV at everyone’s
home. For instance, Google Street View and iTowns that generate a
huge amount of images and videos contain both 2D and 3D scene
texts [2] and many potential applications, such as traffic monitoring,
geographic information systems, road navigation and scene
understanding use the videos that are captured by an iTown imaging
vehicle on which a camera is fixed at its top. To locate the address of
a store, the user is offered 3D view of the location, created by
suitable projection of pre-stitched image mosaics. A project like
iTown could easily generate hundreds of thousands of such mosaics
in a single city. The manual annotation of all these images with the
visible textual information would be very time consuming and
probably impractical [2]. Therefore, there is a great demand for
automatic algorithm for both 2D and 3D text detection and
recognition with a good accuracy.
(a). 2D character from 2D video frames
(b) 3D text on frontal view from 3D video frames
(c) 3D text with its shadow caused by 3D effect
Fig. 1. Illustrating 3D effect with recognition rate: For characters “M”, “
生”, “N”, the Tesseract OCR recognizes as “M”, “生” and “N”,
respectively while for “S” the OCR does not recognize it due to extra
edges caused by 3D effect
There are methods in literature which work well for video or
images containing 2D texts [3]. However, when given video which
containing both 2D and 3D texts as an input, the performance of the
methods degrades drastically [4, 5] because of the variations in edge
pattern and strength. For instance, In Figure 1, (a) shows 2D
characters chosen from 2D video, (b) shows 3D character from 3D
video but it is on frontal view and (c) shows 3D character from 3D
video on side view where we can see 3D effect in the form of extra
edges. It is observed from Figure 1 that OCR fails to recognize 3D
character on side view. Therefore, in this work, we consider 2D text
from 2D video and 3D text on frontal view from 3D as 2D text and
3D text on side view from 3D video are considered as 3D texts. There
are two ways to achieve the accuracy: (1) developing a unified
algorithm which works well for both 2D and 3D texts in video, (2)
classifying 2D and 3D texts in video such that separated algorithms
can be developed. In this work, we focus on the second way to
improve the accuracy because this way can make use of the existing
2D text detection methods, rather than developing a unified method
which will be relatively hard. However, all the 3D texts that appear
on iTown and urban videos are generally scene texts. This makes the
problem of classification challenging and complex because scene text
is a part of the image captured by camera and it poses virtually
unlimited range of sizes, shapes and colors [3]. For comparison,
graphics texts are artificially added to video frames to supplement
visual or audio content. Therefore, the presences of both graphics and
scene texts in a video frame bring another difficulty in classifying 2D
and 3D video texts.
In literature, there are a plenty of text detection methods [6-8] by
using connected component analysis, texture analysis, edge and
gradient analysis. However, these methods generally consider the
videos containing only 2D texts but without 3D ones. We can also
see several methods which use temporal information in video for text
detection [9-13]. For example, Bouaziz et al. [12] proposed a
similarity criterion to find text appearance based on frame differences.
However, the similarity criterion requires a threshold value to
identify the sudden differences, and the focus of the method is only
on 2D graphics text detection but not 3D text detection. Huang et al.
[13] proposed automatic detection and localization of natural scene
texts in video based on edge and stroke details. Similarly, these
features may work well for 2D scene texts but not for 3D ones
because the latter may not provide a constant stroke width or edge
density as expected. Therefore, the method is not suitable for 3D text
detection in video.
Hence, in this work, we propose a novel method for 2D and 3D
text classification to improve text detection and recognition accuracy.
for defining Mutual Nearest Neighbor Symmetry (MNNS) at block
level to classify text frames using wavelet and moments features, we
further propose MNNS for classifying 2D or 3D text representatives.
Namely, if the dominant stroke width distances of each quadrant
form clusters which can satisfy MNNS then it is considered as a 2D
text representative else a 3D text representative. This is valid because
we can expect such symmetry for 2D characters due to double edges
and parallel edges but for 3D character, it does not due to extra edges
and loss of edges caused by perspective distortion and complex
background.
II. Proposed Method
It is noted from the work presented in [14] for video text detection
that gradient operation on video frames is useful for increasing the
contrast of text pixels. As motivated, we initially perform the gradient
operation to enhance text pixels in this work. After text pixels are
enhanced, we propose k-means clustering algorithm with k=3 to
remove noisy pixels that have low gradient values as they may not
contribute to text. The rest two clusters which have high mean values
are considered as text clusters. This results in text candidates for each
video frame. Due to complex background of video, there are chances
of misclassifying false text candidates as text candidates. As we are
inspired by the observation that characters usually have double edges
with a constant stroke width distance [15], we propose a novel
horizontal and vertical symmetry feature based on the gradient
directions and the gradient magnitudes of each text candidate. The
symmetry extracts the two facts that double edges have parallel
direction, and text candidates have a high gradient magnitude at near
edges and on edges but a low magnitude in between parallel edges
[14]. In other words, our horizontal symmetry uses the gradient
direction and the magnitude values between edges, while the vertical
symmetry uses the direction of parallel edges, which is perpendicular
to the gradient direction of a text candidate. This outputs Potential
Text Candidates (PTCs).
To validate the PTCs, we explore temporal redundancy in video. It
is observed that texts in video usually have constant movements
along a particular direction while the background does not as stated
in several methods [9-13]. Inspired by this observation, we propose
an iterative method that studies the neighbor information of each PTC
in consecutive frames to identify stable PTCs. The reason for
considering neighbor information of PTCs is to tolerate the arbitrary
text movements because sometimes, video may contain arbitrary text
movements rather than static movements. As a result, the iterative
process gives stable PTCs by finding the PTCs which exist in
consecutive frames, until the iterative process stops. Thus the
iterative process in the study of stable PTCs from consecutive frames
serves two purposes: (1) it helps in automatically deciding the
number of interested frames out of 30 frames per second because it is
a research issue for the existing methods [9-13] which assume a fixed
number of five, ten etc, (2) it helps in identifying stable PTCs inspite
of arbitrary text movements by throwing out non-stable ones which
are likely non-text components. We call the output of the iterative
process as Stable Potential Text Candidates (SPTCs).
For each SPTC, we extract its edge components, which we call
text representatives, from the Canny edge image of the input frame.
For each text representative, we divide the whole representative into
four quadrants and then extract dominant stroke width distances for
each quadrant. The stoke width distances are calculated according to
the method in [15]. It is true that characters generally exhibit
symmetry like faces of human when we divide into equal halves at
the center point. Based on this observation and as it is explored in [16]
(a)
(b)
(c)
Fig. 2. Text candidates for 2D text frame: (a) Gray frame , (b) Gradient
first frame (c) Output of k-means clustering algorithm (binary)
(a)
(b)
(c)
Fig. 3. Text candidates for 3D text frame: (a) Gray frame, (b) Gradient
frame and (c) Output of k-means clustering algorithm (binary)
A. Text Candidates Selection
For the first video frame as shown in Figure 2(a), where 2D scene
texts are embedded with different orientations and graphics texts are
at the bottom of the frame, the method obtains the gradient image as
shown in Figure 2(b), from which we can notice that text pixels are
brightened compared to the pixels in Figure 2(a). Therefore, the
method applies k-means clustering algorithm with k=3 on the
gradient image shown in Figure 2(b) to classify text candidates as
shown in Figure 2(c), where one can see all the high contrast pixels
are classified as text candidates including text pixels. In the same way,
for the video frame shown in Figure 3(a) where 3D scene texts appear
on a building background with different orientations, the method
obtains the gradient image as shown in Figure 3(b), and the text
candidates by the k-means clustering algorithm on the gradient image
in Figure 3(b) are shown in Figure 3(c). It is observed from Figure
21(c) and Figure 3(c) that the 3D texts that appear in Figure 3(c) are
brighter than the 2D texts in Figure 2(c). This is due to extra edges
and thickness of the strokes given by the 3D effect as illustrated in
Figure 1. As a result, there is no guarantee that a 3D character always
exhibits symmetry like human face and provides parallel edges as in
2D texts. This observation leads to exploring the new features like
symmetry and direction of parallel edges to classify 2D and 3D texts
in this work.
B. Horizontal and Vertical Symmetry for Potential Text
Candidates
It is observed that the method presented in Section A
misclassifies false text candidates as text candidates as shown in
Figure 2(c) and Figure 3(c). Therefore, we propose a novel horizontal
and vertical symmetry feature for identifying PTCs. For each text
candidate as shown in Figure 4(a), the method considers a 3×3
window and computes the mean gradient values for the window,
which tolerates little distortions and text movements. It moves in both
the positive and the negative gradient directions of a text candidate
(P0 in Figure 4(a)) and while moving, the method checks the mean
gradient values as defined in equation (1) and equation (2) until the
condition is met, which we call the horizontal symmetry. This is
illustrated in Figure 4(b) where we can see the results P1 and P2 for
P0. Then the method moves along the perpendicular direction to the
gradient in both the two directions of down and up for P0, P1 and P2.
It continues as long as the distance between P1 and P2 gives same
distance as shown in Figure 4(c). Let these pixels be P0U, P0D, P1U,
P1D, P2U and P2D, respectively, as shown in Figure 4(d). The
method computes the standard deviation of the gradient angles of
those points as defined in equation (3). If P0 satisfies equation (3)
then it is said to satisfy both the horizontal and the vertical
symmetries, and is called as a PTC. All the PTCs from the 2D text
frame in Figure 2(c) and the 3D text frame in Figure 3(c) can be seen
in Figure 5(a) and (b), respectively, where most of the false text
candidates have been removed. However, we can see still few false
PTCs due to background complexity.
(a) P0 and the window
(b) Find P1 and P2
(c) Move perpendicular
(d) Find P0U, P0D, P0U, P0D,
P0U, P0D
𝜃𝑃 ∶= 𝑇ℎ𝑒 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑎𝑛𝑔𝑙𝑒 𝑜𝑓 𝑝𝑖𝑥𝑒𝑙 𝑃, 𝜃 ∈ (−𝜋/2, 𝜋/2]
𝑃1, 𝑃2, 𝑃0𝑈, 𝑃0𝐷, 𝑃1𝑈, 𝑃1𝐷, 𝑃2𝑈, 𝑃2𝐷 𝑎𝑙𝑙 𝑒𝑥𝑖𝑠𝑡
{ 𝑆𝑡𝑑(𝜃𝑃0 , 𝜃𝑃0𝑈 , 𝜃𝑃0𝐷 , 𝜃𝑃1 , 𝜃𝑃1𝑈 , 𝜃𝑃1𝐷 , 𝜃𝑃2 , 𝜃𝑃2𝑈 , 𝜃𝑃2𝐷 ) (3)
∈ (1,10)
…
(a) Video sequence t, t+1… t+n
(b) t and t+1 frames as input for first iteration
(c) Dominant pixels for t and t+1 frames
(d) Mask Mt+1
(e) DP1
Fig. 4. The procedure of horizontal and vertical symmetry
(f) Stable potential text candidates after meeting converging criterion
(a) Potential text candidates for
2D text in the first frame
(b) Potential text candidates for 3D
text in the first frame
Fig. 5. Procedure for Potential text candidates selection (PTC)
More formally, gradient magnitude and mean gradient magnitude
can be calculated as below.
Let 𝐺𝑃 ∶= 𝑇ℎ𝑒 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑜𝑓 𝑝𝑖𝑥𝑒𝑙 𝑃
𝐺𝑀(𝑃𝑖−1,𝑗−1 ) 𝐺𝑀(𝑃𝑖−1,𝑗 ) 𝐺𝑀(𝑃𝑖−1,𝑗+1 )
𝐺𝑀(𝑃𝑖,𝑗 )
𝐺𝑀(𝑃𝑖,𝑗+1 ) ])
𝑉𝑃 ∶= 𝑚𝑒𝑎𝑛 ([ 𝐺𝑀(𝑃𝑖,𝑗−1 )
𝐺𝑀(𝑃𝑖+1,𝑗−1 ) 𝐺𝑀(𝑃𝑖+1,𝑗 ) 𝐺𝑀(𝑃𝑖+1,𝑗+1 )
𝐺𝑀 𝑖𝑠 𝑡h𝑒 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒
𝑃𝑟𝑖𝑃1 ∶= 𝑃1 − 𝐺𝑃1 ,
𝑁𝑒𝑥𝑡𝑃1 ∶= 𝑃1 + 𝐺𝑃1
𝑉𝑃1 > 𝑉𝑃𝑟𝑖𝑃1 && 𝑉𝑃1 > 𝑉𝑁𝑒𝑥𝑡𝑃1
𝑃𝑟𝑖𝑃2 ∶= 𝑃2 − (−𝐺𝑃2 ), 𝑁𝑒𝑥𝑡𝑃2 ∶= 𝑃2 + (−𝐺𝑃2 )
(1)
𝑉𝑃2 > 𝑉𝑃𝑟𝑖𝑃2 && 𝑉𝑃2 > 𝑉𝑁𝑒𝑥𝑡𝑃2
(2)
Fig. 6. The iterative process for Stable Potential Text Candidates
(SPTCs)
C. Temporal redundancy for Stable Potential Text
Candidates Selection
It is noted from the results of the previous section that there still
exist false PTCs in the resultant images as shown in Figure 5(a) and
(b). To validate the PTCs, the method proposes an iterative process
which explores temporal redundancy in video to identify Stable
Potential Text Candidates (SPTCs). Let t, t+1… t+n be the video
sequence as shown in Figure 6(a). Here n denotes 30 frames per
second. Initially, the method considers the first two consecutive
frames, say t and t+1 as shown in Figure 6(b), and finds the
corresponding PTCs as shown in Figure 6(c). For frame t+1, the
method merges all the PTCs within the defined window of size
11×11 pixels centered at each PTC by a mask Mt+1 operation. The
results can be seen in Figure 6(d). Then the method removes all the
other PTCs in t, which are not covered by the mask operation as
shown in Figure 6(e), where the remaining PTCs after elimination are
drawn. It is observed that the number of the PTCs in Figure 6(e) is
less than the number of the PTCs in Figure 6(d). Let the results of
filtering PTCs be DP1 . Similarly, for the second iteration, the method
gets the PTCs for the t+2 frame and again the same mask Mt+2
operation is applied to filter out the PTCs in DP1. This leads to DP2 .
This process basically finds the stable PTCs which are presented
in all the consecutive frames until the iterative process terminates. In
this way, the method continues the iterative process to filter out those
unstable PTCs until the converging criterion is met. The final stable
PTCs can be seen in Figure 6(f). This is nothing but getting DPfinal
by the iterative process. To define the terminating condition, we
estimate the proximity matrix which indicates the distances between
the PTCs in DP1, DP2…DPfinal as defined in equation (4). This is
valid because the PTCs in the text area are more or less stable than
the non-text PTCs in consecutive frames. It is observed that as
iteration increases the standard deviation of the proximity matrix of
DPi decreases and after certain iterations, the standard deviation of
the previous iteration and the current iteration has become almost
equal as defined in equation (5). Figure 7 shows that after the 9th
iteration the curve becomes flat from iteration 10 to iteration 12. This
is the terminating point as defined in equation (5). It is because the
unstable PTCs have been eliminated at the iterations. The outputs are
called Stable Potential Text Candidates (SPTCs). Similarly, the same
procedure is used for 3D text video to obtain SPTCs.
𝑃𝑀𝑖 = [
𝐷𝑖𝑠𝑡(𝑃1 , 𝑃1 ) ⋯ 𝐷𝑖𝑠𝑡(𝑃1 , 𝑃𝐾 )
⋮
⋱
⋮
]
𝐷𝑖𝑠𝑡(𝑃𝐾 , 𝑃1 ) ⋯ 𝐷𝑖𝑠𝑡(𝑃𝐾 , 𝑃𝐾 )
(4)
𝐷𝑃𝑖 = {𝑃1 , 𝑃2 … 𝑃𝐾 }, 𝐾 = ‖𝐷𝑃𝑖 ‖
‖𝑆𝑡𝑑(𝑃𝑀𝑖 ) − 𝑆𝑡𝑑(𝑃𝑀𝑖 )‖ < 0.2
(5)
in the Canny edge image of the input frame. The method performs
the histogram operation on stroke width distances for each quadrant
as shown in Figure 9, and chooses the dominant stroke width
distances from each histogram as the shown values. As we discussed
in the proposed methodology section, if a character is from a 2D text
frame then it must satisfy the symmetry like human face else not
always. To extract this property, we propose Mutual Nearest
Neighbor Symmetry (MNNS) to classify the TPs as representing 2D
text or 3D text. The MNNS procedure first calculates the Maximum
and the Minimum as defined in equation (6) and (7), respectively.
Then it compares the remaining two distances with the Maximum and
the Minimum, and classifies them into an Maximum cluster if the
distance is close to the Maximum distance, otherwise classifies it into
an Minimum cluster. This results in two equal clusters containing an
equal number of distance values. If the dominant stroke width
distances of TP satisfy MNNS then it is considered as a 2D TP else a
3D TP. In order to classify the whole text block as 2D, we consider a
voting criterion that counts the number of 2D TPs and 3D TPs in text
blocks of the images. If the count which represents 2D TPs is more
than the count which represents 3D TPs in the text block, then the
method considers the text block as a 2D text one else a 3D text one.
Figure 10 shows the TPs that satisfy MNNS (red color components)
for 2D text image, and the TPs that satisfy MNNS for 3D text image
(red color components), respectively. It is noticed from Figure 9 that
the number of the TPs that satisfy MNNS are more in the case of 2D
text images while less in the case of 3D text images. Therefore, the
two text blocks in the first image are classified as 2D and the three
text blocks in the second image are classified as 3D. The main
advantage of this method is that it can classify 2D and 3D texts even
when a single frame contains both 2D and 3D text blocks.
(a) Text Representative
(b) Canny edge image
(c) Text Representative
(d) Canny edge image
Fig. 7. Terminating condition for iterative process (Number of iterations vs
standard variance of PMi )
D. 2D and 3D Text Block Classification
For each SPTC in Figure 6(f), the method extracts edge
components from the Canny edge image of the input frame to study
the behavior of the SPTC, which results in Text Representatives
(TPs). Figure 8(a) shows the TPs that are obtained from the Canny
edge image (Figure 8(b)). Similarly, for 3D text frame, the method
obtains TPs as shown in Figure 8(c), which are obtained from the
Canny edge image in Figure 8(d). We use grouping criterion based on
the nearest neighbor technique to merge all the TPs by referring the
Canny edge image of the input frame. Components that have less
than three pixels are eliminated because they do not contribute to text.
More details for boundary growing and merging can be found in [14].
The output of this step is considered as text block segmentation. In
this way, the method segments text lines of both graphics and scene
texts irrespective of 2D and 3D.
For each segmented text block, the method studies the behavior
of each TP in both 2D and 3D text blocks to classify it as either a 2D
or a 3D block by using stroke width distances. The method divides
the whole TP into four quadrants (top left, bottom left, top right,
bottom right) at the center of the component as shown in Figure 9
(see the center with red color lines). Then for each TP of each
quadrant, the method finds stroke width distance by traversing along
the perpendicular direction to the stroke direction as suggested in [15]
Fig. 8. Text Representative and Canny edge image for 2D and 3D frame
4 4
7 7
Fig. 9. Histograms for dominant stroke width distances for four quadrants
𝐷𝑆𝑊𝑚𝑎𝑥 = max{𝑑𝑠𝑤(𝑃𝑖 ) | 𝑖 = 1,2,3,4}
𝐷𝑆𝑊𝑚𝑖𝑛 = min{𝑑𝑠𝑤(𝑃𝑖 ) | 𝑖 = 1,2,3,4}
(6)
(7)
𝐹𝑜𝑟 𝑒𝑎𝑐ℎ 𝑃𝑖 (𝑖 = 1,2,3,4),
𝑖𝑓 ‖𝑑𝑠𝑤(𝑃𝑖 ) − 𝐷𝑆𝑊𝑚𝑖𝑛 ‖ ≥ ‖𝑑𝑠𝑤(𝑃𝑖 ) − 𝐷𝑆𝑊𝑚𝑎𝑥 ‖ ,
𝐴𝑑𝑑 𝐷𝑆𝑊𝑖 𝑡𝑜 𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑚𝑎𝑥
{
𝑖𝑓 ‖𝑑𝑠𝑤(𝑃𝑖 ) − 𝐷𝑆𝑊𝑚𝑖𝑛 ‖ < ‖𝑑𝑠𝑤(𝑃𝑖 ) − 𝐷𝑆𝑊𝑚𝑎𝑥 ‖ ,
𝐴𝑑𝑑 𝐷𝑆𝑊𝑖 𝑡𝑜 𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑚𝑖𝑛
Huang et al.[13]
57.0
51.5
54.0
Bouaziz et al.[12]
50.0
35.0
41.0
(8)
Fig. 10. 2D and 3D text block classification
III. Experimental Results
As it is the first work on classification of 2D and 3D texts, there
is no benchmark or standard dataset for evaluating the proposed
method. Therefore, we create video data comprising 500 video clips,
which includes 200 3D text video and 300 2D text video clips that are
captured by our own video camera at different places such as urban
scenes, shops, markets and buildings. This dataset contains the texts
of different orientations, scripts, fonts, font sizes, etc. Each video clip
may last less than ten seconds. To evaluate the method, we consider
the measures, namely, recall, precision and F-measure for text line
segmentation, classification rate for the classification of 2D and 3D
text frames and character recognition rate for the recognition results.
These are the standard measures to evaluate the methods. To show
the effectiveness of the method, we implement two existing methods
[12, 13] which use temporal redundancy, edges and stroke
information. Similarly, to validate the classification in terms of
recognition rate, we implement three baseline binarization methods
that are Niblack [17] and Sauvola [18] methods which use thresholds
for binarizing the images, and one more recently developed method
[19] for video text binarization based on Wavelet-Gradient Fusion
(WGF) criterion.
Fig. 11. Sample results of the proposed, Huang et al. [13] and Bouaziz et al. [12]
(a) Sample 2D text lines classified by the proposed method
A. Experiments for Text Block Segmentation
Sample qualitative results of the proposed and the existing
methods [12, 13] for text block segmentation are shown in Figure 11,
where the first row shows the input frames having 3D text, 2D text
and 2D text of Chinese script. The second row shows the results of
the proposed method which successfully detects almost all the texts
in the input frames. The third and the fourth rows show the results of
Huang et al [13]. and Bouaziz et al [12] methods, respectively. The
existing methods detect 2D texts well but fail to detect 3D texts. The
main reason for the poor accuracy of the existing methods is that the
methods developed for 2D text detection but not 3D text detection
and the features used are sensitive to 3D texts. The quantitative
results of the proposed and existing methods are reported in Table I,
where both the existing methods give poor accuracies compared to
the proposed method in terms of recall, precision and F-measure.
Therefore, it can be concluded that the proposed method outperforms
the existing methods for text line segmentation.
TABLE I. TEXT BLOCK DETECTION RESULTS OF THE PROPOSED AND
EXISTING METHODS (IN %) ON BOTH 2D + 3D VIDEOS
(b) Sample 3D text lines classified by the proposed method
Fig. 12. Sample 2D and 3D text blocks from our database
B. Experiments for Classification of 2D and 3D Text Blocks
Sample 2D and 3D text blocks that are successfully classified by
the proposed method are shown in Figure 12(a) and (b), respectively.
It is observed from Figure 12 that the proposed method works well
for different types of texts and different scripts. The qualitative
results of the proposed method are reported in Table II, where the
confusion matrix gives promising results for 2D and 3D text
classification. We can also see from Figure 12 that the proposed
method classifies both graphics texts (most likely 2D) and scene texts
(can be either 2D or 3D) correctly though the text lines are suffering
from illumination, orientation, different fonts and contrasts.
TABLE II. CONFUSION MATRIX OF THE CLASSIFICATION METHOD FOR
2D AND 3D TEXT BLOCKS (IN %)
Method
R
P
F
Type
2D text
3D text
Proposed method
86.0
83.0
84.5
2D text
85.5
14.5
3D text
21.0
79.0
C. Validation of Classification by Text Blocks
To validate the effectiveness of the proposed classification
method, we compute recall, precision and F-measure after
classification that is to give 2D text frames and 3D text frames as the
input separately for the existing and the proposed methods. Table I
gives the accuracies before classifying 2D and 3D texts, which give
overall performances of the existing and the proposed methods. Table
III shows the existing methods give better accuracies for 2D text but
low accuracies for 3D texts. On the other hand, the proposed method
gives better accuracies for both 2D and 3D texts compared to the
existing methods. When we compare the accuracies of 2D and 3D
texts for the proposed method, we find the accuracy of 3D texts is
lower than that of 2D texts. This is because of the loss of information
during classification. Therefore, we can assert that the proposed
classification method makes difference in improving the accuracy for
text detection when data is mixed with 2D and 3D texts.
TABLE III. TEXT BLOCK DETECTION RESULTS OF THE PROPOSED AND
EXISTING METHODS (IN %) AFTER CLASSIFICATION
Method
2D Text video
3D Text video
R
P
F
R
P
F
Proposed method
89.0
84.0
86.5
81.5
82.0
82.0
Huang et al. [13]
65.0
58.5
61.5
42.0
39.0
41.0
Bouaziz et al.[12]
45.0
56.0
50.0
38.0
31.0
34.0
D. Validation of Classification by Recognition
To know the effectiveness of the proposed classification method
in terms of the recognition rates before classification and after
classification, we compare two baseline thresholding binarization
methods [17, 18] and the recent method [19] of video text
binarization. The method binarizes images and passes them to
teseract (Google OCR) [20] which is publicly available to calculate
character recognition rate. The results are reported in Table IV, from
which one can notice that the all the three binarization methods give
poor accuracies for 3D text after classification compared to 2D text.
The results of before classification are higher than those of after
classification. The reason for the poor accuracies is that the methods
are developed for 2D text binarization but not for 3D text. In addition,
the methods require high contrast text images but not like video
frames. Another reason may be the inherent limitations of the OCR
which accepts only particular fonts, size, and clear shape characters.
Hence, the classification is necessary to improve the accuracy.
TABLE IV. CHARACTER RECOGNITION OF THE BINARIZATION METHODS
BEFORE AND AFTER CLASSIFICATION (IN%)
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
A. Smolic, “3D video and fee viewpoint video-from capture to display”,
Pattern Recognition, 2011, pp 1958-1968.
R. Minetto, N. Thome, M. Cord, N. J. Leite and J. Stolfi, “SnooperText:
A text detection system for automatic indexing of urban scenes”,
Computer Vision and Image Understanding, 2013 (article in press).
R. Minetto, N. Thome, M. Cord, N. J. Leite and J. Stolf, “Snoopertrack:
Text Detection and Tracking for Outdoor Videos”, In Proc. ICIP, pp
505-508, 2011.
R. Minetto, N. Thome, M. Cord, N. J. Leite and J. Stolf, “T-HOG: An
effective gradient-based descriptor for single line text regions”, Pattern
Recognition, 2013, pp 1078-1090.
C. M. Gracia, M. Mirmehdi, J. Sigut, J. L. G. Mora, “Fast perspective
recovery of text in natural scenes”, Image and Vision Computing, 2013,
pp 714-724.
D. Doermann, J. Liang and H. Li, “Progress in Camera-Based Document
Image Analysis”, In Proc. ICDAR, 2003, pp 606- 616.
J. Zang and R. Kasturi, “Extraction of Text Objects in Video
Documents: Recent Progress”, In Proc. DAS, 2008, pp 5-17
M. R. Lyu, J. Song and M. Cai, “A Comprehensive Method for
Multilingual Video Text Detection, Localization, and Extraction”, IEEE
Trans. CSVT, 2005, pp 243-255.
R. Wang, W. Jin and L. Wu, “A Novel Video Caption Detection
Approach using Multi-Frame Integration”, In Proc. ICPR, 2004.
H. Li, D. Doermann and O. Kia, “Automatic Text Detection and
Tracking in Digital Video”, IEEE Trans. IP, pp 147-156, 2000.
T. Q. Phan, P. Shivakumara, T. Lu and C. L. Tan, “Recognition of
Video Through Temporal Integration”, In Proc. ICDAR, pp 589-593,
2013.
B. Bouaziz, T. Zlitni and W. Mahdi, “AViTExt: Automatic Video Text
Extraction” CoRR abs/1301.2173, 2013.
X. Huang and H. Ma, Automatic Detection and Localization of Natural
Scene Text in Video”, In Proc. ICPR, 2010, pp 3216-3219.
P. Shivakumara, T. Q. Phan, S. Lu and C. L. Tan, “Gradient Vector
Flow and Grouping based Method for Arbitrarily-Oriented Scene Text
Detection in Video Images”, IEEE Trans. CSVT, 2013, pp 1729-1739.
B. Epshtein, E. Ofek, Y. Wexler, “Detecting text in natural scenes with
stroke width transform,” In: Proc. CVPR, 2010, pp. 2963-2970.
After classification
2D
3D
75.5
34.0
50.5
22.0
19.0
5.0
[16] P. Shivakumara, A. Dutta, T. Q. Phan, C. L. Tan and U. Pal, “A Novel
Mutual Nearest Neighbor based Symmetry for Text Frame Classification
in Video”, Pattern Recognition, 2011, pp 1671-1683.
IV. Conclusion and Future Work
[18] J. Sauvola, T. Seeppanen, S. Haapakoski and M. Pietikainen,“Adaptive
Document Binarization”, In Proc. ICDAR, 1997, pp 147-152.
In this paper, we propose a novel method for classification of 2D
and 3D texts blocks. The method identifies text candidates with the
help of k-means clustering algorithm on gradient images. Then
horizontal and vertical symmetry based on gradient direction and
[19] S. Roy, P. Shivakumara, P. Roy and C. L. Tan, “Wavelet-GradientFusion for Video Text Binarization”, In Proc. ICPR, 2012, pp 33003303.
Methods
WGF [19]
Niblack [18]
Souvola [17]
Before classification
2D + 3D
56.5
37.0
12.5
gradient magnitudes of text candidates to identify potential text
candidates. The potential text candidates are validated by iterative
method which uses temporal redundancy and spatial proximity of the
potential text candidates to identify stable potential text candidates.
For stable potential text candidates, the method proposes new mutual
nearest neighbor symmetry to identify the 2D and 3D text
components. Voting method is used to classify 2D and 3D texts in
frames. We evaluate the text line segmentation, text detection and
recognition before classification and after classification with the
results of the existing methods. However, the proposed methods
detect text regions regardless of script type, therefore, we are
planning to develop a method for script identification of 3D and 2D
scripts in video in future.
[17] W. Niblack, “An Introduction to Digital Image Processing”, Prentice
Hall, Englewood Cliffs, 1986.
[20] Tesseract. http://code.google.com/p/tesseract-ocr/.
Download