TCSVT2013 - School of Computing

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 Gradient Vector Flow and Grouping based Method for ArbitrarilyOriented Scene text Detection in Video Images Palaiahnakote Shivakumara, Trung Quy Phan, Shijian Lu and Chew Lim Tan, Senior Member, IEEE  Abstract—Text detection in videos is challenging due to low resolution and complex background of videos. Besides, an arbitrary orientation of scene text lines in video makes the problem more complex and challenging. This paper presents a new method that extracts text lines of any orientations based on Gradient Vector Flow (GVF) and neighbor component grouping. The GVF of edge pixels in the Sobel edge map of the input frame is explored to identify the dominant edge pixels which represent text components. The method extracts edge components corresponding to dominant pixels in the Sobel edge map, which we call Text Candidates (TC) of the text lines. We propose two grouping schemes. The first finds nearest neighbors based on geometrical properties of TC to group broken segments and neighboring characters which results in word patches. The end and junction points of skeleton of the word patches are considered to eliminate false positives, which output the Candidate Text Components (CTC). The second is based on the direction and the size of the CTC to extract neighboring CTC and to restore missing CTC, which enables arbitrarily-oriented text line detection in video frame. Experimental results on different datasets including arbitrarily oriented text data, non-horizontal and horizontal text data, Hua’s data and ICDAR-03 data (Camera images) show that the proposed method outperforms existing methods in terms of recall, precision and F-measure. Index Terms—Gradient vector flow, Dominant text pixel, Text candidates, Text components, Candidate text components, Arbitrarily-oriented text detection. I. INTRODUCTION T ext detection and recognition is a hot topic for researchers in the field of image processing, pattern recognition and multimedia. It draws attention of Content based Image Retrieval (CBIR) community in order to fill the semantic gap between low level and high level features to some extent if text is available in the video [1-4]. In addition, the text detection and recognition can be used to retrieve the exciting and semantic events from the sports video [5-7]. Therefore, text detection and extraction is essential to improve the performance of the retrieval system in real world applications. This research is supported in part by the A*STAR grant 092 101 0051 (WBS no. R252-000-402-305). P. Shivakumara is with the Multimedia Unit, Department of Computer Systems and Information Technology, University of Malaya, Kuala Lampur, 50603, Malaysia, Telephone: +60 03 7967 2505 (E-mail: hudempsk@yahoo.com). T. Q. Phan is with the Department of Computer Science, School of Computing, National University of Singapore, Computing 1, 13 Computing Drive, Singapore 117417. (E-mail: phanquyt@comp.nus.edu.sg). S. Lu is with the Department of Computer Vision and Image Understanding, Infocomm of Research (I2R), Singapore. (Email: slu@i2r.astar.edu.sg). C. L. Tan is with the Department of Computer Science, School of Computing, National University of Singapore, Computing 1, 13 Computing Drive, Singapore 117417. (Email: tancl@comp.nus.edu.sg). Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. Video consists of two types of texts that are scene text and graphics text. Scene text is part of the image captured by camera. Examples of scene text include street signs, billboards, and text on trucks and writing on shirts. Therefore, the nature of scene text is unpredictable compared to graphics text which can be more structured and closely related to the subject. However, scene text can be used to uniquely identify objects in sports events, navigating Google map and assisting visual impaired people. Since the nature of scene text is unpredictable, it poses lots of challenges. Out of these, arbitrary orientation is more challenging as it is not as easy as processing straight text lines. Several methods have been developed for text detection and extraction that achieve reasonable accuracy for natural scene text (camera images) [8-13] as well as multi-oriented text [11]. However, it is noted that most of the methods use classifier and large number of training samples to improve the text detection accuracy. To tackle the multi-orientation problem, the methods use connected component analysis. For instance, the stroke width transform based method for text detection in scene images by Epshtein et al. [8] works well for connected components which preserve shapes. Pan et al. [9] also proposed a hybrid approach for text detection in natural scene images based on conditional random field. The conditional random field involves connected component analysis to label the text candidates. Since the images are high contrast images, the connected component analysis based features with classifier training work well for achieving better accuracy. However, the same methods cannot be used directly for text detection in video because of low contrast and complex background which causes disconnections, loss of shapes etc. In this case, deciding classifier and geometrical features of the components is not that easy. Thus, these methods are not suitable for video text detection. Plenty of methods have been proposed since last decade for text detection in video based on connected component, [14-15], texture [16-19] and edge and gradient [20-25]. Connected component based methods are good for caption text and uniform color text but not for multiple color characters text line and clutter background text. Texture based methods consider the appearance of the text as a special texture. These methods are good for complex background to some extent but at the cost of computations due to a large number of features and large number of training samples for classification of text and non-text pixels. Therefore, the performance of these methods depends on the classifier in use and the number of training samples chosen for text and non-text. Edge and texture features without a classifier is proposed by Liu et al. [26] for text detection but the method uses a large number of features to discriminate text and nontext pixels. A set of texture features without a classifier is also proposed by Shivakumara et al. [27, 28] for accurate text detection in video frames. Though, the methods work well for different varieties of frames, they require more time to process due to large number of features. In addition, the scope of the methods is limited to horizontal text. Similarly, combination of edge and gradient features is good for both text detection accuracy and efficiency compared to texture based > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < methods. For example, text detection using gradient and statistical analysis of intensity values is proposed by Wong and Chen [21]. This method suffers from grouping of text and non-text components. The colour information is also used along with edge information for text detection by Cai et al. [22]. This method works well for caption text but the performance of the method degrades when the font size varies. In general, edge and gradient based methods produce more false positives due to heuristics that are used for text and non-text pixel classification. To the best of our knowledge none of the methods as discussed above address the arbitrarily-oriented text detection in video properly. The reason is that arbitrarily-oriented text generally comes from scene text which poses many problems compared to graphics text. Zhou et al. [29] have proposed a method for detecting both horizontal and vertical text lines in video using multiple stage verification and effective connected component analysis. This method is good for caption text but not for other text and the orientation is limited to horizontal and vertical only. Shivakumara et al. [30] have addressed this multioriented issue based on the Laplacian and the skeletonization methods. This method gives low accuracy because the skeleton based method is not good enough to classify simple and complex components when clutter background is present. In addition, the method is said to be computationally expensive. Recently, the method [31] based on Bayesian classifier and boundary growing is proposed to improve accuracy for multi-oriented text detection in video. However, the boundary growing method used in this work is good when sufficient space is present between the text lines otherwise it considers non-text as text components. Therefore, the method considers only nonhorizontal straight text lines instead of arbitrary oriented ones where the space between the text lines is often limited. The arbitrary text detection is proposed in [32] using gradient directional features and region growing. This method requires classification of horizontal and non-horizontal text images and when the image contains multi-oriented text then fails to classify them. Therefore, it is not effective for arbitrary text detection. Thus, the arbitrarily-oriented text detection in video is still considered as a challenging and interesting problem. Hence in this paper, we propose the use of gradient vector flow for identifying text components in a novel way. The work presented in [33] for identifying object boundaries using Gradient Vector Flow (GVF) which has the ability to move into concave boundaries without sacrificing boundary pixels motivated us to propose GVF based method for arbitrary text detection in this work. This property helps in detecting both high and low contrast text pixels unlike gradient in [32] detects only high contrast text pixels, which is essential for video text detection of any orientation to improve the accuracy. II. PROPOSED METHODOLOGY We explore GVF for identifying dominant text pixel using Sobel edge map of the input image for arbitrary text detection in video in this work. We prefer Sobel than other edge operators such as Canny because Sobel gives fine details for text and less details for non-text while Canny gives lots of erratic edges for background along with fine details of text. Next, edge components in Sobel edge map corresponding to dominant pixels are extracted and we call them Text Candidates (TC). This operation gives representatives for each text line. To tackle arbitrary orientation, we propose a new two-stage grouping criterion for the TC. The first stage grows the perimeter of each TC to identify the nearest neighbor based on size and angle of the TC to group them, which gives text components. Before proceeding to the second stage of grouping, we introduce a skeleton concept on text components given by the first stage to eliminate false text components based on junction points. We name this output as Candidate Text Components (CTC). In the second stage, we use tails 2 of the CTC to identify the direction of the text information and the method grows along the identified direction to find the nearest neighbor CTC, which outputs the final result of arbitrarily-oriented text detection in video. To the best of our knowledge, this is the first work addressing the issue of arbitrarily-oriented text detection in video with promising accuracy using GVF information. A. GVF for Dominant Text Pixel Selection The Gradient Vector Flow (GVF) is a vector that minimizes the energy functional as defined in equation (1) [33]. ℇ = ∬ 𝜇(𝑢𝑥2 + 𝑢𝑦2 + 𝑣𝑥2 + 𝑣𝑦2 ) + |∇𝑓|2 |𝑔 − ∇𝑓 2 | 𝑑𝑥𝑑𝑦 (1) where 𝑔(𝑥, 𝑦) = (𝑢(𝑥, 𝑦), 𝑣(𝑥, 𝑦)) is the GVF field and 𝑓(𝑥, 𝑦) is the edge map of the input image. This GVF has been used in [33] for object boundary detection and it is shown that GVF is better than traditional gradient and sneak. It is also noted from [33] that there are two problems with the traditional gradient operation that are (1) these vectors generally have large magnitudes only in the immediate vicinity of the edges (2) in homogeneous regions, where pixel values are nearly constant, f is nearly zero. The GVF is extension of gradient which extends the gradient map farther away from the edges and into homogeneous regions using computational diffusion process. This results in the inherent competition of the diffusion process which will create vectors that point into boundary concavities. This is a special property of the GVF. In summary, GVF helps to propagate gradient information, i.e. the magnitude and the direction, into homogenous regions. In other words, GVF helps in detecting multiple forces at corner points of object contours. This cue allows us to use multiple forces at corner points of edge components in the Sobel edge map of the input video text frame to identify them as dominant pixels. This dominant pixel selection removes most of the background information which simplifies the problem of classifying text and nontext pixels and retains text information irrespective of the orientation of the text in video. This is the great advantage of dominant pixel selection by GVF information. It is illustrated in Figure 1 where (a) is the input and (b) is the GVF for all pixels in the image in Figure 1(a). It is observed from Figure 1(b) that dense forces at corners of contours and at curve boundaries of text components as text components are more cursive than non-text components in general. Therefore, for each pixel, we count how many forces are pointing to the text pixels and other pixels (based on GVF arrows). A pixel is classified as a “dominant text pixel” if the pixel attracts at least four GVF forces. The threshold of four is determined by running an experiment of counting GVF forces between one and five GVF forces over 100 test samples randomly selected from our database as reported quantitative results in Table 1. Table 1 shows that for 2 GVF, f-measure is low and misdetection rate is high compared to 3 GVF due to more non-text pixels (background) represented by 2 GVF while for 3 GVF, f-measure is low and misdetection rate is high compared to 4 GVF due to the same reason. On the other hand, for 4 GVF, f-measure is high and misdetection rate is low compared to 5 GVF. This shows that 5 GVF loses text pixels and it increases the misdetection rate. It is also observed from Table 1 that the 5 GVF gives high precision and low recall compared to 4 GVF. This indicates that 5 GVF loses dominant pixels which represent true text and non-text pixels as well. Therefore, it is inferred that 4 GVF is better than other GVF for identifying dominant text pixels which represent true text pixels and few non-text pixels. In addition, at this stage, our objective to propose 4 GVF for dominant pixel selection is to remove non-text pixels as many as possible despite the fact that it eliminates a few dominant pixels which represent text pixels because the proposed grouping presented (in Section II C and II D) have the ability to restore missing text > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < information. Therefore, losing a few dominant text pixels for characters in a text line does not affect much overall performance of the method. The dominant text pixel selection is illustrated in Figure 1(c) for the frame shown in Figure 1(a). Figure 1(c) shows that the dominant text pixel selection that removes almost all non-text components. Figure 1(d) shows dominant text pixels overlaid on the input frame. One can notice from Figure 1(d) that each text components have dominant pixel. In this way, dominant text pixel selection facilitates arbitrarily-oriented text detection. represent only text pixels as shown in Figure 3(h) for the GVF in red color shown in Figure 3(g). In addition, Figure 4 shows that 4 GVF selection identifies dominant pixels (Figure 4(b) and (d)) well for the characters like “O” and “I” (Figure 4(a) and (c)) where there are no corners but they have extreme points. Thus, it confirms that 4 GVF work well for any other characters. (a) Character chosen from Figure 1(a) (a). Input (b) GVF (b) Sobel edge map (c) Dominant text pixels (c) GVF overlaid on Sobel edge map (d) Dominant pixels on input frame Figure 1. Dominant point selection based on GVF As an example, we choose a character image “a” from the input frame shown in Figure 1(a). This is reproduced in Figure 2(a) to illustrate that how GVF information helps in selecting dominant text pixels. To show GVF arrows for the character image in Figure 2(a), we get the Sobel edge map as shown in Figure 2(b) and GVF arrows on the Sobel edge map as shown in Figure 2(c). From 2(c), it is clear that all the GVF arrows are pointing towards the inner contour of the character image “a”. This is because of the low contrast in the background and the high contrast at the inner boundary of the character image “a”. Thus from Figure 2(d), we observe that corner points and cursive text pixel on the contour attract more GVF arrows compared to non-corner points and non-text pixels. For instance, for a text pixel on the inner contour of the character “a’ shown in Figure 2(a), the GVF corresponding to this pixel is marked by the oval in the middle of Figure 2(d). The oval area shows that a greater number of GVF forces are pointing towards that text pixel. Similarly, for a nontext pixel at the top left corner of the character “a” in Figure 2(a), the corresponding GVF marked by top left oval in Figure 2(d) shows that lesser number of GVF forces are pointing towards that pixel. For the same two text and non-text pixels, we show the GVF arrows in their 3 x 3 neighborhood. Darker arrows shown in Figure 3(a) and (b) are those that point to the middle pixel (the pixel of interest); lighter arrows are those that are attracted elsewhere. In Figure 3(a), the middle pixel attracts four arrows. Hence it is classified as a corner point (dominant text pixel) and the other pixel shown in Figure 3(b) attracts only one arrow and is classified as a non-text pixel. We test some pixels that attract two and three GVF arrows as shown in Figure 3(c)-(d), Figure 3(e)-(f), respectively. One can see that Dominant Pixels (DP) shown in Figure 3(d) and (f) corresponding to GVF (red color) in Figure 3(c) and (e) represent not only text pixels but also non-text pixel (background pixels). On the other hand, in Figure 3(g)-(h) we see that the pixels selected by four GVF are real candidate text pixels because these pixels indeed 3 (d) GVF for the character image shown in (a) Figure 2. Magnified GVF for corner and non-corner pixels marked by oval shape (a). GVF arrows at text pixel (b) GVF arrows at non-text (c) 2GVF (e) 3GVF (d) DP (f) DP (g) 4 GVF (h) DP Figure 3. Illustration for selection of dominant text pixels (DP) with GVF arrows (a) 4 GVF (b) DP (c) 4 GVF (d) DP Figure 4. 4 GVF for the character like “O” and “I” to identify DP > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Table 1. Experiments on 100 random samples chosen form different database for choosing GVF arrows GVF Arrows 2 3 4 5 R 0.51 0.56 0.78 0.63 P 0.36 0.47 0.67 0.68 F 0.42 0.51 0.72 0.65 MDR 0.33 0.27 0.16 0.53 B. Text Candidates Selection We use the result of dominant pixel selection shown in Figure 1(c) for text candidate selection. For each dominant pixel in Figure 1(c), the method extracts edge components from the Sobel edge map shown in Figure 5(a) corresponding to dominant pixels. We call these extracted edge components as text candidates as shown in Figure 5(b). Figure 5(b) shows that this operation extracts almost all text components with few false positives. Then the extracted text candidates are used in the next section to restore complete text information with Sobel edge map. (a). Sobel edge map (b) Text candidates Figure 5. Text candidates selection based on dominant pixels C. First Grouping for Candidate Text Components For each text candidate shown in Figure 5(b), the method finds its perimeter and it allows the perimeter to grow in five iterations, pixel by pixel, in the direction of the text line in the Sobel edge map of the input frame to group neighboring text candidates. The perimeter is defined as contour of the text candidates. The method computes minor axis for the perimeter of the text candidates and it considers length of the minor axis as radius to expand the perimeter. At every iteration, the method traverses the expanded perimeter to find the text pixel (white pixel) of the neighboring text candidate in the text line. The objective of this step is to merge segments of character components and neighbor characters to form a word. This process merges text candidates which have close proximity within five iterations of the perimeter. The five is determined empirically by studying the space between the text candidates. The five pixel tolerance is acceptable because it is lower than the space between the characters. As a result, we get two groups of text candidates, namely the current group and the neighbor group. Then the method verifies the following properties based on the size and angle of the text candidate groups before merging them. Generally, the length of the major axes of the character components will have almost the same lengths and the angle difference between the character components have almost the same angle. However, we fix 𝜃𝑚𝑖𝑛1 as 5° because in case of arbitrarily oriented text, each character has slight different orientations according to nature of text line orientation. To take care of little orientation variation, we fix the 5o. Size: 𝑚𝑒𝑑𝑖𝑎𝑛𝐿𝑒𝑛𝑔𝑡ℎ(𝑔) < 𝑙𝑒𝑛𝑔𝑡ℎ(𝑐) < 𝑚𝑒𝑑𝑖𝑎𝑛𝐿𝑒𝑛𝑔𝑡ℎ(𝑔) × 3 3 4 where length(.) is the length of the major axis of a text candidates group and medianLength(.) is the median length of the major axes of all the text candidates in the group so far. - Angle: 𝑔 = 𝑔𝑝𝑟𝑒𝑣 ∪ {𝑐𝑙𝑎𝑠𝑡 } 𝑔𝑛𝑒𝑥𝑡 = 𝑔 ∪ {𝑐} ∆𝜃1 = |𝑎𝑛𝑔𝑙𝑒(𝑔) − 𝑎𝑛𝑔𝑙𝑒(𝑔𝑝𝑟𝑒𝑣 )| ∆𝜃2 = |𝑎𝑛𝑔𝑙𝑒(𝑔) − 𝑎𝑛𝑔𝑙𝑒(𝑔𝑛𝑒𝑥𝑡 )| where g is the current group, clast is the text candidate group that was last added to g, and c is the new text candidate group that we are considering to add to g. It follows that gprev and gnext are the group immediately before the current group and the candidate (next) group, respectively. angle(.) returns the orientation of the major axis of each group based on PCA. The angle condition is: |∆𝜃1 − ∆𝜃2 | ≤ 𝜃𝑚𝑖𝑛1 This condition is only checked when g has at least four components. If a text candidates group passes these two conditions, we merge the neighbor group with the current group to get candidate text components (word patches). These two conditions fail when we get large angle difference between two words due to clutter background while grouping. It is illustrated in Figure 6 where (a)-(e) show g, c, gprev, clast, and gnext, respectively chosen from Figure 5(b). The angles are computed for the groups are as follows. In this case: 1 = 5.33, 2 = 4.02, length(c) = 11.95, medianLength(g) = 12.64. So the conditions are satisfied and c is merged into g as shown in Figure 6(e). In this way, the method groups the broken segments and neighboring characters to get candidate text components. The final results of grouping for the text candidates in Figure 5(b) are shown in Figure 7(a) where we can see different colors representing different formed groups. The staircase effect in Figure 7(b) shows that grouping mechanism for obtaining candidate text groups. This process repeats until there are no remaining unvisited text candidates. This grouping essentially gives word patches by grouping character components. (a). g (b) c (c) gprev (d) clast (e) gnext Figure 6. Illustration for candidate text components selection (a). First Grouping (b) Starecase effect (d) End and junction points (c).Skeleton (e) Candidates text components After false positive elimination Figure 7. Word patches extraction > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < It is observed from Figure 7(b) that there are false text candidates groups. To eliminate them, we check the skeleton of each group as shown in Figure 7(c) and count the number of junction points shown in Figure 7(d). If |intersection(skeleton(g))| > 0  false text candidate groups and not retained. skeleton(.) returns the skeleton of a group and intersection(.) returns the set of intersection (junction) points. The final results can be seen after removing false text candidates group in Figure 7(e). However, there are still false text candidates group. 5 grouped with the final results shown in Figure 9(b) where the curving text line is extracted with a false positive. D. Second Grouping for Text Line Detection The first grouping mentioned above produces the word patches by grouping character components. For each word patch, the second grouping now finds two tail ends using the major axis of the word patch. The method considers text candidates at both tail ends of the word to grow its perimeter based on the direction of the major axis for a few iterations to find neighboring word patches. The number of iterations is determined based on the experiments on space between words and characters. While growing the perimeter by pixel by pixel, the method looks for white pixels of the neighboring word patches. The Sobel edge map of the input frame has been used for growing and finding neighboring word patches. The two word patches are grouped based on the angle properties of word patches. Let t 1 and t2 be the right tail end and left tail end of the second word patch, respectively. 𝑡1 = 𝑡𝑎𝑖𝑙(𝑤1 , 𝑐1 ), 𝑡2 = 𝑡𝑎𝑖𝑙(𝑤2 , 𝑐2 ), 𝑡12 = 𝑡1 ∪ 𝑡2 ∆𝜃1 = |𝑎𝑛𝑔𝑙𝑒(𝑡1 ) − 𝑎𝑛𝑔𝑙𝑒(𝑡12 )|, ∆𝜃2 = |𝑎𝑛𝑔𝑙𝑒(𝑡2 ) − 𝑎𝑛𝑔𝑙𝑒(𝑡12 )| where w1 is the current word patch, c1 is the text candidate that is being used for growing. c2 is that text candidate of the word patch w2 that it belongs to. The idea is to check that the “tail angles” of the two words are compatible with each other. tail(w, c) returns up to three text candidates immediately connected to c in w. t12 is then the next text candidate tail of both t1 and t2. The angle condition is: ∆𝜃1 ≤ 𝜃𝑚𝑖𝑛 2 ∧ ∆𝜃2 ≤ 𝜃𝑚𝑖𝑛 2 This condition is only checked if both t1 and t2 contain three components. If a word patch passes this condition, it is merged to the current word. Here we set 𝜃𝑚𝑖𝑛 2 to 25° to take care of orientation difference between the words in the text line. The little orientation difference between the words is expected because the input is arbitrarily oriented text. This 25o may not affect much grouping process because of enough space between the text lines. Illustration for grouping word patches chosen from Figure 7(e) can be seen in Figure 8 where (a)-(e) represent w1, w2, t1, t2 and t12, respectively. Suppose we are considering whether to merge w1 and w2. (a). Second grouping (b) Text line detection Figure 9. Arbitrary text extraction E. False Positive Removal Sometimes the false positives are merged with the text lines (like in the above case), which makes it difficult to remove the false positives. However, in other cases, the false positives may stand alone and thus we propose the following rules to remove these kinds of false positives. The rules for eliminating such false positives based geometrical properties of the text block are common practice in text detection [14-32] to improve the accuracy. Therefore, we also propose the similar rules in this work. False positive checking: if area(w) < 200 or edge_density(w) < 0.05  false positive and removed 𝑒𝑑𝑔𝑒_𝑙𝑒𝑛𝑔𝑡ℎ(𝑠𝑜𝑏𝑒𝑙(𝑤)) 𝑒𝑑𝑔𝑒_𝑑𝑒𝑛𝑠𝑖𝑡𝑦(𝑤) = 𝑎𝑟𝑒𝑎(𝑤) Where sobel(.) returns the sobel edge map and edge_length(.) returns the total length of all edges in the edge map. Figure 10(a) shows the input, (b) shows the results before false positive elimination, (c) shows the results of false positive elimination using the area of the text block and (d) shows the results of false positive elimination using edge density of the text block. (a).Input (c) Area for false positive removal (b) Before false positive removal (d).Density for false positive removal Figure 10. Illustration for false positives elimination III. EXPERIMENTAL RESULTS (a). w1 (b) w2 (c) t1 (d) t2 (e) t12 Figure 8. Illustration for word grouping In this case: 1 = 20.87, 2 = 20.68 so the condition is satisfied and w1 and w2 are merged and it is shown in Figure 9(a) in red color. This process repeats until there are no remaining unvisited words and the output of the second grouping is shown in Figure 9(a) where the staircase effect with different colors shows how the words are We create our own dataset for evaluating the proposed method along with standard dataset such as Hua’s data of 45 video frames [34]. Our dataset includes 142 arbitrarily-oriented text frames (almost all scene text frames), 220 non-horizontal text frames (176 scene text frames and 44 graphics text frames), 800 horizontal text frames (160 Chinese text frames, 155 scene text frames and 485 English graphics text frames), and publicly available Hua’s data of 45 frames (12 scene text frames and 33 graphics text frames). We also tested our method > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < on the ICDAR-03 competition dataset [35] of 251 camera images (all are scene text images) to check the effectiveness of our method on camera based images. In total, 1207 (142+220+800+45) video frames and 251 camera images are used for experimentation. To compare the results of the proposed method with existing methods, we consider seven popular existing methods that are Bayesian and boundary growing based method [31], Laplacian and skeleton based method [30], Zhou et al. [29], Fourier-RGB based method [28], Liu et al [26], Wong and Chen [21] and Cai et al. [22]. The main reason to consider these existing methods is that these methods work with fewer constraints, for complex background without a classifier and training as in our proposed method. We evaluate the performance of the proposed method at the text line level, which is a common granularity level in the literature [1725], rather than the word or character level because we have not considered text recognition in this work. The following categories are defined for each detected block by a text detection method. Truly Detected Block (TDB): A detected block that contains at least one true character. Thus, a TDB may or may not fully enclose a text line. Falsely Detected Block (FDB): A detected block that does not contain text. Text Block with Missing Data (MDB): A detected block that misses more than 20% of the characters of a text line (MDB is a subset of TDB). The percentage is chosen according to [30-31], in which a text block is considered correctly detected if it overlaps at least 80% of the pixels of the ground-truth block. We count manually Actual Number of Text Blocks (ATB) in the images and it is considered as the ground truth for evaluation. The performance measures are defined as follows. Recall (R) = TDB / ATB, Precision (P) = TDB / (TDB + FDB), F-measure (F) = (2  P  R) / (P + R), Misdetection Rate (MDR) = MDB / TDB. Sample experimental results for both the proposed and existing methods on horizontal text detection are shown in Figure 13 where input image shown in Figure 13(a) has complex background with horizontal text. It is noticed from Figure 13 that the proposed method, the Bayesian, the Laplacian, the Fourier-RGB and Cai et al.’s methods detect almost all text lines while other methods miss text lines. The Bayesian method does not fix bounding box properly and it gives more false positives due to the problem of boundary growing. The Fourier-RGB method detects text properly. The other existing methods do not detect text properly as we can notice that Zhou et al.’s method misses a few text lines, Liu et al.’s method misses a few words in addition to false positives, while Wong and Che’s , and Cai et al.’s methods do not fix the bounding boxes properly for the text lines. (a). Input (b) Proposed (d) Laplacian (e) Zhou et al. (c) Bayesian (f) Fourier-RGB In addition, we also measure the Average Processing Time (in terms of seconds) for each method in our experiment. A. Experiment on Video Text Data In order to show the effectiveness of the proposed method over the existing methods, we assemble 142 arbitrary images with 800 horizontal and 220 non-horizontal images to form a representative variety set of general video data to calculate the performance measures, namely, recall, precision, F-measure and misdetection rate. The quantitative results of the proposed and the existing methods for 1162 images (142+800+220) are reported in Table 2. We highlight sample arbitrary, horizontal and non-horizontal images for discussion in Figure 11, Figure 12 and Figure 13, respectively. For the curve text line like circle shaped shown in Figure 11(a), the proposed method extracts text lines with one false positive while the existing methods fail to detect curve text line properly. The main reason is that the existing methods are developed for horizontal and non-horizontal text line detection but not for arbitrary text detection. It is observed from Figure 12 that for the input frame having different orientations and complex background as shown in Figure 12(a), the proposed method detects almost all text with a few misdetections as shown in Figure 12(b) while the Bayesian method does not fix bounding boxes properly as shown in Figure 12(c), the Laplacian method detects two text lines and it loses one text line as shown in Figure 12(d) due to complex background in the frame. On the other hand Zhou et al.’s method fails to detect text as shown in Figure 12(e) as it is limited to horizontal and vertical text lines only and caption text but not scene text and multi-oriented text. It is also observed from Figure 12 that the Fourier-RGB method, Liu et al.’s, Wong and Chen’s and Cai et al.’s methods fail to detect text lines because these methods are developed for horizontal text detection but not for non-horizontal text detection. 6 (g) Liu et al. (h) Wong and Chen (i) Cai et al. Figure 11. Sample results for arbitrarily-oriented text detection (a).Inputs (d) Laplacian (g) Liu et al. (b) Proposed (e) Zhou et al. (h) Wong and Chen (c) Bayesian (f) Fourier-RGB (i) Cai et al. Figure 12. Sample results for non-horizontal text detection > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7 B. Experiment on Independent Data (Hua’s Data) (a) Input (d) Laplacian (b) Proposed (e) Zhou et al. We found a small publicly available dataset of 45 video frames [34], namely, Hua’s Dataset for evaluating the performance of the proposed method in comparison with the existing methods. We included this set in our experiment as it serves as an independent test set in addition to our own data set in the preceding section. But we caution that this set contains only horizontal text and hence does not give a full comparison for the entire spectrum of the text detection capability from horizontal, non-horizontal to arbitrary orientation. Figure 14 shows sample results for the proposed and existing methods, where (a) is the input frame having huge and small font text, and (b)-(i) are the results of the proposed and existing methods, respectively. (c) Bayesian (f) Fourier-RGB (a)Input (g) Liu et al. (h) Wong and Chen (b) Proposed (c) Bayesian (i) Cai et al. Figure 13. Sample results for horizontal text detection Observiations of the above sample images show that the proposed method detects well for arbitrary, non-horizontal and horizontal texts compared to existing methods, the quantitative results reported in Table 2 also show the proposed method outperforms the existing methods in terms of recall, precision, F-measure and misdetection rate. However, the Average Processing Time (APT) of the proposed method is longer than most of the existing methods except for the Fourier-RGB and Liu et al.’s methods as shown in Table 2 as well as in subsequent experiments, namely Tables 3 and 4. . The higher APT is attributed to the process of GVF determination and grouping which incurs higher computational cost. It is this GVF process that enables the proposed method to deal with arbitrarily-oriented text lines. Our previous methods, namely, the Bayesian and the Laplacian methods give lower accuracy compared to the proposed method according to Table 2. This is because these methods were developed for non-horizontal and horizontal text detection but not for arbitrary orientation text detection. As a result, the boundary growing and the skeleton based methods proposed, respectively in the Bayesian and the Laplacian for handling multi-oriented problem fail to perform on arbitrary text. Zhou et al.’s method works well for only vertical and horizontal caption text but not for arbitrary orientation and scene text and hence the method gives poor accuracy. Since Liu et al.’s, Wong and Chen’s and Cai et al.’s methods were developed for horizontal text detection but not for non-horizontal and arbitrary orientation text detection, these methods give poor accuracy compared to the proposed method. Table 2. Performance on arbitrary + non-horizontal + horizontal data (142 + 220 + 800 = 1162) Methods R P F MDR Proposed Method Bayesian [31] Laplacian [30] Zhou et al. [29] Fourier-RGB [28] Liu et al. [26] Wong and Chen [21] Cai et al. [22] 0.78 0.75 0.74 0.54 0.63 0.57 0.54 0.54 0.79 0.69 0.77 0.72 0.77 0.64 0.76 0.41 0.78 0.71 0.75 0.61 0.69 0.60 0.63 0.46 0.10 0.15 0.19 0.28 0.13 0.12 0.12 0.17 APT (sec) 14.6 10.3 9.6 1.5 16.9 23.3 1.8 7.4 (d) Laplacian (e) Zhou et al. (f) Fourier-RGB (g) Liu et al. (h) Wong and Chen (i) Cai et al. Figure 14. Sample results for Hua’s data It is observed from Figure 14 that the proposed method detects both the text lines in the input frame while the Bayesian method does not detect all text, the Laplacian method fails to detect complete text lines, hence rendering them as either misdetection or false positives. Therefore, misdetection rate is high compared to the proposed method as shown in Table 3. The Fourier-RGB method detects text properly and hence it gives good recall. The other existing methods fail to detect text lines in the input frame due to font variation. From Table 3, it can be concluded that the proposed method and our earlier methods [30,31] outperform the other existing methods in terms of recall, precision, F-measure and misdetection rate. We take note that the Bayesian method [31] and the Laplacian method [30] achieve better F-measure than the proposed method. However, as we earlier caution, Hua’s dataset does not contain arbitrarily oriented text, both the Bayesian and the Laplacian methods are given an advantage of not being tested with arbitrary text lines. If Hua’s dataset had contained arbitrarily oriented text lines, then the Bayesian and the Laplacian methods would have shown poorer F-measures like in Table 2. Table 3. Performance with Hua’s data Methods R P F MDR Proposed Method Bayesian [31] Laplacian [30] Zhou et al. [29] Fourier-RGB [28] 0.88 0.87 0.93 0.72 0.81 0.74 0.85 0.81 0.82 0.73 0.80 0.85 0.87 0.77 0.76 0.05 0.18 0.07 0.44 0.06 APT (sec) 10.5 5.6 11.7 1.13 14.6 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Liu et al. [26] Wong and Chen [21] Cai et al. [22] 0.75 0.51 0.69 0.54 0.75 0.43 0.63 0.61 0.53 0.16 0.13 0.13 24.9 1.6 9.2 C. Experiment on ICDAR-03 Data (Camera Images) We added another independent test set in this experiment like in the preceding section. The objective of this experiment is to show that the proposed method works well for high resolution camera images when the proposed method works well for low resolution video frames. This dataset is available publicly [35] as ICDAR-03 competition data for text detection from natural scene images. (a)Input (b) Proposed (c) Bayesian Zhou et al. [29] Fourier-RGB [28] Liu et al. [26] Wong and Chen [21] Cai et al. [22] 0.66 0.80 0.53 0.52 0.67 0.83 0.66 0.61 0.83 0.33 0.73 0.72 0.57 0.64 0.44 0.26 0.04 0.24 0.08 0.43 8 1.2 15.5 16.1 1.0 6.1 We also conduct experiments on ICDAR data using ICDAR 2003 measures for our proposed method and the results are reported in Table 5. Since our primary goal is to detect text in video, we develop and evaluate the method at the line level as it is a common practice in video text detection survey [14-32]. In order to calculate recall, precision and F-measure according to ICDAR 2003, we modify the method to fix the bounding box for each word in the image based on the space between the words and characters. Table 5 shows that the proposed method does not achieve better accuracy than the best method (Hinner Becker) but it stands in the third position among the five methods. The poor accuracy is due to the problem of word segmentation, fixing closed bounding box and strict measures. In addition, the method does not utilize the advantage of high resolution images as the participating methods use connected component analysis for text detection and grouping. Hence, the proposed method misses true text blocks. The results of the participating methods reported in Table 5 are taken from the ICDAR 2005 [35] to compare with the proposed method. Table 5. Word level performance on ICDAR 2003 data (d) Laplacian (g) Liu et al. (e) Zhou et al (f) Fourier-RGB (h)Wong and Chen (i) Cai et al. Methods Proposed Method Hinner Becker [35] Alex Chen [35] Qiang Zhu [35] Jisoo Kim [35] Nobuo Ezaki [35] R 0.42 0.67 0.60 0.40 0.28 0.36 P 0.36 0.62 0.60 0.33 0.22 0.18 F 0.35 0.62 0.58 0.33 0.22 0.22 Figure 15. Sample results for scene text detection (ICDAR-2003 data) IV. CONCLUSION AND FUTURE WORK We show sample results for the proposed and existing methods in Figure 15 where (a) is a sample input frame, and (b)-(i) show the results of the proposed and the existing methods, respectively. It is observed from Figure 15 that the proposed method, the Fourier-RGB method and Cai et al.’s method work well for the input frame but other methods including our earlier methods, namely, the Bayesian and the Laplacian methods fail to detect text lines properly. The results reported in Table 4 shows that the proposed method is better in terms of recall, F-measure and misdetection rate compared to the Bayesian, the Laplacian and Fourier-RGB methods. This is because for high contrast and resolution images, the classification methods proposed in the Bayesian and the Laplacian methods and the dynamic threshold used in Fourier-RGB all fail to classify text and non-text pixels properly. However, the proposed method and our earlier methods are better than the other existing methods in terms of recall, precision and F-measure but in terms of misdetection rate, Wong and Chen’s method is better according to results reported in Table 4. Wong and Chen’s method is worst in recall, precision and F-measure compared to the proposed method. This experiment shows that the proposed method is good for even high resolution and contrast images. In this paper, we have explored GVF information for the first time for text detection in video by selecting dominant text pixels and text candidates with the help of the Sobel edge map. This dominant text pixel selection helps in removing non-text information in complex background of video frames. Text candidate selection and the first grouping method ensure that text pixels are not missed. The second grouping tackles the problems created by arbitrarily-oriented text to achieve better accuracy for text detection in video. Experimental results of the variety of the datasets such as arbitrarily-oriented data, non-horizontal data, horizontal data, Hua’ s data and ICDAR-03 data show that the proposed method works well for text detection irrespective of contrast, orientation, background, script, fonts and font size. However, the proposed method may not give good accuracy for horizontal text lines with less spacing between text lines. To overcome this problem, we are planning to develop another method which can detect text lines without considering their spacing using an alternative grouping criterion in future. Table 4. Line level performance on ICDAR-03 data Methods R P F MDR Proposed Method Bayesian [31] Laplacian [30] 0.92 0.87 0.86 0.76 0.72 0.76 0.83 0.78 0.81 0.13 0.14 0.13 APT (sec) 12.7 7.9 6.8 ACKNOWLEDGEMENT This research is supported in part by A*STAR grant 092 101 0051 (WBS no. R252-000-402-305). Authors are grateful to Editor and reviewers for constructive comments and suggestions which improve quality of paper greatly. REFERENCES [1] N. Sharma, U. Pal, M. Blumenstein, "Recent Advances in Video Based Document Processing: A Review," In Proc. DAS, 2012. 63-68. > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9 [2] J. Zang and R. Kasturi, “Extraction of Text Objects in Video Documents: Recent [31] P. Shivakumara, R. P. Sreedhar, T. Q. Phan, S. Lu and C. L. Tan, Progress”, DAS, 2008, 5-17. K. Jung, K.I. Kim and A.K. Jain, “Text Information Extraction in Images and Video: a Survey”, Pattern Recognition, 2004, 977-997. D. Crandall and R. Kasturi, “Robust Detection of Stylized Text Events in Digital Video”, ICDAR, 2001, 865-869. D. Zhang and S. F. Chang, “Event Detection in Baseball Video using Superimposed Caption Recognition”, In Proc. ACM MM, 2002, pp 315-318. C. Xu, J. Wang, K. Wan, Y. Li and L. Duan, “Live Sports Event Detection based on Broadcast Video and Web-casting Text”, In Proc. ACM MM, 2006, pp 221-230. W. Wu, X. Chen and J. Yang, “Incremental Detection of Text on Road Signs from Video with Applications to a Driving Assistant Systems”, In Proc. ACM MM, 2004, pp 852-859. B. Epshtein, E ofek and Y. Wexler, “Detecting Text in Natural Scenes with Stroke Width Transform”, CVPR, 2010, 2963-2970. Y. F. Pan, X. Hou and C. L. Liu, “A Hybrid Approach to Detect and Localize Texts in Natural Scene Images”, IEEE Trans. on IP, 2011, 800-813. X. Chen, J. Yang, J. Zhang and A. Waibel, “Automatic Detection and Recognition of Signs from Natural Scenes”, IEEE Trans. on IP, 2004, 87-99. C. Yao, X. Bai, W. Liu, Y. Ma and Z. Tu, “Detecting Texts of Arbitrary Orientations in Natural Images”, In Proc. CVPR, 2012, 10831090. L. Neumann and J. Matas, “Real-Time Scene Text Localization and Recognition”, In Proc. CVPR, 2012, 3538-3545. T. Q. Phan, P. Shivakumara and C. L. Tan, “Detecting Text in the Real World”, In Proc. ACM MM, 2012, 765-768. A. K. Jain and B. Yu, “Automatic Text Location in Images and Video Frames”, Pattern Recognition, 1998, 2055-2076. V. Y. Mariano and R. Kasturi, “Locating Uniform-Colored Text in Video Frames”, ICPR, 2000, 539-542. H. Li, D. Doermann and O. Kia, “Automatic Text Detection and Tracking in Digital Video”, IEEE Trans. on IP, 2000, 147-156. Y. Zhong, H. Zhang and A. K. Jain, “Automatic Caption Localization in Compressed Video”, IEEE Trans. on PAMI, 2000, 385-392. K. L Kim, K. Jung and J. H. Kim. “Texture-Based Approach for Text Detection in Images using Support Vector Machines and Continuous Adaptive Mean Shift Algorithm” IEEE Trans. on PAMI, 2003, 16311639. V. Wu, R. Manmatha and E. M Riseman, “Text Finder: an Automatic System to Detect and Recognize Text in Images”, IEEE Trans. on PAMI, 1999, 1224-1229. R. Lienhart and A. Wernickle, “Localizing and Segmenting Text in Images and Videos”, IEEE Trans. on CSVT, 2002, 256-268. E. K. Wong and M. Chen, “A New Robust Algorithm for Video Text Extraction”, Pattern Recognition, 2003, 1397-1406. M. Cai, J. Song and M. R. Lyu, “A New Approach for Video Text Detection”, ICIP, 2002, 117–120. A. Jamil, I. Siddiqi, F. Arif and A. Raza, “Edge-based Features for Localization of Artificial Urdu Text in Video Images”, ICDAR, 2011, 1120-124. M. Anthimopoulos, B. Gatos and I. Pratikakis, “A Two-Stage Scheme for Text Detection in Video Images”, Image and Vision Computing, 2010, 1413-1426. X. Peng, H. Cao, R. Prasad and P. Natarajan, “Text Extraction from Video using Conditional Random Fields”, ICDAR, 2011, 1029-1033. C. Liu, C. Wang and R. Dai, “Text Detection in Images Based on Unsupervised Classification of Edge-based Features”, ICDAR, 2005, 610-614. P. Shivakumara, W. Huang, C. L. Tan and P. Q. Trung, “Accurate Video Text Detection Through Classification of Low and High Contrast Images”, Pattern Recognition, 2010, 2165-2185. P. Shivakumara, T. Q. Phan and C. L. Tan, “New Fourier-Statistical Features in RGB Space for Video Text Detection”, IEEE Trans. on CSVT, 2010, 1520-1532. J. Zhou, L. Xu, B. Xiao and R. Dai, “A Robust System for Text Extraction in Video”, ICMV, 2007, 119-124. P. Shivakumara, T, Q. Phan and C. L. Tan, “A Laplacian Approach to Multi-Oriented Text Detection in Video”, IEEE Trans. on PAMI, 2011, 412-419. “Multi-Oriented Video Scene Text Detection Through Bayesian Classification and Boundary Growing”, IEEE Trans. on CSVT, 2012, 1227-1235. N. Sharma, P. Shivakumara, U. Pal, M. Blumenstein, Chew Lim Tan, "A New Method for Arbitrarily-Oriented Text Detection in Video," In Proc. DAS, 2012, 74-78. C. Xu and J. L. Prince, "Snakes, shapes, and Gradient Vector Flow", IEEE Transactions on Image Processing, 1998, 359–369. X. S Hua, L. Wenyin and H.J. Zhang, “An Automatic Performance Evaluation Protocol for Video Text Detection Algorithms”, IEEE Trans. on CSVT, 2004, 498-507. (http://www.cs.cityu.edu.hk/~liuwy/PE_VTDetect/) S. M. Lucas, “ICDAR 2005 Text Locating Competition Results”, ICDAR, 2005, 80-84. [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [32] [33] [34] [35]

TCSVT2013 - School of Computing

Related documents

Products

Support

TCSVT2013 - School of Computing

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib