Wavelet-Gradient-Fusion for Video Text Binarization aSangheeta Roy, bPalaiahnakote Shivakumara, cPartha Pratim Roy and bChew Lim Tan a Tata Consultancy Services, Kolkata, India School of Computing, National University of Singapore, Singapore c Laboratoire d’Informatique, Université François Rabelais, Tours, France a roy.sangheeta@tcs.com b{shiva, and tancl}@comp.nus.edu.sg cpartha.roy@univ-tours.fr b Abstract—Achieving good character recognition rate in video images is not as easy as achieving the same from the scanned documents because of low resolution and complex background in video images. In this paper, we propose a new method using fusion of horizontal, vertical and diagonal information obtained by the wavelet and the gradient on text line images to enhance the text information. We apply kmeans with k=2 on row-wise and column-wise pixels separately to extract possible text information. The union operation on row-wise and column-wise clusters provides the text candidates information. With the help of Canny of the input image, the method identifies the disconnections based on mutual nearest neighbor criteria on end points and it compares the disconnected area with the text candidates to restore the missing information. Next, the method uses connected component analysis to merge some subcomponents based on nearest neighbor criteria. The foreground (text) and background (non-text) is separated based on new observation that the color values at edge pixel of the components are larger than the color values of the pixel inside the component. Finally, we use Google Tesseract OCR to validate our results and the results are compared with the baseline thresholding techniques to show that the proposed method is superior to existing methods in terms of recognition rate on 236 video and 258 ICDAR 2003 text lines. Keywords- Wavelet-Gradient-Fusion, Video text lines, Video Video text restoration, Video character rcognition I. INTRODUCTION Character recognition in document analysis is the most successful application in the field of pattern recognition. However, if we test the same OCR engine on video scene character, the OCR engine reports poor accuracy because the OCR was developed mainly for scanned document images containing simple background and high contrast but not for the video images having complex background and low contrast. It is evident from the natural scene character recognition methods [1-6] that the document OCR engine does not work for camera based natural scene images due to failure of binarization in handling non-uniform background and non-illumination. Therefore, poor character recognition rate (67%) is reported for ICDAR-2003 competition data [7]. This shows that despite high contrast of camera images, the best accuracy reported is 67% so far, thus achieving better character recognition rate for video images is still an elusive goal for the researchers because of the lack of good binarization method which can tackle both low contrast and complex background to separate foreground and background accurately [8]. It is noted that character recognition rate varies from 0% to 45% [8] if we apply OCR directly on video text, which is much lower than scene character recognition accuracy. Our experimental result of the existing baseline methods such as Niblack [9] and Sauvola et al.[10] show that thresholding techniques give poor accuracy for the video and scene images. It is reported in [11] that the performance of these thresholding techniques is not consistent because the character recognition rate changes as the application and dataset change. In this paper, we make an attempt by proposing a new method for separation of foreground (text) and background (non-text) such that the OCR engine provides better accuracy than reported accuracy in the literature. There are several papers that addressed video text binarization and localization problem based on edge, stroke, color and corner information to improve character recognition rate. Ntirogiannis et al. [12] have proposed a binarization method based on baseline and stroke width extraction to obtain body of the text information and convex hull analysis with adaptive thresholding is done for obtaining final text information. However, this method focuses on artificial text where pixels have uniform color but not on both artificial and scene text where pixel do not have uniform color values. An automatic binarization method for color text areas in images and video based on convolutional neural network is proposed by Saidane and Garcia [13]. The performance of the method depends on the number of training samples. Recently, edge based binarization for video text image is proposed by Zhou et al. [14] to improve the video character recognition rate. This method takes Canny of the input image as input and it proposes a modified flood fill algorithm to fill the gap if there is a small gap on the contour. This method works well for small gaps but not for big gaps on the contours. In addition to this, the method’s primary focus is graphics text and big font but not both graphics and scene text. Therefore, from the above discussion, it can be concluded that there are methods to improve video character recognition rate through binarization but these methods concentrate on big font, graphics text in video but not on both graphics and scene text where we can expect much more variation in contrast and background compared to graphics text. Therefore, improving video character recognition through binarization irrespective of text type, contrast and background complexity is challenging. Hence, in this work, we propose a new WaveletGradient-Fusion (WGF) method based on fusion concept with wavelet and gradient information and a new way of obtaining text candidates to overcome the above problems. II. PROPOSED METHOD While we note that there are several sophisticated methods for text line detection in video irrespective of contrast, text type, orientation, contrast and background variation, we use our method [15] using Laplacian approach and skeleton analysis to segment the text lines from the video frames. Therefore, the output of our text detection method is the input to the proposed method in this work. The multi-oriented text lines segmented from the video frames are converted to horizontal text lines based on the direction of the text lines. Hence, non-horizontal text lines are treated as horizontal text lines to make implementation easier. The proposed method is structured into four sub-sections. In Section A, we propose a novel method to fuse wavelet and gradient information to enhance the text information in video text lines. Text candidates are obtained by a new way of clustering on enhanced image in Section B. The possible text information is restored with the help of Canny of the input image and the text candidate image in Section C. Finally, in Section D, the method to separate foreground and background is presented based on color features of the edge pixel and inside component pixels. the above three obtained fused images to get the final fused image as shown in Figure 2(k) where we can see that the text information is sharpened compared to the results shown in Figure 2(d), (g) and (j). (a). Input text line image (b) Horizontal Wavelet (c) Horizontal Gradient Input text line image (d) Fusion-1 of (b) and (c) Wavelet Gradient (e) Vertical Wavelet H V D H V (f) Vertical Gradient D (g) Fusion-2 of (e) and (f) Fusion-1 Fusion-2 (h) Diagonal Wavelet (i) Diagonal Gradient Fusion-3 (j) Fusion-3 of (h) and (i) Final fused image Figure 1. Flow diagram for the wavelet-gradient-fusion A. Wavelet-Gradient-Fusion Method It is noted that wavelet decomposition is good for enhancing the low contrast pixel in the video frame because of multi-resolution analysis which gives horizontal (H), vertical (V) and diagonal (D) information and gradient operation of the same direction on video image gives fine detail of the edge pixel in video text line image. To overcome the problems of unpredictable video characteristics, the work presented in [16] suggested the use of fusion of the values given by the low bands of the input images to increase the resolution of the image. Inspired by this work, we propose an operation that chooses the highest pixel value among low pixel values of different sub-bands corresponding to wavelet and gradient at different levels as a fusion criterion. It is shown in Figure 1 where one can see how the sub-bands of wavelet fuse with the gradient images and the final fusion image is obtained after fusing three Fusion-1, Fusion-2 and Fusion-3 images. For example, for the input image shown in Figure 2(a), the method compares the pixel values in the horizontal wavelet (Figure 2(b)) with the corresponding pixel values in the horizontal gradient (Figure 2(c)) and it chooses the highest pixel value to obtain the fusion image as shown in Figure 2(d). In the same way, the method obtains the fusion image for the vertical wavelet and the vertical gradient as shown in Figure 2(e)-(g), and the diagonal wavelet and the diagonal gradient images as shown in Figure 2(h)-(j). The same operation is performed on (k) Fusion of Fusion-1, Fusion-2 and Fusion-3 Figure 2. Intermediate results for WGF method B. Text Candidates It is observed from the result of the previous section that WGF method widens the gap between text and non-text pixels. Therefore, to classify text and non-text pixels, we use k means clustering with k=2 in a novel way by applying on each row and column separately as shown in Figure 3(a) and (b) where the result of row-wise clustering lose some text information while the result of column-wise clustering does not lose text information. Here the cluster that has the higher mean between the two is considered the text cluster. This is the advantage of the new way of row-wise and column-wise clustering as it helps in restoring the possible text information. The union of row-wise and column-wise clustering results is considered as text candidates to separate and text and non-text information as shown in Figure 3(c) where it is seen that the union operation includes other background information in addition to text. (a). k-means clustering row-wise (b) k-means clustering column-wise (c) Union of (a) and (b) Figure 3. Text candidates for text binarization C. Smoothing It is observed from the text candidates that the shape of the character is almost preserved and it may contain other background information. Therefore, the method considers the text candidates image as the reference image to clean up the background. The method identifies disconnections in the Canny of the input image by testing mutual nearest neighbor criteria on end points as shown in Figure 4(a) where disconnections are marked by red color rectangles. The mutual nearest neighbor criteria is defined as follows: if P1 is near to P2 then P2 should be near to P1, where Point P1 and Point P2 are the two end points. This is because Canny gives good edge information for video text line images but at the same time it gives lots of disconnections due to low contrast and complex background. The identified disconnection area is matched with the same position in the text candidates image locally to restore the missing text information since the text candidates image does not lose much text information compared to the Canny edge image as shown in Figure 4(b) where almost all components are filled by flood fill operation. However, we can see noisy pixels in the background. To eliminate them, we perform projection profiles analysis which result in a clear text information with clean background as shown in Figure 4(c). (a). Gap identification based on mutual nearest neighbor criteria (b) Disconnections are filled and identified noisy pixels (c) Clear connected components Figure 4. Process of smoothing (a). Color values of edge pixels and inside character (b) Foreground and background is separated (c) “successive year” Figure 5. Foreground and background separation by analyzing the color values at edge pixel and inside the components D. Foreground and Background Separation The method considers the text in the smoothed image obtained from the above step C as connected components and it analyses by fixing a bounding box to merge the subcomponents, if any, based on the nearest neighbor criteria as shown in Figure 5(a). For each component in the merged image, the method extracts the maximum color information from the input image corresponding to pixels in the components of the merged image. It is found from the results of maximum color extraction that the extracted color values refer the border/edge of the components. This is valid because usually colour values at edges or near edges have higher values than those at the pixels inside the components if there exist holes inside the component. This observation helps us to find a hole for each component by making low values as black and high values as white as shown in Figure 5(b). After separating text and non-text, the result is fed to OCR [17] to test recognition results. For example, for the result shown in Figure 5(b), the OCR engine recognizes the whole text correctly as shown in Figure 5(c) where recognition result is in quote. III. EXPERIMENTAL RESULTS As there is no standard database to evaluate the proposed method performance, we create our own data of video which include 236 text lines selected from different news video sources and 258 text lines selected randomly from ICDAR2003 competition scene images. In total, 494 text line images are considered. To measure the performance of the proposed method, we use character recognition rate. For comparative study, we implement two baseline methods of binarization [9, 10] and the methods are evaluated in terms of recognition rate. The sample results for the proposed and existing methods on both video and ICDAR data are shown in Table 1 where we consider input images with low contrast, complex background, distorted text and different fonts. It is noticed from the recognition results in quote in Table 1 that the OCR engine recognizes almost all the results given by the proposed method while Niblack method gives better results than the Sauvola method and worse than the proposed method. For Sauvola’s method, the OCR returns none (“ “) for almost all the input images. The reason for the poor result lies in the use of thresholds to binarize because it is hard to fix optimal thresholds for video text lines due to unpredictable characteristics. On the other hand, the proposed method does not fix any threshold and it takes advantage of the WaveletGradient-Fusion and color features for foreground and background separation. However, the proposed method sometimes fails for too low contrast images as shown in last row of video and ICDAR data in Table 1. Therefore, there is a scope for further improvements. The OCR engine is used to calculate the recognition rate for the input images without binarization (“Before” column in Table 2 and Table 3) and the results are reported in Table 2 and Table 3 for video data and ICDAR data. The OCR engine gives slightly better results for ICDAR data than video data. This is true because ICDAR data contains high contrast image and complex background whereas video data is of low contrast and contains complex background. The results reported in Table 2 and Table 3 show that the proposed method provides better improvements such as 16.08% for video data and 15.79% for ICDAR compared to recognition results before binarization. Table 1. Sample results of the proposed and existing methods Video Input Proposed (WGF) Niblack [9] Sauvola [10] “1-800 EH -7 0000” “K•{.-Gab 605 0*” “••1 .;* 000 0” ”$70°°” “ECE?” “FHEIEE” “INFOHME” “ “ ““ ”Rapld” “Rama’” ““ ODOUR” ““ ““ “and Connect” ““ “••n¤•¤¤¤n• c•¤” ” “ “ “$IILIGa gl” “EZTJIEIIH” ICDAR 2003 Competition data ACKNOWLEDGMENT This work is done jointly by National University of Singapore (NUS), Singapore and Département Informatique Polytech'Tours, France. This research is also supported in part by A*STAR grant 092 101 0051 (WBS no. R252-000-402305). REFERENCES ” DISCOVER” have shown that this fusion helps in enhancing text information. We used k-means clustering algorithm in different row-wise and column-wise way to obtain text candidates. The mutual nearest neighbor concept is proposed to identify the true pair of end pixels to restore the missing text information. To separate foreground and background, we explore the color values at edges and inside the components. The experimental results of the proposed method and existing method show that the proposed method outperforms the existing methods in terms of recognition rate. However, the reported recognition rate is not very high as in document analysis because the tesseract OCR is not font independent and robust, we are planning to explore learning based method to improve the recognition rate on large dataset. ““ ““ [1] D. Doermann, J. Liang and H. Li, “Progress in Camera-Based [2] “skimmed” ““ ““ [4] ” The” [3] ““ ““ [5] ““ “EXIT” ““ [6] “AT HE-E * T I CS” ““ ““ “G E E K” ““ ““ “fa(UE‘HTS” “GENTS” “A C}££a`|'1” Table2. Recognition rate of the proposed and existing methods on video data (in %) Before Proposed Niblack Sauvola 48.49 After 64.57 47.03 17.26 Improvements 16.08 -1.46 -31.23 Table3. Recognition rate of the proposed and existing methods on ICDAR data (in %) Before Proposed Niblack Sauvola 51.62 IV. After 67.41 42.30 19.98 Improvements 15.79 -9.32 -31.64 [7] [8] [9] [10] [11] [12] [13] [14] [15] CONCLUSION In this work, we have proposed a new fusion method based on wavelet sub-bands and gradient of different directions. We [16] [17] Document Image Analysis”, In Proc. ICDAR, 2003, pp 606616. J. Zang and R. Kasturi, “Extraction of Text Objects in Video Documents: Recent Progress”, In Proc. DAS, 2008, pp 5-17 K. Wang and S. Belongie, “Word Spotting in the Wild”, In Proc. ECCV, 2010, pp 591-604. X. Tang, X. Gao, J. Liu and H. Zhang, “A Spatial-Temporal Approach for Video Caption Detection and Recognition”, IEEE Trrans. Neural Network, 2002, pp 961-971. M. R. Lyu, J. Song and M. Cai, “A Comprehensive Method for Multilingual Video Text Detction , Localization, and Extraction”, IEEE Trans. CSVT, 2005, pp 243-255. A. Mishara, K. Alahari and C. V. Jawahar, “An MRF Model for Binarization of Natural Scene Text”, In Proc. ICDAR, 2011, pp 11-16. L. Neumann and J. Matas, “A Method for Text Localization and Recognition in Real-World Images”, In Proc. ACCV, 2011, pp 770-783. D. Chen and J. M. Odobez, “Video text recognition using sequential Monte Carlo and error voting methods”, Pattern Recognition Letters, 2005, pp 1386-1403. W. Niblack, “An Introduction to Digital Image Processing”, Prentice Hall, Englewood Cliffs, 1986. J. Sauvola, T. seeppanen, S. Haapakoski and M. Pietikainen, “Adaptive Document Binarization”, In Proc. ICDAR, 1997, pp 147-152. J. He, Q. D. M. Do, A. C. Downton and J. H. Kim, “A Comparision of Bianarization Methods for Historical Archive Documents”, In Proc. ICDAR, 2005, pp 538-542. K. Ntirogiannis, B. Gotos and I. Pratikakis, “Binarization of Textual Content in Video Frames”, In Proc. ICDAR, 2011, pp 673-677. Z. Saidane and C. Garcia, “Roubst Binarization for Video Text Recognition”, In Proc. ICDAR, 2007, pp 874-879. Z. Zhou, L. Li and C. L. Tan, “Edge based Binarization of Video Text Images”, In Proc. ICPR, 2010, pp 133-136. P. Shivakumara, T. Q. Phan and C. L. Tan, “A Laplacian Approach to Multi-Oriented Text Detection in Video”, IEEE Trans. PAMI, 2011, pp 412-419. G. Pajares and J. M. Cruz, “A wavelet-based image fusion tutorial”, Pattern Recognition, 2004, pp 1855-1872. Tesseract. http://code.google.com/p/tesseract-ocr/.