International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 11 - April 2015 An Efficient Extraction of Text Objects from Images and Videos #1 Lucy Lalmuanchhungi#1, Dr.Suresh .M.B.#2 PG Student, Prof. and Head,Department of Information Science & Engineering, East West Institute of Technology Bangalore 560091,Karnataka, India. #2 Abstract—Text extraction in images and videos has been used in large variety of applications such as mobile robot navigation, document retrieving, object identification, vehicle license plate detection, etc. Algorithms that automatically index text objects are needed. Such text often gives an indication of a scene's semantic content. Text can have arbitrary color, size, and orientation. Backgrounds may be complex and changing. Existing systems have focused only on detecting the spatial extent of text in individual video frames and images. A text extraction algorithm using Dual Tree Discrete wavelet transform is proposed. This constitutes a text event that should be entered only once in the index. Therefore it is also necessary to determine the temporal extent of text events. Such text effects are common in television programs and commercials to attract viewer attention. Keywords—Connected transform,Detection, Segmentation. Dual Tree Discrete wavelet Extraction, Localization and naturally occurs in the field of view of the camera during video capture. The scene text occurring on banners and posters also make important keywords that describe the video content. Obviously, detection of the scene text is a challenging task due to varying lighting, complex movement and transformation. Generally, text extraction in videos can be divided into the following stages: 1)Text detection, finding regions in a video frame that contain text; 2)Text localization, grouping text regions into text instances and generating a set of tight bounding boxes around all text instances; 3)Text tracking, following a text event as it moves or changes over time and determining the temporal and spatial locations and extents of text events; 4)Text binarization, binarizing the text bounded by text regions and marking text as one binary level and background as the other; 5)Text recognition, performing OCR on the binarized text image. Occasionally binarization step is eliminated in favor of applying OCR on color/gray level images. Examples of caption text and scene text are shown in figure 1. I. INTRODUCTION Information nowadays is becoming increasingly enriched by multimedia components. Libraries that were originally pure text are continuously adding images, videos, and audio clips to their repositories, and large digital image and video libraries are emerging as well. They all need an automatic means to efficiently index and retrieve multimedia components. Text detection is the process of detecting and locating those regions that contain texts from a given image or video and is the first step in obtaining textual information. However, text variations related to size, style, orientation, and alignment, as well as low contrast and complex backgrounds make the problem of automatic text detection extremely challenging. Text in images, especially in images which are part of web pages and in videos is one powerful source of highlevel semantics. If these text occurrences could be detected, segmented, and recognized automatically, they would be a valuable source of high-level semantics for indexing and retrieval. Text in images includes useful information for the automatic annotation, indexing, and structuring of images.Text in videos can be classified into two types. The first type is caption or artificial text which is artificially superimposed on the video at the time of editing. Caption text summarizes the content of the video, and therefore it plays an important role in making it easier for the task of video indexing. The second type of text is scene text which ISSN: 2231-5381 (a) (b) Figure 1. Examples of (a) Caption text and (b) Scene text II. LITERATURE SURVEY http://www.ijettjournal.org Page 517 International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 11 - April 2015 A literature survey in this area summarizes some of the past works that have already been done in this field. In each work, a summary of the methods and a few highlights of the approaches are given. Rainer Lienhart and Axel Wernicke[1] proposed a novel method for localizing and segmenting text in complex images and videos. Text lines are identified by using a complex-valued multilayer feed-forward network trained to detect text at a fixed scale and position. The network‟s output at all scales and a position is integrated into a single text-saliency map, serving as a starting point for candidate text lines. In the case of video, these candidate text lines are refined by exploiting the temporal redundancy of text in video. Localized text lines are then scaled to a fixed height of 100 pixels and segmented into a binary image with black characters on white background. For videos, temporal redundancy is exploited to improve segmentation performance. Input images and videos can be of any size due to a true multi-resolution approach. Moreover, the system is not only able to locate and segment text occurrences into large binary images, but is also able to track each text line with sub-pixel accuracy over the entire occurrence in a video, so that one text bitmap is created for all instances of that text line. KwangIn Kim and Keechul Jung [2] proposeda novel texture-based method for detecting texts in images. A support vector machine (SVM) is used to analyze the textural properties of texts. No external texture feature extraction module is used; rather, the intensities of the raw pixels that make up the textural pattern are fed directly to the SVM, which works well even in high-dimensional spaces. Next, text regions are identified by applying a continuously adaptive mean shift algorithm (CAMSHIFT) to the results of the texture analysis. The combination of CAMSHIFT and SVMs produces both robust and efficient text detection, as time-consuming texture analyses for less relevant pixels are restricted, leaving only a small part of the input image to be texture-analyzed. Lifang GU [3] proposed an efficient algorithm for detecting and extracting text directly in MPEG video sequences. Most existing methods for text detection in video usually operate on raw pixel data and only output the location of the detected text regions in a single frame. Our algorithm makes use of the features encoded in compressed data to perform fast text detection. Motion information and the characteristics of text region continuity in multiple frames are then used to fine-tune the detected candidate text regions. Furthermore, characters are reliably extracted by an adaptive thresholding method after applying some noise reduction filtering in multiple frames. Such extracted characters can be directly fed into a conventional OCR system for recognition. Experimental results on several video sequences show that the proposed algorithm is able to detect and extract text in MPEG video sequences with various scene complexities. David Crandall and Sameer Antani[4] proposed an approach to extract text appearing in video, which often reflects a scene‟s semantic content. This is a difficult problem due to the unconstrained nature of general-purpose video. Text can have arbitrary color, size, and orientation. Backgrounds may be complex and changing. Most work so far has made restrictive assumptions about the nature of text ISSN: 2231-5381 occurring in video. Such work is therefore not directly applicable to unconstrained, general-purpose video. In addition, most work so far has focused only on detecting the spatial extent of text in individual video frames. However, text occurring in video usually persists for several seconds. This constitutes a text event that should be entered only once in the video index. Therefore it is also necessary to determine the temporal extent of text events. This is a non-trivial problem because text may move, rotate, grow, shrink, or otherwise change over time. Such text effects are common in television programs and commercials but so far have received little attention in the literature. This paper discusses detecting, binarizing, and tracking caption text in general-purpose MPEG-1 video. Solutions are proposed for each of these problems and compared with existing work found in the literature. III.PROPOSED METHOD The architecture of the proposed system is shown in figure 2. The input to the top layer of the architecture is either an image or a video. The video is converted into video frames. An image pre-processing is used to change the video frame into gray scale image and also to resize the image. Dual tree discrete wavelets transform algorithm is used on images or the video frames. Next step is to extract edges of the text from the video frames and remove nontext region. Finally after applying segmentation methods the extracted text output is obtained. Figure 2. Proposed system architecture A. Preprocessing of the Input The test input, either image or video frames undergo the preprocessing step in order to improve the quality of appearance. The preprocessing step may involve different processes such as rescaling, resizing or noise removal. For input image or video with colors, their RGB components have to undergo a preprocessing step which results into a combined intensity image. B. Dual Tree Discrete Wavelets Transform http://www.ijettjournal.org Page 518 International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 11 - April 2015 A wavelet is basically used to represent a data set in the form of differences and averages, called detail components or average components. These averages are used to find variations at different scales present in the data set. The Dual Tree Discrete Wavelets Transform (DTDWT) decomposes the input signal into four different sub-bands with one average component (LL) and three detail components (LH, HL, HH) as shown in figure 3. The text pixels in the input image are enhanced by taking average of these three detail components. Taking total average of these three averages again increases the gap between the text and non-text pixels in the input image. The traditional Discrete Wavelets Transform (DWT) produces nearly similar results but they lack properties directionality and shift-invariance and also they perform computationally slower. On the other hand, DTDWT provides both properties of directionality and shiftinvariance and therefore provide more efficient results. components in the image regions are calculated resulting into components which contain labels for the components with values starting from zero, where zero indicates background and, otherwise, indicates text objects in the image region. Then, different properties of image regions are measured which are used to plot the bounding boxes of the text regions. Finally thresholding is performed based on these properties to extract the text output from the input image. (a) Original image (b) R band (c) G band (d) B band (e) Average of all (f) Laplacian Mask Figure 3. Result of DTDWT decomposition C. Extraction of Text Edges After wavelets decomposition, we obtain variations in the values of text and non-text pixels using the three detail sub-bands. These variation features are processed with Laplacian mask in order to detect discontinuities in four different directions: horizontal, vertical, up-left and upright. After applying Laplacian filtering, we obtain positive and negative values. Transition between these two values corresponds to transition between text and background. Maximum Gradient Difference (MGD) method is then applied on these obtained values in order to form a map of maximum and minimum values in the image. Candidate text regions always have larger MGD values than non-text regions because of presence of more number of positive and negative peaks. Normalization is then performed on the input image based on these values in order to obtain a binary image. Figure 4 shows the wavelet colors and features of the input image. D. Non-Text Region Removal Before performing removal of non-text region from the image, different morphological operations are performed such as dilation or filling in order to bridge any gaps present in the edges of the input image. Connected ISSN: 2231-5381 (g) MGD (h)Binary image Figure 4. Wavelet colors and features http://www.ijettjournal.org Page 519 International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 11 - April 2015 IV. EXPERIMENTAL RESULTS When we give the input video shown in the figure 5 to the proposed text extraction system, we get the following results as shown in the figures. Figure 6 shows the image after applying the Laplacian mask, and figure 7 shows the image after morphological region filling and figure 8 shows image after noise removal. Figure 9 shows the input image frame with bounding boxes around the text regions and finally figure 10 shows the extracted text output. Figure 8. Noise Removal Figure 5. Input to the system Figure 9. Text with Bounding Boxes Figure 6.Laplacian Mask Figure 10. Extracted text output V.CONCLUSION Figure 7. Region Filling ISSN: 2231-5381 The proposed text extraction system has been tested on various types of input. The proposed system uses methods based on edge detection. It makes use of Dual Tree Discrete Wavelet Transform which proves to be a very efficient wavelet decomposition method compared to traditional transform methods. The system is designed in such a way that text present in the input images are detected automatically and hence extracted efficiently. However, there are still many areas that need more improvements in detecting blurred text regions with very poor resolution images. http://www.ijettjournal.org Page 520 International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 11 - April 2015 REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] Rainer Lienhart, Axel Wernicke, “Localizing and Segmenting Text in Images and Videos,” IEEE transactions on circuits and systems for video technology, vol. 12, no. 4, april 2002. Kwang In Kim, Keechul Jung, and Jin Hyung Kim, “TextureBased Approach for Text Detection in Images Using Support Vector Machines and Continuously Adaptive Mean Shift Algorithm,” IEEE transactions on pattern analysis,Vol.25, No.12, December 2003. LifangGu, ”Text Detection and Extraction in MPEG Video Sequences ,” CBMI „01, Brescia, Italy, September 19-21, 2001. David Crandall, Sameer Antani, RangacharKasturi, “Extraction of special effects caption text events from digital video ,” IJDAR (2003) 5: 138–157 . M. Padmaja, J. Sushma, “Text Detection in Color Images”, International Conference on Intelligent Agent & Multi-Agent Systems,Chennai, 22-24 July 2009, pp. 1-6, 2009. Keechul Jung, KwangIn Kim and Anil K. Jain, “Text information extraction in images and video: A Survey”, Elsevier, Pattern Recognition, vol.37 (5), pp 977–997, 2004. Chung-Wei Liang and Po-Yueh Chen, “DWT Based Text Localization”, International Journal of Applied Science and Engineering, pp.105-116, 2004. Xiaoqing Liu and JagathSamarabandu, “Multiscaleedgebase Text extraction from Complex images”, IEEE Multimedia and Expo 2006, International Conf., Tronto, Canada, pp. 1721-1724, 2006. Xiao-Wei Zhang, Xiong-Bo Zheng, Zhi-Juan Weng, “Text Extraction Algorithm Under Background Image Using Wavelet Transforms”, Proceedings of the 2008 International Conference on Wavelet Analysis and Pattern Recognition, Hong Kong, 30-31, Aug. 2008. M. Cai, J. Song, and M. R. Lyu, “A new approach for video text detec-tion, ” in Proc. Int. Conf. Image Process., Rochester, NY, Sep. 2002, pp. 117–120. L. Agnihotri and N. Dimitrova, “Text detection for video analysis,” in Proc. IEEE Workshop Content-Based Access Image Video Libraries, 1999, pp. 109 –113. ISSN: 2231-5381 http://www.ijettjournal.org Page 521