An Efficient Extraction of Text Objects from Images and Videos -

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 11 - April 2015
An Efficient Extraction of Text Objects from
Images and Videos
#1
Lucy Lalmuanchhungi#1, Dr.Suresh .M.B.#2
PG Student, Prof. and Head,Department of Information Science & Engineering, East West Institute of Technology
Bangalore 560091,Karnataka, India.
#2
Abstract—Text extraction in images and videos has been
used in large variety of applications such as mobile
robot navigation, document retrieving, object
identification, vehicle license plate detection, etc.
Algorithms that automatically index text objects are
needed. Such text often gives an indication of a scene's
semantic content. Text can have arbitrary color, size,
and orientation. Backgrounds may be complex and
changing. Existing systems have focused only on
detecting the spatial extent of text in individual video
frames and images. A text extraction algorithm using
Dual Tree Discrete wavelet transform is proposed. This
constitutes a text event that should be entered only once
in the index. Therefore it is also necessary to determine
the temporal extent of text events. Such text effects are
common in television programs and commercials to
attract viewer attention.
Keywords—Connected
transform,Detection,
Segmentation.
Dual Tree Discrete wavelet
Extraction,
Localization
and
naturally occurs in the field of view of the camera during
video capture. The scene text occurring on banners and
posters also make important keywords that describe the
video content. Obviously, detection of the scene text is a
challenging task due to varying lighting, complex
movement and transformation.
Generally, text extraction in videos can be divided
into the following stages: 1)Text detection, finding regions
in a video frame that contain text; 2)Text localization,
grouping text regions into text instances and generating a
set of tight bounding boxes around all text instances;
3)Text tracking, following a text event as it moves or
changes over time and determining the temporal and
spatial locations and extents of text events; 4)Text
binarization, binarizing the text bounded by text regions
and marking text as one binary level and background as
the other; 5)Text recognition, performing OCR on the
binarized text image. Occasionally binarization step is
eliminated in favor of applying OCR on color/gray level
images. Examples of caption text and scene text are shown
in figure 1.
I. INTRODUCTION
Information nowadays is becoming increasingly
enriched by multimedia components. Libraries that were
originally pure text are continuously adding images, videos,
and audio clips to their repositories, and large digital image
and video libraries are emerging as well. They all need an
automatic means to efficiently index and retrieve
multimedia components. Text detection is the process of
detecting and locating those regions that contain texts from
a given image or video and is the first step in obtaining
textual information. However, text variations related to size,
style, orientation, and alignment, as well as low contrast
and complex backgrounds make the problem of automatic
text detection extremely challenging.
Text in images, especially in images which are part
of web pages and in videos is one powerful source of highlevel semantics. If these text occurrences could be detected,
segmented, and recognized automatically, they would be a
valuable source of high-level semantics for indexing and
retrieval. Text in images includes useful information for the
automatic annotation, indexing, and structuring of
images.Text in videos can be classified into two types. The
first type is caption or artificial text which is artificially
superimposed on the video at the time of editing. Caption
text summarizes the content of the video, and therefore it
plays an important role in making it easier for the task of
video indexing. The second type of text is scene text which
ISSN: 2231-5381
(a)
(b)
Figure 1. Examples of (a) Caption text and (b) Scene text
II. LITERATURE SURVEY
http://www.ijettjournal.org
Page 517
International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 11 - April 2015
A literature survey in this area summarizes some
of the past works that have already been done in this field.
In each work, a summary of the methods and a few
highlights of the approaches are given.
Rainer Lienhart and Axel Wernicke[1] proposed a
novel method for localizing and segmenting text in complex
images and videos. Text lines are identified by using a
complex-valued multilayer feed-forward network trained to
detect text at a fixed scale and position. The network‟s
output at all scales and a position is integrated into a single
text-saliency map, serving as a starting point for candidate
text lines. In the case of video, these candidate text lines are
refined by exploiting the temporal redundancy of text in
video. Localized text lines are then scaled to a fixed height
of 100 pixels and segmented into a binary image with black
characters on white background. For videos, temporal
redundancy is exploited to improve segmentation
performance. Input images and videos can be of any size
due to a true multi-resolution approach. Moreover, the
system is not only able to locate and segment text
occurrences into large binary images, but is also able to
track each text line with sub-pixel accuracy over the entire
occurrence in a video, so that one text bitmap is created for
all instances of that text line.
KwangIn Kim and Keechul Jung [2] proposeda
novel texture-based method for detecting texts in images. A
support vector machine (SVM) is used to analyze the
textural properties of texts. No external texture feature
extraction module is used; rather, the intensities of the raw
pixels that make up the textural pattern are fed directly to
the SVM, which works well even in high-dimensional
spaces. Next, text regions are identified by applying a
continuously adaptive mean shift algorithm (CAMSHIFT)
to the results of the texture analysis. The combination of
CAMSHIFT and SVMs produces both robust and efficient
text detection, as time-consuming texture analyses for less
relevant pixels are restricted, leaving only a small part of
the input image to be texture-analyzed.
Lifang GU [3] proposed an efficient algorithm for
detecting and extracting text directly in MPEG video
sequences. Most existing methods for text detection in
video usually operate on raw pixel data and only output the
location of the detected text regions in a single frame. Our
algorithm makes use of the features encoded in compressed
data to perform fast text detection. Motion information and
the characteristics of text region continuity in multiple
frames are then used to fine-tune the detected candidate text
regions. Furthermore, characters are reliably extracted by
an adaptive thresholding method after applying some noise
reduction filtering in multiple frames. Such extracted
characters can be directly fed into a conventional OCR
system for recognition. Experimental results on several
video sequences show that the proposed algorithm is able to
detect and extract text in MPEG video sequences with
various scene complexities.
David Crandall and Sameer Antani[4] proposed an
approach to extract text appearing in video, which often
reflects a scene‟s semantic content. This is a difficult
problem due to the unconstrained nature of general-purpose
video. Text can have arbitrary color, size, and orientation.
Backgrounds may be complex and changing. Most work so
far has made restrictive assumptions about the nature of text
ISSN: 2231-5381
occurring in video. Such work is therefore not directly
applicable to unconstrained, general-purpose video. In
addition, most work so far has focused only on detecting
the spatial extent of text in individual video frames.
However, text occurring in video usually persists for
several seconds. This constitutes a text event that should be
entered only once in the video index. Therefore it is also
necessary to determine the temporal extent of text events.
This is a non-trivial problem because text may move, rotate,
grow, shrink, or otherwise change over time. Such text
effects are common in television programs and
commercials but so far have received little attention in the
literature. This paper discusses detecting, binarizing, and
tracking caption text in general-purpose MPEG-1 video.
Solutions are proposed for each of these problems and
compared with existing work found in the literature.
III.PROPOSED METHOD
The architecture of the proposed system is shown in
figure 2. The input to the top layer of the architecture is
either an image or a video. The video is converted into
video frames. An image pre-processing is used to change
the video frame into gray scale image and also to resize the
image. Dual tree discrete wavelets transform algorithm is
used on images or the video frames. Next step is to extract
edges of the text from the video frames and remove nontext region. Finally after applying segmentation methods
the extracted text output is obtained.
Figure 2. Proposed system architecture
A. Preprocessing of the Input
The test input, either image or video frames undergo
the preprocessing step in order to improve the quality of
appearance. The preprocessing step may involve different
processes such as rescaling, resizing or noise removal. For
input image or video with colors, their RGB components
have to undergo a preprocessing step which results into a
combined intensity image.
B. Dual Tree Discrete Wavelets Transform
http://www.ijettjournal.org
Page 518
International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 11 - April 2015
A wavelet is basically used to represent a data set in
the form of differences and averages, called detail
components or average components. These averages are
used to find variations at different scales present in the data
set. The Dual Tree Discrete Wavelets Transform (DTDWT)
decomposes the input signal into four different sub-bands
with one average component (LL) and three detail
components (LH, HL, HH) as shown in figure 3. The text
pixels in the input image are enhanced by taking average of
these three detail components. Taking total average of these
three averages again increases the gap between the text and
non-text pixels in the input image.
The traditional Discrete Wavelets Transform (DWT)
produces nearly similar results but they lack properties
directionality and shift-invariance and also they perform
computationally slower. On the other hand, DTDWT
provides both properties of directionality and shiftinvariance and therefore provide more efficient results.
components in the image regions are calculated resulting
into components which contain labels for the components
with values starting from zero, where zero indicates
background and, otherwise, indicates text objects in the
image region. Then, different properties of image regions
are measured which are used to plot the bounding boxes of
the text regions. Finally thresholding is performed based on
these properties to extract the text output from the input
image.
(a) Original image
(b) R band
(c) G band
(d) B band
(e) Average of all
(f) Laplacian Mask
Figure 3. Result of DTDWT decomposition
C. Extraction of Text Edges
After wavelets decomposition, we obtain variations
in the values of text and non-text pixels using the three
detail sub-bands. These variation features are processed
with Laplacian mask in order to detect discontinuities in
four different directions: horizontal, vertical, up-left and upright. After applying Laplacian filtering, we obtain positive
and negative values. Transition between these two values
corresponds to transition between text and background.
Maximum Gradient Difference (MGD) method is then
applied on these obtained values in order to form a map of
maximum and minimum values in the image. Candidate
text regions always have larger MGD values than non-text
regions because of presence of more number of positive and
negative peaks. Normalization is then performed on the
input image based on these values in order to obtain a
binary image. Figure 4 shows the wavelet colors and
features of the input image.
D. Non-Text Region Removal
Before performing removal of non-text region from
the image, different morphological operations are
performed such as dilation or filling in order to bridge any
gaps present in the edges of the input image. Connected
ISSN: 2231-5381
(g) MGD
(h)Binary image
Figure 4. Wavelet colors and features
http://www.ijettjournal.org
Page 519
International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 11 - April 2015
IV. EXPERIMENTAL RESULTS
When we give the input video shown in the figure 5
to the proposed text extraction system, we get the following
results as shown in the figures. Figure 6 shows the image
after applying the Laplacian mask, and figure 7 shows the
image after morphological region filling and figure 8 shows
image after noise removal. Figure 9 shows the input image
frame with bounding boxes around the text regions and
finally figure 10 shows the extracted text output.
Figure 8. Noise Removal
Figure 5. Input to the system
Figure 9. Text with Bounding Boxes
Figure 6.Laplacian Mask
Figure 10. Extracted text output
V.CONCLUSION
Figure 7. Region Filling
ISSN: 2231-5381
The proposed text extraction system has been tested
on various types of input. The proposed system uses
methods based on edge detection. It makes use of Dual Tree
Discrete Wavelet Transform which proves to be a very
efficient wavelet decomposition method compared to
traditional transform methods. The system is designed in
such a way that text present in the input images are detected
automatically and hence extracted efficiently. However,
there are still many areas that need more improvements in
detecting blurred text regions with very poor resolution
images.
http://www.ijettjournal.org
Page 520
International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 11 - April 2015
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
Rainer Lienhart, Axel Wernicke, “Localizing and Segmenting
Text in Images and Videos,” IEEE transactions on circuits and
systems for video technology, vol. 12, no. 4, april 2002.
Kwang In Kim, Keechul Jung, and Jin Hyung Kim, “TextureBased Approach for Text Detection in Images Using Support
Vector Machines and Continuously Adaptive Mean Shift
Algorithm,” IEEE transactions on pattern analysis,Vol.25, No.12,
December 2003.
LifangGu, ”Text Detection and Extraction in MPEG Video
Sequences ,” CBMI „01, Brescia, Italy, September 19-21, 2001.
David Crandall, Sameer Antani, RangacharKasturi, “Extraction of
special effects caption text events from digital video ,” IJDAR
(2003) 5: 138–157 .
M. Padmaja, J. Sushma, “Text Detection in Color Images”,
International Conference on Intelligent Agent & Multi-Agent
Systems,Chennai, 22-24 July 2009, pp. 1-6, 2009.
Keechul Jung, KwangIn Kim and Anil K. Jain, “Text information
extraction in images and video: A Survey”, Elsevier, Pattern
Recognition, vol.37 (5), pp 977–997, 2004.
Chung-Wei Liang and Po-Yueh Chen, “DWT Based Text
Localization”, International Journal of Applied Science and
Engineering, pp.105-116, 2004.
Xiaoqing Liu and JagathSamarabandu, “Multiscaleedgebase Text
extraction from Complex images”, IEEE Multimedia and Expo
2006, International Conf., Tronto, Canada, pp. 1721-1724, 2006.
Xiao-Wei Zhang, Xiong-Bo Zheng, Zhi-Juan Weng, “Text
Extraction Algorithm Under Background Image Using Wavelet
Transforms”, Proceedings of the 2008 International Conference
on Wavelet Analysis and Pattern Recognition, Hong Kong, 30-31,
Aug. 2008.
M. Cai, J. Song, and M. R. Lyu, “A new approach for video text
detec-tion, ” in Proc. Int. Conf. Image Process., Rochester, NY,
Sep. 2002, pp. 117–120.
L. Agnihotri and N. Dimitrova, “Text detection for video
analysis,” in Proc. IEEE Workshop Content-Based Access Image
Video Libraries, 1999, pp. 109 –113.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 521
Download