INTERNATIONAL JOURNAL OF ENGINEERING TECHNOLOGY AND COMPUTER APPLICATIONS.
1 King Abdulaziz University,
2 Rabigh, KSA
Dr.Sami M Halawani and Mohamed Shajahan
Abstract: Overlay text brings important semantic clues in video content analysis such as video information retrieval and summarization, since the content of the scene or the editor’s intention can be well represented by using inserted text.
Overlay text extraction for video optical character videos also contain text describing the scores and team or player names [1]. In general, text displayed in the videos can be classified into scene text and overlay text [2]. Scene text occurs recognition (OCR) becomes more challenging, compared to the text extraction for OCR tasks of document images, due to the numerous difficulties resulting from complex background, unknown text color, size and so on. This proposed system naturally in the background as a part of the scene, such as the advertising boards, banners, and so on.
In contrast to that, overlay text is superimposed on the video scene and used to help viewers’ understanding. Since the overlay text is highly compact and structured, it can be used for video indexing and retrieval [3]. The main aim of this solves the overlay text detection, OCR process and overlay removal. For overlay text detection the transition map is used. For OCR process font research is:
To propose a novel framework to detect the Overlay text information in Video matching method is used. The overlay-removal is handled by Hybrid inpaint method which is the combination of sub-patch method and weighted interpolation method. Using this proposed method frames.
To extract the Overlay text into the ASCII format strings
To restore the video frame without overlay text using Hybrid inpainting. one can get the marked overlay texts, list of unique overlay texts and Overlay text removed videos. This method produces better than the previous methods. Resultant accuracy is highly improved.
2. Implementation Methodology
Lot of methods are already implemented for overlay text detection. Color based methods
Keywords
Overlay text, Transition map, Video optical character recognition, Inpaint, Video restoration. are not working properly because of un-uniform color distribution. Most of existing video text detection methods has been proposed on the basis of color, edge, and texture-based feature. The
1. Introduction
With the development of video editing method proposed by Agnihotri [13], concentrates on the red color component, instead of all the 3 color components. Some methods used the high technology, there are growing uses of overlay text contrast video frames to extract the texts. Kim et inserted into video contents to provide viewers with better visual understanding. Most al . [14] uses RGB color space and clustering concept. But no methods are fully efficient for broadcasting videos tend to increase the use of overlay text to convey more direct summary of semantics and deliver better viewing experience.
For example, headlines summarize the reports in clustering. So text detection is not so better in this case. The edge based methods are not made success because of complex background. A modified edge map is introduced by Lyu et al . news videos and subtitles in the documentary drama help viewers understand the content. Sports
28
[15]. This is providing some improvement in
INTERNATIONAL JOURNAL OF ENGINEERING TECHNOLOGY AND COMPUTER APPLICATIONS. overlay text detection. Texture-based approaches are also used for overlay text detection. Another method for Overlay text detection in video is a similarity-of-corner point which is explained in
Bertini [16]. The next coming methods used interpolation filter, wavelet coefficients, etc… In the method proposed by Wonjun Kim [5] the transition map model is proposed. But it didn’t support OCR process and Overlay text removal.
Also his method takes more time because frame updation is done after text detection.
Thus existing methods experience difficulties in handling texts with various contrasts or inserted in a complex background. This research work proposes a novel framework to detect, extract the overlay text from the video scene, ASCII text conversion and Video restoration. Based on this observation that there exist transient colors between inserted text and its adjacent background, a transition map is first generated. Then candidate regions are extracted by a reshaping method and the overlay text regions are determined based on the occurrence of overlay text in each candidate.
The detected overlay text regions are localized accurately using the projection of overlay text pixels in the transition map and the text extraction is finally conducted. A video OCR method is adopted to convert the ASCII text form. Finally video is restored (without overlay text) using
Hybrid inpaint method. The steps of the working methodology of the proposed system are described as follows:
Video Frame Extraction
The format of video should be in .AVI format. It should be created from 24-bit-format frames. Here video is split into frames. Each frame is converted into images.
Transition Map Generation
As a rule of thumb, if the background of overlay text is dark, then the overlay text tends to be bright. On the contrary, the overlay text tends to be dark if the background of overlay text is bright. Therefore, there exists transient colors between overlay text and its adjacent background due to color bleeding, the intensities at the bleeding by the lossy video compression. It is also observed that the intensities of three consecutive pixels increases exponentially at the boundary of dark overlay text. To find the intensity change in the transition region three steps are adopted. They are as follows:
1.
Saturation calculation
2.
Modified Saturation calculation
3.
Transition map generation
If a pixel satisfies the logarithmical change constraint, three consecutive pixels centered by the current pixel are detected as the transition pixels and the transition map is generated. The depth explanation is given below.
The modified saturation is defined as follows:
S
1
R
G where max
S
2
2
0
I
~
.
5 x ,
I
~
, if I
~
0 .
5
, otherwise
(2)
S
and max S
denote the saturation
I
S
B
3
min
R , G , B
(1) value and the maximum saturation value at the corresponding intensity level, respectively.
~
denotes the intensity at the
which is normalized to [0,1]. Based on the conical HSI color model [4], the maximum value of saturation
is normalized in accordance with
I
~ compared to 0.5 in (2). The transition
D
L
, D
H
can thus be defined by combination of the change of intensity and the modified saturation as follows:
D
L
1
max
S
S
dS
L
x
1 ,y
boundary of overlay text are observed to have the logarithmical change. The intensities of three consecutive pixels are decreasing logarithmically at the boundary of bright overlay text due to color
29
D
H
1
dS
H
x
1 ,y
INTERNATIONAL JOURNAL OF ENGINEERING TECHNOLOGY AND COMPUTER APPLICATIONS. where dS
L
~
S x
1 ,y
~
S and dS
H
~
S
x
1 ,y
(3)
Since the weight dS
L
and dS
H x , the weight can be zero by the achromatic overlay text and background, we add 1 to the weight in (3). If a pixel satisfies the logarithmical change constraint given in (4), three consecutive pixels centered by the current pixel are detected as the transition pixels and the transition map
is generated
T
1 , if
0 ,
D
H
D
L
T
H otherwise
(4)
The thresholding value TH is empirically set to
80 in consideration of the logarithmical change.
Update Frames
The difference of the previous frame’s transition map and current frame’s transition map, decides whether to process the current frame or neglect the current frame.A threshold is used here for decision making.
Candidate Map Region Detection
The transition map can be utilized as a useful indicator for the overlay text region. To generate the connected components, first generate a linked map [5]. If a gap of consecutive pixels between two nonzero points in the same row is shorter than 7% of the image width, they are filled with 1s. Next the Hole filling algorithm is used to fill the small gaps and to maintain the connectivity. Then each connected component is reshaped to have smooth boundaries. Since it is reasonable to assume that the overlay text regions are generally in rectangular shapes, a rectangular bounding box is generated by linking four points, which correspond to
(min_x, min_y), (max_x, min_y),
(min_x, max_y), (max_x, max_y) taken from the link map and candidate regions.
Overlay Text Region Determination
In this subsection, we introduce a texturebased approach for overlay text region determination. Based on the observation that intensity variation around the transition pixel is big due to complex structure of the overlay text, we employ the local binary pattern (LBP) introduced in [6] to describe the texture around the transition pixel. LBP is a very efficient and simple tool to represent the consistency of texture using only the intensity pattern. LBP forms the binary pattern using current pixel and its all square neighbor pixels and can be converted into a decimal numbers as follows:
LBP
P
P i
1
0 s
g i
g c
2 i where s
1 ,
0 , x x
0
0
(5)
P denote the user’s chosen number of square neighbor pixels of a specific pixel. g i
-> neighbor pixels intensity. g c
-> intensity of current pixel.
Next we define the probability of overlay text (POT) using [5] transition pixel, LBP data. If
POT of the candidate region is larger than a predefined value, the corresponding region is finally determined as the overlay text region.
Because of this process the false candidate regions which are formed by normal texture data are rejected.
Overlay Text Region Refinement
The overlay text region or the bounding box obtained in the preceding subsection needs to be refined for better accurate text extraction. In this subsection, we use a modified projection of transition pixels [7] in the transition map is to perform the overlay text region refinement. First, the horizontal projection is performed to accumulate all the transition pixel counts in each row of the detected overlay text region to form a histogram of the number of transition pixels. Then the small value counts, which denote the small candidate regions are removed and separated regions are re-labeled. The projection is conducted vertically also. Small counted regions are removed once again.
30
INTERNATIONAL JOURNAL OF ENGINEERING TECHNOLOGY AND COMPUTER APPLICATIONS.
Color Polarity Change
Before applying video OCR application, the refined overlay text regions need to be converted to a binary image, where all pixels belonging to overlay text are highlighted and others suppressed. Since the text color may be overlay texts from the video frames i.e. Video restoration without overlay texts. The inpaint problem is defined as how to “guess” the Lacuna region after removal of an object by replicating a part from the remainder of the whole image with visually plausible quality [12]. We propose a either brighter or darker than the background color, an efficient scheme is required to extract the overlay into bright data. If the first encountered transition pixel belongs to 1, whereas the pixel apart by two pixel distances belongs to 0 then the, pixel values in the text region is inverted to make the overlay text brighter than the hybrid region-filling algorithm [12] composed of a texture synthesis technique and an efficient interpolation method with a re-finement approach:
1) The
“subpatch texture synthesis technique” can synthesize the Lacuna region with significant accuracy; surrounding background. Note that the inversion is simply done by subtracting the pixel value from the maximum pixel value.
2) The “weighted interpolation method” is applied to reduce computation time;
In the procedure of region-filling, color texture distribution analysis is used to choose
Overlay Text Marking
The rectangle bounding box is projected around the extracted overlay text region. Using the four corner points of candidate region we can mark the Text data. whether the subpatch texture synthesis technique or the weighted interpolation method should be applied. In the subpatch texture synthesis technique, the actual pixel values of the Lacuna region are synthesized by adaptively sampling from the source region. The restored video frames are inserted into a new video (.avi) file.
ASCII Text Conversion
Optical character recognition, usually
3. Results and Discussion
Most of existing video text detection abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into methods has been proposed on the basis of color, edge, and texture-based feature. Color-based approaches assume that the video text is machine-encoded text. OCR makes it possible to edit the text, search for a word or phrase, store it composed of a uniform color. However, it is more compactly, display or print a copy free of scanning artifacts, and apply techniques such as rarely true that the overlay text consists of a uniform color due to degradation resulting from compression coding and low contrast between text machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer and background. Edge-based approaches are also considered useful for overlay text detection since text regions contain rich edge information. The vision.
The font database is created and database commonly adopted method is to apply an edge characters are extracted and stored in memory.
The pure overlay content is defined by Otsu detector to the video frame and then identify regions with high edge density and strength. This method [8]. Using vertical histogram projection method performs well if there is no complex the overlay text characters (in image format) are extracted. Using SSD method the matched background and it becomes less reliable as the scene contains more edges in the background.Texture-based approaches, such as alphabet is found and the text characters (in
ASCII format) are collected into string format. In this collection of overlay texts the uniqueness the salient point detection and the wavelet transform, have also been used to detect the text regions. However, since it is almost impossible to strings are found and maintained as a list of overlay text. detect text in a real video by using only one
Overlay Text Removal or Video Restoration
Inpaint process means region filling after characteristic of text, some methods take advantage of combined features to detect video object removal from a digital photograph or video
[9],[10],[11]. This section aims to remove the
31 text.
INTERNATIONAL JOURNAL OF ENGINEERING TECHNOLOGY AND COMPUTER APPLICATIONS.
32
INTERNATIONAL JOURNAL OF ENGINEERING TECHNOLOGY AND COMPUTER APPLICATIONS.
Overall Structure
Marked Overlay Text video frames
(Output 1)
Sample Outputs
Extracted frames (Input)
List of Overlay texts in full video
(Output 2)
Overlay Text
Removed Video
( Restored Video by inpaint operation)
(Output 3)
Fig 1: Original Frame (Sample Frame)
33
Fig 2: Transition map o t s e x f t o f
INTERNATIONAL JOURNAL OF ENGINEERING TECHNOLOGY AND COMPUTER APPLICATIONS.
Fig 3: Linked Map Fig 7 :Color Polarity Change output
Fig 4:Filling Data Fig 8:Extracted Overlay Text
Fig 5: Candidate Region Determination
Fig 6:Candidate region Refinement
Fig 9: Overlay text removed frame
The previous methods are not robust to different character size. But the proposed method is robust to different character size because this is working based on transition map concept. Another advantage is the position of the text can be placed anywhere, the transition map can indicate it successfully. The proposed system is also robust to color and contrast variance.
In the previous method [5] they use the frame-update after the Overlay text detection. It will take more time. To avoid this over time taken problem, this system uses the frame-update at the time calculation of transition map. This will
34
INTERNATIONAL JOURNAL OF ENGINEERING TECHNOLOGY AND COMPUTER APPLICATIONS. reduce the execution time and speed up the process. with OCR process. This system also processed a better OCR process which is adopted in Video scenes.
The previous methods are not combined
No methods are coming with Overlay text removal facility. Here we adopted a better inpainting methodology in the extracted overlay image output. Hybrid inpaint method is used to restore the video by removing the overlay text.
This hybrid method is the combination of subpatch filling and weighted interpolation. Because of weighted interpolation the inpaint speedup. The restoration is also very accurate one. classification of high-frequency wavelet coefficients,” in Proc. Int. Conf. Pattern
Recognition, Aug. 2004, vol. 1, pp.425–428.
[3] J. Cho, S. Jeong, and B. Choi, “News video retrieval using automatic indexing of Korean closed-caption,” Lecture Notes in Computer
Science, vol. 2945, pp. 694–703, Aug. 2004.
[4] R. C. Gonzalez and R. E.Woods
Image Processing,
, Digital
2nd ed. Upper Saddle River,
NJ: Prentice-Hall, 2002.
[5] Wonjun Kim and Changick Kim ,
“A New
Approach for Overlay Text Detection and
Extraction From Complex Video Scene “,
IEEE transactions on image processing, vol. 18, no. 2, february 2009.
[6] T. Ojala, M. Pierikainen, and T. Maenpaa,
“Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 24,
4. Conclusion:
The various processes on overlay text detection and extraction from complex videos are proposed in this paper. The main concept of the work is based on the observation that there exist transient colors between inserted text and its adjacent background. Linked maps are generated to make connected components for each candidate region and then each connected component is reshaped to have smooth boundaries. We compute the density of transition pixels and the consistency of texture around the transition pixels to distinguish the overlay text regions from other candidate regions.
The local binary pattern is used for the intensity variation around the transition pixel in the proposed method. The boundaries of the detected overlay text regions are localized accurately using the projection of overlay text no. 7, pp. 971–987, Jul. 2002.
[7] M. R. Lyu, J. Song, and M. Cai, “A comprehensive method for multilingual video text detection, localization, and extraction,”
IEEE
Trans. Circuit and Systems for Video Technology, vol. 15, no. 2, pp. 243–255, Feb. 2005.
[8] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Trans. Syst., Man,
Cybern., vol. 9, no. 1, pp. 62–66, Mar. 1979.
[9] Timothy K. Shih, Nick C. Tang, and Jenq-
Neng Hwang,” Exemplar-Based Video Inpainting
Without Ghost Shadow Artifacts by Maintaining
Temporal Continuity”, in IEEE transactions on circuits and systems for video technology, vol. 19, no. 3, march 2009
[10] K. A. Patwardhan, G. Sapiro, and M.
Bertalmío, “Video inpainting under constrained camera motion,”
IEEE Trans. Image Process., pixels in the transition map. Next Overlay text data are converted into ASCII text form. Next the overlay texts are removed by hybrid inpaint method [12]. This research is well adopted in video data processing. vol. 16, no. 2, pp. 545–553, Feb. 2007.
[11] A. Criminisi, P. Perez, and K. Toyama,
“Region filling and object removal by exemplarbased image inpainting,”
IEEE Trans.
ImageProcess., vol. 13, no. 9, pp. 1200–1212,
5. References
[1] C. G. M. Soak and M. Worring, “Time interval maximum entropy based event indexing in soccer video,” in
Proc. Int. Conf. Multimedia and Expo,
Jul. 2003, vol. 3, pp. 481–484.
[2] J. Gllavata, R. Ewerth, and B. Freisleben,
“Text detection in images based on unsupervised
35
Sep. 2004.
[12] Han-Jen Hsu, Jhing-Fa Wang ,” A Hybrid
Algorithm With Artifact Detection Mechanism for
Region Filling After Object Removal From a
Digital Photograph”, IEEE Transactions on Image
Processing, vol. 16, no. 6, June 2007
[13] L. Agnihotri and N. Dimitrova, “Text detection for video analysis,” in
Proc. IEEE
INTERNATIONAL JOURNAL OF ENGINEERING TECHNOLOGY AND COMPUTER APPLICATIONS.
Int.Workshop on Content-Based Access of Image and VideoLibraries, , pp. 109–113, Jun. 1999 .
[14] K. C. K. Kim et al.
, “Scene text extraction in natural scene images using hierarchical feature combining and verification,” in Proc. Int.Conf.
Pattern Recognition, vol. 2, pp. 679–682, Aug.
2004.
[15] M. R. Lyu, J. Song, and M. Cai, “A comprehensive method for multilingual video text detection, localization, and extraction,”
IEEE
Trans. Circuit and Systems for Video Technology, vol. 15, no. 2, pp. 243–255,Feb. 2005.
[16] M. Bertini, C. Colombo, and A. D. Bimbo,
“Automatic caption localization in videos using salient points,” in Proc. Int. Conf. Multimedia and
Expo, Aug. 2001, pp. 68–71. (ICME), pp. 1721–
1724, Jul. 2006.
36
INTERNATIONAL JOURNAL OF ENGINEERING TECHNOLOGY AND COMPUTER APPLICATIONS.
37