Simulating the Performance of Parallel Applications on Desktop Grids

advertisement
SEARCH IN CLASSROOM VIDEOS
WITH OPTICAL CHARACTER RECOGNITION
FOR VIRTUAL LEARNING
A Thesis
Presented to
the Faculty of the Department of Computer Science
University of Houston
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
By
Tayfun Tuna
December 2010
SEARCH IN CLASSROOM VIDEOS
WITH OPTICAL CHARACTER RECOGNITION
FOR VIRTUAL LEARNING
Tayfun Tuna
APPROVED:
Dr. Jaspal Subhlok, Advisor
Dep. of Computer Science, University of Houston
Dr. Shishir Shah
Dep. of Computer Science, University of Houston
Dr. Lecia Barker
School of Information, University of Texas at Austin
Dean, College of Natural Sciences and Mathematics
ii
Acknowledgements
I am very much grateful to my advisor, Dr. Jaspal Subhlok for his guidance,
encouragement, and support during this work. He kept me motivated by his insightful
suggestions for solving many problems, which would otherwise seem impossible to
solve. I would not be able to complete my work in time without his guidance and
encouragement.
I would like to express my deepest gratitude towards Dr. Shishir Shah, who gave
me innumerable suggestions in weekly meetings and in image processing class, both of
them helped me many times to solve difficult problems in this research.
I am heartily thankful to Dr Lecia Barker for her support and for agreeing to be a
part of my thesis committee.
Without the love and support of my wife, it would have been hard to get my thesis
done on time. I am forever indebted to my wife Naile Tuna.
iii
SEARCH IN CLASSROOM VIDEOS
WITH OPTICAL CHARACTER RECOGNITION
FOR VIRTUAL LEARNING
An Abstract of a Thesis
Presented to
the Faculty of the Department of Computer Science
University of Houston
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
By
Tayfun Tuna
December 2010
iv
Abstract
Digital videos have been extensively used for educational purposes and distance
learning. Tablet PC based lecture videos have been commonly used at UH for many
years. To enhance the user experience and improve usability of classroom lecture videos,
we designed an indexed, captioned and searchable (ICS) video player. The focus of this
thesis is search.
Searching inside of a lecture is useful especially for long videos; instead of
losing an hour watching the entire video, it will allow us to find the relevant scenes
instantly. This feature requires extracting the text from video screenshots by using
Optical Character Recognition (OCR). Since ICS video frames include complex images,
graphs, and shapes in different colors with non-uniform backgrounds, our text detection
requires a more specialized approach than is provided by off-the-shelf OCR engines,
which are designed primarily for recognizing text within scanned documents in black and
white format.
In this thesis, we describe how we used and increased the detection of these OCR
engines for ICS video player. We surveyed the current OCR engines for ICS video
frames and realized that the accuracy of recognition should be increased by preprocessing
the images. By using some image processing techniques such as resizing, segmentation,
inversion on images, we increased the accuracy rate of search in ICS video player.
v
Table of Contents
CHAPTER 1. INTRODUCTION ................................................................................................................ 1
1.1 MOTIVATION .................................................................................................................................. 1
1.2 BACKGROUND................................................................................................................................. 2
1.2.1 VIDEOINDEXER ....................................................................................................................... 3
1.2.2 OVERWIEW OF OPTICAL CHARACTER RECOGNITION (OCR) TOOL . ......................... 5
1.2.3 ICS VIDEO PLAYER . ............................................................................................................... 7
1.3 RELATED WORK.............................................................................................................................10
1.3.1 VIDEOPLAYERS . ....................................................................................................................11
1.3.2 OCR IMPLEMENTATION IN VIDEOS . .................................................................................12
1.4 THESIS OUTLINE .. ........................................................................................................................14
CHAPTER 2. SURVEY OF OCR TOOLS ...............................................................................................15
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.7
POPULAR OCR TOOLS ..................................................................................................................15
THE CRITERIA FOR A “GOOD” OCR TOOL FOR ICS VIDEO IMAGES ...................................16
SIMPLE OCR ....................................................................................................................................18
ABBYY FINEREADER.. ..................................................................................................................20
TESSERACTOCR.. ...........................................................................................................................21
GOCR.. ..............................................................................................................................................22
MICROSOFT OFFICE DOCUMENT IMAGING(MODI).. .............................................................24
CONCLUSION.. ................................................................................................................................27
CHAPTER 3. OCR CHALLENGES AND ENHANCEMENTS.............................................................28
3.1 WHAT IS OCR ? ..............................................................................................................................28
3.2 HOW DOES OCR WORK ? ..............................................................................................................29
3.3 CAUSES OF FALSE DETECTION ..................................................................................................31
CHAPTER 4. ENHANCEMENTS FOR OCR .........................................................................................34
4.1 SEGMENTATION ............................................................................................................................34
4.1.1 THREASHOLDING ...................................................................................................................37
4.1.2 EROSION AND DIALATION ...................................................................................................39
4.13 EDGE DETECTION...................................................................................................................41
4.14 BLOB EXTRACTION ...............................................................................................................43
4.2 RESIZING FOR TEXT FONT SIZE ................................................................................................44
4.3 INVERSION .....................................................................................................................................46
4.4 RESIZING IMAGE ...........................................................................................................................48
4.5 INVERSION .....................................................................................................................................49
CHAPTER 5. OCR ACCURACY CITERIA AND TEST RESUTLS ....................................................51
5.1 TEST DATA .....................................................................................................................................51
5.1.1 THE IMAGES FOR OCR DETECTION TEST .......................................................................51
5.1.2 THE TEXT FOR OCR DETECTION TEST ............................................................................53
5.1.2 SEARCH ACCURACY ............................................................................................................54
5.2 WORD ACCURACY AND SEARCH ACCURACY .......................................................................54
5.3 PREPARING AND TESTIN TOOLS ...............................................................................................55
5.3.1 TEXTPICTURE EDITOR ........................................................................................................56
5.3.2 OCR TOOL MANAGER AND ACCURACY TESTER ........................................................57
5.1.2 SEARCH ACCURACY ............................................................................................................59
5.4 EXPERIMENTS AND TEST RESULTS .........................................................................................60
CHAPTER 6. CONCLUSION....................................................................................................................69
REFERENCES ............................................................................................................................................72
vi
List of Figures
FIGURE 1.1: BLOCK DIAGRAM OF THE VIDEO INDEXER .................................................................... 3
FIGURE 1.2: A SNAPSHOOT FROM THE VIDEO INDEXER .................................................................... 4
FIGURE 1.3 A SNAPSHOOT OF AN OUTPUT FROM THE VIDEO INDEXER............................................. 4
FIGURE 1.4. THE OCR TOOL FUNCTION IN ICS VIDEO PLAYER .......................................................... 5
FIGURE 1.5.A SNAPSHOOT OF AN OUTPUT FOLDER OF OCR TOOL .................................................... 5
FIGURE 1.6. A SNAPSHOOT OF RUNNING OCR TOOL ......................................................................... 6
FIGURE 1.7. A SNAPSHOOT OF ICS VIDEO PLAYER XML OUTPUT OF OCR TOOL ................................ 6
FIGURE 1.8. FLOW OF ICS VIDEO PLAYER ....................................................................................... 7
FIGURE 1.9. A SNAPSHOT OF THE VIDEO PLAYER SCREEN................................................................ 8
FIGURE 1.10. LIST WIEW OF SEARCH FEATURE ................................................................................ 9
FIGURE 1.11.ICS VIDEO PLAYER PROGRESSBAR ............................................................................. 10
FIGURE 2.1 SIMPLE OCR DETECTION EXAMPLE 1 .......................................................................... 18
FIGURE 2.2 SIMPLE OCR DETECTION EXAMPLE 2. .......................................................................... 19
FIGURE 2.3 SIMPLE OCR DETECTION EXAMPLE 3. .......................................................................... 19
FIGURE 2.4 ABBY FINE READER DETECTION EXAMPLE. ................................................................. 20
FIGURE 2.5 USER INTERFACE OF ABBY FINE READER. .................................................................... 20
FIGURE 2.6 TESSERACTOCR DETECTION EXAMPLE 1. .................................................................... 21
FIGURE 2.7 TESSERACTOCR DETECTION EXAMPLE 2. .................................................................... 22
FIGURE 2.8 TESSERACTOCR DETECTION EXAMPLE 3. .................................................................... 22
FIGURE 2.9 GOCR TOOL DETECTION EXAMPLE 1. ........................................................................... 23
FIGURE 2.10 GOCR TOOL DETECTION EXAMPLE 2. ......................................................................... 23
FIGURE 2.11 GOCR TOOL DETECTION EXAMPLE 3. ......................................................................... 24
FIGURE 2.12 USING THE MODI OCR ENGINE IN C PROGRAMMING LANGUAGE. ............................... 25
FIGURE 2.13 MODI DETECTION EXAMPLE 1. ................................................................................... 25
FIGURE 2.14 MODI DETECTION EXAMPLE 2. ................................................................................... 26
FIGURE 2.15 MODI DETECTION EXAMPLE 3. ................................................................................... 26
FIGURE 3.1 PATTERN RECOGNITION.STEPS FOR CLASSIFICATION .................................................. 30
FIGURE 3.2 CHARACTER REPRESENTATION FOR FUTURE EXTRACTION. ........................................ 30
FIGURE 3.3 DISTORTED IMAGE ANALYSIS. ..................................................................................... 31
FIGURE 3.4 CONTRAST AND COLOR DIFFERENCES IN CHARACTERS IN AN IMAGE. ........................ 32
FIGURE 3.5 SIZE DIFFERENCE IN CHARACTERS IN AN IMAGE. ........................................................ 32
FIGURE 4.1 BLACK FONT TEXT ON THE WHITE COLOR BACKGROUND. .......................................... 35
FIGURE 4.2 COMPLEX BACKGROUND WITH DIFFERENT COLOR FONT. ........................................... 35
FIGURE 4.3 OCR RESULTS FOR A WHOLE IMAGE. ........................................................................... 36
FIGURE 4.4 OCR RESULTS FOR A SEGMENTED IMAGE. ................................................................... 36
FIGURE 4.5 SIS THRESHOLD EXAMPLE 1......................................................................................... 38
FIGURE 4.6 SIS THRESHOLD EXAMPLE 2......................................................................................... 38
FIGURE 4.7 STRUCTURAL ELEMENT MOVEMENT FOR MORPHOLOGICAL OPERATIONS. ................. 40
FIGURE 4.8 STRUCTURED ELEMENT FOR EROSION AND DILATION. ................................................ 40
FIGURE 4.9 DILATATION AFFECT ON AN IMAGE. ............................................................................ 41
FIGURE 4.10 EDGE DETECTION EFFECT ON A DILATING IMAGE...................................................... 42
FIGURE 4.11 BLOB EXTRACTION EXAMPLE ON AN IMAGE. ............................................................ 44
FIGURE 4.12 RESIZE PROCESS IN AN EXAMPLE............................................................................... 45
FIGURE 4.13 RESIZE PROCESS IN INTERPOLATION.......................................................................... 45
FIGURE 4.14 RESIZE PROCESS IN BILINEAR INTERPOLATION.......................................................... 46
FIGURE 4.15 RGP COLOR MODEL. ................................................................................................... 47
FIGURE 4.16 THE INVERSION OPERATION ON THE LEFT INPUT IMAGE. .......................................... 48
vii
FIGURE 4.17 INVERSION EQUATIONS AND THEIR EFFECT ON THE IMAGES..................................... 49
FIGURE 4.18 OCR ENGINES’ DETECTIONS FOR ORIGINAL IMAGE.................................................... 50
FIGURE 5.1 EXAMPLE ICS VIDEO IMAGE. ........................................................................................ 52
FIGURE 5.2 EXAMPLES OF SOME IMAGES THAT ARE NOT INCLUDED IN THE TEST. ........................ 52
FIGURE 5.3 AN EXAMPLE OF SOME TEXT THAT ARE NOT INCLUDED IN THE TEST. ......................... 53
FIGURE 5.4 SCREENSHOOT OF TEXTPICTURE EDITOR TOOL. .......................................................... 56
FIGURE 5.5 INPUT FOLDER FOR OCR TEST CREATED BY TEXTPICTURE EDITOR TOOL. ................... 57
FIGURE 5.6 .SCREENSHOOT OF OCR TOOL MANAGER AND ACCURACY TESTER ............................. 58
FIGURE 5.7 SCREENSHOOT OF OCR TOOL MANAGER AND ACCURACY TESTER ............................. 58
FIGURE 5.8 EXCEL FILE CREATED BY OCR MANAGER TOOL FOR FOLDER. ..................................... 59
FIGURE 5.9 EXCEL FILE CREATED BY OCR MANAGER TOOL FOR AN IMAGE. .................................. 59
FIGURE 5.10 EXAMPLE SCREENS FROM THE VIDEOS HAVE HIGHEST FALSE POSITIVES. ................ 67
FIGURE 5.11 EXAMPLE SCREENS FROM VIDEOS HAVE HIGHEST WORD DETECTIONS ..................... 67
FIGURE 5.12 EXAMPLE SCREENS FROM THE VIDEOS WHICH HAVE LOWEST DETECTION ............... 67
List of Graphs
GRAPH 5.1 OCR ACCURACY TEST GRAPH FOR ‘WORD ACCURACY’ ................................................ 63
GRAPH 5.2 GRAPH FOR OCR TEST RESULTS OF ‘SEARCH ACCURACY’ ............................................ 64
GRAPH 5.3 GRAPH FOR OCR TEST RESULTS OF EXECUTION TIMES ................................................. 64
GRAPH 5.4 OCR TEST RESULTS FOR FALSE POSITIVES .................................................................... 65
GRAPH 5.5 GRAPH FOR OCR TEST RESULTS OF SEARCH ACCURACY RATE FOR ALL VIDEOS .......... 68
List of Tables
TABLE 2.1 POPULAR OCR TOOLS .................................................................................................. 15
TABLE 2.2 SELECTED OCR TOOLS TEST ........................................................................................ 18
TABLE 5.1 FORMULATION OF ‘WORD ACCURACY’ ........................................................................ 54
TABLE 5.2 FORMULATION OF ‘SEARCH ACCURACY’ ..................................................................... 55
TABLE 5.3 OCR ACCURACY TEST RESULTS FOR ‘WORD ACCURACY’ ............................................. 61
TABLE 5.4 NUMBER OF UNDETECTED WORDS WITH METHODS ..................................................... 62
TABLE 5.5 OCR ACCURACY TEST RESULTS FOR ‘SEARCH ACCURACY’ ......................................... 62
TABLE 5.6 TEST RESULTS FOR ‘EXECUTION TIMES’ ...................................................................... 64
TABLE 5.7 NUMBER OF ‘FALSE POSITIVES’ ................................................................................... 65
TABLE 5.8 VIDEOS WHICH HAVE THE HIGHEST ‘FALSE POSITIVIVES’ ............................................ 64
viii
Chapter 1: Introduction
1.1 Motivation
There is a huge database of digital videos in any school that employs lecture
video recording. Traditionally, students would download the video and watch using a
basic video player. This method is not suitable for some users like students who want to
quickly refer to a specific topic in a lecture video as it is hard to tell exactly when that
topic was taught. It is not also suitable for deaf students. To make these videos more
accessible and exciting, we needed to make the content inside videos easily navigable,
searchable and associate closed captions with videos through a visually attractive and
easy to use interface of a video player.
To provide easy access to video content and enhance user experience, we
designed a video player in ICS video project, focused on making the video content more
accessible and navigable to users. This video player allows users to search for a topic
they want in a lecture video, which saves time as users do not need to view the whole
lecture stream to find what they were looking for.
To provide search ability to our video player, we need to get the text of each video
frames. This can be done by using optical character recognition (OCR). Since ICS video
frames include complex images, graphs, and shapes in different colors with non-uniform
backgrounds, our text detection requires a more specialized approach than is provided by
off-the-shelf OCR softwares, which are designed primarily for recognizing text within
scanned documents in black and white format. Apart from the choosing the right OCR
1
tool for ICS video player, using basic pre-image processing techniques to improve
accuracy are required.
1.2 Background
Digital videos in education have been a successful medium for students to study
or revise the subject matter taught in a classroom [1]. Although, a practical method of
education, it was never meant to replace or substitute live classroom interaction as a live
classroom lecture and student-instructor interaction cannot be retained in a video, but we
still provide anytime-anywhere accessibility by allowing web based access to lecture
videos[2]. We wanted to enhance the user experience and make the content of video
lectures easily accessible to students by designing a player which could support indexing
(or visual transition points), search and captioning.
At the University of Houston, video recordings have been used for many years for
distance learning. In all those years, lecture videos have only grown in popularity [2,3,4].
A problem that students face while viewing these videos is that it is difficult to access
specific content. To solve this problem we started a project known as Indexed, Captioned
and Searchable (ICS) Videos. Indexing (a process of locating visual transitions in video),
searching and captioning have been incorporated in the project to attain the goal of
making lecture videos accessible to a wide variety of users in an easy to use manner.
We are looking at the project from the perspective of an end user (most likely a
student). To increase usefulness of the ICS Video project, all videos contain metainformation associated with them. This meta-information contains information like
description of lecture, a series of points in the video time-line where a visual transition
exists (also known as index points) along with keywords needed to search and closed
2
caption text. The indexer, explained in the following sections, creates index and transition
points of the video as image files for OCR tool. OCR tool detects the text from these
images and stores it in a way that ICS Video Player, explained in the following sections,
organizes this meta-information in a manner which is practical to the end user while
preserving the emphasis on the video.
As stated earlier, this work is a culmination of a larger ICS Video project. In this
section we present a summary of contributions made by others for this project.
1.2.1 Video Indexer
The job of the indexer is to divide the video into segments where each division
occurs at a visual transition as shown in figure 1.1. By dividing a video in this manner we
get a division of topics taught in a lecture because the visual transitions in a video are
nothing but slide transitions. The indexer is also supposed to eliminate duplicate
transition points and place index points at approximately similar time intervals.
Figure 1.1: Block diagram of the video indexer. The output from the indexer is image
files and a textual document which essentially contains a list of index points i.e. time
stamps where a visual transition exists.
3
Joanna Li[3] outlined a method to identify visual transitions and eliminate
duplicates by filtering. Later, this approach was enhanced with new algorithms [4].
Figure 1.2: A Snapshot from the video indexer. It is running to find index point and
transition points.
Figure 1.3: Output from the video indexer. It created all transition points and a data file
shows which one is index point.
In figure 1.2 a snapshot from the video indexer is shown. After it finishes
processes, it creates the outputs in a folder for OCR tool as shown in figure 1.3.
4
1.2.2 Overview of Optical Character Recognition (OCR) Tool
We will discuss OCR deeply in the following chapters. Figure 1.4 shows a
workflow with a short description.
Figure 1.4: The OCR tool takes each frame where an index (or visual transition) exists
and extracts a list of keywords written on it. This list is then organized in such a way that
it can be cross referenced by the index points.
After video indexer creates the index point and transition points which are image
files, OCR module runs to get the keywords from the written text on these video frames
(which are essentially power point slides). As a result we get all the keywords for a video
segment from this tool. These keywords, among other data, are then used to facilitate
search functions in the video player.
Figure 1.5: The OCR tool rename files according to their index point number. L1082310_i_1_1 refers to first index point and first transition points. L1-082310_t_1_2
refers to first index point and second transition points.
5
Figure 1.6: The OCR tool running for extracting text from images.
OCR tool finishes extracting text from all images one by one and then creates an
XML file for output that includes the keywords for each transition point as shown in
figure 1.7.
Figure 1.7: XML file, output of OCR tool.
Once the xml file is ready, the ICS Video Player is ready to use it on its interface.
We discuss in the next chapter how the information supplied by the indexer and the OCR
tool is used in the ICS Video Player.
6
1.2.3 ICS video player
Figure 1.8: The video player is fed the meta-information consisting of index points and
keywords along with some of the course information, which that lecture belongs to, and
the information about the lecture itself. Caption file is an optional resource, if present,
will be displayed below the video as shown in Figure 1.9.
In essence the ICS Videos project aims at providing three features - indexing,
captioning and search to distance digital video education. Here is an overview of how
those three features were integrated in the video player:
1. Indexing
The recorded video lectures were divided into segments where the division
occurs at a visual transition (which is assumed to be a change of topic in a
lecture). These segments (or index points) were organized in a list of index
points in the player interface (see Figure 1.9 (d)).
2. Captioning
The video player was designed to contain a panel which can be minimized and
display closed captions if a closed caption file was associated with that video (see
Figure 1.9 (f)). At the time of writing this, the captions file needs to be manually
generated by the author.
7
3. Search
User can search for a topic of interest in a video by searching for a keyword. The
result shows all occurrences of that keyword among the video segments. This is
implemented by incorporating the indexer and OCR tool discussed earlier in the
video processing pipeline. The search result allows users to easily navigate to the
video segment where a match for the search keyword was found as shown figure
1.9 (b) and figure 1.10.
Figure 1.9: A snapshot of the video player screen. Highlighted components - (a)
video display, (b) search box, (c) lecture title, (d) index list, (e) playhead slider, (f)
closed captions, (g) video controls
In Figure 1.9 we show a running example of the video player. The lecture in
figure 1.9 belongs to the COSC 1410 - Introduction to Computer Science course by Dr.
8
Nouhad Rizk in University of Houston. The Figure gives a view of the player as a whole
along with every component. The player interface is mostly self explanatory, but we
should clarify some of the functionality. Video display (figure 1.9 (a)) shows the current
status of the video. If the video is paused, it shows a gray overlay over the video with a
play button. Index list (Figure 1.9 (d)) contains a descriptive entry for each index point
(also known as visual transition) in the video. Each entry in the index list is made up of a
snapshot image of the video at the index point, name of the index and its description
shown in figure 1.10.
Figure 1.10: The Figure shows the list of results when the user searched for the keyword
"program" in the lecture.
One component that is not shown in figure 1.9 is the search result component.
When the user searches for a keyword, all indices that contain that keyword in their
keyword list, are displayed in the list of search results. The user can then click on a result
to go to that index point in the video. As shown in Figure 1.10, every result also contains
the snapshot of the video at the index point with the name and description of the index
point. It also shows where the keyword was found - in the keyword list (along with
9
number of matches), in title or in the description of the index point. All of the information
here comes from the xml file created in OCR tool as we explained in the previous
section.
Figure 1.11 The Figure shows the progress bar of ICS Video Player. In this case the video
is playing at index point 1.
One thing we need to point out for the search feature on this player is that when a
user searches a keyword and finds it in a keyword list, the progress bar pointer goes to the
beginning of that index region; it does not go to the exact position of the videos.
We talked about the work flow of ICS Video player briefly and how the OCR tool
is used in the project. The purpose of the work done in this thesis is to create an OCR
tool, for the video player, which will provide the text of the video frames, so that the user
will be able search inside a video by using ICS video player. There are several ways to
design an OCR tool that will create text for ICS video player; our main goal is to make it
accurate enough for the end user to use and find the keyword in the right place of the
video content. For this, we tested the current OCR tools and used some pre-processing
techniques to improve their accuracy. With the experiments and results, we concluded
that the OCR tools we presented here can be used for ICS Video Player. Modifying
images by using image processing techniques, prior to sending to the OCR tool will
increase the accuracy of these tools.
1.3 Related Work
There have been efforts around the industry along the lines of video indexing and
usage of OCR for getting text from videos. We will take a look at each of them
separately.
10
1.3.1 Video Players
Google video is one of the most famous video players and it is a free video sharing
website developed by Google Inc.[5]. Google Video has incorporated indexing feature in
their videos which allows users to search for a particular type of video among any video
available publicly on the internet. But, they index videos based on the context it was
found in and not on the content of the video itself, which means if a video is located in a
website about animals and one searches for a video about animals then there is a chance
that this video will appear in the search result. The main difference here is that the search
result does not guarantee that videos appearing in search result do indeed have the
required content in them (attributed to the fact that indexing was not done on the video
content). This method does not suit our needs because one of the main requirements of
the project was to allow students to be able to locate the topic they were looking for
inside a video.
Another implementation, known as Project Tuva, implemented by Microsoft Research, features searchable videos along with closed captions and annotations [6]. It also
features division of the video time-line into segments where a segment represents a topic
taught in the lecture. However, the division of video into segments is done manually in
Project Tuva. Tuva also offers an Enhanced Video Player to play videos.
There exists a related technology known as hypervideo which can synchronize
content inside a video with annotations and hyperlinks [7]. Hypervideo allows a user to
navigate between video chunks using these annotations and hyperlinks. Detail-ondemand video is a type of hypervideo which allows users to locate information in an
interrelated video [8]. For editing and authoring detail-on-demand type hypervideo there
11
exists a video editor known as Hyper-Hitchcock [8, 9, 10]. Hyper-Hitchcock video player
can support indexing because it plays hypervideos, but one still has to manually put
annotations and hyperlinks in the hypervideo to index it.
There has been some research for implementation of search function and topics
inside a video. Authors of [11] have developed a system known as iView that features
intelligent searching of English and Chinese content inside a video. It contains image
processing to extract keywords. In addition to this iView, it also features speech
processing techniques.
Searchinsidevideo is another implementation of indexing, searching and
captioning videos. Searchinsidevideo is able to automatically transcribe the video content
and let the search engines accurately index the content, so that they can include it within
their search results. Users can also find all of the relevant results for their searches across
all of the content (text, audio and video) in a single, integrated search.[12]
1.3.2 OCR implementations in Videos
OCR is used in videos in a lot of applications such as car plate number
recognition in surveillance cameras, or text recognition on news and sport videos. And
there are a lot of projects going on in universities that aim to get a better OCR detection
in videos.
SRI International (SRI) has developed ConTEXTract™, a text recognition
technology that can find and read text (such as street signs, name tags, and billboards) in
real scenes. This optical character recognition (OCR) for text within imagery and video
requires a more specialized approach than is provided by off-the-shelf OCR software,
which is designed primarily for recognizing text within documents. ConTEXTract
12
distinguishes lines of text from other contents in the imagery, processes the lines, and
then sends them to an OCR sub module, which recognizes the text. Any OCR engine can
be integrated into ConTEXTract with minor modifications.[14] The idea of segmenting in
this work is inspired by this work.
Authors of [13] proposed a fully automatic method for summarizing and indexing
unstructured presentation videos based on text extracted from the projected slides. They
use changes of text in the slides as a means to segment the video into semantic shots.
Unlike precedent approaches, their method does not depend on availability of the
electronic source of the slides, but rather extracts and recognizes the text directly from
the video. Once text regions are detected within key frames, a novel binarization
algorithm, Local Adaptive Otsu (LOA), is employed to deal with the low quality of video
scene text. We are inspired by this work by its application of threshold to images and its
use of the Tesseract OCR tool.
Authors of [15], worked on Automatic Video Text Localization and Recognition
for Content-based video indexing for sports applications using multi-modal approach.
They used segmentation by using dilation methods for localizing. The method for
segmentation in our work is inspired by this work.
Authors of [16] from Augsburg University worked on a project named MOCA
where they have a paper for Automatic Text Segmentation, known also as text
localization, and Text Recognition for Video Indexing. They used OCR engines to detect
the text from TV programs. To increase the OCR engine accuracy they presented a new
approach to text segmentation and text recognition in digital video and demonstrated its
suitability for indexing and retrieval. Their idea of using different snapshots of the same
13
scene is not applicable to our work since our videos are indexed and only these indexed
screenshots are available to the OCR tool.
1.4 Thesis Outline
This thesis is organized as follows: Chapter 2 gives an introduction to commonly
used OCR engines in todays world and explains the reason for using the three OCR
engines MODI Tesseract OCR and GOCR. In Chapter 3 we discuss OCR challenges for
ICS video images and explain our approaches to deal with these challenges. The methods
we used to enhance text recognition are discussed in Chapter 4. The criteria for
enhancement on text detection explained in Chapter 5. We also show the results of
experiments in Chapter 5. The work is finally concluded in Chapter 6.
14
Chapter 2: Survey of OCR Tools
Developing a proprietary OCR system is a complicated task and requires a lot of
effort. Instead of creating a new OCR tool it is better to use the existing ones.
In the previous chapter we mentioned that there are many OCR tools that allow us
to extract text from an image. In this chapter, we discuss the criteria for a good OCR tool
suitable for our goals and then we mention some of the tools we tested and justify our
choice(s).
2.1 Popular OCR Tools
Table 2.1 shows us some popular OCR tools.
ABBYY FineReader
Puma.NET
AnyDoc Software
Readiris
Brainware
ReadSoft
CuneiForm/OpenOCR
RelayFax
ExperVision TypeReader & RTK
Scantron Cognition
GOCR
SimpleOCR
LEADTOOLS
SmartScore
Microsoft Office Document Imaging
Tesseract
Ocrad
Transym OCR
OCRopus
Zonal OCR
OmniPage
Table 2.1: Popular OCR tools
These tools can be classified into 2 types:
a) Tools that can be integrated into our project
Open Source tools and some commercial tools for which the OCR module can be
integrated into a project such as Office 2007 MODI.
b) Tools that cannot be integrated into our project
15
Commercial Tools such as ABBY FineReader that encapsulate OCR, mainly aim
to scan, print and edit. They are successful in getting text, but the OCR component cannot
be imported to as a module to a project as they have their own custom user interface and
everything should be done in this interface.
2.2 The criteria for a “Good” OCR Tool for ICS video images
There are many criteria for a good OCR tool in general such as design, user
interface, performance, accessibility etc. The priorities for our project are the
accessibility, usability and accuracy. So the criteria for being a good tool for our project
are:
1. Accessibility-Usability: In ICS video project we will process many image files
and we need to do them automatically. We will not do it one by one or go to a
process of clicking to a program->browse files-> run the tool ->get the text-> put
the text to a place we can use. Accessibility is our first concern. How can we
access the tool? Can we call it in a command prompt so that we can access in C++,
C# or JAVA programming languages? Or can we access in windows operating
system and use it with parameters as many times as possible and any time we
want? Can we include this tool as a package, a dll or a header to our project so
that we can import and use as a part of our project?
2. Accuracy: The tool should also have a reasonable rate of accuracy in
converting images to text. It is also important that the accuracy of the text
recognition we are looking for is only for our project inputs. In other words, most
of the OCR tools are designed for scanned images which are mostly black and
white. They may claim their accuracy is up to 95%, but what about the accuracy
16
for colored images? So accuracy for our inputs is another important criterion to
decide if the tool is good.
3. Complexity: A program doing one task can be considered simple, while a
program doing lot of tasks can be considered a complex tool. In this sense, we
only need the tool to extract the text from images. Anything else it does will
increase the complexity of it.
4. Updatability: No algorithm can be considered as the final algorithm. To
increase the accuracy or performance can we change it so it will work better for
our project? It may be a good tool, but it may not support our input type (which is
jpg files). Can we update it so that it will be able to processes our inputs?
5. Performance and Use Space: Most tools we examined have reasonable
performance and they use reasonable memory and hard drive space. For the ICS
video project, the OCR module will work in server side as a web server. That
means the speed for converting the images to text or the required space or
memory usage for the OCR tool is not as important.
Now we can take a look at the tools in the next section. Since testing all tools that
work mostly in different environments-operating systems- would be time consuming, we
test the tools below as examples for their group. We filtered the popular tools shown in
table 2.1 to a table 2.2 as an example according to the classification of accessibility and
complexity in previous section. We presented one example for each group. Not
importable tools such as commercial big tools and small applications and importable
tools such as MODI. We tested two OCR tools in open source tools: GOCR and
Tesseract OCR which can work in windows environment.
17
NAME
Simple OCR
ABBYY FineReader
Tesseract OCR
GOCR
Office 2007 MODI
CATEGORY
Not importable –Small Application
Not importable –Big Application
Importable –Open Source
Importable –Open Source
Importable – Big Application
Table 2.2: Selected OCR tools to test
2.3 Simple OCR
SimpleOCR is a free tool that can read bi-level and grayscale, and create TIFF files
containing bi-level (i.e. black & white) images. It works with all fully compliant TWAIN
scanners and also accepts input from TIFF files. With this tool, it is expected that we
could easily and accurately convert a paper document into editable electronic text for
use in any application including Word and WordPerfect, with 99% accuracy. [17]
a)
b)
Figure 2.1: SimpleOCR detection example 1: a- Input, b- Output
SimpleOCR has a user interface in which we can open a file by clicking,
browsing, running copying and pasting manually. In other words, it does not have
command line usability and also it is not importable to our tool. Hence it could not be
18
adopted for our project. It also failed to create text in colored images like Figure 2.1 and
Figure 2.2. It gives an error message “Could not convert the page to text.”
b
a
Figure 2.2: SimpleOCR detection example 2: a- Input, b- Output
SimpleOCR was able detect some of the text in colored images like figure 2.3 but
with a low accuracy, only the word Agents detected correctly.
a
b
Figure 2.3: SimpleOCR detection example 3, a- Input b- Output
SimpleOCR failed to be a good OCR tool for our project in the first and second
criterion, Accessibility & Usability and Accuracy.
19
2.4 ABBYY FineReader
ABBYY is a leading provider of document conversion, data capture, and
linguistic software and services. The key areas of ABBYY's research and development
include document recognition and linguistic technologies. [18]
b
a
Figure 2.4 ABBYY FineReader detection example a- Input b- Output
ABBYY showed good accuracy for our test images, Figure 2.4. But it is not
applicable to our project: to be able to use OCR part of ABBYY Fine Reader, it is
required to use its own interface to get the text. Open the files from the menu, run the
OCR engine and see if the text is accurate or have to be corrected manually as shown in
figure 2.5.
Figure 2.5 User interface of ABBY Fine Reader
20
Even though the accuracy of ABBY Fine reader is high, it is not a “good” tool for
our ICS Video Project. It does not satisfy our first criteria which is Accessibility &
Usability.
2.5 Tesseract OCR
The Tesseract OCR engine is one of the most accurate open source OCR engines
available. The source code will read a binary, grey or color image and output text. A tiff
reader is built in that will read uncompressed TIFF images, or lib tiff can be added to read
compressed images. Most of the work on Tesseract is sponsored by Google [19].
b
a
Figure 2.6: Tesseract OCR detection Example 1; a- Input b- Output
Tesseract OCR engine is being updated frequently and the accuracy of the tool is
precise for colored images. Figure 2.6 and figure 2.7 are good examples detection of
capabilities of Tesseract OCR.
21
a
b
Figure 2.7 Tesseract OCR detection example 2 a- Input b- Output
Tesseract OCR may be the most accurate open source tool, but the accuracy rate
is not perfect, in figure 2.6, the last line is not recognized at all. The image in figure 2.7
is recognized precisely, whereas in figure 2.8 is missed the word Summary. But it is
accessible, easy to use and can be called from command prompt in any programming
languages.
a)
b)
Figure 2.8
Tesseract OCR detection Example 3; a- Input b- Output
22
2.6 GOCR
GOCR is an OCR program, developed under the GNU Public License, initially
written by Jörg Schulenburg, it is also called JOCR. It converts scanned images to text
files [20].
GOCR engine assumes no colors, black on white only, assumes no rotation, same
font, all characters are separated and every character is recognized empirically based on
its pixel pattern [21].
a
b
Figure 2.9: GOCR tool detection Example 1; a- Input b- Output
a
b
Figure 2.10: GOCR tool detection Example 2; a- Input b- Output
23
Figure 2.11: GOCR tool detection Example 3 a- Input b- Output
GOCR is also accessible, easy to use and can be called from command prompt in
any programming languages like Tesseract and the detection accuracies for the images in
figure 2.9-2.11 are similar to Tesseract. GOCR is not regularly updated like Tesseract.
2.7 Microsoft Office Document Imaging (MODI)
Microsoft Office Document Imaging (MODI) is a Microsoft Office application
that supports editing documents scanned by Microsoft Office Document Scanning. It was
first introduced in Microsoft Office XP and is included in later Office versions
including Office.
Via COM, MODI provides an object model based on 'document' and 'image' (page)
objects. One feature that has elicited particular interest on the Web is MODI's ability to
convert scanned images to text under program control, using its built-in OCR engine.
The MODI object model is accessible from development tools that support the
Component Object Model (COM) by using a reference to the Microsoft Office Document
Imaging 11.0 Type Library. The MODI Viewer control is accessible from any
24
development tool that supports ActiveX controls by adding Microsoft Office Document
Imaging Viewer Control 11.0 or 12.0 (MDIVWCTL.DLL) to the application project.
When optical character recognition (OCR) is performed on a scanned document,
text is recognized using sophisticated pattern-recognition software that compares scanned
text characters with a built-in dictionary of character shapes and sequences. The
dictionary supplies all uppercase and lowercase letters, punctuation, and accent marks
used in the selected language [22].
In the images tested, the accuracy of Modi was very good and it was easy to
access via code. After importing the Microsoft Office Document Imaging 12.0 Type
Library it is accessible from any development tool that supports ActiveX
MODI.Document md = new MODI.Document();
md.Create(FileName));
md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
MODI.Image image = (MODI.Image)md.Images[];
writeFile.Write(image.Layout.Text)
Figure 2.12: Using the MODI OCR engine in C# programming language
b
a
Figure 2.13: MODI detection example 1 a- Input b- Output
25
a)
b)
Figure 2.14: MODI detection example 2; a- Input b- Output
Figure 2.15: MODI detection example 3; a- Input b- Output
Accessibility-Usability of Modi made it easier to be imported to a C# project and
it has a very good accuracy rate for the ICS video images. In figure 2.15 it was able to
detect the thumbnail of video in the left.
We found MODI to be a friendly engine for the type of images in ICS videos
which are generally fully colored texts and images.
26
2.7 Conclusion:
We presented some popular OCR tools and checked if they can be integrated to
our ICS video project or not. And we justified our choice by giving some examples, 3
input images and the results of for each OCR tool. We conclude that we can use any of
the 3 tools GOCR, Tesseract OCR or MODI can be integrated. It is hard to say which one
is best, by looking at the outputs of 3 examples. We decided to include all of them in our
experiments. A large scale of test images from ICS video images and the results of these
tools will give us a better perspective and this will be done in the experiments and results
section in the chapter 5. Before that, in the following chapter we will look at challenges
of OCR and our proposed methods to enhance the detection.
27
Chapter 3: OCR and Challenges
In the previous chapter we looked for the OCR tools and decided which tools to
use: MODI, GOCR and Tesseract OCR. In the examples we provided in the previous
section, we saw that colored images confuses OCR tools and their accuracy goes down.
That means there are some issues we need to deal with in OCR engines and ICS video
images. We introduce what is OCR, how OCR works and what are the challenges in
OCR detection for ICS Video frames so that we can discuss how we can deal with them
in the next chapter.
3.1 What is OCR?
Optical character recognition, more commonly known as OCR, is the
interpretation of scanned images of handwritten, typed or printed text into text that can be
edited on a computer. There are various components that work together to perform
optical character recognition. These elements include pattern identification, artificial
intelligence and machine vision. Research in this area continues, developing
more effective read rates and greater precision [23].
In 1929 Gustav Tauschek obtained a patent on OCR in Germany, followed by
Handel who obtained a US patent on OCR in USA in 1933(U.S. Patent 1,915,993). In
1935 Tauschek was also granted a US patent on his method (U.S. Patent 2,026,329).
Tauschek's machine was a mechanical device that used templates and a photo detector.
RCA engineers in 1949 worked on the first primitive computer-type OCR to help
blind people for the US Veterans Administration, but instead of converting the printed
28
characters to machine language, their device converted it to machine language and then
spoke the letters. It proved far too expensive and was not pursued after testing [24].
Since that time, OCR is used for credit card imprints for billing purposes, for
digitizing the serial numbers on coupons returned from advertisements, sorting mails
in United States Postal Service, converting the text for blind people to have a computer
read text to them out loud and digitizing and storing scanned documents in archives like
hospitals, libraries etc.
3.2 How Does OCR Work?
OCR engines are good pattern recognition engines and robust classifiers, with the
ability to generalize in decision making based on imprecise input data. They offer ideal
solutions to a variety of classification character.
There are two basic methods used for OCR: Matrix matching and feature extraction.
Of the two ways to recognize characters, matrix matching is the simpler and more
common. Matrix Matching compares what the OCR scanner sees as a character with a
library of character matrices or templates. When an image matches one of these
prescribed matrices of dots within a given level of similarity, the computer labels that
image as the corresponding ASCII character. Matrix matching works best when the OCR
encounters a limited repertoire of type styles, with little or no variation within each style.
Where the characters are less predictable, feature, or topographical analysis is superior.
Feature Extraction is OCR without strict matching to prescribed templates. Also known
as Intelligent Character Recognition (ICR), or Topological Feature Analysis, this method
varies by how much "computer intelligence" is applied by the manufacturer. The
computer looks for general features such as open areas, closed shapes, diagonal lines, line
29
intersections, etc. This method is much more versatile than matrix matching but it needs a
pattern recognition process as shown in Figure 3.1. [25]
Figure 3.1 Pattern Recognition steps for Classification
In OCR engines, for future extraction, computer needs to define which pixels are
path and which are not. In another words, it needs to classify all the pixels as paths and
not paths. Paths can be considered as 1, others can be considered as 0. In figure 3.2 paths
creates a character of E.
a
b
Figure 3.2 Character Representation for Future Extraction: a) Black and white image
b) Binary representation of image [26]
30
3.3 Causes of False Detection
OCR engines works on pattern recognition on the images, before recognition path
they need to classify the each image pixels as path (1) or not path(0). Like most of the
images ICS video images are colored in different shades and sometimes distorted or
noisy. This makes pattern recognition fail at some level. Even though they have the
ability to lie in their resilience against distortions in the input data and their capability to
learn, they have a limit. After a certain point of distortions they start to make mistakes.
a
b
c
d
Figure 3.3 Distorted image pattern analyses. a is distorted but could be detected and
considered as b, c is distorted more and could not be detected.
This pattern recognition and machine learning problem is related to computer
vision problem which is related to the human visual system. In that sense, we can say if a
picture is hard to read for humans, it is also hard to read for computers. (Reverse is not
applicable: An irony of computer science is that tasks humans struggle with can be
performed easily by computer programs, but tasks humans can perform effortlessly
remain difficult for computers. We can write a computer program to beat the very best
human chess players, but we can't write a program to identify objects in a photo or
understand a sentence with anywhere near the precision of even a child.)
31
The human visual system and visibility is affected by the following factors:
1) Contrast – relationship between the luminance of an object and the luminance
of the background. The luminance- proportion of incident light reflected into the
eye- can be affected by location of light sources and room reflectance (glare
problems).
2) Size – The larger the object, the easier it is to see. However, it is the size of the
image on the retina, not the size of the object per se that is important. Therefore we
bring smaller objects closer to the eye to see details.
3) Color – not really a factor in itself, but closely related both to contrast and
luminance factors.
For humans, it is essential to have a certain amount of contrast in an image to define
what it is, which is the same for computers: computers need a certain amount of contrast
in shapes to be able to detect differences. This is also important for character recognition.
Characters in the image should have enough contrast to be able to be defined.
Figure 3.4: Contrast and color difference in characters in an image. White text has high
contrast and is easy to read, blue text has low contrast and it is hard to read.
Figure 3.5 Size difference in characters in an image. Bigger size text is easier to read.
Best OCR results depend on various factors, the most important being font and size used
for OCR. Other noted factors are color, contrast, brightness, and density of content. OCR
32
engines fail in pattern recognition in low contrast, small size and complex colored text.
We can do some image processing techniques to modify images before using OCR to
reduce the number of fails of OCR.
OCR engine detection is also affected by font style of text. To detect a certain font
style of a character it should be previously defined and stored. Since our ICS video
player’s fonts are in the type of fonts which most of the OCR engines supported such as
Tahoma, Arial, San Serif, Times News Roman etc. so font style problem will barely
affect detection of our OCR engines. So our enhancement would be about segmentation,
text size, color, contrast, brightness, and density of content. We will talk about what
approach we used in the next section.
33
Chapter 4: Enhancements for OCR detection
In the previous chapter, we looked at the challenges in OCR detection for ICS
Video frames. Here, we will discuss the approach and the methods we used to get a better
recognition from OCR engines.
OCR engines possess complex algorithms, predefined libraries, and training
datasets. Modifying an OCR algorithm requires an understanding of the algorithm from
the beginning to the end. Apart from that, sometimes images become too complex to be
defined as ICS video images; therefore, for better OCR engine results, doing
enhancements on the image before sending it to the OCR engine can be used.
4.1 Segmentation
In the previous chapter, we stated that OCR engines use segmentation mostly
designed for scanned images with black font on a white background. Segmentation of
text is two phase: detection of the word and detection of character. Detecting the word
can be considered to be locating the word’s text place in the image, and detection of
character is locating the character in the image as well. While using OCR engines, we
saw that this segmentation is not enough for some ICS video images. Due to the lack of
segmentation, OCR engines make mistakes in an image with complex colored objects.
The mistakes are introduced during setting up a threshold of the image to correct
binarization.
A successful image segmentation for a black and white image and a failed
segmentation for a colored image on a uniform background are represented by figure 4.1
and figure 4.2 respectively. Segmentation 1 can be considered to be a segmentation of the
34
words and segmentation 2 can be considered to be a segmentation of the characters in
these figures.
Figure 4.1: Black font text on the white color background segmentation for OCR
character recognition.
In Figure 4.1, segmentation 1 and segmentation 2 are done successfully so that all
texts could be separated to words and then to characters. In figure 4.2, due to the lack of
difference between text 1 and background, text 1 is not segmented as a word. So it is not
segmented to characters also. Text 2 in figure 4.2, is in a different background that could
be segmented to word and then to characters also. Text3 and Text4 backgrounds are very
close to each other in figure 4.2; because of that they are considered as a single word, but
since their font and background color are close to one another, character segmentation
could not be done.
Figure 4.2: Complex background with different font color text segmentation for OCR.
35
We need to remember that these figures are used only for the illustration of OCR
segmentation. We may have to look at ICS video image examples and OCR outputs to
see the importance of segmentation. Figure 4.3 shows that without segmentation OCR
engines fail, whereas in figure 4.4, segmented input allows for a better performance of
OCR engines.
__c____c_0__
_0_____
c0___0___
___________
____0
0_00__
___0 _0 0 _
\l_l_;_l'__ll
l'__\_)l)l)\, __\
_'___-li_'i__ '
__ l il\.\ i t i__ il t i__)
i- ''-" _
a)
b)
-4 I. I’-’ 4 C I 41
——*
\'Cl'\iCil|
n;_i';" -. `
hOl`iZ0llIL
l|
$(llll\l`€L|
l`€5[)0ll5€5
smoothed
mean
c)
d)
Figure 4.3: OCR results for a whole image has complex objects with different colors : a)
input image b) GOCR result c) MODI OCR result d) Tesseract OCR result
a)
Figure 4.4:
squared responses
squared responses
squared responses
ven caI
yen I ci I
vertical
Classification
classification
classification
Honizontal
horizontal
horizontal
smoothed mean
smoothed mean
smoothed mean
b)
c)
d)
OCR Results for a segmented image which has complex objects with
different colors: a) segmented part of input b) GOCR result c) MODI OCR result d)
Tesseract OCR result
36
Image segmentation is probably the most widely studied topic in computer vision
and image processing. Hence, many studies have been done on segmentation for a
particular application. We mentioned some of them in chapter 1. In our approach, we
simply grouped the objects on the screen by thresholding, dilating and doing blob
coloring extraction which will be explained in the following sections.
4.1.1 Thresholding
Images in ICS videos are colored; therefore we need to convert colored images to
black and white images, for segmentation and morphological operations. We will do so
by performing image binarization known as thresholding. We used the SIS filter in
AForge Image Library, a free source image processing library for C#, which performs
image thresholding calculating the threshold automatically using simple image statistics
method. For each pixel:

two gradients are calculated
ex = |I(x + 1, y) - I(x - 1, y)| and |I(x, y + 1) - I(x, y - 1)|;

weight is calculated as maximum of two gradients;

sum of weights is updated (weight total += weight);

sum of weighted pixel values is updated (total += weight * I(x, y)).
The result threshold is calculated as sum of weighted pixel values divided by sum of
weight [27].
37
a) Input Image
b) Thresholded Image
FIGURE 4.5: SIS Threshold Example 1: Output result is in the white foreground on the
black background.
Sis thresholding results can be different as shown in Figure 4.5 and Figure 4.6. In
Figure 4.5, output is white foreground and black background; in Figure 4.6, it is reversed.
a) Input Image
b) Thresholded Image
FIGURE 4.6 SIS Threshold Example 2: Output result is in black foreground in white
background.
For morphological operation, we will use erosion or dilation. We need to decide
which one to use according to our image. If the image has a white foreground and a black
38
background, dilation will tend to remove the foreground. This is not desirable, so we
make a decision by Average Optical Density calculating the
𝑁−1 𝑁−1
1
𝐴𝑂𝐷(𝐼) = ( 2 ) ∑ ∑ 𝐼(𝑖, 𝑗)
𝑁
𝑖=0 𝑗=0
We calculate AOD of a binary image which has 0(white) and 1(black) values. This
puts the AOD value between 0 and 1. We found that when AOD >0.15 for ICS video
frames, it refers to a black foreground and white background image using erosion. In the
other case, when AOD <=0.15, it refers to an image with white foreground and black
background and we choose to use dilation.
4.1.2 Erosion and Dilation
We binarized image calculated AOD and decided which morphological operation
to use for segmentation in the previous sections. Here, we will talk about the meaning of
erosion and dilation, as well as how erosion and dilation affect images.
Erosion and Dilation are morphological operations that affect the shapes of
objects and regions in binary images. All processing is done on a local basis and region
or blob shapes are affected in a local manner. A structuring element (a geometric
relationship between pixels) is moved over the image in such way that it is centered over
every image pixel at some point, row-by-row, and column-by-column. An illustration of
movements of a structural element is shown in figure 4.7.
39
FIGURE 4.7 Structural element movements for Morphological Operations
Given a window B and a binary image I:
J1 = DILATE(I, B) if J1(i, j) = OR{B • I(i, j)} = OR{I(i-m, j-n); (m,n) Î B}
J2 = ERODE(I, B) if J2(i, j) = AND{B • I(i, j)} = AND{I(i-m, j-n);(m, n) Î B} [20]
Dilation removes object holes of too-small size on black foreground and white
background image. Erosion is the reverse of dilation, but when we use it for a white
foreground and black background image, it gives the same effect and removes object
holes of too-small size, so we can call it dilation effect.
For erosion and dilation operations we used the structured element shown in
Figure 4.8:
0
0
0
1
1
1
0
0
0
Figure 4.8: Structured Element for Erosion and Dilation
40
Thus, dilation effect allows for growth of separate objects, or joining objects. We
will use it for joining the characters and creating groups for segmentation. We choose a
horizontal window so that the characters will intend to merge in the right and left
direction in the image shown in Figure 4.9. Through trial and error, we found that 8
iterations of dilation/erosion process are reasonable for joining characters in most ICS
video images.
Input Image
Dilatation #1
Dilatation #2
Dilatation #3
Dilatation #4
Dilatation #5
Dilatation #6
Dilatation #7
Dilatation #8
Figure 4.9 Dilatation affect on an image: Dilation joins the small objects (characters)
and fill the small holes; texts are converted to square objects.
4.1.3 Edge detection
We grouped every small object, such as characters and small items, to a single
object in the previous process. Next, we used Sobel operator to detect the edges in
AForge Image Library. Detection of the edges will unify the objects and provide more
accurate detection of groups.
41
The filter searches for objects' edges by applying Sobel operator. Each pixel in
the resulting image is calculated as an approximate absolute gradient magnitude for
corresponding pixel of the source image:
|G| = |Gx| + |Gy] ,
where Gx and Gy are calculated utilizing Sobel convolution kernels:
Gx
Gy
-1 0 +1
+1 +2
-2 0 +2
0
0
-1 0 +1
-1 -2
Using the above kernel, the approximated magnitude
+1
0
-1
for pixel x is calculated using the
next equation: [28]
P1 P2 P3
P8
x P4
P7 P6 P5
|G| = |P1 + 2P2 + P3 - P7 - 2P6 - P5| + |P3 + 2P4 + P5 - P1 - 2P8 - P7|
Edge detection effect is shown in Figure 4.10.
a) Dilated image
b) Edge detected image
Figure 4.10: Edge Detection effect on a dilated image
42
4.1.4 Blob Extraction
After we detect the edges, provided the continuity and completeness of the groups,
we will count and extract stand alone objects in the image using connected components
labeling algorithm. It is an algorithmic application of the graph theory, where subsets
of connected components are uniquely labeled based on a given heuristic of the AForge
Image Library. The filter performs labeling of objects in the source image. It colors each
separate object using a different color. The image processing filter treats all none-black
pixels as the objects' pixels and all black pixel as the background.
The AForge Library blob extractor extracts blobs in the input images (which in our
case are thresholded and dilated images). However, we need the original image as an
input for OCR, thus we will use blob extraction to detect the place of the objects in
original image.
A Blob Extraction example is shown in Figure 4.12. In the extracted blob, one
would expect more blobs; however, they were filtered using several criteria. If a blob
contains other blobs we do not extract it. When the blob width/ blob height <1.5. (The
text we want to detect is at least two characters long; since we dilated text to the right
left, every time width will be more than height.). In figure 4.12, man’s body is not
extracted because of the height to width ratio. Also, very small size blobs are not
included in the blobs. After filtering according to these criteria, we pass the parts to
Tesseract OCR engine, and if there is no text detection on this blob, we remove it. This
process can be considered as a back-forward operation.
43
a
b
c
Figure 4.11 Blob extraction example on an image by using edge detected image: a)
Original Image b) Edge detected image c) Extracted Blobs
Increasing the iterations of dilatation done in figure 4.10, or increasing the size of
structural element will reduce the number of blobs detected. However, it may merge the
text with the objects; thus, we keep the structural element size very low (3x3) and
iterations in 8.
4.2 Resizing Text Font Size
We separated the texts from other complex background by segmentation in the
previous section. What if the segmented text font size is too small to be detected
correctly?
We know that the best font size for OCR is 10 to 12 points. Use of smaller font size
has led to a poor quality OCR. Font sizes greater than 72 are considered images, and thus
should be avoided. Usually dark, bold letters on a light background and vice versa, yield
good results. The textual content should be ideally placed, with good spacing between
words and sentences [29].
44
We need big enough fonts to be able to detect the text. A big size image is not
enough; we need bigger size text, in comparison to the human visual system where a big
size object is also not enough, because a big size image on the retina is needed.
Increasing the size of a text can be achieved easily by changing the size of the image. For
instance, if we resize the image to x1.5 it will make everything inside the text x1.5 bigger
than in the previous one.
a) Original image
b) Resized image(1.5)
Figure 4.12: Resize Process in an example
Font size under 10px becomes hard to obtain by OCR engines. We have plenty of
small characters (mostly occurring in the explanation for the images or graphs), so before
the image processing we increase the size of the images by a factor of 1.5 as a default.
Resizing is done by the bilinear image interpolation method; it works in two directions,
and tries achieving the best approximation of a pixel's color and intensity based on the
values at surrounding pixels. The following example illustrates how resizing /
enlargement works:
a
b
c
Figure 4.13: Resize Process in interpolation a) Original image b)Divided pixels to fill by
interpolation c) Resized image
45
Bilinear interpolation considers the closest 2x2 neighborhood of a known pixel
values surrounding the unknown pixel. It then takes the weighted average of these 4
pixels to arrive at its final interpolated value. These results are much smoother looking
images than its closest neighbor.
Figure 4.14: Resize Process in bilinear interpolation in pixels.
On the diagram in Figure 4.14, the left side shows the case when all known pixel
distances are equal, so the interpolated value is simply their sum divided by four [30].
4.4.3 Inversion
In the previous chapter we segmented and resized the images and saw that
applying both of them improved the detection. The detection can be increased by playing
with intensity relationship of the text and fonts by using inversion method. Before we
explain the inversion, we need to look at the RGB color model.
The RGB color model is an additive color model in which red, green, and blue light
are added together in various ways to reproduce a broad array of colors. The name of the
model comes from the initials of the three additive primary colors, red, green, and blue
shown in figure 4.15.
46
Figure 4.15 RGB Color Model
Image file formats such as BMP, JPEG, TGA or TIFF commonly in 24-bit RGB
representations, color values for each pixel encoded in a 24 bits per pixel fashion where
three 8-bitunsigned integers (0 through 255) represent the intensities of red, green, and
blue.

(0, 0, 0) is black

(255, 255, 255) is white

(255, 0, 0) is red

(0, 255, 0) is green

(0, 0, 255) is blue

(255, 255, 0) is yellow

(0, 255, 255) is cyan

(255, 0, 255) is magenta
Inverting colors is basically playing with RGB values. When we invert an image
in a classical way, we take the inverse RGB values. For example, the inverse of the color
(1,0,100) is (255-1,255-0,255-100)=(254,255,155). This will change the view of an
47
image, but will not change the difference between the text and the background since we
subtracted all of them using the same number.
Figure 4.16: The inversion operation: input image in the left, inverted image, in the right.
In our approach, we expand this technique from 1 to 7 inversions by using the
equation in Figure 4.17. OCR engines give different results for inverted images.
Sometimes the 5th inversion is better than the first one. So we use all of them in the
inversion operation and unify the results.
Instead of using only the original image, using inverted image improved the OCR
results. However, since we do not know which inversion will be the best for the OCR
engine, we will use all of them and union the results.
48
Original
Image
R
G
B
Inversion
1
255-R
G
B
Inversion
2
R
255-G
B
Inversion
3
R
G
255-B
Inversion
4
255-R
255-G
B
Inversion
5
R
255-G
255-B
Inversion
6
255-R
G
255-B
Inversion
7
255-R
255-G
255-B
Figure 4.17 Inversion equations and their effect on the images.
49
Image
Original
Image
MODI result
Question 3 Where did
the story say that there
was a statue raised in
Mrs. Bethune’s h o n
Washington,
D.C.
Miami,
Florida
Mayesville,
South
Carolina
Tessaract Result
` i i wiiiiiipi
` " ii iiiii
Question 3
Where did the story
say that there
was a statue raised in
Mrs. Bethune’s
honor?
_B-Nik is
Question 3
Where did the story
say that there
was a statue raised in
Mrs. Bethune's
honor? V M i ‘_ *ii
lliivi
Question 3 Where did
the story say that there
was a statue raised in
Inversio
Mrs. Bethune’s h o n I
n1
_ L • • Mayesville
Washington,
D.C.
Miami, Florida South
Carolina
Question 3 Where did ` i‘ ilil i|iVi
the story say that there Question 3
was a statue raised in Where did the story
Inversio
Mrs. Bethune’s honor? say that there
n3
Washington,
D.C. was a statue raised in
Miami,
Florida Mrs. Bethune’s
Mayesville,
South honor?
Carolina
Q2= 'iii VE
Where did the story say Question 3
that there was a statue Where did the story
raised
in
Mrs. say that there
Inversio
Bethune’s h o no was a statue raised in
n5
hiii’iin’ I, IIIk:11fl1I I; Mrs. Bethune’s
1iiIK ij[ihi uestion I,’.
honor?
H Question 3 _ ___ V
` ,
Y fv
Where did the story say e_‘~iiiiii";i
that there was a statue Question 3
raised
in
Mrs. Where did the story
Inversio
Bethune’s
hono:? say that there
n7
ItM1’1dIIkc,
r was a statue raised in
V4hüifl4h. ID1e. r 7 r 7 Mrs. Bethune’s
IIiif!ihñ1i0 IF1kiI[iKkii honor?
S5iiV1f[Ihl iiiøüiiii
Figure 4.18 OCR engines’ detections for original image and inverted images .
50
Chapter 5: OCR Accuracy Criteria and Test Results
In this chapter, we shall have a look at the OCR accuracy criteria and the tools we
used for testing before the results of the experiments. We start the discussion by defining
the test data.
5.1 Test Data
Test data can be considered as image test data and the texts in these images. We
will look at them separately.
5.1.1 The Images for OCR Accuracy Test
Test data consists of 1387 different images which are created by an indexer from
selected 20 different ICS videos. Most of them are also tested by ICS video indexer [4],
so our inputs became more reliable. Selected videos are diverse in templates and color
styles since they are mostly prepared by different instructors. 15 different instructors are
represented. 14 of them are from Computer Science Department and 6 of them from other
departments at University of Houston.
In figure 5.1, there are examples to illustrate the variety of the images in ICS
video test data.
51
Figure 5.1: Example ICS video images included in test data.
Figure 5.2: Examples of images that are not included in the test.
Some images that do not include any text were removed from the list; as shown in
figure 5.2; empty screen of a video or the screens that do not have any related text
information such as the beginning or the end of the video.
52
5.1.2 The text for OCR test
In each image, it is desired that all texts,(main body, tables, figures and their
captions, foot notes, page numbers and headers) in the image are counted, there are some
exceptions though.
Figure 5.3: An example of some text that is not included in the test.
If the text in the image is too small to read, as shown in figure 5.3, that text will
not be included in the text data of the image. Deciding whether the size is small enough
to omit is also related to our ability to read the data. If we do not read accurately, we
cannot write the text to compare it to the results of the tool.
For search, case information is not useful; people will not look for uppercase
letters specifically so our recognition will not be case sensitive. All the data will be
assumed as lowercase.
53
5.2. Word Accuracy and Search Accuracy
If there are n different words and each word has a repeating frequency as shown in
the table 5.1, then the word accuracy will be calculated according to the “WA” formula.
In another words, “word accuracy” will give the information about what fraction of
words are detected correctly.
Word
Detected
frequency
Missed
w1
w2
w3
.
.
Ground
truth
frequency
F1
F2
F3
.
.
f1
f2
f3
.
.
M1=F1-f1
M2=F2-f2
M3=F3-f3
.
.
wn-1
wn
Fn-1
Fn
Fn-1
fn
Mn-1=Fn-1-fn-1
Mn=Fn-fn
Table 5.1: Formulation of “word accuracy”
Mi --> Missed values
NTW-> Number of total words value: F1+F2+F3..+Fn
MW Missed words value will be total of M values: M1+M2+….Mn
WA-> Word accuracy will be given by
WA=MW/NTW= [M1+M2+….Mn] / (F1+F2+F3..+Fn)
= [(F1-f1) +(F2-f2)+(F3-f3)+…(Fn-fn)]/ (F1+F2+F3..+Fn)
Search Accuracy is related to the probability of a successful search. If there are n
different words, each word has a frequency, but we will accept all frequencies as 1.So if a
word is detected in an image, we don’t need to know how many times it is detected on
that image because for search, our purpose is to decide if the word exists or not. The
54
formulation of search accuracy is in the table 5.2, and calculation will be done according
to SA formula.
word
Detected
frequency
w1
Ground
truth
frequency
F1
f1
M1=0 , f1<1
M1=1 , f1>=1
w2
F2
f2
M2=0 , f2<1
M2=1 , f2>=1
w3
F3
f3
.
.
.
.
.
.
M3=0 , f3<1
M3=1 , f3>=1
.
.
wn-1
Fn-1
Fn-1
wn
Fn
fn
Search Missed
Mn-1=0 , fn-1<1
Mn-1=1 , fn-1>=1
Mn=0 , fn<1
Mn=1 , fn>=1
Table 5.2 : Formulation of “search accuracy”
SMWSearch missed word value will be M1+M2+M3…Mn.
SASearch Accuracy will be given by
SA=SMW/n= (M1+M2+M3…Mn) /n;
5.3 Preparing and Testing Tools
We have created two tools for testing and experiments. One is for preparing the
text data and the other is for experiment and testing the accuracy. We will discus each of
them separately.
55
5.3.1 TextPictureEditor
Before we start to test the accuracy we have to prepare the ground truth for the test.
Each image we wanted to test should have corresponding text. Creating text, looking at
the pictures and writing them manually will take a lot time so we designed a small tool
that can help. This tool provides us the ability to go back and forth, in a folder, in the user
interface. We can see the images and we can write the text we see in the picture on the
text area which is under the picture.
Figure 5.4: Screenshot of TextPictureEditor tool.
The steps for running the tool can be defined as:
-open a folder; it will automatically load a picture in that folder.
-if the text is not created for the picture, send the picture to OCR and get text to text area
-if the text is already created, check if it is correct, if not update and save it to text area.
56
Figure 5.5: Input folder for OCR test created by TextPictureEditor tool.
After going forward-back in the folder we can create text files which will hold ground
truth text information for all images. All comparison will be based on these txt files.
5.3.2 OCR Tool Manager and Accuracy Tester
Rest of the job is done by the accuracy tester: using image processing techniques
on the images, managing the OCR tools and testing the accuracy of these tools. This is
illustrated in figure 5.6.
Regions in the tools as shown in the figure 5.6:
1) Selecting the folders for accuracy test or choosing the image to experiment is
done in this part.
2) Selecting the image processing techniques for modifying the images is done in
this part.
3) Input image region
4) Modified image region
5) Selecting the OCR techniques is done in this part.
6) The output of MODI is shown here
7) The output of GOCR and Tesseract are shown here.
57
Figure 5.6: Screenshot of OCR tool manager and accuracy tester
After running the accuracy test for a folder of ICS video, it modifies the images
and gets the text for each OCR tool. Then it compares each result of these tools to ground
truth data and creates a excel file for showing the statistics and the image as shown in
figures 5.7 and 5.8.
Figure 5.7: Screenshot of OCR tool manager and accuracy tester
58
Figure 5.8: Excel file created by OCR Manager Tool for folder.
As shown in figure 5.8, the excel file lists the missed words, missed searches and
missed characters for each image separately and the total of images. It also creates a
separate excel file for each image for detailed view shown in figure 5.9. Words that are
detected and those that are not are shown. Accuracy at a tool can be read from this file.
Figure 5.8: Excel file created by OCR Manager Tool for an image.
59
5.4 Experiments and Test Results
As mentioned before, we tested 3 different OCR tools: MODI, GOCR and
Tesseract OCR; our purpose is to find the best OCR engine tool to be used in ICS video
player. This test is 2 phase, first one is comparing accuracy rate of these 3 tools without
any image modifications, second phase is comparing the results of each tool after image
modifications.
We employed 20 different videos folder which have 1387 different picture in total,
created by video indexer, reduced from about 2000 images after eliminating some of
them under the criteria we defined in the previous section. Then we created the ground
truth text by using TextPictureEditor; it was a tool that makes it easier to write the text
for an image.
We kept the image files and text file in each folder; in this case it is 20 different
folders. For each image we run these three different OCR tools and compare their results
with the ground truth data. For each image, we separately created the excel files for
statistical information. Their accuracy rate according to the criteria will give us some idea
about which one is better.
For the second phase we modified the inputs by applying image processing
techniques. In other words we preprocessed the images in hopes of getting better results
from these OCR tools. All results are written to the same excel file for each video.
Eventually we unified these 20 excel files manually for an overall picture.
60
Method
# Word Miss #Expected Word Word Accuracy
Modi
1823
27201
93.30%
Gocr
7117
27201
73.84%
Tesseract
4406
27201
83.80%
Modi-Gocr-Tesseract
1068
27201
96.07%
IE+Modi
766
27201
97.18%
IE+Gocr
4829
27201
82.25%
IE+Tesseract
2148
27201
92.10%
IE+Modi-Gocr-Tesseract
589
27201
97.83%
Table 5.3: OCR accuracy test results for “Word Accuracy”
Graph 5.1: OCR accuracy test graph for “Word Accuracy”
61
Word accuracy results are shown in the table 5.3 and in graph 5.1. We can
conclude from them that among these 3 OCR tools, Modi, Gocr and Tesseract, Modi has
the highest word accuracy with 93.30%. Then Tesseract OCR with 83.80%. Lastly Gocr
with 73.84% word accuracy. When we use these tools together, the word accuracy
increased up to 96.67%. That means some of the words can be detected with one OCR
tool but not with others. Combining tools increased the accuracy because we took the
Union of these three results. They complemented each other.
It can also be seen from the table 5.3 and graph 5.1 that our proposed image
enhancement method (IE) worked well also, it increased the word accuracy rate on all
methods. IE increased the Modi word accuracy from 93.30% to 97.18% and increased
the GOCR word accuracy from 73% to 82.25%. The increase in the Tesseract OCR with IE
is from 83.80% to 92.10%. IE increased also the word accuracy of combined methods
from 96.07% to 97.03%. The increase in the combined method is not as much as the
increase in the individual methods. We can also see that from the table below.
#detection increase
with IE
Modi
1026
Gocr
2140
Tesseract
2155
ModiGocrTesseract
460
Table 5.4: Number of undetected words with methods that they are detected with IE.
Similar results are obtained in table 5.5 and graph 5.2 for search accuracy rate.
Modi has the highest search accuracy rate and IE increased all of the OCR tools search
accuracy rate.
62
# Search
#Expected
Search
Method
Miss
Unique words
Accuracy
Modi
1784
20006
91.08%
Gocr
6736
20006
66.33%
Tesseract
4113
20006
79.44%
Modi-Gocr-Tesseract
1044
20006
94.78%
IE+Modi
758
20006
96.21%
IE+Gocr
4596
20006
77.03%
IE+Tesseract
1958
20006
90.21%
IE+Modi-Gocr-Tesseract
584
20006
97.08%
Table 5.5 OCR accuracy test results for “Search Accuracy”
Graph 5.2: OCR accuracy test graph for “Search Accuracy”
63
Method
Modi
Gocr
Teseract
Modi-Gocr-Tesseract
IE+Modi
IE+Gocr
IE+Tesseract
IE+Modi-Gocr-Tesseract
ExecutionTime ExecutionTime ExecutionTime
of IE (ms)
of Method(ms) Total (ms)
0
987940574
987940574
0
988302280
988302280
0
989242186
989242186
0
4000630424
4000630424
999164776
1018913396
2018078172
999164776
1023043090
2022207866
999164776
1035144464
2034309240
999164776
4112253332
5111418108
Table 5.6: OCR Test results for “Execution Time(ms)”
Graph 5.3: OCR graph for “Execution Times”
IE increased both word accuracy and search accuracy. But as we can see in the
table 5.6 and graph 5.3, IE operation increased the execution times, which is almost
doubled for the individual methods.
64
Method
Modi
Gocr
Teseract
Modi-Gocr-Tesseract
IE+Modi
IE+Gocr
IE+Tesseract
IE+Modi-Gocr-Tesseract
# of FalsePositives
19271
10363
13613
45473
81499
52764
93928
150913
Table 5.7: Number of false positives
# of FalsePositives
150913
93928
81499
52764
45473
19271
10363
13613
Graph 5.4: OCR test graph for number of false positives
65
Tools detected some word that do not exist in the image, in table 5.7 and graph 5.4
we can see the number of such false positives. In this result, we can see that Modi has the
lowest false positives. Using more than one method at a time increased the number of
false detection a lot, but the maximum increase in the false detection is occurred in the IE
method, which can be explained. We created 7 different images in inversion method with
IE and some of these inversion created other false positives. Combining the tools
together with IE gave us more than 150 thousand false positives which is more than the
total number of words.
Method
Modi
Gocr
Tesseract
Modi-Gocr-Tesseract
IE+Modi
IE+Gocr
IE+Tesseract
IE+Modi-Gocr-Tesseract
Computer Science 4
FalsePositiveTotal
3911
976
2773
18965
6337
17597
8484
29008
Computer Science 10
FalsePositiveTotal
3112
865
1961
15497
5316
12258
6533
23587
Table 5.8: The 2 videos which have highest of false positives
When we look for the detail to videos individually for false positive, we realized
that in computer science 4 and 10 videos, false positives are the highest. When we look
for the reason of it, we saw that both of these video prepared in classroom presenter
which has thumbnails of screens in the left. They are also in black font in white
background, that also made the tools to detect text even in the small regions which it
shouldn’t do that. In figure 5.10, there are some examples from these videos.
Similarly we looked for the details of high and low detected videos and example
slides from these videos are shown in figure 5.11 and figure 5.12 respectively.
66
a) Computer Science 4
b) Computer Science 10
Figure 5.10: Example screens from the videos which have highest false positives
a) Computer Science 14
b) Computer Science 20
Figure 5.11: Example screens from the videos which have highest word detection
b) Computer Science 17
a) Computer Science 2
Figure 5.12 : Example screens from the videos which have lowest word detection
67
Graph 5.5 : Graph of OCR test results of search accuracy rate for all videos
68
Chapter 6: Conclusion
In this work, we have demonstrated that using current OCR techniques, searching
keywords in a video is possible. We surveyed the popular OCR tools and choose 3
different OCR engines MODI, GOCR and Tesseract OCR that can be integrated into ICS
video project. We made experiments on the accuracy of these tools. graph 5.5 graph for
ocr test results of search accuracy rate for all videos
We needed some tools for experiments to create the ground truth text data for
OCR accuracy check. We designed a TextPictureEditor in C# language and prepared the
text of 1387 different images which are extracted from 20 different ICS videos. This test
data contains 20007 unique words, 27201 total words (each more than 1 character length)
in total 144613 characters shown in table 5-6. We used for this data for testing with an
OCR engine manager and accuracy checker, designed in C# language. The results of this
testing accuracy and performance processes showed that MODI OCR is best for ICS
video player in the sense of accuracy and performance.
We have also demonstrated that these OCR engines have challenges for complex
colored, uniform background and bad contrast in text font and text background images
such as ICS video images. We proposed a method to deal with this problem by using
image processing techniques. In other words, we proposed a method to enhance these
tools in the sense of accuracy rate by preprocessing the images. After performing SIS
thresholding of the image, several iterations of dilates operation to connect the texts and
with sobel edge detector, we were able to segment the texts for OCR input. These helped
the tool to recognize the text more accurately. In the graphs of 5.1 and 5.2 it can be seen
69
that image enhancement increased the accuracy for all tools. However, the number of
false positives increased with IE as well as the execution time. When we consider the
increase in accuracy when performance is not first priority, but accuracy is priority, our
approach for modifying inputs for a better accuracy is applicable.
In our experiments we have also tested whether using all of the OCR tools one
after another and combining the results will increase the accuracy or not. The idea of
combining these tools inspired by ensemble learning which is a machine learning
algorithm for classification, aimed to use several different approaches to a single problem
and to combine the results. The results of experiment showed that it did increase the
accuracy, but at a very low rate and with high performance loss.
With the results in detail for each video, we could classify the videos as hard to
detect, easy to detect as shown in graph 5.5.
Creating the ground truth test data, writing or correcting the text of 1387 images,
was very challenging and time consuming. Deciding the criteria was also challenging.
Should we include the captions of images in the test?
Should we include the
mathematical formulas and operators? What about parenthesis? Finding a good
segmentation algorithm was another challenge.
For future work, by using machine learning algorithms for training the data or for
classification of images and by using other image processing techniques for a better
segmentation, the complex background images can be transformed to a type that current
OCR engines will not confuse in detection of the text. Tesseract OCR engine has the
ability of learning, so training this tool with ICS video images will increase the accuracy.
70
An evaluation of the ICS video player for search feature will guide us to the right
path. How this search feature is useful or whether the accuracy of search is enough or
not. Do false positives affect the users? We will find all the answers by such a broad
evaluation. Until that time current OCR engines with image enhancement we provided
can be used for search in ICS videos.
71
References
[1] Todd Smith, Anthony Ruocco, and Bernard Jansen, Digital video in education,
SIGCSE Bull. 31 (1999), no. 1, 122-126.
[2] Jaspal Subhlok, Olin Johnson, Venkat Subramaniam, Ricardo Vilalta, and Chang
Yun, Tablet PC video based hybrid coursework in computer science: Report from a pilot
project., SIGCSE '07 Proceedings of the 38th SIGCSE technical symposium on
Computer science education, 2007
[3] Joanna Li, Automatic indexing of classroom lecture videos, Master's thesis, University
of Houston, 2008.
[4] Gautam Bhatt, Efficient automatic indexing for lecture videos, Master's thesis,
University of Houston , April 2010.
[5] Google Inc., Google video, http://en.wikipedia.org/wiki/Google_Video,January 2005.
[6] Microsoft, Project tuva, http://research.microsoft.com/apps/tools/tuva/, 2009.
[7] Wei-hsiu Ma, Yen-Jen Lee, David H. C. Du, and Mark P. McCahill, Video-based
hypermedia for education-on-demand, MULTIMEDIA '96: Proceedings of the fourth
ACM international conference on Multimedia (New York, NY, USA), ACM, 1996, pp.
449-450.
[8]Andreas Girgensohn, Lynn Wilcox, Frank Shipman, and Sara Bly, Designing
aordances for the navigation of detail-on-demand hypervideo, AVI '04: Proceedings of
the working conference on Advanced visual interfaces (New York, NY, USA), ACM,
2004, pp. 290-297.
[9]Andreas Girgensohn, Frank Shipman, and Lynn Wilcox, Hyper-hitchcock: authoring
interactive videos and generating interactive summaries, MULTIMEDIA '03:
Proceedings of the eleventh ACM international conference on Multimedia (New York,
NY, USA), ACM, 2003, pp. 92-93
[10] Frank Shipman, Andreas Girgensohn, and Lynn Wilcox, Hypervideo expression:
experiences with hyper-hitchcock, HYPERTEXT '05: Proceedings of the sixteenth ACM
conference on Hypertext and hypermedia (New York, NY, USA),ACM, 2005, pp. 217226.
[11] Michael R. Lyu, Edward Yau, and Sam Sze, A multilingual, multimodal digital video
library system, JCDL '02: Proceedings of the 2nd ACM/IEEE-CS joint conference on
Digital libraries (New York, NY, USA), ACM, 2002, pp. 145-153.
72
[12] Search in Videos , http://searchinsidevideo.com/#home.
[13] Michele Merler and John R. Kender, Semantic keyword extraction via adaptive text
binarization of unstructured unsourced video, Nov. 2009 , ISSN: 1522-4880
[14] Video Text Recognition, http://www.sri.com/esd/automation/video_recog.html
[15] Anshul Verma, Design, Development and Evalutaion of a Player For indexed,
Captioned and Searchable Videos, Master's thesis, University of Houston, August 2010.
[16] Rainer Lienhart and Wolfgang Effelsberg. Automatic Text Segmentation and Text
Recognition for Video Indexing. ACM/Springer Multimedia Systems, Vol. 8. pp.6981, January 2000;
[17] Simple OCR, http://www.simpleocr.com/Info.asp.
[18] ABBYY FinerReader, http://www.abbyy.com/company/
[19] Tessaract OCR, , http://code.google.com/p/tesseract-ocr/.
[20] GOCR, “Information” , http://jocr.sourceforge.net/.
[21] GOCR, “Linux Tag” http://www-e.uniagdeburg.de/jschulen/ocr/linuxtag05/w_lxtg05.pdf
[22]Modi, http://office.microsoft.com/en-us/help/about-ocr-international-issues-HP003
081238.aspx
[23] About OCR, http://www.ehow.com/how-does_4963233_ocr-work.html
[24] OCR , http://en.wikipedia.org/wiki/Optical_character_recognition
[25] OCR, http://www.dataid.com/aboutocr.htm
[26] Pattern Recognition, http://www.dontveter.com/basisofai/char.html
[27] Sis Threasholding , http://www.aforgenet.com/framework/docs/html/39e861e0e4bb-7e09-c067-6cbda5d646f3.htm.
[28] Sobel Edge Detector , http://www.aforgenet.com/framework/docs/
[29] Best OCR Font Size, Computer vision ,
ocr/best-font-and-size-for-ocr.html?lang=eng.
73
http://www.cvisiontech.com/pdf/pdf-
[30]
Image
Interpolation,
interpolation.htm.
http://www.cambridgeincolour.com/tutorials/image-
[31] Computer_vision , http://en.wikipedia.org/wiki/Computer_vision .
74
Download