SEARCH IN CLASSROOM VIDEOS WITH OPTICAL CHARACTER RECOGNITION FOR VIRTUAL LEARNING A Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Tayfun Tuna December 2010 SEARCH IN CLASSROOM VIDEOS WITH OPTICAL CHARACTER RECOGNITION FOR VIRTUAL LEARNING Tayfun Tuna APPROVED: Dr. Jaspal Subhlok, Advisor Dep. of Computer Science, University of Houston Dr. Shishir Shah Dep. of Computer Science, University of Houston Dr. Lecia Barker School of Information, University of Texas at Austin Dean, College of Natural Sciences and Mathematics ii Acknowledgements I am very much grateful to my advisor, Dr. Jaspal Subhlok for his guidance, encouragement, and support during this work. He kept me motivated by his insightful suggestions for solving many problems, which would otherwise seem impossible to solve. I would not be able to complete my work in time without his guidance and encouragement. I would like to express my deepest gratitude towards Dr. Shishir Shah, who gave me innumerable suggestions in weekly meetings and in image processing class, both of them helped me many times to solve difficult problems in this research. I am heartily thankful to Dr Lecia Barker for her support and for agreeing to be a part of my thesis committee. Without the love and support of my wife, it would have been hard to get my thesis done on time. I am forever indebted to my wife Naile Tuna. iii SEARCH IN CLASSROOM VIDEOS WITH OPTICAL CHARACTER RECOGNITION FOR VIRTUAL LEARNING An Abstract of a Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Tayfun Tuna December 2010 iv Abstract Digital videos have been extensively used for educational purposes and distance learning. Tablet PC based lecture videos have been commonly used at UH for many years. To enhance the user experience and improve usability of classroom lecture videos, we designed an indexed, captioned and searchable (ICS) video player. The focus of this thesis is search. Searching inside of a lecture is useful especially for long videos; instead of losing an hour watching the entire video, it will allow us to find the relevant scenes instantly. This feature requires extracting the text from video screenshots by using Optical Character Recognition (OCR). Since ICS video frames include complex images, graphs, and shapes in different colors with non-uniform backgrounds, our text detection requires a more specialized approach than is provided by off-the-shelf OCR engines, which are designed primarily for recognizing text within scanned documents in black and white format. In this thesis, we describe how we used and increased the detection of these OCR engines for ICS video player. We surveyed the current OCR engines for ICS video frames and realized that the accuracy of recognition should be increased by preprocessing the images. By using some image processing techniques such as resizing, segmentation, inversion on images, we increased the accuracy rate of search in ICS video player. v Table of Contents CHAPTER 1. INTRODUCTION ................................................................................................................ 1 1.1 MOTIVATION .................................................................................................................................. 1 1.2 BACKGROUND................................................................................................................................. 2 1.2.1 VIDEOINDEXER ....................................................................................................................... 3 1.2.2 OVERWIEW OF OPTICAL CHARACTER RECOGNITION (OCR) TOOL . ......................... 5 1.2.3 ICS VIDEO PLAYER . ............................................................................................................... 7 1.3 RELATED WORK.............................................................................................................................10 1.3.1 VIDEOPLAYERS . ....................................................................................................................11 1.3.2 OCR IMPLEMENTATION IN VIDEOS . .................................................................................12 1.4 THESIS OUTLINE .. ........................................................................................................................14 CHAPTER 2. SURVEY OF OCR TOOLS ...............................................................................................15 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.7 POPULAR OCR TOOLS ..................................................................................................................15 THE CRITERIA FOR A “GOOD” OCR TOOL FOR ICS VIDEO IMAGES ...................................16 SIMPLE OCR ....................................................................................................................................18 ABBYY FINEREADER.. ..................................................................................................................20 TESSERACTOCR.. ...........................................................................................................................21 GOCR.. ..............................................................................................................................................22 MICROSOFT OFFICE DOCUMENT IMAGING(MODI).. .............................................................24 CONCLUSION.. ................................................................................................................................27 CHAPTER 3. OCR CHALLENGES AND ENHANCEMENTS.............................................................28 3.1 WHAT IS OCR ? ..............................................................................................................................28 3.2 HOW DOES OCR WORK ? ..............................................................................................................29 3.3 CAUSES OF FALSE DETECTION ..................................................................................................31 CHAPTER 4. ENHANCEMENTS FOR OCR .........................................................................................34 4.1 SEGMENTATION ............................................................................................................................34 4.1.1 THREASHOLDING ...................................................................................................................37 4.1.2 EROSION AND DIALATION ...................................................................................................39 4.13 EDGE DETECTION...................................................................................................................41 4.14 BLOB EXTRACTION ...............................................................................................................43 4.2 RESIZING FOR TEXT FONT SIZE ................................................................................................44 4.3 INVERSION .....................................................................................................................................46 4.4 RESIZING IMAGE ...........................................................................................................................48 4.5 INVERSION .....................................................................................................................................49 CHAPTER 5. OCR ACCURACY CITERIA AND TEST RESUTLS ....................................................51 5.1 TEST DATA .....................................................................................................................................51 5.1.1 THE IMAGES FOR OCR DETECTION TEST .......................................................................51 5.1.2 THE TEXT FOR OCR DETECTION TEST ............................................................................53 5.1.2 SEARCH ACCURACY ............................................................................................................54 5.2 WORD ACCURACY AND SEARCH ACCURACY .......................................................................54 5.3 PREPARING AND TESTIN TOOLS ...............................................................................................55 5.3.1 TEXTPICTURE EDITOR ........................................................................................................56 5.3.2 OCR TOOL MANAGER AND ACCURACY TESTER ........................................................57 5.1.2 SEARCH ACCURACY ............................................................................................................59 5.4 EXPERIMENTS AND TEST RESULTS .........................................................................................60 CHAPTER 6. CONCLUSION....................................................................................................................69 REFERENCES ............................................................................................................................................72 vi List of Figures FIGURE 1.1: BLOCK DIAGRAM OF THE VIDEO INDEXER .................................................................... 3 FIGURE 1.2: A SNAPSHOOT FROM THE VIDEO INDEXER .................................................................... 4 FIGURE 1.3 A SNAPSHOOT OF AN OUTPUT FROM THE VIDEO INDEXER............................................. 4 FIGURE 1.4. THE OCR TOOL FUNCTION IN ICS VIDEO PLAYER .......................................................... 5 FIGURE 1.5.A SNAPSHOOT OF AN OUTPUT FOLDER OF OCR TOOL .................................................... 5 FIGURE 1.6. A SNAPSHOOT OF RUNNING OCR TOOL ......................................................................... 6 FIGURE 1.7. A SNAPSHOOT OF ICS VIDEO PLAYER XML OUTPUT OF OCR TOOL ................................ 6 FIGURE 1.8. FLOW OF ICS VIDEO PLAYER ....................................................................................... 7 FIGURE 1.9. A SNAPSHOT OF THE VIDEO PLAYER SCREEN................................................................ 8 FIGURE 1.10. LIST WIEW OF SEARCH FEATURE ................................................................................ 9 FIGURE 1.11.ICS VIDEO PLAYER PROGRESSBAR ............................................................................. 10 FIGURE 2.1 SIMPLE OCR DETECTION EXAMPLE 1 .......................................................................... 18 FIGURE 2.2 SIMPLE OCR DETECTION EXAMPLE 2. .......................................................................... 19 FIGURE 2.3 SIMPLE OCR DETECTION EXAMPLE 3. .......................................................................... 19 FIGURE 2.4 ABBY FINE READER DETECTION EXAMPLE. ................................................................. 20 FIGURE 2.5 USER INTERFACE OF ABBY FINE READER. .................................................................... 20 FIGURE 2.6 TESSERACTOCR DETECTION EXAMPLE 1. .................................................................... 21 FIGURE 2.7 TESSERACTOCR DETECTION EXAMPLE 2. .................................................................... 22 FIGURE 2.8 TESSERACTOCR DETECTION EXAMPLE 3. .................................................................... 22 FIGURE 2.9 GOCR TOOL DETECTION EXAMPLE 1. ........................................................................... 23 FIGURE 2.10 GOCR TOOL DETECTION EXAMPLE 2. ......................................................................... 23 FIGURE 2.11 GOCR TOOL DETECTION EXAMPLE 3. ......................................................................... 24 FIGURE 2.12 USING THE MODI OCR ENGINE IN C PROGRAMMING LANGUAGE. ............................... 25 FIGURE 2.13 MODI DETECTION EXAMPLE 1. ................................................................................... 25 FIGURE 2.14 MODI DETECTION EXAMPLE 2. ................................................................................... 26 FIGURE 2.15 MODI DETECTION EXAMPLE 3. ................................................................................... 26 FIGURE 3.1 PATTERN RECOGNITION.STEPS FOR CLASSIFICATION .................................................. 30 FIGURE 3.2 CHARACTER REPRESENTATION FOR FUTURE EXTRACTION. ........................................ 30 FIGURE 3.3 DISTORTED IMAGE ANALYSIS. ..................................................................................... 31 FIGURE 3.4 CONTRAST AND COLOR DIFFERENCES IN CHARACTERS IN AN IMAGE. ........................ 32 FIGURE 3.5 SIZE DIFFERENCE IN CHARACTERS IN AN IMAGE. ........................................................ 32 FIGURE 4.1 BLACK FONT TEXT ON THE WHITE COLOR BACKGROUND. .......................................... 35 FIGURE 4.2 COMPLEX BACKGROUND WITH DIFFERENT COLOR FONT. ........................................... 35 FIGURE 4.3 OCR RESULTS FOR A WHOLE IMAGE. ........................................................................... 36 FIGURE 4.4 OCR RESULTS FOR A SEGMENTED IMAGE. ................................................................... 36 FIGURE 4.5 SIS THRESHOLD EXAMPLE 1......................................................................................... 38 FIGURE 4.6 SIS THRESHOLD EXAMPLE 2......................................................................................... 38 FIGURE 4.7 STRUCTURAL ELEMENT MOVEMENT FOR MORPHOLOGICAL OPERATIONS. ................. 40 FIGURE 4.8 STRUCTURED ELEMENT FOR EROSION AND DILATION. ................................................ 40 FIGURE 4.9 DILATATION AFFECT ON AN IMAGE. ............................................................................ 41 FIGURE 4.10 EDGE DETECTION EFFECT ON A DILATING IMAGE...................................................... 42 FIGURE 4.11 BLOB EXTRACTION EXAMPLE ON AN IMAGE. ............................................................ 44 FIGURE 4.12 RESIZE PROCESS IN AN EXAMPLE............................................................................... 45 FIGURE 4.13 RESIZE PROCESS IN INTERPOLATION.......................................................................... 45 FIGURE 4.14 RESIZE PROCESS IN BILINEAR INTERPOLATION.......................................................... 46 FIGURE 4.15 RGP COLOR MODEL. ................................................................................................... 47 FIGURE 4.16 THE INVERSION OPERATION ON THE LEFT INPUT IMAGE. .......................................... 48 vii FIGURE 4.17 INVERSION EQUATIONS AND THEIR EFFECT ON THE IMAGES..................................... 49 FIGURE 4.18 OCR ENGINES’ DETECTIONS FOR ORIGINAL IMAGE.................................................... 50 FIGURE 5.1 EXAMPLE ICS VIDEO IMAGE. ........................................................................................ 52 FIGURE 5.2 EXAMPLES OF SOME IMAGES THAT ARE NOT INCLUDED IN THE TEST. ........................ 52 FIGURE 5.3 AN EXAMPLE OF SOME TEXT THAT ARE NOT INCLUDED IN THE TEST. ......................... 53 FIGURE 5.4 SCREENSHOOT OF TEXTPICTURE EDITOR TOOL. .......................................................... 56 FIGURE 5.5 INPUT FOLDER FOR OCR TEST CREATED BY TEXTPICTURE EDITOR TOOL. ................... 57 FIGURE 5.6 .SCREENSHOOT OF OCR TOOL MANAGER AND ACCURACY TESTER ............................. 58 FIGURE 5.7 SCREENSHOOT OF OCR TOOL MANAGER AND ACCURACY TESTER ............................. 58 FIGURE 5.8 EXCEL FILE CREATED BY OCR MANAGER TOOL FOR FOLDER. ..................................... 59 FIGURE 5.9 EXCEL FILE CREATED BY OCR MANAGER TOOL FOR AN IMAGE. .................................. 59 FIGURE 5.10 EXAMPLE SCREENS FROM THE VIDEOS HAVE HIGHEST FALSE POSITIVES. ................ 67 FIGURE 5.11 EXAMPLE SCREENS FROM VIDEOS HAVE HIGHEST WORD DETECTIONS ..................... 67 FIGURE 5.12 EXAMPLE SCREENS FROM THE VIDEOS WHICH HAVE LOWEST DETECTION ............... 67 List of Graphs GRAPH 5.1 OCR ACCURACY TEST GRAPH FOR ‘WORD ACCURACY’ ................................................ 63 GRAPH 5.2 GRAPH FOR OCR TEST RESULTS OF ‘SEARCH ACCURACY’ ............................................ 64 GRAPH 5.3 GRAPH FOR OCR TEST RESULTS OF EXECUTION TIMES ................................................. 64 GRAPH 5.4 OCR TEST RESULTS FOR FALSE POSITIVES .................................................................... 65 GRAPH 5.5 GRAPH FOR OCR TEST RESULTS OF SEARCH ACCURACY RATE FOR ALL VIDEOS .......... 68 List of Tables TABLE 2.1 POPULAR OCR TOOLS .................................................................................................. 15 TABLE 2.2 SELECTED OCR TOOLS TEST ........................................................................................ 18 TABLE 5.1 FORMULATION OF ‘WORD ACCURACY’ ........................................................................ 54 TABLE 5.2 FORMULATION OF ‘SEARCH ACCURACY’ ..................................................................... 55 TABLE 5.3 OCR ACCURACY TEST RESULTS FOR ‘WORD ACCURACY’ ............................................. 61 TABLE 5.4 NUMBER OF UNDETECTED WORDS WITH METHODS ..................................................... 62 TABLE 5.5 OCR ACCURACY TEST RESULTS FOR ‘SEARCH ACCURACY’ ......................................... 62 TABLE 5.6 TEST RESULTS FOR ‘EXECUTION TIMES’ ...................................................................... 64 TABLE 5.7 NUMBER OF ‘FALSE POSITIVES’ ................................................................................... 65 TABLE 5.8 VIDEOS WHICH HAVE THE HIGHEST ‘FALSE POSITIVIVES’ ............................................ 64 viii Chapter 1: Introduction 1.1 Motivation There is a huge database of digital videos in any school that employs lecture video recording. Traditionally, students would download the video and watch using a basic video player. This method is not suitable for some users like students who want to quickly refer to a specific topic in a lecture video as it is hard to tell exactly when that topic was taught. It is not also suitable for deaf students. To make these videos more accessible and exciting, we needed to make the content inside videos easily navigable, searchable and associate closed captions with videos through a visually attractive and easy to use interface of a video player. To provide easy access to video content and enhance user experience, we designed a video player in ICS video project, focused on making the video content more accessible and navigable to users. This video player allows users to search for a topic they want in a lecture video, which saves time as users do not need to view the whole lecture stream to find what they were looking for. To provide search ability to our video player, we need to get the text of each video frames. This can be done by using optical character recognition (OCR). Since ICS video frames include complex images, graphs, and shapes in different colors with non-uniform backgrounds, our text detection requires a more specialized approach than is provided by off-the-shelf OCR softwares, which are designed primarily for recognizing text within scanned documents in black and white format. Apart from the choosing the right OCR 1 tool for ICS video player, using basic pre-image processing techniques to improve accuracy are required. 1.2 Background Digital videos in education have been a successful medium for students to study or revise the subject matter taught in a classroom [1]. Although, a practical method of education, it was never meant to replace or substitute live classroom interaction as a live classroom lecture and student-instructor interaction cannot be retained in a video, but we still provide anytime-anywhere accessibility by allowing web based access to lecture videos[2]. We wanted to enhance the user experience and make the content of video lectures easily accessible to students by designing a player which could support indexing (or visual transition points), search and captioning. At the University of Houston, video recordings have been used for many years for distance learning. In all those years, lecture videos have only grown in popularity [2,3,4]. A problem that students face while viewing these videos is that it is difficult to access specific content. To solve this problem we started a project known as Indexed, Captioned and Searchable (ICS) Videos. Indexing (a process of locating visual transitions in video), searching and captioning have been incorporated in the project to attain the goal of making lecture videos accessible to a wide variety of users in an easy to use manner. We are looking at the project from the perspective of an end user (most likely a student). To increase usefulness of the ICS Video project, all videos contain metainformation associated with them. This meta-information contains information like description of lecture, a series of points in the video time-line where a visual transition exists (also known as index points) along with keywords needed to search and closed 2 caption text. The indexer, explained in the following sections, creates index and transition points of the video as image files for OCR tool. OCR tool detects the text from these images and stores it in a way that ICS Video Player, explained in the following sections, organizes this meta-information in a manner which is practical to the end user while preserving the emphasis on the video. As stated earlier, this work is a culmination of a larger ICS Video project. In this section we present a summary of contributions made by others for this project. 1.2.1 Video Indexer The job of the indexer is to divide the video into segments where each division occurs at a visual transition as shown in figure 1.1. By dividing a video in this manner we get a division of topics taught in a lecture because the visual transitions in a video are nothing but slide transitions. The indexer is also supposed to eliminate duplicate transition points and place index points at approximately similar time intervals. Figure 1.1: Block diagram of the video indexer. The output from the indexer is image files and a textual document which essentially contains a list of index points i.e. time stamps where a visual transition exists. 3 Joanna Li[3] outlined a method to identify visual transitions and eliminate duplicates by filtering. Later, this approach was enhanced with new algorithms [4]. Figure 1.2: A Snapshot from the video indexer. It is running to find index point and transition points. Figure 1.3: Output from the video indexer. It created all transition points and a data file shows which one is index point. In figure 1.2 a snapshot from the video indexer is shown. After it finishes processes, it creates the outputs in a folder for OCR tool as shown in figure 1.3. 4 1.2.2 Overview of Optical Character Recognition (OCR) Tool We will discuss OCR deeply in the following chapters. Figure 1.4 shows a workflow with a short description. Figure 1.4: The OCR tool takes each frame where an index (or visual transition) exists and extracts a list of keywords written on it. This list is then organized in such a way that it can be cross referenced by the index points. After video indexer creates the index point and transition points which are image files, OCR module runs to get the keywords from the written text on these video frames (which are essentially power point slides). As a result we get all the keywords for a video segment from this tool. These keywords, among other data, are then used to facilitate search functions in the video player. Figure 1.5: The OCR tool rename files according to their index point number. L1082310_i_1_1 refers to first index point and first transition points. L1-082310_t_1_2 refers to first index point and second transition points. 5 Figure 1.6: The OCR tool running for extracting text from images. OCR tool finishes extracting text from all images one by one and then creates an XML file for output that includes the keywords for each transition point as shown in figure 1.7. Figure 1.7: XML file, output of OCR tool. Once the xml file is ready, the ICS Video Player is ready to use it on its interface. We discuss in the next chapter how the information supplied by the indexer and the OCR tool is used in the ICS Video Player. 6 1.2.3 ICS video player Figure 1.8: The video player is fed the meta-information consisting of index points and keywords along with some of the course information, which that lecture belongs to, and the information about the lecture itself. Caption file is an optional resource, if present, will be displayed below the video as shown in Figure 1.9. In essence the ICS Videos project aims at providing three features - indexing, captioning and search to distance digital video education. Here is an overview of how those three features were integrated in the video player: 1. Indexing The recorded video lectures were divided into segments where the division occurs at a visual transition (which is assumed to be a change of topic in a lecture). These segments (or index points) were organized in a list of index points in the player interface (see Figure 1.9 (d)). 2. Captioning The video player was designed to contain a panel which can be minimized and display closed captions if a closed caption file was associated with that video (see Figure 1.9 (f)). At the time of writing this, the captions file needs to be manually generated by the author. 7 3. Search User can search for a topic of interest in a video by searching for a keyword. The result shows all occurrences of that keyword among the video segments. This is implemented by incorporating the indexer and OCR tool discussed earlier in the video processing pipeline. The search result allows users to easily navigate to the video segment where a match for the search keyword was found as shown figure 1.9 (b) and figure 1.10. Figure 1.9: A snapshot of the video player screen. Highlighted components - (a) video display, (b) search box, (c) lecture title, (d) index list, (e) playhead slider, (f) closed captions, (g) video controls In Figure 1.9 we show a running example of the video player. The lecture in figure 1.9 belongs to the COSC 1410 - Introduction to Computer Science course by Dr. 8 Nouhad Rizk in University of Houston. The Figure gives a view of the player as a whole along with every component. The player interface is mostly self explanatory, but we should clarify some of the functionality. Video display (figure 1.9 (a)) shows the current status of the video. If the video is paused, it shows a gray overlay over the video with a play button. Index list (Figure 1.9 (d)) contains a descriptive entry for each index point (also known as visual transition) in the video. Each entry in the index list is made up of a snapshot image of the video at the index point, name of the index and its description shown in figure 1.10. Figure 1.10: The Figure shows the list of results when the user searched for the keyword "program" in the lecture. One component that is not shown in figure 1.9 is the search result component. When the user searches for a keyword, all indices that contain that keyword in their keyword list, are displayed in the list of search results. The user can then click on a result to go to that index point in the video. As shown in Figure 1.10, every result also contains the snapshot of the video at the index point with the name and description of the index point. It also shows where the keyword was found - in the keyword list (along with 9 number of matches), in title or in the description of the index point. All of the information here comes from the xml file created in OCR tool as we explained in the previous section. Figure 1.11 The Figure shows the progress bar of ICS Video Player. In this case the video is playing at index point 1. One thing we need to point out for the search feature on this player is that when a user searches a keyword and finds it in a keyword list, the progress bar pointer goes to the beginning of that index region; it does not go to the exact position of the videos. We talked about the work flow of ICS Video player briefly and how the OCR tool is used in the project. The purpose of the work done in this thesis is to create an OCR tool, for the video player, which will provide the text of the video frames, so that the user will be able search inside a video by using ICS video player. There are several ways to design an OCR tool that will create text for ICS video player; our main goal is to make it accurate enough for the end user to use and find the keyword in the right place of the video content. For this, we tested the current OCR tools and used some pre-processing techniques to improve their accuracy. With the experiments and results, we concluded that the OCR tools we presented here can be used for ICS Video Player. Modifying images by using image processing techniques, prior to sending to the OCR tool will increase the accuracy of these tools. 1.3 Related Work There have been efforts around the industry along the lines of video indexing and usage of OCR for getting text from videos. We will take a look at each of them separately. 10 1.3.1 Video Players Google video is one of the most famous video players and it is a free video sharing website developed by Google Inc.[5]. Google Video has incorporated indexing feature in their videos which allows users to search for a particular type of video among any video available publicly on the internet. But, they index videos based on the context it was found in and not on the content of the video itself, which means if a video is located in a website about animals and one searches for a video about animals then there is a chance that this video will appear in the search result. The main difference here is that the search result does not guarantee that videos appearing in search result do indeed have the required content in them (attributed to the fact that indexing was not done on the video content). This method does not suit our needs because one of the main requirements of the project was to allow students to be able to locate the topic they were looking for inside a video. Another implementation, known as Project Tuva, implemented by Microsoft Research, features searchable videos along with closed captions and annotations [6]. It also features division of the video time-line into segments where a segment represents a topic taught in the lecture. However, the division of video into segments is done manually in Project Tuva. Tuva also offers an Enhanced Video Player to play videos. There exists a related technology known as hypervideo which can synchronize content inside a video with annotations and hyperlinks [7]. Hypervideo allows a user to navigate between video chunks using these annotations and hyperlinks. Detail-ondemand video is a type of hypervideo which allows users to locate information in an interrelated video [8]. For editing and authoring detail-on-demand type hypervideo there 11 exists a video editor known as Hyper-Hitchcock [8, 9, 10]. Hyper-Hitchcock video player can support indexing because it plays hypervideos, but one still has to manually put annotations and hyperlinks in the hypervideo to index it. There has been some research for implementation of search function and topics inside a video. Authors of [11] have developed a system known as iView that features intelligent searching of English and Chinese content inside a video. It contains image processing to extract keywords. In addition to this iView, it also features speech processing techniques. Searchinsidevideo is another implementation of indexing, searching and captioning videos. Searchinsidevideo is able to automatically transcribe the video content and let the search engines accurately index the content, so that they can include it within their search results. Users can also find all of the relevant results for their searches across all of the content (text, audio and video) in a single, integrated search.[12] 1.3.2 OCR implementations in Videos OCR is used in videos in a lot of applications such as car plate number recognition in surveillance cameras, or text recognition on news and sport videos. And there are a lot of projects going on in universities that aim to get a better OCR detection in videos. SRI International (SRI) has developed ConTEXTract™, a text recognition technology that can find and read text (such as street signs, name tags, and billboards) in real scenes. This optical character recognition (OCR) for text within imagery and video requires a more specialized approach than is provided by off-the-shelf OCR software, which is designed primarily for recognizing text within documents. ConTEXTract 12 distinguishes lines of text from other contents in the imagery, processes the lines, and then sends them to an OCR sub module, which recognizes the text. Any OCR engine can be integrated into ConTEXTract with minor modifications.[14] The idea of segmenting in this work is inspired by this work. Authors of [13] proposed a fully automatic method for summarizing and indexing unstructured presentation videos based on text extracted from the projected slides. They use changes of text in the slides as a means to segment the video into semantic shots. Unlike precedent approaches, their method does not depend on availability of the electronic source of the slides, but rather extracts and recognizes the text directly from the video. Once text regions are detected within key frames, a novel binarization algorithm, Local Adaptive Otsu (LOA), is employed to deal with the low quality of video scene text. We are inspired by this work by its application of threshold to images and its use of the Tesseract OCR tool. Authors of [15], worked on Automatic Video Text Localization and Recognition for Content-based video indexing for sports applications using multi-modal approach. They used segmentation by using dilation methods for localizing. The method for segmentation in our work is inspired by this work. Authors of [16] from Augsburg University worked on a project named MOCA where they have a paper for Automatic Text Segmentation, known also as text localization, and Text Recognition for Video Indexing. They used OCR engines to detect the text from TV programs. To increase the OCR engine accuracy they presented a new approach to text segmentation and text recognition in digital video and demonstrated its suitability for indexing and retrieval. Their idea of using different snapshots of the same 13 scene is not applicable to our work since our videos are indexed and only these indexed screenshots are available to the OCR tool. 1.4 Thesis Outline This thesis is organized as follows: Chapter 2 gives an introduction to commonly used OCR engines in todays world and explains the reason for using the three OCR engines MODI Tesseract OCR and GOCR. In Chapter 3 we discuss OCR challenges for ICS video images and explain our approaches to deal with these challenges. The methods we used to enhance text recognition are discussed in Chapter 4. The criteria for enhancement on text detection explained in Chapter 5. We also show the results of experiments in Chapter 5. The work is finally concluded in Chapter 6. 14 Chapter 2: Survey of OCR Tools Developing a proprietary OCR system is a complicated task and requires a lot of effort. Instead of creating a new OCR tool it is better to use the existing ones. In the previous chapter we mentioned that there are many OCR tools that allow us to extract text from an image. In this chapter, we discuss the criteria for a good OCR tool suitable for our goals and then we mention some of the tools we tested and justify our choice(s). 2.1 Popular OCR Tools Table 2.1 shows us some popular OCR tools. ABBYY FineReader Puma.NET AnyDoc Software Readiris Brainware ReadSoft CuneiForm/OpenOCR RelayFax ExperVision TypeReader & RTK Scantron Cognition GOCR SimpleOCR LEADTOOLS SmartScore Microsoft Office Document Imaging Tesseract Ocrad Transym OCR OCRopus Zonal OCR OmniPage Table 2.1: Popular OCR tools These tools can be classified into 2 types: a) Tools that can be integrated into our project Open Source tools and some commercial tools for which the OCR module can be integrated into a project such as Office 2007 MODI. b) Tools that cannot be integrated into our project 15 Commercial Tools such as ABBY FineReader that encapsulate OCR, mainly aim to scan, print and edit. They are successful in getting text, but the OCR component cannot be imported to as a module to a project as they have their own custom user interface and everything should be done in this interface. 2.2 The criteria for a “Good” OCR Tool for ICS video images There are many criteria for a good OCR tool in general such as design, user interface, performance, accessibility etc. The priorities for our project are the accessibility, usability and accuracy. So the criteria for being a good tool for our project are: 1. Accessibility-Usability: In ICS video project we will process many image files and we need to do them automatically. We will not do it one by one or go to a process of clicking to a program->browse files-> run the tool ->get the text-> put the text to a place we can use. Accessibility is our first concern. How can we access the tool? Can we call it in a command prompt so that we can access in C++, C# or JAVA programming languages? Or can we access in windows operating system and use it with parameters as many times as possible and any time we want? Can we include this tool as a package, a dll or a header to our project so that we can import and use as a part of our project? 2. Accuracy: The tool should also have a reasonable rate of accuracy in converting images to text. It is also important that the accuracy of the text recognition we are looking for is only for our project inputs. In other words, most of the OCR tools are designed for scanned images which are mostly black and white. They may claim their accuracy is up to 95%, but what about the accuracy 16 for colored images? So accuracy for our inputs is another important criterion to decide if the tool is good. 3. Complexity: A program doing one task can be considered simple, while a program doing lot of tasks can be considered a complex tool. In this sense, we only need the tool to extract the text from images. Anything else it does will increase the complexity of it. 4. Updatability: No algorithm can be considered as the final algorithm. To increase the accuracy or performance can we change it so it will work better for our project? It may be a good tool, but it may not support our input type (which is jpg files). Can we update it so that it will be able to processes our inputs? 5. Performance and Use Space: Most tools we examined have reasonable performance and they use reasonable memory and hard drive space. For the ICS video project, the OCR module will work in server side as a web server. That means the speed for converting the images to text or the required space or memory usage for the OCR tool is not as important. Now we can take a look at the tools in the next section. Since testing all tools that work mostly in different environments-operating systems- would be time consuming, we test the tools below as examples for their group. We filtered the popular tools shown in table 2.1 to a table 2.2 as an example according to the classification of accessibility and complexity in previous section. We presented one example for each group. Not importable tools such as commercial big tools and small applications and importable tools such as MODI. We tested two OCR tools in open source tools: GOCR and Tesseract OCR which can work in windows environment. 17 NAME Simple OCR ABBYY FineReader Tesseract OCR GOCR Office 2007 MODI CATEGORY Not importable –Small Application Not importable –Big Application Importable –Open Source Importable –Open Source Importable – Big Application Table 2.2: Selected OCR tools to test 2.3 Simple OCR SimpleOCR is a free tool that can read bi-level and grayscale, and create TIFF files containing bi-level (i.e. black & white) images. It works with all fully compliant TWAIN scanners and also accepts input from TIFF files. With this tool, it is expected that we could easily and accurately convert a paper document into editable electronic text for use in any application including Word and WordPerfect, with 99% accuracy. [17] a) b) Figure 2.1: SimpleOCR detection example 1: a- Input, b- Output SimpleOCR has a user interface in which we can open a file by clicking, browsing, running copying and pasting manually. In other words, it does not have command line usability and also it is not importable to our tool. Hence it could not be 18 adopted for our project. It also failed to create text in colored images like Figure 2.1 and Figure 2.2. It gives an error message “Could not convert the page to text.” b a Figure 2.2: SimpleOCR detection example 2: a- Input, b- Output SimpleOCR was able detect some of the text in colored images like figure 2.3 but with a low accuracy, only the word Agents detected correctly. a b Figure 2.3: SimpleOCR detection example 3, a- Input b- Output SimpleOCR failed to be a good OCR tool for our project in the first and second criterion, Accessibility & Usability and Accuracy. 19 2.4 ABBYY FineReader ABBYY is a leading provider of document conversion, data capture, and linguistic software and services. The key areas of ABBYY's research and development include document recognition and linguistic technologies. [18] b a Figure 2.4 ABBYY FineReader detection example a- Input b- Output ABBYY showed good accuracy for our test images, Figure 2.4. But it is not applicable to our project: to be able to use OCR part of ABBYY Fine Reader, it is required to use its own interface to get the text. Open the files from the menu, run the OCR engine and see if the text is accurate or have to be corrected manually as shown in figure 2.5. Figure 2.5 User interface of ABBY Fine Reader 20 Even though the accuracy of ABBY Fine reader is high, it is not a “good” tool for our ICS Video Project. It does not satisfy our first criteria which is Accessibility & Usability. 2.5 Tesseract OCR The Tesseract OCR engine is one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or lib tiff can be added to read compressed images. Most of the work on Tesseract is sponsored by Google [19]. b a Figure 2.6: Tesseract OCR detection Example 1; a- Input b- Output Tesseract OCR engine is being updated frequently and the accuracy of the tool is precise for colored images. Figure 2.6 and figure 2.7 are good examples detection of capabilities of Tesseract OCR. 21 a b Figure 2.7 Tesseract OCR detection example 2 a- Input b- Output Tesseract OCR may be the most accurate open source tool, but the accuracy rate is not perfect, in figure 2.6, the last line is not recognized at all. The image in figure 2.7 is recognized precisely, whereas in figure 2.8 is missed the word Summary. But it is accessible, easy to use and can be called from command prompt in any programming languages. a) b) Figure 2.8 Tesseract OCR detection Example 3; a- Input b- Output 22 2.6 GOCR GOCR is an OCR program, developed under the GNU Public License, initially written by Jörg Schulenburg, it is also called JOCR. It converts scanned images to text files [20]. GOCR engine assumes no colors, black on white only, assumes no rotation, same font, all characters are separated and every character is recognized empirically based on its pixel pattern [21]. a b Figure 2.9: GOCR tool detection Example 1; a- Input b- Output a b Figure 2.10: GOCR tool detection Example 2; a- Input b- Output 23 Figure 2.11: GOCR tool detection Example 3 a- Input b- Output GOCR is also accessible, easy to use and can be called from command prompt in any programming languages like Tesseract and the detection accuracies for the images in figure 2.9-2.11 are similar to Tesseract. GOCR is not regularly updated like Tesseract. 2.7 Microsoft Office Document Imaging (MODI) Microsoft Office Document Imaging (MODI) is a Microsoft Office application that supports editing documents scanned by Microsoft Office Document Scanning. It was first introduced in Microsoft Office XP and is included in later Office versions including Office. Via COM, MODI provides an object model based on 'document' and 'image' (page) objects. One feature that has elicited particular interest on the Web is MODI's ability to convert scanned images to text under program control, using its built-in OCR engine. The MODI object model is accessible from development tools that support the Component Object Model (COM) by using a reference to the Microsoft Office Document Imaging 11.0 Type Library. The MODI Viewer control is accessible from any 24 development tool that supports ActiveX controls by adding Microsoft Office Document Imaging Viewer Control 11.0 or 12.0 (MDIVWCTL.DLL) to the application project. When optical character recognition (OCR) is performed on a scanned document, text is recognized using sophisticated pattern-recognition software that compares scanned text characters with a built-in dictionary of character shapes and sequences. The dictionary supplies all uppercase and lowercase letters, punctuation, and accent marks used in the selected language [22]. In the images tested, the accuracy of Modi was very good and it was easy to access via code. After importing the Microsoft Office Document Imaging 12.0 Type Library it is accessible from any development tool that supports ActiveX MODI.Document md = new MODI.Document(); md.Create(FileName)); md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true); MODI.Image image = (MODI.Image)md.Images[]; writeFile.Write(image.Layout.Text) Figure 2.12: Using the MODI OCR engine in C# programming language b a Figure 2.13: MODI detection example 1 a- Input b- Output 25 a) b) Figure 2.14: MODI detection example 2; a- Input b- Output Figure 2.15: MODI detection example 3; a- Input b- Output Accessibility-Usability of Modi made it easier to be imported to a C# project and it has a very good accuracy rate for the ICS video images. In figure 2.15 it was able to detect the thumbnail of video in the left. We found MODI to be a friendly engine for the type of images in ICS videos which are generally fully colored texts and images. 26 2.7 Conclusion: We presented some popular OCR tools and checked if they can be integrated to our ICS video project or not. And we justified our choice by giving some examples, 3 input images and the results of for each OCR tool. We conclude that we can use any of the 3 tools GOCR, Tesseract OCR or MODI can be integrated. It is hard to say which one is best, by looking at the outputs of 3 examples. We decided to include all of them in our experiments. A large scale of test images from ICS video images and the results of these tools will give us a better perspective and this will be done in the experiments and results section in the chapter 5. Before that, in the following chapter we will look at challenges of OCR and our proposed methods to enhance the detection. 27 Chapter 3: OCR and Challenges In the previous chapter we looked for the OCR tools and decided which tools to use: MODI, GOCR and Tesseract OCR. In the examples we provided in the previous section, we saw that colored images confuses OCR tools and their accuracy goes down. That means there are some issues we need to deal with in OCR engines and ICS video images. We introduce what is OCR, how OCR works and what are the challenges in OCR detection for ICS Video frames so that we can discuss how we can deal with them in the next chapter. 3.1 What is OCR? Optical character recognition, more commonly known as OCR, is the interpretation of scanned images of handwritten, typed or printed text into text that can be edited on a computer. There are various components that work together to perform optical character recognition. These elements include pattern identification, artificial intelligence and machine vision. Research in this area continues, developing more effective read rates and greater precision [23]. In 1929 Gustav Tauschek obtained a patent on OCR in Germany, followed by Handel who obtained a US patent on OCR in USA in 1933(U.S. Patent 1,915,993). In 1935 Tauschek was also granted a US patent on his method (U.S. Patent 2,026,329). Tauschek's machine was a mechanical device that used templates and a photo detector. RCA engineers in 1949 worked on the first primitive computer-type OCR to help blind people for the US Veterans Administration, but instead of converting the printed 28 characters to machine language, their device converted it to machine language and then spoke the letters. It proved far too expensive and was not pursued after testing [24]. Since that time, OCR is used for credit card imprints for billing purposes, for digitizing the serial numbers on coupons returned from advertisements, sorting mails in United States Postal Service, converting the text for blind people to have a computer read text to them out loud and digitizing and storing scanned documents in archives like hospitals, libraries etc. 3.2 How Does OCR Work? OCR engines are good pattern recognition engines and robust classifiers, with the ability to generalize in decision making based on imprecise input data. They offer ideal solutions to a variety of classification character. There are two basic methods used for OCR: Matrix matching and feature extraction. Of the two ways to recognize characters, matrix matching is the simpler and more common. Matrix Matching compares what the OCR scanner sees as a character with a library of character matrices or templates. When an image matches one of these prescribed matrices of dots within a given level of similarity, the computer labels that image as the corresponding ASCII character. Matrix matching works best when the OCR encounters a limited repertoire of type styles, with little or no variation within each style. Where the characters are less predictable, feature, or topographical analysis is superior. Feature Extraction is OCR without strict matching to prescribed templates. Also known as Intelligent Character Recognition (ICR), or Topological Feature Analysis, this method varies by how much "computer intelligence" is applied by the manufacturer. The computer looks for general features such as open areas, closed shapes, diagonal lines, line 29 intersections, etc. This method is much more versatile than matrix matching but it needs a pattern recognition process as shown in Figure 3.1. [25] Figure 3.1 Pattern Recognition steps for Classification In OCR engines, for future extraction, computer needs to define which pixels are path and which are not. In another words, it needs to classify all the pixels as paths and not paths. Paths can be considered as 1, others can be considered as 0. In figure 3.2 paths creates a character of E. a b Figure 3.2 Character Representation for Future Extraction: a) Black and white image b) Binary representation of image [26] 30 3.3 Causes of False Detection OCR engines works on pattern recognition on the images, before recognition path they need to classify the each image pixels as path (1) or not path(0). Like most of the images ICS video images are colored in different shades and sometimes distorted or noisy. This makes pattern recognition fail at some level. Even though they have the ability to lie in their resilience against distortions in the input data and their capability to learn, they have a limit. After a certain point of distortions they start to make mistakes. a b c d Figure 3.3 Distorted image pattern analyses. a is distorted but could be detected and considered as b, c is distorted more and could not be detected. This pattern recognition and machine learning problem is related to computer vision problem which is related to the human visual system. In that sense, we can say if a picture is hard to read for humans, it is also hard to read for computers. (Reverse is not applicable: An irony of computer science is that tasks humans struggle with can be performed easily by computer programs, but tasks humans can perform effortlessly remain difficult for computers. We can write a computer program to beat the very best human chess players, but we can't write a program to identify objects in a photo or understand a sentence with anywhere near the precision of even a child.) 31 The human visual system and visibility is affected by the following factors: 1) Contrast – relationship between the luminance of an object and the luminance of the background. The luminance- proportion of incident light reflected into the eye- can be affected by location of light sources and room reflectance (glare problems). 2) Size – The larger the object, the easier it is to see. However, it is the size of the image on the retina, not the size of the object per se that is important. Therefore we bring smaller objects closer to the eye to see details. 3) Color – not really a factor in itself, but closely related both to contrast and luminance factors. For humans, it is essential to have a certain amount of contrast in an image to define what it is, which is the same for computers: computers need a certain amount of contrast in shapes to be able to detect differences. This is also important for character recognition. Characters in the image should have enough contrast to be able to be defined. Figure 3.4: Contrast and color difference in characters in an image. White text has high contrast and is easy to read, blue text has low contrast and it is hard to read. Figure 3.5 Size difference in characters in an image. Bigger size text is easier to read. Best OCR results depend on various factors, the most important being font and size used for OCR. Other noted factors are color, contrast, brightness, and density of content. OCR 32 engines fail in pattern recognition in low contrast, small size and complex colored text. We can do some image processing techniques to modify images before using OCR to reduce the number of fails of OCR. OCR engine detection is also affected by font style of text. To detect a certain font style of a character it should be previously defined and stored. Since our ICS video player’s fonts are in the type of fonts which most of the OCR engines supported such as Tahoma, Arial, San Serif, Times News Roman etc. so font style problem will barely affect detection of our OCR engines. So our enhancement would be about segmentation, text size, color, contrast, brightness, and density of content. We will talk about what approach we used in the next section. 33 Chapter 4: Enhancements for OCR detection In the previous chapter, we looked at the challenges in OCR detection for ICS Video frames. Here, we will discuss the approach and the methods we used to get a better recognition from OCR engines. OCR engines possess complex algorithms, predefined libraries, and training datasets. Modifying an OCR algorithm requires an understanding of the algorithm from the beginning to the end. Apart from that, sometimes images become too complex to be defined as ICS video images; therefore, for better OCR engine results, doing enhancements on the image before sending it to the OCR engine can be used. 4.1 Segmentation In the previous chapter, we stated that OCR engines use segmentation mostly designed for scanned images with black font on a white background. Segmentation of text is two phase: detection of the word and detection of character. Detecting the word can be considered to be locating the word’s text place in the image, and detection of character is locating the character in the image as well. While using OCR engines, we saw that this segmentation is not enough for some ICS video images. Due to the lack of segmentation, OCR engines make mistakes in an image with complex colored objects. The mistakes are introduced during setting up a threshold of the image to correct binarization. A successful image segmentation for a black and white image and a failed segmentation for a colored image on a uniform background are represented by figure 4.1 and figure 4.2 respectively. Segmentation 1 can be considered to be a segmentation of the 34 words and segmentation 2 can be considered to be a segmentation of the characters in these figures. Figure 4.1: Black font text on the white color background segmentation for OCR character recognition. In Figure 4.1, segmentation 1 and segmentation 2 are done successfully so that all texts could be separated to words and then to characters. In figure 4.2, due to the lack of difference between text 1 and background, text 1 is not segmented as a word. So it is not segmented to characters also. Text 2 in figure 4.2, is in a different background that could be segmented to word and then to characters also. Text3 and Text4 backgrounds are very close to each other in figure 4.2; because of that they are considered as a single word, but since their font and background color are close to one another, character segmentation could not be done. Figure 4.2: Complex background with different font color text segmentation for OCR. 35 We need to remember that these figures are used only for the illustration of OCR segmentation. We may have to look at ICS video image examples and OCR outputs to see the importance of segmentation. Figure 4.3 shows that without segmentation OCR engines fail, whereas in figure 4.4, segmented input allows for a better performance of OCR engines. __c____c_0__ _0_____ c0___0___ ___________ ____0 0_00__ ___0 _0 0 _ \l_l_;_l'__ll l'__\_)l)l)\, __\ _'___-li_'i__ ' __ l il\.\ i t i__ il t i__) i- ''-" _ a) b) -4 I. I’-’ 4 C I 41 ——* \'Cl'\iCil| n;_i';" -. ` hOl`iZ0llIL l| $(llll\l`€L| l`€5[)0ll5€5 smoothed mean c) d) Figure 4.3: OCR results for a whole image has complex objects with different colors : a) input image b) GOCR result c) MODI OCR result d) Tesseract OCR result a) Figure 4.4: squared responses squared responses squared responses ven caI yen I ci I vertical Classification classification classification Honizontal horizontal horizontal smoothed mean smoothed mean smoothed mean b) c) d) OCR Results for a segmented image which has complex objects with different colors: a) segmented part of input b) GOCR result c) MODI OCR result d) Tesseract OCR result 36 Image segmentation is probably the most widely studied topic in computer vision and image processing. Hence, many studies have been done on segmentation for a particular application. We mentioned some of them in chapter 1. In our approach, we simply grouped the objects on the screen by thresholding, dilating and doing blob coloring extraction which will be explained in the following sections. 4.1.1 Thresholding Images in ICS videos are colored; therefore we need to convert colored images to black and white images, for segmentation and morphological operations. We will do so by performing image binarization known as thresholding. We used the SIS filter in AForge Image Library, a free source image processing library for C#, which performs image thresholding calculating the threshold automatically using simple image statistics method. For each pixel: two gradients are calculated ex = |I(x + 1, y) - I(x - 1, y)| and |I(x, y + 1) - I(x, y - 1)|; weight is calculated as maximum of two gradients; sum of weights is updated (weight total += weight); sum of weighted pixel values is updated (total += weight * I(x, y)). The result threshold is calculated as sum of weighted pixel values divided by sum of weight [27]. 37 a) Input Image b) Thresholded Image FIGURE 4.5: SIS Threshold Example 1: Output result is in the white foreground on the black background. Sis thresholding results can be different as shown in Figure 4.5 and Figure 4.6. In Figure 4.5, output is white foreground and black background; in Figure 4.6, it is reversed. a) Input Image b) Thresholded Image FIGURE 4.6 SIS Threshold Example 2: Output result is in black foreground in white background. For morphological operation, we will use erosion or dilation. We need to decide which one to use according to our image. If the image has a white foreground and a black 38 background, dilation will tend to remove the foreground. This is not desirable, so we make a decision by Average Optical Density calculating the 𝑁−1 𝑁−1 1 𝐴𝑂𝐷(𝐼) = ( 2 ) ∑ ∑ 𝐼(𝑖, 𝑗) 𝑁 𝑖=0 𝑗=0 We calculate AOD of a binary image which has 0(white) and 1(black) values. This puts the AOD value between 0 and 1. We found that when AOD >0.15 for ICS video frames, it refers to a black foreground and white background image using erosion. In the other case, when AOD <=0.15, it refers to an image with white foreground and black background and we choose to use dilation. 4.1.2 Erosion and Dilation We binarized image calculated AOD and decided which morphological operation to use for segmentation in the previous sections. Here, we will talk about the meaning of erosion and dilation, as well as how erosion and dilation affect images. Erosion and Dilation are morphological operations that affect the shapes of objects and regions in binary images. All processing is done on a local basis and region or blob shapes are affected in a local manner. A structuring element (a geometric relationship between pixels) is moved over the image in such way that it is centered over every image pixel at some point, row-by-row, and column-by-column. An illustration of movements of a structural element is shown in figure 4.7. 39 FIGURE 4.7 Structural element movements for Morphological Operations Given a window B and a binary image I: J1 = DILATE(I, B) if J1(i, j) = OR{B • I(i, j)} = OR{I(i-m, j-n); (m,n) Î B} J2 = ERODE(I, B) if J2(i, j) = AND{B • I(i, j)} = AND{I(i-m, j-n);(m, n) Î B} [20] Dilation removes object holes of too-small size on black foreground and white background image. Erosion is the reverse of dilation, but when we use it for a white foreground and black background image, it gives the same effect and removes object holes of too-small size, so we can call it dilation effect. For erosion and dilation operations we used the structured element shown in Figure 4.8: 0 0 0 1 1 1 0 0 0 Figure 4.8: Structured Element for Erosion and Dilation 40 Thus, dilation effect allows for growth of separate objects, or joining objects. We will use it for joining the characters and creating groups for segmentation. We choose a horizontal window so that the characters will intend to merge in the right and left direction in the image shown in Figure 4.9. Through trial and error, we found that 8 iterations of dilation/erosion process are reasonable for joining characters in most ICS video images. Input Image Dilatation #1 Dilatation #2 Dilatation #3 Dilatation #4 Dilatation #5 Dilatation #6 Dilatation #7 Dilatation #8 Figure 4.9 Dilatation affect on an image: Dilation joins the small objects (characters) and fill the small holes; texts are converted to square objects. 4.1.3 Edge detection We grouped every small object, such as characters and small items, to a single object in the previous process. Next, we used Sobel operator to detect the edges in AForge Image Library. Detection of the edges will unify the objects and provide more accurate detection of groups. 41 The filter searches for objects' edges by applying Sobel operator. Each pixel in the resulting image is calculated as an approximate absolute gradient magnitude for corresponding pixel of the source image: |G| = |Gx| + |Gy] , where Gx and Gy are calculated utilizing Sobel convolution kernels: Gx Gy -1 0 +1 +1 +2 -2 0 +2 0 0 -1 0 +1 -1 -2 Using the above kernel, the approximated magnitude +1 0 -1 for pixel x is calculated using the next equation: [28] P1 P2 P3 P8 x P4 P7 P6 P5 |G| = |P1 + 2P2 + P3 - P7 - 2P6 - P5| + |P3 + 2P4 + P5 - P1 - 2P8 - P7| Edge detection effect is shown in Figure 4.10. a) Dilated image b) Edge detected image Figure 4.10: Edge Detection effect on a dilated image 42 4.1.4 Blob Extraction After we detect the edges, provided the continuity and completeness of the groups, we will count and extract stand alone objects in the image using connected components labeling algorithm. It is an algorithmic application of the graph theory, where subsets of connected components are uniquely labeled based on a given heuristic of the AForge Image Library. The filter performs labeling of objects in the source image. It colors each separate object using a different color. The image processing filter treats all none-black pixels as the objects' pixels and all black pixel as the background. The AForge Library blob extractor extracts blobs in the input images (which in our case are thresholded and dilated images). However, we need the original image as an input for OCR, thus we will use blob extraction to detect the place of the objects in original image. A Blob Extraction example is shown in Figure 4.12. In the extracted blob, one would expect more blobs; however, they were filtered using several criteria. If a blob contains other blobs we do not extract it. When the blob width/ blob height <1.5. (The text we want to detect is at least two characters long; since we dilated text to the right left, every time width will be more than height.). In figure 4.12, man’s body is not extracted because of the height to width ratio. Also, very small size blobs are not included in the blobs. After filtering according to these criteria, we pass the parts to Tesseract OCR engine, and if there is no text detection on this blob, we remove it. This process can be considered as a back-forward operation. 43 a b c Figure 4.11 Blob extraction example on an image by using edge detected image: a) Original Image b) Edge detected image c) Extracted Blobs Increasing the iterations of dilatation done in figure 4.10, or increasing the size of structural element will reduce the number of blobs detected. However, it may merge the text with the objects; thus, we keep the structural element size very low (3x3) and iterations in 8. 4.2 Resizing Text Font Size We separated the texts from other complex background by segmentation in the previous section. What if the segmented text font size is too small to be detected correctly? We know that the best font size for OCR is 10 to 12 points. Use of smaller font size has led to a poor quality OCR. Font sizes greater than 72 are considered images, and thus should be avoided. Usually dark, bold letters on a light background and vice versa, yield good results. The textual content should be ideally placed, with good spacing between words and sentences [29]. 44 We need big enough fonts to be able to detect the text. A big size image is not enough; we need bigger size text, in comparison to the human visual system where a big size object is also not enough, because a big size image on the retina is needed. Increasing the size of a text can be achieved easily by changing the size of the image. For instance, if we resize the image to x1.5 it will make everything inside the text x1.5 bigger than in the previous one. a) Original image b) Resized image(1.5) Figure 4.12: Resize Process in an example Font size under 10px becomes hard to obtain by OCR engines. We have plenty of small characters (mostly occurring in the explanation for the images or graphs), so before the image processing we increase the size of the images by a factor of 1.5 as a default. Resizing is done by the bilinear image interpolation method; it works in two directions, and tries achieving the best approximation of a pixel's color and intensity based on the values at surrounding pixels. The following example illustrates how resizing / enlargement works: a b c Figure 4.13: Resize Process in interpolation a) Original image b)Divided pixels to fill by interpolation c) Resized image 45 Bilinear interpolation considers the closest 2x2 neighborhood of a known pixel values surrounding the unknown pixel. It then takes the weighted average of these 4 pixels to arrive at its final interpolated value. These results are much smoother looking images than its closest neighbor. Figure 4.14: Resize Process in bilinear interpolation in pixels. On the diagram in Figure 4.14, the left side shows the case when all known pixel distances are equal, so the interpolated value is simply their sum divided by four [30]. 4.4.3 Inversion In the previous chapter we segmented and resized the images and saw that applying both of them improved the detection. The detection can be increased by playing with intensity relationship of the text and fonts by using inversion method. Before we explain the inversion, we need to look at the RGB color model. The RGB color model is an additive color model in which red, green, and blue light are added together in various ways to reproduce a broad array of colors. The name of the model comes from the initials of the three additive primary colors, red, green, and blue shown in figure 4.15. 46 Figure 4.15 RGB Color Model Image file formats such as BMP, JPEG, TGA or TIFF commonly in 24-bit RGB representations, color values for each pixel encoded in a 24 bits per pixel fashion where three 8-bitunsigned integers (0 through 255) represent the intensities of red, green, and blue. (0, 0, 0) is black (255, 255, 255) is white (255, 0, 0) is red (0, 255, 0) is green (0, 0, 255) is blue (255, 255, 0) is yellow (0, 255, 255) is cyan (255, 0, 255) is magenta Inverting colors is basically playing with RGB values. When we invert an image in a classical way, we take the inverse RGB values. For example, the inverse of the color (1,0,100) is (255-1,255-0,255-100)=(254,255,155). This will change the view of an 47 image, but will not change the difference between the text and the background since we subtracted all of them using the same number. Figure 4.16: The inversion operation: input image in the left, inverted image, in the right. In our approach, we expand this technique from 1 to 7 inversions by using the equation in Figure 4.17. OCR engines give different results for inverted images. Sometimes the 5th inversion is better than the first one. So we use all of them in the inversion operation and unify the results. Instead of using only the original image, using inverted image improved the OCR results. However, since we do not know which inversion will be the best for the OCR engine, we will use all of them and union the results. 48 Original Image R G B Inversion 1 255-R G B Inversion 2 R 255-G B Inversion 3 R G 255-B Inversion 4 255-R 255-G B Inversion 5 R 255-G 255-B Inversion 6 255-R G 255-B Inversion 7 255-R 255-G 255-B Figure 4.17 Inversion equations and their effect on the images. 49 Image Original Image MODI result Question 3 Where did the story say that there was a statue raised in Mrs. Bethune’s h o n Washington, D.C. Miami, Florida Mayesville, South Carolina Tessaract Result ` i i wiiiiiipi ` " ii iiiii Question 3 Where did the story say that there was a statue raised in Mrs. Bethune’s honor? _B-Nik is Question 3 Where did the story say that there was a statue raised in Mrs. Bethune's honor? V M i ‘_ *ii lliivi Question 3 Where did the story say that there was a statue raised in Inversio Mrs. Bethune’s h o n I n1 _ L • • Mayesville Washington, D.C. Miami, Florida South Carolina Question 3 Where did ` i‘ ilil i|iVi the story say that there Question 3 was a statue raised in Where did the story Inversio Mrs. Bethune’s honor? say that there n3 Washington, D.C. was a statue raised in Miami, Florida Mrs. Bethune’s Mayesville, South honor? Carolina Q2= 'iii VE Where did the story say Question 3 that there was a statue Where did the story raised in Mrs. say that there Inversio Bethune’s h o no was a statue raised in n5 hiii’iin’ I, IIIk:11fl1I I; Mrs. Bethune’s 1iiIK ij[ihi uestion I,’. honor? H Question 3 _ ___ V ` , Y fv Where did the story say e_‘~iiiiii";i that there was a statue Question 3 raised in Mrs. Where did the story Inversio Bethune’s hono:? say that there n7 ItM1’1dIIkc, r was a statue raised in V4hüifl4h. ID1e. r 7 r 7 Mrs. Bethune’s IIiif!ihñ1i0 IF1kiI[iKkii honor? S5iiV1f[Ihl iiiøüiiii Figure 4.18 OCR engines’ detections for original image and inverted images . 50 Chapter 5: OCR Accuracy Criteria and Test Results In this chapter, we shall have a look at the OCR accuracy criteria and the tools we used for testing before the results of the experiments. We start the discussion by defining the test data. 5.1 Test Data Test data can be considered as image test data and the texts in these images. We will look at them separately. 5.1.1 The Images for OCR Accuracy Test Test data consists of 1387 different images which are created by an indexer from selected 20 different ICS videos. Most of them are also tested by ICS video indexer [4], so our inputs became more reliable. Selected videos are diverse in templates and color styles since they are mostly prepared by different instructors. 15 different instructors are represented. 14 of them are from Computer Science Department and 6 of them from other departments at University of Houston. In figure 5.1, there are examples to illustrate the variety of the images in ICS video test data. 51 Figure 5.1: Example ICS video images included in test data. Figure 5.2: Examples of images that are not included in the test. Some images that do not include any text were removed from the list; as shown in figure 5.2; empty screen of a video or the screens that do not have any related text information such as the beginning or the end of the video. 52 5.1.2 The text for OCR test In each image, it is desired that all texts,(main body, tables, figures and their captions, foot notes, page numbers and headers) in the image are counted, there are some exceptions though. Figure 5.3: An example of some text that is not included in the test. If the text in the image is too small to read, as shown in figure 5.3, that text will not be included in the text data of the image. Deciding whether the size is small enough to omit is also related to our ability to read the data. If we do not read accurately, we cannot write the text to compare it to the results of the tool. For search, case information is not useful; people will not look for uppercase letters specifically so our recognition will not be case sensitive. All the data will be assumed as lowercase. 53 5.2. Word Accuracy and Search Accuracy If there are n different words and each word has a repeating frequency as shown in the table 5.1, then the word accuracy will be calculated according to the “WA” formula. In another words, “word accuracy” will give the information about what fraction of words are detected correctly. Word Detected frequency Missed w1 w2 w3 . . Ground truth frequency F1 F2 F3 . . f1 f2 f3 . . M1=F1-f1 M2=F2-f2 M3=F3-f3 . . wn-1 wn Fn-1 Fn Fn-1 fn Mn-1=Fn-1-fn-1 Mn=Fn-fn Table 5.1: Formulation of “word accuracy” Mi --> Missed values NTW-> Number of total words value: F1+F2+F3..+Fn MW Missed words value will be total of M values: M1+M2+….Mn WA-> Word accuracy will be given by WA=MW/NTW= [M1+M2+….Mn] / (F1+F2+F3..+Fn) = [(F1-f1) +(F2-f2)+(F3-f3)+…(Fn-fn)]/ (F1+F2+F3..+Fn) Search Accuracy is related to the probability of a successful search. If there are n different words, each word has a frequency, but we will accept all frequencies as 1.So if a word is detected in an image, we don’t need to know how many times it is detected on that image because for search, our purpose is to decide if the word exists or not. The 54 formulation of search accuracy is in the table 5.2, and calculation will be done according to SA formula. word Detected frequency w1 Ground truth frequency F1 f1 M1=0 , f1<1 M1=1 , f1>=1 w2 F2 f2 M2=0 , f2<1 M2=1 , f2>=1 w3 F3 f3 . . . . . . M3=0 , f3<1 M3=1 , f3>=1 . . wn-1 Fn-1 Fn-1 wn Fn fn Search Missed Mn-1=0 , fn-1<1 Mn-1=1 , fn-1>=1 Mn=0 , fn<1 Mn=1 , fn>=1 Table 5.2 : Formulation of “search accuracy” SMWSearch missed word value will be M1+M2+M3…Mn. SASearch Accuracy will be given by SA=SMW/n= (M1+M2+M3…Mn) /n; 5.3 Preparing and Testing Tools We have created two tools for testing and experiments. One is for preparing the text data and the other is for experiment and testing the accuracy. We will discus each of them separately. 55 5.3.1 TextPictureEditor Before we start to test the accuracy we have to prepare the ground truth for the test. Each image we wanted to test should have corresponding text. Creating text, looking at the pictures and writing them manually will take a lot time so we designed a small tool that can help. This tool provides us the ability to go back and forth, in a folder, in the user interface. We can see the images and we can write the text we see in the picture on the text area which is under the picture. Figure 5.4: Screenshot of TextPictureEditor tool. The steps for running the tool can be defined as: -open a folder; it will automatically load a picture in that folder. -if the text is not created for the picture, send the picture to OCR and get text to text area -if the text is already created, check if it is correct, if not update and save it to text area. 56 Figure 5.5: Input folder for OCR test created by TextPictureEditor tool. After going forward-back in the folder we can create text files which will hold ground truth text information for all images. All comparison will be based on these txt files. 5.3.2 OCR Tool Manager and Accuracy Tester Rest of the job is done by the accuracy tester: using image processing techniques on the images, managing the OCR tools and testing the accuracy of these tools. This is illustrated in figure 5.6. Regions in the tools as shown in the figure 5.6: 1) Selecting the folders for accuracy test or choosing the image to experiment is done in this part. 2) Selecting the image processing techniques for modifying the images is done in this part. 3) Input image region 4) Modified image region 5) Selecting the OCR techniques is done in this part. 6) The output of MODI is shown here 7) The output of GOCR and Tesseract are shown here. 57 Figure 5.6: Screenshot of OCR tool manager and accuracy tester After running the accuracy test for a folder of ICS video, it modifies the images and gets the text for each OCR tool. Then it compares each result of these tools to ground truth data and creates a excel file for showing the statistics and the image as shown in figures 5.7 and 5.8. Figure 5.7: Screenshot of OCR tool manager and accuracy tester 58 Figure 5.8: Excel file created by OCR Manager Tool for folder. As shown in figure 5.8, the excel file lists the missed words, missed searches and missed characters for each image separately and the total of images. It also creates a separate excel file for each image for detailed view shown in figure 5.9. Words that are detected and those that are not are shown. Accuracy at a tool can be read from this file. Figure 5.8: Excel file created by OCR Manager Tool for an image. 59 5.4 Experiments and Test Results As mentioned before, we tested 3 different OCR tools: MODI, GOCR and Tesseract OCR; our purpose is to find the best OCR engine tool to be used in ICS video player. This test is 2 phase, first one is comparing accuracy rate of these 3 tools without any image modifications, second phase is comparing the results of each tool after image modifications. We employed 20 different videos folder which have 1387 different picture in total, created by video indexer, reduced from about 2000 images after eliminating some of them under the criteria we defined in the previous section. Then we created the ground truth text by using TextPictureEditor; it was a tool that makes it easier to write the text for an image. We kept the image files and text file in each folder; in this case it is 20 different folders. For each image we run these three different OCR tools and compare their results with the ground truth data. For each image, we separately created the excel files for statistical information. Their accuracy rate according to the criteria will give us some idea about which one is better. For the second phase we modified the inputs by applying image processing techniques. In other words we preprocessed the images in hopes of getting better results from these OCR tools. All results are written to the same excel file for each video. Eventually we unified these 20 excel files manually for an overall picture. 60 Method # Word Miss #Expected Word Word Accuracy Modi 1823 27201 93.30% Gocr 7117 27201 73.84% Tesseract 4406 27201 83.80% Modi-Gocr-Tesseract 1068 27201 96.07% IE+Modi 766 27201 97.18% IE+Gocr 4829 27201 82.25% IE+Tesseract 2148 27201 92.10% IE+Modi-Gocr-Tesseract 589 27201 97.83% Table 5.3: OCR accuracy test results for “Word Accuracy” Graph 5.1: OCR accuracy test graph for “Word Accuracy” 61 Word accuracy results are shown in the table 5.3 and in graph 5.1. We can conclude from them that among these 3 OCR tools, Modi, Gocr and Tesseract, Modi has the highest word accuracy with 93.30%. Then Tesseract OCR with 83.80%. Lastly Gocr with 73.84% word accuracy. When we use these tools together, the word accuracy increased up to 96.67%. That means some of the words can be detected with one OCR tool but not with others. Combining tools increased the accuracy because we took the Union of these three results. They complemented each other. It can also be seen from the table 5.3 and graph 5.1 that our proposed image enhancement method (IE) worked well also, it increased the word accuracy rate on all methods. IE increased the Modi word accuracy from 93.30% to 97.18% and increased the GOCR word accuracy from 73% to 82.25%. The increase in the Tesseract OCR with IE is from 83.80% to 92.10%. IE increased also the word accuracy of combined methods from 96.07% to 97.03%. The increase in the combined method is not as much as the increase in the individual methods. We can also see that from the table below. #detection increase with IE Modi 1026 Gocr 2140 Tesseract 2155 ModiGocrTesseract 460 Table 5.4: Number of undetected words with methods that they are detected with IE. Similar results are obtained in table 5.5 and graph 5.2 for search accuracy rate. Modi has the highest search accuracy rate and IE increased all of the OCR tools search accuracy rate. 62 # Search #Expected Search Method Miss Unique words Accuracy Modi 1784 20006 91.08% Gocr 6736 20006 66.33% Tesseract 4113 20006 79.44% Modi-Gocr-Tesseract 1044 20006 94.78% IE+Modi 758 20006 96.21% IE+Gocr 4596 20006 77.03% IE+Tesseract 1958 20006 90.21% IE+Modi-Gocr-Tesseract 584 20006 97.08% Table 5.5 OCR accuracy test results for “Search Accuracy” Graph 5.2: OCR accuracy test graph for “Search Accuracy” 63 Method Modi Gocr Teseract Modi-Gocr-Tesseract IE+Modi IE+Gocr IE+Tesseract IE+Modi-Gocr-Tesseract ExecutionTime ExecutionTime ExecutionTime of IE (ms) of Method(ms) Total (ms) 0 987940574 987940574 0 988302280 988302280 0 989242186 989242186 0 4000630424 4000630424 999164776 1018913396 2018078172 999164776 1023043090 2022207866 999164776 1035144464 2034309240 999164776 4112253332 5111418108 Table 5.6: OCR Test results for “Execution Time(ms)” Graph 5.3: OCR graph for “Execution Times” IE increased both word accuracy and search accuracy. But as we can see in the table 5.6 and graph 5.3, IE operation increased the execution times, which is almost doubled for the individual methods. 64 Method Modi Gocr Teseract Modi-Gocr-Tesseract IE+Modi IE+Gocr IE+Tesseract IE+Modi-Gocr-Tesseract # of FalsePositives 19271 10363 13613 45473 81499 52764 93928 150913 Table 5.7: Number of false positives # of FalsePositives 150913 93928 81499 52764 45473 19271 10363 13613 Graph 5.4: OCR test graph for number of false positives 65 Tools detected some word that do not exist in the image, in table 5.7 and graph 5.4 we can see the number of such false positives. In this result, we can see that Modi has the lowest false positives. Using more than one method at a time increased the number of false detection a lot, but the maximum increase in the false detection is occurred in the IE method, which can be explained. We created 7 different images in inversion method with IE and some of these inversion created other false positives. Combining the tools together with IE gave us more than 150 thousand false positives which is more than the total number of words. Method Modi Gocr Tesseract Modi-Gocr-Tesseract IE+Modi IE+Gocr IE+Tesseract IE+Modi-Gocr-Tesseract Computer Science 4 FalsePositiveTotal 3911 976 2773 18965 6337 17597 8484 29008 Computer Science 10 FalsePositiveTotal 3112 865 1961 15497 5316 12258 6533 23587 Table 5.8: The 2 videos which have highest of false positives When we look for the detail to videos individually for false positive, we realized that in computer science 4 and 10 videos, false positives are the highest. When we look for the reason of it, we saw that both of these video prepared in classroom presenter which has thumbnails of screens in the left. They are also in black font in white background, that also made the tools to detect text even in the small regions which it shouldn’t do that. In figure 5.10, there are some examples from these videos. Similarly we looked for the details of high and low detected videos and example slides from these videos are shown in figure 5.11 and figure 5.12 respectively. 66 a) Computer Science 4 b) Computer Science 10 Figure 5.10: Example screens from the videos which have highest false positives a) Computer Science 14 b) Computer Science 20 Figure 5.11: Example screens from the videos which have highest word detection b) Computer Science 17 a) Computer Science 2 Figure 5.12 : Example screens from the videos which have lowest word detection 67 Graph 5.5 : Graph of OCR test results of search accuracy rate for all videos 68 Chapter 6: Conclusion In this work, we have demonstrated that using current OCR techniques, searching keywords in a video is possible. We surveyed the popular OCR tools and choose 3 different OCR engines MODI, GOCR and Tesseract OCR that can be integrated into ICS video project. We made experiments on the accuracy of these tools. graph 5.5 graph for ocr test results of search accuracy rate for all videos We needed some tools for experiments to create the ground truth text data for OCR accuracy check. We designed a TextPictureEditor in C# language and prepared the text of 1387 different images which are extracted from 20 different ICS videos. This test data contains 20007 unique words, 27201 total words (each more than 1 character length) in total 144613 characters shown in table 5-6. We used for this data for testing with an OCR engine manager and accuracy checker, designed in C# language. The results of this testing accuracy and performance processes showed that MODI OCR is best for ICS video player in the sense of accuracy and performance. We have also demonstrated that these OCR engines have challenges for complex colored, uniform background and bad contrast in text font and text background images such as ICS video images. We proposed a method to deal with this problem by using image processing techniques. In other words, we proposed a method to enhance these tools in the sense of accuracy rate by preprocessing the images. After performing SIS thresholding of the image, several iterations of dilates operation to connect the texts and with sobel edge detector, we were able to segment the texts for OCR input. These helped the tool to recognize the text more accurately. In the graphs of 5.1 and 5.2 it can be seen 69 that image enhancement increased the accuracy for all tools. However, the number of false positives increased with IE as well as the execution time. When we consider the increase in accuracy when performance is not first priority, but accuracy is priority, our approach for modifying inputs for a better accuracy is applicable. In our experiments we have also tested whether using all of the OCR tools one after another and combining the results will increase the accuracy or not. The idea of combining these tools inspired by ensemble learning which is a machine learning algorithm for classification, aimed to use several different approaches to a single problem and to combine the results. The results of experiment showed that it did increase the accuracy, but at a very low rate and with high performance loss. With the results in detail for each video, we could classify the videos as hard to detect, easy to detect as shown in graph 5.5. Creating the ground truth test data, writing or correcting the text of 1387 images, was very challenging and time consuming. Deciding the criteria was also challenging. Should we include the captions of images in the test? Should we include the mathematical formulas and operators? What about parenthesis? Finding a good segmentation algorithm was another challenge. For future work, by using machine learning algorithms for training the data or for classification of images and by using other image processing techniques for a better segmentation, the complex background images can be transformed to a type that current OCR engines will not confuse in detection of the text. Tesseract OCR engine has the ability of learning, so training this tool with ICS video images will increase the accuracy. 70 An evaluation of the ICS video player for search feature will guide us to the right path. How this search feature is useful or whether the accuracy of search is enough or not. Do false positives affect the users? We will find all the answers by such a broad evaluation. Until that time current OCR engines with image enhancement we provided can be used for search in ICS videos. 71 References [1] Todd Smith, Anthony Ruocco, and Bernard Jansen, Digital video in education, SIGCSE Bull. 31 (1999), no. 1, 122-126. [2] Jaspal Subhlok, Olin Johnson, Venkat Subramaniam, Ricardo Vilalta, and Chang Yun, Tablet PC video based hybrid coursework in computer science: Report from a pilot project., SIGCSE '07 Proceedings of the 38th SIGCSE technical symposium on Computer science education, 2007 [3] Joanna Li, Automatic indexing of classroom lecture videos, Master's thesis, University of Houston, 2008. [4] Gautam Bhatt, Efficient automatic indexing for lecture videos, Master's thesis, University of Houston , April 2010. [5] Google Inc., Google video, http://en.wikipedia.org/wiki/Google_Video,January 2005. [6] Microsoft, Project tuva, http://research.microsoft.com/apps/tools/tuva/, 2009. [7] Wei-hsiu Ma, Yen-Jen Lee, David H. C. Du, and Mark P. McCahill, Video-based hypermedia for education-on-demand, MULTIMEDIA '96: Proceedings of the fourth ACM international conference on Multimedia (New York, NY, USA), ACM, 1996, pp. 449-450. [8]Andreas Girgensohn, Lynn Wilcox, Frank Shipman, and Sara Bly, Designing aordances for the navigation of detail-on-demand hypervideo, AVI '04: Proceedings of the working conference on Advanced visual interfaces (New York, NY, USA), ACM, 2004, pp. 290-297. [9]Andreas Girgensohn, Frank Shipman, and Lynn Wilcox, Hyper-hitchcock: authoring interactive videos and generating interactive summaries, MULTIMEDIA '03: Proceedings of the eleventh ACM international conference on Multimedia (New York, NY, USA), ACM, 2003, pp. 92-93 [10] Frank Shipman, Andreas Girgensohn, and Lynn Wilcox, Hypervideo expression: experiences with hyper-hitchcock, HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia (New York, NY, USA),ACM, 2005, pp. 217226. [11] Michael R. Lyu, Edward Yau, and Sam Sze, A multilingual, multimodal digital video library system, JCDL '02: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries (New York, NY, USA), ACM, 2002, pp. 145-153. 72 [12] Search in Videos , http://searchinsidevideo.com/#home. [13] Michele Merler and John R. Kender, Semantic keyword extraction via adaptive text binarization of unstructured unsourced video, Nov. 2009 , ISSN: 1522-4880 [14] Video Text Recognition, http://www.sri.com/esd/automation/video_recog.html [15] Anshul Verma, Design, Development and Evalutaion of a Player For indexed, Captioned and Searchable Videos, Master's thesis, University of Houston, August 2010. [16] Rainer Lienhart and Wolfgang Effelsberg. Automatic Text Segmentation and Text Recognition for Video Indexing. ACM/Springer Multimedia Systems, Vol. 8. pp.6981, January 2000; [17] Simple OCR, http://www.simpleocr.com/Info.asp. [18] ABBYY FinerReader, http://www.abbyy.com/company/ [19] Tessaract OCR, , http://code.google.com/p/tesseract-ocr/. [20] GOCR, “Information” , http://jocr.sourceforge.net/. [21] GOCR, “Linux Tag” http://www-e.uniagdeburg.de/jschulen/ocr/linuxtag05/w_lxtg05.pdf [22]Modi, http://office.microsoft.com/en-us/help/about-ocr-international-issues-HP003 081238.aspx [23] About OCR, http://www.ehow.com/how-does_4963233_ocr-work.html [24] OCR , http://en.wikipedia.org/wiki/Optical_character_recognition [25] OCR, http://www.dataid.com/aboutocr.htm [26] Pattern Recognition, http://www.dontveter.com/basisofai/char.html [27] Sis Threasholding , http://www.aforgenet.com/framework/docs/html/39e861e0e4bb-7e09-c067-6cbda5d646f3.htm. [28] Sobel Edge Detector , http://www.aforgenet.com/framework/docs/ [29] Best OCR Font Size, Computer vision , ocr/best-font-and-size-for-ocr.html?lang=eng. 73 http://www.cvisiontech.com/pdf/pdf- [30] Image Interpolation, interpolation.htm. http://www.cambridgeincolour.com/tutorials/image- [31] Computer_vision , http://en.wikipedia.org/wiki/Computer_vision . 74