M ÀSTER EN I NFORM ÀTICA INDUSTRIAL I AUTOM ÀTICA U NIVERSITAT DE G IRONA Document Classification Master Thesis Marius V ILA D UR ÁN Directors: Dr. Mateu S BERT C ASSASSAYAS Dr. Miquel F EIXAS F EIXAS Juliol 2009 Contents 1 2 3 Introduction 1.1 Framework . . . . . . . . . . . . . . . . . . . . 1.2 Main Research Lines in Document Classification 1.3 Objectives . . . . . . . . . . . . . . . . . . . . . 1.4 Methodology . . . . . . . . . . . . . . . . . . . 1.5 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 11 12 13 State of the Art and Background 2.1 Introduction . . . . . . . . . . . . . . . . 2.2 Document Classification . . . . . . . . . 2.2.1 Problem Statement . . . . . . . . 2.2.2 Classifier Architecture . . . . . . 2.2.3 Performance Evaluation . . . . . 2.3 Information Theory . . . . . . . . . . . . 2.4 Image Preprocessing Techniques . . . . . 2.4.1 Skew . . . . . . . . . . . . . . . 2.4.2 Image Segmentation . . . . . . . 2.5 Image Registration . . . . . . . . . . . . 2.5.1 Image Registration Pipeline . . . 2.5.2 Similarity Metrics . . . . . . . . 2.5.3 Challenges in Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 16 16 18 24 25 28 28 33 36 36 38 43 Classification Framework 3.1 Problem Statement . . . 3.2 Classifier Architecture . 3.3 Document Database . . . 3.4 Document Filtering . . . 3.4.1 Size Filter . . . . 3.4.2 K-means Filter . 3.4.3 NMI Filter . . . 3.5 Interface . . . . . . . . . 3.5.1 “Arxiu” Menu . . 3.5.2 “Finestra” Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 47 48 52 53 54 54 56 56 57 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents 3.5.3 3.5.4 3.5.5 4 5 6 “Base de dades” Menu . . . . . . . . . . . . . . . . . . . “Pre-procés’ Menu . . . . . . . . . . . . . . . . . . . . . “Registre” Menu . . . . . . . . . . . . . . . . . . . . . . 60 60 62 Image Preprocessing and Segmentation 4.1 Introduction . . . . . . . . . . . . . . . 4.2 Image Preparation Methods . . . . . . . 4.2.1 Cropping . . . . . . . . . . . . 4.2.2 Check Position . . . . . . . . . 4.2.3 Skew . . . . . . . . . . . . . . 4.2.4 Image Rotation and Scaling . . 4.3 Feature Extraction Methods . . . . . . . 4.3.1 Logo Extraction . . . . . . . . 4.3.2 Color and Gray-scale Detection 4.3.3 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 66 67 67 70 72 74 75 76 76 Image Registration 5.1 Introduction . . . . . . . . . . . 5.2 Normalized Mutual Information 5.3 Logo Registration . . . . . . . . 5.4 Performance Evaluation . . . . . 5.4.1 NMI Registration Test . 5.4.2 Logo Registration Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 81 82 83 85 86 86 Conclusions and Future Work 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 89 90 . . . . . . . . . . . . . . . . . . . . . . . . A Qt 93 B MySQL 97 C OpenCV 99 D .NET 101 2 List of Figures 1.1 1.2 Document taxonomy defined by Nagy [1]. . . . . . . . . . . . . . Basic pipeline of the master thesis. . . . . . . . . . . . . . . . . . 10 11 Three components of a document classifier. . . . . . . . . . . . . Three possible partitions of the document space. . . . . . . . . . . A typical sequence of document recognition for mostly-text document images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Five categories of structured documents and their recommended feature representation. . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Binary entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Venn diagram of a discrete channel . . . . . . . . . . . . . . . . . 2.7 The image before and after image smoothing. . . . . . . . . . . . 2.8 Image windows of size Xwin ∗ Ywin and region [-L,L]. . . . . . . 2.9 Correlation matrix of the vertical line d1 towards d2 . . . . . . . . 2.10 Skew detection using the correlation matrix of Fig. 2.9. . . . . . . 2.11 The main components of the registration framework . . . . . . . . 17 18 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 Three components of our document classifier. . . . . . . . . . . . Some documents that make up our document space. . . . . . . . . Examples of document classes. . . . . . . . . . . . . . . . . . . . Classification algorithm. . . . . . . . . . . . . . . . . . . . . . . Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of the distance calculation between the colors of two images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design of interface with qtDesiger. . . . . . . . . . . . . . . . . . Main screen of the application. . . . . . . . . . . . . . . . . . . . “Arxiu” menu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . “Finestra” menu. . . . . . . . . . . . . . . . . . . . . . . . . . . Cascade mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . Mosaic mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . “Base de dades” menu. . . . . . . . . . . . . . . . . . . . . . . . Configuration database menu. . . . . . . . . . . . . . . . . . . . “Pre-procés” menu. . . . . . . . . . . . . . . . . . . . . . . . . . 3 20 22 26 28 31 32 33 34 37 46 48 48 51 53 55 56 57 58 58 59 59 60 61 61 List of Figures 3.16 3.17 3.18 3.19 Preprocess configuration menu. . . . . . . “Registre” menu. . . . . . . . . . . . . . Registration configuration methods menu. Result returned by the application. . . . . 4.1 a) Scanned image in a wrong position. b) Image with a skew error. c) Generally, scanners generate A4 images although a smaller document is scanned. In this concrete case an adjustment is necessary. This problem also appears in cases a) and b). . . . . . . . . . . . First idea to implement the white zone elimination. . . . . . . . . The white zone elimination problem with a black and white image. Solution fot the white zone elimination problem. . . . . . . . . . The result of applying the cropping method to a ticket. . . . . . . The result of applying the cropping method to a receipt. . . . . . . a) Original image, b) image with horizontal solid lines, and c) image with vertical solid lines . . . . . . . . . . . . . . . . . . . . . The result of applying the check position method to a receipt in wrong position. . . . . . . . . . . . . . . . . . . . . . . . . . . . The result of applying the check position method to a ticket in wrong position. . . . . . . . . . . . . . . . . . . . . . . . . . . . The result of applying the check position method to a receipt in correct position. . . . . . . . . . . . . . . . . . . . . . . . . . . . The result of applying the skew method to a ticket. . . . . . . . . The result of applying the skew method to a ticket. . . . . . . . . The result of applying the skew method to a ticket. . . . . . . . . The result of applying the scale method to an invoice. . . . . . . . The result of applying the scale method to a receipt. . . . . . . . . Logo extraction of a ticket. . . . . . . . . . . . . . . . . . . . . . Logo extraction of an invoice. . . . . . . . . . . . . . . . . . . . K-means algorithm diagram. . . . . . . . . . . . . . . . . . . . . K-means algorithm example. 1) k initial “means” (in this case k = 3) are randomly selected from the data set (shown in color). 2) k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means. 3) The centroid of each of the k clusters becomes the new mean. 4) Steps 2 and 3 are repeated until convergence has been reached. . . . . . . . . . . . . . . . . . . . The result of applying the k-means algorithm to an invoice. a) Original Image. b) Result of applying the k-means algorithm for k = 2, i.e., 2 groups/colors. c) Result of applying the k-means algorithm for k = 3, i.e., 3 groups/colors. . . . . . . . . . . . . . 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 62 62 63 66 68 68 68 69 69 70 71 71 71 72 73 73 74 74 75 76 78 78 79 List of Figures 4.21 The result of applying the k-means algorithm to an invoice. a) Original Image. b) Result of applying the k-means algorithm for k = 3, i.e., 3 groups/colors. c) Result of applying k-means algorithm for k = 6, i.e., 6 groups/colors. . . . . . . . . . . . . . . . . 5.1 5.2 79 Scaled image used in normalized mutual information registration. Superposition of two registered images using normalized mutual information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphical summary of the logo registration method. We want to classify a input image and we have three possible candidates. In this case, only the last reference image can obtain a good similarity value (the third one in the right column). . . . . . . . . . . . . . . Superposition of two registered images using the logo registration method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 The three parts of a ticket. . . . . . . . . . . . . . . . . . . . . . NMI registration of tickets after removing the variable parts. . . . 91 91 A.1 Qt Designer Interface. . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3 5.4 6.1 6.2 5 84 84 85 List of Tables 3.1 5.1 5.2 Comparison of the results to calculate one hundred distance between color images using backtracking and our approach. . . . . . 56 Results summary of the NMI registration method. . . . . . . . . . Results summary of the logo registration method. . . . . . . . . . 87 87 7 Chapter 1 Introduction In this chapter, we present the framework of this master thesis, the main research lines in document classification, the objectives of this master thesis, and the methodology used to developed the application. Finally, we show the document outline. 1.1 Framework This master thesis takes place within the framework of a project between the research group Gilab (Graphics and Imaging Laboratory) of the University of Girona and the GIDOC INTEGRAL company. GIDOC INTEGRAL is a company specialized in document management. The main objective of the project is to develop a series of algorithms with the aim of creating a system that allows the automatic document classification with minimal human intervention. Thus, this master thesis is based on a research work which will be applied to solve the needs of a company in the field of document classification. For confidentiality reasons, the information of some document images has been removed or distorted. 1.2 Main Research Lines in Document Classification Basically, document classification can follow two main lines differentiated by the use or non-use of the text content of the document. We use the following terminology to refer to these two possibilities: • Document image classification: Assign a single-page document image to one of a set of predefined document classes. • Text classification: Assign a text document (ASCII, HTML/XML, . . . ) to one of a set of predefined document classes. Text classification techniques can be applied as part of document image classification, using OCR results 9 Chapter 1. Introduction extracted from the document image. Sebastiani [2] provides a comprehensive survey of text categorization, which is an active research area in information retrieval. Classification can be also based on various features, such as image-level features, structural or textual features (e.g., word frequency or word histogram). Using the document taxonomy (see Figure 1.1) defined by Nagy [1], documents can be included into two basic groups: mostly-text documents and mostly-graphic documents. Mostly-text documents include business letters, forms, newspapers, technical reports, proceedings, journal papers, etc. These are in contrast to mostlygraphics documents such as engineering drawings, diagrams, and sheet music. Figure 1.1: Document taxonomy defined by Nagy [1]. Nagy’s characterization of documents focuses on document format: mostlygraphics or mostly-text, handwritten or typeset, etc. Another way of characterizing documents is by the application domain, such as documents related to income tax documents from insurance companies. Many classifiers use document spaces that are restricted to a single application domain. On the other hand, other classifiers use document spaces that span several application domains. We present here a summary of the document spaces of selected classifiers characterized by application domains. • A single domain document space Bank documents [3]. Business letters or reports. Invoices [3] [4]. Business forms [5]. Forms in banking applications. Tax forms [6]. Document from insurance companies. Book page. Journal pages. • A multiple-domain document space 10 Chapter 1. Introduction Articles, advertisements, dictionaries, forms, manuals, etc. Journal pages, business letters, and magazines. [7] Bills, tax forms, journals, and mail pieces. Journal papers, tax forms.[6]. Business letters, memoranda, and documents form other domains. Bagdanov and Worring characterize document classification at two levels of detail, coarse-grained and fine-grained [8]. A coarse-grained classification is used to classify documents with a distinct difference of features, such as business letters versus technical articles. Whereas fine-grained classification is used to classify documents with similar features, such as business letters from different senders, or journal title pages from various journals. 1.3 Objectives The main goal of this master thesis is the recognition and classification of different documents within a database. The fundamental pillars to achieve this basic objective are the image preprocessing techniques and, in particular, the image registration measures. Another important objective has been to present the state of the art of document classification. Figure 1.2: Basic pipeline of the master thesis. We consider that a document is an image without text content identified as such. Given the image of a document, the basic objective is to identify this image within a previously created database with the documents of a company. In this project, we focus our attention on a database of invoices, receipts, and tickets. If we can not identify a document within the database we can make the following two interpretations: 11 Chapter 1. Introduction • An error has occurred and it must be treated properly. • The document that we want to identify has not yet been entered into the database. Located in the context of a company, scanned images or imported electronic documents can come from different supplier companies, customers, marketing, etc. Or they can come from different departments of the company. Today a very high time is spent on the document classification and indexing process when a document is incorporated into a document management system or when a data extraction system is fed (supplier invoices, supplier delivery notes, opinion polls, information request, task demands, etc). One of the most important components of a massive capture of information is the preprocessing stage where data are extracted from scanned or imported documents for its later classification. There are two major ways to classify electronic documents. First it is its classification according to the form of its components. In this project we focus on this line. Although there are many publications on registration, segmentation and image classification, there are few specific works on recognition of document typologies, although this is an essential task for document management in large companies (invoices, receipts, documents of diverse typologies,). The second possibility is to identify the text of the documents with OCR tools. This option is important for the classification of completely open documents where we can not assume any such form or labeling. It is important to consider the rapidity and reliability of the used techniques. To this end it has been necessary to apply techniques for pruning the tree search in the database documents. This is one of the key parts of the project that has required the investigation of descriptors so that we can perform an efficient pruning. Finally also the use of a images hierarchy of different resolution has been investigated. This study of the image behavior with different resolution is also a key element in the project development 1.4 Methodology The methodology used to carry out the application implementation is based on Extreme Programming. Extreme Programming (also known as XP) is a methodology of software engineering that leads to a development process that is more responsive to customer needs than traditional methods and it allows to create a higher quality software since it allows changes in the requirements on the part of the user. This is considered by the defenders of this methodology as a normal and desirable aspect of the software development projects. To think about the requirements adaptation in any point of the project duration is more realistic than attempting to define all requirements at the project beginning 12 Chapter 1. Introduction and then having to lose too much time to make adjustments to the initial requirement in the event of changes in plans. Fundamental features of the Extreme Programing: • Iterative and incremental development. • Continuous unitary tests. • Programming by pairs. • Frequent interaction between the programming team and the client or user. • Correct all errors before adding a new functionality, as well as making frequent deliveries. • To rewrite parts of the code in order to increase its readability, but without changing its behavior. • Simplicity in the code We tried to meet all these guidelines, except the pair programming, since the master thesis is an individual project. In conclusion, the Extreme Programming allows us to develop a software that is easy to adapt in case of changes or additions. Algorithms have been implemented using C++ and have been translated to .NET (see Appendix D). We have also used different tools, such as Qt, MySQL, and OpenCV (see Appendixes A, B, and C, respectively). 1.5 Document Outline This master thesis is organized into six chapters. The first two chapters are introductory and deal with previous work. The next four chapters are focused on our document classifier and the methods of preprocessing, segmentation, and registration used by our application. Finally, a concluding chapter is presented. In more detail: • Chapter 2: State of the Art and Background In this chapter, we present the state of the art of document classification and the main tools needed to develop this master thesis, such as information theory, image preprocessing techniques, and image registration. • Chapter 3: Classification Framework In this chapter, we present a description of our document image classification framework based on the information collected in the state of the art. Specifically, we introduce the problem statement, the classifier architecture, the document database, the document filtering, and the application interface. 13 Chapter 1. Introduction • Chapter 4: Image Preprocessing and Segmentation In this chapter, several methods for preprocessing (cropping, skew, and check position) and segmentation (logo extraction, k-means, and color and grayscale detection) are presented. • Chapter 5: Image Registration In this chapter, two registration methods (NMI registration and logo registration) and their performance evaluation are presented. • Chapter 6: Conclusions and Future Work In this chapter, the conclusions of this master thesis are presented, as well as some indications about our current and future research. Moreover, four appendixes describe the four basic tools (Qt, MySQl, OpenCV, and .NET) used in order to develop the application described in this master thesis. 14 Chapter 2 State of the Art and Background In this chapter, we present the state of the art of document classification and the main tools needed to develop this master thesis, such as information theory, image preprocessing techniques, and image registration. 2.1 Introduction Document image classification is an important task in document image processing and it is used in the following applications [9]: • Office automation. Document classification allows the automatic distribution or archiving of documents. For example, after classification of business letters according to the sender and message type (such as order, offer, or inquiry), the letters are sent to the appropriate departments for processing. • Digital libraries. Document classification improves the indexing efficiency in digital library construction. For example, the classification of documents into table of contents page or title page can narrow the set of pages from which to extract specific meta-data, such as the title or table of contents of a book. • Image retrieval. Document classification plays an important role in document image retrieval. For example, consider a document image database containing a large heterogeneous collection of document images. Users have many retrieval demands, such as retrieval of papers from one specific journal, or retrieval of document pages containing tables or graphics. Classification of documents based on visual similarity helps to limit the search and improves retrieval efficiency and accuracy. • Other document image analysis applications. Document classification facilitates higher-lever document analysis. Due to the complexity of document understanding, most high-level document analysis systems rely on domaindependent knowledge to obtain high accuracy. Many available information 15 Chapter 2. State of the Art and Background extraction systems are specially designed for a specific type of document, such as forms processing or postal address precessing, to achieve high speed and performance. To process a broad range of documents, it is necessary to classify the documents first, so that a suitable document analysis system for each specific document type can be adopted. 2.2 Document Classification There is a great diversity in document classifiers. Classifiers solve a variety of document classification problems, differ in how they use training data to construct models of document classes, and differ in their choice of document features and recognition algorithms. Chen [9] surveys this diverse literature using three components: the problem statement, the classifier architecture, and the performance evaluation. These components are illustrated in Figure 2.1. The problem statement (see Section 2.2.1) for a document classifier defines the problem being solved by the classifier. It consists of two aspects: the document space and the set of document classes. The document space defines the range of input document samples. The training samples and the test samples are drawn from the document space. The set of document classes defines the possible outputs produced by the classifier and is used to label document samples. Most surveyed classifiers use manually defined document classes, with class definitions based on similarity of contents, form, or style. The problem statement corresponding to this master thesis is discussed further in Section 3.1. The classifier architecture (see Section 2.2.2) includes four aspects: document features and recognition stages, feature representations, class models and classification algorithms, and learning mechanisms. Performance evaluation (see Section 2.2.3) is used to measure the performance of a classifier, and to permit performance comparisons between classifiers. The diversity among document classifiers makes performance comparisons difficult. Issues in performance evaluation include the need for standard data sets, standardized performance metrics, and the difficulty of separating the classifier performance from the pre-processor performance. 2.2.1 Problem Statement The problem statement for a document classifier has two aspects: the document space and the set of document classes. The former defines the range of input documents, and the latter defines the output that the classifier can produce. Document Space The document space is the set of documents that a classifier is expected to handle. The labeled training samples and test samples are all drawn from this document space. The training samples are assumed to be representative of the defined set 16 Chapter 2. State of the Art and Background Figure 2.1: Three components of a document classifier. 17 Chapter 2. State of the Art and Background of classes. The document space may include documents that should be rejected, because they do not lie within any document class. In this case, the training samples might consist of positive samples only, or they might consist of a mixture of positive and negative samples. Set of Document Classes The set of document classes defines how the document space is partitioned. The name of a document class is the output produced by the classifier. Several possible partitions of document space are shown in Figure 2.2. A set of document classes may uniquely separate the document space (see Figure 2.2.a), with a single class label assigned to a document. If the document space is larger than the union of the document classes (see Figure 2.2.b), the classifier is expected to reject all documents that do not belong to any document class. Fuzziness may exist in the definition of document classes (see Figure 2.2.c), with multiple class labels assigned to a document. Figure 2.2: Three possible partitions of the document space. A document class (also called document type or document genre) is defined as a set of documents characterized by the similarity of expressions, style, form or contents. This definition states that various criteria can be used for defining document classes. Document classes can be defined based on similarity of contents. For example, consider pages in conference papers, with classes consisting of “pages with experimental results”, “pages with conclusions”, “pages with description of a method”. Alternatively, document classes can be defined based on similarity of form and style (also called visual similarity), such as page layout, use of figures, or choice of fonts. 2.2.2 Classifier Architecture Chen [9] uses the following four aspects to characterize the classifier architecture: 1. Document features and recognition stage. 2. Feature representations. 18 Chapter 2. State of the Art and Background 3. Class models and classification algorithms. 4. Learning mechanisms. These aspects are interrelated: design decisions made regarding one aspect have influence on design of other aspects. For example, if document features are represented in fixed-length feature vectors, then statistical models and classification algorithms are usually considered. Document Features and Recognition Stage Choice of document features is an important step in classifier design. Relevant surveys about document features include the following. Commonly used features in OCR are surveyed in [10]. A set of commonly used features for page segmentation and document zone classification are given in [11]. Structural features produced in physical and logical layout analysis are surveyed in [12, 1]. The majority of features are extracted from black and white document images. The gray-scale or color images (e.g., advertisements and magazine articles) are binarized. Unavoidably, for certain documents, the binarization process removes essential discriminate information. More research should be devoted to the use of features extracted directly from gray-scale or color images to classify documents. Before discussing the choice of document features further, we first consider the document recognition stage at which classification is performed. Document Recognition Stages Document classification can be performed at various stages of document processing. The choice of document features is constrained by the document recognition stage at which document classification is performed. Figure 2.3 shows a typical sequence of document recognition for mostly-text document images, where: • Block segmentation and classification identify rectangular blocks (or zones) enclosing homogeneous content portions, such as text, table, figure, and halftone image. • Physical layout analysis (also called structural layout analysis or geometric layout analysis) extracts layout structure: a hierarchical description of the objects in a document image, based on the geometric arrangements in the image. For example, an intelligent document processing system that can transform paper documents into XML format called WISDOM++ uses six levels of layout hierarchy: basic blocks, lines, sets of lines, frame 1, frame 2, and page [13]. • Logical layout analysis (also called logical labeling) extracts the logical structure: a hierarchy of logical objects, based on the human-perceptible 19 Chapter 2. State of the Art and Background meaning of the document contents. For example, the logical structure of a journal page is a hierarchy of logical objects, such as title, authors, abstract, and sections [12]. Figure 2.3: A typical sequence of document recognition for mostly-text document images. Document classification can be performed at various recognition stages. The choice of this recognition stage depends on the goal of document classification and the type of documents. Choice of Document Features Chen [9] characterizes document features using three categories: image features, structural features, and textual features, where: • Image features are either extracted directly from the image (e.g., the density of black pixels in a region) or extracted from a segmented image (e.g., the number of horizontal lines in a segmented block). Image features extracted at the level of a whole image are called global image features; image features extracted from the regions of an image are called local image features. • Structural features (e.g., relationships between objects in the page) are obtained from physical or logical layout analysis. • Textual features (e.g., presence of keywords) may be computed from OCR output or directly from document images. Some classifiers use only image features, only structural features, or only textual features; others use a combination of features from several groups. 20 Chapter 2. State of the Art and Background The classifiers that use only image features are fast since they can be implemented before document layout analysis. But they may be limited to providing coarse classification, since image features alone do not capture characteristic structural information. More elaborated methods are needed to verify the classification result. Shin et al. [6] measure document image features directly from the unsegmented bitmap image. The document features include density of content area, statistics of features of connected components, column/row gaps, and relative point sizes of fonts. These features are measured in four types of windows: cell windows, horizontal strip windows, vertical strip windows, and the page window. Most of systems use a combination of physical layout features and local image features. This provides a good characterization of structured images. The classification is done before logical labeling, allowing the classification results to be used to tailor logical labeling, that is, we could use physical layout features to classify the document, and then adapt the logical labeling phase to the document class. Document classification using logical structural features is expensive since it needs a domain-specific logical model for each type of document. Classification using textual features is closely related to text categorization in information retrieval. Purely textual measures, such as frequency and weights of keywords or index terms, can be used on their own, or in combination with image features. Textual features may be extracted from OCR results which may be noisy. Alternatively textual features may be extracted directly from document images [6]. Feature Representations Document features extracted from each sample document in a classifier can be represented in various ways, such as a flat representation (fixed-length vector or string), a structural representation, or a knowledge base. Document features that do not provide structural information are usually represented in fixed-length feature vectors. Features that provide structural information are represented in various formats as a tree [3, 4], a list, a graph, . . . (see Figure 2.4) Diligenti et al. [4] claim that a flat representation does not carry robust information about the position and the number of basic constituents of the image, whereas a recursive representation preserves relationships among the image constituents. Chen et al. [9] show a table proposed by Watanabe (see Figure 2.4) where structured document are categorized in five groups and each category are related with a recommended feature representation. Watanabe also gives the following guideline for the selection of a feature representation: “The simpler, the better”. If the document can be represented using a list, then use a list because of higher processing efficiency, easier knowledge definition and management. Similarly, a tree representation is better than a graph representation due to its relative simplicity. The choice of a feature representation is also constrained by the kind of class model and classification algorithm that is used. 21 Chapter 2. State of the Art and Background Figure 2.4: Five categories of structured documents and their recommended feature representation. Class Models and Classification Algorithms Class models define the characteristics of the document classes. The class models can take various forms, including grammars, rules, and decision trees. The class models are trained using features extracted from the training samples. They are either manually built by a person or automatically built using machine learning techniques. Class models and classification algorithms are tightly coupled. A class model and classification algorithm must allow for noise or uncertainty in the matching process. Traditional statistical and structural pattern classification techniques that have been applied to document classification are reviewed by Chen et al. [9] as follows. • Statistical pattern classification techniques: There are many traditional statistical pattern classification techniques, such as nearest neighbor, decision tree, and neural network. These techniques are relatively mature and there are libraries and classification toolboxes implementing these techniques. Traditional statistical classifiers represent each document instance with a fixed-length feature vector. This makes it difficult to capture much of the layout structure of document images. Therefore, these techniques are less suitable for fine-grained document classification. • Structural pattern classification techniques: These techniques have higher computational complexity than statistical pattern recognition techniques. Also, machine learning techniques for creating class models based on structural representations are not yet standard. Many authors provide their own methods for training class models [3, 4]. • Knowledge-based document classification techniques: A knowledge-based document classification technique uses a set of rules or a hierarchy of frames encoding expert knowledge on how to classify documents into a given set of classes. The knowledge base can be constructed manually or automatically. Manu22 Chapter 2. State of the Art and Background ally built knowledge-based systems only perform what they were programmed to do. Significant efforts are required to acquire knowledge from domain experts and to maintain and update the knowledge base. Moreover, it is not easy to adapt the system to a different domain [2]. Recently developed knowledge based systems learn rules automatically from labeled training samples [13]. • Template matching: Template matching is used to match an input document with one or more prototypes of each class. This technique is most commonly applied in cases where document images have fixed geometric configurations, such as forms. Matching an input form with each of a few hundred templates is time consuming. Computational cost can be reduced by hierarchical template matching. Byun and Lee [5] propose a partial matching method, in which only some areas of the input form are considered. Template matching has also been applied to broad classification tasks, with documents from various application domains such as business letters, reports, and technical papers. The template for each class is defined by one userprovided input document, and the template does not describe the structure variability with the class. Therefore, the template is only suitable for coarse classification. • Combination of multiple classifiers: Multiple classifiers may be combined to improve classification performance. • Multi-stage classification: A document classifier can perform classification in multiple stages, first classifying documents into a small number of coarse-grained classes, and then refining this classification. Maderlechner et al. [14] implement a two-stage classifier, where the first stage classifies documents as either journal articles or business letters, based on physical layout information. The second stage further classifies business letters into 16 application categories according to content information from OCR. Learning Mechanisms A learning mechanism provides an automated way for a classifier to construct or tune class models, based on observation of training samples. Hand coding of class models is most feasible in applications that use a small number of document classes, with document features that are easily generalized by a system designer. For example, Taylor et al. [15] manually construct a set of rules to identify functional components in a document and learn the frequency of those components from training data. However, manual creation of entire class models is difficult in applications involving a large number of document classes, especially when users are allowed to define document classes. With a learning mechanism, the classifier 23 Chapter 2. State of the Art and Background can adapt to changing conditions, by updating class models or adding new document classes. 2.2.3 Performance Evaluation Performance evaluation is a critically important component of a document classifier. It involves challenging issues, including difficulties in defining standard datasets and standardized performance metrics, the difficulty of comparing multiple document classifiers, and the difficulty of separating classifier performance from preprocessor performance. Performance evaluation includes the metrics for evaluating a single classifier, and the metrics for comparing multiple classifiers. Most of the classification systems measure the effectiveness of the classifiers, which is the ability to take the right classification decisions. Various performance metrics are used for classification effectiveness evaluation, including accuracy [3], correct rate, recognition rate [5], error rate [15], false rate [13], reject rate, recall, and precision. The significance of the reported effective performance is not entirely standard, since some classifiers have reject ability while others do not, and some classifiers output a ranked list of results [3, 7], while others produce a single result. Standard performance metrics are necessary to evaluate performance. Document classifiers are often difficult to compare because they are solving different classification problems, drawing documents from different input spaces, and using different sets of classes as possible outputs. For example, it is difficult to compare a classifier that deals with fixed-layout documents (forms or table-forms) to one that classifies documents with variable layouts (newspaper or articles). Another complication is that the number of document classes varies widely. The classifiers use as few as 3 classes [13] to as many as 500 classes, and various criteria are used to define these classes. Also many researchers collect their own data sets for training and testing their document classifiers. These data sets are of varying size, ranging from a few dozen [5, 7], or a few hundred [3], to thousands of document instances. The sizes of training set and test set affect the classifier performance.These factors make it very difficult to compare performance of document classifiers. Some authors lead in the right direction by making data available online. In this way other authors can use the data provided and add their own data to test their classification system. To compare the performance of two classifiers, a standard data set providing ground-truth information should be used to train and test the classifiers. The University of Washington document image database is one source of ground truth data for document image analysis and understanding research [16]. Some authors conclude that UW data is far from optimal for document classification, since it has a small number of documents from a relatively large number of classes. Finlands MTDB Oulu Document Database defines 19 document classes and provides ground truth information for document recognition [17]. The number of documents per class ranges from less than ten up to several hundred. The docu24 Chapter 2. State of the Art and Background ments in this database are diverse, and assigned to pre-defined document classes, making this database a useful starting point for research into document classification. It is difficult to separate the classifier performance from the preprocessor performance. The performance of a classifier depends on the quality of document processing performed prior to classification. For example, classification based on layout-analysis results is affected by the quality of the layout analysis, by the number of split and merged blocks. Similarly, OCR errors affect classification based on textual features. In order to compare classifier performance, it is important to use standardized document processing prior to the classification step. One method of achieving this is through use of a standard document database that includes not only labeled document images, but also includes sample results from intermediate stages of document recognition. This would allow document classifiers to be tested under the same conditions, classifying documents based on the same document-recognition results. Construction of such databases is a difficult and time-consuming task. 2.3 Information Theory In 1948, Claude Shannon published “A mathematical theory of communication” [18] which marks the beginning of information theory. In this paper, he defined measures such as entropy and mutual information, and introduced the fundamental laws of data compression and transmission. In this section, we present some basic measures of information theory. Good references are the texts by Cover and Thomas [19], and Yeung [20]. Entropy The Shannon entropy is the classical measure of information, where information is simply the outcome of a selection among a finite number of possibilities. Entropy also measures uncertainty or ignorance. The Shannon entropy H(X) of a discrete random variable X with values in the set X = {x1 , x2 , . . . , xn } is defined as H(X) = − X p(x) log p(x), (2.1) x∈X where p(x) = P r[X = x], the logarithms are taken in base 2 (entropy is expressed in bits), and we use the convention that 0 log 0 = 0, which is justified by continuity. We can use interchangeably the notation H(X) or H(p) for the entropy, where p is the probability distribution {p1 , p2 , . . . , pn }. As − log p(x) represents the information associated with the result x, the entropy gives us the average information or uncertainty of a random variable. Information and uncertainty are opposite. Uncertainty is considered before the event, information after. 25 Chapter 2. State of the Art and Background Figure 2.5: Binary entropy. So, information reduces uncertainty. Note that the entropy depends only on the probabilities. Some relevant properties [18] of the entropy are: 1. 0 ≤ H(X) ≤ log n • H(X) = 0 if and only if all the probabilities except one are zero, this one having the unit value, i.e., when we are certain of the outcome. • H(X) = log n when all the probabilities are equal. This is the most uncertain situation. 2. If we equalize the probabilities, entropy increases. When n = 2, the binary entropy (Figure 2.5) is given by H(X) = −p log p − (1 − p) log(1 − p), (2.2) where the variable X is defined by 1 with probability p X= 0 with probability 1 − p. If we consider another random variable Y with probability distribution p(y) corresponding to values in the set Y = {y1 , y2 , . . . , ym }, the joint entropy of X and Y is defined as H(X, Y ) = − XX p(x, y) log p(x, y), x∈X y∈Y where p(x, y) = P r[X = x, Y = y] is the joint probability. 26 (2.3) Chapter 2. State of the Art and Background Also, the conditional entropy is defined as H(X|Y ) = − XX p(x, y) log p(x|y), (2.4) y∈Y x∈X where p(x|y) = P r[X = x|Y = y] is the conditional probability. The Bayes theorem expresses the relation between the different probabilities: p(x, y) = p(x)p(y|x) = p(y)p(x|y). (2.5) If X and Y are independent, then p(x, y) = p(x)p(y). The conditional entropy can be thought of in terms of a channel whose input is the random variable X and whose output is the random variable Y . H(X|Y ) corresponds to the uncertainty in the channel input from the receiver’s point of view, and vice versa for H(Y |X). Note that in general H(X|Y ) 6= H(Y |X). The following properties are also met: 1. H(X, Y ) ≤ H(X) + H(Y ) 2. H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ) 3. H(X) ≥ H(X|Y ) ≥ 0 Mutual Information The mutual information between two random variables X and Y is defined as I(X, Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) X XX = − p(x) log p(x) + p(x, y) log p(x|y) x∈X = XX y∈Y x∈X p(x, y) log x∈X y∈Y p(x, y) . p(x)p(y) (2.6) Mutual information represents the amount of information that one random variable, the output of the channel, gives (or contains) about a second random variable, the input of the channel, and vice versa, i.e., how much the knowledge of X decreases the uncertainty of Y and vice versa. Therefore, I(X, Y ) is a measure of the shared information between X and Y . Mutual information I(X, Y ) has the following properties: 1. I(X, Y ) ≥ 0 with equality if, and only if, X and Y are independent. 2. I(X, Y ) = I(Y, X) 27 Chapter 2. State of the Art and Background Figure 2.6: Venn diagram of a discrete channel. 3. I(X, Y ) = H(X) + H(Y ) − H(X, Y ) 4. I(X, Y ) ≤ H(X) The relationship between all the above measures can be expressed by a Venn diagram, as shown in Figure 2.6. 2.4 Image Preprocessing Techniques Preprocessing generally consists of a series of image-to-image transformations. It does not increase our knowledge of the contents of the document, but may help to extract it. Some of the early stages of processing scanned documents are independent of the type of documents. Many noise filtering, binarization, edge extraction, and segmentation methods can be applied equally well to printed or handwritten text, line drawings or maps. For the purpose of this mater thesis we review a skew method and the k-means segmentation method in the context of global representation techniques. 2.4.1 Skew The most important preprocess in this master thesis is the skew. In Section 4.2.3 we explain some modifications made in the algorithm introduced in this section in order to suit our needs. Another essential preprocess to obtain our goals is called “cropping”. This method is responsible for removing parts of the image that do not belong to the document itself. This method is explained in detail in Section 4.2.1. All of the algorithms developed for skew detection are accurate on full pages of uniformly aligned text. The better algorithms are less affected by the presence 28 Chapter 2. State of the Art and Background of graphics, paragraphs with different skew, curvilinear distortion, arising from photocopying books, large areas of dark pixels near the margin, and few, short text lines. The procedure used to solve skew problem is proposed by Gatos et al. in [21]. This paper proposes a computationally efficient procedure for skew detection and text line position determination in digitized documents, which is based on the cross-correlation between the pixels of vertical lines in a document. Due to the text skew, each horizontal text line intersects a predefined set of vertical lines at non-horizontal positions. Using only the pixels on these vertical lines they construct a correlation matrix and evaluate the skew angle of the document with high accuracy. In addition, using the same matrix, they compute the positions of text lines in the document. The proposed method is tested on a variety of mixed-type documents and it provides good and accurate results while it requires only a short computational time. Many document image classification systems often consists of a preprocessing stage, a document layout understanding and segmentation stage, a feature extraction stage and a classification stage. Many of these stages are facilitated if the document has not been skewed during the scanning process and its text lines are strictly horizontal. Although there are some techniques of image processing that can work on skewed documents too, they tend to be ineffective and involve great computational cost. It is therefore preferable, in the preprocessing stage, to determine the skew angle of the digitized documents. There are several methods for skew detection. These methods are based on the Hough transform, projection-based approaches, and the Fourier transform. Hough transform methods are the most popular, but they are computationally expensive. In projection-based methods, the projections of the document onto specific directions are first calculated. The skew angle corresponds to a rotation angle where some projection characteristics are satisfied. For example, using horizontal projection, we can determine that the skew angle corresponds to a rotation for which the mean square deviation of a projection histogram is maximized. This method gives good accurate results but is computationally expensive due to its preprocessing stage and global maximum finding stage. Other methods belong to the Fourier transform approaches. According to this kind of method, the skew angle corresponds to the direction for which the density of the Fourier space becomes the largest. This method requires computation of the Fourier transform of the document and for this reason, it is computationally expensive. Gatos et al. [21] propose a new skew detection method based on the information existing on a set of equidistant vertical lines. They accept that a text document consists mainly of horizontal text lines and they can further choose from all the image pixels those lying on a set of equidistant vertical lines. For a document, these pixels correspond to pixels of text lines. By using only these pixels they construct a correlation matrix between the vertical lines. They do not need to relate every pixel of a line to all pixels of the other lines, but only to those pixels that lie in a specific region defined by the expected maximum skew. In this way, they re29 Chapter 2. State of the Art and Background duce the computation time significantly without sacrificing accuracy. Finally, they form the vertical projection of the matrix, and the skew angle corresponds to the global maximum of the projection. This proposed method also works well with documents that, in addition to the usual horizontal text lines, contain images, line draws, and tables. The innovations introduced by this new method are the following: • Efficiency: Instead of using all the image pixels, we use only those lying on certain vertical lines defined in the image. This results in a drastic decrease of the calculation time for skew detection. The basic matrix used for data storage is of much smaller dimension compared to other methods, which results in a faster algorithm implementation and minimum storage requirements. • Accuracy: This method extracts the document skew with high accuracy. This can be further improved by using more than two vertical lines. The use of more than two vertical lines improves the accuracy, reduces the possibility of a wrong result due to noise and diminishes the possibility of missing a text line of short length. This happens because the skew detection accuracy depends on the distance between the first and the last vertical lines. • Robustness: The results of this proposed method are robust to the presence of graphics in the document which is not true for methods based on the Hough transform. Now, we describe the process that we need to follow to implement this method using two vertical lines. We can divide this process in the next five basic steps: 1. Binary image. Firstly we define the binary text image B(x, y) ∈ {0, 1} with the integer x, y taking values in the ranges 1 < x < Xwin and 1 < y < Ywin , and assuming that text pixels are assigned the value 1 and background pixels the value 0. All the distances in this paper are expressed in units of pixel distance, that is, the horizontal distances in terms of horizontal pixel distance and the vertical distances in term of vertical pixel distance. 2. Image smoothing. Before applying the method for skew detection to B(x, y), we preprocess the image, so that text lines are transformed to thick solid lines. According to it, if the number of background pixels (B(x, y) = 0) lying between two adjacent horizontal text pixels is less than or equal to a certain threshold T , then these background pixels are converted into text pixels (B(x, y) = 1). The proper value of T depends on the text characteristics and primarily on the character width. Therefore, the proper threshold value is selected according to the user’s experience. Gatos et al. [21] found that a suitable value for T is T = 0.1Xwin . Figure 2.7 shows the results of the above procedure applied to a document. 30 Chapter 2. State of the Art and Background Figure 2.7: The image before and after image smoothing. 3. Line data acquisition. We define now a set of two or more vertical lines in the document. Embodying the previous preprocessing tool, we define that a pixel belongs to a vertical line if its distance from it is less than or equal to T /2. Then, we define the pixels of every vertical line k, lying at horizontal distance m from the left margin, through the following line smoothing binary function: linek (y) = 1 if 0 Pm i=m−T /2 +T /2B(i, y) otherwise, 6= 0, y = 1, · · · , Ywin . (2.7) We can say that the function linex (y) depicts text pixel existence at the vertical line k after the line smoothing transformation. In contrast with the Hough transform approach, where all the image pixels are used, we will use only the pixels belonging to these vertical lines. Thus, we need less memory and we speed up significantly the algorithm. 4. Selection of pixels for skew detection. A common characteristic of a text document is the repetition of the horizontal text lines along the vertical direction. This is obvious (see Fig. 2.7) by observing the repetition of the pixel-blocks along the vertical columns. These blocks correspond mainly to horizontal text lines. It is noted that, although in most cases the repetition of text lines is approximately periodical, this is not a pre-requirement for this approach. Examination of the blocks between two different vertical lines can give the necessary information for skew detection. We choose two vertical lines d1 and d2 (see Fig. 2.8), at distances D1 and D2 from the left margin of the image. Distances D1 and D2 are defined so the image is divided into equal parts: D1 = 31 Xwin and D2 = 23 Xwin . The skew angle estimation is based on the 2Ywin pixels which are on these two lines obtained by equation 2.7. 5. Skew detection from the correlation matrix of two vertical lines. We 31 Chapter 2. State of the Art and Background Figure 2.8: Image windows of size Xwin ∗ Ywin and region [-L,L]. want to determine a matrix that records all the relative positions of the pixels of the vertical line d1 to the pixels of the vertical line d2 . We notice that due to the text skew θ, a text line intersects the two vertical lines d1 and d2 in two points having vertical distance l = (D2 − D1 ) tan θ. Making the assumption that a document can be rotated up to ±5◦ , that is, θmax = 5◦ due to a scanning misplacement, the vertical distance l must satisfy the constraint: −L < l < L, where L = (D2 − D1 ) tan 2πθmax 360 . (2.8) where L is an integer, expressed in number of vertical pixels. For every text pixel of the vertical line d1 (line1 (yk ) = 1), we search for text pixels at the vertical line d2 in a region [−L, L] centered at yk . We store this information in a correlation matrix C(yk , λ) ∈ {0, 1} defined as C(yk , λ) =line1 (yk )line2 (yk + λ), f or1 ≤ yk ≤ Ywin and − L ≤ λ ≤ L. (2.9) Pixels outside the image region are assumed to be 0. As we can see in Fig. 2.9, the correlation matrix C has zero elements for yk = 6, 7, 8, 14, 15, 16, 22, 23, 24 and 25. This is because there are no text pixels at line d1 for these yk values. We also have C(1, 3) = 1 because there is a text pixel at line d1 for yk = 1 and there is also a text pixel at line d2 for yk = 1 + 3 = 4. If the image skew angle is 0, then the intersection of every text line with the two vertical lines d1 and d2 should have a vertical distance (D2 − D1 ) tan θ. So, the correlation matrix C will have maximum accumulation of points 32 Chapter 2. State of the Art and Background Figure 2.9: Correlation matrix of the vertical line d1 towards d2 . along the y-axis for λ = int[0.5 + (D2 − D1 ) tan θ] (where int[x] is the integer part of x). Making a reverse approach of this syllogism, image skew is obtained if we detect the global maximum of the vertical projection of the correlation matrix C. The vertical projection of the correlation matrix is given from the formula 2.10. P (λ) = YX win C(k, λ), ∀λ ∈ [−L, L]. (2.10) k=1 According to the above, if the global maximum of P (λ) is at λ = λmax , then the document skew is given by the relation 2.11. − θ = tan 1 λmax D2 − D1 . (2.11) As we can see in Fig. 2.10, we have a global maximum of the projection for λ = 3, which means that the document skew angle is tan− 1[3/(D2 − D1 )]. In Section 4.2.3 we can find a detailed explanation about how we use this method and the variations introduced to adapt it to our application. 2.4.2 Image Segmentation The segmentation process can be described as the identification of structures in an image. It consists in subdividing an image into its constituent parts, a significant step towards image understanding [22]. 33 Chapter 2. State of the Art and Background Figure 2.10: Skew detection using the correlation matrix of Fig. 2.9. Image segmentation is the process of labeling each pixel in an image dataset according to certain parameter or features. Since a segmented image provides richer information than the original one, it is an essential tool in the image study. Segmentation is considered a very difficult task and a lot of research is being done to develop automatic segmentation techniques. Unfortunately, the automatic process is not easy since the regions to be segmented can vary with the image. Consequently, most proposed methods are specific assuming usually some a priori information that must either be built into the system or provided by a human operator. In the image processing literature, we can find a lot of segmentation methods and also very diverse ways of classifying them [22, 23]. Automatic segmentation processes can be divided into two groups: the global segmentation methods, where all image pixels are collected in some clusters, and the local segmentation methods, where only a region is taken into account classifying the pixels inside or outside of this region. For the purposes of this master thesis, we only review the most basic global segmentation methods. These methods are also referred to as classification methods, since each point is classified into a cluster, usually depending on its intensity value and the intensity of its neighbours, and not on its position in the image. The main global segmentation methods can be classified in these groups: • Thresholding This segmentation scheme relies upon the selection of a range of intensity levels, called threshold values, for each structure class. These intensity ranges are exclusive to a single class, and span the dynamic range of the image. Subsequently a feature is classified by selecting the class in which the value of the feature falls within the range of feature values of the class. 34 Chapter 2. State of the Art and Background The determination of more than one threshold value is a process called multithresholding. The selection of the threshold generally depends on the visual identification of a peak in the histogram corresponding to a structure class, and the selection of a range of intensities around the peak to include only the structure class. A possible criterion is to assign the histogram minima as the threshold values. More refined criteria are summarized in [24, 25]. • Segmentation by image enhancement In image processing terminology, an operation for image enhancement improves the quality of the image in a particular manner, either subjectively or objectively. This segmentation model assumes that a structure class ideally has a single intensity, and that noise and scanning artifacts corrupt this level to produce the distribution of intensities observed for a structure class. Thus, by the application of image enhancement techniques for reducing noise and smoothing the image, the enhanced image approximates the ideal image (the segmented one). The main drawback of this segmentation approach is that the structures that do not have strong edges on all sides are smoothed, leading to large classification errors when subsequent labelling is applied. • Segmentation by unsupervised clustering Clustering methods are algorithms that operate on an input dataset, grouping data into clusters based on the similarity of the data in these clusters. Clustering algorithms are unsupervised classifiers, assigning states from scratch. They are also useful for data exploration, allowing a user to discover patterns of similarities in a dataset. A well-known clustering algorithm is the k-means [26]. The k-means algorithm accepts as input the number of clusters to organize data within, initial location of cluster centers, and a dataset to cluster. The number of clusters in which the algorithm fits the data is specified to the algorithm, and represents a parameter the user desires to experiment with, or, also, the expected or desired number of classes to discern from the data. There are no conditions upon which data are excluded or included in consideration to fit into a class; all data provided as input are classified. A given sample or feature measurement is assigned exclusively to one class (fuzzy k-means clustering assigns a degree of membership to each data item for each class). The algorithm is an iterative algorithm, assigning a class at each iteration to each data element. The algorithm iteration ceases when there are no changes in the classification solution. Each iteration consists in classifying the dataset by comparison of the dataset to the current cluster centers. A data item is assigned to the same class as a cluster center if the Euclidean distance between the data item and the cluster center is the least distance between 35 Chapter 2. State of the Art and Background the data item and all the cluster centers. Following class assignment, cluster centers are updated by computation of the centroid of the dataset classified as the same class. 2.5 Image Registration Image registration is a fundamental task in image processing used to match two or more images or volumes obtained at different times, from different devices or from different viewpoints. Basically, it consists in finding the geometrical transformation that enables us to align images into a unique coordinate space. In the scope of this master thesis we will focus on 2D rigid registration techniques because only transformations that consider translations and rotations are allowed. In this section, the main components of the image registration pipeline are presented. A classification of the most representative registration methods that have been proposed is also given. To end the section, the main challenges in the registration field are described. 2.5.1 Image Registration Pipeline The image registration pipeline starts with the selection of the two images to be registered. One of the two images is defined as the fixed image and the other one as the moving image. Given these images, registration is treated as an optimization problem with the goal of finding the spatial mapping that will bring the moving image into alignment with the fixed one. This process can be described as a process composed of four basic elements [27]: the transformation, the interpolator, the metric and the optimizer (see Figure 2.11). The transformation component represents the spatial mapping of points from the fixed image space to points in the moving image space. The interpolator is used to evaluate moving image intensity at non-grid positions. The metric component provides a measure of how well the fixed image is matched by the transformed moving image. This measure forms the quantitative criterion to be optimized by the optimizer over the search space defined by the parameters of the transformation. Each of these components is now described in more detail. 1. Spatial transformation. The registration process consists in reading the input image, defining the reference space (i.e. its resolution, positioning and orientation) for each of these images, and establishing the correspondence between them (i.e. how to transform the coordinates from one image to the coordinates of the other image). The spatial transformation defines the spatial relationship between both images. Basically, two groups of transformations can be considered: • Rigid or affine transformations. These transformations can be defined with a single global transformation matrix. Rigid transformations are 36 Chapter 2. State of the Art and Background Figure 2.11: The main components of the registration framework are the two input images, a transformation, a metric, an interpolator, and an optimizer. defined as geometrical transformations that only consider translations and rotations, and, thus, they preserve all distances. Affine transformations also allow shearing transforms and they preserve the straightness of lines (and the planarity of surfaces) but not the distances. • Nonrigid or elastic transformations. These transforms are defined for each of the points of the images with a transformation vector. For simplification purposes, sometimes only some control points are considered and the transformation at the other points is obtained by interpolating the transformation at these control points. Using these kinds of transformations, the straightness of the lines are not ensured. In this master thesis, rigid image registration is our reference point. 2. Interpolation. The interpolation strategy determines the intensity value of a point at a non-grid position. When a general transformation is applied to an image, the transformed points may not coincide with the regular grid. So, an interpolation scheme is needed to estimate the values at these positions. One of the main problem of registration appears when there is not a direct correspondence between the coordinates of the two models. In this situation certain criteria has to be fixed to determine how this point has to be approximated in the second model. Therefore, spatial transformation rely for their proper implementation on interpolation and image resampling. Interpolation is the process of intensity based transformation and resampling is the process where intensity values are assigned to the pixels in the transformed image. Several interpolation schemes have been introduced [28]. The most common are: • Nearest neighbour interpolation: the intensity of each point is given by the one of the nearest grid-point. 37 Chapter 2. State of the Art and Background • Linear interpolation: the intensity of a point is obtained from the linearweighted combination of the intensities of its neighbors. • Splines: the intensity of a point is obtained from the spline-weighted combination of a grid-point kernel [29]. • Partial volume interpolation: the weights of the linear interpolation are used to update the histogram, without introducing new intensity values [30]. 3. Metric. The metric evaluates the similarity (or disparity) between the two images to be registered. Several image similarity measures have been proposed. They can be classified depending on the used features which are: • Geometrical features. A segmentation process detects some features and, then, they are aligned. These methods do not have high computational cost. Nevertheless, there is a great dependence on the initial segmentation results. • Correlation measures. The intensity values of each image are analyzed and the alignment is achieved when a certain correlation measure is maximized. Usually, a priori information is used in these metrics. • Intensity occurrence. These measures depend on the probability of each intensity value and are based on information theory [18]. Despite this variety of measures, this last group has become the most popular. Due to the importance of the similarity measure in our research, a classification of registration techniques according to this parameter will be given in Section 2.5.2. 4. Optimization. The optimizer finds the maximum (or minimum) value of the metric varying the spatial transformation. For the registration problem, an analytical solution is not possible. Then, numerical methods can be used in order to obtain the global extreme of a non analytical function. The most used methods in the image registration context are Powell’s method, simplex method, gradient descent, conjugate-gradient method, and genetic algorithms (such as one-plus-one evolutionary). The choice of a method will depend on the implementation criteria and the measure features (smoothness, robustness, etc.). A detailed description of several numerical optimization methods and their implementations can be found in [31]. 2.5.2 Similarity Metrics The registration metric characterizes the similarity (or disparity) of both images for a given transformation. It is considered that the two images are registered when this similarity (or disparity) function is maximum (or minimum). 38 Chapter 2. State of the Art and Background The registration methods that have been proposed can be classified into two main groups according to the information considered to compute the measure: (i) feature-based registration, which uses previously segmented objects from the images to achieve the alignment and (ii) pixel-based methods, which use the whole data. A more detailed description of both groups is given below. Feature-based Registration Measures based on geometric features minimize spatial disparity between selected features from the images (e.g. distance between corresponding points). The main difference between the methods of this group is the feature selected for the registration, which can be points, surface, intrinsic features such as landmarks, or extrinsic measures such as implanted markers. According to the features, two main categories of algorithms can be considered: • Point-based registration algorithms The basis of these algorithms is the selection of a set of points in each of the images and then the minimum Euclidian distance between them gives the best alignment. Since, in general, the point sets of each image do not exactly coincide, an iterative algorithm is performed until the distance between these sets of points is minimal. These methods are used extensively in the medical scenario due to their simplicity. • Segmented-based registration algorithms Segmentation-based registration algorithms are based on the alignment of segmented structures. The segmentation process takes an image and separates its elements into connected regions, which present the same desired property or characteristic. The segmentation-based algorithms are generally accurate and fast if a good choice of features is performed. The main drawback of this approach is that the registration accuracy is limited to the accuracy of the segmentation step. Feature-based registration requires specialized segmentation and feature extraction for each application. In addition the methodology is not immune to noise and is sensitive to outliers. The main advantages of the segmentationbased methods are that it give more accurate results than the intensity-based approach. They are faster than the intensity-based registration as they use a lower number of features and the optimization procedure needs less iterations. Pixel-based Similarity Measures The alternative to the feature-based approach is the intensity-based registration. This approach assumes some relation between the optical densities of pixels and 39 Chapter 2. State of the Art and Background operates directly on the image grey values without prior data reduction by the user nor segmentation. The registration is implicitly performed by the definition of a function which evaluates the quality of alignment and thereby controls the optimization procedure. The information used for the alignment is not restricted to any specific feature and therefore this approach is more flexible than the feature-based one. There are two different methodologies distinguishing the methods in this group: • Intensity-based methods These methods base the alignment on the evaluation of the intensity values considering the images aligned when the differences between grey values are minimal. This restriction is ideal in cases where the two images are identical except for noise. An important aspect to be considered is that the proposed functions are only computed on the overlap area between both image, which varies for different transformations. Some of the functions that have been proposed to describe the relation between grey values are: – The sum of absolute value differences. This is the simplest and most direct measure of similarity of two image values. This measure is defined as X S(A, B) = x∈A T |fA (x) − fB (x)|, (2.12) B where fA (x) and fB (x) represent the intensity at a point x of the image A and B, respectively. When this measure is applied we assume that the image values are calibrated to the same scale. – Correlation. In the alignment of two images, registration results in a strong linear relationship between corresponding values in the two images. A measure of similarity would be the correlation, which determines the fit of a line to the distribution of corresponding values. Correlation is expressed as C(A, B) = X x∈A T fA (x) × fB (x). (2.13) B The main limitations of this measure are: ∗ Its dependence on the number of points over which it is evaluated. This tends to favour transformations yielding large overlap. The normalized cross-correlation solves this problem simply by dividing correlation by the number of points. 40 Chapter 2. State of the Art and Background ∗ Its dependence on the intensity values, which tends to favour high intensity values. As a solution to this second limitation a better measure of alignment was proposed: the correlation coefficient. The correlation coefficient is a measure of the residual errors from the fitting of a line to the data by minimization of the least squares. • Methods based on the occurrences of intensity values The basic idea behind these methods is that two values are related or similar if there are many other examples of those values occurring together in the overlapping image. These measures are a class of more generic statistical measures which only look at the occurrence of image values and not at the values themselves. Most of these techniques are based on the feature space or joint histogram. The joint histogram is a two-dimensional plot of the corresponding grey values in the images showing the combinations of grey values in each of the two images for all corresponding points. The joint histogram is constructed by counting the number of times a combination of grey values occurs. For each pair of corresponding points (x, y), where x is a point in the first image and y a point in the second image, the entry (fA (x), fB (y)) in the joint histogram is increased. The joint histogram depends on the alignment of the images. When the images are correctly registered, corresponding structures overlap and the joint histogram will show certain clusters for the grey values of those structures. Conversely, when the images are misaligned, structures in one image will overlap with structures in the other image that are not their counterparts. This will be manifested in the joint histogram by a dispersion of the clustering. This property is exploited by defining measures of clustering or dispersion which have to be maximized and minimized respectively. Most of these measures are based on information theory. For a detailed description of this theory see Section 2.3. In the information theory context, the registration of two images is represented by an information channel X → Y , where the random variables X and Y represent the images. Their marginal probability distributions, p(x) and p(y), and the joint probability distribution, p(x, y), are obtained by simple normalization of the marginal and joint intensity histograms of the overlapping areas of both images. Some of the measures based on the occurrences of intensity values are – Moments of the joint probability distribution. The joint probability tells us the proportion of times one or more variables hold some specific values. Empirically, as the images approach the registration position, the values of the peaks in the joint probability distribution increase in height and the values on the regions of the probability distribution 41 Chapter 2. State of the Art and Background which contain lower counts decrease in height. Therefore, the registration process has to re-arrange the pixels so that they occur with their most probable corresponding value in the other image. One possible approach to quantify this shift from lower probabilities in the joint probability distribution to a smaller number of higher probabilities is to measure skewness (or the third moment) in the distribution of probabilities in the joint histogram. The skewness characterizes the degree of asymmetry of a distribution around its mean. It is a pure number that characterizes only the shape of the distribution. – Joint entropy. In the joint histogram of two images, grey values disperse with misregistration and the joint entropy is a measure of this dispersion. By finding the transformation that minimizes their joint entropy, images should be registered [30]. The main drawback of this method is its high sensitivity to the overlap area. – Mutual information (MI). Another measure is mutual information which is less sensitive to the overlap area. The more dependent the datasets are, the higher the MI between them. Registration is assumed to correspond to the maximum mutual information: the images have to be aligned in such a manner that the amount of information they contain about each other is maximal [32]. In the image registration context, Studholme [33] proposed the normalized mutual information (NMI) defined by N M I(X, Y ) = H(X) + H(Y ) I(X, Y ) =1+ , H(X, Y ) H(X, Y ) (2.14) which is more robust than M I, due to its greater independence of the overlap area. To conclude this section, the most relevant properties of the intensity-based registration approach are summarized. The main feature of intensity-based registration is its generality; it can be applied to any dataset with no previous pre-processing nor segmentation. Moreover, as all the pixels are considered on the alignment process, the method is quite immune to noise and is insensitive to outliers. The convergence of intensity-based registration is in general very slow. Several strategies have been proposed to speed up the process, first registering at lower resolutions and then increasing the resolution. Due to the considerable computational cost required by these methodologies, multi-resolution and multi-scale approaches are incorporated to the process in order to speed up the convergence of this method. 42 Chapter 2. State of the Art and Background 2.5.3 Challenges in Image Registration In this section, the main problems currently being addressed by image registration researchers are briefly summarized. Robustness and Accuracy To evaluate the behaviour of a registration method robustness and accuracy are the main parameters to be considered. The first parameter, robustness, refers to how the method behaves with respect to different initial states, i.e. different initial positions of the images, image noise, etc. The second parameter, accuracy, refers to how the final method solution is closer to the ideal solution. Constantly, new measures and new interpolation schemes appear trying to improve the robustness and the accuracy of the standard measures. Artifacts In the registration process, the interpolator algorithm plays an important role, since usually the transformation brings the point to be evaluated into a non-grid position. This importance is greater when the grid size coincides in both images, since the interpolator pattern is repeated for each point. When the mutual information or its derivations, which are the most common measures used in image registration, are computed, their value is affected by both the interpolation scheme and the selected sampling strategy, limiting the accuracy of the registration. The fluctuations of the measure are called artifacts and are well studied by Tsao [34]. Speed-up One of main user requirements when using registration techniques is speed. Users desire results as fast as possible. The large amount of data acquired by current capture devices makes its processing difficult in terms of time. Therefore, the definition of strategies able to accelerate the registration process is fundamental. Several multiresolution frameworks have been proposed achieving better robustness and speeding up the process. 43 Chapter 3 Classification Framework As we introduced in Section 1.3 the main goal of this master thesis is the recognition and classification of different documents (invoices, tickets, and receipts) within a database. In this master thesis, a document refers to a single-page typeset document image. The document image may be produced form a scanner, a fax machine, or by converting electronic document into an image format (usually TIFF format). Our documents are images without text content identified as such. Given the image of a document, the basic objective is to identify this image within a previously created database with all company documents. Therefore, although our image have not text content identified as such, most documents are of type mostly-text document. Thus, OCR techniques are not applied in our project. It is also important to emphasize that, in our document classifier, the document space is not restricted to a single application domain (e.g., only invoices or only bank documents or only tickets . . . ), since it is extended to include several application domains such as tickets, invoices, and receipts. As we have seen in Section 2.2 Bagdanov and Worring characterize document classification at two levels of detail, coarsegrained and fine-grained. We have seen that coarse-grained classification is used to classify documents with very different features, such as business letters versus technical articles. Whereas fine-grained classification is used to classify documents with similar features, such as business letters from different senders, or journal title pages from various journals. In our case, course-grained class corresponds to filter the database by different documents features (see Section 3.3), while fine-grained class corresponds to apply normalized mutual information registration (see Section 5.2). In summary, some of the main features of our document classifier are the following: • Classification is based on image-level features. • Using the document taxonomy defined by Nagy [1], our documents can be included in mostly-text documents group. 45 Chapter 3. Classification Framework • Our classifier uses a document space that span several application domains. • Our classifier is characterized by two levels: – Coarse-grained. This level corresponds to filter the database by different document features (see Section 3.3). – Fine-grained. This level corresponds to apply normalized mutual information registration (see Section 5.2). In Section 2.2, we have seen that Chen [9] defines a document classifier using three components (the problem statement, the classifier architecture, and the performance evaluation) represented in Figure 2.1. We have modified the diagram 2.1 with the aim of adapting it to our document classifier. The result is represented in Figure 3.1. Figure 3.1: Three components of our document classifier. In the first component, problem statement, we obtain a series of document images (document samples) that we can divide into two groups: reference samples and input samples. • Reference samples: This group is formed by a document set where, ideally, all documents are different between them and where each document 46 Chapter 3. Classification Framework represents a document type that can identify the classifier. This group of documents will form our database. • Input samples: This group is formed by a document set that we want to use as classifier input with the aim of obtaining what kind of document belongs each other, that is, the objective is to find the corresponding document within the database formed by reference samples. Both sets of documents have to be preprocessed, within the second component (classifier architecture), with the aim of preparing images and extracting their features for the registration process. Classifier architecture also includes a classification algorithm which is responsible for the registration process and therefore it is a very important part of this master thesis. In Section 3.2 we can find a detailed explanation of this part along with a Figure 3.4 that summarizes its main stages. Last but not least, the third component, performance evaluation, allows us to obtain an estimation of the speed, robustness, and efficiency of the implemented classifier. In the next Sections 3.1, 3.2 and 5.4 we can find a detailed explanation of the three components (problem statements, classifier architecture, and performance evaluation) of our document classifier. 3.1 Problem Statement In this chapter, we define, by using collected data in Section 2.2.1, the problem that is solved by the classifier. The problem statement for a document classifier consists of two aspects [9]: the document space and the set of document classes. The Document Space The document space defines the range of input document samples and may include documents that should be rejected, because they do not lie within any document class. In this case, the rejected document can be added in the group of reference images. In this way, when next document of this kind enters the system, it may be recognized and properly classified. Our classifier uses a document space constituted by invoices, tickets, and receipts (see Figure 3.2). The Set of Document Classes As we have seen in Section 2.2.1 the set of document classes defines how the document space is partitioned. Specifically, our document space is larger than the union of the document classes (see Figure 2.2.b). Also we have seen that a document class is defined as a set of documents characterized by the similarity of expressions, style, form, or contents. In our case, 47 Chapter 3. Classification Framework Figure 3.2: Some documents that make up our document space. document classes are defined based on the similarity of form and style (also called visual similarity), such as page layout (see Figure 3.3). Figure 3.3: Examples of document classes. 3.2 Classifier Architecture As we have seen in Section 2.2.2 Chen [9] uses the following four aspects to characterize the classifier architecture: 1. Document features and recognition stage. 48 Chapter 3. Classification Framework 2. Feature representations. 3. Class models and classification algorithms. 4. Learning mechanisms. In this section we will explain these four aspects of our classifier. Document Features and Recognition Stage In our case we use document features in order to make a good filter to the database containing reference images. Initially, the database contains a very high number of reference images. This makes it too expensive to register the input image with all database. Because of this, to register the input image with all the database images is too expensive in terms of execution time. Using document features we can filter the database by removing documents with different features from the input image. This allows us to reduce considerably the database size and in turn it allows us to reduce the execution time. Before discussing the choice of document features further, we first consider the document recognition stage at which the classification is performed. Document Recognition Stages As we have seen in Section 2.2.2, document classification can be performed at various stages of document processing. The choice of document features is constrained by the document recognition stage at which document classification is performed. Figure 2.3 shows a typical sequence of document recognition for mostly-text document images. In our case, we only need to apply “Image preprocessing”. Specifically, we do the following preprocessing: • Cropping removes image margins. Thus, the image is adjusted to the document that it represents (see Section 4.2.1). • Check position verifies that the document text is in horizontal position. Otherwise, a correction by applying a rotation of 90 degrees is applied (see Section 4.2.2). • Skew correction is applied (see Section 4.2.3). • Logo extraction (see Section 4.3.1) allows us to apply a logo registration (see Section 5.3). It is necessary that the user manually selects the logo of the reference images. Otherwise, logo registration can not be applied. • K-means allows us to identify the main colors of a document and its ratios (see Section 4.3.3). In Chapter 4, we can find a detailed explanation and visual examples of these and other implemented preprocessing techniques. 49 Chapter 3. Classification Framework Choice of Document Features Chen [9] characterizes document features using three categories: image features, structural features, and textual features. In our classifier we only consider image features taken directly from the image. These are called global image features. This makes our classifier very quick since it is implemented after the preprocessing process. Thus, we avoid processes such as block segmentation, physical layout analysis, logical layout analysis, or OCR. Finally, we have considered the use of the following document features: • Image size: It allows to do an initial filtering of database. Thus, we can distinguish between different typologies of documents such as tickets, invoices, and receipts. • Logo and logo position: Logo reference images and their positions are stored in the database. This allows to do a logo registration to find the logo (around the stored position) within the input image (see Section 5.3). • Color/gray level: It allows us to filter color or gray documents depending on the input image. • Colors: Main colors of a document and its ratios allow us to filter documents with different colors or color ratios. • First NMI registration: In theory, after the preprocessing process, the registration of two images of the same type without translation has to overcome a certain threshold. Otherwise, the images that do not exceed this threshold are filtered. We have also studied the possibility of using the following document features, but the results were not sufficiently reliable and robust to be used for the filtering of the database. • Number of regions in the image: This value was obtained by an split and merge segmentation method. • Number of horizontal and vertical lines in the image: The method for obtaining these values requires working with high resolution, which increases execution time considerably. • Entropy: It is not a good discriminating value for this specific task. Feature Representations In Section 2.2.2 we have seen that document features extracted from each sample document in a classifier can be represented in various ways, such as a flat representation (fixed-length vector or string), a structural representation, or a knowledge base. As our document features do not provide structural information, they are usually represented it in fixed-length feature vectors, unlike features that provide 50 Chapter 3. Classification Framework structural information that they are represented in various formats as a tree, a list, a graph, . . . Specifically, we store documents features as image class attributes, since it is not necessary to take into account any kind of structural information. Thus we respect the guideline for the selection of feature representations given by Watanabe: “The simpler, the better”. Our document representation is the simplest possible. It contains the minimum necessary information to filter the database since, in our case, document features do not participate in the registration process. Class Models and Classification Algorithms Class models define the characteristics of the document classes. In our case, the class model is defined by the document image, their features, and one or more logos. Document features allow us to filter the database, document image allows us to apply NMI registration (see Section 5.2), and finally with the logos of the documents we can use the logo registration (see Section 5.3). Figure 3.4 shows in detail the classification algorithm presented in Figure 3.1. Figure 3.4: Classification algorithm. 51 Chapter 3. Classification Framework In NMI registration we apply a new method based on information theory that has not been used before in document classification. On the other hand, in logo registration, we use a classification algorithm presented in Section 2.2.2: the template matching, which is used to match an input document with one or more logos of each model class. Learning Mechanisms A learning mechanism provides an automated way for a classifier to construct or tune class models. Our classifier only has one learning mechanism shown in Figure 3.1. This mechanism consists in adding to the database, as a new reference images, the input images that could not be classified. Thus, if a new input image of this type is introduced into the system, this could be classified correctly. 3.3 Document Database In Chapter 3, we have seen the great importance of the database within the system. The database stores the path of reference images together with their features. Thus, the database can be filtered by means of input image features. This allows us to register the input image only with the reference images that have similar features. This implies an important reduction in the number of images that need to be registered and therefore a decrease in the execution time. Our database is very simple. It consists of a single table of 13 columns. Each column represents some kind of document information. In the database, the following fields are stored (see Figure 3.5): • “identificador”: An identifier is assigned to each reference image. • “nom”: It corresponds to the name and extension of the reference image. • “imatge”: It stores the reference image path. • “logo”: It stores the logo image path. • “xLogo”: It corresponds to the “x” position in pixels of the logo origin1 within the reference image. • “yLogo”: It corresponds to the “y” position in pixels of the logo origin1 within the reference image. • “amplLogo”: It represents the width in pixels of the logo. • “alLogo”: It represents the height in pixels of the logo. 1 The origin (x,y) of the logo is taken as its upper-left corner. 52 Chapter 3. Classification Framework • “imReduida”: It stores the thumbnail path of the reference image. In Section 5.2 we will see that the NMI registration is applied to thumbnail and not to the original images. • “xcm”: It corresponds to the width in centimeters of the reference image. • “ycm”: It corresponds to the height in centimeters of the reference image. • “centroides”: It stores the text file path that contains the result of applying the k-means method (see Section 4.3.3) to the reference image, that is, it contains the main colors of the reference image and its ratios. • “color”: It is a boolean value that indicates whether the reference image is in color or gray scale. Figure 3.5: Database. 3.4 Document Filtering In this chapter, the central part of this master thesis is explained: the filtering process. The main objective is to classify an input image from the reference documents stored in the database. If we try to achieve this goal by registering the input image with all images stored in the database it would be computationally very expensive since the database contains a large number of reference images. Therefore, it is necessary to implement a system to reduce this time. To do this, we should reduce the number of possible candidates, that is, we need to filter the database. We currently use 3 filters: size (see Section 3.4.1), k-means (see Section 3.4.2), and NMI filter (see Section 3.4.3). 53 Chapter 3. Classification Framework 3.4.1 Size Filter This is a simple filter that allows us to consider only the reference images with a similar size to the input image. Therefore, we can discard the reference images with different typology to the input image, that is, we can differentiate between invoice, receipts, and tickets. Tickets of the same type can have a different height, therefore they are an especial case, since we only consider their width in the filtering process. 3.4.2 K-means Filter This filter allows us to consider only the reference images with similar colors to the input image. This filter is based on the k-means method (see Section 4.3.3). Using the k-means method we calculate the main colors and its ratios of the reference and input images. Then, the k-means filter calculates the difference between the reference images and the input image colors. If the difference is greater than a certain threshold, the image is discarded. It has been necessary to implement a method to calculate the difference between colors. The main objective is to assign each color pixel of the image reference pixel to one color pixel of the input. This allow us to calculate a distance value based on color image to decide whether we consider o discard the reference image. In a first version we implemented a method to find an optimal solution using backtracking. The solution was the best possible but the method was too slow. The scanned images have a lot of noise due to the acquisition process, and, therefore, we considered that it was not necessary to spend so much time looking for the optimal solution and that an approximation of the solution was sufficient to apply this filter. Thus, in a second version, we decided to calculate only two particular cases and consider the best of them as the valid solution. We consider the following two cases: • In each iteration we group the two colors of different images separated by the minimum distance (best candidates) taking into account color proportions. We always group an exact numbers of pixels of each image. • In each iteration we group the color that is more remote (worst candidate) with the other image color that is at a minimum distance of this. These two cases are best understood by looking at the two examples of Figure 3.6. In the two examples we compare the 3 main colors of two images. We represent the colors in a 2D plane to facilitate problem comprehension. The left images express the problem as they locate the three colors of each image in 2D space and they also represent the color proportions (they are represented by the value inside the circle and the circle size) and the distance between them (they are represented by discontinuous lines and they are quantified by a black number). The central 54 Chapter 3. Classification Framework images show the solutions that we obtain by applying the first case described. Finally, the right images show the solutions that we obtain by applying the second case described. We can see that in the first example a better result is obtained with the second case and however in the second example a better result is obtained with the first case. Figure 3.6: Example of the distance calculation between the colors of two images. In order to demonstrate the reliability of this approach we calculated 100 distances between images using backtracking and our approach. After comparing these results we can seen that the maximum difference between optimal distance and our approach is always less than 10. The distance between √ two colors is always between 0 and 442 (distance between black and white, that is, 3 ∗ 255). Table 3.1 shows the results in more detail. 55 Chapter 3. Classification Framework Interval of differences (Approach distance − Optimal distance) [0, 1] (1, 3] (3, 6] (6, 10] Total number of differences 42 21 26 11 100 Table 3.1: Comparison of the results to calculate one hundred distance between color images using backtracking and our approach. 3.4.3 NMI Filter After applying the image preparation methods (see Section 4.2), it is logical to think that the reference images and the input image would have to be practically registered. Therefore we apply the NMI registration method (see Section 5.2) without applying any kind of transformation. If we obtain a registration value smaller than a certain threshold the image is discarded. 3.5 Interface The graphical interface of the application (see Figure 3.7) was designed by means of the Qt Designer (A) Figure 3.7: Design of interface with qtDesiger. 56 Chapter 3. Classification Framework When the application runs, it appears a main screen (see Figure 3.8) with the following four outstanding sub-menus: • “Arxiu” • “Finestra” • “Base de dades” • “Pre-procés” • “Registre” Figure 3.8: Main screen of the application. 3.5.1 “Arxiu” Menu “Arxiu” menu contains the following options (see Figure 3.9): • “Obrir”: It allows us to load an image from the disk. It is necessary to load at least one image to activate some options like “tancar finestra”, “tancar totes les finestres”, “apropar”, “allunyar”, “cascada”, “afegir/actualizar imatge”, “esborrar imatge”, . . . • “Guardar”: It allows us to store a loaded image on the disk. • “Sortir”: It allows us to leave the application. 57 Chapter 3. Classification Framework Figure 3.9: “Arxiu” menu. 3.5.2 “Finestra” Menu “Finestra” menu contains the following options (see Figure 3.10): • “Apropar”: It zooms in to the selected image. • “Allunyar”: It zooms out to the selected image. • “Tamany original”: Selected image is displayed with its original size. • “Cascada”: Windows are arranged in cascade mode (see Figure 3.11). • “Mosaic”: Windows are arranged in mosaic mode (see Figure 3.12). • “Tancar”: The selected window is closed. • “Tancar totes”: All open windows in the application are closed. • Windows list: Windows list allows us to see an opened window list and to change the selected windows. Figure 3.10: “Finestra” menu. 58 Chapter 3. Classification Framework Figure 3.11: Cascade mode. Figure 3.12: Mosaic mode. 59 Chapter 3. Classification Framework 3.5.3 “Base de dades” Menu “Base de dades” menu contains the following options (see Figure 3.13): • “Connectar”: It allows us to get connected to a database. Initially it is connected to the database defined by default. In the case we want to change the database, we only need to disconnect it. Then we will activate the connection option, which will allow us to access to the configuration menu (see Figure 3.14), where a new connection can be specified (database, user name, password,. . . ). • “Desconnectar”: It allows us to disconnect a database. • “Afegir/Actualitzar imatge”: If the selected image does not exist, it is added to the database; otherwise, only its fields are updated. • “Esborrar imatge”: if the selected image is in the database, it is erased. • “Restaurar base de dades”: it allows us to initialize a database from the images contained in a specific folder. • “Esborrar base de dades”: The database is completely removed. Figure 3.13: “Base de dades” menu. 3.5.4 “Pre-procés’ Menu “Pre-procés” menu only contains the “Aplicar” option (see Figure 3.15). This option allows us to access to the preprocess configuration (see Figure 3.16), where we can carry out the preprocess selection and configuration. In Section 4 the different preprocesses are explained in detail. 60 Chapter 3. Classification Framework Figure 3.14: Configuration database menu. Figure 3.15: “Pre-procés” menu. Figure 3.16: Preprocess configuration menu. 61 Chapter 3. Classification Framework Figure 3.17: “Registre” menu. 3.5.5 “Registre” Menu “Registre” menu only contains the “Aplicar” option (see Figure 3.17). This option allows us to access to the registration configuration (see Figure 3.18), where we can carry out the registration method selection and configuration. In Section 5 the different registration methods are explained in detail. Figure 3.18: Registration configuration methods menu. Figure 3.19 shows how the result of the classification is returned. 62 Chapter 3. Classification Framework Figure 3.19: Result returned by the application. 63 Chapter 4 Image Preprocessing and Segmentation 4.1 Introduction As we stated in the introduction to Section 2.4, preprocessing generally consists of a series of image-to-image transformations. It does not increase our knowledge of the documents, but may help to extract it. On the other hand, the segmentation process can be described as the identification of structures in an image. It consists in subdividing an image into its constituent parts (see Section 2.4.2). Using this type of algorithms, we try to find diverse document features. The basic idea consists in adding to the database the reference images and also the found features. This will allow us to filter the database using the calculated input image features and discarding those images that do not have similar features. If we have a database with 1000 images, it would be ideal to reduce these images to 9 or 10 candidates by filtering. Therefore, if the features are very differentiated more images will be disconnected and, thus, the classification will be faster and more efficient. Implemented methods can be divided into two main groups according to their functionality: “image preparation methods” and “feature extraction methods”. “Image preparation methods” are included inside the preprocess method group and they are formed by the following methods: • “Ajustar” (cropping, see Section 4.2.1) • “Comprovar posició” (check position, see Section 4.2.2) • “Skew” (see Section 4.2.3) • “Escalar” (to scale image, see Section 4.2.4) • “Rotar” (to rotate image, see Section 4.2.4) 65 Chapter 4. Image Preprocessing and Segmentation “Feature extraction methods” are included inside the segmentation method group and they are formed by the following methods: • “Seleccionar logo” (logo selection, see Section 4.3.1) • “Color” (color and gray scale detection, see Section 4.3.2) • “K-means” (see Section 4.3.3) In Sections 4.2 and 4.3 we will explain these methods with examples and results. The preprocessing and segmentation methods presented in this chapter have been tested with 100 images and we have obtained a 100% success. 4.2 Image Preparation Methods Generally, documents are digitized using a scanner. This involves several problems such as position errors, skew errors, or that the scanner normally generates A4 images although a smaller document like a ticket or a banking receipt is scanned (see Figure 4.1). Therefore, it has been necessary to implement methods to correct these errors. This is essential to obtain a more robust, fast, and efficient registration (see Chapter 5). Figure 4.1: a) Scanned image in a wrong position. b) Image with a skew error. c) Generally, scanners generate A4 images although a smaller document is scanned. In this concrete case an adjustment is necessary. This problem also appears in cases a) and b). 66 Chapter 4. Image Preprocessing and Segmentation 4.2.1 Cropping We have seen that scanner normally generates A4 images although a smaller document is scanned. Therefore, it has been necessary to implement a method that fits the scanned images. Seeing the image of Figure 4.1 we consider two main problems: “white zone elimination” and “gray zone elimination”. White Zone Elimination This problem is very simple to solve because in the white zone all pixels are pure white (R = 255, G = 255, and B = 255). The first idea was to do a search in a straight line from the point marked in red on the Figure 4.2 to the first non-white pixel and to remove the white zone (this process is repeated initiating the search from above, left and right sides, in this order, to fit the image by all the sides). This method is very fast because we only process a unique line of pixels by each side. The principal problem is that this method can not be applied to black and white images, because we take the risk of eliminating important information from the image (see Figure 4.3). When we want to remove white zones of a black and white image we do the same process but in this case instead of processing a single pixel per line we process all the lines and we stop the search when finding a line with some non-white pixel. Thus we solve the problem and we get a correct result. In Figure 4.4 we can see that a small zone has been removed, but in this case removal is correct. Finally, after several tests, we have seen that the application of the second method in all cases is more efficient that to discriminate between color images and black and white images. Therefore, the second method is applied in all cases. Gray Zone Elimination After removing the white zones is necessary to eliminate the gray zones. These zones can not be eliminated as directly as the white zones, because the pixels in the gray zones suffer a slight variation of intensity. The method implemented to solve this problem is very similar to the previous ones. The basic difference is that in this case the search does not end when a line of pixels with some not-white pixels is found, but it ends when the intensity variance in the pixels line is above a certain threshold. Thus the fit problem is solved. Figures 4.5 and 4.6 show some results of the cropping method application. 4.2.2 Check Position After fitting the images, it is necessary to verify that the document is in the correct position. Therefore, it has been necessary to implement a method that detects the 67 Chapter 4. Image Preprocessing and Segmentation Figure 4.2: First idea to implement the white zone elimination. Figure 4.3: The white zone elimination problem with a black and white image. Figure 4.4: Solution fot the white zone elimination problem. 68 Chapter 4. Image Preprocessing and Segmentation Figure 4.5: The result of applying the cropping method to a ticket. Figure 4.6: The result of applying the cropping method to a receipt. 69 Chapter 4. Image Preprocessing and Segmentation position of a document and rotates 90 degrees the image if necessary. In order to know if the document is in the correct position we convert the lines into solid lines. First, we obtain horizontal solid lines and later vertical solid lines. Results of this process are shown in Figure 4.7. Figure 4.7: a) Original image, b) image with horizontal solid lines, and c) image with vertical solid lines At this point, we can deduce the correct position of the document by comparing the average lengths of the lines of both images shown in Figures 4.7b and 4.7c. If the average lengths of horizontal lines is bigger than the average lengths of vertical lines, the document is in the correct position. Otherwise the document must be rotated 90 degrees. Thus, the position problem is solved. Figures 4.8 and 4.9 show some results of the check position method application when the document is in a wrong position and Figure 4.10 shows the result of the check position method application when the document is in the correct position. 4.2.3 Skew After fitting and checking the image position, a last step is necessary to have the image properly prepared for image feature extraction and finally for the registration algorithm execution. This last step consists in correcting the small skew errors that may have occurred in scanning process. To do this we have implemented the method described in Section 2.4.1. 70 Chapter 4. Image Preprocessing and Segmentation Figure 4.8: The result of applying the check position method to a receipt in wrong position. Figure 4.9: The result of applying the check position method to a ticket in wrong position. Figure 4.10: The result of applying the check position method to a receipt in correct position. 71 Chapter 4. Image Preprocessing and Segmentation In order to make the method more robust we add the following process. First we calculate the skew error by two vertical lines as it is explained in Section 2.4.1. Then we repeat the process with 3 lines getting two skew error values and finally we calculate the average of these. If this average agrees with the previous calculated skew error, we consider that skew error is correctly calculated, otherwise we add another line. The process ends when the last calculated skew error match with the previous calculated skew error. Finally, the rotating method (see Section 4.2.4) is applied to correct the detected skew error. In spite of the skew error is usually less than 5 degrees, Figures 4.11, 4.12, and 4.13 show some results of the skew method application in documents with extreme skew errors in order to demonstrate the great robustness of the method. Figure 4.11: The result of applying the skew method to a ticket. 4.2.4 Image Rotation and Scaling There are two simple methods that allow us to apply any kind of rotation or scale to a specific image. This methods have been implemented using tools provided by the QT libraries. Rotating method is used by the Skew method (see Section 4.2.3) to correct the detected skew error. Scaling method is used by the NMI registration method. It is applied with low-resolution images as we describe in Section 5.2. 72 Chapter 4. Image Preprocessing and Segmentation Figure 4.12: The result of applying the skew method to a ticket. Figure 4.13: The result of applying the skew method to a ticket. 73 Chapter 4. Image Preprocessing and Segmentation Figures Figures 4.11, 4.12, and 4.13 show results of the rotating method application, and Figures 4.14 and 4.15 show results of the scaling method application. Figure 4.14: The result of applying the scale method to an invoice. Figure 4.15: The result of applying the scale method to a receipt. 4.3 Feature Extraction Methods After solving the problems produced in the document scanning, the image is prepared and the necessary features can be extracted. We basically focused on the following 3 features: • Logo: We need to have the document logo and its position within the image in order to apply the logo registration (see Section 5.3). 74 Chapter 4. Image Preprocessing and Segmentation • Color and gray scale: It is an important feature. If the input image is a color image all the gray-scale images are discarded and vice versa. • K-means: If the input image is a color image, we apply the k-means algorithm to detect the main colors and its ratios. 4.3.1 Logo Extraction We implement a method that allows the user to select manually the image portion of the document considered as the logo. We consider that a logo is anything that is repeated in all the documents of the same type, therefore it may be an image, a word, a structure . . . The final objective of this method is that each reference image in the database has also a logo image and its position within the reference image. We only define one logo by image, but we could easily define two or more logos by image. This has not been done because we considered that only one logo is enough to make a good logo registration. Figures 4.16 and 4.17 show reference images with its extracted logos. Figure 4.16: Logo extraction of a ticket. 75 Chapter 4. Image Preprocessing and Segmentation Figure 4.17: Logo extraction of an invoice. 4.3.2 Color and Gray-scale Detection Color and gray-scale detection is a simple method that checks if the image is in color or gray-scale. To know if a image is a gray-scale image, R = G = B must be satisfied by all pixels of the image, otherwise the image is a color image. 4.3.3 K-means The last extraction method uses the k-means algorithm (see Segmentation by unsupervised clustering in Section 2.4.2). This method is only applied to the color images and allows us to extract the main colors and its ratios. Thus, the database can be filtered taking into account only similar color images. K-means is one of the simplest unsupervised learning algorithms that solve the clustering problem. The procedure follows a simple and easy way to classify a given data set (color pixels) through a certain number k of clusters fixed a priori. In or case, k is the number of colors that we want to find. The algorithm proceeds as follows. First, the centroids are placed randomly, although different locations could cause different results. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this moment we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done 76 Chapter 4. Image Preprocessing and Segmentation between the same data set points and the nearest new centroid. Thus, a loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words, centroids do not move any more. Finally, this algorithm aims at minimizing a function, in this case a squared error function: k X n X J= (xi − cj )2 , (4.1) j=1 i=1 where (xi − cj )2 is a chosen distance measure between a data point xi and the cluster center cj . This function J is an indicator of the distance of the n data points from their respective cluster centers. The algorithm is composed of the following steps represented in Figure 4.18: 1. Place k points into the space represented by the objects that are being clustered. These points represent the initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the k centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. Figure 4.19 shows the algorithm step representation in a simple example. Figures 4.20 and 4.21 show the results that are generated by the k-means algorithm. It is important to emphasize that this method generates a file where the centroid values and the percentage of image pixels belonging to each centroid are stored. This allows us to implement a color comparison algorithm taking into account its proportion within the image.Thus, this algorithm gives more importance to the larger proportion colors and less or null importance to the lesser proportion colors. 77 Chapter 4. Image Preprocessing and Segmentation Figure 4.18: K-means algorithm diagram. Figure 4.19: K-means algorithm example. 1) k initial “means” (in this case k = 3) are randomly selected from the data set (shown in color). 2) k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means. 3) The centroid of each of the k clusters becomes the new mean. 4) Steps 2 and 3 are repeated until convergence has been reached. 78 Chapter 4. Image Preprocessing and Segmentation Figure 4.20: The result of applying the k-means algorithm to an invoice. a) Original Image. b) Result of applying the k-means algorithm for k = 2, i.e., 2 groups/colors. c) Result of applying the k-means algorithm for k = 3, i.e., 3 groups/colors. Figure 4.21: The result of applying the k-means algorithm to an invoice. a) Original Image. b) Result of applying the k-means algorithm for k = 3, i.e., 3 groups/colors. c) Result of applying k-means algorithm for k = 6, i.e., 6 groups/colors. 79 Chapter 5 Image Registration 5.1 Introduction As we have seen in Section 2.5, registration is a fundamental task in image processing used to match two or more images or volumes obtained at different times, from different devices or from different viewpoints. Basically, it consists in finding the geometrical transformation that enables us to align images into a unique coordinate space. To carry out the image registration we have focused on two methods: • NMI registration: This method uses techniques based on information theory (see Section 5.2). • Logo registration: This method uses template matching techniques based on the normalized correlation coefficient (see Section 5.3). As we described in Section 3.4, the registration method is only applied to a group of reference images in the database that have passed a series of filters. There are two different ways to apply the registration method to this group of reference images: brute-force search and basic search. • Brute-force search: The algorithm applies the registration method to all images of the group. Then, it returns a list of all the images sorted according to the obtained similarity value. • Basic search: The algorithm stops when it finds the first image with a similarity value greater than a certain threshold. Thus, it returns a list of the registered images. In the best case, the list contains a single image and, in the worst case, the list will contain all images, like the brute-search case. Both ways have their advantages and disadvantages. The brute-force method utilization may be interesting in the case that the database contains more than one image of the same type, since it returns all the images similar to the input image 81 Chapter 5. Image Registration and it always returns the best possible result. On the other hand it is slower than the basic search since it does not save any registration operation. In theory, in our specific case, the database contains only one document of each type. In practice, the database can contain more than one document of each type (this makes the method more robust, but also slower). 5.2 Normalized Mutual Information Mutual information has been successfully used in many different fields, such as medical imaging for image registration. We propose here to use the normalized mutual information (NMI) to register documents. Thus, we can register an input document with a series of reference documents. Maximizing the normalized mutual information we will obtain the most similar documents. The image registration process is explained in detail in Section 2.5. We have seen that the image registration process can be described as a process composed of four basic elements: the transformation, the interpolator, the metric, and the optimizer. We now explain these components for our registration method: • Spatial transformation: The spatial transformation defines the spatial relationship between both images. We only consider 2D translations, that is, we use a particular case of rigid transformation. In order to facilitate and speed up the image registration, a correction skew method (see Section 4.2.3) has been implemented. Thus, we assume that the translation in both axes is the only rigid transformation that has to be carried out between the two images. • Interpolator: The interpolation strategy determines the intensity value of a point at a non-rigid position. We specifically use a linear interpolation, that is, the intensity of a point is obtained from the linear-weighted combination of the intensities of its neighbors. • Metric: The metric evaluates the similarity (or disparity) between the two images to be registered. We specifically use NMI as the image similarity measure. It is included within the intensity occurrence measures, that is, it depends on the probability of each intensity value and is based on information theory. • Optimization: The optimizer finds the maximum (or minimum) value of the metric varying the spatial transfomation. We specifically use the Powell’s method [31]. Powell’s method, also called conjugate direction method, is a zero-order optimization method. The basic concept of Power’s conjugate direction approach is to utilize the sequential search directions in one-dimension search and generate a new direction toward the next iterative point. As the unidirectional search vector Si , i = 1 · · · n (for n variables) 82 Chapter 5. Image Registration is defined, the conjugate direction is determined by the sum of all unidirectional vectors: Sn+1 = n X αi Si (5.1) i=1 where Sn+1 is the conjugate direction and αi is the scaling factor. Finally, we have done several tests at different image resolutions. After some experiments, we have seen that the values of NMI using high resolutions images are lower than using low resolution images. This is logical because in the high resolution images we have a higher dispersion of intensities. If we scale down (compress) the image, we obtain an image where the significant parts take more protagonism. After this study, we have seen that using images with their greater side of 50 pixels (see Figures 5.1) very good results are obtained. This also supposes a great acceleration of the registration process. Figure 5.1: Scaled image used in normalized mutual information registration. Figure 5.2 shows the superposition of two registered images using the normalized mutual information. 5.3 Logo Registration Unlike the NMI registration method, the logo registration method only registers a small portion of the image in high resolution. After the logo extraction preprocess (see Section 4.3.1), we keep in the database the logo image and its position within the reference image. When we want to classify a new input image using this registration method, we only search the logo in a window located surrounding the position stored in the database. Therefore, there is the advantage that if two documents have different structure but the same logo on the same approximated site, the image registration should be correct. This explanation of the logo registration method is graphically summarized in Figure 5.3. Now we define the four components of our logo registration method: 83 Chapter 5. Image Registration Figure 5.2: Superposition of two registered images using normalized mutual information. Figure 5.3: Graphical summary of the logo registration method. We want to classify a input image and we have three possible candidates. In this case, only the last reference image can obtain a good similarity value (the third one in the right column). 84 Chapter 5. Image Registration • Spatial transformation: This component is identical to the NMI registration method (see Section 5.2). • Interpolator: Unlike the NMI registration method, this method does not need interpolator, because its translations are integers (in pixels). • Metric: In this case, we use the normalized correlation coefficient as the image similarity measure (see Section 2.5.2). • Optimization: Unlike the NMI registration method, this method does not need an optimization process, because in this case the similarity value is calculated for all positions in the image and then we select the maximum value. Figure 5.4 shows the superposition of two registered images using the logo registration method. Figure 5.4: Superposition of two registered images using the logo registration method. 5.4 Performance Evaluation We now evaluate the performance of our NMI and logo registration methods. 85 Chapter 5. Image Registration 5.4.1 NMI Registration Test We have classified 77 input images, using a database with 100 reference images. These 77 input images are divided into 40 invoices, 30 receipts, and 7 tickets, and the 100 reference images are divided into 60 invoices, 25 receipts, and 15 tickets. After classifying the 77 input images, we have obtained the following results: 71 input images (92.2%) are classified correctly and 6 input images (7.8%) are classified incorrectly. The successful cases are divided as follows: • 9 input images do not have a reference image and the application did not find matches (true negative). • 62 input image have a reference image and the application found matches (true positive). The errors cases are divided as follows: • 3 errors were in ticket classification. The input images have reference images but the application did not find matches (false negative). • 1 input image does not have a reference image, but the application found matches (false positive). • 2 input images have reference images, but the application found bad matches (wrong positive). The registration computational time vary between 300 ms and 3 seconds. To classify the 77 input images of the test, the application took approximately 2.5 minutes (about two seconds per image). After these experiments, we have seen that the main problems appear with the tickets. We propose a possible solution in the future work section (see Section 6). In Table 5.1 a summary of the results is presented. 5.4.2 Logo Registration Test To evaluate the performance of our logo registration method, we use the same set of images used in the performance evaluation of the NMI registration method (see Section 5.4.1). After classifying the 77 input images, we have obtained the following results: 76 input images (98.7%) are classified correctly and 1 input images (1.3%) are classified incorrectly. The successful cases are divided as follows: • 10 input images do not have a reference image and the application did not find matches (true negative). 86 Chapter 5. Image Registration Successful cases Error cases Document True True False False Wrong type positive negative negative positive positive Tickets 2 2 3 0 0 Receipts 25 4 0 0 1 Invoices 28 10 0 1 1 Subtotal 55 16 3 1 2 Total 71 6 Table 5.1: Results summary of the NMI registration method. Successful cases Error cases Document True True False False Wrong type positive negative negative positive positive Tickets 5 2 0 0 0 Receipts 26 4 0 0 0 Invoices 29 11 1 0 0 Subtotal 60 16 1 0 0 Total 76 1 Table 5.2: Results summary of the logo registration method. • 66 input image have a reference image and the application found the correct matches (true positive). The error case is an input image that have reference image, but the application did not find matches (false negative). This error could be solved using a larger search window. The registration computational time vary between 300 ms and 3 seconds. To classify the 77 input images of the test, the application took approximately 1.4 minutes (about 1.2 seconds per image). After these experiments, we have seen that logo registration method is more robust and faster than NMI registration method. The principal disadvantage is that logo registration method requires human intervention whenever a reference image is added to the database. In Table 5.2 a summary of the results is presented. 87 Chapter 6 Conclusions and Future Work Document image classification is an important focus of research. The development of new techniques that assist and enhance the recognition and classification of different document types is fundamental in business environments and has been the main objective of this master thesis. In this chapter, the conclusions of this master thesis and the directions for our future research are presented. 6.1 Conclusions In this master thesis we achieved two basic objectives. First, a state of the art has been presented with the aim of summarizing the current state of the classification and recognition of document images. Second, several image processing tools have been presented in order to classify different document classes, such as invoices, receipts, and tickets. Next, the main contributions of this master thesis are summarized: • In Chapter 2, a state of the art of document classification has been presented. This allowed us to see that there is much work to do in the field of document recognition and classification based on document images. • In Chapter 3, we presented a description of our document image classification framework. Specifically, we introduced the problem statement, the classifier architecture, the document database, the document filtering where we presented the methods (the k-means filter based on color comparison, the NMI filter based on information theory, and the size filter) used to do this filtering process, and the application interface. We have always worked with the objective of reducing computation times as much as possible. • In Chapter 4, several methods for preprocessing and segmentation are presented. These methods allow us to prepare the image document in order to speed up the registration process. The image preparation methods (cropping, 89 Chapter 6. Conclusions and Future Work skew, and check position) consist in a series of image-to-image transformations. They do not increase our knowledge of the contents of the document, but may help to extract it. On the other hand, feature extraction methods (logo extraction, k-means, and color and gray-scale detection) allow us to extract important document features (the principal colors, the logo, and the logo position) and really increase our knowledge of the image contents. • In Chapter 5, a new use of the NMI registration have been presented. We have seen that this method is fast, robust, and a fully automatic classification technique. We also presented a logo registration method that obtains very good results in the classification of all kind of document images, but it has the disadvantage that it is necessary to manually select the logo of the reference images before they are added to the database (i.e., this method needs a priors human intervention). In order to evaluate the performance of these registration methods we studied the results and computation time obtained by each method. • In Section 3.5 an interface to apply all the implemented methods have been presented. This interface has been very useful to test the methods and to see the final results. 6.2 Future Work The ideas presented in this master thesis can be improved or expanded in different directions. We propose the following future works: • We have seen that the main focus of errors in the NMI registration are the tickets (see Section 5.4). The classification of the tickets is very difficult because they are document images with very little information. Moreover, unlike the invoices or receipts, the same type of ticket does not always have the same size, usually the ticket size depends on the contained information. In order to improve processing tickets we think that a ticket could be divided into 3 parts: a invariable top part, a variable central part, and a invariable bottom part (see Figure 6.1). The idea is to remove the variable central part, then we may only register the invariables parts (top part and bottom part) getting a higher value of similarity (see Figure 6.2). • To try other interesting image registration techniques such as image registration based on compression. The basic idea is the conjecture that two images are correctly registered when we can maximally compress one image given the information in the other. It would be interesting to demonstrate that the image registration process can be formulated as a compression problem. 90 Chapter 6. Conclusions and Future Work Figure 6.1: The three parts of a ticket. Figure 6.2: NMI registration of tickets after removing the variable parts. 91 Chapter 6. Conclusions and Future Work • To improve the efficiency of the implemented methods. Although the obtained computation times are good, time is an essential factor to take into account in the field of document classification and therefore we must minimize it as much as possible. • Setting up new learning mechanisms. For example, a mechanism that can detect the invariable parts of a document class from an image group belonging to this documents class. • To implement a new method to automatically locate the logo position within a document image. This allows us to fully automate the logo classification. 92 Appendix A Qt To design and implement the GUI of the application has been decided to use Qt [35]. Qt is a cross-platform application development framework, widely used for the development of GUI programs (in which case it is known as a widget toolkit), and also used for developing non-GUI programs such as console tools and servers. Qt is most notably used in KDE, Opera, Google Earth, Skype, Qt Extended, Adobe Photoshop Album, VirtualBox and OPIE. It is produced by the Norwegian company Qt Software, formerly known as Trolltech, a wholly owned subsidiary of Nokia since June 17, 2008. Qt uses C++ with several non-standard extensions implemented by an additional pre-processor that generates standard C++ code before compilation. Qt can also be used in several other programming languages; via language bindings. It runs on all major platforms, and has extensive internationalization support. NonGUI features include SQL database access, XML parsing, thread management, network support and a unified cross-platform API for file handling. Distributed under the terms of the GNU Lesser General Public License (among others), Qt is free and open source software. In the market we can find the following Qt distributions: • Qt Enterprise Edition and Qt Professional Edition, available for developing software for commercial purposes, includes technical support and there are extensions available. • Qt Free Edition is the version for Unix/X11 for the development of free software and open source. Is free available under the terms of the Q Public License and the GNU General Public License. For Windows platforms is also available a non commercial Qt version. • Qt Educational Edition is a Professional Edition Qt version licensed only for educational purposes. • Qt/Embedded Free Edition. 93 Appendix A. Qt Features: • QT is a library for creating graphical interfaces. It is distributed free under a GPL license (or QPL) that allows us to incorporate QT into our open-source applications. • It is available for a large number of platforms: Linux, MacOs X, Solaris, HP-UX, UNIX with X11 ... • It is object oriented, which facilitates software development. • It uses C++, although can also be used in several other programming languages (eg. Python or Perl) via language bindings. • It is a library based on the concepts of widgets (objects), Signals-Slots, and events (e.g., mouse click). • The signals and slots are the mechanisms that enable communication between widgets. • The widget can contain any number of children. The “top-level” widget can be any, either window, button, etc.. • Some attributes such as text labels, etc... are modified similarly to html. • QT provides additional functionality: – Basic libraries → Input/Output, Network Management, XML. – Database Interface → Oracle, MySQL, PostgreSQL, ODBC. – Plugins, dynamic libraries (Images, formats, ...) – Unicode, internationalization. Qt Designer is a tool for designing and implementing user interfaces built with the Qt multiplatform GUI toolkit. Qt Designer makes it easy to experiment with user interface design. At any time you can generate the code required to reproduce the user interface from the files Qt Designer produces, changing your design as often as you like. It has a palette of widgets very complete, including the most common widgets in QT libraries. Figure A.1 shows the interface of this tool. 94 Appendix A. Qt Figure A.1: Qt Designer Interface. 95 Appendix A. Qt 96 Appendix B MySQL We decided to use MySQL [36] as a database for managing images. MySQL is one of the most popular open source database. Their continued development and its growing popularity is making MySQL a increasingly direct competitor of giants in the databases field like Oracle. MySQL is a relational database management system (RDBMS). The program runs as a server providing multi-user access to a number of databases. There are many types of databases, from a simple file system to a relational object-oriented database. MySQL, as a relational database, uses multiple tables to store and to organize the information. It was written in C and C++ and emphasizes his great adaptation to different development environments, allowing its interaction with the most used programming languages like PHP, Perl, and Java, and its integration in different operating systems. It is also very remarkable the condition of open source that causes that their use is gratuitous and that can even be modified with total freedom, being able to download their source code. This has affected very positively to its development and continuous updates, resulting that MySQL is one of the most useful tools by the Internet oriented programmers. 97 Appendix B. MySQL 98 Appendix C OpenCV OpenCV [37] is a computer vision library originally developed by Intel. It is free for commercial and research use under the open source BSD license. The library is cross-platform, and runs on Windows, Mac OS X, Linux, PSP, VCRT (Real-Time OS on Smart camera) and other embedded devices. It focuses mainly on real-time image processing, as such, if it finds Intel’s Integrated Performance Primitives on the system, it will use these commercial optimized routines to accelerate itself. Example applications of the OpenCV library are Human-Computer Interaction (HCI); Object Identification, Segmentation and Recognition; Face Recognition; Gesture Recognition; Motion Tracking, Ego Motion, Motion Understanding; Structure From Motion (SFM); Stereo and Multi-Camera Calibration and Depth Computation; Mobile Robotics. General description: • Open source computer vision library in C/C++. • Optimized and intended for real-time applications. • OS/hardware/window-manager independent. • Generic image/video loading, saving, and acquisition. • Both low and high level API. • Provides interface to Intel’s Integrated Performance Primitives (IPP) with processor specific optimization (Intel processors). Features: • Image data manipulation (allocation, release, copying, setting, conversion). • Image and video I/O (file and camera based input, image/video file output). • Matrix and vector manipulation and linear algebra routines (products, solvers, eigenvalues, SVD). 99 Appendix C. OpenCV • Various dynamic data structures (lists, queues, sets, trees, graphs). • Basic image processing (filtering, edge detection, corner detection, sampling and interpolation, color conversion, morphological operations, histograms, image pyramids). • Structural analysis (connected components, contour processing, distance transform, various moments, template matching, Hough transform, polygonal approximation, line fitting, ellipse fitting, Delaunay triangulation). • Camera calibration (finding and tracking calibration patterns, calibration, fundamental matrix estimation, homography estimation, stereo correspondence). • Motion analysis (optical flow, motion segmentation, tracking). • Object recognition (eigen-methods, HMM). • Basic GUI (display image/video, keyboard and mouse handling, scroll-bars). • Image labeling (line, conic, polygon, text drawing) OpenCV modules: • cv - Main OpenCV functions. • cvaux - Auxiliary (experimental) OpenCV functions. • cxcore - Data structures and linear algebra support. • highgui - GUI functions. 100 Appendix D .NET The different methods presented in this master thesis have been translated into visual basic using .Net [38] and they are encompassed within a dll (Dynamic-link library). The objective of this translation is that the company, in addition to have the implemented application using QT, they also have a library independent of Qt by which they can use the different methods from other applications programmed in Visual Basic. The company asked to us specifically that the library implementation was in .Net for its comfort. The Microsoft .NET Framework is a software framework that can be installed on computers running Microsoft Windows operating systems. It includes a large library of coded solutions to common programming problems and a virtual machine that manages the execution of programs written specifically for the framework. The .NET Framework is a key Microsoft offering and is intended to be used by most new applications created for the Windows platform. The framework’s Base Class Library provides a large range of features including user interface, data and data access, database connectivity, cryptography, web application development, numeric algorithms, and network communications. The class library is used by programmers, who combine it with their own code to produce applications. Programs written for the .NET Framework execute in a software environment that manages the program’s runtime requirements. Also part of the .NET Framework, this runtime environment is known as the Common Language Runtime (CLR). The CLR provides the appearance of an application virtual machine so that programmers need not consider the capabilities of the specific CPU that will execute the program. The CLR also provides other important services such as security, memory management, and exception handling. The class library and the CLR together constitute the .NET Framework. The principal features of .Net are: • Interoperability, because interaction between new and older applications is commonly required. The .NET Framework provides means to access functionality that is implemented in programs that execute outside the .NET environment. 101 Appendix D. .NET • Common runtime engine: The Common Language Runtime (CLR) is the virtual machine component of the .NET framework. All .NET programs execute under the supervision of the CLR, guaranteeing certain properties and behaviors in the areas of memory management, security, and exception handling. • Language independence: The .NET Framework introduces a Common Type System, or CTS. The CTS specification defines all possible data types and programming constructs supported by the CLR and how they may or may not interact with each other. Because of this feature, the .NET Framework supports the exchange of instances of types between programs written in any of the .NET languages. • Base Class Library: the Base Class Library (BCL), part of the Framework Class Library (FCL), is a library of functionality available to all languages using the .NET Framework. The BCL provides classes which encapsulate a number of common functions, including file reading and writing, graphic rendering, database interaction and XML document manipulation. • Simplified deployment: The .NET framework includes design features and tools that help manage the installation of computer software to ensure that it does not interfere with previously installed software, and that it conforms to security requirements. • Security: The design is meant to address some of the vulnerabilities, such as buffer overflows, that have been exploited by malicious software. Additionally, .NET provides a common security model for all applications. • Portability: The design of the .NET Framework allows it to theoretically be platform agnostic, and thus cross-platform compatible. That is, a program written to use the framework should run without change on any type of system for which the framework is implemented. Microsoft’s commercial implementations of the framework cover Windows, Windows CE, and the Xbox 360. 102 Bibliography [1] Nagy, G.: Twenty years of document image analysis in pami. IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 38–62 [2] Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34 (2002) 1–47 [3] Appiani, E., Cesarini, F., Colla, A.M., Diligenti, M., Gori, M., Marinai, S., Soda, G.: Automatic document classification and indexing in high-volume applications. IJDAR 4 (2001) 69–83 [4] Diligenti, M., Frasconi, P., Gori, M.: Hidden tree markov models for document image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003) 2003 [5] Byun, Y., Lee, Y.: Form classification using dp matching. In: SAC ’00: Proceedings of the 2000 ACM symposium on Applied computing, New York, NY, USA, ACM (2000) 1–4 [6] Shin, C., Doermann, D., Rosenfeld, A.: Classification of Document Pages Using Structure-Based Features. International Journal on Document Analysis and Recognition 3 (2001) 232–247 [7] Hu, J., Kashi, R., Wilfong, G.: Document classification using layout analysis. In: DEXA ’99: Proceedings of the 10th International Workshop on Database & Expert Systems Applications, Washington, DC, USA, IEEE Computer Society (1999) 556 [8] Marcel, A.B., Worring, M.: Fine-grained document genre classification using first order random graphs. In: In Proceedings of the Sixth International Conference on Document Analysis and Recognition (ICDAR2001. (2001) 79 [9] Chen, N., Blostein, D.: A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int. J. Doc. Anal. Recognit. 10 (2007) 1–16 [10] Due, Jain, A.K., Taxt, T.: Feature extraction methods for character recognition-a survey. Pattern Recognition 29 (1996) 641–662 103 Bibliography [11] Okun, O., Doermann, D., Pietikainen, M.: Page Segmentation and Zone Classification: The State of the Art. Technical Report LAMP-TR-036,CARTR-927,CS-TR-4079, University of Maryland, College Park (1999) [12] Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. Volume 5010., SPIE (2003) 197–207 [13] Esposito, F., Malerba, D., Francesca, Lisi, F.A., Ras, W.: Machine learning for intelligent processing of printed documents. Journal of Intelligent Information Systems 14 (2000) 175–198 [14] Maderlechner, G., Suda, P., Brückner, T.: Classification of documents by form and content. Pattern Recogn. Lett. 18 (1997) 1225–1231 [15] Taylor, S., Lipshutz, M., Nilson, R.: Classification and functional decomposition of business documents. Volume 2. (1995) 563–566 vol.2 [16] Phillips, I.T., Chen, S., Haralick, R.M.: Cd-rom document database standard. (1995) 198–203 [17] Sauvola, J., K.: Mediateam document database. Website (1999) http: //www.mediateam.oulu.fi/MTDB/ (last visit 08/07/09). [18] Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal 27 (1948) 379–423, 623–656 [19] Cover, T.M., Thomas, J.: Elements of Information Theory. John Wiley and Sons Inc. (1991) [20] Yeung, R.W.: A First Course in Information Theory. Springer (2002) [21] B. Gatos, N. Papamarkos, C.C.: Skew detection and text line position determination in digitized documents. Pattern Recognition 30 (1997) 1505–1519 [22] Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice Hall, Upper Saddle River (NJ), USA (2002) [23] Freixenet, J., Muñoz, X., Raba, D., Martı́, J., Cufı́, X.: Yet another survey on image segmentation: Region and boundary information integration. In: European Conference on Computer Vision, Copenhagen, Denmark (2002) 408–422 [24] Sahoo, P.K., Soltani, S., Wong, A.K., Chen, Y.C.: A survey of thresholding techniques. Comput. Vision Graph. Image Process. 41 (1988) 233–260 [25] Sezgin, M., Sankur, B.: Survey over image thresholding techniques and quantitative performance evaluation. Journal of Electronic Imaging 13 (2004) 146–168 104 Bibliography [26] Hartigan, J.A., Wong, M.A.: A k-means clustering algorithm. Applied Statistics 28 (1979) 100–108 [27] Lavallee, S.: Registration for computed-integrated-surgery: Methodolgy, state of the art. Computer Integrated Surgery: Technology and Clinical Applications (1995) 77–97 MIT Press, Cambridge, Massachusettes. [28] Lehmann, T.M., Gonner, C., Spitzer, K.: Registration for computedintegrated-surgery: Methodolgy, state of the art. IEEE Transactions on Medical Imaging 18 (1999) 1049–1074 [29] Unser, M.: Splines. A perfect fit for signal and image processing. IEEE Signal Processing Magazine 16 (1999) 22–38 [30] Collignon, A., Vandermeulen, D., Suetens, P., Marchal, G.: Automated multi-modality image registration based on information theory. Computational Imaging and Vision 3 (1995) 263–274 [31] Press, W., Teulokolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes in C. Cambridge University Press (1992) [32] Maes, F., Vandermeulen, D., , Marshal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. In: IEEE Proceedings of the Workshop on Mathematical Methods in Biomedical Image Analysis. (1996) 187–198 [33] Studholme, C.: Measures of 3D medical image alignment. PhD thesis, Computational Imaging Science Group, Division of Radiological Sciences, United Medical and Dental school’s of Guy’s and St Thomas’s Hospitals (1997) [34] Tsao, J.: Interpolation artifacts in multimodal image registration based on maximization of mutual information. IEEE Transactions on Medical Imaging 22 (2003) 854–864 [35] Nokia, C.: Qt software. (Website) http://www.qtsoftware.com/ products (last visit 08/07/09). [36] Sun, M.: 08/07/09). Mysql. (Website) http://www.mysql.com/ (last visit [37] Bradski, G., Kaehler, A.: Learning OpenCV: Computer Vision with the OpenCV Library. O’Reilly, Cambridge, MA (2008) [38] Microsoft, C.: Microsoft .net framework. (Website) http://www. microsoft.com/NET/ (last visit 08/07/09). 105