Document Classification

advertisement
M ÀSTER EN I NFORM ÀTICA INDUSTRIAL I
AUTOM ÀTICA
U NIVERSITAT DE G IRONA
Document Classification
Master Thesis
Marius V ILA D UR ÁN
Directors:
Dr. Mateu S BERT C ASSASSAYAS
Dr. Miquel F EIXAS F EIXAS
Juliol 2009
Contents
1
2
3
Introduction
1.1 Framework . . . . . . . . . . . . . . . . . . . .
1.2 Main Research Lines in Document Classification
1.3 Objectives . . . . . . . . . . . . . . . . . . . . .
1.4 Methodology . . . . . . . . . . . . . . . . . . .
1.5 Document Outline . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
9
11
12
13
State of the Art and Background
2.1 Introduction . . . . . . . . . . . . . . . .
2.2 Document Classification . . . . . . . . .
2.2.1 Problem Statement . . . . . . . .
2.2.2 Classifier Architecture . . . . . .
2.2.3 Performance Evaluation . . . . .
2.3 Information Theory . . . . . . . . . . . .
2.4 Image Preprocessing Techniques . . . . .
2.4.1 Skew . . . . . . . . . . . . . . .
2.4.2 Image Segmentation . . . . . . .
2.5 Image Registration . . . . . . . . . . . .
2.5.1 Image Registration Pipeline . . .
2.5.2 Similarity Metrics . . . . . . . .
2.5.3 Challenges in Image Registration
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
16
16
18
24
25
28
28
33
36
36
38
43
Classification Framework
3.1 Problem Statement . . .
3.2 Classifier Architecture .
3.3 Document Database . . .
3.4 Document Filtering . . .
3.4.1 Size Filter . . . .
3.4.2 K-means Filter .
3.4.3 NMI Filter . . .
3.5 Interface . . . . . . . . .
3.5.1 “Arxiu” Menu . .
3.5.2 “Finestra” Menu
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
47
48
52
53
54
54
56
56
57
58
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
3.5.3
3.5.4
3.5.5
4
5
6
“Base de dades” Menu . . . . . . . . . . . . . . . . . . .
“Pre-procés’ Menu . . . . . . . . . . . . . . . . . . . . .
“Registre” Menu . . . . . . . . . . . . . . . . . . . . . .
60
60
62
Image Preprocessing and Segmentation
4.1 Introduction . . . . . . . . . . . . . . .
4.2 Image Preparation Methods . . . . . . .
4.2.1 Cropping . . . . . . . . . . . .
4.2.2 Check Position . . . . . . . . .
4.2.3 Skew . . . . . . . . . . . . . .
4.2.4 Image Rotation and Scaling . .
4.3 Feature Extraction Methods . . . . . . .
4.3.1 Logo Extraction . . . . . . . .
4.3.2 Color and Gray-scale Detection
4.3.3 K-means . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
66
67
67
70
72
74
75
76
76
Image Registration
5.1 Introduction . . . . . . . . . . .
5.2 Normalized Mutual Information
5.3 Logo Registration . . . . . . . .
5.4 Performance Evaluation . . . . .
5.4.1 NMI Registration Test .
5.4.2 Logo Registration Test .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
81
81
82
83
85
86
86
Conclusions and Future Work
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
89
90
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A Qt
93
B MySQL
97
C OpenCV
99
D .NET
101
2
List of Figures
1.1
1.2
Document taxonomy defined by Nagy [1]. . . . . . . . . . . . . .
Basic pipeline of the master thesis. . . . . . . . . . . . . . . . . .
10
11
Three components of a document classifier. . . . . . . . . . . . .
Three possible partitions of the document space. . . . . . . . . . .
A typical sequence of document recognition for mostly-text document images. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Five categories of structured documents and their recommended
feature representation. . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Binary entropy . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Venn diagram of a discrete channel . . . . . . . . . . . . . . . . .
2.7 The image before and after image smoothing. . . . . . . . . . . .
2.8 Image windows of size Xwin ∗ Ywin and region [-L,L]. . . . . . .
2.9 Correlation matrix of the vertical line d1 towards d2 . . . . . . . .
2.10 Skew detection using the correlation matrix of Fig. 2.9. . . . . . .
2.11 The main components of the registration framework . . . . . . . .
17
18
2.1
2.2
2.3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
Three components of our document classifier. . . . . . . . . . . .
Some documents that make up our document space. . . . . . . . .
Examples of document classes. . . . . . . . . . . . . . . . . . . .
Classification algorithm. . . . . . . . . . . . . . . . . . . . . . .
Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example of the distance calculation between the colors of two images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Design of interface with qtDesiger. . . . . . . . . . . . . . . . . .
Main screen of the application. . . . . . . . . . . . . . . . . . . .
“Arxiu” menu. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
“Finestra” menu. . . . . . . . . . . . . . . . . . . . . . . . . . .
Cascade mode. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mosaic mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
“Base de dades” menu. . . . . . . . . . . . . . . . . . . . . . . .
Configuration database menu. . . . . . . . . . . . . . . . . . . .
“Pre-procés” menu. . . . . . . . . . . . . . . . . . . . . . . . . .
3
20
22
26
28
31
32
33
34
37
46
48
48
51
53
55
56
57
58
58
59
59
60
61
61
List of Figures
3.16
3.17
3.18
3.19
Preprocess configuration menu. . . . . . .
“Registre” menu. . . . . . . . . . . . . .
Registration configuration methods menu.
Result returned by the application. . . . .
4.1
a) Scanned image in a wrong position. b) Image with a skew error.
c) Generally, scanners generate A4 images although a smaller document is scanned. In this concrete case an adjustment is necessary.
This problem also appears in cases a) and b). . . . . . . . . . . .
First idea to implement the white zone elimination. . . . . . . . .
The white zone elimination problem with a black and white image.
Solution fot the white zone elimination problem. . . . . . . . . .
The result of applying the cropping method to a ticket. . . . . . .
The result of applying the cropping method to a receipt. . . . . . .
a) Original image, b) image with horizontal solid lines, and c) image with vertical solid lines . . . . . . . . . . . . . . . . . . . . .
The result of applying the check position method to a receipt in
wrong position. . . . . . . . . . . . . . . . . . . . . . . . . . . .
The result of applying the check position method to a ticket in
wrong position. . . . . . . . . . . . . . . . . . . . . . . . . . . .
The result of applying the check position method to a receipt in
correct position. . . . . . . . . . . . . . . . . . . . . . . . . . . .
The result of applying the skew method to a ticket. . . . . . . . .
The result of applying the skew method to a ticket. . . . . . . . .
The result of applying the skew method to a ticket. . . . . . . . .
The result of applying the scale method to an invoice. . . . . . . .
The result of applying the scale method to a receipt. . . . . . . . .
Logo extraction of a ticket. . . . . . . . . . . . . . . . . . . . . .
Logo extraction of an invoice. . . . . . . . . . . . . . . . . . . .
K-means algorithm diagram. . . . . . . . . . . . . . . . . . . . .
K-means algorithm example. 1) k initial “means” (in this case
k = 3) are randomly selected from the data set (shown in color).
2) k clusters are created by associating every observation with the
nearest mean. The partitions here represent the Voronoi diagram
generated by the means. 3) The centroid of each of the k clusters becomes the new mean. 4) Steps 2 and 3 are repeated until
convergence has been reached. . . . . . . . . . . . . . . . . . . .
The result of applying the k-means algorithm to an invoice. a)
Original Image. b) Result of applying the k-means algorithm for
k = 2, i.e., 2 groups/colors. c) Result of applying the k-means
algorithm for k = 3, i.e., 3 groups/colors. . . . . . . . . . . . . .
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
61
62
62
63
66
68
68
68
69
69
70
71
71
71
72
73
73
74
74
75
76
78
78
79
List of Figures
4.21 The result of applying the k-means algorithm to an invoice. a)
Original Image. b) Result of applying the k-means algorithm for
k = 3, i.e., 3 groups/colors. c) Result of applying k-means algorithm for k = 6, i.e., 6 groups/colors. . . . . . . . . . . . . . . . .
5.1
5.2
79
Scaled image used in normalized mutual information registration.
Superposition of two registered images using normalized mutual
information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Graphical summary of the logo registration method. We want to
classify a input image and we have three possible candidates. In
this case, only the last reference image can obtain a good similarity
value (the third one in the right column). . . . . . . . . . . . . . .
Superposition of two registered images using the logo registration
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
The three parts of a ticket. . . . . . . . . . . . . . . . . . . . . .
NMI registration of tickets after removing the variable parts. . . .
91
91
A.1 Qt Designer Interface. . . . . . . . . . . . . . . . . . . . . . . . .
95
5.3
5.4
6.1
6.2
5
84
84
85
List of Tables
3.1
5.1
5.2
Comparison of the results to calculate one hundred distance between color images using backtracking and our approach. . . . . .
56
Results summary of the NMI registration method. . . . . . . . . .
Results summary of the logo registration method. . . . . . . . . .
87
87
7
Chapter 1
Introduction
In this chapter, we present the framework of this master thesis, the main research
lines in document classification, the objectives of this master thesis, and the methodology used to developed the application. Finally, we show the document outline.
1.1
Framework
This master thesis takes place within the framework of a project between the research group Gilab (Graphics and Imaging Laboratory) of the University of Girona
and the GIDOC INTEGRAL company. GIDOC INTEGRAL is a company specialized in document management.
The main objective of the project is to develop a series of algorithms with the
aim of creating a system that allows the automatic document classification with
minimal human intervention. Thus, this master thesis is based on a research work
which will be applied to solve the needs of a company in the field of document
classification.
For confidentiality reasons, the information of some document images has been
removed or distorted.
1.2
Main Research Lines in Document Classification
Basically, document classification can follow two main lines differentiated by the
use or non-use of the text content of the document. We use the following terminology to refer to these two possibilities:
• Document image classification: Assign a single-page document image to
one of a set of predefined document classes.
• Text classification: Assign a text document (ASCII, HTML/XML, . . . ) to
one of a set of predefined document classes. Text classification techniques
can be applied as part of document image classification, using OCR results
9
Chapter 1. Introduction
extracted from the document image. Sebastiani [2] provides a comprehensive survey of text categorization, which is an active research area in information retrieval.
Classification can be also based on various features, such as image-level features, structural or textual features (e.g., word frequency or word histogram).
Using the document taxonomy (see Figure 1.1) defined by Nagy [1], documents
can be included into two basic groups: mostly-text documents and mostly-graphic
documents. Mostly-text documents include business letters, forms, newspapers,
technical reports, proceedings, journal papers, etc. These are in contrast to mostlygraphics documents such as engineering drawings, diagrams, and sheet music.
Figure 1.1: Document taxonomy defined by Nagy [1].
Nagy’s characterization of documents focuses on document format: mostlygraphics or mostly-text, handwritten or typeset, etc. Another way of characterizing
documents is by the application domain, such as documents related to income tax
documents from insurance companies. Many classifiers use document spaces that
are restricted to a single application domain. On the other hand, other classifiers
use document spaces that span several application domains. We present here a
summary of the document spaces of selected classifiers characterized by application domains.
• A single domain document space
Bank documents [3].
Business letters or reports.
Invoices [3] [4].
Business forms [5].
Forms in banking applications.
Tax forms [6].
Document from insurance companies.
Book page.
Journal pages.
• A multiple-domain document space
10
Chapter 1. Introduction
Articles, advertisements, dictionaries, forms, manuals, etc.
Journal pages, business letters, and magazines. [7]
Bills, tax forms, journals, and mail pieces.
Journal papers, tax forms.[6].
Business letters, memoranda, and documents form other domains.
Bagdanov and Worring characterize document classification at two levels of
detail, coarse-grained and fine-grained [8]. A coarse-grained classification is used
to classify documents with a distinct difference of features, such as business letters
versus technical articles. Whereas fine-grained classification is used to classify
documents with similar features, such as business letters from different senders, or
journal title pages from various journals.
1.3
Objectives
The main goal of this master thesis is the recognition and classification of different documents within a database. The fundamental pillars to achieve this basic
objective are the image preprocessing techniques and, in particular, the image registration measures. Another important objective has been to present the state of the
art of document classification.
Figure 1.2: Basic pipeline of the master thesis.
We consider that a document is an image without text content identified as
such. Given the image of a document, the basic objective is to identify this image
within a previously created database with the documents of a company. In this
project, we focus our attention on a database of invoices, receipts, and tickets. If
we can not identify a document within the database we can make the following two
interpretations:
11
Chapter 1. Introduction
• An error has occurred and it must be treated properly.
• The document that we want to identify has not yet been entered into the
database.
Located in the context of a company, scanned images or imported electronic
documents can come from different supplier companies, customers, marketing, etc.
Or they can come from different departments of the company. Today a very high
time is spent on the document classification and indexing process when a document
is incorporated into a document management system or when a data extraction
system is fed (supplier invoices, supplier delivery notes, opinion polls, information
request, task demands, etc). One of the most important components of a massive
capture of information is the preprocessing stage where data are extracted from
scanned or imported documents for its later classification.
There are two major ways to classify electronic documents. First it is its classification according to the form of its components. In this project we focus on this
line. Although there are many publications on registration, segmentation and image
classification, there are few specific works on recognition of document typologies,
although this is an essential task for document management in large companies
(invoices, receipts, documents of diverse typologies,). The second possibility is to
identify the text of the documents with OCR tools. This option is important for the
classification of completely open documents where we can not assume any such
form or labeling.
It is important to consider the rapidity and reliability of the used techniques. To
this end it has been necessary to apply techniques for pruning the tree search in the
database documents. This is one of the key parts of the project that has required
the investigation of descriptors so that we can perform an efficient pruning. Finally
also the use of a images hierarchy of different resolution has been investigated.
This study of the image behavior with different resolution is also a key element in
the project development
1.4
Methodology
The methodology used to carry out the application implementation is based on
Extreme Programming.
Extreme Programming (also known as XP) is a methodology of software engineering that leads to a development process that is more responsive to customer
needs than traditional methods and it allows to create a higher quality software
since it allows changes in the requirements on the part of the user. This is considered by the defenders of this methodology as a normal and desirable aspect of the
software development projects.
To think about the requirements adaptation in any point of the project duration
is more realistic than attempting to define all requirements at the project beginning
12
Chapter 1. Introduction
and then having to lose too much time to make adjustments to the initial requirement in the event of changes in plans.
Fundamental features of the Extreme Programing:
• Iterative and incremental development.
• Continuous unitary tests.
• Programming by pairs.
• Frequent interaction between the programming team and the client or user.
• Correct all errors before adding a new functionality, as well as making frequent deliveries.
• To rewrite parts of the code in order to increase its readability, but without
changing its behavior.
• Simplicity in the code
We tried to meet all these guidelines, except the pair programming, since the
master thesis is an individual project.
In conclusion, the Extreme Programming allows us to develop a software that
is easy to adapt in case of changes or additions.
Algorithms have been implemented using C++ and have been translated to
.NET (see Appendix D). We have also used different tools, such as Qt, MySQL,
and OpenCV (see Appendixes A, B, and C, respectively).
1.5
Document Outline
This master thesis is organized into six chapters. The first two chapters are introductory and deal with previous work. The next four chapters are focused on our
document classifier and the methods of preprocessing, segmentation, and registration used by our application. Finally, a concluding chapter is presented. In more
detail:
• Chapter 2: State of the Art and Background
In this chapter, we present the state of the art of document classification
and the main tools needed to develop this master thesis, such as information
theory, image preprocessing techniques, and image registration.
• Chapter 3: Classification Framework
In this chapter, we present a description of our document image classification framework based on the information collected in the state of the art.
Specifically, we introduce the problem statement, the classifier architecture,
the document database, the document filtering, and the application interface.
13
Chapter 1. Introduction
• Chapter 4: Image Preprocessing and Segmentation
In this chapter, several methods for preprocessing (cropping, skew, and check
position) and segmentation (logo extraction, k-means, and color and grayscale detection) are presented.
• Chapter 5: Image Registration
In this chapter, two registration methods (NMI registration and logo registration) and their performance evaluation are presented.
• Chapter 6: Conclusions and Future Work
In this chapter, the conclusions of this master thesis are presented, as well as
some indications about our current and future research.
Moreover, four appendixes describe the four basic tools (Qt, MySQl, OpenCV,
and .NET) used in order to develop the application described in this master thesis.
14
Chapter 2
State of the Art and Background
In this chapter, we present the state of the art of document classification and the
main tools needed to develop this master thesis, such as information theory, image
preprocessing techniques, and image registration.
2.1
Introduction
Document image classification is an important task in document image processing
and it is used in the following applications [9]:
• Office automation. Document classification allows the automatic distribution or archiving of documents. For example, after classification of business
letters according to the sender and message type (such as order, offer, or
inquiry), the letters are sent to the appropriate departments for processing.
• Digital libraries. Document classification improves the indexing efficiency
in digital library construction. For example, the classification of documents
into table of contents page or title page can narrow the set of pages from
which to extract specific meta-data, such as the title or table of contents of a
book.
• Image retrieval. Document classification plays an important role in document image retrieval. For example, consider a document image database
containing a large heterogeneous collection of document images. Users have
many retrieval demands, such as retrieval of papers from one specific journal, or retrieval of document pages containing tables or graphics. Classification of documents based on visual similarity helps to limit the search and
improves retrieval efficiency and accuracy.
• Other document image analysis applications. Document classification facilitates higher-lever document analysis. Due to the complexity of document
understanding, most high-level document analysis systems rely on domaindependent knowledge to obtain high accuracy. Many available information
15
Chapter 2. State of the Art and Background
extraction systems are specially designed for a specific type of document,
such as forms processing or postal address precessing, to achieve high speed
and performance. To process a broad range of documents, it is necessary to
classify the documents first, so that a suitable document analysis system for
each specific document type can be adopted.
2.2
Document Classification
There is a great diversity in document classifiers. Classifiers solve a variety of
document classification problems, differ in how they use training data to construct
models of document classes, and differ in their choice of document features and
recognition algorithms. Chen [9] surveys this diverse literature using three components: the problem statement, the classifier architecture, and the performance
evaluation. These components are illustrated in Figure 2.1.
The problem statement (see Section 2.2.1) for a document classifier defines the
problem being solved by the classifier. It consists of two aspects: the document
space and the set of document classes. The document space defines the range of
input document samples. The training samples and the test samples are drawn
from the document space. The set of document classes defines the possible outputs
produced by the classifier and is used to label document samples. Most surveyed
classifiers use manually defined document classes, with class definitions based on
similarity of contents, form, or style. The problem statement corresponding to this
master thesis is discussed further in Section 3.1.
The classifier architecture (see Section 2.2.2) includes four aspects: document
features and recognition stages, feature representations, class models and classification algorithms, and learning mechanisms.
Performance evaluation (see Section 2.2.3) is used to measure the performance
of a classifier, and to permit performance comparisons between classifiers. The diversity among document classifiers makes performance comparisons difficult. Issues in performance evaluation include the need for standard data sets, standardized
performance metrics, and the difficulty of separating the classifier performance
from the pre-processor performance.
2.2.1
Problem Statement
The problem statement for a document classifier has two aspects: the document
space and the set of document classes. The former defines the range of input documents, and the latter defines the output that the classifier can produce.
Document Space
The document space is the set of documents that a classifier is expected to handle.
The labeled training samples and test samples are all drawn from this document
space. The training samples are assumed to be representative of the defined set
16
Chapter 2. State of the Art and Background
Figure 2.1: Three components of a document classifier.
17
Chapter 2. State of the Art and Background
of classes. The document space may include documents that should be rejected,
because they do not lie within any document class. In this case, the training samples might consist of positive samples only, or they might consist of a mixture of
positive and negative samples.
Set of Document Classes
The set of document classes defines how the document space is partitioned. The
name of a document class is the output produced by the classifier. Several possible partitions of document space are shown in Figure 2.2. A set of document
classes may uniquely separate the document space (see Figure 2.2.a), with a single class label assigned to a document. If the document space is larger than the
union of the document classes (see Figure 2.2.b), the classifier is expected to reject all documents that do not belong to any document class. Fuzziness may exist
in the definition of document classes (see Figure 2.2.c), with multiple class labels
assigned to a document.
Figure 2.2: Three possible partitions of the document space.
A document class (also called document type or document genre) is defined
as a set of documents characterized by the similarity of expressions, style, form
or contents. This definition states that various criteria can be used for defining
document classes. Document classes can be defined based on similarity of contents. For example, consider pages in conference papers, with classes consisting
of “pages with experimental results”, “pages with conclusions”, “pages with description of a method”. Alternatively, document classes can be defined based on
similarity of form and style (also called visual similarity), such as page layout, use
of figures, or choice of fonts.
2.2.2
Classifier Architecture
Chen [9] uses the following four aspects to characterize the classifier architecture:
1. Document features and recognition stage.
2. Feature representations.
18
Chapter 2. State of the Art and Background
3. Class models and classification algorithms.
4. Learning mechanisms.
These aspects are interrelated: design decisions made regarding one aspect
have influence on design of other aspects. For example, if document features are
represented in fixed-length feature vectors, then statistical models and classification algorithms are usually considered.
Document Features and Recognition Stage
Choice of document features is an important step in classifier design. Relevant surveys about document features include the following. Commonly used features in
OCR are surveyed in [10]. A set of commonly used features for page segmentation
and document zone classification are given in [11]. Structural features produced in
physical and logical layout analysis are surveyed in [12, 1].
The majority of features are extracted from black and white document images.
The gray-scale or color images (e.g., advertisements and magazine articles) are
binarized. Unavoidably, for certain documents, the binarization process removes
essential discriminate information. More research should be devoted to the use of
features extracted directly from gray-scale or color images to classify documents.
Before discussing the choice of document features further, we first consider the
document recognition stage at which classification is performed.
Document Recognition Stages Document classification can be performed at
various stages of document processing. The choice of document features is constrained by the document recognition stage at which document classification is
performed.
Figure 2.3 shows a typical sequence of document recognition for mostly-text
document images, where:
• Block segmentation and classification identify rectangular blocks (or zones)
enclosing homogeneous content portions, such as text, table, figure, and halftone image.
• Physical layout analysis (also called structural layout analysis or geometric
layout analysis) extracts layout structure: a hierarchical description of the
objects in a document image, based on the geometric arrangements in the
image. For example, an intelligent document processing system that can
transform paper documents into XML format called WISDOM++ uses six
levels of layout hierarchy: basic blocks, lines, sets of lines, frame 1, frame
2, and page [13].
• Logical layout analysis (also called logical labeling) extracts the logical
structure: a hierarchy of logical objects, based on the human-perceptible
19
Chapter 2. State of the Art and Background
meaning of the document contents. For example, the logical structure of a
journal page is a hierarchy of logical objects, such as title, authors, abstract,
and sections [12].
Figure 2.3: A typical sequence of document recognition for mostly-text document
images.
Document classification can be performed at various recognition stages. The
choice of this recognition stage depends on the goal of document classification and
the type of documents.
Choice of Document Features Chen [9] characterizes document features using
three categories: image features, structural features, and textual features, where:
• Image features are either extracted directly from the image (e.g., the density
of black pixels in a region) or extracted from a segmented image (e.g., the
number of horizontal lines in a segmented block). Image features extracted
at the level of a whole image are called global image features; image features
extracted from the regions of an image are called local image features.
• Structural features (e.g., relationships between objects in the page) are obtained from physical or logical layout analysis.
• Textual features (e.g., presence of keywords) may be computed from OCR
output or directly from document images.
Some classifiers use only image features, only structural features, or only textual features; others use a combination of features from several groups.
20
Chapter 2. State of the Art and Background
The classifiers that use only image features are fast since they can be implemented before document layout analysis. But they may be limited to providing
coarse classification, since image features alone do not capture characteristic structural information. More elaborated methods are needed to verify the classification
result.
Shin et al. [6] measure document image features directly from the unsegmented
bitmap image. The document features include density of content area, statistics
of features of connected components, column/row gaps, and relative point sizes
of fonts. These features are measured in four types of windows: cell windows,
horizontal strip windows, vertical strip windows, and the page window.
Most of systems use a combination of physical layout features and local image
features. This provides a good characterization of structured images. The classification is done before logical labeling, allowing the classification results to be used
to tailor logical labeling, that is, we could use physical layout features to classify
the document, and then adapt the logical labeling phase to the document class.
Document classification using logical structural features is expensive since it
needs a domain-specific logical model for each type of document.
Classification using textual features is closely related to text categorization in
information retrieval. Purely textual measures, such as frequency and weights of
keywords or index terms, can be used on their own, or in combination with image
features. Textual features may be extracted from OCR results which may be noisy.
Alternatively textual features may be extracted directly from document images [6].
Feature Representations
Document features extracted from each sample document in a classifier can be
represented in various ways, such as a flat representation (fixed-length vector or
string), a structural representation, or a knowledge base. Document features that
do not provide structural information are usually represented in fixed-length feature
vectors. Features that provide structural information are represented in various
formats as a tree [3, 4], a list, a graph, . . . (see Figure 2.4)
Diligenti et al. [4] claim that a flat representation does not carry robust information about the position and the number of basic constituents of the image, whereas
a recursive representation preserves relationships among the image constituents.
Chen et al. [9] show a table proposed by Watanabe (see Figure 2.4) where
structured document are categorized in five groups and each category are related
with a recommended feature representation.
Watanabe also gives the following guideline for the selection of a feature representation: “The simpler, the better”. If the document can be represented using
a list, then use a list because of higher processing efficiency, easier knowledge
definition and management. Similarly, a tree representation is better than a graph
representation due to its relative simplicity.
The choice of a feature representation is also constrained by the kind of class
model and classification algorithm that is used.
21
Chapter 2. State of the Art and Background
Figure 2.4: Five categories of structured documents and their recommended feature
representation.
Class Models and Classification Algorithms
Class models define the characteristics of the document classes. The class models can take various forms, including grammars, rules, and decision trees. The
class models are trained using features extracted from the training samples. They
are either manually built by a person or automatically built using machine learning techniques. Class models and classification algorithms are tightly coupled. A
class model and classification algorithm must allow for noise or uncertainty in the
matching process. Traditional statistical and structural pattern classification techniques that have been applied to document classification are reviewed by Chen et
al. [9] as follows.
• Statistical pattern classification techniques: There are many traditional
statistical pattern classification techniques, such as nearest neighbor, decision tree, and neural network. These techniques are relatively mature
and there are libraries and classification toolboxes implementing these techniques. Traditional statistical classifiers represent each document instance
with a fixed-length feature vector. This makes it difficult to capture much
of the layout structure of document images. Therefore, these techniques are
less suitable for fine-grained document classification.
• Structural pattern classification techniques: These techniques have higher
computational complexity than statistical pattern recognition techniques. Also,
machine learning techniques for creating class models based on structural
representations are not yet standard. Many authors provide their own methods for training class models [3, 4].
• Knowledge-based document classification techniques: A knowledge-based
document classification technique uses a set of rules or a hierarchy of frames
encoding expert knowledge on how to classify documents into a given set of
classes.
The knowledge base can be constructed manually or automatically. Manu22
Chapter 2. State of the Art and Background
ally built knowledge-based systems only perform what they were programmed
to do.
Significant efforts are required to acquire knowledge from domain experts
and to maintain and update the knowledge base. Moreover, it is not easy to
adapt the system to a different domain [2].
Recently developed knowledge based systems learn rules automatically from
labeled training samples [13].
• Template matching: Template matching is used to match an input document with one or more prototypes of each class. This technique is most
commonly applied in cases where document images have fixed geometric
configurations, such as forms. Matching an input form with each of a few
hundred templates is time consuming. Computational cost can be reduced by
hierarchical template matching. Byun and Lee [5] propose a partial matching
method, in which only some areas of the input form are considered. Template matching has also been applied to broad classification tasks, with documents from various application domains such as business letters, reports,
and technical papers. The template for each class is defined by one userprovided input document, and the template does not describe the structure
variability with the class. Therefore, the template is only suitable for coarse
classification.
• Combination of multiple classifiers: Multiple classifiers may be combined
to improve classification performance.
• Multi-stage classification: A document classifier can perform classification in multiple stages, first classifying documents into a small number of
coarse-grained classes, and then refining this classification. Maderlechner
et al. [14] implement a two-stage classifier, where the first stage classifies
documents as either journal articles or business letters, based on physical
layout information. The second stage further classifies business letters into
16 application categories according to content information from OCR.
Learning Mechanisms
A learning mechanism provides an automated way for a classifier to construct
or tune class models, based on observation of training samples. Hand coding of
class models is most feasible in applications that use a small number of document
classes, with document features that are easily generalized by a system designer.
For example, Taylor et al. [15] manually construct a set of rules to identify functional components in a document and learn the frequency of those components
from training data. However, manual creation of entire class models is difficult in
applications involving a large number of document classes, especially when users
are allowed to define document classes. With a learning mechanism, the classifier
23
Chapter 2. State of the Art and Background
can adapt to changing conditions, by updating class models or adding new document classes.
2.2.3
Performance Evaluation
Performance evaluation is a critically important component of a document classifier. It involves challenging issues, including difficulties in defining standard
datasets and standardized performance metrics, the difficulty of comparing multiple document classifiers, and the difficulty of separating classifier performance
from preprocessor performance.
Performance evaluation includes the metrics for evaluating a single classifier,
and the metrics for comparing multiple classifiers. Most of the classification systems measure the effectiveness of the classifiers, which is the ability to take the
right classification decisions. Various performance metrics are used for classification effectiveness evaluation, including accuracy [3], correct rate, recognition rate
[5], error rate [15], false rate [13], reject rate, recall, and precision.
The significance of the reported effective performance is not entirely standard,
since some classifiers have reject ability while others do not, and some classifiers
output a ranked list of results [3, 7], while others produce a single result. Standard
performance metrics are necessary to evaluate performance.
Document classifiers are often difficult to compare because they are solving
different classification problems, drawing documents from different input spaces,
and using different sets of classes as possible outputs. For example, it is difficult to
compare a classifier that deals with fixed-layout documents (forms or table-forms)
to one that classifies documents with variable layouts (newspaper or articles). Another complication is that the number of document classes varies widely. The classifiers use as few as 3 classes [13] to as many as 500 classes, and various criteria
are used to define these classes. Also many researchers collect their own data sets
for training and testing their document classifiers. These data sets are of varying size, ranging from a few dozen [5, 7], or a few hundred [3], to thousands of
document instances. The sizes of training set and test set affect the classifier performance.These factors make it very difficult to compare performance of document
classifiers. Some authors lead in the right direction by making data available online. In this way other authors can use the data provided and add their own data to
test their classification system.
To compare the performance of two classifiers, a standard data set providing
ground-truth information should be used to train and test the classifiers. The University of Washington document image database is one source of ground truth data
for document image analysis and understanding research [16]. Some authors conclude that UW data is far from optimal for document classification, since it has a
small number of documents from a relatively large number of classes.
Finlands MTDB Oulu Document Database defines 19 document classes and
provides ground truth information for document recognition [17]. The number of
documents per class ranges from less than ten up to several hundred. The docu24
Chapter 2. State of the Art and Background
ments in this database are diverse, and assigned to pre-defined document classes,
making this database a useful starting point for research into document classification.
It is difficult to separate the classifier performance from the preprocessor performance. The performance of a classifier depends on the quality of document
processing performed prior to classification. For example, classification based on
layout-analysis results is affected by the quality of the layout analysis, by the number of split and merged blocks. Similarly, OCR errors affect classification based
on textual features. In order to compare classifier performance, it is important to
use standardized document processing prior to the classification step. One method
of achieving this is through use of a standard document database that includes
not only labeled document images, but also includes sample results from intermediate stages of document recognition. This would allow document classifiers
to be tested under the same conditions, classifying documents based on the same
document-recognition results. Construction of such databases is a difficult and
time-consuming task.
2.3
Information Theory
In 1948, Claude Shannon published “A mathematical theory of communication”
[18] which marks the beginning of information theory. In this paper, he defined
measures such as entropy and mutual information, and introduced the fundamental
laws of data compression and transmission.
In this section, we present some basic measures of information theory. Good
references are the texts by Cover and Thomas [19], and Yeung [20].
Entropy
The Shannon entropy is the classical measure of information, where information is
simply the outcome of a selection among a finite number of possibilities. Entropy
also measures uncertainty or ignorance.
The Shannon entropy H(X) of a discrete random variable X with values in the
set X = {x1 , x2 , . . . , xn } is defined as
H(X) = −
X
p(x) log p(x),
(2.1)
x∈X
where p(x) = P r[X = x], the logarithms are taken in base 2 (entropy is
expressed in bits), and we use the convention that 0 log 0 = 0, which is justified
by continuity. We can use interchangeably the notation H(X) or H(p) for the
entropy, where p is the probability distribution {p1 , p2 , . . . , pn }. As − log p(x)
represents the information associated with the result x, the entropy gives us the
average information or uncertainty of a random variable. Information and uncertainty are opposite. Uncertainty is considered before the event, information after.
25
Chapter 2. State of the Art and Background
Figure 2.5: Binary entropy.
So, information reduces uncertainty. Note that the entropy depends only on the
probabilities.
Some relevant properties [18] of the entropy are:
1. 0 ≤ H(X) ≤ log n
• H(X) = 0 if and only if all the probabilities except one are zero, this
one having the unit value, i.e., when we are certain of the outcome.
• H(X) = log n when all the probabilities are equal. This is the most
uncertain situation.
2. If we equalize the probabilities, entropy increases.
When n = 2, the binary entropy (Figure 2.5) is given by
H(X) = −p log p − (1 − p) log(1 − p),
(2.2)
where the variable X is defined by
1 with probability p
X=
0 with probability 1 − p.
If we consider another random variable Y with probability distribution p(y)
corresponding to values in the set Y = {y1 , y2 , . . . , ym }, the joint entropy of X
and Y is defined as
H(X, Y ) = −
XX
p(x, y) log p(x, y),
x∈X y∈Y
where p(x, y) = P r[X = x, Y = y] is the joint probability.
26
(2.3)
Chapter 2. State of the Art and Background
Also, the conditional entropy is defined as
H(X|Y ) = −
XX
p(x, y) log p(x|y),
(2.4)
y∈Y x∈X
where p(x|y) = P r[X = x|Y = y] is the conditional probability.
The Bayes theorem expresses the relation between the different probabilities:
p(x, y) = p(x)p(y|x) = p(y)p(x|y).
(2.5)
If X and Y are independent, then p(x, y) = p(x)p(y).
The conditional entropy can be thought of in terms of a channel whose input
is the random variable X and whose output is the random variable Y . H(X|Y )
corresponds to the uncertainty in the channel input from the receiver’s point of
view, and vice versa for H(Y |X). Note that in general H(X|Y ) 6= H(Y |X).
The following properties are also met:
1. H(X, Y ) ≤ H(X) + H(Y )
2. H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y )
3. H(X) ≥ H(X|Y ) ≥ 0
Mutual Information
The mutual information between two random variables X and Y is defined as
I(X, Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X)
X
XX
= −
p(x) log p(x) +
p(x, y) log p(x|y)
x∈X
=
XX
y∈Y x∈X
p(x, y) log
x∈X y∈Y
p(x, y)
.
p(x)p(y)
(2.6)
Mutual information represents the amount of information that one random variable, the output of the channel, gives (or contains) about a second random variable,
the input of the channel, and vice versa, i.e., how much the knowledge of X decreases the uncertainty of Y and vice versa. Therefore, I(X, Y ) is a measure of
the shared information between X and Y .
Mutual information I(X, Y ) has the following properties:
1. I(X, Y ) ≥ 0 with equality if, and only if, X and Y are independent.
2. I(X, Y ) = I(Y, X)
27
Chapter 2. State of the Art and Background
Figure 2.6: Venn diagram of a discrete channel.
3. I(X, Y ) = H(X) + H(Y ) − H(X, Y )
4. I(X, Y ) ≤ H(X)
The relationship between all the above measures can be expressed by a Venn
diagram, as shown in Figure 2.6.
2.4
Image Preprocessing Techniques
Preprocessing generally consists of a series of image-to-image transformations. It
does not increase our knowledge of the contents of the document, but may help to
extract it.
Some of the early stages of processing scanned documents are independent of
the type of documents. Many noise filtering, binarization, edge extraction, and
segmentation methods can be applied equally well to printed or handwritten text,
line drawings or maps.
For the purpose of this mater thesis we review a skew method and the k-means
segmentation method in the context of global representation techniques.
2.4.1
Skew
The most important preprocess in this master thesis is the skew. In Section 4.2.3
we explain some modifications made in the algorithm introduced in this section in
order to suit our needs.
Another essential preprocess to obtain our goals is called “cropping”. This
method is responsible for removing parts of the image that do not belong to the
document itself. This method is explained in detail in Section 4.2.1.
All of the algorithms developed for skew detection are accurate on full pages
of uniformly aligned text. The better algorithms are less affected by the presence
28
Chapter 2. State of the Art and Background
of graphics, paragraphs with different skew, curvilinear distortion, arising from
photocopying books, large areas of dark pixels near the margin, and few, short text
lines.
The procedure used to solve skew problem is proposed by Gatos et al. in
[21]. This paper proposes a computationally efficient procedure for skew detection and text line position determination in digitized documents, which is based on
the cross-correlation between the pixels of vertical lines in a document. Due to the
text skew, each horizontal text line intersects a predefined set of vertical lines at
non-horizontal positions. Using only the pixels on these vertical lines they construct a correlation matrix and evaluate the skew angle of the document with high
accuracy. In addition, using the same matrix, they compute the positions of text
lines in the document. The proposed method is tested on a variety of mixed-type
documents and it provides good and accurate results while it requires only a short
computational time.
Many document image classification systems often consists of a preprocessing
stage, a document layout understanding and segmentation stage, a feature extraction stage and a classification stage. Many of these stages are facilitated if the
document has not been skewed during the scanning process and its text lines are
strictly horizontal. Although there are some techniques of image processing that
can work on skewed documents too, they tend to be ineffective and involve great
computational cost. It is therefore preferable, in the preprocessing stage, to determine the skew angle of the digitized documents.
There are several methods for skew detection. These methods are based on the
Hough transform, projection-based approaches, and the Fourier transform. Hough
transform methods are the most popular, but they are computationally expensive. In
projection-based methods, the projections of the document onto specific directions
are first calculated. The skew angle corresponds to a rotation angle where some
projection characteristics are satisfied. For example, using horizontal projection,
we can determine that the skew angle corresponds to a rotation for which the mean
square deviation of a projection histogram is maximized. This method gives good
accurate results but is computationally expensive due to its preprocessing stage
and global maximum finding stage. Other methods belong to the Fourier transform
approaches. According to this kind of method, the skew angle corresponds to the
direction for which the density of the Fourier space becomes the largest. This
method requires computation of the Fourier transform of the document and for this
reason, it is computationally expensive.
Gatos et al. [21] propose a new skew detection method based on the information existing on a set of equidistant vertical lines. They accept that a text document
consists mainly of horizontal text lines and they can further choose from all the image pixels those lying on a set of equidistant vertical lines. For a document, these
pixels correspond to pixels of text lines. By using only these pixels they construct
a correlation matrix between the vertical lines. They do not need to relate every
pixel of a line to all pixels of the other lines, but only to those pixels that lie in
a specific region defined by the expected maximum skew. In this way, they re29
Chapter 2. State of the Art and Background
duce the computation time significantly without sacrificing accuracy. Finally, they
form the vertical projection of the matrix, and the skew angle corresponds to the
global maximum of the projection. This proposed method also works well with
documents that, in addition to the usual horizontal text lines, contain images, line
draws, and tables.
The innovations introduced by this new method are the following:
• Efficiency: Instead of using all the image pixels, we use only those lying on
certain vertical lines defined in the image. This results in a drastic decrease of
the calculation time for skew detection. The basic matrix used for data storage is of much smaller dimension compared to other methods, which results
in a faster algorithm implementation and minimum storage requirements.
• Accuracy: This method extracts the document skew with high accuracy. This
can be further improved by using more than two vertical lines. The use of
more than two vertical lines improves the accuracy, reduces the possibility
of a wrong result due to noise and diminishes the possibility of missing a
text line of short length. This happens because the skew detection accuracy
depends on the distance between the first and the last vertical lines.
• Robustness: The results of this proposed method are robust to the presence of
graphics in the document which is not true for methods based on the Hough
transform.
Now, we describe the process that we need to follow to implement this method
using two vertical lines. We can divide this process in the next five basic steps:
1. Binary image. Firstly we define the binary text image B(x, y) ∈ {0, 1} with
the integer x, y taking values in the ranges 1 < x < Xwin and 1 < y < Ywin ,
and assuming that text pixels are assigned the value 1 and background pixels
the value 0. All the distances in this paper are expressed in units of pixel
distance, that is, the horizontal distances in terms of horizontal pixel distance
and the vertical distances in term of vertical pixel distance.
2. Image smoothing. Before applying the method for skew detection to B(x, y),
we preprocess the image, so that text lines are transformed to thick solid
lines. According to it, if the number of background pixels (B(x, y) = 0)
lying between two adjacent horizontal text pixels is less than or equal to a
certain threshold T , then these background pixels are converted into text pixels (B(x, y) = 1). The proper value of T depends on the text characteristics
and primarily on the character width. Therefore, the proper threshold value
is selected according to the user’s experience. Gatos et al. [21] found that
a suitable value for T is T = 0.1Xwin . Figure 2.7 shows the results of the
above procedure applied to a document.
30
Chapter 2. State of the Art and Background
Figure 2.7: The image before and after image smoothing.
3. Line data acquisition. We define now a set of two or more vertical lines in
the document. Embodying the previous preprocessing tool, we define that a
pixel belongs to a vertical line if its distance from it is less than or equal to
T /2. Then, we define the pixels of every vertical line k, lying at horizontal
distance m from the left margin, through the following line smoothing binary
function:
linek (y) =
1 if
0
Pm
i=m−T /2 +T /2B(i, y)
otherwise,
6= 0,
y = 1, · · · , Ywin .
(2.7)
We can say that the function linex (y) depicts text pixel existence at the vertical line k after the line smoothing transformation. In contrast with the Hough
transform approach, where all the image pixels are used, we will use only the
pixels belonging to these vertical lines. Thus, we need less memory and we
speed up significantly the algorithm.
4. Selection of pixels for skew detection. A common characteristic of a text
document is the repetition of the horizontal text lines along the vertical direction. This is obvious (see Fig. 2.7) by observing the repetition of the
pixel-blocks along the vertical columns. These blocks correspond mainly to
horizontal text lines. It is noted that, although in most cases the repetition of
text lines is approximately periodical, this is not a pre-requirement for this
approach. Examination of the blocks between two different vertical lines can
give the necessary information for skew detection. We choose two vertical
lines d1 and d2 (see Fig. 2.8), at distances D1 and D2 from the left margin
of the image. Distances D1 and D2 are defined so the image is divided into
equal parts: D1 = 31 Xwin and D2 = 23 Xwin . The skew angle estimation is
based on the 2Ywin pixels which are on these two lines obtained by equation
2.7.
5. Skew detection from the correlation matrix of two vertical lines. We
31
Chapter 2. State of the Art and Background
Figure 2.8: Image windows of size Xwin ∗ Ywin and region [-L,L].
want to determine a matrix that records all the relative positions of the pixels
of the vertical line d1 to the pixels of the vertical line d2 . We notice that due
to the text skew θ, a text line intersects the two vertical lines d1 and d2 in two
points having vertical distance l = (D2 − D1 ) tan θ. Making the assumption
that a document can be rotated up to ±5◦ , that is, θmax = 5◦ due to a
scanning misplacement, the vertical distance l must satisfy the constraint:
−L < l < L, where L = (D2 − D1 ) tan
2πθmax
360
.
(2.8)
where L is an integer, expressed in number of vertical pixels.
For every text pixel of the vertical line d1 (line1 (yk ) = 1), we search for
text pixels at the vertical line d2 in a region [−L, L] centered at yk . We store
this information in a correlation matrix C(yk , λ) ∈ {0, 1} defined as
C(yk , λ) =line1 (yk )line2 (yk + λ),
f or1 ≤ yk ≤ Ywin and − L ≤ λ ≤ L.
(2.9)
Pixels outside the image region are assumed to be 0. As we can see in Fig.
2.9, the correlation matrix C has zero elements for yk = 6, 7, 8, 14, 15, 16,
22, 23, 24 and 25. This is because there are no text pixels at line d1 for these
yk values. We also have C(1, 3) = 1 because there is a text pixel at line d1
for yk = 1 and there is also a text pixel at line d2 for yk = 1 + 3 = 4.
If the image skew angle is 0, then the intersection of every text line with the
two vertical lines d1 and d2 should have a vertical distance (D2 − D1 ) tan θ.
So, the correlation matrix C will have maximum accumulation of points
32
Chapter 2. State of the Art and Background
Figure 2.9: Correlation matrix of the vertical line d1 towards d2 .
along the y-axis for λ = int[0.5 + (D2 − D1 ) tan θ] (where int[x] is the
integer part of x). Making a reverse approach of this syllogism, image skew
is obtained if we detect the global maximum of the vertical projection of the
correlation matrix C.
The vertical projection of the correlation matrix is given from the formula
2.10.
P (λ) =
YX
win
C(k, λ), ∀λ ∈ [−L, L].
(2.10)
k=1
According to the above, if the global maximum of P (λ) is at λ = λmax ,
then the document skew is given by the relation 2.11.
−
θ = tan 1
λmax
D2 − D1
.
(2.11)
As we can see in Fig. 2.10, we have a global maximum of the projection for
λ = 3, which means that the document skew angle is tan− 1[3/(D2 − D1 )].
In Section 4.2.3 we can find a detailed explanation about how we use this
method and the variations introduced to adapt it to our application.
2.4.2
Image Segmentation
The segmentation process can be described as the identification of structures in an
image. It consists in subdividing an image into its constituent parts, a significant
step towards image understanding [22].
33
Chapter 2. State of the Art and Background
Figure 2.10: Skew detection using the correlation matrix of Fig. 2.9.
Image segmentation is the process of labeling each pixel in an image dataset according to certain parameter or features. Since a segmented image provides richer
information than the original one, it is an essential tool in the image study. Segmentation is considered a very difficult task and a lot of research is being done to
develop automatic segmentation techniques.
Unfortunately, the automatic process is not easy since the regions to be segmented can vary with the image. Consequently, most proposed methods are specific assuming usually some a priori information that must either be built into the
system or provided by a human operator. In the image processing literature, we
can find a lot of segmentation methods and also very diverse ways of classifying
them [22, 23]. Automatic segmentation processes can be divided into two groups:
the global segmentation methods, where all image pixels are collected in some
clusters, and the local segmentation methods, where only a region is taken into
account classifying the pixels inside or outside of this region.
For the purposes of this master thesis, we only review the most basic global
segmentation methods. These methods are also referred to as classification methods, since each point is classified into a cluster, usually depending on its intensity
value and the intensity of its neighbours, and not on its position in the image.
The main global segmentation methods can be classified in these groups:
• Thresholding
This segmentation scheme relies upon the selection of a range of intensity
levels, called threshold values, for each structure class. These intensity
ranges are exclusive to a single class, and span the dynamic range of the
image. Subsequently a feature is classified by selecting the class in which
the value of the feature falls within the range of feature values of the class.
34
Chapter 2. State of the Art and Background
The determination of more than one threshold value is a process called multithresholding.
The selection of the threshold generally depends on the visual identification
of a peak in the histogram corresponding to a structure class, and the selection of a range of intensities around the peak to include only the structure
class. A possible criterion is to assign the histogram minima as the threshold
values. More refined criteria are summarized in [24, 25].
• Segmentation by image enhancement
In image processing terminology, an operation for image enhancement improves the quality of the image in a particular manner, either subjectively or
objectively. This segmentation model assumes that a structure class ideally
has a single intensity, and that noise and scanning artifacts corrupt this level
to produce the distribution of intensities observed for a structure class. Thus,
by the application of image enhancement techniques for reducing noise and
smoothing the image, the enhanced image approximates the ideal image (the
segmented one).
The main drawback of this segmentation approach is that the structures that
do not have strong edges on all sides are smoothed, leading to large classification errors when subsequent labelling is applied.
• Segmentation by unsupervised clustering
Clustering methods are algorithms that operate on an input dataset, grouping
data into clusters based on the similarity of the data in these clusters. Clustering algorithms are unsupervised classifiers, assigning states from scratch.
They are also useful for data exploration, allowing a user to discover patterns
of similarities in a dataset.
A well-known clustering algorithm is the k-means [26]. The k-means algorithm accepts as input the number of clusters to organize data within, initial
location of cluster centers, and a dataset to cluster. The number of clusters
in which the algorithm fits the data is specified to the algorithm, and represents a parameter the user desires to experiment with, or, also, the expected
or desired number of classes to discern from the data. There are no conditions upon which data are excluded or included in consideration to fit into
a class; all data provided as input are classified. A given sample or feature
measurement is assigned exclusively to one class (fuzzy k-means clustering
assigns a degree of membership to each data item for each class).
The algorithm is an iterative algorithm, assigning a class at each iteration to
each data element. The algorithm iteration ceases when there are no changes
in the classification solution. Each iteration consists in classifying the dataset
by comparison of the dataset to the current cluster centers. A data item
is assigned to the same class as a cluster center if the Euclidean distance
between the data item and the cluster center is the least distance between
35
Chapter 2. State of the Art and Background
the data item and all the cluster centers. Following class assignment, cluster
centers are updated by computation of the centroid of the dataset classified
as the same class.
2.5
Image Registration
Image registration is a fundamental task in image processing used to match two or
more images or volumes obtained at different times, from different devices or from
different viewpoints. Basically, it consists in finding the geometrical transformation that enables us to align images into a unique coordinate space. In the scope of
this master thesis we will focus on 2D rigid registration techniques because only
transformations that consider translations and rotations are allowed.
In this section, the main components of the image registration pipeline are
presented. A classification of the most representative registration methods that
have been proposed is also given. To end the section, the main challenges in the
registration field are described.
2.5.1
Image Registration Pipeline
The image registration pipeline starts with the selection of the two images to be
registered. One of the two images is defined as the fixed image and the other one
as the moving image. Given these images, registration is treated as an optimization
problem with the goal of finding the spatial mapping that will bring the moving
image into alignment with the fixed one. This process can be described as a process composed of four basic elements [27]: the transformation, the interpolator,
the metric and the optimizer (see Figure 2.11). The transformation component
represents the spatial mapping of points from the fixed image space to points in the
moving image space. The interpolator is used to evaluate moving image intensity
at non-grid positions. The metric component provides a measure of how well the
fixed image is matched by the transformed moving image. This measure forms
the quantitative criterion to be optimized by the optimizer over the search space
defined by the parameters of the transformation. Each of these components is now
described in more detail.
1. Spatial transformation. The registration process consists in reading the
input image, defining the reference space (i.e. its resolution, positioning and
orientation) for each of these images, and establishing the correspondence
between them (i.e. how to transform the coordinates from one image to the
coordinates of the other image).
The spatial transformation defines the spatial relationship between both images. Basically, two groups of transformations can be considered:
• Rigid or affine transformations. These transformations can be defined
with a single global transformation matrix. Rigid transformations are
36
Chapter 2. State of the Art and Background
Figure 2.11: The main components of the registration framework are the two input
images, a transformation, a metric, an interpolator, and an optimizer.
defined as geometrical transformations that only consider translations
and rotations, and, thus, they preserve all distances. Affine transformations also allow shearing transforms and they preserve the straightness
of lines (and the planarity of surfaces) but not the distances.
• Nonrigid or elastic transformations. These transforms are defined for
each of the points of the images with a transformation vector. For simplification purposes, sometimes only some control points are considered and the transformation at the other points is obtained by interpolating the transformation at these control points. Using these kinds of
transformations, the straightness of the lines are not ensured.
In this master thesis, rigid image registration is our reference point.
2. Interpolation. The interpolation strategy determines the intensity value of a
point at a non-grid position. When a general transformation is applied to an
image, the transformed points may not coincide with the regular grid. So, an
interpolation scheme is needed to estimate the values at these positions.
One of the main problem of registration appears when there is not a direct
correspondence between the coordinates of the two models. In this situation
certain criteria has to be fixed to determine how this point has to be approximated in the second model. Therefore, spatial transformation rely for their
proper implementation on interpolation and image resampling. Interpolation
is the process of intensity based transformation and resampling is the process
where intensity values are assigned to the pixels in the transformed image.
Several interpolation schemes have been introduced [28]. The most common
are:
• Nearest neighbour interpolation: the intensity of each point is given
by the one of the nearest grid-point.
37
Chapter 2. State of the Art and Background
• Linear interpolation: the intensity of a point is obtained from the linearweighted combination of the intensities of its neighbors.
• Splines: the intensity of a point is obtained from the spline-weighted
combination of a grid-point kernel [29].
• Partial volume interpolation: the weights of the linear interpolation
are used to update the histogram, without introducing new intensity
values [30].
3. Metric. The metric evaluates the similarity (or disparity) between the two
images to be registered. Several image similarity measures have been proposed. They can be classified depending on the used features which are:
• Geometrical features. A segmentation process detects some features
and, then, they are aligned. These methods do not have high computational cost. Nevertheless, there is a great dependence on the initial
segmentation results.
• Correlation measures. The intensity values of each image are analyzed
and the alignment is achieved when a certain correlation measure is
maximized. Usually, a priori information is used in these metrics.
• Intensity occurrence. These measures depend on the probability of
each intensity value and are based on information theory [18].
Despite this variety of measures, this last group has become the most popular. Due to the importance of the similarity measure in our research, a
classification of registration techniques according to this parameter will be
given in Section 2.5.2.
4. Optimization. The optimizer finds the maximum (or minimum) value of
the metric varying the spatial transformation. For the registration problem,
an analytical solution is not possible. Then, numerical methods can be used
in order to obtain the global extreme of a non analytical function. The most
used methods in the image registration context are Powell’s method, simplex method, gradient descent, conjugate-gradient method, and genetic algorithms (such as one-plus-one evolutionary).
The choice of a method will depend on the implementation criteria and the
measure features (smoothness, robustness, etc.). A detailed description of
several numerical optimization methods and their implementations can be
found in [31].
2.5.2
Similarity Metrics
The registration metric characterizes the similarity (or disparity) of both images
for a given transformation. It is considered that the two images are registered when
this similarity (or disparity) function is maximum (or minimum).
38
Chapter 2. State of the Art and Background
The registration methods that have been proposed can be classified into two
main groups according to the information considered to compute the measure:
(i) feature-based registration, which uses previously segmented objects from the
images to achieve the alignment and (ii) pixel-based methods, which use the whole
data. A more detailed description of both groups is given below.
Feature-based Registration
Measures based on geometric features minimize spatial disparity between selected
features from the images (e.g. distance between corresponding points). The main
difference between the methods of this group is the feature selected for the registration, which can be points, surface, intrinsic features such as landmarks, or extrinsic
measures such as implanted markers. According to the features, two main categories of algorithms can be considered:
• Point-based registration algorithms
The basis of these algorithms is the selection of a set of points in each of the
images and then the minimum Euclidian distance between them gives the
best alignment.
Since, in general, the point sets of each image do not exactly coincide, an iterative algorithm is performed until the distance between these sets of points
is minimal. These methods are used extensively in the medical scenario due
to their simplicity.
• Segmented-based registration algorithms
Segmentation-based registration algorithms are based on the alignment of
segmented structures. The segmentation process takes an image and separates its elements into connected regions, which present the same desired
property or characteristic.
The segmentation-based algorithms are generally accurate and fast if a good
choice of features is performed. The main drawback of this approach is that
the registration accuracy is limited to the accuracy of the segmentation step.
Feature-based registration requires specialized segmentation and feature extraction for each application. In addition the methodology is not immune to
noise and is sensitive to outliers. The main advantages of the segmentationbased methods are that it give more accurate results than the intensity-based
approach. They are faster than the intensity-based registration as they use
a lower number of features and the optimization procedure needs less iterations.
Pixel-based Similarity Measures
The alternative to the feature-based approach is the intensity-based registration.
This approach assumes some relation between the optical densities of pixels and
39
Chapter 2. State of the Art and Background
operates directly on the image grey values without prior data reduction by the user
nor segmentation. The registration is implicitly performed by the definition of a
function which evaluates the quality of alignment and thereby controls the optimization procedure. The information used for the alignment is not restricted to any
specific feature and therefore this approach is more flexible than the feature-based
one.
There are two different methodologies distinguishing the methods in this group:
• Intensity-based methods
These methods base the alignment on the evaluation of the intensity values
considering the images aligned when the differences between grey values are
minimal. This restriction is ideal in cases where the two images are identical
except for noise. An important aspect to be considered is that the proposed
functions are only computed on the overlap area between both image, which
varies for different transformations.
Some of the functions that have been proposed to describe the relation between grey values are:
– The sum of absolute value differences. This is the simplest and most direct measure of similarity of two image values. This measure is defined
as
X
S(A, B) =
x∈A
T
|fA (x) − fB (x)|,
(2.12)
B
where fA (x) and fB (x) represent the intensity at a point x of the image
A and B, respectively. When this measure is applied we assume that
the image values are calibrated to the same scale.
– Correlation. In the alignment of two images, registration results in
a strong linear relationship between corresponding values in the two
images. A measure of similarity would be the correlation, which determines the fit of a line to the distribution of corresponding values.
Correlation is expressed as
C(A, B) =
X
x∈A
T
fA (x) × fB (x).
(2.13)
B
The main limitations of this measure are:
∗ Its dependence on the number of points over which it is evaluated.
This tends to favour transformations yielding large overlap. The
normalized cross-correlation solves this problem simply by dividing correlation by the number of points.
40
Chapter 2. State of the Art and Background
∗ Its dependence on the intensity values, which tends to favour high
intensity values. As a solution to this second limitation a better
measure of alignment was proposed: the correlation coefficient.
The correlation coefficient is a measure of the residual errors from
the fitting of a line to the data by minimization of the least squares.
• Methods based on the occurrences of intensity values
The basic idea behind these methods is that two values are related or similar
if there are many other examples of those values occurring together in the
overlapping image. These measures are a class of more generic statistical
measures which only look at the occurrence of image values and not at the
values themselves.
Most of these techniques are based on the feature space or joint histogram.
The joint histogram is a two-dimensional plot of the corresponding grey values in the images showing the combinations of grey values in each of the
two images for all corresponding points. The joint histogram is constructed
by counting the number of times a combination of grey values occurs. For
each pair of corresponding points (x, y), where x is a point in the first image
and y a point in the second image, the entry (fA (x), fB (y)) in the joint histogram is increased.
The joint histogram depends on the alignment of the images. When the images are correctly registered, corresponding structures overlap and the joint
histogram will show certain clusters for the grey values of those structures.
Conversely, when the images are misaligned, structures in one image will
overlap with structures in the other image that are not their counterparts.
This will be manifested in the joint histogram by a dispersion of the clustering. This property is exploited by defining measures of clustering or dispersion which have to be maximized and minimized respectively. Most of
these measures are based on information theory. For a detailed description
of this theory see Section 2.3. In the information theory context, the registration of two images is represented by an information channel X → Y ,
where the random variables X and Y represent the images. Their marginal
probability distributions, p(x) and p(y), and the joint probability distribution, p(x, y), are obtained by simple normalization of the marginal and joint
intensity histograms of the overlapping areas of both images.
Some of the measures based on the occurrences of intensity values are
– Moments of the joint probability distribution. The joint probability tells
us the proportion of times one or more variables hold some specific
values. Empirically, as the images approach the registration position,
the values of the peaks in the joint probability distribution increase
in height and the values on the regions of the probability distribution
41
Chapter 2. State of the Art and Background
which contain lower counts decrease in height. Therefore, the registration process has to re-arrange the pixels so that they occur with
their most probable corresponding value in the other image. One possible approach to quantify this shift from lower probabilities in the joint
probability distribution to a smaller number of higher probabilities is
to measure skewness (or the third moment) in the distribution of probabilities in the joint histogram.
The skewness characterizes the degree of asymmetry of a distribution
around its mean. It is a pure number that characterizes only the shape
of the distribution.
– Joint entropy. In the joint histogram of two images, grey values disperse with misregistration and the joint entropy is a measure of this
dispersion. By finding the transformation that minimizes their joint entropy, images should be registered [30]. The main drawback of this
method is its high sensitivity to the overlap area.
– Mutual information (MI). Another measure is mutual information which
is less sensitive to the overlap area. The more dependent the datasets
are, the higher the MI between them. Registration is assumed to correspond to the maximum mutual information: the images have to be
aligned in such a manner that the amount of information they contain
about each other is maximal [32].
In the image registration context, Studholme [33] proposed the normalized mutual information (NMI) defined by
N M I(X, Y ) =
H(X) + H(Y )
I(X, Y )
=1+
,
H(X, Y )
H(X, Y )
(2.14)
which is more robust than M I, due to its greater independence of the
overlap area.
To conclude this section, the most relevant properties of the intensity-based
registration approach are summarized. The main feature of intensity-based
registration is its generality; it can be applied to any dataset with no previous
pre-processing nor segmentation. Moreover, as all the pixels are considered on the alignment process, the method is quite immune to noise and is
insensitive to outliers. The convergence of intensity-based registration is in
general very slow. Several strategies have been proposed to speed up the process, first registering at lower resolutions and then increasing the resolution.
Due to the considerable computational cost required by these methodologies,
multi-resolution and multi-scale approaches are incorporated to the process
in order to speed up the convergence of this method.
42
Chapter 2. State of the Art and Background
2.5.3
Challenges in Image Registration
In this section, the main problems currently being addressed by image registration
researchers are briefly summarized.
Robustness and Accuracy
To evaluate the behaviour of a registration method robustness and accuracy are
the main parameters to be considered. The first parameter, robustness, refers to
how the method behaves with respect to different initial states, i.e. different initial
positions of the images, image noise, etc. The second parameter, accuracy, refers
to how the final method solution is closer to the ideal solution. Constantly, new
measures and new interpolation schemes appear trying to improve the robustness
and the accuracy of the standard measures.
Artifacts
In the registration process, the interpolator algorithm plays an important role, since
usually the transformation brings the point to be evaluated into a non-grid position.
This importance is greater when the grid size coincides in both images, since the
interpolator pattern is repeated for each point. When the mutual information or its
derivations, which are the most common measures used in image registration, are
computed, their value is affected by both the interpolation scheme and the selected
sampling strategy, limiting the accuracy of the registration. The fluctuations of the
measure are called artifacts and are well studied by Tsao [34].
Speed-up
One of main user requirements when using registration techniques is speed. Users
desire results as fast as possible. The large amount of data acquired by current capture devices makes its processing difficult in terms of time. Therefore, the definition of strategies able to accelerate the registration process is fundamental. Several
multiresolution frameworks have been proposed achieving better robustness and
speeding up the process.
43
Chapter 3
Classification Framework
As we introduced in Section 1.3 the main goal of this master thesis is the recognition and classification of different documents (invoices, tickets, and receipts)
within a database.
In this master thesis, a document refers to a single-page typeset document image. The document image may be produced form a scanner, a fax machine, or by
converting electronic document into an image format (usually TIFF format).
Our documents are images without text content identified as such. Given the
image of a document, the basic objective is to identify this image within a previously created database with all company documents. Therefore, although our image have not text content identified as such, most documents are of type mostly-text
document. Thus, OCR techniques are not applied in our project. It is also important to emphasize that, in our document classifier, the document space is not restricted to a single application domain (e.g., only invoices or only bank documents
or only tickets . . . ), since it is extended to include several application domains
such as tickets, invoices, and receipts. As we have seen in Section 2.2 Bagdanov
and Worring characterize document classification at two levels of detail, coarsegrained and fine-grained. We have seen that coarse-grained classification is used
to classify documents with very different features, such as business letters versus
technical articles. Whereas fine-grained classification is used to classify documents
with similar features, such as business letters from different senders, or journal title
pages from various journals. In our case, course-grained class corresponds to filter
the database by different documents features (see Section 3.3), while fine-grained
class corresponds to apply normalized mutual information registration (see Section
5.2).
In summary, some of the main features of our document classifier are the following:
• Classification is based on image-level features.
• Using the document taxonomy defined by Nagy [1], our documents can be
included in mostly-text documents group.
45
Chapter 3. Classification Framework
• Our classifier uses a document space that span several application domains.
• Our classifier is characterized by two levels:
– Coarse-grained. This level corresponds to filter the database by different document features (see Section 3.3).
– Fine-grained. This level corresponds to apply normalized mutual information registration (see Section 5.2).
In Section 2.2, we have seen that Chen [9] defines a document classifier using
three components (the problem statement, the classifier architecture, and the performance evaluation) represented in Figure 2.1. We have modified the diagram 2.1
with the aim of adapting it to our document classifier. The result is represented in
Figure 3.1.
Figure 3.1: Three components of our document classifier.
In the first component, problem statement, we obtain a series of document
images (document samples) that we can divide into two groups: reference samples
and input samples.
• Reference samples: This group is formed by a document set where, ideally, all documents are different between them and where each document
46
Chapter 3. Classification Framework
represents a document type that can identify the classifier. This group of
documents will form our database.
• Input samples: This group is formed by a document set that we want to use
as classifier input with the aim of obtaining what kind of document belongs
each other, that is, the objective is to find the corresponding document within
the database formed by reference samples.
Both sets of documents have to be preprocessed, within the second component (classifier architecture), with the aim of preparing images and extracting their
features for the registration process.
Classifier architecture also includes a classification algorithm which is responsible for the registration process and therefore it is a very important part of this
master thesis. In Section 3.2 we can find a detailed explanation of this part along
with a Figure 3.4 that summarizes its main stages.
Last but not least, the third component, performance evaluation, allows us to
obtain an estimation of the speed, robustness, and efficiency of the implemented
classifier.
In the next Sections 3.1, 3.2 and 5.4 we can find a detailed explanation of
the three components (problem statements, classifier architecture, and performance
evaluation) of our document classifier.
3.1
Problem Statement
In this chapter, we define, by using collected data in Section 2.2.1, the problem that
is solved by the classifier. The problem statement for a document classifier consists
of two aspects [9]: the document space and the set of document classes.
The Document Space
The document space defines the range of input document samples and may include
documents that should be rejected, because they do not lie within any document
class. In this case, the rejected document can be added in the group of reference
images. In this way, when next document of this kind enters the system, it may be
recognized and properly classified.
Our classifier uses a document space constituted by invoices, tickets, and receipts (see Figure 3.2).
The Set of Document Classes
As we have seen in Section 2.2.1 the set of document classes defines how the
document space is partitioned. Specifically, our document space is larger than the
union of the document classes (see Figure 2.2.b).
Also we have seen that a document class is defined as a set of documents characterized by the similarity of expressions, style, form, or contents. In our case,
47
Chapter 3. Classification Framework
Figure 3.2: Some documents that make up our document space.
document classes are defined based on the similarity of form and style (also called
visual similarity), such as page layout (see Figure 3.3).
Figure 3.3: Examples of document classes.
3.2
Classifier Architecture
As we have seen in Section 2.2.2 Chen [9] uses the following four aspects to characterize the classifier architecture:
1. Document features and recognition stage.
48
Chapter 3. Classification Framework
2. Feature representations.
3. Class models and classification algorithms.
4. Learning mechanisms.
In this section we will explain these four aspects of our classifier.
Document Features and Recognition Stage
In our case we use document features in order to make a good filter to the database
containing reference images.
Initially, the database contains a very high number of reference images. This
makes it too expensive to register the input image with all database. Because of
this, to register the input image with all the database images is too expensive in
terms of execution time.
Using document features we can filter the database by removing documents
with different features from the input image. This allows us to reduce considerably
the database size and in turn it allows us to reduce the execution time.
Before discussing the choice of document features further, we first consider the
document recognition stage at which the classification is performed.
Document Recognition Stages As we have seen in Section 2.2.2, document
classification can be performed at various stages of document processing. The
choice of document features is constrained by the document recognition stage at
which document classification is performed.
Figure 2.3 shows a typical sequence of document recognition for mostly-text
document images. In our case, we only need to apply “Image preprocessing”.
Specifically, we do the following preprocessing:
• Cropping removes image margins. Thus, the image is adjusted to the document that it represents (see Section 4.2.1).
• Check position verifies that the document text is in horizontal position. Otherwise, a correction by applying a rotation of 90 degrees is applied (see Section 4.2.2).
• Skew correction is applied (see Section 4.2.3).
• Logo extraction (see Section 4.3.1) allows us to apply a logo registration
(see Section 5.3). It is necessary that the user manually selects the logo of
the reference images. Otherwise, logo registration can not be applied.
• K-means allows us to identify the main colors of a document and its ratios
(see Section 4.3.3).
In Chapter 4, we can find a detailed explanation and visual examples of these
and other implemented preprocessing techniques.
49
Chapter 3. Classification Framework
Choice of Document Features Chen [9] characterizes document features using
three categories: image features, structural features, and textual features.
In our classifier we only consider image features taken directly from the image.
These are called global image features. This makes our classifier very quick since
it is implemented after the preprocessing process. Thus, we avoid processes such
as block segmentation, physical layout analysis, logical layout analysis, or OCR.
Finally, we have considered the use of the following document features:
• Image size: It allows to do an initial filtering of database. Thus, we can distinguish between different typologies of documents such as tickets, invoices,
and receipts.
• Logo and logo position: Logo reference images and their positions are stored
in the database. This allows to do a logo registration to find the logo (around
the stored position) within the input image (see Section 5.3).
• Color/gray level: It allows us to filter color or gray documents depending on
the input image.
• Colors: Main colors of a document and its ratios allow us to filter documents
with different colors or color ratios.
• First NMI registration: In theory, after the preprocessing process, the registration of two images of the same type without translation has to overcome
a certain threshold. Otherwise, the images that do not exceed this threshold
are filtered.
We have also studied the possibility of using the following document features,
but the results were not sufficiently reliable and robust to be used for the filtering
of the database.
• Number of regions in the image: This value was obtained by an split and
merge segmentation method.
• Number of horizontal and vertical lines in the image: The method for obtaining these values requires working with high resolution, which increases
execution time considerably.
• Entropy: It is not a good discriminating value for this specific task.
Feature Representations
In Section 2.2.2 we have seen that document features extracted from each sample
document in a classifier can be represented in various ways, such as a flat representation (fixed-length vector or string), a structural representation, or a knowledge
base. As our document features do not provide structural information, they are
usually represented it in fixed-length feature vectors, unlike features that provide
50
Chapter 3. Classification Framework
structural information that they are represented in various formats as a tree, a list,
a graph, . . .
Specifically, we store documents features as image class attributes, since it
is not necessary to take into account any kind of structural information. Thus we
respect the guideline for the selection of feature representations given by Watanabe:
“The simpler, the better”. Our document representation is the simplest possible. It
contains the minimum necessary information to filter the database since, in our
case, document features do not participate in the registration process.
Class Models and Classification Algorithms
Class models define the characteristics of the document classes. In our case, the
class model is defined by the document image, their features, and one or more
logos. Document features allow us to filter the database, document image allows
us to apply NMI registration (see Section 5.2), and finally with the logos of the
documents we can use the logo registration (see Section 5.3). Figure 3.4 shows in
detail the classification algorithm presented in Figure 3.1.
Figure 3.4: Classification algorithm.
51
Chapter 3. Classification Framework
In NMI registration we apply a new method based on information theory that
has not been used before in document classification. On the other hand, in logo
registration, we use a classification algorithm presented in Section 2.2.2: the template matching, which is used to match an input document with one or more logos
of each model class.
Learning Mechanisms
A learning mechanism provides an automated way for a classifier to construct or
tune class models. Our classifier only has one learning mechanism shown in Figure
3.1. This mechanism consists in adding to the database, as a new reference images,
the input images that could not be classified. Thus, if a new input image of this
type is introduced into the system, this could be classified correctly.
3.3
Document Database
In Chapter 3, we have seen the great importance of the database within the system. The database stores the path of reference images together with their features.
Thus, the database can be filtered by means of input image features. This allows
us to register the input image only with the reference images that have similar features. This implies an important reduction in the number of images that need to be
registered and therefore a decrease in the execution time.
Our database is very simple. It consists of a single table of 13 columns. Each
column represents some kind of document information.
In the database, the following fields are stored (see Figure 3.5):
• “identificador”: An identifier is assigned to each reference image.
• “nom”: It corresponds to the name and extension of the reference image.
• “imatge”: It stores the reference image path.
• “logo”: It stores the logo image path.
• “xLogo”: It corresponds to the “x” position in pixels of the logo origin1
within the reference image.
• “yLogo”: It corresponds to the “y” position in pixels of the logo origin1
within the reference image.
• “amplLogo”: It represents the width in pixels of the logo.
• “alLogo”: It represents the height in pixels of the logo.
1
The origin (x,y) of the logo is taken as its upper-left corner.
52
Chapter 3. Classification Framework
• “imReduida”: It stores the thumbnail path of the reference image. In Section
5.2 we will see that the NMI registration is applied to thumbnail and not to
the original images.
• “xcm”: It corresponds to the width in centimeters of the reference image.
• “ycm”: It corresponds to the height in centimeters of the reference image.
• “centroides”: It stores the text file path that contains the result of applying
the k-means method (see Section 4.3.3) to the reference image, that is, it
contains the main colors of the reference image and its ratios.
• “color”: It is a boolean value that indicates whether the reference image is
in color or gray scale.
Figure 3.5: Database.
3.4
Document Filtering
In this chapter, the central part of this master thesis is explained: the filtering process.
The main objective is to classify an input image from the reference documents
stored in the database. If we try to achieve this goal by registering the input image
with all images stored in the database it would be computationally very expensive
since the database contains a large number of reference images. Therefore, it is
necessary to implement a system to reduce this time. To do this, we should reduce
the number of possible candidates, that is, we need to filter the database.
We currently use 3 filters: size (see Section 3.4.1), k-means (see Section 3.4.2),
and NMI filter (see Section 3.4.3).
53
Chapter 3. Classification Framework
3.4.1
Size Filter
This is a simple filter that allows us to consider only the reference images with a
similar size to the input image. Therefore, we can discard the reference images
with different typology to the input image, that is, we can differentiate between
invoice, receipts, and tickets.
Tickets of the same type can have a different height, therefore they are an especial case, since we only consider their width in the filtering process.
3.4.2
K-means Filter
This filter allows us to consider only the reference images with similar colors to the
input image. This filter is based on the k-means method (see Section 4.3.3). Using
the k-means method we calculate the main colors and its ratios of the reference
and input images. Then, the k-means filter calculates the difference between the
reference images and the input image colors. If the difference is greater than a
certain threshold, the image is discarded.
It has been necessary to implement a method to calculate the difference between colors. The main objective is to assign each color pixel of the image reference pixel to one color pixel of the input. This allow us to calculate a distance value
based on color image to decide whether we consider o discard the reference image. In a first version we implemented a method to find an optimal solution using
backtracking. The solution was the best possible but the method was too slow.
The scanned images have a lot of noise due to the acquisition process, and,
therefore, we considered that it was not necessary to spend so much time looking
for the optimal solution and that an approximation of the solution was sufficient
to apply this filter. Thus, in a second version, we decided to calculate only two
particular cases and consider the best of them as the valid solution.
We consider the following two cases:
• In each iteration we group the two colors of different images separated by the
minimum distance (best candidates) taking into account color proportions.
We always group an exact numbers of pixels of each image.
• In each iteration we group the color that is more remote (worst candidate)
with the other image color that is at a minimum distance of this.
These two cases are best understood by looking at the two examples of Figure
3.6. In the two examples we compare the 3 main colors of two images. We represent the colors in a 2D plane to facilitate problem comprehension. The left images
express the problem as they locate the three colors of each image in 2D space and
they also represent the color proportions (they are represented by the value inside
the circle and the circle size) and the distance between them (they are represented
by discontinuous lines and they are quantified by a black number). The central
54
Chapter 3. Classification Framework
images show the solutions that we obtain by applying the first case described. Finally, the right images show the solutions that we obtain by applying the second
case described.
We can see that in the first example a better result is obtained with the second
case and however in the second example a better result is obtained with the first
case.
Figure 3.6: Example of the distance calculation between the colors of two images.
In order to demonstrate the reliability of this approach we calculated 100 distances between images using backtracking and our approach. After comparing
these results we can seen that the maximum difference between optimal distance
and our approach is always less than 10. The distance between √
two colors is always
between 0 and 442 (distance between black and white, that is, 3 ∗ 255). Table 3.1
shows the results in more detail.
55
Chapter 3. Classification Framework
Interval of differences
(Approach distance − Optimal distance)
[0, 1]
(1, 3]
(3, 6]
(6, 10]
Total
number of
differences
42
21
26
11
100
Table 3.1: Comparison of the results to calculate one hundred distance between
color images using backtracking and our approach.
3.4.3
NMI Filter
After applying the image preparation methods (see Section 4.2), it is logical to
think that the reference images and the input image would have to be practically
registered. Therefore we apply the NMI registration method (see Section 5.2) without applying any kind of transformation. If we obtain a registration value smaller
than a certain threshold the image is discarded.
3.5
Interface
The graphical interface of the application (see Figure 3.7) was designed by means
of the Qt Designer (A)
Figure 3.7: Design of interface with qtDesiger.
56
Chapter 3. Classification Framework
When the application runs, it appears a main screen (see Figure 3.8) with the
following four outstanding sub-menus:
• “Arxiu”
• “Finestra”
• “Base de dades”
• “Pre-procés”
• “Registre”
Figure 3.8: Main screen of the application.
3.5.1
“Arxiu” Menu
“Arxiu” menu contains the following options (see Figure 3.9):
• “Obrir”: It allows us to load an image from the disk. It is necessary to load at
least one image to activate some options like “tancar finestra”, “tancar totes
les finestres”, “apropar”, “allunyar”, “cascada”, “afegir/actualizar imatge”,
“esborrar imatge”, . . .
• “Guardar”: It allows us to store a loaded image on the disk.
• “Sortir”: It allows us to leave the application.
57
Chapter 3. Classification Framework
Figure 3.9: “Arxiu” menu.
3.5.2
“Finestra” Menu
“Finestra” menu contains the following options (see Figure 3.10):
• “Apropar”: It zooms in to the selected image.
• “Allunyar”: It zooms out to the selected image.
• “Tamany original”: Selected image is displayed with its original size.
• “Cascada”: Windows are arranged in cascade mode (see Figure 3.11).
• “Mosaic”: Windows are arranged in mosaic mode (see Figure 3.12).
• “Tancar”: The selected window is closed.
• “Tancar totes”: All open windows in the application are closed.
• Windows list: Windows list allows us to see an opened window list and to
change the selected windows.
Figure 3.10: “Finestra” menu.
58
Chapter 3. Classification Framework
Figure 3.11: Cascade mode.
Figure 3.12: Mosaic mode.
59
Chapter 3. Classification Framework
3.5.3
“Base de dades” Menu
“Base de dades” menu contains the following options (see Figure 3.13):
• “Connectar”: It allows us to get connected to a database. Initially it is connected to the database defined by default. In the case we want to change
the database, we only need to disconnect it. Then we will activate the connection option, which will allow us to access to the configuration menu (see
Figure 3.14), where a new connection can be specified (database, user name,
password,. . . ).
• “Desconnectar”: It allows us to disconnect a database.
• “Afegir/Actualitzar imatge”: If the selected image does not exist, it is added
to the database; otherwise, only its fields are updated.
• “Esborrar imatge”: if the selected image is in the database, it is erased.
• “Restaurar base de dades”: it allows us to initialize a database from the images contained in a specific folder.
• “Esborrar base de dades”: The database is completely removed.
Figure 3.13: “Base de dades” menu.
3.5.4
“Pre-procés’ Menu
“Pre-procés” menu only contains the “Aplicar” option (see Figure 3.15).
This option allows us to access to the preprocess configuration (see Figure
3.16), where we can carry out the preprocess selection and configuration.
In Section 4 the different preprocesses are explained in detail.
60
Chapter 3. Classification Framework
Figure 3.14: Configuration database menu.
Figure 3.15: “Pre-procés” menu.
Figure 3.16: Preprocess configuration menu.
61
Chapter 3. Classification Framework
Figure 3.17: “Registre” menu.
3.5.5
“Registre” Menu
“Registre” menu only contains the “Aplicar” option (see Figure 3.17).
This option allows us to access to the registration configuration (see Figure
3.18), where we can carry out the registration method selection and configuration.
In Section 5 the different registration methods are explained in detail.
Figure 3.18: Registration configuration methods menu.
Figure 3.19 shows how the result of the classification is returned.
62
Chapter 3. Classification Framework
Figure 3.19: Result returned by the application.
63
Chapter 4
Image Preprocessing and
Segmentation
4.1
Introduction
As we stated in the introduction to Section 2.4, preprocessing generally consists of
a series of image-to-image transformations. It does not increase our knowledge of
the documents, but may help to extract it.
On the other hand, the segmentation process can be described as the identification of structures in an image. It consists in subdividing an image into its
constituent parts (see Section 2.4.2).
Using this type of algorithms, we try to find diverse document features. The
basic idea consists in adding to the database the reference images and also the
found features. This will allow us to filter the database using the calculated input
image features and discarding those images that do not have similar features.
If we have a database with 1000 images, it would be ideal to reduce these
images to 9 or 10 candidates by filtering. Therefore, if the features are very differentiated more images will be disconnected and, thus, the classification will be
faster and more efficient.
Implemented methods can be divided into two main groups according to their
functionality: “image preparation methods” and “feature extraction methods”.
“Image preparation methods” are included inside the preprocess method group
and they are formed by the following methods:
• “Ajustar” (cropping, see Section 4.2.1)
• “Comprovar posició” (check position, see Section 4.2.2)
• “Skew” (see Section 4.2.3)
• “Escalar” (to scale image, see Section 4.2.4)
• “Rotar” (to rotate image, see Section 4.2.4)
65
Chapter 4. Image Preprocessing and Segmentation
“Feature extraction methods” are included inside the segmentation method
group and they are formed by the following methods:
• “Seleccionar logo” (logo selection, see Section 4.3.1)
• “Color” (color and gray scale detection, see Section 4.3.2)
• “K-means” (see Section 4.3.3)
In Sections 4.2 and 4.3 we will explain these methods with examples and results.
The preprocessing and segmentation methods presented in this chapter have
been tested with 100 images and we have obtained a 100% success.
4.2
Image Preparation Methods
Generally, documents are digitized using a scanner. This involves several problems
such as position errors, skew errors, or that the scanner normally generates A4
images although a smaller document like a ticket or a banking receipt is scanned
(see Figure 4.1). Therefore, it has been necessary to implement methods to correct
these errors. This is essential to obtain a more robust, fast, and efficient registration
(see Chapter 5).
Figure 4.1: a) Scanned image in a wrong position. b) Image with a skew error. c)
Generally, scanners generate A4 images although a smaller document is scanned.
In this concrete case an adjustment is necessary. This problem also appears in cases
a) and b).
66
Chapter 4. Image Preprocessing and Segmentation
4.2.1
Cropping
We have seen that scanner normally generates A4 images although a smaller document is scanned. Therefore, it has been necessary to implement a method that fits
the scanned images.
Seeing the image of Figure 4.1 we consider two main problems: “white zone
elimination” and “gray zone elimination”.
White Zone Elimination
This problem is very simple to solve because in the white zone all pixels are pure
white (R = 255, G = 255, and B = 255).
The first idea was to do a search in a straight line from the point marked in red
on the Figure 4.2 to the first non-white pixel and to remove the white zone (this
process is repeated initiating the search from above, left and right sides, in this
order, to fit the image by all the sides). This method is very fast because we only
process a unique line of pixels by each side.
The principal problem is that this method can not be applied to black and white
images, because we take the risk of eliminating important information from the
image (see Figure 4.3). When we want to remove white zones of a black and white
image we do the same process but in this case instead of processing a single pixel
per line we process all the lines and we stop the search when finding a line with
some non-white pixel. Thus we solve the problem and we get a correct result. In
Figure 4.4 we can see that a small zone has been removed, but in this case removal
is correct.
Finally, after several tests, we have seen that the application of the second
method in all cases is more efficient that to discriminate between color images and
black and white images. Therefore, the second method is applied in all cases.
Gray Zone Elimination
After removing the white zones is necessary to eliminate the gray zones. These
zones can not be eliminated as directly as the white zones, because the pixels in
the gray zones suffer a slight variation of intensity.
The method implemented to solve this problem is very similar to the previous
ones. The basic difference is that in this case the search does not end when a line of
pixels with some not-white pixels is found, but it ends when the intensity variance
in the pixels line is above a certain threshold.
Thus the fit problem is solved. Figures 4.5 and 4.6 show some results of the
cropping method application.
4.2.2
Check Position
After fitting the images, it is necessary to verify that the document is in the correct
position. Therefore, it has been necessary to implement a method that detects the
67
Chapter 4. Image Preprocessing and Segmentation
Figure 4.2: First idea to implement the white zone elimination.
Figure 4.3: The white zone elimination problem with a black and white image.
Figure 4.4: Solution fot the white zone elimination problem.
68
Chapter 4. Image Preprocessing and Segmentation
Figure 4.5: The result of applying the cropping method to a ticket.
Figure 4.6: The result of applying the cropping method to a receipt.
69
Chapter 4. Image Preprocessing and Segmentation
position of a document and rotates 90 degrees the image if necessary.
In order to know if the document is in the correct position we convert the lines
into solid lines. First, we obtain horizontal solid lines and later vertical solid lines.
Results of this process are shown in Figure 4.7.
Figure 4.7: a) Original image, b) image with horizontal solid lines, and c) image
with vertical solid lines
At this point, we can deduce the correct position of the document by comparing
the average lengths of the lines of both images shown in Figures 4.7b and 4.7c. If
the average lengths of horizontal lines is bigger than the average lengths of vertical
lines, the document is in the correct position. Otherwise the document must be
rotated 90 degrees. Thus, the position problem is solved.
Figures 4.8 and 4.9 show some results of the check position method application
when the document is in a wrong position and Figure 4.10 shows the result of the
check position method application when the document is in the correct position.
4.2.3
Skew
After fitting and checking the image position, a last step is necessary to have the
image properly prepared for image feature extraction and finally for the registration
algorithm execution. This last step consists in correcting the small skew errors
that may have occurred in scanning process. To do this we have implemented the
method described in Section 2.4.1.
70
Chapter 4. Image Preprocessing and Segmentation
Figure 4.8: The result of applying the check position method to a receipt in wrong
position.
Figure 4.9: The result of applying the check position method to a ticket in wrong
position.
Figure 4.10: The result of applying the check position method to a receipt in correct
position.
71
Chapter 4. Image Preprocessing and Segmentation
In order to make the method more robust we add the following process. First
we calculate the skew error by two vertical lines as it is explained in Section 2.4.1.
Then we repeat the process with 3 lines getting two skew error values and finally we
calculate the average of these. If this average agrees with the previous calculated
skew error, we consider that skew error is correctly calculated, otherwise we add
another line. The process ends when the last calculated skew error match with the
previous calculated skew error. Finally, the rotating method (see Section 4.2.4) is
applied to correct the detected skew error.
In spite of the skew error is usually less than 5 degrees, Figures 4.11, 4.12, and
4.13 show some results of the skew method application in documents with extreme
skew errors in order to demonstrate the great robustness of the method.
Figure 4.11: The result of applying the skew method to a ticket.
4.2.4
Image Rotation and Scaling
There are two simple methods that allow us to apply any kind of rotation or scale
to a specific image. This methods have been implemented using tools provided by
the QT libraries.
Rotating method is used by the Skew method (see Section 4.2.3) to correct the
detected skew error.
Scaling method is used by the NMI registration method. It is applied with
low-resolution images as we describe in Section 5.2.
72
Chapter 4. Image Preprocessing and Segmentation
Figure 4.12: The result of applying the skew method to a ticket.
Figure 4.13: The result of applying the skew method to a ticket.
73
Chapter 4. Image Preprocessing and Segmentation
Figures Figures 4.11, 4.12, and 4.13 show results of the rotating method application, and Figures 4.14 and 4.15 show results of the scaling method application.
Figure 4.14: The result of applying the scale method to an invoice.
Figure 4.15: The result of applying the scale method to a receipt.
4.3
Feature Extraction Methods
After solving the problems produced in the document scanning, the image is prepared and the necessary features can be extracted. We basically focused on the
following 3 features:
• Logo: We need to have the document logo and its position within the image
in order to apply the logo registration (see Section 5.3).
74
Chapter 4. Image Preprocessing and Segmentation
• Color and gray scale: It is an important feature. If the input image is a color
image all the gray-scale images are discarded and vice versa.
• K-means: If the input image is a color image, we apply the k-means algorithm to detect the main colors and its ratios.
4.3.1
Logo Extraction
We implement a method that allows the user to select manually the image portion
of the document considered as the logo. We consider that a logo is anything that
is repeated in all the documents of the same type, therefore it may be an image, a
word, a structure . . .
The final objective of this method is that each reference image in the database
has also a logo image and its position within the reference image. We only define
one logo by image, but we could easily define two or more logos by image. This
has not been done because we considered that only one logo is enough to make a
good logo registration.
Figures 4.16 and 4.17 show reference images with its extracted logos.
Figure 4.16: Logo extraction of a ticket.
75
Chapter 4. Image Preprocessing and Segmentation
Figure 4.17: Logo extraction of an invoice.
4.3.2
Color and Gray-scale Detection
Color and gray-scale detection is a simple method that checks if the image is in
color or gray-scale. To know if a image is a gray-scale image, R = G = B must
be satisfied by all pixels of the image, otherwise the image is a color image.
4.3.3
K-means
The last extraction method uses the k-means algorithm (see Segmentation by unsupervised clustering in Section 2.4.2). This method is only applied to the color
images and allows us to extract the main colors and its ratios. Thus, the database
can be filtered taking into account only similar color images.
K-means is one of the simplest unsupervised learning algorithms that solve the
clustering problem. The procedure follows a simple and easy way to classify a
given data set (color pixels) through a certain number k of clusters fixed a priori.
In or case, k is the number of colors that we want to find. The algorithm proceeds
as follows. First, the centroids are placed randomly, although different locations
could cause different results. The next step is to take each point belonging to a
given data set and associate it to the nearest centroid. When no point is pending,
the first step is completed and an early groupage is done. At this moment we need
to re-calculate k new centroids as barycenters of the clusters resulting from the
previous step. After we have these k new centroids, a new binding has to be done
76
Chapter 4. Image Preprocessing and Segmentation
between the same data set points and the nearest new centroid. Thus, a loop has
been generated. As a result of this loop we may notice that the k centroids change
their location step by step until no more changes are done. In other words, centroids
do not move any more. Finally, this algorithm aims at minimizing a function, in
this case a squared error function:
k X
n
X
J=
(xi − cj )2 ,
(4.1)
j=1 i=1
where (xi − cj )2 is a chosen distance measure between a data point xi and the
cluster center cj . This function J is an indicator of the distance of the n data points
from their respective cluster centers.
The algorithm is composed of the following steps represented in Figure 4.18:
1. Place k points into the space represented by the objects that are being clustered. These points represent the initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the k centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a
separation of the objects into groups from which the metric to be minimized
can be calculated.
Figure 4.19 shows the algorithm step representation in a simple example.
Figures 4.20 and 4.21 show the results that are generated by the k-means algorithm.
It is important to emphasize that this method generates a file where the centroid
values and the percentage of image pixels belonging to each centroid are stored.
This allows us to implement a color comparison algorithm taking into account its
proportion within the image.Thus, this algorithm gives more importance to the
larger proportion colors and less or null importance to the lesser proportion colors.
77
Chapter 4. Image Preprocessing and Segmentation
Figure 4.18: K-means algorithm diagram.
Figure 4.19: K-means algorithm example. 1) k initial “means” (in this case k =
3) are randomly selected from the data set (shown in color). 2) k clusters are
created by associating every observation with the nearest mean. The partitions
here represent the Voronoi diagram generated by the means. 3) The centroid of
each of the k clusters becomes the new mean. 4) Steps 2 and 3 are repeated until
convergence has been reached.
78
Chapter 4. Image Preprocessing and Segmentation
Figure 4.20: The result of applying the k-means algorithm to an invoice. a) Original
Image. b) Result of applying the k-means algorithm for k = 2, i.e., 2 groups/colors.
c) Result of applying the k-means algorithm for k = 3, i.e., 3 groups/colors.
Figure 4.21: The result of applying the k-means algorithm to an invoice. a) Original
Image. b) Result of applying the k-means algorithm for k = 3, i.e., 3 groups/colors.
c) Result of applying k-means algorithm for k = 6, i.e., 6 groups/colors.
79
Chapter 5
Image Registration
5.1
Introduction
As we have seen in Section 2.5, registration is a fundamental task in image processing used to match two or more images or volumes obtained at different times,
from different devices or from different viewpoints. Basically, it consists in finding the geometrical transformation that enables us to align images into a unique
coordinate space.
To carry out the image registration we have focused on two methods:
• NMI registration: This method uses techniques based on information theory
(see Section 5.2).
• Logo registration: This method uses template matching techniques based on
the normalized correlation coefficient (see Section 5.3).
As we described in Section 3.4, the registration method is only applied to a
group of reference images in the database that have passed a series of filters. There
are two different ways to apply the registration method to this group of reference
images: brute-force search and basic search.
• Brute-force search: The algorithm applies the registration method to all images of the group. Then, it returns a list of all the images sorted according to
the obtained similarity value.
• Basic search: The algorithm stops when it finds the first image with a similarity value greater than a certain threshold. Thus, it returns a list of the
registered images. In the best case, the list contains a single image and, in
the worst case, the list will contain all images, like the brute-search case.
Both ways have their advantages and disadvantages. The brute-force method
utilization may be interesting in the case that the database contains more than one
image of the same type, since it returns all the images similar to the input image
81
Chapter 5. Image Registration
and it always returns the best possible result. On the other hand it is slower than
the basic search since it does not save any registration operation. In theory, in our
specific case, the database contains only one document of each type. In practice,
the database can contain more than one document of each type (this makes the
method more robust, but also slower).
5.2
Normalized Mutual Information
Mutual information has been successfully used in many different fields, such as
medical imaging for image registration. We propose here to use the normalized
mutual information (NMI) to register documents. Thus, we can register an input
document with a series of reference documents. Maximizing the normalized mutual information we will obtain the most similar documents.
The image registration process is explained in detail in Section 2.5. We have
seen that the image registration process can be described as a process composed
of four basic elements: the transformation, the interpolator, the metric, and the
optimizer.
We now explain these components for our registration method:
• Spatial transformation: The spatial transformation defines the spatial relationship between both images. We only consider 2D translations, that is, we
use a particular case of rigid transformation. In order to facilitate and speed
up the image registration, a correction skew method (see Section 4.2.3) has
been implemented. Thus, we assume that the translation in both axes is the
only rigid transformation that has to be carried out between the two images.
• Interpolator: The interpolation strategy determines the intensity value of a
point at a non-rigid position. We specifically use a linear interpolation, that
is, the intensity of a point is obtained from the linear-weighted combination
of the intensities of its neighbors.
• Metric: The metric evaluates the similarity (or disparity) between the two
images to be registered. We specifically use NMI as the image similarity
measure. It is included within the intensity occurrence measures, that is, it
depends on the probability of each intensity value and is based on information theory.
• Optimization: The optimizer finds the maximum (or minimum) value of
the metric varying the spatial transfomation. We specifically use the Powell’s method [31]. Powell’s method, also called conjugate direction method,
is a zero-order optimization method. The basic concept of Power’s conjugate direction approach is to utilize the sequential search directions in
one-dimension search and generate a new direction toward the next iterative
point. As the unidirectional search vector Si , i = 1 · · · n (for n variables)
82
Chapter 5. Image Registration
is defined, the conjugate direction is determined by the sum of all unidirectional vectors:
Sn+1 =
n
X
αi Si
(5.1)
i=1
where Sn+1 is the conjugate direction and αi is the scaling factor.
Finally, we have done several tests at different image resolutions. After some
experiments, we have seen that the values of NMI using high resolutions images
are lower than using low resolution images. This is logical because in the high
resolution images we have a higher dispersion of intensities. If we scale down
(compress) the image, we obtain an image where the significant parts take more
protagonism. After this study, we have seen that using images with their greater
side of 50 pixels (see Figures 5.1) very good results are obtained. This also supposes a great acceleration of the registration process.
Figure 5.1: Scaled image used in normalized mutual information registration.
Figure 5.2 shows the superposition of two registered images using the normalized mutual information.
5.3
Logo Registration
Unlike the NMI registration method, the logo registration method only registers
a small portion of the image in high resolution. After the logo extraction preprocess (see Section 4.3.1), we keep in the database the logo image and its position
within the reference image. When we want to classify a new input image using this
registration method, we only search the logo in a window located surrounding the
position stored in the database. Therefore, there is the advantage that if two documents have different structure but the same logo on the same approximated site,
the image registration should be correct. This explanation of the logo registration
method is graphically summarized in Figure 5.3.
Now we define the four components of our logo registration method:
83
Chapter 5. Image Registration
Figure 5.2: Superposition of two registered images using normalized mutual information.
Figure 5.3: Graphical summary of the logo registration method. We want to classify a input image and we have three possible candidates. In this case, only the
last reference image can obtain a good similarity value (the third one in the right
column).
84
Chapter 5. Image Registration
• Spatial transformation: This component is identical to the NMI registration
method (see Section 5.2).
• Interpolator: Unlike the NMI registration method, this method does not need
interpolator, because its translations are integers (in pixels).
• Metric: In this case, we use the normalized correlation coefficient as the
image similarity measure (see Section 2.5.2).
• Optimization: Unlike the NMI registration method, this method does not
need an optimization process, because in this case the similarity value is
calculated for all positions in the image and then we select the maximum
value.
Figure 5.4 shows the superposition of two registered images using the logo
registration method.
Figure 5.4: Superposition of two registered images using the logo registration
method.
5.4
Performance Evaluation
We now evaluate the performance of our NMI and logo registration methods.
85
Chapter 5. Image Registration
5.4.1
NMI Registration Test
We have classified 77 input images, using a database with 100 reference images.
These 77 input images are divided into 40 invoices, 30 receipts, and 7 tickets, and
the 100 reference images are divided into 60 invoices, 25 receipts, and 15 tickets.
After classifying the 77 input images, we have obtained the following results:
71 input images (92.2%) are classified correctly and 6 input images (7.8%) are
classified incorrectly.
The successful cases are divided as follows:
• 9 input images do not have a reference image and the application did not find
matches (true negative).
• 62 input image have a reference image and the application found matches
(true positive).
The errors cases are divided as follows:
• 3 errors were in ticket classification. The input images have reference images
but the application did not find matches (false negative).
• 1 input image does not have a reference image, but the application found
matches (false positive).
• 2 input images have reference images, but the application found bad matches
(wrong positive).
The registration computational time vary between 300 ms and 3 seconds. To
classify the 77 input images of the test, the application took approximately 2.5
minutes (about two seconds per image).
After these experiments, we have seen that the main problems appear with the
tickets. We propose a possible solution in the future work section (see Section 6).
In Table 5.1 a summary of the results is presented.
5.4.2
Logo Registration Test
To evaluate the performance of our logo registration method, we use the same set
of images used in the performance evaluation of the NMI registration method (see
Section 5.4.1).
After classifying the 77 input images, we have obtained the following results:
76 input images (98.7%) are classified correctly and 1 input images (1.3%) are
classified incorrectly.
The successful cases are divided as follows:
• 10 input images do not have a reference image and the application did not
find matches (true negative).
86
Chapter 5. Image Registration
Successful cases
Error cases
Document
True
True
False
False
Wrong
type
positive
negative
negative
positive
positive
Tickets
2
2
3
0
0
Receipts
25
4
0
0
1
Invoices
28
10
0
1
1
Subtotal
55
16
3
1
2
Total
71
6
Table 5.1: Results summary of the NMI registration method.
Successful cases
Error cases
Document
True
True
False
False
Wrong
type
positive
negative
negative
positive
positive
Tickets
5
2
0
0
0
Receipts
26
4
0
0
0
Invoices
29
11
1
0
0
Subtotal
60
16
1
0
0
Total
76
1
Table 5.2: Results summary of the logo registration method.
• 66 input image have a reference image and the application found the correct
matches (true positive).
The error case is an input image that have reference image, but the application
did not find matches (false negative). This error could be solved using a larger
search window.
The registration computational time vary between 300 ms and 3 seconds. To
classify the 77 input images of the test, the application took approximately 1.4
minutes (about 1.2 seconds per image).
After these experiments, we have seen that logo registration method is more
robust and faster than NMI registration method. The principal disadvantage is that
logo registration method requires human intervention whenever a reference image
is added to the database.
In Table 5.2 a summary of the results is presented.
87
Chapter 6
Conclusions and Future Work
Document image classification is an important focus of research. The development
of new techniques that assist and enhance the recognition and classification of different document types is fundamental in business environments and has been the
main objective of this master thesis. In this chapter, the conclusions of this master
thesis and the directions for our future research are presented.
6.1
Conclusions
In this master thesis we achieved two basic objectives. First, a state of the art has
been presented with the aim of summarizing the current state of the classification
and recognition of document images. Second, several image processing tools have
been presented in order to classify different document classes, such as invoices,
receipts, and tickets.
Next, the main contributions of this master thesis are summarized:
• In Chapter 2, a state of the art of document classification has been presented.
This allowed us to see that there is much work to do in the field of document
recognition and classification based on document images.
• In Chapter 3, we presented a description of our document image classification framework. Specifically, we introduced the problem statement, the
classifier architecture, the document database, the document filtering where
we presented the methods (the k-means filter based on color comparison, the
NMI filter based on information theory, and the size filter) used to do this
filtering process, and the application interface. We have always worked with
the objective of reducing computation times as much as possible.
• In Chapter 4, several methods for preprocessing and segmentation are presented. These methods allow us to prepare the image document in order to
speed up the registration process. The image preparation methods (cropping,
89
Chapter 6. Conclusions and Future Work
skew, and check position) consist in a series of image-to-image transformations. They do not increase our knowledge of the contents of the document,
but may help to extract it. On the other hand, feature extraction methods
(logo extraction, k-means, and color and gray-scale detection) allow us to
extract important document features (the principal colors, the logo, and the
logo position) and really increase our knowledge of the image contents.
• In Chapter 5, a new use of the NMI registration have been presented. We
have seen that this method is fast, robust, and a fully automatic classification
technique. We also presented a logo registration method that obtains very
good results in the classification of all kind of document images, but it has
the disadvantage that it is necessary to manually select the logo of the reference images before they are added to the database (i.e., this method needs
a priors human intervention). In order to evaluate the performance of these
registration methods we studied the results and computation time obtained
by each method.
• In Section 3.5 an interface to apply all the implemented methods have been
presented. This interface has been very useful to test the methods and to see
the final results.
6.2
Future Work
The ideas presented in this master thesis can be improved or expanded in different
directions.
We propose the following future works:
• We have seen that the main focus of errors in the NMI registration are the
tickets (see Section 5.4). The classification of the tickets is very difficult
because they are document images with very little information. Moreover,
unlike the invoices or receipts, the same type of ticket does not always have
the same size, usually the ticket size depends on the contained information.
In order to improve processing tickets we think that a ticket could be divided
into 3 parts: a invariable top part, a variable central part, and a invariable
bottom part (see Figure 6.1). The idea is to remove the variable central part,
then we may only register the invariables parts (top part and bottom part)
getting a higher value of similarity (see Figure 6.2).
• To try other interesting image registration techniques such as image registration based on compression. The basic idea is the conjecture that two images
are correctly registered when we can maximally compress one image given
the information in the other. It would be interesting to demonstrate that the
image registration process can be formulated as a compression problem.
90
Chapter 6. Conclusions and Future Work
Figure 6.1: The three parts of a ticket.
Figure 6.2: NMI registration of tickets after removing the variable parts.
91
Chapter 6. Conclusions and Future Work
• To improve the efficiency of the implemented methods. Although the obtained computation times are good, time is an essential factor to take into
account in the field of document classification and therefore we must minimize it as much as possible.
• Setting up new learning mechanisms. For example, a mechanism that can
detect the invariable parts of a document class from an image group belonging to this documents class.
• To implement a new method to automatically locate the logo position within
a document image. This allows us to fully automate the logo classification.
92
Appendix A
Qt
To design and implement the GUI of the application has been decided to use Qt
[35].
Qt is a cross-platform application development framework, widely used for the
development of GUI programs (in which case it is known as a widget toolkit), and
also used for developing non-GUI programs such as console tools and servers. Qt is
most notably used in KDE, Opera, Google Earth, Skype, Qt Extended, Adobe Photoshop Album, VirtualBox and OPIE. It is produced by the Norwegian company Qt
Software, formerly known as Trolltech, a wholly owned subsidiary of Nokia since
June 17, 2008.
Qt uses C++ with several non-standard extensions implemented by an additional pre-processor that generates standard C++ code before compilation. Qt can
also be used in several other programming languages; via language bindings. It
runs on all major platforms, and has extensive internationalization support. NonGUI features include SQL database access, XML parsing, thread management,
network support and a unified cross-platform API for file handling. Distributed
under the terms of the GNU Lesser General Public License (among others), Qt is
free and open source software.
In the market we can find the following Qt distributions:
• Qt Enterprise Edition and Qt Professional Edition, available for developing
software for commercial purposes, includes technical support and there are
extensions available.
• Qt Free Edition is the version for Unix/X11 for the development of free
software and open source. Is free available under the terms of the Q Public
License and the GNU General Public License. For Windows platforms is
also available a non commercial Qt version.
• Qt Educational Edition is a Professional Edition Qt version licensed only for
educational purposes.
• Qt/Embedded Free Edition.
93
Appendix A. Qt
Features:
• QT is a library for creating graphical interfaces. It is distributed free under a
GPL license (or QPL) that allows us to incorporate QT into our open-source
applications.
• It is available for a large number of platforms: Linux, MacOs X, Solaris,
HP-UX, UNIX with X11 ...
• It is object oriented, which facilitates software development.
• It uses C++, although can also be used in several other programming languages (eg. Python or Perl) via language bindings.
• It is a library based on the concepts of widgets (objects), Signals-Slots, and
events (e.g., mouse click).
• The signals and slots are the mechanisms that enable communication between widgets.
• The widget can contain any number of children. The “top-level” widget can
be any, either window, button, etc..
• Some attributes such as text labels, etc... are modified similarly to html.
• QT provides additional functionality:
– Basic libraries → Input/Output, Network Management, XML.
– Database Interface → Oracle, MySQL, PostgreSQL, ODBC.
– Plugins, dynamic libraries (Images, formats, ...)
– Unicode, internationalization.
Qt Designer is a tool for designing and implementing user interfaces built with
the Qt multiplatform GUI toolkit. Qt Designer makes it easy to experiment with
user interface design. At any time you can generate the code required to reproduce
the user interface from the files Qt Designer produces, changing your design as
often as you like. It has a palette of widgets very complete, including the most
common widgets in QT libraries. Figure A.1 shows the interface of this tool.
94
Appendix A. Qt
Figure A.1: Qt Designer Interface.
95
Appendix A. Qt
96
Appendix B
MySQL
We decided to use MySQL [36] as a database for managing images.
MySQL is one of the most popular open source database. Their continued
development and its growing popularity is making MySQL a increasingly direct
competitor of giants in the databases field like Oracle.
MySQL is a relational database management system (RDBMS). The program
runs as a server providing multi-user access to a number of databases.
There are many types of databases, from a simple file system to a relational
object-oriented database.
MySQL, as a relational database, uses multiple tables to store and to organize
the information. It was written in C and C++ and emphasizes his great adaptation
to different development environments, allowing its interaction with the most used
programming languages like PHP, Perl, and Java, and its integration in different
operating systems.
It is also very remarkable the condition of open source that causes that their
use is gratuitous and that can even be modified with total freedom, being able to
download their source code. This has affected very positively to its development
and continuous updates, resulting that MySQL is one of the most useful tools by
the Internet oriented programmers.
97
Appendix B. MySQL
98
Appendix C
OpenCV
OpenCV [37] is a computer vision library originally developed by Intel. It is free
for commercial and research use under the open source BSD license. The library is
cross-platform, and runs on Windows, Mac OS X, Linux, PSP, VCRT (Real-Time
OS on Smart camera) and other embedded devices. It focuses mainly on real-time
image processing, as such, if it finds Intel’s Integrated Performance Primitives on
the system, it will use these commercial optimized routines to accelerate itself.
Example applications of the OpenCV library are Human-Computer Interaction (HCI); Object Identification, Segmentation and Recognition; Face Recognition; Gesture Recognition; Motion Tracking, Ego Motion, Motion Understanding;
Structure From Motion (SFM); Stereo and Multi-Camera Calibration and Depth
Computation; Mobile Robotics.
General description:
• Open source computer vision library in C/C++.
• Optimized and intended for real-time applications.
• OS/hardware/window-manager independent.
• Generic image/video loading, saving, and acquisition.
• Both low and high level API.
• Provides interface to Intel’s Integrated Performance Primitives (IPP) with
processor specific optimization (Intel processors).
Features:
• Image data manipulation (allocation, release, copying, setting, conversion).
• Image and video I/O (file and camera based input, image/video file output).
• Matrix and vector manipulation and linear algebra routines (products, solvers,
eigenvalues, SVD).
99
Appendix C. OpenCV
• Various dynamic data structures (lists, queues, sets, trees, graphs).
• Basic image processing (filtering, edge detection, corner detection, sampling
and interpolation, color conversion, morphological operations, histograms,
image pyramids).
• Structural analysis (connected components, contour processing, distance transform, various moments, template matching, Hough transform, polygonal approximation, line fitting, ellipse fitting, Delaunay triangulation).
• Camera calibration (finding and tracking calibration patterns, calibration,
fundamental matrix estimation, homography estimation, stereo correspondence).
• Motion analysis (optical flow, motion segmentation, tracking).
• Object recognition (eigen-methods, HMM).
• Basic GUI (display image/video, keyboard and mouse handling, scroll-bars).
• Image labeling (line, conic, polygon, text drawing)
OpenCV modules:
• cv - Main OpenCV functions.
• cvaux - Auxiliary (experimental) OpenCV functions.
• cxcore - Data structures and linear algebra support.
• highgui - GUI functions.
100
Appendix D
.NET
The different methods presented in this master thesis have been translated into
visual basic using .Net [38] and they are encompassed within a dll (Dynamic-link
library). The objective of this translation is that the company, in addition to have the
implemented application using QT, they also have a library independent of Qt by
which they can use the different methods from other applications programmed in
Visual Basic. The company asked to us specifically that the library implementation
was in .Net for its comfort.
The Microsoft .NET Framework is a software framework that can be installed
on computers running Microsoft Windows operating systems. It includes a large library of coded solutions to common programming problems and a virtual machine
that manages the execution of programs written specifically for the framework.
The .NET Framework is a key Microsoft offering and is intended to be used by
most new applications created for the Windows platform. The framework’s Base
Class Library provides a large range of features including user interface, data and
data access, database connectivity, cryptography, web application development,
numeric algorithms, and network communications. The class library is used by
programmers, who combine it with their own code to produce applications. Programs written for the .NET Framework execute in a software environment that
manages the program’s runtime requirements. Also part of the .NET Framework,
this runtime environment is known as the Common Language Runtime (CLR). The
CLR provides the appearance of an application virtual machine so that programmers need not consider the capabilities of the specific CPU that will execute the
program. The CLR also provides other important services such as security, memory management, and exception handling. The class library and the CLR together
constitute the .NET Framework.
The principal features of .Net are:
• Interoperability, because interaction between new and older applications is
commonly required. The .NET Framework provides means to access functionality that is implemented in programs that execute outside the .NET environment.
101
Appendix D. .NET
• Common runtime engine: The Common Language Runtime (CLR) is the
virtual machine component of the .NET framework. All .NET programs
execute under the supervision of the CLR, guaranteeing certain properties
and behaviors in the areas of memory management, security, and exception
handling.
• Language independence: The .NET Framework introduces a Common Type
System, or CTS. The CTS specification defines all possible data types and
programming constructs supported by the CLR and how they may or may
not interact with each other. Because of this feature, the .NET Framework
supports the exchange of instances of types between programs written in any
of the .NET languages.
• Base Class Library: the Base Class Library (BCL), part of the Framework
Class Library (FCL), is a library of functionality available to all languages
using the .NET Framework. The BCL provides classes which encapsulate
a number of common functions, including file reading and writing, graphic
rendering, database interaction and XML document manipulation.
• Simplified deployment: The .NET framework includes design features and
tools that help manage the installation of computer software to ensure that it
does not interfere with previously installed software, and that it conforms to
security requirements.
• Security: The design is meant to address some of the vulnerabilities, such as
buffer overflows, that have been exploited by malicious software. Additionally, .NET provides a common security model for all applications.
• Portability: The design of the .NET Framework allows it to theoretically
be platform agnostic, and thus cross-platform compatible. That is, a program written to use the framework should run without change on any type
of system for which the framework is implemented. Microsoft’s commercial
implementations of the framework cover Windows, Windows CE, and the
Xbox 360.
102
Bibliography
[1] Nagy, G.: Twenty years of document image analysis in pami. IEEE Trans.
Pattern Anal. Mach. Intell. 22 (2000) 38–62
[2] Sebastiani, F.: Machine learning in automated text categorization. ACM
Comput. Surv. 34 (2002) 1–47
[3] Appiani, E., Cesarini, F., Colla, A.M., Diligenti, M., Gori, M., Marinai, S.,
Soda, G.: Automatic document classification and indexing in high-volume
applications. IJDAR 4 (2001) 69–83
[4] Diligenti, M., Frasconi, P., Gori, M.: Hidden tree markov models for document image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003) 2003
[5] Byun, Y., Lee, Y.: Form classification using dp matching. In: SAC ’00:
Proceedings of the 2000 ACM symposium on Applied computing, New York,
NY, USA, ACM (2000) 1–4
[6] Shin, C., Doermann, D., Rosenfeld, A.: Classification of Document Pages
Using Structure-Based Features. International Journal on Document Analysis
and Recognition 3 (2001) 232–247
[7] Hu, J., Kashi, R., Wilfong, G.: Document classification using layout analysis.
In: DEXA ’99: Proceedings of the 10th International Workshop on Database
& Expert Systems Applications, Washington, DC, USA, IEEE Computer Society (1999) 556
[8] Marcel, A.B., Worring, M.: Fine-grained document genre classification using first order random graphs. In: In Proceedings of the Sixth International
Conference on Document Analysis and Recognition (ICDAR2001. (2001) 79
[9] Chen, N., Blostein, D.: A survey of document image classification: problem
statement, classifier architecture and performance evaluation. Int. J. Doc.
Anal. Recognit. 10 (2007) 1–16
[10] Due, Jain, A.K., Taxt, T.:
Feature extraction methods for character
recognition-a survey. Pattern Recognition 29 (1996) 641–662
103
Bibliography
[11] Okun, O., Doermann, D., Pietikainen, M.: Page Segmentation and Zone
Classification: The State of the Art. Technical Report LAMP-TR-036,CARTR-927,CS-TR-4079, University of Maryland, College Park (1999)
[12] Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. Volume 5010., SPIE (2003) 197–207
[13] Esposito, F., Malerba, D., Francesca, Lisi, F.A., Ras, W.: Machine learning
for intelligent processing of printed documents. Journal of Intelligent Information Systems 14 (2000) 175–198
[14] Maderlechner, G., Suda, P., Brückner, T.: Classification of documents by
form and content. Pattern Recogn. Lett. 18 (1997) 1225–1231
[15] Taylor, S., Lipshutz, M., Nilson, R.: Classification and functional decomposition of business documents. Volume 2. (1995) 563–566 vol.2
[16] Phillips, I.T., Chen, S., Haralick, R.M.: Cd-rom document database standard.
(1995) 198–203
[17] Sauvola, J., K.: Mediateam document database. Website (1999) http:
//www.mediateam.oulu.fi/MTDB/ (last visit 08/07/09).
[18] Shannon, C.E.: A mathematical theory of communication. The Bell System
Technical Journal 27 (1948) 379–423, 623–656
[19] Cover, T.M., Thomas, J.: Elements of Information Theory. John Wiley and
Sons Inc. (1991)
[20] Yeung, R.W.: A First Course in Information Theory. Springer (2002)
[21] B. Gatos, N. Papamarkos, C.C.: Skew detection and text line position determination in digitized documents. Pattern Recognition 30 (1997) 1505–1519
[22] Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice Hall, Upper
Saddle River (NJ), USA (2002)
[23] Freixenet, J., Muñoz, X., Raba, D., Martı́, J., Cufı́, X.: Yet another survey
on image segmentation: Region and boundary information integration. In:
European Conference on Computer Vision, Copenhagen, Denmark (2002)
408–422
[24] Sahoo, P.K., Soltani, S., Wong, A.K., Chen, Y.C.: A survey of thresholding
techniques. Comput. Vision Graph. Image Process. 41 (1988) 233–260
[25] Sezgin, M., Sankur, B.: Survey over image thresholding techniques and
quantitative performance evaluation. Journal of Electronic Imaging 13 (2004)
146–168
104
Bibliography
[26] Hartigan, J.A., Wong, M.A.: A k-means clustering algorithm. Applied Statistics 28 (1979) 100–108
[27] Lavallee, S.: Registration for computed-integrated-surgery: Methodolgy,
state of the art. Computer Integrated Surgery: Technology and Clinical Applications (1995) 77–97 MIT Press, Cambridge, Massachusettes.
[28] Lehmann, T.M., Gonner, C., Spitzer, K.:
Registration for computedintegrated-surgery: Methodolgy, state of the art. IEEE Transactions on Medical Imaging 18 (1999) 1049–1074
[29] Unser, M.: Splines. A perfect fit for signal and image processing. IEEE
Signal Processing Magazine 16 (1999) 22–38
[30] Collignon, A., Vandermeulen, D., Suetens, P., Marchal, G.: Automated
multi-modality image registration based on information theory. Computational Imaging and Vision 3 (1995) 263–274
[31] Press, W., Teulokolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes
in C. Cambridge University Press (1992)
[32] Maes, F., Vandermeulen, D., , Marshal, G., Suetens, P.: Multimodality image
registration by maximization of mutual information. In: IEEE Proceedings
of the Workshop on Mathematical Methods in Biomedical Image Analysis.
(1996) 187–198
[33] Studholme, C.: Measures of 3D medical image alignment. PhD thesis, Computational Imaging Science Group, Division of Radiological Sciences, United
Medical and Dental school’s of Guy’s and St Thomas’s Hospitals (1997)
[34] Tsao, J.: Interpolation artifacts in multimodal image registration based on
maximization of mutual information. IEEE Transactions on Medical Imaging
22 (2003) 854–864
[35] Nokia, C.: Qt software. (Website) http://www.qtsoftware.com/
products (last visit 08/07/09).
[36] Sun, M.:
08/07/09).
Mysql.
(Website) http://www.mysql.com/ (last visit
[37] Bradski, G., Kaehler, A.: Learning OpenCV: Computer Vision with the
OpenCV Library. O’Reilly, Cambridge, MA (2008)
[38] Microsoft, C.:
Microsoft .net framework. (Website) http://www.
microsoft.com/NET/ (last visit 08/07/09).
105
Download