Uploaded by Aqsa Maqsood

handwriting in python thesis

advertisement
There are several practical uses for a scheme that can mechanically detect and distinguish manuscripts
in stock photos. For example, it may assist visually impaired people in traversing supermarkets [28] or
metropolitan landscapes [3] or provide a supplementary basis for an independent triangulation scheme.
The text in natural photographs, in general, serves as a valuable source of material around the
underlying depiction or division that is being shown.
Text recognition in natural photographs, on the other hand, presents a unique set of challenges. Object
character recognition (OCR) for scanned papers is often nearly flawless, but the more general challenge
of detecting text in unconstrained photos remains unsolved. Text recognition in scene photographs is
substantially more difficult because of the numerous potential differences in backdrops, textures,
typefaces, and lighting conditions. As a result, developing models and representations resistant to these
fluctuations is essential if we design an end-to-end text recognition system. So, it's no surprise that
today's complete text detection and character recognition algorithms rely on intelligently handengineered features [10, 11] to collect and represent the underlying data. The raw detection or
recognition responses must often be combined into a complete system using complex replicas such as
provisional chance arenas (CRFs) [32] or pictorial-structure mockups [38].
With the help of recent developments in appliance knowledge, and more specifically, in unverified
feature knowledge, I provide an alternate method to text recognition in this thesis. Low-level features
may be learned automatically using these feature learning techniques.
[8, 15, 16, 19, 23, 33] gives an alternative to hand-engineering the characteristics utilised for
representation [8-15]. Similar methods have been used in many related areas, such as graphic credit
[42] and action classification [20]. Text identification and character recognition have been made possible
by a simple feature-learning architecture that requires minimal feature-engineering or prior knowledge
[7].
We created a collection of features tailored specifically to the text recognition challenge by using these
feature-learning methods. Afterwards, a more complex convolutional neural network was trained to use
the newly-learned features (CNN). To name just a few of the problems that CNNs have solved,
hierarchical neural networks with huge representational capacity have been used to recognise
handwriting [21], visual object identification [4], and character recognition. These designs helped us
train text and character recognition modules that were very accurate. We designed network
architectures that may be employed for text detection and character identification because of their
structural closeness. Using basic and typical post-rectifying procedures such as non-maximal suppression
(NMS) [29] and sunbeam exploration [34], a comprehensive end-to-end system might be built. Our
method beat the industry standards for ICDAR 2003 [25] and Street View Text (SVT) [38] while being
straightforward to deploy. A text-recognition system constructed from scratch, without the use of handcoded characteristics or preexisting information, is a feasible alternative according to our results, in
other words.
All of this and more, including a review of the relevant literature and a detailed analysis of the suggested
strategy, are contained in this thesis. Chap. 2 goes through the contextual and connected research on
division manuscript gratitude, unconfirmed feature knowledge, and convolutional neuronal networks. A
high-level explanation is given of the various components of the entire recognition system in Chapter 3.
A thorough description of the text detection module is included in this talk. Because Tao Wang did the
bulk of the work on the character recognition and final integration modules, this thesis does not go into
as much detail about them. Text detection and end-to-end recognition systems are examined in Chapter
4 of the book.
This chapter summarises the main findings of the thesis and gives some final thoughts.
Finally, Adam Coates and Professor Andrew Ng provided guidance and support throughout the end-toend text recognition system developed with Tao Wang.
Thus, to highlight the collaborative character of this effort, I often use the terms "our system" and "our
work" while discussing the system.
Text recognition has been a long-standing issue in machine learning and computer vision. There are two
essential components to full-text recognition: text localisation and word recognition. Localisation begins
with a focus on specific words or paragraphs within a document. It's easier to distinguish individual
words and lines of text when you know where they care about each other. Figure 2.1 shows an example
of the job of end-to-end recognition in action.
There has been a lot of time and effort tackling the various aspects of the text-recognition issue
throughout the years. There are already algorithms that can perform exceedingly well on specific tasks
like digit recognition in limited environments. For example, the algorithm in [5] can execute handwritten
digit recognition almost as well as a person. Furthermore, the system in [7] achieves very high accuracy
in recognising English characters. There is still a long way to go before we can reliably identify and
remember the text in detailed images.
While scene text recognition has been studied extensively, many researchers have focused on a single
component. Text recognition, atmosphere/term subdivision, and appeal/appearance acknowledgement
have attracted the most attention. I'll go through each of these subsystems in detail in the following
sections.
As mentioned previously, text detection or localisation aims to find text sections in a photograph. Each
letter or line of text in the image is often identified by looking for a box or rectangle surrounding it in the
detection process. Several approaches have been planned for text recognition. These systems vary in
complexity, from unassuming the ledge classifications with hand-coded characteristics to more
complicated multistage pipelines integrating many unique algorithms and processing layers. According
to [32], a conditional random field (CRF), a multi-stage channel, and a major pre-rectifying such as
binarisation of the input image can distinguish text lines. Another group of people who work in text
detection has devised novel characteristics and transformations that are well-suited to their particular
purpose. Using [11] as an example, a robust and cutting-edge text identification algorithm uses the
regularity of character stroke width.
Text segmentation and recognition follow a similar path. Let me begin by providing a quick overview of
text segmentation and recognition as a research topic. Segmentation is the process of separating a
single word or line of text into its component letters or words. To put it another way, in the context of
an end-to-end system, this inputs line or word represents a text detection area. Segmented characters
and words, on the other hand, are recognised by character or word recognition software. We can
identify the underlying word by concatenating the characters to distinguish each character in the word.
In the end, a collection of annotated bounding boxes, as shown in Figure 2.1, is the outcome of
segmentation and recognition. Each bounding box's label or annotation consists of the term used to
identify it.
Many ways have been used to solve the challenges of segmentation and recognition, just as they have
been used to overcome the problems of detection. Several probabilistic graphical models for combined
segmentation and recognition, a multi-stage hypothesis verification pipeline, and a pictorial
representation are among the approaches included in this collection.
structural model [13, 40, and 41]. Many models have been used, such as geometric or linguistic ones, to
integrate prior knowledge to deal with this issue. For example, the notion that certain letters are taller
than others or maybe interchangeable because of their similarity is encoded in models [30]. (e.g., the
uppercase "S" and lowercase "S"). On the other hand, language models explain the typical distribution
of characters inside words. Show how bigrams (two-character sequences) are more prevalent by using
an example from the English language.
This is a typical constituent used to increase the presentation of acknowledgement organisations since
the ultimate objective is to recognise and locate text in pictures (e.g. [30, 38, 40].) This set of terms in
the scene or concept is called the lexicon. While it's perfectly acceptable to present an alphabetical list
of words used in English, it's also OK to provide a list of famous names, locations, and abbreviations.
Because of this, our model is now capable of correcting some of its errors, such as misreading a letter in
a word but still recognising the proper term by using a lexicon.
I am selecting the closest match to the anticipated term from a list of available words.
A text-recognition program's performance improves with a reduced vocabulary. At first, needing a
language may seem to be a significant restriction on the system; yet, in many cases, a tiny and
constrained lexicon may be obtained with ease. How many words would be used in a grocery store
scenario if this strategy was used? Probably fewer than half of all words in the English language. It's not
uncommon for location data to include more information on street or business sign nmessinesssigns. An
Internet search may discover local stores when a user's location is known. Then, the system can use this
information to create a relevant language for the user. A minor language (containing 50-500 words)
would lower the model's generality, but in many circumstances, this is suitable. The reduced lexiconconstrained environment of our system led us to focus more on this when designing our plan. I call this
technique the lexicon-driven or lexicon-constrained recognition framework. An alternative method is
described in this thesis that may be used in a more general context when specialist vocabulary is not
readily available.
The way data is represented significantly impacts a model's performance in machine learning. Thus,
developing a high-performance model necessitates the development of appropriate representations of
the data. Section 2.1's examination of scene text recognition shows that many strategies that have been
effectively used in the domain of text recognition and documentation have depended on prudently
hand-engineered characteristics [3, 11, 30]. These specific qualities are needed to provide an intense
depiction of the data. Variations in lighting, texture, typeface, backdrop, and picture quality are only a
few examples, as described in Chapter 1. Hand-engineering is a time-consuming and costly technique
that can't solve this data representation issue in many circumstances.
Machine learning researchers have recently concentrated on developing algorithms capable of learning
these fundamental symbols, or attributes originating from the information itself. These features are
learned from unlabeled data sets when using unsupervised feature-learning methods. Therefore, these
feature-learning systems present a new way for developing highly specialised features for use in-text
detection and identification. We can also produce many more parts Due to these methods, more should
be done than can be done by hand-engineering alone. It might then be used to boost the performance
of present classification methods. by using feature banks with more excellent dimensions. Using an
unsupervised learning technique, the system in [8] learns more than 4,000 characteristics to attain
state-of-the-art character recognition capabilities.
Duplicate organisation [42], sentimentality examination [26], and text appreciation [7] are just a few
examples of machine learning applications that have made use of numerous unsupervised featurelearning techniques. Sparse autoencoders, sparse coding, and K-means are all examples of feature
learning algorithms that have been extensively studied in the literature. In terms of computational
complexity and scalability, each method differs from the others.
Due to the high compute costs involved in most of these techniques are not suitable for huge photos,
such as those supplied to a text recognition system. Because of its speed and simplicity (requiring nearly
no hyperparameters), the K-means method described in [7, 8] was used in our study.
Machine learning and computer vision may benefit from automatically extracting domain-specific
characteristics extracted from the data It may be an alternative to the more standard approach of handengineering.As I show in my thesis, we can create a high-performing and resilient model with essentially
little hand-tuning using these learnt characteristics and the representational capabilities of a
convolutional neural network. To put it another way, this study breaks away from several commonly
used techniques for dealing with text detection and identification.
So far, I've outlined some of the current detection and identification approaches researchers employ.
Some of these solutions combine complex models with intelligently crafted features for the specific task
at hand. As a result of this project, I aim to show an alternate design that does not rely on custom-made
elements very complex models that include a colossal quantity of historical data. Unsupervised feature
learning approaches may replace the hand-engineering of features. We, the undersigned, are
subsequently used these characteristics to train a convolutional neural network [21, 22]. Both neural
networks and convolutional networks are covered in this section.
Telecommunications infrastructure (CNN).
After this preparatory work, the convolutional neural network's structure may now be described in more
detail. A multilayer, hierarchical neural network is a convolutional neural network is at its most basic
level. Home-grown agreeable fields, heaviness distribution, and three-dimensional combining or
subsampling layers separate the CNN from the basic feedforward neural networks discussed in Section
2.3.1. In the context of a visual recognition challenge, I examine each of these three characteristics
separately. Consider, for example, a single 32-by-32 picture patch as the CNN's input data. For example,
a 32-by-32 grid of pixel strength values may be used as this input.
All neurons in the next layer were linked to the previous layer's neurons in Section 2.3.1's virtual
networks. A more specific example is that each hidden layer neuron calculated a function based on the
standards of every bulge in the input deposit's input layer.
On the other hand, visual recognition often benefits from using the image's local substructure. Examples
of this phenomenon include images with nearby pixels that are substantially linked and images with
pixels that are spread out far that are Uncorrelated or loosely correlated. In computer vision, many
typical terms of the cognitive are based on the local characteristics. in the picture, which is not
unexpected [9, 24]. Each neuron in the CNN architecture can only use variables from the previous layer
that are geographically local to the current layer.
An 8-by-8 sub-window may be all that is needed for the first hidden layer of a CNN if the input is a 32by-32 image patch, for example. The input layer's collection of nodes that influence a neuron's activity is
known as the
Field of reception of a neuron. The neuron "sees" this component of the picture intuitively.
Since each neuron has its own unique set of inputs and outputs, a CNN's inputs and results tend to be
localised rather than global. Since neighbouring layers are not always wholly linked, a sparser collection
of edges in network design.
The shared edge weights across various hidden layers are the second characteristic separating CNNs
from virtual neural networks.
Each neuron calculates a weighted linear combination of its inputs before moving on to the next step.
fed into the rest of the network. We may think of this as assessing a linear input filter. Numerous
neurons in a hidden layer may evaluate the same filter across multiple sub-windows of an input picture
when weights are shared. Thus, the CNN may be seen as learning a series of filters F = [Fi] I, where each
filter is applied to all subwindows inside the input picture. F = [Fi] I, n A generic encoding or
representation of underlying data is forced on the network by applying the same set of filters across the
whole picture. Constraints on the weights of various neurons also have a regularising impact on the
CNN, allowing the network to generalise better in many visual recognition contexts. Weight sharing also
has the advantage of reducing the number of free parameters in the CNN, which makes training more
straightforward and more efficient. Filter F convolution of the input picture I may be summarised as
evaluating the filter F across each window in the input picture. As a result, in the convolutional stage of
To retrieve the convolutional response from a CNN, we convolve the input picture with each filter in F.
Subsampling or pooling layers is the last characteristic distinguishing a CNN from other classifiers. The
dimensionality of the convolutional responses and the model's translational invariance must be reduced
to achieve this dual aim. Spatial pooling [2] is the conventional method. Pooling is a technique in which
m-by-n blocks of convolutional response maps are pooled together (generally disjoint). The replies in
each block are then subjected to an evaluation of a pooling function. This method gives you a Having a
size of mn, a smaller response map (one response for each block). To determine the answer for each
block, either use max pooling or average pooling, in which case it is assumed to be the response's
maximum value across all block replies. An example of average pooling is shown in Figure 2.3. Here, we
average over four 2-by-2 blocks of the convolutional response map's 4-by-4 grid (the shaded regions in
Figure 2.3)
2-by-2 grid set in a row. The average of all the data in the block is used to calculate the pooled answer.
We use the following approach to get a final 2-by-2 pooled response map. This is a considerable
decrease in map of responding dimensions compared to the 4-by-4 convolutional response map that
was created initially.
Multiple layers of convolution and pooling alternate in CNNs.
On top of the outputs of the first convolution-pooling layer, for example, we might add a second
convolution-pooling layer. In this instance, we transfer the initial results to the second set of
convolution-pooling layers. This method may be used to create multilayered architecture.
These initial convolutional layers' low-level convolutional filters may be considered to encode the input
data at the lowest possible level. For example, basic edge filters may be used in the case of picture data.
Increasingly complex structures are learned as the neural network progresses upwards in layers. The
CNN architecture can deliver enormous quantities of symbolic power by using numerous layers and a
large number of filters.
Error backpropagation is a typical neural network training method used to train a CNN.
Handwriting identification [21], visual thing acknowledgement [4], and character recognition [35] are
just a few of the text classification tasks that have seen success using convolutional neural networks.
Distributed and CHAPTER 2. BACKGROUND AND CONNECTED ACTIVITIES. For the first time, it is now
feasible to train significantly more extensive and more powerful CNNs that attain state-of-the-art
performance on standard benchmarks using 14 GPUs [4–27]. To put it another way, we may use the
representative capacity of these networks and the resilience of features obtained from unsupervised
techniques to create robust systems for both text detection and recognition that are straightforward to
implement. End-to-end results may be obtained using the most straightforward post-processing
procedures when employing these sturdy and exact components.
The culture construction utilised to Pullman our text recognition and appreciation components are
described in this chapter. Our whole end-to-end system was built on these foundational pillars. After
that, I'll go into detail about how we combined these two elements into a single, seamless design.
I'll provide a high-level overview of our text-recognition software to set the stage.
The text detector and the character classifier were the two main components of our system.
To build the To create a text detector, we first trained a binary classifier. whether or not a single picture
patch had a well-centred character. We then used this binary classifier to assess its response overall 32by-32 windows in the picture to calculate detector responses throughout the whole image. Sliding the
fixed-size detection window on the entire picture, we were able to identify suitable lines and groupings
of text. A character classifier was trained to determine which of 62 potential characters (26 uppercase
letters, 26 lowercase letters, and ten numbers) were present in a 32-by32 input patch. Notably, we did
not include a non-character class in our character classifier since it was built on the assumption that the
input picture patch had just one character. We next identified the characters in each area or line of text
by swiping the character classifier over the sections or lines of text that the text detector had found.
To acquire the final results: a set of annotated bounding boxes for each word in a picture, we used a
beam search strategy to integrate the results from text detection and text recognition modules. Figure
3.1 represents a simple version of this pattern recognition process.
Text detection and recognition modules are discussed in detail in this part, followed by a brief discussion
of the datasets used to train the sensor and the recogniser.
A two-layer convolutional neural network was the foundation for the text detector and character
recogniser. Unsupervised feature learning algorithms were used in both situations to discover a
collection of low-level features that could describe the data. Backpropagation of the L2-SVM
classification error was used to train the neural network discriminatively.
We employed a convolutional neural network with two layers, similar to those described in [21, 35], for
text detection and character identification. The convolutional layer of the CNN was followed by a spatial
pooling layer, as stated in Section 2.3.2. For the text detection and character recognition modules, the
second three-dimensional pooling layer's outputs were fed into a fully linked organization layer that
conducted either two organization or 62-way classification.
Each of our essential elements or filters was used to calculate this representation efficiently.
An additional layer on top of the convolutional one was created for spatial pooling. Using an average
pooling technique, we were able to minimise the response map's dimensions and attain a degree of
translational invariance Instead of utilising the 25-by-25-by-d response map from the first convolutional
layer, we'll use the 25-by-25-by-d response map from the second convolutional layer.r, we used an
average of the activations from 25 blocks spread throughout the grid of 5x5. This procedure resulted in a
5-by-5-by-d response map, as intended.
On top of the first layer's output, we applied a second layer of convolution and average pooling. We
employed a set of 2-by-2-by-d filters in our second convolutional layer. Similarly to the first layer, we
used the 5-by-5-by-d response from the first layer to convolve each 2-by-2-by-d filter. With these
convolutions, we arrived at four by four by two convolutional response maps with the second
convolutional layer using d2. We utilized d2 = 256 filters for detection, and d2 = 720 filters for
recognition. We had to employ more filters because the recogniser performed a 62-way classification
rather than a binary classification like the detector. As a result, to obtain comparable results, the
character recogniser needed more expressive capacity and, thus, more filters. We applied the activation
function h to the convolutional responses in the second layer, just as in the first convolutional layer. The
final output of the second layer was a 2-by2-by-d2 representation averaged across the replies in four
blocks placed in a 2-by-2 grid over the picture. One-versus-all, multiclass support vector machine was
trained on this 2-by-2 form using d2 as the training data (SVM). The SVM used binary classification for
detection and 62-way classification for recognition.
Each of our essential elements or filters was used to calculate this representation efficiently.
An additional layer on top of the convolutional one was created for spatial pooling. Using an average
pooling technique, we were able to minimise the dimensionality of the response map and achieve a
degree of translational invariance. Instead of using the 25-by-25-by-d response map from the first
convolutional layer, we used an average of the activations from 25 blocks spread throughout the grid of
5x5. This procedure resulted in a 5-by-5-by-d response map, as intended.
On top of the first layer's output, we applied a second layer of convolution and average pooling. We
employed a set of 2-by-2-by-d filters in our second convolutional layer. Similarly to the first layer, we
used the 5-by-5-by-d response from the first layer to convolve each 2-by-2-by-d filter. With these
convolutions, we arrived at four by four by two convolutional response maps with the second
convolutional layer using d2. We utilized d2 = 256 filters for detection, and d2 = 720 filters for
recognition. We had to employ more filters because the recogniser performed a 62-way classification
rather than a binary classification like the detector. As a result, to obtain comparable results, the
character recogniser needed more expressive capacity and, thus, more filters. We applied the activation
function h to the convolutional responses in the second layer, just as in the first convolutional layer. The
final output of the second layer was a 2-by2-by-d2 representation averaged across the replies in four
blocks placed in a 2-by-2 grid over the picture. One-versus-all, multiclass support vector machine was
trained on this 2-by-2 form using d2 as the training data (SVM). The SVM used binary classification for
detection and 62-way classification for recognition.
Using error backpropagation [1], we decreased this goal by fine-tuning or updating the CNN weights.
The parameters were fine-tuned in the classification and second convolutional and average-pooling
layers. It was necessary to maintain the trained filters as the first filters layer. Using K-means features as
a low-level data encoding was prompted because it was computationally less costly to introduce just a
portion of the CNN layers.
Because we were working with somewhat extensive networks, we dispersed the fine-tuning over
numerous graphics processing units (GPUs). GPUs and distributed computing are becoming more
widespread in large-scale machine learning [4, 6] and are frequently required to build and train more
sophisticated, representational models. [4, 6].
As an introduction, I briefly detail the training datasets for the text detector and character recogniser
before moving on to the integration method. A "positive" text detection sample was defined as a single
32-by-32 picture patch containing the text detector developers (some examples are shown in Figure
3.4). There were unfavourable examples where the character was off-centre and merely had nature
sections.
As a result of this differentiation, a clear description of what constitutes a good example may be
supplied. Using this definition of "positive" and "negative," the detector selected for and recognised
windows with well-centred characters.
As mentioned in Section 3.3.1, this information significantly impacted the space calculation procedure.
Researchers have included synthetic training examples in their datasets to increase model performance
in many current texts on text recognition [7, 30, 38].
When creating high-quality synthetic training pictures, we used the whole system.
Six hundred sixty-five typefaces to train both the text detector and character recognition1. 1 Our initial
step was to sample the character and background grayscale levels from normal distributions with the
same mean and standard deviation as the ICDAR 2003 training pictures [25], then use that data to build
the synthetic instances. As the last step, we added some Gaussian blurring to random areas of the
picture. As a final touch, we used natural backdrops to create the impression of background clutter.
As indicated in the figure, our synthetic examples are compared to samples from the ICDAR 2003
training pictures.
As the last step in training the detector and the recogniser, we used examples from the ICDAR 2003
training pictures [25] and an English subset of the Chars74k dataset [10]. We utilised almost 100,000
instances in total to train the detector and recogniser.
A high-resolution picture is used to generate a list of possible candidate lines of text in this section. A
high-level review of our procedure is where I begin. We first used the text detector to do sliding-window
detection over the picture. Each 32-by-32 window in the photo was stored in a response map that
represented the possibility of text appearing there. We repeated this procedure at various picture sizes
for a collection of multiscale response maps. We derived and scored bounding boxes for potential text
lines in the original picture from these response maps. This collection of candidate bounding boxes was
then applied to our final set of bounding boxes using non-maximal suppression (NMS) [29].
A high-resolution picture (640-by-480 pixels to 1600-by-1200 pixels) was used as the input for a 32-by32 window detector response evaluation. A sliding-window detection is helpful for a broad range of
visual tasks, such as pedestrian identification or face recognition. The detection CNN performed a
forward propagation step by computing the detector response for each 32-by-32-pixel window in the
picture. As a result, the detector gave each window in the picture a confidence score. If the sensor
returned a considerable positive number, it was more confident that the window had a text. In contrast,
a significant negative value was more satisfied that the window was devoid of any reader. Note that a
well-centred character is required for a window to "contain the text" as per our standards in Section
3.1.3.
We had to apply this sliding-window procedure to various picture scales to recognise text at different
sizes since the detection window was fixed at 32-by-32 pixels.
Both up and downsampling images were investigated to capture various text sizes. Thirteen alternative
scales were employed in the final product, ranging from 150 per cent to 10 per cent of the original
image's size. If an input picture is scaled by 50%, a 32-by-32 detection window becomes a 64-by-64 over
the original image. It's possible to identify sliding windows using this technique. We created a multiscale
response map for each of these scales. This sliding-window response procedure is shown in Figure 3.5.
We calculated each line of the text's bounding box using the multiscale response map for an input
picture. To do so, we first assumed that text tended to line up horizontally in natural photographs,
which is a simple assumption. Most text in raw photos does fall along horizontal lines. This was a logical
assumption.
Or lines with a slight tilt in the image response map, the detector responses along a single line (e.g., a
32-by-w window where w is the picture width) matched to one row. A comparison of the detector
responses along two distinct lines in the input picture is shown in figures 3.6a and 3.6b (see below).
There are two things to take away from these cases. If you have centred text, the detector answers are
usually positive, but they are generally negative if you have uncentered text. Because a window
containing a well-centred character had been established as a positive example, this specific reaction
from the detector was predicted. A positive signal would not have been generated for windows that
only showed a portion of the character, such as those in a line of text that was off-centre. First, we
noticed that the detector response peaks aligned with character locations. Again, given our
understanding of what constituted a good example, this specific action was not unexpected.
Non-maximal suppression (NMS) [29] was used since a stroke of text would often include numerous
typoscripts and typoscripts displayed as peaks in the indicator answer outline throughout a sequence.
We could then determine whether a line had text based on the number of mountains and the
magnitudes of the peak responses that occurred. Let R(x) represent the detector response at each
location x along a single line. R0 (x) was generated by calculating the non-maximally suppressed
response from R(x).
Where is the length of the period during which we conducted NMS? Cross-validation helped us choose.
As a result, NMS eliminated any answers, not at their maximum in the immediate vicinity. Figure 3.6c
shows NMS in action on a centred line of text.
We calculated the average R peak from the NMS responses on a single line and compared it to a
predetermined threshold. The line was considered positive if R's peak surpassed the point and harmful if
R fell below the threshold (does not contain text). The text bounding box's left and rightmost peaks on
the NMS response map were used as the line's extents if it included text. Cross-validation was used to
choose. This approach of estimating the bounding box has two possible flaws. There were also instances
when two or more sets of words were on the same line but separated by a significant gap or non-text
area. ' Figure 3.7 depicts one such model. The terms Ipswich and London and Gt Yarmouth and
Chelmsford are nearly on the same line in this example. However, a big bounding box covering the
whole picture width would be illogical since this line has two distinct text sections.
Another issue was that the response maps might include many false positives, resulting in misleading
peaks in the NMS response along a line. Thus, non-text areas were mistaken for text areas. The variation
in the spacing between NMS peaks differentiated these false detections from those that were text.
Similarly, the distance between peaks in a standard line of text is relatively constant, as is the space
between letters.
This lack of regularity in the distance between fictitious summits was not unexpected. As a result, we
were able to use the consistency of the replies to both separate terms that belong in the same category
and eliminate some of the irrelevant ones. From the NMS profile over the line, in particular, we
estimated the distance between consecutive peaks. We next calculated the average distance between
these peak separations. Once we reached a separation distance more significant than some scalar
multiplication of the median length (m), we separated the line at that point. A single peak in each of the
two segments created following the split eliminated that section. This was why a single, isolated peak
was considerably more likely to be caused by a bogus detection response than by a single, solitary
character in the picture.
Using a more realistic example, I demonstrate this technique of line-splitting. Take a second look at the
Pay attention to the line that connects Ipswich and London in particular. The Ipswich and London
characters' NMS profiles would show peaks in this line. However, there are no significant activations in
the area between the two terms.
In other words, Ipswich and London's median distance between peaks is almost equal to that distance.
That's why there's such a big difference between Ipswich and London's first and second highest peaks
when looking at the characters on each side of the border. Consequently, the line was divided into two
halves, with one beginning at the first character in London and the other finishing at the final character
in Ipswich. Using this heuristic, we can observe that the line is divided into groups of words. When there
are bogus answers, the NMS profile over a bar isolates the peaks and filters them out.
As a further step, we calculated the score for each bounding box for a line, taking into account the
average magnitude of the peaks within each package. Strong peak responses, as previously mentioned,
indicated a greater degree of certainty that a character occupied the underlying position. As a result,
more excellent scores were associated with boxes that fulfilled the criteria better.
To recognise lines and groups of text from a picture, I used a multiscale, sliding-window strategy to
accomplish so in the previous section. Our process for acquiring final end-to-end text recognition results
from these bounding boxes is presented here. At a high level, there were three stages to this procedure.
The first step was to determine the location of the word boundaries inside a particular bounding box.
Next, we used the character recogniser to identify the characters inside the bounding box, as shown in
Figure 1. It was only after this process that we discovered the most probable word borders (spaces) and
individual word IDs along this line, using a beam search. An annotated set of bounding boxes was
created due to this approach. A sample image and annotated bounding boxes are shown in the following
figure.
Many steps had to be taken before the end-to-end outcome could be achieved. To achieve this, we
employed the same approach outlined in Section 3.2.2 to identify potential text lines by examining the
detector response profile for a single line of text in the document. It was also taught to look for photos
with well-centred characters, as mentioned in Section 3.1.3. Negative examples were photos in which
the characters were only partly visible or off-centred.
If the window is placed in the middle of the characters, the viewer should feel uncomfortable.
Based on this, we were able to make an educated guess as to where the empty spaces were located
inside a single line of text. We looked for the negative peak responses to find blank gaps on a line, just as
we had done with the positive peak answers. The negative replies along a line were subjected to NMS.
The following two parts of the end-to-end system description describe how the character recognition
module was integrated with the text detection and space estimate algorithms. Please note that Tao
Wang did most of the work detailed here and in part after it; I've included it for completeness' sake.
First, I'll describe a situation where we were only given a single phrase in a tightly bounded box. In
addition, we assumed that the words that may appear in the bounding box were supplied to us in the
form of a lexicon L. In this word recognition exercise, you were tasked with finding the word that
appeared in the box's boundary. In this case, we used the same strategy that we did in the detection
phase. The character classifier was moved over the bounding box using a technique known as sliding
window classification. In this case, we just had to move the character classifier horizontally across the
bounding chest since we were given a word-level bounding box. Contrast this with detection, where we
have to drag the binary classifier over the picture to identify a specific object. The character classifier
generated a 62-way classification response for each horizontal point in the bounding box (an example is
displayed in Figure 3.9). An optimistic forecast (the window included a character of that class) was
conveyed by a positive response, while a negative response encoded a pessimistic prediction.
For this reason, because the detection profile did not always match the character recogniser’s
confidence intervals, we did not utilise the detection profile's peaks to estimate where each character
was located. Using the classifier answers was a more acceptable option since the purpose was to
identify the letters in a word.
In this part, you'll learn how we segmented and recognised a line or a set of words together. Text
detector output was often a single line or group of words, as discussed in Section 3.2. Finally, in Section
3.3.1, I explained how we arrived at an estimate of a collection of candidate spaces for the word groups
detected by the text detector.
First, it was necessary to assume that the genuine segmentation of each bounding box B calculated by
the text detector was possible to create from the set S of estimated spaces for the bounding box to build
a comprehensive end-to-end system. Furthermore, if several words are included inside the set S of
estimated spslotsthen, the distance between every pair of adjacent terms in B must be included. Oversegmentation was permissible for S, but not under-segmentation (the absence of spaces between words
in B). A beam search in the breadth-first search investigates the top N feasible partial segmentations of a
line or a group of words. Wing to some heuristic score a, and we were able to assess alternative wordlevel segmentation using this assumption systematically. As indicated in Section 3.3.2, we used the
heuristic score to determine whether or not a candidate segmentation was a good fit. We retrieved the
correct segmentation of the words in the bounding box by searching across this collection of alternative
segmentations as permitted by the list of candidate spaces. We also used alignment scores to threshold
individual segments to avoid false positives from the text detection step. We removed -text" parts,
which were those with poor recognition scores. The segmentation of the words in the bounding
boxidentifyingentification of the terms inside each segment was received after the beam search. As a
result, we were able to accomplish full end-to-end recognition.
The presence of a vocabulary has been critical to most of the work discussed so far. There are many
distinct types of specialised lexicons, as mentioned in Section 2.1. On the other hand, a universal system
should not take it for granted that there will always be a specialised vocabulary accessible. The methods
we used to adapt our approach to the absence of a technical language are discussed in detail in this
section.
The complete English dictionary, maybe supplemented with a list of common names, abbreviations, and
proper nouns, might serve as the system's lexicon in cases when a specific glossary is not easily
accessible. The approach described in Sections 3.3.2 and 3.3.3, on the other hand, only takes one trip
through the lexicon to calculate the alignment score for each candidate segment. Because of this, we
are employing the whole English vocabulary would be computationally impossible. Therefore, we looked
at other options.
Our strategy was to use a spell-checking system to build a vocabulary. A text detection system was
initially run in the lexicon-constrained configuration to generate a potential set of bounding boxes for
the text areas. Each bounding box's character classifier answers were tallied up for each character in the
box. After analysing the character classifier answers in Section 3.3.2, we next used NMS on those
responses to estimate the positions of each character in the bounding box. Our first estimate for the
underlying term was formed by concatenating the most believable characters from each indicated spot.
As the following step, we ran the proposed period using Hunspell, an open-source spell checker2. We
supplemented the basic Hunspell English vocabulary with lists of commonly used names, locations, and
other proper nouns. A collection of recommended terms was provided by Hunspell, which was then
utilised to build a lexicon dynamically. Our final end-to-end findings were only a matter of using the
method presented in Section 3.3.3. We were able to broaden the scope of our system's use by using this
technique in situations where we lacked access to a specialist vocabulary.
When it comes to text detection and character recognition, this chapter focuses on our assessment of
the whole system. As a bonus, I demonstrate how our text recognition systems stack up against our
competitors compared to industry norms. End-to-end evaluations of the recognition system were
performed on the Street View Text (SVT) [38] and ICDAR 2003 Robust Reading [25] datasets.
First, let's take a look at some of the specifics of our evaluations. The ICDAR dataset is a common one for
text detection and recognition systems. Images from various sources, including book covers, street
signs, and business signs, make up the ICDAR collection. On the other hand, the SVT dataset is
comprised of Google Street View photos. Consequently, pictures from this dataset frequently have
lesser resolution and display far more variability than those from the ICDAR dataset. Recognition over
the SVT dataset is substantially more complex than recognition over the ICDAR dataset because of the
higher variability and lesser resolution. On the other hand, the SVT dataset better matches the realworld settings under which a text recognition system would work than the ICDAR dataset. The critical
point is that the SVT dataset was built with lexicon-driven text recognition from start to finish. The
dictionary does not include all of the text in the picture, though. This dataset aims to identify only words
in the dictionary from the picture. Terms used in the scenario but not included in the lexicon should be
left out of the analysis. SVT dataset images are given in Figure 4.1 with their corresponding correct
ground truth annotations.
Our text-detection technology performed worse than many others according to our tests' outcomes.
Our text identification algorithm had a significant drawback in that it could not correctly clip out the
text. Figure 4.2's bounding boxes show that the estimated boxes are more important than the ground
truth bounding boxes.
There was little padding between the projected bounding box and the word extents.
I discuss two possible reasons why the ICDAR measure may have been underutilised. Our detector was
trained on well-centred characters but not correctly cropped ones. Figure 3.4 shows an example of an
unpadded affirmative case used to prepare the detector. This resulted in an improved sensor that was
more sensitive to the presence of a well-centred character and less susceptible to padding.
Consequently, a well-centred set of characters in a tightly-cropped bounding box received the same
score as those in a less-tightly-padded group. In addition, the limited number of scales we examined was
a contributing factor. Accordingly, the potential heights of the predicted bounding boxes were restricted
to a narrow set since we only looked at one range of scales (one for each scale). In light of this
constraint, probably, the detector could not have produced a better bounding box for a particular line or
collection of words because of this.
The purpose-built methods in [11, 30, 32], which often used extra processing steps to create wellcropped bounding boxes, did not entirely comply with the restrictions of our detection approach as
detailed here. These more advanced algorithms were able to obtain better-cropped bounding boxes,
and consequently, improved performance on this specific criterion should not come as a big surprise to
anybody. As a reminder, our objective wasn't only to get neatly clipped bounding boxes—it was to get
the whole picture. The bounding boxes supplied by the detector were adequate as long as the character
recogniser resisted these padding differences.
The ICDAR 2003 test set was used to assess the detector's performance. We used the detection
threshold with the greatest F-score in our analysis to determine the final findings. Using the ICDAR 2003
dataset, our technique is compared to other methods in Table 4.1.
The results in the table show that our usual technique fared severely compared to the alternatives. The
ICDAR 2003 Robust Reading ground truths comprised bounding boxes for individual words, while our
detection system-generated bounding boxes for complete lines of text. Consequently, even if the
detector correctly detected the text in the picture, the match scores between a bounding box for a
whole line and a bounding box for a single word were often relatively low. As a result, we tweaked the
ground truth bounding boxes to see how well the detector could distinguish text lines and groups.
Coalescing bounding boxes on the same line allowed us to test the detector's accuracy and recall using
this ground truth. Figure 4.2 shows an example of this method in action.
The overall performance of the detector was significantly improved by testing against the merged
bounding boxes; in particular, the F-score rose by 0.16 on the ICDAR test set. It was no longer possible to
compare our results to those of other researchers in this field because of this change in the ground truth
bounding boxes. On the other hand, this corrected statistic provided a more accurate picture of how the
detector identified text lines effectively.
The entire text-recognition pipeline is evaluated in this part. I will quickly describe the system's
performance on cropped-character and word recognition tasks as a starting point. The ICDAR 2003
Robust Reading dataset [25] was used to test our system for these two tasks.
A total of 5198 characters from 62 different classes were used in the ICDAR 2003 test set for our first
evaluation of our character recognition module. A single bounding box containing a well-centred
character is provided in the cropped character identification job, and the aim is to identify the nature
contained inside the bounding box. Essentially, this is the same 62-way classification issue discussed in
Chapter 3. A two-layer CNN was used to train the character recognition module, as stated in Section
3.1.2. According to Table 4.2, we outperformed other character-recognition systems on the ICDAR 2003
test set. Cropped character identification is a challenging challenge, yet our technology has surpassed
the best in the industry in terms of accuracy. Our convolutional neural architecture proved its worth
with this performance. We trained a highly resilient and scalable convolutional neural network using
supervised fine-tuning on a massive, multilayer convolutional architecture.
The lexicon-constrained framework described in Section 2.1.3 was used in the additional experiments
detailed in the following sections. In particular, we assumed the presence of a lexicon for the set of
studies on clipped word recognition and complete end-to-end recognition. The performance of our
system is compared to that of a lexicon-driven system comparable to that reported in [38] to give crucial
background. My evaluation of the word recognition system from Section 3.3.2 is focused here.
We tested cropped word recognition using the ICDAR 2003 Robust Word Recognition dataset. To
accomplish this task, the input pictures used in this example are comprised of neatly clipped words, such
as those in Figure 4.3. The goal of word recognition would be equivalent to assessing the most excellent
possible recall in the context of a complete end-to-end system, supposing we had a flawless text
detection system. Since the performance on this particular word recognition task may be used to
estimate the overall system performance (as discussed in Section 3.3.2), we can assume that it
represents a reasonable upper limit. Case-insensitive word-level accuracy (% of correctly detected
words, ignoring case) was used to evaluate our word recognition algorithm. This case-insensitive metric
changed the prior experiment with character recognition, which tested performance in a case-sensitive
environment. To deal with this more difficult challenge, we reduced the importance of the casesensitive
We are Concentrated on the word at hand rather than differentiating between different letter cases. In
reality, it is usually enough to recognise what term is being used and not to worry about the exact cause
of letters. That is why we didn't necessarily lose a lot of information or generality when we switched to a
case-insensitive configuration.
Our trials had to be lexicon-constrained, so we had to provide the word recognition system with a list of
possible terms to work with.
When we ran ICDAR (50), we fed it a lexicon that included both the words that occurred in the accurate
picture and a list of 50 randomly generated "distractor" terms (words that do not appear in the image).
As a result, given a picture as input, the objective was to choose the proper term from a list of around 50
options. Using the ICDAR 2003 test set as a starting point, we generated a lexicon of over 500 words for
use in the second experiment, dubbed ICDAR (Full). We employed the identical lexicons as the end-toend system tested by [38] to compare our results to theirs in terms of performance.
Comparing our word-recognition system's performance to [38], we arrive at Table 4.3. A word
recognition system has not yet been tested against the ICDAR 2003 Word Recognition dataset.
Since our system outperformed the system in [38], we can see from Table 4.3 that our word recognition
system produced substantially greater accuracies than that of [38]. Furthermore, our system's
performance dropped by just 6% compared to a 14% drop in [38] when the vocabulary grew from 50 or
so terms to 500+. In a broader context, our model seems to be better able to cope with uncertainty.
For example, our end-to-end recognition system didn't need an extensive vocabulary. Some
performance improvement was attributable to our character classifier's increased accuracy. While the
character classifier in [38] only achieved 64% accuracy on the ICDAR 2003 character sets, our character
classifier achieved an impressive accuracy of over 84%. Thanks to more accurate character recognisers,
we could better differentiate the terms' characters. Therefore, we made more accurate guesses about
the word's origin. This illustrates that we can achieve state-of-the-art word recognition by building a
high-performance character recognition system rather than relying on advanced approaches like
probabilistic inference or visual structural models.
End-to-end system performance on the ICDAR 2003 Robust Reading dataset and the Street View Text
dataset are described in this section. If you're looking for a whole picture to be recognised, you'll need
to find and identify every single word in the image. Using text detection, we created text bounding
boxes for individual characters and paragraphs and sections of longer texts. According to Section 3.3.1,
we approximated the placement of the spaces. We next used a beam search to analyse these predicted
space locations and find the appropriate line segmentation and word identity.
A similar constraint was placed on us while we worked on word recognition. We utilised the lexicons
given by the authors of [38] for the SVT dataset and the dictionaries of 5, 20, and 50 distractor terms for
the ICDAR dataset. The whole vocabulary from the ICDAR test set was likewise used in the end-to-end
experiment. These trials are referred to as SVT, ICDAR (5), ICDAR (20), ICDAR (50), and ICDAR (Full) in the
discussion that follows. Every time, we used the usual assessment criteria established in [25] and utilised
in [38]. Other object recognition contests, such as the PASCAL Visual Object Recognition Challenge [12],
use very similar criteria.
A predicted bounding box is called a "match" if it overlaps a predicted bounding box.
More than 50% of ground truth matches prediction, and the projected term is also more than 50%
correct. When comparing the predicted word to the ground-truth word, we disregarded the case in the
previous section. The accuracy and recall of the whole end-to-end system were assessed using this
method. Figure 4.4 shows the accuracy and recall curves for these studies.
So we could see how well each experiment performed from beginning to finish. We tallied the most
excellent possible F-values for accuracy and recall across all our studies. Table 4.4 compares our end-toend system's F-scores with those of [38]'s system. The table shows that our system consistently
obtained higher F-scores than the other systems. As a bonus, the
There was a substantially more significant difference in our end-to-end performance than theirs, as
shown by an improved F-score, on the more demanding benchmarks, such as ICDAR (Full) and SVT (a
difference of 0.08). Our method seems to be more resilient in general settings when we lack access to a
specialised vocabulary or photographs with more background clutter and more variance in typefaces
and lighting conditions.
Extending our approach to the broader world where we had no access to a specific language was our
last experiment. We used Hunspell, an open-source spell checker, to dynamically build a vocabulary
based on the raw classifier answers, as stated in Section 3.3.4. Figure 4.4d displays a general lexicon
system's precision and recall graph on the dataset ICDAR. According to [30], which does not use a
dictionary or lexicon, our method outperforms the approach in [30]. I am unaware of any other system
tested on the ICDAR 2003 Robust Reading dataset.
Using Hunspell to create a lexicon dynamically, we found that the recognition system's performance was
much worse than the experiment where we provided the recognition system with a specialised
vocabulary.
Due to the linguistic constraints of the end-to-end system, this decrease in speed was not unexpected.
However, it is crucial to highlight that we can expand our end-to-end system to function in contexts with
no specific lexicon by utilising this easy modification with a publicly accessible spell-checking
programme.
In addition, we see that our system's performance in this scenario is still equivalent to that of the best
methods now available.
Using the ICDAR 2003 Robust Reading and SVT dataset, I show some examples of the system's results.
Figure 4.5 depicts them.
Unsupervised feature learning and large-scale convolutional architectures are used in this thesis to
tackle the challenge of end-to-end text recognition. We have developed a method that utilises both the
specificity of learnt features and the enormous representational capability of a two-layer convolutional
neural network to capture the underlying data. Using the same convolutional architecture, we \ can
train highly accurate and resilient text detection and character recognition modules. These two
components may then be woven together using basic non-maximal suppression methods and a beam
search, allowing us to create a complete system. Before, text detection and recognition systems
necessitated complex, multi-stage pipelines, extensive hand-engineering, or some other form of prior
knowledge. With these findings, I can demonstrate the robustness of our technique. Using our
developed lexicon, I've also shown how our system recognises characters, words, and end-to-end text.
Even without a lexicon-constrained system, it is easy to expand it using open-source spell-checking tools
like Hunspell, which are publicly accessible and open-source. The system's performance was again on
pace with the best in the industry in this more generic situation.
Thus, our findings show that big, multilayer convolutional neural networks may be used effectively.
Using networks to solve the challenge of text recognition instead of purpose-built, hand-engineered
solutions
I'll point out some of the present system's flaws throughout this part and suggest improvements. We
first saw a significant difference in the performance of our system when it came to lexicon-driven,
cropped word identification compared to the entire end-to-end text recognition challenge. Several
factors caused this performance discrepancy. First, the text identification system's estimated bounding
boxes were not correctly clipped, as indicated in Section 4.1. There may have been a drop in the overall
end-to-end performance due to a lower quality bounding box produced by a detector that had been
specifically calibrated to function on precisely clipped word-level bounding boxes. Because the whole
end-to-end recognition system had a limited memory capacity, this also contributed to the problem.
Because of this, it was difficult for the total end-to-end system to recover from a failure in the text
detector, for example. Thus, enhancing the detector's recall would significantly impact performance. To
get better-cropped bounding boxes, future work might focus on further increasing the performance of
the text recognition algorithm. The binary classifier used in text identification may be trained on
consistent and well-cropped characters as a viable technique to improve the quality of bounding boxes.
With better-cropped bounding boxes, the text recognition method is more selective.
Another drawback of the text detection algorithm was that it only looked for horizontal text lines
(described in Section 3.2.2). Our approach had certain limitations because of this assumption, which was
logical in many circumstances and proved to be accurate in reality. This limitation was especially
troublesome when the text was somewhat slanted and looked to span numerous lines as a result. As a
result, the detector could not offer a well-cropped bounding box for the whole word or line in these
instances. The text detector could not correctly locate words if the letters were placed vertically or in a
curved form. As a result, extending the end-to-end system's capabilities to include handling text that
isn't horizontally aligned might be a future focus.
Thirdly, our present end-to-end system lacked an efficient method for accurately calculating word
boundaries in a line of text. Word-level segmentation was a challenge that we couldn't solve, in other
words. We were able to get good results in the lexicon-constrained environment by combining the basic
space estimation procedure described in Section 3.3.1 with a beam search over the various
segmentations.
Space estimate and beam search did not work effectively in a broader environment where we depended
on Hunspell. Our end-to-end system would benefit significantly from improved segmentation methods
that aren't dependent on heuristics, particularly when we don't have access to a specialised vocabulary.
Finally, we may design a system that depends on a specialised lexicon and recognise terms, not in the
lexicon. It's not always safe to assume that all the words that may occur in the scenario are included
within the specified lexicon, even if we have access to specialist lexicons. In this scenario, having a
specific vocabulary allows the model to anticipate terms in the scene but are not in the lexicon. A
system like this would combine the lexicon-constrained framework and the spell-correction framework
I've outlined in this thesis into one system. An alternative to either the fully-lexicon-driven or completely
generic framework would be this hybrid framework, which would better represent actual-world
operating situations than either of these two options.
If you have any questions or comments about this article, please contact me using the contact form.
End-to-end recognition has been addressed in this thesis, and our technique is described. But the entire
end-to-end text recognition issue remains unsolvable despite our best efforts. Hopes and dreams are
high for future study in this fascinating field
Download