Uploaded by farhan nafiz_20

Bangla handwritten and Scripted Character Recognition Using Transfer Learning

advertisement
KNOWLEDGE & TECHNOLOGY
Bangladesh Army University of Engineering & Technology
Department of Computer Science and Engineering
A thesis Report on
Bangla Handwritten and Scripted Character Recognition Using
Transfer Learning
A thesis is submitted in partial fulfillment for the requirements of the degree of Bachelor
of Science in Computer Science and Engineering.
Submitted by
Farhan Nafiz
ID No.: 18104020
Supervised by
Md. Omar Faruq
Lecturer, Department of CSE, BAUET
Department of Computer Science and Engineering
Bangladesh Army University of Engineering & Technology
July, 2022
KNOWLEDGE & TECHNOLOGY
Bangladesh Army University of Engineering & Technology
Department of Computer Science and Engineering
CERTIFICATE
This is to certify that the project entitled “Bangla Handwritten and Scripted Character
Recognition Using Transfer Learning” by “Farhan Nafiz”, ID No.:18104020, has been
accepted as satisfactory in partial fulfillment of the requirement for the degree of Bachelor
of Science in Computer Science and Engineering on July 2022.
Signature of Supervisor
…………………………………
(Md. Omar Faruq)
(Lecturer)
Department of Computer Science and Engineering
Bangladesh Army University of Engineering & Technology
ii
KNOWLEDGE & TECHNOLOGY
Bangladesh Army University of Engineering & Technology
Department of Computer Science and Engineering
DECLARATION
I hereby declare that my thesis entitled “Bangla Handwritten and Scripted Character
Recognition Using Transfer Learning” is the result of my work. I also ensure that it
was not previously submitted or published elsewhere for the award of any degree or
diploma.
The work has been accepted for the degree of Bachelor of Science in Computer Science
and Engineering at Bangladesh Army University of Engineering & Technology (BAUET).
Author
……………………………….
(Farhan Nafiz)
iii
ACKNOWLEDGMENT
We would like to sincerely thank our supervisor Md. Omar Faruq, Lecturer, Department of
Computer Science and Engineering, BAUET, for his valuable guidance, suggestions, and
encouragement to complete our work. His motivation, suggestions and insights for this
research have been invaluable. Without his support and proper guidance this research
would never have been possible. His valuable opinion, time and input provided throughout
the thesis work, from first phase of thesis topics introduction, subject selection, proposing
algorithm, modification till the project implementation and finalization which helped us to
do our thesis work in proper way. We are really grateful to him.
Finally, I express my heart-full thanks to all of our friends who helped me in the successful
completion of this project.
Farhan Nafiz
ID No.:18104020
iv
ABSTRACT
In order to decipher images of Bangla handwritten characters and scripted characters into
an electronically editable format, which plays a crucial role in enhancing and digitalizing
many analog applications, this paper proposes a mechanism of handwritten and scripted
letter and digit recognition. This mechanism will not only pave the way for future research
but will also have many practical applications in the present.
The mechanisms of handwritten and scripted character recognition have been extensively
studied over the past 50 years, and the explosive growth of main memory and
computational power has made it possible to implement more efficient and complex
HLDR methodologies, which has increased demand in a number of upcoming application
domains. Adopting a deep, optimized, and data-processing architecture is one of the most
effective ways to increase accuracy or reduce error rate in the field of pattern recognition.
As a consequence, this study suggests that by utilizing the recently published Bangla-lekha
dataset and Bangla Scripted Character dataset, along with the transfer learning model
MobileNetV2 and MobileNetV3Large architecture, we may provide results that are
superior to those of previous studies.
v
List of contents
Chapter
Title
Certificate
Declaration
Acknowledgment
Abstract
List of contents
List of Figures
List of Tables
Page No.
ii
iii
iv
v
vi
vii-viii
viii
INTRODUCTION
1.1 Introduction
1.2 Problem Definition
1-4
1-2
3
1.3 Objectives
1.4 Motivation
3
3-4
2
LITERATURE REVIEW
2.1 Introduction
2.2 Early Work
2.4 Dataset
2.7 Conclusion
5-10
5-6
6-7
7-10
10
3
METHODOLOGY
3.1 Introduction
3.2 Proposed Methodology
3.3 Data Preprocessing
3.4 Transfer Learning
3.5 MobileNetV2 Architecture
3.6 MobileNetV3 Architecture
3.7 Conclusion
11-21
11
11-12
13-14
15-16
17-18
19-21
21
4
RESULT ANALYSIS AND DISCUSSIONS
4.1 Introduction
4.2 Performance Analysis
4.3 Model Wise Result Comparison
4.4 Conclusion
22-33
22
22-28
29-33
33
5
CONCLUSION & FUTURE WORK
5.1 Introduction
5.2 Future Work
34-35
34
34-35
REFERENCES
36-38
1
vi
List of Figures
Fig No
Title
Page No
1
Different characteristics of Bangla Scripts
2
2
. From left to right, matra jukto, ordho-matra and
2
matraheen alphabets
3
Example of data collection form
4
Dataset with added noise and blur
10
5
Proposed Methodology
12
6
Dataset with noise and blur
8
14
7
Task of Traditional Machine Learning
15
8
Task of Transfer Learning
16
9
Architecture of MobileNetV2
18
10
Architecture of MobileNetV3
21
11
Train and Validation Loss using MobileNetV2
23
for 1st dataset
12
Train and Test accuracy using MobileNetV2 for
24
1st dataset
13
Train and Validation loss using
25
MobileNetV3Large for 1st dataset
14
Train and Test accuracy using
st
MobileNetV3Large for 1 dataset
vii
25
15
Train and validation loss using MobileNetV2 for
16
2nd dataset
16
Train and test accuracy using MobileNetV2 for
27
nd
2 dataset
17
Train and test accuracy using MobileNetV3Large
28
for 2nd dataset
18
Train and validation loss using
28
nd
MobileNetV3Large for 2 dataset
19
Confusion Matrix
32
List Of Tables
Fig No
Title
Page No
1
Model wise Result Comparison
30
2
Result analysis for 1st dataset
30
3
Result analysis for 2nd dataset
30
4
Result comparison with previous work
31
5
Scores achieved by some classes
33
viii
ix
CHAPTER 1
INTRODUCTION
1.1Introduction
Bangla is the language of 220 million people in this area and 300 million people from
all over the globe, which puts Bangla in the position of being the fourth most frequent
language in the world. Bangla is the second-most popular language on the Indian
subcontinent. Furthermore, it is the native tongue of the people of Bangladesh, and it
is also a member of the larger Indo-European language family. It derives its primary
roots from Sanskrit, and it has continued to develop through the incorporation of
words from other languages over the course of the thousands of years that it has been
in existence. The most up-to-date form of the language has a total of 50 fundamental
alphabets, and these are supplemented with a further 24 compound letters and 10
individual numbers.
Research in pattern recognition, artificial intelligence and computer vision paved the
way to OCR. Despite continuous academic research in the field, the focus on OCR
has shifted primarily to implementation of proven techniques in accordance to its
application potentials in banks, post-offices, defense organization, license plate
recognition, reading aid for the visually impaired, library automation, language
processing and multimedia system design. Recognition of Bangla character, being
based on one of the most popular languages in the Indian subcontinent with around
243 million people using it as a mode of communication, is of special interest to us. A
substantial amount of work has already been done in the area of Bangla handwritten
character recognition. However, the amount of work done in the area of scripted
character recognition is not of a significant amount. Various approaches have been
introduced by researchers in the field of handwritten OCR, but in this paper, we came
up an efficient approach to detect characters from a printed document. Bangla
documents can be categorized to two types: printed and handwritten. To extract
characters from an image of a printed/scripted document, we begin by detecting
individual lines and then identifying individual words in those lines. Next, we abstract
individual characters from the extracted words. The system proposed in this paper
1
attempts to detect 202 individual Bangla characters. There are no predecessors in this
area that attempt to recognize individual characters to such a large extent. The Bangla
character set comprises of 11 vowels, 39 consonants, 10 numerals [1]. There are also
compound characters formed by a combination between consonants as well as a union
between consonant and vowel. A vowel following a consonant can take a modified
shape called a vowel modifier. Many characters of the Bangla script have a horizontal
line above it called a ”matra” or headline, [1] as illustrated in Fig. 1. A Bangla text
can be segmented into three zones [2]. The upper zone denotes the portion above the
headline, the middle zone covers the portion of basic characters or compound
characters below the headline and the lower zone is the region where some of the
modifiers can reside. The imaginary line separating the middle and lower zone is
called base line. Fig. 1 depicts the different zones existent in the Bangla script. The
concept of uppercase and lowercase letters is absent in the Bangla script and writing
style follows the left to right horizontal convention.
Fig.1. Different characteristics of Bangla Scripts
Fig.2. From left to right, matra jukto, ordho-matra and matraheen alphabets
2
1.2 Problem Definition
Over the last several decades, the delivery of information has evolved from
handwritten hard copy papers to digital file forms, which are more trustworthy and
durable. Despite the move to new forms of document management, a considerable
fraction of older records are still stored in handwritten formats. This is especially true
for our government's efforts to transfer its current Bangla-written archives into digital
ones. The issue arises when attempting to convert them, as traditional solutions rely
on human typing to transfer an existing archive. This sluggish and unimaginative
method can take a long time to study the papers and a large amount of personnel to
create exact copies of each and every document. The challenge is exacerbated while
attempting to understand the writing style of handwritten papers, because everyone
approaches Bangla handwriting differently. Furthermore, Bangla characters feature a
complicated arrangement of curvatures with various types of compound characters
that complement other basic letters.
1.3 Objectives
Our goal is to classify and recognize handwritten Bangla and Scripted characters from
sample images of handwritten isolated words, for which we have decided to classify
all 50 simple characters, 24 compound characters, and 10 digits, totaling 84 classes,
and images of scripted Bangla Character words, totaling 202 classes. Covering the full
domain of the Bangla language family is a monumental effort that necessitates the use
of complicated and efficient machine learning methods as well as top-tier hardware to
train a model in the shortest period of time. It is critical for us that the model achieves
the least predicted value accuracy when compared to the previous generation of stateof-the-art.
3
1.4 Motivation
This dilemma led us to develop a method for detecting handwritten Bangla texts using
a machine learning model that can categorize handwritten Bangla alphabets from
photographs of documents.
This technique provides an alternate option to the traditional way of transcription of
handwritten Bangla documents, reducing labor, associated expenses, and overall
process time.
Furthermore, there are several possible applications for Bangla HLDR, including
Bangla traffic number plate recognition, automatic postal code identification, data
extraction from hardcopy forms, automatic ID card reading, automatic reading of
bank checks, and document digitization, among others.
We anticipate that this system will eventually supplement existing systems that will
help governments and organizations to become more efficient.
4
Chapter 2
Literature Review
2.1 Introduction:
Both scripted or optical character recognition, as well as handwritten character
recognition in Bangla, have received a lot of research attention. The identification
methods face various difficulties when dealing with optical and handwritten
characters. The first optical character recognition research was done by Dutta and
Chaudhury [3] in 1993 on Bengali alpha-numeric character recognition using
curvature features, and later in 1998 by Chaudhuri and Pal [4], who implemented
character recognition by implementing a structural-feature-based tree classifier. By
using a tree classifier and their template matching method, they took on the problem
of compound character recognition. Additionally, Sural and Das [5] developed an
MLP for Bengali script utilizing fuzzy feature extraction based on the Hough
transform. Chowdhury et al. [6] provided a successful strategy in the latter works. In
their subsequent work, Majumder et al. [9] refined the technique by using the K
Nearest Neighbor classifier and a feature set based on curvelet coefficients. Bag et al.
[10], one of the latter works, employed string matching as the classifier and was based
on a structural topological feature set. None of these methods utilized Convolutional
Neural Networks (CNN) or deep learning for the recognition process. Although CNN
has been around for a while, its full potential was not realized until Alex Krichevsky
et al. [11] deployed it in 2012 and outperformed every other object detection method.
Since then, a few methods for handwritten character recognition have been developed
based on CNN [12]–[14]. The older methods for reading handwritten characters were
Rahman and Kaykobad [15] in 1998 and Rahman et al [16] in 2002. For the purpose
of recognizing Bangla handwritten characters, they created a multistage classifier.
Rahman and Saddik [17] produced more sophisticated work in 2007, demonstrating
adequate performance by putting a string-matching algorithm to use that could
reliably identify a variety of patterns. By creating a classifier based on a support
vector machine, Bhowmik et al. [18] improved performance in 2009. (SVM).
Additionally, Pal et al. in 2009 [19] used the histogram of the directional chain code
of the contour pixels of the character picture to detect handwritten Bangla words.
5
Numerous feature descriptors have demonstrated promising performance in digit
recognition and other applications [20]–[22]. In [20], KNN and SVM classifiers were
employed after extracting features using Local Directional Pattern and Gradient
Directional Pattern.
2.2 Early Work:
Several state-of-the-art implementations on classification of Bangla handwritten
characters and digits using machine learning algorithms have been presented by
various scholars, where most early systems relied on common shallow learning
techniques like feature extraction and Multilayer Perception (MLP) techniques.
Among these, some of the most well-known works are those of A. Roy, Bhattacharya,
B. Chaudhuri, and U. Pal, who are pioneers in the field of classifying Bangla
handwritten characters and digits and have raised the bar for implementation and
future scholarly research. A rectangular grid of evenly spaced horizontal and vertical
lines is superimposed on the character bounding box in the two-stage recognition
scheme proposed by Bhattacharya et al. [1]. The feature vector for the first classifier
is then computed, and the response of this first classifier is examined to determine
whether it confused any of the 50 classes of basic Bangla characters. In order to
compute the feature vector in the second step of classification, another rectangular
grid is placed over the character bounding box, but this time the rectangular grid is
made up of unevenly spaced horizontal and vertical lines. They employed MLP and
the Modified Quadratic Discriminant Function (MQDF) classifiers, respectively, at
both stages. A new architecture was presented by Basu et al. [2] that employs MLP
for classification and a hierarchical technique to separate characters from words.
Utilizing three separate feature extraction algorithms, they segmented character
patterns into 36 classes by combining related characters into a single class. Using
wavelet transformation to extract features from character images and Multilayer
Perceptron (MLP), RBF network, and SVM for classification, Bhowmik et al. [3]
proposed a fusion classifier for 45 classes. When classifying the data, they took into
account some similar characters as a single pattern. Character recognition for
handwritten and scripted Bangla characters has not yet been the subject of a lot of
effort. We have enough scopes to change the recognition and segmentation algorithms
6
so they operate more quickly and accurately since we must concentrate on
handwritten and scripted characters.
2.3 Dataset:
The first dataset is BanglaLekha-Isolated, which features a Bangla handwritten
isolated character. Bangla basic characters (50), Bangla numerals (10), and compound
characters (24) make up the 84 characters in this batch of data. Each of the 84
characters had 2000 handwriting samples gathered, scanned, and pre-processed. The
final dataset includes 1,66,105 handwritten character pictures after removing typos
and scribbles. It's also worth noting that each subject's age and gender are included
with their sample data in the dataset. The status of automated Bangla handwriting
recognition research is trailing significantly behind, despite a lot of progress in the
automatic identification of handwritten English text. Deep learning approaches, in
particular, have recently been proved to be very successful in handling handwriting
identification challenges using machine learning. Such learning techniques, on the
other hand, often need a substantial amount of labeled data. Isolated characters are the
focus of this dataset. An example of the form used to gather samples of handwriting
may be seen in Figure 3.
7
Fig 3: Example of data collection form
To gauge their speed of completion, people were given five minutes, then two
minutes to complete the forms. Achieving a uniform distribution of handwriting
quality was the goal of this procedure. Additionally, a spreadsheet with marks given
to particular forms (a group of 84 characters) as an evaluation of the aesthetic quality
8
of the letters (how lovely is the handwriting?) is included in the dataset. A widely
acknowledged handwriting expert in Bangladesh specified the following standards for
marking the characters. Clear and easy to read c) One style throughout the form d)
Correct dimension e) Correctness. a) Consistent Size and format. This data collection
is well-balanced.
Scripted Bangla Characters is the second collection of data. Selecting 20 Bangla fonts
from a pool of diverse Bangla typefaces was the first step in this process. It was then
entered into the word document and translated into an image file for each of the 202
characters for each typeface. We were able to separate the characters using the picture
as a guide. At this point, a character may have up to 20 distinct visual representations.
We had a good selection of standard-issue characters to work with. However, realworld images may be distorted or overexposed. So, for each sample in the dataset, a
'Salt and Pepper Noise' and a 'Gaussian Blur' sample were added. Since the test
pictures might differ in many ways, Dataset found it challenging to train the model
with just 60 image examples. Each example picture was further distorted and rotated
to better represent real-world events. The following procedure yielded 1000 samples
from 60 characters each: This rotation has a 0.7 chance of occurring in the -10-to+10-degree range. a chance of distortion between 1.1 and 1.5. The presence of salt
and pepper noise, Poisson noise, and other types of noise in input pictures makes it
difficult to achieve improved segmentation and prediction accuracy. The input picture
is first subjected to a median filter, which works very well with grayscale document
images.
9
After scanning through the input picture, the median filter replaces each pixel's value
with the median of its square neighborhood's pixel values. In this case, the window
has nine pieces since it is a 3*3 size window. An odd number of items makes it
simpler to calculate the median. Most of the salt and pepper, Poisson, and other
sounds are removed by this median filter. The noisy picture on the left is shown in
Fig. 4, whereas the image on the right is the result of removing the noise. This data
collection is well-balanced.
Fig 4: Dataset with added noise and blur
2.4 Conclusion:
The major goal of this thesis is to provide a system for identifying printed and
handwritten Bangla characters in spite of noise, color change, various fonts, sizes,
sources, and spacing. In our work, we offer a comprehensive technique built on a
Transfer Learning model that is intended to function in the actual world with little
user input required. we preferred Transfer Learning applications for higher accuracy
and better efficiency. The user does not need to know which text is in the forefront or
the background. The user only needs to upload a picture of the document to the
system. According to Bryant et aladvice, .'s we took into account a few crucial factors
while creating software from models [24]. From the user's perspective, the entire
procedure is automated.
10
Chapter 3
METHODOLOGY
3.1 Introduction
A pre-trained model is used as the foundation for a new model in the machine
learning technique known as transfer learning. Simply expressed, an optimization that
enables quick progress when modeling the second task is applied to a model that was
trained on the first, unrelated job. One may attain considerably better performance
than training with just a modest quantity of data by applying transfer learning to a
new task. It is uncommon to train a model from scratch for tasks linked to image or
natural language processing since transfer learning is so widespread. Instead, data
scientists and academics prefer to begin with a model that has already been trained to
recognize generic properties in photos, such as edges and forms, and to categorize
things. Inception, ImageNet, and AlexNet are common examples of models using
transfer learning as their foundation. In the Transfer Learning method, I used two
different Keras applications MobileNetV2 and MobilenetV3Large for this work.
3.2 Proposed Methodology
The author presented the MobileNetV2-based method in the article that was planned
to be written on the dataset. A step-by-step visual guide of the whole process is
provided below the figure. First things first, I have to get the dataset ready, which
means that I have to augment and normalize the data for the transfer learning model.
After that, I can start feeding the picture data into the model. Putting each character
via MobileNetV2 causes them to go through three distinct thick layers thereafter.
There are two levels that are concealed, and one layer is for output. The current best
error rate on the MNIST digit-recognition challenge is less than 0.3 percent, which is
close to the performance of humans [7]. On the other hand, the identification of
handwritten Bangla characters is quite different from that due to the vast number of
classifications used and the variety of forms used in writing. In order to correctly
11
detect Bengali characters, we were required to choose a suitable Convolutional model
that is both effective and capable of being quickly optimized. In addition, the dataset
required some preliminary processing before we could get the outcome we wanted.
On the other hand, while the dataset was being trained, efforts were also made to
minimize overfitting and reach a state of optimal performance. As a result, it is
suitable for usage with higher-dimensional data, such as photographs, which are often
just two dimensions. The following is a list of the fundamental components that make
up a hidden "filter," which has the same dimension as the input and operates the
convolutional transformation on the input. Then there are "Pooling" Layers, which
work toward down sampling the picture in order to reduce the amount of work that
has to be done computationally. In addition, there are techniques of regularization
such as Dropout and Batch Normalization. Other key terms are "strides," which refers
to the number of steps the filter will travel across the input matrix; "padding," which
refers to the addition of additional cells to the border of the input matrix; and "kernel,"
which is the name that is given to the filter itself. Because the filters contain learnable
parameters that are improved during training by backpropagation, the step of the
model that is responsible for feature extraction is also taught during training. A
densely connected feedforward neural network is used on the last layer after the final
feature map has been flattened to generate a one-dimensional vector. This vector is
then fed into the network. This last layer provides the completed categorization result
for the job that was asked of it. A concise explanation of each step may be found in
the next section.
Fig 5: Proposed Methodology
12
3.3 Data Preprocessing:
the BanglaLekha-isolated dataset already has the following characteristics: its
foreground and background have been inverted, noise has been removed using a
median filter and an edge thickening filter, and the dataset has been scaled to be
square and given the required paddings [5]. After scanning the forms, each
handwritten character is extracted automatically, and then the extraction is checked by
hand. Because it is anticipated that the dataset would be used for activities involving
machine learning and pattern recognition, the backdrop was changed to black, and the
character samples were changed to white. Following the use of an edge thickening
operation, which was done in order to provide clarity to the pictures, a median filter
was used in order to minimize the amount of image noise that was present (Figure
3.2). However, in order to reach a higher degree of certainty about the correctness of
the results, we normalized the dataset by dividing each picture by 255, as follows:
Out(i,j) = In(i,j)/255.0
During the training process, further data is added to the dataset in real time. The
enhancement was accomplished in dataspace by the use of elastic distortions [8] by
moving in width and height. For the purpose of this shifting, the range was maintained
at 0.4. Another underscore serves as the delineator between the two sections.
Therefore, it is possible to deduce from the filenames not only the age and gender of
the individual who filled out the form but also the character whose picture is included
inside the file. The picture files have been arranged in folders according to the
character for the sake of organization and convenience. There are 84 folders total, one
for each character.
Sample Manipulation:
The examples of personalities that were made accessible to us met our expectations in
a satisfactory manner. However, there is a possibility that actual picture samples will
be grainy or noisy. Therefore, we augmented the dataset by adding a sample with 'Salt
and Pepper Noise' and another sample with 'Gaussian Blur' for each and every sample
in the dataset, as shown in Fig. 6.
13
Fig 6: Dataset with noise and blur
Augmentation:
Due to the fact that test photos might differ in a wide variety of ways, it was
challenging for us to train the model using just 60 image examples. Every example
picture has had additional distortion and rotation applied to it so that it more
accurately represents the actual world. Initial dataset sample for a character,
beginning at position 60 in figure 5. A list of Trained Characters In all, we created
1000 samples for each trained character by using the following procedure:
• A rotation ranging from -10 degrees to +10 degrees, with a probability of 0.2
according to a certain probability distribution.
• A strength distortion ranging from 1.1 to 1.5 with a likelihood of occurrence
Specifics of the implementation and the training:
Google COLAB was used for the whole of the project. Pytorch was chosen as the
Deep Learning Framework to be used for these investigations. Nvidia Tesla K80 was
chosen as the GPU for usage. During the training process, the input photos were
modified using a variety of image transformation methods, including elastic
transformation, mirror rotation, random cropping, and flipping, among others. The
original photos were shrunk down to 64 by 64 pixels before being uploaded. The
training learning rate ranged from 0.00001 to 0.0001, and many different learning rate
schedulers, including exponential LR reductithe on plateau, cosine annealing, and
cosine annealing restarts, were experimented with.
14
3.4 Transfer Learning
Transfer learning is a sort of optimization that makes it possible to make quick
progress or increase performance while modeling the second task. Transfer learning is
the enhancement of learning in a new task by the transfer of information from a
related task that has previously been taught. This may be done by transferring the
knowledge from an already learned related activity. In the traditional kind of machine
learning known as supervised learning, when we want to train a model for some task
and domain A, we expect that we will be given labeled data for that same task and
domain. This is the case when we make the assumption that we will be training the
model. This is made abundantly evident to us in Figure 1, which demonstrates that the
task and domain of both the training data and the test data used by our model A are
identical. In a subsequent section, we will provide a more in-depth explanation of
precisely what a task and a domain are. For the time being, let us make the
assumption that a job is a goal that our model tries to fulfill, such as recognizing
things in photos, and that a domain is a location from which our data originates, such
as photographs taken in San Francisco coffee shops.
Fig 7: Task of Traditional Machine Learning
15
We can now train a model A on this dataset and expect it to perform well on unseen
data in the same task and area.
A labeled dataset of the same kind is needed for a different job or domain B, so we
may build a whole new model. hence, we may expect it to do well on this data. When
we don't have enough labeled data to train a credible model for the task or domain we
care about, the classic supervised learning paradigm fails. Even though night-time
photographs are different from daytime ones, we can use the same model to learn how
to recognize pedestrians. It's common to see a decrease in performance since the
model has been trained on biased data and doesn't know how to generalize to a new
context. We can't even utilize a current model to train a new one, like spotting bikers,
since the labels between the tasks change. Using labeled data from a comparable task
or topic, transfer learning may help us in these situations. This information obtained
by performing the source job in the source domain may be observed in Figure 2 when
we attempt to apply it to our issue of interest
Fig 8: Task of Transfer Learning
As much information as possible is sent from the source setting to the target task or
area in practice. Depending on the data, this knowledge may take on numerous forms,
such as how items are put together so that we can more readily recognize unfamiliar
objects, or how individuals use words to communicate their thoughts.
16
3.5 MobileNetV2
MobileNetV2 is very similar to the original MobileNet, except that it uses inverted
residual blocks with bottlenecking features. It has a drastically lower parameter count
than the original MobileNet. MobileNets support any input size greater than 32 x 32,
with larger image sizes offering better performance. This function returns a Keras
image classification model, optionally loaded with weights pre-trained on
ImageNet.For transfer learning use cases. each Keras Application expects a specific
kind of input preprocessing. For MobileNetV2, call
tf.keras.applications.mobilenet_v2.preprocess_input on your inputs before passing
them to the model. mobilenet_v2.preprocess_input will scale input pixels between -1
and 1.
Arguments
input_shape: Optional shape tuple, to be specified if you would like to use a model
with an input image resolution that is not (224, 224, 3). It should have exactly 3 inputs
channels (224, 224, 3).
alpha: Float, larger than zero, controls the width of the network. This is known as
the width multiplier in the MobileNetV2 paper, but the name is kept for consistency
with applications.MobileNetV1 model in Keras.
•
If alpha < 1.0, proportionally decreases the number of filters in each layer.
•
If alpha > 1.0, proportionally increases the number of filters in each layer.
•
If alpha = 1.0, default number of filters from the paper are used at each layer.
include_top: Boolean, whether to include the fully-connected layer at the top of the
network. Defaults to True.
weights: String, one of None (random initialization), 'imagenet' (pre-training on
ImageNet), or the path to the weights file to be loaded.
input_tensor: Optional Keras tensor (i.e. output of layers.Input()) to use as image
input for the model.
pooling: String, optional pooling mode for feature extraction when include_top is
False.
17
•
None means that the output of the model will be the 4D tensor output of the
last convolutional block.
•
avg means that global average pooling will be applied to the output of the last
convolutional block, and thus the output of the model will be a 2D tensor.
•
max means that global max pooling will be applied.
classes: Optional integer number of classes to classify images into, only to be
specified if include_top is True, and if no weights argument is specified.
classifier_activation: A str or callable. The activation function to use on the "top"
layer. Ignored unless include_top=True. Set classifier_activation=None to return the
logits of the "top" layer. When loading pretrained weights, classifier_activation can
only be None or "softmax".
Fig 9: Architecture of MobileNetV2
18
3.6 MobileNetV3Large Architecture
MobileNetV3 is a convolutional neural network that is tuned to mobile phone CPUs
through a combination of hardware-aware network architecture search (NAS)
complemented by the NetAdapt algorithm, and then subsequently improved through
novel architecture advances.
Arguments
input_shape: Optional shape tuple, to be specified if you would like to use a model
with an input image resolution that is not (224, 224, 3). It should have exactly 3 inputs
channels (224, 224, 3). You can also omit this option if you would like to infer
input_shape from an input_tensor. If you choose to include both input_tensor and
input_shape then input_shape will be used if they match, if the shapes do not match
then we will throw an error. E.g. (160, 160, 3) would be one valid value.
alpha: controls the width of the network. This is known as the depth multiplier in the
MobileNetV3 paper, but the name is kept for consistency with MobileNetV1 in Keras.
•
If alpha < 1.0, proportionally decreases the number of filters in each
layer.
•
If alpha > 1.0, proportionally increases the number of filters in each
layer.
•
If alpha = 1, default number of filters from the paper are used at each
layer.
minimalistic: In addition to large and small models this module also contains socalled minimalistic models, these models have the same per-layer dimensions
characteristic as MobilenetV3 however, they don't utilize any of the advanced blocks
(squeeze-and-excite units, hard-swish, and 5x5 convolutions). While these models are
less efficient on CPU, they are much more performant on GPU/DSP.
include_top: Boolean, whether to include the fully-connected layer at the top of the
network. Defaults to True.
19
weights: String, one of None (random initialization), 'imagenet' (pre-training on
ImageNet), or the path to the weights file to be loaded.
input_tensor: Optional Keras tensor (i.e. output of layers.Input()) to use as image
input for the model.
pooling: String, optional pooling mode for feature extraction when include_top is
False.
•
None means that the output of the model will be the 4D tensor output
of the last convolutional block.
•
avg means that global average pooling will be applied to the output of
the last convolutional block, and thus the output of the model will be a
2D tensor.
•
max means that global max pooling will be applied.
classes: Integer, optional number of classes to classify images into, only to be
specified if include_top is True, and if no weights argument is specified.
dropout_rate: fraction of the input units to drop on the last layer.
classifier_activation: A str or callable. The activation function to use on the "top"
layer. Ignored unless include_top=True. Set classifier_activation=None to return the
logits of the "top" layer. When loading pretrained weights, classifier_activation can
only be None or "softmax".
include_preprocessing: Boolean, whether to include the preprocessing layer
(Rescaling) at the bottom of the network. Defaults to True.
20
Fig 10: Architecture of MobileNetV3
3.7 Conclusion:
Several factors hindered the progress of the implementation, that either consumed our
valuable time, or created restrictions that we were forced to oblige. For example, due
to the unavailability of sufficient RAM, the efficiency of the model was bottlenecked
to a certain degree, such that memory restrictions were limiting the accuracy that can
be achieved. Since our GPU and CPU were of ordinary specification, increasing input
parameters like image size would greatly increase the training time needed for the
model, which forced us to compromise and work on a simple network instead of a
complex network, thus sacrificing any opportunities to further improve the efficiency
of the model. Though these limitations arrive I used these two models are used for the
task.
21
Chapter 4
Result Analysis and Discussion
4.1 Introduction
Recognition of Bangla Handwritten characters and Scripted characters are far too
different from other language. To solve this complex problem we address the
degradation problem by introducing a deep residual learning framework. Our purpose
is to classify Bengali alphabets, numeral and compound characters with a single
classifier. We have experimented with some famous and also custom architectures to
achieve best possible solution. I used 75:25 as the train and test dataset ratio on 84
classes for 1st dataset and 75:25 as the train and est dataset ratio on 202 classes for 2nd
dataset.
4.2 Performance Analysis
1st Dataset Experiments and Results:
Experimental results of the proposed recognition scheme have been collected based
on the samples of the prepared dataset from Bangla-Lekha . I have performed batch
wise training in this study due to large sized training set. The number of Batch Size
(BS) is considered as a user defined parameter. On the other hand, Learning Rate
(LR) is also an element that influences learning. At first, we tried with MobileNetV2
with the dataset image size of 32x32 pixel values which gave us the loss of 0.59
approximately on 50 epocs.
22
Fig 11: Train and Validation Loss using MobileNetV2 for 1st dataset
Then we tried decreasing maxpool to 2x2 from the default value of MobileNetv2 3x3
but we kept the default stride of 2x2. This network architect gave us 0.36 test loss. We
tried removing the stride too which gave the same loss but an increase in accuracy.
After we updated the image size to 64x64 which gave us a decreased test loss of 0.30.
Further, I tried excluding the maxpool layer from mobilenet which gave a slight
increase in accuracy and test loss decreased to 0.30. Moreover, we tried increasing the
image size even more to 64x64 and it gave a improved result of 0.29 loss. Previously
we only use 32*32 pixel image size, it provides us accuracy around 78%. But after
increasing the image size to 64*64 pixel it shows much better accuracy around
84.43%-87.23%. Again we increase image size. It increased the computational time of
the program.
23
Fig 12: Train and Test accuracy using MobileNetV2 for 1st dataset
Further, I tried excluding the maxpool layer from MobileNetV3Large which gave a
slight increase in accuracy. Moreover, I tried increasing the image size even more to
32x32 and it gave good result but not as good as MobileNetV2. Previously we only
use 32*32 pixel image size, it provides us accuracy around 76%. But after increasing
the image size to 64*64 pixel it shows much better accuracy around 82.33%-86.23%.
Again we increase image size. It increased the computational time of the program.
24
Fig 13: Train and Validation loss using MobileNetV3Large for 1st dataset
After this experiment on MobileNetV3Large application I get the accuracy 86.23%
with the loss 0.30.
Fig 14: Train and Test accuracy using MobileNetV3Large for 1st dataset
25
2nd Dataset Experiments and Results:
Experimental results of the proposed recognition scheme have been collected based
on the samples of the prepared dataset from BanglaScriptedCHARACTER. I have
performed batch wise training in this study due to large sized training set. The number
of Batch Size (BS) is considered as a user defined parameter. On the other hand,
Learning Rate (LR) is also an element that influences learning. At first, we tried with
MobileNetV2 with the dataset image size of 32x32 pixel values which gave us the
loss of 0.59 approximately on 50 epocs.
Fig 15: Train and validation loss using MobileNetV2 for 2nd dataset
After that, we experimented with lowering maxpool from MobileNetv2's default value
of 3x3 to 2x2, but we maintained the usual stride of 2x2 throughout. This particular
network architect provided us with a 0.36 test loss. Additionally, we tested deleting
the stride, which resulted in the same loss but an improvement in accuracy. After we
changed the picture size to 64 by 64, we saw that the test loss had gone down to 0.30.
In addition, I experimented with omitting the maxpool layer from mobilenetV2,
which resulted in a marginal improvement in accuracy and a reduction to 0.28 in test
loss. In addition, we tried expanding the size of the picture even further to 64x64, and
it resulted in a better loss percentage of 0.27. In the past, we have just used an image
26
size of 32 by 32 pixels, which has given us an accuracy of around 78 percent.
However, when the picture size was increased to 64 by 64 pixels, it showed a
significant improvement in accuracy of between 82.44 and 86.23 percent. Once again,
we will enhance the picture size. It took the software longer to compute its results as a
result.
Fig 16: Train and test accuracy using MobileNetV2 for 2nd dataset
In addition, I experimented with omitting the maxpool layer from MobileNetV3Large,
which resulted in a marginal improvement in accuracy.
In addition to that, I experimented with expanding the picture size even more to
32x32, and although it produced satisfactory results, they were not as satisfactory as
MobileNetV2's.
In the past, we have just used an image size of 32 by 32 pixels, which has given us an
accuracy of around 76%.
However, when the picture size was increased to 64 by 64 pixels, the results showed a
significant improvement in accuracy, somewhere between 82.33 and 86.23 percent.
Once again, we make the images larger.
It caused an increase in the amount of time the software needed to compute.
27
Fig 17: Train and test accuracy using MobileNetV3Large for 2nd dataset
After this experiment on MobileNetV3Large application I get the accuracy 85.13%
with the loss 0.34.
Fig 18: Train and validation loss using MobileNetV3Large for 2nd dataset
28
4.3 Model Wise Result Comparison:
Each of the models trained with different image size and modified layers produced
some of the most singular results that were not anticipated. However, a pattern
followed that established the fact that the image size of the training dataset is
correlated to the value accuracy that can be obtained by the model. Unfortunately,
increasing the image size comes with an additional cost of computational
requirements and training time. Thus, we had to be economical with our available
resources and model architecture, and gradually increase the size of images and
change other parameters to obtain the optimum accuracy.
As shown in Table 1 , we started our training with the image size from 32x32 to
64x64 and applied Adam optimization and augmentation of height and width shift for
most of the models, where the learning rate for all the models started with 0.00001
decayed learning rate to the factor of square root of 1e-4 to 1e-5, when value loss was
not decreasing. MobileNetV2 and MovileNetV3Large used with varying degrees, by
adding or modifying layers to reduce overfitting of data and increase value accuracy
for that particular model as much as possible
Our first model MobileNetV2 with 32x32 image size gave 78 % accuracy without
augmentation and increased accuracy to 84.43% when augmentation was applied.
From here on forwards, we decided to apply augmentation to all the models that we
tested to achieve maximum accuracy. We applied the same parameters to
MobileNetV3Large model, and the accuracy slightly decreased by 0.1% to 82.43%.
With additional layers of maxpool and stride, the value accuracy still remained at
84.43%, however when removing stride, the accuracy increased by 0.2%. Then we
increased our training image size to 64x64 and applied to the the previous model with
the same parameters, and as expected, accuracy increased to 87.23%. Surprisingly,
without any maxpool and strides, the same model gave a lower accuracy of 86.23%,
which is not better than before. After acquiring better hardwares like CPU, GPU and
RAM, we began to test our model with even larger size of images to achieve greater
accuracy. We started with image size of 64x64 on the earlier MobileNetV2 model and
obtained 87.23% accuracy, which is surprisingly higher than the previous model with
image size of 64x64, despite using dataset with larger image size. Later, our
experiment to train the model with Rmsprop optimization could not surpass the value
accuracy of previous model with Adam optimization. Then we shifted to
29
MobileNetV2 with the same image size, and our accuracy increased to 87.23%, which
later increased by 0.2% when we applied augmentation with shift of 0.2 height and
width. Finally, we again tested on MobileNetV2, however this time, we applied
dropout optimization to each of the Resnet cells, with image size of 64x64. This
model made a breakthrough in value accuracy, which reached the sweet spot of
87.23%, and as of writing this paper, this is the highest accuracy ever reached by any
previous state of the art systems on Bangla handwritten and Scripted character
recognition.
Table 1: Model wise Result Comparison
SL
Used Model
Image
Optimizer
Augmentation Accuracy
Size
1
MobileNetV2
32x32
Adam
No
78%
2
MobileNetV2
64x64
Adam
Yes
87.23%
3
MobileNetV3Large 32x32
Adam
No
76%
4
MobileNetV3Large 64x64
Adam
yes
86.23%
Table 2: Result analysis for 1st dataset
Table 3: Result analysis for 2nd dataset:
30
Lastly I’ve compared the proposed work accuracy with some previous work shown in
Table 4.
Table 4: Result comparison with previous work
Description of
Accuracy(%)
Proposed Work
Accuracy(%)
79.4%
Bangla
87.23%
previous work
Recognition of
Handwritten
Handwritten and
Bangla Characters
Scripted Character
Using Gabor Filter
recognition using
and Artificial
Transfer Learning
Neural Network[6]
Recognition of
76.86%
Bangla
Bangla handwritten
Handwritten and
basic characters
Scripted Character
and digits using
recognition using
convex hull-based
Transfer Learning
87.23%
feature set[4]
Bangla Hand-
93.43%
Bangla
Written Character
Handwritten and
Recognition Using
Scripted Character
Support Vector
recognition using
Machine[7]
Transfer Learning
31
87.23%
Confusion matrix evaluation
In order to test the performance of our classifier, we applied confusion matrix over the
84 classes against a set of test data, for which the values are known. Below in Fig. 19,
the result of convergence is shown.
Fig 19: Confusion Matrix
As we can see, each of the classes have scored highest in its own test case and shows
that the classifier successfully matched each of the classes against its correct test case,
proving that the performance of the classifier is uniform for all the classes. Later, we
scrutinized the scores achieved by the classes, where some of them are presented in
Table 5. We discovered that the Bangla character ‘ক্ষ’ scored the lowest precision
score of 0.78 and recall of 0.86, while the character ‘◌ং ’ scored the highest
32
precision score with exact 1 and recall of 0.99. These results can be attributed to the
fact that complexity of characters and its degree of cursive strokes can reduce the
classifier’s ability to correctly predict that character’s class.
Table 5: Scores achieved by some classes
Class
Precision
Recall
F1-Score
Support
অ
0.98
0.98
0.98
401
ড
0.87
0.88
0.88
422
ষ
0.95
0.94
0.94
373
◌ং
1.00
0.99
0.99
412
ক্ষ
0.78
0.86
0.82
405
4.4 Conclusion
I showed the effectiveness of Transfer Learning can achieve better performance to
classify and recognize Bangla Handwritten and Scripted characters into digitally
readable formats, than the shallower learning methods like MLP and SVM. From the
result section, I showed that for both of the datasets, MobileNetV2 performs better
than MobileNetV3Large model. Using same parameters for every training and
testing(validation) the highest result 87.23% came out from MobileNetV2 model.
Though the MobileNetV3Large is newer than MobileNetV2 the parameter count is
lower. This is why MobileNetV2 performs better for this task.
33
Chapter 5
Conclusion and Future Work
5.1 Introduction
We showed that effective utilization of Transfer Learning can achieve better
performance to classify and recognize Bangla handwritten characters and Scripted
Characters into digitally readable formats, than the shallower learning methods like
MLP and SVM. If we had more hardware support, then the accuracy would have been
better as this method gave a promising result. The results are analogous with previous
related work; however, they did not test on a large amount of handwritten character
dataset like Bangla-Lekha, we helped us to surpass the previous highest value
accuracy of systems based on CNN. Experiments on a large dataset showed the
robustness of this model for Bangla handwritten character recognition. In future with
more resource and bigger CNN network architecture, we can achieve a better result
and improve the state-of-the-art scale for Bangla handwritten letter and digit
recognition. Despite being the seventh most-spoken language in the world, Bengali
has not received many contributions in this domain due to a huge number of
characters with complex shapes. The proposed study on Transfer Learning of
MobileNetV2 and MobileNetV3Large demonstrates that both Keras applications
perform well for the recognition of Bangla handwritten and scripted characters.
5.2 Future Work:
Through the use of cutting-edge hardware that will give us the advantage in building a
more complicated network, our future efforts for this system will aim to further
increase the precision that can be acquired.
Additionally, we would like to expand the classes to incorporate more intricate
Bangla word structures that blend basic and compound characters into a single
character.
Additionally, we would like to broaden the use of this system to isolated Bangla
handwritten words and scripted characters. To do this, the characters will first be
divided to isolate them, and then the core classifier will carry out its function
34
independently. This may be expanded even further to include a classifier for
handwritten sentences, where the previous word segmentation and a classifier for
isolated characters are combined into one system. In the future, we will focus on
improving the architecture more and apply the architecture to more datasets.
Furthermore, the lack of an ideal Bengali handwritten character dataset is also a
dilemma in this domain of research. In the future, we will work in this regard also.
35
References
Tapotosh Ghosh, Md. Min-ha-zul Abedin, Shayer Mahmud Chowdhury, and
Mohammad Abu Yousuf. “A Comprehensive Review on Recognition Techniques for
Bangla Handwritten Characters” In 2019 International Conference on Bangla Speech
and Language Processing (ICBSLP), pages 1–6, September 2019.
Mridul Ghosh, Himadri Mukherjee, Sk Md Obaidullah, K. C. Santosh, Nibaran Das,
Kaushik Roy, “LWSINet: A deep learning-based approach towards video script
identification” Received: 28 August 2020 / Revised: 7 January 2021 / Accepted: 21
May 2021 / © The Author(s), under exclusive licence to Springer Science+Business
Media, LLC, part of Springer Nature 2021.
Changsi Shu, Ke Xu ,Tanfeng Sun , Xinghao Jiang, “YM-NET: A New Network
Structure for License Plate Detection in Complex Scenarios ” ICASIT 2020: 2020
International Conference on Aviation Safety and Information Technology
Nibaran Das, Sandip Pramanik, “Recognition of Handwritten Bangla Basic Character
and Digit Using Convex Hall Basic Feature”. 2009 International Conference on
Artificial Intelligence and Pattern Recognition(AIPR-09)
Halima Begum et al, Recognition of Handwritten Bangla Characters using Gabor
Filter and Artificial Neural Network, International Journal of Computer Technology
& Applications, Vol 8(5),618-621, ISSN:2229-6093
Riasat Azim et al, Bangla Hand-Written Character Recognition Using Support Vector
Machine, International Journal of Engineering Works vol. 3, Issue 6, PP. 36-46, June
2016.
36
Tasnim Ahmed , Md. Nishat Raihan, Rafsanjany Kushol and Md Sirajus Salekin “A
Complete Bangla Optical Character Recognition System: An Effective Approach”
2019 22nd International Conference on Computer and Information Technology
(ICCIT), 18-20 December, 2019
Farisa Benta Safir. The proposed (CNN) architectures, including DenseNet, Xception,
NASNet, and MobileNet to build the OCR architecture implements an end to end
strategy that recognizes handwritten Bengali words from handwritten word images.
With 86.22%accuracy.
Shyla Afroge. A Hybrid Model for Recognizing Handwritten Bangla Characters using
Support Vector Machine. With 92.06% accuracy. [10]
Indonesian Plate Number Identification Using YOLACT and Mobilenetv2 in the
Parking Management System I Kadek Gunawan, I Putu Agung Bayupati, Kadek Suar
Wibawa, I Made Sukarsa, Laurensius Adi Kurniawan
RSCA: Real-Time Segmentation-Based Context-Aware Scene Text Detection
Jiachen Li, Yuan Lin, Rongrong Liu, Chiu Man Ho, Humphrey Shi; Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Workshops, 2021, pp. 2349-2358
BanglaLekha-Isolated: A multi-purpose comprehensive dataset of Handwritten
Bangla Isolated characters Mithun Biswas, Rafiqul Islam, Gautam Kumar Shom, Md.
Shopon,NabeelMohammed, SifatMomen, Anowarul Abedin
https://doi.org/10.1016/j.dib.2017.03.035
37
Majumdar and B. Chaudhuri, “A mlp classifier for both printed and handwritten
bangla numeral recognition,” in Computer Vision, Graphics and Image Processing.
Springer, 2006, pp. 796–804.
S. Bag, G. Harit, and P. Bhowmick, “Topological features for recognizing printed and
handwritten bangla characters,” in Proceedings of the 2011 Joint Workshop on
Multilingual OCR and Analytics for Noisy Unstructured Text Data. ACM, 2011, p. 10
A. F. R. Rahman, R. Rahman, and M. C. Fairhurst, “Recognition of handwritten
bengali characters: a novel multistage approach,” Pattern Recognition, vol. 35, no. 5,
pp. 997–1006, 2002.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
convolutional neural networks,” in Advances in neural information processing
systems, 2012, pp. 1097–1105
A. Rahman and M. Kaykobad, “A complete bengali ocr: A novel hybrid approach to
handwritten bengali character recognition,” Journal of computing and information
technology, vol. 6, no. 4, pp. 395–413, 1998
M. A. Rahman and A. El Saddik, “Notice of violation of ieee publication principles
modified syntactic method to recognize bengali handwritten characters,” IEEE
Transactions on Instrumentation and Measurement, vol. 56, no. 6, pp. 2623–2632,
2007.
T. K. Bhowmik, P. Ghanty, A. Roy, and S. K. Parui, “Svm-based hierarchical
architectures for handwritten bangla character recognition,” International Journal on
Document Analysis and Recognition (IJDAR), vol. 12, no. 2, pp. 97–108, 2009.
B. R. Bryant, J. Gray, M. Mernik, P. J. Clarke, R. B. France, and G. Karsai,
“Challenges and directions in formalizing the semantics of modeling languages,”
2011.
Y. Ji, Y.-H. Zhang, and W.-M. Zheng, “Modelling spiking neural network from the
architecture evaluation perspective,” Journal of computer science and technology, vol.
31, no. 1, pp. 50–59, 2016.
38
Download