KNOWLEDGE & TECHNOLOGY Bangladesh Army University of Engineering & Technology Department of Computer Science and Engineering A thesis Report on Bangla Handwritten and Scripted Character Recognition Using Transfer Learning A thesis is submitted in partial fulfillment for the requirements of the degree of Bachelor of Science in Computer Science and Engineering. Submitted by Farhan Nafiz ID No.: 18104020 Supervised by Md. Omar Faruq Lecturer, Department of CSE, BAUET Department of Computer Science and Engineering Bangladesh Army University of Engineering & Technology July, 2022 KNOWLEDGE & TECHNOLOGY Bangladesh Army University of Engineering & Technology Department of Computer Science and Engineering CERTIFICATE This is to certify that the project entitled “Bangla Handwritten and Scripted Character Recognition Using Transfer Learning” by “Farhan Nafiz”, ID No.:18104020, has been accepted as satisfactory in partial fulfillment of the requirement for the degree of Bachelor of Science in Computer Science and Engineering on July 2022. Signature of Supervisor ………………………………… (Md. Omar Faruq) (Lecturer) Department of Computer Science and Engineering Bangladesh Army University of Engineering & Technology ii KNOWLEDGE & TECHNOLOGY Bangladesh Army University of Engineering & Technology Department of Computer Science and Engineering DECLARATION I hereby declare that my thesis entitled “Bangla Handwritten and Scripted Character Recognition Using Transfer Learning” is the result of my work. I also ensure that it was not previously submitted or published elsewhere for the award of any degree or diploma. The work has been accepted for the degree of Bachelor of Science in Computer Science and Engineering at Bangladesh Army University of Engineering & Technology (BAUET). Author ………………………………. (Farhan Nafiz) iii ACKNOWLEDGMENT We would like to sincerely thank our supervisor Md. Omar Faruq, Lecturer, Department of Computer Science and Engineering, BAUET, for his valuable guidance, suggestions, and encouragement to complete our work. His motivation, suggestions and insights for this research have been invaluable. Without his support and proper guidance this research would never have been possible. His valuable opinion, time and input provided throughout the thesis work, from first phase of thesis topics introduction, subject selection, proposing algorithm, modification till the project implementation and finalization which helped us to do our thesis work in proper way. We are really grateful to him. Finally, I express my heart-full thanks to all of our friends who helped me in the successful completion of this project. Farhan Nafiz ID No.:18104020 iv ABSTRACT In order to decipher images of Bangla handwritten characters and scripted characters into an electronically editable format, which plays a crucial role in enhancing and digitalizing many analog applications, this paper proposes a mechanism of handwritten and scripted letter and digit recognition. This mechanism will not only pave the way for future research but will also have many practical applications in the present. The mechanisms of handwritten and scripted character recognition have been extensively studied over the past 50 years, and the explosive growth of main memory and computational power has made it possible to implement more efficient and complex HLDR methodologies, which has increased demand in a number of upcoming application domains. Adopting a deep, optimized, and data-processing architecture is one of the most effective ways to increase accuracy or reduce error rate in the field of pattern recognition. As a consequence, this study suggests that by utilizing the recently published Bangla-lekha dataset and Bangla Scripted Character dataset, along with the transfer learning model MobileNetV2 and MobileNetV3Large architecture, we may provide results that are superior to those of previous studies. v List of contents Chapter Title Certificate Declaration Acknowledgment Abstract List of contents List of Figures List of Tables Page No. ii iii iv v vi vii-viii viii INTRODUCTION 1.1 Introduction 1.2 Problem Definition 1-4 1-2 3 1.3 Objectives 1.4 Motivation 3 3-4 2 LITERATURE REVIEW 2.1 Introduction 2.2 Early Work 2.4 Dataset 2.7 Conclusion 5-10 5-6 6-7 7-10 10 3 METHODOLOGY 3.1 Introduction 3.2 Proposed Methodology 3.3 Data Preprocessing 3.4 Transfer Learning 3.5 MobileNetV2 Architecture 3.6 MobileNetV3 Architecture 3.7 Conclusion 11-21 11 11-12 13-14 15-16 17-18 19-21 21 4 RESULT ANALYSIS AND DISCUSSIONS 4.1 Introduction 4.2 Performance Analysis 4.3 Model Wise Result Comparison 4.4 Conclusion 22-33 22 22-28 29-33 33 5 CONCLUSION & FUTURE WORK 5.1 Introduction 5.2 Future Work 34-35 34 34-35 REFERENCES 36-38 1 vi List of Figures Fig No Title Page No 1 Different characteristics of Bangla Scripts 2 2 . From left to right, matra jukto, ordho-matra and 2 matraheen alphabets 3 Example of data collection form 4 Dataset with added noise and blur 10 5 Proposed Methodology 12 6 Dataset with noise and blur 8 14 7 Task of Traditional Machine Learning 15 8 Task of Transfer Learning 16 9 Architecture of MobileNetV2 18 10 Architecture of MobileNetV3 21 11 Train and Validation Loss using MobileNetV2 23 for 1st dataset 12 Train and Test accuracy using MobileNetV2 for 24 1st dataset 13 Train and Validation loss using 25 MobileNetV3Large for 1st dataset 14 Train and Test accuracy using st MobileNetV3Large for 1 dataset vii 25 15 Train and validation loss using MobileNetV2 for 16 2nd dataset 16 Train and test accuracy using MobileNetV2 for 27 nd 2 dataset 17 Train and test accuracy using MobileNetV3Large 28 for 2nd dataset 18 Train and validation loss using 28 nd MobileNetV3Large for 2 dataset 19 Confusion Matrix 32 List Of Tables Fig No Title Page No 1 Model wise Result Comparison 30 2 Result analysis for 1st dataset 30 3 Result analysis for 2nd dataset 30 4 Result comparison with previous work 31 5 Scores achieved by some classes 33 viii ix CHAPTER 1 INTRODUCTION 1.1Introduction Bangla is the language of 220 million people in this area and 300 million people from all over the globe, which puts Bangla in the position of being the fourth most frequent language in the world. Bangla is the second-most popular language on the Indian subcontinent. Furthermore, it is the native tongue of the people of Bangladesh, and it is also a member of the larger Indo-European language family. It derives its primary roots from Sanskrit, and it has continued to develop through the incorporation of words from other languages over the course of the thousands of years that it has been in existence. The most up-to-date form of the language has a total of 50 fundamental alphabets, and these are supplemented with a further 24 compound letters and 10 individual numbers. Research in pattern recognition, artificial intelligence and computer vision paved the way to OCR. Despite continuous academic research in the field, the focus on OCR has shifted primarily to implementation of proven techniques in accordance to its application potentials in banks, post-offices, defense organization, license plate recognition, reading aid for the visually impaired, library automation, language processing and multimedia system design. Recognition of Bangla character, being based on one of the most popular languages in the Indian subcontinent with around 243 million people using it as a mode of communication, is of special interest to us. A substantial amount of work has already been done in the area of Bangla handwritten character recognition. However, the amount of work done in the area of scripted character recognition is not of a significant amount. Various approaches have been introduced by researchers in the field of handwritten OCR, but in this paper, we came up an efficient approach to detect characters from a printed document. Bangla documents can be categorized to two types: printed and handwritten. To extract characters from an image of a printed/scripted document, we begin by detecting individual lines and then identifying individual words in those lines. Next, we abstract individual characters from the extracted words. The system proposed in this paper 1 attempts to detect 202 individual Bangla characters. There are no predecessors in this area that attempt to recognize individual characters to such a large extent. The Bangla character set comprises of 11 vowels, 39 consonants, 10 numerals [1]. There are also compound characters formed by a combination between consonants as well as a union between consonant and vowel. A vowel following a consonant can take a modified shape called a vowel modifier. Many characters of the Bangla script have a horizontal line above it called a ”matra” or headline, [1] as illustrated in Fig. 1. A Bangla text can be segmented into three zones [2]. The upper zone denotes the portion above the headline, the middle zone covers the portion of basic characters or compound characters below the headline and the lower zone is the region where some of the modifiers can reside. The imaginary line separating the middle and lower zone is called base line. Fig. 1 depicts the different zones existent in the Bangla script. The concept of uppercase and lowercase letters is absent in the Bangla script and writing style follows the left to right horizontal convention. Fig.1. Different characteristics of Bangla Scripts Fig.2. From left to right, matra jukto, ordho-matra and matraheen alphabets 2 1.2 Problem Definition Over the last several decades, the delivery of information has evolved from handwritten hard copy papers to digital file forms, which are more trustworthy and durable. Despite the move to new forms of document management, a considerable fraction of older records are still stored in handwritten formats. This is especially true for our government's efforts to transfer its current Bangla-written archives into digital ones. The issue arises when attempting to convert them, as traditional solutions rely on human typing to transfer an existing archive. This sluggish and unimaginative method can take a long time to study the papers and a large amount of personnel to create exact copies of each and every document. The challenge is exacerbated while attempting to understand the writing style of handwritten papers, because everyone approaches Bangla handwriting differently. Furthermore, Bangla characters feature a complicated arrangement of curvatures with various types of compound characters that complement other basic letters. 1.3 Objectives Our goal is to classify and recognize handwritten Bangla and Scripted characters from sample images of handwritten isolated words, for which we have decided to classify all 50 simple characters, 24 compound characters, and 10 digits, totaling 84 classes, and images of scripted Bangla Character words, totaling 202 classes. Covering the full domain of the Bangla language family is a monumental effort that necessitates the use of complicated and efficient machine learning methods as well as top-tier hardware to train a model in the shortest period of time. It is critical for us that the model achieves the least predicted value accuracy when compared to the previous generation of stateof-the-art. 3 1.4 Motivation This dilemma led us to develop a method for detecting handwritten Bangla texts using a machine learning model that can categorize handwritten Bangla alphabets from photographs of documents. This technique provides an alternate option to the traditional way of transcription of handwritten Bangla documents, reducing labor, associated expenses, and overall process time. Furthermore, there are several possible applications for Bangla HLDR, including Bangla traffic number plate recognition, automatic postal code identification, data extraction from hardcopy forms, automatic ID card reading, automatic reading of bank checks, and document digitization, among others. We anticipate that this system will eventually supplement existing systems that will help governments and organizations to become more efficient. 4 Chapter 2 Literature Review 2.1 Introduction: Both scripted or optical character recognition, as well as handwritten character recognition in Bangla, have received a lot of research attention. The identification methods face various difficulties when dealing with optical and handwritten characters. The first optical character recognition research was done by Dutta and Chaudhury [3] in 1993 on Bengali alpha-numeric character recognition using curvature features, and later in 1998 by Chaudhuri and Pal [4], who implemented character recognition by implementing a structural-feature-based tree classifier. By using a tree classifier and their template matching method, they took on the problem of compound character recognition. Additionally, Sural and Das [5] developed an MLP for Bengali script utilizing fuzzy feature extraction based on the Hough transform. Chowdhury et al. [6] provided a successful strategy in the latter works. In their subsequent work, Majumder et al. [9] refined the technique by using the K Nearest Neighbor classifier and a feature set based on curvelet coefficients. Bag et al. [10], one of the latter works, employed string matching as the classifier and was based on a structural topological feature set. None of these methods utilized Convolutional Neural Networks (CNN) or deep learning for the recognition process. Although CNN has been around for a while, its full potential was not realized until Alex Krichevsky et al. [11] deployed it in 2012 and outperformed every other object detection method. Since then, a few methods for handwritten character recognition have been developed based on CNN [12]–[14]. The older methods for reading handwritten characters were Rahman and Kaykobad [15] in 1998 and Rahman et al [16] in 2002. For the purpose of recognizing Bangla handwritten characters, they created a multistage classifier. Rahman and Saddik [17] produced more sophisticated work in 2007, demonstrating adequate performance by putting a string-matching algorithm to use that could reliably identify a variety of patterns. By creating a classifier based on a support vector machine, Bhowmik et al. [18] improved performance in 2009. (SVM). Additionally, Pal et al. in 2009 [19] used the histogram of the directional chain code of the contour pixels of the character picture to detect handwritten Bangla words. 5 Numerous feature descriptors have demonstrated promising performance in digit recognition and other applications [20]–[22]. In [20], KNN and SVM classifiers were employed after extracting features using Local Directional Pattern and Gradient Directional Pattern. 2.2 Early Work: Several state-of-the-art implementations on classification of Bangla handwritten characters and digits using machine learning algorithms have been presented by various scholars, where most early systems relied on common shallow learning techniques like feature extraction and Multilayer Perception (MLP) techniques. Among these, some of the most well-known works are those of A. Roy, Bhattacharya, B. Chaudhuri, and U. Pal, who are pioneers in the field of classifying Bangla handwritten characters and digits and have raised the bar for implementation and future scholarly research. A rectangular grid of evenly spaced horizontal and vertical lines is superimposed on the character bounding box in the two-stage recognition scheme proposed by Bhattacharya et al. [1]. The feature vector for the first classifier is then computed, and the response of this first classifier is examined to determine whether it confused any of the 50 classes of basic Bangla characters. In order to compute the feature vector in the second step of classification, another rectangular grid is placed over the character bounding box, but this time the rectangular grid is made up of unevenly spaced horizontal and vertical lines. They employed MLP and the Modified Quadratic Discriminant Function (MQDF) classifiers, respectively, at both stages. A new architecture was presented by Basu et al. [2] that employs MLP for classification and a hierarchical technique to separate characters from words. Utilizing three separate feature extraction algorithms, they segmented character patterns into 36 classes by combining related characters into a single class. Using wavelet transformation to extract features from character images and Multilayer Perceptron (MLP), RBF network, and SVM for classification, Bhowmik et al. [3] proposed a fusion classifier for 45 classes. When classifying the data, they took into account some similar characters as a single pattern. Character recognition for handwritten and scripted Bangla characters has not yet been the subject of a lot of effort. We have enough scopes to change the recognition and segmentation algorithms 6 so they operate more quickly and accurately since we must concentrate on handwritten and scripted characters. 2.3 Dataset: The first dataset is BanglaLekha-Isolated, which features a Bangla handwritten isolated character. Bangla basic characters (50), Bangla numerals (10), and compound characters (24) make up the 84 characters in this batch of data. Each of the 84 characters had 2000 handwriting samples gathered, scanned, and pre-processed. The final dataset includes 1,66,105 handwritten character pictures after removing typos and scribbles. It's also worth noting that each subject's age and gender are included with their sample data in the dataset. The status of automated Bangla handwriting recognition research is trailing significantly behind, despite a lot of progress in the automatic identification of handwritten English text. Deep learning approaches, in particular, have recently been proved to be very successful in handling handwriting identification challenges using machine learning. Such learning techniques, on the other hand, often need a substantial amount of labeled data. Isolated characters are the focus of this dataset. An example of the form used to gather samples of handwriting may be seen in Figure 3. 7 Fig 3: Example of data collection form To gauge their speed of completion, people were given five minutes, then two minutes to complete the forms. Achieving a uniform distribution of handwriting quality was the goal of this procedure. Additionally, a spreadsheet with marks given to particular forms (a group of 84 characters) as an evaluation of the aesthetic quality 8 of the letters (how lovely is the handwriting?) is included in the dataset. A widely acknowledged handwriting expert in Bangladesh specified the following standards for marking the characters. Clear and easy to read c) One style throughout the form d) Correct dimension e) Correctness. a) Consistent Size and format. This data collection is well-balanced. Scripted Bangla Characters is the second collection of data. Selecting 20 Bangla fonts from a pool of diverse Bangla typefaces was the first step in this process. It was then entered into the word document and translated into an image file for each of the 202 characters for each typeface. We were able to separate the characters using the picture as a guide. At this point, a character may have up to 20 distinct visual representations. We had a good selection of standard-issue characters to work with. However, realworld images may be distorted or overexposed. So, for each sample in the dataset, a 'Salt and Pepper Noise' and a 'Gaussian Blur' sample were added. Since the test pictures might differ in many ways, Dataset found it challenging to train the model with just 60 image examples. Each example picture was further distorted and rotated to better represent real-world events. The following procedure yielded 1000 samples from 60 characters each: This rotation has a 0.7 chance of occurring in the -10-to+10-degree range. a chance of distortion between 1.1 and 1.5. The presence of salt and pepper noise, Poisson noise, and other types of noise in input pictures makes it difficult to achieve improved segmentation and prediction accuracy. The input picture is first subjected to a median filter, which works very well with grayscale document images. 9 After scanning through the input picture, the median filter replaces each pixel's value with the median of its square neighborhood's pixel values. In this case, the window has nine pieces since it is a 3*3 size window. An odd number of items makes it simpler to calculate the median. Most of the salt and pepper, Poisson, and other sounds are removed by this median filter. The noisy picture on the left is shown in Fig. 4, whereas the image on the right is the result of removing the noise. This data collection is well-balanced. Fig 4: Dataset with added noise and blur 2.4 Conclusion: The major goal of this thesis is to provide a system for identifying printed and handwritten Bangla characters in spite of noise, color change, various fonts, sizes, sources, and spacing. In our work, we offer a comprehensive technique built on a Transfer Learning model that is intended to function in the actual world with little user input required. we preferred Transfer Learning applications for higher accuracy and better efficiency. The user does not need to know which text is in the forefront or the background. The user only needs to upload a picture of the document to the system. According to Bryant et aladvice, .'s we took into account a few crucial factors while creating software from models [24]. From the user's perspective, the entire procedure is automated. 10 Chapter 3 METHODOLOGY 3.1 Introduction A pre-trained model is used as the foundation for a new model in the machine learning technique known as transfer learning. Simply expressed, an optimization that enables quick progress when modeling the second task is applied to a model that was trained on the first, unrelated job. One may attain considerably better performance than training with just a modest quantity of data by applying transfer learning to a new task. It is uncommon to train a model from scratch for tasks linked to image or natural language processing since transfer learning is so widespread. Instead, data scientists and academics prefer to begin with a model that has already been trained to recognize generic properties in photos, such as edges and forms, and to categorize things. Inception, ImageNet, and AlexNet are common examples of models using transfer learning as their foundation. In the Transfer Learning method, I used two different Keras applications MobileNetV2 and MobilenetV3Large for this work. 3.2 Proposed Methodology The author presented the MobileNetV2-based method in the article that was planned to be written on the dataset. A step-by-step visual guide of the whole process is provided below the figure. First things first, I have to get the dataset ready, which means that I have to augment and normalize the data for the transfer learning model. After that, I can start feeding the picture data into the model. Putting each character via MobileNetV2 causes them to go through three distinct thick layers thereafter. There are two levels that are concealed, and one layer is for output. The current best error rate on the MNIST digit-recognition challenge is less than 0.3 percent, which is close to the performance of humans [7]. On the other hand, the identification of handwritten Bangla characters is quite different from that due to the vast number of classifications used and the variety of forms used in writing. In order to correctly 11 detect Bengali characters, we were required to choose a suitable Convolutional model that is both effective and capable of being quickly optimized. In addition, the dataset required some preliminary processing before we could get the outcome we wanted. On the other hand, while the dataset was being trained, efforts were also made to minimize overfitting and reach a state of optimal performance. As a result, it is suitable for usage with higher-dimensional data, such as photographs, which are often just two dimensions. The following is a list of the fundamental components that make up a hidden "filter," which has the same dimension as the input and operates the convolutional transformation on the input. Then there are "Pooling" Layers, which work toward down sampling the picture in order to reduce the amount of work that has to be done computationally. In addition, there are techniques of regularization such as Dropout and Batch Normalization. Other key terms are "strides," which refers to the number of steps the filter will travel across the input matrix; "padding," which refers to the addition of additional cells to the border of the input matrix; and "kernel," which is the name that is given to the filter itself. Because the filters contain learnable parameters that are improved during training by backpropagation, the step of the model that is responsible for feature extraction is also taught during training. A densely connected feedforward neural network is used on the last layer after the final feature map has been flattened to generate a one-dimensional vector. This vector is then fed into the network. This last layer provides the completed categorization result for the job that was asked of it. A concise explanation of each step may be found in the next section. Fig 5: Proposed Methodology 12 3.3 Data Preprocessing: the BanglaLekha-isolated dataset already has the following characteristics: its foreground and background have been inverted, noise has been removed using a median filter and an edge thickening filter, and the dataset has been scaled to be square and given the required paddings [5]. After scanning the forms, each handwritten character is extracted automatically, and then the extraction is checked by hand. Because it is anticipated that the dataset would be used for activities involving machine learning and pattern recognition, the backdrop was changed to black, and the character samples were changed to white. Following the use of an edge thickening operation, which was done in order to provide clarity to the pictures, a median filter was used in order to minimize the amount of image noise that was present (Figure 3.2). However, in order to reach a higher degree of certainty about the correctness of the results, we normalized the dataset by dividing each picture by 255, as follows: Out(i,j) = In(i,j)/255.0 During the training process, further data is added to the dataset in real time. The enhancement was accomplished in dataspace by the use of elastic distortions [8] by moving in width and height. For the purpose of this shifting, the range was maintained at 0.4. Another underscore serves as the delineator between the two sections. Therefore, it is possible to deduce from the filenames not only the age and gender of the individual who filled out the form but also the character whose picture is included inside the file. The picture files have been arranged in folders according to the character for the sake of organization and convenience. There are 84 folders total, one for each character. Sample Manipulation: The examples of personalities that were made accessible to us met our expectations in a satisfactory manner. However, there is a possibility that actual picture samples will be grainy or noisy. Therefore, we augmented the dataset by adding a sample with 'Salt and Pepper Noise' and another sample with 'Gaussian Blur' for each and every sample in the dataset, as shown in Fig. 6. 13 Fig 6: Dataset with noise and blur Augmentation: Due to the fact that test photos might differ in a wide variety of ways, it was challenging for us to train the model using just 60 image examples. Every example picture has had additional distortion and rotation applied to it so that it more accurately represents the actual world. Initial dataset sample for a character, beginning at position 60 in figure 5. A list of Trained Characters In all, we created 1000 samples for each trained character by using the following procedure: • A rotation ranging from -10 degrees to +10 degrees, with a probability of 0.2 according to a certain probability distribution. • A strength distortion ranging from 1.1 to 1.5 with a likelihood of occurrence Specifics of the implementation and the training: Google COLAB was used for the whole of the project. Pytorch was chosen as the Deep Learning Framework to be used for these investigations. Nvidia Tesla K80 was chosen as the GPU for usage. During the training process, the input photos were modified using a variety of image transformation methods, including elastic transformation, mirror rotation, random cropping, and flipping, among others. The original photos were shrunk down to 64 by 64 pixels before being uploaded. The training learning rate ranged from 0.00001 to 0.0001, and many different learning rate schedulers, including exponential LR reductithe on plateau, cosine annealing, and cosine annealing restarts, were experimented with. 14 3.4 Transfer Learning Transfer learning is a sort of optimization that makes it possible to make quick progress or increase performance while modeling the second task. Transfer learning is the enhancement of learning in a new task by the transfer of information from a related task that has previously been taught. This may be done by transferring the knowledge from an already learned related activity. In the traditional kind of machine learning known as supervised learning, when we want to train a model for some task and domain A, we expect that we will be given labeled data for that same task and domain. This is the case when we make the assumption that we will be training the model. This is made abundantly evident to us in Figure 1, which demonstrates that the task and domain of both the training data and the test data used by our model A are identical. In a subsequent section, we will provide a more in-depth explanation of precisely what a task and a domain are. For the time being, let us make the assumption that a job is a goal that our model tries to fulfill, such as recognizing things in photos, and that a domain is a location from which our data originates, such as photographs taken in San Francisco coffee shops. Fig 7: Task of Traditional Machine Learning 15 We can now train a model A on this dataset and expect it to perform well on unseen data in the same task and area. A labeled dataset of the same kind is needed for a different job or domain B, so we may build a whole new model. hence, we may expect it to do well on this data. When we don't have enough labeled data to train a credible model for the task or domain we care about, the classic supervised learning paradigm fails. Even though night-time photographs are different from daytime ones, we can use the same model to learn how to recognize pedestrians. It's common to see a decrease in performance since the model has been trained on biased data and doesn't know how to generalize to a new context. We can't even utilize a current model to train a new one, like spotting bikers, since the labels between the tasks change. Using labeled data from a comparable task or topic, transfer learning may help us in these situations. This information obtained by performing the source job in the source domain may be observed in Figure 2 when we attempt to apply it to our issue of interest Fig 8: Task of Transfer Learning As much information as possible is sent from the source setting to the target task or area in practice. Depending on the data, this knowledge may take on numerous forms, such as how items are put together so that we can more readily recognize unfamiliar objects, or how individuals use words to communicate their thoughts. 16 3.5 MobileNetV2 MobileNetV2 is very similar to the original MobileNet, except that it uses inverted residual blocks with bottlenecking features. It has a drastically lower parameter count than the original MobileNet. MobileNets support any input size greater than 32 x 32, with larger image sizes offering better performance. This function returns a Keras image classification model, optionally loaded with weights pre-trained on ImageNet.For transfer learning use cases. each Keras Application expects a specific kind of input preprocessing. For MobileNetV2, call tf.keras.applications.mobilenet_v2.preprocess_input on your inputs before passing them to the model. mobilenet_v2.preprocess_input will scale input pixels between -1 and 1. Arguments input_shape: Optional shape tuple, to be specified if you would like to use a model with an input image resolution that is not (224, 224, 3). It should have exactly 3 inputs channels (224, 224, 3). alpha: Float, larger than zero, controls the width of the network. This is known as the width multiplier in the MobileNetV2 paper, but the name is kept for consistency with applications.MobileNetV1 model in Keras. • If alpha < 1.0, proportionally decreases the number of filters in each layer. • If alpha > 1.0, proportionally increases the number of filters in each layer. • If alpha = 1.0, default number of filters from the paper are used at each layer. include_top: Boolean, whether to include the fully-connected layer at the top of the network. Defaults to True. weights: String, one of None (random initialization), 'imagenet' (pre-training on ImageNet), or the path to the weights file to be loaded. input_tensor: Optional Keras tensor (i.e. output of layers.Input()) to use as image input for the model. pooling: String, optional pooling mode for feature extraction when include_top is False. 17 • None means that the output of the model will be the 4D tensor output of the last convolutional block. • avg means that global average pooling will be applied to the output of the last convolutional block, and thus the output of the model will be a 2D tensor. • max means that global max pooling will be applied. classes: Optional integer number of classes to classify images into, only to be specified if include_top is True, and if no weights argument is specified. classifier_activation: A str or callable. The activation function to use on the "top" layer. Ignored unless include_top=True. Set classifier_activation=None to return the logits of the "top" layer. When loading pretrained weights, classifier_activation can only be None or "softmax". Fig 9: Architecture of MobileNetV2 18 3.6 MobileNetV3Large Architecture MobileNetV3 is a convolutional neural network that is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm, and then subsequently improved through novel architecture advances. Arguments input_shape: Optional shape tuple, to be specified if you would like to use a model with an input image resolution that is not (224, 224, 3). It should have exactly 3 inputs channels (224, 224, 3). You can also omit this option if you would like to infer input_shape from an input_tensor. If you choose to include both input_tensor and input_shape then input_shape will be used if they match, if the shapes do not match then we will throw an error. E.g. (160, 160, 3) would be one valid value. alpha: controls the width of the network. This is known as the depth multiplier in the MobileNetV3 paper, but the name is kept for consistency with MobileNetV1 in Keras. • If alpha < 1.0, proportionally decreases the number of filters in each layer. • If alpha > 1.0, proportionally increases the number of filters in each layer. • If alpha = 1, default number of filters from the paper are used at each layer. minimalistic: In addition to large and small models this module also contains socalled minimalistic models, these models have the same per-layer dimensions characteristic as MobilenetV3 however, they don't utilize any of the advanced blocks (squeeze-and-excite units, hard-swish, and 5x5 convolutions). While these models are less efficient on CPU, they are much more performant on GPU/DSP. include_top: Boolean, whether to include the fully-connected layer at the top of the network. Defaults to True. 19 weights: String, one of None (random initialization), 'imagenet' (pre-training on ImageNet), or the path to the weights file to be loaded. input_tensor: Optional Keras tensor (i.e. output of layers.Input()) to use as image input for the model. pooling: String, optional pooling mode for feature extraction when include_top is False. • None means that the output of the model will be the 4D tensor output of the last convolutional block. • avg means that global average pooling will be applied to the output of the last convolutional block, and thus the output of the model will be a 2D tensor. • max means that global max pooling will be applied. classes: Integer, optional number of classes to classify images into, only to be specified if include_top is True, and if no weights argument is specified. dropout_rate: fraction of the input units to drop on the last layer. classifier_activation: A str or callable. The activation function to use on the "top" layer. Ignored unless include_top=True. Set classifier_activation=None to return the logits of the "top" layer. When loading pretrained weights, classifier_activation can only be None or "softmax". include_preprocessing: Boolean, whether to include the preprocessing layer (Rescaling) at the bottom of the network. Defaults to True. 20 Fig 10: Architecture of MobileNetV3 3.7 Conclusion: Several factors hindered the progress of the implementation, that either consumed our valuable time, or created restrictions that we were forced to oblige. For example, due to the unavailability of sufficient RAM, the efficiency of the model was bottlenecked to a certain degree, such that memory restrictions were limiting the accuracy that can be achieved. Since our GPU and CPU were of ordinary specification, increasing input parameters like image size would greatly increase the training time needed for the model, which forced us to compromise and work on a simple network instead of a complex network, thus sacrificing any opportunities to further improve the efficiency of the model. Though these limitations arrive I used these two models are used for the task. 21 Chapter 4 Result Analysis and Discussion 4.1 Introduction Recognition of Bangla Handwritten characters and Scripted characters are far too different from other language. To solve this complex problem we address the degradation problem by introducing a deep residual learning framework. Our purpose is to classify Bengali alphabets, numeral and compound characters with a single classifier. We have experimented with some famous and also custom architectures to achieve best possible solution. I used 75:25 as the train and test dataset ratio on 84 classes for 1st dataset and 75:25 as the train and est dataset ratio on 202 classes for 2nd dataset. 4.2 Performance Analysis 1st Dataset Experiments and Results: Experimental results of the proposed recognition scheme have been collected based on the samples of the prepared dataset from Bangla-Lekha . I have performed batch wise training in this study due to large sized training set. The number of Batch Size (BS) is considered as a user defined parameter. On the other hand, Learning Rate (LR) is also an element that influences learning. At first, we tried with MobileNetV2 with the dataset image size of 32x32 pixel values which gave us the loss of 0.59 approximately on 50 epocs. 22 Fig 11: Train and Validation Loss using MobileNetV2 for 1st dataset Then we tried decreasing maxpool to 2x2 from the default value of MobileNetv2 3x3 but we kept the default stride of 2x2. This network architect gave us 0.36 test loss. We tried removing the stride too which gave the same loss but an increase in accuracy. After we updated the image size to 64x64 which gave us a decreased test loss of 0.30. Further, I tried excluding the maxpool layer from mobilenet which gave a slight increase in accuracy and test loss decreased to 0.30. Moreover, we tried increasing the image size even more to 64x64 and it gave a improved result of 0.29 loss. Previously we only use 32*32 pixel image size, it provides us accuracy around 78%. But after increasing the image size to 64*64 pixel it shows much better accuracy around 84.43%-87.23%. Again we increase image size. It increased the computational time of the program. 23 Fig 12: Train and Test accuracy using MobileNetV2 for 1st dataset Further, I tried excluding the maxpool layer from MobileNetV3Large which gave a slight increase in accuracy. Moreover, I tried increasing the image size even more to 32x32 and it gave good result but not as good as MobileNetV2. Previously we only use 32*32 pixel image size, it provides us accuracy around 76%. But after increasing the image size to 64*64 pixel it shows much better accuracy around 82.33%-86.23%. Again we increase image size. It increased the computational time of the program. 24 Fig 13: Train and Validation loss using MobileNetV3Large for 1st dataset After this experiment on MobileNetV3Large application I get the accuracy 86.23% with the loss 0.30. Fig 14: Train and Test accuracy using MobileNetV3Large for 1st dataset 25 2nd Dataset Experiments and Results: Experimental results of the proposed recognition scheme have been collected based on the samples of the prepared dataset from BanglaScriptedCHARACTER. I have performed batch wise training in this study due to large sized training set. The number of Batch Size (BS) is considered as a user defined parameter. On the other hand, Learning Rate (LR) is also an element that influences learning. At first, we tried with MobileNetV2 with the dataset image size of 32x32 pixel values which gave us the loss of 0.59 approximately on 50 epocs. Fig 15: Train and validation loss using MobileNetV2 for 2nd dataset After that, we experimented with lowering maxpool from MobileNetv2's default value of 3x3 to 2x2, but we maintained the usual stride of 2x2 throughout. This particular network architect provided us with a 0.36 test loss. Additionally, we tested deleting the stride, which resulted in the same loss but an improvement in accuracy. After we changed the picture size to 64 by 64, we saw that the test loss had gone down to 0.30. In addition, I experimented with omitting the maxpool layer from mobilenetV2, which resulted in a marginal improvement in accuracy and a reduction to 0.28 in test loss. In addition, we tried expanding the size of the picture even further to 64x64, and it resulted in a better loss percentage of 0.27. In the past, we have just used an image 26 size of 32 by 32 pixels, which has given us an accuracy of around 78 percent. However, when the picture size was increased to 64 by 64 pixels, it showed a significant improvement in accuracy of between 82.44 and 86.23 percent. Once again, we will enhance the picture size. It took the software longer to compute its results as a result. Fig 16: Train and test accuracy using MobileNetV2 for 2nd dataset In addition, I experimented with omitting the maxpool layer from MobileNetV3Large, which resulted in a marginal improvement in accuracy. In addition to that, I experimented with expanding the picture size even more to 32x32, and although it produced satisfactory results, they were not as satisfactory as MobileNetV2's. In the past, we have just used an image size of 32 by 32 pixels, which has given us an accuracy of around 76%. However, when the picture size was increased to 64 by 64 pixels, the results showed a significant improvement in accuracy, somewhere between 82.33 and 86.23 percent. Once again, we make the images larger. It caused an increase in the amount of time the software needed to compute. 27 Fig 17: Train and test accuracy using MobileNetV3Large for 2nd dataset After this experiment on MobileNetV3Large application I get the accuracy 85.13% with the loss 0.34. Fig 18: Train and validation loss using MobileNetV3Large for 2nd dataset 28 4.3 Model Wise Result Comparison: Each of the models trained with different image size and modified layers produced some of the most singular results that were not anticipated. However, a pattern followed that established the fact that the image size of the training dataset is correlated to the value accuracy that can be obtained by the model. Unfortunately, increasing the image size comes with an additional cost of computational requirements and training time. Thus, we had to be economical with our available resources and model architecture, and gradually increase the size of images and change other parameters to obtain the optimum accuracy. As shown in Table 1 , we started our training with the image size from 32x32 to 64x64 and applied Adam optimization and augmentation of height and width shift for most of the models, where the learning rate for all the models started with 0.00001 decayed learning rate to the factor of square root of 1e-4 to 1e-5, when value loss was not decreasing. MobileNetV2 and MovileNetV3Large used with varying degrees, by adding or modifying layers to reduce overfitting of data and increase value accuracy for that particular model as much as possible Our first model MobileNetV2 with 32x32 image size gave 78 % accuracy without augmentation and increased accuracy to 84.43% when augmentation was applied. From here on forwards, we decided to apply augmentation to all the models that we tested to achieve maximum accuracy. We applied the same parameters to MobileNetV3Large model, and the accuracy slightly decreased by 0.1% to 82.43%. With additional layers of maxpool and stride, the value accuracy still remained at 84.43%, however when removing stride, the accuracy increased by 0.2%. Then we increased our training image size to 64x64 and applied to the the previous model with the same parameters, and as expected, accuracy increased to 87.23%. Surprisingly, without any maxpool and strides, the same model gave a lower accuracy of 86.23%, which is not better than before. After acquiring better hardwares like CPU, GPU and RAM, we began to test our model with even larger size of images to achieve greater accuracy. We started with image size of 64x64 on the earlier MobileNetV2 model and obtained 87.23% accuracy, which is surprisingly higher than the previous model with image size of 64x64, despite using dataset with larger image size. Later, our experiment to train the model with Rmsprop optimization could not surpass the value accuracy of previous model with Adam optimization. Then we shifted to 29 MobileNetV2 with the same image size, and our accuracy increased to 87.23%, which later increased by 0.2% when we applied augmentation with shift of 0.2 height and width. Finally, we again tested on MobileNetV2, however this time, we applied dropout optimization to each of the Resnet cells, with image size of 64x64. This model made a breakthrough in value accuracy, which reached the sweet spot of 87.23%, and as of writing this paper, this is the highest accuracy ever reached by any previous state of the art systems on Bangla handwritten and Scripted character recognition. Table 1: Model wise Result Comparison SL Used Model Image Optimizer Augmentation Accuracy Size 1 MobileNetV2 32x32 Adam No 78% 2 MobileNetV2 64x64 Adam Yes 87.23% 3 MobileNetV3Large 32x32 Adam No 76% 4 MobileNetV3Large 64x64 Adam yes 86.23% Table 2: Result analysis for 1st dataset Table 3: Result analysis for 2nd dataset: 30 Lastly I’ve compared the proposed work accuracy with some previous work shown in Table 4. Table 4: Result comparison with previous work Description of Accuracy(%) Proposed Work Accuracy(%) 79.4% Bangla 87.23% previous work Recognition of Handwritten Handwritten and Bangla Characters Scripted Character Using Gabor Filter recognition using and Artificial Transfer Learning Neural Network[6] Recognition of 76.86% Bangla Bangla handwritten Handwritten and basic characters Scripted Character and digits using recognition using convex hull-based Transfer Learning 87.23% feature set[4] Bangla Hand- 93.43% Bangla Written Character Handwritten and Recognition Using Scripted Character Support Vector recognition using Machine[7] Transfer Learning 31 87.23% Confusion matrix evaluation In order to test the performance of our classifier, we applied confusion matrix over the 84 classes against a set of test data, for which the values are known. Below in Fig. 19, the result of convergence is shown. Fig 19: Confusion Matrix As we can see, each of the classes have scored highest in its own test case and shows that the classifier successfully matched each of the classes against its correct test case, proving that the performance of the classifier is uniform for all the classes. Later, we scrutinized the scores achieved by the classes, where some of them are presented in Table 5. We discovered that the Bangla character ‘ক্ষ’ scored the lowest precision score of 0.78 and recall of 0.86, while the character ‘◌ং ’ scored the highest 32 precision score with exact 1 and recall of 0.99. These results can be attributed to the fact that complexity of characters and its degree of cursive strokes can reduce the classifier’s ability to correctly predict that character’s class. Table 5: Scores achieved by some classes Class Precision Recall F1-Score Support অ 0.98 0.98 0.98 401 ড 0.87 0.88 0.88 422 ষ 0.95 0.94 0.94 373 ◌ং 1.00 0.99 0.99 412 ক্ষ 0.78 0.86 0.82 405 4.4 Conclusion I showed the effectiveness of Transfer Learning can achieve better performance to classify and recognize Bangla Handwritten and Scripted characters into digitally readable formats, than the shallower learning methods like MLP and SVM. From the result section, I showed that for both of the datasets, MobileNetV2 performs better than MobileNetV3Large model. Using same parameters for every training and testing(validation) the highest result 87.23% came out from MobileNetV2 model. Though the MobileNetV3Large is newer than MobileNetV2 the parameter count is lower. This is why MobileNetV2 performs better for this task. 33 Chapter 5 Conclusion and Future Work 5.1 Introduction We showed that effective utilization of Transfer Learning can achieve better performance to classify and recognize Bangla handwritten characters and Scripted Characters into digitally readable formats, than the shallower learning methods like MLP and SVM. If we had more hardware support, then the accuracy would have been better as this method gave a promising result. The results are analogous with previous related work; however, they did not test on a large amount of handwritten character dataset like Bangla-Lekha, we helped us to surpass the previous highest value accuracy of systems based on CNN. Experiments on a large dataset showed the robustness of this model for Bangla handwritten character recognition. In future with more resource and bigger CNN network architecture, we can achieve a better result and improve the state-of-the-art scale for Bangla handwritten letter and digit recognition. Despite being the seventh most-spoken language in the world, Bengali has not received many contributions in this domain due to a huge number of characters with complex shapes. The proposed study on Transfer Learning of MobileNetV2 and MobileNetV3Large demonstrates that both Keras applications perform well for the recognition of Bangla handwritten and scripted characters. 5.2 Future Work: Through the use of cutting-edge hardware that will give us the advantage in building a more complicated network, our future efforts for this system will aim to further increase the precision that can be acquired. Additionally, we would like to expand the classes to incorporate more intricate Bangla word structures that blend basic and compound characters into a single character. Additionally, we would like to broaden the use of this system to isolated Bangla handwritten words and scripted characters. To do this, the characters will first be divided to isolate them, and then the core classifier will carry out its function 34 independently. This may be expanded even further to include a classifier for handwritten sentences, where the previous word segmentation and a classifier for isolated characters are combined into one system. In the future, we will focus on improving the architecture more and apply the architecture to more datasets. Furthermore, the lack of an ideal Bengali handwritten character dataset is also a dilemma in this domain of research. In the future, we will work in this regard also. 35 References Tapotosh Ghosh, Md. Min-ha-zul Abedin, Shayer Mahmud Chowdhury, and Mohammad Abu Yousuf. “A Comprehensive Review on Recognition Techniques for Bangla Handwritten Characters” In 2019 International Conference on Bangla Speech and Language Processing (ICBSLP), pages 1–6, September 2019. Mridul Ghosh, Himadri Mukherjee, Sk Md Obaidullah, K. C. Santosh, Nibaran Das, Kaushik Roy, “LWSINet: A deep learning-based approach towards video script identification” Received: 28 August 2020 / Revised: 7 January 2021 / Accepted: 21 May 2021 / © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021. Changsi Shu, Ke Xu ,Tanfeng Sun , Xinghao Jiang, “YM-NET: A New Network Structure for License Plate Detection in Complex Scenarios ” ICASIT 2020: 2020 International Conference on Aviation Safety and Information Technology Nibaran Das, Sandip Pramanik, “Recognition of Handwritten Bangla Basic Character and Digit Using Convex Hall Basic Feature”. 2009 International Conference on Artificial Intelligence and Pattern Recognition(AIPR-09) Halima Begum et al, Recognition of Handwritten Bangla Characters using Gabor Filter and Artificial Neural Network, International Journal of Computer Technology & Applications, Vol 8(5),618-621, ISSN:2229-6093 Riasat Azim et al, Bangla Hand-Written Character Recognition Using Support Vector Machine, International Journal of Engineering Works vol. 3, Issue 6, PP. 36-46, June 2016. 36 Tasnim Ahmed , Md. Nishat Raihan, Rafsanjany Kushol and Md Sirajus Salekin “A Complete Bangla Optical Character Recognition System: An Effective Approach” 2019 22nd International Conference on Computer and Information Technology (ICCIT), 18-20 December, 2019 Farisa Benta Safir. The proposed (CNN) architectures, including DenseNet, Xception, NASNet, and MobileNet to build the OCR architecture implements an end to end strategy that recognizes handwritten Bengali words from handwritten word images. With 86.22%accuracy. Shyla Afroge. A Hybrid Model for Recognizing Handwritten Bangla Characters using Support Vector Machine. With 92.06% accuracy. [10] Indonesian Plate Number Identification Using YOLACT and Mobilenetv2 in the Parking Management System I Kadek Gunawan, I Putu Agung Bayupati, Kadek Suar Wibawa, I Made Sukarsa, Laurensius Adi Kurniawan RSCA: Real-Time Segmentation-Based Context-Aware Scene Text Detection Jiachen Li, Yuan Lin, Rongrong Liu, Chiu Man Ho, Humphrey Shi; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021, pp. 2349-2358 BanglaLekha-Isolated: A multi-purpose comprehensive dataset of Handwritten Bangla Isolated characters Mithun Biswas, Rafiqul Islam, Gautam Kumar Shom, Md. Shopon,NabeelMohammed, SifatMomen, Anowarul Abedin https://doi.org/10.1016/j.dib.2017.03.035 37 Majumdar and B. Chaudhuri, “A mlp classifier for both printed and handwritten bangla numeral recognition,” in Computer Vision, Graphics and Image Processing. Springer, 2006, pp. 796–804. S. Bag, G. Harit, and P. Bhowmick, “Topological features for recognizing printed and handwritten bangla characters,” in Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data. ACM, 2011, p. 10 A. F. R. Rahman, R. Rahman, and M. C. Fairhurst, “Recognition of handwritten bengali characters: a novel multistage approach,” Pattern Recognition, vol. 35, no. 5, pp. 997–1006, 2002. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105 A. Rahman and M. Kaykobad, “A complete bengali ocr: A novel hybrid approach to handwritten bengali character recognition,” Journal of computing and information technology, vol. 6, no. 4, pp. 395–413, 1998 M. A. Rahman and A. El Saddik, “Notice of violation of ieee publication principles modified syntactic method to recognize bengali handwritten characters,” IEEE Transactions on Instrumentation and Measurement, vol. 56, no. 6, pp. 2623–2632, 2007. T. K. Bhowmik, P. Ghanty, A. Roy, and S. K. Parui, “Svm-based hierarchical architectures for handwritten bangla character recognition,” International Journal on Document Analysis and Recognition (IJDAR), vol. 12, no. 2, pp. 97–108, 2009. B. R. Bryant, J. Gray, M. Mernik, P. J. Clarke, R. B. France, and G. Karsai, “Challenges and directions in formalizing the semantics of modeling languages,” 2011. Y. Ji, Y.-H. Zhang, and W.-M. Zheng, “Modelling spiking neural network from the architecture evaluation perspective,” Journal of computer science and technology, vol. 31, no. 1, pp. 50–59, 2016. 38