Mini Project Report On OCR & Natural Language Processing (In partial fulfillment of the requirements for the award of the degree of Bachelor of Technology in Information & Technology) Submitted By Name : David Lalrinawma Roll No. : IT/18/09 Semester : 7th Semester Under the guidance of Mrs. Vanlalmuansangi Asst.Prof. DEPARTMENT OF INFORMATION & TECHNOLOGY MIZORAM UNIVERSITY(A CENTRAL UNIVERSITY) AIZAWL-796004, INDIA 2021 CERTIFICATE OF APPROVAL Certified that project work entitled “OCR and Natural Language Processing” is hereby approved as a creditable work carried out in the 7th semester by “DAVID LALRINAWMA (IT/18/09)” and presented in a satisfactory manner in partial fulfilment for the Mizoram University. It is understood by this approval that the undersigned do not endorse or approval statement made, opinion expressed or conclusion drawn there in but approval only for the purpose for which it has been submitted. Mrs.Vanlalmuansangi, Asst.Prof. Mrs.Chawngsangpuii, Asst.Prof. (Mini-Project Guide) (Mini-Project Coordinator) External Examiner Mrs.Chawngsangpuii (Head of the Department) Department of Information & Technology, Mizoram University (A Central University) ACKNOWLEGMENT With immense pleasure I, Mr DAVID LALRINAWMA (IT/18/09) presenting “OCR AND NATURAL LANGUAGE PROCESSING” mini project report in partial fulfilment of the requirements for the degree of Bachelor of Technology in Information Technology Engineering. I wish to thank all the Bachelor the people who gave me unending support. I express my profound thanks to my project guide Mrs. Vanlalmuansangi, our project coordinator Mrs. Chawngsangpuii and those who have indirectly guided and helped me in preparation. DAVID LALRINAWMA (IT/18/09) ABSTRACT Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. Optical character recognition or optical character reader (OCR) Optical character recognition or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo or from subtitle text superimposed on an image. CONTENTS Introduction………………………………………………………………………………1 What is Natural Language Processing(NLP)……………………………………………...2 Uses of NLP What is Optical Character Reognition (OCR)....................................................................3 How OCR works………………………………………………………………………….3 Advantage of OCR……………………………………………………………………….5 Objective….........................................................................................................................6 System design…………………………………………………………………………….6 OCR Workframe………………………………………………………………………….6 Language Used……………………………………………………………………………6 EasyOCR Frame work…………………………………………………………………….7 Facts :……………………………………………………………………………………...7 EasyOCR currently support language………………………………………….. How to install EasyOCR on machine…………………………………………… Streamlit…………………………………………………………………………………8 Introduction of Streamlit 1. The flow of Streamlit while developing a web app 1. Running a streamlit app 2. Development flow 3. Data flow 4. Displaying the data 5. Wedgets 6. Layout 2. Let’s summarize the working of streamlit Let’s get started with our OCR Web App……………………………………….…………9 Browse and Drop a file……………………………………………………………………10 Browseing a file to upload…………………………………………………………………11 Upload file……………………………………………………………………………….....12 Reader………………………………………………………………………………………12 Reading and Writing the readed text from from the file……………………………………13 Conclusion of project……………………………………………………………………….14 Reference…………………………………………………………………………………...14 INTRODUCTION Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated interpretation and generation of natural language, but at the time not articulated as a problem separate from artificial intelligence. Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. OCR stands for "Optical Character Recognition." It is a technology that recognizes text within a digital image. It is commonly used to recognize text in scanned documents and images. OCR software can be used to convert a physical paper document, or an image into an accessible electronic version with text. It is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image. Widely used as a form of data entry from printed paper data records – whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation – it is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, (extracted) text-tospeech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. (1) What is Nautral Language Processing (NLP)? The field of study that focuses on the interactions between human language and computers is called natural language processing, or NLP for short. It sits at the intersection of computer science, artificial intelligence, and computational linguistics. Uses of Nautral Language Processing : Email filters : Email filters are one of the most basic and initial applications of NLP online. It started out with spam filters, uncovering certain words or phrases that signal a spam message. But filtering has upgraded, just like early adaptations of NLP. One of the more prevalent, newer applications of NLP is found in Gmail's email classification. The system recognizes if emails belong in one of three categories (primary, social, or promotions) based on their contents. For all Gmail users, this keeps your inbox to a manageable size with important, relevant emails you wish to review and respond to quickly. Smart assistants : Smart assistants like Apple’s Siri and Amazon’s Alexa recognize patterns in speech thanks to voice recognition, then infer meaning and provide a useful response. We’ve become used to the fact that we can say “Hey Siri,” ask a question, and she understands what we said and responds with relevant answers based on context. And we’re getting used to seeing Siri or Alexa pop up throughout our home and daily life as we have conversations with them through items like the thermostat, light switches, car, and more. We now expect assistants like Alexa and Siri to understand contextual clues as they improve our lives and make certain activities easier like ordering items, and even appreciate when they respond humorously or answer questions about themselves. Our interactions will grow more personal as these assistants get to know more about us. As a New York Times article “Why We May Soon Be Living in Alexa’s World,” explained: “Something bigger is afoot. Alexa has the best shot of becoming the third great consumer computing platform of this decade.” Search results Search engines use NLP to surface relevant results based on similar search behaviors or user intent so the average person finds what they need without being a search-term wizard. For example, Google not only predicts what popular searches may apply to your query as you start typing, but it looks at the whole picture and recognizes what you’re trying to say rather than the exact search words. Someone could put a flight number in Google and get the flight status, type a ticker symbol and receive stock information, or a calculator might come up when inputting a math equation. These are some variations you may see when completing a search as NLP in search associates the ambiguous query to a relative entity and provides useful results. Predictive text Things like autocorrect, autocomplete, and predictive text are so commonplace on our smartphones that we take them for granted. Autocomplete and predictive text are similar to (2) search engines in that they predict things to say based on what you type, finishing the word or suggesting a relevant one. And autocorrect will sometimes even change words so that the overall message makes more sense. They also learn from you. Predictive text will customize itself to your personal language quirks the longer you use it. This makes for fun experiments where individuals will share entire sentences made up entirely of predictive text on their phones. The results are surprisingly personal and enlightening; they’ve even been highlighted by several media outlets. Language translation One of the tell-tale signs of cheating on your Spanish homework is that grammatically, it’s a mess. Many languages don’t allow for straight translation and have different orders for sentence structure, which translation services used to overlook. But, they’ve come a long way. With NLP, online translators can translate languages more accurately and present grammatically-correct results. This is infinitely helpful when trying to communicate with someone in another language. Not only that, but when translating from another language to your own, tools now recognize the language based on inputted text and translate it. What is Optical Character Recognition ( OCR)? Optical Character Recognition refers to a software technology that electronically identifies text (written or printed) inside an image file or physical document, such as a scanned document, and converts it into a machine-readable text form to be used for data processing. It is also known as text recognition. In short, optical character recognition software helps convert images or physical documents into a searchable form. Examples of OCR are text extraction tools, PDF to .txt converters, and Google’s image search function. To see OCR software in action, you can try using Text Extractor Tool by Brandfolder. This optical character recognition online tool can convert an image of text (such as a screenshot) into plaintext. How does Optical Character Recognition work? The concept of OCR is straightforward. However, its implementation can be quite challenging due to several factors, such as the variety of fonts or the methods used for letter formation. For example, an OCR implementation can get exponentially more complex when non-digital handwriting samples are used as input instead of typed writing. The entire process of OCR involves a series of steps that mainly contain three objectives: preprocessing of the image, character recognition, and post-processing the specific output. In the following, we will show how optical character recognition works and explain the main steps of OCR technologies. 1. Scanning the Document This is the prime step of OCR which connects to a scanner to scan the document. Scanning the (3) document decreases the number of variables to account for when creating the OCR software since it standardizes the inputs. Also, this step specifically enhances the efficiency of the entire process by ensuring perfect alignment and sizing of the specific document. 2. Refining the Image In this step, the optical character recognition software improves the elements of the document that need to be captured. Any imperfections such as dust particles are eliminated, and edges, as well as pixels, are smoothed to get a plain and clear text. This step makes it easier for the program to capture and be able to clearly “see” the words being inputted without, for instance, smudges or irregular dark areas. 3. Binarization The refined image document is then converted into a bi-level document image, containing only black and white colors, where black or dark areas are identified as characters. At the same time, white or light areas are identified as background. This step aims to apply segmentation to the document to easily differentiate the foreground text from the background, which allows for the optimal recognition of characters. 4. Recognizing the Characters In this step, the black areas are further processed to identify letters or digits. Usually, an OCR focuses on one character or block of text at a time. The recognition of characters is carried out by using one of the following two algorithms: a) Pattern recognition. The pattern recognition algorithm involves inserting text in different fonts and formats into the OCR software. The modified software is then used for comparing and recognizing the characters in the scanned document. b) Feature detection. Through the feature detection algorithm, OCR software applies rules considering the features of a certain letter or number to identify characters in the scanned document. Examples of features include the number of angled lines, crossed lines, or curves used for comparing and identifying characters. Simple OCR software compares the pixels of every scanned letter with an existing database to identify the closest match. However, sophisticated forms of OCR divide every character into its components, such as curves and corners, to compare and match physical features with corresponding letters. 5. Verifying the Accuracy : After the successful recognition of characters, the results are cross-referenced by utilizing the internal dictionaries of the OCR software to ensure accuracy. Measuring OCR accuracy is done by taking the output of an analysis conducted by an OCR and comparing it to the contents of the original version. (4) There are two typical methods for analyzing the accuracy of OCR software: Character-level accuracy, counting how many characters were detected correctly. Word-level accuracy, counting how many words were recognized correctly. In most cases, 98-99% accuracy is the acceptable accuracy rate, measured at the page level. This means that in a page of around 1,000 characters, 980-990 characters should be accurately identified by the OCR software. Advantages Of Optical Character Recognition Optical Character Recognition offers a wide range of benefits, many of which were reviewed in this article. However, the most important benefits of OCR are listed below for your reference. Improved accuracy: Software-based character recognition eliminates human errors, resulting in improved accuracy. Speed-up the processes: The technology converts unstructured data into searchable information, providing the required data available at faster rates and subsequently speeding up business processes. Cost-effective: OCR technology does not require a lot of resources which reduces the processing costs and subsequently reduces the overall costs of a business. Enhanced customer satisfaction: The accessibility of searchable data by the customers ensures a good experience, assuring better customer satisfaction. Improved productivity: The easy accessibility of searchable data makes a stress-free environment for the employees, allowing them to focus on the main goals, boosting the productivity of a business. (5) OBJECTIVE The objective of this project on OCR is to achieve modification or conversion of any form of text or text-containing documents such as handwritten text, printed or scanned text images, into an editable digital format for deeper and further processing. SYSTEM DESIGEN Once the planning and analysis of the project are completed, the design phase begins. The goal of the system to let them choose the file to be uploaded for the OCR to display the file, read it, write the text in the file, that can be edited. OCR WORKFRAME LANGUAGE USED: Python is used to make this system, in python EasyOCR package is used for extracting the text from the image . Some example code : def load_model(): reader = easyocr.Reader(['en'], model_storage_directory ='.') return reader Lets break down the code line by line: 1. Reader class from easyocr class and then passing [‘en’] as an attribute which means that now it will only detect the English part of the image as text, if it will find other language like Chinese and Japanese then it will ignore those text. 2. We have set the attribute for language so, we are loading the model_storage_directory for the Image to be in the readtext() function. (6) EASYOCR FRAMEWORK Facts : EasyOCR currently support 42 languages : Afrikaans (af), Azerbaijani (az), Bosnian (bs), Czech (cs), Welsh (cy), Danish (da), German (de), English (en), Spanish (es), Estonian (et), French (fr), Irish (ga), Croatian (hr), Hungarian (hu), Indonesian (id), Icelandic (is), Italian (it), Japanese (ja), Korean (ko), Kurdish (ku), Latin (la), Lithuanian (lt), Latvian (lv), Maori (mi), Malay (ms), Maltese (mt), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt),Romanian (ro), Slovak (sk), Slovenian (sl), Albanian (sq), Swedish (sv),Swahili (sw), Thai (th), Tagalog (tl), Turkish (tr), Uzbek (uz), Vietnamese (vi), Chinese (zh) – Source: JaidedAI It provides enough flexibility to choose Text detection with or with out GPU. How to install EasyOCR on machine To get started installing EasyOCR, my recommendation is to follow our pip install opencv tutorial with an important caveat: We install opencv-python and not opencv-contrib-python in our virtual environment. Furthermore, if we have both of these packages in the same environment, it could lead to unintended consequences. It is unlikely that pip would complain if we have both installed, so be diligent and check with the pip freeze command. Our recommendation is that we dedicate a separate Python virtual environment on our system for EasyOCR (Option B of the pip install opencv guide). However, although option B suggests naming your virtual environment cv, we recommend naming it easyocr, ocr_easy, or something similar and we install pytorch too. (7) STREAMLIT : INTRODUCTION OF STREAMLIT: Streamlit is an open- source python framework for building web apps for Machine Learning and Data Science. We can instantly develop web apps and deploy them easily using Streamlit. Streamlit allows you to write an app the same way you write a python code. Streamlit makes it seamless to work on the interactive loop of coding and viewing results in the web app. The flow of Streamlit while developing a web app Running a streamlit app First , we create a python script with streamlit commands and execute the script using the following command, streamlit run <yourscript.py> After running this, a sample app will be open in a new tab in our default browser. Development flow If the source code of the streamlit’s python script changes the app shows whether to rerun the application or not in the top-right corner. We can also select the ‘Always rerun’ option to rerun always when the source script changes. This makes our development flow much easier, every time you make some changes it’ll reflect immediately in your web app. This loop between coding and viewing results live makes you work seamlessly with streamlit. Data flow Streamlit allows us to write an app the same way we write a python code. The streamlit has a distinctive data flow, any time something changes in our code or anything needs to be updated on the screen, streamlit reruns our python script entirely from the top to the bottom. This happens when the user interacts with the widgets like a select box or drop-down box or when the source code is changed. If we have some costly operations while rerunning our web app, like loading data from databases, we can use streamlit’s st.cache method to cache those datasets, so that it loads faster. Displaying the data Streamlit provides you with many methods to display various types of data like arrays, tables, and data frames. (8) To write a string simply use, st.write(“Your string”) To display a data frame use, st.dataframe method. Widgets There are several widgets available in streamlit, like st.selectbox, st.checkbox, st.slider, and etc. Let us see an example for widget in streamlit. import streamlit as st value = st.slider('val') # this is a widget st.write(value, 'squared is', value * value) During the first run, the web app would output the string “0 squared is 0”, and then if a user increases or decreases the widget, the code is rerun by the streamlit from top to bottom and assigns the present state of the widget to the variable. For example, if the user slides the slider to point 15, streamlit will rerun the code and output the following string “15 squared is 225”. Layout You can easily arrange your widgets or data seamlessly using the st.sidebar method. This method helps you to align data in the left panel sidebar. All you have to do is simply use st.sidebar.selectbox to display a selectbox in the left panel. Let’s summarize the working of streamlit Streamlit runs the python script from top to bottom Each time the user interacts the script is a rerun from top to bottom Streamlit allows you to use caching for costly operations like loading large datasets. Let’s get started with our OCR Web App: Let’s write a title and a DataFrame in the web app using streamlit. import pandas as pd import streamlit as st #Title st.title('Easy OCR - Extract Text from Image') #Subtilte st.markdown("##Optical Character Recognition - Using 'easyocr', 'streamlit'") Now execute the following command in your console, streamlit run streamlit_app.py (9) After executing the command access the following URL to see the following result, http://localhost:8501/ In the above code, st.title method writes the given string as header, st.write method writes the given string as such, and it also writes the DataFrame as well in the app. Browse and Drage a file : We write is the code below to browse or drop the file in our Web App. #image Uploader image = st.file_uploader(label="Upload your image here", type = ['png','jpg','jpeg']) We are able to browse or drop the file using st.file_uploader in the folder from the given directory. The type of file to be uploaded is label as type of image file as ocr will read only from them and disregard any other type of file other then them. (10) Browseing a file to upload: A direcrtory given for the browser using the the code below: def load_model(): reader = ocr.Reader(['en'], model_storage_directory ='.') return reader The file is Browse selected to be uploaded. Below us we can see the red arrow point is pointing the folder from from which the file can be upload from our cureent directory, but note that we can still Browse in other folder too. We use a storage directory in a relative path, from where the code exist, the file is in the current directory. (11) Upload file: The Uploaded file is displayed below using : input_image = Image.open(image)#read image st.image(input_image)#display image The yellow arrow point at the uploaded file and the the red arrow is pointing at the displayed uploaded file . input_image = Image.open(image), the Image if from PIL where PIL is the Python Imaging Library which provides the python interpreter with image editing capabilities. The Image module provides a class with the same name which is used to represent a PIL image. The module also provides a number of factory functions, including functions to load images from files, and to create new images and the uploaded file is displayed using st.image(input_image), from the input_image. Reader: def load_model(): reader = ocr.Reader(['en'], model_storage_directory ='.') return reader reader = load_model()#load model We call the object Reader to read the object and use the reader to store the load model after returning it. (12) Reading and Writing the readed text from the file : Code : result = reader.readtext(np.array(input_image)) result_text=[]#empty list for storeing for text in result: result_text.append(text[1]) st.write(result_text) The text from the file is readed by readtext and the readed text is then stored in an empty list named result, the text in the list is appended using result_text.append(text[1]) line by line using for loop. The st.write(result_text) is used to display the readed and extracted text from the file, the text can be copied and saved in a text file for present or future use. (13) CONCLUSION OF PROJECT: In conclusion, Optical Character recognition (OCR) is a very remarkable technology that holds a lot of potential. In this day ang age, such tools are already quite advanced. However, OCR is going to look even better in the future. REFERENCE: https://www.geeksforgeeks.org/python-programming-language/ https://streamlit.io/ https://www.youtube.com/ (14)