Uploaded by chandramukhi

DavId Lalrinawma, Roll No. IT-18-09, &th semester , Mini project Report on OCR and NLP

advertisement
Mini Project Report
On
OCR & Natural Language Processing
(In partial fulfillment of the requirements for the award of the
degree
of Bachelor of Technology in Information & Technology)
Submitted By
Name : David Lalrinawma
Roll No. : IT/18/09
Semester : 7th Semester
Under the guidance of
Mrs. Vanlalmuansangi Asst.Prof.
DEPARTMENT OF INFORMATION & TECHNOLOGY
MIZORAM UNIVERSITY(A CENTRAL UNIVERSITY)
AIZAWL-796004, INDIA
2021
CERTIFICATE OF APPROVAL
Certified that project work entitled “OCR and Natural Language Processing” is hereby approved as
a creditable work carried out in the 7th semester by “DAVID LALRINAWMA (IT/18/09)” and
presented in a satisfactory manner in partial fulfilment for the Mizoram University. It is understood by
this approval that the undersigned do not endorse or approval statement made, opinion expressed or
conclusion drawn there in but approval only for the purpose for which it has been submitted.
Mrs.Vanlalmuansangi, Asst.Prof. Mrs.Chawngsangpuii, Asst.Prof.
(Mini-Project Guide) (Mini-Project Coordinator)
External Examiner
Mrs.Chawngsangpuii
(Head of the Department)
Department of Information & Technology, Mizoram University
(A Central University)
ACKNOWLEGMENT
With immense pleasure I, Mr DAVID LALRINAWMA (IT/18/09) presenting “OCR AND
NATURAL LANGUAGE PROCESSING” mini project report in partial fulfilment of the
requirements for the degree of Bachelor of Technology in Information Technology
Engineering. I wish to thank all the Bachelor the people who gave me unending support.
I express my profound thanks to my project guide Mrs. Vanlalmuansangi, our project
coordinator Mrs. Chawngsangpuii and those who have indirectly guided and helped me in
preparation.
DAVID LALRINAWMA
(IT/18/09)
ABSTRACT
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial
intelligence concerned with the interactions between computers and human language, in particular
how to program computers to process and analyze large amounts of natural language data. The goal is
a computer capable of "understanding" the contents of documents, including the contextual nuances of
the language within them. The technology can then accurately extract information and insights
contained in the documents as well as categorize and organize the documents themselves.
Optical character recognition or optical character reader (OCR) Optical character recognition or
optical character reader is the electronic or mechanical conversion of images of typed, handwritten or
printed text into machine-encoded text, whether from a scanned document, a photo of a document, a
scene-photo or from subtitle text superimposed on an image.
CONTENTS
Introduction………………………………………………………………………………1
What is Natural Language Processing(NLP)……………………………………………...2

Uses of NLP
What is Optical Character Reognition (OCR)....................................................................3
How OCR works………………………………………………………………………….3
Advantage of OCR……………………………………………………………………….5
Objective….........................................................................................................................6
System design…………………………………………………………………………….6
OCR Workframe………………………………………………………………………….6
Language Used……………………………………………………………………………6
EasyOCR Frame work…………………………………………………………………….7
Facts :……………………………………………………………………………………...7


EasyOCR currently support language…………………………………………..
How to install EasyOCR on machine……………………………………………
Streamlit…………………………………………………………………………………8
 Introduction of Streamlit
1. The flow of Streamlit while developing a web app
1. Running a streamlit app
2. Development flow
3. Data flow
4. Displaying the data
5. Wedgets
6. Layout
2. Let’s summarize the working of streamlit
Let’s get started with our OCR Web App……………………………………….…………9
Browse and Drop a file……………………………………………………………………10
Browseing a file to upload…………………………………………………………………11
Upload file……………………………………………………………………………….....12
Reader………………………………………………………………………………………12
Reading and Writing the readed text from from the file……………………………………13
Conclusion of project……………………………………………………………………….14
Reference…………………………………………………………………………………...14
INTRODUCTION
Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an
article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing
test as a criterion of intelligence, a task that involves the automated interpretation and generation of
natural language, but at the time not articulated as a problem separate from artificial intelligence.
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial
intelligence concerned with the interactions between computers and human language, in particular
how to program computers to process and analyze large amounts of natural language data. The goal is
a computer capable of "understanding" the contents of documents, including the contextual nuances of
the language within them. The technology can then accurately extract information and insights
contained in the documents as well as categorize and organize the documents themselves.
OCR stands for "Optical Character Recognition." It is a technology that recognizes text within
a digital image. It is commonly used to recognize text in scanned documents and images. OCR
software can be used to convert a physical paper document, or an image into an accessible
electronic version with text. It is the electronic or mechanical conversion of images of typed,
handwritten or printed text into machine-encoded text, whether from a scanned document, a
photo of a document, a scene-photo (for example the text on signs and billboards in a landscape
photo) or from subtitle text superimposed on an image.
Widely used as a form of data entry from printed paper data records – whether passport
documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of
static-data, or any suitable documentation – it is a common method of digitizing printed texts so
that they can be electronically edited, searched, stored more compactly, displayed on-line, and
used in machine processes such as cognitive computing, machine translation, (extracted) text-tospeech, key data and text mining. OCR is a field of research in pattern recognition, artificial
intelligence and computer vision.
(1)
What is Nautral Language Processing (NLP)?
The field of study that focuses on the interactions between human language and computers is called
natural language processing, or NLP for short. It sits at the intersection of computer science, artificial
intelligence, and computational linguistics.
Uses of Nautral Language Processing :

Email filters :
Email filters are one of the most basic and initial applications of NLP online. It started out
with spam filters, uncovering certain words or phrases that signal a spam message. But
filtering has upgraded, just like early adaptations of NLP. One of the more prevalent, newer
applications of NLP is found in Gmail's email classification. The system recognizes if emails
belong in one of three categories (primary, social, or promotions) based on their contents.
For all Gmail users, this keeps your inbox to a manageable size with important, relevant
emails you wish to review and respond to quickly.

Smart assistants :
Smart assistants like Apple’s Siri and Amazon’s Alexa recognize patterns in speech thanks to
voice recognition, then infer meaning and provide a useful response. We’ve become used to
the fact that we can say “Hey Siri,” ask a question, and she understands what we said and
responds with relevant answers based on context. And we’re getting used to seeing Siri or
Alexa pop up throughout our home and daily life as we have conversations with them
through items like the thermostat, light switches, car, and more. We now expect assistants
like Alexa and Siri to understand contextual clues as they improve our lives and make certain
activities easier like ordering items, and even appreciate when they respond humorously or
answer questions about themselves. Our interactions will grow more personal as these
assistants get to know more about us. As a New York Times article “Why We May Soon Be
Living in Alexa’s World,” explained: “Something bigger is afoot. Alexa has the best shot of
becoming the third great consumer computing platform of this decade.”

Search results
Search engines use NLP to surface relevant results based on similar search behaviors or user
intent so the average person finds what they need without being a search-term wizard. For
example, Google not only predicts what popular searches may apply to your query as you
start typing, but it looks at the whole picture and recognizes what you’re trying to say rather
than the exact search words. Someone could put a flight number in Google and get the flight
status, type a ticker symbol and receive stock information, or a calculator might come up
when inputting a math equation. These are some variations you may see when completing a
search as NLP in search associates the ambiguous query to a relative entity and provides
useful results.

Predictive text
Things like autocorrect, autocomplete, and predictive text are so commonplace on our
smartphones that we take them for granted. Autocomplete and predictive text are similar to
(2)
search engines in that they predict things to say based on what you type, finishing the word
or suggesting a relevant one. And autocorrect will sometimes even change words so that the
overall message makes more sense. They also learn from you. Predictive text will customize
itself to your personal language quirks the longer you use it. This makes for fun experiments
where individuals will share entire sentences made up entirely of predictive text on their
phones. The results are surprisingly personal and enlightening; they’ve even been
highlighted by several media outlets.

Language translation
One of the tell-tale signs of cheating on your Spanish homework is that grammatically, it’s a
mess. Many languages don’t allow for straight translation and have different orders for
sentence structure, which translation services used to overlook. But, they’ve come a long
way. With NLP, online translators can translate languages more accurately and present
grammatically-correct results. This is infinitely helpful when trying to communicate with
someone in another language. Not only that, but when translating from another language to
your own, tools now recognize the language based on inputted text and translate it.
What is Optical Character Recognition ( OCR)?
Optical Character Recognition refers to a software technology that electronically identifies text
(written or printed) inside an image file or physical document, such as a scanned document, and
converts it into a machine-readable text form to be used for data processing. It is also known as text
recognition.
In short, optical character recognition software helps convert images or physical documents into a
searchable form. Examples of OCR are text extraction tools, PDF to .txt converters, and Google’s
image search function.
To see OCR software in action, you can try using Text Extractor Tool by Brandfolder. This optical
character recognition online tool can convert an image of text (such as a screenshot) into plaintext.
How does Optical Character Recognition work?
The concept of OCR is straightforward. However, its implementation can be quite challenging due to
several factors, such as the variety of fonts or the methods used for letter formation. For example,
an OCR implementation can get exponentially more complex when non-digital handwriting samples
are used as input instead of typed writing.
The entire process of OCR involves a series of steps that mainly contain three objectives: preprocessing of the image, character recognition, and post-processing the specific output.
In the following, we will show how optical character recognition works and explain the main steps
of OCR technologies.
1. Scanning the Document
This is the prime step of OCR which connects to a scanner to scan the document. Scanning the
(3)
document decreases the number of variables to account for when creating the OCR software since it
standardizes the inputs. Also, this step specifically enhances the efficiency of the entire process by
ensuring perfect alignment and sizing of the specific document.
2. Refining the Image
In this step, the optical character recognition software improves the elements of the document that
need to be captured. Any imperfections such as dust particles are eliminated, and edges, as well as
pixels, are smoothed to get a plain and clear text. This step makes it easier for the program to
capture and be able to clearly “see” the words being inputted without, for instance, smudges or
irregular dark areas.
3. Binarization
The refined image document is then converted into a bi-level document image, containing only black
and white colors, where black or dark areas are identified as characters. At the same time, white or
light areas are identified as background. This step aims to apply segmentation to the document to
easily differentiate the foreground text from the background, which allows for the optimal
recognition of characters.
4. Recognizing the Characters
In this step, the black areas are further processed to identify letters or digits. Usually, an OCR focuses
on one character or block of text at a time. The recognition of characters is carried out by using one
of the following two algorithms:
a) Pattern recognition. The pattern recognition algorithm involves inserting text in different
fonts and formats into the OCR software. The modified software is then used for comparing
and recognizing the characters in the scanned document.
b) Feature detection. Through the feature detection algorithm, OCR software applies rules
considering the features of a certain letter or number to identify characters in the scanned
document. Examples of features include the number of angled lines, crossed lines, or curves
used for comparing and identifying characters.
Simple OCR software compares the pixels of every scanned letter with an existing database to
identify the closest match. However, sophisticated forms of OCR divide every character into its
components, such as curves and corners, to compare and match physical features with
corresponding letters.
5. Verifying the Accuracy :
After the successful recognition of characters, the results are cross-referenced by utilizing the
internal dictionaries of the OCR software to ensure accuracy. Measuring OCR accuracy is done by
taking the output of an analysis conducted by an OCR and comparing it to the contents of the
original version.
(4)
There are two typical methods for analyzing the accuracy of OCR software:
Character-level accuracy, counting how many characters were detected correctly.
Word-level accuracy, counting how many words were recognized correctly.
In most cases, 98-99% accuracy is the acceptable accuracy rate, measured at the page level. This
means that in a page of around 1,000 characters, 980-990 characters should be accurately identified
by the OCR software.
Advantages Of Optical Character Recognition
Optical Character Recognition offers a wide range of benefits, many of which were reviewed in this
article. However, the most important benefits of OCR are listed below for your reference.





Improved accuracy: Software-based character recognition eliminates human errors,
resulting in improved accuracy.
Speed-up the processes: The technology converts unstructured data into searchable
information, providing the required data available at faster rates and subsequently speeding
up business processes.
Cost-effective: OCR technology does not require a lot of resources which reduces the
processing costs and subsequently reduces the overall costs of a business.
Enhanced customer satisfaction: The accessibility of searchable data by the customers
ensures a good experience, assuring better customer satisfaction.
Improved productivity: The easy accessibility of searchable data makes a stress-free
environment for the employees, allowing them to focus on the main goals, boosting the
productivity of a business.
(5)
OBJECTIVE
The objective of this project on OCR is to achieve modification or conversion of any
form of text or text-containing documents such as handwritten text, printed or scanned
text images, into an editable digital format for deeper and further processing.
SYSTEM DESIGEN
Once the planning and analysis of the project are completed, the design phase begins. The
goal of the system to let them choose the file to be uploaded for the OCR to display the file,
read it, write the text in the file, that can be edited.
OCR WORKFRAME
LANGUAGE USED:
Python is used to make this system, in python EasyOCR package is used for extracting the
text from the image .
Some example code :
def load_model():
reader = easyocr.Reader(['en'], model_storage_directory ='.')
return reader
Lets break down the code line by line:
1. Reader class from easyocr class and then passing [‘en’] as an attribute which means
that now it will only detect the English part of the image as text, if it will find other
language like Chinese and Japanese then it will ignore those text.
2. We have set the attribute for language so, we are loading the
model_storage_directory for the Image to be in the readtext() function.
(6)
EASYOCR FRAMEWORK
Facts :
EasyOCR currently support 42 languages :
Afrikaans (af), Azerbaijani (az), Bosnian (bs), Czech (cs), Welsh (cy), Danish (da), German (de),
English (en), Spanish (es), Estonian (et), French (fr), Irish (ga), Croatian (hr), Hungarian (hu),
Indonesian (id), Icelandic (is), Italian (it), Japanese (ja), Korean (ko), Kurdish (ku), Latin (la),
Lithuanian (lt), Latvian (lv), Maori (mi), Malay (ms), Maltese (mt), Dutch (nl), Norwegian (no),
Polish (pl), Portuguese (pt),Romanian (ro), Slovak (sk), Slovenian (sl), Albanian (sq), Swedish
(sv),Swahili (sw), Thai (th), Tagalog (tl), Turkish (tr), Uzbek (uz), Vietnamese (vi), Chinese (zh) –
Source: JaidedAI
It provides enough flexibility to choose Text detection with or with out GPU.
How to install EasyOCR on machine
To get started installing EasyOCR, my recommendation is to follow our pip install opencv tutorial
with an important caveat:
We install opencv-python and not opencv-contrib-python in our virtual environment. Furthermore,
if we have both of these packages in the same environment, it could lead to unintended consequences.
It is unlikely that pip would complain if we have both installed, so be diligent and check with the pip
freeze command.
Our recommendation is that we dedicate a separate Python virtual environment on our system for
EasyOCR (Option B of the pip install opencv guide).
However, although option B suggests naming your virtual environment cv, we recommend naming
it easyocr, ocr_easy, or something similar and we install pytorch too.
(7)
STREAMLIT :
INTRODUCTION OF STREAMLIT:
Streamlit is an open- source python framework for building web apps for Machine Learning and Data
Science. We can instantly develop web apps and deploy them easily using Streamlit. Streamlit allows
you to write an app the same way you write a python code. Streamlit makes it seamless to work on the
interactive loop of coding and viewing results in the web app.
The flow of Streamlit while developing a web app
Running a streamlit app
First , we create a python script with streamlit commands and execute the script using the following
command,
streamlit run <yourscript.py>
After running this, a sample app will be open in a new tab in our default browser.
Development flow
If the source code of the streamlit’s python script changes the app shows whether to rerun the
application or not in the top-right corner. We can also select the ‘Always rerun’ option to rerun
always when the source script changes.
This makes our development flow much easier, every time you make some changes it’ll reflect
immediately in your web app. This loop between coding and viewing results live makes you work
seamlessly with streamlit.
Data flow
Streamlit allows us to write an app the same way we write a python code. The streamlit has a
distinctive data flow, any time something changes in our code or anything needs to be updated on
the screen, streamlit reruns our python script entirely from the top to the bottom. This happens
when the user interacts with the widgets like a select box or drop-down box or when the source
code is changed.
If we have some costly operations while rerunning our web app, like loading data from databases,
we can use streamlit’s st.cache method to cache those datasets, so that it loads faster.
Displaying the data
Streamlit provides you with many methods to display various types of data like arrays, tables, and
data frames.
(8)


To write a string simply use, st.write(“Your string”)
To display a data frame use, st.dataframe method.
Widgets
There are several widgets available in streamlit, like st.selectbox, st.checkbox, st.slider, and etc. Let
us see an example for widget in streamlit.
import streamlit as st
value = st.slider('val') # this is a widget
st.write(value, 'squared is', value * value)
During the first run, the web app would output the string “0 squared is 0”, and then if a user increases
or decreases the widget, the code is rerun by the streamlit from top to bottom and assigns the present
state of the widget to the variable.
For example, if the user slides the slider to point 15, streamlit will rerun the code and output the
following string “15 squared is 225”.
Layout
You can easily arrange your widgets or data seamlessly using the st.sidebar method. This method
helps you to align data in the left panel sidebar. All you have to do is simply use st.sidebar.selectbox
to display a selectbox in the left panel.
Let’s summarize the working of streamlit



Streamlit runs the python script from top to bottom
Each time the user interacts the script is a rerun from top to bottom
Streamlit allows you to use caching for costly operations like loading large datasets.
Let’s get started with our OCR Web App:
Let’s write a title and a DataFrame in the web app using streamlit.
import pandas as pd
import streamlit as st
#Title
st.title('Easy OCR - Extract Text from Image')
#Subtilte
st.markdown("##Optical Character Recognition - Using 'easyocr', 'streamlit'")
Now execute the following command in your console,
streamlit run streamlit_app.py
(9)
After executing the command access the following URL to see the following
result, http://localhost:8501/
In the above code, st.title method writes the given string as header, st.write method writes the given
string as such, and it also writes the DataFrame as well in the app.
Browse and Drage a file :
We write is the code below to browse or drop the file in our Web App.
#image Uploader
image = st.file_uploader(label="Upload your image here", type = ['png','jpg','jpeg'])
We are able to browse or drop the file using st.file_uploader in the folder from the given
directory. The type of file to be uploaded is label as type of image file as ocr will read only
from them and disregard any other type of file other then them.
(10)
Browseing a file to upload:
A direcrtory given for the browser using the the code below:
def load_model():
reader = ocr.Reader(['en'], model_storage_directory ='.')
return reader
The file is Browse selected to be uploaded.
Below us we can see the red arrow point is pointing the folder from from which the file can be upload
from our cureent directory, but note that we can still Browse in other folder too.
We use a storage directory in a relative path, from where the code exist, the file is in the current
directory.
(11)
Upload file:
The Uploaded file is displayed below using :
input_image = Image.open(image)#read image
st.image(input_image)#display image
The yellow arrow point at the uploaded file and the the red arrow is pointing at the displayed
uploaded file . input_image = Image.open(image), the Image if from PIL where PIL is the Python
Imaging Library which provides the python interpreter with image editing capabilities.
The Image module provides a class with the same name which is used to represent a PIL image.
The module also provides a number of factory functions, including functions to load images from
files, and to create new images and the uploaded file is displayed using st.image(input_image), from
the input_image.
Reader:
def load_model():
reader = ocr.Reader(['en'], model_storage_directory ='.')
return reader
reader = load_model()#load model
We call the object Reader to read the object and use the reader to store the load model after
returning it.
(12)
Reading and Writing the readed text from the file :
Code : result = reader.readtext(np.array(input_image))
result_text=[]#empty list for storeing
for text in result:
result_text.append(text[1])
st.write(result_text)
The text from the file is readed by readtext and the readed text is then stored in an empty list named
result, the text in the list is appended using result_text.append(text[1]) line by line using for loop.
The st.write(result_text) is used to display the readed and extracted text from the file, the text can be
copied and saved in a text file for present or future use.
(13)
CONCLUSION OF PROJECT:
In conclusion, Optical Character recognition (OCR) is a very remarkable technology that holds a lot
of potential. In this day ang age, such tools are already quite advanced. However, OCR is going to
look even better in the future.
REFERENCE:
 https://www.geeksforgeeks.org/python-programming-language/
 https://streamlit.io/
 https://www.youtube.com/
(14)
Download