Uploaded by Shalini S

pro2

advertisement
2017 International Conference on circuits Power and Computing Technologies [ICCPCT]
TEXT RECOGNITION AND FACE DETECTION
AID FOR VISUALLY IMPAIRED PERSON USING
RASPBERRY PI
Mr.Rajesh M.
Scientist/Engineer ‘C’
NIELIT, NIT campus
Calicut, Kerala, India
rajeshm@nielit.gov.in
Ms. Bindhu K. Rajan
Electronics and Communication Engineering
Jyothi Engineering College Cheruthuruthy,
Thrissur, Kerala, India
bindhukrajan09@gmail.com
Ajay Roy, Almaria Thomas K,
Ancy Thomas, Bincy Tharakan T, Dinesh C
Electronics and Communication Engineering
Jyothi Engineering College Cheruthuruthy,
Thrissur, Kerala, India
ajayroyp1995@gmail.com, almariakt@gmail.com,
ancytpjec2015@gmail.com, bincytharakan96@gmail.com,
dineshrajagiri@gmail.com
Abstract— Speech and text is the main medium for
human communication. A person needs vision to access the
information in a text. However those who have poor vision can
gather information from voice. This paper proposes a camera
based assistive text reading to help visually impaired person in
reading the text present on the captured image. The faces can
also be detected when a person enter into the frame by the mode
control. The proposed idea involves text extraction from scanned
image using Tesseract Optical Character Recognition (OCR) and
converting the text to speech by e-Speak tool, a process which
makes visually impaired persons to read the text. This is a
prototype for blind people to recognize the products in real
world by extracting the text on image and converting it into
speech. Proposed method is carried out by using Raspberry pi
and portability is achieved by using a battery backup. Thus the
user can carry the device anywhere and able to use at any time.
Upon entering the camera view previously stored faces are
identified and informed which can be implemented as a future
technology. This technology helps millions of people in the world
who experience a significant loss of vision.
Index Terms—Stroke Width Transform (SWT), gaussian filter,
adaptive gaussian thresholding, OCR engine, e-Speak
I. INTRODUCTION
Due to eye diseases, age related causes, uncontrolled
diabetes, accidents and other reasons the number of visually
impaired persons increased every year. One of the most
significant difficulty for a visually impaired person is to read.
Recent developments in mobile phones, computers, and
978-1- 5090-4967- 7/17/$31.00 © 2017 IEEE
availability of digital cameras make it feasible to assist the
blind person by developing camera based applications that
combine computer vision tools with other existing beneficial
products such as Optical Character Recognition (OCR) system.
In this proposed system text recognition is done by Open
Computer Vision (Open CV), a library of functions used for
implementing image processing techniques. Image processing
is a technique of using mathematical operations in image, any
form of inputs such as image, a series of images, or a video can
be used for processing. An image or a set of characteristics or
parameters related to image is the output of image processing.
Image processing has various applications like computer
graphics, scanning, facial recognition, text recognition etc.
Various features of text like its font, font size, alignment,
background etc influences in its recognition. Number plate
recognition is a fair example for text extraction.
Text extraction from an image is carried out by OCR. It is a
method of conversion of images of writings on a label, printed
books, sign boards etc. to text only. OCR helps to create
reading devices for visually impaired persons and technologies
involving telegraphy. The binary image is converted to text by
Tesseract library in OCR engine that detects the outline, slope,
pitches, white spaces and joint letters. It also checks the quality
of the recognized text. In this system the conversion of text to
voice output is by e-Speak algorithm. The e-Speak is a TextTo-Speech (TTS) system which converts text into speech. The
artificial production of human speech is known as speech
synthesis. The speech synthesizer can be implemented in a
software or a hardware product. The platform used for this
2017 International Conference on circuits Power and Computing Technologies [ICCPCT]
purpose is known as a speech synthesizer. The storage of entire
words or sentences allows for high-quality output in specific
usage domains. A synthesizer can incorporate the model of a
vocal tract and other human voice characteristics.
This paper aims to build an efficient camera based assistive
text reading device. The idea involves text extraction from
image taken by a camera installed on a spectacle. The extracted
text is then converted to audio signals and to voice output. It is
also used to detect a person’s face in the frame. This is carried
out by using Raspberry pi where the portability is the main aim,
which is achieved by providing a battery backup.
This paper is organized as follows. Section II introduces the
related work about the proposed system. Proposed
methodology is then presented in section III. Hardware and
Software implementations are explained in sections IV and V
respectively. Section VI concludes the paper.
II. LITERATURE SURVEY
Earlier works helps to support the visually impaired persons.
Most of the existing systems are built in MATLAB platforms.
And few of them use laptops, so that they are not portable.
Algorithms used in earlier systems lack efficiency and
accuracy.
The paper [1] presents a prototype for extracting text from
images using Raspberry Pi. The images are captured using a
web cam and are processed using Open CV and OTSU’s
algorithm. Initially the captured images are converted to gray
scale color mode. The images are rescaled and cosine
transformations are applied by setting vertical and horizontal
ratio. After applying some morphological transformations
OTSU’s thresholding is applied to images which is adaptive
thresholding algorithm. After thresholding, contours for the
images are generated using special functions in Open CV.
Using these contours, bounding boxes are drawn around the
objects and text in the images. Using these drawn bounding
boxes each and every character present in the image is
extracted which is then applied to the OCR engine to recognize
the text present in the image.
In [3], proposed a camera based assistive text reading
framework to help visually impaired persons read text labels
and product packaging from hand-held objects in daily life. The
system proposes a motion based method to define a Region Of
Interest (ROI), for isolating the object from untidy backgrounds
or other surrounding objects in the camera vision. A mixtureof-Gaussians-based background subtraction technique is used
to extract moving object region. To acquire text details from
the ROI, text localization and recognition are conducted. Then
text regions from the object ROI are automatically focused. In
an Adaboost model the gradient features of stroke orientations
and distributions of edge pixels are carried out by Novel Text
Localization algorithm. Text characters in localized text
regions are binarized and recognized by off-the-shelf optical
character identification software.
A bottom-up integration is performed in the information
and merging pixels of similar stroke width into connected
components to detect the text in natural scenes using SWT [4].
This allows to detect letters across a wide range of scales in the
same image. Since it do not use a filter bank of a few discrete
orientations, it can detect strokes (and, consequently, text lines)
of any direction. This method carries enough information for
accurate text segmentation and so a good mask is readily
available for detected text. The need for integration over scales,
orientations of the filter and, the inherent attenuation to
horizontal texts are the limitations of this method. The linear
features which are used in remote sensing and medical imaging
domains are related with the definition of stroke. In road
detection, the range of road widths in an aerial or satellite photo
is known and limited, whereas texts appearing in natural image
can vary in scale drastically. Additionally, roads are typically
elongated linear structures with low curvature, which is again
not true for text.
In [5] describes the camera based text reading system for
blind person. In this paper a binary image is created using
global or local thresholding which can be decided from
Fisher’s Discriminant Rate (FDR). The technique is essentially
based on OTSU’s binarization method. It is an automatic
threshold selection region based segmentation method. In this
method when the characters are present on a frame, then-the
local histogram has two peaks and this is reflected as a high
value for the FDR. For quasi-uniform frames the value of the
FDR is small and the histogram has only one peak. In the case
of complex areas the histogram is dispersed resulting in higher
FDR values, which are still lower than in the case of text
areas. With a bimodal gray-level histogram the FDR is used to
detect the image frames. When the image frames are of high
FDR values, the local OTSU threshold is used for binarizing
the image, frames with low FDR values.
The proposed system carried out image processing using
Open CV and Tesseract. The camera used for capturing the
images is installed on the spectacle. The system becomes more
efficient, effective and portable by providing battery backup.
III. PROPOSED METHODOLOGY
The proposed method is to help blind person in reading the
text present on the text labels, printed notes and products as a
camera based assistive text reader. The implemented idea
involves text recognition and faces detection from image taken
by camera on spectacle and recognizes the text using OCR.
Conversion of the recognized text file to voice output by eSpeak algorithm. The system is good for portability, which is
achieved by providing a battery backup.
The portability allows the user to carry the device anywhere
and can use at any time. A prototype was developed which uses
a camera on spectacle and Raspberry pi that works in real time.
The proposed system has two different modes as shown in
Fig. 1. The face and text modes are selected using mode control
switch. The system captures the frame and checks the presence
of text in the frame. It will also check the presence of face in
the frame and inform the user via audio message. If a character
is found by the camera the user will be informed that image
with some text was detected. Thus if the user wanted to hear or
to know about the content in the image he can use a switch to
capture the image.
2017 International Conference on circuits Power and Computing Technologies [ICCPCT]
images. The USB powered camera is used in order to connect it
with Raspberry pi board.
Power Bank
A power bank of capacity 16,000mAh is used for
making the model portable. This power bank could keep the
system live for a maximum of 8 hours of time and can be
recharged. It can provide a steady voltage of 5v 2A for a long
period. Power bank makes the system more compact and
flexible.
Fig. 1. System architecture
The captured image is first converted to grayscale and then
filtered using a Gaussian filter to reduce the noise in the image.
Here adaptive Gaussian thresholding is used to reduce the noise
in the image. The filtered image is then converted to binary.
The binarized image is cropped so that the portions of the
image with no characters are removed. The cropped frame is
loaded to the Tesseract OCR so as to perform text recognition.
The output of the Tesseract OCR will be text file which will be
the input of the e-Speak. The e-Speak creates an analog signal
corresponding to the text file given as the input. The analog
signal produced by the e-Speak is then given to a headphone to
get the audio output signal.
V. SOFTWARE IMPLEMENTATION
Raspberry pi works in Raspbian which is derived from
the Debian operating system. The algorithms are written using
the python language which is a script language. The functions
in algorithm are called from the OpenCV library. Tesseract is
an open source-OCR engine. It assumes that its input is a
binary image with optional polygonal text region defined.
OpenCV is an open source computer vision library.
IV. HARDWARE IMPLEMENTATAION
Hardware components used for this system are Raspberry pi,
camera on spectacle and a power bank. The camera on
spectacle captures the image from the frame. The captured
images are sent to Raspberry Pi and all the image processing
was done. The voice output is available on the audio jack. It
can be heard through headphone. The power for system is
given using 16000mAh power bank. A rectifier circuit is used
to charge the power source.
Raspberry pi
Raspberry pi is a small computer which could be
programmed. It works like a Linux based computer which
could do all the normal operations in PC. Raspberry Pi works
in open source platform. Raspberry Pi 2 Model B 1GB is used
in this system. This model comes with 40 GPIO pins and 4
USB ports which makes it more useful. Also it has camera
interface and 3.5mm audio jack. USB ports available on this
board are used to connect the camera with raspberry pi. Three
GPIO pins are used, for capturing image, for mode control and
for shutting down the system respectively. The board is
operated in such a way that the code starts executing when it is
powered ON. The audio output is available through the audio
jack.
Camera
A compactable camera on spectacle is used for image
capturing. It has auto focusing capability with a resolution of
1280X720 which is capable of capturing some good quality
Fig. 2. Flow chart
Flow chart for the proposed system is explained in the Fig. 2.
The system initializes the values for count and mode as zero.
Count is to store the number of frames and mode value is to
select text or face modes. When the number of frames reaches a
value of 120 frames, the system checks for face or text
depending upon the mode and gives the voice output. The
switches ‘c’ and ‘m’ is used in the system. When switch c is
high it captures the image and does the processing part. The
mode control is done by switch ‘m’. The system includes a
switch‘s’ to shutdown the system.
2017 International Conference on circuits Power and Computing Technologies [ICCPCT]
Conversion of image to text using OCR tool
Tesseract is an open source-OCR engine. It assumes that its
input is a binary image with optional polygonal text region
defined. The first step is a connected component analysis in
which outline of the components is stored. By the inspection of
the nesting of outlines, it is easy to detect inverse text and
recognize it as early as black on white text. At this stage,
outlines are gathered together, purely by nesting, into blobs.
Blobs are organized into text lines, and the lines and regions
are analyzed for fixed pitch or proportional text. Slope across
the line is used to find text lines. These lines are broken into
words differently according to the kind of character spacing.
Fixed pitch text is chopped immediately by character cells. The
cells are checked for joined letters and if it is found then it is
separated. Quality of recognized text is verified. If clarity is not
enough the text is passed to associator. The classifier compares
each recognized letter with training data. The word recognition
is done by considering confidence and rating [6].
Conversion of text to voice using e-Speak
Normal text to speech conversion is done using eSpeak which is a TTS system. The artificial production of
human speech is known as speech synthesis. Speech computer
or speech synthesizer is used for this purpose and can be
implemented in software or hardware products. The storage of
entire words or sentences for specific user domains, allows for
high-quality output. To create a completely "synthetic" voice
output a synthesizer can be used to incorporate a model of the
vocal tract and other human voice characteristics.
VI.RESULT
The results obtained from the procedure described above
are illustrated in the figures below. Fig. 4 indicates the image
captured using the camera on spectacle, Fig. 5 indicates the
preprocessed image which is given to tesseract OCR engine to
extract the text from the image and Fig .6.shows the output
from the tesseract OCR engine. The accuracy can be improved
by making use of a HD auto focus camera.
Fig. 4.Captured Image
Fig. 3. Schematic diagram for e-Speak
Fig. 3 explains the e-Speak algorithm. A TTS (or
"engine") is composed of two parts a front-end and a back-end.
The front-end has two major tasks, the normalization and
phonetic transcription of text. Normalization, pre-processing,
or tokenization of text is the conversion of text containing
symbols like abbreviations and numbers into equivalent
written-out words. The front-end then assigns phonetic
transcription to each word. The prosodic units like clauses,
sentences, and phrases are marked and divided. Text-tophoneme conversion is the process of assigning phonetic
transcriptions to words. The output from the front-end is a
symbolic linguistic representation from the Phonetic
transcriptions and the prosody information. The back-end
performs the function of a synthesizer. The symbolic linguistic
representation to sound conversion is achieved using this back
end. The most attractive feature of a speech synthesis system
are naturalness and intelligibility. The output sounds like
human speech which describes the naturalness, the output is
ease with the intelligibility of understanding. Speech synthesis
systems usually try to maximize both natural and intelligibility
which are the characteristics of an ideal speech synthesizer.
Fig. 5. Image given to Tesseract OCR
2017 International Conference on circuits Power and Computing Technologies [ICCPCT]
will increase the safety of blind people. The device could
make better result if some training is given to visually
impaired person. By providing object detection feature to the
visual narrator, it could recognize objects that are commonly
used by the visually impaired people. Recognizing objects like
currencies, tickets, visa cards, numbers or details on smart
phone etc could make the life of blind people easier.
Identification of traffic signals, sign boards and other land
marks could be helpful in traveling. Blue tooth facility could
be added in order to remove the wired connection between the
spectacle and Raspberry pi.
VII.REFERENCES
Fig. 6.Output from the OCR
VII.CONCLUSION
In the proposed idea portability issue is solved by using
Raspberry pi. The MATLAB is replaced with Open CV and it
results in fast processing. Open CV which is the latest tool for
image processing has more supporting libraries than
MATLAB. The device consists of a camera installed on the
spectacle. The processor used is very small and can be kept
inside the pocket of the user. A wired connection is provided
with the camera for fast access. Power bank provided for the
system helps to work the device for about 6 to 8 hours. By
these features the device become simple, reliable and more
user friendly.
The proposed system can be improved through addition of
various components. Addition of GPS to the present system
will enable the user to get directions and it could give
information regarding present location of the user. Also the
device can be used for face recognition. Visually impaired
person need not to guess people. He can identify them as the
camera capture their faces. GSM module can be added to this
system to implement a panic button. If the user is in trouble,
then he can make use of the panic button to seek help by
sending the location to some predefined mobile numbers. This
[1] Ms.Rupali, D Dharmale, Dr. P.V. Ingole, "Text Detection and
Recognition with Speech Output for Visually Challenged
Person",vol. 5, Issue 1, January 2016
[2] Nagaraja, L., et al. "Vision based text recognition using
raspberry PI." National Conference on Power Systems ,
Industrial Automation (NCPSIA 2015).
[3] Rajkumar N, Anand M.G, Barathiraja N, "Portable Camera
Based Product Label Reading For Blind People.",IJETT, Vol. 10
Number 11 - Apr 2014
[4] Boris Epshtein, Eyal Ofek, Yonatan Wexler, "Detecting Text in
Natural Scenes with Stroke Width Transform."
[5] Ezaki, Nobuo, et al. "Improved text-detection methods for a
camera-based text reading system for visually impaired
persons." Eighth International Conference on Document
Analysis and Recognition (ICDAR’05). IEEE, 2005.
[6] Ray Smith,"An Overview of the Tesseract OCR Engine."
[7] Chucai Yi, Yingli Tian and Aries Arditi, “Portable
Camera-Based Assistive Text and Product Label Reading from
Hand-Held Objects for Visually impaired persons,”
IEEE/ASME Transactions on Mechatronics, Vol. 19, No. 3, pp.
808, June 2014.
[8] Sherine Sebastian and Priya S., “Text Detection and Recognition
from Images as an Aid to Visually impaired persons Accessing
Unfamiliar Environments,” Asian Research Publishing Network
(ARPN) Journal of Engineering and Applied Sciences, ISSN
1819-6608,
Vol. 10, No. 17, September 2015
Download