2017 International Conference on circuits Power and Computing Technologies [ICCPCT] TEXT RECOGNITION AND FACE DETECTION AID FOR VISUALLY IMPAIRED PERSON USING RASPBERRY PI Mr.Rajesh M. Scientist/Engineer ‘C’ NIELIT, NIT campus Calicut, Kerala, India rajeshm@nielit.gov.in Ms. Bindhu K. Rajan Electronics and Communication Engineering Jyothi Engineering College Cheruthuruthy, Thrissur, Kerala, India bindhukrajan09@gmail.com Ajay Roy, Almaria Thomas K, Ancy Thomas, Bincy Tharakan T, Dinesh C Electronics and Communication Engineering Jyothi Engineering College Cheruthuruthy, Thrissur, Kerala, India ajayroyp1995@gmail.com, almariakt@gmail.com, ancytpjec2015@gmail.com, bincytharakan96@gmail.com, dineshrajagiri@gmail.com Abstract— Speech and text is the main medium for human communication. A person needs vision to access the information in a text. However those who have poor vision can gather information from voice. This paper proposes a camera based assistive text reading to help visually impaired person in reading the text present on the captured image. The faces can also be detected when a person enter into the frame by the mode control. The proposed idea involves text extraction from scanned image using Tesseract Optical Character Recognition (OCR) and converting the text to speech by e-Speak tool, a process which makes visually impaired persons to read the text. This is a prototype for blind people to recognize the products in real world by extracting the text on image and converting it into speech. Proposed method is carried out by using Raspberry pi and portability is achieved by using a battery backup. Thus the user can carry the device anywhere and able to use at any time. Upon entering the camera view previously stored faces are identified and informed which can be implemented as a future technology. This technology helps millions of people in the world who experience a significant loss of vision. Index Terms—Stroke Width Transform (SWT), gaussian filter, adaptive gaussian thresholding, OCR engine, e-Speak I. INTRODUCTION Due to eye diseases, age related causes, uncontrolled diabetes, accidents and other reasons the number of visually impaired persons increased every year. One of the most significant difficulty for a visually impaired person is to read. Recent developments in mobile phones, computers, and 978-1- 5090-4967- 7/17/$31.00 © 2017 IEEE availability of digital cameras make it feasible to assist the blind person by developing camera based applications that combine computer vision tools with other existing beneficial products such as Optical Character Recognition (OCR) system. In this proposed system text recognition is done by Open Computer Vision (Open CV), a library of functions used for implementing image processing techniques. Image processing is a technique of using mathematical operations in image, any form of inputs such as image, a series of images, or a video can be used for processing. An image or a set of characteristics or parameters related to image is the output of image processing. Image processing has various applications like computer graphics, scanning, facial recognition, text recognition etc. Various features of text like its font, font size, alignment, background etc influences in its recognition. Number plate recognition is a fair example for text extraction. Text extraction from an image is carried out by OCR. It is a method of conversion of images of writings on a label, printed books, sign boards etc. to text only. OCR helps to create reading devices for visually impaired persons and technologies involving telegraphy. The binary image is converted to text by Tesseract library in OCR engine that detects the outline, slope, pitches, white spaces and joint letters. It also checks the quality of the recognized text. In this system the conversion of text to voice output is by e-Speak algorithm. The e-Speak is a TextTo-Speech (TTS) system which converts text into speech. The artificial production of human speech is known as speech synthesis. The speech synthesizer can be implemented in a software or a hardware product. The platform used for this 2017 International Conference on circuits Power and Computing Technologies [ICCPCT] purpose is known as a speech synthesizer. The storage of entire words or sentences allows for high-quality output in specific usage domains. A synthesizer can incorporate the model of a vocal tract and other human voice characteristics. This paper aims to build an efficient camera based assistive text reading device. The idea involves text extraction from image taken by a camera installed on a spectacle. The extracted text is then converted to audio signals and to voice output. It is also used to detect a person’s face in the frame. This is carried out by using Raspberry pi where the portability is the main aim, which is achieved by providing a battery backup. This paper is organized as follows. Section II introduces the related work about the proposed system. Proposed methodology is then presented in section III. Hardware and Software implementations are explained in sections IV and V respectively. Section VI concludes the paper. II. LITERATURE SURVEY Earlier works helps to support the visually impaired persons. Most of the existing systems are built in MATLAB platforms. And few of them use laptops, so that they are not portable. Algorithms used in earlier systems lack efficiency and accuracy. The paper [1] presents a prototype for extracting text from images using Raspberry Pi. The images are captured using a web cam and are processed using Open CV and OTSU’s algorithm. Initially the captured images are converted to gray scale color mode. The images are rescaled and cosine transformations are applied by setting vertical and horizontal ratio. After applying some morphological transformations OTSU’s thresholding is applied to images which is adaptive thresholding algorithm. After thresholding, contours for the images are generated using special functions in Open CV. Using these contours, bounding boxes are drawn around the objects and text in the images. Using these drawn bounding boxes each and every character present in the image is extracted which is then applied to the OCR engine to recognize the text present in the image. In [3], proposed a camera based assistive text reading framework to help visually impaired persons read text labels and product packaging from hand-held objects in daily life. The system proposes a motion based method to define a Region Of Interest (ROI), for isolating the object from untidy backgrounds or other surrounding objects in the camera vision. A mixtureof-Gaussians-based background subtraction technique is used to extract moving object region. To acquire text details from the ROI, text localization and recognition are conducted. Then text regions from the object ROI are automatically focused. In an Adaboost model the gradient features of stroke orientations and distributions of edge pixels are carried out by Novel Text Localization algorithm. Text characters in localized text regions are binarized and recognized by off-the-shelf optical character identification software. A bottom-up integration is performed in the information and merging pixels of similar stroke width into connected components to detect the text in natural scenes using SWT [4]. This allows to detect letters across a wide range of scales in the same image. Since it do not use a filter bank of a few discrete orientations, it can detect strokes (and, consequently, text lines) of any direction. This method carries enough information for accurate text segmentation and so a good mask is readily available for detected text. The need for integration over scales, orientations of the filter and, the inherent attenuation to horizontal texts are the limitations of this method. The linear features which are used in remote sensing and medical imaging domains are related with the definition of stroke. In road detection, the range of road widths in an aerial or satellite photo is known and limited, whereas texts appearing in natural image can vary in scale drastically. Additionally, roads are typically elongated linear structures with low curvature, which is again not true for text. In [5] describes the camera based text reading system for blind person. In this paper a binary image is created using global or local thresholding which can be decided from Fisher’s Discriminant Rate (FDR). The technique is essentially based on OTSU’s binarization method. It is an automatic threshold selection region based segmentation method. In this method when the characters are present on a frame, then-the local histogram has two peaks and this is reflected as a high value for the FDR. For quasi-uniform frames the value of the FDR is small and the histogram has only one peak. In the case of complex areas the histogram is dispersed resulting in higher FDR values, which are still lower than in the case of text areas. With a bimodal gray-level histogram the FDR is used to detect the image frames. When the image frames are of high FDR values, the local OTSU threshold is used for binarizing the image, frames with low FDR values. The proposed system carried out image processing using Open CV and Tesseract. The camera used for capturing the images is installed on the spectacle. The system becomes more efficient, effective and portable by providing battery backup. III. PROPOSED METHODOLOGY The proposed method is to help blind person in reading the text present on the text labels, printed notes and products as a camera based assistive text reader. The implemented idea involves text recognition and faces detection from image taken by camera on spectacle and recognizes the text using OCR. Conversion of the recognized text file to voice output by eSpeak algorithm. The system is good for portability, which is achieved by providing a battery backup. The portability allows the user to carry the device anywhere and can use at any time. A prototype was developed which uses a camera on spectacle and Raspberry pi that works in real time. The proposed system has two different modes as shown in Fig. 1. The face and text modes are selected using mode control switch. The system captures the frame and checks the presence of text in the frame. It will also check the presence of face in the frame and inform the user via audio message. If a character is found by the camera the user will be informed that image with some text was detected. Thus if the user wanted to hear or to know about the content in the image he can use a switch to capture the image. 2017 International Conference on circuits Power and Computing Technologies [ICCPCT] images. The USB powered camera is used in order to connect it with Raspberry pi board. Power Bank A power bank of capacity 16,000mAh is used for making the model portable. This power bank could keep the system live for a maximum of 8 hours of time and can be recharged. It can provide a steady voltage of 5v 2A for a long period. Power bank makes the system more compact and flexible. Fig. 1. System architecture The captured image is first converted to grayscale and then filtered using a Gaussian filter to reduce the noise in the image. Here adaptive Gaussian thresholding is used to reduce the noise in the image. The filtered image is then converted to binary. The binarized image is cropped so that the portions of the image with no characters are removed. The cropped frame is loaded to the Tesseract OCR so as to perform text recognition. The output of the Tesseract OCR will be text file which will be the input of the e-Speak. The e-Speak creates an analog signal corresponding to the text file given as the input. The analog signal produced by the e-Speak is then given to a headphone to get the audio output signal. V. SOFTWARE IMPLEMENTATION Raspberry pi works in Raspbian which is derived from the Debian operating system. The algorithms are written using the python language which is a script language. The functions in algorithm are called from the OpenCV library. Tesseract is an open source-OCR engine. It assumes that its input is a binary image with optional polygonal text region defined. OpenCV is an open source computer vision library. IV. HARDWARE IMPLEMENTATAION Hardware components used for this system are Raspberry pi, camera on spectacle and a power bank. The camera on spectacle captures the image from the frame. The captured images are sent to Raspberry Pi and all the image processing was done. The voice output is available on the audio jack. It can be heard through headphone. The power for system is given using 16000mAh power bank. A rectifier circuit is used to charge the power source. Raspberry pi Raspberry pi is a small computer which could be programmed. It works like a Linux based computer which could do all the normal operations in PC. Raspberry Pi works in open source platform. Raspberry Pi 2 Model B 1GB is used in this system. This model comes with 40 GPIO pins and 4 USB ports which makes it more useful. Also it has camera interface and 3.5mm audio jack. USB ports available on this board are used to connect the camera with raspberry pi. Three GPIO pins are used, for capturing image, for mode control and for shutting down the system respectively. The board is operated in such a way that the code starts executing when it is powered ON. The audio output is available through the audio jack. Camera A compactable camera on spectacle is used for image capturing. It has auto focusing capability with a resolution of 1280X720 which is capable of capturing some good quality Fig. 2. Flow chart Flow chart for the proposed system is explained in the Fig. 2. The system initializes the values for count and mode as zero. Count is to store the number of frames and mode value is to select text or face modes. When the number of frames reaches a value of 120 frames, the system checks for face or text depending upon the mode and gives the voice output. The switches ‘c’ and ‘m’ is used in the system. When switch c is high it captures the image and does the processing part. The mode control is done by switch ‘m’. The system includes a switch‘s’ to shutdown the system. 2017 International Conference on circuits Power and Computing Technologies [ICCPCT] Conversion of image to text using OCR tool Tesseract is an open source-OCR engine. It assumes that its input is a binary image with optional polygonal text region defined. The first step is a connected component analysis in which outline of the components is stored. By the inspection of the nesting of outlines, it is easy to detect inverse text and recognize it as early as black on white text. At this stage, outlines are gathered together, purely by nesting, into blobs. Blobs are organized into text lines, and the lines and regions are analyzed for fixed pitch or proportional text. Slope across the line is used to find text lines. These lines are broken into words differently according to the kind of character spacing. Fixed pitch text is chopped immediately by character cells. The cells are checked for joined letters and if it is found then it is separated. Quality of recognized text is verified. If clarity is not enough the text is passed to associator. The classifier compares each recognized letter with training data. The word recognition is done by considering confidence and rating [6]. Conversion of text to voice using e-Speak Normal text to speech conversion is done using eSpeak which is a TTS system. The artificial production of human speech is known as speech synthesis. Speech computer or speech synthesizer is used for this purpose and can be implemented in software or hardware products. The storage of entire words or sentences for specific user domains, allows for high-quality output. To create a completely "synthetic" voice output a synthesizer can be used to incorporate a model of the vocal tract and other human voice characteristics. VI.RESULT The results obtained from the procedure described above are illustrated in the figures below. Fig. 4 indicates the image captured using the camera on spectacle, Fig. 5 indicates the preprocessed image which is given to tesseract OCR engine to extract the text from the image and Fig .6.shows the output from the tesseract OCR engine. The accuracy can be improved by making use of a HD auto focus camera. Fig. 4.Captured Image Fig. 3. Schematic diagram for e-Speak Fig. 3 explains the e-Speak algorithm. A TTS (or "engine") is composed of two parts a front-end and a back-end. The front-end has two major tasks, the normalization and phonetic transcription of text. Normalization, pre-processing, or tokenization of text is the conversion of text containing symbols like abbreviations and numbers into equivalent written-out words. The front-end then assigns phonetic transcription to each word. The prosodic units like clauses, sentences, and phrases are marked and divided. Text-tophoneme conversion is the process of assigning phonetic transcriptions to words. The output from the front-end is a symbolic linguistic representation from the Phonetic transcriptions and the prosody information. The back-end performs the function of a synthesizer. The symbolic linguistic representation to sound conversion is achieved using this back end. The most attractive feature of a speech synthesis system are naturalness and intelligibility. The output sounds like human speech which describes the naturalness, the output is ease with the intelligibility of understanding. Speech synthesis systems usually try to maximize both natural and intelligibility which are the characteristics of an ideal speech synthesizer. Fig. 5. Image given to Tesseract OCR 2017 International Conference on circuits Power and Computing Technologies [ICCPCT] will increase the safety of blind people. The device could make better result if some training is given to visually impaired person. By providing object detection feature to the visual narrator, it could recognize objects that are commonly used by the visually impaired people. Recognizing objects like currencies, tickets, visa cards, numbers or details on smart phone etc could make the life of blind people easier. Identification of traffic signals, sign boards and other land marks could be helpful in traveling. Blue tooth facility could be added in order to remove the wired connection between the spectacle and Raspberry pi. VII.REFERENCES Fig. 6.Output from the OCR VII.CONCLUSION In the proposed idea portability issue is solved by using Raspberry pi. The MATLAB is replaced with Open CV and it results in fast processing. Open CV which is the latest tool for image processing has more supporting libraries than MATLAB. The device consists of a camera installed on the spectacle. The processor used is very small and can be kept inside the pocket of the user. A wired connection is provided with the camera for fast access. Power bank provided for the system helps to work the device for about 6 to 8 hours. By these features the device become simple, reliable and more user friendly. The proposed system can be improved through addition of various components. Addition of GPS to the present system will enable the user to get directions and it could give information regarding present location of the user. Also the device can be used for face recognition. Visually impaired person need not to guess people. He can identify them as the camera capture their faces. GSM module can be added to this system to implement a panic button. If the user is in trouble, then he can make use of the panic button to seek help by sending the location to some predefined mobile numbers. This [1] Ms.Rupali, D Dharmale, Dr. P.V. Ingole, "Text Detection and Recognition with Speech Output for Visually Challenged Person",vol. 5, Issue 1, January 2016 [2] Nagaraja, L., et al. "Vision based text recognition using raspberry PI." National Conference on Power Systems , Industrial Automation (NCPSIA 2015). [3] Rajkumar N, Anand M.G, Barathiraja N, "Portable Camera Based Product Label Reading For Blind People.",IJETT, Vol. 10 Number 11 - Apr 2014 [4] Boris Epshtein, Eyal Ofek, Yonatan Wexler, "Detecting Text in Natural Scenes with Stroke Width Transform." [5] Ezaki, Nobuo, et al. "Improved text-detection methods for a camera-based text reading system for visually impaired persons." Eighth International Conference on Document Analysis and Recognition (ICDAR’05). IEEE, 2005. [6] Ray Smith,"An Overview of the Tesseract OCR Engine." [7] Chucai Yi, Yingli Tian and Aries Arditi, “Portable Camera-Based Assistive Text and Product Label Reading from Hand-Held Objects for Visually impaired persons,” IEEE/ASME Transactions on Mechatronics, Vol. 19, No. 3, pp. 808, June 2014. [8] Sherine Sebastian and Priya S., “Text Detection and Recognition from Images as an Aid to Visually impaired persons Accessing Unfamiliar Environments,” Asian Research Publishing Network (ARPN) Journal of Engineering and Applied Sciences, ISSN 1819-6608, Vol. 10, No. 17, September 2015