International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) Text to Speech Conversion System using OCR Jisha Gopinath1, Aravind S2, Pooja Chandran3, Saranya S S4 1,3,4 Student, 2Asst. Prof., Department of Electronics and Communication, SBCEW, Kerala, India Abstract- - There are about 45 million blind people and 135 million visually impaired people worldwide. Disability of visual text reading has a huge impact on the quality of life for visually disabled people. Although there have been several devices designed for helping visually disabled to see objects using an alternating sense such as sound and touch, the development of text reading device is still at an early stage. Existing systems for text recognition are typically limited either by explicitly relying on specific shapes or colour masks or by requiring user assistance or may be of high cost. Therefore we need a low cost system that will be able to automatically locate and read the text aloud to visually impaired persons. The main idea of this project is to recognize the text character and convert it into speech signal. The text contained in the page is first pre-processed. The preprocessing module prepares the text for recognition. Then the text is segmented to separate the character from each other. Segmentation is followed by extraction of letters and resizing them and stores them in the text file. These processes are done with the help of MATLAB. This text is then converted into speech. II. T EXT SYNTHESIS Recognition of scanned document images using OCR is now generally considered to be a solved problem for some scripts. Components of an OCR system consist of optical scanning, binarization, segmentation, feature extraction and recognition. Index terms- Binarization, OCR, Segmentation, Templates, TTS. I. INTRODUCTION Machine replication of human functions, like reading, is an ancient dream. However, over the last five decades, machine reading has grown from a dream to reality. Speech is probably the most efficient medium for communication between humans. Optical character recognition has become one of the most successful applications of technology in the field of pattern recognition and artificial intelligence. Character recognition or optical character recognition (OCR), is the process of converting scanned images of machine printed or handwritten text (numerals, letters, and symbols), into a computer format text. . Speech synthesis is the artificial synthesis of human speech [1]. A Text-ToSpeech (TTS) synthesizer is a computer-based system that should be able to read any text aloud, whether it was directly introduced in the computer by an operator or scanned and submitted to an Optical Character Recognition (OCR) system [1]. Operational stages [2] of the system consist of image capture, image preprocessing, image filtering, character recognition and text to speech conversion.The software platforms used are MatLab, LabVIEW and android platform. Fig 1:Components of an OCR-system. With the help of a digital scanner the analog document is digitizedand the extracted text will be pre-processed.Each symbol is extracted through a segmentation process [2]. The identity of each symbol comparing the extracted features with descriptions of the symbol classes obtained through a previous learning phase. Contextual information is used to reconstruct the words and numbers of the original text. III. SPEECH SYNTHESIS Speech is the vocalization form of human communication. Speech communication is more effective medium than text communication medium in many real world applications. 389 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) Speech synthesis is the artificial production of human speech. A system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware. Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. The quality of a speech synthesizer is judged by its similarity to the human voice, and by its ability to be understood. IV. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations), which is then imposed on the output speech. V. SYSTEM IMPLEMENTATION a) Using Lab VIEW LabVIEW is a graphical programming language that uses icons instead of lines of text to create applications. LabVIEW uses dataflow programming, where the flow of data through the nodes on the block diagram determines the execution order of the VIs and functions.VIs, or virtual instruments, are LabVIEW programs that imitate physical instruments. In LabVIEW, user builds a user interface by using a set of tools and objects. The user interface is known as the front panel. User then adds code using graphical representations of functions to control the front panel objects. This graphical source code is also known as G code or block diagram code. T EXT T O SPEECH SYNTHESIS A Text-To-Speech (TTS) synthesizer is a computerbased system that should be able to read any text aloud. The block diagram given below explains the same [3]. i) LabVIEW Program Structure A LabVIEW program is similar to a text-based program with functions and subroutines; however, in appearance it functions like a virtual instrument (VI) [5]. A real instrument may accept an input, process on it and then output a result. Similarly, a LabVIEW VI behaves in the same manner. A LabVIEW VI has 3 main parts: a) Front Panel window Every user created VI has a front panel that contains the graphical interface with which a user interacts. The front panel can house various graphical objects ranging from simple buttons to complex graphs [6]. b) Block Diagram window Nearly every VI has a block diagram containing some kind of program logic that serves to modify data as it flows from sources to sinks. The block diagram houses a pipeline structure of sources, sinks, VIs, and structures wired together in order to define this program logic. Most importantly, every data source and sink from the front panel has its analog source and sink on the block diagram. This representation allows the input values from the user to be accessed from the block diagram. Likewise, new output values can be shown on the front panel by code executed in the block diagram. Fig 2: Overall Block diagram A text-to-speech system (or "engine") is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization [4]. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-tophoneme conversion. The back-end often referred to as the synthesizer—then converts the symbolic linguistic representation into sound. c) Controls, Functions and Tools Palette Windows, which contain icons associated with extensive libraries of software functions, subroutines, etc. 390 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) ii) Process Flowchart A Start Create OCR session Is Microsoft Win32 SAPI Available ? No Error Image capture Yes Read image and character set file Make a server for Win32 SAPI Get ROI Get voce object from Win32 SAPI Read text Compare input string with SAPI string Draw bounding boxes Extract voice Correlation Wave player initialization Output speech Recognize character and write to text file Stop Text analysis b) Using Android Android is a Linux-based operating system for mobile devices such as smartphones and tablet computers. It is developed by the Open Handset Alliance led by Google. A 391 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) Google releases the Android code as open-source, under the Apache License [7]. Android has seen a number of updates since its original release, each fixing bugs and adding new features. Android consists of a kernel based on the Linux kernel, with middleware, libraries and APIs written in C and application software running on an application framework which includes Java-compatible libraries based on Apache Harmony. Android uses the Dalvik virtual machine with just-in-time compilation to run Dalvik dex-code (Dalvik Executable), which is usually translated from Java bytecode. The main hardware platform for Android is the ARM architecture. There is support for x86 from the Android x86 project, and Google TV uses a special x86 version of Android. A Generate Bitmap in ARGB8888 Pass image to Tesseract OCR engine i) System design Open source OCR software called Tesseract is used as a basis for the implementation of text reading system for visually disabled in Android platform. Google is currently developing the project and sponsors the open development project. Today, Tesseract is considered the most accurate free OCR engine in existence. User can select an image already stored on the Android device or use the device’s camera to capture a new image; it then runs through an image rectification algorithm and passes the input image to the Tesseract service. When the OCR process is complete it produces a returns a string of text which is displayed on the user interface screen, where the user is also allowed to edit the text then using the TTS API enables our Android device to speak text of different languages. The TTS engine that ships with the Android platform supports a number of languages: English, French, German, Italian and Spanish. Also American and British accents for English are both supported. The TTS engine needs to know which language to speak. So the voice and dictionary are language-specific resources that need to be loaded before the engine can start to speak [8,9]. Display the text output given by OCR engine Pass the text field to TTS API Output speech Stop c) Using MATLAB i) System architecture The system consists of a portable camera, a computing device and a speaker or headphone. Images can be captured using the camera. For better results we can use a camera with zooming and auto focus capability. OCR based speech synthesis system applications require a high processing speed computer system to perform specified task. It's possible to do with 100MHz and 16M RAM, but for fast processing (large dictionaries, complex recognition schemes, or high sample rates), we should shoot for a minimum of a 400MHz and 128M RAM. Because of the processing required, most software packages list their minimum requirements. It requires an operating system and sound must be installed in PC. System applications require a good quality speaker to produce a good quality of sound. ii) Process flowchart Start Image capture Correct orientation A 392 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) ii) Process flowchart VI. Image Capture Image Preprocessing Image Filtering Crop Lines Resize Letters Crop Letters Load Templates Correlation Write to Text File RESULT Text reading system has two main parts: image to text conversion and text to voice conversion. Image into text and then that text into speech is converted using MatLab, LabVIEW and in Android platform. For image to text conversion firstly image is converted into gray image then black and white image and then it is converted into text by using MatLab and LabVIEW. But in Android platform we processed rgb image. Microsoft Win 32 speech application program interface library has been used to produce speech information available for computer in MatLab and Labview. This library allows selecting the voice and audio device one would like to use. We can select the voices from the list and can change the pace and volume, which can be listened by installing wave player. Android platform implementation uses Android text to speech application program interface. a) Using LabVIEW Input: Extract Letters Text Analysis Input image Output: Is WIN32SAPI available? Make a server for Win32 SAPI Get voice object from Win32 SAPI Compare input string with SAPI string Extract voice Wave Player initialization Fig 3: Front panel of text reading system Output Speech 393 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) b) Using Android Input: Output: Input image Output text VII. CONCLUSION This paper is an effort to suggest an approach for image to speech conversion using optical character recognition and text to speech technology. The application developed is user friendly, cost effective and applicable in the real time. By this approach we can read text from a document, Web page or e-Book and can generate synthesized speech through a computer's speakers or phone’s speaker. The developed software has set all policies of the singles corresponding to each and every alphabet, its pronunciation methodology, the way it is used in grammar and dictionary. This can save time by allowing the user to listen background materials while performing other tasks. System can also be used to make information browsing for people who do not have the ability to read or write. This approach can be used in part as well. If we want only to text conversion then it is possible and if we want only text to speech conversion then it is also possible easily. People with poor vision or visual dyslexia or totally blindness can use this approach for reading the documents and books. People with speech loss or totally dumb person can utilize this approach to turn typed words into vocalization. Experiments have been performed to test the text reading system and good results have been achieved. Image capture Output: Output c) Using MatLab Input: REFERENCES [1] [2] Input image 394 T. Dutoit, "High quality text-to-speech synthesis: a comparison of four candidate algorithms," Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, vol.i, no., pp.I/565-I/568 vol.1, 19-22 Apr 1994. B.M. Sagar, Shobha G, R. P. Kumar, “OCR for printed Kannada text to machine editable format using database approach” WSEAS Transactions on Computers Volume 7, Pages 766-769, 6 June 2008. International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015) [3] [4] [5] [6] [7] [8] [9] http://www.voicerss.org/tts/ http://www.comsys.net/technology/speechframe/text-to-speechtts.html Image Acquisition and Processing with LabVIEW, Christopher G Relf, CRC Press, 2004. http://www.rspublication.com/ijst/aug%2013/6.pdf Implementing Optical Character Recognition on the Android Operating System for Business Cards Sonia Bhaskar, Nicholas Lavassar, Scott Green EE 368 Digital Image Processing. J. Liang, et. al. “Geometric Rectification of Camera-captured Document Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 591-605, July 2006. G. Zhu and D. Doermann. “Logo Matching for Document Image Retrieval,” International Conference on Document Analysis and Recognition (ICDAR 2009), pp. 606-610, 2009. Authors Biographies Jisha Gopinath, pursuing final year BTech degree in Electronics and Communication Engineering from Mahatma Gandhi University, Kerala, India. Completed Diploma in Electronics Engineering from Technical Board of Education, Kerala. Aravind S, Assistant professor in the Department of Electronics and Communication Engineering, Sree Buddha College of Engineering for Women, Mahatma Gandhi University, Kerala, India. He obtained M.Tech degree in VLSI and Embedded Systems with Distiction from Govt. College of Engineering Chengannur, Cochin University in 2012. He received his B.Tech Degree in Electronics and Communication Engineering with Distiction from the main campus of Cochin University of Science and Technology, School of Engineering, Kerala, India, in 2009. He has published ten research papers in various International Journals. He has presented three papers in National Conferences. He has excellent and consistent academic records,very good verbal and written communication skills. He has guided nine projects for graduate engineering students and one project for P.G student. He has academic experince of 3 years and industrial experience of 1.6 years. For Post Graduate students he has handled subjects such as Electronic Design Automation Tools, VLSI Circuit Design and Technology ,Designing with Microcontrollers and Adaptive Signal Processing. He taught subjects such as Network Theory, DSP, Embedded Systems, Digital Electronics, Microcontroller and applications, Computer Organisation and Architecture , Microprocessor and applications, Microwave Engineering ,Computer Networks and VLSI for B.Tech students. Pooja Chandran, pursuing final year BTech degree in Electronics and Communication Engineering from Mahatma Gandhi University, Kerala, India. Saranya.S.S, pursuing final year BTech degree in Electronics and Communication Engineering from Mahatma Gandhi University, Kerala, India. Completed Diploma in Electronics Engineering from Technical Board of Education, Kerala. 395