Uploaded by zeyyarlynn

IJETAE 0115 62

advertisement
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
Text to Speech Conversion System using OCR
Jisha Gopinath1, Aravind S2, Pooja Chandran3, Saranya S S4
1,3,4
Student, 2Asst. Prof., Department of Electronics and Communication, SBCEW, Kerala, India
Abstract- - There are about 45 million blind people and 135
million visually impaired people worldwide. Disability of
visual text reading has a huge impact on the quality of life for
visually disabled people. Although there have been several
devices designed for helping visually disabled to see objects
using an alternating sense such as sound and touch, the
development of text reading device is still at an early stage.
Existing systems for text recognition are typically limited
either by explicitly relying on specific shapes or colour masks
or by requiring user assistance or may be of high cost.
Therefore we need a low cost system that will be able to
automatically locate and read the text aloud to visually
impaired persons. The main idea of this project is to recognize
the text character and convert it into speech signal. The text
contained in the page is first pre-processed. The preprocessing module prepares the text for recognition. Then the
text is segmented to separate the character from each other.
Segmentation is followed by extraction of letters and resizing
them and stores them in the text file. These processes are done
with the help of MATLAB. This text is then converted into
speech.
II.
T EXT SYNTHESIS
Recognition of scanned document images using OCR is
now generally considered to be a solved problem for some
scripts. Components of an OCR system consist of optical
scanning, binarization, segmentation, feature extraction and
recognition.
Index terms- Binarization, OCR, Segmentation, Templates,
TTS.
I.
INTRODUCTION
Machine replication of human functions, like reading, is
an ancient dream. However, over the last five decades,
machine reading has grown from a dream to reality. Speech
is probably the most efficient medium for communication
between humans. Optical character recognition has become
one of the most successful applications of technology in the
field of pattern recognition and artificial intelligence.
Character recognition or optical character recognition
(OCR), is the process of converting scanned images of
machine printed or handwritten text (numerals, letters, and
symbols), into a computer format text. . Speech synthesis is
the artificial synthesis of human speech [1]. A Text-ToSpeech (TTS) synthesizer is a computer-based system that
should be able to read any text aloud, whether it was
directly introduced in the computer by an operator or
scanned and submitted to an Optical Character Recognition
(OCR) system [1]. Operational stages [2] of the system
consist of image capture, image preprocessing, image
filtering, character recognition and text to speech
conversion.The software platforms used are MatLab,
LabVIEW and android platform.
Fig 1:Components of an OCR-system.
With the help of a digital scanner the analog document is
digitizedand the extracted text will be pre-processed.Each
symbol is extracted through a segmentation process [2].
The identity of each symbol comparing the extracted
features with descriptions of the symbol classes obtained
through a previous learning phase. Contextual information
is used to reconstruct the words and numbers of the original
text.
III.
SPEECH SYNTHESIS
Speech is the vocalization form of human
communication. Speech communication is more effective
medium than text communication medium in many real
world applications.
389
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
Speech synthesis is the artificial production of human
speech. A system used for this purpose is called a speech
synthesizer, and can be implemented in software or
hardware. Synthesized speech can be created by
concatenating pieces of recorded speech that are stored in a
database. The quality of a speech synthesizer is judged by
its similarity to the human voice, and by its ability to be
understood.
IV.
In certain systems, this part includes the computation of
the target
prosody (pitch
contour,
phoneme
durations), which is then imposed on the output speech.
V.
SYSTEM IMPLEMENTATION
a) Using Lab VIEW
LabVIEW is a graphical programming language that
uses icons instead of lines of text to create applications.
LabVIEW uses dataflow programming, where the flow of
data through the nodes on the block diagram determines the
execution order of the VIs and functions.VIs, or virtual
instruments, are LabVIEW programs that imitate physical
instruments. In LabVIEW, user builds a user interface by
using a set of tools and objects. The user interface is known
as the front panel. User then adds code using graphical
representations of functions to control the front panel
objects. This graphical source code is also known as G
code or block diagram code.
T EXT T O SPEECH SYNTHESIS
A Text-To-Speech (TTS) synthesizer is a computerbased system that should be able to read any text aloud.
The block diagram given below explains the same [3].
i) LabVIEW Program Structure
A LabVIEW program is similar to a text-based program
with functions and subroutines; however, in appearance it
functions like a virtual instrument (VI) [5]. A real
instrument may accept an input, process on it and then
output a result. Similarly, a LabVIEW VI behaves in the
same manner. A LabVIEW VI has 3 main parts:
a) Front Panel window
Every user created VI has a front panel that contains the
graphical interface with which a user interacts. The front
panel can house various graphical objects ranging from
simple buttons to complex graphs [6].
b) Block Diagram window
Nearly every VI has a block diagram containing some
kind of program logic that serves to modify data as it flows
from sources to sinks. The block diagram houses a pipeline
structure of sources, sinks, VIs, and structures wired
together in order to define this program logic. Most
importantly, every data source and sink from the front
panel has its analog source and sink on the block diagram.
This representation allows the input values from the user to
be accessed from the block diagram. Likewise, new output
values can be shown on the front panel by code executed in
the block diagram.
Fig 2: Overall Block diagram
A text-to-speech system (or "engine") is composed of
two parts: a front-end and a back-end. The front-end has
two major tasks. First, it converts raw text containing
symbols like numbers and abbreviations into the equivalent
of written-out words. This process is often called text
normalization, pre-processing, or tokenization [4]. The
front-end then assigns phonetic transcriptions to each
word, and divides and marks the text into prosodic units,
like phrases, clauses, and sentences. The process of
assigning phonetic transcriptions to words is called text-tophoneme conversion. The back-end often referred to as
the synthesizer—then converts the symbolic linguistic
representation into sound.
c) Controls, Functions and Tools Palette
Windows, which contain icons associated with extensive
libraries of software functions, subroutines, etc.
390
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
ii) Process Flowchart
A
Start
Create OCR session
Is Microsoft
Win32 SAPI
Available ?
No
Error
Image capture
Yes
Read image and
character set file
Make a server for
Win32 SAPI
Get ROI
Get voce object from
Win32 SAPI
Read text
Compare input string
with SAPI string
Draw bounding
boxes
Extract voice
Correlation
Wave player
initialization
Output speech
Recognize
character and write
to text file
Stop
Text analysis
b) Using Android
Android is a Linux-based operating system for mobile
devices such as smartphones and tablet computers. It is
developed by the Open Handset Alliance led by Google.
A
391
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
Google releases the Android code as open-source, under
the Apache License [7]. Android has seen a number of
updates since its original release, each fixing bugs and
adding new features. Android consists of a kernel based on
the Linux kernel, with middleware, libraries and APIs
written in C and application software running on an
application framework which includes Java-compatible
libraries based on Apache Harmony. Android uses the
Dalvik virtual machine with just-in-time compilation to run
Dalvik dex-code (Dalvik Executable), which is usually
translated from Java bytecode. The main hardware platform
for Android is the ARM architecture. There is support for
x86 from the Android x86 project, and Google TV uses a
special x86 version of Android.
A
Generate Bitmap in ARGB8888
Pass image to Tesseract OCR
engine
i) System design
Open source OCR software called Tesseract is used as a
basis for the implementation of text reading system for
visually disabled in Android platform. Google is currently
developing the project and sponsors the open development
project. Today, Tesseract is considered the most accurate
free OCR engine in existence. User can select an image
already stored on the Android device or use the device’s
camera to capture a new image; it then runs through an
image rectification algorithm and passes the input image to
the Tesseract service.
When the OCR process is complete it produces a returns
a string of text which is displayed on the user interface
screen, where the user is also allowed to edit the text then
using the TTS API enables our Android device to speak
text of different languages. The TTS engine that ships with
the Android platform supports a number of languages:
English, French, German, Italian and Spanish. Also
American and British accents for English are both
supported. The TTS engine needs to know which language
to speak. So the voice and dictionary are language-specific
resources that need to be loaded before the engine can start
to speak [8,9].
Display the text output given
by OCR engine
Pass the text field to TTS
API
Output speech
Stop
c) Using MATLAB
i) System architecture
The system consists of a portable camera, a computing
device and a speaker or headphone. Images can be captured
using the camera. For better results we can use a camera
with zooming and auto focus capability. OCR based speech
synthesis system applications require a high processing
speed computer system to perform specified task. It's
possible to do with 100MHz and 16M RAM, but for fast
processing (large dictionaries, complex recognition
schemes, or high sample rates), we should shoot for a
minimum of a 400MHz and 128M RAM. Because of the
processing required, most software packages list their
minimum requirements. It requires an operating system and
sound must be installed in PC. System applications require
a good quality speaker to produce a good quality of sound.
ii) Process flowchart
Start
Image capture
Correct orientation
A
392
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
ii) Process flowchart
VI.
Image Capture
Image Preprocessing
Image Filtering
Crop Lines
Resize Letters
Crop Letters
Load Templates
Correlation
Write to Text File
RESULT
Text reading system has two main parts: image to text
conversion and text to voice conversion. Image into text
and then that text into speech is converted using MatLab,
LabVIEW and in Android platform. For image to text
conversion firstly image is converted into gray image then
black and white image and then it is converted into text by
using MatLab and LabVIEW. But in Android platform we
processed rgb image. Microsoft Win 32 speech application
program interface library has been used to produce speech
information available for computer in MatLab and
Labview. This library allows selecting the voice and audio
device one would like to use. We can select the voices from
the list and can change the pace and volume, which can be
listened by installing wave player. Android platform
implementation uses Android text to speech application
program interface.
a) Using LabVIEW
Input:
Extract Letters
Text Analysis
Input image
Output:
Is WIN32SAPI
available?
Make a server for Win32 SAPI
Get voice object from Win32 SAPI
Compare input string with SAPI string
Extract voice
Wave Player initialization
Fig 3: Front panel of text reading system
Output Speech
393
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
b) Using Android
Input:
Output:
Input image
Output text
VII.
CONCLUSION
This paper is an effort to suggest an approach for image to
speech conversion using optical character recognition and
text to speech technology. The application developed is
user friendly, cost effective and applicable in the real time.
By this approach we can read text from a document, Web
page or e-Book and can generate synthesized speech
through a computer's speakers or phone’s speaker. The
developed software has set all policies of the singles
corresponding to each and every alphabet, its pronunciation
methodology, the way it is used in grammar and dictionary.
This can save time by allowing the user to listen
background materials while performing other tasks. System
can also be used to make information browsing for people
who do not have the ability to read or write. This approach
can be used in part as well. If we want only to text
conversion then it is possible and if we want only text to
speech conversion then it is also possible easily. People
with poor vision or visual dyslexia or totally blindness can
use this approach for reading the documents and books.
People with speech loss or totally dumb person can utilize
this approach to turn typed words into vocalization.
Experiments have been performed to test the text reading
system and good results have been achieved.
Image capture
Output:
Output
c) Using MatLab
Input:
REFERENCES
[1]
[2]
Input image
394
T. Dutoit, "High quality text-to-speech synthesis: a comparison of
four candidate algorithms," Acoustics, Speech, and Signal
Processing, 1994. ICASSP-94., 1994 IEEE International Conference
on, vol.i, no., pp.I/565-I/568 vol.1, 19-22 Apr 1994.
B.M. Sagar, Shobha G, R. P. Kumar, “OCR for printed Kannada text
to machine editable format using database approach” WSEAS
Transactions on Computers Volume 7, Pages 766-769, 6 June 2008.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
[3]
[4]
[5]
[6]
[7]
[8]
[9]
http://www.voicerss.org/tts/
http://www.comsys.net/technology/speechframe/text-to-speechtts.html
Image Acquisition and Processing with LabVIEW, Christopher G
Relf, CRC Press, 2004.
http://www.rspublication.com/ijst/aug%2013/6.pdf
Implementing Optical Character Recognition on the Android
Operating System for Business Cards Sonia Bhaskar, Nicholas
Lavassar, Scott Green EE 368 Digital Image Processing.
J. Liang, et. al. “Geometric Rectification of Camera-captured
Document Images,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, pp. 591-605, July 2006.
G. Zhu and D. Doermann. “Logo Matching for Document Image
Retrieval,” International Conference on Document Analysis and
Recognition (ICDAR 2009), pp. 606-610, 2009.
Authors Biographies
Jisha Gopinath, pursuing final year BTech
degree in Electronics and Communication
Engineering from Mahatma Gandhi University,
Kerala, India. Completed Diploma in
Electronics Engineering from Technical Board
of Education, Kerala.
Aravind S, Assistant professor in the
Department
of
Electronics
and
Communication Engineering, Sree Buddha
College of Engineering for Women, Mahatma
Gandhi University, Kerala, India. He obtained
M.Tech degree in VLSI and Embedded
Systems with Distiction from Govt. College of Engineering
Chengannur, Cochin University in 2012. He received his B.Tech
Degree in Electronics and Communication Engineering with
Distiction from the main campus of Cochin University of Science
and Technology, School of Engineering, Kerala, India, in 2009.
He has published ten research papers in various International
Journals. He has presented three papers in National Conferences.
He has excellent and consistent academic records,very good
verbal and written communication skills. He has guided nine
projects for graduate engineering students and one project for P.G
student. He has academic experince of 3 years and industrial
experience of 1.6 years. For Post Graduate students he has
handled subjects such as Electronic Design Automation Tools,
VLSI Circuit Design and Technology ,Designing with
Microcontrollers and Adaptive Signal Processing. He taught
subjects such as Network Theory, DSP, Embedded Systems,
Digital Electronics, Microcontroller and applications, Computer
Organisation and Architecture , Microprocessor and applications,
Microwave Engineering ,Computer Networks and VLSI for
B.Tech students.
Pooja Chandran, pursuing final year BTech
degree in Electronics and Communication
Engineering from Mahatma Gandhi University,
Kerala, India.
Saranya.S.S, pursuing final year BTech
degree in Electronics and Communication
Engineering
from
Mahatma
Gandhi
University, Kerala, India. Completed Diploma
in Electronics Engineering from Technical
Board of Education, Kerala.
395
Related documents
Download