Uploaded by Asad Hameed

Research Paper TBW TTS

advertisement
1
The Evolution of Text-to-Speech over time
Mois Shabbir
Department of Electrical Engineering, EME College
HU-212: Technical and Business Writing
Asst Prof Rukaiza Khan
Jan.9, 2023
2
Abstract:
This research paper examines the evolution of text–to–speech technology over time. It began as
early attempts to mimic human voice and has since evolved in sophistication and capability,
becoming an important tool for people with disabilities or those who are unable to read out loud.
The paper will analyze the history of TTS development from its earliest days to current
advancements and future potentials. Additionally, this research will explore how it has been used
to facilitate communication and language learning. The paper will also discuss the current
challenges faced by TTS technology. Finally, this research will provide insights into potential
future trends in text-to-speech development that may shape its use for years to come. The
analysis of evolution of TTS was done by using recorded audios of different systems over a long
period of time and the improvement, clarity and everything else was observed based on these
samples.
Keywords: Text-to-Speech, Technology, Evolution, History, Analyze.
3
Evolution of text-to-speech technology over time
1. Introduction:
“Any sufficiently advanced technology is equivalent to magic” (Clarke, C.A.). Who would have
imagined the world in which we live in will be like this, a hundred years ago? As the time
passes, the technology keep getting more and more diverse. In my research paper I seek to
analyze (one such field which just keep getting extremely advanced as the time passes) the
evolution of text-to-speech (TTS) technology, from its initial development in the 1950s up to
modern applications. TTS has come a long way since its early days, when only basic words and
phrases could be converted into speech. Today’s systems are more sophisticated and able to
accurately interpret complex language, allowing for more natural interactions between humans
and machines. It will examine how TTS has been used over time and look at the various
advancements that have been made in terms of accuracy, speed, and voice quality. In addition, it
will discuss potential future developments for this technology. Finally, it will consider the
different challenges this technology is facing in modern era. The researches done before on this
topic are from different angles, such as they examined TTS technology from one person’s
perspective or they are of the work of a person in this field and are not in way too much depth,
but in this paper the research done is in a very detailed way. So overall, in this research paper,
the evolution and present state of TTS will be analyzed in detail.
2. Methodology:
As in this research paper, the evaluation of evolution of TTS technology will be done and the
standard on which it was accomplished was to gather recorded audios from different TTS
systems over time and then analyze them to find the improvements in different parameters such
4
as speed, accuracy, voice clarity and quality. As the data was gathered from different sources so
it is a quantitative research methodology.
3. History:
TTS has always been a vast field for improvement. For analysis of its evolution over time, we
first need to see from where it originated.
3.1 Speech Synthesizer:
The improvement in quality of speech synthesizers over time has helped greatly in the field of
TTS and its evolution. The first effort to produce artificial speech was made back in 1779,
Russian professor Christian Kratzenstein made a system to pronounce the 5 vowels. That was the
point from where this technology started and keeps evolving in the coming years.
3.1.1. VODER:
A VODER (Voice Operating Demonstrator) is a speech synthesis device developed by Homer
Dudley at Bell Laboratories in the 1930s. It was one of the earliest attempts to create an
electronic talking machine and can be considered a precursor to modern speech synthesizers. The
device consists of two main parts: an electrical circuit that creates sound waves, and a keyboardlike control panel with which the user manipulates those sounds into recognizable words or
phrases. The VODER was demonstrated publicly at several world's fairs during the 1930s,
including New York City’s World's Fair of 1939–1940 where it received much attention from
visitors (Harold L Andrews, 2001). The audio sample for VODER is given below:
5
VODER.wav
3.1.2. Parametric Artificial Talker (PAT):
It was the first formant synthesizer that was introduced by Walter Lawrence in 1953. It worked
on frequencies and its operation was rather complicated (Sami Lemmety, 1999). The audio
sample for PAT is given below:
Parametric Artificial Talker.wav
3.2 Text To Speech (TTS):
With the advancements in the technology in this field, such system was required which can also
help ordinary as well as handicap people and make their life smoother. So TTS systems were
introduced for this purpose.
3.2.1. First TTS (1968):
In 1968, Noriko Umeda and his companions introduced first complete and functioning text-tospeech system for English language. The speech was quite understandable but was quite tedious.
The audio sample is given below:
6
First T T S.wav
3.2.2. MITalk:
In 1976, Allen, Hunnicutt, and Klatt developed the MITalk at MIT available in English language.
This TTS used different levels to convert text to synthesized speech. In the first level,
abbreviations, numbers, and symbols were transformed into words. Then, using a 12,000 morph
(prefixes, roots, and suffixes) lexicon, words were converted to their phonetic equivalent. Words
not in the lexicon were converted to phonemes by using rules (Suhas Mache et al., 2015).
The audio sample for MITalk is given below:
MITalk.wav
3.2.3. Klattalk system:
The Klattalk text‐to‐speech program contains a system of rules for generating synthetic speech
from an abstract representation that includes some syntactic markers, lexical stress, and
phonemes. The program is a valuable tool for the evaluation of timing rules in English. It will be
7
demonstrated that correct timing specifications at the acoustic level are essential to sentence
comprehension and acceptability (Dennis H.Klatt, 1983).
The audio sample for Klattalk is given below:
Khlattalk.wav
3.2.4. DEC Talk:
Digital Equipment Corporation DEC Talk was based on Klattalk system it is available in
American English, German and Spanish. The DEC Talk system later became commercially
available in 1983. The system is capable to say most proper names, e-mail and URL addresses
and supports a customized pronunciation dictionary. It has also punctuation control for pauses,
pitch, and stress and the voice control commands are inserted in a text file which is used by DEC
talk software applications. The speaking rate is adjustable between 75 to 650 words per minute
(Suhas Mache et al., 2015). Some audio samples are given below:
DEC Talk 5 voices.wav
DEC Talk 300wpm audio sample is below:
8
DEC Talk 300 wpm.wav
3.2.5. Microsoft Narrator:
In 2000, Microsoft released its screen narrator which was a great help to blind or impaired
people. Narrator is the built-in screen reader for Windows, included at no extra charge. Although
it's been in development since 2000, recent years have brought features that bring Narrator up to
par with NVDA. While many other screen readers are available for Windows, Narrator is
installed by default, which makes it especially handy for installing Windows or setting up a new
PC.
4. Present:
In recent years, further advances have been made in text-to-speech technology that allow for
more natural interactions between humans and machines. This includes improvements in speech
recognition algorithms that can accurately interpret complex language, as well as the use of
artificial intelligence and machine learning to generate more natural sounding voices.
The following is an audio sample of Google TTS:
Google Text To Speech.mp3
5. Challenges:
9
Despite these advances, text-to-speech technology still faces several challenges. One is the lack
of standardization across different systems, which can make it difficult for users to find a TTS
system that meets their needs. Another challenge is the difficulty in generating natural sounding
voices that are able to interpret emotion and context correctly. Finally, there is also the issue of
cost, many commercial applications for TTS are expensive and may be out of reach for some
users.
6. Results:
By listening to the audio samples given in the research paper, it can be seen that how much these
systems has improved in generating more natural voices, with a lot of clarity and accuracy. From
dull and monotonous voices we achieved much sophisticated and human like voices. For people
using this technology cost was another big issue because of inefficiency of circuits in its early
development stages, but now with the evolution in this field, the cost has also reduced greatly.
So, overall by analyzing the audios it can be seen how much improvement has been
accomplished in this field. The following table summarizes the results of this research by
summarizing the improvement done over time w.r.t different parameters:
Sr. no. Platform
Quality
Clarity
Cost
Year
1.
VODER
Normal
Normal
High
1937-1938
2.
Parametric
Poor
Poor
Normal
1953
Artificial Talker
3.
First TTS
Good
Normal
High
1968
4.
MITalk
Poor
Good
Normal
1979
5.
Klattalk
Normal
Normal
Low
1981
10
6.
DEC Talk
Good
Good
Normal
1983
7.
Google TTS
Good
Great
No cost
Present
7. Conclusion:
Text-to-speech technology has come a long way since its beginnings in 1779 when Wolfgang
von Kempelen created the first “Speaking Machine”. The evolution of this field has already been
discussed in detail. Today, TTS systems are used in many everyday applications such as virtual
assistants and navigation systems. Additionally, researchers have been working on improving
these systems with advancements in natural language processing and speech synthesis
techniques. First commercial speech synthesis systems were hardware based and the
advancements in this field was much slow because the development process was much expensive
and time taking. With the evolution of computers more powerful computers are made, and
because of this most speech synthesizers are now software based. Software based systems are
easy to update than hardware and are also less expensive and less time-consuming than
hardware. So a lot of advancements can be done in this field in near future.
11
References:
A short review related to speech synthesis.
https://www.researchgate.net/publication/304600612_Review_on_Text-To-Speech_Synthesizer
Timing rules in Klattalk: Implications for models of speech production
May 1983The Journal of the Acoustical Society of America 73(S1):66http://dx.doi.org/10.1121/1.2020502
The information about the discovery and invention of DEC talk and MItalk
https://www.researchgate.net/publication/304600612_Review_on_Text-ToSpeech_Synthesizer
The information about the discovery and invention of Parametric Artificial Talker.
SamiLemmety (1999). A thesis on the speech synthesizing techniques in early times
Information about PAT
http://research.spa.aalto.fi/publications/theses/lemmetty_mst/chap2.html
For basic information about speech synthesizers. https://doi.org/10.1121/1.395275
For the audio sample of Google Speech synthesizer. https://cloud.google.com/text-to-speech
For the audio samples of first TTS device and Klattalk system.
https://acousticstoday.org/klatts-speech-synthesis-d/
For other audio samples used in the research paper. https://youtu.be/huq2TSV99hI
12
Download