1 The Evolution of Text-to-Speech over time Mois Shabbir Department of Electrical Engineering, EME College HU-212: Technical and Business Writing Asst Prof Rukaiza Khan Jan.9, 2023 2 Abstract: This research paper examines the evolution of text–to–speech technology over time. It began as early attempts to mimic human voice and has since evolved in sophistication and capability, becoming an important tool for people with disabilities or those who are unable to read out loud. The paper will analyze the history of TTS development from its earliest days to current advancements and future potentials. Additionally, this research will explore how it has been used to facilitate communication and language learning. The paper will also discuss the current challenges faced by TTS technology. Finally, this research will provide insights into potential future trends in text-to-speech development that may shape its use for years to come. The analysis of evolution of TTS was done by using recorded audios of different systems over a long period of time and the improvement, clarity and everything else was observed based on these samples. Keywords: Text-to-Speech, Technology, Evolution, History, Analyze. 3 Evolution of text-to-speech technology over time 1. Introduction: “Any sufficiently advanced technology is equivalent to magic” (Clarke, C.A.). Who would have imagined the world in which we live in will be like this, a hundred years ago? As the time passes, the technology keep getting more and more diverse. In my research paper I seek to analyze (one such field which just keep getting extremely advanced as the time passes) the evolution of text-to-speech (TTS) technology, from its initial development in the 1950s up to modern applications. TTS has come a long way since its early days, when only basic words and phrases could be converted into speech. Today’s systems are more sophisticated and able to accurately interpret complex language, allowing for more natural interactions between humans and machines. It will examine how TTS has been used over time and look at the various advancements that have been made in terms of accuracy, speed, and voice quality. In addition, it will discuss potential future developments for this technology. Finally, it will consider the different challenges this technology is facing in modern era. The researches done before on this topic are from different angles, such as they examined TTS technology from one person’s perspective or they are of the work of a person in this field and are not in way too much depth, but in this paper the research done is in a very detailed way. So overall, in this research paper, the evolution and present state of TTS will be analyzed in detail. 2. Methodology: As in this research paper, the evaluation of evolution of TTS technology will be done and the standard on which it was accomplished was to gather recorded audios from different TTS systems over time and then analyze them to find the improvements in different parameters such 4 as speed, accuracy, voice clarity and quality. As the data was gathered from different sources so it is a quantitative research methodology. 3. History: TTS has always been a vast field for improvement. For analysis of its evolution over time, we first need to see from where it originated. 3.1 Speech Synthesizer: The improvement in quality of speech synthesizers over time has helped greatly in the field of TTS and its evolution. The first effort to produce artificial speech was made back in 1779, Russian professor Christian Kratzenstein made a system to pronounce the 5 vowels. That was the point from where this technology started and keeps evolving in the coming years. 3.1.1. VODER: A VODER (Voice Operating Demonstrator) is a speech synthesis device developed by Homer Dudley at Bell Laboratories in the 1930s. It was one of the earliest attempts to create an electronic talking machine and can be considered a precursor to modern speech synthesizers. The device consists of two main parts: an electrical circuit that creates sound waves, and a keyboardlike control panel with which the user manipulates those sounds into recognizable words or phrases. The VODER was demonstrated publicly at several world's fairs during the 1930s, including New York City’s World's Fair of 1939–1940 where it received much attention from visitors (Harold L Andrews, 2001). The audio sample for VODER is given below: 5 VODER.wav 3.1.2. Parametric Artificial Talker (PAT): It was the first formant synthesizer that was introduced by Walter Lawrence in 1953. It worked on frequencies and its operation was rather complicated (Sami Lemmety, 1999). The audio sample for PAT is given below: Parametric Artificial Talker.wav 3.2 Text To Speech (TTS): With the advancements in the technology in this field, such system was required which can also help ordinary as well as handicap people and make their life smoother. So TTS systems were introduced for this purpose. 3.2.1. First TTS (1968): In 1968, Noriko Umeda and his companions introduced first complete and functioning text-tospeech system for English language. The speech was quite understandable but was quite tedious. The audio sample is given below: 6 First T T S.wav 3.2.2. MITalk: In 1976, Allen, Hunnicutt, and Klatt developed the MITalk at MIT available in English language. This TTS used different levels to convert text to synthesized speech. In the first level, abbreviations, numbers, and symbols were transformed into words. Then, using a 12,000 morph (prefixes, roots, and suffixes) lexicon, words were converted to their phonetic equivalent. Words not in the lexicon were converted to phonemes by using rules (Suhas Mache et al., 2015). The audio sample for MITalk is given below: MITalk.wav 3.2.3. Klattalk system: The Klattalk text‐to‐speech program contains a system of rules for generating synthetic speech from an abstract representation that includes some syntactic markers, lexical stress, and phonemes. The program is a valuable tool for the evaluation of timing rules in English. It will be 7 demonstrated that correct timing specifications at the acoustic level are essential to sentence comprehension and acceptability (Dennis H.Klatt, 1983). The audio sample for Klattalk is given below: Khlattalk.wav 3.2.4. DEC Talk: Digital Equipment Corporation DEC Talk was based on Klattalk system it is available in American English, German and Spanish. The DEC Talk system later became commercially available in 1983. The system is capable to say most proper names, e-mail and URL addresses and supports a customized pronunciation dictionary. It has also punctuation control for pauses, pitch, and stress and the voice control commands are inserted in a text file which is used by DEC talk software applications. The speaking rate is adjustable between 75 to 650 words per minute (Suhas Mache et al., 2015). Some audio samples are given below: DEC Talk 5 voices.wav DEC Talk 300wpm audio sample is below: 8 DEC Talk 300 wpm.wav 3.2.5. Microsoft Narrator: In 2000, Microsoft released its screen narrator which was a great help to blind or impaired people. Narrator is the built-in screen reader for Windows, included at no extra charge. Although it's been in development since 2000, recent years have brought features that bring Narrator up to par with NVDA. While many other screen readers are available for Windows, Narrator is installed by default, which makes it especially handy for installing Windows or setting up a new PC. 4. Present: In recent years, further advances have been made in text-to-speech technology that allow for more natural interactions between humans and machines. This includes improvements in speech recognition algorithms that can accurately interpret complex language, as well as the use of artificial intelligence and machine learning to generate more natural sounding voices. The following is an audio sample of Google TTS: Google Text To Speech.mp3 5. Challenges: 9 Despite these advances, text-to-speech technology still faces several challenges. One is the lack of standardization across different systems, which can make it difficult for users to find a TTS system that meets their needs. Another challenge is the difficulty in generating natural sounding voices that are able to interpret emotion and context correctly. Finally, there is also the issue of cost, many commercial applications for TTS are expensive and may be out of reach for some users. 6. Results: By listening to the audio samples given in the research paper, it can be seen that how much these systems has improved in generating more natural voices, with a lot of clarity and accuracy. From dull and monotonous voices we achieved much sophisticated and human like voices. For people using this technology cost was another big issue because of inefficiency of circuits in its early development stages, but now with the evolution in this field, the cost has also reduced greatly. So, overall by analyzing the audios it can be seen how much improvement has been accomplished in this field. The following table summarizes the results of this research by summarizing the improvement done over time w.r.t different parameters: Sr. no. Platform Quality Clarity Cost Year 1. VODER Normal Normal High 1937-1938 2. Parametric Poor Poor Normal 1953 Artificial Talker 3. First TTS Good Normal High 1968 4. MITalk Poor Good Normal 1979 5. Klattalk Normal Normal Low 1981 10 6. DEC Talk Good Good Normal 1983 7. Google TTS Good Great No cost Present 7. Conclusion: Text-to-speech technology has come a long way since its beginnings in 1779 when Wolfgang von Kempelen created the first “Speaking Machine”. The evolution of this field has already been discussed in detail. Today, TTS systems are used in many everyday applications such as virtual assistants and navigation systems. Additionally, researchers have been working on improving these systems with advancements in natural language processing and speech synthesis techniques. First commercial speech synthesis systems were hardware based and the advancements in this field was much slow because the development process was much expensive and time taking. With the evolution of computers more powerful computers are made, and because of this most speech synthesizers are now software based. Software based systems are easy to update than hardware and are also less expensive and less time-consuming than hardware. So a lot of advancements can be done in this field in near future. 11 References: A short review related to speech synthesis. https://www.researchgate.net/publication/304600612_Review_on_Text-To-Speech_Synthesizer Timing rules in Klattalk: Implications for models of speech production May 1983The Journal of the Acoustical Society of America 73(S1):66http://dx.doi.org/10.1121/1.2020502 The information about the discovery and invention of DEC talk and MItalk https://www.researchgate.net/publication/304600612_Review_on_Text-ToSpeech_Synthesizer The information about the discovery and invention of Parametric Artificial Talker. SamiLemmety (1999). A thesis on the speech synthesizing techniques in early times Information about PAT http://research.spa.aalto.fi/publications/theses/lemmetty_mst/chap2.html For basic information about speech synthesizers. https://doi.org/10.1121/1.395275 For the audio sample of Google Speech synthesizer. https://cloud.google.com/text-to-speech For the audio samples of first TTS device and Klattalk system. https://acousticstoday.org/klatts-speech-synthesis-d/ For other audio samples used in the research paper. https://youtu.be/huq2TSV99hI 12