Overview of Text to Speech Getting the computer to read your printed document out loud” Text to Speech • “Text-to-Speech software is used to convert words from a computer document (e.g. word processor document, web page) into audible speech spoken through the computer speaker” Benefits • The benefits of speech synthesis have been many, including computers that can read books to people, better hearing aids, more simultaneous telephone conversations on the same cable, talking machines for vocally impaired or deaf people and better aids for speech therapy. The history of speech synthesis • What you maybe don't know is that the first synthetic speech was produced as early as in the late 18th century. • The machine was built in wood and leather and was very complicated to use generating audible speech. It was constructed by Wolfgang von Kempelen and had great importance in the early studies of Phonetics. • The picture following is the original construction as it can be seen at the Deutsches Museum (von Meisterwerken der Naturwissenschaft und Technik) in Munich, Germany. Von Kempelens Machine Voder • In the early 20th century when it was possible to use electricity to create synthetic speech, the first known electric speech synthesis was "Voder" and its creator Homer Dudley showed it to a broader audience in 1939 on the world fair in New York. OVE • One of the pioneers of the development of speech synthesis in Sweden was Gunnar Fant. • During the 1950s he was responsible for the development of the first Swedish speech synthesis OVE (Orator Verbis Electris.) • By that time it was only Walter Lawrences Parametric Artificial Talker (PAT) that could compete with OVE in speech quality. • OVE and PAT were text-to-speech systems using Formant (parametric) synthesis. Speech synthesis becomes more human-like • The greatest improvements when it comes to natural speech were during the last 10 years. • The first voices we used for ReadSpeaker back in 2001 were produced using Diphone synthesis. • The voices are sampled from real recorded speech and split into phonemes, a small unit of human speech. This was the first example of Concatenation synthesis. However, they still have an artificial/synthetic sound. We still use diphone voices for some smaller languages and they are widely used to speech-enable handheld computers and mobile phones due to their limited resource consumption, both memory and CPU. Unit Selection • It wasn't until the introduction of a technique called Unit selection, that voices became very naturally sounding. this is still concatenation synthesis but the used units are larger than phonemes, sometimes a complete sentence. Why use Speech Synthesis • Visual Issue (Difficulty seeing text) • Cognitive Issue ( Low reading level/comprehension) • Motor Issue (Difficulty handling a book or paper) Forms of Text • E text Most of the text you see on your computer Examples: Internet, Email, Word Document, E Books • Paper text Any text is printed on paper Examples: Newspaper, Book, Magazine Forms of Text • E text Most of the text you see on your computer Examples: Internet, Email, Word Document, E Books • Paper text Any text is printed on paper Examples: Newspaper, Book, Magazine Characteristics of Speech synthesis systems • Many speech synthesis systems take as their input text and output speech. • Hence they are often known as text to speech (TTS) systems. • The naturalness of a speech synthesizer usually refers to how much the output sounds like the speech of a real person. • The intelligibility of a speech synthesizer refers to how easily the output can be understood. Parts of Speech Synthesizers • . Speech Synthesizers usually consist of two parts. First Part • The first part has two major tasks. First it takes the raw text and converts things like numbers and abbreviations into their written-out word equivalents. This process is often called text normalization, pre-processing, or tokenization. Then it assigns phonetic transcriptions to each word, and divides and marks the text into various linguistic units like phrases, clauses, and sentences. The combination of phonetic transcriptions and prosody information make up the symbolic linguistic representation output of the first part of the system to the second part. Second Part • The other part, the back end, takes the symbolic linguistic representation and converts it into actual sound output. • The back end is often referred to as the synthesizer. Text normalization challenges • The process of normalizing text is rarely straightforward. Texts are full of homographs (i.e. words that are spelt the same but are pronounced differently, e.g. Read the book, The book was read), numbers and abbreviations that all ultimately require expansion into a phonetic representation. • There are many words in English which are pronounced differently based on context(i.e. homographs) . Some examples: • project: My latest project is to learn how to better project my voice. • bow: The girl with the bow in her hair was told to bow deeply when greeting her superiors. • Most TTS systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well-understood, or computationally effective. Numbers • Deciding how to convert numbers is another problem TTS systems have to address. • It is a fairly simple programming challenge to convert a number into words, like 1325 becoming "one thousand three hundred twenty-five". • However, numbers occur in many different contexts in texts, and 1325 should probably be read as "thirteen twenty-five" when part of an address (1325 Main St.) and as "one three two five" if it is the last four digits of a social security number. • Often a TTS system can infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the systems provide a way to specify the type of context if it is ambiguous. Abbreviations • Similarly, abbreviations like "etc." are easily rendered as "et cetera", but often abbreviations can be ambiguous. • For example, the abbreviation "in." in the following example: "Yesterday it rained 3 in. Take 1 out, then put 3 in." • "St." can also be ambiguous: "St. John St." • TTS systems with intelligent front ends can make educated guesses about how to deal with ambiguous abbreviations, while others do the same thing in all cases, resulting in nonsensical but sometimes comical outputs: "Yesterday it rained three in." or "Take one out, then put three inches." Text-to-phoneme challenges • Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion, as phoneme is the term used by linguists to describe distinctive sounds in a language. Dictionary Based approach • The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciation is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. Rule based approach • The other approach used for text-tophoneme conversion is the rule-based approach, where rules for the pronunciations of words are applied to words to work out their pronunciations based on their spellings. This is similar to the "sounding out" approach to learning reading. Synthesizer technologies • There are two main technologies used for the generating synthetic speech waveforms: concatenative synthesis and formant synthesis sometimes called parametric speech synthesis. • There are others such as • Recorded promptss • Intonation modeling Formant Synthesis • Formant synthesis does not use any human speech samples at runtime. Instead, the output synthesized speech is created using an acoustic model. • Parameters such as frequency amplitude etc are varied over time to create a waveform of artificial speech. • This method is sometimes called Rule-based synthesis but some argue that because many concatenative systems use rule-based components for some parts of the system, like the front end, the term is not specific enough. • Many systems based on formant synthesis technology generate artificial, roboticsounding speech, and the output would never be mistaken for the speech of a real human. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have some advantages over concatenative systems. • Formant synthesized speech can be very reliably intelligible, even at very high speeds, avoiding the acoustic glitches that can often plague concatenative systems. • High speed synthesized speech is often used by the visually impaired for quickly navigating computers using a screen reader. • Second, formant synthesizers are often smaller programs than concatenative systems because they do not have a database of speech samples. • Last, because formant-based systems have total control over all aspects of the output speech, a wide variety of prosody can be output, conveying not just questions and statements, but a variety of emotions and tones of voice. Formant • This synthesis is a sort of source-filter-method that is based on mathematic models of the human speech organ. The approach pipe is modelled from a number of resonances with resemblance to the formants (frequency bands with high energy in voices) in natural speech. The first electronic voices Voder, and later on OVE and PAT, were speaking with totally synthetic and electronic produced sounds using formant synthesis. As with articulatory synthesis, the memory consumption is small but CPU usage is large. The Source-Filter Model of Formant Synthesis • Excitation or Voicing Source(s) to model sound source – standard wave of glottal pulses for voiced sounds – randomly varying noise for unvoiced sounds – modification of airflow due to lips, etc. Formant Synthesis continued – high frequency (F0 rate), quasi-periodic, choppy – modeled with vector of glottal waveform patterns in voiced regions • Acoustic Filter(s) – shapes the frequency character of vocal tract and radiation character at the lips – relatively slow (samples around 5ms suffice) and stationary – modeled with LPC (linear predictive coding) Concatenative synthesis • Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. • Generally, concatenative synthesis gives the most natural sounding synthesized speech. • However, natural variation in speech and automated techniques for segmenting the waveforms sometimes result in audible glitches in the output, detracting from the naturalness. There are three main subtypes of concatenative synthesis: Subtypes • Unit selection synthesis uses large speech databases (more than one hour of recorded speech). During database creation, each recorded utterance is segmented into some or all of the following linguistic constructs such as phonemes, words phrases and sentences • Diphone synthesis uses a minimal speech database containing all the Diphones (sound-to-sound transitions) occurring in a given language. In diphone synthesis, only one example of each diphone is contained in the speech database. • Domain-specific synthesis concatenates pre-recorded words and phrases to create complete utterances. Concatenative Synthesis – Record basic inventory of sounds – Retrieve appropriate sequence of units at run time – Concatenate and adjust durations and pitch – Synthesize waveform Concatenating synthesis • A concatenating synthesis is made of recorded pieces of speech (sound-clips) that is then unitized and formed to speech. • Depending on the length of sound-clips that are used it become a diphone or a polyphonic synthesis. • The latter in a more developed version is also called a Unit Selection synthesis, where the synthesizer has access to both long and short segments of speech and the best segments for the actual context is chosen. Diphone • In phonetics, a diphone is an adjacent pair of phones. It is usually used to refer a recording of the transition between two phones. • A phone is the actual pronunciation of a phoneme Diphone and Polyphone Synthesis • Phone sequences capture co-articulation • That is how combinations of phones sound Diphone and Polyphone Synthesis • Data Collection Methods – Collect data from a single (professional) speaker – Select text with maximal coverage (typically with greedy algorithm), or – Record minimal pairs in desired contexts (real words or nonsense) • Reduce number collected by – phonotactic constraints – collapsing in cases of no co-articulation • Cut speech in positions that minimize context contamination • Need single phones, diphones and sometimes triphones Diphone • For a diphone synthesis elements from the recorded speech are very small. • The speech may sound a bit monotonic. • Diphone synthesis doesn't work that well • in languages where there is a lot of inconsequence in the pronunciation rules (English, Swedish etc) • in special cases where letters are pronounced differently than in general. • The diphone works better for languages that have large consistencies in the pronunciation (Spanish, Finnish etc.) • Another advantage is that the prosody, the intonation, can be described in much detail. Signal Processing for Concatenative Synthesis • Diphones recorded in one context must be generated in other contexts • Features are extracted from recorded units • Signal processing manipulates features to smooth boundaries where units are concatenated • Signal processing modifies signal via ‘interpolation’ – intonation – duration Unit selection • The greatest difference between a Unit selection and a diphone voice is the length of the used speech segments. • There are entire words and phrases stored in the unit database. this implies that the database for the Unit selection voices are many times bigger than for diphone voices. • Thus, the memory consumption is huge while the CPU consumption is low. Unit Selection • The most important issue is to still get a natural and smooth prosody. • This is hard because the units contain both intonation and pronunciation since entire phrases are used almost directly from the recorded data. • Since the first Unit selection voice was released, over eight years ago, there has been much improvement for each new voice with every release. HMM synthesis • A quite new technology is speech synthesis based on HMM, a mathematical concept called Hidden Markov models. • It is a statistical method where the text-tospeech system is based on a model that is not known beforehand but it is refined by continuous training. • The technique consumes large CPU resources but very little memory. • This approach seems to give a better prosody, without glitches, and still produces very natural sounding, human-like speech Recorded Prompts • The simplest (and most common) solution is to record prompts spoken by a (trained) human • Produces human quality voice • Limited by number of prompts that can be recorded • Can be extended by limited cut-and-paste or template filling Articulatory synthesis • In an articulatory synthesis, models of the human articulators (tong, lips, teethes, jaw) and vocal ligament are used to simulate how an airflow passes through, to calculate what the resulting sound will be like. It is a great challenge to find good mathematical models and therefore the development of articulatory synthesis is still in research. The technique is very computation-intensive but memory requirements is almost nothing. Sable • SABLE is an emerging standard extending SGML • http://www.cstr.ed.ac.uk/projects/sable.html – marks: emphasis(#), break(#), pitch(base/mid/range,#), rate(#), volume(#), semanticMode(date/time/email/URL/...), speaker(age,sex) – Implemented in Festival Synthesizer (free for research, etc.): http://www.cstr.ed.ac.uk/projects/festival.html Assistive Applications of speech synthesis • Systems that provide voice synthesis output for blind users are generally referred to as screen readers. Brown (1989)[ Cook and Hussey 95] has identified the capabilities an ideal voice output system should have. Key Features • 1: Good audio environment. No background noise, Good speakers, earphones etc • 2: Good intelligibility. The output should be intelligible. Studies have shown this to be paramount. Studies have also shown that naturalness of the voice is also desirable particularly for female users of speech synthesis. Synthetic voices are not that acceptable. • 3: The screen reader should work with all commercially available software, i.e. the blind user should have access to the same software the sighted user has. • This includes access to both text and graphics. • 4: The adapted output system should work with a variety of speech synthesizer systems. User Interface • The user interface should have the following characteristics. • 1: Spoken letters often sound the same e.g. b and v. To reduce ambiguity the synthesizer should have access to the aviators alphabet (Alpha Bravo , Charlie etc.) • 2: To match the capabilities of normal vision, the screen reader should be able to read forward or backwards, read punctuation, highlights and other syntactical conventions. • 3: A sighted reader often scans whole passages to get context or a sense of the text. The screen reader should be able to read complete sentences and passages. • 4: Computer programs often generate prompts and output messages. The screen reader should be able to read these. Operational Characteristics • The following are desirable operational characteristics of the screen reader. • 1; It should be easy to use and maintain. It shouldn’t require huge amounts of training. Screen readers are often complex and difficult to master. • 2: Screen readers have two modes , application and review. Review is where the reader is basically reading. Application is where the functionality of the application a can be accessed. For example a document in a word processor could be read in review mode but edited in application mode. Ideally the two modes should be merged. If a mistake were noted while a document is being read then it would be beneficial to change it there and then and not have to switch out of review into application mode. . More • 3: Central to the success of using screen readers is the notion of cursor routing. Here the navigation path of the cursor through the document is recorded so that we can return to various step points if we have to switch between modes. For example if we have to switch to application mode we can retrace our steps through the document. • This is similar to the macro capability provided by many commercial systems. • 4: In windows based systems there is a capability to generate a series of windows. Screen readers need to be able to locate and move between windows and output and changes that might occur Graphical User interfaces (GUI’s) and Screen readers • . • GUI’s present unique and didfficult problems to the blind user. • GUI’s use visual metaphors to represent information. For example there are files folders, briefcases, trashcans and more. each represented by graphical icons. These icons are not easily represented by speech synthesizers designed to convert text to phonemes and so on. It is absolutely essential for these icons to have an associated text label which can then be spoken when the icon is selected. More • Visual information is spatial. Location is a component of this space. Relative positions of objects are organized within a 2 dimensional space. • Auditory information is temporal (time based). It is difficult to convey the position of a pointer by speech. This is fundamental to the process of selecting of icons in GUIS. • An alternative approach to pointer movement is to tab through the screen. These tabs can then be highlighted by the screen readers with auditory prompts. Navigation thus consists of a series Tab prompts followed by the announcement of the icon labels as each is highlighted in turn. Applications of Screen readers • Screen readers are ideally suited to applications that consist of text . The following are some of the major application areas to which screen readers are applied. • Auditory Reading Substitute • The oldest and most prevalent use of auditory substitution is talking books. Traditionally the books have been read by actors or others and this has been recorded on tape or disk. • If the text of a book is available electronically, then auditory output may be provided by synthetic speech devices. A key issue here is the intelligibility of the output. Using the Web with Screen readers • The web is a hugely important gateway to inclusion. It provides access to information, to commerce, to entertainment, to news, to communication , to e-learning and to many other applications and functions. Screen Readers are a hugely important technology for people with poor vision and reading difficulties for accessing the Web. • The following is a list of Screen Reader features and the Web Issues associated with them. Sequential Reading • Screen Readers read sequentially from top to bottom, left to right (they are easily confused by columns). • As speech synthesis technology matures, browsers designed specifically to read HTML will make greater use of HTML tags to format output and provide options for the user. Tags that are used not according to the HTML 3.2 specification will create problems for such browsers. • Does your site have tables that do not "read" from top to bottom, left to right? This includes the use of tables to achieve the effect of columnar text layout. If so, is the information alternatively provided in some other noncolumnar format? More web • Immediate Start • Screen Readers begin reading as soon as the page is loaded. • Do you have excessive standard text or navigational tools that appear at the top of every page? It is difficult for a speech synthesizer to be manipulated to "ignore" such items. • Navigation • Navigation is by link, word, line or character, but not by sentence or paragraph. • If a user hops from link to link on your web page, will she or he hear "click here" repeated over and over again or is the link text brief but meaningful? More web • Background Wallpaper • Screen Readers cannot interpret background graphical wallpaper (because there is no ALT attribute). • Do you use background images that contain important content? If so, do you provide an alternate (text-based) method for viewing that content? • Blinking Text • Screen Readers cannot reliably read blinking text (sometimes they skip it and sometimes they fixate on it). • Many non-blind people object to blinking text on aesthetic grounds, and it can affect speech synthesizer software negatively as well. Do you really need to use it? More web • Search and Punctuation • Screen Readers can search for text strings or attributes (jumping from link to link is accomplished by searching for underlined text). They use punctuation (periods, commas, etc.) to structure speech output. • In addition to its negative consequences for speech synthesis software, incorrect use of punctuation and spelling irritates many members of the general public. Do you use punctuation correctly and consistently throughout your site? Have you checked your spelling? Images • Images • Screen Readers read content of ALT attributes with images but cannot interpret images themselves. • Do you use brief but meaningful ALT text with all images? Does your ALT text describe the function of certain visual images rather than just a description of the image? (Example: "change of topic" rather than "blue line") More Images • If an image is important to the understanding or appreciation of your web site, do you have an adequate text description of it available? • Blind people, people using text-only browsers and those who have turned off automatic downloading of images see no information when a web page contains no text. Does the home page of your web site contain text that could guide such a user to a non-graphical alternative? • Do you provide text descriptions for video clips and video feeds? Maths with screen readers • Reading and writing mathematics is fundamentally different than reading and writing text. • Just being able to read mathematics is very difficult for visually disabled people. Non-visual representations (such as braille) are not as powerful and flexible as visual ones. Take for instance the following equation: Y = +/- b2 4ac c • This uses the two dimensions of the paper to represent the fraction as a clear single component. It also uses the relative positions of different components in a semantically meaningful way. • For example the bar on the top of the square root symbol extends over all the symbols governed by it. • There is also no redundancy in the representation (unlike written text); if any one symbol were deleted the meaning of the equation would be changed completely. • A problem with reading mathematics in non-visual formats is to be able to control the reading. • For instance, if one listens to a whole equation at once, one only remembers a small part of it. When reading visually people can focus on the parts of the equation which are of interest and ignore the rest. • A step further on from being able to read mathematical material is to be able to manipulate it, to create new equations and modify existing ones. This involves the concepts of selection and re-writing. • The issues of following mathematical representations in speech output is down to the nature of text versus mathematics. • Text is linear in nature while mathematical equations are two dimensional. What you have been reading in this text is a good example of this problem. In contrast, examine the following relatively simple equation Figure 1: A relatively simple equation • One will immediately notice that the equation contains a superscript and a fraction - both being two dimensional in nature. The equation could have been written in a linear form, for example: • a = sqrt(((x super 2) - y) / z) • For this relatively simple equation, a linear representation is adequate for reading to a blind user. But, with any increase in complexity it becomes apparent that linear representations are no longer useful. • Making mathematics accessible to the blind is a challenging and difficult process. The computer and its range of output devices has become the foundation of numerous projects that have brought this goal closer to a reality. • With I/O devices such as high-quality speech, musical tones, refreshable Braille, haptic feedback and high reliability speech input, new and effective tools will soon be on the market. • Other research into direct neural connectivity, will in the future, make the picture even brighter. References • • • • • • • • Special Access technology by Paul Nisbet & Patrick Poon http://callcentre.education.ed.ac.uk/About_CALL/Publications_CAA/Books_ CAB/SAT_CAC/sat_cac.html http://www.callcentrescotland.org or http://callcentre.education.ed.ac.uk/ [Brown 89] Brown C,. Computer Access In Higher Education for students with disabilities. Ed 2 Monterey California 1989 Us Dept. of Education [Cook and Hussey 1995] Cook AM, Hussey SM. Assistive technologies: principles and practice. Baltimore: Mosby; 1995. SNOW [Dutoit 96] T. DUTOIT, An Introduction to Text-To-Speech Synthesis¸ Kluwer Academic Publishers, 1996, 326 pp. Sn [Redish and Theofanos 2003] Redish J and Theofanos M.F., Observing Users Who Listen to Web Sites , STC Usability SIG Newsletter Usability Interface april 2003 issue (Vol 9, No. 4) Text-To-speech (TTS) “Voice” Resources http://www.microsoft.com/msagent/downloads/user.asp http://www.bytecool.com/voices.htm http://www.digitalfuturesoft.com/texttospeechproducts.php http://www.neospeech.com/product/technologies/tts.php http://nextup.com/TextAloud/SpeechEngine/voices.html#morefreevoices Free Text Readers • • • • • • NaturalReader (100 Character Limit) ReadPlease WordTalk Adobe Reader Microsoft Reader Bookshelf review Review review review Review review Commercial Text Readers • • • • • Natural Reader Professional/Enterprise WYNN Premier Assistive Technology TextOutloud Kurzweil 3000 E books • http://www.ebooks.com/ • http://library.netlibrary.com/Home.aspx • http://www.amazon.com/exec/obidos/tg/browse/-/551440/ref=b_tn_bh_eb/002-1204779-5767200 • http://www.diesel-ebooks.com/cgi-bin/category.cgi • http://www.ereader.com/ Go to Google a type in E books