Text-to-Speech Text-to-Speech Part I Intelligent Robot Lecture Note 1 Text-to-Speech Introduction Intelligent Robot Lecture Note 2 Text-to-Speech Introduction • History ► Long before electronic signal processing was invented, speech researchers tried to build machine to create human speech. ◦ Gerbert of Aurillac (d. 1003 AD) ◦ Albertus Magnus (1198-1280) ◦ Roger Bacon (1214-1294) ► In 1779, the Danish scientist Christian Kratzenstein built models of the human vocal tract that could produce the vowel sounds. ◦ This was followed by the bellows-operated “acoustic-mechanical speech machine” by Wolfgang von Kempelen. ◦ This machine added models of the tongue and lips, enabling it to produce consonants as well as vowels. ◦ In 1837, Charles Wheatstone produced a “speaking machine” based on von Kempelen’s design. ◦ In 1857, M. Faber built the “Euphonia”. Intelligent Robot Lecture Note 3 Text-to-Speech Introduction • History ► ► In the 1930s, Bell Labs developed the VOCODER, a keyboardoperated electronic speech analyzer and synthesizer that was said to be clearly intelligible. The pattern playback was build by Franklin S. Cooper and his colleagues in the late 1940s and completed in 1950. ◦ There were several different versions of this hardware device but only one currently survives. ◦ The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into sound. ◦ Using this device, Alvin Liberman and colleagues were able to discover acoustic cues for the perception of phonetic segments (consonants and vowels). Intelligent Robot Lecture Note 4 Text-to-Speech Text and Phonetic Analysis Intelligent Robot Lecture Note 5 Text-to-Speech Text and Phonetic Analysis • • • • • • • Modules and Data Flow Text Normalization Linguistic Analysis Homograph Disambiguation Morphological Analysis Letter-to-Sound Conversion Evaluation Intelligent Robot Lecture Note 6 Text-to-Speech Modules and Data Flow raw text or tagged text Document Structure Detection Text Normalization Text Analysis Linguistic Analysis Lexicon tagged text Homograph Disambiguation Morphological Analysis Phonetic Analysis Letter-to-Sound Conversion tagged text & phones Modularized functional blocks for text and phonetic analysis components [Huang et al., 2001] Intelligent Robot Lecture Note 7 Text-to-Speech Modules and Data Flow • Modules ► Document structure detection ◦ Document structure is important to provide a context for all later processes. In addition, some elements of document structure, such as sentence breaking and paragraph segmentation, may have direct implications for prosody. (not be covered here) ► Text normalization ◦ Text normalization is the conversion from the variety of symbols, numbers, and other non-orthographic entities of text into a common orthographic transcription suitable for subsequent phonetic conversion. ► Linguistic analysis ◦ Linguistic analysis recovers the syntactic constituency and semantic features of words, phrases, clauses, and sentences, which is important for both pronunciation and prosodic choices in the successive processes. Intelligent Robot Lecture Note 8 Text-to-Speech Modules and Data Flow • Modules ► Homograph disambiguation ◦ It is important to disambiguate words with different senses to determine proper phonetic pronunciations, such as object (/ah b jh eh k t/) as a verb or as a noun (/aa b jh eh k t/). ► Morphological analysis ◦ Analyzing the component morphemes provides important cues to attain the pronunciations for inflectional and derivational words. ► Letter-to-sound conversion ◦ The last stage of the phonetic analysis generally includes general letter-tosound rules (or modules) and a dictionary lookup to produce accurate pronunciations for any arbitrary word. Intelligent Robot Lecture Note 9 Text-to-Speech Modules and Data Flow • Data flow S [f1, f2, …, fn] S NP [f1, f2, …, fn] VP [f1, f2, …, fn] W W1 W2 W3 Σ A skilled e lec tri cian re por ted ax s k ih l d ih l eh k t r ih sh ax n r iy p ao r t ax d F1 F2 F4 F5 C P IP1 [f1, f2, …, fn] W4 F3 IP2 [f1, f2, …, fn] U [f1, f2, …, fn] Annotation tiers indicating incremental analysis based on an input (text) sentence “A skilled electrician reported.” Flow of incremental annotation is indicated by arrows on the left side. [Huang et al., 2001] Intelligent Robot Lecture Note 10 Text-to-Speech Modules and Data Flow • Data flow ► W(ords) Σ, C(ontrols) ◦ The syllabic structure (Σ) and the basic phonemic form of a word are derived from lexical lookup and/or the application of rules. The Σ tier shows the syllable divisions (written in text form for convenience). The C tier, at this stage, shows the basic phonemic symbols for each word’s syllables. ► W(ords) S(yntax/semantics) ◦ The word stream from text is used to infer a syntactic and possibly semantic structure (S tier) for an input sentence. Syntactic and semantic structure above the word would include syntactic constituents such as Noun Phrase (NP), Verb Phrase (VP), etc. and any semantic features that can be recovered from the current sentence or analysis of other contexts that may be available (such as an entire paragraph or document). The lower-level phrases such as NP and VP may be grouped into broader constituents such as Sentence (S), depending on the parsing architecture. Intelligent Robot Lecture Note 11 Text-to-Speech Modules and Data Flow • Data flow ► S(yntex/semantics) P(rosody) ◦ The P(rosodic) tier is also called the symbolic prosodic module. If a word is semantically important in a sentence, that importance can be reflected in speech with a little extra phonetic prominence, called an accent. Some synthesizers begin building a prosodic structure by placing metrical foot boundaries to the left of every accented syllable. The resulting metrical foot structure is shown as F1, F2, etc. Over the metrical foot structure, higher-order prosodic constituents, with their own characteristic relative pitch ranges, boundary pitch movements, etc. can be constructed, shown as intonational phrases IP1, IP2. Intelligent Robot Lecture Note 12 Text-to-Speech Text Normalization • Abbreviations ► ► An abbreviation is a shortened form of a word or phrase. Since any abbreviation is potentially ambiguous, and there are several distinct types of ambiguity, a system should combine knowledge from a variety of contextual sources such as document structure and origin, in order to resolve abbreviations. ◦ Dr. (doctor or drive) ◦ MD (medicinae doctor or Maryland) ► Types of abbreviations ◦ ◦ ◦ ◦ ◦ ◦ ◦ Acronyms* Apocope Clipping Elision Syncope Syllabic abbreviation Portmanteau Intelligent Robot Lecture Note 13 Text-to-Speech Text Normalization • Abbreviations ► ► Acronyms are words created from the first letters or other words. Examples of acronyms ◦ Pronounced as a word – NATO: North Atlantic Treaty Organization – scuba: self-contained underwater breathing apparatus ◦ Pronounced as the names of letters – DNA: deoxyribonucleic acid – LED: light-emitting diode ◦ Pronounced as the names of letters but with a shortcut – IEEE: Institute of Electrical and Electronics Engineers – W3C: World Wide Web Consortium ◦ Pseudo-acronyms – IOU: “I owe you” – CQR: “secure”, a brand of boat anchor Intelligent Robot Lecture Note 14 Text-to-Speech Text Normalization • Number formats ► Phone numbers 02-1234-5678 (02) 1234-5678 +82-2-1234-5678 ► Dates 12/19/94 December nineteenth ninety four 04/27/1992 April twenty seventh nineteen ninety two May 27, 1995 May twenty seventh nineteen ninety five 1,994 one thousand nine hundred and ninety four 1994 nineteen ninety four Intelligent Robot Lecture Note 15 Text-to-Speech Text Normalization • Number formats ► ► Times 11:15 eleven fifteen 5:20 am five twenty am 12:15:20 twelve hours fifteen minutes and twenty seconds 07:55:46 seven hours fifty-five minutes and forty-six seconds Money and currency $40 forty dollars £200 two hundred pounds 5¥ five yen 300 € three hundred euros Intelligent Robot Lecture Note 16 Text-to-Speech Text Normalization • Number formats ► ► Ordinal numbers 1/2 one half 1/3 one third 1/4 one quarter or one fourth 3/10 three tenths Cardinal numbers 123 one two three 1,234 one thousand two hundred (and) thirty four two four two six one hundred (and) twenty three twenty four twenty six 2426 two thousand four hundred (and) twenty six Intelligent Robot Lecture Note 17 Text-to-Speech Text Normalization • Domain-specific tags ► Mathematical expressions (MathML) ◦ (x + 2)2 <EXPR> <EXPR> x <PLUS/> 2 </EXPR> <POWER/> 2 </EXPR> ► Chemical formulae (CML) ◦ C2OCHO4 <FORMULA> <XVAR BUILTIN=“STOICH”> C C O C O H H H H </XVAR> </FORMULA> Intelligent Robot Lecture Note 18 Text-to-Speech Text Normalization • Miscellaneous formats ► Approximately or tilde ◦ The symbol ~ is spoken as approximately before numeral or currency amount, otherwise it is the character named tilde. ► Accented Roman characters ◦ A table of folding characters can be provided so that for a term such as Über-mensch, rather than spell out the word Über, or ignore it, the system can convert it to its nearest English-orthography equivalent: Uber. ► High ASCII characters ◦ Ⓒ (copyright), ™ (trademark), @ (at), ® (registered mark) ► Asterisk ◦ “Larry has *never* been here.” ◦ This may be suppressed for asterisks spanning two or more words. ► Emoticons ◦ :-) (smiley face), :-( (frowning face), ;-) (winking smiley face) Intelligent Robot Lecture Note 19 Text-to-Speech Linguistic Analysis • A minimal set of modular functions or services ► Sentence breaking Mr. Smith came by. He knows that it costs $1.99, but I don’t know when he’ll be back (he didn’t ask, “when should I return?”)... His Web site is www.mrsmithhhhhh.com. The car is 72.5 in. long (we don’t know which parking space he’ll put his car in.) but he said “...and the truth shall set you free,” an interesting quote. ◦ Abbreviation processing ◦ Rules or CART built upon features based on: document structure, whitespace, case conventions, etc. ◦ Statistical frequencies on sentence-initial word likelihood ◦ Statistical frequencies of typical lengths of sentences for various genres ◦ Streaming syntactic/semantic (linguistic) analysis Intelligent Robot Lecture Note 20 Text-to-Speech Linguistic Analysis • A minimal set of modular functions or services ► POS tagging [Manning et al., 1999] Influential/JJ members/NNS of/IN the/DT House/NNP Ways/NNP and/CC Means/NNP Committee/NNP introduced/VBD legislation/NN that/WDT would/MD restrict/VB how/WRB the/DT new/JJ savingsand-loan/NN bailout/NN agency/NN can/MD raise/VB capital/NN ,/, creating/VBG another/DT potential/JJ obstacle/NN to/TO the/DT government/NN ’s/POS sale/NN of/IN sick/JJ thrifts/NNS ./. ◦ A system uses the POS labels to decide alternative pronunciations and to assign differing degrees of prosodic prominence. ◦ In addition, the bracketing might assist in deciding where to place pauses for great intelligibility. Intelligent Robot Lecture Note 21 Text-to-Speech Linguistic Analysis • A minimal set of modular functions or services ► Homograph disambiguation ◦ Homograph disambiguation in general refers to the case of words with the same orthographic representation (written form) but having different semantic meanings and sometimes even different pronunciations. ► Noun phrase (NP) and clause detection ◦ Basic NP and clause information could be critical for a prosodic generation module to generate segmental durations. It also provides useful clues to introduce necessary pauses for intelligibility and naturalness. Phrase and clause structure are well covered in any parsing techniques. ► Sentence type identification ◦ Sentence types (declarative, yes-no question, etc.) are critical for macrolevel prosody for the sentence. Intelligent Robot Lecture Note 22 Text-to-Speech Homograph Disambiguation • Homograph variation ► ► Homograph variation can often be resolved on POS (grammatical) category. Deep semantic and/or discourse analysis would be required to resolve the tense ambiguity. ◦ “If you read the book, he’ll be angry.” • Two special sources of pronunciation ambiguity ► A variation of dialects ◦ Tomato: tom[ey]to, tom[aa]to ◦ Boston natives tend to reduce the /r/ sound in sentences like “Park your car in Harvard yard.” ► Speech rate and formality level ◦ The /g/ sound in recognize may be omitted in faster speech. Intelligent Robot Lecture Note 23 Text-to-Speech Homograph Disambiguation • Examples of homographs ► Stress homographs: noun with front-stress vowel, verb with end-stress vowel ◦ “an absent boy” vs. “Do you choose to absent yourself?” ► Voicing: noun/verb or adjective/verb distinction made by voice final consonant ◦ “They will abuse him.” vs. “They won’t take abuse.” ► -ate words: noun/adjective sense uses schwa, verb sense uses a full vowel ◦ “He will graduate.” vs. “He is a graduate.” ► Double stress: front-stressed before noun, end-stressed when final in phrase ◦ “an overnight bag” vs. “Are you staying overnight?” ► -ed adjectives with matching verb past tenses ◦ “He is a learned man.” vs. “He learned to play piano.” Intelligent Robot Lecture Note 24 Text-to-Speech Homograph Disambiguation • Examples of homographs ► Ambiguous abbreviations ◦ in, am, SAT (Saturday vs. Standard Aptitude Test) ► Borrowed words from other languages ◦ They could sometimes be distinguishable based on capitalization. ◦ “El Camino Real road in California” vs. “real world” ◦ “polish shoes” vs. “Polish accent” ► Miscellaneous ◦ “The sewer overflowed.” vs. “a sewer is not a tailor.” ◦ “He moped since his parents refused to buy a moped.” ◦ “Agape is a Greek word.” vs. “His mouth was agape.” Intelligent Robot Lecture Note 25 Text-to-Speech Morphological Analysis • Morphological analysis ► Decomposition process re nation al ize prefix ► stem suffix When a dictionary does not list a given orthographic form explicitly, it is sometimes possible to analyze the new word in terms of shorter forms already present. Intelligent Robot Lecture Note 26 Text-to-Speech Morphological Analysis • Prefixes and suffixes ► The prefixes and suffixes are generally considered bound, in the sense that they cannot stand alone but must combine with a stem. ◦ A word such as establishment may be decomposed into a “stem” establish and a suffix -ment. ◦ Should a system attempt to further decompose the stem establish into establ and -ish? – Since a difference that makes no difference is no difference, it is best to be conservative and list only obvious and highly productive affixes. ► Prefix and suffix stripping gives an analysis for many common inflected and some derived words. ◦ It helps in saving system storage, but it also make mistakes (from a synchronic point of view: basement is not base + -ment). ◦ Adding irregular morphological formation into a system dictionary is always a desirable solution. Intelligent Robot Lecture Note 27 Text-to-Speech Letter-to-Sound Conversion • Letter-to-sound conversion [Demper et al., 1998] ► ► ► Also known as grapheme-to-phoneme conversion For most words encountered in the input of a TTS system, a canonical (or ‘baseform’) pronunciation is easily obtained by dictionary look-up. The traditional default strategy uses a set of context-dependent phonological rules written by an expert. ◦ However, the task of manually writing such a set of rules, deciding the rule order so as to resolve conflicts appropriately, maintaining the rules as mispronunciations are discovered etc., is very considerable and requires an expert with depth of knowledge of the specific language. ► More recent attention has focused on the application of automatic techniques based on machine-learning from large corpora. Intelligent Robot Lecture Note 28 Text-to-Speech Letter-to-Sound Conversion • Phonological rules ► Assumption ◦ The pronunciation of a letter or letter substring can be found if sufficient is known of its context, i.e. the surrounding letters. ► ► The form of the rules is A[B]C D, which states that the letter substring B with left-context A and right-context C receives the pronunciation (i.e. phoneme substring) D. Because of the complexities of letter-to-sound correspondence, more than one rule generally applies at each stage of transcription. ◦ The conflicts which arise are resolved by maintaining the rules in a set of sublists, grouped by (initial) letter and with each sublist ordered by specificity. ◦ The most specific rule is usually at the top and most general at the bottom. ◦ Transcription is usually a one-pass, left-to-right process. Intelligent Robot Lecture Note 29 Text-to-Speech Letter-to-Sound Conversion • Pronunciation by analogy ► ► Pronunciation by analogy exploits the phonological knowledge implicitly contained in a dictionary of words and their corresponding pronunciations. The underlying idea is that a pronunciation for an unknown word is assembled by matching substrings of the input to substrings of known, lexical words, hypothesizing a partial pronunciation for each matched substring from the phonological knowledge, and concatenating the partial pronunciations. Intelligent Robot Lecture Note 30 Text-to-Speech Letter-to-Sound Conversion • Decision trees [Black et al., 1998] ► ► ► A decision tree has as the input grapheme sliding window with three letters to the left and three to the right accordingly. This method is appropriate for discreet characteristics and produces rather compact models, whose size is defined by the total number of questions and leaf nodes in the output tree. Disadvantages ◦ It assumes that the decisions are independent one from another so is that it cannot use the prediction of the previous phone as the reference to predict the next one. ◦ Another limitations introduced by the binary decision trees that every time a question is asked the training corpus is divided into two parts and further questions are asked only over the remaining parts of the corpus. Intelligent Robot Lecture Note 31 Text-to-Speech Letter-to-Sound Conversion • Finite state transducers ► This approach chooses the pronunciation φ that maximizes the probability of a phoneme sequence given the letter sequence g. ˆ arg max{ p( | g )} ► This estimation can be done using standard n-gram methods. N p ( g , ) p ( g i , i | g1i 1 , 1i 1 ) i 1 ◦ where N is the number of letters in the word. Intelligent Robot Lecture Note 32 Text-to-Speech Letter-to-Sound Conversion • Finite state transducers ► ► ► n-grams can be represented by a finite-state automaton, where a new state is defined for each history h and arc a is created for each new grapheme-phoneme pair (g,φ). These arcs are labeled with a grapheme-phoneme pair and weighted with the probability of (g,φ) given the history h. To derive the finite state transducer the labels attached to the automaton edges are split in a way that letters become input and phonemes become output. Intelligent Robot Lecture Note 33 Text-to-Speech Letter-to-Sound Conversion • Hidden Markov models [Taylor, 2005] ► ► ► Each phoneme is represented by one HMM and letters are the emitted observations. The probability of transitions between models is equal to the probability of the phoneme given the previous phoneme. The objective of this method is to find the most probable sequence of hidden models (phonemes) given the observations (letters), using the probability distributions found during the model training. ˆ arg max p( g | ) p( ) ◦ where p(φ) is the prior probability of a sequence of phonemes occurring, and p(g,φ) is the likelihood of grapheme sequence g given phoneme sequence φ. Intelligent Robot Lecture Note 34 Text-to-Speech Evaluation • Text and phonetic analysis ► ► The evaluation is usually feasible, because the input and output of such module is relatively well defined. The evaluation focuses mainly on symbolic and linguistic level in contrast to the acoustic level. • Automatic detection of document structures ► ► The evaluation typically focuses on sentence breaking and sentence type detection. Since the definitions of these two types of document structures are straightforward, a standard evaluation database can be easily established. • LTS conversion analysis ► An automated test framework for the LTS conversion analysis minimally includes a set of test words and their phonetic transcriptions for automated lookup and comparison tests. Intelligent Robot Lecture Note 35 Text-to-Speech Reading List • Black et al., 1998. “Issues in building general letter to sound rules”. 3rd ESCA workshop on speech synthesis, Jenolah Caves, Australia, pp. 77-80. • Demper et al., 1998. “A comparison of letter-to-sound conversion techniques for English text-to-speech synthesis”. Institute of Acoustics 20(6), pp. 245-254. • Huang et al., 2001. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. New Jersey, Prentice Hall. • Manning and Schütze, 1999. Foundations of Statistical Natural Language Processing. Cambridge, MIT Press. • Taylor, 2005. “Hidden Markov Models for Grapheme to Phoneme Conversion”. Interspeech-2005, Lisbon, pp. 1973-1976. Intelligent Robot Lecture Note 36 Text-to-Speech Prosody Intelligent Robot Lecture Note 37 Text-to-Speech Prosody • • • • • • • General Prosody Speaking Style Symbolic Prosody Duration Assignment Pitch Generation Prosody Markup Languages Prosody Evaluation Intelligent Robot Lecture Note 38 Text-to-Speech General Prosody • It is not what you said; it is how you said it! ► ► ► An important supporting role in guiding a listener’s recovery of the basic messages A starring role in signaling connotation The speaker’s attitude toward the message, toward the listener • Prosody is often defined on two different levels ► An abstract, phonological level ◦ (phrase, accent and tone structure) ► A physical phonetic level ◦ (fundamental frequency, intensity or amplitude, and duration) Intelligent Robot Lecture Note 39 Text-to-Speech General Prosody • From the listener’s point of view, prosody consists of systematic perception and recovery of a speaker’s intentions based on: ► Pauses ◦ To indicate phrases and to avoid running out of air ► Pitch ◦ Rate of vocal-fold cycling (fundamental frequency or F0) as a function of time ► Rate/relative duration ◦ Phoneme durations, timing, and rhythm ► Loudness ◦ Relative amplitude/volume Intelligent Robot Lecture Note 40 Text-to-Speech General Prosody • The analysis of prosody ► Two approaches in the prosody literature ◦ Create an abstract descriptive system witch characterizes observations of the behavior of the parameters of prosody within the acoustic signal (fundamental frequency movement, intensity changes and durational movement), and promote the system to a symbolic phonological role. ◦ Create a symbolic phonological system which can be used to input to processes which eventually result in an acoustic signal judged by listeners to have a proper prosody. Intelligent Robot Lecture Note 41 Text-to-Speech General Prosody Phonetic analysis sound wave synthesized sound wave strip random element add random element Phonological analysis strip non-random device-constrained phonetic variations strip non-random, but non-essential language-specific phonological variations preserve generalized stripped information in a static phonetic model preserve generalized stripped information in a static phonological model reconstruct phonetic variations reconstruct phonological variations Low level synthesis abstract underlying representation abstract markup High level synthesis Intelligent Robot Lecture The Note diagram of how linguists take a sound wave [Monaghan, 2002] 42 Text-to-Speech General Prosody Parsed text and phone string Pause insertion and prosodic phrasing Speaking style F0 contour Duration Loudness Enriched prosodic representation Block diagram of a prosody generation system Intelligent Robot Lecture Note 43 Text-to-Speech Speaking Style • Prosody depends not only on the linguistic content of a sentence. • Different people generate different prosody for the same sentence. • Even the same person generates a different prosody depending on his or her mood. Intelligent Robot Lecture Note 44 Text-to-Speech Speaking Style • Character ► As a determining element in prosody, refers primarily to long-term, stable, extra-linguistic properties of a speaker. ◦ Speaker’s region and economic status ◦ Gender, age, speech defects ◦ Fatigue, inebriation, talking with mouth full ► Since many of these elements have implications for both the prosodic and voice quality of speech output. Intelligent Robot Lecture Note 45 Text-to-Speech Speaking Style • Emotion ► ► Temporary emotional conditions such as amusement, anger, contempt, grief, sympathy, suspicion, etc. have an effect on prosody. A few preliminary conclusion from existing research on emotion in speech [Murray et al., 1993]. ◦ Speakers vary in their ability to express emotive meaning vocally in controlled situations. ◦ Listeners vary in their ability to recognize and interpret emotions from recorded speech. ◦ Some emotions are more readily expressed and identified than others. ◦ Similar intensity of two emotions can lead to confusing one with the other. Intelligent Robot Lecture Note 46 Text-to-Speech Speaking Style • Some basic emotions that have been studied in speech include: ► ► ► ► Anger, though well studied in the literature, may be too broad a category for coherent analysis. One could imagine a threatening kind of anger with a more overtly expressive type of tantrum could be correlated with a wide, raised pitch range. Joy is generally correlated with increase in pitch and pitch range, with in crease in speech rate, Smiling generally raises F0 and formant frequencies and can be well identified by untrained listeners. Sadness generally has normal or lower than normal pitch realized in a narrow range, with a slow rate and tempo, It may also be characterized by slurred pronunciation and irregular rhythm. Fear is characterized by high pitch in a wide range, variable rate, precise pronunciation, and irregular voicing (perhaps due to disturbed respiratory pattern). Intelligent Robot Lecture Note 47 Text-to-Speech Symbolic Prosody • Abstract or symbolic prosodic structure is the link between the infinite multiplicity of pragmatic, semantic, and syntactic features of an utterance and the relatively limited F0, phone durations, energy, and voice quality. • Symbolic prosody deals with: ► ► Braking the sentence into prosodic phrases, possibly separated by pauses Assigning labels, such as emphasis, to different syllables or words within each prosodic phrase. Intelligent Robot Lecture Note 48 Text-to-Speech Symbolic Prosody • Words are normally spoken continuously, unless there are specific linguistic reasons to signal a discontinuity. • The term juncture refers to prosodic phrasing that is, where do words cohere, and where do prosodic breaks (pauses and/or special pitch movements) occur. • Juncture effects, expressing the degree of cohesion or discontinuity between adjacent words, are determined by physiology (running out of breath), phonetics, syntax, semantics, and pragmatics. Intelligent Robot Lecture Note 49 Text-to-Speech Symbolic Prosody • The primary phonetic means of signaling juncture are: ► ► ► ► Silence insertion Characteristic pitch movements in the phrase-final syllable. Lengthening of a few phones in the phrase-final syllable. Irregular voice quality such as vocal fry. Intelligent Robot Lecture Note 50 Text-to-Speech Symbolic Prosody Parsed text and phone string Symbolic Prosody Pauses Prosodic Phrases Accent Tone Tune Prosody Attributes Pitch Range Prominence Declination Speaking Style F0 Contour Generation F0 contour Pitch generation decomposed in symbolic and phonetic prosody Intelligent Robot Lecture Note 51 Text-to-Speech Symbolic Prosody • Pauses ► ► ► ► In a long sentence, speakers normally and naturally pause a number of times. These pauses have traditionally been thought to correlate with syntactic but might more properly be thought of as markers of information structure [Steedman, 2000]. They may also be motivated by poorly understood stylistic idiosyncrasies of the speaker, or physical constraints. There are many reasonable places to pauses in a long sentence, but a few where it is critical not to pause. The goal of a TTS system should be to avoid placing pauses anywhere that might lead to ambiguity, misinterpretation, or complete breakdown of understanding. Intelligent Robot Lecture Note 52 Text-to-Speech Symbolic Prosody • Pauses ► The CART can be used for pause assignment [Ostendorf et al., 1998]. ◦ Use POS categories of words, punctuation, and a few structural measures, such as overall length of a phrase, and length relative to neighboring phrases to construct the classification tree. ◦ Is this a sentence boundary (marked by punctuation)? ◦ Is the left word a content word and the right word a function word? ◦ What is the function word type of word to the right? ◦ Is either adjacent word a proper name? ◦ What is the current location in the sentence? ◦ … Intelligent Robot Lecture Note 53 Text-to-Speech Symbolic Prosody • Prosodic phrases ► ► ► An end-of-sentence period may trigger an extreme lowering of pitch, a comma-terminated prosodic phrase may exhibit a small continuation rise at its end, signaling more to come, etc. Prosodic junctures that are clearly signaled by silence (and usually by characteristic pitch movement as well), also called intonational phrases, are required between utterances and usually at punctuation boundaries. Prosodic junctures that are not signaled by silence but rather by characteristic pitch movement only, also called phonological phrases, may be harder to place with certainty and to evaluate. Intelligent Robot Lecture Note 54 Text-to-Speech Symbolic Prosody • Prosodic phrases ► ► To discuss linguistically significant juncture types and pitch movement, it helps to have a simple standard vocabulary. ToBI (for Tones and Break Indices) [Silverman, 1992] is a proposed standard for transcribing symbolic intonation of American English utterances, though it can be adapted to other languages as well. Intelligent Robot Lecture Note 55 Text-to-Speech Symbolic Prosody • Prosodic phrases ► ► The Break Indices part of ToBI specifies an inventory of numbers expressing the strength of a prosodic juncture. The Break Indices are marked for any utterance on their own discrete break index tier (or layer of information), with the BI notations aligned in time with a representation of the speech phonetics and pitch track. Intelligent Robot Lecture Note 56 Text-to-Speech Symbolic Prosody • Prosodic phrases ► On the break index tier, the prosodic association of words in an utterance is shown by labeling the end of each word for the subjective strength of its association with the next word, on a scale from 0 (strongest perceived conjoining) to 4 (most disjoint). ◦ 0 for cases of clear phonetic marks of clitic groups (pronunced as part of another word, as in ve in I’ve) ◦ 1 most phrase-medial word boundaries. ◦ 2 a strong disjuncture marked by a pause or virtual pause, but with no tonal marks. ◦ 3 intermediate intonation phrase boundrary. ◦ 4 full intonation phrase boundary. ◦ “Did/0 you/1 want/0 an/1 example?/4” Intelligent Robot Lecture Note 57 Text-to-Speech Symbolic Prosody • Accents ► ► We should briefly clarify use of terms such as stress and accent. Stress generally refers to an idealized location in an English word that is a potential site for phonetic prominence effects, such as extruded pitch and/or lengthened duration. ◦ “Acme Industries is the biggest employer in the area.” ► Accent is the signaling of semantic salience by phonetic means. ◦ “I didn’t say employer, I said employee.” Intelligent Robot Lecture Note 58 Text-to-Speech Symbolic Prosody • Tones ► ► Tones can be understood as labels for perceptually salient levels or movements of F0 on syllables. Pitch levels and movements on accented and phrase-boundary syllables can exhibit a bewildering diversity, based on the speaker’s characteristics, the nature of the speech event. Intelligent Robot Lecture Note 59 Text-to-Speech Symbolic Prosody • Tones ► Chinese, a lexical tone language, is said to have an inventory of 4 lexical tones (5 if neutral tone is included). The four Chinese tones Intelligent Robot Lecture Note 60 Text-to-Speech Symbolic Prosody • Tones ► ► A basic set of tonal contrasts has been codified for American English as part of the Tones and Break Indices (ToBI) system [Silverman, 1992]. ToBI can be used for annotation of prosodic training data for machine learning, and also for internal modular control of F0 generation in a TTS system ◦ The set specifies 2 abstract levels, H (High) and L (Low), indicating a relatively higher or lower point in a speaker’s range. Intelligent Robot Lecture Note 61 Text-to-Speech Symbolic Prosody ToBI Pitch accent tones ToBI Tone Description H* Peak accent – a tone target on an accented syllables which is in the upper part of the speaker’s pitch range L* Low accent – a tone target on an accented syllable which is in the lowest part of the speaker’s pitch range L*+H Scooped accent – a low tone target on an accented syllable which is immediately followed by a relatively sharp rise to a peak in the upper part of the speaker’s pitch range L*+!H Scooped downstep accent – a low tone target on an accented syllable which is immediately followed by a relatively flat rise to a downstep peak L+H* Rising peak accent – a high peak target on an accented syllable which is immediately preceded by a relatively sharp rise from a valley in the lowest part of the speaker’s pitch range !H* Graph Downstep high tone –a clear step down onto an accented syllable from a high pitch which itself cannot be accounted for by an H phrasal tone ending the preceding phrase or by a preceding H pitch accent in the same phrase Intelligent Robot Lecture Note 62 Text-to-Speech Symbolic Prosody ToBI intermediate phrasal tones ToBI Tone Description L- Phrase accent, which occurs at an intermediate phrase boundary (level 3 and above). H- Phrase accent, which occurs at an intermediate phrase boundary (level 3 and above). ToBI boundary tones ToBI Tone Description L-L% For a full intonation phrase with an L phrase accent ending its final intermediate phrase and an L% boundary tone falling to a point low in the speaker’s range, as in the standard ‘declarative’ contour of American English. L-H% For a full intonation phrase with an L phrase accent closing the last intermediate phrase, followed by an H boundary tone, as in ‘continuation rise’. H-H% For an intonation phrase with a final intermediate phrase ending in an H phrase accent and a subsequent H boundary tone, as in the canonical ‘yes-no question’ contour. Note that the H- phrase accent causes ‘upstep’ on the following boundary tone, so that the H% after H- rises to a very high value. H-L% For an intonation phrase in which the H phrase accent of the final intermediate phrase upsteps the L% to a value in the middle of the speaker’s range, producing a final level plateau. %H High initial boundary tones; marks a phrase that begins relatively high in the speaker ‘s pitch range when not explained by an initial H* or preceding H%. Intelligent Robot Lecture Note 63 Text-to-Speech Symbolic Prosody “Marianna made the marmalade”, with an H* accent on Marianna and marmalade, and a final L-L% marking the characteristic sentence-final pitch drop. Intelligent Robot Lecture Note 64 Text-to-Speech Symbolic Prosody • Prosodic transcription system ► INTSINT ◦ INTSINT is a coding system of intonation [Hirst, 1994]. ► TILT ◦ TILT is one of the most interesting models of prosodic annotation. ◦ It can represent a curve in both its qualitative (ToBI-like) and quantitative (parameterized) aspects. ► K-ToBI ◦ Prosodic transcription convention for standard Korean [Jun, 2000]. Intelligent Robot Lecture Note 65 Text-to-Speech Symbolic Prosody • INTSINT ► INTSINT is a coding system of intonation [Hirst, 1994]. ◦ It provides a formal encoding of the symbolic or phonologically significant events on a pitch curve. ◦ Each such target point of the stylized curve is coded by a symbol, either as an absolute tone, scaled globally with respect to the speakers pitch range, or as a relative tone, defined locally in conjunction with the neighboring target points. The definition of absolute tones T Top of the speaker’s pitch range M Initial, mid value B Bottom of the speaker’s pitch range H Target higher than both immediate neighbors S Target not different from preceding target L Target lower than both immediate neighbors U Target in a rising sequence D Target in a falling sequence Intelligent Robot Lecture Note 66 Text-to-Speech Symbolic Prosody • INTSINT Top T Higher H D Upstepped Downstepped U B Bottom Intelligent Robot Lecture Note Mid Same M S L Lower Absolute tone Relative tone 67 Text-to-Speech Symbolic Prosody • TILT ► The automatic parameterization of a pitch event on a syllable is in terms of: ◦ ◦ ◦ ◦ ◦ ► Starting F0 value (Hz) Duration Amplitude of rise (Arise, in Hz) Amplitude of fall (Afall, in Hz) Starting point, time aligned with the signal and with the vowel onset The tone shape, mathematically represented by its tilt, is a value computed directly from the F0 curve by the following formula tilt Intelligent Robot Lecture Note | Arise | | A fall | | Arise | | A fall | 68 Text-to-Speech Symbolic Prosody • TILT Label scheme for syllables sil Silence c Connection a Major pitch accent fb Falling boundary rb Rising boundary afb Accent + falling boundary arb Accent + rising boundary m Minor accent mfb Minor accent + falling boundary mrb Minor accent + rising boundary l Level accent lrb Level accent + rising boundary lfb Level accent + falling boundary Intelligent Robot Lecture Note 69 Text-to-Speech Symbolic Prosody • K-ToBI ► ► ► Prosodic transcription convention for standard Korean [Jun, 2000]. Based on the design principles of the original English ToBI and the Japanese ToBI system. Assumes intonational phonology with a close relationship to a hierarchical model of prosodic constituents. Intelligent Robot Lecture Note 70 Text-to-Speech Symbolic Prosody • K-ToBI IP AP IP: intonation phrase AP: accentual phrase W: phonological word s: syllable T=H, when the syllable initial segment is aspirated/tense otherwise, T=L %: intonation phrase boundary tone (AP) W (W) s s…s T H s L H % Intonational structure of Korean Intelligent Robot Lecture Note 71 Text-to-Speech Symbolic Prosody • K-ToBI ► ► Two prosodic units defined by intonation Intonation Phrase (IP) ◦ Marked by a boundary tone (%) and phrase final lengthening ► Accentual Phrase (AP) ◦ Smaller than an IP but larger than word and demarcated by phrasal tones ◦ AP phrasal tones – LHLH when the phrase has more than 3 syllables – LH, LLH, LHH when the phrase has fewer than 4 syllables ◦ AP initial tone – Changes depending on the laryngeal feature of the phrase initial segment – H when the AP begins with an aspirated or a tense obstruent (e.g., HHLH) – Otherwise L Intelligent Robot Lecture Note 72 Text-to-Speech Symbolic Prosody “I hate Younga” Intelligent Robot Lecture Note 73 Text-to-Speech Symbolic Prosody • Structure of K-ToBI ToBI (4 tiers) Word tier Tone tier K-ToBI (5 tiers) Word tier Phonological tone tier Phonetic tone tier Break index tier Break index tier Miscellaneous tier Miscellaneous tier Intelligent Robot Lecture Note 74 Text-to-Speech Symbolic Prosody • Why does K-ToBI need 2 tone tier? ► ► Surface tonal variation are not distinctive or predictable in Korean Prosody Rather, what is distinctive is the phrasing and IP boundary tones. ◦ The presence or absence of an AP and IP boundary can change the meaning of an utterance, as with the distinction between wh-questions and yes/no-question and the disambiguation of syntactically ambiguous structures. ◦ Also, the IP boundary tone delivers semantic s sell as pragmatic meaning for an utterance. Intelligent Robot Lecture Note 75 Text-to-Speech Symbolic Prosody • Phonological tone tier ► Labels the boundary of two prosodic units ◦ A boundary tone (X%) at the end of an IP – X% can be one of the 9 IP boundary tones (L%, H%, HL%, LH%, LHL%, HLH%, HLHL%, LHLH%, LHLHL%) – We note that it is possible that not all of the IP boundary tones are distinctive (e.g., LHLH% vs. LHLHL%), but until we find further evidence of distinctive meaning, or lack thereof, we will use all of these tones ◦ LHa at the end of an AP – Aligned with the end of AP final segment determined from the waveform. ◦ T% – Marks the end of an IP – Aligned with the end of IP final segment determined form the waveform Intelligent Robot Lecture Note 76 Text-to-Speech Symbolic Prosody • Phonetic tone tier ► ► IP boundary tones: the same as those in the phonological tone tier. AP tones: 14 types of surface tonal patterns ◦ (LHa, LHHa, LLHa, HLHa, HHa, HLa, LHLa, HHLa, HLLa, LLa, HHLHa, LHLHa, HHLLa, LHLLa) ◦ Labeled by 3 AP initial tones (L, H, +H) and 3 AP final tones (La, Ha, L+) ◦ These 6 tones are not always realized, but the combination of these 6 tones can represent all 14 tonal types Intelligent Robot Lecture Note 77 Text-to-Speech Symbolic Prosody Schematic F0 contours of 8 boundary tones of IP Schematic F0 contours of 14 tonal patterns of AP Intelligent Robot Lecture Note 78 Text-to-Speech Symbolic Prosody • The break index tier ► Represents the degree of juncture perceived between each pair of words ► 4 break indices ◦ ◦ ◦ ◦ ► 3: a strong phrasal disjuncture such as an IP 2: minimal phrasal disjuncture such as an AP 1: phrase-internal word boundaries 0: a juncture smaller than a word boundary 3 more break index’s for a mismatch between the perceived degree of juncture and the tonal pattern ◦ 3m: used when the juncture is 3 but has an AP tonal pattern ◦ 2m: used when the juncture is 2 but has either no AP tonal pattern or the tonal pattern of an IP ◦ 1m: used when the juncture is 1 but there is an AP tonal pattern ► “-” diacritic affixed directly to the right of the higher break index ◦ 1-: uncertainty between 0 and 1 ◦ 2-: uncertainty between 1 and 2 Intelligent Robot Lecture Note 79 Text-to-Speech Symbolic Prosody • Miscellaneous tier ► Contains labeler comments concerning events such as silence, audible breathing, laughter, or other disfluencies. “These days, that kind of church, eh, Year 2000, millennium…” Intelligent Robot Lecture Note 80 Text-to-Speech Reading List • Allen, J., M.S. Hunnicutt, and D.H. Klatt, From Text to Speech: the MITalk System, 1987, Cambridge, UK, University Press. • Black, A. and A. Hunt, “Generating F0 Contours from toBI labels using Linear Regression,” Proc, of the Int. Conf, on Spoken Language Processing, 1996, pp. 1385-1388. • Fujisaki, H. and H, Sudo, “A generative Model of the Prosody of Connected Speech in Japanese,” annual Report of Eng. Research Institute, 1971, 30, pp. 75-80. • Hirst, D.H., “The Symbolic Coding of Fundamental Frequency Curves: from Acoustics to Phonology,” Proc. Of Int, Symposium on Prosody, 1994, Yokohama, Japan. • Huang, X., et al., “Whistler: A Trainable Text-to-Speech System,” Int, Conf. on Spoken Language Processing, 1996, Philadelphia, PA, pp. 2387-2390. Intelligent Robot Lecture Note 81 Text-to-Speech Reading List • Jun, S., K-ToBI (Korean ToBI) labeling conventions (version 3.1), http://www.linguistics.ucla.edu/people/jun/ktobi/K-tobi.html, 2000. • Monaghan, A. “State-of-the-art summary of European synthetic prosody R&D,” Improvements in Speech Synthesis. Chichester: Wiley, 1993, 93-103. • Murray, I. and J. Arnott, “Toward the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion,” Journal Acoustical Society of Ameriac, 1993, 93(2), pp. 1097-1108. • Ostendorf, M., and N. Veilleux, “A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location,” Computational Linguistics, 1994, 20(1), pp. 27-54. Intelligent Robot Lecture Note 82 Text-to-Speech Reading List • Plumpe, M. and S. Meredith, “which is More Important in a Concatenative Text-to-Speech System: Pitch, Duration, or Spectral Discontinuity,” Third ESCA/COCOSDA Int. Workshop on Speech Synthesis, 1998, Jenolan Caves, Australia, pp. 231-235. • Silverman, K., The Structure and processing of fundamental Frequency Contours, Ph. D. Thesis, 1987, University of Cambridge, Cambridge, UK. • Steedman, M., “Information Structure and the Syntax-Phonology Interface,” Linguistic Inquiry, 2000. • van Santen, J., “Assignment of Segmental Duration in Text-toSpeech Synthesis,” computer Speech and Language, 1994, 8, pp. 95-128. • W3C, Speech Synthesis Markup Requirements for Voice Markup Languages, 2000, http://www.w3.org/TR/voice-tts-reqs/. Intelligent Robot Lecture Note 83