Text to Speech Systems (TTS) EE 516 Spring 2009 Alex Acero Acknowledgments • Thanks to Dan Jurafsky for a lot of slides • Also thanks to Alan Black, Jennifer Venditti, Richard Sproat Outline • History of TTS • Architecture • Text Processing • Letter-to-Sound Rules • Prosody • Waveform Generation • Evaluation Dave Barry on TTS “And computers are getting smarter all the time; scientists tell us that soon they will be able to talk with us. (By "they", I mean computers; I doubt scientists will ever be able to talk to us.) Von Kempelen 1780 • Small whistles controlled consonants • Rubber mouth and nose; nose had to be covered with two fingers for non-nasals • Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air From Traunmüller’s web site 5 Closer to a natural vocal tract: Riesz 1937 The 1936 UK Speaking Clock From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm The UK Speaking Clock • • • • July 24, 1936 Photographic storage on 4 glass disks 2 disks for minutes, 1 for hour, one for seconds. Other words in sentence distributed across 4 disks, so all 4 used at once. • Voice of “Miss J. Cain” A technician adjusts the amplifiers of the first speaking clock From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm Homer Dudley’s VODER 1939 •Synthesizing speech by electrical means •1939 World’s Fair •Manually controlled through complex keyboard •Operator training was a problem •1939 vocoder Cooper’s Pattern Playback Dennis Klatt’s history of TTS (1987) • More history at http://www.festvox.org/history/klatt.html (Dennis Klatt) Outline • History of TTS • Architecture • Text Processing • Letter-to-Sound Rules • Prosody • Waveform Generation • Evaluation Types of Modern Synthesis • Articulatory Synthesis: – Model movements of articulators and acoustics of vocal tract • Formant Synthesis: – Start with acoustics, create rules/filters to create each formant • Concatenative Synthesis: – Use databases of stored speech to assemble new utterances • HMM-Based Synthesis – Run an HMM in generation mode Formant Synthesis • Were the most common commercial systems while computers were relatively underpowered. • 1979 MIT MITalk (Allen, Hunnicut, Klatt) • 1983 DECtalk system • The voice of Stephen Hawking Concatenative Synthesis • All current commercial systems. • Diphone Synthesis – Units are diphones; middle of one phone to middle of next. – Why? Middle of phone is steady state. – Record 1 speaker saying each diphone • Unit Selection Synthesis – Larger units – Record 10 hours or more, so have multiple copies of each unit – Use search to find best sequence of units TTS Demos (all are Unit-Selection) • ATT: – http://www.research.att.com/~ttsweb/tts/demo.php • Microsoft – http://research.microsoft.com/en-us/groups/speech/tts.aspx • Festival – http://www-2.cs.cmu.edu/~awb/festival_demos/index.html • Cepstral – http://www.cepstral.com/cgi-bin/demos/general • IBM – http://www-306.ibm.com/software/pervasive/tech/demos/tts.shtml Text Normalization • Analysis of raw text into pronounceable words • Sample problems: – – – – He stole $100 million from the bank It's 13 St. Andrews St. The home page is http://ee.washington.edu yes, see you the following tues, that's 11/12/01 • Steps – – – – Identify tokens in text Chunk tokens into reasonably sized sections Map tokens to words Identify types for words Grapheme to Phoneme • How to pronounce a word? Look in dictionary! But: – Unknown words and names will be missing – Turkish, German, and other hard languages • uygarlaStIramadIklarImIzdanmISsInIzcasIna • ``(behaving) as if you are among those whom we could not civilize’ • uygar +laS +tIr +ama +dIk +lar +ImIz +dan +mIS +sInIz +casIna civilized +bec +caus +NegAble +ppart +pl +p1pl +abl +past +2pl +AsIf • So need Letter to Sound Rules • Also homograph disambiguation (wind, live, read) Prosody: from words+phones to boundaries, accent, F0, duration • Prosodic phrasing – Need to break utterances into phrases – Punctuation is useful, not sufficient • Accents: – Predictions of accents: which syllables should be accented – Realization of F0 contour: given accents/tones, generate F0 contour • Duration: – Predicting duration of each phone Waveform synthesis: from segments, f0, duration to waveform • Collecting diphones: – need to record diphones in correct contexts • l sounds different in onset than coda, t is flapped sometimes, etc. – need quiet recording room, maybe EEG, etc. – then need to label them very very exactly • Unit selection: how to pick the right unit? Search • Joining the units • dumb (just stick'em together) • PSOLA (Pitch-Synchronous Overlap and Add) • MBROLA (Multi-band overlap and add) Festival • http://festvox.org/festival/ • Open source speech synthesis system • Multiplatform (Windows/Unix) • Designed for development and runtime use – Use in many academic systems (and some commercial) – Hundreds of thousands of users – Multilingual • No built-in language • Designed to allow addition of new languages – Additional tools for rapid voice development • Statistical learning tools • Scripts for building models Festival as software • C/C++ code with Scheme scripting language • General replaceable modules: – Lexicons, LTS, duration, intonation, phrasing, POS tagging, tokenizing, diphone/unit selection, signal processing • General tools – Intonation analysis (f0, Tilt), signal processing, CART building, Ngram, SCFG, WFST CMU FestVox project • Festival is an engine, how do you make voices? • Festvox: building synthetic voices: – – – – Tools, scripts, documentation Discussion and examples for building voices Example voice databases Step by step walkthroughs of processes • Support for English and other languages • Support for different waveform synthesis methods – Diphone – Unit selection – Limited domain Outline • History of TTS • Architecture • Text Processing • Letter-to-Sound Rules • Prosody • Waveform Generation • Evaluation Text Processing • • • • • • • He stole $100 million from the bank It’s 13 St. Andrews St. The home page is http://ee.washington.edu Yes, see you the following tues, that’s 11/12/01 IV: four, fourth, I.V. IRA: I.R.A. or Ira 1750: seventeen fifty (date, address) or one thousand seven… (dollars) Steps • • • • Identify tokens in text Chunk tokens Identify types of tokens Convert tokens to words Step 1: identify tokens and chunk • Whitespace can be viewed as separators • Punctuation can be separated from the raw tokens • Festival converts text into – ordered list of tokens – each with features: • its own preceding whitespace • its own succeeding punctuation End-of-utterance detection • Relatively simple if utterance ends in ?! • But what about ambiguity of “.” • Ambiguous between end-of-utterance and end-ofabbreviation – My place on Forest Ave. is around the corner. – I live at 360 Forest Ave. – (Not “I live at 360 Forest Ave..”) • How to solve this period-disambiguation task? Some rules used • A dot with one or two letters is an abbrev • A dot with 3 cap letters is an abbrev. • An abbrev followed by 2 spaces and a capital letter is an end-of-utterance • Non-abbrevs followed by capitalized word are breaks • This fails for – Cog. Sci. Newsletter – Lots of cases at end of line. – Badly spaced/capitalized sentences More sophisticated decision tree features • Prob(word with “.” occurs at end-of-s) • Prob(word after “.” occurs at begin-of-s) • Length of word with “.” • Length of word after “.” • Case of word with “.”: Upper, Lower, Cap, Number • Case of word after “.”: Upper, Lower, Cap, Number • Punctuation after “.” (if any) • Abbreviation class of word with “.” (month name, unit-ofmeasure, title, address name, etc) CART • Breiman, Friedman, Olshen, Stone. 1984. Classification and Regression Trees. Chapman & Hall, New York. • Description/Use: – Binary tree of decisions, terminal nodes determine prediction (“20 questions”) – If dependent variable is categorial, “classification tree”, – If continuous, “regression tree” Learning DTs • • • • DTs are rarely built by hand Hand-building only possible for very simple features, domains Lots of algorithms for DT induction I’ll give quick intuition here CART Estimation • Creating a binary decision tree for classification or regression involves 3 steps: – Splitting Rules: Which split to take at a node? – Stopping Rules: When to declare a node terminal? – Node Assignment: Which class/value to assign to a terminal node? Splitting Rules • Which split to take a node? • Candidate splits considered: – Binary cuts: for continuous (-inf < x < inf) consider splits of form: • X <= k vs. x > k K – Binary partitions: For categorical x {1,2,…} = X consider splits of form: – x A vs. x X-A, A X Splitting Rules • Choosing best candidate split. – Method 1: Choose k (continuous) or A (categorical) that minimizes estimated classification (regression) error after split – Method 2 (for classification): Choose k or A that minimizes estimated entropy after that split. Decision Tree Stopping • • When to declare a node terminal? Strategy (Cost-Complexity pruning): 1. Grow over-large tree 2. Form sequence of subtrees, T0…Tn ranging from full tree to just the root node. 3. Estimate “honest” error rate for each subtree. 4. Choose tree size with minimum “honest” error rate. • To estimate “honest” error rate, test on data different from training data (I.e. grow tree on 9/10 of data, test on 1/10, repeating 10 times and averaging (crossvalidation). Sproat’s EOS tree Steps 3+4: Identify Types of Tokens, and Convert Tokens to Words • Pronunciation of numbers often depends on type: – – – – 1776 date: seventeen seventy six. 1776 phone number: one seven seven six 1776 quantifier: one thousand seven hundred (and) seventy six 25 day: twenty-fifth Festival rule for dealing with “$1.2 million” (define (token_to_words utt token name) (cond ((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?") (string-matches (utt.streamitem.feat utt token "n.name") ".*illion.?")) (append (builtin_english_token_to_words utt token (string-after name "$")) (list (utt.streamitem.feat utt token "n.name")))) ((and (string-matches (utt.streamitem.feat utt token "p.name") "\\$[0-9,]+\\(\\.[0-9]+\\)?") (string-matches name ".*illion.?")) (list "dollars")) (t (builtin_english_token_to_words utt token name)))) Rule-based versus machine learning • As always, we can do things either way, or more often by a combination • Rule-based: – Simple – Quick – Can be more robust • Machine Learning – Works for complex problems where rules hard to write – Higher accuracy in general – But worse generalization to very different test sets • Real TTS and NLP systems – Often use aspects of both. Machine learning method for Text Normalization • From 1999 Hopkins summer workshop “Normalization of Non-Standard Words” • Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., and Richards, C. 2001. Normalization of Non-standard Words, Computer Speech and Language, 15(3):287-333 • NSW examples: – Numbers: • 123, 12 March 1994 – Abrreviations, contractions, acronyms: • approx., mph. ctrl-C, US, pp, lb – Punctuation conventions: • 3-4, +/-, and/or – Dates, times, urls, etc How common are NSWs? • Varies over text type • Word not in lexicon, or with non-alphabetic characters: Text Type novels % NSW 1.5% press wire 4.9% e-mail 10.7% recipes 13.7% classified 17.9% How hard are NSWs? • Identification: – Some homographs “Wed”, “PA” – False positives: OOV • Realization: – Simple rule: money, $2.34 – Type identification+rules: numbers – Text type specific knowledge (in classified ads, BR for bedroom) • Ambiguity (acceptable multiple answers) – “D.C.” as letters or full words – “MB” as “meg” or “megabyte” – 250 Step 1: Splitter • Letter/number confjunctions (WinNT, SunOS, PC110) • Hand-written rules in two parts: – Part I: group things not to be split (numbers, etc; including commas in numbers, slashes in dates) – Part II: apply rules: • • • • At transitions from lower to upper case After penultimate upper-case char in transitions from upper to lower At transitions from digits to alpha At punctuation Step 2: Classify token into 1 of 20 types • • • • • • • • • • EXPN: abbrev, contractions (adv, N.Y., mph, gov’t) LSEQ: letter sequence (CIA, D.C., CDs) ASWD: read as word, e.g. CAT, proper names MSPL: misspelling NUM: number (cardinal) (12,45,1/2, 0.6) NORD: number (ordinal) e.g. May 7, 3rd, Bill Gates II NTEL: telephone (or part) e.g. 212-555-4523 NDIG: number as digits e.g. Room 101 NIDE: identifier, e.g. 747, 386, I5, PC110 NADDR: number as stresst address, e.g. 5000 Pennsylvania • NZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT,URL,etc • SLNT: not spoken (KENT*REALTY) More about the types • 4 categories for alphabetic sequences: – EXPN: expand to full word or word seq (fplc for fireplace, NY for New York) – LSEQ: say as letter sequence (IBM) – ASWD: say as standard word (either OOV or acronyms) • 5 main ways to read numbers: – – – – – Cardinal (quantities) Ordinal (dates) String of digits (phone numbers) Pair of digits (years) Trailing unit: serial until last non-zero digit: 8765000 is “eight seven six five thousand” (some phone numbers, long addresses) – But still exceptions: (947-3030, 830-7056) Type identification algorithm • Create large hand-labeled training set and build a DT to predict type • Example of features in tree for subclassifier for alphabetic tokens: – P(t|o) = p(o|t)p(t)/p(o) – P(o|t), for t in ASWD, LSWQ, EXPN (from trigram letter model) – P(t) from counts of each tag in text – P(o) normalization factor N p(o | t) p(li1 | li1,li2 ) i1 Type identification algorithm • Hand-written context-dependent rules: – List of lexical items (Act, Advantage, amendment) after which Roman numbers read as cardinals not ordinals • Classifier accuracy: – 98.1% in news data, – 91.8% in email Step 3: expanding NSW Tokens • Type-specific heuristics – – – – – – ASWD expands to itself LSEQ expands to list of words, one for each letter NUM expands to string of words representing cardinal NYER expand to 2 pairs of NUM digits… NTEL: string of digits with silence for puncutation Abbreviation: • use abbrev lexicon if it’s one we’ve seen • Else use training set to know how to expand • Cute idea: if “eat in kit” occurs in text, “eat-in kitchen” will also occur somewhere. 4 steps to Sproat et al. algorithm 1) Splitter (on whitespace or also within word (“AltaVista”) 2) Type identifier: for each split token identify type 3) Token expander: for each typed token, expand to words • • Deterministic for number, date, money, letter sequence Only hard (nondeterministic) for abbreviations 4) Language Model: to select between alternative pronunciations Homograph disambiguation • Most frequent homographs, from Liberman and Church • Not a huge problem, but still important record house contract lead live lives protest survey project separate present read subject rebel finance estimate 195 150 143 131 130 105 94 91 90 87 80 72 68 48 46 46 POS Tagging for homograph disambiguation • Many homographs can be distinguished by POS live l ay v REcord INsult OBject OVERflow DIScount CONtent l ih v reCORD inSULT obJECT overFLOW disCOUNT conTENT Part of speech tagging • 8 (ish) traditional parts of speech – This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) – Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS – We’ll use POS most frequently POS examples N V ADJ ADV P PRO DET noun verb adj adverb preposition pronoun determiner chair, bandwidth, pacing study, debate, munch purple, tall, ridiculous unfortunately, slowly, of, by, to I, me, mine the, a, that, those POS Tagging: Definition • The process of assigning a part-of-speech or lexical class marker to each word in a corpus: WORDS the koala put the keys on the table TAGS N V P DET POS Tagging example WORD the child put the keys on the table TAG DET N V DET N P DET N Open and closed class words • Closed class: a relatively fixed membership – – – – Prepositions: of, in, by, … Auxiliaries: may, can, will had, been, … Pronouns: I, you, she, mine, his, them, … Usually function words (short common words which play a role in grammar) • Open class: new ones can be created all the time – English has 4: Nouns, Verbs, Adjectives, Adverbs – Many languages have all 4, but not all! – In Lakhota and possibly Chinese, what English treats as adjectives act more like verbs. Open class words • Nouns – Proper nouns (Seattle University, Boulder, Neal Snider, William Gates Hall). English capitalizes these. – Common nouns (the rest). German capitalizes these. – Count nouns and mass nouns • Count: have plurals, get counted: goat/goats, one goat, two goats • Mass: don’t get counted (snow, salt, communism) (*two snows) • Adverbs: tend to modify things – – – – Unfortunately, John walked home extremely slowly yesterday Directional/locative adverbs (here, home, downhill) Degree adverbs (extremely, very, somewhat) Manner adverbs (slowly, delicately) • Verbs: – In English, have morphological affixes (eat/eats/eaten) Closed Class Words • Idiosyncratic • Examples: – – – – – – – prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, .. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, … POS tagging: Choosing a tagset • There are so many parts of speech, potential distinctions we can draw • To do POS tagging, need to choose a standard set of tags to work with • Could pick very coarse tagets – N, V, Adj, Adv. • More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags – PRP$, WRB, WP$, VBG • Even more fine-grained tagsets exist PRP PRP$ Using the UPenn tagset • The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. • Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”) • Except the preposition/complementizer “to” is just marked “to”. POS Tagging • Words often have more than one POS: back – – – – The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word. How hard is POS tagging? Measuring ambiguity Unambiguous (1 tag): 38,857 Ambiguous (2-7 tags): 8,844 2 tags 6,731 3 tags 1621 4 tags 357 5 tags 90 6 tags 32 7 tags 6 well, set, round, open, fit, down 8 tags 4 ‘s, half, back, a 9 tags 3 that, more, in 3 methods for POS tagging 1. Rule-based tagging – (ENGTWOL) 2. Stochastic (=Probabilistic) tagging – HMM (Hidden Markov Model) tagging 3. Transformation-based tagging – Brill tagger Rule-based tagging • • • • Start with a dictionary Assign all possible tags to words from the dictionary Write rules by hand to selectively remove tags Leaving the correct tag for each word Start with a dictionary • • • • • • she: promised: to back: the: bill: PRP VBN,VBD TO VB, JJ, RB, NN DT NN, VB • Etc… for the ~100,000 words of English Use the dictionary to assign every possible tag PRP She VBN VBD promised TO to NN RB JJ VB back DT the VB NN bill Write rules to eliminate tags Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP” NN RB JJ VB VBN PRP VBD TO VB DT NN She promised to back the bill Stochastic Tagging • Intuition: assign each word “most probable” tag • Simplest way to define “most probable” – – – – Collect a training corpus Choose tag which is most frequent for that word in training corpus I.e., chose tag such that p(tag|word) is high Of all the times that “use” occurred in a training corpus, what percentage was it V, what percentage N? Choose higher probability tag. • Does it work? – Achieves: 90%! But we can do better: – How? Context: “to use”: use is V; “the use of”: use is N HMM Tagger • Intuition: Pick the most probable tag sequence for a series of words n n n tˆ1 argmax P(t1 | w1 ) t1n • But how to make the right-hand side operational? • Use Bayes’ rule: • Substituting: P(y | x)P(x) P(x | y) P(y) n n n ˆt1n argmax P(w1 | t1 )P(t1 ) n n P(w1 ) t1 HMM Tagger: fundamental equations n 1 n 1 n 1 ˆt1n argmax P(w | t )P(t ) n n P(w1 ) t1 • Since the word sequence is constant: ˆt1n argmax P(w1n | t1n )P(t1n ) t1n likelihood • Still too hard to compute directly prior HMM Tagger: Two simplifying assumptions • Prob of word independent of other words and their tags: tˆ1n argmax P(w1n | t1n )P(t1n ) t1n • Prob of tag is only dependent on previous tag n • Combining: P(w1n | t1n ) P(w i | t i ) i1 n P(t1n ) P(t i | t i1 ) i1 n tˆ1n argmax P(t1n | w1n ) argmax P(w i | t i )P(t i | t i1 ) t1n t1n i1 Estimating these probabilities • Determiners precede nouns in English, so expect P(NN|DT) to be high C(t ,t ) P(ti | t i1 ) i1 i C(t i1) • In tagged 1-million word Brown corpus: • P(NN|DT) = C(DT,NN)/C(DT) = 56509/116454=.49 • P(is|VBZ)=C(VBZ,is)/C(VBZ)=10073/21627=.47 C(t i,wi ) P(wi | ti ) C(ti ) An example • Secretariat/NNP is/BEZ expected/VBN to/TO race/VB tomorrow/NR • People/NNS continue/VB to/TO inquire/VB the/AT reason/NN for/IN the/AT race/NN for/IN outer/JJ space/NN An example of two tag sequences Picture of HMM Viterbi Algorithm S1 S2 S3 S4 S5 JJ DT VB VB NNP NN RB NN VBN TO VBD promised to back the bill Evaluation • The result is compared with a manually coded “Gold Standard” – Typically accuracy reaches 96-97% – This may be compared with result for a baseline tagger (one that uses no context). • Important: 100% is impossible even for human annotators Outline • History of TTS • Architecture • Text Processing • Letter-to-Sound Rules • Prosody • Waveform Generation • Evaluation Lexicons and Lexical Entries • You can explicitly give pronunciations for words – Each language/dialect has its own lexicon – You can lookup words with • (lex.lookup WORD) – You can add entries to the current lexicon • (lex.add.entry NEWENTRY) – Entry: (WORD POS (SYL0 SYL1…)) – Syllable: ((PHONE0 PHONE1 …) STRESS ) – Example: (“cepstra” n ((k eh p) 1) ((s t r aa) 0)))) Converting from words to phones • Two methods: – Dictionary-based – Rule-based (Letter-to-sound=LTS) • • • • Early systems, all LTS MITalk was radical in having huge 10K word dictionary Now systems use a combination CMU dictionary: 127K words – http://www.speech.cs.cmu.edu/cgi-bin/cmudict Dictionaries aren’t always sufficient • Unknown words – Seem to be linear with number of words in unseen text – Mostly person, company, product names – But also foreign words, etc. • So commercial systems have 3-part system: – Big dictionary – Special code for handling names – Machine learned LTS system for other unknown words Letter-to-Sound Rules • Festival LTS rules: • (LEFTCONTEXT [ ITEMS] RIGHTCONTEXT = NEWITEMS ) • Example: – (#[ch]C=k) – ( # [ c h ] = ch ) • # denotes beginning of word • C means all consonants • Rules apply in order – “christmas” pronounced with [k] – But word with ch followed by non-consonant pronounced [ch] • E.g., “choice” Stress rules in LTS • English famously evil: one from Allen et al 1987 • V -> [1-stress] / X_C* {Vshort C C?|V} {[Vshort C*|V} • Where X must contain all prefixes: • Assign 1-stress to the vowel in a syllable preceding a weak syllable followed by a morpheme-final syllable containing a short vowel and 0 or more consonants (e.g. difficult) • Assign 1-stress to the vowel in a syllable preceding a weak syllable followed by a morpheme-final vowel (e.g. oregano) • etc Modern method: Learning LTS rules automatically • • • • Induce LTS from a dictionary of the language Black et al. 1998 Applied to English, German, French Two steps: alignment and (CART-based) rule-induction Alignment • Letters: c h e c k e d • Phones: ch _ eh _ k _ t • Black et al Method 1: – First scatter epsilons in all possible ways to cause letters and phones to align – Then collect stats for P(letter|phone) and select best to generate new stats – This iterated a number of times until settles (5-6) – This is EM (expectation maximization) alg Alignment • Black et al method 2 • Hand specify which letters can be rendered as which phones – C goes to k/ch/s/sh – W goes to w/v/f, etc • Once mapping table is created, find all valid alignments, find p(letter|phone), score all alignments, take best Alignment • Some alignments will turn out to be really bad. • These are just the cases where pronunciation doesn’t match letters: – Dept d ih p aa r t m ah n t – CMU s iy eh m y uw – Lieutenant l eh f t eh n ax n t (British) • Also foreign words • These can just be removed from alignment training Building CART trees • Build a CART tree for each letter in alphabet (26 plus accented) using context of +-3 letters • # # # c h e c -> ch • c h e c k e d -> _ • This produces 92-96% correct LETTER accuracy (58-75 word acc) for English Improvements • • • • Take names out of the training data And acronyms Detect both of these separately And build special-purpose tools to do LTS for names and acronyms Names • Big problem area is names • Names are common – 20% of tokens in typical newswire text will be names – 1987 Donnelly list (72 million households) contains about 1.5 million names – Personal names: McArthur, D’Angelo, Jimenez, Rajan, Raghavan, Sondhi, Xu, Hsu, Zhang, Chang, Nguyen – Company/Brand names: Infinit, Kmart, Cytyc, Medamicus, Inforte, Aaon, Idexx Labs, Bebe Names • Methods: – Can do morphology (Walters -> Walter, Lucasville) – Can write stress-shifting rules (Jordan -> Jordanian) – Rhyme analogy: Plotsky by analogy with Trostsky (replace tr with pl) – Liberman and Church: for 250K most common names, got 212K (85%) from these modified-dictionary methods, used LTS for rest. – Can do automatic country detection (from letter trigrams) and then do country-specific rules Outline • History of TTS • Architecture • Text Processing • Letter-to-Sound Rules • Prosody • Waveform Generation • Evaluation Defining Intonation • Ladd (1996) “Intonational phonology” • “The use of suprasegmental phonetic features Suprasegmental = above and beyond the segment/phone – F0 – Intensity (energy) – Duration • to convey sentence-level pragmatic meanings” – I.e. meanings that apply to phrases or utterances as a whole, not lexical stress, not lexical tone. Three aspects of prosody • Prominence: some syllables/words are more prominent than others • Structure/boundaries: sentences have prosodic structure – Some words group naturally together – Others have a noticeable break or disjuncture between them • Tune: the intonational melody of an utterance. Prosodic Prominence: Pitch Accents A: What types of foods are a good source of vitamins? B1: Legumes are a good source of VITAMINS. B2: LEGUMES are a good source of vitamins. • Prominent syllables are: • • • Louder Longer Have higher F0 and/or sharper changes in F0 (higher F0 velocity) Slides from Jennifer Venditti Prosodic Boundaries . French [bread and cheese] [French bread] and [cheese] Prosodic Tunes • Legumes are a good source of vitamins. • Are legumes a good source of vitamins? TOPIC #1 Thinking about F0 Graphic representation of F0 400 350 F0 (in Hertz) 300 250 200 150 100 50 legumes are a good source of VITAMINS time The ‘ripples’ 400 350 300 250 200 150 [s] [t] 100 [s] 50 legumes are a good source of VITAMINS F0 is not defined for consonants without vocal fold vibration. The ‘ripples’ 400 350 300 250 200 150 100 50 [g] [z] [g] [v] legumes are a good source of VITAMINS ... and F0 can be perturbed by consonants with an extreme constriction in the vocal tract. Abstraction of the F0 contour 400 350 300 250 200 150 100 50 legumes are a good source of VITAMINS Our perception of the intonation contour abstracts away from these perturbations. The ‘waves’ and the ‘swells’ 400 ‘wave’ = accent 350 300 250 200 150 ‘swell’ = phrase 100 50 legumes are a good source of VITAMINS TOPIC #2 Accent Placement and Intonational Tunes Stress vs. accent • Stress is a structural property of a word — it marks a potential (arbitrary) location for an accent to occur, if there is one. • Accent is a property of a word in context — it is a way to mark intonational prominence in order to ‘highlight’ important words in the discourse. (x) (x) x x stressed syll x full vowels x x x x x x x vi ta mins Ca li x (accented syll) x for nia syllables Stress vs. accent (2) • The speaker decides to make the word vitamin more prominent by accenting it. • Lexical stress tell us that this prominence will appear on the first syllable, hence VItamin. Which word receives an accent? • It depends on the context. For example, the ‘new’ information in the answer to a question is often accented, while the ‘old’ information usually is not. – Q1: What types of foods are a good source of vitamins? – A1: LEGUMES are a good source of vitamins. – Q2: Are legumes a source of vitamins? – A2: Legumes are a GOOD source of vitamins. – Q3: I’ve heard that legumes are healthy, but what are they a good source of ? – A3: Legumes are a good source of VITAMINS. Same ‘tune’, different alignment 400 350 300 250 200 150 100 50 LEGUMES are a good source of vitamins The main rise-fall accent (= “I assert this”) shifts locations. Same ‘tune’, different alignment 400 350 300 250 200 150 100 50 Legumes are a GOOD source of vitamins The main rise-fall accent (= “I assert this”) shifts locations. Same ‘tune’, different alignment 400 350 300 250 200 150 100 50 legumes are a good source of VITAMINS The main rise-fall accent (= “I assert this”) shifts locations. Broad focus 400 350 300 250 200 150 100 50 legumes are a good source of vitamins In the absence of narrow focus, English tends to mark the first and last ‘content’ words with perceptually prominent accents. Yes-No question tune 550 500 450 400 350 300 250 200 150 100 50 are LEGUMES a good source of vitamins Rise from the main accent to the end of the sentence. Yes-No question tune 550 500 450 400 350 300 250 200 150 100 50 are legumes a GOOD source of vitamins Rise from the main accent to the end of the sentence. Yes-No question tune 550 500 450 400 350 300 250 200 150 100 50 are legumes a good source of VITAMINS Rise from the main accent to the end of the sentence. WH-questions [I know that many natural foods are healthy, but ...] 400 350 300 250 200 150 100 50 WHAT are a good source of vitamins WH-questions typically have falling contours, like statements. Rising statements 550 500 450 400 350 300 250 200 150 100 50 legumes are a good source of vitamins [... does this statement qualify?] High-rising statements can signal that the speaker is seeking approval. ‘Surprise-redundancy’ tune [How many times do I have to tell you ...] 400 350 300 250 200 150 100 50 legumes are a good source of vitamins Low beginning followed by a gradual rise to a high at the end. ‘Contradiction’ tune “I’ve heard that linguini is a good source of vitamins.” 400 350 300 250 200 150 100 50 linguini isn’t a good source of vitamins [... how could you think that?] Sharp fall at the beginning, flat and low, then rising at the end. TOPIC #3 Intonational phrasing and disambiguation A single intonation phrase 400 350 300 250 200 150 100 50 legumes are a good source of vitamins Broad focus statement consisting of one intonation phrase (that is, one intonation tune spans the whole unit). Multiple phrases 400 350 300 250 200 150 100 50 legumes are a good source of vitamins Utterances can be ‘chunked’ up into smaller phrases in order to signal the importance of information in each unit. Phrasing can disambiguate • Global ambiguity: Sally saw % the man with the binoculars. Sally saw the man % with the binoculars. Phrasing can disambiguate • Temporary ambiguity: When Madonna sings the song ... Phrasing can disambiguate • Temporary ambiguity: When Madonna sings the song is a hit. Phrasing can disambiguate • Temporary ambiguity: When Madonna sings % the song is a hit. When Madonna sings the song % it’s a hit. [from Speer & Kjelgaard (1992)] Phrasing can disambiguate 400 350 300 250 Mary & Elena’s mother mall 200 150 100 50 I met Mary and Elena’s mother at the mall yesterday One intonation phrase with relatively flat overall pitch range. Phrasing can disambiguate 400 350 Elena’s mother mall 300 250 Mary 200 150 100 50 I met Mary and Elena’s mother at the mall yesterday Separate phrases, with expanded pitch movements. TOPIC #4 The TOBI Intonational Transcription Theory ToBI: Tones and Break Indices • Pitch accent tones – – – – – H* “peak accent” L* “low accent” L+H* “rising peak accent” (contrastive) L*+H ‘scooped accent’ H+!H* downstepped high • Boundary tones – L-L% (final low; Am Eng. Declarative contour) – L-H% (continuation rise) – H-H% (yes-no queston) • Break indices – 0: clitics, 1, word boundaries, 2 short pause – 3 intermediate intonation phrase – 4 full intonation phrase/final boundary. Examples of the TOBI system • I don’t eat beef. L* L* L*L-L% • Marianna made the marmalade. H* L-L% L* H-H% • “I” means insert. H* H* H*L-L% 1 H*LH*L-L% 3 ToBI • http://www.ling.ohio-state.edu/~tobi/ • TOBI for American English – http://www.ling.ohio-state.edu/~tobi/ame_tobi/ • Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. (1992). ToBI: a standard for labelling English prosody. In Proceedings of ICSLP92, volume 2, pages 867-870 • Pitrelli, J. F., Beckman, M. E., and Hirschberg, J. (1994). Evaluation of prosodic transcription labeling reliability in the ToBI framework. In ICSLP94, volume 1, pages 123-126 • Pierrehumbert, J., and J. Hirschberg (1990) The meaning of intonation contours in the interpretation of discourse. In P. R. Cohen, J.Morgan, and M. E. Pollack, eds., Plans and Intentions inCommunication and Discourse, 271-311. MIT Press. • Beckman and Elam. Guidelines for ToBI Labelling. Web. TOPIC #5 PRODUCING INTONATION IN TTS Intonation in TTS 1) Accent: Decide which words are accented, which syllable has accent, what sort of accent 2) Boundaries: Decide where intonational boundaries are 3) Duration: Specify length of each segment 4) F0: Generate F0 contour from these TOPIC #5a Predicting pitch accent Factors in accent prediction • Contrast – Legumes are poor source of VITAMINS – No, legumes are a GOOD source of vitamins – I think JOHN and MARY should go – No, I think JOHN AND MARY should go But it’s more than just contrast • List intonation: • I went and saw ANNA, LENNY, MARY, and NORA. In fact, accents are common! • A Broadcast News example from Hirschberg (1993) • SUN MICROSYSTEMS INC, the UPSTART COMPANY that HELPED LAUNCH the DESKTOP COMPUTER industry TREND TOWARD HIGH powered WORKSTATIONS, was UNVEILING an ENTIRE OVERHAUL of its PRODUCT LINE TODAY. SOME of the new MACHINES, PRICED from FIVE THOUSAND NINE hundred NINETY five DOLLARS to seventy THREE thousand nine HUNDRED dollars, BOAST SOPHISTICATED new graphics and DIGITAL SOUND TECHNOLOGIES, HIGHER SPEEDS AND a CIRCUIT board that allows FULL motion VIDEO on a COMPUTER SCREEN. Factors in accent prediction • Part of speech: – Content words are usually accented – Function words are rarely accented • Of, for, in on, that, the, a, an, no, to, and but or will may would can her is their its our there is am are was were, etc Factors in accent prediction • • • • Word Order Preposed items are accented more frequently TODAY we will BEGIN to LOOK at FROG anatomy. We will BEGIN to LOOK at FROG anatomy today. Factors in Accent Prediction • • • • Information Status: New versus old information. Old information is not deaccented There are LAWYERS, and there are GOOD lawyers Complex Noun Phrase Structure • Sproat, R. 1994. English noun-phrase accent prediction for text-to-speech. Computer Speech and Language 8:79-94. • Proper Names, stress on right-most word – New York CITY; Paris, FRANCE • Adjective-Noun combinations, stress on noun – Large HOUSE, red PEN, new NOTEBOOK • Noun-Noun compounds: stress left noun – HOTdog (food) versus HOT DOG (overheated animal) – WHITE house (place) versus WHITE HOUSE (made of stucco) • examples: – Madison AVENUE, park STREET, MEDICAL building – APPLE cake, cherry PIE • Some Rules: – Furniture+Room -> RIGHT (e.g., kitchen TABLE) – Proper-name + Street -> LEFT (e.g. PARK street) Other features • • • • • • • POS POS of previous word POS of next word Stress of current, previous, next syllable Unigram probability of word Bigram probability of word Position of word in sentence State of the art • • • • Hand-label large training sets Use CART, SVM, CRF, etc to predict accent Lots of rich features from context Classic lit: – Hirschberg, Julia. 1993. Pitch Accent in context: predicting intonational prominence from text. Artificial Intelligence 63, 305340 TOPIC #5b Predicting boundaries Predicting Boundaries • Intonation phrase boundaries – Intermediate phrase boundaries – Full intonation phrase boundaries More examples • From Ostendorf and Veilleux. 1994 “Hierarchical Stochastic model for Automatic Prediction of Prosodic Boundary Location”, Computational Linguistics 20:1 • Computer phone calls, || which do everything | from selling magazine subscriptions || to reminding people about meetings || have become the telephone equivalent | of junk mail. || • Doctor Norman Rosenblatt, || dean of the college | of criminal justice at Northeastern University, || agrees.|| • For WBUR, || I’m Margo Melnicove. Ostendorf and Veilleux CART TOPIC #5c Predicting duration Duration • Simplest: fixed size for all phones (100 ms) • Next simplest: average duration for that phone (from training data). Samples from SWBD in ms: – – – – – aa ax ay eh Ih 118 59 138 87 77 b d dh f g 68 68 44 90 66 • Next Next Simplest: add in phrase-final and initial lengthening plus stress: • Better: average duration for each triphone Duration in Festival (2) • Klatt duration rules. Modify duration based on: – – – – – – Position in clause Syllable position in word Syllable type Lexical stress Left+right context phone Prepausal lengthening • Festival: 2 options – Klatt rules – Use labeled training set with Klatt features to train CART Duration: state of the art • Lots of fancy models of duration prediction: – Using Z-scores and other clever normalizations – Sum-of-products model – New features like word predictability • Words with higher bigram probability are shorter TOPIC #5d F0 Generation F0 Generation • Generation in Festival – F0 Generation by rule – F0 Generation by linear regression • Some constraints – F0 is constrained by accents and boundaries – F0 declines gradually over an utterance (“declination”) F0 Generation by rule • Generate a list of target F0 points for each syllable • Here’s a rule to generate a simple H* “hat” accent (with fixed = speaker-specific F0 values): (define (targ_func1 utt syl) "(targ_func1 UTT STREAMITEM) Returns a list of targets for the given syllable." (let ((start (item.feat syl 'syllable_start)) (end (item.feat syl 'syllable_end))) (if (equal? (item.feat syl "R:Intonation.daughter1.name") "Accented") (list (list start 110) (list (/ (+ start end) 2.0) 140) (list end 100))))) F0 generation by regression • Supervised machine learning again • We predict: value of F0 at 3 places in each syllable • Predictor features: – – – – Accent of current word, next word, previous Boundaries Syllable type, phonetic information Stress information • Need training sets with pitch accents labeled Outline • History of TTS • Architecture • Text Processing • Letter-to-Sound Rules • Prosody • Waveform Generation • Evaluation Articulatory Synthesis The vocal tract is divided into a large number of short tubes, as in the electrical transmission line analog, which are then combined and resonant frequencies calculated. from Sinder, 1999 (thesis work with Flanagan, Rutgers) Formant Synthesis • Instead of specifying mouth shapes, formant synthesis specifies frequencies and bandwidths of resonators, which are used to filter a source waveform. • Formant frequency analysis is difficult; bandwidth estimation is even more difficult. But the biggest perceptual problem in formant synthesis is not in the resonances, but in a “buzzy” quality most likely due to the glottal source model. • Formant synthesis can sound identical to natural utterance if details of the glottal source and formants are well modeled. NATURAL SPEECH (John Holmes, 1973) Hi ( z) SYNTHETIC SPEECH 1 2e bi 1 cos(2 fi ) z 1 e2 bi z 2 Klatt’s formant synthesizer AV Impulse + RGZ + RGP AVS Gen. RGS SW RNP RNZ F0 Random Number First Diff. AH Gen. X A1 R1 R1 AN RNP R2 A2 R2 R3 A3 R3 R4 A4 R4 R5 A5 R5 Cas cad e Tra nsfe r Fun ctio n + MOD. LPF AF + A6 R6 AB Parallel Transfer Function First Diff. Klatt’s parameter values Symbol Name N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 AV AF AH AVS F0 F1 F2 F3 F4 FNZ AN A1 A2 A3 A4 A5 A6 AB B1 B2 B3 SW 32 33 34 35 36 37 38 39 FNP BNP BNZ BGS SR NWS G0 NFC Amplitude of voicing (dB) Amplitude of frication (dB) Amplitude of aspiration (dB) Amplitude of sinusoidal voicing (dB) Fundamental frequency (Hz) First formant frequency (Hz) Second formant frequency (Hz) Third formant frequency (Hz) Fourth formant frequency (Hz) Nasal zero frequency (Hz) Nasal formant amplitude (Hz) First formant amplitude (Hz) Second formant amplitude (Hz) Third formant amplitude (Hz) Fourth formant amplitude (Hz) Fifth formant amplitude (Hz) Sixth formant amplitude (Hz) Bypass path amplitude (Hz) First formant bandwidth (Hz) Second formant bandwidth (Hz) Third formant bandwidth (Hz) Cascade/parallel switch … Nasal pole frequency (Hz) Nasal pole bandwidth (Hz) Nasal zero bandwidth (Hz) Glottal resonator 2 bandwidth (Hz) Sampling rate (Hz) Number of waveform samples per chunk Overall gain control (dB) Number of cascaded formants Min Max Typ 0 0 0 0 0 150 500 1300 2500 200 0 0 0 0 0 0 0 0 40 40 40 0 80 80 80 80 500 900 2500 3500 4500 700 80 80 0 80 80 80 80 80 500 500 500 1 0 0 0 0 0 500 1500 2500 3500 250 0 0 0 0 0 0 0 0 50 70 110 0 200 50 50 100 500 1 0 4 500 500 500 1000 20000 200 80 6 250 100 100 200 10000 50 48 5 Formant systems: Rule-Based Synthesis • For synthesis of arbitrary text, formants and bandwidths for each phoneme are determined by analyzing speech of a single person. • The models of each phoneme may be a single set of formant frequencies and bandwidths for a canonical phoneme at a single point in time, or a trajectory of frequencies, bandwidths, and source models over time. • The formant frequencies for each phoneme are combined over time using a model of coarticulation, such as Klatt’s modified locus theory. • Duration, pitch, and energy rules are applied • Result: something like this: Concatenative Synthesis • • • • Copy synthesis sounds great but synthesis by rule using formants does not. Why?… Problem with glottal source? Problem with coarticulation and formant transitions? Problem with prosody? Formant synthesis was main TTS technique until the early or mid 1990’s, when increasing memory size and CPU speed allowed concatenative synthesis to be viable approach. Concatenative synthesis uses recordings of small units of speech (typically the region from the middle of one phoneme to the middle of another phoneme, or a diphone unit), and glues these units together to forms words and sentences. Don’t have to worry about glottal source models or coarticulation, since the synthesis is just a concatenation of different waveforms containing “natural” glottal source and coarticulation. Concatenative Synthesis: Units • The basic unit for concatenative synthesis is the diphone: sil-jh jh-aa aa-n n-sil • More recent TTS research is on using larger units. Issues include: • how to decide what units will be used? • how to select best unit from very large database? • With increasing size and variety of units, there is an exponential growth in the database size. Yet, despite massive databases that may take months to record, coverage is nowhere near complete. There is a very large number of infrequent events in speech. Joining Units • Dumb: – just join – Better: at low amplitude regions • TD-PSOLA – Time-domain pitch-synchronous overlap-and-add – Join at pitch periods (with windowing) Diphone boundaries in stops Prosodic Modification • Modifying pitch and duration independently • Changing sample rate modifies both: – “Alvin and the Chipmunks” speech • Duration: duplicate/remove parts of the signal • Pitch: resample to change pitch Speech as Short Term signals Duration modification • Duplicate/remove short term signals Pitch Modification • Move short-term signals closer together/further apart Overlap-and-add (OLA) Overlap and Add (OLA) • Hanning windows of length 2N used to multiply the analysis signal • Resulting windowed signals are added • Analysis windows, spaced 2N • Synthesis windows, spaced N • Time compression is uniform with factor of 2 • Pitch periodicity somewhat lost around 4th window TD-PSOLA ™ • Time-Domain Pitch Synchronous Overlap and Add • Patented by France Telecom (CNET) • Very efficient – No FFT (or inverse FFT) required • Can modify Hz up to two times or by half TD-PSOLA ™ HMM-Based synthesis • Generate the most likely sequence of spectral (e.g. MFCCs) and excitation (F0) parameters for the given phone sequence using HMM • Create a filter using the spectral parameters • Pass the excitation parameters (F0, noise) through the filter to generate the waveform Block Diagram • Zen & Toda (2005) HMM parameter generation • Each model represents a phone or a subphone (diphone, triphone, etc.) • Each model consists of multiple states – Tri-state model – Each Gaussian mixture represented by a different state with the transitional probability as the mixture weight • Each state emits spectral/F0 feature vector – 12~13 MFCCs, deltas, (delta-deltas) – F0, delta, (delta-delta) Problem for HMM parameter generation • We know which models to concatenate in what order • We do NOT know – which state in the model to use to generate each frame – which value to choose from a set of values observable within each state Tokuda et al. (1995) • We need to solve arg max P(O, q | ) c – O : observation sequence, where each observation is a feature vector consisting of MFCCs (c) and their deltas (Δc) – q : state sequence – λ : HMM • The problem is that we don’t exactly know what q is Solution – (1) arg max P(O, q | ) arg max P(O | q, ) P(q | ) arg max P(O | q, ) c c c Let’s assume that we know the state sequence “q” P(O | q, ) bq1 (o1 ) bq2 (o2 ) bq3 (o3 ) bqT (oT ) bqt (ot ) N (ct ; qt , qt ) 1 N (o; , ) exp (o )' 1 (o ) 2 (2 ) N | | 1 Solution – (1) • Non smooth spectrum MFCC1 time Solution – (2) Add deltas P(O | q, ) bq1 (o1 ) bq2 (o2 ) bq3 (o3 ) bqT (oT ) bqt (ot ) N (ct ; qt , qt ) N (ct ; qt , qt ) 1. Differentiate the log-probability with respect to ct 2. Solve set of linear equations for ct MFCC1 time Digression: Delta • Simple calculation of delta ct 1 ct 1 dt 2 • More robust calculation of delta L dt l (c t l l 1 ct l ) L 2l2 l 1 • Typically rewritten as below, where wl is derived from above dt L w c l L l t l Finding the state sequence • Recall that our problem was that – We do NOT know 1) which state in the model to use to generate each frame 2) which value to choose from a set of values observable within each state • • • The solution discussed thus far solved (2) assuming that we know the answer to (1) To really solve the problem, we should consider all possible state sequences and choose “c” that gives us the highest observation probability Directly solving the equation for all possible state sequences takes too much time How about excitation? • Unvoiced speech: white noise. This is fine! • Voiced speech: Impulse train h[n] • Problems: – Voiced speech has frication – Hard decisions are hard How about excitation? • Use a mixed excitation model h[n] + g[n] • Learn model parameters from data with HMM • Multi-band noise is better HMM-Based concatenative Synthesis • Given a big database • Find string of units that maximizes probability of HMM T U arg max p(U | ) arg max p(ut | ) U U t 1 • Intrasegment scores can be precomputed • Concatenation scores could also be precomputed – All possible joints (could be large!) – Delta means and variances at boundaries are the key! • Good job at concatenation matching! • How about prosody? • Use HMM too! Stylistic TTS Text TTS in Windows since Windows 2000 Text Analysis Rules Letter-to-Sound Dictionary and Rules Read speech voice Prosody Waveform concatenation Speech Stylistic TTS Database of Recorded Speech Thanks to Min Chu, MSR Asia Outline • History of TTS • Architecture • Text Processing • Letter-to-Sound Rules • Prosody • Waveform Generation • Evaluation Evaluation of TTS • Intelligibility Tests – Diagnostic Rhyme Test (DRT) • Humans do listening identification choice between two words differing by a single phonetic feature – Voicing, nasality, sustenation, sibilation • 96 rhyming pairs • Veal/feel, meat/beat, vee/bee, zee/thee, etc – Subject hears “veal”, chooses either “veal or “feel” – Subject also hears “feel”, chooses either “veal” or “feel” • % of right answers is intelligibility score. • Overall Quality Tests – Have listeners rate space on a scale from 1 (bad) to 5 (excellent) • Preference Tests (prefer A, prefer B)