Lecture - MELODI Lab

advertisement
Text to Speech Systems (TTS)
EE 516 Spring 2009
Alex Acero
Acknowledgments
• Thanks to Dan Jurafsky for a lot of slides
• Also thanks to Alan Black, Jennifer Venditti,
Richard Sproat
Outline
• History of TTS
• Architecture
• Text Processing
• Letter-to-Sound Rules
• Prosody
• Waveform Generation
• Evaluation
Dave Barry on TTS
“And computers are getting smarter all the time; scientists tell
us that soon they will be able to talk with us.
(By "they", I mean computers; I doubt scientists will ever be
able to talk to us.)
Von Kempelen 1780
• Small whistles controlled
consonants
• Rubber mouth and nose;
nose had to be covered with
two fingers for non-nasals
• Unvoiced sounds: mouth
covered, auxiliary bellows
driven by string provides puff
of air
From Traunmüller’s web site
5
Closer to a natural vocal tract:
Riesz 1937
The 1936 UK Speaking Clock
From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm
The UK Speaking Clock
•
•
•
•
July 24, 1936
Photographic storage on 4 glass disks
2 disks for minutes, 1 for hour, one for seconds.
Other words in sentence distributed across 4 disks, so all 4
used at once.
• Voice of “Miss J. Cain”
A technician adjusts the amplifiers of
the first speaking clock
From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm
Homer Dudley’s VODER 1939
•Synthesizing speech by electrical means
•1939 World’s Fair
•Manually controlled through complex keyboard
•Operator training was a problem
•1939 vocoder
Cooper’s Pattern Playback
Dennis Klatt’s history of TTS (1987)
• More history at http://www.festvox.org/history/klatt.html
(Dennis Klatt)
Outline
• History of TTS
• Architecture
• Text Processing
• Letter-to-Sound Rules
• Prosody
• Waveform Generation
• Evaluation
Types of Modern Synthesis
• Articulatory Synthesis:
– Model movements of articulators and acoustics of vocal tract
• Formant Synthesis:
– Start with acoustics, create rules/filters to create each formant
• Concatenative Synthesis:
– Use databases of stored speech to assemble new utterances
• HMM-Based Synthesis
– Run an HMM in generation mode
Formant Synthesis
• Were the most common commercial systems while
computers were relatively underpowered.
• 1979 MIT MITalk (Allen, Hunnicut, Klatt)
• 1983 DECtalk system
• The voice of Stephen Hawking
Concatenative Synthesis
• All current commercial systems.
• Diphone Synthesis
– Units are diphones; middle of one phone to middle of next.
– Why? Middle of phone is steady state.
– Record 1 speaker saying each diphone
• Unit Selection Synthesis
– Larger units
– Record 10 hours or more, so have multiple copies of each unit
– Use search to find best sequence of units
TTS Demos (all are Unit-Selection)
• ATT:
– http://www.research.att.com/~ttsweb/tts/demo.php
• Microsoft
– http://research.microsoft.com/en-us/groups/speech/tts.aspx
• Festival
– http://www-2.cs.cmu.edu/~awb/festival_demos/index.html
• Cepstral
– http://www.cepstral.com/cgi-bin/demos/general
• IBM
– http://www-306.ibm.com/software/pervasive/tech/demos/tts.shtml
Text Normalization
• Analysis of raw text into pronounceable words
• Sample problems:
–
–
–
–
He stole $100 million from the bank
It's 13 St. Andrews St.
The home page is http://ee.washington.edu
yes, see you the following tues, that's 11/12/01
• Steps
–
–
–
–
Identify tokens in text
Chunk tokens into reasonably sized sections
Map tokens to words
Identify types for words
Grapheme to Phoneme
• How to pronounce a word? Look in dictionary! But:
– Unknown words and names will be missing
– Turkish, German, and other hard languages
• uygarlaStIramadIklarImIzdanmISsInIzcasIna
• ``(behaving) as if you are among those whom we could not civilize’
•
uygar +laS +tIr +ama +dIk +lar +ImIz +dan +mIS +sInIz +casIna civilized +bec
+caus +NegAble +ppart +pl +p1pl +abl +past +2pl +AsIf
• So need Letter to Sound Rules
• Also homograph disambiguation (wind, live, read)
Prosody:
from words+phones to boundaries, accent, F0, duration
• Prosodic phrasing
– Need to break utterances into phrases
– Punctuation is useful, not sufficient
• Accents:
– Predictions of accents: which syllables should be accented
– Realization of F0 contour: given accents/tones, generate F0
contour
• Duration:
– Predicting duration of each phone
Waveform synthesis:
from segments, f0, duration to waveform
• Collecting diphones:
– need to record diphones in correct contexts
• l sounds different in onset than coda, t is flapped sometimes, etc.
– need quiet recording room, maybe EEG, etc.
– then need to label them very very exactly
• Unit selection: how to pick the right unit? Search
• Joining the units
• dumb (just stick'em together)
• PSOLA (Pitch-Synchronous Overlap and Add)
• MBROLA (Multi-band overlap and add)
Festival
• http://festvox.org/festival/
• Open source speech synthesis system
• Multiplatform (Windows/Unix)
• Designed for development and runtime use
– Use in many academic systems (and some commercial)
– Hundreds of thousands of users
– Multilingual
• No built-in language
• Designed to allow addition of new languages
– Additional tools for rapid voice development
• Statistical learning tools
• Scripts for building models
Festival as software
• C/C++ code with Scheme scripting language
• General replaceable modules:
– Lexicons, LTS, duration, intonation, phrasing, POS tagging,
tokenizing, diphone/unit selection, signal processing
• General tools
– Intonation analysis (f0, Tilt), signal processing, CART building, Ngram, SCFG, WFST
CMU FestVox project
• Festival is an engine, how do you make voices?
• Festvox: building synthetic voices:
–
–
–
–
Tools, scripts, documentation
Discussion and examples for building voices
Example voice databases
Step by step walkthroughs of processes
• Support for English and other languages
• Support for different waveform synthesis methods
– Diphone
– Unit selection
– Limited domain
Outline
• History of TTS
• Architecture
• Text Processing
• Letter-to-Sound Rules
• Prosody
• Waveform Generation
• Evaluation
Text Processing
•
•
•
•
•
•
•
He stole $100 million from the bank
It’s 13 St. Andrews St.
The home page is http://ee.washington.edu
Yes, see you the following tues, that’s 11/12/01
IV: four, fourth, I.V.
IRA: I.R.A. or Ira
1750: seventeen fifty (date, address) or one thousand
seven… (dollars)
Steps
•
•
•
•
Identify tokens in text
Chunk tokens
Identify types of tokens
Convert tokens to words
Step 1: identify tokens and
chunk
• Whitespace can be viewed as separators
• Punctuation can be separated from the raw tokens
• Festival converts text into
– ordered list of tokens
– each with features:
• its own preceding whitespace
• its own succeeding punctuation
End-of-utterance detection
• Relatively simple if utterance ends in ?!
• But what about ambiguity of “.”
• Ambiguous between end-of-utterance and end-ofabbreviation
– My place on Forest Ave. is around the corner.
– I live at 360 Forest Ave.
– (Not “I live at 360 Forest Ave..”)
• How to solve this period-disambiguation task?
Some rules used
• A dot with one or two letters is an abbrev
• A dot with 3 cap letters is an abbrev.
• An abbrev followed by 2 spaces and a capital letter is an
end-of-utterance
• Non-abbrevs followed by capitalized word are breaks
• This fails for
– Cog. Sci. Newsletter
– Lots of cases at end of line.
– Badly spaced/capitalized sentences
More sophisticated decision tree
features
• Prob(word with “.” occurs at end-of-s)
• Prob(word after “.” occurs at begin-of-s)
• Length of word with “.”
• Length of word after “.”
• Case of word with “.”: Upper, Lower, Cap, Number
• Case of word after “.”: Upper, Lower, Cap, Number
• Punctuation after “.” (if any)
• Abbreviation class of word with “.” (month name, unit-ofmeasure, title, address name, etc)
CART
• Breiman, Friedman, Olshen, Stone. 1984. Classification
and Regression Trees. Chapman & Hall, New York.
• Description/Use:
– Binary tree of decisions, terminal nodes determine prediction
(“20 questions”)
– If dependent variable is categorial, “classification tree”,
– If continuous, “regression tree”
Learning DTs
•
•
•
•
DTs are rarely built by hand
Hand-building only possible for very simple features, domains
Lots of algorithms for DT induction
I’ll give quick intuition here
CART Estimation
• Creating a binary decision tree for classification or regression
involves 3 steps:
– Splitting Rules: Which split to take at a node?
– Stopping Rules: When to declare a node terminal?
– Node Assignment: Which class/value to assign to a terminal
node?
Splitting Rules
• Which split to take a node?
• Candidate splits considered:
– Binary cuts: for continuous (-inf < x < inf) consider splits of form:
• X <= k vs. x > k K
– Binary partitions: For categorical x  {1,2,…} = X consider splits
of form:
– x  A vs. x  X-A, A  X
Splitting Rules
• Choosing best candidate split.
– Method 1: Choose k (continuous) or A (categorical) that
minimizes estimated classification (regression) error after split
– Method 2 (for classification): Choose k or A that minimizes
estimated entropy after that split.
Decision Tree Stopping
•
•
When to declare a node terminal?
Strategy (Cost-Complexity pruning):
1. Grow over-large tree
2. Form sequence of subtrees, T0…Tn ranging from full tree to just
the root node.
3. Estimate “honest” error rate for each subtree.
4. Choose tree size with minimum “honest” error rate.
•
To estimate “honest” error rate, test on data different
from training data (I.e. grow tree on 9/10 of data, test on
1/10, repeating 10 times and averaging (crossvalidation).
Sproat’s EOS tree
Steps 3+4: Identify Types of Tokens, and
Convert Tokens to Words
• Pronunciation of numbers often depends on type:
–
–
–
–
1776 date: seventeen seventy six.
1776 phone number: one seven seven six
1776 quantifier: one thousand seven hundred (and) seventy six
25 day: twenty-fifth
Festival rule for dealing with “$1.2
million”
(define (token_to_words utt token name)
(cond
((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?")
(string-matches (utt.streamitem.feat utt token "n.name")
".*illion.?"))
(append
(builtin_english_token_to_words utt token (string-after name
"$"))
(list
(utt.streamitem.feat utt token "n.name"))))
((and (string-matches (utt.streamitem.feat utt token "p.name")
"\\$[0-9,]+\\(\\.[0-9]+\\)?")
(string-matches name ".*illion.?"))
(list "dollars"))
(t
(builtin_english_token_to_words utt token name))))
Rule-based versus machine learning
• As always, we can do things either way, or more often by a
combination
• Rule-based:
– Simple
– Quick
– Can be more robust
• Machine Learning
– Works for complex problems where rules hard to write
– Higher accuracy in general
– But worse generalization to very different test sets
• Real TTS and NLP systems
– Often use aspects of both.
Machine learning method for Text
Normalization
• From 1999 Hopkins summer workshop “Normalization of
Non-Standard Words”
• Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., and Richards, C.
2001. Normalization of Non-standard Words, Computer Speech and
Language, 15(3):287-333
• NSW examples:
– Numbers:
• 123, 12 March 1994
– Abrreviations, contractions, acronyms:
• approx., mph. ctrl-C, US, pp, lb
– Punctuation conventions:
• 3-4, +/-, and/or
– Dates, times, urls, etc
How common are NSWs?
• Varies over text type
• Word not in lexicon, or with non-alphabetic characters:
Text Type
novels
% NSW
1.5%
press wire
4.9%
e-mail
10.7%
recipes
13.7%
classified
17.9%
How hard are NSWs?
• Identification:
– Some homographs “Wed”, “PA”
– False positives: OOV
• Realization:
– Simple rule: money, $2.34
– Type identification+rules: numbers
– Text type specific knowledge (in classified ads, BR for bedroom)
• Ambiguity (acceptable multiple answers)
– “D.C.” as letters or full words
– “MB” as “meg” or “megabyte”
– 250
Step 1: Splitter
• Letter/number confjunctions (WinNT, SunOS, PC110)
• Hand-written rules in two parts:
– Part I: group things not to be split (numbers, etc; including commas
in numbers, slashes in dates)
– Part II: apply rules:
•
•
•
•
At transitions from lower to upper case
After penultimate upper-case char in transitions from upper to lower
At transitions from digits to alpha
At punctuation
Step 2: Classify token into 1 of 20 types
•
•
•
•
•
•
•
•
•
•
EXPN: abbrev, contractions (adv, N.Y., mph, gov’t)
LSEQ: letter sequence (CIA, D.C., CDs)
ASWD: read as word, e.g. CAT, proper names
MSPL: misspelling
NUM: number (cardinal) (12,45,1/2, 0.6)
NORD: number (ordinal) e.g. May 7, 3rd, Bill Gates II
NTEL: telephone (or part) e.g. 212-555-4523
NDIG: number as digits e.g. Room 101
NIDE: identifier, e.g. 747, 386, I5, PC110
NADDR: number as stresst address, e.g. 5000 Pennsylvania
• NZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT,URL,etc
• SLNT: not spoken (KENT*REALTY)
More about the types
• 4 categories for alphabetic sequences:
– EXPN: expand to full word or word seq (fplc for fireplace, NY for New
York)
– LSEQ: say as letter sequence (IBM)
– ASWD: say as standard word (either OOV or acronyms)
• 5 main ways to read numbers:
–
–
–
–
–
Cardinal (quantities)
Ordinal (dates)
String of digits (phone numbers)
Pair of digits (years)
Trailing unit: serial until last non-zero digit: 8765000 is “eight seven six five
thousand” (some phone numbers, long addresses)
– But still exceptions: (947-3030, 830-7056)
Type identification algorithm
• Create large hand-labeled training set and build a DT to
predict type
• Example of features in tree for subclassifier for alphabetic
tokens:
– P(t|o) = p(o|t)p(t)/p(o)
– P(o|t), for t in ASWD, LSWQ, EXPN (from trigram letter model)
– P(t) from counts of each tag in text
– P(o) normalization factor
N
p(o | t)   p(li1 | li1,li2 )
i1

Type identification algorithm
• Hand-written context-dependent rules:
– List of lexical items (Act, Advantage, amendment) after which
Roman numbers read as cardinals not ordinals
• Classifier accuracy:
– 98.1% in news data,
– 91.8% in email
Step 3: expanding NSW Tokens
• Type-specific heuristics
–
–
–
–
–
–
ASWD expands to itself
LSEQ expands to list of words, one for each letter
NUM expands to string of words representing cardinal
NYER expand to 2 pairs of NUM digits…
NTEL: string of digits with silence for puncutation
Abbreviation:
• use abbrev lexicon if it’s one we’ve seen
• Else use training set to know how to expand
• Cute idea: if “eat in kit” occurs in text, “eat-in kitchen” will also occur
somewhere.
4 steps to Sproat et al. algorithm
1) Splitter (on whitespace or also within word (“AltaVista”)
2) Type identifier: for each split token identify type
3) Token expander: for each typed token, expand to words
•
•
Deterministic for number, date, money, letter sequence
Only hard (nondeterministic) for abbreviations
4) Language Model: to select between alternative
pronunciations
Homograph disambiguation
• Most frequent homographs, from
Liberman and Church
• Not a huge problem, but still
important
record
house
contract
lead
live
lives
protest
survey
project
separate
present
read
subject
rebel
finance
estimate
195
150
143
131
130
105
94
91
90
87
80
72
68
48
46
46
POS Tagging for homograph
disambiguation
• Many homographs can be distinguished by POS
live l ay v
REcord
INsult
OBject
OVERflow
DIScount
CONtent
l ih v
reCORD
inSULT
obJECT
overFLOW
disCOUNT
conTENT
Part of speech tagging
• 8 (ish) traditional parts of speech
– This idea has been around for over 2000 years (Dionysius
Thrax of Alexandria, c. 100 B.C.)
– Called: parts-of-speech, lexical category, word classes,
morphological classes, lexical tags, POS
– We’ll use POS most frequently
POS examples
N
V
ADJ
ADV
P
PRO
DET
noun
verb
adj
adverb
preposition
pronoun
determiner
chair, bandwidth, pacing
study, debate, munch
purple, tall, ridiculous
unfortunately, slowly,
of, by, to
I, me, mine
the, a, that, those
POS Tagging: Definition
• The process of assigning a part-of-speech or lexical class
marker to each word in a corpus:
WORDS
the
koala
put
the
keys
on
the
table
TAGS
N
V
P
DET
POS Tagging example
WORD
the
child
put
the
keys
on
the
table
TAG
DET
N
V
DET
N
P
DET
N
Open and closed class words
• Closed class: a relatively fixed membership
–
–
–
–
Prepositions: of, in, by, …
Auxiliaries: may, can, will had, been, …
Pronouns: I, you, she, mine, his, them, …
Usually function words (short common words which play a role in
grammar)
• Open class: new ones can be created all the time
– English has 4: Nouns, Verbs, Adjectives, Adverbs
– Many languages have all 4, but not all!
– In Lakhota and possibly Chinese, what English treats as adjectives
act more like verbs.
Open class words
• Nouns
– Proper nouns (Seattle University, Boulder, Neal Snider, William Gates
Hall). English capitalizes these.
– Common nouns (the rest). German capitalizes these.
– Count nouns and mass nouns
• Count: have plurals, get counted: goat/goats, one goat, two goats
• Mass: don’t get counted (snow, salt, communism) (*two snows)
• Adverbs: tend to modify things
–
–
–
–
Unfortunately, John walked home extremely slowly yesterday
Directional/locative adverbs (here, home, downhill)
Degree adverbs (extremely, very, somewhat)
Manner adverbs (slowly, delicately)
• Verbs:
– In English, have morphological affixes (eat/eats/eaten)
Closed Class Words
• Idiosyncratic
• Examples:
–
–
–
–
–
–
–
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …
POS tagging: Choosing a tagset
• There are so many parts of speech, potential distinctions we can draw
• To do POS tagging, need to choose a standard set of tags to work with
• Could pick very coarse tagets
– N, V, Adj, Adv.
• More commonly used set is finer grained, the “UPenn TreeBank
tagset”, 45 tags
– PRP$, WRB, WP$, VBG
• Even more fine-grained tagsets exist
PRP
PRP$
Using the UPenn tagset
• The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
• Prepositions and subordinating conjunctions marked IN
(“although/IN I/PRP..”)
• Except the preposition/complementizer “to” is just marked
“to”.
POS Tagging
• Words often have more than one POS: back
–
–
–
–
The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
• The POS tagging problem is to determine the POS tag for a
particular instance of a word.
How hard is POS tagging? Measuring
ambiguity
Unambiguous (1 tag): 38,857
Ambiguous (2-7 tags): 8,844
2 tags
6,731
3 tags
1621
4 tags
357
5 tags
90
6 tags
32
7 tags
6 well, set, round, open, fit, down
8 tags
4 ‘s, half, back, a
9 tags
3 that, more, in
3 methods for POS tagging
1. Rule-based tagging
–
(ENGTWOL)
2. Stochastic (=Probabilistic) tagging
–
HMM (Hidden Markov Model) tagging
3. Transformation-based tagging
–
Brill tagger
Rule-based tagging
•
•
•
•
Start with a dictionary
Assign all possible tags to words from the dictionary
Write rules by hand to selectively remove tags
Leaving the correct tag for each word
Start with a dictionary
•
•
•
•
•
•
she:
promised:
to
back:
the:
bill:
PRP
VBN,VBD
TO
VB, JJ, RB, NN
DT
NN, VB
• Etc… for the ~100,000 words of English
Use the dictionary to assign every
possible tag
PRP
She
VBN
VBD
promised
TO
to
NN
RB
JJ
VB
back
DT
the
VB
NN
bill
Write rules to eliminate tags
Eliminate VBN if VBD is an option when VBN|VBD follows
“<start> PRP”
NN
RB
JJ
VB
VBN
PRP VBD
TO
VB
DT
NN
She promised
to
back
the
bill
Stochastic Tagging
• Intuition: assign each word “most probable” tag
• Simplest way to define “most probable”
–
–
–
–
Collect a training corpus
Choose tag which is most frequent for that word in training corpus
I.e., chose tag such that p(tag|word) is high
Of all the times that “use” occurred in a training corpus, what
percentage was it V, what percentage N? Choose higher
probability tag.
• Does it work?
– Achieves: 90%! But we can do better:
– How? Context: “to use”: use is V; “the use of”: use is N
HMM Tagger
• Intuition: Pick the most probable tag sequence for a series
of words
n
n
n
tˆ1  argmax P(t1 | w1 )
t1n
• But how to make the right-hand side operational?
• Use Bayes’ rule:
• Substituting: 
P(y | x)P(x)
P(x | y) 
P(y)
n
n
n
ˆt1n  argmax P(w1 | t1 )P(t1 )
n
n
P(w1 )
t1

HMM Tagger: fundamental equations
n
1
n
1
n
1
ˆt1n  argmax P(w | t )P(t )
n
n
P(w1 )
t1
• Since the word sequence is constant:

ˆt1n  argmax P(w1n | t1n )P(t1n )
t1n
likelihood
• Still too hard to compute directly

prior
HMM Tagger: Two simplifying assumptions
• Prob of word independent of other words and their tags:
tˆ1n  argmax P(w1n | t1n )P(t1n )
t1n
• Prob of tag is only dependent
on previous tag
n

• Combining:
P(w1n | t1n )   P(w i | t i )
i1
n
P(t1n )   P(t i | t i1 )
i1
n

tˆ1n  argmax P(t1n | w1n )  argmax  P(w i | t i )P(t i | t i1 )
t1n

t1n
i1
Estimating these probabilities
• Determiners precede nouns in English, so expect
P(NN|DT) to be high
C(t ,t )
P(ti | t i1 ) 
i1
i
C(t i1)
• In tagged 1-million word Brown corpus:
• P(NN|DT) = C(DT,NN)/C(DT) = 56509/116454=.49

• P(is|VBZ)=C(VBZ,is)/C(VBZ)=10073/21627=.47
C(t i,wi )
P(wi | ti ) 
C(ti )
An example
• Secretariat/NNP is/BEZ expected/VBN to/TO race/VB
tomorrow/NR
• People/NNS continue/VB to/TO inquire/VB the/AT reason/NN
for/IN the/AT race/NN for/IN outer/JJ space/NN
An example of two tag sequences
Picture of HMM
Viterbi Algorithm
S1
S2
S3
S4
S5
JJ
DT
VB
VB
NNP
NN
RB
NN
VBN
TO
VBD
promised
to
back
the
bill
Evaluation
• The result is compared with a manually coded “Gold
Standard”
– Typically accuracy reaches 96-97%
– This may be compared with result for a baseline tagger (one
that uses no context).
• Important: 100% is impossible even for human annotators
Outline
• History of TTS
• Architecture
• Text Processing
• Letter-to-Sound Rules
• Prosody
• Waveform Generation
• Evaluation
Lexicons and Lexical Entries
• You can explicitly give pronunciations for words
– Each language/dialect has its own lexicon
– You can lookup words with
• (lex.lookup WORD)
– You can add entries to the current lexicon
• (lex.add.entry NEWENTRY)
– Entry: (WORD POS (SYL0 SYL1…))
– Syllable: ((PHONE0 PHONE1 …) STRESS )
– Example:
(“cepstra” n ((k eh p) 1) ((s t r aa) 0))))
Converting from words to phones
• Two methods:
– Dictionary-based
– Rule-based (Letter-to-sound=LTS)
•
•
•
•
Early systems, all LTS
MITalk was radical in having huge 10K word dictionary
Now systems use a combination
CMU dictionary: 127K words
– http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Dictionaries aren’t always sufficient
• Unknown words
– Seem to be linear with number of words in unseen text
– Mostly person, company, product names
– But also foreign words, etc.
• So commercial systems have 3-part system:
– Big dictionary
– Special code for handling names
– Machine learned LTS system for other unknown words
Letter-to-Sound Rules
• Festival LTS rules:
• (LEFTCONTEXT [ ITEMS] RIGHTCONTEXT = NEWITEMS )
• Example:
– (#[ch]C=k)
– ( # [ c h ] = ch )
• # denotes beginning of word
• C means all consonants
• Rules apply in order
– “christmas” pronounced with [k]
– But word with ch followed by non-consonant pronounced [ch]
• E.g., “choice”
Stress rules in LTS
• English famously evil: one from Allen et al 1987
• V -> [1-stress] / X_C* {Vshort C C?|V} {[Vshort C*|V}
• Where X must contain all prefixes:
• Assign 1-stress to the vowel in a syllable preceding a weak
syllable followed by a morpheme-final syllable containing a
short vowel and 0 or more consonants (e.g. difficult)
• Assign 1-stress to the vowel in a syllable preceding a weak
syllable followed by a morpheme-final vowel (e.g. oregano)
• etc
Modern method: Learning LTS rules
automatically
•
•
•
•
Induce LTS from a dictionary of the language
Black et al. 1998
Applied to English, German, French
Two steps: alignment and (CART-based) rule-induction
Alignment
• Letters: c h e c k e d
• Phones: ch _ eh _ k _ t
• Black et al Method 1:
– First scatter epsilons in all possible ways to cause letters and
phones to align
– Then collect stats for P(letter|phone) and select best to
generate new stats
– This iterated a number of times until settles (5-6)
– This is EM (expectation maximization) alg
Alignment
• Black et al method 2
• Hand specify which letters can be rendered as which phones
– C goes to k/ch/s/sh
– W goes to w/v/f, etc
• Once mapping table is created, find all valid alignments, find
p(letter|phone), score all alignments, take best
Alignment
• Some alignments will turn out to be really bad.
• These are just the cases where pronunciation doesn’t
match letters:
– Dept
d ih p aa r t m ah n t
– CMU
s iy eh m y uw
– Lieutenant
l eh f t eh n ax n t (British)
• Also foreign words
• These can just be removed from alignment training
Building CART trees
• Build a CART tree for each letter in alphabet (26 plus
accented) using context of +-3 letters
• # # # c h e c -> ch
• c h e c k e d -> _
• This produces 92-96% correct LETTER accuracy (58-75 word
acc) for English
Improvements
•
•
•
•
Take names out of the training data
And acronyms
Detect both of these separately
And build special-purpose tools to do LTS for names and
acronyms
Names
• Big problem area is names
• Names are common
– 20% of tokens in typical newswire text will be names
– 1987 Donnelly list (72 million households) contains about 1.5
million names
– Personal names: McArthur, D’Angelo, Jimenez, Rajan, Raghavan,
Sondhi, Xu, Hsu, Zhang, Chang, Nguyen
– Company/Brand names: Infinit, Kmart, Cytyc, Medamicus, Inforte,
Aaon, Idexx Labs, Bebe
Names
• Methods:
– Can do morphology (Walters -> Walter, Lucasville)
– Can write stress-shifting rules (Jordan -> Jordanian)
– Rhyme analogy: Plotsky by analogy with Trostsky (replace tr with
pl)
– Liberman and Church: for 250K most common names, got 212K
(85%) from these modified-dictionary methods, used LTS for rest.
– Can do automatic country detection (from letter trigrams) and then
do country-specific rules
Outline
• History of TTS
• Architecture
• Text Processing
• Letter-to-Sound Rules
• Prosody
• Waveform Generation
• Evaluation
Defining Intonation
• Ladd (1996) “Intonational phonology”
• “The use of suprasegmental phonetic features
Suprasegmental = above and beyond the segment/phone
– F0
– Intensity (energy)
– Duration
• to convey sentence-level pragmatic meanings”
– I.e. meanings that apply to phrases or utterances as a whole, not
lexical stress, not lexical tone.
Three aspects of prosody
• Prominence: some syllables/words are more prominent than
others
• Structure/boundaries: sentences have prosodic structure
– Some words group naturally together
– Others have a noticeable break or disjuncture between them
• Tune: the intonational melody of an utterance.
Prosodic Prominence: Pitch Accents
A: What types of foods are a good source of vitamins?
B1: Legumes are a good source of VITAMINS.
B2: LEGUMES are a good source of vitamins.
• Prominent syllables are:
•
•
•
Louder
Longer
Have higher F0 and/or sharper changes in F0 (higher F0 velocity)
Slides from Jennifer Venditti
Prosodic Boundaries
.
French [bread and cheese]
[French bread] and [cheese]
Prosodic Tunes
• Legumes are a good source of vitamins.
• Are legumes a good source of vitamins?
TOPIC #1
Thinking about F0
Graphic representation of F0
400
350
F0 (in Hertz)
300
250
200
150
100
50
legumes are a good source of VITAMINS
time
The ‘ripples’
400
350
300
250
200
150
[s]
[t]
100
[s]
50
legumes are a good source of VITAMINS
F0 is not defined for consonants without vocal
fold vibration.
The ‘ripples’
400
350
300
250
200
150
100
50
[g]
[z]
[g]
[v]
legumes are a good source of VITAMINS
... and F0 can be perturbed by consonants with
an extreme constriction in the vocal tract.
Abstraction of the F0 contour
400
350
300
250
200
150
100
50
legumes are a good source of VITAMINS
Our perception of the intonation contour abstracts
away from these perturbations.
The ‘waves’ and the ‘swells’
400
‘wave’ = accent
350
300
250
200
150
‘swell’ = phrase
100
50
legumes are a good source of VITAMINS
TOPIC #2
Accent Placement and
Intonational Tunes
Stress vs. accent
• Stress is a structural property of a word — it marks a potential
(arbitrary) location for an accent to occur, if there is one.
• Accent is a property of a word in context — it is a way to mark
intonational prominence in order to ‘highlight’ important words in the
discourse.
(x)
(x)
x
x
stressed syll
x
full vowels
x
x
x
x
x
x
x
vi
ta
mins
Ca
li
x
(accented syll)
x
for nia
syllables
Stress vs. accent (2)
• The speaker decides to make the word vitamin more
prominent by accenting it.
• Lexical stress tell us that this prominence will appear on the
first syllable, hence VItamin.
Which word receives an accent?
• It depends on the context. For example, the ‘new’ information in
the answer to a question is often accented, while the ‘old’
information usually is not.
– Q1: What types of foods are a good source of vitamins?
– A1: LEGUMES are a good source of vitamins.
– Q2: Are legumes a source of vitamins?
– A2: Legumes are a GOOD source of vitamins.
– Q3: I’ve heard that legumes are healthy, but what are they a good
source of ?
– A3: Legumes are a good source of VITAMINS.
Same ‘tune’, different alignment
400
350
300
250
200
150
100
50
LEGUMES are a good source of vitamins
The main rise-fall accent (= “I assert this”) shifts locations.
Same ‘tune’, different alignment
400
350
300
250
200
150
100
50
Legumes are a GOOD source of vitamins
The main rise-fall accent (= “I assert this”) shifts locations.
Same ‘tune’, different alignment
400
350
300
250
200
150
100
50
legumes are a good source of VITAMINS
The main rise-fall accent (= “I assert this”) shifts locations.
Broad focus
400
350
300
250
200
150
100
50
legumes are a good source of vitamins
In the absence of narrow focus, English tends to mark the first
and last ‘content’ words with perceptually prominent accents.
Yes-No question tune
550
500
450
400
350
300
250
200
150
100
50
are LEGUMES a good source of vitamins
Rise from the main accent to the end of the sentence.
Yes-No question tune
550
500
450
400
350
300
250
200
150
100
50
are legumes a GOOD source of vitamins
Rise from the main accent to the end of the sentence.
Yes-No question tune
550
500
450
400
350
300
250
200
150
100
50
are legumes a good source of VITAMINS
Rise from the main accent to the end of the sentence.
WH-questions
[I know that many natural foods are healthy, but ...]
400
350
300
250
200
150
100
50
WHAT are a good source of vitamins
WH-questions typically have falling contours, like statements.
Rising statements
550
500
450
400
350
300
250
200
150
100
50
legumes are a good source of vitamins
[... does this statement qualify?]
High-rising statements can signal that the speaker
is seeking approval.
‘Surprise-redundancy’ tune
[How many times do I have to tell you ...]
400
350
300
250
200
150
100
50
legumes are a good source of vitamins
Low beginning followed by a gradual rise to a high at the end.
‘Contradiction’ tune
“I’ve heard that linguini is a good source of vitamins.”
400
350
300
250
200
150
100
50
linguini isn’t a good source of vitamins
[... how could you think that?]
Sharp fall at the beginning, flat and low, then rising at the end.
TOPIC #3
Intonational phrasing
and disambiguation
A single intonation phrase
400
350
300
250
200
150
100
50
legumes are a good source of vitamins
Broad focus statement consisting of one intonation phrase
(that is, one intonation tune spans the whole unit).
Multiple phrases
400
350
300
250
200
150
100
50
legumes
are a good source of vitamins
Utterances can be ‘chunked’ up into smaller phrases
in order to signal the importance of information in each unit.
Phrasing can disambiguate
• Global ambiguity:
Sally saw % the man with the binoculars.
Sally saw the man % with the binoculars.
Phrasing can disambiguate
• Temporary ambiguity:
When Madonna sings the song ...
Phrasing can disambiguate
• Temporary ambiguity:
When Madonna sings the song is a hit.
Phrasing can disambiguate
• Temporary ambiguity:
When Madonna sings % the song is a hit.
When Madonna sings the song % it’s a hit.
[from Speer & Kjelgaard (1992)]
Phrasing can disambiguate
400
350
300
250
Mary & Elena’s mother
mall
200
150
100
50
I met Mary and Elena’s mother at the mall yesterday
One intonation phrase with relatively flat overall pitch range.
Phrasing can disambiguate
400
350
Elena’s mother
mall
300
250
Mary
200
150
100
50
I met Mary and Elena’s mother at the mall yesterday
Separate phrases, with expanded pitch movements.
TOPIC #4
The TOBI Intonational
Transcription Theory
ToBI: Tones and Break Indices
• Pitch accent tones
–
–
–
–
–
H* “peak accent”
L* “low accent”
L+H* “rising peak accent” (contrastive)
L*+H ‘scooped accent’
H+!H* downstepped high
• Boundary tones
– L-L% (final low; Am Eng. Declarative contour)
– L-H% (continuation rise)
– H-H% (yes-no queston)
• Break indices
– 0: clitics, 1, word boundaries, 2 short pause
– 3 intermediate intonation phrase
– 4 full intonation phrase/final boundary.
Examples of the TOBI system
• I don’t eat beef.
L*
L* L*L-L%
• Marianna made the marmalade.
H*
L-L%
L*
H-H%
• “I” means insert.
H*
H*
H*L-L%
1
H*LH*L-L%
3
ToBI
• http://www.ling.ohio-state.edu/~tobi/
• TOBI for American English
– http://www.ling.ohio-state.edu/~tobi/ame_tobi/
• Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P.,
Pierrehumbert, J., and Hirschberg, J. (1992). ToBI: a standard for labelling English
prosody. In Proceedings of ICSLP92, volume 2, pages 867-870
• Pitrelli, J. F., Beckman, M. E., and Hirschberg, J. (1994). Evaluation of prosodic
transcription labeling reliability in the ToBI framework. In ICSLP94, volume 1, pages
123-126
• Pierrehumbert, J., and J. Hirschberg (1990) The meaning of intonation contours in the
interpretation of discourse. In P. R. Cohen, J.Morgan, and M. E. Pollack, eds., Plans
and Intentions inCommunication and Discourse, 271-311. MIT Press.
• Beckman and Elam. Guidelines for ToBI Labelling. Web.
TOPIC #5
PRODUCING
INTONATION IN TTS
Intonation in TTS
1) Accent: Decide which words are accented, which syllable
has accent, what sort of accent
2) Boundaries: Decide where intonational boundaries are
3) Duration: Specify length of each segment
4) F0: Generate F0 contour from these
TOPIC #5a
Predicting pitch accent
Factors in accent prediction
• Contrast
– Legumes are poor source of VITAMINS
– No, legumes are a GOOD source of vitamins
– I think JOHN and MARY should go
– No, I think JOHN AND MARY should go
But it’s more than just contrast
• List intonation:
• I went and saw ANNA, LENNY, MARY, and NORA.
In fact, accents are common!
• A Broadcast News example from Hirschberg (1993)
• SUN MICROSYSTEMS INC, the UPSTART COMPANY that HELPED
LAUNCH the DESKTOP COMPUTER industry TREND TOWARD
HIGH powered WORKSTATIONS, was UNVEILING an ENTIRE
OVERHAUL of its PRODUCT LINE TODAY. SOME of the new
MACHINES, PRICED from FIVE THOUSAND NINE hundred NINETY
five DOLLARS to seventy THREE thousand nine HUNDRED dollars,
BOAST SOPHISTICATED new graphics and DIGITAL SOUND
TECHNOLOGIES, HIGHER SPEEDS AND a CIRCUIT board that
allows FULL motion VIDEO on a COMPUTER SCREEN.
Factors in accent prediction
• Part of speech:
– Content words are usually accented
– Function words are rarely accented
• Of, for, in on, that, the, a, an, no, to, and but or will
may would can her is their its our there is am are
was were, etc
Factors in accent prediction
•
•
•
•
Word Order
Preposed items are accented more frequently
TODAY we will BEGIN to LOOK at FROG anatomy.
We will BEGIN to LOOK at FROG anatomy today.
Factors in Accent Prediction
•
•
•
•
Information Status:
New versus old information.
Old information is not deaccented
There are LAWYERS, and there are GOOD lawyers
Complex Noun Phrase Structure
• Sproat, R. 1994. English noun-phrase accent prediction for text-to-speech.
Computer Speech and Language 8:79-94.
• Proper Names, stress on right-most word
– New York CITY; Paris, FRANCE
• Adjective-Noun combinations, stress on noun
– Large HOUSE, red PEN, new NOTEBOOK
• Noun-Noun compounds: stress left noun
– HOTdog (food) versus HOT DOG (overheated animal)
– WHITE house (place) versus WHITE HOUSE (made of stucco)
• examples:
– Madison AVENUE, park STREET, MEDICAL building
– APPLE cake, cherry PIE
• Some Rules:
– Furniture+Room -> RIGHT (e.g., kitchen TABLE)
– Proper-name + Street -> LEFT (e.g. PARK street)
Other features
•
•
•
•
•
•
•
POS
POS of previous word
POS of next word
Stress of current, previous, next syllable
Unigram probability of word
Bigram probability of word
Position of word in sentence
State of the art
•
•
•
•
Hand-label large training sets
Use CART, SVM, CRF, etc to predict accent
Lots of rich features from context
Classic lit:
– Hirschberg, Julia. 1993. Pitch Accent in context: predicting
intonational prominence from text. Artificial Intelligence 63, 305340
TOPIC #5b
Predicting boundaries
Predicting Boundaries
• Intonation phrase boundaries
– Intermediate phrase boundaries
– Full intonation phrase boundaries



More examples
• From Ostendorf and Veilleux. 1994 “Hierarchical
Stochastic model for Automatic Prediction of Prosodic
Boundary Location”, Computational Linguistics 20:1
• Computer phone calls, || which do everything | from selling
magazine subscriptions || to reminding people about
meetings || have become the telephone equivalent | of junk
mail. ||
• Doctor Norman Rosenblatt, || dean of the college | of
criminal justice at Northeastern University, || agrees.||
• For WBUR, || I’m Margo Melnicove.
Ostendorf and Veilleux CART
TOPIC #5c
Predicting duration
Duration
• Simplest: fixed size for all phones (100 ms)
• Next simplest: average duration for that phone (from
training data). Samples from SWBD in ms:
–
–
–
–
–
aa
ax
ay
eh
Ih
118
59
138
87
77
b
d
dh
f
g
68
68
44
90
66
• Next Next Simplest: add in phrase-final and initial
lengthening plus stress:
• Better: average duration for each triphone
Duration in Festival (2)
• Klatt duration rules. Modify duration based on:
–
–
–
–
–
–
Position in clause
Syllable position in word
Syllable type
Lexical stress
Left+right context phone
Prepausal lengthening
• Festival: 2 options
– Klatt rules
– Use labeled training set with Klatt features to train CART
Duration: state of the art
• Lots of fancy models of duration prediction:
– Using Z-scores and other clever normalizations
– Sum-of-products model
– New features like word predictability
• Words with higher bigram probability are shorter
TOPIC #5d
F0 Generation
F0 Generation
• Generation in Festival
– F0 Generation by rule
– F0 Generation by linear regression
• Some constraints
– F0 is constrained by accents and boundaries
– F0 declines gradually over an utterance (“declination”)
F0 Generation by rule
• Generate a list of target F0 points for each syllable
• Here’s a rule to generate a simple H* “hat” accent (with fixed =
speaker-specific F0 values):
(define (targ_func1 utt syl)
"(targ_func1 UTT STREAMITEM)
Returns a list of targets for the given syllable."
(let ((start (item.feat syl 'syllable_start))
(end (item.feat syl 'syllable_end)))
(if (equal? (item.feat syl
"R:Intonation.daughter1.name") "Accented")
(list
(list start 110)
(list (/ (+ start end) 2.0) 140)
(list end 100)))))
F0 generation by regression
• Supervised machine learning again
• We predict: value of F0 at 3 places in each syllable
• Predictor features:
–
–
–
–
Accent of current word, next word, previous
Boundaries
Syllable type, phonetic information
Stress information
• Need training sets with pitch accents labeled
Outline
• History of TTS
• Architecture
• Text Processing
• Letter-to-Sound Rules
• Prosody
• Waveform Generation
• Evaluation
Articulatory Synthesis
The vocal tract is divided into a large number of short tubes,
as in the electrical transmission line analog, which are then
combined and resonant frequencies calculated.
from Sinder, 1999 (thesis work with Flanagan, Rutgers)
Formant Synthesis
•
Instead of specifying mouth shapes, formant synthesis specifies
frequencies and bandwidths of resonators, which are used to filter a
source waveform.
•
Formant frequency analysis is difficult; bandwidth estimation is even
more difficult. But the biggest perceptual problem in formant
synthesis is not in the resonances, but in a “buzzy” quality most
likely due to the glottal source model.
•
Formant synthesis can sound identical to natural utterance if details
of the glottal source and formants are well modeled.
NATURAL SPEECH
(John Holmes, 1973)
Hi ( z) 
SYNTHETIC SPEECH
1  2e bi
1
cos(2 fi ) z 1  e2 bi z 2
Klatt’s formant synthesizer
AV
Impulse
+
RGZ
+
RGP
AVS
Gen.
RGS
SW
RNP
RNZ
F0
Random
Number
First
Diff.
AH
Gen.
X
A1
R1
R1
AN
RNP
R2
A2
R2
R3
A3
R3
R4
A4
R4
R5
A5
R5
Cas
cad
e
Tra
nsfe
r
Fun
ctio
n
+
MOD.
LPF
AF
+
A6
R6
AB
Parallel Transfer Function
First
Diff.
Klatt’s parameter values
Symbol
Name
N
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
AV
AF
AH
AVS
F0
F1
F2
F3
F4
FNZ
AN
A1
A2
A3
A4
A5
A6
AB
B1
B2
B3
SW
32
33
34
35
36
37
38
39
FNP
BNP
BNZ
BGS
SR
NWS
G0
NFC
Amplitude of voicing (dB)
Amplitude of frication (dB)
Amplitude of aspiration (dB)
Amplitude of sinusoidal voicing (dB)
Fundamental frequency (Hz)
First formant frequency (Hz)
Second formant frequency (Hz)
Third formant frequency (Hz)
Fourth formant frequency (Hz)
Nasal zero frequency (Hz)
Nasal formant amplitude (Hz)
First formant amplitude (Hz)
Second formant amplitude (Hz)
Third formant amplitude (Hz)
Fourth formant amplitude (Hz)
Fifth formant amplitude (Hz)
Sixth formant amplitude (Hz)
Bypass path amplitude (Hz)
First formant bandwidth (Hz)
Second formant bandwidth (Hz)
Third formant bandwidth (Hz)
Cascade/parallel switch
…
Nasal pole frequency (Hz)
Nasal pole bandwidth (Hz)
Nasal zero bandwidth (Hz)
Glottal resonator 2 bandwidth (Hz)
Sampling rate (Hz)
Number of waveform samples per chunk
Overall gain control (dB)
Number of cascaded formants
Min
Max
Typ
0
0
0
0
0
150
500
1300
2500
200
0
0
0
0
0
0
0
0
40
40
40
0
80
80
80
80
500
900
2500
3500
4500
700
80
80
0
80
80
80
80
80
500
500
500
1
0
0
0
0
0
500
1500
2500
3500
250
0
0
0
0
0
0
0
0
50
70
110
0
200
50
50
100
500
1
0
4
500
500
500
1000
20000
200
80
6
250
100
100
200
10000
50
48
5
Formant systems: Rule-Based Synthesis
•
For synthesis of arbitrary text, formants and bandwidths
for each phoneme are determined by analyzing speech
of a single person.
•
The models of each phoneme may be a single set of
formant frequencies and bandwidths for a canonical
phoneme at a single point in time, or a trajectory of
frequencies, bandwidths, and source models over time.
•
The formant frequencies for each phoneme are
combined over time using a model of coarticulation,
such as Klatt’s modified locus theory.
•
Duration, pitch, and energy rules are applied
•
Result: something like this:
Concatenative Synthesis
•
•
•
•
Copy synthesis sounds great but synthesis by rule using formants
does not. Why?… Problem with glottal source? Problem with
coarticulation and formant transitions? Problem with prosody?
Formant synthesis was main TTS technique until the early or mid
1990’s, when increasing memory size and CPU speed allowed
concatenative synthesis to be viable approach.
Concatenative synthesis uses recordings of small units of speech
(typically the region from the middle of one phoneme to the middle
of another phoneme, or a diphone unit), and glues these units
together to forms words and sentences.
Don’t have to worry about glottal source models or coarticulation,
since the synthesis is just a concatenation of different waveforms
containing “natural” glottal source and coarticulation.
Concatenative Synthesis: Units
•
The basic unit for concatenative synthesis is the diphone:
sil-jh jh-aa
aa-n
n-sil
•
More recent TTS research is on using larger units. Issues include:
• how to decide what units will be used?
• how to select best unit from very large database?
•
With increasing size and variety of units, there is an exponential
growth in the database size. Yet, despite massive databases that
may take months to record, coverage is nowhere near complete.
There is a very large number of infrequent events in speech.
Joining Units
• Dumb:
– just join
– Better: at low amplitude regions
• TD-PSOLA
– Time-domain pitch-synchronous overlap-and-add
– Join at pitch periods (with windowing)
Diphone boundaries in stops
Prosodic Modification
• Modifying pitch and duration independently
• Changing sample rate modifies both:
– “Alvin and the Chipmunks” speech
• Duration: duplicate/remove parts of the signal
• Pitch: resample to change pitch
Speech as Short Term signals
Duration modification
• Duplicate/remove short term signals
Pitch Modification
• Move short-term signals closer together/further apart
Overlap-and-add (OLA)
Overlap and Add (OLA)
• Hanning windows of length 2N used to multiply the analysis
signal
• Resulting windowed signals are added
• Analysis windows, spaced 2N
• Synthesis windows, spaced N
• Time compression is uniform with factor of 2
• Pitch periodicity somewhat lost around 4th window
TD-PSOLA ™
• Time-Domain Pitch Synchronous Overlap and Add
• Patented by France Telecom (CNET)
• Very efficient
– No FFT (or inverse FFT) required
• Can modify Hz up to two times or by half
TD-PSOLA ™
HMM-Based synthesis
• Generate the most likely sequence of spectral (e.g. MFCCs)
and excitation (F0) parameters for the given phone sequence
using HMM
• Create a filter using the spectral parameters
• Pass the excitation parameters (F0, noise) through the filter to
generate the waveform
Block Diagram
• Zen & Toda (2005)
HMM parameter generation
• Each model represents a phone or a subphone (diphone,
triphone, etc.)
• Each model consists of multiple states
– Tri-state model
– Each Gaussian mixture represented by a different state with the
transitional probability as the mixture weight
• Each state emits spectral/F0 feature vector
– 12~13 MFCCs, deltas, (delta-deltas)
– F0, delta, (delta-delta)
Problem for HMM parameter generation
• We know which models to concatenate in what order
• We do NOT know
– which state in the model to use to generate each frame
– which value to choose from a set of values observable within
each state
Tokuda et al. (1995)
• We need to solve
arg max P(O, q |  )
c
– O : observation sequence, where each observation
is a feature vector consisting of MFCCs (c) and
their deltas (Δc)
– q : state sequence
– λ : HMM
• The problem is that we don’t exactly know what
q is
Solution – (1)
arg max P(O, q |  )  arg max P(O | q,  )  P(q |  )  arg max P(O | q,  )
c
c
c
Let’s assume that we know the state sequence “q”
P(O | q,  )  bq1 (o1 )  bq2 (o2 )  bq3 (o3 ) bqT (oT )
bqt (ot )  N (ct ;  qt ,  qt )
 1

N (o;  , ) 
 exp   (o   )' 1  (o   ) 
 2

(2 ) N |  |
1
Solution – (1)
• Non smooth spectrum 
MFCC1
time
Solution – (2) Add deltas
P(O | q,  )  bq1 (o1 )  bq2 (o2 )  bq3 (o3 ) bqT (oT )
bqt (ot )  N (ct ;  qt ,  qt )  N (ct ;  qt ,   qt )
1. Differentiate the log-probability with respect to ct
2. Solve set of linear equations for ct
MFCC1
time
Digression: Delta
• Simple calculation of delta
ct 1  ct 1
dt 
2
• More robust calculation of delta
L
dt 
 l  (c
t l
l 1
 ct l )
L
2l2
l 1
• Typically rewritten as below, where wl is derived from
above
dt 
L
 w c
l  L
l
t l
Finding the state sequence
•
Recall that our problem was that
–
We do NOT know
1) which state in the model to use to generate each frame
2) which value to choose from a set of values observable within
each state
•
•
•
The solution discussed thus far solved (2)
assuming that we know the answer to (1)
To really solve the problem, we should consider
all possible state sequences and choose “c”
that gives us the highest observation probability
Directly solving the equation for all possible
state sequences takes too much time
How about excitation?
• Unvoiced speech: white noise. This is fine!
• Voiced speech: Impulse train
h[n]
• Problems:
– Voiced speech has frication
– Hard decisions are hard
How about excitation?
• Use a mixed excitation model
h[n]
+
g[n]
• Learn model parameters from data with HMM
• Multi-band noise is better
HMM-Based concatenative Synthesis
• Given a big database
• Find string of units that maximizes probability of HMM
T
U  arg max p(U |  )  arg max  p(ut |  )
U
U
t 1
• Intrasegment scores can be precomputed
• Concatenation scores could also be precomputed
– All possible joints (could be large!)
– Delta means and variances at boundaries are the key!
• Good job at concatenation matching!
• How about prosody?
• Use HMM too!
Stylistic TTS
Text
TTS in Windows
since Windows 2000
Text Analysis
Rules
Letter-to-Sound
Dictionary
and Rules
Read speech voice
Prosody
Waveform
concatenation
Speech
Stylistic TTS
Database of
Recorded Speech
Thanks to Min Chu,
MSR Asia
Outline
• History of TTS
• Architecture
• Text Processing
• Letter-to-Sound Rules
• Prosody
• Waveform Generation
• Evaluation
Evaluation of TTS
• Intelligibility Tests
– Diagnostic Rhyme Test (DRT)
• Humans do listening identification choice between two words differing
by a single phonetic feature
– Voicing, nasality, sustenation, sibilation
• 96 rhyming pairs
• Veal/feel, meat/beat, vee/bee, zee/thee, etc
– Subject hears “veal”, chooses either “veal or “feel”
– Subject also hears “feel”, chooses either “veal” or “feel”
• % of right answers is intelligibility score.
• Overall Quality Tests
– Have listeners rate space on a scale from 1 (bad) to 5 (excellent)
• Preference Tests (prefer A, prefer B)
Download