Speech Synthesis: Then and Now Julia Hirschberg CS 4706 7/15/2016

advertisement
Speech Synthesis: Then and Now
Julia Hirschberg
CS 4706
7/15/2016
1
Today
• Early speech synthesizers
– Articulatory synthesis
– Formant (acoustic) synthesis
– Concatenative synthesis
• Components of a Modern TTS System
7/15/2016
2
Synthesizer Components
• Front end: From input to control parameters
– From acoustic/phonetic representations
– From naturally occurring text
– From constrained mark-up language
– From semantic/conceptual representations
• Back end: From control parameters to waveform
– Articulatory synthesis
– Formant/acoustic synthesis
– Concatenative synthesis
7/15/2016
3
The First ‘Speaking Machine’
• Wolfgang von Kempelen, Mechanismus der menschlichen
Sprache nebst Beschreibung einer sprechenden Maschine,
1791 (in Deutsches Museum still and playable)
• First to produce whole words, phrases – in many languages
7/15/2016
4
Joseph Faber’s Euphonia, 1846
7/15/2016
5
• Constructed 1835 w/pedal and keyboard control
– Whispered and ordinary speech
– Model of tongue, pharyngeal cavity with
changeable shape
– Singing too “God Save the Queen”
• Modern Articulatory Synthesis: Dennis Klatt
(1987)
7/15/2016
6
7/15/2016
7
• World’s Fair in NY, 1939
• Requires much training to ‘play’
• Purpose: reduce bandwidth needed to transmit
speech, so many phone calls can be sent over
single line
7/15/2016
8
7/15/2016
9
7/15/2016
10
• Answers:
– These days a chicken leg is a rare dish.
– It’s easy to tell the depth of a well.
– Four hours of steady work faced us.
• ‘Automatic’ synthesis from spectrogram – but
can also use hand-painted spectrograms as
input
• Purpose: understand perceptual effect of
spectral details
7/15/2016
11
Formant/Resonance/Acoustic Synthesis
• Parametric or resonance synthesis
– Specify minimal parameters, e.g. f0 and first 3
formants
– Pass electronic source signal thru filter
• Harmonic tone for voiced sounds
• Aperiodic noise for unvoiced
• Filter simulates the different resonances of the vocal tract
• E.g.
– Walter Lawrence’s Parametric Artificial Talker (1953)
for vowels and consonants
– Gunnar Fant’s Orator Verbis Electris (1953) for
vowels
– Formant synthesis download (demo)
7/15/2016
12
Synthesis by Computer
• Beginnings ~1960; dominant from 1970—
7/15/2016
13
Concatenative Synthesis
• Most common type today
• First practical application in 1936: British Phone
company’s Talking Clock
– Optical storage for words, part-words,
phrases
– Concatenated to tell time
• E.g.
• And a ‘similar’ example
• Bell Labs TTS (1977) (1985)
7/15/2016
14
Variants of Concatenative Synthesis
• Inventory units
– Diphone synthesis (e.g. Festival)
– Microsegment synthesis
– “Unit Selection” – large, variable units
• Issues
– How well do units fit together?
– What is the perceived acoustic quality of the
concatenated units?
– Is post-processing on the output possible, to
improve quality?
7/15/2016
15
TTS Production Levels: Back End and Front
End
•Orthographic input: The children read to Dr. Smith
•World Knowledge
text normalization
•Semantics
•Syntax
word pronunciation
•Lexical
Intonation assignment
•Phonology
intonation realization
– F0, amplitude, duration
•Acoustics
synthesis
7/15/2016
16
Text Normalization
•
•
•
•
Reading is what W. hates most.
Reading is what Wilde hated most.
Have the students read the questions.
In 1996 she sold 1995 shares and deposited
$42 in her 401(k).
• The duck dove supply.
7/15/2016
17
Pronunciation in Context
7/15/2016
18
Intonation Assignment: Phrasing
• Traditional: hand-built rules
– Punctuation 234-5682
– Context/function word: no breaks after
function word He went to dinner
– Parse? She favors the nuts and bolts
approach
• Current: statistical analysis of large labeled
corpus
– Punctuation, pos window, utt length,…
7/15/2016
19
Intonation Assignment: Accent
• Hand-built rules
– Function/content distinction He went out the
back door/He threw out the trash
– Complex nominals:
• Main Street/Park Avenue
• city hall parking lot
• Statistical procedures trained on large corpora
• Contrastive stress, given/new distinction?
7/15/2016
20
Intonation Assignment: Contours
• Simple rules
– ‘.’ = declarative contour
– ‘?’ = yes-no-question contour unless wh-word
present at/near front of sentence
• Well, how did he do it? And what do you know?
7/15/2016
21
Phonological and Acoustic Realization
• Task:
– Produce a phonological representation from
phonemes and intonational assignment
• Pitch contour aligned with text
• Durations, intensity
– Select best concatenative units from inventory
– Post-process if needed/possible to smooth
joins, modify pitch, duration, intensity, rate
from original units
– Produce acoustic waveform as output
7/15/2016
22
TTS:
Where are we now?
• Natural sounding speech for some utterances
– Where good match between input
and database
• Still…hard to vary prosodic features and
retain naturalness
– Yes-no questions: Do you want to fly
first class?
• Context-dependent variation still hard to infer
from text and hard to realize naturally:
7/15/2016
23
– Appropriate contours from text
– Emphasis, de-emphasis to convey
focus, given/new distinction: I own a
cat. Or, rather, my cat owns me.
– Variation in pitch range, rate, pausal
duration to convey topic structure
• Characteristics of ‘emotional speech’ little
understood, so hard to convey: …a voice that
sounds friendly, sympathetic, authoritative….
• How to mimic real voices?
• ScanSoft/Nuance demo
7/15/2016
24
Next Week
• Pronunciation Modeling for speech synthesis
• Hwk 2 due
7/15/2016
25
Download