ppt - ISCA - International Speech Communication Association

advertisement
Perspectives for Articulatory
Speech Synthesis
Bernd J. Kröger
Department of Phoniatrics, Pedaudiology, andCommunication Disorders
University Hospital Aachen and RWTH Aachen, Germany
University Hospital Aachen Germany
Bernd J. Kröger
bkroeger@ukaachen.de
Examples: ASS of
Peter Birkholz (2003-2007) Univ. of Rostock
application: railway
announcement system
application: dialog system
Outline
•
Introduction
•
Vocal tract models
•
Aerodynamic and acoustic models
•
Glottis models and noise source models
•
Control models: Generation of speech movements
•
Towards neural control concepts
•
Conclusions
Outline
•
Introduction: Perspectives
•
Vocal tract models
•
Aerodynamic and acoustic models
•
Glottis models and noise source models
•
Control models: Generation of speech movements
•
Towards neural control concepts
•
Conclusions
Perspectives for Articulatory Speech Synthesis?
• Commercial or technical vs. scientific:
Is high quality articulatory speech synthesis a realistic goal?
Yes!
If we have it:
Advantage in comparison to current corpus-based synthesis methods:
Variability
– Different voices simply by parameter variation
(sex, age, voice quality)
 no need for different corpora
– Individual differences in articulation  no need for different corpora
e.g.:
degree of nasalization
individual sound / syllable realizations
– Different languages
 no need for different corpora
Perspectives for Articulatory Speech Synthesis?
• Commercial or technical vs. scientific goals:
– audiovisual speech synthesis:
modeling 3D talking heads
– towards “the virtual human” (avatars)
– towards “humanoid robots”
Need for more
natural talking
heads
Engwall, KTH Stockholm
(1995-2001)
Perspectives for Articulatory Speech Synthesis?
• Scientific perspectives:
ASS may help to collect and condense knowledge of speech
production:
– … of articulation (sound geometries, speech movements,
coarticulation)
– … of vocal tract acoustics
– … of control concepts: different approaches exist:
• neural control: self-organization, training algorithms (Kröger et
al. 2007)
• gestural control: concept for articulatory movements (Birkholz
et al. 2006, Kröger 1998)
• segmental control (Kröger 1998)
• corpus-based control (Birkholz et al. this meeting)
Outline
•
Introduction: Components of ASS-Systems
•
Vocal tract models
•
Aerodynamic and acoustic models
•
Glottis models and noise source models
•
Control models: Generation of speech movements
•
Towards neural control concepts
•
Conclusions
Components of articulatory speech synthesis
vocal tract and glottis model
area function  tube model
+
aerodynamic-acoustic simulation
control module
speech signal
Outline
•
Introduction
•
Vocal tract models
•
Aerodynamic and acoustic models
•
Glottis models and noise source models
•
Control models: Generation of speech movements
•
Towards neural control concepts
•
Conclusions
Outline
•
Introduction
•
Vocal tract models: types
•
Aerodynamic and acoustic models
•
Glottis models and noise source models
•
Control models: Generation of speech movements
•
Towards neural control concepts
•
Conclusions
Different types of vocal tract models
•
Statistical models: Parameters
derived on basis of statistical
analysis e.g. of MR/CT/X-Rayimage-corpra (Maeda 1990, Badin et al.
Badin et al. (2003):
gridline system
3D
2D
2003)
•
Geometrical models: vocal tract
shape is described by a-priori
defined parameters
preferred for ASS
Stevens & House (1955)
Flanagan et al. (1980)
area-function related: Stev&House 1955
articulator related: Mermelstein 1973,
Birkholz 2007
•
1D
Biomechanical models: Modeling
of articulators using finite element
methods (Dang 2004, Engwall 2003,
Wilhelms-Tricarico 1997)
•
1D, 2D, 2D+, 3D models
3D
2D+
Dang et al. (2004)
Engwall (2003)
Example:
Geometrical 3D vocal tract model: Birkholz (2007)
[a]
[i]
[schwa]
based on MRI-data of one speaker (and CT data of an
replica of teeth and hard palate)
Vocal tract parameters (a priori)
23 basis parameters:
•
•
•
•
•
•
•
Lips (2 DOF),
Mandible (3 DOF),
Hyoid (2 DOF),
Velum (1 DOF),
Tongue (12 DOF), !!!
Minimal cross-sectional
areas (3 DOF)
…
Meshes of the vocal tract model (Birkholz 2005)
Figure of the complete vocal tract model (Birkholz 2005)
Variation of individual parameters
Variation of the lower jaw,
leaving all other parameters
constant
 co-movement of
dependent articulators: lips
and tongue
Outline
•
Introduction
•
Vocal tract models
•
Aerodynamic and acoustic models
•
Glottis models and noise source models
•
Control models: Generation of speech movements
•
Towards neural control concepts
•
Conclusions
Aero-acoustic simulation
•
Four major types of models
– Reflection type line analog: forward and backward traveling partial
flow or pressure waves are calculated; time domain simulation (e.g.,
Kelly & Lochbaum 1962; Strube et al. 1989, Kröger 1993);
preferred
Problem: no variation of vocal tract length; constant tube length
– Transmission line circuit analog: digital simulation of electrical
circuit elements: time domain simulation (e.g. Flanagan 1975,
Maeda 1982, Birkholz 2005)
Problem: modeling frequency dependent losses (radiation, …)
– Hybrid time-frequency domain models (Sondhi & Schoeter 1987,
Allen & Strong 1985)
Problem: flow calculation  modeling aerodynamics; sound sources
– Three dimensional FE-modelling of acoustic wave propagation
and aerodynamics (e.g. ElMasri et al. 1996, Matsuzaki and Motoki
2000): helpful for exact formulation of aero-acoustics in the vicinity
of noise sources (glottis, frication);
Problem: complexity and high computational effort; real time
synthesis can not be acheived
Extraction of the area function: Birkholz (2007)
Midline of VT
vary in shape!
cross-sections:
perpendicular to airflow
cross-sectional area values from glottis to mouth  area function
Note: Area function can not be calculated from 2D vocal tract models:
From midsagittal data we can not deduce cross sectional data!
Illustration:
Calculation of the area function for KL-synthesis:
Kröger (1993): needs constant VT-length
Begin: VT model
End: discete areafunction
discrete area
function 
defining tube
sections for the
acoustic model
Green: Midsagittal
view
White: gridline
system for calculation of area
function
continuous area
function (varying
vocal tract length)
… for a complete sentence:
“Das ist mein Haus” (“That’s my house”)
Second Example :
From VT over area function to vocal tract tubes
now: for a transmission line circuit model (Birkholz 2005):
…. vocal tract tubes can vary in length:
vocal tract:
pharyngeal
and oral part
trachea / subglottal system
mouth
opening
branch
glottis
Teeth
nasal cavity
and sinuses (i.e.
indirectly coupled nasal cavities: Dang &
Honda (1994)
Next step: From tube model to acoustic signal
using the transmission line circuit analog (Birkholz 2005):
On the basis of …
Geometrical parameters of
an (elliptical) tube sections
Acoustic parameters of a tube section 
parameters of lumped elements of the electrical
transmission line
Length: l
cross-sectional area: A
perimeter: S
(elliptic small - round)
mass (enertia),
compressibility
and losses of air
column
 calculation of pressure and flow for each tube section
Illustration: …
Calculation of the acoustic speech signal: Kröger (1993)
lung pressure,
vocal fold tension,
glottal aperture
for whole
utterance
tube section model
(area function)
oral part
nasal part
time line for complete utt.
(red arrows: insertion
of Bernoulli pressure
drop and of noise
source)
white: progress of
calculation
(progress bar)
Instantaneous
acoustic signal
(20ms-window)
… for a complete sentence:
“Das ist mein Haus” (“That’s my house”)
Display of air flow and air pressure calculated
along the transmission line: Kröger (1993)
lung pressure,
vocal fold tension,
glottal aperture
for whole
utterance
tube section model
(current area
function)
white: progress of
calculation
(progress bar)
magenta: pressure
acoustic signal
just calculated
(20ms-window)
blue: flow
red: glottal mass pair
light blue: force on
the mass pair
current pressure
values of each
tube section
strong acoustic
excitation at time
instant of glottal
closure (after glottal
closing phase)
high flow values
during glottal
opening
… for one glottal cycle within a complete sentence:
“Das ist mein Haus” (“That’s my house”)
Summarizing: Vocal tract models
and acoustic simulation
Vocal tract models:
•
•
•
Area function is the basis for the calculation of the acoustic signal.
Calculation of area function can not be done in 2D models  this
disqualifies 2D-VT models for articulatory-acoustic speech synthesis
Parametric VT-models should be preferred currently for building up highquality articulatory speech synthesizers. Advantages:
– low computational effort for calculation vocal tract geometries
– strong flexibility to get auditory satisfying sound targets
in future these models should be replaced by statistically based models and
by biomechanical models
Summarizing: Vocal tract models
and acoustic simulation
Acoustic simulation:
• Problems occurring with different acoustic models:
– Variation of length of tube sections
– Modeling frequency dependent losses
– Computational effort
• Conclusion:
The transmission line circuit analog (e.g. Birkholz et al. 2007)
allows a compromise between quality and computational effort:
Real time synthesis should be possible in the near future on
normal PC’s using TLCA
Outline
•
Introduction
•
Vocal tract models
•
Aero-acoustic simulation of speech sounds
•
Glottis models and noise source models
•
Control models: Generation of speech movements
•
Towards neural control concepts
•
Conclusions
Glottis models
•
Self-oscillating models (e.g. Ishizaka Flanagan 1972 two
mass model and derivates)
– physiological control parameters: vocal fold tension,
glottal aperture …
– calculation of glottal area waveform (low, up) and
glottal flow
•
Parametric glottal area models (e.g. Titze 1984 and
derivates)
– glottal waveform (opening-closing movement) is given
– Calculation of glottal flow
•
Parametric glottal flow models (e.g. LF-model 1985)
– Acoustical relevant control parameters: F0, open
quotient, maximum negative peak flow (time derivative
of glottal flow), return phase ….
 direct control of acoustic voice quality
preferred
Different phonation types using
a self-oszillating model (Kröger 1997)
Using a self-oszillating
model with a chink
able to produce:
– normal phonation
– loud phonation
– breathy phonation
– creaky phonation
able to produce:
– F0 contours
– voiced-voiceless
contrast
the model:
extended by
a chink, leak
normal
loud
simply two
control parameters:
- vocal fold tension
- glottal aperture
breathy
creaky
Mechanisms for the generation of noise
Noise is produced at narrow passages within the VT:
• Separate:
– volume velocity sources (no obstacle case)
– pressure sources (obstacle case, Stevens 1998)
Occurrence of noise sources in the VT
Birkholz (2007)
Occur simultaneously at different places within the VT:
• pressure sources: lung section (no noise), epiglottis, at obstacles (e.g. teeth)
• volume velocity sources: at the exit of each VT constriction
Controlled by degree of VT constriction and amplitude of air flow
Voiceless excitation of the vocal tract
•
•
•
•
Noise is produced at narrow passages within the VT
The mechanisms of noise generation are not completely understood (no
satisfying 3D FE models solving the Navier-Stokes equation)
Current solution: parameter optimization
The art to construct a good noise source model is to
– find the right places for insertion of noise sources
– optimize parameters (spectral shape, strength,…) of the source noise
Noise source parameter optimization: examples
•
Synthesis examples (real – synthetic - /aCa/), Birkholz (2005):
•
/f/
/s/
/sh/
/ch/
But compare with Mawass, Badin & Bailly (2000):
/f/
/s/
/sh/
/x/
Summarizing: glottis models
and noise source models
•
Take self-oscillating vocal fold models; can be used for high quality
articulatory speech synthesis
– vocal fold tension  mainly determines f0
– glottal aperture  voice qualities: pressed – normal – breathy
– glottal aperture  segmental changes: glottal stop – voiced –
voiceless
•
•
Take simple noise models (pressure and velocity sources)
3D acoustic noise source models (solving the Navier-Stokes equation)
currently are not satisfying.
Outline
•
Introduction
•
Vocal tract models
•
Aero-acoustic simulation of speech sounds
•
Glottis models and noise source models
•
Control models: Generation of speech movements
•
Towards neural control concepts
•
Conclusions
Generation of speech movements
•
•
•
Starting with segmental input:
text  sound chain, phoneme chain (text2phoneme-conversion)
But: How to convert a
chain of segments (phones)  articulatory movements ?
Theoretically and practically elegant solution:
concept of articulatory gestures: bivalent character:
– discrete phonlogical units
– quantitative units for controlling articulatory movements
phonological plan -> motor plan (discrete)
Example: „Kompass“
O
k
segments
C-row 1
p
O
V-row
discrete
gestures
m
vocal organ
target
C-row 2
do
fc
og
s
a
la
fc
ov
la
fc
og
glottis, velopharyngeal port
Quantitative control units
a
motor plan
ap
nc
og
From discrete to quantitative realisation of a gesture
dorsal full closing gesture: {fcdo}
Quantiative gestural
parameters:
activation interval
time function for
articulator movement
Ton, Toff, Ttarg, voc_org,
loctarg
Modeling reduction is easily possible:
Example: “mit dem Boot” (Kröger 1993)
increase in speech rate  increase in gestural overlap
 segmental changes
motor plan
9 steps:
: quantitative
: qualitative
not reduced
fully reduced
all gestures
still exist!
Connected speech using gestural control:
Examples (1)
„Der Zug...“
„Guten Tag...“
Connected speech using gestural control:
Examples (2)
„Nächster Halt...“
„Nächster Halt...“
Summarizing: control concepts
Gesture based control concept can be used:
– Links phonemes to articulation: bivalent character of gestures:
• discrete phonological units
• quantitative units for motor control (activation interval, targets, transition
velocities, …)
– gestures quantitatively comprise
• description of target-directed movement
• definition of the target itself (not incompatible with target concepts)
– gestures model segmental changes (assimilations, elisions) occurring in
reduction by increase in temporal overlap of gestures
– How to deduce rules for coordination of speech gestures for syllables, words,
complete utterances?
Outline
•
Introduction
•
Vocal tract models
•
Aero-acoustic simulation of speech sounds
•
Glottis models and noise source models
•
Control models: Generation of speech movements
•
Towards neural control concepts
•
Conclusions
Note
We have a lot of knowledge concerning
the plant:
– articulatory geometries
– speech acoustics
Birkholz et al. (2007)
We have much less knowledge concerning
neural control of speech articulation
Note
We have a lot of knowledge concerning
the plant:
– articulatory geometries
– speech acoustics
problem
no problem
We have much less knowledge concerning
neural control of speech articulation
Homer Simpson
Idea
•
•
•
•
Copy or mimic speech acquisition:
Start like a toddler with babbling: i.e. explore your vocal apparatus and
combine motor states with resulting sensory states (auditory, somatosensory)
Imitation: copying mothers (caretakers) speech signals is now possible,
since auditory-to-motor relations are already trained
Idea: to build-up a corpus of trained speech items (is known as the mental
syllablary, postulated by Levelt and Wheeldon 1994)
 idea: corpus-based neuro-articulatory speech synthesis
Is based purely on acoustic data; articulatory data are not needed (EMA…):
Toddler are able to learn to speak from acoustic stimulation
to comprehension
from mental lexicon and syllabification:
phonological plan
infrequent
syllables
premotor
cortical
auditory-phonetic
processing
frequent syllables
phonemic map
prosody
high-order
frontal lobe
motor
planning
phonetic
map
motor plan
subcortical
and
peripheral
external
speaker
(mother)
ssst.
parietal lobe
motor execution
(control and
corrections)
primary motor map
cortical
primary
somatosensory
map
high-order
motor execution
primary
motor
temporal lobe
somatosensoryphonetic proc.
auditory
state
auditory
map
primary
subcortical
cerebellum
basal ganglia
thalamus
motor state
neuromuscular
processing
somatosensory
receptors and
preprocessing
auditory
receptors and
preprocessing
articulatory state
articulatory
signal
acoustic
signal
skin, ears
and sensory
pathways
muscles and
articulators:
tongue, lips,
jaw, velum …
peripheral
Neurophonetic model of speech production: DFG grant KR1439/13-1 2007-2010
Outline
•
Introduction
•
Vocal tract models
•
Aero-acoustic simulation of speech sounds
•
Glottis models and noise source models
•
Control models: Generation of speech movements
•
Towards neural control concepts
•
Conclusions
What are the perspectives
for articulatory speec synthesis?
• Practically: ASS could reach high-quality standards over the next
decades.
My recommendation: Use
– 3D geometrical (or statistical) vocal tract articulatory models
– Simple self-oscillating glottis models (2 masses and a chink)
– Transmission line analog time domain acoustic models (1D) with
optimized simulation of losses
– an optimized simple noise source model
– Gestural control concept
– Acoustic data base for generating gestural coordination and prosody
(cp. Birkholz et al. 2007 this meeting)
• Example for singing using ASS: Dona nobis pacem: (Birkholz 2007)
Clinical application of a 2D VT model
2D-articulatory model synchronized with natural speech
used in speech therapy (Kröger
2005): visual stimulation technique
Thank you !!
What do you
think about
these ideas?
I like this stuff.
It is good for
our future!
Literatur
•
•
•
•
•
•
•
•
•
•
Badin P, Bailly G, Revéret L, Baciu M, Segebarth C, Savariaux C (2002) Three-dimensional articulatory modeling of
tongue, lips and face, based on MRI and video images, Journal of Phonetics 30: 533-553
Birkholz P, Jackèl, D, Kröger BJ (2007) Simulation of losses due to turbulence in the time-varying vocal system. IEEE
Transactions on Audio, Speech, and Language Processing 15: 1218-1225
Birkholz P (2007) Control of an articulatory speech synthesizer based on dynamic approximation of spatial
articulatory targets. Proceedings of the Interspeech 2007 – Eurospeech. Antwerp, Belgium
Birkholz P (2005) 3D-Artikulatorische Sprachsynthese. Unpublished PhD thesis. University Rostock
Birkholz P, Jackèl, D (2004) Influence of temporal discretization schemes on formant frequencies and bandwidths in
time domain simulations of the vocal tract system. Proceedings of Interspeech 2004-ICSLP. Jeju, Korea, pp. 11251128
Birkholz P, Kröger BJ (2006) Vocal tract model adaptation using magnetic resonance imaging. Proceedings of the 7th
International Seminar on Speech Production. Belo Horizonte, Brazil, pp. 493-500
Birkholz P, Jackèl D, Kröger BJ (2006) Construction and control of a three-dimensional vocal tract model.
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2006). Toulouse,
France, pp. 873-876
Birkholz P, Steiner I, Breuer S (2007) Control concepts for articulatory speech synthesis. Proceedings of the 6th ISCA
Speech Synthesis Research Workshop. Universität Bonn
Browman CP, Goldstein L (1990) Gestural specification using dynamically-defined articulatory structures. Journal of
Phonetics 18: 299-320
Browman CP, Goldstein L (1992) Articulatory phonology: An overview. Phonetica 49: 155-180
Literatur
•
•
•
•
•
•
•
•
•
•
•
•
•
Cranen B, Schroeter J (1995) Modeling a leaky glottis. Journal of Phonetics 23: 165-177
Dang J, Honda K (1994) Morphological and acoustical analysis of the nasal and the paranasal cavities. Journal of the
Acoustical Society of America 96: 2088-2100
Engwall, O (1999) Modeling of the vocal tract in three dimensions, EUROSPEECH'99: 113-116
Flanagan JL (1965) Speech Analysis, Synthesis and Perception. Springer-Verlag, Berlin
Guenther FH (2006) Cortical interactions underlying the production of speech sounds. Journal of Communication
Disorders 39: 350-65
Guenther FH, Ghosh SS, Tourville JA (2006) Neural modeling and imaging of the cortical interactions underlying
syllable production. Brain and Language 96: 280-301
Kohonen T (2001) Self-organizing maps. Springer, Berlin, 3rd edition
Kröger BJ (1998) Ein phonetisches Modell der Sprachproduktion (Niemeyer, Tübingen).
Kröger BJ (1993) A gestural production model and its application to reduction in German. Phonetica 50: 213-233
Kröger BJ (2003) Ein visuelles Modell der Artikulation. Laryngo – Rhino – Otologie 82: 402-407
Kröger BJ, Birkholz P, Kannampuzha J, Neuschaefer-Rube C (2007) Multidirectional mappings and the concept of a
mental syllabary in a neural model of speech production. In: Fortschritte der Akustik: 33. Deutsche Jahrestagung für
Akustik, DAGA '07. Stuttgart
Kröger BJ, Birkholz P, Kannampuzha J, Neuschaefer-Rube C (2006) Learning to associate speech-like sensory and
motor states during babbling. Proceedings of the 7th International Seminar on Speech Production. Belo Horizonte,
Brazil, pp. 67-74
Kröger BJ, Gotto J, Albert S, Neuschaefer-Rube C (2005) A visual articulatory model and ist application to therapy of
speech disorders: a pilot study. In: S Fuchs, P Perrier, B Pompino-Marschall (Hrsg.) Speech production and
perception: Experimental analyses and models. ZASPiL 40: 79-94
Literatur
•
Mermelstein P (1973) Articulatory model for the study of speech production. Journal of the Acoustical Society of
America 53: 1070-1082
•
Saltzman EL, Munhall KG (1989) A dynamic approach to gestural patterning in speech production. Ecological
Psychology 1: 333-382
•
Titze IR (1984) Parameterization of the glottal area, glottal flow, and vocal fold contact area. Journal of the Acoustical
Society of America 75: 570-580
Observation:
Hannah (0-2):
each morning
during wake up
Training set: “silent mouthing”
•
•
combination of min, (mid,) and max values {0, (0.5,) 1} of all 10 joint
parameters (Kröger et al. 2006, DAGA Braunschweig)
double closures and non-physiological articulations are avoided
subset for lips
subset for tongue
 4608 patterns
of training data
Training
•
Design of the net: one-layer feed-forward,
25+18 input neurons (somatosensory), 40 output neurons (motor)  ca. 2000 links
•
Set of 4608 patterns of training data
 min-max combination training set; “silent mouthing”
•
5.000 cycles of batch training
 mean error ca. 10% for prediction of an articulatory state (Kröger et al. 2006b, ISSP, Ubatuba,
Brazil)
•
Software:
Java-version of SNNS (Stuttgart Neural Network Simulator) http://www-ra.informatik.uni-
tuebingen.de/SNNS/
Training results: “motor equivalence”
… despite prediction error 10%
position of
lower jaw:
low
position of
lower jaw:
high
labial closure
apical closure
dorsal closure
each column: somatosensory values are the same (except of jaw parameter)
 acoustically relevant closures are kept despite strong jaw perturbation
Download