Perspectives for Articulatory Speech Synthesis Bernd J. Kröger Department of Phoniatrics, Pedaudiology, andCommunication Disorders University Hospital Aachen and RWTH Aachen, Germany University Hospital Aachen Germany Bernd J. Kröger bkroeger@ukaachen.de Examples: ASS of Peter Birkholz (2003-2007) Univ. of Rostock application: railway announcement system application: dialog system Outline • Introduction • Vocal tract models • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions Outline • Introduction: Perspectives • Vocal tract models • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions Perspectives for Articulatory Speech Synthesis? • Commercial or technical vs. scientific: Is high quality articulatory speech synthesis a realistic goal? Yes! If we have it: Advantage in comparison to current corpus-based synthesis methods: Variability – Different voices simply by parameter variation (sex, age, voice quality) no need for different corpora – Individual differences in articulation no need for different corpora e.g.: degree of nasalization individual sound / syllable realizations – Different languages no need for different corpora Perspectives for Articulatory Speech Synthesis? • Commercial or technical vs. scientific goals: – audiovisual speech synthesis: modeling 3D talking heads – towards “the virtual human” (avatars) – towards “humanoid robots” Need for more natural talking heads Engwall, KTH Stockholm (1995-2001) Perspectives for Articulatory Speech Synthesis? • Scientific perspectives: ASS may help to collect and condense knowledge of speech production: – … of articulation (sound geometries, speech movements, coarticulation) – … of vocal tract acoustics – … of control concepts: different approaches exist: • neural control: self-organization, training algorithms (Kröger et al. 2007) • gestural control: concept for articulatory movements (Birkholz et al. 2006, Kröger 1998) • segmental control (Kröger 1998) • corpus-based control (Birkholz et al. this meeting) Outline • Introduction: Components of ASS-Systems • Vocal tract models • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions Components of articulatory speech synthesis vocal tract and glottis model area function tube model + aerodynamic-acoustic simulation control module speech signal Outline • Introduction • Vocal tract models • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions Outline • Introduction • Vocal tract models: types • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions Different types of vocal tract models • Statistical models: Parameters derived on basis of statistical analysis e.g. of MR/CT/X-Rayimage-corpra (Maeda 1990, Badin et al. Badin et al. (2003): gridline system 3D 2D 2003) • Geometrical models: vocal tract shape is described by a-priori defined parameters preferred for ASS Stevens & House (1955) Flanagan et al. (1980) area-function related: Stev&House 1955 articulator related: Mermelstein 1973, Birkholz 2007 • 1D Biomechanical models: Modeling of articulators using finite element methods (Dang 2004, Engwall 2003, Wilhelms-Tricarico 1997) • 1D, 2D, 2D+, 3D models 3D 2D+ Dang et al. (2004) Engwall (2003) Example: Geometrical 3D vocal tract model: Birkholz (2007) [a] [i] [schwa] based on MRI-data of one speaker (and CT data of an replica of teeth and hard palate) Vocal tract parameters (a priori) 23 basis parameters: • • • • • • • Lips (2 DOF), Mandible (3 DOF), Hyoid (2 DOF), Velum (1 DOF), Tongue (12 DOF), !!! Minimal cross-sectional areas (3 DOF) … Meshes of the vocal tract model (Birkholz 2005) Figure of the complete vocal tract model (Birkholz 2005) Variation of individual parameters Variation of the lower jaw, leaving all other parameters constant co-movement of dependent articulators: lips and tongue Outline • Introduction • Vocal tract models • Aerodynamic and acoustic models • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions Aero-acoustic simulation • Four major types of models – Reflection type line analog: forward and backward traveling partial flow or pressure waves are calculated; time domain simulation (e.g., Kelly & Lochbaum 1962; Strube et al. 1989, Kröger 1993); preferred Problem: no variation of vocal tract length; constant tube length – Transmission line circuit analog: digital simulation of electrical circuit elements: time domain simulation (e.g. Flanagan 1975, Maeda 1982, Birkholz 2005) Problem: modeling frequency dependent losses (radiation, …) – Hybrid time-frequency domain models (Sondhi & Schoeter 1987, Allen & Strong 1985) Problem: flow calculation modeling aerodynamics; sound sources – Three dimensional FE-modelling of acoustic wave propagation and aerodynamics (e.g. ElMasri et al. 1996, Matsuzaki and Motoki 2000): helpful for exact formulation of aero-acoustics in the vicinity of noise sources (glottis, frication); Problem: complexity and high computational effort; real time synthesis can not be acheived Extraction of the area function: Birkholz (2007) Midline of VT vary in shape! cross-sections: perpendicular to airflow cross-sectional area values from glottis to mouth area function Note: Area function can not be calculated from 2D vocal tract models: From midsagittal data we can not deduce cross sectional data! Illustration: Calculation of the area function for KL-synthesis: Kröger (1993): needs constant VT-length Begin: VT model End: discete areafunction discrete area function defining tube sections for the acoustic model Green: Midsagittal view White: gridline system for calculation of area function continuous area function (varying vocal tract length) … for a complete sentence: “Das ist mein Haus” (“That’s my house”) Second Example : From VT over area function to vocal tract tubes now: for a transmission line circuit model (Birkholz 2005): …. vocal tract tubes can vary in length: vocal tract: pharyngeal and oral part trachea / subglottal system mouth opening branch glottis Teeth nasal cavity and sinuses (i.e. indirectly coupled nasal cavities: Dang & Honda (1994) Next step: From tube model to acoustic signal using the transmission line circuit analog (Birkholz 2005): On the basis of … Geometrical parameters of an (elliptical) tube sections Acoustic parameters of a tube section parameters of lumped elements of the electrical transmission line Length: l cross-sectional area: A perimeter: S (elliptic small - round) mass (enertia), compressibility and losses of air column calculation of pressure and flow for each tube section Illustration: … Calculation of the acoustic speech signal: Kröger (1993) lung pressure, vocal fold tension, glottal aperture for whole utterance tube section model (area function) oral part nasal part time line for complete utt. (red arrows: insertion of Bernoulli pressure drop and of noise source) white: progress of calculation (progress bar) Instantaneous acoustic signal (20ms-window) … for a complete sentence: “Das ist mein Haus” (“That’s my house”) Display of air flow and air pressure calculated along the transmission line: Kröger (1993) lung pressure, vocal fold tension, glottal aperture for whole utterance tube section model (current area function) white: progress of calculation (progress bar) magenta: pressure acoustic signal just calculated (20ms-window) blue: flow red: glottal mass pair light blue: force on the mass pair current pressure values of each tube section strong acoustic excitation at time instant of glottal closure (after glottal closing phase) high flow values during glottal opening … for one glottal cycle within a complete sentence: “Das ist mein Haus” (“That’s my house”) Summarizing: Vocal tract models and acoustic simulation Vocal tract models: • • • Area function is the basis for the calculation of the acoustic signal. Calculation of area function can not be done in 2D models this disqualifies 2D-VT models for articulatory-acoustic speech synthesis Parametric VT-models should be preferred currently for building up highquality articulatory speech synthesizers. Advantages: – low computational effort for calculation vocal tract geometries – strong flexibility to get auditory satisfying sound targets in future these models should be replaced by statistically based models and by biomechanical models Summarizing: Vocal tract models and acoustic simulation Acoustic simulation: • Problems occurring with different acoustic models: – Variation of length of tube sections – Modeling frequency dependent losses – Computational effort • Conclusion: The transmission line circuit analog (e.g. Birkholz et al. 2007) allows a compromise between quality and computational effort: Real time synthesis should be possible in the near future on normal PC’s using TLCA Outline • Introduction • Vocal tract models • Aero-acoustic simulation of speech sounds • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions Glottis models • Self-oscillating models (e.g. Ishizaka Flanagan 1972 two mass model and derivates) – physiological control parameters: vocal fold tension, glottal aperture … – calculation of glottal area waveform (low, up) and glottal flow • Parametric glottal area models (e.g. Titze 1984 and derivates) – glottal waveform (opening-closing movement) is given – Calculation of glottal flow • Parametric glottal flow models (e.g. LF-model 1985) – Acoustical relevant control parameters: F0, open quotient, maximum negative peak flow (time derivative of glottal flow), return phase …. direct control of acoustic voice quality preferred Different phonation types using a self-oszillating model (Kröger 1997) Using a self-oszillating model with a chink able to produce: – normal phonation – loud phonation – breathy phonation – creaky phonation able to produce: – F0 contours – voiced-voiceless contrast the model: extended by a chink, leak normal loud simply two control parameters: - vocal fold tension - glottal aperture breathy creaky Mechanisms for the generation of noise Noise is produced at narrow passages within the VT: • Separate: – volume velocity sources (no obstacle case) – pressure sources (obstacle case, Stevens 1998) Occurrence of noise sources in the VT Birkholz (2007) Occur simultaneously at different places within the VT: • pressure sources: lung section (no noise), epiglottis, at obstacles (e.g. teeth) • volume velocity sources: at the exit of each VT constriction Controlled by degree of VT constriction and amplitude of air flow Voiceless excitation of the vocal tract • • • • Noise is produced at narrow passages within the VT The mechanisms of noise generation are not completely understood (no satisfying 3D FE models solving the Navier-Stokes equation) Current solution: parameter optimization The art to construct a good noise source model is to – find the right places for insertion of noise sources – optimize parameters (spectral shape, strength,…) of the source noise Noise source parameter optimization: examples • Synthesis examples (real – synthetic - /aCa/), Birkholz (2005): • /f/ /s/ /sh/ /ch/ But compare with Mawass, Badin & Bailly (2000): /f/ /s/ /sh/ /x/ Summarizing: glottis models and noise source models • Take self-oscillating vocal fold models; can be used for high quality articulatory speech synthesis – vocal fold tension mainly determines f0 – glottal aperture voice qualities: pressed – normal – breathy – glottal aperture segmental changes: glottal stop – voiced – voiceless • • Take simple noise models (pressure and velocity sources) 3D acoustic noise source models (solving the Navier-Stokes equation) currently are not satisfying. Outline • Introduction • Vocal tract models • Aero-acoustic simulation of speech sounds • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions Generation of speech movements • • • Starting with segmental input: text sound chain, phoneme chain (text2phoneme-conversion) But: How to convert a chain of segments (phones) articulatory movements ? Theoretically and practically elegant solution: concept of articulatory gestures: bivalent character: – discrete phonlogical units – quantitative units for controlling articulatory movements phonological plan -> motor plan (discrete) Example: „Kompass“ O k segments C-row 1 p O V-row discrete gestures m vocal organ target C-row 2 do fc og s a la fc ov la fc og glottis, velopharyngeal port Quantitative control units a motor plan ap nc og From discrete to quantitative realisation of a gesture dorsal full closing gesture: {fcdo} Quantiative gestural parameters: activation interval time function for articulator movement Ton, Toff, Ttarg, voc_org, loctarg Modeling reduction is easily possible: Example: “mit dem Boot” (Kröger 1993) increase in speech rate increase in gestural overlap segmental changes motor plan 9 steps: : quantitative : qualitative not reduced fully reduced all gestures still exist! Connected speech using gestural control: Examples (1) „Der Zug...“ „Guten Tag...“ Connected speech using gestural control: Examples (2) „Nächster Halt...“ „Nächster Halt...“ Summarizing: control concepts Gesture based control concept can be used: – Links phonemes to articulation: bivalent character of gestures: • discrete phonological units • quantitative units for motor control (activation interval, targets, transition velocities, …) – gestures quantitatively comprise • description of target-directed movement • definition of the target itself (not incompatible with target concepts) – gestures model segmental changes (assimilations, elisions) occurring in reduction by increase in temporal overlap of gestures – How to deduce rules for coordination of speech gestures for syllables, words, complete utterances? Outline • Introduction • Vocal tract models • Aero-acoustic simulation of speech sounds • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions Note We have a lot of knowledge concerning the plant: – articulatory geometries – speech acoustics Birkholz et al. (2007) We have much less knowledge concerning neural control of speech articulation Note We have a lot of knowledge concerning the plant: – articulatory geometries – speech acoustics problem no problem We have much less knowledge concerning neural control of speech articulation Homer Simpson Idea • • • • Copy or mimic speech acquisition: Start like a toddler with babbling: i.e. explore your vocal apparatus and combine motor states with resulting sensory states (auditory, somatosensory) Imitation: copying mothers (caretakers) speech signals is now possible, since auditory-to-motor relations are already trained Idea: to build-up a corpus of trained speech items (is known as the mental syllablary, postulated by Levelt and Wheeldon 1994) idea: corpus-based neuro-articulatory speech synthesis Is based purely on acoustic data; articulatory data are not needed (EMA…): Toddler are able to learn to speak from acoustic stimulation to comprehension from mental lexicon and syllabification: phonological plan infrequent syllables premotor cortical auditory-phonetic processing frequent syllables phonemic map prosody high-order frontal lobe motor planning phonetic map motor plan subcortical and peripheral external speaker (mother) ssst. parietal lobe motor execution (control and corrections) primary motor map cortical primary somatosensory map high-order motor execution primary motor temporal lobe somatosensoryphonetic proc. auditory state auditory map primary subcortical cerebellum basal ganglia thalamus motor state neuromuscular processing somatosensory receptors and preprocessing auditory receptors and preprocessing articulatory state articulatory signal acoustic signal skin, ears and sensory pathways muscles and articulators: tongue, lips, jaw, velum … peripheral Neurophonetic model of speech production: DFG grant KR1439/13-1 2007-2010 Outline • Introduction • Vocal tract models • Aero-acoustic simulation of speech sounds • Glottis models and noise source models • Control models: Generation of speech movements • Towards neural control concepts • Conclusions What are the perspectives for articulatory speec synthesis? • Practically: ASS could reach high-quality standards over the next decades. My recommendation: Use – 3D geometrical (or statistical) vocal tract articulatory models – Simple self-oscillating glottis models (2 masses and a chink) – Transmission line analog time domain acoustic models (1D) with optimized simulation of losses – an optimized simple noise source model – Gestural control concept – Acoustic data base for generating gestural coordination and prosody (cp. Birkholz et al. 2007 this meeting) • Example for singing using ASS: Dona nobis pacem: (Birkholz 2007) Clinical application of a 2D VT model 2D-articulatory model synchronized with natural speech used in speech therapy (Kröger 2005): visual stimulation technique Thank you !! What do you think about these ideas? I like this stuff. It is good for our future! Literatur • • • • • • • • • • Badin P, Bailly G, Revéret L, Baciu M, Segebarth C, Savariaux C (2002) Three-dimensional articulatory modeling of tongue, lips and face, based on MRI and video images, Journal of Phonetics 30: 533-553 Birkholz P, Jackèl, D, Kröger BJ (2007) Simulation of losses due to turbulence in the time-varying vocal system. IEEE Transactions on Audio, Speech, and Language Processing 15: 1218-1225 Birkholz P (2007) Control of an articulatory speech synthesizer based on dynamic approximation of spatial articulatory targets. Proceedings of the Interspeech 2007 – Eurospeech. Antwerp, Belgium Birkholz P (2005) 3D-Artikulatorische Sprachsynthese. Unpublished PhD thesis. University Rostock Birkholz P, Jackèl, D (2004) Influence of temporal discretization schemes on formant frequencies and bandwidths in time domain simulations of the vocal tract system. Proceedings of Interspeech 2004-ICSLP. Jeju, Korea, pp. 11251128 Birkholz P, Kröger BJ (2006) Vocal tract model adaptation using magnetic resonance imaging. Proceedings of the 7th International Seminar on Speech Production. Belo Horizonte, Brazil, pp. 493-500 Birkholz P, Jackèl D, Kröger BJ (2006) Construction and control of a three-dimensional vocal tract model. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2006). Toulouse, France, pp. 873-876 Birkholz P, Steiner I, Breuer S (2007) Control concepts for articulatory speech synthesis. Proceedings of the 6th ISCA Speech Synthesis Research Workshop. Universität Bonn Browman CP, Goldstein L (1990) Gestural specification using dynamically-defined articulatory structures. Journal of Phonetics 18: 299-320 Browman CP, Goldstein L (1992) Articulatory phonology: An overview. Phonetica 49: 155-180 Literatur • • • • • • • • • • • • • Cranen B, Schroeter J (1995) Modeling a leaky glottis. Journal of Phonetics 23: 165-177 Dang J, Honda K (1994) Morphological and acoustical analysis of the nasal and the paranasal cavities. Journal of the Acoustical Society of America 96: 2088-2100 Engwall, O (1999) Modeling of the vocal tract in three dimensions, EUROSPEECH'99: 113-116 Flanagan JL (1965) Speech Analysis, Synthesis and Perception. Springer-Verlag, Berlin Guenther FH (2006) Cortical interactions underlying the production of speech sounds. Journal of Communication Disorders 39: 350-65 Guenther FH, Ghosh SS, Tourville JA (2006) Neural modeling and imaging of the cortical interactions underlying syllable production. Brain and Language 96: 280-301 Kohonen T (2001) Self-organizing maps. Springer, Berlin, 3rd edition Kröger BJ (1998) Ein phonetisches Modell der Sprachproduktion (Niemeyer, Tübingen). Kröger BJ (1993) A gestural production model and its application to reduction in German. Phonetica 50: 213-233 Kröger BJ (2003) Ein visuelles Modell der Artikulation. Laryngo – Rhino – Otologie 82: 402-407 Kröger BJ, Birkholz P, Kannampuzha J, Neuschaefer-Rube C (2007) Multidirectional mappings and the concept of a mental syllabary in a neural model of speech production. In: Fortschritte der Akustik: 33. Deutsche Jahrestagung für Akustik, DAGA '07. Stuttgart Kröger BJ, Birkholz P, Kannampuzha J, Neuschaefer-Rube C (2006) Learning to associate speech-like sensory and motor states during babbling. Proceedings of the 7th International Seminar on Speech Production. Belo Horizonte, Brazil, pp. 67-74 Kröger BJ, Gotto J, Albert S, Neuschaefer-Rube C (2005) A visual articulatory model and ist application to therapy of speech disorders: a pilot study. In: S Fuchs, P Perrier, B Pompino-Marschall (Hrsg.) Speech production and perception: Experimental analyses and models. ZASPiL 40: 79-94 Literatur • Mermelstein P (1973) Articulatory model for the study of speech production. Journal of the Acoustical Society of America 53: 1070-1082 • Saltzman EL, Munhall KG (1989) A dynamic approach to gestural patterning in speech production. Ecological Psychology 1: 333-382 • Titze IR (1984) Parameterization of the glottal area, glottal flow, and vocal fold contact area. Journal of the Acoustical Society of America 75: 570-580 Observation: Hannah (0-2): each morning during wake up Training set: “silent mouthing” • • combination of min, (mid,) and max values {0, (0.5,) 1} of all 10 joint parameters (Kröger et al. 2006, DAGA Braunschweig) double closures and non-physiological articulations are avoided subset for lips subset for tongue 4608 patterns of training data Training • Design of the net: one-layer feed-forward, 25+18 input neurons (somatosensory), 40 output neurons (motor) ca. 2000 links • Set of 4608 patterns of training data min-max combination training set; “silent mouthing” • 5.000 cycles of batch training mean error ca. 10% for prediction of an articulatory state (Kröger et al. 2006b, ISSP, Ubatuba, Brazil) • Software: Java-version of SNNS (Stuttgart Neural Network Simulator) http://www-ra.informatik.uni- tuebingen.de/SNNS/ Training results: “motor equivalence” … despite prediction error 10% position of lower jaw: low position of lower jaw: high labial closure apical closure dorsal closure each column: somatosensory values are the same (except of jaw parameter) acoustically relevant closures are kept despite strong jaw perturbation