3D Computer Simulation of the Human
Vocal Tract
Towards an Extensible Infrastructure for a
Three-dimensional Face and Vocal-Tract
Sid Fels, ECE, Bryan Gick, Linguistics
Florian Vogt, ECE, Ian Wilson, Linguistics, Carol Jaeger, ECE
University of British Columbia
Vancouver, BC, Canada
V6T 1Z4
Introduction and Motivation
• Want speech synthesis that is:
low bandwidth,
high quality,
visually coordinated,
physically based,
• Important developments
– computational speed increases
– new imaging techniques
– better understanding of vocal tract
Speech Synthesis Techniques
• Three main synthesis techniques:
– Time domain
• LPC (Markel & Gray, 1972), CELP (Schroeder & Atal),
Multipulse (Atal & Remde)
• CELP used in cell phones - good quality at 4.8Kbps
• used in text-to-speech applications too
• concatenation based systems
– Frequency domain
– Articulatory domain
Time domain Text To Speech (TTS)
• TTS in time domain
– concatenate prerecorded speech segments
– pitch change and transitions tricky
• research on different ways to do this
• Here’s a few examples:
– CSTR Edinburgh: diphone synthesis/ non uniform unit
selection 1, 2, 3
– CHATR ATR-ITL Kyoto / Japan, non uniform unit
selection 1
– BellLabs-TTS-System,LPC diphone synthesis 1, 2
– PSOLA (Verhelst, 1990) and more
Frequency domain TTS
• TTS using formants:
– spectral changes are slow
• should interpolate well
– rules for transitions can’t be simply linear
– change speaker characteristics easily
• not so natural though
– change intonation (somewhat)
– Main synthesizers: (Klatt, 1980) and (Rye &
Holmes, 1982)
– Examples:
• Infovox, Telia Promotor / KTH Stockholm 1
• Multilingual TTS system, TI Uni Duisburg 1
• DECtalk: regular 1 2, affective modification (Cahn) 1 2 3
Articulatory Synthesis
• Parameterize human vocal tract, glottis
and lungs
– mechanical or electrical systems
2D Articulatory Synthesis
• Articulatory
– Mermelstein, 1971
model used at
– Coker, 1976
• C: Tongue body
• H: Hyoid
• J: Jaw
• L: Lips
• T: Tongue Tip
• V: Velum
2D Articulatory Synthesis
• To synthesize speech:
– convert to area function
– use acoustic tube model
– activate with sound source
• Sound/excitation source
– waveform
– model
• Glottal and Lung model from (Flanagan, Ishizaka, and
Shipley, 1975)
• 2 masses and springs (oscillator)
Articulatory Synthesis: Haskins
Articulatory Synthesis: Haskins
Articulatory Synthesis: History
• Interesting history
– sometimes hot (1700s, 2000) and
sometimes not (1800s, 1970-1980s)
• Important now because of “Talking
Head” research
– McGurk effect
Articulatory Synthesis: Talking Heads
• Visual and auditory signals interact
– visual signal can make auditory signal hard to hear
• McGurk Effect Demo
• Talking heads important for:
– more natural interaction
– dubbing new voices
– compact encoding of voice and image
• Can we create good talking head from acoustic signal?
– Not so easy: i.e., Bregler, Slaney and Covell
– articulatory synthesis provides necessary articulatory movement
with audio waveform
see “Speech Recognition and Sensory Integration”, Massaro and Stork,
American Scientist, Vol. 86, 1998.
Articulatory Synthesis: History
• Kratzenstein resonators (1770 - Imperial
Academy of St. Petersburg contest)
Articulatory Synthesis: AVTs
• von Kempelen’s AVT (1791)
Articulatory Synthesis: more T.H.
• R. R. Riesz's talking mechanism, 1937
Articulatory Synthesis: electronic AVTs
• The Voder (Dudley, Riesz and Watson,
• Example
Articulatory Synthesis: more AVTS
• Alexander G. Bell and Melville Bell
– moulded human skull
• Glove-TalkII (Fels and Hinton, 1998)
– uses gestural articulator model
Articulatory TTS Synthesis
• Text-to-speech possible:
– Pavarobotti, National Center
for Voice and Speech, (U.
Iowa) 1, 2
• See also Perry Cook’s work
at Princeton (formerly at
Stanford (CCRMA))
• SPASM (example song)
P. R. Cook, "Synthesis of the Singing Voice Using a
Physically Parameterized Model of the Human Vocal
Tract," ICMC, 1989.
Where we are going:
3D Articulator Synthesis
• Creating infrastructure for research on
3D articulatory synthesis
• Use for:
– speech synthesis
– speech analysis
– articulation analysis
Where we are at:
• Developing and extending scenegraph
methodology for model specification
– multi-scale and multi-level capable
• Elaborating scenegraph model based on
– base-line for refinement or simplification
• Exploring techniques for synthesis
– support 2D to 3D techniques
– integrate source-tract models as well
• Developing measurement tools
– ultrasound tongue tracking; integrate with Haskins
Articulatory Synthesis: Project Overview
• Build S/W Infrastructure to support
modular approach to modeling
– extrapolate and abstract from Lee, Waters
and Terzopolous work
Lee, Terzopolous, Waters, 1998
Haber, Kaehler, at al, 2002
Software Architecture
• 5 main components to deal with:
1. simulator engine,
2. three-dimensional geometry module
3. graphical user interface (GUI) module,
4. synthesis engine and
5. numerics engine.
Structure of Simulator
3D geometry
Imaging data
Structure of Simulator
3D geometry
Imaging data
3D Geometry: Scene Graph
• Base model notation on Scene Graph
– basis of 3D animation
– nodes for specifying graphical models including
• shapes, cameras, lights, properties, transformation,
engines, selection, view etc.
• Extend and add nodes to represent
– muscles, constraints, nerves, dynamics
– may need multiple passes per iteration
3D Geometry: Scene Graph
• Example decomposition
– Head
• Skull
• Mouth
– jaw
– teeth
– tongue
» hyoid
– cheek, other soft structures
Respiratory tract
Using Image Data
• Data from MRI, ultrasound, EMA and
other imaging devices
– used in real-time or offline
– create geometry
– provide constraints on system
Structure of Simulator
3D geometry
Imaging data
• Separate out rendering of model
• Integrate with other 3D animation tools
e.g. Blender, Maya, 3DMax.
Structure of Simulator
3D geometry
Imaging data
• Separate out to make simulation code clean
• Allow multiple access points to reduce
– command line, GUI, scripted, stdin/stdout, files
(save state)
• Control module behaviour
• Automate as much as possible
– extensions get support for GUIs
Structure of Simulator
3D geometry
Imaging data
• All numerical processes separated out from
– allow for improvements to numerical routines
– allow flexibility to switch methods
• i.e. Implicit vs. Explicit euler integration, FEM solver
• Each pass through scene graph builds up
– numerics operate on state and return state back to
scene graph
Structure of Simulator
3D geometry
Imaging data
• As simple as possible:
– infinite loop that updates state of simulation
– traverses scene graph
• calculate new state
• render
• simulate airflow
– can be thought of as an independent
module as well
3D Articulator Techniques & Issues
1. Vocal Tract Model
– geometric versus physical model
– static versus dynamic model
– parameter extraction/tuning from
• real data
– X-Ray, MRI, EMA, Ultrasound, Electropalatography (Stone
and Lundberg, 1996)
• anatomy
3D Articulatory Synthesis Model
– Our direction
• use physical model of vocal tract and tongue
– articulator/muscle based model
– match face model
• use dynamic modeling of soft tissues
– tongue, lips, cheeks, pharynx, etc.
– include volume constraints
– collision detection
3D Articulatory Vocal Tract Model
• Which technique to use for soft tissues?
(see Gibson and Mirtich, 1997 for review)
a) Non-physical models
– splines, patches
– difficult to get deformations correct
– may be good for representing static 3D shape of
vocal tract
b) Spring-mass models
• use a collection of point masses connected by springs
• popular in facial animations
– i.e. Waters, 1987 and Lee, Terzopolous, Waters, 1998
• may be difficult to model stiff areas well
– numerical instabilities
• volumetric constraints difficult to model well
3D Articulatory Vocal Tract Model
c) Boundary Line Integral and Boundary
Element Method (James and Pai, 1999)
• Use boundary integral equation formulation
• of static linear elasticity
– use boundary element method to solve
– limited to boundary only
» how to deal with heterogeneous tissue?
Tongue may be difficult
• Limited to linear elasticity
– should be OK for small deformations
d) Continuum models and Finite
Element Methods
• some human tissue models have been created:
– Payan et al 2001, Gouret et. al., 1989, Chen and
Zeltzer, 1992, Bro-Nielsen, 1997.
3D Articulator Synthesis: Engwall / Badin
• One attempt by (Engwall / Badin, 1999)
– geometric model of vocal tract derived from
articulator parameters
• array of vertices (polygonal mesh)
• symmetric around midsagittal plane
– tongue model is set of filtered vertices
• 5 parameters
– synthesis model
• acoustic tube
3D Articulatory Synthesis Model
Anatomical Models for simulation
– FEM-Tongue models
e.g Dung 2000
– Ultrasound Tongue models
e.g. Stone 1998
– Anatomical Tongue model
Takemoto 2001
3D Articulatory Synthesis Model
2) Synthesis Model
– simulate propagation of air pressure waves
through the 3D vocal tract model
• simplified source model
• fluid dynamic models
• maybe modification of ray-tracing
– 2D acoustic tube model
– dynamic models and time domain source
– classical electrical analog
» enhanced with airflow model (Jackson, 2000)
– 2.5D and 3D acoustic tube
– with and without source models
– FEM or BEM analysis for flow and turbulence
3D Articulatory Synthesis: Applications
• Surgical prediction (Payan et al 2002)
• compression
– videophone, multimedia data
speech research tool
text-to-speech synthesis
lip synchronization in movies Haber, at al, 2001
new musical instruments Vogt, etal 2001
– based on 3D models + wave propagation
• Some applications may not require complete,
3D articulatory synthesis model
3D Articulatory Synthesis: Summary
• Continue building framework for public
domain 3D Articulatory Synthesizer
• Construction of modular S/W architecture
– integrating simple models of vocal tract behaviour
– providing support for multiple level-of-detail modeling
– 2D tube model for synthesis
• Developing 3D tube models
• Building anatomically based vocal tract
– development of scene graph semantics and syntax
Articulatory Synthesis: Haskins
• Haskins lab articulatory synthesis (ASY)
– synthesize static vowel sounds (/u/, /i/)
– vowel space mostly function of tongue
position and height
Articulatory Synthesis: Haskins
• Can also synthesize utterances
• Use table (script) for specifying
configuration of vocal tract (articulation)
• Use table (control) that specifies timing
– source
– articulation
– like key framing or morphing
• synthesis done every pitch pulse
Articulatory Synthesis: Haskins
• Examples:
1st frame
– /da/
– about 75msec
from start to end
– sound
– utterance
2nd frame
Articulatory Synthesis: Haskins
• Formant Tracks
: Articulatory Synthesis: Haskins
• Spectrogram
Articulatory Synthesis: Haskins
• Problems:
– vowel sounds OK
– plosives, fricatives, liquids and aspirants not OK
– where to get articulator data?
• Measurements
electromagnetic articulograph
• Models
– only use rigid 2D models
Articulatory Synthesis: more Talking Heads
• Joseph Faber's Amazing Talking
Machine: Euphonia (1830-40's)
Articulatory Synthesis: analog AVT
• Dunn's Electrical Vocal Tract,Journal of
the Acoustical Society of America, 1950