
Motor Control Strategies
for Chinese Intonation
Greg Kochanski (University of Oxford, UK)
Chilin Shih (University of Illinois, Urbana-Champaign)
Tan Lee (Chinese University of Hong-Kong)
Hongyan Jing (IBM)
• The Goal:
– Explain intonation in a way that is:
• Consistent with linguistic assumptions.
• Consistent with known Physiology and Neuroscience.
• The Method:
– Motion planning over a phrase.
– Minimize sum of
• Error between actual pitch and linguistic target
• An “effort” cost term that penalizes rapid, jerky motions.
• The Result:
– Intonation in tone languages can be represented by:
• A lexically-specified tone template (i.e, you use a dictionary to look up
which tone a syllable has).
• A continuous cost-of-error parameter, one per word.
– Evidence that the cost-of-misinterpretations we measure are real:
• Cross-language similarities
• Metrical patterns
• Other
Tone languages provide the ideal test case for motor control
1. tone is important, and
2. you can be sure what the speaker is trying to accomplish.
The meaning of each syllable is determined by the pitch contour
over the syllable.
Ma (high tone) = “Mother”
Ma (rising tone) = “Hemp”
Ma (low falling tone) = “Horse”
Ma (high falling tone) = “to scold”
You can look up the tone in the dictionary.
Pitch contour is determined primarily by muscle tension in the
vocal folds.
Tone shapes
Another Challenge
F0 (Hz)
Typical tone shapes in green
Time (10 ms intervals)
People talk nearly as fast as possible, therefore
dynamics must be important.
Pitch (f0) for a maximum-rate warble.
Pitch (f0) for a maximum-rate warble.
Mandarin on
the same
time scale
The Data
Male speaker of Madarin (Chinese)
Female speaker of Cantonese (Chinese)
Text from newspaper news stories.
737 syllables for Mandarin
– 41.4 syllables per second
– 1.20.7 seconds per phrase (between pauses).
• Segmented into words by three independent native
speakers (Mandarin)
• Tracks of fundamental frequency vs. time (pitch)
extracted by get_f0 from ESPS/Waves package.
Basic assumptions used in modeling
• People plan their utterances several syllables in
• People produce optimal, highly practiced speech.
– Most of what we say is made from bits and pieces we’ve said before.
– There are only 4 (Mandarin) or 6 (Cantonese) tones to combine.
– A speaker has the chance to practice and optimize all the common 3- and 4- tone
• A simple model for f0 (pitch): f0 is linearly related
to muscle tensions.
• A simple model of the muscle control strategy.
– No reason to believe pitch is controlled differently from other muscle motions.
Optimize what?
• People want to minimize the chance that they will be
significantly misunderstood. Some words will be
more important than others:
– Risk = P(misinterpreted) * cost-of-misinterpretation
– Perhaps weight matches importance.
• People want to minimize effort and/or talk faster
– Chairs, Cars
• How to combine the two?
– A weighted sum.
– Cost-of-misinterpretation plays the role of the weight.
What is the unit of motion planning?
Probably a phrase or a sentence.
(Data courtesy Chilin Shih)
People start at a higher pitch when they begin longer sentences.
Also planning of inhaled air volume.
Therefore, there is some plan ~300 ms before start of speech.
Modeling math
p is the realized pitch
“We’re optimizing something”
p(t )  arg min G  R 
p (t )
p is implicitly a function of time
G   dt p 2   2 p 2   2 p 2
 i ri
R is the total risk for the utterance: ri is the error of
the ith target, and si is the cost if this particular
word is misinterpreted.
Where ri is the error of the ith target
ri  
( p  y ) dt
(this is an approximation;
see elsewhere for correct,
more detailed equation)
y(t) is the pitch of a point in the ith target.
The time-dependence is suppressed for clarity.
Modeling math – more detail.
The cost of a
misinterpretation of
the ith syllable.
Total risk for the
 i ri
Where ri is the error
of the ith target
Alpha () controls
Beta () controls
how much the shape
how much the
of the pitch contour
average pitch of the
syllable matters.
ri   ( p  pi )  ( y  yi )     pi  yi 
ttarget i
y is the pitch of a point in the ith target.
A bar denotes an average over a target.
How does G depend on the
form of the pitch curve?
Large effort implies a curve
with larger slopes and sharper
corners: wigglier.
G   dt p 2   2 p 2   2 p 2
Model behavior
• For s>>1, Error (R) dominates, and pitch matches target.
• For s<<1, Effort (G) dominates, both speaker and listener
accept large deviations, and pitch smoothly interpolates.
• For s~1, everything compromises.
The rest of the model:
• A model is a sequence of targets.
– The type of the target (tone1, tone2, …)
is looked up in a dictionary.
• Each target has a cost-of-misinterpretation.
– The cost is adjustable for each word
– Syllables within a word are derived from word cost via
the metrical pattern for words of a certain length.
• One target per tone.
• Targets are stretched to fit syllable duration.
• Only one phonological rule: 3323
What’s the procedure?
Sequence of tones
Costs of misinterpretations
Compute the pitch
curve as a function of
phonological inputs
and the cost of a
Nonlinear least-squares
fitting algorithm
Model fits for Mandarin Chinese
Tone class (input)
Inside a word, the cost of a
misinterpretation is distributed
by the metrical pattern
Model fits to Mandarin Chinese
0.61 free parameters per syllable, 13 Hz RMS error.
Results are stable under small changes in the model.
This model allows
extra freedom:
different tones are
allowed to define
their targets
Costs for misinterpreting
different syllables.
The two models have
words defined by
different labelers
This model allows
less freedom: all
tones have the
same type of target.
Model parameters
Phrasing is
marked in
data courtesy
of Prof. Tan
Metrical patterns inside words (Mandarin)
The metrical pattern controls how the cost-of-misinterpretation is split up inside
a word. Syllables are marked with . The vertical position is proportional to
log(s) for each syllable, so higher syllables have larger s, and will be executed
more carefully. For 4-syllable words, the error bars are shown by the pairs of
segmentation of
characters into
Random segmentation of
characters into words –
Note that the metrical
pattern disappears, showing
that we are measuring
something real that is tied to
Another nice property
•The cost-of-misinterpretation parameter for a syllable is
correlated with the mutual information with the preceeding
•r = -0.175
•>95% confidence
•Pitch patterns are implemented
•sloppily for syllables that are unsurprising, and
•precisely for surprising ones.
(Mutual informations from a database of 15000 newspaper sentences.
Syllable identity was defined by phoneme content and tone.)
• Models with motor planning capture important aspects of speech.
• They allow a very compact representation of complex behaviors.
• Intonation is represented as:
– a small set of discrete symbols, in sequence,
– modulated by a cost-of-misinterpretation, with
• The cost-of-misinterpretation parameter seems real:
– Similar across languages
– Matches language structure
• This model can be applied broadly:
• Two dialects of Chinese
• Some aspects of English
• Separating different singing and speaking styles from the content
• See http://kochanski.org/papers .