Acoustic effects of variation in vocal effort
by men, women and children
Hartmut Traunmüller and Anders Eriksson
assisted by
Anita Andersson, Ingegerd Eklund and Jessika Rundlöv
with financial support from HSFR and NUTEK for the period 94-07-01 -- 96-06-31
and from SU
Acoustic properties of speech sounds vary
because of
linguistic - expressive - organic - perspectival
factors
This investigation is mainly concerned with
expressive variation
–vocal effort
–mode of phonation (whispering vs. phonating).
and interactions with
organic variation
–age
–sex
(men, women, children)
Vocal effort: a subjective,
physiological quantity
Voice level: an acoustic quantity
(SPL of a standard utterance measured at a
standard distance)
Alternative ways of controlling voice level:
Trained speaker's/singer's technique
–More variation in pulmonic pressure
–F0 less affected
Ordinary speaker's technique
–More variation in vocal fold tension
–F0 more affected
Adopted definition & quantification of
"vocal effort"
“Vocal effort” = The quantity that ordinary
speakers vary when they adapt their speech to
the demands of an increased or decreased
communicational distance.
Adjusting "loudness level”
(Holmberg, Hillman and Perkell, 1988)
Shouting
(Rostolland, 1982)
Speaking in noise
(Rastatter and Rivers, 1983; Loren, Colcord, and
Rastatter, 1986; Van Summers, Pisoni, Bernacki,
Pedlow, and Stokes, 1988; Bond, Moore, and Gable,
1989).
Different effects of white and multitalker
noise with same SPL
(Rivers and Rastatter, 1985)
Variation in vocal effort affects the shape
of the glottal pulses
(vocal fold closing velocity and relative closed interval duration)
(Holmberg et al., 1988; Holmberg, Hillman, Perkell, Guiod, and
Goldman, 1995; Södersten, Hertegård and Hammarberg, 1995).
... reflected in the spectral emphasis of the
partials above the first
(Gauffin and Sundberg, 1989; Childers and Lee, 1991; Granström and
Nord, 1992)
Variation in vocal effort affects F1
(Frøkjær-Jensen, 1966; Rostolland and Parant, 1974; Schulman, 1985; Bond
et al., 1989; Liénard and Di Benedetto, 1999).
F1 difficult to measure
but more open mouth >> higher F1
Variation in vocal effort affects
segment durations
(Fónagy and Fónagy, 1966, Rostolland, 1982, and Bonnot and
Chevrie-Muller, 1991)
larger effort: longer vowels but
somewhat shorter consonants
SPL as a measure of vocal effort?
(Liénard and Di Benedetto, 1999)
SPL plays no major part in judgments of
vocal effort
(Rundlöf, 1996; Traunmüller, 1997; Eriksson and Traunmüller, 1999)
SPL varies widely as a function of
perspectival factors.
Listeners distinguish variations in a speaker’s
vocal effort from variations in their own
distance from the speaker.
(Wilkens and Bartel, 1977, Eriksson and Traunmüller, 1999)
Our measure of vocal effort:
The average rating, by a group of
listeners, of the communicational distance
for each stimulus.
Our aim:
Acquire detailed quantitative knowledge
about those acoustic effects of variations
in vocal effort that are of perceptual
importance.
Relevant to:
Speech synthesis with desired paralinguistic
quality
Automatic recognition of linguistic information
Automatic recognition of expressive
information
Automatic recognition of organic information
Conversion of paralinguistic quality
Automatic speech-to-speech translation with
conserved paralinguistic quality
Theories of speech: The Modulation Theory
Subjects
6 male adults, 20–51 years
6 female adults, 20–38 years
4 boys, 7 years
4 girls, 7 years
all speaking Stockholm Swedish
Speech material
Anita: “Hur många kort tog du av varje färg?”
Jag tog ett violett, åtta svarta och sex vita
[ 
]
5 phonated and 2 whispered versions
Recording
Place: Långängen, Lidingö
DAT-recorder
High quality microphone, wind protected, 50 mm
from speaker's lips
Stepwise attenuator 0, 8, 16, 24, and 32 dB
Sampling at 16 kHz, 16 bits per sample
HP-filtering at 70 Hz, 48 dB/octave
ESPS/Waves
For formant frequency measurements
resampled at 6.4 kHz for men, 8 kHz for women
and 10.667 kHz for children.
Table I. Distances between speaker and addressee. The full
range was used for phonated speech. Whispered speech was
only used at the two shortest distances.
Version
Distance (m)
1
2
3
0.3
1.5
7.5
4
5
37.5 187.5
Acoustic measurements
Sound pressure levels
SPLV (voiced segments & potentially voiced)
SPLS (three [s]-es)
SPL0 (voiced segments LP filtered at 1.5 F0mean, 18 dB/oct.)
Spectral emphasis
SPLV - SPL0
Fundamental frequency
F0 (mean and SD, excl. creaky voiced sections)
Formant frequencies
F1a (average of four [a]-s)
F3 (average of voiced segments & potentially voiced)
Segment durations
durV (average of 14 vowels, 3 [v] and 1 [j])
durC (average of 8 stops, 3 [s] and 1 [l])
The measure of vocal effort
Exp. 1
20 listeners
phonated utterances
original SPL
Exp. 2
20 listeners
phonated utterances
SPL random +/- 6 dB
Geometric means of distances in meters
Real
Estimanted
0.375
0.47
1.5
0.69
7.5
1.9
37.5
7.5
187.5
31
Exp. 2 (dep.) vs. Exp. 1 (indep.): r = 0.993, slope = 0.93.
Estimated (dep.) vs. real distance (indep.): r = 0.90
Rundlöf J. (1996). Perceptuella ledtrådar vid auditiv bedömning av avståndet
mellan talare och lyssnare D-uppsats, lingvistik, SU.
Extrinsic factors
(1) Communicational distance 2log(distance in meters)
(2) “Closeness" e(1-n)
(see Fig. 1)
(3) Wind noise (wind velocity in m/s)
(4) Speaker age:
2log(age
in years),
(5) Boyhood (1, 0)
(6) Manhood (1, 0)
(7) Speaker-specific constants (speaker specific
average prediction error)
90
FIG. 1. The average sound
pressure level (SPLv), with an
arbitrary reference, of the
voiced and potentially voiced
TYP2
segments in the phonated and
44
whispered
utterances
33
produced
by men (), women
(),22boys (), and girls ().
80
phonated
70
60
11
50
40
3
whispered
2
30
Distance (m)
187.5
6
37.5
5
7.5
4
1.5
3
0.3
2
1
1
0
SPLv (dB)
4
1.5
1.0
Lg2(d)
.5
Closeness
Wind noise
0.0
Lg2(age)
Boyhood
-.5
Manhood
Speaker spec. const.
-1.0
SPLv
SPLo
Emph.
SPLs
F0
F1a
F3
durv
durc
Acoustic properties
FIG. 2. The contribution of the environmental and speaker specific
factors (1) communicational distance, (2) “closeness” (3) wind noise, (4)
speaker age, (5) boyhood, (6) manhood, and (7) speaker-specific
constants, to the variation in acoustic variables measured in the phonated
utterances. These variables were (from left to right) SPLv, SPL0, spectral
emphasis (SPLv–SPL0), SPLs, utterance average F0, F1a, F3, and the
durations of vowel–like (durV) and consonantal segments (durC.).
Sound pressure levels
The dependent variables were SPLv, SPL0, spectral emphasis (SPLv–
SPL0), and SPLs, for all of which the effect is expressed in dB.
r2
r2, speaker specific
Reference value
1. Distance doubled
"
fivefolded
2. “Closeness” 0.3 vs. 1.5 m
"
0.3 m vs. 
3. Wind velocity +1 m/s
4. Speaker age 30 vs. 7.5 yrs.
5. Boy
6. Man
SPLv
0.94
0.98
58.6 dB
+4.6 *
+10.8 *
+9.5 *
+15.0 *
+0.6 *
+3.7 *
+4.2 *
–0.5 *
SPL0
0.92
0.96
53.8 dB
+3.3 *
+7.6 *
+7.0 *
+11.0 *
+0.5 *
+2.7 *
+4.2 *
–1.1 *
Emph.
0.79
0.90
4.9 dB
+1.4 *
+3.2 *
+2.6 *
+4.1 *
+0.1 *
+1.0 *
–0.0 *
+0.7 *
SPLs
0.79
0.88
47.7 dB
+2.0 *
+4.6 *
+2.6 *
+4.1 *
+0.6 *
+9.0 *
+6.4 *
+2.2 *
Table III. Occurrence of creaky voice, in % of the total
duration of the voiced segments.
Men
Women
0.3
1.4
7.1
1.5
7.8
4.4
7.5
0.5
1.7
38.5 187.5
0.7
0.0
0.5
0.0
m
2.1
2.7
F0 and formant frequencies
The dependent variables were F0, F1 of the [a]-segments, and F3 of the
voiced segments, for all of which the effect is expressed as a factor.
r2
r2, speaker specific
Reference value
1. Distance doubled
"
fivefolded
2. “Closeness” 0.3 vs. 1.5 m
"
0.3 m vs. 
3. Wind velocity +1 m/s
4. Speaker age 30 vs. 7.5 yrs.
5. Boy
6. Man
F0
0.91
0.97
175 Hz
1.13 *
1.37 *
1.36 *
1.63 *
1.04 *
0.74 *
1.00 *
0.61 *
F1a
0.84
0.93
580 Hz
1.08 *
1.19 *
1.09 *
1.15 *
1.03 *
0.79 *
1.05 *
0.84 *
F3
0.93
0.97
2687 Hz
1.00 *
1.01 *
1.01 *
1.02 *
1.01 *
0.75 *
1.00 *
0.88 *
Table IV. Mean values and standard deviations of F0 as a function
of distance. Standard deviations also expressed in semitones.
m
Men
Hz
st
Women Hz
st
Boys
Hz
st
Girls
Hz
st













































Segment durations
The dependent variables were the durations of vowel-like (durV) and
consonantal segments (durC), for which the effect is expressed as a
factor.
r2
r2, speaker specific
Reference value
1. Distance doubled
"
fivefolded
2. “Closeness” 0.3 vs. 1.5 m
"
0.3 m vs. 
3. Wind velocity +1 m/s
4. Speaker age 30 vs. 7.5 yrs.
5. Boy
6. Man
durV
0.66
0.88
58 ms
1.11 *
1.27 *
1.35 *
1.61 *
1.05 *
0.69 *
0.99 *
1.04 *
durC
0.28
0.63
70 ms
1.02 *
1.04 *
1.17 *
1.27 *
1.00 *
0.85 *
0.93 *
1.00 *
MODE
600
400
200
phonated
whispered
37.5 187.5 0.3
1.5
8
7.5
7
1.5
6
0.3
5
0
4
Men
0
29
10
146
0
65
0
15
3
Women
0
51
13
237
0
148
0
6
2
Boys
0
68
* 12
192
0
160
16
17
800
1
Girls
0
252
10
167
3
64
20
6
Pause duration (ms)
Position
Jag
tog
ett
violett,
åtta
svarta
och
sex
vita.

1000
0
Table V. The mean pausing time, in
ms, in all phonated and whispered
utterances after the word listed in
the first column.
Distance (m)
522
465
455
265
FIG. 3. The mean of the total
pause duration (in ms) in phonated
and whispered utterances shown as
a function of the communicational
distance for men (), women (),
boys (), and girls ().
100
90
TYP
80
80
Men
24
Women
23
Boys
21
Girls
70
60
SPLv
SPLv
SPLv
SPLv
=
=
=
=
20.956
20.413
21.901
20.631
+ 1.556 VEL
+ 1.609 VEL
+ 1.477 VEL
+ 1.490 VEL
(r = 0.99)
(r = 0.99)
(r = 0.98)
(r = 0.98)
SPLs
SPLs
SPLs
SPLs
=
=
=
=
17.199
16.585
16.244
14.087
+ 1.120 VEL
+ 0.696 VEL
+ 0.623 VEL
+ 0.391 VEL
(r = 0.94)
(r = 0.92)
(r = 0.72)
(r = 0.83)
20
50
Men
14
Women
13
Boys
11
Girls
40
30
10
4
Men
3
Women
Boys
1
Girls
0
20
10
0
-4
-2
0
2
4
6
SPLv–SPL0
SPLv–SPL0
SPLv–SPL0
SPLv–SPL0
=
=
=
=
2.275
1.618
1.973
1.901
+ 0.435 VEL
+ 0.553 VEL
+ 0.373 VEL
+ 0.522 VEL
(r = 0.88)
(r = 0.95)
(r = 0.92)
(r = 0.92)
8
Vocal Effort Level
Vocal Effort Level
FIG. 4. SPLv (above), SPLs (middle), and the spectral emphasis SPLv–SPL0 (below)
shown as a function of vocal effort level VEL = 2log(d), where d is the perceived
communicational distance in meters. Regression lines fitted to the whole set of data
for SPLv and emphasis, and to those obtained from each speaker group, men (,
solid lines), women (, broken), boys (, dashed), and girls (, dotted) for SPLs.
F0, F1a, F3 (Hz)
4000
Equations of the regression lines:
2000
Men
Women
Boys
Girls
202
1000
42
800
2
logF0
2
logF0
2
logF0
2
logF0
=
=
=
=
6.918
7.792
8.248
8.331
+ 0.217 VEL
+ 0.162 VEL
+ 0.154 VEL
+ 0.132 VEL
(r = 0.98)
(r = 0.94)
(r = 0.93)
(r = 0.95)
9.126
9.368
9.764
9.746
+ 0.095 VEL
+ 0.128 VEL
+ 0.155 VEL
+ 0.172 VEL
(r = 0.91)
(r = 0.95)
(r = 0.93)
(r = 0.94)
11.217
11.473
11.857
11.871
+ 0.003 VEL
+ 0.017 VEL
+ 0.000 VEL
+ 0.006 VEL
(r = 0.12)
(r = 0.55)
(r = 0.02)
(r = 0.18)
41
600
32
2
31
2
Men
Women
Boys
Girls
400
12
11
10
200
4
Men
Women
Boys
Girls
3
B oys F0
100
Girls F0
-4
-2
0
2
4
6
8
logF1a
logF1a
2
logF1a
2
logF1a
2
logF3
2
logF3
2
logF3
2
logF3
=
=
=
=
=
=
=
=
Vocal Effort Level
Vocal Effort Level
Fig. 5. Mean values of F0, F1a, and F3, shown as a function of VEL for
men (), women (), boys (), and girls (). Regression lines fitted to
each variable (solid, dotted, broken lines) and speaker group.
4000
2000
Girls F3
Women F3
1000
Women F1a
800
M en F3
600
M en F1a
Boys F3
400
Girls F1a
200
F0 (Hz)
600
500
400
300
200
100
90
100
F0, F1a, F3 (Hz)
Boys F1a
Fig. 6. Mean values of F0 ,
F1 of the [a]-segments, and
F3, plotted as a function of
F0. Regression lines shown
for each variable and
speaker group, men (,
solid lines), women (,
broken), boys (, dashed),
and girls (, dotted).
For a 100% increase in F0,
F1a increased by
42% for men
71% for women
95% for boys
124% for girls
(r = 0.90),
(r = 0.92),
(r = 0.94),
(r = 0.94).
There is a positive correlation between F1 and F0
(large effect)
in realizations of the same linguistic strings
by speakers who differ in age and/or sex,
and by the same speakers who alter their pitch register.
“Intrinsic pitch”: a negative correlation between F1 and F0
(small effect)
in vowels produced by a given speaker
in the same linguistic and paralinguistic context.
Increases in vocal effort involve
simultaneously:
> subglottal pressure ( > SPL, … )
> vocal fold tension, ( > F0, … )
> vocal tract openness ( > F1, … )
Recognition of vocal effort
Correlation coefficients of acoustic variables with vocal effort
level (VEL)
SPL0
0.95 (exceptional)
SPLv
0.98 (exceptional)
(SPLv–SPL0 )
0.90
F0 and F3
0.87
F0, F3, and Emph
0.96
F0, F3, Emph, 2log(durV/durC) 0.97 (std.err of est. 0.64 units)
Whispering
F3, F1a, and 2log(durV /durC)
[no F0, no spectral emphasis]
0.90
200
100
90
80
70
VCDUR
60
50
200
40
-4
-2
0
2
4
6
8
0
2
4
6
8
HEJSAN
100
90
80
70
Fig. 7. Mean durations of vowel-like
segments (above) and consonantal
segments (below) shown as a
function of VEL. Locally weighted
least squares regression lines fitted
to the data obtained from each
speaker group, men (, solid lines),
women (, broken), boys (,
dashed), and girls (, dotted).
60
50
40
-4
-2
Vocal Effort
Level2log(dur /dur ):
Equations
for
V
C
Men
Women
Boys
Girls
2
log(durV/durC)
2
log(durV/durC)
2
log(durV/durC)
2
log(durV/durC)
=
=
=
=




0.117
0.066
0.410
0.382




0.122
0.149
0.147
0.108
VEL
VEL
VEL
VEL
(r
(r
(r
(r
=
=
=
=
0.82)
0.84)
0.90)
0.71)
Table VI. Mean values and standard deviations of differences between
whispered and voiced versions of the same utterance produced by the same
speakers at the same communicational distance (0.3 and 1.5 m). The
significance level of the difference between the age groups is also
indicated.
n
SPLv
SPLs
F1a
F3
durV
durC
Adults
23
17.7 4.5 dB
0.7 2.7 dB
+24 12%
+5.1 4 %
+16 17 %
+11 14 %
Children
15
20.8 2.0 dB
4.6 2.8 dB
+26 12%
+3.3 6.3 %
+7 23 %
14 21 %
Sign.
**
***
n.s.
n.s.
n.s.
***
Table VII. Mean perceived and calculated distances between speaker and
addressee for the phonated versions compared with distances calculated
using the same equations for the whispered versions. The independent
variables were F1a, F3, durV, and durC.
Perc. dist. (m), phonated
Calc. dist. (m), phonated
Calc. dist. (m), whispered
.47
.52
2.0
.69
.82
3.3
1.9
2.0
7.5
7.7
31
22
10
0
-10
T YPE
w
-20
m
g
8000
10000
6000
4000
2000
600
800
1000
400
b
200
100
-30
Center frequency (Hz)
Center f requency (Hz)
Fig. 8. The gross difference in spectral energy distribution between whispered
and phonated versions of the same utterance produced by men (), women (),
boys (), and girls () at the same communicational distance (0.3 and 1.5 m),
based on level measurements in frequency bands covering 3 critical bands with
overlap.