Acoustic effects of variation in vocal effort by men, women and children Hartmut Traunmüller and Anders Eriksson assisted by Anita Andersson, Ingegerd Eklund and Jessika Rundlöv with financial support from HSFR and NUTEK for the period 94-07-01 -- 96-06-31 and from SU Acoustic properties of speech sounds vary because of linguistic - expressive - organic - perspectival factors This investigation is mainly concerned with expressive variation –vocal effort –mode of phonation (whispering vs. phonating). and interactions with organic variation –age –sex (men, women, children) Vocal effort: a subjective, physiological quantity Voice level: an acoustic quantity (SPL of a standard utterance measured at a standard distance) Alternative ways of controlling voice level: Trained speaker's/singer's technique –More variation in pulmonic pressure –F0 less affected Ordinary speaker's technique –More variation in vocal fold tension –F0 more affected Adopted definition & quantification of "vocal effort" “Vocal effort” = The quantity that ordinary speakers vary when they adapt their speech to the demands of an increased or decreased communicational distance. Adjusting "loudness level” (Holmberg, Hillman and Perkell, 1988) Shouting (Rostolland, 1982) Speaking in noise (Rastatter and Rivers, 1983; Loren, Colcord, and Rastatter, 1986; Van Summers, Pisoni, Bernacki, Pedlow, and Stokes, 1988; Bond, Moore, and Gable, 1989). Different effects of white and multitalker noise with same SPL (Rivers and Rastatter, 1985) Variation in vocal effort affects the shape of the glottal pulses (vocal fold closing velocity and relative closed interval duration) (Holmberg et al., 1988; Holmberg, Hillman, Perkell, Guiod, and Goldman, 1995; Södersten, Hertegård and Hammarberg, 1995). ... reflected in the spectral emphasis of the partials above the first (Gauffin and Sundberg, 1989; Childers and Lee, 1991; Granström and Nord, 1992) Variation in vocal effort affects F1 (Frøkjær-Jensen, 1966; Rostolland and Parant, 1974; Schulman, 1985; Bond et al., 1989; Liénard and Di Benedetto, 1999). F1 difficult to measure but more open mouth >> higher F1 Variation in vocal effort affects segment durations (Fónagy and Fónagy, 1966, Rostolland, 1982, and Bonnot and Chevrie-Muller, 1991) larger effort: longer vowels but somewhat shorter consonants SPL as a measure of vocal effort? (Liénard and Di Benedetto, 1999) SPL plays no major part in judgments of vocal effort (Rundlöf, 1996; Traunmüller, 1997; Eriksson and Traunmüller, 1999) SPL varies widely as a function of perspectival factors. Listeners distinguish variations in a speaker’s vocal effort from variations in their own distance from the speaker. (Wilkens and Bartel, 1977, Eriksson and Traunmüller, 1999) Our measure of vocal effort: The average rating, by a group of listeners, of the communicational distance for each stimulus. Our aim: Acquire detailed quantitative knowledge about those acoustic effects of variations in vocal effort that are of perceptual importance. Relevant to: Speech synthesis with desired paralinguistic quality Automatic recognition of linguistic information Automatic recognition of expressive information Automatic recognition of organic information Conversion of paralinguistic quality Automatic speech-to-speech translation with conserved paralinguistic quality Theories of speech: The Modulation Theory Subjects 6 male adults, 20–51 years 6 female adults, 20–38 years 4 boys, 7 years 4 girls, 7 years all speaking Stockholm Swedish Speech material Anita: “Hur många kort tog du av varje färg?” Jag tog ett violett, åtta svarta och sex vita [ ] 5 phonated and 2 whispered versions Recording Place: Långängen, Lidingö DAT-recorder High quality microphone, wind protected, 50 mm from speaker's lips Stepwise attenuator 0, 8, 16, 24, and 32 dB Sampling at 16 kHz, 16 bits per sample HP-filtering at 70 Hz, 48 dB/octave ESPS/Waves For formant frequency measurements resampled at 6.4 kHz for men, 8 kHz for women and 10.667 kHz for children. Table I. Distances between speaker and addressee. The full range was used for phonated speech. Whispered speech was only used at the two shortest distances. Version Distance (m) 1 2 3 0.3 1.5 7.5 4 5 37.5 187.5 Acoustic measurements Sound pressure levels SPLV (voiced segments & potentially voiced) SPLS (three [s]-es) SPL0 (voiced segments LP filtered at 1.5 F0mean, 18 dB/oct.) Spectral emphasis SPLV - SPL0 Fundamental frequency F0 (mean and SD, excl. creaky voiced sections) Formant frequencies F1a (average of four [a]-s) F3 (average of voiced segments & potentially voiced) Segment durations durV (average of 14 vowels, 3 [v] and 1 [j]) durC (average of 8 stops, 3 [s] and 1 [l]) The measure of vocal effort Exp. 1 20 listeners phonated utterances original SPL Exp. 2 20 listeners phonated utterances SPL random +/- 6 dB Geometric means of distances in meters Real Estimanted 0.375 0.47 1.5 0.69 7.5 1.9 37.5 7.5 187.5 31 Exp. 2 (dep.) vs. Exp. 1 (indep.): r = 0.993, slope = 0.93. Estimated (dep.) vs. real distance (indep.): r = 0.90 Rundlöf J. (1996). Perceptuella ledtrådar vid auditiv bedömning av avståndet mellan talare och lyssnare D-uppsats, lingvistik, SU. Extrinsic factors (1) Communicational distance 2log(distance in meters) (2) “Closeness" e(1-n) (see Fig. 1) (3) Wind noise (wind velocity in m/s) (4) Speaker age: 2log(age in years), (5) Boyhood (1, 0) (6) Manhood (1, 0) (7) Speaker-specific constants (speaker specific average prediction error) 90 FIG. 1. The average sound pressure level (SPLv), with an arbitrary reference, of the voiced and potentially voiced TYP2 segments in the phonated and 44 whispered utterances 33 produced by men (), women (),22boys (), and girls (). 80 phonated 70 60 11 50 40 3 whispered 2 30 Distance (m) 187.5 6 37.5 5 7.5 4 1.5 3 0.3 2 1 1 0 SPLv (dB) 4 1.5 1.0 Lg2(d) .5 Closeness Wind noise 0.0 Lg2(age) Boyhood -.5 Manhood Speaker spec. const. -1.0 SPLv SPLo Emph. SPLs F0 F1a F3 durv durc Acoustic properties FIG. 2. The contribution of the environmental and speaker specific factors (1) communicational distance, (2) “closeness” (3) wind noise, (4) speaker age, (5) boyhood, (6) manhood, and (7) speaker-specific constants, to the variation in acoustic variables measured in the phonated utterances. These variables were (from left to right) SPLv, SPL0, spectral emphasis (SPLv–SPL0), SPLs, utterance average F0, F1a, F3, and the durations of vowel–like (durV) and consonantal segments (durC.). Sound pressure levels The dependent variables were SPLv, SPL0, spectral emphasis (SPLv– SPL0), and SPLs, for all of which the effect is expressed in dB. r2 r2, speaker specific Reference value 1. Distance doubled " fivefolded 2. “Closeness” 0.3 vs. 1.5 m " 0.3 m vs. 3. Wind velocity +1 m/s 4. Speaker age 30 vs. 7.5 yrs. 5. Boy 6. Man SPLv 0.94 0.98 58.6 dB +4.6 * +10.8 * +9.5 * +15.0 * +0.6 * +3.7 * +4.2 * –0.5 * SPL0 0.92 0.96 53.8 dB +3.3 * +7.6 * +7.0 * +11.0 * +0.5 * +2.7 * +4.2 * –1.1 * Emph. 0.79 0.90 4.9 dB +1.4 * +3.2 * +2.6 * +4.1 * +0.1 * +1.0 * –0.0 * +0.7 * SPLs 0.79 0.88 47.7 dB +2.0 * +4.6 * +2.6 * +4.1 * +0.6 * +9.0 * +6.4 * +2.2 * Table III. Occurrence of creaky voice, in % of the total duration of the voiced segments. Men Women 0.3 1.4 7.1 1.5 7.8 4.4 7.5 0.5 1.7 38.5 187.5 0.7 0.0 0.5 0.0 m 2.1 2.7 F0 and formant frequencies The dependent variables were F0, F1 of the [a]-segments, and F3 of the voiced segments, for all of which the effect is expressed as a factor. r2 r2, speaker specific Reference value 1. Distance doubled " fivefolded 2. “Closeness” 0.3 vs. 1.5 m " 0.3 m vs. 3. Wind velocity +1 m/s 4. Speaker age 30 vs. 7.5 yrs. 5. Boy 6. Man F0 0.91 0.97 175 Hz 1.13 * 1.37 * 1.36 * 1.63 * 1.04 * 0.74 * 1.00 * 0.61 * F1a 0.84 0.93 580 Hz 1.08 * 1.19 * 1.09 * 1.15 * 1.03 * 0.79 * 1.05 * 0.84 * F3 0.93 0.97 2687 Hz 1.00 * 1.01 * 1.01 * 1.02 * 1.01 * 0.75 * 1.00 * 0.88 * Table IV. Mean values and standard deviations of F0 as a function of distance. Standard deviations also expressed in semitones. m Men Hz st Women Hz st Boys Hz st Girls Hz st Segment durations The dependent variables were the durations of vowel-like (durV) and consonantal segments (durC), for which the effect is expressed as a factor. r2 r2, speaker specific Reference value 1. Distance doubled " fivefolded 2. “Closeness” 0.3 vs. 1.5 m " 0.3 m vs. 3. Wind velocity +1 m/s 4. Speaker age 30 vs. 7.5 yrs. 5. Boy 6. Man durV 0.66 0.88 58 ms 1.11 * 1.27 * 1.35 * 1.61 * 1.05 * 0.69 * 0.99 * 1.04 * durC 0.28 0.63 70 ms 1.02 * 1.04 * 1.17 * 1.27 * 1.00 * 0.85 * 0.93 * 1.00 * MODE 600 400 200 phonated whispered 37.5 187.5 0.3 1.5 8 7.5 7 1.5 6 0.3 5 0 4 Men 0 29 10 146 0 65 0 15 3 Women 0 51 13 237 0 148 0 6 2 Boys 0 68 * 12 192 0 160 16 17 800 1 Girls 0 252 10 167 3 64 20 6 Pause duration (ms) Position Jag tog ett violett, åtta svarta och sex vita. 1000 0 Table V. The mean pausing time, in ms, in all phonated and whispered utterances after the word listed in the first column. Distance (m) 522 465 455 265 FIG. 3. The mean of the total pause duration (in ms) in phonated and whispered utterances shown as a function of the communicational distance for men (), women (), boys (), and girls (). 100 90 TYP 80 80 Men 24 Women 23 Boys 21 Girls 70 60 SPLv SPLv SPLv SPLv = = = = 20.956 20.413 21.901 20.631 + 1.556 VEL + 1.609 VEL + 1.477 VEL + 1.490 VEL (r = 0.99) (r = 0.99) (r = 0.98) (r = 0.98) SPLs SPLs SPLs SPLs = = = = 17.199 16.585 16.244 14.087 + 1.120 VEL + 0.696 VEL + 0.623 VEL + 0.391 VEL (r = 0.94) (r = 0.92) (r = 0.72) (r = 0.83) 20 50 Men 14 Women 13 Boys 11 Girls 40 30 10 4 Men 3 Women Boys 1 Girls 0 20 10 0 -4 -2 0 2 4 6 SPLv–SPL0 SPLv–SPL0 SPLv–SPL0 SPLv–SPL0 = = = = 2.275 1.618 1.973 1.901 + 0.435 VEL + 0.553 VEL + 0.373 VEL + 0.522 VEL (r = 0.88) (r = 0.95) (r = 0.92) (r = 0.92) 8 Vocal Effort Level Vocal Effort Level FIG. 4. SPLv (above), SPLs (middle), and the spectral emphasis SPLv–SPL0 (below) shown as a function of vocal effort level VEL = 2log(d), where d is the perceived communicational distance in meters. Regression lines fitted to the whole set of data for SPLv and emphasis, and to those obtained from each speaker group, men (, solid lines), women (, broken), boys (, dashed), and girls (, dotted) for SPLs. F0, F1a, F3 (Hz) 4000 Equations of the regression lines: 2000 Men Women Boys Girls 202 1000 42 800 2 logF0 2 logF0 2 logF0 2 logF0 = = = = 6.918 7.792 8.248 8.331 + 0.217 VEL + 0.162 VEL + 0.154 VEL + 0.132 VEL (r = 0.98) (r = 0.94) (r = 0.93) (r = 0.95) 9.126 9.368 9.764 9.746 + 0.095 VEL + 0.128 VEL + 0.155 VEL + 0.172 VEL (r = 0.91) (r = 0.95) (r = 0.93) (r = 0.94) 11.217 11.473 11.857 11.871 + 0.003 VEL + 0.017 VEL + 0.000 VEL + 0.006 VEL (r = 0.12) (r = 0.55) (r = 0.02) (r = 0.18) 41 600 32 2 31 2 Men Women Boys Girls 400 12 11 10 200 4 Men Women Boys Girls 3 B oys F0 100 Girls F0 -4 -2 0 2 4 6 8 logF1a logF1a 2 logF1a 2 logF1a 2 logF3 2 logF3 2 logF3 2 logF3 = = = = = = = = Vocal Effort Level Vocal Effort Level Fig. 5. Mean values of F0, F1a, and F3, shown as a function of VEL for men (), women (), boys (), and girls (). Regression lines fitted to each variable (solid, dotted, broken lines) and speaker group. 4000 2000 Girls F3 Women F3 1000 Women F1a 800 M en F3 600 M en F1a Boys F3 400 Girls F1a 200 F0 (Hz) 600 500 400 300 200 100 90 100 F0, F1a, F3 (Hz) Boys F1a Fig. 6. Mean values of F0 , F1 of the [a]-segments, and F3, plotted as a function of F0. Regression lines shown for each variable and speaker group, men (, solid lines), women (, broken), boys (, dashed), and girls (, dotted). For a 100% increase in F0, F1a increased by 42% for men 71% for women 95% for boys 124% for girls (r = 0.90), (r = 0.92), (r = 0.94), (r = 0.94). There is a positive correlation between F1 and F0 (large effect) in realizations of the same linguistic strings by speakers who differ in age and/or sex, and by the same speakers who alter their pitch register. “Intrinsic pitch”: a negative correlation between F1 and F0 (small effect) in vowels produced by a given speaker in the same linguistic and paralinguistic context. Increases in vocal effort involve simultaneously: > subglottal pressure ( > SPL, … ) > vocal fold tension, ( > F0, … ) > vocal tract openness ( > F1, … ) Recognition of vocal effort Correlation coefficients of acoustic variables with vocal effort level (VEL) SPL0 0.95 (exceptional) SPLv 0.98 (exceptional) (SPLv–SPL0 ) 0.90 F0 and F3 0.87 F0, F3, and Emph 0.96 F0, F3, Emph, 2log(durV/durC) 0.97 (std.err of est. 0.64 units) Whispering F3, F1a, and 2log(durV /durC) [no F0, no spectral emphasis] 0.90 200 100 90 80 70 VCDUR 60 50 200 40 -4 -2 0 2 4 6 8 0 2 4 6 8 HEJSAN 100 90 80 70 Fig. 7. Mean durations of vowel-like segments (above) and consonantal segments (below) shown as a function of VEL. Locally weighted least squares regression lines fitted to the data obtained from each speaker group, men (, solid lines), women (, broken), boys (, dashed), and girls (, dotted). 60 50 40 -4 -2 Vocal Effort Level2log(dur /dur ): Equations for V C Men Women Boys Girls 2 log(durV/durC) 2 log(durV/durC) 2 log(durV/durC) 2 log(durV/durC) = = = = 0.117 0.066 0.410 0.382 0.122 0.149 0.147 0.108 VEL VEL VEL VEL (r (r (r (r = = = = 0.82) 0.84) 0.90) 0.71) Table VI. Mean values and standard deviations of differences between whispered and voiced versions of the same utterance produced by the same speakers at the same communicational distance (0.3 and 1.5 m). The significance level of the difference between the age groups is also indicated. n SPLv SPLs F1a F3 durV durC Adults 23 17.7 4.5 dB 0.7 2.7 dB +24 12% +5.1 4 % +16 17 % +11 14 % Children 15 20.8 2.0 dB 4.6 2.8 dB +26 12% +3.3 6.3 % +7 23 % 14 21 % Sign. ** *** n.s. n.s. n.s. *** Table VII. Mean perceived and calculated distances between speaker and addressee for the phonated versions compared with distances calculated using the same equations for the whispered versions. The independent variables were F1a, F3, durV, and durC. Perc. dist. (m), phonated Calc. dist. (m), phonated Calc. dist. (m), whispered .47 .52 2.0 .69 .82 3.3 1.9 2.0 7.5 7.7 31 22 10 0 -10 T YPE w -20 m g 8000 10000 6000 4000 2000 600 800 1000 400 b 200 100 -30 Center frequency (Hz) Center f requency (Hz) Fig. 8. The gross difference in spectral energy distribution between whispered and phonated versions of the same utterance produced by men (), women (), boys (), and girls () at the same communicational distance (0.3 and 1.5 m), based on level measurements in frequency bands covering 3 critical bands with overlap.