Glottal noise during speech

advertisement
Dept. for Speech, Music and Hearing
Quarterly Progress and
Status Report
Glottal noise during speech
Rothenberg, M.
journal:
volume:
number:
year:
pages:
STL-QPSR
15
2-3
1974
001-010
http://www.speech.kth.se/qpsr
STL-QPSR 2-3/1974
I.
A.
SPEECH PRODUCTION
GLOTTAL NOISE DURING SPEECH
M. ~ o t h e n b e r ~ *
Abstract
Additive aperiodic components in the voice of one adult male speaker
were investigated by using a high pass filter to eliminate most of the
periodic components. The glottal volume velocity waveform was r e corded simultaneously by inverse-filtering the o r a l volume velocity.
I t was found that for flow r a t e s below one liter/second the noise tended
to be strongest when the instantaneous glottal volume velocity was a bout 200 to 300 milliliters/second. When the average glottal a i r flow
was increased, a s when the vocal folds were partially abducted for a n
another aperiodic component was sometimes noted a t
intervocalic [h
instantaneous flow r a t e s above about 1.2 to 1. liter/second. Visual observations using a fiberoptic probe, spectrograms, and measurements
of the p r e s s u r e just above the glottis indicated that this high-flow-rate
noise was actually generated in the vocal t r a c t above the glottis, and not
within the glottis. These patterns in the occurrence of aperiodic components were used to add filtered gaussian noise to a t h r e e - p a r a m e t e r ,
"black-box" model of the voice source. Results of informal listening
t e s t s a r e reported.
1,
Introduction
I t i s generally recognized that during normal, non-whispered speech
t h e r e can be energy components generated a t o r near the glottis that a r e
much l e s s periodic than the quasi-periodic pulse t r a i n s usually used a s
a model for the glottal source during voiced speech.
Most obvious i s
the noise-like energy commonly associated with the unvoiced vowel-like
sound [ h
1,
sometimes r e f e r r e d to phonologically a s a glottal fricative
(Strevens 1960). However, even during voicing there a r e often slight
aperiodicities that have been modeled by small random variations of
glottal period (Lieberman 1961, Dolansky and Tjernlund 1968) o r by randomly occurring additive noise components ( ~ u j i m u r a1968). We report
h e r e an attempt to investigate experimentally the mechanisms underlying
some of the additive noise components present during voicing and during
glottal fricatives.
Additive noise during voicing
There i s very little evidence a s to the nature of the noise components
during voicing, and how they v a r y a s a function of laryngeal adjustment.
Guest r e s e a r c h e r f r o m Syracuse University, Syracuse, N. Y. , U . S.A.
f r o m September 26, 1973 - March 15, 1974.
STL-QPSR 2-3/1974
2.
To help fill t h i s gap the following experiment w a s performed.
For a
single adult m a l e speaker (the a u t h o r ) , numerous s a m p l e s of the glottal
air flow waveform w e r e displayed on one t r a c e of a dual t r a c t s t o r a g e
oscilloscope.
The glottal a i r flow w a s obtained by i n v e r s e - f i l t e r i n g o r a l
volume velocity (Rothenberg 1973). On the o t h e r t r a c e was displayed the
energy i n the radiated acoustic p r e s s u r e waveform above about 5 kHz.
A number-of sharp-cutoff f i l t e r s was u s e d a t different t i m e s to remove
the energy below 5 kHz, but the m o s t successful was the "8 kHz' (center
frequency) octave band f i l t e r of a Briiel & K j a r model 2113 s p e c t r u m
analyzer.
The lower l i m i t of 5 kHz was chosen because for this s p e a k e r
t h e energy above 5 kHz appeared to be p r i m a r i l y random i n n a t u r e , in
that the waveform did not repeat f r o m cycle to cycle except for a s m a l l
component just following the glottal c l o s u r e .
I t a l s o s e e m e d likely that
the random energy above 5 kHz could be u s e d a s a m e a s u r e of the r e l a tive strength of the random energy o v e r a considerably wider bandwidth.
(To actually m e a s u r e the random energy below 5 kHz, where i t o v e r -
I
lapped with the s t r o n g e r periodic components, i s theoretically possible ,
but r e q u i r e d a much m o r e complex a n a l y s i s procedure. )
Though t h e r e was a considerable variation i n the strength and location
of the noise energy during the glottal cycle, the m o s t consistent p a t t e r n
i n the o c c u r r e n c e of the noise i s shown i n F i g . I-A-1.
T h e s e photos w e r e
taken i n a n a l m o s t consecutive sequence while attempting to vocalize with
varying d e g r e e s of compression of the vocal folds, f r o m "breathy" (vocal
folds partially abducted) to "tight" (vocal folds adducted o r p r e s s e d together m o r e than i n n o r m a l voicing) (Rothenberg 197 3).
Though subglottal
p r e s s u r e was not m e a s u r e d , the speaker had considerable previous experience m e a s u r i n g h i s own subglottal p r e s s u r e , and could e s t i m a t e subjectively with some confidence that the p r e s s u r e stayed within a range of
about 7-10 c m HZO. The Briiel & K j a r octave f i l t e r was used.
t e r on the c a m e r a was s e t for a one second exposure.
enough to capture about 15 consecutive t r a c e s .
The shut-
This was long
Afterwards the c a m e r a
was r e t r i g g e r e d with a closed glottis to r e c o r d z e r o a i r flow.
The patterning of the noise relative to the glottal cycle can be seen
b e s t i n the left hand half of each photo, w h e r e the t i m e alignment between
the t r a c e s was best.
Since the i n v e r s e - f i l t e r and the band p a s s f i l t e r both
had a delay of about .7 milliseconds, the upper and lower t r a c e s can be
considered to be in approximately c o r r e c t t i m e alignment.
1
BREATHY
AIRFLOW
VOICE
X O U S T I C ENERGY
NORMAL
I
I
I
VOICE
0 msec
Fig. I - A - I .
50
Simultaneous recordings of glottal a i r flow
and radiated acoustic energy above 5500 H z for
voicing at three l e v e l s of vocal fold adduction.
STL-QPSR 2-3/1974
3.
I n the "normal" voice, and perhaps a l s o i n the "tight" voice, there i s
a small b u r s t of added energy just a f t e r the glottal closure which might
have been caused by the periodic energy of the glottal closure. If one
neglects this component, the following pattern emerges:
(1) The random component was s m a l l e s t i n the "normal" voice, where
it tended to be m o s t predominant during the middle of the opening
and closing phases, and minimum when the a i r flow was z e r o o r
n e a r i t s peak value.
(2) The random component during the "tight" voice o c c u r r e d p r i m a r i l y
during the glottal pulse and was rathei- uniform i n amplitude throughout the pulse.
(3) The random component during the "breathy" voice o c c u r r e d p r i m a r i l y
when the a i r flow was minimum, and was l e a s t strong (almost a t the
level of the background and system noise) when the a i r flow was n e a r
its peak.
I
These observations a r e consistent with a mechanism for noise production that generates noise most strongly a t a glottal opening equivalent to
a n a i r flow of about 200 to 300 milliliters/second.
This r e s u l t was s u r -
prising in view of the fact that the noise in breathy voicing i s commonly
thought to be caused by the high a i r flow.
However, a n examination of
the a i r flow pattern a t the glottal orifice m a k e s these r e s u l t s a t l e a s t
plausible.
An i n c r e a s e i n noise with i n c r e a s e d glottal opening would p r e -
sumably be due to a transition f r o m laminar to turbulent streaming within the glottis.
This would occur if the Reynolds number for the glottal
flow,
Re
P a r t i c l e velocity in cm/sec x glottal width in c m
.15
exceeded the c r i t i c a l value for a n orifice the shape of the glottis (MeyerEppler 1953). A s the glottal width i n c r e a s e s , the particle velocity dec r e a s e s slightly due to the lowered subglottal p r e s s u r e , and so the
Reynolds numbbr, though i t i n c r e a s e s , i n c r e a s e s m o r e slowly than the
glottal opening.
The c r i t i c a l ~ e ~ n o ' l number
ds
for turbulence production
is difficult t o determine f r o m the l i t e r a t u r e , since it depends greatly on
the details of the shape of the glottis, and of i t s entrance and exit pathways.
Since the discontinuity in a r e a tends to be l e s s a s the glottis opens.
the c r i t i c a l Reynolds number probably i n c r e a s e s somewhat with glottal
width.
These f a c t o r s together indicate that t h e r e m a y be no tendency to
turbulence a t the glottis a s i t opens.
A s a simple experiment, one can
STL-QPSR 2-3/1974
blow through the pursed lips separated v e r y slightly to imitate the glottal
opening.
F o r i n t r a o r a l p r e s s u r e s s i m i l a r in magnitude to the subglottal
p r e s s u r e during speech, t h e r e i s little audible evidence of turbulence a s
the separation between the lips i s increased.
The s m a l l amount of noise that was found to occur a t s m a l l glottal
openings might have been due to a d e c r e a s e i n the c r i t i c a l Reynolds
number, a s , f o r example, if a misalignment of the vocal c o r d s caused a
strong change i n the direction of streaming when the c o r d s were v e r y
close. Alternatively, t h e r e might have been some other effect m o r e difficult to identify, such a s the friction of the a i r s t r e a m along the m o i s t
glottal walls.
The amplitudes of the noise components shown i n Fig. I - A - 1 were
about 50 dB below the amplitude of the f i r s t formant energy in normal
voicing.
(The vowel was [ ze
] . ) If we estimate that the amplitude of a l l
the random energy in the waveform, including the frequency range excluded by the f i l t e r , was about 10 dB higher, we m u s t conclude that the
random component i n the normal voicing would be only marginally audible,
and then only above 3 o r 4 kHz, where i t i s not masked by the periodic
component.
However, when the voicing was m o r e breathy o r m o r e tight,
the periodic components became s m a l l e r and the random components
l a r g e r , and the random components could be a minor but perceptible
auditory feature.
Additive noise during glottal fricatives
The production mechanism for glottal fricatives such a s /h/ in Engl i s h involves some degree of abduction of the vocal folds.
Though the
evidence presented above indicates that glottal noise during voicing tends
to d e c r e a s e with glottal a r e a and glottal a i r flow, listening to [ h
] type
sounds leads one to conclude that t h e r e m u s t a l s o be some s o u r c e of a periodic energy that tends to get stronger a s the glottis opens.
To investigate t h i s , the author produced a l a r g e number of samples of initial
and medial [h] , while listening for noise during the
[ h 1 and observing
the glottal a i r flow waveform on a storage oscilloscope.
P i c t u r e s of the
s c r e e n were taken occasionally.
I t was c l e a r that t h e r e was some source
of noise i n the [ h ] that could occur a t high a i r flow r a t e s i n certain vowe l contexts.
The perceived strength of this noise was v e r y well correlated
5.
STL-QPSR 2-3/1974
with c e r t a i n f e a t u r e s of the glottal a i r flow waveform.
These features
a r e i l l u s t r a t e d by the s e t of oscilloscope photos shown i n Fig. I-A-2.
The waveforms i n F i g . I-A-2 a r e of the i n v e r s e - f i l t e r e d glottal a i r
flow i n repetitions of the nonsense syllable
above about 800 Hz.
[
h a ] , low p a s s f i l t e r e d
They w e r e chosen to r e p r e s e n t different d e g r e e s of
glottal opening during the
[h
I.
The top two waveforms show a n opening
and closing of the glottis of approximately m i n i m a l duration, i n which
the a i r flow p a t t e r n during the m o s t open s t a t e i s a l m o s t sinusoidal, and
offset f r o m z e r o .
T h i s is the type of waveform we might expect when
the vocal folds a r e abducted just enough during the
not come i n contact during the v i b r a t o r y cycle.
perceived noise during t h e s e productions.
[ h ] so that they do
T h e r e was v e r y l i t t l e
The glottal opening movements
In
i n photos C and D of Fig. I-A-2 a r e s t r o n g e r and of longer duration.
addition, they show a s m a l l flattening of the top of the oscillations whene v e r the a i r flow exceeded about 1. 3 o r 1 . 4 l i t e r / s e c .
A significant
amount of acoustic noise could be h e a r d during t h e s e productions, and
one might g u e s s that the reduction i n the peak flows o c c u r r e d because of
a t r a n s i t i o n f r o m a low impedance l a m i n a r s t r e a m i n g to a high i m p e dance turbulent s t r e a m i n g somewhere in the vocal t r a c t .
The flattening effect noted i n waveforms C and D of Fig. I - A - 2
can
be s e e n to be much s t r o n g e r i n waveforms E and F. I n E , the c o m p r e s sion is initiated a t about 1. 1 o r 1. 2 l i t e r / s e c o n d , and i n F the c o m p r e s sion s e e m s to s t a r t a t about 0.9 liter/second.
I n a l l productions i n
which the airflow was s i m i l a r to that i n E and F , the noise i n the
[h
1
was highly audible, and t h e r e was a feeling of constriction a t the back of
the t h r o a t that s e e m e d to c o r r e l a t e well with the level of the noise.
With
s o m e p r a c t i c e i t was possible to v a r y t h i s feeling of constriction, and
t h e r e b y v a r y the amount of random energy produced, and the amount of
"flattening" i n the g l o t t a l a i r flow waveform.
I n waveform F , random
oscillations on the non-oscillatory, high a i r flow p a r t of the waveform
m a k e i t c l e a r that random energy was being generated during that interval.
To investigate f u r t h e r the m e c h a n i s m of noise generation a t high a i r
flow r a t e s , a number of s a m p l e s of the glottal a i r flow waveform during
the production of
[h
1
w a s photographed, with a n effort m a d e to u s e
varying d e g r e e s of "constriction' . Fig. I - A - 3 shows a s e t of 5 wavef o r m s with different d e g r e e s of subjective feeling of constriction.
The
0
Fig. I - A - 3 .
100 msec
G l o t t a l a i r flow n e a r t h e o n s e t of voicing d u r i n g
p r o d u c t i o n s of [h z] with d i f f e r e n t d e g r e e s of
subjective pharyngeal construction.
7.
STL-QPSR 2-3/1974
a t a conversational volume was always accompanied by a p r e s s u r e of 1
to 2 c m water during the aspirated phase of the [ h ] . Small adjustments
i n the position of the catheter did not seem to effect the p r e s s u r e reading,
and so i t could be assumed that the reading was not a n artifact of the a i r
flow impinging directly on the catheter tip.
I t was also observed that in spectrograms made from recordings of
] for this speaker the formants tended to r i s e 10 to 20 percent
during a n aspirated [ h 1. This r i s e could have been caused, a t l e a s t in
[ze h E
p a r t , by the acoustical shortening of the vocal t r a c t that would result
f r o m a high acoustical impedance one o r two centimeters above the glottis,
a t the level of the epiglottis. An increase in the formant frequencies
could also be partially due to acoustic coupling to the subglottal system
through the open glottis (Fant, et al. 1972). This tendency for the f o r mant structure to increase in frequency during the [ h ] was verified in
spectrograms made from the voice of a second adult male speaker. A
typical spectrogram for this speaker i s shown in Fig. I-A-4, for the
phonetic sequence[a ha ? ae h a ?
E
h E
1.
A Kay model 6061 spectrum
analyzer was used.
Simulation for speech synthesis
The results of the above experiments were used in adding filtered
gaus sian noise to an electrical "black box1Imodel of the voice source
(Rothenberg, et al. 1974)
.
The source was an ad hoc electrical network
designed to have a n output similar to an actual glottal a i r flow waveform,
and was controlled by three voltages representing hypothetical neurological command p a r a m e t e r s that would a c t to v a r y the Frequency, Loudn e s s and Tightness (affect of vocal fold adduction) of the voice for an unconstricted supraglottal vocal tract.
Figs. I -A-5 and I-A-6, respectively,
illustrate how the output waveform varied over the full range of the Loudn e s s (L) and Tightness (T) p a r a m e t e r s . A "normal voicing" could be approximated by the waveform third from the bottom in Fig. I-A-5 and
fourth from the bottom in Fig. I-A-6.
The voice source was connected
to the O V E I11 s e r i a l formant synthesizer a t the Dept. of Speech Communication using the control scheme sketched by Rothenberg e t al. , and
subjected to a number of informal listening t e s t using nonsense syllables
and connected speech.
kHz
Y
F i g . I-A - 4 . Wide band spectrogram of the phonetic sequence
[ = h a ?=ha?? E h
E l .
5 msec
F i g . I - A - 5.
S i m u l a t e d g l o t t a l a i r flow w a v e f o r m
with L v a r i e d i n e q u a l s t e p s .
SIMULATED
GLOTTAL
AIRFLOW
5 msec
Fig. I-A-6.
S i m u l a t e d g l o t t a l a i r flow w a v e f o r m
with T v a r i e d i n e q u a l s t e p s .
STL-QPSR 2-3/1974
To simulate the noise component found to o c c u r a t low flow r a t e s
during voicing, a circuit was included i n the voice s o u r c e to i n s e r t noise
when the simulated a i r flow was i n the a p p r o p r i a t e r a n g e , about 100 to
400 m i l l i l i t e r s/second. The noise amplitude was a l s o m a d e slightly de
-
pendent on the waveform derivative, in o r d e r to simulate the affect of
flow inertance a t s m a l l glottal openings.
F o r a given glottal a r e a , t h i s
i n e r t a n c e would tend t o d e c r e a s e the flow, and t h e r e f o r e the n o i s e , during
the glottal opening phase, and i n c r e a s e t h e s e p a r a m e t e r s during the glott a l closing phase.
parameter.
The noise level was made proportional to the loudness
The noise s p e c t r u m was chosen by informal listening t e s t s
to be falling a t about 3 dB p e r octave, but still might be considerably diff e r e n t than the s p e c t r u m of actual glottal noise.
The noise l e v e l w a s s e t
about 6 dB above the level a t which the i n s e r t i o n of the random component
would just be perceived with simulated n o r m a l voicing.
Accurate acoustical modeling of the high-flow-rate noise s o u r c e can
be quite complex.
I n the e l e c t r i c a l model of the voice s o u r c e , we have so
f a r only introduced a n additive noise component a s the simulated a i r flow
waveform r i s e s above the equivalent of about 1 . 2 liter/second.
The noise
amplitvde i n c r e a s e s roughly i n proportion to simulated flow level above the
threshold and to the loudness p a r a m e t e r L.
(This second dependency m a y
have been unrealistic. ) The noise s o u r c e is the s a m e a s that u s e d for the
low-flow-rate noise.
The constant determining t h e noise level was s e t by
i n f o r m a l listening t e s t s .
The formant s t r u c t u r e i s extrapolated d i r e c t l y
f r o m neighboring vowels, with no compensation f o r vocal t r a c t excitation
being above the glottis.
The insertion of the s m a l l amount of low-air-flow noise was generally
considered to make the synthesized speech m o r e n a t u r a l sounding, though
i t i s likely that such n a t u r a l n e s s could a l s o be obtained by much s i m p l e r
a d hoc methods, a s the scheme suggested by F u j i m u r a (1968).
However,
the m o r e significant question i s whether a m o r e a c c u r a t e simulation of t h i s
noise i m p r o v e s the perception of linguistically significant transitions in
laryngeal adjustment, a s changes in the voice p a r a m e t e r s t e r m e d h e r e
Loudness and Tightness.
Since the level of the noise is s m a l l relative to
the periodic component, we should expect that the variation of t h e frequency,
amplitude, and s p e c t r a l shape of the periodic component should be the m o r e
dominant f a c t o r s i n the perception of m o s t t r a n s i t i o n s of laryngeal adjustm e n t , and t h i s w a s generally found t o be the c a s e i n listening t o t r a n s i t i o n s
9.
STL-QPSR 2-3/1974
in the model parameters with and without the low-air-flow noise present.
However, the informal listening t e s t s used were not adequate to t e s t for
smaller, second-order effects.
In the synthesis procedure used, the high-air-flow noise might be
generated by the voice source model during any "high-air-flow" sound,
such a s an unvoiced fricative o r stop burst, however, we discuss here
only the affect on the perception of
to generate an
[ h] . The voice source model was used
[ h 1 sound by appropriately varying the Tightness param-
e t e r , i. e. , increasing i t to a "normal voicingt' level after a n open glottal
state, o r increasing then decreasing the parameter for an intervocalic [ h ] .
The high-flow-rate noise, though not entirely satisfactory in spectral quality, seemed to improve the naturalness of those types of[ h l sounds in
which i t might be expected to occur naturally, a s after an open glottal
state, o r before a stressed vowel, when the glottal opening movement
might be long ( > 150 msec) and result in a l a r g e r than average adduction
of the vocal folds. A simulation of a shorter intervocalic glottal opening
transition, such a s might occur in an intervocalic English /h/ in an uns t r e s s e d position, tended to be heard a s a natural
noise was present.
[ h ] whether o r not the
(In the final synthesis procedure, the time constant
used for the Tightness parameter tended to prevent the simulated a i r flow
[ h ] was shorter
than about 125 msec.) In fact, fairly natural sounding [ h ] sounds could
from reaching the threshold for noise generation when the
be produced in any context with no random components a t a l l added, provided the variation in the tightness parameter was made with a natural
time constant.
With no noise present, the
[ h ] was apparently signaled
by transitions in the spectral distribution of the qua si -periodic components.
Acknowledgments
The author greatly appreciates the numerous conversations concerning
the work reported here that he enjoyed with members of the staff of the
Dept. of Speech Communication, especially P. Branderud, R. Carlson,
B. Granstr6m , and Dr. J. Lindqvist-Gauffin.
The speech synthesis ex-
periments were made in cooperation with Mr. Carlson and Mr. Granstrdm,
and the fiberoptic probe observations in cooperation with Dr. LindqvistGauffin.
This work w a s partially supported by the National Institutes of Health,
USA.
10.
STL-QPSR 2-3/1974
References
DOLANSKY, L. and TJERNLUND, P. (1968): ' On C e r t a i n I r r e g u l a r i t i e s
of Voiced Speech Waveforms", I E E E T r a n s . Audio and E l e c t r o a c o u s t i c s AU-16, pp. 57-64.
FANT, G . , ISHIZAKA, K . , LINDQVIST, J . , and SUNDBERG, J. (1972):
Subglottal F o r m a n t s t ' , STL-QPSR 1/1972, pp. 1 - 12.
F U JIMURA, 0. (1968): "An Approximation to Voice Aperiodicity' ,
I E E E T r a n s . Audio and E l e c t r o a c o u s t i c s AU
- 16, pp. 68-72.
-LIEBERMAN, P. (1961): " P e r t u r b a t i o n s i n Vocal Pitch", J.Acoust.
Soc.Am. 33, pp. 597-603.
M E Y E R - E P P L E R , W. (1953): Zum Erzeugungsmechanismus d e r Gerauschlaute", Z . Phonetik 7 , pp. 196-2 12.
'
ROTHENBERG, M. (1973): "A New I n v e r s e - F i l t e r i n g Technique for
Deriving the Glottal A i r Flow Waveform During Voicing", J.
Acoust. Soc. Am. -53, pp. 1632- 1645.
ROTHENBERG, M. , CARLSON, R. , G R A N s T R ~ M , B. , and LINDQVISTGAUFFIN, J. (1974): "A T h r e e - P a r a m e t e r Voice Source for
Speech Synthesis", Proceedings of the Speech Communication
S e m i n a r , Stockholm, Aug. 1-3, 1974.
STREVENS, P. (1960): "Spectra of F r i c a t i v e Noise i n Human Speech",
Language and Speech 3 , pp. 32-49.
Download