Analysis and Synthesis of the ... Lateral Consonant Adrienne Prahler

Analysis and Synthesis of the American English
Lateral Consonant
by
Adrienne Prahler
Submitted to the Department of Electrical Engineering and
Computer Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer
Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 1998
@ Adrienne Prahler, MCMXCVIII. All rights reserved.
The author hereby grants to MIT permission to reproduce and
distribute publicly paper and electronic copies of this thesis
document in whole or in part, and to grant others the right to do so.
A uthor ..........................
..................................
Department of Electrical Engineering and Computer Science
May 22, 1998
C ertified by....
AJcUL
4•'pted
JUL 141998
U ,I (
"~:
"~.
-
kRg14WES
......
.....................................
Kenneth Stevens
Clarence LeBel Professor
bhesis,Supervisor
by .................-f.6-"•
..........
Arthur C. Smith
Chairman, Department Committee o, Graduate Students
Analysis and Synthesis of the American English Lateral
Consonant
by
Adrienne Prahler
Submitted to the Department of Electrical Engineering and Computer Science
on May 22, 1998, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
The lateral consonant in English is generally produced with a backed tongue body, a
midline closure of the tongue blade at the alveolar ridge, and a path around one or
both of the lateral edges of the tongue blade. In pre-vocalic lateral consonants, the
release of the closure causes a discontinuity in the spectral characteristics of the sound.
Past attempts to synthesize syllable-initial lateral consonants using formant changes
alone have not been entirely satisfactory. Data from prior research has shown rapid
changes not only in the formant frequencies but also in the glottal source amplitude
and spectrum as well as in the amplitudes of the formant peaks at the consonant
release. Further measurements have been made on additional utterances, guided
by models of lateral production. Synthesis of lateral-vowel syllables that include
additional changes in bandwidths, pole-zero pairs, spectral tilt, and the amplitude of
voicing are judged to be more natural than lateral-vowel syllables with only formant
transitions.
Thesis Supervisor: Kenneth Stevens
Title: Clarence LeBel Professor
Acknowledgments
I would like to thank Ken Stevens and all of the Speech Group for everything. Ken
has been an inspiration to me both professionally and personally, thank you. Thank
you to my parents and brother for all their love and support over the years - I would
definitely not be here without all of you. Thank you to Peter for all the laughs and
nights out - you kept me sane. Basak, what can I say except we are DONE!
Work supported in part by a LeBel Fellowship and by NIH Grant DC00075.
Contents
1
1.1
M otivation.. . . . ...
. . . . . . . . . . . . . . .
12
1.2
Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.3
Background ...
1.4
2
12
Introduction
. . . ...
. . . ..
13
. ........
....
................
1.3.1
Poles and Zeros of Vocal Tract Transfer Function
. . . . . . .
13
1.3.2
Glottal Source Reduction . . . . . . . . . . . . . . . . . . . . .
16
1.3.3
Acoustic Losses ..........................
16
Purpose of Research
17
...........................
20
Modeling
2.1
2.2
2.3
2.4
M otivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.1.1
Simple Lossless Tube . . . . . . . . . . . . . . . . . . . . . . .
20
2.1.2
Expected Numbers of Poles and Zeros
. . . . . . . . . . . . .
21
Evaluation of Lateral Side Channels . . . . . . . . . . . . . . . . . . .
22
..
...
.....
. . ..
..
2.2.1
Equations ....
2.2.2
Boundary Conditions........................
2.2.3
Solve for Specific Cases of Lengths and Areas
Evaluation of the Entire Model
...
. ..
..
..
..
22
24
. . . . . . . . .
24
. . . . . . . . . . . . . . . . . . . . .
29
. . ...
. . ..
..
. . ..
. . ...
29
2.3.1
Equations ......
2.3.2
Boundary Conditions........................
30
2.3.3
Solution for specific cases of lengths and Areas . . . . . . . . .
31
Justification of the Model
........................
35
2.4.1
Calculation of the Number of Poles and Zeros and Approximate
Locations
2.5
2.6
3
35
2.5.1
35
Assumptions.............................
Conclusions . . ..
...
.
. ....
...
....
. . . . . . . ..
. .
Measurements
36
37
3.1
Purpose of Measurements - What Makes a /1/?
3.2
M ethod
3.3
M easurements................................
39
3.3.1
Singleton
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3.2
Cluster.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
. . . . . . . . . . . .
37
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.4
Singleton Analysis
3.5
Analysis of Utterances with Consonant Clusters . . . . . . . . . . . .
45
3.6
Comparisions of Environments . . . . . . . . . . . . . . . . . . . . . .
47
3.7
Conclusions ..
50
..
Abruptness
............................
. ..
. ......
41
...
....
.......................
. . . . . . . . . . .
.....
50
Synthesis
51
4.1
Abruptness/Consonantal Quality . . . . . . . . . . . . . . . . . . . .
51
4.1.1
Changes in Frequency
......................
51
4.1.2
Changes in Amplitude ......................
52
4.2
M ethod
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Results of Perceptual Testing
6
35
Limitations of Model............................
3.7.1
4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
62
5.1
First Perceptual Experiments
. . . . . . . . . . . . . . . . . . . . . .
62
5.2
Final testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
Conclusions
67
6.1
69
Further Research .............................
A Matlab code for Modeling
71
A.1 Solve Lateral Channel Equations
. . . . . . . . . . . . . . . . . . . .
A.2 Solve Equations for Lateral Branches when One Tube Disappears
. .
71
72
A.3 Solve for Entire Model ..........................
72
A.4 Whole Model when One Side Branch Disappears . . . . . . . . . . . .
73
List of Figures
2-1
Model of side channels formed during lateral consonant production
23
2-2 Transfer function for model of side channels, 20logIU,.t/UinI, plotted
for A 21 = A22 = .1 cm2 and 12 1 = 12 2 = 8 cm
. . . . . . . . . . . . . .
25
2-3 Transfer function for model of side channels, 20loglUout/Uin , plotted
for A21 = .3 cm2 , A 2 2
2-4
=
.2 cm2 and 121
=
8.5 cm, 122
=
7.5 cm . . . . .
Transfer function for model of side channels, 20log|Umt/Uinl, plotted
for A21 = A22 = .3 cm 2 and 12 1 = 4 cm, 12 2 = 12 cm . . . . . . . . . .
2-5
26
27
Transfer function for model of side channels, 20logjU0ut/U.n|, plotted
for A21 = .3 cm2 ,A22
=
.1 cm2 and 121
=
4 cm, 122 = 12 cm . . . . . .
27
2-6 Transfer function for model of side channels, 20loglUout/Ui•, plotted
16 cm . . . . . . . . .
28
2-7 Model of the vocal tract during lateral consonant production . . . . .
29
for A 2 1 /A 22
2-8
=
.1 cm2 to .5 cm 2 and 121 + 122
Transfer function for model of vocal tract, 201ogjU,0 t/UJn, plotted for
A21 = A22 = .2 cm 2 and 12 1 =1
2-9
=
22
= 8 cm . . . . . . . . . . . . . . . .
32
Transfer function for model of vocal tract, 201oglUout/UVn1, plotted for
A 21 = A 22 = .2 cm 2 ,
21
= 10 cm, and 12 2 = 8 cm . . . . . . . . . . . .
33
2-10 Transfer function for model of vocal tract, 20logjUo
0 ~ t/Uinj, plotted for
A21 = .2 cm2, A 22
=
.5 cm2 and 121 = 11.5 cm and 122
=
4.5 cm .
.
..
33
2-11 Transfer function for model of vocal tract, 20logjUout/Uinj, plotted for
various values of A 2 1 and A22 and 121 + l2 2 = 16 cm . . . . . . . . . .
34
3-1
The effect of the pre-emphasis on the spectra . . . . . . . . . . . . . .
40
3-2
Spectrogram of luck for male speaker . . . . . . . . . . . . . . . . . .
40
3-3
Spectrogram of voiced cluster utterance, bleed, for male speaker . . .
42
3-4
Spectrogram of voiceless cluster utterance, plead, for male speaker . .
42
3-5
Changes in amplitudes of formants from lateral to vowel for singleton
utterances: error bars represent standard deviation of data . . . . . .
3-6
Changes in formant frequencies from lateral to vowel for singleton utterances: error bars represent standard deviation of data . . . . . . .
3-7
43
44
Changes in amplitudes of formants from liquid to vowel for voiced and
some voiceless stop cluster utterances: error bars represent standard
deviation of data
3-8
.............................
45
Changes in formant frequencies from liquid to vowel for voiced and
some voiceless stop cluster utterances: error bars represent standard
deviation of data ...............
3-9
..............
46
Comparison of A's of amplitudes of formants from lateral to vowel for
singleton and cluster utterances: error bars represent standard deviation of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3-10 Comparison of A's of formant frequencies from lateral to vowel for singleton and cluster utterances: error bars represent standard deviation
of data.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3-11 Measured values of formant frequencies during lateral consonant: error
bars represent standard deviation of data . . . . . . . . . . . . . . . .
49
4-1
Spectrogram of natural utterance loot for male speaker . . . . . . . .
53
4-2
Spectrogram of first synthesized utterance of loot for male speaker . .
53
4-3
Spectrogram of second synthesized utterance of loot for male speaker
53
4-4
Spectra of natural utterance during /1/ using a 6.4 ms Hamming window 54
4-5
Spectra of first synthesized utterance during /1/ using a 6.4 ms Hamming window
4-6
.........
..........
54
Spectra of second synthesized utterance during /1/ using a 6.4 ms Hamming window
4-7
............
...............................
Formant trajectories for the synthesized word loot . . . . . . . . . . .
55
56
4-8 Time varying voicing changes in synthesized utterance
4-9
. . . . . . . .
Time varying additional pole and zero in synthesized utterance.
4-10 Time varying bandwidth changes in synthesized utterance
. ..
. . . . . .
58
58
59
4-11 Spectra of natural utterance during /1/ using 25.6 mns Hamming window 59
4-12 Spectra of first synthesized utterance during /1/ using 25.6 ms Hamm ing window
.....................
..........
60
4-13 Spectra of second synthesized utterance during /1/ using 25.6 ms Hamming window
5-1
.............
60
Results of perceptual experiments: % first utterance rated more natural by listeners
5-2
..................
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
Total change in Fl and F2 between lateral and vowel vs. % times
second synthesized utterances rated more natural than first synthesized
utterance
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
List of Tables
3.1
Singleton utterances
3.2
Cluster utterances..............................
3.3
A's of formant frequencies and amplitudes for singleton utterances;
...........................
38
39
Amplitude changes are in dB, frequencies are in Hz, and standard
deviations (sd) across speakers and repetitions are given in the right
half of the table........................
3.4
.......
43
A's of formant frequencies and amplitudes for voiced and some voiceless
stop cluster utterances; Amplitude changes are in dB, frequencies are
in Hz, and standard deviations (sd) across speakers and repetitions are
given in the right half of the table . . . . . . . . . . . . . . . . . . . .
3.5
46
Comparision of singleton and cluster utterances for same speakers; Amplitude changes are in dB, frequencies are in Hz, and standard deviations (sd) across speakers and repetitions are given in the lower half of
the table.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6
Measured formant frequency values in Hz of the lateral for singleton
and cluster utterances
4.1
48
. . . . . . . . . . . . . . . . . . . . . . . . . .
49
Constant synthesis parameters of utterance incorporating formant frequency transitions..............................
56
4.2
Constant synthesis parameters of utterance incorporating voicing changes 59
5.1
Results of perceptual experiments: % first utterance rated more natural by listeners
. . . . . . . . . ..
....
. . . . . . . . . . . . . . .
64
5.2
Total change in frequency between lateral and vowel for synthesized
utterances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
Chapter 1
Introduction
1.1
Motivation
Consonants in most languages are different from vowels in many ways including
abruptness at the consonant release and a reduction in overall power during the
sound[9]. The American English lateral consonant is very different from other sounds
in that a similar vocal tract configuration produces an extreme variety of frequency
spectra. The lateral consonant is sonorant and not continuant, so it is similar to nasal
consonants, but differs from glides. The lateral must be examined to develop a better
understanding of speech sounds in language.
The lateral is a liquid consonant which is more elusive than other consonants.
Most sounds in languages have distinctive contrasts and minimal pairs, but the lateral lacks a clear contrasting sound, allowing more variations in the production of the
sound. The lateral, however, can also be a syllable nucleus which lacks abruptness,
and the perceptual cues must be contained in the spectrum only. Understanding the
perceptual cues within the frequency spectra and in the transitions to vowels is a practical motivation for studying the lateral. This information can be used for a variety
of purposes including synthesis, speech recognition, and teaching second languages
(to speakers of languages without a comparable lateral sound in their language, i.e.,
Japanese).
1.2
Properties
The American English lateral /1/ is one of the semivowel consonants; it is produced
with a complete closure on the midline of the vocal tract like many other consonants,
but the narrowing of the vocal tract differentiates it from vowels. The lateral is
produced with a backed and lowered tongue body and occlusion at the alveolar ridge.
A complete closure is not made with the tongue, and airflow continues around the
tongue. The first formant(F1) is low, although higher than that typically found for
a high vowel, and the second formant(F2) is barely separated from the first formant.
The third formant(F3) generally has a relatively strong amplitude and is higher in
frequency than the third formant frequency for most vowels[16, 4, 3]. Espy-Wilson
reports the average formant frequencies of prevocalic /1/:
Fl averaged across all
speakers is 399 Hz, F2 is 1074 Hz, and F3 is 2553 Hz[6, 5]. The lateral is prone to
considerable variation depending on the individual and the phonetic context, and this
variability makes it more difficult to characterize than other consonants [5, 15]. The
acoustic analysis of /1/ is not complete, and more work is needed to determine how to
characterize the lateral acoustically for applications in speech synthesis directly and
any number of other applications including speech recognition and speech pathology.
This thesis will develop a theoretical model of the lateral, and will examine the validity
of the model by acoustically analyzing the prevocalic lateral for various speakers
and contexts, synthesizing the laterals using the theoretical model, and conducting
perceptual experiments to determine the primary acoustic cues for a lateral.
1.3
1.3.1
Background
Poles and Zeros of Vocal Tract Transfer Function
To better understand the acoustic characteristics of the American English laterals,
some work has focused on modeling the vocal tract for this class of sounds. The
acoustic theory of the laterals was first described by Fant [7]. Assuming that the
constriction sizes in the vocal tract are large enough (.17 cm 2 or greater [15]) to avoid
the production of turbulence noise, the acoustic theory is simple for the laterals. The
vocal tract is modeled as a tube with constrictions and side branches; the first formant
is approximately a Helmholtz resonance, with the acoustic mass due to the lateral
constriction. The low second formant resonance is because of the pharyngeal constriction caused by the tongue body backing in the cavity behind the constriction. The
third formant is roughly a resonance of the mouth cavity anterior to the constriction,
and the fourth formant is determined from the length of the entire cavity system. The
production of the lateral /1/ with an alveolar point Gf articulation creates an interior
cavity formed by the tongue blade. An additional cavity is created under the tongue
which couples with the back cavity. The result is that there are zeros as well as poles
in the transfer function for a lateral configuration. The lowest zero is possibly caused
by the cavity formed by the tongue blade while an additional pole is due to the entire
system. Fant suggests that the effect of the pole-zero pair is simply that the fourth
formant takes on the role of the third and so on, and a lateral can be synthesized
without the additional zero if the formants are simply shifted appropriately[7] .
A modification of Fant's theory is proposed by Stevens. The side branches around
the tongue will affect the high frequency behavior of the lateral. The transfer function
of the vocal tract will have additional zeros when the branches in the airway are
asymmetrical.
The exact locations of the zeros will vary since the length of the
side branches are variable among speakers. Stevens suggests great variability in the
spectrum of the lateral in the frequency range of 2500-4000 Hz depending on the
surrounding environment and the individual speaker. The rate of release of the lateral
is slower than that of other alveolar consonants and the possible interaction of the
additional poles and zeros on the surrounding speech segments is unknown[15]. Unlike
Fant, Stevens suggests that the effect of the poles and zeros, regardless of their actual
location, is not simply shifting of the formants up in frequency, but instead has an
overall effect of modifying the spectrum in the high frequency range.
Previously, a limiting factor in creating an accurate model of the lateral was the
lack of data on the length of the lateral channels. Recent work by Narayanan et al.
[12, 13] at UCLA using Magnetic Resonance Imaging (MRI) and electropalatographic
techniques (EPG) have provided valuable data that can lead to a more accurate
understanding of the geometry, acoustics, and aerodynamics of the lateral. Using the
MRI and EPG studies, detailed 3-D data of the vocal tract during the production
of the laterals were obtained, and the appearance of side channels during the lateral
consonant was confirmed. However, MRI and EPG data show that the left and right
channels are not equal in either area or length, and there is great variation across
subjects and phonetic contexts. All sounds in the study, however, are produced with
a lingual occlusion approximately 1 - 1.5 cm away from the lip opening and a closure
length of 0.6-1.5 cm. The cross-sectional area of the channels formed varies from
0.1-0.5 cm 2 and the relative areas of the right and left channels show variation even
for the same speaker.
The acoustic implications of these data are the ability to finally know the physical
dimensions and configuration of the vocal tract and oral cavity during the production
of the lateral, and this knowledge permits more accurate modeling of the acoustics.
The lateral channel areas and low flow rates suggest no significant pressure drop
in the supraglottal constriction region and negligible chances for frication.
These
observations support the assumption of Stevens [15] that turbulence noise can be
ignored in modeling the vocal tract for the lateral, at least when the glottis is in the
normal configuration for phonation. This finding also suggests sustained, uniform flow
throughout the duration of the sound. The first formant frequency can be attributed
to the Helmholtz resonance between the back cavity volume and oral constriction,
with the low frequency behavior being dominated by the back cavity configuration.
The change of Fl at the release is expected to be abrupt due to the abrupt changes
in area functions arising from the anterior tongue blade movement.
The second
formant is associated with the back cavity resonance and can be greatly affected by
retracting or raising the posterior tongue body. Three-dimensional modeling from
the collected data suggests that the tongue blade has a tendency towards an inward
lateral compression which creates the lateral channels observed [12], supporting the
hypothesis of Sproat and Fujimura [14] that the tongue blade narrowing is a feature
of laterals.
1.3.2
Glottal Source Reduction
In addition to the extra poles and zeros in the frequency spectrum during laterals,
a study by Bickley and Stevens [1] found a change in the glottal waveform when a
constriction is formed in the vocal tract during the semivowels. The glottal source is
affected by the narrowing of the vocal tract during the production of the lateral /1/,
which creates an increased acoustic mass and changes the volume velocity waveform
at the glottis. As the acoustic impedance increases due to the decreasing constriction size, the pressure drop across the constriction increases causing a decrease in
the pressure across the glottis. The pressure drop across the constriction causes an
increase in the intraoral pressure during the open phase of the glottal waveform. The
decreased transglottal pressure affects the forces on the vocal folds and leads to a
decreased amplitude of the volume velocity waveform. Although there are individual
differences across speakers, the study suggested that there is always a decrease in
intensity during the lateral relative to the adjacent vowel. There is also a less abrupt
termination for the glottal pulse, leading to a greater spectrum tilt.
1.3.3
Acoustic Losses
Research shows that in addition to the presence of poles and zeros and glottal source
reduction during the lateral, acoustic losses in the vocal tract increase. As the acoustic
resistance at the constriction increases, there are significant effects on the bandwidth
of the first formant. The bandwidth of the second formant also increases significantly
during the lateral[16]. The study by Bickley and Stevens [1] found a change in amplitude of the first harmonic to be on average 2.9 dB for /1/, while the average change
in amplitude of the first formant was 7.5 dB. This greater change in the first formant
amplitude relative to the amplitude of the fundamental suggests an increase in the
bandwidth of the first formant. The large bandwidth of F1 can be attributed to the
acoustic losses in the oral constriction causing the overall reduction of the amplitude
of the spectrum during the /1/ [2].
1.4
Purpose of Research
Previous research still leaves many unanswered questions concerning the acoustic
properties of the American English lateral /1/. Research has suggested several models
of the vocal tract for the production of prevocalic /1/. However, the variability among
individuals and phonetic context makes it difficult to determine the main features of
an accurate model without further research. Additional poles and zeros exist in the
frequency spectrum during the production of the /1/, but the exact placement and
perceptual importance is not known. Glottal source reduction and acoustic losses also
occur, but the amount and best method to simulate them in synthesis is not known.
This research acoustically analyzes the lateral to determine the essential components and perceptual cues for the /1/ for theoretical modeling and synthesis. A
database of utterances containing prevocalic /1/ is created for a variety of male and
female speakers. These 1aterals are acoustically analyzed looking specifically for polezero pairs caused by the unequal lateral channels which have been confirmed by the
MRI and EPG studies [13], back reactions on the glottal source during the lateral
due to the constriction of the vocal tract [2], and increasing bandwidths caused by
suggested acoustic losses [15]. Using the previous research and the gathered acoustic evidence, a theoretical model of the vocal tract for the laterals is developed. A
selected set of database words are synthesized using the Klatt synthesizer, and perceptual experiments in which the listeners rate the naturalness of the various synthesized
sounds is used to determine the validity of the theoretical model and pinpoint the
acoustic cues for the prevocalic /1/.
One of the primary acoustic features examined is the additional poles and zeros
in the spectrum during the lateral. It is hypothesized that the observed poles and
zeros found around 1.5-3 kHz are due to some combination of the lateral channels
and back cavity, but the great variability of the right and left channel area functions
for the same speaker and across speakers makes it difficult to determine the exact
effects and their perceptual importance [12]. This research attempts to determine
whether the exact placement of the pole and zero is important perceptually or if the
overall high frequency effects of a fast drop off of the spectral amplitude above the
frequency of the second formant, combined with the higher third formant frequency,
are the primary acoustic cues used by listeners for the prevocalic lateral.
Another acoustic feature to be examined in the database of sounds is the glottal
source during the lateral. Previous research suggests that with the narrowing of the
vocal tract during the lateral, there are some back effects that alter the actual glottal
source. Preliminary research also shows some sort of glottal back effect for wordinitial prevocalic laterals, but the reduction is not as apparent for prevocalic laterals
found in consonant clusters. The variability of the glottal source reduction between
speakers and phonetic contexts makes it difficult to determine what is perceptually
important. For accurate, natural sounding synthesized laterals and a valid theoretical
model of the lateral, the manifestation of glottal source reduction must be determined
through acoustic analysis of various speakers and contexts.
Acoustic losses are also examined in prevocalic laterals.
Preliminary research
demonstrates increased acoustic losses during the lateral, but the exact magnitude
is still not determined. The most reliable measure of acoustic loss is the increased
bandwidth of the first formant. Acoustic analysis of the database is used to estimate
the amount of acoustic loss for modeling and synthesis.
The acoustic analysis of the prevocalic laterals is used to determine the validity
of a theoretical model of the vocal tract.
purposes.
This model is then used for synthesis
Previous synthesis work has contrasted the two liquid consonants, /1/
and /r/ and used the differences between the two sounds to define the lateral for
synthesis.
The work incorporated a prominent F3 and a decreasing F3 value at
the release of the lateral to discriminate the sound from /r/[11]. A change in the
voicing is created by lowering the amplitude of voicing during the liquid, but the
naturalness of these synthesized laterals is not very good. Synthesis performed in
the present study attempts to determine the parameters inherently important for the
natural synthesis of the lateral. The Klatt synthesizer parameters that are specifically
examined include TL (tilt), BW (bandwidths of formants), the possibility of poles
and zeros, and formant changes[8]. Initial synthesis of sample data suggests that the
TL parameter increases, the bandwidth of the first formant increases, a pole and zero
are present near the third formant, and abrupt transitions of the formants occur at
the release of the lateral.
The importance of the additional details in the model of the prevocalic lateral are
determined through perceptual testing. A set of words in the database are synthesized for a speaker based on the theoretical model, and actual utterances are also used
to determine the synthesis parameters necessary for a good match of the synthesized
lateral to the natural lateral. Two versions of the word are synthesized. One version
time varies only formant transitions while the other time varies bandwidths, spectral
tilt, a pole-zero pair, in addition to the formant frequencies. Listeners rate the naturalness of the various sounds, spoken and synthesized, in perceptual experiments.
These data are then analyzed to determine what parameters are primary and whether
a more complicated model of the lateral is necessary for good synthesis.
Chapter 2
Modeling
2.1
Motivation
In this chapter, a model of the acoustic behavior of the vocal tract for lateral consonants is developed. The aim is to interpret the changes in the frequency spectra
that occur during the production of the lateral consonant. We are interested in the
natural resonance frequencies of the vocal tract configuration during the lateral production, and also the frequencies of possible zeros in the transfer function. Acoustic
losses, including bandwidth changes and glottal source changes, are not considered
in the model. The proposed model offers some explanation of the acoustic attributes
of the lateral. It also attempts to explain how the same sound can be produced with
such large variability in spectral peaks and valleys. This variability suggests that the
acoustic cues are not limited to the locations of the resonances, but there may be
more complicated perceptual effects.
2.1.1
Simple Lossless Tube
To understand and model the vocal tract during lateral consonant production, a
simple lossless tube is first examined. A combination of such tubes is then used as
the basis for the model of the lateral consonant. The two variables of interest in the
tube are the sound pressure and the volume velocity. The sound pressure p(x, t) and
volume velocity U(x, t) for one dimensional propagation are
8p--
pOUUau
Ox
A&t
7Po&
A
(2.1)
(2.2)
where A is the cross-sectional area, Po is the ambient air pressure, p is the ambient
pressure of the air (0.00114 gm/cm 3 ), and 7-y
is the ratio of specific heats at a constant
volume (7 = 1.4 for air). Assuming an exponential time dependence and a constant
area function, these equations reduce to
d 2p + k 2p- 0
d
d2+ k'p = 0
(2.3)
(2.3)
and
A dp= U
327rfp dx
where k = 2, and c =
2.1.2
2[
PK(the
(2.4)
velocity of sound, 35,400 cm/sec in the body)[15].
Expected Numbers of Poles and Zeros
Previous research verifies that adding channels to a model increases the number of
poles and in some cases causes zeros to appear in the transfer function. In the simple
case of a single tube, the expected number of poles and zeros in the transfer function
for a uniform tube is determined by the length of the tube and the position and
the type of source. The average spacing of poles for a uniform tube of length 1 is
(.
By adding side branches of total length 1, the average spacing of poles for the
system decreases to
is np, =
2(1+,)f
fz
=
The number of poles up to a certain frequency,
f,
Zeros also appear in the transfer function and the number up to
frequency, f, is nz,.=
at
.
C
The first minimum, zero, in the transfer function occurs
for a uniform side branch that is closed at the end[15, 7].
The lateral however consists of two parallel paths not just an additional side
branch, and this affects the expected approximate number of poles and zeros. Stevens
hypothesizes that poles of the transfer function are replaced by pole-zero-pole clusters
and that the first appearance of such a cluster will occur at the half wavelength of the
lateral channels, f = (2(Lc12)) [15]. A difficulty with two parallel paths is determining
how they will interact; one path could be considered as the main channel, while the
other acts as a side branch.
2.2
Evaluation of Lateral Side Channels
Fant suggests the possibility of the formation of side channels during lateral consonant
production and this is confirmed by the data gathered at UCLA[7, 13]. The addition
of side channels during the lateral consonant changes the transfer function of the
vocal tract configuration. Previous research shows that speakers produce the lateral
consonant with a variety of vocal tract configurations depending on the individual
and the phonetic context[12]. Speakers can have two side channels of different lengths
and different areas or only one channel. The side channels formed during the lateral
consonant can be modeled approximately as two uniform tubes, as shown in Figure 21. We consider the acoustic behavior of such a configuration.
2.2.1
Equations
Solving Equations 2.3 and 2.4 for the input and output volume velocities and pressures, the equations reduce to
(
Urn,2
)
()cos(-k1
cosin(-k1 )
21
-
Pin2
and
(
(
Pin
where 121 and
122
21
cos(-kl21)
sin(-kl21)
cos(-kl 22)
I - -2Pcsin(-k1
22 )
4
=sin(-kl 22 )
cos(-k122)
pout2
(2.5)
out,,
(
/
t22
0oUo
C(2.6)
Pout22
/
are the lengths of each of the component paths, and A2 1 and A 22
are the constant cross-sectional areas.
121
'I
Uin
Uout
A22
122
Figure 2-1: Model of side channels formed during lateral consonant production
2.2.2
Boundary Conditions
The boundary conditions of the system are restrictions on the volume velocities and
sound pressures of the two tubes at the input and output. The input volume velocity,
Uin, is the sum of Uin2 , and Uin,22 and the output volume velocity, Uo, is the sum of
the two output volume velocities. The volume velocities are assumed to sum without
any interference at the two locations. The input and output pressures of the two
tubes are also assumed to be equal. The transfer function, Uout/Uin, of the two side
channels is obtained by solving Equations 2.5 and 2.6 for the boundary conditions of
the system, assuming zero pressure at the output of the system, Pot = 0.
A 21sin(kl 22) + A22sin(kl 21)(2.7)
"A 2 1cos(k121)sin(k122) + A22sin(k121)cos(k122
2.2.3
Solve for Specific Cases of Lengths and Areas
The transfer function of the system depends on the lengths and areas of the two
channels. The model is a lossless model and is useful for locating the resonance peaks
and valleys of the system, but not the amplitudes. In this simple system, a zero will
occur when the numerator of the transfer function is zero and a pole will occur when
the denominator is zero. A zero occurs at the frequencies for which
sin(k122 ) = -A 22/A 21sin(kl21)
(2.8)
cos(kl21)sin(k122 ) = -A 22 /A 2 1sin(kl21)cos(kl 22)
(2.9)
and a pole occurs when
The transfer function has been calculated for several different cases of lengths
and area configurations, reflecting the possible vocal tract configurations used by
individuals.
0
1000
2000
3000
Frequency (Hz)
4000
5000
__
6000
Figure 2-2: Transfer function for model of side channels, 20log|IUout/Ui,|, plotted for
A21 = A 2 2 = .1 cm2 and 121 = 122 = 8 cm
Side Channels the Same Lengths
The transfer function, 201og|UoJt/Uid ,is plot-
ted for A 2 1 = A 22 = .1 cm 2 and 121 = 122 = 8 cm in Figure 2-2. When the two
cross-sectional areas, A21 and A2 2 , and lengths, 121 and 122, are equal, the transfer
function has no interference from the side channels and the transfer function appears
to be that of a simple, lossless tube of ltot = 121 = 122 = 8 cm.
Varying the cross-sectional areas of the tubes will not affect the locations of the
resonance peaks or valleys when the lengths of the side channels are equal as Equations 2.8 and 2.9 suggests. Some zero points are canceled by poles at the same location
when the ratio of the lengths is an integer. In this case, where the lengths are equal,
the zeros created are cancelled by the additional poles and the system appears as
a single tube of length 8 cm. As expected, the number of poles of the system up
to 6 kHz is 3, using the rough approximation of the number of poles described in
Section 2.1.2 with a single tube of length I = 8 cm.
Side Channels of Different Lengths
The transfer function, 201ogiUGot/Uin, is
plotted for A 21 = .3 cm 2 , A 22 = .2 cm 2 and 121 = 8.5 cm, 122 = 7.5 cm in Figure 2-3.
The 1 cm difference between side channel lengths produces two pole-zero pairs - the
0
1000
2000
3000
Frequency (Hz)
4000
5000
6000
Figure 2-3: Transfer function for model of side channels, 20log UoUt/UiI, plotted for
A21 = .3 cm2 , A 22 = .2 cm 2 and 121 = 8.5 cm, 122 = 7.5 cm
first at approximately 2 kHz and the second at 4 kHz. The first zero location seems
to be at a frequency corresponding to a single wavelength of the total length of ýhe
side channels (in this case 16 cm).
2
The transfer function, 20logIUo
0 t/U,,j, is plotted for A 21 = A 22 = .3 cm and
121 = 4 cm, 122 = 12 cm in Figure 2-4.
As the difference between the two side
channels increases, the variation from a simple all-pole system is greater. Although
the total number of poles present up to 6 kHz is not as expected, bunching may be
occurring at higher frequencies and some cancellation of poles and zeros occurs. Since
the ratio of the lengths is an integer and the cross-sectional areas are equal, some of
the zeros are cancelled by additional poles. A pole-zero pair does appear with the
zero at approximately 2100-2300 Hz. Varying the lengths of the side channels, greatly
alters the shape of the transfer function.
2
In Figure 2-5, the transfer function, 201ogUo,,t/Ujl, is plotted for A21 = .3 cm ,
A 2 2 =- .1 cm2 and 121 = 4 cm, 122 = 12 cm. The total length and length ratios are
the same as in Figure 2-4, but the ratio of the cross-sectional areas is different. The
change in area ratio produces a great change in the transfer function when the lengths
of the side channels are also different, and an additional pole-zero pair appears in the
00
Frequency (Hz)
Figure 2-4: Transfer function for model of side channels, 201oglUot/Ui I, plotted for
A21 = A22 = .3 cm2 and 121 = 4 cm, 122 =: 12 cm
30
Frequency (Hz)
Figure 2-5: Transfer function for model of side channels, 20logljUt/UjnI, plotted for
A 21 = .3 cm2,A 22 = .1 cm 2 and 121 = 4 cm, 122 = 12 cm
1000-2000 Hz range. The second lower pole-zero pair could possibly appear due to the
change in the ratio of the areas. Even with these four limited cases, the variability in
the spectra with this vocal tract is extreme, from a simple all-pole system to a system
with two pole-zero pairs.
A22/A21=.33, 122/121=.88
-
-
A22/A21=.2, 122/121=3
- - A22/A21=.5, 122/121=1.46
A22/A21=.25,122/121=.52
0
2000
4000
6000
Frequency (Hz)
Figure 2-6: Transfer f'unction for model of side channels, 201og|Uo"tUi,|,plotted for
A2 1 /A22 = .1 cm 2 to .5 cm 2 and 121 + 122 = 16 cm
To further demonstrate the variability possible with slight modifications of the
vocal tract configuration, Figure 2-6 shows the transfer function for pairings of crosssectional area ratios and channel lengths, with the sum of the side channels held
constant, 121 + 122
=
16 cm. These plots illustrate the extreme variability that can
occur. An interesting observation is that all configurations produce a pole-zero pair
in the 1.5-3 kHz range, corresponding approximately to one wavelength of a tube of
length 16 cm.
Evaluation of the Entire Model
2.3
4
Uin
Uout
Figure 2-7: Model of the vocal tract during lateral consonant production
During lateral production, the vocal tract can be modeled in its entirety as two side
channels coupled with a simple uniform tube on each end, as shown in Figure 2-7.
2.3.1
Equations
Solving Equations 2.3 and 2.4 for the input and output volume velocities and pressures, the equations for the additional two sections reduce to
J
Pini
-i
cos(-kli)
2-A sin (- kll)
k cos(-klj)
c
Pi
(2.10
sin(-kl)
(2.10)
Pout,
(Use
and
cos(-l)
/ -2csin(-k1
3)
/ Pin3 "
'PC
sn(-kl3)) Uot3
cos(-kla)
(2(1
PoUt 3
where 11 and 13 are the lengths of each of the systems, A 1 and A3 are the constant
cross-sectional areas of the front and back tubos.
2.3.2
Boundary Conditions
The transfer function, Uout/Uin, of the system is obtained by combining Equations 2.5,
2.6, 2.10 and 2.11. The boundary conditions for the system are determined for each
of the sections. The input volume velocity, Uin,, is the source volume velocity for
the system. The output volume velocity of the left tube, U,,t, is the sum of the two
input volume velocities of the side channels, Uin, and Uin22. The output pressure,
Pot,, is the input pressure of the side channels. The input volume velocity of the
third section, Uin3, is the sum of the output volume velocities of the side channels,
Uout,, and Uout2,.
The output pressure of the two side channels is the same as the
input pressure to the right section, Pin3 . The output pressure, PoUt3 , is assumed to
be zero because the end of the tube is open, and the output volume velocity, UOt
3,
is
the output volume velocity of the system, Uout. The transfer function is
Uot/Uin = [-A 3 (A 21 sin(kl22 )+ A 22 sin(kl21 ))]/
(-2A2lA
22 cos(kli)cos(k12 1)cos(kl 22)sin(k13 )
+A A 21sin(kll )sin(kl22)cos(kl2 1)sin(k13 )
+AA
-A
3 sin(kli
)sin(kl22 ) sin((k121 )cos(k13)
22 A3 cos(kll)cos(k122)sin(kl 21)cos(kl3 )
+(A21+
+2A
A22 )(cos(klI)sin(kl21)sin(kl3 )sin(kl22))
21 A22 cos(kl
)sin(kl3)
-A 2 1A3 cos (kl)cos(kl 21)cos(k13 )sin(kl22)
+AA
22 sin(kl)cos(kl22 )sin(k13)
sin(k121)
(2.12)
2.3.3
Solution for specific cases of lengths and Areas
The transfer function, 201og|IoUt/Ui•n , is solved for various lengths and cross-sectional
areas to determine the effects on the frequency spectra. The lengths and areas of the
first and third section are not varied. In all of the following figures, A 3 = 5 cm2,
13 = 10 cm, A, = 2 cm2 , 11 = 1 cm are kept constant and correspond roughly to
measured values gathered by Narayanan et al. [13, 12]. The second formant in the
model will not be as low as actually measured because all of the sections are assumed
to have constant areas, and, according to perturbation theory squeezing the cross
sectional area of section 1 will cause the second resonance peak to shift lower[15].
Also, squeezing section 3 near the middle (to simulate the backed tongue body) will
lower the second resonance.
As in the situation with only the side channels, this system will have zeros when the
numerator of the transfer function is zero and a pole will occur when the denominator
is zero. The zeros of the system are determined by the same equation as for the simple
model. Zeros will occur when,
sin(kl22 ) = -A
22 /A 21 sin(kl21).
(2.13)
The locations of the poles in this more complicated model are different from the
pole locations for the simple model since the denominator is more complicated. In
some cases when poles cancel zeros in the simple model, cancellation may not occur
in the more complicated model. For a general idea of the number and location of the
additional zeros, the zeros for the simple model can be solved for using Equation 2.8
of the simple model.
Same Lengths The transfer function, 201ogjUot/Uinj, is plotted in Figure 2-8 for
A 21 = A 22 = .2 cm2 and 121 = 122 = 8 cm. The system appears as an all-pole system
as expected. When the side channels are the same length, the effect on the transfer
function is to appear as a single tube of the length of the side channels (not the sum).
If the model accounted for losses, the area effects would be visible but this model
I
I-Frequency (Hz)
Figure 2-8: Transfer function for model of vocal tract, 201ogJUot/Ui I, plotted for
A 2 1 = A 22 = .2 cm 2 and 121 = 122 = 8 cm
only locates peaks and valleys of : : system. As expected, the number of poles of the
system up to 6 kHz is 6, using the approximation described in Section 2.1.2.
Same Areas, Different Lengths
The transfer function, 201og|Uout/Uin, is plotted
in Figure 2-9 for A21 = A22 = .2 cm 2, 121 = 10 cm, and 122 = 8 cm. The different
lengths of the side channels produce two pole-zero pairs with one in the 1800-2200 Hz
range. The number of poles is related to the total length of the system, and using the
approximation described in Section 2.1.2 the number of poles is expected to be 9 up
to 6 kHz. Assuming the additional length to the system is 8 cm (the average length
of the two side branches), the expected number of zeros with the addition of the side
branches is between two and three. Obviously, the number of zeros with two side
channels, is not just dependent on the total length of the side branches but on the
interaction of the side channels described by Equation 2.8 a rough approximation of
the number of zeros can be determined by the additional length of the side channels
to the system.
Different Areas, Different Lengths
The transfer function, 201ogJUot/Unin, is
plotted in Figure 2-10 for A21 = .2 cm 2 , A22 = .5 cm 2 and 121 = 11.5 cm and 122 = 4.5
)0
Frequency (Hz)
Figure 2-9: Transfer function for model of vocal tract, 201ogUo,,t/UinJ, plotted for
A21 = A22= .2 cm2 , 121 = 10 cm, and 122 = 8 cm
Frequency (Hz)
00
Figure 2-10: Transfer function for model of vocal tract, 20loglUot/U,,I, plotted for
A21 = .2 cm 2 , A2 2 = .5 cm 2 and 121 = 11.5 cm and 122 = 4.5 cm
cm. The difference in areas and lengths of the two side channels causes the transfer
function to include three pole-zero pairs.
-
A22/A21=2.5, 122/121=3
SA22/A21=3, 122/121=.88
- - A22/A21=2, 122/121=.52
~0
0
5B;
0M
m
0
2000
4000
Frequency (Hz)
6000
Figure 2-11: Transfer function for model of vocal tract, 20logIUot/Ui.,j, plotted for
various values of A 21 and A 22 and 121 + 22 = 16 cm
The transfer function, 201ogU1,/t/Uinj, is plotted in Figure 2-11 for various values
of A21 and A22 and 121 +
122
= 16 cm. Even though the total length of the side
channels, 121 + 122, remains constant at 16 cm, the variation in the frequency spectra
of the transfer function is huge. The length of the side channels is not necessarily 16
cm for all individuals, but the variation does give a general idea of what occurs with
different side channel lengths and areas. Pole-zero pairs appear at approximately
1500-2000 Hz and continue to the high frequencies.
2.4
Justification of the Model
The model suggested here does not account for any acoustic losses including changes
in bandwidth or glottal source. The model does give a general idea of the location of
resonance peaks due to the addition of the side channels.
Calculation of the Number of Poles and Zeros and
2.4.1
Approximate Locations
The complete model will give rise to zeros at the locations specified by Equation 2.8.
Since the total number of poles and zeros can not increase for the system, the transfer
function, as a general approximation, will also show the formation of the same number
of poles as zeros. The number of zeros expected up to a certain frequency, f, is
nz =
(2)
C
as described in Section 2.1.2. The exact locations are determined by the
denominator of the transfer function. When the lengths are integer multiples of each
other, pole and zero cancellations can be expected. The simpler model of the side
channels in Figure 2-1 can be used to determine the possible locations of the zeros
and the expected number of zeros.
2.5
2.5.1
Lengths
Limitations of Model
Assumptions
The total sum of the side branches is held constant at 16 cm which is
greater than data gather by Narayanan et al. suggests [12, 13]. However, the examination of natural utterances, suggests the formation of a pole-zero pair around the
third formant. Shortening the side channel lengths, increases the frequency of the
first zero above the third formant.
Constant Areas The assumption of constant areas for the model is not realistic,
but it simplifies the calculations and allows some generalizations to be made of the
expected effects.
Formant Locations
The second formant calculated by this model does not account
for the backed tongue body position and the F2 is not as low as observed. Perturbation
theory of the back cavity accounts for the lower F2 observed and indicates why the
simple model does not exhibit this behavior.
2.6
Conclusions
Although the total lengths of the side channels are kept constant, the locations of
the resonance peaks and valleys are not constant. Small changes in the lengths and
cross-sectional areas causes huge variations in the spectra. If the key perceptual cue
for the lateral consonant is the location of the peaks, then speakers would have to
maintain a certain configuration of the vocal tract that is not very stable. Since
similar configurations create extremely different effects, the key perceptual cue can
not be the exact location of the resonance peaks and valleys of the spectrum. Instead,
the cues must be more subtle, and possibly the addition of the poles and zeros assist
in creating the cues. The constant attributes across the various lengths and crosssectional areas is low first and second formant resonances and the addition of multiple
pole-zero pairs beginning at approximately 1500 Hz.
Chapter 3
Measurements
3.1
Purpose of Measurements - What Makes a
/1/?
The lateral consonant is extremely variable depending on the phonetic context and
speakers. Modeling of the vocal tract as a configuration with side channels suggests
that the resonance frequencies of the system are susceptible to subtle changes in crosssectional areas and lengths of the side channels. If the key perceptual cue can not
be the exact location of the resonance peaks and valleys, then what is the cue used
by listeners to identify the lateral? Previous research suggests that other changes in
the system occur during the lateral production including changes in the glottal source
[1, 2], bandwidth changes, and the addition of pole zero pairs. Measurements are made
on lateral consonants produced by various speakers in order to better understand
the changes occurring during the lateral production and to determine what might
be potential perceptual cues used to discriminate the lateral consonant from other
sounds.
3.2
Method
All recordings were made for normal English speakers with normal hearing. The
speakers were recorded in the sound room at the MIT RLE Speech Group lab. The
recordings were made onto audio tape or onto DAT tapes. Utterances for male speakers were digitized at a sampling rate of 10 kHz and low pass filtered at 4.8 kHz while
female speakers were digitized at a sampling rate of 13 kHz and low pass filtered at
6.2 kHz. One set of utterances for a male speaker and a female speaker were directly
digitized from DAT tapes and downsampled to 10 kHz and 13 kHz respectively using
the Sound Design program.
Singleton
The singleton lateral utterances were recorded for six speakers, three fe-
male and three male. Three repetitions of six pre-vocalic /1/ utterances were recorded
in isolation with an extra word at the end of the list to prevent intonation variations.
The speakers were instructed to say the word in a normal tone and to maintain a
constant intonation for all words. The word list is given in Table 3.1.
leap loot
let lap
law luck
Table 3.1: Singleton utterances
The lateral consonant is released into vowels at the four extremes of the possible
high and low tongue body configurations and two lax vowels at intermediate tonguebody heights. All speakers of the singleton utterances were also recorded for the
cluster utterances.
Cluster The cluster utterances were recorded for eight speakers, four female and
four male. Two repetitions of the cluster words spoken in the phrase, "Say word again"
were recorded. The speakers were instructed to say the list at a comfortable level and
pace. The word list is given in Table 3.2. The stop consonants were combined with
/r/ and /1/ in combinations that occur in English and released into the vowels, /i/
and /a/. Additionally, singleton stop consonants and /r/ were used for utterances
releasing into the two vowels.
beat
breed
clean
deep
geese
green
plead
team
bought
broad
clod
dot
got
grog
plod
top
reed
bleed
keep
crete
dream
glean
peat
preach
treat
rod
block
cop
craw
drop
glop
pot
prod
trod
Table 3.2: Cluster utterances
3.3
Measurements
Windows, Placements
All measurements of the waveforms were made using a 6.4
ms Hamming window. The release of the liquid consonants was determined by examining the speech waveform and the spectra for the change in formants (particularly
the second formant resonance peak) and voicing.
Pre-Emphasis
Pre-emphasis on the spectra was used to ensure a more accurate
first frequency formant peak, since the first formant frequency is very low for the
lateral consonant. The first harmonic interferes with the measurements and Fl sometimes appears to be lower than 300 Hz even with pre-emphasis.
Figure 3-1 plots the attenuation in dB of the pre-emphasis filter for a 13 kHz
sampling rate. The effect can be roughly thought of as a 6 dB/octave slope up to
about 3 kHz. The pre-emphasis effect is removed from measurements of the spectrum
amplitudes.
3.3.1
Singleton
Measurements of singleton utterances were made at two points in time using a method
similar to that described by Stevens and Blumstein[16]. Measurements of the first
00
Figure 3-1: The effect of the pre-emphasis on the spectra
three formants and the amplitudes were taken 20 ms prior to the release of the lateral
and 20 ms after the release into the vowel. A series of spectra obtained with the 6.4
ms Hamming window were averaged over a 12 ms interval, and therefore included
at least one full glottal period. Use of this averaging technique was convenient since
it did not require careful placement of the window at the beginning of each glottal
period. An example spectrogram of a singleton utterance is shown in Figure 3-2.
LSPECTO: 256-pt DFT, smart AGC
6.4-ms Hamming window every 1ms
PJLUCK1
APR 20 1998
[APRAHLER.SYN]
z
0
w
a:
Ll~
I
-!
TIME (ms)
(s
.
.
R9n10T..
II
F
-.
- %Y^Yfft.
'7ýý -ý
....
40
<0~ L
0
100
200
t--t
300
400 TIME500(ms)600
700
800
900 o1000
Figure 3-2: Spectrogram of luck for male speaker
3.3.2
Cluster
Several measurements on utterances with clusters were made at different points in the
waveform, although not all of them are reported here. A 6.4 ms Hamming window
was centered on the initial burst produced by the stop consonant, and the frequencies
and amplitudes of the low frequency peak and high frequency peak (greater than
2.5 kHz for men and 3 kHz for women) were measured. If the stop consonant was
voiced, additional measurements of the frequencies and amplitudes of the low and
high frequency peaks were made 20 ms after the burst and 20 ms prior to voicing
onset with the Hamming window, and averaging for 12 ms. Measurements of the
first three formant peaks and amplitudes were taken 20 ms after the release of the
liquid consonant or 20 ms after voicing onset (20 ms into the vowel if there was no
liquid consonant in the utterance) for voiced and voiceless consonants using the 6.4
ms Hamming window and averaging for 12 ms. If the liquid consonant was sustained
for more than 20 ms by the speaker, an additional measurement was made 20 ms
prior to the release of the liquid using the 6.4 ms Hamming window averaging for 12
ms.
The spectrogram of a voiced labial cluster utterance, bleed, for a male speaker is
shown in Figure 3-3 and the spectrogram of a voiceless labial cluster utterance, plead,
is shown in Figure 3-4. The lateral is not sustained in the voiceless consonant cluster:
it appears that the lateral is being released as voicing onset occurs.
3.4
Singleton Analysis
Figures 3-5 and 3-6 show the changes in frequencies and amplitudes of the first three
formants from the time 20 ms before the release to 20 ms after the release of the
consonant. Clearly, the vowel has higher frequencies and amplitudes than the lateral
consonant. The changes in frequencies depend on the vowel that follows. Although
the change in Fl is small for both leap and loot which are high vowels, these two words
have the greatest average change in amplitude, Al. This change in amplitude is much
greater than what would be expected based on a small change in Fl. With greater
LSPECTO: 256-pt DFT, smart AGC
. 80
(L
.
6.4-ms Hamming window every I me
.
40
0
100
200
300
400
500
600
TIME (ms)
700
800
900
1000
Figure 3-3: Spectrogram of voiced cluster utterance, bleed, for male speaker
LSPECTO: 256-pt DFT, smart AGC
6.4-ms Hamming window every 1 ms
80
A-
a.40
<0
0
100
200
300
400
500
600
TIME (ms)
700
800
900
1000
Figure 3-4: Spectrogram of voiceless cluster utterance, plead, for male speaker
leap
let
lap
law
luck
loot
Figure 3-5: Changes in amplitudes of formants from lateral to vowel for singleton
utterances: error bars represent standard deviation of data
word
leap
let
lap
law
luck
loot
A1
Al
9.4
6.4
3.9
3.7
4.8
12.7
AA2
9.0
13.8
15.7
10.3
13.5
11.9
A A3
19.5
17.0
20.3
8.2
14.2
15.4
AF1i
15.9
237.1
213.9
318.7
248.8
36.1
AF2
875.9
461.2
321.1
358.9
119.6
272.8
sdal
2.0
3.8
2.0
4.2
3.6
3.2
sda2
3.2
3.8
3.0
4.4
4.2
5.0
sda3
2.0
7
2.2
3.8
3.2
4.6
sdfl
14.4
13.8
29.4
49.6
40.6
30.2
sdf2
108
105
47.8
51
50.8
60.6
Table 3.3: A's of formant frequencies and amplitudes for singleton utterances; Amplitude changes are in dB, frequencies are in Hz, and standard deviations (sd) across
speakers and repetitions are given in the right half of the table
800
600
11
400
200
0
leap
let
lap
law
luck
loot
Figure 3-6: Changes in formant frequencies from lateral to vowel for singleton utterances: error bars represent standard deviation of data
changes in amplitudes occurring than what is expected based on the shifts in formant
frequencies, other changes in the source must occur. Possibly the formant bandwidths
and the glottal source change, cause a more abrupt transition and enhance the effects
resulting from changes in frequency. This enhancement provides strong perceptual
cues that distinguish the lateral consonant from, for example, the glide /w/.
3.5
Analysis of Utterances with Consonant Clusters
O delta Al
Sdelta A2
E delta A3
la
ra
i
Figure 3-7: Changes in amplitudes of formants from liquid to vowel for voiced and
some voiceless stop cluster utterances: error bars represent standard deviation of data
Figure 3-7 and 3-8 show the changes in frequencies and amplitudes of the first three
formants from the point 20 ms before the release to 20 ms after the release of the
consonant. The data includes voiced stop consonants and voiceless stop consonants
with a duration of at least 20 ms of the liquid. The lateral consonant has greater
1000
~ii~~i
la
ra
d
Figure 3-8: Changes in formant frequencies from liquid to vowel for voiced and some
voiceless stop cluster utterances: error bars represent standard deviation of data
word
la
li
ra
A1l
0.8
3.9
0.6
ALA2
7.5
2.6
3.9
ri
0.0
1.5
A A3 ALF1
8.8
242.4
12.0
-0.4
4.8
114.4
7.2
-3.3
AF2 sdal
149.7
3.4
684.7 3.8
41.8
5.8
490.4
5.2
sda2 sda3 sdfl
9.4
8.2
167
7.0
11.6 37.0
6.8
14.8 182
9.8
14.0
53.2
sdf2
255
558
148
274
Table 3.4: A's of formant frequencies and amplitudes for voiced and some voiceless
stop cluster utterances; Amplitude changes are in dB, frequencies are in Hz, and
standard deviations (sd) across speakers and repetitions are given in the right half of
the table
changes in both amplitudes and frequencies than the retroflex consonant for both
vowels, especially changes in the amplitude of the third formant. The change in F1
for /li/ is greater than for /la/ but the change in F2 is is greater for /li/ than for
/la/.
3.6
Comparisions of Environments
25
20
15
10
5
0
delta Al
delta A2
delta A3
Figure 3-9: Comparison of A's of amplitudes of formants from lateral to vowel for
singleton and cluster utterances: error bars represent standard deviation of data
Figures 3-9 and 3-10 compare the data for the laterals as singletons and in clusters.
The figures show the changes in frequencies and amplitudes of the first three formants
from 20 ms into the vowel to 20 ms prior to the release of the lateral consonant. The
data used for the comparisons of cluster and singleton lateral consonant changes are
from the same six speakers, three female and three male. The changes in amplitudes and frequencies for the singleton lateral consonant are greater than changes in
amplitudes and frequencies for the cluster lateral consonants.
1000
Scluster /a/
E single/a/
lI
Icluster
D'single/N
delta F1
delta F2
Figure 3-10: Comparison of A's of formant frequencies from lateral to vowel for
singleton and cluster utterances: error bars represent standard deviation of data
measurement
A Al
A A2
A A3
A F1
A F2
sd al
sd a2
sd a3
sd fl
sd f2
cluster /a/
0.8
7.5
8.8
242
150
1.6
4.8
4.0
88.2
128
single /a/
3.7
10.3
8.2
319
359
2.0
3.0
2.2
29.2
47.8
cluster /i/
3.9
2.6
12.0
-0.4
685
1.8
3.4
5.8
18.6
229
single /i/
9.4
9.0
19.5
15.9
876
4.2
4.4
3.8
49.4
51
Table 3.5: Comparision of singleton and cluster utterances for same speakers; Amplitude changes are in dB, frequencies are in Hz, and standard deviations (sd) across
speakers and repetitions are given in the lower half of the table
20
15
0
5
-50m
frequency (Hz)
Figure 3-11: Measured values of formant frequencies during lateral consonant: error
bars represent standard deviation of data
Fl
F2
F3
cluster /a/
482
1022
2694
single /a/
397
1019
2885
cluster /i/
398
1244
2450
single /i/
398
1310
2722
Table 3.6: Measured formant frequency values in Hz of the lateral for singleton and
cluster utterances
The averaged measured values of the formant frequencies during the lateral are
plotted against the changes in amplitudes in Figure 3-11 for singleton and cluster
utterances. The measurements used to determine the formant frequencies were made
20 ms prior to the release of the lateral. The values are also listed in Table 3.6.
3.7
Conclusions
3.7.1
Abruptness
The lateral is produced with a relatively open vocal tract configuration(compared to
other consonants). Possibly, the lateral must increase the abruptness of the release
into the vowel when it is a singleton to enhance the consonantal quality. The change
in amplitudes of the formants are greater for singleton laterals than for cluster laterals
for all formants. The change in formant frequencies are greater for singleton laterals
than for cluster laterals. The total changes in formant frequencies depend on the
vowel the lateral is being release into. It appears that when the change in formant
frequency is small, the change in amplitude of that formant increases. The lateral
consonant creates abruptness. The change in frequency is limited by the following
vowel. When the formant frequency transition is minimal, the amplitude change is
increased by some alteration of the glottal source and and increase of acoustic losses
to create the consonantal quality. Cluster laterals do not necessarily need to enhance
the consonantal quality since the stop consonant already creates abruptness in the
spectra.
Chapter 4
Synthesis
This chapter describes the synthesis of the lateral using the model and the measured
values as a guide. Previous attempts at synthesis have not incorporated the observed
phenomena of bandwidth changes, additional spectral tilt, or pole-zero pairs, using
instead only formant transitions and some alteration of the amplitude of voicing[11].
The method of the synthesis is described for a lateral using these additional parameters in conjunction with formant transitions. These synthesized words are then used
for perceptual experiments .to determine the importance of the various synthesis parameters. Incorporating these changes makes natural synthesis of laterals possible
despite the extreme spectral variability observed and expected.
4.1
4.1.1
Abruptness/Consonantal Quality
Changes in Frequency
The lateral has a low F1 and F2 which transition rapidly at the release into the vowel.
As previous attempts at synthesis have shown, these transitions are essential for the
accurate synthesis of a natural lateral. In singleton laterals, the transitions alone do
not create enough abruptness, and changes in the voicing amplitude must be added.
4.1.2
Changes in Amplitude
The model and gathered data for the lateral exhibit variability of amplitudes and
in the positioning of poles and zeros, making it nearly impossible to make any generalizations. By switching the focus from particular locations of spectral minimas
or maximas to the general effects occurring during the lateral, a different approach
to synthesis can be developed. Since the American English lateral has no minimal
contrasting pair, more latitude in frequency characteristics is possible. Past research
has shown that changes in the glottal source occurs during the lateral consonant[2, 1].
The simple lossless model proposed in Chapter 2 suggests the formation of a pole-zero
pair, but the location is variable depending on the lengths and cross-sectional areas of
the side channels. Modeling and measured values suggest that glottal source changes
are manifested in changes in amplitude of voicing(AV), spectral tilt(TL), bandwidth
changes(BW), and a pole and zero. Changes are expected in the glottal source, but
the exact mechanisms of the additional loss are not necessarily important for each
individual utterance.
4.2
Method
The synthesis of the lateral utterances was performed using the Klatt synthesizer. Six
utterances, luck, let, lap, law, leap, and loot, spoken by a male speaker in isolation
were used as the basis of the synthesis. The final stop consonant is spliced off (except
for law which is a consonant vowel(CV)) and only the /1/ and vowel are synthesized.
The final version of the utterance is the combination of the synthesized utterance
concatenated with the stop consonant. For each utterance, two synthesized versions
were created for evaluation. The first utterance consisted of only the appropriate formant locations and transitions and constant parameters for voicing quality. Changes
in amplitude, changes in spectral tilt, bandwidth changes, and an additional pole and
zero are added to the formant transitions to produce the second synthesized version.
The synthesis method including the time varying parameters used for the word loot
is detailed below. The spectrograms of the three versions of the word - the natural
26-pt DFT,smamAGC
LSPECTO:
every1m
wndow
6.4-ms Hamming
Figure 4-1: Spectrogram of natural utterance loot for male speaker
256-ptDFT,smartAGC
LSPECTO:
0
100
200
300
400
windowevery1ms
6.4-msHamming
500 600
TIME(ms)
700
800
1000I
900
Figure 4-2: Spectrogram of first synthesized utterance of loot for male speaker
LSPECTO256-ptDFT,wart AGC
0
100
6.4-mmHamming
window
every1ms
XLOOT2
MAY7 1998
[APRAHLER.SYN]
200
300
400
s00
600
TIME(ms)
700
0co soo
1000
Figure 4-3: Spectrogram of second synthesized utterance of loot for male speaker
KLSPEC93:
MAY 11 1998
I47ms
[aprahier~syn]
loot1
Figure 4-4: Spectra of natural utterance during /1/ using a 6.4 ms Hamming window
KLSPEC93:
[aprahler.syn]
MAY 11 1998
87mns
dootl
Figure 4-5: Spectra of first synthesized utterance during /1/ using a 6.4 ms Hamming
window
KLSPEC93:
MAY 11 1998
iN.61A i Jc In4ý
m iNMcA-^i oOw
Oti Nt21MA
V
"-
VV
VwV
-
V TV
vv
i-vvV.V
[aprahler~syn]
f987ms
V
a
V
Xloot2
Figure 4-6: Spectra of second synthesized utterance during /1/ using a 6.4 ms Hamming window
utterance, the first synthesized version including formant transitions only, and the
second synthesized version including additional acoustic losses and formant transitions
are shown in Figures 4-1, 4-2, and 4-3. The original utterance was evaluated using
a program (lspecto) that estimates the formant locations and amplitude of voicing.
Using these values as a template, the first synthesized version was created. The correct
formant frequencies were determined by comparing the natural utterance with the
synthesized version using a 6.4 ms Hamming window averaging over 14 ms and placing
the window for measurements every 15 ms. Example spectra from this measurement
technique are shown in Figures 4-4, 4-5, and 4-6 for the three versions of the utterance.
Additionally, a 25.6 ms Hamming window placed every 20 ms was used to match the
formant peaks with the correct harmonics during the lateral and vowel. This process
allowed the formant frequency peaks to be matched after several iterations. The other
parameters of the synthesizer, such as the open quotient(OQ) and spectral tilt(TL),
were set as constants to maximize the naturalness of the utterance depending on the
speaker for the first synthesized version.
The final formant transitions used for both versions of the synthesized utterances
are plotted in Figure 4-7. The amplitude of voicing parameter and amplitude of
aspiration are held constant throughout the utterance, with the aspiration amplitude
15 dB below the amplitude of voicing. The constant parameters used in the first
4500
F5
40003500
..
..--
-F4
3000-
2500F3
2000
5001000
F1
5000-
0
50
150
100
200
250
Tim (ms)
Figure 4-7: Formant trajectories for the synthesized word loot
Parameter
GV
GH
OQ
TL
B1
B2
B3
B4
B5
Value
64
64
55
0
45
200
350
125
150
Table 4.1: Constant synthesis parameters of utterance incorporating formant frequency transitions
synthesized version to obtain appropriate voicing quality are listed in Table 4.1.
The second synthesized version of the utterance is created by adding a pole-zero
pair, changing bandwidths, and altering the spectral tilt. Gathered data suggest the
AV parameter increases as the /1/ is released. Voicing changes observed are due
to changes in the AV parameter in addition to other mechanisms of acoustic losses,
including changes in bandwidths and the formation of pole-zero pairs. Changes in
the AV parameter are added first by using the extrapolated voicing values from the
original utterance. The amplitude of aspiration is also varied to match the voicing
changes and maintain the 15 dB difference for the appropriate voicing quality. The
values of the voicing parameters, including the spectral tilt, which is discussed below,
are shown in Figure 4-8.
A pole and zero are then added during the lateral, and these combine rapidly at the
release into the vowel. Modeling suggests that the pole and zero, while occurring in a
pair, do not always occur in the same order. By examining the spectra of the natural
utterance, the order of the pole and zero is determined as well as the bandwidths.
The locations of the pole and zero over time are plotted in Figure 4-9.
Following the pole and zero placement, the bandwidths of the formants and the
spectral tilt are adjusted. If a zero-pole pair occurs, the spectral effect at high frequencies is an overall increase in amplitude, whereas a pole-zero pair causes an attenuation
of the higher frequencies. When a zero-pole occurs, the entire high frequency spectra is increased, but by increasing the spectral tilt, the appropriate bandwidths and
prominences can be matched. When a pole-zero occurs, less spectral tilt needs to
be added since the overall effect of the pair in that order is to decrease the amplitude of the higher frequencies. From the ordering of the pair, the spectral tilt and
bandwidths are determined by matching the natural utterance with the synthesized
utterance. The varying bandwidths of the formant frequencies are shown in Figure 410. In this case, the bandwidth of the second formant actually is increased during the
vowel to match the natural utterance. During the lateral the second formant had a
small bandwidth and a more prominent peak than exhibited in other utterances. The
constant parameters for the second synthesized utterance are shown in Table 4.2.
60
50
40
IAV
---TL
--AH
S30
20
10
0
0
50
100
150
Time (ms)
200
250
Figure 4-8: Time varying voicing changes in synthesized utterance
3000
2500
2000
f
c 1500-
1000
500
-
50
100
150
Time (ms)
200
250
Figure 4-9: Time varying additional pole and zero in synthesized utterance
350
300
250
200
150
100
50
0
0
50
100
150
200
250
mime(ms)
Figure 4-10: Time varying bandwidth changes in synthesized utterance
Parameter
GV
GH
Value
64
64
55
100
OQ
B5
Table 4.2: Constant synthesis parameters of utterance incorporating voicing changes
KLSPEC93:
MAY 5 1998
DFT-Spec:
win:25.6ms
FO= 125Hz
Rms= 54dB
Specto-Spec:
win:25.6ms
Freq Amp
253
53
990
44
2734 37
3272 34
3984 35
60-A
N
40[t
A~;n
I
1PLlIi
IIIl7Y
20i1ML
A
I
I
I
on-xA
,'~
wV
[aprahler.syn]
I
1
v
A\
_I
2
---"---3
FREQ (kHz)
4
A
/,~.A.
vV
V
8ms
80mns
NvNvV
5
--IA
A,
v
pjlootl
Figure 4-11: Spectra of natural utterance during /1/ using 25.6 ms Hamming window
MAY 5 1998
KLSPEC93:
OFT-Spec:
win: 25.6rm
F0 - 125Hz
Rms a 57dB
Specto-Spec:
win: 25.6ms
Freq Amp
339 55
3261 48
3958 41
FREQ (kHz)
AA
AA
A
AA
•A
AA
AA
t3~"y-V V-vV'N
80ms
[aprahler.syn]
AA
xlootl
Figure 4-12: Spectra of first synthesized utterance during /1/ using 25.6 ms Hamming
window
KLSPEC93:
MAY 5 1998
60/ A
11
A. J
20
AF
i
1
^
A YA.
FN7000t 79AtYcvN
't, "k
v V -VVV-V-vv
DFT-Spec:
win: 25.6ms
FGP
- 123Hz
Rms, 54dB
Specto-Spec:
win: 25.6ms
Freq Amp
371 53
938 43
2793 36
3209 34
4009 34
--
FRE -(k z)
/%
1 i 11a
i 7Fo'N- f
A~d"NI, V'N
I[ýt2llowoINY~olt p~t
VV-VV-VVrVV-V"-V
[aprahler~syn)
Figure 4-13: Spectra of second synthesized utterance during /1/ using 25.6 ms Hamming window
Spectra for the three versions during the lateral are shown in Figures 4-11, 4-12,
and 4-13 using a 25.6 ms Hamming Window with no averaging. The minimum occurring at approximately 1800 Hz is apparent in the spectra of the natural utterance and
the second synthesized utterance. The first synthesized utterance instead sustains a
value of about 20 dB greater in this frequency range. Also, a better match in the
higher frequency bandwidths and relative amplitudes are obtained in the second synthesized version of the utterance. To obtain approximately the correct high frequency
amplitudes in the first synthesized utterance, the bandwidths are increased, thereby
decreasing the prominence of the spectral peaks in those areas.
Chapter 5
Results of Perceptual Testing
This chapter presents the results of the perceptual tests, rating the naturalness of the
synthesized utterances in relation to the natural spoken utterance and with respect
to each other. Once the synthesis work is performed, the new method of including
additional parameters must be evaluated. Two sets of perceptual experiments were
run to evaluate different synthesized utterances. The initial perceptual experiments
guided the second set of experiments to evaluate more accurately the naturalness of
the utterances.
5.1
First Perceptual Experiments
The first set of perceptual experiments was performed with an initial set of synthesized
words. In the initial synthesis work, natural utterances spoken between two other
words were used to guide the selection of parameters. The synthesized words were
presented to listeners in isolation. A disadvantage of this approach was that the
synthesized utterances were shorter than what one would expect for isolated words.
This work also attempted to separate the effects of all of the individual components
contributing to the changes in voicing, including the pole-zero pair, the changes in
bandwidths, and the addition of varying spectral tilt.
Subjects were asked to rate the quality of the naturalness of utterances presented
individually on a continuous scale of 0 to 1. The results were not conclusive except
that all the utterances were relatively natural.
Since no discrimination between data points could be statistically shown, the
data was instead used to guide the second set of more sensitive perceptual experiments. Three issues were identified as needing to be improved for the second set of
experiments. To increase the duration of the utterances to be evaluated, additional
utterances were recorded from speakers that included the lateral words spoken in isolation instead of within a phrase as before. The method of testing was also changed to
involve a comparison of two utterances instead of evaluating each utterance in isolation. Additionally, since no significant difference was apparent in the results of testing
the various parameters contributing to voicing changes individually, the parameters
were grouped as a whole for evaluation.
5.2
Final testing
The naturalness of the synthesized utterances was evaluated using the three versions of
the six utterances: the natural spoken utterance, the first synthesized version including only formant frequency transitions, and the second synthesized version including
formant frequency transitions and changes of the glottal source and some formant
bandwidths. Five English-speaking subjects participated in the experiments.
For each word, six stimulus pairs were created consisting of two different versions
of the three utterances in all possible combinations. Five repetitions of each stimulus
were presented so a total of ten comparisons of each of the utterances with each other
were made. The subjects were given these instructions - "In this listening test, you
will be asked to judge the naturalness of the pairs of utterances. Each stimulus is
composed of two utterances of the same word. When the utterances are presented,
you decide which of the utterances sounds more natural, specifically listening to the
/1/ sound in both utterances. It is okay to guess if you do not hear a difference."
The results are presented in Table 5.1 and plotted in Figure 5-1. The percentage
ratings are determined by adding the number of times each utterance was rated more
natural than the other regardless of order. Examination of the data suggests that
word
let
luck
leap
lap
loot
natural-one
76%
78%
62%
66%
78%
one-natural
24%
22%
38%
34%
22%
natural-two
68%
64%
62%
56%
58%
two-natural
32%
36%
38%
44%
42%
one-two
40%
28%
40%
26%
24%
two-one
60%
72%
60%
74%
76%
law
72%
28%
62%
38%
38%
62%
total
72%
28%
62%
38%
33%
67%
Table 5.1: Results of perceptual experiments: % first utterance rated more natural
by listeners
Perceptual Results
80%
70%
60%
50%
C13 40%
30%
20%
10%
0%
0
N
total
m let
m luck
El leap
0 lap
mloot
f law
one over two over two over
natural
natural
one
Figure 5-1: Results of perceptual experiments: % first utterance rated more natural
by listeners
word
two-one rating
A F1 + A F2
A F1
A F2
let
luck
leap
lap
loot
law
60%
72%
60%
74%
76%
62%
527.0
254.0
977.0
507.0
273.0
97.0
195
176
39
214
19
137
332
78
938
293
254
-40
Table 5.2: Total change in frequency between lateral and vowel for synthesized utterances
80%
7r,0/.
70%
65%
60%
55%
50%
0
100
200
300
400
500
600
700
Total Change in Formants (Hz)
800
900
1000
Figure 5-2: Total change in F1 and F2 between lateral and vowel vs. % times second
synthesized utterances rated more natural than first synthesized utterance
order does not alter the conclusions made here about which utterances are more
natural than another. The second synthesized version is zated more natural than the
first synthesized version an average of 67% of the time. For all the utterances, the
synthesized version incorporating the changes in voicing is rated more natural when
compared with the first synthesized version. Additionally the second synthesized
version has an equal or higher rating of naturalness than the first synthesized version
when each is compared with the natural utterance. The highest naturalness ratings
when comparing the two synthesized versions are obtained for loot, lap, and luck (all
rated more natural than the simpler version 70% of the time). The total change in
formant frequencies, AF1 and AF2, for the utterances (with the exception of law)
is compared with the perceptual results of the second synthesized version and the
first synthesized version in Table 5.2 and Figure 5-2. As the total change in F1
and F2 increased, the effect of incorporating additional parameters decreased. This
would suggest that when a large abruptness occurs due to the change in formant
frequencies, additional parameters are not as essential to increasing the naturalness
of the utterance. However, even with the large abruptness in formant frequencies,
as in the case of leap, the naturalness of the utterance is increased by varying the
amplitude of voicing, bandwidths, spectral tilt, and the addition of a pole-zero pair.
Chapter 6
Conclusions
The American English lateral consonant is prone to considerable variation depending
on the individual speaker and on the phonetic context, and this variability makes it
more difficult to characterize than some other consonants. The first two formants
are rather low and barely separated while the third formant is generally higher in
frequency than those for vowels. Characterization of this elusive consonant and the
identification of the key perceptual cues is necessary for applications such as speech
synthesis. Modeling, further measurements on gathered data, synthesis, and evaluation of synthesis work with perceptual experiments for the lateral consonant are
performed in this thesis in order to begin the complete evaluation of the sound.
Source-filter modeling of laterals can help to identify the various acoustic characteristics important for these sounds. The vocal tract during the lateral consonant
can be modeled as a tube with constrictions and side branches. The production of
the lateral /1/ with an alveolar point of articulation creates an interior cavity formed
by the tongue blade. An additional cavity is created under the tongue which couples
with the back cavity, creating poles and zeros during the lateral. The side branches
around the tongue affect the high frequency components of the lateral and account
for effects on the spectrum that can be attributed to pole-zero pairs. Individual differences in the exact locations of the pole-zero pairs are expected since the lengths
and cross-sectional areas of the side branches are so variable among speakers. The locations of the poles are due to the interaction of the entire system, while the locations
of the zeros are due to the configuration of the lateral side channels only.
A database of utterances was recorded by six different speakers. The utterances
contained prevocalic /1/ followed by six different vowels, and prevocalic /1/ in stop
consonant clusters followed by two different vowels. Acoustic analysis of these utterances examined attributes of the sound that provided information about back
reactions on the glottal source during the lateral, pole-zero pairs, and increased bandwidths. Measurements of the formant frequencies and amplitudes during the lateral
and the vowel show that the change in amplitudes, A Al and A A2, are greater
for the singleton lateral than for the cluster lateral. The increase in Al and A2 can
not be accounted for by the changes in formant frequencies alone during the release.
The additional increase in amplitudes suggests that changes in the source, the bandwidths, and pole-zero pairs are occurring in addition to the transition of the formant
frequencies.
The theoretical model of the lateral, together with data from the measurements,
was used to guide the synthesis of two words us:ng the Klatt synthesizer [8]. The
Klatt synthesizer parameters that were manipulated were formant changes, TL(tilt),
BW(bandwidths of formants), AV(amplitude of voicing), and additional poles and zeros. Six utterances containing singleton prevocalic /1/ followed by six different vowels
were synthesized for use in perceptual experiments. The first synthesized utterance
was created by altering only the formant frequencies at the lateral release. The second utterance included changes in bandwidths, the spectral tilt, and the addition of
a pole-zero pair as well as the changes in the formant frequencies. These two synthesized utterances were used with the natural utterance to determine the naturalness
of the synthesis. The utterances were presented in pairs and the listeners were asked
to judge which utterance in the pair sounded more natural. For all utterances, the
second synthesized utterance, containing additional changes in voicing, was judged
to be more natural, on average, than the simple synthesized utterance. The second
synthesized version was also judged more natural against the natural utterance than
the first synthesized utterance.
The naturalness of the synthesized utterance approaches that of the original ut-
terance with the rapid transition of the formants, changes in the amplitude of voicing,
changes in the bandwidths of the formants and the addition of a pole and zero pair.
As the modeling suggests, the lateral is inherently variable in high frequency content
because of the geometry of the vocal tract configuration. The key acoustic characteristic is not related to the exact placement of the high frequency formants or their
amplitudes. Instead, these results support the conclusion that the key acoustic characteristic of the pre-vocalic lateral /1/ is an abruptness at the release of the lateral
into the vowel and a good quality, natural sounding synthesized lateral must include
this abruptness. This abruptness is created in part by rapid changes in the first two
formant frequencies and in part by rapid changes in spectrum amplitude over the
speech frequency range.
In speech recognition, the analysis should be capable of resolving these rapid
changes in order to distinguish /1/ from glides. In synthesis, these changes must be
present if the stimuli are to sound natural and are to be discriminated from other
sonorant consonants. Therapy to improve a speaker's production of /1/ should emphasize the need for creating this abruptness in the appropriate frequency ranges.
6.1
Further Research
Further research to improve the understanding of the lateral is needed. The creation
of a model which incorporates acoustic losses including varying cross-sectional area
functions as discussed by Maeda would help determine how the acoustic losses are occurring and if they are speaker dependent or context dependent[10]. Modeling which
also incorporates additional geometries of the vocal tract during lateral production
should be developed. The models should include configurations in which two side
channels are formed, but only one is connected to both the front and back cavities
and the other connects only to the front cavity.
Further examination of the pre-vocalic cluster laterals and the differences between
the singleton laterals would also be helpful. Perceptual testing of synthesis work on
cluster laterals could begin to unravel the phonetic context mystery of the lateral. Key
perceptual cues for the lateral should cross the boundaries of the phonetic context,
and the model of the lateral could be improved by including this information.
In English, the lateral has more latitude in frequency content because there is no
minimal contrasting sound. Examining a language with contrasting lateral sounds
such as in Italian or Spanish, could provide more insight to the perceptual cues for
the lateral.
Appendix A
Matlab code for Modeling
This is the code used to solve the equations for the modeling section of my thesis.
A.1
Solve Lateral Channel Equations
Pin3 =0 solving for Uout,
e2='Uin21=cos(-k*121)*Uout21'
e3='Uin22=cos(-k*122)*Uout22'
e4='Poutl= (-j*rho*c/A21)*sin(-k*l21)*Uout21'
e6='Poutl- (-j*rho*c/A22)*sin(-k*122)*Uout22'
[Uout21,Uout22,Poutl1,Uin21]
-solve(e3,e2,e4,e6,'Uout21,Uout22,Pout1,Uin21')
Uin=symadd(Uin21,'Uin22')
Uout=symadd(Uout21,Uout22)
TF=symdiv(Uout, Uin)
TF=simplify(TF)
TF=simple(TF)
TFside= (A21 *sin(k*122)+sin(k*121) *A22)./
(cos(k*121).*A21.*sin(k*122)+sin(k*121).*A22.*cos(k*122))
A.2
Solve Equations for Lateral Branches when
One Tube Disappears
Pin3 = 0
e2='Uin21=cos(-k*121)*Uout21'
e4='Pout 1=( -j *rho*c/A21) *sin(-k*121) *Uout21'
[Uout21 ,Poutl] =solve(e2,e4,'Uout21 ,Poutl')
TFnos=symdiv(Uout21,'Uin21')
TFnos=simplify(TFnos)
TFnos=simple(TFnos)
A.3
Solve for Entire Model
Pin3 = 0
el='Uinl= (cos(-k*11)*(Uin21+Uin22))+((A1/(j*rho*c))*sin(-k*11)*Poutl)'
e2= 'Pinli= ((-(j*rho*c/(A1/)*sin(-k*11)*(Uin21l+Uin22))+(cos(-k*ll)*Poutl)'
e3='Uin21 -cos(-k*121) *Uout21+A21/(j*rho*c)*sin((-k*121) *Pin3'
e4='Poutl= (-j*rho*c)/A21*sin(-k*121)*Uout21+cos(-k*121)*Pin3'
e5='Uin22=cos(-k*122) *Uout22+A22/(j*rho*c) *sin((-k*122) *Pin3'
e6='Poutl- (-j*rho*c)/A22*sin(-k*122)*Uout22+cos(-k*122)*Pin3'
e7='Uout2l+Uout22-cos(-k*13)*Uout3'
e8='Pin3= (-j*rho*c/A3)*sin(-k*13)*Uout3'
[Uout3,Uin21,Uin22,Pout1,Uout22,Pin3,Uout21]=
solve(el,e3,e4,e5,e6,e7,e8,'Uout3,Uin21,Uin22,Pout1,Uout22,Pin3,Uout21')
and the answer is...
TF=symdiv(Uout3,'Uinl')
TF=simplify(TF)
TF=simple(TF)
A.4
Whole Model when One Side Branch Disappears
Pin3 = 0
el='Uinl=(cos(-k*11)*Uin22)+((A1/(j*rho*c))*sin(-k*ll)*Poutl)'
e2= 'Pin 1= (-j*rho*c)/A1 *sin(-k*l1)*Uin22+(cos(-k*ll)*Pout1)'
e3='Uin22=cos(-k*122)*Uout22+A22/(j*rho*c)*sin(-k*122)*Pin3'
e4='Pout1= (-j *rho*c)/A22*sin (-k*122) *Uout22+cos(-k*122) *Pin3'
e7='Uout22=cos(-k*13)*Uout3'
e8='Pin3= (-j *rho*c/A3)*sin (-k*13)*Uout3'
[Uout3,Uin22,Pout1,Uout22,Pin3]=
solve(e1 ,e3,e4,e7,e8,'Uout3,Uin22,Pout 1,Uout22,Pin3')
TFno=symdiv(Uout3,'Uinl')
TFno=simple(TFno)
TFno=simplify(TFno)
Bibliography
[1] C.A. Bickley and K.N. Stevens. Effects of Vocal-tract Constriction on the Glottal
Source: Experimental and Modeling Studies. Journal of Phoentics, 14:373-382,
1986.
[2] C.A. Bickley and K.N Stevens. Effects of Vocal Tract Constriction on the Glottal
Source: Data from Voiced Consonants. In C. Sasaki T. Baer and K. Harris, editors, Laryngeal Function in Phonation and Respiration, pages 239-253. College
Hill Press, San Diego, 1987.
[3] R.A.W. Bladon. The Production of Laterals: Some Acoustic Properties and their
Psychological Implications. In Current Issues in Linguistic Theory, volume 9 of
Amsterdam Studies in the Theory and History of Linguistic Scieve IV, pages
501-508. Amsterdam, 1979.
[4] R.M. Dalston. Acoustic Characteristics of English /w,r,1/ Spoken Correctly
by Young Children and Adults. Journal of the Acoustical Society of America,
57:462-469, 1975.
[5] C. Espy-Wilson. An Acoustic-Phonetic Approach to Speech Recognition: Appli-
cation to the Semivowels. PhD thesis, Massachusetts Institute of Technology,
1987.
[6] C Espy-Wilson.
Acoustic Measures for linguistic features distinguishing the
semivowels /wjrl/ in American English.
America, 92:736-757, 1992.
Journal of the Acoustical Society of
[7] G. Fant. Acoustic Theory of Speech Production. The Hague, Mouton, 1960.
[8] D.H. Klatt and L. C. Klatt. Analysis, Synthesis, and Perception of Voice Quality
Variations among Female and Male Talkers. Journal of the Acoustical Society of
America, 87:820-857, 1990.
[9] P. Ladefoged and I. Maddieson. The Sounds of the World's Languages. Blackwell
Publishers, Cambridge, MA, 1996.
[10] Shinji Maeda. A Digital Simulation Method of the Vocal-Tract System. Speech
Communication, 1:199-229, 1982.
[11] K. Miyawaki, W. Strange, R. Verbrugge, A.M. Liberman, J.J. Jenkins, and
O. Fujimura. An Effect of Linguistic Experience: The Discrimination of [r] and
[1] by Native Speakers of Japanese and English. Perception and Psychophysics,
18(5):331-340, 1975.
[12] S. Narayanan, A. Alwan, and K. Haker. An Articulatory Study of Liquid Approximants in American English. In ICPhS Proceedings, volume 3, pages 576-579,
1995.
[13] S. Narayanan, A. Alwan, and K. Haker. Toward Articulatory-acoustic Models
for Liquid Approximants Based on MRI and EPG data. Part I. The Laterals.
Journal of the Acoustical Society of America, 101:1064-1078, 1997.
[14] R. Sproat and O. Fujimora. Allopohonic Aariation in English /1/ and its Implications for Phonetic Implementation. Journal of Phonetics, 21:291-311, 1993.
[15] K.N. Stevens. Acoustic Phonetics. MIT Press, Cambridge, MA, in press.
[16] K.N. Stevens and S.E. Blumstein. Attributes of Lateral Consonants. In Acoustical
Society of America Proceedings, 1994.