Analysis of Acoustic Cues for

advertisement
Analysis of Acoustic Cues for
Identifying the Consonant /o/ in Continuous Speech
by
Ying Alisa Cao
Submitted to the Department of Electrical Engineering and Computer Science
In partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 13, 2002
Copyright @2002 Ying Alisa Cao. All Rights Reserved.
The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and
electronic copies of this thesis and to grant others the right to do so.
Author ...............
...
Depu
-
F- i
En
cience
June 13, 2002
Certified by .................
Kenneth N. Stevens
Professor, Research Laboratory for Electronics
Thesis Supervisor
Accepted by .............
-
% irhur C. Smith
Chairman, Department Committee on Graduate Theses
OF TECHNOt96Y
JUL 3 1 ?002
LIBRARIES
amKER
Analysis of Acoustic Cues for
Identifying the Consonant /o/ in Continuous Speech
by
Ying Alisa Cao
Submitted to the Department of Electrical Engineering and Computer Science
June 13, 2002
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
This project aims to advance automatic recognition of continuous speech by recognizing
individual phonemes in speech using acoustic cues unique to each phoneme. This project focuses
on studying the acoustic characteristics of one of the most prevalent phonemes of English-the
fricative consonant /6/, as in word the, this, those, etc. Since previous research has shown that /6/
can assimilate to its preceding phoneme, characteristics of /6/ and its close-sounding phonemes, /n/,
/d/, and /v/, are studied in the preceding context of nasal consonant, stop consonant, and vowel,
respectively. Through examining characteristics of /6/ and its counterparts in phrases such as win
those and win nose, the goal is to find the contextual-based and invariant cues for identifying /6/.
Spectrum analysis tools are used to extract important acoustic information such as the formant
frequencies and their changes over time, the energy distribution over frequencies, and the duration
of utterances. In the context of a nasal consonant, F2 and the change in F2 over a fixed interval of
time (DeltaF2) are found to be the best cues: /6/ has lower F2 and higher DeltaF2 than /n/. In the
context of stop consonant, F2 and amplitude difference in the noise burst between the high and low
frequency ranges are the best cues: /6/ has lower F2 and a more negative amplitude difference. In
the context of vowels, F3 and DeltaF3 are found to be the best cues: /6/ generally has higher F3 and
DeltaF3 than /v/, although they are not as reliable as the cues of the other two contexts. The
formant frequencies are greatly influenced by the speaker's gender and the succeeding vowel, and
they vary among speakers of the same gender. Thus, the more contextual and speaker information
the identification criteria are based on, the more accurate the identification of /6/ is likely to be.
This correlation suggests human's auditory system is likely to also rely on contextual information
for the accurate processing of continuous speech.
Thesis Supervisor: Kenneth N. Stevens, Sc.D.
Title: Clarence J. Lebel Professor of Electrical Engineering, Research Laboratory for Electronics
2
Acknowledgments
My heartfelt gratitude goes to Professor Kenneth Stevens, for giving me this valuable
learning experience to engage in research and scientific writing, for personally guiding me through
the entire process, for his patience and understanding, and for his care and concerns for my well-
being.
I would like to thank my friends at MIT who have given me many warm and joyous
memories and who have helped me immensely in all aspects of my life in the past five years.
I also want to thank my parents who have worked hard and sacrificed much to provide me
with the best education and much more.
Lastly, even though this thesis hardly measures up to all that I have received from them, I
would like to dedicate this thesis to my grandparents. They, too, have sacrificed much to give me
the best opportunities, and they gave me a wonderful childhood from which I draw faith and
confidence. Their sincerity, work ethic, simple living, dedication toward family, and service to
society have been and will always be my most cherished inspiration in life.
3
Table of Contents
I: Introduction .........................................................................................
12
i 1 Mo tivation ................................................................................................................................
12
1.2 B ackgroundInform ation .....................................................................................................
13
1.3 Prelim in ary W ork .....................................................................................................................
15
1.4 R esearch Objectives .................................................................................................................
17
II: M ethodology....................................................................................
18
II. Ov erview .................................................................................................................................
18
e.2.................................................................................................................................
Datab ase
18
20
.3 Parameters A nalyzed ..............................................................................................................
11.3.1 Second and Third Formant Frequencies at the Onset of the Succeeding Vowel (F2 and
0
F 3 )..............................................................................................................................................2
11.3.2 Change of 2 and f3 Over a Period of 50 ms (DeltaF2 and DeltaF3)..........................22
11.3.3 Amplitude Difference (Amp(High-Mid)) and Duration of Burst of /6/ and /d/...........23
III: Results and Analysis....................................................................
26
II.1 Context of Nasal Consonant............................................................................................
111.1.1 Second vs. Third Formant Frequencies (F2 vs. F3)...................................................
26
26
111.1.2 Second Formant at the Onset of the Succeeding Vowel (F2).....................................27
111.1.3 Movement of f2 Over a Period of 50 ms (DeltaF2)...................................................
111.1.4 F 2 and D eltaF 2 ...............................................................................................................
111.2 Context of Stop Consonant..............................................................................................
32
33
38
38
111.2.1 Second vs. Third Formant Frequencies (F2 vs. F3)...................................................
111.2.2 The Second Formant Frequency at the Onset of the Succeeding Vowel (F2)............38
111.2.3 Amplitude Difference In The Burst (Amp(High-Mid))............................................
40
111.2 .4 Burst Duration ................................................................................................................
41
111.2.5 F2 and Amp(High-Mid).............................................................................................
42
I I.3 Contex t of Vow el....................................................................................................................4
111.3.1 Second vs. Third Formant Frequencies (F2 vs. F3)...................................................
6
46
111.3.2 Third Formant Frequency at the Onset of the Succeeding Vowel (F3)......................46
111.3 .3 F 3 and DeltaF 3 ...............................................................................................................
47
IV: Conclusion and Future W ork.......................................................52
I V. Co n c lus io n .............................................................................................................................
4
52
IV 2 Futu re Wo rk ...........................................................................................................................
V. References ........................................................................................
54
57
VI. Appendices......................................................................................58
Appendix A: Sentences Used in the Study..................................................................................58
Appendix B : A dditionalR esults .................................................................................................
Appendix B. 1 Context of Nasal Consonant...........................................................................60
Appendix B.2 Context of Vowel ..........................................................................................
5
60
62
List of Figures
Figure 1-1: a) An example spectrogram, b) An example output of formant tracks extracted by xkl.
Note that this thesis denotes the second and third formant frequencies with lowercase "f', as
"f2" and "f3", and denotes the second and third formant frequencies at the onset of the
succeeding vowel with uppercase "F", as "F2" and "F3."....................................................
15
Figure 1-2: An idealized spectrogram and the important acoustic features measured in the
prelim in ary study .......................................................................................................................
16
Figure 1-3: Three commonly observed patterns in the second and third formant frequencies of /6/
and /v /. .......................................................................................................................................
16
Figure 1-4: Example of a combination (F3 and -F3) that separates /6/ and /v/. Notice that a line
representing some differentiation criteria can be drawn and would separate all utterances of /6/
from th o se of /v/.........................................................................................................................17
Figure 2-1: A male speaker's waveform of the utterance "V" in "TV" (Sentence #12). The changes
in the waveform at around 224 ms indicate the onset of the vowel.....................................
21
Figure 2-2: An example spectrum showing auto-picked value of F3 (the dotted vertical line) by xkl.
...................................................................................................................................................
22
Figure 2-3: Spectra of the burst of a) /6/ vs. b) /d/, of a female speaker. Notice that a prominent
peak at around 4.7 kHz is seen in the spectrum of /d/ but not in that of /6/. .........................
23
Figure 2-4: Sample waveforms of a) /6/ and b) /d/. In each waveform, the two vertical lines
represent the points of waveform change, and the time in between the lines is the duration of
th e b u rst......................................................................................................................................2
5
Figure 3-1: The mid-sagittal view of the vocal track for /6/, /n/, and /d/ [5]. The back cavity is more
restricted for /6/, the difference that leads to a lower F2 for /6/ than for /n/ and /d/. ........... 29
Figure 3-2: The average and standard deviations of F2 of /6/ vs. /n/ in nasal context of female
speakers, given the succeeding vowel /o/. Notice that the F2 ranges of /6/ and /n/ do not
ov erlap . ......................................................................................................................................
30
Figure 3-3: The average and standard deviations of F2 of /6/ vs. /n/ in nasal context of male
speakers, given the succeeding vowel /o/. Notice the slight overlap between the two F2 ranges.
...................................................................................................................................................
6
30
Figure 3-4: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of female speakers, regardless of the
succeeding vowel. Notice that it is impossible to draw a line such that /6/ points would be on
one side and /n/ points on the other. ......................................................................................
35
Figure 3-5: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of male speakers, regardless of the
succeeding vowel. Notice that it is impossible to draw a line such that /6/ points would be on
one side and /n/ points on the other. ......................................................................................
35
Figure 3-6: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of female speakers, given succeeding
v ow el /i/. ....................................................................................................................................
36
Figure 3-7: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual female speakers, a) F#l, b)
F#2, given succeeding vowel /i/. /6/ and /n/ are more distinctly separated for each individual
speaker than for both speakers (Figure 3-6). ........................................................................
36
Figure 3-8: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of female speakers, given succeeding
vowel /o/. Notice that /6/ and /n/ occupy different ranges of F2 but not of DeltaF2...........37
Figure 3-9: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual female speakers, a) F#1, b)
F#2, given succeeding vowel /o/. /6/ and /n/ are more distinctly separated for each individual
speaker than for both speakers (Figure 3-8). ........................................................................
37
Figure 3-10: The average and standard deviations of F2 of /6/ vs. /d/ in stop context of female
speakers, given the succeeding vowel /o/. Notice that the F2 of /6/ is significantly lower than
th at o f /d/....................................................................................................................................3
9
Figure 3-11: The average and standard deviations of F2 of /6/ vs. /d/ in stop context of male
speakers, given the succeeding vowel /o/. Male speakers show similar trend as female speakers
do . (F igure 3-10)........................................................................................................................39
Figure 3-12: F2 and amplitude difference of /6/ vs. /d/ in stop context of female speakers. Notice
that /6/ and /d/ are clearly separated by both F2 and amplitude difference..........................43
Figure 3-13: F2 and amplitude difference of /6/ vs. /d/ in stop context of male speakers. Unlike in
the similar plot for female speakers (Figure 3-12), the regions of /6/ and /d/ overlap slightly for
th e m ale sp eak ers.......................................................................................................................43
Figure 3-14: F2 and amplitude difference of /6/ vs. /d/ in stop context of individual female speakers
a) F#1, and b) F#2. Notice that regions of /6/ and /d/ are clearly and consistently separated in
both sp eak ers..............................................................................................................................44
7
Figure 3-15: F2 and amplitude difference of /6/ vs. /d/ in stop context of individual male speakers
a) M# 1, b) M#2. /6/ and /d/ are more distinctly separated for each individual speaker than for
both speakers (Figure 3-13) ..................................................................................................
44
Figure 3-16: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) female speakers, b) male speakers,
given succeeding vow el //.........................................................................................................50
Figure 3-17: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker a) M#1, b) M#2
given succeeding vowel /i/. /6/ and /v/ are more distinctly separated for each individual
speaker than for both speakers (Figure 3-16b). ....................................................................
50
Figure 3-18: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker a) F#1, b) F#2
given succeeding vowel /i/. /6/ and /v/ remain difficult to separate even when the speakers are
analyzed individually.................................................................................................................50
Figure 3-19: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker a) M#1, b) M#2 in
two pairs of sentences succeeded by similar vowels (/o/ and //).
Notice the large variations in
both F3 and D eltaF3 of /6/ and /v/. ......................................................................................
51
Figure 3-20: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker (M#1) a) in sentences
#3 and #11 (succeeded by vowel /c/) b) in sentences #5 and #16 (succeeded by vowel /D/).
Notice the significant differences in F3 and DeltaF3 of both /6/ and /v/ between the two plots.
51
...................................................................................................................................................
Figure 3-21: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker (M#2) a) in sentences
#3 and #11 (succeeded by vowel /c/) b) in sentences #5 and #16 (succeeded by vowel /;/).
Notice similar decreases in F3 and DeltaF3 of both /6/ of /v/ are seen in graph b) as in Figure
3 -2 0 b ..........................................................................................................................................
51
Figure A-1: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of male speakers, given succeeding vowel
/i/. Notice that the regions of /6/ and /n/ overlap. ..................................................................
60
Figure A-2: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual male speakers a) M#1, b)
M#2, given succeeding vowel /i/. Notice that /6/ and /n/ are less separable for these males
speakers than they are for females speakers (Figure 3-7)....................................................
60
Figure A-3: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of male speakers, given succeeding vowel
/o /. ..............................................................................................................................................
8
61
Figure A-4: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual male speakers a) M#1, b)
M#2, given succeeding vowel /o/. Notice that /6/ and /n/ become more separable (mostly by
F2) for each individual speaker than for both speakers (Figure A-3), but the separation is not
as distinct as in similar plots for the female speakers (Figure 3-9). .....................................
61
Figure A-5: The average and standard deviation of F3 of /6/ vs /v/ in vowel context of a) female
speakers, b) male speakers, regardless of the succeeding vowel.........................................62
Figure A-6: The average and standard deviation of F3 of /6/ vs /v/ in vowel context of a) female
speakers, b) male speakers, given succeeding vowel /i/......................................................62
Figure A-7: The average and standard deviation of F3 of /6/ vs. /v/ in vowel context, given
succeeding vowel /i/, of an individual male speaker (M#1). Notice that the overlap becomes
significantly reduced for a given speaker than for both male speakers (Figure A-6b)..........62
Figure A-8: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) male, b) female speakers in two
pairs of sentences succeeded by similar vowels (/3/ and /r/). Notice the wide ranges of F3 and
DeltaF3 for both /6/ and /v/...................................................................................................
63
Figure A-9: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker a) F#1, b) F#2 in
two pairs of sentences succeeded by similar vowels (/3/ and /0/). Notice the large variations in
both F3 and DeltaF3 of /6/ and /v/........................................................................................
64
Figure A- 10: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker (F# 1) a) in
sentences #3 and #11 (succeeded by I1/) b) in sentences #5 and #16 (succeeded by /o/)..........64
Figure A- 11: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker (F#2) a) in
sentences #3 and #11 (succeeded by vowel //) b) in sentences #5 and #16 (succeeded by
vowel /3/). Notice the differences in F3 and especially in DeltaF3 between the two plots,
similar to those seen in the other speakers (Figure A-10, 3-20, 3-21)...................................64
Figure A-12: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) female speakers, b) male
speakers, given succeeding vow el /o/...................................................................................
65
Figure A-13: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual female speakers a) F#1,
b) F#2, given succeeding vowel /o/. Notice /6/ and /v/ are mixed in the same region.........65
Figure A-14: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual male speakers a) M#1, b)
M#2, given succeeding vowel /o/. Notice /6/ have higher F3 and lower DeltaF3 than /v/--the
same general pattern is see in A-13, despite its exceptions in individual enunciations. ........... 65
9
Figure A-15: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) female, b) male speakers, given
succeeding vow el /a /.................................................................................................................66
Figure A-16: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual speakers a) F#1, b) F#2,
c) M#1, and d) M#2, given succeeding vowel /w/. Notice that only c) shows clear distinction
between /6/ and /v/. The average values of /6/ and /v/ of each plot, however, consistently show
that /6/ has higher F3 and lower DeltaF3 than /v/..................................................................
66
Figure A-17: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) female, b) male speakers given
succeeding vowel /I/. Notice /6/ tend to have higher F3 than /v/, but their DeltaF3 values tend
to be in the sam e range. .............................................................................................................
67
Figure A- 18: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual speakers a) F# 1, b) F#2,
c) M#1, d) M#2, given succeeding vowel /I/. Neither parameter is very effective at separating
/6/ from /v/. In general, F3 is higher in /6/. The average DeltaF3 of /6/ seems slightly higher
than that of /v/, which is an exception to the trend of DeltaF3 observed for the rest of the
v ow els........................................................................................................................................6
10
7
List of Tables
Table 2-1: The contexts, close-sounding consonants, succeeding vowel, and phrases studied in the
project. The underlined segments are the parts of the phrases analyzed in detail.................19
Table 2-2: Number of utterances analyzed in the study for each context. ...................................
20
Table 3-1: F2 and F3 of /6/ vs. /n/ in the nasal context, listed by speakers and vowels. Notice that
the corresponding F2 of /6/ is less than that of /n/ for all cases, but F3 does not show such
consistency (the asterisks mark cases of inconsistency). Thus F2 is further analyzed..........27
Table 3-2: Average F2 values for /6/ vs. /n/ in the nasal context, listed by gender and succeeding
vowel. Notice that F2 varies according to gender and the succeeding vowel, but F2 of /6/ is
always lower than the corresponding F2 of /n/......................................................................
27
Table 3-3: Average DeltaF2 of /6/ vs. /n/ in nasal context, listed by speaker and succeeding vowel.
The average DeltaF2 is positive for utterances succeeded by /o/ and negative for those by /i/.
DeltaF2 of /6/ is greater than that of /n/ in all cases. ...........................................................
32
Table 3-4: Average F2 values for /6/ vs. /d/ in stop context, listed by gender. Notice that F2 is
higher in /d/ than in /6/. Note that for /6/ and /d/ only succeeding vowel /o/ is examined........38
Table 3-5: Amplitude difference between the high and mid frequency ranges of /6/ vs. /d/ in stop
context, listed by speaker and enunciation. Notice /6/'s amplitude difference is usually
negative and is less than that of /d/ in almost all cases........................................................
40
Table 3-6: Burst duration of /6/ vs. /d/ in stop context, listed by each speaker. Notice that no
consistent difference is seen between the burst duration of /6/ and /d/ across all speakers.......41
Table 3-7: Average F2 and amplitude difference between high and mid frequency ranges of /6/ vs.
/d/ in stop context, listed by gender. Notice that F2 is substantially lower in /6/ than in /d/ and
Amp(High-Mid) is negative for /6/ but positive for /d/........................................................42
11
Chapter I: Introduction
1.1 Motivation
Progress in the intense study of acoustics and speech recognition in the past century has
manifested in automatic speech recognition software such as SUMMIT used in the MIT-developed
Jupiter weather information system. Jupiter, in particular, has been achieving response accuracy of
about 90% for novice users and over 98% for experienced users [1]. Such software, however, has a
limited domain of acceptable queries, set by the limited number of recognized words; Jupiter can
only recognize about 2000 words. This limited scope of recognition is a result of the algorithm of
the speech recognition units, such as SUMMIT, which recognize words in the queries by matching
key segments of the speech signals with a pre-stored database of phonemes [1]. This kind of search
and match algorithm makes real-time recognition of continuous speech impractical, both in terms
of size limitations of the pre-stored library of vocabulary and the computation power currently
available to run such an algorithm.
In light of such limitations, this study is not interested in advancing automatic speech
recognition by looking for ways to reduce the limitations of the search and match algorithm
mentioned above, but by finding sets of acoustic cues that can unambiguously identify individual
phonemes in continuous speech. This study examines the spectral characteristics of phonemes and
aims to incorporate these characteristics into algorithms that computers can use to achieve
automatic phoneme recognition. In the process, the goal is to gain further understanding of how the
human perceptual system processes, identifies, and differentiates
12
between similar-sounding
phonemes. In addition to building better recognition tools that can potentially identify every
phoneme in continuous speech, this research can also contribute to the development of computerbased methods for transforming spoken speech into its written form. Such a function would be
highly valuable in a number of applications, such as taking notes for deaf and hard-of-hearing
individuals.
1.2 Background Information
This project focuses on finding the acoustic characteristics of /6/ for two main reasons.
First, /6/ is one of the most common phonemes in English [2]. Because /6/ is found in common
function words such as they, them, those, then, and the etc, it is the
7 th
most frequent consonant in
spoken English and the most frequent consonant in word-initial position. Because function words
could be important for extracting the meaning of sentences, it is important to be able to recognize
intended /6/ in natural speech. Second, because /6/ is usually found at the same location, at the
beginning of words, /6/ is easier to study.
/6/ is a voiced fricative with weak noise, and is produced by air turbulence created when air
from the lungs is forced through the vocal tract constriction formed by the tongue and the teeth [3].
/6/ is generally unstressed, as in see the ball, but it can also be stressed, as in see that ball. As
mentioned, /6/ is often found at beginning of words, but it can also be found in middle of words,
such as bother,father, weather; and in end of a few words, such as seethe, bathe, and teethe [4].
Some recent research in consonants in varied contexts that occur in normal speech has
shown that the acoustic features of phonemes vary according to the identity of adjacent phonemes
[3], [4]. Therefore, acoustic cues for identifying /6/ could be context-dependent-meaning that the
13
auditory system may identify /6/ based on one of several different sets of cues, depending on the
perceived context. At the same time, given that our auditory system can recognize intended /6/ in
various contexts as /6/, it is possible that there exist a set of invariant acoustic cues for speech
perception--characteristics that are common of /b/ in all contexts. Since our purpose is to find such
invariant and context-dependent cues, characteristics of /6/ are analyzed in various contexts.
Spectral characterization of /6/ preceded by nasal consonants has already been studied for
/n/ [4]. The research found that /6/ assimilates and becomes nasalized when preceded by /n/, i.e. the
entire consonant region in the spectrogram of /6/ shows characteristics like those found in /n/. At
the same time, acoustic evidence suggests that contextually-nasalized /6/ retains its dental place of
articulation [4]. This evidence is based on the second formant frequency (f2) in the following
vowel. F2 is considerably lower at the release of contextually-nasalized /6/ than at the release of a
true /n/. Furthermore, listeners can generally tell the difference between natural tokens of win nose
and win those, even when /6/ is completely nasalized [4]. And when synthetic stimuli were
constructed in which signals differed only in F2 near the nasal consonant region of win nose and
win those, listeners systematically reported hearing the latter more often when F2 was low at the
release of the nasal consonant [4].
These results are consistent with the claims in literature that
despite contextual assimilation, listeners can recognize the intended phoneme [3]. Finding the
acoustic cues that help listeners to recognize all of the contextually modified /6/ as the same
phoneme is the objective of this research.
To achieve this objective, this study first analyzes characteristics of /6/ in isolated
enunciation
of
vowel-consonant(fricative)-vowel
14
(VCV)
combinations
by
comparing
the
spectrogram of /6/ to that of its two closest phonemes, /v/ and /z/. It then proceeds to examine /6/ in
a variety of contexts in sentence material.
1.3 Preliminary Work
Preliminary characterization has been done for the spectrograms of fricatives /6/, /z/, and /v/
that occur in isolated vowel-fricative-vowel (VCV) combinations containing the vowels /a/, /x/, /e/,
/i/, and /u/.
The fricative /z/ has visible high frequency content in its spectrogram that is not found in
the spectrograms of /6/ and /z/. Thus, /z/ can easily be differentiated from the other two fricatives,
and efforts are subsequently focused on finding the more subtle spectral differences between /6/
and /v/. To do so, a program called xkl is used to extract frequency information from the original
spectrogram. (See Figure 1-1). An idealized spectrogram illustrating the features of /6/ and /v/ that
are measured and compared is displayed in Figure 1-2.
I
0
kHz
7-
4
...-
23
r~0.
a)
b
T
0
200
300
400
50
TIME
Time (ms)
Figure 1-1: a) An example spectrogram, b) An example output of formant tracks extracted by xkl.
Note that this thesis denotes the second and third formant frequencies with lowercase "f", as "f2"
and "f3", and denotes the second and third formant frequencies at the onset of the succeeding vowel
with uppercase "F", as "F2" and "F3."
15
Onset of Fricative
Release of Fricative
I
I
Fricative(C)
Vowel(V)
-13
Vowel(v)
F3
fcativ dutatioh
-Fl
El1
fi
Time (Ms)
Figure 1-2: An idealized spectrogram and the important acoustic features measured in the
preliminary study.
The parameters measured include +F1, +F2, +F3, -F1, -F2, -F3, +Slope of F2, +Slope of F3,
-Slope of F2, -Slope of F3, and fricative duration. The most useful parameter for differentiating /6/
and /v/ turns out to be +F3. F3 values of /6/ are greater than those of /v/ in all of the CVC
enunciations studied. The rest of the parameters can also be useful in differentiating between /6/
and /v/, especially when the context is known.
Three common patterns of movements of F2 and F3 are observed, as shown in Figure 1-3.
dh
Time (ns)
Time (ns)
dl
Time (ns)
Figure 1-3: Three commonly observed patterns in the second and third formant frequencies of /6/
and /v/.
16
Furthermore, using combinations of parameters, /6/ and /v/ can be distinguishable in a number of
utterances, but not in all. Figure 1-4 shows an example of a combination of characteristics that are
successful in distinguishing /6/ and /v/.
26002500A
2400S2300
adh
-
22002100-
Au
2 00 0
,
*
*
*
2000 2100 2200 2300 2400 2500 2600 2700
F3
Figure 1-4: Example of a combination (F3 and -F3) that separates /6/ and /v/. Notice that a line
representing some differentiation criteria can be drawn and would separate all utterances of /6/
from those of /v/.
1.4 Research Objectives
The objective of the project is to apply the methods and results of the analysis of the
spectral characteristics of /6/ and /v/ in the simpler VCV enunciation toward finding cues that can
identify all intended /6/'s in various contexts in continuous speech. More specifically, the research
intends to accomplish the following:
1. Identify the invariant and contextual-based differences between the spectrograms and
spectra of /6/ and its close-sounding phonemes.
2. Gain insight into how the human perceptual system processes and identifies the
contextually-varying /6/ and other phonemes during cognitive processing of continuous
speech.
17
Chapter II: Methodology
11.1 Overview
This project examines the invariant- and contextual-based characteristics of /6/. A set of
eighteen sentences is designed such that /6/ and its close-sounding phonemes are between the same
preceding and succeeding context. These sentences are listed in Appendix A. Three contexts that
are studied are nasal, stop, and vowel, and the corresponding close-sounding phonemes of /6/ are
/n/, /d/ and /v/, respectively. Based on the finding of the preliminary work, useful characteristics in
identifying /6/, such as F2 and F3, are measured and analyzed for the new database of sentences.
Additional parameters of the burst of /6/ and /d/ in the context of stop consonants are also measured
and analyzed.
11.2 Database
Eighteen sentences containing /6/ and its close-sounding phonemes, /n/, /d/ and /v/, are
constructed to form the database for this project (See Appendix A). Phrases in the original
sentences that contain the consonants of interest are included in Table 2-1.
The sequences of
preceding context, consonant (/6/; or /n/, /d/, and /v/), and succeeding vowel that are analyzed in
detail are underlined in Table 2-1. Because the succeeding vowel may influence the characteristics
of the consonants, the succeeding vowels of each pair of phrases are chosen to be the same, as seen
in the table.
Type of
Preceding
Context
Phoneme In
The Preceding
Context
161' Close Sounding
Consonant (/6/ vs ...)
Nasal
/n/
/n/
Succeedin
g
oe
//
Phrase
1.win those
win nose
Stop
It!
/d/
/i/
2. win these
/o/
win Niece
1. putthose
put dough
Vowel
/i/
iV/
1. guarantee these
guard TV's
/1/
2. see this
see Victorian
/u/
/3/
3. to the racial
two veracious
/ai/
/W/
4. dye that
dye vat
/a/
/o/
5. via those
via votes
/ei/
/c/
6. may then
may
vend
Table 2-1: The contexts, close-sounding consonants, succeeding vowel, and phrases studied in the
project. The underlined segments are the parts of the phrases analyzed in detail.
To find the spectral characteristics of /6/ that are common to both genders, the sentences are
spoken by both male and female speakers. Because there may be variations among speakers and
enunciations, each sentence is spoken by two male and two female speakers and is repeated three
times by each speaker.
For each context, the database provides the following number of consonant pairs, listed in
Table 2-2.
19
Type of Preceding
Context
Number of Pairs
of Phrases
Nasal
Stop
Vowel
2
1
6
Number ofRepetitions
_umber____epetition
11
11
11
Total Number of
UtterancesAnalyzed
22
11
66
Table 2-2: Number of utterances analyzed in the study for each context.
The actual number of data analyzed may be less than the numbers listed here, because some
formant frequencies are impossible to determine in some enunciations. Also, 12 repetitions (4
speakers and 3 repetition per speaker) were planned, but due to problems during recording one set
of utterances by one of the female speakers was not used for the study. So, only 11 repetitions for
pairs of sentences were available for analysis and study.
11.3 Parameters Analyzed
11.3.1 Second and Third Formant Frequencies at the Onset of the Succeeding Vowel (F2 and
F3)
The onset time of the succeeding vowel is determined by the combination of three pieces of
information: changes in the enunciations' sound wave, the formant tracks extracted by xkl, and the
spectrogram. Because vowels, unlike consonants, are formed with no obstruction in the vocal tract,
significantly more sound energy is concentrated at the lower formant frequencies of vowels than in
the case of consonants.
High amplitude of sound energy is represented by dark bands on the
spectrogram and by bold points in the formant tracks extracted by xkl. Thus, the abrupt appearance
of dark bands in the spectrogram and of bold formant frequencies points in the formant tracks give
the approximate onset time of the succeeding vowel.
20
A more exact onset time is determined by looking for abrupt changes in the waveform.
Because the vocal tract has different shapes for vowels and consonants as a speaker switches from a
consonant to a vowel, the change is reflected in the waveform, as shown in Figure 2-1. In the
example given below, the point of abrupt change (the appearance of the first negative peak) at
around 224 ms marks the onset of the vowel; this value would correspond to the times of abrupt
appearance of dark bands and bold points in the spectrogram and formant tracks, respectively.
4000 -464.18 mrs (405)
2000
-
-2000
V.
-.
yfi
V
.
-
~l
Ir
-4000
170
180
190
200
210
TIME (ms)
220
230
240
250
Figure 2-1: A male speaker's waveform of the utterance "V" in "TV" (Sentence #12). The changes
in the waveform at around 224 ms indicate the onset of the vowel.
Because the interest is to study the pattern of change of the second and third formant
frequencies over time, a narrow time window that gives detailed time domain information is used.
In this case, a 6.0 millisecond Hamming window is chosen. Since the amplitudes of sound energy
in higher formant frequencies tend to be low for vowels-thereby making determination of F3
more problematic--the pre-emphasis parameter in xkl is set to 100 to raise the amplitude of the F3
prominence.
The formant tracks extracted by xkl (the right side figure in Figure 1-1) show abrupt jumps
that are uncharacteristic of formant frequencies, due to the measuring algorithm used in xkl. To get
21
a more accurate measurement of F2 and F3, the formant frequencies given by the 6.0 ms Hamming
window are averaged over 15 milliseconds--7 milliseconds before the onset time to 7 milliseconds
after. An example of the averaged spectrum is shown in Figure 2-2. The auto-pick option is turned
on to let xkl find the most accurate values for the formant frequencies, which correspond to the
peak frequencies (x-coordinates) seen in the spectrum.
dB
70
Avg DFT-spect (kn)
win:6.Oms
start 276
end 290
60
50
40
30
20
10
2741 Hz 40.3 dB
2741 Hz 31.2 dB
0
1
Figure 2-2: An exam ple
2
3
57
4
FREQ (kHz)
7
8
spectrum showing auto-picked value of F3 (the dotted vertical line) by xkl.
11.3.2 Change of f2 and 3 Over a Period of 50 ms (DeltaF2 and DeltaF3)
The same parameter setting of 6.0 ms Hamming window, 100 pre-emphasis, and 15 ms
time-averaging is used to read the f2 and f3 values at 50 ms after the onset. The midpoint of the
time average is the onset time plus 50 ms. DeltaF2 and DeltaF3 are the differences of the f2 and f3
values at the later time minus those at the earlier time, respectively.
22
11.3.3 Amplitude Difference (Amp(High-Mid)) and Duration of Burst of /6/ and /d/
The burst spectrum of /d/ is expected to contain more energy in the higher frequency range
than that of /6/, and /d/ is expected to have longer duration than /6/. Thus the burst amplitude and
duration of the two consonant bursts are measured in the hope of finding characteristics that would
separate the two phonemes. Example spectra of bursts of /6/ and /d/ that illustrate the difference are
shown in Figure 2-3.
dB
70
dB
dB
70
60
60
50
50
40
40
30
30
20
20
10
10
--
0
0
a)
rI
_
1
2
3
4
5
FREO (kHz)
6
7
a
b)
1
2
3
5
4
FREQ (kHz)
6
7
Figure 2-3: Spectra of the burst of a) /6/ vs. b) /d/, of a female speaker. Notice that a prominent
peak at around 4.7 kHz is seen in the spectrum of /d/ but not in that of /6/.
To measure and quantify this spectral difference, the amplitudes of the peaks in the high and mid
frequency ranges are measured, and the difference is taken. This difference is named the amplitude
difference, and will be denoted as Amp(High-Mid) from here on.
Due to the differences in formant frequencies between the genders, the Amp(High-Mid)
value is defined slightly differently for each gender; the cutoff frequency defining the ranges is
higher for female speakers than for male speakers. More specifically:
23
1. For female speakers, Amp(High-Mid) equals the amplitude difference between the
highest peak at frequencies higher than 4000 Hz and the highest peak (excluding the F1
peak) at under 3000 Hz.
2. For male speakers, Amp(High-Low) equals the amplitude difference between the highest
peak at frequencies higher than 3500 Hz and the highest peak (excluding the F1 peak) at
under 2500 Hz.
As done for the succeeding vowels, the spectra of /6/ and /d/ are also generated by a 6.0Hamming window, averaged over 15 ms, and the formant frequencies for the spectra are autopicked by xkl. The midpoint of the time average is the onset of /6/ and /d/, which are easily
determined by looking for abrupt changes in the waveform. (See Figure 2-4.) The pre-emphasis is
set at 0, because /6/ and /d/ both have enough energy distributed in the higher frequencies for f3 to
be determined unambiguously.
Figure 2-4 also illustrates how the duration of /6/ and /d/ are
determined.
24
-I
-,
w
-
e
270.32 ms (546).
6000
4000
2000
0
-2000
-4000
r 013 ms-5477
-
-
-6000
215
220
e
245
240
TIME (n)
235
230
225
a)
250
265
260
255
27,
TI1E (sms)77
6000
-
3000
f
0
-3000
-6000
275
b)
280
285
290
295
300
310
305
TIME (ms)
315
320
325
330
335
Figure 2-4: Sample waveforms of a) /6/ and b) /d/. In each waveform, the two vertical lines
represent the points of waveform change, and the time in between the lines is the duration of the
burst.
In each waveform, the former and latter points of change indicate the beginning and the end of the
burst of /6/ or /d/. The time difference between the two points is the duration of the burst.
25
Chapter III: Results and Analysis
The second and third formant frequencies at the onset of the succeeding vowel (F2 and F3)
and their change over time (DeltaF2 and DeltaF3) were measured and analyzed for all three
contexts. The duration and the amplitude differences of the burst of /6/ and /d/ are also examined
in the context of a stop consonant. The resulting measurements of the three contexts are presented
in the tables and figures below. Pairs of parameters that show consistent differences between /6/
and its close-sounding phonemes are graphed as (x, y) pairs for the purpose of finding criteria that
would separate the consonants.
111.1 Context of Nasal Consonant
111.1.1 Second vs. Third Formant Frequencies (F2 vs. F3)
F2 and F3, the second and third formants at the onset of the succeeding vowel, were
measured for /6/ and /n/. Table 3-1 shows the average value of each speaker, given a particular
consonant and succeeding vowel (F#1 denotes female speaker #1,
and so on).
26
M#1 denotes male speaker #1,
Parameter
Vowel
Consonant
/o/
//
/n/
F2
/n/
/o/
/n/
F3
/i/
/n/
F2 of Individual Speaker (Hz)
M#2
F#2
M#J
F#1
1386
1654
1606
1260
2394
1952
1312
1533
2363
2552
2773
3072
3009
3087
2037
2394
2793
2856
2825
2993
2016
2058
2667
2741
2694*
2667*
1974
2032
2342*
2258*
2426
2436
1
Table 3-1: F2 and F3 of /6/ vs. /n/ in the nasal context, listed by speakers and vowels. Notice that
the corresponding F2 of /6/ is less than that of /n/ for all cases, but F3 does not show such
consistency (the asterisks mark cases of inconsistency). Thus F2 is further analyzed.
Notice that in Table 3-1, given the same speaker and vowel, the F2 of /6/ is less than that of
/n/ for all cases. This difference is not only for the average values but for each enunciation as well.
F3, however, does not show the same consistency. The average F3 of /6/ is less than that of /n/ for
only 6 out of 8 of the cases (see Table 3-1; not true for the two sets of F3 appended with "").
The
relationship between F3 of /6/ and /n/ is even less consistent in each enunciation; only about half of
the enunciations have a higher F3 for /n/ than for /6/. Therefore, F2 is determined to be a better
parameter for identifying /6/ in nasal context and is thus further studied.
111.1.2 Second Formant at the Onset of the Succeeding Vowel (F2)
F2 of /6/ and /n/ is further analyzed by taking the average of F2 for all of the utterances
produced within each gender group. See Table 3-2 below for results.
Succeeding Vowel
/o/
/i/
F2 of Female Speakers (Hz)
/n/
/6/
1993
1660
2200
2473
F2 o Male Speakers (Hz)
/n/
/6/
1423
1360
1995
2045
Table 3-2: Average F2 values for /6/ vs. /n/ in the nasal context, listed by gender and succeeding
vowel. Notice that F2 varies according to gender and the succeeding vowel, but F2 of /6/ is always
lower than the corresponding F2 of /n/.
27
Table 3-2 indicates that corresponding F2's differ significantly between the genders, with
the difference ranging from around 200 to 570 Hz. This gender difference is expected because male
speakers have longer vocal tracts, which produce lower formant frequencies,
according to
perturbation properties of resonators [6].
Table 3-2 indicates that the succeeding vowel also significantly influences F2. F2 of both
/6/ and /n/ when succeeded by /o/ is consistently around 600 Hz lower than that of the consonants
when succeeded by /i/. F2's correlation with the succeeding vowels is expected, because different
vowels are results of different vocal tract shapes, which in turn resonate at different formant
frequencies, according to the perturbation properties of resonators [6]. The vowel /i/ is expected to
have a high F2 because /i/ has a fronted and high tongue body position [6]. On the other hand, /o/
has a back tongue position and thus should have a lower F2 [6]. The measured F2 agrees with these
expectations, and thus helps ensure the validity of the rest of the analysis.
The differences in F2 of the same consonant when succeeded by different vowels (Table 32) suggest that knowledge of the identity of the succeeding vowel could be important in the correct
identification of /6/. More specifically, if the succeeding vowel is not taken into consideration, the
average F2 of /6/ for female would be the average of 1660 and 2200 Hz (Table 3-2), or 1930 Hz;
and F2 of /n/ would be the average of 1993 and 2473 Hz (Table 3-2), or 2233 Hz. Given these
average F2 values of /6/ and /n/, a reasonable cutoff F2 to differentiate /6/ and /n/ could be the
midpoint, or 2082 Hz, i.e. consonants with higher F2 would be identified as a /n/, and consonants
with a lower F2 would be identified as a /6/. If this were the case, for /6/ and /n/ succeeded by /o/,
28
the vast majority of both /6/ and /n/ would be determined to be /6/, since more than half of the /n/
would have F2 less than 2082 Hz. To avoid this type of misclassification, differentiation criteria of
the consonants would be much more accurate is they are set with regard to a particular succeeding
vowel.
Table 3-2 also shows that for both genders and vowels, F2 of /n/ is consistently greater than
that of the corresponding /6/. This observation is reasonable based on the difference in the vocal
tract of the two consonant. Figure 3-1 shows that /6/ has more constriction in the back cavity than
/n/ (and /d/). The shape of the back cavity is known to have the strongest influence on the value of
F2: the more constriction, the lower F2 [6]. The data obtained are consistent with findings in the
literature and thus strengthen the validity of the results.
tongue
/d/ and /n/
/6/
Figure 3-1: The mid-sagittal view of the vocal track for /6/, /n/, and /d/ [5]. The back cavity is more
restricted for /6/, the difference that leads to a lower F2 for /6/ than for /n/ and /d/.
The average and standard deviation of F2 for vowel /o/ is calculated and graphed in Figure
3-2 and 3-3. The height of each bar represents the average F2 and the extension above and below
the bar represents one standard deviation above and below the average.
29
2600
212_
2200 N1
1800 -
1400
-
1000 -
-0
1626
0/a/
H/n/
Figure 3-2: The average and standard deviations of F2 of /6/ vs. /n/ in nasal context of female
speakers, given the succeeding vowel /o/. Notice that the F2 ranges of /6/ and /n/ do not overlap.
2600 -
2200 N
0 /6/
.I/n/
1800
1423
1400 -
1323
LIr
1000 -
Figure 3-3: The average and standard deviations of F2 of /6/ vs. /n/ in nasal context of male
speakers, given the succeeding vowel /o/. Notice the slight overlap between the two F2 ranges.
30
For female speakers, the ranges of F2 for /6/ and /n/ are relatively distinct. Most of their F2 values
fall in between 1590-1662 Hz for /6/, as opposed to 1886-2371 Hz for /n/. For male speakers,
however, F2 of /n/ and /6/ are less distinct and their ranges overlap (Figure 3-3). The ranges of the
majority of F2 values of male speakers are 1249-1398 Hz and 1301-1545 Hz, for /6/ and /n/
respectively.
The lack of overlap of F2 ranges in female speakers (Figure 3-2) suggests that substantial
differences exist between the F2's of the two consonants, and thus it is possible to set a criterion
that would separate most utterances of /6/ from /n/ on the basis of F2 alone. In this particular case,
the cutoff F2 that would best differentiate between the two consonants is somewhere between the
lower range of /n/ (1886 Hz) and the higher range of /6/ (1662 Hz). If the criterion is set half way
between the two values for simplicity, or at 1774 Hz, it would be 1.47 standard deviations away
from the mean F2 of /n/ and 4.11 standard deviations from mean F2 of /6/. Therefore, more than
93% of /n/ and virtually
100% of /6/ would be identified correctly, assuming Gaussian
distributions. With fine adjustments, the cutoff could be set so that it would be as many standard
deviations away from the average of both consonants as possible. Such adjustment would further
increase the accuracy of identifying the intended consonants.
The criterion for differentiating /6/ and /n/ based on F2 is harder to set for the given male
speakers. But since the two consonants are articulated similarly between the two genders, the
apparent greater difficulty in differentiating /6/ and /n/ in the male speakers is most likely speaker
specific due to the particular speakers studied. More specifically, F2 data for each utterance by the
31
two male speakers reveal substantial variation of F2 between the two speakers: F2 of one speaker is
significantly and consistently higher than that of the other. For each speaker, however, F2 of /6/ and
/n/ show differences similar to those observed in the female speakers Thus, the more acoustic
characteristics are known about a particular speaker, the more accurate the identification of /6/
would likely to be for that speaker.
111.1.3 Movement of f2 Over a Period of 50 ms (DeltaF2)
The second formant frequency at 50 ms after the onset of the vowel is measured by methods
described in Chapter II. The average frequency differences at the two times for a given speaker and
succeeding vowel are tabulated in Table 3-3.
Succeeding Vowel
/o/
/I
Consonant
Average DeltaF2 of Individual Speaker (Hz)
/6/
/n/
/6/
F#1
-205
-409
315
F#2
-94
-188
410
M#1
-178
-179
221
M#2
-147
-221
111
/n/
205
241
179
74
Table 3-3: Average DeltaF2 of /6/ vs. /n/ in nasal context, listed by speaker and succeeding vowel.
The average DeltaF2 is positive for utterances succeeded by /o/ and negative for those by /i/.
DeltaF2 of /6/ is greater than that of /n/ in all cases.
Notice that the average DeltaF2 is negative for all speakers when the succeeding vowel is
/o/ and is positive when the succeeding vowel is /i/. This observation is also consistent with
characteristics of F2 of /o/ and /i/. As mentioned earlier, the tongue is raised and fronted when
producing /i/. These actions cause a widening of the back cavity of the vocal tract and are reflected
32
by a rising F2. The tongue is moved toward a back position when producing /o/, and the movement
results in a lowering of F2, as observed in the data.
111.1.4 F2 and DeltaF2
Since F2 and DeltaF2 are, for the most part, consistently different for /6/ and /n/, they are
chosen to be graphed as x- and y-coordinates, i.e. (F2, Delta F2). Figure 3-4 and 3-5 show F2 and
DeltaF2 of /6/ vs. /n/, regardless of the succeeding vowel, of the female and male speakers,
respectively. Notice that /6/ and /n/ are mixed together in both figures. It is impossible to draw a
line, representing a certain differentiation criterion, such that points of /6/ would be on one side and
points of /n/ on the other.
The F2 and DeltaF2 of /6/ and /n/ of the female speakers when plotted for a particular
succeeding vowel (/i/), shown in Figure 3-6, show better separation between /6/ and /n/ than in the
case of mixed vowels in Figure 3-4. The separation becomes even clearer in Figure 3-7, when the
plots of F2 and DeltaF2 are further narrowed down to each speaker. Notice that for both female
speakers, /6/ has lower F2 and higher DeltaF2 than /n/. The same trend is also observed for both
female speakers in /6/ and /n/ succeeded by /o/ in Figures 3-8 and 3-9. After normalization by
vowel and speaker, points for /6/ and /n/ can be separated onto different sides of a positively sloped
line. Such a line would be the graphical representation of an algorithm that differentiates intended
/6/ from /n/, and it would be specified by assigning the appropriate coefficients to F2 and DeltaF2.
Figures 3-7 and 3-9 show that /6/ and /n/ are indeed distinct enough that it is possible for an
algorithm based on F2 and DeltaF2 values to correctly distinguish between the two consonants.
33
Similar plots of F2 and DeltaF2 of /6/ and /n/ of a particular vowel for the each of the two
male speakers are included in Appendix B (Figures A-2 and A-4). Notice that each of the four
graphs shows more separation of the /6/ and /n/ points than their corresponding plot in which the
two male speakers' data are mixed (Figures A-1 and A-3). Again, the clearer separation of /6/ and
/n/ in the plots of for each individual speaker shows that, in addition to having knowledge of the
succeeding vowel, knowledge of the particular speaker also greatly assists in separating /6/ from its
close-sounding phoneme.
The extent of separation between /6/ and /n/ in the four plots of Figures A-2 and A-4 for
male speakers, however, is not as clear and consistent as for the females (Figures 3-7 and 3-9).
Both Figure A-2a and A-4b show the same kind of separation as seen in the female speakers but
with less distinction, whereas Figures A-2b and A-4a have overlapping /6/ and /n/ points. Despite
the lack of more convincing separation in the male speakers, the clusters of /6/ and /n/ points are
located in similar positions relative to each other. This similarity is expected because speakers of
both genders articulate /6/ and /n/ by shaping their the vocal tracts in similar ways, thus resulting in
the same relationship in the values of F2 and DeltaF2 between /6/ and /n/.
The lack of more
convincing separation for the male speakers is most likely due to the limited number of utterances
that were available for this study. Had more speakers of both genders been asked to repeat each
utterance more times, the pattern of relatively lower F2 and higher DeltaF2 of /6/ would probably
be more apparent, and the separation between the two consonants would be more distinct.
34
M
OUU -
400
200
S-200
-
-400
-600
F2 (Hz)
Figure 3-4: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of female speakers, regardless of the
succeeding vowel. Notice that it is impossible to draw a line such that /6/ points would be on one
side and /n/ points on the other.
600
400
2000
0
10
-200-400
-600
F2 (Hz)
Figure 3-5: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of male speakers, regardless of the
succeeding vowel. Notice that it is impossible to draw a line such that /6/ points would be on one
side and /n/ points on the other.
35
600
400
200
0
10
49
"'Do
/n
S-200i
-400
-600
F2 (Hz)
Figure 3-6: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of female speakers, given succeeding
vowel /i/.
600
600
400
400
200
200
-20J0
0
00
SA
D/O
0
-200J
-400
-400
-600
-600
n
F2 (Hz)
F2 (Hz)
b)
a)
Figure 3-7: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual female speakers, a) F#1, b)
F#2, given succeeding vowel /i/. /6/ and /n/ are more distinctly separated for each individual
speaker than for both speakers (Figure 3-6).
36
600 r400
200
+
0
16
-30
4
0
//
-200
-400
-600
F2 (Hz)
Figure 3-8: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of female speakers, given succeeding
vowel /o/. Notice that /6/ and /n/ occupy different ranges of F2 but not of DeltaF2.
600
600
400
400
200
200
-20/
10
0
mn
0
2500,
01A
00
3 0
15t~
~2*
-200
-400
-400
-600
-600
ZRI
31
B
U
F2 (Hz)
F2 (Hz)
b)
a)
Figure 3-9: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual female speakers, a) F#1, b)
F#2, given succeeding vowel /o/. /6/ and /n/ are more distinctly separated for each individual
speaker than for both speakers (Figure 3-8).
37
111.2 Context of Stop Consonant
111.2.1 Second vs. Third Formant Frequencies (F2 vs. F3)
Data on F2 and F3 similar to those in Table 3-1 were collected for phrases that compare /6/
and /d/. As in the case of nasal context, F2 of /6/ is consistently lower than that of /d/ whereas F3 of
/6/ and /d/ does not show a consistent pattern. Therefore, F2 is further analyzed.
111.2.2 The Second Formant Frequency at the Onset of the Succeeding Vowel (F2)
The second formant frequency of /6/ and /d/ is further analyzed by taking the average of F2
for all utterances produced by the female and male speakers separately. The results are summarized
in Table 3-4. Note that for /6/ and /d/ only succeeding vowel /o/ is examined.
F2 of Male Speakers (Hz)
/d/
/6/
F2 of Female Speakers (Hz)
/d/
/6/
1695
1367
2079
1650
Table 3-4: Average F2 values for /6/ vs. /d/ in stop context, listed by gender. Notice that F2 is
higher in /d/ than in /6/. Note that for /6/ and /d/ only succeeding vowel /o/ is examined.
Again, F2 varies significantly between the genders, on the order of about 300 Hz. F2 of /6/ is lower
than that of /d/, because like /n/, /d/ also has less constriction in the back cavity than /6/ (see Figure
3-1).
The average and standard deviation of F2 of female and male speakers are shown in Figures
3-10 and 3-11. As in Figure 3-2, the height of each bar represents the average F2, whereas the
extension above and below the bar represents one standard deviation away from the average.
38
2600
2200
2079
- -
/
169_
1800 --
M /d/
1400
1000
-
Figure 3-10: The average and standard deviations of F2 of /6/ vs. /d/ in stop context of female
speakers, given the succeeding vowel /o/. Notice that the F2 of /6/ is significantly lower than that of
/d/.
2600 -
2200
-
1650/
1800
1397
1400
1000
Figure 3-11: The average and standard deviations of F2 of /6/ vs. /d/ in stop context of male
speakers, given the succeeding vowel /o/. Male speakers show similar trend as female speakers do.
(Figure 3-10)
39
For female speakers, the ranges of F2 of /6/ and /d/ are relatively distinct (Figure 3-10).
Most of the F2 values of /6/ fell between 1660-1729 Hz as opposed to 2012-2146 Hz for /d/. For
male speakers, F2 of /6/ and /d/ are relatively distinct too (Figure 3-11). Most F2 values of male
speakers are in the range of 1329-1465 Hz and 1552-1747 Hz, for /6/ and /n/ respectively. Because
the ranges do not overlap, as in the case of /6/ and /n/ in the nasal context, it is possible to set up a
cutoff of F2 such that /6/ and /d/ would be correctly identified by their lower and higher F2 values,
respectively.
111.2.3 Amplitude Difference In The Burst (Amp(High-Mid))
As described Chapter II, the energy distribution in the burst spectrum can potentially be
used to distinguish intended /6/ and /d/; the amplitude differences in the burst of /6/ and /d/ are
tabulated in Table 3-5.
Speaker
F#1
F#2
Amplitude Difference (Amp(High-Mid)) (dB)
Consonant
Enunciation
#1
Enunciation
#2
Enunciation
#3
Average of
the
Enunciations
/a/
/d/
/6/
N/a
N/a
-18.1
-11.1
6.7
-12.5
-14.5
5.8
-8.1
-12.8
6.25
-12.9
/d/
17.3
14
6.3
12.6
M#1
/6/
M#2
/d/
/ /
/d/
-6.8
-7.8
1.3
1
-8.1
-1.2
-1.3
5.5
-3.5
-4.5
-6.2
10.4
-6.1
-4.5
-2.1
5.6
Table 3-5: Amplitude difference between the high and mid frequency ranges of /6/ vs. /d/ in stop
context, listed by speaker and enunciation. Notice /6/'s amplitude difference is usually negative and
is less than that of /d/ in almost all cases.
40
The negative amplitude difference that is consistently observed in most enunciations of /6/
is reasonable and expected because /6/ has more energy in the lower than in the higher frequency
range [6]. The stop consonant /d/ is expected to have a positive amplitude difference, because
alveolar stop consonants are known to have significant amount of energy in the high frequency
range [6]. Such positive amplitude differences are consistently observed in all speakers except in
male speaker #1 (M#1). Although M#1's amplitude differences of /d/ are consistently negative, his
average value is not as negative as that of /6/. Thus even M#1 shows the expected relative
difference between /6/ and /d/ on average, and his negative values for /d/ is most likely a speakerspecific characteristic of M#1 and not results of experimental error.
If more enunciations were
recorded, M# 1 would likely to continue to show negative amplitude difference for /d/, but M# 1 and
others speakers are expected to have more negative amplitude difference for /6/ than for /d/.
111.2.4 Burst Duration
Duration of the burst of /d/ and /6/ are measured from the original sound waves, and the
results are tabulated in Table 3-6.
Burst Duration of Individual Speaker (ms)
Consonant
/6/
/d/
F#1
15.5
13.1
F#2
11.8
21.4
M#1
16
10.3
M#2
16.1
16.3
Table 3-6: Burst duration of /6/ vs. /d/ in stop context, listed by each speaker. Notice that no
consistent difference is seen between the burst duration of /6/ and /d/ across all speakers.
Duration of /6/ and /d/ are similar for some speakers (F#1 and M#2) in both genders. For
speaker F#2, however, the duration of /d/ is much longer than that of /6/, whereas the opposite is
41
true in speaker M#1. This inconsistency suggests that burst duration is not a useful characteristic in
distinguishing the two consonants. The inconsistent duration is most likely due to speakers'
differences in articulating the consonants. Some speakers pronounce /d/ more deliberately than /6/
leading to a longer burst duration for /d/, whereas other speakers pronounce /6/ more deliberately or
similarly as /d/, leading to a shorter or similar burst duration, respectively.
111.2.5 F2 and Amp(High-Mid)
The average F2 and average amplitude difference of each gender are listed in Table 3-7.
Parameter
F2 (Hz)
Amp(High-Mid)
(dB)
Consonant
/a/
Female Speakers
1690
Male Speakers
1337
/d/
2090
1650
/d/
-12.9
3.4
-4.1
0.55
Table 3-7: Average F2 and amplitude difference between high and mid frequency ranges of /6/ vs.
/d/ in stop context, listed by gender. Notice that F2 is substantially lower in /6/ than in /d/ and
Amp(High-Mid) is negative for /6/ but positive for /d/.
Since F2 and the amplitude difference show consistent differences for /6/ and /d/, they are
graphed as x- and y-coordinates, i.e. (F2, Amp(High-Mid)), shown in Figure 3-12 to 3-15.
42
20 -
U
U
10-
0
99
+ /6/
* U.
g
00
H /d!
-10
-20
F2 (Hz)
Figure 3-12: F2 and amplitude difference of /6/ vs. /d/ in stop context of female speakers. Notice
that /6/ and /d/ are clearly separated by both F2 and amplitude difference.
20
S10--
-10
-20
F2 (Hz)
Figure 3-13: F2 and amplitude difference of /6/ vs. /d/ in stop context of male speakers. Unlike in
the similar plot for female speakers (Figure 3-12), the regions of /6/ and /d/ overlap slightly for the
male speakers.
43
20
20
-
10
-10
0
00
00
-10
-20
-20F
F2 (Hz)
F-2 (Hz)
b)
a)
Figure 3-14: F2 and amplitude difference of /6/ vs. /d/ in stop context of individual female speakers
a) F#1, and b) F#2. Notice that regions of /6/ and /d/ are clearly and consistently separated in both
speakers.
20
20
S,10
10
-
om
-1D0
0
-10
-20
-20
F2 (Hz)
F2 (Hz)
a)
* /6/
* /d/
b)
Figure 3-15: F2 and amplitude difference of /6/ vs. /d/ in stop context of individual male speakers
a) M#1, b) M#2. /6/ and /d/ are more distinctly separated for each individual speaker than for both
speakers (Figure 3-13).
44
As is the case for separation of /6/ and /n/ in the nasal context, the more additional
information is known about the context and speaker, the more distinction is seen between
characteristics of intended utterances of /6/ and /d/. For example, when the two male speakers are
plotted together (Figure 3-13), regions of /6/ and /d/ overlap. When the speakers are plotted
individually (Figure 3-15), however, regions of /6/ and /d/ become separable by a positively sloped
line, because /6/ generally has lower F2 and amplitude difference than /d/.
In this particular study, the two female speakers had correlated values of F2 and amplitude
difference, and thus graphing them together or individually did not make much difference in terms
of separating the consonants. The similar values are most likely due to the similar acoustic
characteristics of the two speakers. Had more utterances of more speakers, both male and female,
been studied, the separation of /6/ and /d/ would most likely follow the same trend as observed in
the nasal context-the more additional information is known about the preceding and succeeding
context and speaker, the more consistently the characteristics of /6/ would differ from those of /d/.
Therefore, if more speakers and utterances are surveyed, a more accurate range of F2 and
amplitude difference range can be established for each gender population, leading to a more
optimal cutoff values of F2 and amplitude difference that would lead to more accurate
identification of all intended utterances of /6/.
More importantly, the more the acoustic
characteristics are known about a particular speaker, the more fine-tuned the cutoff for F2 and
amplitude would be to identify intended /6/ in the continuous speech of that particular speaker.
45
111.3 Context of Vowel
111.3.1 Second vs. Third Formant Frequencies (F2 vs. F3)
As with the previous two contexts, both F2 and F3 were measured initially. Unlike the case
for the other two contexts, however, it is the F3 of /6/ and /v/ that maintain a consistent difference,
instead of F2. Out of the 66 utterances preceded by a vowel in the database (see Table 2-2), all of
the F3 of /6/ are greater than that of /v/; F2 was not nearly as consistent between /6/ and /v/.
Therefore, the rest of the study focuses on F3.
111.3.2 Third Formant Frequency at the Onset of the Succeeding Vowel (F3)
Similar to the normalization of vowel and speaker done in the previous two contexts, the
average and standard deviation of F3 are graphed first regardless of the succeeding vowel, then for
a given succeeding vowel (/i/), and finally for a given succeeding vowel (/i/) and a particular
speaker (M#1) (see Figure A-5 to A-7 in Appendix B). A similar pattern is observed in the vowel
context: the range of F3 of /6/ and /v/ overlap less and less as more contextual factors are taken into
account. Given the same succeeding vowel and speaker, the range of /6/ and /v/ did not overlap at
all. Therefore, a cutoff F3 can be specified such that the vast majority of intended /6/ and /v/ would
have F3 above and below the cutoff, respectively.
46
111.3.3 F3 and DeltaF3
In general, the separation of /6/ vs. /v/ is not as definite and consistent as for /6/ vs. /n/ and
/d/ in the previous two contexts.
For example, in the plots of each speaker for vowel /i/, shown
below in Figure 3-17 and 3-18, /6/ and /v/ of each of the male speakers clearly have different
ranges of F3 and DeltaF3 and thus are separable; /6/ has higher F3 but lower DeltaF3 than /d/. The
same parameters of /6/ and /v/ from the same utterances made by the female speakers, however,
overlap significantly (Figure 3-18). Likewise, F3 and DeltaF3 values of /6/ and /v/ obtained in
phrases succeeded by vowel /I/ tend to clutter within the same ranges (Figure A-18 in Appendix B).
All speakers' /6/ seem to have higher F3 than /v/, but their DeltaF3 values overlap significantly in
F#1 and M#2 (Figure A-18a and A-18d).
The overlap is even more evident in utterances of
succeeding vowel /x/, shown in Figure A-16. For three of the four speakers (F#1, F#2, and M#2),
points of /6/ and /v/ are mixed within the same region. The average F3 and average DeltaF3 of
each speaker, however, do show consistency in that /6/ has higher average F3 but lower DeltaF3
than /v/ for all speakers. This difference agrees with the pattern in /6/ and /v/ of M#1(Figure A16c), who is the only speaker whose /6/ and /v/ show clear distinction. Therefore, despite a number
of exceptions in individual enunciations, the average F3 and DeltaF3 are different enough for /6/
and /v/ that they are useful parameters for distinguishing the two consonants. Furthermore, the
average F3 is higher in /6/ than in /v/--the same observation made during the primary work
discussed in Chapter I. This consistency helps validate the results obtained in this study of
characteristics of /6/ and /v/ in the context of vowels.
47
Data from all three contexts have shown that the succeeding vowel substantially influences
formant frequencies. Thus, utterances of /6/ and /v/ succeeded by similar vowels and spoken by the
same speaker are expected to have similar acoustic characteristics. Figure 3-19, however, shows
that F3 and DeltaF3 of the same consonant in similar context by the same speaker can be quite
different actually. Figure 3-19 includes F3 and DeltaF3 of two pairs of sentences that are succeeded
by similar vowels /s/ and /z/. In one of the two pairs, with succeeding vowel /E/, "may
then"(sentence #11) and "may vend" (sentence #3) are compared and contrasted. In the other pair,
with succeeding vowel /3/, "to the racial" (sentence #16) and "two veracious" (sentence #5) are
compared and contrasted. The difference is especially noticeable for the male speakers in Figure 320. Notice that the DeltaF3 of /6/'s from similar contexts (sentences #11 and #16) differ by almost
900 Hz, while the DeltaF3 of /v/'s from similar contexts differ by around 500 Hz. Likewise, in
Figure 3-21, both F3 and DeltaF3 of /v/'s from similar contexts (sentences #3 and #5) differ by
around 400 Hz. At the same time, Figure 3-20 and 3-21, in which the two pairs of sentences are
graphed individually, indicate that within the same sentence, /6/ and /v/ show similar and consistent
separation as seen in the sentences of other vowels. Therefore, aside form the succeeding vowel,
there may still be other subtle context factors that influence the phoneme of interest, such as /6/ and
/v/.
Again, the more contextual information is known and taken into consideration, the more
accurate the recognition would be. In this particular case, sentences #3 and #11
both have the
consonant /n/ after the succeeding vowel /E/, whereas sentences #5 and #16 have consonant /r/ after
vowel /o/. The consonant /r/ is known to strongly influence its preceding vowel. In this case, /o/ is
altered by /r/ and thus shows F3 and DeltaF3 values that are uncharacteristic of a natural /o/. The
48
goal is to determine whether the sentence contains a /6/ or /v/ based on the consonant's effect on
the succeeding vowel, but the vowel /;/ in this case is not only affected by the consonants but by /r/
as well. The effect of /r/ most likely has caused the decreases of F3 and DeltaF3 observed in the
two sets of sentences in Figure 3-20 and 3-21.
In general, F3 is not as reliable for identifying /6/ in vowel context as F2 is for stop and
nasal context. The F3 is impossible to determine in some cases given the tools used. Furthermore,
as seen in the plots, F3 has more exceptions to the general trend expected than F2 does for the nasal
and stop contexts. For example, DeltaF3 is higher in /6/ for succeeding vowel /1/, which is the
opposite from all the other vowels. Despite some inconsistency in F3, however, the F2's of /6/ and
/v/ are even less consistent. Thus F3 and DeltaF3 remain as the most useful characteristics found
thus far in differentiating between /6/ and /v/. F3 is found to be higher in /6/ than in /v/, while
DeltaF3 is found to be lower in /6/ than in /v/ for most cases.
49
1000
1000
-
*
800
800
N
600
7
600
Nr
400
*/f/i
Cu
*//
200
200
0
0 )0
-2dgC
400
300
5"0
35 00
3000
250
-2d/
35 00
F3 (Hz)
F3 (Hz)
b)
a)
Figure 3-16: F3 and Delta F3 of /6/ vs. /v/ in vo vel context of a) female speakers, b) male speakers,
given succeeding vowel /i/.
1000
1000
800
800
600
600
400
400
200
200
*1/
Ely,/
0
-20 0Do
03
-20U -
F3 (Hz)
F3 (Hz)
b)
a)
Figure 3-17: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker a) M#1, b) M#2
given succeeding vowel /i/. /6/ and /v/ are more distinctly separated for each individual speaker
than for both speakers (Figure 3-16b).
1000 -
1000
800 -
800
N
Cu
4.)
N
600
*/6/
400
E /v/
Cu
4.)
200
400 200
//
-
0
0
00
-200 -
600 -
2500
3000
300
-20 0
DO -
2500
3000
35
F3 (Hz)
F3 (Hz)
b)
a) I_
Figure 3-18: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker a) F#1, b) F#2
given succeeding vowel /i/. /6/ and /v/ remain difficult to separate even when the speakers are
analyzed individually.
50
600
600
300
300
N
0
-30#
Re
no
3000
r0D
0
-3040
300/
-600
-600 -
-900
-900
-1200
M /v/
-1200
F3 (Hz)
F3 (Hz)
I b)1
a)
-
Figure 3-19: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker a) M#1, b) M#2 in
two pairs of sentences succeeded by similar vowels (/Q/ and I/). Notice the large variations in both
F3 and DeltaF3 of /6/ and /v/.
600
600
300
300
al/vp ,
-600
-900
-1200
-1200
F3 (Hz)
F3 (Hz)
a)
/V/
-600
-
-900
"A
-f/
-304
-304 W
b
________________________________
)
Figure 3-20: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker (M#1) a) in sentences
#3 and #11 (succeeded by vowel /e/) b) in sentences #5 and #16 (succeeded by vowel /3/). Notice
the significant differences in F3 and DeltaF3 of both /6/ and /v/ between the two plots.
600
600
300
300
N
U
N
0
__li
-3040
U
0
-304(
-600 -
-600
-900
-900
-1200
-1200
U/v/
F3 (Hz)
F3 (Hz)
b)
a)
Figure 3-21: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a male speaker (M#2) a) in sentences
#3 and #11 (succeeded by vowel /c/) b) in sentences #5 and #16 (succeeded by vowel /3/). Notice
similar decreases in F3 and DeltaF3 of both /6/ of /v/ are seen in graph b) as in Figure 3-20b.
51
Chapter IV: Conclusion and Future Work
IV.1 Conclusion
Substantial Variations In Formant Frequencies
Much variation in formant frequencies of /6/ and its close-sounding consonants is observed,
and such variation is mainly caused by three factors. First, gender affects formant frequencies:
corresponding frequencies are lower in male speakers than in female speakers by orders of
hundreds of Hertz. Such difference is due to male speakers' longer vocal tracts, which resonate at
lower formant frequencies [6].
Second, the succeeding vowel substantially affects formant
frequencies: certain vowels generally have higher formant frequencies than other vowels. This
difference is caused by the difference in the shaping of vocal tracts and in the movements of
articulators when producing different succeeding vowels. Third, within each gender, acoustic
characteristics of individual speakers affect formant frequencies.
Such speaker effects on formant
frequencies are generally smaller than those of gender and succeeding vowel.
And varying
speaker-specific acoustic characteristics are mostly due to small variations among speakers' lengths
of vocal tracts, widths of palatal vaults, arrangements of teeth, etc.
Consistent Differences Between /o/ And Its Close-sounding Phonemes
Despite variations between genders and among speakers, the same utterance repeated by the
same speaker generally results in similar formant frequencies. Some general trends that are
observed are:
52
1. In the context of nasal consonant, /6/ almost always assimilates to the nasalization of a
preceding /n/. F2 of /6/ is almost always less than that of /n/. F2 alone can distinguish /6/
from /n/, but the combination of F2 and DeltaF2 separates the two consonants better. /6/ has
lower F2 and higher DeltaF2 than /n/.
2. In the context of stop consonant, /6/ is almost always produced as a consonant with an
abrupt release, rather than as a fricative. F2 of /6/ is almost always less than that of /d/. The
high-frequency amplitude in the burst of /6/ and /d/ is lower in /6/ than in /d/. The difference
between the high-frequency amplitude and the mid-frequency amplitude (in dB) is usually
negative for /6/ but positive for /d/.
F2 alone can distinguish /6/ from /d/, but the
combination of F2 and amplitude difference distinguishes the consonants more reliably.
3. In the context of vowels, F3 of /6/ is higher than that of /v/. DeltaF3 of /v/ is higher than
that of /6/ in most of the cases. /6/ and /v/ are harder to distinguish than /6/ and /n/ or /6/ and
/d/.
Algorithm For Identifying /6/ Harder To Specify
Due to the influences of context and speaker, the identification criteria of factors such as F2
and F3 are mostly contextual- and speaker- based. Invariant cues have been harder to specify for a
given speaker regardless of context, and even harder for the population of an entire gender. The
separation of /6/ and its close-sounding phonemes in the two-dimensional plots suggests that
contextual criteria based on the two parameters of the plots can be set for a given speaker and can
identify the intended /6/ with good accuracy. To set more optimal criteria, studying larger database
53
of utterances and speakers would help by providing the recognizer with phonemic characteristics
that are more representative of the population to compare with measurements made on the speech
signals.
Gathering additional acoustic measurements for individual speakers, however, would be
even more useful in developing better speech recognition accuracy.
Human's Natural Speech Recognition May Rely Heavily On Contextual Information
Given that most cues found as useful in identifying /6/ are contextual-based and that they
become more effective with more information about the speaker, it is likely that human ears and
brains also rely heavily on contextual information during speech recognition. This conclusion
corresponds to the general observation that it is easier to recognize words in context than in
isolation and that it is easier to understand familiar speakers than unfamiliar ones. This speculation
on human's heavy reliance on contextual cues in natural speech processing, however, does not
exclude the possible existence of invariant cues.
IV.2 Future Work
Study Larger Database
The current database of sentences and speakers shows promising distinctions between the
acoustic characteristics of /6/ and its close-sounding consonants. However, some exceptions are
observed, which would most likely become statistically insignificant if similar analysis is done on a
larger database. Even for the utterances that have consistently showed the expected differences
between the consonants, a greater database would provide more representative values for the
54
acoustic parameters and help establish more useful cues for distinguishing between /6/ and its
close-sounding phonemes. Thus, studying more speakers, sentences, and repetitions would lead to
more accurate identification of intended utterances of /6/ in continuous speech.
Search For Additional Cues
The more independent parameters that show differences between /6/ and its close-sounding
phonemes are found, the more likely that the combination of such parameters would lead to better
recognition of /6/. Additional parameters that can potentially be useful include those in the spectra
of /n/ vs. /6/ in the nasal context and those in the spectra of /6/ vs. /v/ in the vowel context.
Normalize Speakers
Data have consistently shown that acoustic characteristics differ between each speaker, and
such differences make it difficult for non-speaker-based cues to identify /6/ correctly. Thus, some
calibration of characteristics between speakers would be essential for accurate recognition. Since
average f3 is correlated to the length of the vocal tract for a particular speaker, f3 can be measured
and averaged over time to normalize the differences in speakers' vocal tracts. The possibility of
incorrect recognition due to speaker differences can be further reduced by pre-recording and preanalyzing each speaker's common vowels, if the situation permits. This pre-stored, speaker-specific
information would help recognizer to take speaker-specific characteristics into consideration, and
thus to interpret the measurements made in the speech signal more accurately.
55
Develop Recognition Algorithm Using Discriminative Analysis
Given a number of parameters that have different values for /6/ and its close-sounding
phonemes, discriminant analysis can be used to assign coefficients to the various parameters based
on their relative usefulness in identifying /6/. For example, in the 2-D plots of F2 and DeltaF2,
discriminant analysis can be used to determine the coefficients of F2 and DeltaF2 to specify a line
that would best separate /6/ and /n/ in the plot.
For this study, discriminant analysis is not practical because only two parameters are found
helpful for each context and the sample size of utterances is statistically too small for the analysis.
But given a larger sample size, from which hopefully more useful cues would be found,
discriminant analysis can be applied to sets of cues and help optimize the recognition algorithm.
Test On Continuous Speech
Finally, the algorithm should be tested and fine-tuned on utterances of /6/ in natural
sentences and eventually, in continuous speech.
56
V. References
[1] Spoken Language Systems, http://www.sls.lcs.mit.edu/sls/whatwedo/applications/jupiter.html,
http://www.sls.lcs.mit.edu/sls/whatwedo/architecture.html#SUMMIT, 1998.
[2] Denes, P. B. On the Statistics of Spoken English. The Journalof the Acoustical Society of
America, volume 88, number 6, p 894, 897, 1963.
[3] Denes, P. B. and Pinson, E. N. The Speech Chain. New York: W.H. Freeman and Company,
1993.
[4] Manuel, S. Y. Speaker Nasalize /o/ After /n/, but Listeners Still Hear /o/. Journal ofPhonetics,
23, 453-476, 1995.
[5] Perkell J.S. and Klatt D.H., editor. Invariance and VariabilityIn Speech Progresses.Hillsdale:
Lawrence Erlbaum Associates, 1986.
[6] Stevens, K. N. Acoustic Phonetics. Cambridge: MIT Press, 2000.
57
VI. Appendices
Appendix A: Sentences Used In The Study
In the context of nasal consonant:
1. Every kid wants to win nose of Rudolf. (Sentence #1)
Every kid wants to win those orange dolls. (Sentence #17)
2. She tried to win Niece Wendy's toys. (Sentence #9)
She tried to win these Wendy's toys. (Sentence #20)
In the context of stop consonant:
1. Put those in the refrigerator. (Sentence #18)
Put dough in the refrigerator. (Sentence #8)
In the context of vowels:
1. Try to guarantee these prices. (Sentence #7)
Try to guard TV's Prices. (Sentence #12)
2. Point to the racial statements. (Sentence #16)
Repeat two veracious statements. (Sentence #5)
3. See this era of classical paintings. (Sentence #13)
See Victorian classical paintings. (Sentence #15)
4. Do not dye that indigenous curtain. (Sentence #14)
Put the dye vat in Disney's corner. (Sentence #6)
58
5. They are chosen via those. (Sentence #10)
They are chosen via votes. (Sentence #4)
6. She may vend Finny's Kiosk. (Sentence #3)
She may then find the kiosk. (Sentence #11)
59
M
Appendix B: Additional Results
Appendix B.1 Context of Nasal Consonant
Succeeding vowel /i/, male speakers.
600
400
200
I+
/5/
-200
-400
-600
F2 (Hz)
Figure A-1: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of male speakers, given succeeding vowel
/i/. Notice that the regions of /6/ and /n/ overlap.
600
600
400
400
200
200
pro
0
S-2OJ3
1500
2000
200
/
-20 j
-400
-400
-600
-600
F2 (Hz)
F2 (Hz)
a)
0
b)
Figure A-2: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual male speakers a) M#1, b)
M#2, given succeeding vowel /i/. Notice that /6/ and /n/ are less separable for these males speakers
than they are for females speakers (Figure 3-7).
60
-M
Appendix B.1 Context of Nasal Consonant (Con'd)
Succeeding vowel /o/, male speakers.
600
400s
200-777
0
-200
-
-400
-600
F2 (Hz)
Figure A-3: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of male speakers, given succeeding vowel
/o/.
r4
600
600
400
400
200
200+
0
-200Mf 46 I/
_20011
-400
-400
-600
-600
Mn
F2 (Hz)
F2 (Hz)
a)
DO0
b)
Figure A-4: F2 and DeltaF2 of /6/ vs. /n/ in nasal context of individual male speakers a) M#l, b)
M#2, given succeeding vowel /o/. Notice that /6/ and /n/ become more separable (mostly by F2) for
each individual speaker than for both speakers (Figure A-3), but the separation is not as distinct as
in similar plots for the female speakers (Figure 3-9).
61
Appendix B.2 Context of Vowel
3000
3000 1
-
2777
N
2600 -
2600
-
2200
2200
1800 -
1800
b) '
I
a) L
-I
Ul/vi
Ulvi
Figure A-5: The average and standard deviation of F3 of /6/ vs /v/ in vowel context of a) female
speakers, b) male speakers, regardless of the succeeding vowel.
3000
3000
-
2600
2600 -l5
2200 -
-
20
22000
1800
18oo
a)
b)
-
Figure A-6: The average and standard deviation of F3 of /6/ vs /v/ in vowel context of a) female
speakers, b) male speakers, given succeeding vowel /i/.
3000
2657
I'
24"
2600 -/6/
2200
1800
Figure A-7: The average and standard deviation of F3 of /6/ vs. /v/ in vowel context, given
succeeding vowel /i/, of an individual male speaker (M#1). Notice that the overlap becomes
significantly reduced for a given speaker than for both male speakers (Figure A-6b).
62
Appendix B.2 Context of Vowel (Con'd)
/b/ vs. /v/, given succeeding vowel /e/ and /a/ (2 pairs of sentences: #3 and #11, and #5 and #16)
600
300
0
-300
U
/v/
-600
-900
-1200
F3 (Hz)
a)
600
300
0
10
-300-
-600
-900
-1200
F3 (Hz)
b)
Figure A-8: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) male, b) female speakers in two
pairs of sentences succeeded by similar vowels (/a/ and /E/). Notice the wide ranges of F3 and
DeltaF3 for both /6/ and /v/.
63
2
g
600
600 -
3Inn
300
N
0
-30J
0
2Gg
0
000
A
-30J
2000 U
0
E/v/
9-J
-600
-600
-900
-900
-1200
-1200 F3 (Hz)
F3 (Hz)
b)
a)
Figure A-9: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker a) F#1, b) F#2 in
two pairs of sentences succeeded by similar vowels (/o/ and /e/). Notice the large variations in both
F3 and DeltaF3 of /6/ and /v/.
600
600
300
300
0-30O
M//
-30
-600
-600
-900
-1200
-1200
F3 (Hz)
F3 (Hz)
b)
a)
Figure A-10: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker (F#1) a) in
sentences #3 and #11 (succeeded by II)b) in sentences #5 and #16 (succeeded by /o/).
600
600
300
300
0-
0J0-300
2000
10DO
3000
-300-Uv
-600
-600
-900
-900
2000 M
40
/
-1200
-1200
F3 (Hz)
F3 (Hz)
b)
a)
Figure A-l: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a female speaker (F#2) a) in
sentences #3 and #11 (succeeded by vowel //) b) in sentences #5 and #16 (succeeded by vowel
/3/). Notice the differences in F3 and especially in DeltaF3 between the two plots, similar to those
seen in the other speakers (Figure A-10, 3-20, 3-2 1).
64
Appendix B.2 Context of Vowel (Con'd)
/6/ vs. /v/, given succeeding vowel /o/ (sentences #10 and #4)
450
450
300-
300
150
*/1/
S150
0
+
0
-150
-150
-300
-300
F3 (Hz)
F3 (Hz)
b)
a)
Figure A-12: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) female speakers, b) male
speakers, given succeeding vowel /o/.
45.
!- -
450
300
300
150
150//
//
0
0
0
-150
-150
-300
-300
F3 (Hz)
F3 (Hz)
b)
a)
Figure A-13: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual female speakers a) F#1,
b) F#2, given succeeding vowel /o/. Notice /6/ and /v/ are mixed in the same region.
N
450
450
300
300
150
0
2C00
-150
150/
E/l/
0
2500
3000
-0
20
-150
35 00
E /V/
2
3000
350
-300
-300-
F3 (Hz)
F3 (Hz)
b)
a)
Figure A-14: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual male speakers a) M#1, b)
M#2, given succeeding vowel /o/. Notice /6/ have higher F3 and lower DeltaF3 than /v/--the same
general pattern is see in A- 13, despite its exceptions in individual enunciations.
65
Appendix B.2 Context of Vowel (Con'd)
/6/ vs. /v/, given succeeding vowel /e/ (sentences #14 and #6)
N
450
450
300
300
150
150
U
*
Cu
0
0-
F3(Hz)
DO
-150 -
-150
-300
-300
F3 (Hz)
F3 (Hz)
b)
a)
Figure A-15: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) female, b) male speakers, given
succeeding vowel /4/.
N
450
450 -
300 -
300 N~
150 Cu
-M,
150
I/v/
0j-
0
-15-
-150
-300
-300
F3 (Hz)
F3 (Hz)
b) -
a)
450
450
300
NT
N
150
150
Cu
* /v/
Mu
0
2000
-150
300
2500
3000
0
2coo
-150
350 0
* /v;/
2500
3000
35 0
-300
-300
F3 (Hz)
F3 (Hz)
d)1
c)
Figure A-16: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual speakers a) F#1, b) F#2,
c) M#1, and d) M#2, given succeeding vowel /w/. Notice that only c) shows clear distinction
between /6/ and /v/. The average values of /6/ and /v/ of each plot, however, consistently show that
/6/ has higher F3 and lower DeltaF3 than /v/.
66
Appendix B.2 Context of Vowel (Con'd)
/b/ vs. /v/, given succeeding vowel /I/ (sentences #13 and #15)
300
N
-
300 1
150
0
20 0
N
3000
2"0.
35
150
0
20 I0
0/V/
*/a/
-150
-150
-300
-300
op
*
32
250
-
/
3*0
F3 (Hz)
F3 (Hz)
b)
a)
Figure A-17: F3 and Delta F3 of /6/ vs. /v/ in vowel context of a) female, b) male speakers given
succeeding vowel /I/. Notice /6/ tend to have higher F3 than /v/, but their DeltaF3 values tend to be
in the same range.
300
awl
150
150
N
0
2(
0
22
D/OI
-150 -
-150
-300 -
-300
D/O
0
F3 (Hz)
F3 (Hz)
b)
a)
T
300
300
150
150
0
2( 30
3:
2500
6T
3000
0 I
20 00
35 00
-150
-150
-300 -
-300
2500
3000
35
M
F3 (Hz)
F3 (Hz)
d) 1
c)
Figure A-18: F3 and Delta F3 of /6/ vs. /v/ in vowel context of individual speakers a) F#1, b) F#2,
c) M#1, d) M#2, given succeeding vowel /I/. Neither parameter is very effective at separating /6/
from /v/. In general, F3 is higher in /6/. The average DeltaF3 of /6/ seems slightly higher than that
of /v/, which is an exception to the trend of DeltaF3 observed for the rest of the vowels.
67
Download