Implementing the Matcher in the ... System with Uncertainty in Data

advertisement
Implementing the Matcher in the Lexical Access
System with Uncertainty in Data
by
Roy Kyung-Hoon Kim
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Masters of Engineering in Electrical Engineering and Computer
Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2000
© Roy Kyung-Hoon Kim, MM. All rights reserved.
The author hereby grants to MIT permission to reproduce and
distribute publicly paper and electronic copies of this thesis document
in whole or in part.
ENG
MASSA CHUSETTS INSTITUTE
OF TECHNOLOGY
JUL 2 7 2000
Authoi..
............ . . LJBRARIES
Departh-f nt of Electrical Engineering and Computer Science
May 18, 2000
Certified by.............
7--
Kenneth Stevens
Professor
PTieie:8rvisor
/
Accepted by...............
Arthur C. Smith
Chairman, Department Committee on Graduate Students
Implementing the Matcher in the Lexical Access System
with Uncertainty in Data
by
Roy Kyung-Hoon Kim
Submitted to the Department of Electrical Engineering and Computer Science
on May 18, 2000, in partial fulfillment of the
requirements for the degree of
Masters of Engineering in Electrical Engineering and Computer Science
Abstract
The goal of this thesis is to modify the existing matcher of the lexical access system
being developed at the Research Laboratory of Electronics so that it provides efficient
and accurate results with limited segmental information. This information, provided
by the speech signal processor, contains a set of sublexical units called segments
and a set of features to characterize each of them. The nature of a feature is to
describe a particular characteristic of a given segment. Previously the matching
subsystem demanded a complete set of segments and features for each spoken word.
Specifically, the speech signal processor was required to be without fault in its efforts
to detect all available landmarks and cues and to convert them into the segmentally
formatted data that the matcher recognizes. But this requirement for impeccability is
nearly impossible to meet and must be relaxed for a real-world lexical access system.
Overall, this new, modified matcher in the lexical access system represents a realworld application that anticipates and responds to imperfections in the given data.
More specifically, the modified matcher has the ability to translate a series of segments
with incomplete sets of features into possible utterances that the series may represent.
With this new matcher, an experiment was performed to initiate a process to identify
features with the most acoustic information. For a given set of incomplete segmental
representations, the results of the experiment showed that the output of the matcher,
or number of matched utterances, increases exponentially as the input of the matcher,
or number of speaker-intended words, increases linearly. But as more features are
defined in these incomplete representations, we can conclude from the results that
the number of possible utterances becomes less exponential and more linear.
Thesis Supervisor: Kenneth Stevens
Title: Professor
2
Acknowledgments
First and foremost to my Lord Jesus.
To express my gratitude, with which words do I begin?
In my joy, you rejoiced with me.
In my despair, you wiped away my tears.
The past five years are yours, Jesus. I love you.
To my parents, Sun Chul Kim and Ae Im Kim.
I will forever cherish all the prayers
and encouragements you have showered upon me.
When I was ready to give up at MIT,
your gentle, yet strong, words revived my soul.
Um-ma, Ah-pa, sa-rhang-hae-yo.
To my brothers, Peter and Mark.
Peter, your passion is found in only a few.
Thanks for sharpening me with your zeal.
Mark, you have grown tremendously over the year.
Your increased love for God has blessed me greatly.
To Ken.
Thank you for your advices and support.
I think every professor at MIT should
learn from the way you teach and care
for your students.
To my KCF brothers and sisters.
Your constant encouragements and challenges
will be treasured forever.
I'm deeply saddened to depart.. keep running the race.
To the '99-'00 Outreach Team: James, Sunny, Sera, and Dave.
Through this tough, difficult year,
each one of you meant the world to me.
Thank you for your friendship and partnership.
To my AAA friends and ballers.
This past year would not have been
the same without you "low-budget" guys.
Seek truth. Live life. No more second place.
To Young, my precious friend.
Your prayers have sustained me this year.
Thanks, bro. Hey, more years to come..
3
Contents
1
13
Introduction
15
2 Background
2.1
The Speech Chain. ...............
. . . . . . . . . .
15
2.2
Basic Terminology of Lexical Access . . .
. . . . . . . . . .
16
2.3
3
4
2.2.1
Effects of Anatomical Structures .
16
2.2.2
Features . . . . . . . . . . . . . .
20
2.2.3
Landmarks and Segments
. . . .
24
2.2.4
Examples
. . . . . . . . . . . . .
25
The Lexical Access Project . . . . . . . .
26
29
Original Matcher
3.1
Standard Phonemes . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2
L exicon
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3
Linguistic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.4
The Original Matching Process
. . . . . . . . . . . . . . . . . . . . .
32
. . . . . . . . . . . . . . . . . . . . . . .
32
. . . . . . . . .
35
3.4.1
The Overall Method
3.4.2
The Detailed Process of the Original Matcher
44
Modified Matcher
4.1
Variability in the Speech Signal . . . . . . . . . . . . . . . . . . . . .
45
4.1.1
Contextual Variations
. . . . . . . . . . . . . . . . . . . . . .
45
4.1.2
Extra-linguistic Variations . . . . . . . . . . . . . . . . . . . .
47
4
5
4.2
Matcher-related Uncertainties . . . . . . . . . . . . . . . . . . . . . .
47
4.3
The Modified Matcher Process . . . . . . . . . . . . . . . . . . . . . .
48
4.3.1
M issing Features
. . . . . . . . . . . . . . . . . . . . . . . . .
49
4.3.2
Uncertainty of First and Last Segments . . . . . . . . . . . . .
53
Experiment with Partial Feature Segments
60
5.1
Experim ent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.2
D ata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.2.1
Feature-level Characteristics . . . . . . . . . . . . . . . . . . .
66
5.2.2
Word-level Characteristics . . . . . . . . . . . . . . . . . . . .
71
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.3
6
Summary
77
A 248-Word Lexicon
79
B Standard Phonemes
81
C Rules: An Example
88
D Source Code for Tree-Growth
90
E Results for One-Word Tests
98
F Results for Two-Word Tests
102
G Results for Four-Words Tests
106
H Manual for Match3 Program
110
5
List of Figures
2-1
Speech Chain. This figure shows the basic human communication system (from Denes and Pinson, 1993) . . . . . . . . . . . . . . . . . . .
2-2
16
General anatomical structures of the vocal tract. The vocal system may
be partitioned into four functional parts (from Keyser and Stevens, 1994). 17
2-3
Basic tree with anatomical structures. All the end nodes correspond
to a particular physical structure. . . . . . . . . . . . . . . . . . . . .
19
. .
20
. . . . . . . . . . . . .
26
2-4
Tree with anatomical structures and their corresponding features.
2-5
Tree diagram for two phonemes, /n/ and /ey/
2-6
Overall model of the lexical access system. The system is broken down
into two subsystems: speech processing subsystem and the matching
subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-1
27
Basic model of matcher. For the original matcher, the series of segments has to be error-free and all the possible features have to be
detected........
3-2
...................................
29
Model of matcher with necessary information. In addition to the series
of segments as shown in figure 3-1, standard phonemes, linguistic rules,
and the lexicon are also needed. . . . . . . . . . . . . . . . . . . . . .
3-3
30
Matching model with linguistic rules using index. Two lexical words
that match the segments of the utterance at index 0 are "an" and
"another" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
33
3-4
Tree model of the matching process. This model shows the progression
of index for matched lexical words. Each node may be modeled as a
subtree which may beget another subtree.
3-5
. . . . . . . . . . . . . . .
34
Block diagram of standard template initialization. _segments constains
an array of standard phonemes in data[i], where i is the index of the
array. Each phoneme is defined further by -segment and _bundle classes. 36
3-6
Block diagram of lexicon initialization. This process is relatively complex because a lexicon consists of an array of words, where each consists
of an array of one or more distinct pronunciations. . . . . . . . . . . .
3-7
38
Block diagram of rule set initialization. The ruleset contains an array
of linguistic rules which need to be translated (or initialized) for the
m atcher to recognize. . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-8
39
Block diagram of the tree growing process. This process recursively
builds a tree where each node represents a distinct index and a matched
word. The recursion occurs in this figure when grow-aux2 calls on
grow-aux, which calls grow-aux2 back. . . . . . . . . . . . . . . . .
4-1
41
Effects of consonants on the first- and second-formant characteristics of
eight types of vowels, where each follow the consonant-vowel-consonant
(CVC) format. These graphs show three types of consonants: velars
(open circles), postdentals (open triangles), and labials (open circles).
Vowels in isolation are also shown (black circles) [12]. . . . . . . . . .
4-2
46
Feature set space model for the original matcher, in part (A), and the
modified matcher, in part (B). Part (A) shows that the original matcher
requires the two feature-set space to be exactly the same, while part
(B) shows that the segment space can be a subset of the standard
phonem e space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
51
4-3
Feature-match model for the original matcher, in part (A), and the
modified matcher, in part (B). In this figure, "phoneme" refers to a
set of features used by the lexicon and "segment" refers to a set of
features, complete or incomplete, detected from the speech signal. . .
4-4
52
Set models for the output of the matcher with different categories as
described in table 4.9. Note that category D is the output of the ANDfunction of categories B and C.
5-1
. . . . . . . . . . . . . . . . . . . . .
59
Histogram for the results of 1-C test. The number of matches in a
cohort of the matcher is the independent variable. Much of the distribution is concentrated at one match per data sample. . . . . . . . . .
5-2
Histogram for the results of 1-IC test.
The number of matches in
a cohort of the matcher is the independent variable.
Much of the
distribution is concentrated at five or less matches per data sample. .
5-3
67
67
Histogram for the results of 2-C test. The number of matches in a
cohort of the matcher is the independent variable. The results are very
similar to that of 1-C test in figure 5-1. . . . . . . . . . . . . . . . . .
5-4
68
Histogram for the results of 2-IC test. The number of matches in a
cohort of the matcher is the independent variable. The results resemble an exponentially decreasing function where the frequency becomes
insignificant as the number of matches approaches 20. . . . . . . . . .
5-5
69
Histogram for the results of 4-C test. The number of matches in a
cohort of the matcher is the independent variable. The distribution
resembles that seen in one-complete word and two-complete-word tests. 70
5-6
Histogram for the results of 4-IC test. The number of matches in a
cohort of the matcher is the independent variable. The results resemble an exponentially decreasing function where the frequency becomes
insignificant as the number of matches approaches 400. . . . . . . . .
8
70
5-7
A magnified version of the histogram in figure 5-6 where the range is
less than 200. The distribution is generally uniform when the number
of matches is greater than 30. The unexpected impulse at 200 is the
sum of all that comes after that point.
5-8
. . . . . . . . . . . . . . . . .
71
Two graphs where the independent variable is the number of intended
words in the input and the dependent variable is the average number
of matches in the cohort. The graph with squares corresponds to the
expected behavior and the graph with circles corresponds to the real
behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-9
74
This graph shows the difference between the expected and the real
graphs from figure 5-8. As the number of words increase for the input,
the difference increases exponentially. . . . . . . . . . . . . . . . . . .
9
75
List of Tables
2.1
List of features and their possible values. The features may be categorized into three distinct groups: articulator-free, articulator, and
articulator-bound (from Maldonaldo's Master Thesis, 1999).
. . . . .
21
2.2
Four consonant types and their corresponding manner features.
2.3
Two subsets of articulator features
2.4
Feature matrix for two phonemes, /n/ and /ey/, as seen by the matcher. 25
2.5
A listing of the English vowels, glides, and consonants in the Lexical
. . .
22
. . . . . . . . . . . . . . . . . . .
23
Access System. AF=affricates, FR=fricatives, SO=sonorants, ST=stops 28
3.1
A list of English phonemes used in lexical access system. . . . . . . .
4.1
Two sample series of phonemes as valid outputs of the speech processor.
31
Though the second is a subset of the first series, both represent the
utterance, "baby can".
4.2
. . . . . . . . . . . . . . . . . . . . . . . . . .
48
Three sample segments where segment 1 is a standard phoneme, /ae/.
The other two segments are subsets of the first segment. RoC = Release
O r C losure.
4.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Matched results of the three segments from table 4.2. From left to
right, more phonemes match because fewer features exist to constrain
the m atching process. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4
Segment input to the modified matcher. Segment (A) has a complete
set of features for /p/ while segment (B) is a subset of the first set.
4.5
50
.
53
Output of the modified matcher with segment from table 4.4 (A) as
input. All the words in this list contain the phoneme /p/.
10
. . . . . .
54
4.6
Output of the modified matcher with segment from table 4.5 (B) as
input. Due to fewer features to match in the segment, the matcher
outputs more matched words than table 4.5. . . . . . . . . . . . . . .
4.7
55
Two consecutive segments with complete feature-sets as the input to
the modified matcher. The first segment represents the phoneme, /p/,
and the second segment represents the phoneme, /iy/ . . . . . . . . .
4.8
55
Output of the modified matcher with two consecutive segments from
table 4.7 as input. Most of the matched utterances are over a series of
two words. In this example, we can conclude that if the first segment
is known to be the beginning of a word, then there are no matches. .
4.9
56
All combinations of (nil), '#' and '%' in the first and last segments
and their meanings. In fact, category D acts as the original matcher
where a series of segments represents a complete word(s). . . . . . . .
57
4.10 Two sample inputs to the modified matcher where the positions of the
specific segments are constrained. (A), by using '%', forces the matcher
to find word(s) that ends with segment 2. (B), by using '#', forces the
matcher to find word(s) that begins with segment 1. . . . . . . . . . .
57
4.11 Output of the modified matcher with two segments from table 4.10 (A)
as input. As expected, the last two segments of these words are /k/
and /iy/..
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.12 Output of the modified matcher with two segments from table 4.10 (B)
as input, where '#' is placed after "Time:" of the first segment.
. . .
58
4.13 A possible input to the modified matcher which follows the constraints
described in table 4.9 (D).
5.1
. . . . . . . . . . . . . . . . . . . . . . . .
59
This is a series of segments representing the word, "as" for two different categories of inputs in the experiment.
The segments in (A)
are described with a full set of features, while the segments in (B) are
described with a incomplete set of features. RoC = Release Or Closure. 62
11
5.2
A list of six tests of the experiment and their corresponding names
used in this thesis.
5.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
The two series of segments represent two sequential words, "on day".
The segments in (A) are described with the conventions of first category
of inputs and is a possible data sample for the 2-C test. The segments
in (B) are described with the conventions of the second category of
inputs and is a possible data sample for the 2-IC test. RoC = Release
O r C losure.
5.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
Mean, standard deviation (std. dev.), and the maximum of the number
of matched utterances of the data inputs for each of the tests in the
experim ent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5
66
Two words with feature sets constrained by the second convention for
inputs. (A) is a series of segments representing the word, "do". (B)
is a series of segments representing the word, "add". RoC=Release or
C losure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6
72
Two cohorts corresponding to the two individual series of segments
from table 5.5. (A) is the cohort for the series that intended to represent the word "do". (B) is the cohort for the series that intended to
represent the word "add".
5.7
. . . . . . . . . . . . . . . . . . . . . . . .
73
Two series of segments from table 5.5 combined into one series. RoC=Release
of C losure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
5.8
The cohort corresponding to the series of segments in table 5.7.
73
5.9
Two cohorts corresponding to the two individual series of segments
representing the words, "caution" and "able".
A.1
. . .
. . . . . . . . . . . . .
76
In all the tests, index numbers above are used to indicate specific words. 80
12
Chapter 1
Introduction
The Lexical Access Project, which is an ongoing effort by the Speech Communication
Group in the Research Laboratory of Electronics (RLE) at Massachusetts Institute
of Technology (MIT), is aimed at creating a knowledge-based, rule-governed speech
recognition system based upon the methods human beings use to produce speech. The
idea of recognizing speech in this fashion differs from current approaches to automatic
speech recognition. Unlike systems which rely primarily on statistical analysis and
complex Markov models for detection, this system relies on a set of acoustic cues
in speech which provide information about the articulatory movements in the vocal
tract.
The overall function of the lexical access system is to convert a sound signal into
a series of segments (which will be defined later), match them to a predefined lexicon,
and produce the intended utterance. In order to accomplish this function, the system
is partitioned into two subsystems: the speech processing subsystem and the matching
subsystem [13]. When the speech signal enters the system, the processor detects and
extracts acoustic cues from the speech signal and labels the features in accordance
to these cues. From the labeling process, it converts the existing format to a format
that is compatible with the segmental information in the lexicon. Finally, the matcher
takes the converted segments from the processor and compares them to each of the
words in the lexicon to compile a list of possible utterances.
The older version of the matching subsystem, called the "original" matcher, de13
mands a complete series of segments and feature-sets for each spoken word. In other
words, the speech signal processor is required to be without fault in its efforts to
detect all available acoustic cues and to convert them into a segmentally formatted
data that the matcher recognizes. But this constraint for impeccability is realistically
impossible to satisfy and must be relaxed for a real-world lexical access system.
This thesis presents a "modified" matcher which, unlike the original matcher,
tolerates some uncertainty in the signal analysis of the processing subsystem. Modification is necessary because the processor cannot detect and identify all the available
acoustic cues in the speech signal. More specifically, the newly designed matcher
allows the segments to be defined with incomplete sets of features. Also, the series of
segments is not required to fully complete an utterance in the modified matcher.
Chapter 2 familiarizes the reader with important background information required
to understand the basics of the lexical access project. First, this chapter provides a
general overview of the human communication system, in which its main focus is
on the four functional subsystems of the human vocal tract. Then it describes the
essential components of the lexical access system and how the anatomical knowledge
is utilized in the structure of the system.
The following chapter introduces the purpose and the design of the original matcher.
First, it discusses the essential building blocks of any effective matcher. Then it steps
through the specific algorithm of the original matcher to show how it behaves when
a series of segments is given as its input.
Chapter 4 explains the purpose and the design of the modified matcher. In the
beginning, it uncovers some of the unrealistic, yet inherent, assumptions that the
original matcher makes in its algorithm, thereby highlighting the reasons for the need
of a modified algorithm. Then it discusses specific design modifications which are
implemented to relax those assumptions.
Finally, chapter 5 describes an experiment that was performed with the modified
matcher, which is an initial step towards identifying the more valuable features. Then
the collected data are presented and interpreted.
14
Chapter 2
Background
2.1
The Speech Chain
The lexical access project is based upon speech producing methods that human beings
use. Therefore it is essential to understand the basics of how we communicate with
each other before we can learn in detail about this ongoing project. First we will
discuss the background information needed to fully understand the project, and then
analyze the lexical access system in further detail.
Figure 1 illustrates a set of events that describe the overall communication system
of a human being. According to this model, the speaker first arranges the thoughts
by accessing memory for a set of words. After translating these words into linguistic
units called phonemes (which will be defined soon), the brain sends impulses to the
motor nerves in the vocal tract. Then the sound waves generated by movements of
the articulators in the vocal tract and by the respiratory system are radiated and
propagated to the listener. As illustrated, the ears of the speaker act as a feedback
system to produce the most desired acoustic effect.
The listener's ears pick up the sound waves and send the signal through a complex filtering system to the brain. By comparing a set of perceived cues and their
corresponding features to an existing lexicon in the memory, the brain deciphers these
features into a set of possible words that the speaker intended. Finally these words
are stored in the listener's lexicon for future use [3].
15
SPEAKER
LUSTENER
Seen
ior,
Brainnerves4
Smrynd wavcs
-Motor
nCVC
Figure 2-1: Speech Chain. This figure shows the basic human communication system
(from Denes and Pinson, 1993).
Evidence has shown that at the linguistic level of speech production and perception, there are units of representation that are more fundamental than words, called
phonemes [2]. A phoneme is defined as a unit of speech that distinguishes one word
from another and is represented as a segment that consists of a particular set of characteristic features. A series of these segments forms the lexical model for a spoken
utterance.
2.2
2.2.1
Basic Terminology of Lexical Access
Effects of Anatomical Structures
The sounds that a person produces are a result of physical manipulation of the vocal
system.
In fact, an utterance is stored in memory as a sequence of words, and
each word in turn can be represented as a sequence of phonemes [2]. Each of these
phonemes can then be characterized with specific physical attributes, or features.
The physical aspects of the vocal tract allow some regions to operate independent
16
31
Figure 2-2: General anatomical structures of the vocal tract. The vocal system may
be partitioned into four functional parts (from Keyser and Stevens, 1994).
of the controls by other regions, and this attribute creates a type of classification of
features in groups [6]. More specifically, the human vocal tract can be broken up
into four functional regions that are generally independent of each other, as shown in
figure 2-2.
Region 1 represents the vocal folds, which is a dominant subsystem of the vocal
tract in producing speech. The vocal folds periodically interrupt the normal flow of
air from the lungs to the mouth by producing a sequence of air puffs whose frequency
is determined by the rate at which the glottis, or the space between the vocal folds, is
spread or constricted [11]. These "periodic interruptions", or oscillations, are directly
correlated to the fundamental frequency.
Also the frequency can be changed by
increasing the stiffness of the vocal folds. Not only are the vocal folds independent
from the rest of the system, but they can be modeled as a source of sound to the rest
of the vocal tract; therefore the vocal cords can be modeled as an oscillating source
to the system consisting of regions 2, 3, and 4.
If the vocal tract is modeled as described, then the system is dependent upon the
structural characteristics of regions 2, 3, and 4, and will output a signal in accordance
to their unique physical features. So we will now examine these regions and their
17
effects on the output of the system.
Region 2 consists of the laryngeal and pharyngeal pathways from the vocal folds
to the oral cavity. The larynx is a cartilaginous structure in which many ligaments,
including the vocal folds, are interconnected. The structure moves vertically in producing speech and in the act of swallowing. The pharynx is a tube-like structure
surrounded by a set of constrictor muscles which connect the larynx to the base of
the oral cavity of the mouth. To effectively alter an utterance, the cross-sectional
dimensions of the tube in this region can be widened or contracted. In other words,
manipulation of the cross-sectional area within the pharyngeal region contributes to
the acoustic filtering of the sound source [11].
Region 3 simply contains the soft palate, which is a flap of tissue used to cover or
uncover the velopharyngeal opening. If the flap is lowered, then part of the air from
the source escapes through the nasal cavity. Such activity of the soft palate modifies
the characteristics of the vocal tract such that the nasal cavity becomes parallel to
the rest of the vocal cavity. This modifies the acoustic filter by enhancing the low
frequency energy while weakening the high frequencies to cause a nasal sound [4]. If
the soft palate is not lowered, then the air travels directly along the cavity of the
mouth.
Region 4 represents the structures in the oral cavity such as the tongue and the
lips. More specifically, this region is described with the tongue blade, the tongue
body, and the lips. These anatomical parts are so important that they are given
a distinctive title, usually known as the "articulators". The tongue is a muscular
mass that can move in almost any direction to alter the filtering characteristics of the
vocal system. Though it is true that the tongue blade and the body are anatomically
attached, each has some independence in acting as needed. The lips can be positioned
in three different positions: protruded, rounded, and spread, but for our analysis, only
the rounded cases will be considered.
In our analysis above, general regions that complete the human vocal system have
been examined. To summarize the basic components of the anatomical makeup in a
more hierarchical model, figure 2-3 is provided.
18
Voca I Folds
Soft Palate
Pharyngeal
Lingual
Lij ps
Glottis
Pharynx
Body
Blade
Figure 2-3: Basic tree with anatomical structures. All the end nodes correspond to a
particular physical structure.
19
Vowel (root)
Vocal Folds
Glide
Soft alate
slack
stiff
Consonant
Pharyngeal
(continuant, sonorant, stridentI
Lingual
Glottis
Lips
Pharynx
Blade
Bod
round
spread
constr
atr
ctr
high
low
back
ant
dist
lat
rhot
Figure 2-4: Tree with anatomical structures and their corresponding features.
2.2.2
Features
Each of the anatomical parts in Figure 2-3 acts in certain manners, or has particular
features, to produce the desired sounds. In other words, each feature has a corresponding set of acoustic correlates, which are used to distinguish different segments
of sounds [10]. As described in the previous section, the vocal folds can be stiffened or
slackened while the glottis can be spread or constricted. Other anatomical parts such
as the soft palate can be lowered or raised, causing the sound to be nasal or not. The
rest of the vocal structures and their physical features are illustrated in figure 2-4.
Some of the labels for the features are abbreviated, and a fuller description can be
found in table 2.1.
To assign values to each of the features, binary symbols of +/- are utilized. For
example, [- nasal] represents the fact that the sound was not nasal because the soft
palate was not lowered, while [+ nasal] represents the exact opposite. [+back] corresponds to the tongue body being pulled back from its rested position. Table 2.1 lists
20
Articulator-Free
vowel
glide
consonant
sonorant
continuant
strident
Articulator
Articulator-Bound
Values
+
+
+
+/vocal folds
glottis
pharynx
soft palate
body
blade
lips
stiff
slack
spread
constricted
advanced tongue root
constricted tongue root
nasal
high
low
back
anterior
distributed
lateral
rhotic
round
+/-
+/-
+/-
Table 2.1: List of features and their possible values. The features may be categorized
into three distinct groups: articulator-free, articulator, and articulator-bound (from
Maldonaldo's Master Thesis, 1999).
the rest of the features that are used in the lexical access system and their possible
values.
Articulator-Free Features
A distinct category of features was not discussed in the previous sections. This category, consisting of the features listed under the "articulator-free" section in table 2.1,
relays information about the type of sound and the manner of articulation without
reference to a particular articulator. More specifically, the articulator-free features
show whether a constriction is made in the vocal tract, and if so, how it was made.
In fact, each phoneme is required to have one of the following articulator-free features
21
Consonant Types I Sonorant I Strident
Continuant
affricate
fricative
-+
+/-
sonorant
+
stop
-_-
+
-
Table 2.2: Four consonant types and their corresponding manner features.
assigned a positive value: vowel, glide, or consonant.
If a phoneme is characterized with [+vowel], the root node is the dominant node
because all of the four regions can actively affect the production of sound for vowels,
while for some phonemes, such as /w/, the glide node is the dominant node. For
others, the consonant node is the dominant node where, by forming constrictions in
the oral cavity, mainly the articulators in Region 4 define the acoustic characteristics.
The other three features under the "articulator-free" column in table 2.1 are used
to distinguish the different types of consonants.
As listed in table 2.2, sonorant,
continuant, and strident features are used to distinguish a consonant from its four
possible types: affricate, fricative, sonorant, and stop. The sonorant feature refers
to whether or not there is a buildup of pressure behind a constriction in region 4 of
figure 2-2. [+sonorant] shows that there is a continuation of low frequency amplitude
at its closure while [-sonorant] shows that there is a decrease in the low frequency
amplitude. [+continuant] corresponds to the fact that there is only a partial closure
in the vocal tract while [-continuant] represents a complete closure in the oral cavity.
Finally [+strident] means that there are exceptionally strong fricative noises.
Articulator Features
The second of the three categories, represented in the second column of table 2.1, is
titled as the "articulator" features. This category, which includes the blade, the body,
lips, pharynx, soft palate, vocal folds, and glottis, shows what anatomical structures
are used to produce particular sounds.
22
Primary Articulators
Tongue Blade
Tongue Body
Lips
Secondary Articulators
Pharynx
Soft Palate
Vocal Folds
Glottis
Table 2.3: Two subsets of articulator features
This category may be partitioned into two subsets to better understand the functional descriptions of the features. The two subsets of the "articulator" features are
seen in table 2.3. The primary articulators, consisting of the tongue blade, the tongue
body, and the lips, play a major role in defining the characteristics of the sound. The
secondary articulators, consisting of the pharynx, soft palate, vocal folds, and glottis,
also play an active role but not as the primary articulators do [6].
Articulator-Bound Features
The third of these categories in table 2.1 is called the "articulator-bound" features
and its responsibility is to describe how the articulators are positioned and shaped
to produce particular sounds. For example, [+rounded] indicates that a segment
was produced by rounding the articulator called lips. The tongue blade has four
distinctive features: anterior, distributed, lateral, and rhotic. [+anterior] refers to
when the tip of the tongue makes contact with the anterior portion of the alveolar
ridge. [+distributed] indicates that a broader part of the tongue touched the alveolar
ridge. [+lateral] refers to the situation when the air flows around the side of the
tongue. And [+rhotic] is very similar to [+lateral], yet with a different shaping of the
tongue.
The three features that are associated with the tongue body are high, low, and
back. High and low refer to the vertical characteristics of the tongue while back
refers to its horizontal position. A couple of other articulator-bound features are
nasal and round, both of which are described earlier in detail. Advanced tongue root
and constricted tongue root features indicate whether the phoneme is tense or lax.
23
Constricted glottis and spread glottis indicate the state of glottis and play a vital role
in producing sounds such as /h/. And stiff vocal folds and slack vocal folds directly
dictate whether the consonant will be voiced or unvoiced.
2.2.3
Landmarks and Segments
As we noted earlier in the section describing the speech chain, the listener decodes
speech by identifying all the phonemes by searching for acoustic cues in the signal.
These cues are for acoustic landmarks and for articulator-bound features. These
cues are technically known as landmarks. Depending on the characteristics of the
landmarks, they are classified into one of three groups: vowels, glides, or consonants.
Vowels are produced when there is no narrowing effect in the vocal tract. By
studying the formants, which are related to natural frequencies of vocal tract, the
landmark can be located. More specifically, the time associated with a vowel landmark
is marked near the place where the amplitude of the first formant (Fl) is a maximum
[1].
There are only four glides in the lexical access system. These glides are produced
with an intermediate narrowing of the vocal tract, which keeps them from exhibiting
abrupt changes in the formants. In the lexical access machine, their landmarks are
chosen at times where the signal amplitude is a minimum [1].
Consonants are recognized by their closure and a subsequent release of energy in
the vocal tract, causing abrupt spectral changes. This effect is produced when the
vocal tract is almost or completely closed and then released.
Each of these landmarks, or parts of landmarks, identifies a particular segment,
or a phoneme. Along with the articulator-free features, each segment is identified
with their articulator-bound features based on a hypothesis that segments are stored
in human communication systems in the form of discrete classes of features.
24
/n/
/ey/
Time: (nil)
Symbol: n
Prosody: (nil)
Features:
+ consonant
- continuant
+ sonorant
+ blade
+ anterior
- distributed
+ nasal
Time: (nil)
Symbol: ey
Prosody: (nil)
Features:
+ vowel
+ body
- high
- low
- back
+ advtongue-root
- const-tongue-root
(A)
(B)
Table 2.4: Feature matrix for two phonemes, /n/ and /ey/, as seen by the matcher.
2.2.4
Examples
The features associated with a phoneme have traditionally been represented as an
array, or matrix, of feature values [2]. In this section, all the previous information
will be brought together into a couple of examples of phonemes so that we have a
better idea of how they can be modeled in feature matrices.
The first phoneme we will analyze is /n/, which is shown in table 2.4 (A). As seen
in this feature matrix, [+
consonant] is one of the articulator-free features, which
means that the main node is at the consonant node. The next two features describe
the rest of the articulator-free features. The articulator feature in this matrix is [+
blade]. [+ anterior], [- distributed], and [+ nasal], which are the rest of the features,
show the articulator-bound features of /n/. This is also seen in figure 2-5.
The second phoneme we will analyze is /ey/, which is also shown in table 2.4 (B).
From the table, we see that its main node is the root, or the vowel, node. Therefore,
there are no more articulator-free features, and the next feature is [+ body], which is
an articulator. The next four features are articulator-free features. This is illustrated
in figure 2-5.
25
/n/
/ey/
Vowel (root)
Vowel (root)
Vocal Folds
Vocal Folds
Soft Palate
slack
stiff
stiff
Consonant
{- continuant, + sonorant
Pharyngeal
+nslLingual
Glottis
Soft Palate
slack
}
Lips
Pharynx
Bod
spread
Glide
Glide
constr
atr
Blade
round
spread
ctr
high
low back +ant - dist
at
constr + atr
- ctr
-
rhot
high tlow, bac
ant tim
at
Figure 2-5: Tree diagram for two phonemes, /n/ and /ey/
2.3
The Lexical Access Project
Now that the basic theoretical information has been presented, we are ready to better
understand and appreciate the lexical access project. Figure 2-6 depicts a basic block
diagram of the lexical access system. Overall, this system interprets a sound signal
into a set of segments and matches them to a predefined lexicon. Although this may
sound simple, there are many complicated factors that affect the performance and
the output of this machine, as we shall shortly see. To accomplish this general task,
the system must first find all the landmarks and look for the distinguishing frequency
characteristics to identify which of the three articulator-free features it matches to:
vowel, glide or consonant. After a landmark is detected, the signal processing subsystem looks around the vicinity of the landmark to gather more information about
the particular segment.
After all the features are detected and given a value, the signal processing subsystem converts all the gathered information into a list of features for the particular
segment. These segments need to be converted to formats that are recognizable to
the matcher. Then this information, which will correspond to one of the phonemes
(or more than one if a partial representation of the features make up the phoneme) in
26
rhot
Speech Processing Subsystem
Matcing Subsystem
I
I
Landmark Detection
Speech Signal
10
CoMersion
Matcher
Possible Utterance
Feature Detection
Figure 2-6: Overall model of the lexical access system. The system is broken down
into two subsystems: speech processing subsystem and the matching subsystem.
table 2.5, is sent to the matching subsystem. This process can be seen in figure 2-6.
Using the processed information from the previous subsystem, the matcher attempts to recreate the utterance of the speaker. The output of the matcher may
consist of more than one possible utterance if more than one matches the descriptions of the input. As we shall see later, the matching process is a complex process
with problems that are speaker-dependent.
27
Vowels
/iy/
/ih/
/ey/
/eh/
/ae/
/aa/
/ao/
/ow/
/ah/
/uh/
/uw/
/rr/
/er/
/ex/
Glides
/h/
/w/
/y/
/r/
Consonants-AF
/ch/
/dj/
Consonants-FR
/f/
/dh/
/s/
/sh/
/th/
/v/
Consonants-SO
//b/
/m//d/
/n/
/ng/
Consonants-ST
/g/
/k/
/p/
/t/
/z/
/zh/
Table 2.5: A listing of the English vowels, glides, and consonants in the Lexical Access
System. AF=affricates, FR=fricatives, SO=sonorants, ST=stops
28
Chapter 3
Original Matcher
The original matcher, which was developed by Zhang (1998) and Maldonaldo (1999),
is the older version of the matching subsystem. Before discussing the newer version
of the matcher, it may be useful to understand the details of the original matcher. In
this chapter, we will give a general description of how the original matcher functions.
The basic function of the matcher is to take an input represented by a series of
segments from the speech processing subsystem, and to produce a list of possible
utterances which best represent the input data, as shown in figure 3-1.
Although
the basic system appears simple in functionality, there are additional aspects of the
matching subsystem which act as sources of many complications.
One of the main features of the original matcher is that it requires the speech
MATCHER
Series of Segments
Representing a
Sampled Utterance
List of
Possible
-Utterance
Figure 3-1: Basic model of matcher. For the original matcher, the series of segments
has to be error-free and all the possible features have to be detected.
29
Standard Phonemes
Linguistic Rules
MATCHERList
MA TCHER
Feature Based
Presentation of an
Utterance
of
Possible
Utterance
Lexicon
Figure 3-2: Model of matcher with necessary information. In addition to the series of
segments as shown in figure 3-1, standard phonemes, linguistic rules, and the lexicon
are also needed.
processor subsystem to be perfect and error-free in its analysis of the continuous
speech signal.
More specifically, for the original matcher to consider the data as
valid, the speech processor has to detect all the segments and all the features therein.
Although this is an unrealistic constraint for any matcher to place upon the speech
processor, the original matcher cannot function without it. This will be discussed
later in greater detail.
After a speech signal has been properly labeled and converted into discrete lexical representations that perfectly and completely describe all the cues around each
of the landmarks, the original matching process is ready to begin. But before the
actual implementation takes place, the matcher needs three other different types of
information: a list of linguistic rules which are to be applied during the matching, a
list of standard phonemes, and a lexicon containing a given list of words with their
corresponding phonemic representations. To illustrate these requirements, a modified
version of figure 3-1 is shown in figure 3-2. In this new diagram, we note that there
are four types of inputs needed for the matcher to perform its duties. In the next few
sections, we will describe the three new types of information: the standard phonemes,
the lexicon, and the list of linguistic rules.
30
iy
ih
ey
eh
ae
aa
ah
uw
uh
rr
er
x
k
r
h
t
1
m
v
dh
z
zh
f
th
ao
w
n
S
ow
y
ng
sh
dj
ch
b
d
g
p
Table 3.1: A list of English phonemes used in lexical access system.
3.1
Standard Phonemes
In any matching algorithm, data must be compared to a defined reference data.
Since the input of the matcher is formatted to a series of segments, or phonemes, the
matching process is much simplified if the reference is formatted as a list of phonemes.
In fact, the standard reference for the original matcher is exactly that: a list of all
the possible phonemes in English. A full list of these phonemes is given in table 3.1.
Because the standard list of phonemes is a reference for the matcher, each phoneme
has to be perfectly described. In other words, each phoneme should have values for
all necessary features. No errors of any kind should exist in the description of any of
the phonemes.
3.2
Lexicon
Not only does the matcher compare individual input segments to the list of standard
phonemes, but it also places them sequentially to produce particular English words.
Since the output will most likely be a series of words, the matcher needs a reference of
possible words where each word is a specific set of phonemes. During the development
stages of this subsystem in the lexical access system, a small lexicon of 248 words is
used. Appendix A has a copy of this lexicon with a list of words.
31
3.3
Linguistic Rules
A set of rules may be essential for an effective and useful lexical access speech recognizer in the real world. If the speaker were to enunciate every syllable very carefully
in the most proper manner, then this attached information would not be necessary
in recognizing the utterances. But when the segments occur in contexts, particularly
in more complex sequences within a syllable or in sentences that are produced in a
conversational or casual style, their articulatory and acoustic attributes may be modified [11]. Due to these modifications, the matcher needs to take appropriate action
as defined by the linguistic rules of casual speech.
For example, let us consider a series of words, "bat man".
In casual speech,
a speaker who intended to say those words may actually utter "bap man".
This
alteration occurs because the /t/ in "bat" assimilates to a /p/ by assuming the place
of articulation of the /m/ in the "man".
Since these cases are prevalent in any
language, the lexical access machine is required to recognize them and adjust the
process in a correct manner; therefore linguistic rules are necessary.
3.4
3.4.1
The Original Matching Process
The Overall Method
In the first stage of the process, also known as initialization, all of the four input files
are translated into feature matrices that the matcher can recognize and manipulate.
After reading these files into its data structure, the matcher modifies the lexicon by
using the given rule set to account for possible linguistic changes within the observed
speech. In other words, once initialization is completed, the lexicon is expanded by
applying the rule set and adding new pronunciations to words that could have arisen
as a result of some fluent speech modification at word boundaries [8]. In order to
understand the matching process more completely, let us step through an example
from Maldonaldo's thesis [9].
Suppose that the result of the labeling and conversion process demonstrates a
32
Phonemic Rep.
of an UtteranceI
Lexical Words
0
1
2
3
4
5
x
x
n
n
ah
dh
x
r
ey
p
b
ae
ih
n
p
aa
w
rr
n
aa
y
n
t
iy
7
6
8
9
10
11
12
13
14
15
16
k
n
Figure 3-3: Matching model with linguistic rules using index. Two lexical words that
match the segments of the utterance at index 0 are "an" and "another".
sequence of phonemes corresponding to the phrase, "another ape back in power".
Although we will refer to the bundles by their phonemic symbols for the sake of
simplicity, this phrase is realistically represented by a sequence of bundles of features
and not their symbols. Also suppose that the lexicon consists of these following words:
[x n] ("an"), [x n ah dh x r] ("another"), [ey p] ("ape"), [b ae k] ("back"), [ih n] ("in"),
[p aa w rr] ("power"), and [n aa y n t iy n] ("nineteen").
Finally suppose that the
matcher is informed of only one linguistic rule which states that if a word ends in a
/d/, an /n/ or a /t/, and is followed by a consonant produced by [+lips] such as /p/,
then the last segment of the previous word can also gain [+lips] features.
First a temporary lexicon is created by applying the rule set to the existing lexicon.
As illustrated in figure 3-3, each lexical word is lined up, segment by segment, against
the output of the signal processing unit, which is the line across the top. Then each of
these words is tested for any possible linguistic modifications, and from the example
above, only "nineteen" ends with an /n/ and is followed by a /p/. Therefore a new
set of phonemes is added to "nineteen" in the temporary lexicon, in which the last
/n/ has [+ lips].
After the temporary lexicon has been created, the sequence of the bundles of
features is compared to each of the words in the temporary lexicon. As seen above,
33
index:
AN
ndex: 2
0
0
AN OTHER
index: 6
no match
APE
index:
8
BACK
index: 11,
Figure 3-4: Tree model of the matching process. This model shows the progression
of index for matched lexical words. Each node may be modeled as a subtree which
may beget another subtree.
34
at index 0, only two words match segment by segment: "an" and "another". From
here, the temporary lexicon is completely destroyed and two branches corresponding
to these two matched words are created as shown in figure 3-3. Then subtrees from
these two branches are produced through iterations of the same matching process
that is described above. For example, a temporary lexicon is created to begin a new
matching process as a result of the match of "an". But the tree created by this match
fails to complete because no words in the lexicon matches from index 2 and forward.
But the tree created by the match of "another" will complete a sequence of matches
and will reconstruct the phrase "another ape back in power."
3.4.2
The Detailed Process of the Original Matcher
In the previous section, the function of the overall matching subsystem was described
and a general understanding of its algorithm was presented. In this section, a more
detailed examination of this subsystem is provided. To understand the algorithm
that the original matcher utilizes, we will analyze these four processes: initialization
of the standard template, initialization of the lexicon, initialization of the linguistic
rules, and growing of the matcher tree. Though the details may be complex, this
information may be helpful to those who want to understand the matching process
at the source code level.
Initialization of Standard Template
One of the pieces of input information from figure 3-2 is the standard template, which
is also in appendix B. But before the matcher is able to use this information, it first
needs to translate and store the information into a specific data structure that is
recognized. This beginning process is formally called the template initialization and
is the first block in figure 3-5.
Since the matcher is programmed in C++, the names of each of the boxes represent
classes and the arrows represent transition from one object to another. To better
understand the matcher's specific structure, such as the types of classes and objects,
35
from main program
Initialization
' _segment
variables assigned
__I
_segments
variables assigned
_bundle
LOOP:
- get all headers
features[i] = +, -, x,...}
data[i] = new segment
no?
end of file
yes ?
back to main program
Figure 3-5: Block diagram of standard template initialization. -segments constains
an array of standard phonemes in data[i}, where i is the index of the array. Each
phoneme is defined further by _segment and _bundle classes.
36
please refer to Maldonaldo's master's thesis on pages 21 and 22. But for our purpose
of analysis, we will take this structure as given.
When the command to initialize is executed, the matcher jumps to a class called
segments to define the object of list of segments.
Within this object, a set of
header variables are assigned. When all the variables for the template are set, then
the matcher is ready to initialize each of the standard phonemes that the template
contains. This initialization algorithm is implemented with a loop that runs until
each of the individual phonemes is defined in the matcher and stored in an array
called data/i.
To define each phoneme in data/i, -segments class calls upon -segment class to
do some basic work. First, an object is created with specifically assigned variables,
and then this object calls upon _bundle class to finish the initialization of a phoneme.
Finally _bundle defines some variables and gives values to the individual features for
the particular phoneme. This whole process is repeated for each of the standard
phonemes in the template.
Initialization of the Lexicon
The structure design for initialization of the lexicon is different from that of the
standard template in that the lexicon is an array of classes. To help understand this
initialization process, refer to figure 3-6 while reading this section.
To generalize this type of structure, an Array template is utilized for a generic
array class in our code. When initialization of the lexicon is called upon, the matcher
begins the process with a class called -lexicon. First all the variable assignments
are completed. Also the Array template is initialized so that this object can contain
a defined array of lexical words and their phonemic descriptions. When all the initialization for _lexicon is finished, then this process executes a loop for each of the
lexical words by iteratively calling a class called _lex-word.
Since there may be more than one possible pronunciation for each of the words,
each _lex-word class is an array of pronunciations.
Therefore, when this class is
called upon, the array and its variables are initialized. Then the process executes a
37
from main program
_lexword
initialize Array of pronunciation
get Label
LOOP:
initialize pronunciation
add to Arrray of pronunciation
Initialization
_segments
'-0
load up [x x x] format
LOOP:
Sdata[i]
yesI?
_lexicon
initialize Array of lex-word
set variables
LOOP:
initialize lex-word
add to Array of lex-word
lexicon
done?
end of
segment?
lex-word
done?
yes?
Ino?
no?
_pronunciation
yes?
= new _segment
-segment
initialize variables
p = new _phonemic-rep
mode = either
feature = phonemic
no?
_phonemicjrep
check _STD()
get phoneme by letter
check phoneme
back to main program
Figure 3-6: Block diagram of lexicon initialization. This process is relatively complex
because a lexicon consists of an array of words, where each consists of an array of one
or more distinct pronunciations.
38
' rule
from main program
Initialization
Header = "time"
Symbol= "value"
_segments
_rules
initialize variables
+
' segment
mode = feature
state = feature
LOOP:
initialize rule
add to Array of rules
newmode = "features"
newstate = "features"
K-
_bundle
LOOP:
data[i] = new segment
-
- get all headers
features[i]= {+, -, x, .
no?
no?
end of file
for rules?
end of file
for rule?
yes?
back to main program
yes?
Figure 3-7: Block diagram of rule set initialization. The ruleset contains an array
of linguistic rules which need to be translated (or initialized) for the matcher to
recognize.
loop where _pronunciation class is called upon for each distinct pronunciation of a
word. More specifically, the amount that the loop iterates in -lex-word is equivalent
to the number of variations in pronunciations of a particular word.
_pronunciation class is called within each iteration by _lex-word, but this class
does not accomplish much but assign a few varibles. Then this object sets up the
lexical pronunciation format represented by a series of phonemes by calling upon the
segments class. In this class, the process executes a loop where each phoneme is
retrieved from the input file and stored in the data structure of a pronunciation;
_segment and _phonemic-rep are classes which finish this process.
When one
pronunciation is done, then the process goes into the next loop in -lex-word for the
next possible variation of the pronunciation of the current word.
39
Initialization of Linguistic Rules
The third type of information that the matcher needs is the linguistic rule set. The
rules file has to be in a specific format to be recognized by the matcher, as seen in
appendix C. When the command for linguistic rules is first encountered, the matching
algorithm calls upon the _rules class to begin the initialization process, as shown in
figure 3-7. First, many of the variables are given values when -rules is called. Since
there are are numerous linguistic rules applicable in English, the matcher has to
initialize an array of rules. Afterwards, the process enters a loop where each of the
rules is interpreted and stored accordingly. To accomplish the interpretation and
storage, _rule class is called upon for each iteration.
_rule class merely assigns values to some variables and then calls the _segments
class to read a rule from a rule file. Because a rule file follows the same format as that
of the standard template from this point forward, the process for interpreting and
storing data from here is the same as that of the standard template. The matcher
enters a loop within the _segments class to the -bundles class.
Initialization of the Utterance Data
As we can observe from appendix B, the standard template file and a sample utterance
file is of the same format. In fact, the process in which to initialize the utterance
data is exactly the same as that of the standard template except for some assignment
of variables. Therefore, the matcher utilizes the same initialization algorithm as that
shown in figure 3-5.
Growth of Matching Tree
Once all the initializations are complete, the "tree growing" process begins; it is a
process that refers to a sequential matching algorithm in which a tree grows, word by
word, as matching occurs, as shown in figure 3-4. From this section, we learned that
each matched node may be seen as a sub-tree. Therefore, one can imagine a series
of words becoming a huge tree which consists of smaller trees, or sub-trees, which
40
t->grow(rs,
N.-p
lex)
~
get modified lexicon
temporary lexicon =
temporary lexicon
grow-aux2
-
for each sentence tree
apply rules (starting-point)
LOOP: for each word in lexicon
LOOP: for each pronunciation
grow-aux (startingpoint)
LOOP:
lx -> copyo
->
-
grow-aux2 with incremented index
lexicon
no?
lexical word
yes?
done?doe
yes?
tree growing
no?
o?
done?
yes?
backto min pogra
back o man prgro?
matching
*LOOP: for each phoneme in pronunciation
no?
\yes?
feature match?
pronunciation
yes ?
_.word
P
done?
Label = "word"
no?
nobundle-match
yes?
Ords
_wd~
budeLOOP:
for each feature
done?
tarray of matched words
Figure 3-8: Block diagram of the tree growing process. This process recursively builds
a tree where each node represents a distinct index and a matched word. The recursion
occurs in this figure when grow-aux2 calls on grow-aux, which calls grow-aux2
back.
41
themselves consist of sub-trees, and so on. This recursive model is very important
in understanding the details of the tree growing process. A figure of this process is
shown in figure 3-8.
A function that manages and begins the topmost level of the process for tree
growth is called grow-aux2.
First it calls on a function called grow-aux to find
all the words in the lexicon that match the utterance data from a specific position
(initially at 0), as shown in figure 3-8. If there are matched words, grow-aux is
also responsible for translating those words into a tree structure such that each of
these words becomes a sub-tree of the current node. Then for each of these sub-trees,
grow-aux2 is called to find more branching possibilities for production of subsequent
trees.
Let us step through the example that was analyzed in section 2.4.1. The current
position, or the node, is at 0 initially for the topmost tree. Then grow-aux is called
upon by growaux2 to find all the matching words from the existing lexicon and in
doing so, two words are found: an and another. Each of these two words attempts
to build its own sub-tree by calling grow-aux2 and initiating this whole process
again from its own node, or index. Among all the recursive sub-trees completed in
matching, only one path of branches completed the entire matching process in this
particular example: another ape back in power.
From the analysis above, the grow-aux function has not been described. Overall,
this function has two responsibilities: first is to modify the existing lexicon with
linguistic rules and second is to match the segments from the utterance to the modified
lexicon. In its second task, the function searches for perfect matches between the
standard phonemes and the segments, which is a central characteristic of the original
matcher. When finished, the function adds the newly matched words to an array of
pre-existing matched words and destroys the modified lexicon.
Since the algorithm for modifying the lexicon in accounting for linguistic rules is
complex and would require an entire chapter of its own, we will not touch upon the
details. Maldonaldo's master's thesis devotes a whole chapter to this very topic and
explains the algorithm in greater depth.
42
Let us assume that the matcher already possesses a perfectly modified lexicon
that accounts for all the possible linguistic mutations of a word. Then the matcher is
ready to compare the utterance input to the modified lexicon. In order to accomplish
this task in a more orderly fashion, it needs to compare segment by segment for
each pronunciation of each word in the lexicon. This type of algorithm can easily be
implemented using nested loops.
The purpose of the topmost loop in grow-aux for the matching process is to run
through each of the words in the modified lexicon. Due to linguistic rule modifications,
each of the words may have more than one possible pronunciations; therefore each
loop of a lexical word requires a nested loop for each of its pronunciations. At this
point, the matcher searches through each pronunciations of a word and compares
them to the utterance in a segment by segment basis by using a function known as
matching.
Matching function simply examines each of the phonemes in the pronunciation
by calling upon bundle-match function to compare all the features of that phoneme
to all the features in the current segment of the utterance. Since the original matcher
requires error-free data, the bundle-match function requires the two segmental information to be exactly the same. If matched without any discrepancy, then the
word that the pronunciation represents is added to the series of pre-existing matched
words.
43
Chapter 4
Modified Matcher
As described in section 2.3, the lexical access system consists of two subsystems: the
speech signal processor and the matcher. The signal processor receives a continuous
speech signal from a speaker and translates it into a series of segments, each with
distinct feature characteristics. Then the matcher receives this segmentized data and
outputs a set of possible utterances to recreate words that the speaker intended to
say.
Now let us imagine a scenario where the signal processor is error-free in its analysis.
In other words, the signal processor is able to detect all the acoustic cues for every
possible segment and features therein. In the end of the labeling and conversion
process, this subsystem will output segments where each will match exactly, feature
by feature, to a specific phoneme in the standard template. Although a process with
such qualifications seems impractical, the original matcher requires them from the
processor.
Constraints that leave no space for uncertainties should not exist in any real-world
systems, and in this regard, the lexical access system is no exception. If the speaker
is confined to enunciate each syllable very carefully, then it may be feasible to satisfy
the demand for perfect detection in the processor, but for any speech recognizer to
be effective in realistic settings, it must have the ability to handle casual speech. Yet
much of the problem in speech recognition algorithms results from the high degree of
variability in naturally spoken utterances as compared with more restricted speech [5].
44
Due to many contextual and speaker dependent variabilities, the original matcher is
of insignificant real value and needs to be modified.
One important source of variability in the implementation of the distinctive features for a segment is the position of the segment in a syllable [11]. For example, a
consonant before a vowel may be acoustically different from one after a vowel. Besides
these contextual variations, there are some extra-linguistic variations as well, such as
the speaker's vocal tract characterizations, the type of speaking environment, and
even whether the speaker has a cold or is nervous [5]. In addition to these two causes
of variations, there are some uncertainty issues exclusive to the matcher as well.
4.1
Variability in the Speech Signal
Although many phonetic features can be extracted from the speech signal, the acoustic realization of words and phonemes can be highly variable, and this variability
introduces a good deal of recognition ambiguity in the initial classification of the
speech signal [7]. In fact, the task of developing a speech processing subsystem with
an ability to detect all the phonetic features despite acoustic variability is a near
impossibility. These variations may be partitioned into two categories: contextual
variations and extra-linguistic variations.
4.1.1
Contextual Variations
Local phonetic context can influence the acoustic representation in phonetic segments to varying degrees. At times, the context can affect one or more features in a
phoneme, but without removing entirely the acoustic evidence for the features [11].
In addition, for certain acoustic cues which should be present in the signal, evidence
for them sometimes cannot be found in the sound. In the most extreme cases, entire
segments may even be deleted. Yet while segments are deleted and modified in the
production process, it is found that extra segments are almost never inserted because
the modifications are primarily due to the inertia of a physical system [5].
As an example, let us look at some acoustic effects of context on vowel charac45
2A
2.0-
L.80
U.
pbfv*
i
1.6- ^
w 14- _
86SsztdCj
a kg
9 NULL
o A
Ia.
LL
0.81
0.2
0.5
FREQUENCY OF F1 (KCPS)
0.8
Figure 4-1: Effects of consonants on the first- and second-formant characteristics of
eight types of vowels, where each follow the consonant-vowel-consonant (CVC) format. These graphs show three types of consonants: velars (open circles), postdentals
(open triangles), and labials (open circles). Vowels in isolation are also shown (black
circles) [12].
teristics. Since the primary articulator that produces distinctions among the vowels
is the tongue body, many of the outstanding variations occur when the consonants
around a vowel place constraints on that articulator. One case is when a back vowel
is positioned after an alveolar consonant, where the tongue body is required to be
in a fronted position to make contact with the alveolar ridge. After the consonant
release, the time taken for the tongue body to position itself for the back vowel is
about 100ms [11]. Similarly if the back vowel precedes an alveolar consonant, then
the time taken for a complete movement of the tongue body is about the same. This
is a significant amount of time, especially in rapid speech where vowel durations are
usually less than 200ms. One can imagine the heavy influence that these consonants
can have on the characteristics of a vowel segment in casual speech.
Figure 4-1 shows the effects of consonants on the first and second formants of
46
eight vowels in lexical access. The points are measurements of the two formants
at the midpoints of vowels, which are all situated in a consonant-vowel-consonant
(CVC) formation, with the two consonants being the same. In this figure, three types
of consonants are used: velars (open circles), postdentals (open triangles), and labials
(open circles). The fourth case shows vowels in isolation (black circles). From the
graph, one can clearly observe the impact that consonants have upon properties of
acoustic parameters of vowels. One might expect the effects to be even greater in
running speech where the vowel durations are likely to be shorter than they are for
isolated CVC utterances.
4.1.2
Extra-linguistic Variations
Extra-linguistic variations play a vital role in introducing uncertainties in the speech
signal as well. One of the most influential causes of these variations is physical differences of the vocal tracts among individuals. For example, the resonant frequencies
between the two sexes generally vary significantly due to the differing physical dimensions. Another type of extra-linguistic variation that the lexical access system, or
any other speech recognition system, is required to handle is background noise, which
depends heavily on the environment. In fact, some acoustic cues may be erased with
enough background noise. As one can imagine, there are countless other factors.
4.2
Matcher-related Uncertainties
The original matcher inherently possesses three unrealistic assumptions. The first
is that the speech processor will detect all the possible features, and therefore, each
segment will be error-free. The last section explained that this assumption is not valid
for a working speech recognizer. The second is that all the segments theoretically in
the signal will be detected, but as mentioned previously, this assumption is also
false because some segments may be deleted. The last assumption that the original
matcher makes is that the series of segments it receives completely represents a full
utterance. In other words, the original matcher expects the data from the speech
47
First
Second
series of phonemes
/b/ /ey/ /b/ /iy/ /k/ /ae/ /n/
/ey/ /b/ /iy/ /k/ /ae/ /n/
Table 4.1: Two sample series of phonemes as valid outputs of the speech processor.
Though the second is a subset of the first series, both represent the utterance, "baby
can
recognizer to begin with the first segment of a word and end with the last segment of
a word, independent of the number of words in the utterance. Yet one can imagine
that such an expectation is not valid in a continuous system. In a real-word situation,
the speech processor will not have the ability to identify the boundaries of individual
words in continuous speech signal. Therefore the matcher will receive a series of
segments that may begin and end at random points within words.
Let us look at two valid series of segments in table 4.1, assuming that each segment
has a complete set of features. If the original matcher receives the first series of
segments as an input, then the output will be the statement, "baby can".
But if
the input is the second series of segments, then the original matcher will not output
any utterances because the series does not represent complete words. A more useful
and realistic matcher should recognize that the two different series of segments may
represent the same utterance and process the data accordingly.
4.3
The Modified Matcher Process
Due to the uncertainty factors described above, a modified matcher has been designed
to satisfy more realistic conditions. Specifically, the contemporary modified matcher
functions without the constraints of the first and the third assumptions stated in the
previous section. Though not dealt with in the current modified matcher, the second
assumption also must be relieved in the final matcher.
First this section will describe the modifications applied to the original matcher to
account for the first assumption that the original matcher makes. Later, the section
48
segment 1
Time: (nil)
Symbol: /ae/
Prosody: (nil)
RoC: unspecified
+ vowel
- high
+ low
segment 2
Time: (nil)
Symbol: (nil)
Prosody: (nil)
RoC: unspecified
+ vowel
- high
+low
segment 3
Time: (nil)
Symbol: (nil)
Prosody: (nil)
RoC: unspecified
+ vowel
- back
- adv.tongue-root
- const-tongue-root
Table 4.2: Three sample segments where segment 1 is a standard phoneme, /ae/.
The other two segments are subsets of the first segment. RoC = Release Or Closure.
will show how the modified matcher handles the last assumption, in which the first
and the last segments may be positioned in any parts of words.
4.3.1
Missing Features
The ability for a matcher to acknowledge incomplete sets of features in segments is
very important. As mentioned before, one of the reasons for its importance is that
an expectation placed upon the signal processor by the matcher to detect all possible
acoustic features is unfair and unrealistic. But a more compelling reason is that the
signal processor may be more efficient in not detecting all acoustic cues for features.
Some features naturally do less to differentiate phonemes than other features, and
developing systems that detect these less informative features may not be worth the
cost of time, money, and design complexity of the speech processor; the difference in
performance may be merely a small reduction in ambiguity. Also, some acoustic cues
tend to be more dependent on the context than other cues, and therefore are more
vulnerable to variability. Since possibilities of error in detection are higher with these
highly variable cues, it may be safer to ignore them in the analysis.
49
segment 1
/ae/
segment 2
/ae/
/aa/
segment 3
/iy/
/ih/
/ey/
/eh/
/ae/
/aa/
/ao/
/ow/
/ah/
/uw/
/uh/
/rr/
/er/
/x/
Table 4.3: Matched results of the three segments from table 4.2. From left to right,
more phonemes match because fewer features exist to constrain the matching process.
Overview of the Modified Algorithm
Let us study an example to gain better visual understanding. In table 4.2, there are
three types of segments ordered from left to right. The features in segment 2 are a
subset of segment 1, and those in segment 3 are a subset of segment 2. The first is a
copy of the standard phoneme, /ae/, and contains all the possible features. But the
second and the third segments are missing some features in comparison to the first
segment. The original matching algorithm will recognize the first segment as /ae/ but
will not recognize the other two segments. On the other hand, the modified algorithm
will acknowledge all the segments but will associate more phonemes as fewer features
exist to constrain the match, as shown in table 4.3. As expected, all the lists include
the intended phoneme, /ae/.
One may use set models to understand the general idea of the two matching
algorithms. As figure 4-2 shows, the original matcher requires that the segment from
the utterance and a standard phoneme to occupy the same "feature-set space". But
in the modified matcher, the "feature-set space" for a segment is a subset of the
"feature-set space" of the standard phoneme. Simply, the first case demands that
the two segments be the same while the second case allows one to be a subset of the
other.
50
Standard Phoneme Space
Standard Phoneme Space
Segment
Space
Segment Space
(A)
(B)
Figure 4-2: Feature set space model for the original matcher, in part (A), and the
modified matcher, in part (B). Part (A) shows that the original matcher requires the
two feature-set space to be exactly the same, while part (B) shows that the segment
space can be a subset of the standard phoneme space.
Specific Modifications in the Algorithm
Most of the modified matching algorithm is carried over from the original algorithm.
In fact, all the initialization and most of the tree-growing algorithms, which are all
described in section 3.4, are exactly the same. The main difference lies in a test
titled "feature-match?" in figure 3-8, which is called by _bundle-match to perform
a feature by feature match between a standard phoneme and a segment from the
utterance.
Figure 4-3 (A) shows that in the original matcher, the function that
performs this test is called exact-match. If a feature has a particular value within a
standard phoneme, then exact-match requires that the same feature has the same
value in the segment.
Meanwhile, in the modified matcher, another function titled incomp.match replaces -exactmatch in performing the "feature-match?" test, as shown in table 4-3
(B). Although a feature may have a particular value in a standard phoneme, this new
function allows that feature to be missing in the segment. Yet situations where the
feature has differing values in the segment and the standard phoneme is not allowed.
51
from Tree-growing algorithm
from Tree-growing algorithm
matching
matching
LOOP: for each phoneme in pronunciatich
LOOP: for each phoneme in pronunciatift
bundlematch
bundlematch
LOOP: for each feature
LOOP: for each feature
exactmatch
incomp-match
every feature in the phoneme has to match
every feature in the segment (and vice versa)
every feature in the segment
has tc
versa)
every feature in the phoneme (not vice versa)
(A)
(B)
Figure 4-3: Feature-match model for the original matcher, in part (A), and the modified matcher, in part (B). In this figure, "phoneme" refers to a set of features used by
the lexicon and "segment" refers to a set of features, complete or incomplete, detected
from the speech signal.
52
segment
sement 1
segment 1
1
segmen
Time: (nil)
Symbol: p
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
Time: (nil)
Symbol: (nil)
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
- continuant
- sonorant
- sonorant
+ lips
+ lips
- round
-
round
- constricted-glottis
- slack-vocal-folds
(B)
(A)
Table 4.4: Segment input to the modified matcher. Segment (A) has a complete set
of features for /p/ while segment (B) is a subset of the first set.
4.3.2
Uncertainty of First and Last Segments
In a continuous lexical access system, the matcher should not assume that the received
series of segments begins and ends at the boundaries of words. If the system were to
process a free flowing stream of speech data, then the speech processing subsystem
does not inherently possess any knowledge of the specific positions of the segments.
Thus the processor sends a series of segments without the information of the series
being a complete sentence or not.
Due to this uncertainty, the original matcher
necessarily has to be modified such that it does not assume the positions of the first
and the last segments.
Examples of the Modification
Due to the complex nature of the algorithm that implements this modification, it
may be more beneficial to step through a few examples than to explain the intricate
details of the algorithm. In each of the examples, the input and the corresponding
output of the modified matcher will be provided. The lexicon used for these examples
is the 248 word lexicon found in appendix A.
53
list of matched words
12. pig
1. ape
2. appear 13. pigs
14. pop
3. cup
15. potato
4. cups
16. power
5. hip
17. put
6. keep
7. oppose 18. sport
19. support
8. pail
20. suppose
9. pails
21. up
10. pay
11. perch 22. zap
23. zip
Table 4.5: Output of the modified matcher with segment from table 4.4 (A) as input.
All the words in this list contain the phoneme /p/.
The first example is shown in table 4.4 (A), where the input is only one segment
with a complete set of features. In fact, the phoneme that matches the given set of
features is /p/. Since the modified matcher does not assume the relative position of
this segment in a word, it will output all the words that contain /p/ as a segment.
The list of words is shown in table 4.5. In this specific case, the original matcher does
not output any words because no words exist that are completely described by /p/
alone.
Let us now erase some of the features from the segment such that the input
represents an incomplete set shown in table 4.4 (B). Due to the fact that this segment
contains fewer features to match than that shown in part (A), both /p/ and /b/
satisfy the match and the number of possible words in the cohort increases. In fact,
the number of possible words increases from 23 to 62 as shown in tables 4.5 and 4.6.
If the input consists of two consecutive segments, as shown in table 4.7, the outcome is quite different. In this case, the matcher searches in the lexicon for any words,
or a series of two words, that contain these segments in a sequential manner. Therefore, the output is represented by individual words that contain these two segments
54
1. able
2. about
3. above
4. ape
5. appear
6. baby
7. back
8. bad
9. bake
10. balloon
11. bat
12. be
13. because
14. become
15. before
16. began
list of matched words
33. cub
17. begin
34. cup
18. below
35. cups
19. beside
20. between 36. debate
37. debby
21. big
38. dub
22. blow
39. goodbye
23. book
40. hip
24. box
41. keep
25. boy
42. number
26. bug
43. oppose
27. bus
44. pail
28. busses
45. pails
29. busy
46. pay
30. but
47. perch
31. by
32. cab
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
pig
pigs
pop
potato
power
put
rabbit
rabid
remember
sport
support
suppose
up
zap
zip
Table 4.6: Output of the modified matcher with segment from table 4.5 (B) as input.
Due to fewer features to match in the segment, the matcher outputs more matched
words than table 4.5.
- sonorant
segment 2
Time: (nil)
Symbol: iy
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ vowel
+ high
- low
+ lips
- back
- round
+ advtongue-root
- constricted-glottis
- slack vocalfolds
- const-tongue-root
segment 1
Time: (nil)
Symbol: p
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
Table 4.7: Two consecutive segments with complete feature-sets as the input to the
modified matcher. The first segment represents the phoneme, /p/, and the second
segment represents the phoneme, /iy/.
55
list of matched words
1.
2.
3.
4.
ape easy
ape even
appear
cup easy
5. cup even
6.
7.
8.
9.
hip easy
hip even
keep easy
keep even
11.
12.
13.
14.
10. pop easy
pop even
up easy
up even
zap easy
16. zip easy
17. zip even
15. zap even
Table 4.8: Output of the modified matcher with two consecutive segments from table 4.7 as input. Most of the matched utterances are over a series of two words. In
this example, we can conclude that if the first segment is known to be the beginning
of a word, then there are no matches.
sequentially or two words that end with the first segment and begin with the second
segment. All the possible words, and series of words, are shown in table 4.8. As one
can observe, all the matched utterances, except number three, has /p/ as the ending
segment of the first word and /iy/ as the first segment of the next word.
Although only a couple of examples have been shown, the modified matcher can
recognize any number of segments with any combinations of features for each segment.
User-Mode for Testing
The user may force the first or the last segment of the input to be the beginning or
the end of a word by using the following symbols, '#' or '%' after "Time:", in the
header of designated segments. Table 4.9 shows how the user can manipulate the
input to inform the matcher what type of input it is receiving. The reason for this
flexibility is because the user may find forced assignments useful for testing purposes.
For example, the cases studied in the previous subsection are categorized into category
A, where the first and the last segments are unrestricted in position.
If "Time:" of the last segment is assigned the character, '%' instead of (nil) as
shown in table 4.9 (B), then the data indicates to the matcher that the series of
segments ends at the last segment of a word. For example, if the last segment of a
two segment input has '%' after "Time:" as shown in table 4.10 (A), then the output
of the matcher is as shown in table 4.11.
56
As one can see, the last two segments
(nil) in "Time:"
of first segment
in "Time:"
of first segment
'#'
(nil) in "Time:" of last segment
first and last segments
may be in any position
'%' in "Time:" of last segment
first segment may be in any position
last segment is last of a word
A
B
first segment is first of a word
last segment may be in any position
first segment is first of a word
last segment is last of a word
C
D
Table 4.9: All combinations of (nil), '#' and '%' in the first and last segments and
their meanings. In fact, category D acts as the original matcher where a series of
segments represents a complete word(s).
segment 1
Time: (nil)
Symbol: k
Prosody: (nil)
RoC: unspecified
Features:
+ consonant
segment 2
Time: %
Symbol: iy
Prosody: (nil)
RoC: unspecified
Features:
+ vowel
segment 1
Time: #
Symbol: k
Prosody: (nil)
RoC: unspecified
Features:
+ consonant
segment 2
Time: (nil)
Symbol: iy
Prosody: (nil)
RoC: unspecified
Features:
+ vowel
- continuant
+ high
- continuant
+ high
- low
- sonorant
- low
+ dorsum
+ high
- back
+ dorsum
+ high
- back
low
- constricted glottis
- slack-vocal-folds
- const-tongue-root
- low
- constricted-glottis
- slack-vocal-folds
- consttongue-root
-
sonorant
-
+ advtongue-root
(A)
+ advtongue-root
(B)
Table 4.10: Two sample inputs to the modified matcher where the positions of the
specific segments are constrained. (A), by using '%', forces the matcher to find word(s)
that ends with segment 2. (B), by using '#', forces the matcher to find word(s) that
begins with segment 1.
57
list of matched words
1. cookie
2. leaky
Table 4.11: Output of the modified matcher with two segments from table 4.10 (A)
as input. As expected, the last two segments of these words are /k/ and /iy/.
list of matched words
1. keep
Table 4.12: Output of the modified matcher with two segments from table 4.10 (B)
as input, where '#' is placed after "Time:" of the first segment.
of "cookie" and "leaky" are /k/
and /iy/
sequentially.
In contrast, if '%' is not
present, then that particular constraint is absent and the matcher would output more
utterances, such as "book eat".
As explained in table 4.9 (C), "Time:" of the first segment may be assigned the
character, '#' instead of (nil). In such cases, the data indicates that the series of
segments begins at the first segment of a word. For example, if the two segments are
as shown in table 4.10 (B), then the only word that matches is "keep", as shown in
table 4.12. As expected, the first two segments of "keep" are /k/ and /iy/.
Finally if the first segment is given '#' and the last segment is given '%', then the
constraint is that the series of segments represent complete word(s). If the input is
as shown in table 4.13, then the matcher would output no words since no words are
completely described by /k/ /iy/. Through figure 4-4, set models may be utilized to
better visualize this last case. If two sets of matcher's outputs exist, one corresponding
to category B and the other corresponding to category C from table 4.9, then the
set that represents category D is the AND-function of the two sets. Since there is no
overlap of output utterances from table 4.11 and table 4.12; the matcher outputs no
words in this case. Interestingly, the two symbols used together force the matcher to
act, to a certain degree, as the original matcher.
58
segment 1
Time: #
Symbol: k
Prosody: (nil)
RoC: unspecified
Features:
+ consonant
segment 2
Time: %
Symbol: iy
Prosody: (nil)
RoC: unspecified
Features:
+ vowel
- continuant
-
+ high
- low
sonorant
+ dorsum
+ high
- back
low
- constricted-glottis
- slack-vocalfolds
- const-tongue-root
+ adv-tongue-root
-
Table 4.13: A possible input to the modified matcher which follows the constraints
described in table 4.9 (D).
Output from
Category D
Output from
Category C
Output from
Category B
Figure 4-4: Set models for the output of the matcher with different categories as
described in table 4.9. Note that category D is the output of the AND-function of
categories B and C.
59
Chapter 5
Experiment with Partial Feature
Segments
In section 4.3, two reasons for modifying the original matcher are presented. One of
the reasons is that the original matcher places unrealistic expectations on the speech
signal processor. Indeed, this reason alone is compelling enough for developing a new
matcher. But the second provides a more positive reason; the modified matcher may
decrease the cost and the design complexity of the speech processor while avoiding
significant loss of performance of the entire lexical access system. Some features may
not be worth the computational power required for their detection because they accomplish little to reduce the ambiguity within a segment. Therefore investigating
different partial representations can help us identify the features with the most information, the features with the least information, and everything in between. Through
a long experimental process, a map of features can be made, which will facilitate the
design of the lexical access system. In this chapter, an experiment which explores
the performance of the modified matcher when only partial feature information is
presented. This is, by no means, a comprehensive experiment for the field of research
in partial feature sets, but is a stepping stone toward a further series of experiments
to refine our knowledge in this area.
60
5.1
Experiment
Two main categories of inputs were used for this experiment:
a series of com-
plete feature-set segments and a series of incomplete feature-set segments. Naturally
phonemes from the standard template were used to represent segments for the first
category. To define the segments in the second category, fixed subsets of features
were utilized, which represented a feature-set convention for the second category of
inputs throughout the experiment.
In considering which features to allow or disallow in the incomplete segments of the
second category, whether they are consonants, vowels or glides, we should account for
how difficult the task is to detect their acoustic cues and their contextual variability.
In this initial set of experiments, a small number of features which are expected to
be most reliably detected were used. If a segment is a consonant, the articulatorfree features, such as [sonorant], [continuant], and [strident], are expected to be most
easily defined. If a segment is a vowel or a glide, then its articulator-free features and
its articulator-bound features caused by the tongue body, such as [high], [low], and
[back], are expected to be most easily defined. As a result, we used these particular
features, and only these features, as a convention to describe the segments in the
second category.
To observe how the conventions of the two categories are applied to a series of
segments, an example is given in table 5.1, where a word "as" is analyzed. Part (A)
of this table follows the conventions defined for the first category of inputs, while
part (B) follows the conventions defined for the second category. In part (A), /ae/
and /z/ are shown in their complete sets of features, just as they are in the standard
template of phonemes. Yet in part (B), the vowel, /ae/ provides only the articulatorfree features and the articulator-bound features caused by the tongue-body, while the
consonant, /z/, provides only the articulator-free features.
Six different tests, as listed in table 5.2, were run for this experiment: 1-C test,
1-IC test, 2-C test, 2-IC test, 4-C test, 4-IC test. Each of these tests received 124 separate data samples as inputs, where for each sample, the matcher produced matched
61
segment 1
Time: #
Symbol: ae
Prosody: (nil)
RoC: unspecified
+ vowel
segment 2
Time: %
Symbol: z
Prosody: (nil)
RoC: unspecified
+ consonant
segment 1
Time: #
Symbol: (nil)
Prosody: (nil)
RoC: unspecified
+ vowel
segment 2
Time: %
Symbol: (nil)
Prosody: (nil)
RoC: unspecified
+ consonant
- high
+ low
- back
- advtongue-root
- const-tongue-root
+ continuant
- sonorant
+ strident
+ blade
+ anterior
- high
+ low
- back
+ continuant
- sonorant
+ strident
+ anterior
- distributed
+ slack-vocal-folds
category 2: (B)
category 1: (A)
Table 5.1: This is a series of segments representing the word, "as" for two different
categories of inputs in the experiment. The segments in (A) are described with a
full set of features, while the segments in (B) are described with a incomplete set of
features. RoC = Release Or Closure.
descriptions of the six tests
one-word test with complete features
one-word test with incomplete features
two-words test with complete features
two-words test with incomplete features
four-words test with complete features
four-words test with incomplete features
name used
1-C test
1-IC test
2-C test
2-IC test
4-C test
4-IC test
Table 5.2: A list of six tests of the experiment and their corresponding names used
in this thesis.
62
word(s). More specifically, a data sample consists of a random word from the lexicon,
or a series of random words, represented by a series of segments. These segments
may have complete or incomplete feature-sets. The 1-C, 2-C, and 4-C tests follow the
conventions of the first category of inputs while the 1-IC, 2-IC, and 4-IC tests follow
the conventions of the second category of inputs.
In the 1-C test, a data sample corresponds to a series of complete segments which
are intended to represent a particular word; "intended" because the matcher may
interpret the series of segments as another word(s). After this data sample is processed
by the matcher, the number of matched word(s) in the cohort is recorded.
The second test, 1-IC test, is very similar to the first test. The only difference
is that the segments used in this test follow the conventions of the second category
of inputs, i.e. partial feature specifications.
Therefore, each of the data samples
in this test corresponds to a series of incomplete segments, intended to represent a
particular word. In order to compare and analyze the two one-word tests, same list
of 124 "intended" words are used.
The series of segments in table 5.1 (A) represents a sample input for an 1-C test,
and the series of segments in part (B) represents a sample input for the 1-IC test,
where the "intended" word in these examples is "as".
The next two tests, 2-C test and 2-IC test, are similar to the first two tests just
described. The only difference is that the series of segments are intended to represent
two words in series. The number of matched utterances in the cohort for each of
the 124 samples is recorded for each test. An example data input for the 2-C test is
shown in table 5.3 (A) and a corresponding data input for the 2-IC test is shown in
part (B), where the intended words are "on day".
The final two tests, 4-C test and 4-IC test, are similar to the tests described
above, where the only difference lies in that an input is intended to represent a series
of four words. The format of the data samples follow the previous examples given in
tables 5.1 and 5.3.
As seen in these tables, '#' and '%' symbols have been utilized after "Time:" of
the first and the last segments. As a result, the matcher is forced to match the first
63
segment 1
Time: #
Symbol: aa
Prosody: (nil)
RoC: unspecified
+ vowel
segment 2
Time: (nil)
Symbol: n
Prosody: (nil)
RoC: unspecified
+ consonant
segment 3
Time: (nil)
Symbol: d
Prosody: (nil)
RoC: unspecified
+ consonant
segment 4
Time: %
Symbol: ey
Prosody: (nil)
RoC: unspecified
+ vowel
- round
- high
- continuant
+ sonorant
- continuant
- sonorant
- high
- low
+ low
+ blade
+ high
- back
+ back
- advtongue-root
+ anterior
- distributed
- low
+ slack-vocal-folds
- advtongue-root
-const-tongue-root
+ consttongue-root
+ nasal
(A)
segment 1
Time: #
Symbol: (nil)
Prosody: (nil)
RoC: unspecified
+ vowel
segment 2
Time: (nil)
Symbol: (nil)
Prosody: (nil)
RoC: unspecified
+ consonant
segment 3
Time: (nil)
Symbol: (nil)
Prosody: (nil)
RoC: unspecified
+ consonant
segment 4
Time: %
Symbol: (nil)
Prosody: (nil)
RoC: unspecified
+ vowel
+ back
- high
- continuant
+ sonorant
- continuant
- sonorant
- high
- low
+ low
- back
(B)
Table 5.3: The two series of segments represent two sequential words, "on day". The
segments in (A) are described with the conventions of first category of inputs and is
a possible data sample for the 2-C test. The segments in (B) are described with the
conventions of the second category of inputs and is a possible data sample for the
2-IC test. RoC = Release Or Closure.
64
of the series of input segments to the first of a word candidate and the last of the
series of input segments to the last of a word candidate. For example, if the input is
as shown in table 5.3(A), then all the matched utterances of the cohort should begin
with phoneme /aa/ and end with the phoneme /ey/.
There are exactly two purposes in running the six tests in this experiment:
" to analyze the feature-level characteristics of the cohorts in these six tests.
" to analyze the word-level characteristics of the cohorts in these six tests.
The feature-level characteristics are defined by examining how statistics derived
from the output of the modified matcher change over a decreasing number of defined
features in segments. For example, the results from the 1-C test are compared to
the results from the 1-IC test in order to better understand the overall consequences
of having partial sets of features. The same comparative analysis is done for the
two-words tests and the four-words tests.
The word-level characteristics are defined by examining how statistics derived from
the output of the modified matcher change over an increasing number of words in
series for a data sample. More specifically, the output data from the one-word tests
is compared with the output data from the two-word tests, which then is compared to
the output data from the four-words tests. This type of analysis presents important
information about how the cohort reacts as the number of segments increase.
5.2
Data
All the input data to the modified matcher for each of the six tests and their corresponding output data were recorded, as shown in appendices E, F, and G. As
mentioned before, the output data we are interested in is the number of matched
utterances in the cohort for a given input data sample. For each of the six tests,
statistics such as the mean, standard deviation, and maximum were calculated, and
these are listed in table 5.4.
65
test
1-C
1-IC
2-C
2-IC
4-C
4-IC
mean
1.024
2.685
1.121
7.742
1.097
67.790
std. dev.
0.154
3.014
0.501
12.996
0.296
152.637
maximum
2
22
5
88
2
1242
Table 5.4: Mean, standard deviation (std. dev.), and the maximum of the number of
matched utterances of the data inputs for each of the tests in the experiment.
5.2.1
Feature-level Characteristics
If the performance of the lexical access system is solely based on the ambiguity of
the output of the system, which corresponds to the number of utterances in the
cohort produced by the matcher, then it is a foregone conclusion that the more
acoustic information that the system knows, the better its performance will be. In
this experiment, one of the fundamental questions that has been examined is, "how
much of the performance will decrease with the given incomplete sets of features?"
With "given incomplete sets" defined by conventions of the second category of inputs,
the solution to this question is found in the feature-level characteristics of the tests.
1-C and 1-IC Tests
As recorded in table 5.4, the mean of the number of matches in the output of the 1-C
test is 1.024 and the standard deviation is 0.154. These statistics lead us to confirm
the obvious conclusion: if all the possible features are correctly detected, then almost
every one of the data samples match exactly to its intended word. A histogram for
the results of this test, shown in figure 5-1, depicts that almost all of the 124 inputs
correspond to a singular utterance in their cohorts. In fact, 98.4% of the data samples
matched one-to-one.
When many of the features were absent for the same words, as is the case in the
1-IC test, the mean and the standard deviation increased slightly, 2.685 and 3.014
respectively. From figure 5-2, we see that most of the number of matches in the cohorts
66
One-Complete-Word Test
140-
120100-
80 Go-
40200
1
2
Number of Matches
0
3
Figure 5-1: Histogram for the results of 1-C test. The number of matches in a cohort
of the matcher is the independent variable. Much of the distribution is concentrated
at one match per data sample.
One-Incomplete-Word Test
56
4840-
Lt24-
16-
a0-
0
5
10
15
NIA fHf ach",
20
25
Figure 5-2: Histogram for the results of 1-IC test. The number of matches in a cohort
of the matcher is the independent variable. Much of the distribution is concentrated
at five or less matches per data sample.
67
Two-Complete-Words Test
12010080-
4020
0
1~
0
2
4
Number of Matches
6
Figure 5-3: Histogram for the results of 2-C test. The number of matches in a cohort
of the matcher is the independent variable. The results are very similar to that of
1-C test in figure 5-1.
is usually less than or equal to 5. When the results are more closely examined, we
find that 45.2% matched exactly to its intended word and 71.0% matched within the
ambiguity of two possible utterances.
2-C and 2-IC Tests
The mean is 1.121 and the standard deviation is 0.501 for the output of the 2-C test,
as shown in table 5.4. Although some numbers of matches climbed as high as five,
histogram shows almost all the matches were one-to-one in figure 5-3. Upon further calculations, we find that 93.5% of the cohorts contained one correctly matched
utterance.
In the 2-IC test, however, the mean and the standard deviation jumped to 7.742
and 12.996 respectively. To examine these statistics in detail, let us look at the
histogram shown in figure 5-4. The distribution exponentially decreases as the number
of matches approaches 20.
Even though approximately 64% of the data samples
corresponded to five or fewer matches in their cohorts, the other 36% are significant
and should not be ignored.
68
Two-Incomplete-Words Test
80
72-
6456>48C
LL
3224-
180n
0
20
40
s0
80
100
Number of Matches
Figure 5-4: Histogram for the results of 2-IC test. The number of matches in a cohort
of the matcher is the independent variable. The results resemble an exponentially decreasing function where the frequency becomes insignificant as the number of matches
approaches 20.
4-C and 4-IC Tests
According to table 5.4, the mean is 1.097 and the standard deviation is 0.296 for 4-C
test. As the histogram shows in figure 5-5, approximately 91.9% of the utterances
consisting of four words matched exactly to their intended utterances given that all
the features are defined in all the segments.
In the 4-IC test, however, the outputs behave much differently; the mean is 67.790
while the standard deviation is 152.637. In the histogram provided in figure 5-6, we
find that almost all of the data samples matched to 200 or fewer possible utterances,
while a few individual points measured at 700 or higher. In fact, the highest number of
matches for a data sample was 1242, in which the intended phrase was "do pail another
contain". When the histogram is magnified in the range of 200 or less matches, as
shown in figure 5-7, we observe that much of the distribution lies in 30 or fewer
matches while the rest of the field is generally uniformly distributed.
69
Four-Complete-Words Test
120
100-
8060U-
4020-
0*
-
1
1
2
Number of Matches
0
3
Figure 5-5: Histogram for the results of 4-C test. The number of matches in a cohort
of the matcher is the independent variable. The distribution resembles that seen in
one-complete word and two-complete-word tests.
Four-Incomplete-Words Test
100
80-
> 60
U- 40-
20-
0-
0
200
400
600
800
Number of Matches
1000
1200
Figure 5-6: Histogram for the results of 4-IC test. The number of matches in a cohort
of the matcher is the independent variable. The results resemble an exponentially decreasing function where the frequency becomes insignificant as the number of matches
approaches 400.
70
Four-Incomplete-Words Test
'10
0-
0
100
200
Number of Matches
Figure 5-7: A magnified version of the histogram in figure 5-6 where the range is
less than 200. The distribution is generally uniform when the number of matches is
greater than 30. The unexpected impulse at 200 is the sum of all that comes after
that point.
5.2.2
Word-level Characteristics
As explained in the previous section, a variable that affects the dynamics of the
output of the matcher is the number of features that are detected and labeled for
each segment. Another variable that influences the output is the length of the series
of segments that the matcher receives per process. In our experiment, we measured
such length in terms of the number of words that a particular series of segments
represents.
When the results of the 1-C, 2-C, and 4-C tests, are examined, it is evident that
the number of matched utterances in the cohort do not change as the number of
words vary. The mean values from each of the tests do not vary by more than 0.1
from each other. In fact, the mean value does not appear to have any correlation
with the number of words; it does not monotonically increase nor decrease with the
increasing number of words.
However, as one would expect, 1-IC, 2-IC, and 4-IC tests produce very different
word-level characteristics. More specifically, one would expect that the relationship
between the number of matches per sample and the number of words in the intended
71
segment 1
Time: #
Symbol: d
Prosody: (nil)
RoC: unspecified
+ consonant
segment 2
Time: %
Symbol: uw
Prosody: (nil)
RoC: unspecified
+ vowel
segment 1
Time: #
Symbol: ae
Prosody: (nil)
RoC: unspecified
+ vowel
segment 2
Time: %
Symbol: d
Prosody: (nil)
RoC: unspecified
+ consonant
- continuant
+ high
- high
- continuant
- sonorant
- low
+ low
- back
- sonorant
+ back
(A)
(B)
Table 5.5: Two words with feature sets constrained by the second convention for
inputs. (A) is a series of segments representing the word, "do". (B) is a series of
segments representing the word, "add". RoC=Release or Closure
utterance is this:
y = 2.685'
(5.1)
* where 2.685 = mean value for 1-IC test
" x = number of words in the intended utterance
" y = expected number of matches for data sample, x
To better understand the formulation of equation 5.1, let us study an example to
observe how a series of words can affect the output of the matcher. Let us assume
that our first series of segments is as shown in table 5.5 (A). If we use this as an input
to the modified matcher, then we would get two matched utterances in the cohort as
shown in table 5.6 (A). Now let us assume that we have a second set of segments as
shown in table 5.5 (B), and if this is the input to the matcher, the cohort would look
like table 5.6 (B). At this point, the question we ask is this: what would the cohort
be if these two sets of segments were combined into one? If the input is as shown
in table 5.7, then we would expect the cohort to consist of the utterances as shown
in table 5.8. Each possible utterance for the first series is followed by each possible
utterance of the second series, and therefore, the total number in the cohort is 2*3=6.
72
cohort of segments from table 5.5(A)
1. do
2. to
3. too
cohort of segments from table 5.5(B)
1. add
2. at
(A)
(B)
Table 5.6: Two cohorts corresponding to the two individual series of segments from
table 5.5. (A) is the cohort for the series that intended to represent the word "do".
(B) is the cohort for the series that intended to represent the word "add".
segment 1
Time: #
Symbol: d
Prosody: (nil)
RoC: unspecified
+ consonant
segment 2
Time: (nil)
Symbol: uw
Prosody: (nil)
RoC: unspecified
+ vowel
segment 3
Time: (nil)
Symbol: ae
Prosody: (nil)
RoC: unspecified
+ vowel
segment 4
Time: %
Symbol: d
Prosody: (nil)
RoC: unspecified
+ consonant
- continuant
+ high
- high
- continuant
- low
+ low
- back
- sonorant
-
sonorant
+ back
Table 5.7: Two series of segments from table 5.5 combined into one series.
RoC=Release of Closure.
cohort of segments from table 5.7 (A)
1. do add
2. do at
3. to add
4. to at
5. too add
6. too at
Table 5.8: The cohort corresponding to the series of segments in table 5.7.
73
Expected vs. Real Data
70-
56 -
E
=28 z
O 14 0
1
2
3
Number of Words
4
Figure 5-8: Two graphs where the independent variable is the number of intended
words in the input and the dependent variable is the average number of matches in
the cohort. The graph with squares corresponds to the expected behavior and the
graph with circles corresponds to the real behavior.
In the results of the 1-IC test, the average number of matched words is 2.685.
Following the same principle above, if the input consisted of two words, then the
expected average number of matched words would be 2.685*2.685, or 2.6852. And if
the input was expanded to four words, then the expected average number of matched
words would be 2.6854. Thus equation 5.1 computes the expected number of matched
utterances in a cohort for any number of words. In figure 5-8, the expected behavior
is also shown graphically with squares.
Yet, the data we collected behaved much differently from the expected behavior.
In figure 5-8, the real data deviates increasingly from the expected data as the number
of words increases. In fact, figure 5-9 shows that the difference between the real and
the expected increases exponentially as the number of words increases linearly. As one
can imagine, this type of behavior poses great problems, especially because utterances
are not confined to one or two words.
To explain this unexpected behavior, let us examine data sample #52-2 from 2-IC
test. #52-2 is translated as word #52, "caution", and word #2,
"able", in series as
seen in appendices A and B. These two words were separate data samples in the
74
Difference between Real vs. Expected Data
20-
15C
0-
aU aJ
10
-
-
1
-
2
3
Number of Words
4
Figure 5-9: This graph shows the difference between the expected and the real graphs
from figure 5-8. As the number of words increase for the input, the difference increases
exponentially.
M-C test, and the number of matches for "caution" and "able" individually were 6
and 5 respectively; their cohorts are shown in table 5.9. From this information, the
expected number of matches for "caution able" is calculated to be 6*5, or 30. Yet the
real number of matches in the test is 88, as seen in appendix C. What causes such
significant deviance from the expected output? The reason is simple; new words are
created between the boundaries of the segments for "caution" and the segments for
"able".- For example, one of the utterances in the cohort is, "bus a make all". In this
particular case, "make" is a word that is created between the boundaries when the
two series of segments are serialized. In fact, 55 other utterances were created in the
same unexpected manner.
As the number of words in an utterance increases, the probability of creating
unexpected words in the matching process also increases. In fact, this undesirable
effect increases exponentially with the number of words, as shown in figure 5-9.
75
cohort of "caution"
1. bus all
2. caution
3. cousin
4. does all
5. go some
6. toe some
cohort of "able"
1. able
2. ache all
3. ache an
4. ape all
5. ape an
Table 5.9: Two cohorts corresponding to the two individual series of segments representing the words, "caution" and "able".
5.3
Conclusion
In this experiment, we have analyzed the results of six different tests to gain insights
into the feature-level and the word-level characteristics of the data samples. One of
the observations we have made is that when all the segments had complete sets of
features, almost all of the data samples matched exactly to the "intended" phrase,
independent of the number of words in the data sample.
The second convention for the data samples stated that all the segments consist
of only the articulator-free features and the vowel and glide segments consist of additional articulator-bound features caused by the tongue body. When this convention
was used on the data samples, the results showed that the performance of the system
decreased exponentially with the increasing number of words in the input. In fact,
the average number in the cohort increased exponentially faster than the expected
number due to the newly created words between word boundaries.
From the results, we can conclude that the correlation between the number of
words in the input and the number of matched utterances in the output of the matcher
decreases as more features are defined in the segments. Furthermore, when all the
features are defined, the results reveal that no correlation exists.
Interestingly, only a few data samples caused the mean to increase significantly in
the 4-IC test. For example, sample #68-156-12-56 had 1242 possible utterances that
matched. Because this unexpected behavior is not common, it is my conjecture that
a couple of more detected features in the sets will help avoid these unusual cases.
76
Chapter 6
Summary
In this project, we have examined the lexical access system with a greater attention
given to the second subsystem, the matcher. The previous version, called the original
matcher ignored the fact that different types of variations can exist in a real speech
signal. Therefore, the original matcher would fail to function in real situations.
In an effort to account for these uncertainty factors, a modified matcher was designed. Unlike the original matcher, it tolerates some uncertainty in the signal analysis
of the processing subsystem. More specifically, this matcher allows the segments to
be defined with incomplete sets of features. Also, a series of segments is not required
to fully complete an utterance in this new matcher.
When the design of the modified matcher was complete, an experiment, in which
the effects of incomplete feature-sets were analyzed, was accomplished. Due to the
fact that a very small number of features were defined for 1-IC, 2-IC, and 4-IC tests,
the results were not as desirable as we hoped for. In these tests, for example, the
number of possible utterances in the cohort increased exponentially as the number of
words in the input increased linearly. In fact, our results show some unexpected, and
undesirable, behavior in which new words were created between boundaries of words.
This experiment is only a stepping stone toward a further series of similar experiments to identify the features that provide the most information, features that do
not provide much information and everything in between. A mapping of features to
values, which indicate the amount of information they inherently possess, is essential
77
in designing an efficient lexical access system.
Where To Go From Here
It has been speculated that the first segment of a word typically has more observable acoustic cues than the latter segments. Though it is difficult to estimate its
effects on the performance, such information will definitely help reduce the ambiguity
of utterances in the cohort. To better understand its effects, an experiment with
appropriate parameters should be pursued in the future.
Also, information regarding syllabic affiliation may reduce the ambiguity among
matched words in the cohort. This would require the researcher to change the source
code of the matcher such that the new algorithm recognizes syllabic affiliations.
As mentioned in section 4.2, one of the design assumptions of the original matcher
was not resolved in this modified matcher: the deletion of segments. As one can imagine, just a single missing segment in a speech signal will cause tremendous problems
in the current matcher and therefore, it needs to be resolved. One method is to define
linguistic rules in the matching algorithm to compensate for deletion of segments.
Finally, more modifications should be made in the current matcher to further relax
the constraints on the input. One method to achieve this is to utilize an uncertaintymeasure scheme, where a feature can have numeric values (i.e., 0 to 10 in which
0=very uncertain and 10=very certain). In this scheme, the matcher will provide
more freedom for the signal processor to identify acoustic cues and define them amidst
uncertainty. Currently, the modified matcher utilizes a +/- scheme; if a feature is not
identifiable with 100 percent certainty, or very close to 100 percent, it should be left
out of the segment for the matching algorithm. As one can imagine, such scheme is
not an efficient method of utilizing all the pertinent information in a signal. A more
elaborate uncertainty-measure scheme would be appropriate for the finalized lexical
access system.
78
Appendix A
248-Word Lexicon
79
1. a
7. again
13. ape
19. away
25. bat
31. begin
37. book
43. but
49. cant
55. comes
61. cup
67. dig
73. dub
79. fame
85. from
91. giving
97. has
103. hip
109. into
115. kept
121. like
127. mack
133. me
139. my
145. not
151. only
157. pay
163. power
169. rug
175. seek
181. shortage
187. some
193. support
199. taking
205. there
211. today
217. unite
223. wants
229. went
235. with
241. write
246. zip
248-Word Lexicon
2. able
3. about
4. above
8. all
9. among
10. an
14. appear
15. are
16. arlene
20. baby
21. back
22. bad
26. be
27. because 28. become
32. below
33. beside
34. between
box
38.
39. boy
40. bug
44. by
45. can
46. came
50. case
51. catch
52. caution
56. contain
57. cookie
58. could
62. day
63. debate
64. debby
68. do
69. dog
70. does
74. dug
75. duty
76. easy
80. feel
81. few
82. fifteen
86. gas
87. gave
88. get
92. go
93. gone
94. good
98. have
99. he
100. her
104. his
105. how
106. i
110. is
111. it
112. jake
116. knock
117. lasso
118. lazy
122. lock
123. long
124. look
128. make
129. man
130. many
134. men
135. money 136. more
140. name
141. nasal
142. never
146. nothing
147. now
148. number
152. oppose
153. or
154. other
158. perch
159. phone
160. pig
164. put
165. rabbit
166. rabid
170. said
171. saw
172. say
176. seize
177. shake
178. she
182. show
183. sing
184. small
188. something 189. sport
190. state
194. suppose
195. system 196. take
200. than
201. that
202. the
206. these
207. they
208. this
212. toe
213. tom
214. tonight
218. up
219. us
220. very
224. was
225. water
226. way
230. were
231. what
232. when
236. without
237. woman 238. women
242. year
243. you
244. your
247. zoo
248. zoom
5. ache
11. and
17. as
23. bake
29. before
35. big
41. bus
47. can
53. city
59. cousin
65. deny
71. done
77. even
83. follow
89. give
95. goodbye
101. hike
107. ike
113. just
119. leaky
125. looked
131. mary
137. mouth
143. nineteen
149. of
155. out
161. pop
167. raccoon
173. seal
179. shoe
185. so
191. submit
197. taken
203. them
209. time
215. too
221. want
227. we
233. which
239. word
245. zap
6. add
12. another
18. at
24. balloon
30. began
36. blow
42. busy
48. canoe
54. come
60. cub
66. did
72. dont
78. every
84. for
90. gives
96. had
102. him
108. in
114. keep
120. leave
126. loss
132. may
138. much
144. no
150. on
156. pail
162. potato
168. remember
174. see
180. short
186. soak
192. sudden
198. takes
204. then
210. to
216. took
222. wanted
228. well
234. will
240. would
Table A.1: In all the tests, index numbers above are used to indicate specific words.
80
Appendix B
Standard Phonemes
81
Utterance: Standard Features Version 1 7/11/1998
Time: (nil)
Symbol: ae
Prosody: (nil)
Release Or Closure: unspecified
Features:
Time: (nil)
Symbol: iy
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ vowel
- high
+ vowel
+ high
- low
- back
+ advtongueroot
- consttonguejroot
+ low
- back
- advtongue-root
- consttonguejroot
Time: (nil)
Symbol: aa
Prosody: (nil)
Release Or Closure: unspecified
Features:
Time: (nil)
Symbol: ih
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ vowel
- round
+ vowel
+ high
- low
- back
- advjtongueroot
- consttongue-root
- high
+
- advtongueroot
+ consttonguejroot
Time: (nil)
Symbol: ao
Prosody: (nil)
Release Or Closure: unspecified
Features:
Time: (nil)
Symbol: ey
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ vowel
+ vowel
+ round
- high
- high
- low
- back
+ advjtongueroot
- consttonguejroot
- low
+ back
- advtonguejroot
+ consttongue-root
Time: (nil)
Symbol: eh
Prosody: (nil)
Release Or Closure: unspecified
Features:
+
low
+ back
Time: (nil)
Symbol: ow
Prosody: (nil)
Release Or Closure: unspecified
Features:
vowel
+ vowel
+ round
- high
- low
- back
- adv-tongueroot
- consttonguejroot
- high
- low
+ back
+
advtonguejroot
- consttongue-root
82
- low
+ back
- advtongue-root
- consttongue-root
Time: (nil)
Symbol: ah
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ vowel
- round
- high
-
Time: (nil)
Symbol: er
Prosody: (nil)
Release Or Closure: unspecified
Features:
low
+ back
- advjtongueroot
- consttonguejroot
+ reduced
+ vowel
+ rhotic
+ blade
- round
- anterior
- distributed
Time: (nil)
Symbol: uw
Prosody: (nil)
Release Or Closure: unspecified
Features:
+
- high
- low
+ back
- advtonguejroot
- consttonguejroot
vowel
+ round
+ high
-
low
+ back
+ advtongueroot
- consttonguejroot
Time: (nil)
Symbol: x
Prosody: (nil)
Release Or Closure: unspecified
Features:
Time: (nil)
Symbol: uh
+ reduced
+ vowel
- round
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ vowel
+ round
+ high
-
- high
- low
- advtongue-root
- consttonguejroot
low
+ back
- advjtongueroot
- consttongue-root
Time: (nil)
Symbol: w
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ glide
+ lips
Time: (nil)
Symbol: rr
Prosody: (nil)
Release Or Closure: unspecified
Features:
- reduced
+ vowel
+ rhotic
+ blade
- round
- anterior
- distributed
- high
+ round
+ high
- low
+ back
+ advtongue-root
- consttonguejroot
Time: (nil)
83
Symbol: y
- high
Prosody: (nil)
- low
+ back
- nasal
Release Or Closure: unspecified
Features:
+ glide
+ blade
- anterior
+ distributed
+ high
-
Time: (nil)
Symbol: m
Prosody: (nil)
Release Or Closure: unspecified
Features:
low
- back
+ advtonguejroot
- consttongue-root
+ consonant
- continuant
+ sonorant
+
Time: (nil)
Symbol: r
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ glide
-
Time: (nil)
Symbol: n
Prosody: (nil)
Release Or Closure: unspecified
Features:
lateral
+ rhotic
+ blade
- anterior
- distributed
- high
-
lips
round
+ nasal
-
+ consonant
- continuant
+ sonorant
+ blade
+ anterior
- distributed
+ nasal
low
+ back
- nasal
Time: (nil)
Symbol: h
Prosody: (nil)
Time: (nil)
Symbol: ng
Prosody: (nil)
Release Or Closure: unspecified
Features:
Release Or Closure: unspecified
Features:
+ glide
+
larynx
+ consonant
- continuant
+ sonorant
+ body
+ high
- low
+ nasal
+ spread-glottis
-
constricted-glottis
Time: (nil)
Symbol: 1
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
+ sonorant
+
Time: (nil)
Symbol: v
Prosody: (nil)
Release Or Closure: unspecified
Features:
lateral
- rhotic
+ blade
+ anterior
- distributed
+ consonant
+ continuant
- sonorant
- strident
84
+ lips
- round
+ slackvocalfolds
+ consonant
+ continuant
- sonorant
- strident
+ lips
Time: (nil)
Symbol: dh
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ continuant
- sonorant
- strident
+ blade
+ anterior
+ distributed
+ continuant
- sonorant
- strident
+ blade
+ anterior
+ distributed
slackvocalfolds
-
slackvocalfolds
Time: (nil)
Symbol: s
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
+ continuant
- sonorant
+ strident
+ blade
+ anterior
- distributed
+ slackvocalfolds
+ consonant
+ continuant
- sonorant
+ strident
+ blade
+ anterior
- distributed
- slackvocalfolds
Time: (nil)
Symbol: zh
Prosody: (nil)
Release Or Closure: unspecified
Features:
Time: (nil)
Symbol: sh
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
continuant
- sonorant
+ strident
+ blade
- anterior
+ distributed
+
slack vocal folds
+ consonant
Time: (nil)
Symbol: z
Prosody: (nil)
Release Or Closure: unspecified
Features:
+
round
-
Time: (nil)
Symbol: th
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
+
-
+ consonant
+ continuant
- sonorant
+ strident
+ blade
- anterior
+ distributed
slackvocalfolds
Time: (nil)
Symbol: f
Prosody: (nil)
Release Or Closure: unspecified
Features:
-
85
slackvocalfolds
Time: (nil)
Symbol: dj
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
- sonorant
+ strident
+ blade
- anterior
+ distributed
+ slackvocalfolds
Time: (nil)
Symbol: g
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
- sonorant
+ body
+ high
-
low
+ slackvocalfolds
Time: (nil)
Symbol: ch
Time: (nil)
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
- sonorant
+ strident
+ blade
- anterior
+ distributed
- slackvocalfolds
Symbol: p
Prosody: (nil)
Time: (nil)
Time: (nil)
Symbol: b
Symbol: t
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
- sonorant
+ blade
+ anterior
- distributed
- constricted-glottis
- slackvocalfolds
Release Or Closure: unspecified
Features:
+ consonant
- continuant
- sonorant
+ lips
- round
- constricted-glottis
- slackvocalfolds
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
- sonorant
+ lips
- round
+ slackvocalfolds
Time: (nil)
Symbol: d
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
- sonorant
+ blade
+ anterior
- distributed
+ slackvocalfolds
Time: (nil)
Symbol: k
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
- sonorant
+ body
+ high
86
- low
- constricted-glottis
- slackvocalfolds
Time: (nil)
Symbol: Is
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ reduced
+
vowel
+ anterior
- rhotic
- distributed
- high
-
low
+ back
- advjtongueroot
- consttongue-root
- nasal
87
Appendix C
Rules: An Example
88
Rule: place of articulation change
Segment: T
Type: L
Prosody: (nil)
Release Or Closure: (nil)
Features:
Segment: Cl
Type: (nil)
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
+ blade
+ anterior
- distributed
Segment: C2
Type: #
Prosody: (nil)
Release Or Closure: unspecified
Features:
Segment: C3
Type: (nil)
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
+ lips
Segment: SI
Type: CI
Prosody: (nil)
Release Or Closure: unspecified
Features:
C3 lips
C3 blade
C3 dorsum
C3 round
C3 anterior
C3 distributed
C3 high
C3 low
C3 back
89
Appendix D
Source Code for Tree-Growth
90
////// Tree /////
/Constructor
_tree::_tree(sentence new-s)
{
s=new _sentence;
s=new-s;
NoOfFirst=O; /to record the number of possible first segments
FirstWordOffset=new int [MaxMatchPerPos];
sent=new words [s->get_NoOfSegmentsO+1];
int i;
for (i=O;i<s->getNoOfSegmentso;i++)
sent[i]=NULL;
/Find '#' and '%', which symbolize the first and the last words
//of a sentence. Both aren't required to exist, in which the program
/assumes that the first/last segment in the data isn't necessarily
/the first/last segment of the possible utterance.
//NoOfSegInSentence is required because if the sentence file contains
/the end symbol, "%", then the program needs to stop at that segment
//while ignoring all consequent segments.
for (i=O;i<s->getLNoOfSegmentso;i++)
{
if (!strsame(s->get-data(i)->getjtimeo, "%"))
continue; // ** if no %for the given segment, go to the next one.
else
{
s->set_NoOfSegInSentence(++i); //
break;
**
if
'%',
I}
// record whether or not if '%' exist.
if (s->getNoOfSegInSentenceo==O)
LastWord=O;
else LastWord=1;
/ test to see if '#' exists.
if (!strsame(s->get -data()->gettimeO, "#"))
FirstWord=O;
else FirstWord=1;
if (LastWord) // ** if '%' present
NoOfSegs=s->getNoOfSegInSentenceo;
else NoOfSegs=s->getNoOfSegmentso;
}
// add a word into the matching tree
// at a specific position
void _tree::add(word w, int index)
{
// check boundry for non-oneword cases
if ((LastWord)&&(index))
if (w->getNoOfSegments()+index > NoOfSegs)
return;
91
then record the number of segments.
// check boundry
if (index>=O && index<NoOfSegs)
/ if no word array at position specified, create it first
if (sent[index]==NULL)
sent[index]=new _words(MaxMatchPerPos);
// add word to that word array
/cout<<"word added "<<w->get-wordO<<" "<<index<<endl;
sent[index]->add(w);
else terminate("_jree::add","invalid index",itoa(index));
// output complete and incomplete matched sentences
void _tree::output-aux(destination des, FILE* writeptr, char* s, int index, output-mode om)
int i;
char *temp;
if (LastWord)
if (index > NoOfSegs)
terminate("_tree::outputjaux","index overflow",itoa(index),itoa(NoOfSegs));
// when output requests is made at the final segment
if ((LastWord&&(index==NoOfSegs))11(!LastWord&&(index>=NoOfSegs)))
{
/ if mode is completed
// then output sentence
if (om==completed)
if (des==screen)
cout << NoOfCompleted++ <<
".
"<< s << "." << endl;
else if (des==disk)
fprintf(writeptr,"%d. %s.\n",NoOfCompleted++,s);
I}
// when output requests is not made at the final segment
else
/ if we cannot continue to follow any word links
if (sent[index]==NULL)
/ if output mode is incomplete then output
if (om==incomplete)
if (des==screen)
cout << NoOflncomplete++ <<
<< s << endl;
else if (des==disk)
fprintf(writeptr,"%d. %s\n",NoOfIncomplete++,s);
/ if we are not at the final segment and there are more word links to
// follow from the current position, recurse on all the possible
/ links
else
for (i=0;i<sent[index]->getNoOfDatao;i++)
{
92
// copy the words up to this point and add on the next word
/then recurse output on the next link point
temp = new char [1000];
strcpy(temp,s);
strcat(temp," ");
I/cout<<"this is what i have: "<<temp<<index<<i<<endl;
strcat(temp,sent[index]->get-data(i)->get-wordo);
/if index=0 and FirstWord=0, then the first word may not be
// complete. Therefore, use special variables.
if (indexllFirstWord)
output_aux(des,writeptr,temp,sent[index]->get-data(i)>getNoOfSegmentso+index,om);
else
output-aux(des,writeptr,temp,FirstWordOffset[i]+index,om);
// clean up memory
delete s;
void _tree::output (destination des, FILE* writeptr)
{
// set the GLOBAL counters
NoOfCompleted=l;
NoOflncomplete=1;
char *st;
/ titles
if (des==screen)
cout << "\n======= Completed-======\n"
else if (des==disk)
fprintf(writeptr,"\n\n======= Results =======\n\n======= Completed =======\n");
// output completed
st=new char [5]; strcpy(st,"");
output-aux(des,writeptr,st,0,completed);
/ titles
if (des==screen)
cout << "\n======= Incomplete ======--\n";
else if (des==disk)
fprintf(writeptr,"\n======= incomplete ======\n");
// output incomplete sentences
st=new char [5]; strcpy(st,"");
output-aux(des,writeptr,st,0,incomplete);
cout << endl;
// grow matching tree
void _tree::grow(rules rs, lexicon
{
grow-aux2(rs,lx,0);
lx)
93
// recursively grow tree -starting_ from a sentence position
void _tree::grow-aux2(rules rs, lexicon lx, int starting-position)
if (sent[starting-position]!=NULL) return;
/int i;
//for (i=O;i<NoOfSegments;i++)
// grow at this particular position
grow.aux(rs,lx,startingposition);
if (sent[startingposition]==NULL)
return;
// recursively follow the links based on the matched words
// at this position
for (nt i=O; i<sent[starting-position]->getNoOfDatao; i++){
if ((FirstWord j(!FirstWord&&(starting-position))))
grow-aux2(rs,lx,starting-position+sent[starting-position]->get-data(i)->getNoOfSegmentso);
else if (!FirstWord&&(!starting-position))
grow-aux2(rs,lx,starting-position+FirstWordOffset[i]);
}
// grow matching tree for a particular sentence position
void _tree::grow-aux(rules rs, lexicon lx, int starting-position)
// if the current position has already been grown
// quit
if ((sent[starting-position]!=NULL)
(starting-position >= NoOfSegs)) return;
// construct a temporary lexicon
lexicon tempjx=lx->copyo;
//
//
//
//
To suppress rules in this program, DO NOT comment the next line
out. This will disenable the matcher from working properly because
the expansion bit is determinded when the next line is called. To
ingnore the rules, go to rules.rs file and erase the rules as you wish.
temp-lx->apply-rules(rs,s,starting-position);
//cout<<starting-position<<endl;
//
//
//
//
/
//
//
assuming the lexicon is expanded with a set of external rules ONLY
we will just match the WORDs that was expanded and skip those that
was not expanded by the external rules. The ones not expanded by the
external rules are the ones that does not match the first segment of
the target position in the sentence. This is an optimization so we
do not need to expand all the words inside the lexicon, but only those
PROMISING ones that matches the first segment of the sentence.
int ij;
word w;
// loop through all the touched words in the lexicon as well as all their
// multiple pronunciationas and alternative pronunciations
for (i=O;i<tempjlx->getNoOfDatao;i++)
94
if (tempjx->get data(i)->getexpanded()==exp)
for (j=O;j<temp_x->get-data(i)->getNoOfDatao;j++)
//cout<<"Word tested: "<<tempjx->get_.data(i)->getLabel()<<endl;
// if any pronunciation match the current context in the sentence
/ it is added to the word array for this position.
// for case where '#' doesn't exist and starting-position=O.
if (!FirstWord&&(!starting-position)){
if (match-search(tempjx->get-data(i)->get-data(j),starting-position))
{
w=new _word (tempjlx->get-data(i)->getLabel(),temp-lx->get-data(i)->get-data(j)>getNoOfSegmentso);
add(w,starting-position);
// after one match do we go on or do we quit?
// choise here is quit
break;
// alternatively we can go on and find more matches for the
// same word.. to do this just get rid of break;
// for case where '#' does exist or ('#' doesn't exist and
// starting-position doesn't equal 0)
if ((FirstWordll(!FirstWord&&(startingposition))))
if (matching(tempjx->getdata(i)->get data(j),s,starting-position))
w=new _word (tempjx->get-data(i)->getLabel(),tempjx->getdata(i)->get-data(j)>getNoOfSegmentso);
add(w,starting-position);
// after one match do we go on or do we quit?
// choise here is quit
break;
// alternatively we can go on and find more matches for the
// same word.. to do this just get rid of break;
// clean up the temporary lexicon copy
delete temp_lx;
// print out tree status
void _tree::statuso
int ij;
// sentence
cout << "Sentence: " << s->get-contento;
cout << "\nposition\tmatch\n--------\V-----\n";
// position -- matchings
for (i=0;i<NoOfSegs;i++)
{
if (sent[i]!=NULL)
cout << i << "\t\t";
for (j=0;j<sent[i]->getNoOfDatao;j++)
cout << sent[i]->getdata(j)->get_word() <<
cout << "\n";
95
}
// match a pronunciation with the word
int _tree::matching (pronunciation p, sentence s, int index)
{
// overflow case
if (LastWord)
if (index+(p->get_NoOfSegmentsO) > s->getNoOfSegmentso)
return 0;
int i;
// segment by segment match of the pronunciation to the sentence context
/ if just one fails we exit
for (i=0;i<p->getNoOfSegmentso;i++)
if (!(bundlejincomp-match(p->get-data(i),s->get-data(index+i))))
return 0;
else if (!LastWord)
if ((index+i+1)==NoOfSegs)
return 1;
}
}
// ** for given pronunciation from a word in the lexicon, it will
// ** loop through each phoneme and compare it to the "incomplete" word.
// ** index isn't required for single word search. This procedure is used
// ** only on the first word of the sentence, given that "#" is absent.
int _tree::match-search(pronunciation p, int index)
{
int i,count_match=0,count_miss=0;
for (i=0; i<p->getNoOfSegmentso; i++) {
// this condition is used if there is only one word to be analyzed.
/ if all the segments of the input matches, yet is not the
// end of the pronunciation, then start the search again.
// This means that countmatch should be set to zero and
// countmiss should be incremented the amount that is needed.
if (count match==NoOfSegs) I
if ((i != p->getNoOfSegmentso)&&(LastWord)){
countmatch=0;
count_miss=countmiss+NoOfSegs;
}
// get the first word in the sentence. The first word can start at
// any segment within itself.
if (!(bundlejincomp-match(p->get-data(i), s->get-data(countmatch+index)))) {
if (!count-match) {
count_miss++;
continue;
/ if a sequence of matches isn't complete, then start the matching
// process over again. countmiss is incremented by the amount of
// previous matches in the sequence and one (for the current mismatch)
// and countmatch is set to zero.
else {
countmiss=countniss+count-match+l;
96
count match=O;
continue;
}
else
countmatch++;
if (!LastWord) // ** if '% doesn't exists
if ((index+count match)==NoOfSegs) //
break;
**
if last segment
}
// ** if there is no match at all, then return 0.
if (count-miss==p->getNoOfSegmentso)
return 0;
// records # of matches for array of possible first words
FirstWordOffset[NoOfFirst++]=countmatch;
return 1;
}
97
Appendix E
Results for One-Word Tests
98
Lexical
Word
Number
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
Number of
Number of
Matches with
Matches with
Complete Sets (A) Incomplete Sets (B)
1
5
MEAN
1
5
STD. DEV
1
2 MAXIMUM
1
1
1
2
1
11
1
1
1
1
1
2
1
4
1
5
1
1
1
1
1
22
1
17
32
1
10
34
36
38
1
1
1
12
1
1
40
1
7
42
44
1
1
3
1
46
48
50
1
1
1
2
1
1
52
1
6
54
1
2
56
1
5
58
2
5
60
1
7
62
1
2
64
66
68
1
1
1
4
5
3
70
72
74
1
1
1
2
1
7
76
1
1
78
1
1
99
A
1.024
0.154
2
B
2.685
3.014
22
WORD#
80
82
84
86
88
90
92
94
96
98
100
102
104
106
108
110
112
114
116
118
120
122
124
126
128
130
132
134
136
138
140
142
144
146
148
150
152
154
156
158
160
A
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
B
1
1
1
1
4
1
2
5
1
1
2
2
1
1
1
1
2
5
2
3
1
2
1
1
1
2
1
2
1
1
2
2
1
2
1
1
7
3
2
1
5
100
WORD #
162
164
166
168
170
172
174
176
178
180
182
184
186
188
190
192
194
196
198
200
202
204
206
208
210
212
214
216
218
220
222
224
226
228
230
232
234
236
238
240
242
244
246
248
A
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
B
3
5
2
1
2
1
2
1
2
1
2
1
1
2
1
6
1
4
1
1
1
3
2
2
3
2
4
5
1
1
3
1
1
2
1
2
1
2
2
1
1
1
2
1
101
Appendix F
Results for Two-Word Tests
102
Lexical
Word
Number
126-46
142-244
152-80
22-232
224-180
86-46
60-158
124-216
136-142
152-164
222-70
36-34
Number of
Number of
Matches with
Matches with
Comp-Sets (A) Incomp-Sets (B)
1
2
1
2
1
7
1
2
1
1
1
2
1
7
1
5
1
2
5
35
1
6
1
12
82-2
180-12
1
1
5
11
74-126
16-8
102-214
1
1
1
7
1
20
162-24
1
13
52-84
222-228
88-86
176-122
88-212
176-202
1
1
1
1
1
1
6
6
4
2
8
1
140-168
1
2
186-112
1
1
178-122
202-100
40-230
144-74
4
1
1
1
2
3
7
9
104-42
1
3
156-222
182-130
134-214
1
1
1
6
5
20
82-236
2
2
132-190
158-142
10-238
52-2
238-244
124-208
1
1
1
1
1
1
1
2
5
88
5
2
103
MEAN
STD. DEV
MAXIMUM
A
1.121
0.501
5
B
7.742
12.996
88
WORD#
128-226
156-234
94-212
148-84
188-144
10-86
196-200
102-56
6-52
56-224
24-64
188-166
66-122
124-244
134-156
4-50
42-88
52-154
120-242
196-150
26-228
148-150
52-210
220-28
2-114
10-46
160-120
190-72
66-130
236-232
150-56
6-10
204-170
198-188
178-182
16-86
34-164
20-136
246-236
208-18
80-30
A
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
2
1
2
B
1
2
10
1
2
2
4
18
12
9
4
4
10
1
4
3
16
20
1
8
2
1
18
14
17
2
5
1
10
4
5
2
6
2
5
1
60
9
4
7
9
104
WORD#
164-190
194-20
48-112
72-58
96-170
36-158
102-246
186-2
242-180
80-90
26-144
28-92
100-158
214-124
172-194
26-214
98-6
132-212
238-154
176-204
230-176
122-44
136-18
4-86
120-58
76-196
92-172
38-42
96-166
58-30
42-234
154-92
118-194
42-238
40-150
110-166
6-178
182-104
142-52
12-78
240-30
148-196
A
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
1
1
2
1
B
5
4
1
5
1
1
4
1
1
1
1
49
2
4
1
14
2
3
18
3
2
4
2
3
5
6
2
3
2
85
3
7
8
6
11
10
4
2
12
3
9
4
105
Appendix G
Results for Four-Words Tests
106
Lexical
Number of
Number of
Matches with
Comp-Sets (A)
Matches with
Incomp-Sets (B)
132-66-238-2
152-168-166-230
240-202-212-174
1
1
1
54
14
4
192-32-96-144
1
307
36-226-202-40
170-16-94-194
26-216-96-198
1
1
1
7
50
5
146-170-188-200
158-198-14-156
1
1
20
2
66-158-162-16
1
65
208-90-184-42
14-22-246-110
2-36-16-210
92-86-128-206
32-162-150-144
228-108-54-188
228-152-204-72
1
1
2
1
1
1
1
6
10
15
2
242
10
63
84-212-114-84
156-166-14-208
128-80-92-230
180-78-188-116
90-160-176-168
182-42-22-202
1
1
1
1
1
1
17
8
2
4
5
45
110-118-170-180
1
8
58-198-98-82
70-42-46-228
38-154-112-138
232-128-92-148
202-68-210-48
54-44-102-84
1
1
1
1
2
1
5
28
7
12
9
6
46-216-134-2
1
160
146-8-10-168
30-44-160-228
210-238-180-220
184-232-180-220
218-92-202-16
112-194-230-142
24-70-180-226
188-124-108-120
1
2
2
1
1
1
1
1
2
362
15
4
2
20
2
2
136-146-66-210
2
75
Word
Number
107
MEAN
STD. DEV
MAXIMUM
A
B
1.097
0.296
2
67.790
152.637
1242
WORD#
180-30-178-228
146-166-122-26
240-126-150-174
134-14-224-20
130-244-218-34
58-240-172-76
176-86-2-98
150-56-50-14
4-90-24-146
14-110-96-156
230-96-230-214
130-176-126-114
112-94-98-222
126-40-148-192
10-178-80-32
82-170-192-30
96-12-28-90
30-230-4-54
90-18-2-200
102-64-184-152
94-98-18-54
196-120-152-126
4-224-178-108
18-82-58-88
82-156-120-26
236-44-232-116
64-74-44-184
224-178-178-82
130-20-168-106
24-202-94-174
218-92-202-16
176-124-16-134
68-156-12-56
120-196-160-102
122-186-176-202
112-220-124-108
206-240-54-138
122-196-183-32
174-26-184-220
128-10-88-86
18-40-42-178
182-218-98-98
A
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
A
68
20
2
8
24
5
5
9
5
2
13
20
25
77
20
404
44
170
20
112
20
44
10
140
2
8
42
4
8
10
2
2
1242
40
2
1
4
218
5
12
42
2
108
WORD#
190-48-20-134
94-152-198-102
40-174-186-204
10-244-106-200
186-2-238-154
120-238-34-6
124-40-34-102
242-218-2-146
168-22-36-178
42-124-102-198
26-190-136-226
174-66-186-128
132-52-160-174
34-74-178-150
28-92-220-146
86-84-240-178
24-84-246-118
238-172-108-28
84-132-32-40
150-104-44-4
168-134-74-102
28-22-126-138
76-100-108-74
88-188-86-176
130-70-44-82
52-2-122-46
236-154-240-204
220-246-242-188
82-96-152-134
176-2-66-118
16-134-224-98
44-170-162-238
168-28-90-94
72-128-162-242
98-212-132-194
128-66-176-32
24-206-224-202
192-242-102-20
6-148-128-236
58-148-172-212
244-54-44-24
94-168-36-28
A
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
2
1
1
1
1
B
18
104
63
1
738
187
192
115
10
12
1
12
116
168
539
2
6
110
109
5
28
52
14
9
8
352
18
3
14
339
2
130
110
13
40
55
2
88
4
20
3
5
109
Appendix H
Manual for Match3 Program
110
This appendix provides a detailed instruction set for the usage of the modified matcher, called
match3. Currently this program is located in directory '/usr/users/rhkim/PROGS/MATCHER/'
and has been compiled on a linux machine. The type of machine platform is important because
the program has to be recompiled when run on other machines(i.e. SGI machines).
H.1 Setup
Before running the matcher, the user needs four types of files, as explained in chapter 3: standard
template of phonemes, 248-words lexicon, linguistic rules, and series of segments representing an
utterance. All these files should be in text format (emacs is a convenient text editor for this purpose). Since the user is expected to provide all this information, he/she should familiarize himself/herself with the specific format that each of these four files is required to follow.
But before we go into the formats of these files, the user should understand the format for
a segment.
H.1.1 Segment
As one can imagine, segments are used extensively in the lexical access system. Currently, the
format used to indicate the contents of a segment is shown in figure H. 1. Every segment is
required to follow this form.
The first element of a segment is the character, <, which indicates the starting point of a
segment. Next, a segment must have the four header variables, Time, Symbol, Prosody, and
Release or Closure, even if all their values are (nil). A set of features and their values follow the
variables. These features can be deleted (except consonants, vowels, and glides) or added to the
111
segment in any order the user wishes,. Finally, > in the last line indicates that the segment is complete.
Time: (nil)
Symbol: iy
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ vowel
+ high
- low
- back
+ advtongueroot
-
consttonguejroot
Figure H. 1: The specific text format of a segment that the matcher expects.
H.1.2 Standard Template of Phoneme
An example standard template of phonemes is shown in appendix B. From this appendix, we
observe that the template basically consists of two components: the header line and a list of segments. The header line in this example is: "Utterance: Standard Features Version 1 7/11/1998".
Although the contents of this line is not important in the matching algorithm, the matcher looks
for "Utterance:", and therefore, it should be left alone. Each of the segments follows the format
given in figure H.1.
Remember to name this file, "standard.label" in the directory containing the program,
match3.
H.1.3 Lexicon
As an example, a section of a lexicon file in text format is shown in figure H.2. In this figure, the
first line of the lexicon is a header line: "LexiconName: M3LexVersion_1_7/11/1998". Again
this line is not important to the matching algorithm and can be titled anything. The next line is the
character, <, which indicates that a list of lexical words is starting. Then the user may list any
112
words in the format as shown. In fact, the words may not even be in alphabetical order. If a word
has two or more possible pronunciations, then these pronunciations can be listed as shown with
'able' in figure H.2. To indicate the end of the lexicon, > is utilized as shown in figure H.3.
When saving this file, make sure that files is named, "lexicon.lex".
For the tree-based matcher by Maldonaldo, the lexicon format is different. Please refer to
his thesis to observe the differences.
Buffers Files Tools Edit Search Mule C Help
Iable
ntame: M3LexVersion_1_7/11/1998
a
[ah]
[ey b x
abourt
above
ache
add
again
all
arm 'ng
an
and
anothe
[x b aa
Ex b ah
ley k]
[ae d]
[x g eh
[ao 11
Ex m ah
[ae n],Ex n]
Eae n d],[x
[x n ah
l],[ey b 1]
w t]
v]
n]
ng]
n d]
dh x r]
Figure H.2: The beginning section of the lexicon.
woMrnan
wom en
word
would
write
year
you
your
zap
zip
zoo
ZooM
'Ii
Ew uh
Ew ih
Ew rr
Ew uh
Er aa
Ly iy
uw]
Ey ao
Ez ae p]
Ez ih p]
Ez uw]
Ez uw
m x n]
m x n]
d]
d]
y tI
r]
r]
m]
-E- --
l-
H.3: The end section of the lexicon.
Figure
I
113
- - - - -- - - -
H.1.4 Rules
There are two types of filies for linguistic rules: a segmental description of a rule and a list of
rules. An example format of a rule is shown in appendix C. To better understand how to specify
a particular linguistic rule using this format, please refer to Maldonaldo's thesis, which explains
this topic very clearly. Each rule should be in a separate file. When saving this file, make sure
that these files have a ".rul" extension, such as "rule1.rul".
When a set of rules has been defined in text format, a file consisting of list of these rules
should be made. An example is shown in figure H.4. In this figure, the first three lines are unimportant, as long as the rules are in the same directory. This file is normally called "rules.rs".
Buffers
Files
Tools
Edit
Search Mule
Help
HuleSet: M3RSVersion_1_7/11/1998
Type: External
Directory:.
rule1
rule2
Figure H.4: A list of rules used in the matching process. rulel and rule2 refer to files called
"rulel.rul" and "rule2.rul" where each contains different linguistic rules. Both of these ".rul" files
follow the format given in appendix C.
H.1.5 Segments Representing an Utterance
Among all the files, the user needs to be most familiar with this file because it contains a series of
segments which represents an utterance. The previous three files are given and usually left
untouched by the user, but this file must be defined by him/her. The format is very similar to the
format used for the standard template file. Figure H.5 shows an example of a file containing the
phrase, "a day".
114
Files
Buffers
Tools Edit Search roule Help
Utterance: a day
Time:
(nil)
Symbol: ah
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ vowel
- round
- high
- low
+ back
- adv-tongue-root
- const-tongue-root
(nil)
Symbol: d
Time:
Prosody: (nil)
Release Or Closure: unspecified
Features:
+
-
consonant
continuant
- sonorant
+ blade
+ anterior
- distributed
+ slack-vocal-folds
(nil)
Symbol: ey
Prosody: <nil)
Release Or Closure:
Time:
unspecified
Features*.
+ vowel
+
-
high
low
back
adv-tongue-root
const-tongue-root
>0
Figure H.5: A file containing a series of segments representing "a day".
Just like the standard template, the first line contains the header for this file, which is
unimportant to the matching algorithm. Next, the user must type in all the sequential segments as
he/she desires with the segmental form described in section H. 1.1. Also, as explained in section
4.3, '#' and '%' symbols may be utilized within this file. Moreover, one may choose to delete or
insert features in segments as he/she wishes, as shown in figure H.6.
While saving this file in the text editor, make sure that the name has a ".label" extension.
For example, the file in figure H.6 may be called "sentl.label"
115
Buffers Files Tools
Edit Search Mule Help
Utterance: a day
Time:
#
Symbol: ah
Prosody: (nil)
Release Or Closure: unspecified
Features:
+
vowel
- high
- low
+ back
(nil)
Symbol: d
Time:
Prosody: (nil)
Release Or Closure: unspecified
Features:
+ consonant
- continuant
- sonorant
Time: X
Symbol: ey
Prosody: (nil)
Release Or Closure:
Features:
+ vowel
unspecified
- high
- low
- back
Figure H.6: A file containing a series of segments from figure H.5, except with some absent features and with '# and '%' symbols. This text file will be called sentl~abel for the rest of the
manual.
H.2 Program
When all four types of files are saved in the same directory, one may start using the program. If any of these files is missing, then the program will abort.
116
athena% Is
lexicon.lex
rulel.rul
match3
rule2.rul
athena% match35
rules~rs
sentllabel
standard.label
II
Figure H.7: A list of files required in the same directory. match3 is the matcher. lexicon.lex contains the list of lexical words. standard.label contains all the standard phonemes. rules.rs contains a list of all the rules, rulel.rul and rule2.rul. Finally, sentl.label contains the segmental
representation of an utterance.
To start the program, type "match3" at the prompt, as shown in figure H.7. Then the program will begin and automatically initialize itself using the standard template, the lexicon, and the
rules. The matcher should look like figure H.8 at this point.
athena% Is
lexicon.lex
match3
athena% match3
rulel.rul
rule2.rul
rules.rs
sentl.label
standard.label
Welcome to Match G3, system booting .....
----
SYSTEM INITIALIZATION
Booting up on a sun4 system (w20-575-77.mit.edu) ...
Loading Standard Template: standard~label .....
Loading Lexicon: lexicon.lex ..... Done,
Loading Rule Set: rules~rs ..... Done,
-=---
INITIALIZATION COMPLETE
Done,
--
Welcome to Match G3, rhkiml
* Please enter the filename of a sentence to run, (w/o dabel)
* Type 'hel ' at any time to list all available commands,
Match G3> i
Figure H.8: The matcher is now running. At this point, all the initialization is complete and is
awaiting for an input from the user. The input is the filename of a segmentized utterance.
117
To input a file containing a series of segments the user created, as seen in section H. 1.5,
type the name without ".label" as shown in figure H.9.
athena% is
lexicon.lex
match3
athena% match3
rulel.rul
rule2.rul
rules.rs
sentl.label
standard.label
Welcome to Match G3, system booting .....
SYSTEM INITIALIZATION
Booting up on a sun4 system (w20-575-77.mit.edu) ...
Loading Standard Template: standard~label .....
Loading Lexicon: lexicon.lex ..... Done.
Loading Rule Set: rules.rs ..... Done,
Done,
INITIALIZATION COMPLETE
Welcome to Match G3, rhkim!
* Please enter the filename of a sentence to run, (w/o .label)
* Type 'help' at any time to list all available commands.
Match G3> sentil
Figure H.9: The user types in the utterance file name, sentl(.label).
Then the matcher will automatically output the cohort with a set of possible utterances, as
shown in figure H.10.
A prompt appears again and await for the user to continue the process. To quit the program, type "quit" at the prompt.
118
athena% Is
lexicon.lex
match3
athena% match3
rulel.rul
rule2.rul
rules.rs
senti.label
sentl.label~
standard.label
Welcome to Match G3, system booting .....
-=-=--
SYSTEM INITIALIZATION
Booting up on a sun4 system (w20-575-77.mit.edu)
Loading Standard Template: standard.label .....
Loading Lexicon: lexicon.lex ,,,.
Done,
Loading Rule Set: rules.rs ..... Done.
=-=-=--
Done.
INITIALIZATION COMPLETE
Welcome to Match G3, rhkim!
* Please enter the filename of a sentence to run, (w/o .label)
* Type 'help' at any time to list all available commands,
Match G3> senti
Loading Sentence: ./sentllabel ..... Done.
Growing Matching Tree ... Done. [0 sec]
- - --
ANSWER
Completed
1, a day.
2. a pay.
Incomplete
1, up
Match G3> []
Figure H.10: An example of the matcher interface when an input is given. In this figure, the input
is sent1(.label), which consists of segments shown in figure H.6. As one can see, there are two
complete utterances in the cohort. After the matching process is done, the matcher awaits for the
next input.
119
Bibliography
[1] J.Y. Choi. Labeling With Features. MIT, unpublished document, 1999.
[2] N. Chomsky and M. Halle. The Sound Pattern of English. MIT Press, Cambridge, MA, 1991.
[3] P.B. Denes and E.N. Pinson. Speech Chain: the Physics and Biology of Spoken
Language. Anchor Press, Garden City, NY, second edition, 1993.
[4] G. Fant. Acoustic Theory of Speech Production,1960.
[5] D.P. Huttenlocher. Acoustic-Phonetic and Lexical Constraints in Word Recognition: Lexical Access Using Partial Information. Master's thesis, MIT, 1984.
[6] S.J. Keyser and K.N. Stevens. Feature Geometry and the Vocal Tract. Phonology, 1994.
[7] D.H. Klatt. The Problem of Variability in Speech Recognition and in Models of
Perception. Paper presented at the Conference on Variability and Invariance in
Speech in October, 1983.
[8] P. Li. Feature Modifications and Lexical Access. Master's thesis, MIT, 1993.
[91 Aaron Maldonaldo. Incorporating a Feature Tree Geometry into a Matcher for
a Speech Recognizer. Master's thesis, MIT, 1999.
[10] K.N. Stevens. Acoustic Correlates of Some Phonetic Categories.
Acoustical Society of America, 1980.
120
Journal of
[11] K.N. Stevens. Acoustic Phonetics. MIT Press, Cambridge, MA, 1998.
[12] K.N. Stevens and A.S. House. Perturbation of Vowel Articulations by Consonantal Context: An Acoustical Study. Journal of Speech and Hearing Research,
6, 1963.
[13} Y. Zhang. Toward Implementation of a Feature-based Lexical Access System.
Master's thesis, MIT, 1998.
121
Download