Implementing the Matcher in the Lexical Access System with Uncertainty in Data by Roy Kyung-Hoon Kim Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Masters of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2000 © Roy Kyung-Hoon Kim, MM. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part. ENG MASSA CHUSETTS INSTITUTE OF TECHNOLOGY JUL 2 7 2000 Authoi.. ............ . . LJBRARIES Departh-f nt of Electrical Engineering and Computer Science May 18, 2000 Certified by............. 7-- Kenneth Stevens Professor PTieie:8rvisor / Accepted by............... Arthur C. Smith Chairman, Department Committee on Graduate Students Implementing the Matcher in the Lexical Access System with Uncertainty in Data by Roy Kyung-Hoon Kim Submitted to the Department of Electrical Engineering and Computer Science on May 18, 2000, in partial fulfillment of the requirements for the degree of Masters of Engineering in Electrical Engineering and Computer Science Abstract The goal of this thesis is to modify the existing matcher of the lexical access system being developed at the Research Laboratory of Electronics so that it provides efficient and accurate results with limited segmental information. This information, provided by the speech signal processor, contains a set of sublexical units called segments and a set of features to characterize each of them. The nature of a feature is to describe a particular characteristic of a given segment. Previously the matching subsystem demanded a complete set of segments and features for each spoken word. Specifically, the speech signal processor was required to be without fault in its efforts to detect all available landmarks and cues and to convert them into the segmentally formatted data that the matcher recognizes. But this requirement for impeccability is nearly impossible to meet and must be relaxed for a real-world lexical access system. Overall, this new, modified matcher in the lexical access system represents a realworld application that anticipates and responds to imperfections in the given data. More specifically, the modified matcher has the ability to translate a series of segments with incomplete sets of features into possible utterances that the series may represent. With this new matcher, an experiment was performed to initiate a process to identify features with the most acoustic information. For a given set of incomplete segmental representations, the results of the experiment showed that the output of the matcher, or number of matched utterances, increases exponentially as the input of the matcher, or number of speaker-intended words, increases linearly. But as more features are defined in these incomplete representations, we can conclude from the results that the number of possible utterances becomes less exponential and more linear. Thesis Supervisor: Kenneth Stevens Title: Professor 2 Acknowledgments First and foremost to my Lord Jesus. To express my gratitude, with which words do I begin? In my joy, you rejoiced with me. In my despair, you wiped away my tears. The past five years are yours, Jesus. I love you. To my parents, Sun Chul Kim and Ae Im Kim. I will forever cherish all the prayers and encouragements you have showered upon me. When I was ready to give up at MIT, your gentle, yet strong, words revived my soul. Um-ma, Ah-pa, sa-rhang-hae-yo. To my brothers, Peter and Mark. Peter, your passion is found in only a few. Thanks for sharpening me with your zeal. Mark, you have grown tremendously over the year. Your increased love for God has blessed me greatly. To Ken. Thank you for your advices and support. I think every professor at MIT should learn from the way you teach and care for your students. To my KCF brothers and sisters. Your constant encouragements and challenges will be treasured forever. I'm deeply saddened to depart.. keep running the race. To the '99-'00 Outreach Team: James, Sunny, Sera, and Dave. Through this tough, difficult year, each one of you meant the world to me. Thank you for your friendship and partnership. To my AAA friends and ballers. This past year would not have been the same without you "low-budget" guys. Seek truth. Live life. No more second place. To Young, my precious friend. Your prayers have sustained me this year. Thanks, bro. Hey, more years to come.. 3 Contents 1 13 Introduction 15 2 Background 2.1 The Speech Chain. ............... . . . . . . . . . . 15 2.2 Basic Terminology of Lexical Access . . . . . . . . . . . . . 16 2.3 3 4 2.2.1 Effects of Anatomical Structures . 16 2.2.2 Features . . . . . . . . . . . . . . 20 2.2.3 Landmarks and Segments . . . . 24 2.2.4 Examples . . . . . . . . . . . . . 25 The Lexical Access Project . . . . . . . . 26 29 Original Matcher 3.1 Standard Phonemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 L exicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Linguistic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 The Original Matching Process . . . . . . . . . . . . . . . . . . . . . 32 . . . . . . . . . . . . . . . . . . . . . . . 32 . . . . . . . . . 35 3.4.1 The Overall Method 3.4.2 The Detailed Process of the Original Matcher 44 Modified Matcher 4.1 Variability in the Speech Signal . . . . . . . . . . . . . . . . . . . . . 45 4.1.1 Contextual Variations . . . . . . . . . . . . . . . . . . . . . . 45 4.1.2 Extra-linguistic Variations . . . . . . . . . . . . . . . . . . . . 47 4 5 4.2 Matcher-related Uncertainties . . . . . . . . . . . . . . . . . . . . . . 47 4.3 The Modified Matcher Process . . . . . . . . . . . . . . . . . . . . . . 48 4.3.1 M issing Features . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.2 Uncertainty of First and Last Segments . . . . . . . . . . . . . 53 Experiment with Partial Feature Segments 60 5.1 Experim ent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 D ata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2.1 Feature-level Characteristics . . . . . . . . . . . . . . . . . . . 66 5.2.2 Word-level Characteristics . . . . . . . . . . . . . . . . . . . . 71 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 6 Summary 77 A 248-Word Lexicon 79 B Standard Phonemes 81 C Rules: An Example 88 D Source Code for Tree-Growth 90 E Results for One-Word Tests 98 F Results for Two-Word Tests 102 G Results for Four-Words Tests 106 H Manual for Match3 Program 110 5 List of Figures 2-1 Speech Chain. This figure shows the basic human communication system (from Denes and Pinson, 1993) . . . . . . . . . . . . . . . . . . . 2-2 16 General anatomical structures of the vocal tract. The vocal system may be partitioned into four functional parts (from Keyser and Stevens, 1994). 17 2-3 Basic tree with anatomical structures. All the end nodes correspond to a particular physical structure. . . . . . . . . . . . . . . . . . . . . 19 . . 20 . . . . . . . . . . . . . 26 2-4 Tree with anatomical structures and their corresponding features. 2-5 Tree diagram for two phonemes, /n/ and /ey/ 2-6 Overall model of the lexical access system. The system is broken down into two subsystems: speech processing subsystem and the matching subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 27 Basic model of matcher. For the original matcher, the series of segments has to be error-free and all the possible features have to be detected........ 3-2 ................................... 29 Model of matcher with necessary information. In addition to the series of segments as shown in figure 3-1, standard phonemes, linguistic rules, and the lexicon are also needed. . . . . . . . . . . . . . . . . . . . . . 3-3 30 Matching model with linguistic rules using index. Two lexical words that match the segments of the utterance at index 0 are "an" and "another" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 33 3-4 Tree model of the matching process. This model shows the progression of index for matched lexical words. Each node may be modeled as a subtree which may beget another subtree. 3-5 . . . . . . . . . . . . . . . 34 Block diagram of standard template initialization. _segments constains an array of standard phonemes in data[i], where i is the index of the array. Each phoneme is defined further by -segment and _bundle classes. 36 3-6 Block diagram of lexicon initialization. This process is relatively complex because a lexicon consists of an array of words, where each consists of an array of one or more distinct pronunciations. . . . . . . . . . . . 3-7 38 Block diagram of rule set initialization. The ruleset contains an array of linguistic rules which need to be translated (or initialized) for the m atcher to recognize. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8 39 Block diagram of the tree growing process. This process recursively builds a tree where each node represents a distinct index and a matched word. The recursion occurs in this figure when grow-aux2 calls on grow-aux, which calls grow-aux2 back. . . . . . . . . . . . . . . . . 4-1 41 Effects of consonants on the first- and second-formant characteristics of eight types of vowels, where each follow the consonant-vowel-consonant (CVC) format. These graphs show three types of consonants: velars (open circles), postdentals (open triangles), and labials (open circles). Vowels in isolation are also shown (black circles) [12]. . . . . . . . . . 4-2 46 Feature set space model for the original matcher, in part (A), and the modified matcher, in part (B). Part (A) shows that the original matcher requires the two feature-set space to be exactly the same, while part (B) shows that the segment space can be a subset of the standard phonem e space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 51 4-3 Feature-match model for the original matcher, in part (A), and the modified matcher, in part (B). In this figure, "phoneme" refers to a set of features used by the lexicon and "segment" refers to a set of features, complete or incomplete, detected from the speech signal. . . 4-4 52 Set models for the output of the matcher with different categories as described in table 4.9. Note that category D is the output of the ANDfunction of categories B and C. 5-1 . . . . . . . . . . . . . . . . . . . . . 59 Histogram for the results of 1-C test. The number of matches in a cohort of the matcher is the independent variable. Much of the distribution is concentrated at one match per data sample. . . . . . . . . . 5-2 Histogram for the results of 1-IC test. The number of matches in a cohort of the matcher is the independent variable. Much of the distribution is concentrated at five or less matches per data sample. . 5-3 67 67 Histogram for the results of 2-C test. The number of matches in a cohort of the matcher is the independent variable. The results are very similar to that of 1-C test in figure 5-1. . . . . . . . . . . . . . . . . . 5-4 68 Histogram for the results of 2-IC test. The number of matches in a cohort of the matcher is the independent variable. The results resemble an exponentially decreasing function where the frequency becomes insignificant as the number of matches approaches 20. . . . . . . . . . 5-5 69 Histogram for the results of 4-C test. The number of matches in a cohort of the matcher is the independent variable. The distribution resembles that seen in one-complete word and two-complete-word tests. 70 5-6 Histogram for the results of 4-IC test. The number of matches in a cohort of the matcher is the independent variable. The results resemble an exponentially decreasing function where the frequency becomes insignificant as the number of matches approaches 400. . . . . . . . . 8 70 5-7 A magnified version of the histogram in figure 5-6 where the range is less than 200. The distribution is generally uniform when the number of matches is greater than 30. The unexpected impulse at 200 is the sum of all that comes after that point. 5-8 . . . . . . . . . . . . . . . . . 71 Two graphs where the independent variable is the number of intended words in the input and the dependent variable is the average number of matches in the cohort. The graph with squares corresponds to the expected behavior and the graph with circles corresponds to the real behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9 74 This graph shows the difference between the expected and the real graphs from figure 5-8. As the number of words increase for the input, the difference increases exponentially. . . . . . . . . . . . . . . . . . . 9 75 List of Tables 2.1 List of features and their possible values. The features may be categorized into three distinct groups: articulator-free, articulator, and articulator-bound (from Maldonaldo's Master Thesis, 1999). . . . . . 21 2.2 Four consonant types and their corresponding manner features. 2.3 Two subsets of articulator features 2.4 Feature matrix for two phonemes, /n/ and /ey/, as seen by the matcher. 25 2.5 A listing of the English vowels, glides, and consonants in the Lexical . . . 22 . . . . . . . . . . . . . . . . . . . 23 Access System. AF=affricates, FR=fricatives, SO=sonorants, ST=stops 28 3.1 A list of English phonemes used in lexical access system. . . . . . . . 4.1 Two sample series of phonemes as valid outputs of the speech processor. 31 Though the second is a subset of the first series, both represent the utterance, "baby can". 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Three sample segments where segment 1 is a standard phoneme, /ae/. The other two segments are subsets of the first segment. RoC = Release O r C losure. 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Matched results of the three segments from table 4.2. From left to right, more phonemes match because fewer features exist to constrain the m atching process. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Segment input to the modified matcher. Segment (A) has a complete set of features for /p/ while segment (B) is a subset of the first set. 4.5 50 . 53 Output of the modified matcher with segment from table 4.4 (A) as input. All the words in this list contain the phoneme /p/. 10 . . . . . . 54 4.6 Output of the modified matcher with segment from table 4.5 (B) as input. Due to fewer features to match in the segment, the matcher outputs more matched words than table 4.5. . . . . . . . . . . . . . . 4.7 55 Two consecutive segments with complete feature-sets as the input to the modified matcher. The first segment represents the phoneme, /p/, and the second segment represents the phoneme, /iy/ . . . . . . . . . 4.8 55 Output of the modified matcher with two consecutive segments from table 4.7 as input. Most of the matched utterances are over a series of two words. In this example, we can conclude that if the first segment is known to be the beginning of a word, then there are no matches. . 4.9 56 All combinations of (nil), '#' and '%' in the first and last segments and their meanings. In fact, category D acts as the original matcher where a series of segments represents a complete word(s). . . . . . . . 57 4.10 Two sample inputs to the modified matcher where the positions of the specific segments are constrained. (A), by using '%', forces the matcher to find word(s) that ends with segment 2. (B), by using '#', forces the matcher to find word(s) that begins with segment 1. . . . . . . . . . . 57 4.11 Output of the modified matcher with two segments from table 4.10 (A) as input. As expected, the last two segments of these words are /k/ and /iy/.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.12 Output of the modified matcher with two segments from table 4.10 (B) as input, where '#' is placed after "Time:" of the first segment. . . . 58 4.13 A possible input to the modified matcher which follows the constraints described in table 4.9 (D). 5.1 . . . . . . . . . . . . . . . . . . . . . . . . 59 This is a series of segments representing the word, "as" for two different categories of inputs in the experiment. The segments in (A) are described with a full set of features, while the segments in (B) are described with a incomplete set of features. RoC = Release Or Closure. 62 11 5.2 A list of six tests of the experiment and their corresponding names used in this thesis. 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 The two series of segments represent two sequential words, "on day". The segments in (A) are described with the conventions of first category of inputs and is a possible data sample for the 2-C test. The segments in (B) are described with the conventions of the second category of inputs and is a possible data sample for the 2-IC test. RoC = Release O r C losure. 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Mean, standard deviation (std. dev.), and the maximum of the number of matched utterances of the data inputs for each of the tests in the experim ent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 66 Two words with feature sets constrained by the second convention for inputs. (A) is a series of segments representing the word, "do". (B) is a series of segments representing the word, "add". RoC=Release or C losure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 72 Two cohorts corresponding to the two individual series of segments from table 5.5. (A) is the cohort for the series that intended to represent the word "do". (B) is the cohort for the series that intended to represent the word "add". 5.7 . . . . . . . . . . . . . . . . . . . . . . . . 73 Two series of segments from table 5.5 combined into one series. RoC=Release of C losure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.8 The cohort corresponding to the series of segments in table 5.7. 73 5.9 Two cohorts corresponding to the two individual series of segments representing the words, "caution" and "able". A.1 . . . . . . . . . . . . . . . . 76 In all the tests, index numbers above are used to indicate specific words. 80 12 Chapter 1 Introduction The Lexical Access Project, which is an ongoing effort by the Speech Communication Group in the Research Laboratory of Electronics (RLE) at Massachusetts Institute of Technology (MIT), is aimed at creating a knowledge-based, rule-governed speech recognition system based upon the methods human beings use to produce speech. The idea of recognizing speech in this fashion differs from current approaches to automatic speech recognition. Unlike systems which rely primarily on statistical analysis and complex Markov models for detection, this system relies on a set of acoustic cues in speech which provide information about the articulatory movements in the vocal tract. The overall function of the lexical access system is to convert a sound signal into a series of segments (which will be defined later), match them to a predefined lexicon, and produce the intended utterance. In order to accomplish this function, the system is partitioned into two subsystems: the speech processing subsystem and the matching subsystem [13]. When the speech signal enters the system, the processor detects and extracts acoustic cues from the speech signal and labels the features in accordance to these cues. From the labeling process, it converts the existing format to a format that is compatible with the segmental information in the lexicon. Finally, the matcher takes the converted segments from the processor and compares them to each of the words in the lexicon to compile a list of possible utterances. The older version of the matching subsystem, called the "original" matcher, de13 mands a complete series of segments and feature-sets for each spoken word. In other words, the speech signal processor is required to be without fault in its efforts to detect all available acoustic cues and to convert them into a segmentally formatted data that the matcher recognizes. But this constraint for impeccability is realistically impossible to satisfy and must be relaxed for a real-world lexical access system. This thesis presents a "modified" matcher which, unlike the original matcher, tolerates some uncertainty in the signal analysis of the processing subsystem. Modification is necessary because the processor cannot detect and identify all the available acoustic cues in the speech signal. More specifically, the newly designed matcher allows the segments to be defined with incomplete sets of features. Also, the series of segments is not required to fully complete an utterance in the modified matcher. Chapter 2 familiarizes the reader with important background information required to understand the basics of the lexical access project. First, this chapter provides a general overview of the human communication system, in which its main focus is on the four functional subsystems of the human vocal tract. Then it describes the essential components of the lexical access system and how the anatomical knowledge is utilized in the structure of the system. The following chapter introduces the purpose and the design of the original matcher. First, it discusses the essential building blocks of any effective matcher. Then it steps through the specific algorithm of the original matcher to show how it behaves when a series of segments is given as its input. Chapter 4 explains the purpose and the design of the modified matcher. In the beginning, it uncovers some of the unrealistic, yet inherent, assumptions that the original matcher makes in its algorithm, thereby highlighting the reasons for the need of a modified algorithm. Then it discusses specific design modifications which are implemented to relax those assumptions. Finally, chapter 5 describes an experiment that was performed with the modified matcher, which is an initial step towards identifying the more valuable features. Then the collected data are presented and interpreted. 14 Chapter 2 Background 2.1 The Speech Chain The lexical access project is based upon speech producing methods that human beings use. Therefore it is essential to understand the basics of how we communicate with each other before we can learn in detail about this ongoing project. First we will discuss the background information needed to fully understand the project, and then analyze the lexical access system in further detail. Figure 1 illustrates a set of events that describe the overall communication system of a human being. According to this model, the speaker first arranges the thoughts by accessing memory for a set of words. After translating these words into linguistic units called phonemes (which will be defined soon), the brain sends impulses to the motor nerves in the vocal tract. Then the sound waves generated by movements of the articulators in the vocal tract and by the respiratory system are radiated and propagated to the listener. As illustrated, the ears of the speaker act as a feedback system to produce the most desired acoustic effect. The listener's ears pick up the sound waves and send the signal through a complex filtering system to the brain. By comparing a set of perceived cues and their corresponding features to an existing lexicon in the memory, the brain deciphers these features into a set of possible words that the speaker intended. Finally these words are stored in the listener's lexicon for future use [3]. 15 SPEAKER LUSTENER Seen ior, Brainnerves4 Smrynd wavcs -Motor nCVC Figure 2-1: Speech Chain. This figure shows the basic human communication system (from Denes and Pinson, 1993). Evidence has shown that at the linguistic level of speech production and perception, there are units of representation that are more fundamental than words, called phonemes [2]. A phoneme is defined as a unit of speech that distinguishes one word from another and is represented as a segment that consists of a particular set of characteristic features. A series of these segments forms the lexical model for a spoken utterance. 2.2 2.2.1 Basic Terminology of Lexical Access Effects of Anatomical Structures The sounds that a person produces are a result of physical manipulation of the vocal system. In fact, an utterance is stored in memory as a sequence of words, and each word in turn can be represented as a sequence of phonemes [2]. Each of these phonemes can then be characterized with specific physical attributes, or features. The physical aspects of the vocal tract allow some regions to operate independent 16 31 Figure 2-2: General anatomical structures of the vocal tract. The vocal system may be partitioned into four functional parts (from Keyser and Stevens, 1994). of the controls by other regions, and this attribute creates a type of classification of features in groups [6]. More specifically, the human vocal tract can be broken up into four functional regions that are generally independent of each other, as shown in figure 2-2. Region 1 represents the vocal folds, which is a dominant subsystem of the vocal tract in producing speech. The vocal folds periodically interrupt the normal flow of air from the lungs to the mouth by producing a sequence of air puffs whose frequency is determined by the rate at which the glottis, or the space between the vocal folds, is spread or constricted [11]. These "periodic interruptions", or oscillations, are directly correlated to the fundamental frequency. Also the frequency can be changed by increasing the stiffness of the vocal folds. Not only are the vocal folds independent from the rest of the system, but they can be modeled as a source of sound to the rest of the vocal tract; therefore the vocal cords can be modeled as an oscillating source to the system consisting of regions 2, 3, and 4. If the vocal tract is modeled as described, then the system is dependent upon the structural characteristics of regions 2, 3, and 4, and will output a signal in accordance to their unique physical features. So we will now examine these regions and their 17 effects on the output of the system. Region 2 consists of the laryngeal and pharyngeal pathways from the vocal folds to the oral cavity. The larynx is a cartilaginous structure in which many ligaments, including the vocal folds, are interconnected. The structure moves vertically in producing speech and in the act of swallowing. The pharynx is a tube-like structure surrounded by a set of constrictor muscles which connect the larynx to the base of the oral cavity of the mouth. To effectively alter an utterance, the cross-sectional dimensions of the tube in this region can be widened or contracted. In other words, manipulation of the cross-sectional area within the pharyngeal region contributes to the acoustic filtering of the sound source [11]. Region 3 simply contains the soft palate, which is a flap of tissue used to cover or uncover the velopharyngeal opening. If the flap is lowered, then part of the air from the source escapes through the nasal cavity. Such activity of the soft palate modifies the characteristics of the vocal tract such that the nasal cavity becomes parallel to the rest of the vocal cavity. This modifies the acoustic filter by enhancing the low frequency energy while weakening the high frequencies to cause a nasal sound [4]. If the soft palate is not lowered, then the air travels directly along the cavity of the mouth. Region 4 represents the structures in the oral cavity such as the tongue and the lips. More specifically, this region is described with the tongue blade, the tongue body, and the lips. These anatomical parts are so important that they are given a distinctive title, usually known as the "articulators". The tongue is a muscular mass that can move in almost any direction to alter the filtering characteristics of the vocal system. Though it is true that the tongue blade and the body are anatomically attached, each has some independence in acting as needed. The lips can be positioned in three different positions: protruded, rounded, and spread, but for our analysis, only the rounded cases will be considered. In our analysis above, general regions that complete the human vocal system have been examined. To summarize the basic components of the anatomical makeup in a more hierarchical model, figure 2-3 is provided. 18 Voca I Folds Soft Palate Pharyngeal Lingual Lij ps Glottis Pharynx Body Blade Figure 2-3: Basic tree with anatomical structures. All the end nodes correspond to a particular physical structure. 19 Vowel (root) Vocal Folds Glide Soft alate slack stiff Consonant Pharyngeal (continuant, sonorant, stridentI Lingual Glottis Lips Pharynx Blade Bod round spread constr atr ctr high low back ant dist lat rhot Figure 2-4: Tree with anatomical structures and their corresponding features. 2.2.2 Features Each of the anatomical parts in Figure 2-3 acts in certain manners, or has particular features, to produce the desired sounds. In other words, each feature has a corresponding set of acoustic correlates, which are used to distinguish different segments of sounds [10]. As described in the previous section, the vocal folds can be stiffened or slackened while the glottis can be spread or constricted. Other anatomical parts such as the soft palate can be lowered or raised, causing the sound to be nasal or not. The rest of the vocal structures and their physical features are illustrated in figure 2-4. Some of the labels for the features are abbreviated, and a fuller description can be found in table 2.1. To assign values to each of the features, binary symbols of +/- are utilized. For example, [- nasal] represents the fact that the sound was not nasal because the soft palate was not lowered, while [+ nasal] represents the exact opposite. [+back] corresponds to the tongue body being pulled back from its rested position. Table 2.1 lists 20 Articulator-Free vowel glide consonant sonorant continuant strident Articulator Articulator-Bound Values + + + +/vocal folds glottis pharynx soft palate body blade lips stiff slack spread constricted advanced tongue root constricted tongue root nasal high low back anterior distributed lateral rhotic round +/- +/- +/- Table 2.1: List of features and their possible values. The features may be categorized into three distinct groups: articulator-free, articulator, and articulator-bound (from Maldonaldo's Master Thesis, 1999). the rest of the features that are used in the lexical access system and their possible values. Articulator-Free Features A distinct category of features was not discussed in the previous sections. This category, consisting of the features listed under the "articulator-free" section in table 2.1, relays information about the type of sound and the manner of articulation without reference to a particular articulator. More specifically, the articulator-free features show whether a constriction is made in the vocal tract, and if so, how it was made. In fact, each phoneme is required to have one of the following articulator-free features 21 Consonant Types I Sonorant I Strident Continuant affricate fricative -+ +/- sonorant + stop -_- + - Table 2.2: Four consonant types and their corresponding manner features. assigned a positive value: vowel, glide, or consonant. If a phoneme is characterized with [+vowel], the root node is the dominant node because all of the four regions can actively affect the production of sound for vowels, while for some phonemes, such as /w/, the glide node is the dominant node. For others, the consonant node is the dominant node where, by forming constrictions in the oral cavity, mainly the articulators in Region 4 define the acoustic characteristics. The other three features under the "articulator-free" column in table 2.1 are used to distinguish the different types of consonants. As listed in table 2.2, sonorant, continuant, and strident features are used to distinguish a consonant from its four possible types: affricate, fricative, sonorant, and stop. The sonorant feature refers to whether or not there is a buildup of pressure behind a constriction in region 4 of figure 2-2. [+sonorant] shows that there is a continuation of low frequency amplitude at its closure while [-sonorant] shows that there is a decrease in the low frequency amplitude. [+continuant] corresponds to the fact that there is only a partial closure in the vocal tract while [-continuant] represents a complete closure in the oral cavity. Finally [+strident] means that there are exceptionally strong fricative noises. Articulator Features The second of the three categories, represented in the second column of table 2.1, is titled as the "articulator" features. This category, which includes the blade, the body, lips, pharynx, soft palate, vocal folds, and glottis, shows what anatomical structures are used to produce particular sounds. 22 Primary Articulators Tongue Blade Tongue Body Lips Secondary Articulators Pharynx Soft Palate Vocal Folds Glottis Table 2.3: Two subsets of articulator features This category may be partitioned into two subsets to better understand the functional descriptions of the features. The two subsets of the "articulator" features are seen in table 2.3. The primary articulators, consisting of the tongue blade, the tongue body, and the lips, play a major role in defining the characteristics of the sound. The secondary articulators, consisting of the pharynx, soft palate, vocal folds, and glottis, also play an active role but not as the primary articulators do [6]. Articulator-Bound Features The third of these categories in table 2.1 is called the "articulator-bound" features and its responsibility is to describe how the articulators are positioned and shaped to produce particular sounds. For example, [+rounded] indicates that a segment was produced by rounding the articulator called lips. The tongue blade has four distinctive features: anterior, distributed, lateral, and rhotic. [+anterior] refers to when the tip of the tongue makes contact with the anterior portion of the alveolar ridge. [+distributed] indicates that a broader part of the tongue touched the alveolar ridge. [+lateral] refers to the situation when the air flows around the side of the tongue. And [+rhotic] is very similar to [+lateral], yet with a different shaping of the tongue. The three features that are associated with the tongue body are high, low, and back. High and low refer to the vertical characteristics of the tongue while back refers to its horizontal position. A couple of other articulator-bound features are nasal and round, both of which are described earlier in detail. Advanced tongue root and constricted tongue root features indicate whether the phoneme is tense or lax. 23 Constricted glottis and spread glottis indicate the state of glottis and play a vital role in producing sounds such as /h/. And stiff vocal folds and slack vocal folds directly dictate whether the consonant will be voiced or unvoiced. 2.2.3 Landmarks and Segments As we noted earlier in the section describing the speech chain, the listener decodes speech by identifying all the phonemes by searching for acoustic cues in the signal. These cues are for acoustic landmarks and for articulator-bound features. These cues are technically known as landmarks. Depending on the characteristics of the landmarks, they are classified into one of three groups: vowels, glides, or consonants. Vowels are produced when there is no narrowing effect in the vocal tract. By studying the formants, which are related to natural frequencies of vocal tract, the landmark can be located. More specifically, the time associated with a vowel landmark is marked near the place where the amplitude of the first formant (Fl) is a maximum [1]. There are only four glides in the lexical access system. These glides are produced with an intermediate narrowing of the vocal tract, which keeps them from exhibiting abrupt changes in the formants. In the lexical access machine, their landmarks are chosen at times where the signal amplitude is a minimum [1]. Consonants are recognized by their closure and a subsequent release of energy in the vocal tract, causing abrupt spectral changes. This effect is produced when the vocal tract is almost or completely closed and then released. Each of these landmarks, or parts of landmarks, identifies a particular segment, or a phoneme. Along with the articulator-free features, each segment is identified with their articulator-bound features based on a hypothesis that segments are stored in human communication systems in the form of discrete classes of features. 24 /n/ /ey/ Time: (nil) Symbol: n Prosody: (nil) Features: + consonant - continuant + sonorant + blade + anterior - distributed + nasal Time: (nil) Symbol: ey Prosody: (nil) Features: + vowel + body - high - low - back + advtongue-root - const-tongue-root (A) (B) Table 2.4: Feature matrix for two phonemes, /n/ and /ey/, as seen by the matcher. 2.2.4 Examples The features associated with a phoneme have traditionally been represented as an array, or matrix, of feature values [2]. In this section, all the previous information will be brought together into a couple of examples of phonemes so that we have a better idea of how they can be modeled in feature matrices. The first phoneme we will analyze is /n/, which is shown in table 2.4 (A). As seen in this feature matrix, [+ consonant] is one of the articulator-free features, which means that the main node is at the consonant node. The next two features describe the rest of the articulator-free features. The articulator feature in this matrix is [+ blade]. [+ anterior], [- distributed], and [+ nasal], which are the rest of the features, show the articulator-bound features of /n/. This is also seen in figure 2-5. The second phoneme we will analyze is /ey/, which is also shown in table 2.4 (B). From the table, we see that its main node is the root, or the vowel, node. Therefore, there are no more articulator-free features, and the next feature is [+ body], which is an articulator. The next four features are articulator-free features. This is illustrated in figure 2-5. 25 /n/ /ey/ Vowel (root) Vowel (root) Vocal Folds Vocal Folds Soft Palate slack stiff stiff Consonant {- continuant, + sonorant Pharyngeal +nslLingual Glottis Soft Palate slack } Lips Pharynx Bod spread Glide Glide constr atr Blade round spread ctr high low back +ant - dist at constr + atr - ctr - rhot high tlow, bac ant tim at Figure 2-5: Tree diagram for two phonemes, /n/ and /ey/ 2.3 The Lexical Access Project Now that the basic theoretical information has been presented, we are ready to better understand and appreciate the lexical access project. Figure 2-6 depicts a basic block diagram of the lexical access system. Overall, this system interprets a sound signal into a set of segments and matches them to a predefined lexicon. Although this may sound simple, there are many complicated factors that affect the performance and the output of this machine, as we shall shortly see. To accomplish this general task, the system must first find all the landmarks and look for the distinguishing frequency characteristics to identify which of the three articulator-free features it matches to: vowel, glide or consonant. After a landmark is detected, the signal processing subsystem looks around the vicinity of the landmark to gather more information about the particular segment. After all the features are detected and given a value, the signal processing subsystem converts all the gathered information into a list of features for the particular segment. These segments need to be converted to formats that are recognizable to the matcher. Then this information, which will correspond to one of the phonemes (or more than one if a partial representation of the features make up the phoneme) in 26 rhot Speech Processing Subsystem Matcing Subsystem I I Landmark Detection Speech Signal 10 CoMersion Matcher Possible Utterance Feature Detection Figure 2-6: Overall model of the lexical access system. The system is broken down into two subsystems: speech processing subsystem and the matching subsystem. table 2.5, is sent to the matching subsystem. This process can be seen in figure 2-6. Using the processed information from the previous subsystem, the matcher attempts to recreate the utterance of the speaker. The output of the matcher may consist of more than one possible utterance if more than one matches the descriptions of the input. As we shall see later, the matching process is a complex process with problems that are speaker-dependent. 27 Vowels /iy/ /ih/ /ey/ /eh/ /ae/ /aa/ /ao/ /ow/ /ah/ /uh/ /uw/ /rr/ /er/ /ex/ Glides /h/ /w/ /y/ /r/ Consonants-AF /ch/ /dj/ Consonants-FR /f/ /dh/ /s/ /sh/ /th/ /v/ Consonants-SO //b/ /m//d/ /n/ /ng/ Consonants-ST /g/ /k/ /p/ /t/ /z/ /zh/ Table 2.5: A listing of the English vowels, glides, and consonants in the Lexical Access System. AF=affricates, FR=fricatives, SO=sonorants, ST=stops 28 Chapter 3 Original Matcher The original matcher, which was developed by Zhang (1998) and Maldonaldo (1999), is the older version of the matching subsystem. Before discussing the newer version of the matcher, it may be useful to understand the details of the original matcher. In this chapter, we will give a general description of how the original matcher functions. The basic function of the matcher is to take an input represented by a series of segments from the speech processing subsystem, and to produce a list of possible utterances which best represent the input data, as shown in figure 3-1. Although the basic system appears simple in functionality, there are additional aspects of the matching subsystem which act as sources of many complications. One of the main features of the original matcher is that it requires the speech MATCHER Series of Segments Representing a Sampled Utterance List of Possible -Utterance Figure 3-1: Basic model of matcher. For the original matcher, the series of segments has to be error-free and all the possible features have to be detected. 29 Standard Phonemes Linguistic Rules MATCHERList MA TCHER Feature Based Presentation of an Utterance of Possible Utterance Lexicon Figure 3-2: Model of matcher with necessary information. In addition to the series of segments as shown in figure 3-1, standard phonemes, linguistic rules, and the lexicon are also needed. processor subsystem to be perfect and error-free in its analysis of the continuous speech signal. More specifically, for the original matcher to consider the data as valid, the speech processor has to detect all the segments and all the features therein. Although this is an unrealistic constraint for any matcher to place upon the speech processor, the original matcher cannot function without it. This will be discussed later in greater detail. After a speech signal has been properly labeled and converted into discrete lexical representations that perfectly and completely describe all the cues around each of the landmarks, the original matching process is ready to begin. But before the actual implementation takes place, the matcher needs three other different types of information: a list of linguistic rules which are to be applied during the matching, a list of standard phonemes, and a lexicon containing a given list of words with their corresponding phonemic representations. To illustrate these requirements, a modified version of figure 3-1 is shown in figure 3-2. In this new diagram, we note that there are four types of inputs needed for the matcher to perform its duties. In the next few sections, we will describe the three new types of information: the standard phonemes, the lexicon, and the list of linguistic rules. 30 iy ih ey eh ae aa ah uw uh rr er x k r h t 1 m v dh z zh f th ao w n S ow y ng sh dj ch b d g p Table 3.1: A list of English phonemes used in lexical access system. 3.1 Standard Phonemes In any matching algorithm, data must be compared to a defined reference data. Since the input of the matcher is formatted to a series of segments, or phonemes, the matching process is much simplified if the reference is formatted as a list of phonemes. In fact, the standard reference for the original matcher is exactly that: a list of all the possible phonemes in English. A full list of these phonemes is given in table 3.1. Because the standard list of phonemes is a reference for the matcher, each phoneme has to be perfectly described. In other words, each phoneme should have values for all necessary features. No errors of any kind should exist in the description of any of the phonemes. 3.2 Lexicon Not only does the matcher compare individual input segments to the list of standard phonemes, but it also places them sequentially to produce particular English words. Since the output will most likely be a series of words, the matcher needs a reference of possible words where each word is a specific set of phonemes. During the development stages of this subsystem in the lexical access system, a small lexicon of 248 words is used. Appendix A has a copy of this lexicon with a list of words. 31 3.3 Linguistic Rules A set of rules may be essential for an effective and useful lexical access speech recognizer in the real world. If the speaker were to enunciate every syllable very carefully in the most proper manner, then this attached information would not be necessary in recognizing the utterances. But when the segments occur in contexts, particularly in more complex sequences within a syllable or in sentences that are produced in a conversational or casual style, their articulatory and acoustic attributes may be modified [11]. Due to these modifications, the matcher needs to take appropriate action as defined by the linguistic rules of casual speech. For example, let us consider a series of words, "bat man". In casual speech, a speaker who intended to say those words may actually utter "bap man". This alteration occurs because the /t/ in "bat" assimilates to a /p/ by assuming the place of articulation of the /m/ in the "man". Since these cases are prevalent in any language, the lexical access machine is required to recognize them and adjust the process in a correct manner; therefore linguistic rules are necessary. 3.4 3.4.1 The Original Matching Process The Overall Method In the first stage of the process, also known as initialization, all of the four input files are translated into feature matrices that the matcher can recognize and manipulate. After reading these files into its data structure, the matcher modifies the lexicon by using the given rule set to account for possible linguistic changes within the observed speech. In other words, once initialization is completed, the lexicon is expanded by applying the rule set and adding new pronunciations to words that could have arisen as a result of some fluent speech modification at word boundaries [8]. In order to understand the matching process more completely, let us step through an example from Maldonaldo's thesis [9]. Suppose that the result of the labeling and conversion process demonstrates a 32 Phonemic Rep. of an UtteranceI Lexical Words 0 1 2 3 4 5 x x n n ah dh x r ey p b ae ih n p aa w rr n aa y n t iy 7 6 8 9 10 11 12 13 14 15 16 k n Figure 3-3: Matching model with linguistic rules using index. Two lexical words that match the segments of the utterance at index 0 are "an" and "another". sequence of phonemes corresponding to the phrase, "another ape back in power". Although we will refer to the bundles by their phonemic symbols for the sake of simplicity, this phrase is realistically represented by a sequence of bundles of features and not their symbols. Also suppose that the lexicon consists of these following words: [x n] ("an"), [x n ah dh x r] ("another"), [ey p] ("ape"), [b ae k] ("back"), [ih n] ("in"), [p aa w rr] ("power"), and [n aa y n t iy n] ("nineteen"). Finally suppose that the matcher is informed of only one linguistic rule which states that if a word ends in a /d/, an /n/ or a /t/, and is followed by a consonant produced by [+lips] such as /p/, then the last segment of the previous word can also gain [+lips] features. First a temporary lexicon is created by applying the rule set to the existing lexicon. As illustrated in figure 3-3, each lexical word is lined up, segment by segment, against the output of the signal processing unit, which is the line across the top. Then each of these words is tested for any possible linguistic modifications, and from the example above, only "nineteen" ends with an /n/ and is followed by a /p/. Therefore a new set of phonemes is added to "nineteen" in the temporary lexicon, in which the last /n/ has [+ lips]. After the temporary lexicon has been created, the sequence of the bundles of features is compared to each of the words in the temporary lexicon. As seen above, 33 index: AN ndex: 2 0 0 AN OTHER index: 6 no match APE index: 8 BACK index: 11, Figure 3-4: Tree model of the matching process. This model shows the progression of index for matched lexical words. Each node may be modeled as a subtree which may beget another subtree. 34 at index 0, only two words match segment by segment: "an" and "another". From here, the temporary lexicon is completely destroyed and two branches corresponding to these two matched words are created as shown in figure 3-3. Then subtrees from these two branches are produced through iterations of the same matching process that is described above. For example, a temporary lexicon is created to begin a new matching process as a result of the match of "an". But the tree created by this match fails to complete because no words in the lexicon matches from index 2 and forward. But the tree created by the match of "another" will complete a sequence of matches and will reconstruct the phrase "another ape back in power." 3.4.2 The Detailed Process of the Original Matcher In the previous section, the function of the overall matching subsystem was described and a general understanding of its algorithm was presented. In this section, a more detailed examination of this subsystem is provided. To understand the algorithm that the original matcher utilizes, we will analyze these four processes: initialization of the standard template, initialization of the lexicon, initialization of the linguistic rules, and growing of the matcher tree. Though the details may be complex, this information may be helpful to those who want to understand the matching process at the source code level. Initialization of Standard Template One of the pieces of input information from figure 3-2 is the standard template, which is also in appendix B. But before the matcher is able to use this information, it first needs to translate and store the information into a specific data structure that is recognized. This beginning process is formally called the template initialization and is the first block in figure 3-5. Since the matcher is programmed in C++, the names of each of the boxes represent classes and the arrows represent transition from one object to another. To better understand the matcher's specific structure, such as the types of classes and objects, 35 from main program Initialization ' _segment variables assigned __I _segments variables assigned _bundle LOOP: - get all headers features[i] = +, -, x,...} data[i] = new segment no? end of file yes ? back to main program Figure 3-5: Block diagram of standard template initialization. -segments constains an array of standard phonemes in data[i}, where i is the index of the array. Each phoneme is defined further by _segment and _bundle classes. 36 please refer to Maldonaldo's master's thesis on pages 21 and 22. But for our purpose of analysis, we will take this structure as given. When the command to initialize is executed, the matcher jumps to a class called segments to define the object of list of segments. Within this object, a set of header variables are assigned. When all the variables for the template are set, then the matcher is ready to initialize each of the standard phonemes that the template contains. This initialization algorithm is implemented with a loop that runs until each of the individual phonemes is defined in the matcher and stored in an array called data/i. To define each phoneme in data/i, -segments class calls upon -segment class to do some basic work. First, an object is created with specifically assigned variables, and then this object calls upon _bundle class to finish the initialization of a phoneme. Finally _bundle defines some variables and gives values to the individual features for the particular phoneme. This whole process is repeated for each of the standard phonemes in the template. Initialization of the Lexicon The structure design for initialization of the lexicon is different from that of the standard template in that the lexicon is an array of classes. To help understand this initialization process, refer to figure 3-6 while reading this section. To generalize this type of structure, an Array template is utilized for a generic array class in our code. When initialization of the lexicon is called upon, the matcher begins the process with a class called -lexicon. First all the variable assignments are completed. Also the Array template is initialized so that this object can contain a defined array of lexical words and their phonemic descriptions. When all the initialization for _lexicon is finished, then this process executes a loop for each of the lexical words by iteratively calling a class called _lex-word. Since there may be more than one possible pronunciation for each of the words, each _lex-word class is an array of pronunciations. Therefore, when this class is called upon, the array and its variables are initialized. Then the process executes a 37 from main program _lexword initialize Array of pronunciation get Label LOOP: initialize pronunciation add to Arrray of pronunciation Initialization _segments '-0 load up [x x x] format LOOP: Sdata[i] yesI? _lexicon initialize Array of lex-word set variables LOOP: initialize lex-word add to Array of lex-word lexicon done? end of segment? lex-word done? yes? Ino? no? _pronunciation yes? = new _segment -segment initialize variables p = new _phonemic-rep mode = either feature = phonemic no? _phonemicjrep check _STD() get phoneme by letter check phoneme back to main program Figure 3-6: Block diagram of lexicon initialization. This process is relatively complex because a lexicon consists of an array of words, where each consists of an array of one or more distinct pronunciations. 38 ' rule from main program Initialization Header = "time" Symbol= "value" _segments _rules initialize variables + ' segment mode = feature state = feature LOOP: initialize rule add to Array of rules newmode = "features" newstate = "features" K- _bundle LOOP: data[i] = new segment - - get all headers features[i]= {+, -, x, . no? no? end of file for rules? end of file for rule? yes? back to main program yes? Figure 3-7: Block diagram of rule set initialization. The ruleset contains an array of linguistic rules which need to be translated (or initialized) for the matcher to recognize. loop where _pronunciation class is called upon for each distinct pronunciation of a word. More specifically, the amount that the loop iterates in -lex-word is equivalent to the number of variations in pronunciations of a particular word. _pronunciation class is called within each iteration by _lex-word, but this class does not accomplish much but assign a few varibles. Then this object sets up the lexical pronunciation format represented by a series of phonemes by calling upon the segments class. In this class, the process executes a loop where each phoneme is retrieved from the input file and stored in the data structure of a pronunciation; _segment and _phonemic-rep are classes which finish this process. When one pronunciation is done, then the process goes into the next loop in -lex-word for the next possible variation of the pronunciation of the current word. 39 Initialization of Linguistic Rules The third type of information that the matcher needs is the linguistic rule set. The rules file has to be in a specific format to be recognized by the matcher, as seen in appendix C. When the command for linguistic rules is first encountered, the matching algorithm calls upon the _rules class to begin the initialization process, as shown in figure 3-7. First, many of the variables are given values when -rules is called. Since there are are numerous linguistic rules applicable in English, the matcher has to initialize an array of rules. Afterwards, the process enters a loop where each of the rules is interpreted and stored accordingly. To accomplish the interpretation and storage, _rule class is called upon for each iteration. _rule class merely assigns values to some variables and then calls the _segments class to read a rule from a rule file. Because a rule file follows the same format as that of the standard template from this point forward, the process for interpreting and storing data from here is the same as that of the standard template. The matcher enters a loop within the _segments class to the -bundles class. Initialization of the Utterance Data As we can observe from appendix B, the standard template file and a sample utterance file is of the same format. In fact, the process in which to initialize the utterance data is exactly the same as that of the standard template except for some assignment of variables. Therefore, the matcher utilizes the same initialization algorithm as that shown in figure 3-5. Growth of Matching Tree Once all the initializations are complete, the "tree growing" process begins; it is a process that refers to a sequential matching algorithm in which a tree grows, word by word, as matching occurs, as shown in figure 3-4. From this section, we learned that each matched node may be seen as a sub-tree. Therefore, one can imagine a series of words becoming a huge tree which consists of smaller trees, or sub-trees, which 40 t->grow(rs, N.-p lex) ~ get modified lexicon temporary lexicon = temporary lexicon grow-aux2 - for each sentence tree apply rules (starting-point) LOOP: for each word in lexicon LOOP: for each pronunciation grow-aux (startingpoint) LOOP: lx -> copyo -> - grow-aux2 with incremented index lexicon no? lexical word yes? done?doe yes? tree growing no? o? done? yes? backto min pogra back o man prgro? matching *LOOP: for each phoneme in pronunciation no? \yes? feature match? pronunciation yes ? _.word P done? Label = "word" no? nobundle-match yes? Ords _wd~ budeLOOP: for each feature done? tarray of matched words Figure 3-8: Block diagram of the tree growing process. This process recursively builds a tree where each node represents a distinct index and a matched word. The recursion occurs in this figure when grow-aux2 calls on grow-aux, which calls grow-aux2 back. 41 themselves consist of sub-trees, and so on. This recursive model is very important in understanding the details of the tree growing process. A figure of this process is shown in figure 3-8. A function that manages and begins the topmost level of the process for tree growth is called grow-aux2. First it calls on a function called grow-aux to find all the words in the lexicon that match the utterance data from a specific position (initially at 0), as shown in figure 3-8. If there are matched words, grow-aux is also responsible for translating those words into a tree structure such that each of these words becomes a sub-tree of the current node. Then for each of these sub-trees, grow-aux2 is called to find more branching possibilities for production of subsequent trees. Let us step through the example that was analyzed in section 2.4.1. The current position, or the node, is at 0 initially for the topmost tree. Then grow-aux is called upon by growaux2 to find all the matching words from the existing lexicon and in doing so, two words are found: an and another. Each of these two words attempts to build its own sub-tree by calling grow-aux2 and initiating this whole process again from its own node, or index. Among all the recursive sub-trees completed in matching, only one path of branches completed the entire matching process in this particular example: another ape back in power. From the analysis above, the grow-aux function has not been described. Overall, this function has two responsibilities: first is to modify the existing lexicon with linguistic rules and second is to match the segments from the utterance to the modified lexicon. In its second task, the function searches for perfect matches between the standard phonemes and the segments, which is a central characteristic of the original matcher. When finished, the function adds the newly matched words to an array of pre-existing matched words and destroys the modified lexicon. Since the algorithm for modifying the lexicon in accounting for linguistic rules is complex and would require an entire chapter of its own, we will not touch upon the details. Maldonaldo's master's thesis devotes a whole chapter to this very topic and explains the algorithm in greater depth. 42 Let us assume that the matcher already possesses a perfectly modified lexicon that accounts for all the possible linguistic mutations of a word. Then the matcher is ready to compare the utterance input to the modified lexicon. In order to accomplish this task in a more orderly fashion, it needs to compare segment by segment for each pronunciation of each word in the lexicon. This type of algorithm can easily be implemented using nested loops. The purpose of the topmost loop in grow-aux for the matching process is to run through each of the words in the modified lexicon. Due to linguistic rule modifications, each of the words may have more than one possible pronunciations; therefore each loop of a lexical word requires a nested loop for each of its pronunciations. At this point, the matcher searches through each pronunciations of a word and compares them to the utterance in a segment by segment basis by using a function known as matching. Matching function simply examines each of the phonemes in the pronunciation by calling upon bundle-match function to compare all the features of that phoneme to all the features in the current segment of the utterance. Since the original matcher requires error-free data, the bundle-match function requires the two segmental information to be exactly the same. If matched without any discrepancy, then the word that the pronunciation represents is added to the series of pre-existing matched words. 43 Chapter 4 Modified Matcher As described in section 2.3, the lexical access system consists of two subsystems: the speech signal processor and the matcher. The signal processor receives a continuous speech signal from a speaker and translates it into a series of segments, each with distinct feature characteristics. Then the matcher receives this segmentized data and outputs a set of possible utterances to recreate words that the speaker intended to say. Now let us imagine a scenario where the signal processor is error-free in its analysis. In other words, the signal processor is able to detect all the acoustic cues for every possible segment and features therein. In the end of the labeling and conversion process, this subsystem will output segments where each will match exactly, feature by feature, to a specific phoneme in the standard template. Although a process with such qualifications seems impractical, the original matcher requires them from the processor. Constraints that leave no space for uncertainties should not exist in any real-world systems, and in this regard, the lexical access system is no exception. If the speaker is confined to enunciate each syllable very carefully, then it may be feasible to satisfy the demand for perfect detection in the processor, but for any speech recognizer to be effective in realistic settings, it must have the ability to handle casual speech. Yet much of the problem in speech recognition algorithms results from the high degree of variability in naturally spoken utterances as compared with more restricted speech [5]. 44 Due to many contextual and speaker dependent variabilities, the original matcher is of insignificant real value and needs to be modified. One important source of variability in the implementation of the distinctive features for a segment is the position of the segment in a syllable [11]. For example, a consonant before a vowel may be acoustically different from one after a vowel. Besides these contextual variations, there are some extra-linguistic variations as well, such as the speaker's vocal tract characterizations, the type of speaking environment, and even whether the speaker has a cold or is nervous [5]. In addition to these two causes of variations, there are some uncertainty issues exclusive to the matcher as well. 4.1 Variability in the Speech Signal Although many phonetic features can be extracted from the speech signal, the acoustic realization of words and phonemes can be highly variable, and this variability introduces a good deal of recognition ambiguity in the initial classification of the speech signal [7]. In fact, the task of developing a speech processing subsystem with an ability to detect all the phonetic features despite acoustic variability is a near impossibility. These variations may be partitioned into two categories: contextual variations and extra-linguistic variations. 4.1.1 Contextual Variations Local phonetic context can influence the acoustic representation in phonetic segments to varying degrees. At times, the context can affect one or more features in a phoneme, but without removing entirely the acoustic evidence for the features [11]. In addition, for certain acoustic cues which should be present in the signal, evidence for them sometimes cannot be found in the sound. In the most extreme cases, entire segments may even be deleted. Yet while segments are deleted and modified in the production process, it is found that extra segments are almost never inserted because the modifications are primarily due to the inertia of a physical system [5]. As an example, let us look at some acoustic effects of context on vowel charac45 2A 2.0- L.80 U. pbfv* i 1.6- ^ w 14- _ 86SsztdCj a kg 9 NULL o A Ia. LL 0.81 0.2 0.5 FREQUENCY OF F1 (KCPS) 0.8 Figure 4-1: Effects of consonants on the first- and second-formant characteristics of eight types of vowels, where each follow the consonant-vowel-consonant (CVC) format. These graphs show three types of consonants: velars (open circles), postdentals (open triangles), and labials (open circles). Vowels in isolation are also shown (black circles) [12]. teristics. Since the primary articulator that produces distinctions among the vowels is the tongue body, many of the outstanding variations occur when the consonants around a vowel place constraints on that articulator. One case is when a back vowel is positioned after an alveolar consonant, where the tongue body is required to be in a fronted position to make contact with the alveolar ridge. After the consonant release, the time taken for the tongue body to position itself for the back vowel is about 100ms [11]. Similarly if the back vowel precedes an alveolar consonant, then the time taken for a complete movement of the tongue body is about the same. This is a significant amount of time, especially in rapid speech where vowel durations are usually less than 200ms. One can imagine the heavy influence that these consonants can have on the characteristics of a vowel segment in casual speech. Figure 4-1 shows the effects of consonants on the first and second formants of 46 eight vowels in lexical access. The points are measurements of the two formants at the midpoints of vowels, which are all situated in a consonant-vowel-consonant (CVC) formation, with the two consonants being the same. In this figure, three types of consonants are used: velars (open circles), postdentals (open triangles), and labials (open circles). The fourth case shows vowels in isolation (black circles). From the graph, one can clearly observe the impact that consonants have upon properties of acoustic parameters of vowels. One might expect the effects to be even greater in running speech where the vowel durations are likely to be shorter than they are for isolated CVC utterances. 4.1.2 Extra-linguistic Variations Extra-linguistic variations play a vital role in introducing uncertainties in the speech signal as well. One of the most influential causes of these variations is physical differences of the vocal tracts among individuals. For example, the resonant frequencies between the two sexes generally vary significantly due to the differing physical dimensions. Another type of extra-linguistic variation that the lexical access system, or any other speech recognition system, is required to handle is background noise, which depends heavily on the environment. In fact, some acoustic cues may be erased with enough background noise. As one can imagine, there are countless other factors. 4.2 Matcher-related Uncertainties The original matcher inherently possesses three unrealistic assumptions. The first is that the speech processor will detect all the possible features, and therefore, each segment will be error-free. The last section explained that this assumption is not valid for a working speech recognizer. The second is that all the segments theoretically in the signal will be detected, but as mentioned previously, this assumption is also false because some segments may be deleted. The last assumption that the original matcher makes is that the series of segments it receives completely represents a full utterance. In other words, the original matcher expects the data from the speech 47 First Second series of phonemes /b/ /ey/ /b/ /iy/ /k/ /ae/ /n/ /ey/ /b/ /iy/ /k/ /ae/ /n/ Table 4.1: Two sample series of phonemes as valid outputs of the speech processor. Though the second is a subset of the first series, both represent the utterance, "baby can recognizer to begin with the first segment of a word and end with the last segment of a word, independent of the number of words in the utterance. Yet one can imagine that such an expectation is not valid in a continuous system. In a real-word situation, the speech processor will not have the ability to identify the boundaries of individual words in continuous speech signal. Therefore the matcher will receive a series of segments that may begin and end at random points within words. Let us look at two valid series of segments in table 4.1, assuming that each segment has a complete set of features. If the original matcher receives the first series of segments as an input, then the output will be the statement, "baby can". But if the input is the second series of segments, then the original matcher will not output any utterances because the series does not represent complete words. A more useful and realistic matcher should recognize that the two different series of segments may represent the same utterance and process the data accordingly. 4.3 The Modified Matcher Process Due to the uncertainty factors described above, a modified matcher has been designed to satisfy more realistic conditions. Specifically, the contemporary modified matcher functions without the constraints of the first and the third assumptions stated in the previous section. Though not dealt with in the current modified matcher, the second assumption also must be relieved in the final matcher. First this section will describe the modifications applied to the original matcher to account for the first assumption that the original matcher makes. Later, the section 48 segment 1 Time: (nil) Symbol: /ae/ Prosody: (nil) RoC: unspecified + vowel - high + low segment 2 Time: (nil) Symbol: (nil) Prosody: (nil) RoC: unspecified + vowel - high +low segment 3 Time: (nil) Symbol: (nil) Prosody: (nil) RoC: unspecified + vowel - back - adv.tongue-root - const-tongue-root Table 4.2: Three sample segments where segment 1 is a standard phoneme, /ae/. The other two segments are subsets of the first segment. RoC = Release Or Closure. will show how the modified matcher handles the last assumption, in which the first and the last segments may be positioned in any parts of words. 4.3.1 Missing Features The ability for a matcher to acknowledge incomplete sets of features in segments is very important. As mentioned before, one of the reasons for its importance is that an expectation placed upon the signal processor by the matcher to detect all possible acoustic features is unfair and unrealistic. But a more compelling reason is that the signal processor may be more efficient in not detecting all acoustic cues for features. Some features naturally do less to differentiate phonemes than other features, and developing systems that detect these less informative features may not be worth the cost of time, money, and design complexity of the speech processor; the difference in performance may be merely a small reduction in ambiguity. Also, some acoustic cues tend to be more dependent on the context than other cues, and therefore are more vulnerable to variability. Since possibilities of error in detection are higher with these highly variable cues, it may be safer to ignore them in the analysis. 49 segment 1 /ae/ segment 2 /ae/ /aa/ segment 3 /iy/ /ih/ /ey/ /eh/ /ae/ /aa/ /ao/ /ow/ /ah/ /uw/ /uh/ /rr/ /er/ /x/ Table 4.3: Matched results of the three segments from table 4.2. From left to right, more phonemes match because fewer features exist to constrain the matching process. Overview of the Modified Algorithm Let us study an example to gain better visual understanding. In table 4.2, there are three types of segments ordered from left to right. The features in segment 2 are a subset of segment 1, and those in segment 3 are a subset of segment 2. The first is a copy of the standard phoneme, /ae/, and contains all the possible features. But the second and the third segments are missing some features in comparison to the first segment. The original matching algorithm will recognize the first segment as /ae/ but will not recognize the other two segments. On the other hand, the modified algorithm will acknowledge all the segments but will associate more phonemes as fewer features exist to constrain the match, as shown in table 4.3. As expected, all the lists include the intended phoneme, /ae/. One may use set models to understand the general idea of the two matching algorithms. As figure 4-2 shows, the original matcher requires that the segment from the utterance and a standard phoneme to occupy the same "feature-set space". But in the modified matcher, the "feature-set space" for a segment is a subset of the "feature-set space" of the standard phoneme. Simply, the first case demands that the two segments be the same while the second case allows one to be a subset of the other. 50 Standard Phoneme Space Standard Phoneme Space Segment Space Segment Space (A) (B) Figure 4-2: Feature set space model for the original matcher, in part (A), and the modified matcher, in part (B). Part (A) shows that the original matcher requires the two feature-set space to be exactly the same, while part (B) shows that the segment space can be a subset of the standard phoneme space. Specific Modifications in the Algorithm Most of the modified matching algorithm is carried over from the original algorithm. In fact, all the initialization and most of the tree-growing algorithms, which are all described in section 3.4, are exactly the same. The main difference lies in a test titled "feature-match?" in figure 3-8, which is called by _bundle-match to perform a feature by feature match between a standard phoneme and a segment from the utterance. Figure 4-3 (A) shows that in the original matcher, the function that performs this test is called exact-match. If a feature has a particular value within a standard phoneme, then exact-match requires that the same feature has the same value in the segment. Meanwhile, in the modified matcher, another function titled incomp.match replaces -exactmatch in performing the "feature-match?" test, as shown in table 4-3 (B). Although a feature may have a particular value in a standard phoneme, this new function allows that feature to be missing in the segment. Yet situations where the feature has differing values in the segment and the standard phoneme is not allowed. 51 from Tree-growing algorithm from Tree-growing algorithm matching matching LOOP: for each phoneme in pronunciatich LOOP: for each phoneme in pronunciatift bundlematch bundlematch LOOP: for each feature LOOP: for each feature exactmatch incomp-match every feature in the phoneme has to match every feature in the segment (and vice versa) every feature in the segment has tc versa) every feature in the phoneme (not vice versa) (A) (B) Figure 4-3: Feature-match model for the original matcher, in part (A), and the modified matcher, in part (B). In this figure, "phoneme" refers to a set of features used by the lexicon and "segment" refers to a set of features, complete or incomplete, detected from the speech signal. 52 segment sement 1 segment 1 1 segmen Time: (nil) Symbol: p Prosody: (nil) Release Or Closure: unspecified Features: + consonant Time: (nil) Symbol: (nil) Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant - continuant - sonorant - sonorant + lips + lips - round - round - constricted-glottis - slack-vocal-folds (B) (A) Table 4.4: Segment input to the modified matcher. Segment (A) has a complete set of features for /p/ while segment (B) is a subset of the first set. 4.3.2 Uncertainty of First and Last Segments In a continuous lexical access system, the matcher should not assume that the received series of segments begins and ends at the boundaries of words. If the system were to process a free flowing stream of speech data, then the speech processing subsystem does not inherently possess any knowledge of the specific positions of the segments. Thus the processor sends a series of segments without the information of the series being a complete sentence or not. Due to this uncertainty, the original matcher necessarily has to be modified such that it does not assume the positions of the first and the last segments. Examples of the Modification Due to the complex nature of the algorithm that implements this modification, it may be more beneficial to step through a few examples than to explain the intricate details of the algorithm. In each of the examples, the input and the corresponding output of the modified matcher will be provided. The lexicon used for these examples is the 248 word lexicon found in appendix A. 53 list of matched words 12. pig 1. ape 2. appear 13. pigs 14. pop 3. cup 15. potato 4. cups 16. power 5. hip 17. put 6. keep 7. oppose 18. sport 19. support 8. pail 20. suppose 9. pails 21. up 10. pay 11. perch 22. zap 23. zip Table 4.5: Output of the modified matcher with segment from table 4.4 (A) as input. All the words in this list contain the phoneme /p/. The first example is shown in table 4.4 (A), where the input is only one segment with a complete set of features. In fact, the phoneme that matches the given set of features is /p/. Since the modified matcher does not assume the relative position of this segment in a word, it will output all the words that contain /p/ as a segment. The list of words is shown in table 4.5. In this specific case, the original matcher does not output any words because no words exist that are completely described by /p/ alone. Let us now erase some of the features from the segment such that the input represents an incomplete set shown in table 4.4 (B). Due to the fact that this segment contains fewer features to match than that shown in part (A), both /p/ and /b/ satisfy the match and the number of possible words in the cohort increases. In fact, the number of possible words increases from 23 to 62 as shown in tables 4.5 and 4.6. If the input consists of two consecutive segments, as shown in table 4.7, the outcome is quite different. In this case, the matcher searches in the lexicon for any words, or a series of two words, that contain these segments in a sequential manner. Therefore, the output is represented by individual words that contain these two segments 54 1. able 2. about 3. above 4. ape 5. appear 6. baby 7. back 8. bad 9. bake 10. balloon 11. bat 12. be 13. because 14. become 15. before 16. began list of matched words 33. cub 17. begin 34. cup 18. below 35. cups 19. beside 20. between 36. debate 37. debby 21. big 38. dub 22. blow 39. goodbye 23. book 40. hip 24. box 41. keep 25. boy 42. number 26. bug 43. oppose 27. bus 44. pail 28. busses 45. pails 29. busy 46. pay 30. but 47. perch 31. by 32. cab 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. pig pigs pop potato power put rabbit rabid remember sport support suppose up zap zip Table 4.6: Output of the modified matcher with segment from table 4.5 (B) as input. Due to fewer features to match in the segment, the matcher outputs more matched words than table 4.5. - sonorant segment 2 Time: (nil) Symbol: iy Prosody: (nil) Release Or Closure: unspecified Features: + vowel + high - low + lips - back - round + advtongue-root - constricted-glottis - slack vocalfolds - const-tongue-root segment 1 Time: (nil) Symbol: p Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant Table 4.7: Two consecutive segments with complete feature-sets as the input to the modified matcher. The first segment represents the phoneme, /p/, and the second segment represents the phoneme, /iy/. 55 list of matched words 1. 2. 3. 4. ape easy ape even appear cup easy 5. cup even 6. 7. 8. 9. hip easy hip even keep easy keep even 11. 12. 13. 14. 10. pop easy pop even up easy up even zap easy 16. zip easy 17. zip even 15. zap even Table 4.8: Output of the modified matcher with two consecutive segments from table 4.7 as input. Most of the matched utterances are over a series of two words. In this example, we can conclude that if the first segment is known to be the beginning of a word, then there are no matches. sequentially or two words that end with the first segment and begin with the second segment. All the possible words, and series of words, are shown in table 4.8. As one can observe, all the matched utterances, except number three, has /p/ as the ending segment of the first word and /iy/ as the first segment of the next word. Although only a couple of examples have been shown, the modified matcher can recognize any number of segments with any combinations of features for each segment. User-Mode for Testing The user may force the first or the last segment of the input to be the beginning or the end of a word by using the following symbols, '#' or '%' after "Time:", in the header of designated segments. Table 4.9 shows how the user can manipulate the input to inform the matcher what type of input it is receiving. The reason for this flexibility is because the user may find forced assignments useful for testing purposes. For example, the cases studied in the previous subsection are categorized into category A, where the first and the last segments are unrestricted in position. If "Time:" of the last segment is assigned the character, '%' instead of (nil) as shown in table 4.9 (B), then the data indicates to the matcher that the series of segments ends at the last segment of a word. For example, if the last segment of a two segment input has '%' after "Time:" as shown in table 4.10 (A), then the output of the matcher is as shown in table 4.11. 56 As one can see, the last two segments (nil) in "Time:" of first segment in "Time:" of first segment '#' (nil) in "Time:" of last segment first and last segments may be in any position '%' in "Time:" of last segment first segment may be in any position last segment is last of a word A B first segment is first of a word last segment may be in any position first segment is first of a word last segment is last of a word C D Table 4.9: All combinations of (nil), '#' and '%' in the first and last segments and their meanings. In fact, category D acts as the original matcher where a series of segments represents a complete word(s). segment 1 Time: (nil) Symbol: k Prosody: (nil) RoC: unspecified Features: + consonant segment 2 Time: % Symbol: iy Prosody: (nil) RoC: unspecified Features: + vowel segment 1 Time: # Symbol: k Prosody: (nil) RoC: unspecified Features: + consonant segment 2 Time: (nil) Symbol: iy Prosody: (nil) RoC: unspecified Features: + vowel - continuant + high - continuant + high - low - sonorant - low + dorsum + high - back + dorsum + high - back low - constricted glottis - slack-vocal-folds - const-tongue-root - low - constricted-glottis - slack-vocal-folds - consttongue-root - sonorant - + advtongue-root (A) + advtongue-root (B) Table 4.10: Two sample inputs to the modified matcher where the positions of the specific segments are constrained. (A), by using '%', forces the matcher to find word(s) that ends with segment 2. (B), by using '#', forces the matcher to find word(s) that begins with segment 1. 57 list of matched words 1. cookie 2. leaky Table 4.11: Output of the modified matcher with two segments from table 4.10 (A) as input. As expected, the last two segments of these words are /k/ and /iy/. list of matched words 1. keep Table 4.12: Output of the modified matcher with two segments from table 4.10 (B) as input, where '#' is placed after "Time:" of the first segment. of "cookie" and "leaky" are /k/ and /iy/ sequentially. In contrast, if '%' is not present, then that particular constraint is absent and the matcher would output more utterances, such as "book eat". As explained in table 4.9 (C), "Time:" of the first segment may be assigned the character, '#' instead of (nil). In such cases, the data indicates that the series of segments begins at the first segment of a word. For example, if the two segments are as shown in table 4.10 (B), then the only word that matches is "keep", as shown in table 4.12. As expected, the first two segments of "keep" are /k/ and /iy/. Finally if the first segment is given '#' and the last segment is given '%', then the constraint is that the series of segments represent complete word(s). If the input is as shown in table 4.13, then the matcher would output no words since no words are completely described by /k/ /iy/. Through figure 4-4, set models may be utilized to better visualize this last case. If two sets of matcher's outputs exist, one corresponding to category B and the other corresponding to category C from table 4.9, then the set that represents category D is the AND-function of the two sets. Since there is no overlap of output utterances from table 4.11 and table 4.12; the matcher outputs no words in this case. Interestingly, the two symbols used together force the matcher to act, to a certain degree, as the original matcher. 58 segment 1 Time: # Symbol: k Prosody: (nil) RoC: unspecified Features: + consonant segment 2 Time: % Symbol: iy Prosody: (nil) RoC: unspecified Features: + vowel - continuant - + high - low sonorant + dorsum + high - back low - constricted-glottis - slack-vocalfolds - const-tongue-root + adv-tongue-root - Table 4.13: A possible input to the modified matcher which follows the constraints described in table 4.9 (D). Output from Category D Output from Category C Output from Category B Figure 4-4: Set models for the output of the matcher with different categories as described in table 4.9. Note that category D is the output of the AND-function of categories B and C. 59 Chapter 5 Experiment with Partial Feature Segments In section 4.3, two reasons for modifying the original matcher are presented. One of the reasons is that the original matcher places unrealistic expectations on the speech signal processor. Indeed, this reason alone is compelling enough for developing a new matcher. But the second provides a more positive reason; the modified matcher may decrease the cost and the design complexity of the speech processor while avoiding significant loss of performance of the entire lexical access system. Some features may not be worth the computational power required for their detection because they accomplish little to reduce the ambiguity within a segment. Therefore investigating different partial representations can help us identify the features with the most information, the features with the least information, and everything in between. Through a long experimental process, a map of features can be made, which will facilitate the design of the lexical access system. In this chapter, an experiment which explores the performance of the modified matcher when only partial feature information is presented. This is, by no means, a comprehensive experiment for the field of research in partial feature sets, but is a stepping stone toward a further series of experiments to refine our knowledge in this area. 60 5.1 Experiment Two main categories of inputs were used for this experiment: a series of com- plete feature-set segments and a series of incomplete feature-set segments. Naturally phonemes from the standard template were used to represent segments for the first category. To define the segments in the second category, fixed subsets of features were utilized, which represented a feature-set convention for the second category of inputs throughout the experiment. In considering which features to allow or disallow in the incomplete segments of the second category, whether they are consonants, vowels or glides, we should account for how difficult the task is to detect their acoustic cues and their contextual variability. In this initial set of experiments, a small number of features which are expected to be most reliably detected were used. If a segment is a consonant, the articulatorfree features, such as [sonorant], [continuant], and [strident], are expected to be most easily defined. If a segment is a vowel or a glide, then its articulator-free features and its articulator-bound features caused by the tongue body, such as [high], [low], and [back], are expected to be most easily defined. As a result, we used these particular features, and only these features, as a convention to describe the segments in the second category. To observe how the conventions of the two categories are applied to a series of segments, an example is given in table 5.1, where a word "as" is analyzed. Part (A) of this table follows the conventions defined for the first category of inputs, while part (B) follows the conventions defined for the second category. In part (A), /ae/ and /z/ are shown in their complete sets of features, just as they are in the standard template of phonemes. Yet in part (B), the vowel, /ae/ provides only the articulatorfree features and the articulator-bound features caused by the tongue-body, while the consonant, /z/, provides only the articulator-free features. Six different tests, as listed in table 5.2, were run for this experiment: 1-C test, 1-IC test, 2-C test, 2-IC test, 4-C test, 4-IC test. Each of these tests received 124 separate data samples as inputs, where for each sample, the matcher produced matched 61 segment 1 Time: # Symbol: ae Prosody: (nil) RoC: unspecified + vowel segment 2 Time: % Symbol: z Prosody: (nil) RoC: unspecified + consonant segment 1 Time: # Symbol: (nil) Prosody: (nil) RoC: unspecified + vowel segment 2 Time: % Symbol: (nil) Prosody: (nil) RoC: unspecified + consonant - high + low - back - advtongue-root - const-tongue-root + continuant - sonorant + strident + blade + anterior - high + low - back + continuant - sonorant + strident + anterior - distributed + slack-vocal-folds category 2: (B) category 1: (A) Table 5.1: This is a series of segments representing the word, "as" for two different categories of inputs in the experiment. The segments in (A) are described with a full set of features, while the segments in (B) are described with a incomplete set of features. RoC = Release Or Closure. descriptions of the six tests one-word test with complete features one-word test with incomplete features two-words test with complete features two-words test with incomplete features four-words test with complete features four-words test with incomplete features name used 1-C test 1-IC test 2-C test 2-IC test 4-C test 4-IC test Table 5.2: A list of six tests of the experiment and their corresponding names used in this thesis. 62 word(s). More specifically, a data sample consists of a random word from the lexicon, or a series of random words, represented by a series of segments. These segments may have complete or incomplete feature-sets. The 1-C, 2-C, and 4-C tests follow the conventions of the first category of inputs while the 1-IC, 2-IC, and 4-IC tests follow the conventions of the second category of inputs. In the 1-C test, a data sample corresponds to a series of complete segments which are intended to represent a particular word; "intended" because the matcher may interpret the series of segments as another word(s). After this data sample is processed by the matcher, the number of matched word(s) in the cohort is recorded. The second test, 1-IC test, is very similar to the first test. The only difference is that the segments used in this test follow the conventions of the second category of inputs, i.e. partial feature specifications. Therefore, each of the data samples in this test corresponds to a series of incomplete segments, intended to represent a particular word. In order to compare and analyze the two one-word tests, same list of 124 "intended" words are used. The series of segments in table 5.1 (A) represents a sample input for an 1-C test, and the series of segments in part (B) represents a sample input for the 1-IC test, where the "intended" word in these examples is "as". The next two tests, 2-C test and 2-IC test, are similar to the first two tests just described. The only difference is that the series of segments are intended to represent two words in series. The number of matched utterances in the cohort for each of the 124 samples is recorded for each test. An example data input for the 2-C test is shown in table 5.3 (A) and a corresponding data input for the 2-IC test is shown in part (B), where the intended words are "on day". The final two tests, 4-C test and 4-IC test, are similar to the tests described above, where the only difference lies in that an input is intended to represent a series of four words. The format of the data samples follow the previous examples given in tables 5.1 and 5.3. As seen in these tables, '#' and '%' symbols have been utilized after "Time:" of the first and the last segments. As a result, the matcher is forced to match the first 63 segment 1 Time: # Symbol: aa Prosody: (nil) RoC: unspecified + vowel segment 2 Time: (nil) Symbol: n Prosody: (nil) RoC: unspecified + consonant segment 3 Time: (nil) Symbol: d Prosody: (nil) RoC: unspecified + consonant segment 4 Time: % Symbol: ey Prosody: (nil) RoC: unspecified + vowel - round - high - continuant + sonorant - continuant - sonorant - high - low + low + blade + high - back + back - advtongue-root + anterior - distributed - low + slack-vocal-folds - advtongue-root -const-tongue-root + consttongue-root + nasal (A) segment 1 Time: # Symbol: (nil) Prosody: (nil) RoC: unspecified + vowel segment 2 Time: (nil) Symbol: (nil) Prosody: (nil) RoC: unspecified + consonant segment 3 Time: (nil) Symbol: (nil) Prosody: (nil) RoC: unspecified + consonant segment 4 Time: % Symbol: (nil) Prosody: (nil) RoC: unspecified + vowel + back - high - continuant + sonorant - continuant - sonorant - high - low + low - back (B) Table 5.3: The two series of segments represent two sequential words, "on day". The segments in (A) are described with the conventions of first category of inputs and is a possible data sample for the 2-C test. The segments in (B) are described with the conventions of the second category of inputs and is a possible data sample for the 2-IC test. RoC = Release Or Closure. 64 of the series of input segments to the first of a word candidate and the last of the series of input segments to the last of a word candidate. For example, if the input is as shown in table 5.3(A), then all the matched utterances of the cohort should begin with phoneme /aa/ and end with the phoneme /ey/. There are exactly two purposes in running the six tests in this experiment: " to analyze the feature-level characteristics of the cohorts in these six tests. " to analyze the word-level characteristics of the cohorts in these six tests. The feature-level characteristics are defined by examining how statistics derived from the output of the modified matcher change over a decreasing number of defined features in segments. For example, the results from the 1-C test are compared to the results from the 1-IC test in order to better understand the overall consequences of having partial sets of features. The same comparative analysis is done for the two-words tests and the four-words tests. The word-level characteristics are defined by examining how statistics derived from the output of the modified matcher change over an increasing number of words in series for a data sample. More specifically, the output data from the one-word tests is compared with the output data from the two-word tests, which then is compared to the output data from the four-words tests. This type of analysis presents important information about how the cohort reacts as the number of segments increase. 5.2 Data All the input data to the modified matcher for each of the six tests and their corresponding output data were recorded, as shown in appendices E, F, and G. As mentioned before, the output data we are interested in is the number of matched utterances in the cohort for a given input data sample. For each of the six tests, statistics such as the mean, standard deviation, and maximum were calculated, and these are listed in table 5.4. 65 test 1-C 1-IC 2-C 2-IC 4-C 4-IC mean 1.024 2.685 1.121 7.742 1.097 67.790 std. dev. 0.154 3.014 0.501 12.996 0.296 152.637 maximum 2 22 5 88 2 1242 Table 5.4: Mean, standard deviation (std. dev.), and the maximum of the number of matched utterances of the data inputs for each of the tests in the experiment. 5.2.1 Feature-level Characteristics If the performance of the lexical access system is solely based on the ambiguity of the output of the system, which corresponds to the number of utterances in the cohort produced by the matcher, then it is a foregone conclusion that the more acoustic information that the system knows, the better its performance will be. In this experiment, one of the fundamental questions that has been examined is, "how much of the performance will decrease with the given incomplete sets of features?" With "given incomplete sets" defined by conventions of the second category of inputs, the solution to this question is found in the feature-level characteristics of the tests. 1-C and 1-IC Tests As recorded in table 5.4, the mean of the number of matches in the output of the 1-C test is 1.024 and the standard deviation is 0.154. These statistics lead us to confirm the obvious conclusion: if all the possible features are correctly detected, then almost every one of the data samples match exactly to its intended word. A histogram for the results of this test, shown in figure 5-1, depicts that almost all of the 124 inputs correspond to a singular utterance in their cohorts. In fact, 98.4% of the data samples matched one-to-one. When many of the features were absent for the same words, as is the case in the 1-IC test, the mean and the standard deviation increased slightly, 2.685 and 3.014 respectively. From figure 5-2, we see that most of the number of matches in the cohorts 66 One-Complete-Word Test 140- 120100- 80 Go- 40200 1 2 Number of Matches 0 3 Figure 5-1: Histogram for the results of 1-C test. The number of matches in a cohort of the matcher is the independent variable. Much of the distribution is concentrated at one match per data sample. One-Incomplete-Word Test 56 4840- Lt24- 16- a0- 0 5 10 15 NIA fHf ach", 20 25 Figure 5-2: Histogram for the results of 1-IC test. The number of matches in a cohort of the matcher is the independent variable. Much of the distribution is concentrated at five or less matches per data sample. 67 Two-Complete-Words Test 12010080- 4020 0 1~ 0 2 4 Number of Matches 6 Figure 5-3: Histogram for the results of 2-C test. The number of matches in a cohort of the matcher is the independent variable. The results are very similar to that of 1-C test in figure 5-1. is usually less than or equal to 5. When the results are more closely examined, we find that 45.2% matched exactly to its intended word and 71.0% matched within the ambiguity of two possible utterances. 2-C and 2-IC Tests The mean is 1.121 and the standard deviation is 0.501 for the output of the 2-C test, as shown in table 5.4. Although some numbers of matches climbed as high as five, histogram shows almost all the matches were one-to-one in figure 5-3. Upon further calculations, we find that 93.5% of the cohorts contained one correctly matched utterance. In the 2-IC test, however, the mean and the standard deviation jumped to 7.742 and 12.996 respectively. To examine these statistics in detail, let us look at the histogram shown in figure 5-4. The distribution exponentially decreases as the number of matches approaches 20. Even though approximately 64% of the data samples corresponded to five or fewer matches in their cohorts, the other 36% are significant and should not be ignored. 68 Two-Incomplete-Words Test 80 72- 6456>48C LL 3224- 180n 0 20 40 s0 80 100 Number of Matches Figure 5-4: Histogram for the results of 2-IC test. The number of matches in a cohort of the matcher is the independent variable. The results resemble an exponentially decreasing function where the frequency becomes insignificant as the number of matches approaches 20. 4-C and 4-IC Tests According to table 5.4, the mean is 1.097 and the standard deviation is 0.296 for 4-C test. As the histogram shows in figure 5-5, approximately 91.9% of the utterances consisting of four words matched exactly to their intended utterances given that all the features are defined in all the segments. In the 4-IC test, however, the outputs behave much differently; the mean is 67.790 while the standard deviation is 152.637. In the histogram provided in figure 5-6, we find that almost all of the data samples matched to 200 or fewer possible utterances, while a few individual points measured at 700 or higher. In fact, the highest number of matches for a data sample was 1242, in which the intended phrase was "do pail another contain". When the histogram is magnified in the range of 200 or less matches, as shown in figure 5-7, we observe that much of the distribution lies in 30 or fewer matches while the rest of the field is generally uniformly distributed. 69 Four-Complete-Words Test 120 100- 8060U- 4020- 0* - 1 1 2 Number of Matches 0 3 Figure 5-5: Histogram for the results of 4-C test. The number of matches in a cohort of the matcher is the independent variable. The distribution resembles that seen in one-complete word and two-complete-word tests. Four-Incomplete-Words Test 100 80- > 60 U- 40- 20- 0- 0 200 400 600 800 Number of Matches 1000 1200 Figure 5-6: Histogram for the results of 4-IC test. The number of matches in a cohort of the matcher is the independent variable. The results resemble an exponentially decreasing function where the frequency becomes insignificant as the number of matches approaches 400. 70 Four-Incomplete-Words Test '10 0- 0 100 200 Number of Matches Figure 5-7: A magnified version of the histogram in figure 5-6 where the range is less than 200. The distribution is generally uniform when the number of matches is greater than 30. The unexpected impulse at 200 is the sum of all that comes after that point. 5.2.2 Word-level Characteristics As explained in the previous section, a variable that affects the dynamics of the output of the matcher is the number of features that are detected and labeled for each segment. Another variable that influences the output is the length of the series of segments that the matcher receives per process. In our experiment, we measured such length in terms of the number of words that a particular series of segments represents. When the results of the 1-C, 2-C, and 4-C tests, are examined, it is evident that the number of matched utterances in the cohort do not change as the number of words vary. The mean values from each of the tests do not vary by more than 0.1 from each other. In fact, the mean value does not appear to have any correlation with the number of words; it does not monotonically increase nor decrease with the increasing number of words. However, as one would expect, 1-IC, 2-IC, and 4-IC tests produce very different word-level characteristics. More specifically, one would expect that the relationship between the number of matches per sample and the number of words in the intended 71 segment 1 Time: # Symbol: d Prosody: (nil) RoC: unspecified + consonant segment 2 Time: % Symbol: uw Prosody: (nil) RoC: unspecified + vowel segment 1 Time: # Symbol: ae Prosody: (nil) RoC: unspecified + vowel segment 2 Time: % Symbol: d Prosody: (nil) RoC: unspecified + consonant - continuant + high - high - continuant - sonorant - low + low - back - sonorant + back (A) (B) Table 5.5: Two words with feature sets constrained by the second convention for inputs. (A) is a series of segments representing the word, "do". (B) is a series of segments representing the word, "add". RoC=Release or Closure utterance is this: y = 2.685' (5.1) * where 2.685 = mean value for 1-IC test " x = number of words in the intended utterance " y = expected number of matches for data sample, x To better understand the formulation of equation 5.1, let us study an example to observe how a series of words can affect the output of the matcher. Let us assume that our first series of segments is as shown in table 5.5 (A). If we use this as an input to the modified matcher, then we would get two matched utterances in the cohort as shown in table 5.6 (A). Now let us assume that we have a second set of segments as shown in table 5.5 (B), and if this is the input to the matcher, the cohort would look like table 5.6 (B). At this point, the question we ask is this: what would the cohort be if these two sets of segments were combined into one? If the input is as shown in table 5.7, then we would expect the cohort to consist of the utterances as shown in table 5.8. Each possible utterance for the first series is followed by each possible utterance of the second series, and therefore, the total number in the cohort is 2*3=6. 72 cohort of segments from table 5.5(A) 1. do 2. to 3. too cohort of segments from table 5.5(B) 1. add 2. at (A) (B) Table 5.6: Two cohorts corresponding to the two individual series of segments from table 5.5. (A) is the cohort for the series that intended to represent the word "do". (B) is the cohort for the series that intended to represent the word "add". segment 1 Time: # Symbol: d Prosody: (nil) RoC: unspecified + consonant segment 2 Time: (nil) Symbol: uw Prosody: (nil) RoC: unspecified + vowel segment 3 Time: (nil) Symbol: ae Prosody: (nil) RoC: unspecified + vowel segment 4 Time: % Symbol: d Prosody: (nil) RoC: unspecified + consonant - continuant + high - high - continuant - low + low - back - sonorant - sonorant + back Table 5.7: Two series of segments from table 5.5 combined into one series. RoC=Release of Closure. cohort of segments from table 5.7 (A) 1. do add 2. do at 3. to add 4. to at 5. too add 6. too at Table 5.8: The cohort corresponding to the series of segments in table 5.7. 73 Expected vs. Real Data 70- 56 - E =28 z O 14 0 1 2 3 Number of Words 4 Figure 5-8: Two graphs where the independent variable is the number of intended words in the input and the dependent variable is the average number of matches in the cohort. The graph with squares corresponds to the expected behavior and the graph with circles corresponds to the real behavior. In the results of the 1-IC test, the average number of matched words is 2.685. Following the same principle above, if the input consisted of two words, then the expected average number of matched words would be 2.685*2.685, or 2.6852. And if the input was expanded to four words, then the expected average number of matched words would be 2.6854. Thus equation 5.1 computes the expected number of matched utterances in a cohort for any number of words. In figure 5-8, the expected behavior is also shown graphically with squares. Yet, the data we collected behaved much differently from the expected behavior. In figure 5-8, the real data deviates increasingly from the expected data as the number of words increases. In fact, figure 5-9 shows that the difference between the real and the expected increases exponentially as the number of words increases linearly. As one can imagine, this type of behavior poses great problems, especially because utterances are not confined to one or two words. To explain this unexpected behavior, let us examine data sample #52-2 from 2-IC test. #52-2 is translated as word #52, "caution", and word #2, "able", in series as seen in appendices A and B. These two words were separate data samples in the 74 Difference between Real vs. Expected Data 20- 15C 0- aU aJ 10 - - 1 - 2 3 Number of Words 4 Figure 5-9: This graph shows the difference between the expected and the real graphs from figure 5-8. As the number of words increase for the input, the difference increases exponentially. M-C test, and the number of matches for "caution" and "able" individually were 6 and 5 respectively; their cohorts are shown in table 5.9. From this information, the expected number of matches for "caution able" is calculated to be 6*5, or 30. Yet the real number of matches in the test is 88, as seen in appendix C. What causes such significant deviance from the expected output? The reason is simple; new words are created between the boundaries of the segments for "caution" and the segments for "able".- For example, one of the utterances in the cohort is, "bus a make all". In this particular case, "make" is a word that is created between the boundaries when the two series of segments are serialized. In fact, 55 other utterances were created in the same unexpected manner. As the number of words in an utterance increases, the probability of creating unexpected words in the matching process also increases. In fact, this undesirable effect increases exponentially with the number of words, as shown in figure 5-9. 75 cohort of "caution" 1. bus all 2. caution 3. cousin 4. does all 5. go some 6. toe some cohort of "able" 1. able 2. ache all 3. ache an 4. ape all 5. ape an Table 5.9: Two cohorts corresponding to the two individual series of segments representing the words, "caution" and "able". 5.3 Conclusion In this experiment, we have analyzed the results of six different tests to gain insights into the feature-level and the word-level characteristics of the data samples. One of the observations we have made is that when all the segments had complete sets of features, almost all of the data samples matched exactly to the "intended" phrase, independent of the number of words in the data sample. The second convention for the data samples stated that all the segments consist of only the articulator-free features and the vowel and glide segments consist of additional articulator-bound features caused by the tongue body. When this convention was used on the data samples, the results showed that the performance of the system decreased exponentially with the increasing number of words in the input. In fact, the average number in the cohort increased exponentially faster than the expected number due to the newly created words between word boundaries. From the results, we can conclude that the correlation between the number of words in the input and the number of matched utterances in the output of the matcher decreases as more features are defined in the segments. Furthermore, when all the features are defined, the results reveal that no correlation exists. Interestingly, only a few data samples caused the mean to increase significantly in the 4-IC test. For example, sample #68-156-12-56 had 1242 possible utterances that matched. Because this unexpected behavior is not common, it is my conjecture that a couple of more detected features in the sets will help avoid these unusual cases. 76 Chapter 6 Summary In this project, we have examined the lexical access system with a greater attention given to the second subsystem, the matcher. The previous version, called the original matcher ignored the fact that different types of variations can exist in a real speech signal. Therefore, the original matcher would fail to function in real situations. In an effort to account for these uncertainty factors, a modified matcher was designed. Unlike the original matcher, it tolerates some uncertainty in the signal analysis of the processing subsystem. More specifically, this matcher allows the segments to be defined with incomplete sets of features. Also, a series of segments is not required to fully complete an utterance in this new matcher. When the design of the modified matcher was complete, an experiment, in which the effects of incomplete feature-sets were analyzed, was accomplished. Due to the fact that a very small number of features were defined for 1-IC, 2-IC, and 4-IC tests, the results were not as desirable as we hoped for. In these tests, for example, the number of possible utterances in the cohort increased exponentially as the number of words in the input increased linearly. In fact, our results show some unexpected, and undesirable, behavior in which new words were created between boundaries of words. This experiment is only a stepping stone toward a further series of similar experiments to identify the features that provide the most information, features that do not provide much information and everything in between. A mapping of features to values, which indicate the amount of information they inherently possess, is essential 77 in designing an efficient lexical access system. Where To Go From Here It has been speculated that the first segment of a word typically has more observable acoustic cues than the latter segments. Though it is difficult to estimate its effects on the performance, such information will definitely help reduce the ambiguity of utterances in the cohort. To better understand its effects, an experiment with appropriate parameters should be pursued in the future. Also, information regarding syllabic affiliation may reduce the ambiguity among matched words in the cohort. This would require the researcher to change the source code of the matcher such that the new algorithm recognizes syllabic affiliations. As mentioned in section 4.2, one of the design assumptions of the original matcher was not resolved in this modified matcher: the deletion of segments. As one can imagine, just a single missing segment in a speech signal will cause tremendous problems in the current matcher and therefore, it needs to be resolved. One method is to define linguistic rules in the matching algorithm to compensate for deletion of segments. Finally, more modifications should be made in the current matcher to further relax the constraints on the input. One method to achieve this is to utilize an uncertaintymeasure scheme, where a feature can have numeric values (i.e., 0 to 10 in which 0=very uncertain and 10=very certain). In this scheme, the matcher will provide more freedom for the signal processor to identify acoustic cues and define them amidst uncertainty. Currently, the modified matcher utilizes a +/- scheme; if a feature is not identifiable with 100 percent certainty, or very close to 100 percent, it should be left out of the segment for the matching algorithm. As one can imagine, such scheme is not an efficient method of utilizing all the pertinent information in a signal. A more elaborate uncertainty-measure scheme would be appropriate for the finalized lexical access system. 78 Appendix A 248-Word Lexicon 79 1. a 7. again 13. ape 19. away 25. bat 31. begin 37. book 43. but 49. cant 55. comes 61. cup 67. dig 73. dub 79. fame 85. from 91. giving 97. has 103. hip 109. into 115. kept 121. like 127. mack 133. me 139. my 145. not 151. only 157. pay 163. power 169. rug 175. seek 181. shortage 187. some 193. support 199. taking 205. there 211. today 217. unite 223. wants 229. went 235. with 241. write 246. zip 248-Word Lexicon 2. able 3. about 4. above 8. all 9. among 10. an 14. appear 15. are 16. arlene 20. baby 21. back 22. bad 26. be 27. because 28. become 32. below 33. beside 34. between box 38. 39. boy 40. bug 44. by 45. can 46. came 50. case 51. catch 52. caution 56. contain 57. cookie 58. could 62. day 63. debate 64. debby 68. do 69. dog 70. does 74. dug 75. duty 76. easy 80. feel 81. few 82. fifteen 86. gas 87. gave 88. get 92. go 93. gone 94. good 98. have 99. he 100. her 104. his 105. how 106. i 110. is 111. it 112. jake 116. knock 117. lasso 118. lazy 122. lock 123. long 124. look 128. make 129. man 130. many 134. men 135. money 136. more 140. name 141. nasal 142. never 146. nothing 147. now 148. number 152. oppose 153. or 154. other 158. perch 159. phone 160. pig 164. put 165. rabbit 166. rabid 170. said 171. saw 172. say 176. seize 177. shake 178. she 182. show 183. sing 184. small 188. something 189. sport 190. state 194. suppose 195. system 196. take 200. than 201. that 202. the 206. these 207. they 208. this 212. toe 213. tom 214. tonight 218. up 219. us 220. very 224. was 225. water 226. way 230. were 231. what 232. when 236. without 237. woman 238. women 242. year 243. you 244. your 247. zoo 248. zoom 5. ache 11. and 17. as 23. bake 29. before 35. big 41. bus 47. can 53. city 59. cousin 65. deny 71. done 77. even 83. follow 89. give 95. goodbye 101. hike 107. ike 113. just 119. leaky 125. looked 131. mary 137. mouth 143. nineteen 149. of 155. out 161. pop 167. raccoon 173. seal 179. shoe 185. so 191. submit 197. taken 203. them 209. time 215. too 221. want 227. we 233. which 239. word 245. zap 6. add 12. another 18. at 24. balloon 30. began 36. blow 42. busy 48. canoe 54. come 60. cub 66. did 72. dont 78. every 84. for 90. gives 96. had 102. him 108. in 114. keep 120. leave 126. loss 132. may 138. much 144. no 150. on 156. pail 162. potato 168. remember 174. see 180. short 186. soak 192. sudden 198. takes 204. then 210. to 216. took 222. wanted 228. well 234. will 240. would Table A.1: In all the tests, index numbers above are used to indicate specific words. 80 Appendix B Standard Phonemes 81 Utterance: Standard Features Version 1 7/11/1998 Time: (nil) Symbol: ae Prosody: (nil) Release Or Closure: unspecified Features: Time: (nil) Symbol: iy Prosody: (nil) Release Or Closure: unspecified Features: + vowel - high + vowel + high - low - back + advtongueroot - consttonguejroot + low - back - advtongue-root - consttonguejroot Time: (nil) Symbol: aa Prosody: (nil) Release Or Closure: unspecified Features: Time: (nil) Symbol: ih Prosody: (nil) Release Or Closure: unspecified Features: + vowel - round + vowel + high - low - back - advjtongueroot - consttongue-root - high + - advtongueroot + consttonguejroot Time: (nil) Symbol: ao Prosody: (nil) Release Or Closure: unspecified Features: Time: (nil) Symbol: ey Prosody: (nil) Release Or Closure: unspecified Features: + vowel + vowel + round - high - high - low - back + advjtongueroot - consttonguejroot - low + back - advtonguejroot + consttongue-root Time: (nil) Symbol: eh Prosody: (nil) Release Or Closure: unspecified Features: + low + back Time: (nil) Symbol: ow Prosody: (nil) Release Or Closure: unspecified Features: vowel + vowel + round - high - low - back - adv-tongueroot - consttonguejroot - high - low + back + advtonguejroot - consttongue-root 82 - low + back - advtongue-root - consttongue-root Time: (nil) Symbol: ah Prosody: (nil) Release Or Closure: unspecified Features: + vowel - round - high - Time: (nil) Symbol: er Prosody: (nil) Release Or Closure: unspecified Features: low + back - advjtongueroot - consttonguejroot + reduced + vowel + rhotic + blade - round - anterior - distributed Time: (nil) Symbol: uw Prosody: (nil) Release Or Closure: unspecified Features: + - high - low + back - advtonguejroot - consttonguejroot vowel + round + high - low + back + advtongueroot - consttonguejroot Time: (nil) Symbol: x Prosody: (nil) Release Or Closure: unspecified Features: Time: (nil) Symbol: uh + reduced + vowel - round Prosody: (nil) Release Or Closure: unspecified Features: + vowel + round + high - - high - low - advtongue-root - consttonguejroot low + back - advjtongueroot - consttongue-root Time: (nil) Symbol: w Prosody: (nil) Release Or Closure: unspecified Features: + glide + lips Time: (nil) Symbol: rr Prosody: (nil) Release Or Closure: unspecified Features: - reduced + vowel + rhotic + blade - round - anterior - distributed - high + round + high - low + back + advtongue-root - consttonguejroot Time: (nil) 83 Symbol: y - high Prosody: (nil) - low + back - nasal Release Or Closure: unspecified Features: + glide + blade - anterior + distributed + high - Time: (nil) Symbol: m Prosody: (nil) Release Or Closure: unspecified Features: low - back + advtonguejroot - consttongue-root + consonant - continuant + sonorant + Time: (nil) Symbol: r Prosody: (nil) Release Or Closure: unspecified Features: + glide - Time: (nil) Symbol: n Prosody: (nil) Release Or Closure: unspecified Features: lateral + rhotic + blade - anterior - distributed - high - lips round + nasal - + consonant - continuant + sonorant + blade + anterior - distributed + nasal low + back - nasal Time: (nil) Symbol: h Prosody: (nil) Time: (nil) Symbol: ng Prosody: (nil) Release Or Closure: unspecified Features: Release Or Closure: unspecified Features: + glide + larynx + consonant - continuant + sonorant + body + high - low + nasal + spread-glottis - constricted-glottis Time: (nil) Symbol: 1 Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant + sonorant + Time: (nil) Symbol: v Prosody: (nil) Release Or Closure: unspecified Features: lateral - rhotic + blade + anterior - distributed + consonant + continuant - sonorant - strident 84 + lips - round + slackvocalfolds + consonant + continuant - sonorant - strident + lips Time: (nil) Symbol: dh Prosody: (nil) Release Or Closure: unspecified Features: + continuant - sonorant - strident + blade + anterior + distributed + continuant - sonorant - strident + blade + anterior + distributed slackvocalfolds - slackvocalfolds Time: (nil) Symbol: s Prosody: (nil) Release Or Closure: unspecified Features: + consonant + continuant - sonorant + strident + blade + anterior - distributed + slackvocalfolds + consonant + continuant - sonorant + strident + blade + anterior - distributed - slackvocalfolds Time: (nil) Symbol: zh Prosody: (nil) Release Or Closure: unspecified Features: Time: (nil) Symbol: sh Prosody: (nil) Release Or Closure: unspecified Features: + consonant continuant - sonorant + strident + blade - anterior + distributed + slack vocal folds + consonant Time: (nil) Symbol: z Prosody: (nil) Release Or Closure: unspecified Features: + round - Time: (nil) Symbol: th Prosody: (nil) Release Or Closure: unspecified Features: + consonant + - + consonant + continuant - sonorant + strident + blade - anterior + distributed slackvocalfolds Time: (nil) Symbol: f Prosody: (nil) Release Or Closure: unspecified Features: - 85 slackvocalfolds Time: (nil) Symbol: dj Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant - sonorant + strident + blade - anterior + distributed + slackvocalfolds Time: (nil) Symbol: g Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant - sonorant + body + high - low + slackvocalfolds Time: (nil) Symbol: ch Time: (nil) Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant - sonorant + strident + blade - anterior + distributed - slackvocalfolds Symbol: p Prosody: (nil) Time: (nil) Time: (nil) Symbol: b Symbol: t Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant - sonorant + blade + anterior - distributed - constricted-glottis - slackvocalfolds Release Or Closure: unspecified Features: + consonant - continuant - sonorant + lips - round - constricted-glottis - slackvocalfolds Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant - sonorant + lips - round + slackvocalfolds Time: (nil) Symbol: d Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant - sonorant + blade + anterior - distributed + slackvocalfolds Time: (nil) Symbol: k Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant - sonorant + body + high 86 - low - constricted-glottis - slackvocalfolds Time: (nil) Symbol: Is Prosody: (nil) Release Or Closure: unspecified Features: + reduced + vowel + anterior - rhotic - distributed - high - low + back - advjtongueroot - consttongue-root - nasal 87 Appendix C Rules: An Example 88 Rule: place of articulation change Segment: T Type: L Prosody: (nil) Release Or Closure: (nil) Features: Segment: Cl Type: (nil) Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant + blade + anterior - distributed Segment: C2 Type: # Prosody: (nil) Release Or Closure: unspecified Features: Segment: C3 Type: (nil) Prosody: (nil) Release Or Closure: unspecified Features: + consonant + lips Segment: SI Type: CI Prosody: (nil) Release Or Closure: unspecified Features: C3 lips C3 blade C3 dorsum C3 round C3 anterior C3 distributed C3 high C3 low C3 back 89 Appendix D Source Code for Tree-Growth 90 ////// Tree ///// /Constructor _tree::_tree(sentence new-s) { s=new _sentence; s=new-s; NoOfFirst=O; /to record the number of possible first segments FirstWordOffset=new int [MaxMatchPerPos]; sent=new words [s->get_NoOfSegmentsO+1]; int i; for (i=O;i<s->getNoOfSegmentso;i++) sent[i]=NULL; /Find '#' and '%', which symbolize the first and the last words //of a sentence. Both aren't required to exist, in which the program /assumes that the first/last segment in the data isn't necessarily /the first/last segment of the possible utterance. //NoOfSegInSentence is required because if the sentence file contains /the end symbol, "%", then the program needs to stop at that segment //while ignoring all consequent segments. for (i=O;i<s->getLNoOfSegmentso;i++) { if (!strsame(s->get-data(i)->getjtimeo, "%")) continue; // ** if no %for the given segment, go to the next one. else { s->set_NoOfSegInSentence(++i); // break; ** if '%', I} // record whether or not if '%' exist. if (s->getNoOfSegInSentenceo==O) LastWord=O; else LastWord=1; / test to see if '#' exists. if (!strsame(s->get -data()->gettimeO, "#")) FirstWord=O; else FirstWord=1; if (LastWord) // ** if '%' present NoOfSegs=s->getNoOfSegInSentenceo; else NoOfSegs=s->getNoOfSegmentso; } // add a word into the matching tree // at a specific position void _tree::add(word w, int index) { // check boundry for non-oneword cases if ((LastWord)&&(index)) if (w->getNoOfSegments()+index > NoOfSegs) return; 91 then record the number of segments. // check boundry if (index>=O && index<NoOfSegs) / if no word array at position specified, create it first if (sent[index]==NULL) sent[index]=new _words(MaxMatchPerPos); // add word to that word array /cout<<"word added "<<w->get-wordO<<" "<<index<<endl; sent[index]->add(w); else terminate("_jree::add","invalid index",itoa(index)); // output complete and incomplete matched sentences void _tree::output-aux(destination des, FILE* writeptr, char* s, int index, output-mode om) int i; char *temp; if (LastWord) if (index > NoOfSegs) terminate("_tree::outputjaux","index overflow",itoa(index),itoa(NoOfSegs)); // when output requests is made at the final segment if ((LastWord&&(index==NoOfSegs))11(!LastWord&&(index>=NoOfSegs))) { / if mode is completed // then output sentence if (om==completed) if (des==screen) cout << NoOfCompleted++ << ". "<< s << "." << endl; else if (des==disk) fprintf(writeptr,"%d. %s.\n",NoOfCompleted++,s); I} // when output requests is not made at the final segment else / if we cannot continue to follow any word links if (sent[index]==NULL) / if output mode is incomplete then output if (om==incomplete) if (des==screen) cout << NoOflncomplete++ << << s << endl; else if (des==disk) fprintf(writeptr,"%d. %s\n",NoOfIncomplete++,s); / if we are not at the final segment and there are more word links to // follow from the current position, recurse on all the possible / links else for (i=0;i<sent[index]->getNoOfDatao;i++) { 92 // copy the words up to this point and add on the next word /then recurse output on the next link point temp = new char [1000]; strcpy(temp,s); strcat(temp," "); I/cout<<"this is what i have: "<<temp<<index<<i<<endl; strcat(temp,sent[index]->get-data(i)->get-wordo); /if index=0 and FirstWord=0, then the first word may not be // complete. Therefore, use special variables. if (indexllFirstWord) output_aux(des,writeptr,temp,sent[index]->get-data(i)>getNoOfSegmentso+index,om); else output-aux(des,writeptr,temp,FirstWordOffset[i]+index,om); // clean up memory delete s; void _tree::output (destination des, FILE* writeptr) { // set the GLOBAL counters NoOfCompleted=l; NoOflncomplete=1; char *st; / titles if (des==screen) cout << "\n======= Completed-======\n" else if (des==disk) fprintf(writeptr,"\n\n======= Results =======\n\n======= Completed =======\n"); // output completed st=new char [5]; strcpy(st,""); output-aux(des,writeptr,st,0,completed); / titles if (des==screen) cout << "\n======= Incomplete ======--\n"; else if (des==disk) fprintf(writeptr,"\n======= incomplete ======\n"); // output incomplete sentences st=new char [5]; strcpy(st,""); output-aux(des,writeptr,st,0,incomplete); cout << endl; // grow matching tree void _tree::grow(rules rs, lexicon { grow-aux2(rs,lx,0); lx) 93 // recursively grow tree -starting_ from a sentence position void _tree::grow-aux2(rules rs, lexicon lx, int starting-position) if (sent[starting-position]!=NULL) return; /int i; //for (i=O;i<NoOfSegments;i++) // grow at this particular position grow.aux(rs,lx,startingposition); if (sent[startingposition]==NULL) return; // recursively follow the links based on the matched words // at this position for (nt i=O; i<sent[starting-position]->getNoOfDatao; i++){ if ((FirstWord j(!FirstWord&&(starting-position)))) grow-aux2(rs,lx,starting-position+sent[starting-position]->get-data(i)->getNoOfSegmentso); else if (!FirstWord&&(!starting-position)) grow-aux2(rs,lx,starting-position+FirstWordOffset[i]); } // grow matching tree for a particular sentence position void _tree::grow-aux(rules rs, lexicon lx, int starting-position) // if the current position has already been grown // quit if ((sent[starting-position]!=NULL) (starting-position >= NoOfSegs)) return; // construct a temporary lexicon lexicon tempjx=lx->copyo; // // // // To suppress rules in this program, DO NOT comment the next line out. This will disenable the matcher from working properly because the expansion bit is determinded when the next line is called. To ingnore the rules, go to rules.rs file and erase the rules as you wish. temp-lx->apply-rules(rs,s,starting-position); //cout<<starting-position<<endl; // // // // / // // assuming the lexicon is expanded with a set of external rules ONLY we will just match the WORDs that was expanded and skip those that was not expanded by the external rules. The ones not expanded by the external rules are the ones that does not match the first segment of the target position in the sentence. This is an optimization so we do not need to expand all the words inside the lexicon, but only those PROMISING ones that matches the first segment of the sentence. int ij; word w; // loop through all the touched words in the lexicon as well as all their // multiple pronunciationas and alternative pronunciations for (i=O;i<tempjlx->getNoOfDatao;i++) 94 if (tempjx->get data(i)->getexpanded()==exp) for (j=O;j<temp_x->get-data(i)->getNoOfDatao;j++) //cout<<"Word tested: "<<tempjx->get_.data(i)->getLabel()<<endl; // if any pronunciation match the current context in the sentence / it is added to the word array for this position. // for case where '#' doesn't exist and starting-position=O. if (!FirstWord&&(!starting-position)){ if (match-search(tempjx->get-data(i)->get-data(j),starting-position)) { w=new _word (tempjlx->get-data(i)->getLabel(),temp-lx->get-data(i)->get-data(j)>getNoOfSegmentso); add(w,starting-position); // after one match do we go on or do we quit? // choise here is quit break; // alternatively we can go on and find more matches for the // same word.. to do this just get rid of break; // for case where '#' does exist or ('#' doesn't exist and // starting-position doesn't equal 0) if ((FirstWordll(!FirstWord&&(startingposition)))) if (matching(tempjx->getdata(i)->get data(j),s,starting-position)) w=new _word (tempjx->get-data(i)->getLabel(),tempjx->getdata(i)->get-data(j)>getNoOfSegmentso); add(w,starting-position); // after one match do we go on or do we quit? // choise here is quit break; // alternatively we can go on and find more matches for the // same word.. to do this just get rid of break; // clean up the temporary lexicon copy delete temp_lx; // print out tree status void _tree::statuso int ij; // sentence cout << "Sentence: " << s->get-contento; cout << "\nposition\tmatch\n--------\V-----\n"; // position -- matchings for (i=0;i<NoOfSegs;i++) { if (sent[i]!=NULL) cout << i << "\t\t"; for (j=0;j<sent[i]->getNoOfDatao;j++) cout << sent[i]->getdata(j)->get_word() << cout << "\n"; 95 } // match a pronunciation with the word int _tree::matching (pronunciation p, sentence s, int index) { // overflow case if (LastWord) if (index+(p->get_NoOfSegmentsO) > s->getNoOfSegmentso) return 0; int i; // segment by segment match of the pronunciation to the sentence context / if just one fails we exit for (i=0;i<p->getNoOfSegmentso;i++) if (!(bundlejincomp-match(p->get-data(i),s->get-data(index+i)))) return 0; else if (!LastWord) if ((index+i+1)==NoOfSegs) return 1; } } // ** for given pronunciation from a word in the lexicon, it will // ** loop through each phoneme and compare it to the "incomplete" word. // ** index isn't required for single word search. This procedure is used // ** only on the first word of the sentence, given that "#" is absent. int _tree::match-search(pronunciation p, int index) { int i,count_match=0,count_miss=0; for (i=0; i<p->getNoOfSegmentso; i++) { // this condition is used if there is only one word to be analyzed. / if all the segments of the input matches, yet is not the // end of the pronunciation, then start the search again. // This means that countmatch should be set to zero and // countmiss should be incremented the amount that is needed. if (count match==NoOfSegs) I if ((i != p->getNoOfSegmentso)&&(LastWord)){ countmatch=0; count_miss=countmiss+NoOfSegs; } // get the first word in the sentence. The first word can start at // any segment within itself. if (!(bundlejincomp-match(p->get-data(i), s->get-data(countmatch+index)))) { if (!count-match) { count_miss++; continue; / if a sequence of matches isn't complete, then start the matching // process over again. countmiss is incremented by the amount of // previous matches in the sequence and one (for the current mismatch) // and countmatch is set to zero. else { countmiss=countniss+count-match+l; 96 count match=O; continue; } else countmatch++; if (!LastWord) // ** if '% doesn't exists if ((index+count match)==NoOfSegs) // break; ** if last segment } // ** if there is no match at all, then return 0. if (count-miss==p->getNoOfSegmentso) return 0; // records # of matches for array of possible first words FirstWordOffset[NoOfFirst++]=countmatch; return 1; } 97 Appendix E Results for One-Word Tests 98 Lexical Word Number 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Number of Number of Matches with Matches with Complete Sets (A) Incomplete Sets (B) 1 5 MEAN 1 5 STD. DEV 1 2 MAXIMUM 1 1 1 2 1 11 1 1 1 1 1 2 1 4 1 5 1 1 1 1 1 22 1 17 32 1 10 34 36 38 1 1 1 12 1 1 40 1 7 42 44 1 1 3 1 46 48 50 1 1 1 2 1 1 52 1 6 54 1 2 56 1 5 58 2 5 60 1 7 62 1 2 64 66 68 1 1 1 4 5 3 70 72 74 1 1 1 2 1 7 76 1 1 78 1 1 99 A 1.024 0.154 2 B 2.685 3.014 22 WORD# 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158 160 A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 B 1 1 1 1 4 1 2 5 1 1 2 2 1 1 1 1 2 5 2 3 1 2 1 1 1 2 1 2 1 1 2 2 1 2 1 1 7 3 2 1 5 100 WORD # 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190 192 194 196 198 200 202 204 206 208 210 212 214 216 218 220 222 224 226 228 230 232 234 236 238 240 242 244 246 248 A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 B 3 5 2 1 2 1 2 1 2 1 2 1 1 2 1 6 1 4 1 1 1 3 2 2 3 2 4 5 1 1 3 1 1 2 1 2 1 2 2 1 1 1 2 1 101 Appendix F Results for Two-Word Tests 102 Lexical Word Number 126-46 142-244 152-80 22-232 224-180 86-46 60-158 124-216 136-142 152-164 222-70 36-34 Number of Number of Matches with Matches with Comp-Sets (A) Incomp-Sets (B) 1 2 1 2 1 7 1 2 1 1 1 2 1 7 1 5 1 2 5 35 1 6 1 12 82-2 180-12 1 1 5 11 74-126 16-8 102-214 1 1 1 7 1 20 162-24 1 13 52-84 222-228 88-86 176-122 88-212 176-202 1 1 1 1 1 1 6 6 4 2 8 1 140-168 1 2 186-112 1 1 178-122 202-100 40-230 144-74 4 1 1 1 2 3 7 9 104-42 1 3 156-222 182-130 134-214 1 1 1 6 5 20 82-236 2 2 132-190 158-142 10-238 52-2 238-244 124-208 1 1 1 1 1 1 1 2 5 88 5 2 103 MEAN STD. DEV MAXIMUM A 1.121 0.501 5 B 7.742 12.996 88 WORD# 128-226 156-234 94-212 148-84 188-144 10-86 196-200 102-56 6-52 56-224 24-64 188-166 66-122 124-244 134-156 4-50 42-88 52-154 120-242 196-150 26-228 148-150 52-210 220-28 2-114 10-46 160-120 190-72 66-130 236-232 150-56 6-10 204-170 198-188 178-182 16-86 34-164 20-136 246-236 208-18 80-30 A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 2 B 1 2 10 1 2 2 4 18 12 9 4 4 10 1 4 3 16 20 1 8 2 1 18 14 17 2 5 1 10 4 5 2 6 2 5 1 60 9 4 7 9 104 WORD# 164-190 194-20 48-112 72-58 96-170 36-158 102-246 186-2 242-180 80-90 26-144 28-92 100-158 214-124 172-194 26-214 98-6 132-212 238-154 176-204 230-176 122-44 136-18 4-86 120-58 76-196 92-172 38-42 96-166 58-30 42-234 154-92 118-194 42-238 40-150 110-166 6-178 182-104 142-52 12-78 240-30 148-196 A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 B 5 4 1 5 1 1 4 1 1 1 1 49 2 4 1 14 2 3 18 3 2 4 2 3 5 6 2 3 2 85 3 7 8 6 11 10 4 2 12 3 9 4 105 Appendix G Results for Four-Words Tests 106 Lexical Number of Number of Matches with Comp-Sets (A) Matches with Incomp-Sets (B) 132-66-238-2 152-168-166-230 240-202-212-174 1 1 1 54 14 4 192-32-96-144 1 307 36-226-202-40 170-16-94-194 26-216-96-198 1 1 1 7 50 5 146-170-188-200 158-198-14-156 1 1 20 2 66-158-162-16 1 65 208-90-184-42 14-22-246-110 2-36-16-210 92-86-128-206 32-162-150-144 228-108-54-188 228-152-204-72 1 1 2 1 1 1 1 6 10 15 2 242 10 63 84-212-114-84 156-166-14-208 128-80-92-230 180-78-188-116 90-160-176-168 182-42-22-202 1 1 1 1 1 1 17 8 2 4 5 45 110-118-170-180 1 8 58-198-98-82 70-42-46-228 38-154-112-138 232-128-92-148 202-68-210-48 54-44-102-84 1 1 1 1 2 1 5 28 7 12 9 6 46-216-134-2 1 160 146-8-10-168 30-44-160-228 210-238-180-220 184-232-180-220 218-92-202-16 112-194-230-142 24-70-180-226 188-124-108-120 1 2 2 1 1 1 1 1 2 362 15 4 2 20 2 2 136-146-66-210 2 75 Word Number 107 MEAN STD. DEV MAXIMUM A B 1.097 0.296 2 67.790 152.637 1242 WORD# 180-30-178-228 146-166-122-26 240-126-150-174 134-14-224-20 130-244-218-34 58-240-172-76 176-86-2-98 150-56-50-14 4-90-24-146 14-110-96-156 230-96-230-214 130-176-126-114 112-94-98-222 126-40-148-192 10-178-80-32 82-170-192-30 96-12-28-90 30-230-4-54 90-18-2-200 102-64-184-152 94-98-18-54 196-120-152-126 4-224-178-108 18-82-58-88 82-156-120-26 236-44-232-116 64-74-44-184 224-178-178-82 130-20-168-106 24-202-94-174 218-92-202-16 176-124-16-134 68-156-12-56 120-196-160-102 122-186-176-202 112-220-124-108 206-240-54-138 122-196-183-32 174-26-184-220 128-10-88-86 18-40-42-178 182-218-98-98 A 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 A 68 20 2 8 24 5 5 9 5 2 13 20 25 77 20 404 44 170 20 112 20 44 10 140 2 8 42 4 8 10 2 2 1242 40 2 1 4 218 5 12 42 2 108 WORD# 190-48-20-134 94-152-198-102 40-174-186-204 10-244-106-200 186-2-238-154 120-238-34-6 124-40-34-102 242-218-2-146 168-22-36-178 42-124-102-198 26-190-136-226 174-66-186-128 132-52-160-174 34-74-178-150 28-92-220-146 86-84-240-178 24-84-246-118 238-172-108-28 84-132-32-40 150-104-44-4 168-134-74-102 28-22-126-138 76-100-108-74 88-188-86-176 130-70-44-82 52-2-122-46 236-154-240-204 220-246-242-188 82-96-152-134 176-2-66-118 16-134-224-98 44-170-162-238 168-28-90-94 72-128-162-242 98-212-132-194 128-66-176-32 24-206-224-202 192-242-102-20 6-148-128-236 58-148-172-212 244-54-44-24 94-168-36-28 A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 B 18 104 63 1 738 187 192 115 10 12 1 12 116 168 539 2 6 110 109 5 28 52 14 9 8 352 18 3 14 339 2 130 110 13 40 55 2 88 4 20 3 5 109 Appendix H Manual for Match3 Program 110 This appendix provides a detailed instruction set for the usage of the modified matcher, called match3. Currently this program is located in directory '/usr/users/rhkim/PROGS/MATCHER/' and has been compiled on a linux machine. The type of machine platform is important because the program has to be recompiled when run on other machines(i.e. SGI machines). H.1 Setup Before running the matcher, the user needs four types of files, as explained in chapter 3: standard template of phonemes, 248-words lexicon, linguistic rules, and series of segments representing an utterance. All these files should be in text format (emacs is a convenient text editor for this purpose). Since the user is expected to provide all this information, he/she should familiarize himself/herself with the specific format that each of these four files is required to follow. But before we go into the formats of these files, the user should understand the format for a segment. H.1.1 Segment As one can imagine, segments are used extensively in the lexical access system. Currently, the format used to indicate the contents of a segment is shown in figure H. 1. Every segment is required to follow this form. The first element of a segment is the character, <, which indicates the starting point of a segment. Next, a segment must have the four header variables, Time, Symbol, Prosody, and Release or Closure, even if all their values are (nil). A set of features and their values follow the variables. These features can be deleted (except consonants, vowels, and glides) or added to the 111 segment in any order the user wishes,. Finally, > in the last line indicates that the segment is complete. Time: (nil) Symbol: iy Prosody: (nil) Release Or Closure: unspecified Features: + vowel + high - low - back + advtongueroot - consttonguejroot Figure H. 1: The specific text format of a segment that the matcher expects. H.1.2 Standard Template of Phoneme An example standard template of phonemes is shown in appendix B. From this appendix, we observe that the template basically consists of two components: the header line and a list of segments. The header line in this example is: "Utterance: Standard Features Version 1 7/11/1998". Although the contents of this line is not important in the matching algorithm, the matcher looks for "Utterance:", and therefore, it should be left alone. Each of the segments follows the format given in figure H.1. Remember to name this file, "standard.label" in the directory containing the program, match3. H.1.3 Lexicon As an example, a section of a lexicon file in text format is shown in figure H.2. In this figure, the first line of the lexicon is a header line: "LexiconName: M3LexVersion_1_7/11/1998". Again this line is not important to the matching algorithm and can be titled anything. The next line is the character, <, which indicates that a list of lexical words is starting. Then the user may list any 112 words in the format as shown. In fact, the words may not even be in alphabetical order. If a word has two or more possible pronunciations, then these pronunciations can be listed as shown with 'able' in figure H.2. To indicate the end of the lexicon, > is utilized as shown in figure H.3. When saving this file, make sure that files is named, "lexicon.lex". For the tree-based matcher by Maldonaldo, the lexicon format is different. Please refer to his thesis to observe the differences. Buffers Files Tools Edit Search Mule C Help Iable ntame: M3LexVersion_1_7/11/1998 a [ah] [ey b x abourt above ache add again all arm 'ng an and anothe [x b aa Ex b ah ley k] [ae d] [x g eh [ao 11 Ex m ah [ae n],Ex n] Eae n d],[x [x n ah l],[ey b 1] w t] v] n] ng] n d] dh x r] Figure H.2: The beginning section of the lexicon. woMrnan wom en word would write year you your zap zip zoo ZooM 'Ii Ew uh Ew ih Ew rr Ew uh Er aa Ly iy uw] Ey ao Ez ae p] Ez ih p] Ez uw] Ez uw m x n] m x n] d] d] y tI r] r] m] -E- -- l- H.3: The end section of the lexicon. Figure I 113 - - - - -- - - - H.1.4 Rules There are two types of filies for linguistic rules: a segmental description of a rule and a list of rules. An example format of a rule is shown in appendix C. To better understand how to specify a particular linguistic rule using this format, please refer to Maldonaldo's thesis, which explains this topic very clearly. Each rule should be in a separate file. When saving this file, make sure that these files have a ".rul" extension, such as "rule1.rul". When a set of rules has been defined in text format, a file consisting of list of these rules should be made. An example is shown in figure H.4. In this figure, the first three lines are unimportant, as long as the rules are in the same directory. This file is normally called "rules.rs". Buffers Files Tools Edit Search Mule Help HuleSet: M3RSVersion_1_7/11/1998 Type: External Directory:. rule1 rule2 Figure H.4: A list of rules used in the matching process. rulel and rule2 refer to files called "rulel.rul" and "rule2.rul" where each contains different linguistic rules. Both of these ".rul" files follow the format given in appendix C. H.1.5 Segments Representing an Utterance Among all the files, the user needs to be most familiar with this file because it contains a series of segments which represents an utterance. The previous three files are given and usually left untouched by the user, but this file must be defined by him/her. The format is very similar to the format used for the standard template file. Figure H.5 shows an example of a file containing the phrase, "a day". 114 Files Buffers Tools Edit Search roule Help Utterance: a day Time: (nil) Symbol: ah Prosody: (nil) Release Or Closure: unspecified Features: + vowel - round - high - low + back - adv-tongue-root - const-tongue-root (nil) Symbol: d Time: Prosody: (nil) Release Or Closure: unspecified Features: + - consonant continuant - sonorant + blade + anterior - distributed + slack-vocal-folds (nil) Symbol: ey Prosody: <nil) Release Or Closure: Time: unspecified Features*. + vowel + - high low back adv-tongue-root const-tongue-root >0 Figure H.5: A file containing a series of segments representing "a day". Just like the standard template, the first line contains the header for this file, which is unimportant to the matching algorithm. Next, the user must type in all the sequential segments as he/she desires with the segmental form described in section H. 1.1. Also, as explained in section 4.3, '#' and '%' symbols may be utilized within this file. Moreover, one may choose to delete or insert features in segments as he/she wishes, as shown in figure H.6. While saving this file in the text editor, make sure that the name has a ".label" extension. For example, the file in figure H.6 may be called "sentl.label" 115 Buffers Files Tools Edit Search Mule Help Utterance: a day Time: # Symbol: ah Prosody: (nil) Release Or Closure: unspecified Features: + vowel - high - low + back (nil) Symbol: d Time: Prosody: (nil) Release Or Closure: unspecified Features: + consonant - continuant - sonorant Time: X Symbol: ey Prosody: (nil) Release Or Closure: Features: + vowel unspecified - high - low - back Figure H.6: A file containing a series of segments from figure H.5, except with some absent features and with '# and '%' symbols. This text file will be called sentl~abel for the rest of the manual. H.2 Program When all four types of files are saved in the same directory, one may start using the program. If any of these files is missing, then the program will abort. 116 athena% Is lexicon.lex rulel.rul match3 rule2.rul athena% match35 rules~rs sentllabel standard.label II Figure H.7: A list of files required in the same directory. match3 is the matcher. lexicon.lex contains the list of lexical words. standard.label contains all the standard phonemes. rules.rs contains a list of all the rules, rulel.rul and rule2.rul. Finally, sentl.label contains the segmental representation of an utterance. To start the program, type "match3" at the prompt, as shown in figure H.7. Then the program will begin and automatically initialize itself using the standard template, the lexicon, and the rules. The matcher should look like figure H.8 at this point. athena% Is lexicon.lex match3 athena% match3 rulel.rul rule2.rul rules.rs sentl.label standard.label Welcome to Match G3, system booting ..... ---- SYSTEM INITIALIZATION Booting up on a sun4 system (w20-575-77.mit.edu) ... Loading Standard Template: standard~label ..... Loading Lexicon: lexicon.lex ..... Done, Loading Rule Set: rules~rs ..... Done, -=--- INITIALIZATION COMPLETE Done, -- Welcome to Match G3, rhkiml * Please enter the filename of a sentence to run, (w/o dabel) * Type 'hel ' at any time to list all available commands, Match G3> i Figure H.8: The matcher is now running. At this point, all the initialization is complete and is awaiting for an input from the user. The input is the filename of a segmentized utterance. 117 To input a file containing a series of segments the user created, as seen in section H. 1.5, type the name without ".label" as shown in figure H.9. athena% is lexicon.lex match3 athena% match3 rulel.rul rule2.rul rules.rs sentl.label standard.label Welcome to Match G3, system booting ..... SYSTEM INITIALIZATION Booting up on a sun4 system (w20-575-77.mit.edu) ... Loading Standard Template: standard~label ..... Loading Lexicon: lexicon.lex ..... Done. Loading Rule Set: rules.rs ..... Done, Done, INITIALIZATION COMPLETE Welcome to Match G3, rhkim! * Please enter the filename of a sentence to run, (w/o .label) * Type 'help' at any time to list all available commands. Match G3> sentil Figure H.9: The user types in the utterance file name, sentl(.label). Then the matcher will automatically output the cohort with a set of possible utterances, as shown in figure H.10. A prompt appears again and await for the user to continue the process. To quit the program, type "quit" at the prompt. 118 athena% Is lexicon.lex match3 athena% match3 rulel.rul rule2.rul rules.rs senti.label sentl.label~ standard.label Welcome to Match G3, system booting ..... -=-=-- SYSTEM INITIALIZATION Booting up on a sun4 system (w20-575-77.mit.edu) Loading Standard Template: standard.label ..... Loading Lexicon: lexicon.lex ,,,. Done, Loading Rule Set: rules.rs ..... Done. =-=-=-- Done. INITIALIZATION COMPLETE Welcome to Match G3, rhkim! * Please enter the filename of a sentence to run, (w/o .label) * Type 'help' at any time to list all available commands, Match G3> senti Loading Sentence: ./sentllabel ..... Done. Growing Matching Tree ... Done. [0 sec] - - -- ANSWER Completed 1, a day. 2. a pay. Incomplete 1, up Match G3> [] Figure H.10: An example of the matcher interface when an input is given. In this figure, the input is sent1(.label), which consists of segments shown in figure H.6. As one can see, there are two complete utterances in the cohort. After the matching process is done, the matcher awaits for the next input. 119 Bibliography [1] J.Y. Choi. Labeling With Features. MIT, unpublished document, 1999. [2] N. Chomsky and M. Halle. The Sound Pattern of English. MIT Press, Cambridge, MA, 1991. [3] P.B. Denes and E.N. Pinson. Speech Chain: the Physics and Biology of Spoken Language. Anchor Press, Garden City, NY, second edition, 1993. [4] G. Fant. Acoustic Theory of Speech Production,1960. [5] D.P. Huttenlocher. Acoustic-Phonetic and Lexical Constraints in Word Recognition: Lexical Access Using Partial Information. Master's thesis, MIT, 1984. [6] S.J. Keyser and K.N. Stevens. Feature Geometry and the Vocal Tract. Phonology, 1994. [7] D.H. Klatt. The Problem of Variability in Speech Recognition and in Models of Perception. Paper presented at the Conference on Variability and Invariance in Speech in October, 1983. [8] P. Li. Feature Modifications and Lexical Access. Master's thesis, MIT, 1993. [91 Aaron Maldonaldo. Incorporating a Feature Tree Geometry into a Matcher for a Speech Recognizer. Master's thesis, MIT, 1999. [10] K.N. Stevens. Acoustic Correlates of Some Phonetic Categories. Acoustical Society of America, 1980. 120 Journal of [11] K.N. Stevens. Acoustic Phonetics. MIT Press, Cambridge, MA, 1998. [12] K.N. Stevens and A.S. House. Perturbation of Vowel Articulations by Consonantal Context: An Acoustical Study. Journal of Speech and Hearing Research, 6, 1963. [13} Y. Zhang. Toward Implementation of a Feature-based Lexical Access System. Master's thesis, MIT, 1998. 121