DISCOVERING PHONEMIC BASE FORMS AUTOMATICALLY - AN INFORMATION THEORETIC APPROACH by John M. Lucassen Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of BACHELOR OF SCIENCE and MASTER OF SCIENCE at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 16, 1983 C 1983 John M. Lucassen The author hereby grants to MIT permission to reproduce and to distribute copies of this thesis document in whole or in part. Signature of Author Department of Electrical Engineering and Computer Science, January 15, 1983 ( Certif ied by 11' Professor Jonathan Allen, Academic Thesis Supervisor Certified by Dr. Lalit R. Bahl, Company Thesis Supervisor Accepted by Chairman, Profe'sor Arthur C. Smith, Department Committee on Graduate Studies MASSACN3ETS INSTITUTE OF TECHNOLOGY MAY 2 7 1983 Archives LIBRARIES ABSTRACT Discovering Phonemic Base Forms Automatically - an Information Theoretic Approach by John M. Lucassen Submitted to the puter Science on Requirements for of Science at the Department of Electrical Engineering and ComJanuary 15, 1983 in Partial Fulfillment of the the Degrees of Bachelor of Science and Master Massachusetts Institute of Technology Information Theory is applied to the formulation of a set of probabilistic spelling-to-sound rules. An algorithm is implemented to apply these rules. The rule discovery process is automatic and self-organized, and uses a minimum of explicit domain-specific knowledge. The above system is interfaced to an existing continuous-speech recognizer, which has been modified to perform phone recognition. The resulting system can be used to determine the phonemic base form of a word from its spelling and a sample utterance, automatically (without expert intervention). One possible application is the generation of base forms for use in a speech recognition system. All the algorithms employed seek minimum entropy or maximum likelihood solutions. Sub-optimal algorithms are used where no practical optimal algorithms could be found. Thesis Supervisor: Title: Jonathan Allen Professor of Electrical Engineering Index Terms: Spelling to Sound Rules, Information Theory, Self-organized Pattern Recognition, Efficient Decision Trees, Maximum Likelihood Estimation, Phone Recognition. Abstract ii CONTENTS 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Introduction . . . . . . . . Background . . . . . . . . . Objective . . . . . . . . . . The Channel Model . . . . . . Feature Set Selection . . . . Data Collection . . . . . . . Utilizing the Sample Utterance 2.0 The Channel Model 2.1 2.2 2.3 2.4 Specification of the Channel Model . Imposing Structure on the Model . . . Related Work . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Feature Set Selection . . . . . . . Introduction . . . . . . . . . . . Objectives . . . . . . . . . . . . Related Work . . . . . . . . . . . Binary questions . . . . . . . . . Towards an Optimal Feature Set . . Using Nearby Letters and Phones Only Towards a Set of 'Good' Questions . Details of Feature Set Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . Decision Tree Design 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Related Work . . . . . . . . . . . Constructing a Binary Decision Tree Selecting Decision Functions . . . Defining the Tree Fringe . . . . . An Efficient Implementation of the BQF Analysis of Decision Trees . . . . Determining Leaf Distributions . . 5.0 Using the Channel Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 5 8 9 10 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 13 19 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 22 23 24 27 29 33 36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . ... . . .. . . . . . . . . . itself . by itself . . . . . . . . . . . . . . . . . . . 68 . . . . . ... . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... .. . . . . . . . . . . . . . . . . . . 43 . 44 . 46 . 47 . 50 . 53 . 59 . 62 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Performance . . . . . . . . . . . . . 6.1.1 Performance of the Channel Model by 6.1.2 Performance of the Phone Recognizer 6.1.3 Performance of the complete system 6.2 Objectives Achieved . . . . . . . . . 6.3 Limitations of the system . . . . . . 6.4 Suggestions for further research .. Contents . . . . . . . . . . . 5.1 A Simplified Interface . . . . . . . 5.1.1 Paths . .. . .*.. *. . . 5.1.2 Subpaths . . . . . . . . . . . . 5.1.3 Summary of Path Extension . . . . 5.2 A Simple Base Form Predictor . . . . 5.3 The Base Form decoder . . . . . . . . Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.0 6.0 . . . . . . . . . .. . . . . . . . . .. 68 69 70 74 76 77 80 80 80 82 83 85 87 90 iii A.0 References . . . . . . . . 92 . . . . . . . . . . . . . . . . . 94 B.0 Dictionary Pre-processing B.1 B.2 B.3 B.4 B.5 Types of Pre-processing Needed . . . . . . . The trivial Spelling-to-baseform Channel Model Base form Match Score Determination . . . . . Base form Completion . . . . . . . . . . . . Base form Alignment . . . . . . . . . . . . . . . . . . . 94 95 96 97 97 C.0 The Clustering Algorithm . . . . . . . . 98 C.1 C.2 Objective . . . . . . . . . . . . . . . . . . . . .. . . . . Implementation . . .. . . . . . . . . . . . . . . . . . . . 98 99 . . . . . . . .. D.0 List of Features Used D.1 D.2 D.3 D.4 Questions Questions Questions Questions E.0 Overview of the Decision Tree Design Program... F.0 Evaluation Results Contents about about about about . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . .. . . . . . . . . . the current letter . . . . . . . . . . the letter at offset -1 . . . .. . . . the letter at offset +1 . . . . . . . . phones . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 101 101 102 104 105 . . . . 108 . . . . . . . 109 iv LIST OF ILLUSTRATIONS Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. The Spelling to Baseform Channel . . . . . . . . . . The Sequential Correspondence between Letters and Phones Another View of the Channel . . . . . . . . . . . . . A More Complicated Alignment . . . . . . . . . . . . Different Alignments . . . . . . . . . . . . . . . . Information Contributed by Letters and Phones . . . . Information Contributed by Letters, Cumulative . . . Information Contributed by Phones, Cumulative . . . . The Best Single Question about the Current Letter . . The Best Second Question about the Current Letter . . Number of Questions asked about Letters and Phones . Coefficient Sets and their Convergence . . . . . . . Typical Channel Model Errors . . . . . . . . . . . . Typical Phone Recognizer Errors . . . . . . . . . . . Partially Specified Pronunciations . . . . . . . . . List of Illustrations . . . . . . . . . . . . . . 6 7, 14 15 18 29 31 32 35 35 62 67 82 84 95 v PREFACE The work presented in this thesis was done while the author was with the Speech Processing Group at the IBM Thomas J. Watson Research Center. Much of the work done by the Speech Processing Group utilizes methods from the domain of Information Theory [Bahl75] [Jeli75] [Jeli76] [Jeli80]. The project presented here was approached in the same spirit. The project was initiated in 1980, as a first attempt to make use of the pronunciation information contained in a large on-line dictionary (70,000 entries). The decision to use an automatic decision tree construction method to find spelling-to-sound rules was due to Bob Mercer. To some extent the project was a feasibility study: at the outset, we did not know how difficult the task would be. A persistent effort was made not to introduce explicit domain-specific knowledge and heuristics into the system. Consequently, the system is highly self-organized, and adaptable to different applications. Several aspects of the project called for the formulation and efficient implementation of some basic algorithms in Information Theory. The resulting programs can, and have been, be used for other purposes. ACKNOWLEDGEMENTS I would like to thank the members of the Speech Recognition Research Group, who were always willing to answer questions, to lend a helping hand. I will not mention everyone by name - you know who you are. In particular, I would like to thank Lalit Bahl and Bob Mercer, whose insights and ideas were immensely helpful at all stages of the project. Finally, I would like to thank my advisor, professor Jonathan Allen, for his support; and John Tucker, the director of the MIT VI-A Coop Program, for making my work at IBM possible. Preface vi 1.0 INTRODUCTION This chapter gives an overview of the setting in which the project was conceived and the line of reasoning that determined its shape. gives a brief description of each of the major portions of the work. intended It also It is to aid the reader in developing a frame of mind suitable for reading the remaining chapters. 1.1 BACKGROUND The (phonemic) base form of a word is a string of phones or phonemes that describes the 'normal' It pronunciation of the word at some level of detail. is called the 'base form' phonetic realization of to distinguish it the word. from the 'surface form' or Phonemic base forms are useful for several reasons: * From the base form of a word, one can derive most alternate pronunciations of the word by the application of phonological rules [Cohe75]. * Base forms reflect similarities between words and between parts of words at a more fundamental level than surface forms. Because the number of different phones is much smaller than the number of different words, these similarities between words at the phone level can be exploited recognition in a number of ways. In particular, phone-based speech systems may use them to model coarticulation phenomena, facilitate the adaptation to a new speaker, to and to reduce the amount of computation needed for recognition [Jeli75]. Introduction 1 Phonemic base forms may be obtained in a variety of ways. aries contain pronunciation information from which base forms can be the correct pronunciation of a derived. Alternatively, anyone who 'knows' word has some knowledge of phonetics can determine and However, it is not generally possible to determine unknown word by listening to how it Many diction- its base the base form. form of an is pronounced, because information may be lost in the transformation from base form to surface form. Base forms detail, (and surface forms) to suit their application. may be expressed at various For example, some applications need to distinguish three or more different levels of stress on vowels, need only one or two; for some applications it between different unstressed vowels, levels of while some is necessary to distinguish for other applications this dis- tinction is not important. Because each application sets its own standard regarding the amount of detail specified in a base form, different applications may not be able to share base forms, and it may be necessary to generate a new set of base forms for each new application. 1.2 OBJECTIVE The objective of this project was to find some way of generating base forms for new words automatically (i.e., without expert intervention). * The main application we envision is enabling users of a phone-based speech recognition system to add new words (of their own choice) to the vocabulary of the system. * A second application is the verification and/or correction of existing base forms. Introduction 2 * A third application is the custom-tailoring of base forms to individual speakers, to overcome the effects of speaker idiosyncrasies. Based on these intended applications, the following objectives were formulated: 1. The system should use only information that can be provided by a typical user of a speech recognition system, such as * a few samples of the correct pronunciation of a word, possibly by different speakers 2. * the spelling of the word * the part-of-speech of the word * sample sentences in which the word occurs. The system should be adaptable to a particular convention regarding the amount of detail specified in the base forms. In order to satisfy the first objective, we decided to use the following sources of information: the spelling of a word, and a single sample utterance. The spelling of the word is used because (1) it is available in all the applications we presently anticipate, and (2) it contains a lot of infor- mation about the pronunciation of the word (although perhaps less so in English than in some other languages). We decided to use a sample utterance as well, because the spelling of a word by itself does not contain enough information to determine its pronunciation. For example, the pronunciation of the word 'READ' its tense as a verb; further examples include the words 'NUMBER' adjective), 'CONTRACT' (verb or noun) and 'UNIONIZED' depends on (noun or (union-ized or un-ionized). Introduction 3 We will use only one sample of the pronunciation because this is the simplest and least expensive option. Since we are using the pronunciation of a word as a source of information, one may wonder why it is necessary to use the spelling of the word as well. There are two reasons for this: 1. we do not presently have an automatic phone recognizer that is sufficiently reliable, and 2. as pointed out before, form, and it a sample utterance corresponds to a surface is not generally possible to infer the corresponding base form from this surface form. There are a variety of potential sources of information which we do not use at the present time. the capitalization of These include the part-of-speech of the word, the word, and so on. Knowledge of the part-of-speech of a word can help to resolve ambiguities in the base form, especially in stress assignment. ter of a word is capitalized, entirely in upper case) since it names, Case information (whether the first let- for example, or whether the word is spelled can also provide clues about the pronunciation, appears that different spelling-to-sound rules apply to proper acronyms and abbreviations than to 'regular' words. not presently use these types of information, While we do we do not preclude using them in the future. The second objective (adaptability) is satisfied by the use of self-organized methods wherever possible. Introduction 4 1.3 THE CHANNEL MODEL We view the task of predicting the possible base forms of a word from its We will define a hypothetical trans- spelling as a communication problem. and construct a model formation that turns letter strings into base forms, of this transformation, by means of which we can determine the likelihood that a particular letter string produced a certain base form. 1 This transformation is presumed to take place as a result of transmission of a letter string through a channel that garbles the letter string and produces a phone string as output model of this (see Figure 1 on page 6). Thus, the will be referred to as a channel model. transformation Information and communication theory provide us with a variety of tools for constructing such models. There are a variety of ways to define our channel model. In the interest of simplicity, we have based our model on the assumption that each letter in the spelling of a word corresponds to some identifiable portion of the base form, and that these correspondences in Figure 2 illustrated on page 7 for are in sequential order, the words 'BASE' and 'FORM'. as 2 Because of this sequential correspondence, the channel model can operate It may be that in reality, no such transformation takes place. For example, it may be that letter strings are determined by base forms, or that both are determined by something else, perhaps a morpheme string. 2 I will use upper case and quotes to denote letters lower case and strings strokes are presented [Cohe75] (p. 309). to denote phonemes (/ei/, in a notation similar to ('A', /b that 'B'), ee/). and Phone defined in Since all phone strings in this thesis are accompa- nied by the corresponding letter strings, I do not provide a translation table here. Introduction 5 Letter string Phone string (channel) 11, 12313' '. 1n .1 . 2'x3' T A B L E > Figure 1. The Spelling to t ei b uh 1 Baseform Channel: channel, a string of n letters, output of the channel, ' '1m The input to the is shown on the left. The a string of m phones, is shown on the right. as follows: for each letter that enters the channel, a sequence of phones, representing the pronunciation of that letter, emerges at the other end. The pronunciation of each letter will generally consist of no more than one or two phones, and it is empty in the case of silent letters. Since the pronunciation of a letter often depends on the context in which it occurs, it is imperative that the model be allowed to inspect this con- text before it predicts the pronunciation of a letter. This implies that the model must have memory. ther by allowing it has The power of the model can be increased fur- to remember the pronunciations of the letters that it already predicted. Thus, the following information is taken into account by the channel model as it predicts the pronunciation of a given letter in a word: to its right, Hereafter, the letter itself, any letters to its left, any letters and the pronunciation of the letters I will refer to this information as the 'context' to the left. that produces the pronunciation; I will refer to the letter of which the pronunciation is being predicted as the 'current letter', and to its pronunciation as the pronunciation of the current letter or the 'pronunciation of the context'. Introduction 6 B /b/ A /ei/ S /s/ E // F /f/ 0 /aw/ R M The Sequential Correspondence between Letters and Phones Figure 2. The channel model needs to predict the pronunciation3 of individual letters of a letter string. Because the language is not finite, and because we have a limited amount of sample data, we will construct a model that is able to generalize: a model that can predict the pronunciations of con- texts that have never been seen before. Such a model can operate by pre- dicting the pronunciation of a context on the basis of familiar patterns within that context. The specification of such a pattern together with a probability distribution over the pronunciation alphabet is a spelling to pronunciation rule. In "Decision Tree Design" on page 43 ff., I will describe an algorithm that can find these spelling to pronunciation rules automatically, based on a statistical analysis of a sufficient amount of 'training data'. The rules will be expressed in the form of a decision tree. 3 Read this as "determine the probability distribution over the set of possible pronunciations" Introduction 7 1.4 FEATURE SET SELECTION The decision tree design program most important, it is subject to a number of limitations; requires that the set of features out of which rules may be constructed be specified in advance, and that this set be limited in size. The choice of this feature set is of profound importance, only features since included in this set can be used to formulate rules. The feature set should satisfy the following, conflicting, requirements: 1. it should be small enough to allow the tree design program to operate within our resource limits; 2. it should be sufficiently general, so that any spelling to pronuncia- tion rule can be expressed effectively in terms of the features in the feature set. We can compromise by using a relatively small number of features that are 'primitive' in the sense that any other feature can be expressed in terms of them. More specifically, we will represent each letter and phone of a context in terms of a small number of binary features. These features are answers to questions of the form 'is the next letter a vowel' previous phone a dental'. Features are selected on the basis of the Mutu- al Information between the feature values current and 'is the and the pronunciation of the letter. The feature selection process is described in detail in "Feature Set Selection" on page 21 ff. .4 4 [Gall68] gives a comprehensive overview of Information Theory. Introduction 8 1.5 DATA COLLECTION The tasks of feature selection and decision tree design are accomplished by means of self-organized methods. Consequently, we need to have access to a sufficient amount of sample data for training. Because the channel model is defined in terms of letters and their pronunciation, data must also be in this form. the training This means that the training data must consist of a collection of aligned spelling and base form pairs, in which the correspondence between letters and phones is already indicated. Training data of this sort is not generally available: dictionary gives the pronunciation of most words, while a typical these pronunciations are not broken down into units that correspond to the individual letters of the word. 5 Consequently, we have had to align all the spelling and base form pairs used for training. large (70,000), Since the number of such pairs is rather this could not be done by hand; instead, we used a simple, self-organized model of letter to base form correspondences (designed and built by R. Mercer) to perform this alignment. The alignment process is described in "Base form Alignment" on page 97 ff. . Two sources of spelling and base form information were used: an on-line dictionary, and the base form vocabulary of an existing speech recognition system. The data from the dictionary had to be processed before it be used, because it contained incomplete pronunciations, could in entries of the following form: tabular /t tabulate /- s ae - b j uh - 1 uh r/ 1 ei t/ This is not necessary since it appears that people can determine the correct alignment, when needed, without difficulty. Introduction 9 and because it did not contain the pronunciations forms plurals, such as superlatives, dictionary, past and so forth. which had of regular tenses and participles, Furthermore, to be eliminated. inflected comparatives and there were some errors in the See "Types of Pre-processing Needed" on page 94 ff.. 1.6 UTILIZING THE SAMPLE UTTERANCE The Channel Model has been interfaced to a phone recognizer, referred to as the 'Linguistic Decoder'. hereafter A maximum-likelihood decoder is used to identify the base form that is most likely to be correct given the spelling of the current word and the sample utterance. The phone recognizer is simply a suitably modified version of the con- nected speech recognition system of [Jeli751, and is not central to this thesis. The incorporation maximum-likelihood of the Channel Model into the decoder is described in "Using the Channel Model" on page 68 ff.. Introduction 10 2.0 THE CHANNEL MODEL 2.1 SPECIFICATION OF THE CHANNEL MODEL This section describes the general properties of the Channel Model and its implementation. The Channel Model was defined with two objectives in mind: 1. the model should be simple, so that its structure can be determined automatically; the model should be as accurate as possible. 2. The channel model operates on a word from left to right, predicting phones in left-to-right order. This decision was made after preliminary testing failed to indicate any advantage to either direction. The left-to-right direction was chosen to simplify the interface to the linguistic decoder, as well as because of its intuitive appeal. The Channel that we seek to model is nondeterministic: string (such as 'READ') a given letter may be pronounced differently at different times. Thus, the Channel Model is a probabilistic model: given a letter string, it assigns a probability to each possible pronunciation of that string. The mapping from letter strings to phone strings is context sensitive: the pronunciation Therefore, of a letter depends on the context in which it the channel model has to be able to 'look letters and to 'look ahead' to future letters. back' appears. to previous The fact that channel mod- el is allowed to 'look back' implies that the model has memory. The fact that it is allowed to 'look ahead' The Channel Model further implies that it operates with a 11 delay. The look-back and look-ahead is limited to the letters of the word under consideration. Consequently, the base form of each word is pre- dicted without regard to the context in which the word appears. Since the likelihood of a particular pronunciation for a letter depends on the pronunciation of preceding letters, we also allow the channel model to 'look back' at the pronunciation of previous letters. This type of look-back is necessary for the model to predict consistent strings. Let me illustrate this with an example: consider the word 'CONTRACT', and its two pronunciations6 C 0 N T R A (1) /k 'aa n t r ae k t/ (noun) (2) /k t r 'ae k t/ (verb) uh n C T Since the channel operates from left to right, the correct pronunciation of the 'A' depends on the pronunciation assigned to the '0'. If the model is unable to recall the pronunciation decisions made earlier on, it will also predict the following pronunciations, each of which is internally inconsistent: C 0 N T R (3) /k 'aa n t r 'ae k t/ (two stressed syllables) (4) /k t r ae k t/ (no stress at all) uh n A C T By allowing the model to 'look back' to pronunciation decisins made earlier on, we can avoid such inconsistencies. 6 The symbol /'/ This capability also allows (single quote) denotes primary stress; the symbol /./ (period) will be used to indicate secondary stress. The Channel Model 12 the channel model to use knowledge about the permissibility of particular phone sequences in its predictions - thus, the model can avoid producing phone sequences that cannot occur in normal English. Finally, we will model the channel as being causal. This implies that the channel future model can not use any information about the pronunciation of letters in determining the pronunciation of the current letter. Without this restriction, we would be unable to compute the likelihood of a complete pronunciation string from the individual probabilities of the constituent pronunciations. 2.2 IMPOSING STRUCTURE ON THE MODEL To simplify the channel model, we view the channel as if it predicts the base form for a letter string by predicting the pronunciation of each letter separately (see Figure 3 on page 14). The pronunciations of the indi- vidual letters of the word are then strung together to form the base form of the word. Motivation This simplification can be made because, in general, each letter in the spelling of a word corresponds to some identifiable portion of the base form, and that these correspondences are in sequential order. to this rule will be discussed shortly). (Exceptions This simplification allows us to decompose our original task into two independent and presumably simpler sub-tasks: The Channel Model 13 Letter string Phone string (channel) L , 3 /x2 1, L n ... x / /xnl' ... ' xnk /eh/ E X /k s/ T /t/ / r/ R A Figure 3. /uh/ Another View of the Channel: string of n letters, the channel, n groups, The input to the channel, a is shown on the left. The output of a string of m phones, has been organized into one group of phones for each letter. Some of these groups may be empty; others may contain more than one phone. 1. to identify the likely pronunciations of each letter of the word (and compute their probabilities), 2. and to combine the results and identify the likely base forms of the word (and compute their probabilities). The word is decomposed into letters because this is the simplest possible decomposition: any decomposition based upon higher-level units such as morphemes would require a method for segmenting the letter string, which would add to the complexity of the model. The Channel Model 14 Letter string Phone string (channel) > T A /ei/ B /b/ L /uh 1/ / E Figure 4. A More Complicated Alignment: The aligned with the 'L', and the 'E' is silent. segment /uh 1/ is Limitations The above decomposition over-simplification. form, /t /t it the problem is not clear how to match up 'LE' are relatively unimportant, it if based on an obvious and its base While we have no difficulty matching up 'TAB' the sequential nature of the alignment. sistent: is Consider, for example, the word 'TABLE' ei b uh 1/. ei b/, of it and /uh to 1/ while preserving Although the details of alignment is important that the alignments be con- the sample letter- and phone strings are aligned haphazardly, will be difficult to determine the corresponding spelling to base form rules. By using machine-generated alignments, we can be assured of some degree of consistency. The alignment program, which was designed and implemented by R. Mercer, does produce the 'correct' trated alignment in Figure 4. The process alignment, as illus- is described iri "Base form Alignment" on page 97 ff.. A second source of disagreement between the 'real' spelling to base form channel and our channel model is found in the treatment of digraphs such as 'EA', 'OU', 'TH' and 'NG' and repeated letters such as 'TT' and 'EE', which often correspond to a single phone. The Channel Model Since the model is capable only 15 of predicting the pronunciation of individual letters, it predicts the pronunciation of such a digraph as if the phone corresponds to one of the letters and the other letter is silent. The main drawback of this approach is that the model must decide on the pronunciation of each letter separately, thereby increasing the computational workload and the opportunity to make errors. On the other hand, it eliminates the need to seg- ment the letter string. A further limitation of the simplified model is that it can not deal straightforwardly with words in which the correspondence between letters and phones is not sequential, such as: 'COMFORTABLE' (with base form /k 'uh m f t er b uh 1/) and 'NUCLEAR' (with the non-standard base form /n 'uu k j uh 1 er/). While the channel model is in principle capable of dealing with rules of any complexity, complex rules such as those involving transpositions may be hard to discover. Fortunately, the correspondence between letters and phones is sequential in most words. Implementation of the Channel Model Every prediction made by the Channel Model is based upon knowledge of the context of the current letter: * the current letter itself, * the letters that precede and follow it (up to the nearest word bounda- ry), and * the pronunciation of preceding letters. The Channel Model 16 The problem of predicting the pronunciation associated with a particular context is a pattern recognition problem. domain [Case8l] [Meis72] We will This is a well-studied problem [Stof74]. use a decision tree as classifier. normally associated with decision trees Aside from the advantages [Meis72] [Payn77], we will take advantage of the fact that a decision tree can be constructed automatically. We have implemented an algorithm to perform this task. It is described in detail in "Decision Tree Design" on page 43 ff.. Combining Pronunciations into Base forms Because the channel is causal, the probability of a particular pronuncia- tion string is equal to the product of the probabilities that each of the letters is pronounced correctly: n p(X I L) = H p(x. I L, x, x 2 , ... , x. 1 ) i=1 where X is a pronunciation string, L is a letter string, and x. is the i'th element of X. However, we are not interested in the probability of a particular pronun- ciation string - we are interested in the probability of a particular base form. A base form may correspond to many different pronunciation strings, each of which represents a different way of segmenting of the base form into pronunciations (see Figure 5 on page 18). Thus, the probability of a particular base form is the sum of the probabilities of all the pronunciation strings that are representations of this base form. Fortunately, the number of plausible alignments of a word is usually small, The Channel Model so that we can 17 Letter string Phone string (channel) B /b/ (1) /ou/ 0 A // /t- T B (2) /b/0 // /ou/ A T /t/ /b ou/ B /t/ // // 0 A T Figure 5. (3) Different Alignments: A single base form may correspond to several different pronunciation the alignment Alignment (1) between is the strings, letters and depending on the more plausible than alignments phones. (2) or (3). calculate their probabilities without difficulty. The number of align- ments with non-negligible probabilities is determined to a large extent by the consistency of the alignments in the training data. The Channel Model 18 2.3 RELATED WORK One difference between our design and typical existing spelling to sound systems [Alle76] [Elov76] tionary of exceptions. in [Hunn76] is the fact that we do not use a dic- Such a dictionary can be of great practical value predicting the pronunciation of high-frequency English words, since these words tend not to follow the rules that apply to the remainder of the English vocabulary. ceptions', 1. We draw no distinction between 'rules' and 'ex- for several reasons: Our system is intended mainly to predict the pronunciation of infrequently used words, namely words that are not already included in the standard vocabulary of a speech recognition system. Thus, there is no need to devote extra attention to high-frequency words. 2. By including exceptions in the probabilistic model, capable of reproducing any the system will be exceptional pronunciations encountered in the training data without any special effort. more, that were Further- the pattern matching power of the system is available to identi- fy any derived forms to which an exception may apply. 3. We are operating under the assumption that we have insufficient training data. Therefore, we may encounter exceptions that did not occur in the training data. By including known exceptions in the training data for our probabilistic model, we can obtain a better estimate of the reliability of the model. 2.4 SUMMARY Now that we have formulated our Channel modelling task as a decision tree design problem, we are faced with several sub-problems, which are the sub- The Channel Model 19 ject of the following chapters: * We need to select a set of primitive features spelling to base form rules will be in terms of which the expressed (see "Feature Set Selection" on page 21 ff.); * We need to formulate the spelling to base form rules themselves - construct the decision tree and determine the probability distribution (over the output alphabet) associated with each leaf (see "Decision Tree Design" on page 43 ff.); Finally, * we need to construct an interface that allows communication between the Channel Model and the Linguistic Decoder (see "Using the Channel Model" on page 68 ff.). The Channel Model 20 3.0 FEATURE SET SELECTION 3.1 INTRODUCTION In the preceding chapter, we motivated our decision to construct a decision tree with spelling to base form rules. Unfortunately, there appears to be no practical method for constructing an optimal decision tree for this problem with our limited computational resources [Hyaf76]. therefore use a sub-optimal tree construction algorithm, We will which is described in detail in the next chapter. This tree construction algorithm builds a decision tree in which all rules are of the form 'if A and B ... tures of a context, specifies and C then X', where A.. .C are binary fea- chosen from a finite set, the feature set, and 'X' a probability distribution over the pronunciation alphabet. 7 The size of this feature set the set is too small, erful; if it is subject to conflicting constraints: the resulting family of rules will not be very pow- is too large, the tree construction program will consume too much of our limited computing resources, of dimensionality' if while being subject to the 'curse (see [Meis72]). The more difficult problem is to determine not how many, but which fea- tures to include in the feature set. Although the size of the feature set is determined manually and somewhat arbitrarily, the selection of indi- vidual features has been largely automated. 7 Since the presence or absence of a particular feature is ascertained by asking 'question' a corresponding question, the terms 'feature' and may often be used interchangeably. Feature Set Selection 21 3.2 OBJECTIVES The feature set that is made available to the decision tree design program should possess the following properties: 1. All the features that are necessary for the construction of a good decision tree should be present in the set. It is clear that the pronunciation of an unknown letter in an unknown context is determined mostly by the identity of the letter itself.' Thus, a good feature set will probably contain features by which the current letter may be identified. In many cases, surrounding letters also affect the pronunciation, and this must be reflected in the fea- ture set. 2. There should be no features in the feature set that are informative but not general. For example, suppose that each of the sample words in the dictionary occupies a single line in the dictionary. Clearly, knowing the line number at which a particular context occurs can be helpful in predicting the pronunciation for that context. However, this information does not generalize to words that do not occur in the dictionary, and therefore does not contribute much to the power of the model. 3. The feature set should not contain features that are not informative. Uninformative features should be avoided for two reasons: ' For a confirmation of this assertion, see Figure 6 on page 29. Feature Set Selection 22 their * presence increases the amount of computation required to design the decision tree, even if they are not used at all; where the sample data is sparse, they may appear to be informative * (because of the 'curse of dimensionality'); however, if they are ever incorporated in a decision tree, this tree will be inferior. 3.3 RELATED WORK Several existing systems that predict pronunciation from spelling operate on the basis of a collection of spelling to pronunciation rules [Alle76] [Elov76]. To the extent that these systems operate on the letter string sequentially (from left to right), they can be re-formulated in terms of a decision tree similar to the one we use. The rules that make up these sys- tems are generally of the form 'In the context aal, is pronounced /6/', where 5 is a letter (or a string of letters), * individual letters, sounds a and I are either or letter-sound combinations ('Z', /r/, 'silent E'), * classes of letters, sounds or combinations thereof ('vowel', 'consonant cluster'), or * strings of the above ('a silent E followed by a consonant'), and /6/ 'T or K', specifies a sequence of phones (possibly null). These rules are very similar in structure to the rules that we seek to discover. Consequently, Feature Set Selection it is likely that some of the features used by 23 these rule-based systems would be good candidates for our feature set; the converse may also be true. Since we have made no particular attempt to duplicate features used by other systems, ture sets is either coincidental any similarity between the fea- or (more likely) a confirmation that these features are indeed meaningful. 3.4 BINARY QUESTIONS Any question about a member of a finite set (a letter, a phone, a pronun- ciation) can be expressed as a partition of that set, such that each subset corresponds to a different answer to the question. For example, the answer to the question 'is the letter L a vowel' can be represented by a partition of the letter alphabet into vowels and non-vowels. 9 For reasons described below, we will consider only binary questions (about binary features) as candidates for the feature set. done without loss of generality: Note that this can be just as any integer can be expressed as a sequence of binary digits, any set membership question can be expressed in terms of a set of binary questions. To minimize fragmentation The use of binary features makes it easier to construct a reliable deci- 9 Since any question about a letter of phone is equivalent to a corresponding partition of the letter- or phone alphabet, the terms 'question' and 'partition' Feature Set Selection may often be used interchangeably. 24 sion tree from a limited amount of sample data, because it minimizes the amount of data fragmentation. illustrated as follows: Consider placing an n-ary test at some node in a decision tree, given only are n possible outcomes, each outcome is m/n. The concept of data fragmentation can be m sample contexts for that node. Since there the average number of samples corresponding to The larger this number m/n, the easier it is to make statistically valid observations in the sample sets corresponding to each of the possible outcomes. Since m is given, we can maximize m/n by choos- ing the smallest possible value for n, namely 2. To simplify evaluation Throughout feature selection and decision tree design, we need to compare the merits of individual questions. In general, the mutual information" between the answer to a question and the unknown (pronunciation) which is to be predicted is the most convenient standard of comparison. It is clear that a question with several possible outcomes can carry more information than a binary question. 10 However, this greater information content The mutual information between two variables is defined in terms of the entropy (H), or mathematical uncertainty, of each of the variables and the joint entropy of the two variables taken together: MI(x ; y) E H(x) + H(y) - H(x, y). It represents the amount of information that each variable provides about the other variable: H(y I x) = H(y) - MI(x ; y). The entropy of a random variable x with n possible values x * xn, with respective probabilities p 1 . .pn, is defined as n H(x) - E p. x log& 2 ( p. i=1 Feature Set Selection 25 is bought at a price: the use of such questions, cause unnecessary fragmentation. Thus, as pointed out above, can we cannot compare questions with different numbers of possible outcomes by simply measuring their information content: we need an evaluation standard that balances the gain in information against the cost of data fragmentation. problem by limiting ourselves to features We can avoid this of a single degree, such as binary features. To simplify enumeration The number of different possible questions about a discrete random variable is equal to the number of different possible partitions of the set of possible values of this variable. 11 This number is generally very large. In fact, even the set of all possible binary questions about letters or phones, while much smaller, is too large to allow us to enumerate all such possibilities. However, it is much easier to consider a significant frac- tion of the possible binary partitions than it is to consider a signif- icant fraction of all possible partitions. To deal with limited sample data Because we have only a limited amount of sample data, the information con- 1 If a typical context consists of seven letters (a few on each side of the 'current letter', and the current letter itself) and three phones to the left of the current letter, the number of distinct contexts is 26 7 x50 3 = 1.0x10 1 5 . The number of different ways of partitioning a set of this size into subsets is far too large to allow individual consideration of every possibility. Feature Set Selection 26 tent of a particular ability. feature can be estimated only with a certain reli- In fact, the more candidate features one examines, the greater the chance of finding one that 'looks' much better than it really is. By considering only binary questions, we limit the number of candidate features considered. Thus, if we find any features that 'look' good, they probably are. To simplify implementation Finally, the exclusive use of binary features greatly simplifies the programs that are used to generate, compare, and select individual features. Since the answer to a binary question can assume only two different values ('true' or 'false'), only one bit of memory is needed to represent each binary feature of a context. The features that describe a context can therefore be conveniently viewed as a bit vector, and the problem of predicting the pronunciation of a context can be seen as a bit-vector classification problem. 3.5 TOWARDS AN OPTIMAL FEATURE SET The problem of selecting the best subset of a set of feature's ('measurement selection') has been widely studied, but no efficient and generally applicable method appears to have been found. The difficulty of the prob- lem is illustrated with examples by [Elas67] and by [Tous7l], the latter showing that the best subset of m out of n binary features (m=2 and n=3 in his example) need not include the best single feature and may in fact con- sist of the m worst single features. This, and similar results obtained elsewhere, suggests that in general, an optimal feature set can be found Feature Set Selection 27 only by enumerating and evaluating a very large number of candidate feature sets. We are therefore forced to use a sub-optimal method for fea- ture set selection. In order to make the task of feature selection more manageable, we will only consider features that refer to a single element of the context only (a letter or a phone). This allows us to subdivide the feature selection problem into a number of independent sub-problems, suitable features namely the selection of for each of the elements of a context. Since a single element of a context may assume only 26 to 50 different values, the feature selection problem for such elements entire contexts. is much easier than that for (Approximately 1015 different contexts are possible). The restriction that each question can refer only to a single letter or phone is actually quite severe: if no questions about multiple elements of the context are included in the feature set, any pattern that involves multiple elements (such as a digraph) can be identified only by identifying each of its constituent elements. rules will be more complicated. This means that the corresponding The restriction is justified only by the fact that we have no computationally attractive method of identifying such more general features. Note: The definition of 'context phones) elements' calls for elements (letters or at a particular distance ('offset') away from the current letter. When the current letter is near a word boundary, letter or phone at a given distance. there may be no actual To deal with this situation, the letter alphabet and the phone alphabet are each augmented by the distinguished symbol presumed to '#', be and the phone- extended with an and letter strings of all contexts are infinite sequence of '#' in both directions. Feature Set Selection 28 Phone Letter Mutual at Mutual at Offset Information Offset Information -4 0.110 -3 0.161 -3 0.130 -2 0.307 -2 0.217 -1 0.785 -1 0.636 0 2.993 +1 0.809 +2 0.351 (*) +3 0.197 +4 0.122 Figure 6. Information Contributed by Letters and Phones: This table shows the Mutual Information between a letter (or phone) at a given offset away from the current pronunciation of the current letter. letter and the For example, the second letter to the right of the current letter provides an average of 0.351 bits of information (*). 3.6 USING NEARBY LETTERS AND PHONES ONLY On the average, the entropy or mathematical uncertainty (H) ciation of a letter is 4.342 bits. of the pronun- This difficulty is roughly equivalent to a choice, for each letter, between 24.342 = 20.3 equally likely alternatives. 12 12 This value, 2H, is called the perplexity. age number of equally likely alternatives' The term 'equivalent averis intended as a synonym for 'perplexity'. Feature Set Selection 29 The pronunciation of the average letter is determined mostly by the identity of the letter itself and its immediate context, and to a much lesser extent by letters that are further away. Indeed, the mutual information between the pronunciation of a given letter and the identity of the letters surrounding shown in it decreases rapidly with increasing distance, Figure 6 on page 29. According to this table, as is revealing the identity of the current letter reduces the entropy of its pronunciation to 4.342-2.993=1.349 bits, equivalent to a choice between an average of 2.5 equally likely alternatives. 13 Also, it can be seen that letters to the right (corresponding to a positive offset) are more informative than letters to the left. This may be due in part to the fact that where a digraph corresponds to a single phone, that phone is aligned with the first letter in the digraph. A more useful measure of the information content of far-away letters is the amount of additional information that they provide if the identities of closer-by letters are already known. the conditional mutual information.) tion (I will refer to this quantity as Because the total amount of informa- contributed cannot exceed the entropy of the pronunciation itself (which is about 4.342 bits), the conditional mutual information falls off much more rapidly with increasing distance than the unconditional mutual information. This is illustrated in Figure 7 on page 31. far-away letters contain only very little The fact that information about the pronun- ciation of the current letter justifies our decision to limit the feature set to questions that refer to nearby letters only. 13 The figures cited here were obtained from our sample data, and are not (necessarily) In particular, regard representative of, for example, running English text. the sample data was drawn from a dictionary without to word frequencies, and filtered to eliminate entries con- taining blanks or punctuation. Furthermore, the figures cited depend on the phone alphabet used. Feature Set Selection 30 Letter Cumulative Additional Mutual Mutual Mutual at Offset Information Information 0 2.993 2.993 +1 0.809 3.440 0.446 -1 0.636 3.797 0.357 +2 0.351 4.022 0.225 -2 0.217 4.130 0.108 +3 0.197 4.180 0.050 -3 0.130 4.199 0.019 +4 0.122 4.214 0.015 -4 0.110 4.220 0.006 Figure 7. Information Contributed Information by Letters, Cumulative: This table shows how much information (about the pronunciation of the identity current letter) is obtained by determining the of the surrounding letters in the order shown. The current letter provides 2.993 bits of information; the letter to its right yields another 0.446 bits; the letter to its left yields another 0.357 bits and so on. The mutual information contributed by phones at increasing distances is subject to a similar decline: see Figure 8 on page 32. In this figure, the first table indicates how much information the phones to the left of the current letter provide if the identity of this letter is known (hence the figure of 2.993 at the top of the 'cumulative' column). The second table indicates how much information they provide if the four nearest letters on each side of the current letter are known as well (hence the figure of 4.220 at the top of the 'cumulative' column). Note that these figures are much lower, and fall off even more rapidly than those in the Feature Set Selection 31 first table. These figures provide some theoretical justification for the fact that we do not include questions about far-away phones in the feature set. In Figure 8, note the figure on the bottom line: the amount of information provided by letters -4. .+4 and phones -3.. -1 entropy of the unknown pronunciation totals 4.239 bits. is 4.342 bits, Since the we are 0.103 bits short and therefore cannot predict the pronunciation with perfect accuracy. This uncertainty is due to two factors: 1. even if all available information is taken into account, the pronunciation of a context may still be ambiguous (in the word 'READ', for example); furthermore, 2. all contextual information that is more than 4 letters or 3 phones away is ignored. It is not clear how much of the difference is due to each of these factors. Note that an entropy of 0.103 is roughly equivalent to a choice between an average of 20.103 = 1.07 equally likely alternatives. 3.7 TOWARDS A SET OF 'GOOD' QUESTIONS This section describes the method used for finding good binary questions about letters and phones. tives outlined before: questions, and a This method was designed to satisfy the objecmaximal maximum of information content of the individual independence so that a combination of questions yields as much information as possible. Feature Set Selection 32 Phone Cumulative Additional Mutual Mutual Mutual at Offset Information Information Information 2.993 -1 0.785 3.398 0.405 -2 0.307 3.630 0.233 -3 0.161 3.798 0.167 4.220 -1 0.785 4.234 0.014 -2 0.307 4.238 0.004 -3 0.161 4.239 0.002 Figure 8. Information Contributed by Phones, Cumulative: This table shows how much information (about the pronunciation of the current letter) is obtained by determining the identity of successive phones to the left of the current letter. In the top table, the identity of the current letter is known; in the bottom table, the identity of the letters at offsets -4..+4 are known. Note: Since the same method is used for both letters and phones, tion explains pronunciations. the method only in terms of letters this secand their The generalization to phones is given in "Questions about Phones" on page 40 f f . . The only source of information used in the feature selection process for the 'current letter' is an array that describes the joint probability distribution of letters and their pronunciations, p(L, x) Feature Set Selection 33 where L is a letter and x is its pronunciation. derived from the aligned letter- This information is and phone strings that constitute the training data for the model. For now, we will assume the existence of a method for determining the single best question about a letter, given a particular optimality criterion. We can then proceed as follows. First, we determine the single most informative binary question Q about a letter L: that which maximizes MI(Q(L) ; x), where x is the pronunciation of L (and MI stands for Mutual Information). We will call this question QL1 1). (question about the current Letter, no. By definition, this choice of QL1 minimizes the conditional entropy, or remaining uncertainty, of the pronunciation of a letter L given Q(L). The question QL1 obtained by this method is shown in Figure 9 on page 35. The mutual information between QL1(L) and the pronunciation of L is 0.865 bits, which is close to the theoretical maximum of 1 bit. After QL1 is found, about a letter L: we determine the most informative second question Q that which maximizes MI(Q(L) ; x I QL1(L)) (Note that this is not the same as the second-most informative question). We will call this question QL2. The question QL2 obtained by this method is illustrated in Figure 10 on page 35. The amount of additional information obtained by QL2, MI(QL2(L) ; x I QL1(L)), Feature Set Selection 34 Q1 (L) is 'true' for {B C D F K L M N P Q R S T V X ZI, for {# A E G H I J O U W Y '}. Q1 (L) is 'false' Figure 9. The Best Single Question about the Current Letter is 0.722 bits. Thus, QLl(L) and QL2(L) together provide 0.865+0.722=1.587 bits of information about the pronunciation of L, which is more than half of the total amount of information carried by L (2.993 bits, as indicated in Figure 6 on page 29). We continue to generate new questions, M.I(QL (L) ; x I QLI(L), QL2 (L), ... QL (L), while maximizing , QL i(L)) This process is continued until every letter L is completely specified by the answers to each of the questions about it: QL1 (L), QL2 (L), ... , QLn(L) -(determine)-> L Q 2 (L) is 'true' for {A C D E I K M 0 Q S T U X Y Z '}, Q 2 (L) is 'false' for {# B F G H J L N P R V W}. Q1 (L) and Q2 (L) together divide the alphabet into four groups: 1: # G H J W} 2: (A E I 0 U Y '} 3: {B F L N P R V} 4: {C D K M Q S T X ZI Figure 10. The Best Second Question about the Current Letter Feature Set Selection 35 Since the answers QL1 (L) through QL (L) completely define L, it is not possible to extract any further information from it. As the reader will recall, this description of the question-finding algorithm assumes questions the that existence maximize the of a method various for identifying each of the objective functions given above. Unfortunately, even the most efficient methods we know for finding guaranteed optimal questions would require a disproportionate amount of computation. Therefore, we must once again resort to a sub-optimal method. The algorithm used is described in detail in "The Clustering Algorithm" on page 98 ff.. By supplying different optimality criteria, different questions may be derived. 3.8 DETAILS OF FEATURE SET SELECTION By invoking the clustering program with different parameters (in particular, different probability distribution arrays and different objective functions), we have obtained a variety of different questions. The questions are listed in "List of Features Used" on page 101 ff.. Questions about the Current Letter Questions QL1 through QL7 were obtained by clustering letters with the objective of maximizing MI(QL.(L) ; x where x is the I QL 1 (L),QL2 (L),.. .,QL. (L)), pronunciation of the letter L. In other words, these questions are the best first question about the current letter, the best Feature Set Selection 36 second question given the first, the best third question given the first two, and so on (the name QL names a Question about a Letter). In cases where several different partitions would provide the same amount of mutual information by this definition, ties were resolved by giving preference to partitions that had the highest unconditional mutual information MI(QL.(L) ; x). Thus, each question obtained is not only the best i'th the limits of the algorithm), question (within but also a good question when considered by itself. Let me point out some of the salient features of the questions QL1. .7 and the partitions of the letter alphabet that they define (as listed in "Questions about the current letter" on page 101 ff.). * The amount of (conditional) mutual information provided by the first few questions is near the theoretical limit of 1.0, idly for further questions. and falls off rap- This confirms our conjecture that a few features contain most of the information in a context, and that these features can be identified automatically. It is inevitable that the conditional mutual information provided by later questions falls off, since the sum of the conditional mutual information is fixed at 2.993 bits (see Figure 6 on page 29). * The unconditional mutual information of successive questions falls off as the questions become more specialized in making distinctions not made by preceding questions, but then increases to nearly its ori- ginal value as the number of clusters to be broken up increases and the clustering program utilizes the resulting freedom to maximize the secondary objective function as well. Feature Set Selection 37 * In general, it can be seen that as the alphabet is partitioned into smaller and smaller subsets, lar letters. between 'K' for this In particular, and 'Q' is each subset contains more and more simiit can be seen that the distinction is not drawn until after six questions. The reason clear: both are pronounced /k/ with high likelihood. Indeed, the distinction between 'K' and 'Q' is worth only 0.0008 bits of information as is indicated at QL7. and apostrophe (''') Also, it turns out that 'E' remain together for a long time. This is not sur- prising considering the use of the apostrophe in words such as "I'LL" and "YOU'LL". * One may wonder why the first question does not distinguish between vowels and consonants, since this is the most 'obvious' fication of letters. Indeed, it binary classi- turns out that the question 'is let- ter L a vowel' conveys 1.0 bits of information about the identity of L, since, in our sample data, a randomly selected letter has a proba- bility of 0.50 of being a vowel. However, a simple calculation (not given here) shows that the question would contribute only 0.790 bits of information about the pronunciation of L, which is somewhat less than the 0.865 bits contributed by QL1. Questions about the letter at offset -1 Questions QLL1 through QLL6 (Questions about the Letter to the Left) were derived specifically to capture the information contained in the letter to the left of the current letter, given the identity of the cur'rent letter. (Because of the order in which letters were added in Figure 7 on page 31, the amount of mutual information obtainable from this letter is not listed there. It is 0.346 bits.) The questions (which are listed in "Questions about the letter at offset -1" objective were obtained with the of maximizing the mutual information between QLLi(L) pronunciation letter. on page 102 ff.), of the letter In other words, Feature Set Selection following L, given and the the identity of that after determining the identity of the current 38 letter, QLLl(L) is the best single question about the letter immediately to its left, derive QLL2(L) is the best second question, these questions, and so on. In order to a variant of the clustering algorithm had to be developed. This new clustering task is somewhat more complicated than the simple clustering of letters. More importantly, it also requires more space: it is necessary to store the probability distribution array p(L, LL, X(L)) where L is the current letter, LL is the letter to its left, and X(L) its pronunciation. This array contains over 140,000 entries. Questions about the letter at offset +1 Questions QLR1 through QLR7 (questions about the Letter to the Right) were derived specifically to capture the information carried by the letter to the right of the current letter, indicated in Figure 7 on page 31, given the identity of the latter. this letter contains an average of 0.446 bits of information about the pronunciation of the current letter. questions, As which are given in "Questions about the letter at offset +1"i page 104 ff., were obtained by the same method used for QLL1. .6, The on although the joint probability distribution used was of course p(L, LR, X(L)) where L is the current letter, LR is the letter to its right, and X(L) its pronunciation. Feature Set Selection 39 Questions about the remaining letters As indicated above, the current letter and the letters immediately adjacent to it are represented in the feature set by features which were espe- cially chosen to be optimal for this purpose. Unfortunately, our cluster- ing program in its present form is unable to perform this task for letters at larger distances from the current letter, because the amount of storage required to represent the joint probability distribution arrays increases by a factor of 26 for every unit of increase in the distance from the current letter. 14 Therefore, we cannot derive questions specifically to cap- ture the relevant features of far-away letters. Given the choice between several sub-optimal question selection methods, we decided simply to use QLL1. .6, * for the following reasons: Any set of questions derived on the basis of properties of the letters is probably better than one which is not (such as a Huffman code or a binary representation of the position of the letter in the alphabet) * QL1. .7 have some unpleasant properties, between letter- between 'K and phone and 'Q' strings: due to the forced alignment for example, the is found to be nearly irrelevant. distinction There may be other, less obvious anomalies * When faced with the choice between QLL1. .6 and QLR1. .7, I simply chose the set which requires the smaller number of bits to represent each letter, lacking any clues as to which would perform better. 14 For the a letter at a distance of (plus or minus) two from the current letter, the probability distribution array would occupy nearly 17 megabytes of storage. Feature Set Selection 40 Thus, 56 the lexical context of the 'current bits: seven bits each letter' for the current is expressed in terms of letter and the letter to its right, and six bits each for the remaining letters from offset -4 to +4. Questions about Phones The same clustering algorithm that is used to identify good questions about the current letter, is also applied to the phone alphabet. time, the objective is to identify good questions immediately precedes the current letter. This about the phone that The objective function, the mutual information between the preceding phone and the pronunciation of the current letter, can be computed from the joint probability distrib- ution p(P, x), where x is the pronunciation of the current letter and P is the preceding phone. The questions obtained (QPl..8, page 105 ff.) separated see "Questions about phones" on have some amount of intuitive appeal, from consonants rather quickly, in that vowels are while the phones that stay together for a long time appear to be rather similar. For the benefit of skeptics (including the author) who would like to see a 'rationalization' questions, of some of the more First, or of these namely the and 'ERl' in opposite classes in of course, the present assignment of these two phones results in the highest Mutual subset aspects a single example was investigated in more detail, apparently paradoxical placement of 'ERO' QPl. questionable swapping Information: moving either phone to the opposite the two phones yields an inferior result. inspection showed that the phones occur in quite distinct contexts: is followed by <empty> with a probability of 0.46 and by 'ZX' Feature Set Selection Further 'ERO' with a prob- 41 ability of 0.16, whereas 'NX', 'MX', 'ER1' is followed by 'SX', <empty>, 'TX', 'KX', 'DX' and 'VX' with a probability of between 0.100 and 0.054 each. Each of the three phones immediately preceding the current letter is represented in the feature set by eight bits, namely the answers to QL1 through QP8 for each phone. that represent letters, Since the feature set also contains 56 bits the feature set consists of a total of 80 features. Other Features We have experimented with a variety of other features. Some of these were simply already otherwise more available; representations of information others provided information not obtainable from the regular feature set. * compact The following features have been used in experiments: the stress pattern on vowel phones in the phones to the left of the current letter, represented as a vector of stress levels * the location and length of consonant clusters in the phone sequence already predicted * whether or not a primary stressed syllable has already been produced. Other features were considered, but never actually used as we went back to the more basic feature set: * The position of the current letter in the word * The first and last letters of the word, or perhaps some characterization of the final suffix of the word (suffixes are given special treatment in some existing spelling to sound systems, such as [Elov76]). Feature Set Selection 42 In practice, we have found that the details of the feature set do not seem to matter much. This is only true only up to a point, of course: not tried any feature sets that are predictably bad. that we either). have not tried any feature sets that we have (Some might argue are predictably good, The method described in this chapter yields acceptable result with a minimum of human intervention. Feature Set Selection 43 4.0 DECISION TREE DESIGN This chapter describes the algorithm used to design the decision tree that contains the spelling to base form rules. The exposition assumes an understanding of the concept of a decision tree. The problem at hand requires that a decision tree be built to classify bit-vectors of approximately 80 bits each. This must be done automat- ically, based on an analysis of up to 500,000 labeled sample bit-vectors. Each vector must be classified as belonging to one of approximately 200 classes. * We will make the following assumptions about the data vectors: Each bit-vector may belong to more than one class, i.e. each context may have more than one pronunciation; * Any output class can correspond to a large number of different underlying patterns, which may correspond to completely different bit-vectors; * Some of the bits in the vector contain much more information than others, and only a few of these informative bits need to be inspected to classify a particular vector; * The cost of testing a bit is constant (the same for all bits). It is possible that we may have made some hidden assumptions that are not listed here. The exposition in this chapter will cover the following subjects: * related work (existing decision tree design algorithms) * the theory behind the decision tree design algorithm * a set of ad-hoc mechanisms to limit the growth of the decision trees, and their rationalizations Decision Tree Design 44 * an efficient implementation of the decision tree design algorithm * a discussion of the complexity of this implementation * methods by which to analyze a decision tree, to determine how 'good' is (in terms of classification accuracy, space requirements and so it on), and * the method used to 'smooth' the sample data distributions associated with the leaves of the decision tree. 4.1 RELATED WORK The automatic and interactive design of decision tree classifiers has been subject of much study [Case8l] [Hart82] [Hyaf76] [Payn77] [Stof74]. Most of the reported work deals either with the design of optimal trees for very simple problems or with sub-optimal trees for problems that are more complicated or involve a lot of data. Among the papers that deal with the classification of discrete feature vectors, [Case8l] and [Stof74] appear to be the most applicable to the present problem. Casey [Case81] addresses character recognition. ments (pels), the problem of designing a decision tree for He wishes to classify arrays of 25x40 picture ele- represented as a 1000-bit vectors. His classifier design algorithm obtains much of its efficiency due to the simplifying assumption that the individual bits in a bit-vector are conditionally 'independent, given the character that they represent. Casey reports that for the data to which his algorithm was applied (bit-map representations of characters in a single justified. font), the assumption of conditional independence is In part, this may be the case because every character corre- sponds to a single pattern. Decision Tree Design 45 Unfortunately, the bit-vectors that we need to classify do not appear to exhibit this property, variety of different patterns: come from an 'S' because most pronunciations , a 'Z' for example, or an 'X'. can correspond the pronunciation to a /z/ can We are therefore unable to use Casey's algorithm directly, although we will employ some of the same heuristics. Stoffel [Stof74] presents a general method for the design of classifiers for vectors of discrete, uncrdered variables (such as bit-vectors). He presents a worked-out example in which his method is used to characterize patterns in (8x12) bit-maps that represent characters. In order to be able to deal with large amounts of sample data, he makes the simplifying assumption that most of the bits in the vector are approximately equally informative. This allows him to define the distance between vectors in terms of the number of positions in which they have different values. He then places vectors that differ in only a few bits in equivalence classes, thereby reducing the size of the problem. It appears that in the data to which he applied his method, bit-map representations of characters, it is indeed true that most bits are approximately equally informative. Unfortunately, the bit-vectors that we need to classify do not exhibit this property: we know that some bits carry as much as 0.9 bits of information, while others carry 0.02 bits or less. This results from the fact that 'nearby' letters and phones are much more informative than those that are away. further While we cannot adopt this particular simplifying assumption, it seems that some variation of his method might be used for data reduction and to improve the modelling capability of the system. No such improvements are presently planned. Decision Tree Design 46 4.2 CONSTRUCTING A BINARY DECISION TREE A binary decision tree for predicting the label (pronunciation) of a bit vector (context) may be constructed as follows: 1. Consider an infinite, rooted binary tree (growing downward from the root). 2. Designate a suitable set of nodes of this tree to be the leaves of the tree. (I will refer to the set of leaf nodes as the fringe of the tree). Discard any nodes that lie below the fringe. 3. Associate a decision function with each internal node in the remaining tree. 4. Associate with each leaf node of the tree a probability distribution over the set of possible labels. The decision function at each node in the tree is selected on the basis of an analysis of the sample data that would be routed to that node during the classification process. It follows that the decision function of a node can be determined only if the decision functions for the nodes along the path between it and the root are known - thus, they must be assigned to the nodes of the tree in a top-down order. It also follows that each sam- ple contributes to the selection of several decision functions, namely one for each internal node along the path from the root to the leaf to which it is eventually classified. Finally, it follows that each sample contrib- utes to the selection of at most one decision function at each level in the tree. (A level consists of nodes that are at the same distance from the root). This property will be exploited later. Decision Tree Design 47 4.3 SELECTING DECISION FUNCTIONS When the decision function of a given node must be selected, the decision tree design algorithm operates as follows. The sample data routed to this node (by the portion of the decision tree that lies above it) to be available. Recall that this sample data consists is presumed of labeled bit-vectors or pairs (L, B[1..n]}, and B[1. .n] represents a vector of n where L is a label (a pronunciation), bits. From the sample data, we may compute the mutual information between L and each of the bits in the bit-vector, B[i]: MI(L ; B[i]). The decision function, then, this mutual is simply a test against the bit i for which information is greatest. This heuristic (which for this par- ticular problem was suggested by R. and undoubtedly by others. Mercer) is also employed by [Case81], [Hart82] shows that this choice of decision function minimizes an upper bound on several commonly used cost functions for the resulting tree. The following example at the decision root of function questions i, the mutual information MI(B[i]; L) is calculated. By defi- these values must lie between zero and one. there are about 80 questions; of these, 25 the tree. selection consider ation, situation the process: nition, the illustrates For all In a typical situcontribute less than 0.05 bits of information each, 41 contribute between 0.05 and 0.10 bits, 7 contribute between 0.10 and 0.50 bits, Thus, and 7 contribute more than 7 bits. the most informative question is indeed much more informative than a question selected at random. Decision Tree Design In our example, the question 'Is the current 48 letter one of (#, A, namely information, E, G, 0.865 H, I, J, 0, bits. U, W, Thus, it Y, contributes the most '}' is selected as the decision function at the root of the decision tree. The decision function is now used to split the sample data into two subsets, namely (1) E, G, H, I, J, 0, the samples in which the current letter is one of (#, A, U, W, Y, '}, and (2) the samples in which it is not. One of these subsets of the sample data set corresponds to the left branch of the decision tree, the other to the right branch. Since we started with a label entropy of 4.342 bits, and were able to obtain 0.865 bits of information from the first question, the weighted average label entropy at the leaves of the tree is now 4.342-0.865 = 3.477 bits. in the branch corresponding to the 'vowels' In fact, the entropy is about 4.2 bits, and the entropy in the branch corresponding to the 'consonants' is about 2.8 bits. For each subtree separately, the above procedure is repeated: the data are analyzed and the Mutual Information between B[i] and L is calculated for all values of i. In each subtree, the single best question is selected as the decision function and used to further partition the data. out, the next question about information, bits. reducing 'vowels' As it turns provides as much as 0.95 bits of the entropy in that branch from 4.2 bits to 3.2 The next question about 'consonants' is somewhat less effective, providing only 0.53 bits of information. This process continues along all the branches of the tree, tree until it contains as many as 15,000 leaves. extending the It would be impractical to investigate every branch of the tree manually; statistical tools are used for further analysis (see "Analysis of Decision Trees" on page 59). Because the decision tree design algorithm operates without 'look-ahead' and simply selects the best available single question, I will refer to it as the Best Question First, or BQF algorithm. The BQF heuristic has a number of desirable properties: Decision Tree Design 49 1. Only the single most informative question is used, hence the selection is in general unaffected by the presence of uninformative features 2. The binary since a trees built by this method tend to be fairly balanced, balanced tree offers the greatest potential for entropy reduction 3. Only sample data local to the node is used in the selection of the decision function, hence the tree can be constructed efficiently 4. The algorithm is incremental need not be specified in nature; thus, the desired tree size in advance but can be allowed to depend on observations made during tree construction. It is also possible to refine an existing tree. Only tests of single bits in the bit-vector are considered as possible decision functions. There are several reasons for this: 1. The decision function is selected from a small class of candidates, so this selection can be made easily 2. The decision function is easy to represent (only the index i needs to be specified) 3. The test can be performed quickly (only a single bit needs to be inspected) 4. There is only one degree of freedom in the selection mechanism; thus, a decision function can be selected even if only a small number of samples is available. Unfortunately, the algorithm is sub-optimal, as it must be since the prob- lem of constructing an optimal decision tree is in general NP-complete Decision Tree Design 50 [Hvaf76]. ance of Furthermore, this heuristic there appear to be no tight bounds on the performunder realistic conditions, although (Case8l] gives an upper bound on the performance under certain conditions. Consid- ering the enormous cost of constructing an optimal decision tree, this algorithm appears to provide a good compromise. 4.4 DEFINING THE TREE FRINGE In order to obtain a useful decision tree, growth of the tree. tests, which it is necessary to limit the We will do this by means of a number of termination together specify the tree fringe: against a node fails, the node is not extended. several termination tests simultaneously; passes all the termination tests. if a termination test In general, we will use a node is extended only if it The following is a list of the termi- nation tests that we have implemented to date. Tests that are always used are listed first; optional and experimental tests follow. 1. If all the samples at a particular node have the same label L, entropy of L is zero and no question can reduce it any further. the Thus, there is no need to extend the node. 2. If all the samples at a particular node have identical bit vectors B(l..n], the samples are indistinguishable. Therefore, no question will break up the sample data and reduce the entropy of L. Thus, there is no way of extending the node. This situation represents an ambiguity in the bit vectors, which may occur for one of two reasons: Decision Tree Design 51 a. the word in which the context occurs has multiple pronunciations /r eh d/) (such as READ /r ee d/, b. the same context occurs in several different words, pronounced differently. * 3. in which it is For example: NATION, NATIONAL (/n ei sh uh n/, /n ae sh uh n uh 1/) If the product of entropy and node probability falls below a certain threshold, the node is not extended. To the extent that the entropy of a L at a node is a good approximation of the entropy reduction that can be obtained by extending the node, this test defines a tree fringe corresponding to a local optimum in the trade-off between tree size and expected average classification accuracy. 4. If the number of data samples that corresponds to a node is below a certain threshold, the node is not extended. The purpose of this termination test is twofold: to avoid selecting a decision function on the basis of insufficient sample data, and to place an upper bound on the size of the tree. 5. If the label distributions at the immediate descendants would not be different with a statistical significance of a node level of at least, say, 99.95%, the node is not extended. The purpose of this termination test is to avoid extending a node if no question can separate the data at that node into two distinct populations. nificance A Chi-squared contingency test is used (a<0.05%). level Decision Tree Design of the test must be relatively high, The sigsince the 52 separation of the distributions corresponding to the descendants already the best out of n, is where n is the number of bits in a bit vec- tor. 6. Each of the sample vectors is extended ('padded') sisting of random bits, each of which is 0.5. Then, '1' with a probability of the mutual information between each bit (including the random bits) and the labels L is computed. does not with a vector con- If the best 'real' bit provide more information about L than each of the random bits, the node is not extended. The purpose of this Chi-squared test. termination test is the same as that of the While we could simply set a threshold on the mini- mal mutual information required for extending a node, this threshold should in practice depend on sample size at the node and the distribution of labels L. ter to be trade-off The present scheme requires only a single parame- specified: between the the number of random bits, m. There is a selectivity of the test and execution time, since the mutual information of each of the m random bits must be computed. 7. In practice, we have used values of m=20 to m=n. At each node, the product of entropy and node probability is computed. This quantity is presumed to decline along the path from the root to a leaf. path If the decline in this value over the last few nodes along a does not exceed a certain ,threshold value, the path is not extended. This termination test imposes a 'progress requirement'. The product of entropy and probability is used because the entropy itself does not decline along all paths: the BQF algorithm sometimes selects tests that divide the data for a node into a large class with low entropy and a small class with an even higher entropy than the original data. Decision Tree Design 53 4.5 AN EFFICIENT IMPLEMENTATION OF THE BQF ALGORITHM Processing Order The BQF heuristic can be applied to a node only if the sample data for the node is known. Therefore, top-down fashion. this task: the nodes must be processed in some sort of Three different processing orders appear suitable for depth-first breadth-first order. order, some type of best-first order and We have considered and implemented both best-first and breadth-first order. The original implementation (by R. order. Mercer) extended nodes in best-first The objective of the evaluation function was to identify the node and the test at that node that would constitute the best possible potential refinement of the tree. computation, Since it is not practical to perform this the node with the highest product of entropy (of L) and prob- ability (of the sample data at the node) was extended. upper bound This product is an on the imprcbvement in classification accuracy that can be obtained by extending the node. Unfortunately, storage since it this program made inefficient use of access to external required periodic scans of the 800,000 labeled sample vectors to find the sample data corresponding to the nodes that were to be extended. For efficiency, these accesses were performed in batches: on each scan, the data for best n nodes was read, where n was limited to 40 by the amount of available memory. Nevertheless, the number of scans was still proportional to the number of nodes in the tree. This made it very expensive to construct large trees. In order to utilize each scan of the external file better, the program was re-written to process nodes in breadth-first order. This processing order implies that all the nodes at a particular level are processed sequential- Decision Tree Design 54 ly. Every sample vector corresponds to at most one node at any given lev- el (or to none, if it corresponds to a leaf at a higher level). Thus, the nodes at each level define a partition of the sample data. sample data in the external file can be ordered so that it Therefore, the can be read in a single sequential scan as the nodes at a given level are processed. (The data in a file so ordered is said to be aligned with the nodes at that level in the tree). Since the partition at each level is a refinement of the partition at the previous (higher) level, it is possible to re-order any aligned file to be aligned with the next level in the tree in 0(n) time and space, as will be shown later. When proceeding in this fashion, the number of scans of the sample data file is proportional to the depth of the eventual tree, rather than to the number of nodes in the tree. Thus, the number of times that the sample data file has to be read has been reduced from (the number of nodes in the tree) - 40, or between 100 and 7,500, to (the number of levels in the tree) or between 60 and 90. x 3, This improvement is significant, especially for large trees. Details of Operation The new BQF program builds the tree one level at a time, starting at the root node. The nodes at each level are extended in left-to-right order. will first describe Decision Tree Design how the nodes I at a given level can be processed 55 sequentially, given that the input file is properly ordered; I will then show how the sample data can be efficiently re-ordered, so that it will be aligned with the nodes at the next level. The sample data for each node consists of a set of labeled bit-vectors, (L, B[l..n]}. The decision function for a node is defined by the integer i that maximizes MI(L ; B[i]) or H(L) - H(L I B[i]). The quantity H(L) does not depend on i, and may be ignored. The condi- tional entropy of L given B[i] may be computed as follows: H(L I B[i]) = EL( p(L I B[i])xlog 2p(L I B[i]) ) where ZL denotes a summation over all label values L. The probability distribution p(L I B[i]) can be computed from the observed frequencies of {L, B[i]} and B[i] in the sample data corresponding to the node. the node is all that Indeed, the sample data for is required to perform this computation. Thus, assuming the data in the input file is aligned with the nodes at the current level in the tree, one, and this takes no more memory than is used to process a single node the nodes at that level can be processed one by since the storage space can be reused. While the nodes at a given level are being processed, it is possible to copy the sample data to a new file, such that the data in the new file will be aligned with the nodes at the next level in the tree. This is done as Decision Tree Design 56 follows: Then, Before processing of a level begins, an empty file is allocated. the nodes at that level are processed one by one. After a node has been processed (and its decision function i has been determined), the por- tion of the input file that corresponds to the node is scanned twice. the samples for the left subtree of the node (the samples scan, the first On for which B[i] = '0') are copied into the output file; on the second scan, the samples for the right subtree (for which B[i] = '1') the output file. Since B[i] must be either copied to the output file exactly once. 'O' or '1', are copied into each sample is It is clear that the data for the node is now properly ordered for the next level: all the data for the left descendant The of the node comes before the data for the right descendant. two scans of the input file needed to copy the data can be accom- plished by maintaining two additional pointers into the input file. If an efficient 'seek' operation is available, it tion a single file pointer as needed. may be used instead to reposi- When all the nodes at the level have been processed, the output file becomes the input file for the next pass. If a decision is made not to extend a particular node, two options are available: 1. The data corresponding to the node is copied to the output file. Since the node has no descendants, nodes at subsequent levels, the data when it once again). tains the data does not correspond to any and a provision must be made to skip over is encountered (and to copy it to the output file This approach ensures that the output file always con- all the samples originally contained in the input 'file. Thus, when the tree is complete, the data will be aligned with the fringe of the tree, and can be used for various types of postprocessing. 2. The data corresponding to the node is not copied to the output file at all. Since the node has no descendants, this also preserves the align- ment between the data in the output file and the nodes at the next level in the tree. This approach has the advantage that the data file Decision Tree Design 57 always contains only the data samples that are still needed. When the tree is nearly complete and only a small number of samples are still used to refine the remaining 'active' nodes, the savings obtained by this method can be substantial (about a factor of two, An outline of the I suspect). program is shown in "Overview of the Decision Tree Design Program" on page 108. Program Complexity The asymptotic running time of the program consists of two components: The input file is scanned for each level in the tree. 1. During each scan, the entire file must be read (presuming that the data for terminated nodes is retained). determine whether it by classifying it For each sample read in, it is necessary to belongs to the current node. This is done simply and determining whether it ends up at the current node, which takes an amount of time proportional to the depth of the tree at that point. Furthermore, the frequency count for (B[i], L} needs to be incremented for all values of i for which B[i} is '1'. This takes an amount of time proportional to the length of the bit vector. 2. For each node, the mutual information between L and each of the bits B[i] must be computed. This computation requires that the array con- taining the joint frequency distribution of L and B[i] be read, for all i. In all, then, the running time is of order T = O( dN(d+n) + mnL ), where Decision Tree Design 58 d is the depth of the tree N is the number of sample pairs {L, B[1..n]} n is the number of bits in each bit vector m is the number of (internal) nodes in the tree L is the size of the output alphabet. In practice, the quadratic dependence on d (d2 N) is insignificant in comparison the remainder of the computation, especially as long as d<n. In practice, then, T = O( dNn + epsilonxd 2 N + mnL ) O( dNn + mnL ). The amount of space (fast random access storage) required by the program also consists of two components: 1. The joint frequency f(L, B~i}) is stored for the current node, for all values of L and i; 2. A copy of the growing tree is stored. The asymptotic memory requirement is thus S = O( Ln + m ). If it is necessary to preserve the label distributions corresponding to the nodes, they can be written to an external file as they become available. Finally, the program requires an amount of temporary external storage sufficient to store an extra copy of the sample data file; this can be sequential access storage, and therefore is not included in the space calculation above. Decision Tree Design 59 The program has each expressed successfully been applied in terms of 80 features, pronunciations, to construct a to a set of 800,000 samples, with an output tree of depth 30 with 15,000 construction of tree of depth 23 with 1508 rules, expressed in terms of 80 features, ciations, alphabet of 250 rules. The from 31550 samples, each with an output alphabet of 157 pronun- required less than six minutes of CPU time on an IBM Model 3033 processor. It is likely that some type of data reduction, the data is presented to the tree design program, applied before could reduce the compu- tation time considerably. 4.6 ANALYSIS OF DECISION TREES The ultimate error objective of our decision tree rate of the Channel Model. is simply to minimize the The extent to which this objective achieved can be measured only by performing prediction experiments. ever, each measured, Thus, decision tree has and that give a variety a rough we can evaluate different of characteristics is How- that may be indication of the quality of the tree. methods of constructing decision trees without actually performing costly experiments. Analysis of decision trees is of practical importance for a second reason: we can obtain information about the features in the feature set. In fact, this is our first opportunity to investigate the interactiorn between all the individual features in the feature set. Finally, it is possible to determine the effectiveness of the various ter- mination tests by comparing trees on various parameters Decision Tree Design constructed with different thresholds (such as the minimum number of samples per node, or 60 the significance level for which to test distribution separations). following diagnostic The information was obtained about the decision trees generated by the BQF program: 1. the number of nodes in the tree (this is an indication of the amount of storage required to store the tree) 2. the average entropy at the leaves, weighted by the leaf probability (this is an indication of the classification accuracy of the tree) 3. the average depth of the leaves, weighted by the leaf probability (this quantity is proportional to average classification time) 4. the depth of the tree (this is an upper bound on classification time, and also an indication of the cost of constructing the tree) The following information tells us about the individual questions as well as about the tree itself: 5. the average number of times that each question will be used when classifying a vector (this number ranges from 0 to 1, and is an indication of what questions are found by the BQF program to be informative) 6. the average depth at which each question is asked, if at all (this tends to be a small number for good questions, because good questions are asked first) The following data tells us about the effectiveness of the various termination tests used: 7. the number and aggregate probability of the paths that were terminated by each of the termination tests (if the number of paths terminated by a given termination test is small, that test can be eliminated) Decision Tree Design 61 8. the number of leaves at each depth, and their aggregate probabilities (this is an indication of how well balanced the tree is. If the number probability and of the leaves at the lowest levels is small, con- struction of the tree may have been unnecessarily costly). As an example diagnostic, Figure 11 lists the average number of questions asked about each of the letters and phones in a a context (for a typical tree). As expected, the average rule contains many more tests of the cur- rent letter and other nearby letters and phones, than of far-away letters and phones. In this example, the average total number of tests is 10.29. While the BQF heuristic itself has no parameters that can be 'tuned', we have performed experiments to measure the effect of various termination tests and of the corresponding threshold values. Therefore, a second method for evaluation was used: first, a tree was built using the smallest practical set of termination criteria, with the most tolerant thresholds. Significant parameters, such as the entropy and label distribution at each node, were saved with the tree. used to Then, alternative termination tests were define subsets of this existing tree, (such as number of nodes, the attributes of which and average leaf entropy) could be evaluated without actually constructing any new trees. Also, for each node in the tree, the label distributions at the leaves that descended from this node were compared, appeared to be and the significance level at which these distributions different was reported. This procedure was especially effective in the investigation of the effect of threshold va'lues on tree size and classification accuracy. In particular, binary search on a given parameter has been used to define a subtree of a particular size for an application that was subject to memory constraints. Decision Tree Design 62 Letter Number Phone Number at of at of Offset Questions Offset Questions -4 0.097 -3 0.107 -3 0.064 -2 0.536 -2 0.180 -1 0.982 -1 0.840 0 4.199 +1 1.723 +2 0.955 (*) +3 0.337 +4 0.265 Figure 11. Number of Questions asked about Letters and Phones: This table shows the number of questions asked about a typical letter (or phone) letter in a at a given offset away from the current typical tree. For example, an average of 0.955 questions are asked about the second letter to the right of the current letter (*). 4.7 DETERMINING LEAF DISTRIBUTIONS When the structure of the tree is completely determined, a probability distribution (over the pronunciation alphabet) has to be assigned to each leaf of the tree. Model, it Since these probabilities are the output of the Channel is important that they be estimated accurately. Unfortunately, a simple maximum-likelihood estimate based on the training data would be a rather poor choice, for the following reasons: Decision Tree Design 63 * The number of samples for each leaf is small relative to the size of the pronunciation alphabet. Therefore, the probability of infrequent- ly occurring pronunciations cannot be estimated reliably. * The decision tree is tailored to the sample data; therefore, the distributions at the leaves are severely biased and give rise to an overly confident estimate of classification accuracy. These problems can be alleviated by shrinking the tree, and using the distributions found at nodes that have more samples and are less tailored to the sample data. However, this cannot be taken to its extreme, since the distribution at the root is simply an estimate of the first-order distribution of pronunciations, and does not take the context into account at all. In order to avoid the problems associated with each of these two extreme methods, we will estimate the actual distribution at each leaf as a linear combination (weighted sum) of the distributions encountered along the path from the root to the leaf. tions (the weights) will be The coefficients of this linear combinacomputed by means of deleted estimation Since not all leaves are at the same depth, the number of nodes along the [Jeli80], as explained below. path from a leaf to the root may vary. tion of the weights, it In order to simplify the computa- is convenient to combine a fixed number (say n) of distributions at each leaf. A typical value for n is 5. utions that are used are selected as follows: the path from a leaf to the root. size. root Calculate between S[1] and S[n], consider the nodes along Associated with each node is a sample Call the sample size at the leaf S[1] S[n]. The distrib- and the sample size at the the n-2 numbers that are logarithmically and call them S[2] through S[n-1]. spaced Along the path from the leaf to the root, find for each of the S[i] the lowest node that has at least S[i] samples. The distributions found at these nodes will be Decision Tree Design 64 combined to estimate the 'proper' leaf distribution. It can be seen that the distributions at the leaf and at the root are used as the first and the n'th distribution, respectively. We now have to compute the optimal coefficients or weights X[l. .n](L) for every combined. leaf L, where n is the number Because we do not have sufficient of distributions that are sample data to accurately estimate the optimal coefficient set for each leaf L, we divide the leaves into classes, and compute a set of weights X[l..n](C) for each such class C. The division of leaves into classes should be such that leaves with similar optimal coefficient sets are assigned to the same class, and such that there is enough data for each class to estimate these coefficients reliably. It was conjectured that leaves with similar sample sizes would have similar optimal coefficient sets. Therefore, the range of sample sizes was divided into intervals, and leaves with sample sizes in the same interval were assigned to the same class. Upon further examination of the data, it was found that the coefficient values also vary rather consistently as a function of the entropy at the corresponding leaf: the higher the entropy, distribution is for a given sample size. there are more significant parameters with higher entropy. Thus, the less reliable the leaf This may be the case because to be estimated in a distribution the classes that exhibited a large variation in leaf entropy (the classes for small sample sizes) were further subdivided into entropy classes. Care was taken to ensure that there was enough sample data for each resulting class. For a tree with 5,000 leaves, we typically used about 15 classes. Decision Tree Design 65 The problem is now to determine, for each class, that minimizes the entropy (and thus data. Unfortunately, the coefficient vector maximizes the likelihood) of 'actual' the distribution of the 'actual' data is not known - all we have is sample data, and the sample data yields a poor estimate of the leaf distributions for the reasons mentioned above. In fact, if we were to use the sample data to estimate the coefficients directly, the optimal weight for the leaf distributions would always be one and all other weights would always be zero. Therefore, we will estimate the optimal coefficient values by means of deleted estimation, using an algorithm similar to that presented in [Jeli8O]. m batches. the j'th We first A typical value for m is 4. divide the sample data into We then construct m trees, where tree is built from the sample data in batches 1..j-1, j+l..m. For each tree, the data in the 'held-out' tree) can be treated as 'actual' data. the X[l..n] that coefficient held-out data vector (all m batches), batch (batch For each class C, i for the j'th we now compute maximizes the probability of the where the probability of the data in the j 'th batch is computed with the distributions found in the j'th tree. Within each class, the coefficients are computed by an iterative, rapidly converging procedure which is a simplified form of the Forward-Backward algorithm [Baum72]. This algorithm operates by repeatedly re-estimating the coefficient values given the old values, the sample data and the estimators (the distributions in the tree): Z x E. where X.' x X.P.(x) i i X.P.(x) is the new estimate of X[i], Z ix denotes the sum over all samples x in the sample data, P.(x) denotes the probability of a sample x Decision Tree Design accord- 66 ing to the i'th denotes estimator for the leaf to which x is classified, the sum over all i. Of course, the samples x from batch classified with the j'th tree, relative to which they are 'new' As expected, vectors. typical different classes give tree. Because of space limitations, different are samples. coefficient the coefficients of only The coefficient values after the tenth and thirtieth iterations are shown to illustrate the conver- gence properties of the algorithm. 0.2. to i Figure 12 depicts the coefficient sets that were obtained for a three of the fifteen classes are shown. first, rise and Z. All coefficients were initialized to Class 1 contains all the leaves with a sample size of less than 63 and an entropy of less than 0.65; class 5 contains all the leaves with a sample size of less than 63 and an entropy of more than 1.49; class 15 contains all leaves with a sample size greater than 541. The threshold val- ues that define class boundaries were chosen automatically. The Forward-Backward algorithm has the property that the probability of the held-out data according to the model, XXE. Zx Zi X.P.(x), Xi Pix) increases at each iteration until a stationary point is reached. particular special case, For this the algorithm is guaranteed to find the global optimum provided that the initial estimates of X>1..n] are nonzero."1 In practice, fifteen to thirty iterations are sufficient. is L. R. Bahl, personal communication Decision Tree Design 67 After One After Ten Iteration Iterations X[1] 0.47368 0.92513 0.92854 X[2] 0.21041 0.04491 0.04220 N < 63, X [3] 0.17609 0.02479 0.02329 H < 0.65 X[4] 0.09715 0.00413 0.00487 X[5] 0.04268 0.00104 0.00111 X[1] 0.37938 0.84593 0.88184 X[2] 0.23135 0.09329 0.07105 N < 63, X[3] 0.20250 0.05065 0.03528 H > 1.49 X[4] 0.13431 0.00946 0.01084 X[5] 0.05246 0.00067 0.00098 X[1] 0.44915 0.99101 0.99910 X[2] 0.25744 0.00863 0.00050 X[3] 0.15930 0.00029 0.00022 X[4] 0.07414 0.00003 0.00006 X[5] 0.05996 0.00004 0.00011 Class 1: Class 5: Class 15: N > 541 Figure 12. Coefficient Sets and their After Thirty Iterations Convergence: This figure illustrates the convergence of the coefficient values for three representative classes. In each class, weight for the leaf distribution and 41[] is the X[5] is the weight for the root (or first-order) distribution. Decision Tree Design 68 5.0 USING THE CHANNEL MODEL 5.1 A SIMPLIFIED INTERFACE The Channel Model models the mapping between words and their base forms. A word is viewed as a letter string, and a base form is viewed as a phone string. Internally however, the Channel Model is implemented not in terms of phone strings but pronunciation strings, where each pronunciation corresponds to a single letter, and may consist of zero or more phones. As pointed out before, the difference between the two representations is that a single phone string can correspond to many different pronunciation strings, one for each possible alignment between the letters in the word and the phones in the base form. Presently, the Channel Model is used for two applications. Its main func- tion is to operate in concurrence with the Linguistic Decoder, which finds the most likely base form for a word, given the spelling of the word and a single sample of its pronunciation. As a secondary function, the Channel Model is part of a very simple base form predictor, which attempts to predict the base form of a word on the basis of its spelling alone. This simple predictor is used mostly to test the Channel Model. As a third application, as yet unrealized, the Channel Model could be used in an unlimited-vocabulary speech recognition system: a 'front end' would produce a phone sequence X, and a decoder would find the most likely let- ter string L given X using the a priori probability of L and the Channel Model. Using the Channel Model 69 The application programs that use the Channel Model are interested in finding base forms, not the various pronunciation strings to which these may Therefore, correspond. an interface has been defined that allows these application programs to operate in terms of phone strings only. PATHS 5.1.1 The application programs perform their search for the 'correct' base form in a left to right manner: they start with the null base form, and consid- er successively longer and longer partial solutions (and finally complete solutions) until the best possible complete solution has been identified. Each partial solution is a leading substring or prefix of a set of complete solutions. Such a partial base form is called a path. A path is characterized by a phone string and the probability that the 'correct' base form begins with that phone string. If the model were very accurate, the probability of all paths that are leading substrings of the correct base form would be 1.0, 0.0. In practice, and the probability of all other paths would be of course, there will be many other paths with a nonze- ro probability. The successively longer partial solutions obtained by extending and, shorter solutions are extending a path (by a particular phone) eventually, paths. the complete The operation of yields a new path, the phone string of which consists of the phone string of the original path extended by the new phone. The probability of the new path is the probability that the new path is a prefix of the correct solution. Clearly, this must be less than or equal to the probability of the original path, with equality only if the new phone is the only phone by which the original path can be extended. found it In order to simplify the treatment of complete paths, convenient to adopt the following convention: Using the Channel Model we have each word is fol- 70 lowed by a distinguished letter that can only have the pronunciation #', aad a path is complete if and only if its last symbol is the termination W#' . marker bet.) (For this purpose, '#' is considered part of the phone alpha- The sum of the probabilities of all different extensions of a path (including the extension by '#') is always equal to the probability of the path itself. The interface to the Channel Model is defined in terms of two operators on paths: one operator creates a null path, given a letter string; the other operator extends a path by a given phone or by the termination marker, yielding a new path. The implementation of the operator which creates the null path is trivial; the implementation of the operator which extends a path is as follows. 5.1.2 SUBPATHS The Decision Tree module which underlies the Channel Model implementation operates in Internally, terms of pronunciation strings rather than phone strings. any path may correspond to several different pronunciation strings. Each such pronunciation string is represented internally as a subpath. A complication arises because paths are extended in increments of one phone, while pronunciation strings are extended in increments of one pronunciation. that they, To straighten matters out, we have defined subpaths so too, are extended in increments of one phone. A subpath con- sists of a pronunciation string, and an indication of how much of the pronunciation string is indeed part of the corresponding path. The reason why this indication is necessary can be illustrated as follows: consider the word 'EXAMPLE' with base form /i g z ae m p uh 1/, Using the Channel Model 71 and the partial base form containing the (or path) pronunciation /i/. extend this path by the phone /g/. 'X' is determined to be /g z/. Thus, /i/. This path has single subpath, Now, consider what happens when we Internally, the pronunciation of the the subpath can be extended to /i/-/g z/ At this point, the pronunciation string of the subpath contains more phones than are desired for the subpath itself (which should have a phone string of /i g/, not /i g z/). Thus, the subpath specification is augmented 'link' by a number that indicates how many of the phones in the last of the subpath are indeed part of the corresponding path: /i/-/g Z/1. Continuing the example, suppose that we now wish to extend the new path, /i g/, suffices to update the indicator In this case, it by the phone /z/. of how many phones are part of the path. The new subpath looks like this: /i/-/g Z/2. A further complication to this scheme is posed by the fact that a pronunciation may not contain any phones at all. dled in an entirely consistent manner, This complication can be han- as follows: consider the word 'AARDVARK' with base form /a r d v a r k/. Assume, for the moment, not with the second. that the first /a/ Thus, is aligned with the first the only subpath of the path /a/ 'A', looks like this: /a/,. Using the Channel Model 72 Now, consider what happens when we extend the path /a/ by the phone /r/. Internally, the pronunciation of the next letter of the word (the second 'A') is predicted to be // and the following subpath is created: /a/-//. Obviously, this phone string /a cannot suffice since this subpath does not contain the r/. Thus, the pronunciation of the third letter must be predicted, yielding the subpath /a/-//-/r/l. In general, whenever the pronunciation of a letter can be null, the pronunciation of the next letter must be predicted as well. If it too can be null, the procedure is repeated. This process continues until the end of the word is reached, or until a letter is encountered that cannot have a null pronunciation. Finally, I wi-ll illustrate with two examples the mechanism by which multi- ple subpaths for the same path come into existence. examples of the same general case, they are sufficiently different that separate illustrations are in order. suppose that the the first /a/ first or the second 'A'. Although both are The first example is rather trivial: in 'AARDVARK' can align with either the If this is the case, then the path /a/ has two subpaths: /a/1 and //-/a/,. When this path is extended by the phone /r/, each of its subpaths must be extended, and two new subpaths result: Using the Channel Model 73 /a/-//-/r/ //-/a/-/ Thus, 1 and r/1 . multiple subpaths may arise whenever there is an ambiguity in the alignment. While pronunciation, the above example revolves around the alignment ambiguities arise whenever two different pronun- ciations can provide the phone by which the path is extended. for example, null Consider, the word 'BE' with base form /b ee/. Since the pronunciation of the 'B' can be either /b/ or /b ee/, the path /b/ is represented by the following two subpaths: /b/1 and /b ee/ 1 . When this path is next extended by the phone /ee/, the following two subpaths may be formed: /b/-/ee/ 1 and /b ee/2. When, finally, this path is extended by the termination marker '#', the following two subpaths are formed: /b/-/ee/-# and /b ee/-//-#. The first subpath represents the corresponds represents /b ee/ to an the /b/ and the 'incorrect' and the 'E' 'correct' 'E' alignment, alignment, in which the 'B' to the /ee/. The second subpath in which the 'B' corresponds to is silent. Using the Channel Model 74 When a path is complete, other subpaths, ertheless, it one subpath is usually much more likely than all due to the consistency of the aligned training data. Nev- is necessary to keep track of all plausible alignments from the beginning since the best alignment often cannot be immediately identified. 5.1.3 SUMMARY OF PATH EXTENSION In order to extend a path, it is necessary to extend all of its subpaths. The probability of the new path can then be computed as the sum of the probabilities of the new subpaths. To extend a subpath, proceed as follows: 1. If not all the phones in the last link of the subpath are examine the first 'unused' a. If it 'used', phone: is identical to the phone by which the subpath is to be extended, simply make a copy of the subpath and update the length of the new subpath. The probability of the new subpath is the same as that of the old subpath. b. If it is different, phone. Thus, 2. the subpath cannot be extended by the given the probability of the extension is zero. If the pronunciation string of the subpath contains no extra phones, the decision tree module must be invoked to predict the pronunciation of the context, next letter. The input to the decision tree module is a which can be formed from the spelling of the word and the Using the Channel Model 75 pronunciation string stored in the subpath. The position of the cur- rent letter in the letter string can be determined by counting the number of links in the subpath. Note that if there are no more letters, the probability of the termination marker Otherwise, W#' is one and of all other pronunciations, zero. several pronunciations that begin with the given phone may have nonzero probability (for example, /b/ pronunciation, and /b ee/). For each such a new subpath is created by attaching this new pronun- ciation to a copy of the original subpath and indicating that only the first phone of the new link is 'used'. The probability of each new subpath is equal to the probability of the original subpath times the probability of the new pronunciation given the context. If the null pronunciation has a nonzero probability, a new subpath is made by appending '//' to the a copy of the original subpath. new subpath is then extended recursively by invoking step 2. The subpaths generated are collected, path can now be calculated: it This above. and the probability of the extended is the sum of the probabilities of the individual subpaths. Since each subpath corresponds to a particular alignment between the letter string and the phone string, the system can also be used to determine the most likely alignment between a given (letter string, phone string} pair. This capability is not used at present. Using the Channel Model 76 5.2 A A SIMPLE BASE FORM PREDICTOR base simple form which predictor, base forms predicts from letter strings, was developed - mainly to serve as a test vehicle for the Channel Model. The function of the base form predictor is to identify the base form X that has the highest likelihood of being the correct base form of a given letter string, L: find the X that maximizes where p(X p(X I L), I L) is computed using the Channel Model. The base form predic- tor operates as follows: 1. Start with the null path n0 . 2. Maintain a list f (), of paths, which is an estimate of the probability of the best complete path that may be formed by extending the path i. The path with the highest value of f'(i) is at the head of the list. (For efficiency, this list implemented as a heap, performed in 3. ordered by f'(r), which allows or priority queue is insertions and deletions to be O(log n) time.) Repeat the following procedure until a complete path is found at the head of the list: a. Remove the best path from the head of the list. b. Generate all its possible extensions by invoking the Channel Model - there will be one extension for each possible phone, and one for the termination marker '#'. Using the Channel Model 77 c. Compute f'(fT) for each extension and insert it into the list at its appropriate place, maintaining the ordering on f' (i). 4. When a complete path is found at the head of the list, the algorithm terminates. This algorithm, property that as which is known as long as 'stack f'(IT)2f(T) for all decoding', paths ff, path that makes its way to the head of the list overall ([Nils8O], section 2.4: the A' has the important the first complete will be the best solution algorithm). The efficiency of the stack algorithm depends on how closely f' (r) approximates f(1r). In the present implementation, f'(i) is defined as the total probability of all the complete paths that may be formed by extending the path i. Note that the condition f'(ff)2f(ff) this is a rather poor estimate of f(i) is always satisfied. However, and the search performed by the predictor tends to take time exponential in the length of the base form that is to be predicted. be This is not a problem, since even long words can handled within a second or so, and the program is used for testing only. 5.3 THE BASE FORM DECODER The Channel Model has been successfully interfaced to the IBM Linguistic Decoder [Jeli75] [Jeli76], producing a system, the Base Form Decoder, that predicts the base form of a word on the basis of its spelling and a single sample of its pronunciation. The Base Form Decoder is a maximum likeli- hood decoder: its objective is to identify the phone sequence X that has the highest likelihood of being the correct base form of the word W, given Using the Channel Model 78 the spelling of the word (a letter string L) and a sample of the pronunciation of the word (a speech signal or acoustic signal A). In other words, I L, A). X should maximize p(X Using Bayes' rule, we can rewrite this expression as p(X I L) I L, X) p(A x p(X I L, A) = p(A I L) and make the following observations: 1. The expression p(A I L, X) can be simplified to p(A I X), because the dependence of A on L is through X only. 2. The value of the denominator, quently, it makes no p(A I L), does not depend on X. contribution to the maximization Conse- and may be ignored. Thus, we can re-state our objective: X should maximize p(X The quantity p(X quantity p(A I L) x p(A I X). I L) can be computed by means of the Channel Model; the I X), which is sometimes called the 'acoustic match', can be computed by a means of a model of the relation between phone strings and sounds, such as that described in [Jeli76]. The maximization is performed by the Linguistic Decoder by means of a stack decoding mechanism similar to that described in the previous section, but more sophisticated [Bahl75] (p. 408). In particular, it search time is approximately rather than exponential. Using the Channel Model uses a better approximation of f (), so that linear in the length of the decoded string An experiment has been performed to evaluate the 79 performance of the combined system (Channel Model, Stack Decoder): 3.34%. the estimated phone The results of this experiment error Acoustic Matcher and rate is between are described 2.14% and in more detail in "Performance of the complete system" on page 83 ff.. Using the Channel Model 80 6.0 EVALUATION This chapter consists of four sections: * Performance * Objectives Achieved * Limitations of the system * Suggestions for further research 6.1 6.1.1 PERFORMANCE PERFORMANCE OF THE CHANNEL MODEL BY ITSELF Two versions of the Channel Model have been constructed and tested. model was trained on 4000 words One from a 5000-word base form vocabulary (hereafter referred to as the OC vocabulary, for Office Correspondence); the other model unexpectedly, ing data was trained on a 70,000-word dictionary. Not the 4000 OC base forms appeared to be insufficient as train- for the model. The dictionary, on the other hand, appears to contain enough information to properly train the Channel Model. The model trained on the 4000 OC base forms was used to estimate a weak fuzzy lower bound on the error rate of the Channel Model, simply by having it predict the base forms of the words that it was trained on. word, the 'decoded' phone string was aligned with the 'correct' string using the string alignment program of [Wagn74], Evaluation For each phone and every mismatch 81 (insertion, deletion or substitution) between the two strings was counted as an error. The number of errors was then divided by the number of phones in the 'correct' string, to yield a 'phone error rate of about 1.05%, with most of the errors being in stress placement. An upper bound on the error rate of the Channel Model was estimated as follows: the model trained on the dictionary was used to predict the base forms of 500 words from the OC vocabulary, calculated as before. 'phone error rate' and a 'phone error rate' was Averaged over all the words in the test data, the was 21.0%. The 'actual' error rate is less than 21%, for the following reasons: * The strings 'correct' compared are in slightly different formats, since the strings follow the conventions of the OC vocabulary, where- as the 'decoded' strings attempt to follow the conventions of the dic- tionary. Typical discrepancies include: * - /er/ vs. /uh r/, - /n t s/ vs. /n s/, - /j - unstressed /i/ vs. unstressed /uh/. uu/ vs. /uu/, and Some phones 'BITE', are represented as two symbols; for example, is represented as /ax for example, /ix/ * ixg/. Consequently, the 'I' in a substitution of, for /ax ixg/ is counted as two errors. A single word may have multiple base forms; if so, only one will be considered 'correct'. An examination of the errors made by the Channel Model indicates that some of these errors could either be avoided or tolerated in an application environment, although some errors are rather severe. For a list of typi- cal errors (every tenth error) see Figure 13. Evaluation 82 Context Correct Predicted COVERAGES /ERO/ /UHO RX/ ELEMENTARY /ERO/ /UHO RX/ FELIX /EE1 LX IXO/ /EH1 LX UHO/ ITINERARY /AX1 IXG TX IX1/ /IXO TX AX1 IXG/ NEGOTIATING /UHO/ /IXO/ PRODUCING /Ul/ /UHO/ REWARD /EEO/ /IXO/ SUSPECT /UH1/ /UHO/ ACCIDENTS /IXO/ /UHO/ CALCULATED /IXO/ /UHO/ CRITIQUE /IXO TX EE1/ /EE1 SH UHO/ DUPLICATION /DX UUl/ /DX JX UUl/ GENERATING /ERO/ /UHO RX/ INQUIRING /IXl/ /IXO/ LOADED /IXO/ /UHO/ PACKAGING /IXO/ /EI1/ READS /EE1/ /EH1/ Typical Figure 13. the Channel Channel Model Errors: Model when Selected errors made by operating without clues from acoustic information. 6.1.2 PERFORMANCE OF THE PHONE RECOGNIZER BY ITSELF A version of the Linguistic Decoder for connected speech [Jeli76] was configured to vocabulary recognize of the phones. system by This a vocabulary unstressed vowels counted separately). Evaluation was done by replacing the standard of 50 phones (stressed and No phonological rules were used. 83 Training- and test data was obtained from an existing corpus of speech, which had been recorded in a noisy (office) environment by a male talker with a close-talking microphone. available, in 'sentences' A total of 3900 words of speech were of ten words each. These words had been drawn from a vocabulary of 2000 different words. matically into single-word units, The data was segmented auto- and the resulting corpus was split up into 3500 words of training data and 400 words of test data, such that no word occurred in both. Phone Recognition recognizers), is and it system this simple. only about 60%. imum-distance aligned if did not a rather difficult task (at least for automatic would be foolhardy to expect good performance of a Indeed, the phone recognition rate of the system was This recognition rate was computed on the basis of a minalignment, constrained in that two phones could not be the portions of the speech signal to which they corresponded overlap. experiment. For Insertions a list of were typical not counted errors as errors (every tenth in error) this see Figure 14. 6.1.3 PERFORMANCE OF THE COMPLETE SYSTEM The choice of test data for the combined system was constrained by the limited availability of recorded speech and sample base form'data in compatible formats. The same data used to test the phone recognizer, was used to test the combined system. This data does not include any speech that contributed to the acoustic model training, and comes from a source that is independent of the dictionary. first Of the 400 words of test data, the 50 words were used to test the decoder and to adjust various parame- ters; the next 300 words were used for evaluation; the remaining 50 words were saved for emergencies. Evaluation Of the 300 evaluation words, 4 turned out to 84 Context Correct Predicted VARIOUSRESOLVE /XX/ /KX/ RECEIPT /TX/ /KX/ REQUIREMENT /NX/ BOTTOM /AAl/ LEARNMISSION /XX/ PROCEED /PX/ /KX/ POSITION /PX/ /TX/ TRANSFER /FX/ // LUCK /LX/ /WX/ EMPLOYMENT /LX/ /wX/ FORMAL /UHO/ HAVE /AE1/ Figure 14. /UHl/ /TX XX/ /AEl AAl/ Typical Phone Recognizer Errors: the Selected errors made by Phone Recognizer when operating without clues from spelling. be abbreviations and were eliminated from further consideration. remaining 296 words contained a total of 1914 phones, The or an average of 6.5 phones per word. The decision tree used for this experiment consisted of 3424 rules with a (weighted) average length of 10.29 tests. The longest rule required 21 tests. In addition to the problem of slight differences in format that also affected our Channel Model test, the combined test is complicated by the fact that results depend on the speaker. multiple, Evaluation In fact, the dictionary allows equally likely base forms for many words and consequently the 85 system can produce different base forms for different utterances of the same word. The base forms produced by the combined system (296 in all) were graded by the author, using two different evaluation standards: * A lower bound on the error rate was established by counting only definite errors, giving the system the 'benefit of the doubt' in question- able cases. was found to This lower bound on the 'phone error rate' be 41+1914 = 2.14%. * An upper bound on the error rate was established by eliminating the 'benefit of the doubt' and counting all possible errors. This upper bound on the 'phone error rate' was found to be 64+1914 = 3.34%. The base forms that contained 'errors' by either of these standards are listed in "Evaluation Results" on page 109. It can be seen that the error rate of the combined system is much less than that of the phone recognizer, and much less than the upper bound on the error rate of the Channel Model operating by itself. Nevertheless, it is clear that the system in its present form is not ready for use 'in the real world'. 6.2 OBJECTIVES ACHIEVED As the reader will recall, we set out to build a system that would generate base forms automatically (i.e., without expert intervention). This system has been built, out of several quite independent parts: The Channel Model represents the relation between letter strings and the corresponding phonemic base forms: this model is in the form of a large collection of probabilistic spelling to base form rules. These rules were Evaluation 86 discovered automatically, by a very simple program, analysis of a suitable amount of sample data. is self-organized: it on the basis of an Because of this, the system can be applied to any collection of sample data, and it will infer an appropriate set of spelling to base form rules from these 'examples'. A phone recognizer is employed to estimate the probability that a particular phone string would have produced a given observed speech signal. This phone recognizer was recognizer, replacing its Very knowledge of phonetics or phonology has been applied to the little constructing by modifying a connected word ordinary vocabulary by a lexicon of phones. design of the phone recognizer: instead, it is 'trained' amount of sample speech by means of the powerful on a suitable 'Forward-Backward' or 'EM' algorithm. The knowledge contributed by these two systems is combined as Recall that each system calculates a probability value. follows. If X is a phone string, L a letter string and A an acoustic observation, then * p(X I L) is the probability that X is the base form of a word with spelling L; * p(A I X) is the probability that the base form X, when pronounced by the talker, would have produced the speech signal A. From these two values, we may calculate the probability of X given L and A using Bayes' Rule: p(X p(X I L) x p(A I L, X) I L, A) = p(A The denominator, p(A I L) I L), may be ignored in practice because it depends only on A and L and is therefore constant. Evaluation 87 The search for the most likely phone string (by the above measure) ducted using the 'Stack Algorithm' or A* Algorithm. is con- An existing implemen- tation of this algorithm is used. The performance figures given in the preceding section are intended as an indication of how well a system of this type can perform in practice. The measured phone error rate of between 2% and 4% is probably too high for most 'real' applications. Nevertheless, it is substantially better than the results obtained if either of the two knowledge sources is eliminated. This confirms our conjecture that the two knowledge sources complement one another. Furthermore, these figures can be viewed as demonstrating the viability of the design philosophy, namely to self-organized methods, and as little rely as much as possible on as possible on the designer's preju- dices regarding the structure of the problem and its solution. application of Information Theory to language is not new While the [Shan5l], its application to the formulation of spelling to sound rules and the like appears to be novel. The present approach was first suggested in detail by R. Mercer. As the approach project became neared completion, apparent; several limitations of the design they are discussed in the next section. The remaining section of this chapter is dedicated to potential improvements of the system. 6.3 LIMITATIONS OF THE SYSTEM The Channel Model utilizes a rather simple model of the relation between letter strings and the corresponding base forms. Evaluation This limits the perform- 88 ance of the model in a number of ways: * Inversions: Sometimes the correspondence between the letters of a word and the phones of its base form is not sequential. Because it is difficult to express a rule involving an inversion in terms of the feature sets employed, these rules may not be discovered at all. * Transformations: It appears that the present model is unable to cap- ture various transformations that can take place at the phone or letter level, such as the change in the pronunciation of the 'A' when a suffix is added to the word 'NATION': 'NATION' with base form /n 'ei sh uh n/, 'NATIONAL' with base form /n 'ae sh uh n uh 1/ Thus, each of these strings has to be modeled separately. (In this example, the decision between /ei/ and /ae/ is made very dif- ficult by the fact that the nearest difference in the letter strings is more than 4 letters away). The Channel Model is trained on a limited amount of sample data, and will therefore have difficulty with types of data that are not represented in the sample set, such as: * Special 'Words': Some vocabularies of interest, such as 'one for 'Office Correspondence', abbreviations sound rules and contain other 'words' do not apply, a significant number (say, 3%) of to which the ordinary spelling to such as 'CHQ', 'APL' and 'IBMers'. The present system is able to deal with abbreviations and acronyms only to the extent that examples are included in training data. Evaluation 89 * Proper Names: We have found that proper names appear to require dif- ferent spelling to sound rules than ordinary words. Unfortunately, the available dictionary does not contain any proper names on which the Channel Model can be trained. Finally, (as the pattern recognition algorithms and statistical procedures applied to feature selection, decision tree design and distribution smocthing) have their limitations: * Complex Patterns: It appears that many spelling to baseform rules are triggered only by rather complex patterns available features). Unfortunately, (complex in terms of the the number of bits available to represent each pattern is limited by the depth of the decision tree, which in turn is limited by the amount of sample data and in particular by the number of similar contexts with different pronunciations. In practice, the decision tree construction program is sometimes una- ble to identify the entire pattern, and will construct a rule that can be invoked when only a part of the pattern is present. * 'Global' Constraints: ering 'global' The system appears to have difficulty discov- constraints (within words), such as the fact that each word should contain at least one stressed syllable. Naturally, this knowledge could be made explicit but we would like to avoid that since such constraints would be different for each application. * 'New' Data: The smoothing procedures employed are rather ad hoc; there is no reason to believe that they lead to optimal prediction of 'new' data. Evaluation 90 6.4 SUGGESTIONS FOR FURTHER RESEARCH I believe that with some effort, the system can be improved considerably within the present design framework. The following potential improve- ments appear particularly promising: * Add a filter to the system that will recognize acronyms and abbreviations (which is relatively easy, since they are often capitalized), and treat them specially: tion to what 'spelled-out' * the allow for 'spelled-out' regular Channel Model base forms in addi- predicts and the base forms from the training data. Use a more complete dictionary as training material, that remove includes inflected forms, geographical preferably one and surnames and common abbreviations. * * Use a set of phonological rules to allow the phone recognizer to look for surface forms instead of base forms [Cohe75] [Jeli76]. Improve by using prime-event analysis the pattern [Stof74] or matching by instead of just a tree. stage, building perhaps a generalized decision network Consider using multi-stage optimization in the tree design process [Hart82]. * Improve the feature selection stage, perhaps by abandoning the limitation that features may apply to individual letters and phones only. * Use a more refined acoustic/phonetic the phone recognizer, Evaluation 'speaker performance model' in for example one that takes account of context. 91 * Reduce the amount of data presented to the decision tree design program by 'factoring' out patterns that are repeated many times; this will reduce the amount of computation required. * Improve the data smoothing procedure, perhaps on the basis of a more accurate model of classification faults. * Improve the Recognizer gation, interface with between regard the Channel Model and the Phone to constraint propagation and error propa- perhaps by means of a meta-level model; allow the degree of speaker-sensitivity to be adjusted. * Eliminate the need for alignment of the sample data by considering all possible alignments at all times. * Replace the Channel Model models, (which computes p(X J L)) by two separate one of phone strings and one of the mapping from phone strings to letter strings. The quantity p(X I L) can then be computed as p(X) x p(L I X) + p(L), where p(L) may be ignored because it does not depend on X. Together, the two separate models can be more powerful than the simple Channel Model. Evaluation 92 A.0 REFERENCES [Alle76] J. Allen, Synthesis of Speech from Unrestricted Text, ceedings of the IEEE, Vol. 64, No. 4, pp. 433-442 (1976) [Bah175] L. R. Bahl, F. Jelinek, Decoding for Channels with Insertions, Deletions and Substitutions with Applications to Speech Recognition, in IEEE Transactions on Information Theory, Vol. IT-21, No. 4, pp. 404-411 (1975) in Pro- [Baum72] L. E. Baum, An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes, in Inequalities, Vol. III, Academic Press, New York, 1972 [Case8l] R. G. Casey, G. Nagy, Decision Tree Design Using a Probabilistic Model, IBM Research Report RJ3358 (40314), 1981 [Cohe75] P. S. Cohen, R. L. Mercer, The Phonological Component of an Automatic Speech-Recognition System, in Speech Recognition, Academic Press, New York, 1975, pp. 275-320 [Elas67] J. D. Elashoff, R. M. Elashoff, G. E. Goldman, On the choice of variables in classification problems with dichotomous variables, in Biometrika, Vol 54, pp. 668-670 (1967) [Elov76] H. S. Elovitz, R. Johnson, A. McHigh, J. E. Shore, Letter-to-Sound Rules for Automatic Translation of English Text to Phonetics, in IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-24, No. 6, pp. 446-459 (1976) [Forn73] G. D. Forney, Jr, The Viterbi Algorithm, in Proceedings of the IEEE, Vol. 61, pp. 268-278 (1973) [GaII68] R. C. Gallager, Information Theory and Reliable Communication, John Wiley & sons, 1968 [Hart82] C. R. P. Hartman, P. K. Varshney, K. Gerberich, Application of Information struction of Efficient Decision Trees, in Information Theory, Vol. IT-28, No. 4, pp. G. Mehrotra, C. L. Theory to the ConIEEE Transactions on 565-577 (1982) [Hunn76] S. Hunnicutt, Phonological Rules for a Text-to-Speech System, in American Journal of Computational Linguistics, Microfiche 57 (1976) [Hyaf76] References L. Hyafil, R. Rivest, Constructing Optimal Binary Decision Trees is NP-Complete, in Information Processing Letters, Vol. 5, No. 1, pp. 15-17 (1976) 93 [JeHi75] F. Jelinek, L. R. Bahl, R. L. Mercer, Design of a Linguistic Statistical Decoder for the Recognition of Continuous Speech, in IEEE Transactions on Information Theory, Vol. IT-21, No. 3, May 1975, pp. 250-256 [JeHi76] F. Jelinek, Continuous Speech Recognition by Statistical Methods, in Proceedings of the IEEE, Vol. 64, No. 4, pp. 532-556 (1976) [Jeli80] F. Jelinek, R. L. Mercer, Interpolated estimation of Markov Source Parameters from Sparse Data, in Proceedings on the Workshop on Pattern Recognition in Practice, North-Holland Publishing Company, 1980, pp. 381-397 [Meis72] W. S. Meisel, Computer-oriented approaches to pattern recognition, Academic Press, New York, 1972 [NiIs80] N. J. Nilsson, Principles Publishing Co., 1980 of Artificial Intelligence, Tioga [Payn77] H. J. Payne, W. S. Meisel, An Algorithm for Constructing Optimal Binary Decision Trees, in IEEE Transactions on Computers, Vol. C-26, No. 9, pp. 905-916 (1977) [Shan5l] Shannon, Prediction and Entropy of Printed English, in Bell Systems Technical Journal, Vol. 30, pp. 5 0 - 6 4 (1951) [Stof74] J. C. Stoffel, A Classifier Design Technique for Discrete Variable Pattern Recognition Problems, in IEEE Transactions on Computers, Vol. C-23, No. 4, pp. 428-441 (1974) [Tous7l] G. T. Toussaint, Note on Optimal Selection of Independent Binary-Valued Features for Pattern Recognition, in IEEE Transactions on Information Theory, September 1971, p. 618 [Wagn74] R. A. Wagner, M. J. Fischer, The String-to-String Correction Problem, in Journal of the Association for Computing Machinery, Vol. 21, No. 1, January 1974, pp. 168-173 References 94 B.0 DICTIONARY PRE-PROCESSING Sample data for the construction of the spelling-to-baseform channel model was obtained from two sources: 1. An on-line dictionary containing some 70,000 entries, 2. The existing base Research Group. form collection of the IBM and Speech Recognition This chapter explains the need for pre-processing of data from these sources, and provides some detail regarding the methods used. Note: The algorithms presented Mercer and the author. B.1 TYPES OF PRE-PROCESSING in this section represent work by R. NEEDED The pronunciation information contained in our dictionary was not originally in a convenient form for further processing, for the following reasons: 1. The pronunciations given in the dictionary are not always base forms: often the dictionary presents several forms of the same basic pronunciation that could all be derived from a single base form. Thus, it is necessary to identify and remove all pronunciation entries that are variants of some given base form. This problem was not addressed at this stage. Given a set of phonological rules [Cohe75], it is a trivial matter to determine whether two phone strings are variants of each other. 2. The pronunciations in the dictionary are sometimes incomplete: in order to save space, the dictionary gives the pronunciation of each word that differs only slightly from the preceding word by merely indicating the difference (see Figure 15). These incomplete pronunciations are identifiable by a leading and/or a trailing hyphen. For each incomplete pronunciation, the single correct completion had to be selected. This was done automatically, as described in "Base form Completion" on page 97 ff. . 3. The particular version of the dictionary that was available to us contains a small but not insignificant number of errors, most of which appear to have been introduced during some transcription or translation process. Dictionary Pre-processing 95 MAG-A-ZINE /m 'ae g - uh - z .ee n/ /m .ae g - uh - '/ MA-TRI-MO-NIAL /m .ae - t r uh - m 'ou /- n j uh 1/ TOL-ER-ATE TOL-ER-A-TIVE TOL-ER-A-TOR /t 'aa TO-TA-QUI-NA /t .aa t - uh -k //- /- Figure 15. - n ee - uh 1/, 1 - uh - r .ei t/ r .ei t - i v/ r .ei t - uh r/ w 'ai -n uh/, k (w) 'ee -/ Partially Specified Pronunciations: In the dictionary, pronunciations are not always completely specified. A match score was computed for each (spelling, pronunciation) pair and the dictionary was sorted by this match score. The part of the dictionary that contained the low-scoring words was then examined manually, and any pronunciations that were found to be in error were either corrected or eliminated. 4. The dictionary contains no entries for regular inflected forms; this affects plurals, present and past tenses and participles, and comparatives and superlatives. A program was written to inspect an ordinary word frequency list (of about 65,000 words). For each inflected form that occurred in this list but not in the dictionary, the program retrieved the pronunciation of the root from the dictionary, and generated the inflected form by means of rule application. 5. The dictionary contains no explicit information regarding the alignment between the letters in the words and the phones in the base forms. The alignment was performed automatically, by means of the trivial spelling-to-baseform channel model described below. See "Base form Alignment" on page 97 ff. for details. B.2 THE TRIVIAL SPELLING-TO-BASEFORM CHANNEL MODEL Note: This model was originally designed and implemented by R. Mercer A trivial spelling to baseform channel model was needed to calculate match scores between letter strings and phone strings, and to align them. In this model, a letter is modeled as a Markov Source (or finite-state machine) with an initial and a final state, in which each arc or transi- Dictionary Pre-processing 96 tion corresponds to a phone. A letter string may be modeled by connecting together the Markov Sources corresponding to the individual letters in the string. The transition probabilities of each Markov Source are independent of the context in which it occurs. The Markov Source model was chosen mostly because algorithms and programs are available for constructing and utilizing such a model. Despite the obvious shortcomings of this model, it does appear to be sufficiently powerful for the purposes for which it is used (alignment and base form completion). The transition probabilities of the Markov Source model (the only parameters in the model) were computed with the Forward-Backward algorithm [Jeli76] [Baum72], using the dictionary as sample data. B.3 BASE FORM MATCH SCORE DETERMINATION The Markov Source model of base forms, as described above, is capable of computing the probability that any phone string X is the base form of a given letter string L. This probability can be used as a match score between a letter string and a base form. The most serious shortcoming of the model is in its lack of context sensitivity. While many letters are often silent (some with a probability as high as 50%), the model is not aware that 'runs' of silent letters are much less likely. As a result, the model over-estimates the probability of short phone strings. In an attempt to compensate for this shortcoming in the model, we have used several different functions as estimates of the match value between a letter string L (of length n) and a base form X: * if we assume that the model is correct, between L and X as p(X I L) * if we view p(X I L) as the product of the individual phone probabilities, we can define the match score as the (geometric) average probability per predicted phone, n/p (X * we can define the match score I L) finally, we can normalizes the probability p(X I L) by dividing by the expected probability of a correct phone string of the same length, and define the match score as p(X I L) + cn for a suitable constant c. Since the match calculations were used only for dictionary processing prior to manual editing, no attempt was made to formally evaluate the effectiveness of each of these measures. Of course, any attempt to evaluate them would have required the manual analysis of a large amount of data. Dictionary Pre-processing 97 B.4 BASE FORM COMPLETION In the dictionary available to us, many pronunciations are specified incompletely, with a hyphen indicating that a portion of the preceding pronunciation is to be carried over to complete the pronunciation (see Figure 15 on page 95). By convention, only an integral number of (phonemic) syllables may be carried over from the preceding pronunciation. A number of other, unstated rules apply such as the rule that any inserted string (of the form -x-) must always replace at least one syllable: otherwise it would be difficult to determine where the string should be inserted. As a result, the number of possible completions is always relatively small. For example, there are only two ways of completing the incomplete pronunciation for the word 'TOTAQUINA' (refer to Figure 15 on page 95): /t .aa t k (w) 'ee k w 'ai n uh/ /t .aa t uh k (w) 'ee n uh/ A user of the dictionary will probably have little difficulty identifying the second option as the correct one. The base form completion program attempts to do the same: it calculates the match score between the letter string and each of the proposed base forms, and simply selects the one with the highest score. B.5 BASE FORM ALIGNMENT The Markov Source model of the spelling-to-baseform mapping can be used to determine the most likely alignment between the letters in a letter string and the phones in a phone string, by means of a dynamic programming algorithm known as the Viterbi algorithm (see [Forn73]). This algorithm was used to align a total of 70,000 (spelling, baseform) pairs. Although we never formally verified the 'correctness' of the alignments thus obtained, we did not find any unacceptable alignments in any of the hundreds of cases we inspected. In fact, it is not easy to define 'correctness' in this context. In case of doubt, we always accepted the alignment produced by the Viterbi program, to ensure consistency throughout the training set. Dictionary Pre-processing 98 C.0 THE CLUSTERING ALGORITHM Note: The Mercer, M. C.1 algorithms presented in A. Picheny and the author. this section represent work by R. OBJECTIVE The clustering program described here is quite general in scope, and has been used for a variety of purposes besides feature selection. The program requires the following inputs: 1. An array describing the joint probability distribution of two random variables. The rows correspond to the entities that are to be clustered; the columns correspond to the possible values of the unknown that is to be predicted. 2. An optional specification of a partition of the rows into some number of 'auxiliary clusters'. These auxiliary clusters represent the combined answers to any previous questions about the rows. The algorithm attempts to find a binary partition of the rows so that the conditional mutual information between the two resulting clusters and the columns, given the information provided by the 'auxiliary clusters', is maximized. If several possible partitions offer the same value of this primary objective function, ties are resolved in favor of a secondary objective function, namely the unconditional mutual information between the two resulting clusters and the columns. We can define this formally as follows: The joint probability distribution is of the form p(L, x), where L denotes the 'current letter' of a context and x its pronunciation. The objective of the program is to define a binary partition, C: L + (1, 2) that assigns every letter either to cluster no. 1 or to cluster no. 'auxiliary clusters' provided as input define a mapping A: L + {1, 2, The ... , n} that assigns every letter to one of n classes. vided as input, n=1. Our primary objective is to maximize MI( 2. C(L); x I A(L) If no partition is pro- ); when there are several solutions, our secondary objective is to The Clustering Algorithm 99 maximize MI( C(L); x ). C.2 IMPLEMENTATION Merging clusters The program starts by placing each letter L in a cluster all its own. (Since this gives us more than two clusters, we need to re-define C as C: L + 1, 2, . .. , m} where m initially is equal to the number of letters.) The program then determines by how much the value of the objective function would drop if it combined any two clusters into one, and it combines those two clusters for which the resulting drop is the smallest. This process is repeated until only some small number of clusters, typically between two and fifteen, remain. Finding the best binary partition of a small set of clusters For a small number of clusters n, it is possible to evaluate the objective function for every possible way of combining these n clusters into two, and thereby to identify the optimal combination. If we arbitrarily identify the two final clusters with the values '0' and '1', every possible combination can be specified as a vector of n bits, in which each bit indicates whether the corresponding cluster becomes part of cluster '0' or cluster '1'. If we interchange the roles of '0' and '1', the partition defined by such a bit vector remains the same. Therefore, 2 (n-1) different combinations are possible. If we enumerate these combinations using a Gray code, only one cluster changes place between every pair of successive combinations, and the value of the objective functions can easily be computed in an incremental fashion. This method is guaranteed to find the best way of reducing the number of clusters from n to two. It often yields a better result than would be obtained by continuing the repeated merging procedure of the previous section until n=2. Swapping and Moving elements In an attempt to further improve the binary partition obtained by one of the above methods, the clustering program considers all possible ways of moving a single letter from one cluster to another, as well as all ways of swapping two letters that are in opposite clusters. The swap or move that improves the value of the objective function by the greatest amount is performed. This process of swapping and/or moving continues until no single swap or move can further improve the value of our objective function. The Clustering Algorithm 100 While the cost of this post-processing stage is substantial, the method often yields significant improvements. In fact, in all test cases to which the algorithm has been applied, it has made the enumeration process described in the preceding section superfluous: the sequence 'merge until n=2, swap/move until done' always produced the same result as the sequence 'merge until n=l4, enumerate and find best combination for n=2, swap/move until done'. For some of the larger clustering tasks, the swap/move stage was omitted because it would take too long. The Clustering Algorithm 101 D.0 LIST OF FEATURES USED This section contains a listing of the features that were obtained by the feature selection program using various optimality criteria. QUESTIONS ABOUT THE CURRENT LETTER D.1 The following questions about the current letter were obtained by maximizing MI(QLi(L) ; X(L) I QL1(L),QL2(L),. . . , QLi-l(L)), while resolving ties by maximizing MI(QLi(L) ; X(L)). QL1 - Mutual Information: 0.865437 bits. True: BCDFKLMNPQRSTVXZ False: #AEGHIJOUWY' QL2 - Conditional MI: 0.721985 bits, unconditional MI: 0.714513 bits. True: ACDEIKMOQSTUXYZ' False: #BFGHJLNPRVW Classes defined by QLl and QL2: # A B C QL3 - G E F D H I L K J 0 N M W U Y' P R V Q S T X Z Conditional MI: 0.629876 bits, unconditional MI: 0.651345 bits. True: #BCEFIKNPQSVXYZ' False: ADGHJLMORTUW Classes defined by QL1, A B C D E G L 0 F K M I H R QL2 and QL3: U N P V Q S X Z T Y J W List of Features Used 102 QL4 - Conditional MI: 0.534354 bits, unconditional MI: 0.595232 bits. True: #ABCDEFGJKMPQRVX' False: HILNOSTUWYZ Classes defined by QL1 through QL4: GJ, HW, IY, L, N, OU, R, SZ, T. QL5 #, A, BFPV, CKQX, DM, E', Conditional MI: 0.190333 bits, unconditional MI: 0.611793 bits. - True: ACDEFGHIKLOPQST False: #BJMNRUVWXYZ' Classes defined by QL1 through QL5: letters. QL6 BV, CKQ, FP and individual Conditional MI: 0.050348 bits, unconditional MI: 0.619650. - True: CDFILMSVYZ False: #ABEGHJKNOPQRTUWX' Classes defined by QL1 through QL6: KQ and individual letters. QL7 Conditional MI: 0.000844 bits, unconditional MI: 0.670228 bits. True: BCDFQRSVZ False: #AEGHIJKLMNOPTUWXY' QL1 through QL7 identify each letter uniquely. D.2 QUESTIONS ABOUT THE LETTER AT OFFSET -1 The following questions about the letter to the left of the current letter were obtained by maximizing MI(QLLi(LL) ; X(L) I L, QLL1(LL),QLL2(LL),..., QLLi-1(LL)), (where L is the current letter, X(L) is the pronunciation of the current letter, and LL is the letter to the left of the current letter), while resolving ties by maximizing MI(QLLi(LL) ; X(L) I L). Although this quantity is conditional on L, I will refer to it ditional' to distinguish it from the former quantity. QLL1 - as 'uncon- Mutual Information: 0.108138 bits true: #AEIOQUWY False: BCDFGHJKLMNPRSTVXZ' List of Features Used 103 QLL2 - Conditional MI: 0.076817, unconditional MI: 0.053857 true: #AHILNRSUVX' False: BCDEFGJKMOPQTWYZ Classes defined by QLL1 and QLL2: # B E H QLL3 - A C O L I D Q N U F G JK M P T Z W Y R S V X Conditional MI: 0.069841, unconditional MI: 0.052867 true: #ABCFHJMNOPSWX False: DEGIKLQRTUVYZ' Classes defined by QLL1, QLL2 and QLL3: # D B E H I L O QLL4 - A G C Q N U R w K T Z F JM P Y S X V Conditional MI: 0.067447, unconditional MI: 0.052367; true: #BDFJQRSUWXYZ' False: ACEGHIKLMNOPTV Classes defined by QLL1 through QLL4: GKT, HN, I, LV, 0, QY, R', SX, U, W. QLL5 - #, DZ, A, BFJ, CMP, E, Conditional MI: 0.018370, unconditional MI: 0.052650; true: #BDEGHJMVWXY' False: ACFIKLNOPQRSTUZ Classes defined by QLL1 through QLL5: letters. QLL6 - BJ, CP, KT and individual Conditional MI: 0.005148, unconditional MI: 0.055200; true: #BFHPT False: ACDEGIJKLMNOQRSUVWXYZ' List of Features Used 104 QLL1 through QLL6 uniquely identify each letter. D.3 QUESTIONS ABOUT THE LETTER AT OFFSET +1 The following questions about the letter to the right of the current letter were obtained by maximizing MI(QLRi(LR) ; X(L) I L, QLR1(LR),QLR2(LR),..., QLRi-1(LR)), (where L is the current letter, X(L) is the pronunciation of the current letter, and LR is the letter to the right of the current letter), while resolving ties by maximizing MI(QLRi(LR) ; X(L) I L). Although this quantity is conditional on L, I will refer to it ditional' to distinguish it from the former quantity. QLR1 - as 'uncon- Mutual Information: 0.142426 bits True: #AEIORUWY' False: BCDFGHJKLMNPQSTVXZ QLR2 - Conditional MI: 0.089281, unconditional MI: 0.068588; True: #ABCEFJLMPQSTUVW False: DGHIKNORXYZ' Classes defined by QLR1 and QLR2: # A E U W B C F J L M P Q S T V D G H K N X Z I R Y QLR3 - Conditional MI: 0.092730, unconditional MI: 0.061178; True: #FIJLNTWZ False: ABCDEGHKMOPQRSUVXY' Classes defined by QLR1, QLR2 and QLR3: # A B D F W E C G J U M P Q S V H K X L T List of Features Used 105 I N Z 0 R Y' QLR4 - Conditional MI: 0.073014, unconditional MI: 0.081467; True: #BDEFHILMVY False: ACGJKNOPQRSTUWXZ' Classes defined by QLR1 through QLR4: FL, GKX, I, JT, NZ, OR', W, Y QLR5 - Conditional MI: 0.041879, BMV, CPQS, #, AU, DH, E, unconditional MI: 0.085400; True: #BDFJKQRSTUWZ' False: ACEGHILMNOPVXY JT, MV, Classes defined by QLR1 through QLR5: CP, GX, inidvidual letters. QLR6 - Conditional MI: 0.005090, QS, R' and unconditional MI: 0.051465; True: #ABCEFHIJLNOPQRUVXY False: DGKMSTWZ' Classes defined letters. QLR7 - by QLR1 through QLR6: CP and individual Conditional MI: 0.001989, unconditional MI: 0.088821; True: #ABDEFGIJKLOPQRSTUWXYZ' False: CHMNV QLR1 through QLR7 define all letters uniquely. D.4 QUESTIONS ABOUT PHONES The following questions phones were obtained by maximizing MI(QPi(P) ; X(L) I QP1(P),QP2(P),..., QPi-l(P)), (where X(L) is the pronunciation of the current letter, and P is the last phone before the pronunciation of the current letter), while resolving ties by maximizing MI(QPi(P) ; X(L)). QP1 - Conditional MI: 0.265914, List of Features Used unconditional MI: 0.265914 106 True: UHO UH1 ER1 AE1 EI1 AA1 AX UXG EHO EH1 EE1 IXO IX IXG OU1 AW1 UUl UX False: ERO BX CH DX EEO FX GX HX JH KX LX MX NX PX RX SX SH TX TH DH UUO VX WX JX NG ZX ZH # QP2 - Conditional MI: 0.164915, unconditional MI: 0.128966 True: ERO AX EEG IXO TH UUO NG ZX # False: UHO UHi ER1 AE1 EIl AA1 UXG BX CH DX EHO EH1 EE1 FX GX HX I IXG JH KX LX MX NX OUI AW1 PX RX SX SH TX DH UU1 UX VX WX JX ZH Classes defined by QP1 and QP2: * * * * QP3 - UHO UH1 ER1 AE1 EIl AAO AA1 UXG EHO EH1 EE1 IX IXG OU1 AW1 UU1 UXi ERO EEO TH UUO NG ZX # AX IXO BX CH DX FX GX HX JH KX LX MX NX PX RX SX SH TX DH VX WX JX ZH Conditional MI: 0.116981, unconditional MI: 0.076418 True: ERO ER1 EI1 AX EEO EE1 IXG LX MX NX OU1 RX SX TH UUO UU1 NG ZX False: UHO UH1 AEl AA1 UXG BX CH DX EHO EHM FX GX HX IXO IX JH KX AW1 PX SH TX DH UX VX WX JX ZH # Classes defined by QP1 through QP3: QP4 - * * * * * * * UHO UH1 AEl AA1 UXG EHO EHi IX AW1 UXi ER1 EIl AAO EEl IXG OU1 UUi ERO EEO TH UUG NG ZX AX BX CH DX FX GX HX JH KX PX SH TX DH VX WX JX ZH IX0 LX MX NX RX SX * # Conditional MI: 0.095049, unconditional MI: 0.062220 True: ER1 CH DX EEO EEl IXG JH NX OU1 AW1 SX SH TX TH DH UU1 UX VX JX NG ZX ZH False: UHO UH1 ERO AE1 EIl AA1 AX. UXG BX EHO EH1 FX GX HX IXO IXi KX LX MX PX RX UUO WX # Classes defined by QP1 through QP4: UHO UH1 AEl AA1 UXG EHO EHM IXi, EIl AAO, ERO UUO, ER1 EE1 IXG OU1 UUi, AX1, BX FX GX HX KX PX WX, CH DX JH SH TX DH VX JX ZH, EEG TH NG ZX, IXO, LX MX RX, NX SX, AW1 UX1, #. List of Features Used 107 QP5 - Conditional MI: 0.077567, unconditional MI: 0.037309 True: UHO AX UXG DX GX IXG KX MX OU1 PX SX TX UUO UX NG false: UH1 ERO ER1 AEl EIl AAl BX CH EHO EH EEO EEl FX HX IXO IX JH LX NX AW1 RX SH TH DH UUl VX WX JX ZX ZH # QP6 - Conditional MI: 0.042339, unconditional MI: 0.054139 True: AE1 EIl UXG BX DX EHO EH EEl FX IX IXG JH KX LX SX TH DH UUO UU1 VX NG ZX # False: UHO UH1 ERO ER1 AA1 AX CH EEO GX HX IXO MX NX OU1 AW1 PX RX SH TX UX1 WX JX ZH QP7 - Conditional MI: 0.017873, unconditional MI: 0.046538 True: ERO AE1 AA1 AXl BX CH DX EEO EE1 GX HX IXO IX NX AW1 TX TH DH UX1 VX JX ZH # False: UHO UH1 ER1 EIl UXG EHO EH1 FX IXG JH KX LX MX OU1 PX RX SX SH UUO UUl WX NG ZX Classes defined by QPl through QP7: DH VX, and individual phones. QP8 - Conditional MI: 0.003825, AE1 IX, CH JX ZH, EHO EH1, unconditional MI: 0.039209 True: UHO ERO ER1 EIl UXG CH EHM EEO EE1 HX IX IXG LX MX OU1 RX SX SH TH DH UUO UU1 WX NG ZX ZH False: UH1 AE1 AA1 AX BX DX EHO FX GX IXO JH KX NX AW1 PX TX UXl VX JX # Classes phones. defined List of Features Used by QP1 through QP8: CH ZH, and individual 108 E.0 OVERVIEW OF THE DECISION TREE DESIGN PROGRAM The basic control structure of the decision tree design program is illustrated below. The details of the mutual information computation and the termination tests are omitted. As illustrated, the program retains the sample data that corresponds to terminated (unextended) nodes. design decisiontree: procedure; do for level = 1, 2, 3, ... call processnodesatcurrentlevel(); if there are no extendable nodes at the next level, then stop, else continue; end; end; process_nodesatcurrentlevel: procedure; create an empty output file; open the input file; do for all nodes at this level, from left to right; call processnodeo; end; call copydata(; replace the input file by the (now full) output file; end /* of level */ processnode: procedure; call copy data(; read the data for the current node; determine decision function for the node, or decide not to extend it; if the node is extended then read data for node, copying data for left subtree to output file; read data for node, copying data for right subtree to output file; else read data for node, copying it to output file; end /* of node */ copydata: procedure; read data from input file, copying it to output file, either until a sample for the current node is found, or until the end of the input file is reached. end; Overview of the Decision Tree Design Program 109 F.0 EVALUATION RESULTS The following is a listing of the words whose base forms were predicted incorrectly in the 'final experiment' by either of the two scoring procedures used (see "Performance of the complete system" on page 83). Some words occurred several times in the sample data; the number of times that each word occurred in given following the spelling of each entry. For words that occurred more than once, repeated errors were counted separately. For example, the word 'SINGLE' occurred five times, and an erroneous base form was produced four out of five times. The base forms are given in the notation employed by [Cohe75]; a single digit at the end of each vowel phone represents that stress on that vowel, as follows: 0 1 unstressed stressed. The number of errors found by each of the scoring procedures is indicated to the right of each entry: L H (for Low) indicates the number of definite errors (for High) indicates the number of possible errors. single single single single generated professionals reduced remove remove sometime Hugh Hugh anxious apparently appear appeared arranged associated capabilities caused checked content copier criteria distribute distributor Evaluation Results SX IX NG UHO LX SX IX NG UHO LX SX IX NX UHO LX SX IX NG UHO LX JH EH1 NX RX EIl TX IXO DX PX RX UHO FX EH1 SH UHO LX ZX RX IXC DX UU1 SX TX RX EEO MX OU1 VX RX IXO MX UU1 FX SX UH1 MX TX AX IXG MX HX JX UUl GX HX JX UU1 GX EIl NG KX SH UHO SX UHO PX UHO RX EH1 NX TX LX EEO UHO PX JX ERO UHO PX JX ERO UHO DX UHO RX RX EI1 NX JH UHO DX UHO SX OUl SX EEC EIl TX IXO DX KX EIl PX UHO BX UHO LX IXO TX EEC ZX KX AW1 UH1 ZX TX CH EH1 KX UHO DX KX AA1 NX TX EH1 NX TX KX AA1 PX JX ERO KX RX AX IXG TX JX RX EEC UHO DX IXO SX TX RX UHO BX JX UU1 TX DX IXO SX TX RX UHO BX JX UHO TX ERO L H 1 1 1 1 2 2 1 1 1 2 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 110 earliest elements examination insured join lower magazine participated pilot practices raised ready reflect relating represented resolved resume say significantly supplemental ER1 EEO EHl SX TX EHl MX UHO NX TX SX IXc GX LXI ZX AEl MX UHO NX EIl SH UHO NX LXI SH UX RX DX IXl NX JH IXG NX LX RX MX UHO GX EIl ZX EEl NX PX AA RX TX IX SX UHO PX EIl TX IXO DX PX AX IXG UHO LX UHO TX PX RX AE1 KX TX SX NX ZX RX EIl SX TX RX EIl DX EEO RX IXC FX LX EH KX TX RX IXC LX EIl TX IXO NG RX EHI PX RX IXO ZX EHO NX TX IXO DX RX IXC ZX AW1 LX PX EEO DX RX IX0 SX UU1 MX SX EHI EEO SX IX1 GX NX IX FX IXO KX UHO NX TX LX "E SX UH1 PX UHO LX MX EH1 NX TX UHO LX total Calculated Error Rates: Evaluation Results 1 1 121 1 1 1 2 2 2 3 1 1 2 2 2 2 1 1 1 1 1 2 2 3 1 2 2 1 2 3 41 64 Low Estimate High Estimate 41 + 1914 = 2.14% 64 + 1914 = 3.34% ill