iZA-K lENEER bI:' i XMAR TK7855 .M41 .R43 6 N go.41 Knowledge-Based Pitch Detection RLE Technical Report No. 518 June 1986 Webster P. Dove Research Laboratory of Electronics Massachusetts Institute of Technology Cambridge, MA 02139 USA This work has been supported in part by the Advanced Research Projects Agency monitored by the Office of Naval Research under Contract No. N00014-81-K-0742, in part by the National Science Foundation under Grant ECS.--07285, in part by Sanders Associates, Inc., and in part by an Amoco Foundation Ffow'ship. Many thanks must surely be for persons who were in the know. For all their help and faith in me, thanks Randy D. and AVO. For others helpful all along, through evenings late and sessions long. My thanks to persons short and tall to DSPG members all. A special word is granted few whose office shared and social glue was help when I was feeling blue. To Stephen, Doug and Cory too. During all this tribulation, who paid the remuneration? A generous organization, thanks to AMOCO Foundation. My greatest gratitude in life is for a blessing I have had in trying times both good and bad. For all her help I love my wife. Knowledge-Based Pitch Detection by Webster P. Dove Submitted to the Department of Electrical Engineering and Computer Science on May 9, 1986 in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in Electrical Engineering ABSTRACT Many problems in signal processing involve a mixture of numerical and symbolic knowledge. Examples of problems of this sort include the recognition of speech and the analysis of images. This thesis focuses on the problem of employing a mixture of symbolic and numerical knowledge within a single system, through the development of a system directed at a modified pitch detection problem. For this thesis, the conventional pitch detection problem was modified by providing a phonetic transcript and sex/age information as input to the system, in addition to the acoustic waveform. The Pitch Detector's Assistant (PDA) system that was developed is an interactive facility for evaluating ways of approaching this problem. The PDA system allows the user to interrupt processing at any point, change either input data, derived data, or problem knowledge and continue execution. This system uses a representation for signals that has infinite domains and facilitates the representation of concepts such as zero-phase and causality. Probabilistic representations for the uncertainty of symbolic and numerical assertions are derived. Efficient procedures for combining such assertions are also developed. The Normalized Local Autocorrelation waveform similarity measure is defined, and an efficient FFT based implementation is presented. The insensitivity of this measure to exponential growth and decay is discussed and its significance for speech analysis. The concept of a history independent rule system is presented. The implementation of that concept in the PDA and its significance are described. A pilot experiment is performed that compares the performance of the PDA to the Gold-Rabiner pitch detector. This experiment indicates that the PDA produces 1/2 as many voicing errors as the Gold-Rabiner program across all tested signal-to-noise ratios. It is demonstrated that the phonetic transcript and sex/age information are significant contributors to this performance improvement. Thesis Supervisor: Alan V. Oppenheim Professor of Electrical Engineering Title: Thesis Supervisor: Randall Davis Associate Professor of the Sloan School of Management Title: i For my daughter Table of Contents Chapter 1 Introduction ......................................................................................... 1.1 Knowledge-Intensive Problems ................................................................. 1.2 Knowledge-Based Solutions .......................................................................... 1.3 A Modified Pitch Detection Problem ............................................................ 1.4 A Pitch Detectors Assistant ................................................................. 1.5 An Overview of the Contributions .............................................................. 1.6 Organization of the Thesis ..................................................................... ...... 1 1 2 5 8 9 10 Chapter 2 Domain Knowledge ................................ ............................................. 2.1 Knowledge about Pitch Production .............................................................. 2.1.1 An Overview of the Oscillatory System .............................................. 2.1.2 General Properties of Vocal Excitation in Time and Frequency ......... 2.1.3 Physiological Factors Influencing FO ....................................... ............. 2.1.4 Phonetic and Phonological Influences ................................................... 2.1.5 Syntactic Effects ................................................................. 2.1.6 Other Linguistic Factors ........................................................................ 2.1.7 Extra-Linguistic Factors ........................................................................ 2.2 Knowledge from Pitch Detection Algorithms .............................................. 2.2.1 Temporal Similarity .............................................................................. 2.2.2 Data Reduction ...................................... ...................................... 2.2.3 Harmonic Structure ................................................................. 2.2.4 Other Methods ....................................................................................... 2.2.5 Voicing Determination ........................................................................... 2.3 Conclusion ..................................................................................................... 11 12 12 15 18 21 23 24 25 28 31 35 41 43 43 45 Chapter 3 System Architecture ........................................................................... 3.1.1 An Example of System Operation ........................................................ . 3.1.2 The Symbolic Input ................................................................. 3.1.3 The Outputs ........................................................................................... 3.2 Pitch Detection in PDA ................................. ................................ 3.2.1 Assertions ................... .............................................. 3.2.2 Voicing Related Assertions ................................................................. 3.2.3 FO Related Assertions .............................................. ................... 3.2.4 Combining Assertions ........................................................................... 3.3 The Rules ................................................................. 47 47 50 51 53 53 54 56 56 59 Contents Contents 3.4 Determining Voicing ......................... ........................................ 3.4.1 Overview ................................................................. 3.4.2 Knowledge in the PDA ........................................ ......................... ......................... 3.4.3 Using the Knowledge ........................................ 3.5 Determining FO ................................................................. 3.5.1 Concepts ........................................ .......................... 3.5.2 Examples of Implementation ............................................................... 3.6 Epochs ............................................................................................................ 3.6.1 Concept ................................................................................................... 3.6.2 Related systems ..................................................................................... 3.7 Knowledge Manager ...................................................................................... .......................... 3.7.1 Overview ....................................... 3.7.2 Running the Rule System ................................................................. ......................... 3.7.3 Network Maintenance ........................................ 3.7.4 Confidence Computation ....................................................................... 3.8 Rule System ................................................................................................... .......................... 3.8.1 Concepts ........................................ 3.9 Conclusions ................................................................. 65 65 67 68 77 77 79 87 88 95 96 96 96 100 102 104 104 107 Chapter 4 System Details ..................................................................................... 4.1 The KBSP Package ......................................................................................... 4.1.1 The Precursor (SPL) ................................................................. 4.1.2 The KBSP Package ................................................................................. 4.2 The Epoch System ......................................................................................... ............... ........................... 4.3 The Rule System ........................ 4.3.1 Overview ................................................................................................ 4.3.2 Architecture ........................................................................................... 4.4 The Knowledge Manager .............................................................................. 4.4.1 Running the Rule System ...................................................................... 4.4.2 Network Management ........................................................................... 4.4.3 Computing Confidence ........................................................................... ......................... 4.5 Numerical Pitch Detection ........................................ 4.5.1 Normalized Local Autocorrelation ................................................... .... 4.5.2 Determining Pitch from Similarity ...................................................... 4.5.2.1 Numerical Pitch Advice ................................................................ 4.5.2.2 Determining the Standard Deviation of FO Estimates ................. 4.6 Conclusions .................................................... ............. 108 108 109 111 115 127 127 130 138 138 139 145 151 151 156 159 164 165 Chapter 5 System Evaluation .............................................................................. ................................... 5.1 Introduction .............................. 5.2 A Comparison of PDA and G-R ................................................................... ......................... 5.2.1 Performance Criteria ........................................ 167 167 168 168 ii Contents 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6 Caveats ................................................................. .................................. Test Results ............................... A Typical PDA Run ................................................................. ......................... The Impact of Symbols ........................................ Comments ................................................................. 169 171 173 174 181 ........................ 182 6.1 Introduction ................................................................. 6.2 Quantitative versus Qualitative Symbols .................................................... 6.3 Symbolic/Numerical Interaction ................................................................. 6.4 Combining Results ........................................ ........................ 6.5 Alternate Values ......................................... 6.6 Mapping Numerical Values to Confidence ................................................... 6.7 Use of Explicit Statistics for Assertions ...................................................... 6.8 Rule Conditions: Match versus Function ..................................................... 6.9 The Use of Dependency ................................ ................................. 6.10 Comments .................................................................................................... 182 182 185 191 194 195 196 198 202 203 Chapter 7 Conclusions ........................................................................................... 7.1 Thesis Contributions ................................................................. 7.2 Future Work ................................................................. 205 205 207 Chapter 6 KBSP .................................... iii . List of Figures 1.1 Inputs: W aveform, Transcript and Sex/Age ......................................... 2.1 X-ray and schematic of the human vocal system ................................ 2.2 Glottal Pulse Approximation in Time and Frequency ......................... 2.3 Voiced W aveform and Spectrum .......................................................... 2.4 Unvoiced Waveform and Spectrum ...................................................... 2.5 A Pop in Time and Frequency ........................................ ................... 2.6 Variation in Glottal Waveshape with Intensity ................................... 2.7 End of Sentence Glottalization ........................................................... 2.8 Clipping Function ................................................................................... 2.9 Gold-Rabiner Extrema ........................................................... 2.10 Trigger Modules ................................................................................... 2.11 Period Estimates ............... .................... ........................ 3.1 Inputs: Symbolic Transcript and Waveform ........................................ 3.2 The Outputs of the PDA ........................................................................ 3.3 Basic Approach ......................... .................................. 3.4 Voicing Assertions ........................................................... 3.5 Support for <VOICED> ......................................... .................. 3.6 Support for <FINAL-PITCH> ........................................................... 3.7 Determining Voicing ........................................................... 3.8 Voicing from Phoneme Identity ........................................................... 3.9 Voicing from Stop Timing ...................................................................... 3.10 Finding Gaps ........................................................... ................... 3.11 Rule to find stop bursts ........................................ 3.12 Voicing from Sonorant Power ........................................................... 3.13 Silence from Broadband Power ........................................................... 3.14 Voicing from Similarity ........................................................... 3.15 Determining FO ........................................................... 3.16 Typical Speech Similarity for /i/ ........................................................ 3.17 FO from Similarity ............................................................................... 3.18 FO from Phoneme Identity ......................................... .................. 3.19 FO from Sex and Age ........................................................... 3.20 FO from Phonemes ............................................................................... 3.21 FO Advice from Phonetic Context ....................................................... 3.22 Preliminary Pitch ................................................................................. Figures 6 12 15 16 16 17 18 26 33 37 38 39 47 48 53 55 57 59 65 68 69 71 73 74 75 76 77 79 79 82 83 83 85 86 Figures 3.23 A priori F ........................................ .................... 3.24 An Example of Epochs ......................................................................... 3.25 Bad Merging .......................................................................................... 3.26 The Knowledge Manager ........................................ ................... 3.27 Phonetic Support for Final Voiced ...................................................... 3.28 Support for Final Pitch ........................................................................ 3.29 Networked Objects ............................................................................... 4.1 The Merging of Epochs ........................................ ................... 4.2 Inefficient Condition ........................................................... 4.3 Efficient Condition ...................................... ..................... 4.4 The Effects of Merging ........................................................................... 4.5 Rules and the Knowledge Manager ....................................................... 4.6 Typical Rule Condition Network .......................................................... 4.7 YAPS Network Architecture ...................................... ............. 4.8 Inefficient YAPS Network ........................................................... 4.9 Typical Support Relationship ........................................................... 4.10 Alternation of Assertions and Bindings .............................................. 4.11 Support for r from s ............................................................................. 4.12 Modified Posterior Probability ........................................................... 4.13 Amplitude Variation in Speech ........................................................... 4.14 Fast NLA ........................................................... 4.15 Voiced stop closure ........................................................... 5.la Input Information ................................................................................ 6.1 Conditions done with Matching ........................................................... 6.2a Conditions done with Functions #1 .................................................... 6.2b Conditions done with Functions #2 .................................................... ii 86 89 93 96 101 103 104 117 124 124 125 130 133 135 135 143 144 147 148 152 156 162 173 198 199 200 List of Tables 5.1 Example of Results ........................... ................................ 5.2 Performance vs SNR ........................................................... 5.3 Processing without Symbols .................................................................. 5.4a Difficult Sentence with Symbols .......................................................... 5.4b Difficult Sentence without Symbols .................................................... Tables 171 172 174 180 180 CHAPTER 1 Introduction Goals In this thesis our goal was to develop techniques for combining symbolic and numerical knowledge in signal processing systems. We felt that systems for solving problems in which both kinds of knowledge could be found were dominated by either a symbolic or numerical approach, and that a more balanced application of symbolic and numerical knowledge would lead to better performance. 1.1. Knowledge-Intensive Problems There are numerous signal processing problems in which the signals involved and the phenomena that underlie them do not all fit straightforward mathematical models. In most if not all of these problems, there is a lot of knowledge available, but this knowledge can't all be expressed in simple mathematical terms. An example of a problem of this sort is tracking seagoing vessels. There are many pieces of applicable input information: ocean acoustic data, radar data, satellite data, sightings, course plans, ocean currents, atmospheric conditions, etc. There is knowledge about wave propagation, ocean thermal and salinity phenomena, mechanical vibration, structural acoustics, propeller ratios, biological noises, typical biological feeding areas, typical shipping lanes, behavior patterns of commercial and military vessels (both national and foreign), activity patterns on ships, theories of sound propagation, etc. Much of this information can be further qualified by knowledge about 1 Chapter 1 specific vessels, captains, companies and world events. Another problem of this sort is the recognition of requests given as speech to a computer. In this case the input information is primarily acoustic. The pertinent knowledge includes sampling and filtering issues, acoustic temporal and spectral properties of speech, phonetic and phonological properties of the language, properties of noise sources, the phonetic and phonological manifestations of various accents, dialects, and individual speakers, the grammar of the language, the semantics of the requests and the nature of the information available to satisfy them. These are two examples of a wealth of problems of this type. Considering these problems it is apparent that some information can be thought of as numerical (e.g. noise power spectra), and some as symbolic (e.g. language grammar). It is our belief that systems must be capable of employing both types of knowledge effectively if they are to employ most of the knowledge. 1.2. Knowledge-Based Solutions Depth versus Breadth Systems to solve problems of this type can approach the problem of dealing with this knowledge in different ways. One approach is to take a small subset of the knowledge and build a system around it. An example might be a system that assumes a probabilistic model for the sound generated by ships, assumes a dynamic model for their motion, and computes an estimate of the trajectories of all vessels by choosing the scenario that maximizes the probability of receiving the signals that were in fact received. Such systems use a small subset of the available knowledge because the cost of computing such "optimal" estimates becomes prohibitive very rapidly as the com- 2 Chapter 1 plexity of the model grows. This approach can be characterized as using a small amount of knowledge in a very powerful way, a "depth" approach. Another approach for dealing with knowledge-intensive problems is to employ heuristics that make use of many pieces of knowledge. In this case, the system is unlikely to be provably optimal since the interaction of the components of the system would be too difficult to analyze. Such systems often don't have a specific mathematical model for the problem they solve. Their design is based on concepts of how the parts of the system should function and interact. Concepts derived from common sense, experience with other approaches to the task, or from a (more tractable) This approach achieves its performance mathematical analysis of the subproblem. through extensive application of the knowledge, it goes for breadth. Symbolic versus Numerical In the field of signal processing most systems for problems that do not require a symbolic result are dominated by the use of numerical knowledge. In part this is because many such systems emphasize depth, and it is only natural that the ideas that are used are chosen from the mathematically attractive aspects of the knowledge; such ideas can be manipulated with the powerful mathematical tools at the signal processor's disposal. However, even in signal processing systems that emphasize breadth, the data is represented and manipulated in primarily numerical terms. In the majority of signal processing systems for speech pitch detection, one does not find any reference to phonetic, or linguistic aspects of pitch production (see chapter 2 for a description of such knowledge). The focus of systems written by the signal processing community is on representing the mathematical properties of signals and manipulating those 3 - Chapter 1 properties, not on modeling or manipulating representations of non-numerical phenomena that might underlie those signals. Conversely, if one examines systems developed by the expert systems community for problems that have both numerical and symbolic components, one sees a tendency to use symbolic representations, and a preference for the use of symbolic processing as the primary means of problem solving. The HEARSAY[1] system for recognizing spoken database requests, and the SIAP[2] system for determining ship locations from acoustic data are both dominated by symbolic approaches. In both systems, numerical processing appears as an initial stage to convert the nput information to a primarily symbolic representation, with little use of numerical representations or numerical processing thereafter. Knowledge-Based Signal Processing There does not appear to be any fundamental reason why a balanced approach to the use of knowledge is not possible with these problems, systems which would employ symbolic and numerical representations throughout, and made "equal" use of symbolic and numerical processing techniques to solve the problem. Such a balance, by virtue of its potentially greater repertoire of knowledge, could achieve better performance than that possible by employing one approach or the other. The focus of knowledge-based signal processing (KBSP) is on learning how to build systems that employ a substantial amount of knowledge to solve knowledgeintensive problems; systems that freely make use of numerical and symbolic knowledge, numerical and symbolic processing methods. As was mentioned at the start of this section, our intent was to learn how to combine symbolic and numerical knowledge in signal processing systems. 4 ·r Specifically, we were looking for new Chapter 1 computer representations for numerical and symbolic information, ways to combine those representations, and other ideas that would contribute to building such systems. The approach we took to achieve these goals was to select a problem that had a balance of numerical and symbolic knowledge, and build a system to solve it, placing particular attention on the development of new ideas in the system that would be applicable to other problems. The problem we selected was a variant of the pitch detection problem, and the system we built is called the Pitch Detector's Assistant (PDA). 1.3. A Modified Pitch Detection Problem The conventional problem of pitch detection involves analyzing a digitized speech waveform, and producing an aligned voicing decision and fO estimate. This thesis focused on a problem that was somewhat different because solving it was to be a catalyst for learning about combining symbolic and numerical knowledge, more than an end in itself. The pitch detection problem for this thesis was chosen to permit a balanced use of symbolic and numerical information, to encourage a balanced application of symbolic and numerical processing techniques, and to achieve this balance from the earliest stages of processing to the final output. This was accomplished in two ways: by changing the pitch detection problem statement, and by placing certain demands on the system that solved it. The conventional pitch detection task is largely numerical. The input is a numerical sequence and one of the two outputs (fO) is numerical. In addition, conventional approaches to pitch detection describe time in numerical terms (by indexing with samples as opposed to using fundamental speech units such as phonemes or syllables). 5 ____ __ Chapler I Therefore, the problem of aligning voicing transitions and fO estimates with the waveform is also in effect a numerical one. Since the conventional problem is primarily numerical, our changes involved augmenting it symbolically. We made symbolic information available as input, information relevant to pitch and voicing determination. In particular, we supplied a phonetic transcript of the utterance (including word and syllable markings), as well as the sex and age of the speaker (either MALE, FEMALE or CHILD). An example of these inputs is shown in figure 1.1. The waveform in the upper 1.78 ,'Ill -2.69 896 DECLARATIVE Phrase Y To he has the eyes bluest Words D. Fiz hiN A-c Syllables bkiu st't adz To c> h Phonemes iP a R C3 - z - A O -Q b I u e s t' ' - PI~ I3C t :~~~ to 3n Z =- I 0 TRANSCRIPT Sex: Female Figure 1.1 Inputs: Waveform, Transcript and Sex/Age 6 22896 Chapter I part of the figure is digitized at 10 kHz and represented in the computer as floating point numbers. The range of values are shown to the left of the vertical axis, and the range of indices are shown below the horizontal axis. The transcript is shown in the lower part of the figure. The four types of transcript marks (phrase, words, syllables and phonemes) are shown in the labeled vertical strata of the picture, and each mark is identified by a string of characters, with its extent depicted by the line below it. The widths of the boxes which border these lines signify the uncertainty with which those boundaries are known. Identical boxes which lie above one another are in reality just images of the same box as viewed from the different strata. The thin line which appears under most syllable marks depicts the "syllabic nucleus", the vowel or vowel-like phoneme around which that syllable is built. Symbolic input of this form helps our goal of combining symbolic and numerical knowledge in three ways: * This symbolic information balances the numerical information already present in the problem, so there are opportunities to combine the two. * Providing this information as input can reduce the tendency for early processing to be dominated by a numerical approach, as has been the case with a number of expert systems for processing signals. Providing the transcript as input alleviated what would otherwise have been a major task: the analysis of the waveform to generate that information. Such a difficult symbolic analysis problem might have led to symbolic processing being dominant. * Though it is unusual to augment the pitch detection problem in this fashion, there are potential uses for a system that could solve it. These include the generation of pitch tracks for use in the enhanced reconstruction of archival speech material, pitch analysis for talking computer databases, and reference pitch tracks for testing other pitch detectors. Also, similar problem scenarios exist that are unrelated to speech. 7 A Chapter One example is the enhancement of satellite imagery when road maps of the area are available. 1.4. A Pitch Detectors Assistant Another influence on this thesis was our abstract picture of the computer system that was to solve it. We intended to investigate many different approaches to combining numerical and symbolic knowledge. However, there are limits on the complexity of a system that can be built in one try. Also, it was not clear at the outset exactly what approaches would be interesting, nor was it clear what approaches would be feasible in the framework of a thesis. The two alternatives in such a situation are to build many separate systems each of which demonstrates some new ideas, or to develop a single system over a long time period by adding new ideas incrementally. We took the latter approach. While we did not intend expending all our effort trying to maximize the knowledge present in this system, it was clear that future systems of this type would be large and would probably be developed in an incremental fashion. Thus, anything that we learned about facilitating the extended development of such systems would be a useful contribution. Also, the amount of knowledge we incorporated into this system would depend on the ease of incorporation, and the more knowledge that was incorporated in our system, the more we would learn about the process and power of combining symbolic and numerical knowledge. For these reasons we chose to picture the system we developed as a Pitch Detector's Assistant (PDA). This view places a premium on the ability to interact with and extend the capabilities of the system. The focus on extensibility is appropriate for any incrementally developed system, the focus on interaction helps the developer to analyze deficiencies 8 Chapter 1 1 in its behavior that suggest the need for new knowledge, and helps the developer to debug any problems that occur when the system is being modified. 1.5. An Overview of the Contributions This thesis makes several contributions related to programming symbolic and numerical knowledge. Some major results are enumerated below: * * * We argue that converting numerical information to symbolic information on input, and using only symbolic processing thereafter is inadvisable. We suggest system architectures with which that can be avoided. We present a number of different ways that symbolic and numerical information can interact and give examples of such interactions drawn from the PDA system. We offer new representations for uncertainty in symbolic and numerical assertions, and new methods for processing uncertainty that are derived mathematically from a few basic principles and implemented efficiently in the PDA. We offer a new model for the representation of signals in systems that admits a broad class of signals (including those with infinite duration and those that are periodic) and makes the representation of temporal phenomena like linear-phase and zero-phase straightforward. Besides results that pertain to the representation of numerical and symbolic knowledge in systems, there are results that pertain to signal processing (a new algorithm for measuring waveform similarity that is insensitive to waveform envelope), rule based systems (a rule system whose results are dependent on the currently active rules and data, but not dependent the order of entry of rules and data, nor rules and data that were active but have since been retracted), and pitch detection (the PDA is shown to outperform the Gold-Rabiner pitch detector[3] over a wide range of signalto-noise ratios, and the symbolic information is shown to be a determining factor in the PDA's performance). 9 Chapter 1 1.6. Organization of the Thesis This thesis is organized into 7 chapters. The Introduction has presented the goals of the work and motives behind them, described and justified the specific problem we attempted to solve, and mentioned some of the important contributions of the work. Chapter 2 presents the knowledge applicable to pitch detection with brief descriptions of the ideas and references for further reading. Chapter 3 describes the general architecture of the PDA system, the specific knowledge that is represented in it and how that knowledge is made to work. Chapter 3 also describes the major components of the PDA from a conceptual standpoint. Chapter 4 discusses the major components of the system in detail and examines issues of implementation. Chapter 5 presents a comparison of the PDA with the G-R pitch detector. Chapter 6 discusses significant points about the implementation of this system and relates them to the general problem of representing numerical and symbolic knowledge in systems, and chapter 7 discusses some results that relate to signal processing, expert systems and pitch detection and suggests some directions for future work. 10 A9 CHAPTER 2 Domain Knowledge This chapter presents knowledge that bears on the pitch detection problem. This knowledge originates from two research areas: speech communications, and signal processing. From speech communications there is knowledge about the mechanics, acoustics and linguistics of pitch. From signal processing there is experience with solutions to this problem in the form of algorithms for pitch detection. While the following sections describe what we learned about pitch, bear in mind that it was not all incorporated into the program that we built (chapter 3 discusses the knowledge that is actually incorporated into our program and how it is used). There are three purposes served by a chapter which presents the whole spectrum of pitch knowledge: Provide a point of comparison. Since we feel that our program incorporates more knowledge than its predecessors, it is only fair to show all the knowledge so the reader can judge the significance of our accomplishment. Illuminate the structure of the knowledge Much of the effort in building this system involved designing data structures and procedures to represent pitch and the phenomena which influence it. Exposure to the available knowledge should help the reader understand the rationale behind these decisions by clarifying the nature of the things to be represented. Provide an archive. This chapter can serve related work by compiling references to the topics of pitch and its measurement into a single document. Chapter 2 Chapter 2 2.1. Knowledge about Pitch Production 2.1.1. An Overview of the Oscillatory System This is a brief presentation of the physiology of pitch production. A more detailed account can be found in a paper by Ohala[4] and the references cited therein. The human vocal system is depicted in an X-ray photograph and schematic diagram in figure 2.1[5]. From the lungs, a single passage passes through the vocal folds of the glottis and up to the velum. There it branches with one passage going past the velum to the nose and the other passage going past the tongue to the mouth. The velum can be used to close off the passage through the nose (the nasal tract), so only a single passage extends from the glottis. The tongue, lips and jaw can be used NOSTRI MOUTH Figure 2.1 X-ray and schematic of the human vocal system 12 I Chapter 2 to vary the cross-section of the passage through the mouth (the vocal tract). By controlling the topology of the passages with the velum and the geometry of the vocal tract with the other articulators, humans implement a variable filter that controls the different sounds of speech. The arrangement of this apparatus that is used for a particular sound is called the "articulatory configuration" for that sound. This system of passages is acoustically excited by three possible means: glottal pulses, turbulence generated by constrictions imposed on the air flow, and transients caused by abrupt pressure release. Glottal Pulses Glottal pulses are the (usually) periodic flaps of the vocal folds that give voiced speech its characteristic buzzy quality. These vibrations are driven by the air flowing between the vocal folds[6] like the vibrations of a trumpet player's lips. The rate of vibration is influenced by changes in the pressure drop across the vocal folds, changes in their tension, and adjustment of the average space between them. Some of these changes are caused by cons-ious attempts to manipulate fO (the rate of glottal vibration). Others are the indirect effects of certain articulatory gestures as described below and in more detail in [4]. Turbulence For certain articulatory configurations (e.g. "s") the nasal tract is blocked and the passage through the vocal tract is reduced to a very small opening somewhere along its length. This causes the particle velocity at the constriction to become very high, leading to turbulent noise generation. Other examples of sounds produced in this fashion are "f', "th" and "sh". 13 Chapter 2 There are two terms for excitation of this kind: "aspiration" and "frication". When the constriction is made with the glottis, but the glottal configuration is such that vibration does not occur, then the resulting turbulent sound is called aspiration (as in the sound "h"). When the constriction is made with the tongue lips or teeth (as in the earlier examples), then the sound is called frication. Transients If the above constriction is carried to the point of complete closure, then pressure builds up in the oral cavity. Phonemes that involve such closure are called stops (e.g. "p" "t" "g"). When the articulatory configuration is changed after a stop and this pressure is released, two things can happen: turbulent noise and popping. During release of a stop, the initial opening is small and turbulent noise (frication) usually occurs at the point of constriction (the so called "place of articulation"). As the opening grows, there is less resistance to airflow through the constriction. The air velocity through the glottis increases, and that can in turn lead to either glottal vibration or aspiration depending on the state of the glottis. The other possible occurrence during the release of a stop is the creation of a pop due to an abrupt change from zero to positive airflow. This event is not an inevitable consequence of stop release and is not considered a fundamental acoustic correlate of stop release. Nevertheless, it is important to realize that such a pop can occur because of the potential confusion between such pops and the low-frequency energy that usually indicates voicing. 14 Chapter 2 2.1.2. General Properties of Vocal Excitation in Time and Frequency From the above description we can see several modes of vocal excitation: glottal vibration, turbulence at the glottis (aspiration), turbulence elsewhere (frication), and possible pop transients during the release of stops. These modes of excitation have the A following temporal and spectral properties. Glottal Pulses An approximation for the glottal waveshape during voiced speech is given in[7] as g(n)= 2[l-cos(rnlN1) On<Nl = cos(7r(n-N 1 )I2N 2) N n (2.1) N 1 +N 2 = 0 otherwise. The time response and frequency spectrum of this approximation (using reasonable values of N 1 and N Amplitude 2 is shown in figure 2.2. dB x(n) Spectrum 29.4 1.1 8.8 -31.6 0.8 0 138 8 Figure 2.2 Glottal Pulse Approximation in Time and Frequency 15 Pi Chapter 2 Quasi-periodicity During voiced speech, if the glottal source and vocal tract were perfectly stable and an infinite amount of data could be analyzed, then the frequency spectrum would consist of lines at multiples of the fundamental frequency of glottal excitation (fO). However, there is cycle to cycle variation in the glottal waveshape, temporal variation in fO, temporal variation in the vocal tract shape (and therefore its impulse response) and practical Fourier analysis must be done with a windowed data segment. Therefore, at best the spectrum of voiced speech consists not of lines at the multiples of f0, but of peaks at or near those multiples. A sample of a voiced speech waveform and its spectrum is shown in figure 2.3. Noise Excitation When there is turbulent excitation of the vocal tract, the frequency spectrum is one of a filtered white noise source. An example of this mode of excitation is shown in figure 2.4. The notable distinctions between these two sorts of excitation is that voiced excitation has substantial low-frequency energy and is generally repetitive in time Amplitude ' 1t ] atl Voiced Speech dB .'" IP\1F Y 1W~~zv10t, $,,i ltsi A00:! 0.0 __ Spectrum I 144 _ 21000 _ _ _ __ _ __._ _ _ _ ~~~~~~~~! 1"i iJ ,'! S~Ist!8Awnt~1 And vo AV ___ 21510 __ I 1A1i i - . A- 1. 8 Pi Figure 2.3 Voiced Waveform and Spectrum -- 16 Chapter 2 dB Unvoiced Speech Amplitude 8.0 I - II-rL~I~YIY ·LY,n 1 ... ,. Spectrum 122 C15 i d a, .. I& rj9;.u a ... 1011Y -16WO)"W 1U·I'V'q'4ll%*VO. ' "' .- r. k 9 :'61 1' 62 -1001.0 a S610 5188 8 Pi Figure 2.4 Unvoiced Waveform and Spectrum (leading to a spectrum containing uniformly spaced peaks), whereas turbulent excitation has little low frequency power, no temporal regularity and therefore no uniform peak structure in the frequency domain. Mixed Excitation It is possible to have both glottal vibration and frication as in the sound "z". In this case the glottal portion of the excitation dominates in the low frequency range and the fricative portion in the high frequencies, leading to a spectrum that has a regular harmonic shape in some regions and an irregular noise in others[8]. Pops The time and frequency response of a pop during stop release is shown in figure 2.5. The increased low-frequency energy can be seen as compared with the earlier spectrum from frication. 17 A Chapter 2 AmpitdeSto Amplitude 3180.I dB Brs Stop Burst dB. 1221 _A, ~ Secru Spectrum ~ A.,, iK\ A, -110 0 If Ny~sletsl~stre~lll 4 v I1 -A0 6844 7354 0 Pi Figure 2.5 A Pop in Time and Frequency 2.1.3. Physiological Factors Influencing FO Sex and Age One of the most pronounced influences on fO is the sex and age of the speaker. It is common knowledge and has been demonstrated in experiments[9] that female speakers and children have much higher nominal fO values than males. Variation in Excitation Spectrum with Voicing Intensity For most voiced speech the vocal folds come completely together during part of the cycle. When voicing is intense and complete glottal closure is achieved, the abruptness of the closure leads to a fairly broadband excitation. However, when voicing is soft, complete closure may not occur during any part of the cycle, resulting in a smooth (narrowband) excitation waveshape. This effect is depicted in figure 2.6. The acoustic implications of this phenomenon is that phonemes which are "softly" spoken may have a depressed high frequency spectrum. 18 Chapter 2 Pc;r - - asgr ;tillQB1C hili LOW VOICE EFFORT MEDIUM VOICE EFFORT HIGH VOICE EFFORT Figure 2.6 Variation in Glottal Waveshape with Intensity Glottal Excitation During Closure When voicing is sufficiently intense to completely close the vocal folds during part of each cycle, recent research[10] suggests that some vibration, and therefore some vocal tract excitation, may still take place. Acoustically, this means that even with a closed glottis, one cannot presume that the acoustic signal is solely determined by the impulse response of the vocal tract (as in some recent works on glottal inverse falteringll11, 12]). Frication Frication, achieved by forcing air through a constriction of the vocal tract above the glottis, leads to an increase in pressure between the glottis and the constriction. If this occurs during glottal vibration, that increased pressure may lower the rate of vibration of the vocal folds (due to the reduced pressure differential across them). Ultimately, this pressure will stop glottal vibration entirely. If frication takes place without voicing and a rapid transition is made to a following voiced state (e.g. fa"), then the rate of glottal vibration may be temporarily 19 Chapter 2 elevated due to high airflow. This situation occurs because the glottis is wide open during the unvoiced fricative to prevent glottal vibration, and because of the pressure that has built up. When the fricative is released, while the glottis is closing in preparation for voicing, the high pressure and open vocal tract leads to a surge in the air velocity that elevates the initial glottal vibration rate. Stops Stop production also leads to a pressure rise above the glottis. Like fricatives, if glottal vibration is taking place, this can first cause a drop in fO followed by the complete cessation of vibration. If a voiced stop is brief enough, vibration may continue during the entire closure phase of the stop. Unvoiced stops can also temporarily elevate fO in an immediately following vowel. These effects (and those for fricatives) are discussed and demonstrated in a paper by Lea[13]. FO bias due to Articulatory Configuration The position of the larynx is believed to play a role in fO determination through the tension it places on the vocal folds. Since different articulatory configurations lead to different positions of the larynx, there is a bias in fO that depends on the phoneme being uttered. This has been documented in a variety of experiments[14, 15]. At the present time, this information is only available for vowels. Rate Limits Since the fO regulation system is mechanical, there are limits on the rate of change of fO that can be achieved through active manipulation. These limits have been studied for both professional singers and laypersons[16, 17]. This information bears on 20 __ Chapter 2 the types of interpolation that can be performed between pitch estimates. It also sets limits on acceptable rates of variation for plausible numerical pitch estimates. An FO model based on Physiology One model for the behavior of fO is based on the mechanical structures used to manipulate fO17]. This model predicts that log(fO) can be modeled as the sum of responses of second order linear systems. This model could potentially provide more constraints on pitch variation than the rate limits mentioned above, so it could potentially be more useful. However, applying it requires the determination of a driving function, and the designers of this model have yet to suggest how such an excitation might determined in the absence of a pitch track. The thrust of their research has been to perform the inverse problem of determining the model excitation from the pitch track. 2.1.4. Phonetic and Phonological Influences FO Effects As was mentioned above, stops, fricatives and vowels can influence fO. Though such effects are physiological in origin, they occur only in specific phonetic contexts, so they can also be considered phonetic influences. Stop Timing Besides having an effect on fO, certain phonemes carry information about voicing onset time (VOT). Information about the timing of voicing with respect to phoneme position is not available for all phonemes. However, stops have been studied extensively[18]. 21 Chapter 2 Each stop can be subdivided into three phases: * A closure phase during which the vocal tract is completely blocked and little if any sound is emanated (see "voiced bars" below). * A burst phase during which fricative energy is produced by the rapid airflow through the opening constriction. A possible aspirative phase due to high airflow at the glottis. * These phases need not all be acoustically apparent. Voiced stops, for example, have little or no discernible aspiration. Also, when a stop is part of a stressed syllable (e.g. "repeal"), it is likely to manifest the acoustic behavior described above. However, when a stop is part of an unstressed syllable (e.g. "letter") these phases may not all be clearly in evidence. In addition to the structural description of stops in terms of these phases, research also provides information on the timing of the phases. These times are strongly influenced by the surrounding phonetic context[ 183. One use of such timing information is to subsegment the information in the transcript. Stops are transcribed in two parts: the closure and the release. By using stop timing information, the release can be further broken down into frication and aspiration. Identifying the fricated portion is useful because the pops that often accompany such frication can confound the detection of voicing. The surge in low-frequency power due to the pop may trigger detectors which rely on the presence of such power to indicate voicing. Other Timing Effects Other research on the influence of phonetic context on segment timing can be found in [19]. 22 Chapter 2 2.1.5. Syntactic Effects A talker may use fO and timing to distinguish certain words or groups of words in a sentence for syntactic purposes. This can involve elevated fO, pronounced variation in fO, and variation in the duration of segmets[20]. Word Level "Content" words may be distinguished from "filler" words in this fashion[21]. For example the noun subject of a sentence might be delivered starting with an unusual fO rise that a preposition would not receive. Phrase level A noun phrase or verb phrase is often accompanied by a rapid rise in fO followed by a gradual fall[22]. This seems to be a mechanism for helping the listener parse the sentence, and at least one method of automatically analyzing the syntax of spoken sentences has been based on this observation[13]. Also, a rising/falling fO pattern on phrases is a feature of some speech synthesis programs (Dennis Klatt, private communication). The overall sentence (if it is declarative and not a question) may also have this rising/falling fO pattern imposed on it. The tendency for average fO to fall during the course of a sentence is called "declination". While declination is a generally accepted aspect of fO behavior in a declarative sentence, some believe that it is in fact a side effect of incremental downsteps in individual phenomenon[23]. 23 syllables, and not a global Chapter 2 2.1.6. Other Linguistic Factors Stress and Prominence When we speak a phrase, we employ two types of stress. The first is lexical stress. This is stress on some syllables of each word that is governed by the identity of the word alone (hence the term "lexical"). For example, "banana" is lexically stressed on its second syllable. Lexically unstressed syllables are often underspoken. Some of the phenomena normally associated with a phoneme may be missing when it appears in an unstressed syllable[24]. This effect can turn a /t/ into a flap (a very brief stop-like sound) in the word "butter" and suppress the second vowel in button" turning the pronunciation into "buttn". The second type of stress is prominence. It is used to mark the significant words in the phrase being spoken and to emphasize the phrase structure. "Lets EAT here." is a phrase with the prominence on EAT. This phrase might be used to answer the inquiry "what should we do at the mall?". The phrase "Lets eat HERE" might be used to point out a specific store in which to eat within the mall. In both cases the purpose of the prominence is to point out the significant part of the communication. Prominence is usually attached to significant verbs or nouns in a sentence[23]. Prominence is intimately tied to fO since one of the ways of achieving prominence is with significant transitions in f0. One of the important relations between it and lexical stress is that prominence is always placed on lexically stressed syllables. Besides influencing the clarity of the acoustic manifestations of various phonemes, stress or prominence may systematically vary that manifestation. 24 For example, Chapter 2 vowels which are stressed tend to be longer than those which are unstressed[20, 19]. Tune The syntactic and linguistic influences on fO have been incorporated into a theory for describing the approximate fO trajectories used in English (ignoring consonental dip and articulatory bias)[23, 251. This theory postulates a series of "tones" (much like the notes of a musical score) that are affixed to some of the lexically stressed syllables of a sentence. These tones specify a rise, fall or combined rise/fall at the location the associated syllable. A single "phrase tone" is associated with the last prominent syllable of the phrase and this tone specifies the behavior of fO out to the end of the phrase. The set of tones for a sentence is called a "tune". The tune, together with speaker specific fO parameters and information about the overall emphasis, is used to predict an average pitch contour for the sentence. While this theory seems very attractive because it collects so many phenomena into a single description, there are some difficulties with using it. First, it is a developing theory and is changing in the face of new experimentation. Second, it depends on knowledge of the tones and their position in the sentence. While such information may be practical to acquire when experienced transcribers of tune are available, it is not easy at present. 2.1.7. Extra-Linguistic Factors Speaking up FO is influenced when a talker attempts to improve the reliability of their communication by EMPHASIZING THEIR WORDS. To do this the speaker increases the loudness of his speech and increases the fO variations that are used in it. 25 Chapter 2 Emotion Emotion also plays a part in determining the fO of an utterance. Anger tends to be reflected by dramatic emphasis which leads to large fO variations. Jitter Because the human glottis is a biological device, its oscillations exhibit noticeable cycle to cycle variation in period and waveshape. Experiments with the voicing of continuous tones has shown period variation of about .5% between successive periods, and total variation of 1.5%[26, 27]. There are no experiments on cycle to cycle period jitter in normal speech. However, a brief experiment (presented in chapter 5) suggests that 2% is a reasonable value to use for expected jitter. This figure is important in pitch detection because it defines when succesive period estimates can be considered the same. Glottalization Some speech waveforms indicate that the excitation is aperiodic glottal pulses with longer spacings than typical for periodic excitation. Spacings two or more times those expected for periodic vibration are not unusual. Aperiodically excited speech is called "glottalization". Certain situations exhibit glottalization frequently. For example, if a word begins in an unstressed vowel, and the preceding word ends in a vowel then glottalization of the unstressed vowel is common (M. Bush private communication). Also, at the ends of sentences and places where fO is unusually low, glottalization is likely[28]. (figure 2.7). 26 __ Chapter 2 Amplitude 32088.80 (SECREST-WAVEFORM las2fl) III -91l AL 28851 30217 Figure 2.7 End of Sentence Glottalization Diplophonia Another unusual behavior of the glottis is its tendency to go into a mode of vibration that makes alternating pulses identical. This is called diplophonia. Although only occasional speakers manifest it constantly, it is not uncommon to find it in the speech of normal talkers at locations where glottalization would be expected (e.g. at the ends of sentences)[28]. Voice Bars Voice bars is a term used to describe the parts of a speech spectrogram that appear when the sound emanating from the throat is recorded rather than the sound from the mouth or nose. This situation may occur during the closure portion of voiced stops. While this does not truly represent an irregular voicing mode, the resulting spectrum can be almost sinusoidal. Thus if period estimates are to be maintained during this time interval, the method of detection cannot rely on the existence of several harmonics of fO. 27 Chapter 2 22. Knowledge from Pitch Detection Algorithms Since the 1940's pitch has been numerically determined for a variety of reasons. One goal is the storage and transmission of a compact representation of speech. Another is the recognition of speech so computers can be controlled by voice. Further goals include the diagnosis of vocal system pathology, the identification of speakers and the analysis of speaker stress. This work provides a body of knowledge about the behavior of various algorithms when used to detect pitch and voicing. Signal processing algorithms for pitch detection all contain knowledge about the pitch detection problem. Some of that knowledge can be readily interpreted, as in the theory behind the basic algorithm. Other knowledge may not be as easy to interpret (e.g. limits on the bounds of program loops may signify an assumption about the range of permissible pitch values), or may be so bound up with the specific implementation that interpretation in the context of an abstract pitch detection problem is impossible (e.g. the choice of vote bias in the Gold-Rabiner decision scheme). The remainder of this chapter discusses the knowledge represented in pitch detectors by presenting a general pitch detection idea and giving specific examples of pitch detectors which use it. We have tried in this discussion to avoid ideas that can't be related to the general problem of pitch detection, and only pertain to a particular method. Thus this is a catalog of pitch detection knowledge not a catalog of pitch detection algorithms. For a more detailed comparative discussion of various pitch detectors the reader should examine[291]. A definition of "periodicity" The primary phenomenon involved in pitch detection is periodicity. By definition, a signal is periodic if a shifted version of the signal is identical to the 28 ___ _II Chapter 2 original signal for some shift (i.e. P: n , x [n = x [n +P]). The "period" of a periodic signal is taken to be the smallest shift for which this is true. In speech processing one describes the signal as "quasi-periodic" (meaning "approximately periodic"). In that context it is typical to think of "periodicity" not as an all or nothing property (as the dictionary would suggest), but as a variable quantity like period. There is no specific definition for the term "periodicity". Roughly speaking, we use it to mean: "How similar is the signal to a shifted version of itself?". Periodicity in the Short Term Speech is more than just almost periodic. One can assume that over intervals of a few periods, properties such as waveshape and period are relatively constant[30]. Thus, algorithms for determining period treat it as a slowly varying quantity that may reasonably be measured on sections of speech data that are a few periods in length. A Common Program Structure Programs for pitch detection generally operate in two phases. The first phase of the program transforms a section of the signal into a data structure from which potential period candidates and periodicity estimates can be readily derived. The second phase takes this data structure and either selects a single period choice, or declares the signal to be unvoiced. These two phases are applied on successive sections of the signal from left to right until all the data is exhausted. These two phases can be further subdivided into four steps: la Preprocessing An initial analysis that changes the basic structure of the signal without producing a data structure from which the period and periodicity can be 29 _ _ Chapter 2 easily measured. Typical examples of this are low-pass filtering to eliminate signal noise and possible aperiodic high frequency speech energy, and either clipping[31], frequency equalization[32], or linear prediction[33] to reduce the formant structure of the signal so the harmonic structure is more apparent. lb Convert the preprocessed signal into the data structure from which the period and periodicity can be readily extracted. 2a Decision Analyze the data structure and choose a period candidate. 2b Postprocessing Derive a final period estimate on the basis of information about some or all of the period candidates from 2a. One example of this is non-linear smoothing[34], which can effectively eliminate individual gross errors in the pitch candidates from 2a. A simple example of such a program would be one in which the first phase computes the short-time autocorrelation function of the signal section[61, and the second phase selects the largest peak of the autocorrelation function (other than at the origin) if it is above a threshold, and declares the speech unvoiced if not. Because so many pitch detection programs have this form, it is a useful way to think of them for comparative purposes. However, this structure is not the only way of arranging such a program (the program developed for this thesis doesn't fit this model well). In the following sections we will (where appropriate) point out how a given algorithm can be decomposed in this way, and how this decomposition compares with those of other algorithms. Period Range Limits Many pitch detection algorithms scan over a finite range of periods (e.g. 3 ms to 15 ms). Searching for a period only in this range constitutes an assumption about the range of pitch periods in speech (the corresponding frequency range is 60 hz to 330 hz). While this is a reasonable range for most of the speech of normal male and female 30 Chapter 2 talkers, it excludes the possibility of picking up unusually long glottalized periods or unusually high fO values in speech from a child. For the speech corpus that was used for the experiments described in chapter 5 of this document, the above range would be too restrictive to permit accurate analysis. There were a few periods shorter than 3ms and periods longer than 15ms were not uncommon. In terms of the pitch detection knowledge that this period restriction represents, it is fair to assume that most periods will fall within this range. However, it is inappropriate to assume that all periods will be so. Rather than unconditionally constrain the period estimate to lie within this range, a better approach would be to downgrade the confidence in a period estimate that lies outside. Such an approach would permit proper measurement of glottalized periods. 22.1. Temporal Similarity "Temporal similarity" describes any pitch detection method that determines the periodicity and period of a signal by comparing it to a shifted version of itself. Since period changes over time in a speech signal, such a comparison must be made using windowed segments of data. In conceptually simplified terms, such an algorithm seeks the smallest P such that V n E N ,x [n 1 = x In +P ]. The set of samples N typically surrounds the location where the period is to be estimated. The criterion for when x [n = x [n +P ] and the determination of the set N are the two major things that differentiate the various methods which use temporal similarity as their basis. 31 __ 10 Chapter 2 An Average Magnitude Difference Pitch Detector The average magnitude difference function (AMDF) pitch detector[35] uses temporal similarity to identify the periodicity and period of the signal. It creates a sequence that displays periodicity as a function of shift (candidate period) by computing the following expression for each integer value of P in the range [3ms 5ms]: AMDF[P]-= (2.2) Ix[n ]-x[n+P]I n EN where the set N contains 20ms of speech. This computation produces low values of AMDF [P ] when the signal is close to periodic with period P, and high values when it is not. In this program, the procedure for selecting a particular shift value (as the final period estimate) involves a complex tracking algorithm. It uses information about the period and periodicity, as measured from the previous signal section, to guide the decision for the current section. This procedure is too complex and algorithm specific to be worth fully describing here. The AMDF method fits the two phase description of pitch detection programs well. The first phase constructs the AMDF[P data structure without preprocessing. This sequence exhibits the lowest values at locations of strongest periodicity, and the location corresponds directly to the period estimate. The second phase is a (compl.icated) decision process, without any distinct postprocessing step. An Autocorrelation Pitch Detector The pitch detector designed by Dubnowski et. al. [31] also measures temporal similarity to determine periodicity and period. In this case, the comparison x [n ] = x [n +P] is performed using an inner product rather than an absolute 32 Chapter 2 difference (as used by the preceding algorithm). The set N contains 30ms of speech, and the candidate period P runs from 2.5 ms to 20 ms. This algorithm uses a preprocessor on the speech signal to spectrally flatten the signal for improved pitch detector performance. To perform this whitening, first the mean of the signal section is subtracted out (in case there is a DC offset), then the peak amplitudes over the first and last thirds of the section are averaged to estimate the clipping threshold C for that section. Finally, the incoming signal values are mapped to the values {+1,0,-1} using a threshold function, which is shown in figure 2.8. 4 output +1 -C i w -~~~~~~~~ -1 I Figure 2.8 Clipping Function 33 input Chapter 2 To compute the inner product for non-zero shift P, the signal section x [n ] is treated as though it were padded with zeroes. This has the effect of applying a linear taper to the values of the inner product[6]. The weight applied to the zero shift inner product is 1.0, and the weight that would be applied to the 30 ms shift inner product (if it were computed) would be 0. This sequence of inner products is the data structure resulting from the first phase of processing. Shift corresponds to period, and value corresponds to periodicity. Given this data structure, the shift (in the range from 2.5ms to 20ms) which yields the largest inner product is the period candidate. If that inner product fails to exceed 1/3 the height of the 0 shift inner product (which is always larger) or if the maximum amplitude for the section (before clipping) is below 1/20 of the maximum for the sequence, then the frame is declared unvoiced. There are some new pieces of knowledge exposed here. One is that whitening has a potentially beneficial effect when one uses autocorrelation for computing periodicity with respect to shift. By flattening the spectrum of the speech, whitening reduces the tendency for the autocorrelation to have large values due to formant structure in the speech (resonances in the vocal tract) rather than due to periodic glottal excitation; large values that could lead to erroneous period candidates. The clipping system used by this pitch detection method is only one of a variety of means to achieve whitening, others include adaptive filter banks[32] and linear prediction[33]. Another new piece of knowledge involves the motive behind the linear taper on the inner product values: "The use of a linear taper on the autocorrelation function effectively enhances the peak at the pitch period with respect to peaks at multiples of the pitch period. thereby reducing the possibility of doubling or tripling the pitch-period estimate because of higher correlations at these lags than at the lag of the actual pitch period." 34 Chapter 2 This exposes a fundamental problem in pitch detection. If a signal is periodic with period P, then it is also periodic with period 2P. To guarantee (or at least encourage) the selection of the true period, the decision process must either emphasize the apparent periodicity at small shifts (as this algorithm does) or determine that no shorter plausible shift exists (as is done in the program written for this thesis). The final new idea exposed by this pitch detector relates to the silence threshold. Algorithms that measure periodicity usually have some kind of normalization so they will work at all amplitude levels. In this algorithm, that normalization is embodied in the comparison with the zero shift inner product. Since the inner product at the shift which corresponds to one period and the inner product at zero shift are both proportional to the energy in the signal section, comparing them yields a procedure that is insensitive to scaling. When the speech becomes very soft so environmental and electronic noise sources dominate the signal, this type of algorithm can "lock on" to periodicities in the noise which have nothing to do with the speech signal. To avoid such situations, this (and most other) pitch detectors employ a silence threshold and either declare the speech unvoiced at low amplitudes, or have a special "silent" voicing declaration for that purpose. Clearly the designers of these algorithms believe that voiced speech is only normally produced over a finite amplitude range (20 to 1 according to this particular method). While we have not found any speech research that corroborates this belief, it does seem to be a reasonable assumption. 2.2.2. Data Reduction Digitized speech normally has many samples in each pitch period. Since pitch is considered to be slowly varying, this is many more samples than necessary to encode 35 'p Chapter 2 the pitch information. Therefore it should be possible to convert the speech to a "reduced" representation that can still be used to estimate pitch, a representation that has fewer samples per second of speech. This is the meaning of "data reduction". The attraction of data reduction is in eliminating the computational cost that goes with handling individual samples. Programs such as the autocorrelation method above must make tens of thousands of arithmetic computations for each 300 sample speech section. If each section could be reduced to a few numbers (e.g. one non-zero sample positioned at the start of each pitch period) then the computation savings would be considerable. Gold-Rabiner One pitch detector that relies on data reduction is the Gold-Rabiner (G-R) pitch detector[3]. This program first eliminates aperiodic high-frequency energy with a 900hz low pass filter. The resulting waveform is immediately reduced to a sequence of extrema (samples that are either larger than or smaller than their immediate neighbors) eliminating perhaps 90-95% of the samples. This extremal sequence is used to create six alternative streams of pulses by adding or subtracting adjacent extrema. Three of these streams are synchronized with the peaks in the original extremal sequence, and have pulse amplitudes as follows: m 1 Each pulse is just the height of the corresponding extremal peak. Each pulse is the height of the extremal peak plus the depth of the preceding valley. 3 Each pulse is the height of the extremal peak less the height of the preceding peak. 2 36 0 Chapter 2 The other three streams are synchronized with the valleys in the original extremal sequence and are defined as follows: m4 Each pulse is just the depth of the corresponding extremal valley. m5 m6 Each pulse is the depth of the extremal valley plus the height of the preceding peak. Each pulse is the depth of the extremal valley less the depth of the preceding valley. These definitions are illustrated in figure 2.9. and they yield six pulse streams with the same period as the original speech, but with many fewer samples per second. These six pulse streams are fed to six identical trigger modules for a further reduction step. Each module produces a sequence of period estimates by running a blanking and decay circuit driven by its input pulse stream. Each time one of these circuits is triggered by a pulse, there follows a "dead time" during which triggering is prevented. After the dead time, a finite trigger threshold is set (initially to the height I Figure 2.9 Gold-Rabiner Extrema 37 __I Chapter 2 of the previous triggering pulse). This threshold decays exponentially with time until another pulse exceeds it, triggering the module and repeating the process. The dead time and the decay time are proportional to the average triggering time for that module. This second reduction procedure is depicted in figure 2.10, which depicts the action of a single module. The output of each module is a sequence of period estimates. Each time a module is triggered, the time since the last triggering constitutes a period estimate. At any given time, the three most recent period estimates are available from each of the six modules. Every 10 ms a final period estimate is determined from this repertoire of 18 numbers. The decision process which is the second phase of the G-R pitch detector is unique. It is presumed that each of the six modules is producing an equally reliable period estimate, and that only the most recent six estimates are viable choices for the final period estimate. First, a constituency of 36 "voters" is established from the 18 available period estimates. If each voter is labeled Pij where i stands for the module and j for TIME Fic. 4. Operation of the detection circuit wh'ch consists of a variable blanking tir,e during which no pulses are accepted, followed by a vari;hle cxponcitiial rundown. Figure 2.10 Trigger Modules 38 __ Chapter 2 the voter from that module, then the voters are defined by figure 2.11. The three most recent period estimates, two pairwise sums and the triple sum, are all voters from that module. The six candidates are the most recent estimates from each of the 6 modules. Four elections are held using four different tolerances for agreement between the candidate and the voter. For each election, a candidate gets one vote from each voter that agrees with the candidate within the specified tolerance. Since the tolerances for the four elections are different, and since the larger tolerances result in larger numbers of votes being cast, a bias is subtracted from each czdidate's votes; a bias that grows with the tolerance and compensates for it. If the candidate with the highest number of votes (including bias) has a positive median vote total across all elections, then that candidate wins, and that is the final period estimate. Otherwise (or if the peak amplitude of the section is less than 1/20 of the peak amplitude of the sentence) the section is declared unvoiced. Finally, a 3pt , P.,RP I3 P1 4 'P. I P,. PPE I TPP,,,i2 1- I TIME I pI s P:3 Pi I 2 P2 , L P22 . PrI -T -" pTIME . * *~~~~~ .~~~~~~~~~~~~~~~~~~, Figure 2.11 Period Estimates 39 ------- `------- Chapter 2 median smoother is applied to the sequence of final pitch estimates to correct gross errors. The G-R pitch detector is probably the most difficult of all pitch detectors to analyze for the knowledge it contains. Clearly there is knowledge behind the byzantine decision process, and the structure of the pulse streams and trigger modules, since it works so well. However, it is very difficult to tell what the ideas are, which apply to pitch detection and which to this specific approach, and which ideas are the most significant. When one speaks of knowledge that is "compiled in" to a program, this is a classic example. This method demonstrates that period can be measured solely from extrema. This is a remarkable thing when you consider how much of a reduction that accomplishes, and how simple it is to do. For example, while zero crossings are a similar reduction and are also simple to find, there are no good examples of pitch detectors based on zero crossings. This method assumes that the extrema of a speech signal fit a pattern that the trigger modules are designed to detect. Specifically, following a "trigger extremum" there is a period of uncertainty in extrema, followed by a period of decaying extrema followed by another trigger extremum. In the use of sums of recent period estimates as voters we see the intent to correct for extraneous pulses triggering the modules. Such pulses divide what should be a single period estimate into two or more shorter ones. However, the original can be recovered by summing adjacent estimates as they have done. This particular knowledge is only significant for methods that are both attempting to use extrema and being plagued by extraneous pulses. 40 __ _ _ __ _ Chapter 2 Perhaps the other most significant knowledge embodied in G-R is its use of redundancy. By having six trigger modules working on "different" views of the signal, and having a voting procedure to collect information from all those sources, the G-R pitch detector makes itself less susceptible to individual errors. Data Reduction by Principle Cycle Analysis This method[36] first band-pass filters the data to eliminate high frequency oscillation then makes a parametric representation of each half-cycles of the resulting waveform. This parametric representation is further reduced by analyzing the halfcycles and eliminating those that could not be the initial half-cycles of a pitch period. The goal is the isolation of the "principle cycles" which start each pitch period. This system is unusual because it uses some notion of the structure of speech to determine its pitch measurement. Drops in amplitude are used to delimit syllables. An approximation of the average pitch during each syllable guides the final pitch estimation for that syllable. Thus this program embodies knowledge about the syllabic nature of speech and the fact that pitch variation within a syllable is small. 2.2.3. Harmonic Structure When a signal is periodic, it has a spectrum composed of lines at the harmonics of the fundamental frequency. If it is quasi-periodic (as voiced speech usually is) then there are not lines but "peaks" instead. That is, one expects substantial spectral energy at or near the multiples of the fundamental, and little energy elsewhere. In the absence of a definitive statement of the nature of the quasi-periodicity, it is not possible to define the precise nature of these peaks. 41 __ Chapter 2 Pitch Detection using Spectral Peak Positions The fundamental frequency of such a quasi-periodic signal can be estimated by measuring the spacing of the peaks[371, adjusting a comb (or sieve) until it best fits the peak pattern[38, 39, 40], or looking for the greatest common denominator of the peak positions[41, 42]. Pitch detection by Harmonic Sum These techniques[43, 441 use the entire spectrum to estimate the period rather than just the peak positions. In some sense, the distinction between these methods and those immediately preceding resembles the distinction between the data reduction time domain methods like G-R and the full time domain methods like autocorrelation. Essentially, these methods take the inner product between the spectrum (or the log spectrum) and an impulse train. The spacing of the impulses that yields the largest projection becomes the fundamental frequency estimate. The largest estimate typically occurs when the impulses align with the peaks in the harmonic spectrum. Cepstral Pitch Detection Since the speech signal can be approximated as the convolution of a periodic pulse train with a low time-width filter function, the cepstrum can be used to turn the convolution of these two signals into a sum. This yields a low-time region containing the details of the vocal-tract response and a high-time region consisting of pulses located at the period of the speech and its multiples. By determining the spacing of those pulses or the initial pulse position one can estimate the period. This is the basis for the cepstral method of pitch detection[451. We include it in the section on spectral methods because the computation of the cepstrum depends on computing the spec42 Chapter 2 trum. 2.2A Other Methods Maximum Likelihood Pitch Detection The assumption that speech is a perfectly periodic signal with additive white gaussian noise, leads to a procedure for maximum likelihood pitch detection[46, 47]. These algorithms create a periodic signal by convolving the speech with an impulse train (aliasing). The spacing of the impulses is adjusted to maximize the similarity between this periodic signal and the original, and the resulting spacing is taken as the period estimate. It can be shown[44] that these algorithms which are computed in the time domain are virtually identical to the harmonic sum spectral method mentioned earlier. Pitch Detection from Prediction Residual These methods[33, 48, 49, 50] use "linear prediction" to produce a residual signal from the speech. By using a short predictor (10th to 16th order), correlations in the signal that manifest themselves over short lags are removed (namely the correlations due to the vocal tract response). Once the short term correlations have been eliminated, it is possible to detect pitch simply by looking for peaks in the resulting residual signal. 2.2.5. Voicing Determination Voicing may be determined numerically in a variety of ways, many of which do not involve pitch estimation. Some of the standard methods are: low frequency power, broadband power, periodicity, zero-crossing rate and predictability. 43 - 0 Chapter 2 Low Frequency Power Low frequency power measures rely on the fact that glottal vibration leads to low frequency energy over a substantial time interval. Other forms of excitation (e.g. frication) lead to high-frequency energy or (like stops) to very short duration broadband power. Thus a low-pass filter up to 600hz followed by some smoothing to eliminate the effects of clicks and stops can yield a useful voicing estimator. Broadband Power Broadband power measurements can be used to determine silent intervals if the background noise is low. Such a direct determination of silence can eliminate the wasted processing that would otherwise take place. Also, estimators that factor out power (such as zero-crossings) can be very erratic at low power. So silence detection should be coupled with them to prevent this spurious behavior from triggering incorrect voicing errors. Periodicity Periodicity measures make use of the fact that glottal excitation is usually periodic. This is often used as a voicing method in pitch detectors because the period is being measured anyway. Unfortunately aperiodic speech is not uncommon, so pitch detectors that rely solely on periodicity to make their voicing determination will be prone to errors on some sentences. Zero-Crossing Rate The zero-crossing rate of a speech signal is sensitive to glottal vibration because the low frequency components that result from such excitation push the signal far from zero for (relatively) long periods of time. If the signal is dominated by noise, the 44 _ _ I I _ Chapter 2 small but rapid deviations due to the high frequency components cause many more zero-crossings to occur. As was mentioned above, this measure must be accompanied by a determination that power is present in significant quantities. When there is little power, the rate of zero-crossings is highly dependent on the nature of the background noise and therefore should not be used as a means to determine voicing. Predictability When there is glottal excitation, the acoustic waveform can be modeled as an impulse-like sequence exciting a slowly varying filter. In this case, linear prediction algorithms can reduce the power in the signal dramatically. When the excitation is random noise, prediction is much less effective in reducing power. Thus, the effectiveness of a linear predictor can be used as a means of determining voicing. It is also the case that linear predictor coefficients have different responses to voiced and unvoiced speech. In particular, the first coefficient of a linear predictor has been used to help determine voicing[51]. Pattern Recognition Approaches Numerical pattern recognition approaches to voicing determination rely on a combination of the above measures and training with pre-marked speech to numerically define the criteria for distinguishing voiced speech from unvoiced speech[l51, 52]. 2.3. Conclusion There is considerable knowledge available that pertains in some way to the pitch detection problem. For reasons discussed in chapter 1, we did not seek to exhaustively 45 Chapter 2 include all of it in the program that was developed for this thesis. The following chapters discuss the knowledge used in that program and how that knowledge was employed. 46 CHAPTER 3 System Architecture This is the first of two chapters describing the Pitch Detector's Assistant (PDA) program. In Chapter 3 we begin with a general overview of the program and its components. The remainder of the chapter breaks the system down first in terms of the problem structure (knowledge about voicing and fO determination) and then in terms of the program structure (the rule system, the dependency networks etc.). Chapter 3 describes the system in terms of the intent of the design and the ideas that motivated it. The following chapter focuses on the actual implementation of the program, dealing with practical problems and the engineering decisions that were necessary to translate the ideas into practice. 3.1.1. An Example of System Operation Before discussing the design of the system we present an example of its operation. The input to the system (except for the speaker sex/age) is shown in figure 3.1. The waveform is shown in the upper part of the figure. It is digitized at 10 khz and represented in the computer as floating point numbers'. The range of values are shown to the left of the vertical axis, and the range of indices are shown below the horizontal axis. The transcript is shown in the lower part of the figure. The four types of transcript marks (phrase, words, syllables and phonemes) are shown in the labelled vertical strata of the picture. Each mark is identified by a string of characters, and its 'While the original data was represented in 16 bit fixed point, all subsequent numerical processing was done in 32 bit foating point notation. Chapter 3 Chapter 3 Amplitude ,- MacSfS 480 SNR ^n ,* 0.8 _3vafl.R a 34846 Transcript Mark Type Phrase! i Words i ' yllables w Phonemesi OECLARATIVE a which tea wrt' t'ti' t' t' t I' I party p'pa p'pa r baker did ri' ddid' i' ddl be' kk: d'be' k'k to go g'gow _ _ __ g'g o" ttuh _ t' t u 8 h 34046 Figure 3.1 Inputs: Symbolic Transcript and Waveform extent is depicted by the line below it. The boxes which border these lines signify the uncertainty with which those boundaries are known. In each case, the box depicts a Gaussian density whose mean and standard deviation are shown by the center and half-width of that box. Identical boxes which lie above one another are in reality just images of the same box as viewed from the different strata. The dotted line which appears under most syllable marks depicts the "syllabic nucleus" the vowel or vowel-like phoneme around which that syllable is built. The results of processing are shown in figure 3.2. All four plots span time horizontally with the domain (in samples at 10khz) appearing below the horizontal axis of each plot. The first plot shows the "confidence" that the speech is glotally excited 48 Chapter 3 Odds Factor 53.5 Voicing Odds . I: \ II V~~~ , iI I ; I I - ---' i I I .~~~~~ _ I\ _ a 8.8 ', 34847 -1 F0 Fe Probability 588 a 34125 a F0 48 Pitch Track - - - - , i 8 34125 8 Mark Type Revised Transcript OECLARATIVE Phrase 0 uhich Uords I tea Wit' SlIlaoles tt WIt I I W Phonemes - party did baker ddu bm' k'kS I X - 6 t' - - I t' t pmr p _ r _ r r dd _ I d b _ go r to m t't L k k r_ 9 : od t' t d u h _ _ _ 8 34046 Figure 3.2 The Outputs of the PDA 49 Chapter 3 (the voicing-odds-factor). The second is a pseudo-intensity plot 2 of the 3-dimensional surface which is the probability density of fO as a function of time (the fOprobability-density). A vertical slice through this surface yields the probability density for fO at that temporal location. The next plot shows the "revised" alignment of the phonetic transcript (where the depicted boundaries include information from numerical measurements) and the last plot is a conventional pitch track derived from the fO probability density and the voicing confidence at each location. 3.1.2. The Symbolic Input The design of the symbolic transcript input was based on the nature of the problem knowledge presented in Chapter 2. Phonemes are important because the phonetic context can be used to infer both pitch and changes in it. Syllables are the objects to which stress is attached. Words are important because of tendency for gaps to occur between them as opposed to within them, and because they are the means of indexing if one is to look up lexical stress. The phrase is the level at which one distinguishes question from declaration and the phrase delimits the outer bounds of the sentence. While it may seem that providing a phonetic transcript would essentially solve the voicing decision problem, there are several reasons why this is not so. First, the boundaries of phonetic marks lack sufficient precision. Typically, they are only given with accuracy to the nearest few centiseconds. Since each analysis frame is one centisecond, errors of several frames are possible. Second, one cannot reliably determine voicing solely from phonetic identity. Some phonemes (like /z/ before a pause) are likely to change from voiced to unvoiced over their duration. Others like /r/ may be 2 The density of dots in this plot corresponds to probability density for fO with black being a high probability density and white being low. 50 Chapter 3 voiced when used at the beginning of a stressed syllable, and unvoiced in an unstressed context. Lastly, the phonetic transcript cannot be expected to be completely accurate. Missing or incorrectly transcribed phonemes are possible. For all these reasons, the phonetic transcript does not solve the voicing decision problem. The phoneme marks in the transcript were made by trained phoneticians who were given a plot of the waveform, its (300hz bandwidth) spectrogram and the words in the utterance. The phoneticians were not permitted to listen to the sentence. Subsequently, additional marks were added to delimit syllables, words and the phrase. In addition, syllabic stress was indicated for each syllable (chosen from the values: UNSTRESSED, STRESSED and PROMINENT) based on listening to the sentences 3 . In marking the phonemes, the phoneticians were asked to indicate their region of uncertainty at the boundaries. This was an unfamiliar task for them and in many cases a phonetician would mark a boundary with just a line (the conventional way of marking). In these cases, the judgment was made by the author based on the distinctness of the indicators for that boundary in the spectrogram. In any case, a lower limit of uncertainty of .Olsec was used since that was the approximate resolution of pencil marks on the spectrogram. 3.1.3. The Outputs The choice of outputs shown in 3.2 reflects two motives. The first motive was to show in detail the process by which the PDA program came to its conclusions. The second was to produce not only the "answer" to the question (the pitch track), but 3 These marks were added by the author because they were necessary input to the PDA and were not a part of the transcriptions provided by the phoneticians. As the author is not a trained linguist, these added marks are somewhat suspect. However the uncertainty represented by such a marking was felt to be a proper challenge for the system. 51 Chapter 3 also to express the uncertainty that is implicit in that answer (e.g. the voicing confidence is displayed rather than just a voicing decision). The first motive is a reflection of the nature of an assistant and the importance of "interaction" between it and the operator. When processing a sentence, the outputs guide the operator in providing or revising information when they feel that the program is "confused" in some portion of the utterance. The outputs also help the program developer to understand and correct deficiencies in the system. The second motive is that such supplementary information as confidence or statistics is important to the user of the program. It ,s not uncommon for a signal processing program to supply only the answer with no other information. This appears to be uniformly the case for other pitch detectors. However, there are two reasons for including information about uncertainty (if it can be estimated inside the program). First, if the user is a person, such information helps them decide how to interpret the answer in the context of their problem. Second, such information can be useful if the results are used in later processing. When more than one system component contributes an answer to a single question, it is necessary to combine those answers. When the number of contributors is fixed and small, the combination function can be designed based on the known identity of the contributors. The voting matrix of the Gold-Rabiner pitch detector is a good example of this. However, if, as in PDA, where there may be many changing contributors, it is awkward to have to rewrite the combination mechanism for each new configuration. One cure for this is to have all contributors use a common means of communicating their answers, a means that contains enough information for the combining to be done without knowing the contributors from which the answers came. 52 Chapter 3 PDA uses the idea of developing a language for answers, and isolating the identity of the contributor from the contribution. This idea was used in the system for refining estimates of boundary positions between phonemes, the system for determining the answer to symbolic valued questions (such as voicing), and the system for determining the value of fO. This ability to isolate contributor from contribution seems crucial to the ability to incrementally develop a large system without having a detailed understanding of the interaction of all of its parts. Thus we see in the choice of outputs the interactive nature of an assistant, the desire to express to the user both the answer and the confidence in it, and the need in a composite system to express both the answers to questions and the supplementary information needed to combine those answers. 3.2. Pitch Detection in PDA Since the pitch detection problem breaks down into the subproblems of finding voicing and fO, the PDA was designed to solve these two subproblems separately then combine the results (see figure 3.3). The approach of both subsystems is similar and can be described as follows: make assertions, verify them, combine results. Before describing the voicing and FO subsystems we discuss the structures on which they are based: assertions and rules. 32.1. Assertions Much of the information used and generated by the PDA during the analysis of a sentence is represented in the form of "assertions". Conceptually, an assertion is a statement about some aspect of the problem with supplementary information about confidence or statistics. Assertions may support and be supported by other assertions 53 Chapter 3 Figure 3.3 Basic Approach and it is through this support that the confidence or statistics of a given assertion is determined. As assertions are added (or removed) from the system, the support they provide influences other assertions that are already present. The PDA is designed to propagate the influence of each change through the entire network of assertions. This means that the confidence in any given assertion is always kept up to date. A more detailed discussion of the implementation and operation of this network is presented in Chapter 4. 3.2.2. Voicing Related Assertions In the voicing subsystem, most assertions take the form of symbolic statements about the utterance as a function of position. Each statement includes a description (e.g. "voiced" or "fricated"), a domain over which the assertion applies (some portion of the utterance), and a confidence (a numerical measure typically varying over the duration of the assertion. The phoneme, word, syllable and phrase marks in the input 54 Chapter 3 (shown in figure 3.1) and the assertions generated by the PDA (shown in figure 3.4) are all assertions of this form. The graphical depiction presented in these pictures shows the "domain" of the assertion as a line bounded by two boxes. The "description" is indicated by the string centered above the line 4. For example, at the word level of the input transcript shown in 3.1 the descriptions are the words of the sentence. Mark Type Voicing Marks G N BRST _ NB NS NS i NV PVF 1 PBe PA iPF PF I PV : .PS PS I- 4 G i NS NV cS NS N NVNV I - NS NV . Ja- PA PF PF : PVPV PS a-' P!5 PA PF PV PVPV PS i NS NS NV PVF PB PA NV PVF PVF 2K P PB PB 3 NS NV PA PF PV PV PF PV PS 1 --- 0 PV Ps pv o-- PS 34846 BRST - Stop burst. NS - Numerically measured silence. NV - Numerically measured voiced. G - Phonetically inferred gap. PVF - Phonetically inferred voiced frication. PB - Phonetically inferred voiced bars. PA - Phonetically inferred aspiration. PF - Phonetically inferred frication. PV - Phonetically inferred voiced. PS - Phonetically inferred silence. Figure 3.4 Voicing Assertions 4 In these figures, the confidence in the assertions is not shown. The confidence contours of some typical assertions are shown in a later section on combining assertions. 55 Chapter 3 32.3. FO Related Assertions In the fO subsystem each assertion takes the form of a sequence of probability densities for fO over a portion of the utterance, and a sequence specifying the confidence that those densities are "valid". Any such assertion may be valid or invalid at any position within their domain. If the assertion is valid then the given probability density applies to fO at that index. If it is invalid then there is no additional information about fO there. The pseudo-intensity plot of fO probability versus position in figure 3.2 displays the unique top level assertion (<FINAL-PITCH>) to which all other fO assertions contribute. This assertion spans the entire utterance and specifies the probability density for fO at each temporal position. 32.4. Combining Assertions A system called the "knowledge manager" is responsible for updating the network of assertions whenever change occurs. This same system is also responsible for computing the confidence (and in the case of fO assertions the fO probability densities) of assertions. The details of the procedure for determining confidence are given in Chapter 4, but there are a few points mentioned here to help the reader understand the material that follows. The confidence of each voicing assertion is expressed as a special kind of odds. The odds of some event can be derived from the probablility of that event by the expression O (x )= P (x )/ (1-P (x )). Thus, the odds of an event ranges from 0 to oo. Odds are usually expressed as a fraction. For example, 1/1 odds reads "1 to 1" and implies a 50% probability. Odds that are greater than one are called "in favor" and odds less than one are called "against". For example, one would say the odds are 9/1 in favor of voicing (90% probability) or 1/9 against voicing (10% probability). 56 Chapter 3 Confidence in the PDA is expressed by what we call the "odds-factor". Specifically, the odds-factor of an assertion is the ratio of the "current odds" to the "a priori odds". Like odds, odds-factors also range from 0 to oo. If an assertion is receiving no support (unlikely since the creator of the assertion typically supports it) it will have an odds-factor of 1.0, signifying that the assertion odds have not changed from their a priori value. The reasons for using odds-factors to represent co.ifidence are given in Chapter 4. The fact that they are used means that the assertions made by the PDA must be interpreted in the context of the a priori odds. If the odds-factor for a "voiced" assertion is 2.0 at some position, it does not mean that the odds are 2/1 in favor of voicing, it means that the odds are twice as high as they were a priori. If the a priori odds were 1/10 against, then they are now 2/10 against. All voicing assertions (those derived from the transcript as well as those derived from the waveform) support a single assertion named <VOICED> which spans the entire utterance. The confidence of <VOICED> is determined by the net contribution from all supporters at each sample index, through a procedure that is described in detail in Chapter 4. The odds-factor for <VOICED> is the product of the "supportfactors" from each contributor, with a support-factor of less than one lowering the confidence of <VOICED> and a support-factor greater than one raising it. It is the odds-factor for <VOICED> that is displayed in the output (figure 3.2) as the confidence of voicing. To clarify this process, figure 3.5 shows the multiple contributions of support for <VOICED> over a subinterval of an utterance. The first plot shows the odds-factor for <VOICED> as a function of time, and the latter plots each show the support- 57 Chapter 3 Mark Type Transcript Support from: PV-VOICEO-SUPPORT rule 71 Phrase i__~~coa ___ lords - r Syllables p P Phonemes QP 1.8 I 0 i 11498 8588 Odds-Factor 53.5 11498 8580 Support from: PF-VOICED-SUPPORT3 rule 7 <VOICED> / i I 0 i 0.0' 8500 11498 Support from: NS-VOICED-SUPPORT rule 7 11498 8588 Support from: PS-VOICED-SUPPORT rule 7 1.0 - 1.0 08' 858 11498 Support from: NV-VOICED-SUPPORT rule 7 \ 0' 85088 11498 Support from: PV-VOICED-SUPPORT rule 7 1.8 a. 1. a H 1.80 ! 1.0 B0 8580 11498 8588 11498 8500 Support from: NV-VOICED-SUPPORT rule 7 Support from: Similarity 71 1.0 I 8580 8SOO8 11498 Figure 3.5 Support for <VOICED> 58 11498 Chapter 3 factor from a contributing assertion, labelled by the rule through which that assertion contributes support. For example, the support labelled "from rule <PV-VOICEDSUPPORT>" is from a phonetic voiced assertion (probably due to a vowel in the phonetic transcript). This support has a value larger than 1.0 where the vowel is located (since the vowel increases the confidence of voicing) and 1.0 elswhere (no contribution). On the other hand, the support labelled "from rule <NS-VOICED- SUPPORT>" has a value less than 1.0 where a numerical silence assertion is located (since silence decreases the confidence in voicing) and 1.0 elsewhere. For fO, there is a similar <FINAL-PITCH> assertion. assertion contributes probability densities for f Each supporting pitch over its domain, and in <FINAL- PITCH> they are combined. In figure 3.6 there are windows showing the individual contributions of fO probability density at a single position. The upper windows each represent the probability density contributed by a single assertion, this density reflects both the confidence and the basic density being asserted. The bottom window shows the final density that results from combining the upper ones. This is one slice of the fO density shown in the pseudo-intensity plot in 3.2. 3.3. The Rules The following explanation of rules should help the reader to interpret the rules given as examples in the following text. A more detailed description of the rule system is given later in this chapter, and details about the implementation are provided in Chapter 4. This discussion and the explanation of the system that follows use essentially literal copies of the rules as found in the program. While that form of presentation may be difficult to understand at first, the rules are presented in this fashion to avoid 59 -- Chapter 3 Transcript Mark Type p(f 0) Contributor 0.44iL Phrase Words k'k~ k Syllables Phonemes k - f 498 . . 205088 Am litude 21749 8 p(f 0) 8.207 Waveform Contributor 0. X~~*elsllellll T j I, i'A ! , lit ~.,~ ~~ ,vii.,,.,,..~, J1i: 't _'*J ' n i 28588 p(f8) 8.22 I I Fina Oensity at 28758 8.8' . -fe 498 .. 8 p(f 8) 8.219 i 21749 Contributor I ontri 0 p(f8) 8.226 f498 CContributor 8.8 8 p(f8) 8.88588 f8 498 8 4,-J 498 a l Contributor 8.80 8 ---49f 498 Figure 3.6 Support for <FINAL-PITCH> the confusion that sometimes occurs when an "English translation" is presented in place of actual rules from the program. We hope the combination of rule text and explanation will serve both the casual reader and one intent on understanding the details of the program. 60 Chapter 3 Consider the following rule for voicing determination: (def rule <phonetic-voiced> COND IT IONS (type 'p-mark x) (some (voicing x) :voiced) (let start (start x)) (let end (end x)) ACTIONS (assert (phonetic-voiced start end) .8 .2) PREMISE-ODDS (lambda (x) (odds x))) The rule's name is "<phonetic-voiced> ". The three uppercase symbols in the text of the rule precede the (only) three significant parts of any rule. The CONDITIONS part must be satisfied for the rule to "trigger". Triggering of the rule creates an object called a "binding" which both stands for that triggering of the rule and describes the unique association of assertions (and other objects) with rule variables that caused the triggering. Like an assertion, this binding has a confidence (represented as an odds factor). Unlike an assertion, this confidence is not computed by multiplying the support-factors of supporters. Instead, the supporters (which are a subset of the variable values) determine the odds-factor of the binding through the PREMISE-ODDS expression in the rule. Only variables which appear in the premise-odds expression are in fact supporters of the binding and contribute to its odds-factor. Finally, the ACTIONS part of the rule is a collection of lisp expressions that specify the changes to be made when the rule is "fired". These forms typically make other assertions, but it is also possible for the actions to cause the binding to give support to existing assertions. This is the way the rules mentioned in figure 3.5 provide support for the assertion <VOICED>. For the preceding rule, the first form in the conditions declares the variable x, and specifies that it should be bound to new assertions of type p-mark. Each time a p-mark (phoneme-mark) assertion is made, the conditions of this rule will be checked 61 Chapter 3 with the value of x bound to that assertion. The second form in the conditions specifies that the value of the expression (voicing x ) be :voiced, thus the phonememark bound to x must be for a voiced phoneme in order for the rule to trigger. The next two forms are let forms. They both declare variables (start and end respectively) and assign them to the value of an expression (in this case, start = (start x ) and end = (end x )). In part, these let clauses serve as a shorthand to avoid the need to rewrite expressions either in the conditions, the premise-odds or the actions. In addition, declaring any variable in the conditions (either with a type form or with a let form) signifies to the knowledge manager that this binding "depends" on the assertion or object that is the value of the variable. Such dependency is discussed in detail in the section describing the knowledge manager, but basically it causes any conditions forms involving the variable to be rechecked whenever that variable value (object or assertion) is changed. Since the knowledge manager only monitors changes in variable values and not changes in expressions, assigning a new variable to an expression is the only way to make the rule responsive to that expression. For example, the expression (start x ) returns an object that stands for the boundary at the start of the phoneme-mark. Suppose the status of the boundary object was significant to the rule. Suppose it was important that the boundary object was also the start of a word. If at some point the rule was triggered and a binding was created, then it would be necessary for the knowledge manager to check the rule conditions if that boundary object were to change. Such a change might mean that the boundary object was no longer the start of a word, and that the binding was consequently invalid. The point is, the boundary object could change without the phoneme-mark changing. The only way to be sensi- 62 Chapter 3 tive to changes in the boundary-object that is the value of the expression (startx) is to use a let clause to assign a new variable to it. As it happens, in this rule the values of the variables start and end are not used in the conditions, so the let clauses only serve as shorthand. The premise-odds form is a lambda expression. In lisp this is a procedure definition, where the first list after the word X is the argument list (x in this case), and the remainder of the form are expressions to evaluate, the last of which gives the value of the subroutine. In this rule, the value of the premise-odds form is just (odds x ) the odds-factor of the phoneme-mark that triggered the rule. This becomes the odds-factor of the binding. The reason this is a lambda expression is that it is necessary to have an argument list. Recall that the supporters of the binding are the values of variables that appear in the premise-odds form. Rather than have the PDA analyze the premise odds form to determine what variables are present, a list of the variables used is required. This both simplifies the task of the PDA and provides a check on the user's programming, since they must properly declare all variables and use them all in order not to generate warnings. Since the argument list and the body of the premise odds form were both already present, it was only logical to turn it into a common procedure declaration. Eventually, when the rule is fired, the actions part makes a phonetic-voiced assertion that covers the same interval as phoneme-mark that triggered it. The two numbers in the (assert ... ) form determine how the binding supports the phoneticvoiced assertion. A detailed description of the meaning of these numbers is given in Chapter 4, but basically the first number is the "probability of detection" and the second is the "probability of false alarm". The first number specifies the probability that the rule will be triggered if the actions are appropriate, the second specifies the 63 Chapter 3 probability that the rule will be triggered if the actions are NOT appropriate. These two numbers are chosen by the system designer to represent their knowledge of the rules' behavior. The following paragraphs recap the features and purposes of the three parts of a rule. * CONDITIONS These are a set of lisp forms that must be satisfied for the rule to trigger. They specify the assertions that are necessary and as a side effect they bind rule variables to these assertions. These variables are passed to both the PREMISE-ODDS and the ACTIONS part of the rule. (e.g. be satisfied that must tests are just forms Many (same (voicing x ) :voiced ) is such a test). Let forms both declare variables and bind them, and type forms declare variables, specify that they should be bound to new assertions, and specify the type of assertion to bind them to. For the rule to be triggered, all tests in the rule must be satisfied. For each (type ... ) form, an assertion of the proper type must be available and all test forms in the conditions must be true. On each iteration, the rule system takes each rule that is triggered and creates a binding that specifies the rule and the variable values that triggered it. Only one binding is created for each way assertions in the database can be associated with type variables in the rule. Thus each rule will fire exactly once on each satisfactory configuration of assertions in the database. * ACTIONS These are a set of lisp forms that change the state of the system by altering existing assertions, or by adding new assertions to the database. New assertions made during the execution of the ACTIONS are "supported" by the binding in accordance with the probability of detection and probability of false alarm specified in the assert clause. The absence of such probability specifications implies the assertion will be support through some other means. * PREMISE-ODDS The premise-odds X expression determines the confidence in the binding that is created each time a rule is triggered. The variable values that support the binding must be declared in the argument list of the X expression. The last form in the X expression specifies the binding's odds-factor. 64 Chapter 3 The rules which follow are not an exhaustive list of the rules in the PDA. Samples are taken from each of the types of rules. For example, though there are 10 rules for subsegmenting stops (corresponding to different phonetic contexts), only one such rule (<silence2> is presented here). 3.4. Determining Voicing 3.4.1. Overview In PDA voicing is determined in two stages. The system first makes estimates of the voicing mode in various subintervals within the utterance. Then, by combining those estimates, it predicts the likelihood of voicing at each sample. The modes of voicing that PDA seeks to identify are: voiced, frication, aspiration, voiced-frication, voiced-bars and silence. All these voicing modes can be derived symbolically based on knowledge of the phonemes and their locations, and a subset of these modes (voiced and silence) can be found numerically by analyzing power levels in different spectral bands, and by looking for similarity in time. Some phonetically derived assertions are first verified by checking for acoustic properties numerically. The presence (or absence) of these acoustic properties in turn affects the confidence of the voicing mode assertion. While no corresponding effort was made to verify numerically derived voicing mode estimates using phonetic information, there is no fundamental reason why that could not be done. However, given the nature of this project there was only time to implement one type of verification, and the phonetically derived assertions seemed to have the greater need for it. The flow of analysis from the input to the voicing assertions and finally to <VOICED> is depicted in figure 3.7. Phoneme marks in the transcript are converted 65 Chapter 3 Figure 3.7 Determining Voicing to voicing marks (Voicing Analysis). There is also a rule that spots the locations where bursts can be expected (Likely Burst) whose actions execute a burst identification procedure (Burst Ident.). This procedure analyzes the waveform and may produce a burst-mark if one can be located. From the waveform, broadband, 66 4 Chapter 3 low-frequency (0-900hz) and high-frequency (2500hz +) power estimates produce a set of power curves that are numerically analyzed to generate voicing marks (Power to Voicing). The "similarity" of the waveform5 is also measured (Similar. Detect.), and for many of the phonetically derived voicing marks, verification of their expected properties is carried out using the waveform (Phonetic Verify). Finally, the voicing marks that have been created and the similarity measurement of the waveform provide support to the unique assertion <VOICED> (Voiced-Support Rules) which is the voicing conclusion of the PDA. 3.4.2. Knowledge in the PDA This section provides a brief description of each of the ideas used. in PDA to determine voicing. The next section shows how those ideas were implemented. Voicing from Phoneme Voicing Most phonemes have inherent voicing (vowels are voiced, fricatives are frication, ... ). In PDA there is a table that defines the properties (including voicing mode) of the various phonemes. Voicing from Stop Aspiration Stops have predictable burst duration and voice onset time (VOT). During the burst the voicing can be assumed to be frication, and during the rest of the VOT the voicing can be assumed to be aspiration. s By this we mean how similar the waveshape is to the waveshape anywhere nearby. Strong similarity implies voicing. 67 Chapter 3 Silent Gaps If there is a gap from the start of the utterance to the first phoneme-mark or from the last phoneme-mark to the end of the utterance, the speech is likely to be silent there. If there is a gap between the end of one word and the start of another, the speech is likely to be silent there also. Location of Stop Bursts In the vicinity of an unvoiced stop release, the boundaries of the burst can be located by scanning outward from the peak in fricative energy. Voicing from Sonorant Power Large amounts of "sonorant" power (60-600hz) suggests that the speech is voiced. Voicing from Broadband Power Low Levels of broadband power signify silence. Voicing From Similarity If at any index it is possible to locate a nearby index where the waveform is "similar", then the speech is voiced. 3.4.3. Using the Knowledge Voicing from Phoneme Voicing The simplest example of voicing determination is the direct hypothesis of voicing from the identity of a phoneme. One rule which accomplishes this is shown in figure 3.8. This rule says "given a p-mark (phoneme mark) whose voicing is :voiced, make a 68 Chapter 3 _ _ (def rule <phonetic-voiced> CONDITIONS (type 'p-mark x) (same (voicing x) :voiced) (let start (start x)) (let end (end x)) PREMISE-OODS (lambda (x) (odds x)) $xS is a phoneme-mark the phoneme must be voiced shorthand variables for SstartS and SendS ; the binding confidence is that of ; the phoneme-mark $Sx ACTIONS (assert (phonetic-voiced start end) ; assert a voicing mark .8 ; probability of rule firing if it should (.8) .2)) ; probability of rule firing if it shouldn't (.2) Figure 3.8 Voicing from Phoneme Identity phonetic-voiced assertion over the interval of that p-mark" 6 . There is a similar rule for the following categories: frication, voiced-frication, voiced-bars. The assertion support numbers of .8 and .2 represent moderate confidence that the mark given by the phonetician really warrants such an assertion (values of 1.0 and 0.0 would represent absolute confidence, and values of .5 and .5 would represent no confidence whatever). Voicing from Stop Aspiration An example of the determination of voicing from stop aspiration is shown in figure 3.9. This is one of ten rules that make frication and aspiration assertions from marks involving unvoiced stops. These are the most complex rules in the system7 This rule seeks phoneme clusters of the form 6 A "mark" is our term for some annotation of the utterance. 7 Their number and complexity reflect an interest in experimenting with the expression of knowledge within the PDA system more than the significance that this information may hold for voicing determination. 69 Chapter 3 (defrule <silence2> CONDITIONS (type 'p-mark stop) ; $stopS is a phoneme-mark (isa 'unvoiced-stop-release stop) ; it must be an unvoiced stop release (let left (left-neighbor stop)) ; the preceding phoneme-mark (isa 'stop-closure left) ; must be a stop closure. (let 2nd-left (left-neighbor (left-neighbor stop))) ; 2nd preceding (let right (right-neighbor stop)) ; following phoneme-mark (isa 's 2nd-left) ; 2nd previous phoneme must be an /s/ (nota 'semi-vowel right) ; following phoneme mustn't be a semi-vowel (type 'speech-sampling-rate sampling-rate-assertion) ; find Fe assertion (let sampling-rote (value sampling-rate-assertion)) ; extract Fs (let stop-start (start stop)) ; starting epoch of $stopS (let stop-end (end stop)) ; ending epoch of Sstop$ ; find the nominal vot for this stop context (let avrg-ms (vot-mean-s-stop-vowel stop 1000.0)) ; locals representing the fraction of release that is frication (let fric-fraction (frication-fraction stop)) ; and the typical standard deviation in $stop$'s VOT. (let deviation-ms (stop-deviation stop)) PREMISE-OODS (lambda (stop left 2nd-left right sampling-rate deviation-ms avrg-ms) (min (odds 2nd-left) (odds left) (odds stop) (odds right) (let. ((dur (// ($length (domain stop)) .001 sampling-rate))) ; dur in ms ; determine the binding confidence based on how typical this stop's VOT is, and the confidence in the stop context. (odds-gaussian (- dur avrg-ms) (e 2 deviation-ms) 5)))) ACTIONS (let. ((dur (Slength (domain stop))) ; stop release duration (in samples) (fric-dur (round ( dur fric-fraction))) ; duration of frication Build an epoch to represent the boundary between frication and aspiration. Uncertainty is given by the uncertainty of VOT. (epoch (epoch:shift stop-start fric-dur :standard-deviotion ( deviation-ms .001 sampling-rate)))) assert marks for the fricative and aspirative portions or the release (assert (phonetic-frication stop-start epoch) .8 .4) (assert (phonetic-aspiration epoch stop-end) .8 .4)))) Figure 3.9 Voicing from Stop Timing /s/<unvoiced closure> <unvoiced stop release> <not a semivowel> The variable avrg-ms is assigned to the expected duration of the stop release, fric -fraction is assigned to the proportion of that release that is expected to be fricated, and deviation -ms is assigned to the expected standard deviation of the stop release. All three numbers are properties associated with each stop type, based on the phonetic context. This information comes from a study of stop durations[ 181]. 70 - Chapter 3 In this rule the premise-odds form uses the statistics of stop duration and the confidence in the phoneme marks to assign confidence to the bindings created when the rule fires. As was mentioned earlier, the confidence in the binding in turn determines the support for the phonetic-frication and phonetic-aspiration assertions made in the actions of the rule. A new epoch 8 must be created to represent the frication/aspiration transition during the release, and the rule provides information regarding the certainty of the position of that epoch. This is the purpose of deviation -ms in the ACTIONS of the rule. The form (epoch :shift ... ) creates a new epoch shif d from the beginning of the stop (stop-start ) by the duration of frication (fric -dur). The :standard-deviation argument to this function means the resulting epoch will have a larger standarddeviation than the epoch it was derived from and the amount of increased uncertainty is given by the deviation -ms expression in the ACTIONS. Silent Gaps Inferring the existence of gaps from the end of the phonetic transcript to the end of the utterance is accomplished with the rule shown in figure 3.10. This rule says "Find an utterance, and find a word that is also at the end of a phrase. Make a gap running from the end of the word to a new epoch created at the end of the utterance.", The reason for different versions of the procedure end in the conditions is that the utterance has no epochs at its start and end. This is also the reason for creating a new epoch at the location where the utterance ends (the standard deviation of the ending epoch was chosen as an intuitively reasonable value). 8 "Epoch" is our term for the object that stands for the boundary between marks. Epochs are discussed in more detail below. 71 Chapter 3 (def rule <phrase-end-gap> CONDITIONS (type 'word-mark word) (type 'utterance utt) (type 'phrase-mark phr) (end-of phr word) (let start (end word)) (let end (Send utt)) ; a transcript word mark ; the waveform ; the transcript phrase mark ; Swords ends at phrase end. ; the ending epoch of Swords ; Sutt$ is not bounded by epochs. ; Therefore, this is just a number. ; Smart '<' fcn can compare the epoch ; Sstart$ with the number SendS (< start end) PREMI SE-OODS ; binding confidence is given by the confidence in SwordS and Sutt$ (lambda (word utt) (min (odds word) (odds utt))) ACTIONS ; Since there was no epoch at the end of $uttS$ we must make one. (assert (gap start (make-simple-epoch :mean end :sd 100 :support utt)) .9 .4)) Figure 3.10 Finding Gaps This particular rule presents an interesting use of the support factors in the (assert ... ) clause. The values of .9 and .4 say "This rule is very likely to detect gaps, but it also has a high false alarm probability". The likelihood of detecting gaps stems from the fact that the phoneticians are not likely to put phoneme-marks past the end of the sentence. The false alarm probability is high because even though there are no phoneme-marks there, there may be residual speech in the waveform (coughing, comments from the experimenter, etc.). Location of Stop Bursts It is helpful to determine where stop bursts occur because their brief high intensity can cause inappropriate voiced support from modules that determine voicing from power. In this situation, the retraction of support for the voiced conclusion caused by the existence of a burst assertion counteracts the power indicators. 72 4 Chapter 3 The reasons for numerically locating stop bursts were: * * The estimates of the end of the burst from information about the phonetic environment of the stop were highly variable. This was an interesting opportunity to demonstrate one technique of combining symbolic and numerical processing. Namely, the use of symbolic information to select the area of application of a signal processing algorithm. The rule which accomplishes this task is a simple one because most of the effort is contained in the signal processing procedure. As can be seen from figure 3.11, the rule looks for the utterance and an unvoiced stop release phoneme mark then calls the function scan -for -burst which actually makes the new assertions. This procedure finds the peak in power interior to the stop and works its way outward looking for a drop in power. Such a simple algorithm would be impractical as a purely numerical means of locating bursts, but in the tighter context provided by a stop phoneme mark it is effective. Notice that the local variables start and end are not in fact used in either the ACTIONS or PREMISE-ODDS of the rule. This is an example of how local variables signify binding dependency. If those epochs were changed for some reason, the old (def rule <numer ic-burst0> CONDITIONS (type 'p-mark x) ; $x$ is a phoneme-mark (type 'utterance utt) ; $utt$ is the waveform (isa ':unvoiced-stop-release x) ; $xS must be a stop release (let start (start x)) (let end (end x)) PREMISE-ODDS ACTIONS (scan-for-burst utt x)) Figure 3.11 Rule to find stop bursts 73 - Chapter 3 binding would become invalid and a new one would be created. Invalidating the old binding would retract any previous results from the procedure scan -for -burst. Subsequently, when the rule was fired again (due to the new binding), scan -for-burst would be reinvoked over the new region delimited by start and end, potentially yielding different results. Also note that there is no PREMISE-ODDS form. This is because the support for any numeric-burst marks that are generated by scan -for--burst is determined by that procedure and not from any of the assertions that triggered this rule. Voicing from Sonorant Power Measuring the amount of low-frequency (sonorant) power (0-600hz) is one way of determining voicing. Unvoiced excitation of the vocal tract (e.g. frication) has mostly high frequency content, whereas glottal vibration always generates substantial low-frequency power. The rule for this task (figure 3.12) is like the one for bursts in that it runs a numerical procedure. In this case, there is no domain qualification by the symbolic information ad the procedure is free to make numeric-voiced conclusions wherever sonorant power is high. (def rule < numeric-voicedl> CONDITIONS ; Sutt$ is the waveform (type 'utterance utt) ; $pwr$ is assertion concerning the maximum sonorant power (type 'max-sonorant-power pwr) ; $pwr-volue$ is the numerical value of that assertion. (let pwr-value (value pwr)) PREMISE-ODDS ACTIONS (scan-for-voicing (sonorant-log-power utt) pwr-value)) Figure 3.12 Voicing from Sonorant Power 74 ___1__11 Chapter 3 As with the previous rule, the odds of the created assertions are determined in the scanning process, not by condition variables. Thus there is no need for a PREMISE-ODDS form since the binding's confidence is irrelevant. This reflects two different purposes of variables in rules. One is the use of variables that have a direct bearing on the confidence of results, the other is the use of variables as a mechanism of controlling the behavior of the system (as in this case). Voicing from Broadband Power This rule is essentially equivalent to the previous rule, except that the scanning process looks for regions of the waveform that have power within a few dB of utterance minimum. The one thing that distinguishes this rule from the previous (as seen in figure 3.13) is the requirement of a high -snr assertion in the CONDITIONS. When experiments were undertaken with noisy speech, it became apparent that the algorithm used by this rule was ineffective if the speech was noisy. The high -snr is made by another module if the difference between maximum and minimum power is found to be greater than 40db. (def rule <numeric-silencel > CONDITIONS (type 'utterance utt) ; the waveform (type 'min-power pwr) ; the minimum power assertion ; the waveform has a high signal to noise ratio (type 'high-snr snr) PREMISE-ODDS ACTIONS (scan-for-silence (broadband-log-power utt) (value pwr))) Figure 3.13 Silence from Broadband Power 75 Chapter 3 Voicing from Similarity This method of determining voicing was built into the system in a different way than the previous. Rather than running a scanning procedure and producing numericvoiced assertions, this rule (figure 3.14) uses the time varying similarity of the waveform to establish the confidence in the binding and supports the final voiced result v directly. The decision to implement this idea in this fashion was partly due to time limitation. The earlier examples all produced results in the form of voicing assertions which in turn supported the final voicing assertion. This two step process meant that other parts of the system had access to the results of those modules. This more direct implementation saved the extra step of producing numeric-voiced assertions at the expense of rendering the results invisible to the rest of the system. (defrule <simi Iority-voiced-support> CONDITIONS (type 'utterance utt) ; the waveform (type 'voiced v) ; the unique assertion <voiced> PREMISE-ODDS (lambda (utt) ;; low sonoront power makes this measure invalid (seq lower (seqcl ip (seq-exponentiate (seq-scale (sonorant-log-power utt) .05 (+ 20 (seq-min (sonorant-log-power utt)))) 1.0 10.0) 1.0 1e.e) ;; use a Ipf on the utt to eliminate uncorrelated power. ;; consider 2.0 similarity as warranting an odds of 1/1 on the premise. (seq-clip (seq-scale (local-similarity (gr-pd:gr-lpf utt)) 1.0 1.0) .1 100.0))) ACTIONS (provide-support v .8 .01)) Figure 3.14 Voicing from Similarity 76 4 Chapter 3 Another problem with this implementation of the similarity rule is that the epochs that would have been created for the numeric-voiced assertions were not made. So any improvement in accuracy that might have come from merging those epochs into the rest of the epochs of the system has been lost. 3.5. Determining FO FO is determined through the interaction of the following ideas: numerical estimation of the spacing between similar places in the waveform, fO prediction from phoneme identity and speaker sex/age, fO prediction from sex/age alone, fO derivative prediction from phonetic environment and stress, and fO range estimation based on preliminary (presumably trustworthy) estimates. This organization is depicted schematically in figure 3.15. 3.5.1. Concepts Waveform Similarity This part of the PDA corresponds to the core of conventional pitch detectors, the part which estimates fO or period using numerical techniques. The basic principle is that periodic glottal pulses exciting a slowly varying vocal tract lead to periodicity in the acoustic output that can be used to determine fO. FO from Phonetic Identity Using information available from speech research[9] the PDA can predict fO from the identity of vowels and dipthongs and the sex/age of the speaker. 77 _I_ Chapter 3 <FINAL-PITCH> Figure 3.15 Determining FO FO from Sex/Age With other phoneme classes, for which no fO bias information is available, a prediction can still be made by using the data for a central vowel like /a/ and specifying a greater uncertainty in the estimate than in the phoneme specific case. FO Derivative from Phonetic Context The effects of consonantal context on the fO trajectory during vowels was investigated by Lea[13]. This information is used by the PDA to advise the numerical pitch 78 I__I _ _ _ _· _ 14 Chapter 3 detector about the likeliest pitch trajectory at the beginning of certain vowels. FO Range from Preliminary Numerical Results The pitch range within a single sentence is rarely greater than one octave, frequently less. In part, this range is being estimated by the phonetic identity and phonetic sex/age information above. However, by collecting a set of estimates of fO that are felt to be reliable it is also possible to estimate the range numerically. 3.5.2. Examples of Implementation Waveform Similarity The numerical period estimation method used by the PDA is based on finding similar positions in the waveform. The technical details of the algorithm are discussed in Chapter 4, but in simple terms the algorithm windows the utterance at a location where the fO estimate is to be made and compares that waveshape with nearby regions of the utterance that have also been windowed. Spacings between the windows that lead to similar waveshapes become candidates for estimating the period of the speech at that location. Figure 3.16 shows the similarity as a function of spacing in the middle of the vowel sound /i/. This algorithm is typical of time domain algorithms that are based on correlation measures in that it has a certain amount of background "clutter" between peaks and that there are peaks at multiples of the "true" period in both directions. The rule responsible for computing fO numerically is shown in figure 3.17. The CONDITIONS extract the utterance from the known assertions and find apriori-f 0 which is a function that specifies the odds of a given fO value. This is the result of the "fO range from preliminary numerical results" module and is discussed below. There 79 Chapter 3 Warped NLA 7 Similarity of /i/ I I I 1 i I:I I iii I iI I '11 I I1 tsi i I ' jij a -280 298 Figure 3.16 Typical Speech Similarity for /i/ (def rule <pd-on- remaining-support> CONDITIONS ; the waveform (type 'utterance utt) ; a preliminary f assertion (type 'apriori-fO apriori-fO-assertion) (let apriori-fe (value apriori-fe-ossertion)) ; and its value ; operator control switch (type 'do-final-pd switch) ; confidence is determined numerically PREMISE-OODS ACTIONS ; all phonemes (lete ((phonemes (get-from-km 'p-mark)) ; regions that have not yet been processed for similarity (support (support-without-sim-pitch (apply '$cover phonemes)))) (loop for ivl being the intervals of support do (format t &final pd on ivl ao"ivl) (assert (sim-pitch-assertions (gr-pd:gr-lpf utt) ivl 0.00 nil .02 apriori-fe))))) Figure 3.17 FO from Similarity is no need for a PREMISE-ODDS because the certainty information for the resulting fO assertions are solely determined from the numerical measurements. A low-pass filtered version ,of the utterance together with this apriori-f 0 function and some control parameters are used to invoke the numerical pitch detector. The parameters passed to the procedure sim -pitch -assertions are (from left to right): the interval over which to detect fO (ivl), the expected period change per cycle 80 Chapter 3 (0.0), the "preferred" look direction (nil ), and the range of possible period change per period (.02). In this particular rule, most of these control parameters are set to a neutral value (for example, no period change is expected and there is no preferred look direction). The other rules which invoke this procedure do supply "advice" in the form of values to these parameters, advice which depends on the phonetic context of the region being analyzed. The choice of .02 for the period "jitter" was selected based on speech research[26], and the results of a brief jitter experiment that we conducted. The details on the influence of these parameters on the pitch detection algorithm are presented in Chaoter 4. The variable support in the ACTIONS stands for the set of intervals that have not yet been numerically tested for pitch. Determining the regions that have already been analyzed is necessary because the implementation of the idea "fO derivative from phonetic context" involves running the same numerical algorithm in a few intervals with advice about what to expect, and two sets of results from the same numerical algorithm in the same location of the utterance would conflict with independence assumptions made in the combination of such results. Thus this rule must restrict its actions to those intervals that have not already been numerically tested by this detector. The existance of the switch do -final -pd points out a lack of complex control in this system. After the PDA has derived all it can from the phonetic input and waveform (including numerical pitch analysis in regions where phonetic advice is available), it will cease making assertions (without ever having run the previous rule). At this point, the operator adds the single assertion do -final -pd , triggering the above rule and performing numerical pitch analysis on the remaining regions. 81 I Chapter 3 FO from Phonetic Identity As was discussed in Chapter 2, tables are available that specify fO given the phoneme and the speaker sex/age (either MALE, FEMALE or CHILD)[91]. In PDA, this information is used to generate fO probability contours over the phonemes for which it is available. These contours have their mean given by the value in the table for that phoneme and speaker, the standard deviation is chosen as 30 Hz, and the odds-factor is given by the odds-factor of the phoneme in question. The standard deviation was chosen as a balance between having some useful effect and encompassing the natural variations of different speakers. While the oddsfactor is dependent on the odds-factor attributed to the phoneme, no use was made in this system of the ability to give individual phonemes different odds factors. Instead, they were all fixed with a nominal odds-factor of 4.0. One of the two rules that stem from this idea is shown in figure 3.18. The other 'A (def rule <phonet icpi tch-f rom-vowels> CONDITIONS (type 'p-mark pm) ; a phoneme-mark (isa ':vowel pm) ; a vowel (type 'sex sex-assertion) ; the sex assertion (let sex (value sex-assertion)) ; its value ; confidence determined numerically PREMISE-ODS ACTIONS ; select the mean f from the sex/age (let ((fe (inherit pm (selectq sex (:male :p-b-men) (:female :p-b-women) (:child :p-b-children))))) ; mean: f, standard-deviation: 3hz, odds from the phoneme-mark '(lImbda (ignore) f) '(lambda (ignore) 30) (assert (pitch-assertion (fcn-taper-over- ivl (lambda (ignore) (send pm :odds-factor)) pm) (domain pm))))) Figure 3.18 FO from Phoneme Identity 82 Chapter 3 performs the same task for dipthongs. The CONDITIONS require a vowel phoneme mark and the sex/age of the speaker. The PREMISE-ODDS are not used. The ACTIONS first look up the proper fO value based on the sex/age using the (inherit ... ) form. Then a pitch assertion is made using that fO, a nominal standard deviation of 30hz, the same domain as the phoneme mark, and a confidence equal to that of the phoneme mark (tapered to zero outside the domain of the phoneme mark). In the dipthong form, the one difference is that the fO value slides from the value attributed to the configuration at the start of the dipthong to that attributed to the end. FO from Sex/Age Alone For regions of the utterance where the above information is not available, a more coarse fO estimate is made based on the sex/age alone (see figure 3.19). In this case, the mean fO comes from that for the neutral vowel /a/ for the speaker type in question. The standard deviation of 60 hz is broader than the above because this, estimate is less well constrained. Finally, the odds-factor is chosen as a fixed 5/1 odds. In figure 3.20 we see the f0 information contributed by the above two ideas. The plot shows time extending from left to right, f0 extending from bottom to top, and probability density is shown by the darkness of the picture. A vertical slice through this surface would plot the probability density for fO at that temporal location. The relatively narrow densities (narrower vertically) are those contributed by the information that is phoneme specific. The wider density is the more general contribution for the remaining regions of the utterance. 83 --~~~__ I--- Chapter 3 (def rule <phonetic-fe-on-remaining-support> CONDITIONS (type 'utterance utt) ; the waveform (type 'do-finol-phonetic-fe switch) ; operator switch (type 'sex sex-assertion) ; the sex assertion (let sex (value sex-assertion)) ; its value PREMISE-OODS ACTIONS (let. ((phonemes (get-from-km 'p-mark)) ;; the support yet to be covered by pitch-assertions (support (support-without-phonotic-pitch (apply '$cover phonemes))) (fe (inherit 'phoneme:<aa> (selectq sex (:male :p-b-men) (:female :p-b-women) (:chilId :p-b-children))))) ; iterate over the remaining intervals mean: f from /a/, standard-deviation: 60, odds-factor: 5 (loop for ivl being the intervals of support do (format t &final-phonetic-f0 on a" ivl) (assert (pitch-assertion (lambda (ignore) fe) (lambda (ignore) 60) (lambda (ignore) 5.0) ivl))))) Figure 3.19 FO from Sex and Age F8 Probaiiity 58 A, ,,r t. ro-,I ,... t`~~~~~ L-i L;~~~~~~~~id _ r ~~~~~~~~,i. f; i, - -,.; .j a a 34125 Figure 3.20 FO from Phonemes FO Derivative from Phonetic Context 84 _1_____ __ _ _____ ___ I _ 4 Chapter 3 Information about speech behavior near consonant vowel boundaries takes two forms. First, for certain cases there is an expected fO change at the boundary. Second, sometimes one expects periodicity to be more apparent on one side of the boundary than the other. In both cases, PDA uses this information to "advise" the numerical similarity detector in those regions. To employ information about the change in fO near consonant vowel boundaries the numerical waveform similarity measure has the ability to make use of estimates of "period moveout". As is described in more detail in Chapter 4, the similarity pitch detector first finds a nearby region of similarity then looks for similarity near double and triple that distance. When given advice about "period-moveout", those multiple period locations are adjusted to reflect the expected moveout. Also the confidence result, which is based on how closely the double and triple period peaks are to the nominal values, is influenced by such advice. The directional advice also affects the confidence result by establishing a penalty for similarity found in the non-preferred direction. The rule which performs this task is shown in figure 3.21. The CONDITIONS of this rule seek an unvoiced obstruent phoneme in a strongly or moderately stressed syllable followed by a dipthong or a vowel. The utterance, apriori-f 0 and speech sampling rate are also acquired in the CONDITIONS. Since the confidence of assertions will be determined numerically, no PREMISE-ODDS is required. The actions first establish ivl as the first .03 seconds of the vowel, then the similarity pitch detector is used over that interval. Notice in this case that the control parameters of the pitch detector which were defaulted in the previous example (figure 3.17) are being specified here. For example, the expected period moveout is "positive" time (since voicing will 85 --- ____·_-··-·-·-·--·---- Chapter 3 (defrule <advised-pd-l> CONDITIONS (type 'utterance utt) (type 'p-mark c) (not-null (memq (send c :phoneme-name) 'phoneme:(<p> <t> <k> <f> <th> <s> <sh> <ch> (type 'speech-sampling-rate sampling-rate-assertion) (let sampling-rate (value sampling-rate-assertion)) (type 'apriori-fO apriori-fe-assertion) (let apriori-fe (value apriori-fe-assertion)) (let v (right-neighbor c)) (or (isa :vowel v) (isa :dipthong v) (inherit v :y)) ; stressed (or (eq (send v :stress) 1) (eq (send v :stress) 2)) PREMI SE-ODDS ACT IONS (let. ((ivl (interval (Sstart v) (+ ($start v) (round ( sampling-rate .03)))))) (assert (sim-pitch-assertions (gr-pd:gr-lpf utt) ivl .01 :pos .02 apriori-f0)))) <q>))) Figure 3.21 FO Advice from Phonetic Context grow stronger to the right). FO Range from Preliminary Numerical Results The apriori-f 0 assertion is generated through the actions of the two rules shown in figure 3.22 and figure 3.23. The first simply generates the preliminarypitch-assertions using the low passed utterance. The second takes the histogram of those preliminary estimates that are very confident ("majority range") and establishes (defrule <preliminary-pd> CONDITIONS (type 'utterance utt) ; tne waveform PREMISE-ODDS ACTIONS (assert (preliminary-pitch-assertions (gr-pd:gr-lpf utt) (domain utt)))) Figure 3.22 Preliminary Pitch 86 Chapter 3 (def rule <numeric-pitch-range> COND IT IONS (type 'preliminary-pitch-assertions prelim-pd) (type 'final-pitch final-pitch) PREMISE-OODS ACTIONS (let ((range (send prelim-pd :range))) (cond ((finite-interval-p range) (let. ((majority-range (send prelim-pd :majority-range .95)) (left (Send majority-range)) (right (+ left ($Slength majority-range)))) (cond ((or (null-interval-p majority-range) (< right 20)) (assert (apriori-fO (lambda (ignore) 1.0)))) (t ;; since the speech may be glottalized we place no ;; limit on the maximum f0, only on the minimum f. (assert (apriori-fe (lambda (fO) (cond ((< fe (Send majority-range)) 1.0) (t (kbsp:interpolate left f right 1.8 .1 :clip)))))))))) (t (assert (apriori-fe (lambda (ignore) 1.0))))))) Figure 3.23 A priori FO an "upper maximum" fO by extending the maximum observed fO ("lower maximum") by the spread of the observed fO values. The apriori-f 0 function that is asserted will penalize fO values above the "lower maximum" linearly with fO up to 10/1 odds against at the upper maximum. apriori-f Note the tests in the actions that cause the 0 assertion to be simply a constant 1.0 (no effect). This occurs if there were no fO estimates from the preliminary-pitch -- assertions that were sufficiently certain, or if the number that passed were spread less than 20hz. These events can occur if there is sufficient noise that few "confident" numerical estimates occur. In this case, the rule becomes disabled. 3.6. Epochs 87 Chapter 3 3.6.1. Concept One definition of the word epoch is "a memorable event". This is the meaning intended by its use here. Some systems that attack problems related to time (e.g. SIAP and HEARSAY) represent the position of objects with numbers. For example, numbers are used to signify the time or frame index of the start of an object like a word (in HEARSAY) or a spectral line (in SIAP). In PDA we took an alternative approach of providing a complex data structure to represent time points. This decision was motivated by the complexity surrounding the manipulation of such events in this problem. The significance of events Speech is analyzed as a concatenation of objects. Words, syllables, phonemes, segments all break up speech into concatenated entities. To employ the symbolic knowledge available to it, PDA needs to know the locations of these objects. For example, it is known that a pitch fall can be associated with unvoiced- consonant/vowel boundaries. In order to position such an inflection, it is necessary to locate the boundary precisely. The boundaries between these objects are what the PDA considers "significant events". In the course of analyzing an utterance, several pieces of (relatively) independent information may be given about the position of some particular boundary. For example, to locate an uv-consonant/vowel boundary there will be information in both the input transcript and positions of threshold crossing generated by the numerical power measuring procedures. To do the best possible job of estimating the positions of these events, there must be a means to combine the information from all these sources. Combining position estimates is one purpose of the Epoch system. 88 Chapter 3 Alignment of Symbols and Waveforms Another important use of the Epoch system is the precise alignment of numerical and symbolic information about the utterance. The PDA uses both numerically derived and symbolically derived information to guide the operation of some numerical procedures. For example, if the numerical period estimator is to be advised about a likely fO fall after an uv-consonant/vowel boundary, then there must be a single statement about where it is expected to occur. Thus, he position information derived from separate numerical and symbolic sources must be combined before it can be used to guide the period estimator. The PDA aligns the phonetic transcript and numerical energy measurements with a single Epoch system. This means that an epoch hypothesized by a symbolic module and an epoch hypothesized by a numerical module will be "merged" into a single epoch if the system concludes that they refer to the same underlying event. This "merger" produces a single "composite" epoch that combines the statistical information from both of the constituent "simple" epochs for better accuracy, and also serves to logically connect (in a network of epochs and objects) those objects that either of the two epochs were previously associated with. All of the above points can be seen in figure 3.24. The small boxes represent the epochs. The width of the box depicts the uncertainty in the position of the event that the epoch stands for. Identical epochs that are exactly above one another are in fact the same epoch redrawn at a different level. The words, syllables and phonemes are connected with one another because they came from a single set of simple epochs (generated by the entry of the phonetic transcript). Likewise the interval marks starting 89 with 'p' (ps-phonetic silence, Chapter 3 Mark Type Transcript OECLARATIVE Phrase Words which tea wit'l t't tr p'par Syllables 6 w i t' Phonemes t' t did party p a r ddd' 1' dd go to g'go t'tWi baker be' kkS d'be' kkl g o' t' t u h 34046 8 Voicing Mark Type arks GG G! a) BRST NB NS NS NS NS NS NS NS NS NS 1-4 -NS : NV NV NV NV NV NVNV I I . i . NV PVF PVF PVF PVF PB PB PB O-PB PA PA PA PA 3-a 3 PA PF PF PF PF PF C PF PV PVPV PV PV PV PV PV PV VPV -c--a PS PV' P PSa PS PS PS PS PS I-. . 3 PS G H 4 G--- PS 3~ 34046 3 Figure 3.24 An Example of Epochs pv=phonetic voiced) were derived from the phoneme marks, so they too share the simple epochs that were created by the transcription. However, the 'n' marks (numerical silence, numerical voiced) were derived from power measures made on the waveform and these procedures created their own set of epochs to delimit the numerically computed regions. It is because of the Epoch system that the numerical and phonetic interval marks are tied together. 90 Chapter 3 Computing temporal position Pitch detection includes both voicing and period determination. The correct time alignment of these two properties with the speech is essential. Since there are inaccuracies in the input transcript and in numerical measurements of the waveform, it is useful for PDA to combine its sources of temporal estimates to determine event positions more accurately. This combination is accomplished through a largely numerical process derived from a few theoretical assumptions. Each epoch includes statistical information that models the position of the event it represents as a Gaussian distribution (by specifying mean and standard-deviation). When two or more epochs are "sufficiently" close and are compatible, they are assumed to stem from the same underlying event. They are further assumed to be independent estimates. These assumptions lead to a formula for determining the mean and variance of the "composite" estimate. A more detailed description of the implementation is provided in Chapter 4. The assumptions The assumptions of independence and Gaussian statistics were made largely for convenience. It seemed appropriate to use a unimodal distribution for the statistics of each event estimate. Smaller errors were more likely than large ones (so it shouldn't be uniform). The Gaussian shape is intuitively familiar to most people with a mathematical background, so providing accuracy assessments in terms of Gaussian standard deviation was convenient 9 . Estimates were also assumed to be independent because if there was dependence there was no readily available way to determine it, 9 The operator must provide in the transcript an indication of the inaccuracy of each position estimate so the system can properly merge them with the temporal estimates that are automatically generated. 91 Chapter 3 and a dependence assumption would require some sort of specification. These are certainly powerful assumptions to make, but they seemed reasonable in this context and they allowed us to derive the mechanism of combining the statistical estimates rather than picking an ad-hoc procedure. Further, there did not seem to be another preferable set of assumptions given the information we had about this problem. While the independent Gaussian assumptions we made are probably wrong in many cases, any other set of assumptions would be as well. If the performance of the system were critically dependent on the precision of specifying the statistics of these estimates, then building a workable system would probably have been impossible. Sufficiently close Each epoch provides a probability distribution for the position of the underlying event it represents. Assuming that a number of such epochs come from a single event, one can view the probability density as specifying the position of the epoch given knowledge about the true event position. Thus if many epochs pertain to a single event, they form a cluster about that event and the system has (estimates) of the statistics of that cluster that can be used to evaluate it. Specifically, if the epoch cluster is viewed as a point in n-dimensional space near the point that corresponds to the true event, it is possible for the system to assess the chance that the epoch point would be further from the event point than it is. It is this evaluation that the Epoch system uses to decide if two epochs are sufficiently close (the details a covered in Chapter 4). 92 Chapter 3 Consistency of Merges It became apparent in Another requirement for epoch merging is consistency. experiments that the criteria of "closeness" was insufficient. The most prominent example was when the starting and ending epochs of a measured burst were merged. This occurred because they were both uncertain enough to have plausibly been from the same event. However, in this case it was logically inconsistent to merge them since they were known to be separated by the burst. To deal with logical information about merging, a filtering mechanism is provided in the merging process. In the PDA, the only filter used specifies that the starting and ending epochs of a given object may not be merged. However, clearly other criteria for merging could be considered. For example, in figure 3.25 one can see failures: the numerical burst (BRST) should not extend into the middle of phonetic silence region Mark Type G NB NS NV PvF PB PA -" Voicing Marks BRST NS NV Of To NV PA PF I I PV PS r' 3 PF S PV 3 PS c= 8498 5888000 Figure 3.25 Bad Merging 93 Chapter 3 (PS), and the numerical voiced region (NV) should not be merged with phonetic frication (PF). In the former case, the numerical burst represents sound and is therefore incompatible with phonetic silence. In the latter case, the NV should signify a voiced region; which cannot be the case during the phonetic frication implied by an unvoiced stop. These situations can and should be detected and corrected by the system. However, there was insufficient time in this project to deal with these issues. We merely wish to point out an important potential use of combined numerical and symbolic processing that should be investigated in future work. Epochs as nodes Epochs serve as the connections between objects. Adjacent phonemes are linked through the epoch that ends one and starts the other. Different levels of abstraction such as a word and its starting phoneme both start with the same epoch. Different concepts such as a numeric-voicing assertion and a word may start with the same epoch. Thus conditions in rules which express connectivity constraints may con- veniently be determined through examination of the network of objects and epochs. Epochs as Abstraction Epochs combine various things related to events under one system. * * * The determination of event positions by combining estimates. The representation of adjacency between objects via a network for efficiency. The determination of connectivity in that network through a combination of numerical closeness and logical consistency tests on the epochs of the objects. By representing events explicitly in this way, other parts of the system can make use of epochs in one abstract way without being concerned with their other uses. For 94 jr Chapter 3 example the rule interpreter can use the connectivity between epochs and objects to determine the left neighbor of some phoneme without being aware of the nature of epochs as numerical event estimates. This sort of abstraction would have been difficult to achieve if epochs were represented simply as numbers. 3.62. Related systems Another system whose purpose is the representation of temporal information is given in[53]. This system allows the user to make statements about intervals in time in symbolic terms such as: interval A begins before interval B", "interval D is inside interval F, etc. The system can then determine the truth of other statements about the intervals in question. This system was intended for the description of time in the context of issues of causality. For example, in some expert systems[541 it is important to know the order of events to determine the meaning of certain observations. For example, in medical diagnosis there are numerous interacting systems within the body. The order of events (such as a rise in blood pressure versus a change in weight) may be needed to make the correct diagnosis. The purpose of the Epoch system is different. It was not intended for determining causality since the relative relations of the objects in PDA's problem are largely clear. The Epoch system provides a mechanism for integrating multiple uncertain estimates to improve precision. Further, the Epoch system serves as the structural linkage for the temporal assertions of the PDA. Neither of these capabilities is part of Allen's system. 95 Chapter 3 3.7. Knowledge Manager 3.7.1. Overview The knowledge manager is a part of the PDA which stands between the rules and the asserted information of the system ( figure 3.26). It runs the rule system, maintains the dependency network, maintains the confidence support network and computes confidence for scalar and distributed assertions. 3.7.2. Running the Rule System The rule system is run in three phases: check changed assertions, check new assertions. fire rule bindings. The latter two tasks are typical for a rule system (though the PDA has a different control regimen than most current rule systems). The first task is unique to PDA and is something that contributes to the efficiency of the rule system. In most rule systems, rule actions may only assert objects. To implement change, the old object is removed and replaced. In the PDA, when rules are invoked they may create new objects or change old ones. Rule Actions Assertion - KM ~Rule Conditions New Objects Assertion Database Figure 3.26 The Knowledge Manager 96 Chapter 3 When changes are made, notices are passed along links between the various objects in the system informing dependents that the change occurred. This may immediately cause the dependent to respond. For example, when a word-mark learns that one of the syllables it covers has changed, the word-mark checks to see if its starting epoch and ending epoch match the starting epoch of the first of its covered syllables, and the ending epoch of the last. If not, then the word-mark updates its start (or end) and itself sends out a change notice. Most objects are not sensitive to change notices. However, the knowledge manager is listed as a dependent of every assertion and it does respond to change notices. It keeps track of all changed objects during a particular rule firing session and subsequently checks the status of all rules with respect to these changed objects (as described below). Another type of object that always responds to change is the binding. Recall that a binding stands for one invocation of a particular rule. Also recall that in the PDA. if the circumstances that justified that rule invocation go away, then the effects of hat rule invocation are undone as well. When a binding learns that one of the obt.cts that triggered its creation has changed it marks itself as "invalid". During the "check changed assertions" phase of rule system operation, the KM feeds all changed objects back through any rules that were interested in them. This can lead to three possible outcomes for any rule. * An invalid binding may be made valid. If the rule finds that the changed object still satisfies the rule's conditions then the existing binding will be made valid and the actions it caused will remain in effect. * An invalid binding may remain invalid. If after all changed objects have been checked, the rule responsible for the binding does not find that the conditions for that binding are now satisfied then the binding will be revoked and all of its effects undone. 97 Chapter 3 * A new binding may be created. It may be that a changed object now satisfies a rule criterion that it previously did not. In this case a new binding will be created (though it will not perform any rule actions until the "firing" phase of rule system operation). Once the old bindings and changed objects have been checked, the next step for the KM is the determination of which rules trigger (generate bindings). Some new bindings may have already been created during the checking of changed objects. The remaining new bindings are found by testing the rules against any new assertions. When new assertions were made during the preceding rule-firing phase, the KM noted all rules that were potentially interested in those assertions. At this point, the KM sends the new assertions to those rules and gets back notification from any rule that was able to create new bindings. Now the KM knows about all new bindings that are based on the previous rule firing phase. Also, any results that were invalidated by changes made during the previous rule firing phase have been undone. At this point, the KM uses the new bindings to invoke their rules leading to a new set of changes and new assertions. l he ability to handle change in the assertion database is novel in the PDA rule systeiim. There are truth maintenance systems (TMS)[55] that can be used to build networks in which cells are either on, off, or unknown and are connected with "or" and "not" to model logical expressions. One of the uses for the dependency network in the PDA is similar to a TMS. When a binding is retracted and it notifies all its dependents (the assertions it made during rule firing) and they are also retracted, then the PDA is acting very much like a TMS. However, there are several ways in which the PDA differs from a conventional TMS. Historically, TMS systems work with fixed networks of cells and update the 98 Chapter 3 network in accordance with changes in the states of those cells imposed from without. The PDA differs from this since it is constantly changing the network as assertions are made (each assertions corresponding to a cell). In addition, the PDA uses this network not only for passing information about the logical state of a cell (i.e. has it been retracted), but also inform the dependents that a change has taken place in which they might be interested, a message that does not necessarily affect the informee. In this sense, the "cells" in the PDA have a more complex state than simply on, off or unknown and more complicated ways of changing that state. Still, the ideas behind TMS certainly had a strong influence on the PDA. Existing rule systems do not work by having assertions each with a complex state that changes with time. They implement change by removal and replacement of the object in question. In a system where objects are simple and assertion is inexpensive this is a reasonable thing to do. However, in the PDA (and likely in similar KBSP systems) some assertions may take many minutes to create, and the alteration of the network structure to add and remove assertions can be expensive. Further, if consistancy is to be maintained when an assertion is removed, all dependent effects mu_ be undone. his can involve great expense. Thus, the desire to automatically undo actions if the rule activation is no longer valid, coupled with actions that can be expensive to take (typical of many signal processing operations), leads to a design which explicitly handles change in objects rather than using removal and replacement. The other unusual aspect aspect of the PDA brought out in this discussion is the simplistic control regimen. Existing rule systems create a conflict set of bindings and select a single member to fire. For example, the oldest or newest binding may be chosen or the binding with the largest number of condition clauses may be chosen. In 99 __ I Chapter 3 contrast, the PDA purposely lacks such a control mechanism. All bindings are collected then all bindings are fired. This choice was motivated by the belief that embedding knowledge in the control mechanism could lead to an obscure system that would be difficult to develop. Thus PDA lacks such a mechanism. The user relies on the consistency maintenance discussed above and explicit conditions in the rules to determine system behavior. 3.7.3. Network Maintenance Objects in the PDA are related through two networks, one for dependency and one for probabilistic support. The first network is for absolute dependence (object "a" must exist for object "b" to exist) and the notification of change. For example, a numerical power estimate depends in this way on the waveform. If the waveform is retracted (killed) the power estimate must be retracted as well. The second network is for computing confidence. If support is removed for an assertion like voiced, the confidence in voiced may change, but the assertion is not removed. The odds factor just reverts to 1.0. and the odds to its a priori value. The two participants in a dependency link are called the user and the server. Typically, each server provides something that its user needs. If the server changes, the user is informed in case the user needs to act on that change. If a server is killed, the user typically dies as well. All signals created by the system are made to "use" the signals from which they are constructed. In this way, if a signal is removed (say the algorithm for computing it is found to be incorrect) then all signals computed from it are removed also. Likewise, bindings are made to "use" all the objects that went into activating the rule. If any of those objects are killed then the binding dies. Also, if any of them are changed, the binding is notified and marks itself invalid (as 100 4 Chapter 3 described earlier). The KM is also made to "use" all assertions so it too will be notified of object change and can check the changed objects against interested rules. Unlike bindings, the KM will not die if one of the objects it is using does. Finally, all assertions made during rule actions "use" the binding for that activation so if the binding dies, all assertions are removed. The support links are established in two ways. During rule actions, new assertions may be "supported" by the binding if the confidence in that assertion stems directly from the objects matched in the conditions. The rules for phonetic voicing from phonetic identity are of this type (see figure 3.8). On the other hand, some assertions have their confidence determined by a numerical procedure. In this case no support is provided, and the default confidence is established when the assertion is created (see for example figure 3.12). The other means of creating support links is the explicit request. For example, all phonetic-voiced assertions ultimately lend (or remove) support from the unique assertion <VOICED>. This is handled by the rule shown in figure 3.27. The provide-support form causes the binding of this rule invocation to "support" the object of type voiced (which is the unique object <VOICED>). The binding is in turn "supported" by the objects that appear in the PREMISE-ODDS form (def rule < pv-voi ced-support > CONDITIONS (type 'phonetic-voiced pv) (type 'voiced v) PREMISE-ODDS (lambda (pv) (ivl-gate pv (odds pv))) ACTIONS (provide-support v .7 .3)) Figure 3.27 Phonetic Support for Final Voiced 101 Chapter 3 (pv in this case). Notice that while the binding "uses" <VOICED>, it is not "supported" by it. If <VOICED> was retracted for some reason, this binding would have to be retracted, but the confidence in the binding does not depend on the confidence of <VOICED>, only on the confidence of the phonetic-voiced assertion found by the conditions. In maintaining these networks, the KM must make the links between these objects, propagate notifications along user and server links, propagate changes in confidence along support links and recompute confidence when necessary. 3.7.4. Confidence Computation When the support for an object is changed, the confidence in that object is usually affected. In terms of the nature of their confidence, there are three types of assertions. Symbolic-scalar, symbolic-sequence and numeric-sequence. A symbolic-scalar assertions makes a statement without reference to time index (e.g. sex/age=CHILD), a symbolic-sequence assumption refers to the time index (e.g. phonetic-voiced gives the confidence of glottal voicing as a function of index), and a numeric-sequence asserts a numerical value as a function of index (e.g. fO versus index). The KM treats the first two types in a similar way and the last differently. Symbolic assertions (sequential and scalar) are supported by bindings. The KM computes the odds-factor of such an assertion using a formula similar to that used in PROSPECTOR[56]. All the incoming support is combined somewhat like water flowing in (or out) along the support links. For symbolic-scalar assertions, the PDA resembles PROSPECTOR a great deal. For symbolic-sequence assertions (which are new to PDA), the odds-factor at each point in the domain of the assertion is computed 102 Chapter 3 separately and kept in a confidence sequence for that assertion 0 . This allows it to make such assertions on intervals over which the confidence is varying. The PDA seems to be the first system to compute the confidence in distributed symbolic assertions in this way. Other systems which use such assertions (e.g. HEARSAY) employ a fixed confidence over their entire duration. For the numeric-sequence assertions which represent estimates of FO a different approach is used. The rules which give support to the final pitch assertion do so by directly linking the supporting pitch assertion with the final pitch as is shown in figure 3.28. The assertions which provide the support do so by specifying the parameters of a Gaussian density and a confidence at each index. At each sample in the final pitch, the KM takes the probability densities from the supporters and their confidences and makes a multi-modal composite density using conventional rules of probability (see Chapter 4 for details). Because it generates a multi-modal density it is capable of reflecting various possible choices for fO with different peak amplitudes. (def rule <phonetic support-for-finoal-pitch> CONDITIONS (type 'finol-pitch final-pitch) (type 'pitch-assertion assert) PREMISE-ODDS ACTIONS (send assert :support final-pitch)) Figure 3.28 Support for Final Pitch 10 Because of the efficiency of the signal processing system which the PDA uses, it is possible to take such an approach despite having in effect many millions of confidence computations to perform. 103 __II Chapter 3 3.8. Rule System This section discusses the concepts that motivated the design of the rule system. The responsibility of the rule system is the recognition of the conditions for activating rules. The rule system in the PDA supports the following features: networked objects, mutable objects, a dynamic rulebase, and test-directed rather than pattern-directed invocation. 3.8.1. Concepts Networked Objects Objects in PDA are all connected together. The rule system works with such a structure. Many rule systems (such as OPS5) do not. Their assertions have attribute values that are symbols. This contrast can be best understood by way of figure 3.29. 41 Type: person Name: joe Father: jim ' I Type: person Name: george Country: USA Type: person Name: jim Father: george I OPS5 (isolated objects) PDA (linked objects) Figure 3.29 Networked Objects 104 Chapter 3 In a rule system such as OPS5, asserted objects are represented by a data structure that records attributes and values for that object. In OPS5, these values can only be scalars or symbols. In terms of the figure, this means that in order to find out whether Joe had an ancestor from the USA, the rule system must take Joe's father's name (Jim) and test it against all other person's names for a match, then see if that person was from the USA. When that fails it repeats the process with Jim finding George. This expensive process can be avoided if the objects are connected as is shown for PDA. In this case, the function inherits can follow the pointers from son to father until the information is located. The ability to directly search for such information can mean an improvement of the order of the number of assertions being searched or possibly powers of that order (depending on the rule conditions). This is discussed more fully in Chapter 4. In the PDA, the networked representation led to a substantial cost savings (as it would in many problems it was applied to). Mutable Objects Mutable objects in the rule system means that objects can be changed after being asserted. In the PDA, the primary source of object mutation is the Epoch system. As objects are linked by it, they change. For example, as a result of an epoch merger a word may become the neighbor of a voicing onset, and this may trigger a rule to fire. A second posible source of change is the explicit modification of objects by rules. In one experimental rule, the voice onset time of stops (marked in the transcript) was compared to tables of expected onset times. If the marked onset time was too large and the stop was at the end of a word, the PDA would unlink the two words and insert a word pause automatically. Such changes have to be detected by other rules if 105 I Chapter 3 the system is to function properly. Rule systems generally rely on the removal and replacement of an assertion in order to change it. In the PDA (and we feel in other KBSP tasks), such an approach to dealing with change would be too costly. The existence of dependencies between objects means that the removal of one object leads to many others being removed and substantial amounts of work being redone. Another problem with removal and replacement is that some objects (e.g. numerical pitch estimates) are very expensive to compute, so removal and replacement is very undesirable. Since some rule systems have functional conditions, they could be used with networked objects. Why can't they handle mutable objects? The problem is in the way change is detected. These rule systems have a network built from the conditions of all the rules. They locate the rules to be fired by dropping new assertions into that network. Once an assertion has been fed to the network and thereby tested, there is no provision for detecting change in that object and testing it again. It must be removed and replaced to accomplish that. Dynamic Rulebase The PDA was intended to evaluate ideas for developing complex systems. One aspect of such development is the testing and debugging of new rules, and the ability to dynamically change the rulebase improves the efficiency of program development. In some rule systems, to alter a rule one must remove and replace all assertions. In the PDA, a complete restart would take a long time. By allowing rules to be added or removed at any time, the developer can experiment with a new rule in a matter of minutes. 106 Chapter 3 Test-Directed Invocation The PDA uses functional tests rather than patterns as the primary way of expressing conditions. Since there were numerical conditions, such tests were inevitable. quantities to be manipulated in Many of the non-numerical conditions extracted information about objects that was not directly present in the object itself and would therefore be difficult to implement with simple (efficient) patterns (see the previous comments about networked objects). Finally, tests are functional abstractions in that they do not constrain the actual data structure of the assertions. Pattern-based rule systems supply you with a data structure that their pattern language supports. By avoiding such patterns we remove any structural constraint on the assertions in PDA and allow those aspects of the system to be programmed independently of the rule conditions. 3.9. Conclusions This chapter has given a conceptual presentation of the PDA system. It discussed the program knowledge and how it was used and developed the ideas behind the major subsystems (Epoch, KM and rule system). The following chapter goes into detail on these systems, presenting the theoretical mathematics behind the numerical similarity estimator and the confidence computation. Chapter 4 also introduces the KBSP signal processing system and discusses how it was used in this project. 107 _ VJ CHAPTER 4 System Details This chapter describes the implementation details of the numerical pitch detector and other major components (the KBSP package, the Epoch system, the rule system, and the knowledge manager) that make up the PDA. 4.1. The KBSP Package The Symbolics LISP machine on which the PDA was implemented supplies the programmer with a LISP language environment that includes a LISP oriented editor, a debugger that allows the examination, modification and resumption of all executing procedures, support for the creation and modification of data types to encourage object oriented programming, and a large system for controlling the graphic display. This LISP language environment provides a powerful base on which to build applications, but it does not support specific applications. In particular, the LISP machine software contains no datatypes or procedures designed specifically for signal processing. The KBSP package extends the basic LISP machine software environment with datatypes, procedures and graphical displays that help the programmer develop and use signal processing algorithms. The KBSP package was the joint effort of the author and Cory Myers who was working concurrently on his PhD thesis. The following text briefly describes the major features of the KBSP system. A much more detailed account is available in the 1986 PhD thesis of Cory Myers which presents extensions to the KBSP package that Chapter 4 __ ____ Chapter 4 allow signals to be manipulated in symbolic as well as numerical form. 4.1.1. The Precursor (SPL) One of the first efforts to specify the desired features of a signal processing datatype was made by Gary Kopec[57]. Kopec's work was based on the principle that it would be easier to write signal processing programs if there was a basic datatype from which all signals were constructed. Designing a procedure to process signals would be simpler because the external interface presented by all signals would be identical. Developing large systems would be simpler because changes in the internal implementation of signals could be made without requiring changes in the procedures that used them. This would increase the chance that programs written at different times and by different people would be compatible. There had to be a common functionality that all signal objects supplied, which was sufficient for any of the operations that might be performed on them. For example, Kopec pointed out that a signal representation should support non-sequential access since many signal processing procedures are described in that fashion (e.g. the EFT). He proposed two operations that a signal must support: (fetch <signal> index) and (domain <signal>) . The first is needed for retreiving the samples of a signal, and the second for identifying which indices are "defined" (attempts to access a signal outside its domain are program errors in his model of signals). Kopec also proposed some other properties that a signal representation should have. He justified these properties by saying - "since mathematical signals have them, computer signal representations should have them as well". For example, he felt that signal objects, once created, should never change (since mathematical signals do not). He suggested that the closer the programmed representation of a signal was to the 109 ___1_1_____ ___ ____ Chapter 4 mathematical concept, the better a signal processing tool it would be. To achieve this he mapped properties of mathematical signals to properties of computer signals. The Signal Model Kopec implemented this philosophy in a signal processing package he referred to as SPL. SPL modeled signals as objects whose domains were finite intervals (e.g. {O,...,(N--1)}) and which supported the two operations (fetch signal index) and (length signal). It was an error to fetch a signal outside its domain. Thus SPL uses what might be called a "finite vector" model for signals. Signal objects (like mathematical vectors) are only defined over a finite range of indices starting with 0, and it is an error to combine (e.g. add) two signals of different lengths. First the shorter must be used to create a zero-padded version of the same length as the longer, then they can be added. Signal Representation in SPL Signals are represented by objects that contain two procedures, one for calculating the samples of the signal and one for calculating the length. Thus the basic task of "creating" a signal is one of "definition" not "computation". Rather than creating a Hamming window by computing all its elements and storing them in an array, the program defines the procedure for computing those elements. Note that such definitions may take place at compile time (as with conventional FORTRAN subroutines) and while the program is running (as procedures which create signals are invoked). The invocation of the procedure for computing the samples of a signal only takes place when some other procedure asks for the sample values of the signal (e.g. via (fetch signal 17)). 110 -------- ·---- Chapter 4 Signal Definition If the definition of every Hamining window of a different length had to be written as a separate piece of code, signal processing would be a difficult chore. To simplify the definition of signals, SPL permits the user to define an entire class of signals with a single block of code (e.g. all Hamming windows are defined with one piece of code). For example, the signal class definition for Hamming windows creates two procedures that are generic versions of the fetch and domain procedure for a specific Hamming window, and a third procedure called a builder. When a specific Hamming window is desired, the builder for the Hamming window class is invoked passing it the desired length. The builder creates an object to represent that specific Hamming window by customizing the generic Hamming fetch and length procedures for the specified length of this specific Hamming window. 4.1.2. The KBSP Package The KBSP package used Kopec's work as a starting point. It includes the idea that signals should be defined rather than computed (i.e. a signal is more like a function than an array), and that the user should be able to define whole classes of signals at once. However, KBSP differs from SPL in some important respects. The Signal Model The signal model used by KBSP is different from the one used by SPL. Its distinctive features are as follows: Domain In KBSP the domain of a signal is (--oo oo). Period All signals have a period (for most signals it is infinite). The combination of two periodic signals generally leads to a signal whose period is the least 111 Chapter 4 common multiple of its constituents. Signals which are not periodic are said to have period-co. Support The support of a signal is a range of indices outside of which the value is 0. Some signals (e.g. Hamming windows) have a finite support, others (e.g. real exponentials) have infinite support. Benefits of the Model * * * * Because this model supports infinite domains, the class of representable signals includes common periodic signals and infinite support signals like exponentials, Gaussians and sinusoids. Infinite domains also means the preservation of temporal properties such as zero-phase, causality and temporal offset is straightforward. For example, the convolution of two zero-phase signals will still be zero-phase. All signals have the same index domain, so any signal may be combined element by element with any other. Thus constructions like (seq-sum (Hamming 31) (Hamming 43)) are well defined in KBSP. Infinite domains means the user of a signal need not be concerned with the support of a signal except for efficiency. For example. when adding two signals of unequal support the user is allowed to fetch samples from the shorter signal outside of its support without causing errors. This makes many procedures that process multiple signals simpler. Side Effects of the Model Deferred Evaluation Choosing infinite domains makes deferred evaluation a requirement. Interval Algebra The computation of support is an important part of the definition of a signal. Some signals (such as correlations) have a support which is a complicated function of the supports of their arguments. To assist the programmer in manipulating supports, a datatype for representing them was created called "interval" and a set of procedures for manipulating intervals were defined. Operations such as union and intersection were provided as well as signal processing related operations such as shift, correlation and convolution of intervals. These procedures made use of explicit symbols for oo and --oo to make the definition and manipulation of signals with infinite support convenient. Fetching Intervals One requirement for a signal representation is that it be computable at an arbitrary index. From a practical perspective, there is a computational cost associated with each access. Thus for efficiency reasons, it is useful to be 112 ----111···1·-··-*---· Chapter 4 able to access many samples of a signal at once (typically a group of samples would be stored temporarily in a fast access data object like an array). Recent versions of Kopec's language allow the user the alternatives of fetching a single sample from a signal or all the samples. In the KBSP package it is impractical to fetch all the samples (since the domain is infinite). Instead, the user is allowed to fetch an arbitrary interval of samples (which are returned in an array). Computing an interval of samples of a sequence usually involves fetching intervals of samples from the sequences on which it is based. Determining which samples of those input sequences are required can be a difficult aspect of defining a signal (esp. for correlation). This is another use for the interval algebra. Signal Representation Like SPL, KBSP represents signals as objects with functions for computing their samples and other properties. The basic properties supplied by all signals in KBSP are support and period. Signal Definition In KBSP one defines an entire class of signals at once. However, rather than think of such a thing as a definition of a class, one can think of it as the definition of a "system". In the case of Hamming for example, the task is one of defining the system Hamming which when applied to a scalar will generate a Hamming signal. For addition, one can define the system seq-sum which when applied to two signals, creates a new signal whose values are given by the sum of the two inputs. 31 Hamming 31 point Hamming signal signal-ag seq-sum D-signal-a + signal-b signalThis is not functionally different from the concept of a class with a finder as Kopec 113 Chapter 4 defined it, but the notion that one is defining a system and that once defined one plugs the outputs of systems into othei systems is a useful way to picture program development. Reasons for Delayed Evaluation Kopec justified delayed evaluation as follows "There is a clear conceptual distinction between the processes of defining a function (by constructing an expression for its values), and computing the function. by evaluating the expression at a specific point of its domain. This observation suggests that creating the program representation of a discrete-time signal or system should not imply the immediate creation of (representations for) all of its domain range pairs. ... creating a signal should not imply computing all of its samples." There are other reasons for separating definition and computation beyond the conceptual distinction between them: * If signals with infinite domains are to be represented, then delayed evaluation is essential. * If there is some chance that the entire signal may not be needed, for example if it is to be decimated, then computing all the samples at definition time would be wasteful. * If the definition of the signal might be changed to an equivalent but more efficient forln before computation is necessary[58]. Transparent Caching In the original description of SPL[57] it was pointed out that efficiency could be gained if a signal type called a store were defined, a signal that saved its samples after computing them. By inserting such stores, the programmer could avoid repeated computations if they fetched samples more than once, for example if they used the same Hamming window for two different purposes. 114 I Chapter 4 This idea was extended independently in both KBSP and later versions of SPL (such as SRL[59] ) to the idea of transparent sample caching. With this capability, the system automatically saves any computed samples in case they are needed later. In SRL, any fetching of a signal caused the signal to be cached over its entire domain. With infinite domain signals this is not possible, so it is necessary in KBSP to allow caching of a subset of the domain. An additional advantage of caching only a subset is the efficiency that accrues from not computing samples that are not needed. Discussion The KBSP package makes algorithm development more convenient. This enhanced the development of the PDA by decreasing the "turn-around" time between an algorithmic idea and an implementation. It also served as the basis for symbolic distributed representations in the PDA. For example, the input transcript was implemented as an augmented sequence object whose sample values were symbolic. This allowed the graphical interface built into the KBSP package to be used to examine the state of the PDA. The KBSP was also used by the Knowledge Manager to compute the confidence sequences for distributed symbolic assertions and the probability density sequences for fO assertions. For those two applications, the KBSP proved invaluable. For a more complete discussion of the principles and implementation of the KBSP package, the reader can refer to the PhD thesis of Cory Myers[58]. 4.2. The Epoch System The PDA has a special datatype ("epoch") to represent temporal events. The motivation for developing a separate datatype to represent temporal events was as fol115 Chapter 4 lows: * * A single number representation for temporal locations failed to capture the inherent temporal uncertainty reflected in both symbolic input and numerical measurements. The information needed to refine the statistical information about event positions is gradually acquired during system operation. So it was apparent that extensive manipulation of event estimates would be necessary. * Besides the numerical time of the event, there was a concept of neighbors to that event that needed representation. There had to be an object type to which the distributed assertions such as phonemes would be attached. * Given the overall PDA philosophy that the user can interact (add and retract information) during system operation, this connection datatype had to be "smart" enough to be able to undo its effects on demand. There are four basic tasks handled by this system of epochs: * * ,* * Compute position and uncertainty statistics for composite epochs that represent multiple event estimates. Determine when individual simple epochs should be merged with other simple or composite epochs. Support the rule system by efficiently answering questions concerning the relationships between temporally distributed assertions. Provide backtracking capability to allow previous merges to be undone. Basic Epochs The basic represenlation of an epoch is a data object with the following structure: Name The name of the object is generated automatically by the system. For simple epochs, the name will be "se#" where # is a number. For composite epochs it will be "ce#". The name is a LISP symbol whose value is the epoch. This allows the operator to examine epochs individually for development and debugging purposes and for studying the systems conclusions about events. Mean This is a number (in samples) that specifies most likely position of the "event" this epoch represents. Variance This number reflects the temporal uncertainty in the Mean. 116 Chapter 4 Neighbors This is a set of links (pointers) to the temporal assertions that have a boundary at this epoch. For example, a phoneme-mark has an epoch as its "start" and an epoch as its "end". Types of Epochs: Simple and Composite When a new epoch is needed (e.g. a new assertion is to be made whose endpoints are not given by earlier assertions), a simple epoch is created. This happens as a result of rule actions making new distributed assertions (e.g. the numeric-voiced rule makes epochs to start and end voicing assertions). When multiple simple epochs lay near one another, the Epoch system constructs a composite epoch to represent the group of simple ones. The composite shadows or hides the simple epochs it represents. Requests given to the simple epochs are forwarded to the composite for handling. This means that for most purposes distributed assertions whose simple epochs have been merged appear to share a single composite epoch. The merging of epochs and resulting effects are depicted diagrammatically in figure 4.1 The boxes represent distributed assertions and the circles represent epochs. The original configuration has two phoneme marks each with a simple starting and ending epoch. Assuming that se 24 and se 25 are sufficiently close to one another, they will appear to merge into a single epoch (ce 3) connecting both assertions (as in the middle part of the figure). In actuality, the simple epochs still exist, they are the simply forwarding all requests for information to ce 3 for reply (this is depicted in the bottom part). Simple Epochs Simple epochs are made by rule invocations when they need to refer to a place in time. For example, the rules which parse the input transcript assert phoneme-marks 117 ·21 Chapter 4 se23 ~------ pml2 se24 se25 pml2 se26 Original sepm12 After (apm12pparent) After (apparent) After (actual) Figure 4.1 The Merging of Epochs each of which starts and ends on a different simple epoch. The Epoch system subsequently merges the adjacent end/start simple epochs of adjacent phoneme-marks (because of their closeness) effectively connecting the phoneme-marks together. Simple epochs are not created when a distributed assertion is built from epoch information that already exists. For example, when syllable-marks are built from their phoneme-marks new simple epochs are unnecessary (and inappropriate). Instead, the syllable-marks are made to start and end on the same simple epochs their included phoneme-marks start and end on. This is necessary because the computation of composite epoch statistics makes the assumption that the individual simple epochs contribute independent information about the underlying event position. 118 Chapter 4 The neighbors of a simple epoch are all the distributed assertions that start or end on it. Typically, when first created a simple epoch has one neighbor. For example, initially the starting epoch of a typical phoneme-mark has only that phoneme-mark as a neighbor. However, if a syllable-mark is added that starts with that phoneme-mark, then that syllable-mark will become another neighbor of the epoch. This is the only way that simple epochs ever get neighbors. They never get them through merging. That is the purpose of composite epochs. Simple epochs are unaffected by merging so retraction can be performed. By keeping a record of what contribution a given simple epoch makes to a composite (which neighbors it contributes and what event estimate), it is possible for the Epoch system to correct the state of the composite if the simple epoch is retracted. Because simple epochs are the only type that are directly connected to assertions, they are the only type that can be retracted. Composite epochs are only attached to simple epochs. Each simple epoch depends on one assertion, typically the one it was created for. For example, the simple epochs on which a numeric-voiced assertion starts and ends depend on that assertion. If the assertion is retracted, the epochs are as well. In cases where simple epochs can have several neighbors another object may provide the support. For example, the simple epochs that terminate the transcript marks (phoneme-marks, syllable-marks, etc.) depend not on individual marks, but on the transcript itself (despite the fact that the transcript object doesn't use them for anything). This way, it is possible to retract and replace a phoneme-mark or a syllablemark without retracting the epoch and all the other marks it is attached to. 119 '4 Chapter 4 Once a simple epoch becomes part of a composite epoch, all requests to that simple epoch (e.g. "what is your mean", "who are your neighbors") are forwarded to the composite epoch. In this way the simple epoch "takes on" the properties of the composite it is a part of. Composite Epochs Composite epochs are made only by the Epoch system (in order to merge together simple epochs). They are not directly attached to assertions, so they only receive inquiries that are forwarded from simple epochs. Each merged group of simple epochs is represented by one and only one composite. Such simple epochs have a record of their composite and the composite has a record of its constituent simple epochs. The composite does not depend on any individual simple epoch out of the group, but will only continue to exist as long as there are at least two constituents. If the group falls below this number, the remaining constituent reverts to answering requests itself and the composite is removed. The neighbors of a composite are the union of the neighbors of the simple constituents. Because of this and the fact that simple epochs forward requests to the composite, when two simple epochs are merged the assertions to which they are attached will become neighbors. This is how much of the connectivity is established in the network of assertions. Computing Statistics The statistics of a composite (mean and variance) are established from the statistics of the constituents in a straightforward manner. By assuming that each simple epoch contributes a Gaussian independent noisy observation of an underlying event, 120 Chapter 4 the constituent statistics are given by: Si (4.1) c =--' 1 ' i a2= cQ 1 £i a 1 where si I is the set of estimates, {Si are the estimate means, and { (4.2) i } the estimate variances. Controlling Merging Each time a newly created simple epoch is close to an existing simple or composite epoch, a decision must be made whether to merge the two. Both numerical and symbolic issues impact on this question. From both perspectives, the program must decide if the merger is appropriate. From a symbolic standpoint, the only criterion the PDA is interested in is that the epochs to be merged have no neighbors in common. The only way they could would be if they were starting and ending epochs of a single distributed assertion. Thus, this criterion serves to prevent the starting epoch of an assertion from being merged with its ending epoch, an action that is clearly unreasonable since the two epochs necessarily represent different events. From a numerical standpoint, the PDA is interested in how probable it is that the proposed epochs refer to the same event. The need to answer this question is the reason that all epochs must specify statistics. Without some statement about where the underlying event might be (some statement about the uncertainty of the epoch), it 121 Chapter 4 would not be possible to answer that question. If the statistics were assumed a priori to be the same for all simple epochs, then the system would be unresponsive to the available information about epochs; information available from the operator (about transcript epoch statistics), from symbolic modules (about the likely timing of stops) and from numerical modules (about the uncertainty in power based voicing assertions). Given the availability of simple epoch statistics and the assumption that each epoch proposed for a merge is an independent Gaussian observation of some event, the formula that gives the probability density for that configuration of simple epochs stemming from a single event is 1 e n 1 Ir2 1A1 2 (4.3) where s' is the vector of simple epoch means, is a vector whose elements are all equal to the mean value of the proposed composite, and A is a diagonal matrix of simple epoch variances. To rate he quality of a proposed cluster of simple epochs in absolute terms, the PDA integrates this probability density over all larger clusters, where the size of the cluster is given by the "Mahanalobis distance"[60] (normalized radius) of the cluster R. 1 n-1(s.-s ) 2 F. R = a2 l 2 (4.4) n -1 Numerically then, the PDA's criterion for merging a cluster of simple epochs is the likelihood that another randomly chosen cluster would be larger in "radius". The 122 Chapter 4 PDA uses an absolute numerical threshold of 20% to make this decision. That is, if there is at least a 20% chance that another randomly chosen cluster (with the same number of elements) would have a larger radius than the proposed cluster, then the proposed cluster is merged. Otherwise, the proposed epochs are considered estimates of different underlying events. This value was chosen empirically to balance the number of incorrectly clustered epochs with the number of failures to properly cluster. Supporting Rule Conditions By linking assertions to one another, the Epoch system may satisfy certain rule conditions. For example, if a rule is interested in an /s/t/ phoneme cluster, then when the Epoch system merges the ending epoch of an /s/ phoneme-mark with the starting epoch of a /t/ phoneme-mark, that rule should be activated. This is accomplished in the PDA through change propagation. All neighbors of a simple epoch are recorded as "users" of the epoch. If the epoch's state changes, the neighbors (assertions) must be "informed". The Knowledge Manager is in turn a user of every assertion. When a simple epoch is merged, it is considered changed. It informs its users (i.e. its neighbors) and they inform the Knowledge Manager. The Knowledge Manager has a record of what object types each rule is interested in, and it informs appropriate rules that the assertions connected to that epoch have changed.' Finally, these rules reevaluate their conditions with respect to the changed objects and determine if firing is appropriate. Besides stimulating the reevaluation of rule conditions, the system of epochs provides an efficient mechanism for evaluating rule conditions concerned with adjacency. Because of the direct linkage between assertions and simple epochs and the linkage between merged simple epochs and composite ones, a rule can evaluate a condition 123 Chapter 4 concerning adjacency by "chasing pointers" rather than with an implied search of the assertion database. For example, in the previous case of the /s/t/ context, in the absence of this linkage the rule might be written as shown in figure 4.2. With the linkage created by the epoch system the condition can instead be written as shown in figure 4.3, where the (right -neighbor x) form looks up y using the epoch linkage. If there are n /s/s and m /t/s in the utterance, the comparison clause (eq (end x) (start y)) in the rule of figure 4.2 would be run n xm times. None of the clauses in the rule of figure 4.3 would be run more than n times. (defrule <slow> CONDITIONS (type 'phoneme-mark x) (isa 's x) (type 'phoneme-mark y) (isa ':t y) (eq (end x) (start y)) Figure 4.2 Inefficient Condition (defrule <fast> CONDITIONS (type 'phoneme-mark (isa ':s x) x) (let y (right-neighbor x)) (isa ':t y) Figure 4.3 Efficient Condition) Figure 4.3 Efficient Condition 124 Chapter 4 Backtracking Backtracking is the process by which an assertion is retracted from the system and its effects undone. The design of the Epoch system is such that the merging of simple epochs can be undone if one or more of the simple epochs is retracted. To accomplish this, the Epoch system preserves each constituent (simple) epoch untouched when a merge is performed. The only effect of the merge is to cause each simple epoch involved to forward requests for information to the composite epoch representing it. In that way, the simple epoch appears to have the properties of the composite (mean, variance and neighbors) while the original properties of the simple epoch are preserved. Because all the simple epoch information is preserved, it is possible to retract any given constituent epoch and recompute the state of the composite properly. A Demonstration Figure 4.4 shows the operation of the Epoch system by comparing the same set of assertions with and wit hout merging. The upper part of the figure shows the realigned transcript and the voicing marks with epoch merging disabled. The lower part of the figure shows what happens if merging is allowed. Vertical lines indicate connected epochs. In the unmerged case, all information that is phonetically derived shares common epochs. Deriving one phonetic description from another (e.g. voicing marks from phonemes) does not generate new boundary information except in the case of the frication/aspiration boundary during stop release. However, there are no connections between numerically derived marks, nor between phonetically and numerically derived ones. Therefore, there has been no refinement of boundary position estimates. 125 Chapter 4 Mark Type Transcript Phrase tea I.^^ I_ _ - , { Syllaoles Phonemes 98 53 Mark Type Voicing Marks 8RST NS NS NS Nv PVF NV NV PA PB9 PA PF PF Pv. I C, PS I, Pv _T i { PS S388 9798 Transcript Mark Type oim,. ------ ... ir e J 1Z · wors tea · - I- t't SyllIabes --- t J A, Phonemes I,' c - p - 2- . 9798 5388 Voicing Mark Type G 3RST NB NS . .d I NV NVI PVFI PB' PA PF I PF PV I PS' arks NV PA I PF II=1 , PS I PV L I i I I I T a~------ z NS I PS 9798 5388 Figure 4.4 The Effects of Merging 126 Chapter 4 In the merged case, there is extensive linkage between the marks and those which have been merged show substantially reduced deviation; a reduction in uncertainty that results from combining multiple sources of information. 4.3. The Rule System 43.1. Overview The rule system in the PDA consists of a set of rules and a set of procedures that implement rule behavior. The basic philosophy of the PDA rule system is as follows: * * A given rule is "primed" to respond to some configuration of data in the PDA. If that configuration arises then the actions part of the rule is executed, once for each such configuration. If at a later time, the data configuration which triggered the rule ceases to exist, then the changes caused by that rule firing will likewise be undone. * Similarly, if a rule is retracted then the changes caused by that rule's actions will be removed. * Finally, anytime a new rule is entered into the system it is informed of the current data in the PDA and executed on all appropriate data configurations. The goal of this philosophy is a rule system whose rest state' is dependent on the rules and data present, but not dependent on the history of rule and data entry. Ephemeral rules and data should not effect the rest state of the system, nor should the order of entry of rules/data actually present. This approach can be contrasted with rule systems such as YAPS[61] and OPS5[62] where the rest state is dependent on history. In these systems: * The disappearance of data configurations causes no automatic retraction of rule results, so a transient data configuration can cause a difference in the rest state. I The rest state is the state of the rule system that is achieved once all rule firing and consequent state modification have ceased. 127 I -- Chapter 4 * · The retraction of rules does not undo their effects, so a transient rule can also affect the rest state. Any data entered prior to a rule is invisible to that rule, so the order of rule and data entry can influence the rest state. A History Independent Rule System The advantage of a rule system whose results are independent of the history of data entry is simplicity of behavior. It is easier to understand the PDA's behavior if that behavior is determined by the information that is present, not the order in which that information was entered, nor any information that is no longer in the system. Similarly, the advantage of a rule system whose results are independent of the history of rule definition and data entry is that this simplicity of behavior is preserved even under changes to the rulebase. One motive for developing a history independent rule system is convenient run time interaction, appropriate for an assistant. Such interaction can take the following forms: * To understand the contribution of one particular type of information, the operator may enter information in batches. For example, the operator might input phoneme-marks, observe the results, then enter syllable-marks to see the impact of this additional information. * The operator might alter the transcript because they want to experiment with the effects of different hypotheses. For example, the operator might change a phoneme-mark from one type to another. To determine the effects of a rule change, the operator can run the PDA up to a rest state, make the change and run the system to rest again, observing the results. Proposed corrections to erroneous rules can be evaluated in a like manner. * It is easier for the operator/rule developer to understand, manipulate and develop the PDA if the conclusions of the system are dependent on the current configuration of rules and data, and not on how that configuration was arrived at. 128 Chapter 4 Another advantage of a history independent rule system is in the impact of rules which change the input data. One rule was written that verified stop voice onset time (VOT) against a table. If a word-final stop-mark was found to have excessive VOT, a word pause was inserted after the stop-mark in the input transcript. Since this rule might be run at any time after the stop had been entered, there was no way for the operator to know if results might be derived from the initial stop configuration which would be invalid after the change. However, because the PDA rule system automatically retracts any such results, the operator need not be concerned with them. The computational price one pays for such history independence is the cost of maintaining a network of dependency links between objects, and propagating effects along the network. The cost in terms of programming and debugging such a network is also high. No Control of Rule Order Another feature of the PDA's rule system is a lack of conflict resolution. Most rule systems have a set of pending rules whose conditions have been satisfied, but which have not yet been fired. These rules are considered in "conflict". Typically, some "conflict resolution" strategy is used to select a single rule from this set, that rule is executed, and then conditions of all rules are reevaluated. The conflict resolution strategy impacts the behavior of the system, because the firing of one rule may prevent other rules from firing. Thus, programming such systems involves understanding and possibly controlling this conflict resolution strategy. The rule system in the PDA has such a set of pending rules, but instead of selecting one, all pending rules are fired. This approach was chosen because we felt that the use of a conflict resolution strategy as a way to influence system behavior was 129 _I__ _________ ___ Chapter 4 undesirable because of its subtlety. For example, the idea that the order in which rules appear in the rulebase should affect system results seemed undesirable. Thus, the firing of rules in the PDA must be controlled explicitly with the conditions of the rules and not indirectly through conflict resolution. Also, since the rule system behaves in a history independent fashion, the order of rule fire is largely irrelevent. 4.3.2. Architecture A picture of the relationship between rules and the Knowledge Manager (KM) is shown in figure 4.5. The KM is connected to all rules and knows which object types Each rule is interested in. New assertions (generally made by rules) are handed to the Knowledge Manager (KM) for processing. The KM passes those objects on to the appropriate rules so the rules can check their conditions. In addition, the KM makes itself a "user" of each new assertion, so it will be informed about any change in the Figure 4.5 Rules and the Knowledge Manager 130 Chapter 4 status of the assertion. Such changes cause the KM to inform appropriate rules so the rules can verify their condition status with respect to the changed assertion. A more detailed account of how the KM handles this task and how rule actions are executed is given below in the section concerning the KM. The rest of this section is concerned with the procedures and structures that accomplish rule condition evaluation. Rule Conditions Rule conditions are composed of three types of clauses: Match Clause The clause (type 'phoneme-mark x) is a match clause. X must be an asserted object of type 'phoneme-mark. Let Clause The clause (let y (right -neighbor x)) is a let clause. When the actions are performed, y will be bound to the value of the expression (right -neighbor x) for the choice of x given above. Test Clause The clause (isa ':stop x) is a test clause. X must be stop phoneme for the rule to be triggered. The match clauses specify both that a given variable should bound to new assertions, and that only assertions of a given type need be considered. Thus match variables do for this rule system what pattern variables do in a pattern based rule system. They implicitly iterate over new assertions. However, in this system one must explicitly declare these variables rather than simply mentioning them in a pattern, and in this system such variables have a type which limits the things they may be bound to. Whenever a new object is asserted, the KM passes that object to all rules containing type clauses for an object of that type. 131 Chapter 4 Let clauses are another way of binding variables, in this case, the variable's value is uniquely determined from the values of variables mentioned in the body of the clause. Unlike match variables, these variables do not independently iterate over new assertions. Let clauses serve to simplify the expression of rules by allowing a variable to be assigned to an often used value. Also, they can be used to save the value of some expression for use either in the PREMISE-ODDS portion of the rule or in the actions. Finally, because the binding that is created when an rule is triggered is assumed to be dependent on the value of all variables in the conditions, assigning a variable to the value of some expression is a way of signifying such dependency. Variables mentioned in let clauses are called "local variables". Since match clauses and let clauses both assign variable values, only one of the two methods may be used with any given variable. The test clauses are the primary way of constraining rule execution. Tests may involve arbitrary LISP functions of any local or match variables. The test clauses are affixed to a network that is used to determine if new (and changed) assertions should trigger rule execution. Each rule is a test network Each rule is responsible for two tasks. Given a new assertion, a rule must determine if it can derive new bindings from it. Given a changed assertion the rule must verify whether or not old bindings are still valid, and it must derive new bindings that are now possible as a result of the change. One way to understand this process is to picture condition evaluation as a multi-dimensional constraining process. Each variable is an axis in a multi-dimensional space. Each possible choice for a variable is a value somewhere on that axis. Suppose a rule has match variables x and 132 _ Chapter 4 y, each of which must be a phoneme-mark, and 20 phoneme-marks have been asserted. The space involved is two dimensional ([x ,y ]) and each axis has twenty possible values, so there are 400 choices for the space [x ,y ], all of which would cause the rule to be executed. Suppose further that rule's conditions include the test (isa ':stop x) and only three phoneme-marks are stops (a /p/ and two /g/s). Then the number of choices for x drops to three and the total number of satisfactory choices for [x ,y ] drops to 60. Finally, suppose the rule includes the test (eq (phoneme y) (phoneme x)). Now there are only two choices for [x ,y ] and the rule will only be fired twice. Constraints on the satisfactory choices for the top-level space of the rule (the one containing all variables mentioned in the conditions) are accomplished through the type restriction in the match clause, and through the test clauses. Each rule uses a network structure of tests and stores to determine which combinations of match variable bindings are satisfactory. A typical network is depicted in figure 4.6. The boxes depict stores where choices for each subspace are kept (the boxes are labeled with the space they store). Each new assertion is added at the bottom of the network as a choice for a one-dimensional subspace, and the acceptable choices propagate upwards through the network. Each new choice must pass through tests (depicted by diamonds). If a choice fails a test at some level, the choice is diverted into a store for bad choices. If the test is passed, the choice is saved in the store for good choices (above the test) and continues to propagate up the network. Dots where paths join are called "merges". Each time a new choice reaches a merge, it is joined once with every compatible choice present in the good-store on the 133 I_ ___ _ I _ _ Chapter 4 {x, y, z, a) lz} iy} z x y Figure 4.6 Typical Rule Condition Network other entering branch. This typically generates many choices for the merged subspace, which continues to propagate up the network. Choices from two branches are compatible if the projection of each choice onto the common subspace has the same value. The circles represent the let clauses. Each of these adds a new dimension to entering subspace (at the bottom of the circle) and produces the value along that new dimension by invoking the let body. Efficiently Evaluating Conditions The overall goals of the rule system are the correct and efficient evaluation of rule conditions. One way to achieve correct evaluation would be to find all the new top-level choices that result from each new assertion, and test them. This would 134 I_ Chapter 4 correspond to moving all test diamonds up to the top of the network. However, such an approach would be extremely inefficient. If a test can be performed on a subspace before a merge, then doing it there once is cheaper than doing it after the merge on the many choices that result. By pushing tests as far down in the network (to the nodes with the fewest dimensions at which the test can be performed) one gets the most power from them. Take any one test and consider how many final choices are eliminated by each invocation. If the rule's space is x ,y ,w ,z, but the test only requires x and y, then each time the test is failed by a choice x 0 ,y 0 in subspace x ,y, all the final choices x 0, y 0 ,w ,z will have been eliminated. Other rule systems build networks similar to this to evaluate rule conditions. For example, YAPS builds structures using a similar branching architecture as shown in figure 4.7. In YAPS, the network topology is completely specified by the condition patterns, and is influenced by pattern order. Each test in YAPS (tests are specified separately from patterns) is affixed to this network in the first location where all variables in the test are available, scanning the network from left to right, bottom to top. While this approach places tests low in the network. it has a flaw. Because the topology of the network is specified by the rule patterns and not the test subspaces, it may be the case that a test's subspace is not present in the network. That will mean the test gets performed on a higher dimensional subspace which will likely lead to unnecessary testing. An example of this problem is shown in figure 4.8. This rule finds all sequences of three NUMs in a row. The numbers adjacent to the branches in the figure are the number of choices that will have flowed through that branch assuming the facts (NUMBER 1) through (NUMBER 100) were asserted. One million invo- 135 ___ _ __ ____. Chapter 4 __ (yaps-rule (x -x) (y -x) (z -) tests (test-i -x -y) (test-2 -y -z) actions ... ) (-X. -y. -) test-2 (-x. -y} test-i {-z} -Z I-v} -x -y Figure 4.7 YAPS Network Architecture 136 __ Chapter 4 (yaps-rule (number -x) (number -x) (number -z) tests (eq -z (+ 2 -x)) (eq -z (+ 1 -y)) actions ... ) {-x, -y, -z) I E 100 1 100 -Z 100 -y Figur-e 4.8 Inefficient YAPS Network -x cations of the test will occur even though only ten thousand would have been necessary. The PDA rule system differs from YAPS in its approach toward generating the network. Rather than derive the network directly from pattern clauses without regard to the subspaces they represent, the PDA designs the rule network to contain all test subspaces so all tests can be used with maximum efficiency. Another source of efficiency is in the "merging" process. If the store for each subspace is indexed by the shared subspaces of all stores it is combined with, then when a new choice comes in from one of those stores, the set of compatible choices in this store can be found with a single lookup (by projecting the new choice onto the shared subspace and indexing). Without this index, it is necessary to do a linear search whose cost will be proportional to the contents of the store. 137 --- --- -~~_ _ Chapter 4 Handling Change Because the state of objects can change, it becomes necessary to reevaluate tests. When a rule is told about a changed object, first it scans up the network stores looking for choices with local variable values that were determined by the changed object. These choices may be eliminated if the local variable value is different as a result of the change. When that happens, all choices that depended on that choice are also eliminated, and a new choice with the correct local variable value is installed in the badstore for that subspace. Now the good-stores with associated tests are checked for choices that use the changed object. Those choices are verified and any that fail are flushed (along with all dependents). Finally, the bad-stores are checked for choices that use the changed object (including choices just generated through a local value change). Those that now pass are propagated up the network as if they were good choices. 4.4. The Knowledge Manager The Knowledge Manager (KM) is a name for the parts of the PDA that handle the interactions between assertions and manage the operation of the rule system. The tasks of the KM fall into three categories: * Run the Rule System * Operate Dependency and Support Networks * Compute Confidence and Other Statistics of Assertions 4.4.1. Running the Rule System The preceding section on the PDA rule system focused on the determination of condition satisfaction. Rules really have two independent roles to play in the operation of the PDA. One role is played by the rule's conditions (which are represented for 138 __ Chapter 4 each rule with the network described earlier). As this network is fed information about new and changed assertions it is able to generate new (or verify old) binding objects. The second role played by each rule is that of generating new assertions and modifying old ones, the effects of the ACTIONS. In the PDA, these two roles fit into separate phases of the overall system cycle as follows: 1) Execute rule actions using bindings which have yet to be used to fire a rule. Keep track of object change and old bindings affected by it. Collect newly generated bindings for execution on the next cycle. Mark bindings dependent on changed objects as potentially invalid. 2) Check rule conditions against changed objects, determine whether bindings affected by change are still valid. Kill bindings that are not verified. Retract all their results (possibly including other bindings both executed and unexecuted). Check rule conditions against new objects. 3) 4) This cycle is one job of the KM. It is run until bindings stop being generated (the so called "rest state" of the system). The implementation of this aspect of the KM is a matter of keeping tables of unfired bindings and unverified bindings, newly generated or changed assertions, and rules currently in the system. 4.4.2. Network Management There are two networks managed by the KM. One is a network of dependency, of objects that rely on the existence and status of other objects. The other is a network of support, consisting of links that signify added or reduced belief in an object or information about its statistics. While these networks are separate from the KM proper, the KM uses them to update the status of the PDA when changes in the status and confidence of assertions have taken place. 139 Chapter 4 Dependency Network The dependency network is a way of maintaining the PDA in a self consistent state despite the modification and retraction of data and rules. It was developed to support the idea of a history independent rule system mentioned above. When there is an implicit dependency between one object and another (for example between a transcribed phoneme and the assertions it led to) then the dependency network can be used to express that dependency. One object is said to be the "user" and the other the "server". If the server is retracted then all its users are notified and they are (usually) retracted as well. Thus the user/server relationship expresses one object's dependency on another. For example, a rule binding is a user of all the variable values for that binding. These values might include the waveform, some phoneme-marks or syllable-marks, the VOICED assertion, etc. If any of those objects are retracted, then that binding is retracted as well. This type of processing is modelled after the processing of so called truth maintenance systems (TMS)55]. Those systems maintain a network of nodes each of which can be in one of three states (on, off, or unknown). The network connects those nodes with constraints that correspond to logical "or" and logical 'not". In operation, one specifies the value at a subset of the nodes, and the network drives the remaining nodes to values as necessary to satisfy the constraints (or the network reports a contradiction signifying that the set of specified values is inconsistant with the network topology). The propagation of retraction information by the Knowledge Manager is exactly the sort of processing that takes place in the operation of a TMS. What we have done 140 Chapter 4 is to merge that technology with the technology of rule systems to create a rule system that is history independent and consequently easier to program. In doing this we had to extend the nature of the propagated information. In a TMS, each node has only three states. In the PDA (as is likely in any KBSP problem) the nodes have more complicated states because they represent more than simple logical quantities. The assertions in the PDA represent vocal excitation modes, fO estimates, etc. Therefore, besides the logical dependency that is manifested by chains of retraction, objects can depend on one another in ways that don't involve being true or false, valid or invalid. For instance, a word mark depends on the syllable marks it covers. If the epoch that ends the last syllable of the word is changed, then the epoch that ends the word mark must also be changed. To handle this type of dependency we developed the idea of propagating change information. When a server is changed, it broadcasts a message indicating that it has changed to all of its users. The users can then react to the change. This aspect of the user/server relationship allows the PDA to react to modifications that fall short of retraction. Any binding is notified if objects it is using are changed. The binding's response to this change is to put itself on a list of possibly invalid bindings. The next time phase 2 of rule system operation occurs ("check changed objects"), the KM will ask this binding's rule to test its conditions against the changed object. If that test passes, then the rule notifies the binding, and the binding removes itself from the invalid bindings list. If the test fails, the KM retracts the binding and the effects of that retraction ripple through the dependency network retracting all objects that were dependent on the binding (users, users of users, ... ). 141 Chapter 4 The binding responds to both change and retraction on the part of its servers. The KM responds to change but not retraction. The KM makes itself a user of all new assertions, so it can be aware of those which change and should be checked against rule conditions. However, when an assertion dies the KM merely forgets it. The opposite behavior is exhibited by the rule condition networks. They are users of all the objects that are held in the stores of the network. Changes in the objects are ignored. However, retraction of the objects causes them to be flushed from the stores. The structure of the dependency network is implemented by pointers from users to servers and vice-versa. Each server keeps a list of users and each user keeps a list of servers. If the user is retracted, it notifies the server and the server disconnects from the user. If the server changes, it notifies the user so the user can respond to the change. If the server is retracted it notifies the user which almost invariably is retracted also. Support Network This network represents and updates lines of support between assertions. The two ends of a support link are the "supporter" and "supportee". The supporter is providing information about the value of the supportee. In the case of symbolic assertions, this support affects the confidence in the supportee. For example, the unique assertion VOICED is supported by all other voicing related assertions (phoneticfrication, numeric-silence, ... ). These supporters are what determine the confidence in VOICED as a function of time. In the case of numerical valued assertions, the supporters are providing information about the value. The FINAL-PITCH assertion is supported by pitch assertions derived numerically and symbolically. Each supporter supplies an estimate of fO together with confidence and varianc'e that contributes to 142 Chapter 4 the probability density for fO contained in FINAL-PITCH. Figure 4.9 shows the typical support relationship for a symbolic assertion like VOICED. Several sources of support must be combined together (using procedures described below) possibly together with a default support (explained below). The network preserves the connections between objects so if a new assertion is added that affects the confidence of some object, that change in confidence can be propagated to supported objects and the rest of the support network brought "up to date" with the new information. The default support is a special kind that exists for handling support that doesn't come from an object. The best example for this is the support given to numeric-voiced assertions. The procedure that creates them determines a value for their confidence in the process of analyzing the utterance. Unfortunately, this procedure is not an assertion itself, so it cannot participate in the support network. To deal with this problem, an assertion can have a default support which is static (unaffected by future change) that is supplied when the assertion is created. This is the way in which the numerical procedure provides support LO the numeric-voiced assertions. Spportee - Default siiPo1wri - Figuporert Relationship Figure 4.9 Typical Support Relationship 143 - Chapter 4 There are two ways to compute confidence for a symbolic assertion. One is used by bindings (the confidence in a binding stands for the confidence that the rule conditions were satisfied). They use the PREMISE-ODDS form in the rule to compute their confidence. This expression is given by the rule writer and may mention any variables in the conditions. The PDA system recognizes that all condition variable values that are mentioned in the PREMISE-ODDS form are supporters. Those that are not mentioned in the PREMISE-ODDS are not considered supporters (they do not influence binding confidence) though they are servers in the user/server sense. This is one reason for having separate networks. The other way used by the PDA to determine the confidence in a symbolic assertion from its supporters is through a Bayesian probabilistic method similar to that used in PROSPECTOR and is described in detail below. This is the confidence computation method used by assertions. Since bindings support the assertions they create and since assertions are used as the variable values in rule conditions that generate and support bindings, the resulting support network appears as in figure 4.10. The last type of support is for numerical assertions (fO). In this case, the supporters are pitch assertions that support the FINAL-PITCH assertion. Bindings were not used to convey support to FINAL-PITCH because the complexity of the support involved in numerical assertions is not handled properly by the BINDING object class. In the PDA, these pitch assertions are not themselves supported. It is unfortunate and probably inaccurate that the confidence in the precursors to these pitch assertions (the waveform and the phonetic transcript) do not influence the confidence in the pitch assertions nor their support for FINAL-PITCH. That simply reflects the limited understanding we have at this point for how to deal with this problem. 144 _11_1 Chapter 4 Figure 4. 10 Alternation of Assertions and Bindings The final point about support representation is that it is time-varying. Most assertions in the network are distributed over a substantial time interval. It was logical that their confidence be variable over that interval since the support for them would be variable. Therefore, the confidence of a distributed assertions is represented by a sequence object and computed with the help of the KBSP signal processing package. This combination of making a small number of assertions, but allowing them to vary confidence over time, was an effective solution to the problem of getting good temporal resolution of assertions such as VOICED, without resorting to making assertions every few samples (which would have been impossible with the rule system technology available to us). 4.4.3. Computing Confidence The computation of confidence for symbolic assertions and probability density for numerical assertions is the last task of the KM. For bindings this is a straightforward task. The PREMISE-ODDS of the rule specifies the function that determines binding 145 _ _ _I I_ Chapter 4 confidence. For symbolic assertions the procedure is similar to PROSPECTOR's procedure for combining support, though the PDA performs this computation separately for each index in the domain of the assertion whereas in PROSPECTOR the confidence of an assertion is a scalar quantity. The procedure for combining support for numerical assertions is derived using probability theory with some independence assumptions. Because the mathematical form of the resulting densities is not a simple expression (such as a Gaussian), the resulting probability density for the numerical parameter is approximated with a uniformly sampled sequence over the region of non-zero probability. PROSPECTOR The derivation of the support combination rule in PROSPECTOR is as follows. Suppose there is an assertion receiving support r and assertions sending support s i . First, assume that all that can ever be known about r comes from {s, }, so P (r I ) = P (r I s Where V signifies "all that is known" and (,4/s5n x signifies "all that is known about" x. Further, assume that the probability of the supporters is independent of any knowledge about r, so P(s 1 s )1 Ir -P(s 1Ir ) ... P(s n Ir) (4.6a) I ) (4.6b) if r is known to be true with certainty and P(s l.s l r )=P(sll r) ... P(s if r is known to be false with certainty. For a single supporter s this leads to the following expression for the posterior probability for r. 146 __ ___ Chapter 4 P(r IV)=P(r I )=P(r Is)P(s IV)+P(r I)P(s IV) = k 1P(s I)+k 2P( (4.7a) I) (4.7b) A typical graph of this relation is shown in figure 4.11. In PROSPECTOR, the rule writer specifies P (s Ir ), P(s I F), and P(r ) and the following formulas are used to compute k 1 and k 2 P(s Ir )P(r) (P(s Ir )-P(s IF))P(r)+P(s IF) (4.8a) P (s Ir )P (r) (P(s I)-P(s Ir))P(r)+l-P(sIF) 48b and 2 These expressions completely specify the relationship between P(r I s ) and P(s IV), including the value of P (s ) the probability for s when nothing is yet known. Unfortunately, s itself is often a supported node, so the rule writer will have provided a value for P (s) even though there already was an implied value for it. The rule won't typically provide consistent information, so there is an inherent contradiction facing PROSPECTOR in deciding which information to use. To deal with this, the authors of PROSPECTOR chose to loosen the specification in (4.7a) in a way that is P(r I s) k2 P(r) k I .I g.- P (,T~( I VT ) U P s) 1 Figure 4.11 Support for r from s - 147 I Chapter 4 depicted in figure 4.12. It assumes the values of P(r) and P(s) are correct and should be consistent. Likewise it assumes the values for k and k 2 are computed correctly, and uses linear interpolation as shown to deal with the inconsistency. To compute confidence from multiple supporters, PROSPECTOR uses the independence assumptions mentioned above and computes the odds of the receiver of support (where odds is defined by O (r) = P(r ). The odds of the recipient is given by l-P(r) O(r IV) = O(r )IIX i where O(r iV ) 0(r) is computed using the single support expressions (4.7a) through (4.8b), together with the non-linear mapping implied by figure 4.12. PDA There are two ways in which the PDA differs from PROSPECTOR's approach to computing confidence for supported symbolic assertions. P(r I s) i k 2 P(r) I I I k I 1 - ,I U Ps ) 1 -P(s Figure 4.12 Modified Posterior Probability 148 ICI IVS) Chapter 4 1) 2) The PDA computes confidence separately at each index. The PDA uses a different representation for confidence that allows a much simpler (more economical) updating procedure so 1 is practical even though .5 million updates are not uncommon. The source of the contradiction in PROSPECTOR is in asking the user for too many constraints. The PDA solution for this involves a different expression of the confidence of an object. Where PROSPECTOR uses the odds O (r I V) as a measure of confidence for each assertion, the PDA uses the odds-factor O (r I V )-O (r ), the ratio of the current odds given what is known to the a priori odds. The formula for the current odds-factor of the recipient based on what is known about a single supporter is (s. IV) P(s, Ir) O(r IV 5 ) Si O (s .i) IX O(r) P(s +l-P(siIr) L 0(S) i IF) O(s ) (4.9) +l-P(s i IF) or more simply P(s i Ir)OF(s s si P(siIF)OF(s IY)+-P(si Ir) I)+-P(si IF) i and the formula for the odds-factor based on all information is 00(r) (r IV) (4.11) Thus, the a priori odds of an assertion have no influence on the propagation of confidence, only on the proper interpretation of the final odds-factor. These two expressions depend only on the odds-factors of assertions and not on any linear or non-linear transformation of them. It is not necessary to convert from odds to probability to compute the non-linearity implied by figure 4.12, for example. This simplification leads to a dramatic savings in the computational cost of propagating 149 _ __ ____ __ Chapter 4 confidence. Numerical Assertions The processing of support for numerical (fO) assertions means taking in fO estimates and combining them to yield a probability density for fO. Each contributing assertion is in the form of a Gaussian estimate for fO with a specified probability of false alarm. It is assumed that an estimate that is in fact a false alarm contributes nothing to the probability for fO. Thus for two contributors there are four possible cases of false alarms: both true, both false, one or the other false. For n contributors, there are 2 n combinations. The probability of any combination is computed from the vector of probabilities of false alarm by taking the product of the probabilities that the false ones are false times the product of the probabilities that the true ones are true. For each combination, only the true estimates influence the density, those are multiplied. Thus the overall density is given by the sum of weighted products of Gaussians, where the weight placed on each product is the probability of that particular combination of true and false estimates. p (f O)= £.| combos II P( ) II P(k )Gaussiank(f o) j EfaLse k Ertue An incremental form of this computation exists whereby a single new estimate s may be used to convert the density for fO before (p ) to the one after (p) as follows: (f ) = P (s )p (f O)GaussianS(f )+P()p(f ) , which allows the cost of computation of probability for numerical values to be linear in the number of supporting assertions. 150 _1_1 Chapter 4 4.5. Numerical Pitch Detection The numerical pitch detector use, . the PDA is based on temporal similarity. First, a map is made of the local similarity of the waveform in the vicinity of a selected index. Then the glottal pitch is estimated from that map by imposing some constraints on the uniformity of the periods. The same similarity map that is created in the first stage of numerical pitch estimation is also used to corroborate voicing. For that purpose there is no period uniformity constraint, so the second stage is not needed. 4.5.1. Normalized Local Autocorrelation We call the procedure for creating a similarity map a "normalized local autocorrelation (NLA)". A sequence x [k ] is windowed twice, once at the location n where the pitch is to be estimated, and again somewhere nearby (e.g. at n +1 ). These two windowed sections are each normalized to unit energy and their inner product is the similarity estimate at location n for the lag . In geometric terms this procedure is measuring the cosine of the angle between the vectors which are the two sections, without regard to the lengths of those vectors. If we define a sequence of vectors x [s ] which stand for the input signal shifted to the left by by s, then for position n, the equations for NLA are 2 yl[n ] = x'[n ]xi Y 2 [n ,lag] = x'[n +lag ]x-i (4.12a) (4.12b) 2 A '"' mark signifies a 1-dimensional sequence. The "" denotes normalization to unit energy. "X" represents element by element multiplication and " < X-, y- >" means the unweighted multiplicative inner product over the implied indices of the sequences. 151 II_ __ _ __ _ Chapter 4 1[n ] gl11n /Y ], (4.13a) [n ]> 1[n lag ] = 2 y2[n ,lagn] - y [n ,lag] = 2 n ag= <Y 2[n ,lag ],y ,ag 2 [n (4.13b) ]> s [n ,lag ] = <l[n ],y 2 [n ,lg ]> (4.14) Amplitude Variation in Speech In estimating similarity for either period or voicing determination, it is preferable for amplitude not to be a factor. This is because the speech signal can be highly variable in amplitude over regions where the waveshape is relatively constant. Examples of such variation are the growth and decay at voice onset and offset (an example for voice onset is seen in figure 4.13). In a section with a relatively stable waveshape, the amplitude can vary by a factor of 2-5 in as little as 20ms. Amplitude 10208.0 Door Waveform I ,!II 8.8 i , i -16988.8 I 78 804 8368 Figure 4.13 Amplitude Variation in Speech 152 Chapter 4 Effects on a Similarity Estimate Of interest here (for period estimation and voicing corroboration) is the ability to find nearby locations of similar waveshape, since that is what indicates glottal vibration. True periodicity (identical waveshape and amplitude) is not necessary. The effect of amplitude sensitivity in a similarity estimate is to make the peak values of the similarity erratic. This means that interpreting those peaks in absolute terms can lead to errors in voicing judgment. Further, if (as in the PDA) peak information is used to estimate the confidence in the fO value, then those confidence estimates will also be inappropriately sensitive to amplitude variation. Reducing Sensitivity Like short-time autocorrelation[61, NLA analyzes its input signal "locally" to reduce sensitivity to long time variation in waveshape and period. Like modified short-time autocorrelation[6] NLA produces the same value at any lag corresponding to a period. It has no linear taper as does standard autocorrelation. Like any normalized measure (e.g. "semblance" processing[63] ) the NLA will produce a value (i.e. 1.0) at n*period that is independent of any scaling of the input signal. What NLA accomplishes that other normalized measures do not is to continue to produce this same value when its input is an exponentially weighted periodic signal instead of a purely periodic one 3 . Thus, similarity estimated with NLA is sensitive to the important phenomenon (waveshape) and insensitive to the interfering effects of growth and decay. 3 This can be understood by recognizing that an exponential envelope on X appears as a simple scaling on delayed versions of x. Since the NLA is immune to scaling on either section in the inner product, the impact of exponential weighting is nil. 153 4 Chapter 4 Warping the Similarity On voiced speech, NLA values at n*period lie between about .7 and 1.0 t. In order to spread this region out and thereby give better control over the setting of similarity thresholds in the system, the warping function y = log & -7 y (l+x 7 (4.15) was used. This warp spreads the relevant values over the range from about 1 to 7. In other portions of the PDA system, this similarity value was used as a kind of odds. That is, the similarity was treated as signifying the likelihood of that lag being the period of the speech or a multiple of it. While high similarity signifies high likelihood and low similarity signifies low likelihood, there is no mathematical justification for this warped similarity being treated as odds. We simply chose to treat it as such because it had the right behavior and no obviously better alternative was apparent. In any following text, numerical references to similarity values refer to the warped similarity. The Choice of Window The similarity function is a two dimensional quantity s [n ,Lag ]. The weighting function w- serves to localize the estimate to the vicinity of n, and to band-limit the response of s [n ,lag ] as function of n. From the definition above, if we temporarily ignore the complexity caused by the normalization of -7l[n] and '2[n ,Lag] we can approximate s [n ,lag ] as follows: s [n ,lag = [+n ]x [+n +lag ]x 2 (4.16a) t These values are based on experiments and not on any theoretical properties of NLA or speech. 154 Chapter 4 For a fixed lag this can be rewritten as a function of n (4.16b) Slag [n ] = Zlag [-+n ]x2 where ag= x [+lag] (4.16c) , or equivalently ag2 where * represents correlation. between w 2 and From this expression we see that the correlation serves to smooth s [lag ,n ] as a function of n. (4.16d) t , and thereby band-limit the response of Further, it is evident that we should choose wi to be the square root of some suitably band-limiting window (e.g. Hamming). The choice of bandwidth for this window depends on the stability of the waveform similarity with n and the sampling interval over n. In the PDA, the waveform similarity is computed every lOms (a value that is common to most pitch processing), if that is a sufficient sampling density, it suggests that the similarity variation as a function of n should have no spectral components above 50 Hz. A 25 ms Hamming window has substantial response up to 40 Hz and little above this frequency so that should pass the relevent part of the similarity variation while suppressing irrelevent higher frequency components. Experiments we performed demonstrated that shortening the window to 12.5 ms lead to unacceptably low stability for the similarity estimate. Implementation The expressions (4.12a) through (4.14) present a direct implementation of the normalized local autocorrelation. A faster FFT based implementation of the complete 155 Chapter 4 NLA algorithm is shown in figure 4.14. The dominant computations in this implementation are the two correlations (*'s) each requiring two forward and one reverse FFT. Assume an FFT size m of 2048 points (large enough to capture the 1200 lags desired) and m-log 2 ( -) butterflies per FFT. Since a butterfly takes one complex multiply and two complex adds (4 real multiplies and 6 real adds), the total arithmetic operation count is roughly 10x1024x10 = 100,000. Using this implementation kept the per sentence processing time (for NLA) to about 17 minutes. Because the processing time was independent of the window length (up to about 80ms), the window length was solely determined by the tradeoff of temporal stability for temporal resolution. 4.5.2. Determining Pitch from Similarity Given a similarity map generated with warped NLA, several factors need to be considered in estimating pitch. y 1 [1 ]XW' ,ag 2 Figure 4.14 Fast NLA 156 ,lag ]> Chapter 4 1 Speech has period jitter of at least .5%. 2 Periodic voiced regions of a sentence can have moveout (monotonic period change) of several percent per cycle. Waveform cycles excited by glottal pulses are typically present in groups of three or more. 3 4 5 Speech does not typically have strong similarity within a period. Above a few kHz, speech similarity becomes erratic. These thoughts led us to the following procedure for estimating pitch from the similarity map. a Select a candidate similarity peak in the lag range from 2 ms to 20 ms whose similarity is at least 1.0. b Search for corroborating peaks near 2 times and 3 times the lag of the candidate. c Starting with a net odds that is the similarity of the candidate, if the similarity of the 2x peak is > 1.0 then multiply the net odds by it. If the similarity of the 3x peak is also > 1.0 then multiply the net odds by it also. Thus the net odds is given by net_odds = sim (peakIx ) (4.17) if sim (peak 2x ) < 1, or net_odds = sim (peak lx )sim (peak 2 x )max( 1.0,sim (peak 3x )) , (4.18) if sim (peak 2x ) 1 . These steps generate two sets of peak candidates, one for positive lags and one for negative lags. Positive and negative candidate sets are processed separately from one another in two passes through the following step. d For each candidate in the set, look for others that are at lags nearer to the origin. If found, the largest similarity of those nearer peaks is divided into the net odds of the candidate. A At this point one direction is selected and the other discarded. Only one direction is kept because the fO assertions that would result from looking in both directions would 157 a Chapter 4 be very correlated. This would violate the assumption of the knowledge manager that multiple assertions on a given topic can be taken as independent information on that topic. e Choose the lag direction (positive or negative) that includes the candidate with the best net odds. Discard the candidates in the other direction. f Make each remaining candidate with a net-odds > 1.0 into a Gaussian fO assertion at that speech index. The mean fO of the assertion is for that P0 candidate (see below), the odds of the assertion is given by the net-odds of the candidate, and the standard deviation is given by an uncertainty estimation formula given below. Speech properties 1-5 are handled by this procedure as follows: 1,2 In step b, corroborating peaks must come from the vicinity of 2 times and 3 times the candidate lag, but not exactly at those positions. The tolerance is +-2% about the expected position. This allows for both period jitter in the individual cycles and any consistent moveout between the first second and third cycles. The support that the multiples give to the candidate is based on how close they are to their expected position. A triangular weighting is applied to the support given by the multiple. The weighting is 1.0 if it is exactly where expected and 0.0 if it is at or exceeds +-2% deviation from the expected position. 3 By searching for multiples of a candidate lag, the procedure uses the existence of multiples to increase its confidence in the candidate. Lack of such multiples is not taken as refutation of the candidate, but such a candidate would need to be very strong to stand by itself against other candidates with multiple support. 4 This knowledge is dealt with in step d. The presence of a strong similarity inside a candidate lag reduces the net odds for that candidate by the similarity of the strongest inner peak. Thus, other things being equal, the shorter of two candidate choices will retain more confidence because the net-odds of the longer will be divided by the primary peak similarity of the shorter. 5 The fact that the speech spectrum above 2 kHz does not carry much similarity information was established by experiment. This effect was also noted by a co-worker who was attempting to track the phase of upper harmonics of voiced speech for coding purposes[64]. Since poorly correlated upper harmonics (and high frequency noise) will suppress peaks in the similarity map, the speech is filtered to eliminate frequencies above 2 kHz. 158 Chapter 4 The above description of the numerical pitch detector applies when it is running without prior information about the sentence. In the PDA, the numerical pitch detector is given prior information about certain portions of the waveform. The next section discusses how this is accomplished. 4.5.2.1. Numerical Pitch Advice In some regions of the sentence, the phonetic context has implications for pitch detection. Information about either the expected temporal behavior of fO, or about how to go about detecting it. In these regions, the numerical pitch detector is given phonetically derived "advice" which takes four forms. I The moveout can be specified. II The preferred direction can be specified. III The expected range of pitch values can be specified. Moveout In certain phonetic contexts the pitch is expected to change in a consistent fashion from period to period. For example, in a vowel-consonant-vowel context where the consonant is /p/, /t/, or /k/, we expect a falling fO at voicing onset. In such contexts, the PDA invokes the numerical pitch estimator with advice the periods should be growing in that region (the growth is estimated as 1% per period as determined by informal experiments). This advice affects the behavior of the numerical pitch detector in step b. Instead of looking for period multiples at exactly 2 times or 3 times the lag of the primary candidate, an adjustment is made in the expected positions of those multiples. The theoretical basis of that adjustment is as follows. 159 . Chapter 4 Suppose we assume that the speech signal is a repeating signal with a linearly changing period. Observations are time values that correspond to known cycles of this repetition (e.g. cycle 1 happened at 1.04, cycle 3 happened at 1.54). Let these "cycle times" be the values of an indexed set of variables { . t _ 2 ,t _1t 0, " } and let us assign the time of the zeroth cycle (t 0) to be 0. Let the variable c stand for a continuous "cycle position" parameter for the speech signal. This variable takes on integer values at the corresponding cycles of the waveform (e.g. c = 1 at cycle 1 and so on. The following expression defines the value of time at any cycle of the waveform: r f dtdc t= (4.19) dc If we define the "instantaneous" period of the signal (P()) as the rate of change of time with cycle dt then we have dc t= f P(c )dc . (4.20) 0 Finally, if we require that the period is a linear function of the cycle then P(c ) P +mc (4.21) and i i f(PO mc 0 2- 2= P 0 i+m 22I Since we have a set of times {t i } we can use equation (4.22) to determine PO and m. For example, if we knew {t _3, .. ., t 31 then we would have 160 Chapter 4 3 2 1 -2 -3 4.5 2.0 .5 - t3 t2 'p'. 1 .5 1 2.0 4.5 (4.23) -2 r -3 In the numerical pitch estimation process candidates are generated together with peaks at the second and third multiple. Equation (4.22) determines where such peaks will be located and given peaks, how to estimate the instantaneous period at c =0. Specifically, if there is an a priori estimate of non-zero moveout and the time of the fist multiple (the primary candidate) is known, then the expected times for the second and third multiples are given by i +-2 t. = t 1+ 2 M 2 (4.24) This expression is used to determine the center of the search region used to find these peaks. A correct estimate of the moveout advice will yield much stronger candidates, because the expected positions of the multiples will correspond more closely to their true positions, and that will mean the triangular weighting that is applied to such multiples will not reduce their support as much. Despite the availability of t 1 t 2 and t 3, the PDA only uses t 1 and t 2 to estimate the period based on a given candidate. This saves having to solve an overconstrained problem, and places less of a burden on the assumption that period is linearly varying with waveform cycle since only two (and not three) cycles are considered. These two advantages are offset by the fact that the estimate of PO is potentially noisier than it 161 4 Chapter 4 might be if all three time values were used. Preferred Direction In some phonetic contexts there should be a stronger similarity in one time direction than the other. Let "left" stand for travelling back in time and "right"stand for travelling ahead. At the closure of a /b/ stop, the waveform will have one shape to the right (where the mouth is closed) and another to the left (where it is open). An example of this effect can be seen in figure figure 4.15. A rule which detects this and similar contexts invokes the numerical pitch detector twice, once on each side of this boundary advising each invocation to look for similarity moving away from the boundary and not towards it. The effect of this advice on the numerical pitch detector is to penalize the net odds of candidates that are in the conflicting direction by 50%. The intent of the penalty is to influence the detector to use the candidates in the direction that is more likely to be the correct one. Amplitude 112880.0 Eyes Waveform V r V -16688.8 18117 V i 11893 Figure 4.15 Voiced stop closure 162 Chapter 4 Pitch Range Within the PDA, there are phonetically derived pitch estimates which are based on research results on the average pitch of certain phonemes as spoken by speakers of certain sex/age classes There are also pitch estimates that are the result of the numerical pitch detector. Both of these types of estimates are integrated by the knowledge manager to provide the final results of the system. As is discussed below, such pitch estimates must be given as Gaussian densities for the fundamental frequency with false alarm specification in order for the knowledge manager to be able to integrate them. Late in the implementation of this system, another kind of knowledge was added to the system concerning the rough pitch range of the sentence. This information was generated by a preliminary numerical pitch analysis that only used data that the detector was very sure of. These pitch estimates generated a histogram of pitch values that was used to estimate the overall fO range of the sentence. Experience demonstrated that this range gave a tight upper bound on the likely fO values of the sentence, but that the lower bound was loose owing to the extremely low values of fO that can result from irregular glottal vibration. Because the upper bound was tight and the lower bound was not, it did not seem that a Gaussian probability shape was appropriate. Since, the knowledge manager only supported assertions with Gaussian probabilities, this "prior" range estimate was given as advice to the numerical pitch detector rather than as fO assertions. This advice takes the form of a sequence of functions which specify the a priori odds on fO for each frame of the speech. These functions influence the numerical pitch detector in step c. Instead of using the similarity of the primary candidate as the 163 Chapter 4 initial net odds, that similarity is first adjusted by the a priori odds of the candidate. Jitter It is also possible to specify the acceptable jitter as advice to this numerical pitch detector. We did not make use of this capability in the PDA. 4.52. Determining the Standard Deviation of FO Estimates Because the fO assertions generated by the numerical pitch detector must be com- bined by the knowledge manager (KM) with other fO assertions, they are given in the format expected by the KM. Each assertion specifies a Gaussian density with an odds at a specific speech index. The knowledge manager combines these densities to generate the composite probability density that is the PDA's statement about fO at that speech index. The mean of the Gaussian is given by the formula for P 0 given in the previous section. The odds of the Gaussian is given by the net odds of the candidate responsible for the fO assertion. The standard deviation is estimated by a formula drawn from radar theory[651. The formula is for the standard deviation of a return time estimate for a radar pulse of a known bandwidth with a known signal-to-noise ratio. The correlation of the transmitted radar pulse with the received one is equivalent to the correlation of the 0 lag speech section with sections taken at various nearby lags. The formula for the variance of the lag estimate is VLr IT7 s T J- :__ 1 (4.25) where 164 Chapter 4 Er is the signal energy, IV is the noise variance, and _1 fo21S(w)2dc o2 _ 27r 1 fS(o)12do is the standard signal bandwidth. By measuring the standard bandwidth (4.26) 2, the noise level (in silent regions) and the signal power, the PDA can use equation (4.25) to estimate the standard deviation in the fO assertions it produces. 4.6. Conclusions This chapter has described some of the interesting features of the implementation of the PDA: * * * * * * * * A signal representation based on o domains that permits the representation of a broad class of signals including exponentials and periodic signals, and allows the representation of temporal phonemena such as delay and linear phase in a straightforward manner. A computational representation for temporal events that merges them based on numerical and symbolic and compatibility, and serves as a means of comrn bining sources of information about temporally distributed phenomena. A rule system that facilitates program development and interaction through history independence. Rule conditions based on procedures rather than patterns which isolate rules from object data structure to simplify programming, and make efficient use of the information encoded by linkage in the assertion database. An efficient algorithm for rule condition procedure evaluation. A new representation for symbolic assertion confidence that allows for temporal variation and thereby represents time varying problems more accurately. A new method for computing the confidence of symbolic assertions (using odds-factors) that eliminates the need for approximating the Bayesian analysis and is substantially less expensive. The NLA, a new periodicity measure that is insensitive to growth and decay in the waveform. 165 aa CHAPTER 5 System Evaluation 5.1. Introduction This chapter describes a pilot experiment to determine the performance of the PDA. In it, the PDA and the Gold Rabiner (G-R) pitch detector are compared, using noisy speech with a hand edited reference pitch track. We chose to do this experiment in order to determine if the idea of combining numerical and symbolic processing in this program was beneficial. Despite the modest amount of symbolic and numeric knowledge that had thus far been incorporated into the PDA', the experiment demonstrated that the PDA produced 1/2 as many errors as G-R accross many different signal-to-noise ratios. To determine the contribution of symbolic information to the PDA's performance, the system was run on two of the test sentences without the aid of a phonetic transcript or the sex/age of the speaker. Without this symbolic information, the PDA and the G-R pitch detector performed equally well in terms of total voicing and fO errors. Surprisingly the symbolic information impacted most on fO errors and not voicing errors. Apparently, the addition of relatively broad fO assertions due to the symbolic information is an appropriate complement to the precise, but multi-modal, fO densities produced by numeric analysis. 1 Many symbolic ideas (such as tune information) and numeric ideas (such as interframe fO smoothing) had not been incorporated. Chapter 5 Chapter 5 5.2. A Comparison of PDA and G-R This is a three way comparison between the PDA, the Gold-Rabiner pitch detector (G-R) and hand marked pitch. The G-R program was implemented on a lisp machine using information derived from the original paper that described it[3], and information drawn from a C language program 2 written by Eliot Singer (a member of the group at MIT Lincoln Laboratory in which Ben Gold works). The sentences (already hand marked for pitch every 100 ms) were acquired from Bruce Secrest who was at the time working for the speech research group at Texas Instruments. The frame rates for both detectors were 100/sec, tie speech was sampled at 12.5 kHz and low-passed below 900 Hz before analysis. Fourteen sentences were used, including 10 different texts, male and female speakers, and age categories ranging from child to elderly. Each of the fourteen sentences was used at 5 signal energy to noise energy ratios (40 dB, 30 dB, 20 dB, 10 dB, 0 dB) by adding white Gaussian noise. 5.2.1. Performance Criteria Four kinds of errors were counted: high and low pitch errors of more than 30 Hz, voiced-to-unvoiced errors and unvoiced-to-voiced errors. The errors were divided by the number of seconds in each sentence. Since the frame rate was a uniform 100 frames/second, these numbers may be interpreted both as percentage errors (100 times the number of errors per frame), and as error rates (errors per second). These errors were further gated by an amplitude criterion. Specifically, if the speech amplitude in the frame in question never exceeded 1/10 of the maximum amplitude of the sentence then the frame was ignored. This deemphasizes the 2 Knowledge embedded in that C program that either differed from the original paper (e.g. voting matrix bias values) or was not present in the original paper (e.g. the precise speech filter spectrum) was essential to achieving the best performance from the detector. 168 Chapter 5 significance of errors in the quiet portions of the sentence, errors that would not be likely to significantly affect the quality of speech reconstruction. The standard deviation of fO "error-free" voiced frames was also measured. However, as is explained below, the hand marked pitch was too imprecise for this figure to be meaningful. 5.2.2. Caveats The hand marked pitch tracks were only measured to the nearest sample lag (as evidenced by the quantization of fO values). This means that precise fO deviation measurements cannot be made with this data since they would be dominated by the fO quantization. We have presented these deviation measurements, but do not include them in our criteria for performance. It appears that peak-picking in the time domain was used to hand mark the pitch. We believe this because in places where there was fO ambiguity between the apparent pitch on the basis of peak-picking and other plausible techniques such as correlation measurements (as was used in the semi-automatic system SAPD[66] ) it was the peak-picking value that was chosen. If peak-picking was the sole method used to estimate fO, then the hand marking may not be the best possible. However, there is no reason to believe that it would give a preference to either of the pitch detection methods being tested, and a cursory examination of the hand marking did not reveal extensive anomalies. Another factor that might influence the results is our use of an amplitude gating on the errors. It seems clear that there should be some amplitude weighting, since errors made at low amplitude would not be as noticeable to someone listening to 169 -- Chapter 5 reconstructed speech. Yet there is no standard for such a weighting. If our particular weighting isn't ideal, it is probably close enough that the effects of the difference on this experiment are minor. The performance of the PDA is enhanced by the fact that the transcript is unaffected by the noise. It would have been fair to "add noise" to the transcripts as well. However, there was insufficient time to retranscribe the utterances after noise was added. Therefore, admitting to the potential bias, the transcript (and sex/age) information used in these experiments is the information derived from clean sentences. As has been mentioned, our primary interest was in learning methods of combining knowledge, symbolic and numeric, and not in optimizing the PDA. Thus we did not incorporate all the knowledge that might have improved the performance of the PDA. In building the PDA, little effort was expended adjusting thresholds, window sizes and shapes, filters etc. Reasonable values were chosen based on intuition, or brief experiments with one or two sentences. Though no experiments on the sensitivity of these parameters were performed, it is unlikely that they are all near their optimal values. Conversely, these sorts of refinements are apparent in the Lincoln Laboratory version of the G-R pitch detector, and minor deviations from the parameter values in that program led to markedly poorer performance. Thus it seems likely that the G-R program is the best example of a program in its class, whereas the PDA is only a "rough cut". While the following tests are hardly conclusive (owing to the limited corpus and the caveats above) they suggest that the combination of symbolic and numeric ideas 170 Chapter 5 can lead to better performance than either taken alone. 5.2.3. Test Results An example taken from the raw data that was gathered is shown in table 5.1. These were the 5 different trials (different SNR's) for a single sentence. The sentence was spoken by speaker "NLD" using text 9 - "Almost everything involved making the child mind.". The speaker was female and in age group 2 - adolescent (we were not able to get specific age information). The other columns in the figure are: SNR High The signal to noise ratio (in dB) used for that trial. Errors in estimated fO that exceeded the hand marked value by at least 30 Hz. Low Errors in estimated fO that fell short of the hand marked value by at least 30 Hz. UV-V Errors in which the speech was marked "voiced" when it should have been marked "unvoiced". V-UV Errors in which the speech was marked "unvoiced" when it should have been marked "voiced". Dev The standard deviation of fO weighted by the normalized amplitude of speech in each frame. Test Spkr Txt Sex nld 9 f nld 9 f nld 9 f nld 9 f nld 9 f PDA Age SNR High Low UV-V V-UV Dev 2 0 0.4 4.0 0.0 0.0 2.1 2 10 0.7 0.9 0.0 0.0 1.7 2 20 0.7 0.9 0.0 0.0 1.7 2 30 0.4 0.7 0.0 0.4 1.6 2 40 0.7 0.7 0.0 0.0 1.6 High 0.0 0.2 0.2 0.4 0.4 Table 5.1 Example of Results 171 GR Low UV-V V-UV Dev 9.1 0.0 11.5 2.5 2.0 0.4 3.1 1.7 0.4 0.0 2.0 1.5 0.4 0.0 2.2 1.5 0.4 0.0 2.0 1.6 Chapter 5 As was mentioned above, the standard deviation values are of little value because the reference fO was so heavily quantized. Therefore, to estimate the overall performance of the two methods at each SNR, the error columns were added and averaged across all sentences with the same SNR. Those results are shown in table 5.2. The columns marked "FO" are the total gross fO errors, and the columns marked "Vcng" are the total voicing errors. The parenthetical numbers in this table are the standard deviations of the numbers above them. In interpreting these tables, differences of about 1.0 errors per second or less should be considered barely significant. This experiment indicates that the PDA makes roughly half the total errors (sum of fO and voicing errors) of G-R irrespective of SNR, and that the difference is largely due to the V-UV errors in G-R. While it might seem nonsensical that G-R is making fewer high fO errors when the SNR is low, in reality G-R is just calling those frames unvoiced, leading to a reduction in fO errors at the expense of V-UV errors. Because of this sort of effect, the behavior shown by individual columns of this table must not be taken out of context. SNR 0: 10: 20: 30: 40: High 1.4 (1.1) 1.4 (1.5) 1.4 (1.6) 1.5 (1.6) 1.6 (1.7) Low 4.1 (2.7) 0.9 (0.9) 0.8 (0.8) 0.8 (0.8) 0.8 (0.8) PDA UV-V V-UV 0.1 1.4 (0.2) (2.4) 0.7 0.2 (0.5) (0.7) 0.2 0.9 (0.8) (0.2) 0.3 0.2 (0.4) (0.2) 0.3 0.1 (0.2) (0.4) GR FO 5.5 (2.1) 2.4 (1.2) 2.1 (1.3) 2.3 (1.3) 2.4 (1.3) Vcng 1.5 (1.7) 0.9 (0.6) 1.1 (0.6) 0.4 (0.3) 0.4 (0.3) High 0.3 (0.4) 0.8 (0.8) 1.0 (0.6) 0.9 (0.6) 0.9 (0.5) Low 4.3 (4.3) 1.5 (1.3) 1.1 (1.3) 0.9 (1.0) 0.9 (1.2) Table 5.2 Performance vs SNR 172 UV-V 0.1 (0.3) 0.2 (0.4) 0.2 (0.3) 0.1 (0.2) 0.1 (0.2) V-UV 13.6 (5.2) 5.4 (3.0) 3.4 (2.1) 3.7 (2.3) 3.9 (2.9) FO 4.6 (3.0) 2.3 (1.1) 2.1 (1.0) 1.9 (0.8) 1.8 (0.9) Vcng 13.7 (3.7) 5.7 (2.1) 3.6 (1.5) 3.8 (1.7) 4.0 (2.0) Chapter 5 This effect also points out a significant difference in the approach of these two systems. The G-R pitch detector attempts to find periodicity; if it can't G-R calls the speech unvoiced. The PDA determines if the speech is voiced separately and without regard to periodicity. Having determined that the speech is voiced, the PDA chooses the most likely pitch. This means that when the analysis becomes difficult, G-R will begin making large numbers of V-> UV errors, whereas the PDA will have a mixture of voicing and fO errors. 5.24. A Typical PDA Run Figures 5.la through 5. f show the results of typical PDA runs from 40 dB SNR Amplitude -i ana MacSfS 48dB SNR a 8.8 -32RRRI 34846 0 Mark Typ e Transcript DECLARATIVE Phrase Words Syllables which tea -Wt' t'ti' w I t' Phonemes 6 t' t | party p'par did i baker d'dzd' p' p a r r i' dd i r ot|i''d * WzI~a be' kk' go to 9'9ow t'tuh db e' kIk d'e ' ok , 9'9 aoow t' t u h 8 34046 Figure 5. la Input Information - - 173 -- I------ Chapter 5 to 0 dB SNR. The first figure shows the input information for this sequence of figures (the waveform and transcript). The remaining figures show the outputs for different SNR's. The four panels of each figure (from the top) are the voicing assertions, the probability of VOICED, the pitch track and the fO probability density. Values for the pitch track that are at the top of the graph signify an unvoiced decision. Some significant features of these runs are * * * * Below 30 dB SNR, the PDA is unable to locate numeric silence regions reliably. This can be seen as a loss of "ns" marked assertions in the voicing assertions panels of the figures. By 0 dB SNR the PDA can no longer numerically resolve distinct voiced intervals. At 0 dB, there is only a single numeric voiced assertion. As the SNR drops one can see an increase in the PDA's uncertainty of f0. This takes for form of a thickening of the contours for fO in -the density plot. Likewise, there is an overall drop in the voicing certainty as can be seen by noting the drop in vertical scale factor for the probability of "voiced" as SNR goes down. 5.2.5. The Impact of Symbols In order to judge the impact of the symbolic information on the performance of the PDA, one of the sentences was reprocessed at all 5 SNR's without the input of a transcript or the sex/age of the speaker. The first sentence shown is the one from table 5.1, which had relatively few errors for either detector. The results are shown in table 5.3. The only significant change is a 5/1 increase in gross low fO errors at 0 dB snr. Note also that the data for the G-R detector corresponds closely with the previous data despite the use of a different noise sequence (the seed for the noise generator was different). This indicates that the experimental results are not extremely sensitive to the specific noise sequence. 174 ii Chapter 5 Mark TpeS Voicing Marks G NB NS NS' NV PVF, PBi PA, PF PV C BRST NS NS NV NV NS NV _-- , . , NS NV PVF x PB PA PF PF PVPV c PS PS- PS I I PA PF PV PV PV PV PS NS NS NV PVF c PB . PIVF P8 PA PA PF C PV PV PS PV NS "--c NV NV PF -a PV PV PS PS 0 Odds 34847 Voicing Odds actor 53.1 I, \ .: I 08. J j 1' i , ?i: , * I \ \ A I *~~~ - - L 34047 -1 F8 Pitch Track 409 - _ NI)2_ ~ i. . a n n 3412S 0 F8 ProDaDility co 508 .16 1 i4 1.(· . .. . i. .. .. 0 34125 0 Figure 5.1lb PDA Outputs at 40dB SNR Chapter 5 Mark Type Voicing flarks G G' NE NS Nv PVF P8 PA: PF: C G--- BRST - NV NV NV NV NV PVF PB I PVF PB PA PF PF v 0- PVPV PS PS !- PS * - -C c - PS 0-4 o--a PA PF PV PV _ Pv NV I I PvF PA PA J P-C PF PF PV D---rC PV ,,- PB oI. PV PS Nv :} ,,f PV Pv PV ---- O-a PS -- a PS PS -a o 8 34847 Odda Factor Voicing Odds 52.3 I ,j i JI i i i' I i I 8.8 : i, I j A K i V A I IL - 34847 -i Pitch Track Fa i v I i I a 34125 8 Fa PCnoatIlIt F8 see t i.i tII a 34125 8 Figure 5.1c PDA Outputs at 30dB SNR I Chapter 5 _ __ Mtlrk Type Voicing narks C GG BRST NE NS Nv NV I NV NV , ,I o----- Pe F'A PA. PF PF o PVPV P PSPS 3---- 0-PS Pv -4 PS PV PV 0- 0 P9 ' PB _ -- PA PA PF PF PV 0-C PV PV PS 0- NV PVF PVF PA PF PF NV J---- PVF PVF Pe NV NV NV PV PV PS _ PV 0a PS PS _- _ _ 8 34847 Odds Factor 48.$ Voicing Odds Ii 1 i /, :- I~ n'N 1r I! i~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ; h . 8.9 - --· - - ! . , · I - ---------------- 34847 -1 Pitch Track 48 -- II 8 34125 FS Praanailitu s58 A--A t --- 34125 8 Figure 5. d PDA Outputs at 20dB SNR ----------------- Chapter 5 Mark Type Voicing Marks G G - NB NS NV PVF PB PA PF a---- G BRST NV Nv NV : . - NV 3 NV PVF x Pe PVF x PB PF O--- PVPV 3 PV PS PS pE e*-- PA c PF o.-C PS 0-4 c PV PA 3 PF PV PV NV 3- PVF x PB NV - -C PA PF PV PV m PV 13· I PS NV PV -- PV PS 3-4 3--.4 PV PS PS .- 8 34847 Odds Factor 21.8, Voicing Odds i ¢ I ;I II i ; I I I '';11 F i , i .. . .r-n.n w. -1 - 1' i I : ; f~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 34847 _ Pitch Track 480 -- -1 ,I '\\~~ _, I--N \. A II 8 34125 0 F0 C D .~ e,;l:i.. 58e a v 34125 8 Figure 5.1e PDA Outputs at 10dB SNR 4 -- Chapter 5 nark Type arks Voicing G G , G - NB NS NV NV PvF pvc PVI PB PA PA PF PV PF -cPS PS PS PVPV PV PV PV 2- PV PV PV PS PS PS ~~~~34e 8^~~~~~~~~~~~~~~ 8.54 PF PF PS Odds Factor PA PA PA PV PS PB PF PF >c PVPV PVF PB PB 147 Voicing Odds - , I ,i ;\r , 1.r ! i I \, ,n * 'A \ ,.ar i 'I t" ' I 'N11 'I 1 I 4. 8.8 34847 -1 F8 Pitch Track 48 1 .I Al -AI A,: -i I ' > I Nu -- a 34125 8 FP F0 Prnmilitt see a 34125 8 Figure 5.1f PDA Outputs at 0dB SNR --- Chapter 5 Spkr nld nld nld nld nld Test Txt Sex 9 f 9 f 9 f 9 f 9 f PDA GR Age SNR High Low UV-V V-UV Dev 2 0.2 2.2 0 0.2 20.2 0.7 10 0.4 0.9 0.2 0.4 1.8 2 20 0.4 0.9 0.2 0.2 1.7 2 2 30 0.2 0.9 0.2 0.4 1.6 40 0.4 2 0.9 0.2 0.2 1.6 High Low UV-V V-UV Dev 10.6 2.7 0.0 11.1 0.2 0.2 2.4 0.0 2.4 1.4 0.4 0.2 0.4 2.2 1.6 0.4 0.0 0.2 2.0 1.6 0.4 0.0 0.2 2.0 1.6 Table 5.3 Processing without Symbols The error rates for that sentence were quite low, and one wouldn't expect much impact from the symbols on such clean numeric data. Tables 5.4a and 5.4b show the Spkr Txt les 2 les 2 les 2 les 2 les 2 Test Sex f f f f f Age SNR High Low 5 0 2.1 5.8 5 10 1.6 1.6 5 20 1.3 1.8 5 30 1.1 1.6 5 40 1.3 1.6 GR PDA UV-V V-UV Dev High Low UV-V V-UV Dev 1.3 0.0 2.4 1.1 1.8 0.0 14.8 2.9 0.0 1.8 2.2 0.8 0.0 0.3 6.6 2.8 1.6 0.8 3.4 2.0 0.0 1.8 2.3 1.3 0.3 0.5 2.3 2.1 0.0 0.3 3.2 2.6 0.5 3.4 2.5 0.8 0.5 2.3 1.3 0.0 Table 5.4a Difficult Sentence with Symbols Spkr Txt les 2 les 2 2 les les 2 les 2 Test Sex f f f f f Age SNR High Low 5 0 0.3 12.1 5 10 0.8 5.0 5 20 0.8 3.7 5 30 0.5 4.0 4.0 5 40 0.8 GR PDA UV-V V-UV Dev High Low UV-V V-UV Dev 1.8 0.8 2.2 1.1 2.6 0.0 12.7 2.3 2.1 1.9 0.8 2.1 0.0 7.7 2.5 0.0 0.8 4.7 1.9 0.0 2.1 2.1 1.0 1.3 0.3 2.9 2.2 0.3 0.8 2.0 1.6 0.0 3.4 2.3 1.1 0.8 2.0 1.6 0.0 0.5 Table 5.4b Difficult Sentence without Symbols 180 W4 Chapter 5 impact of symbols on a more difficult sentence. There is a substantial reduction in the gross low fO errors at the expense of a slight increase in gross high fO errors. The total number of gross fO errors at 0 dB SNR dropped from 12.4 per second to 7.9 per second. At 40 dB SNR the reduction was from 4.8 per second to 2.9 per second. Surprisingly, the symbols appear to have little or no effect on voicing errors. Apparently, the numerical methods for determining voicing in the PDA are more robust than the numerical methods for determining fO. 5.2.6. Comments A comparison between the PDA and G-R showed that the PDA made 1/2 the number of errors. The fact that this improvement in performance was accomplished with a program that only partially taps the available knowledge and lacks extensive polishing, indicates that more substantial gains can be made. More significantly, it is clear that the symbolic input is an important contributor to the PDA performance. Without symbolic input, the two methods have roughly equivalent error rates with the PDA being dominated by low fO errors, and G-R being dominated by V-UV errors. This difference in the nature of their errors is not surprising, since G-R will never accept the irregularly spaced pitch pulses of glottalization as voiced, whereas PDA can do so since its methods of determining voicing are not predicated on periodicity. 181 4 4 _ _ __ _ CHAPTER 6 KBSP 6.1. Introduction The previous chapters have presented the PDA system from several viewpoints. First from the perspective of the knowledge it contained and the concepts of its subsystems, then with an emphasis on implementation, and finally with a demonstration of its performance. In this chapter we discuss what we have learned from this thesis about combining numerical and symbolic processing in systems and what we have learned from other aspects of knowledge-based signal processing. Each section discusses some aspect of KBSP that relates to this thesis with comments on how that issue influenced the thesis or vice-versa. 62. Quantitative versus Qualitative Symbols The value of an attribute can be numerical (1.5) or symbolic ("chocolate"). Consider the phrase "symbolic value" to mean any non-numerical value that has no obvious mapping into a number. For example, "chocolate" is a symbolic value but "five" is not. Within the subset of values that are symbolic, some quantify their attribute and others do not. For example, "chocolate" doesn't quantify the attribute "taste", since there is no obvious ordering of tastes. However, "high" does quantify any attribute it applies to', since there is an ordering between for example "high" and "low". ' Except when applied to a state of mind. Chapter 6 Chapter 6 For "quantitative" symbolic values, there is always an ordering and there may be a metric as well. For example, in cooking one can substitute two "medium" peppers for one "large" one. However, it is rarely the case that arithmetic can be used on quantitative symbolic values. Usually, one recognizes the significance of a particular symbolic value. In the PDA, if the signal-to-noise ratio is "high", then silent regions can be detected by looking for places with "low" power. Numerical values are not meaningful unless their scale can be determined or if they are simple counts ("the number of poles is 5"). The scale might be evident from context - "The first car took 6.7 seconds and the second car took 5.2." or standard usage - "He was doing 75 when the cop pulled him over". Some symbolic values do not need such a scale to be meaningful. Values like "voiced" can be interpreted solely based on their identity. Others derive their meaning from their identity and the attribute they are applied to (e.g. "stop" can be the value of "phoneme identity" or "traffic signal state"). Symbolic values that quantify their attribute do need a scale for them to be meaningful. For example, the value "high" needs some sort of scale. Sometimes a standard scale is implied by the attribute (e.g. "the stove is on high"). If not, a scale must be defined. In a system that uses symbols to quantify numerical attributes, the scale is chosen at the time of quantification. For example, in the PDA the attribute SNR is "high" if it is larger than 40 dB and "low" otherwise. In performing this mapping from numbers to symbols, one divides the numerical range into symbolically labelled, mutually exclusive, exhaustive sets, reducing a near infinite number of possible numerical values to a small number of symbolic ones. 183 14 Chapter 6 In many KBSP systems[67], all numerical attributes are quantified to a coarse symbolic range (e.g. "very high", "high", "low", "very low") before being used for any other purpose. Based on our experience with the PDA, this is not the best approach to building KBSP systems. Mapping from a detailed description to a simplified one irreversibly loses information. This is only appropriate if the lost information is irrelevent or at least unnecessary. This is true for the Gold-Rabiner pitch detector, since the periodicity in speech is apparent in the extrema of a low-pass filtered version. It can also be true when one maps numerical values to symbolic quantitative ones for any particular purpose. An example of this is the SNR mapping in the PDA. This is only used to control whether or not to make s4lence-assertions based on power measurements. However, such mappings are usually sensitive to the purpose to which the information is to be put. The extrema of a speech signal would be insufficient for purposes of reconstruction. Similarly, the numerical to symbolic mapping appropriate for one purpose may not be appropriate for another. If a system is sufficiently simple that a given mapping is only used for compatible purposes, then mapping the numbers "up front" will probably suffice. However, as that system grows, such mappings are likely to impede the inclusion of new knowledge since they will not be appropriate in general. Some possible reasons for this use of initial numerical/symbolic mapping are: 1 2 The authors view the conversion from a high resolution numerical value to a low resolution (but still quantitative) symbolic value as retaining sufficient information about the attribute for their purposes. These systems are built on top of rule systems that encourage or require the use of values taken from a small set. Hence the use of quantitative symbols taken from a small set rather than numerical values taken from a much larger one. 184 Chapter 6 A rule system that relies on matching promotes the initial transformation from numbers to symbols. When using such systems, numerical values are less attractive because matching is impractical (with numerical values the number of possibilities is too large). With numerical values, other condition evaluation methods are required (e.g. the functional tests of the PDA). An alternative to an initial transformation from numbers to symbols is for each rule to perform the mapping in its conditions. In that case, the mapping can be made in the specific context of the analysis to be performed (the conditions of that particular rule), and it places no restriction on the use of the (numerical) value of the attribute by other rules. Thus, many different applications for the same information can be easily supported, and the initial implementation of the system is not constraining future growth. Based on this discussion, we have two pieces of advice about building KBSP systems: 1 2 Avoid an initial, single numerical-to-symbolic mapping unless it obviously preserves all the important information. If the system is based on a rule system, make sure the rule system supports the convenient and efficient application of functional tests in rule conditions (e.g. the PDA rule system). 6.3. Symbolic/Numerical Interaction One of the primary goals of this thesis was a better understanding of how to combine numerical and symbolic processing. In our work on the PDA, we employed four types of interaction between symbolic and numerical information: Combination: Numerical and symbolic information contribute to the same conclusion. . 185 Chapter 6 Verification: Numerical information verifies symbolic assertions. Supplying Parameters: Symbolic information supplies advice for numerical processing. Invocation Control: Symbolic information determines where numerical processing will be applied. Combination In the PDA, two examples of the use of "combination" are the determination of the confidence in symbolic assertion VOICED and the determination of the statistics of the numerical assertion FINAL-PITCH. derived assertions (phonetic-voiced, In the first case, there are symbolically phonetic-frication, numerical-silence, burst) which add or deduct support for VOICED and thereby influence the voicing decision in the pitch track. In the second case, phonetically derived fO estimates and numerically derived f0 estimates are combined to form the fO probability densities of FINALPITCH which in turn determines the final pitch estimate of the PDA. It is interesting to note that both interactions are accomplished numerically (through the merging of confidence estimates or the merging of probability density estimates). The (symbolic) identity of a supporter of VOICED serves to determine the polarity of the confidence contribution (phonetic-frication refutes VOICED whereas phonetic-voiced supports it). With fO assertions, the (symbolic) identity of the phoneme and the sex/age (male, female, child) select the parameter values of the probability density estimates for fO (mean, variance and odds) which is contributed to FINAL-PITCH. Another example of the direct combination of symbolic and numerical information is in the actions of the Epoch system. Phoneme boundaries and boundaries of 186 Chapter 6 numerically determined voiced segments both give rise to epochs which are then merged. Note also that the merging process itself is both numerical (the epochs must be close) and symbolic (the starting and ending epoch of a single assertion cannot be merged even if it would be numerically reasonable to do so). In the PDA, the nature of the "combination" of objects (support versus denial, merging versus not) involves both numerical and symbolic information. The results of the combination can also be symbolic (newly established linkage between voicing marks caused by an epoch merge) or numerical (a new value for the fO probability density or voicing confidence). Verification This type of interaction occurs when there are ways to verify or check an assertion (relatively) independently. For example in the PDA, the phoneme-marks in the input transcript might be erroneous (the phonetician makes a mistake). Also, the translation from a phoneme-mark to a voicing assertion may be erroneous (phonemes do not always manifest their typical acoustic properties). To check phor c voicing assertions the PDA scans the waveform for phenomena that should accompany them. For example, a stressed phonetic-voiced mark is expected to have substantial lowfrequency power near the center. If the measurements check out, the voicing-marks. are given added support, otherwise support is deducted. Another example of verification is in the use of numerical information in the PREMISE-ODDS form of symbolically derived assertions. For example, in the assertion of phonetic-aspiration and phonetic-frication from the phonetic marking of a stop, the PREMISE-ODDS (which provides support for the assertions) includes a comparison of the duration of the stop with the expected duration of stops of that type. If the 187 Chapter 6 stop is marked with a length that substantially deviates from the typical duration, then the phonetic voicing assertions derived from it will only be weakly supported by the rule. Supplying Parameters Information that results from one analysis can be used to provide "advice" for some other analysis to improve its performance. An example of this sort of interaction is in the use of the phonetic context to guide numerical fO determination. For regions where the phonetic context can suggest an expected direction of similarity or an expected moveout, the PDA runs the numerical pitch detector with advice about what to expect. Given this advice, the pitch detector downgrades the confidence in results that do not conform to these expectations. The PDA lacks any self-contained symbolic procedures like the numerical pitch detector. However, one can conceive of KBSP problems in which such interaction is plausible. If there was a symbolic procedure for parsing a stream of phoneme estimates into possible words. and a numerical procedure had determined that there was substantial high frequency noise, then the symbolic procedure could base its analysis primarily on the more robust voiced phonemes, be "suspicious" of the fricative phonemes, and look for errors involving voiced -> unvoiced phoneme substitutions. Finally, one need not view supplying parameters as a rigid structure in the system. Such a view might depict symbolic analysis supplying parameters to numerical analysis in the following way symbolic -numeric 188 Chapter 6 with the parameters being a required input to the numerical processing. Our view of this process, and the reason we used the word "advice", is that this information is optional and might be best depicted in the following way symbolic numeric with the dotted line from the symbolic to the numerical module signifying that information may or may not be forthcoming along that path. Invocation Control In the PDA, symbolic context may suggest certain numerical processing. For example, consider the case of numerical burst analysis. When an unvoiced stop is identified in the transcript, a numerical procedure is invoked there to find the boundaries of the stop burst. This numerical procedure is only used in parts of the utterance where a burst is expected. Thus one need not be as careful in the design of the algorithm as would be the case if it were to be run everywhere. The symbolic invocation serves to "prefilter" the speech eliminating regions which might cause erroneous results. In a problem involving more complex symbolic analysis, it is easy to see where invocation of symbolic processing might be controlled by numerical information. In examining an image to identify, objects (say automobiles) it might be the case that some simple numerical analysis (median filtering, local bandwidth, ... ) might suggest locations where an automobile is likely. The symbolic analysis for the structure of 189 Chapter 6 the automobile can then take place in those locations. This analysis might lead to erroneous results if applied indiscriminately. Besides the potential advantage of avoiding false detection by only applying a procedure where appropriate, KBSP systems can be made more efficient by avoiding the application of complicated/expensive procedures of one type (symbolic or numerical) by using simple processing of the other type. One final comment about invocation control. It is like and unlike the notion of "islands of certainty" as expressed in the HEARSAY system. As applied by HEARSAY to the problem of parsing speech, islands of certainty suggests that one start parsing in places where single words or short phrases are known with relative certainty and work outward from there. It suggests that the correct final parse will be likely to contain these confident words, and that one can get to the correct final parse more quickly by building outwards from such islands rather than constructing the entire parse in a piecemeal fashion. Like islands of certainty, invocation control specifies where good places to work might be located. However, unlike islands of certainty, those places are not dcltermined by the certainty in the information already present there. They are determined by properties present in those areas because of other analysis, either symbolic properties (as in the PDA when the transcript contains a stop) or numerical ones (as in the hypothetical example above). The concept of islands of certainty suggests that places nearby might be preferable to work on. Our concept of invocation control makes no such statement. Thus, while both islands of certainty, and our concept of invocation control both involve controlling the places where analysis should take place, the former is an important 190 Chapter 6 idea for problems with a certain structure (e.g. parsing speech) and the latter is important as a means for combining different types of information in any problem. 6.4. Combining Results In a system that uses a parallel collection of modules to accomplish some task, some means must be provided for combining their results. The PDA uses this technique of multiple contributors for problems that involve both symbolic and numerical answers. Numerical Information A particularly good example of combining numerical results is the parallel processing Gold-Rabiner pitch detector. It uses six separate streams of extrema derived from the speech signal. Each is processed with a separate period detector to produce six sequences of period estimates, one estimate from each detector for each frame (100ms) of speech. Every 100 ms the "best" of the six most recent period estimates is chosen as the final result (if none are good enough, the frame is declared unvoiced). Here are several points about the G-R pitch detector's information combination method: Supporters are Equal In the G-R system, there are 36 potential supporters for each of the six period candidates. These supporters come from the period estimates of the current and two previous frames, and sums of pairs and sums of triples of these estimates. The rationale for using this set of supporters is unimportant for this discussion. The significant point is that each supporter is entitled to one vote in "choosing" the winner. Neither the origin of the supporter (e.g. whether it is a single period estimate, a sum, or a triple) nor any other information about the supporter (e.g. its reliability) is used to weight the votes. Fixed Tolerance (prior assumption) The tolerance for supporters in G-R is fixed. This constitutes a prior assumption about the statistics of the supporters (if a candidate period is 191 Chapter 6 correct, the assumption is that the supporters will be within a specified percentage of the candidate period). The G-R algorithm does not react to the posterior statistics of the supporters. Hard Decision Boundaries The confidence added by the vote of a single supporter is an all or nothing thing. It is either within the tolerance or not, any other information about proximity is discarded. The PDA method for combining numerical information was not intended to be a replacement for the one used in G-R. It was intended as a model for how such information could be combined in any system. However, since G-R is an exaniple of how that has been accomplished in the past, it is illuminating to compare the two. The PDA method of combining numerical information employs the explicit probability densities associated with numerical assertions and uses the algebra of probability to combine them. Support is Weighted Each contributor to FINAL-PITCH supplies a gaussian density estimate with odds (effectively false alarm probability). Thus the contributions are weighted by their confidence, which is in turn determined from the quality of their source information. No Fixed Tolerance or Prior Assumption Each contributor establishes the treatment of their contribution by the statistics (mean and variance) they declare. There is no prior assumption in the combining method about the statistics of the supporters. In each run ol the system, the statistics are redetermined. No Hard Decision The contribution of support is not an all or nothing thing. The closer a supporting assertion is to a candidate pitch, the more support it gives. The ability to employ explicit statistical information makes this new approach to combining numerical information well suited to systems that must accept a wide variety of sources of numerical information, since each source can specify statistics that are appropriate for its own results (as is the case of the phonetically derived and 192 Chapter 6 numerically derived fO assertions in the PDA). Because it specifies the statistics of its results (i.e. the probability density for the numerical value in question), this method should be very useful in situations where the results must be further combined by the system. Also, a system that supplies such statistics in its output can be more helpful to a user than one that does not. Symbolic Information The approach used for combining symbolic values in the PDA corresponds closely with the one used by the PROSPECTOR system[561. Both the PDA and PROSPECTOR express a probabilistic confidence in their assertions. Both use a combination rule derived from Bayesian probability theory, and both propagate confidence through a network connecting the assertions, so changes in confidence are immediately reflected throughout the network. The distinctions between them are as follows: Dimensionality of an Assertion PROSPECTOR assertions corresponded to scalars. An assertion like "rocktype-igneous" had a scalar confidence associated with it. In the PDA, many assertions are distributed in time and therefore correspond to vectors. An assertion such as "voicing=voiced" has a vector confidence which specifics the confidence independently for each sample covered by the assertion. This distinction allows the PDA to have a modest number of assertions (in the hundreds) while responding to the information available with precision down to the sample level. This preserves fast rule system operation without sacrificing responsiveness to small events. The programming convenience and processing power of the KBSP package was crucial to this part of the implementation. Absolute versus Relative Odds PROSPECTOR assigns each of its assertions an absolute odds (equivalent to a probability of truth) and defines support from one assertion to another with two parameters (as discussed in chapter 4). The PDA assigns each assertion an odds-factor (the ratio of the current odds to the prior odds) and defines the support from one assertion to another in terms of probability of detection and probability of false alarm. 193 Chapter 6 As was discussed in chapter 4, these changes eliminate the piecewise linear interpolation of probability used in PROSPECTOR, and thereby improve computational efficiency. Two other possible advantages are that the elimination of interpolation improves performance, and that rule writers familiar with concepts such as probability of detection might find it easier to express themselves using the PDA approach. However, these points are both conjectural. While odds-factors worked well on this problem, it is too early to know if this more "theoretically correct" procedure is in fact a practical improvement over PROSPECTOR's approach. 6.5. Alternate Values There are occasions when contradictory data arise during processing. One way to deal with this situation is to immediately discard all but one of the conflicting results. Another is to keep around the alternatives and delay the choice (assuming it will subsequently become clear which result to chose). For symbolic values such as "voiced" and "unvoiced", the contradictory data are competing assertions. The PDA allows contradictory assertions to coexist. All the support from such assertions is tallied via the unique VOICED assertion. The final voicing decision is made from its odds-factor. This is a fairly conventional approach which is often employed when assertions have explicitly recorded confidence. For some numerical problems, there can be multiple values which do not represent conflict. This is the case with epochs in the PDA. Epochs are merged when they are symbolically compatible (they are not the start and end of the same assertion) and they are numerically compatible (their statistics admit to the possibility that they both pertain to the same underlying event). When epochs do not merge, they are simply kept separate, in effect implying that an additional underlying event has been detected. 194 Chapter 6 For other numerical problems, conflict is an issue. In the case of fO estimates a single number is desired. In the PDA we see a new possibility for dealing with conflicting values. The fO probability densities of the assertion FINAL-PITCH combine the contributions of all fO assertions into a single representation. The competing assertions are neither kept separate, nor are any discarded. In this way the multiple alternatives are all coexisting, but not as individuals. In effect the combined probability density represents a myriad of assertions about the potential value of fO, where the density at each choice of fO represents the likelihood of that point being correct. The algebra of probability allows the system to manipulate this set of assertions as a group. With this approach, it is necessary to use a sampled version of the fO probability density because the combined density no longer conforms to any simple mathematical model. A single, sampled probability density for a numerical quantity takes the place of a set of alternative values for a symbolic quantity each with a specified confidence. 6.6. Mapping Numerical Values to Confidence An important aspect of numerical/symbolic interaction in the PDA is the conversion of numerical value information into symbolic confidence information. Three cases of this are: * * The confidence of phonetic-frication assertions and phonetic-aspiration assertions (produced by rules that respond to stop environments) depends on the match between the duration of the marked stop and the typical duration of stops of that type. The more the marked stop duration differs from expectation, the lower the confidence in the phonetic voicing assertions. The confidence in numerical voicing assertions depends on the measured power within their domain. For example, a numerical-voiced assertion is created when low-frequency power is substantially above the background noise level. The greater that power, the more confident the numerical195 14 Chapter 6 voiced assertion. The confidence in VOICED depends (in part) on the similarity (not the periodicity) of the waveform. If the waveform is substantially similar at some lag (from 2 to 20ms) then the confidence of VOICED is enhanced. The above are examples where numerical values influence symbolic confidence. There is also one example where symbolic confidence converts to numerical value. That case is in the fO decision made for generating a pitch track. A variation in the confidence of a supporting phonetic fO assertion could c'ause a change in the position of the highest peak in the fO probability density of FINAL-PITCH and thereby affect the final pitch choice. All these examples are predicated on symbolic assertions having a numerical side (namely their confidence). The numerical value of some assertions is influencing the value of the symbolic confidence which is itself numerical. Despite the fact that it is implemented numerically, this is a case where a numerically valued assertion is having an impact on a symbolic one, and it is likely to be a common phenomenon in any KBSP system in which symbolic values are uncertain. 6.7. Use of Explicit Statistics for Assertions The association of a confidence with each symbolic assertion has become a commonplace feature of symbolic processing systems. It provides a way for the system to process uncertainty as it processes values. This allows the system to model uncertain analyses accurately and to express this uncertainty to the operator. It can also serve system modularity by keeping together a value and the uncertainty about the value, rather than having the uncertainty explicitly or implicitly built into code that uses the value. 196 Chapter 6 In the PDA, this idea is extended in two ways. First, the concept of explicitly associating a confidence with a data value is extended to numerical values; and second, the association of confidence with symbolic values is applied to distributed symbolic values in the form of a confidence sequence. It is not uncommon in signal processing for the published results of an experiment to show confidence intervals on the values. Also, the notion of the statistics of an estimate is pervasive in the mathematics of signal processing. However, in our experience with signal processing systems, such statistical information is not carried along and processed with signal values, nor is it present in the outputs of signal processing systems (except when the primary purpose of the system is to compute the statistics and not the estimate). In the PDA, the extension to numerical values is embodied in the statistics used for epochs and fO estimates. While the details of these two applications differ, fundamentally they both demonstrate that explicit statistics associated with numerical values is a practical alternative to having the statistics of numerical values implicitly or explicitly encoded in the procedures that operate on those values (as in the G-R voting matrix). The expression of time varying confidence for distributed symbolic assertions with sequences is an idea that is very helpful when the problem extends substantially along some dimension (e.g. time, frequency, space) and there are abstract entities that cover substantial intervals (e.g. phonemes, picture patches). By using a confidence sequence, one can avoid the necessity of attributing equal likelihood to an assertion over its entire domain. 197 __ Chapter 6 6.8. Rule Conditions: Match versus Function In contemporary rule based inference systems, there are two different approaches to determining the applicability of rules: matching and functional testing. Both are means to determine if a given set of circumstances warrant the application of a particular rule. In addition, the use of variables in matching provides a means of binding variables for use in other parts of the condition and in the actions of the rule. Matching is a process that involves comparing a composite object representing a rule condition (a "pattern") with a similar object representing an assertion (a "text"). The pattern consists of literals and wildcards with variables to be bound to the wildcard components. The text is purely literals (while some systems use wildcard assertions for quantification, that issue is not relevant to this discussion). A typical rule will have several such patterns. The rule is activated when there is a set of assertions such that the text literals match the pattern literals term for term, and all pattern wildcards with the same variable match identical text literals. For example, if wildcards in patterns and their associated variables were written with a leading hyphen, then figure 6.1 would be a patterns (phoneme -x end -e1) (phoneme -y start -e2) (simple-epoch - composite -e3) (simple-epoch -e2 composite -e3) texts (phoneme phoneme-mork-23 start simple-epoch-11 end simple-epoch-17) (phoneme phoneme-mark-29 start simple-epoch-21 end simple-epoch-27) (simple-epoch simple-epoch-17 composite composite-epoch-6) (simple-epoch simple-epoch-21 composite composite-epoch-6) Figure 6.1 Conditions done with Matching 198 Chapter 6 condition that looked for consecutive phoneme marks (and a set of matching assertions) 2 . The distinction between functional tests and matching is that a specific procedure is invoked to test for satisfaction of a part of the condition, rather than a general procedure to compare a pattern with the text of the assertion. There are two examples of functional tests that can be compared to figure 6.1. The first (and most straightforward) is shown in figure 6.2a. One significant difference in using functions is that one no longer gets binding as an automatic side effect (by matching wildcards). The "type" clauses provide the replacement for wildcard binding. They cause all assertions of the given type (i.e. tests (type x phoneme-mork) (type y phoneme-mark) (equal y (right-neighbor x)) Phoneme-mark-23 Phoneme-mork-29 I Figure 6.2a Conditions done with Functions #1 2 As was discussed in chapter 4, the actual implementation of linkage between assertions in the PDA is with the simple-epochs that actually bound the assertion being connected to a common compositeepoch. Thus the need for the final two patterns to "extract" the composite for comparison. 199 Chapter 6 phoneme-mark) to be tried as a choice for the specified variable (e.g. x). The sole test is (equal y (right -neighbor x )). Here we see one of the significant advantages of using functional tests. There is no longer any mention of the internal representation of phoneme-marks (i.e. epochs). Instead, we are able to employ the abstract function (right -neighbor ... ) to extract the necessary information. In this way one can separate the expression in the rule condition from the structure of the assertion. There is another somewhat subtler distinction between functional tests and matching that can best be shown by another version of the rule (see figure 6.2b). In this example there are no tests whatever. Instead, the value for the variable y is directly extracted from the network structure in which x is embedded. When functions are used in rule conditions, information can be extracted at any depth from tests (type x phoneme-mark) (let y (right-neighbor x)) Phoneme-mr k-23 Phoneme-mor k-29 Figure 6.2b Conditions done with Functions #2 200 Chapter 6 underlying data structures. With matching, a single pattern can extract information that is directly present in a single assertion (as the simple epoch values -el and -e2 are extracted by the first two patterns in figure 6.1). However, it requires additional patterns if the object of interest (e.g. the composite-epoch) is only indirectly related. These additional patterns come at a price (the cost of comparing them against all the assertions that they do not match). By using a function to extract the desired information one can avoid this wasted computation. The advantage or disadvantage of match may depend on whether the data structure of the assertion texts is usable as the means of storing information in the system. When assertion texts are themselves the means of storing information (e.g. as in OPS5), when there is no underlying (hidden) linkage to be exploited, and when isolating the structure of assertion texts from rules is deemed unimportant, then match can be an efficient and easily implemented approach to rule condition evaluation. In the PDA, the existence of complex hidden data relationships, the cost of creating assertion texts, the extra effort of implementing code for matching, and the desire to isolate the rule conditions from data structure all led to the decision to avoid using match in rule conditions. Another case for the use of match is that other parts of the system can easily analyze the rule patterns for purposes such as the focusing of attention on particular rules during certain types of processing. It can be extremely difficult to automatically examine the (potentially arbitrary) expressions that may result from an unconstrained functional approach to rule conditions. We did not face this issue in the PDA since we did not employ such automatic analysis of the rules for control purposes. However, it seems that the crux of this issue may be in constraining the syntax of the rule and not 201 - Chapter 6 in the choice of a match or functional tests as a way of evaluating conditions. Using functional tests for condition evaluation can be more costly than match. The dramatic efficiencies of the discrimination net (as used by systems such as YAPS and OPS-5) have not been achieved for functional tests. While some efficiency is possible (see chapter 4) and the use of underlying data structure may have computational advantages, there is still substantial cost in the use of functions for rule conditions, and this can be compounded by the added programming complexity. The following point may be the most important with respect to KBSP. Match is an appropriate mechanism for determining equality. If the values of objects require complicated numerical tests to determine the applicability of a rule, then functions become a necessity. While it may be possible to use match with quantitative symbols for some numerical applications (as discussed previously), a programming environment for numerical applications should support the testing of numerical values in rules. 6.9. The Use of Dependency The PDA demonstrates two ways in which dependency can be used in a KBSP problem: for propagating changes in statistics and for keeping the system consistent. The PDA method of statistical propagation is based on the one used by the PROSPECTOR system. In that system, the assertions were all interconnected through a network that propagated a change in the odds of any assertion to all supported assertions. In PROSPECTOR, the assertions not distributed in time or space, and the statistics were just the confidence of the assertion. In the PDA this same idea is used for atomic symbolic valued assertions (as in PROSPECTOR), for distributed symbolic valued assertions (e.g. voicing), and for numerically valued assertions (e.g. f0). This ability 202 Chapter 6 to keep statistical information up to date in the face of change is very helpful in an interactive system (e.g. an assistant) where the operator may volunteer information. It is also useful in cases where information is not supplied in a known order. Contemporary rule systems do not retract information that was derived from rules that are no longer valid, or retract information based on object configurations that no longer exist. By using the ideas behind TMS[55], we implemented a history independent rule system. With such a rule system, rules and data can be entered and retracted in any order. The final results depend only on the rules and assertions which remain. While this may offer some useful potential for self modification on the part of the system, that mode of behavior was not used in the PDA. For the PDA, the biggest advantages of history independence were in the convenience of ignoring the ordering of rule and data addition, and in the ability to make (fast) incremental modification for test purposes without the need to reinitialize the system. 6.10. Comments This chapter has made several suggestions about systems that combine symbrlc and numerical knowledge. * Mapping numeric values to symbolic values before processing is inadvisable if the mapping might not be appropriate for all applications. * Symbolic and numerical information can be combined directly as equal contributors, one can verify the other, one can be used as advice for processing of the other type, and one can guide the places where processing of the other type is applied. Examples of all these types of interaction are present in the PDA. * Information from symbolic and numeric sources can be combined either as numeric values (fO) or symbolic ones (voicing). We have presented effective new computational models for both of these situations. Information about the boundaries of symbolic assertions can be effectively processed with a system that uses both symbolic and numerical criteria for linking the information. Since such assertions can be derived from both numerical and symbolic sources, this too represents a means of combining * 203 Chapter 6 * * * such information. We have demonstrated the effectiveness of such a system. The explicit representation of uncertainty makes a system more modular, allows the contribution of each assertion to be based on its quality, and allows the system to express its confidence to the user. This theme is evident in the Epoch system, and the Knowledge Manager, and the outputs of the PDA. We have employed new uncertainty representations for symbolic assertions (the odds-factor), distributed symbolic assertions (sequential confidence), and numerical assertions (the fO density). When using numerically valued assertions, a rule system with functional tests is likely to be necessary. Such a rule system can also contribute to system modularity (through abstract rule conditions), and system efficiency (through the access of information in a network of assertions). We have implemented such a rule system with a novel means of efficiently evaluating functional conditions. In any rule based system, the ability to insert the rules and data in any order simplifies the behavior of the system and the task of operating it; the ability to add and remove rules and data facilitates development. Using the ideas behind truth maintenance systems, together with the idea of change propagation, we have developed a history independent rule system that allows both of these types of user interaction. Our original goal was to learn more about how to build systems for problems that involved both symbolic and numerical knowledge. We have learned a great deal about ways to go about this task. The previous points are general statements about the task, and the earlier chapters have presented specific examples of how one such system was built. The PDA demonstrates that extensive interaction between symbolic and numerical knowledge is practical, and that it appears to be effective. In the last chapter we will discuss some ideas for future work on this area. 204 $ of 4 CHAPTER 7 Conclusions 7.1. Thesis Contributions This thesis makes several significant contributions to Knowledge-Based Signal Processing. The specific points were outlined in the preceding chapter, but in general they concern methods of combining symbolic and numerical problem knowledge in systems. In addition to these contributions, there are specific contributions to signal processing, pitch detection and expert systems. Signal Processing This thesis defines the signal processing operation we call the Normalized Local Autocorrelation (NLA), an application of the idea behind the correlation coefficient to a single time sequence for the purpose of locating similarities. We investigated the effects of the window in this function and determined that the window should be chosen as the square root of a suitable band-limiting window (such as Hamming window). We also gave an efficient FFT based structure for computing the NLA, and fnally, we pointed out that the NLA has the feature of being insensitive to exponential taper (a property that is very important in determining the period of signals with a time-varying envelope such as speech). The signal representation developed for this thesis is the first to use an unbounded domain, together with random access of the signal values. Since one of the primary goals of such a representation is to capture the broadest possible class of Chapter 7 Chapter 7 signals, the ability to represent signals with infinite domains that are both periodic and aperiodic is a significant advance. Another goal is the ability to express signal processing knowledge in the most straightforward fashion. Since this signal representation allows the convenient expression of concepts such as zero-phase and linear-phase, and since it can preserve the offset of signal sections that are involved in such operations as overlap-add convolution, it enhances our ability to program signals. Pitch Detection Our contribution to pitch detection takes two forms. First, we compiled a list of the knowledge that bears on the pitch detection problem. While Hess has compiled a much more exhaustive treatment of the signal processing part of this knowledge[29], we were unable to find a convenient reference for the wealth of speech knowledge. Chapter 2 serves as a compact source of references about the knowledge in this problem. The other contribution made to pitch detection is in the theme of the PDA. Pitch detectors have always had a strong numerical flavor, and have not made substantial use of the knowledge about pitch that stems from speech research. While the PDA is not appropriate for many conventional pitch detection problems (transcripts are not always available), it does demonstrate that the information present in speech (about stress, phonetic identity, etc.) can be useful for pitch detection. This suggests that systems which are capable of deriving such information from the waveform would be able to make better use of the knowledge unique to speech. 206 Chapter 7 Expert Systems One of the points mentioned in the preceding chapter has implications that go beyond systems for symbolic/numerical problems. The feature of history independence in the rule system was profoundly useful in the development of the PDA system and the insertion of the speech knowledge that it used. That feature would be of value to any expert system project, whether numerical information was involved or not. Another idea used in the PDA that is important for expert systems is the use of functions in rule conditions to extract information .:om a network of assertions. By not performing the implicit iteration that matching and unification require, this technique can save 90% or more of the work involved in determining neighbors and other information that is not explicitly present in any one assertion. This idea would apply to any problem which can be viewed as a large number of related entities, each of which must be interpreted based on its own properties and those of its neighbors. '72. Future Work In the PDA there are networks for maintaining support and for maintaining logical dependency. In building the system we had to avoid making loops in these net- works, or the propagation of information around them would never cease. It is not' hard to envision a case where loops would make sense. For example, if a numerical parameter was estimated approximately and that approximation was used to refine the estimate of the parameter than a loop could result. Our method for propagating support and dependency information will not work with such loops, but there may be alternative propagation methods that would work. 207 Chapter 7 In combining numerical assertions, the PDA started with assertions represented in parametric form (mean, deviation, and confidence) and combined them using a uniformly sampled sequence to represent the resulting density. Two areas for future work are: * * What is a good method for combining numerical assertions which are expressed in the form of arbitrary sequences rather than gaussian parameters? Such a method would be necessary if the results of the initial combination were to be used in a second stage of processing. What are alternative representations for such probability densities from which the true density can be determined? Uniformly sampled sequences can lose contributions that involve very narrow peaks. Some of the knowledge in the PDA was compiled into the numerical pitch detector. While rules served as a convenient modular way of expressing knowledge, they were inefficient and awkward for programming the numerical pitch detector. Is there a way to make the rule system framework better for expressing such knowledge, or is that type of knowledge inherently difficult to express with rules? 208 -1 Bibliography [1] L. Erman, R. Hayes-Roth, V. R. Lesser, and D. R. Reddy, "The Hearsay-II Speech Understanding System: Integrating Knowledge to Resolve Uncertainty," Computing Surveys, vol. 12, pp. 213-254, Jun. 1980. [21 H. P. Nii, E. A. Feigenbaum, J. J. Anton, and A. J. Rockmore, "Signal-to-Symbol Transformation: HASP/SIAP Case Study," Al Magazine, vol. 3, pp. 23-35, Spring 1982. [31 B. Gold and L. R. Rabiner, "Parallel Processing Techniques for Estimating Pitch Periods of Speech in the Time Domain," JASA, vol. 46, pp. 442-448, Aug. 1969. [4] J. J. Ohala, "The physiology of tone," in Consonant Types and Tone, USC Occasional Papers in Linguistics, July 1973, pp. 3-14. [5] J. L. Flanagan, C. H. Coker, L. R. Rabiner, R. W. Schafer, and N. Umeda, "Synthetic voices for computers," IFFFEEE Spectrumn, vol. 7, no. 10, pp. 22-45, Oct. 1970. [61 L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Englewood Cliffs, NJ: Prentice-Hall, 1978. [7] A. E. Rosenberg, "Effect of glottal pulse shape on the quality of natural vowels," JASA, vol. 49, no. 2, pp. 583-590, Feb. 1971. [81 J. Makhoul, R. Viswanathan, R. Schwartz, and A. W. F. Huggins, "A MixedSource Model for Speech Compression and Synthesis," JASA, vol. 64, no. 6, pp. 1577-1581, Dec. 1978. [9] G. E. Peterson and H. L. Barney, "Control Methods Used in a Study of Vowels," JASA, vol. 24, no. 2, pp. 175-184, Mar. 1952. [10] D. E. Veeneman and S. L. BeMent, "Automatic glottal inverse filtering from speech and electroglottographic signals ," IEEE Trans. ASSP, vol. 33, no. 2, pp. 369-376, Apr. 1985. [111 D.J. Wong, J.D. Markel, and A.H. Gray, "Least squares glottal inverse filtering from the acoustic speech waveform," IEEE-ASSP, vol. 27, pp. 350-355, Aug. 1979. [12] M. R. Matausek and V. S. Batalov, "A new approach to the determination of the glottal waveform," IEEE Trans. ASSP, vol. 28, no. 6, pp. 616-622, Dec. 1980. Bibliography Bibliography [13] W. A. Lea, "Segmental and Suprasegmental Influences on Fundamental Frequency Contours," in Consonant Types and Tone, USC Occasional Papers in Linguistics, July 1973, pp. 17-69. [14] I. Lehiste and G. E. Peterson, "Some Basic Consideration in the Analysis of Intonation," JASA, vol. 33, no. 4, pp. 419-425, 1961. [151 A. S. House and G. Fairbanks, "The influence of consonant environment upon the secondary acoustical characteristics of vowel," JASA, vol. 25, no. 1, pp. 105-113, Jan. 1953. [16] H. Fujisaki, "Dynamic Characteristics of Voice Fundamental Frequency in Speech and Singing," Fourth F.A.S.E. Symposium, 1981. [17] H. Fujisaki, "Modeling the dynamic characteristics of voice fundamental frequency with applications to analysis and synthesis of intonation," XIII International Congress of Linguists, pp. 57-69, 1982. [181 V. W. Zue, "Acoustic Characteristics of Stop Consonants: A Controlled Study," Tech. Rpt., vol. 523, Lincoln Lab, 1976. [191 D. H. Klatt, "Linguistic uses of segmental duration in English: Acoustic and perceptual evidence," JASA, vol. 59, no. 5, pp. 1208-1221, May 1976. [201 D. H. Klatt, "Vowel lengthening is syntactically determined in a connected discourse," Journal of Phonetics, vol. 3, pp. 129-140, 1975. [21] D. O'Shaughnessy, "Modelling Fundamental Frequency, and its Relationship to Syntax Semantics and Phonetics," PhD, MIT, 1976. [221 W. E. Cooper and J. M. Sorenson, "Fundamental Frequency Contours at Syntactic Boundaries," JASA, vol. 62, no. 3, Sept. 1977. [231 J. B. Pierrehumbert, "The Phonology and Phonetics of English Intonation," PhD, MIT, 1980. [24] D. P. Huttenlocher and V. W. Zue, "A model of lexical access from partial phonetic information," Proc. ICASSP, Mar. 1984. [25] M. Liberman and J. B. Pierrehumbert, "Intonational Invariance Under Changes of Pitch Range and Length," BLTJ, 1982. [26] Y. Horii, "Fundamental frequency perturbation observed in sustained phonation," Journalof Speech and Hearing Research, vol. 22, pp. 5-19, Mar. 1979. [27] Y. Horii, "Vocal shimmer in sustained phonation," Journal of Speech and Hearing Research, vol. 21, no. 1, pp. 202-209, 1980. [28] L. Dolansky and P. Tjerlund, "On Certain Irregularities of Voiced-Speech Waveforms," IEEE-AU, vol. AU-16, no. 1, March 1968. 210 Bibliography [291 W. Hess, Pitch determination in speech signals. Berlin: Springer-Verlag, 1983. [30] J. L. Flanagan, Speech Analysis Synthesis and Perception. New York, NY: Springer-Verlag, 1972. [311 J. J. Dubnowski, R. W. Schafer, and L. R. Rabiner, "Real-Time Digital Hardware Pitch Detector," IEEE Trans. ASSP, vol. 24, no. 1, pp. 2-8, Feb. 1976. [321 M. M. Sondhi, "New Methods of Pitch Extraction," IFEE Trans. AU, vol. AU-16, no. 2, pp. 262-266, June 1968. [33] J. D. Markel, "The SIFT algorithm for fundamental frequency estimation," IEE Trans. Audio and Electroacoustics, vol. 20, no. 5, pp. 367-377, Dec. 1972. [34] L. R. Rabiner, M. R. Sambur, and C. E. Schmidt, "Applications of a nonlinear smoothing algorithm to speech processing," IEEE Trans. ASSP, vol. 23, no. 6, pp. 552-557, Dec. 1975. [351 M. J. Ross, H. L. Shaffer, A. Cohen, R. Freudberg, and H. J. Manley, 'Average magnitude difference function pitch extractor," IEEE Trans. ASSP, vol. 22, no. 5, pp. 353-362, Oct. 1974. [36] N. J. Miller, "Pitch detection by data reduction," IEEE Trans. ASSP, vol. 23, no. 1, pp. 72-79, Feb. 1975. [37] S. Seneff, "Real-Time Harmonic Pitch Detector," IFFFTrans. ASSP, vol. 26, no. 4, pp. 358-365, Aug. 1978. [381 J. A. Moorer, "The optimum comb method of pitch period analysis of continuous digitized speech," IEEE Trans. ASSP, vol. 22, no. 5, pp. 330-338, Oct. 1974. [391 R. J. Sluyter, H. J. Kotmans, and T. Claasen, "Improvements of the harmonicsieve pitch extraction scheme and an appropriate method for voice-unvoiced detection," Proc. ICASSP, pp. 188-191, 1982. [401 H. Duifuis, L. F. Williams, and R. J. Sluyter, "Measurement of pitch in speech: an implementation of Goldstein's theory of pitch perception," J.ISA, vol. 71, no. 9, pp. 1568-1580, June 1982. [411 T. V. Sreenivas and P. V. S. Rao, "Pitch extraction from corrupted harmonics of the power spectrum," JASA, vol. 65, no. 1, pp. 223-227, Jan. 1979. [421 M. Piszczalski and B. A. Galler, "Predicting musical pitch from component frequency ratios," JASA, vol. 66, no. 3, pp. 710-720, Sept. 1979. [43] M. R. Schroeder, "Period Histogram and Product Spectrum: New Methods for Fundamental-Frequency Measurement," JASA, vol. 43, no. 4, pp. 829-833, Jan. 1968. [44] A. M. Noll, "Pitch determination of human speech by the harmonic product spectrum, the harmonic sum spectrum, and a maximum likelihood estimate," Proc. Symposium on Computer Processing in Comm., pp. 779-797, April 8-10, 1969. 211 _I _ Bibliography [45] A. M. Noll, "Cepstrum Pitch Determination," JASA, vol. 41, no. 2, pp. 293-309, Aug. 1966. [46] D. Friedman, "Pseudo-Maximum-Likelihood Speech Pitch Extraction," IEEE Trans. ASSP, vol. 25, no. 3, pp. 213-221, June 1977. [47] J. D. Wise, J. R. Caprio, and T. W. Parks, "Maximum Likelihood Pitch Estimation," IEE Trans. ASSP, vol. 24, no. 5, pp. 418-423, Oct. 1976. [48] J. N. Maksym, "Real-time pitch extraction by adaptive prediction of the speech waveform," IEE Trans. Audio and Electroacoustics, vol. 21, no. 3, pp. 149-154, June 1973. [49] J. D. Markel, "Application of a digital inverse filter for automatic formant and fO analysis," IEEE Trans. Audio and Electroacoustics, vol. 21, no. 3, pp. 154-160, June 1973. [50] D. T. L. Lee, "A novel innovations based time-domain pitch detector," Proc. ICASSP, pp. 40-44, 1980. [51] B. S. Atal and L. R. Rabiner, "A pattern recognition approach to voicedunvoiced-silence classification with applications to speech recognition," IEEE Trans. ASSP, vol. 24, no. 3, pp. 201-212, June 1976. [52] V.V.S. Sarma, "Studies in pattern recognition approach to voiced-unvoicedsilence classification," Proc. ICASSP, pp. 14, 1978. [53] J. F. Allen, "Maintaining knowledge about temporal intervals," Comm. of the ACM, vol. 26, no. 11, pp. 832-843, Nov. 1983. [54] W. J. Long, "Reasoning about state from causation and time in a medical domain," Proc. AAA, pp. 251-254, 1983. [55] D. A. McAllester, "A three valued truth maintenance system," A.I. Memo 473, MIT Al Lab, Cambridge, MA., May 1978. [56] R. O. Duda, P. E. Hart, N. J. Nillson, R. Reboh, J. Slocum, and G. L. Sutherland, "Development of a computer-based consultation for mineral exploration," Annual Report SRI Projects 5821 and 6415, SRI Intl., 1977. [57] G. E. Kopec, "The representation of discrete-time signals and systems in programs," PhD, MIT, 1980. [58] C. Myers, "Numeric and Symbolic Representation and Manipulation of Signals," PhD, MIT, Cambridge, MA., 1986. [59] G. E. Kopec, "The signal representation language SRL," Proc. ICASSP, 1983. [60] R. O. Duda and P. E. Hart, Pattern Classificationand Scene Analysis. New York, NY: Wiley-Interscience, 1973. [61] E. Allen, " "YAPS: Yet another production system"," Proc. AAA, pp. 5-7, 1983. 212 Bibliography [62] C. L. Forgy and J. McDermott, "OPS, A domain-independent production system language," Proc. Joint Conference on Artificial Intelligence, pp. 933-939, 1977. [63] C. V. Kimball and T. L. Marzetta, "Semblance processing of borehole acoustic array data," Geophysics, vol. 49, no. 3, pp. 274-281, Mar. 1984. [64] D. Griffin, "A New Speech Model and its Use in Speech Coding," PhD, MIT, Cambridge, MA., 1986. [65] H. L. Van Trees, Detection, Estimation, and Modulation Theory, Part III. New York: John Wiley & Sons, 1971. [661 C. A. McGonegal, L. R. Rabiner, and A. E. Rosenberg, "A Semiautomatic Pitch Detector (SAPD)," IEEE Trans. ASSP, vol. 23, pp. 570-574, Dec. 1975. [67] A. M. Nazif and M. D. Levine, "Low level image segmentation: an expert system," IEEE Trans. PAMI, vol. 6, no. 5, pp. 555-577, Sep. 1984. 213 -4 40 I I UNCLASSIFIED SCURIITY CuLASSflCAO0ON Of THIS PAG REPORT DOCUMENTATION PAGE 1& RPOART SECURITY CASSlIFCATION 1IL RESTRIrCTrV 2 3. 01STRIUTIONIAVAIABILTY O SIECURITY CASIFICATION AUTWORITY MARKINGS RO RT Approved for public release; 2b. 01CL.AUIFICATION/OOWI4NRAOIN4 4. PRFORIMING ORG"NIZAT10 ON IPORT NUMEUlRi}) NAM1 . M4NTORING ORGANIZATION1RPlOT NIUMSR(S) OF PIRFORMING ORGAQNZATION 0OFFICE I SYMI. 7a. NAM OF MONITORING Research Laboratory of Elect rolterm, CM Z (Cft7. SeM 7b. AoIaES O (CUl. AMil s JUNOINS2POIORNING UF Oli"IZATION :I OFIICE p SYMO.L /if adIdesb 0ROCUREMENT INSTRUMSNT IOtNTIlICATIOlN NUMOII I. N00014-81-K-074Z M Mad ZIP Cie AOOREW Cty. 1Q. SOURCE OF FUNOING NO_. 1400 Wilson Boulevard PROGRAM *rOjCTr Arlington. Virginia Z2217 1l. TITUS 'Iftel aft tW OS ZIP COm" 800 North Quincy Street Arlington, Virginia 22 17 Advanced Research Projects fAgency Is N Mathematical and Information Scien. Div. Ca¢tl 77 Massachusetts Avenue Cambridge, MA 02139 L XRGANIZA^T Office of Naval Research Massachusetts Institute of Te hnology Ek AOORS ILEMNT NO. TASK WORK UNIT rN. NO. No. CN*zlv-*=049 an - 506 Knowledge-Based Pitch Detection Webster 12 ,,,,IAIUJTWR r I 00 AorRT 0 rod Technical I. distribution unlimited SCnxDUI. P. Dove IYr. =Vaasa POM _ orGP 14'. cart AGVERT o June M.. IL 3 UNT 1986 SUPPLtMiNTAmY NOTATION I7. COSATI CoolS c3 O _IE UP 13. AllTRACT Crnmew lE. SUIJeCT TERMS C¢on.an, ww!w t' .eese ed ,,,eft by d"eM ,utMir SUL. o . aO wn if e Mnee d iMey at &e* oi* sM6r) Many problems in signal processing involve a mixture of numerical and symbolic knowledge. Examples of problems of this sort include the recognition of speech and the analysis of images. This tesis focuses on the problem of employing a mixture of symbolic and numerical knowledge within a ingle system, through the development of a system directed at a modified pitch detection problem. For this thesis, the conventional pitch detection problem was modified by providing a phonetic transcript and sex/age information as input to the system, in addition to the acoustic waveform. The Pitch Detector's Assistant (PDA) system that was developed is an interactive facility for evaluating ways of approaching this problem. The PDA system allows the user to interrupt processing at any point. change either input data, derived data, or problem knowledge and continue execution. 2a. OISTPRI UTIONIAVAILAILITY OF ASTC UtNCLASSIfID/UNLIMITO 22. NAME OfP ESPONSI1 TRACT SAME AS RPT. = R SCURTY Cl CL^SSIlICATION tUnclassified OTIC USRas INOIVIOUAL 22b. TLEPON Kyra M. Hall T WLJ I-Aim Cnfrne. 141.3, w A Rnr+ (617 lOI' Am ON OF 1 AN 7 IS OSSOLcE ) NUMIIR 22. OPICE SYMIOL C 253-2569 E. S$CURIT'Y CLASSIFICATION O2f TIS __II_ _ ;Ga I~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~t SlCUITY CLA.2lPIA'lhM Me i 'rus _~~~ *~ - w Oer _ w I This system uses a representation for signals that has infinite domains and facilitates the representation of concepts such as zero-phase and causality. Probabilistic representations for the uncertainty of symbolic and numerical assertions are derived. Efficient procedures for combining such assertions are also developed. The Normalized Local Autocorrelation waveform similarity measure is defined, and an efficient FFT based implementation is presented. The insensitivity of this measure to exponential growth and decay is discussed and its significance for speech analysis. The concept of a history independent rule system is presented. The implementation of that concept in the PDA and its significance are described. A pilot experiment is performed that compares the performance of the PDA to the Gold-Rabiner pitch detector. This experiment indicates that the PDA produces 1/2 as many voicing errors as the Gold-Rabiner program across all tested signal-to-noise ratios. It is demonstrated that the phonetic transcript and sex/age information are significant contributors to this performance improvement. I 1SCUI'rJY CLASSIFICArION OF t'rS PAGE DISTRIBU TION LIST DODAAD Code Director Defense Advanced Research Project Agency 1400 Wilson Boulevard Arlington, Virginia 22209 Attn: Program Management HX1241 (1) Head Mathematical Sciences Division Office of Naval Research 800 North Quincy Street Arlington, Virginia 22217 N00014 (1) Administrative Contracting Officer E19-628 Massachusetts Institute of Technology N66017 (1) Director Naval Research Laboratory Attn: Code 2627 Washington, D.C. 20375 N00 17 3 (6) Defense Technical Information Center Bldg 5, Cameron Station Alexandria, Virginia 22314 S47031 (12) Cambridge, Massachusetts 02139 (1) Dr. Judith Daly DARPA / TTO 1400 Wilson Boulevard Arlington, Virginia 22209 _ I_ _ 4 * ___ __ __I _