R. L. E. cxr's, Cu~l' iO 1A i*ARCUMIT . 2R00 36-412 is 'w,i- J T7S1!NSTITUITE Or T !J ~CI,",'.,,* t~.".~'l, I'SSACH'TIET T-3r" 'v .... ariOllrtLst 1-aNOL0G Y iIjUE?7Tr'P; ws$_Ffiff xt£' '~ . l ,v' ' WORD-RECOGNITION COMPUTER PROGRAM BERNARD GOLD I A L olt (0t %OOF' TECHNICAL REPORT 452 JUNE 15, 1966 MASSACHUSETTS INSTITUTE OF TECHNOLOGY RESEARCH LABORATORY OF ELECTRONICS CAMBRIDGE, MASSACHUSETTS The Research Laboratory of Electronics is an interdepartmental laboratory in which faculty members and graduate students from numerous academic departments conduct research. The research reported in this document was made possible in part by support extended the Massachusetts Institute of Technology, Research Laboratory of Electronics, by the JOINT SERVICES ELECTRONICS PROGRAMS (U.S. Army, U.S. Navy, and U.S. Air Force) under Contract No. DA36-039-AMC-03200(E); additional support was received from the National Science Foundation (Grant GK-835), the National Institutes of Health (Grant 5 POI MH-04737-06), and the National Aeronautics and Space Administration (Grant NsG-496). Reproduction in whole or in part is permitted for any purpose of the United States Government. Qualified requesters may obtain copies of this report from DDC. MASSACHUSETTS INSTITUTE OF TECHNOLOGY RESEARCH LABORATORY OF ELECTRONICS Technical Report 452 June 15, 1966 WORD-RECOGNITION COMPUTER PROGRAM Bernard Gold (Manuscript received April 14, 1966) Abstract A word-recognition computer program has been designed and tested for a vocabulary of 54 words and a population of 10 male speakers. The program performs the functions of segmentation, measurements on the segments, and decision making. Out of the 540 words, 74 were incorrectly classified by the program. ____I I __ I TABLE OF CONTENTS I. General Discussion 1 II. Synopsis of Present Work 4 III. Description of the Computer Facility 5 IV. Finding the Word Boundaries 6 V. Segmentation 7 VI. Description of Measurements 11 VII. Discussion of the Formant Measurements 13 VIII. Decision Making 15 IX. Preliminary Results 18 X. Future Plans 19 Appendix I Word List 20 Acknowledgment 21 iii _ __ _ --1_1.11-_1_11___ __.__ r_ __ I. GENERAL DISCUSSION Recent work has led to some results in automatic word recognition which are reasonably encouraging. The results of this work are recorded here. A good deal of the report sets forth ideas and opinions about the scope of automatic speech recognition, technological aims, and other aspects of this many-faceted problem. In speaking of automatic speech recognition, we must quickly delineate the problem. To begin with, the general problem of conversing with a machine as if it were a human is of such vast scope and such great complexity that it cannot seriously be considered to be of immediate technological interest. On the other hand, greatly constrained versions of the problem are of interest, and first we shall enumerate and speculate about the consequences of these constraints. A vital and far-reaching constraint is that of having the (speech recognition) machine recognize only disconnected words, rather than connected speech. Thus, the speaker utters a single word, the machine analyzes the word and tries to recognize the word by comparison against a stored vocabulary. The game may be altered to allow the machine to 'refuse to recognize' the word if it does not adequately match any of the words in the stored vocabulary. The ability to build a machine that recognizes words does not imply an ability to build a machine to recognize connected speech. A word spoken in isolation can be very different from the same word spoken in a sentence. The same word spoken in different sentences may again be greatly changed. How to segment speech into words is, to begin with, an unsolved problem. Thus, work on word recognition is not necessarily the correct path to work on speech recognition. Why not view, however, the problem of word For technological applications, recognition of words is probably as useful as recognizing connected speech. Even if one wanted to impress a multiword thought on a machine, the speaker should have little trouble in pausing sufficiently long between words to make automatic word segmentation a not too difficult recognition as an aim in itself? matter. Accepting, then, the restriction that a reasonable goal in speech recognition is that of word recognition, we next inquire as to a feasible vocabulary size. In particular, we should like to be able to make some good guesses as to how complex the machine becomes as a function of vocabulary size. To make these guesses, however, requires information on how many speakers the machine is expected to respond to. We know, for example, that a digit recognizer responding only to 'his master's voice' is quite simple. If you want the machine to respond to a wide variety of speakers, even the digit recognizer becomes appreciably more complex but the problem is still quite manageable. Extending the vocabulary to approximately 50 words for a variety of speakers seems to require an additional order of complexity, and it is to this step that the work described here is devoted. Given the acoustic signal representing the input word, the machine must make 1 II __ _C _ _ _ measurements on the signal, the result being a set of numbers. If n measurements are performed, the set of n numbers derived can be thought of as a point in an n-dimensional measurement space. The position of the point 'defines' the word. The ability of the machine to recognize a word depends on (a) the compactness of that part of the space containing all versions of the same word uttered by different speakers, and (b) the adequate diffusion of the other words over the space. In our opinion, these attributes derive primarily from the exact nature of the measurements chosen. The argument has often been advanced that the type of measurement is not really too important because by means of statistical models one can derive transformations that will generate 'good' measurements from the original 'poor' measurements. Although such arguments are valid for many classification problems, we think that successful word recognition depends on acoustic rather than statistical insights. A spoken word is uniquely defined as a string of phonemes. The natural approach to word recognition would appear to be, first, to find the boundaries separating all of the phonemes and second, classification of each phoneme. The first problem is called segmentation. We would like to make the point that recognition of words from vocabularies of 20 or more words (provided that a special monosyllabic vocabulary is not chosen) requires some sort of segmentation. Thus far, it seems to be true that rules for consistent seg- mentation of a word into phonemes are very difficult to invent. In Section V the segmentation rules that were used for the present work are given. There the reader will see to what extent our segmentation is phonemic and to what extent we have avoided some of the more difficult segmentation problems. A few comments will now be made on how to judge the effectiveness of a word recognizer. For example, the word 'address' may be stressed on the second syllable by nine out of ten speakers. If the tenth speaker stresses the first syllable, should the machine be blamed for an error? Actually, a man listening to these 10 utterances might say that the speaker had, in some sense, made an error. The clear distinction between the two ways of saying 'address' or 'advertisement' suggests that perhaps both versions of these words be stored in the vocabulary. This argument could, of course, be extended to say that whenever the recognizer failed to make consistent measurements on the same word spoken by different speakers, then that same word be redefined as two or more words in the vocabulary. This extension has limitations, however, since words are more difficult to recognize in a larger vocabulary. Mainly in the interests of simplicity, no attempt has been made here to enlarge the vocabulary for the purpose of encompassing different ways of saying the same word. It is estimated that an increase of perhaps 10 words to the present vocabulary of 54 would have eliminated a large portion of the errors made. It seems more fruitful, at present, to study the causes of these errors than to try to avoid them. Having raised the question of speaker pronunciation, let us pursue this important matter further. The word recognizer to be described in this report depends greatly on 2 having the same syllable stressed for every utterance of a given word. In the vast majority of cases, all speakers naturally stress the same syllable and the machine correctly notes this; in several words, however, stress is ambiguous. Although recognition is often successful even then, this success is usually quite precarious. Based on exper- ience with this recognizer, it is thought that instructing the speaker to stress properly would improve the performance. Another worthwhile demand on the speaker is that he explode his final p's, k's and t's, rather than swallow them, as speakers often do. Finally, if close-talking microphones are used, avoidance of "breathiness" is desirable to prevent low-frequency "wind" noise. Such noise may cause the recognizer to mistake an unvoiced sound as a voiced sound. Three areas for which word recognition may be a pertinent problem are speech bandwidth compression, voice-actuated mechanisms, and 'talking to computers'. It is well appreciated that a word recognizer and word synthesizer combine to create a speech bandwidth compression system yielding very close to the ultimate bandwidth. The prob- lem of preserving the speaker identity has not yet even been attempted. Furthermore, the very small vocabulary of 50-100 words that might be available, at present, would indeed limit the utility of such a device for speech communication. Whether voice-actuated mechanisms would be beneficial is really impossible to know. Perhaps one reason for wanting to build a word recognizer is to satisfy one's curiosity about its affect on other people. Talking to computers appears to be the most promising immediate application of a word recognizer. Many computer people argue that a typewriter input is more reliable Such an argument was once very potent, but recent work wherein and uses less storage. man and machine must both react quickly to each other has made it less convincing. One could cite as a possible application of a word recognizer the area of computer-controlled graphic communication. Computer rooms are usually noisy places, hardly a suitable environment for a word recognizer. To minimize these disturbances to the word recognizer requires that special attention be paid to the microphone that is used and exactly how it is mounted on the speaker. This particular problem has been avoided here; the words were recorded in a fairly quiet environment (not an anechoic chamber). Thus, we can only speculate about possible difficulties and how to treat them. In 'talking to computers', the speaker may very well want to keep his hands free for such purposes as typing and for control of an oscilloscope display with a light pen. Microphone mounts of the type worn by telephone operators are perhaps most practical under these conditions; these microphones ought to have noise cancellation and closeIf the resulting protection against spurious triggering of the recognizer is still insufficient, special circuitry might be necessary. Design of a satisfactory acoustic environment might well be one of the interesting talking features, if possible. 'tangential' problems of talking to computers. 3 I_ I - 1·_1^-·1111-1 -- l---ltY.-- -- _ _- _ __ II. SYNOPSIS OF PRESENT WORK Presentation of this synopsis has been deferred to make clearer the chosen boundFirst, recordings were made of 10 male speakers, each of aries of the present work. whom read the same list of 54 words. were computer-oriented. The chosen words Each word was analyzed by a 16-channel spectrum analyzer covering the 180-3000 cps band. and voicing detector. The list is given in Appendix I. Also, the words were passed through a pitch extracter All processed words were passed through the word-recognition program, from which the results of the measurements were tabulated. The main func- tion of the program was to segment the word, make 15 measurements after segmentation, and file these measurements for each of the 540 spoken words. The tabulation of all of these measurements was printed and after rather lengthy inspection of the data, a decision algorithm was devised. All 540 words were then passed through this algorithm and the results tabulated and printed. The rest of this report describes, in more detail, the computer facility, segmentation, the measurements, and the decision algorithm, and discusses the results that were obtained. 4 III. DESCRIPTION OF THE COMPUTER FACILITY For convenience of discussion, the various programs are divided into three compartments, called data gathering, gathering, is shown in Fig. 1. compilation, and decision making. The first, data It consists of a real-time read-in, into TX-2 core mem- ory, of the spectrum, voicing decision, and voice fundamental frequency. All spectrum channels are sampled each 2. 5 msec, and other information is sampled each 5 msec. SPE WA Fig. 1. Data-gathering mode of the word-recognition program. The next step is compilation, whereby the digital tapes are played back into core memory and the results of the measurements compiled. Approximately one hour of computer running time was needed to compile a complete inventory of test results on the 540 words. The result is a three-dimensional matrix of numbers, obtained by per- forming each measurement on every word and speaker. In a 36-bit word-length com- puter, 2025 registers are needed to store these numbers as 9-bit quantities. Finally, the decision-making program singles out one speaker's words and, by matching measurements made on these words with similar measurements made on the other 9 speakers, the program tries to recognize all words spoken by this speaker. By performing this operation on all of the speakers in turn, each time matching against the other 9 speakers, a total score is finally obtained. 5 ~~~~~~~~~~~~ -1- -_ ---- __ -1. -.. - IV. FINDING THE WORD BOUNDARIES Let us assume that the spectrum of the word, and the voicing decision associated with the word, are in core memory. Even at this early stage, minor difficulties arise in determining the beginning and end of the word. In order to read the word into memory, the program must sense when the input signal exceeds a threshold. too low may trigger the read-in before the word is uttered. A threshold that is A threshold that is too high may cause the weak 'f' in a word like 'four' to be missed, so that a word more like 'or' will be read into memory. Of course, careful monitoring can eliminate these troubles, but great care consumes time and, besides, there is the tempting thought that in a limited vocabulary (containing 'four' but not 'or'), the garbled word might still be correctly identified. Our program was designed to eliminate noise of less than 40-msec duration that This is done by measuring the average mag- might have triggered the read-in program. nitude function with a running window of 17. 5-msec width, as indicated in Fig. 2. SPEECH LEVEL WORD ENDING WORD BEGINNING 1 11 .. l........ SCANNING WINDOW (17.5 msec) Fig. 2. _ _ TMI --TIME SCANNING WINDOW ( 100 msec ) Word boundary determination. For our purposes, 'magnitude function' is defined as the sum of the spectrum channels at any moment. Comparison with a fixed threshold determines the word beginning, which is defined as the position of the scanning window when the threshold is first exceeded as the window scans from left to right. Figure 2 has been drawn to illustrate that a spurious sample which triggered the read-in program could be eliminated. To find the end of the word, the procedure is similar but the scan window must be greatly widened. A word like 'eight', if the final t is exploded, will contain an interval of silence between the vowel and the t. As seen in Fig. 2, the end of the word is made coincident with the beginning of the window. 6 V. SEGMENTATION By studying speech spectrograms and the speech magnitude function, several basic segmentation cues emerge. These will now be discussed. A sharp dip in the over-all magnitude function generally signifies a segment. Thus, in Fig. 3, the dip shown usually indicates a vowel-consonant-vowel sequence, although the most accurate placement of the segment boundaries is not determined only from this picture. The cue is not perfect; for example, the speech level of the word 'zero' often fluctuates, as shown in Fig. 4. CONSONANT _ ,MAGNITUDE El FUNCTION MGNITUDE FUNCTION VOWEL TIME TIME Fig. 3. Characteristic dip during an intervocalic consonant. Fig. 4. Magnitude function for "zero." Cessation or initiation of voicing is another good cue which is quite well detectable with a buzz-hiss detector of the type used in vocoders. The voicing pattern of a given word is not unique. In the words 'divide', 'seven', 'number', the voiced fricatives and voiced plosives may be detected as either voiced or unvoiced. Many other such examples can be found. As we shall see in a moment, however, this and the first segmentation cue together serve as the primary segmentation determinants of the present program. A rapid change in the over-all spectrum is another cue for segmentation. This 'change' we have defined by the formula, 16 A (xi-yi) i= 5 C= 16 E ' (1) (xi+Yi) i=5 where x i is the sum of 4 samples from spectrum channel i to the left of a reference time, and yi is the sum of 4 samples from the same channel to the right of the reference. We see from Eq. 1 that C must be a positive number smaller than unity; large changes in spectrum are reflected as large values of C. Running the summation in (1) from the fifth rather than from the first spectrum channel proved a more sensitive measure of segmentation. Based on the above-mentioned three cues, the specific segmentation algorithm that was used can now be given. 7 _ LIICII___II___·XLI_ IPIIII--CI- --- -^1_--____11_ .·(_.·_ .. I PI 1. Divide the word into voiced and unvoiced segments. To remove possible spurious segments, first delete all voiced segments of less than 20-msec duration and then all unvoiced segments of less than 20-msec duration. 2. Find the voiced segment having the greatest energy, defined as the product of the average magnitude function and segment duration. 3. if possible, by finding an "adequate dip." Divide this largest segment further, This "dip" is defined as the position of the smallest minimum magnitude function during Segment edges are ignored in searching for this dip. the segment. The dip is considered to be "adequate," and denotes a segment if it is no greater than 1/3 of one of the maxima to either side and no greater than 0. 7 of the other maximum. Given an "adequate dip," segment boundaries to both left and right are found by computing the largest value of C (in Eq. 1) between the dip and the two maxima. 4. If further division according to rule 3 succeeded, then the previous large segment obtained from rule 2 has now been fractured into three segments. The 'stressed seg- ment' of these three is found by the location of the maximum magnitude function. 5. Further fracturing of the stressed segment into, at most, three segments is now possible. The stressed segment is scanned, and C is measured at each sample. reference boundaries for which C exceeds a threshold are marked as boundaries. All The program is constrained to allow only a "left segment" or "right segment" or both to be chopped off the stressed segment. Figures 5, The rules will now be illustrated for a few words. 6, and 7 show the ::`:·I.r , :I A~· 1,") .. F '' 11 :·Eti B ffliai-·· a Fig. 5. F;P Segmentation of "directive. " I " - ii· I .I I ,I ,i : " I ,J I ]D I I I I ! R E 1III RULE 1 . RULE 3 I VE C T 8 , $ ;*-yei ·-ds ;VA', -U ';@ 'tA r ~,$p,~~ird·:T·i' L- :<44 A. :C.L i, .L 7: Pc ,, ;rc·· -"' :· . : i-· ar.· 5i .1" 5i;h '.? i.: RULES 1 THROUGH 4 'I RULES 5 I NE Fig. 6. Segmentation of "nine." IN ,1 .Y .K.i· .i· ·····- ·· · --·""·"ia '" -·; ··'·i ,-:. :·· J I I !MUL I I II I RULE 1 T Fig. 7. I PI Y II I I RULE 5 Segmentation of "multiply." 9 ... _~ - - c spectrograms and the resultant segments obtained. a) directive Rule 1 will result in dire ct ive Rule 2 will choose dire as the large segment Rule 3 will further segment to give di r e ct ive Rule 4 will stress the third segment Rule 5 does nothing b) nine Rules 1 through 4 do nothing Rule 5 will yield n i ne c) multiply Rule 1 yields mul t i ply Rule 2 chooses mul as the largest segment Rules 3 and 4 do nothing Rule 5 produces m ul t i pl y 10 VI. DESCRIPTION OF MEASUREMENTS The measurement repertoire has been limited so that occasionally only a fraction of a word is analyzed. This can be most easily illustrated by an example. If the word 'intersect' is operated on by the segmentation rules, the result will be in P-4 t P-3 s P- 1 rer P- 2 e PO ct p1 Denoting the stressed segment e as po, we can define segments to the right of po as P'1 P2' etc., and segments to the left of po as P 1, P2', etc. The program analyzes only the bracketed part of the word 'intersect'. As many as two segments to the right and two segments to the left of po are analyzed, information being ignored. the other The stressed segment po may itself be subdivided by Rule 5, but this does not affect the treatment of the other segments. We now enumerate the measurements. 1. Relative duration of any left segment within po 0 2, 3, 4, and 5. Relative durations of P- 2, 6. Presence of a silence in P i1 7. Presence of a silence in P 1 . 8, 9, 10. P_ ,1 P 1 ' and P2 . Spectral measurements performed on the middle third of po. 11. Spectral measurement on P_ 1 . 12, 13, 14, and 15. Measurements based on formant motion during segment po' By relative duration is meant the ratios of the duration of a given segment to the duration of the entire word. Measurements 6 and 7 are strong cues for the recognition of P_ 1 and P1 as voiceless plosives. A silence is considered to be present when the average speech level in a 20-msec scan window falls below a threshold. Measurement 8 is defined as 16 M - M8= e qi- i= 1 16 i= 1 qi (2) the i=s average magnitude function of the middle third of the segment where qi is the average magnitude function of the middle third of the segment po. Sim- ilarly, 4 8 12 iqi+ qiM =i=1 9 i= 5 i= I =9 i=l 16 16 qi i= 13 (3) i 11 - - _C_ ~ I I - -------- I w and 4 M i=1 8 qi- i=5 12 q 1 7 16r 16 qi+ 7 i= 13 q (4) i=l Measurement 11 is similar to M 8 , except that the evaluation of qi takes place over those 20 msec of p_ 1 having the greatest average magnitude function. M 8 is a useful discriminant for segments with low first (F 1 ) and second (F 2 ) formants, such as in the words 'move', 'load', 'quarter'. M 9 tended to give consistently negative results for segments with high F 1 and low F 2 , as in 'octal', 'half', 'add'. M10 gave similar results for these words, but also gave negative results for the diphthongs in 'divide', 'nine', 'binary', 'five'. M 1 1 gave strong negative results for the 'S' sound. 12 VII. DISCUSSION OF THE FORLMANT MEASUREMENTS Formant information contains strong recognition cues and deserves special consideration. Two points are pertinent for the present problem. One is that the usual criterion of acceptability of a formant-tracking device, namely, that it control a formant synthesizer to produce acceptable vowel quality, does not apply. Second, formant measurements on the same vowel (in the same context) by different speakers, are widely diffused. From this, it can be implied that (a) a speech-recognition "formant tracker" need not actually track formants as defined by tracing the center of the dark bars on a spectrogram; it can track some function of the spectrum which is perhaps closely related to formants and Chart 1. Vi 1 Wi. Db Attenuation in F 3 Measurement Frequency (cps) Db Attenuation in F 2 Measurement 900 8 oo 1080 4 oo 1260 0 oo 1440 0 o0 1620 0 8 1800 5 6 1980 10 4 2160 15 2 2340 18 0 2520 22 0 2700 oo 0 2880 8 oo 3060 16 oo Note - This chart was obtained from a paper by J. N. Shearme, "Analysis of the Performance of an Automatic Formant Measuring System," presented at the Stockholm Speech Communication Seminar, 1962. easier to measure, and (b) we should not depend too heavily on precise formant locations. The definitions of formants F 1 , F2, and F 3 that are used here will now be given: 7 ix F = i=1 1 7 xi= i= 1 13 - " .' _ . -~II- -I 16 iv.x. iv.x i i=1 1 1 i=l F3 = 16 wix. i=l Here, x. is the it h spectrum channel signal, and u i , vi, and w. are sets of weights which 11 1 1 th roughly correspond to the average frequency of occurrence of the formants in the i ter. Chart 1 shows the sets of weights, in the F in the F 1 and F 3 region. fil- Weighting is uniform region. The measurements 12, 13, 14, and 15 are now defined as: 12) the average value of F 1 over po; 13) the average value of F 2 over po; 14) the difference between F 2 near the beginning and near the end of p; derivative of F2 . and 15) the average of the absolute value of the time Thus, some information concerning formant values and some informa- tion about formant changes are included. 14 VIII. DECISION MAKING Perhaps more time has been spent (by people purportedly working on speech recognition) in trying to devise statistical recognition models than in working closely with actual data. We think that a good deal of this effort has been wasted, not because a recognition model is unimportant, but because most of this work has been directed at models that can be theoretically analyzed; for example, n-dimensional Gaussian models. Speech data will not fit into predetermined formats and any attempt to squeeze them into one will lower the machine's capabilities. In the general discussion the framework of a word-recognition decision maker has been loosely defined. Let us first somewhat formalize these ideas by constructing a model and then describe how this model was changed to accomodate the data. Given any one of the 15 measurements and any of the 54 words, an empirical range for that measurement and word can be constructed. Thus, numbers obtained by applying measurement 8 to each speaker saying the word 'load' varied from 67 to 92; let us then define the range associated with the word 'load' and measurement 8 as all numbers greater than or equal to 67 and less than or equal to 92. In this way a range R.. is defined for the ith word and jth measurement and by inspection of the observed data numerical values can be assigned to every Rij. If we think of range as a line segment along one of the axes of a 15-dimensional measurement space, then it is clear that any word is bounded within a 15-dimensional "box." Each "box" will contain all utterances of the same word. Now, if these "boxes" were mutually disjoint, the decision problem would be solved, at least to the extent that all 54 words would be perfectly partitioned and thus distinguishable. Let us be more precise about the phrase "mutually disjoint." We mean that no two Even if 14 measurements of a given word fell different words ever fall in the same box. within the 14-dimensional projection of another word's box, if the 15t h measurement did not fall into the range of the 15 t h measurement of the other word, the first word would not be wrongly classified as the other word. As expected, this division of the measurement space into 'boxes' does not, for our set of measurements, uniquely classify all utterances. If the model were applied exactly as described, the net result would be that many of the boxes would occupy such a large proportion of the space that large numbers of coincident (nondisjoint) boxes would be For example, measurement 5 applied to the word 'cycle' covers a range, if present. nine out of the ten speakers are considered, from 11 to 20. This is a rather restricted region for a measurement that yields a number from 0 to approximately 70, over all 540 words; however, measurement 5 applied to the 10 t h speaker, yields 47. Thus the range would be 11-47, much wider than before and much less capable of keeping the boxes disjoint. This failure of the measurements to give numbers that are clustered together for all utterances of the same word is sufficiently prevalent to make it necessary to modify the 15 _^I -1 11-----·111 _ 1_1 ·_1111111411_--pI1l_..- _II_ __I_ I Chart 2. Score-Probability Table. range model. Probability Score 0 -20 1/9 -6 2/9 -4 3/9 -2 4/9 0 5/9 6/9 0 7/9 2 8/9 4 1 5 1 Presumably, such modification has to be probabilistic, and it is exactly at this point that traps must be avoided. We have very few samples of the effect of each measurement on each word (10, to be exact), and double or triple that number would still be small. The kind of trouble caused is exemplified (and there are many more examples) by measurement 1, the relative duration of p_ 2 applied to 'delete'. The numbers obtained vary from 15 to 43 for 9 of the 10 speakers. For the 10ht speaker, the segmentation failed to uncover a PZ' so that the measurement could not be made (if we like, we can say the result is zero). Now, from purely probabilistic considerations, it could be claimed that measurement 1 applied to 'delete' yielded a range from 15 to 43 with probability 0. 9, and yielded the measurement 0 with probability . 1. that 1 0 th Inspection of utterance shows, however, that the stress was put on the first syllable rather than the second, and since we are unwilling to admit that 1 out of 10 speakers will say 'delete' that way we would prefer to throw away that measurement. Our procedure, then, is intuitive and based somewhat on personal judgements about each of the utterances. What is done is to find a range R based on 10 numbers, which may or may not encompass the actual numerical range of observed values which appears to be that region in which we expect the numbers to fall. Now, in assigning probabilities, when a given speaker's words are to be processed, he is first withdrawn from the population. Then, probability qi if m 1 of the data points falls within our chosen range, = ml/ 9 to the range. we assign a If m 2 numbers were below the range, then the probability (associated with Rij), that a subsequent number is below the range is q 2 = m 2 / 9 , and similarly q 3 = m 3 / 9 is the probability that a number is above the range. The model is now nearly complete. For each measurement and word, a range R.. and three probabilities have been assigned. algorithm can now be described. An 'unknown' word appears. Based on these entities, a decision-making It is first matched with word 1 of the vocabulary as follows: Measurement 1 is made and it is determined whether this measurement falls 16 ---`-- I -- ------ within the range of word 1, below this range or above it. probability qi' q2', or q3 will be obtained. and saved. Depending on the result, a A partial score is then found from Chart 2 The total score is the sum of partial scores over all measurements. The unknown word is then matched in the same way with all other words in the vocabulary and that word resulting in the greatest total score 'wins', that is, the decision is made that the unknown word is that word in the vocabulary with the highest score. 17 _-- _- ~ __~~~~~~~~_1 -- - - ---I -·l-r-r·-----u--^-·--··· I---` I - PRELIMINARY RESULTS IX. Our program correctly identified all but 74 of the 540 words in our catalogue, an average of approximately 86%. The errors are enumerated in the appendix, together with some supplementary data. The appendix shows that approximately 95% of the words Some of the remaining 5% included were either correctly identified or finished second. words that were either stressed wrongly, mispronounced or not correctly read into the computer memory. 93 o z O 0 O u w 25 a U 0 O d 0 20 Z 89 O 15 z U z w _ U z 10 a U 87 85 16 0 Fig. 8. 24 32 40 48 56 THRESHOLD (a) Probability of correct decision. (b) Probability of no decision as a function of threshold setting. If 'no decision' is allowable, the performance of the program can be judged from Fig. 8. The abscissa represents a threshold setting that the highest score must exceed for there to be a decision. The two curves show how this threshold setting affects the percentage of correct decisions, as well as that of no decisions. 18 I X. FUTURE PLANS At present, no specific plan for future work exists, mainly because the number of possible directions is large enough to be a little confusing. We end our report by citing some of these possibilities. Of great importance is the extension to more speakers. If 10 new speakers are added, will simple adjustments of the ranges suffice to maintain the over-all average? How will female speech and foreign accents affect the decision making? Can the ranges and three associated probabilities be found automatically? Possible enlargement of the vocabulary, in the future, may make this a very important question. In the present work, we have avoided the problem of consistent segmenting of initial and final voiced plosives, final voiced fricatives and nasals, and glides. Clearly, we must make progress along these lines, either in refining the segment measures or in adapting the work of others to the problem. Other problems that came to mind are further 'tuning' of the present program to improve the average by adjusting the ranges and probabilities and perhaps adding a few measurements, trying to improve the formant tracking measurements, experimenting with a speech signal of wider band in the hope of finding measurement cues for distinguishing between plosives and fricatives. Perhaps the most tempting direction is that of applying the technique to a practical problem of 'talking to a computer'. 19 _~~ ~ ~ ~ ~~~~~~~~~~~~~ II_ ^ _ _ I I - Y _^_ _I Appendix I. Word List insert delete name end replace scale move cycle read skip binary jump save address core overflow directive point list control load register store word add exchange subtract input zero output one make two intersect three compare four accumulate five memory six bite seven eight quarter half nine whole multiply unite divide decimal number octal 20 4 1 I Acknowledgement The work reported here was begun because of the joint interest of the author and Professor K. N. Stevens, who has followed the progress continuously, contributing valuable suggestions as well as insights on the nature of some of the more enigmatic speech sounds. Many of the thoughts presented here derive from conversations with him. One result of the relative informality of the report is the absence of documentation. The main purpose of the report was rapid dissemination of progress up to the present time, rather than a more formal presentation. Nevertheless, edge the important influence of previous work. J. Chief it is proper to acknowl- among these is the work of W. Forgie and Carma Forgie, who have made important advances in several areas of speech recognition. Lincoln Laboratory, The author has also had the benefit of day-to-day contact, M. I. T., with J. W. Forgie. at Notable among the large volume of other published material is that of George Hughes. Although no mention has been made in this report of the application of the linguistic concept of distinctive features of speech recognition, this approach, developed by Hughes, has greatly influenced the lines of investigation pursued here. Many other significant papers have appeared over the years. ography appears in IEEE Spectrum of March, 1965. 21 I - - -'----' -'----------- - -- ------·- ·- ··-·-- An extensive bibli- _ ________ __ __ _ JOINT SERVICES ELECTRONICS PROGRAM REPORTS DISTRIBUTION LIST AFRSTE Hqs. USAF Room ID-429, The Pentagon Washington, D.C. 20330 Department of Defense Dr. Edward M. Reilley Asst Director (Research) Ofc of Defense Res & Eng Department of Defense Washington, D. C. 20301 Office of Deputy Director (Research and Information Room 3D1037) Department of Defense The Pentagon Washington, D. C. 20301 Director Advanced Research Projects Agency Department of Defense Washington, D. C. 20301 AFFTC (FTBPP-2) Technical Library Edwards AFB, Calif. 93523 Space Systems Division Air Force Systems Command Los Angeles Air Force Station Los Angeles, California 90045 Attn: SSSD SSD(SSTRT/Lt. Starbuck) AFUPO Los Angeles, California 90045 Director for Materials Sciences Advanced Research Projects Agency Department of Defense Washington. D. C. 20301 Det #6, OAR (LOOAR) Air Force Unit Post Office Los Angeles, California 90045 Headquarters Defense Communications Agency (333) The Pentagon Washington, D. C. 20305 Systems Engineering Group (RTD) Technical Information Reference Branch Attn: SEPIR Directorate of Engineering Standards and Technical Information Wright-Patterson AFB, Ohio 45433 Defense Documentation Center Attn: TISIA Cameron Station, Bldg. 5 Alexandria, Virginia 22314 ARL (ARIY) Wright-Patterson AFB, Ohio 45433 Director National Security Agency Attn: Librarian C-332 Fort George G. Meade, Maryland 20755 AFAL (AVT) Wright-Patterson AFB, Ohio 45433 AFAL (AVTE/R. D. Larson) Wright-Patterson AFB, Ohio 45433 Weapons Systems Evaluation Group Attn: Col. Finis G. Johnson Department of Defense Washington, D. C. 20305 National Security Agency Attn: R4-James Tippet Office of Research Fort George G. Meade, Maryland 20755 RADC (EMLAL- 1) Griffiss AFB, New York 13442 Attn: Documents Library Central Intelligence Agency Attn: OCR/DD Publications Washington, D. C. 20505 AFCRL (CRMXLR) AFCRL Research Library, Stop 29 L. G. Hanscom Field Bedford, Massachusetts 01731 Department of the Air Force AUL3T-9663 Maxwell AFB, Alabama 36112 - I ill I - I-" Commanding General Attn: STEWS-WS-VT White Sands Missile Range New Mexico 88002 -----" '---'")e --·LI--L·-·IIIPL ·--- ~ .... l I JOINT SERVICES REPORTS DISTRIBUTION LIST (continued) Academy Library DFSLB) U. S. Air Force Academy Colorado 80840 Commanding Officer U. S. Army Materials Research Agency Watertown Arsenal Watertown, Massachusetts 02172 FJSRL USAF Academy, Colorado 80840 APGC (PGBPS-12) Eglin AFB, Florida 32542 AFETR Technical Library (ETV, MU-135) Patrick AFB, Florida 32925 AFETR (ETLLG-1) STINFO Officer (for Library) Patrick AFB, Florida 32925 ESD (ESTI) L. G. Hanscom Field Bedford, Massachusetts 01731 AEDC (ARO, INC) Attn: Library/Documents Arnold AFS, Tennessee 37389 European Office of Aerospace Research Shell Building 47 Rue Cantersteen Brussels, Belgium Lt. Col. E. P. Gaines, Jr. Chief, Electronics Division Directorate of Engineering Sciences Air Force Office of Scientific Research Washington, D.C. 20333 Commanding Officer U. S. Army Ballistics Research Laboratory Attn: V. W. Richards Aberdeen Proving Ground Aberdeen, Maryland 21005 Commandant U. S. Army Air Defense School Attn: Missile Sciences Division C&S Dept. P.O. Box 9390 Fort Bliss, Texas 79916 Commanding General Frankford Arsenal Attn: SMUFA-L6000 (Dr. Sidney Ross) Philadelphia, Pennsylvania 19137 Commanding General U.S. Army Missile Command Attn: Technical Library Redstone Arsenal, Alabama 35809 U. S. Army Munitions Command Attn: Technical Information Branch Picatinney Arsenal Dover, New Jersey 07801 Commanding Officer Harry Diamond Laboratories Attn: Mr. Berthold Altman Connecticut Avenue and Van Ness St. N. W. Washington, D. C. 20438 Department of the Army U. S. Army Research Office Attn: Physical Sciences Division 3045 Columbia Pike Arlington, Virginia 22204 Research Plans Office U. S. Army Research Office 3045 Columbia Pike Arlington, Virginia 22204 Commanding General U. S. Army Materiel Command Attn: AMCRD-RS- PE-E Washington, D.C. 20315 Commanding General U. S. Army Strategic Communications Command Washington, D. C. 20315 Commanding Officer U. S. Army Security Agency Arlington Hall Arlington, Virginia 22212 Commanding Officer U. S. Army Limited War Laboratory Attn: Technical Director Aberdeen Proving Ground Aberdeen, Maryland 21005 Commanding Officer Human Engineering Laboratories Aberdeen Proving Ground, Maryland 21005 Director U.S. Army Engineer Geodesy, Intelligence and Mapping Research and Development Agency Fort Belvoir, Virginia 22060 I JOINT SERVICES REPORTS DISTRIBUTION LIST (continued) Department of the Navy Commandant U. S. Army Command and General Staff College Attn: Secretary Fort Leavenworth, Kansas 66270 Chief of Naval Research Department of the Navy Washington, D. C. 20360 Attn: Code 427 Dr. H. Robl, Deputy Chief Scientist U. S. Army Research Office (Durham) Box CM, Duke Station Durham, North Carolina 27706 Chief, Bureau of Ships Department of the Navy Washington, D. C. 20360 Commanding Officer U. S. Army Research Office (Durham) Attn: CRD-AA-IP (Richard O. Ulsh) Box CM, Duke Station Durham, North Carolina 27706 Chief, Bureau of Weapons Department of the Navy Washington, D. C. 20360 Commanding Officer Office of Naval Research Branch Office Box 39, Navy No 100 F. P.O. New York, New York 09510 Superintendent U. S. Army Military Academy West Point, New York 10996 Commanding Officer Office of Naval Research Branch Office 1030 East Green Street Pasadena, California The Walter Reed Institute of Research Walter Reed Medical Center Washington, D.C. 20012 Commanding Officer U. S. Army Engineer R&D Laboratory Attn: STINFO Branch Fort Belvoir, Virginia 22060 Commanding Officer Office of Naval Research Branch Office 219 South Dearborn Street Chicago, Illinois 60604 Commanding Officer U. S. Army Electronics R&D Activity White Sands Missile Range, New Mexico 88002 Dr. S. Benedict Levin, Director Institute for Exploratory Research U. S. Army Electronics Command Attn: Mr. Robert O. Parker, Executive Secretary, JSTAC (AMSEL-XL-D) Fort Monmouth, New Jersey 07703 Commanding Officer Office of Naval Research Branch Office 207 West 42nd Street New York, New York 10011 Commanding Officer Office of Naval Research Branch Office 495 Summer Street Boston, Massachusetts 02210 Director, Naval Research Laboratory Technical Information Officer Washington, D. C. Attn: Code 2000 Commanding General U. S. Army Electronics Command Fort Monmouth, New Jersey 07703 Attn: AMSEL-SC HL-O AMSEL-RD-D HL-R RD-G NL-D RD-MAF- 1 NL-A RD-MAT NL-P RD-GF NL-R XL-D NL-S XL-E KL-D XL-C KL-E XL-S KL-S HL-D KL-T HL-L VL-D HL-J WL-D HL-P Commander Naval Air Development and Material Center Johnsville, Pennsylvania 18974 Librarian, U. S. Electronics Laboratory San Diego, California 95152 Commanding Officer and Director U. S. Naval Underwater Sound Laboratory Fort Trumbull New London, Connecticut 06840 Librarian, U. S. Naval Post Graduate School Monterey, California -··-· - - 4"~~- .. N. . ,, ··--- - JOINT SERVICES REPORTS DISTRIBUTION LIST (continued) Commander U. S. Naval Air Missile Test Center Point Magu, California Dr. H. Harrison, Code RRE Chief, Electrophysics Branch NASA, Washington, D. C. 20546 Director U.S. Naval Observatory Washington, D. C. Goddard Space Flight Center NASA Attn: Library, Documents Section Code 252 Green Belt, Maryland 20771 Chief of Naval Operations OP-07 Washington, D. C. Director, U. S. Naval Security Group Attn: G43 3801 Nebraska Avenue Washington, D. C. Commanding Officer Naval Ordnance Laboratory White Oak, Maryland Commanding Officer Naval Ordnance Laboratory Corona, California Commanding Officer Naval Ordnance Test Station China Lake, California Commanding Officer Naval Avionics Facility Indianapolis, Indiana Commanding Officer Naval Training Device Center Orlando, Florida U. S. Naval Weapons Laboratory Dahlgren, Virginia Weapons Systems Test Division Naval Air Test Center Patuxtent River, Maryland Attn: Library Other Government Agencies Mr. Charles F. Yost Special Assistant to the Director of Research NASA Washington, D.C. 20546 NASA Lewis Research Center Attn: Library 21000 Brookpark Road Cleveland, Ohio 44135 National Science Foundation Attn: Dr. John R. Lehmann Division of Engineering 1800 G Street N. W. Washington, D. C. 20550 U.S. Atomic Energy Commission Division of Technical Information Extension P.O. Box 62 Oak Ridge, Tennessee 37831 Los Alamos Scientific Library Attn: Reports Library P.O. Box 1663 Los Alamos, New Mexico 87544 NASA Scientific & Technical Information Facility Attn: Acquisitions Branch (S/AK/DL) P. O. Box 33 College Park, Maryland 20740 Non-Government Agencies Director Research Laboratory for Electronics Massachusetts Institute of Technology Cambridge, Massachusetts 02139 Polytechnic Institute of Brooklyn 55 Johnson Street Brooklyn, New York 11201 Attn: Mr. Jerome Fox Research Coordinator Director Columbia Radiation Laboratory Columbia University 538 West 120th Street New York, New York 10027 Director Stanford Electronics Laboratories Stanford University Stanford, California Director Coordinated Science Laboratory University of Illinois Urbana, Illinois 61803 JOINT SERVICES REPORTS DISTRIBUTION LIST (continued) Director Electronics Research Laboratory University of California Berkeley 4, California Hunt Library Carnegie Institute of Technology Schenley Park Pittsburgh, Pennsylvania 15213 Director Electronics Sciences Laboratory University of Southern California Los Angeles, California 90007 Mr. Henry L. Bachmann Assistant Chief Engineer Wheeler Laboratories 122 Cuttermill Road Great Neck, New York Professor A. A. Dougal, Director Laboratories for Electronics and Related Sciences Research University of Texas Austin, Texas 78712 University of Liege Electronic Institute 15, Avenue Des Tilleuls Val-Benoit, Liege Belgium Division of Engineering and Applied Physics 210 Pierce Hall Harvard University Cambridge, Massachusetts 02138 University of California at Los Angeles Department of Engineering Los Angeles, California California Institute of Technology Pasadena, California Attn: Documents Library Aerospace Corporation P.O. Box 95085 Los Angeles, California 90045 Attn: Library Acquisitions Group University of California Santa Barbara, California Attn: Library Professor Nicholas George California Institute of Technology Pasadena, California Carnegie Institute of Technology Electrical Engineering Department Pittsburgh, Pennsylvania Aeronautics Library Graduate Aeronautical Laboratories California Institute of Technology 1201 E. California Blvd. Pasadena, California 91109 University of Michigan Electrical Engineering Department Ann Arbor, Michigan New York University College of Engineering New York, New York Director, USAF Project RAND Via: Air Force Liaison Office The RAND Corporation 1700 Main Street Santa Monica, California 90406 Attn: Library Syracuse University Dept. of Electrical Engineering Syracuse, New York The Johns Hopkins University Applied Physics Laboratory 8621 Georgia Avenue Silver Spring, Maryland Attn: Boris W. Kuvshinoff Document Librarian Yale University Engineering Department New Haven, Connecticut Bendix Pacific Division 11600 Sherman Way North Hollywood, California School of Engineering Sciences Arizona State University Tempe, Arizona General Electric Company Research Laboratories Schenectady, New York Dr. Leo Young Stanford Research Institute Menlo Park, California I_ IIC · ·_11_ _I _C11____ ___I_· _ I JOINT SERVICES REPORTS DISTRIBUTION LIST (continued) Airborne Instruments Laboratory Deerpark, New York Lockheed Aircraft Corporation P.O. Box 504 Sunnyvale, California Raytheon Company Bedford, Massachusetts Attn: Librarian _ __ UNCLASSTFTIE Security Classification DOCUMENT CONTROL DATA - R&D (Security claslflcaton of title, body of abstractand indexing annotation must be entered when the overall report is classified) I. ORIGINATIN G ACTIVITY (Corporate author) 28a. REPORT SECURITY C LASSIFICATION Research Laboratory of Electronics Massachusetts Institute of Technology Cambridge, Massachusettsone Unclassified GROUP 2b 3. REPORT TITLE Word-Recognition Computer Program 4. DESCRIPTIVE NOTES (Type of report and nclusive dates) Technical Report 5. AUTHOR(S) (Last name, first name, Gold, nitial) Bernard 6. REPORT DATE 7-. June 15, 1966 8a. C_.OTRCT 0C- 7b. NO. OF REFS 36 NO. 0 None 9.. ORIGINATOR'S REPORT NUMBER(S) Technical Report 452 D[-039-AMCi~i~-03ZVUOfE 200-14501-B31F b. PROJECT NO, TOTAL NO. OF PACES NSF Grant GK-835 NIH Grant GK-835 PO1 MH NIH Grant 5 7-0 6 PO1 MH-04737-06 9b. OTHER RPORT NO(S) (Any other numbers that may be assglned this report) NASA Grant NsG-496 10 A VA IL ABILITY/LIMITATION NOTICES Distribution of this report is unlimited I11SUPPLEMENTARY NOTES 12. SPONSORING MILITARY ACTIVITY Joint Services Electronics Program thru USAECOM, Fort Monmouth, N. J. 13. ABSTRACT A word-recognition computer program has been designed and tested for a vocabulary of 54 words and a population of 10 male speakers. The program performs the functions of segmentation, measurements on the segments, and decision making. Out of the 540 words, 74 were incorrectly classified by the program. rn% -L 1 FORM JAN64 1 A7 1 / 4j UNCLASSIFIED Security Classification -II------- - Pl_-·--·l-LII ----·- . . . - _l _L· _. .- I· I__---_- I-- - 1 'JNCLASSIFIED Classifi,atinn i curitv Y"""""`L"" V`~~·' 14. KEY WORDS LINK A ROLM LINK B ROL.E W LINK C WT ROLE WT Word - recognition Segmentation Talking to Computers Voice-actuated mechanisms INSTRUCTIONS 1. ORIGINATING ACTIVITY: Enter the name and address imposed by security classification, using standard statements of the contractor, subcontractor, grantee, Department of Desuch. as: fense activity or other organization (corporale author) issuing (1) "Qualified requesters may obtain copies of this the report. report from DDC." 2a. REPORT SECUITY CLASSIFICATION: Enter the over(2) "Foreign announcement and dissemination of this all security classification of the report. Indicate whether report by DDC is not authorized. " "Restricted Data" is included Marking is to be in accordance with appropriate security regulations. (3) "U. S. Government agencies may obtain copies of this report directly from DDC. Other qualified DDC 2b. GROUP: Automatic downgrading is specified in DoD Diusers shall request through rective 5200. 10 and Armed Forces Industrial Manual. Enter ., the group number. Also, when applicable, show that optional markings have been used for Group 3 and Group 4 as author(4) "U. S. military agencies may obtain copies of this ized. report directly from DDC Other qualified users 3. REPORT TITLE: Enter the complete report title in all shall request through capital letters. Titles in all cases should be unclassified. If a meaningful title cannot be selected without classification, show title classification in all capitals in parenthesis (5) "All distribution of this report is controlled. Qualimmediately following the title. ified DDC users shall request through 4. DESCRIPTIVE NOTES: If appropriate, enter the type of report, e.g., interim, progress, summary, annual, or final. If the report has been furnished to the Office of Technical Give the inclusive dates when a specific reporting period is Services, Department of Commerce, for sale to the public, indicovered. cate this fact and enter the price, if known. 5. AUTHOR(S): Enter the name(s) of author(s) as shown on 11. SUPPLEMENTARY NOTES: Use for additional explanaor in the report. Entel last name, first name, middle initial. tory notes. If military, show rank and branch of service. The name of the principal .:thor is an absolute minimum requirement. 12. SPONSORING MILITARY ACTIVITY: Enter the name of the departmental project office or laboratory sponsoring (pay6. REPORT DATE. Enter the date of the report as day, ing for) the research and development. Include address. month, year; or month, year. If more than one date appears on the report, use date of publication. 13. ABSTRACT: Enter an abstract giving a brief and factual summary of the document indicative of the report, even though 7a. TOTAL NUMBER OF PAGES: The total page count it may also appear elsewhere in the body of the technical reshould follow normal pagination procedures, i.e., enter the port. If additional space is required, a continuation sheet shall number of pages containing information. be attached. 7b. NUMBER OF REFERENCES: Enter the total number of It is highly desirable that the abstract of classified reports references cited in the report. be unclassified. Each paragraph of the abstract shall end with Ba. CONTRACT OR GRANT NUMBER: If appropriate, enter an indication of the military security classification of the inthe applicable number of the contract or grant under which formation in the paragraph, represented as (TS). (S), (C). or (U) the report was written. There is no limitation on the length of the abstract. How8b, 8c, & 8d. PROJECT NUMOER: Enter the appropriate ever, the suggested length is from 150 to 225 words. military department identification, such as project number, subproject number, system numbers, task number, etc. 14. KEY WORDS: Key words are technically meaningful terms or short phrases that characterize a report and may be used as 9a. ORIGINATOR'S REPORT NUMBER(S): Enter the offiindex entries for cataloging the report. Key words must be cial report number by which the document will be identified selected so that no security classification is required. Identiand controlled by the originating activity. This number must fiers, such as equipment model designation. trade name, military he unique to this report. project code name, geographic location, may be used as key 9b. OTHER REPORT NUMBER(S): If the report has been words but will be followed by an indication of technical conassigned any other repcrt numbers (either by the originator text. The assignment of links, rules, and weights is optional. or by the sponsor), also enter this number(s). 10. AVAILABILITY/LIMITATION NOTICES: Enter any limitations on further dissemination of the report, other than those UNCLASSIFIED - Security Classification