A Virtual Vocabulary Speech Recognizer by Peter D. Pathe B.S. Engineering and Applied Science California Institute of Technology Pasadena, California 1977 Submitted to the Department of Architecture in Partial Fulfillment of the Requirements of the Degree of Master of Science in Visual Studies at the Massachusetts Institute of Technology April (c) Signature 1983 Massachusetts Institute of of Author Technology - A 1983 -- Department of Architecture 1 April, 1983 I Certified by Assistant Professor Andrew Lippman of Media Technology Thesis Supervisor Accepted by Nicholas Negroponte Professor of Computer Graphics Chairman, Departmental Committee for Graduate Students Rotch MASSACHUSETTS INSWitUTE OF TECHNOLOGY AUG 5 1983 A Virtual Vocabulary Speech Recognizer by Peter D. Pathe Submitted to the Department of Architecture on 1 April, 1983 in partial fulfillment of the requirements for the Degree of Master of Science in Visual Studies ABSTRACT A system for the automatic recognition of human speech is described. A commercially available speech recognizer sees its recognition vocabulary increased through the use of virtual memory management techniques. Central to the design are issues concerning the nature of speech, its effectiveness as an isolated mode of communication with computers, and its role as a part of a multi-modal communication interface. A highly interactive information retrieval system serves as a sample application. This system is detailed, and features which make it an appropriate application for the speech system are identified. Sponsored in full by the Office of Naval Research, Contract No. N00014-80-K-0921 Thesis Title: Supervisor: Andrew Lippman Assistant Professor of Media Technology -2- Acknowledgements I wish to all thank everyone at the Architecture Machine Group for the help I received while working there. Andy Lippman provided financial support and resources for project and others, Chris Schmandt was this and gave direction to my research. a great source of information and advice on speech recognition. Thanks to Chris Lombardi, Neil Galarneau, and Adam Rose for their work on the NEC personal computer. Lenox Brassell was when always ready to share his knowledge of Magic6 I needed it. The lab just wouldn't have been the same without Smokehouse and the Hardware Team. I especially wish to thank Walter Bender and encouragement, both in and out of the opportunity to work together -3- for his constant help the lab. again in I hope we have the future. Table Of Contents page Abstract . ..................................... 2 Acknowledgements .............................. 3 1.0 .............................. 5 2.0 Communication Modes ....................... 2.1 Speech as a Mode of Interaction ...... 10 3.0 Automated Speech Recognition .............. 3.1 The Intelligent Listener ............. 14 17 4.0 Contextual Environment .................... 4.1 What Is NewsPeek? .................... 4.2 NewsPeek Operation ................... 4.2.1 Off-line (Editorial) Functions 4.2.2 On-line Functions (Reader Aids) 4.3 User Interface ....................... 21 21 24 24 5.0 The Speech Recognition Unit ............... 29 6.0 The Vocabulary Predictor .................. 33 Introduction a 26 27 7.0 Applying the Virtual Vocabulary Speeech Rec 0 gni z er 38 8.0 Conclusion 46 References ................................ ................................................. -4- 47 1.0 Introduction Human communication is a well developed art. People speak, write, gesture, sing, dance, paint, and the list goes on. Each of these modes provides a powerful way to get a message from one person to another. lives exchange Our of thoughts, Unfortunately, the well developed. are made richer through the feelings, and experiences. art of communicating with machines is not The problem isn't that there is nothing to people converse with computers as issue is how it must be said. a matter of course. The so say; real Most notably, computers cannot yet understand normal human speech. Automatic speech recognition devices do exist. Commercially available recognizers typically are capable of identifying only a small vocabulary of isolated words uttered by a single This is nothing like the recognition speaker. of fluent speech, yet still is adequate performance in some applications. The virtual vocabulary Speech Recognizer one such device. the speech communication with the host. be stored in selectively A recognizer, disc memory, and and sections can be loaded into the recognizer's Although the -5- small personal large recognition vocabulary can the computer's memory, vocabulary space. an enhancement of of the system is a At the heart computer which controls is smaller active speech recognizer can identify one of only a small number of words effective vocabulary seems at any given moment, its quite large. The general usefulness of this system hinges on the assumption that, for a given application, there exists some method for predicting a subset vocabulary with a reasonable likelihood of including the speaker's next utterance. property of This is obviously not a all systems to which automatic speech recognition might be applied, but it is practical in The host application in retrieval and this case is analysis system. certain instances. an interactive news At any given moment, the user is typically involved in reading and searching for groups of stories. hands, Exactly what constitutes a group is largely in his as the system provides the user with the ability to associate the stories by features subsets like content and age. Just as of the global database are constructed by the user during system operation, associated subset recognition vocabularies can be created as well. information about the task, such as In addition, other the history of the user's activity, can be used to supplement the vocabulary subsetting process. This speech system provides the recognizer with memory, additional processing power, knowledge about the user, knowledge of the task being performed. integrated with the host application in -6- and The speech system, when this manner, performs with in a greater degreee of flexibility and utility than it does isolation. -7- 2.0 Communication Modes Research in communication between humans important role the mode of interaction plays information. [4,11] subject was designated the were in the transfer of and Ochsman assembling a machine. One source of information, the other the Each time the experiment was performed, the subjects allowed to interact in a different prescribed manner. modes made available were communication-rich in demonstrated the In experiments performed by Chapanis two subjects were involved in seeker. has room), The (subjects together voice, video, handwriting, typing, and various combinations measured in of the above. The quality of the exchange was each case by the time the seeker required to assemble his project. The communication-rich environment was far Of and away the best. the single modes, voice was found to be the most useful. Without exception, the multi-modal environments including verbal contact proved more valuable than those without. Also of interest, the typewriter fared poorly among experienced typists and novices alike. Although the studies were performed with pairs of human subjects, the three results design cited above are of particular relevance to the of computer-user interfaces. People and computers usually converse by typing at each other. keyboard-driven computer terminal -8- is In its defense, a precise, reliable the I/O device, is inexpensively manufactured, and requires minimal computational support. displaying large it Its utility in accepting and quickly amounts of text is undeniable. is not always well suited to the task of highly interactive communication between humans and computers. Means input and output are becoming more common, as enabling computers to speak and hear. As are devices an active area of advances in computer design produce faster more powerful machines, memory and processing performing tasks I/O devices of graphical Making these and other alternatives useful in computer interaction is research. Unfortunately, immediately at hand may be and devoted to "left over" and from afforded alternative improving the user interface. From a computer interface designer's standpoint, the most significant result of the communication experiment is not the relative rating of speech and typing as useful modes of interaction, but instead the very high rating of the multi-modal, communication-rich environment. The quality interface provides multiple channels for communication between seeker and source. That some of the channels overlap in their ability to transfer information is not an inefficiency to be optimized away. redundancy inherent its usefulness. the mental mind to in multi-modal communication is The a part of The user of the quality interface is freed constraints of single-moded thinking. applying the computer to from He devotes his the use for which it is intended, undistracted by the need to route his thoughts through an inconvenient channel. -9- The communication-rich environment isolated modes. is more than a collection of Each mode may show strengths and weaknesses depending on the kind of information to be transferred and idiosyncratic preferences of the others; in the user. Each mode supplements a thoughtfully designed system, the modes should not function independently, but with knowledge of each other whenever possible. this seems Although especially applicable in human-computer interface. verbal lost in it is not yet the state of the art, I/O devices, some the case of the Due to the nature of of the their operation. For graphical and "richness" of the interface is example, a speech recognizer may function by reducing a sentence to a string of words, which are passed to the main computer in a form suitable Other elements of the spoken line, such as inflection, are completely touch-screen operations but ignored. emphasis and Similarly, a gesture on may be represented by its serve to for processing. endpoints. the These extract some useful part of the input signal, in so doing they effectively band-limit that signal. hope is that the redundancy inherent in The the multi-modal interface may be used to reconstruct some of the communication bandwidth lost in the 2.1 Speech Speech feature extraction process. as a Mode of Interaction is a powerful method of communication as supplemental mode, Most people learn to -10- a single or speak at an early age and practice this such a social skill throughout their simple and natural process it is all but lives. It is taken for granted. What makes It speech such a powerful tool in is fast. speech human communication? Thought may be quickly converted to speech, the itself is rapid, and the speech is easily converted back to thought at the receiving end. It is dense. transmit more Subtle inflections in voice may effectively information than the text of what is spoken. way something is said conveys meaning of its It is automatic. One mentally composes language he speaks, so the act of The own. a message in the speaking that message is automatic. It may operate in parallel to other thought processes. speaker may talk and drive a car at It requires no There the same time, special for instance. special equipment. are practical considerations for the use computer input. A of speech as Since even untrained people know how to operator instruction is unnecessary. computer-speech environment places -11- speak, A well designed few physical constraints on the user. computer In a sample application, an from many locations programmer in inspector may talk to his an industrial site. Perhaps likes to pace the floor while devising code. the computer with a telephone and operate it a Equip from anywhere in world. Since the operator's hands are not required for speech, they are free for other tasks. The inspector examines the programmer scribbles with a pencil. free" nature of speech makes it a machine part; Of course, the a good candidate for in the multi-modal communication environment. "hands inclusion Most other input devices require some mechanical manipulation by the user. People already speak to machines every day. The act is so natural and routine that most would not give it but speaking on a telephone is uncomfortable talking to just that. it were a person. feel Because Speech can a useful method for communicating with machines, provided that the machines hold up their Imagine calling the to Why don't people a machine like the telephone? it responds intelligently, as if be a second thought, a switchboard operator, only to be (hypothetical) talking difference? Probably. at all? computer. Would you care? the computer kept saying, respond end of the conversation. "I connected Could you tell the Probably not. don't understand," What if the human operator What if or didn't did the same? Perhaps you would start talking to your telephone the same way you talk to your stalled automobile and broken toaster, -12- expecting, and getting, the same intelligent reply from all three. -13 - 3.0 Automated Speech Recognition Although people find voice a fast and effortless mode of communication, machine recognition of remains an simply elusive goal. 'plugged into' handle spoken I/O. No device yet exists which a computer as if it were a terminal to (usually under reasonably well identifying to can be Commercially available speech recognizers with limited vocabularies co-operative user. conversational speech 100 words) perform isolated utterances spoken by a These devices do find applications, but due their limitations lack the general utility of other input devices. Most speech recognizers must be whose voice 'trained' by the individual is to be recognized. the vocabulary, a As the user says each word 'template' is created and saved. is used as a representative sample of the audio in The template signal produced by the speaker uttering that word. When recognizing, the device performs spectral analysis audio The amplitude input. several means, Fast Fourier spectrum is determined by one of including direct filtering, digital filtering by Transform, and linear predictive analysis. pattern matching techniques each of the of its are applied to stored templates. the input one of the best dynamic programming, operates on a signal, be characterized as a function of -14- Various signal and of these, called or pattern, which may several variables. The effects of non-linear variations along a common axis of two functions can be minimized through this technique. Dynamic programming, as applied to speech recognition, compensates temporal variations between utterances of a such for single word by squeezing and stretching the template along its time axis. Recognition signal improves dramatically [12,14,16]. accompanied by some kind of are complications variations in uttering a single word. the audio quite large. input or phrase Usually this is 'confidence' metric. in addition to those imposed by normal signal produced by a single speaker Variations between speakers can be Background noise to distort recognizable and room acoustics affects the listener's perception of fluent further serve Position in audio features. Coarticulation between words in causes the has been correlated with the entire vocabulary of templates, the best match is reported. There After a sentence the word. (continuous) speech also severe recognition problems. Recognition reliability decreases rapidly as user vocabulary size increases. Allowable variation in a word's audio spectrum may be quite large relative to the differences required to distinguish vocabularies it Inter-word from another. are more likely sufficient to identification than those of a small yield good large vocabulary, but this feature cannot be pushed very far. its variations in Identifying a amplitude spectrum is loosely analagous to -15- spoken word by labelling a typed as the number of table word by its hashing function value; entries (words) increases relative to the hash The more words there are in the labels. ambiguity of space, the greater recognition so does the size, the likelihood that two the (or more) will be similar. One might think that these machines are attempting to recognize at too high a level. speech the order of After all, there in the English language; 300,000 words be successfully communicated to By reducing English text to identified primitives transmission is spoken-text problem? they may all a computer via terminal keyboard. easily a manageable number of (the alphabet), solved, are somewhere on the problem of word why not apply the same reasoning to the There are about 20,000 syllables in English, but these are constructed from roughly forty phonemes [8]. Unfortunately, current analysis techniques do not isolate and identify phonemes in the input audio a word is composed of many of these primitives, composite multiply, resulting in recognition. a method of Improvements choice in the Recognition error methods of about reliably signal. errors in poor confidence in Since the word in linguistic analysis may make this future. can also be reduced through non-acoustical analysis. By providing the recognizer with knowledge the system and some intelligence with which to apply that -16- knowledge, some The of recognition can be shifted from processes. analytical totally 3.1 of the burden Intelligent Listener a speech applying knowledge and intelligence to One method of system involves the notion of recognizing a word in context. and requires Usually this context is a semantic one, a formal along with production specification of the language's grammar rules relating the vocabulary. This method may be extended to recognize phonemes aiding in in a specified linguistic grammar, thus grammars for natural task. of the specifying formal the identification of words. languages such as English is a formidable However, small subset grammars can be defined. grammar is used to restrict the number Knowledge of possible sentences which can be constructed from the language's vocabulary. recognized so With the aid of semantic as to produce analysis, words are a correct sentence with the least total likelihood of error, even though the word chosen in a single instance may not be the most likely candidate. There are some problems with semantic analysis. Complexity increases quickly with the size of the vocabulary and the number of productions in the grammar. And, of course, there may be a large number of mis-recognitions which are semantically correct. Unfortunately for the semantic analyzer, people often deviate -17- from the formally correct natural This may happen between familiar other grammars when conversing. partners, or when commands and information are being exchanged tersely. Informal speech is apparently less bound to the rules of production than writing is. If a specified grammar for does not include the crop up during its the automated speech recognizer kinds of idiosyncratic deviances which may operation, its utility to the user is impacted. Some applications do not have the luxury of restricting syntax. single-word commands; They may be a reasonably accessed quite naturally by an action to be taken on a single-word spoken datum may be implicit, or may depend on the state of the system. Another sort of context may be employed to improve speech recognition in these situations. Knowledge of the task being performed can be applied to choose likely subsets of the vocabulary to be analyzed in given instances. A word from the smaller vocabulary can then be recognized with a much greater reliability. Another, perhaps less obvious, recognizer's advantage. user interface. context can be used to That is the context of Some applications nature, and may involve the user operations. If, in continuous the multi-modal interactive by or repetitive through the course of these operations, computer can detect patterns knowledge of are highly the in the user's the user's input on -18- the activity, then all channels may be employed to choose a likely spoken vocabulary subset for recognition. sense, by becoming familiar with its partner the computer becomes in In a conversation, a better listener. It should be noted that not all speech recognition applications provide an environment conducive to computer-user familiarity. For instance, a system for automatic transcription from dictation might be no more interactive than a tape recorder, the interaction itself yielding no clue to On the other hand, its knowledge techniques the aid the speech recognizer. recognition may be improved by applying of the subject matter and semantic analysis. are not mutually exclusive. application, and, to some extent, By taking the nature of the nature of the user, into account, all of the techniques described in can be combined to produce useful These strategies for this section speech recognition. The Virtual Vocabulary Speech Recognizer speech recognition explored in requires a the is the strategy for "NewsPeek" project. speech recognizer able to cope with a vocabulary of single-word utterances. Its just sixty words. 1600 words can be stored in the recognizer's virtual memory. the NewsPeek large speech recognition hardware is capable of handling a vocabulary of However, over NewsPeek This recognizer employs knowledge of the both application and the user's interactions with the system to choose sixty-word subsets of the virtual vocabulary for recognition. This system and the NewsPeek project are -19- described in the sections that -20- follow. 4.0 Contextual Environment The specific environment which provides the the speech recognition package will personalized, dynamic information discussing the details rely is 'context' upon which "NewsPeek," analysis system. of spoken input, it a Before will be useful to describe the NewsPeek system which will be referenced as an example throughout this paper. 4.1 What Is NewsPeek? NewsPeek is one component of for data analysis. Specifically, NewsPeek is "Nexis" data retrieval system serves as database and the source of global model global used by NewsPeek. Subscribers (video text display), hard copy unit, function keys, interface, The Nexis these being (1) search tools, (2) found by those service are normally a computer keyboard including special Nexis and telephone data-transmission station performs two basic access and to the Nexis a search mechanisms access station consisting of provided with a local items a personal, electronic newspaper. Mead Data Central's terminal The overall effort is to develop interactive systems purpose of the interactive, a larger research program. functions, and control of centrally located database local display and printing tools. -21- of the news NewsPeek's purpose is to modify the Nexis system, the large computerized database and manager described above, it into "electronic publishing". a medium for NewsPeek alters the role of the Nexis system as repository for news stories print media, news other words, In archive and already disseminated by conventional it instead act and has and transform as the source of current intended for first-time distribution electronically. This alteration is controlled entirely at the subscriber's end of the system as is. system, leaving the central Nexis replaces Mead Data Central's Nexis computer. video station with a local personal The computer is equipped with a touch-sensitive color graphics display, optical videodisc player, and voice This computer duplicates the original recognition system. station's functions of access and retrieval, but replaces the Unlike the normal configuration interface to these functions. of NewsPeek the Nexis system, NewsPeek resembles a conventional Where the Nexis subscriber retrieves remote data publication. on demand, the NewsPeek subscriber receives his own copy of a locally available, electronically delivered newspaper. The Nexis search routines, under the direction of NewsPeek, provide the editorial of function associated with the publication the electronic newspaper. newspaper are located computer. Since the single reader, interests. An Stories for inclusion in the and transmitted to the recipient's resulting publication is intended for a its contents will reflect his personal tastes individualized, user-directed news -22- search is and effected, replacing the general, mass-targetted one currently provided by the media; no two subscribers get the same "Time" magazine. Upon receipt of this publication, the subscriber's personal computer acts as an aid to the examination contents. order Just as one's daily paper is not necessarily or in one sitting, the electronic newspaper for perusal. interactive One key read in is available The local computer provides the environment for analysis. feature the news of the newspaper's of this system is the manner occurs. which perusal of In use, the personal computer becomes far more than a potentially better or easier to use translator. in data base search Rather than simply providing verbal and touch-sensitive replacements for the control keys provided by the Nexis system, it allows structure and importance. for the search to assume a new Nexis library searches without the local processor, due to the nature of the tools provided, tend to take a short (linear) path through the global database. The operation basically consists of iteratively reducing a group of candidate stories, group is contents. small enough to allow (i.e., While this is individual examination until the user fine, perhaps single article on the starting with the entire library, until the finds a set of even optimal, a particular subject, target stories.) for the user seeking a it may be restrictive to reader whose goal is less well defined; -23- of its the system was not designed for browsing is browsing. In the NewsPeek system, however, exploited as a method of discovery. If, during the reading of the newspaper, the reader's interest is diverted, he has the option of pursuing that diversion without the expense of initiating a new global To sum up, NewsPeek is an attempt at producing a personal, electronic publication. search In form and content after a hypothetical newspaper. local, personal computer, both in reading it. from the top. it The reader is is modelled aided by his producing the newspaper and in The activity of reading the publication itself encourages digression and variation, as the reader is either to browse, or to make associations free and follow through on them. 4.2 NewsPeek Operation The NewsPeek system can be seperated into those which aid the reader in which prepare two sets of functions, perusing the for a reading session. stories, and those The following sub-sections provide details. 4.2.1 Off-line (Editorial) Functions In order to spare the from annoying delays the news of subscriber to the electronic newspaper during an the day, much of otherwise productive moment with the processing -24- associated with the creation of The the publication occurs prior analagous functions to the time of perusal. of editing, composing, and publishing a morning newspaper are performed the night before the paper hits the stands, and so, pre-processing. presumably, would the electronic newspaper's This is a period of otherwise low demand for both the computer and the telephone, which is used to access the Nexis library. A process is initiated to connect the the Nexis system. This program is subscriber's computer with a top level filter, and directs the Nexis access tools, which may be considered NewsPeek's remote editor. as a Just as the newspaper's editor filter between the volume of news events what eventually ends up printed in in acts the world and the morning edition, the automated editor decides what information is worth copying from the enormous Nexis database into the user's personal newspaper. Because the program is privileged with knowledge of the reader's requirements and preferences, the resulting publication can be tailored to please the entire circulation (one). After the desired news stories have been retrieved, their contents are correlated with the user's archive, his local picture library point (on optical videodisc), and each other. The of such extensive cross-referencing is to provide the reader with a potentially huge number of relating the information in the manner -25- paths through the news, of his choosing. 4.2.2 on-line Functions (Reader Aids) The reader wishes to examine his newspaper, which consists (fast) local individually selected stories now available in an expanded table current version of NewsPeek the front page is of In the Text is displayed on his color television. memory. contents, combining some of the features of (summarization), and some of normal tables of headlines (location). As the of contents may be preceded by an system evolves, this table "cover stories." "intelligently" chosen set of The user may want to take advantage of the local text processing which has occurred, and may fashion: set of do so simple a deceptively NewsPeek responds by overlaying a he indicates a word. cross-references. in Each line in this set contains it selected word surrounded by the words adjacent to in the another story, library entry, or the picture index, thus indicating the its use. context of return to At this point the reader may elect to studying the underlying page, piqued, decide to press correlation on. By choosing a line in display, he is transported to the story the line was taken, in this manner. the number of jumps Of course, the user is up along the path he has created, or to or file from which abandon it stories in his personal library. -26- that can be always free to back Along the way, the user may initiate new Nexis notes, the and a new page or picture is displayed. There is virtually no limit to made or, his curiousity completely. searches, make The important associations made among the articles features here are that the being read are the user's own, not those imposed on him by the links for associations are are used both as the and that these publisher, choosing the order in which the stories are presented, for requesting new information. and as keys 4.3 User Interface The interface must be supportive and intuitively clear user. In addition, it should be transparent, otherwise color the process section is will of browsing and analyzing the For example, if stories publication. as it to the from the newspaper's sports are displayed more quickly than others, the perusal path likely to be biased. However, the NewsPeek designers' intention has never been to simply create a so-called "user-friendly", or Nexis system. make the news enhanced computer front end to the existing Basic to the design philosophy is the desire to search seem as much like reading an ordinary newspaper as possible, or, information at the very least, like obtaining from a well-informed, co-operative expert. If this cannot be achieved, the system will be used only by those already versed in the use of computers, or as the last resort of those who are not. end, a bimodal To this be issued and words gesturing) input method is employed. selected either through touch or verbally. Input is -27- Commands may (pointing and acknowledged on the television display primarily by the creative use of color. modes are highly redundant. pages by a book; Just if desired.) or by pointing to it on the page. as many commands some requests Touching a words picture of 'soft could. this," may be made with will button' on An easily favor one A word is equal channel ease selected by via over voice and another. the display screen may say more than voiced command such as, "Show me may not translate well into the world of physical gestures. Similarly, the advantage of choosing a word not on the dictated "Next page, please." "please," gesture, fifty he were flipping a page in alternately, he may simply say, saying it input For instance, the reader may turn stroking the TV screen as if (even without the The two screen is some characteristics of and are covered more fully later. -28- obvious. speech in These points the speech recognition system, a 5.0 The Speech Recognition Unit The device used for speech recognition in the NewsPeek system is an NEC This device comes model PC-8001A personal computer. configured with CPU, dual mini-floppy disc drives, unit, and voice recognition hardware. I/O interface To this has been added a Shure SM-lo noise-cancelling microphone connected through a pre-amplifier/mixer. The voice recognizer uttered by is capable of a single speaker. This particular model to recognize connected speech; recognizer's vocabulary identifying sixty words that is, if is not able two words from the are spoken one after another in a single phrase, there is no guarantee that either will be recognized. Note that what utterance of is meant here by less than 1.5 "word" is seconds duration. NewsPeek command "NextPage" satifies considered this criterion, and is and audio preamp-mixer the acoustic environment is mounted For example, the a single word by the system. The microphone recognition actually any connected system. are used to help and so improve the reliability of the The microphone is of good audio quality on a headset worn by the user. arrangement offers several Since it is worn by the stabilize advantages for speaker, he is and This microphone speech recognition. free to move around without having to worry about his being heard by the computer. The microphone is always a constant distance -29- from the speaker's mouth, thus eliminating one variable in room acoustics which may otherwise cause problems in word recognition. microphone is close to the direction, effects preamp-mixer aimed in that of outside noise are minimized. line, and to assure that the input is always makes the recognizer's The I/O speaker's mouth and is The audio is used to minimize the signal-to-noise ratio on the microphone's recognizer's Because the speech at the same audio level. job easier, and improves its This also reliability. interface unit enables the CPU to communicate with the recognizer hardware and the disc drives. The device also supports an RS-232 serial data interface through which the CPU can communicate with the NewsPeek host computer. The NEC personal capabilities computer runs a program to of the speech recognizer. have been formatted to templates 1677 on each of two mounted disc drives. serial interface. LOADBLOCK. vocabulary. digital voice The CPU accepts from the NewsPeek computer These commands are: The NEC computer transfers the specified number of voice templates from disc storage to the speech recognizer's The block of templates may disc, and can be copied to active The mini-floppy discs allow the storage of commands from its own keyboard, or via extend the start anywhere on the any starting slot in the recognizer's vocabulary, but the templates must be contiguous instances. For in both example, ten templates starting with number -30- fifty on the disc (#50-#59) may be copied to the active vocabulary starting at slot fifteen This SAVEBLOCK. (#15-#24). is similar to the LOADBLOCK command, but transfers a block of templates from the active vocabulary to disc storage. This command is used to create digital voice TRAIN_BLOCK. The NEC computer creates templates. speech recognizer's active vocabulary follow. user utterances which in the active vocabulary, but again, recognizer Upon receipt is activated. in the for the specified number This block may start anywhere of START_LISTEN. a block of templates it must be contiguous. of this command the speech From this point on, whenever the user says a word, the NEC computer sends the host computer an interrupt followed by the the slot number of the recognized word in active vocabulary and the confidence metric of that recognition. If the speaker's utterance is not recognized, an interrupt followed by a null slot number and confidence is sent. STOP_LISTEN. This command turns speech recognition off. Through the use of this simple command set, NEC speech recognition unit the utility of the is enhanced enormously. Without the processing power provided by the NEC personal computer, the speech recognizer is capable of -31- identifying only sixty words. With it, the recognizer draws from a vocabulary of over fifty times that size. Since the host computer can interrupt the recognition process and issue commands, new words can be trained during application run-time. This feature can be used to eliminate the modality present in some speech recognition systems; not be put in prior 'train-mode' to learn to being put in application. For occurs recognizer need a large number of words 'recognize-mode' for the duration of the example, the large NewsPeek vocabulary is "grown" dynamically. command vocabulary the Except for the initial (about twenty words), all one word at a time during NewsPeek's training of the recognizer training operation. a vocabulary that is determined before run-time, Even for the prospect of a 600-word recognizer training session should scare away any sane user. In spite of by the NEC processor, it still word the additional capabilities provided the recognizer cannot distinguish an individual from the many in its virtual memory until that word is loaded into the small active vocabulary. maintaining the state of the recognizer's with the host computer. user The responsiblity for active vocabulary lies In the NewsPeek system, a part of the interface handler manages the virtual vocabulary, described in the next section. -32- and is 6.0 The Vocabulary Predictor the speech The vocabulary predictor is a crucial component in recognition system. unrecognized, regardless of the performance of hardware. 100% reliability, equipment, and would, in totally superfluous. It input. it would require no other fact, render the speaker NewsPeek's vocabulary predictor has a somewhat easier job, as user's next of the recognition course, if a system could determine a speaker's next utterance with recognition input goes When the predictor fails, it is allowed sixty guesses at the It can, therefore, be imperfect. should be noted that a program for predicting the not necessarily require supernatural the nature of its future does NewsPeek, by abilities. application and user interface, co-operates quite well with the vocabulary predictor. system for utilizing this elaborated upon later, but It is an appropriate This speech recognizer. sort of a few points here will help is clarify the description. First, NewsPeek's local library of similar to the grouping of stories news magazine. internal method in sections organized by topic, in an ordinary Although the user need not be aware of this structure, it does exist, for stories is associating groups of stories. -33- and it provides words with groups a convenient of news Second, most NewsPeek commands are words from the news stories being examined by the user, and serve to make the user aware of other stories available for perusal. As a result of this property of the overall design, command words are frequently present on the output display. to the vocabulary This group of words is available predictor. Third, also due to the nature of the perusal method, there is a good probability that a command will be used more than once in a short input sequence. User input history is accessible by the predictor. The above three points are mentioned to demonstrate how an application and its interface can suggest natural lines for dividing a large vocabulary into smaller, more manageable blocks. When this is the case, as it is in NewsPeek, the vocabulary predictor's task may be reduced to that of finding a likely subset vocabulary from the set of words contained in a group of small blocks. The decision rules it employs for this operation may also be determined largely by the application it serves. In general, the vocabulary predictor works in the following manner. Small blocks of words are derived from the total vocabulary. The number of words in this group should be much smaller than the main vocabulary, but should exceed the capacity of the speech recognition unit so that it will be fully utilized. The blocks are assigned relative priorities to aid in the -34- assembly of the final blocks will have already been assigned weighting values. the priorities and picks a First, to operate on the blocks and weights subset vocabulary. In a sample NewsPeek situation, constructed A dependent on the state of the decision rule, which may itself be system, uses Words within the subset vocabulary. a vocabulary might be like this: are isolated. several blocks of words These are (a) the basic NewsPeek command set, (b) the last five words input by the user over either channel and (touch or voice), (c) trained words appearing on the output story page, (d) trained words appearing on the correlation page, (e) words baseball associated with stories about the Boston Red Sox team. Second, priorities are assigned to the blocks. Block (a) gets the highest priority; the user will these commands say a basic command word, it Third, (e) is essential that always be available to him. Block next highest priority. Blocks Block it is not only likely that (c) and (d) get (b) equal priorities. gets the lowest priority. the active subset vocabulary is -35- gets the calculated. Block (a) slots are left by blocks Of the is included in full, (c) in the subset, and (d). Two is block and there Because blocks (e) (b). Thirty-five are forty words shared of these are members of block remaining thirty-eight words, highest weighting factors block as the thirty-five with the (most recently used) are included. of higher priority have filled the active space, is not included. Underlying this implementation of the vocabulary predictor are two basic assumptions. The first is that there exists some method for sorting the entire vocabulary that is of value, given a specific weighting factor operator. assignment, and is can itself which to construct the Since suggest This is the a kind of global vocabulary that the logical vocabulary blocks active subset vocabulary. the speech recognition unit these blocks from This does more operates by transferring function can be optimized by identifying and organizing virtual memory to In the NewsPeek in some general allow more complex assignment of weighting factors. blocks of words, its word application program. The second underlying assumption is application than just (b). exploit them. system, the weighting factor assigned to each the vocabulary is derived from the time that word was last accessed by the user. the vocabulary has The least recently accessed word in the lowest weighting -36- factor; most recently has the highest. replace this Other indicators may be used to augment or simple weighting strategy. calculating vocabulary weighting factors investigated for NewsPeek's Several methods for are still being predictor, and are discussed in more detail later. -37- 7.0 Applying the virtual vocabulary Speech Recognizer The previous sections have extolled the virtues of speech in multi-modal communication environment, problems in the implementation of recognition system, and described a sample a system. In this section, recognition given to system will be Briefly, then, a list of application an automated speech application for such the application of NewsPeek's speech detailed. Consideration will be is affected. features making NewsPeek an interesting for speech recognition: * personalized * most * bi-modal * recognition vocabulary is user application input is single word two phase A point discussed practical issues involving the NewsPeek application directly and how the system in general * a interface (touch-screen and speech) (on- and off- run-time dynamic line) processing objective is manipulation of English text by point discussion follows. Since NewsPeek is a personalized electronic publication, there is only one user of the system. Thus, the problem of inter-speaker vocabulary variances is neatly make light of this, skirted. (Not to an important problem, but one left for -38- another day.) Each NewsPeek subscriber has his own mini-floppy disc, on which is encoded his personally created vocabulary. The second point hindrance in the and an asset as Due to NewsPeek's as the speech system aid the formal structure on the situation. far speech recognizer. ," _ The recognizer is or Imposing a input grammar does little to Most commands would be of stories about is concerned. lack of command syntax, a semantic processor cannot be employed to "V-o" above list may be considered both a the form, rectify the "List the "Do we have any pictures of ?" _ still faced with the task of distinguishing from the same large group of words in the vocabulary. Furthermore, the command to be executed is usually implicit, determined by the state of the activity. system and history of user Requiring the user to restate the obvious undermines the utility inherent step backward. in this mode of the plus side, large spoken vocabulary. semantic analysis is not necessarily and freeing the user's personal computer looked upon as a minor is this: advantage. words hardware. advantage, however, from the problem of so prevalent in the amalgam of fluent As previously mentioned, small vocabularies can be cheap, from this job may be The real single utterances do not suffer co-articulation variances speech. a The problem presented here is that of recognizing single words from a on communication, and is of isolated identified reasonably well by currently available By predicting a likely -39- subset of a large vocabulary to be recognized, the virtual vocabulary speech recognizer attempts to reduce the problem to one which has already been "solved." The context of the bi-modal interface is subsetting process. input The speech recognition is being processed over the other being displayed on the color monitor. aware of input over the local data structure. input mode, system "knows" what line, and what output is as well. When the user that word is found in a If the word is one which has been trained for speech recognition, it subset vocabulary. the vocabulary Of course, the system is speech channel, indicates a word via either by the user used in is loaded into the active The assumption here is that since the user is using this word to key his path through the news, there is good possibility he will be using it again in a the immediate future. This technique produced a bit of unexpected fallout; user can guarantee the presence of a word in vocabulary by first touching it the user was never intended to through conscious effort, extent. Because speech channel, of namely, the the active speech on the display screen. Although guide the prediction algorithm this has become an option to a small it can increase the speaker's confidence in the and presents little or no nuisance to him, it is some value. Knowledge of the output display is -40- also available to the speech system. Any trained word on the current output page may be included in the active subset vocabulary. In normal operation all of these words are made active, as they are all potential NewsPeek commands. environment This is the usual way to maintain the of redundant input modes. However, the active vocabulary is small, and the subset prediction algorithm may have some likely off-screen candidates for recognition. is the best mode of access for these words. Speech What to do if the user wishes to select one of these words? One method involves basing the decision upon user activity in the two input modes. If the history of this activity shows the user consistently using touch as the initial mode for selecting on-screen words, then the block of on-screen words is assigned a low priority relative to the off-screen block. words As the on-screen are touched they are loaded into the active subset vocabulary and may then be selected via either input mode. On the other hand, if user history shows repeated subsequent failures in recognizing spoken on-screen words, the priority assignments are reversed. This operation is one example of how the modes in a rich communication environment can supplement one another, The speech recognizer's function here is partly dependent on touch-screen input and output. The NewsPeek vocabulary is run-time dynamic. Among other things, this means that new words are added and old words deleted as part of the normal operation of the system. -41- Many applications in in Training a large vocabulary a single session can be tedious the speaker, and in not even precisely known before it itself provides is to be used. The system the group of words from which the user vocabulary. Knowledge of improve the performance NewsPeek, the vocabulary is the case of for be used to training session. its use during a special advance of his static vocabulary created speech recognition use a that employ selects the user's changing vocabulary can recognizer's performance, and the of NewsPeek in general. Since NewsPeek's normal function includes a pre-processing phase with the user off-line, a logical place exists processing required by the speech system. for any The off-line operations for reformatting the speech recognizer's virtual memory come under this category. is not practical to format an entire Although it floppy disc after each of the user's verbal commands, desirable to have the organization of the memory in reflect the sorting by blocks it is system's vocabulary in and weighting factors used the construction of the subset vocabularies. After the user has gone off-line, the disc can be restored to a configuration dependent on the state previous session. organization The last data are This periodic cleanup for the item on the system is used of the vocabulary at the end of start assures optimal memory of the user's next list of session. NewsPeek features notes for manipulating English real words that the user can -42- the that the language text. These say. The NewsPeek user can easily associate verbal names with the objects he manipulates on the output display and throughout the database, just by correlating a spoken word with its printed counterpart. this automatically for himself, but the speech recognizer must have the operation performed for it: this," the user says, "Learn touches the word on the screen, then says it. Since Nexis database provides the text, the NewsPeek reader the use of the terminal keyboard. As a result of association operation, the computer respond verbally to the user. Votrax Type to 'N' He does the is spared the acquires the ability to Peripheral devices Talk or the Prose Model 2000 such as the can be used convert text strings directly into audible English. This, in turn, may provide useful user feedback from the speech recognizer. For example, the speaker says a word. recognizer gets The a match, but of marginal confidence. Rather than ignore the match and ask the speaker to try again, the computer speech responds with, recognizer now has answer. "Did you say The _?" only to contend with a simple (This technique is adapted from Schmandt Actually, what happens in NewsPeek is interesting than the mere naming of "yes" in a sense, these and their names are one and the same. They are words. that appear in the NewsPeek stories are manipulated by the user. words They are also objects The the words to be the words he says, that must be recognized by the speech -43- and more facilitate the operations that manipulate them, for, words "no" [15].) a little subtler objects to or system. the Owing to the presence of this feature, the vocabulary predictor has it. not a wealth of information about the user's task available Although the current version of the virtual vocabulary does exploit them, many possibilities for vocabulary sorting algorithm exist. Temporal conditions When appear? current section? A the enhancement of the partial list follows. -- did the word first When did it last to appear in When did it the NewsPeek database? first/last appear in the a story actually read by the user? Statistical conditions -Does the word appear database? word frequently throughout throughout the current story or the NewsPeek section? Does the occur nowhere else within the current story/section? outside the current story/section? periodically Positional (eg., Does the word appear only in Monday's edition)? conditions -- Does the word appear in a headline? story's lead paragraph? concluding paragraph? first or last line within a paragraph? Does the word occur in a phrase with a Nexis containing a frequently accessed word? frequently throughout appears The keyword? a word appearing the current story or section? nowhere else within the current story or positional conditions may -44- a phrase a word that section? even be computed recursively. For instance, does the word occur discovered by the previous list of containing one of those words? likely that an itself Words in a phrase containing a word conditions? and so implementation of this on. a phrase It is, however, not function could justify on the basis of either performance or computational cost. satisfying one, identifiably some, all of different from the rest are thus potentially more the others. or As (or less) the above conditions are of the NewsPeek text, and likely to be accessed than this information is made available to the vocabulary predictor, improved speech recognition should result. Similar analysis can be performed on the vocabulary using information about the user interface. ask about the time a word was When was it periodically? first accessed? For instance, one might last accessed by voice. Is it by touch. accessed frequently? and so forth. In general, the speech system should be adaptable to various host applications. features They need not display all of singled out in similar to this chapter, but those the NewsPeek applications NewsPeek should have the least difficulty employing the speech system, and, for the most part, see the best -45- results. 8.0 Conclusion The small recognition vocabulary of an inexpensive, commercially available speech recognizer can be extended by supplying additional memory and processing capabilities. By integrating the speech recognition system with an appropriate host application, the recognizer is privileged with useful information concerning both the user speech better recognizer performance is the result. Automatic is and his purpose; still recognition of natural, fluent, a long way off. However, field, speech recognition is conversational speech as strides are made in the destined to become a practical popular method for computer input. -46- and REFERENCES [1] IEEE Trans. Barnett, J., "A Vocal Data Management System," Audio Electroacoust., vol. AU-21, June 1973, pp. 185 - 188. [2] Bates, M., "The Use of Syntax in a Speech Understanding System," IEEE Trans. Acoust., Speech, and Signal Processing, vol. [3] ASSP-23, Feb. 1975, pp. 112 - 117. Bolt, R. A., "Voice and Gesture at the Graphics Interface," Computer Graphics, Proceedings of ACM SIGGRAPH '80, vol. 14, no. 3, 1980, pp. - 262 270. [4] Chapanis, A., "Interactive Human Comunication," American, vol. 232, no. 3, 1975, pp. 36 - 42. [5] David, E. E. and Denes, P. B., Human Communication: Unified View, New York, McGraw Hill, 1972. [6] Hill, D. R., "Man-Machine Interaction Using Speech," Advances in Computers, Alt, F. L., Rubinoff, M., and Yovits, M. C., Eds., vol. II., New York, Academic Press, 1971, [7] pp. 230. June 1968, pp. 184 - 197. Levinson, S. E. and Liberman, M. Y., Computer," Scientific American, vol. 64 [9] - A Lea, W. A., "Establishing the Value of Voice Communication With Computers," IEEE Trans. Audio Electroacoust., vol. AU-16, [8] 165 Scientific - "Speech Recognition by 244, no. 4, 1981, pp. 76. "Man Computer Symbiosis," Licklinder, J. C. R., Perspectives on the Computer Revolution, Pylyshyn, Z. W., Ed., Englewood Cliffs, NJ, Prentice-Hall, 1970 pp. 306 318. [10] [11] Nash-Webber, B., "Semantic Support for a Speech Understanding System," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-23, Feb. 1975, pp. 124 Ochsman, R. B. and Chapanis, A., "The Effects of Ten Communication Modes on the Behavior of Teams During Co-operative Problem Solving," Int. J. Man-Machine Studies, vol. [12] and - 128. 6, 1974, pp. 579 - 619. Rabiner, L. R., "Considerations in Dynamic Time Warping Algorithms for Discrete Word Recognition," IEEE Trans. Acoust., Speech, and Signal Processing, vol. ASSP-26, Dec. 1978, pp. 575 - 582. -47- [13] Reddy, D. R., "Speech Recognition by Machine: A Review," Proceedings of the IEEE, vol. 64, no. 4, April 1976, pp. 501 [14] - 531. Sakoe, H. and Chiba, S., "Dynamic Programming Algorithm Optimizations for Spoken Word Recognition," IEEE Trans. Acoust., Speech, and Signal Processing, vol. ASSP-26, Feb. 1978, pp, 43 - 49. [15] Schmandt, C. and Hulteen, E. A., "The Intelligent Voice Interactive Interface," Proceedings Human Factors in Computer Systems, Gaithersburg, MD., 1982, National Bureau of Standards / ACM, pp. 363 - 366. [16] Tappert, C. C. and Subrata, D. K., "Memory and Time Improvements in a Dynamic Programming Algorithm for Matching Speech Patterns," IEEE Trans. Acoust., Speech, and Signal Processing, vol. ASSP-26, Feb. 1978, pp. 583 - 586. [17] Walker, D. E., "The SRI Speech Understanding System," IEEE Trans. Acoust., Speech, and Signal Processing, vol. ASSP-23, no. 5, Oct. 1975, pp. 397 - -48- 416.