Intelligent talk-and-touch interfaces using multi-modal semantic grammars

From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. Intelligent talk-and-touch interfaces using multi-modal semantic grammars Bruce Krulwich and Chad Burkey Center for Strategic TechnologyResearch Andersen Consulting LLP 100 South WackerDrive, Chicago, IL 60606 krulwich@andersen.com Abstract Multi-modal interfaces havebeenproposed as a wayto capture theeaseandexpressivity of natural communication. Interfaces of thissortallowusersto communicate with computers through combinations of speech, gesture, touch, expression, etc.A critical problem indeveloping suchan interface is integrating these different inputs (e.g., spoken sentences, pointing gestures, etc.) intoa single interpretation. Forexample, incombining speech andgesture, a system mustrelate eachgesture to theappropriate partof the sentence. We areinvestigating thisproblem asit arises in ourtalkandtouch interfaces, whichcombine full-sentence speech andscreen-touching. Oursolution, whichhasbeen implementedin two completed prototypes, uses multi-modal semantic grammarsto match screen touches to speech utterances. Through this mechanism,our systems can easily support wide variations in the speech patterns used to indicate touch references. Additionally,they can ask specific focusedquestions to the user in the event of an inability to understand the input. Theyearl also incorporate other semantic information, such as contextual references or references to previous sentence referents, through this single unified approach. Ourtwo prototypes appear effective in providing a straightforward and powerfulinterface to novice computerusers. 1. Talk and touch interfaces Natural humandialogue is a lot morethan sequences of words. Context, gestures, expressions, and implicit knowledgeall play a considerable role in conveyingcomplexthoughts and ideas in a discussion. Thesefactors clarify the language, fill in the gaps, and disambiguatethe spokenwords themselves. CSTaR’stalk and touch project is attempting to incorporate information of this sort into an ~ intelligent interface based on full-sentence speech recognition and touch-screen sensing [Krulwich and Burkey, 1994]. Consider the following utterances, in the context of a typical electronic mail system: 1. Readthis message(with a corresponding touch to a messageheader) t Ourscreentouchescanbe easilyreplacedwithmouse clicks, etc. Krulwich 103 From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. 2. 3. 4. 5. Send a messageto him (with a corresponding touch to a name) Send a message to him (without a touched name) Forwardthis messageto him and Bill (two touches: one message, one name) Forwardthis messageto him and Bill (with only a touched message) In each of these cases, the system has to match the touched objects (references to messagesor people) to the ambiguousslots in the sentence. In cases 3 and 5 there are more potential references to touchedobjects than there are screen touches, so additional inference is required to determine the missing information from context, or at least determine specifically which informationis unspecified. This paper discusses the approach to these issues that is embodiedin our two talk and touch prototypes. Thef’trst, an interface to a typical e-mail systemshownin figure 1, allows commands such as those listed above. Weuse this system to introduce and illustrate our approachin section 2. The second, a video communicationsmanagershownbriefly in figure 2, supports commands for establishing video calls and multi-point video-conferences, sending and receiving multLrnedia e-mail, and sharing and manipulating documentsduring a video-conference. Wediscuss this secondsystemin detail in section 3. 2. Multi-modal semantic grammars The first step in our approachto developing an integrated interpretation of speech and screen z In general, semantic touches is to parse the spoken text into a multi-modal semantic grammar, grammarsare language grammarsthat parse text into semantic componentsinstead of syntactic ones [Burton, 1976]. For example, a typical syntactic grammarwouldparse the sentence forward the new messagefrom John to Bill and my team into syntactic constructs such as subject, verb, direct object, and prepositional phrase. Asemantic grammar,on the other hand, wouldparse it in terms of semantic constructs such as command,messagespecification, author, and recipients. This allows for easier and moreaccurate interpretation, as well as improvingperformancethrough early application of semanticconstraints. Wehave extended this idea by adding semantic constructs to the grammarthat refer to other sources of input information, such as screen touches, as well as those that refer to context. For example, the sentence "forward this message to Bill and Chad"would be parsed to recognize that "this message"is not only a messagespecification, but is also a reference either to a touched screen object or to a particular contextually-significant message. A samplegrammarof this sort, for top-level commands in our e-mail prototype, is shownin figure 3. The grammardefines nodes, or "tags," (shownin angled brackets) that correspond to blocks of text in the input string. Eachtag can be seen as a slot that maybe filled by a portion of an input sentence. Instead of the phrase from him being recognized as a prepositional phrase, as it might with a syntactic grammar, it can be recognized as an author of a message (the tag 2 Issues relating to speech recognition are discussed in section 4. In our discussion here we assumethat the spoken cormnandshave been recognized ,and are available in text form. 104 BISFAI-95 From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. . . ¯ ¯ . ¯ :...:. :::: | .... ..:: ¯ ... ::::¯ .:... ¯ :.:’:’:::::¯ . i :.’:. ¯ . ¯ " :.’. " " "’i :ii.:..:’." :.::’: Y""" -..’::ii: :" :’.i’i:. ¯... Figure 1: A prototype talk and touch e-mail system workspace <fromPerson>), and in particular as a reference to a touched person on the screen (the tag <pointPerson>). The following are someof the tags used in parsing the sentence forward this message to BillandChad, andtheir corresponding text: <sentence>= "forward this messageto Bill and Chad" <cmdForward> = "forward this message to Bill and Chad" <message>= "this message" <pointltem> = "this message" <rePeople> = "to Bill and Chad" <people> = "Bill and Chad" :;~.~:.~-- m Figure 2: A video communicationsmanager system Krulwich 105 From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. <sentence> ::=<cmdFozxvard> I <cmdReply> I <cmdltem> I <cmdCheck> I <cmdReadMessage> I <cmdReadWhich> I <cmdWho>I <cmdWhols> I <cmdDelete>I <cmdScrol]>. <cmdForward>::= <forward> <message> <toPeople> I <forward> <toPeople> I <forward> <message.>I <forward> . <cmdRepfy> ::=<create> <reply> TO<message,> I <create,> <reply> TO <person> I <reply> TO <message,> l <reply> TO <person> I <create> <reply> I <reply>. <cmdItem>::= <create> <item> <rePeople> I <create> <item>. <cmdCheck> ::= <check> <new> MAILFROMMY<groups> I <check> <new> MESSAGESFROMMY <groups> I <check> <new> MAILI <check> <new> MESSAGES. <cmdReadMessage>::= <read> <message>. <cmdReadWhich>::= <read> THE<which> <item> <fromPerson> I <read> THE<which> <item>. <cmdWho>::= <who> RECEIVED<message> I <who> GOT<message> f <who> WASSENT<message>. <c~ndWhoIs>::= <whols> <person>. <cmdDelete,> ::= <delete> <message>. <cmdScroll>::= <scroll> <direction> BY<number>I <scroll> <direction>. <message> ::= <pointltem> I THE<new><item> <fromPerson> I THE<item> <fromPe~son> I <msgNum> I THE<new><reply> <fromPex~on>I THE<reply> <fx~nPerson>. <pointltem> ::= <pointWhich><item> l <pointWhicl~ <reply> I <pointWhich>. <msgnum> ::= <item> NUMBER <number>. <polntwhich> ::= THIS I THAT. <item> ::= MESSAGE I NOTEI LE’ITER. <toPeople> ::= TO<people>. <people> ::= <peopleItem> I <people, Item> AND<people>. <peopleItem> ::= <person> I <groups> I MY<groups>. <fromPerson> ::= FROM <person>. <person> ::= <pointperson> I <name>I <me>. <pointperson> ::= HIMI HERI HEI SHEI <pointwhich> PERSON. <name>::--- PAULA[ BILL I BRUCE I CHAD. <create> ::= CREATEA I CREATEA NEWI SENDA I SENDA NEW. <check> ::= CHECKFOR I SCANFOR I DO I HAVEANYI READMY. <new> ::= NEWI UNREAD. <next> ::= NEXTI NEXT<new>. <previous> ::= PREVIOUSI PREVIOUS<new>. <which>::= <next> f <previous>. <who> ::= TELL ME WHO<else> I TELL ME WHO. <direction> ::= UP I DOWN. <groups> ::= FRIENDSI TEAM. Figure 3: A partial multi-modal semantic _grammarfor e-marl commands A backgroundassumptionthat we are makingis that the spoken text and the screen touches cannot be matchedtemporally. If our systems could simply comparethe times that particular wordswerespokenwith the times of the screen touches, it court of coursedeterminean a priori mostlikely matchbetweenthe touchesandtheir referents. Unfortunately,the currentstate of the market in speech technology makesaccurate time-stamping of multiple input modalities impossible. Moreimportantly, informalstudies have showna tendencyfor people to have their screentoucheslag behindtheir spokenreferences. Forboth reasons,moreinference is requiredto matchspokenreferences with screen touches. 106 BISFAI-95 From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. The second interpretation step, then, is to matchthe multi-modalreferences in the parse to the touched screen objects. The simplest case, of course, is whenthe numberof screen touches matchesthe numberof multi-modaltags. In this case, the references in the command are resolved with the appropriate components of the touched objects. For example, suppose the spoken command is ’forward this messageto him and her," and three screen objects were touched. The system determines that a <cmdForward> was given, and sees that the <message>tag refers to a <pointWhich>.This is interpreted by resolving the <pointWhich>with the first screen touch, which is to a message. The system then sees that the <toPeople> contains two <pointPerson> tags, soit resolves the two recipients of the messagewith the second and third screen touches by checking that the two touched objects have assecmted names. The system then executes the FORWARD commandwith the given message and destination names. Thetypechecking of thisresolution is significant, andis crucial to proper handling of more complex situations. Manyobjects andwordson thescreen canbe interpreted in a number of ways. In the e-mail system workspace shownin figure 1, for example, the nameof a message author might be touchedas a reference to the person as an author, or as a recipient, or instead to the messageitself. This is even moreapparent in morecomplexdomains,in which, for example,a documenticon mayindicate a reference to the documentitself, its author, its project, or its purpose. In our previous example, "forward this messageto him and her;" suppose that there are two recorded screen touches, both of whichare to messageindicators and could thus refer to either a message or its author. The system parses the message into the multi-modal semantic grammar from figure 3, and determines that the spoken command is a <cmdForward>,that the <message> is a <pointltem>, and that the recipients are two <pointPerson>s. The system then attempts to resolve the three references to screen touches with the two actual screen touches, and realizes that a simple one-to-one mappingis not possible. The system then checks the previous command to see what messages or people were referenced, to use this context to resolve our current references. If the previous command refered primarily to a message,as is most likely, the system will assumethat the current command is referring to the message from the previous command,and that our two screen touches are resolving the recipients of the message.If, on the other hand, the previous command is purely person-oriented (e.g., whois he), the system will assumethat the previously referred-to person is one of the recipients, and resolve the fn’st screen touch with the messageto forward, and the secondwith the secondrecipient. If, on the other hand, the first screen touch wasto a person’s name,it could not be interpreted as pointing to a message. In this case, both touches are assmnedto refer to recipients, and the messagehas to be determinedfrom context. If there is no wayto determine the message,the user will be asked to specify the message. Becausethe system was able to interpret the bulk of the command,the question posed to the user can be quite focused, such as "Whichmessagedo you want to forward to Chadand Bill?" Krulwich 107 From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. Figure 4: A communications workspace Ingeneral, the greater thedisparity between spoken references to touched objects andtheactual screen touches, themoreinference maybe needed to interpret thecommand, andthegreater the likelihood ofhaving toasktheuserquestions before proceeding. Ourgoalin atlthese cases has bcentoutilize whatever information isavailable toconstruct a partial interpret thesentence, and tobc abletoasktheuseras focused a question as possible. Whiletheheuristics we havebeen discussing areonlysomeof manythatarcpossible, we havefound in practice thatoursystem is almost always abletocorrectly interpret ambiguous commands frombeginning users. 3. A second prototype: Video communications management Oursecond talkandtouchprototype, a videocommunications manager, is shownin figure 4. Thesystem supports commands tostartandendvideocalls or multi-point videoconferences, to playand sendmultimcdia mailmessages, to shareand manipulate documents duringvideo conferences, andso on.Ourgoalsindeveloping thisprototype werethreefold. First, wewanted toapply talkandtouch interfaces toa domain thatwasinherently verbal andvisual, asopposed to a domain likee-mall thatwasinherently textoriented. Second, wewanted to increase thenumber of typesof objects thatcouldbc referenced multi-modally, whiledecreasing thenumber of potentially ambiguous references in ourspoken commands. Third, we wanted to integrate our 108 BISFAI--95 From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. <sentence> ::= <agent><command> I <command>. <command> ::= <cmdCall>I <cmdSend> i <cmdPickup> I <cmdHangup> I <cmdPlay>. <cmdPlay>-:= <play> <message>. <cmdSend> ::= SEND<toPeople> A <measagzType> I SEND <toPeople,> A <messageType> <msgTytm> I SENDA <messagcType>TO<toPeople> I SENDA <messageTyp¢> TO <toPeopIe> <msgType>. <cmdPickup> ::= ANSWER <call>l PICKUP<call>l SHOW <call>l PICK<itCall> UPI PICKIT UP. <cmdCall>::= CALL <toPeople>l CALL <rePeople><callType>l GET<rePeople,> ONTHELINE I GET<toPeople> ONTHELINE<callType>l SETUP A <confType.> with <rePeople> I:SETUPA <confType> with<rePeople> <callType>. <cmdHangup> ::=HANGUP <call> I DISCONNECT <call.> I END<call>. <call> ::=THECALLTO <toP~ple> I THECALLFROM<f~anPeason>l <pointCaU> I <itCall>. <message>::= <pointMsg>I <itMsg> I THE<messagoType> FROM <IM~nPerson> I THE<msgNum> <messageType,>. <confI’ype.> ::= CONFERENCE CALLI TELECONFERENCE I VIDEOCONI~RENCE. <messageType>::= MESSAGE I REPLY. <toPeople.>::ffi <toPtzson>l<toPvrson>AND <toPerson>l<toPerson>AND <toPea’son>AND <toPcrson> I <toPerson><toPermn>AND<toPerson>. <toPerson> ::= <name,> l <pointPerson>. <pointP~son>::= HIMI HERI THEM. <callType> ::= AUDIO ONLY. <msgType~::= AUDIOONLYI TEXTONLY. ~,nsgNum>::= FIRST I SECOND I ~ I FOURTH. <play> ::= PLAYI SHOW. <pointCall> ::= THISI THISCALL. <fromPetson>::= <name>I <pointPorson>. <name>::= BRUCEKRULWICH l BOBLORDI MINDYCOHNI MARK PAULI PAID ALTO. <agent>::= EINSTEIN. <pointMsg>::= THISI THISMESSAGE. <itCall> ::=IT. <itMsg>::= IT. Figure 5: A grammar for communicationsmanagement (fig. 3) workwith other ongoingresearchat CSTaR, particularly in supportfor multimediacollaboration and communication. Figure 4 showsthe workspaceof our communication managersystem. In the center of the screen arc messages, with the picaa’c of the sender and the narnc and date of the rmssage. These correspondto multimediamessagesthat have beensent or havebeen left by people trying to call. Onthe bottomare two areas for active or pendingcans, with incomming calls on the left and outgoingcalls on the right. Alongthe left side is an area for feedbackfromthe system,with the Iccognizcd spokensentence, the feedbackfrom the talk and touch component,and the feedback fromthe speech recognition subsystem. Figure 5 shows part of the multi-modal semantic grammarfor commandsin the video communications workspacc.This system allows far fewer ambiguousscreen touches than the emarlsystem, dueto the objects on the screen beingmoredistinct and categorized,anddueto lack of support for commands like the forward this message to him and her command that we saw earlier. Grammar tags such as <po~tMessage>,<pointCall>, and <pointPerson>correspondto Krulwich 109 From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. Figure 6: A multi-point video conference sharing a document screen touches, and none of the commands at this screen include references to morethan one type 3of screen object. As an example, if the user gives the command"send a message to Bob Lord and him," the systemwill parse the sentence and attempt to resolve the single <pointPerson>reference. If there is a single screen touch recorded, the system will attempt to interpret the touched object as a person, be it the person shownin a picture, or the sender of an indicated message,or the other person on a call. If no touch has been recorded, the system will use contextual references from the previous commands,as discussed earlier. If none exist, a focused question will be posed to the user. A commandsuch as "set up a video conference with Pale Alto, Mindy Cohn, and him" will establish a multi-point video conference. If the <pointPerson>resolves with a touch to BobLord, the conferenceshownin figure 6 will be started as soonas all participants are available. Calls and conferences of this sort support a variety of multi-modal commands,through the grammarshown in figure 7. As with the communications management grammar, the commandsfor video conferencecontrol feature no multi-reference sentences, and are thus muchsimpler to interpret. 3 There are manyissues involved in designingan interface to facilitate straightforward multi-modalinterpretation, whichare beyondthe scope of the present paper. 110 BISFAI-95 From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. 4. Discussion and future work The research that we have presented can be viewed from a numberof perspectives, and raises a variety of issues. Oneofthecritical issues in theresearch anddevelopment ofspeech recognition systems hasbccn theneedfora grammar, or othersources of constraints on theinpututterances, to allowfor accurate recognition of continuous speech (e.g., [Reddy, 1976; Leeet.al.,1990;Huang et.al., 1993]. Earlyapproaches useda finite-state grammar, suchas a standard BNFgrammar, forthis purpose. Morerecent research havegeneralized thisapproach toavoid itslimitations byreplacing thefinite stategrammar witha statistical modelofwordordering. Inourresearch, however, we havefound grammar-based speech recognition to be suitable, primarily because of ourneedfor themulti-modal semantic grammars forinterpretation. By usingthesamegrammars forboth purposes, we integrate theprocessing of speech andinterpretation, apply oursemantic knowledge 4 Theremay,however, cariy inprocessing, andfacilitate improved performance. be other benefits of moregeneral approaches, suchas theability to handle newnamesor phrases dynamically. We arecurrently exploring this possibility. Ourapproach hastodateonlybeenapplied to resolving references to touched objects. There arc, however, manyotherusesof gesture in speech[Feldman and Rime,1991]thathavebeen investigated previously inmulti-modal interfaces (e.g., [Bolt andHerranz, 1992; Keens, 1994]). Ourapproach caneasily be extended to include references to screen areasor numerical ranges along axes. It is moredifficult, andthesubject of ongoing research, to incorporate gestures representing thecommands themselves (e.g., an "X"gesture fordeletion) intoourmulti-modal semantic grammars. Lastly, it is crucial thatwe develop morepowerful techniques forincorporating contextual information. Justas we havedeveloped successful approaches fordesigning grammars to support multi-modal references, we haveto enableour grammars to bettersupport references to contextually relevant objects andpeople. If wecanintegrate contextual references andgesturebased commands intoourintegrated approach to multi-medal grammars, we willtrulybe ableto support high-quality interactions between people andcomputers. Acknowledgments: We wishto thankAnatoleGershman, DaveBeck,LucianHughes,Steve Sate,Kishore Swaminathan, andLarryBirnbaum formanyuseful discussions on theresearch presented, andtheinnumerable visitors thathaveseenourdemostrations fortheir feedback. 4 Weare currently using the IBMContinuous Speech SystemTM, which is based on Carnegie Mellon’s Sphinx I system. Wehave also used the PE400TM system from Speech Systems Inc. Krulwich 111 From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. References Bolt, R., 1980. Put that there: Voice and gesture at the graphics interface. ComputerGraphics (Proceedings of the ACMSIGGRAPH "80), Vol. 14, No. 3. Bolt, R. ,and Herranz, E., 1992. Two-handedgesture in multi-modal dialog. In Proceedings of the Fifth Annual Symposiumon User Interface Software and Technology, Monterey, CA. Burton, R., 198x. Semantic grammars.In The Encyclopedia of Artificial Intelligence, Feigenbaumand Barr, eds. Burton, R., 1976. Semantic grammar:An engineering technique for constructing natural understanding systems. BBNtechnical report 3453, Cambridge, MA. Feldman, R. and Rime, B., 1991. Fundamentalsof Nonverbal Behavior. CambridgeUniversity Press. Huang, X., Alleva, F., Hon, H., Hwang, M., Lee, K., and Rosenfeld, R., 1993. The SPHINX-I/ Speech Recognition System: An Overview. Computer, speech, and language, volume 2, pp. 137-148. Koons, D., 1994. Capturing and interpreting multi-modal descriptions with multiple representations. In Working Notes of the 1994 AAAISpring Symposiumon Intelligent Multi-Media Multi-ModalSystems, Stanford, CA, pp. 1321. Krulwich, B. and Burkey, C., 1994. Natural command interfaces incorporating speech and gesture. WorkingNotes of the 1994 Conference on Lifelike ComputerCharacters, Snowbird, Utah, 1994. Lee, K., Hon, H., and Reddy, R., 1990. An overview of the SPHINXspeech recognition Transactions on Acoustics, Speech, and Signal Processing, pp. 35-45. Reddy, R., 1976. Speech recognition by machine: A review. IEEEProceedings, 64:4, pp. 502-531. 112 BISFAI-95 system. IEEE

Intelligent talk-and-touch interfaces using multi-modal semantic grammars

Related documents

Products

Support

Intelligent talk-and-touch interfaces using multi-modal semantic grammars

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib