Intelligent talk-and-touch interfaces using multi-modal semantic grammars

From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
Intelligent talk-and-touch interfaces
using multi-modal semantic grammars
Bruce Krulwich and Chad Burkey
Center for Strategic TechnologyResearch
Andersen Consulting LLP
100 South WackerDrive, Chicago, IL 60606
krulwich@andersen.com
Abstract
Multi-modal
interfaces
havebeenproposed
as a wayto capture
theeaseandexpressivity
of natural
communication.
Interfaces
of thissortallowusersto communicate
with
computers
through
combinations
of speech,
gesture,
touch,
expression,
etc.A critical
problem
indeveloping
suchan interface
is integrating
these
different
inputs
(e.g.,
spoken
sentences,
pointing
gestures,
etc.)
intoa single
interpretation.
Forexample,
incombining
speech
andgesture,
a system
mustrelate
eachgesture
to theappropriate
partof the
sentence.
We areinvestigating
thisproblem
asit arises
in ourtalkandtouch
interfaces,
whichcombine
full-sentence
speech
andscreen-touching.
Oursolution,
whichhasbeen
implementedin two completed prototypes, uses multi-modal semantic grammarsto match
screen touches to speech utterances. Through this mechanism,our systems can easily
support wide variations in the speech patterns used to indicate touch references.
Additionally,they can ask specific focusedquestions to the user in the event of an inability
to understand the input. Theyearl also incorporate other semantic information, such as
contextual references or references to previous sentence referents, through this single
unified approach. Ourtwo prototypes appear effective in providing a straightforward and
powerfulinterface to novice computerusers.
1. Talk and touch interfaces
Natural humandialogue is a lot morethan sequences of words. Context, gestures, expressions,
and implicit knowledgeall play a considerable role in conveyingcomplexthoughts and ideas in a
discussion. Thesefactors clarify the language, fill in the gaps, and disambiguatethe spokenwords
themselves.
CSTaR’stalk and touch project is attempting to incorporate information of this sort into an
~
intelligent interface based on full-sentence speech recognition and touch-screen sensing
[Krulwich and Burkey, 1994]. Consider the following utterances, in the context of a typical
electronic mail system:
1. Readthis message(with a corresponding touch to a messageheader)
t Ourscreentouchescanbe easilyreplacedwithmouse
clicks, etc.
Krulwich 103
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
2.
3.
4.
5.
Send a messageto him (with a corresponding touch to a name)
Send a message to him (without a touched name)
Forwardthis messageto him and Bill (two touches: one message, one name)
Forwardthis messageto him and Bill (with only a touched message)
In each of these cases, the system has to match the touched objects (references to messagesor
people) to the ambiguousslots in the sentence. In cases 3 and 5 there are more potential
references to touchedobjects than there are screen touches, so additional inference is required to
determine the missing information from context, or at least determine specifically which
informationis unspecified.
This paper discusses the approach to these issues that is embodiedin our two talk and touch
prototypes. Thef’trst, an interface to a typical e-mail systemshownin figure 1, allows commands
such as those listed above. Weuse this system to introduce and illustrate our approachin section
2. The second, a video communicationsmanagershownbriefly in figure 2, supports commands
for establishing video calls and multi-point video-conferences, sending and receiving multLrnedia
e-mail, and sharing and manipulating documentsduring a video-conference. Wediscuss this
secondsystemin detail in section 3.
2. Multi-modal semantic grammars
The first step in our approachto developing an integrated interpretation of speech and screen
z In general, semantic
touches is to parse the spoken text into a multi-modal semantic grammar,
grammarsare language grammarsthat parse text into semantic componentsinstead of syntactic
ones [Burton, 1976]. For example, a typical syntactic grammarwouldparse the sentence forward
the new messagefrom John to Bill and my team into syntactic constructs such as subject, verb,
direct object, and prepositional phrase. Asemantic grammar,on the other hand, wouldparse it in
terms of semantic constructs such as command,messagespecification, author, and recipients.
This allows for easier and moreaccurate interpretation, as well as improvingperformancethrough
early application of semanticconstraints.
Wehave extended this idea by adding semantic constructs to the grammarthat refer to other
sources of input information, such as screen touches, as well as those that refer to context. For
example, the sentence "forward this message to Bill and Chad"would be parsed to recognize
that "this message"is not only a messagespecification, but is also a reference either to a touched
screen object or to a particular contextually-significant message.
A samplegrammarof this sort, for top-level commands
in our e-mail prototype, is shownin figure
3. The grammardefines nodes, or "tags," (shownin angled brackets) that correspond to blocks
of text in the input string. Eachtag can be seen as a slot that maybe filled by a portion of an
input sentence. Instead of the phrase from him being recognized as a prepositional phrase, as it
might with a syntactic grammar, it can be recognized as an author of a message (the tag
2 Issues relating to speech recognition are discussed in section 4. In our discussion here we assumethat the spoken
cormnandshave been recognized ,and are available in text form.
104
BISFAI-95
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
.
. ¯ ¯ . ¯ :...:. :::: |
.... ..:: ¯ ... ::::¯ .:...
¯ :.:’:’:::::¯ . i :.’:. ¯ . ¯ " :.’. " " "’i
:ii.:..:’."
:.::’: Y"""
-..’::ii: :" :’.i’i:. ¯...
Figure 1: A prototype talk and touch e-mail system workspace
<fromPerson>), and in particular as a reference to a touched person on the screen (the tag
<pointPerson>). The following are someof the tags used in parsing the sentence forward this
message
to BillandChad,
andtheir
corresponding
text:
<sentence>= "forward this messageto Bill and Chad"
<cmdForward>
= "forward this message to Bill and Chad"
<message>= "this message"
<pointltem> = "this message"
<rePeople> = "to Bill and Chad"
<people> = "Bill and Chad"
:;~.~:.~--
m
Figure 2: A video communicationsmanager system
Krulwich
105
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
<sentence>
::=<cmdFozxvard>
I <cmdReply>
I <cmdltem>
I <cmdCheck>
I <cmdReadMessage>
I <cmdReadWhich>
I <cmdWho>I <cmdWhols>
I <cmdDelete>I <cmdScrol]>.
<cmdForward>::= <forward> <message> <toPeople> I <forward> <toPeople>
I <forward>
<message.>I <forward>
.
<cmdRepfy>
::=<create>
<reply>
TO<message,>
I <create,>
<reply>
TO <person>
I <reply>
TO <message,>
l <reply>
TO <person>
I <create> <reply> I <reply>.
<cmdItem>::= <create> <item> <rePeople> I <create> <item>.
<cmdCheck> ::= <check> <new> MAILFROMMY<groups>
I <check> <new> MESSAGESFROMMY <groups>
I <check> <new> MAILI <check> <new> MESSAGES.
<cmdReadMessage>::= <read> <message>.
<cmdReadWhich>::= <read> THE<which> <item> <fromPerson>
I <read> THE<which> <item>.
<cmdWho>::= <who> RECEIVED<message> I <who> GOT<message>
f <who> WASSENT<message>.
<c~ndWhoIs>::= <whols> <person>.
<cmdDelete,> ::= <delete> <message>.
<cmdScroll>::= <scroll> <direction> BY<number>I <scroll> <direction>.
<message> ::= <pointltem> I THE<new><item> <fromPerson> I THE<item> <fromPe~son>
I <msgNum>
I THE<new><reply> <fromPex~on>I THE<reply> <fx~nPerson>.
<pointltem> ::= <pointWhich><item> l <pointWhicl~ <reply> I <pointWhich>.
<msgnum> ::= <item> NUMBER
<number>.
<polntwhich> ::= THIS I THAT.
<item> ::= MESSAGE
I NOTEI LE’ITER.
<toPeople> ::= TO<people>.
<people> ::= <peopleItem> I <people, Item> AND<people>.
<peopleItem> ::= <person> I <groups> I MY<groups>.
<fromPerson> ::= FROM
<person>.
<person> ::= <pointperson> I <name>I <me>.
<pointperson> ::= HIMI HERI HEI SHEI <pointwhich> PERSON.
<name>::--- PAULA[ BILL I BRUCE
I CHAD.
<create> ::= CREATEA I CREATEA NEWI SENDA I SENDA NEW.
<check> ::= CHECKFOR I SCANFOR I DO I HAVEANYI READMY.
<new> ::= NEWI UNREAD.
<next> ::= NEXTI NEXT<new>.
<previous> ::= PREVIOUSI PREVIOUS<new>.
<which>::= <next> f <previous>.
<who> ::= TELL ME WHO<else> I TELL ME WHO.
<direction> ::= UP I DOWN.
<groups> ::= FRIENDSI TEAM.
Figure 3: A partial multi-modal semantic _grammarfor e-marl commands
A backgroundassumptionthat we are makingis that the spoken text and the screen touches
cannot be matchedtemporally. If our systems could simply comparethe times that particular
wordswerespokenwith the times of the screen touches, it court of coursedeterminean a priori
mostlikely matchbetweenthe touchesandtheir referents. Unfortunately,the currentstate of the
market in speech technology makesaccurate time-stamping of multiple input modalities
impossible. Moreimportantly, informalstudies have showna tendencyfor people to have their
screentoucheslag behindtheir spokenreferences. Forboth reasons,moreinference is requiredto
matchspokenreferences with screen touches.
106
BISFAI-95
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
The second interpretation step, then, is to matchthe multi-modalreferences in the parse to the
touched screen objects. The simplest case, of course, is whenthe numberof screen touches
matchesthe numberof multi-modaltags. In this case, the references in the command
are resolved
with the appropriate components of the touched objects. For example, suppose the spoken
command
is ’forward this messageto him and her," and three screen objects were touched. The
system determines that a <cmdForward>
was given, and sees that the <message>tag refers to a
<pointWhich>.This is interpreted by resolving the <pointWhich>with the first screen touch,
which is to a message. The system then sees that the <toPeople> contains two <pointPerson>
tags, soit resolves the two recipients of the messagewith the second and third screen touches by
checking that the two touched objects have assecmted names. The system then executes the
FORWARD
commandwith the given message and destination names.
Thetypechecking
of thisresolution
is significant,
andis crucial
to proper
handling
of more
complex
situations.
Manyobjects
andwordson thescreen
canbe interpreted
in a number
of
ways. In the e-mail system workspace shownin figure 1, for example, the nameof a message
author might be touchedas a reference to the person as an author, or as a recipient, or instead to
the messageitself. This is even moreapparent in morecomplexdomains,in which, for example,a
documenticon mayindicate a reference to the documentitself, its author, its project, or its
purpose.
In our previous example, "forward this messageto him and her;" suppose that there are two
recorded screen touches, both of whichare to messageindicators and could thus refer to either a
message or its author. The system parses the message into the multi-modal semantic grammar
from figure 3, and determines that the spoken command
is a <cmdForward>,that the <message>
is a <pointltem>, and that the recipients are two <pointPerson>s. The system then attempts to
resolve the three references to screen touches with the two actual screen touches, and realizes that
a simple one-to-one mappingis not possible.
The system then checks the previous command
to see what messages or people were referenced,
to use this context to resolve our current references. If the previous command
refered primarily
to a message,as is most likely, the system will assumethat the current command
is referring to
the message from the previous command,and that our two screen touches are resolving the
recipients of the message.If, on the other hand, the previous command
is purely person-oriented
(e.g., whois he), the system will assumethat the previously referred-to person is one of the
recipients, and resolve the fn’st screen touch with the messageto forward, and the secondwith the
secondrecipient.
If, on the other hand, the first screen touch wasto a person’s name,it could not be interpreted as
pointing to a message. In this case, both touches are assmnedto refer to recipients, and the
messagehas to be determinedfrom context. If there is no wayto determine the message,the user
will be asked to specify the message. Becausethe system was able to interpret the bulk of the
command,the question posed to the user can be quite focused, such as "Whichmessagedo you
want to forward to Chadand Bill?"
Krulwich
107
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
Figure
4: A communications
workspace
Ingeneral,
the greater
thedisparity
between
spoken
references
to touched
objects andtheactual
screen
touches,
themoreinference
maybe needed
to interpret
thecommand,
andthegreater
the
likelihood
ofhaving
toasktheuserquestions
before
proceeding.
Ourgoalin atlthese
cases
has
bcentoutilize
whatever
information
isavailable
toconstruct
a partial
interpret
thesentence,
and
tobc abletoasktheuseras focused
a question
as possible.
Whiletheheuristics
we havebeen
discussing
areonlysomeof manythatarcpossible,
we havefound
in practice
thatoursystem
is
almost
always
abletocorrectly
interpret
ambiguous
commands
frombeginning
users.
3. A second prototype: Video communications management
Oursecond
talkandtouchprototype,
a videocommunications
manager,
is shownin figure
4.
Thesystem
supports
commands
tostartandendvideocalls
or multi-point
videoconferences,
to
playand sendmultimcdia
mailmessages,
to shareand manipulate
documents
duringvideo
conferences,
andso on.Ourgoalsindeveloping
thisprototype
werethreefold.
First,
wewanted
toapply
talkandtouch
interfaces
toa domain
thatwasinherently
verbal
andvisual,
asopposed
to
a domain
likee-mall
thatwasinherently
textoriented.
Second,
wewanted
to increase
thenumber
of typesof objects
thatcouldbc referenced
multi-modally,
whiledecreasing
thenumber
of
potentially
ambiguous
references
in ourspoken
commands.
Third,
we wanted
to integrate
our
108
BISFAI--95
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
<sentence>
::= <agent><command>
I <command>.
<command>
::= <cmdCall>I <cmdSend>
i <cmdPickup>
I <cmdHangup>
I <cmdPlay>.
<cmdPlay>-:= <play> <message>.
<cmdSend>
::= SEND<toPeople> A <measagzType>
I SEND
<toPeople,> A <messageType>
<msgTytm>
I SENDA <messagcType>TO<toPeople>
I SENDA <messageTyp¢>
TO <toPeopIe> <msgType>.
<cmdPickup>
::= ANSWER
<call>l PICKUP<call>l SHOW
<call>l PICK<itCall> UPI PICKIT UP.
<cmdCall>::= CALL
<toPeople>l CALL
<rePeople><callType>l GET<rePeople,> ONTHELINE
I GET<toPeople> ONTHELINE<callType>l SETUP A <confType.> with
<rePeople>
I:SETUPA <confType>
with<rePeople>
<callType>.
<cmdHangup>
::=HANGUP <call>
I DISCONNECT
<call.>
I END<call>.
<call>
::=THECALLTO <toP~ple>
I THECALLFROM<f~anPeason>l
<pointCaU>
I <itCall>.
<message>::= <pointMsg>I <itMsg> I THE<messagoType>
FROM
<IM~nPerson>
I THE<msgNum>
<messageType,>.
<confI’ype.> ::= CONFERENCE
CALLI TELECONFERENCE
I VIDEOCONI~RENCE.
<messageType>::= MESSAGE
I REPLY.
<toPeople.>::ffi <toPtzson>l<toPvrson>AND
<toPerson>l<toPerson>AND
<toPea’son>AND
<toPcrson>
I <toPerson><toPermn>AND<toPerson>.
<toPerson>
::= <name,>
l <pointPerson>.
<pointP~son>::= HIMI HERI THEM.
<callType> ::= AUDIO
ONLY.
<msgType~::= AUDIOONLYI TEXTONLY.
~,nsgNum>::= FIRST I SECOND
I ~ I FOURTH.
<play> ::= PLAYI SHOW.
<pointCall> ::= THISI THISCALL.
<fromPetson>::= <name>I <pointPorson>.
<name>::= BRUCEKRULWICH
l BOBLORDI MINDYCOHNI MARK
PAULI PAID ALTO.
<agent>::= EINSTEIN.
<pointMsg>::= THISI THISMESSAGE.
<itCall> ::=IT.
<itMsg>::= IT.
Figure 5: A grammar
for communicationsmanagement
(fig. 3)
workwith other ongoingresearchat CSTaR,
particularly in supportfor multimediacollaboration
and communication.
Figure 4 showsthe workspaceof our communication
managersystem. In the center of the screen
arc messages, with the picaa’c of the sender and the narnc and date of the rmssage. These
correspondto multimediamessagesthat have beensent or havebeen left by people trying to call.
Onthe bottomare two areas for active or pendingcans, with incomming
calls on the left and
outgoingcalls on the right. Alongthe left side is an area for feedbackfromthe system,with the
Iccognizcd spokensentence, the feedbackfrom the talk and touch component,and the feedback
fromthe speech recognition subsystem.
Figure 5 shows part of the multi-modal semantic grammarfor commandsin the video
communications
workspacc.This system allows far fewer ambiguousscreen touches than the emarlsystem, dueto the objects on the screen beingmoredistinct and categorized,anddueto lack
of support for commands
like the forward this message to him and her command
that we saw
earlier. Grammar
tags such as <po~tMessage>,<pointCall>, and <pointPerson>correspondto
Krulwich
109
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
Figure 6: A multi-point video conference sharing a document
screen touches, and none of the commands
at this screen include references to morethan one type
3of screen object.
As an example, if the user gives the command"send a message to Bob Lord and him," the
systemwill parse the sentence and attempt to resolve the single <pointPerson>reference. If there
is a single screen touch recorded, the system will attempt to interpret the touched object as a
person, be it the person shownin a picture, or the sender of an indicated message,or the other
person on a call. If no touch has been recorded, the system will use contextual references from
the previous commands,as discussed earlier. If none exist, a focused question will be posed to
the user.
A commandsuch as "set up a video conference with Pale Alto, Mindy Cohn, and him" will
establish a multi-point video conference. If the <pointPerson>resolves with a touch to BobLord,
the conferenceshownin figure 6 will be started as soonas all participants are available. Calls and
conferences of this sort support a variety of multi-modal commands,through the grammarshown
in figure 7. As with the communications management grammar, the commandsfor video
conferencecontrol feature no multi-reference sentences, and are thus muchsimpler to interpret.
3 There are manyissues involved in designingan interface to facilitate straightforward multi-modalinterpretation,
whichare beyondthe scope of the present paper.
110
BISFAI-95
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
4. Discussion and future work
The research that we have presented can be viewed from a numberof perspectives, and raises a
variety of issues.
Oneofthecritical
issues
in theresearch
anddevelopment
ofspeech
recognition
systems
hasbccn
theneedfora grammar,
or othersources
of constraints
on theinpututterances,
to allowfor
accurate
recognition
of continuous
speech
(e.g.,
[Reddy,
1976;
Leeet.al.,1990;Huang
et.al.,
1993].
Earlyapproaches
useda finite-state
grammar,
suchas a standard
BNFgrammar,
forthis
purpose.
Morerecent
research
havegeneralized
thisapproach
toavoid
itslimitations
byreplacing
thefinite
stategrammar
witha statistical
modelofwordordering.
Inourresearch,
however,
we
havefound
grammar-based
speech
recognition
to be suitable,
primarily
because
of ourneedfor
themulti-modal
semantic
grammars
forinterpretation.
By usingthesamegrammars
forboth
purposes,
we integrate
theprocessing
of speech
andinterpretation,
apply
oursemantic
knowledge
4 Theremay,however,
cariy
inprocessing,
andfacilitate
improved
performance.
be other
benefits
of moregeneral
approaches,
suchas theability
to handle
newnamesor phrases
dynamically.
We
arecurrently
exploring
this
possibility.
Ourapproach
hastodateonlybeenapplied
to resolving
references
to touched
objects.
There arc,
however,
manyotherusesof gesture
in speech[Feldman
and Rime,1991]thathavebeen
investigated
previously
inmulti-modal
interfaces
(e.g.,
[Bolt
andHerranz,
1992;
Keens,
1994]).
Ourapproach
caneasily
be extended
to include
references
to screen
areasor numerical
ranges
along
axes.
It is moredifficult,
andthesubject
of ongoing
research,
to incorporate
gestures
representing
thecommands
themselves
(e.g.,
an "X"gesture
fordeletion)
intoourmulti-modal
semantic
grammars.
Lastly,
it is crucial
thatwe develop
morepowerful
techniques
forincorporating
contextual
information.
Justas we havedeveloped
successful
approaches
fordesigning
grammars
to support
multi-modal
references,
we haveto enableour grammars
to bettersupport
references
to
contextually
relevant
objects
andpeople.
If wecanintegrate
contextual
references
andgesturebased
commands
intoourintegrated
approach
to multi-medal
grammars,
we willtrulybe ableto
support
high-quality
interactions
between
people
andcomputers.
Acknowledgments:
We wishto thankAnatoleGershman,
DaveBeck,LucianHughes,Steve
Sate,Kishore
Swaminathan,
andLarryBirnbaum
formanyuseful
discussions
on theresearch
presented,
andtheinnumerable
visitors
thathaveseenourdemostrations
fortheir
feedback.
4 Weare currently using the IBMContinuous Speech SystemTM, which is based on Carnegie Mellon’s Sphinx I
system. Wehave also used the PE400TM system from Speech Systems Inc.
Krulwich
111
From: Proceedings, Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence. Copyright © 1995, AAAI (www.aaai.org). All rights reserved.
References
Bolt, R., 1980. Put that there: Voice and gesture at the graphics interface. ComputerGraphics (Proceedings of
the ACMSIGGRAPH
"80), Vol. 14, No. 3.
Bolt, R. ,and Herranz, E., 1992. Two-handedgesture in multi-modal dialog. In Proceedings of the Fifth Annual
Symposiumon User Interface Software and Technology, Monterey, CA.
Burton, R., 198x. Semantic grammars.In The Encyclopedia of Artificial
Intelligence, Feigenbaumand Barr, eds.
Burton, R., 1976. Semantic grammar:An engineering technique for constructing natural understanding systems.
BBNtechnical report 3453, Cambridge, MA.
Feldman, R. and Rime, B., 1991. Fundamentalsof Nonverbal Behavior. CambridgeUniversity Press.
Huang, X., Alleva, F., Hon, H., Hwang, M., Lee, K., and Rosenfeld, R., 1993. The SPHINX-I/ Speech
Recognition System: An Overview. Computer, speech, and language, volume 2, pp. 137-148.
Koons, D., 1994. Capturing and interpreting multi-modal descriptions with multiple representations. In Working
Notes of the 1994 AAAISpring Symposiumon Intelligent Multi-Media Multi-ModalSystems, Stanford, CA, pp. 1321.
Krulwich, B. and Burkey, C., 1994. Natural command
interfaces incorporating speech and gesture. WorkingNotes
of the 1994 Conference on Lifelike ComputerCharacters, Snowbird, Utah, 1994.
Lee, K., Hon, H., and Reddy, R., 1990. An overview of the SPHINXspeech recognition
Transactions on Acoustics, Speech, and Signal Processing, pp. 35-45.
Reddy, R., 1976. Speech recognition by machine: A review. IEEEProceedings, 64:4, pp. 502-531.
112
BISFAI-95
system. IEEE