ARGOT: The Rochester Dialogue System

advertisement
From: AAAI-82 Proceedings. Copyright ©1982, AAAI (www.aaai.org). All rights reserved.
ARGOT:
The Rochester Dialogue System
James 1:. Allen, Alan M. Frisch, and Diane J. 1,&man
Computer Science Department
The University of Rochester
Kochester, NY 14627
Abstract
We are engaged in a long-term research project that
has the ultimate aim of describing a mechanism that can
partake in an extended English dialogue on some
reasonably well specified range of topics. This paper is a
progress report on the project, called ARGOT. It outlines
the system and describes recent results as well as work in
progress.
Once the system makes the above inferences, it
generates (5), which denies the request, as well as (6),
which provides additional information that may be helpful
to the user. The system believes that talking to the head
operator will be of use to the user because it has
recognized the user’s goal of getting a tape mounted.
Utterance (7) taken in isolation is meaningless; however, in
the context of the entire dialogue, it can be seen as an
attempt to modify the original request by respecifying the
1. Introduction
tape to be mounted.
Consider Dialogue 1, a slightly cleaned up version of
an actual dialogue between a computer operator and a user
communicating via terminals.
Allen’s [1979] model of language as cooperative
behavior provides answers to several of the difficulties
suggested by Dialogue 1. The basic assumption of that
approach, which is adopted in AKGOT, is that the
participants in a dialogue are conversing in order to
achieve certain goals. As a consequence, a major part of
understanding what someone said is recognizing what
goals they are pursuing. In purposeful dialogues this
model accounts for helpful responses, as well as for
responses to indirect speech acts and some sentence
fragments. However, since his model has no knowledge of
discourse structure it cannot partake in an extended
dialogue.
(1) User:
(2)
(3)
(4)
(5)
Operator:
(6)
(7)
User:
Could you mount a rnagtapc for mc?
It’s T376.
No ring please.
Can you do it in five minutes?
We are not allowed to mount that magtape.
You will have to talk to the head operator about It.
How about tape T241?
Dialogue 1.
We are building a computer system called ARGOT
that plays the role of the operator in extended dialogues
such as the above. This dialogue illustrates some of the
many issues that must be addressesed in building such a
system. For instance, the fust utterance taken literally is a
query about the system’s (i.e., the operator’s) abilities. In
this dialogue, however, the user intends it as part of a
request to mount a particular magtape. Thus, the system
must recognize an indirect speech act. Utterance (2)
identifies the tape in question, and (3) and (4) add
constraints on how the requested mounting is supposed to
be done. These four utterances, taken as a unit, can be
summarized as a single request to mount a particular
magtape with no ring within five minutes.
One of the major advances made in ARGOT is that it
recognizes multiple goals underlying utterances. For
example, consider the user’s goals underlying utterance (2).
From the point of view of the task domain, the user’s goal
is to get the tape mounted (by means of identifying it).
From the point of view of the dialogue, the user’s goal is
to elaborate on a previous requests, i.e. the user is
specifying the value of a parameter in the plan that was
recognized from the first utterance, In the ARGOT
system, we recognize both these goals and are investigating
the relationship between them. The need for this type of
analysis has been pointed out by many researchers (e.g.,
[Levy, 1979: Grosz, 1979; Appelt, 1981; and Johnson and
Robertson, 19811).
2. Organization
The work at SRI [Walker, 19781in expert-apprentice
dialogues monitored the goals of the user at the task level.
The only analysis at the communicative goal level was
implicit in various mechanisms such as the focusing of
attention [Grosz, 19781.Their work ties the task structure
and communicative structure too closely together for our
purposes. Appelt [1981] also views utterances as actions
which satisfy goals along various explicit dimensions--a
social dimension as well as what would correspond to our
communicative
levels. However, his
and
task
communicative dimension is again mainly concerned with
focusing.
of AHGO’
Currently, the AKGOT system is divided into many
subsystems, each running concurrently. The three
subsystems we shall consider in this paper are the task goal
reasoner, the conmunicative
goal reasoner, and the
linguistic reasoner. Each of these levels is intended to
perform both recognition and generation. In this paper we
consider only recognition, since the generative side of the
system is not currently being implemented.
The task goal reasoner recognizes goals in the domain
of discourse, such as mounting tapes, reading files, etc. The
communicative goal reasoner recognizes goals such as
introducing a topic, clarifying or elaborating on aaprevious
utterance, modifying the current topic, etc. Allen’s earlier
system had parts of both types of analysis but they
collapsed into one level. A result of this was that it was
difficult to incorporate knowledge of the dialogue structure
into the analysis.
The linguistic level is responsible for providing input
to the other levels of analysis that reflects the content of
the actual utterances. This parser will be based on the
Word Expert Psrser system of Small and Reiger [1981].As
the linguistic analysis progresses, it will notify the other
levels of the various noun phrases that appear as they are
analyzed. This allows the other levels to start analyzing the
speaker’s intentions before the entire sentence is
linguistically analyzed. Thus, an interpretation may be
found even if the linguistic analysis eventually “fails” to
find a complete sentence. (I’ailure is not quite the correct
word here, since if the utterance is understood, whether it
WaS “correct” or not becomes uninteresting.) We are
investigating other information that could be useful for the
rest of the system during parsing; for instance, the
recognition of clue words to the discourse structure
[Reichman, 19781. If a user utterance contains the word
“please,” the communicative level should be notified SO
that it can generate an expectation for a request.
In addition, the rest of the system may be able to
provide strong enough expectations about the content of
the utterance that the linguistic level is able to construct a
plausible analysis of what was said, even for some
ungrammatical sentences.
Splitting
the analysis of intention into the
communicative and task levels brings about the problem of
identifying and relating the high-level goals of the plans at
each level. The high-level goals at the task level are
dependent on the domain, and correspond to the highlevel goals in the earlier model. The high-level
communicative goals reflect the structure of English
dialogue and are used ti input to the task level reasoner.
In other words, these goals specify some operation (e.g.,
introduce goal, specify parameter) that indicates how the
task level plan is to be manipulated. Our initial high-level
communicative goals are based on the work of Mann,
Moore and Levin [1977]. In their model, conveuations are
analyzed in terms of the ways in which language is used to
achieve goals in the task domain. For example, bidding a
goal is a communicative action which introduces a task
goal for adoption by the hearer.
3. Issues
Given the communicative goals, we must now be able
to recognize plans at this level. Neither Mann et al. [1977]
nor Reichman [1978] have described in detail the process
of recognizing the communicative goals from actual
utterances. Currently, we adapt Allen’s [1979] recognition
algorithm, which finds an inference path connecting the
observed linguistic action(s) to an expected communicative
goal. This algorithm uses the representation of the
utterance from the linguistic level and a set of possible
communicative acts predicted by a dialogue grammar
which indicates what communicative acts are allowed ‘at
any particular time for both participants, and is modeled
after Horrigan [1977].
irr Knowledge Representation
All levels of analysis in AKGOT make use of a
common knowledge representation and a common
knowledge base module (KB). The KU stores a set of
sentences of the representation language and provides
retrieval facilities for accessing them. These retrieval
facilities are used extensively at all levels of analysis.
Because of this, the sentences stored in the KB not only
represent the knowledge associated with magtapes and the
doings of a computer room, but also the knowledge
necessary to deal with language. This includes knowledge
of physical, mental, and linguistic actions, how these
actions are involved in plans, what computer users do, and
what they expect of computer operators.
67
Following Brachman [1979b], our representation is
constructed of two levels: the epistemological level and the
conceptual level. The epistemological level provides a set
of knowledge structuring primitives that are used to
construct all conceptual level entities (e.g., action, time,
and belief). Each level of the representation provides *
primitive symbols (particular predicate symbols, function
symbols, and constant symbols) which can then be
combined using the notation of FOPC. By inheriting the
logical connectives and quantificational structure of FOPC,
the resulting representation language is quite expressive.
3.1 The Episternological
Level of Representation
The epistemological level of the representation
supplies a fixed set of predicates which are the knowledgestructuring primitives out of which all representations are
built. The choice of knowledge-structuring primitives has
been motivated by the study of semantic networks. For
instance, where a semantic network such as Brachman
[1979b] might have
Object
T376
Agent
Operator
c3
we would have
t-2
SUBTYPE(Tape-Mountmgs,Events)
TYPE(Operator-Mounting-T376,Tape-Mountings)
ROL,E(Operator-Mounting-T376,Agent,Opcrator)
ROLE(Operator-Mounting-1‘376,0bject,T376)
Notice that the SUBTYPE predicate corresponds lo tie
unshaded double arrow, the TYPE predicate to the shaded
double arrow, and the ROLE predicate to the single arrow.
Our constants are sorted into individuals (e.g.: OperatorMounting-T376, Operator, ‘1’376), types (e.g.: Events, TapeMountings) and rolenames (e.g.: Agent,Object). These sorts
somewhat correspond to the shaded oval, unshaded oval
and shaded box of the network. Allen and Friseh [1982]
have fully described and axiomatized the epistemological
level of the representation and have compared it to
semantic networks.
3.2 The
Conceptual
Level of Representation
The representation of actions is crucial to a dialogue
participant for two reasons. The first is that the participant
must be able to represent the meaning of utterances that
refer to actions (e.g., “Can you mount a magtape for me?”
The second, as
refers to the action of mounting).
previously discussed, is that it is advantageous to model
the language comprehension and production processes as
purposeful, planned action (e.g., uttering “Can you mount
a magtape for me?” is a requesting action). However,
existing models of action, most notably the state-space
approach (e.g. [Fikes and Nilsson, 19711) appear
Since a major
inadequate for the above purposes.
deficiency with the existipg models is an inadequate
treatment of time, we first turn our attention to this issue.
An interval-based temporal logic and its associated
inference processes have been defined [Allen, 1981a].
Kather than using a global time line, the representation
employs a hierarchical set of reference frames. A
particular interval is known by its location relative to the
reference frames and other intervals. This is particularly
important in a dialogue system for most temporal
knowledge does not have a precise time. This
representation of time has been used to produce a general
model of events and actions [Allen, 1981b]. The occurence
of an event corresponds to a partial description of the
world over some time interval. Actions are defined as that
subclass of events that are caused by agents. This is in
contrast to the state-space view of an action as a function
from one world state to a succeeding world state. Our
approach enables the representation of actions that
describe inactivity (e.g., standing still), preserving a state
(e.g. preventing your television from being stolen), and
simultaneous performance of simpler actions (e.g., talking
while juggling).
Kepresenting actions, particularly speech acts, requires
the representation of beliefs. For example, the effect of the
speech act of informing involves changing the beliefs of
the hearer. A model of belief has been developed that
treats BELIIW as a predicate on an agent and a sentence
of the representation language. To do this, there must be
a name for every sentence in the language. Perlis [1981]
and Haas [1982] have introduced naming schemes that
provide enough expressiveness to deal with traditional
representational requirements such as quantifying in. Haas
[1982] has used this formulation of belief to predict an
agent’s action by constructing plans that can include
mental actions. His treatment of belief and action does
not suffer from the problem of the possible worlds
approach [Moore, 19791 that an agent believes all
consequences of his beliefs.
the grammar are (in an informal notation)
user BID-GOAL to system, ‘and
user SUMMON system.
3.3 The Knowledge Base Module
The Knowledge Base (KB) provides a set of retrieval
facilities that is the sole access that the system has to the
sentences stored in the KB. This retrieval facility
corresponds to the matcher in a semantic network
representation. Since retrieval must respect the semantics
of the representation, it is viewed as inference. However,
this inference must be limited because retrieval must
terminate, and must do so in a reasonable amount of time.
Frisch and Allen [1982] have shown how a limited
inference engine suitable for knowledge retrieval can be
given a formal, non-procedural specification in a metalanguage and how such a specification can be efficiently
implemented.
The capabilities and limitations of the retriever can be
thought of intuitively as follows. A set of axioms dealing
solely with the epistemological primitives is built into the
retriever.
For example, three of these axioms are:
V tl,t2,t3 SUBTYPE(tl,t2) A SUBTYPE(t2,t3)
+
SUBTYPE(tlJ3)
(SUBTYPE is transitive.)
w o,tp2
‘I’YPE(o,t$ A SUBTYPE(t~J2) +
‘l’YP~(o,t2)
(Every member of a given type is a member of
its supertypes.)
W x,r,y,y’ ROLE(x,r,y) A ROLE(x,r,y’) +
y=y’
Taking the utterance literally, the linguistic level uses both
syntactic and semantic analysis to identify the linguistic
actions (speech acts) performed by the speaker. For
utterance (1) we have
user REQUEST that
System INFORM user if systetn can InOuIlt a tape,
which is sent to the communicative level. The plan
recognition algorithm produces BID-GOAL acts for two
possible goals:
(G.l)
system INFORM user if system can mount a
tape (literal interpretation)
(G.2)
system MOUNT a tape (indirect interpretation).
The indirect interpretation, (G.2), is favored,
illustrating how goal plausibility depends upon what the
dialogue participants know and believe, Most people know
that operators can mount tapes, so the literal interpretation
is unlikely. However, if the user did not know this, the
literal interpretation would also have been recognized (i.e.,
the system might generate “yes” before attempting to
mount the tape). It is important to remember here that the
plan was recognized starting from the literal interpretation
of the utterance. The indirect interpretation falls out of
the plan analysis (see [Perrault and Allen, 19801for more
details). Thus, the linguistic level only needs to produce a
literal analysis.
(Role fillers are unique)
Through these built-in axioms, the retriever “knows
The
about” all of the episemological primitives.
retriever’s power comes from the fact that it can, for the
most part, reason completely with the built-in axioms. Its
limitations arise because it only partially reasons with the
sentences stored in the KB. The retriever also has
knowledge of how to control inferences with the built-in
axioms. In this manner, the retriever only performs those
inferences for which it has adequate control knowledge to
perform efficiently.
4. A Shnple
Example
Let us trace a simplified analysis of utterance (1)
“Could you mount a magtape for me?” The
communicative acts expected at the start of a dialogue by
The communicative level sends the recognized BIDGOAL, (G.2) to the task reasoner. There, the user’s task
level goal to mount a tape is recognized, and the system
accepts the user’s goal as a goal of its own. Of course,
since the task level reasoner is a general plan recognizer as
well, it may well infer beyond the immediate effect of the
specific communicative action. For example, it may infer
that the user has the higher-level goal of reading a file.
The task level reasoner generates a plan for mounting
a tape and then inspects this plan for obstacles. Assuming
the user says nothing further, there would be .an obstacle
in the task plan, for the system would not know which
tape to mount. The task level reasoner would generate the
goal for the system to identify the tape and would send
this goal to the communicative goal reasoner. This
reasoner would plan a speech act (or acts), obeying the
on well-formed discourse, that could lead to
accomplishing the goal of identifying the tape. This speech
act then would be sent to the linguistic level which would
generate a response such as “Which tape?”
constraints
1:ikes. R.13. and N.J. Nilsson, “STKIPS: A new approach
to the application of theorem proving to problem
solving,” Artificial Intelligence, 2, 189-205, 1971.
I:risch, A.M. and J.I:. Allen, “Knowledge retrieval as
limited inference,” Lecture Notes on Computer
In Dialogue 1, however, the user identifies the tape in
utterance (2), which the communicative level recognizes as
a SPECIE‘Y-PAHAME?‘EKl’~I~action for the plan created by the
initial UID-GOAL action.
5. Current State
and lhture
Science:
6th
Proceedit lgs.
Conference
on
Automated
Deduction
New York: Springer-Verlag, 1982.
Grosz, B.J., “Discourse knowledge,” In [Walker, 19781.
Grosz, B.J., “Utterance and objective: issues in natural
language communication,” hoc., 6th Int’l. Joint Conf.
on Artificial Intelligence, Tokyo, 1979.
Directions
We have implemented the knowledge base, a simple
dialogue grammar, and simple plan recogniLers at both the
communicative and task levels. Furthermore, we are
currently incorporating a word-expert parser [Small and
Kieger, 19811. As discussed in the previous sections,
further research on all aspects of ARGOT (i.e. the levels,
the interactions between them, and the theoretical models)
is still needed.
Haas, A.R., “Mental states and mental actions in
planning,” Ph.D. thesis, Computer Science Dept., U.
Rochester, 1982.
Horrigan. M.K., “Modelling simple dialogues,” Yroc.. 5th
Int’l. Joint Conf. on Artificial Intelligence, MIT, 1977.
Johnson, P.N. and S.P. Robertson, “MAGPIII: A goalbased model of conversation,” Research Report #206,
Computer Science Dept., Yale U., May 1981.
I,evy, D.. “Communicative goals and strategies: between
discourse and syntax,” in T. Givon (ed.). Syntax and
Semantics, Vol. 12. New York: Academic Press, 1979.
Acknowledgements
We would like to thank Bill Matm for providing us
with the dialogues. We thank Lokendra Shastri and Marc
Valain for their contributions to the development of
ARGOT and Dan Russell for his helpful comments on an
earlier version of this paper.
This work has been
supported in part by NSF Grant ET-8012418 and
DARPA Grant N00014-82-K-0193.
Mann, W.C., J.A. Moore, and J.A. Levin, “A
comprehension model for human dialogue,” Proc., 5th
Int’l. Joint Conf. on Artificial Intelligence, MIT, 1977.
Moore, R.C., “Reasoning about knowledge and action,”
Ph.1). thesis, MIT, 1979.
Perlis I>., “I anguage, computation, and reality,” Ph.D.
thesis, Computer Science Dept., U. Rochester, 1981.
Hefereuces
Perrault, C.R. and J.F. Allen, “A plan-based analysis of
indirect speech acts,” J. Assoc. Comp’l. Linguistics 6, 3,
1980.
Allen, J.F., “A plarl-based approach to speech act
recognition,” Ph.D. thesis, Computer Science Dept., U.
‘roroIlto. 1979.
Reichman,
Allen, J.F., “An interval- based representation of temporal
Science
knowledge,” Yroc., 7th Int’l. Joint Conf. on Artificial
Intelligence, Vancouver, 13.C., 1981a.
R., “Conversational coherency,”
Cognitive
2, 1978.
Small, S.I.. and C. Rieger, “Parsing and comprehending
with word experts (a theory and its realization)“, TR
1039, Dept. Computer Science, U. Maryland, 1981.
Allen, J.F., “What’s necessary to hide?: Reasoning about
action verbs,” Proc., 19th Annual Meeting, Assoc. for
Computational Linguistics, 77-81, Stanford U., 1981b.
Walker, I).E. Understunding Spoken Language. New York:
North-Holland, 1978.
Allen, J.F. and A.M. Frisch, “What’s in a semantic
network?” hoc., 20th Annual Meeting, Assoc. for
Computatronal Linguistics, U. Toronto, June, 1982.
Awelt, D., “Planning natural language utterances to
satisfy multiple goals,” Ph.D. thesis, Computer Science
Dept., Stanford U., 1981.
Brachman, R.J., “On the epistemological status of semantic
networks,” in N.V. Findler (Ed.). Associative Networks.
New York: Academic Press, 1979b.
70
Download