Toward Conversational Human-Computer Interaction James F. Allen, Donna K. Byron, Myroslava Dzikovska,

AI Magazine Volume 22 Number 4 (2001) (© AAAI)
Articles
Toward Conversational
Human-Computer
Interaction
James F. Allen, Donna K. Byron, Myroslava Dzikovska,
George Ferguson, Lucian Galescu, and Amanda Stent
■ The belief that humans will be able to interact with
computers in conversational speech has long been
a favorite subject in science fiction, reflecting the
persistent belief that spoken dialogue would be the
most natural and powerful user interface to computers. With recent improvements in computer
technology and in speech and language processing, such systems are starting to appear feasible.
There are significant technical problems that still
need to be solved before speech-driven interfaces
become truly conversational. This article describes
the results of a 10-year effort building robust spoken dialogue systems at the University of
Rochester.
T
he term dialogue is used in different communities in different ways. Many
researchers in the speech-recognition
community view “dialogue methods” as a way
of controlling and restricting the interaction.
For example, consider building a telephony
system that answers queries about your mortgage. The ideal system would allow you to ask
for what you need in any way you chose. The
variety of possible expressions you might use
makes this system a challenge for current
speech-recognition technology. One approach
to this problem is to have the system engage
you in a dialogue by having you answer questions such as, “What is your account number?”
“Do you want your balance information?” On
the positive side, by controlling the interaction, your speech is much more predictable,
leading to better recognition and language processing. On the negative side, the system has
limited your interaction. You might need to
provide all sorts of information that isn’t relevant to your current situation, making the
interaction less efficient.
Another view of dialogue involves basing
human-computer interaction on human conversation. In this view, dialogue enhances the
richness of the interaction and allows more
complex information to be conveyed than is
possible in a single utterance. In this view, language understanding in dialogue becomes
more complex. It is this second view of dialogue to which we subscribe. Our goal is to
design and build systems that approach
human performance in conversational interaction. We believe that such an approach is feasible and will lead to much more effective user
interfaces to complex systems.
Some people argue that spoken language
interfaces will never be as effective as graphic
user interfaces (GUIs) except in limited specialcase situations (for example, Schneiderman
[2000]). This view underestimates the potential power of dialogue-based interfaces. First,
there will continue to be more and more applications for which a GUI is not feasible because
of the size of the device one is interacting with
or because the task one is doing requires using
one’s eyes or hands. In these cases, speech provides a worthwhile and natural additional
modality (Cohen and Oviatt 1995).
Even when a GUI is available, spoken dialogue can be a valuable additional modality
because it adds considerable flexibility and
reduces the amount of training required. For
example, GUI designers are always faced with
a dilemma—either they provide a relatively
basic set of operations, forcing the user to perform complex tasks using long sequences of
commands, or they add higher-level commands that do the task the user desires. One
problem with providing higher-level commands is that in many situations, there is a
wide range of possible tasks; so, the interface
Copyright © 2001, American Association for Artificial Intelligence. All rights reserved. 0738-4602-2000 / $2.00
WINTER 2001
27
Articles
Technique Used
Example Task
Task Complexity
Dialogue Phenomena Handled
Finite-state script
Long-distance calling
Least complex
User answers questions
Frame based
Getting train arrival and
departure information
User asks questions, simple
clarifications by system
Sets of contexts
Travel booking agent
Shifts between predetermined
topics
Plan-based models
Kitchen design
consultant
Dynamically generated topic
structures, collaborative
negotiation subdialogues
Agent-based models
Disaster relief
management
Most complex
Different modalities (for
example, planned world and
actual world)
Table 1. Dialogue and Task Complexity.
becomes cluttered with options, and the user
requires significant training to learn how to use
the system.
It is important to realize that a speech interface by itself does not solve this problem. If it
simply replaces the operations of menu selection with speaking a predetermined phrase
that performs the equivalent operation, it can
aggravate the problem because the user would
need to remember a potentially long list of
arbitrary commands. Conversational interfaces, however, would provide the opportunity
for the user to state what he/she wants to do in
his/her own terms, just as he/she would do to
another person, and the system takes care of
the complexity.
Dialogue-based interfaces allow the possibility of extended mixed-initiative interaction
(Allen 1999; Chu-Carroll and Brown 1997).
This approach models the human-machine
interaction after human collaborative problem
solving. Rather than viewing the interaction as
a series of commands, the interaction involves
defining and discussing tasks, exploring ways
to perform the task, and collaborating to get it
done. Most importantly, all interactions are
contextually interpreted with respect to the
interactions performed to this point, allowing
the system to anticipate the user’s needs and
provide responses that best further the user’s
goals. Such systems will create a new paradigm
for human-computer interaction.
Dialogue Task Complexity
There is a tremendous range of complexity of
tasks suitable for dialogue-based interfaces, and
28
AI MAGAZINE
we attempt a broad classification of them in
table 1. At the simplest end are the finite-state
systems that follow a script of prompts for the
user. Such systems are in use today for simple
applications such as long-distance dialing by
voice and have already proved quite successful.
This technique works only for the simplest of
tasks.
The frame-based approach includes most of
the spoken dialogue systems constructed to
date. In this approach, the system interprets
the speech to acquire enough information to
perform a specific action, be it answering a
question about train arrivals or routing your
call to the appropriate person at a bank. The
context is fixed in these systems because they
do only one thing. Specialized processing techniques are used that take advantage of the specific domain. One can view the context as
being represented as a set of parameters that
need to be instantiated before the system
action can be taken (for example, Seneff and
Polifroni [2000]). For example, to provide
information about train arrivals and departures, the system needs to know parameters
such as the train ID number, the event
involved (for example, arriving or departing),
and the day of travel (see table 2). The action is
performed as soon as enough information has
been identified. This approach has been used
for systems providing information about current movies (for example, Chu-Carroll [1999]),
train schedules (for example, Sturm, den Os,
and Boves [1999]), and routes to restaurants
(for example, Zue et al. [2000]).
Because of the simplicity of these domains, it
is possible to build very robust language-pro-
Articles
cessing systems. One does not need to obtain
full linguistic analyses of the sentences, and in
fact, most information can be extracted by simple patterns designed for the specific domain.
For example, given the utterance “When does
the Niagara Bullet leave Rochester?” patternmatching techniques could identify values for
the following parameters: the train (answer:
The Niagara Bullet), the event (answer: leaving), and the location (answer: Rochester).
Even if speech recognition was poor, and the
recognized utterance was “Went up the Niagara Bullet to leave in Chester,” patterns could
still extract the train (that is, the Niagara Bullet) and event (that is, leaving) and continue
the dialogue.
The next level up in complexity involves
representing the task by a series of contexts,
each represented using the frame-based
approach. For example, for a simple travel
booking agent, the system might need to book
a series of travel segments, and each one would
be represented by a context containing the
information about one travel leg. It might also
be able to book hotels and rental cars. With
multiple contexts, such systems must be able
to identify when the user switches contexts. It
can be quite challenging to recognize cases
where a user goes back and wants to modify a
previously discussed context, say, to change
some detail about the first leg of a trip after discussing the second leg. Examples of such work
can be found within the Defense Advanced
Research Projects Agency COMMUNICATOR Project (for example, Xu and Rudnicky [2000]).
At the University of Rochester, we are primarily interested in the design of systems for
the next two levels of complexity, shown in
table 1. In these, the tasks are too complicated
to represent as a series of parameterized contexts. In fact, these tasks require the system to
maintain an explicit model of the tasks or
world and reason about these models. The language and the dialogues become significantly
more complicated, and one needs to start
explicitly modeling the collaborative problemsolving process that the system and user
engage in. In the plan-based approach, the dialogue involves interactively constructing a
plan with the user (for example, a design for a
kitchen, a plan to evacuate personnel off an
island). The last level of complexity involves
agent-based models. These dialogues might
still involve planning but also might involve
executing and monitoring operations in a
dynamically changing world (for example,
emergency rescue coordination).
Parameter
Possible Values
The train ID
BN101, …
The event
Departure, arrival
The location
Avon, Bath, Corning, …
The date-time range
Monday, Aug. 3, afternoon, …
Table 2. Context for a Train Information Task.
Practical Dialogue
Note that although this classification probably
covers most of the potential applications for
human-computer interaction, it by no means
captures the extent of full human conversation.
In all these settings, the dialogue is focused on
accomplishing a concrete task. We call such dialogues practical dialogues. The dialogue literature
identifies many specific subclasses of practical
dialogue, such as task-oriented dialogues, information-seeking dialogues, advice and tutoring
dialogues, and command and control dialogues.
The fact that we are dealing with practical
dialogues is important because it does seem
clear that full natural language understanding
by machine will not occur in the foreseeable
future. Our belief, however, is that sufficient
understanding of practical dialogues is feasible,
captured in the following hypothesis:
The Practical Dialogue Hypothesis: The
conversational competence required for
practical dialogues, although still complex, is significantly simpler to achieve
than general human conversational competence.
However, even if this hypothesis is true, it
might still be too time consuming to construct
understanding systems for different domains.
Even if we could build a demonstration system
in one domain, it might still be infeasible to
apply this technique to other domains. Thus,
we arrive at our second hypothesis, which
essentially says that although practical dialogues in different domains might appear quite
different at first glance, they all share essentially the same underlying structures:
The Domain-Independence Hypothesis:
Within the genre of practical dialogue, the
bulk of the complexity in the language
interpretation and dialogue management
is independent of the task being performed.
WINTER 2001 29
Articles
(1) USR: We need to get the woman in Penfield
to Strong.
(2) SYS: OK
(3) USR: What vehicles are available?
(4) SYS: There are ambulances in Pittsford and
Webster.
(6) SYS: Do you know that Route 96 is blocked due
to construction?
(7) USR: Oh
(8) USR: Let's use the interstate instead.
(9) SYS: OK. I'll dispatch the crew.
Figure 1. A Short Example Dialogue.
If this hypothesis is true, it is worthwhile to
spend a considerable effort building a generic
dialogue system that can then be adapted to
each new task relatively easily. Our experiences, detailed later in this article, have thus far
lent support to the truth of these hypotheses.
The practical dialogue hypothesis is explored
in more detail in Allen et al. (2000).
Example: A Fragment of a
Practical Dialogue
The following fragment of a dialogue illustrates
some of the power of dialogue-based interaction. Dialogues of this complexity can be handled by our system (to be described later). The
entire dialogue fragment is shown as figure 1.
The task is an emergency rescue scenario.
Specifically, the user must collaborate with the
system to manage responses to 911 calls in a
simulation for Monroe County, New York. In
this situation, Penfield, Pittsford, and Webster
are towns; Strong is the name of a hospital; and
the dialogue starts after a report of an injured
woman has been received. The first utterance
serves to establish a joint objective to get the
woman to Strong Memorial Hospital. In utterance 2, the system confirms the introduction
of the new objective. This context is crucial for
interpreting the subsequent interaction. The
system must identify the question in 3 as initiating a problem-solving act of identifying
resources to use to solve the problem, which
has significant impact on how the question is
interpreted: First, although the user asked
about vehicles, the system needs to realize that
it should only consider ambulances because
30
AI MAGAZINE
they are the only vehicles that can perform this
part of the task. Second, the term available
means different things in different contexts. In
this context, it means the vehicles are not currently assigned to another task and that a crew
is ready. Finally, although there might be many
available ambulances, the system chooses in its
response in 4 to list the ones that are closest to
Penfield.
Note that in a frame-based system, such contextual interpretation can be built in to the specialized language processing. In our system,
however, the context is dynamically changing.
If we next talk about repairing a power line, for
example, the vehicles will now need to be
interpreted as electric utility trucks.
Utterance 5 is interpreted as an attempt to
specify a solution to the objective by sending
one of the ambulances from Pittsford. Specifically, the user has asked to use an ambulance in
Pittsford to take the woman to the hospital. In
this case, the system does not simply agree to
the request because it has identified a problem
with the most direct route. Thus, it responds by
giving the user this information in the form of
a clarification question in 6. At this stage, the
solution has not been agreed to and is still the
active focus of the discussion.
In 7, the user indicates that the problem was
not known (with the “Oh”). Even with a longer
pause here, the system should wait for some
continuation because the user has not stopped
his/her response. Utterance 8 provides further
specification of the solution and is interpreted
as confirming the solution of using an ambulance from Pittsford. (Note that “Let’s use one
from Webster instead” would have been a
rejection of this solution and the introduction
of a new one.) Again, it is reasoning about the
plan and the situation that leads the system to
the correct interpretation.
In utterance 9, the system confirms the solution and takes initiative to notify the ambulance crew (thus starting the execution of the
plan). Although this example is short, it does
show many of the key issues that need to be
dealt with in a planning-based practical dialogue. Note especially that the interaction is
collaborative, with neither the system nor the
user in control of the whole interaction.
Rather, each contributes when he/she/it can
best further the goals of the interaction.
Four Challenges for
Dialogue Systems
Before giving a brief overview of our system, we
discuss four major problems in building dialogue systems to handle tasks in complex
Articles
domains and how we approached them. The
first is handling the level of complexity of the
language associated with the task. The second
is integrating a dialogue system with a complex
back-end reasoning system (for example, a factory scheduler, a map server, an expert system
for kitchen design). The third is the need for
intention recognition as a key part of the
understanding process. The fourth is enabling
mixed-initiative interaction, in which either
the system or the user can control the dialogue
at different times to make the interaction most
effective. We consider each of these problems
in turn.
Parsing Language in
Practical Dialogues
The pattern-matching techniques used to great
effect in frame-based and sequential-context
systems simply do not work for more complex
domains. They do not capture enough of the
subtlety and distinctions that people depend
on in using language. We need to produce a
detailed semantic (that is, deep) representation
of what was said—something that captures
what the user meant by the utterance. Currently, the only way to get such a system is to build
it by hand. Although there are techniques for
automatically learning grammars from corpora
(for example, Charniak [2000]), such systems
produce a shallow representation of the structure of written sentences, not representations
of meaning.
There have been many efforts over the years
to develop broad-coverage grammars of natural
languages such as English. These grammars
have proven to be of little use in practice
because of the vast ambiguity inherent in natural languages. It would not be uncommon for
a 12-word sentence to have hundreds of different parses based on syntax alone. One of the
mainstay techniques for dealing with this
problem has been to use semantic restrictions
in the grammar to enforce semantic, as well as
syntactic, constraints. For example, we might
encode a restriction that the verb eat applies to
objects that are edible, for example, to disambiguate the word chips in “He ate the chips” to
be the ones made out of corn rather than silicon. The problem with semantic restrictions,
however, is that it is hard to find them if we
want to allow all possible sentences in conversational English. This is one place where the
practical dialogue hypotheses come in to play.
Although semantic restrictions are hard to find
in general, there do appear to be reasonable
restrictions that apply to practical dialogues in
general. Furthermore, we can further refine the
general grammar by specifying domain-specific
restrictions for the current task, which can significantly reduce the possible interpretations
allowed by the grammar.
In TRIPS, we use a feature-based augmented
context-free grammar with semantic restrictions as described in Allen (1995). We have
found little need to change this grammar when
moving to new applications, except to extend
the grammar to cover new general forms that
haven’t happened to occur in previous
domains. We do, of course, have to define any
new words that are specific to the new application, but this defining is done relatively easily.
Another significant aspect of parsing spoken
language is that it is not sentence based.
Rather, a single utterance can realize a
sequence of communicative acts called speech
acts. For example, the utterance “OK let’s do
that then send a truck to Avon” is not a grammatical sentence in the traditional sense. It
needs to be parsed as a sequence of three
speech acts: an acknowledgment (“OK”), an
acceptance (“let’s do that”), and a request
(“send a truck to Avon”). Our grammar produces act descriptions rather than sentence
structures. To process an utterance, it looks for
all possible speech acts anywhere in the utterance and then searches for the shortest
sequence of acts that covers the input (or as
much of the input that can be covered).
Integrating Dialogue and
Task Performance
The second problem is how to build a dialogue system that can be adapted easily to
most any practical task. Given the range of
applications that might be used, from information retrieval to design to emergency relief
management to tutoring, we cannot place
strong constraints on what the application
program looks like. As a result, we chose to
work within an agent-based framework, where
the back end is viewed as a set of agents providing services, and we define a broker that
serves as the link between the dialogue system
and the back end, as shown in figure 2.
Of course, the dialogue system has to know
much about the task being implemented by
the back end. Thus, the generic system (applicable across a practical domain) is specialized to
the particular domain by integrating domainspecific information. The challenges here lie in
designing a generic system for practical dialogue together with a framework in which new
tasks can be defined relatively easily. Key to this
enterprise is the development of an abstract
problem-solving model that serves as the
underlying structure of the interaction. This
model includes key concepts such as (1) objec-
WINTER 2001 31
Articles
verifying design constraints, or planning rescue missions.
Intention Recognition
Dialogue
System
Service
Broker
Service
Providers
Many areas of natural language processing
have seen great progress in recent years with
the introduction of statistical techniques
trained on large corpora, and some people
believe that dialogue systems will eventually be
built in the same way. We do not think that
this is the case, and one of the main reasons is
the need to do intention recognition, that is,
determining what the user is trying to do by
saying the utterance. Let us illustrate this point
with one particular example from an application in which the person and the system must
collaborate to construct, monitor, and modify
plans to evacuate all the people off an island in
the face of an oncoming hurricane. Figure 3
shows a snapshot of an actual session in
progress with an implemented version of the
system. On the map, you can see the routes
developed to this point, and the plan itself is
displayed in a window showing the actions of
each vehicle over time.
For the following example, the context
developed by the interaction to this point is as
follows:
Objectives: The overall objective is to
evacuate the island. So far, one subgoal
has been developed: evacuating the people in the city of Abyss to Delta.
Figure 2. Service Providers.
tives, the way you want the world to be (for
example, goals and subgoals, constraints on
solutions); (2) solutions, courses of action
intended to move closer to achieving the
objectives; (3) resources, objects, and abstractions
(for example, time) that are available for use in
solutions; and (4) situations, the way the world
currently is (or might be).
Utterances in a practical dialogue are interpreted as manipulations of these different
aspects, for example, creating, modifying,
deleting, evaluating, and describing objectives,
solutions, resources, and situations. A domainspecific task model provides mappings from
the abstract problem-solving model to the
operations in a particular domain by specifying
what things count as objectives, solutions,
resources, and situations in this domain and
how they can be manipulated. In this way, the
general-purpose processing of practical dialogue is separated from the specifics of, for
example, looking up information in a database,
32
AI MAGAZINE
Solutions: A plan has been developed to
move the people from Abyss to Delta
using truck one.
Within this setting, consider the interpretations of the following utterances:
Can we use a helicopter to get the people
from Abyss? In this setting, this utterance is
most naturally talking about using a helicopter
rather than the truck to evacuate the people at
Abyss. It is ambiguous about whether the user
wants to make this change to the plan (that is,
a request to modify the plan) or is asking a
question about feasibility. With the first interpretation, an appropriate system response
might be “sure” and modifying the plan. With
the second, it might be, “Yes we could, and that
would save us 10 hours.”
Can we use a helicopter to get the people
at Barnacle? The only change is the city mentioned, but now the most natural interpretation is that the user wants to introduce a new
subgoal (evacuating Barnacle) and is suggesting
a solution (fly them out by helicopter). As
before, this question is ambiguous because of
request and question interpretations. A good
response to the request might be “OK” (and
Articles
adding the goal and solution to the plan), but
a good response to the question would be
“Yes.”
Can we use a helicopter to get the people
from Delta? Changing the city to Delta
changes the most likely interpretation yet
again. In this case, the most natural interpretation is that once the people from Abyss arrive
in Delta, we should pick them up by helicopter.
In this case, the user is talking about extending
a solution that was previously discussed. As in
the other two cases, it could be a request or a
question.
These examples show that there are at least
six distinct and plausible interpretations of an
utterance of this form (changing only the city
name). Distinguishing between these interpretations requires reasoning about the task to
identify what interpretation makes sense rationally given the current situation. There is no
way to avoid this reasoning if we are to
respond appropriately to the user. The techniques in statistical natural language processing, although useful for certain subproblems
such as parsing, do not suggest any approach
to deal with problems that require reasoning in
context.
Note also that even if the system only had to
answer questions, it would be necessary to perform intention recognition to know what
question is truly being asked, which reveals the
complexity, and the power, of natural language. Approaches that are based purely on the
form of language, rather than its content and
context, will always remain extremely limited.
There are some good theoretical models of
intention recognition (for example, Kautz and
Allen [1986]), but these models have proven to
be too computationally expensive for real-time
systems. In TRIPS, we use a two-stage process
suggested in Hinkelman and Allen (1989). The
first stage uses a set of rules that match against
the form of the incoming speech acts and generate possible underlying intentions. These
candidate intentions are then evaluated with
respect to the current problem-solving state.
Specifically, we eliminate interpretations that
would not make sense for a rational agent to do
in the current situation. For example, it is
unlikely that one would try to introduce a goal
that is already accomplished.
Mixed-Initiative Dialogue
Human practical dialogue involves mixed-initiative interaction; that is, the control of the
dialogue changes between the conversants. In
contrast, in a fixed-initiative dialogue, one participant controls the interaction throughout.
Mixed-initiative interaction increases dialogue
Figure 3. Interacting with TRIPS.
effectiveness and efficiency and enables both
participants’ needs to be met.
Finite-state systems are typically fixed system initiative. At each point in the dialogue,
the user must answer the specific question the
system asks. This approach can work well for
very simple tasks such as long-distance dialing
and typically works well for tasks such as finding out information about your bank accounts
(although it often takes many interactions to
get the one required piece of information). It
does not work well when you need to do something a little out of the normal path—in which
case, you often might need to go through
many irrelevant interactions before getting the
information you want, if ever. As the task
becomes more complex, these strategies
become less and less useful.
At the other extreme, a frame-based system
can be fixed user initiative. The system would
do nothing but interpret user input until sufficient information was obtained to perform the
task. This approach is problematic in that the
user might not know what information he/she
still needs to supply.
Because of these problems, many current
spoken-dialogue systems offer limited mixedinitiative interaction. On the one hand, the
system might allow the user to give information that is in addition to what the system
asked for. On the other, the system might ask
for clarification or might prompt for information it needs to complete the task. These are the
simplest forms of initiative that can occur. In
more complex domains, initiative can occur at
WINTER 2001 33
Articles
Speech
Recognition
Parsing
GUI
Events
Interpretation
Manager
Discourse
Context
Reference
Manager
Sentence
Generation
Display
Planning
Generation
Manager
Task
Manager
preferred interaction when viewed from the
perspective of optimizing the task performance.
Current spoken-dialogue systems have only
supported limited discourse-level mixed initiative. As we see later, in TRIPS, we have developed
an architecture in which much more complex
interactions can occur. The system’s behavior is
determined by a component called the behavioral agent that, in responding to an utterance,
considers its own private goals in addition to
the obligations it has because of the current
state of the dialogue.
The TRIPS System: A Prototype
Practical Dialogue System
(the Rochester interactive planning system) is the latest in a series of prototype practical dialogue systems that have been developed
at the University of Rochester (Allen et al.
1995; Ferguson and Allen 1998). We started by
collecting and studying human-human dialogues where people collaborate to solve sample problems (Heeman and Allen 1995; Stent
2000). We then used these data to specify performance goals for our systems. Our ongoing
research plan is to incrementally increase the
complexity of the domain and, at the same
time, increase the proficiency of the system. In
this section, we provide a brief overview of the
latest TRIPS system, concentrating on the
responsibilities of the core components and
the information flow between them.
As mentioned previously, TRIPS uses an agentbased component architecture. The interagent
communication language is a variant of the
knowledge query and manipulation language
(Labrou and Finin 1997). This architecture has
proven to be very useful for supporting the
development and maintenance of the system
and facilitates experimentation with different
algorithms and techniques. More details on the
system architecture can be found in Allen, Ferguson, and Stent (2001).
The components of the system can be divided into three areas of function: (1) interpretation, (2) generation, and (3) behavior. Each
area consists of a general control-management
module that coordinates the behavior of the
other modules in the cluster and shares information and messages with the other managers,
as shown in figure 4. As we discuss the components, we consider the processing of an utterance from our 911 domain.
The speech recognizer in TRIPS uses the
SPHINX-II system (Huang et al. 1993), which outputs a set of word hypotheses to the parser. A
keyboard manager allows the user to type
TRIPS
Events and
Information
from External
Sources
Behavioral
Agent
"Back-End" Applications
External Agents
Figure 4. The TRIPS System Architecture.
different levels (for example, Chu-Carroll and
Brown [1997]) and dramatically change the
system behavior over extended periods of the
dialogue.
There are cases where conflicts can arise
between the needs of the system and the
requirements of the dialogue. For example,
consider the 911 domain discussed earlier. A
situation can arise in which the user has asked
a question (and, thus, is controlling the dialogue), but the system learns of a new emergency and, thus, wants to notify the user of the
new problem. In such cases, the system must
balance the cost of respecting the user’s initiative by answering the question, against the cost
of ignoring the question and taking the initiative to more quickly deal with the new emergency. For example, we might see an interaction such as the following:
User: What is the status of clearing the
interstate from the accident?
System: Hold on a minute. There’s a new
report that a tree just fell on someone in
Pittsford.
Although this interaction seems incoherent at
the discourse level, it might very well be the
34
AI MAGAZINE
Articles
input as well and passes this input to the parser. The parser is a best-first bottom-up chart
parser along the lines described in Allen (1995)
and, as discussed earlier, uses a grammar that
combines structural (that is, syntactic) and
semantic information. TRIPS uses a generic
grammar for practical dialogue. A general
semantic model and a generic set of predicates
represent the meanings of the common core of
conversational English. The grammar is then
specialized using a set of declarative “scenario”
files defined for each application. These files
define domain-specific lexical items, domainspecific semantic categories, and mappings
from generic predicates to domain-specific
predicates.
The output of the parser is a sequence of
speech acts, with the content of each specified
using the generic predicates as well as the predicates defined in the scenario files. Consider
the utterance “We need to get the woman in
Penfield to Strong,” where Penfield is a town,
and Strong is the name of a hospital. A simplified version of the output of the parser would
be the speech act given in figure 5.
The output of the parser is passed to the
interpretation manager, which is responsible
for interpreting such speech acts in context
and identifying the discourse obligations that
the utterance produces as well as the problemsolving acts that the user is attempting to
accomplish. It invokes the reference manager
to attempt to identify likely referents for referring expressions. The reference manager uses
the accumulated discourse context from previous utterances, plus knowledge of the particular situation, to identify likely candidates. The
analysis of references in the current sentence is
shown in table 3.
The interpretation manager then uses the
task manager to aid in interpreting the intended speech and problem-solving acts. In this
case, it uses general lexical knowledge that sentences asserting needs often serve to introduce
new active goals in practical dialogues (not
always, for example, they can be used to
extend solutions as well, as in “Go on to Pittsford. We will need to get fuel there”). To
explore this hypothesis, it checks whether the
action of transporting an injured woman to a
hospital is a reasonable active goal in this
domain, which it is. Thus, it identifies figure 6
as what the user intended to accomplish by
his/her utterance (the intended problem-solving act):
The interpretation manager also identifies a
discourse obligation to respond to the user’s
utterance (figure 7), where SA11 is the assert
speech act shown earlier.
(ASSERT
:ID SA11
:SPEAKER USR
:HEARER SYS
:CONTENT
(NEED
:AGENT (PRO ÒweÓ)
:THEME
(TRANSPORT
:OBJECT
(THE ?w (AND
(TYPE ?w WOMAN)
(AT-LOC ?w
(NAME ?n ÒPenfieldÓ))))
:TO-LOC (NAME ?s ÒStrongÓ))))
Figure 5. A Simplified Parser Output.
(INITIATED
:WHO USR
:WHAT
(CREATE
:ID PS22
:AGENT USR
:WHAT
(OBJECTIVE
:WHAT
(TRANSPORT
:OBJECT
(THE ?w (AND
(TYPE ?w WOMAN)
(AT-LOC ?w PENFIELD)
(REFERS-TO ?w WOM1)))
:TO-LOC SMH1))))
Figure 6. The Intended Problem-Solving Act.
The behavioral agent must decide how to
handle the proposed act of establishing a goal.
Assuming that it has no difficulty doing this
and that it has nothing else more pressing, it
can plan a response of confirming this suggested goal, thereby completing the problem-solving act initiated by the user. Thus, it could send
the request in figure 8 to the generation manager to simply confirm the user’s suggestion.
The behavioral agent could also, if it chose
to take more task-level initiative, search for
WINTER 2001 35
Articles
(OBLIGATION
:WHO SYS
:WHAT
(RESPOND-TO
:WHAT SA11))
Figure 7. The Discourse Obligation.
(REQUEST
:CONTENT
(CONFIRM
:WHAT SA11
:WHY
(COMPLETE
:WHO SYS
:WHAT PS22)))
Figure 8. The Request to the Generation Manager.
Referring Expression
Likely Referent
Source Used
“We”
SS1: The set consisting
of USR and SYS
The general setting of
the dialogue
“The woman in
Penfield”
WOM1: The injured
woman in Penfield
previously discussed
Discourse history
“Strong”
SMH1: Strong Memorial
Hospital
General world
knowledge and lexicon
Table 3. Reference Resolution.
available ambulances and suggest one to use
for transporting the woman. In this example,
however, the system remains more passive. The
generation manager, receiving this request as
well as knowing the pending discourse obligation, can satisfy both using a simple “OK”
(and, possibly, updating a screen showing
active goals if there are any).
Conclusion
Although there are still serious technical issues
remaining to be overcome, dialogue-based user
interfaces are showing promise. Once they
reach a certain level of basic competence, they
36
AI MAGAZINE
will rapidly start to revolutionize the way the
people interact with computers, much like
direct-manipulation interfaces (using menus
and icons) revolutionized computer use in the
last decade.
We are close to attaining a level of robust
performance that supports empirical evaluation of such systems. For example, we performed an experiment with an earlier dialogue
system that interacted with the user to define
train routes (Allen et al. 1996). For subjects, we
used undergraduates who had never seen the
system before. They were given a short videotape about the routing task; taught the
mechanics of using the system; and then left to
solve routing problems with no further advice
or teaching, except that they should interact
with the system as though it were another person. Over 90 percent of the sessions resulted in
successful completion of the task. Although
the task was quite simple, and some of the dialogues fairly lengthy given the task solved, this
experiment does support the viability of dialogue-based interfaces and validated the claim
that such systems would be usable without any
user training.
In the next year, we plan a similar evaluation
of the TRIPS 911 system, in which untrained
users will be given the task of handling emergencies in a simulated world. This experiment
will provide a much more significant assessment of the approach using a task that is near
to the level of complexity found in a wide
range of useful applications.1
Note
1. This article gave a very high-level overview of the
TRIPS project. More information on the project,
including downloadable videos of system runs, is
available at our web site, www.cs.rochester.edu/
research/cisd.
References
Allen, J.; Byron, D.; Dzikovska, M.; Ferguson, G.;
Galescu, L.; and Stent, A. 2000. An Architecture for a
Generic Dialogue Shell. Journal of Natural Language
Engineering 6(3): 1–16.
Allen, J.; Ferguson, G.; and Stent, A. 2001. An Architecture for More Realistic Conversational Systems.
Paper presented at the Intelligent User Interfaces
Conference (IUI-01), 14–17 January, Santa Fe, New
Mexico.
Allen, J. F. 1999. Mixed Initiative Interaction. In Proceedings of the IEEE Intelligent Systems, 6. Washington, D.C.: IEEE Computer Society.
Allen, J. F. 1995. Natural Language Understanding.
Wokingham, U.K.: Benjamin Cummings.
Allen, J. F.; Miller, G.; Ringger, B.; and Sikorski, T.
1996. A Robust System for Natural Spoken Dialogue.
In Proceedings of the Association for Computational Linguistics, 62–70. New Brunswick, N.J.: Associa-
Articles
tion for Computational Linguistics.
Allen, J. F.; Schubert, L. K.; Ferguson, G.;
Heeman, P.; Hwang, C.-H.; Kato, T.; Light,
M.; Martin, N. G.; Miller, B. W.; Poesio, M.;
and Traum, D. R. 1995. The TRAINS Project:
A Case Study in Defining a Conversational
Planning Agent. Journal of Experimental and
Theoretical AI 7(2): 7–48.
Charniak, E. 2000. A Maximum-EntropyInspired Parser. Paper presented at the First
Meeting of the North American Chapter of
the American Association for Computational Linguistics, 29 April–4 May, Seattle,
Washington.
Chu-Carroll, J. 1999. Form-Based Reasoning for Mixed-Initiative Dialogue Management in Information-Query Systems. Paper
presentec at EUROSPEECH, 5–9 September,
Budapest, Hungary.
Chu-Carroll, J., and Brown, M. K. 1997.
Tracking Initiative in Collaborative Dialogue Interactions. Paper presented at the
Thirty-Fifth Annual Meeting of the Association for Computational (ACL-97), 7–12
July, Madrid, Spain.
Cohen, P. R., and Oviatt, S. L. 1995. The
Role of Voice Input for Human-Machine
Communication. Proceedings of the National
Academy of Sciences 92(22): 9921–9927.
Ferguson, G., and Allen, J. F. 1998. TRIPS: An
Integrated Intelligent Problem-Solving
Assistant. In Proceedings of the National
Conference on Artificial Intelligence (AAAI98), 567–573. Menlo Park, Calif.: American
Association for Artificial Intelligence.
Heeman, P. A., and Allen, J. 1995. The
TRAINS-93 Dialogues, TN94-2, Department
of Computer Science, University of
Rochester.
Hinkelman, E., and Allen, J. F. 1989. Two
Constraints on Speech Act Ambiguity.
Paper presented at Conference of the Association for Computational Linguistics
(ACL), 26–29 June, Vancouver, Canada.
Huang, X. D.; Alleva, F.; Hon, H. W.;
Hwang, M. Y.; Lee, K. F.; and Rosenfeld, R.
1993. The SPHINX-II Speech-Recognition System: An Overview. Computer, Speech, and
Language 7(2): 137–148.
Kautz, H., and Allen, J. F. 1986. Generalized
Plan Recognition. In Proceedings of the
National Conference on Artificial Intelligence (AAAI-86), 32–37. Menlo Park, Calif.:
American Association for Artificial Intelligence.
Labrou, Y., and Finin, T. 1997. A Proposal
for a New KQML Specification. TR CS-9703, Department of Computer Science and
Electrical Engineering, University of Maryland.
Seneff, S., and Polifroni, J. 2000. Dialogue
Management in the Mercury Flight Reservation System. Paper presented at the
Workshop on Conversational Systems, 30
April, Seattle, Washington.
Stent, A. J. 2000. The Monroe Corpus.
TR728 and TN99-2, Department of Computer Science, University of Rochester.
Sturm, J.; den Os, E.; and Boves, L. 1999.
Dialogue Management in the Dutch ARISE
Train Timetable Information System. Paper
presented at EUROSPEECH, 5–9 September,
Budapest, Hungary.
Xu, W., and Rudnicky, A. 2000. Task-Based
Dialogue Management Using an Agenda.
Paper presented at the Workshop on Conversational Systems, 30 April, Seattle,
Washington.
Zue, V., Seneff, S.; Glass, J.; Polifroni, J.;
Pao, C.; Hazen, T.; and Hetherington, L.
2000. JUPITER: A Telephone-Based Conversational Interface for Weather Information.
IEEE Transactions on Speech and Audio Processing 8(1).
James Allen is the John
Dessauer professor of
computer science at the
University of Rochester.
He was editor-in-chief of
Computational Linguistics
and has authored several
books, including Natural
Language Understanding
(Benjamin Cummings, 1995), 2d edition.
He received a Presidential Young Investigator Award for his research and is a fellow of
the American Association for Artificial
Intelligence. Allen’s research interests lie at
the intersection of language and reasoning
and span a range of issues, including natural language understanding, dialogue systems, knowledge representation, commonsense reasoning, and planning. His e-mail
address is james@cs.rochester.edu.
Myroslava Dzikovska
is a Ph.D. student in
computer science at the
University of Rochester.
She graduated with a
B.Sc. in applied mathematics from the Ivan
Franko Lviv State University in Lviv, Ukraine.
Her research interests are natural language
understanding, spoken dialogue, AI, and
computational linguistics. Her e-mail
address is myros@cs.rochester.edu.
Lucian Galescu received
an M.S. in computer science in 1998 from the
University of Rochester,
where he is currently a
Ph.D. candidate. He also
received an M.S. in mathematics–computer science in 1992, with a thesis on the field of theorem proving, from the
A. I. Cuza University of Iasi, Romania, where
he held a junior faculty position until 1996.
His current research interests are in the area
of statistical modeling of natural language.
His e-mail address is galescu@cs.rochester.
edu.
Amanda Stent is a Ph.D.
candidate at the University of Rochester. Her
research area is natural
language generation for
task-oriented spoken dialogue. In 2002, she is
joining the faculty in the
Department of Computer Science at State University of New York
at Stony Brook. Her e-mail address is
stent@cs.rochester.edu.
George Ferguson is a
research scientist in the
Computer Science Department at the University of Rochester. He
received his Ph.D. from
Rochester in 1995 for
work on adapting AI
planning and reasoning
to the needs of interactive, mixed-initiative
systems. His research interests continue to
be at the interface between AI technology
and human-computer interaction. His email address is ferguson@cs.rochester.edu.
Schneiderman, B. 2000. The Limits of
Speech Recognition. Communications of the
ACM 43(9): 63–65.
WINTER 2001 37
Articles
Expertise in Context:
Human and Machine
The cover is a reproduction of “Clown Three” by Max Papart. It is reproduced with permission, courtesy of Nahan Galleries, New York, New York.
C
“expert
systems” are among
the best known
applications of artificial intelligence. But what is expertise?
The nature of knowledge and
expertise, and their relation to
context, is the focus of active
discussion — even controversy
— among psychologists,
philosophers, computer scientists, and other cognitive scientists. The questions reach to
the very foundations of cognitive theory — with new perspectives contributed by the
social sciences. These debates
about the status and nature of
expert knowledge are of interest to and informed by the
artificial intelligence community — with new perspectives
contributed by “constructivists” and “situationalists.”
The twenty-three essays in
this volume discuss the essential nature of expert knowledge, as well as such questions
such as how “expertise” differs from mere “knowledge,”
the relation between the individual and group processes
involved in knowledge in general and expertise in particular, the social and other contexts of expertise, how
expertise can be assessed, and
the relation between human
and computer expertise.
OMPUTERIZED
690 pp., ISBN 0-262-56110-7
Paul J. Feltovich
Published by the AAAI Press / The MIT Press
To order call 800-356-0343 (US and Canada)
or (617) 625-8569.
Distributed by The MIT Press,
5 Cambridge Center, Cambridge, MA 02142
38
AI MAGAZINE
Kenneth M. Ford
Robert R. Hoffman