Proceedings of the IEEE - Pace University Webspace

advertisement
DRAFT
Article Review: Spoken Dialogue Technology: Enabling the Conversational User
MICHAEL F. MCTEAR University of Ulster
ACM Computing Surveys (CSUR)
Volume 34 , Issue 1 (March 2002)
Pages: 90 - 169
Year of Publication: 2002
ISSN:0360-0300
This article has been reviewed by Gregory A. Vaughn Sr., in partial fulfillment
of course DC891 course, Summer of 2004
DRAFT
“Spoken dialogue systems allow users to interact with computer-based applications such
as databases and expert systems by using natural spoken language. The origins of spoken
dialogue systems can be traced back to Artificial Intelligence research in the 1950s concerned
with developing conversational interfaces. However, it is only within the last decade or so, with
major advances in speech technology, that large-scale working systems have been developed
and, in some cases, introduced into commercial environments. As a result many major
telecommunications and software companies have become aware of the potential for spoken
dialogue technology to provide solutions in newly developing areas such as computer-telephony
integration. Voice portals, which provide a speech-based interface between a telephone user and
Web-based services, are the most recent application of spoken dialogue technology. This article
describes the main components of the technology---speech recognition, language understanding,
dialogue management, communication with an external source such as a database, language
generation, speech synthesis, and shows how these component technologies can be integrated
into a spoken dialogue system. The article describes in detail the methods that have been adopted
in some well-known dialogue systems, explores different system architectures, considers issues
of specification, design, and evaluation, reviews some currently available dialogue development
toolkits, and outlines prospects for future development.”[1]
Beyond his abstract, McTEAR presents this article as a detailed description and
explanation of the most salient elements of spoken dialogue technology. His intended audience
is both the computer scientist new to this technology as well as those experienced in the
technology that wish to conduct further research in the area. In this paper McTEAR takes a
different approach. Rather than focusing on spoken language systems currently in existence, he
DRAFT
focuses on the underlying technologies. His review is supplemented by the use of examples of
specific systems that illustrate common issues/problems.
[1] The review is broken down into logical units that cover:




Spoken Dialogue System Definition
Classification of Dialogue Systems by Control Strategies
Components of a Spoken Dialogue System
1. Speech Recognition
The conversion of an input speech utterance, consisting of a sequence of
acoustic-phonetic parameters, into a string of words.
2. Language Understanding
The analysis of this string of words with the aim of producing a meaning
representation for the recognized utterance that can be used by the
dialogue management component.
3. Dialogue management
The control of the interaction between the system and the user, including
the coordination of the other components of the system.
4. Communication with external system: For example, with a database
system, expert system, or other computer application.
5. Response generation: The specification of the message to be output by the
system.
6. Speech output: The use of text-to-speech synthesis or prerecorded speech
to output the system’s message.
Review a number of Architectures and Dialogue Control Strategies
1. Speech Corpora
2. Wizard-of-Oz studies
3. Speech and Language Understanding Components
4. Guidelines and Standards for Spoken language Systems
Research Approach
The author has conducted a formal survey of professional literature (Literature Review)
in support of his hypotheses that practical and efficient research needs to be conducted that
incorporates an amalgam or fusion of a number of the currently available technologies to
produce practical solutions to problems in spoken dialogue technology.
I believe the work is of importance to researchers in this area, and to me specifically,
because it brings under one cover a cross section of all of the more important work on this topic.
DRAFT
It therefore becomes an excellent jump-off point for any serious research on spoken language
technology. Too, it presents a cause-to-pause to rethink the approach(es) that can or should be
taken, and the limits that may be applied.
Related Research Problems and Future Research
[1] Research in robotics on gesture free spoken dialogue has revealed that current
systems can not handle deictic references (references generated with pronouns like, here, there,
that, etc.). These type of references have no meaning unless that are tied to some concrete
situation. Usually these references are resolved by including physical gestures (pointing in the
direction you wish the robot to go). A research question can be posed to see [3] “how far can
one go in resolving deictic reference without gesture recognition?.” Another area for
examination might be universal quantification of negation. Current system can not properly
interpret an respond to questions like, (robot sees a soda can), “Are the cans that you see all
red?.” Top, instructions like, “Don’t do it” can not be properly interpreted. Only simply phrases
like stop, stop going to, etc can be used. Lastly, while there are many instances where
communication with a robot by a number of users may be desirable, only one user at a time
dialogue with current spoken language systems. Research that prioritizes natural language inputs
and making sense of seemingly contradictory commands become major challenges.
The author offers multiple paths for future research in the area of spoken
language/dialogue technology. These research initiatives may focus more on the different ways
that natural language components can be integrated and spoken dialogue systems can be
deployed as marketable real word applications.
Research in the area of AI for instance, planning, may be tightly integrated with speech
act theory to produce models of conversational agency. Too, more sophisticated dialogue
DRAFT
managers based on research in text based dialogue managers may be developed. Another path
suggest that statistical techniques may be integrated with reinforcement learning algorithms.
This integration would allow for “automated learning of optimal learning strategies”. Modeling
dialogue as a Markov decision process is a popular technique for improving recognition and
translation in text and may be viewed as an element in a state space effected by both user
responses and system actions. Yet another path to be traversed is the World Wide Web. The
authors believes that “there is a potential for applications using spoken dialogue technology to
perform services such as home shopping, or to control program appliances around the home.
Voice XML distributed applications are currently being developed and deployed that
enable synthesized speech, digitized audio, recognition of spoken and DTMF key input,
recording spoken input, telephony, and mixed-initiative dialogues. These techniques, now
accepted as a standard, may be deployed in spoken language applications that have real world
usability.
In summary the author has provided a list of potential spoken dialogue research:



.

“more robust speech recognition”
1. including the ability to perform well in noisy conditions
2. deal with out-of-vocabulary words
3. close integration with technologies for natural language
the use of prosody in spoken dialogue systems
1. provide more naturally sounding output
2. assist recognition by identifying phrase boundaries as well as the functions of
utterances.
research concerned with component integration and with investigating the extent to
which the language understanding and dialogue management components can
compensate for deficiencies in speech recognition applicability of different
technologies for particular application types, such as the costs and benefits of parsing
using theoretically motivated grammars compared with robust and partial parsing and
with more pragmatically driven methods such as concept spotting.
investigation of the applicability of different technologies for particular application
types
DRAFT
costs and benefits of parsing using theoretically motivated grammars compared with
robust and with more pragmatically driven methods such as concept spotting.

Studies of different approaches to dialogue management in relation to the requirements
of an application indicating for example
1. where state-based methods are applicable
2. under which circumstances more complex approaches are required

the incorporation of more sophisticated approaches to dialogue management deriving
from AI-based research.

research into the use of stochastic and machine learning techniques.

the development of multimodal dialogues systems

dialogue systems with Web integration
In closing, the author concludes that much of the future research will most probably be pursued
by commercial interests in an effort to deploy products that are both” marketable, profitable and
that, within the constraints of the technology, can be made desirable by the consumer. .
DRAFT
References
[1]
Michael F McTear (2002) Spoken dialogue technology: enabling the conversational
interface. ACM Computing Surveys, Volume 34, Issue 1 (March 2002), pp. 90 - 169.
[2]
D.C. Charles Hair, Review: Spoken dialogue technology: enabling the
conversational user interface. ACM Computing Surveys (CSUR)
[3]
Human-Robot Interaction Through Gesture-Free Spoken Dialogue
Autonomous Robots
Volume 16, Issue 3 (May 2004), Pages: 239 – 257, Year of Publication: 2004
ISSN:0929-5593
DRAFT
Citation Appendix
Human-Robotic Interaction based On Spoken natural Language Dialogue, by Dimitris
Spiliotopoulos, Ion Androutsopoulos, and Constantine D. Spyropoulos; Software and
Knowledge Engineering Laboratory, Institute of Information and telecommunications, National
Centre for Scientific Research.
Luis Villaseñor , Manuel Montes , Jean Caelen, A model for the conversation multimodal manmachine: integration of speech and action, Proceedings of the Latin American conference on
Human-computer interaction, August 17-20, 2003, Rio de Janeiro, Brazil
Karen Ward , David G. Novick, Hands-free documentation, Proceedings of the 21st annual
international conference on Documentation, October 12-15, 2003, San Francisco, CA, USA
DRAFT
Spoken Language Technology 103
locate these connectors, instructions for
the first two actions are not required and
so the system proceeds with instructions
for the third action, which is confirmed in
User3, and for the fourth action. Here the
user requires further instructions, which
are given in System5 with the action confirmed in User5. At this point the user
asserts that the wire between 84 and 99
is connecting, so that the fifth instruction
to connect the second end to 99 is not required.
A further missing axiom is discovered,
which leads the system to ask what
the LED is displaying (System7).
3.4. Summary
The examples presented in this section
have illustrated three different types of
dialogue control strategy. The selection
of a dialogue control strategy determines
the degree of flexibility possible in the dialogue
and places requirements on the
technologies employed for processing the
user’s input and for correcting errors.
There are many variations on the dialogue
strategies illustrated here, and these will
be discussed in greater detail in Section 5.
The next section will examine the component
technologies of spoken dialogue
systems.
4. COMPONENTS OF A SPOKEN
DIALOGUE SYSTEM
A spoken dialogue system involves the integration
of a number of components that
typically provide the following functionalities
[Wyard et al. 1996]:
Speech recognition: The conversion of an
input speech utterance, consisting of a
sequence of acoustic-phonetic parameters,
into a string of words.
Language understanding: The analysis
of this string of words with the aim
of producing a meaning representation
for the recognized utterance that can
be used by the dialogue management
component.
Dialogue Management: The control of the
interaction between the system and the
user, including the coordination of the
other components of the system.
Communication with external system:
For example, with a database system,
expert system, or other computer
application.
Response generation: The specification of
the message to be output by the system.
DRAFT
Speech output: The use of text-to-speech
synthesis or prerecorded speech to output
the system’s message.
These components are examined in the following
subsections in relation to their role
in a spoken dialogue system (for a recent
text on speech and language processing,
see Jurafsky and Martin [2000]).
4.1. Speech Recognition
The task of the speech recognition component
of a spoken dialogue system is
to convert the user’s input utterance,
which consists of a continuous-time signal,
into a sequence of discrete units such as
phonemes (units of sound) or words. One
major obstacle is the high degree of variability
in the speech signal. This variability
arises from the following factors:
Linguistic variability: Effects on the
speech signal caused by various linguistic
phenomena. One example is coarticulation,
that is, the fact that the same
phoneme can have different acoustic realizations
in different contexts, determined
by the phonemes preceding and
following the sound in question.
Speaker variability: Differences between
speakers, attributable to physical factors
such as the shape of the vocal tract
as well as factors such as age, gender,
and regional origin; and differences
within speakers, due to the fact that
even the same words spoken on a different
occasion by the same speaker tend
to differ in terms of their acoustic properties.
Physical factors such as tiredness,
congested airways due to a cold,
and changes of mood have a bearing
on how words are pronounced, but the
location of a word within a sentence and
the degree of emphasis it is given are
also factors which result in intraspeaker
variability.
ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Download