Uploaded by super100pig

A Chatbot Response Generation System

advertisement
Session 6: Conversational UIs
A Chatbot Response Generation System
Jasper Feine
Karlsruhe Institute of Technology
Karlsruhe, Germany
jasper.feine@kit.edu
Stefan Morana
Saarland University
Saarbrücken, Germany
stefan.morana@uni-saarland.de
ABSTRACT
experience with chatbots (i.e., text-based conversational agents)
[19, 25, 34]. However, many interactions with chatbots are rather
short, constrained by limited vocabulary, and contain incomplete
or wrong information [1, 6, 12]. This does not only lead to a low
penetration rate of chatbots [25], but also limits their application
beyond simple dyadic interactions [37, 50, 54].
The natural language capabilities of chatbots are mostly limited
by the amount of e�ort developers invest in the development of a
chatbot’s dialog system [21, 38]. Independent of the type of dialog
system (i.e., data-driven or rule-based), the creation and evaluation of high-quality chatbot responses is a very time-consuming
endeavor. Whereas chatbot developers have the essential technical
expertise to develop such dialog systems, they often lack required
domain knowledge. On the other hand, domain experts have the
required domain knowledge, but they lack technological expertise
[2, 52]. As a consequence, it is necessary to develop a system that
empowers domain experts to engage in the response generation
process in order to support the chatbot developers to craft chatbot responses that are relevant for the respective end-user groups
[21, 29, 54].
What is currently missing is a system that allows chatbot developers to actually involve respective domain experts in an e�ective
and e�cient chatbot response generation process. Current processes are often limited to the testing of decoupled prototypes and
the exchange of spreadsheets. More speci�cally, current chatbot
development systems lack two important interrelated functionalities: (1) they do not enable domain experts to easily improve and
propose new chatbot responses while (2) chatbot developers keep
control and curation over the response generation process.
To address this need, we introduce a system with three key
functionalities: (1) The system is implemented as a web application
and connects easily with any existing chatbots via an API. Therefore,
the system serves as an additional layer between domain experts
and the connected chatbots. In addition, (2) the system enables
domain experts to interact with the connected chatbot via an autogenerated chat window which allows them to directly improve
chatbot responses during their interaction. Finally, (3) the system
supports chatbot developers with a sentiment supported review
dashboard in order to review, accept, change, or reject collected
chatbot response improvements of the domain experts.
The developed system is e�ective, because it enables chatbot
developers to continuously improve chatbot responses of existing
chatbots with respective domain experts. It is e�cient, because it
orchestrates the response generation process in one system without
creating additional development e�ort. Therefore, it enables chatbot
developers to create high quality chatbot responses, interaction data,
and contextual information that can be used to enhance the dialog
system of a chatbot. Therefore, the design of the system can be
used to extend existing chatbot development systems in order to
Developing successful chatbots is a non-trivial endeavor. In particular, the creation of high-quality natural language responses for
chatbots remains a challenging and time-consuming task that often
depends on high-quality training data and deep domain knowledge.
As a consequence, it is essential to engage experts in the chatbot
response development process which have the required domain
knowledge. However, current tool support to engage domain experts in the response generation process is limited and often does
not go beyond the exchange of decoupled prototypes and spreadsheets. In this paper, we present a system that enables chatbot developers to e�ciently engage domain experts in the chatbot response
generation process. More speci�cally, we introduce the underlying
architecture of a system that connects to existing chatbots via an
API, provides two improvement mechanisms for domain experts
to improve chatbot responses during their chatbot interaction, and
helps chatbot developers to review the collected response improvements with a sentiment supported review dashboard. Overall, the
design of the system and its improvement mechanisms are useful
extensions for chatbot development systems in order to support
chatbot developers and domain experts to collaboratively enhance
the natural language responses of a chatbot.
CCS CONCEPTS
• Human-centered computing ! Natural language interfaces;
User interface design.
KEYWORDS
chatbot response, improvement mechanism, system, domain expert,
chatbot developer
ACM Reference Format:
Jasper Feine, Stefan Morana, and Alexander Maedche. 2020. A Chatbot
Response Generation System. In Mensch und Computer 2020 (MuC’20), September 6–9, 2020, Magdeburg, Germany. ACM, New York, NY, USA, 9 pages.
https://doi.org/10.1145/3404983.3405508
1
Alexander Maedche
Karlsruhe Institute of Technology
Karlsruhe, Germany
alexander.maedche@kit.edu
INTRODUCTION
The quality of conversational interaction design in general, and the
quality of chatbot responses in particular, is critical for the user
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for pro�t or commercial advantage and that copies bear this notice and the full citation
on the �rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior speci�c permission
and/or a fee. Request permissions from permissions@acm.org.
MuC’20, September 6–9, 2020, Magdeburg, Germany
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-7540-5/20/09. . . $15.00
https://doi.org/10.1145/3404983.3405508
333
Session 6: Conversational UIs
MuC’20, September 6–9, 2020, Magdeburg, Germany
Jasper Feine, Stefan Morana, and Alexander Maedche
support chatbot developers and domain experts to collaboratively
enhance the responses of a chatbot.
In the remainder of this paper, we �rst review work on chatbot
dialog systems, their natural language capabilities, and existing chatbot response improvement systems. Subsequently, we introduce
the design of the proposed system and outline how we instantiated
it. Next, we report the results of a pilot study in which we tested
the proposed system in a �eld deployment. Finally, we discuss the
bene�ts of the system as well as critically re�ect its application
contexts, its limitations, and further research avenues.
To demonstrate the lack of the chatbots’ language capabilities,
we compared their responses with the responses of humans. Therefore, we analyzed an existing human-chatbot dialog corpus from the
Conversational Intelligence Challenge 2 (ConvAI2) [9]. We selected
this dialog corpus consisting of 1,111 dialogs because it contains
human-chatbot dialogs of state-of-the-art chatbots that are supposed to hold up an intelligent conversation with a human over
several interaction turns. For our analysis, we downloaded the dialog corpus of the wild evaluation round in which the human users
evaluated the chatbot responses. We analyzed the lexical diversity
of all chatbot and human messages using a Stanford CoreNLP server
[36]. Using the server, we investigated all messages regarding their
unique word lemmas and part-of-speech (POS) tags (i.e., in total
94,933) and counted the unique adjectives, adverbs, and verbs. We
analyzed the adjectives, adverbs, and verbs because they are highly
relevant to express emotions which is an inherently human ability
[3, 14]. As depicted in Figure 1, the ConvAI2 chatbots (human users)
used in total 282 (494) unique adjectives, 97 (160) unique adverbs,
and 264 (466) unique verbs in all ConvAI2 conversations of the wild
evaluation round. The results indicate that the human users used
75% more unique adjectives, 65% more adverbs, and 76% more verbs
across all conversations than the ConvAI2 chatbots. Therefore, the
corpus analysis reveals that the humans’ language usage is very
diverse in terms of lexical and emotional diversity and even these
well-designed chatbots from the ConvAI2 challenge cannot stand
up to this.
2 RELATED WORK
2.1 Chatbot Dialog Systems
The quality of a conversational interaction between a human and
a chatbot is currently mostly limited by the amount of e�ort developers invest in the chatbot’s dialog system. The dialog system
typically consists of three interacting components, namely the natural language understanding (i.e., convert words to meaning), the
dialog management (i.e., decide the next system action), as well
as the response generation component (i.e., convert meaning to
words) [40].
Dialog systems of chatbots can be broadly distinguished in terms
of their dialog coherency and scalability [21, 27]. On the one hand,
dialog systems that comprise handcrafted domain-speci�c dialog
rules enable goal-oriented chatbots to converse coherently about
a speci�c topic. The naturalness of the chatbot responses is, however, mostly determined by the amount of e�ort chatbot developers
invest in the development of dialog rules and the authoring of rulespeci�c chatbot responses [21]. Thus, it is time-consuming to extend
the natural language capabilities of such a chatbot, which limits
the scalability of these approaches. On the other hand, data-driven
dialog managers automatically generate chatbot responses based on
large, existing dialog corpora. They enable the probabilistic matching of user messages to examples in the training data. Then they
select the best matching response from the training data set without using any handcrafted dialog rules. These approaches are often
used for the development of non-goal-oriented chatbots, whose
primary purpose is chatting. However, data-driven approaches lack
coherency and robustness because the naturalness of the responses
strongly relies on the quality of the training data. The generation of
high-quality training data is, however, a major challenge [21, 27].
Overall, rule-based dialog systems dominated the chatbot landscape in the last decades [39] but data-driven dialog systems are
becoming more popular [46]. To further leverage the strengths
of both approaches, hybrid dialog systems have been proposed
[21]. For example, Hybrid Code Networks uses a data-driven neural network combined with the ability to include procedural rules
[60]. However, existing real-world solutions that combine both
approaches are still scarce [21].
2.2
Figure 1: Analysis of human-chatbot dialogs of the ConvAI2
challenge. The graph illustrates the amount of unique adjectives, adverbs, and verbs used by the chatbots and human
users during the wild evaluation round of the ConvAI2 challenge.
Natural Language Limitations of Chatbots
2.3
A major goal in the development of chatbots is to create natural
language capabilities that meet the user expectations [13, 15]. However, most chatbots often reply with the same message, only possess
a very limited vocabulary, and often provide wrong information
[26, 34].
Crowd-Powered Dialog Systems
To further improve chatbot responses, chatbot developers can develop chatbot prototypes, present them to domain experts, analyze
their interaction data, and conduct user interviews [25]. Subsequently, they can modify the chatbot responses before they start
334
Session 6: Conversational UIs
A Chatbot Response Generation System
MuC’20, September 6–9, 2020, Magdeburg, Germany
Table 1: Crowd-powered dialog systems.
System
Description
SUEDE [29]
SUEDE is a speech interface prototyping tool which allows designers to easily create prompt/response speech interfaces and
further enables them to test and analyze them with many users using a Wizard-of-Oz mode.
is, a crowd-powered conversational assistant. While the assistant appears to be a single individual, it is actually driven by a
dynamic crowd of multiple workers using a specially designed response interface.
The paper reports a hybrid dialog manager which uses the technique called self-dialogs. This means that crowd-workers write
both the answers of a user and the responses of a chatbot in order to increase naturalness of the dialog corpus.
RegionSpeak is an advanced version of ViwWiz [4] which collects labels from the crowd for several objects in a visual area and
then enables blind users to explore the spatial layout of the objects.
The Evorus system engages crowd-workers to propose best suitable chatbot responses and then automates itself over time
using machine learning.
The Mnemo system is a crowd powered dialog plugin which is capable to save and aggregate human-generated context notes
from goal-oriented dialogs.
The Fantom system generates evolving dialog trees and automatically creates crowd tasks in order to collect responses for user
requests that could not be handled so far.
Chorus [34]
Edina [31]
RegionSpeak [62]
Evorus [22]
Mnemo [20]
Fantom [27]
a next improvement cycle. This process, however, is a very work
intensive.
A promising solution to improve the e�ciency of improving
chatbots responses is to leverage a dialog system that utilizes crowdworkers. Crowd-workers are human workers which are usually recruited anonymously through an open call over the web and consist
of non-experts [5]. Crowd-powered dialog systems could reduce the
scalability limitation of manually developed dialog systems without sacri�cing the complete control over the response generation
process [27]. Whereas early crowd-working attempts took several
hours to complete a task, recent approaches have shown to work
in nearly real-time [4, 32]. As a consequence, crowd-workers have
been used to collectively reply to a user. A brief non-exhaustive
review of promising crowd-powered dialog systems is illustrated
in Table 1.
Beside their advantages, crowd-powered dialog systems also create some serious challenges because crowd-workers have shown to
abuse these systems [23, 33, 47]. In the context of crowd-powered
dialog systems three malicious user groups have been identi�ed
[23]: inappropriate workers (i.e., provide faulty or irrelevant information), �irters (i.e., are interested in the user’s true identity or
develop unnecessary personal connections), spammers (i.e., performs abnormally large amount of meaningless actions in a task)
[23]. In particular, the case of Microsoft’s Tay has shown dramatically what can go wrong with systems that automatically learn
from user generated content [47].
2.4
For example, Chorus [23] uses voting as a �ltering mechanism
which worked fairly well in a �eld deployment. However, the mechanism only worked when at least one other crowd-worker also
voted for a message [23]. In another study, the VizWiz system ensures that it always receives at least two response proposals from
two di�erent crowd-workers. The results of a �eld deployment
further revealed that it took on average three response proposals to always receive at least one correct response [4]. Another
approach is to award points to the responses when they contain
correct information [59]. A counter-mechanism to reduce the risk
of stealing sensitive user data is used by Edina [31]. Edina uses the
technique called self-dialogs in which crowd-workers author the
content of the chatbot and the content of the user. Another promising approach is to divide the improvement tasks into micro tasks
that prevent any crowd-worker from seeing too many information
[33]. However, the interaction context is very important in order
to correctly understand a natural language interaction [20]. Seeing
fractions of a conversation might not be su�cient to correctly improve a chatbot response. Finally, Fantom anonymized the dialogs,
but the anonymization still had to be done manually [27].
Summing up, chatbot developers need to carefully consider the
improvement mechanisms that can be used by domain experts to
improve chatbot responses as well as the mechanisms to reevaluate
the improved responses before they are �nally shown to the endusers.
3
Chatbot Response Improvement
Mechanisms
DESIGNING A CHATBOT RESPONSE
GENERATION SYSTEM
In this section, we propose the design of a chatbot response generation system that enables chatbot developers to engage domain
experts in the chatbot response generation process. The high-level
design of the system is described in the next section. Subsequently,
the improvement mechanisms as well as the sentiment supported
review dashboard are described .
Overall, chatbot developers need to ensure that user-generated
chatbot responses do not lead to o�ensive, inappropriate, and meaningless responses and that the contributors have su�cient domain
knowledge to propose meaningful chatbot responses [23, 33, 47].
This applies to crowd-workers, end-users, but also domain experts
who may not take the improvement task seriously. To reduce these
risks, several systems have investigated counter-mechanisms to
reduce the creation of malicious user improvements.
335
Session 6: Conversational UIs
MuC’20, September 6–9, 2020, Magdeburg, Germany
Jasper Feine, Stefan Morana, and Alexander Maedche
3.2
Figure 2: High-level design of the chatbot response generation system.
3.1
Improvement Mechanisms
We can classify chatbot response improvement mechanisms along
two counteracting continua, namely the need for a chatbot developer to review the collected chatbot response improvements and
the mechanism’s restrictiveness. At the one side of the continuum,
very unrestricted improvement mechanisms can lead to creative
and natural improvements that increase the quality of the chatbot.
However, they can also lead to potential malicious improvements
[10, 26, 33, 47]. At the other side of the continuum, very restricted
improvement mechanisms that only allow users to change speci�c
sections of chatbot responses reduce the reviewing e�ort of chatbot
developers [56]. This limits the creativity and naturalness of the
proposed chatbot responses. A combination of restricted and less restricted mechanisms could be a promising approach to increase the
language variation of chatbot responses while chatbot developers
can keep control over the response generation process. To instantiate such improvement mechanisms, we developed a chat window
based on Microsoft’s WebChat [42]. The developed chat window
allows domain experts to directly improve chatbot responses with
both restricted and unrestricted improvement mechanisms during
a human-chatbot interaction.
The �rst improvement mechanism limits the domain expert’s
degree of freedom to change the responses of a chatbot while also
allowing domain experts to increase the lexical and emotional variety of a chatbot. To design such a mechanism, we analyzed popular
online translators (i.e., Google translate, DeepL) that enable users
to improve given translations. For example, Google translate allows
users to click on a speci�c part of the sentence and then shows a
drop-down list with alternative translations. Users can then select
a more appropriate translation in order to directly manipulate the
translation in the web interface.
Based on this idea, we developed a similar improvement mechanism for chatbots and implemented it into the chat window. To
do so, the chat window sends all chatbot responses received via
the API to an instance of a Stanford CoreNLP server [36] before
displaying any chatbot responses. The CoreNLP server tags all
words regarding their part of speech (POS) tag. The chat window
then highlights all adjectives, adverbs, and verbs because they are
highly relevant to express emotions [3]. If domain experts want to
improve a word in a chatbot response, they can simply click on it.
This word is then, including its POS tag, sent to an instance of a
Princeton’s Word Net dictionary [45] which returns appropriate
synonyms. In the case the word is a verb, the JavaScript package
compromise [28] further transforms the verb into its appropriate
form. The synonyms are then displayed in a drop-down list and
the domain experts can select a more appropriate synonym. The
instantiation of the restricted mechanism is shown in in Figure 4.
The second improvement mechanism should enable domain experts to directly manipulate the chatbot interaction in the chat
window in order to simplify the mapping between goals and actions [17] and to encourage a feeling of engagement and power
of control [17, 55]. Therefore, we developed a direct manipulation
mechanism that is characterized by a continuous representation of
the object of interests (i.e., chatbot responses), o�ers physical actions (i.e., domain experts can directly click on a chatbot response),
High-level Design
The proposed system is developed as a web application. The main
dashboard and the main features of the web application are displayed in Figure 3. To start using the system, chatbot developer can
connect any existing chatbot via their API. The only requirement of
such a chatbot API is to exchange messages between the users and
the chatbot. Therefore, the system can be used to improve chatbots
with di�erent types of dialog systems because the dialog system of
the to be improved chatbot still handles all the dialog management.
This means that the system acts as an additional layer between
the chatbot developers, domain experts, and the to be improved
chatbot without requiring access to the chatbot’s source code.
The high level design of the system is illustrated in Figure 2. It
illustrates that the system is not limited to one chatbot only but
functions as a platform that can be easily connected with several
chatbots via their API. In addition, an auto-generated chat window
can be shared with domain experts. This ensures that the system
scales well with the need of testing many di�erent chatbot versions
and also reduces e�ort to develop speci�c chatbots that can be
connected to this system.
To develop a prototype of the proposed design, we decided to include chatbots that converse via Microsoft’s Direct Line 3.0 API [43].
After chatbot developer connected a chatbot via their Direct Line
API (see Figure 3, top-left), the system instantiates a knowledge-base
and generates a shareable chat window (see Figure 3, bottom-left).
Chatbot developers can then share a link of the chat window with
domain experts. Domain experts can then interact with the chatbot
and the chat window o�ers two improvement mechanisms in order
to directly improve the chatbot responses during an interaction
(described in detail in the following section). This enables domain
experts to directly improve disappointing chatbot responses during
the interaction which are then stored in the system’s knowledge
base. Finally, chatbot developers can review and delete the collected
chatbot responses using a sentiment supported review dashboard
which is described in section 3.2.
336
Session 6: Conversational UIs
A Chatbot Response Generation System
MuC’20, September 6–9, 2020, Magdeburg, Germany
Figure 3: (Middle) Main dashboard of the web application which displays all connected chatbots and main functionalities; (topleft) interface to connect chatbots via their API key; (bottom-left); automatically generated chat window that can be shared
with domain experts to improve the chatbot responses; (top-right) improvement dashboard that summarizes key measures of
the chatbot response improvement process; (bottom-right) review dashboard to review and delete collected chatbot responses.
Figure 4: Chat window with restricted improvement mechanism.
Figure 5: Chat window with unrestricted improvement
mechanism.
and shows the impact of the users’ action immediately on the objects of interest (i.e., chatbot responses get immediately updated
in the chat window) [55]. This improvement mechanism enables
domain experts to freely improve and add chatbot responses in the
chat window as displayed in Figure 5. However, chatbot response
improvements created with this mechanism increase the reviewing
337
Session 6: Conversational UIs
MuC’20, September 6–9, 2020, Magdeburg, Germany
Jasper Feine, Stefan Morana, and Alexander Maedche
response set. This illustrate that the lexical diversity of the chatbot responses were increased through the response improvements
proposed by the students.
Finally, we analyzed the chatbot responses that have been improved most frequently. The most frequently improved chatbot
response was the welcome message (i.e., “Hello, I am the chatbot
of...”). The students proposed in total 10 di�erent versions of it. This
variety of improvements leads to the challenge of selecting the best
welcome message. We discuss this challenge in the next section in
more detail. The second most frequently improved chatbot response
(7 times) was the chatbot’s excuse for not understanding a user’s
message (i.e., “I am sorry, but I did not understand your message”).
For example, one student asked whether the chatbot also speaks
German. The chatbot excused for not understanding the user and
the student improved the response by replacing it with “Nein tut
mir leid. I only speak English!”. This information can now be used
to update the dialog system of the chatbot in order to enable the
chatbot to respond to questions regarding its language capabilities.
e�ort of chatbot developers because domain experts may propose
malicious chatbot responses.
3.3
Review Dashboard
To reduce the reviewing e�ort of the collected chatbot responses,
the system supports chatbot developers by automatically deleting
all responses that contain profanity using the bad-words JavaScript
package [41]. In addition, the system o�ers a review dashboard
(see Figure 6) which enables chatbot developers to review, accept,
reject, or modify the collected chatbot responses which are then
updated in the system’s knowledge base. To further support chatbot
developers, the review dashboard analyzes all chatbot responses
using the Vader sentiment analysis package [18]. Improvements
with a negative sentiment score are highlighted in the review dashboard. Developers can then easily delete or modify the collected
improvements. Finally, chatbot developer can export the collected
chatbot responses. The export can then be used to further enhance
the dialog system of the chatbot.
4
PILOT STUDY
We tested the system’s ability to improve chatbot responses in a
pilot study. In particular, we investigated how the system is actually
used in a real-world scenario. Therefore, we used the system to
improve the chatbot responses of an existing chatbot, namely the
student service chatbot of our institute. The student service chatbot
was developed using the Microsoft Bot Framework and can respond
to questions about employees, lectures, exams, and thesis projects.
It is available on our website and is mostly used by our students.
We connected the chatbot to the system by adding its Direct Line
3.0 API key and then engaged end-users with the required domain
knowledge in a response improvement process. To do so, we posted
the link of the chat window in two Facebook groups in which endusers of the chatbot (i.e., students of our institute) coordinate and
discuss course contents. We asked them to improve the responses
of the student service chatbot because they know best how the
chatbot should respond in speci�c situations. The participation was
voluntarily and the participants did not receive any compensation.
After two days of data collection, several students interacted in
110 sessions with the chatbot via the system. The students sent in total 230 messages to the chatbot. They improved 36 complete chatbot
responses and added 27 new chatbot responses to the knowledge
base. In addition, they used the restricted improvement mechanism
to replace 8 synonyms of the original chatbot responses. Moreover,
one student changed one synonym as well as the text of a response.
Thus, a total of 72 chatbot responses were changed or added. As
a consequence, the design of the chatbot response generation system enabled the students to easily improve the responses of the
institute’s chatbot.
Subsequently, we investigated the language variation of the
newly collected chatbot responses and compared them to the initial chatbot response set of the chatbot. To analyze the language
variation, we used a Stanford’s CoreNLP server [36] and tagged all
responses regarding their unique word lemmas and POS tags (see
Figure 7). The results revealed that the collected chatbot responses
contain a total of eight more unique adjectives, nine more unique
adverbs, and eleven more unique verbs than the original chatbot
Figure 7: Lexical variety of original chatbot responses and
of the improved chatbot responses collected during the pilot
study.
5
DISCUSSION
In this paper, we proposed the design of a system that engages
domain experts in the chatbot response generation process. The
system enables chatbot developers to connect an existing chatbot to
the system and enables domain experts to directly improve chatbot
responses during a conversation with this chatbot. In addition,
chatbot developers keep control and curation over the improvement
process because they can review the collected responses using a
sentiment supported review dashboard.
The proposed design of the system can be useful when chatbot
developers are looking to expand the natural language response
capabilities of a chatbot by involving domain experts in their development process. As a consequence, the proposed system and
its components can be useful extensions for real-world chatbot improvement systems (e.g., Rasa X [51]) or chatbot development kits
(e.g., Microsoft’s Power Virtual Agents [44]). Such an approach can
enable chatbot developers and domain experts to collaboratively
enhance the natural language responses of a chatbot.
A major advantage of the system is its compatibility with di�erent types of existing chatbots independent of their technology and
goal. This is possible because the system only requires an API connection to the chatbot that reveals all messages exchanged between
a user and the chatbot. All chatbot responses and the improvements are directly displayed in the chat window and stored in the
338
Session 6: Conversational UIs
A Chatbot Response Generation System
MuC’20, September 6–9, 2020, Magdeburg, Germany
Figure 6: Sentiment supported review dashboard which shows all collected chatbot response improvements. In addition, it
shows their interaction context. Color coding indicates the sentiment scores of the collected chatbot responses. Chatbot developers can delete or modify the collected chatbot responses before they export the data.
system’s knowledge base. This is a major advantage over other
improvement approaches that require a chatbot to be adapted to
a speci�c platform [11] or require additional e�ort in developing
decoupled prototypes, e.g. in the form of mockups.
Therefore, chatbot improvement systems that are able to connect to existing chatbots via APIs can be a promising approach to
increase adoption of tool support because the threshold to use the
system is very low. Chatbot developers do not need to change the
source code of the chatbot or do not need to develop additional
mockups. They only need to connect an existing chatbot via their
API to the portal. Other promising approaches even generated a
chatbot directly from APIs by leveraging the crowd [24]. Such easyto-use approaches can support knowledge transfer from research
to practice because research �ndings are not limited to a speci�c
chatbot instantiation and thus, can be directly applied in practice.
5.1
Second, additional evaluations are necessary in order to investigate the scalability of such an improvement approach. The current
design of the system leads to the creation of several response improvements for the same chatbot response because humans generally reply in a multitude of ways [27]. Chatbot developers need to
converge the collected improvements to an optimal set of chatbot responses in order to implement them in the dialog system. However,
it is quite di�cult to distil a best suitable set of chatbot responses
because no best interaction style of a chatbot exists. The best chatbot response always depends on the user, task, and context of the
interaction [15, 16]. For example, it has been shown that users with
di�erent levels of task experience also prefer di�erent language
styles of a chatbot [8]. To address this, chatbot developers would
need to develop chatbots that adapt their language style to the
preferences of an individual user. This has shown to be a promising
approach [57]. However, such approaches require more complex
dialog systems that do not only manage the dialog �ow, but also
select the most appropriate chatbot response based on individual
user characteristics that have been collected by the system. Such
approaches could further help to develop chatbots that are able
to adapt to individual users [14, 54, 58], act as emotion regulators
[48, 53], e�ectively support collaborative collocated talk-in-action
[49, 54], and support the future of work [35].
Third, future research should further investigate unrestricted
and other restricted user-driven improvement mechanisms to improve chatbot responses while avoiding abuse. While synonym
selection, profanity detection, and sentiment analysis help chatbot
developers to identify harmful user improvements, more automated
approaches [33] can make user-driven chatbot improvement approaches much more e�cient. In this regards, existing approaches
from FAQ systems could be extended to a chatbot context in which
end-users are capable of rating the received chatbot responses.
Fourth, it must be noted that the system’s improvement mechanisms may not be equally useful for all types of chatbots. For
Limitations and Future Research Avenues
Our study comes with limitations that also suggest opportunities
for future research. First, it must be noted that the system was
only evaluated in a pilot study which revealed that it was indeed
seamlessly possible to connect an existing chatbot with the system
and that expert users such as students would use the system to
contribute new and enhanced chatbot responses. To further gain
understanding on how domain experts in organizations actually
perceive and use such a system, more evaluations are necessary
because the current pilot study only focused on one type of user
group. Therefore, we plan to conduct further laboratory and �eld
evaluations with di�erent user groups in order to show the bene�ts
of such a system beyond the �rst indications here. This also includes
to test how end-users of such a chatbot react to the improved
chatbot responses in order to demonstrate that the approach results
in the design of better chatbots.
339
Session 6: Conversational UIs
MuC’20, September 6–9, 2020, Magdeburg, Germany
Jasper Feine, Stefan Morana, and Alexander Maedche
chatbots that mainly employ prede�ned dialogue �ows, corrected
or reworked answer alternatives may be a substantial contribution.
However, a chatbot response is often written to serve as a suitable
response to a broad range of user requests. Hence, reworking the
chatbot responses based on the experiences of a single user may
be counterproductive in some contexts. Therefore, future research
could extend the proposed system to improve the complete dialog
�ow of a chatbot in regards to the interaction context.
Fifth, the design of the system is currently restricted to improve
responses of chatbots (i.e., text-based conversational agents). However, voice-based conversational agents are becoming increasingly
important [61]. They have lower barriers of entry and use [7] and
enable interactions separate from the already busy chat modality
[30]. Consequently, future research can extend the design of the
proposed system to investigate response generation systems for
voice-based conversational agents.
6
Wirtschaftsinformatik (WI2019).
[12] Stephan Diederich, Alfred Benedikt Brendel, and Lutz M. Kolbe. 2020. Designing
Anthropomorphic Enterprise Conversational Agents. Business & Information
Systems Engineering (2020). https://doi.org/10.1007/s12599-020-00639-y
[13] Jasper Feine, Ulrich Gnewuch, Stefan Morana, and Alexander Maedche. 2019. A
Taxonomy of Social Cues for Conversational Agents. International Journal of
Human-Computer Studies 132 (2019), 138–161. https://doi.org/10.1016/j.ijhcs.
2019.07.009
[14] Jasper Feine, Stefan Morana, and Ulrich Gnewuch. 2019. Measuring Service
Encounter Satisfaction with Customer Service Chatbots using Sentiment Analysis.
In 14. Internationale Tagung Wirtschaftsinformatik (WI2019).
[15] Jasper Feine, Stefan Morana, and Alexander Maedche. 2019. Designing a Chatbot Social Cue Con�guration System. In Proceedings of the 40th International
Conference on Information Systems (ICIS). AISel, Munich.
[16] Jasper Feine, Stefan Morana, and Alexander Maedche. 2019. Leveraging MachineExecutable Descriptive Knowledge in Design Science Research – The Case of
Designing Socially-Adaptive Chatbots. In Extending the Boundaries of Design
Science Theory and Practice, Bengisu Tulu, Soussan Djamasbi, and Gondy Leroy
(Eds.). Springer International Publishing, Cham, 76–91.
[17] David Frohlich. 1993. The history and future of direct manipulation. Behaviour & Information Technology 12, 6 (1993), 315–329. https://doi.org/10.
1080/01449299308924396
[18] C. Hutto EricJ Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International AAAI Conference on
Weblogs and Social Media. Ann Arbor, MI, USA.
[19] Ulrich Gnewuch, Stefan Morana, and Alexander Maedche. 2017. Towards Designing Cooperative and Social Conversational Agents for Customer Service. In
Proceedings of the 38th International Conference on Information Systems (ICIS).
AISel, Seoul.
[20] S. R. Gouravajhala, Y. Jiang, P. Kaur, and J. Chaar. 2018. Finding Mnemo: Hybrid
Intelligence Memory in a Crowd-Powered Dialog System. In Collective Intelligence
Conference (CI 2018). Zurich, Switzerland.
[21] J. Harms, P. Kucherbaev, A. Bozzon, and G. Houben. 2019. Approaches for Dialog
Management in Conversational Agents. IEEE Internet Computing 23, 2 (2019),
13–22. https://doi.org/10.1109/MIC.2018.2881519
[22] Ting-Hao Huang, Joseph Chee Chang, and Je�rey P. Bigham. 2018. Evorus: A
Crowd-powered Conversational Assistant Built to Automate Itself Over Time. In
Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems CHI ’18. ACM Press, 1–13. https://doi.org/10.1145/3173574.3173869
[23] Ting-Hao Kenneth Huang, Walter S. Lasecki, Amos Azaria, and Je�rey P. Bigham.
2016. "Is There Anything Else I Can Help You With?" Challenges in Deploying an
On-Demand Crowd-Powered Conversational Agent. In Fourth AAAI Conference
on Human Computation and Crowdsourcing.
[24] Ting-Hao Kenneth Huang, Walter S. Lasecki, and Je�rey P. Bigham. 2015.
Guardian: A crowd-powered spoken dialog system for web apis. In Third AAAI
conference on human computation and crowdsourcing.
[25] Mohit Jain, Pratyush Kumar, Ramachandra Kota, and Shwetak N. Patel. 2018.
Evaluating and Informing the Design of Chatbots. In DIS 2018, Ilpo Koskinen,
Youn-kyung Lim, Teresa Cerratto-Pargman, Kenny Chow, and William Odom
(Eds.). Association for Computing Machinery, [New York, NY], 895–906. https:
//doi.org/10.1145/3196709.3196735
[26] Youxuan Jiang, Jonathan K. Kummerfeld, and Walter S. Lasecki. 2017. Understanding Task Design Trade-o�s in Crowdsourced Paraphrase Collection. In
Proceedings of the 55th Annual Meeting of the Association for. Association for
Computational Linguistics, 103–109. https://doi.org/10.18653/v1/P17-2017
[27] Patrik Jonell, Mattias Bystedt, Fethiye Irmak Dogan, Per Fallgren, Jonas Ivarsson,
Marketa Slukova, José Lopes Ulme Wennberg, Johan Boye, and Gabriel Skantze.
2018. Fantom: A Crowdsourced Social Chatbot using an Evolving Dialog Graph.
In 1st Proceedings of Alexa Prize.
[28] Spencer Kelly. 17.09.2019. compromise: modest natural-language processing in
javascript. https://github.com/spencermountain/compromise
[29] Scott R. Klemmer, Anoop K. Sinha, Jack Chen, James A. Landay, Nadeem
Aboobaker, and Annie Wang. 2000. Suede: a Wizard of Oz prototyping tool
for speech user interfaces. In Proceedings of the 13th annual ACM symposium on
User interface software and technology. ACM, 1–10.
[30] Rafal Kocielnik, Daniel Avrahami, Jennifer Marlow, Di Lu, and Gary Hsieh. 2018.
Designing for workplace re�ection: a chat and voice-based conversational agent.
In In Proceedings of the 2018 Designing Interactive Systems Conference (DIS ’18).
ACM, 881–894. https://doi.org/10.1145/3196709.3196784
[31] Ben Krause, Marco Damonte, Mihai Dobre, Daniel Duma, Joachim Fainberg,
Federico Fancellu, Emmanuel Kahembwe, Jianpeng Cheng, and Bonnie Webber.
2017. Edina: Building an open domain socialbot with self-dialogues. In 1st
Proceedings of Alexa Prize.
[32] Walter S. Lasecki, Kyle I. Murray, Samuel White, Robert C. Miller, and Je�rey P.
Bigham. 2011. Real-time crowd control of existing interfaces. In Proceedings of
the 24th Annual ACM Symposium on User Interface Software and Technology, Je�
Pierce, Maneesh Agrawala, and Scott Klemmer (Eds.). ACM Press, New York, NY,
23. https://doi.org/10.1145/2047196.2047200
CONCLUSION
In this paper, we propose the design of a system that enables chatbot
developers to e�ciently engage domain experts in the chatbot response generation process. The system enables chatbot developers
to connect existing chatbots via an API to the system. The system
enables domain experts to improve the chatbot responses during
an interaction. We tested the system with students in a pilot study.
Overall, the design of the system and its improvement mechanisms
can be useful extensions for chatbot development systems in order
to support chatbot developers and domain experts to collaboratively
enhance the natural language responses of a chatbot.
REFERENCES
[1] Martin Adam, Michael Wessel, and Alexander Benlian. 2020. AI-based chatbots
in customer service and their e�ects on user compliance. Electronic Markets 9, 2
(2020), 204.
[2] Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014.
Power to the people: The role of humans in interactive machine learning. AI
MAGAZINE 35, 4 (2014), 105–120. https://doi.org/10.1609/aimag.v35i4.2513
[3] Farah Benamara, Carmine Cesarano, Antonio Picariello, and Venkatramana S.
Subrahmanian. 2019. Sentiment analysis: Adjectives and adverbs are better than
adjectives alone. In Processdings of ICWSM. Academic Press.
[4] Je�rey P. Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller,
Robert C. Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, and Samual
White. 2012. VizWiz: nearly real-time answers to visual questions. In Proceedings
of the 23nd annual ACM symposium on User interface software and technology.
ACM. https://doi.org/10.1145/1866029.1866080
[5] Je�rey P. Bigham, Richard E. Ladner, and Yevgen Borodin. 2011. The Design of Human-Powered Access Technology. In The Proceedings of the 13th
International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’11). Association for Computing Machinery, New York, NY, USA, 3–10.
https://doi.org/10.1145/2049536.2049540
[6] Petter Bae Brandtzaeg and Asbjørn Følstad. 2018. Chatbots: Changing User Needs
and Motivations. Interactions 25, 5 (2018), 38–43. https://doi.org/10.1145/3236669
[7] Robin N. Brewer, Leah Findlater, Joseph ’Jo�sh’ Kaye, Walter Lasecki, Cosmin
Munteanu, and Astrid Weber. 2018. Accessible Voice Interfaces. In Conference
on Computer Supported Cooperative Work and Social Computing. ACM, 441–446.
https://doi.org/10.1145/3272973.3273006
[8] Veena Chattaraman, Wi-Suk Kwon, Juan E. Gilbert, and Kassandra Ross. 2019.
Should AI-Based, conversational digital assistants employ social- or task-oriented
interaction style? A task-competency and reciprocity perspective for older adults.
Computers in Human Behavior 90 (2019), 315–330. https://doi.org/10.1016/j.chb.
2018.08.048
[9] ConvAI. 2018. The Conversational Intelligence Challenge 2. http://convai.io/
[10] Florian Daniel, Cinzia Cappiello, and Boualem Benatallah. 2019. Bots Acting Like
Humans: Understanding and Preventing Harm. IEEE Internet Computing 23, 2
(2019), 40–49. https://doi.org/10.1109/MIC.2019.2893137
[11] Stephan Diederich, Alfred Brendel, and Lutz M Kolbe. 2019. Towards a Taxonomy
of Platforms for Conversational Agent Design. In 14. International Conference on
340
Session 6: Conversational UIs
A Chatbot Response Generation System
MuC’20, September 6–9, 2020, Magdeburg, Germany
[57] Paul Thomas, Mary Czerwinski, Daniel McDu�, Nick Craswell, and Gloria Mark.
2018. Style and Alignment in Information-Seeking Conversation. In Proceedings
of the 2018 Conference on Human Information Interaction&Retrieval - CHIIR ’18,
Chirag Shah, Nicholas J. Belkin, Katriina Byström, Je� Huang, and Falk Scholer
(Eds.). ACM Press, New York, New York, USA, 42–51. https://doi.org/10.1145/
3176349.3176388
[58] Felipe Thomaz, Carolina Salge, Elena Karahanna, and John Hulland. 2020. Learning from the Dark Web: leveraging conversational agents in the era of hyperprivacy to enhance marketing. Journal of the Academy of Marketing Science 48, 1
(2020), 43–63. https://doi.org/10.1007/s11747-019-00704-3
[59] Luis von Ahn and Laura Dabbish. 2004. Labeling images with a computer game.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
Elizabeth Dykstra-Erickson (Ed.). ACM, New York, NY, 319–326. https://doi.org/
10.1145/985692.985733
[60] Jason D. Williams, Kavosh Asadi, and Geo�rey Zweig. 2017. Hybrid code networks: practical and e�cient end-to-end dialog control with supervised and
reinforcement learning. arXiv preprint (2017).
[61] Rainer Winkler, Sebastian Hobert, Antti Salovaara, Matthias Söllner, and
Jan Marco Leimeister. 2020. Sara, the Lecturer: Improving Learning in Online Education with a Sca�olding-Based Conversational Agent. In Proceedings
of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, 1–14.
https://doi.org/10.1145/3313831.3376781
[62] Yu Zhong, Walter S. Lasecki, Erin Brady, and Je�rey P. Bigham. 2015. RegionSpeak:
Quick Comprehensive Spatial Descriptions of Complex Images for Blind Users. In
Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing
Systems, Bo Begole (Ed.). ACM, New York, NY, 2353–2362. https://doi.org/10.
1145/2702123.2702437
[33] Walter S. Lasecki, Jaime Teevan, and Ece Kamar. 2014. Information extraction
and manipulation threats in crowd-powered systems. In Proceedings of the ACM
conference on Computer supported cooperative work & social computing. ACM.
https://doi.org/10.1145/2531602.253173
[34] Walter S. Lasecki, Rachel Wesley, Je�rey Nichols, Anand Kulkarni, James F. Allen,
and Je�rey P. Bigham. 2013. Chorus: a crowd-powered conversational assistant.
In Proceedings of the 26th annual ACM symposium on User interface software and
technology. ACM. https://doi.org/10.1145/2501988.2502057
[35] Alexander Maedche, Christine Legner, Alexander Benlian, Benedikt Berger, Henner Gimpel, Thomas Hess, Oliver Hinz, Stefan Morana, and Matthias Söllner.
2019. AI-Based Digital Assistants. Business & Information Systems Engineering
61, 4 (2019), 535–544.
[36] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard,
and David McClosky. 2014. The Stanford CoreNLP natural language processing
toolkit. In Proceedings of 52 nd Annual Meeting of the Association for Computational
Linguistics:. ACL.
[37] Moira McGregor and John C. Tang. 2017. More to Meetings: Challenges in
Using Speech-Based Technology to Support Meetings. In Proceedings of the 2017
ACM Conference on Computer Supported Cooperative Work and Social Computing
(CSCW ’17). ACM, New York, NY, USA, 2208–2220. https://doi.org/10.1145/
2998181.2998335
[38] M. McTear, Z. Callejas, and D. Griol. 2016. The Conversational Interface: Talking
to Smart Devices (1st ed.). Springer International Publishing, Switzerland.
[39] Michael F. McTear. 2002. Spoken dialogue technology: enabling the conversational user interface. Comput. Surveys 34, 1 (2002), 90–169.
[40] Michael F. McTear. 2017. The Rise of the Conversational Interface: A New Kid
on the Block?. In Future and Emerging Trends in Language Technology. Machine
Learning and Big Data, José F. Quesada, Francisco-Jesús Martín Mateos, and
Teresa López Soto (Eds.). Springer International Publishing, Cham, 38–49.
[41] Michael Price. 17.09.2019. bad-words: A javascript �lter for badwords. https:
//github.com/web-mech/badwords
[42] Microsoft. 2019. Bot Framework Web Chat. https://github.com/microsoft/
BotFramework-WebChat
[43] Microsoft. 2019. Microsoft Bot Framework Direct Line JS Client. https://github.
com/microsoft/BotFramework-DirectLineJS
[44] Microsoft. 2019. Microsoft Power Virtual Agents. https://powervirtualagents.
microsoft.com/en-us/
[45] George A. Miller. 1995. WordNet: a lexical database for English. Commun. ACM
38, 11 (1995), 39–41. https://doi.org/10.1145/219717.219748
[46] Maali Mnasri. 2019. Recent advances in conversational NLP: Towards the standardization of Chatbot building. arXiv preprint arXiv:1903.09025 (2019).
[47] Gina Ne� and Peter Nagy. 2016. Automation, algorithms, and politics| talking to
Bots: Symbiotic agency and the case of Tay. International Journal of Communication 10 (2016), 17.
[48] Zhenhui Peng, Taewook Kim, and Xiaojuan Ma. 2019. GremoBot. In Conference
Companion Publication of the 2019 on Computer Supported Cooperative Work and
Social Computing - CSCW ’19, Eric Gilbert and Karrie Karahalios (Eds.). ACM Press,
New York, New York, USA, 335–340. https://doi.org/10.1145/3311957.3359472
[49] Martin Porcheron, Joel E. Fischer, Moira McGregor, Barry Brown, Ewa Luger,
Heloisa Candello, and Kenton O’Hara. 2017. Talking with Conversational Agents
in Collaborative Action. In Companion of the 2017 ACM Conference on Computer
Supported Cooperative Work and Social Computing, Charlotte P. Lee (Ed.). ACM,
New York, NY, 431–436. https://doi.org/10.1145/3022198.3022666
[50] Martin Porcheron, Joel E. Fischer, and Sarah Sharples. 2017. "Do Animals Have A
ccents?". In CSCW’17, Charlotte P. Lee, Steve Poltrock, Louise Barkhuus, Marcos
Borges, and Wendy Kellogg (Eds.). The Association for Computing Machinery,
New York, New York, 207–219. https://doi.org/10.1145/2998181.2998298
[51] Rasa. 2019. Improve your contextual assistant with Rasa X. https://rasa.com/
docs/rasa-x/
[52] Tony Russell-Rose and Tyler Tate. 2013. Designing the search experience: The
information architecture of discovery. Morgan Kaufmann/Elsevier, Amsterdam.
https://ebookcentral.proquest.com/lib/subhh/detail.action?docID=1046391
[53] Isabella Seeber, Lena Waizenegger, Stefan Seidel, Stefan Morana, Izak Benbasat,
and Paul Benjamin Lowry. 2019. Collaborating with Technology-Based Autonomous Agents: Issues and Research Opportunities. SSRN Electronic Journal
(2019). https://doi.org/10.2139/ssrn.3504587
[54] Joseph Seering, Michal Luria, Geo� Kaufman, and Jessica Hammer. 2019. Beyond
Dyadic Interactions. In Proceedings of the 2019 CHI Conference on Human Factors
in Computing Systems - CHI ’19. ACM Press, New York, USA, 1–13. https:
//doi.org/10.1145/3290605.3300680
[55] Ben Shneiderman. 1997. Direct manipulation for comprehensible, predictable
and controllable user interfaces. In Proceedings of the international conference on
Intelligent user interfaces. ACM, 33–39. https://doi.org/10.1145/238218.238281
[56] Mark S. Silver. 2008. On the Design Features of Decision Support Systems:
The Role of System Restrictiveness and Decisional Guidance. In Handbook on
Decision Support Systems 2: Variations, Frada Burstein and Clyde W. Holsapple
(Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 261–291.
341
Download