Session 6: Conversational UIs A Chatbot Response Generation System Jasper Feine Karlsruhe Institute of Technology Karlsruhe, Germany jasper.feine@kit.edu Stefan Morana Saarland University Saarbrücken, Germany stefan.morana@uni-saarland.de ABSTRACT experience with chatbots (i.e., text-based conversational agents) [19, 25, 34]. However, many interactions with chatbots are rather short, constrained by limited vocabulary, and contain incomplete or wrong information [1, 6, 12]. This does not only lead to a low penetration rate of chatbots [25], but also limits their application beyond simple dyadic interactions [37, 50, 54]. The natural language capabilities of chatbots are mostly limited by the amount of e�ort developers invest in the development of a chatbot’s dialog system [21, 38]. Independent of the type of dialog system (i.e., data-driven or rule-based), the creation and evaluation of high-quality chatbot responses is a very time-consuming endeavor. Whereas chatbot developers have the essential technical expertise to develop such dialog systems, they often lack required domain knowledge. On the other hand, domain experts have the required domain knowledge, but they lack technological expertise [2, 52]. As a consequence, it is necessary to develop a system that empowers domain experts to engage in the response generation process in order to support the chatbot developers to craft chatbot responses that are relevant for the respective end-user groups [21, 29, 54]. What is currently missing is a system that allows chatbot developers to actually involve respective domain experts in an e�ective and e�cient chatbot response generation process. Current processes are often limited to the testing of decoupled prototypes and the exchange of spreadsheets. More speci�cally, current chatbot development systems lack two important interrelated functionalities: (1) they do not enable domain experts to easily improve and propose new chatbot responses while (2) chatbot developers keep control and curation over the response generation process. To address this need, we introduce a system with three key functionalities: (1) The system is implemented as a web application and connects easily with any existing chatbots via an API. Therefore, the system serves as an additional layer between domain experts and the connected chatbots. In addition, (2) the system enables domain experts to interact with the connected chatbot via an autogenerated chat window which allows them to directly improve chatbot responses during their interaction. Finally, (3) the system supports chatbot developers with a sentiment supported review dashboard in order to review, accept, change, or reject collected chatbot response improvements of the domain experts. The developed system is e�ective, because it enables chatbot developers to continuously improve chatbot responses of existing chatbots with respective domain experts. It is e�cient, because it orchestrates the response generation process in one system without creating additional development e�ort. Therefore, it enables chatbot developers to create high quality chatbot responses, interaction data, and contextual information that can be used to enhance the dialog system of a chatbot. Therefore, the design of the system can be used to extend existing chatbot development systems in order to Developing successful chatbots is a non-trivial endeavor. In particular, the creation of high-quality natural language responses for chatbots remains a challenging and time-consuming task that often depends on high-quality training data and deep domain knowledge. As a consequence, it is essential to engage experts in the chatbot response development process which have the required domain knowledge. However, current tool support to engage domain experts in the response generation process is limited and often does not go beyond the exchange of decoupled prototypes and spreadsheets. In this paper, we present a system that enables chatbot developers to e�ciently engage domain experts in the chatbot response generation process. More speci�cally, we introduce the underlying architecture of a system that connects to existing chatbots via an API, provides two improvement mechanisms for domain experts to improve chatbot responses during their chatbot interaction, and helps chatbot developers to review the collected response improvements with a sentiment supported review dashboard. Overall, the design of the system and its improvement mechanisms are useful extensions for chatbot development systems in order to support chatbot developers and domain experts to collaboratively enhance the natural language responses of a chatbot. CCS CONCEPTS • Human-centered computing ! Natural language interfaces; User interface design. KEYWORDS chatbot response, improvement mechanism, system, domain expert, chatbot developer ACM Reference Format: Jasper Feine, Stefan Morana, and Alexander Maedche. 2020. A Chatbot Response Generation System. In Mensch und Computer 2020 (MuC’20), September 6–9, 2020, Magdeburg, Germany. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3404983.3405508 1 Alexander Maedche Karlsruhe Institute of Technology Karlsruhe, Germany alexander.maedche@kit.edu INTRODUCTION The quality of conversational interaction design in general, and the quality of chatbot responses in particular, is critical for the user Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro�t or commercial advantage and that copies bear this notice and the full citation on the �rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci�c permission and/or a fee. Request permissions from permissions@acm.org. MuC’20, September 6–9, 2020, Magdeburg, Germany © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-7540-5/20/09. . . $15.00 https://doi.org/10.1145/3404983.3405508 333 Session 6: Conversational UIs MuC’20, September 6–9, 2020, Magdeburg, Germany Jasper Feine, Stefan Morana, and Alexander Maedche support chatbot developers and domain experts to collaboratively enhance the responses of a chatbot. In the remainder of this paper, we �rst review work on chatbot dialog systems, their natural language capabilities, and existing chatbot response improvement systems. Subsequently, we introduce the design of the proposed system and outline how we instantiated it. Next, we report the results of a pilot study in which we tested the proposed system in a �eld deployment. Finally, we discuss the bene�ts of the system as well as critically re�ect its application contexts, its limitations, and further research avenues. To demonstrate the lack of the chatbots’ language capabilities, we compared their responses with the responses of humans. Therefore, we analyzed an existing human-chatbot dialog corpus from the Conversational Intelligence Challenge 2 (ConvAI2) [9]. We selected this dialog corpus consisting of 1,111 dialogs because it contains human-chatbot dialogs of state-of-the-art chatbots that are supposed to hold up an intelligent conversation with a human over several interaction turns. For our analysis, we downloaded the dialog corpus of the wild evaluation round in which the human users evaluated the chatbot responses. We analyzed the lexical diversity of all chatbot and human messages using a Stanford CoreNLP server [36]. Using the server, we investigated all messages regarding their unique word lemmas and part-of-speech (POS) tags (i.e., in total 94,933) and counted the unique adjectives, adverbs, and verbs. We analyzed the adjectives, adverbs, and verbs because they are highly relevant to express emotions which is an inherently human ability [3, 14]. As depicted in Figure 1, the ConvAI2 chatbots (human users) used in total 282 (494) unique adjectives, 97 (160) unique adverbs, and 264 (466) unique verbs in all ConvAI2 conversations of the wild evaluation round. The results indicate that the human users used 75% more unique adjectives, 65% more adverbs, and 76% more verbs across all conversations than the ConvAI2 chatbots. Therefore, the corpus analysis reveals that the humans’ language usage is very diverse in terms of lexical and emotional diversity and even these well-designed chatbots from the ConvAI2 challenge cannot stand up to this. 2 RELATED WORK 2.1 Chatbot Dialog Systems The quality of a conversational interaction between a human and a chatbot is currently mostly limited by the amount of e�ort developers invest in the chatbot’s dialog system. The dialog system typically consists of three interacting components, namely the natural language understanding (i.e., convert words to meaning), the dialog management (i.e., decide the next system action), as well as the response generation component (i.e., convert meaning to words) [40]. Dialog systems of chatbots can be broadly distinguished in terms of their dialog coherency and scalability [21, 27]. On the one hand, dialog systems that comprise handcrafted domain-speci�c dialog rules enable goal-oriented chatbots to converse coherently about a speci�c topic. The naturalness of the chatbot responses is, however, mostly determined by the amount of e�ort chatbot developers invest in the development of dialog rules and the authoring of rulespeci�c chatbot responses [21]. Thus, it is time-consuming to extend the natural language capabilities of such a chatbot, which limits the scalability of these approaches. On the other hand, data-driven dialog managers automatically generate chatbot responses based on large, existing dialog corpora. They enable the probabilistic matching of user messages to examples in the training data. Then they select the best matching response from the training data set without using any handcrafted dialog rules. These approaches are often used for the development of non-goal-oriented chatbots, whose primary purpose is chatting. However, data-driven approaches lack coherency and robustness because the naturalness of the responses strongly relies on the quality of the training data. The generation of high-quality training data is, however, a major challenge [21, 27]. Overall, rule-based dialog systems dominated the chatbot landscape in the last decades [39] but data-driven dialog systems are becoming more popular [46]. To further leverage the strengths of both approaches, hybrid dialog systems have been proposed [21]. For example, Hybrid Code Networks uses a data-driven neural network combined with the ability to include procedural rules [60]. However, existing real-world solutions that combine both approaches are still scarce [21]. 2.2 Figure 1: Analysis of human-chatbot dialogs of the ConvAI2 challenge. The graph illustrates the amount of unique adjectives, adverbs, and verbs used by the chatbots and human users during the wild evaluation round of the ConvAI2 challenge. Natural Language Limitations of Chatbots 2.3 A major goal in the development of chatbots is to create natural language capabilities that meet the user expectations [13, 15]. However, most chatbots often reply with the same message, only possess a very limited vocabulary, and often provide wrong information [26, 34]. Crowd-Powered Dialog Systems To further improve chatbot responses, chatbot developers can develop chatbot prototypes, present them to domain experts, analyze their interaction data, and conduct user interviews [25]. Subsequently, they can modify the chatbot responses before they start 334 Session 6: Conversational UIs A Chatbot Response Generation System MuC’20, September 6–9, 2020, Magdeburg, Germany Table 1: Crowd-powered dialog systems. System Description SUEDE [29] SUEDE is a speech interface prototyping tool which allows designers to easily create prompt/response speech interfaces and further enables them to test and analyze them with many users using a Wizard-of-Oz mode. is, a crowd-powered conversational assistant. While the assistant appears to be a single individual, it is actually driven by a dynamic crowd of multiple workers using a specially designed response interface. The paper reports a hybrid dialog manager which uses the technique called self-dialogs. This means that crowd-workers write both the answers of a user and the responses of a chatbot in order to increase naturalness of the dialog corpus. RegionSpeak is an advanced version of ViwWiz [4] which collects labels from the crowd for several objects in a visual area and then enables blind users to explore the spatial layout of the objects. The Evorus system engages crowd-workers to propose best suitable chatbot responses and then automates itself over time using machine learning. The Mnemo system is a crowd powered dialog plugin which is capable to save and aggregate human-generated context notes from goal-oriented dialogs. The Fantom system generates evolving dialog trees and automatically creates crowd tasks in order to collect responses for user requests that could not be handled so far. Chorus [34] Edina [31] RegionSpeak [62] Evorus [22] Mnemo [20] Fantom [27] a next improvement cycle. This process, however, is a very work intensive. A promising solution to improve the e�ciency of improving chatbots responses is to leverage a dialog system that utilizes crowdworkers. Crowd-workers are human workers which are usually recruited anonymously through an open call over the web and consist of non-experts [5]. Crowd-powered dialog systems could reduce the scalability limitation of manually developed dialog systems without sacri�cing the complete control over the response generation process [27]. Whereas early crowd-working attempts took several hours to complete a task, recent approaches have shown to work in nearly real-time [4, 32]. As a consequence, crowd-workers have been used to collectively reply to a user. A brief non-exhaustive review of promising crowd-powered dialog systems is illustrated in Table 1. Beside their advantages, crowd-powered dialog systems also create some serious challenges because crowd-workers have shown to abuse these systems [23, 33, 47]. In the context of crowd-powered dialog systems three malicious user groups have been identi�ed [23]: inappropriate workers (i.e., provide faulty or irrelevant information), �irters (i.e., are interested in the user’s true identity or develop unnecessary personal connections), spammers (i.e., performs abnormally large amount of meaningless actions in a task) [23]. In particular, the case of Microsoft’s Tay has shown dramatically what can go wrong with systems that automatically learn from user generated content [47]. 2.4 For example, Chorus [23] uses voting as a �ltering mechanism which worked fairly well in a �eld deployment. However, the mechanism only worked when at least one other crowd-worker also voted for a message [23]. In another study, the VizWiz system ensures that it always receives at least two response proposals from two di�erent crowd-workers. The results of a �eld deployment further revealed that it took on average three response proposals to always receive at least one correct response [4]. Another approach is to award points to the responses when they contain correct information [59]. A counter-mechanism to reduce the risk of stealing sensitive user data is used by Edina [31]. Edina uses the technique called self-dialogs in which crowd-workers author the content of the chatbot and the content of the user. Another promising approach is to divide the improvement tasks into micro tasks that prevent any crowd-worker from seeing too many information [33]. However, the interaction context is very important in order to correctly understand a natural language interaction [20]. Seeing fractions of a conversation might not be su�cient to correctly improve a chatbot response. Finally, Fantom anonymized the dialogs, but the anonymization still had to be done manually [27]. Summing up, chatbot developers need to carefully consider the improvement mechanisms that can be used by domain experts to improve chatbot responses as well as the mechanisms to reevaluate the improved responses before they are �nally shown to the endusers. 3 Chatbot Response Improvement Mechanisms DESIGNING A CHATBOT RESPONSE GENERATION SYSTEM In this section, we propose the design of a chatbot response generation system that enables chatbot developers to engage domain experts in the chatbot response generation process. The high-level design of the system is described in the next section. Subsequently, the improvement mechanisms as well as the sentiment supported review dashboard are described . Overall, chatbot developers need to ensure that user-generated chatbot responses do not lead to o�ensive, inappropriate, and meaningless responses and that the contributors have su�cient domain knowledge to propose meaningful chatbot responses [23, 33, 47]. This applies to crowd-workers, end-users, but also domain experts who may not take the improvement task seriously. To reduce these risks, several systems have investigated counter-mechanisms to reduce the creation of malicious user improvements. 335 Session 6: Conversational UIs MuC’20, September 6–9, 2020, Magdeburg, Germany Jasper Feine, Stefan Morana, and Alexander Maedche 3.2 Figure 2: High-level design of the chatbot response generation system. 3.1 Improvement Mechanisms We can classify chatbot response improvement mechanisms along two counteracting continua, namely the need for a chatbot developer to review the collected chatbot response improvements and the mechanism’s restrictiveness. At the one side of the continuum, very unrestricted improvement mechanisms can lead to creative and natural improvements that increase the quality of the chatbot. However, they can also lead to potential malicious improvements [10, 26, 33, 47]. At the other side of the continuum, very restricted improvement mechanisms that only allow users to change speci�c sections of chatbot responses reduce the reviewing e�ort of chatbot developers [56]. This limits the creativity and naturalness of the proposed chatbot responses. A combination of restricted and less restricted mechanisms could be a promising approach to increase the language variation of chatbot responses while chatbot developers can keep control over the response generation process. To instantiate such improvement mechanisms, we developed a chat window based on Microsoft’s WebChat [42]. The developed chat window allows domain experts to directly improve chatbot responses with both restricted and unrestricted improvement mechanisms during a human-chatbot interaction. The �rst improvement mechanism limits the domain expert’s degree of freedom to change the responses of a chatbot while also allowing domain experts to increase the lexical and emotional variety of a chatbot. To design such a mechanism, we analyzed popular online translators (i.e., Google translate, DeepL) that enable users to improve given translations. For example, Google translate allows users to click on a speci�c part of the sentence and then shows a drop-down list with alternative translations. Users can then select a more appropriate translation in order to directly manipulate the translation in the web interface. Based on this idea, we developed a similar improvement mechanism for chatbots and implemented it into the chat window. To do so, the chat window sends all chatbot responses received via the API to an instance of a Stanford CoreNLP server [36] before displaying any chatbot responses. The CoreNLP server tags all words regarding their part of speech (POS) tag. The chat window then highlights all adjectives, adverbs, and verbs because they are highly relevant to express emotions [3]. If domain experts want to improve a word in a chatbot response, they can simply click on it. This word is then, including its POS tag, sent to an instance of a Princeton’s Word Net dictionary [45] which returns appropriate synonyms. In the case the word is a verb, the JavaScript package compromise [28] further transforms the verb into its appropriate form. The synonyms are then displayed in a drop-down list and the domain experts can select a more appropriate synonym. The instantiation of the restricted mechanism is shown in in Figure 4. The second improvement mechanism should enable domain experts to directly manipulate the chatbot interaction in the chat window in order to simplify the mapping between goals and actions [17] and to encourage a feeling of engagement and power of control [17, 55]. Therefore, we developed a direct manipulation mechanism that is characterized by a continuous representation of the object of interests (i.e., chatbot responses), o�ers physical actions (i.e., domain experts can directly click on a chatbot response), High-level Design The proposed system is developed as a web application. The main dashboard and the main features of the web application are displayed in Figure 3. To start using the system, chatbot developer can connect any existing chatbot via their API. The only requirement of such a chatbot API is to exchange messages between the users and the chatbot. Therefore, the system can be used to improve chatbots with di�erent types of dialog systems because the dialog system of the to be improved chatbot still handles all the dialog management. This means that the system acts as an additional layer between the chatbot developers, domain experts, and the to be improved chatbot without requiring access to the chatbot’s source code. The high level design of the system is illustrated in Figure 2. It illustrates that the system is not limited to one chatbot only but functions as a platform that can be easily connected with several chatbots via their API. In addition, an auto-generated chat window can be shared with domain experts. This ensures that the system scales well with the need of testing many di�erent chatbot versions and also reduces e�ort to develop speci�c chatbots that can be connected to this system. To develop a prototype of the proposed design, we decided to include chatbots that converse via Microsoft’s Direct Line 3.0 API [43]. After chatbot developer connected a chatbot via their Direct Line API (see Figure 3, top-left), the system instantiates a knowledge-base and generates a shareable chat window (see Figure 3, bottom-left). Chatbot developers can then share a link of the chat window with domain experts. Domain experts can then interact with the chatbot and the chat window o�ers two improvement mechanisms in order to directly improve the chatbot responses during an interaction (described in detail in the following section). This enables domain experts to directly improve disappointing chatbot responses during the interaction which are then stored in the system’s knowledge base. Finally, chatbot developers can review and delete the collected chatbot responses using a sentiment supported review dashboard which is described in section 3.2. 336 Session 6: Conversational UIs A Chatbot Response Generation System MuC’20, September 6–9, 2020, Magdeburg, Germany Figure 3: (Middle) Main dashboard of the web application which displays all connected chatbots and main functionalities; (topleft) interface to connect chatbots via their API key; (bottom-left); automatically generated chat window that can be shared with domain experts to improve the chatbot responses; (top-right) improvement dashboard that summarizes key measures of the chatbot response improvement process; (bottom-right) review dashboard to review and delete collected chatbot responses. Figure 4: Chat window with restricted improvement mechanism. Figure 5: Chat window with unrestricted improvement mechanism. and shows the impact of the users’ action immediately on the objects of interest (i.e., chatbot responses get immediately updated in the chat window) [55]. This improvement mechanism enables domain experts to freely improve and add chatbot responses in the chat window as displayed in Figure 5. However, chatbot response improvements created with this mechanism increase the reviewing 337 Session 6: Conversational UIs MuC’20, September 6–9, 2020, Magdeburg, Germany Jasper Feine, Stefan Morana, and Alexander Maedche response set. This illustrate that the lexical diversity of the chatbot responses were increased through the response improvements proposed by the students. Finally, we analyzed the chatbot responses that have been improved most frequently. The most frequently improved chatbot response was the welcome message (i.e., “Hello, I am the chatbot of...”). The students proposed in total 10 di�erent versions of it. This variety of improvements leads to the challenge of selecting the best welcome message. We discuss this challenge in the next section in more detail. The second most frequently improved chatbot response (7 times) was the chatbot’s excuse for not understanding a user’s message (i.e., “I am sorry, but I did not understand your message”). For example, one student asked whether the chatbot also speaks German. The chatbot excused for not understanding the user and the student improved the response by replacing it with “Nein tut mir leid. I only speak English!”. This information can now be used to update the dialog system of the chatbot in order to enable the chatbot to respond to questions regarding its language capabilities. e�ort of chatbot developers because domain experts may propose malicious chatbot responses. 3.3 Review Dashboard To reduce the reviewing e�ort of the collected chatbot responses, the system supports chatbot developers by automatically deleting all responses that contain profanity using the bad-words JavaScript package [41]. In addition, the system o�ers a review dashboard (see Figure 6) which enables chatbot developers to review, accept, reject, or modify the collected chatbot responses which are then updated in the system’s knowledge base. To further support chatbot developers, the review dashboard analyzes all chatbot responses using the Vader sentiment analysis package [18]. Improvements with a negative sentiment score are highlighted in the review dashboard. Developers can then easily delete or modify the collected improvements. Finally, chatbot developer can export the collected chatbot responses. The export can then be used to further enhance the dialog system of the chatbot. 4 PILOT STUDY We tested the system’s ability to improve chatbot responses in a pilot study. In particular, we investigated how the system is actually used in a real-world scenario. Therefore, we used the system to improve the chatbot responses of an existing chatbot, namely the student service chatbot of our institute. The student service chatbot was developed using the Microsoft Bot Framework and can respond to questions about employees, lectures, exams, and thesis projects. It is available on our website and is mostly used by our students. We connected the chatbot to the system by adding its Direct Line 3.0 API key and then engaged end-users with the required domain knowledge in a response improvement process. To do so, we posted the link of the chat window in two Facebook groups in which endusers of the chatbot (i.e., students of our institute) coordinate and discuss course contents. We asked them to improve the responses of the student service chatbot because they know best how the chatbot should respond in speci�c situations. The participation was voluntarily and the participants did not receive any compensation. After two days of data collection, several students interacted in 110 sessions with the chatbot via the system. The students sent in total 230 messages to the chatbot. They improved 36 complete chatbot responses and added 27 new chatbot responses to the knowledge base. In addition, they used the restricted improvement mechanism to replace 8 synonyms of the original chatbot responses. Moreover, one student changed one synonym as well as the text of a response. Thus, a total of 72 chatbot responses were changed or added. As a consequence, the design of the chatbot response generation system enabled the students to easily improve the responses of the institute’s chatbot. Subsequently, we investigated the language variation of the newly collected chatbot responses and compared them to the initial chatbot response set of the chatbot. To analyze the language variation, we used a Stanford’s CoreNLP server [36] and tagged all responses regarding their unique word lemmas and POS tags (see Figure 7). The results revealed that the collected chatbot responses contain a total of eight more unique adjectives, nine more unique adverbs, and eleven more unique verbs than the original chatbot Figure 7: Lexical variety of original chatbot responses and of the improved chatbot responses collected during the pilot study. 5 DISCUSSION In this paper, we proposed the design of a system that engages domain experts in the chatbot response generation process. The system enables chatbot developers to connect an existing chatbot to the system and enables domain experts to directly improve chatbot responses during a conversation with this chatbot. In addition, chatbot developers keep control and curation over the improvement process because they can review the collected responses using a sentiment supported review dashboard. The proposed design of the system can be useful when chatbot developers are looking to expand the natural language response capabilities of a chatbot by involving domain experts in their development process. As a consequence, the proposed system and its components can be useful extensions for real-world chatbot improvement systems (e.g., Rasa X [51]) or chatbot development kits (e.g., Microsoft’s Power Virtual Agents [44]). Such an approach can enable chatbot developers and domain experts to collaboratively enhance the natural language responses of a chatbot. A major advantage of the system is its compatibility with di�erent types of existing chatbots independent of their technology and goal. This is possible because the system only requires an API connection to the chatbot that reveals all messages exchanged between a user and the chatbot. All chatbot responses and the improvements are directly displayed in the chat window and stored in the 338 Session 6: Conversational UIs A Chatbot Response Generation System MuC’20, September 6–9, 2020, Magdeburg, Germany Figure 6: Sentiment supported review dashboard which shows all collected chatbot response improvements. In addition, it shows their interaction context. Color coding indicates the sentiment scores of the collected chatbot responses. Chatbot developers can delete or modify the collected chatbot responses before they export the data. system’s knowledge base. This is a major advantage over other improvement approaches that require a chatbot to be adapted to a speci�c platform [11] or require additional e�ort in developing decoupled prototypes, e.g. in the form of mockups. Therefore, chatbot improvement systems that are able to connect to existing chatbots via APIs can be a promising approach to increase adoption of tool support because the threshold to use the system is very low. Chatbot developers do not need to change the source code of the chatbot or do not need to develop additional mockups. They only need to connect an existing chatbot via their API to the portal. Other promising approaches even generated a chatbot directly from APIs by leveraging the crowd [24]. Such easyto-use approaches can support knowledge transfer from research to practice because research �ndings are not limited to a speci�c chatbot instantiation and thus, can be directly applied in practice. 5.1 Second, additional evaluations are necessary in order to investigate the scalability of such an improvement approach. The current design of the system leads to the creation of several response improvements for the same chatbot response because humans generally reply in a multitude of ways [27]. Chatbot developers need to converge the collected improvements to an optimal set of chatbot responses in order to implement them in the dialog system. However, it is quite di�cult to distil a best suitable set of chatbot responses because no best interaction style of a chatbot exists. The best chatbot response always depends on the user, task, and context of the interaction [15, 16]. For example, it has been shown that users with di�erent levels of task experience also prefer di�erent language styles of a chatbot [8]. To address this, chatbot developers would need to develop chatbots that adapt their language style to the preferences of an individual user. This has shown to be a promising approach [57]. However, such approaches require more complex dialog systems that do not only manage the dialog �ow, but also select the most appropriate chatbot response based on individual user characteristics that have been collected by the system. Such approaches could further help to develop chatbots that are able to adapt to individual users [14, 54, 58], act as emotion regulators [48, 53], e�ectively support collaborative collocated talk-in-action [49, 54], and support the future of work [35]. Third, future research should further investigate unrestricted and other restricted user-driven improvement mechanisms to improve chatbot responses while avoiding abuse. While synonym selection, profanity detection, and sentiment analysis help chatbot developers to identify harmful user improvements, more automated approaches [33] can make user-driven chatbot improvement approaches much more e�cient. In this regards, existing approaches from FAQ systems could be extended to a chatbot context in which end-users are capable of rating the received chatbot responses. Fourth, it must be noted that the system’s improvement mechanisms may not be equally useful for all types of chatbots. For Limitations and Future Research Avenues Our study comes with limitations that also suggest opportunities for future research. First, it must be noted that the system was only evaluated in a pilot study which revealed that it was indeed seamlessly possible to connect an existing chatbot with the system and that expert users such as students would use the system to contribute new and enhanced chatbot responses. To further gain understanding on how domain experts in organizations actually perceive and use such a system, more evaluations are necessary because the current pilot study only focused on one type of user group. Therefore, we plan to conduct further laboratory and �eld evaluations with di�erent user groups in order to show the bene�ts of such a system beyond the �rst indications here. This also includes to test how end-users of such a chatbot react to the improved chatbot responses in order to demonstrate that the approach results in the design of better chatbots. 339 Session 6: Conversational UIs MuC’20, September 6–9, 2020, Magdeburg, Germany Jasper Feine, Stefan Morana, and Alexander Maedche chatbots that mainly employ prede�ned dialogue �ows, corrected or reworked answer alternatives may be a substantial contribution. However, a chatbot response is often written to serve as a suitable response to a broad range of user requests. Hence, reworking the chatbot responses based on the experiences of a single user may be counterproductive in some contexts. Therefore, future research could extend the proposed system to improve the complete dialog �ow of a chatbot in regards to the interaction context. Fifth, the design of the system is currently restricted to improve responses of chatbots (i.e., text-based conversational agents). However, voice-based conversational agents are becoming increasingly important [61]. They have lower barriers of entry and use [7] and enable interactions separate from the already busy chat modality [30]. Consequently, future research can extend the design of the proposed system to investigate response generation systems for voice-based conversational agents. 6 Wirtschaftsinformatik (WI2019). [12] Stephan Diederich, Alfred Benedikt Brendel, and Lutz M. Kolbe. 2020. Designing Anthropomorphic Enterprise Conversational Agents. Business & Information Systems Engineering (2020). https://doi.org/10.1007/s12599-020-00639-y [13] Jasper Feine, Ulrich Gnewuch, Stefan Morana, and Alexander Maedche. 2019. A Taxonomy of Social Cues for Conversational Agents. International Journal of Human-Computer Studies 132 (2019), 138–161. https://doi.org/10.1016/j.ijhcs. 2019.07.009 [14] Jasper Feine, Stefan Morana, and Ulrich Gnewuch. 2019. Measuring Service Encounter Satisfaction with Customer Service Chatbots using Sentiment Analysis. In 14. Internationale Tagung Wirtschaftsinformatik (WI2019). [15] Jasper Feine, Stefan Morana, and Alexander Maedche. 2019. Designing a Chatbot Social Cue Con�guration System. In Proceedings of the 40th International Conference on Information Systems (ICIS). AISel, Munich. [16] Jasper Feine, Stefan Morana, and Alexander Maedche. 2019. Leveraging MachineExecutable Descriptive Knowledge in Design Science Research – The Case of Designing Socially-Adaptive Chatbots. In Extending the Boundaries of Design Science Theory and Practice, Bengisu Tulu, Soussan Djamasbi, and Gondy Leroy (Eds.). Springer International Publishing, Cham, 76–91. [17] David Frohlich. 1993. The history and future of direct manipulation. Behaviour & Information Technology 12, 6 (1993), 315–329. https://doi.org/10. 1080/01449299308924396 [18] C. Hutto EricJ Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International AAAI Conference on Weblogs and Social Media. Ann Arbor, MI, USA. [19] Ulrich Gnewuch, Stefan Morana, and Alexander Maedche. 2017. Towards Designing Cooperative and Social Conversational Agents for Customer Service. In Proceedings of the 38th International Conference on Information Systems (ICIS). AISel, Seoul. [20] S. R. Gouravajhala, Y. Jiang, P. Kaur, and J. Chaar. 2018. Finding Mnemo: Hybrid Intelligence Memory in a Crowd-Powered Dialog System. In Collective Intelligence Conference (CI 2018). Zurich, Switzerland. [21] J. Harms, P. Kucherbaev, A. Bozzon, and G. Houben. 2019. Approaches for Dialog Management in Conversational Agents. IEEE Internet Computing 23, 2 (2019), 13–22. https://doi.org/10.1109/MIC.2018.2881519 [22] Ting-Hao Huang, Joseph Chee Chang, and Je�rey P. Bigham. 2018. Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems CHI ’18. ACM Press, 1–13. https://doi.org/10.1145/3173574.3173869 [23] Ting-Hao Kenneth Huang, Walter S. Lasecki, Amos Azaria, and Je�rey P. Bigham. 2016. "Is There Anything Else I Can Help You With?" Challenges in Deploying an On-Demand Crowd-Powered Conversational Agent. In Fourth AAAI Conference on Human Computation and Crowdsourcing. [24] Ting-Hao Kenneth Huang, Walter S. Lasecki, and Je�rey P. Bigham. 2015. Guardian: A crowd-powered spoken dialog system for web apis. In Third AAAI conference on human computation and crowdsourcing. [25] Mohit Jain, Pratyush Kumar, Ramachandra Kota, and Shwetak N. Patel. 2018. Evaluating and Informing the Design of Chatbots. In DIS 2018, Ilpo Koskinen, Youn-kyung Lim, Teresa Cerratto-Pargman, Kenny Chow, and William Odom (Eds.). Association for Computing Machinery, [New York, NY], 895–906. https: //doi.org/10.1145/3196709.3196735 [26] Youxuan Jiang, Jonathan K. Kummerfeld, and Walter S. Lasecki. 2017. Understanding Task Design Trade-o�s in Crowdsourced Paraphrase Collection. In Proceedings of the 55th Annual Meeting of the Association for. Association for Computational Linguistics, 103–109. https://doi.org/10.18653/v1/P17-2017 [27] Patrik Jonell, Mattias Bystedt, Fethiye Irmak Dogan, Per Fallgren, Jonas Ivarsson, Marketa Slukova, José Lopes Ulme Wennberg, Johan Boye, and Gabriel Skantze. 2018. Fantom: A Crowdsourced Social Chatbot using an Evolving Dialog Graph. In 1st Proceedings of Alexa Prize. [28] Spencer Kelly. 17.09.2019. compromise: modest natural-language processing in javascript. https://github.com/spencermountain/compromise [29] Scott R. Klemmer, Anoop K. Sinha, Jack Chen, James A. Landay, Nadeem Aboobaker, and Annie Wang. 2000. Suede: a Wizard of Oz prototyping tool for speech user interfaces. In Proceedings of the 13th annual ACM symposium on User interface software and technology. ACM, 1–10. [30] Rafal Kocielnik, Daniel Avrahami, Jennifer Marlow, Di Lu, and Gary Hsieh. 2018. Designing for workplace re�ection: a chat and voice-based conversational agent. In In Proceedings of the 2018 Designing Interactive Systems Conference (DIS ’18). ACM, 881–894. https://doi.org/10.1145/3196709.3196784 [31] Ben Krause, Marco Damonte, Mihai Dobre, Daniel Duma, Joachim Fainberg, Federico Fancellu, Emmanuel Kahembwe, Jianpeng Cheng, and Bonnie Webber. 2017. Edina: Building an open domain socialbot with self-dialogues. In 1st Proceedings of Alexa Prize. [32] Walter S. Lasecki, Kyle I. Murray, Samuel White, Robert C. Miller, and Je�rey P. Bigham. 2011. Real-time crowd control of existing interfaces. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Je� Pierce, Maneesh Agrawala, and Scott Klemmer (Eds.). ACM Press, New York, NY, 23. https://doi.org/10.1145/2047196.2047200 CONCLUSION In this paper, we propose the design of a system that enables chatbot developers to e�ciently engage domain experts in the chatbot response generation process. The system enables chatbot developers to connect existing chatbots via an API to the system. The system enables domain experts to improve the chatbot responses during an interaction. We tested the system with students in a pilot study. Overall, the design of the system and its improvement mechanisms can be useful extensions for chatbot development systems in order to support chatbot developers and domain experts to collaboratively enhance the natural language responses of a chatbot. REFERENCES [1] Martin Adam, Michael Wessel, and Alexander Benlian. 2020. AI-based chatbots in customer service and their e�ects on user compliance. Electronic Markets 9, 2 (2020), 204. [2] Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the people: The role of humans in interactive machine learning. AI MAGAZINE 35, 4 (2014), 105–120. https://doi.org/10.1609/aimag.v35i4.2513 [3] Farah Benamara, Carmine Cesarano, Antonio Picariello, and Venkatramana S. Subrahmanian. 2019. Sentiment analysis: Adjectives and adverbs are better than adjectives alone. In Processdings of ICWSM. Academic Press. [4] Je�rey P. Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C. Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, and Samual White. 2012. VizWiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. ACM. https://doi.org/10.1145/1866029.1866080 [5] Je�rey P. Bigham, Richard E. Ladner, and Yevgen Borodin. 2011. The Design of Human-Powered Access Technology. In The Proceedings of the 13th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’11). Association for Computing Machinery, New York, NY, USA, 3–10. https://doi.org/10.1145/2049536.2049540 [6] Petter Bae Brandtzaeg and Asbjørn Følstad. 2018. Chatbots: Changing User Needs and Motivations. Interactions 25, 5 (2018), 38–43. https://doi.org/10.1145/3236669 [7] Robin N. Brewer, Leah Findlater, Joseph ’Jo�sh’ Kaye, Walter Lasecki, Cosmin Munteanu, and Astrid Weber. 2018. Accessible Voice Interfaces. In Conference on Computer Supported Cooperative Work and Social Computing. ACM, 441–446. https://doi.org/10.1145/3272973.3273006 [8] Veena Chattaraman, Wi-Suk Kwon, Juan E. Gilbert, and Kassandra Ross. 2019. Should AI-Based, conversational digital assistants employ social- or task-oriented interaction style? A task-competency and reciprocity perspective for older adults. Computers in Human Behavior 90 (2019), 315–330. https://doi.org/10.1016/j.chb. 2018.08.048 [9] ConvAI. 2018. The Conversational Intelligence Challenge 2. http://convai.io/ [10] Florian Daniel, Cinzia Cappiello, and Boualem Benatallah. 2019. Bots Acting Like Humans: Understanding and Preventing Harm. IEEE Internet Computing 23, 2 (2019), 40–49. https://doi.org/10.1109/MIC.2019.2893137 [11] Stephan Diederich, Alfred Brendel, and Lutz M Kolbe. 2019. Towards a Taxonomy of Platforms for Conversational Agent Design. In 14. International Conference on 340 Session 6: Conversational UIs A Chatbot Response Generation System MuC’20, September 6–9, 2020, Magdeburg, Germany [57] Paul Thomas, Mary Czerwinski, Daniel McDu�, Nick Craswell, and Gloria Mark. 2018. Style and Alignment in Information-Seeking Conversation. In Proceedings of the 2018 Conference on Human Information Interaction&Retrieval - CHIIR ’18, Chirag Shah, Nicholas J. Belkin, Katriina Byström, Je� Huang, and Falk Scholer (Eds.). ACM Press, New York, New York, USA, 42–51. https://doi.org/10.1145/ 3176349.3176388 [58] Felipe Thomaz, Carolina Salge, Elena Karahanna, and John Hulland. 2020. Learning from the Dark Web: leveraging conversational agents in the era of hyperprivacy to enhance marketing. Journal of the Academy of Marketing Science 48, 1 (2020), 43–63. https://doi.org/10.1007/s11747-019-00704-3 [59] Luis von Ahn and Laura Dabbish. 2004. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Elizabeth Dykstra-Erickson (Ed.). ACM, New York, NY, 319–326. https://doi.org/ 10.1145/985692.985733 [60] Jason D. Williams, Kavosh Asadi, and Geo�rey Zweig. 2017. Hybrid code networks: practical and e�cient end-to-end dialog control with supervised and reinforcement learning. arXiv preprint (2017). [61] Rainer Winkler, Sebastian Hobert, Antti Salovaara, Matthias Söllner, and Jan Marco Leimeister. 2020. Sara, the Lecturer: Improving Learning in Online Education with a Sca�olding-Based Conversational Agent. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, 1–14. https://doi.org/10.1145/3313831.3376781 [62] Yu Zhong, Walter S. Lasecki, Erin Brady, and Je�rey P. Bigham. 2015. RegionSpeak: Quick Comprehensive Spatial Descriptions of Complex Images for Blind Users. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, Bo Begole (Ed.). ACM, New York, NY, 2353–2362. https://doi.org/10. 1145/2702123.2702437 [33] Walter S. Lasecki, Jaime Teevan, and Ece Kamar. 2014. Information extraction and manipulation threats in crowd-powered systems. In Proceedings of the ACM conference on Computer supported cooperative work & social computing. ACM. https://doi.org/10.1145/2531602.253173 [34] Walter S. Lasecki, Rachel Wesley, Je�rey Nichols, Anand Kulkarni, James F. Allen, and Je�rey P. Bigham. 2013. Chorus: a crowd-powered conversational assistant. In Proceedings of the 26th annual ACM symposium on User interface software and technology. ACM. https://doi.org/10.1145/2501988.2502057 [35] Alexander Maedche, Christine Legner, Alexander Benlian, Benedikt Berger, Henner Gimpel, Thomas Hess, Oliver Hinz, Stefan Morana, and Matthias Söllner. 2019. AI-Based Digital Assistants. Business & Information Systems Engineering 61, 4 (2019), 535–544. [36] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52 nd Annual Meeting of the Association for Computational Linguistics:. ACL. [37] Moira McGregor and John C. Tang. 2017. More to Meetings: Challenges in Using Speech-Based Technology to Support Meetings. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW ’17). ACM, New York, NY, USA, 2208–2220. https://doi.org/10.1145/ 2998181.2998335 [38] M. McTear, Z. Callejas, and D. Griol. 2016. The Conversational Interface: Talking to Smart Devices (1st ed.). Springer International Publishing, Switzerland. [39] Michael F. McTear. 2002. Spoken dialogue technology: enabling the conversational user interface. Comput. Surveys 34, 1 (2002), 90–169. [40] Michael F. McTear. 2017. The Rise of the Conversational Interface: A New Kid on the Block?. In Future and Emerging Trends in Language Technology. Machine Learning and Big Data, José F. Quesada, Francisco-Jesús Martín Mateos, and Teresa López Soto (Eds.). Springer International Publishing, Cham, 38–49. [41] Michael Price. 17.09.2019. bad-words: A javascript �lter for badwords. https: //github.com/web-mech/badwords [42] Microsoft. 2019. Bot Framework Web Chat. https://github.com/microsoft/ BotFramework-WebChat [43] Microsoft. 2019. Microsoft Bot Framework Direct Line JS Client. https://github. com/microsoft/BotFramework-DirectLineJS [44] Microsoft. 2019. Microsoft Power Virtual Agents. https://powervirtualagents. microsoft.com/en-us/ [45] George A. Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41. https://doi.org/10.1145/219717.219748 [46] Maali Mnasri. 2019. Recent advances in conversational NLP: Towards the standardization of Chatbot building. arXiv preprint arXiv:1903.09025 (2019). [47] Gina Ne� and Peter Nagy. 2016. Automation, algorithms, and politics| talking to Bots: Symbiotic agency and the case of Tay. International Journal of Communication 10 (2016), 17. [48] Zhenhui Peng, Taewook Kim, and Xiaojuan Ma. 2019. GremoBot. In Conference Companion Publication of the 2019 on Computer Supported Cooperative Work and Social Computing - CSCW ’19, Eric Gilbert and Karrie Karahalios (Eds.). ACM Press, New York, New York, USA, 335–340. https://doi.org/10.1145/3311957.3359472 [49] Martin Porcheron, Joel E. Fischer, Moira McGregor, Barry Brown, Ewa Luger, Heloisa Candello, and Kenton O’Hara. 2017. Talking with Conversational Agents in Collaborative Action. In Companion of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, Charlotte P. Lee (Ed.). ACM, New York, NY, 431–436. https://doi.org/10.1145/3022198.3022666 [50] Martin Porcheron, Joel E. Fischer, and Sarah Sharples. 2017. "Do Animals Have A ccents?". In CSCW’17, Charlotte P. Lee, Steve Poltrock, Louise Barkhuus, Marcos Borges, and Wendy Kellogg (Eds.). The Association for Computing Machinery, New York, New York, 207–219. https://doi.org/10.1145/2998181.2998298 [51] Rasa. 2019. Improve your contextual assistant with Rasa X. https://rasa.com/ docs/rasa-x/ [52] Tony Russell-Rose and Tyler Tate. 2013. Designing the search experience: The information architecture of discovery. Morgan Kaufmann/Elsevier, Amsterdam. https://ebookcentral.proquest.com/lib/subhh/detail.action?docID=1046391 [53] Isabella Seeber, Lena Waizenegger, Stefan Seidel, Stefan Morana, Izak Benbasat, and Paul Benjamin Lowry. 2019. Collaborating with Technology-Based Autonomous Agents: Issues and Research Opportunities. SSRN Electronic Journal (2019). https://doi.org/10.2139/ssrn.3504587 [54] Joseph Seering, Michal Luria, Geo� Kaufman, and Jessica Hammer. 2019. Beyond Dyadic Interactions. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI ’19. ACM Press, New York, USA, 1–13. https: //doi.org/10.1145/3290605.3300680 [55] Ben Shneiderman. 1997. Direct manipulation for comprehensible, predictable and controllable user interfaces. In Proceedings of the international conference on Intelligent user interfaces. ACM, 33–39. https://doi.org/10.1145/238218.238281 [56] Mark S. Silver. 2008. On the Design Features of Decision Support Systems: The Role of System Restrictiveness and Decisional Guidance. In Handbook on Decision Support Systems 2: Variations, Frada Burstein and Clyde W. Holsapple (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 261–291. 341