ГОЛОСОВОЙ ИНТЕРФЕЙС АВТОНОМНЫХ УСТРОЙСТВ НА ПРИМЕРЕ ГОВОРЯЩЕГО САМОВАРА Малюк И.О. Московский Политехнический Университет, факультет Машиностроения Научный руководитель: к.ф.н., доцент Курбакова М.А. Статья описывает современные проблемы, возникающие при разработке «умного» устройства, устройства предназначенного для Интернета Вещей или прочего автономного устройства, в котором голосовой интерфейс пользователя используется в качестве основного. Ключевые слова: автономные устройства, голосовой пользовательский интерфейс VOICE INTERFACE OF AUTONOMOUS DEVICES IN EXAMPLE OF SPEAKING SAMOVAR Igor Malyuk Moscow Polytechnic University, Faculty of Mechanical Engineering Scientific advisor: Kurbakova M.A. Abstract: the article describes contemporary problems that occur by designing a smart-device, device dedicated for use with Internet of Things or any other autonomous device that has voice user interface as main machine-user interface. Keywords: smart devices, voice user interface, internet of things, voice assistant For the recent decade a lot of amazing things in computer engineering happened. There appeared hundreds of various single-board PCs that have simple API, allowing uncommon developers to design and deploy complicated engineering systems of any kind. Such boards as Raspberry Pi (RPi), Particle, BeagleBoard give mobility to programs with complex calculations. Thus a new industry as Internet of Things (IoT) appeared. Internet Of Things is a global infrastructure for the information society, enabling advanced services by interconnecting (physical and virtual) things based on existing and evolving interoperable information and communication technologies [2]. Nowadays, the common pattern of IoT devices use is «Master-Slave», when special application or program needed to be installed on users master-device [1]. Using this app, user gives commands to device and can observe it’s current state. The device obeys commands and sends a reply with technical information (Slave part). The «internet» part of most IoT devices is that they can be controlled over the Internet, sending it’s data though third-party servers that process it. Therefore another way of controlling devices appeared in the way of giving access other apps to your master control app (Philips Hue [10]), 1 etc. or giving them access directly to your device, if it conforms to a designated protocol (Apple HomeKit [9]) or even have separate device doing one of ways (Amazon Echo [11]). Most of these third-party apps have or are bounded with voice assistant app that has voice user interface (VUI) and built-in artificial intelligence of various complexity. Commonly they have normal human voice (Yandex.Alice is dubbed by Tatiana Shitova [13]). This is considered not to annoy the user, by creating an illusion of talk with other human [12]. Main problem of such approach is that by building a smart device, e.g. a smart vacuum cleaner, it is a good thing to add a VUI to a device because it may widen customer amount. But designing a hardware that is compatible with contemporary voice assistants creates bunch of software limitations and is also time-consuming. These problems were faced by designing Autonomous Samovar. Principles of open-hardware were followed and RPi Zero was used as the main controller board, giving user an opportunity to hack the device and replace it with similar Banana Pi or other for instance. These boards have enough abilities to have VUI-powered software. The problem of decentralisation has been solved by creating a software that distribute commands not only from master-device to slave-device, but equally through all the connected into one mesh devices. As an example, in the mesh network of Samovar, Lamp and user’s PC, user may give command to Samovar to make tea, yet Samovar may give command to the Lamp to turn on if configured so. This allows user’s devices to interact with each other beyond user, though allowing to create more fault-tolerant systems of devices. Such an approach allows not to bound with existing voice assistants, but to build your own with only necessary functions. For example, Samovar uses modified Jasper voice assistant as main VUI [8]. The algorithm of use is: 1. User places a cup to Samovar 2. User gives voice-command to Samovar to make tea and Samovar replies 3. Samovar pours tea leaves into cup 4. Samovar takes water from it’s water box, boils it and pours into the cup, brewing tea 5. User takes a cup with tea Second step of the algorithm is nothing else but a user-machine dialog using machine’s VUI. Acceptable behaviour of this dialog is pretty similar to GUI dialog and could be concerned to have next criteria: - understandability - clarity - expectancy First two criteria mean that user has to understand what machine does and what it tells user, but third criteria means that user has to understand how the machine tells something to user. In GUI of desktop, for example, «understandability» stands for displaying windows in order to separate different kinds of information, «clarity» is for making user understand what the information is in 2 these windows and what can user do with it (e.g. press a button), and «expectancy» stands for user to imagine what happens if he will press this button. In GUI «expectancy» is usually observed by adding animations to objects on screen, allowing user to understand what happens to the certain object, similarly as seeing a real life thing: it has it’s own rules of existence that cannot be broken (e.g. physical laws). VUI works in the same way: from psychology is known that people expect something unknown to work in similar way to something they already know. The main problem with voice control is that people use voice for communication with other people. Socrates’ «Talk in order that I may see you» applies to this case as well. The reason is that people estimate other people by their voice and can at once decide what other human is by his manner of speech. Human speech gives to types of information: semantic part gives information about objects and actions, paralinguistic part is used to inform about «hidden» message, such as emotional state [5]. Obviously, Samovar cannot have emotional state, therefore it has given false information to user. Samovar replied to user’s command with gentle woman voice (RHVoice speech synthesizer, «Anna» voice [7]), and automatically made user think that it had all characteristics as of human and estimated them. The command for making tea should contain words «samovar», «make», «tea» (Russ. «cамовар», «сделай», «чаю»); speech recognition engine CMUSphinx was used [6]. When command is accepted correctly the reply is «Okay, making tea» (Russ. «Хорошо, делаю чай»); in other cases the reply is «Sorry, the command is incorrect» (Russ. «Простите, команда неверная»). The survey of 10 people was done to describe the problem. Participants have been given two tasks: 1. To ask Samovar prototype to make some tea 2. To talk with the Samovar prototype According to experience of participants, they have given next information: 3 The decision was to make Samovar speak inhuman, «robotic» voice, that doesn’t let estimate emotional state. For doing that, Samovar’s replies were recorded and processed. The survey was repeated with the new voice: 4 As a result, it is seen, that using voice that doesn’t make user expect to communicate with a human is more preferable for use in VUIs, where the amount of functions is limited. This may be important in voice assistants implemented in military systems, engineering software or other spheres, where problems with software can cause emergency. References: 1. National Instruments. Application Design Patterns: Master/Slave http://www.ni.com/tutorial/ 3022/en/ (accessed April 13, 2019). 2. ITU. Internet of Things Global Standards Initiative https://www.itu.int/en/ITU-T/gsi/iot/ Pages/default.aspx (accessed April 11, 2019). 3. El Ayadi M., Kamel M. S., Karray F. Survey on speech emotion recognition: Features, classification schemes, and databases // Pattern Recognition. 2011. V. 44, N. 3. P. 572–587. 4. Ntalampiras S., Potamitis I., Fakotakis N. An adaptive framework for acoustic monitoring of potential hazards // EURASIP Journal on Audio, Speech, and Music Processing. 2009. V. 2009. P. 13–23. 5. Nwe T.L., Foo S.W., De Silva L.C. Speech emotion recognition using hidden Markov models // Speech communication. 2003. V. 41, N. 4. P. 603–623 6. CMUSphinx. Available at: http://cmusphinx.sourceforge.net/wiki/ (accessed April 10, 2019). 7. RHVoice. Available at: https://github.com/Olga-Yakovleva/RHVoice (accessed April 10, 2019). 8. Jasper. Available at: http://jasperproject.github.io (accessed April 11, 2019). 9. Apple Developer Documentation, Using the HomeKit Accessory Protocol Specification (Non-Commercial Version) https://developer.apple.com/support/homekit-accessory-protocol/ (accessed April 11, 2019) 10. Philips, «Meet the Hue» https://www2.meethue.com/en-us/philips-hue-app (accessed April 11, 2019) 11. Amazon Developer Documetation, Alexa Skills. Alexa Connected Devices https:// developer.amazon.com/alexa/connected-devices (accessed April 11, 2019) 12. E.V. Samostiyenko, Assistant and his debtor: about artifitial voices // Praktiki i interpretatsii. Tom 3(2) [Practics and interpretation. Volume 3(2)] (in Russ.). - 2018. P. 57 13. Yandex Blog, «Introducing Voice Assistant Alice» (in Russ.) https://yandex.ru/blog/company/ alisa (accessed April 13, 2019) 5