Uploaded by tigormal

Voice Interface of Autonomous Devices in Example of Speaking Samovar

advertisement
ГОЛОСОВОЙ ИНТЕРФЕЙС АВТОНОМНЫХ УСТРОЙСТВ НА ПРИМЕРЕ ГОВОРЯЩЕГО САМОВАРА
Малюк И.О.
Московский Политехнический Университет, факультет Машиностроения
Научный руководитель: к.ф.н., доцент Курбакова М.А.
Статья описывает современные проблемы, возникающие при разработке «умного» устройства, устройства предназначенного для Интернета Вещей или прочего автономного
устройства, в котором голосовой интерфейс пользователя используется в качестве основного.
Ключевые слова: автономные устройства, голосовой пользовательский интерфейс
VOICE INTERFACE OF AUTONOMOUS DEVICES IN EXAMPLE OF SPEAKING
SAMOVAR
Igor Malyuk
Moscow Polytechnic University, Faculty of Mechanical Engineering
Scientific advisor: Kurbakova M.A.
Abstract: the article describes contemporary problems that occur by designing a smart-device,
device dedicated for use with Internet of Things or any other autonomous device that has voice
user interface as main machine-user interface.
Keywords: smart devices, voice user interface, internet of things, voice assistant
For the recent decade a lot of amazing things in computer engineering happened. There appeared
hundreds of various single-board PCs that have simple API, allowing uncommon developers to
design and deploy complicated engineering systems of any kind. Such boards as Raspberry Pi
(RPi), Particle, BeagleBoard give mobility to programs with complex calculations. Thus a new
industry as Internet of Things (IoT) appeared.
Internet Of Things is a global infrastructure for the information society, enabling advanced
services by interconnecting (physical and virtual) things based on existing and evolving
interoperable information and communication technologies [2].
Nowadays, the common pattern of IoT devices use is «Master-Slave», when special application
or program needed to be installed on users master-device [1]. Using this app, user gives
commands to device and can observe it’s current state. The device obeys commands and sends a
reply with technical information (Slave part).
The «internet» part of most IoT devices is that they can be controlled over the Internet, sending
it’s data though third-party servers that process it. Therefore another way of controlling devices
appeared in the way of giving access other apps to your master control app (Philips Hue [10]),
1
etc. or giving them access directly to your device, if it conforms to a designated protocol (Apple
HomeKit [9]) or even have separate device doing one of ways (Amazon Echo [11]).
Most of these third-party apps have or are bounded with voice assistant app that has voice user
interface (VUI) and built-in artificial intelligence of various complexity. Commonly they have
normal human voice (Yandex.Alice is dubbed by Tatiana Shitova [13]). This is considered not to
annoy the user, by creating an illusion of talk with other human [12].
Main problem of such approach is that by building a smart device, e.g. a smart vacuum cleaner, it
is a good thing to add a VUI to a device because it may widen customer amount. But designing a
hardware that is compatible with contemporary voice assistants creates bunch of software
limitations and is also time-consuming.
These problems were faced by designing Autonomous Samovar. Principles of open-hardware
were followed and RPi Zero was used as the main controller board, giving user an opportunity to
hack the device and replace it with similar Banana Pi or other for instance. These boards have
enough abilities to have VUI-powered software.
The problem of decentralisation has been solved by creating a software that distribute commands
not only from master-device to slave-device, but equally through all the connected into one mesh
devices. As an example, in the mesh network of Samovar, Lamp and user’s PC, user may give
command to Samovar to make tea, yet Samovar may give command to the Lamp to turn on if
configured so. This allows user’s devices to interact with each other beyond user, though
allowing to create more fault-tolerant systems of devices.
Such an approach allows not to bound with existing voice assistants, but to build your own with
only necessary functions. For example, Samovar uses modified Jasper voice assistant as main
VUI [8].
The algorithm of use is:
1. User places a cup to Samovar
2. User gives voice-command to Samovar to make tea and Samovar replies
3. Samovar pours tea leaves into cup
4. Samovar takes water from it’s water box, boils it and pours into the cup, brewing tea
5. User takes a cup with tea
Second step of the algorithm is nothing else but a user-machine dialog using machine’s VUI.
Acceptable behaviour of this dialog is pretty similar to GUI dialog and could be concerned to
have next criteria:
- understandability
- clarity
- expectancy
First two criteria mean that user has to understand what machine does and what it tells user, but
third criteria means that user has to understand how the machine tells something to user. In GUI
of desktop, for example, «understandability» stands for displaying windows in order to separate
different kinds of information, «clarity» is for making user understand what the information is in
2
these windows and what can user do with it (e.g. press a button), and «expectancy» stands for
user to imagine what happens if he will press this button. In GUI «expectancy» is usually
observed by adding animations to objects on screen, allowing user to understand what happens to
the certain object, similarly as seeing a real life thing: it has it’s own rules of existence that cannot
be broken (e.g. physical laws).
VUI works in the same way: from psychology is known that people expect something unknown
to work in similar way to something they already know. The main problem with voice control is
that people use voice for communication with other people. Socrates’ «Talk in order that I may
see you» applies to this case as well. The reason is that people estimate other people by their
voice and can at once decide what other human is by his manner of speech. Human speech gives
to types of information: semantic part gives information about objects and actions, paralinguistic
part is used to inform about «hidden» message, such as emotional state [5]. Obviously, Samovar
cannot have emotional state, therefore it has given false information to user.
Samovar replied to user’s command with gentle woman voice (RHVoice speech synthesizer,
«Anna» voice [7]), and automatically made user think that it had all characteristics as of human
and estimated them.
The command for making tea should contain words «samovar», «make», «tea» (Russ.
«cамовар», «сделай», «чаю»); speech recognition engine CMUSphinx was used [6]. When
command is accepted correctly the reply is «Okay, making tea» (Russ. «Хорошо, делаю чай»);
in other cases the reply is «Sorry, the command is incorrect» (Russ. «Простите, команда неверная»).
The survey of 10 people was done to describe the problem. Participants have been given two
tasks:
1. To ask Samovar prototype to make some tea
2. To talk with the Samovar prototype
According to experience of participants, they have given next information:
3
The decision was to make Samovar speak inhuman, «robotic» voice, that doesn’t let estimate
emotional state. For doing that, Samovar’s replies were recorded and processed. The survey was
repeated with the new voice:
4
As a result, it is seen, that using voice that doesn’t make user expect to communicate with a
human is more preferable for use in VUIs, where the amount of functions is limited. This may be
important in voice assistants implemented in military systems, engineering software or other
spheres, where problems with software can cause emergency.
References:
1. National Instruments. Application Design Patterns: Master/Slave http://www.ni.com/tutorial/
3022/en/ (accessed April 13, 2019).
2. ITU. Internet of Things Global Standards Initiative https://www.itu.int/en/ITU-T/gsi/iot/
Pages/default.aspx (accessed April 11, 2019).
3. El Ayadi M., Kamel M. S., Karray F. Survey on speech emotion recognition: Features,
classification schemes, and databases // Pattern Recognition. 2011. V. 44, N. 3. P. 572–587.
4. Ntalampiras S., Potamitis I., Fakotakis N. An adaptive framework for acoustic monitoring of
potential hazards // EURASIP Journal on Audio, Speech, and Music Processing. 2009. V.
2009. P. 13–23.
5. Nwe T.L., Foo S.W., De Silva L.C. Speech emotion recognition using hidden Markov models
// Speech communication. 2003. V. 41, N. 4. P. 603–623
6. CMUSphinx. Available at: http://cmusphinx.sourceforge.net/wiki/ (accessed April 10, 2019).
7. RHVoice. Available at: https://github.com/Olga-Yakovleva/RHVoice (accessed April 10,
2019).
8. Jasper. Available at: http://jasperproject.github.io (accessed April 11, 2019).
9. Apple Developer Documentation, Using the HomeKit Accessory Protocol Specification
(Non-Commercial Version) https://developer.apple.com/support/homekit-accessory-protocol/
(accessed April 11, 2019)
10. Philips, «Meet the Hue» https://www2.meethue.com/en-us/philips-hue-app (accessed April
11, 2019)
11. Amazon Developer Documetation, Alexa Skills. Alexa Connected Devices https://
developer.amazon.com/alexa/connected-devices (accessed April 11, 2019)
12. E.V. Samostiyenko, Assistant and his debtor: about artifitial voices // Praktiki i interpretatsii.
Tom 3(2) [Practics and interpretation. Volume 3(2)] (in Russ.). - 2018. P. 57
13. Yandex Blog, «Introducing Voice Assistant Alice» (in Russ.) https://yandex.ru/blog/company/
alisa (accessed April 13, 2019)
5
Download