Robots Make Things Funnier

advertisement
Robots Make Things Funnier
Jonas Sjöbergh and Kenji Araki
Graduate School of Information Science and Technology
Hokkaido University
{js,araki}@media.eng.hokudai.ac.jp
Abstract. We evaluate the influence robots can have on the perceived
funniness of jokes. We let people grade how funny simple word play jokes
are, and vary the presentation method. The jokes are presented as text
only or said by a small robot. The same joke is rated significantly higher
when presented by the robot. We also let one robot tell a joke and have
one more robot either laugh, boo, or do nothing. Laughing and booing
is significantly funnier than no reaction, though there was no significant
difference between laughing and booing.
1
Introduction
While humans use humor very often in daily interactions, computer systems are
still far from able to use humor freely. Research on humor has been done in
many different research areas, for instance psychology, philosophy, sociology and
linguistics. When it comes to computer processing of humor in various ways,
a good overview of the recent research can be found in [1]. Two main areas
of computer implementations exist, humor recognition and humor generation.
For humor generation, systems generating quite simple forms of jokes, e.g. word
play jokes, have been constructed [2,3,4,5,6]. Recognition systems that try to
recognize whether a text is a joke or not have also been constructed [7,8,9].
Our paper is mainly relevant for generation systems. What is considered amusing varies a lot from person to person. Many things have an influence on how
funny something is perceived to be, such as the general mood at the time or who
is delivering the joke. This makes evaluating and comparing jokes and joke generating systems complicated. We evaluate the effects of different ways of delivering
simple jokes to evaluators. Our hypothesis was that jokes would be perceived as
funnier when presented by a small robot than when presented in text form. We
also hypothesised that the same joke would be perceived as funnier if another
robot for instance laughs at the joke than with no reaction.
We evaluate the effects of different ways of delivering simple jokes to evaluators
by showing simple word play jokes to volunteer evaluators. Some jokes were told
by a small robot, while some were presented only as text. We also evaluated the
effects of feedback on the jokes, in the form of another robot laughing, booing,
or showing no reaction after the first robot told a joke. Despite some problems
with for instance the voice of the robot being hard to hear and understand, both
hypotheses were confirmed.
H. Hattori et al. (Eds.): JSAI 2008, LNAI 5447, pp. 306–313, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Robots Make Things Funnier
2
307
Evaluation Method
We have collected a large set of simple word play jokes in Japanese. These were
collected by searching the Internet for a few seed jokes. Any page containing at
least two of the seed jokes was then automatically processed to find the largest
common left and right contexts of instances of the two jokes in the page. If the
common contexts were at least four characters each, anything appearing with such
a left and right context in the same page was saved as a new joke. For instance,
if two seed jokes occur in an HTML-list, all list items in the list would be saved.
This method has been used earlier to expand sets of named entities [10].
It is of course possible to run the method repeatedly, using the newly found
jokes as seeds for the next run, to find more jokes. Sometimes things that are not
jokes are also downloaded, for instance when two posters in a web forum us two
different seed jokes as their signatures. This generally leads to grabbing all the
signatures on the page, not all of which are necessarily jokes. We have run our
program a few times, and also done some manual clean up to remove things that
are obviously not jokes. This has resulted in a collection of 2,780 jokes. Many
of these are small variations of other jokes in the collection though, so there are
not 2,780 unique jokes.
For our experiments in this paper we selected 22 jokes from this collection.
The main criteria when selecting jokes were that the jokes should be very short
(since the robot model we use has a very limited amount of memory for voice
samples) and be understandable in spoken form (some jokes in the collection are
only understandable in writing).
Two robots were used, both of the same model: Robovie-i1 , a fairly small robot
that can move its legs and lean the body sideways, see Figure 1. It also has a small
speaker, though the sound volume is quite low and the sound quality is poor. The
main features of the Robovie-i are that it is cute, easily programmable, and fairly
cheap. One of the robots used is gold colored and one is blue. Whenever a robot
was about to speak, it first moved its body from side to side a little, and then
slightly leaned backwards, looking up at and pointing the speaker (mounted on
the stomach) towards the evaluator. Other than that, the robots did not move.
The robot voice was generated automatically using a text-to-speech tool for
Japanese, AquesTalk2 . The robots use different synthetic voices, so it is possible
to distinguish which robot is talking only by listening. The text-to-speech conversion works fairly well, though the produced speech is sometimes difficult to
understand. It is flat, lacking rhythm, intonation, joke timing etc. The voice is
somewhat childlike, and sounds “machine like”, like most cheap text-to-speech.
For the first experiment, ten jokes were divided into two sets, set 1 and set 2,
of five jokes each. Jokes in set 1 were always presented first and then the jokes in
set 2. To half of the evaluators (group A) the jokes in set 1 were presented using
one of the robots and to the other half (group B) these jokes were presented only
in text form. The same was done for the jokes in set 2 too, but if an evaluator
1
2
http://www.vstone.co.jp/top/products/robot/Robovie-i.html
http://www.a-quest.com/aquestal/
308
J. Sjöbergh and K. Araki
Fig. 1. The robots used
had set 1 presented using the robot then the same evaluator would have set 2
presented using text and vice versa. Any one evaluator was only shown the same
joke once, all jokes were shown to all evaluators, and all the jokes were always
presented in the same order. Evaluators were assigned to group A or B based
on the order they arrived in, e.g. the first ten to arrive ended up in group A,
the next ten in group B, etc. This means that all jokes were evaluated an equal
number of times using the robot and using text.
For the second experiment twelve jokes were divided into three sets of four
jokes each. These were then presented by having one robot tell the joke and
the other robot either laugh a little and say “umai” (“good one”), say “samui”
(“cold”, as in “not funny”), or do nothing. As in the first experiment, the jokes
were presented to all evaluators in the same order, and all evaluators were presented with each joke exactly one time. Set 1 was made up of jokes 0, 3, 4, and
8; set 2 of jokes 1, 2, 5, and 9; and set 3 of jokes 6, 7, 10, and 11. Evaluators
were assigned to either group C, D, or E. All groups had different reactions for
each set of jokes, so the second robot would laugh at four jokes each time, boo
at four jokes, and make no reaction at four jokes, but different jokes for different
groups. Which group had which reaction to which set of jokes is shown in Table
3. All jokes were presented with each reaction to the same number of evaluators.
Evaluators were found by going to a table in a student cafeteria and setting
up the robots and a sign saying that in exchange for participating in a short
robot experiment volunteers would get some chocolate. Only native speakers of
Japanese could participate. The evaluations were done one person at a time, so
if many arrived at the same time some would have to wait their turn. Evaluators
were asked to grade all jokes on a scale from 1 (boring) to 5 (funny).
As the cafeteria background was fairly noisy, compounded by the poor speaker,
it was sometimes hard to hear what the robot was saying. In such cases the joke
was repeated until the evaluator heard what was said. A more quiet background
Robots Make Things Funnier
309
Table 1. Mean evaluation scores for the three sets of jokes using different presentation
methods. Which group evaluated which set using which method is given in parenthesis.
Set Robot
Text
1 2.5 (A) 2.2 (B)
2 3.0 (B) 2.4 (A)
All
2.8
2.3
would have been better but finding a large number of evaluators prepared to go
to a different experiment place would have been very difficult. We have since
then managed to find better speakers, though they were not available in time
for the experiment.
3
Results
In general, the evaluators were happy to participate, though most people passing by ignored the evaluation. In total, 60 evaluators, 17 women and 43 men,
participated in the experiment. The scores of the jokes vary wildly from person
to person. The lowest mean score for all jokes for one person was 1.3 and the
highest 3.9 for the first experiment and 1.3 and 3.8 for the second experiment.
3.1
Robot vs. Text
The results of the first experiment are presented in Tables 1 and 2. Table 1
shows the mean scores of the different sets of jokes using the robot and text,
and Table 2 shows the scores of each joke. These are of course also influenced
by how interested each evaluator was in this type of jokes in general.
The total average scores in Table 1 is perhaps the most interesting result to
focus on. It gives a good comparison between the two methods. Any specific
evaluator is giving a score to the same number of jokes for both methods, and
every joke is present an equal number of times for both methods. As hypothesized, the robot presentation method gets a higher mean score than text, 2.8
compared to 2.3. Though the standard deviation in the scores is quite high, 1.2
for both methods, the difference is significant (α = 0.01 level, Student’s t-test).
Looking at the individual jokes in Table 2, nine of the ten jokes were perceived
as funnier when presented by the robot than by text, though in many cases the
difference was small. Accounting for the multiple number of comparisons, only
two jokes (jokes 5 and 7) were significantly funnier at the α = 0.05 level using
the robot.
The Pearson correlation between the jokes in text form and the same jokes
presented by the robot is 0.73, indicating a fairly high correlation. This indicates
that the robot improves the impression of the jokes, and not that the robot is
simply funny in itself (which would make all jokes told by the robot be about
as funny). Some jokes are improved more than others, which could depend on
310
J. Sjöbergh and K. Araki
Table 2. Mean evaluation scores using different presentation methods
Joke
Total Robot Text
2.2
1.9
2.8
3.0
1.8
2.5
2.3
2.8
2.7
3.1
2.5
2.0
3.0
3.2
2.0
3.1
2.7
3.2
2.8
3.1
2.0
1.9
2.5
2.9
1.5
1.9
2.0
2.4
2.6
3.2
Average
2.5
# Highest Score
2.8
9
2.3
1
0
1
2
3
4
5
6
7
8
9
many things. Some factors are the quality of the robot voice for the words in
question, if the joke is new or old, if the joke is a joke on the person telling the
joke etc. Joke 1 (scored similarly in text and using the robot) is for instance a
very well known Japanese joke that most people no longer find interesting while
joke 5 (scored a lot higher using the robot than using text) is a joke that has a
repeated sound that sounds funny, but does not come through quite as well in
text it seems.
3.2
Laughter, Booing, or No Reaction
The results of the second experiment evaluating the influence of a second robot
either laughing, booing, or giving no reaction at all to the telling of a joke, are
presented in Tables 3 and 4. Table 3 shows the mean scores of the different sets
of jokes having different reactions, and Table 4 shows each individual joke.
As before, the total average score for each presentation method in Table 3 is
perhaps the most interesting result to focus on, since any specific evaluator is
giving a score to the same number of jokes for every reaction type, and every
joke is present an equal number of times with each reaction. As hypothesized,
the mean scores are higher with some form of reaction than with no reaction,
Table 3. Mean evaluation scores for the three sets of jokes using different presentation
methods. Which group evaluated which set using which method is given in parenthesis.
Set
Jokes
1 {0, 3, 4, 8}
2 {1, 2, 5, 9}
3 {6, 7, 10, 11}
All
2.2
No reaction Laughter Booing
1.9 (E)
2.0 (C)
2.7 (D)
2.8
3.1 (D) 2.9 (C)
2.2 (E) 2.5 (D)
3.1 (C) 2.4 (E)
2.6
Robots Make Things Funnier
311
Table 4. Mean evaluation scores using different presentation methods
Joke
Total No Reaction Laughter Booing
2.6
2.5
2.0
3.0
2.7
2.3
2.5
2.7
2.4
2.0
3.0
2.6
1.6
2.3
1.9
2.5
1.7
1.9
2.2
2.5
1.9
1.8
3.4
2.6
3.0
2.5
2.1
3.3
3.2
2.1
3.2
3.0
2.9
1.8
3.1
2.9
3.0
2.8
2.1
3.1
3.1
2.8
2.2
2.5
2.5
2.4
2.5
2.2
Average
2.5
# Highest Score
2.2
1
2.8
7
2.6
5
0
1
2
3
4
5
6
7
8
9
10
11
averaging 2.8 for laughter and 2.6 for booing, compared to 2.2 for no reaction.
Again, the standard deviation in the scores is quite high, 1.0 for laughter and no
reaction, and 1.1 for booing. The differences between laughter and no reaction
and between booing and no reaction are significant on the α = 0.01 level (Student’s t-test, α adjusted for multiple comparisons), while the difference between
laughter and booing is not significant.
Looking at the individual jokes in Table 4, only for one joke out of twelve was
the no reaction presentation better than the other methods.
3.3
Discussion
In total, despite some problems with hearing and understanding what the robots
said, the robots did make things funnier. The difference in mean scores of 0.5 between text and robot is rather large on a scale from 1 to 5, especially considering
that the average score was only 2.5. The same is true for the difference between
no reaction and laughter (or booing), 0.6 (0.4) for an average score of 2.5.
Evaluations of the impressions of robots performing manzai (Japanese standup comedy) have shown similar results to ours, i.e. that robots telling jokes give
a funny impression. The overall impression of the robots was rated higher than
viewing amateur comedians perform the same routine on TV [11].
Some general problems with the evaluations were: the noise in the cafeteria,
the low quality robot speakers, the text-to-speech results sometimes being hard
to understand, and the fact that simple word play jokes are not very funny
without context. Some of these (speaker and text-to-speech) are pretty straightforward to improve, while the others might be more difficult.
The robots worked quite well in telling jokes and evaluators seemed to relate
to them. Many evaluators commented on the robot’s reactions with things like
“Yes, I agree, that was not very funny” (when booing) or “Ha ha, for sure it is
312
J. Sjöbergh and K. Araki
lying now!” (when laughing at a joke the evaluator obviously did not think was
very funny), despite the robots being obviously just machines. Jokes mentioning
the speaker (such as “I am so out of shape I get tired just by pushing my luck”)
also tended to get a better reaction when embodied by the robot than when just
read in text form.
4
Conclusions
We evaluated the impact of different presentation methods for evaluating how
funny jokes are. We found that the same joke was perceived as significantly
funnier when told by a robot than when presented only using text. The average
scores were 2.8 (robot) and 2.3 (text), which is a quite large difference on the scale
from 1 to 5 used. This means that it can be difficult to compare the evaluations
of different joke generating systems (or other sources of humor) evaluated at
different times, since even the presentation method used has a very large impact
on the results. There are likely many other factors that influence the evaluation
results too, making it difficult to compare different systems.
We also evaluated the impact of having another robot laugh, boo, or do nothing when a joke was told. This too made a significant difference to the perceived
funniness of a joke, with an averages of 2.8 (laugh), 2.6 (boo), and 2.2 (no reaction). The robot always laughed and booed in the exact same way. A more
varied set of reactions would probably be even funnier.
In future experiments we want to examine the effect of speech only (is the
voice or the robot funny?), and include a small training evaluation to remove
any effect of a joke being funny because it is the first appearance of the robot.
We already have better speakers for the robots, and would like to have better
text-to-speech quality. We also want to evaluate automatically generated jokes,
and have already used the robots for this in some other work [6].
Acknowledgments
This work has been funded by The Japanese Society for the Promotion of Science
(JSPS). We also thank the anonymous reviewers for insightful comments.
References
1. Binsted, K., Bergen, B., Coulson, S., Nijholt, A., Stock, O., Strapparava, C.,
Ritchie, G., Manurung, R., Pain, H., Waller, A., O’Mara, D.: Computational humor. IEEE Intelligent Systems 21(2), 59–69 (2006)
2. Binsted, K.: Machine Humour: An Implemented Model of Puns. PhD thesis, University of Edinburgh, Edinburgh, United Kingdom (1996)
3. Binsted, K., Takizawa, O.: BOKE: A Japanese punning riddle generator. Journal
of the Japanese Society for Artificial Intelligence 13(6), 920–927 (1998)
4. Yokogawa, T.: Generation of Japanese puns based on similarity of articulation. In:
Proceedings of IFSA/NAFIPS 2001, Vancouver, Canada (2001)
Robots Make Things Funnier
313
5. Stark, J., Binsted, K., Bergen, B.: Disjunctor selection for one-line jokes. In: Proceedings of INTETAIN, Madonna di Campiglio, Italy, pp. 174–182 (2005)
6. Sjöbergh, J., Araki, K.: A complete and modestly funny system for generating
and performing Japanese stand-up comedy. In: Coling 2008: Companion volume:
Posters and Demonstrations, Manchester, UK, pp. 109–112 (2008)
7. Taylor, J., Mazlack, L.: Toward computational recognition of humorous intent. In:
Proceedings of Cognitive Science Conference 2005 (CogSci 2005), Stresa, Italy, pp.
2166–2171 (2005)
8. Mihalcea, R., Strapparava, C.: Making computers laugh: Investigations in automatic humor recognition. In: Proceedings of HLT/EMNLP, Vancouver, Canada
(2005)
9. Sjöbergh, J., Araki, K.: Recognizing humor without recognizing meaning. In: Masulli, F., Mitra, S., Pasi, G. (eds.) WILF 2007. LNCS, vol. 4578, pp. 469–476.
Springer, Heidelberg (2007)
10. Wang, R., Cohen, W.: Language-independent set expansion of named entities using
the web. In: Perner, P. (ed.) ICDM 2007. LNCS, vol. 4597, pp. 342–350. Springer,
Heidelberg (2007)
11. Hayashi, K., Kanda, T., Miyashita, T., Ishiguro, H., Hagita, N.: Robot manzai robots’ conversation as a passive social medium. In: IEEE International Conference
on Humanoid Robots (Humanoids 2005), Tsukuba, Japan, pp. 456–462 (2005)
Download