The range of human-human interactions involving Abstract

advertisement
Turn-Taking and Coordination in Human-Machine Interaction: Papers from the 2015 AAAI Spring Symposium
What’s the Game and Who’s Got the Ball?
Genre in Spoken Interaction
Emer Gilmartin, Francesca Bonin, Loredana Cerrato, Carl Vogel, Nick Campbell
Trinity College Dublin
gilmare@tcd.ie
The range of human-human interactions involving
speech is enormous, with the problem of categorizing
different types of speech exchange into genres labeled as
‘notorious’ (Bakhtine, 1986). Different communicative
situations or ‘speech-exchange systems’ (Sacks, Schegloff,
& Jefferson, 1974) are analogous to sport, where different
games employ similar moves – soccer, rugby, and karate
all involve kicking motions – but the purpose, execution,
and effects of these moves vary. In different
communicative situations the basic building blocks of
dialogue, including utterance characteristics, dysfluencies,
turn-taking systems, pauses, gaps and overlaps, may vary
with the type and parameters of different interactions
(mode, content, goal). Such observations are implicit in the
work of Goffman, who emphasizes the relevance of
framing in interactions (e.g. retelling an argument is
distinct from the first hand experience of an argument) as
well as types of interactional scenario (Goffman, 1974;
Goffman, 1981). However, these issues are relevant again
as methodology for interaction analysis increasingly
attends, with high levels of precision, to the timing of
interaction phenomena. One cannot hope to automatically
generate or analyse social interaction if models are based
on unsuitable interaction data – one cannot play soccer
with the rules of rugby. Analysis of different types of
interaction can help discriminate which phenomena are
generalizable and which are situation or genre dependent.
In addition to extending understanding of human
communication, such knowledge will be useful in human
machine interaction design.
Our previous work on laughter in formal task-based
meetings from the AMI corpus and casual talk from the
TableTalk and DANS corpora revealed differences in the
distribution of laughter and its appearance in advance of
topic boundaries (Gilmartin, Bonin, Vogel, & Campbell,
2013). Our work in building a casual chat dialogue into a
robot interface showed that dialogue structure and timing
of system turns was crucial to dialogue success (Gilmartin
& Campbell, 2012). In the course of this work it became
Abstract
Humans engage in an enormous range of interaction types.
Therefore, there is a need to consider genre when analyzing
or modeling multimodal interaction. While some low level
mechanisms may follow universal patterns, it is also
possible that even basic interaction mechanisms, such as
turn-taking, vary with the type and parameters of different
interactions such as mode, content, and goal. Greater insight
is needed into the characteristics of different interaction
types in order to automatically generate or analyze spoken
interaction. More focused interaction models are needed to
generate human-machine dialogue beyond simple taskbased scenarios, and these models require suitable
interaction data. We discuss genre in spoken interaction,
outline the characteristics of casual conversation, review
available data, and describe ongoing work exploring the
dynamics of task-based and social dialogue, and of ‘chat’
and ‘chunk’ subtypes of casual conversation in particular.
Introduction
Conversational interaction is a multi-modal and multifunctional activity, where participants filter information
from a bundle of signals and cues in order to make
inferences about the speaker’s literal and pragmatic
meaning, intentions, and affective state. Spoken encounters
between humans take many forms and serve many goals.
We use conversation to conduct business - from buying
stamps in the post office to commercial negotiations in the
boardroom. We access medical and legal help though
spoken consultations. We learn about the world and form
and deepen relationships through spoken interaction, which
seems to be the glue that maintains our social groups
(Dunbar, 2003). Human-machine spoken interaction is
moving from narrow task-based scenarios to more ‘humanlike’ applications, feeding the need for models of an
increasing variety of communicative situations.
Copyright © 2014, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
91
clear that despite the existence of some theory on small
talk, there is a lack of temporal information needed to build
models of the turn-taking, pause, gap, overlap and
disfluency distribution in various types of social dialogue.
Our goal in the work described below is to investigate
casual social talk and its sub-genres, and to use this
knowledge to inform the design of sociable dialogue
systems.
but rather an emergent activity of congregating people,
which allowed social bonding by avoiding an
uncomfortable silence (which could be seen as unfriendly)
and then engaging participants in a reciprocal flow of
words – generally positive platitudes, comments on the
obvious, and personal views and life history. This function
Malinowski saw as the most basic use of speech, in
addition to more sophisticated uses of language as a
repository and controller of ideas (Malinowski, 1923).
Laver (Laver, 1975) speculated that the prominence of
competence based or mentalistic approaches to language in
the 1960s and 1970s, and the resulting lack of attention to
communication in action, had discouraged research into
phatic communication or small talk. He focuses on phatic
communion during the ‘psychologically crucial margins of
interaction’, conversational openings and closings,
postulating that small talk performs a lubricating or
transitional function from silence to initial greetings to
business and back to closing sequences and to leave taking.
While Laver and others had provided descriptions of
aspects of the structure and content of small talk, their
observations were based on field notes and introspection
rather than recordings. Schneider performed a major study
of small talk based on a corpus of audio recordings of
small talk made ‘in the wild’ throughout Britain
(Schneider, 1988). He concentrated on the linguistic
content of the entire dialogues rather than simply openings
and closings, using discourse analysis principles to
describe instances of small talk at several levels, from
frames such as ‘WEATHER’ to sequences and adjacency
pairs within interaction, and then to the types of utterances
comprising these sequences. Schneider notes that although
Grice’s Co-operative Principle may hold for small talk, the
Gricean maxims of quantity, quality, relation, and manner
do not seem to be well suited as these maxims are tied to
the quality and efficient transfer of information. He
proposes a politeness principle (‘Be polite!’) and two
super-maxims of politesse and friendliness. Politesse is
essentially an avoidance strategy with maxims including
Speech (‘Avoid silence!’), and Person (‘Avoid curiosity!’),
while Friendliness is more active with maxims including
Speech (‘Say something nice!), and Person (‘Show
interest!). He identifies sequence types widely used in
small talk such as remark-agree which are often followed
by idling sequences of repetitions of agreeing tails such as
‘Yes, of course’, ‘MmHmm.’. He highlights the
importance of agreeableness to small talk, mentioning a
tendency to exaggerate agreement and positive evaluations,
and how shared experience is often sought in friendly
conversation.
Slade and Eggins define casual conversation as the type of
talk engaged in when ‘talking just for the sake of talking’.
They strongly rebut any notion that casual conversation is
light or ‘meaningless’, arguing that it is through casual
Genre in Spoken Interaction
A major distinction between different types of speech is
whether the goal of the interaction as a whole or indeed of
individual contributions is ‘transactional’ or ‘interactional’
(Brown & Yule, 1983). Transactional, or task-based, talk
has short-term goals which are clearly defined and known
to the participants – examples include service encounters in
shops or business meetings. Early conversational analysis
work, which formed the basis for much turn-taking theory,
was based on recordings of psychotherapy sessions or
telephone conversations in domains such as emergency or
911 operator assistance and suicide hotlines (Duncan,
1972; Schegloff & Sacks, 1973). These task-based
conversations rely heavily on the transfer of linguistic or
lexical information. In technology most spoken dialogue
systems have been designed under the constraints of
Allen’s Task Based Dialog Hypothesis for reasons of
tractability (Allen et al., 2000), thus imbuing spoken
dialogue system design with a transactional slant.
However, in real-life conversation there is often no
obvious short term task to be accomplished through speech
and the purpose of the interaction is better described as
building and maintaining social bonds and transferring
attitudinal or affective information (R. Dunbar, 1998;
Malinowski, 1923) - examples include greetings, gossip,
and social chat or small talk. A tenant’s short chat about
the weather with the concierge of an apartment block is not
intended to transfer important meteorological data but
rather to build a relationship which may serve either of the
participants in the future. Of course, most transactional
encounters are peppered with social or interactional
elements as the establishment and maintenance of friendly
relationships contributes to task success. Below we briefly
describe theories on the structure, content and purpose of
casual conversation - the focus of our work.
Casual Conversation – the Unmarked Case
Researchers in several fields have mentioned casual
conversation as a genre or type of speech exchange system,
and differentiated it from task-based interaction.
Malinowski drew attention to ‘phatic communion’- an
activity comprising free aimless social conversation, not
intended to exchange information or to express thought,
92
conversation that people form and refine their social reality
(Eggins & Slade, 2004). They cite gossip, where
participants reaffirm their solidarity by jointly ascribing
outsider status to another, and show examples of
conversation between closer friends at a dinner party where
greater intimacy allows greater differences of opinion.
They identify story-telling as a frequent genre in
conversation and highlight the existence of ‘chat’
(interactive exchanges involving short turns by all
participants) and ‘chunks’ (longer uninterrupted
contributions) as elements of conversation. They also
mention the tendency for casual conversation to involve
multiple participants rather than the dyads normally found
in instrumental interactions or the examples often used in
conversation analysis. Instrumental and interactional talk
also differs in duration of episodes – task-based
conversations are temporally bounded by task completion
and tend to be short, while casual conversation can go on
indefinitely. Indeed, in one of the foundational papers on
conversational organization, attention is drawn to the fact
that exactly these casual conversational situations ‘continuing state(s) of incipient talk’ – were not covered by
the theories of (task-based) conversational structure being
developed (Schegloff & Sacks, 1973).
Many researchers in the field identify the fact that they
base their studies on orthographic transcriptions only as a
limiting factor in the breath of their descriptions (Laver,
1975; Schneider, 1988; Thornbury & Slade, 2006). In our
work we intend to add prosodic and temporal information
to the existing body of knowledge describing casual
conversation and small talk, and use this information in
combination with existing accounts of social talk to inform
the design of social talk for human computer interaction.
Our first step is analysis of human-human casual
conversation.
generalizations about other genres of natural conversation
(Lemke, 2012).
Researchers have obtained natural data by recording real
home or work situations. Early studies used recordings of
telephone calls, as in Sacks and Schelgoff’s emergency
services data, or the conversational data collected by
Jefferson (Sacks et al., 1974). More domain independent
natural telephonic data has been gathered by recording
large numbers of real phone conversations, as in the
Switchboard corpus (Godfrey, Holliman, & McDaniel,
1992), and the ESP-C collection of Japanese telephone
conversations (Campbell, 2004). Audio corpora of nontelephonic spoken interaction include the Santa Barbara
Corpus (DuBois, Chafe, Meyer, & Thompson, 2000),
sections of the ICE corpora (Greenbaum, 1991) and of the
British National Corpus (BNC-Consortium, 2000).
Unfortunately the unimodal (audio only) nature of these
collections make them unsuitable for modeling multimodal
interaction. While much progress has been made, it is not
certain that one size fits all for interactional data, and that
results and insights gained, for example, from
measurements of the timing of pauses and gaps in
telephone conversations about a ‘balloon task’ or indeed in
real-life suicide hotline recordings, are generalizable to
face-to-face casual conversation.
There are some examples of collections of audio and
later video recordings of varied natural interactions as in
the Gothenburg Corpus (Allwood, Björnberg, Grönqvist,
Ahlsén, & Ottesjö, 2000). High quality multimodal corpora
are appearing comprising naturalistic encounters with no
prescribed task or subject of discussion imposed on
participants. These include collections of free-talk
meetings, or first encounters between strangers as in the
Swedish Spontal, and the NOMCO and MOMCO Danish
and Maltese corpora (Edlund et al., 2010; Paggio,
Allwood, Ahlsén, & Jokinen, 2010).
In our current work we are focusing on the D64 corpus
(Oertel, Cummins, Edlund, Wagner, & Campbell, 2010).
The advantage of this corpus is that it comprises recordings
of multiparty casual talk in a natural setting of several
hours duration, thus providing a nearly prototypical
example of that ‘continuing state of incipient talk’ which is
of interest. Below we describe work in progress on two
components of natural casual conversation – chat and
chunks.
Human-human conversational data
Investigation into conversational dynamics is often based
on corpora of recorded speech. Many of these are
recordings of task-based interactions such as describing a
route through a map as in the HCRC MapTask corpus
(Anderson et al., 1991), spotting differences in two
pictures as in the DiaPix task in the LUCID (Baker &
Hazan, 2010) and Wildcat (Van Engen et al., 2010)
corpora, ranking items on a list – for example to decide
which items would be useful in an emergency (Vinciarelli,
Salamin, Polychroniou, Mohammadi, & Origlia, 2012),
and participating in real or staged business meetings as in
the ICSI and AMI corpora (Janin et al., 2003; McCowan et
al., 2005). While these data collection paradigms result in
corpora of great utility to researchers, it is not certain that
tasks such as these can be used to make reliable
Data and Annotation
The D64 corpus is a multimodal corpus of over 8 hours of
informal conversational English, recorded over two days in
Dublin in November 2009. The corpus comprises 3
sessions recorded over two days in an apartment living
room. There were between 2 and 5 people on camera at all
93
times. Participants were all native or near native speakers
of English (C1/C2 on CEFRL). There were no instructions
to participants about what to talk about and care was taken
to ensure that all participants understood that they were
free to talk or not as the mood took them. The various
recordings were synchronized to determine an accurate
timeline for the corpus, and to ensure alignment of video,
audio, and motion capture data.
The microphone recordings were found to be unsuitable
for automatic segmentation as there were frequent overlaps
and bleedover from other speakers. Therefore, the audio
files were first segmented manually into speech and silence
intervals using Praat (Boersma & Weenink, 2010) on 10
and 4-second windows. The process was then repeated for
the sound file recorded at the same time for each of the
other speakers, resulting in annotations checked across five
different sound files. Any remaining speech intervals not
assigned to a particular speaker were resolved using Elan
(Wittenburg, Brugman, Russel, Klassmann, & Sloetjes,
2006) to refer to the video recordings taken at the same
time. There are concerns with to note with the manual
annotation of silence. Humans listening to speech can miss
or indeed imagine the existence of objectively measured
silences of short duration, especially when there is
elongation of previous or following syllables (Martin
1970). However, Martin’s results were based on annotators
timing pauses with a stopwatch in a single hearing. In the
current work, speech can be slowed down and replayed
and, by using the four-second window, annotators can
clearly see silences and differences in amplitude on the
speech waveform and spectrogram. Therefore, it was much
more likely that silences would be picked up and it was
hoped that problems due to annotators only picking up
perceptually salient silences could be avoided.
After segmentation the data for Session 1 of the corpus
were manually transcribed and annotated. Words,
hesitations, filled pauses, unfinished words, laughs and
coughs were transcribed and marked. The transcription was
carried out at the intonational phrase (IP) level as this is a
usefully small unit of speech – IPs can easily be
concatentated to the interpausal unit (IPU) and turn level as
required. IPs are also the basic unit for intonation study.
Several annotation and labeling schemes for conversational
speech or dialogue were examined to inform the design of
the final annotation scheme, largely based on the TRAINS
dialogue transcription scheme (Heeman and Allen 1995).
Pauses and gaps were generated from the segmentation,
with silences bound on either side by the same speaker
classed as pauses while silences bound by different
speakers marked as gaps. Filled pauses or fillers are nonlexical single syllable sounds made by the current speaker,
often interpreted as signs of hesitation. They usually take
the form of a schwa or reduced vowel pronounced alone or
with nasalization. They are variously transcribed as uh and
um (more commonly in American English) or eh/er and
em/erm (more commonly in British English). In the
annotation the uh/um forms were used for all filled pauses
with the only distinction being the presence of nasalization.
In order to more fully investigate genre within casual
talk, an impressionistic annotation labeling conversational
sections as ‘chat’ or ‘chunk’ was carried out. These
annotations will be used with the lexical, structural and
acoustic features annotated to build a classifier and
investigate whether these two genres can be distinguished
automatically.
We are particularly interested in differences in turntaking and pause and gap duration. In task-based dialogue,
turn-taking is traditionally as competitive with the floor
viewed as a scarce resource. However, in free-ranging
casual conversation the impetus to avoid silence combined
with the lack of pressing information to transfer may lead
to a situation where the goal is to ‘keep the ball in the air’
with turn-taking occurring at topic exhaustion rather than
at points which would be seen as transition relevant (TRPs)
in more rapid fire instrumental exchanges.
The annotated data, comprising 6164 intonational phrase
units, including speech contributions and very short
utterances, are currently being analysed in order to gain
insight into similarities and differences in the temporal or
chronemic features of chat and chunk sections of casual
talk, and to obtain general data on casual conversation
which can be contrasted with task-based conversation. We
expect to have results in the near future.
Conclusions
To create systems which understand and/or generate social
signals for use in human-machine interaction, it is
necessary to have a clear idea of the use-case scenario of
the system. It is thus vital to gain a greater understanding
of the various genres, speech-exchange systems, or
activities arising in human-human and human machine
interaction. Comparison of characteristics of different
genres within and across corpora will help discriminate
which phenomena are generalizable and which are
situation dependent. Such knowledge will be useful in
human machine interaction design, particularly in the field
of companion robots or relational agents, as the very notion
of a companion application entails understanding of social
spoken interaction.
Acknowledgements
This work is supported by the Fastnet Project – Focus on
Action in Social Talk: Network Enabling Technology
funded by Science Foundation Ireland (SFI) 09/IN.1/I2631.
94
Greenbaum, S. (1991). ICE: The international corpus of English.
English Today, 28(7.4), 3–7.
Heeman, P. A., & Allen, J. F. (1995). The TRAINS 93 Dialogues.
DTIC Document.
Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan,
N., Stolcke, A. (2003). The ICSI meeting corpus. In Acoustics,
Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03).
2003 IEEE International Conference on (Vol. 1, pp. I–364).
Laver, J. (1975). Communicative functions of phatic communion.
Organization of Behavior in Face-to-Face Interaction, 215–238.
Lemke, J. L. (2012). Analyzing verbal data: Principles, methods,
and problems. In Second International Handbook of Science
Education (pp. 1471–1484). Springer.
Malinowski, B. (1923). The problem of meaning in primitive
languages. Supplementary in the Meaning of Meaning, 1–84.
Martin, J. G. (1970). On judging pauses in spontaneous speech.
Journal of Verbal Learning and Verbal Behavior, 9(1), 75–78.
McCowan, I., Carletta, J., Kraaij, W., Ashby, S., Bourban, S.,
Flynn, M., … Karaiskos, V. (2005). The AMI meeting corpus. In
Proceedings of the 5th International Conference on Methods and
Techniques in Behavioral Research (Vol. 88).
Oertel, C., Cummins, F., Edlund, J., Wagner, P., & Campbell, N.
(2010). D64: A corpus of richly recorded conversational
interaction. Journal on Multimodal User Interfaces, 1–10.
Paggio, P., Allwood, J., Ahlsén, E., & Jokinen, K. (2010). The
NOMCO multimodal Nordic resource–goals and characteristics.
Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest
systematics for the organization of turn-taking for conversation.
Language, 696–735.
Schegloff, E. A., & Sacks, H. (1973). Opening up closings.
Semiotica, 8(4), 289–327.
Schneider, K. P. (1988). Small talk: Analysing phatic discourse
(Vol. 1). Hitzeroth Marburg.
Thornbury, S., & Slade, D. (2006). Conversation: From
description to pedagogy. Cambridge University Press.
Van Engen, K. J., Baese-Berk, M., Baker, R. E., Choi, A., Kim,
M., & Bradlow, A. R. (2010). The Wildcat Corpus of native-and
foreign-accented English: Communicative efficiency across
conversational dyads with varying language alignment profiles.
Language and Speech, 53(4), 510–540.
Vinciarelli, A., Salamin, H., Polychroniou, A., Mohammadi, G.,
& Origlia, A. (2012). From nonverbal cues to perception:
personality and social attractiveness. In Cognitive Behavioural
Systems (pp. 60–72). Springer.
Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., &
Sloetjes, H. (2006). Elan: a professional framework for
multimodality research. In Proceedings of LREC (Vol. 2006).
References
Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., &
Stent, A. (2000). An architecture for a generic dialogue shell.
Natural Language Engineering, 6(3&4), 213–228.
Allwood, J., Björnberg, M., Grönqvist, L., Ahlsén, E., & Ottesjö,
C. (2000). The spoken language corpus at the department of
linguistics, Göteborg University. In FQS–Forum Qualitative
Social Research (Vol. 1).
Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G.,
Garrod, S., … others. (1991). The HCRC map task corpus.
Language and Speech, 34(4), 351–366.
Baker, R., & Hazan, V. (2010). LUCID: a corpus of spontaneous
and read clear speech in British English. In Proceedings of the
DiSS-LPSS Joint Workshop 2010.
Bakhtine, M. M. (1986). Speech genres and other late essays.
University of Texas Press.
BNC-Consortium. (2000). British national corpus. URL
Http://www. Hcu. Ox. Ac. uk/BNC.
Boersma, P., & Weenink, D. (2010). Praat: doing phonetics by
computer [Computer program], Version 5.1. 44.
Brown, G., & Yule, G. (1983). Teaching the spoken language
(Vol. 2). Cambridge University Press.
Campbell, N. (2004). Speech & Expression; the Value of a
Longitudinal Corpus. In LREC.
DuBois, J. W., Chafe, W. L., Meyer, C., & Thompson, S. A.
(2000). Santa Barbara Corpus of Spoken American English. CDROM. Philadelphia: Linguistic Data Consortium.
Dunbar, R. (1998). Grooming, gossip, and the evolution of
language. Harvard Univ Press.
Dunbar, R. I. M. (2003). The social brain: Mind, language, and
society in evolutionary perspective. Annual Review of
Anthropology, 163–181.
Duncan, S. (1972). Some signals and rules for taking speaking
turns in conversations. Journal of Personality and Social
Psychology, 23(2), 283–292. doi:10.1037/h0033031
Edlund, J., Beskow, J., Elenius, K., Hellmer, K., Strömbergsson,
S., & House, D. (2010). Spontal: A Swedish Spontaneous
Dialogue Corpus of Audio, Video and Motion Capture. In LREC.
Eggins, S., & Slade, D. (2004). Analysing casual conversation.
Equinox Publishing Ltd.
Gilmartin, E., Bonin, F., Vogel, C., & Campbell, N. (2013).
Laugher and Topic Transition in Multiparty Conversation. In
Proceedings of the SIGDIAL 2013 Conference (pp. 304–308).
Metz, France: Association for Computational Linguistics.
Gilmartin, E., & Campbell, N. (2012). More than just words:
building a chatty robot. Presented at the IWSDS 2012, Paris.
Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992).
SWITCHBOARD: Telephone speech corpus for research and
development. In Acoustics, Speech, and Signal Processing, 1992.
ICASSP-92., 1992 IEEE International Conference on (Vol. 1, pp.
517–520).
Goffman, E. (1974) Frame Analysis: An Essay on the
Organization of Experience. Northeastern University Press:
Lebanon, NH.
Goffman, E. (1981) Forms of Talk. University of Pennsylvania
Press. Philadelphia, PA.
95
Download