Turn-Taking and Coordination in Human-Machine Interaction: Papers from the 2015 AAAI Spring Symposium What’s the Game and Who’s Got the Ball? Genre in Spoken Interaction Emer Gilmartin, Francesca Bonin, Loredana Cerrato, Carl Vogel, Nick Campbell Trinity College Dublin gilmare@tcd.ie The range of human-human interactions involving speech is enormous, with the problem of categorizing different types of speech exchange into genres labeled as ‘notorious’ (Bakhtine, 1986). Different communicative situations or ‘speech-exchange systems’ (Sacks, Schegloff, & Jefferson, 1974) are analogous to sport, where different games employ similar moves – soccer, rugby, and karate all involve kicking motions – but the purpose, execution, and effects of these moves vary. In different communicative situations the basic building blocks of dialogue, including utterance characteristics, dysfluencies, turn-taking systems, pauses, gaps and overlaps, may vary with the type and parameters of different interactions (mode, content, goal). Such observations are implicit in the work of Goffman, who emphasizes the relevance of framing in interactions (e.g. retelling an argument is distinct from the first hand experience of an argument) as well as types of interactional scenario (Goffman, 1974; Goffman, 1981). However, these issues are relevant again as methodology for interaction analysis increasingly attends, with high levels of precision, to the timing of interaction phenomena. One cannot hope to automatically generate or analyse social interaction if models are based on unsuitable interaction data – one cannot play soccer with the rules of rugby. Analysis of different types of interaction can help discriminate which phenomena are generalizable and which are situation or genre dependent. In addition to extending understanding of human communication, such knowledge will be useful in human machine interaction design. Our previous work on laughter in formal task-based meetings from the AMI corpus and casual talk from the TableTalk and DANS corpora revealed differences in the distribution of laughter and its appearance in advance of topic boundaries (Gilmartin, Bonin, Vogel, & Campbell, 2013). Our work in building a casual chat dialogue into a robot interface showed that dialogue structure and timing of system turns was crucial to dialogue success (Gilmartin & Campbell, 2012). In the course of this work it became Abstract Humans engage in an enormous range of interaction types. Therefore, there is a need to consider genre when analyzing or modeling multimodal interaction. While some low level mechanisms may follow universal patterns, it is also possible that even basic interaction mechanisms, such as turn-taking, vary with the type and parameters of different interactions such as mode, content, and goal. Greater insight is needed into the characteristics of different interaction types in order to automatically generate or analyze spoken interaction. More focused interaction models are needed to generate human-machine dialogue beyond simple taskbased scenarios, and these models require suitable interaction data. We discuss genre in spoken interaction, outline the characteristics of casual conversation, review available data, and describe ongoing work exploring the dynamics of task-based and social dialogue, and of ‘chat’ and ‘chunk’ subtypes of casual conversation in particular. Introduction Conversational interaction is a multi-modal and multifunctional activity, where participants filter information from a bundle of signals and cues in order to make inferences about the speaker’s literal and pragmatic meaning, intentions, and affective state. Spoken encounters between humans take many forms and serve many goals. We use conversation to conduct business - from buying stamps in the post office to commercial negotiations in the boardroom. We access medical and legal help though spoken consultations. We learn about the world and form and deepen relationships through spoken interaction, which seems to be the glue that maintains our social groups (Dunbar, 2003). Human-machine spoken interaction is moving from narrow task-based scenarios to more ‘humanlike’ applications, feeding the need for models of an increasing variety of communicative situations. Copyright © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 91 clear that despite the existence of some theory on small talk, there is a lack of temporal information needed to build models of the turn-taking, pause, gap, overlap and disfluency distribution in various types of social dialogue. Our goal in the work described below is to investigate casual social talk and its sub-genres, and to use this knowledge to inform the design of sociable dialogue systems. but rather an emergent activity of congregating people, which allowed social bonding by avoiding an uncomfortable silence (which could be seen as unfriendly) and then engaging participants in a reciprocal flow of words – generally positive platitudes, comments on the obvious, and personal views and life history. This function Malinowski saw as the most basic use of speech, in addition to more sophisticated uses of language as a repository and controller of ideas (Malinowski, 1923). Laver (Laver, 1975) speculated that the prominence of competence based or mentalistic approaches to language in the 1960s and 1970s, and the resulting lack of attention to communication in action, had discouraged research into phatic communication or small talk. He focuses on phatic communion during the ‘psychologically crucial margins of interaction’, conversational openings and closings, postulating that small talk performs a lubricating or transitional function from silence to initial greetings to business and back to closing sequences and to leave taking. While Laver and others had provided descriptions of aspects of the structure and content of small talk, their observations were based on field notes and introspection rather than recordings. Schneider performed a major study of small talk based on a corpus of audio recordings of small talk made ‘in the wild’ throughout Britain (Schneider, 1988). He concentrated on the linguistic content of the entire dialogues rather than simply openings and closings, using discourse analysis principles to describe instances of small talk at several levels, from frames such as ‘WEATHER’ to sequences and adjacency pairs within interaction, and then to the types of utterances comprising these sequences. Schneider notes that although Grice’s Co-operative Principle may hold for small talk, the Gricean maxims of quantity, quality, relation, and manner do not seem to be well suited as these maxims are tied to the quality and efficient transfer of information. He proposes a politeness principle (‘Be polite!’) and two super-maxims of politesse and friendliness. Politesse is essentially an avoidance strategy with maxims including Speech (‘Avoid silence!’), and Person (‘Avoid curiosity!’), while Friendliness is more active with maxims including Speech (‘Say something nice!), and Person (‘Show interest!). He identifies sequence types widely used in small talk such as remark-agree which are often followed by idling sequences of repetitions of agreeing tails such as ‘Yes, of course’, ‘MmHmm.’. He highlights the importance of agreeableness to small talk, mentioning a tendency to exaggerate agreement and positive evaluations, and how shared experience is often sought in friendly conversation. Slade and Eggins define casual conversation as the type of talk engaged in when ‘talking just for the sake of talking’. They strongly rebut any notion that casual conversation is light or ‘meaningless’, arguing that it is through casual Genre in Spoken Interaction A major distinction between different types of speech is whether the goal of the interaction as a whole or indeed of individual contributions is ‘transactional’ or ‘interactional’ (Brown & Yule, 1983). Transactional, or task-based, talk has short-term goals which are clearly defined and known to the participants – examples include service encounters in shops or business meetings. Early conversational analysis work, which formed the basis for much turn-taking theory, was based on recordings of psychotherapy sessions or telephone conversations in domains such as emergency or 911 operator assistance and suicide hotlines (Duncan, 1972; Schegloff & Sacks, 1973). These task-based conversations rely heavily on the transfer of linguistic or lexical information. In technology most spoken dialogue systems have been designed under the constraints of Allen’s Task Based Dialog Hypothesis for reasons of tractability (Allen et al., 2000), thus imbuing spoken dialogue system design with a transactional slant. However, in real-life conversation there is often no obvious short term task to be accomplished through speech and the purpose of the interaction is better described as building and maintaining social bonds and transferring attitudinal or affective information (R. Dunbar, 1998; Malinowski, 1923) - examples include greetings, gossip, and social chat or small talk. A tenant’s short chat about the weather with the concierge of an apartment block is not intended to transfer important meteorological data but rather to build a relationship which may serve either of the participants in the future. Of course, most transactional encounters are peppered with social or interactional elements as the establishment and maintenance of friendly relationships contributes to task success. Below we briefly describe theories on the structure, content and purpose of casual conversation - the focus of our work. Casual Conversation – the Unmarked Case Researchers in several fields have mentioned casual conversation as a genre or type of speech exchange system, and differentiated it from task-based interaction. Malinowski drew attention to ‘phatic communion’- an activity comprising free aimless social conversation, not intended to exchange information or to express thought, 92 conversation that people form and refine their social reality (Eggins & Slade, 2004). They cite gossip, where participants reaffirm their solidarity by jointly ascribing outsider status to another, and show examples of conversation between closer friends at a dinner party where greater intimacy allows greater differences of opinion. They identify story-telling as a frequent genre in conversation and highlight the existence of ‘chat’ (interactive exchanges involving short turns by all participants) and ‘chunks’ (longer uninterrupted contributions) as elements of conversation. They also mention the tendency for casual conversation to involve multiple participants rather than the dyads normally found in instrumental interactions or the examples often used in conversation analysis. Instrumental and interactional talk also differs in duration of episodes – task-based conversations are temporally bounded by task completion and tend to be short, while casual conversation can go on indefinitely. Indeed, in one of the foundational papers on conversational organization, attention is drawn to the fact that exactly these casual conversational situations ‘continuing state(s) of incipient talk’ – were not covered by the theories of (task-based) conversational structure being developed (Schegloff & Sacks, 1973). Many researchers in the field identify the fact that they base their studies on orthographic transcriptions only as a limiting factor in the breath of their descriptions (Laver, 1975; Schneider, 1988; Thornbury & Slade, 2006). In our work we intend to add prosodic and temporal information to the existing body of knowledge describing casual conversation and small talk, and use this information in combination with existing accounts of social talk to inform the design of social talk for human computer interaction. Our first step is analysis of human-human casual conversation. generalizations about other genres of natural conversation (Lemke, 2012). Researchers have obtained natural data by recording real home or work situations. Early studies used recordings of telephone calls, as in Sacks and Schelgoff’s emergency services data, or the conversational data collected by Jefferson (Sacks et al., 1974). More domain independent natural telephonic data has been gathered by recording large numbers of real phone conversations, as in the Switchboard corpus (Godfrey, Holliman, & McDaniel, 1992), and the ESP-C collection of Japanese telephone conversations (Campbell, 2004). Audio corpora of nontelephonic spoken interaction include the Santa Barbara Corpus (DuBois, Chafe, Meyer, & Thompson, 2000), sections of the ICE corpora (Greenbaum, 1991) and of the British National Corpus (BNC-Consortium, 2000). Unfortunately the unimodal (audio only) nature of these collections make them unsuitable for modeling multimodal interaction. While much progress has been made, it is not certain that one size fits all for interactional data, and that results and insights gained, for example, from measurements of the timing of pauses and gaps in telephone conversations about a ‘balloon task’ or indeed in real-life suicide hotline recordings, are generalizable to face-to-face casual conversation. There are some examples of collections of audio and later video recordings of varied natural interactions as in the Gothenburg Corpus (Allwood, Björnberg, Grönqvist, Ahlsén, & Ottesjö, 2000). High quality multimodal corpora are appearing comprising naturalistic encounters with no prescribed task or subject of discussion imposed on participants. These include collections of free-talk meetings, or first encounters between strangers as in the Swedish Spontal, and the NOMCO and MOMCO Danish and Maltese corpora (Edlund et al., 2010; Paggio, Allwood, Ahlsén, & Jokinen, 2010). In our current work we are focusing on the D64 corpus (Oertel, Cummins, Edlund, Wagner, & Campbell, 2010). The advantage of this corpus is that it comprises recordings of multiparty casual talk in a natural setting of several hours duration, thus providing a nearly prototypical example of that ‘continuing state of incipient talk’ which is of interest. Below we describe work in progress on two components of natural casual conversation – chat and chunks. Human-human conversational data Investigation into conversational dynamics is often based on corpora of recorded speech. Many of these are recordings of task-based interactions such as describing a route through a map as in the HCRC MapTask corpus (Anderson et al., 1991), spotting differences in two pictures as in the DiaPix task in the LUCID (Baker & Hazan, 2010) and Wildcat (Van Engen et al., 2010) corpora, ranking items on a list – for example to decide which items would be useful in an emergency (Vinciarelli, Salamin, Polychroniou, Mohammadi, & Origlia, 2012), and participating in real or staged business meetings as in the ICSI and AMI corpora (Janin et al., 2003; McCowan et al., 2005). While these data collection paradigms result in corpora of great utility to researchers, it is not certain that tasks such as these can be used to make reliable Data and Annotation The D64 corpus is a multimodal corpus of over 8 hours of informal conversational English, recorded over two days in Dublin in November 2009. The corpus comprises 3 sessions recorded over two days in an apartment living room. There were between 2 and 5 people on camera at all 93 times. Participants were all native or near native speakers of English (C1/C2 on CEFRL). There were no instructions to participants about what to talk about and care was taken to ensure that all participants understood that they were free to talk or not as the mood took them. The various recordings were synchronized to determine an accurate timeline for the corpus, and to ensure alignment of video, audio, and motion capture data. The microphone recordings were found to be unsuitable for automatic segmentation as there were frequent overlaps and bleedover from other speakers. Therefore, the audio files were first segmented manually into speech and silence intervals using Praat (Boersma & Weenink, 2010) on 10 and 4-second windows. The process was then repeated for the sound file recorded at the same time for each of the other speakers, resulting in annotations checked across five different sound files. Any remaining speech intervals not assigned to a particular speaker were resolved using Elan (Wittenburg, Brugman, Russel, Klassmann, & Sloetjes, 2006) to refer to the video recordings taken at the same time. There are concerns with to note with the manual annotation of silence. Humans listening to speech can miss or indeed imagine the existence of objectively measured silences of short duration, especially when there is elongation of previous or following syllables (Martin 1970). However, Martin’s results were based on annotators timing pauses with a stopwatch in a single hearing. In the current work, speech can be slowed down and replayed and, by using the four-second window, annotators can clearly see silences and differences in amplitude on the speech waveform and spectrogram. Therefore, it was much more likely that silences would be picked up and it was hoped that problems due to annotators only picking up perceptually salient silences could be avoided. After segmentation the data for Session 1 of the corpus were manually transcribed and annotated. Words, hesitations, filled pauses, unfinished words, laughs and coughs were transcribed and marked. The transcription was carried out at the intonational phrase (IP) level as this is a usefully small unit of speech – IPs can easily be concatentated to the interpausal unit (IPU) and turn level as required. IPs are also the basic unit for intonation study. Several annotation and labeling schemes for conversational speech or dialogue were examined to inform the design of the final annotation scheme, largely based on the TRAINS dialogue transcription scheme (Heeman and Allen 1995). Pauses and gaps were generated from the segmentation, with silences bound on either side by the same speaker classed as pauses while silences bound by different speakers marked as gaps. Filled pauses or fillers are nonlexical single syllable sounds made by the current speaker, often interpreted as signs of hesitation. They usually take the form of a schwa or reduced vowel pronounced alone or with nasalization. They are variously transcribed as uh and um (more commonly in American English) or eh/er and em/erm (more commonly in British English). In the annotation the uh/um forms were used for all filled pauses with the only distinction being the presence of nasalization. In order to more fully investigate genre within casual talk, an impressionistic annotation labeling conversational sections as ‘chat’ or ‘chunk’ was carried out. These annotations will be used with the lexical, structural and acoustic features annotated to build a classifier and investigate whether these two genres can be distinguished automatically. We are particularly interested in differences in turntaking and pause and gap duration. In task-based dialogue, turn-taking is traditionally as competitive with the floor viewed as a scarce resource. However, in free-ranging casual conversation the impetus to avoid silence combined with the lack of pressing information to transfer may lead to a situation where the goal is to ‘keep the ball in the air’ with turn-taking occurring at topic exhaustion rather than at points which would be seen as transition relevant (TRPs) in more rapid fire instrumental exchanges. The annotated data, comprising 6164 intonational phrase units, including speech contributions and very short utterances, are currently being analysed in order to gain insight into similarities and differences in the temporal or chronemic features of chat and chunk sections of casual talk, and to obtain general data on casual conversation which can be contrasted with task-based conversation. We expect to have results in the near future. Conclusions To create systems which understand and/or generate social signals for use in human-machine interaction, it is necessary to have a clear idea of the use-case scenario of the system. It is thus vital to gain a greater understanding of the various genres, speech-exchange systems, or activities arising in human-human and human machine interaction. Comparison of characteristics of different genres within and across corpora will help discriminate which phenomena are generalizable and which are situation dependent. Such knowledge will be useful in human machine interaction design, particularly in the field of companion robots or relational agents, as the very notion of a companion application entails understanding of social spoken interaction. Acknowledgements This work is supported by the Fastnet Project – Focus on Action in Social Talk: Network Enabling Technology funded by Science Foundation Ireland (SFI) 09/IN.1/I2631. 94 Greenbaum, S. (1991). ICE: The international corpus of English. English Today, 28(7.4), 3–7. Heeman, P. A., & Allen, J. F. (1995). The TRAINS 93 Dialogues. DTIC Document. Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Stolcke, A. (2003). The ICSI meeting corpus. In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference on (Vol. 1, pp. I–364). Laver, J. (1975). Communicative functions of phatic communion. Organization of Behavior in Face-to-Face Interaction, 215–238. Lemke, J. L. (2012). Analyzing verbal data: Principles, methods, and problems. In Second International Handbook of Science Education (pp. 1471–1484). Springer. Malinowski, B. (1923). The problem of meaning in primitive languages. Supplementary in the Meaning of Meaning, 1–84. Martin, J. G. (1970). On judging pauses in spontaneous speech. Journal of Verbal Learning and Verbal Behavior, 9(1), 75–78. McCowan, I., Carletta, J., Kraaij, W., Ashby, S., Bourban, S., Flynn, M., … Karaiskos, V. (2005). The AMI meeting corpus. In Proceedings of the 5th International Conference on Methods and Techniques in Behavioral Research (Vol. 88). Oertel, C., Cummins, F., Edlund, J., Wagner, P., & Campbell, N. (2010). D64: A corpus of richly recorded conversational interaction. Journal on Multimodal User Interfaces, 1–10. Paggio, P., Allwood, J., Ahlsén, E., & Jokinen, K. (2010). The NOMCO multimodal Nordic resource–goals and characteristics. Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language, 696–735. Schegloff, E. A., & Sacks, H. (1973). Opening up closings. Semiotica, 8(4), 289–327. Schneider, K. P. (1988). Small talk: Analysing phatic discourse (Vol. 1). Hitzeroth Marburg. Thornbury, S., & Slade, D. (2006). Conversation: From description to pedagogy. Cambridge University Press. Van Engen, K. J., Baese-Berk, M., Baker, R. E., Choi, A., Kim, M., & Bradlow, A. R. (2010). The Wildcat Corpus of native-and foreign-accented English: Communicative efficiency across conversational dyads with varying language alignment profiles. Language and Speech, 53(4), 510–540. Vinciarelli, A., Salamin, H., Polychroniou, A., Mohammadi, G., & Origlia, A. (2012). From nonverbal cues to perception: personality and social attractiveness. In Cognitive Behavioural Systems (pp. 60–72). Springer. Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., & Sloetjes, H. (2006). Elan: a professional framework for multimodality research. In Proceedings of LREC (Vol. 2006). References Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., & Stent, A. (2000). An architecture for a generic dialogue shell. Natural Language Engineering, 6(3&4), 213–228. Allwood, J., Björnberg, M., Grönqvist, L., Ahlsén, E., & Ottesjö, C. (2000). The spoken language corpus at the department of linguistics, Göteborg University. In FQS–Forum Qualitative Social Research (Vol. 1). Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., … others. (1991). The HCRC map task corpus. Language and Speech, 34(4), 351–366. Baker, R., & Hazan, V. (2010). LUCID: a corpus of spontaneous and read clear speech in British English. In Proceedings of the DiSS-LPSS Joint Workshop 2010. Bakhtine, M. M. (1986). Speech genres and other late essays. University of Texas Press. BNC-Consortium. (2000). British national corpus. URL Http://www. Hcu. Ox. Ac. uk/BNC. Boersma, P., & Weenink, D. (2010). Praat: doing phonetics by computer [Computer program], Version 5.1. 44. Brown, G., & Yule, G. (1983). Teaching the spoken language (Vol. 2). Cambridge University Press. Campbell, N. (2004). Speech & Expression; the Value of a Longitudinal Corpus. In LREC. DuBois, J. W., Chafe, W. L., Meyer, C., & Thompson, S. A. (2000). Santa Barbara Corpus of Spoken American English. CDROM. Philadelphia: Linguistic Data Consortium. Dunbar, R. (1998). Grooming, gossip, and the evolution of language. Harvard Univ Press. Dunbar, R. I. M. (2003). The social brain: Mind, language, and society in evolutionary perspective. Annual Review of Anthropology, 163–181. Duncan, S. (1972). Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology, 23(2), 283–292. doi:10.1037/h0033031 Edlund, J., Beskow, J., Elenius, K., Hellmer, K., Strömbergsson, S., & House, D. (2010). Spontal: A Swedish Spontaneous Dialogue Corpus of Audio, Video and Motion Capture. In LREC. Eggins, S., & Slade, D. (2004). Analysing casual conversation. Equinox Publishing Ltd. Gilmartin, E., Bonin, F., Vogel, C., & Campbell, N. (2013). Laugher and Topic Transition in Multiparty Conversation. In Proceedings of the SIGDIAL 2013 Conference (pp. 304–308). Metz, France: Association for Computational Linguistics. Gilmartin, E., & Campbell, N. (2012). More than just words: building a chatty robot. Presented at the IWSDS 2012, Paris. Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). SWITCHBOARD: Telephone speech corpus for research and development. In Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on (Vol. 1, pp. 517–520). Goffman, E. (1974) Frame Analysis: An Essay on the Organization of Experience. Northeastern University Press: Lebanon, NH. Goffman, E. (1981) Forms of Talk. University of Pennsylvania Press. Philadelphia, PA. 95