Turn-Taking and Coordination in Human-Machine Interaction: Papers from the 2015 AAAI Spring Symposium Can Overhearers Predict Who Will Speak Next? Peter A. Heeman and Rebecca Lunsford Center for Spoken Language Understanding Oregon Health & Science University Portland OR than the speaker-control model. For example, Duncan (Duncan and Niederehe 1974) proposed that people bid for the turn using a number of turn-cues, with the highest bidder winning the turn. In previous work (Selfridge and Heeman 2010), we proposed a system that considers the importance of what it wants to say when placing a bid. Through simulations, we found that this results in more efficient conversations than the speaker-control model, as the system and user can take the turn when they have something important to add to the conversation, and not take the turn when they do not. That work showed the advantage of turn-bidding, but the question still remains as to whether people actually negotiate for the turn, and if so, how. The three models of turn-taking that we will consider are as follows: Speaker control: the speaker decides when it wants to yield the turn, and will indicate this on the current utterance. When the speaker tries to yield the turn, the listener will usually take the turn. Pure negotiative: the speaker and other conversant negotiate who will speak next. The speaker does not use the current utterance in expressing his/her bid. Speaker negotiative: the speaker and other conversant negotiate who will speak next. The speaker does use the current utterance to express all of least part of his/her bid. This gives the speaker an earlier opportunity to express his/her bid than the pure negotiative model. In this paper, we present results of a perceptual study that was designed to help shed light on how people engage in turn-taking. Specifically, we want to determine the extent that the current speaker dictates who will speak next. We conducted a perceptual study in which human subjects analyze human-human turn-taking behavior. After an utterance unit, subjects will indicate whether they think the current speaker will continue, using a 6 point Likert scale. We will determine how often they are able to predict the actual turn-outcome, and how much they agree with each other. This will help us determine the extent to which the current speaker controls whether a turn-transition will occur. Abstract One theory of turn-taking in dialogue is that the current speaker controls when the other conversant can speak, which is also the basis of most spoken dialogue systems. A second theory is that the two conversants negotiate who will speak next. In this paper, we test these theories by examining how well an overhearer can predict this, based only on the current speaker’s utterance, which is what the other conversant would have access to. We had overhearers listen to the current speaker and indicate whether they felt the current speaker will continue or not. Our results support the negotiative model. Most current spoken dialogue systems are used for slotfilling tasks, where the system asks the user the values for a set of parameters. Such tasks can be accomplished using a very structured interaction, where the system completely controls the dialogue. However, as we try to employ spoken dialogue systems to help users with more complex tasks, this simple interaction structure will be too limiting. Both the user and system need to be able to freely participate in the conversation in order to solve the task. An important component of this is to allow natural turn-taking. Many systems assume a rigid model of turn-taking, where one person keeps the turn until they decide to release it, which we refer to here as the speaker-control model. This approach stems from the work of Sacks et al. (Sacks, Schegloff, and Jefferson 1974), in which they postulated that the current speaker decides when to allow someone else to take the turn. The task of a spoken dialogue system is thus to determine the user’s end-of-turn signal. Furthermore, Sacks proposed that people strive to minimize gaps (silences between turns), and minimize overlaps (where both people talk at the same time). Thus the system needs to determine the end-of-turn signal as quickly and as reliably as possible so as to minimize gaps between turns, and minimize incorrect turn-grabs, so as to minimize overlaps. A limitation of the speaker-control approach is that the speaker decides how long he/she wants to speak, without consideration for whether the listener wants the turn. This hinders the ability of the conversants to freely contribute to the conversation. There is evidence that human turn-taking is more flexible Background Models of Human-Human Turn-Taking c 2015, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. Sacks: Turn-taking in conversation has been discussed in the social-linguistics literature. Sacks et al. (Sacks, Sche- 30 gloff, and Jefferson 1974) viewed any stretch of speech between switching speakers as a turn. Sacks proposed a set of rules for turn-taking: at each each transition-relevance place (TRP), the current speaker can select someone to speak; otherwise, anyone can self-select; otherwise, the speaker can self-select. Thus, in Sacks’ model, the speaker must signal when the TRP is occurring. Sacks also notes that turntransitions often occur with little or no gap, where the gap is the silence between speaker turns. Thus, the speaker needs to be projecting the TRP so that the hearer knows in advance when it will occur. One problem is that Sacks never explicitly defines what a TRP is. In fact, it seems as if a TRP is simply where the speaker would like to apply the above three rules. So, if a speaker wants to hold onto the turn for three utterances, the speaker simply avoids making a TRP after the first two. It is because of this that we refer to Sacks’ model as speakercontrol: the speaker decides whether to keep the turn or release it. turn-taking. Perceptual Studies Several researchers have done perceptual studies in which subjects, acting as overhearers of a conversation, decide when a turn-transition will occurs. De Ruiter, Mitterer, and Enfields (de Ruiter, Mitterer, and Enfield 2006) sought to determine whether overhearers are able to project a turntransfer point and what types of information they use, as this could indicate what types of information conversants use. For this paper, we are interested in whether subjects predict the actual outcome, not what types of information they use and when this information is available. Tice and Henetz (Tice and Henetz 2011) explored the use of eye-gaze to study turn-boundary projection: will a third party look at the next speaker as the current speaker’s turn is coming to a close? On actual turn-transitions for questionanswer pairs, they found that subjects did look toward the other speaker. However, they did not measure whether there were false positives, and so it is unclear if they can distinguish end-of-turns from end-of-utterances. Duncan: Duncan (Duncan 1972) also investigated turntaking. Duncan defined backchannels very broadly to include acknowledgments, collaborative completions, brief requests for clarification, and brief restatements; and excluded them from speaking turns. Duncan viewed backchannels as the hearer actively avoiding the speaking turn, keeping it squarely with the current speaker. Duncan proposed that at certain points, the speaker would issue a turn-yielding signal, at which point the listener could elect to take the turn, issue a backchannel, or remain silent. The turn-yielding cues that Duncan proposed that the current speaker uses to show his/her willingness to release the turn include syntactic and semantic completion, sentenceending prosody, and gesture. Although only one cue is needed to signal a turn-yield, Duncan found that the likelihood of the listener taking the turn increased linearly with the number of turn-yielding cues that the speaker displayed. He found that when no cues were displayed, there was approximately a 7% chance of the listener attempting to take the turn. When 6 cues were present, there was a 50% chance. In our view, these results suggest that the current speaker might be able to indicate to the listener the degree to which the speaker wants to keep or release the turn. This work is part of the inspiration for our second explanation of turntaking, a negotiative model in which both participants can bid on the turn. Turn-Taking Cues in Human-Human Dialogue Gravano and Hirschberg (Gravano and Hirschberg 2011) investigated how well they could operationalize Duncan’s model of turn-taking (Duncan 1972), in which the current speaker signals his/her willingness to yield the turn. In a corpus of human-human dialogues, they examined each silence that was at least 50ms. The task was to decide if the following speech would be a transition to the listener or a continuation by the current speaker. Following loosely the features outlined by Duncan, they automatically computed a set of 7 features of the current utterance, including pitch slope in the last 200ms and last 300ms, average pitch and intensity in the last 500ms and last 1000ms, voice quality, speaking rate, segment length, presence of a tag question, and textual completion (syntactic, semantic and pragmatic completion). Confirming Duncan, they found a linear relationship between the number of features present and the probability of a turn transition. When all 7 features are present, the probability of a turn-transition was about 65%. The replication of the linear effect of multiple cues helps support a negotiative model of turn-taking. Methods Audio Samples We selected audio samples from the Switchboard corpus, (Godfrey, Holliman, and McDaniel 1992), which is a corpus of conversational speech, in which two people, who do not know each other, talk on the telephone about a predefined topic. In the future, we will use task-oriented speech as well. Audio clips were selected from a single conversation. We first marked potential clips from the transcriptions that were syntactically and semantically complete. Utterances had to be at least two words and contribute to the topic of the conversation. We then listened to these clips in Praat (Boersma 2001) to ensure intonational completeness Schegloff: Schegloff (Schegloff 2000) was concerned about how turn-fights are resolved. Schegloff found that (a) fights for turn are often accompanied with sudden acoustic alteration, such as louder volume, higher pitch, and faster or slower speaking rate; (b) the vast majority of fights for turn are resolved very quickly. He proposed that turn-fights are resolved through an interactive procedure, e.g., syllable by syllable negotiation, using devices such as volume, pitch, and speaking rate. However, Schegloff’s analysis only consisted of a few examples; no statistical evidence was given. This work also suggests that negotiation might play a role in 31 Continue Speaker A Statement 6 (3 bc) Question Speaker B Statement 7 (3 bc) Question Switch 9 2 Early Onset Turn-Fight 0 3 5 (1 bc) 2 2 4 would continue speaking or if the other speaker would speak next. A likert scale was used to allow subjects to indicate how certain they were in their choice of speaker. We used 6-point forced choice scale (rather than a 5-point scale), in order to force subjects, even if they were uncertain, to decide between switch and continue. (2 bc) Table 1: Audio Clips and Turn-Taking Outcomes. and that there was not a substantial overlap with the speech of the other speaker. The remaining clips were then annotated. First, clips were annotated as a question if they had the syntactic or intonational form of a question; otherwise, they were annotated as a statement. Second, we annotated whether the other speaker made a backchannel, such as ‘yeah’ or ‘uh-huh’. This usually overlapped the end of the audio clip. This was not counted as taking the turn. Third, we annotated the turntaking outcome: continue: current speaker continues speaking switch: other conversant speaks next, after the current speaker finishes early onset: other conversant speaks next, but slightly overlaps the current speaker. turn fight: both speakers start speaking afterward around the same time. Once all potential audio clips were identified, a list was created for each of the two speakers in the conversation that showed only the start and stop times and the annotations. From each list a subset of 20 clips was selected so as to balance the different types of turn-taking outcomes. This was done without listening to the audio so as to avoid experimenter bias. In addition, we selected a clip to be used as part of the instructions. Table 1 gives the distribution of the audio clips that were used in the study. The code ‘bc’ shows how many of the given utterances were followed by a backchannel. Figure 2: Mechanical Turk session format. Single page shown containing two audio clips. The Mechanical Turk interface is shown in Figure 2. Subjects had to listen to 20 audio clips from a single speaker, presented in random order.1 Audio clips were displayed two at a time, with radio buttons below. The interface allows the subject to replay an audio clip, but pausing during play was disabled. In addition, a subject could change their selection any time before selecting Continue, but could not go back to a previous screen. As it was easy for a subject to think they had made a selection, but not actually have triggered the button, the Continue button was disabled until a selection was made for each of the two clips. Subjects and Task Types Subjects from the US were recruited via the Amazon Mechanical Turk Interface. Prior to commencing their session, subjects were required to read a consent form, and verify that they were willing to have their data used in research. They were also asked to confirm that they were native English speakers, and were between the age of 18 and 65. Furthermore, subjects were instructed to ensure that they were able to play a sample audio clip from their computer, and to adjust their volume so they could easily hear the audio prior to continuing the session. We created two task types. One task type used audio clips from speaker A and the other used audio clips from speaker B. 60 subjects performed each task type. Subjects could choose to participate in either, or both, task types. In all, 104 subjects participated, with 15 subjects completing both sessions.2 Mechanical Turk Interface The data for this research was collected using Amazon’s Mechanical Turk, an online workplace that allows users to select and perform Human Intelligence Tasks (HITs). Figure 1: Selection instructions provided to the subjects. The instructions included in the interface are shown in Figure 1. The subjects were instructed to listen to each audio clip, then select whether they thought the same speaker 1 For the initial 20 sessions, the clips were presented in a fixed order. 2 Data from one task type was discarded due to incomplete data. 32 Assessing the Subjects’ Performance In completing this task, it was possible that some subjects either: 1) had no intention of performing the task with due diligence (i.e., “gaming the system”), or 2) found the task too difficult or frustrating, and thus were unable to perform the task well. In either of these cases, the data for these subjects should be excluded from analyses, and we excluded data as detailed below. To identify those subjects who are “gaming the system”, we intentionally designed the task to allow subjects to make a selection prior to listening to the audio, while listening to the audio, or to play two audio clips simultaneously. By designing the task in this manner, and by logging the subjects’ clicks and the audio start and end times, we are able to identify subjects who did not listen to the full audio clips before making a selection and exclude their data from analyses. The accuracy of the logging differed between subjects. Because of this variance, it was important to determine a logging error benchmark for each subject, prior to deciding whether a subject did, in fact, select a response prior to completing the audio. To assess logging error, we subtracted the actual length of each audio clip from the difference between the logged start and end time for that subject and that clip. From this set we then selected the largest difference and used that as the logging error benchmark for that subject. Data for a given subject was excluded if they made at least one selection that was earlier than four times their logging error benchmark. Exceptions to this exclusion rule were allowed for the two longest prompts for each speaker. For the longer prompts, subjects would sometimes make their selection while the prompt was playing, and revise the selection, if needed, once the audio completed. For these cases the data was not automatically excluded for that subject. Using this criteria, data for 20 of the 104 subjects was excluded from analyses. In some cases, a subject performed well with one task type, but not the other. In this case the subject’s data was only excluded for the task type in which their performance was suspect. Actual Continue Switch Total Subject Prediction Continue Switch 428 219 370 507 798 726 Total 647 877 1524 Table 2: Contingency table showing the counts of subjects’ selections for the continue and switch turn outcomes. As it should be easier for subjects to predict the actual outcome when the audio clip contains a question, we also looked at the data with questions excluded. Here we expect the percentage correct to decrease. The results are shown in Table 3. As anticipated, subjects selected the actual turn outcome less often, only 57.5% of the time, which is also a significant effect, χ2 (1,N=1326)=32.7, p<.0001. For the remaining analyses, questions are included. Actual Continue Switch Total Subject Prediction Continue Switch 428 219 344 335 772 554 Total 647 679 1326 Table 3: Contingency table with questions excluded Next we look at the predictions for each speaker separately, which is shown in Table 4. For speaker A, subjects selected the actual turn outcome 65.2% of the time, and 57.3% for speaker B. Here too, we see that the actual turn outcome has a significant effect on subjects’ selections, respectively, χ2 (1,N=782)=105.5, p<.0001, and χ2 (1,N=742)=15.8, p<.0001. Actual Continue Switch Total Subject Prediction Continue Switch 226 50 222 284 448 334 Total 276 506 782 Actual Continue Switch Total Subject Prediction Continue Switch 202 169 148 223 350 392 Total 371 371 742 Speaker A Data Analysis Ability to Predict Outcome We first analyze how well subjects can predict the actual turn outcome. We collapsed the responses into 2 groups, with the first 3 grouped as continue, and the last 3 as a switch. For these analyses, we exclude early onsets and turn fights.3 Counts of the subjects’ predictions are shown in Table 2. From a statistical perspective, we find that the actual outcome has a significant effect on the subjects’ prediction, χ2 (1,N=1524)=85.7, p<.0001. However, subjects selected the actual turn outcome only 61.0% of the time. Looking at each actual outcome separately, we see that when the current speaker continued, the subjects selected continue 67.2% of the time, and when the turn switched, the subjects selected switch 57.8% of the time. Speaker B Table 4: Contingency table separating each speaker Looking just at the subjects ability to predict continues, we see a substantial difference between speakers. Here, subjects were much less successful at predicting the actual turn outcome for speaker B, only 54.5% of the time. In contrast, for speaker A, they selected the actual turn outcome 81.9% of the time. For switches, the difference between the two speakers was less pronounced. Here, subjects were able to select the actual 3 If we include early onsets, the subjects’ ability to predict the actual outcome is slightly worse. 33 Actual Outcome Predicted Continue Switch Early Onset Turn Fight Continue 7 4 0 1 0 5 0 0 Switch 6 9 2 6 Other Table 5: Strong majority prediction versus actual outcome Using the second definition (agreement of at least 70% of subjects), we divide the x-axis into 3 parts. The first is labeled ‘continue’, for the first 12 audio clips, for which at least 70% of the subjects predicted a continue. The third is labeled ‘switch’, which comprises the last 5 audio clips, for which at least 70% of the the subjects predicted a switch. We now compare the strong majority prediction with the actual outcome. The results are given in Table 5. When there was a strong majority predicting a switch, this was the actual outcome. However, 4 of the 5 predicted switches had the syntactic form of a question, which makes this decision easier. The subjects reached a strong majority of continue for 12 audio clips: 7 had the actual outcome of a continue, while 4 were switches and one was a turn fight. The subjects did not reach a strong majority for 23 audio clips. Of these, 6 had an actual outcome of a turn-fight, 2 were early onsets, 6 were continues and 9 were switches. Figure 3: Analysis of each audio clip turn outcome 56.1% of the time for speaker A, and 60.1% of the time for speaker B. Strong Majority We now examine whether there are some audio clips that had stronger agreement from the subjects, which we refer to as a strong majority. We do this without regard to whether they correctly predict the actual outcome. Again, we collapse the 6-point likert scale into a two-way distinction: continue or switch. We have experimented with two ways to define a strong majority. First, we defined it as the number of subjects that needed to agree such that the chance of agreement (assuming responses are equally likely) is less than some cutoff, say 10%. So, over 40 audio clips, we should expect just 4 to reach this cutoff. However, as more subjects are included, the actual percentage of subjects that are needed to agree decreases. If we have 25 subjects, 18 must agree, or 72%. If we have 50 subjects, 32 must agree, or 64%. If we have 100 subjects, 59 must agree, or 59%. This definition seems to have the same deficit as the previous section. It is focusing on what is statistically significant rather than on what what the subjects’ choices tell us about what is going on in the dialogue. Hence, we will not use this criteria to separate the audio clips. Our second definition simply defines a strong majority when some percentage of subjects agree. For this paper, we use 70%. Note that this is about 10% higher than the rate we found in the previous section. The results are shown in Figure 3, with each of the 40 audio clips along the x-axis. The solid line indicates the percentage of subjects that agree that the audio clip is a switch, which is also how the audio clips are arranged along the x-axis. For comparison, the dashed line in the figure indicates how likely it is for subjects to have that level of agreement, as would be computed in the first metric.4 The dashed line shows that although the solid line looks very linear, the solid line hides the fact that it becomes exponentially more difficult to achieve higher and higher levels of agreement. Discussion We first analyzed the ability of subjects to predict the actual outcome. Removing two ambiguous types of turn moves, early onsets and turn-fights, we found that subjects were able to predict the actual turn outcome at a rate of 61%. For our two speakers, this varied from 57.3% to 65.2%. This might be because some speakers are better able to signal their intention, or might be due to the distribution of dialogue contexts, as we were not trying to match their natural distribution in dialogue. Note that Gravano and Hirschberg (Gravano and Hirschberg 2011) also achieved results in this range using machine learning on the cues present in the current utterance. Hence, one possibility is that our rate of around 61% is indeed how well this task can be done. This would suggest that the speaker-centric model is wrong, as 30-40% of turn-taking points can not be accounted for. We also examined whether there were some utterances for which subjects had high agreement, regardless of whether they agreed with the actual outcome. We set a criteria that 70% of subjects had to agree, and found that 17 utterances reached this criteria. The odds that this many would agree by chance is extremely low. The fact that subjects can predict turn outcomes for some utterances and not for others is consistent with the speaker negotiative model. The speaker uses the current utterance to express whether he wants to continue the turn, but might not always have a strong preference. For our set of utterances, the current speaker seems to not have a preference for 23 out of 40, or 57% of the time. Of the 17 with a strong majority, 5 were predicted to be switches, and these were actual switches. This included all 4 questions, which have strong social expectations of how conversants should react, namely to answer the question. So 4 The dashed line is not monotonic, as the number of subjects that did the two task types is not the same. 34 of the 14 non-question switches that occurred, the subjects only strongly predicted one of them. In light of the speaker negotiative model, for these 14, the speaker might not have had a preference. In fact, as questions are such an efficient way for the current speaker to signal that the other should take the turn, other signals might not be used much. The remaining 12 were predicted to be continues. Seven of these were actual continues, but 4 were switches, and one was a turn-fight. This suggests that just because a speaker wants to continue, and signals this on their utterance, it is still up to both the speaker and the other conversant as to what actually happens. The skew of signaling continue more than switch (12 versus 5) might be because it is difficult for a speaker to know whether the other should contribute. Hence, they might want to be more careful in using this signal. Instead, if they are not sure, they could choose to not signal who should take the turn, thus giving the other conversant an opportunity to take the turn. Gravano, A., and Hirschberg, J. 2011. Turn-taking cues in task-oriented dialogue. Computer Speech and Language 25(3):601–634. Sacks, H.; Schegloff, E. A.; and Jefferson, G. 1974. A simplest systematics for the organization of turn-taking for conversation. Language 50(4):696–735. Schegloff, E. A. 2000. Overlapping talk and the organization of turn-taking for conversation. Language in Society 29:1– 63. Selfridge, E. O., and Heeman, P. A. 2010. Importancedriven turn-bidding for spoken dialogue systems. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 177–185. Tice, M., and Henetz, T. 2011. Turn-boundary projection: Looking ahead. In Proceedings of the 33rd Annual Meeting of the Cognitive Science Society. Conclusion In this paper, we discussed the results of an empirical study to better understand human-human turn-taking. Subjects, listening to a speaker’s utterance, had to predict whether the current speaker would continue or the turn would switch. However, subjects were only able to predict 61% of the actual turns, which suggests that the speaker-centric model of turn-taking is a poor account of turn-taking. We did find evidence for a speaker negotiative model of turn-taking, in which the speaker and listener negotiate who will speak next, but where the speaker can signal their desire on the current utterance. Of the 40 utterances, 17 had strong agreement as to who will speak next, and 23 did not. Further, even when there was strong agreement, this did not always agree with what actually occurred. Acknowledgments This work was funded by the National Science Foundation under grant IIS-1321146. References Boersma, P. 2001. Praat, a system for doing phonetics by computer. Glot International 5(9/10):341–345. de Ruiter, J. P.; Mitterer, H.; and Enfield, N. J. 2006. Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation. Language 82(3). Duncan, S. J., and Niederehe, G. 1974. On signalling that it’s your turn to speak. Journal of Experimental Social Psychology 10:234–247. Duncan, S. J. 1972. Some signals and rules for taking speaking turns in conversation. Journal of Personality and Social Psychology 23:283–292. Godfrey, J. J.; Holliman, E. C.; and McDaniel, J. 1992. SWITCHBOARD: Telephone speech corpus for research and development. In Proceedings of the International Conference on Audio, Speech and Signal Processing (ICASSP), 517–520. 35