Can Overhearers Predict Who Will Speak Next?

advertisement
Turn-Taking and Coordination in Human-Machine Interaction: Papers from the 2015 AAAI Spring Symposium
Can Overhearers Predict Who Will Speak Next?
Peter A. Heeman and Rebecca Lunsford
Center for Spoken Language Understanding
Oregon Health & Science University
Portland OR
than the speaker-control model. For example, Duncan (Duncan and Niederehe 1974) proposed that people bid for the
turn using a number of turn-cues, with the highest bidder
winning the turn. In previous work (Selfridge and Heeman
2010), we proposed a system that considers the importance
of what it wants to say when placing a bid. Through simulations, we found that this results in more efficient conversations than the speaker-control model, as the system and
user can take the turn when they have something important
to add to the conversation, and not take the turn when they
do not. That work showed the advantage of turn-bidding, but
the question still remains as to whether people actually negotiate for the turn, and if so, how.
The three models of turn-taking that we will consider are
as follows:
Speaker control: the speaker decides when it wants to
yield the turn, and will indicate this on the current utterance. When the speaker tries to yield the turn, the listener
will usually take the turn.
Pure negotiative: the speaker and other conversant negotiate who will speak next. The speaker does not use the
current utterance in expressing his/her bid.
Speaker negotiative: the speaker and other conversant negotiate who will speak next. The speaker does use the current utterance to express all of least part of his/her bid.
This gives the speaker an earlier opportunity to express
his/her bid than the pure negotiative model.
In this paper, we present results of a perceptual study
that was designed to help shed light on how people engage
in turn-taking. Specifically, we want to determine the extent that the current speaker dictates who will speak next.
We conducted a perceptual study in which human subjects
analyze human-human turn-taking behavior. After an utterance unit, subjects will indicate whether they think the current speaker will continue, using a 6 point Likert scale. We
will determine how often they are able to predict the actual
turn-outcome, and how much they agree with each other.
This will help us determine the extent to which the current
speaker controls whether a turn-transition will occur.
Abstract
One theory of turn-taking in dialogue is that the current
speaker controls when the other conversant can speak,
which is also the basis of most spoken dialogue systems. A second theory is that the two conversants negotiate who will speak next. In this paper, we test these
theories by examining how well an overhearer can predict this, based only on the current speaker’s utterance,
which is what the other conversant would have access
to. We had overhearers listen to the current speaker and
indicate whether they felt the current speaker will continue or not. Our results support the negotiative model.
Most current spoken dialogue systems are used for slotfilling tasks, where the system asks the user the values for
a set of parameters. Such tasks can be accomplished using
a very structured interaction, where the system completely
controls the dialogue. However, as we try to employ spoken
dialogue systems to help users with more complex tasks, this
simple interaction structure will be too limiting. Both the
user and system need to be able to freely participate in the
conversation in order to solve the task. An important component of this is to allow natural turn-taking.
Many systems assume a rigid model of turn-taking, where
one person keeps the turn until they decide to release it,
which we refer to here as the speaker-control model. This
approach stems from the work of Sacks et al. (Sacks, Schegloff, and Jefferson 1974), in which they postulated that the
current speaker decides when to allow someone else to take
the turn. The task of a spoken dialogue system is thus to
determine the user’s end-of-turn signal. Furthermore, Sacks
proposed that people strive to minimize gaps (silences between turns), and minimize overlaps (where both people
talk at the same time). Thus the system needs to determine
the end-of-turn signal as quickly and as reliably as possible so as to minimize gaps between turns, and minimize incorrect turn-grabs, so as to minimize overlaps. A limitation
of the speaker-control approach is that the speaker decides
how long he/she wants to speak, without consideration for
whether the listener wants the turn. This hinders the ability
of the conversants to freely contribute to the conversation.
There is evidence that human turn-taking is more flexible
Background
Models of Human-Human Turn-Taking
c 2015, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
Sacks: Turn-taking in conversation has been discussed in
the social-linguistics literature. Sacks et al. (Sacks, Sche-
30
gloff, and Jefferson 1974) viewed any stretch of speech between switching speakers as a turn. Sacks proposed a set of
rules for turn-taking: at each each transition-relevance place
(TRP), the current speaker can select someone to speak; otherwise, anyone can self-select; otherwise, the speaker can
self-select. Thus, in Sacks’ model, the speaker must signal when the TRP is occurring. Sacks also notes that turntransitions often occur with little or no gap, where the gap is
the silence between speaker turns. Thus, the speaker needs
to be projecting the TRP so that the hearer knows in advance
when it will occur.
One problem is that Sacks never explicitly defines what
a TRP is. In fact, it seems as if a TRP is simply where the
speaker would like to apply the above three rules. So, if a
speaker wants to hold onto the turn for three utterances, the
speaker simply avoids making a TRP after the first two. It
is because of this that we refer to Sacks’ model as speakercontrol: the speaker decides whether to keep the turn or release it.
turn-taking.
Perceptual Studies
Several researchers have done perceptual studies in which
subjects, acting as overhearers of a conversation, decide
when a turn-transition will occurs. De Ruiter, Mitterer, and
Enfields (de Ruiter, Mitterer, and Enfield 2006) sought to
determine whether overhearers are able to project a turntransfer point and what types of information they use, as this
could indicate what types of information conversants use.
For this paper, we are interested in whether subjects predict
the actual outcome, not what types of information they use
and when this information is available.
Tice and Henetz (Tice and Henetz 2011) explored the use
of eye-gaze to study turn-boundary projection: will a third
party look at the next speaker as the current speaker’s turn is
coming to a close? On actual turn-transitions for questionanswer pairs, they found that subjects did look toward the
other speaker. However, they did not measure whether there
were false positives, and so it is unclear if they can distinguish end-of-turns from end-of-utterances.
Duncan: Duncan (Duncan 1972) also investigated turntaking. Duncan defined backchannels very broadly to include acknowledgments, collaborative completions, brief requests for clarification, and brief restatements; and excluded
them from speaking turns. Duncan viewed backchannels as
the hearer actively avoiding the speaking turn, keeping it
squarely with the current speaker. Duncan proposed that at
certain points, the speaker would issue a turn-yielding signal, at which point the listener could elect to take the turn,
issue a backchannel, or remain silent.
The turn-yielding cues that Duncan proposed that the current speaker uses to show his/her willingness to release the
turn include syntactic and semantic completion, sentenceending prosody, and gesture. Although only one cue is
needed to signal a turn-yield, Duncan found that the likelihood of the listener taking the turn increased linearly with
the number of turn-yielding cues that the speaker displayed.
He found that when no cues were displayed, there was approximately a 7% chance of the listener attempting to take
the turn. When 6 cues were present, there was a 50% chance.
In our view, these results suggest that the current speaker
might be able to indicate to the listener the degree to which
the speaker wants to keep or release the turn. This work is
part of the inspiration for our second explanation of turntaking, a negotiative model in which both participants can
bid on the turn.
Turn-Taking Cues in Human-Human Dialogue
Gravano and Hirschberg (Gravano and Hirschberg 2011)
investigated how well they could operationalize Duncan’s
model of turn-taking (Duncan 1972), in which the current
speaker signals his/her willingness to yield the turn. In a corpus of human-human dialogues, they examined each silence
that was at least 50ms. The task was to decide if the following speech would be a transition to the listener or a continuation by the current speaker. Following loosely the features
outlined by Duncan, they automatically computed a set of
7 features of the current utterance, including pitch slope in
the last 200ms and last 300ms, average pitch and intensity
in the last 500ms and last 1000ms, voice quality, speaking
rate, segment length, presence of a tag question, and textual completion (syntactic, semantic and pragmatic completion). Confirming Duncan, they found a linear relationship
between the number of features present and the probability
of a turn transition. When all 7 features are present, the probability of a turn-transition was about 65%. The replication of
the linear effect of multiple cues helps support a negotiative
model of turn-taking.
Methods
Audio Samples
We selected audio samples from the Switchboard corpus,
(Godfrey, Holliman, and McDaniel 1992), which is a corpus of conversational speech, in which two people, who do
not know each other, talk on the telephone about a predefined topic. In the future, we will use task-oriented speech
as well. Audio clips were selected from a single conversation.
We first marked potential clips from the transcriptions
that were syntactically and semantically complete. Utterances had to be at least two words and contribute to the
topic of the conversation. We then listened to these clips in
Praat (Boersma 2001) to ensure intonational completeness
Schegloff: Schegloff (Schegloff 2000) was concerned
about how turn-fights are resolved. Schegloff found that (a)
fights for turn are often accompanied with sudden acoustic
alteration, such as louder volume, higher pitch, and faster
or slower speaking rate; (b) the vast majority of fights for
turn are resolved very quickly. He proposed that turn-fights
are resolved through an interactive procedure, e.g., syllable
by syllable negotiation, using devices such as volume, pitch,
and speaking rate. However, Schegloff’s analysis only consisted of a few examples; no statistical evidence was given.
This work also suggests that negotiation might play a role in
31
Continue
Speaker A
Statement 6 (3 bc)
Question
Speaker B
Statement 7 (3 bc)
Question
Switch
9
2
Early Onset Turn-Fight
0
3
5 (1 bc) 2
2
4
would continue speaking or if the other speaker would speak
next. A likert scale was used to allow subjects to indicate
how certain they were in their choice of speaker. We used
6-point forced choice scale (rather than a 5-point scale), in
order to force subjects, even if they were uncertain, to decide
between switch and continue.
(2 bc)
Table 1: Audio Clips and Turn-Taking Outcomes.
and that there was not a substantial overlap with the speech
of the other speaker.
The remaining clips were then annotated. First, clips were
annotated as a question if they had the syntactic or intonational form of a question; otherwise, they were annotated as
a statement. Second, we annotated whether the other speaker
made a backchannel, such as ‘yeah’ or ‘uh-huh’. This usually overlapped the end of the audio clip. This was not
counted as taking the turn. Third, we annotated the turntaking outcome:
continue: current speaker continues speaking
switch: other conversant speaks next, after the current
speaker finishes
early onset: other conversant speaks next, but slightly overlaps the current speaker.
turn fight: both speakers start speaking afterward around
the same time.
Once all potential audio clips were identified, a list was
created for each of the two speakers in the conversation that
showed only the start and stop times and the annotations.
From each list a subset of 20 clips was selected so as to balance the different types of turn-taking outcomes. This was
done without listening to the audio so as to avoid experimenter bias. In addition, we selected a clip to be used as part
of the instructions. Table 1 gives the distribution of the audio
clips that were used in the study. The code ‘bc’ shows how
many of the given utterances were followed by a backchannel.
Figure 2: Mechanical Turk session format. Single page
shown containing two audio clips.
The Mechanical Turk interface is shown in Figure 2. Subjects had to listen to 20 audio clips from a single speaker,
presented in random order.1 Audio clips were displayed two
at a time, with radio buttons below. The interface allows the
subject to replay an audio clip, but pausing during play was
disabled. In addition, a subject could change their selection
any time before selecting Continue, but could not go back to
a previous screen. As it was easy for a subject to think they
had made a selection, but not actually have triggered the button, the Continue button was disabled until a selection was
made for each of the two clips.
Subjects and Task Types
Subjects from the US were recruited via the Amazon Mechanical Turk Interface. Prior to commencing their session,
subjects were required to read a consent form, and verify
that they were willing to have their data used in research.
They were also asked to confirm that they were native English speakers, and were between the age of 18 and 65. Furthermore, subjects were instructed to ensure that they were
able to play a sample audio clip from their computer, and to
adjust their volume so they could easily hear the audio prior
to continuing the session.
We created two task types. One task type used audio clips
from speaker A and the other used audio clips from speaker
B. 60 subjects performed each task type. Subjects could
choose to participate in either, or both, task types. In all,
104 subjects participated, with 15 subjects completing both
sessions.2
Mechanical Turk Interface
The data for this research was collected using Amazon’s Mechanical Turk, an online workplace that allows users to select and perform Human Intelligence Tasks (HITs).
Figure 1: Selection instructions provided to the subjects.
The instructions included in the interface are shown in
Figure 1. The subjects were instructed to listen to each audio clip, then select whether they thought the same speaker
1
For the initial 20 sessions, the clips were presented in a fixed
order.
2
Data from one task type was discarded due to incomplete data.
32
Assessing the Subjects’ Performance
In completing this task, it was possible that some subjects
either: 1) had no intention of performing the task with due
diligence (i.e., “gaming the system”), or 2) found the task
too difficult or frustrating, and thus were unable to perform
the task well. In either of these cases, the data for these subjects should be excluded from analyses, and we excluded
data as detailed below.
To identify those subjects who are “gaming the system”,
we intentionally designed the task to allow subjects to make
a selection prior to listening to the audio, while listening to
the audio, or to play two audio clips simultaneously. By designing the task in this manner, and by logging the subjects’
clicks and the audio start and end times, we are able to identify subjects who did not listen to the full audio clips before
making a selection and exclude their data from analyses.
The accuracy of the logging differed between subjects.
Because of this variance, it was important to determine a
logging error benchmark for each subject, prior to deciding whether a subject did, in fact, select a response prior to
completing the audio. To assess logging error, we subtracted
the actual length of each audio clip from the difference between the logged start and end time for that subject and that
clip. From this set we then selected the largest difference and
used that as the logging error benchmark for that subject.
Data for a given subject was excluded if they made at least
one selection that was earlier than four times their logging
error benchmark. Exceptions to this exclusion rule were allowed for the two longest prompts for each speaker. For the
longer prompts, subjects would sometimes make their selection while the prompt was playing, and revise the selection,
if needed, once the audio completed. For these cases the data
was not automatically excluded for that subject. Using this
criteria, data for 20 of the 104 subjects was excluded from
analyses. In some cases, a subject performed well with one
task type, but not the other. In this case the subject’s data was
only excluded for the task type in which their performance
was suspect.
Actual
Continue
Switch
Total
Subject Prediction
Continue Switch
428
219
370
507
798
726
Total
647
877
1524
Table 2: Contingency table showing the counts of subjects’
selections for the continue and switch turn outcomes.
As it should be easier for subjects to predict the actual
outcome when the audio clip contains a question, we also
looked at the data with questions excluded. Here we expect
the percentage correct to decrease. The results are shown
in Table 3. As anticipated, subjects selected the actual turn
outcome less often, only 57.5% of the time, which is also
a significant effect, χ2 (1,N=1326)=32.7, p<.0001. For the
remaining analyses, questions are included.
Actual
Continue
Switch
Total
Subject Prediction
Continue Switch
428
219
344
335
772
554
Total
647
679
1326
Table 3: Contingency table with questions excluded
Next we look at the predictions for each speaker separately, which is shown in Table 4. For speaker A, subjects selected the actual turn outcome 65.2% of the time,
and 57.3% for speaker B. Here too, we see that the actual turn outcome has a significant effect on subjects’ selections, respectively, χ2 (1,N=782)=105.5, p<.0001, and
χ2 (1,N=742)=15.8, p<.0001.
Actual
Continue
Switch
Total
Subject Prediction
Continue Switch
226
50
222
284
448
334
Total
276
506
782
Actual
Continue
Switch
Total
Subject Prediction
Continue Switch
202
169
148
223
350
392
Total
371
371
742
Speaker A
Data Analysis
Ability to Predict Outcome
We first analyze how well subjects can predict the actual turn
outcome. We collapsed the responses into 2 groups, with
the first 3 grouped as continue, and the last 3 as a switch.
For these analyses, we exclude early onsets and turn fights.3
Counts of the subjects’ predictions are shown in Table 2.
From a statistical perspective, we find that the actual outcome has a significant effect on the subjects’ prediction,
χ2 (1,N=1524)=85.7, p<.0001. However, subjects selected
the actual turn outcome only 61.0% of the time.
Looking at each actual outcome separately, we see that
when the current speaker continued, the subjects selected
continue 67.2% of the time, and when the turn switched, the
subjects selected switch 57.8% of the time.
Speaker B
Table 4: Contingency table separating each speaker
Looking just at the subjects ability to predict continues,
we see a substantial difference between speakers. Here, subjects were much less successful at predicting the actual turn
outcome for speaker B, only 54.5% of the time. In contrast,
for speaker A, they selected the actual turn outcome 81.9%
of the time.
For switches, the difference between the two speakers was
less pronounced. Here, subjects were able to select the actual
3
If we include early onsets, the subjects’ ability to predict the
actual outcome is slightly worse.
33
Actual Outcome
Predicted Continue Switch Early Onset Turn Fight
Continue
7
4
0
1
0
5
0
0
Switch
6
9
2
6
Other
Table 5: Strong majority prediction versus actual outcome
Using the second definition (agreement of at least 70%
of subjects), we divide the x-axis into 3 parts. The first is
labeled ‘continue’, for the first 12 audio clips, for which at
least 70% of the subjects predicted a continue. The third is
labeled ‘switch’, which comprises the last 5 audio clips, for
which at least 70% of the the subjects predicted a switch.
We now compare the strong majority prediction with the
actual outcome. The results are given in Table 5. When there
was a strong majority predicting a switch, this was the actual outcome. However, 4 of the 5 predicted switches had
the syntactic form of a question, which makes this decision
easier. The subjects reached a strong majority of continue
for 12 audio clips: 7 had the actual outcome of a continue,
while 4 were switches and one was a turn fight. The subjects
did not reach a strong majority for 23 audio clips. Of these,
6 had an actual outcome of a turn-fight, 2 were early onsets,
6 were continues and 9 were switches.
Figure 3: Analysis of each audio clip
turn outcome 56.1% of the time for speaker A, and 60.1% of
the time for speaker B.
Strong Majority
We now examine whether there are some audio clips that had
stronger agreement from the subjects, which we refer to as a
strong majority. We do this without regard to whether they
correctly predict the actual outcome. Again, we collapse the
6-point likert scale into a two-way distinction: continue or
switch.
We have experimented with two ways to define a strong
majority. First, we defined it as the number of subjects that
needed to agree such that the chance of agreement (assuming
responses are equally likely) is less than some cutoff, say
10%. So, over 40 audio clips, we should expect just 4 to
reach this cutoff. However, as more subjects are included,
the actual percentage of subjects that are needed to agree
decreases. If we have 25 subjects, 18 must agree, or 72%.
If we have 50 subjects, 32 must agree, or 64%. If we have
100 subjects, 59 must agree, or 59%. This definition seems
to have the same deficit as the previous section. It is focusing
on what is statistically significant rather than on what what
the subjects’ choices tell us about what is going on in the
dialogue. Hence, we will not use this criteria to separate the
audio clips.
Our second definition simply defines a strong majority
when some percentage of subjects agree. For this paper, we
use 70%. Note that this is about 10% higher than the rate
we found in the previous section. The results are shown in
Figure 3, with each of the 40 audio clips along the x-axis.
The solid line indicates the percentage of subjects that agree
that the audio clip is a switch, which is also how the audio clips are arranged along the x-axis. For comparison, the
dashed line in the figure indicates how likely it is for subjects to have that level of agreement, as would be computed
in the first metric.4 The dashed line shows that although the
solid line looks very linear, the solid line hides the fact that it
becomes exponentially more difficult to achieve higher and
higher levels of agreement.
Discussion
We first analyzed the ability of subjects to predict the actual
outcome. Removing two ambiguous types of turn moves,
early onsets and turn-fights, we found that subjects were
able to predict the actual turn outcome at a rate of 61%. For
our two speakers, this varied from 57.3% to 65.2%. This
might be because some speakers are better able to signal
their intention, or might be due to the distribution of dialogue contexts, as we were not trying to match their natural
distribution in dialogue. Note that Gravano and Hirschberg
(Gravano and Hirschberg 2011) also achieved results in this
range using machine learning on the cues present in the
current utterance. Hence, one possibility is that our rate of
around 61% is indeed how well this task can be done. This
would suggest that the speaker-centric model is wrong, as
30-40% of turn-taking points can not be accounted for.
We also examined whether there were some utterances for
which subjects had high agreement, regardless of whether
they agreed with the actual outcome. We set a criteria that
70% of subjects had to agree, and found that 17 utterances
reached this criteria. The odds that this many would agree
by chance is extremely low. The fact that subjects can predict turn outcomes for some utterances and not for others is
consistent with the speaker negotiative model. The speaker
uses the current utterance to express whether he wants to
continue the turn, but might not always have a strong preference. For our set of utterances, the current speaker seems to
not have a preference for 23 out of 40, or 57% of the time.
Of the 17 with a strong majority, 5 were predicted to be
switches, and these were actual switches. This included all
4 questions, which have strong social expectations of how
conversants should react, namely to answer the question. So
4
The dashed line is not monotonic, as the number of subjects
that did the two task types is not the same.
34
of the 14 non-question switches that occurred, the subjects
only strongly predicted one of them. In light of the speaker
negotiative model, for these 14, the speaker might not have
had a preference. In fact, as questions are such an efficient
way for the current speaker to signal that the other should
take the turn, other signals might not be used much.
The remaining 12 were predicted to be continues. Seven
of these were actual continues, but 4 were switches, and one
was a turn-fight. This suggests that just because a speaker
wants to continue, and signals this on their utterance, it is
still up to both the speaker and the other conversant as to
what actually happens.
The skew of signaling continue more than switch (12 versus 5) might be because it is difficult for a speaker to know
whether the other should contribute. Hence, they might want
to be more careful in using this signal. Instead, if they are
not sure, they could choose to not signal who should take
the turn, thus giving the other conversant an opportunity to
take the turn.
Gravano, A., and Hirschberg, J. 2011. Turn-taking cues
in task-oriented dialogue. Computer Speech and Language
25(3):601–634.
Sacks, H.; Schegloff, E. A.; and Jefferson, G. 1974. A simplest systematics for the organization of turn-taking for conversation. Language 50(4):696–735.
Schegloff, E. A. 2000. Overlapping talk and the organization
of turn-taking for conversation. Language in Society 29:1–
63.
Selfridge, E. O., and Heeman, P. A. 2010. Importancedriven turn-bidding for spoken dialogue systems. In Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, 177–185.
Tice, M., and Henetz, T. 2011. Turn-boundary projection:
Looking ahead. In Proceedings of the 33rd Annual Meeting
of the Cognitive Science Society.
Conclusion
In this paper, we discussed the results of an empirical study
to better understand human-human turn-taking. Subjects,
listening to a speaker’s utterance, had to predict whether the
current speaker would continue or the turn would switch.
However, subjects were only able to predict 61% of the actual turns, which suggests that the speaker-centric model of
turn-taking is a poor account of turn-taking.
We did find evidence for a speaker negotiative model of
turn-taking, in which the speaker and listener negotiate who
will speak next, but where the speaker can signal their desire
on the current utterance. Of the 40 utterances, 17 had strong
agreement as to who will speak next, and 23 did not. Further,
even when there was strong agreement, this did not always
agree with what actually occurred.
Acknowledgments
This work was funded by the National Science Foundation
under grant IIS-1321146.
References
Boersma, P. 2001. Praat, a system for doing phonetics by
computer. Glot International 5(9/10):341–345.
de Ruiter, J. P.; Mitterer, H.; and Enfield, N. J. 2006. Projecting the end of a speaker’s turn: A cognitive cornerstone
of conversation. Language 82(3).
Duncan, S. J., and Niederehe, G. 1974. On signalling that
it’s your turn to speak. Journal of Experimental Social Psychology 10:234–247.
Duncan, S. J. 1972. Some signals and rules for taking speaking turns in conversation. Journal of Personality and Social
Psychology 23:283–292.
Godfrey, J. J.; Holliman, E. C.; and McDaniel, J. 1992.
SWITCHBOARD: Telephone speech corpus for research
and development. In Proceedings of the International Conference on Audio, Speech and Signal Processing (ICASSP),
517–520.
35
Download