HCOMP_Does_History_H.. - University of Southern California

advertisement
Does History Help?
An Experiment on How Context Affects
Crowdsourcing Dialogue Annotation
Elnaz Nouri
Computer Science Department University of Southern California
Natural Dialogue Group, Institute for Creative Technologies
Crowdsourcing Annotation
Faster (?)
Cheaper (?)
Quality (?)
Snow(2008)
In Crowdsourcing Dialogue
Annotation Tasks
 Dialogue data is sequential by nature.
 Does providing context from previous parts of the dialogue (e.g. turns) affect
the annotation of the target part?
Example: Judge the sentiment on the following turn of the
dialogue:
Person 1: Come on out, honey! I'm telling you look good!
Tell her she looks good, tell her she looks good.
Person 2: Oh my God, you look so good!
From Seinfeld…
WAITRESS : Tuna on toast, coleslaw, cup of coffee.
GEORGE: Yeah. No, no, no, wait a minute, I always have tuna
on toast. Nothing's ever worked out for me with tuna on toast. I
want the complete opposite of on toast. Chicken salad, on rye,
un-toasted with a side of potato salad ... and a cup of tea.
JERRY: You know chicken salad is not the opposite of tuna,
salmon is the opposite of tuna, 'cuz salmon swim against the
current, and the tuna swim with it.
GEORGE: Good for the tuna!
Link to Video
Interesting Questions
 General aspect:
 Do annotators need context to do each instance of the
annotation?
 Can we present them with only the needed previous context?
 How does context affect stability of the annotation?
 Crowdsourcing aspect:
 Should we present the whole dialogue to annotator if the
compensation rate is low?
 Can we consider each annotation task as a stand alone micro
task?
 Do annotators on Amazon Mechanical Turk read the instructions
or the context provided?
So we ran an experiment…
The Idea: A Variable Context
Window Size
How is it going? I am
Bronson from the Hill
restaurant.
I am Milton. I am from the Vally
restaurant.
Alright, cool. So looks
like we got some good
resources on the
table. And, uh, we
want to find a way that
works for both of us.
Uh, yeah I agree. I just want
to, we want to maximize
both of our profits.
So what do we
have right here?
The Data Set
 The “Farmers Market” negotiation dataset
 41 dyadic sessions of negotiation based on instructions
 Two restaurant owners are trying to divide some items among
themselves
The Task: Sentiment Analysis Task
•
•
•
•
3 dialogues used: D1 (31 turns), D2 (16 turns), D3 (30 turns) = 77 turns
5 annotators for each instance: A1, A2, A3, A4, A5
annotators recruited on Amazon Turk
$0.02 for annotating each instance
• “Sentiment Annotation Task” on the turns of the dialogue
Emotion Tag
Score
Emotion Embodied
Strongly positive
2
extremely happy or excited toward the topic
Positive
1
generally happy or satisfied, but the emotion wasn't
extreme.
Neutral
0
Not positive or negative
Negative
-1
perceived to be angry or upsetting toward the topic, but
not to the extreme
Strongly Negative
-2
extremely negative toward the topic
Example Stimuli
Previous Context Window Size =3
Example Annotation Result
TURN
A1
A2
A3
A4
A5
AVG
Gold
Person 2: I need the apples so that is done. We get
equal bananas and equal strawberries, so… Done!
1
2
1
0
1
1
1
Person 1: Perfect!
1
2
1
1
2
1.4
2
Person 2: We have reached an agreement.
1
1
2
1
1
1.2
1
Gold annotation: the whole dialogue was presented to the annotator
Evaluation Method 1:
Distance to the Gold Annotation
Context Window Size
D1
D2
D3
0 turns
0.260
0.341
0.236
1 turns
0.261
0.317*
0.228*
2 turns
0.215*
0.326
0.248
3 turns
0.299
0.349
0.313
4 turns
0.277
0.413
0.268
5 turns
0.238
0.356
0.247
6 turns
0.246
0.341
0.255
(* shows the minimum distance from Gold annotation)
Evaluation Method 2:
Inter-annotator Agreement
Hypothesis:
higher inter-annotator reliability implies
more stability
indicator of the optimal context window
size
The differences between window sizes
were not significant according to t-test
except for that of 0 window size.
Context Window
Size
Krippendorff's
alpha
0
0.0976
1
0.2165
2
0.1133
3
0.2431*
4
0.1670
5
0.1923
6
0.1790
* shows the maximum agreement
Conclusions (and considerations)
Our results imply that:
 the number of previous turns doesn’t really affect the annotation
of the target  not necessary to show a big number of previous
turns or the whole dialogue.
 A context window size of 3 is perhaps enough to do the job.
Considerations:
 sample size is very small
 the nature of the dialogues and the negotiation task might have
affect the results
 our dataset wasn’t too emotional! 
 these are not real negotiation or conversations
 the annotation task can also affect the outcome
Future Work
Further investigation is needed:




Different datasets
Different annotation tasks
Appropriate metrics for measuring
Suitable baseline annotation for comparison
Questions?
Please tell me what you think! Your feedback and ideas are
sincerely appreciated!
Download